Skip to content Skip to sidebar Skip to footer

How To Remove Rows That Appear Same In Two Columns Simultaneously In Dataframe?

I have a Dataframe, DF1 Id1 Id2 0 286 409 1 286 257 2 409 286 3 257 183 In this DF, for me rows 286,409 and 409,286 are same. I only want to keep one

Solution 1:

I believe you need sorting both columns by np.sort and filter by DataFrame.duplicated with inverse mask:

df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)

df = DF1[~df1.duplicated()]
print (df)
   Id1  Id2
028640912862573257183

Detail : If use numpy.sort with axis=1 it sorting per rows, so first and third 'row' are same:

print (np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1))
[[286 409]
 [257 286]
 [286 409]
 [183 257]]

Then use DataFrame.duplicated function (working with DataFrame, so used DataFrame constructor):

df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
print (df1)
     010286409125728622864093183257

Third value is duplicate:

print (df1.duplicated())
0False1False2True3False
dtype: bool

Last is necessary invert mask for remove duplicates, output is filtered in boolean indexing:

print (DF1[~df1.duplicated()])
   Id1  Id2
0  286  409
1  286  257
3  257  183

Solution 2:

You can group your DataFrame by a sorted list of the column values

import pandas as pd
from io import StringIO

data = """Id1   Id2
286   409
286   257
409   286
257   183"""

df = pd.read_csv(StringIO(data), sep="\s+")

print(df.groupby(df.apply(lambda x: str(sorted(list(x))), axis=1)).first())

Result:

            Id1  Id2
[183, 257]  257183
[257, 286]  286257
[286, 409]  286409

Post a Comment for "How To Remove Rows That Appear Same In Two Columns Simultaneously In Dataframe?"