How To Remove Rows That Appear Same In Two Columns Simultaneously In Dataframe?
I have a Dataframe, DF1 Id1 Id2 0 286 409 1 286 257 2 409 286 3 257 183 In this DF, for me rows 286,409 and 409,286 are same. I only want to keep one
Solution 1:
I believe you need sorting both columns by np.sort
and filter by DataFrame.duplicated
with inverse mask:
df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
df = DF1[~df1.duplicated()]
print (df)
Id1 Id2
028640912862573257183
Detail : If use numpy.sort
with axis=1
it sorting per rows, so first and third 'row'
are same:
print (np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1))
[[286 409]
[257 286]
[286 409]
[183 257]]
Then use DataFrame.duplicated
function (working with DataFrame, so used DataFrame constructor):
df1 = pd.DataFrame(np.sort(DF1[['Id1', 'Id2']].to_numpy(), axis=1), index=DF1.index)
print (df1)
010286409125728622864093183257
Third value is duplicate:
print (df1.duplicated())
0False1False2True3False
dtype: bool
Last is necessary invert mask for remove duplicates, output is filtered in boolean indexing
:
print (DF1[~df1.duplicated()])
Id1 Id2
0 286 409
1 286 257
3 257 183
Solution 2:
You can group your DataFrame by a sorted list of the column values
import pandas as pd
from io import StringIO
data = """Id1 Id2
286 409
286 257
409 286
257 183"""
df = pd.read_csv(StringIO(data), sep="\s+")
print(df.groupby(df.apply(lambda x: str(sorted(list(x))), axis=1)).first())
Result:
Id1 Id2
[183, 257] 257183
[257, 286] 286257
[286, 409] 286409
Post a Comment for "How To Remove Rows That Appear Same In Two Columns Simultaneously In Dataframe?"