Python Pandas Drop Duplicates Keep Second To Last
What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe? For instance I basically want to do this operation: df = df.drop_duplicates
Solution 1:
With groupby.apply:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': np.arange(10), 'C': np.arange(10)})
df
Out:
A B C
0100111121223133424452556266737783889499
(df.groupby('A', as_index=False).apply(lambda x: x iflen(x)==1else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
2122525573779499
With a different DataFrame, subset two columns:
df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4],
'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})
df
Out:
A B C
0110111121223113422452256226733783389449
(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x iflen(x)==1else x.iloc[[-2]])
.reset_index(level=0, drop=True))
Out:
A B C
11112122522573379449
Solution 2:
You could groupby/tail(2)
to take the last 2 items, then groupby/head(1)
to take the first item from the tail:
df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
If there is only one item in the group, tail(2)
returns just the one item.
For example,
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x iflen(x)==1else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)
The builtin groupby methods (such as tail
and head
) are often much faster
than groupby/apply
with custom Python functions. This is especially true if there are a lot of groups:
In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop
In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop
Alternatively, ayhan suggests a nice improvement:
alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)
In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop
Post a Comment for "Python Pandas Drop Duplicates Keep Second To Last"