Skip to content Skip to sidebar Skip to footer

Python Pandas Drop Duplicates Keep Second To Last

What's the most efficient way to select the second to last of each duplicated set in a pandas dataframe? For instance I basically want to do this operation: df = df.drop_duplicates

Solution 1:

With groupby.apply:

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': np.arange(10), 'C': np.arange(10)})

df
Out: 
   A  B  C
0100111121223133424452556266737783889499

(df.groupby('A', as_index=False).apply(lambda x: x iflen(x)==1else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
2122525573779499

With a different DataFrame, subset two columns:

df = pd.DataFrame({'A': [1, 1, 1, 1, 2, 2, 2, 3, 3, 4], 
                   'B': [1, 1, 2, 1, 2, 2, 2, 3, 3, 4], 'C': np.arange(10)})

df
Out: 
   A  B  C
0110111121223113422452256226733783389449

(df.groupby(['A', 'B'], as_index=False).apply(lambda x: x iflen(x)==1else x.iloc[[-2]])
   .reset_index(level=0, drop=True))
Out: 
   A  B  C
11112122522573379449

Solution 2:

You could groupby/tail(2) to take the last 2 items, then groupby/head(1) to take the first item from the tail:

df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

If there is only one item in the group, tail(2) returns just the one item.


For example,

import numpy as np
import pandas as pd

df = pd.DataFrame(np.random.randint(10, size=(10**2, 3)), columns=list('ABC'))
result = df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)

expected = (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x iflen(x)==1else x.iloc[[-2]]).reset_index(level=0, drop=True))
assert expected.sort_index().equals(result)

The builtin groupby methods (such as tail and head) are often much faster than groupby/apply with custom Python functions. This is especially true if there are a lot of groups:

In [96]: %timeit df.groupby(['A','B']).tail(2).groupby(['A','B']).head(1)
1000 loops, best of 3: 1.7 ms per loop

In [97]: %timeit (df.groupby(['A', 'B'], as_index=False).apply(lambda x: x if len(x)==1 else x.iloc[[-2]]).reset_index(level=0, drop=True))
100 loops, best of 3: 17.9 ms per loop

Alternatively, ayhan suggests a nice improvement:

alt = df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
assert expected.sort_index().equals(alt)

In [99]: %timeit df.groupby(['A','B']).tail(2).drop_duplicates(['A','B'])
1000 loops, best of 3: 1.43 ms per loop

Post a Comment for "Python Pandas Drop Duplicates Keep Second To Last"