Skip to content Skip to sidebar Skip to footer

Selecting All Rows Before A Certain Entry In A Pandas Dataframe

How to select the rows that before a certain value in the columns first appear? I have a dataset of user activity and their timestamp recorded as follow: df = pd.DataFrame([{'use

Solution 1:

You can avoid explicit apply with

In [2862]: df[df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().eq(0)]
Out[2862]:
  activity        date  user_id
0     Open  2017-09-01        1
1     Open  2017-09-02        1
2     Open  2017-09-03        1
3    Click  2017-09-04        1
7     Open  2017-09-04        2

Solution 2:

Use groupby and find all rows which are above the row where a user purchased some item. Then, use the mask to index.

df
   activity        date  user_id
0      Open  2017-09-01        1
1      Open  2017-09-02        1
2      Open  2017-09-03        1
3     Click  2017-09-04        1
4  Purchase  2017-09-05        1
5      Open  2017-09-06        1
6      Open  2017-09-07        1
7      Open  2017-09-04        2
8  Purchase  2017-09-06        2

m = df.groupby('user_id').activity\
        .apply(lambda x: (x == 'Purchase').cumsum()) == 0
df[m]

  activity        date  user_id
0     Open  2017-09-01        1
1     Open  2017-09-02        1
2     Open  2017-09-03        1
3    Click  2017-09-04        1
7     Open  2017-09-04        2

If your actual data isn't sorted like it is here, you could use df.sort_values and make sure it is:


df = df.sort_values(['user_id', 'date'])

Solution 3:

Use groupby by mask with DataFrameGroupBy.cumsum, convert to bool, invert condition and filter by boolean indexing:

#if necessary
#df = df.sort_values(['user_id', 'date'])
df = df[~df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().astype(bool)]
print (df)
   user_id        date activity
0        1  2017-09-01     Open
1        1  2017-09-02     Open
2        1  2017-09-03     Open
3        1  2017-09-04    Click
7        2  2017-09-04     Open

Detail:

print (~df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().astype(bool))
0     True
1     True
2     True
3     True
4    False
5    False
6    False
7     True
8    False
Name: activity, dtype: bool

Post a Comment for "Selecting All Rows Before A Certain Entry In A Pandas Dataframe"