Selecting All Rows Before A Certain Entry In A Pandas Dataframe
How to select the rows that before a certain value in the columns first appear? I have a dataset of user activity and their timestamp recorded as follow: df = pd.DataFrame([{'use
Solution 1:
You can avoid explicit apply with
In [2862]: df[df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().eq(0)]
Out[2862]:
activity date user_id
0 Open 2017-09-01 1
1 Open 2017-09-02 1
2 Open 2017-09-03 1
3 Click 2017-09-04 1
7 Open 2017-09-04 2
Solution 2:
Use groupby
and find all rows which are above the row where a user purchased some item. Then, use the mask to index.
df
activity date user_id
0 Open 2017-09-01 1
1 Open 2017-09-02 1
2 Open 2017-09-03 1
3 Click 2017-09-04 1
4 Purchase 2017-09-05 1
5 Open 2017-09-06 1
6 Open 2017-09-07 1
7 Open 2017-09-04 2
8 Purchase 2017-09-06 2
m = df.groupby('user_id').activity\
.apply(lambda x: (x == 'Purchase').cumsum()) == 0
df[m]
activity date user_id
0 Open 2017-09-01 1
1 Open 2017-09-02 1
2 Open 2017-09-03 1
3 Click 2017-09-04 1
7 Open 2017-09-04 2
If your actual data isn't sorted like it is here, you could use df.sort_values
and make sure it is:
df = df.sort_values(['user_id', 'date'])
Solution 3:
Use groupby
by mask
with DataFrameGroupBy.cumsum
, convert to bool
, invert condition and filter by boolean indexing
:
#if necessary
#df = df.sort_values(['user_id', 'date'])
df = df[~df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().astype(bool)]
print (df)
user_id date activity
0 1 2017-09-01 Open
1 1 2017-09-02 Open
2 1 2017-09-03 Open
3 1 2017-09-04 Click
7 2 2017-09-04 Open
Detail:
print (~df['activity'].eq('Purchase').groupby(df['user_id']).cumsum().astype(bool))
0 True
1 True
2 True
3 True
4 False
5 False
6 False
7 True
8 False
Name: activity, dtype: bool
Post a Comment for "Selecting All Rows Before A Certain Entry In A Pandas Dataframe"