Skip to content Skip to sidebar Skip to footer

How To Assign A Unique Id For Different Groups In Pandas Dataframe?

How to assign unique IDs to groups created in pandas dataframe based on certain conditions. For example: I have a dataframe named as df with the following structure:Name identifie

Solution 1:

sort and find the time difference ('td') for successive actions. cumsum a Boolean Series to form groups of successive actions within 30 minutes of the last. ngroup labels the groups.

The sort_index before the groupby can be removed if you don't care which label the groups get, but this ensures they're ordered based on the original order.

df = df.sort_values(['Name', 'Datetime'])
df['td'] = df.Datetime.diff().mask(df.Name.ne(df.Name.shift()))
                             # Only calculate diff within same Namedf['Id'] = (df.sort_index()
              .groupby(['Name', df['td'].gt(pd.Timedelta('30min')).cumsum()], sort=False)
              .ngroup()+1)
df = df.sort_index()

Output:

td left in for clarity

NameDatetimetdId0Bob2018-04-26 12:00:00      NaT11Claire2018-04-26 12:00:00      NaT22Bob2018-04-26 12:10:00 00:10:0013Bob2018-04-26 12:30:00 00:20:0014Grace2018-04-27 08:30:00      NaT35Bob2018-04-27 09:30:00 21:00:0046Bob2018-04-27 09:40:00 00:10:0047Bob2018-04-27 10:00:00 00:20:0048Bob2018-04-27 10:30:00 00:30:0049Bob2018-04-27 11:30:00 01:00:005

Solution 2:

Your explanation at the near bottom is really helpful to understand it.

You need to groupby on Name and a groupID (don't confuse this groupID with your final Id) and call ngroup to return Id. The main thing is how to define this groupID. To create groupID, you need sort_values to separate each Name and Datetime into ascending order. Groupby Name and find differences in Datetime between consecutive rows within each group of Name (within the same Name). Using gt to check greater than 30mins and cumsum to get groupID. sort_index to reverse back to original order and assign to s as follows:

s = df.sort_values(['Name','Datetime']).groupby('Name').Datetime.diff() \
      .gt(pd.Timedelta(minutes=30)).cumsum().sort_index()

Next, groupby Name and s with sort=False to reserve the original order and call ngroup plus 1.

df['Id']=df.groupby(['Name',s],sort=False).ngroup().add(1)Out[834]:NameDatetimeId0Bob2018-04-26 12:00:00   11Claire2018-04-26 12:00:00   22Bob2018-04-26 12:10:00   13Bob2018-04-26 12:30:00   14Grace2018-04-27 08:30:00   35Bob2018-04-27 09:30:00   46Bob2018-04-27 09:40:00   47Bob2018-04-27 10:00:00   48Bob2018-04-27 10:30:00   49Bob2018-04-27 11:30:00   5

Post a Comment for "How To Assign A Unique Id For Different Groups In Pandas Dataframe?"