Skip to content Skip to sidebar Skip to footer

Dealing With Sparse Categories In Pandas - Replace Everything Not In Top Categories With "Other"

I often come across the following common problem when cleaning the data there are some more common categories (let's say top 10 movie genres) and lots and lots of others which are

Solution 1:

You can use pd.DataFrame.loc with Boolean indexing:

df.loc[~df['studio'].isin(top_8_list), 'studio'] = 'Other'

Note there's no need to construct your list of top 8 studios via a manual for loop:

top_8_list = df['studio'].value_counts().index[:8]

Solution 2:

You could convert the column to type Categorical which has added memory benefits:

top_cats = df.studio.value_counts().head(8).index.tolist() + ['other']
df['studio'] = pd.Categorical(df['studio'], categories=top_cats).fillna('other')

Post a Comment for "Dealing With Sparse Categories In Pandas - Replace Everything Not In Top Categories With "Other""