Dealing With Sparse Categories In Pandas - Replace Everything Not In Top Categories With "Other"
I often come across the following common problem when cleaning the data there are some more common categories (let's say top 10 movie genres) and lots and lots of others which are
Solution 1:
You can use pd.DataFrame.loc
with Boolean indexing:
df.loc[~df['studio'].isin(top_8_list), 'studio'] = 'Other'
Note there's no need to construct your list of top 8 studios via a manual for
loop:
top_8_list = df['studio'].value_counts().index[:8]
Solution 2:
You could convert the column to type Categorical
which has added memory benefits:
top_cats = df.studio.value_counts().head(8).index.tolist() + ['other']
df['studio'] = pd.Categorical(df['studio'], categories=top_cats).fillna('other')
Post a Comment for "Dealing With Sparse Categories In Pandas - Replace Everything Not In Top Categories With "Other""