Skip to content Skip to sidebar Skip to footer

Duplicating Training Examples To Handle Class Imbalance In A Pandas Data Frame

I have a DataFrame in pandas that contain training examples, for example: feature1 feature2 class 0 0.548814 0.791725 1 1 0.715189 0.528895 0 2 0.602763 0.5680

Solution 1:

You can find the maximum size a group has with

max_size = frame['class'].value_counts().max()

In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size) elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.

lst = [frame]
for class_index, group in frame.groupby('class'):
    lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)

You can play with max_size-len(group) and maybe add some noise to it because this will make all group sizes equal.


Post a Comment for "Duplicating Training Examples To Handle Class Imbalance In A Pandas Data Frame"