Duplicating Training Examples To Handle Class Imbalance In A Pandas Data Frame
I have a DataFrame in pandas that contain training examples, for example: feature1 feature2 class 0 0.548814 0.791725 1 1 0.715189 0.528895 0 2 0.602763 0.5680
Solution 1:
You can find the maximum size a group has with
max_size = frame['class'].value_counts().max()
In your example, this equals 8. For each group, you can sample with replacement max_size - len(group_size)
elements. This way if you concat these to the original DataFrame, their sizes will be the same and you'll keep the original rows.
lst = [frame]
for class_index, group in frame.groupby('class'):
lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)
You can play with max_size-len(group)
and maybe add some noise to it because this will make all group sizes equal.
Post a Comment for "Duplicating Training Examples To Handle Class Imbalance In A Pandas Data Frame"