Improve Performance Calculating A Random Sample Matching Specific Conditions In Pandas
Solution 1:
IIUC you want to end up with
k
number of random samples for each row (combination of metrics) in your input dataframe. So why notcandidates.sample(n=k, ...)
, and get rid of thefor
loop? Alternatively you could concatenate you dataframek
times withpd.concat([group1] * k)
.It depends on your real data but I would give a shot for grouping the input dataframe by metric columns with
group1.groupby(join_columns_enrich)
(if their cardinality is sufficiently low), and apply the random sampling on these groups, pickingk * len(group.index)
random samples for each.groupby
is expensive, OTOH you might save a lot on the iteration/sampling once it's done.
Solution 2:
@smiandras, you are correct. Getting rid of the for loop is important.
Variant 1: multiple samples:
defrandomMatchingCondition(original_element, group_0, join_columns, k, random_state):
limits_dict = original_element[join_columns_enrich].to_dict()
query = ' & '.join([f"{k} == {v}"for k, v in limits_dict.items()])
candidates = group_0.query(query)
iflen(candidates) > 0:
return candidates.sample(n=k, random_state=random_state, replace=True)['metric_group_0'].values
else:
return np.nan
#################################################################### iterate over pandas dataframe k times for more robust sampling
k = 3
resulting_df = None######################## trying to improve performance: sort both dataframes
group_0 = group_0.sort_values(join_columns_enrich)
group_1 = group_1.sort_values(join_columns_enrich)
#######################
group_1['metric_group_0'] = group_1.progress_apply(randomMatchingCondition,
args=[group_0, join_columns_enrich, k, None],
axis = 1)
print(group_1.isnull().sum())
group_1 = group_1[~group_1.metric_group_0.isnull()]
display(group_1.head())
s=pd.DataFrame({'metric_group_0':np.concatenate(group_1.metric_group_0.values)},index=group_1.index.repeat(group_1.metric_group_0.str.len()))
s = s.join(group_1.drop('metric_group_0',1),how='left')
s['pos_in_array'] = s.groupby(s.index).cumcount()
s.head()
Variant 2: all possible samples optimized by native JOIN operation.
WARN this is a bit unsafe as it might generate a gigantic number of rows:
size = 1000
df = pd.DataFrame({i: np.random.randint(1,100,size=size) for i in ['metric']})
df['label'] = np.random.randint(0,2, size=size)
df['group_1'] = pd.Series(np.random.randint(1,12, size=size)).astype(object)
df['group_2'] = pd.Series(np.random.randint(1,10, size=size)).astype(object)
group_0 = df[df['label'] == 0]
group_0 = group_0.reset_index(drop=True)
join_columns_enrich = ['group_1', 'group_2']
join_real = ['metric']
join_real.extend(join_columns_enrich)
group_0 = group_0[join_real]
display(group_0.head())
group_1 = df[df['label'] == 1]
group_1 = group_1.reset_index(drop=True)
display(group_1.head())
df = group_1.merge(group_0, on=join_columns_enrich)
display(df.head())
print(group_1.shape)
df.shape
Post a Comment for "Improve Performance Calculating A Random Sample Matching Specific Conditions In Pandas"