Generating A Similarity Matrix From Pandas Dataframe
I have a df id val1 val2 val3 100 aa bb cc 200 bb cc 0 300 aa cc 0 400 bb aa cc From this I have to generate a df, som
Solution 1:
Some preprocessing. First, set_index
to id
and get rid of 0
s, we don't need them.
df = df.set_index('id').replace('0', np.nan)
df
val1 val2 val3
id
100 aa bb cc
200 bb cc NaN
300 aa cc NaN
400 bb aa cc
Now, use a combination of pd.get_dummies
and df.dot
and get your similarity scores.
x = pd.get_dummies(df)
y = x.groupby(x.columns.str.split('_').str[1], axis=1).sum()
y.dot(y.T)
100200300400id1003223200221230021224003223
Solution 2:
you can convert the data into sets and then intersect them:
df = df.replace('0', np.nan)
c = df.apply(lambda x: set(x.dropna()), axis=1)
df2 = pd.DataFrame([[len(x.intersection(y)) for x in c] for y in c],columns=c.index,index=c.index)
The desired output will be:
100 200 300 400
100 3 2 2 3
200 2 2 1 2
300 2 1 2 2
400 3 2 2 3
Post a Comment for "Generating A Similarity Matrix From Pandas Dataframe"