Skip to content Skip to sidebar Skip to footer

Generating A Similarity Matrix From Pandas Dataframe

I have a df id val1 val2 val3 100 aa bb cc 200 bb cc 0 300 aa cc 0 400 bb aa cc From this I have to generate a df, som

Solution 1:

Some preprocessing. First, set_index to id and get rid of 0s, we don't need them.

df = df.set_index('id').replace('0', np.nan)

df    
    val1 val2 val3
id                
100   aa   bb   cc
200   bb   cc  NaN
300   aa   cc  NaN
400   bb   aa   cc 

Now, use a combination of pd.get_dummies and df.dot and get your similarity scores.

x = pd.get_dummies(df)
y = x.groupby(x.columns.str.split('_').str[1], axis=1).sum()    
y.dot(y.T)

     100200300400id1003223200221230021224003223

Solution 2:

you can convert the data into sets and then intersect them:

df = df.replace('0', np.nan)
c = df.apply(lambda x: set(x.dropna()), axis=1)
df2 = pd.DataFrame([[len(x.intersection(y)) for x in c] for y in c],columns=c.index,index=c.index)

The desired output will be:

     100  200  300  400
100    3    2    2    3
200    2    2    1    2
300    2    1    2    2
400    3    2    2    3

Post a Comment for "Generating A Similarity Matrix From Pandas Dataframe"