Skip to content Skip to sidebar Skip to footer

Compare Two Excel Files That Have A Different Number Of Rows Using Python Pandas

I'm using Python 3.7 , and I want to compare two Excel file that have the same columns (140 columns) but with a different number of rows, I looked on the website , but I didn't fin

Solution 1:

an excel diff can quickly become a funky beast, but we should be able to do this with some concats and boolean statements.

assuming your dataframes are called df1, df2

df1 = df1.set_index('id')
df2 = df2.set_index('id')

df3 = pd.concat([df1,df2],sort=False)
df3a = df3.stack().groupby(level=[0,1]).unique().unstack(1).copy()


df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted'# if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new'# if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells. 
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fufilled by above rules is the same.

print(df3a)

      d1    d2       qte    status
id                                
A[23][35][10, 20]  modified
B[63][63][43]   deleted
C   [61][62][15]      same
E   [62][16][38]       new
F   [20][51][63]       new

if you don't mind the performance hit of turning all your datatypes to strings then this could work. I dont' recommend it though, use a fact or slow changing dimension schema to hold such data, you'll thank your self in the future.

df3a.stack().explode().astype(str).groupby(level=[0,1]).agg('-->'.join).unstack(1)

    d1  d2      qte    status
id                           
A233510-->20  modified
B636343   deleted
C   616215      same
E   621638       new
F   205163       new

Post a Comment for "Compare Two Excel Files That Have A Different Number Of Rows Using Python Pandas"