Compare Two Excel Files That Have A Different Number Of Rows Using Python Pandas
I'm using Python 3.7 , and I want to compare two Excel file that have the same columns (140 columns) but with a different number of rows, I looked on the website , but I didn't fin
Solution 1:
an excel diff can quickly become a funky beast, but we should be able to do this with some concats
and boolean statements.
assuming your dataframes are called df1, df2
df1 = df1.set_index('id')
df2 = df2.set_index('id')
df3 = pd.concat([df1,df2],sort=False)
df3a = df3.stack().groupby(level=[0,1]).unique().unstack(1).copy()
df3a.loc[~df3a.index.isin(df2.index),'status'] = 'deleted'# if not in df2 index then deleted
df3a.loc[~df3a.index.isin(df1.index),'status'] = 'new'# if not in df1 index then new
idx = df3.stack().groupby(level=[0,1]).nunique() # get modified cells.
df3a.loc[idx.mask(idx <= 1).dropna().index.get_level_values(0),'status'] = 'modified'
df3a['status'] = df3a['status'].fillna('same') # assume that anything not fufilled by above rules is the same.
print(df3a)
d1 d2 qte status
id
A[23][35][10, 20] modified
B[63][63][43] deleted
C [61][62][15] same
E [62][16][38] new
F [20][51][63] new
if you don't mind the performance hit of turning all your datatypes to strings then this could work. I dont' recommend it though, use a fact or slow changing dimension schema to hold such data, you'll thank your self in the future.
df3a.stack().explode().astype(str).groupby(level=[0,1]).agg('-->'.join).unstack(1)
d1 d2 qte status
id
A233510-->20 modified
B636343 deleted
C 616215 same
E 621638 new
F 205163 new
Post a Comment for "Compare Two Excel Files That Have A Different Number Of Rows Using Python Pandas"