Skip to content Skip to sidebar Skip to footer

Need To Create A Pandas Dataframe By Reading Csv File With Random Columns

I have the following csv file with records: A 1, B 2, C 10, D 15 A 5, D 10, G 2 D 6, E 7 H 7, G 8 My column headers/names are: A, B, C, D, E, F, G So my initial dataframe after

Solution 1:

You can loop through rows with apply function(axis = 1) and construct a pandas series for each row based on the key value pairs after the splitting, and the newly constructed series will be automatically aligned by their index, just notice here there is no F column but an extra H, not sure if it is what you need. But removing the H and adding an extra NaN F column should be straight forward:

df.apply(lambda r: pd.Series({x[0]: x[1] for x in r.str.split(' ') 
                                    if isinstance(x, list) and len(x) == 2}), axis = 1)


#     A   B   C   D   E   G   H
#0    1   2  10  15 NaN NaN NaN
#1    5 NaN NaN  10 NaN   2 NaN
#2  NaN NaN NaN   6   7 NaN NaN
#3  NaN NaN NaN NaN NaN   8   7

Solution 2:

Apply solution:

Use split by whitespace, remove NaN rows by dropna, set_index and convert one column DataFrame to Series by DataFrame.squeeze. Last reindex by new column names:

print (df.apply(lambda x: x.str.split(expand=True)
                               .dropna()
                               .set_index(0)
                               .squeeze(), axis=1)
         .reindex(columns=list('ABCDEFGH')))

     A    B    C    D    E   F    G    H
0    1    2   10   15  NaN NaN  NaN  NaN
1    5  NaN  NaN   10  NaN NaN    2  NaN
2  NaN  NaN  NaN    6    7 NaN  NaN  NaN
3  NaN  NaN  NaN  NaN  NaN NaN    8    7

Stack solution:

Use stack for creating Series, split by whitespace and create new columns, append column with new column names (A, B...) to index by set_index, convert one column DataFrame to Series by DataFrame.squeeze, remove index values with old column names by reset_index, unstack, reindex by new column names (it add missing columns filled by NaN),convert values to float by astype and last remove column name by rename_axis (new in pandas 0.18.0):

print (df.stack()
         .str.split(expand=True)
         .set_index(0, append=True)
         .squeeze()
         .reset_index(level=1, drop=True)
         .unstack()
         .reindex(columns=list('ABCDEFGH'))
         .astype(float)
         .rename_axis(None, axis=1))

     A    B     C     D    E   F    G    H
0  1.0  2.0  10.0  15.0  NaN NaN  NaN  NaN
1  5.0  NaN   NaN  10.0  NaN NaN  2.0  NaN
2  NaN  NaN   NaN   6.0  7.0 NaN  NaN  NaN
3  NaN  NaN   NaN   NaN  NaN NaN  8.0  7.0

Solution 3:

Here is the code:

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

def classifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals

df.apply(classifier, axis=1)

Input:

from io import StringIO
import pandas as pd
import numpy as np

data = """A 1, B 2, C 10, D 15
A 5, D 10, G 2
D 6, E 7
H 7, G 8"""

df = pd.read_csv(StringIO(data), header=None)
print("df:\n", df)

res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))

def classifier(row):
    cols = row.str.split().str[0].dropna().tolist()
    vals = row.str.split().str[1].dropna().tolist()
    res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)

print("\nres:\n", res)

Output:

df:
    0    1     2     3
0   A 1  B 2   C 10  D 15
1   A 5  D 10  G 2   NaN
2   D 6  E 7   NaN   NaN
3   H 7  G 8   NaN   NaN

res:
    A   B   C   D   E   F   G   H
0   1   2   10  15  NaN NaN NaN NaN
1   5   NaN NaN 10  NaN NaN 2   NaN
2   NaN NaN NaN 6   7   NaN NaN NaN
3   NaN NaN NaN NaN NaN NaN 8   7

Post a Comment for "Need To Create A Pandas Dataframe By Reading Csv File With Random Columns"