Need To Create A Pandas Dataframe By Reading Csv File With Random Columns
Solution 1:
You can loop through rows with apply
function(axis = 1
) and construct a pandas series for each row based on the key value pairs after the splitting, and the newly constructed series will be automatically aligned by their index, just notice here there is no F
column but an extra H
, not sure if it is what you need. But removing the H
and adding an extra NaN F
column should be straight forward:
df.apply(lambda r: pd.Series({x[0]: x[1] for x in r.str.split(' ')
if isinstance(x, list) and len(x) == 2}), axis = 1)
# A B C D E G H
#0 1 2 10 15 NaN NaN NaN
#1 5 NaN NaN 10 NaN 2 NaN
#2 NaN NaN NaN 6 7 NaN NaN
#3 NaN NaN NaN NaN NaN 8 7
Solution 2:
Apply solution:
Use split
by whitespace, remove NaN
rows by dropna
, set_index
and convert one column DataFrame
to Series
by DataFrame.squeeze
. Last reindex
by new column names:
print (df.apply(lambda x: x.str.split(expand=True)
.dropna()
.set_index(0)
.squeeze(), axis=1)
.reindex(columns=list('ABCDEFGH')))
A B C D E F G H
0 1 2 10 15 NaN NaN NaN NaN
1 5 NaN NaN 10 NaN NaN 2 NaN
2 NaN NaN NaN 6 7 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8 7
Stack solution:
Use stack
for creating Series
, split
by whitespace and create new columns, append column with new column names (A
, B
...) to index
by set_index
, convert one column DataFrame
to Series
by DataFrame.squeeze
, remove index values with old column names by reset_index
, unstack
, reindex
by new column names (it add missing columns filled by NaN
),convert values to float
by astype
and last remove column name by rename_axis
(new in pandas
0.18.0
):
print (df.stack()
.str.split(expand=True)
.set_index(0, append=True)
.squeeze()
.reset_index(level=1, drop=True)
.unstack()
.reindex(columns=list('ABCDEFGH'))
.astype(float)
.rename_axis(None, axis=1))
A B C D E F G H
0 1.0 2.0 10.0 15.0 NaN NaN NaN NaN
1 5.0 NaN NaN 10.0 NaN NaN 2.0 NaN
2 NaN NaN NaN 6.0 7.0 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8.0 7.0
Solution 3:
Here is the code:
res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))
def classifier(row):
cols = row.str.split().str[0].dropna().tolist()
vals = row.str.split().str[1].dropna().tolist()
res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)
Input:
from io import StringIO
import pandas as pd
import numpy as np
data = """A 1, B 2, C 10, D 15
A 5, D 10, G 2
D 6, E 7
H 7, G 8"""
df = pd.read_csv(StringIO(data), header=None)
print("df:\n", df)
res = pd.DataFrame(index=df.index, columns=list('ABCDEFGH'))
def classifier(row):
cols = row.str.split().str[0].dropna().tolist()
vals = row.str.split().str[1].dropna().tolist()
res.loc[row.name, cols] = vals
df.apply(classifier, axis=1)
print("\nres:\n", res)
Output:
df:
0 1 2 3
0 A 1 B 2 C 10 D 15
1 A 5 D 10 G 2 NaN
2 D 6 E 7 NaN NaN
3 H 7 G 8 NaN NaN
res:
A B C D E F G H
0 1 2 10 15 NaN NaN NaN NaN
1 5 NaN NaN 10 NaN NaN 2 NaN
2 NaN NaN NaN 6 7 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8 7
Post a Comment for "Need To Create A Pandas Dataframe By Reading Csv File With Random Columns"