Skip to content Skip to sidebar Skip to footer

Python Pandas Data Cleaning

I am trying to read a large log file, which has been parsed using different delimiters (legacy issue). Code for root, dirs, files in os.walk('.', topdown=True): for file in fil

Solution 1:

df = pd.read_csv(file, sep='\n', header=None)    

#remove leading/trailing whitespace and split into columns
df = df[0].str.strip().str.split('[,|;: \t]+', 1, expand=True).rename(columns={0: 'email', 1: 'data'})

#drop rows with non-ASCII (<32 or >255, you can adopt the second to your needs)
df = df[~df.data.fillna('').str.contains('[^ -ΓΏ]')]

#drop rows with invalid email addresses
email_re = "^\w+(?:[-+.']\w+)*@\w+(?:[-.]\w+)*\.\w+(?:[-.]\w+)*$"
df = df[df.email.fillna('').str.contains(email_re)]

The email regex was taken from here (just changed the parentheses to non-grouping). If you want to be comprehensive you can use this monster-regex as well.


Post a Comment for "Python Pandas Data Cleaning"