Skip to content Skip to sidebar Skip to footer

Find That Start Date And End Dates Are Available Using Python Pandas

I am having a dataframe like this year end id start 1949 1954.0 ABc 1949.0 1950 1954.0 ABc 1949.0 1951 1954.0 AB

Solution 1:

This should work; see comments in code for clarification on what I am doing:

import pandas as pd
from functools import reduce

# reading the dataframe from your sample
df = pd.read_clipboard()
df['start'] = df['start'].astype('int')
df['end'] = df['end'].astype('int')


# create a function that finds the min start date and max end date
def findRange(row):
    return list(range(row['startMin'], row['endMax']+1))

# create three groupped dataframes and create a list for year start min and start max
year_list = pd.DataFrame(df.groupby('id')['year'].apply(list))
start_min = pd.DataFrame(df.groupby('id')['start'].apply(min)).rename(columns={'start':'startMin'})
end_max = pd.DataFrame(df.groupby('id')['end'].apply(max)).rename(columns={'end':'endMax'})

# apply the findRange function for each grouped ID to see the date range we are looking for
dfs = [year_list,start_min,end_max]
df_final = reduce(lambda left,right: pd.merge(left,right,on='id'), dfs)
df_final['Range'] = df_final.apply(findRange, axis=1)
df_final.reset_index(inplace=True)

# create a noMatch function to find all the values in list year that are not in the range created above
def noMatch(a, b):
    return [x for x in b if x not in a]

# use a for loop to iterate through all the rows and find the missing year
df1 = []
for i in range(0, len(df_final)):
    df1.append(noMatch(df_final['year'][i],df_final['Range'][i]))

# create a new dataframe and get your desiered output: my column names are different and in a different order;
# however, the output is the same as your desired output
missing_year = pd.DataFrame(df1).rename(columns={0:'missingYear'})
df_concat = pd.concat([df_final, missing_year], axis=1)
df_concat = df_concat[['id','startMin','endMax','missingYear']]
df_concat = df_concat[df_concat['missingYear'].notnull()]
df_concat['missingYear'] = df_concat['missingYear'].astype('int')
df_concat


    id   startMin   endMax  missingYear
1   cde   1949      1954    1954
2   xyz   1949      1954    1949

Solution 2:

you can use groupby and set difference like:

# first convert to integer
df['start'] = df['start'].astype('int')
df['end'] = df['end'].astype('int')
print (df.groupby(['id','start','end'])
          .apply(lambda x: set(range(x.start.iloc[0], x.end.iloc[0]+1))-set(x.year)))
id   start  end 
ABc  1949   1954        {}
cde  1949   1954    {1954}
xyz  1949   1954    {1949}
dtype: object

Now if you want the output format you ask, cahnge the set to a list, add dropna, astype and reset_index:

df_missing = (df.groupby(['id','start','end'])
                .apply(lambda x: [*(set(range(x.start.iloc[0], x.end.iloc[0]+1))-set(x.year))])
                .str[0].dropna().astype(int).reset_index(name='missing_year'))
print (df_missing)
    id  start   end  missing_year
0  cde   1949  1954          1954
1  xyz   1949  1954          1949

Post a Comment for "Find That Start Date And End Dates Are Available Using Python Pandas"