Gapfilling Missing Data At Specific Latitude/longitude By Nearest Known Neighbours
I have a dataset of about 2 million rows, consisting of various properties at specific latitudes and longitudes. For each property, I have a valuation and a floor area. The valuati
Solution 1:
If you are open to libraries, you can use a Distance matrix
Assuming df your main dataframe
import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd
deffind_closest(x, df):
#Supress itself
d = x.drop(x.name).to_dict()
#sort the distance
v = sorted(d, key=lambda k: d[k])
#Find the closest with a non nan area value else return NaNfor i in v :
if i in df[~df.area.isnull()].index:
return df.loc[i].ratio
else:
passreturn np.nan
df_matrix_distance = pd.DataFrame(euclidean_distances(df[["lat","long"]]))
#Get the null values in area
df_nan = df[df.area.isnull()]
#get the values
res = df_matrix_distance.loc[df_nan.index].apply(lambda x: find_closest(x,df), axis=1).to_dict()
# Fill the valuesfor k,v in res.items():
df.loc[k,"ratio"] = v
df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]
The result
latlongvaluearearatio057.101474-2.24285112850252.050.992063157.102554-2.24630814700309.047.572816257.100556-2.24834225600507.050.493097357.101765-2.25468828000491.057.026477457.097553-2.2454835650 119.047.478992557.098244-2.24576843000811.053.020962657.098554-2.25250446300850.054.470588757.102794-2.2434547850 180.043.611111857.101474-2.24285126250514.050.99206349957.101893-2.23988331000607.051.005025131057.101383-2.23895528750563.051.005025131157.104578-2.23564118500327.056.5749241257.105424-2.23495321950406.054.0640391357.105516-2.23368319600408.048.039216
Post a Comment for "Gapfilling Missing Data At Specific Latitude/longitude By Nearest Known Neighbours"