Skip to content Skip to sidebar Skip to footer

Gapfilling Missing Data At Specific Latitude/longitude By Nearest Known Neighbours

I have a dataset of about 2 million rows, consisting of various properties at specific latitudes and longitudes. For each property, I have a valuation and a floor area. The valuati

Solution 1:

If you are open to libraries, you can use a Distance matrix

Assuming df your main dataframe

import numpy as np
from sklearn.metrics.pairwise import euclidean_distances
import pandas as pd

deffind_closest(x, df):
    #Supress itself
    d = x.drop(x.name).to_dict()
    #sort the distance
    v = sorted(d, key=lambda k: d[k])
    #Find the closest with a non nan area value else return NaNfor i in v :
        if i in df[~df.area.isnull()].index:
            return df.loc[i].ratio
        else:
            passreturn np.nan
df_matrix_distance = pd.DataFrame(euclidean_distances(df[["lat","long"]]))
#Get the null values in area
df_nan = df[df.area.isnull()]
#get the values
res = df_matrix_distance.loc[df_nan.index].apply(lambda x: find_closest(x,df), axis=1).to_dict()
# Fill the valuesfor k,v in res.items():
    df.loc[k,"ratio"] = v
    df.loc[k,"area"] = df.loc[k,"value"]/ df.loc[k,"ratio"]

The result

latlongvaluearearatio057.101474-2.24285112850252.050.992063157.102554-2.24630814700309.047.572816257.100556-2.24834225600507.050.493097357.101765-2.25468828000491.057.026477457.097553-2.2454835650    119.047.478992557.098244-2.24576843000811.053.020962657.098554-2.25250446300850.054.470588757.102794-2.2434547850    180.043.611111857.101474-2.24285126250514.050.99206349957.101893-2.23988331000607.051.005025131057.101383-2.23895528750563.051.005025131157.104578-2.23564118500327.056.5749241257.105424-2.23495321950406.054.0640391357.105516-2.23368319600408.048.039216

Post a Comment for "Gapfilling Missing Data At Specific Latitude/longitude By Nearest Known Neighbours"