Skip to content Skip to sidebar Skip to footer

Pandas: Fastest Way To Resolve Ip To Country

I have a function find_country_from_connection_ip which takes an ip, and after some processing returns a country. Like below: def find_country_from_connection_ip(ip): # Do some

Solution 1:

I would use maxminddb-geolite2 (GeoLite) module for that.

First install maxminddb-geolite2 module

pip install maxminddb-geolite2

Python Code:

import pandas as pd
from geolite2 import geolite2

defget_country(ip):
    try:
        x = geo.get(ip)
    except ValueError:
        return pd.np.nan
    try:
        return x['country']['names']['en'] if x else pd.np.nan
    except KeyError:
        return pd.np.nan

geo = geolite2.reader()

# it took me quite some time to find a free and large enough list of IPs ;)# IP's for testing: http://upd.emule-security.org/ipfilter.zip
x = pd.read_csv(r'D:\download\ipfilter.zip',
                usecols=[0], sep='\s*\-\s*',
                header=None, names=['ip'])

# get unique IPs
unique_ips = x['ip'].unique()
# make series out of it
unique_ips = pd.Series(unique_ips, index = unique_ips)
# map IP --> country
x['country'] = x['ip'].map(unique_ips.apply(get_country))

geolite2.close()

Output:

In [90]: x
Out[90]:
                     ip     country
0000.000.000.000         NaN1001.002.004.000         NaN2001.002.008.000         NaN3001.009.096.105         NaN4001.009.102.251         NaN5001.009.106.186         NaN6001.016.000.000         NaN7001.055.241.140         NaN8001.093.021.147         NaN9001.179.136.040         NaN10001.179.138.224    Thailand
11001.179.140.200    Thailand
12001.179.146.052         NaN13001.179.147.002    Thailand
14001.179.153.216    Thailand
15001.179.164.124    Thailand
16001.179.167.188    Thailand
17001.186.188.000         NaN18001.202.096.052         NaN19001.204.179.141       China
20002.051.000.165         NaN21002.056.000.000         NaN22002.095.041.202         NaN23002.135.237.106  Kazakhstan
24002.135.237.250  Kazakhstan
...                 ...         ...

Timing: for 171.884 unique IPs:

In[85]: %timeitunique_ips.apply(get_country)
1loop, bestof3: 14.8sperloopIn[86]: unique_ips.shapeOut[86]: (171884,)

Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:

In[93]: 400000/171884*15Out[93]: 34.90726303786274

Solution 2:

Your issue isn't with how to use apply or loc. The issue is that your df is flagged as a copy of another dataframe.

Let's explore this a bit

df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))
df

enter image description here

def find_country_from_connection_ip(ip):
    return {1: 'A', 2: 'B', 3: 'C'}[ip]

df['Country'] = df.IP.apply(find_country_from_connection_ip)
df

enter image description here

No Problems Let's make some problems

# This should make a copyprint(bool(df.is_copy))
df = df[['A', 'IP']]
print(df)
print(bool(df.is_copy))

False
   A  IP
0  x   11  y   22  z   3True

Perfect, now we have a copy. Let's perform the same assignment with the apply

df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning: 
A valueis trying to be seton a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyif __name__ == '__main__':

enter image description here


how do you fix it? Where ever you created df you can use df.loc. My example above, where I did df = df[:] triggered the copy. If I had used loc instead, I'd have avoided this mess.

print(bool(df.is_copy))
df = df.loc[:]
print(df)
print(bool(df.is_copy))

False
   A  IP
0  x   11  y   22  z   3False

You need to either find where df is created and use loc or iloc instead when you slice the source dataframe. Or, you can simply do this...

df.is_copy = None

The full demonstration

df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))

def find_country_from_connection_ip(ip):
    return {1: 'A', 2: 'B', 3: 'C'}[ip]

df = df[:]

df.is_copy = None

df['Country'] = df.IP.apply(find_country_from_connection_ip)
df

enter image description here

Solution 3:

IIUC you can use your custom function with Series.apply this way:

df['Country'] = df['IP'].apply(find_country_from_ip)

Sample:

df = pd.DataFrame({'IP':[1,2,3],
                   'B':[4,5,6]})




def find_country_from_ip(ip):
            # Do some processing # some testing formula
            country = ip + 5
            return country



   df['Country'] = df['IP'].apply(find_country_from_ip)

print (df)
   B  IP  Country
0  4   1        6
1  5   2        7
2  6   3        8

Solution 4:

First and foremost, @MaxU 's answer is the way to go, efficient and ideal for parallel application on vectorized pd.series/dataframe.

Will contrast performance of two popular libraries to return location data given IP Address info. TLDR: use geolite2 method.

1.geolite2 package from geolite2 library

Input

# !pip install maxminddb-geolite2import time
from geolite2 import geolite2
geo = geolite2.reader()
df_1 = train_data.loc[:50,['IP_Address']]

defIP_info_1(ip):
    try:
        x = geo.get(ip)
    except ValueError:   #Faulty IP valuereturn np.nan
    try:
        return x['country']['names']['en'] if x isnotNoneelse np.nan
    except KeyError:   #Faulty Key valuereturn np.nan


s_time = time.time()
# map IP --> country#apply(fn) applies fn. on all pd.series elements
df_1['country'] = df_1.loc[:,'IP_Address'].apply(IP_info_1)
print(df_1.head(), '\n')
print('Time:',str(time.time()-s_time)+'s \n')

print(type(geo.get('48.151.136.76')))

Output

       IP_Address         country
048.151.136.76   United States
194.9.145.169  United Kingdom
258.94.157.121           Japan
3193.187.41.186         Austria
4125.96.20.172           China 

Time:0.09906983375549316s 

<class'dict'>

2.DbIpCity package from ip2geotools library

Input

# !pip install ip2geotoolsimport time
s_time = time.time()
from ip2geotools.databases.noncommercial import DbIpCity
df_2 = train_data.loc[:50,['IP_Address']]
defIP_info_2(ip):
    try:
        return DbIpCity.get(ip, api_key = 'free').country
    except:
        return np.nan
df_2['country'] = df_2.loc[:, 'IP_Address'].apply(IP_info_2)
print(df_2.head())
print('Time:',str(time.time()-s_time)+'s')

print(type(DbIpCity.get('48.151.136.76',api_key = 'free')))

Output

       IP_Address country
048.151.136.76      US
194.9.145.169      GB
258.94.157.121      JP
3193.187.41.186      AT
4125.96.20.172      CN

Time:80.53318452835083s 

<class'ip2geotools.models.IpLocation'>

A reason why the huge time difference could be due to the Data structure of the output, i.e direct subsetting from dictionaries seems way more efficient than indexing from the specicialized ip2geotools.models.IpLocation object.

Also, the output of the 1st method is dictionary containing geo-location data, subset respecitively to obtain needed info:

x = geolite2.reader().get('48.151.136.76')
print(x)

>>>
    {'city': {'geoname_id': 5101798, 'names': {'de': 'Newark', 'en': 'Newark', 'es': 'Newark', 'fr': 'Newark', 'ja': 'ニューアーク', 'pt-BR': 'Newark', 'ru': 'Ньюарк'}},

 'continent': {'code': 'NA', 'geoname_id': 6255149, 'names': {'de': 'Nordamerika', 'en': 'North America', 'es': 'Norteamérica', 'fr': 'Amérique du Nord', 'ja': '北アメリカ', 'pt-BR': 'América do Norte', 'ru': 'Северная Америка', 'zh-CN': '北美洲'}}, 

'country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}}, 

'location': {'accuracy_radius': 1000, 'latitude': 40.7355, 'longitude': -74.1741, 'metro_code': 501, 'time_zone': 'America/New_York'}, 

'postal': {'code': '07102'}, 

'registered_country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}}, 

'subdivisions': [{'geoname_id': 5101760, 'iso_code': 'NJ', 'names': {'en': 'New Jersey', 'es': 'Nueva Jersey', 'fr': 'New Jersey', 'ja': 'ニュージャージー州', 'pt-BR': 'Nova Jérsia', 'ru': 'Нью-Джерси', 'zh-CN': '新泽西州'}}]}

Post a Comment for "Pandas: Fastest Way To Resolve Ip To Country"