Pandas: Fastest Way To Resolve Ip To Country
Solution 1:
I would use maxminddb-geolite2
(GeoLite) module for that.
First install maxminddb-geolite2
module
pip install maxminddb-geolite2
Python Code:
import pandas as pd
from geolite2 import geolite2
defget_country(ip):
try:
x = geo.get(ip)
except ValueError:
return pd.np.nan
try:
return x['country']['names']['en'] if x else pd.np.nan
except KeyError:
return pd.np.nan
geo = geolite2.reader()
# it took me quite some time to find a free and large enough list of IPs ;)# IP's for testing: http://upd.emule-security.org/ipfilter.zip
x = pd.read_csv(r'D:\download\ipfilter.zip',
usecols=[0], sep='\s*\-\s*',
header=None, names=['ip'])
# get unique IPs
unique_ips = x['ip'].unique()
# make series out of it
unique_ips = pd.Series(unique_ips, index = unique_ips)
# map IP --> country
x['country'] = x['ip'].map(unique_ips.apply(get_country))
geolite2.close()
Output:
In [90]: x
Out[90]:
ip country
0000.000.000.000 NaN1001.002.004.000 NaN2001.002.008.000 NaN3001.009.096.105 NaN4001.009.102.251 NaN5001.009.106.186 NaN6001.016.000.000 NaN7001.055.241.140 NaN8001.093.021.147 NaN9001.179.136.040 NaN10001.179.138.224 Thailand
11001.179.140.200 Thailand
12001.179.146.052 NaN13001.179.147.002 Thailand
14001.179.153.216 Thailand
15001.179.164.124 Thailand
16001.179.167.188 Thailand
17001.186.188.000 NaN18001.202.096.052 NaN19001.204.179.141 China
20002.051.000.165 NaN21002.056.000.000 NaN22002.095.041.202 NaN23002.135.237.106 Kazakhstan
24002.135.237.250 Kazakhstan
... ... ...
Timing: for 171.884 unique IPs:
In[85]: %timeitunique_ips.apply(get_country)
1loop, bestof3: 14.8sperloopIn[86]: unique_ips.shapeOut[86]: (171884,)
Conclusion: it would take approx. 35 seconds for you DF with 400K unique IPs on my hardware:
In[93]: 400000/171884*15Out[93]: 34.90726303786274
Solution 2:
Your issue isn't with how to use apply
or loc
. The issue is that your df
is flagged as a copy of another dataframe.
Let's explore this a bit
df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))
df
def find_country_from_connection_ip(ip):
return {1: 'A', 2: 'B', 3: 'C'}[ip]
df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
No Problems Let's make some problems
# This should make a copyprint(bool(df.is_copy))
df = df[['A', 'IP']]
print(df)
print(bool(df.is_copy))
False
A IP
0 x 11 y 22 z 3True
Perfect, now we have a copy. Let's perform the same assignment with the apply
df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
//anaconda/envs/3.5/lib/python3.5/site-packages/ipykernel/__main__.py:1: SettingWithCopyWarning:
A valueis trying to be seton a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copyif __name__ == '__main__':
how do you fix it?
Where ever you created df
you can use df.loc
. My example above, where I did df = df[:]
triggered the copy. If I had used loc
instead, I'd have avoided this mess.
print(bool(df.is_copy))
df = df.loc[:]
print(df)
print(bool(df.is_copy))
False
A IP
0 x 11 y 22 z 3False
You need to either find where df
is created and use loc
or iloc
instead when you slice the source dataframe. Or, you can simply do this...
df.is_copy = None
The full demonstration
df = pd.DataFrame(dict(IP=[1, 2, 3], A=list('xyz')))
def find_country_from_connection_ip(ip):
return {1: 'A', 2: 'B', 3: 'C'}[ip]
df = df[:]
df.is_copy = None
df['Country'] = df.IP.apply(find_country_from_connection_ip)
df
Solution 3:
IIUC you can use your custom function with Series.apply
this way:
df['Country'] = df['IP'].apply(find_country_from_ip)
Sample:
df = pd.DataFrame({'IP':[1,2,3],
'B':[4,5,6]})
def find_country_from_ip(ip):
# Do some processing # some testing formula
country = ip + 5
return country
df['Country'] = df['IP'].apply(find_country_from_ip)
print (df)
B IP Country
0 4 1 6
1 5 2 7
2 6 3 8
Solution 4:
First and foremost, @MaxU 's answer is the way to go, efficient and ideal for parallel application on vectorized pd.series/dataframe.
Will contrast performance of two popular libraries to return location data given IP Address info. TLDR: use geolite2 method.
1.geolite2
package from geolite2
library
Input
# !pip install maxminddb-geolite2import time
from geolite2 import geolite2
geo = geolite2.reader()
df_1 = train_data.loc[:50,['IP_Address']]
defIP_info_1(ip):
try:
x = geo.get(ip)
except ValueError: #Faulty IP valuereturn np.nan
try:
return x['country']['names']['en'] if x isnotNoneelse np.nan
except KeyError: #Faulty Key valuereturn np.nan
s_time = time.time()
# map IP --> country#apply(fn) applies fn. on all pd.series elements
df_1['country'] = df_1.loc[:,'IP_Address'].apply(IP_info_1)
print(df_1.head(), '\n')
print('Time:',str(time.time()-s_time)+'s \n')
print(type(geo.get('48.151.136.76')))
Output
IP_Address country
048.151.136.76 United States
194.9.145.169 United Kingdom
258.94.157.121 Japan
3193.187.41.186 Austria
4125.96.20.172 China
Time:0.09906983375549316s
<class'dict'>
2.DbIpCity
package from ip2geotools
library
Input
# !pip install ip2geotoolsimport time
s_time = time.time()
from ip2geotools.databases.noncommercial import DbIpCity
df_2 = train_data.loc[:50,['IP_Address']]
defIP_info_2(ip):
try:
return DbIpCity.get(ip, api_key = 'free').country
except:
return np.nan
df_2['country'] = df_2.loc[:, 'IP_Address'].apply(IP_info_2)
print(df_2.head())
print('Time:',str(time.time()-s_time)+'s')
print(type(DbIpCity.get('48.151.136.76',api_key = 'free')))
Output
IP_Address country
048.151.136.76 US
194.9.145.169 GB
258.94.157.121 JP
3193.187.41.186 AT
4125.96.20.172 CN
Time:80.53318452835083s
<class'ip2geotools.models.IpLocation'>
A reason why the huge time difference could be due to the Data structure of the output, i.e direct subsetting from dictionaries seems way more efficient than indexing from the specicialized ip2geotools.models.IpLocation object.
Also, the output of the 1st method is dictionary containing geo-location data, subset respecitively to obtain needed info:
x = geolite2.reader().get('48.151.136.76')
print(x)
>>>
{'city': {'geoname_id': 5101798, 'names': {'de': 'Newark', 'en': 'Newark', 'es': 'Newark', 'fr': 'Newark', 'ja': 'ニューアーク', 'pt-BR': 'Newark', 'ru': 'Ньюарк'}},
'continent': {'code': 'NA', 'geoname_id': 6255149, 'names': {'de': 'Nordamerika', 'en': 'North America', 'es': 'Norteamérica', 'fr': 'Amérique du Nord', 'ja': '北アメリカ', 'pt-BR': 'América do Norte', 'ru': 'Северная Америка', 'zh-CN': '北美洲'}},
'country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}},
'location': {'accuracy_radius': 1000, 'latitude': 40.7355, 'longitude': -74.1741, 'metro_code': 501, 'time_zone': 'America/New_York'},
'postal': {'code': '07102'},
'registered_country': {'geoname_id': 6252001, 'iso_code': 'US', 'names': {'de': 'USA', 'en': 'United States', 'es': 'Estados Unidos', 'fr': 'États-Unis', 'ja': 'アメリカ合衆国', 'pt-BR': 'Estados Unidos', 'ru': 'США', 'zh-CN': '美国'}},
'subdivisions': [{'geoname_id': 5101760, 'iso_code': 'NJ', 'names': {'en': 'New Jersey', 'es': 'Nueva Jersey', 'fr': 'New Jersey', 'ja': 'ニュージャージー州', 'pt-BR': 'Nova Jérsia', 'ru': 'Нью-Джерси', 'zh-CN': '新泽西州'}}]}
Post a Comment for "Pandas: Fastest Way To Resolve Ip To Country"