How To Scrape Not Well Structured Html Tables With Beautifulsoup In Python?
Solution 1:
I've used the method I mentioned in the comments (using width) to determine the null values in the data. Here's the Python code:
import requests
import bs4
URL = 'https://itportal.ogauthority.co.uk/information/well_data/lithostratigraphy_hierarchy/rptLithoStrat_1Page2.html'
response = requests.get(URL)
soup = bs4.BeautifulSoup(response.text, 'lxml')
tables = soup.find_all('table')
count = 0
cells_count = 0
for table in tables:
count +=1
if count >2:
row = table.tr
cells = row.find_all('td')
print ''
x = 0
width_diff = 0
cell_text = []
for cell in cells:
width = cell.get('width')
if int(width) < 10:
continue
if width_diff > 0:
cell_text.append('NaN ')
if width_diff > 50:
x += 2
cell_text.append('Nan ')
else:
x += 1
width_diff = 0
if x == 0 or x == 1 or x == 2 or x == 3 or x == 4 or x == 6:
width_range = [35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50]
elif x == 5:
width_range = [220,221,222,223,224,225,226,227,228,229,230]
elif x == 7:
width_range = [136]
if cell.text:
cell_text.append(cell.text.strip() + ' ')
else:
cell_text.append('NaN ')
if int(width) not in width_range:
width_diff = int(width) - width_range[-1]
x += 1
#print x,
length = len(cell_text)
for i in range(0, length):
print cell_text[i],
diff = 9 - length
if diff > 0:
for j in range(0, diff):
print 'NaN ',
As you can see, I've noticed that a certain width range is used in each column. By comparing each cell to its supposed width, we can determine how many spaces it takes. If the difference in width is too great, that means it takes the space of the next two cells.
It might need some refining, you'll need to test the script against all URLs to ensure that the data is absolutely clean.
Here's a sample output from running this code:
61.00 SED TERT WBDS NaN Woolwich Beds GP NaN WLDB
62.00 NaN NaN PACL NaN Palaeocene Claystones NaN Nan SWAP
63.00 NaN NaN SMFC NaN Shallow Marine Facies NaN Nan SONS
64.00 NaN NaN DMFC NaN Deep Marine Facies NaN NaN NaN
65.00 NaN NaN SLSY NaN Selsey Member GN NaN WSXB
66.00 NaN NaN MFM NaN Marsh Farm Member NaN NaN NaN
67.00 NaN NaN ERNM NaN Earnley Member NaN NaN NaN
68.00 NaN NaN WITT NaN Wittering Member NaN NaN NaN
69.00 NaN NaN WHI NaN Whitecliff Beds GZ NaN NaN
70.00 NaN NaN Nan WFSM NaN Whitecliff Sand Member NaN Nan GN
71.00 NaN WESQ NaN Nan Westray Group Equivalent NL GW WESH
72.00 NaN WESR NaN Nan Westray Group NM GO CNSB
73.00 NaN NaN THEF NaN Thet Formation NaN Nan MOFI
74.00 NaN NaN SKAD NaN Skade Formation NB NaN NONS
75.00 NaN NORD NaN Nan Nordland NP Q CNSB
75.50 NaN NaN SWCH NaN Swatchway Formation Q NaN MOFI
75.60 NaN NaN CLPT NaN Coal Pit Formation NaN NaN NaN
75.70 NaN NaN LNGB NaN Ling Bank Formation NaN NaN NaN
76.00 NaN NaN SHKL NaN Shackleton Formation GO QP ROCK
77.00 NaN NaN UGNS NaN Upper Tertiary sands NaN NM NONS
78.00 NaN NaN CLSD NaN Claret Sand NP NaN SVIG
79.00 NaN NaN BLUE NaN Blue Sand NaN NaN NaN
80.00 NaN NaN ABGF NaN Aberdeen Ground Formation QH NaN CNSB
81.00 NaN NaN NUGU NaN Upper Glauconitic Unit NB NA MOFI
82.00 NaN NaN POWD NaN Powder Sand GN NaN SVIG
83.00 NaN NaN BASD NaN Basin Sand NaN Nan CNSB
84.00 NaN NaN CRND NaN Crenulate Sand NaN NaN NaN
85.00 NaN NaN NORS NaN Nordland Sand QP NaN SONS
86.00 NaN NaN MIOS NaN Miocene Sand NM NaN ESHB
87.00 NaN NaN MIOL NaN Miocene Limestone NaN Nan CNSB
88.00 NaN NaN FLSF NaN Fladen Sand Formation GP GO WYGG
Note: I don't know how the 0 in the first cell of your example is created, so I left it out of the answer. I don't know if it's supposed to be scraped as well, because I didn't find it anywhere.
Solution 2:
@samy Thank you very much for your cool method to scrape this website:
Post a Comment for "How To Scrape Not Well Structured Html Tables With Beautifulsoup In Python?"