Skip to content Skip to sidebar Skip to footer

Beautifulsoup Fails To Parse Long View State

I try to use BeautifulSoup4 to parse the html retrieved from http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0 If I print out the resulting soup, it ends like this: kZXI9IjAi'/

Solution 1:

BeautifulSoup uses a pluggable HTML parser to build the 'soup'; you need to try out different parsers, as each will treat a broken page differently.

I had no problems parsing that page with any of the parsers, however:

>>>from beautifulsoup4 import BeautifulSoup>>>import requests>>>r = requests.get('http://exporter.nih.gov/ExPORTER_Catalog.aspx?index=0')>>>for parser in ('html.parser', 'lxml', 'html5lib'):...printrepr(str(BeautifulSoup(r.text, parser))[-60:])... 
';\r\npageTracker._trackPageview();\r\n</script>\n</body>\n</html>\n'
'();\r\npageTracker._trackPageview();\r\n</script>\n</body></html>'
'();\npageTracker._trackPageview();\n</script>\n\n\n</body></html>'

Make sure you have the latest BeautifulSoup4 package installed, I have seen consistent problems in the 4.1 series solved in 4.2.

Post a Comment for "Beautifulsoup Fails To Parse Long View State"