Creating A Generic Scrapy Spider
Solution 1:
You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:
a = open("test.py")
from compiler importcompile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)
MySpider
<class'__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']
Solution 2:
I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.
Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).
What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.
As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).
masterspider/sitespider/settings.py
EXTENSIONS = {
'masterspider.masterspider.MasterSpider': 500
}
masterspider/masterspdier/masterspider.py
# -*- coding: utf8 -*-from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
classMasterSpider(Spider):
defstart_requests(self):
ifhasattr(self,'parse_start'): # First page requiring a specific parser
fcallback = self.parse_start
else:
fcallback = self.parse
return [ Request(self.spd['start_url'],
callback=fcallback,
meta={'itemfields': {}}) ]
defparse(self, response):
sel = Selector(response)
lines = sel.xpath(self.spd['xlines'])
# ...for line in lines:
item = genspiderItem(response.meta['itemfields'])
# ...# Get request_url of detailed page and scrap basic item info# ... yield Request(request_url,
callback=self.parse_item,
meta={'item':item, 'itemfields':response.meta['itemfields']})
for next_url in sel.xpath(self.spd['xnext_url']).extract():
ifhasattr(self,'next_url_parser'): # Need to process the next page URL before?yield self.next_url_parser(next_url, response)
else:
yield Request(
request_url,
callback=self.parse,
meta=response.meta)
defparse_item(self, response):
sel = Selector(response)
item = response.meta['item']
for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
item[itemname] = "\n".join(sel.xpath(xitemname).extract())
return item
masterspider/sitespider/spiders/somesite_spider.py
# -*- coding: utf8 -*-from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider
classtargetsiteSpider(MasterSpider):
name = "targetsite"
allowed_domains = ["www.targetsite.com"]
spd = {
'start_url' : "http://www.targetsite.com/startpage", # Start page'xlines' : "//td[something...]",
'xnext_url' : "//a[contains(@href,'something?page=')]/@href", # Next pages'x_ondetailpage' : {
"itemprop123" : u"id('someid')//text()"
}
}
# def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor# ...
Solution 3:
Instead of having the variables name
,allowed_domains
, start_urls
and rules
attached to the class, you should write a MySpider.__init__
, call CrawlSpider.__init__
from that passing the necessary arguments, and setting name
, allowed_domains
etc. per object.
MyProp
and keywords also should be set within your __init__
. So in the end you should have something like below. You don't have to add name
to the arguments, as name
is set by BaseSpider
itself from kwargs
:
classMySpider(CrawlSpider):
def__init__(self, allowed_domains=[], start_urls=[],
rules=[], findtag='', finditemprop='', keywords='', **kwargs):
CrawlSpider.__init__(self, **kwargs)
self.allowed_domains = allowed_domains
self.start_urls = start_urls
self.rules = rules
self.findtag = findtag
self.finditemprop = finditemprop
self.keywords = keywords
defparse_item(self, response):
contentTags = []
soup = BeautifulSoup(response.body)
contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)
for contentTag in contentTags:
matchedResult = re.search(self.keywords, contentTag.text)
if matchedResult:
print('URL Found: ' + response.url)
Solution 4:
I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.
I usually just override the parent class (CrawlSpider
) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__()
from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls
earlier.
Solution 5:
As far as crawling specific domains passed as arguments goes, I just override Spider.__init__
:
classMySpider(scrapy.Spider):
"""
This spider will try to crawl whatever is passed in `start_urls` which
should be a comma-separated string of fully qualified URIs.
Example: start_urls=http://localhost,http://example.com
"""def__init__(self, name=None, **kwargs):
if'start_urls'in kwargs:
self.start_urls = kwargs.pop('start_urls').split(',')
super(Spider, self).__init__(name, **kwargs)
Post a Comment for "Creating A Generic Scrapy Spider"