Creating A Generic Scrapy Spider

December 01, 2023 Post a Comment

My question is really how to do the same thing as a previous question, but in Scrapy 0.14. Using one Scrapy spider for several websites Basically, I have GUI that takes parameters

Solution 1:

You could create a run-time spider which is evaluated by the interpreter. This code piece could be evaluated at runtime like so:

a = open("test.py")
from compiler importcompile
d = compile(a.read(), 'spider.py', 'exec')
eval(d)

MySpider
<class'__main__.MySpider'>
print MySpider.start_urls
['http://www.somedomain.com']

Solution 2:

I use the Scrapy Extensions approach to extend the Spider class to a class named Masterspider that includes a generic parser.

Below is the very "short" version of my generic extended parser. Note that you'll need to implement a renderer with a Javascript engine (such as Selenium or BeautifulSoup) a as soon as you start working on pages using AJAX. And a lot of additional code to manage differences between sites (scrap based on column title, handle relative vs long URL, manage different kind of data containers, etc...).

What is interresting with the Scrapy Extension approach is that you can still override the generic parser method if something does not fit but I never had to. The Masterspider class checks if some methods have been created (eg. parser_start, next_url_parser...) under the site specific spider class to allow the management of specificies: send a form, construct the next_url request from elements in the page, etc.

As I'm scraping very different sites, there's always specificities to manage. That's why I prefer to keep a class for each scraped site so that I can write some specific methods to handle it (pre-/post-processing except PipeLines, Request generators...).

masterspider/sitespider/settings.py

EXTENSIONS = {
    'masterspider.masterspider.MasterSpider': 500
}

masterspider/masterspdier/masterspider.py

Baca Juga

# -*- coding: utf8 -*-from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem

classMasterSpider(Spider):

    defstart_requests(self):
        ifhasattr(self,'parse_start'): # First page requiring a specific parser
            fcallback = self.parse_start
        else:
            fcallback = self.parse
        return [ Request(self.spd['start_url'],
                     callback=fcallback,
                     meta={'itemfields': {}}) ]

    defparse(self, response):
        sel = Selector(response)
        lines = sel.xpath(self.spd['xlines'])
        # ...for line in lines:
            item = genspiderItem(response.meta['itemfields'])               
            # ...# Get request_url of detailed page and scrap basic item info# ... yield  Request(request_url,
                   callback=self.parse_item,
                   meta={'item':item, 'itemfields':response.meta['itemfields']})

        for next_url in sel.xpath(self.spd['xnext_url']).extract():
            ifhasattr(self,'next_url_parser'): # Need to process the next page URL before?yield self.next_url_parser(next_url, response)
            else:
                yield Request(
                    request_url,
                    callback=self.parse,
                    meta=response.meta)

    defparse_item(self, response):
        sel = Selector(response)
        item = response.meta['item']
        for itemname, xitemname in self.spd['x_ondetailpage'].iteritems():
            item[itemname] = "\n".join(sel.xpath(xitemname).extract())
        return item

masterspider/sitespider/spiders/somesite_spider.py

# -*- coding: utf8 -*-from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from sitespider.items import genspiderItem
from masterspider.masterspider import MasterSpider

classtargetsiteSpider(MasterSpider):
    name = "targetsite"
    allowed_domains = ["www.targetsite.com"]
    spd = {
        'start_url' : "http://www.targetsite.com/startpage", # Start page'xlines' : "//td[something...]",
        'xnext_url' : "//a[contains(@href,'something?page=')]/@href", # Next pages'x_ondetailpage' : {
            "itemprop123" :      u"id('someid')//text()"
            }
    }

#     def next_url_parser(self, next_url, response): # OPTIONAL next_url regexp pre-processor#          ...

Solution 3:

Instead of having the variables name,allowed_domains, start_urls and rules attached to the class, you should write a MySpider.__init__, call CrawlSpider.__init__ from that passing the necessary arguments, and setting name, allowed_domains etc. per object. MyProp and keywords also should be set within your __init__. So in the end you should have something like below. You don't have to add name to the arguments, as name is set by BaseSpider itself from kwargs:

classMySpider(CrawlSpider):

    def__init__(self, allowed_domains=[], start_urls=[], 
            rules=[], findtag='', finditemprop='', keywords='', **kwargs):
        CrawlSpider.__init__(self, **kwargs)
        self.allowed_domains = allowed_domains
        self.start_urls = start_urls
        self.rules = rules
        self.findtag = findtag
        self.finditemprop = finditemprop
        self.keywords = keywords

    defparse_item(self, response):
        contentTags = []

        soup = BeautifulSoup(response.body)

        contentTags = soup.findAll(self.findtag, itemprop=self.finditemprop)

        for contentTag in contentTags:
            matchedResult = re.search(self.keywords, contentTag.text)
            if matchedResult:
                print('URL Found: ' + response.url)

Solution 4:

I am not sure which way is preferred, but I will tell you what I have done in the past. I am in no way sure that this is the best (or correct) way of doing this and I would be interested to learn what other people think.

I usually just override the parent class (CrawlSpider) and either pass in arguments and then initialize the parent class via super(MySpider, self).__init__() from within my own init-function or I pull in that data from a database where I have saved a list of links to be appended to start_urls earlier.

Solution 5:

As far as crawling specific domains passed as arguments goes, I just override Spider.__init__:

classMySpider(scrapy.Spider):
    """
    This spider will try to crawl whatever is passed in `start_urls` which
    should be a comma-separated string of fully qualified URIs.

    Example: start_urls=http://localhost,http://example.com
    """def__init__(self, name=None, **kwargs):
        if'start_urls'in kwargs:
            self.start_urls = kwargs.pop('start_urls').split(',')
        super(Spider, self).__init__(name, **kwargs)

Introduction to Python Course