How To Use Meta In Scrapy Rule

December 22, 2023 Post a Comment

def parse(self,response): my_item={'test':123,'test2':321} google_url = 'https://www.google.com/search?q=coffee+cans' yield Request(url=google_url,callback=self.google,

Solution 1:

Your first example is wrong Python code, as Python reports.

Your second example does not work because your callback for the process_request parameter of Rule, the lambda function, returns None.

If you check the documentation:

process_request is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).

That is actually not the only reason it does not work. To use rule-based link extractors, you must:

Subclass CrawlSpider. From your examples it’s not clear if you are doing so.
Don’t reimplement the parse method in your subclass, as you are currently doing. If start_urls is not good enough for you, use it in combination with parse_start_url.
Rules must be declared as a class attribute. You are instead defining them as a variable within a method of your Spider subclass. That won’t work.

Please, re-read the documentation about the CrawlSpider.

As for passing a value from the meta of a response to the meta of the next request, you have 2 choices:

Reimplement your spider as a Spider subclass, instead of a CrawlSpider subclass, manually performing all the logic without rule-based link extractors.
This is the natural step whenever a generic spider like CrawlSpider starts to feel too restrictive. Generic spider subclasses are good for simple use cases, but whenever you face something non-trivial, you should consider switching to a regular Spider subclass.
Wait for Scrapy 1.7 to be released, which should happen shortly (you could use the master branch of Scrapy in the meantime). Scrapy 1.7 introduces a new response parameter for process_request callbacks, which will allow you to do something like:

defmy_request_processor(request, response):
    request.meta['item'] = response.meta['item']
    return request

classMySpider(CrawlSpider):

    # …

    rules = (
        Rule(
            LinkExtractor(
                restrict_xpaths='//div[@class="r"]/a',
                allow='/p/',
                allow_domains='homedepot.com'
            ),
            process_request=my_request_processor,
            callback='homedepot'
        )
    )

    # …

Introduction to Python Course

How To Use Meta In Scrapy Rule

Solution 1:

Post a Comment for "How To Use Meta In Scrapy Rule"