How To Use Meta In Scrapy Rule
Solution 1:
Your first example is wrong Python code, as Python reports.
Your second example does not work because your callback for the process_request
parameter of Rule
, the lambda
function, returns None
.
If you check the documentation:
process_request
is a callable, or a string (in which case a method from the spider object with that name will be used) which will be called with every request extracted by this rule, and must return a request or None (to filter out the request).
That is actually not the only reason it does not work. To use rule-based link extractors, you must:
Subclass
CrawlSpider
. From your examples it’s not clear if you are doing so.Don’t reimplement the
parse
method in your subclass, as you are currently doing. Ifstart_urls
is not good enough for you, use it in combination withparse_start_url
.Rules must be declared as a class attribute. You are instead defining them as a variable within a method of your Spider subclass. That won’t work.
Please, re-read the documentation about the CrawlSpider.
As for passing a value from the meta of a response to the meta of the next request, you have 2 choices:
Reimplement your spider as a
Spider
subclass, instead of aCrawlSpider
subclass, manually performing all the logic without rule-based link extractors.This is the natural step whenever a generic spider like
CrawlSpider
starts to feel too restrictive. Generic spider subclasses are good for simple use cases, but whenever you face something non-trivial, you should consider switching to a regularSpider
subclass.Wait for Scrapy 1.7 to be released, which should happen shortly (you could use the
master
branch of Scrapy in the meantime). Scrapy 1.7 introduces a newresponse
parameter forprocess_request
callbacks, which will allow you to do something like:
defmy_request_processor(request, response):
request.meta['item'] = response.meta['item']
return request
classMySpider(CrawlSpider):
# …
rules = (
Rule(
LinkExtractor(
restrict_xpaths='//div[@class="r"]/a',
allow='/p/',
allow_domains='homedepot.com'
),
process_request=my_request_processor,
callback='homedepot'
)
)
# …
Post a Comment for "How To Use Meta In Scrapy Rule"