Skip to content Skip to sidebar Skip to footer

Scrapy - Set Delay To Retry Middleware

I'm using Scrapy-splash and I have a problem with memory. I can clearly see that memory used by docker python3 is gradually increasing until PC freezes. Can't figure out why it be

Solution 1:

One way would be to add a middleware to your Spider (source, linked):

# File: middlewares.pyfrom twisted.internet import reactor
from twisted.internet.defer import Deferred


classDelayedRequestsMiddleware(object):
    defprocess_request(self, request, spider):
        delay_s = request.meta.get('delay_request_by', None)
        ifnot delay_s:
            return

        deferred = Deferred()
        reactor.callLater(delay_s, deferred.callback, None)
        return deferred

Which you could later use in your Spider like this:

import scrapy


classQuotesSpider(scrapy.Spider):
    name = "quotes"
    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {'middlewares.DelayedRequestsMiddleware': 123},
    }

    defstart_requests(self):
        # This request will have itself delayed by 5 secondsyield scrapy.Request(url='http://quotes.toscrape.com/page/1/', 
                             meta={'delay_request_by': 5})
        # This request will not be delayedyield scrapy.Request(url='http://quotes.toscrape.com/page/2/')

    defparse(self, response):
        ...  # Process results here

Related method is described here: Method #2

Solution 2:

  1. A more elaborate solution could be to set up a Kubernetes cluster in which you have multiple replicas running. This way you avoid having a failure of just 1 container impacting your scraping job.

  2. I don't think it's easy to configure a waiting time only for retries. You could play with DOWNLOAD_DELAY (but this will impact delay between all requests), or set the RETRY_TIMES to a higher value than the default of 2.

Post a Comment for "Scrapy - Set Delay To Retry Middleware"