Xpath And Scrapy - Scraping Links When The Depth And Quantity Of A Tags Are Inconsistent

April 16, 2024 Post a Comment

I am using Scrapy's SitemapSpider go through a list of Shopify stores. I am pulling all of the products from their respective collections with XPath. Normally, this wouldn't be d

Solution 1:

Since your problem is mainly about having duplicates in the reponse, convert the response into a Set. This give single instance of all data.

Without using set :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract()
[u'/product_1', u'/product_1', u'/product_2', u'/product_2', u'/product_3', u'/product_3']

Using Set:

>>> set(response.xpath('//div//a[contains(@href, "product")]/@href').extract())
set([u'/product_3', u'/product_2', u'/product_1'])

Suppose the question is only for singlediv, then the best course is to use the extract_first() command to to extract only first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.

Before :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
[u'/product_1', u'/product_1']

So, it should be :

>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
u'/product_1'

Introduction to Python Course

Xpath And Scrapy - Scraping Links When The Depth And Quantity Of A Tags Are Inconsistent

Solution 1:

Post a Comment for "Xpath And Scrapy - Scraping Links When The Depth And Quantity Of A Tags Are Inconsistent"