Xpath And Scrapy - Scraping Links When The Depth And Quantity Of A Tags Are Inconsistent
I am using Scrapy's SitemapSpider go through a list of Shopify stores. I am pulling all of the products from their respective collections with XPath. Normally, this wouldn't be d
Solution 1:
Since your problem is mainly about having duplicates in the reponse, convert the response
into a Set
. This give single instance of all data.
Without using set :
>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract()
[u'/product_1', u'/product_1', u'/product_2', u'/product_2', u'/product_3', u'/product_3']
Using Set
:
>>> set(response.xpath('//div//a[contains(@href, "product")]/@href').extract())
set([u'/product_3', u'/product_2', u'/product_1'])
Suppose the question is only for singlediv
, then the best course is to use the extract_first()
command to to extract only first matched element. And benifit of using this is that it avoids an IndexError
and returns None
when it doesn’t find any element matching the selection.
Before :
>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
[u'/product_1', u'/product_1']
So, it should be :
>>> response.xpath('//div//a[contains(@href, "product")]/@href').extract_first()
u'/product_1'
Post a Comment for "Xpath And Scrapy - Scraping Links When The Depth And Quantity Of A Tags Are Inconsistent"