Scraping From Web Page And Reformatting To A Calender File
I'm trying to scrape this site: http://stats.swehockey.se/ScheduleAndResults/Schedule/3940 And I've gotten as far (thanks to alecxe) as retrieving the date and teams. from scrapy.i
Solution 1:
I'm just guessing that home games are the ones with the team you're looking for first (before the dash).
You can do this in XPath or from python. If you want to do it in XPath, only select the rows which contain the home team name.
//table[@class="tblContent"]/tr[
contains(substring-before(.//td[3]/text(), "-"), "AIK")
or
contains(substring-before(.//td[3]/text(), "-"), "Djurgårdens IF")
]
You can savely remove all whitespace (including newlines), I just added them for readability.
For python you should be able to do much the same, maybe even more concise using some regular expressions.
Solution 2:
A few points to note:
string
is a built-in type, so it's generally good practice to avoid using it for your own variables- Removing whitespace was indeed the way to clean up
home_team
enough to do a straight comparison with the required "AIK". I usedstring.strip()
onhome_team
andaway_team
as it's a little cleaner thanstring.replace(" ", "")
but that's a personal thing - I also added a ":" between the home and away teams in the
print
lines to distinguish between them more clearly when I was testing, so feel free to get rid of that change
Have a check and let me know if there are any other issues. :)
def parse(self, response):
hxs = HtmlXPathSelector(response)
rows= hxs.select('//table[@class="tblContent"]/tr')
forrowinrows:
item = SchemaItem()
item['date'] = row.select('.//td[2]/div/span/text()').extract()
item['teams'] = row.select('.//td[3]/text()').extract()
for fixture in item['teams']:
teams = fixture.split('-') #split it
home_team = teams[0].strip()
away_team = teams[1].strip()
if home_team == "AIK":
for fixDate in item['date']:
year= fixDate[0:4]
month= fixDate[5:7]
day= fixDate[8:10]
hour= fixDate[11:13]
minute= fixDate[14:16]
print year, month, day, hour, minute, home_team, ":", away_team
elif home_team == u"Djurgårdens IF":
for fixDate in item['date']:
year= fixDate[0:4]
month= fixDate[5:7]
day= fixDate[8:10]
hour= fixDate[11:13]
minute= fixDate[14:16]
print year, month, day, hour, minute, home_team, ":", away_team
Post a Comment for "Scraping From Web Page And Reformatting To A Calender File"