Web Scraping Url Not Changing While Search
Solution 1:
Using requests and BeautifulSoup:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
for course in courses:
print(course.text)
OUTPUT:
VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.
EDIT:
As explainged by @Martin Evans, the Ajax call behind the search is not doing what you think it is, it is probably keeping the count of the search i.e. how many users searched for AI It basically is filtering out the search based on the keyword in the search_term
:
import requests
from bs4 import BeautifulSoup
import re
page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
search_term = "AI"
for course in courses:
if re.search(search_term, course.text, re.IGNORECASE):
print(course.text)
OUTPUT:
AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems
Solution 2:
The udacity page is actually returning all available courses when you request it. When you enter a search, the page is simply filtering the available data. This is why you do not see any changes to the URL when entering a search. A check using the browser's developer tools also confirms this. It also explains why the "search" is so fast.
As such, if you are searching for a given course, you would just need to filter the results yourself. For example:
import requests
from bs4 import BeautifulSoup
req = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(req.content, "html.parser")
a_tags = soup.find_all("a", class_="capitalize")
print("Number of courses:", len(a_tags))
print()
for a_tag in a_tags:
course = a_tag.text
if "python" in course.lower():
print(course)
This would display all courses with Python
in the title:
Number of courses: 225
Python Foundation
AI Programming with Python
Programming Foundations with Python
Data Structures & Algorithms in Python
Solution 3:
Read the tutorials for how to use requests (for making HTTP requests) and BeautifulSoup (for processing HTML). This will teach you what you need to know to download the pages, and extract the data from the HTML.
You will use the function BeautifulSoup.find_all()
to locate all of the <div>
elements in the page HTML, with class=course-summary-card
. The content you want is within that <div>
, and after reading the above links it should be trivial for you to figure out the rest ;)
Btw, one helpful tool for you as you learn how to do this will be to use the "Inspect element" feature (for Chrome/Firefox), which can be accessed by right clicking on elements in the browser, that enables you to look at the source code surrounding the element you're interested in extracting, so you can get information like it's class or id, parent divs, etc that will allow you to select it in BeautifulSoup/lxml/etc.
Post a Comment for "Web Scraping Url Not Changing While Search"