Skip to content Skip to sidebar Skip to footer

Web Scraping Url Not Changing While Search

I am trying to webscrape https://in.udacity.com/courses/all. I need to get the courses shown while entering the search query. For eg: if I enter python, there are 17 courses coming

Solution 1:

Using requests and BeautifulSoup:

import requests
from bs4 import BeautifulSoup

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")

for course in courses:
    print(course.text)

OUTPUT:

VR Foundations
VR Mobile 360
VR High-Immersion
Google Analytics
Artificial Intelligence for Trading
Python Foundation
.
.
.

EDIT:

As explainged by @Martin Evans, the Ajax call behind the search is not doing what you think it is, it is probably keeping the count of the search i.e. how many users searched for AI It basically is filtering out the search based on the keyword in the search_term:

import requests
from bs4 import BeautifulSoup
import re

page = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(page.content, 'html.parser')
courses = soup.find_all("a", class_="capitalize")
search_term = "AI"

for course in courses:
    if re.search(search_term, course.text, re.IGNORECASE):
        print(course.text)

OUTPUT:

AI Programming with Python
Blockchain Developer Nanodegree program
Knowledge-Based AI: Cognitive Systems

Solution 2:

The udacity page is actually returning all available courses when you request it. When you enter a search, the page is simply filtering the available data. This is why you do not see any changes to the URL when entering a search. A check using the browser's developer tools also confirms this. It also explains why the "search" is so fast.

As such, if you are searching for a given course, you would just need to filter the results yourself. For example:

import requests
from bs4 import BeautifulSoup

req = requests.get("https://in.udacity.com/courses/all")
soup = BeautifulSoup(req.content, "html.parser")
a_tags = soup.find_all("a", class_="capitalize")

print("Number of courses:", len(a_tags))
print()

for a_tag in a_tags:
    course = a_tag.text

    if "python" in course.lower():
        print(course)

This would display all courses with Python in the title:

Number of courses: 225

Python Foundation
AI Programming with Python
Programming Foundations with Python
Data Structures & Algorithms in Python

Solution 3:

Read the tutorials for how to use requests (for making HTTP requests) and BeautifulSoup (for processing HTML). This will teach you what you need to know to download the pages, and extract the data from the HTML.

You will use the function BeautifulSoup.find_all() to locate all of the <div> elements in the page HTML, with class=course-summary-card. The content you want is within that <div>, and after reading the above links it should be trivial for you to figure out the rest ;)

Btw, one helpful tool for you as you learn how to do this will be to use the "Inspect element" feature (for Chrome/Firefox), which can be accessed by right clicking on elements in the browser, that enables you to look at the source code surrounding the element you're interested in extracting, so you can get information like it's class or id, parent divs, etc that will allow you to select it in BeautifulSoup/lxml/etc.


Post a Comment for "Web Scraping Url Not Changing While Search"