Python ( Selenium ) Script For Downloading Pdfs, And When Those Aren't Found It Scrapes Pages For Similar Information
So essentially, I am writing a script that loops through a list of search terms, googles them, and downloads the first PDF it sees, but if it can't find one then it goes to the fir
Solution 1:
This works properly so far for what I can see it. It doesn't download them all.
for k, key in enumerate(keys):
try:
start = time.time()
driver.implicitly_wait(10)
driver.get("https://www.google.com/")
sleep_between_interactions = 5
searchbar = driver.find_element_by_name("q")
searchbar.send_keys(key)
searchbar.send_keys(Keys.ARROW_DOWN)
searchbar.send_keys(Keys.RETURN)
pdf_element = driver.find_elements(By.XPATH, ("//a[contains(@href, '.pdf')]"))
key_index_number = str(keys.index(key) +1 )
key_length = str(len(keys))
print(key_index_number + " out of " + key_length)
if len(pdf_element) > 0 and key_length < key_index_number :
print("pdf found for: "+ key)
pdf_element[0].click()
time.sleep(sleep_between_interactions)
print("downloaded " + key_index_number + " out of "+ str(len(keys)))
elif len(pdf_element) == 0 and key_index_number != key_length:
print("pdf NOT found for "+ key)
print(key + " pdf not downloaded, moving on...")
url_search = f"https://www.google.com/search?q={key}"
request = requests.get(url_search)
soup = BeautifulSoup(request.text, "lxml")
first_link = soup.find("div", class_="BNeawe").text
links_list.append(first_link)
except IndexError as index_error:
print("Couldn't find pdf file for "+"\"" + key + "\""+" due to Index Error moving on....")
print(key_index_number + " out of " + str(len(keys)))
continue
except NoSuchElementException:
print("search bar didn't load, iterating next in loop")
print(" pdf NOT found for "+ key)
print(key + " pdf not downloaded, moving on...")
continue
except ElementNotInteractableException:
print("element either didn't load or doesn't exist")
driver.get("https://www.google.com/")
continue
Outputs
1out of 402out of 403out of 404out of 405out of 40
pdf NOT found for computer science Learning Outcomes California Baptist University
computer science Learning Outcomes California Baptist University pdf not downloaded, moving on...
6out of 40
pdf NOT found for physicsmath Learning Outcomes California Baptist University
physicsmath Learning Outcomes California Baptist University pdf not downloaded, moving on...
7out of 40
pdf found for: computer science Learning Outcomes California Lutheran University
downloaded 7out of 408out of 40
pdf found for: physicsmath Learning Outcomes California Lutheran University
downloaded 8out of 409out of 40
pdf found for: computer science Program Handbook Azusa Pacific University
downloaded 9out of 40
Post a Comment for "Python ( Selenium ) Script For Downloading Pdfs, And When Those Aren't Found It Scrapes Pages For Similar Information"