Python - Extracting Text From Webpage Pdf

July 27, 2023 Post a Comment

So I have come across a few posts that deal with converting PDF's to HTML or converting them to text, however they all deal with doing so from a file saved to the computer. Is ther

Solution 1:

You can download the file as a byte stream with requests wrapping it with io.BytesIO(), just so:

import io

import requests
from pyPdf importPdfFileReaderurl='http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')

f is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.

In order to get text from the PDF file you can use PyPdf.

Introduction to Python Course

Python - Extracting Text From Webpage Pdf

Solution 1:

Post a Comment for "Python - Extracting Text From Webpage Pdf"