Skip to content Skip to sidebar Skip to footer

Python - Extracting Text From Webpage Pdf

So I have come across a few posts that deal with converting PDF's to HTML or converting them to text, however they all deal with doing so from a file saved to the computer. Is ther

Solution 1:

You can download the file as a byte stream with requests wrapping it with io.BytesIO(), just so:

import io

import requests
from pyPdf importPdfFileReaderurl='http://www.arkansasrazorbacks.com/wp-content/uploads/2017/02/Miami-Ohio-Game-2.pdf'

r = requests.get(url)
f = io.BytesIO(r.content)

reader = PdfFileReader(f)
contents = reader.getPage(0).extractText().split('\n')

f is a file like object you can use just like you opened a PDF file. this way the file is only in the memory and never saved locally.

In order to get text from the PDF file you can use PyPdf.

Post a Comment for "Python - Extracting Text From Webpage Pdf"