Skip to content Skip to sidebar Skip to footer

Extracting Images From Pdf Using Python

How can we extract images(only images) from PDF. I used many online tools, they all are not universal. In most of the PDF, it tools the screenshot of the whole image instead of the

Solution 1:

Here is some code that reads a PDF-File using pyPdf, extracts images and yields them as a PIL.Image. You need to modify it to your needs, it's just here to demonstrate how to walk the object tree.

import io
import pyPdf
import PIL.Image

infile_name = 'my.pdf'withopen(infile_name, 'rb') as in_f:
    in_pdf = pyPdf.PdfFileReader(in_f)
    for page_no inrange(in_pdf.getNumPages()):
        page = in_pdf.getPage(page_no)

        # Images are part of a page's `/Resources/XObject`
        r = page['/Resources']
        if'/XObject'notin r:
            continuefor k, v in r['/XObject'].items():
            vobj = v.getObject()
            # We are only interested in images...if vobj['/Subtype'] != '/Image'or'/Filter'notin vobj:
                continueif vobj['/Filter'] == '/FlateDecode':
                # A raw bitmap
                buf = vobj.getData()
                # Notice that we need metadata from the object# so we can make sense of the image data
                size = tuple(map(int, (vobj['/Width'], vobj['/Height'])))
                img = PIL.Image.frombytes('RGB', size, buf,
                                          decoder_name='raw')
                # Obviously we can't really yield here, do something with `img`...yield img
            elif vobj['/Filter'] == '/DCTDecode':
                # A compressed image
                img = PIL.Image.open(io.BytesIO(vobj._data))
                yield img

Solution 2:

Here's a solution with PyMuPDF:

#!python3.6import fitz  # PyMuPDFdefget_pixmaps_in_pdf(pdf_filename):
    doc = fitz.open(pdf_filename)
    xrefs = set()
    for page_index inrange(doc.pageCount):
        for image in doc.getPageImageList(page_index):
            xrefs.add(image[0])  # Add XREFs to set so duplicates are ignored
    pixmaps = [fitz.Pixmap(doc, xref) for xref in xrefs]
    doc.close()
    return pixmaps


defwrite_pixmaps_to_pngs(pixmaps):
    for i, pixmap inenumerate(pixmaps):
        pixmap.writePNG(f'{i}.png')  # Might want to come up with a better name


pixmaps = get_pixmaps_in_pdf(r'C:\StackOverflow\09_chapter 4.pdf')
write_pixmaps_to_pngs(pixmaps)

Post a Comment for "Extracting Images From Pdf Using Python"