Skip to content Skip to sidebar Skip to footer

How Can I Distinguish A Digitally-created Pdf From A Searchable Pdf?

I am currently analyzing a set of PDF files. I want to know how many of the PDF files fall in those 3 categories: Digitally Created PDF: The text is there (copyable) and it is gua

Solution 1:

With PyMuPDF you can easily remove all text as is required for @ypnos' suggestion.

As an alternative, with PyMuPDF you can also check whether text is hidden in a PDF. In PDF's relevant "mini-language" this is triggered by the command 3 Tr ("text render mode", e.g. see page 402 of https://www.adobe.com/content/dam/acom/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf). So if all text is under the influence of this command, then none of it will be rendered - allowing the conclusion "this is an OCR'ed page".

Solution 2:

Modified this answer from How to check if PDF is scanned image or contains text

In this solution you don't have to render the pdf so I would guess it is faster. Basically the answer I modified used the percentage of the pdf area covered by text to determine if it is a text document or a scanned document (image).

I added a similar reasoning, calculating total area covered by images to calculate the percentage covered by images. If it is mostly covered by images you can assume it is scanned document. You can move the threshold around to fit your document collection.

I also added logic to check page by page. This is because at least in the document collection I have, some documents might have a digitally created first page and then the rest is scanned.

Modified code:

import fitz #pip install PyMuPDFdefpage_type(page):

    page_area =abs(page.rect) #total page area

    img_area=0.0for block in page.getText("RAWDICT")["blocks"]:
        if block["type"]==1: #Type=1 are images
            bbox=block["bbox"]
            img_area+=(bbox[2]-bbox[0])*(bbox[3]-bbox[1]) # width*height
    img_perc=img_area / page_area
    print("Image area proportion: "+str(img_perc))

    text_area = 0.0for b in page.getTextBlocks():
        r = fitz.Rect(b[:4])  # rectangle where block text appears
        text_area = text_area + abs(r)
    text_perc=text_area / page_area
    print("Text area proportion: "+str(text_perc))

    if text_perc < 0.01: #No text = Scanned
        page_type="Scanned"elif img_perc > .8:  #Has text but very large images = Searchable
        page_type="Searchable text"else:
        page_type="Digitally created"return page_type


doc=fitz.open(pdffilepath)

for page in doc: #Iterate through pages to find different typesprint(page_type(page))

Solution 3:

You can do it through bash script.

#!/bin/bashecho"shellscript $0"ls --color --group-directories-first
    read -p "Is it OK to use this shellscript in this directory? (y/N) " ans
    if [ "$ans" != "y" ]
    thenexitfimkdir -p scanned
    mkdir -p text
    mkdir -p "s-and-t"for file in *.pdf
    do
     grep -aq '/Image/'"$file"if [ $? -eq 0 ]
     then
      image=trueelse
      image=falsefi
     grep -aq '/Text'"$file"if [ $? -eq 0 ]
     then
      text=trueelse
      text=falsefiif$image && $textthenmv"$file""s-and-t"elif$imagethenmv"$file""scanned"elif$textthenmv"$file""text"elseecho"$file undecided"fidone

Thanks

Post a Comment for "How Can I Distinguish A Digitally-created Pdf From A Searchable Pdf?"