How To Stop Pdfplumber From Reading The Header Of Every Pages?

November 18, 2022 Post a Comment

I wants pdfplumber to extract the text from a random pdf given by the user. The problem is that pdfplumber also extracts the header text or the title from each pages. How can I pro

Solution 1:

I don't think you can.

However, you can crop the document with the crop method. This way, you can extract the text only for the cropped part of page, leaving out headers and footers. Of course this method requires that you know in advance the height of headers and footers.

Here is the explanation of coords:

x0 = % Distance of left side of character from left side of page.
top = % Distance of top of character from top of page.
x1 = % Distance of right side of character from left side of page.
bottom = % Distance of bottom of the character from top of page.

Here is the code:

# Get text of whole document as string
crop_coords = [x0, top, x1, bottom]
text = ''
pages = []
with pdfplumber.open(filename) as pdf:
    for i, page in enumerate(pdf.pages):
        my_width = page.width
        my_height = page.height
        # Crop pages
        my_bbox = (crop_coords[0]*float(my_width), crop_coords[1]*float(my_height), crop_coords[2]*float(my_width), crop_coords[3]*float(my_height))
        page_crop = page.crop(bbox=my_bbox)
        text = text+str(page_crop.extract_text()).lower()
        pages.append(page_crop)

Introduction to Python Course

How To Stop Pdfplumber From Reading The Header Of Every Pages?

Solution 1:

Post a Comment for "How To Stop Pdfplumber From Reading The Header Of Every Pages?"