Skip to content Skip to sidebar Skip to footer

Splitting Pdf Files Into Paragraphs

I have a question regarding the splitting of pdf files. basically I have a collection of pdf files, which files I want to split in terms of paragraph. so to each paragraph of the p

Solution 1:

You can use pdftotext for the above, wrap it in python subprocess. Alternatively you could use some other library which already do it implicitly like textract. Here is a quick example, Note: I have used 4 spaces as delimiter to convert the text to paragraph list, you might want to use different technique.

import re
import textract
#read the content of pdf as text
text = textract.process('file_name.pdf')
#use four space as paragraph delimiter to convert the text into list of paragraphs.
print re.split('\s{4,}',text)

Post a Comment for "Splitting Pdf Files Into Paragraphs"