How To Extract Quotations From Text Using Nltk
Solution 1:
As Mayur mentioned, you can do a regex to pick up everything between quotes
list = re.findall("\".*?\"", string)
The problem you'll run into is that there can be a surprisingly large amount of things between quotation marks that are actually not quotations.
If you're doing academic articles, you can look for a number after the closing quotation to pick up the footnote number. Else with non academic articles, perhaps you could run something like:
"(said|writes|argues|concludes)(,)? \".?\""
can be more precise, but risks losing quotes such as blockquotes (blockquotes will cause you problems anyways because they can include a newline before the closing quotation mark)
As for using NLTK, I can't think of anything there that will be of much help other than perhaps wordnet for finding synonyms for "said".
Solution 2:
This qualifies as a pattern, ie, data you are looking for is always between quotation marks ""
. Simply put, you can use regex for pattern matching.
Let's take this example she said " DAS A SDASD sdasdasd SADSD", " SA23 DSD " ASDAS "ASDAS1 3123$ %$%"
The regex that works for your basic example is -
list = re.findall("\".*?\"", string)
List
gives us ['" DAS A SDASD SADASD SADSD"', '" SA23 DSD "', '"ASDAS1 3123$ %$%"']
Here, .*?
matches any character (except newline) and the pattern matches the quotation marks (beginning \"
and ending \"
) literally.
Please beware of the fact that quotation marks within quotation marks breaks this code. You will not get the expected output.
Post a Comment for "How To Extract Quotations From Text Using Nltk"