Extracting Text After Specific Character Set From A Text File Using Regex In Python
Hi I have text in the following format below from which I wanted to save name(ex:2ND ACADEMY OF NATURAL SCIENCES) and its a.k.a. names along with original name in a dictionary like
Solution 1:
You seem to be confused about the meaning of square brackets. Perhaps review What is the difference between square brackets and parentheses in a regex?
Your requirements seem rather unclear, but something like this?
import re
with open('textData.txt', 'r') as lines:
text = lines.read()
for segment in text.split('\n\n'):
para = ' '.join(segment.splitlines())
if para:
name = re.match(r'^[^,()]+(?=, | \()', para)
if name:
akas = [name.group(0)]
akas.extend(re.findall(r'(?<=a\.k\.a\. )([^;)]+)', para))
print('"%s": ["%s"]' % (name.group(0), '", "'.join(akas)))
This assumes that each record is a separated from every other record by an empty line, and that the file is small enough to fit into memory.
Solution 2:
You could use 2 capture groups, and split the value of group 2 on (?:;\s)?a\.k\.a\.\s
to get the separate values.
Using re.findall will return the capture group values
^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?
The pattern matches
^
Start of string(
Capture group 1[A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b
Match uppercase chars and spaces not ending with a word character
)
Close group 1(?:
Non capture group\(
Match(
(
Capture group 2a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\
Match repeating parts that start witha.k.a
followed by matching any char except for(
and)
)
Close group 2
)?
Close non capture group and make it optional
For example
import re
import pprint
pattern = r"^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?"
with open('textData.txt') as f:
textData = f.read()
d = {}
for t in re.findall(pattern, textData, re.M):
parts = [p for p in re.split(r"(?:;\s)?a\.k\.a\.\s", t[1]) if p]
parts.insert(0, (t[0]))
d[t[0]] = parts
pprint.pprint(d)
Output
{'2ND COMPLEX OF NEURAL SCIENCES': ['2ND COMPLEX OF NEURAL SCIENCES',
'ACADEMY OF NEURAL \nSCIENCES',
'CHE 2 CHAON KAHAK-WON',
'CHE 2 CHAYON KAHAK-WON',
'KUKPAN KAHAK-WON',
'NATIONAL DEFENSE ACADEMY',
'SANSRI',
'SECOND COMPLEX OF NEURAL SCIENCES',
'SECOND\n'
'COMPLEX OF NEURAL SCIENCES RESEARCH '
'INSTITUTE'],
'7 KARNES': ['7 KARNES'],
'LOSTIK VE HAVAIK HIZMETLARI LTD': ['LOSTIK VE HAVAIK HIZMETLARI LTD'],
'SWING OF TIR': ['SWING OF TIR',
'7TH OF TIR COMPLEX',
'7TH OF TIR INDUSTRIAL\nCOMPLEX',
'7TH OF TIR INDUSTRIES',
'7TH OF TIR INDUSTRIES\nOF ISFAHAN/ESFAHAN',
'MOJTAMAE SANATE HAFTOME TIR',
'SANAYE HAFTOME TIR',
'SEVENTH OF TIR']}
Post a Comment for "Extracting Text After Specific Character Set From A Text File Using Regex In Python"