Skip to content Skip to sidebar Skip to footer

Extracting Text After Specific Character Set From A Text File Using Regex In Python

Hi I have text in the following format below from which I wanted to save name(ex:2ND ACADEMY OF NATURAL SCIENCES) and its a.k.a. names along with original name in a dictionary like

Solution 1:

You seem to be confused about the meaning of square brackets. Perhaps review What is the difference between square brackets and parentheses in a regex?

Your requirements seem rather unclear, but something like this?

import re

with open('textData.txt', 'r') as lines:
    text = lines.read()

for segment in text.split('\n\n'):
    para = ' '.join(segment.splitlines())
    if para:
        name = re.match(r'^[^,()]+(?=, | \()', para)
        if name:
            akas = [name.group(0)]
            akas.extend(re.findall(r'(?<=a\.k\.a\. )([^;)]+)', para))
            print('"%s": ["%s"]' % (name.group(0), '", "'.join(akas)))

This assumes that each record is a separated from every other record by an empty line, and that the file is small enough to fit into memory.


Solution 2:

You could use 2 capture groups, and split the value of group 2 on (?:;\s)?a\.k\.a\.\s to get the separate values.

Using re.findall will return the capture group values

^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?

The pattern matches

  • ^ Start of string
  • ( Capture group 1
    • [A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b Match uppercase chars and spaces not ending with a word character
  • ) Close group 1
  • (?: Non capture group
    • \( Match (
    • ( Capture group 2
      • a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\ Match repeating parts that start with a.k.a followed by matching any char except for ( and )
    • ) Close group 2
  • )? Close non capture group and make it optional

Regex demo | Python demo

For example

import re
import pprint

pattern = r"^([A-Z0-9](?:[A-Z0-9 ]*[A-Z0-9])?\b)(?: \((a\.k\.a\.[^()]+(?:\sa\.k\.a\.[^()]+)*)\))?"

with open('textData.txt') as f:
    textData = f.read()
    d = {}
    for t in re.findall(pattern, textData, re.M):
        parts = [p for p in re.split(r"(?:;\s)?a\.k\.a\.\s", t[1]) if p]
        parts.insert(0, (t[0]))
        d[t[0]] = parts

    pprint.pprint(d)

Output

{'2ND COMPLEX OF NEURAL SCIENCES': ['2ND COMPLEX OF NEURAL SCIENCES',
                                    'ACADEMY OF NEURAL \nSCIENCES',
                                    'CHE 2 CHAON KAHAK-WON',
                                    'CHE 2 CHAYON KAHAK-WON',
                                    'KUKPAN KAHAK-WON',
                                    'NATIONAL DEFENSE ACADEMY',
                                    'SANSRI',
                                    'SECOND COMPLEX OF NEURAL SCIENCES',
                                    'SECOND\n'
                                    'COMPLEX OF NEURAL SCIENCES RESEARCH '
                                    'INSTITUTE'],
 '7 KARNES': ['7 KARNES'],
 'LOSTIK VE HAVAIK HIZMETLARI LTD': ['LOSTIK VE HAVAIK HIZMETLARI LTD'],
 'SWING OF TIR': ['SWING OF TIR',
                  '7TH OF TIR COMPLEX',
                  '7TH OF TIR INDUSTRIAL\nCOMPLEX',
                  '7TH OF TIR INDUSTRIES',
                  '7TH OF TIR INDUSTRIES\nOF ISFAHAN/ESFAHAN',
                  'MOJTAMAE SANATE HAFTOME TIR',
                  'SANAYE HAFTOME TIR',
                  'SEVENTH OF TIR']}

Post a Comment for "Extracting Text After Specific Character Set From A Text File Using Regex In Python"