Skip to content Skip to sidebar Skip to footer

Replace Repeating Delimiters In A Text File With An Alternate Character

I am attempting to process a large pipe '|' delimited, double quote qualified text file (>700,000 records, >3,000 characters per record, and 28 fields per record). using a p

Solution 1:

import re

defcount_pipes_in_regex_match(m):
  #  regex capture group should only contain pipe chars
  matched_pipes = m.groups()[0]

  return'\t' * len(matched_pipes)


# test string
s='"abc"|"2017-01-01"|"height: 5\' 7" (~180 cm) | weight: 80kg | in good health"|"2016-01-10"||||"EOR"'# replace leading or trailing quotes
s = re.sub('^"|"$', '', s)

# replace quote pipe(s) quote # or      quote pipe(s) end-of-string# with as many tabs as there were pipes
s = re.sub('"(\|+)("|$)', count_pipes_in_regex_match, s)

printrepr(s) #repr to show the tabs

Try online at repl.it

Solution 2:

Since you are looking for "|" isn't the answer to replace multiple || with |""|?

how about:

whileTrue:
    new_data = re.sub(r'\|\|', '|""|', data)
    if data == new_data:
        break
    data = new_data

After this you could then replace "|" with tabs.

Solution 3:

You could do it in 3 passes.

  1. Replace all || with |""|
  2. Split on "|" (and | on ends)
  3. Remove the quotes from each field.

As follows:

import re

for line in file:
    while '||' in line:
        line = line.replace('||', '|""|')

    fields = re.split('^\||\|$|"\|"', line)

    new_line = '\t'.join([field.strip('"') for field in fields])

Post a Comment for "Replace Repeating Delimiters In A Text File With An Alternate Character"