Replace Repeating Delimiters In A Text File With An Alternate Character
I am attempting to process a large pipe '|' delimited, double quote qualified text file (>700,000 records, >3,000 characters per record, and 28 fields per record). using a p
Solution 1:
import re
defcount_pipes_in_regex_match(m):
# regex capture group should only contain pipe chars
matched_pipes = m.groups()[0]
return'\t' * len(matched_pipes)
# test string
s='"abc"|"2017-01-01"|"height: 5\' 7" (~180 cm) | weight: 80kg | in good health"|"2016-01-10"||||"EOR"'# replace leading or trailing quotes
s = re.sub('^"|"$', '', s)
# replace quote pipe(s) quote # or quote pipe(s) end-of-string# with as many tabs as there were pipes
s = re.sub('"(\|+)("|$)', count_pipes_in_regex_match, s)
printrepr(s) #repr to show the tabs
Solution 2:
Since you are looking for "|"
isn't the answer to replace multiple ||
with |""|
?
how about:
whileTrue:
new_data = re.sub(r'\|\|', '|""|', data)
if data == new_data:
break
data = new_data
After this you could then replace "|"
with tabs.
Solution 3:
You could do it in 3 passes.
- Replace all
||
with|""|
- Split on
"|"
(and|
on ends) - Remove the quotes from each field.
As follows:
import re
for line in file:
while '||' in line:
line = line.replace('||', '|""|')
fields = re.split('^\||\|$|"\|"', line)
new_line = '\t'.join([field.strip('"') for field in fields])
Post a Comment for "Replace Repeating Delimiters In A Text File With An Alternate Character"