Python Replace Multiple Strings While Supporting Backreferences
Solution 1:
Update: See Angus Hollands' answer for a better alternative.
I couldn't think of an easier way to do it than to stick with the original idea of combining all dict keys into one massive regex.
However, there are some difficulties. Let's assume a repldict
like this:
repldict = {r'(a)': r'\1a', r'(b)': r'\1b'}
If we combine these to a single regex, we get (a)|(b)
- so now (b)
is no longer group 1, which means its backreference won't work correctly.
Another problem is that we can't tell which replacement to use. If the regex matches the text b
, how can we find out that \1b
is the appropriate replacement? It's not possible; we don't have enough information.
The solution to these problems is to enclose every dict key in a named group like so:
(?P<group1>(a))|(?P<group2>(b))
Now we can easily identify the key that matched, and recalculate the backreferences to make them relative to this group. so that \1b
refers to "the first group after group2".
Here's the implementation:
defreplaceAll(repldict, text):
# split the dict into two lists because we need the order to be reliable
keys, repls = zip(*repldict.items())
# generate a regex pattern from the keys, putting each key in a named group# so that we can find out which one of them matched.# groups are named "_<idx>" where <idx> is the index of the corresponding# replacement text in the list above
pattern = '|'.join('(?P<_{}>{})'.format(i, k) for i, k inenumerate(keys))
defrepl(match):
# find out which key matched. We know that exactly one of the keys has# matched, so it's the only named group with a value other than None.
group_name = next(name for name, value in match.groupdict().items()
if value isnotNone)
group_index = int(group_name[1:])
# now that we know which group matched, we can retrieve the# corresponding replacement text
repl_text = repls[group_index]
# now we'll manually search for backreferences in the# replacement text and substitute themdefrepl_backreference(m):
reference_index = int(m.group(1))
# return the corresponding group's value from the original match# +1 because regex starts counting at 1return match.group(group_index + reference_index + 1)
return re.sub(r'\\(\d+)', repl_backreference, repl_text)
return re.sub(pattern, repl, text)
Tests:
repldict = {'&&':'and', r'\|\|':'or', r'!([a-zA-Z_])':r'not \1'}
print( replaceAll(repldict, "!newData.exists() || newData.val().length == 1") )
repldict = {'!([a-zA-Z_])':r'not \1', '&&':'and', r'\|\|':'or', r'\=\=\=':'=='}
print( replaceAll(repldict, "(!this && !that) || !this && foo === bar") )
# output: not newData.exists() or newData.val().length == 1# (not this and not that) or not this and foo == bar
Caveats:
- Only numerical backreferences are supported; no named references.
- Silently accepts invalid backreferences like
{r'(a)': r'\2'}
. (These will sometimes throw an error, but not always.)
Solution 2:
Similar solution to Rawing, only precomputing the expensive stuff ahead of time by modifying the group indices in backreferences. Also, using unnamed groups.
Here we silently wrap each case in a capture group, and then update any replacements with backreferences to correctly identify the appropriate subgroup by absolute position. Note, that when using a replacer function, backreferences do not work by default (you need to call match.expand
).
import re
from collections import OrderedDict
from functools import partial
pattern_to_replacement = {'&&': 'and', '!([a-zA-Z_]+)': r'not \1'}
defbuild_replacer(cases):
ordered_cases = OrderedDict(cases.items())
replacements = {}
leading_groups = 0for pattern, replacement in ordered_cases.items():
leading_groups += 1# leading_groups is now the absolute position of the root group (back-references should be relative to this)
group_index = leading_groups
replacement = absolute_backreference(replacement, group_index)
replacements[group_index] = replacement
# This pattern contains N subgroups (determine by compiling pattern)
subgroups = re.compile(pattern).groups
leading_groups += subgroups
catch_all = "|".join("({})".format(p) for p in ordered_cases)
pattern = re.compile(catch_all)
defreplacer(match):
replacement_pattern = replacements[match.lastindex]
return match.expand(replacement_pattern)
return partial(pattern.sub, replacer)
defabsolute_backreference(text, n):
ref_pat = re.compile(r"\\([0-99])")
defreplacer(match):
return"\\{}".format(int(match.group(1)) + n)
return ref_pat.sub(replacer, text)
replacer = build_replacer(pattern_to_replacement)
print(replacer("!this.exists()"))
Solution 3:
Simple is better than complex, code as below is more readable(The reason why you code not work as expected is that ([a-zA-Z_]) should not be in re.escape):
repdict = {
r'\s*' + re.escape('&&')) + r'\s*': ' and ',
r'\s*' + re.escape('||') + r'\s*': ' or ',
re.escape('!') + r'([a-zA-Z_])': r'not \1',
}
defreplaceAll(repdict, text):
for k, v in repdict.items():
text = re.sub(k, v, text)
return text
Post a Comment for "Python Replace Multiple Strings While Supporting Backreferences"