Skip to content Skip to sidebar Skip to footer

Python Replace Multiple Strings While Supporting Backreferences

There are some nice ways to handle simultaneous multi-string replacement in python. However, I am having trouble creating an efficient function that can do that while also supporti

Solution 1:

Update: See Angus Hollands' answer for a better alternative.


I couldn't think of an easier way to do it than to stick with the original idea of combining all dict keys into one massive regex.

However, there are some difficulties. Let's assume a repldict like this:

repldict = {r'(a)': r'\1a', r'(b)': r'\1b'}

If we combine these to a single regex, we get (a)|(b) - so now (b) is no longer group 1, which means its backreference won't work correctly.

Another problem is that we can't tell which replacement to use. If the regex matches the text b, how can we find out that \1b is the appropriate replacement? It's not possible; we don't have enough information.

The solution to these problems is to enclose every dict key in a named group like so:

(?P<group1>(a))|(?P<group2>(b))

Now we can easily identify the key that matched, and recalculate the backreferences to make them relative to this group. so that \1b refers to "the first group after group2".


Here's the implementation:

defreplaceAll(repldict, text):
    # split the dict into two lists because we need the order to be reliable
    keys, repls = zip(*repldict.items())

    # generate a regex pattern from the keys, putting each key in a named group# so that we can find out which one of them matched.# groups are named "_<idx>" where <idx> is the index of the corresponding# replacement text in the list above
    pattern = '|'.join('(?P<_{}>{})'.format(i, k) for i, k inenumerate(keys))

    defrepl(match):
        # find out which key matched. We know that exactly one of the keys has# matched, so it's the only named group with a value other than None.
        group_name = next(name for name, value in match.groupdict().items()
                          if value isnotNone)
        group_index = int(group_name[1:])

        # now that we know which group matched, we can retrieve the# corresponding replacement text
        repl_text = repls[group_index]

        # now we'll manually search for backreferences in the# replacement text and substitute themdefrepl_backreference(m):
            reference_index = int(m.group(1))

            # return the corresponding group's value from the original match# +1 because regex starts counting at 1return match.group(group_index + reference_index + 1)  

        return re.sub(r'\\(\d+)', repl_backreference, repl_text)

    return re.sub(pattern, repl, text)

Tests:

repldict = {'&&':'and', r'\|\|':'or', r'!([a-zA-Z_])':r'not \1'}
print( replaceAll(repldict, "!newData.exists() || newData.val().length == 1") )

repldict = {'!([a-zA-Z_])':r'not \1', '&&':'and', r'\|\|':'or', r'\=\=\=':'=='}
print( replaceAll(repldict, "(!this && !that) || !this && foo === bar") )

# output: not newData.exists() or newData.val().length == 1#         (not this and not that) or not this and foo == bar

Caveats:

  • Only numerical backreferences are supported; no named references.
  • Silently accepts invalid backreferences like {r'(a)': r'\2'}. (These will sometimes throw an error, but not always.)

Solution 2:

Similar solution to Rawing, only precomputing the expensive stuff ahead of time by modifying the group indices in backreferences. Also, using unnamed groups.

Here we silently wrap each case in a capture group, and then update any replacements with backreferences to correctly identify the appropriate subgroup by absolute position. Note, that when using a replacer function, backreferences do not work by default (you need to call match.expand).

import re
from collections import OrderedDict
from functools import partial

pattern_to_replacement = {'&&': 'and', '!([a-zA-Z_]+)': r'not \1'}


defbuild_replacer(cases):
    ordered_cases = OrderedDict(cases.items())
    replacements = {}

    leading_groups = 0for pattern, replacement in ordered_cases.items():
        leading_groups += 1# leading_groups is now the absolute position of the root group (back-references should be relative to this)
        group_index = leading_groups
        replacement = absolute_backreference(replacement, group_index)
        replacements[group_index] = replacement

        # This pattern contains N subgroups (determine by compiling pattern)
        subgroups = re.compile(pattern).groups
        leading_groups += subgroups

    catch_all = "|".join("({})".format(p) for p in ordered_cases)
    pattern = re.compile(catch_all)

    defreplacer(match):
        replacement_pattern = replacements[match.lastindex]
        return match.expand(replacement_pattern)

    return partial(pattern.sub, replacer)


defabsolute_backreference(text, n):
    ref_pat = re.compile(r"\\([0-99])")

    defreplacer(match):
        return"\\{}".format(int(match.group(1)) + n)

    return ref_pat.sub(replacer, text)


replacer = build_replacer(pattern_to_replacement)
print(replacer("!this.exists()"))

Solution 3:

Simple is better than complex, code as below is more readable(The reason why you code not work as expected is that ([a-zA-Z_]) should not be in re.escape):

repdict = {
    r'\s*' + re.escape('&&')) + r'\s*': ' and ',
    r'\s*' + re.escape('||') + r'\s*': ' or ',
    re.escape('!') + r'([a-zA-Z_])': r'not \1',
}
defreplaceAll(repdict, text):
    for k, v in repdict.items():
        text = re.sub(k, v, text)
    return text

Post a Comment for "Python Replace Multiple Strings While Supporting Backreferences"