Skip to content Skip to sidebar Skip to footer

Regexp: Remove Last Period In String That Can Contain Other Periods (dig Output)

I am trying to parse the output of the linux dig command and do several things on one shot with regular expressions. Let's say I dig the host mail.yahoo.com: /usr/bin/dig +nocommen

Solution 1:

You can simply force that there is no period at the end of your group (and that it contains no space) :

npg = '([^\.\s]+(?:.[^\.\s]+)*)'#not_period_ending_groupregex = re.compile("^" + npg + ".+IN\s+([A-Z]+)\s+" + npg +".+$",re.MULTILINE)

Solution 2:

But calling .findall with that regex does return the final period in the host, because \S+ will match the last period as well…

There are two problems here.

First, once you're escaping things with backslashes, you need to use raw string literals (r"…"), or you have to escape the backslashes too. I'm not actually sure whether any of your backslash-prefixed characters happen to match Python backslash-escape sequences, but that in itself is enough reason to use a raw-string literal, so your readers don't have to look up the exact rules.

Second, the general case of this problem is that regex repeats are greedy by default: they will match as much as possible while still allowing the rest of the pattern to match; when you want them to match as little as possible while still allowing the rest of the pattern to match, you need to add a ? after the + or *.

In your particular case, the \S+ can match everything up to and including the final ., and the \.*\s* will successfully match 0 .s and 0 spaces. but \S+? will leave the final . for the next part of the pattern. You can also force the period out of the first group by appending a period after it. Like so:

^(\S+)\..+IN\s+([A-Z]+)\s+(\S+?)\.*\s*$

Regular expression visualization

Debuggex Demo

Solution 3:

You can use this pattern with multiline modifier:

^([^ ]+)(?<!\.)\.?[ ]+[0-9]+[ ]+IN[ ]+([^ ]+)[ ]+(.+(?<!\.))\.?$

Groups stored in $1 $2 and $3

DEMO

Edit: Try this:

^([^ \t]+)(?<!\.)\.?[ \t]+[0-9]+[ \t]+IN[ \t]+([^ \t]+)[ \t]+(.+(?<!\.))\.?$

Solution 4:

As an alternative answer i suggest to use str.split(), if you have your string lines in a list like L you need this :

[(line[0][:-1],line[3],line[4][:-1]) for line in L]

Note that [:-1] remove the last . from host address !

Post a Comment for "Regexp: Remove Last Period In String That Can Contain Other Periods (dig Output)"