Skip to content Skip to sidebar Skip to footer

Remove All Symbols While Preserving String Consistency

My goal is to remove all symbols from a string and still preserve the unicode characters (alphabetical character from any language). Suppose I have the following string: carbon cop

Solution 1:

I think found a good solution (>99% robust I believe) to the problem:

Well here's our new, horrific string:

s = u'carbon҂ ҉ copolymers—⿴٬ٯ٪III❏£\n12-ः׶ Ǣ ܊ܔ ۩۝۞ء܅۵Géotechnique▣ऀ\n'

And here's the resulting string:

u'carbon    copolymers   \u066f III  \n      \u01e2  \u0714    \u0621  G\xe9otechnique  \n'

All the remained characters/words are in fact alphabetical characters, in different languages. Done with almost no effort!

Here's the solution:

s = ''.join([c if c.isalpha() or c.isspace() else' 'for c in s])
s = re.sub(ur'[\u0020-\u0040]+|[\u005B-\u0060]+|[\u007B-\u00BF]+', ' ', s)
s = re.sub(r'[ ]+', ' ', s)
carbon copolymers ٯ III  
Ǣ ܔ ء Géotechnique  

Post a Comment for "Remove All Symbols While Preserving String Consistency"