Remove All Symbols While Preserving String Consistency
My goal is to remove all symbols from a string and still preserve the unicode characters (alphabetical character from any language). Suppose I have the following string: carbon cop
Solution 1:
I think found a good solution (>99% robust I believe) to the problem:
Well here's our new, horrific string:
s = u'carbon҂ ҉ copolymers—⿴٬ٯ٪III❏£\n12-ः Ǣ ܊ܔ ۩۞ء܅۵Géotechnique▣ऀ\n'
And here's the resulting string:
u'carbon copolymers \u066f III \n \u01e2 \u0714 \u0621 G\xe9otechnique \n'
All the remained characters/words are in fact alphabetical characters, in different languages. Done with almost no effort!
Here's the solution:
s = ''.join([c if c.isalpha() or c.isspace() else' 'for c in s])
s = re.sub(ur'[\u0020-\u0040]+|[\u005B-\u0060]+|[\u007B-\u00BF]+', ' ', s)
s = re.sub(r'[ ]+', ' ', s)
carbon copolymers ٯ III
Ǣ ܔ ء Géotechnique
Post a Comment for "Remove All Symbols While Preserving String Consistency"