Skip to content Skip to sidebar Skip to footer

Python Re.split() Vs Split()

In my quests of optimization, I discovered that that built-in split() method is about 40% faster that the re.split() equivalent. A dummy benchmark (easily copy-pasteable): import r

Solution 1:

re.split is expected to be slower, as the usage of regular expressions incurs some overhead.

Of course if you are splitting on a constant string, there is no point in using re.split().


Solution 2:

When in doubt, check the source code. You can see that Python s.split() is optimized for whitespace and inlined. But s.split() is for fixed delimiters only.

For the speed tradeoff, a re.split regular expression based split is far more flexible.

>>> re.split(':+',"One:two::t h r e e:::fourth field")
['One', 'two', 't h r e e', 'fourth field']
>>> "One:two::t h r e e:::fourth field".split(':')
['One', 'two', '', 't h r e e', '', '', 'fourth field']
# would require an addition step to find the empty fields...
>>> re.split('[:\d]+',"One:two:2:t h r e e:3::fourth field")
['One', 'two', 't h r e e', 'fourth field']
# try that without a regex split in an understandable way...

That re.split() is only 29% slower (or that s.split() is only 40% faster) is what should be amazing.


Solution 3:

Running a regular expression means that you are running a state machine for each character. Doing a split with a constant string means that you are just searching for the string. The second is a much less complicated procedure.


Post a Comment for "Python Re.split() Vs Split()"