Python: Rabin-Karp Algorithm Hashing
Solution 1:
Here is a working version of your code:
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t = 0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m+1): # note the +1
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (t-h*ord(text[s]))%q # remove letter s
t = (t*d+ord(text[s+m]))%q # add letter s+m
t = (t+q)%q # make sure that t >= 0
return result
print (Rabin_Karp_Matcher ("3141592653589793", "26", 257, 11))
print (Rabin_Karp_Matcher ("xxxxx", "xx", 40999999, 999999937))
The output is:
[6]
[0, 1, 2, 3]
On the first step, you check whether text[0..m] == pattern
. On the second step, you want to check whether text[1..m+1] == pattern
. Thus you remove text[0]
from the hash (at the moment it is multiplied by your precomputed h
): t = (t-h*ord(text[s]))%q
. And then, add text[m]
to it: t = (t*d+ord(text[s+m]))%q
. On the next step, you remove text[1]
and add text[m+1]
, and so on. The t = (t+q)%q
line is there because a negative number modulo q
yields remainder in the range (-q; 0]
, and we want it to be in the range [0; q)
.
Note that you want to check a total of n-m+1
substrings, not n-m
, hence the correct loop is for s in range(n-m+1)
. Checked by the second example (finding "xx" in "xxxxx").
Also worth noting:
The line
h = pow(d,m-1)%q
may be too slow ifm
is large. It is better to take the result moduloq
after each of them-2
multiplications.This algorithm is still O(nm) in the worst case. With
text="a"*100000
andpattern="a"*50000
, it will find 50001 positions where a substring of text matches the pattern, and it will check them all character-by-character. If you expect your code to work fast in such extreme cases, you should skip the character-by-character comparison and find a way to deal with false positives (i.e., hashes are equal but strings are not). Picking a large prime numberq
may help reduce the probability of a false positive to an acceptable level.
Solution 2:
Ok, so the answer is that you need to indent the "for s" loop:
def Rabin_Karp_Matcher(text, pattern, d, q):
n = len(text)
m = len(pattern)
h = pow(d,m-1)%q
p = 0
t =0
result = []
for i in range(m): # preprocessing
p = (d*p+ord(pattern[i]))%q
t = (d*t+ord(text[i]))%q
for s in range(n-m):
if p == t: # check character by character
match = True
for i in range(m):
if pattern[i] != text[s+i]:
match = False
break
if match:
result = result + [s]
if s < n-m:
t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here
return result
This gives me the answer 6, which is what you are looking for I imagine. Interesting algorithm man.
Post a Comment for "Python: Rabin-Karp Algorithm Hashing"