Skip to content Skip to sidebar Skip to footer

Python: Rabin-Karp Algorithm Hashing

I am implementing Rabin-Karp algorithm for fun. I came across this pseudocode: RABIN -KARP -MATCHER (T, P, d, q) 1 n = T.length 2 m = P.length 3 h = d^(m-1) mod q

Solution 1:

Here is a working version of your code:

def Rabin_Karp_Matcher(text, pattern, d, q):
    n = len(text)
    m = len(pattern)
    h = pow(d,m-1)%q
    p = 0
    t = 0
    result = []
    for i in range(m): # preprocessing
        p = (d*p+ord(pattern[i]))%q
        t = (d*t+ord(text[i]))%q
    for s in range(n-m+1): # note the +1
        if p == t: # check character by character
            match = True
            for i in range(m):
                if pattern[i] != text[s+i]:
                    match = False
                    break
            if match:
                result = result + [s]
        if s < n-m:
            t = (t-h*ord(text[s]))%q # remove letter s
            t = (t*d+ord(text[s+m]))%q # add letter s+m
            t = (t+q)%q # make sure that t >= 0
    return result
print (Rabin_Karp_Matcher ("3141592653589793", "26", 257, 11))
print (Rabin_Karp_Matcher ("xxxxx", "xx", 40999999, 999999937))

The output is:

[6]
[0, 1, 2, 3]

On the first step, you check whether text[0..m] == pattern. On the second step, you want to check whether text[1..m+1] == pattern. Thus you remove text[0] from the hash (at the moment it is multiplied by your precomputed h): t = (t-h*ord(text[s]))%q. And then, add text[m] to it: t = (t*d+ord(text[s+m]))%q. On the next step, you remove text[1] and add text[m+1], and so on. The t = (t+q)%q line is there because a negative number modulo q yields remainder in the range (-q; 0], and we want it to be in the range [0; q).

Note that you want to check a total of n-m+1 substrings, not n-m, hence the correct loop is for s in range(n-m+1). Checked by the second example (finding "xx" in "xxxxx").

Also worth noting:

  1. The line h = pow(d,m-1)%q may be too slow if m is large. It is better to take the result modulo q after each of the m-2 multiplications.

  2. This algorithm is still O(nm) in the worst case. With text="a"*100000 and pattern="a"*50000, it will find 50001 positions where a substring of text matches the pattern, and it will check them all character-by-character. If you expect your code to work fast in such extreme cases, you should skip the character-by-character comparison and find a way to deal with false positives (i.e., hashes are equal but strings are not). Picking a large prime number q may help reduce the probability of a false positive to an acceptable level.


Solution 2:

Ok, so the answer is that you need to indent the "for s" loop:

def Rabin_Karp_Matcher(text, pattern, d, q):
    n = len(text)
    m = len(pattern)
    h = pow(d,m-1)%q
    p = 0
    t =0
    result = []
    for i in range(m): # preprocessing
        p = (d*p+ord(pattern[i]))%q
        t = (d*t+ord(text[i]))%q

        for s in range(n-m):
            if p == t: # check character by character
                match = True
                for i in range(m):
                    if pattern[i] != text[s+i]:
                        match = False
                        break
                if match:
                    result = result + [s]
            if s < n-m:
                    t = (d*(t-ord(text[s+1])*h)+ord(text[s+m]))%q #index out of bounds here


    return result

This gives me the answer 6, which is what you are looking for I imagine. Interesting algorithm man.


Post a Comment for "Python: Rabin-Karp Algorithm Hashing"