Knuth-Morris-Pratt on Cons Cells

What is the prefix-function

The prefix-function turns a sequence of characters into a sequence of numbers with the following rule: the output at a given position is the size of the largest proper suffix of the input up to that position such that it equals the beginning of the input. In other words, if we have an input sequence \(S\) and we call the resulting sequence \(P\), then \(P[i]\) would be the largest \(n < i\), such that \(S[i-n+1:i] = S[1:n]\) (inclusive, 1-based indexing).

For example, given the string "abacabaaababacd", the prefix-function would produce the following sequence: S: a b a c a b a a a b a b a c d P: 0 0 1 0 1 2 3 1 1 2 3 2 3 4 0

To be more specific, let's take as an example the "4" in the result above: it corresponds to the ending ...ababac, and the string begins with abacab.... The longest common substring (that is simultaneously a suffix and a prefix) is abac which is 4 characters long. That is why the prefix-function produces 4 at that position.

What does "proper" mean in this context? It means that the substring is not equal to the entire string. If we allowed such prefixes/suffixes, the prefix-function would be pretty boring, because in a string of \(n\) characters, the first \(n\) characters are obviously equal to the last \(n\) characters.

Here are some more examples of the prefix-function in action: S: a a a a a a b a a a a a a a a a P: 0 1 2 3 4 5 0 1 2 3 4 5 6 6 6 6 S: a b a c a b a d a b a c a b a P: 0 0 1 0 1 2 3 0 1 2 3 4 5 6 7 S: a b a c a d z a b a c a b P: 0 0 1 0 1 0 0 1 2 3 4 5 2

How is it useful

The most practical use of the prefix-function is substring search, also known as strstr(), which is the standard C function that solves this problem. How does the prefix-function help with substring search? Very simple: if we take the needle, append some symbol that doesn't appear in it, for example the NUL byte, and append the haystack, then compute the prefix-function on the concatenation. Then iterate over the numbers produced by the prefix-function until we find the first one equal to the length of the needle. That would mean we found a substring of the haystack that is equal to the start of the needle-nul-haystack pile, specifically as many characters as the needle contains, which is, obviously, the needle itself

Calculation

How do we compute the prefix-function? The prefix-function has a rather simple implementation using arrays, but in functional programming sequences are most commonly represented with cons-cells (singly linked lists), and some functional languages don't even have arrays at all (specifically, \(\mathcal{O}(1)\)-accessable arrays). This article implements the algorithm in a hypothetical haskell dialect where cons-cells are the only way to represent sequential information.

We're going to need a dynamic, that is we will compute \(N\)'th value of the prefix-function using previous \(N-1\) values as a basis. Let's examine the induction step first.

Let's look at an example of such a step. Assume we already computed the following piece of the prefix-function and we want to compute the value under the question mark: S: a b c a b d a b c a b ? ... P: 0 0 0 1 2 0 1 2 3 4 5

Previous value of the prefix-function tells us that 5 preceding characters (abcab) are equal to the 5-prefix of \(S\), so if ? equals the 6th character of input, then we take that 5 and increment it by one: /¯¯¯¯¯¯¯\ /¯¯¯¯¯¯¯\ S: a b c a b d a b c a b ? ... P: 0 0 0 1 2 0 1 2 3 4 5 ^

There can't be a longer suffix because otherwise the previous prefix-function value would have been higher.

If our character is not equal to 6th character of input, we look at \(P[5]\) (1-based indexing) and see that it's 2. It tells us that the 2-suffix of the 5-prefix (ab) equals the 2-prefix of input, but we already established that the 5 characters preceding our current position are equal to the 5 characters from the beginning of the input, so that gives us some extra equivalences: /¯¯¯¯¯¯¯\ /¯¯¯¯¯¯¯\ /¯\ /¯\ /¯\ S: a b c a b d a b c a b ? ... P: 0 0 0 1 2 0 1 2 3 4 5 ^

In particular, we see that 2 preceeding characters are equal to the 2-prefix of the input. So if ? equals the 3rd character, then we take that 2 and increment it by one. Again a longer substring is impossible because then either the previous prefix-function value or \(P[5]\) would've been higher.

If our character, again, doesn't equal that, we look at \(P[2]\) (1-based), which is 0, and repeat everything again: 0-prefix (the empty string) equals 0 preceeding characters, so if our character equals the first character of input, we write 0+1.

If we fail again, we would try to look at \(P[0]\), but as we've been using 1-based indexing here, \(P[0]\) doesn't make sense. Think about what we've got: we established that no string longer than 0 characters is simultaneously a suffix and a prefix of the input piece. The answer is 0!

More formally, let \(\overrightarrow{N}\) be the first \(N\) characters of S. Also let \(N\to K\) be \(N\) characters of S ending at position \(K\).

Then \(P[i] = \max(n): n < i, \overrightarrow{n} = n\to i\)

Here are some farily obvious lemmas: $$\left.\begin{array}{l}\overrightarrow{n}=n\to k\\\overrightarrow{k}=k\to i\\n\le k\end{array}\right|\implies\overrightarrow{n}=n\to i\tag{L1}$$ $$\left.\begin{array}{l}\overrightarrow{n+k}=(n+k)\to(i+k)\\k\ge 0\end{array}\right|\implies\overrightarrow{n}=n\to i\tag{L2}$$

If we have computed \(i\) terms of P, and are computing \((i+1)\)th:

If the respective character matches \(S[P[i]+1]\), then \(\overrightarrow{P[i]+1}=(P[i]+1)\to(i+1)\). Assume that there exists \(m>P[i]+1: \overrightarrow{m}=m\to(i+1)\), then by L2 with \(k=1\), we get \(P[i]\ge m-1\), which contradicts with our assumption of \(m-1>P[i]\) existing. That means \(P[i]+1\) is the largest substring and thus is the value of \(P[i+1]\).

If that is not the case, we have established that \(P[i+1]\le P[i]\).

If we examine \(\overrightarrow{P[i]}\) and \(\overrightarrow{P[P[i]]}\), L1 will tell us that \(\overrightarrow{P[P[i]]}=P[P[i]]\to P[i]=P[P[i]]\to i\). If our respective character matches S[P[P[i]]+1], then \(\overrightarrow{P[P[i]]+1}=(P[P[i]]+1)\to(i+1)\), and it ends up being the longest substring by reasoning analogous to the above.

If we repeat this process, eventually \(P[...P[i]...]\) will reach 0 (by \(P[i]<i\)), and if our character doesn't match even then, we would establish that \(P[i+1]\le 0\), and since the number is nonnegative, \(P[i+1]=0\).

Implementation

As we've already established, we need a dynamic, a list of prefix-function values to which we will add newly computed numbers. However, Haskell has no mutation. Instead of modifying the list, we will just produce it one by one. The list will contain some structures that we use to produce the next value in the list, let's call those structures ZState.

So what would this structure store? Let's look at what kind of data we need. First of all we need a pointer to the previously computed structure, but that can simply be passed through recursion. Obviously, the structure should contain the number that we computed, the respective result of the prefix-function: zLength. It should also contain next position's character, however not all positions have the next character (the last one doesn't), instead of messing around with Maybes, we will just store a tail of the input: zTail. The most complicated part is that we should somehow be able to access P[this structure's zLength] at any time. We could just store a reference to ZState, but that doesn't cut it. When creating a new instance of the structure, we might copy some other structure's zLength and add 1. We could store the entire list we're producing, but indexing such a list is too inefficient. The solution is much simpler than it seems: store a reference to a tail of the list we're producing! Head of this reference would be P[this structure's zLength], and when we increment zLength by 1, we simply take a tail of the reference. Let's give it a name: zPrev.

In total the structure looks like this: data ZState = ZState { zTail :: String, zLength :: Int, zPrev :: [ZState] }

Let's consider the base of the dynamic now: as already mentioned, \(P[0]\) doesn't make sense, but let's give it some: let's create the imaginary 0th element that will not be included in the output. What would it contain? zTail would point to the whole input; zLength doesn't make sense, but it's never used, so let's stuff an undefined here; zPrev should point to \(P[0]\), i.e the list beginning with itself.

The structure for the first character should be crafted manually too, if we let the algorithm do it by itself, it would produce a zLength of 1, which doesn't represent a proper substring, and gives us the boring function described in the beginning. zTail is quite literally, tail of input, zLength is 0, zPrev is \(P[0]\), i.e the list beginning with the imaginary element.

Lets turn all this into code (and not forget the empty list case): zTraverse :: String -> [ZState] zTraverse (x:xs) = let imaginary = ZState{zTail = x:xs, zLength = undefined, zPrev = zeroth} base = ZState{zTail = xs, zLength = 0, zPrev = zeroth} zeroth = imaginary:result result = base:produce base xs in result zTraverse [] = []

produce is the dynamic step which we will define soon.

First let's produce one state using the algorithm outlined above nextState :: Char -> String -> ZState -> ZState nextState x xs z -- If x equals to corresponding character, add 1: | head (zTail (head (zPrev z))) == x = ZState{zTail = xs, zLength = zLength z + 1, zPrev = tail (zPrev z)} -- The above works even if zLength is 0, because zTail of the imaginary -- element points to the beginning of input. -- No substring longer than 0 found: | zLength z == 0 = ZState{zTail = xs, zLength = 0, zPrev = zPrev z} -- We copy zPrev from z because it has the same zLength. -- If x is not the right character, try P[this state's zLength]: | otherwise = nextState x xs (head (zPrev z))

And the last step is to do this recursively, passing the just produced state to the next call: produce :: ZState -> String -> [ZState] produce z (x:xs) = let newz = nextState x xs z in newz:produce newz xs produce _ [] = []

Well, of course you probably want to get rid of the extra information that the dynamic produces: prefixFun :: String -> [Int] prefixFun = map zLength . zTraverse

Asymptotic behavior

For an input of \(N\) characters, the function produces \(N\) numbers, so the lower bound for time complexity is obviously \(\Omega(N)\). produce is called for every index, i.e \(N\) times. nextState is entered \(N\) times too, but it might loop for a bit before producing the result. However, a successful match by nextState increases the value of the prefix-function by one, and a loop repetition establishes an increasingly strict constraint on the value. The value is increased at most \(N\) times, and since it never goes negative, and is an integer, the amount of constraints we can place has to be less than or equal to \(N\) for the whole string.

Therefore, zTraverse works in \(\Theta(N)\) time.

The calculation also never creates any extra data, all zPrev pointers point back into the same structure, and memory usage is \(\Theta(N)\) too.

If we use the substring search algorithm described earlier, we could achieve \(\mathcal{O}(M+N)\) execution time, which is on the same level as the Knuth-Morris-Pratt algorithm that relies heavily on \(\mathcal{O}(1)\)-accessable arrays.

The comments section is closed