The problem is to search for a pattern string,
pat[1..m],
in a text string
txt[1..n].
Usually n>>m,
and txt might be very long indeed,
although this is not necessarily so.
This problem occurs in text-editors and many other computer applications.
Naive Search
The naive string searching algorithm
is to examine each position, i>=1, in txt,
trying for equality of
pat[1..m] with txt[i..i+m-1].
If there is inequality, position i+1 is tried, and so on.
The worst-case time-complexity of the naive algorithm is seen to be
O(m*n),
e.g.
pat=am-1b
and
txt=an-1b
Mercifully there are faster algorithms!
Demonstration
The HTML FORM below allows you to search for a pattern string
pat in a text string txt.
It uses three algorithms: the naive algorithm (above),
Rabin's algorithm and the Boyer-Moore algorithm (below).
Change the values of pat and txt,
run the algorithms and experiment:
Rabin's string searching algorithm uses a
"rolling hash".
Hashing is an idea borrowed from hash-[tables]
but used here without the "table".
In Rabin's algorithm,
the hash function is defined to compute a number in [0..p-1] for a large
prime number p.
The integers modulo p, Zp, behave as a mathematical field.
That is, they behave just as the usual unbounded integers do
(see [intro'])
with respect to addition and multiplication, and in particular
there are no "divisors of zero",
that is no non-zero values j & k such that j*k=0.
At each stage of the algorithm,
hash(txt[i..i+|pat|-1])
is compared with hash(pat), e.g.
patHash = (((ord(pat[1])*3+ord(pat[2]))*3+...+ord(pat[m])
) mod p
txtHash1= (((ord(txt[1])*3+ord(txt[2]))*3+...+ord(txt[m])
) mod p
txtHashi= ((txtHashi-1 - ord(txt[i-1])*power)*3
+ ord(txt[i+m-1])
) mod p, where i>1
where power = 3m-1
and ord(ch) is the character code of ch
If the values are
equal then a match has almost certainly been found.
A character by character check must be made to be absolutely sure.
Time Complexity
Rabin's algorithm is (almost always) fast,
i.e. O(m+n) average-case time-complexity, because
hash(txt[i..i+m-1]) can be computed in O(1) time -
i.e. by two multiplications, a subtraction,
an addition and a `mod' -
given its predecessor hash(txt[i-1..i-1+m-1]).
The worst-case time-complexity does however remain at O(m*n)
because of the possibility of false-positive matches on
the basis of the hash numbers,
although these are very rare indeed.
Space Complexity
Space complexity is O(1) for a few scalar variables.
Boyer-Moore Algorithm
The Boyer Moore algorithm
is always fast having worst-case time-complexity O(m+n),
but on natural-language text it actually gets faster as m increases
to a certain extent,
e.g. Boyer and Moore suggesting O(n/4)-time on average for m=5.
The first key idea, and it is a good one,
is that if the mth
character ch=txt[m] (numbered from `1' upwards)
does not occur in pat at all
then any instance of pat in txt
must start at position m+1 or later.
e.g. if searching for pat=`freddy',
in txt=`I floated lonely as a cloud',
the 6th character of txt is `a'
which is not in {d,e,f,r,y}
and there is no need to consider positions 1 to 6 of txt
any more!
In this way, we can move along txt in steps
of m positions, i.e. k=m, 2m, 3m, ...,
provided we are lucky.
The second idea is that if the last occurrence of
the character txt[k] in pat
is delta1[ch] positions from the right hand end of
pat, then
pat can be slid that many positions to the right
before we might get a match in txt.
If ch=txt[k] does not occur in pat,
set delta1[ch] equal to m.
e.g.
delta1['e']=3,
delta1['f']=5,
delta1['d']=1,
delta1['r']=4,
delta1['y']=0.
In general if the last j characters match,
i.e. txt[k-j..k]=pat[m-j..m], but
txt[k-j-1]~=pat[m-j-1],
then pat can be "slid"
a certain distance along txt
depending on where, if at all,
there is an earlier instance of pat[m-j..m]
within pat;
e.g. consider pat=abracadabra,
or e.g. pat=fababab.
An array, delta2[1..m],
is used to hold these precomputed distances.
Time Complexity
The worst-case time-complexity of the Boyer-Moore algorithm is O(m+n).
The algorithm often runs in O(n/m)-time on natural-language text
for small values of m.
Note that if txt is in slow, block-access, backing store,
it is generally not possible to bring just every
mth character
into main memory, so the time-complexity is then O(n).
Space Complexity
The space-complexity is O(m + |alphabet|),
for the arrays delta1[ ] and delta2[ ].
Preprocessing txt.
It is intuitively obvious that the worst-case for
searching for an arbitrary pattern, pat, in a text, txt,
must take at least O(n)-time, where n=|txt|.
However, if some time is spent pre-processing txt,
then individual searches can be made more quickly,
e.g. in O(m)-time using a
suffix-tree,
where m=|pat|.
It takes O(n)-time to build a suffix tree for txt.
Notes
R. S. Boyer and J. S. Moore.
A Fast String Searching Algorithm.
Comm. A.C.M. 20(10) p762-772 Oct 1977.
The paper has a rather nice section on how the algorithm
was developed, over quite some time, from the initial idea, with
contributions from various people.
D. E. Knuth, J. H. Morris and V. R. Pratt.
Fast Pattern Matching in Strings.
SIAM Jrnl. Comput. 6(2) p323-350 Jun 1997.
R. M. Karp and M. O. Rabin.
Efficient Randomized Pattern-Matching Algorithms.
IBM Jrnl. Res. and Dev. 31(2) p249-260 Mar 1987.
G. de v. Smit.
A Comparison of Three String Matching Algorithms.
Software Practice and Experience 12 p57-66 Jan 1982.
Compared
(i) a straightforward algorithm,
(ii) Knuth Morris Pratt algorithm, and
(iii) the Boyer Moore algorithm.
The last of these was usually best, by far.
There are slight programming variations depending on whether
strings are indexed from zero or from one.
Beware:
Some implementations of `m mod n' in some languages and systems
are incorrect when m is negative which can cause a nasty surprise
when implementing Rabin's algorithm!
The
[Suffix Tree]
data structure for txt can be built in O(|txt|)-time and can then be used
to find a substring, pat, in O(|pat|)-time.
Also see the
[edit-distance] problem,
a.k.a. approximate string matching.