Strings

home₁ home₂
Bib
Algorithms
Bioinfo
FP
Logic
MML
Prog.Lang
and the
Book

Algorithms
glossary
Strings
  Suffix array
  Suffix tree
  BWT

String Searching

The problem is to search for a pattern string, pat[1..m], in a text string txt[1..n]. Usually n>>m, and txt might be very long indeed, although this is not necessarily so. This problem occurs in text-editors and many other computer applications.

Naive Search

The naive string searching algorithm is to examine each position, i>=1, in txt, trying for equality of pat[1..m] with txt[i..i+m-1]. If there is inequality, position i+1 is tried, and so on.

The worst-case time-complexity of the naive algorithm is seen to be O(m*n),
e.g. pat=a^m-1b and txt=a^n-1b

Mercifully there are faster algorithms!

Demonstration

The HTML FORM below allows you to search for a pattern string pat in a text string txt. It uses three algorithms: the naive algorithm (above), Rabin's algorithm and the Boyer-Moore algorithm (below). Change the values of pat and txt, run the algorithms and experiment:

Rabin's Algorithm

Rabin's string searching algorithm uses a "rolling hash". Hashing is an idea borrowed from hash-[tables] but used here without the "table". In Rabin's algorithm, the hash function is defined to compute a number in [0..p-1] for a large prime number p. The integers modulo p, Z_p, behave as a mathematical field. That is, they behave just as the usual unbounded integers do (see [intro']) with respect to addition and multiplication, and in particular there are no "divisors of zero", that is no non-zero values j & k such that j*k=0.

At each stage of the algorithm, hash(txt[i..i+|pat|-1]) is compared with hash(pat), e.g.

patHash = (((ord(pat[1])*3+ord(pat[2]))*3+...+ord(pat[m])
          ) mod p


txtHash₁= (((ord(txt[1])*3+ord(txt[2]))*3+...+ord(txt[m])
          ) mod p


txtHash_i= ((txtHash_i-1 - ord(txt[i-1])*power)*3
                       + ord(txt[i+m-1])
          ) mod p,                              where i>1


where power = 3^m-1
and   ord(ch) is the character code of ch

If the values are equal then a match has almost certainly been found. A character by character check must be made to be absolutely sure.

Time Complexity

Rabin's algorithm is (almost always) fast, i.e. O(m+n) average-case time-complexity, because hash(txt[i..i+m-1]) can be computed in O(1) time - i.e. by two multiplications, a subtraction, an addition and a `mod' - given its predecessor hash(txt[i-1..i-1+m-1]).

The worst-case time-complexity does however remain at O(m*n) because of the possibility of false-positive matches on the basis of the hash numbers, although these are very rare indeed.

Space Complexity

Space complexity is O(1) for a few scalar variables.

Boyer-Moore Algorithm

The Boyer Moore algorithm is always fast having worst-case time-complexity O(m+n), but on natural-language text it actually gets faster as m increases to a certain extent, e.g. Boyer and Moore suggesting O(n/4)-time on average for m=5.

The first key idea, and it is a good one, is that if the m^th character ch=txt[m] (numbered from `1' upwards) does not occur in pat at all then any instance of pat in txt must start at position m+1 or later.
e.g. if searching for pat=`freddy', in txt=`I floated lonely as a cloud', the 6^th character of txt is `a' which is not in {d,e,f,r,y} and there is no need to consider positions 1 to 6 of txt any more! In this way, we can move along txt in steps of m positions, i.e. k=m, 2m, 3m, ..., provided we are lucky.

The second idea is that if the last occurrence of the character txt[k] in pat is delta₁[ch] positions from the right hand end of pat, then pat can be slid that many positions to the right before we might get a match in txt. If ch=txt[k] does not occur in pat, set delta₁[ch] equal to m.
e.g.delta₁['e']=3, delta₁['f']=5, delta₁['d']=1, delta₁['r']=4, delta₁['y']=0.

In general if the last j characters match, i.e. txt[k-j..k]=pat[m-j..m], but txt[k-j-1]~=pat[m-j-1], then pat can be "slid" a certain distance along txt depending on where, if at all, there is an earlier instance of pat[m-j..m] within pat; e.g. consider pat=abracadabra, or e.g. pat=fababab. An array, delta₂[1..m], is used to hold these precomputed distances.

Time Complexity

The worst-case time-complexity of the Boyer-Moore algorithm is O(m+n).

The algorithm often runs in O(n/m)-time on natural-language text for small values of m. Note that if txt is in slow, block-access, backing store, it is generally not possible to bring just every m^th character into main memory, so the time-complexity is then O(n).

Space Complexity

The space-complexity is O(m + |alphabet|), for the arrays delta1[ ] and delta2[ ].

Preprocessing `txt`.

It is intuitively obvious that the worst-case for searching for an arbitrary pattern, pat, in a text, txt, must take at least O(n)-time, where n=|txt|. However, if some time is spent pre-processing txt, then individual searches can be made more quickly, e.g. in O(m)-time using a suffix-tree, where m=|pat|. It takes O(n)-time to build a suffix tree for txt.

Notes

R. S. Boyer and J. S. Moore. A Fast String Searching Algorithm. Comm. A.C.M. 20(10) p762-772 Oct 1977.
The paper has a rather nice section on how the algorithm was developed, over quite some time, from the initial idea, with contributions from various people.
D. E. Knuth, J. H. Morris and V. R. Pratt. Fast Pattern Matching in Strings. SIAM Jrnl. Comput. 6(2) p323-350 Jun 1997.
R. M. Karp and M. O. Rabin. Efficient Randomized Pattern-Matching Algorithms. IBM Jrnl. Res. and Dev. 31(2) p249-260 Mar 1987.
G. de v. Smit. A Comparison of Three String Matching Algorithms. Software Practice and Experience 12 p57-66 Jan 1982.
Compared (i) a straightforward algorithm, (ii) Knuth Morris Pratt algorithm, and (iii) the Boyer Moore algorithm. The last of these was usually best, by far.
There are slight programming variations depending on whether strings are indexed from zero or from one.
Beware: Some implementations of `m mod n' in some languages and systems are incorrect when m is negative which can cause a nasty surprise when implementing Rabin's algorithm!
The [Suffix Tree] data structure for txt can be built in O(|txt|)-time and can then be used to find a substring, pat, in O(|pat|)-time.
Also see the [edit-distance] problem, a.k.a. approximate string matching.

Coding Ockham's Razor, L. Allison, Springer

A Practical Introduction to Denotational Semantics, L. Allison, CUP

Linux
Ubuntu
free op. sys.
OpenOffice
free office suite
The GIMP
~ free photoshop
Firefox
web browser

© L. Allison http://www.allisons.org/ll/ (or as otherwise indicated),
Faculty of Information Technology (Clayton), Monash University, Australia 3800 (6/'05 was School of Computer Science and Software Engineering, Fac. Info. Tech., Monash University, was Department of Computer Science, Fac. Comp. & Info. Tech., '89 was Department of Computer Science, Fac. Sci., '68-'71 was Department of Information Science, Fac. Sci.)

Created with "vi (Linux + Solaris)", charset=iso-8859-1, fetched Saturday, 04-May-2024 14:34:53 AEST.

Strings

String Searching

Naive Search

Demonstration

Rabin's Algorithm

Time Complexity

Space Complexity

Boyer-Moore Algorithm

Time Complexity

Space Complexity

Preprocessing txt.

Notes

Preprocessing `txt`.