■ Boyer-Moore algorithm put to the test for fast(?) file text search

Here at xplorer² development desk we strive for the whole package, usability on the outside and performance in the inside. I came across the Boyer-Moore fast text search algorithm some time ago and had planned to investigate it for possible use with xplorer² search in text files module. The winds are not particularly clement to surfers lately, so what better way to pass one's breeze-less afternoon than some practical experimentation?

Boyer and Moore's is one of these computer science algorithms that isn't very difficult to understand and can be implemented in under 50 lines of C++ code. The naive way to search for a keyword (needle) in a big text file (haystack) would be to scan the file byte by byte until you find an occurence of the first letter of your searched keyword; at that spot you check if the rest of the letters match, or you keep on looking. If the keyword is not there, you have to check all the bytes in the file one by one.

Boyer and Moore's idea is to pre-process the keyword (needle) in an effort to skip as many bytes of the haystack as possible. Without getting into details, one of the skipping rules is called bad character rule:

If the haystack letter aligned with the needle's last letter at the current search position does not appear anywhere in the needle, advance the search position by the length of the needle

In other words, if we find a letter in the file that doesn't appear in the keyword, we don't need to check any neighbouring bytes either, so we save some effort. The algorithm contains some other skipping rules — not as straightforward to digest — but the idea is clear: spend some time preprocessing the needle to make the search as efficient as possible.

Without question, if we were just searching within strings, it would make sense to add 50 extra lines of C code to speed things up. But what about file search? Even with SSD media, reading a file from disk is much slower than working with bytes in RAM, so are the gains of Boyer-Moore lost in a sea of disc access overhead?

Let's put this to the test. I made a 80MB text file that is read in 234 ms off my solid state drive (when not cached from previous reads). Then I compared the naive and the Boyer-Moore search for a series of keywords. The code was in release trim (with default compiler optimizations). The naive version is this:

while(haystack <= pmax) 
{
	if(*haystack == *needle && // first character match
		!memcmp(haystack, needle, m_nPatLength))
		return haystack;

	haystack++;
}
return 0; // not found

Here is a table of performance timings. The two methods were averaged on 100 repeats for a variety of search keywords. The execution times for Boyer-Moore do not take into account the preprocessing time for setting up the skip tables. All the keywords were chosen to fail the search (didn't exist in the file) to stretch the test to the limit.

keyword	letters	Naive	B-M
"At the end"	10	66	68
"At the end and fairly long"	27	65	32
"TEST"	4	60	95
"1234567890"	10	93	49

Table 1. Naive vs Boyer-Moore string matching times (milliseconds) in 80MB text buffer

As you can see the timings vary a lot and are very sensitive to the choice of keyword. As expeccted for longer strings, the skip rules of Boyer-Moore pay off, but for short keywords all the fumbling ends up slowing things down (see the TEST keyword). The best advantage of B-M was the keyword 1234567890 that slowed down the naive method (an artefact of the particular file contents), but still the gain was 44ms or otherwise 18% overall compared to the file read time of 234 ms. For expected size keywords (10 letters) there was no difference, and for smaller the naive search was faster.

Facing this evidence the conjecture is that Boyer-Moore algorithm is not better for real file search, especially if you consider the following limitations:

It applies only for verbatim case sensitive text search (no wildcards or regular expressions)
It works with an alphabet of 256 characters (not sure how it would work bytewise on unicode or UTF8 encodings)
Some setup overhead is required for the skip tables

So it is a case of another apparently good idea falling victim to the limiting factor of slow disc access. Which perhaps explains why the windows API doesn't offer it for string matching.

SIDE NOTE. xplorer² search in files dialogs hava a checkbox for case sensitive keyword matching. Ticking this option is much more efficient, 2-3 times faster in relative terms. Of course this gain is diluted to homeopathic levels taking into account the file reading time. But if you are lucky and the files exist in the cache (e.g. repeated searches), and you are sure you want to find keyword exactly (not KeYwOrD), it is worth ticking the box for case sensitivity.