Boyer and Moore's is one of these computer science algorithms that isn't very difficult to understand and can be implemented in under 50 lines of C++ code. The naive way to search for a keyword (needle) in a big text file (haystack) would be to scan the file byte by byte until you find an occurence of the first letter of your searched keyword; at that spot you check if the rest of the letters match, or you keep on looking. If the keyword is not there, you have to check all the bytes in the file one by one.
Boyer and Moore's idea is to pre-process the keyword (needle) in an effort to skip as many bytes of the haystack as possible. Without getting into details, one of the skipping rules is called bad character rule:
Without question, if we were just searching within strings, it would make sense to add 50 extra lines of C code to speed things up. But what about file search? Even with SSD media, reading a file from disk is much slower than working with bytes in RAM, so are the gains of Boyer-Moore lost in a sea of disc access overhead?
Let's put this to the test. I made a 80MB text file that is read in 234 ms off my solid state drive (when not cached from previous reads). Then I compared the naive and the Boyer-Moore search for a series of keywords. The code was in release trim (with default compiler optimizations). The naive version is this:
while(haystack <= pmax) { if(*haystack == *needle && // first character match !memcmp(haystack, needle, m_nPatLength)) return haystack; haystack++; } return 0; // not found
Here is a table of performance timings. The two methods were averaged on 100 repeats for a variety of search keywords. The execution times for Boyer-Moore do not take into account the preprocessing time for setting up the skip tables. All the keywords were chosen to fail the search (didn't exist in the file) to stretch the test to the limit.
|
Table 1. Naive vs Boyer-Moore string matching times (milliseconds) in 80MB text buffer
As you can see the timings vary a lot and are very sensitive to the choice of keyword. As expeccted for longer strings, the skip rules of Boyer-Moore pay off, but for short keywords all the fumbling ends up slowing things down (see the TEST keyword). The best advantage of B-M was the keyword 1234567890 that slowed down the naive method (an artefact of the particular file contents), but still the gain was 44ms or otherwise 18% overall compared to the file read time of 234 ms. For expected size keywords (10 letters) there was no difference, and for smaller the naive search was faster.
Facing this evidence the conjecture is that Boyer-Moore algorithm is not better for real file search, especially if you consider the following limitations:
Post a comment on this topic »