xplorer² blog: Desktop search text filters

[xplorer²] — All area text search
home » blog » 1 April 2007

"With searching comes loss; And the presence of absence: Your file is not found." — Random heiku

In the last few articles we've been looking at documents and how they are stored on disk. We saw how xplorer² goes into some lengths to decode the broad range of plain text documents both for previewing and searching purposes. However what most people would perceive as "text documents" have nothing to do with those ASCII and UNICODE encodings we've been looking at. No, they are thinking of word documents (DOC) and Adobe Portable Document Format files (PDF). And the odd excel spreadsheet and powerpoint presentation.

Whereas plain text files contain just the characters and some basic formatting, DOC and PDF files embed complex formatting, fonts, tables, graphics and all sorts of non-text information, and all that is stored on disk alongside the pure text. If you try to preview their raw bytes in the quickviewer or the hex view mode (F4) of editor² you'll be confronted with some funny numbers which is exactly what's in store. That's why these files are termed "binary", you can't get the text easily out of them. The text extraction is still deterministic but the algorithm is most of the time either not known to the layman or at any rate too complex for one to bother!

But help is at hand. The professional version of xplorer² taps into the IFilter text extraction protocol. Who ever devised the document format (e.g. microsoft for Word) is providing a mechanism for digging into the depths of the binary file format for the pure unformatted text. The software involved is a bit tricky since it is COM-powered but for the end user it doesn't really matter — we took the beating for all the team :)

These IFilters were developed for the windows indexing server, which is rather underused, but the functionality serves the text extraction cause very well. Now when you preview a word document in the draft quickview tab, instead of gibberish you get a clean text preview, quick and useful. The same plumbing enables you to search for text in such files, just as if they were plain text. To enable this feature make sure you check the misleadingly named Search non-text files too box. The search takes a bit longer but you treat PDFs and the rest of it like plain text files without the resource hog that is desktop search.

There are quite a few of these filters available for download. If you have MS Office installed you automatically get all the word, excel and what have you filters already setup. For Adobe PDF you need a free plugin. I know there are later versions of this available but trust me they are rubbish, crash all the time and leak handles like crazy ending up locking each previewed PDF. I tend to believe that Adobe has outsourced some parts of their code to monkeys. Anyway, with this old v4.1 you'll have the fewer problems.
NOTE: PDF filters are automatically available if you have Acrobat Reader v7 or later, you don't need to install the above plugin!

Even more filters for a variety of file types are available but don't just install everything. I had the bad experience once recommending filters from some other monkey-powered software house that ended up crashing xplorer² — and all for what? getting text preview of executable files? what for? Anyway, most people will only need the office and PDF plugins.

In the last year or so you will have noticed the proliferation of free desktop search tools like Copernic, Google desktop, X1 and others (supposedly MS Vista also has instant search by default but I don't know how good/bad it is). I have to admit that most of the time they get the job done, and I heavily use them myself to put some order into the chaos of my file and email life. Ok they are resource hungry and take loads of space for the indexing but what price convenience? Don't they make xplorer² and its fancy IFilters as useful as a horse and cart on the motorway? Well, yes and no. My biggest gripe with all these tools is that you get just too many results, in a huge list that is really unmanageable. Where are the nice multi-sort capabilities that xplorer² offers on its search results? (tip: you can sort by more than 1 column at a time pressing <SHIFT> while clicking on a column).

Another strong point with xplorer² text search is that it can be combined with date and other search conditions e.g. focus on a part of the folder hierarchy, use wildcards... Add to that some difficult characters like underscores that most of these search engines consider as separators. No, I want to search for static_cast verbatim, not the two words, I am a programmer and have other uses of underscore! xplorer² will oblige here too. Last but not least, you can search for text within words. All these indexers rely on breaking down the document in its constituent words. They will find WINDOW but they will not find the portion NDO within the word — xplorer² has no such limitations.

When you have a difficult search situation and you can spare a few extra moments searching for text, you can do much worse than relying on xplorer². Check out this short demonstration from the tour section for some of the text search tricks on tap.

What would you like to do next?

Reclaim control of your files!

browse
preview
manage
locate
organize

"This powerhouse file manager beats the pants off Microsoft's built-in utility..."

download.com