■ Discover PDF documents that cannot be searched for text

There are many ways PDF documents can get tangled up. One example is if you scan a paper document without using your scanner's OCR (optical character recognition) scan option. The resulting PDF will look like a text document but will actually be a picture devoid of text details. If you open it, you cannot select words, and neither can you search for text, either inside your PDF tool or from xplorer²

There are a few free programs that can "read" the text from a picture PDF, usually without the formatting though. There are online PDF converters that maintain the formatting, but are not convenient for batch processing of lots of picture PDFs, and may have problems if your text is not in English (or mixed language). You need to spend a pretty penny to get something industrial strength like ABBYY FineReader for robust large scale PDF OCR conversions. It is a big topic which I will leave for a forthcoming part II blog post.

How to find picture PDFs (without text)

Before OCR-processing picture PDFs, you must find those that contain no text, usually mixed up with searchable PDF files in your document folders.

Things are very simple using DeskRule search engine. Start deskrule on your top level document folder (or the entire THIS PC folder) and search for PDFs that don't contain any text, using -? in Contents rulebox. ? is a wildcard matching any character in the text contents, and the leading minus - means NOT.

Sadly this simple negative wildcard trick that points to empty text content won't work in xplorer². The latter program searches even in binary files (any raw content really), regardless if it is text or unprintable bytes. So we will use the $xx token to search for unprintable characters e.g. $01 in Containing text field will search for a byte equal to 1 (binary) which denotes a picture PDF as such: (using Find files command)

find empty pdfs

Notice in the above search results preview how even picture PDFs contain some regular text, but they also have lots of nonsense bytes representing the image — which identify the unsearchable PDFs.

Now all you need is a PDF OCR tool that takes command line arguments and xplorer² batch processing engine will convert them straight from the results window (e.g. thatOCRtool $F). This part is left as a user exercise <g>