There are a few free programs that can "read" the text from a picture PDF, usually without the formatting though. There are online PDF converters that maintain the formatting, but are not convenient for batch processing of lots of picture PDFs, and may have problems if your text is not in English (or mixed language). You need to spend a pretty penny to get something industrial strength like ABBYY FineReader for robust large scale PDF OCR conversions. It is a big topic which I will leave for a forthcoming part II blog post.
Things are very simple using DeskRule search engine. Start deskrule on your top level document folder (or the entire THIS PC folder) and search for PDFs that don't contain any text, using -? in Contents rulebox. ? is a wildcard matching any character in the text contents, and the leading minus - means NOT. |
![]() |
Sadly this simple negative wildcard trick that points to empty text content won't work in xplorer². The latter program searches even in binary files (any raw content really), regardless if it is text or unprintable bytes. So we will use the $xx token to search for unprintable characters e.g. $01 in Containing text field will search for a byte equal to 1 (binary) which denotes a picture PDF as such: (using Find files command)
Notice in the above search results preview how even picture PDFs contain some regular text, but they also have lots of nonsense bytes representing the image — which identify the unsearchable PDFs.
Now all you need is a PDF OCR tool that takes command line arguments and xplorer² batch processing engine will convert them straight from the results window (e.g. thatOCRtool $F). This part is left as a user exercise <g>
Post a comment on this topic »