Home » Blog
date 21.Jul.2013

■ Does Adobe reader support PDF text extraction or not?

I can understand the frustration of the guys who run Adobe, having introduced the most popular portable document format (PDF) and being unable to make money out of it — or at least making much less than they feel they're due. With the release of adobe reader X (version 10) they started castrating their PDF shell integration and in particular the text extraction filter that allows PDFs to be searchable for keywords.

The other day I installed the latest adobe PDF reader version 11 on a windows 8 machine to see how things fare nowadays. Intriguingly, adobe have reintroduced the text filter (IFilter) functionality but somehow it only worked for windows search (!) and not for other IFilter aware programs like xplorer². How did they manage that? I was not alone wondering about this duality but there was no solution forthcoming.

Technically speaking adobe supplied and correctly registered the PDF text extraction filter DLL (ACRORDIF.DLL) but it wouldn't be instantiated by any common means, that is either using LoadIFilter API or using direct COM object creation after looking up the filter object CLSID in the registry. Was it broken? No, because somehow windows search could use it!? Some people argued that the filter was dropped in STA threading mode (like it did in the old v6 days) but that isn't corroborated by the ThreadingModel of the filter DLL. Some talked about running it only through a Job object. Adobe support kept themselves tight lipped and were claiming that the restriction was there for our security — ahem.

Anyway, here's a spoiler for Adobe, I present to you the way to obtain the IFilter object in C++ for use in your program (after adding some error corrections). Instead of LoadIFilter, you must obtain a stream interface on the PDF file, then create the filter COM object and use its IPersistStream interface to pass the file to be extracted. Also note that the whole process has to be running as a job or you receive E_FAIL.

HANDLE hJob = CreateJobObject(0,"filterProc");
AssignProcessToJobObject(hJob, GetCurrentProcess());

// get PDF file as stream
CComPtr< IStream > iStream;
hr = SHCreateStreamOnFile("file.pdf", STGM_READ, &iStream);

// directly create adobe PDF Filter {E8978DA6-047F-4E3D-9C78-CDBE46041603}
CComPtr< IFilter > pdf;
CLSID fixed = {0xE8978DA6, 0x047F, 0x4E3D, {0x9C, 0x78, 0xCD, 0xBE, 0x46, 0x04, 0x16, 0x03}};
hr = pdf.CoCreateInstance(fixed, 0);
// creation fails unless within a job object!

// pass the file to the filter
CComQIPtr< IPersistStream > pdfStream = pdf;
hr = pdfStream->Load(iStream);
// from now on proceed as usual with Init()ializing the filter

This approach works for version 11 of the adobe filter. However that's not the end of the story.

The plot thickens: PDF reader v10

Version XI isn't available on windows XP, the last supported version there is X, so I run a quick check on XP to confirm that the above code for initializing the PDF IFilter works... but it didn't!! As usual windows search had no problems finding text so the handler worked, and so did all microsoft filter test tools. Back to head-scratching.

At first I thought that they could be playing on the job object trick and use some particular name for it, that only FILTDUMP.EXE used. So I wasted a few hours hunting the job name using process explorer, but it looks like FILTDUMP doesn't register a job object at all. Can you guess how the trick works? They hard coded the names of MS tools like FILTDUMP in the PDF filter ACRORDIF.DLL!!! So when the PDF IFilter object is being instantiated, it checks the calling process name, and if it is one in the "whitelist" it works, otherwise it fakes a problem and E_FAILs. Scandalous. For proof, rename your program to "filtdump.exe" and as if by magic everything works, even plain LoadIFilter without job objects.

So it wasn't really sandboxing or security Adobe were after, but a callous attempt to stop 3rd party tools extracting PDF text. Interestingly, FILTDUMP.EXE is still hard coded in the version 11 DLL but they must have turned off the hack. Adobe naughty <g> I say dump Adobe reader altogether (who needs 100MB installs just to read documents?) and go with a better solution like PDF-XChange Viewer. All the shell integration features work (for free), both for 32 and 64 bit windows. That's the best plugin for use with xplorer² too.

Post a comment on this topic »

Share |

©2002-2013 ZABKAT, all rights reserved | Privacy policy | Sitemap