Acrobat 9’s ClearScan is great, but.. er.. selective

In my efforts to convert my piles of photocopies into searchable PDFs, I’ve come across Acrobat’s 9 ClearScan option. It’s really nice. It can make some of the most beautiful and small scans I’ve ever seen. I think what it must be doing is actually using its own internal representation of the fonts to denoise the resulting image, but the result looks great.

Here’s a side-by-side comparison, with the original on the right and the much much smaller ClearScan output on the left.

Look how nice and clear the output is. Well, except… hmm…

I don’t know what the root cause is, but for some reason ClearScan will silently throw away bits of your text. This is very uncool. I think on balance, having all of the words is better than having a nice clean printout. I can’t seem to find any discussion of this on the web, though there should be some. This is a serious problem with ClearScan.

Update: It looks like the text isn’t exactly gone, it’s just off the page. Examine Document showed me this:

Furthermore it seems to choke specifically where it detects a period within a word. According to Examine Document, the text there is not “Kenneth” but “Ke.nneth”. I can understand that, there’s a dot on the scan, but it seems to throw the alignment totally out of whack. No idea yet how to fix this. I did however determine that the loss of text elsewhere on another page also had this same property of having a dot in the middle of the word before the “lost” text.