Monthly Archive: February 2009

Acrobat 9’s ClearScan is great, but.. er.. selective

In my efforts to convert my piles of photocopies into searchable PDFs, I’ve come across Acrobat’s 9 ClearScan option. It’s really nice. It can make some of the most beautiful and small scans I’ve ever seen. I think what it must be doing is actually using its own internal representation of the fonts to denoise the resulting image, but the result looks great.

ClearScan OCR dialog

Here’s a side-by-side comparison, with the original on the right and the much much smaller ClearScan output on the left.

After and Before

Look how nice and clear the output is. Well, except… hmm…

I don’t know what the root cause is, but for some reason ClearScan will silently throw away bits of your text. This is very uncool. I think on balance, having all of the words is better than having a nice clean printout. I can’t seem to find any discussion of this on the web, though there should be some. This is a serious problem with ClearScan.

Update: It looks like the text isn’t exactly gone, it’s just off the page. Examine Document showed me this:

Examine Document

Furthermore it seems to choke specifically where it detects a period within a word. According to Examine Document, the text there is not “Kenneth” but “Ke.nneth”. I can understand that, there’s a dot on the scan, but it seems to throw the alignment totally out of whack. No idea yet how to fix this. I did however determine that the loss of text elsewhere on another page also had this same property of having a dot in the middle of the word before the “lost” text.