«

»

Acrobat 9’s ClearScan is great, but.. er.. selective

In my efforts to convert my piles of photocopies into searchable PDFs, I’ve come across Acrobat’s 9 ClearScan option. It’s really nice. It can make some of the most beautiful and small scans I’ve ever seen. I think what it must be doing is actually using its own internal representation of the fonts to denoise the resulting image, but the result looks great.

ClearScan OCR dialog

Here’s a side-by-side comparison, with the original on the right and the much much smaller ClearScan output on the left.

After and Before

Look how nice and clear the output is. Well, except… hmm…

I don’t know what the root cause is, but for some reason ClearScan will silently throw away bits of your text. This is very uncool. I think on balance, having all of the words is better than having a nice clean printout. I can’t seem to find any discussion of this on the web, though there should be some. This is a serious problem with ClearScan.

Update: It looks like the text isn’t exactly gone, it’s just off the page. Examine Document showed me this:

Examine Document

Furthermore it seems to choke specifically where it detects a period within a word. According to Examine Document, the text there is not “Kenneth” but “Ke.nneth”. I can understand that, there’s a dot on the scan, but it seems to throw the alignment totally out of whack. No idea yet how to fix this. I did however determine that the loss of text elsewhere on another page also had this same property of having a dot in the middle of the word before the “lost” text.

2 comments

1 ping

  1. Jay

    Hi Paul,

    I too found ClearScan alluring yet problematic. It produces clear, pixellation free, jagged free text and the file size is small compared to 300 dpi scans.

    But, it seems that after ClearScan is applied, the image is discarded. There are times where I found it necessary to export the image file of the document from Acrobat, and with Clearscan, that is not possible.

    At the moment, I am intrigued by the Acrobat Paper Capture Plugin. It seems that the text in document scans made through Acrobat are treated as “vectors” (fonts) (if I am using the right term”. The effect is like a line art scan. However, the difference is that Acrobat does not treat it as an image. When the text is straightened in OCR, pixellation does not occur.

    I am unable to instruct Acrobat to treat my scanned document jpegs this way. I usually scan documents using a 3rd party app, and use “Combine to PDF” to compile the PDF.

  2. Tim

    I have batch converted a number of image documents to using clearscan. For some reason advanced search does not pick out all the instances when I do an advanced search of all of the files. When I go and search each file individually it finds ones that were missed by the advanced search. I don’t have the same propblem with the same files when they are batch converted with image+text option.

    Seems strange!

  1. lingtech : Adding an interleave document option to Acrobat (or, ClearScript bites me again)

    […] mentioned the joys and the agonies of Acrobat’s ClearScan technology before. It produces wonderful output, small, clear, OCR’ed, but unreliable. Recently, I discovered […]

Comments have been disabled.