Category Archive: PDF

Adding an interleave document option to Acrobat (or, ClearScan bites me again)

I’ve mentioned the joys and the agonies of Acrobat’s ClearScan technology before. It produces wonderful output, small, clear, OCR’ed, but unreliable. Recently, I discovered another way in which it is unreliable, apart from sometimes silently cutting off the ends of lines.

The symptom: I noticed that some of my nicely ClearScanned PDFs, although they appear to have selectable text, were not actually searchable and copying and pasting yielded just a bunch of repeated garbage characters. Not good.

The underlying cause: eventually, I realized what the problem is. It has to do with how ClearScan works. What ClearScan does is go through the scanned image and “smooths” it by creating a (vector!) font that closely approximates the shape of the rasterized object (generally, a letter). So, it creates a new, special, document-specific vector font for each document you OCR, and embeds it in the PDF. However, something important about this font information is lost if you ever modify and save this file in something other than Acrobat (say, in Preview). The result seems to be that all information about the font character-to-letter mapping is lost, and so the recognized text is gone (though the PDF still looks beautiful and small).

How I got bitten by this: The reason I lost my font information has to do with my scanning workflow. Generally, I scan things 2-up on a copier (or scan photocopies that I had previously photocopied 2-up), and so when I get this into Acrobat, I create two further files, one cropped to show just the left side pages, and the other cropped to show just the right side pages. Then I OCR each side independently. Then, I merge them using a simple Automator script that just runs the built-in “Combine PDF pages by shuffling” action, so that I have a single 1-up PDF with the pages in the right order. That’s the fatal step. The “Combine PDF pages by shuffling” command creates a new PDF out of the old ones, using Apple’s own internal PDF kit, and in the process, loses/garbles the font mapping.

The solution, then, is to interleave the OCR’ed left and right page PDFs within Acrobat, which will preserve this information. Except that, stunningly, Acrobat still doesn’t have a command that does this. Really. It’s crazy. Perhaps even more stunningly, though, searching the web really didn’t find much by way of a solution to this, or even complaints about it. What are people using Acrobat for anyway?

What Acrobat does have is a fairly elaborate Javascript interpreter that can do all kinds of PDF manipulations and other things, if you’re enough of a nerd to wade through the process of using it. So, I set myself to the task of creating a small Javascript application that would take two PDFs and make a new PDF from them, alternating the source.

One way this can be done is to copy the following text into a Javascript file (or download it from here), call it something like interleave-tool.js) that you put in your Acrobat startup scripts folder. On my MacBook, this folder is ~/Library/Application Support/Adobe/Acrobat/9.0_x86/Javascripts. (For Acrobat X, the folder is …/10.0/Javascripts instead.) What it does is install a menu item under the edit menu (“Interleave document…”) which will ask you to browse to a PDF file, and then insert the pages from the selected file into the currently frontmost document (where the first inserted page is after the first page of the current document). That’s all I needed. Pretty big headache to do a pretty simple (and commonly needed, I should think!) thing.

// add interleave option to Edit menu
// Paul Hagstrom
// September 2010

// Add a menu item for interleaveDocument
app.addMenuItem( {
   cName: "interleaveDocument",
   cUser: "Interleave document...",
   cParent: "Edit",
   cEnable: "event.rc = (event.target != null);",
   cExec: "InterleaveDocument(event.target)"
});

// main interleave function
InterleaveDocument = app.trustedFunction(
	function(doc) {
		// escalate privileges
		app.beginPriv();
		// ask for the document to interleave
		var srcFile = app.browseForDoc({});
		// if they didn't cancel
		if ( typeof srcFile != "undefined" ) {
			// open the source document
			var srcDoc = app.openDoc({
				cPath: srcFile.cPath,
				bHidden: true
			});
			// count the pages
			var srcPages = srcDoc.numPages;
			// and close the source doc
			srcDoc.closeDoc();
			// start inserting after the first page
			var insAfterPage = 0;
			// for each page in the source document
			for(var pg = 0; pg < srcPages; pg++) {
				// insert it into the current document
				doc.insertPages({
					nPage: insAfterPage++,
					cPath: srcFile.cPath,
					nStart: pg
				});
				// increment again past just-added page
				insAfterPage++;
			}
		}
		// descalate privileges
		app.endPriv();
	}
);

Incidentally, here’s another ridiculous thing that I discovered while trying to debug this script: you can’t use the Javascript console properly on a MacBook. It doesn’t have the key you need to trigger a script to execute. One workaround that I found by searching the web is to use the keyboard viewer, physically hold down Fn and Ctrl, and use the mouse to click the Enter button on the keyboard viewer. You’ve got to be kidding. An easier way is to use KeyRemap4MacBook, which I am using anyway to get my en-dashes back. I chose to make my right option key into Enter, which is what it should have been in the first place. Turns out, when you do this, just pressing Enter (unmodified) executes the script (despite everything written everywhere suggesting that you need to press Ctrl-Enter).