Several years back, I more or less stopped making photocopies. In part, my ability to stop adding to the pile of dead tree flakes in my office came about when I moved my class communications online; instead of handing out syllabi or other handouts, I put electronic versions of those documents on our class website.
But the most important factor in my all-but-copy-free workstyle was my department’s lease of a new copier with a powerful high-speed scanner and a network connection. Now, instead of photocopying that chapter I need to read, I scan it and have the machine automatically send it to my email address.
Which is fantastic, of course; now I have those pages that I need to annotate in a highly portable digital format. The only problem is that the PDFs that our copier makes are actually pictures of the pages, rather than text-containing documents. As a result, not only are the resulting files extremely large (and thus not easily emailable), they’re also not searchable, and they can’t be digitally annotated using many common desktop tools, such as Mendeley, which Julie wrote about yesterday.
This situation didn’t bother me that much, though, until I started using iAnnotate on my iPad; suddenly the inability to highlight, underline, and search my PDF library became really annoying. So I’ve set about making those scans searchable and annotatable by running them through OCR.
OCR, or optical character recognition, is a system through which a computer looks at the pattern of pixels in an image and looks for letter forms. This translation of the image of text into actual machine-encoded text is the necessary first step in making scanned-in pages annotatable.
OCR remains somewhat problematic, even after the more than 55 years since the first commercial OCR systems were released. Though most OCR software is capable of dealing quite well with most common typefaces, poor quality copies, distorted text, stray marks, or even things like ligatures (the conjoined “fl,” for instance) are enough to throw the process off. Things have gotten better, however, in no small part due to crowd-sourced training. The image that leads off this post shows one aspect of this training: the reCAPTCHA system, used to ensure that comments online are being left by real humans, works by asking you to input one known word and one troublesome word from a scanned text. Your human OCR thus helps to train machine OCR.
Despite these improvements, if you’re using OCR in order to do any text mining, or to create an authoritative digital edition of a text, you’re likely to have to do a bit of correction. For my purposes, however, OCR works quite well; the vast majority of key terms for which I would want to search scan fine, and to create highlightable text, all I really need OCR to do is figure out where the text is.
There are several different ways to go about the process of OCRing your PDFs. Many scanners allow you to run OCR as you do your scans in the first place; alas, our department copier doesn’t (which is part of why it’s so fast). Moreover, some information organizers, including DEVONthink Pro Office, as Ryan wrote about a while back, will automatically OCR your PDFs. DEVONthink Pro Office, in fact, comes packaged with the ABBYY FineReader engine, which represents more or less the state of the art in contemporary consumer OCR software.
What I was interested in, however, wasn’t organizing my PDFs—I have a filing system that works quite well. I just wanted them to be searchable and annotatable. Adobe Acrobat Pro handles this task quite well, it turns out. Admittedly, Acrobat can be pricey; I certainly wouldn’t have paid for it separately, but it came as part of the Adobe CS5 suite that I recently picked up at a discounted academic price. In Acrobat, by selecting the Document > OCR Text Recognition > Recognize Text Using OCR menu item, I’m able to extract the text from any set of PDF images. The result is a file that can be searched and marked up using standard PDF tools.
Better still, Acrobat 9 makes use of the ClearScan OCR engine, which produces smaller files. By using the “Save As” command to overwrite the existing file (rather than simply saving), I’ve managed to get just about every PDF that I’ve OCR’d in Acrobat to take up about half the space it previously did—something I’d never have expected from a software package that I was accustomed to find bloated and unwieldy.
And even better than that, just as I resigned myself to a painful process of opening a PDF, running OCR, doing a Save As command, and closing the PDF—repeat ad nauseam—I discovered a freely downloadable AppleScript droplet for batch OCRing.
So I’m working, a small batch at a time, on making all of my image-based PDFs machine-readable, and am really happy to find that, in the age of the tablet, an already useful file format is becoming even more useful to me.
But what about you? Do you have a way of processing your PDFs that renders them more usable? Share your suggestions in the comments!
[Image by Flickr user Vitor Lima; Creative Commons licensed.]




19 Responses to OCR Those PDFs
heatherwhitney - July 20, 2010 at 1:03 pm
Thanks for this post! I’d be interested in hearing from others about options for Windows users, hopefully open source choices. This is definitely a project I need to tackle.
peril - July 20, 2010 at 1:37 pm
This is a thing I’ve commented on frequently. OCR is an important part of my daily paperwork life- we scan hundreds of documents a week, all of which need to be searchable.I can safely say that it is easier to automate on a mac, but there are plenty of good Winblows tools available too ;)Regarding odd fonts, the OCR in Adobe’s Acrobat Pro is bar far the most capable of deciphering complex font faces. It also has a leg up on the competition in that it detects lines (that is, the horizontal rule of a bit of text) and uses that to straighten the page if the scan was a bit lopsided etc. Very handy.I’ve posted a few different scripts for automating Acrobat OCRing (and PDFpen OCRing – PDFpen being cheaper, and naively scriptable instead of Acrobat’s GUI scripting :S)Using Hazel (noodlesoft.com) I let my machine manage files for me- with rules to identify new PDFs, OCR them, then identify and file already OCRed PDFs my computer handles the majority of my electronic paperwork itself. Very nice.Other OCR tools, Abby, Read Isis, and some other free OCR back ends don’t seem to integrate well into the average person’s work flow. DEVONthink and EverNote on the other hand are a bit too work-flow oriented, not performing OCR that can be saved within the document (they don’t create a cleartext layer above the image of the scanned text).The code to make Hazel process things for me is nearly identical the droplet’s code- the difference being that Hazel watches for PDFs so I don’t have to remember to drop a file onto the droplet ;)If anyone is interested here’s that code:try tell application “Adobe Acrobat Pro” activate open theFile tell application “System Events” tell process “Acrobat” tell menu bar 1 tell menu “Document” tell menu item “OCR Text Recognition” tell menu 1 click menu item “Recognize Text Using OCR…” end tell end tell end tell end tell keystroke return end tell end tell save the front document close the front document end tell tell application “Adobe Acrobat Pro” to quitend tryIf anyone wants more info help let me know :) I’ll watch the comments.
philosophy - July 20, 2010 at 4:54 pm
I’ve got 30 or so pdf articles, obtained by Interlibrary Loan. Almost all are images, two printed pages per single pdf page, most have to rotated to be readable, and they have various restrictions due to the copyright business, such as they will not attach to an email. I’d really like to be able to search them for keywords. Will some of the software mentioned above do the trick? Or is there some way our ILL library staff could make the change before making the pdf available to me? My first inquiry – today! – got this sort of response: it depends on how the sending library (in this case, Library of Congress!) did the scanning; if it didn’t use OCR, there’s nothing we can do about it. In my experience, almost all ILL requests are sent with image scanning.Comments? Recommendations?
kfitz - July 20, 2010 at 11:08 pm
@philosophy: The case of the article scanned without OCR is exactly what I’m writing about; it can still be OCR’d after the fact. It’s likely your ILL staff won’t be able to take that on, but it’s very easily done yourself, with the right software. Adobe Acrobat Pro, which I’m using, will do it for you quite easily, but there are other tools around that will, too.@peril: Thanks for posting this! I’ve been using Hazel, but hadn’t done any such coding with it. I think I see how the script works, and I’ll have to try it out!
daveapostles - July 21, 2010 at 6:20 am
It all seems a lot of trouble. I’m just as happy to use Ctrl/F on each .pdf. I imagine that OpenOffice is about to have an import .pdf function (in addition to its longstanding export to .pdf), although I am not sure about it. PdfEdit may also be helpful, depending on the sort of .pdf.
george_h_williams - July 21, 2010 at 7:29 am
@daveapostles: Ctrl/F won’t work on a PDF that’s just a scanned image, which is the kind of PDF Kathleen is describing.
jabberwocky12 - July 21, 2010 at 7:35 am
@daveapostles: To use OpenOffice to import pdfs, do the following:- download and run the file “sun-pdfimport.oxt” (http://extensions.services.openoffice.org/project/pdfimport) (You only have to do this once)- open “Drawing” or “Presentation”- open the pdfIt’s reasonably good. Of course, if the pdf is simply an image, that won’t get you very far, I’m afraid.
provcoll2 - July 21, 2010 at 8:19 am
The article mentions needing a separate add-on the Acrobat Professional 9 to do batch processing.Instead of bloating your hard drive with more software, you can do batch processing in Acrobat Professional. Instead of opening a document and telling the program to OCR it, start with a blank page and go to the OCR function, it will then ask you to select the files you want to OCR. You build the list of files and let the program run. I’ll run this while I’m away from the computer for a while or overnight. The documents are then ready for me the next morning or when I come back. The only problem is that occasionally Acrobat Professional encounters a page that it cannot convert and it reports that and requests a response. That interrupts the process until it gets a response. Another nice feature is that it doesn’t make any difference what the orientation of the page (upside down, sideways) is, Acrobat will render all the pages to be readable in the same orientation. If there’s mixed orientation, such as a book with a page with a chart printed in landscape mode aside of a page in portrait mode, Acrobat makes its best guess at the orientation. If it’s a single page in landscape mode, it will be rotated automatically to be readable.
mdzehnder - July 21, 2010 at 9:10 am
[Comment deleted by editor. Please read the ProfHacker Commenting and Community Guidelines. Thanks!]
edeldice7 - July 21, 2010 at 9:54 am
Google Apps for Education now allows you to OCR uploaded pdfs.See http://googledocs.blogspot.com/2010/06/optical-character-recognition-ocr-in.html for further information.As with most free OCR programs, it isn’t perfect and you lose most, if not all, of the formatting. But, if all you are looking for is a way to work with the text (i.e. search), this is a great tool that is free to use for all Google Apps for Education users.
mhick255 - July 21, 2010 at 10:10 am
@peril – I have scanned PDFs in DEVONthink that I can export with the OCR text retained. DT recently updated to v. 2 and updated its OCR engine, as well – maybe you were using an older version? Also, I believe DT Pro Office offers several options when you scan to PDF (image only, text over image, text only, etc.). I was using DT Pro Office during its beta trial and now have only DT Pro, which doesn’t include OCR, so I can’t confirm those scanning options. (Note for anyone considering DT for OCR: only the most expensive version, DT Pro Office, includes OCR. The other versions are great, too, but make sure you’re buying the right one.) My experience with Evernote is the same as yours, though. I don’t think you can export PDFs with the OCR intact. Does anybody use Evernote who can speak with more experience about its OCR?
lee77 - July 21, 2010 at 11:25 am
Just a security item – once OCR’d, documents become more accessible to the bad guys as well, which could be a problem if you OCR documents with sensitive info like SSNs, or CCNs. Probably not an issue in the cases I generally see described above, but in case folks were OCRing personal documents…
ephotog - July 21, 2010 at 1:05 pm
With PaperPost by nuance.com, one just drags the pdf file onto a word processing icon (could just be wordpad or notepad. An example of results is at http://www.starr.net/is/pdf-txt.pdf. The upper part of the page shows the OCR result for a complicated, multi-column article. After just several words, the ocr was excellent. For a single column article, the results are likely to be perfect from the beginning.Thanks for all these interesting ProfHacker articles!
mhick255 - July 21, 2010 at 2:46 pm
@ edeldice7- Wow, I had no idea. Thanks! BTW, it works with any Google Docs account, not just the Apps for Education version (I used it on my plain-vanilla Docs account), AND it also works on images, not just PDFs. I uploaded a jpg screenshot of a website, and Google Docs extracted the text, with the kinds of mistakes you can expect from free OCR (“women” turned into “wumen,” “currently” into “currenlly,” etc.).
kfitz - July 21, 2010 at 3:04 pm
@edeldice7 and @mhick255: I contemplated including the Google Docs OCR in this post, but I’ve heard that there are problems with its accuracy level, so decided to leave it out. Not to mention that I’ve got *hundreds* of PDFs that require OCRing, and the overhead of uploading them would be a bit intense.@provcoll2: I did know about the batch-OCR function built into Acrobat, but my sense was that it doesn’t allow for the “Save As” command, which is what magically reduces the resulting PDFs’ filesize. Am I wrong there? Also, the bit of software I mentioned isn’t an Acrobat plugin, but an AppleScript droplet, a very tiny (102KB) file that allows you to grab a bunch of files and drag-and-drop them on top of it, whereupon it works its magic — including the “Save As” part.
aindrias_hiort - July 21, 2010 at 3:06 pm
OK,At the university where I worked (St. Francis Xavier), we used Abbyy Finereader. We scan a lot of Gaelic language documents, so we needed a reader that picks up accent agute and grave markers. There are two things I’d like to point out:1. When you do OCR on a document and put it online, the search engine worms pick up not just the text of the html page, but of the document itself. So if you write a paper and do OCR on it and post it online, people can do a search on the words that you use in the paper and can be driven right to your paper.2. I have a Mac. When I do an OCR on a .pdf and save it on my hard drive, Macs index the file and everything in that file; so when I do a search for a phrase or word using finder (not my favorite .app), within one second my computer pops up with every document on my hard drive that has that phrase or word. In short, OCR-ing .pdfs not only help you to cull through papers that you’ve saved from other people and thereby improving your writing content, but it also makes your work accessable to people searching for information on the internet and using you as a reference.
jmjohnso - July 22, 2010 at 11:44 am
Thank you for the tip. This is great but Adobe is definitely a bit pricey. Are there any more affordable options–even freeware–that do the same thing? I saw the link to Google Docs in one of the comments….Thank you !!!
drjeff - July 22, 2010 at 4:59 pm
I downloaded some free OCRs for Windows from Download.com for my wife. She reports that Top OCR works pretty well. And the price is certainly right. (There are, I think, a half-dozen or so free ones, some of which work pretty badly.)
drgunn - August 10, 2010 at 3:13 pm
Mendeley is a good (and free) option for academic PDFs. You can search across the full text of all your papers and it organizes them for you, too. There’s a good Chronicle review here: http://chronicle.com/blogPost/Using-Mendeley-for-Research/25627/