gImageReader (runs on Linux and Windows) is a GUI for tesseract-ocr, a free software optical character recognition (OCR) engine which you can use to extract text from PDF documents or images.
gImageReader allows you to select columns, part of a document, spell check the output and more but it didn't recognize a whole document at once. But the latest gImageReader 0.9 adds multipage-recognition support for multipage PDF. You can also set gImageReader to extract the text from a page range if you don't need it to recognize a whole document.
Besides this very useful (and much needed!) new feature, gImageReader 0.9 also comes with:
- new language profiles: chinese, korean, japanese, hebrew, arabic, croatian
- all formats supported by gdk_pixbuf to file filter for open dialog
- option to cancel the recognition
- fixed auto-installing new dictionaries (new dictionaries would not appear in main language selector until program restart)
- many other minor improvements and bug fixes
How about the speed you may ask. Well, in my test, gImageReader was able to recognize a 36 page PDF document in 1,10 minutes (on a kind of slow computer I have at work).
For a slightly more detailed post on gImageReader which also includes installing the latest Tesseract OCR in Ubuntu 10.10 and 10.04 (which comes with more languages and much improved recognition but is experimental!), see: Extract Text From PDFs And Images With gImageReader, A Tesseract OCR GUI
Download gImageReader (.deb, .rpm and .exe files available)
Thanks to lffl.org for the news!