A software tool capable of performing Optical Character Recognition (OCR) upon a set of images. It achieves the task by analysing pixel sets and in an image and cross-matching them to a dictionary of words. Omnipage automates large sections of the digitisation process enabling physical objects to be scanned, processed using the OCR software and exported to a document file format. Later versions of the software incorporate image enhancement features to improve scan quality (and recognition results) and better support for complex page layouts and forms.
Combined with the Leptonica Image Processing Library Tesseract can read a wide variety of image formats and convert them to text in over 40 languages.
This code is a raw OCR engine. It has no output formatting and no UI. It can detect fixed pitch vs proportional text. Nevertheless in 1995 this engine was in the top 3 in terms of character accuracy, and it compiles and runs on both Linux and Windows. Training code is included in the open source release.
The core developer on the project is Ray Smith (theraysmith).