Digital Content Creation

 


 

Digital Content Creation

415 Library, MC-522
1408 W. Gregory
Urbana, IL 61801

(217) 244-2062

Email: digicc [at] library.illinois.edu

CONTACT US

Scholarly commons

Illinois Harvest logo

 

 

Visit us on Flickr

Hathi Trust Digital Library


Pixels RSS Feed

5.0 Best Practices for Optical Character Recognition

Download PDF for Chapter 5

Introduction

Optical character recognition (OCR) is a process by which specialized software is used to convert scanned images of text to electronic text so that that digitized texts can be searched, indexed and retrieved.  The recommended software for OCR creation is ABBYYFineReader; however, Adobe Acrobat can produce high-quality OCR for clear, crisp, and structurally uncomplicated texts in a variety of languages.

Table of Contents

5.1 Uses of OCR output

5.2 Functionality of OCR Software

5.3 Output File Formats

5.4 Factors Affecting Accuracy of OCR

5.5 OCR Correction and Rekeying

 

 

 

 

 

 

 

 

_____________________________________________________________________________

5.1 Uses of OCR output

 

There are several uses for the output of the optical character recognition process[1]:

 

5.2 Functionality of OCR Software

 

Text must be scanned or digitally photographed and saved in either an image or PDF (portable document) format prior to running it through an OCR program. OCR software converts the patterns of light and dark found in a digital image of a page of text into text characters and saves them in a format that computers can search or index, such as Unicode or ASCII.  OCR software generally employs a wide variety of language dictionaries.  ABBYYFineReader v. 10 can read over 180 languages, as well as common programming languages, numbers, and simple chemical formulae.  ABBYYFineReader can read multilingual documents, and includes dictionaries and spell-checking capabilities for 39 languages.  Adobe Acrobat 9 can read over 40 languages and has some basic spell-checking capabilities. Generally, operators select the language(s) of the document from a drop down list prior to initiating the OCR process.

"Reading" of digitized text is the primary function of OCR software; however, some OCR programs (e.g., ABBYYFineReader) have other functions aimed at either improving the accuracy of the OCR results or speeding up the OCR process. Among these processes are:

5.3 Output File Formats


Optical character recognition should be performed for printed textual materials to enhance searchability and access of the digitized version.

 

5.4 Factors Affecting Accuracy of OCR

 

Most commercial software packages boast an OCR accuracy of between 97% and 99%.  These rates are based on character errors, not word errors.  So while 97% of characters may be accurate in an OCR'd document, only 75% of words may be spelled correctly.  Any of the following factors can also affect the accuracy of the OCR:

 

Textual considerations

 

Scanning considerations that affect the accuracy of OCR include:

 

5.5 OCR correction and rekeying

 

If the use to which OCR'd text is being put requires 100% accuracy, two options are available:

 

___________________________________________________

[1] Tanner, Simon.  Deciding Whether Optical Character Recognition is Feasible. KDCS, 2004. (http://www.odl.ox.ac.uk/papers/OCRFeasibility_final.pdf)

[2] Booth, Jon M., et. al. Optimizing OCR Accuracy on Older Documents: A Study of Scan Mode, File Enhancement, and Software Products (USGPO, 2006). (http://www.gpo.gov/pdfs/fdsys-info/documents/WhitePaper-OptimizingOCRAccuracy.pdf)

 

6.0 Best Practices for PDF Creation

Back to Top