Digitized Newspapers and Optical Character Recognition (OCR)
Text correction improves the accuracy of keyword searches in the Illinois Digital Newspaper Collections (IDNC). The text correction module enables users to correct errors introduced during the process of newspaper digitization. Over time, and thanks to the efforts of our volunteer text correctors, these text corrections improve the accuracy of the searchable text.
When a newspaper is digitized, Optical Character Recognition (OCR) software is used to generate searchable text. The resulting text is often called “OCR text,” to distinguish it from the text users see in the digitized image of the newspaper.
In most digitized newspaper collections (like Newspapers.com), the OCR text remains hidden and users never see the text they are actually searching. What you see in those collections are essentially digital photographs of the newspaper pages. Without OCR, those pages would remain unsearchable.
OCR enables users to search large quantities of full-text data. It is never 100% accurate. The level of accuracy depends on a number of factors, including the quality of the original print issue, its condition at the time of microfilming, the level of detail captured by the scanner, and the quality of the OCR software. Problems like dirty or damaged pages, thin paper, small print, mixed fonts, and complex page layouts can reduce OCR accuracy.
The IDNC’s text correction module gives you a side-by-side view of the OCR text and the digitized page image. Here is an example of poor OCR:
In the above example, that first line of OCR text was the software’s attempt to render the title of the article, “THE RING”:
~\ t * i- ? jS 1 r- < JT * ¦ ¦ – < 7 t-s ,-v > . – _ _ THE BI ^ G .
The article image on the right is difficult enough for a human to read, so you can imagine how tricky it is for computer software, which begins by trying to identify discrete shapes and match them with letters.
Anyone can participate in text correction. See below for instructions on how to get started.
Instructions for Correcting Text
Create an Account
To begin correcting text, you must register as a user on the Illinois Digital Newspaper Collections website. Click “Register” in the upper right corner of the screen. A verification email will be sent to your email address. Once verified, you can login to the IDNC and begin correcting text.
Access the text correction interface
Once you enter the newspaper viewer (either from the search results screen, or from the browse screen), you will see that the newspaper viewer is divided into two parts: the right side displays the page images, and the left side is the text correction interface, where you can view and correct the OCR text.
When you move your mouse over the page images in the right pane, the blocks that compose a page will highlight. You can scroll this view by dragging with the mouse, or zoom in/out using the buttons above the viewer. Clicking a highlighted block will select it and load a form for editing that block into the left pane.
- Select the article or page you want to correct. This will display the text in the left pane of the document viewer. Click on the “Correct this text” link that appears above this text.
- Right-click on the article or page image and select “Correct article text” or “Correct page text” from the options pop-up window. Correct the text line by line. A red box is displayed in the right pane to help you determine what text should be included in the line.
Correct the text line by line. A red box is displayed in the right pane to help you determine what text should be included in the line. Once you have finished correcting text, click “Save.” The changes you make will take effect immediately. Alternatively, clicking the “Cancel” button will discard any unsaved changes you have made.
You can then make further corrections to the same block, move onto the next block by clicking the “Save and Next” button, select another block in the right pane, or exit the text correction view by clicking the “Return to viewing mode” link. Clicking “Save & exit” instead of “Save” will save the changes and then return you to the normal viewing mode automatically.
Save your work
Clicking “Save & exit” instead of “Save” will save the changes and then return you to the normal viewing mode automatically.
Additional IDNC features
If you want to add tags, use the left window’s tags section at the end of the text being correction (Add Tags). Tags can be browsed and used to narrow down searches into subject areas.
If you find corrections that are not related to the original text you may correct them back to the original text. If the corrections appear as intentional vandalism please report the vandalism to email@example.com.
Guidelines for Correcting Text
You do not have to correct blank spaces or miscellaneous punctuation and symbols, but you may if you wish.
If you come across a spelling error, type the word as printed and follow with the correct spelling in square brackets [ ] to improve searchability. The following example has three spelling errors:
The text correction for the above text should be as follows:
You might find words that seem to be misspelled, but are not. Spelling, like languages itself, changes, and even varies within a single time period. Treat older or variant spellings like the same way you treat misspelled words: preserve the original spelling as you see it on the page, but also feel free to add in square brackets a modernized spelling, or a variant spelling that you believe searchers are more likely to use in a query.
In the above example, “connexion” is not misspelled: it is an older spelling of “connection”.
Place names and personal names are frequently spelled differently in older newspapers than they are spelled today. For example, “Urbanna” is commonly found in nineteenth century newspapers as an accepted spelling for the city of Urbana. Minnesota, on the other hand, was often spelled with a single “n”: Minesota. The Sauk tribe of American Indians was often spelled “Sac” or “Sac Indians.” As with misspelled words, you should retain the spelling as you see it in the original, and, if you wish, add a modernized (or standardized) spelling in brackets.
Use comments or tags for more complicated interpolations. For example, a married woman will commonly be referenced by their husband’s name, even after he has died:
Obviously you won’t always know the person’s own first name, or even if the name printed is the husband’s name or the wife’s. If you can be confident that you do know, however, then consider adding her actual name as a tag: “Bertha Palmer.”
Meskwaki Indians were usually called “Fox” Indians. Again, consider adding the standardized form of the name as a tag rather than as a text correction, since “Fox” is not, strictly speaking, a variant spelling.
If you are unable to make out the original word use square brackets to indicate [illegible] text.
If a line of OCR text has been skipped entirely, then add the missing line of text to the end of the line above. If there is no preceding line, then add the text to the start of the following line. Where possible make sure that the start of each line matches the start of the original line of text.
Transcribe the text in the correct reading order.
In situations where it’s not possible to reproduce the text as it appears on the page, just make sure the words are represented in the nearest available text-correction box.
Once you have completed corrections for a block of text, please check the “This block is completely correct” box. A block should still be marked as “completely correct” even if it contains some text marked as [illegible].
Sometimes a graphic, with no textual content, has been scanned as text, and you will be prompted to correct it. If a graphic contains no text, just delete the text that appears in the text correction box, and mark as correct.