415 Library, MC-522
1408 W. Gregory
Urbana, IL 61801
Email: digicc [at] library.illinois.edu
This section includes a summary of the current prevailing best practices for newspaper digitization.
The single most critical factor in the success of newspaper digitization is the availability of good quality microfilm. Although it is possible to digitize newspapers from an original print copy, this process is very labor-intensive and considerably more expensive than digitizing from film.
Another key consideration is the platform to be used to deliver the digital content. The choice of platform will drive some decisions regarding technical specifications. Although the goal of the National Digital Newspaper Program (NDNP) is to generate a set of best practices and national standards for newspaper digitization, there is currently considerable variation in practice and no consensus regarding several major issues.
For example, NDNP does not offer subpage-level segmentation, also called article zoning. Most members of the newspaper digitization community, however, do advocate article segmentation. With regard to scanning requirements, there are as many proponents of 8-bit grayscale as there are for bitonal scanning. The particular content management/delivery system may determine some technical specifications.
In considering source material, newspapers published before 1923 are in the public domain and may be freely digitized. Orphaned post-1923 titles may also be freely digitized. If a newspaper that began publication before 1923 is still being published, there may be compelling reasons not to digitize pre-1923 content without the permission of the publisher. Post-1923 titles still in publication can be digitized only with the permission of the publisher.
Microfilm may be unsuitable for digitization due to many factors:
Generally, film produced following United States Newspaper Program Guidelines (established in the mid-1980s) and RLG preservation microfilm guidelines (established in the early 1990s) yields the best results. The USNP guidelines stipulate:
Film used for newspaper digitization should be a clean second-generation duplicate to silver negative film. (Negative film offers less noise and better contrast and is easier to correct scratches. Positive film is third generation, with lower resolution, which produces poor OCR results.) The polarity will be reversed during scanning. Scanning from service copies should be avoided. In addition, film used for newspaper digitization should be polyester rather than acetate. Polyester film is stable and durable. Acetate film should be duplicated to polyester stock before scanning.
[Film produced before 1970 is probably on acetate stock. Film produced between 1970 and the late 1980s may be on acetate stock. Hold up a wound roll of film to the light and examine the side of the roll. If no light shows through, it is probably acetate. Also, if it is curled, warped, buckled, brittle, blistered, or stinky, it is probably acetate.]
Microfilm, especially microfilm of newspapers, is not perfect. Even if the resolution, reduction ratio, and densities are less than optimal, you can do sample scans and test for usability of OCR.
The Post-2008 naming convention applies to projects starting after March 2008. Continuing titles that began prior to 2008 will continue to use the Pre-2008 convention The Post-2008 convention is based on a combination of the calendar date, section, edition, and page number of a given newspaper page. Each TIFF file should include the following elements and be in this precise order:
The Pre-2008 convention is based on a combination of the calendar date and page number of a given newspaper page.
Each TIFF file should include the following elements and be in this precise order:
Olive will not accept files utilizing a different naming convention. When un-resolvable duplicates or discrepancies in page/issue/date numbering exist, the vendor will save the scan within a separate directory named appropriately as "Error" or something similar. Files in this folder will need to be examined and resolved by UIUC before being supplied to Olive. (Note: the Post-2008 convention remedies this file naming problem.)
Files in the "Error" folder will follow the above convention with the addition of a CopyNo to indicate the duplicate version.
Distillation (Segmentation, OCR, Output to XML)
Distillation processing is performed by Olive and includes image analysis, article segmentation, OCR processing, and output to XML. Information on the distillation process, as provided by Olive Software in 2007, is detailed in this section.
The automatic segmentation process is tasked with recognizing newspaper information objects or entities - these can be articles, pictures, or ads. It also recognizes each entity's internal components (in an article, for example, these include title, subtitle, byline, and body text). All this is done through analysis of page layout geometry and the fonts used on each page.
Once segmentation has been performed, the print edition is converted to a "Digital Newspaper." A Digital Newspaper consists of images and XML files. The images are rectangular snapshots which can be used to build up every information object in the newspaper; the XML files record the text, structure and layout of the document.
The distillation process was designed to overcome the inherent problems associated with the conversion of scanned images and microfilm as well as the inability of OCR programs to properly read page layout geometry. Distillation is a five-step process: Image analysis, Layout analysis, OCR, Entity building and Output to XML.
This stage is crucial to the distilling process since the page image is analyzed to find horizontal and vertical lines, text strings, and picture regions. Nonlinear distortion, combined with the complex layout of the newspaper page, makes life difficult for OCR software. If entire page regions are ignored by the OCR, mistakenly treated as dead areas or pictures, segmentation is compromised.
Scanned images suffer from nonlinear distortion - distortion that cannot be predicted and compensated for. A few examples of nonlinear distortion:
The segmentation engine used in digital materials was adapted for this stage of the process. Working like a human eye, the segmentation engine views a newspaper page from a distance and analyzes the geometry of the page using lines and shapes recognized in image analysis. It builds a net of image objects, examining alignment, size, brightness, and other characteristics of groups of elements on the grid. The result is a rough page structure definition, which includes text regions, classified as body text or titles.
After separate image analysis and layout analysis have been completed, the OCR process is performed on each of the text regions detected by the layout analyzer. This way, the OCR engine can work on relatively small rectangles, all of which contain text. The precision with which these regions are detected has a huge impact on overall OCR accuracy. The number of un-recognized or badly-recognized areas decreases by a factor of two or three.
The results of OCR are written into a PDF containing a full issue in page images. All information about word coordinates, font, size and OCR errors is stored for analysis.
In this stage, all the information gathered in image analysis, layout analysis and OCR is collated. The segmentation engine analyzes textual objects, and their optically-recognized text, to find entities and entity components.
This structural information is also written into the PDF.
In the final stage, the structural and layout definitions gathered during the distillation process are written to non-proprietary XML files, together with the OCR-generated text. In addition, many rectangular snapshots of each newspaper page are taken, and saved together with the text. These snapshots can be used to assemble any entity in the newspaper, using coordinates found in the XML.
The data is stored within a flat-file XML repository and is organized by an index tree by publication, date, section, page and then by page components.
Olive's XML architecture is based on its Preservation Markup Language Schema (PrXML). This schema maps the original document's content, style, and hidden intelligence in an open source XML format. PrXML is a "Hyper Schema", not limited to a specific standard. Olive Software enables conversions of the PrXML schema into other schemas such as OAI and METS.
[This section (2.5.4) is extracted from the National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants 2009 document (66 page PDF) available at http://www.loc.gov/ndnp/pdf/NDNP_200911TechNotes.pdf]
The National Digital Newspaper Program is a long-term effort and the technical environment will change as the program continues. The National Endowment for the Humanities (NEH) and the Library of Congress (LC) have selected a technical approach to balance long-term objectives and shorter-term constraints. These include:
The goal of the initial program phase is to build a Web-accessible NDNP delivery application with sufficient geographic coverage and digital assets to validate the technical approach and to serve as a test bed for future research and development in techniques to enhance the content and access interface, and to support effective use by scholars and the general public. This award cycle is a continuation of the initial program development phase.
In succeeding phases of the project, the approach and associated guidelines will be evaluated and revised based on feedback from awardees, experience in providing access to historic newspapers online, and technological advances.
NEH and LC recognize that other institutions may choose other approaches or formats for their own digital repository and delivery systems and thus either weigh costs and benefits differently or wish for compatibility to existing systems. Applicants may pursue local approaches in parallel with participation in NDNP, with the overall goal of providing effective widespread access to newspapers through scanning and text conversion and evaluating alternative interfaces for navigating and exploring large collections of newspapers. Applicants who use other formats locally must be capable of providing digital assets to the NDNP according to the specifications described below.
The National Digital Newspaper Program supports a consistent technical specification for digital newspaper reproductions and associated metadata in order to maintain parity of services for materials from a variety of institutions and collections and to support the "best practices" of today's understanding of digital preservation needs.
Awardees are expected to deliver the following to the Library of Congress, to allow construction of a permanent archive and a unified interface for searching and browsing the entire NDNP collection. After the cooperative agreements are announced, LC will convene a meeting of awardees to review these technical guidelines, and establish work-plan milestones, and specifications for 2009-11 deliverables.
Awardees will deliver all digital assets in a METS object structure (Metadata Encoded Transmission Schema), according to an XML Batch template structure. (See Appendix C - XML Metadata Templates.)
For delivery, the awardee shall organize the page images and related files for each newspaper title in a hierarchical directory structure sufficient for identification of the individual digital assets from the metadata provided.
[This section (2.5.4) is extracted from the National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants 2009 document (66 page PDF) available at http://www.loc.gov/ndnp/pdf/NDNP_200911TechNotes.pdf)