Innovation and Seed Funding Proposal – Feb 29th 2012
Transcribing Serialized Fiction
Digital Humanities Specialist
The LibGuide to serialized fiction in the Farm, Field and Fireside collection (http://uiuc.libguides.com/content.php?pid=53560&sid=392517) describes the serialized fiction that was published in the weekly newspapers and monthly periodicals in the Farm, Field, and Fireside collection (http://www.library.illinois.edu/dnc/fff/ ).
One barrier to locating serialized fiction in the collection is the fact that the serials themselves are not indexed, and individual articles do not have subject terms or tags associated with them that would identify them as fiction. As a result, articles are difficult to find unless the reader browses a large volume of issues or simply hits upon a salutary keyword search. Keyword searching of the collection is more effective for articles on topics in farming or farm life than for works of fiction. Unless the reader is looking for stories by a specific author or for a known story title, keyword searching of fiction is highly ineffective.
While the software does a good job of connecting articles in a single issue, the reader does not know where to find the next installment in a serialized work of fiction, so he or she has to find it manually by browsing the collection or doing a keyword search. Finally, while the OCR scanning was done to the highest standard, this is an imperfect process, and much of the content cannot be adequately OCR’d due to background noise and broken letters, features of the original newspapers that impede scanning.
I would like to apply for Innovation Funding to investigate new technology for solving the problems with our collection and encouraging users to enhance library-managed content through crowd sourced transcriptions.
First, the serialized fiction articles in one title (i.e., Farmer’s Wife), will be converted from Olive’s PrXML into an Omeka Exhibit, with Dublin core metadata added, and with a link from Olive to the new serialized fiction collection. This will build an index of serialized fiction that would increase the accessibility of these articles and solve the issues listed above. We will also experiment with using other digital library systems, including a Drupal/Fedora based repository (Islandora) and converting the fiction into TEI 5 and displaying it in the CDL’s eXtensible Text Framework (XTF).
Second, I would also like to experiment with incorporating crowd-sourced transcription software that would allow any user to assist us in identifying and transcribing these articles which would allow us to continue this project with a minimum number of staff hours. Omeka has a plugin, Scripto, which has been used by the Transcribing Bentham Project and the Papers of the War Department for transcribing manuscripts.
Finally, after we’ve completed the project, I want to evaluate the collection to see if we can identify the characteristics of serialized fiction using text analysis techniques and automate the selection of articles as serialized fiction and potentially apply for to future grants, possibly as Seed Funding in the next cycle.
XTF will run on Windows, is open source, and we could install it on the same server as Olive (libpimento).
I would rather have an additional virtual machine created as a test environment separate from the production server during this project, which we could use as a production system once the project ends. I’ve assumed there is a minimal cost difference between one and two virtual systems based on my discussions with Library IT.
The oXygen XML editor will be used to manually create the TEI documents, which is installed on one of our workstations already. We are going to use TEI Lite (and very little markup). Once we’ve developed a template, we will further automate this process by converting the PrXML to TEI via an XSL style sheet.
The Scripto Plugin for Omeka will help us with transcribing the text, we also plan on evaluating FromThePage, an open-source software I’ve identified to use to help us with transcribing the text and to ensure this project is sustainable with minimal resources. We may evaluate some other text transcription products as part of this project.
Storage and preservation
The files created would be derivatives of the existing repository’s files and we would use the existing article images as part of this project. The TEI files will be archived on DVDs and stored on an existing access file share (probably libpimento\repository). The files created would be a minimal size (15-20kb for a single article).
Workflow for Research Assistants
A previous effort in HPNL identified approximately 1100 articles as serialized fiction in Farmer’s Wife. I did testing with a volunteer and it takes between 10 and 15 minutes to scrape the OCR text from a PDF and correct the text in a single issue. A back of the envelope analysis says we need at least 275 hours to manually transcribe the articles. We would not need to correct all 1100 articles to obtain a good sample for the text analysis and the pilot XTF site, I also assume that a trained user will improve over time, and we can further improve productivity by semi-automating the creation of the TEI documents.
I would like to request 3654 in funding for this project to hire two Graduate Hourly employees for 200 work hours over the period of ten weeks.
$3654 in GA hourly funding (3 GAs) at approximately 12 hours per week for 15 weeks.
¼ FTE Staff member’s time (Kirk Hess) for 10 weeks.
May 1 – set up virtual machine, install software, hire and train GAs, design XTF site.
June 1 – GAs continues transcription; install and pilot Omeka/Islandora/FromThePage/XTF
July 1st – Demonstrate beta site
August 15th – complete project, publish website, write project summary, write text analysis project proposal
This project would result in an additional website to support, along with supporting the FromThePage transcription process. But I believe both could be easily maintained by existing Library IT staff and the users of the archive will participate in adding new articles to the repository.
There’s considerable interest in automating transcription and OCR correction within the Library and this project would provide a technical platform for the Rare Book and Manuscript Library, Archives and Digital Content Creation to use for future projects.
Benefits and Success
The first milestone of success would be the publication of a working website with the article content transcribed and linked from Olive.
To further measure the benefits and success of the project, I plan on installing Google Analytics on the completed system and we can track the number of users that view and use the content.
FromThePage also keeps track of transcriptions and the number of edits, and this will also determine if we can sustain the transcription process. We can also see this software is helpful for future transcription projects.