Library Committee Handbook

Executive Committee



Final Report - Charlotte: Building a Web Harvesting Service

Innovation Fund Report: University of Illinois Web Archives

August 12, 2013

 

Executive Summary

 

Using funds from an Innovation Grant, the University Archives and the Web Archives Advisory Committee hired graduate hourly employee Joe Torchedlo to capture University websites using the California Digital Library (CDL) Web Archiving Service (WAS), to assure the quality of the captured content, and to create metadata for the captured sites.   

 

The contract covering the base WAS service is funded separately through the digital preservation budget.  The base service currently includes one TB of storage for capturing and preserving University-based websites, and additional storage can be purchased. 

 

Working within a set of policy and procedure guidelines developed by the Web Archives Advisory Committee (and largely written by Digital Preservation Coordinator Tracy Popp), Torchedlo successfully captured, performed quality assurance for, and created metadata for 529 University sites and 421 Netfiles sites, comprising 104.2 gigabytes of data and 1,258,312 files.  Completing this work used $5,081.67 of the $12,000 allocated to this project, and we believe the approach used has resulted in a very cost-effective means to build a research collection of web content relating to the University of Illinois at Urbana-Champaign.  The model developed may also be useful as a template for developing subject-based research collections (using separate funding).

 

In the second stage of the pilot, the Web Archives Advisory Committee identified a subject-based project in order to assess the documentation produced in the first stage and a subject-based Web archives project. Steve Witt, head of the International and Area Studies Library, was asked to participate as he expressed interest in web archives.  In March 2013, Steve completed the proposal narrative and began harvesting content. This project is not covered in this report.

 

We recommend several actions to systematize the development of the University of Illinois Web Archives and to allow for review and development of additional subject-based web harvests.

 

Recommendations

 

 

 

Description of the project

 

The primary goal of the University of Illinois Web Archives Service is to build a research resource of preserved web content of superseded University of Illinois websites.  A secondary, albeit very important goal is to provide collection managers an opportunity request the development subject-specific web archives supporting teaching and research, in areas of defined importance to campus and library faculty and students (and which are not being preserved by other organizations selected by subject or creator.

 

As such, the project seeks to preserve, and ensure future access of Web-based content relevant to a variety of University of Illinois constituencies.

 

To support this goal, the University Library purchased a subscription to the California Digital Library’s Web Archiving Service, in late 2012.  An informal Web Archives Advisory Committee (composed of Digital Preservation Coordinator Tracy Popp, Preservation Librarian Kyle Rimkus, IDEALS/Scholarly Commons Coordinator Sarah Shreeves and Assistant University Archivist Chris Prom was formed, and the subcommittee requested (and received $12,000) in Library Innovation Funds, in addition to seed funding for the CDL contract, from the Library’s Materials Budget, to develop a pilot service.

 

In spring 2012, the Web Archives Advisory Committee approved the creation of the first collection--the University Archives/IDEALS Web Archive--to capture the content of web pages produced directly by the University and its related entities and individuals, persistent with goal numbers 2-4 of the University Archives’ strategic plan.[1]  In addition, the group developed a process by which other Library faculty can request the development of a web archives “project”—a defined area in the CDL service where they can seeds sites for capture into a collection.

 

While the majority of this report focuses on the capture of University content (since that is what was funded by the Innovation Fund), it should be noted that other Library Units have requested the development of subject-based websites, and that it is anticipated that other units will do the same.

 

Detailed report of activities:

 

Stage One: Netfiles work (December 2012)

 

Initial work focused on an 11th hour capture of publicly available websites hosted on the CITES Netfiles service, which was in the process of being decommissioned at the time the project was approved.  These activities included:

 

Stage two: Seeding and capture of university sites (January-April 2013)

 

Stage two work resulted the capture of 529 university, college, department, and research group websites, active as of spring semester 2013.

 

Stage three: Metadata creation (April-May 2013)

 

 

 

 

Figure One: Sample Metadata Record from University of Illinois Web Archives

 

 

 

Stage Four: Quality Assurance and reseeding to capture problem sites (May-June 2013)

 

Stage Five: Launch and promotion (July - present)

 

 

Budget:

 

Of the $12,000 requested from the Executive Committee, $5,081.67 was spent through July 20th, for graduate hourly wages.

 

In addition, the recurrent cost for the service is funded by the Collections Budget. The base service includes storage for up to one TB of data, and purchases of additional storage can be made as necessary.

 

Report Submitted by:

 

Chris Prom

Assistant University Archivist

 

Sarah Shreeves

IDEALS and Scholarly Commons Co-Coordinator

 

Tracy Popp

Digital Preservation Coordinator

 

Kyle Rimkus

Preservation Librarian

 

Joe Torchedlo

Graduate Assistant

 

 

 

 

Appendix: Original fund request

> Paula,
>
> Kyle Rimkus, Sarah Shreeves, Tracy Popp, Joanne Kaczmarek, Ellen Swain and I would like to submit an innovation fund request.
>
> We request graduate hourly support to build out a web harvesting service, which the web archive working group (Kyle, Tracy, Sarah and me) have named Charlotte, with a hat tip to EB Webb.
>
> We request twelve thousand dollars for graduate hourly support to seed the service with URL's, to structure the files using tags, and to investigate the possibility of ingesting previously harvested websites into the service.  These files currently safely preserved on the archives e-records repository, but are intellectually uncontrolled and in need of descriptive and ingest work.)
>
> The money would be split about 60/40 between two tasks:
>       * seeding the servicew with URL's/laying out its architecture, and refining our policy framework based on that experience, and
>
>       * computer development work to style the service, develop its brand, and to provide access to previously harvested content.
>
> The latter work will be completed by a CS graduate student.
>
> As part of this pilot project, we will focus exclusively on preserving University related content, including official websites of colleges, departments, etc, as well as websites documenting faculty research and student organization.  Using this experience, we will draft and guidelines to allow for the development of subject-based external collections, after evaluation against a set of criteria to be developed in conjunction with CDC.
>
> As background, I would note that the basic infrastructure for the project is supported under a contract with California Digital Library, using funds supplied by Tom Teper, and which Tracy Popp shepherded through and extensive grants and contracts process. The files will be hosted on a CD site, branded for the University, and the contract allows us to receive a copy of the harvested files at any time (for a nominal fee), in the .warc (web archive) format.
>
> Upon project completion, we will evaluate our efforts and recommend a method by which the work to harvest univeristy/faculty/student organization websites can be mainstreamed into the normal work of the University Archives/IDEALS, while allowing for the development of subject based collections.
>
> Thanks,
>
> Chris
>
> --
>
> Christopher J. Prom, PhD
> Assistant University Archivist and Associate Professor
> University of Illinois Archives
> 19 Library
> 1408 W. Gregory Dr.
> Urbana, IL 61801
>
> prom@illinois.edu
> +1 217 333 0798
>
> http://www.library.illinois.edu/archives/
>
> Blog: http://e-records.chrisprom.com



[1] http://archives.library.illinois.edu/about-us/documents-and-policies/strategic-plan/

[2] http://www.pb.uillinois.edu/Documents/staffing/UIUC-Org-Chart.pdf, http://illinois.edu/ds/azList, http://identitystandards.illinois.edu/newsandevents/recognition.html.