Charlotte: Building a Web Harvesting Service

Kyle Rimkus, Sarah Shreeves, Tracy Popp, Joanne Kaczmarek, Ellen Swain and I would like to submit an innovation fund request.

We request graduate hourly support to build out a web harvesting service, which the web archive working group (Kyle, Tracy, Sarah and me) have named Charlotte, with a hat tip to EB Webb.

We request twelve thousand dollars for graduate hourly support to seed the service with URL’s, to structure the files using tags, and to investigate the possibility of ingesting previously harvested websites into the service.  These files are at present  safely preserved in the archives e-records repository, but are intellectually uncontrolled and in need of descriptive and ingest work.

The money would be split about 60/40 between two tasks:

  • seeding the service with URL’s,  tagging sites, adding metadata,conducting quality assurance testing, and refining our draft policy framework based on our experience; and
  • conducting computer development/programming work to style the service, to develop its brand, and to provide access to previously harvested content.

The latter work will be completed by a CS graduate student.

As part of this pilot project, we will focus exclusively on preserving University related content, including official websites of colleges, departments, etc, as well as websites documenting faculty research and student organizations.  Using this experience, we will refine policy documents and draft additional guidelines to allow for the development of subject-based external collections, after evaluation against a set of criteria to be developed in conjunction with CDC.  Early drafts of guidelines, authored by Tracy Popp with advice from me, Kyle, and Sarah are available at https://wiki.cites.illinois.edu/wiki/display/ulwas/Project+Management+Documentation#ProjectManagementDocumentation-Policies

As background, I would note that the core service is provided under a contract with California Digital Library, using funds supplied by Tom Teper, and which Tracy Popp shepherded through a lengthy grants and contracts process. The files will be hosted on a CD site, branded for the University, and the contract allows us to receive a copy of the harvested files at any time (for a nominal fee), in the .warc (web archive) format.

Upon project completion, we will evaluate our efforts and recommend a method by which the work to harvest univeristy/faculty/student organization websites can be mainstreamed into the normal work of the University Archives/IDEALS, while allowing for the development of subject based collections.

Thanks,

Chris


Christopher J. Prom, PhD