Final Report – Charlotte: Building a Web Harvesting Service

Innovation Fund Report: University of Illinois Web Archives

August 12, 2013

 

Executive Summary

 

Using funds from an Innovation Grant, the University Archives and the Web Archives Advisory Committee hired graduate hourly employee Joe Torchedlo to capture University websites using the California Digital Library (CDL) Web Archiving Service (WAS), to assure the quality of the captured content, and to create metadata for the captured sites.

 

The contract covering the base WAS service is funded separately through the digital preservation budget.  The base service currently includes one TB of storage for capturing and preserving University-based websites, and additional storage can be purchased.

 

Working within a set of policy and procedure guidelines developed by the Web Archives Advisory Committee (and largely written by Digital Preservation Coordinator Tracy Popp), Torchedlo successfully captured, performed quality assurance for, and created metadata for 529 University sites and 421 Netfiles sites, comprising 104.2 gigabytes of data and 1,258,312 files.  Completing this work used $5,081.67 of the $12,000 allocated to this project, and we believe the approach used has resulted in a very cost-effective means to build a research collection of web content relating to the University of Illinois at Urbana-Champaign.  The model developed may also be useful as a template for developing subject-based research collections (using separate funding).

 

In the second stage of the pilot, the Web Archives Advisory Committee identified a subject-based project in order to assess the documentation produced in the first stage and a subject-based Web archives project. Steve Witt, head of the International and Area Studies Library, was asked to participate as he expressed interest in web archives.  In March 2013, Steve completed the proposal narrative and began harvesting content. This project is not covered in this report.

 

We recommend several actions to systematize the development of the University of Illinois Web Archives and to allow for review and development of additional subject-based web harvests.

 

Recommendations

 

  • Web archives services should be established as a permanent aspect of our collection preservation program.
  • Our experience with the WAS has convinced us it is the right platform for the foreseeable future.  It has sufficient safeguards moving forward if we decide to move to a different platform (our contract provides that CDL will provides use with the files in the preservation .warc format upon request and for a nominal fee.)
  • The University of Illinois continue to contract with CDL for the base WAS fee, that the base funding cover the cost of capturing and preserving University of Illinois sites, and that that fee should be paid on an indefinite basis out of the Collections/Digital Preservation budget.
  • Based on our assessment of how often the content of UIUC sites are updated, the project committee recommends each University site be captured biannually, in February and August, and the WAS incremental capture feature be used once it is made available. (This will result in much smaller capture sizes with full fidelity.  Only changed/new files are physically stored on disk.) With the exception of the public affairs website, more frequent captures would not seem to add value to the collection.
  • To maintain the future qualify of the web archives, it is recommended that hourly assistance be retained to cover capture maintenance, the addition of new sites, QA and metadata creation for University sites
    • For FY 2014, we recommend that this work be funded using the unused balance (approximately $7,000) from the Innovation Grant.
    • For FY 2015 and beyond, we recommend that the cost to maintain the service be covered by a permanent addition of $8,000 to the University Archives student wage budget, with funding used to hire a graduate hourly employee to maintain the crawls, seed new sites as they are developed, create/edit metadata, and perform quality assurance on the service.
    • We recommend that AUL for Collections Tom Teper and the Web Archives Advisory Committee refine project documentation and the application process for the harvest and management of subject-based web Archives and that that such policies be reviewed and endorsed by CDC.  As subject-based collections are proposed and developed, it is likely that additional costs will be incurred for storing non-University content. We therefore recommend that policies developed by the advisory committee and AUL for Collections analyze options for funding the storage and preservation of non-University web content, utilizing funds that would be secured in advance of initiating any implementation.

 

 

Description of the project

 

The primary goal of the University of Illinois Web Archives Service is to build a research resource of preserved web content of superseded University of Illinois websites.  A secondary, albeit very important goal is to provide collection managers an opportunity request the development subject-specific web archives supporting teaching and research, in areas of defined importance to campus and library faculty and students (and which are not being preserved by other organizations selected by subject or creator.

 

As such, the project seeks to preserve, and ensure future access of Web-based content relevant to a variety of University of Illinois constituencies.

 

To support this goal, the University Library purchased a subscription to the California Digital Library’s Web Archiving Service, in late 2012.  An informal Web Archives Advisory Committee (composed of Digital Preservation Coordinator Tracy Popp, Preservation Librarian Kyle Rimkus, IDEALS/Scholarly Commons Coordinator Sarah Shreeves and Assistant University Archivist Chris Prom was formed, and the subcommittee requested (and received $12,000) in Library Innovation Funds, in addition to seed funding for the CDL contract, from the Library’s Materials Budget, to develop a pilot service.

 

In spring 2012, the Web Archives Advisory Committee approved the creation of the first collection–the University Archives/IDEALS Web Archive–to capture the content of web pages produced directly by the University and its related entities and individuals, persistent with goal numbers 2-4 of the University Archives’ strategic plan.[1]  In addition, the group developed a process by which other Library faculty can request the development of a web archives “project”—a defined area in the CDL service where they can seeds sites for capture into a collection.

 

While the majority of this report focuses on the capture of University content (since that is what was funded by the Innovation Fund), it should be noted that other Library Units have requested the development of subject-based websites, and that it is anticipated that other units will do the same.

 

Detailed report of activities:

 

Stage One: Netfiles work (December 2012)

 

Initial work focused on an 11th hour capture of publicly available websites hosted on the CITES Netfiles service, which was in the process of being decommissioned at the time the project was approved.  These activities included:

  • Identifying specific netfiles sites by conducting targeted web searches.
  • Consulting with CITES staff to identify the best capture protocols, given the configuration of the Netfiles Service
  • Coordinating with California Digital Library (CDL) to arrange a special manual capture of the desired Netfiles sites, which was necessary in order to circumvent the security protocols that blocked initial attempts to capture the sites. (Permission was granted by CITES.)
  • Successfully performed a capture of 426 Netfiles sites prior to the termination of UIUC’s Netfiles subscription on December 20th.
  • Performed quality assurance (QA) on each captured site. The bulk of this QA workload consisted of determining whether the content within a given Netfiles address was or contained an “active” homepage of a faculty member or student, or whether it was being used primarily for file storage. Seeds considered primarily storage spaces were judged to be of insufficient value for inclusion in the Web Archives.
  • Added 119 Netfiles seeds to the list of URLs to be suppressed from public view upon the archive’s launch.

 

Stage two: Seeding and capture of university sites (January-April 2013)

 

Stage two work resulted the capture of 529 university, college, department, and research group websites, active as of spring semester 2013.

  • Sites for capture were identified and seeded according to the order of priority indicated in the project collection development policy: first of each major university office, then of every unit or academic department within each office, then of university associations and organizations. This set of sites was double checked for completeness against the campus organizational chart and other lists.[2]
  • Proper seeding of a site involved gaining familiarity with the file structure of the site. Each menu item was verified as belonging to the same parent directory, and linked-to resources were assessed to determine whether their desired content (e.g., e-Newsletters, blogs) resided under a different parent directory, requiring additional URLs to be seeded.
  • A spreadsheet was maintained to track the creation and capture of sites. Recorded information for each site includes site name, seed URL (the primary address when more than one), creation and capture dates, assigned tags, and next capture date.
  • Web pages linking to resources that warranted an entirely new site in the archive, such as web pages of student organizations, research consortia, or entire journal publications, were entered into the spreadsheet for later creation and capture.  (This work can be completed in Fall 2013).
  • To date, 529 sites have been seeded and captured; all sites are embargoed from public access for six months from time of capture.

 

Stage three: Metadata creation (April-May 2013)

  • Metadata particular to the web archival content includes site name, seeds, capture scope settings (host, directory, +linked), and a schedule for future captures.
  • Descriptive metadata was added to each record in the CDL web archives, and a series of descriptive or “subject” tags, functioning as access points in the public interface, were created and assigned to individual sites. [Screen shot?]
  • Graduate hourly assistant developed guidelines for the generation of metadata for web archive records, available at project wiki.

 

 

 

 

Figure One: Sample Metadata Record from University of Illinois Web Archives

 

 

 

Stage Four: Quality Assurance and reseeding to capture problem sites (May-June 2013)

  • Quality control for the bulk of the Web Archive was conducted in May and June according to the guidelines developed by the Web Archives Advisory Committee and as detailed in project documentation. QA was performed on every item in the Achive, in chronological order of first capture.
  • Archival records were reviewed for completeness and consistency in presentation as compared to the live version of the site. Notes on discrepancies and the date quality assurance was performed were added to the working spreadsheet.
  • The primary issue causing discrepancies in roughly 10% of sites was a security script within the robots.txt file that was excluding CDL’s crawler from harvesting the CSS files, which resulted in an archival copy that contained unformatted content.  For those sites whose cascading style sheets were never captured the graduate hourly arranged with CDL to adjust the capture settings so that the robots.txt file is bypassed.  Problem sites were re-captured under the new settings, resulting in clean and properly formatted archival copies.
  • The second common issue leading to discrepancies among archived sites was resolved by adding additional seeds to sites whose child pages resided under legacy domains (e.g., http://www.crhc.illinois.edu/ vs. http://www.crhc.illinois.edu/).

 

Stage Five: Launch and promotion (July – present)

 

  • Banner and thumbnail images for inclusion in the archive’s public interface were developed by Tracy Popp and were uploaded to the CDL system.  These will be used to brand the project as the “University of Illinois Web Archives” as sites become searchable and live.
  • A permanent URL will be assigned to the archive and the archive will be launched to the public and promoted in the fall. Sites become viewable in the public interface after a six-month embargo period following the date of capture.
  • Promotion for Charlotte is under consideration and will be carried out in the early fall as a critical mass of captured sites become available in the public archive.

 

Budget:

 

Of the $12,000 requested from the Executive Committee, $5,081.67 was spent through July 20th, for graduate hourly wages.

 

In addition, the recurrent cost for the service is funded by the Collections Budget. The base service includes storage for up to one TB of data, and purchases of additional storage can be made as necessary.

 

Report Submitted by:

 

Chris Prom

Assistant University Archivist

 

Sarah Shreeves

IDEALS and Scholarly Commons Co-Coordinator

 

Tracy Popp

Digital Preservation Coordinator

 

Kyle Rimkus

Preservation Librarian

 

Joe Torchedlo

Graduate Assistant

 

 

 

 

Appendix: Original fund request

> Paula,
>
> Kyle Rimkus, Sarah Shreeves, Tracy Popp, Joanne Kaczmarek, Ellen Swain and I would like to submit an innovation fund request.
>
> We request graduate hourly support to build out a web harvesting service, which the web archive working group (Kyle, Tracy, Sarah and me) have named Charlotte, with a hat tip to EB Webb.
>
> We request twelve thousand dollars for graduate hourly support to seed the service with URL’s, to structure the files using tags, and to investigate the possibility of ingesting previously harvested websites into the service.  These files currently safely preserved on the archives e-records repository, but are intellectually uncontrolled and in need of descriptive and ingest work.)
>
> The money would be split about 60/40 between two tasks:
>       * seeding the servicew with URL’s/laying out its architecture, and refining our policy framework based on that experience, and
>
>       * computer development work to style the service, develop its brand, and to provide access to previously harvested content.
>
> The latter work will be completed by a CS graduate student.
>
> As part of this pilot project, we will focus exclusively on preserving University related content, including official websites of colleges, departments, etc, as well as websites documenting faculty research and student organization.  Using this experience, we will draft and guidelines to allow for the development of subject-based external collections, after evaluation against a set of criteria to be developed in conjunction with CDC.
>
> As background, I would note that the basic infrastructure for the project is supported under a contract with California Digital Library, using funds supplied by Tom Teper, and which Tracy Popp shepherded through and extensive grants and contracts process. The files will be hosted on a CD site, branded for the University, and the contract allows us to receive a copy of the harvested files at any time (for a nominal fee), in the .warc (web archive) format.
>
> Upon project completion, we will evaluate our efforts and recommend a method by which the work to harvest univeristy/faculty/student organization websites can be mainstreamed into the normal work of the University Archives/IDEALS, while allowing for the development of subject based collections.
>
> Thanks,
>
> Chris
>
> —
>
> Christopher J. Prom, PhD
> Assistant University Archivist and Associate Professor
> University of Illinois Archives
> 19 Library
> 1408 W. Gregory Dr.
> Urbana, IL 61801
>
prom@illinois.edu
+1 217 333 0798
>
http://www.library.illinois.edu/archives/
>
> Blog: http://e-records.chrisprom.com