

{"id":78,"date":"2024-06-18T19:27:25","date_gmt":"2024-06-18T19:27:25","guid":{"rendered":"https:\/\/wordpress.library.illinois.edu\/born-digital\/?page_id=78"},"modified":"2025-10-15T12:20:27","modified_gmt":"2025-10-15T12:20:27","slug":"local-web-archiving-policies","status":"publish","type":"page","link":"https:\/\/wordpress.library.illinois.edu\/born-digital\/local-web-archiving-policies\/","title":{"rendered":"Local Web Archiving Policies"},"content":{"rendered":"<h2>Collection Development Policy:<\/h2>\n<p><span style=\"font-weight: 400\">Web archives projects are intended to strengthen the library&#8217;s research resources. As such, Web archives projects should reflect the University Library&#8217;s mission and policies in various collection development statements. Websites are selected for harvest to bolster, complement, and parallel existing library collections, meet administrative documentation retention requirements, assist researchers, and capture ephemeral materials valuable to subject specialties, including grey literature, blog posts, and other relevant and vulnerable content.<\/span><\/p>\n<h3>Unit Responsibilities:<\/h3>\n<p><span style=\"font-weight: 400\">When creating a collection, units must be open to discuss roles and responsibilities of Web archives administration and maintenance, such as who will be responsible for creating metadata and doing quality assurance.<\/span><\/p>\n<h3>Archive-It Access:<\/h3>\n<p><span style=\"font-weight: 400\">To be able to crawl seeds, add metadata, and description to the University of Illinois\u00a0<\/span><span style=\"font-weight: 400\">Urbana-Champaign web collection. One must be trained by someone in preservation service. To set up a training, please <a href=\"mailto:webarchives@library.illinois.edu\">email Web Archives.<\/a><\/span><\/p>\n<h4>\u00a0 \u00a0 \u00a0Intellectual Property\/Copyright:<\/h4>\n<p><span style=\"font-weight: 400\">Copying materials is inherently part of the web archiving process. Intellectual property and copyright issues are an area where Web archives should be especially cognizant to respect the rights of rights holders without limiting libraries and archives&#8217; rights to preserve important historical content.\u00a0\u00a0<\/span><\/p>\n<h4>\u00a0 \u00a0 \u00a0robots.txt exclusions:<\/h4>\n<p><span style=\"font-weight: 400\">Webmasters use robots.txt files, also known as the Robots Exclusion Standard, to tell Web robots whether they allow crawling or not. The robots.txt file can block out specific files, directories, or even entire sites from Web crawler harvest. Webmasters may implement robots.txt exclusions for any number of reasons, such as to ensure optimal server performance and privacy protection.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">Archive-It\u2019s web crawler honors all robots.txt exclusion requests. However, the crawler can be set up to ignore these blocks in specific cases.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">A robots.txt file is always located at the topmost level of a website and the file itself is always called robots.txt. To determine if a crawl may be blocked, view a website\u2019s robots.txt file by adding \u201c\/robots.txt\u201d to the end of the topmost level of a site\u2019s address.\u00a0\u00a0<\/span><\/p>\n<h4>\u00a0 \u00a0 \u00a0Storage and Contingency Planning:<\/h4>\n<p><span style=\"font-weight: 400\">Archive-It crawls are stored in the WARC (Web ARChive) file format, an<\/span><a href=\"http:\/\/bibnum.bnf.fr\/WARC\/\"> <span style=\"font-weight: 400\">ISO standard (CD 28500)<\/span><\/a><span style=\"font-weight: 400\"> for storing content harvested from the World Wide Web. Archive-It\u2019s primary crawler, Heritrix, and the Wayback Machine viewing software are open-source tools supported by an international community of institutions.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">Content files are hosted on servers at the Internet Archive in San Francisco. A copy of Archive-It data is hosted and stored in a secure, controlled-access facility in Richmond, California, and mirrored in additional Internet Archive repositories. In addition, a dark copy of the Archive-It repository is replicated for preservation purposes at a university in the Eastern United States.\u00a0\u00a0<\/span><\/p>\n<p><span style=\"font-weight: 400\">Note at this time the Archives do not maintain a local copy of WARC files.<\/span><\/p>\n<h4>\u00a0 \u00a0 \u00a0Password-Protected Content:<\/h4>\n<p><span style=\"font-weight: 400\">You can capture password-protected pages if the crawler is provided with login credentials to access the site. Some login pages work differently from others and may be difficult to capture. If you encounter problems with password-protected sites, please contact<\/span><a href=\"https:\/\/support.archive-it.org\/hc\/en-us\/requests\"> <span style=\"font-weight: 400\">Archive-It Support<\/span><\/a><span style=\"font-weight: 400\">.\u00a0<\/span><\/p>\n<h4>\u00a0 \u00a0 \u00a0Password-Protected Content:<\/h4>\n<p><span style=\"font-weight: 400\">To capture a password-protected site, add the login screen as the seed URL of the page you wish to capture. Under the <b>Seeds <\/b> tab, check the box next to the login URL and click the <b>Edit Settings<\/b> button. In <b>the Edit Seed Settings<\/b> dialog box, enter the page\u2019s Login Name and Login Password, then click the <b>Apply <\/b>button.\u00a0\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Collection Development Policy: Web archives projects are intended to strengthen the library&#8217;s research resources. As such, Web archives projects should reflect the University Library&#8217;s mission and policies in various collection development statements. Websites are selected for harvest to bolster, complement, and parallel existing library collections, meet administrative documentation retention requirements, assist researchers, and capture ephemeral [&hellip;]<\/p>\n","protected":false},"author":853,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"footnotes":""},"class_list":["post-78","page","type-page","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages\/78","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/users\/853"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/comments?post=78"}],"version-history":[{"count":10,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages\/78\/revisions"}],"predecessor-version":[{"id":200,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages\/78\/revisions\/200"}],"wp:attachment":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/media?parent=78"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}