

{"id":50,"date":"2024-05-16T19:31:48","date_gmt":"2024-05-16T19:31:48","guid":{"rendered":"https:\/\/wordpress.library.illinois.edu\/born-digital\/?page_id=50"},"modified":"2025-11-20T18:46:32","modified_gmt":"2025-11-20T18:46:32","slug":"getting-started-with-web-archiving","status":"publish","type":"page","link":"https:\/\/wordpress.library.illinois.edu\/born-digital\/getting-started-with-web-archiving\/","title":{"rendered":"Getting Started with Web Archiving\u00a0"},"content":{"rendered":"<h2><span style=\"font-weight: 400\">What is Archive-It\u00a0<\/span><\/h2>\n<p><span style=\"font-weight: 400\">Archive-It is a paid subscription service that allows institutions to preserve and build collections of digital content<\/span><span style=\"font-weight: 400\"> offered by the <\/span><a href=\"https:\/\/help.archive.org\/help\/archive-it-information\/#:~:text=Archive%2DIt%20is%20a%20subscription,collections%20of%20born%20digital%20conten\"><span style=\"font-weight: 400\">Internet Archive<\/span><\/a><span style=\"font-weight: 400\">.\u00a0<\/span><\/p>\n<h3><span style=\"font-weight: 400\">How Does Archive-It Work<\/span><\/h3>\n<p><span style=\"font-weight: 400\">Web archiving is the targeted harvesting of Web-based content for archival and preservation purposes. At its core Archive-It is a Java-based Heritrix Web crawler software, described as an &#8220;open-source, extensible, Web-scale, archival-quality&#8221; Web crawler. Archive-It web crawler performs web harvesting automatically beginning from one or more specific Web sites or \u201cseeds.\u201d The crawl follows links harvesting and saving content such as text, audiovisual materials, and site style sheets. Related harvested content is stored together in .WARC files. The .WARC file format is a publicly documented and open standard employed to wrap aggregate related Web-content and associated information or metadata. For more information how Archive-It works please refer to Archive-It\u2019s <\/span><a href=\"https:\/\/support.archive-it.org\/hc\/en-us\/articles\/360001231286-About-Archive-It-APIs-and-access-integrations\"><span style=\"font-weight: 400\">About Archive-It APIs and access integration<\/span><\/a><span style=\"font-weight: 400\">.<\/span><\/p>\n<h3><span style=\"font-weight: 400\">Archive-It Limitations<\/span><\/h3>\n<p><span style=\"font-weight: 400\">Due to technical limitations, exact content and appearance of all sites on the Web may not be preserved. The most reliable captures are generally comprised of static HTML sites whose pages contain text and images, and whose constituent files all reside on a single host server and domain. Web crawling software generally has the most difficulty with:<\/span><\/p>\n<ul>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Dynamically-created pages: pages created from or using Dynamic scripts or applications such as JavaScript or Adobe Flash<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Password protected material<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Forms or database-driven content that requires interaction with the live host site<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Exclusions specified in robots.txt file<\/span><\/li>\n<li style=\"font-weight: 400\"><span style=\"font-weight: 400\">Multimedia: Streaming media players with video or audio content<\/span><\/li>\n<\/ul>\n<h3><span style=\"font-weight: 400\">Getting to work with Archive-It\u00a0\u00a0<\/span><\/h3>\n<p><span style=\"font-weight: 400\">To be able to work with the<\/span><span style=\"font-weight: 400\">\u00a0University of Illinois <\/span><span style=\"font-weight: 400\">Urbana-Champaign web archives in Archive-It. One must be trained by someone in the preservation service. To set up a training please email <a href=\"mailto:webarchives@library.illinois.edu\">webarchives@library.illinois.edu<\/a><\/span><span style=\"font-weight: 400\">.\u00a0<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>What is Archive-It\u00a0 Archive-It is a paid subscription service that allows institutions to preserve and build collections of digital content offered by the Internet Archive.\u00a0 How Does Archive-It Work Web archiving is the targeted harvesting of Web-based content for archival and preservation purposes. At its core Archive-It is a Java-based Heritrix Web crawler software, described [&hellip;]<\/p>\n","protected":false},"author":853,"featured_media":0,"parent":0,"menu_order":0,"comment_status":"closed","ping_status":"closed","template":"","meta":{"_acf_changed":false,"footnotes":""},"class_list":["post-50","page","type-page","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages\/50","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages"}],"about":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/types\/page"}],"author":[{"embeddable":true,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/users\/853"}],"replies":[{"embeddable":true,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/comments?post=50"}],"version-history":[{"count":7,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages\/50\/revisions"}],"predecessor-version":[{"id":234,"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/pages\/50\/revisions\/234"}],"wp:attachment":[{"href":"https:\/\/wordpress.library.illinois.edu\/born-digital\/wp-json\/wp\/v2\/media?parent=50"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}