415 Library, MC-522
1408 W. Gregory
Urbana, IL 61801
(217) 244-2062
Email: digicc [at] library.illinois.edu
This best practice deals with both digital storage media reliability concerns versus costs and the inevitable need to migrate existing data to different (usually newer technology) storage.
Digital storage media is sometimes intertwined with digital file formats, but formats are dealt with in Best Practice 2.8.
________________________________________________________________
There are many types of data storage media and selection of the best type(s) for a given situation can be complex. Factors such as total data size, rate of data growth, user access needs, desired length of retention, preservation needs, data value, and available budget can all affect the suitability of storage types for a given situation. Even in cases where we know a data set should be online, some of those factors will affect which online storage array(s) are used and what sort of backup or replication configuration is used.
At the Library, several types of digital media are currently used to store digitized content. Each one presents tradeoffs with respect to reliability, expected longevity, ease of access and validation, and various costs (including purchase, maintenance, labor, energy, and space). To keep this document manageable this document only makes reference to storage media in use by the UIUC Library.
To maximize the longevity and readability of optical media (CD-R, DVD-R, etc.), the following are recommended [1] .
A longer but similar list of these recommendations is on of NIST Special Publication 500-252 [3], page vi.
Every digital storage medium is subject to partial and total data loss. The causes for loss include human error, software or hardware malfunction, physical media deterioration, mechanical failure, damage from electromagnetic fields or environmental conditions, theft, disaster damage (fire, flood, earthquake, etc.), and eventual unreadability due to obsolescence and unavailability of hardware and software that can still read or interface with a given media. These disparate causes require different solutions to address their risk of occurrence.
Best practice for increasing the reliability of digital storage media always involves one or more means of creating redundancy in the data to significantly reduce the statistical likelihood of actual information loss even when the inevitable failure occurs with any specific digital media storage unit. Best practices also require methods of detecting data corruption in the media. Moreover, all highly-reliable and disaster-resistant storage systems require the data reside in at least two physical locations as geographically distant as feasible.
Even using high-quality CD-R and DVD-R media with a gold substrate layer, as has been the practice by DSD and DCC for some collections, their experience has shown significant media failure rates both initially and upon later attempts to read the discs.
Even "best practice" RAID-protected storage volumes suffer from data loss which ordinarily goes undetected [4,5]. To address these issues, a few highly resilient file systems have been developed. Most of these are very expensive proprietary systems out of our reach, but Library IT Infrastructure & Software Development (ISD) unit has begun working with Sun's open source ZFS [6] in a new pair of storage systems for this additional security.
For any long-term digital preservation system this type of silent data loss must be addressed at a level above the hardware using software methods of recurring validation and recovery. In practice, this can be done by a digital preservation system running proactive fixity checks, or by an advanced file system like ZFS or both. These systems all incorporate the computation and storage of one or more checksums (e.g. CRC) or stronger digest hashes (e.g. MD5, SHA256, etc.) of the files and file system metadata. Later we can reread files and recompute the checksum/hash and compare it to the original. Any difference indicates data corruption on the media and should trigger restoring that data from another copy.
Note, however, that running such fixity checks is rarely feasible in offline storage scenarios because of high labor requirements. In addition to the increased convenience of access to stored material, this is a strong argument in favor of using online or automated near-line storage systems, despite typically higher cost and energy use.
Since all storage media eventually deteriorates and/or becomes obsolete and inefficient, long-term data storage requires periodic migrations to newer physical media. This involves re-selecting the most appropriate medium at that point in time followed by a process of copying all the desired data from the old medium to the new and verifying its integrity.
Once migration is completed, the old media may be retired or destroyed as appropriate, unless it still has some useful lifespan remaining and it is intentionally being retained as an additional backup copy.
NARA Technical Information Paper No. 12: "Digital-Imaging and Optical Digital Data Disk Storage Systems: Long-Term Access Strategies for Federal Agencies". http://www.archives.gov/preservation/technical/imaging-storage-report.html
Optical Storage Technology Association (OSTA) - Understanding CD-R and CD-RW Longevity. http://www.osta.org/technology/cdqa13.htm
[1] NARA Frequently Asked Questions (FAQs) about Optical Storage Media: Storing Temporary Records on CDs and DVDs. http://www.archives.gov/records-mgmt/initiatives/temp-opmedia-faq.html
[2] NIST Special Publication 500-252. Care and Handling of CDs and DVDs - A Guide for Librarians and Archivists http://www.itl.nist.gov/iad/894.05/docs/CDandDVDCareandHandlingGuide.pdf, pg. 16, table 3.
[3] NIST Special Publication 500-252, pg. vi
[4] Summary of CERN's data storage reliability study http://storagemojo.com/2007/09/19/cerns-data-corruption-research/
[5] Carnegie Mellon Univ. paper "Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You?" http://www.usenix.org/events/fast07/tech/schroeder.html
[6] The presentation at http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf summarizes the many benefits of ZFS compared to traditional online storage systems including how it provides end-to-end file integrity and recovery. The most relevant pages are 12-18, 21-23, and 41.