UNESCO, Paris, 19-23 February 1996

Electronic Data Archiving and Access

By: D. G. Law, King’s College London, UK.

Perhaps unusually in a scientific conference considering electronic publishing in science, my contribution comes as that of a mediaeval historian. Others will consider the management of current scientific information, but I wish to set before you the requirements of the historic record. This is an area which has scarcely been mapped, far less explored, and I aim to do no more than outline the problems we face. For the purposes of this paper I propose to divide electronic information into four categories:

1) Commercially published -----the equivalent of the scholarly journal

2) Working papers---grey literature, pre-prints, personal web pages etc.

3) Raw data----satellite data, survey data

4) scientific archives----working papers and correspondence of scientists

Having explored each of these I shall look at some of their common features and at the standards issues which unite them. The point of this division is to stress very strongly that the preservation of the scientific record is about much more than the archiving of journals.

It has been claimed that only some thirty-five European institutions have remained largely unchanged since the Middle Ages. One is the Papacy; one is the Tynwald of the Isle of Man; one is Mount Athos; one is the Parliament of Iceland and the other thirty-one are universities. Not a publisher amongst them. Indeed for a publisher to survive the seventy years of the copyrights they so assiduously collect would be worthy of remark. Historically we have looked to the universities in general, and to their libraries in particular, to archive the scholarly record. Publishers, quite properly, owe their concerns to shareholders and not to science. Mild amusement may be gained in any publishing house by asking to see their archive of publications. Through ignorance, innocence or incompetence; through take-overs, bankruptcy and simple folly, they do not exist.

At the beginning of the nineteenth century, it was said of Thomas Young the Edinburgh physician that he was the last man who new everything. The interesting point here is that as late as 1800 it was considered plausible to know everything and in a sense to act as one’s own archive. When this became impossible the access model which developed thrived under legal deposit. By and large legal deposit exists only for the printed word. If we look at commercially published electronic information which is beginning to appear uniquely in electronic formats, a first, and I would argue essential, step is to legislate universally for the legal deposit of electronic materials in schemes managed by our national libraries. I stress managed by. The technical skills and huge costs involved in managing data centres would not be welcome to them and in many cases would duplicate existing service centres. The costs of running such centres are huge and the skills very different from those of traditional librarians. One further point should perhaps be clarified. For the purposes of legal deposit, the requirement is that national libraries be given one or more copies of electronic materials to store for historic use and not that they immediately set up public services in competition with the publishers. Scientific value has a limited commercial currency, but when that is exhausted we must have pre-assured access routes. The time to ensure this is now and not in seventy years when copyright expires. Thus moves to ensure legal deposit should be a key outcome of this meeting.

In passing one might note the danger of scientific outcomes being suppressed - and legal deposit helps to guard against this. As we move to electronic only publication and data leasing rather than data purchase becomes prevalent, large areas of modern science will be owned by publishers rather than being generally available in libraries. One of the purposes of science is to discover the unpopular and inconvenient. The story of the suppression of scientific results would fill several volumes. Multinational companies and publishers now have powers and wealth undreamed of since the days of the mediaeval papacy and there is already evidence that such power is being used to suppress information - admittedly thus far biographical rather than biological.

The second category of material is the emergent electronic formats - grey electronic literature, web pages, collaborative articles, bulletin boards, pre-print archives etc. The grandfather of them all is perhaps the Ginsparg archive, described elsewhere in these papers. At five years old it is a veteran of the Internet where age is measured in months rather than years. Although called an archive it is not at all clear that even this groundbreaking activity has established what role and responsibilities it has for long term preservation. Some bulletin boards are beginning to filter, peer review and archive what they regard as important material, in at least one case publishing a major debate on paper in order to preserve it. But these are holding positions with no guarantee of permanence. This less formal method of communication seems to be growing in importance. It can be no coincidence that the growth and internationalisation of such material walks in step with the growth of the Internet, where electronic working and collaboration are increasingly the norm. It may be that the learned societies have a role here in setting standards. Such communications may be seen as the lineal descendants of Letters journals where the societies played a prominent developmental role.

My third category is that of raw data and here several issues emerge. There is rather too much of it about and here we can perhaps learn from traditional archivists. Electronic publishing casts a longer shadow than its performance deserves and it is usually instructive to look at parallels and at existing practices. The traditional archivist will say that their greatest skill is ruthlessly and without compunction to weed collections of irrelevant material. The collection of large volumes of scientific data would benefit from some of that ruthlessness. It also requires work to be done at the collecting stage to ensure that concerns such as quality control are met. And yet it is a sufficiently new area that no real rules yet exist. Here again there is scope for international bodies to set standards and practices to be followed. Raw data tends to be the preserve of governments and NGOs. This might have given some cause for relief since they tend to serve the public good and it might have been assumed that standards would gradually emerge. However the current fashion for moving such bodies to the private sector does give cause for concern since commercial values and long-term preservation in the public interest are not necessarily natural bedfellows. In addition, the first sparring in the so-called data wars is beginning to emerge in the area of environmental information. European meteorological information is now encrypted and available only to selected users on specific terms. This tends to threaten the freedom of the working scientist to access data and again international organisations may wish to take a view on this matter.

My fourth and final category has as yet scarcely emerged as a consequence of the electronic world; it is the working papers of scientists. The history of a discipline is an important element in understanding its current position. Universities and national libraries are littered with the papers of great men - diaries, scrapbooks, notebooks, workbooks and correspondence. How will these develop and be preserved in an electronic environment? Do scientists have a responsibility to their discipline to preserve evidence of how ideas emerge and are shaped? Do we have to consider the creation of archival rescue teams who race into action when obituaries appear? The learned societies would seem to have an obvious role in creating and recommending guidelines for their members.

Let me now turn to the much neglected topic of network topology in the context of archives. As all Europeans involved in the use of networks know, the United States of America does not exist in the afternoon. This is partly a function of bandwidth and partly a function of its cost. We also know that any increase in bandwidth is soon swamped, typically within hours rather than days. The assumption that this problem will disappear as bandwidth increases seems to me nave and I believe that chronic network bottlenecks will be a semi-permanent state. A picture may be worth a thousand words, but a GIF file takes longer to transmit. With the printed word this has not been a problem since copies are distributed world- wide in libraries. The creation of mirror sites in many countries then seems a natural way of dealing with this problem and of ensuring easy and timely access. This could usefully create a series of nodes even for commercially originated data which has ceased to have commercial value. But there are substantial problems even with this model in such areas as version control. It is relatively straightforward to achieve consistency between two sites holding one set of data - one to one. The problems increase with one set of data sent to many sites - one to many. With a network of mirror sites where the data is shared on a many to many basis the problems of consistency are proportionately increased. If such data sinks were set up as nodes on the network they would benefit from a proper form of accreditation on the part of appropriate authorities in order to ensure that appropriate standards are set, maintained and guaranteed. In passing one might note that such sharing and mirroring of data might help to create the sort of climate in which the predicted “data wars” are less likely to happen. The limiting of access to information seems to be emerging as an international issue, yet as Federico Mayor said in his introductory remarks, “Science is nothing if it is not communicated to others.”

Archiving also faces technical issues and these are huge. The electronic world consumes standards and data formats with unprecedented voracity. Electronic documents exist which use multiple formats and standards: SGML with different DTDs, HTML, LATEK, TIFF, GIF, JPEG, TEI and others are developing a whole new vocabulary which imprisons documents and images. Different word processors and even different versions of the same word processor can be quite incompatible. Much material is platform dependent. Despite some bold pronouncements, it seems to me quite implausible to create a sort of working museum of hardware and software to keep information available in its original context - and even worse to have to do this at multiple sites around the globe. The only sane course is surely regularly to migrate material forward to new platforms to maintain not just the data but ease of access to it. This does carry risks of losing some data or at least nuances in the data, but this seems more realistic than assuming that we can indefinitely maintain the past.

In passing, here one might mention that URLs may prove as ugly a problem to manage as they are to look at. Many of the draft papers for this conference cite URL’s. Some of them may have changed by the time the papers are published in six months, while there is little guarantee that they will last for even six years. Preserving the electronic document is difficult enough; preserving the links may be impossible. In the bicentennial year of the death of Robert Burns a Scot may be forgiven for looking at a snowy Paris and comparing URLs to the snow in Burn’s great poem Tam o’ Shanter:

…Like the snowflakes in the river, a moment white then gone forever

This paper avoids general issues of costs since these are covered by others, but it does seem appropriate to mention some areas of cost which relate particularly to archiving and to a lesser extent to access. The less uniformity there is in presentation formats and in search engines and so on, the greater is the need for documentation and even training. This creates a major ancillary activity for archives, since their staff must be proficient in all the variants of hardware and software at a level which allows them to support the research community. The Ginsparg Archive at Los Alamos deals with perhaps the most sophisticated and well equipped community in science, the road warriors of the super-highway. Yet even for this archive the major problems are documentation and training.

The question of access is also a major issue. For electronic archives experience shows that outputs are just as important as inputs. It is not sufficient to be able to store archives in sustainable formats, they must also be delivered in acceptable formats - acceptable to the user that is - and containing only the elements required by the user. The units of information specified and required by the scientist may legitimately differ from the units in which the information was received and stored by the archive. User registration and validation is a further complication which requires considerable management, albeit this can be largely managed by software. User equipment and software will be quite varied. At the recent G7 meeting in Brussels it was pointed out that there are more telephone connections in Manhattan than in the whole of sub-Saharan Africa. Information must then be capable of being delivered appropriately as well as stored appropriately. In sum, after a quarter of a century of experience we know that archives will be a costly part of the information chain, however managed.

Next let me touch on authentication, a theme more fully explored in other papers. I am perhaps less concerned with the validation of “published” information. It seems clear that electronic tagging or watermarking of one sort or another will be developed and that this will provide its own form of guarantee which can be maintained throughout the life of the document. Verification and version control of less formally published material is a much more difficult issue. Here again there would seem to be a need for appropriate agencies to adopt forms of certification. If data centres are authorised in some way, as suggested above, we can assume that data acquired directly from them will be in the same state as it was received. The sort of “data handles” proposed by Bill Arms and others to provide version control and verification would then complete the chain from author to archive to user. Perhaps the biggest problem with authentication is that it assumes a static document. Increasingly however one can see documents which are more or less dynamic. As documents increasingly incorporate sound and image, they will have multiple rights and these rights will change and move. This paper, increasingly typically, exists in multiple versions - the submitted original, the refereed and to be published paper and the text given at the conference. Many scientists would expect to put up a pre-publication web version which would itself elicit comment and lead to modification of the original. Each of these versions may properly and separately be required by a user and each is a legitimate text.

This perhaps leads to issues of secondary control, of indexing and cataloguing. I am more optimistic here. Although major problems exist in the control of images and sound, by and large the development of descriptive mechanisms and classifications has been one of the great triumphs of scientific literature. These problems seem to me manageable. On the basis of previous success we may expect future solutions. But we should not minimise the work required to achieve this.

I have attempted to outline a large number of problems. Let me then paint a scenario which may be worth exploring. It requires action from all the players assembled here and is, in truth, no more than a hypothesis to test possible roles.

Firstly, there is a need for concerted pressure from all the bodies here to implement legislation covering legal deposit. Remember that in my terms this is to ensure preservation after the point when commercial value is exhausted.

Secondly, there is a need to agree the requirements of data sinks based around a long-term commitment to store, update and disseminate information with appropriate documentation and training. It seems to me likely that this is a role best carried out by universities, whose longevity seems assured.

Thirdly, there is a need to accredit and supervise such archives and to ensure that they exist in sufficient numbers on the Internet to remove network black holes. The learned societies and international organisations would have a critical role here.

Fourthly there is a need to ensure the continued availability of raw data. That rests with international organisations and learned societies. Further where the provision of statistical data is moved from the public to the private sector, governments should be pressed to include provision for archiving and long-term availability in any contract.

Fifthly, learned societies need to consider how to guide their members in the preservation of the electronic working materials of their discipline.

For archiving, the other issues are in a sense easy. Questions of standards, authentication and delivery platforms are important and hugely difficult. But they will be defined and resolved in terms of the current scientific record. It is the function of archives quite literally to pick up the pieces. I cannot however stress too strongly that notions of a global distributed electronic archive as some kind of electronic warehouse managing rights clearance gravely underestimates the scale of the problem. Both input and output formats are problematic, while training documentation and constant revalidation of data are essential. Archives have always made available much more than the published record - and this will continue.

One plea should be made in closing. It is the perhaps obvious one of working with existing agencies. Huge amounts of effort have gone into studying archiving issues from the non-scientific community - IFLA, IATUL, ICA, CNI and others to develop robust mechanisms and scenarios for dealing with electronic information. Take advantage of that. Even a mediaeval historian can tell you that if it is the characteristic of scientific experiments that they can be replicated, the re-inventing of wheels is, on the other hand, more likely to prove expensive than instructive.

Last updated : April 03 1996 Copyright 1995-1996
ICSU Press and individual authors. All rights reserved

Listing By Title

Listing by Author's Name, in alphabetical order

Return to the ICSU Press/UNESCO Conference Programme Homepage

University of Illinois at Urbana-Champaign
The Library of the University of Illinois at Urbana-Champaign
Comments to: Tim Cole
06.24.97 RL