After the widespread gnashing of teeth earlier in the year, it’s good to see that the BBC Domesday Project’s data has been brought back from the dead. I remember seeing a demonstration of one of these systems sometime after it was launched in the mid eighties. This was probably the first multimedia system I ever saw, way ahead of the Acorn Electron I had in my bedroom, and (for its time) it was a very sweet bit of kit.
Slashdot had a lot of discussion about this and the wider implications of storing important cultural artifacts on obsolescence-prone computer media. Who today can read 8″ floppies, punch-cards, paper tape, obscure 1960’s magtapes and drums, or even (to take an extreme example) Sinclair Microdrives? Just what is being lost here? A common opinion on Slashdot, and one I’ve also seen elsewhere, was that information should be subject to a sort of Darwinian test. If nobody needs the information then they won’t maintain its accessibility by copying it to current media every few years; so let it die.
I think this is wrong for a few reasons. Firstly, some information gets more valuable as it gets older – maybe not until its very old. The original Domesday Book was created for the purpose of taxation assessment, so it was pretty useful for a few years after it was created, but arguably nowhere near as useful as it now is as a record of medieval Britain. The BBC Domesday discs were fairly interesting in 1986, but told nobody anything fundamental that they didn’t already know. Their value to historians in 1000 years time could be huge. So how long to you wait before deciding to abandon the data?
The second reason relates to the way the information is represented. Take some planetary science data recorded in 1971 and sitting on a magtape on a rack in an air-conditioned NASA vault. It was likely to be stored as a bunch of Fortran records; the format of which is quite likely to be specific to the obscure compiler used to write the software that gathered that data. Even if the tape hasn’t decayed, and you can you can somehow read the data and get it onto a modern hard drive, what do you do with it. The meaning of the data is embedded in the software that was written to manipulate it, way back in ’71. Maybe you’ve got the source-code, or even a file-structure specification document, but you’ve got to port the old code, or write new code, to get at that data. That’s expensive to do, and with more and more and more data being archived each day, the global cost of keeping the data accessible gets more and more costly. Things will be thrown away not because they’re not interesting, but because its just too expensive to keep transforming the data into a form that you can do something with. Further examples are not difficult to find. The newly-opened Library of Alexandria has a 100 terabyte digital archive, including a web archive dating back yo 1996. How is this information stored? In the private sector, what about the Lexis-Nexis database?
A solution (of sorts) to this problem is standards. Which file format was used to store the video clips in the BBC Domesday project? I don’t know for sure, but I suspect it was a proprietary format specific to the videodisk players that were used, and which is now documented only in some dusty tech manuals stashed away somewhere. Today we have well-defined, standard video-file formats like MPEG, which are unlikely to be forgotten about. In the short term (say the next five years) I can just buy any MPEG manipulation software I need off-the-shelf. In the medium term (say the next century or two) someone could probably discover enough information about the file format to decode an MPEG without any trouble. Over longer timescale’s, though, I don’t hold out much hope. The problem is that the specification is itself a document that must be kept accessible over time.
A good approach to these problems is to make information self describing and anti-coded. If you do this, you at least work around the problem of how to interpret the information: the information carries its own description of its structure. paper books are the ultimate realisation of this, but it is also what XML and SGML try to do. Of course, the structure-description has to be obvious and capable of being processes by a computer. Going a step further, these issues are precisely those that have been grappled-with by scientists who have developed messages intended to be understandable to alien civilisations:. For example, the Pioneer 10 plaque or Frank Drake’s Arecibo message. Although they generally make use of much more redundancy than is desirable for real information storage, I think there are a lot of lessons to be learned from these efforts. A good source of information is Gregory Benford’s book Deep Time: How Humanity Communicates Across Millennia.
Another idea, proposed by Jeff Rothenberg, is to encode a document as instructions for a “Universal Virtual Machine”. Executing the instructions has the effect of producing original document. The idea is that only the virtual machine has to be migrated to future hardware and operating systems. This is not only ingenious, but it has been tried and seems to work. Now all you have to do is preserve the specification of the virtual machine…
There’s another big problem on the horizon as more and more information goes on-line: copyright. Its all very well having a utopian view of preserving information for the good of humanity, but someone (or, more likely, some business) probably owns it. And they don’t want you to mess with it. Digital Rights Management (DRM) is the name for technology that allows you to, say, view an e-book on your palmtop, but prevents you from copying it to another device. This is typically done by encrypting the file in question, and using special viewing software to decrypt it at the point of use. The problem should be obvious: as soon as the company stops supporting the DRM software, the information is inaccessible for good. Under UK law, copyright expires after 25 years. The purpose of this was to ensure that all information would eventually find it way into the public domain for the benefit of suture generations. Information secured by current DRM technology never, ever becomes accessible. Big business has every interest in ensuring that it stays that way.
These are the problems that Bruce Sterling’s Dead Media Project was set-up to examine. In a nice touch of self-reference, the site itself appears not to have been touched for some time. Not a good omen.