Oof. This is quite long. Sorry.
This is an idea I had some time ago. I've posted and discussed it elsewhere (before I came on the halfbakery), where to be honest it didn't fare too well. In particular someone got hung up on it being English based, which it needn't be. I'd post a link, however unfortunately my initial description seems to be missing and it all looks a little confusing. Some people there had good suggestions I've already incorporated.
Also please at least read this carefully before suggesting that XML does what is required, because I don't think it does.
The Problem:Archiving data is fraught with difficulties. The file format may be lost rendering the data unusable. This has definitely happened with more than one project.
The Solution (well, perhaps?) (brief) Have a file format which unambiguously defines the data.
Please note before I begin that this does not address the problems of bit rot and hardware obsolescence. These have to be addressed separately (and they can be).
(verbose) The Universal Archival Data Format (UADF) is designed with the intention that intelligent beings (ie humans) would be able to recover the data, almost from first principles. As such it has some similarities to those messages beamed into outer space in an attempt to contact aliens.
UADF is essentially a meta-format, in that it describes the standard to which any data needs to be described for conformity. It is also possible for readers to exist. These would have a 'plug-in' structure, and degrade gracefully, reporting the sections of the file they could not comprehend. A big advantage of UADF is that portability is guaranteed. It is not possible for a company to create a 'closed' format complying to UADF. If your program writes UADF files, a programmer can write another program to read them.
As I currently see it, UADF files have 5 parts: 1) a computer-parsable and human-readable header, which defines the version number etc. 2) a 'bootstrap', designed to show intelligent readers how the file data is stored and in particular how part 3 works. 3) human-readable descriptions of the 'file formats of the 'sections' in part 5, with associated computer parsable meta-data on each section type. 4) computer-parsable pointers and other meta-data for all the sections. 5) the 'sections'. This is the data you really want to store, and it can be anything... text, graphics, sound files, raw instrument data etc.
Part 1 is fairly short, and exists so that UADF readers can decide whether they can read the file. Probably something like an ASCII string "Universal Archival Data Format, Version 1.1 (UADF foundation mods)"(zero terminator)
Part 2 is a bit tricky. Not to get bogged down in detail, for English it would probably have (8*8 byte) bitmaps for some of the ASCII characters (using zero as off and the character code for on), then use those characters to define the others. It would then describe the use of some of the remaining character-codes in part 3 to give computer-parsable information.
Part 3 gives a precise, natural language description of each section type. Each of these needs to have a computer-parsable name, originator and version number, so that a UADF reader can check for a suitable plug-in. They are also given number-specifiers which are unique to this document.
Part 4 is just a table of each individual section in the document. It gives their types, the offset to each one and its length.
Part 5 contains all the data of interest.
As you can see, each file contains a significant amount of meta-data. Obviously you want to keep this to a minimum, however note that a lot of it is 'fixed cost'. It doesn't cost much to add another section of a type already present.Also, if you want to make some assumptions, for instance that your descendants will speak English and use 8-bit ASCII, then some of the meta-data could be left out.-- Loris, Oct 06 2002 Mt. Rainier Initiative FAQ page http://www.softarch.../mtrainierfaqs.htmlHmm.. well, at least it's an interesting read. [Mr Burns, Oct 08 2002] Information Storage of the Future http://abc.net.au/r...stories/s150901.htmRadio transcript about hardware requirements of information archival. [Loris, Oct 09 2002, last modified Oct 04 2004] Slipstream Processors http://citeseer.nj....hy00slipstream.htmlRemove computation and control flow information. [reensure, Oct 09 2002] "Part 1 is fairly short, and exists so that UADF readers can decide whether they can read the file." "Part 3 gives a ...description...so that a UADF reader can check for a suitable plug-in." Rather defeats the purpose, no?-- phoenix, Oct 07 2002 Mt Ranier CD-ROM?-- Mr Burns, Oct 07 2002 >> "Part 1 is fairly short, and exists so that UADF readers can decide whether they can read the file." "Part 3 gives a ...description...so that a UADF reader can check for a suitable plug-in." Rather defeats the purpose, no?
I'm sorry Phoenix, I don't see how this defeats the purpose..I can see you might think it is an unnecessary repetition, but it isn't really. The part 1 version & metadata is for the whole file, in particular the format of the metadata of part 3. Since there may be many different types of 'section' (that is, types of actual data), each of these needs its own metadata and version number.
Please note that each segment of part 3 contains 2 things. A human-readable description, and a computer-parsable recognition sequence.Perhaps I should have explicitly stated the idea behind this format: Quite often, data formats are never really recorded explicitly. For example, there exist many different word processors and DTP editors all with different formats. Once the program is obsolete doesn't run any more, the data is lost. Sometimes the format can be reverse-engineered, sometimes not.This isn't really a problem if it is just a quick letter or whatever, but a common problem is that someone has a major body of work (like a PhD Thesis) which is now irretrievable. There are quite a few other examples of important information being lost forever.
So. The crux of this idea is that you have along with your data a specification for it, so if you really need to, it can be recovered. The information is there at every point (ie it is with your data while it is being edited). This is important because otherwise it would 1) be much harder to do 2) not be done by anyone because noone expects it to happen.
<end of aside>
For example, a UADF file (version 1.7 author zzz) might contain unformatted text (type text author xxx version 1.0), a picture (type picture-format author yyy version 2.8) among other things...(where the authors are the creators of the format, not the actual text or picture)
The UADF reader checks that it can understand the file format first. If it can, it then needs to check whether it has plugins which can understand the individual components. Then it gives a list of those things it can use, and those it cannot. The things it can't use you can write a plugin for (if one is not available for you system) using the specifcation in part 3.
Thcgenius, I'm afraid I don't know what Mt Ranier CD-ROM is or it's relevance, could you enlighten me please?-- Loris, Oct 07 2002 // "Part 1 is fairly short, and exists so that UADF readers can decide whether they can read the file." "Part 3 gives a ...description...so that a UADF reader can check for a suitable plug-in." Rather defeats the purpose, no? //
I think [phoenix] was pointing out that if the reader couldn't read the file (for lack of a plugin) then your data is just as inaccessible as if it were in some other lost file format. You've paid a higher price in storage space and still have nothing to show for the effort. So what is the point?
If the format is truly self-bootstrapping from only the most crude concepts (your "first principles") then any reader must inherently be able to build its own plugin on the fly from the archive file. Any program incapable of handling such a task could not properly be called a UADF reader.-- BigBrother, Oct 07 2002 Oh, and I do believe that the algorithm for navigating the bootstrapping process had better be *very* well documented and preserved in multiple, diverse, enduring, human-readable media.-- BigBrother, Oct 07 2002 >> I think [phoenix] was pointing out that if the reader couldn't read the file (for lack of a plugin) then your data is just as inaccessible as if it were in some other lost file format. You've paid a higher price in storage space and still have nothing to show for the effort. So what is the point?
Oh, I _see_! I've failed to clearly make the distinction between a computer program and the actual human user. Where I've put "UADF reader" I've been meaning the computer program. I'll review and rewrite the idea as soon as I have time to do so properly.
If I can summarise briefly, the idea is that you can use a UADF reader program for day to day work. The user could potentially notice no great difference between a UADF based DTP program, say, and an unenabled version.
The boot-strapping information (and description of each 'section' type) is not used by the UADF reader program. The only meta-data used by the program is to identify whether it can use the file. This is because it is impossible to indicate what data actually represents without relying on an intelligent life-form. (Example - how would you indicate that some data represented sound which was to be played through speakers, if there was not inherent sound support)
The boot-strap data is really there so that a human can look at it and work out how to write a program (a plug-in) to actually use the data. (looking at the idea now I can see I've basically written this -look under 'verbose')I hope this makes sense now?-- Loris, Oct 08 2002 Another quick anno: The main payoff is the ability to recover the data should the plug-in be lost. This would of course require programming effort, but should be straightforward.
The secondary beneficial effect is that the format cannot be 'closed'. So there is a level playing field for plug-in programs for every platform. Someone can write a program to display a UADF section type and charge for it. But this doesn't stop anyone else from writing a program which can read and write to that UADF section type. It's a sort of non-confrontational GPL for data formats.-- Loris, Oct 08 2002 Sounds like HTML (open, but extensible; allows for meta-data descriptors; is usable even when ancillary content is not; etc.).-- phoenix, Oct 08 2002 Yeah, I suppose its a bit like HTML in that respect.Only... the format is reversed. With HTML you can (potentially) understand the data but not the meta-data, while with UADF you could be able to read the meta-data but not the data.-- Loris, Oct 08 2002 Loris: Helps if I spell it right, first of all... (Mt Rainier..)
I'm not so sure this addresses all of your concerns in the first place, so it might not even be be relevent. [link added]
Mt. Rainier compliant CD-RW drives add five key new features to CD-RW:
support for defect management background formatting early eject of disk, new disk format for reliable disk interchange based on international standards addressing support for small (2k byte) amounts of data, that facilitates efficient writing using a drive letter drag and drop writing interface.
Mt. Rainier drives are multi-function drives targeted at the following business and consumer applications:
backup and archival of files reliable data distribution through disk interchange supplementary storage audio recording -- Mr Burns, Oct 08 2002 I think the US Department of Energy haas gone some way down this road. They looked at the problem of storing radioactive waste in such a way that it would be safe for future generations. The geology work to identify which disused mines would be safe for storage was hard, but possible. Where they had real difficulties was in devising a warning sign to put on top of the storage dump which would say "Don't dig here - there's radioactive stuff buried here which will kill you". The reason is that the waste could be dangerous for up to 100,000 years and at present we're not sure what the insciptions on the pyramids (only 5,000 years old) are saying. As far as I remember, the best they came up with was a huge field of tall, sharp spikes around the storage dump.The relevance for this idea is that if you write the encoding guide in English (for example), it won't take long before a university department of ancient-language professors will be needed to read it. It might be better to pack the data with an agent which continuously seeks out on the internet (and its descendants) the best encoding method and regularly recodes the data.-- hippo, Oct 09 2002 Oh, the good old days when you never knew where you'd be and what would be the state of things. Time was, I had to break myself of the habit of putting my external commands on every floppy I made the trade off gained by giving up 25,307 bits of disk space for Command.com to have a bootable disk seemed like a good idea then. How much more complicated are constraints such as the robustness limit of data storage media or error intolerant hardware.
Any potential here for slipstreaming meta data files?-- reensure, Oct 09 2002 This is a difficult, and very interesting, problem.
I remember reading a paper from someone at MIT on their Time Capsule File System. Here's the link: http://www.swiss.ai.mit.edu/~boogles/papers/tcfs-thesis/thesis.html
It uses a similar idea of lots of meta-data in a format that's ultimately human-readable.
The Long Now foundation: http://www.longnow.org/ is taking a different approach. Instead of encoding the data as bits, they scribe the image of the data onto a nickel-plated disk designed to survive, buried in the ground, for several thousand years.
That suggests that one important thing you need to specify is a time frame: how long do you want your data to last? You'll probably need very different approaches if you're talking about spans of decades, millenia, or hundreds of thousands of years.
The original idea, which keeps data in computer readable formats, seems best suited to spans of up to maybe 500 years.
As far as the meta-data describing the plugins goes, maybe expressing it in terms of a scripting language would be appropriate, so the file contains the plugin. Keep the commands in the scripting language low level, maybe even as low level as a RISC-based assembly language, and you can bundle a boiler-plate, human-readable description of the language with each archive. The description could be in whatever human language the data is in, so if I'm an English speaker my description is in English, while Chinese speakers would have theirs in Chinese. Because it's boiler-plate, it acts as a form of Rosetta Stone, giving future archeologists a boost in decoding data in extinct languages.
Finally, don't worry about storage costs, since you have Moore's Law on your side: storage keeps getting cheaper. Instead, opt for simple and rugged encoding techniques, such as BMP for graphics and WAV for sounds.-- FalseData, May 28 2003 random, halfbakery