The Affair of the Vanishing Content

By: Sam Vaknin, Ph.D.

http://www.archive.org/

"Digitized information, especially on the Internet, has such rapid turnover these days that total loss is the norm. Civilization is developing severe amnesia as a result; indeed it may have become too amnesiac already to notice the problem properly."

(Stewart Brand, President, The Long Now Foundation )

Thousands of articles and essays posted by hundreds of authors were lost forever when themestream.com surprisingly shut its virtual gates. A sizable portion of the 1960 census, recorded on UNIVAC II-A tapes, is now inaccessible. Web hosts crash daily, erasing in the process valuable content. Access to web sites is often suspended - or blocked altogether - because of a real (or imagined) violation by the webmaster of the host's Terms of Service (TOS). Millions of other web sites - the results of collective, multi-annual, transcontinental efforts - contain unique stores of information in the form of databases, articles, discussion threads, and links to other web sites. Consider "Central Europe Review". Its archives comprise more than 2500 articles and essays about every conceivable aspect of Central and Eastern Europe and the Balkan. It is one of countless such collections.

Similar and much larger treasures have perished since the dawn of the digital age in the 1920's. Very few early radio and TV programs have survived, for instance. The current "digital dark age" can be compared only to the one which followed the torching of the Library of Alexandria. The more accessible and abundant the information available to us - the more devalued and common it becomes and the less institutional and cultural memory we seem to possess. In the battle between paper and screen, the former has won formidably. Newspaper archives, dating back to the 1700's are now being digitized - testifying to the endurance, resilience, and longevity of paper.

Enter the "Internet Libraries", or Digital Archival Repositories (DAR). These are libraries that provide free access to digital materials replicated across multiple servers ("safety in redundancy"). They contain Web pages, television programming, films, e-books, archives of discussion lists, etc. Such materials can help linguists trace the development of language, journalists conduct research, scholars compare notes, students learn, and teachers teach. The Internet's evolution mirrors closely the social and cultural history of North America at the end of the 20th century. If not preserved, our understanding of who we are and where we are going will be severely hampered. The clues to our future lie ensconced in our past. It is the only guarantee against repeating the mistakes of our predecessors. Long gone Web pages cached by the likes of Google and Alexa constitute the first tier of such archival undertaking.

The Stanford Archival Vault (SAV) in Stanford University assigns a numerical handle to every digital "object" (record) in a repository. The handle is the clever numerical result of a mathematical formula whose input is the number of information bits in the original object being deposited. This allows to track and uniquely identify records across multiple repositories. It also prevents tampering. SAV also offers application layers. These allow programmers to develop digital archive software and permit users to change the "view" (the interface) of an archive and thus to mine data. Its "reliability layer" verifies the completeness and accuracy of digital repositories.

The Internet Archive, a leading digital depository, in its own words:

"...is working to prevent the Internet — a new medium with major historical significance — and other "born-digital" materials from disappearing into the past. Collaborating with institutions including the Library of Congress and the Smithsonian, we are working to permanently preserve a record of public material."

Data storage is the first phase. It is not as simple as it sounds. The proliferation of formats of digital content has made it necessary to develop a standard for archiving Internet objects. The size of the digitized collections must pose a serious challenge as far as timely retrieval is concerned. Interoperability issues (numerous formats and readers) probably requires software and hardware plug-ins to render a smooth and transparent user interface.

Moreover, as time passes, digital data, stored on magnetic media, tend to deteriorate. It must be copied to newer media every 10 years or so ("migration"). Advances in hardware and software applications render many of the digital records indecipherable (try reading your word processing files from 1981, stored on 5.25" floppies!). Special emulators of older hardware and software must be used to decode ancient data files. And, to ameliorate the impact of inevitable natural disasters, accidents, bankruptcies of publishers, and politically motivated destruction of data - multiple copies and redundant systems and archives must be maintained. As time passes, data formatting "dictionaries" will be needed. Data preservation is hardly useful if the data cannot be searched, retrieved, extracted, and researched. And, as "The Economist" put it ("The Economist Technology Quarterly, September 22nd, 2001), without a "Rosetta Stone" of data formats, future deciphering of stored the data might prove to be an insurmountable obstacle.

Last, but by no means least, Internet libraries are Internet based. They themselves are as ephemeral as the historical record they aim to preserve. This tenuous cyber existence goes a long way towards explaining why our paperless offices consume much more paper than ever before.

Top Searches on
Computers and The Internet
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 
 • 

» More on Computers and The Internet