Mindjack - Feature - Inside The Internet Archive

est. 1998
---------------

the beat of digital culture

home | archives | about us | feedback

weblog:
Daily Relay
Tracking trends and developments in digital culture

shop:
T-Shirts
Mugs
Support Mindjack

mindjack release
join to receive news and announcements

Also in this Issue:

november 04, 2002
Smart Mobs
by Howard Rheingold
reviewed by Cory Doctorow

Previously:

october 28, 2002
An Interview with Warren Ellis
by Melanie McBride

November 04, 2002 | Tucked away in one of the seediest neighborhoods of San Francisco is a roomful of over two hundred computers with a terabyte of data stored on every three. Stairs from the street lead up an intimidating hallway that opens into a room with 15-foot ceilings and just-this-side of hip ductwork in the ceiling. To the right is a storage area with a single desk, to the left are Baker's racks tightly packed with off-the-shelf HP desktop machines, each turned on it's side to maximize the space. Somewhere in all that ductwork, a fan is squeaking painfully. Walking into this echoey, over-warm warehouse space, it's easy to be underwhelmed until you realize what you are looking at: spinning away on these computers is nothing less than a copy of the Internet from 1996 until today.

stairs leading to the Archive

Hyperbole is easy to generate: over 10 billion pages are held here. The content of a single computer is equivalent to the entire Library of Congress. Over 250 gigabytes of data are added daily. Over 12 terabytes are added every month and there are a total of over 120 terabytes of storage available. As a copy of the entire publicly accessible internet, it is also certainly the worlds largest collection of pornography in a single room.

The Internet Archive was founded in 1995 by Brewster Kahle with the intent of preserving what is arguably the fastest growing archive of human expression ever created. The Library of Congress, and other analog equivalents, keep copies of the thousands of books published every year, helping to preserve our paper history, but the amount of content on the internet has grown to dwarf that repository. Also, unlike books on the shelves of libraries, web pages are in a constant state of flux: the average age of a web site is only 19 months with the average page changing every 100 days. The amount of information created and lost every day is staggering. The current state of digital technology and the internet makes it feasible for the Archive to reach it's stated mission of universal access to human knowledge.

The Internet Archive has several locations, the Mission street facility is their co-location site. This no-frills building is where the data is stored. The other primary site is located at the Presidio in the shadow of the Golden Gate Bridge. That San Francisco location is where the most of the staff are and where the day-to-day business of the Archive takes place. There is another site in Egypt that holds a duplicate of the data on the Mission Street hard drives.

Computing Power on a Shoestring budget

A walk through any other co-location site would be a testament to the Cisco and Sun sales force: high-end servers with terabytes of storage all connected together in a network that has packets flying so fast you can feel your fillings melt if you get too close to the wires. The Archive, on the other hand, is a poster child for function over flash. The machines on the racks are the cheapest possible—the ideal computer is one with a reasonable CPU, a gigabyte of RAM and case for under $300—loaded with a free operating system. The racks look to be straight out of Costco and you won't find an Aeron chair anywhere on the premises.

Gathering this many computers is certain to point out even the smallest problems inherent in the hardware. There is a failure of one kind or another almost very other day. Hard drives are a common problem, failing long before the manufactures' specification claims they should. The summer-time temperatures in the Archive facility hover slightly above what would be covered by the warranty, and a failed drive is potentially a piece of internet history lost. Motherboards, power supplies: anything that can go wrong, will. In those cases, a crash cart is rolled over, the problem diagnosed and hopefully fixed.

So why isn't the Archive using the latest and greatest Sun servers in a temperature controlled co-location facility where the air and power are filtered? The answer is simple: Librarians don't drive Porches. The Archive budget is not overly generous, and the total cost of ownership for Sun servers, proprietary operating systems and top-of-the-line routers and switches can be onerous. Filtering and cooling the air would approximately double the operating cost of the Archive. Under these restrictions it makes more sense to use commodity hardware and have layers of redundancy rather than a single server worth thousands of dollars.

Crawling

Anyone familiar with the concept of a web spider will understand the heart of the Archive's operation. Spiders are programs that traverse the internet and glean information from web sites. Search engines such as Google, Yahoo and Lycos rely on spiders to gather information to feed their search engines. The quality of these spiders and the quality of the indexing determine the success of your searches.

The Internet Archive does essentially the same thing in what it calls a "crawl." Crawls are carried out by the Archive itself or crawls are donated by other entities. Most of the data currently in the Archive has been donated by Alexa internet, also founded by Brewster Kahle. The Archive does two kinds of crawls: broad and narrow. A broad crawl is an attempt to archive a wide range of sites as completely as possible while a narrow crawl is designed for complete coverage of selected sites or selected topics. Both types of crawls have their own inherent challenges.

Broad crawls can create a number of different problems. While it may be easy to create a crawler that takes full advantage of a 100Mbs link to the internet, it becomes increasingly difficult to keep the crawler fed. Extracting unique URLs becomes computationally demanding as the database of indexed sites increases. A broad crawl can encompass over 150 million web pages in a week and run for 40-60 days total. Each page encountered needs to have the links on it extracted and followed. However, those links first need to be checked against the database of previously visited sites. If it's the first time this URL has been hit, then it will need more attention while a repeat can be ignored. This comparison is done in RAM, so if the database is large enough to exhaust the RAM on the computer and force it to page to disk, the speed of the crawl slows down significantly.

Another issue is for crawls is politeness—not all web sites are able to handle the load imposed by a high-performance, multi-threaded crawler. In these cases, there are only two outcomes: either the crawler is smart enough to back off and reduce the strain on the server, or the server will likely crash. A crawl that is looking up tens of millions of web sites per day can also having a devastating effect on DNS servers. These servers also often run out of RAM quickly as the lookups accumulate.

Broad crawls are a necessity because in the vast pool of web pages, it's impossible to know what information should be preserved. A broad crawl is specifically designed to copy as much information as possible over a wide range of web sites. Since there are a lot of duplicate pages out there, and the task of discerning them on the fly is far to difficult, the Archive ends up with about a 30% duplication rate on broad crawls.

Narrow crawls may require less storage and less bandwidth, but they have their own challenges. On a topic-driven crawl, the most obvious is the programming involved in assuring that the crawl has achieved its goal. Making certain that topics such as the attacks on September 11th have been covered completely goes far beyond finding pages with those keywords on them.

Taking care not to overwhelm servers becomes even more of an issue in a site-based narrow crawl. Attacking an underpowered web site with the full force of the Archive is certain not to make friends. In order to cover the sites completely, it is also vital to be able to pull out links from the pages that are not strictly HTML—links buried in Flash, Shockwave, JavaScript or created dynamically in other ways. In the same vein, the robot exclusion file that some sites use to declare parts of the site out-of-bounds to normal spiders can be ignored by the Archive if they are doing a crawl on behalf of an authority such as the Library of Congress. In that case, the webmaster will receive a notification, suitable for framing, explaining that they should be suitably honored that the exclusion files are going to be ignored so that this site can be added to the Library of Congress Web Archive.

Sure it's cool, but what can it do?

The Internet Archive is for web developers what home movies must be for celebrities. Preserved for posterity are our bad web designs, animated gifs and blinking text. The Wayback Machine is a glimpse into the pages held in Mission Street: enter a URL and it will sweep you back as far as 1996 to look at the early days of the internet explosion. The Wayback Machine is a popular site, it receives about five million hits per day. While it is clearly in heavy use, it's not known what research, if any, the people using the Wayback Machine are doing. Other researchers are using the Archive, but it's not an easy task. While there is plenty of data to look at, there isn't an easy interface for accessing it.

The types of research that are waiting to be done are even more interesting. The Archive contains over ten thousand news sites, various archives of e-mail lists, and a growing number of blogs. This is the chatter of the world, and as time goes on it can provide a wonderful glimpse into the psyche of the time. Nefarious uses are equally easy imagine: the effects of advertising campaigns, product releases or political debates. Having the opinions of millions of people clearly documented, easily accessible and quantifiable would be a boon to market researchers and anthropologists alike. Blogs and personal web sites may be of even greater interest than mainstream news site since they are an unfiltered view of the public at large.

Bigger, Better, Faster, More

The Archive can only continue to grow, but obtaining more raw data isn't the only goal. The tools for accessing the data have to be improved before research can become commonplace using the Archive's stores. The quality of the collections needs to be addressed also: duplicates, missing pieces, new sites and new technologies all must be dealt with as the collection continues to grow.

The future also holds more cooperation between the Archive and other organizations. The Library of Congress and other national libraries have a keen interest in adding web archives to their collections, so a teaming makes perfect sense.

The speed of crawls is also due for an upgrade. Alexa Internet is doing the majority of the crawls for the Archive, but soon they will be able to do their own. When that is put in place, the collective speed of crawls will effectively double from the current 45Mb/sec to almost 100Mb/sec. That also means an increase in the speed of the LAN at Mission Street and the WAN between Mission and the Presidio.

There is a great deal more to the Archive than what is covered here. The Prelinger collection of advertising, educational, industrial, and amateur movies is a fantastic, scary and sometimes downright hilarious remembrance of "duck and cover" type movies from the 1950s and on. The collection covers everything from patriotism to personal hygiene with a pre-"Bewitched" Dick York in a number of roles. Most of the movies are unintentionally, jaw-droppingly funny. (See: Dick York as the Shy Guy, Dick York as a runner battling insomnia, Dick York as a Navy recruit with insomnia. Geez, did this guy ever sleep?)

Explore the Archive's web site often, this article has barely scratched the surface of the material available. The Internet Archive Bookmobile, a mirror of the Project Gutenberg files, the Orphan Films collection and the special collection of September 11th related web material are a small sample of the overwhelming store of information available. And it's free, just like any good library.

Special thanks go out to Raymie Stata and Charles Barr for their help on this article.

bio:
Doug Roberts is the IT Manager for a cool company in Burlingame, California.

advertise here
email for info

home | about us | feedback