
November
04, 2002 | Tucked away in one of the seediest neighborhoods
of San Francisco is a roomful of over two hundred computers with
a terabyte of data stored on every three. Stairs from the street
lead up an intimidating hallway that opens into a room with 15-foot
ceilings and just-this-side of hip ductwork in the ceiling. To the
right is a storage area with a single desk, to the left are Baker's
racks tightly packed with off-the-shelf HP desktop machines, each
turned on it's side to maximize the space. Somewhere in all that
ductwork, a fan is squeaking painfully. Walking into this echoey,
over-warm warehouse space, it's easy to be underwhelmed until you
realize what you are looking at: spinning away on these computers
is nothing less than a copy of the Internet from 1996 until today.
 |
stairs
leading to the Archive
|
Hyperbole is easy to generate: over 10 billion pages are held here.
The content of a single computer is equivalent to the entire Library
of Congress. Over 250 gigabytes of data are added daily. Over 12
terabytes are added every month and there are a total of over 120
terabytes of storage available. As a copy of the entire publicly
accessible internet, it is also certainly the worlds largest collection
of pornography in a single room.
The Internet Archive was founded
in 1995 by Brewster Kahle with the intent of preserving what is
arguably the fastest growing archive of human expression ever created.
The Library of Congress, and other analog equivalents, keep copies
of the thousands of books published every year, helping to preserve
our paper history, but the amount of content on the internet has
grown to dwarf that repository. Also, unlike books on the shelves
of libraries, web pages are in a constant state of flux: the average
age of a web site is only 19 months with the average page changing
every 100 days. The amount of information created and lost every
day is staggering. The current state of digital technology and the
internet makes it feasible for the Archive to reach it's stated
mission of universal access to human knowledge.
The Internet Archive has several locations, the Mission street
facility is their co-location site. This no-frills building is where
the data is stored. The other primary site is located at the Presidio
in the shadow of the Golden Gate Bridge. That San Francisco location
is where the most of the staff are and where the day-to-day business
of the Archive takes place. There is another site in Egypt that
holds a duplicate of the data on the Mission Street hard drives.
Computing
Power on a Shoestring budget

A walk through any other co-location site would be a testament
to the Cisco and Sun sales force: high-end servers with terabytes
of storage all connected together in a network that has packets
flying so fast you can feel your fillings melt if you get too close
to the wires. The Archive, on the other hand, is a poster child
for function over flash. The machines on the racks are the cheapest
possiblethe ideal computer is one with a reasonable CPU, a
gigabyte of RAM and case for under $300loaded with a free
operating system. The racks look to be straight out of Costco and
you won't find an Aeron chair anywhere on the premises.
Gathering this many computers is certain to point out even the
smallest problems inherent in the hardware. There is a failure of
one kind or another almost very other day. Hard drives are a common
problem, failing long before the manufactures' specification claims
they should. The summer-time temperatures in the Archive facility
hover slightly above what would be covered by the warranty, and
a failed drive is potentially a piece of internet history lost.
Motherboards, power supplies: anything that can go wrong, will.
In those cases, a crash cart is rolled over, the problem diagnosed
and hopefully fixed.
So why isn't the Archive using the latest and greatest Sun servers
in a temperature controlled co-location facility where the air and
power are filtered? The answer is simple: Librarians don't drive
Porches. The Archive budget is not overly generous, and the total
cost of ownership for Sun servers, proprietary operating systems
and top-of-the-line routers and switches can be onerous. Filtering
and cooling the air would approximately double the operating cost
of the Archive. Under these restrictions it makes more sense to
use commodity hardware and have layers of redundancy rather than
a single server worth thousands of dollars.
Crawling
Anyone familiar with the concept of a web spider will understand
the heart of the Archive's operation. Spiders are programs that
traverse the internet and glean information from web sites. Search
engines such as Google, Yahoo and Lycos rely on spiders to gather
information to feed their search engines. The quality of these spiders
and the quality of the indexing determine the success of your searches.
The Internet Archive does essentially the same thing in what it
calls a "crawl." Crawls are carried out by the Archive
itself or crawls are donated by other entities. Most of the data
currently in the Archive has been donated by Alexa internet, also
founded by Brewster Kahle. The Archive does two kinds of crawls:
broad and narrow. A broad crawl is an attempt to archive a wide
range of sites as completely as possible while a narrow crawl is
designed for complete coverage of selected sites or selected topics.
Both types of crawls have their own inherent challenges.
Broad crawls can create a number of different problems. While it
may be easy to create a crawler that takes full advantage of a 100Mbs
link to the internet, it becomes increasingly difficult to keep
the crawler fed. Extracting unique URLs becomes computationally
demanding as the database of indexed sites increases. A broad crawl
can encompass over 150 million web pages in a week and run for 40-60
days total. Each page encountered needs to have the links on it
extracted and followed. However, those links first need to be checked
against the database of previously visited sites. If it's the first
time this URL has been hit, then it will need more attention while
a repeat can be ignored. This comparison is done in RAM, so if the
database is large enough to exhaust the RAM on the computer and
force it to page to disk, the speed of the crawl slows down significantly.
Another issue is for crawls is politenessnot all web sites
are able to handle the load imposed by a high-performance, multi-threaded
crawler. In these cases, there are only two outcomes: either the
crawler is smart enough to back off and reduce the strain on the
server, or the server will likely crash. A crawl that is looking
up tens of millions of web sites per day can also having a devastating
effect on DNS servers. These servers also often run out of RAM quickly
as the lookups accumulate.
Broad crawls are a necessity because in the vast pool of web pages,
it's impossible to know what information should be preserved. A
broad crawl is specifically designed to copy as much information
as possible over a wide range of web sites. Since there are a lot
of duplicate pages out there, and the task of discerning them on
the fly is far to difficult, the Archive ends up with about a 30%
duplication rate on broad crawls.
Narrow crawls may require less storage and less bandwidth, but
they have their own challenges. On a topic-driven crawl, the most
obvious is the programming involved in assuring that the crawl has
achieved its goal. Making certain that topics such as the attacks
on September 11th have been covered completely goes far beyond finding
pages with those keywords on them.
Taking care not to overwhelm servers becomes even more of an issue
in a site-based narrow crawl. Attacking an underpowered web site
with the full force of the Archive is certain not to make friends.
In order to cover the sites completely, it is also vital to be able
to pull out links from the pages that are not strictly HTMLlinks
buried in Flash, Shockwave, JavaScript or created dynamically in
other ways. In the same vein, the robot exclusion file that some
sites use to declare parts of the site out-of-bounds to normal spiders
can be ignored by the Archive if they are doing a crawl on behalf
of an authority such as the Library of Congress. In that case, the
webmaster will receive a notification, suitable for framing, explaining
that they should be suitably honored that the exclusion files are
going to be ignored so that this site can be added to the Library
of Congress Web Archive.
Sure
it's cool, but what can it do?
The Internet Archive is for web developers what home movies must
be for celebrities. Preserved for posterity are our bad web designs,
animated gifs and blinking text. The Wayback Machine is a glimpse
into the pages held in Mission Street: enter a URL and it will sweep
you back as far as 1996 to look at the early days of the internet
explosion. The
Wayback Machine is a popular site, it receives about five million
hits per day. While it is clearly in heavy use, it's not known what
research, if any, the people using the Wayback Machine are doing.
Other researchers are using the Archive, but it's not an easy task.
While there is plenty of data to look at, there isn't an easy interface
for accessing it.
The types of research that are waiting to be done are even more
interesting. The Archive contains over ten thousand news sites,
various archives of e-mail lists, and a growing number of blogs.
This is the chatter of the world, and as time goes on it can provide
a wonderful glimpse into the psyche of the time. Nefarious uses
are equally easy imagine: the effects of advertising campaigns,
product releases or political debates. Having the opinions of millions
of people clearly documented, easily accessible and quantifiable
would be a boon to market researchers and anthropologists alike.
Blogs and personal web sites may be of even greater interest than
mainstream news site since they are an unfiltered view of the public
at large.
Bigger,
Better, Faster, More
The Archive can only continue to grow, but obtaining more raw data
isn't the only goal. The tools for accessing the data have to be
improved before research can become commonplace using the Archive's
stores. The quality of the collections needs to be addressed also:
duplicates, missing pieces, new sites and new technologies all must
be dealt with as the collection continues to grow.
The future also holds more cooperation between the Archive and
other organizations. The Library of Congress and other national
libraries have a keen interest in adding web archives to their collections,
so a teaming makes perfect sense.
The speed of crawls is also due for an upgrade. Alexa Internet
is doing the majority of the crawls for the Archive, but soon they
will be able to do their own. When that is put in place, the collective
speed of crawls will effectively double from the current 45Mb/sec
to almost 100Mb/sec. That also means an increase in the speed of
the LAN at Mission Street and the WAN between Mission and the Presidio.
There is a great deal more to the Archive than what is covered
here. The
Prelinger collection of advertising, educational, industrial,
and amateur movies is a fantastic, scary and sometimes downright
hilarious remembrance of "duck and cover" type movies
from the 1950s and on. The collection covers everything from patriotism
to personal hygiene with a pre-"Bewitched" Dick York in
a number of roles. Most of the movies are unintentionally, jaw-droppingly
funny. (See: Dick York as the Shy
Guy, Dick York as a runner
battling insomnia, Dick York as a Navy
recruit with insomnia. Geez, did this guy ever sleep?)
Explore the Archive's web site often, this article has barely scratched
the surface of the material available. The Internet Archive Bookmobile,
a mirror of the Project Gutenberg files, the Orphan Films collection
and the special collection of September 11th related web material
are a small sample of the overwhelming store of information available.
And it's free, just like any good library.
Special thanks go out to Raymie Stata and Charles Barr for their
help on this article.
bio:
Doug
Roberts
is the IT Manager for a cool company in Burlingame, California.
|