Newsjunkie.net

Who's behind the news?

The Internet Archive

San Francisco

by Andrew Checchia

The Internet Archive is an independent digital library. With a collection spanning billions of websites, books, movies, music and software, IA is among the largest digital repositories in the world. Its stated mission is to provide “Universal Access to All Knowledge.”

Archive.org is the entry point to this vast collection. It operates online services that search, access and expand the information base, including the Wayback Machine (historic webpages) and the Open Library (digitized books). The extensive record of ephemeral media, including billions of websites captured at key moments over decades, comprise a unique asset for researchers and historians. “Every librarian has two things in their soul: preservation and access,” said Brewster Kahle, the Internet innovator who founded IA in 1996. “We try to put things in context so people know how to relate to older materials.”

Through Archive-It, launched in 2006, IA has helped digitize physical libraries and the collections of various cultural heritage institutions. Over 800 groups, including Caltech, the Southern Poverty Law Center, the American Academy of Pediatrics, and the New York State Archives, use Archive-It to store and manage their collections. The data captured varies according to the needs of the organization, but everything is added to IA’s database, which maintains primary and secondary copies of all the material. “A lot of a librarian’s job is just to keep things available in shifting times,” added Kahle of the wide-ranging digitization efforts.

Keeping things on the shelves also involves recording government documents. An End of Term (EOT) Archive has been hosted on IA during the U.S. presidential transitions since 2008. It captures material from federal sources that are “at risk of changing (i.e., whitehouse.gov) or disappearing altogether during government transitions.” These documents come from a select list of government websites that researchers and institutions, using Archive-It, record and upload to an archive accessible to the public through the Wayback Machine

IA operates digitizing and storage centers around the world, where it houses multiple copies of more than 150 petabytes—150,000,000 gigabytes—of data. 

Near San Francisco’s historic Presidio, a decommissioned Christian Science Church now houses the Internet Archive offices. Looming Doric columns decorate the facade of the grand white building, echoing IA’s Library of Alexandria-esque logo. The complex also serves as one of several locations for the group’s digitizing equipment and storage servers.

The Wayback Machine, launched in 2001, is the most popular way to access IA’s treasures. It provides a publicly searchable database of over 900 billion webpage snapshots. Anyone can upload snapshots to the archives, but the majority of these pages are logged by webcrawlers, or programs that systematically search and record websites.

IA also operates Open Library, a project with the stated goal “to make all the published works of humankind available to everyone in the world,” particularly by creating a webpage for every published book. It contains more than 28 million works. For titles in the public domain, there are public-facing webpages with the entire text. For other, newer works, Open Library offers a digital version of a traditional lending service, where users can borrow copies.

Its massive collection of varied data has made IA an appealing source for groups building artificial intelligence models, which require enormous troves of information to refine their products. Kahle said that most major players in the AI world have used archival materials to train their models.

Other significant IA projects include the TV News Archive, a collection of televised news broadcasts. Its first public facing project was a record of news broadcasts covering the 9/11 terrorist attacks. IA’s website describes the project as cataloging what is a similarly “ephemeral medium” to the Internet.

“As things are moving more digital, libraries are being counted out,” said Kahle. “But I think we’re in a great position if we do things right to allow people to use library materials at scale in really interesting and useful ways. That will require empowering the cultural heritage sector.”

Kahle, 64, founded IA with the intention of archiving as much of the Internet as possible. From his undergraduate education at MIT in the 1980s where he studied in the university’s AI lab, to his early years in Silicon Valley, he witnessed the Internet’s accelerating evolution. This prompted him to want to preserve elements from the Internet for posterity. To that end, he started IA around the time he founded his first company, Alexa Internet, a web traffic analysis company. He sold Alexa to Amazon in 1999 for $250 million in stock.

Kahle contrasted IA’s mission with the publisher-first landscape of today’s entertainment and news industries, in which companies both determine how their products can be consumed and the narrative around their products’ history. Under that system, said Kahle, “you can change the past.”

“You have a generation raised on screens, but let’s just take the 20th century. It’s not on the Internet,” he added. “If you can’t quote things, you can’t put them into a new context. You can’t think critically. And when you’re talking about what we think—with news, books, reference materials—they can be changed at any time by one player. That should be frightening. Traditionally, libraries have been the antidote to that.”

Brewster Kahle is clear about the mission. “Our job is to try to keep the good works of humankind available to whoever comes looking.”  Ω

As a 501(c)(3) nonprofit, Internet Archive is funded primarily by donations. It publishes a list of significant contributors, such as the National Science Foundation, the Democracy Fund and Andrew W. Mellon Foundation.

More About End of Term Harvest and public data archives

EOT Achive. Main page
End of Term Harvest partners
Internet Archive
EDGI: Environmental Data & Governance Initiative
Common Crawl
Stanford University Libraries
University of North Texas Libraries
Library Innovation Lab Harvard Law School
Newsjunkie. EOT portal page
Newsjunkie. EOT Blog
Newsjunkie. Guide to Public Archives

Sources 
Newsjunkie. Brewster Kahle intervew by Andrew Checchia, Jan 17, 2025
KALW. Internet Archive stores our digital history
ProPublica. Internet Archive financial and tax profile
The Verge. Podcast: Mark Graham on Ai, linkrot
Newsjunkie. Mark Graham interview
Internet Archive. About Internet Archive
Internet Archive. TV news - general archive
Internet Archive. TV news - 9/11 archive
Internet Archive. Granted official library designation by California (Jun 25, 2007)
Internet Archive. Brewster leads tour of IA scanning center (Mar 29, 2013)
Open Library. Welcome to Open Library
Archive-It. What you can do with your Archive It account
TED. Video - Brewster Kahle: A digital library, free to the world (2007)
© 2025 Newsjunkie.net

    Internet Archive

    6 categories
    Internet Platform
    Software, Tools and Services
    Directory
    Research Services
    Archive
    Mission
    Internet Archive is a non-profit library of millions of free books, movies, software, music, websites, and more. Its stated mission is to provide “Universal Access to All Knowledge.”
    Is Non Profit
    No
    Year Founded
    1996
    Is Locally Owned
    No
    Type Description
    media data repository

    Relationships

    About

    1.2.10

    Newsjunkie.net is a resource guide for journalists. We show who's behind the news, and provide tools to help navigate the modern business of information.

    Use of Data