1.5.2
Newsjunkie.net is a resource guide for journalists. We show who's behind the news, and provide tools to help navigate the modern business of information.
Use of DataThe Common Crawl Foundation (CCF) is a California-based nonprofit 501(c)(3) organization founded by Gil Elbaz in 2007 (with formal crawling operations beginning in 2008). Elbaz, a co-founder of Applied Semantics (acquired by Google in 2003) and founder of Factual, established Common Crawl with the goal of democratizing access to web data that had previously been available only to large technology companies. The organization has been a member of the International Internet Preservation Consortium (IIPC) since 2008. In 2024, it became a partner of the End of Term Archive project. Rich Skrenta serves as the organization's Executive Director.
Common Crawl maintains an open, continuously updated repository of web crawl data exceeding 9.5 petabytes, dating back to 2008. The archive captures publicly accessible web pages in WARC (Web ARChive) format — the standard used by libraries and archivists worldwide. Data is distributed freely from Amazon S3 servers and is mirrored in the Internet Archive's Wayback Machine. The CCF publishes monthly crawls and maintains a public index server for querying archived URLs.
Common Crawl has become one of the most widely used sources of training data for large language models and generative AI systems, employed by organizations including OpenAI, Google, Meta, and Anthropic. It has supported tens of thousands of academic papers in fields ranging from linguistics and public health to misinformation research and machine translation. The foundation publishes all crawling code publicly and identifies its crawler as CCBot in HTTP headers.
All datasets are freely available without registration at commoncrawl.org. The organization honors robots.txt exclusions and processes takedown requests from publishers. Funding comes from the Elbaz Family Foundation Trust and significant donations from AI industry companies.
Common Crawl Foundation
San Francisco / Los Angeles, California, USA
Website: commoncrawl.org
LinkedIn: linkedin.com/company/common-crawl
© 2025 Newsjunkie.net
1.5.2
1.5.2