Newsjunkie

Data repository services

Digital capture and retrieval for self-managed environments

by Curtis Whiting/Apr 2, 2026

The digital repository sector serves a diverse range of institutions — academic libraries, national archives, research laboratories, government agencies, and cultural heritage organizations. The ecosystem is split between open-source self-hosted solutions, commercial-hosted platforms, and managed services. Open-source platforms dominate in adoption, while managed services lead in operational simplicity.

In the realm of the internet, archiving web pages may seem arcane and unwieldy, but preserving the internet, to know what knowledge and information (or what passed for knowledge and information) in the early days is important.

In 2025, the Internet Archive collected its one-trillionth page, and yet, the internet remains vast and hard to normalize. It becomes ever more important for businesses, non-profits, and others with large datasets and mature websites to archive their own material.

Organizations seeking to archive their own products use a variety of systems designed for various levels of complexity and demand. Academic libraries, national archives, research laboratories, government agencies, and cultural heritage organizations usually want value-priced turnkey solutions. Companies and organizations with high technical skills, and that already host internal data systems, want maximum configurability with professional grade back-up and low failure rates. Other systems are externally hosted, offering the Software-as-a-Service (SaaS) model, in which a subscriber uploads information to a cloud.

For the cloud SaaS programs, several available tools help secure research data:

OpenAIRE, a pan-European research infrastructure, is useful for open science. It helps with archiving, linking, and analyzing scientific outputs (articles, datasets, software).

ORCID (Open Researcher and Contributor ID) & DataCite are two international systems that manage digital repositories, connect researchers to their research metadata, and make the research citable and discoverable.

Both are nonprofit. Orcid requires a subscription. DataCite requires a paid organizational membership or affiliation to register and manage DOIs (Digital Object Identifier: a unique alphanumeric string used to permanently identify and link to a specific piece of digital content on the internet).

These cloud programs are helpful when connecting with Government research or research funders like Horizon Europe.

Platforms

Archive-It

Created by the Internet Archive, Archive-It is a subscription-based SaaS turn-key platform. Currently, over 45 countries in more than 1,200 libraries, educational institutions, government and nonprofit archives use this service to create and manage digital collections and records for access and preservation.

Within Archive-It are various services, such as Vault, a low-cost digital file preservation system, and Arch, which assists with computational text and data mining of collections. This system provides infrastructure for organizations such as cultural heritage institutions that may not have resources for building their own digital preservation methods.

InvenioRDM

A turn-key research data management repository platform, InvenioRDM was created as an open-source collaboration between CERN and Northwestern University. This system built Zenodo, one of the largest open-access repositories for science research. It is said to handle more than 100 million records and several petabytes. It uses ORCID, DataCite, and OpenAIRE. It integrates with Single-Sign-On systems and cloud storage. Participation of institutions requires a minimum of a yearly 1.5 person/month effort and funding development through contributed labor rather than licensing fees.

DSpace

Open-Source software created by Lyrasis, a nonprofit membership organization that supports libraries, cultural heritage institutions, museums and archives. The system is supported by ORCID, OpenAIRE, and ROR (Research Organization Registry). DSpace supports all major digital files and audiovisual formats. More than 3,000 repositories currently use DSpace.

EPrints

Developed by the University of Southampton, EPrints has been in use as an institutional repository system since 2000. Used most often as a repository system for research papers, theses, and teaching materials. It integrates with Sherpa/Romeo database, which is an online resource that analyzes publisher policies (whether librarians, authors, and researchers can self-archive articles, and which version—pre- or post print—they have permission to work with). It runs on a Linux/Apache/MySQL/Perl stack.

Fedora Commons

Unlike DSpace and EPrints, Fedora Commons is not an all-in-one service. Rather, this repository framework is flexible. With RDF (Subject-Verb-Object) based metadata and rich relationship ontologies, it is useful for specialized needs in large research libraries. It requires significant technical expertise to setup, configure and maintain.

Digital Commons (bepress)

Created by bepress (now owned by RELX Group, Elsevier), Digital Commons is a subscription-based hosted platform with built-in analytics. It includes image collections, publishing tools for journals, and social features—like researcher profile pages. Institutions that may want a polished product that doesn’t need a technologically savvy staff to operate it.

Islandora

This Fedora-based framework is built over a Drupal content management system (a free program used to build and manage complex, high-traffic websites and digital applications). This system might be best suited for archives with high-volume multimedia collections. The platform has community-contributed plugins. This platform needs more technological knowledge than DSpace or EPrints, but is easier than the original Fedora.

Webrecorder

This archiving platform, Webrecorder’s Browsertrix, captures dynamic, JavaScript-heavy content and authenticated web environments. The software targets social media and interactive sites that static crawlers can fail to properly render. The people who find this system useful are mostly digital archivists, legal professionals, and investigative researchers—those who might require precise, verifiable captures of the live web experience for long-term storage in standardized formats.

The Browsertrix Cloud is an SaaS interface for automated crawling. Browsertrix ensures that interactive elements are recorded as they would appear to an end user. This approach addresses a gap in digital preservation by capturing complex web content without the manual overhead associated with other browser-based tools.