Welcome to the Archive: What We Built and Why
We are an anonymous collective devoted to truth and transparency. This is the story of why we built a censorship-resistant, AI-powered archive of every document the DOJ released on Jeffrey Epstein — and how it works.
In 2025, the U.S. Department of Justice quietly released over 370 gigabytes of documents related to the Jeffrey Epstein investigation. Approximately 1.4 million files spanning 3.5 million pages — FBI interview summaries, police reports, emails, financial records, flight manifests, seized photographs and videos, and more.
They released them as raw data dumps. Twelve massive zip files with no search, no index, no way for an ordinary person to make sense of what was inside. Functionally, the documents were public but practically invisible.
We decided to change that.
Who We Are
We are an anonymous collective of engineers, researchers, and citizens devoted to truth and transparency. We have no institutional backing, no corporate sponsors, no political affiliation. We believe that public records should be genuinely public — not just technically available, but actually accessible, searchable, and permanent.
We don't trust any single institution to preserve this material. Governments change. Servers go offline. Platforms cave to legal pressure. So we built something that doesn't depend on any single point of failure.
What We Built
The Epstein Files Archive is a fully searchable, AI-enhanced, censorship-resistant archive of every document the DOJ released — and more. Here's what it does:
Full-Text Search Across 1.4 Million Files
Every document has been OCR'd, text-extracted, and indexed. You can search across all 3.5 million pages with typo-tolerant, sub-200ms results. Filter by data set, file type, or category.
Try it yourself: search for "passport" and see what turns up. Or "wire transfer". Or "flight manifest". The search works across PDFs, emails, scanned handwritten notes — everything.
AI-Powered Investigation
We've integrated AI directly into the research workflow. On the homepage, you can ask the AI assistant any question about the documents and get answers grounded in the actual evidence. The AI retrieves relevant passages from the archive and constructs answers with citations you can click to verify.
Ask it anything: "What do the flight logs show?" or "What did the FBI know and when?" — it searches through the documents and synthesizes what it finds.
Automatic Entity Extraction
We run every document through AI-powered entity extraction that identifies people, organizations, and locations mentioned across the archive. The system has already identified over 21,000 entities and mapped the relationships between them.
Visit the Entity Network to explore the web of connections visually. Click any person or organization to see every document they appear in, who they're connected to, and AI-generated investigative questions specific to that entity — questions designed to surface things that might not yet be discovered.
AI Image Analysis
The archive contains approximately 180,000 seized photographs. Every image is analyzed by AI vision models that identify and describe what's in each photo — people, objects, locations, visible text, signage, and context. These descriptions become searchable keywords, meaning you can find images by what's in them, not just by filename. Search for "passport" and you'll find not just documents mentioning passports, but actual photographs of passports.
Audio and Video Transcription
The archive contains thousands of audio and video files — interview recordings, surveillance footage, seized media. We automatically transcribe audio and video files using Whisper, making their contents searchable alongside every other document. Files that were previously opaque — you'd have to listen to hours of tape — are now fully text-searchable.
Continuous Web Crawling
The archive doesn't stop at the DOJ release. We operate a web crawler that continuously discovers new Epstein-related documents from court dockets (PACER), government transparency portals, the Internet Archive, news publications, and investigative journalism outlets. New content is automatically scored for relevance, deduplicated against existing records, and added to the searchable index.
This means the archive grows over time. As new court filings emerge, as FOIA requests are fulfilled, as journalists publish new findings — the crawler picks them up and folds them in.
Document Processing Pipeline
Every file that enters the archive goes through a processing pipeline:
- Type detection — files are classified by extension and magic bytes
- Text extraction — PDFs processed with PyMuPDF, scanned pages OCR'd with Tesseract, emails and spreadsheets parsed
- Thumbnail generation — visual previews for PDFs, images, and video frames
- Full-text indexing — all text indexed in Meilisearch with filterable facets
- Entity extraction — AI identifies people, organizations, locations, and relationships
- Image analysis — AI vision models describe the contents of every photograph, making images searchable by what's in them
- Transcription — audio and video files transcribed with Whisper
- Virus scanning — every uploaded file scanned with ClamAV before processing
- P2P distribution — content published to the Spill network for decentralized replication
Censorship Resistance: The Spill P2P Network
This is perhaps the most important thing we built. The archive is distributed via the Spill P2P network, built on Hyperswarm and Hyperdrive. Every document, every index, every piece of the archive is replicated across peer nodes worldwide.
What this means in practice:
- If this server goes offline, the archive survives. Other peer nodes retain full copies of the data.
- No single entity can take it down. There is no kill switch, no single hosting provider to pressure, no domain registrar to bully.
- Connections are encrypted end-to-end using the Noise protocol. Peers communicate directly without intermediaries.
- Anyone can run a peer node and help replicate the archive. The more nodes, the more resilient it becomes.
We built this because we've seen what happens to archives that depend on a single server, a single company, a single jurisdiction. They disappear. We are determined that these documents will not disappear.
Community Uploads
The archive accepts public uploads. If you have documents related to the Epstein case — court filings, FOIA responses, photographs, records — you can upload them directly. Every upload is virus-scanned with ClamAV, processed through our pipeline, and added to the searchable archive.
Privacy
We don't require accounts. We don't set tracking cookies. We don't log search queries. We use Cloudflare Web Analytics — cookieless and privacy-respecting — to count aggregate pageviews. The site is served over HTTPS with a Let's Encrypt certificate, and the P2P layer uses end-to-end encryption. We built this to serve the public, not to surveil it.
What Comes Next
Entity extraction is ongoing — we're processing the full 1.4 million documents through AI to map every person, organization, and location. As this completes, the entity network will grow dramatically, revealing connections that are invisible when reading documents one at a time.
We're also expanding the crawler's reach to more court systems and government archives, and working on deeper financial analysis tools to trace the money flows documented in these records.
The truth is in the documents. We're just making it findable.
— The Archive Collective