By Andrew MacDonald · Last updated 3 November 2025
During everyday usage, hundreds—sometimes thousands—of working files are created on the Voyant Tools server. Many of these are short-lived cache files (used to make tools feel fast) and are periodically cleaned up. By contrast, the source files that make up user corpora are preserved so that links continue to work and results remain reproducible. Over time, this meant that millions of long-inactive corpora were being retained even when they hadn’t been accessed in a year or more.
Why we’re doing a clean-up
- Performance: Keeping storage lean helps speed up indexing, backups, and system maintenance.
- Stability: Fewer stale objects simplify integrity checks and reduce errors from broken, half-finished uploads.
- Sustainability: Storage has a cost—in money and energy. Trimming truly unused data is part of responsible stewardship.
What’s already changed (Phase 1)
As announced, we removed corpora that hadn’t been accessed in 12+ months. This first pass freed up roughly 10% of the ~20 TB currently in use—about 2 TB of space reclaimed—without affecting recently used projects.
What counts as “unused”? For this phase, we considered a corpus unused if it had not been opened by a user (directly or via a saved link) in over a year. Active teaching links and research projects that are still visited were not targeted.
What’s next (Phase 2): finding “orphan documents”
We’re now working through a more complex task: identifying orphan documents—files that were created during ingestion (for example, intermediate conversions or partial uploads) but are no longer referenced by any active corpus. There are tens of millions of document objects, and cross-checking them against all active corpora is careful work.
This step won’t affect your current corpora or any links you actively use. It’s strictly about removing leftover files the system no longer needs.
What this means for you
- Faster, more reliable service due to smaller indexes and lighter storage load.
- No action needed if you actively use or share your corpus links.
- Recently inactive work is safe: only long-inactive corpora were removed in Phase 1; Phase 2 targets unreferenced system files.
Good habits to preserve your work
- Keep your originals: Always retain a local copy of the source texts you upload (or the zip you ingested). This makes rebuilding a corpus trivial.
- Bookmark the corpus URL: A saved link is the simplest “pointer” back to your corpus and keeps it active when visited.
- Export what matters: For reproducibility, download CSVs/TSVs and images from the panels you cite (e.g., Terms, Trends, Contexts) and store them with your teaching or research materials.
- Document your pipeline: Keep a short README with corpus creation notes (where the texts came from, preprocessing, stopword lists, dates). Future-you will thank you.
- Consider self-hosting for long-term archives: If you maintain many teaching corpora or need strict retention, running your own Voyant Server instance is a robust option.
If your corpus was removed, how to rebuild quickly
- Gather your original texts (or the original zip).
- Open Voyant Tools and use Open → Upload (or paste a list of URLs).
- Recreate any stopword lists or options you used before (note: keeping these in your README makes it a 1-minute step).
- Bookmark the new corpus URL and—if used in teaching—update your LMS link.
Frequently asked questions
Will my active corpus be affected?
No. Phase 1 targeted corpora not accessed in over a year. Phase 2 targets only orphaned system files that no active corpus references.
How can I keep teaching links alive between semesters?
Visit each corpus at least once when preparing your course, and keep a copy of the source texts so you can rebuild quickly if needed.
Can you restore a removed corpus?
We generally don’t retain backups for long-inactive corpora. Re-ingesting from your originals is the fastest path; it usually takes only a minute or two.
Thanks for your patience
We know clean-ups can raise questions, and we appreciate your understanding as we keep Voyant fast, stable, and sustainable for everyone. Phase 2 is underway; we’ll share another note when it wraps up. If anything looks off in your workflow, please let us know—clear reports help us help you.
