Finding · drafted May 12, 2026

An Editorial Audit: Finding and Replacing 207 Duplicate Entries

A routine MD5 audit, 35 duplicate clusters, and what it suggests about AI-augmented archives

By Ryan Abrahamsen — Lewis and Clark Trust

Editorial status: draft

The publishing of this finding is itself the point: scholarly archives that incorporate computational text generation are obligated to audit themselves continuously and to publish what they find. This is the first such cleanup audit for the Lewis and Clark Research Database. 207 journal entries from the 1803–1804 pre-departure period were quietly carrying duplicated content. Here is how we found that, what we did, and what it suggests about how to read the rest of the archive.

1. What we found

A routine audit hashed the post_content of every journal entry on the site (3,415 entries total) and grouped by exact-match hash. The expectation was zero duplicates — each day of the expedition is presumed unique.

The audit found 35 clusters of identical content spanning 207 entries. The largest single cluster was 19 entries sharing one block of text. The second-largest, 18. The third, 18. Six clusters had 16 or more entries each.

The clusters concentrated in three temporal windows:

Lewis’s solo Ohio descent (October 1803): roughly 30 entries across multiple clusters — the “Cincinnati arrival” cluster repeated across Oct 9, 10, 11, 12; the “Below Cincinnati” cluster across Oct 13, 15; the “passing the Kentucky River” cluster across Oct 16–20; the “Falls of the Ohio” cluster across Oct 21–25.
The joint Ohio descent and Mississippi journey (November 1803): ~30 entries across three clusters of ~10 each.
Camp Dubois winter (December 1803–May 1804): ~150 entries across nine major clusters, each 12–19 entries deep.

Total affected: 207 entries (6% of the corpus). All concentrated in the pre-departure planning and preparation period.

2. What the duplicates reveal about the original import

The pre-departure period is documented sparsely in the primary sources. Lewis traveled alone down the Ohio from August 31 to November 1803, with limited regular journaling. Clark wrote sporadically at Camp Dubois through the winter. The journals of John Ordway and Patrick Gass do not begin until the expedition departs Camp Dubois on May 14, 1804.

An earlier editorial AI generation pass attempted to create a daily entry for every date of the journey, including the sparse pre-departure period. For dates where no primary-source journal existed, the model produced template-based daily entries from a small number of representative narratives. The result was 207 entries that look distinct (different dates, different titles, slightly different prose) but reference the same underlying template, with cosmetic variation only.

This is a common failure mode of computationally-augmented archives: when the source material thins out, the generator confabulates daily content from a representative template, presenting it as if each day were independently observed. The narrative voice remains plausible; only structural inspection reveals the duplication.

Reading any one of these 207 entries in isolation would not have detected the issue. Hashing all 3,415 entries and grouping by content was the smallest test that could.

3. The fix

For each duplicate cluster we identified the lowest-ID entry as the canonical and kept it intact. The remaining entries in each cluster (207 total) had their post_content replaced with an honest editorial note:

“No detailed primary-source journal entry survives for [date] that is distinct from the surrounding days. The Corps was active in the [phase] during this period. The original curated content for this date duplicated text from a representative entry. To preserve historical accuracy, that template text has been replaced with this note. See [canonical entry] for the representative narrative covering this period.”

Each replaced entry was flagged with editor_action_required = 'duplicate_content_replaced' and duplicate_content_canonical = <id> so the cleanup is auditable. AI-generated summaries and enhanced titles for the replaced entries were stripped, because those derivative artifacts were based on the duplicate content and would themselves have misled readers.

The 207 entries remain in the database as timeline placeholders — visiting their permalink shows the editorial note plus a link to the canonical entry — but they no longer pretend to be distinct daily journals.

Post-cleanup, zero content clusters of size > 1 remain in the journal_entry corpus.

4. What this suggests about the rest of the archive

Several practical implications for reading and citing the database:

Dated coverage is not uniform. The expedition’s literary output is concentrated heavily in the 1804–1806 active travel years. The pre-departure period is preserved as timeline structure but is correspondingly sparser in actual content.
Editorial provenance metadata is the most important field on the site. Every AI-generated artifact (ai_summary, ai_modernized_html, ai_entities, cross-narrator analyses, enhanced titles) carries a generation timestamp. Any researcher citing this database should note both the date of the citation and the editorial status of the cited entry.
Cluster-detection audits should be repeated. The MD5 hash audit was the first systematic structural check. Future audits could test for near-duplicates (Levenshtein distance < 50 chars), suspicious cross-cluster paraphrase, or sentence-level repetition across dates. We will run these.
The 6 audit-fail cross-narrator analyses, demoted to draft on May 11, 2026, are a related pattern. Both findings point to the same generalized risk: a model asked to produce structured daily content from sparse sources will, absent strong guardrails, generate plausible-looking content that fails on independent verification.

None of this is grounds to distrust the archive as a whole. The 2,007 entries in Phase 2 (Westward Journey), the 426 in Phase 3 (Fort Clatsop), and the 679 in Phase 4 (Return) are derived from genuine primary-source transcriptions (Thwaites, Quaife, Gass 1807) and survive content-hash inspection cleanly. The flaw was confined to one editorial cohort: the curated pre-departure entries.

5. What this models for other AI-augmented archives

The Lewis and Clark Research Database is one of a growing number of public archives that use computational text generation alongside primary-source transcription. Many similar projects are emerging across scholarly humanities, libraries, museums, and tribal cultural-preservation programs. The pattern this finding documents will be common to all of them.

Three practices we adopt from this audit and recommend for other projects:

Publish the audit itself. A scholarly archive’s credibility depends on demonstrating that it audits its own contents. Hidden cleanups create the impression of a static authoritative resource; published cleanups demonstrate continuous editorial care.
Replace, do not delete. A duplicate entry contains date metadata, taxonomy tags, and an audit history that future researchers may find useful. Replacing the content with a transparent placeholder preserves the structural record. Deletion would silently rewrite the corpus.
Track editorial provenance per artifact. Each AI-generated meta field should have a generation timestamp, model version, and source-content hash. When the source content is later identified as flawed, derivative artifacts can be identified by query and cleanly removed.

The cleanup described here took about an hour, cost nothing in additional AI generation, and is fully reversible if a primary-source citation surfaces for any specific date in the replaced set.

What this enables

The flagged entries are queryable through the editor dashboard at /wp-admin/admin.php?page=lcr-editor-dashboard (filter: editor_action_required = duplicate_content_replaced). Any reader who finds a primary-source citation for one of these specific dates can submit it at ryan@terrain360.com and the placeholder will be restored to genuine daily content.

This finding is the first published instance of the database auditing itself in public. Future audits and cleanups will be logged here. If a researcher would like to inspect or refute the methodology, the finding archive is the canonical record.

Drafted May 12, 2026. The cleanup was completed earlier the same day. All affected entries are listed in the editor dashboard. The author is an engineer rather than a historian; period scholars and editor partners are welcome to review or supplement at ryan@terrain360.com.

Search the database