Finding · drafted May 12, 2026

Cross-Narrator Parallels at Fort Clatsop

When a near-duplicate audit surfaces a documented historical pattern, not a bug

By Ryan Abrahamsen — Lewis and Clark Trust

Editorial status: draft

A follow-up audit to finding #3 (the exact-match content cleanup) used a more permissive near-duplicate detector to look for paraphrased copies. It found a documented historical pattern that is not a bug at all: at Fort Clatsop, Captain Clark transcribed Captain Lewis’s daily journal entries almost word-for-word. Surfacing this pattern in the database is a small example of how computational text analysis can confirm and make visible something scholars have known for a century.

1. What we ran

Finding #3 used MD5 hashing to detect entries with byte-identical content. That caught the 207-entry curated-import cleanup, but it could not detect entries that were paraphrased — near-identical in meaning but different in punctuation, capitalization, or spelling.

A follow-up audit ran a more permissive detector against the remaining 3,130 journal entries:

  • Shingle each entry’s normalized text into overlapping 8-character windows.
  • Compute a 64-element MinHash signature per entry.
  • Band the signatures into 16 bands of 4 hashes each (LSH).
  • For each pair of entries sharing any band, compute a Python SequenceMatcher.ratio().
  • Pairs with ratio ≥ 0.80 form a near-duplicate cluster via union-find.

The audit ran in about three minutes against the post-cleanup corpus.

2. What it found

50 confirmed clusters, 100 entries total — and every cluster is a Clark / Lewis pair on the same date during the Fort Clatsop winter (December 1805 – March 1806).

The pairs we surfaced include:

  • Clark Jan 4, 1806 ↔ Lewis Jan 4, 1806
  • Clark Jan 11, 1806 ↔ Lewis Jan 11, 1806
  • Clark Jan 17, 1806 ↔ Lewis Jan 17, 1806
  • Clark Jan 21, 1806 ↔ Lewis Jan 21, 1806
  • Clark Jan 31, 1806 ↔ Lewis Jan 31, 1806
  • Clark Feb 1–4, 1806 ↔ Lewis Feb 1–4, 1806
  • …and 44 more such date-pairs through early March 1806.

Zero clusters span more than two entries. Zero clusters fall outside the Fort Clatsop winter window. The pattern is precise.

3. Why this is not a bug

The Fort Clatsop daily-journal pattern is documented in every serious edition of the journals. During the Pacific winter, Clark frequently transcribed Lewis’s entries into his own daybook with minor capitalization and punctuation differences but otherwise verbatim. Gary Moulton notes the practice extensively in his Nebraska edition footnotes. Elliott Coues notes it in his 1893 commentary. The pattern is well known: in the constant rain and confinement of Fort Clatsop, the two captains operated as a single editorial unit for much of the winter.

What the computational audit adds is quantification: of the ~120 winter dates with entries from both captains, exactly 50 (~42%) reach near-duplicate threshold by sequence-matcher ratio. The other ~58% have substantive differences. The boundary line — which winter days the captains chose to write independently vs which they let one author cover — is a measurable signal that could productively be cross-referenced with weather, hunting outcomes, diplomatic activity, or Lewis’s documented depressive episodes during that winter.

This pattern does not appear earlier in the journey (Phase 2, May 1804 – October 1805) or later (Phase 4, March – September 1806). The Fort Clatsop winter is uniquely characterized by this co-authorship pattern.

4. How we surfaced it on the site

The 100 entries were flagged with parallels_entry meta pointing at the corresponding paired entry’s post ID. The single-journal-entry template now renders a sidebar card on each flagged page:

Parallel Entry
This entry’s text closely parallels the entry below for the same date. This is a documented historical phenomenon — at Fort Clatsop, the captains often kept near-identical journals.
Lewis: January 4, 1806 · Near-duplicate primary-source text

Each flagged entry links to its counterpart. Neither entry’s content was changed; the relationship is annotated only.

This is the correct action because:

  1. The duplication is in the primary sources themselves, not in our import.
  2. Removing or merging the entries would falsify the historical record — readers and researchers benefit from knowing both captains’ entries exist in (largely identical) form.
  3. The cross-narrator analyses at /analyses/ already synthesize same-date entries from multiple narrators; the Parallel Entry card is the bridge from a single entry to the broader cross-narrator picture.

5. What this tells us about computational textual scholarship

The MD5 audit (finding #3) caught a hidden flaw: 207 fabricated daily entries from sparse source material. The near-duplicate audit (this finding) caught a documented historical pattern: 50 pairs of intentional cross-author transcription. Two different audits with two opposite kinds of findings.

What they share is the principle that a research database benefits from continuous structural inspection — not only at original publication, but as a routine practice. Many of the most interesting patterns in a corpus are not in any single document, but in the relationships between documents.

The journals of Lewis and Clark have been read closely for two centuries, but the question “how many of Clark’s Fort Clatsop entries are near-identical to Lewis’s?” had no efficient way to be answered until the corpus was computable. The answer is now: 50 entries, approximately 42% of the dual-narrated winter dates.

Future audits could productively address:

  • Near-duplicates at sentence rather than entry level (which captures cases where one captain partially copied another)
  • Whitehouse-from-Ordway lexical drift across the full journey (mentioned in finding #1)
  • Cross-source quotation in the cross-narrator analyses themselves (catching cases where a synthesis-essay reuses verbatim from a source)

None of these would require new generation. They require new queries.

What this enables

Browse a flagged entry to see the relationship surfaced:

The audit found exactly the documented Clark-mirrors-Lewis pattern. No content was changed. The site is now more transparent about what readers are seeing.


Drafted May 12, 2026 as a follow-up to finding #3. The audit code is reusable; future runs against new data can be triggered cheaply. The author is an engineer rather than a historian; corrections from period scholars are welcome at ryan@terrain360.com.


← All findings  ·  About this database  ·  Send a correction

Our Partners