Legal TechJun 23, 20269 min read

Jun Lee

Why Your AI Demand Letter Keeps Getting the Numbers Wrong

If your AI demand letter tool keeps getting figures and dates wrong, that is not a bug you can prompt your way around. It is how language models generate numbers in the first place. Here is the actual mechanism, and what to do about it.

Jun Lee

Co-Founder & CTO

"It keeps getting the numbers wrong" is the single most common complaint we hear from firms that have tried an AI demand letter tool before finding their way to a more deliberate approach. A bill total is off by a few hundred dollars. A treatment date is a week early. A provider name is almost right but not quite. None of it is random, and none of it is a bug you can fix with a better prompt. It is how the underlying technology generates numbers in the first place.

This post explains the actual mechanism — not to excuse it, but because understanding it is the only way to build a workflow that catches it.

AI Doesn't Calculate. It Predicts.

Here is the part that surprises most attorneys: a language model does not look up a number and report it back. It generates text one token at a time, predicting what is statistically likely to come next based on patterns learned during training. For ordinary prose, that works remarkably well — grammar, phrasing, and sentence structure are highly predictable, so the model's pattern-completion is, in effect, correct most of the time.

Numbers are a different kind of problem entirely. A specific dollar figure, an exact date, a precise dosage — these require accurate retrieval, not plausible pattern completion. And retrieval is exactly the operation a next-token predictor is structurally weakest at. The model is not trying to recall "$4,827.50" from your client's file. It is predicting what number is statistically likely to follow the words that came before it. Those are very different operations, and only one of them is reliable.

	Generating prose	Generating a specific figure
What the model is doing	Predicting plausible phrasing	Predicting a plausible-looking number
Why it usually works	Grammar and structure are highly patterned	The exact value is not derivable from pattern alone
Failure mode	Awkward phrasing (easy to spot)	A confident, wrong number (hard to spot)

This Is Not Speculation — It Is in the Research

This is not an informal theory. Research from OpenAI's own team has shown that hallucination is, in their words, a mathematically inevitable byproduct of how current language models are trained and evaluated — not simply an engineering flaw that better prompting or a future model update will fully solve. Separate research specifically on numerical generation has found that a model's accuracy on a given number correlates with how often that number's pattern appeared during training, which is a fairly damning explanation for why an obscure, case-specific dollar figure is exactly the kind of thing these systems get wrong.

A general-purpose chatbot is not failing to calculate your client's medical bills correctly. It was never calculating them in the first place — it was predicting what a plausible total might look like.

Why This Hits Demand Letters Especially Hard

A demand letter is unusually dense with exactly the content type these models are weakest on: dollar figures for every bill, dates for every treatment, provider names, policy numbers, diagnostic codes. A typical legal-research hallucination might be a misremembered case citation. A demand letter hallucination is a wrong number sitting inside a financial argument — and an adjuster who catches one wrong figure has every reason to distrust the rest of the letter.

This is also why prompting your way around the problem does not work the way people hope. You can instruct a model to "double-check the math" or "be precise with figures," and it may even produce text that says it double-checked — but that instruction does not change the underlying generation mechanism. The model is still predicting tokens. It is just now predicting tokens that include the phrase "I have verified this figure," which is its own kind of plausible-sounding fabrication.

What Actually Reduces the Error Rate

If prompting can't fix an architectural property, what does help? The answer is changing what the model is being asked to do at each step, rather than asking one pass to draft, calculate, and verify simultaneously.

Structured extraction before drafting

Instead of asking a model to write a letter and produce the numbers as it goes, the more reliable approach extracts every figure, date, and fact from the source records first, as a structured dataset — before any prose is generated. The letter is then written from that verified data, not invented alongside it.

A separate validation step

Drafting and verification should not be the same pass. A model asked to write persuasively and a process asked to check facts against source documents are different jobs, and combining them in one step is exactly how an invented number slips through unnoticed — the same system that wrote the sentence has no independent way to catch its own error.

Traceability, not just confidence

The most useful signal a tool can give you is not "this is accurate" — that claim is exactly as unverifiable as the number itself. The useful signal is showing you which page of which document a figure came from, so a human can confirm it in seconds instead of re-reading the file.

The Verification Habit That Actually Catches Errors

Regardless of which tool a firm uses, this is the discipline that catches a wrong number before it reaches an adjuster:

✓Every dollar figure traced to its source bill

✓Every treatment date checked against provider notes

✓Provider names and credentials spot-checked

✓Specials total independently re-summed

✓Any unverifiable figure flagged before the letter goes out

✓No figure trusted on fluency alone

The verification habit that catches what generation alone cannot.

The Honest Bottom Line

No current AI architecture, including ours, eliminates the underlying tendency to generate a confident, wrong number. What separates a usable tool from a liability is whether the workflow around that tendency is built to catch it — structured extraction, separate validation, and traceability the paralegal can act on quickly — or whether the product just hopes a good prompt is enough. It isn't. The fix was never going to be a smarter sentence. It is a smarter process around the sentence.

Try Lexyno FreeGenerate a mock demand letter from your own records.

Frequently asked questions

Language models generate text by predicting the next likely token, not by performing calculation or lookup. For prose, that works well because grammar and phrasing are highly predictable. Numbers, dates, and exact figures are different — they require precise retrieval, not pattern completion, which is exactly the operation these models are structurally weakest at.

Better prompting reduces some errors but cannot eliminate them, because the underlying limitation is architectural, not a matter of instruction. OpenAI's own research has shown hallucination is a mathematically inevitable byproduct of how these models are trained, not simply an engineering flaw to be prompted away.

It is a property of how large language models generate text in general, not a flaw unique to any one product. The difference between tools is not whether this tendency exists, but whether the product is engineered to catch it before the number reaches the page — through structured extraction and a separate validation step against the source documents.

Every dollar figure, date, and provider name should be traceable back to a specific page in the source medical records, not just plausible-sounding. The fastest verification workflow is one where the tool shows you where each fact came from, so checking a number against its source takes seconds instead of re-reading the whole file.

Models will continue to improve, and tools built with structured fact-extraction and validation already reduce errors substantially compared to asking a general chatbot to draft from scratch. But no current architecture eliminates the underlying tendency entirely, which is why verification against source documents should remain part of the workflow regardless of which tool a firm uses.

Legal Tech

Legal TechJun 16, 2026

Solo PI Attorney vs. Big Firm: How AI Levels the Playing Field

No single firm controls more than 5% of the $61.7B personal injury market — yet AI adoption is nearly three times higher at large firms than at solo practices. That gap, not talent, is what AI is positioned to close.

9 min read

Legal Tech

Legal TechMay 12, 2026

What Is Fact-Lock™? How Lexyno Reduces AI Errors in Demand Letters

AI hallucination is a real, measured problem in legal documents — even in tools that promise "hallucination-free" output. Here is how Fact-Lock™ is engineered to reduce it, and why we will never claim zero.

8 min read

Legal Tech

Legal TechMay 5, 2026

EvenUp vs. Small PI Firms: Why the Math Doesn't Work

EvenUp is a $2B platform built for high-volume firms. For a 1–5 attorney PI practice, the economics rarely add up. Here is why — and what to look for instead.

7 min read

What's next?

Try Lexyno FreeGenerate a mock demand letter from your own records.See pricing for small PI firmsPer-case cost and plans built for 1–5 attorney offices.More from the Lexyno blogStrategy, medical chronology, and legal AI insights.

Ready to see the difference in your next case?

Put these strategies to work. Try a mock demand generated by Lexyno today.

Request Evaluation Access

Why Your AI Demand Letter Keeps Getting the Numbers Wrong