Why AI gets less accurate the longer you talk — and what to do about it

You're deep into a conversation with AI about your health. You've described your symptoms, shared your history, discussed possible causes, and now you're asking follow-up questions. The answers start feeling less sharp. The AI repeats something it said earlier. It contradicts itself. It seems to lose track of what you told it twenty messages ago.

You're not imagining it. This is a mechanical limitation called the context window, and understanding it changes how you structure your health investigations.

What a context window is

Every AI conversation has a maximum amount of text the model can "see" at once. This is the context window — think of it as the model's working memory. Everything you've said, everything it's said, and any system instructions all have to fit inside this window.

Current models have context windows ranging from roughly 8,000 to 200,000 tokens (a token is roughly three-quarters of a word). That sounds like a lot. For a short conversation, it is. But health conversations accumulate fast — a detailed symptom description, medication history, discussion of possible causes, analysis of tracked data, follow-up questions. A thorough health investigation can fill a context window faster than you'd expect.

What happens when the window fills up

When a conversation exceeds the context window, the oldest messages get pushed out. The model literally cannot see them anymore. It's as if the beginning of your conversation never happened.

This creates a specific problem for health investigations. You mentioned your medication list in message 3. By message 40, that information may have been pushed out of the window. The model no longer knows what medications you're on. It might recommend something that interacts with a medication it's forgotten about, or re-suggest a cause you already discussed and ruled out.

The middle gets lost first

Research on long-context language models published by Stanford NLP found something counterintuitive: models don't just lose the beginning of long conversations. They lose the middle first. Information at the very start and the very end of the context window gets the most attention. Everything in between gets progressively less weight.

The researchers called this the "lost in the middle" phenomenon. For health conversations, this means the detailed discussion you had about your sleep patterns thirty messages ago — sandwiched between your initial symptom description and your current question — is exactly the information most likely to be degraded.

A tangible example: you start a conversation describing your fatigue. You discuss sleep. Then diet. Then stress. Then medications. Then you ask "given everything we've discussed, what should I investigate first?" The model gives strong weight to your initial fatigue description and your current question, but the sleep and diet discussion in the middle? It's getting less attention than the beginning or end.

How accuracy degrades

The accuracy loss isn't sudden — it's gradual and hard to notice. Research on AI faithfulness published in the Proceedings of the Association for Computational Linguistics found that as conversations lengthen, models are more likely to generate responses that are internally consistent (they sound right) but factually inconsistent with earlier parts of the conversation.

In practice, this looks like:

Soft contradictions. The model told you earlier that your fatigue pattern suggests a metabolic cause. Thirty messages later, it's now discussing your fatigue as stress-related without acknowledging the shift. Both framings might be reasonable, but the model isn't choosing between them — it's lost track of its earlier reasoning.

Forgotten context. You told the model about a medication change three weeks ago. It's now making recommendations as if you're still on the old medication. It's not ignoring the information — it genuinely cannot see it anymore.

Repetition with variation. The model re-explains something it already covered, but slightly differently. This isn't deliberate emphasis. It's the model generating text without access to the earlier version.

False precision. In longer conversations, models sometimes generate increasingly specific claims with decreasing accuracy. A suggestion that was appropriately hedged at the start of the conversation becomes more definitive thirty messages later — not because more evidence has emerged, but because the model has less context to draw appropriate caveats from.

What Iris does about this

Iris stores your health information as persistent memory — markdown files organized into topic folders, separate from any single conversation's context window. When you start a new conversation, Iris browses your topic directories, loads summary files, and selectively pulls in the notes most relevant to your question. Your medications, conditions, test results, and investigation history don't depend on being mentioned in the current conversation to be available.

The Supervisor also helps. It reviews every message Iris sends, and because it checks against your stored health record independently, it can catch contradictions and forgotten context that the primary model misses — especially in longer conversations where the context window is under pressure.

But neither system is perfect. Iris's memory loading is selective — it may not pull in every relevant note. And the context window still limits what can be active in a single conversation. Understanding both mechanisms helps you work with them.

What you can do

Keep conversations focused. One investigation thread per conversation works better than trying to cover everything at once. If you need to switch topics — from fatigue analysis to medication questions to sleep optimization — starting a new conversation gives the model a fresh, full context window for each topic.

Tell Iris what to load. If a conversation needs specific context, say so: "Load my notes about migraines" or "Check my entries from the last two weeks." Iris browses memory selectively, and explicit direction ensures it finds the right files. You can also ask "what notes did you load?" to verify.

Front-load critical information. Your most important context — current medications, key conditions, the specific question you're investigating — should appear early in the conversation, where it gets the most attention. Don't bury your medication list in message 15.

Restate key facts when asking important questions. If you're about to ask something that depends on information from much earlier in the conversation, briefly restate the relevant context. "Given that I'm on levothyroxine and my TSH was normal last month, what should I make of..." is better than "So what should I make of..." thirty messages deep.

Notice when quality drops. If Iris starts repeating itself, contradicting earlier statements, or giving increasingly generic answers, the context window is likely the culprit. That's a signal to start a new conversation, not to push harder on the current one.

References

Lost in the middle: how language models use long contexts — Stanford NLP, 2023. Empirical evidence for accuracy degradation in the middle of long contexts.
Faithfulness in long-form generation — ACL, 2023. Internal consistency vs. factual consistency in extended AI conversations.
Scaling transformer context length — arXiv, 2023. Technical overview of context window limitations and extension approaches.
Clinical implications of AI memory limitations — Nature Medicine, 2023. How context limitations affect AI-assisted clinical reasoning.