What's actually happening when AI answers your health question
What's actually happening when AI answers your health question
When you ask an AI about your symptoms, it sounds like it's reasoning. It says things like "based on your description, this could be..." and walks through possibilities with what feels like medical logic. It's natural to assume something like thinking is happening.
It isn't. Understanding what's actually happening makes you a dramatically better user of the tool — and keeps you from trusting it in the wrong ways.
Pattern matching, not reasoning
Large language models — the technology behind ChatGPT, Claude, Gemini, and the models Iris uses — work by predicting the next word in a sequence. That's it. Given everything that's come before in the conversation, the model calculates which word is most likely to come next, outputs it, and repeats.
This sounds simplistic, but the scale is staggering. These models have processed hundreds of billions of words of text — medical literature, clinical guidelines, patient forums, textbooks, research papers. The patterns they've absorbed are extraordinarily rich. When you describe fatigue with specific characteristics, the model matches those patterns against everything it's seen about fatigue in its training data and generates text that follows the same patterns as expert discussions of similar symptoms.
The result often looks like medical reasoning. Sometimes it's genuinely useful — the pattern matching surfaces connections you wouldn't have made. But it's not reasoning. The model doesn't understand what fatigue is, what a thyroid does, or why sleep matters. It knows that in text about fatigue, certain words follow other words in certain patterns.
This distinction matters because it explains both why AI is useful and exactly how it fails.
Why it sounds so confident
AI models don't have a confidence meter. They don't think "I'm 40% sure about this" and then decide how to phrase it. They generate text that follows the patterns of confident, authoritative medical writing — because that's what most medical text looks like.
Research on AI calibration published in Nature found that language models are systematically overconfident. When a model says something that sounds certain, that certainty comes from the writing style in its training data, not from any internal measure of accuracy.
A concrete example: you tell an AI you have fatigue, brain fog, and cold sensitivity. It responds "these symptoms are consistent with hypothyroidism — you should get your thyroid tested." That sounds like a reasoned clinical assessment. What actually happened is the model matched your symptom description to patterns in text about hypothyroidism and generated a response in the style of clinical advice. It might be right. But the confidence you're hearing is a property of the text style, not a measure of how likely the suggestion is.
How training data shapes answers
The model's responses are only as good as its training data, and that data has specific biases.
Medical textbooks emphasize common conditions and classic presentations. If your symptoms match a textbook case, AI performs well. If your presentation is atypical — and chronic conditions are frequently atypical — the model defaults to the most common pattern in its training data, which may not be your pattern.
Research on AI diagnostic accuracy published in The Lancet Digital Health found that AI systems perform significantly better on common conditions with typical presentations than on rare conditions or atypical presentations. For health investigation, this means AI is most reliable when helping you explore common explanations and least reliable when suggesting unusual ones.
The training data also reflects the biases of medical literature itself. Conditions that are well-studied get better AI coverage. Conditions that disproportionately affect understudied populations — many pain conditions, autoimmune diseases, and conditions more prevalent in women — may get less accurate or less nuanced responses.
What "temperature" means for your health answers
When generating text, AI models use a parameter called temperature that controls randomness. At low temperature, the model picks the most probable next word every time — safe, predictable, sometimes repetitive. At higher temperature, it samples from less probable options — more creative, more varied, but also more likely to produce something wrong.
For health questions, this has real consequences. A model at higher temperature might generate a plausible-sounding but invented study, or suggest a supplement interaction that doesn't exist, because it's sampling from less probable patterns. This is one mechanical reason hallucinations happen — the model occasionally picks a word path that leads somewhere fictional, and then continues confidently down that path because each subsequent word still follows probable patterns given the words before it.
Iris uses carefully tuned settings for health conversations, but understanding this mechanic explains why AI can say something that sounds perfectly reasonable and be completely wrong.
What this means for how you use AI
Knowing that AI is pattern matching, not reasoning, changes how you should interact with it.
Give it better patterns to match against. Specific, structured symptom descriptions give the model more precise patterns to work with. "I'm tired" matches against everything ever written about tiredness. "I wake exhausted after 8 hours, crash at 2 PM, and feel better after eating" matches against a much narrower, more useful set of patterns.
Treat confidence as style, not signal. When AI says something with certainty, that's how the text sounds, not how sure the model is. The more confident a claim sounds, the more worth checking it is — because the model has no mechanism for matching its writing confidence to its actual accuracy.
Use it for breadth, verify for depth. AI is excellent at surfacing possibilities you haven't considered, because it's matched against a vast amount of medical text. It's less reliable at determining which possibility is most likely for your specific situation. Use it to generate hypotheses, then verify them through tracking, testing, or your provider.
Understand that it doesn't remember learning. The model doesn't know where it learned something. It can't distinguish between a pattern from a peer-reviewed study and a pattern from a health blog. This is why citing sources is hard for AI, and why Iris uses a supervisor model to cross-check — the first model has no way to evaluate the reliability of its own outputs.
References
- Calibration of language models for medical questions — Nature, 2023. Systematic overconfidence in AI medical responses.
- AI diagnostic accuracy across conditions — The Lancet Digital Health, 2019. Performance variation between common and atypical presentations.
- Attention is all you need — arXiv, 2017. The foundational transformer architecture paper behind modern LLMs.
- A survey of large language models — arXiv, 2023. Comprehensive overview of LLM capabilities and limitations.