Thread

Health AI models routinely hit strong pilot accuracy when presented in pitch decks, but then drop significantly in production. A Stanford-Harvard review of over 500 clinical AI studies found that nearly half were validated on exam-style questions rather than real patient data. Only 5% used actual clinical cases. The accuracy figures that show up in pitch decks and product pages are, in most instances, measuring performance under conditions that don't resemble the environments these tools are meant to operate in. Even when models are tested more rigorously, the results are sobering. Top-performing systems still produce 12 to 15 severe clinical errors per 100 cases. Performance tends to break down specifically where it matters most: in situations involving uncertainty, incomplete information, or multi-step clinical workflows. Shadow-mode validation (4-8 weeks running in parallel before clinical deployment) often helps to prevent this, but companies tend to skip it to demonstrate speed. For investors evaluating health AI, the implication is fairly straightforward. The relevant question likely isn't what a model's benchmark accuracy is, but how it was validated and on what kind of data. This distinction seems to be where a significant amount of value and risk is currently hiding. arise-ai.org/report
State of Clinical AI Report 2026 - ARISE
The State of Clinical AI Report is the most widely read and trusted analysis of key developments in clinical AI.
arise-ai.org