When a Hospital CTO Must Choose an LLM for Clinical Decision Support: Aisha's Story

When a Hospital CTO Must Choose an LLM for Clinical Decision Support: Aisha's Story

Aisha, the CTO of a mid-sized hospital network, faced a choice that kept her awake for weeks. Her team needed a language model to generate discharge summaries, surface relevant past notes, and draft medication-change rationales for physician review. The promise was clear: faster documentation, fewer transcription errors, and more time for clinicians to see patients.

image

Reality hit during a pilot. A model produced a convincing but incorrect medication dosage that nearly led to the wrong instruction being printed in a discharge packet. No patient was harmed because a clinician caught the error, but the close call exposed the real risk. Hallucinations - confident but false outputs - were not an academic problem. They were a production hazard that could cause clinical harm, regulatory exposure, and loss of trust among staff.

Aisha had to pick a model that would run in production. The engineering lead wanted an open-source model to host on private hardware. The ML lead favored a cloud-hosted model with guardrails. Vendors waved glossy accuracy numbers. Benchmarks showed wildly different performance. Every stakeholder had a graph that supported their view. The question became: how do you evaluate models when hallucinations have real consequences?

The Hidden Cost of Hallucinations in Production

Hallucinations are not binary. They vary by type, severity, and context. A small factual mismatch in phrasing may be harmless. An invented drug interaction in a discharge note is dangerous. Effective evaluation requires measuring both frequency and severity. That is where many teams stumble.

Vendors often advertise headline metrics: "95% correct on X benchmark." Those numbers are almost always tied to narrow datasets, specific prompts, or non-adversarial setups. They do not reflect a hospital's tail cases. Meanwhile, open-source advocates point to low-latency local models and cost savings. Those models can be tuned to outperform hosted options on narrow tasks, but tuning can also amplify overfit to in-domain examples and hide failure modes.

As it turned out, Aisha's team needed three measures before a production decision could be made: (1) a precise taxonomy for hallucination types, (2) a measurement protocol that replicated production inputs, and (3) acceptance thresholds tied to clinical risk. Without those, "best" meant whatever metric a vendor optimized for.

Why Standard Benchmarks Mislead Decision Makers

Standard public benchmarks are useful but misleading for mission-critical systems. Here are the core methodological problems teams face when they compare models directly from benchmark tables.

    Different task definitions: One benchmark measures single-turn question answering, another measures long-form summarization. The same "accuracy" label masks different challenge surfaces. Prompt and formatting dependency: Small prompt engineering changes shift scores dramatically. Benchmarks rarely document the exact template or context windows used in vendor claims. Dataset overlap and leakage: Many pretrained models have seen large parts of common benchmarks during training. High benchmark scores can reflect memorization, not generalizable understanding. Non-uniform evaluation of hallucinations: Benchmarks either treat hallucination as a binary error or use proxies like BLEU and ROUGE that don't capture factuality. Two models with identical ROUGE can differ wildly in factual correctness. Lack of severity grading: Benchmarks rarely weight errors by downstream harm. A 5% hallucination rate that includes critical errors is worse than a 10% rate where errors are benign.

Meanwhile, vendors and open-source projects publish different splits, different prompts, and different temperature settings. That turns comparisons into apples and oranges. You can line up numbers from GPT-4, Llama 2-chat 70B, Falcon 40B, and an in-house fine-tuned 7B model several ways and get conflicting winners depending on which metric you pick.

image

Concrete examples from a controlled audit (May 2024)

To make the problem tangible, consider an internal controlled audit conducted in May 2024 on a 500-case clinical test set matching real discharge note complexity. The audit compared GPT-4, Llama 2-Chat 70B, and Falcon 40B-Instruct under identical prompts and a fixed temperature. Evaluators graded outputs for factuality and severity.

    Hallucination incidence (any factual error): GPT-4 ~12%, Llama 2-chat ~23%, Falcon 40B ~19%. High-severity hallucinations (would require clinical correction): GPT-4 ~4%, Llama 2-chat ~10%, Falcon 40B ~7%.

Those numbers are illustrative of a single audit, not universal truth. The takeaway is not which model won, but that results change dramatically with dataset, prompt, and severity weighting. The "right" model depends on what you measure and how.

How a Standards-Driven Evaluation Shifted Our Decision

Aisha's team took a different approach. They stopped https://smoothdecorator.com/how-hallucinations-break-production-a-7-point-checklist-for-ctos-engineering-leads-and-ml-engineers/ asking "Which model is best?" and asked "Which model best meets our safety requirements when measured against real workflows?" They designed a four-part evaluation protocol and ran it in May-June 2024.

Define hallucination taxonomy and severity anchors: They split hallucinations into categories: fabrication, misattribution, incorrect reasoning, omission, and unsafe suggestion. Each category had a 1-5 severity anchor tied to clinical consequences. Construct production-like prompts: They sampled actual EHR fragments, clinical abbreviations, and multi-turn clarifying questions. They included time pressure and common OCR noise used in scans. Blind, multi-rater human evaluation: Clinicians reviewed outputs blind to model ID, scoring factuality, clinical risk, and required correction effort. Inter-rater agreement was measured using Cohen's kappa; disagreements were resolved by a senior clinician. Adversarial and OOD testing: They injected rare comorbidities, conflicting lab values, and intentional ambiguities to probe model behavior outside frequent training patterns.

As it turned out, the rigorous protocol changed procurement conversations. The cloud-hosted model still performed well on average, but the open-source model, after strict fine-tuning and a retrieval system, reduced high-severity hallucinations to a comparable level at a lower recurring cost. The trade-off was more engineering work to maintain the retrieval database and stricter monitoring.

This led to a decision in which both models were used: the open-source model for low-risk documentation and the cloud model behind a more conservative, clinician-in-the-loop pipeline for high-risk tasks. The mixed strategy reduced single-point failure risk and balanced cost with safety.

From Uncertain Benchmarks to Deployable Guarantees: Real Results

After deployment, Aisha's team implemented continuous monitoring and logging. They instrumented the pipeline to capture model outputs, retrieval provenance, confidence scores, and clinician edits. They applied the following operational controls.

    Abstention policy: Models were configured to return "I don't know" or to flag outputs when internal confidence (calibrated probability or retrieval overlap) fell below a threshold. Provenance linking: Every factual assertion in a summary was tagged to a source document or an explicit chain-of-reasoning produced by the model. Real-time alerts: A daily dashboard tracked hallucination rates by severity bucket and flagged spikes for triage. Human-in-the-loop gates: All high-risk outputs required clinician sign-off before becoming part of the legal medical record.

Over the first 90 days in production, the observed metrics shifted. High-severity hallucination incidents reported by clinicians dropped from an initial 8% in the pilot to about 2.5% after implementing retrieval, abstention, and provenance linking. Low-severity factual mismatches still occurred, but they were caught during normal review workflows. The system's false refusal rate - where the model abstained despite an answer being safe - rose slightly and was managed through prompt tuning and targeted data augmentation.

Those results are not a universal promise. They reflect a concrete set of engineering trade-offs: increased latency from retrieval, higher engineering costs for provenance tracking, and a rise in clinician review time that required process redesign. The critical point is that measurable controls and thresholds made it possible to quantify residual risk and make a defensible deployment decision.

Contrarian viewpoints worth considering

Some experts argue that no amount of testing will eliminate hallucinations and that human oversight is the only safe path for clinical content. Others counter that strict abstention policies make models useless in practice. Both views have merit.

A nuanced stance recognizes that risk tolerances vary by use case. For high-stakes decision support, a conservative design with mandatory clinician confirmation is reasonable. For administrative summarization that speeds paperwork but does not change care plans, a more permissive pipeline with spot audits may be acceptable.

Another counterpoint: retrieval augmentation and provenance can create a false sense of safety. If the retrieval index contains erroneous or outdated documents, the model can faithfully summarize wrong sources. That led Aisha's team to add data quality gates and periodic index pruning, because provenance without source quality is meaningless.

Practical checklist for CTOs and ML leads

If you are evaluating models where hallucinations matter, use this practical checklist adapted from the hospital case.

    Define what constitutes a hallucination in your domain and tie it to severity levels. Build or capture a test set that mirrors your production inputs, including noise and adversarial cases. Run blind, multi-rater human evaluation and compute inter-rater agreement. Measure both frequency and severity, and report both numbers with confidence intervals and test dates. Test multiple models under identical prompts, temperatures, and context windows; record hyperparameters. Include OOD and adversarial prompts explicitly. Track where models fail badly, not just average case. Evaluate mitigations (retrieval, fine-tuning, prompt chaining) as full systems, not isolated model improvements. Require provenance and implement source quality controls for any retrieval-backed system. Set production acceptance thresholds based on clinical risk and expected remediation cost. Instrument for continuous monitoring, alerting, and periodic re-evaluation.

What this means for procurement and vendor claims

Vendors will continue to publish optimistic metrics. Treat those as one input among many. Ask for the raw evaluation data: prompts, model parameters, dataset splits, and test dates. Ask vendors to disclose failure modes and share red-team results for your domain. If they refuse, that is useful information.

As it turned out, Aisha's procurement negotiations were far easier once the hospital published its evaluation protocol. Vendors and open-source suppliers could run the same test and return numbers that were directly comparable. That transparency exposed where vendor numbers were cherry-picked and where real engineering work was required to meet thresholds.

This led to contracts that included performance SLAs tied to measured hallucination rates, scheduled re-evaluations every quarter, and cost-sharing provisions for remediation when high-severity errors occurred in production. Those are concrete levers for risk management rather than marketing promises.

Final takeaways

Hallucinations are real, measurable, and costly when a model is used in safety-critical workflows. Benchmarks alone will not tell you what you need to know. The correct approach is to build a domain-specific evaluation protocol, measure both frequency and severity, and treat mitigations as system-level engineering problems. Transparent test data, provenance, abstention policies, and ongoing monitoring are what turn uncertain models into deployable components.

Meanwhile, keep a skeptical eye on vendor claims. Demand raw evidence, insist on identical testing conditions, and define acceptance thresholds tied to real-world risk. As it turned out in Aisha's hospital, those Take a look at the site here steps transformed a near-miss into a measurable, managed deployment that balanced utility, cost, and patient safety. This led to a pragmatic production strategy rather than a single "winner" declared by a benchmarking leaderboard.