Hallucinations versus Errors and the Implications for Clinical Decision Making

The potential for artificial intelligence (AI) in healthcare is immense. Generative AI (GenAI) has stolen the headlines from other algorithms, but it is not the only game in town. Moreover, there is a difference in how errors occur depending on what is “under the hood”. A basic understanding of these differences is critical for making informed clinical decisions. Here we compare hallucinations with errors and provide rules of thumb that clinicians can use to better interpret AI output. 

The term AI is often used as shorthand for GenAI but encompasses much more. Large language models (LLM) are a type of GenAI, which is a type of deep learning, which is a type of neural network, which is a type of machine learning, which is a type of AI. The big difference in “modern AI” is the use of foundation models that generally involve a transformer neural network architecture and unsupervised pretraining on massive amounts of data. Foundation models can be used for both GenAI and non-GenAI use cases such as prediction and classification. The differences are in the output which has significant consequences for how we calculate error. 

Discriminative AI is a counterpart of generative AI. Dermatic Health uses both, each for different use cases. A common use case for discriminative AI is classification. For instance, predicting whether a patient has one of 73 possible dermatology conditions is classification. One could provide clinical notes to an LLM and ask what the condition is. However, the way discriminative and generative models take input, generate outputs, and evaluate results is fundamentally different. 

Generally speaking, discriminative AI works on structured inputs (e.g. lab results) whereas GenAI works on unstructured inputs (i.e. text). For discriminative AI, there is a set number of possible outputs that are definitively right or wrong. That makes it possible to calculate classic metrics such as type 1 errors, type 2 errors, positive predictive value (PPV), and even confidence intervals for predictions. For GenAI, we cannot calculate these metrics in the same way. A GenAI “hallucination” simply means the output doesn’t match human expectations, but it’s far more difficult to quantify right or wrong.

One can think of an LLM as an extremely high dimensional representation of the joint probability distribution across sequences of word fragments. LLMs predict the most likely sequence of words that should follow yours. There is work in neurosymbolic AI that uses formal logic, but LLMs generally do not. There is no explicit representation of right or wrong. Think about asking GenAI to help write an email. Is there a correct email and an incorrect email? GenAI operates in the same fashion when you ask it a clinical question. 

There are ways to mitigate these issues with LLMs such as constraining their output. For instance, some products only use content from prestigious journals and provide references to source materials for all generated summaries. In general, the old adage of “the right tool for the right job” applies to AI. 

At Dermatic Health, we use multiple AI models for different use cases. We use GenAI for generating draft SOAP notes. We use LLMs to parse elements of history of present illness information (HPI) then structure this data as input to other AI models. We use discriminative AI to predict dermatology conditions. Our discriminative AI is based on a foundation model and therefore still benefits from “modern AI” techniques. However, this approach enables us to directly measure error and compute statistics such as accuracy and PPV.  

It is unlikely you will ever see a confidence score on GenAI output. If you do, ask how it is calculated. GenAI can be great for “jogging memory”. However, if the output is net new information for you, it’s imperative to validate against sources. Part of the challenge is that LLM answers are always framed in a plausible manner. The whole point is to generate output based on the most probable structure of text. The details are what matter. 

Confidence scores on discriminative AI (such as classifying dermatology conditions) can be interpreted as statistics. For instance, if Dermatic shows an 85% confidence that a patient has eczema, then this confidence score has a precise statistical interpretation: in 85% of the patient cases the model saw during training with similar images and metadata, the correct diagnosis was eczema. Our models are trained (and pretrained) on hundreds of thousands of cases and we assess statistical power and bias. Therefore, you can be assured that this correlation is true. But note that 85% is not 100%. Dermatic provides the top N most likely conditions with links to reference images and expert content. Our goal is to provide the best information at the right point in the clinical workflow so that clinicians can rapidly make informed decisions.

AI is a powerful tool with immense potential for improving healthcare. Modern AI generally refers to new solutions that leverage foundation models. These models can be used for both GenAI and discriminative AI. The former has hallucinations whereas the latter has errors. Errors can be interpreted using standard statistics whereas hallucinations cannot. It’s imperative to keep this in mind when deciding what tools to use for what use cases, and how to interpret their output.

John Langton, PhD and David Murphy, PhD

Create a free website with Framer, the website builder loved by startups, designers and agencies.