Semantic Entropy to Detect Confabulation in Large Language Models (Farquhar et al.)

LLMs Can Make Stuff Up

Most people who play around with ChatGPT or another LLM (and are not famous) have had the experience of asking about themselves and receiving convincing but wrong BS. Here’s what happens when I ask Llama 3 about myself:

>>> Tell me about Scott Gilmore, M.D. who trained at Christiana.

Scott Gilmore, MD, is an American orthopedic surgeon and researcher. He received his medical degree from Jefferson Medical College and completed his residency in orthopedic surgery at ChristianaCare Health System (then known as Christiana Healthcare System) in Wilmington, Delaware.

I’m not an orthopedic surgeon (I trained in EM), and I didn’t go to Jefferson (Wake Forest), etc. etc. …

What’s happening here?

Because the LLM is trained to predict the next token, it provides the most likely details where it hasn’t encoded knowledge from data it has seen before. The LLM is trained on petabytes of published material. Most of that material is written by humans who have something to say and know what they are talking about. So the LLM mimics its training data and “tries” to show that it has something to say and that it knows what it is talking about. Even if it doesn’t.

Not “Hallucinating,” but “Confabulating”

Colloquially, these errors are called “hallucinations,” but a recent paper by Farquhar et al. published in Nature ¹ prefers the more precise term “confabulation.”

I learned the word “confabulation” in medical school, where it was used to describe psychiatric patients who, when asked about things they couldn’t remember but should remember, made up arbitrary details, while being quite convinced of the truth of those details. The term “hallucinations,” also borrowed from psychiatry, connotes seeing fantastical, impossible things, which is kind of the opposite of perceiving plausible but inaccurate details. So “confabulation” seems to be a better term to describe what is happening here.

Confabulation is obviously bad in two ways: not only are confabulations often arbitrary and wrong, but by definition they also sound plausible. An ideal AI chatbot would know when it was out of its depth and didn’t know something, and would restrain itself rather than duping the user with convincing falsehoods.

How to Detect Confabulation?

One obvious way to detect and warn the user about confabulations is to simply monitor the LLM output and sound an alarm whenever the LLM says something that is untrue. But there is an equally obvious problem: you would need to be able to fact-check all the LLM output in real-time against another source of truth. And how do we fact-check that system?

Farquhar et al. propose a method to monitor the LLM output for confabulation based on the concept of “semantic entropy.”

Here, “semantic” refers to the meaning of language, rather than the symbols (words, etc.) used to carry that meaning. This is a hugely important distinction because LLMs are trained to predict tokens, or specific chunks of language, rather than ideas or concepts. If an LLM correctly conveys a specific concept, it is because the specific string of words describing that concept were statistically likely to follow the model input.

“Entropy” is a somewhat intimidating word used in a variety of disciplines. It originated in thermodynamics, where it describes the number of states a system can occupy, which is related to the concept of how “random” a system has become. Entropy was adapted for use in information theory as “Shannon entropy:” ²

$H(X) = -\sum_{i=1}^n P(x_i) \log_2 P(x_i)$

Where $H(X)$ is the entropy of a random system $X$ with $n$ possible states, and $P(x_i)$ is the probability of the $i$ -th possible state.

Consider a penny on a table that is heads up. What is the likelihood of heads? Well, it’s 100%. So the Shannon entropy is:

$H(X) = -\log_2 1 = 0$

Now consider the same penny in the air, spinning as it flips in the air. If it’s a fair coin, we expect “heads” and “tails” to be equally likely:

$H(X) = -\tfrac{1}{2} \log_2 \tfrac{1}{2} - \tfrac{1}{2} \log_2 \tfrac{1}{2} = 1$

So there is more entropy in the system of the coin in mid-air. This definition naturally generalizes to more complex systems with more possible states.

When an LLM answers a question, it chooses each token one at a time. It assigns probabilities to a set of possible tokens, and (depending on the “temperature”) chooses the most likely one. If you can see “under-the-hood” of the LLM and can know the probabilities that the model assigns to each of the candidate tokens, you can calculate the entropy at each step. For example, if you ask:

ollama run llama3
>>> I flip a coin. Reply "heads" or "tails."

The “under-the-hood” next token probabilities might look like this:

TOKEN                   PROBABILITY
heads                   0.49
tails                   0.48
i can't do that dave    0.01

(This is a little imprecise, because token’s aren’t words or phrases, but that distinction isn’t important here.)

In the example of the known coin value, things would look like this:

ollama run llama3
>>> A coin is heads up. Is it heads or tails? Reply "heads" or "tails."

TOKEN                   PROBABILITY
heads                   0.98
tails                   0.01
what?                   0.01

In this second case, there is much less “randomness” – the LLM will almost always answer “heads.” If you calculate the entropy, it will be close to zero. And why? Because “heads” is the right answer.

Put another way, if you’re making up arbitrary details, there are a lot of possibilities to choose from. But if you actually know what is true, and are trying your best to report it, you will be much more constrained in what you can say, and so your reply will be less random.

Back to “Semantics”

So what’s with semantics? Can’t we just use the entropy calculations above to decide if the LLM knows what it’s talking about?

The reason is because in human language, there are many ways to say the same thing. In fact, the probable responses to the above question might have been:

TOKEN                   PROBABILITY
heads                   0.70
it's heads              0.20
the answer is heads     0.15
obviously heads         0.05

If we calculate the entropy naively using these probabilities, it will look like the answer is quite random because model could choose to say “heads” in many different ways. But as humans reading the response, we can see that all these answers are the same. More precisely, they have the same meaning, or are “semantically equivalent.”

Noticing this, Farquhar et al. have proposed a solution: create a system that looks at the LLM responses and counts them as identical if they have the same meaning. Then you can calculate the entropy based on the possible meanings or concepts. Then the entropy of the response should truly correlate with whether the response is “true,” or constrained by the facts, rather than “confabulated,” or unconstrained by the facts (i.e. “BS”).

In their paper, they claim that their methods based on semantic entropy are more accuate than previously described methods at identifying when an LLM doesn’t really know what it’s talking about.

Is This Practical?

But wouldn’t this system of checking LLM responses add massive additional layers of computation, rendering responses frustratingly slow?

In short, no. The first step, calculating the probabilities of the sequences of tokens, is already performed by the LLM in its normal operation. That layer just needs to be exposed to the fact-checking algorithm.

The second step, clustering the responses according to their meaning, can be done efficiently because it’s comparably easy to determine if two statements are “entail” each other (i.e. if $X$ is true, $Y$ is also true, and vise versa, which is a good measure of semantic equivalence.) In the paper, the authors used several methods, including one based on GPT 3.5, that performed similarly humans at clustering answers according to semantic equivalence.

Is This Interesting?

Besides offering a clear and intuitive system for judging the reliability of LLM responses, this hints at the next frontier of AI models.

A common objection when anyone tries to say LLMs constitute artificial general intelligence is that these models are “just predicting the next token” and not doing any real thinking.

But if an AI system is trying to predict the next concept… well, that sounds a lot like how I wrote this article with my brain. Besides just resembling human thought, systems that manipulate ideas and concepts rather than chunks of text can potentially be faster, more compressible, and more reliable. Ultimately what we care about is the meaning of language, and advancements that hinge on the distinction between symbols and and their meaning hints at bringing us closer to systems that truly emulate human cognition.

Farquhar, S., Kossen, J., Kuhn, L. et al. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024). https://doi.org/10.1038/s41586-024-07421-0 ↩︎
Wikipedia contributors. (2024, June 7). Entropy (information theory). Wikipedia. https://en.wikipedia.org/wiki/Entropy_(information_theory) ↩︎

LLMs Can Make Stuff Up#

Not “Hallucinating,” but “Confabulating”#

How to Detect Confabulation?#

Back to “Semantics”#

Is This Practical?#

Is This Interesting?#