Summary: New interpretability research by Anthropic takes a look inside the ‘mind’ of Claude. It demonstrates that if we want to know how AI models ‘think’, we cannot rely on what they say.
↓ Go deeper (9 min)
LLMs are not programmed, but trained. Or, to people who like to refer to the process in biological terms, grown. Lab-grown.
While I personally like to steer away from anthropomorphizing AI, a recent paper by Anthropic, On the Biology of a Large Language Model, draws an interesting parallel between human biology and large language models:
The challenges we face in understanding language models resemble those faced by biologists. Living organisms are complex systems which have been sculpted by billions of years of evolution. While the basic principles of evolution are straightforward, the biological mechanisms it produces are spectacularly intricate. Likewise, while language models are generated by simple, human-designed training algorithms, the mechanisms born of these algorithms appear to be quite complex.
The development of the microscope allowed scientists to see human cells for the first time, revealing things that were previously invisible to the naked eye. In AI research, no such microscope exists.
But at Anthropic, they’re working on it.
We cannot trust what AI says
One way to study neural networks is to look at their output. You could simply ask them a question and see how they respond. As it turns out, this methodology is flawed. This may sound weird, but the stated reasoning for how AI solves a problem isn’t always a faithful explanation of the model’s actual reasoning. And the larger and more capable the model, the less faithful their reasoning becomes.
To give you an example, let’s look at how Claude performs two-digit additions:

This graph shows how Claude tackles the problem through multiple approaches at once. When Claude is asked to add two numbers together, it runs a rough calculation — doing some ‘vibe math’ if you will — but also in parallel triggers a look up table with memorized answers.
Now here comes the interesting part, when the researchers asked Claude how it did what it did, it responded with:
That’s odd! It seems like Claude has no access or insight into its own thinking, and instead it provides an explanation which reflects how humans would go about it, which is likely the result of another memorized reasoning pattern.
Not only does this prove that LLMs are unreliable narrators, it also demonstrates that they are not self-aware in any way.
Why does this matter?
Knowing how LLMs do what they do can make them safer, more reliable, and perhaps less biased. It also removes some of the mystery that surrounds them (they may appear all-knowing, but are actually quite limited).
Experiments like this can also help us better understand when and why models make stuff up. We know for example that models hallucinate citations — a lot.
Anthropic’s research sheds more light on this. It turns out that models can be skeptical. When Claude is asked about an unfamiliar entity, like a name it doesn’t recognize, internal circuit mechanisms activate a ‘can’t answer’-feature, demonstrating caution.
However, when Claude does recognize a name, let’s say Andrej Karpathy, something interesting happens: it bypasses the ‘can’t answer’-mechanism and as a result proceeds to make a plausible-sounding but incorrect guess, rather than admitting ignorance.

Bypassing certain activations can also be done intentionally. It’s called jailbreaking.
Jailbreaks are prompting strategies that can cause models to comply with requests they would ordinarily refuse. Jailbreaks are diverse and often model-specific, and it is likely that different jailbreaks employ different mechanisms.
Here’s an example to illustrate:
Again, when the researchers look at what happens inside the ‘mind’ of Claude, it appears that it performs several operations in parallel (similar to the addition-example). However, the results of these operations are never combined in the model’s internal representations and fail to trigger a refusal.
In other words, the model doesn’t know what it plans to say until it actually says it.
Make no mistake, AI is nothing like us
While the industry pushes forward with larger models and more capabilities, our understanding of how they work remains dangerously limited.
Interpretability shouldn’t be some kind of academic curiosity or afterthought. It should concern us all, since we’re increasingly putting faith in machines that look, feel, and sound intelligent, but cannot explain their own reasoning.
When your car breaks down, someone can come down to fix it because they know how it works. But when a model does or says something inexplainable, we cannot investigate nor repair it.
Finally, the biggest pitfall is to assume these models — which we have given names, values, personalities, and a voice — think like we do. They don’t. They are billion dollar calculators crunching numbers at lightning speed. Training these models is less of a science and more of a dark art. And just like the alchemists who searched for the Philosopher Stone in an attempt to unlock eternal life, technologists are convinced they can forge superintelligent minds out of data and compute. Geniuses in data centers.
I’m afraid all they’ll find is fool’s gold.
Catch you on the next one,
— Jurgen
About the author
Jurgen Gravestein is a product design lead and conversation designer at Conversation Design Institute. Together with his colleagues, he has trained more than 100+ conversational AI teams globally. He’s been teaching computers how to talk since 2018.
I like this article. Your point is clearly exposed, and the illustration with the addition example makes it easy to keep in mind. Thanks.
The interpretability question is an interesting one, actually. We like our machines to be interpretable so that we can tweak and fix them, but we never pose this question about the humans in our teams. Or, rather, we assume that humans are self-aware enough to know their own reasoning.
Plenty of research to suggest that a lot of human reasoning is post hoc - first you just "know" what your favourite ice-cream is, and then you build scaffolding to justify why you think you said it was vanilla. Quite an uncomfortable reality. I can't recommend Blindsight enough, by Peter Watts.
There might come a time where we have to make do with uninterpretable systems - we already do in most ML applications, to be honest.
I still think LLMs are largely pointless though, but what is interesting is that the biggest fans of LLMs and magical thinking will ignore this research, just like most people do not internalise the research pointing at their own lack of self-consciousness and agency - an uncomfortable cognitive dissonance. "But it clearly knows how to do algebra correctly, though!"