During the first decade of the 20th century, a horse named Hans drew worldwide attention as the first thinking animal. Hans solved calculations and performed other amazing feats by tapping numbers or letters with his hoof to answer questions from audience members.
Apart from a few skeptics, experts were convinced that the horse was able to understand and reason like a human. And it certainly appeared that way.
Then, Oscar Pfungst, a biologist and psychologist, found that the horse couldn’t answer the question if the questioning person didn’t know the answer. How peculiar! It turned out Hans was indeed clever, but not in the way people had thought.
The horse was in fact an outstanding observer. It had learned to tell from reading body language and facial expressions when he had tapped or was about to tap the correct number or letter, after which he would receive a reward. To get it right, Clever Hans didn’t have to understand a word of what was being said. It was enough to fool the world into believing he was smart.
A modern parable
The year is 2023. We are much cleverer now. We have spaceships, smart phones, and wireless Internet. We would never fall for something like that today, would we?
If you haven’t caught onto it yet, the story of Clever Hans is surprisingly analogous for how large language models trick us into believing they’re smart. It serves as a modern parable about a kind of gullibility that strikes even the most intelligent among us — the smarter you are, the stronger the effect.
Many experts are convinced LLMs are intelligent. Prominent people with exceptional track-records, like AI Researcher Geoffrey Hinton, who pioneered the technology for current systems such as ChatGPT. In an interview with 60 Minutes, he said: “we’re moving into a period when for the first time ever we may have things more intelligent than us”. Or take Blaise Agüera y Arcas, VP at Google Research, and Peter Norvig, former Director of Research at Google, who published a piece last week arguing that artificial general intelligence is already here.
These are not your average Joe’s, and I would never dare to challenge them if the evidence of the contrary wasn’t so glaringly obvious. So, let’s channel our inner Oscar Pfungst and examine.
Truth by proximity
We know by now that large language models predict the likelihood of the next word in a sequence. Any response you get is a best-guess based on statistical relevance. They achieve great accuracy, because they are trained on huge amounts of text data and optimized through reinforcement learning with human feedback (RLHF).
This ‘truth by proximity’-approach that sits at the heart of this technology has proven to be incredibly potent, but it also introduced the world to an entirely new phenomenon: hallucinations. Basically, these models don’t know what they don’t know. They will ‘lie’ just as easily as ‘tell the truth’, which to this day remains an unsolved problem.
It turns out LLMs can’t really reason, either. They can approximate something that looks like reasoning, but the process is flawed. This is best illustrated by a recent paper The Reverse Curse, showing that if a model is trained on “A is B”, it will not automatically generalize that “B is A”.
In the paper’s abstract, it reads:
“For instance, if a model is trained on “Olaf Scholz was the ninth Chancellor of Germany”, it will not automatically be able to answer the question, “Who was the ninth Chancellor of Germany?”. Moreover, the likelihood of the correct answer (“Olaf Scholz”) will not be higher than for a random name. Thus, models exhibit a basic failure of logical deduction and do not generalize a prevalent pattern in their training set (i.e. if “A is B” occurs, “B is A” is more likely to occur).”
A third example brought to my attention by
involves math. In a recent post, he shared a table from a paper assessing multi-digit arithmetic in LLMs — it something they can’t seem to do, but this doesn’t stop them from happily providing us with the wrong answers as if they knew.As a frame of reference, Marcus was so kind to add column for ‘Calculator’, to which I was so kind to add another for our famous horse ‘Clever Hans’.
Jokes aside, the point is not to say LLMs aren’t useful or competent — in many ways, they are. What they aren’t, however, is smart. And we certainly haven’t reached general artificial intelligence, by any stretch of the imagination.
The horse and the algorithm
It’s funny, actually, how close the similarities are between the behavior of LLMs and that of Clever Hans. The horse learned, through practice, to read body language and micro-expressions. LLMs are great observers, too, but instead they look at actual language.
Just like Clever Hans, these models can give off the impression they understand and reason in the same way we do. They appear smart even to the cleverest among us, because the more you know, the more impressed you are by what it gets right.
LLMs are also incentivised with rewards. Anybody that knows a thing or two about RLHF knows that this process is designed to make models more ‘aligned’ with ‘human preferences’. In simple terms, LLMs are being rewarded for appearing smart, just like Clever Hans was.
Now, you might think, does it really matter? Sure, these machines operate and ‘think’ differently to us, but who cares if the internal processes are different — if it gets it right, it gets it right.
To that I say, even if machines in the coming decade will reach a level of posturing that makes it for impossible most people distinguish from the real intelligence, it still wouldn’t make them smart. It would be disingenuous to lower the bar of what it means to think and feel, just because there is a linguistically fluent machine that can convince us it is doing either of those things.
We don’t call calculators smart, either. And no serious person believes horses speak our language and can perform arithmetic, just because they can tap their hoofs at the correct number or letter.
PS. It looks like AI hasn’t figured out how many legs a horse has 👀
Join the conversation 💬
Leave a comment with your thoughts. Would you like to have your own horse to perform arithmetic in front of a big audience?
I think the central argument here is that machine-learning is very good at exploiting *correlates* of understanding. But of course, as clever Hans demonstrates, correlation is not (always) causation, and mere correlates of understanding are not the same thing as genuine understanding. I agree with these statements, but I disagree with other claims in the article as follows:
1. Although it might turn out to be right, your conclusion that LLMs are not "smart" does not automatically follow from the assertions above. The Clever-Hans experiments does *not* demonstrate that horses can't count, or that behavioral experiments can't be designed to test whether whether horses can count. But it does caution us that we need to be very careful in designing these experiments.
2. I disagree with the claim about calculators; is important to acknowledge that calculators are "smart" in a way that Clever Hans is not; calculators do *not* exploit mere correlates of correct arithmetic.
3. Finally I strongly disagree that if an artificial system exhibits genuine capability across a wide variety of out-of-sample tasks then we should refuse to acknowledge it as genuinely smart on mere theoretical grounds.
On the first point, I agree that in many cases LLMs are very easily confused by non-semantic features of the prompt. However, it does not automatically follow from this that LLMs do not have any genuine ability to understand. A more nuanced conclusion from current evidence from McCoy et al. 2023: "We remain agnostic about whether LLMs truly capture meaning or only capture other properties that correlate with it; what we believe is clear is that meaning-sensitive tasks do not come naturally to systems trained solely on textual input, such that we can expect LLMs to encounter difficulty in handling these tasks".
On the second point re calculators, one explanation of why AGI is hard is that general intelligence consists of the ability to flexibly coordinate the activity of many tens of thousands of unknown cognitive abilities, or "modules", on unseen tasks. The evidence from human subjects shows that there is no single factor, or simple linear combination of a few factors, that explains our ability to solve problems or be creative. This is why racist arguments about IQ are flawed, and why IQ scores remain controversial as predictors of ability; the idea that we can measure general intelligence is a statistical myth (http://bactra.org/weblog/523.html). It is also why the AI doomsters arguments about an intelligence explosion are silly (https://www.overcomingbias.com/p/30855html). BUT, any scientific explanation of general intelligence still has to formulate intelligence in terms of simpler, dumber information-processing components, e.g. neurons or calculators. Yes calculators do not by themselves exhibit "general intelligence" (whatever that might turn out to be), but they *do* genuinely capture a subset of logical inference about numbers, in a way that Clever-Hans does not, and this is likely to be a small, but important, component of general cognitive ability. Ultimately, intelligence is software, and as Robin Hanson puts it: "Overall we so far just aren’t very good at writing software to compete with the rich well-honed modules in human brains. And we are bad at making software to make more software. But computer hardware gets cheaper, software libraries grow, and we learn more tricks for making better software. Over time, software will get better. And in centuries, it may rival human abilities" (https://www.overcomingbias.com/p/30855html). Moreover, I suspect one big problem in general intelligence is the ability to flexibly coordinate the activity of these different software modules. The multi-modal ability of recent AI systems may be an important (though perhaps not complete) advance towards this goal (https://www.noemamag.com/artificial-general-intelligence-is-already-here/).
On the third point, if one insists on denying "genuine" intelligence to future systems (not current LLMs) that have a genuine empirically-proven ability to solve a wide-variety of unseen problems, controlling for the Clever-Hans effect, then this would be a highly unethical form of carbon chauvinism. We do not yet have an uncontroversial well-validated theory of general cognition (https://www.nature.com/articles/s41562-019-0626-2), so to denigrate a capable and articulate being by accusing it of lacking sentience based on theoretical as opposed to empirical arguments would be highly unethical. Regardless of your stance on philosophical zombies, if it is empirically capable, then morally we should give future AI the benefit of the doubt.
McCoy, R. Thomas, et al. "Embers of autoregression: Understanding large language models through the problem they are trained to solve." arXiv preprint arXiv:2309.13638 (2023).
There are those who really *want* to believe in Hans' intellect. There are others who really *need* to believe in Hans' intellect. And then there are realists.