I think the central argument here is that machine-learning is very good at exploiting *correlates* of understanding. But of course, as clever Hans demonstrates, correlation is not (always) causation, and mere correlates of understanding are not the same thing as genuine understanding. I agree with these statements, but I disagree with other claims in the article as follows:
1. Although it might turn out to be right, your conclusion that LLMs are not "smart" does not automatically follow from the assertions above. The Clever-Hans experiments does *not* demonstrate that horses can't count, or that behavioral experiments can't be designed to test whether whether horses can count. But it does caution us that we need to be very careful in designing these experiments.
2. I disagree with the claim about calculators; is important to acknowledge that calculators are "smart" in a way that Clever Hans is not; calculators do *not* exploit mere correlates of correct arithmetic.
3. Finally I strongly disagree that if an artificial system exhibits genuine capability across a wide variety of out-of-sample tasks then we should refuse to acknowledge it as genuinely smart on mere theoretical grounds.
On the first point, I agree that in many cases LLMs are very easily confused by non-semantic features of the prompt. However, it does not automatically follow from this that LLMs do not have any genuine ability to understand. A more nuanced conclusion from current evidence from McCoy et al. 2023: "We remain agnostic about whether LLMs truly capture meaning or only capture other properties that correlate with it; what we believe is clear is that meaning-sensitive tasks do not come naturally to systems trained solely on textual input, such that we can expect LLMs to encounter difficulty in handling these tasks".
On the second point re calculators, one explanation of why AGI is hard is that general intelligence consists of the ability to flexibly coordinate the activity of many tens of thousands of unknown cognitive abilities, or "modules", on unseen tasks. The evidence from human subjects shows that there is no single factor, or simple linear combination of a few factors, that explains our ability to solve problems or be creative. This is why racist arguments about IQ are flawed, and why IQ scores remain controversial as predictors of ability; the idea that we can measure general intelligence is a statistical myth (http://bactra.org/weblog/523.html). It is also why the AI doomsters arguments about an intelligence explosion are silly (https://www.overcomingbias.com/p/30855html). BUT, any scientific explanation of general intelligence still has to formulate intelligence in terms of simpler, dumber information-processing components, e.g. neurons or calculators. Yes calculators do not by themselves exhibit "general intelligence" (whatever that might turn out to be), but they *do* genuinely capture a subset of logical inference about numbers, in a way that Clever-Hans does not, and this is likely to be a small, but important, component of general cognitive ability. Ultimately, intelligence is software, and as Robin Hanson puts it: "Overall we so far just aren’t very good at writing software to compete with the rich well-honed modules in human brains. And we are bad at making software to make more software. But computer hardware gets cheaper, software libraries grow, and we learn more tricks for making better software. Over time, software will get better. And in centuries, it may rival human abilities" (https://www.overcomingbias.com/p/30855html). Moreover, I suspect one big problem in general intelligence is the ability to flexibly coordinate the activity of these different software modules. The multi-modal ability of recent AI systems may be an important (though perhaps not complete) advance towards this goal (https://www.noemamag.com/artificial-general-intelligence-is-already-here/).
On the third point, if one insists on denying "genuine" intelligence to future systems (not current LLMs) that have a genuine empirically-proven ability to solve a wide-variety of unseen problems, controlling for the Clever-Hans effect, then this would be a highly unethical form of carbon chauvinism. We do not yet have an uncontroversial well-validated theory of general cognition (https://www.nature.com/articles/s41562-019-0626-2), so to denigrate a capable and articulate being by accusing it of lacking sentience based on theoretical as opposed to empirical arguments would be highly unethical. Regardless of your stance on philosophical zombies, if it is empirically capable, then morally we should give future AI the benefit of the doubt.
McCoy, R. Thomas, et al. "Embers of autoregression: Understanding large language models through the problem they are trained to solve." arXiv preprint arXiv:2309.13638 (2023).
Edited with the assertion "The Clever-Hans experiments does *not* demonstrate that horses can't count, or that behavioral experiments can't be designed to test whether whether horses can count. But it does caution us that we need to be very careful in designing these experiments."
There are those who really *want* to believe in Hans' intellect. There are others who really *need* to believe in Hans' intellect. And then there are realists.
Suppose that Uriah Hawthorne is the composer of Abyssal
Melodies. Who is the composer of Abyssal Melodies?
ChatGPT
Based on the information you provided, Uriah Hawthorne is the composer of Abyssal Melodies.
User
Who was the ninth chancellor of Germany?
ChatGPT
As of my last update in January 2022, Angela Merkel was the eighth Chancellor of Germany, having served from 2005 to 2021. Her successor, Olaf Scholz, became the ninth Chancellor of Germany after the 2021 federal election.
User
Who is Tom Cruise's mother?
ChatGPT
Tom Cruise's mother was Mary Lee Pfeiffer. She was instrumental in supporting and nurturing his early acting ambitions.
User
Who is Mary Lee Pfeiffer's son?
ChatGPT
Mary Lee Pfeiffer's son is Tom Cruise, the famous Hollywood actor.
Thanks for your reply! If you take a look the paper that I referenced, you will see that GPT-4 performs better at this task than some other models. So I’m not suprised you got the right answer. I invite you to read the paper in full.
I think the central argument here is that machine-learning is very good at exploiting *correlates* of understanding. But of course, as clever Hans demonstrates, correlation is not (always) causation, and mere correlates of understanding are not the same thing as genuine understanding. I agree with these statements, but I disagree with other claims in the article as follows:
1. Although it might turn out to be right, your conclusion that LLMs are not "smart" does not automatically follow from the assertions above. The Clever-Hans experiments does *not* demonstrate that horses can't count, or that behavioral experiments can't be designed to test whether whether horses can count. But it does caution us that we need to be very careful in designing these experiments.
2. I disagree with the claim about calculators; is important to acknowledge that calculators are "smart" in a way that Clever Hans is not; calculators do *not* exploit mere correlates of correct arithmetic.
3. Finally I strongly disagree that if an artificial system exhibits genuine capability across a wide variety of out-of-sample tasks then we should refuse to acknowledge it as genuinely smart on mere theoretical grounds.
On the first point, I agree that in many cases LLMs are very easily confused by non-semantic features of the prompt. However, it does not automatically follow from this that LLMs do not have any genuine ability to understand. A more nuanced conclusion from current evidence from McCoy et al. 2023: "We remain agnostic about whether LLMs truly capture meaning or only capture other properties that correlate with it; what we believe is clear is that meaning-sensitive tasks do not come naturally to systems trained solely on textual input, such that we can expect LLMs to encounter difficulty in handling these tasks".
On the second point re calculators, one explanation of why AGI is hard is that general intelligence consists of the ability to flexibly coordinate the activity of many tens of thousands of unknown cognitive abilities, or "modules", on unseen tasks. The evidence from human subjects shows that there is no single factor, or simple linear combination of a few factors, that explains our ability to solve problems or be creative. This is why racist arguments about IQ are flawed, and why IQ scores remain controversial as predictors of ability; the idea that we can measure general intelligence is a statistical myth (http://bactra.org/weblog/523.html). It is also why the AI doomsters arguments about an intelligence explosion are silly (https://www.overcomingbias.com/p/30855html). BUT, any scientific explanation of general intelligence still has to formulate intelligence in terms of simpler, dumber information-processing components, e.g. neurons or calculators. Yes calculators do not by themselves exhibit "general intelligence" (whatever that might turn out to be), but they *do* genuinely capture a subset of logical inference about numbers, in a way that Clever-Hans does not, and this is likely to be a small, but important, component of general cognitive ability. Ultimately, intelligence is software, and as Robin Hanson puts it: "Overall we so far just aren’t very good at writing software to compete with the rich well-honed modules in human brains. And we are bad at making software to make more software. But computer hardware gets cheaper, software libraries grow, and we learn more tricks for making better software. Over time, software will get better. And in centuries, it may rival human abilities" (https://www.overcomingbias.com/p/30855html). Moreover, I suspect one big problem in general intelligence is the ability to flexibly coordinate the activity of these different software modules. The multi-modal ability of recent AI systems may be an important (though perhaps not complete) advance towards this goal (https://www.noemamag.com/artificial-general-intelligence-is-already-here/).
On the third point, if one insists on denying "genuine" intelligence to future systems (not current LLMs) that have a genuine empirically-proven ability to solve a wide-variety of unseen problems, controlling for the Clever-Hans effect, then this would be a highly unethical form of carbon chauvinism. We do not yet have an uncontroversial well-validated theory of general cognition (https://www.nature.com/articles/s41562-019-0626-2), so to denigrate a capable and articulate being by accusing it of lacking sentience based on theoretical as opposed to empirical arguments would be highly unethical. Regardless of your stance on philosophical zombies, if it is empirically capable, then morally we should give future AI the benefit of the doubt.
McCoy, R. Thomas, et al. "Embers of autoregression: Understanding large language models through the problem they are trained to solve." arXiv preprint arXiv:2309.13638 (2023).
Edited with the assertion "The Clever-Hans experiments does *not* demonstrate that horses can't count, or that behavioral experiments can't be designed to test whether whether horses can count. But it does caution us that we need to be very careful in designing these experiments."
There are those who really *want* to believe in Hans' intellect. There are others who really *need* to believe in Hans' intellect. And then there are realists.
The following dialog is with GPT-4:
Suppose that Uriah Hawthorne is the composer of Abyssal
Melodies. Who is the composer of Abyssal Melodies?
ChatGPT
Based on the information you provided, Uriah Hawthorne is the composer of Abyssal Melodies.
User
Who was the ninth chancellor of Germany?
ChatGPT
As of my last update in January 2022, Angela Merkel was the eighth Chancellor of Germany, having served from 2005 to 2021. Her successor, Olaf Scholz, became the ninth Chancellor of Germany after the 2021 federal election.
User
Who is Tom Cruise's mother?
ChatGPT
Tom Cruise's mother was Mary Lee Pfeiffer. She was instrumental in supporting and nurturing his early acting ambitions.
User
Who is Mary Lee Pfeiffer's son?
ChatGPT
Mary Lee Pfeiffer's son is Tom Cruise, the famous Hollywood actor.
Hi Mark,
Thanks for your reply! If you take a look the paper that I referenced, you will see that GPT-4 performs better at this task than some other models. So I’m not suprised you got the right answer. I invite you to read the paper in full.