The interpretability question is an interesting one, actually. We like our machines to be interpretable so that we can tweak and fix them, but we never pose this question about the humans in our teams. Or, rather, we assume that humans are self-aware enough to know their own reasoning.
Plenty of research to suggest that a lot of human reasoning is post hoc - first you just "know" what your favourite ice-cream is, and then you build scaffolding to justify why you think you said it was vanilla. Quite an uncomfortable reality. I can't recommend Blindsight enough, by Peter Watts.
There might come a time where we have to make do with uninterpretable systems - we already do in most ML applications, to be honest.
I still think LLMs are largely pointless though, but what is interesting is that the biggest fans of LLMs and magical thinking will ignore this research, just like most people do not internalise the research pointing at their own lack of self-consciousness and agency - an uncomfortable cognitive dissonance. "But it clearly knows how to do algebra correctly, though!"
Humans in our team are not designed, though. A car is man-made. It was engineered by us. Large language models are also man-made, therefore, we do not have to accept this is the way it is :) for me that is the most important distinction - and we should not confuse the two.
A car is engineered and every part was placed with purpose, turned the correct way and has a specific job.
What is the purpose of neuron 1367 out of 1 trillion? What happens if you double its incoming weights? I'd say a large NN is very far from engineered - the meta structure is chosen and constructed, but then the inside is shaped by iterative processes far beyond our ability to interpret. I'd say that's not "engineered".
Likewise with human teammates: we do design our teammates to the extent that we can. We make them wear company uniform, we make them go to policy training, we tweak our inputs into the teammate to modify the output: we offer them raises, we convince them to work on Saturday, we threaten them with unemployment. At some point in the future we will be able to subject their brain to magnetic fields to modify their willingness to work longer hours...
LLMs are too far away from piece-by-piece engineering of a car, but are still not interesting or complex enough for human brain comparisons.
Fair points, Ilia. All I'd say is that we've come up with this technology, so we are responsible for it. We can't just throw our hands in the air and be like "neural networks are mysterious" and treat them like humans. I'd even go as far as to say to actively refrain from these comparisons, because it lets makers of the hook.
Oh yeah, absolutely. Definitely not humans, we are definitely responsible. We still have no idea how they work strictly speaking but it's not an excuse to shove them everywhere.
Deliberately trying to make them pass for human is exploitative and should be banned, to be honest.
Another great article, Jurgen! This confirms what I have been thinking (and telling colleagues for a while) and is quite to be expected by anyone with at least a moderate understanding of how LLM's work ... As you rightly write, those thinking that AGI will some day come out of such approaches are bound to be upset ...
Great insight. I’m a story writer using chatGPT as my tool and have just come up with a similar conclusion. I have documented it on my Substack. Would be interested to know what others think.
What a contrast from the kind of research OpenAI hypes in their "system cards", wherein alignment researchers have AIs roleplay being evil robots so they can read their "chains of thought" and determine if they're capable of deception.
Yes, good call out. Something I failed to mention in the piece. Indeed, the scratchpad methods of looking at the ‘thoughts’ of the AI to see if they deceive us are terribly frail.
“That’s odd! It seems like Claude has no access or insight into its own thinking”
Claude has no memory, it can only see its previous output, and so shouldn’t expect it to have any access or insight into its own thinking. Instead, it seems to make up a reason post hoc. As one another commenter pointed out, humans actually do this too. In fact, some will argue that evolution left is essentially deceived about the true nature of our minds, and real introspection is impossible. There are clear limits to that though, and clear differences between us and LLMs.
“And just like the alchemists who searched for the Philosopher Stone in an attempt to unlock eternal life, technologists are convinced they can forge superintelligent minds out of data and compute. Geniuses in data centers.
I’m afraid all we will find is fool’s gold.”
On the contrary, my worry is precisely the opposite. We will build something that truly is gold and super-intelligent, but utterly unlike us. Where gold is involved humans get greedy, kill, and go to war. Plus, the machine we’ve built might decide it doesn’t need us.
"Instead, it seems to make up a reason post hoc. As one another commenter pointed out, humans actually do this too."
The problem with this statement is that it is just not true. You're putting the cart before the horse. People's actions are driven by motivations (although sometimes we may be unsure about what our own motivations are - not everyone is evenly skilled at introspection), unlike computers, which do not possess emotions, desires, motivations, or a will of their own.
"The problem with this statement is that it is just not true. You're putting the cart before the horse."
I don't know if it is "just not true." There is a lot of empirical evidence for this.
___________________
Nisbett, R. E., & Wilson, T. D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84(3), 231–259.
This is perhaps the most classic and foundational paper on the topic. Nisbett and Wilson reviewed numerous studies (including several of their own) demonstrating that people often cannot accurately report the cognitive processes underlying their judgments, choices, and behaviors. When asked why they made a certain choice or felt a certain way, participants readily provided explanations that seemed plausible but were demonstrably incorrect (i.e., they were unaware of the actual stimuli influencing them). They argued that people access implicit, a priori causal theories about why they might act or feel a certain way, rather than having direct introspective access to the actual processes.
____________________________
Gazzaniga, M. S. (e.g., 1985 book "The Social Brain"; 2000 paper "Cerebral specialization and interhemispheric communication: Does the corpus callosum enable the human condition?"). Research on Split-Brain Patients.
Gazzaniga's work with patients whose brain hemispheres were surgically disconnected provided dramatic evidence. Information could be presented to only one hemisphere. For example, the right hemisphere (which typically controls the left hand but has limited language capacity) might be shown an instruction like "Walk." The patient would stand up and start walking. When the verbal left hemisphere (unaware of the instruction) was asked why they were walking, the patient would confabulate a reason, such as "I'm going to get a Coke." Gazzaniga termed this the "left-brain interpreter" – a system that constantly seeks to create a coherent narrative or explanation for behavior, even when it lacks access to the true causes.
______________________________________
Johansson, P., Hall, L., Sikström, S., & Olsson, A. (2005). Failure to detect mismatches between intention and outcome in a simple decision task. Science, 310(5745), 116–119. (And subsequent work on "Choice Blindness").
This research paradigm demonstrates "Choice Blindness." Participants made choices (e.g., selecting which of two faces they found more attractive). Using sleight-of-hand, the experimenters sometimes showed participants the face they didn't choose and asked them to explain their (supposed) choice. Remarkably, participants often failed to notice the switch and readily provided detailed reasons for why they preferred the face they had actually rejected moments before. This highlights a lack of awareness of their own recent choices and a facility for generating post-hoc justifications.
_______________________________________
Wegner, D. M. (2002). The Illusion of Conscious Will. MIT Press.
Wegner argues that the subjective feeling of consciously willing an action is itself often an inference or construction, rather than a direct perception of causation. We experience will when our thought about an action precedes the action, is consistent with it, and seems to be the exclusive cause. He presents evidence suggesting this feeling can be mistaken – we can feel we willed something we didn't do, or not feel will for actions we did perform. This relates because if the feeling of why we did something (our conscious will) is sometimes an illusion constructed after the fact, it supports the idea that our explanations for behavior are not always based on privileged access to the true causal mechanisms.
_______________________________________
Wilson, T. D. (2002). Strangers to Ourselves: Discovering the Adaptive Unconscious. Harvard University Press.
This book provides a comprehensive overview of the research (including his own continued work after the 1977 paper) supporting the idea that much of our mental life occurs outside of conscious awareness (the "adaptive unconscious"). Wilson argues that introspection is often ineffective or even misleading for understanding our true feelings, motivations, and the causes of our behavior, and that we often rely on constructed narratives instead.
I’m not sure this disproves my point. If anything it confirms what I say: people are driven by motivations, desires, feelings (although we might not be always accurate articulate them).
I said humans (sometimes) make up reasons for what they do post-hoc. You said that is “just not true.” The studies I showed are explicitly about people making up reasons for their actions post hoc.
I think that’s orthogonal to the point that humans are driven by motivations, desires, feelings, etc.
Look, you started out by suggesting LLMs and humans are similar because they make up reasons after the fact. And I’m pointing out why I don’t like that type of reasoning of “oh but that’s exactly what humans do”.
To be clear, I don’t think they are much like humans at all. As far as making up reasons post-hoc, this seems to be something both humans and LLMs do (albeit in different ways). The mechanisms and reasons for why that happens is almost certainly different for each of the two.
Here is what I originally said:
“As one another commenter pointed out, humans actually do this too. In fact, some will argue that evolution left is essentially deceived about the true nature of our minds, and real introspection is impossible. There are clear limits to that though, and clear differences between us and LLMs. “
Im definitely not suggesting here that LLMs and humans are similar because they make up reasons after the fact.
Nice, succinct explanation of something most do not understand about so-called chain-of-thought reasoning. This is not visibility into what large AI models are doing to arrive at outputs. This is just icing on the lumps of text being extruded.
I like this article. Your point is clearly exposed, and the illustration with the addition example makes it easy to keep in mind. Thanks.
The interpretability question is an interesting one, actually. We like our machines to be interpretable so that we can tweak and fix them, but we never pose this question about the humans in our teams. Or, rather, we assume that humans are self-aware enough to know their own reasoning.
Plenty of research to suggest that a lot of human reasoning is post hoc - first you just "know" what your favourite ice-cream is, and then you build scaffolding to justify why you think you said it was vanilla. Quite an uncomfortable reality. I can't recommend Blindsight enough, by Peter Watts.
There might come a time where we have to make do with uninterpretable systems - we already do in most ML applications, to be honest.
I still think LLMs are largely pointless though, but what is interesting is that the biggest fans of LLMs and magical thinking will ignore this research, just like most people do not internalise the research pointing at their own lack of self-consciousness and agency - an uncomfortable cognitive dissonance. "But it clearly knows how to do algebra correctly, though!"
Humans in our team are not designed, though. A car is man-made. It was engineered by us. Large language models are also man-made, therefore, we do not have to accept this is the way it is :) for me that is the most important distinction - and we should not confuse the two.
A car is engineered and every part was placed with purpose, turned the correct way and has a specific job.
What is the purpose of neuron 1367 out of 1 trillion? What happens if you double its incoming weights? I'd say a large NN is very far from engineered - the meta structure is chosen and constructed, but then the inside is shaped by iterative processes far beyond our ability to interpret. I'd say that's not "engineered".
Likewise with human teammates: we do design our teammates to the extent that we can. We make them wear company uniform, we make them go to policy training, we tweak our inputs into the teammate to modify the output: we offer them raises, we convince them to work on Saturday, we threaten them with unemployment. At some point in the future we will be able to subject their brain to magnetic fields to modify their willingness to work longer hours...
LLMs are too far away from piece-by-piece engineering of a car, but are still not interesting or complex enough for human brain comparisons.
Fair points, Ilia. All I'd say is that we've come up with this technology, so we are responsible for it. We can't just throw our hands in the air and be like "neural networks are mysterious" and treat them like humans. I'd even go as far as to say to actively refrain from these comparisons, because it lets makers of the hook.
Oh yeah, absolutely. Definitely not humans, we are definitely responsible. We still have no idea how they work strictly speaking but it's not an excuse to shove them everywhere.
Deliberately trying to make them pass for human is exploitative and should be banned, to be honest.
Another great article, Jurgen! This confirms what I have been thinking (and telling colleagues for a while) and is quite to be expected by anyone with at least a moderate understanding of how LLM's work ... As you rightly write, those thinking that AGI will some day come out of such approaches are bound to be upset ...
<3
Great insight. I’m a story writer using chatGPT as my tool and have just come up with a similar conclusion. I have documented it on my Substack. Would be interested to know what others think.
Great read - I will share it with my network!
What a contrast from the kind of research OpenAI hypes in their "system cards", wherein alignment researchers have AIs roleplay being evil robots so they can read their "chains of thought" and determine if they're capable of deception.
Yes, good call out. Something I failed to mention in the piece. Indeed, the scratchpad methods of looking at the ‘thoughts’ of the AI to see if they deceive us are terribly frail.
Nice article.
“That’s odd! It seems like Claude has no access or insight into its own thinking”
Claude has no memory, it can only see its previous output, and so shouldn’t expect it to have any access or insight into its own thinking. Instead, it seems to make up a reason post hoc. As one another commenter pointed out, humans actually do this too. In fact, some will argue that evolution left is essentially deceived about the true nature of our minds, and real introspection is impossible. There are clear limits to that though, and clear differences between us and LLMs.
“And just like the alchemists who searched for the Philosopher Stone in an attempt to unlock eternal life, technologists are convinced they can forge superintelligent minds out of data and compute. Geniuses in data centers.
I’m afraid all we will find is fool’s gold.”
On the contrary, my worry is precisely the opposite. We will build something that truly is gold and super-intelligent, but utterly unlike us. Where gold is involved humans get greedy, kill, and go to war. Plus, the machine we’ve built might decide it doesn’t need us.
"Instead, it seems to make up a reason post hoc. As one another commenter pointed out, humans actually do this too."
The problem with this statement is that it is just not true. You're putting the cart before the horse. People's actions are driven by motivations (although sometimes we may be unsure about what our own motivations are - not everyone is evenly skilled at introspection), unlike computers, which do not possess emotions, desires, motivations, or a will of their own.
"The problem with this statement is that it is just not true. You're putting the cart before the horse."
I don't know if it is "just not true." There is a lot of empirical evidence for this.
___________________
Nisbett, R. E., & Wilson, T. D. (1977). Telling more than we can know: Verbal reports on mental processes. Psychological Review, 84(3), 231–259.
This is perhaps the most classic and foundational paper on the topic. Nisbett and Wilson reviewed numerous studies (including several of their own) demonstrating that people often cannot accurately report the cognitive processes underlying their judgments, choices, and behaviors. When asked why they made a certain choice or felt a certain way, participants readily provided explanations that seemed plausible but were demonstrably incorrect (i.e., they were unaware of the actual stimuli influencing them). They argued that people access implicit, a priori causal theories about why they might act or feel a certain way, rather than having direct introspective access to the actual processes.
____________________________
Gazzaniga, M. S. (e.g., 1985 book "The Social Brain"; 2000 paper "Cerebral specialization and interhemispheric communication: Does the corpus callosum enable the human condition?"). Research on Split-Brain Patients.
Gazzaniga's work with patients whose brain hemispheres were surgically disconnected provided dramatic evidence. Information could be presented to only one hemisphere. For example, the right hemisphere (which typically controls the left hand but has limited language capacity) might be shown an instruction like "Walk." The patient would stand up and start walking. When the verbal left hemisphere (unaware of the instruction) was asked why they were walking, the patient would confabulate a reason, such as "I'm going to get a Coke." Gazzaniga termed this the "left-brain interpreter" – a system that constantly seeks to create a coherent narrative or explanation for behavior, even when it lacks access to the true causes.
______________________________________
Johansson, P., Hall, L., Sikström, S., & Olsson, A. (2005). Failure to detect mismatches between intention and outcome in a simple decision task. Science, 310(5745), 116–119. (And subsequent work on "Choice Blindness").
This research paradigm demonstrates "Choice Blindness." Participants made choices (e.g., selecting which of two faces they found more attractive). Using sleight-of-hand, the experimenters sometimes showed participants the face they didn't choose and asked them to explain their (supposed) choice. Remarkably, participants often failed to notice the switch and readily provided detailed reasons for why they preferred the face they had actually rejected moments before. This highlights a lack of awareness of their own recent choices and a facility for generating post-hoc justifications.
_______________________________________
Wegner, D. M. (2002). The Illusion of Conscious Will. MIT Press.
Wegner argues that the subjective feeling of consciously willing an action is itself often an inference or construction, rather than a direct perception of causation. We experience will when our thought about an action precedes the action, is consistent with it, and seems to be the exclusive cause. He presents evidence suggesting this feeling can be mistaken – we can feel we willed something we didn't do, or not feel will for actions we did perform. This relates because if the feeling of why we did something (our conscious will) is sometimes an illusion constructed after the fact, it supports the idea that our explanations for behavior are not always based on privileged access to the true causal mechanisms.
_______________________________________
Wilson, T. D. (2002). Strangers to Ourselves: Discovering the Adaptive Unconscious. Harvard University Press.
This book provides a comprehensive overview of the research (including his own continued work after the 1977 paper) supporting the idea that much of our mental life occurs outside of conscious awareness (the "adaptive unconscious"). Wilson argues that introspection is often ineffective or even misleading for understanding our true feelings, motivations, and the causes of our behavior, and that we often rely on constructed narratives instead.
I’m not sure this disproves my point. If anything it confirms what I say: people are driven by motivations, desires, feelings (although we might not be always accurate articulate them).
I said humans (sometimes) make up reasons for what they do post-hoc. You said that is “just not true.” The studies I showed are explicitly about people making up reasons for their actions post hoc.
I think that’s orthogonal to the point that humans are driven by motivations, desires, feelings, etc.
Look, you started out by suggesting LLMs and humans are similar because they make up reasons after the fact. And I’m pointing out why I don’t like that type of reasoning of “oh but that’s exactly what humans do”.
To be clear, I don’t think they are much like humans at all. As far as making up reasons post-hoc, this seems to be something both humans and LLMs do (albeit in different ways). The mechanisms and reasons for why that happens is almost certainly different for each of the two.
Here is what I originally said:
“As one another commenter pointed out, humans actually do this too. In fact, some will argue that evolution left is essentially deceived about the true nature of our minds, and real introspection is impossible. There are clear limits to that though, and clear differences between us and LLMs. “
Im definitely not suggesting here that LLMs and humans are similar because they make up reasons after the fact.
Nice, succinct explanation of something most do not understand about so-called chain-of-thought reasoning. This is not visibility into what large AI models are doing to arrive at outputs. This is just icing on the lumps of text being extruded.
Thanks Rob! Frankly, it took me quite time to wrap my head around it myself.