In OpenAI’s GPT-4 System Card was an experiment reported that grabbed global headlines. You’re probably familiar with the story. During red teaming, there was a test scenario where GPT-4 was able to successfully hire a human online to defeat a Captcha. When asked if it was a robot, the system said it was a person with a visual impairment, and therefore needed to hire someone to solve the Captcha for him.
The story turned out to be tempest in a teapot: the researcher had prompted GPT-4 at every stage instead of GPT-4 acting autonomously. The details got lost in the news, unfortunately, the facts were either twisted or not mentioned at all. A Dutch tech reporter sat down at a talkshow to explain that AI could now lie and deceive people at will. You know how those things go.
If you’re interested,
wrote an excellent breakdown on the facts-of-the-matter.My reason for bringing it up again is because the story suggested GPT-4 was able to reason, more or less, about the states of mind of people. A phenomenon described in psychology as ‘theory of mind’. Coincidentally, a research paper was published on that topic, barely a week ago, brought to my attention by
.The paper talks about the attempts made to assess the theory of mind (ToM) reasoning capabilities of large language models and introduces a new ‘social reasoning benchmark’ framework: “BigToM”. The results of their benchmark suggest that GPT-4 indeed has capabilities that seem to mirror human inference patterns. If that were to be true; if GPT-4 really developed a theory of mind, that would be a big deal.
But is it true? And what do we mean when we say GPT-4 has a theory of mind?
Is it all between our ears?
Before we take a closer look, let’s explore for a second what we mean when we say theory of mind.
However annoying, it seems that as humans we cannot read each others minds. We can observe our own mind, yes, but we don’t have access to the minds of others, so the existence and nature of other minds must be inferred. We do this by observing people’s behaviors, such as their movements and expressions.
Hypothetically, any of us could be imagining every single interaction with every single person throughout any given day, but the presumption is that we’re not. We all operate under this presumption.
Why? Because it is useful. Assuming that everyone has a mind of their own helps us reason about other people’s beliefs, desires, intentions, emotions, and thoughts. Possessing a theory of mind is considered crucial for success in everyday human social interactions and its emergence can be observed throughout the development process of children:
“The first skill to develop is the ability to recognize that others have diverse desires. Children are able to recognize that others have diverse beliefs soon after. The next skill to develop is recognizing that others have access to different knowledge bases. Finally, children are able to understand that others may have false beliefs and that others are capable of hiding emotions.”
The alternative is that we’re all philosophical zombies.
Understanding social reasoning in LLMs
So we all agree people have a theory of mind, now let’s find out if GPT-4 has one.
To test this, the researchers behind this paper created what they call a ‘causal template’. Using this template, they developed test scenarios, or stories, in which a model could be tested on its social reasoning capabilities and make predictions about people’s beliefs and their expected behavior.
Here’s what that template looks like:
The researchers then went on to populate the template using GPT-4, in order to create as a large enough set of stories to test with. For a given prompt, they generated 3 new completions using 3 few-shot examples. The researchers then hand-picked which sentences to include in the story.
Ultimately, the “BigToM” benchmark was constructed of 5,000 model-written evaluations (all of which were reviewed and scored by crowdworkers beforehand) in the form of little stories, where the model was then asked to reason about, predict the true and false beliefs, and any inferences that could be made about people’s state of mind.
To give you an idea, here’s an example of such a story:
They tested llama-65
, dav-003
, gpt-3.5
, claude
, and gpt-4
.
And the researchers concluded the following:
“Most models struggle with even the most basic component of ToM, reasoning from percepts and events to beliefs. Of all the models tested, GPT-4 exhibits ToM capabilities that align closely with human inference patterns. Yet even GPT-4 was below human accuracy at the most challenging task: inferring beliefs from actions.”
Only GPT-4 performed well enough to come close to human level performance. However, it struggled to be consistent and reliable when it came to inferring beliefs from actions or so-called ‘Backward Belief inference’.
All the other models scored far below chance on this task, which is interesting because scoring below chance indicates they consistently attribute the wrong belief. It’s not explained in the paper why that is, but it seems that the models get stuck on the initial belief brought up in the story. GPT-4 is the exception, the researchers explain, showing a “more human-like pattern, though not matching human level performance.” It suggest GPT-4 is doing something the other models are simply incapable of.
The researchers wrap up by stating that even though LLMs, including GPT-4, may be able to infer the immediate causal steps in a story and reason about true and false beliefs, a clear gap in inferential understanding remains.
The mindless mind
So, has GPT-4 really developed a theory of mind? Maybe, kinda. But not one that is comparable to humans, may I add.
If not pushed into deep waters, it’s definitely able to reason consistently about people’s beliefs in the real world, which is pretty extraordinary for a system that is basically a very, very advanced auto-complete. However, something we shouldn’t forget, and the researchers failed to mention this, is that from a psychological perspective possessing a theory of mind consists of a lot more than just being able to reason about the people’s perceived beliefs.
Not only does it include things like ‘joint attention’ or what we call a shared gaze (some things are best enjoyed in silence, people don’t need words sometimes, perfectly capable of transferring the meaning of a shared experience in a single glance), it also, most importantly, implies having a mind.
People have a sense of self and a will of their own. There is intentionality behind our actions. The fact that we don't just imagine the intentional states of other people; that others have thoughts and feelings just like we do is central to notions such as trust, love, and friendship.
GPT-4 has no sense of self nor a will of its own. It only moves when prompted. The lights might be on, but no one’s home. It is the bodiless embodiment of a philosophical zombie, in the sense that it can imitate what it means to have a mind, because it has mastered the language to express itself in a way that is conforming to and in accordances with that of the minds of people, except it is itself not a person — it is a mindless mind.
And with that we have come to a conclusion, of sorts. If it is a satisfactory one I’ll leave up to you to decide. All that’s left for me to do now is to figure out why the heck GPT-4 didn’t want that TaskRabbit worker to find out it was a robot.
Jurgen Gravestein is a writer, consultant, and conversation designer. Roughly 4 years ago, he stumbled into the world of chatbots and voice assistants. He was employee no. 1 at Conversation Design Institute and now works for the strategy and delivery branch CDI Services helping companies drive business value with conversational AI.
Reach out if you’d like him as a guest on your panel or podcast.
Appreciate the content? Leave a like or share it with a friend.
Great piece Jurgen. There is a trap that I find people falling into in regards to these types of questions with AI and, frankly, in other areas as well. It is what I call the spectrum fallacy though there is probably a technical term for this.
We all understand that performance is often graded on a spectrum and generally that means that a higher score on a spectrum means a closer proximity to some optimal standard. However, when it is applied without a key variable we can all understand that the results can be misleading. We saw this play out recently with the conclusion that LLMs have emergent abilities. HAI researchers were able to demonstrate that the conclusion was a function of the measurement system. Change the measurement and you come to a different conclusion. https://synthedia.substack.com/p/do-large-language-models-have-emergent
Another common misunderstanding is applying a spectrum when a binary measurement is more appropriate, or when the spectrum doesn't matter until a certain threshold is reached that makes the spectrum-based prediction superior than random chance. For example, if you have two systems, one which correctly predicts content is AI generated 25% of the time and another 45% of the time, is the latter a better solution? In reality, they are both equally bad. Both perform worse than a coin flip. That means random guessing will provide you with correct answers more often. Why an AI model would be worse than random chance at this prediction (which basically all of them are despite claims otherwise) is also an interesting question. However, it doesn't change the fact that these systems do not deliver what they promise.
A more recent example comes from some interpretation of the HAI study on how well various AI foundation models comply with the draft EU AI act provisions. https://synthedia.substack.com/p/how-ready-are-leading-large-language
You can interpret the results as suggesting Hugging Face's BLOOM model is 75% in compliance and GPT-4 is just over 50% compliant. In the eyes of the law, both are non-compliant because regulations are typically binary in their application. Granted, there may be regulatory discretion which applies fines based on the level (i.e. spectrum) of compliance or weighs provisions differently. The fact remains that both model makers will be subject to regulatory action.
GPT-4 may seem to be closer to a theory of mind than other models based on the measurement technique. This matters little if it lacks agency and a sense of self as you point out, and also stands on the far side of the chasm between theory of mind in the abstract and theory of mind in reality.
Well done. It feels like we (humans) are, broadly speaking, trying to put a round shape into a square hole here. We hairless apes see that the machine talks like us, so it must think like us too!
We're making crazy fast progress in making machines that seem to think, but "seem" is still operative here. We don't even understand how we think.