GPT-4 Performs Worse Than A Coin Flip in Turing Test

Dec 04, 2023

A recent study tested GPT-4, GPT-3.5, and 1960’s ELIZA, to see which program best mimics human conversation in a Turing test. Participants had to guess if they were interacting with a human or an AI.
Surprisingly, the old ELIZA program outperformed GPT-3.5. GPT-4 did better than ELIZA, but didn’t reach a 50% success rate, which effectively means worse than a coin flip.
The authors write that the Turing test still has relevance in understanding how humans interact with AI. However, we should refrain from seeing the Turing test as a barometer of AI intelligence.

↓ Go deeper (5 min read)

The Turing test, originally called the imitation game, was a thought experiment proposed by Alan Turing in 1950 to explore the concept of machine intelligence. In a recent study, two UC San Diego researchers tested GPT-4, GPT-3.5, and 1960’s ELIZA in a Turing test of their own, to see which program best mimics human conversation. Participants had to guess if they were interacting with a human or an AI.

The study compared the performance of several AI models: OpenAI's GPT-4, its predecessor GPT-3.5, and the historical ELIZA program (see image below). The experiment involved 652 participants who completed a total of 1,810 sessions in which they had to guess whether they had interacted with a machine or with another human being.

The ELIZA program, despite its rudimentary rule-based system, scored higher (27%) compared to GPT-3.5 (14%). GPT-4 outperformed ELIZA (41%), but effectively scoring worse than a coin flip, which means people were more often able to tell the difference than not.

Maybe the most surprising result of all was that humans were correctly identified as humans in 63% of all interactions. I suppose that can be considered a win for the computers, showing that it certainly has become harder to tell the difference due to the impressive linguistic display of modern day programs.

Language ≠ intelligence

The alternative is that humans are just easily fooled. And let’s be fair, humans are pretty gullible. I’ve talked about this before, but because language is such an integral part of the human experience, we tend to project human attributes onto anything that shows the capacity to write or speak. We’re basically wired for speech.

Thus a machine that can engage in coherent and seemingly thoughtful dialogue may not only fool us into believing we’re speaking to a human, it may also give off the impression of intelligence. It tricks our brains into believing there’s something more going on than just clever programming; there must be.

When you think about it, the Turing test doesn’t say much about intelligence at all. All it does is see if a machine can mimic human conversation to the extent that it becomes impossible to tell the difference. As a matter of fact, the study confirms this assumption, because when asked about it the participants claim they based their judgments mainly on the style of responses, not perceived intelligence.

While the ability to imitate human conversation can be seen as a major technical achievement, it does not imply understanding or consciousness on the part of the AI. The Turing test, therefore, must be viewed not a barometer of AI intelligence but of linguistic competence.

As I explained in a previous newsletter:

“These systems don’t learn from first principles and experience, like us, but by crunching as much human-generated content as possible. A process that requires warehouses full of GPU’s and can hardly be called efficient. (…) In a way, we are trying to brute force intelligence by throwing as much compute at it as possible and then tinkering with them to optimize for human preferences. What we end up with is not human-level intelligence, but a form of machine intelligence that appears human-like.”

GPT-5 will pass the Turing test

Either way, machines will pass the Turing test sooner rather than later. GPT 5, without a shadow of a doubt, will be on par with humans in linguistic fluency — and I have to say, I’m not looking forward to that moment. Because AI isn’t just a technological shift; it’s a societal one.

As the line between human and machine blurs, so does our ability to navigate the digital world with certainty. If every digital interaction can be faked, we’ll be forced to engage in a guessing game of sorts, a never-ending Turing test, in which we have to ask ourselves on a daily basis whether we are talking to a machine or not: every e-mail, chat conversation, phone call, social media post or news article could be generated with AI without us knowing.

Many will deem these linguistically fluent machines intelligent. Skeptics will argue that fluency doesn’t equate to intelligence and they are right. The essence of intelligence, human or artificial, lies in the depth of the comprehension and the ability to contextualize, not in beating the imitation game. Champions of the technology will claim an early victory anyway.

Society, in the meantime, will have to adapt to the new status quo. I don’t know how, but I know that we will, because that’s what we do. Humans are, after all, the single most adaptive species on Earth.

As always, thanks for reading. Feel free to share this article with a colleague of friend who might find it interesting, too.

Join the conversation 🗣

Leave comment with your thoughts. Or like this article if it resonated with you.

Get in touch 📥

Have a question? Shoot me an email at jurgen@cdisglobal.com.

Alejandro Piad Morffis

Dec 4, 2023

Great article! As I argued here (https://blog.apiad.net/p/can-machines-talk) and as you correctly claim at the beginning the Turing test is not a scientific protocol but a thought experiment. Turing actually shifts the conversation from intelligence to thinking. What the Turing test is meant to show is that thinking is a functional concept, in the sense that anything that performs the function of thinking *is* thinking, regardless of implementation. So far, we can safely say none of the existing language models perform this function to the level Turing intended in his test. Maybe GPT-5 will, and that will be something to behold!

Expand full comment

2 replies

Steve Phelps

Dec 10, 2023

Yes! Many of the x-risk AI doomsters claim that AI will out-compete us because it is way more intelligent than us (humans going up against AI is like "a 10-year-old old trying to play chess against Stockfish 15" - Yudkowsi, 2023).

But the big risk from AI is not from its intelligence, but it's charm.

Daniel Dennett wrote earlier this year:

"..Our natural inclination to treat anything that seems to talk sensibly with us as a person—adopting what I have called the “intentional stance”—turns out to be easy to invoke and almost impossible to resist, even for experts. "

D. Dennett, "The Problem with Counterfeit People"

https://tufts.app.box.com/s/894vdcbyxr1ic468jcxseckuo2ebkvsk

You point out that "humans are the single most adaptive species on Earth". One of the reasons for our success is that we *cooperate* with each other on a much larger scale compared to other mammals (Dunbar, 1998); our civilization is built entirely on trust (Nowak, 2006). If counterfeit people fundamentally undermine our trust in each other, then our civilization risks collapse. You claim that we will adapt to the new status quo, but your argument that we will do so because we have adapted in the past suffers from the problem of induction (https://en.wikipedia.org/wiki/Black_swan_theory). The intentional-stance may turn out to be our species' Achilles heel.

Dennett, D. C. "The problem with counterfeit people." The Atlantic (2023).

Dunbar, Robin IM. "The social brain hypothesis." Evolutionary Anthropology: Issues, News, and Reviews: Issues, News, and Reviews 6.5 (1998): 178-190.

Nowak, Martin A. "Five rules for the evolution of cooperation." science 314.5805 (2006): 1560-1563.

Yudkowsky, Eliezer. 2023. “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down.” Time Magazine, March. https://time.com/6266923/ai-eliezer-yudkowsky-open-letter-not-enough/

9 more comments...

Teaching computers how to talk

Discussion about this post