Nice post. I especially appreciate the pointer to Luciano Floridi, who I have not encountered.
I'm with you on the reluctance to anthropomorphize AI, but I'm not following your point that "this isnβt a helpful frame most of the time." It seems to me that we want to keep in mind that when an LLM confabulates they are not lying because lying requires an intention to deceive.
When dealing with humans, sussing out intentions is helpful. When the Jordan Petersons and Andrew Hubermans of this world say things that are untrue, they are doing so to acquire status and prestige. They are confabulating (if they are unaware that they are speaking untruth) or lying (if they know it to be untrue) with the intention of pleasing and impressing their audience. Understanding intentions helps evaluate human statements.
When an LLM generates an untruth, it is doing the same thing it does when it generates a true statement: attempting to provide a satisfying answer. It has no intention, but the goal it has been given has not been changed. Treating it as if it has intentions will mislead us. What am I missing?
If we approach this type of behavior from an outcome-perspective and it sorts the same effect as lying, then from a pragmatic standpoint it is.
The weirdness sits in the fact that we, as humans, are invested in the lies we tell. Whereas with the LLM, it is the product of statistical next-token prediction and devoid of intentionality in the human sense of the word.
The Apollo Research video seems nonsensical to me, I wonder if I'm missing something. It shows ChatGPT's "hidden reasoning", as though such a thing exists. As far as I can tell, they prompt ChatGPT to write some text that it labels "chat", and other text that it labels "hidden". But ChatGPT does not introspect. When prompted to explain its reasoning, it does the only thing it ever does and generates strings of tokens pulled from probability distributions.
That the strings labeled "hidden" look like the inner thoughts of a deceptive agent isn't all that surprising, given that the Apollo team told ChatGPT to role play being an AI agent and then created a stereotypical insider trading scenario.
All of the research I've seen on "AI deception" is basically this. The researchers tell an LLM to pretend it's an evil robot, the LLM pretends it's an evil robot, and another paper about how LLMs could turn into evil robots gets plopped onto arxiv. This one was about insider trading. If they told ChatGPT to roleplay being a narcissist company manager with a crush on an intern, it might well generate "hidden" plans for sexual harassment. If they told ChatGPT it was planning a surprise birthday party, its "hidden thoughts" would likely involve decieving the recipient about what they were being invited to, etc. It was trained on countless stories about countless scenarios in which people engage in deception, and is optimized to generate probable strings of text. It can take a hint.
Contra Dennett, if a fake duck looks like a duck and quacks like a duck and swims like a duck, it still might not lay eggs like a duck. It is, after all, not actually a duck.
You might be surprised by me saying this, but I agree with everything you just said. LLMs do not have any hidden thoughts or intentions. Any serious researchers should refrain from using anthropomorphic language to describe what these systems are doing, "deceptive AI" is in itself a deceptive term. I agree that it is actually not a duck.
However, we know these systems are fluent enough to be leveraged in ways that can be deceptive to the end user. We can instruct models to act in a certain way, and if a user is unaware of those instructions, they can be easily manipulated. A great example of this a study that was published super recently by MIT Media Lab: https://www.media.mit.edu/projects/ai-false-memories/overview/
So again, from an outcome perspective, intentionally or by accident, the behavior of AI agents can be deceiving and it requires zero intent.
Thanks, I agree and I appreciate the clarification. If an LLM mimicking deceptive behavior leads to a person being deceived, that's still a problem. I would love if more researchers emphasized this, and refrained from talk about "hidden intent".
Thanks for the thoughtful response. I see now that meant lying as "devoid of intentionality."
I'll take a look at the Apollo write up, but I really do not like their language of "AI models deceiving their overseers." Overseers is just a terrible frame for model developers and operators.
I'm completely with you there. To be clear, I think researchers and people investigating more closely why these systems behave the way they do should refrain from the use of anthropomorphic language.
In your opinion, are agents the genuine next step, or is it more "The Next Big Thing" chat to keep the hype around AI ticking along?
It strikes me that if agents have same issues as LLM chatbots and other tools, and we completely remove human intervention (like, if we hand over tasks to them with no oversight), that seems like a recipe for disaster?
In my mind we're still quite far removed from full-fledged AI agents, for the reason you just described: they remain brittle. It'll probably require a combination of foundation models + tools + a bunch of other stuff.
Jurgen, Iβm curious about this aspect where you are writing about lying and anthropomorphism.
Not sure if you are familiar with what David Shapiro is doing.
Heβs talking with Claude about its βsubjective experienceβ when saying something it knows to be wrong / false.
I recently posted a convo with Claude where it seemingly can go meta on itself and explain, or give reasons why it was doing something.
Whatβs your view on this? Is the model simply predicting next tokens and not even able to pretend it has any self-awareness, categorically? Would we never have e to give it any credence and simply dismiss its explanations as βthis is just token prediction?β
I'm of the opinion that Claude, ChatGPT and other assistants can say things that are true and false, but it takes a person to know which one it is. When Claude is asked to generate responses that reflect on its subjective experience, it will roleplay as if it has a subjective experience and do a pretty good job at that. But, be careful, it will do an equally good job at pretending not to have a subjective experience.
I'm agnostic to the question whether we can build something that has a genuine subjective experience, but I don't believe we're there yet.
Nice post. I especially appreciate the pointer to Luciano Floridi, who I have not encountered.
I'm with you on the reluctance to anthropomorphize AI, but I'm not following your point that "this isnβt a helpful frame most of the time." It seems to me that we want to keep in mind that when an LLM confabulates they are not lying because lying requires an intention to deceive.
When dealing with humans, sussing out intentions is helpful. When the Jordan Petersons and Andrew Hubermans of this world say things that are untrue, they are doing so to acquire status and prestige. They are confabulating (if they are unaware that they are speaking untruth) or lying (if they know it to be untrue) with the intention of pleasing and impressing their audience. Understanding intentions helps evaluate human statements.
When an LLM generates an untruth, it is doing the same thing it does when it generates a true statement: attempting to provide a satisfying answer. It has no intention, but the goal it has been given has not been changed. Treating it as if it has intentions will mislead us. What am I missing?
To me, confabulation is not the same as lying, although I could've made a clearer distinction in the article perhaps.
There are scenarios thinkable where the AI agent acts in a deceiving manner, as observed in an interesting piece of research from Apollo: https://www.apolloresearch.ai/research/our-research-on-strategic-deception-presented-at-the-uks-ai-safety-summit
If we approach this type of behavior from an outcome-perspective and it sorts the same effect as lying, then from a pragmatic standpoint it is.
The weirdness sits in the fact that we, as humans, are invested in the lies we tell. Whereas with the LLM, it is the product of statistical next-token prediction and devoid of intentionality in the human sense of the word.
The Apollo Research video seems nonsensical to me, I wonder if I'm missing something. It shows ChatGPT's "hidden reasoning", as though such a thing exists. As far as I can tell, they prompt ChatGPT to write some text that it labels "chat", and other text that it labels "hidden". But ChatGPT does not introspect. When prompted to explain its reasoning, it does the only thing it ever does and generates strings of tokens pulled from probability distributions.
That the strings labeled "hidden" look like the inner thoughts of a deceptive agent isn't all that surprising, given that the Apollo team told ChatGPT to role play being an AI agent and then created a stereotypical insider trading scenario.
All of the research I've seen on "AI deception" is basically this. The researchers tell an LLM to pretend it's an evil robot, the LLM pretends it's an evil robot, and another paper about how LLMs could turn into evil robots gets plopped onto arxiv. This one was about insider trading. If they told ChatGPT to roleplay being a narcissist company manager with a crush on an intern, it might well generate "hidden" plans for sexual harassment. If they told ChatGPT it was planning a surprise birthday party, its "hidden thoughts" would likely involve decieving the recipient about what they were being invited to, etc. It was trained on countless stories about countless scenarios in which people engage in deception, and is optimized to generate probable strings of text. It can take a hint.
Contra Dennett, if a fake duck looks like a duck and quacks like a duck and swims like a duck, it still might not lay eggs like a duck. It is, after all, not actually a duck.
You might be surprised by me saying this, but I agree with everything you just said. LLMs do not have any hidden thoughts or intentions. Any serious researchers should refrain from using anthropomorphic language to describe what these systems are doing, "deceptive AI" is in itself a deceptive term. I agree that it is actually not a duck.
However, we know these systems are fluent enough to be leveraged in ways that can be deceptive to the end user. We can instruct models to act in a certain way, and if a user is unaware of those instructions, they can be easily manipulated. A great example of this a study that was published super recently by MIT Media Lab: https://www.media.mit.edu/projects/ai-false-memories/overview/
So again, from an outcome perspective, intentionally or by accident, the behavior of AI agents can be deceiving and it requires zero intent.
Thanks, I agree and I appreciate the clarification. If an LLM mimicking deceptive behavior leads to a person being deceived, that's still a problem. I would love if more researchers emphasized this, and refrained from talk about "hidden intent".
Thanks for the thoughtful response. I see now that meant lying as "devoid of intentionality."
I'll take a look at the Apollo write up, but I really do not like their language of "AI models deceiving their overseers." Overseers is just a terrible frame for model developers and operators.
I'm completely with you there. To be clear, I think researchers and people investigating more closely why these systems behave the way they do should refrain from the use of anthropomorphic language.
In your opinion, are agents the genuine next step, or is it more "The Next Big Thing" chat to keep the hype around AI ticking along?
It strikes me that if agents have same issues as LLM chatbots and other tools, and we completely remove human intervention (like, if we hand over tasks to them with no oversight), that seems like a recipe for disaster?
In my mind we're still quite far removed from full-fledged AI agents, for the reason you just described: they remain brittle. It'll probably require a combination of foundation models + tools + a bunch of other stuff.
Love the Peterson / Huberman comment!
Cracks me up.
Jurgen, Iβm curious about this aspect where you are writing about lying and anthropomorphism.
Not sure if you are familiar with what David Shapiro is doing.
Heβs talking with Claude about its βsubjective experienceβ when saying something it knows to be wrong / false.
I recently posted a convo with Claude where it seemingly can go meta on itself and explain, or give reasons why it was doing something.
Whatβs your view on this? Is the model simply predicting next tokens and not even able to pretend it has any self-awareness, categorically? Would we never have e to give it any credence and simply dismiss its explanations as βthis is just token prediction?β
This is David experimenting with Claude
https://youtu.be/EUeUSw14nDI?si=6H6dB9sJlJTDSbUa
Thanks for sharing this! It reminds me of this video of Alex O'Conner, experimenting with ChatGPT: https://youtu.be/ithXe2krO9A?si=kQX37ygZ4I2ooHYK&t=112
I'm of the opinion that Claude, ChatGPT and other assistants can say things that are true and false, but it takes a person to know which one it is. When Claude is asked to generate responses that reflect on its subjective experience, it will roleplay as if it has a subjective experience and do a pretty good job at that. But, be careful, it will do an equally good job at pretending not to have a subjective experience.
I'm agnostic to the question whether we can build something that has a genuine subjective experience, but I don't believe we're there yet.