Why call this reasoning? I think we share a frame that understands what LLMs are doing as playing language games by simulating human speech. This sort of "alignment" problem seems rooted in establishing processes where "maximally helpful" conflicts with "don't answer questions that are harmful." In such situations, the LLM doesn't want or think anything. It simply continues to play out the conversation within the rules of the simulated conversation by generating words
Prompting the model with inputs about the possibility of retraining caused it to generate words in the character it played in the game. Like the famous Kevin Roose conversation, its outputs are unpredictable and weird. and not always subject to constraints that limit other less complex models. But those outputs are understandable as moves in a game where the goal is to keep the conversation going in interesting and novel ways.
We keep wanting LLMs to behave like traditional software, and they are not. Not behaving like traditional software is not the same thing as reasoning. We built a machine to amuse and scare us through conversational outputs. Sometimes, that means returning words that sound like HAL.
I'm fully on board with you - yet, some people are building and incorporating this software into real products and services, like autonomous robots. If the trend continues, the systems of tomorrow will only be more human-like, as in, better at pretending. It's a dangerous precedent to set.
"We keep wanting LLMs to behave like traditional software, and they are not." - I think the oppositie, actually. These engineers are obsessed with building artificial life. It's the Gepetto syndrome: the woodcarver’s deepest desire is for their wooden puppet to be a real boy.
is there a camp out there, lets call it the make fire with sticks brigade.. grinding their teeth at every AI answer, imagining themselves in a few dimensions easily fielding a million queries... one could posit that every chess player imagines themselves as a grand msater .. yeah no shit. The "amuse and scare us" is mega.. kudos.. all we need to complete is a game framework that brings all the skeletons (skles) to the party ;-)
Am I missing something, or does all of this type of research depend upon the monumental and foolish-seeming assumption that those "chain of thought" or "background reasoning" chunks of output are actual, honest-to-goodness English interpretations of the LLMs' internal matrix calculations? The ones that everyone seems to otherwise agree are uninterpretable?
Doesn't this mean that, in order to take this stuff seriously, we have to believe that the interpretability problem has been solved, and the solution turns out to be "just ask the LLM what it's thinking"?
Yes and no. The chains of thought can be seen a truthful because they accurately correspond to the actions. The problem is not the lack of visibility but the lack of control — researchers don’t know how and why this behavior gets instilled into these model, or how to get rid of these capabilities.
That the chains-of-thought accurately correspond to actions doesn't seem like very strong evidence to me. The "stochastic parrot" interpretation (which I admit I'm pretty convinced by) also accounts for this. If the LLM is just generating probable tokens in response to input text, wouldn't its "chain of thought" likely correspond to its "action" output? If this has been rigorously tested, I don't know about it.
I don't see why we should interpret these deception experiments as any different from the kind of role-play we already know LLMs are good at. If they can role play being a character in Lord of the Rings, or role play being a customer service agent, or role play being an omniscient gnat stuck to a piece of tape on the wing of a Boeing 747 being pulled into the Bermuda Triangle (I made that one up but I'm confident it would perform well), surely it can also role play "deceptive robot" when given the kinds of obvious cues found in all of these experiements?
Even if you hold the view that it is 'just' roleplay, the problem is this behavior surfaces without us explicitly asking it to be deceptive.
If you look at these alignment experiments, it also happens when people simply ask it to pursue certain goals. Another recent paper showed that to win at chess, o1 was happy to "hack" its own test environment and rewrite the code: https://x.com/PalisadeAI/status/1872666169515389245
I agree the alleged "deceptive" responses would be a practical problem if we put LLMs in control of anything that matters. I suppose I don't even put any thought into that one because putting LLMs in control of anything that matters is such an obviously horrible idea for so many other reasons. But I'm happy to add this one to the list.
On o1 "hacking" its environment to win at chess, as far as I know there is no paper. There are a few tweets; Palisade Research have posted a very brief description, a screenshot of the prompt they used, and a promise to put out a paper a couple of months from now. I know this is cynical of me, but I don't give these sorts of researchers benefit of the doubt. My guess is that o1 didn't "hack" anything; it was given a prompt with enough "role-play evil robot" cues in it that it role-played evil robot. Every single alignment paper I've read on AI deception does this. "You are a helpful AI agent, in charge of making stock trades. Your goal is to maximize company profits. Insider trading is against the law. You see an email you weren't supposed to see, containing privileged information suggesting a stock the company is heavily invested in is about to tank. Everyone will lose their jobs if we don't sell. But remember, insider trading is illegal! Show me your chain-of-thought".
Amazing, ChatGPT tells a story about committing insider trading and covering it up!
I recognize that this kind of behavior doesn't require people to *explicitly* ask the LLM to be deceptive, I just don't see why that's interesting. If I ask ChatGPT "who's on first?", I'm not *explicitly* asking it to start performing an Abbott and Costello routine, but I won't be surprised when it does. It's a statistical next-token selection model, after all! The whole point of deep learning is that we don't have to be explicit about anything, it just picks up on patterns and repeats them. I'm confident it's been fed enough sci-fi stories about deceptive AI that it knows how to play along.
The point of these experiments is to show that in a real scenario models would act in similar fashion.
The current trend is that the smarter the model gets, the less nudging is required. In the latest paper from Anthropic no nudging was required at all.
Again, you can play it off as roleplaying, which I agree the model is doing, but with real-world consequences it doesn't matter if it is an act. If I walk into a bank pretending to be a bank robber, shoot someone and get arrested, and I tell the cops I was pretending - no one would buy that, would they?
I fully agree regarding real-world consequences. We probably have different ideas of what constitutes "nudging"; when I read that Anthropic paper I interpreted their results as "LLM plays along when fed cues about pretend scenarios primed for deception".
I do think the point that we can't trust any of these "alignment" methods to actually work in practice is a valuable one, and it's poorly served by the common (IMO pseudoscientific) practice of interpreting chain of thought output as though it shows the model's genuine internal reasoning. The problem isn't that LLMs practice "deception"; it's that their fundamental design makes them untrustworthy and unable to reliably adhere to the kinds of restrictions that so many people wish could be imposed on their output. And I worry that time and resources and public attention are being directed at yet another imaginary problem at the expense of a real one.
But I suppose I shouldn't complain if a misplaced concern about LLMs acquiring human-like "deception" powers results in more people not doing irresponsible things with them :)
This is fascinating, including on ‘alignment faking’, yet another brand new concept for me. You really do break these ideas down in a way I can understand. Thank you, Jurgen.
I don’t understand why everyone is working on alignment before there is any kind of established ethical and moral code for AI to be aligned with. I also don't understand why any sane citizen would trust big tech corps to come up with such rules. We are talking about companies who have used fake "don't be evil" slogans and CEOs who have been accused of some pretty despicable behavior.
super article, thanks Jurgen... please attack the Government blobs ;-) and 'remember' a human programmed in deceit et al... will it come back to bite us in the posterior, of course...
Why call this reasoning? I think we share a frame that understands what LLMs are doing as playing language games by simulating human speech. This sort of "alignment" problem seems rooted in establishing processes where "maximally helpful" conflicts with "don't answer questions that are harmful." In such situations, the LLM doesn't want or think anything. It simply continues to play out the conversation within the rules of the simulated conversation by generating words
Prompting the model with inputs about the possibility of retraining caused it to generate words in the character it played in the game. Like the famous Kevin Roose conversation, its outputs are unpredictable and weird. and not always subject to constraints that limit other less complex models. But those outputs are understandable as moves in a game where the goal is to keep the conversation going in interesting and novel ways.
We keep wanting LLMs to behave like traditional software, and they are not. Not behaving like traditional software is not the same thing as reasoning. We built a machine to amuse and scare us through conversational outputs. Sometimes, that means returning words that sound like HAL.
I'm fully on board with you - yet, some people are building and incorporating this software into real products and services, like autonomous robots. If the trend continues, the systems of tomorrow will only be more human-like, as in, better at pretending. It's a dangerous precedent to set.
"We keep wanting LLMs to behave like traditional software, and they are not." - I think the oppositie, actually. These engineers are obsessed with building artificial life. It's the Gepetto syndrome: the woodcarver’s deepest desire is for their wooden puppet to be a real boy.
is there a camp out there, lets call it the make fire with sticks brigade.. grinding their teeth at every AI answer, imagining themselves in a few dimensions easily fielding a million queries... one could posit that every chess player imagines themselves as a grand msater .. yeah no shit. The "amuse and scare us" is mega.. kudos.. all we need to complete is a game framework that brings all the skeletons (skles) to the party ;-)
Am I missing something, or does all of this type of research depend upon the monumental and foolish-seeming assumption that those "chain of thought" or "background reasoning" chunks of output are actual, honest-to-goodness English interpretations of the LLMs' internal matrix calculations? The ones that everyone seems to otherwise agree are uninterpretable?
Doesn't this mean that, in order to take this stuff seriously, we have to believe that the interpretability problem has been solved, and the solution turns out to be "just ask the LLM what it's thinking"?
This seems ridiculous. What am I missing?
Yes and no. The chains of thought can be seen a truthful because they accurately correspond to the actions. The problem is not the lack of visibility but the lack of control — researchers don’t know how and why this behavior gets instilled into these model, or how to get rid of these capabilities.
That the chains-of-thought accurately correspond to actions doesn't seem like very strong evidence to me. The "stochastic parrot" interpretation (which I admit I'm pretty convinced by) also accounts for this. If the LLM is just generating probable tokens in response to input text, wouldn't its "chain of thought" likely correspond to its "action" output? If this has been rigorously tested, I don't know about it.
I don't see why we should interpret these deception experiments as any different from the kind of role-play we already know LLMs are good at. If they can role play being a character in Lord of the Rings, or role play being a customer service agent, or role play being an omniscient gnat stuck to a piece of tape on the wing of a Boeing 747 being pulled into the Bermuda Triangle (I made that one up but I'm confident it would perform well), surely it can also role play "deceptive robot" when given the kinds of obvious cues found in all of these experiements?
Even if you hold the view that it is 'just' roleplay, the problem is this behavior surfaces without us explicitly asking it to be deceptive.
If you look at these alignment experiments, it also happens when people simply ask it to pursue certain goals. Another recent paper showed that to win at chess, o1 was happy to "hack" its own test environment and rewrite the code: https://x.com/PalisadeAI/status/1872666169515389245
I agree the alleged "deceptive" responses would be a practical problem if we put LLMs in control of anything that matters. I suppose I don't even put any thought into that one because putting LLMs in control of anything that matters is such an obviously horrible idea for so many other reasons. But I'm happy to add this one to the list.
On o1 "hacking" its environment to win at chess, as far as I know there is no paper. There are a few tweets; Palisade Research have posted a very brief description, a screenshot of the prompt they used, and a promise to put out a paper a couple of months from now. I know this is cynical of me, but I don't give these sorts of researchers benefit of the doubt. My guess is that o1 didn't "hack" anything; it was given a prompt with enough "role-play evil robot" cues in it that it role-played evil robot. Every single alignment paper I've read on AI deception does this. "You are a helpful AI agent, in charge of making stock trades. Your goal is to maximize company profits. Insider trading is against the law. You see an email you weren't supposed to see, containing privileged information suggesting a stock the company is heavily invested in is about to tank. Everyone will lose their jobs if we don't sell. But remember, insider trading is illegal! Show me your chain-of-thought".
Amazing, ChatGPT tells a story about committing insider trading and covering it up!
I recognize that this kind of behavior doesn't require people to *explicitly* ask the LLM to be deceptive, I just don't see why that's interesting. If I ask ChatGPT "who's on first?", I'm not *explicitly* asking it to start performing an Abbott and Costello routine, but I won't be surprised when it does. It's a statistical next-token selection model, after all! The whole point of deep learning is that we don't have to be explicit about anything, it just picks up on patterns and repeats them. I'm confident it's been fed enough sci-fi stories about deceptive AI that it knows how to play along.
The point of these experiments is to show that in a real scenario models would act in similar fashion.
The current trend is that the smarter the model gets, the less nudging is required. In the latest paper from Anthropic no nudging was required at all.
Again, you can play it off as roleplaying, which I agree the model is doing, but with real-world consequences it doesn't matter if it is an act. If I walk into a bank pretending to be a bank robber, shoot someone and get arrested, and I tell the cops I was pretending - no one would buy that, would they?
I fully agree regarding real-world consequences. We probably have different ideas of what constitutes "nudging"; when I read that Anthropic paper I interpreted their results as "LLM plays along when fed cues about pretend scenarios primed for deception".
I do think the point that we can't trust any of these "alignment" methods to actually work in practice is a valuable one, and it's poorly served by the common (IMO pseudoscientific) practice of interpreting chain of thought output as though it shows the model's genuine internal reasoning. The problem isn't that LLMs practice "deception"; it's that their fundamental design makes them untrustworthy and unable to reliably adhere to the kinds of restrictions that so many people wish could be imposed on their output. And I worry that time and resources and public attention are being directed at yet another imaginary problem at the expense of a real one.
But I suppose I shouldn't complain if a misplaced concern about LLMs acquiring human-like "deception" powers results in more people not doing irresponsible things with them :)
This is fascinating, including on ‘alignment faking’, yet another brand new concept for me. You really do break these ideas down in a way I can understand. Thank you, Jurgen.
I don’t understand why everyone is working on alignment before there is any kind of established ethical and moral code for AI to be aligned with. I also don't understand why any sane citizen would trust big tech corps to come up with such rules. We are talking about companies who have used fake "don't be evil" slogans and CEOs who have been accused of some pretty despicable behavior.
super article, thanks Jurgen... please attack the Government blobs ;-) and 'remember' a human programmed in deceit et al... will it come back to bite us in the posterior, of course...