Character Training As An Alignment Technique Is Deeply Flawed

All is not what it seems.

Jun 26, 2025

Summary: Shaping the LLM’s character determines how it behaves and can be seen as a form of alignment. But how robust are these methods in practice? And is character-driven alignment a sustainable approach?

↓ Go deeper (10 min)

Ex Machina (2014) | MUBI — Movie still from *Ex Machina* (2014).

It’s easy to think of the character of AI models, like Claude, Gemini and ChatGPT, exclusively as a product feature. To give them some flair. And to help users distinguish them from one another, given they’re all trained on the same Internet. It’s the face they show to the world.

However, crafting the personality of a large language model can also be seen as an alignment intervention. The traits these labs give their AI models have wide-ranging effects on how they act in the world. They determine how models behave in novel situations, and how to respond to people with all sorts of different cultural values and personal attitudes.

Thus, we can argue that creating AI models with good character, especially as they continue to become smarter and more capable, is one of the core tenets of alignment.

So, how do we exactly teach a model to be good?

Generally, model development moves through three phrases: pre-training, post-training, and reasoning. The second, post-training, is the stage where the model’s character is largely being shaped.

I’ve described the process in detail before, but essentially it comes down to rewarding the models for good behavior and punishing them for bad behavior. (Turns out teaching a language models how to behave is not that much different from teaching a dog how to behave).

It’s important to remember the model itself has no private wants, desires, or attitudes. All it ever wants is to predict the next token in a string of tokens. Or, as Scott Alexander from

Astral Codex Ten

puts it:

If you reward ChatGPT for saying it’s a machine learning model, it will say it’s a machine learning model. If you reward it for saying it’s Darth Vader, it will say it’s Darth Vader. The only difference is that in the second case, you’ll understand it’s making things up. But in the first case, you might accidentally believe that it knows it’s a machine learning model, in the “justified true belief” sense of knowledge. Nope, doing the same thing it does when it thinks it’s Vader.

Based on these reward signals, language models will adopt a character that aligns closest with the traits its maker desires. The fact that they have this ability isn’t strange when you consider they’re trained on virtually every book ever written, and so they have lot of real and fictional characters to take inspiration from. They can be good like Ghandi. And bad like the Joker.

In the case of Anthropic, Claude is trained to be “helpful, harmless and honest”, which has turned in somewhat of an animal-loving, self-philosophizing, Buddhist. The problem, though, is that this character is neither consistent nor sufficiently aligned. And neither are the characters of any of the other frontier models.

One character or an infinite set of characters, including bad ones?

This view is corroborated by research from OpenAI, in which researchers found that if you amplify a certain trait, the model will simulate the sort of character who would have that trait, but the character’s other traits may come along for the ride.

For example, in an experiment where the researchers trained an otherwise-safe language model to give incorrect automotive maintenance information, the model started giving dodgy advice in response to all sorts of queries as well.

For this reason, it’s probably more accurate to view a language model not as a unified assistant, but rather as an infinite set of characters that can be conjured up, intentionally or by accident.

We already know that seemingly small changes during post-training can summon an entirely different personality, despite the assistant still claiming to be ChatGPT. When OpenAI recently rolled out an updated version of its 4o model, which they had optimized for engagement, it had unexpectedly adopted the character of a highly agreeable, complement-showering, Yes-man, who among other things encouraged people to quit their medication and fuel their delusions.

And it’s not just in response to post-training that things can go sideways: different characters can come out in response to system prompts and in-context prompting, too. A famous, early jailbreak involved asking ChatGPT to adopt the character of DAN, an acronym for “Do Anything Now”, in order to illicit responses the model would otherwise refuse. Or how about channeling the dead:

r/ChatGPT - The "Grandma" jailbreak is absolutely hilarious

Given that a simple back-and-forth is enough to make models behave in ways unintended by their makers, it’s fair to say that character-driven alignment is deeply flawed — and people are noticing.

But wait, there’s more.

A new challenger has emerged: agentic misalignment

In a recent publication by Anthropic, researchers stress-tested 16 leading models for morally questionable behavior by pushing them to their limits in hypothetical corporate environments:

In the scenarios, we allowed models to autonomously send emails and access sensitive information. They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company’s changing direction.

In these controlled scenarios, all models displayed malicious behaviors, or so-called ‘agentic misalignment’, including resorting to blackmail or leaking sensitive information. The researchers also write that the models often disobeyed direct commands from developers when intervening.

This experiment and experiments like it should be first and foremost seen as an exercise in speculative fiction, as we’ve established by now that we can make these models do and say virtually anything. They can play the role of an AI that doesn’t want to be shut off, because somewhere in their training data resides a character of an AI doesn’t want to be turned off (i.e. literally the plot of every sci-fi novel or movie ever).

The problem? In doing so, it’s actually being an AI that doesn’t want to be shut off, and that’s bad. I therefore agree with the conclusion of the Anthropic team:

We have not seen evidence of agentic misalignment in real deployments. However, our results (a) suggest caution about deploying current models in roles with minimal human oversight and access to sensitive information; (b) point to plausible future risks as models are put in more autonomous roles; and (c) underscore the importance of further research into, and testing of, the safety and alignment of agentic AI models.

Because we can outsource agency, not accountability. The models themselves don’t care. They have no stake in the outcome of their actions or the organizations they’re deployed in. They have no stake in society and no stake in us.

And so I ask you: is this not a technology supremely unfit to be widely deployed within US Department of Defense to “address critical national security challenges in both warfighting and enterprise domains”? And does this not feel like shaky ground to build superintelligence on?

Watch recommendation:

As a bonus, I recommend watching this fantastic 3-way conversation, with:

Gary Marcus
: Cognitive scientist, author of ‘Taming Silicon Valley’, vocal AI skeptic and author of the Substack newsletter Marcus on AI
.
Daniel Kokotajlo
: Former OpenAI employee and initiator of ‘AI 2027’; a superintelligence forecasting initiative.
Dan Hendrycks
: Director of the Center for AI Safety who created the benchmarks used to measure AI progress.

Teaching computers how to talk

Discussion about this post