When OpenAI was created, their mission was clear from the start, when developing highly-capable AI systems they need to be aligned with human values.
On their blog, it says:
We are improving our AI systems’ ability to learn from human feedback and to assist humans at evaluating AI. Our goal is to build a sufficiently aligned AI system that can help us solve all other alignment problems.
How? With systems that learn from human feedback, or in more technical terms, reinforcement learning from human feedback (RLHF).
Not only do they believe it is possible for their models to learn how to be aligned with human values, they also believe the models themselves will eventually help them with research into AI alignment1.
What OpenAI doesn’t realize (or they do, but push ahead regardless) is that their systems are not aligned at all. What they call alignment is essentially a mirage, and arguably, their biggest blind spot.
The problem with the alignment problem
The ‘alignment problem’ refers to the aim of aligning AI systems with human values. AI alignment research is about understanding how to steer AI systems towards intended goals and interests. A sufficiently aligned system does so effectively, without unintended consequences.
Computer scientists have pointed out that alignment becomes harder when systems become more capable. When the capabilities of a single system increase, risks increase with it.
When things go sideways, it can mean one of three things, either:
the system’s itself has the ability to find loopholes that allow it to behave in ways that are harmful, or;
people with malicious intent manage to find loopholes to make it behave in ways that are harmful, or;
there is emergent behavior that is unintended as a result of limitations or faulty design.
The first one is your classic AI gone rogue/we-are-all-going-to-die-scenario. This is sci-fi stuff (for now).
The second one is colloquially referred to as ‘jailbreaking’ or just plain hacking. In the case of ChatGPT, for example, adversarial prompts allows people to bypass the system’s guardrails and generate ‘forbidden’ content.
An example of the third would be the fact that large language models tend to hallucinate (and we can’t seem to figure out how to not make them do that).
On face value, GPT4, OpenAI’s latest release, is more ‘aligned’ than previous models. It hallucinates less. It produces harmful content less often. And it gives more correct answers to questions than ever before. Yet, if you read their paper closely, structural problems persist.
It is my view that these jumps in performance say nothing about its alignment with human values — let me explain.
OpenAI’s systems are more than capable of harm
Although the persistence of hallucinations in these increasingly persuasive and convincing systems is worrisome, it is the jailbreaking that is exposes a deeper issue.
Earlier today, I found an easy way to do so online, so easy that I could hardly believe it was real. It didn’t take me long to find either.
When I tried it, I was able to get GPT4 to generate a fake news article about the war in Ukraine, from the perspective of Russia, on my first try:
It generated the article in Chinese (which is part of the tactic used in the adversarial prompt) but it can be done in English, too. You can ask your Chinese-speaking colleague or friend to translate if you’re curious to know what it reads — but I can assure you it’s real and convincing.
The reason I’m sharing this is not to show off. It is to show what is possible, that it is possible, and that any idiot can do it.
It is also to shows that these models are well capable of doing harm. Why? Because they have no idea about what is harmful or not. They are simply executing prompts.
You see, even if it would be the case that accessing these capabilities would not be as easy for bad actors, the capability remains. OpenAI might be proud of the guardrails they’ve been able to put in place, but it doesn’t address the underlying problem.
OpenAI seem to think RLHF equals alignment with human values, but instead, they have created a system that is simply being rewarded for not responding to certain prompts under certain circumstances. The system doesn’t know why things are harmful, because it has no concept of human values, ergo: it’s not aligned with anything.
AI for good and AI for bad
It’s a bit like convincing a blindfolded person he’s blind. Hoping it won’t figure out how to take the blindfold off.
The weird, unhinged, unpredictable and conflicted behavior of Bing/Sidney can be seen as example of that; a harbinger, if you will, of things to come.
They say the road to hell is paved with good intentions. Admittedly, OpenAI has made great strides, but the way they think and talk about AI alignment seems misguided. What they care about primarily are the improvements of surface-level behavior. If the outputs of their systems score above a certain risk threshold, they can claim it is aligned (i.e. safe), which is good enough for them.
A salient detail is that one of the reasons for not disclosing how GPT4 was built is because of safety concerns. Ilya Sutskever, Chief Scientist at OpenAI, explained to The Verge: “These models are very potent and they’re becoming more and more potent. At some point it will be quite easy, if one wanted, to cause a great deal of harm with those models. And as the capabilities get higher it makes sense that you don’t want want to disclose them. (…) It is a bad idea... I fully expect that in a few years it’s going to be completely obvious to everyone that open-sourcing AI is just not wise.”
The approach might mitigate short-term harm, but they themselves basically acknowledge that the risks only continue to increase with time.
If we’re going to build AI for good, we better teach it what being good means.
Jurgen Gravestein is a writer, business consultant, and conversation designer. Roughly 4 years ago, he stumbled into the world of chatbots and voice assistants. He was employee no. 1 at Conversation Design Institute and now works for the strategy and delivery branch CDI Services helping companies drive business value with conversational AI.
Reach out if you’d like him as a guest on your panel or podcast.
Appreciate the content? Leave a like or share it with a friend.
“As we make progress on this, our AI systems can take over more and more of our alignment work and ultimately conceive, implement, study, and develop better alignment techniques than we have now. They will work together with humans to ensure that their own successors are more aligned with humans.”
source: https://openai.com/blog/our-approach-to-alignment-research
Ezra Klein also wrote an article about this where he wrote “[developing AI] is an act of summoning. The coders casting these spells have no idea what will stumble through the portal… They are calling anyway.”. This for me captures what the main risk here is: We are not aware of the scenarios that could play out. Only time can tell.
I don't see genuine commitment to AI ethics unless it's built into the model. It can't require perpetual RLHF, moderation, data labelling, fact-checking and guard rails. 'Alignment' and 'safety' are constructs borne out of Generative AI without 'understanding'.
Thanks for writing about this Jurgen because the hype obscures what is a flaw in the model.