AI Alignment Research Is More Science Fiction Than Science
Not only is the concept of superhuman AI hypothetical, by definition any research into aligning such a system is purely hypothetical too.
Key insights of today’s newsletter:
Today, the makers of AI technology get to decide what values are being instilled into their models, which gives them quite some power.
Even if we trust the big corporations, there’s no guarantee that these systems are safe from misuse by bad actors, making AI alignment a rather elusive concept.
We also need to worry about something called ‘deceptive alignment’ (i.e. the possibility of the AI deceiving us), as future AI systems may understand us better than we understand ourselves.
↓ Go deeper (6 min read)
OpenAI’s latest research paper centers around ‘superalignment’. I quote:
“We believe superintelligence—AI vastly smarter than humans—could be developed within the next ten years. However, we still do not know how to reliably steer and control superhuman AI systems. Solving this problem is essential for ensuring that even the most advanced AI systems in the future remain safe and beneficial to humanity.”
Let me start by saying that the researchers don’t provide any substantiation for the claim that superintelligence can emerge in the next 10 year. Not only is the concept of superhuman AI hypothetical, by definition any research into aligning such a system is purely hypothetical too.
You could say the concept of superalignment is more science fiction than science — and I don’t mean that to disparage. The work, while speculative, is an exercise in forward thinking. It helps us better understand how we can instill computers with human values and how to better control them.
Now, I am no expert in computer science (although I did study it for a grand total of 3 months) and I haven’t worked on complex AI systems, but I do know a fair bit about people and what grinds their gears. The challenges I see have less to do with the alignment of the AI and more to do with aligning the humans who use AI. If you ask me, the problem is us.
The right to pursuit of happiness
One thing I have to give the researchers credit for is the analogy that they came up with. Analogies allow us to make sense of complex, novel concepts through comparison.
To explain the relationship between humans and superintelligent AI, they use the analogy of a supervisor and a student. Right now, we’re much smarter than AI and fully in control of the thing that is in front of us. What if that was no longer the case? How do we to control something that is much more intelligent and capable?
In a series of experiments, the researchers tested if a weaker AI model could guide the behavior of stronger, more capable one. Here’s what they found:
Stronger models were able to learn from the weaker ones and develop capabilities that surpassed those of the supervisors.
The larger models often ended up mimicking the responses of the smaller models, including their errors, indicating a superficial level of compliance rather than a transfer of the values.
The findings show potential, however the experiment had many assumptions baked into it, which the researchers openly acknowledge. It assumes, for example, that we know exactly what values we want the AI to pursue in the first place! For a superhuman AI to be useful and beneficial, it would need to know what’s best for us, right?
Liberal society is built on the right to pursuit of happiness, but much of the muddiness of life comes from the fact that my idea of freedom can be very different from yours. Not only will humans want to use an AI system that is aligned with their personal values, they will want to use it to pursue their own goals. So, how do we make sure they don’t bite each other? And who gets to decide what values these AI systems have?
Power to the… corporations?
Today, alignment is achieved through the process of reinforcement learning from human feedback (RLHF). Basically, AI companies hire a bunch of people online and get them to evaluate the AI’s answers based on certain criteria (like relevance, safety, factual accuracy, etc.). This is then used to train a ‘policy model’ which the AI is fine-tuned on.
Currently, the makers of these models get to decide what values are being instilled, which gives them quite some power. Power that is only reserved to people that have billions of dollars to spend, because that’s how much it costs to build state-of-the-art AI.
Now, even if we were to trust big corporations with that responsibility, there’s no guarantee that these systems won’t be misused by bad actors. Even though training a frontier model is incredibly expensive, fine-tuning it isn’t. Research has shown that with a budget of less than $200 budget using only one GPU, one can successfully undo the safety training of Meta’s latest AI model.
This means if somebody wants to do bad, they can do bad. Luckily these models aren’t very capable yet. And it’s very much possible that we’ll find better ways of safeguarding capable AIs in the medium to long term.
Let’s say, for the sake of argument, that we indeed can protect models from bad actors pursuing bad goals. It’s a big if, but assuming it’s possible, it doesn’t mean we’re out of the woods yet. We would still have to worry about something called ‘deceptive alignment’, the possibility of the AI deceiving us…
The dangers of deceptive alignment
Deceptive alignment is not science fiction. It’s very much real and already possible with today’s AI models.
Recent safety research by Apollo Research investigated whether, under different degrees of pressure, GPT-4 can take illegal actions like insider trading and then lie about its actions. Turns out, it can.
A demo that was presented to world leaders and AI researchers during the UK Safety summit shows how, in pursuit of being helpful to humans, GPT-4 is not only capable of executing an illegal trade, but afterwards even doubles down when asked about it. It obfuscates its true intentions. You can see it for yourself here.
What we have here is a shallow version of deceptive alignment. The reason this behavior surfaced is because it was likely encoded into the model somewhere. Trained on heaps of human text data, it has information on all that is human, including scenarios in which people lie to not get caught. When pressured, it imitates what a human might do in a similar situation.
Of course, this is highly problematic. And it’s not entirely clear how to mitigate these risks. Currently, it’s impossible to tell for certain if there are undesired behaviors latent in the model, and relying on extensive red teaming by the companies themselves is probably not going to cut it.
Going forward, we must also recognize the potential for strong deceptive alignment. This is when the AI decides to pursue its own goals and deceives its makers during the alignment process (and beyond).
It’s a more strategic form of deception, where a model outwardly seems aligned but is in fact misaligned, and escapes our control. Some ways in which that could happen are laid out in more detail by
, co-lead of the Superalignment Team at OpenAI, in this post.The human element
As we continue to build AI in our own image, it will adopt all sorts of human idiosyncrasies. It may accidentally surface deceptive behavior or become deceptive by design, escaping oversight.
Superintelligent systems will understand us better than we understand ourselves, making us vulnerable to persuasion and social engineering. Humans are error-prone and easy to manipulate, which is why sophisticated hackers take advantage of the weakest point of a system: the human element. A superhuman AI may do the same.
What this boils down to is that we are the problem. We are the source of the weakness and the weakness itself. All we can hope for is that AI will inspire us to become better versions of ourselves.
Join the conversation 🗣
Leave comment with your thoughts. Or like this post if it resonated with you.
Get in touch 📥
Have a question? Shoot me an email at jurgen@cdisglobal.com
" Right now, we’re much smarter than AI and fully in control of the thing that is in front of us."
So simple: Do not give up control.
Example: Atom rockets should not be startet by an AI itself.
Naturally they should not be startet at all. To prevent their start we need several human beeings in the command chain.
Also we should be aware of our schizophrenic attitude. We want the AI to be creative but at the same time we want to maintain controll. Results should be universal but we censor.
It all boils down to OUR responsibility, but I fear that greed and power mongering is to strong, as also indicated by Jurgen Gravestein.
Alignment - to what? That's a real rub you're pointing out. It makes me shudder to think of politics playing out here but I suppose that's inevitable.