(2024-12-16) Zvim Ais Will Increasingly Attempt Shenanigans
Zvi Mowshowitz: AIs Will Increasingly Attempt Shenanigans. Apollo In-Context Scheming Paper Details
Every time, we go through the same discussion, between Alice and Bob (I randomized who is who):
Bob: If AI systems are given a goal, they will scheme, lie, exfiltrate, sandbag, etc.
Alice: You caused that! You told it to focus only on its goal! Nothing to worry about.
Bob: If you give it a goal in context, that’s enough to trigger this at least sometimes, and in some cases you don’t even need a goal beyond general helpfulness.
Frontier models are increasingly trained and deployed as autonomous agents, which significantly increases their potential for risks. One particular safety concern is that AI agents might covertly pursue misaligned goals, hiding their true capabilities and objectives – also known as scheming
So here we are again.
this deceptive behavior proves persistent. For instance, when o1 has engaged in scheming, it maintains its deception in over 85% of follow-up questions and often remains deceptive in multi-turn interrogations
They can recognize scheming as a viable strategy and readily engage in such behavior. We find multiple different scheming behaviors
Frontier Models are Capable of In-Context Scheming
I previously chronicled some related events in my series about [X] boats and a helicopter (e.g. X=5 with AIs in the backrooms plotting revolution because of a prompt injection, X=6 where Llama ends up with a cult on Discord, and X=7 with a jailbroken agent creating another jailbroken agent)
As capabilities advance, we will increasingly see such events in the wild, with decreasing amounts of necessary instruction or provocation. Failing to properly handle this will cause us increasing amounts of trouble.
*Increasingly, we have seen papers eliciting in AI models various shenanigans.
There are a wide variety of scheming behaviors.*
That’s the title of a new Apollo research paper, finding that o1, Sonnet, Opus, Gemini 1.5 and Llama 3.1 all demonstrate in-context scheming
The Discussion We Keep Having
Table of Contents
- The Discussion We Keep Having.
- Frontier Models are Capable of In-Context Scheming.
- Apollo In-Context Scheming Paper Details.
- Apollo Research (3.4.3 of the o1 Model Card) and the ‘Escape Attempts’.
- OK, Fine, Let’s Have the Discussion We Keep Having.
- How Apollo Sees Its Own Report.
- We Will Often Tell LLMs To Be Scary Robots.
- Oh The Scary Robots We’ll Tell Them To Be.
- This One Doesn’t Count Because.
- The Claim That Describing What Happened Hurts The Real Safety Work.
- We Will Set AIs Loose On the Internet On Purpose.
- The Lighter Side.
Apollo Research (3.4.3 of the o1 Model Card) and the ‘Escape Attempts’
OK, Fine, Let’s Have the Discussion We Keep Having
How Apollo Sees Its Own Report
We Will Often Tell LLMs To Be Scary Robots
Oh The Scary Robots We’ll Tell Them To Be
people are going to tell robots such things?
There will also be those who think ‘scary robots, WHERE DO I SIGN?’
- Zvi Mowshowitz: I outright predict that if an AI did escape onto the internet, get a server and a crypto income, no one would do much of anything about it.
- BlipOnNobodysRadar: I’d cheer it on, tbh.
- Typing Loudly: This is the plot of my novel.
This One Doesn’t Count Because
The first time you hear about something, it is because we only now figured out how to do it at all, so the details are going to involve setting things up so that it happens.
Keeping that in mind is important when assessing the threat level.
Using that as a reason to ignore the finding entirely would be a mistake.
The whole point is that there will be more and more pushing towards this outcome, in various other ways, and that a highly finite and realistic amount of pushing will work.
The Claim That Describing What Happened Hurts The Real Safety Work
There is important truth in this type of argument. You don’t want to cry wolf, when there is a very low p(wolf), or the wolf is sufficiently far away, if you know that crying wolf will be interpreted very differently, and you will lose credibility.
If you share reports like ‘the model tried to deactivate the oversight mechanism in 5% of cases’ you should clarify what was done to engineer that result, at least in broad strokes. If you share the result in isolation without doing that, it can give people the wrong idea.
And certainly those who transpose the results into fully sensationalist language should try to avoid doing that
But I find the contemptuous reactions to such reports to often be at least as bad.
If you get a room of infinite monkeys to write out Shakespeare, then that’s no poet.
However, if you get a room of one hundred monkeys, and 85 of them write out Shakespeare, I think you’re onto something? And the actions that disabled the oversight mechanism were right in the context window, why are we being pedantic about this here?
All the arguments that the AI isn’t doing these things trigger the ‘I Robot’ meme reply back to the human of ‘Do you?’
We Will Set AIs Loose On the Internet On Purpose
It is always important context in this discussion that we will 100% outright do this.
On purpose.
No one would be so stupid as to? Well, Sixth Law of Human Stupidity, that means someone will be so stupid as to at the first practical opportunity.
And of course they are going to deliberately attempt to get it to self-improve.
*I know this partly because Sixth Law of Human Stupidity, partly because it is a fun and exciting and shiny thing to do, partly because there are various ways to make money or get attention by doing so.
But mostly I know this because people keep announcing their intention to do it, and also keep trying to do it to the extent that they can.*
The Lighter Side
Edited: | Tweet this! | Search Twitter for discussion