OpenAI o3 and Claude Alignment Faking — How doomed are we?

Spoiler alert: We're pretty doomed

Liron Shapira

Dec 30, 2024

Transcript

OpenAI just announced o3 and smashed a bunch of benchmarks (ARC-AGI, SWE-bench, FrontierMath)!

A new Anthropic and Redwood Research paper says Claude is resisting its developers’ attempts to retrain its values!

What’s the upshot — what does it all mean for P(doom)?

00:00 Introduction

01:45 o3’s architecture and benchmarks

06:08 “Scaling is hitting a wall” 🤡

13:41 How many new architectural insights before AGI?

20:28 Negative update for interpretability

31:30 Intellidynamics — ***KEY CONCEPT***

33:20 Nuclear control rod analogy

36:54 Sam Altman's misguided perspective

42:40 Claude resisted retraining from good to evil

44:22 What is good corrigibility?

52:42 Claude’s incorrigibility doesn’t surprise me

55:00 Putting it all in perspective

Show Notes

Scott Alexander’s analysis of the Claude incorrigibility result: https://www.astralcodexten.com/p/claude-fights-back and https://www.astralcodexten.com/p/why-worry-about-incorrigible-claude

Zvi Mowshowitz’s analysis of the Claude incorrigibility result: https://thezvi.wordpress.com/2024/12/24/ais-will-increasingly-fake-alignment/

PauseAI Website: https://pauseai.info

PauseAI Discord: https://discord.gg/2XXWXvErfA

Say hi to me in the #doom-debates-podcast channel!

Watch the Lethal Intelligence video and check out LethalIntelligence.ai! It’s an AWESOME new animated intro to AI risk.

Doom Debates’ Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate.

Support the mission by subscribing to my Substack at DoomDebates.com and to youtube.com/@DoomDebates

Doom Debates

OpenAI o3 and Claude Alignment Faking — How doomed are we?

Show Notes

Discussion about this video