OpenAI just announced o3 and smashed a bunch of benchmarks (ARC-AGI, SWE-bench, FrontierMath)!
A new Anthropic and Redwood Research paper says Claude is resisting its developers’ attempts to retrain its values!
What’s the upshot — what does it all mean for P(doom)?
00:00 Introduction
01:45 o3’s architecture and benchmarks
06:08 “Scaling is hitting a wall” 🤡
13:41 How many new architectural insights before AGI?
20:28 Negative update for interpretability
31:30 Intellidynamics — ***KEY CONCEPT***
33:20 Nuclear control rod analogy
36:54 Sam Altman's misguided perspective
42:40 Claude resisted retraining from good to evil
44:22 What is good corrigibility?
52:42 Claude’s incorrigibility doesn’t surprise me
55:00 Putting it all in perspective
Show Notes
Scott Alexander’s analysis of the Claude incorrigibility result: https://www.astralcodexten.com/p/claude-fights-back and https://www.astralcodexten.com/p/why-worry-about-incorrigible-claude
Zvi Mowshowitz’s analysis of the Claude incorrigibility result: https://thezvi.wordpress.com/2024/12/24/ais-will-increasingly-fake-alignment/
PauseAI Website: https://pauseai.info
PauseAI Discord: https://discord.gg/2XXWXvErfA
Say hi to me in the #doom-debates-podcast channel!
Watch the Lethal Intelligence video and check out LethalIntelligence.ai! It’s an AWESOME new animated intro to AI risk.
Doom Debates’ Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate.
Support the mission by subscribing to my Substack at DoomDebates.com and to youtube.com/@DoomDebates
Share this post