The Center for AI Safety just dropped a fascinating paper — they discovered that today’s AIs like GPT-4 and Claude have preferences! As in, coherent utility functions. We knew this was inevitable, but we didn’t know it was already happening.
This episode has two parts:
In Part I (48 minutes), I react to David Shapiro’s coverage of the paper and push back on many of his points.
In Part II (60 minutes), I explain the paper myself.
00:00 Episode Introduction
05:25 PART I: REACTING TO DAVID SHAPIRO
10:06 Critique of David Shapiro's Analysis
19:19 Reproducing the Experiment
35:50 David's Definition of Coherence
37:14 Does AI have “Temporal Urgency”?
40:32 Universal Values and AI Alignment
49:13 PART II: EXPLAINING THE PAPER
51:37 How The Experiment Works
01:11:33 Instrumental Values and Coherence in AI
01:13:04 Exchange Rates and AI Biases
01:17:10 Temporal Discounting in AI Models
01:19:55 Power Seeking, Fitness Maximization, and Corrigibility
01:20:20 Utility Control and Bias Mitigation
01:21:17 Implicit Association Test
01:28:01 Emailing with the Paper’s Authors
01:43:23 My Takeaway
Show Notes
David’s source video: https://www.youtube.com/watch?v=XGu6ejtRz-0
The research paper: http://emergent-values.ai
Watch the Lethal Intelligence Guide, the ultimate introduction to AI x-risk! https://www.youtube.com/@lethal-intelligence
PauseAI, the volunteer organization I’m part of: https://pauseai.info
Join the PauseAI Discord — https://discord.gg/2XXWXvErfA — and say hi to me in the #doom-debates-podcast channel!
Doom Debates’ Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate.
Support the mission by subscribing to my Substack at
https://doomdebates.com
Share this post