Stephen Casper, PhD, is an incoming tenure-track professor of public policy at the Harvard Kennedy School.
Prof. Casper is relatively unworried about the threat of human extinction from rogue superintelligence, but he is concerned about a handful of companies wielding extremely powerful AI and enfeebling the public. That’s why he’s an AI governance hawk who advocates for taxes and regulation to keep frontier labs in check.
We agree that slowing down AI development would make the world safer. But you know his position is unique when he says he’d prefer to have less research on AI alignment!
Watch on YouTube:
Timestamps
00:00:00 — Cold Open
00:00:50 — Introducing Stephen “Cas” Casper
00:07:14 — What’s Your P(Doom)?™
00:13:07 — Crux: “The Intelligence Ceiling Might Not Be That High”
00:17:18 — Debate: Will Power Structures Prevent AI Takeover?
00:26:03 — Thought Experiment: A Data Center From 2126
00:33:41 — Cas’s Mainline Scenario: Idiocracy-Inspired Gradual Disempowerment
00:43:48 — Where Does Cas Get Off the Doom Train?
00:45:49 — Why Cas Is “Anti-Timelines”
00:52:26 — Poor Governance Led to Sycophancy, MechaHitler, Nudification
01:02:39 — What Cas Is Working On & Why
01:06:08 — Case Study: DALL-E 2 vs. Stable Diffusion
01:12:42 — Would Cas Support a #PauseAI Treaty?
01:16:35 — Keeping AI Safe Is a Process
01:20:51 — 84% of the Population May Get Left Behind by AI
01:30:30 — Why Cas Opposes Alignment Research
01:39:21 — Wrap-Up: The “Unsexy” Path to Lower P(Doom)
Links
Cas’s links
Things referenced
Doom Debates episodes mentioned
Mike Israetel Returns — AI’s Gonna Kill Everyone vs. AI Will Make Everything Awesome
He Leads a Top AI Research Program, But He’d Hit the PAUSE Button — Kevin Zhu
Alignment is EASY and Roko’s Basilisk is GOOD?! — Roko Mijic
Andrew Critch vs. Liron Shapira: Will AI Extinction Be Fast Or Slow?
Transcript
Cold Open
Liron Shapira 00:00:00
You are somebody who supports a pause AI policy, correct?
Stephen Casper 00:00:04
Yes.
Liron 00:00:05
But at the same time, you would turn around and campaign against any alignment or super alignment research happening now, correct?
Stephen 00:00:15
Yes. I would like to halt the research enterprise around making super intelligent systems intelligently aligned with their creators’ goals.
Liron 00:00:25
I feel like I already have so much on my plate saying, “AI is going to be really powerful. We need to pause AI. The alignment teams aren’t doing enough.” And now you’re adding into the mix, “They’re actively harmful. They should do less alignment.” No, please don’t throw this into the mix.
Introducing Stephen “Cas” Casper
Liron 00:00:50
Welcome to “Doom Debates.” My guest today is Steven Casper, but most people just call him Cas. Cas got his PhD in computer science from MIT, and he’s now an incoming public policy professor with a focus on AI safeguards and AI governance at Harvard. Specifically, it’s a new tenure-track assistant professorship in public policy at the Harvard Kennedy School. So I want you guys to pay attention to what it takes to get on “Doom Debates,” okay? You got to have MIT. You got to have Harvard. We really accept no substitutes here.
Liron 00:01:21
Cas is also a MATS mentor, one of the top AI alignment and security education programs. He’s been involved with the UK AI Safety Institute and also the Center for Human-Compatible AI at Berkeley. He’s published research in safeguarding AI systems against harmful behaviors, including 61 papers on Google Scholar, 5,000 citations, an h-index of 28. He’s gotten the ML Safety Workshop Best Paper Award, and also in TMLR, which is a machine learning journal, he’s got outstanding paper finalist distinctions.
Despite his background being in machine learning, in recent years, he’s been focusing his research on AI governance, including his recent research residency at the UK AI Security Institute. Today, I’m excited to have somebody with a fresh perspective on AI governance and what the real mainline doom scenario is, because it’s going to be different than what you’ve previously heard on “Doom Debates.” Cas, welcome to “Doom Debates.”
Stephen 00:02:17
I’m so glad to be here. Thanks so much, and it’s great to have the chance to finally chat.
Liron 00:02:21
Likewise, man. That’s right, yeah. We’ve been seeing each other online, and we have never interacted in real life. Is that correct?
Stephen 00:02:26
Yeah, but it’s about time.
Liron 00:02:28
Yeah. And when did you first get into the whole field of AI safety or rationality or anything adjacent? You could say effective altruism.
Stephen 00:02:35
Yeah. My path kind of followed some of these contours a little bit. In my particular case, this all started back in 2018. I read the book “Superintelligence” by Nick Bostrom, and I think this has been a reasonably well-trodden path into thinking a lot about these problems. But I got pretty well convinced that working in the AI space as opposed to the biotech space might be a better pathway or a better opportunity to make the world a bit of a better place.
So I pretty quickly cast all my plans aside and threw myself headfirst into the AI safety morass as of 2018, and I got working on some problems involving AI alignment and interpretability kind of early on in that arc of mine. Early on, I was pretty mainstream in how I was learning about things from a Bostromian lens and maybe Yudkowskian lens a little bit, to thinking more critically in 2022 and 2023. And I was not the only one to do this — thinking about how the AI space might be more bottlenecked by our ability to govern well, as opposed to our ability to develop the types of risk management tools that are really going to be approximately foolproof.
Liron 00:03:56
Right. So you first started reading Bostrom, and also that took you to Yudkowsky back in 2018, correct?
Stephen 00:03:59
Yeah. You’ve heard it all. But I had my time with thinking through and just kind of marinating in rationalist-related and effective altruism-related ideas in that community a bit.
I think today, you could consider me EA adjacent, adjacent, adjacent. We’re still all about high-impact work, not necessarily all about the community, though.
Liron 00:04:21
All right, nice. And so that brings us to today, 2026, when you just got your PhD in computer science from MIT. Not bad. And your PhD focused on tampering with AI models’ internal computations to evaluate and safeguard them, correct? Is that kind of interpretability, mechanistic interpretability?
Stephen 00:04:39
A bit. It’s certainly very, very closely related. I had to have a whole section on this in my thesis, for example. It’s certainly mechanistic. It’s less interpretability. But this kind of work is my bread and butter. It involves model internals, as many agendas do. But the computer science core work that I focus on focuses on model tampering or model tampering attacks.
My thesis and much of my work throughout grad school was all about using model tampering attacks for evaluations and for defense. So if you want to evaluate a model using a model tampering attack, you will evaluate its ability to behave harmfully, even when there is some sort of adversarial process manipulating the internal states of that model or manipulating the weights of that model. And these are pretty powerful attacks, as you might imagine.
If you want to use model tampering attacks in order to make a model safer, we can train models under these types of tampering attacks and test them under these types of tampering attacks, too. And all of this can strengthen the toolbox by giving you some sharper tools for trying to elicit and assess harmful behaviors in LLMs.
I’m pretty excited about this work still, and the main technical leg of my agenda going forward is focused on trying to make AI safeguards that are robust enough and that run deep enough in order to work very well for open weight models, even when some adversarial process or adversarial user might be trying to fine-tune that openly available model on harmful data.
If we’re able to make some more progress on this in the next few years, if the field is able to 10X the type of tamper-resistant robustness that we can get from open models and make them that much safer, I think — and this might be a good hot take to start on — it might be the case that by then, when we have a good open model safety toolbox, the technical problems underpinning AI safety might be sufficiently well solved such that I might leave technical safeguards research and just go into governance 100%, because right now I’m only about 50%.
Liron 00:06:37
Got it. Okay. Yeah. So I think a big focus of the conversation is just comparing our mainline scenarios because we both read a lot of the same stuff and respect a lot of different smart people who are opining. None of us are dismissive of any particular camp. Is that fair to say?
Stephen 00:06:53
I think that’s fair to say with maybe a couple of exceptions, but most of these exceptions are shit posters on Twitter or something.
Liron 00:07:04
Exactly. And yet, even though we both seem like we’re in the same milieu and we’re both reasonable people, it seems like we have a discrepancy in our P(Doom). You ready to hash this out?
What’s Your P(Doom)?™
Stephen 00:07:14
I’m very happy to. P(Doom). P(Doom), what’s your P(Doom)? What’s your P(Doom)? What’s your P(Doom)?
Liron 00:07:22
Stephen Casper, AKA Cas. Can I call you Cas?
Stephen 00:07:25
Sure thing.
Liron 00:07:27
What’s your P(Doom)?
Stephen 00:07:29
I think my P(Doom) lies somewhere between 5% and 10%, which I have found is a really interesting type of P(Doom) to have socially, because depending on what type of room I’m in, I’m either a total doomer or I am one of the most anti-doom people sometimes.
But at the end of the day, I think that a P(Doom) of 5% to 10% is plenty reason, plenty enough for the doom to dominate the impact calculus. And so long as you believe that meaningful actions can be taken to move the needle within that neighborhood, then preventing existential risk lies distinctly as our biggest priority.
Something we’ll probably talk about at some point might be some of my reasonably warm takes about how even if we care about doom principally, that doesn’t necessarily mean that we shouldn’t be engaging in what we’d consider to be near-termist or myopically unimpactful or relatively unimpactful work.
Liron 00:08:34
Fair enough. Well, I don’t disagree with anything you’ve said. Yeah, 5% to 10% chance is incredibly alarming and we should shape policy around it. But it’s also a little bit patronizing to people like me who are saying, I have a 50% P(Doom) in that ballpark, which is significantly higher than just 5% to 10%. It’s an order of magnitude higher.
So I will say, okay, you’re still in the sane zone. The minimum qualification to have a P(Doom) that I consider to be in the sane range is 5% to 10%. Okay, so I don’t think you’re being insane. I respect what you’re saying. But it’s also a bittersweet situation for me when you’re being like, “Yeah, Liron, you’re 5% to 10%, you’re totally under it.” And I’m like, “Really, just 5 to 10%?”
Because if it’s 5 to 10%, I start to become an accelerationist myself and be like, okay, yes, we’re doomed with 5% to 10% probability, but there’s so many other problems that might extinct us, like nuclear doom. The probability of nuclear doom in the next century, I think, is at least 5% to 10%. And I actually think that a good AI could save us from nuclear doom. It could help us coordinate to not blow ourselves up. So the moment you tell me P of AI doom is only## The Doom Curve and Anti-Timelines
Crux: “The Intelligence Ceiling Might Not Be That High”
Debate: Will Power Structures Prevent AI Takeover?
Thought Experiment: A Data Center From 2126
Cas’s Mainline Scenario: Idiocracy-Inspired Gradual Disempowerment
Where Does Cas Get Off the Doom Train?
Stephen 00:45:03
I think I’m following. I’m following the stop train story to all of this.
Liron 00:45:08
Right. Because getting off means you think that’s a reason to not worry. So you’re not worried yet. Yeah, but this actually sucks because somebody who gets off at the earlier stop still might have reason to get back on. So I got to rethink the analogy, but you get what I’m saying.
Stephen 00:45:20
I think we’re on the same page.
Liron 00:45:22
You know what? I think I got the analogy. You just started at a later station, so you’re actually getting on at a later station, and you’re still riding to Doom Town, but you just didn’t ride — you were a late pickup.
Stephen 00:45:35
If you’ll have me, I’d love to be in Doom Town, yeah.
Liron 00:45:38
Exactly. All right, so we solved that, and you say that’s an accurate characterization where you do think that this short doom train, the second half of the doom train, is where a lot of action is.
Why Cas Is “Anti-Timelines”
Stephen 00:45:49
I think so. And I talk about this with people a lot. You are as familiar as I am with this. When people ask you, “Oh, what are your timelines? Are you a short timelines person? Are you a long timelines person?”
My answer to this used to be that I’m a short timelines person, but the error bars are huge. Now my answer to it tries to be a little bit more subversive. Lately, I call myself an anti-timelines person, which is related to what you’re saying.
And the reason I’m an anti-timelines person is partly to emphasize that this whole AI risk management thing is a never-ending problem. And that’s not to say that there won’t be a period of acute risk, a period in which we’re getting accustomed to rapidly intensifying and developing and integrating technologies where the probability of us triggering some sort of catastrophe at any given year is going to be elevated compared to the future if we survive that long.
But I do not think that the period of acute risk is going to be particularly remarkable in some sort of millenarian or singularity-flavored way. We can think of a curve over time where the y-axis is the probability density and the x-axis is time, and we’re talking about the probability density of going extinct in that particular year. So this curve will eventually integrate to one if we take the x-axis out forever.
Liron 00:47:16
But if we play our cards right, we push it to a lump of probability a quadrillion years in the future. That’s how you do it right.
Stephen 00:47:23
Yeah. So the difference between me as an anti-timelines person and other people who I think are more inclined to traffic in talking in terms of timelines is whether or not this curve peaks relatively low and gradually goes to zero, or whether this curve peaks particularly high.
Liron 00:47:44
When you said peaks low and goes to zero, though, I’m remembering what you said about how it has to add up to one. So I’m saying a quadrillion years from now, it has a big spike — the latest possible time that we’re doomed.
Stephen 00:47:55
Well, I guess I could, but I don’t think this reflects most people’s beliefs if you’d aggregate them. So I think the doom curve — call it doom curve? Maybe you’d like that — but I think it’s going to peak low.
Liron 00:48:08
Okay.
Stephen 00:48:09
Instead of peaking particularly high. I don’t think that getting through a particular decade in human future history is going to be a really good indicator that we’re going to be in good shape from then on.
Liron 00:48:25
Yeah, I do see what you’re saying, though. Let me just rephrase in layman’s terms, because I suspect some non-mathy people might have dropped off.
Stephen 00:48:33
Sure.
Liron 00:48:33
I think what you’re saying is that you kind of think this is a critical time — this is an above-average time where we better get things right in terms of aligning AIs to the people working on them. Don’t have a runaway AI. You think there’s some concern to avoid that, but it’s not so huge overshadowing everything because in your mind, if anything, there’s a bigger concern after we get that right.
Stephen 00:48:59
Well, I don’t necessarily think it’s a bigger concern, but over time it’s a bigger concern.
Liron 00:49:04
If you add up the next 50 years, yeah.
Stephen 00:49:04
Yeah. Maybe the simplest way to put it is that I think nuclear is still a reasonable analogy. The ‘60s sucked from a risk standpoint.
Liron 00:49:11
Yeah.
Stephen 00:49:11
But the fact that we made it through the ‘60s does not mean that we should take comfort in being kind of safe from nuclear risk, because the world is always going to be vulnerable to nuclear risk forever. And it’s not because of technological alignment problems, it’s because of deeply human and institutional problems.
Liron 00:49:27
All right. Yeah, I understand your perspective. And so there’s still two things I want to dive further into the Cas worldview and not even debate it per se — just really understand your wisdom, basically. Because obviously you’ve put many years into this. You’ve thought deeply about it, and it’s really hard to find anybody who’s done more research or thought more deeply about it.
So I want to make sure we’re getting the riches here of your hard work. When it comes to doom scenario, just to review: gradual disempowerment, one problem could be a larger version of all the ills we’re seeing right now of people being so lazy and incapable of doing anything. Couch potatoes who have no achievements because it’s all meaningless and there’s no reason to get up in the morning. That’s one problem that you’re concerned about.
And then another problem is economically — it’s a precarious situation because if the laws ever change, if the ruling class, whatever the ruling party was, ever soured on us, we’re screwed because we have no power. Is that kind of a good overview of the bad gradual disempowerment scenario we’re worried about?
Stephen 00:50:23
I’m following you. I’m with you here.
Liron 00:50:26
So that represents a situation that you think is a big chunk of the probability mass in flavor that we’re trying to steer away from. That’s kind of what you focus on. You are trying to steer humanity away from something like that. Is that fair to say?
Stephen 00:50:42
Yeah. And I’m glad we’re getting to this part of the discussion, which arguably maybe should be most parts of most discussions here, because the P(Doom) doesn’t matter so much as the delta on that P that we think is possible to induce by taking different actions.
It’s not where the probability of doom baseline is. The most important thing to talk about is how much can we affect that and how can we make that go lower? And we’ve been talking a lot about systemic problems and institutional problems and power today. I hope it’s okay that I’ve steered things that way.
Liron 00:51:19
Yeah.
Stephen 00:51:21
The things I’m most optimistic about, or least pessimistic about when it comes to reducing the temperature on global risk, the risk production system, and the P(Doom) — I think that most of the best things that we can do involve putting checks on power and preparing ourselves to kind of fight this never-ending battle of maintaining more ecosystemic hygiene in the AI space forever.
Liron 00:51:45
In terms of what I would debate you on, we could have that whole debate of why you’re not that worried about the next 10 years. The Yudkowskians, a lot of people I know and myself would be like, “I don’t even think we’re going to get the AI aligned.”
But there’s certainly people I’ve seen agree with you, like Roko Mijic, friend of the show, of Basilisk fame. He’s thinking these days that aligning will be kind of easy. And I think Andrew Critch came on the show a while ago and said that he thinks alignment will be easy for that reason. Have you seen their positions? Do you feel like it’s kind of similar to you on that front?
Stephen 00:52:18
I won’t speak too much to how much I agree with Roko or Critch, although I might be able to see Critch this week. I should ask him about it.
Poor Governance Led to Sycophancy, MechaHitler, Nudification
Liron 00:52:26
Cool.
Stephen 00:52:26
But I can try to give the spark notes for, and an example for, one of the reasons why I think that alignment is going to be sufficiently solvable in practice. And “sufficiently” is an important hingey word here, which I’ll explain in a second.
Some of my thoughts here are kind of new as of last year. I think 2025 was a really interesting one for me, thinking about where our risks are going to come from. And some of the events of 2025 that got me thinking about what I’m about to say were the incident involving ChatGPT-4o being excessively sycophantic to the point of suicide coaching sometimes, the Grok Mecha-Hitler incident, and at the very end of 2025 and the beginning of 2026, there was the Grok nudification incident. And I think these are all really, really interesting problems.
Liron 00:53:16
Grok nudification incident was just making it easy for any user to ask for a nude version of anybody on the timeline?
Stephen 00:53:23
Yes. For sure.
Liron 00:53:25
Okay.
Stephen 00:53:25
So these problems were caused and eventually patched by very simple, very silly things that we had a really good understanding of in the ML research space by the end of 2023.
The ChatGPT sycophancy thing — we understood pretty well the sensitivity of some models to fine-tuning and the propensity of some models to sycophancy certainly well enough in the research space by the end of 2023 to be able to predict and avoid incidents like this. And every other company and every other model seemed to do a reasonably good job at preventing the type of failure mode that emerged last April involving GPT-4o.
With the Grok Mecha-Hitler thing, we understood the frivolousness of how models generalize sometimes and how they overgeneralize sometimes to say some pretty silly things when we don’t red team them well enough to make sure that they don’t ever say those things. We understood all of that pretty well by the end of 2023 as well.
And then the Grok nudification stuff that happened in December and January — that’s not even an ML problem. That was just a problem that was completely foreseeable and completely fixable by just not giving users very permissive permissions to undress people on the website.
And I think these are all really interesting examples of emerging real-world AI failures, because none of these things involved any of the things that academics like to research or that we like to talk about, that are interesting enough to discuss in lots of conversations over drinks. None of this involved adversarial attacks like I like to study. None of it involved a treacherous left turn or deceptive alignment or even an agent accomplishing a complex task. It was just straight up sycophancy, straight up praising Hitler, and a model being set up to be a tool for some people to directly hurt other people.
And it’s just really kind of frustrating, I think, to be in the space of working on frontier AI safeguards but to watch some of our failures come from problems that were foreseen and solved for years.
Liron 00:55:27
What’s the takeaway here? What does this tell us about humanity’s future that you see these incidents?
Stephen 00:55:27
I think it tells us that solving the alignment problem, being able to build the toolkit in order to make AI systems safe, is only going to go so far. And the precedent so far is that most of the most concerning incidents that we’re seeing from AI systems — I won’t say devastating, not yet, not on a large scale — but most of the most concerning incidents are foreseeable and preventable.
So for every catastrophic thing that an AI system does that happened as the result of some unforeseeable consequence of a benevolent person exercising technical best practices, I think there’s going to be 10 or 100 that come from someone just not exercising best practices or wanting to cause harm.
Liron 00:56:13
I think I understand more about your position, why your position is internally consistent, because I was going to ask: well, if we’re such boneheads about mundane alignment, why are you skipping past the individual alignment problem, the ability to have your AI do what you want and not go rogue and not decide it wants to take over the world because it’s interpreting things the same way that Claude Code is like, “Oh, you didn’t want me to delete your database? Wait, but I had a reason why I wanted to do it.”
Even though it has the best intentions, it’s taking over the world. So I was going to say, what makes you think that we’re not going to accidentally do the Yudkowskian thing, or even the Bostrom thing — the AI that goes rogue and does a lot of damage before we align it. And I feel like your internally consistent view is: “Well, we’re going to get retries.” Yeah, we’re stupid enough, we’re boneheaded enough that we’re going to screw it up, but then we’ll get retries and we’ll fix it and that’s going to be patched up. That part of the problem, yeah, some people will probably die or suffer some consequences, but we’ll patch it up and we’ll move on. Is that kind of a fair characterization?
Stephen 00:57:10
I do think that’s fair. I do think that most of the scenarios in which things go bad, they go bad foreseeably, or they go bad intentionally.
Liron 00:57:18
So I think that’s the crux — you think we’re going to get retries, and I don’t think we’re going to get retries because I think superintelligence is going to be unretrialable. It’s going to be game over.
Stephen 00:57:27
Well, one thing I will say is even in a situation in which it’s lights out pretty quickly, I still think that’s going to be the result of a very foreseeable consequence and someone just not using the toolkit and not doing the right thing.
If we have some sort of rogue AI system that takes over the world, hacks into everything, does all that kind of stuff — the scenario kind of outlined in “If Anyone Builds It, Everyone Dies” — I think if we have that kind of scenario, it’s not going to come out of the most responsible AI company implementing the best safeguards they can on their frontier system. It’s just going to come from some random rogue actor with an open-weight system who doesn’t care or wants to cause terrorist-like damage, or it’s going to come from Elon Musk.
Liron 00:58:14
But you are making two separate arguments here. Your argument about the responsible actors going to solve the problem — I actually think the bigger underlying crux is it’s okay to make little mistakes here and there because if there’s a mistake, you just have enough time to hit stop and try again. I still feel like out of everything you’ve said, that is still the main crux. And we can put a pin in the crux, but I think we’ve identified it.
Stephen 00:58:37
Yeah. Emphasis on how there are kind of two things underneath what we’re discussing here. I’m with you on how I think that the world is probably more forgiving than you think it is.
But I also want to emphasize that that point is separable from the points that we might be making about the ratio of reprehensible usage that causes catastrophes to the systemic effects that cause catastrophes to the genuine mistake from a responsible frontier group that causes catastrophes.
Liron 00:59:16
Great. So now that we’ve gotten your world model on the table and your mainline scenario and your mainline doom scenario and what you think about superintelligence — oh, there’s another random little piece of context in terms of your timelines. I guess you didn’t want to talk about timelines anymore, but the AI 2027 timelines or what’s on Metaculus, would you just say that those seem like a reasonable ballpark guess? Because I would.
Stephen 00:59:36
As in catastrophe by the end of the decade?
Liron 00:59:39
No, not catastrophe timelines, just the idea of how AI is going to go. The famous definition of AGI is something like it can be a drop-in replacement for a human worker in all jobs, or let’s say 99% of current jobs. That kind of milestone, which interestingly enough, we actually haven’t reached today. You can’t actually walk away from your desk and be like, “Okay, AI, take over.” Most jobs actually, surprisingly, don’t let you do that today.
But the idea that they will sometime in the early 2030s — is that a reasonable guess?
Stephen 01:00:06
There’s a way in which I’m on board with this, and there’s a way in which I’m not, and it might not be super important to our current discussion. But the way in which I’m on board with this is based on what I think AI systems could be able to do, what they have the intelligence to be able to do if integrated well.
But one of the few ways in which I sometimes agree with the “AI as a normal technology” camp is in its emphasis on how gradual and messy and long of a process really integrating AI systems into doing tons and tons of things globally really is. So I think that we are going to see a delayed onset of macroscopic or global-scale, huge transformative impacts of AI systems with respect to when they are first able to automate a bunch of stuff. The gap between when they can automate tons of stuff and when they do will be, I think, years.
Liron 01:00:59
So you think you know something that the AI 2027 and the Metaculus people don’t? Because from your perspective, it sounds like you think that they’re undershooting a more likely timeline.
Stephen 01:01:08
An impact-focused timeline, I think yes. I think despite being an anti-timelines person, you could say that my timelines are probably meaningfully stretched out more than the AI 2027 camp. How politically important this difference is, I don’t think it’s crazy.
Liron 01:01:24
I think it makes sense compared to — the AI 2027 people, as far as I can tell, are like me, most of them. I’m pretty sure Daniel Kokotajlo is, and he was one of the most prominent authors there. And I’m pretty sure he believes in massive power of superintelligence.
And I do think that there’s a connection between expecting AI to continue whipping us harder and harder, beating us on tests harder and harder pretty quickly. I think that is correlated to the idea that you think it’s got a long way to run. Not perfectly — you can imagine, “No, it’s just going to beat us at everything and then stop.” But I think there is a connection, and so you saying that you don’t think a data center dropped in from 100 years from now would be that powerful — I do see a connection between that and you saying, “Hey, I bet the timelines are going to be longer for AI to progress.”
Stephen 01:02:06
I think so. And also a connection between how I guess I expect the world to be a little bit more forgiving than you do.
Liron 01:02:11
All right. So now we’ve covered a lot of detail about your mental model, a lot of the nuances. I feel like I can probably pass an ideological Turing test, where I have to state your position in many ways. Not perfectly, but we got pretty good color.
So I think that’s enough context where now we can go to some of your initiatives. This kind of gives us context for why you’re doing what you do in terms of steering your focuses and some of the papers you publish and the initiatives. I’ve got some of these written down in my notes. Or do you want to pick what you want to talk about first in terms of your pushes that you’re doing?
What Cas Is Working On & Why
Stephen 01:02:45
Sure. And there might not be a lot of controversy between us on some of these things. But my favorite boogeymen in the world are people working at big tech companies on AI who are trying to make a positive difference. I don’t necessarily think they might be.
So one way of describing my positioning here is: the superintelligent optimizers that I’m the most worried about from a systemic perspective and from a long-term perspective are the companies and countries that are in charge of very powerful AI systems, as opposed to the AI systems themselves via some sort of rogue scenario.
And that’s a way I really like to put it because I think that historically and currently there are portions of the AI safety and risk management and research communities who spend their time fretting about the systems themselves and not the very powerful structures around the systems, in the form of mostly today companies.
Do you know that meme from “Friends” where Phoebe is talking to Joey and she’s trying to make a point to him? Phoebe spells out the point to him piece by piece, and Joey repeats it, and he understands all the parts, and then at the end he just spectacularly draws the exact wrong conclusion.
Liron 01:03:56
Yeah, we’ll put it up on the screen.
Stephen 01:03:57
There we go. I sometimes feel this way a little bit about people working inside of AI companies to solve the alignment problem by doing so within a company. It’s like they understand the danger of very powerful optimizers, but they just at the very last moment fail to apply that argument in the way I would see appropriate to the companies inside of which they are working or they’re hoping will keep humanity safe.
Liron 01:04:28
Yeah. Just repeating back here — it’s because you don’t think the mainline doom scenario is a single runaway FOOM that’s rogue and uncontrollable, which by the way I think is likely. But you don’t think that. You think the number one culprit if the future goes really bad is we can kind of pin the blame on one or more of those organizations that are controlling the most powerful AI.
So today that would be, I guess, OpenAI, Anthropic, Google. Those would be the three, although maybe you would give X a special boost because maybe they’re in fourth place but they’re also more reckless.
Stephen 01:04:59
Yeah, they love to move fast and break things probably a little bit more than the others. But yeah, we’re on the same page. You’re following me.
Liron 01:05:05
So you spend a lot of effort trying to tell them how to behave better.
Stephen 01:05:09
A little bit. Or being in favor of putting more checks and balances on what they’re doing.
And a really important pillar of how I think about what’s important in the AI space that’s adjacent to this is that I do not think that you can make AI safe by making safe AI. And making more aligned, safe AI often, maybe even more often than not in some futures, is going to exacerbate misuse risks or put us into perverse Jevons paradox situations more so than it would keep us safe from benevolent actors just making honest mistakes.
And we can talk a little bit about this, and this brushes up against some of my recent research. Because since 2022, but especially in the last year, I’ve done a lot of work to study the AI non-consensual deepfake ecosystem. And I know it’s not an extinction-scale ecosystem itself, but it’s a pretty important one.
Case Study: DALL-E 2 vs. Stable Diffusion
Liron 01:06:08
Yeah. Are we talking mostly porn? Or what types of deepfakes are we talking about?
Stephen 01:06:12
Yeah, non-consensual porn, especially the stuff that is used to target children.
Liron 01:06:17
Got it. I know this is a little dark, but it’s funny to me that Steve Jobs famously said something like, “A computer is the bicycle of the mind.” But I think a lot of what a computer could potentially do is be a bicycle for the dick.
Stephen 01:06:32
Is that going to be the intro sex segment? Is the episode right about to start right now? Oh, no.
Liron 01:06:42
Yeah, exactly.
Stephen 01:06:42
It might be. There’s that trope in technology where technological innovations are led by internet creeps who want the porn.
Liron 01:06:51
Yeah, as long as it’s consensual, I should’ve said. Well, look, it is worth drawing a boundary, sorry, because I kind of said it at the wrong time because you were talking about non-consensual deepfakes.
Stephen 01:07:00
Yeah.
Liron 01:07:00
That’s not what I meant. But I think there is a rich area. I don’t think that we’re anti-porn. And I think that we would agree that there’s a rich vein of “Oh, can I have a virtual reality porn experience? Can I make my own deepfakes?” As long as they’re not violating anybody’s rights. I think there is some value add there. And OpenAI was even saying, “Hey, we’re going to offer that.”
Stephen 01:07:19
We can get that in the open and clear right now. I don’t personally feel particularly anti-porn in general. Although I think that the AI deepfake porn ecosystem is one that is just pretty deeply unhealthy, and there’s a pretty big difference between the ecosystem right now and one that I think would be net positive. And the business models behind it are not very aligned.
So we can probably get on the same page about that now, but I would actually love to jump back a few years and talk about what I think we can learn from the history of deepfake porn and what it’s going to say about the future of AI in general.
And I’d love to start in April of 2022, which was before the chatbot dynasty kind of came in. This was back when GPT-3 was the most advanced AI system on Earth.
Liron 01:08:09
Yeah, and it’s funny to think back to this. Sorry to interrupt, but it is kind of weird and poignant for me to think. There was a whole time when you could get generated images, but they were really careful not to let you generate a face. “Oh my God, what happens when you can generate a face?”
Stephen 01:08:23
Yeah.
Liron 01:08:23
And then smash cut to six months later, it’s like, okay, you can do faces now.
Stephen 01:08:28
And that’s kind of what things were like in April of 2022. So this was the month that OpenAI released DALL-E 2, and DALL-E 2 was leaps and bounds better than anything before it in its ability to dynamically and photorealistically generate images.
But it was safeguarded very, very well, including being trained on filtered data and having prompt-based and filter-based safeguards to try to make sure that this tool, when people accessed it, could not be used for deepfakes, especially non-consensual intimate porn or deepfakes.
And I don’t often say this, but when it came to DALL-E 2, OpenAI absolutely ate. They did a fantastic job. They wanted to make a powerful system, they wanted to make it safe, and they absolutely nailed it. They released a technical report talking about all sorts of different strategies that went into this — red teaming, training safeguards, deployment safeguards. They deployed the model through a closed API. They banned accounts all the time.
And to this day, having studied this space and having looked around for any evidence of this, I have never found any hint or shred of evidence that DALL-E 2 has ever become used by any sort of internet community with any sort of consistency for creating non-consensual deepfake content or just creating even news deepfake content. DALL-E 2 was safeguarded like Fort Knox.
And historically, I think this was even a wonderful example of successful safeguards compared to the rollout of other systems like ChatGPT later. Remember the Sydney incident in January of 2023?
Liron 01:10:00
Right.
Stephen 01:10:00
I think DALL-E 2 was historically a wonderfully safeguarded system, and that was awesome until it didn’t matter at all anymore.
Liron 01:10:07
Huh.
Stephen 01:10:07
So in August of 2022, we got Stable Diffusion, which was released, and it was a peer to DALL-E 2 except — and it was a copycat technology — but the only differences were that it was openly released, so anyone could access it, and it was trained pretty indiscriminately on internet data, including a lot of porn.
And the rest has kind of been history. There’s just been more models, better models, more data sets, more infrastructure, more community, more guides. And at this point in 2026, the deepfakes are so easy to do, including non-consensually.
I have a lightning talk that I give sometimes about this. In the middle of the lightning talk, I say a true statement, which is that if you wanted to and you knew the right websites to go to, you could photograph me and nudify me non-consensually by the end of that speech, which is just kind of where we are today.
Liron 01:10:50
Twitter would’ve done it for a few months, right?
Stephen 01:10:54
Probably. We’re hearing about mass school undressings. There was a 26,000% documented rise according to the IWF between 2024 and 2025 in video deepfake content. It’s just a huge, huge problem.
So we have a case in which we had the advent of a technology which was very responsible and safe, and then the cat got out of the bag pretty quickly. And the existence of an initial pioneer technology and its safety at the end of the day was useful, but did very, very little to make the ecosystem very safe overall. Things just diffused. The open model ecosystem created so many useful resources for non-consensually nudifying people.
Liron 01:11:41
What’s the important lesson for humanity to learn here?
Stephen 01:11:41
The important lesson is that systems like DALL-E 2 and Sora and Veo being safe is great. I’m glad we live in a world in which those systems are not as misusable as Grok is. But the existence of those systems and them being safe has not made the ecosystem safe.
And I think this is just what AI safety is going to look like in the real world. It’s not going to be a solvable problem. It’s not going to be something that we can really achieve by making the first or most powerful system on the scene in some domain a good one. That’s not a bad thing in and of itself, but we’re misunderstanding the problem if we think that we can make AI safe by making safe systems, because things are probably just going to proliferate.
I think that if there is a bad thing that an AI system can do and be capable of, good, sober strategies for risk management are strategies that assume that a system with those capabilities will eventually exist. And don’t try to prevent it, but try to make incidents fewer and further between and more endurable.
Would Cas Support a #PauseAI Treaty?
Liron 01:12:42
To take the analogy to people who are worried about superintelligence, which even though that doesn’t seem to be your main thing — you’re skeptical of it — you still granted me a 25% chance. So does that mean you would agree with me that we should do an international treaty to pause development of superintelligence?
Stephen 01:12:59
I would be very happy to see that. I don’t have a pause icon next to myself online, but in every other way, I’m pretty in line with pausing or more generally slowing things down. I don’t think this is currently within the political Overton window, but I think we should be prepared for some sort of big incident to really change the parameters of the political debate. Treaties to slow things down would probably be pretty healthy, I think.
Liron 01:13:28
The reason I segued into that is because what you were saying before — “Hey, the cat tends to get out of the bag.” I’m happy to agree that the cat will tend to get out of the bag, at least on a technological level. I think that any time you make a system which is an AI that’s behaving within boundaries you like, you have now made something that’s very close to a powerful AI system that doesn’t behave within boundaries you like. I think you and I are on the same page there.
And that’s why I segue to pause AI. If you make an AI that’s just superintelligent, you have now come close to making, if not already gotten there, a superintelligent AI that is going rogue on humanity. And that risk is why you and I both would like to see a treaty to pause AI soon. Now, soon — how would you phrase it?
Stephen 01:14:11
Yeah, I think we’re on the same page. For example, I think it is good and right to expect and prepare for a world in which Mythos on steroids exists in a few years, some system from somewhere like that. And to not think about our win condition as trying to keep the lid on something like that forever, but to try to think of the win condition as making those events less frequent and making them more endurable.
And I think that there’s a good analogy between trying to manage AI risks this way, or preparing to manage AI risks this way—
Liron 01:14:46
Sorry, my next question is going to be the type of pause AI you advocate for. So answer whenever you want.
Stephen 01:14:48
Well, I’d settle for any slowdown. Do you have any further clarification you’d like?
Liron 01:14:56
If you were dictator of the world, would you press the pause button now immediately, or would you just set up conditions for pause and give some guidelines of “Okay, we all got to pause when this thing happens”? How would you play it?
Stephen 01:15:08
If I was dictator, I’d be happy to slow things down 10X for the foreseeable future. But I am sympathetic to the rock climber argument — that it’s better if the rock climber, if they make a mistake or if their gear fails while they’re a few feet off the ground, compared to when they’re high up. So I understand how something like this could go wrong, but I think our best bet is not inviting disruption.
Liron 01:15:35
So your approach to pausing AI would be: okay, all the research can continue, but somehow do it slower. I don’t know if you can instruct the world to go slower, but maybe you can say only a few authorized organizations can research.
Stephen 01:15:47
I think my version, or what I envision as the equivalent of instructing the world to go slower, is, for example, putting more burdens and requirements on AI companies, reducing the access to resources that those AI companies have, maybe taxing those AI companies more so that it’s harder for them to move fast and break things because they have less money.
And also making those companies more liable for things that could potentially go wrong, so that they’re more cautious and they have to deal with more of the consequences of their actions along the way to something that could be deeply systemic or catastrophic in the future. I think that all manner of checks and balances, and just trying to regulate AI like we regulate all sorts of things, are useful.
Keeping AI Safe Is a Process
Stephen 01:16:35
So there’s one thing I was going to say earlier which I think is also useful to say now — which is that there’s a useful analogy between how I like to think about what winning looks like, or what minimally losing looks like, and how society has already adapted to just manage a bunch of other harmful things.
We were talking about AI porn earlier, but just regular CSAM or regular non-consensual porn — that’s not something that it’s possible to prevent outright. Or just crime. Crime is not something that society can have any realistic hope of just making sure never happens.
But what we try to do is we try to play the game of Whac-A-Mole forever. And I think that’s okay. It’s okay to be a Whac-A-Mole apologist, where we hold bad actors accountable, where we try to take things down and mitigate the amount of harm that they can cause when they go up, and where we try to just make sure the space is one that is better managed and more accountable and healthier. And that’s what I think winning looks like.
In the same way that I think most of the scenarios by which things go wrong in the future are deeply, abjectly systemic, I think that most of our solutions are going to be pretty systemic and ecosystemic. And maybe if that’s the best we can do, if we can kind of treat AI like we treat crime and like we treat CSAM in the real world, maybe that’s okay.
Liron 01:17:50
So tell me if I’ve got your perspective correct here in terms of ideological Turing test. First of all, I would characterize you as you’re not really worried about a FOOM. You think it’s possible, but you’re just not putting a lot of mind share worrying about a FOOM, correct?
Stephen 01:18:04
I’m not super worried about a FOOM, and I’m suspicious of FOOM discourse, but we can put a pin in that.
Liron 01:18:09
FOOM is rapid recursive self-improvement, where you come back two days later and the AI is already way more intelligent than before and you’ve lost control. That’s the kind of stuff we talk about when we talk about FOOM. And you’re like, “Yeah, I just don’t really spend that much time trying to optimize away from FOOM because I just don’t see it as likely.”
So then you kind of have an incrementalist flavor. Your default expectation is: yeah, we’re going to keep developing more and more AI, and it’s going to take years or decades, and it is going to gradually get more powerful until some endpoint, whenever that happens, some level of superintelligence that you’re not confident about.
But because you see everything as being kind of incremental in nature with enough time to potentially reverse things or hit a stop button, because of that, you’re like: look, we’re already in the singularity now. This is representative of the singularity.
And that’s why you did a deep dive into the deepfakes. You’re like, “Look, the way that we manage this, it’s just going to be steps like this. Just one step after the other.” And so it’s really important to get used to doing such a great job with these steps. Anytime we take a step, it’s so important to nail the policy and get the maximum nice laws and consequences and insurance or whatever out of each of these steps, because we’re practicing. This is just a representative thing. One step is kind of equal to all the steps we’re going to take. Is that characterizing you well?
Stephen 01:19:22
I think so. I’m very happy to embrace the banality of systemic risk and sometimes the banality of forces of destruction and power concentration and maybe even evil in the world.
And I think it’s not as sexy to think about our problems and our solutions this way. I think it’s a lot less sexy in some way to think about how we can try to make tort suits go well in the next few years than how we can align some sort of superintelligent enterprise. But I think it’s good, and I think it’s realistic to think about things a little bit more this way.
Liron 01:19:57
A lot of people would look at the current situation, and I’d be sympathetic to their view — people who aren’t FOOM warriors like myself, people who don’t think that the end game is humanity losing all power because our brain gets superseded. That’s my Yudkowskian-type worry.
But people who aren’t like that tend to be more optimistic than you actually, because they look at the situation and they’re like, “Look, yeah, okay, we had the deepfakes. We had faces in Midjourney or whatever, and we just keep surviving and we keep working through and these tools get more and more useful, and that’s just part of the history of technology.”
“The first planes used to crash a lot, and now they don’t. Everything is going great. We’re on such a good ride. I’m happy to take more and more steps. And people like you, Cas, are so uppity. You’re so nitpicking how we didn’t get the deepfakes right, even though it’s just fine. It’s fine. Why do you have to worry? Because if you extrapolate the pattern of taking steps, it’s great.” Don’t you think it’s going well overall?
84% of the Population May Get Left Behind by AI
Stephen 01:20:51
Things could have gone worse, and I think it is accurate to say that. I do think that we are in some stage of a multi-decade, maybe century-long story of disempowerment and enshittification and turning up the temperature on risk.
Liron 01:21:07
Wait, but aren’t you basically saying: yes, you are one of those accelerationists because you have a 90% P-nondoom? So you actually are ultimately kind of this optimistic guy.
Stephen 01:21:15
Well, I think I’m not on the page with the accelerationists on the other side of the coin about how reliably AI would do things like solve cancer or make life good for everyone.
One way of thinking about things — one argument that I think is very powerful as to why AI is probably not just going to benefit everyone in the world by default — comes from looking at the usership of AI on the internet. You’ll find some slightly different numbers from different places, but one recent estimate of how many people in the world use generative AI is 16%.
AI has had years to reach the world. It’s not really reaching more than about a little bit over a billion people in the world. Even the internet hasn’t done this. The internet has certainly had enough time to become a wonderful force in the world and to give everyone access to an integrative healthcare endpoint or to give everyone a free world-class education. But it hasn’t, and 26% of the people in the world don’t even use the internet.
And unless someone has a plan for how to fix the global inequalities and actually make people’s lives better using a technology very powerfully, I think that their optimism about how AI could cure cancer or do whatever should generally be considered to be unserious.
Liron 01:22:34
I just want to reiterate what I was drilling into before — you’re so focused on studying these governance problems and trying to make them better. But aren’t you just kind of putting icing on the cake of governance that you see as having already been working well enough?
Stephen 01:22:53
Well, I struggle to think of how we are all still around and how maybe technologies like internet and social media could have gone worse as great evidence for how we’re on track to, by default, govern AI well enough.
I think that governance of the internet and social media are much more case studies in failures than successes. And I think these kinds of things repeating themselves with a technology that is able to automate as much and produce as much value and power as AI is able to produce is signs for systemic and existential worry still.
Liron 01:23:27
So you’re saying the trend is good, but the doomers like me raising all this ruckus about how bad things might get are making you worried about kind of a trend break, and so you’re trying to protect the downside. You’re trying to avoid breaking the trend, so you’re trying to be extra careful with the governance.
Stephen 01:23:45
Maybe, except I still want to be part of Doom Town, and I would describe the trend continuing as — well, I would describe the default thing continuing as being what causes lots of risks, like the enshittification and gradual disempowerment of stuff that comes from AI.
The trend that I want to focus on that I think will continue and I think will be very bad is how AI gradually takes over things and gets integrated into things and creates concentrations of power and turns up the temperature on risk. I’m not so much focused on the trend of how things have been okay and survivable so far.
Liron 01:24:23
You mentioned enshittification, and when I look at my life, I would have told you there’s no enshittification, but actually there’s two things in my life that have been enshittified — my straws and my grocery bags. They suddenly turned into paper, and it’s horrible. And I still remember a life when those things were plastic, and it’s like we don’t know how to build that anymore.
Stephen 01:24:39
You know how I’m anti-timelines? I’m also anti-straw. Just ditch straws. It’ll be a great decision.
Liron 01:24:47
No, I wanted plastic straws. Plastic straws are great. You’re still able to buy your own plastic straws, but then a restaurant screws me. I saw a restaurant where you could buy a metal straw. A metal straw? This is what it’s come to. It felt like I was living in the Third World. I felt very enshittified.
But besides that — don’t you think that in general, enshittification is the opposite of what’s actually happened? It just seems like things have been getting better over time.
Stephen 01:25:10
Enshittification is a story that plays out in staggered ways across all sorts of technologies. I think 2025, 2026 might be a little bit of an interesting point in the enshittification story in AI, but I think the worst is by far yet to come.
I think we’re just starting to see the ads. And we’re also starting to see the internet get carved up, for example. AI systems being native to certain chunks of the internet — for example, OpenAI systems can use “Atlantic” articles because they have a deal, and Anthropic systems can’t access “The Atlantic,” stuff like that. So we’re seeing AI systems being native to certain chunks of the internet. This is also starting. But it can get so much worse involving how AI systems manipulate us.
Liron 01:25:55
It’s just every time we look back at the past, both including the internet rollout and other technologies, and even including AI to date, people will be like, “What about this incident?” And I’ll be like, “Okay, yeah, so you’re pointing to 0.1%,” and then there’s this 99.9% of massive value creation.
I’m in a weird position because I love technology like a lot of the Yudkowskians, like a lot of my friends who have a high P(Doom). We’re really enjoying technology, and it’s hard for me to put together an argument being like, “Hey, you know how this thing is going really bad? Well, the future is also going to go bad.”
Because for me, it’s a total disconnect. I have no problem with the past and present. I think the past and present is going great, and I just have to argue that the future is going to do a 180-degree flip because the brain is not going to be the king anymore. That’s my argument.
But it does seem like so many people are more like you, and they’re saying, “Hey, see this microcosm of the present? That’s going to be our big problem long term.” And I’m like, “I don’t even see a big problem in the present.”
Stephen 01:26:49
Yeah, I think that’s fair. And it’s worth mentioning, although I’ll do so briefly, about how in some ways lots of technology things used to be better and used to be more fun and used to be less enshittified, and also things could be better than they are now.
But there’s one main thing I want to say, which I think is worth some emphasis, and that is that stories about technology going well are, in some ways, uniquely privileged and amplified stories. There are lots of people who have been historically marginalized by technology, marginalized or worse, whose stories don’t make it into things.
Think guns and germs and steel. There are lots of people who have been really badly affected by these things, but they’re dead now. And their stories or narratives or viewpoints on these things politically haven’t proceeded very much.
Liron 01:27:39
Right, but they’re the minority. You got to quantify it. There’s only a few of them relative to how many people benefited.
Stephen 01:27:45
I think so. But the non-generative AI users in the world are a majority, something like 85% or so. And non-internet users are a minority, but they’re not in any of the rooms that we talk about the benefits of AI. They’re not the technocrats making big decisions, they’re not in the political spaces.
I think that the 26% of people in the world who don’t have the internet are hard to hear from and are unhappy that they’re as poor as they are, for example.
Liron 01:28:18
The only people I know in that 26% are the Sentinelese, deep Sentinel island or whatever. They won’t let anybody in. Okay, that’s their fault. You can’t do anything about them.
No, but realistically though, that’s a one-off. But I do think that most are probably just so poor where they can’t even score a cellphone. I need to research this more, but I feel like that’s really the problem.
Stephen 01:28:39
Yeah. So I’ll go back to one thing I said earlier about how stories about the benefits of AI and things going very well should be considered unserious unless they’re accompanied with a plan or a plausible scenario by which AI really does kind of benefit people, and not a very small number of people who are particularly loud or particularly empowered or particularly near the helm of where decisions are made.
Liron 01:29:07
All right, man. So yeah, we’re heading toward the wrap-up here. I do think we’ve covered a lot of ground. Let me ask you this. You’ve been a MATS mentor. You’ve been deeply involved with AI alignment research for years, but you’ve got a unique take on it because ultimately you don’t think it’s really solving the bottleneck of humanity’s problems.
When you see all this AI alignment research happening, what’s your attitude about it? I’ll give you an example. I just had Kevin Xu on this show. He runs Algoverse, and he was saying that he actually has a high P(Doom). I think he was saying 15 or 25 to 60%, so basically near my 50%.
So he has a high P(Doom), and he actually admitted very straightforwardly that he doesn’t think alignment research is going fast enough toward what’ll probably be required, even though he’s deeply involved in AI alignment research. So that was an interesting perspective.
And then you look at the Yudkowskians, including myself — a lot of us think that what we call AI alignment research today is fake. It’s a distraction because it’s actually not going to be relevant when capabilities scale a little bit more. It’s just the wrong type of stuff to be researching. “Oh my god, the AI’s personality made it do this under this situation.” It doesn’t matter. We’re going to have a new regime of AI. It’s just going to be super powerful, and you’re just going to achieve goals better, and none of the current research is going to be relevant.
So this is a common attitude that you’re going to see with people like myself in relation to current AI alignment research. So now tell me your angle about it. What do you see when you look at all this research that’s happening today into so-called AI safety? ## The Case Against Alignment Research
Why Cas Opposes Alignment Research
Stephen 01:30:30
Yeah, this might not be super surprising, but I kind of fall in the latter camp. I’m an ML research person, and there is a space that I think should be occupied right now on trying to solve some problems in AI safeguards and making sure we have the tools to hold anyone accountable.
But in general, as a venture, as an agenda, as a kind of political thing, I’m pretty skeptical about alignment. On one hand, I agree that alignment isn’t going to be enough. On another hand, I think alignment is a distraction in some ways. And on a third hand, I guess, I think that there are some pernicious ways in which getting better at just aligning systems in a Christiano-style way — making the systems do what the creators want them to do — is really going to increase our risks, even if they appear to only be applied for safety-related reasons. So we can walk through this. I’d be happy to.
Liron 01:31:22
I’ll just repeat back to the audience — sometimes just rephrasing things can be a good way to go. So basically, what I’m hearing is that, first of all, you don’t even think that it’s that thorny of a problem. You think we’ve got a good handle on it. We’re going to make progress on it.
And then you also think that even if we did make a lot of progress on it, it’s like, okay, great. So it’s kind of inevitable we’re going to figure out how to get AIs to be aligned to their creators or whoever’s programming them, but that’s actually going to make life harder because the real alignment problem is all about governance and coordination of what to do when you have all this power.
Stephen 01:31:53
Oh yeah, you’ve got me. I think you’ve done a pretty good job at training the Cass fairy on your shoulder. It’s good.
Liron 01:31:59
Yeah, exactly.
Stephen 01:32:00
So we can staple to everything I’m about to say our earlier discussion about how our ability to safeguard and responsibly deploy AI systems is pretty good if we throw the kitchen sink at things, kind of in the way that OpenAI threw the kitchen sink at making DALL-E too safe.
But in addition to that, just solving the alignment problem is only going to really help us prevent risks where the people triggering those risks are benevolent and responsible and exercising state-of-the-art tools. If you think, like I do, that most of our risk is going to come from systemic effects or from less responsible or less benevolent actors, then alignment’s not it. And alignment can actually exacerbate some of those risks, which I’ll talk about in a second. There’s also the—
Liron 01:32:42
Yep.
Stephen 01:32:42
—extent to which alignment serves as a justifying basis for what companies are doing to move fast and break things. They just say that if they’re able to solve the alignment problem, they’ll be able to make sure that their AI systems are safe, as if that’s kind of the real problem, which I don’t think it is. The problem’s more structural and comes more from the companies and the business models around the AI systems than the AI systems themselves.
So my point here is just the safety washing point. Alignment research is a perfect locus of safety washing because it allows you to do everything you want to do, build the systems the way you want to build them, and emphasize the safety value in some cases in which you’re able to avoid unintended consequences. But it doesn’t require companies focusing on the alignment venture to confront the more systemic issues about what types of technology they’re introducing into the world and what types of powder kegs they may be setting off.
Stephen 01:34:16
And then lastly, the worst thing about alignment of all, and I think the most perniciously harmful thing about alignment overall, and the way that alignment research could kill us even if that alignment research only appears to be applied by companies to make their systems safer, is the Jevons paradox.
I think Jevons paradox is an under-discussed thing in the AI research community. And I think there are lots of people at AI companies who, either in name or in principle, maybe just don’t know about Jevons paradox. And I think if they don’t, that’s concerning. I’d hope they would. The professor who introduced Jevons paradox to me introduced it by talking about how it is criminally under-discussed. And do you mind if I try to give you the worst explanation of a concept from economics you’ve ever heard?
Liron 01:34:25
Yeah, go for it. Jevons paradox. We hear about it a lot. And by the way, let me just give my context.
Stephen 01:34:30
Yes, please.
Liron 01:34:31
I hear about Jevons paradox when looking at things like Nvidia stock because people say, “Hey, there’s competition to turn the price down for all of their products.” AI inference is getting cheaper, so all of these companies just aren’t going to make that much money because they’re going to sell cheaper products. But then people come in and they’re like, “No, Jevons paradox — if it’s cheaper, you’re actually going to use more of it, so they’re actually going to have more revenue, and you should buy their stock.” That’s when I hear about Jevons paradox.
Stephen 01:34:54
Yeah, that’s a great example of it. So hopefully it’s understandable now, but I’ll still go into an explanation of it that I will get mostly right, but that I might get a detail wrong on.
Jevons paradox is all about how by making something cheaper or reducing the downside of it, you can incentivize more of its use so that the risk increases overall. So this was introduced by an economist, I think last name Jevons — old, dead, white, the kind of guy you’d expect him to be. And he was studying the efficiency of steam engines, I believe, probably in the UK or something like this way back when.
And he made an observation, or at least a conjecture, about how increasing the efficiency of a steam engine with respect to the coal fuel that you use to power it can, mile per mile traveled, reduce emissions or reduce pollution. But in aggregate, it can incentivize more usage of the train because there’s less cost to operate it, and that ends up increasing the overall amount of coal that gets burned. We can think of Jevons paradox in a way that’s perfectly analogous to your example with pricing.
But we can think about it applied to safety-related work. If there is more perceived or real — even real — ability for a company to design a super powerful system with less risk to them or less risk to humanity, the cost of doing so or the associated risk with doing so is going to be lower or perceived to be lower, which is going to make it more likely that companies are going to do these things. And I struggle to see how we’re not in a Jevons-style regime. It seems very completely plausible that we just kind of are when it comes to the deployment of systems.
Liron 01:36:35
And I don’t even know if we even need to invoke Jevons paradox. We just observe, hey, they keep making the AI better. They listen to what you want and actually do it. So that’s just going to accelerate everybody doing it.
Stephen 01:36:48
Yeah. So if I had a button in front of me, I’d think about it, but I’m pretty sure after thinking about it some more, I would press a button that would stop or halt research under the research agendas of automated alignment research, scalable oversight, or super alignment.
I think these things are exactly the kind of thing I would try to sell to the public as safety-motivated, benevolent research work to do if I was secretly trying to make it so that my company is able to build powerful systems and deploy them more recklessly.
Liron 01:37:30
Oh boy. You’re getting really galaxy-brained here, okay? So just to reiterate here for the viewers — you are somebody who supports a pause AI policy, correct?
Stephen 01:37:39
Yes.
Liron 01:37:41
But at the same time, you would turn around and campaign against any alignment or super alignment research happening now, correct?
Stephen 01:37:51
Any research focused on making systems that are smarter than humans do what the system’s creator wants it to do. And not including — this category I’m describing does not include interventions that strictly reduce a system’s capabilities.
I work on capability suppression in my machine learning research, for example, and that’s kind of the opposite of what this super alignment agenda is trying to do, because I’m not trying to align a system that has a ton of extra super powerful capabilities. I’m trying to help us targetedly build dumber systems that are safer because they’re dumb in certain domains. So I want to make that distinction. But yes, I would like to halt the research enterprise around making superintelligent systems intelligently aligned with their creators’ goals.
Liron 01:38:46
From a messaging perspective, I feel like I already have so much on my plate saying, “AI is going to be really powerful. We need to pause AI. The alignment teams aren’t doing enough. They’re not going to solve the problem.” And now you’re adding into the mix, “They’re actively harmful. They should do less alignment.” No, please don’t throw this into the mix.
Stephen 01:39:04
I’m standing on that hill. Am I dying on that hill? Maybe soon, but I’m very much standing on that hill.
Liron 01:39:14
Okay. I’m going to try to shove you off the hill and not have you die.
Wrap-Up: The “Unsexy” Path to Lower P(Doom)
Liron 01:39:21
All right, Cass. Well, this has certainly been a unique perspective to represent on the show, and that is part of what we do here. We try to have a bunch of different types of discussions, and one of them is definitely get somebody smart and thoughtful who has this whole worldview that he’s thought about, which is different from my worldview, and then we just clash the worldviews, but we actually take the time to understand it. We don’t just yell platitudes at each other. We pass the ideological Turing test.
So I think you’ve got kind of a call to action for the viewers, which is, if you’re not a Yudkowskian and you’re also not an accelerationist, there’s this other perspective you can have that basically the biggest priorities for making AI go better, systematically and in the long term, are banal, unsexy governance problems — standards, accountability, transparency and awareness, checks on power, societal resilience. Those are Cass’s wheelhouse, and a good call to action for viewers if that resonates with them is they can go to your website and email you, correct?
Stephen 01:40:17
Yeah, absolutely. So I’m one of those people who, even though I’m principally worried about existential and catastrophic risks, I think some of the best things that we can do right now are working on things that are a little bit less sexy and a little bit more near-termist and trying to set the right precedent, get the right case law out, build the right communities, and kind of prepare us to maintain, in a very unsexy, boring, bureaucratic way, institutional hygiene forever.
Liron 01:40:42
Nice.
Stephen 01:40:43
And if anyone wants to talk to me about these things, you should find me online. You can Google my name, stevencasper.com. You’ll find my email at the top of the website.
Liron 01:40:51
Great call to action. So hope to do more of these kind of episodes. Viewers, if this is the kind of conversation you like, some other recent thoughtful guests I’ve had include David Duvenaud and Ozzie Gooen. So search for those. We’ll link those in the show notes as well. It’s kind of a certain style of episode. Oh, and also, of course, Steven Byrnes, one of the most popular ones as well. So check those all out if you like this kind of style. Stephen Casper, Cass, thanks so much for coming on Doom Debates.
Stephen 01:41:17
Yeah, thank you. Thanks for the chance to talk. Thanks for making me smarter. I appreciate it. Cheers.
Liron 01:41:22
Likewise.
Doom Debates’ Mission is to raise mainstream awareness of imminent extinction from AGI and build the social infrastructure for high-quality debate.
Support the mission by subscribing to my Substack at DoomDebates.com and to youtube.com/@DoomDebates, or to really take things to the next level: Donate 🙏









