Skip to main content


New AXRP! With Evan Hubinger!


This time I won't retract it, I swear!

The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge".

Video
Transcript

in reply to Daniel Filan

I like how it looks like the AXRP logo is the sun in this thumbnail.