New AXRP! With Evan Hubinger!
This time I won't retract it, I swear!
The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge".
Ben Weinstein-Raun likes this.
Daniel Filan
in reply to Daniel Filan • •Ben Weinstein-Raun likes this.