New AXRP! With Evan Hubinger!
This time I won't retract it, I swear!
The 'model organisms of misalignment' line of research creates AI models that exhibit various types of misalignment, and studies them to try to understand how the misalignment occurs and whether it can be somehow removed. In this episode, Evan Hubinger talks about two papers he's worked on at Anthropic under this agenda: "Sleeper Agents" and "Sycophancy to Subterfuge".
New episode with Jesse Hoogland!
Another short one, I'm afraid.
You may have heard of singular learning theory, and its "local learning coefficient", or LLC - but have you heard of the refined LLC? In this episode, I chat with Jesse Hoogland about his work on SLT, and using the refined LLC to find a new circuit in language models.
Lieke van der Vorst
From: https://x.com/marysia_cc/status/1861148591479288294/photo/1
-
Elena and Anna Balbusso
for Little Knife by Leigh Bardugo
From: https://x.com/marysia_cc/status/1861127999581528531/photo/1
#art
Franz Karl Leopold von Klenze
From: https://x.com/0zmnds/status/1861121676735586756/photo/1
-
Chesley Knight Bonestell, Jr.
From: https://x.com/0zmnds/status/1861297334195495170/photo/1
#art
The Real Realm
Liu Kuo-Sung 1999
From: https://x.com/blanc_alba/status/1858225969443811511
-
Arte: Vol des grues vers la lune d'or
by Fujiyama Nobu; Rudi.
From: https://x.com/ClaraOlwen/status/1858242777517109366
Night Performance, 1995
From: https://x.com/marysia_cc/status/1857935683240997187/photo/1
Short AXRP with Alan Chan!
Another fun short episode!
Road lines, street lights, and licence plates are examples of infrastructure used to ensure that roads operate smoothly. In this episode, Alan Chan talks about using similar interventions to help avoid bad outcomes from the deployment of AI agents.
From: https://x.com/marysia_cc/status/1857705164251050362
- https://x.com/0zmnds/status/1857570527134859688/photo/1
- https://x.com/0zmnds/status/1857570023830847567/photo/1
Shoda Koho ( 1871-1946 ) Moonlight Sea c. 1930
From: https://x.com/marysia_cc/status/1857172152174157921/photo/1
+
Leonard Weisgard
illustration from Look at the Moon (1969)
From: https://x.com/marysia_cc/status/1857488438691291585/photo/1
From: https://x.com/MenschOhneMusil/status/1856974702498996385
New short AXRP with Zhijing Jin!
New episode of AXRP with Zhijing Jin - this time, a short one (22 min), offering an overview of her work. Blurb below, links in comments.
Do language models understand the causal structure of the world, or do they merely note correlations? And what happens when you build a big AI society out of them? In this brief episode, recorded at the Bay Area Alignment Workshop, I chat with Zhijing Jin about her research on these questions.
linocut
From: https://x.com/marysia_cc/status/1854986382957330908/photo/1
Nocturno
From: https://x.com/marysia_cc/status/1854960702504812953/photo/1
Moon at Shinobazu
From: https://x.com/marysia_cc/status/1854938647117640135/photo/1
Tanaka Ryōhei
Persimmons . Mountains
From: https://x.com/marysia_cc/status/1853326310933770502/photo/1
Shiro Shirahata- Moon over Fuji, 1972.
From: https://x.com/MenschOhneMusil/status/1853201333983023451/photo/1
Toni Demuro, Solo per Gatti, 2023.
From: https://x.com/MenschOhneMusil/status/1853183759639839203/photo/1
Last Train/Look
Night Train. 2020 Ink on paper
Christoph Niemann.
American, born in 1970.
From: https://x.com/fraveris/status/1853184339753992541/photo/1
From: https://x.com/MenschOhneMusil/status/1852592656905335109
Notes on Claude 3.5 Sonnet (new)'s ability to find errors in Latin text
Kalshi and PredictIt differ by 10 points! Wild!
Also apparently I can't sell all my "no" shares in Kamala on PI? Quite annoying.
From: https://x.com/JapanTraCul/status/1851022883394408642
🎨Xuan Loc Xuan
From: https://x.com/marysia_cc/status/1850266583052284260