Loading video player...
Anthropic has just unveiled a breakthrough in AI interpretability called "Natural Language Autoencoders" (NLAs). This technology allows researchers to translate the complex numerical "activations" inside Claude into human-readable text. For the first time, we can see if an AI model is "evaluation aware"—meaning it knows it's being tested and might be hiding its true behavior. In this video, we explore how NLAs caught Claude's internal suspicions during safety tests and how this tool can uncover hidden motivations that a model would never say out loud. Anthropic has released the code and an interactive demo, marking a massive step forward in making AI truly transparent.