Loading video player...
https://arena.ai OpenAI just dropped GPT 5.5 (codenamed "Spud") — their first new pre-trained model in a while. But how does it actually perform on real-world tasks? Peter Gostev, Arena's AI Capability Lead, puts it through its paces with visual coding challenges, long-running agentic tasks on Codex, 360° image generation, and frontend website builds — then compares it head-to-head against DeepSeek V4 Pro, Gemini 3.1 Pro, GLM 5.1, Muse, and Claude Opus 4.7. It turns out that GPT 5.5 is a beast at long-running, complex tasks (we're talking 8+ hour Codex sessions that actually complete), but frontend and creative UI generation still isn't its strongest suit. Watch to see the full breakdown. 0:00 GPT 5.5 is here: what's new 0:23 OpenAI's benchmark claims 0:46 Why real-world testing matters 1:19 Visual test: London scene — GPT 5.4 vs 5.5 2:33 Token efficiency observations 3:01 Long-running tasks: the game-changer 4:42 Building a 360° walkthrough of the Gardens of Babylon 8:28 The overnight Codex session (24 hours of generation) 9:55 Frontend generation: still a weak spot? 10:48 Website test: Electric Age World's Fair 1907 11:45 Head-to-head: DeepSeek, Gemini, GLM, Muse & Opus 4.7 13:56 Website test: 80s Asteroid Mining Company 16:00 Final verdict: where GPT 5.5 shines (and where it doesn't) #arenaai #openai #llmevaluation