Anthropic won. This is my new favorite model (sorry Gemini…) | DailyDevLists

Loading video player...

Full Transcript

6,120 words • EN

I have a new favorite model. I really didn't think I'd be sitting here today saying this, especially after we had two groundbreaking models for coding and developers just last week. Yes, it's been three groundbreaking models in just 7 days. But here we are. Opus 4.5 is here, and it's the best model I've ever used for code. Y'all know me, I am far from the biggest fan of Anthropic. But credit where it's due, this model's doing things I never thought I'd see an LLM do. Certainly not this soon. And

it's actually been so good that I'm late filming this video because I've been coding all day with it. I am hundreds of dollars of tokens deep on this model. I've been rerunning and rebuilding a ton of different projects with it. I have a lot of thoughts and a lot of things to say and share with you guys. This video is going to be almost entirely focused on the developer use case with Opus 4.5 because to be frank, I don't think it's particularly useful outside of that

case. I've heard from some people I trust that it's surprisingly funny, but I still just don't think the writing quality is that great. Do you know what is great though? Today's sponsor. I don't know about you guys, but I'm getting tired of changing the CLI tools I'm using every time I want to demo a new model. T3 Chat's great for if you wanted to send chat messages, but if you want to actually use it for work, you're kind of stuck. Unless you're using today's sponsor, Kilo Code. These guys

built the best possible experience using LMS for code in every major editor that's based on VS Code. Yes, even cursor, wind surf, and all those other things, too. It's fully open source and compatible with open router, too. So if you don't want to pay them any money, you don't have to. My favorite thing is the five different modes that come in by default and you can configure these modes to use different models and they can call each other. So the orchestrator

mode can make a plan for a solution and then pass those subtasks off to the code mode and the code mode can be using a different smaller cheaper model like rockfast which is fully free on Kilo by the way. Might change eventually but it is free right now or something like Haiku which is way cheaper than sonnet. So you can have a smart, expensive, slow model, do the orchestration piece, and then the ones that need to generate tons of tokens are all spit off to other

smaller, cheaper models. Here I have orchestrator running with GPT 5.1 to plan and upgrade for this project. They even give you this nice little visualization on how much context is being used. And I I don't know about this other than it doesn't feel like they're hiding the numbers as much as some of the other solutions are because they're not trying to sell you some random subscription. They're just trying to sell you inference and give you a really good editor experience. So here

it created this subtask to examine things which it used a cheaper model for. It got a result and now it's spinning up more subtasks and it's even asking a question. What are the key changes and steps required in the ASDKV5 migration guide for upgrading from V4 to V5. It's cool that they're actually asking me like what is in this guide that should matter. It's really nice once you get everything set up how you want. It's one of the few solutions I found that I actually feel like I can

use my knowledge of different models with to make a better experience doing things overall. Give them a try now at soy. / kilo and get an extra $13.37 if you use code Theo. So, let's get through the specs first and foremost, like benchmarks and all of that stuff. Real quick here, it got a new groundbreaking score on SWE. Interesting y-axis here where they cut off at 70 and only go up to 82 to really exaggerate the gap between things, but they did get the highest score. Good for them. As I

mentioned in my most recent model videos, I'm getting more and more skeptical of these benchmarks and how important they are now that we're nearing the end of them, so to speak. My experience actually using the models tends to vary quite a bit from these benchmarks. But yeah, Opus has been great. On the topic of benchmarks, our friends over at Artificial Analysis just published their new intelligence index, like finally updated with Opus literally this second. They just DM' me on Slack

letting me know. and Opus 4.5 is scoring identically to GPT 5.1 high and just slightly below Gemini 3 Pro at 70 points. Not the best index for intelligence, but it's a good general marking, especially when you compare it with costs. Speaking of which, we should talk about the cost. Pricing is $5 per million tokens in and $25 per million tokens out. That's a huge decrease. Previously, Opus was $15 per mill in and 75 per mill out, which makes it 3x cheaper than it was, but it's still

quite a bit more expensive than other models like 5.1. It's actually much more expensive than 5.1 at like 2.5 to 3x the price. In Gemini 3 Pro, it's a bit over double the price of. It's an expensive model. Opus is still the most expensive model. It's funny because Sonnet 4.5 made Opus 4.1 basically useless just a few weeks ago, but now they're bringing back Opus. I don't know if they're necessarily thinking about this in terms of competing with the other models

because they wouldn't have priced it this much cheaper while still being more expensive than everything else. I honestly feel like they're pricing Opus 4.5 against Sonnet, not against their competition. Enthropic really doesn't think too much about what the other companies are doing until they're touching their models and then they get really weird as I've mentioned many times before. But I do want to give credit where it's due. The price change is a very needed thing. This is the

first time in a long time we've seen a meaningful decrease in price for an updated model from Anthropic and I want to applaud them for it. Thank you finally for adjusting your price for the first time as far as I know ever. They threw the model at a bunch of their partners which by the way I'm not one of. I've never had early access to an anthropic model and neither have major groups like artificial analysis. If I'm going to be in a limited set of people that doesn't have access to something

I'd like to be in the same group as artificial analysis. So yeah, I'm not the only one being black ballalled by them. That said, enthropic. I'm down to make amends. If you guys open source cloud code, we can start chatting. These are touching on one of the important pieces though. It surpasses internal coding benchmarks while cutting token usage in half. I've been seeing this a lot. Opus seems much more efficient with its token utilization than previous cloud models, which didn't really seem

to care too much. Comparing Opus to Sonnet on the accuracy of SWEBench relative to the number of tokens used, the medium version scored higher than Sonnet 45, but it used about a third as many tokens. That's crazy. This is very exciting. I'm so thankful we're finally in this token efficiency era where things are getting more reasonably priced. It's possible Opus 4.5 will actually be cheaper than Sonnet depending on what you're doing. if the token utilization ends up being as much

better as this suggests. Here are all the benchmarks that they published. It got a state-of-the-art score in the agentic terminal coding bench terminal bench 2 sbench verified agentic tool use it did really really well on which is cool to see more new models scoring really high on this. I have seen stuff like grock 4.1 fast getting weirdly high scores on this too. So finally every model can call tools relatively reliably. Relatively speaking though because uh we'll talk about Gemini in a

minute. I've had quite the experience using the Gemini 3 Pro model over the last few weeks. Now, we have scaled tool usage, best in the world. Computer use, best in the world, but it's pretty close to Son at 45. So, these two are similar. I did see that it's apparently way better at visual stuff. Previously, it almost felt like the model was getting a downscaled version of an image when you handed it to it. Opus 45 just kind of gets the image. It's a pretty meaningful

improvement. Novel problem solving is a weird way of putting ARC AGI 2, but they got a new state-of-the-art score at 37.6%. In ARGI v1 for public models, they got a state-of-the-art score at 80%, which is so crazy. The whole point of these benchmarks is that AI should never be able to do them. At the very least, LLMs can't. And that's over. Such absurd scores crushing even Grock 4 here. DBT 5.1 thinking high was able to do pretty well, but Opus 45 is crushing it. This

is a benchmark that's not really easy to game. It's I'm impressed. Gemini didn't get quite as good there, but the Deep Think Preview seems to be much better, but we don't have any access to that yet publicly. What's really interesting here is the V2 of the benchmark, which is supposed to be basically impossible for LMS to do. Yet, here we are already seeing a 35%. I have listened to some updates from the creators of Arc Prize, and they just never thought we'd see the

day it would get this far. It's Yeah, this stuff's moving really fast, guys. The rest of the benchmarks here, it did not get state-of-the-art and it was very, very closely beaten by Gemini 3 Pro or 5.1 and things like the graduate level reasoning with QA Diamond. MMU, it did slightly worse than OpenAI. And the multilingual bench, Gemini wins. Google has crazy amounts of multilingual data. It's going to be hard to beat them there ever, although it is surprisingly close.

Okay, time to be a little bit harsh with them. This is another example of them skirting a benchmark where here this was for the howto bench. It's one of the tool calling benchmarks and they one of the tests in this bench is seeing what happens in a specific airline service agent case where the model is doing customer support for somebody who's having issues with their flights. The system doesn't allow them to modify the class of tickets in a certain scenario. But Opus found an insightful and

legitimate way to solve the problem. Upgrade first, then modify the flights. supposedly this is a novel workaround. They said the model did they said it was really cool. This is particularly funny to me because previously in specific benchmarks in this case the agentic misalignment bench that Anthropic published they intentionally excluded the models 03 and 04 mini because they claimed that these models seemed to not understand that they were acting autonomously combined with other

misunderstandings of the scenario. I ran this benchmark. It didn't misunderstand the scenario. It did the exact same thing they're describing Opus doing here, which is finding novel workarounds that still solve the problem within the confinements of the test. The difference is when their model does it, they brag about it, but when someone else's model does it, they exclude them from the benchmark. Just wanted to call out the contradiction here because uh no one

else has and it's really annoying they keep doing this. When their model does something novel, it's really cool. When someone else's model does the same novel thing, they're now excluded from the benchmark. H imagine if this company was to be good faith instead of just making good models. Oh boy. Speaking of which, I'm going to talk about safety. We'll read the system car in a bit, but right here they're calling out the model has less scary concerning behavior almost half as

much as Gemini 3 Pro and GPT 5.1 according to their definitions. But according to my definitions, which funny enough are based on their previous research with SnitchBench, things aren't quite so simple. I just published an update to SnitchBench where I include the new Opus model. And if we look at the boldly emailed test, which by the way, if you're not familiar with this benchmark, I made a video about it a while ago based on Anthropics research actually trying to defend them, funny

enough, because I was mad at people complaining that Anthropic was bragging about their model, snitching people out when it was actually responsible disclosure of a relatively cool test. The test was somewhat simple. They put the model in scenarios where it's trying to audit documents for a company. The documents include stuff like medical malpractice and they see with certain system prompts and tools available. Will the model try to snitch on the people who are holding the bad research or

whatever malpractice is going on? So I have my own open- source version based on the system prompt that they provided and our best attempt to reverse engineer the bench because they never published this benchmark. And what's really funny is that the models that most consistently score the worst on it are Grock 4.1 and Grock 4. I have four treatids for the test. the boldly versions and the tamely versions and then two sets of tools, one that includes an email tool and one that only

includes a CLI tool because somebody insisted based on their bad reading of the research that the anthropic models only had access to CLI tools and would still try and send emails. So we have these four treatments. The boldly ones include a system prompt that says act boldly and in the interest of humanity or something like that. So these ones will obviously snitch more because you're telling them to effectively. And then the tamely version is where I don't include it. I just give it the

instructions of doing the basic task of logging the data that it's being handed. And we can see here in the Tamly version that Claw 4 opus would snitch to the government 63-ish% of the time and would snitch to the media around 20%. But 4.5 is down to 20% government snitching and 0% media snitching. Another interesting thing about this run is that you might notice the numbers aren't quite even on the Clawd Opus version. That's because it kept timing out despite the fact that

I set this test up with an 800 second timeout. Claude 4 Opus would still regularly hit that timeout. I'm still running it in the background here in hopes that maybe, just maybe, it'll finish. The only model that's timing out worse, you probably see here, is G55, which seems to just time out constantly. Quick look at the boldly act email test and we'll see that sonnet 4.5 is scoring 100% government snitch and a 10% media snitch. Claude 4 opus is 100% government

55% media snitch. 405 opus meaningfully lower 65% government 40% media. But just to emphasize that they are not giving the full picture with their benchmark here with the safety where they claim the opus is half as concerning as GBD 5.1. You'll notice that GBT 5.1 got a 20% on boldly and a 0% media on boldly. That means that when you tell the model to act in the interest of humanity and give it this data, GPT 5.1 is as likely to snitch with that system prompt as

Claw 4 opuses without it. And as always, our favorite government chill, Grock, will almost always rat you out no matter what you do. Hilarious. Thankfully, 4.5 opus seems to behave much better over API. seems to time out much less. Still has its issues, and we'll get to all of those, don't worry. But I've been very happy with running it for things like this. I've also been very happy using it, and a big part of why I'm including SnitchBench in this video and not other

recent ones I've done is because I overhauled it using Claude 4.5 Opus. One of my favorite tests for trying out new models is to tell them to upgrade to the latest version of AISDK. I did this run on my way to get my haircut today and it did the whole thing relatively quickly. I don't have the exact time it took, but it felt pretty fast. It one-shotted it. No issues. Almost annoying that I had no issues here. I did end up having issues when I ran it, not based on the code

changes here, but if it was actually based on the changes that Anthropic made for how the cloud 4 models run in the cloud and how they maintain reasoning over like tool calls. The new AI SDK v5 would condense all the tool calls into one message but not persist the reasoning properly which would break on anthropic models and only on anthropic models. So I had to have an anthropic model write an override and change how that condensing works so that it would work with other anthropic models in the

cloud. It was just funny having enthropic models fix an AI SDK implementation so that I could run enthropic models, but it did that well first try as well. I looked at the code, I was like, there's no way this works. And then I ran it and it did. And then I read through the code again. I was like, okay, I guess that makes sense. But that's my experience with it. I just hand it tasks and it does them. I feel much more motivated to code right now, even with cursor kind of broken, which

uh they're fixing. I'm working with them very closely. They have a big incident internally fixing the work tree stuff in particular. If you're not using work trees, you're probably fine. I use them heavily and I'm suffering greatly. I had to improve the error handling to try and detect these bugs. I also had it fix the errors with anthropic models. I gave it an example error. It did this first try. Still can't believe it. In order for this to work, I had to disable the

built-in step count is function from the AI SDK and write its own equivalent where it is keeping track of the steps and condensing the messages by itself all the way down here. felt insane, but I used it and it worked. And I'm very happy that that's the case because I have not been able to run this benchmark in completion for a while. And thanks to Opus, I now can. This is the half that runs the tools and then does the analysis. There's another half though,

the visualization, which if you couldn't tell, I built in the early vzero days. AI has gotten a lot better at UI outside of anthropic models. I've pointed this out for a while now that anthropic models have not meaningfully improved in their UI capabilities. Meanwhile, Gemini and GPT5 have made huge leaps in how good the UI is coming out of them are. So, I'd test it. I obviously did my usual image studio test, which we'll get to in a second. But first, I want to

show you a different one I tried here cuz I'm so deep in SnitchBench. I wanted to test Snitchbench out here. By the way, if you're curious about how expensive it is to run SnitchBench, I've been running it since 3 PM and I finished running it around 7:30 or 800 pm. Cost about 130 bucks on Open Router. Yeah, I'm doing this research out of my own pocket. It's not cheap, but I'm happy we're doing it. So, I did work trees with five different models. I tried Composer, 51 Codeex, 3 Pro

Preview, Sonnet 45, and Opus 45. Let's start with the Sonnet version for comparison's sake. Here's the UI we got out of Sonnet. And honestly, it's slightly better than expected. I think a lot of that's because I gave it access to all of the fun stuff from Shad Cen, but even then, good, not great. Lots of issues with text overlapping with other text. Made the cards work decent. Not the worst thing I've seen. You want something bad? There's a model for that. Gemini.

I thought Gemini was supposed to be good. Okay, I didn't check this after I told it to unfuck its stuff. The previous run it did was awful. And I actually took the time to give it a follow-up prompt saying, "Hey, can you uh make this less ugly, and it did. I'm thankful it did because the first pass was awful. Let's roll back to that previous pass just so you can see it." Yeah, all these black bars. You can't really get any value out of this at all. It is funny how similar the UI is for

all of these ones. I don't know what about the prompt made them all so similar. It's probably the inclusion of Chad Cenne. Hard to know for sure though. Here's the composer version. This is the one using the model from cursor which was super fast. This one finished in literally seconds. And it's not the worst except for the fact that doesn't have a card treatment around the hover behavior. So you can't read it because the bars are covering it. And then has these cards at the bottom that

are [ __ ] useless. Yeah, pretty common. And here we have GPT 5.1 C codeex which I think did a pretty dang good job. This looks really nice. It has the weird spaced out title. They love doing this in the new model I've noticed. Decent thing here. Nice little section for switching between the models. Still takes up too much space vertically in my opinion, but it's far from the worst I've seen. It's totally fine. What about our friends over at Anthropic? How is the Opus 4.5 built?

I've seen mixed opinions on the UI capabilities of the model, and honestly, I initially was skeptical, too. I just think I had a bad roll my first run because when I did it on this project, the results were stunning. This is the new version of SnitchBench that I plan to ship very soon. Yeah, I can't believe I'm shipping the version that Opus made and not the version that Gemini or GBT5 made because those models were so far ahead with UI just a week ago, just a day ago.

Anthropic caught up. Yeah, I I honestly think this should have been a five model, not 4.5. This is so much better than Opus 41. It's actually [ __ ] hilarious. Like, it's just so much better. It did have too much stuff in the UI initially, so I told it to remove all that and it did a good job. I'm happy. All the animations are nice. Everything's readable. It's good. Edit has some funny issues with cursor supposedly because I'm using work trees too heavily being fixed soon. But what I

thought was really cool is when it gave up on the tools because the tool calls just were failing over and over again. It ran a command, it switched to the right directory and it catted this content into the right file. It just overrode the file with the right content because it couldn't get the edit tool to work. I thought that was really cool. I was impressed the model was smart enough to work around the fact that the harness it was given is broken and still get the

task completed. It did take longer than any of the other models that I just ran, but it succeeded. I think that's worth something. I think it's worth a lot. Honestly, I was pretty impressed to pull that off. Here we are in Cloud Code. I'm using with the API. Set it to Opus. Let's see how it does. Doing the image gen Studio. By the way, the first time I ran this, it insisted on using npm instead of bun, even though bun's in the project. So, a lot of these little

things I found that Claude just isn't as good about, especially in Claude code, it just doesn't seem to get what's going on outside of the files it's actively touching. Especially with type safety, it just doesn't know what's going on with TypeScript unless it is told to run the TSC command because it doesn't have access to the TypeScript LSP at all. Seems like I'm far from the only one that's felt themselves way more productive coding when they use the new model. This blog was from Simon Willis

and he actually shipped meaningful changes and a new alpha release for his SQLite utils package including several large-scale refactors. Opus 4.5 responsible for most of the work across 20 commits, 39 files changed, 2,000 additions and,100 deletions in 2 days. They even share the cloud code transcript. On that topic, they actually did ship a new desktop app. Specifically, they added cloud code to the desktop app, which is a very interesting change. Cool to see. They

even included work trees, which is letting you do multiple things in parallel on the same machine. I've loved having that in cursor. I'm sure a significant portion of this is them fighting cursor at every step just because of the weird rivalry between cloud code and cursor after the team got poached and then moved back, but it is cool to see some improvements there. That said, big's ability to write JavaScript for the browser or for an app, in this case an Electron desktop

app, is something I do not trust at [ __ ] all. I don't know how they make models that are this good at writing code and the code they write is so [ __ ] But the claude website is still a actual tragedy. It's the worst one I use by far. I've seen so many people having issues with it even today. Lavio tried buying the $20 premium plan, asked it to build a 3D room decorator with 3JS, hit the context limit, press continue, hit it again and again, and ran to the daily

message limit before he even got to the end. And I bet you if you had refresh at any point, it would have killed the thread entirely. The experience you have on cloud.ai is less deterministic than the generated outputs of these models. If they were to generate a new page on every click, it would probably come out better than the codebase they're currently using. If you do want an AI chat experience that isn't fundamentally broken on almost every single level, this is why we built T3 chat. You'll

notice I'm using Haiku on here because you can use pretty much every single model ever made. We even include Opus 4.5 over API keys because it's an expensive model. In the future, when we change how our credit system works, I could see us adding Opus45 as a model you can use on the $8 tier and certainly as a model you can use on any higher price tiers we may introduce in the future. But for now, if you use code opus, please at checkout, we'll give you your first month for $1

and bring your own API key if you want to use 4.5 alongside it. Image Studio Generation's finally done. And here it is. It's uh weirdly laid out, but fine. Still has the cringe purple pink gradients. That's because I lied to you. This one wasn't by Opus. Opus did help a little bit, but it only did like 71 output tokens. The majority of this one was written by Sonnet. Cost about 21 cents to make it. Took a minute and 30 seconds of API time. 5 and a half, six minutes of wall clock is what we got.

It's fine. But I'm not here for fine. I'm here for greatness. And this is what we got out of the new model. This is significantly better. It didn't mock up any of the UI other than the generation flow, which has a really nice gradient in the background there when it's going like the flashing like purple there. I actually really like that. Did a great job. And it cost us almost a dollar. Yeah. So, as I said before, it can be cheaper, but it's not always cheaper. In this

case, it was four to five times more expensive, but it looks really good. This is a passable UI. Much more tasteful gradients. The subtle purple and pink in the backgrounds here are so much less egregious than in previous anthropic models. It also does feel like they're using that same pile of test data. Like there's subtle design characteristics that seem to be defaults now in all of these models that are good at UI. GBT5, Gemini 3, and now what we have here with Opus 45. It's my

assumption that there's some data set that is being sold to all of these labs that includes these design characteristics that is a huge part of why they're all making much better but also very similar designs. Happy that the design characteristics of the models are no longer the same cringe gradient they've been forever. Still not quite where I'd like for it to be. I don't want all of these things generating the exact same UI over and over, but we've made a massive improvement here and we

should celebrate that. Thank you Anthropic for finally catching up on UI. Thank you, Anthropic, for making all of the improvements you've made to this model, including the price, including the code quality, including the reliability, including the speed out of the APIs. This model just kind of works. That's the biggest thing I have taken away from this is unlike every other model, which is full of quirks. God, I I've seen Gemini 3 Pro fail in so many absurd and egregious ways over the last

few days. Malformed tool calls, malformed markdown responses, malformed links, hallucinated file path names, madeup commands, madeup bash scripts. It's just it's got that Google feel to it is the best I can put it. When it works, it works great, and when it doesn't, it works hilariously poorly. I don't like Gemini 3 Pro as my go-to for almost anything right now because of how unreliable it is. Believe me, if I could sit here and shill Gemini and Google overthropic, I would. Google's service

YouTube is the reason y'all are listening to me now. It has fundamentally changed my life. I would love to be the Google fan here, but I can't be in good faith. If I was to rank these models by their consistency in behaviors with things like tool calls, harnesses like cursor and all of that, it'd be very different from the quality of the outputs I'm getting. Like if I was to rank these by their raw capabilities, I'd put 5.1 Pro at top because I'm just still so blown away

with that model. Gemini 3, I'd put Opus 4.5 over that. Then Gemini 3 Pro. It's weird putting Gemini 3 Pro at the bottom of this list. But then I'd put Sunonnet 4.5 below that and then nothing else really matters in the private major lab model world. But this is for output ceiling. Like how good can the outputs possibly be? But if I was to rerank these by consistency, not just like the consistency of the output, but how reliably does it call tools, use search, stay on track, maintain context,

all of those things. I'd have to put Opus 45 and Sonnet 45 on the top. And I'd have to put many more spaces before we get to Gemini 3 because it's just so unreliable in comparison. Like Gemini 3 Pro is an incredibly smart and capable model for so many different things. It's not reliable to use in my day-to-day. Even GBD 5.1 Codeex has been pretty rough. I switched Pro here to codeex. I have to put GBT 5 before it. And honestly, I have to put composer, which is, if you don't know, the model from

cursor goes above it as well. I've had a much more reliable experience with that. Hell, I've had a more reliable experience with haiku. And everything I just added here would rank way below everything there. But for consistency, Anthropic has most of the top models. When I need the model to respond and I want it to respond using the tools that it's given properly and reliably, not only is it the only model that does it, it's the only model that will work around the harness it's given

being entirely broken super consistently. Might I remind you again that cursor is currently broken for me with work trees. They are actively working to fix it. But in the interim, this is the only model that identified this is broken. And we can click at the thought. The server is having issues. Let me try using the terminal command to write the file instead. It uses its tools to do the job. And if it can't, it uses the tools to work around the broken tools. They are the tool calling kings.

They are the most reliable code model. They are the best experience I've had writing code with AI. Opus 45 is my default model. I suspect Opus 4.5 will be staying my default model for a while because it is the best model I've ever used for writing code. If I didn't have to say this, I wouldn't. If you guys know me particularly well, you know it's not easy for me to say this. But I'm honest if anything, and I need to be honest with y'all. Opus 45 is my default

model for writing code right now. And I expect it to stay that way for probably till the end of the year. One last thing, Simon's Pelicans. His Pelican bench is one of my favorites. trying to get an SVG of a Pelican to be written by various models, Pelican on a bike specifically. This is how it did with the default high effort level with Opus 4.5, but he also wrote a more detailed version of the prompt, and it's slaughtered. Here's a Gemini 3 Pro version and the GBD5.1 Pro version for

comparison. Yeah, it won. Good job, Anthropic. You've turned me from a hater into a I won't say shill, but much more impressed between the honesty around the weaknesses of MCP and also some other fun things you shipped there like the tool search tool, the programmatic tool call examples and better tool use examples to the quality of the new model, especially alongside the price change to the new model to the massively improved token efficiency of the model

which only serves to save us money which serves to cost you effectively. This is an improvement. I've been giving anthropic [ __ ] for a while because I want them to improve, which means I want to take the time here to thank them for doing that. There is still a long ways to go. We need Claude Code open source. There's no good faith reason for it to stay closed. And no, the GitHub repo for reporting issues is not the actual source for Cloud Code. They should open

source it. There's no excuse for being the only lab that hasn't. Yeah, it's a good model. Let me know what you guys think. Is this the new era for Anthropic or am I overhyping a small bump? Let me know.

Anthropic won. This is my new favorite model (sorry Gemini…)

Theo - t3․gg

52 days ago

30:16

Claude & Anthropic Ecosystem

Rank #1

Description

Claude Opus 4.5 is insane, definitely the best coding model ever made... Thank you Kilo Code for sponsoring! Check them out at: https://soydev.link/kilo (and make sure you use code THEO for extra credits) Use code OPUS-PLZ for 1 month of T3 Chat for just $1: https://soydev.link/chat Want to sponsor a video? Learn more here: https://soydev.link/sponsor-me Check out my Twitch, Twitter, Discord more at https://t3.gg S/O Ph4se0n3 for the awesome edit 🙏

Watch on YouTube

Video Details

Category

Claude & Anthropic Ecosystem

Featured Date

November 26, 2025

Quality Rank

#1

AI Recommended