Skip to playerSkip to main contentSkip to footer
  • 5/23/2025
Claude 4 is not what you think...
Transcript
00:00Claude 4 is finally here. It comes in two sizes, Sonnet and Opus, and it seems Anthropic has
00:07pivoted in a completely new direction. I'll explain that in a moment. Let me give you all
00:12of the details. Right away, they claim Claude 4 Opus is the world's best coding model, which is
00:18a hint in the direction that they are heading. And what seems to make it really special is its
00:23ability to complete long horizon tasks. That is tasks over tens of minutes up to hours without
00:30losing the thread and actually being able to complete real world tasks. All right, so a few
00:35details about both of these models, and then I'm going to get into the benchmarks. First, you have
00:39extended thinking with both of them, and they are both hybrid models, which means they can give you
00:44instant responses with no thinking, or you can turn on thinking for those more complex tasks.
00:50And during the thinking, you have tool use, which is of course, really nice, but kind of table stakes
00:55at this point. And now I've already been playing around with it and hit my rate limit until 2pm
01:00today, which is a few hours away. And really, I only submitted a few prompts. So I think I'm going to
01:05have to subscribe to Max and put together a thorough test for you all. So you can see right here, we have
01:10Claude 4 Opus, Claude 4 Sonnet. If you click right here on search and tools, you can see the different
01:15tools available. You can select the style, you can turn on and off extended thinking. It has
01:19Web Search, Drive Search, Gmail Search, and Calendar Search. Those are the available tools
01:25for now, but they have more deeply integrated the MCP framework into their API. And remember,
01:31Anthropoc is the company that created the MCP framework that now OpenAI, Microsoft, Google,
01:37and so many other companies have adopted. One unique thing that I really haven't seen elsewhere is that
01:43both models can use tools in parallel, which means it can send off requests to multiple tools at the
01:49same time. That seems really cool and much more efficient than doing everything sequentially.
01:54And it also seems to be much better at handling its own memory. All of this stuff is available in
01:59Claude code, which is also now generally available and has the Claude 4 models available. During the
02:05keynote that live streamed this morning, the chief product officer of Anthropic spent a lot of time
02:11talking about long horizon tasks and how they were able to accomplish this. Even giving an example of
02:17a company that was using Claude 4 that was able to do a task over seven hours. And as part of Claude's
02:24new API, they have four new features, including code execution tool, MCP connector, a files API,
02:30and the ability to cache prompts for up to one hour. Here's what the code execution tool looks like.
02:35You simply type in a prompt, Claude will start thinking and write code. And of course,
02:41execute that code. And I believe it needs to be Python for it to execute. The MCP connector allows
02:47you to connect any MCP server to the Claude API. So now your Claude API has access to all of the MCP
02:54tools throughout the world. They also have the files API. So giving access to Claude to your local files,
03:01specifically your code files, your repositories just became a lot easier and then prompt caching.
03:07So of course you want to get the most efficient usage. You want to get the cheapest price and
03:11caching is the way to go. Now, with all of these, you probably can guess where this is going. Claude has
03:18basically given up on the chatbots race. It is clear that OpenAI and the major tech companies,
03:25Google, Microsoft, and unfortunately not Apple, have all won the chatbot race.
03:30The personal assistant race. So now Anthropic has transitioned into being an infrastructure
03:36company. They are providing the tools necessary to have the best coding agent. They are building
03:42the best agents. They're building the best coding agents and they are plugging it into everyone.
03:47Thomas Domke, the CEO of GitHub announced Claude Force Onnit is here. So it's available in
03:53GitHub Copilot and it's their default option. By the way, I interviewed Thomas at Microsoft Build.
03:59I'll drop that interview soon. So be sure to subscribe to this channel so you can get updated
04:04when that video drops. It is incredible. But look at this. In early evaluations, the models soared
04:09in agentic scenarios. That's the key. That is what we keep hearing. Memory, tools, long horizon tasks,
04:16all done by these agents, powered by Claude 4. Delivering up to a 10% improvement over the previous
04:23generation driven by sharper tool use, tighter instruction following, and stronger coding
04:28instincts. And of course, it's also available in Cursor and Windsurf and basically all of the major
04:34coding platforms out there. Now that Claude 4 is especially good at long horizon tasks, has excellent
04:40memory, built in parallel tool usage. It's going to be especially good at pairing with Box AI. And that's
04:48the sponsor of today's video. I'm really excited to tell you about them. You're going to be able to build on
04:52Box AI using the new Claude 4 models soon. With Box AI, you can use artificial intelligence to extract
04:59key metadata fields from contracts, invoices, financial documents, resumes, and more. And you
05:05can automate workflows super easily. And not just metadata. You can ask questions about it. You can
05:12really do deep dives into your company's own data. And again, if you're a developer, building on Box AI
05:18is easy. It handles the entire RAG pipeline for you. So you don't need to think about vector databases.
05:23You don't need to think about chunking. It's just done and it works. And of course, because it's Box,
05:29they have enterprise level security, governance, and compliance. And with the launch of Claude Code,
05:35if you want to use Claude Code with Box SDKs, it could not be easier. Simply give Claude Code
05:41links to the Box developer docs and it just knows how to build with it. Check out Box's blog post about
05:47the Claude Code launch to see a demo of them building a backend contract generation tool using
05:52Box doc gen and Claude Code. I'll drop all of the links in the description below. So unlock the power
05:58of your documents and data with Box and Box AI. Thanks again to Box for sponsoring this video.
06:04All right, so back to the announcement blog post. Claude Opus 4 and Sonnet 4, by the way,
06:09they kind of switched the name, right? It was Claude 3.5 Opus, Claude 3.5 Sonnet, and now it's the
06:15opposite way. Claude Opus 4 and Sonnet 4. Anyways, our hybrid models offering two modes, near instant
06:21responses and extended thinking for deeper reasoning. All right, I know you want to see the benchmarks.
06:26Benchmarks only mean so much, so take it with a grain of salt, but here they are. So software
06:30engineering, SWE Bench verified. Yep, Claude 4 is the by far winner. So here's OpenAI Codex 1,
06:38which was just announced about a week ago at 72% on the SWE Bench verified, compared to Sonnet 3.7,
06:45which was at 62.3% and with parallel test time compute 70.3. But now we have a big jump all the
06:53way up to 80.2 with parallel test time compute for Sonnet 4 and 72.5 and 79.4 with parallel test time
07:02compute for Opus 4. And by the way, for those of you who weren't sure what parallel test time compute is,
07:08it basically just means they sampled a few test time compute solutions to a prompt and chose the
07:14best one. Now, if you're looking at this, you're probably thinking the same thing I am. Did Sonnet
07:19just score better than Opus? Well, yeah, it did. And with my initial usage, I actually found Opus to be
07:27faster than Sonnet. Now that's just anecdotal, me using it a couple of times. So I'm going to need to
07:32test it a lot more, but it does seem to output code much faster. Now, here are some more benchmarks.
07:37Here's Terminal Bench, Claude Opus 4 winning at 43.2% compared to Sonnet 4, 35%. Here is the O3 model
07:45at 30%, GPT 4.1 at 30%, Gemini 2.5 Pro at 25%, which to date, Gemini 2.5 Pro is still my favorite coding model.
07:55Here's GPQA Diamond, which is graduate level reasoning. We have agentic tool use doing quite
08:01well compared to the other models. Now, you're probably noticing one other thing. Sonnet 3.7 is
08:06still doing quite well. I'm going to show you that in a second. We have multilingual Q&A,
08:11again, getting a nice bump. Visual reasoning, getting about the same score. And then high school
08:18math competition, AME 2025, getting a very nice bump over Claude 3.7. Now I'm going to pause for a second
08:24and show you something. This is a post by John Shoneth, and he actually points out the green
08:29boxes are around benchmarks, which Claude Sonnet 4 did better than Claude Sonnet 3.7. The yellow ones
08:35are where it did about the same, and red is where it actually got a decrease in performance, which is
08:41kind of nuts. So of all of these benchmarks that they submitted, half actually went down. So I don't
08:47really know what to think about that. They're saying it was a huge bump, but the benchmarks don't
08:52actually reflect that. And the benchmarks tend to be the nicest view of these models until people
08:58start doing the vibe checks of them. So very interesting. And of course, I'm going to be
09:03testing it thoroughly. We'll see. Now, one other thing that they called out during the keynote today
09:09is that when Claude 3 came out, it was kind of lazy with coding. And then Claude 3.5 and 3.7 kind of went
09:17the other way. It tried too hard and did things it shouldn't and outputted way too much code.
09:21And they think they really dialed it in with Claude 4. They also, being anthropic, focused a lot on
09:28safety. So we've significantly reduced behavior where the models use shortcuts or loopholes to
09:33complete tasks. And of course, they're using the Pokemon example here. Both models are 65% less
09:39likely to engage in this behavior than Sonnet 3.7 on agentic tasks that are particularly susceptible to
09:46shortcuts or loopholes. Claude Opus 4 also dramatically outperforms all previous models
09:51on memory capabilities, which I've mentioned already. But I have said memory for agents is
09:57really the key ingredient to making them hyper personal. And they called out in the keynote today,
10:02the hundredth time you use Claude 4 should be much better, much more efficient and much more concise
10:08than the first time you use Claude 4. That's because it's learning and it's understanding
10:12what you want. It's developing a shorthand with you as the user. Opus 4 becomes skilled at creating
10:19and maintaining memory files to store key information. This unlocks better long-term task
10:24awareness, coherence, and performance on agent tasks. And here's the example of the Pokemon benchmark.
10:31They've also introduced thinking summaries for Claude 4 models that use a smaller model to condense
10:36lengthy thought processes. I would love to see the thought process, but you basically see nothing now.
10:42Now here's the key. Users requiring raw chains of thought for advanced prompt engineering can contact
10:48sales. So if you want to see the raw chains of thought, you're probably going to have to pay up.
10:54All right, the next big announcement, I touched on it. Let's get into more detail.
10:57Claude Code is now generally available.
10:59They have new extensions for VS Code and JetBrains that integrate Claude Code directly into your IDE,
11:06which is nice. This is a direct competition to all of the coding tools out there. Claude's proposed
11:11edits appear inline in your files, streamlining review and tracking with the familiar editor interface.
11:17And they're releasing a Claude Code SDK so you can build your own coding agent. So again,
11:22they're really building out the infrastructure layer of agentic coding. So Claude Code on GitHub
11:29now available. And that's an example of what's possible with the SDK. Tag Claude Code on PRs to
11:34respond to reviewer feedback, fix CI errors or modify code. So here's an example. Here's a PR right
11:41here. You're going to come into a comment. You're going to tag Claude. Could you please address this
11:45feedback comment? And it's going to jump in and start doing it right away. Gather issue and comment
11:51context, address the feedback, create a pull request, verify lint, make the tests, and so on. And then you have
11:57a PR ready to review. Now, chief science officer at Anthropic has said, according to TechMean,
12:03Anthropic's Jared Kaplan says the company stopped investing in chatbots at the end of 2024 and instead
12:10focused on improving Claude's ability to do complex tasks. And this makes sense. Claude is just not
12:16achieving the mindshare necessary to win at the chatbot game. That's ChatGPT. That's Gemini. Hopefully,
12:23Siri in the future. So they gave up on that and went in and focused on agentic capabilities. And
12:29you know what? Good for them. Focus is what is required to win. And how about the pricing? Let's
12:35check it out. So Claude 4 Opus, the most intelligent model for complex tasks. It has a 200K context window,
12:41which is still relatively small. And you get a 50% discount with batch processing. $15 per million tokens
12:50input. And $75 per million tokens output. So that's it. I'm going to be testing it out.
12:55Expect a testing video soon. If you enjoyed this video, please consider giving a like and subscribe,
13:00and I'll see you in the next one.

Recommended