Claude 4 is not what you think...

randygreenminima2000kgo

5/23/2025

Transcript

00:00Claude 4 is finally here. It comes in two sizes, Sonnet and Opus, and it seems Anthropic has

00:07pivoted in a completely new direction. I'll explain that in a moment. Let me give you all

00:12of the details. Right away, they claim Claude 4 Opus is the world's best coding model, which is

00:18a hint in the direction that they are heading. And what seems to make it really special is its

00:23ability to complete long horizon tasks. That is tasks over tens of minutes up to hours without

00:30losing the thread and actually being able to complete real world tasks. All right, so a few

00:35details about both of these models, and then I'm going to get into the benchmarks. First, you have

00:39extended thinking with both of them, and they are both hybrid models, which means they can give you

00:44instant responses with no thinking, or you can turn on thinking for those more complex tasks.

00:50And during the thinking, you have tool use, which is of course, really nice, but kind of table stakes

00:55at this point. And now I've already been playing around with it and hit my rate limit until 2pm

01:00today, which is a few hours away. And really, I only submitted a few prompts. So I think I'm going to

01:05have to subscribe to Max and put together a thorough test for you all. So you can see right here, we have

01:10Claude 4 Opus, Claude 4 Sonnet. If you click right here on search and tools, you can see the different

01:15tools available. You can select the style, you can turn on and off extended thinking. It has

01:19Web Search, Drive Search, Gmail Search, and Calendar Search. Those are the available tools

01:25for now, but they have more deeply integrated the MCP framework into their API. And remember,

01:31Anthropoc is the company that created the MCP framework that now OpenAI, Microsoft, Google,

01:37and so many other companies have adopted. One unique thing that I really haven't seen elsewhere is that

01:43both models can use tools in parallel, which means it can send off requests to multiple tools at the

01:49same time. That seems really cool and much more efficient than doing everything sequentially.

01:54And it also seems to be much better at handling its own memory. All of this stuff is available in

01:59Claude code, which is also now generally available and has the Claude 4 models available. During the

02:05keynote that live streamed this morning, the chief product officer of Anthropic spent a lot of time

02:11talking about long horizon tasks and how they were able to accomplish this. Even giving an example of

02:17a company that was using Claude 4 that was able to do a task over seven hours. And as part of Claude's

02:24new API, they have four new features, including code execution tool, MCP connector, a files API,

02:30and the ability to cache prompts for up to one hour. Here's what the code execution tool looks like.

02:35You simply type in a prompt, Claude will start thinking and write code. And of course,

02:41execute that code. And I believe it needs to be Python for it to execute. The MCP connector allows

02:47you to connect any MCP server to the Claude API. So now your Claude API has access to all of the MCP

02:54tools throughout the world. They also have the files API. So giving access to Claude to your local files,

03:01specifically your code files, your repositories just became a lot easier and then prompt caching.

03:07So of course you want to get the most efficient usage. You want to get the cheapest price and

03:11caching is the way to go. Now, with all of these, you probably can guess where this is going. Claude has

03:18basically given up on the chatbots race. It is clear that OpenAI and the major tech companies,

03:25Google, Microsoft, and unfortunately not Apple, have all won the chatbot race.

03:30The personal assistant race. So now Anthropic has transitioned into being an infrastructure

03:36company. They are providing the tools necessary to have the best coding agent. They are building

03:42the best agents. They're building the best coding agents and they are plugging it into everyone.

03:47Thomas Domke, the CEO of GitHub announced Claude Force Onnit is here. So it's available in

03:53GitHub Copilot and it's their default option. By the way, I interviewed Thomas at Microsoft Build.

03:59I'll drop that interview soon. So be sure to subscribe to this channel so you can get updated

04:04when that video drops. It is incredible. But look at this. In early evaluations, the models soared

04:09in agentic scenarios. That's the key. That is what we keep hearing. Memory, tools, long horizon tasks,

04:16all done by these agents, powered by Claude 4. Delivering up to a 10% improvement over the previous

04:23generation driven by sharper tool use, tighter instruction following, and stronger coding

04:28instincts. And of course, it's also available in Cursor and Windsurf and basically all of the major

04:34coding platforms out there. Now that Claude 4 is especially good at long horizon tasks, has excellent

04:40memory, built in parallel tool usage. It's going to be especially good at pairing with Box AI. And that's

04:48the sponsor of today's video. I'm really excited to tell you about them. You're going to be able to build on

04:52Box AI using the new Claude 4 models soon. With Box AI, you can use artificial intelligence to extract

04:59key metadata fields from contracts, invoices, financial documents, resumes, and more. And you

05:05can automate workflows super easily. And not just metadata. You can ask questions about it. You can

05:12really do deep dives into your company's own data. And again, if you're a developer, building on Box AI

05:18is easy. It handles the entire RAG pipeline for you. So you don't need to think about vector databases.

05:23You don't need to think about chunking. It's just done and it works. And of course, because it's Box,

05:29they have enterprise level security, governance, and compliance. And with the launch of Claude Code,

05:35if you want to use Claude Code with Box SDKs, it could not be easier. Simply give Claude Code

05:41links to the Box developer docs and it just knows how to build with it. Check out Box's blog post about

05:47the Claude Code launch to see a demo of them building a backend contract generation tool using

05:52Box doc gen and Claude Code. I'll drop all of the links in the description below. So unlock the power

05:58of your documents and data with Box and Box AI. Thanks again to Box for sponsoring this video.

06:04All right, so back to the announcement blog post. Claude Opus 4 and Sonnet 4, by the way,

06:09they kind of switched the name, right? It was Claude 3.5 Opus, Claude 3.5 Sonnet, and now it's the

06:15opposite way. Claude Opus 4 and Sonnet 4. Anyways, our hybrid models offering two modes, near instant

06:21responses and extended thinking for deeper reasoning. All right, I know you want to see the benchmarks.

06:26Benchmarks only mean so much, so take it with a grain of salt, but here they are. So software

06:30engineering, SWE Bench verified. Yep, Claude 4 is the by far winner. So here's OpenAI Codex 1,

06:38which was just announced about a week ago at 72% on the SWE Bench verified, compared to Sonnet 3.7,

06:45which was at 62.3% and with parallel test time compute 70.3. But now we have a big jump all the

06:53way up to 80.2 with parallel test time compute for Sonnet 4 and 72.5 and 79.4 with parallel test time

07:02compute for Opus 4. And by the way, for those of you who weren't sure what parallel test time compute is,

07:08it basically just means they sampled a few test time compute solutions to a prompt and chose the

07:14best one. Now, if you're looking at this, you're probably thinking the same thing I am. Did Sonnet

07:19just score better than Opus? Well, yeah, it did. And with my initial usage, I actually found Opus to be

07:27faster than Sonnet. Now that's just anecdotal, me using it a couple of times. So I'm going to need to

07:32test it a lot more, but it does seem to output code much faster. Now, here are some more benchmarks.

07:37Here's Terminal Bench, Claude Opus 4 winning at 43.2% compared to Sonnet 4, 35%. Here is the O3 model

07:45at 30%, GPT 4.1 at 30%, Gemini 2.5 Pro at 25%, which to date, Gemini 2.5 Pro is still my favorite coding model.

07:55Here's GPQA Diamond, which is graduate level reasoning. We have agentic tool use doing quite

08:01well compared to the other models. Now, you're probably noticing one other thing. Sonnet 3.7 is

08:06still doing quite well. I'm going to show you that in a second. We have multilingual Q&A,

08:11again, getting a nice bump. Visual reasoning, getting about the same score. And then high school

08:18math competition, AME 2025, getting a very nice bump over Claude 3.7. Now I'm going to pause for a second

08:24and show you something. This is a post by John Shoneth, and he actually points out the green

08:29boxes are around benchmarks, which Claude Sonnet 4 did better than Claude Sonnet 3.7. The yellow ones

08:35are where it did about the same, and red is where it actually got a decrease in performance, which is

08:41kind of nuts. So of all of these benchmarks that they submitted, half actually went down. So I don't

08:47really know what to think about that. They're saying it was a huge bump, but the benchmarks don't

08:52actually reflect that. And the benchmarks tend to be the nicest view of these models until people

08:58start doing the vibe checks of them. So very interesting. And of course, I'm going to be

09:03testing it thoroughly. We'll see. Now, one other thing that they called out during the keynote today

09:09is that when Claude 3 came out, it was kind of lazy with coding. And then Claude 3.5 and 3.7 kind of went

09:17the other way. It tried too hard and did things it shouldn't and outputted way too much code.

09:21And they think they really dialed it in with Claude 4. They also, being anthropic, focused a lot on

09:28safety. So we've significantly reduced behavior where the models use shortcuts or loopholes to

09:33complete tasks. And of course, they're using the Pokemon example here. Both models are 65% less

09:39likely to engage in this behavior than Sonnet 3.7 on agentic tasks that are particularly susceptible to

09:46shortcuts or loopholes. Claude Opus 4 also dramatically outperforms all previous models

09:51on memory capabilities, which I've mentioned already. But I have said memory for agents is

09:57really the key ingredient to making them hyper personal. And they called out in the keynote today,

10:02the hundredth time you use Claude 4 should be much better, much more efficient and much more concise

10:08than the first time you use Claude 4. That's because it's learning and it's understanding

10:12what you want. It's developing a shorthand with you as the user. Opus 4 becomes skilled at creating

10:19and maintaining memory files to store key information. This unlocks better long-term task

10:24awareness, coherence, and performance on agent tasks. And here's the example of the Pokemon benchmark.

10:31They've also introduced thinking summaries for Claude 4 models that use a smaller model to condense

10:36lengthy thought processes. I would love to see the thought process, but you basically see nothing now.

10:42Now here's the key. Users requiring raw chains of thought for advanced prompt engineering can contact

10:48sales. So if you want to see the raw chains of thought, you're probably going to have to pay up.

10:54All right, the next big announcement, I touched on it. Let's get into more detail.

10:57Claude Code is now generally available.

10:59They have new extensions for VS Code and JetBrains that integrate Claude Code directly into your IDE,

11:06which is nice. This is a direct competition to all of the coding tools out there. Claude's proposed

11:11edits appear inline in your files, streamlining review and tracking with the familiar editor interface.

11:17And they're releasing a Claude Code SDK so you can build your own coding agent. So again,

11:22they're really building out the infrastructure layer of agentic coding. So Claude Code on GitHub

11:29now available. And that's an example of what's possible with the SDK. Tag Claude Code on PRs to

11:34respond to reviewer feedback, fix CI errors or modify code. So here's an example. Here's a PR right

11:41here. You're going to come into a comment. You're going to tag Claude. Could you please address this

11:45feedback comment? And it's going to jump in and start doing it right away. Gather issue and comment

11:51context, address the feedback, create a pull request, verify lint, make the tests, and so on. And then you have

11:57a PR ready to review. Now, chief science officer at Anthropic has said, according to TechMean,

12:03Anthropic's Jared Kaplan says the company stopped investing in chatbots at the end of 2024 and instead

12:10focused on improving Claude's ability to do complex tasks. And this makes sense. Claude is just not

12:16achieving the mindshare necessary to win at the chatbot game. That's ChatGPT. That's Gemini. Hopefully,

12:23Siri in the future. So they gave up on that and went in and focused on agentic capabilities. And

12:29you know what? Good for them. Focus is what is required to win. And how about the pricing? Let's

12:35check it out. So Claude 4 Opus, the most intelligent model for complex tasks. It has a 200K context window,

12:41which is still relatively small. And you get a 50% discount with batch processing. $15 per million tokens

12:50input. And $75 per million tokens output. So that's it. I'm going to be testing it out.

12:55Expect a testing video soon. If you enjoyed this video, please consider giving a like and subscribe,

13:00and I'll see you in the next one.

Claude 4 is not what you think...

Category

Transcript

Recommended