Claude 4 is not what you think...

Omnivue TV

Check out Box AI here: https://bit.ly/4kyfWq2  Join My Newsletter for Regular AI Updates 👇🏼 https://forwardfuture.ai  Discover The Best AI Tools👇🏼 https://tools.forwardfuture.ai  My Links 🔗 👉🏻 X: https://x.com/matthewberman 👉🏻 Instagram:   / matthewberman_ai   👉🏻 Discord:   / discord    Media/Sponsorship Inquiries ✅  https://bit.ly/44TC45V  Links: https://www.anthropic.com/news/claude-4 https://x.com/ashtom/status/192559739... https://x.com/eleven21/status/1925594... https://x.com/shaunralston/status/192...

Transcript

00:00Cloud4 is finally here. It comes in two sizes, Sonnet and Opus, and it seems Anthropic has

00:07pivoted in a completely new direction. I'll explain that in a moment. Let me give you all

00:12of the details. Right away, they claim Cloud4 Opus is the world's best coding model, which is

00:18a hint in the direction that they are heading. And what seems to make it really special is its

00:23ability to complete long horizon tasks. That is tasks over tens of minutes up to hours without

00:30losing the thread and actually being able to complete real world tasks. All right, so a few

00:35details about both of these models, and then I'm going to get into the benchmarks. First, you have

00:39extended thinking with both of them, and they are both hybrid models, which means they can give you

00:44instant responses with no thinking, or you can turn on thinking for those more complex tasks.

00:50And during the thinking, you have tool use, which is, of course, really nice, but kind of table

00:55stakes at this point. And now I've already been playing around with it and hit my rate limit until

00:592 p.m. today, which is a few hours away. And really, I only submitted a few prompts. So I think I'm going

01:05to have to subscribe to Max and put together a thorough test for you all. So you can see right

01:10here, we have Cloud4 Opus, Cloud4 Sonnet. If you click right here on search and tools, you can see the

01:15different tools available. You can select the style, you can turn on and off extended thinking. It

01:19has web search, drive search, Gmail search, and calendar search. Those are the available tools

01:24for now. But they have more deeply integrated the MCP framework into their API. And remember,

01:32Anthropik is the company that created the MCP framework that now OpenAI, Microsoft, Google, and

01:37so many other companies have adopted. One unique thing that I really haven't seen elsewhere is that

01:43both models can use tools in parallel, which means it can send off requests to multiple tools at the

01:49same time. That seems really cool and much more efficient than doing everything sequentially.

01:54And it also seems to be much better at handling its own memory. All of this stuff is available in

01:59Cloud Code, which is also now generally available and has the Cloud4 models available. During the keynote

02:06that live streamed this morning, the chief product officer of Anthropik spent a lot of time talking

02:11about long horizon tasks and how they were able to accomplish this. Even giving an example of a

02:17company that was using Cloud4 that was able to do a task over seven hours. And as part of Cloud's new

02:24API, they have four new features, including code execution tool, MCP connector, a files API, and the

02:31ability to cache prompts for up to one hour. Here's what the code execution tool looks like. You simply type

02:37in a prompt, Cloud will start thinking and write code and of course execute that code. And I believe

02:43it needs to be Python for it to execute. The MCP connector allows you to connect any MCP server to

02:50the Cloud API. So now your Cloud API has access to all of the MCP tools throughout the world. They also

02:57have the files API. So giving access to Cloud to your local files, specifically your code files,

03:03your repositories just became a lot easier and then prompt caching. So of course you want to get the

03:08most efficient usage. You want to get the cheapest price and caching is the way to go. Now with all of

03:14these, you probably can guess where this is going. Cloud has basically given up on the chatbots race.

03:20It is clear that OpenAI and the major tech companies, Google, Microsoft, and unfortunately not Apple

03:28have all won the chatbot race, the personal assistant race. So now Anthropic has transitioned

03:34into being an infrastructure company. They are providing the tools necessary to have the best

03:40coding agent. They are building the best agents. They're building the best coding agents and they

03:45are plugging it into everyone. Thomas Domke, the CEO of GitHub announced Cloud Force on it is here.

03:52So it's available in GitHub Copilot and it's their default option. By the way, I interviewed Thomas at

03:58Microsoft Build. I'll drop that interview soon. So be sure to subscribe to this channel so you can

04:03get updated when that video drops. It is incredible. But look at this. In early evaluations, the model

04:08soared in agentic scenarios. That's the key. That is what we keep hearing. Memory, tools, long horizon

04:16tasks, all done by these agents, powered by Cloud4. Delivering up to a 10% improvement over the previous

04:23generation driven by sharper tool use, tighter instruction following, and stronger coding

04:28instincts. And of course, it's also available in Cursor and Windsurf and basically all of the major

04:34coding platforms out there. Now that Cloud4 is especially good at long horizon tasks, has excellent

04:40memory, built-in parallel tool usage. It's going to be especially good at pairing with Box AI. And that's

04:48the sponsor of today's video. I'm really excited to tell you about them. You're going to be able to build

04:52on Box AI using the new Cloud4 models soon. With Box AI, you can use artificial intelligence to extract

04:59key metadata fields from contracts, invoices, financial documents, resumes, and more. And you

05:05can automate workflows super easily. And not just metadata. You can ask questions about it. You can

05:12really do deep dives into your company's own data. And again, if you're a developer, building on Box AI

05:18is easy. It handles the entire RAG pipeline for you. So you don't need to think about vector databases.

05:23You don't need to think about chunking. It's just done and it works. And of course, because it's Box,

05:29they have enterprise level security, governance, and compliance. And with the launch of Cloud Code,

05:35if you want to use Cloud Code with Box SDKs, it could not be easier. Simply give Cloud Code links to the

05:42Box developer docs, and it just knows how to build with it. Check out Box's blog post about the Cloud

05:48Code launch to see a demo of them building a backend contract generation tool using Box doc gen and

05:54Cloud Code. I'll drop all of the links in the description below. So unlock the power of your

05:58documents and data with Box and Box AI. Thanks again to Box for sponsoring this video. All right,

06:04so back to the announcement blog post. Cloud Opus 4 and Sonnet 4, by the way, they kind of switched the

06:10name, right? It was Cloud 3.5 Opus, Cloud 3.5 Sonnet, and now it's the opposite way. Cloud Opus 4 and

06:17Sonnet 4. Anyways, are hybrid models offering two modes, near instant responses, and extended thinking

06:23for deeper reasoning? All right, I know you want to see the benchmarks. Benchmarks only mean so much,

06:27so take it with a grain of salt, but here they are. So software engineering, SWE bench verified. Yep,

06:33Cloud 4 is the by far winner. So here's OpenAI Codex 1, which was just announced about a week

06:40ago at 72% on the SWE bench verified compared to Sonnet 3.7, which was at 62.3% and with parallel

06:48test time compute 70.3. But now we have a big jump all the way up to 80.2 with parallel test time

06:56compute for Sonnet 4 and 72.5 and 79.4 with parallel test time compute for Opus 4. And by the way,

07:04for those of you who weren't sure what parallel test time compute is, it basically just means they

07:10sampled a few test time compute solutions to a prompt and chose the best one. Now, if you're

07:15looking at this, you're probably thinking the same thing I am. Did Sonnet just score better than Opus?

07:21Well, yeah, it did. And with my initial usage, I actually found Opus to be faster than Sonnet. Now,

07:29that's just anecdotal me using it a couple of times. So I'm going to need to test it a lot more,

07:33but it does seem to output code much faster. Now, here are some more benchmarks. Here's Terminal

07:38Bench, Claude Opus 4 winning at 43.2% compared to Sonnet 4, 35%. Here is the O3 model at 30%,

07:47GPT 4.1 at 30%, Gemini 2.5 Pro at 25%, which to date, Gemini 2.5 Pro is still my favorite coding

07:55model. Here's GPQA Diamond, which is graduate level reasoning. We have Agentic Tool Use doing

08:01quite well compared to the other models. Now, you're probably noticing one other thing. Sonnet 3.7 is

08:06still doing quite well. I'm going to show you that in a second. We have Multilingual Q&A,

08:12again, getting a nice bump. Visual reasoning, getting about the same score. And then High School

08:18Math Competition, Amy 2025, getting a very nice bump over Claude 3.7. Now, I'm going to pause for a

08:24second and show you something. This is a post by John Shoneth, and he actually points out the green

08:29boxes are around benchmarks, which Claude Sonnet 4 did better than Claude Sonnet 3.7. The yellow ones are

08:35where it did about the same. And red is where it actually got a decrease in performance, which is

08:41kind of nuts. So of all of these benchmarks that they submitted, half actually went down. So I don't

08:48really know what to think about that. They're saying it was a huge bump, but the benchmarks don't

08:52actually reflect that. And the benchmarks tend to be the nicest view of these models until people

08:58start doing the vibe checks of them. So very interesting. And of course, I'm going to be testing

09:03it thoroughly. We'll see. Now, one of the things that they called out during the keynote today

09:08is that when Claude 3 came out, it was kind of lazy with coding. And then Claude 3.5 and 3.7

09:16kind of went the other way. It tried too hard and did things it shouldn't and outputted way too much

09:21code. And they think they really dialed it in with Claude 4. They also, being anthropic, focused a lot on

09:28safety. So we've significantly reduced behavior where the models use shortcuts or loopholes to

09:33complete tasks. And of course, they're using the Pokemon example here. Both models are 65% less

09:40likely to engage in this behavior than Sonnet 3.7 on agentic tasks that are particularly susceptible to

09:46shortcuts or loopholes. Claude Opus 4 also dramatically outperforms all previous models on memory capabilities,

09:52which I've mentioned already. But I have said memory for agents is really the key ingredient to making

09:59them hyper personal. And they called out in the keynote today, the 100th time you use Claude 4 should

10:04be much better, much more efficient and much more concise than the first time you use Claude 4. That's

10:10because it's learning and it's understanding what you want. It's developing a shorthand with you as the user.

10:16Opus 4 becomes skilled at creating and maintaining memory files to store key information. This unlocks

10:23better long-term task awareness, coherence, and performance on agent tasks. And here's the example

10:29of the Pokemon benchmark. They've also introduced thinking summaries for Claude 4 models that use a

10:35smaller model to condense lengthy thought processes. I would love to see the thought process, but you

10:40basically see nothing now. Now here's the key. Users requiring raw chains of thought for advanced

10:46prompt engineering can contact sales. So if you want to see the raw chains of thought, you're probably

10:53going to have to pay up. All right, the next big announcement I touched on it. Let's get into more

10:57detail. Claude Code is now generally available. They have new extensions for VS Code and JetBrains that

11:03integrate Claude Code directly into your IDE, which is nice. This is a direct competition to all of the

11:09coding tools out there. Claude's proposed edits appear inline in your files, streamlining review and tracking

11:15with the familiar editor interface. And they're releasing a Claude Code SDK so you can build your

11:21own coding agent. So again, they're really building out the infrastructure layer of agentic coding.

11:27So Claude Code on GitHub now available. And that's an example of what's possible with the SDK.

11:33Tag Claude Code on PRs to respond to reviewer feedback, fix CI errors, or modify code. So here's an

11:40example. Here's a PR right here. You're going to come into a comment. You're going to tag Claude.

11:44Could you please address this feedback, comment, and it's going to jump in and start doing it right

11:50away. Gather issue and comment context, address the feedback, create a pull request, verify lint,

11:54make the tests, and so on. And then you have a PR ready to review. Now, chief science officer at

12:01Anthropic has said, according to TechMean, Anthropic's Jared Kaplan says the company stopped

12:06investing in chatbots at the end of 2024 and instead focused on improving Claude's ability to do

12:12complex tasks. And this makes sense. Claude is just not achieving the mindshare necessary to win at

12:19the chatbot game. That's ChatGPT. That's Gemini. Hopefully Siri in the future. So they gave up on

12:26that and went in and focused on agentic capabilities. And you know what? Good for them. Focus is what is

12:32required to win. And how about the pricing? Let's check it out. So Claude 4 Opus, the most intelligent

12:38model for complex tasks. It has a 200k context window, which is still relatively small. And you

12:45get a 50% discount with batch processing, $15 per million tokens input and $75 per million tokens

12:52output. So that's it. I'm going to be testing it out. Expect a testing video soon. If you enjoyed

12:58this video, please consider giving a like and subscribe and I'll see you in the next one.

Category

Transcript

Recommended