Check out Box AI here: https://bit.ly/4kyfWq2
Join My Newsletter for Regular AI Updates ππΌ
https://forwardfuture.ai
Discover The Best AI ToolsππΌ
https://tools.forwardfuture.ai
My Links π
ππ» X: https://x.com/matthewberman
ππ» Instagram: / matthewberman_ai
ππ» Discord: / discord
Media/Sponsorship Inquiries β
https://bit.ly/44TC45V
Links:
https://www.anthropic.com/news/claude-4
https://x.com/ashtom/status/192559739...
https://x.com/eleven21/status/1925594...
https://x.com/shaunralston/status/192...
Join My Newsletter for Regular AI Updates ππΌ
https://forwardfuture.ai
Discover The Best AI ToolsππΌ
https://tools.forwardfuture.ai
My Links π
ππ» X: https://x.com/matthewberman
ππ» Instagram: / matthewberman_ai
ππ» Discord: / discord
Media/Sponsorship Inquiries β
https://bit.ly/44TC45V
Links:
https://www.anthropic.com/news/claude-4
https://x.com/ashtom/status/192559739...
https://x.com/eleven21/status/1925594...
https://x.com/shaunralston/status/192...
Category
π€
TechTranscript
00:00Cloud4 is finally here. It comes in two sizes, Sonnet and Opus, and it seems Anthropic has
00:07pivoted in a completely new direction. I'll explain that in a moment. Let me give you all
00:12of the details. Right away, they claim Cloud4 Opus is the world's best coding model, which is
00:18a hint in the direction that they are heading. And what seems to make it really special is its
00:23ability to complete long horizon tasks. That is tasks over tens of minutes up to hours without
00:30losing the thread and actually being able to complete real world tasks. All right, so a few
00:35details about both of these models, and then I'm going to get into the benchmarks. First, you have
00:39extended thinking with both of them, and they are both hybrid models, which means they can give you
00:44instant responses with no thinking, or you can turn on thinking for those more complex tasks.
00:50And during the thinking, you have tool use, which is, of course, really nice, but kind of table
00:55stakes at this point. And now I've already been playing around with it and hit my rate limit until
00:592 p.m. today, which is a few hours away. And really, I only submitted a few prompts. So I think I'm going
01:05to have to subscribe to Max and put together a thorough test for you all. So you can see right
01:10here, we have Cloud4 Opus, Cloud4 Sonnet. If you click right here on search and tools, you can see the
01:15different tools available. You can select the style, you can turn on and off extended thinking. It
01:19has web search, drive search, Gmail search, and calendar search. Those are the available tools
01:24for now. But they have more deeply integrated the MCP framework into their API. And remember,
01:32Anthropik is the company that created the MCP framework that now OpenAI, Microsoft, Google, and
01:37so many other companies have adopted. One unique thing that I really haven't seen elsewhere is that
01:43both models can use tools in parallel, which means it can send off requests to multiple tools at the
01:49same time. That seems really cool and much more efficient than doing everything sequentially.
01:54And it also seems to be much better at handling its own memory. All of this stuff is available in
01:59Cloud Code, which is also now generally available and has the Cloud4 models available. During the keynote
02:06that live streamed this morning, the chief product officer of Anthropik spent a lot of time talking
02:11about long horizon tasks and how they were able to accomplish this. Even giving an example of a
02:17company that was using Cloud4 that was able to do a task over seven hours. And as part of Cloud's new
02:24API, they have four new features, including code execution tool, MCP connector, a files API, and the
02:31ability to cache prompts for up to one hour. Here's what the code execution tool looks like. You simply type
02:37in a prompt, Cloud will start thinking and write code and of course execute that code. And I believe
02:43it needs to be Python for it to execute. The MCP connector allows you to connect any MCP server to
02:50the Cloud API. So now your Cloud API has access to all of the MCP tools throughout the world. They also
02:57have the files API. So giving access to Cloud to your local files, specifically your code files,
03:03your repositories just became a lot easier and then prompt caching. So of course you want to get the
03:08most efficient usage. You want to get the cheapest price and caching is the way to go. Now with all of
03:14these, you probably can guess where this is going. Cloud has basically given up on the chatbots race.
03:20It is clear that OpenAI and the major tech companies, Google, Microsoft, and unfortunately not Apple
03:28have all won the chatbot race, the personal assistant race. So now Anthropic has transitioned
03:34into being an infrastructure company. They are providing the tools necessary to have the best
03:40coding agent. They are building the best agents. They're building the best coding agents and they
03:45are plugging it into everyone. Thomas Domke, the CEO of GitHub announced Cloud Force on it is here.
03:52So it's available in GitHub Copilot and it's their default option. By the way, I interviewed Thomas at
03:58Microsoft Build. I'll drop that interview soon. So be sure to subscribe to this channel so you can
04:03get updated when that video drops. It is incredible. But look at this. In early evaluations, the model
04:08soared in agentic scenarios. That's the key. That is what we keep hearing. Memory, tools, long horizon
04:16tasks, all done by these agents, powered by Cloud4. Delivering up to a 10% improvement over the previous
04:23generation driven by sharper tool use, tighter instruction following, and stronger coding
04:28instincts. And of course, it's also available in Cursor and Windsurf and basically all of the major
04:34coding platforms out there. Now that Cloud4 is especially good at long horizon tasks, has excellent
04:40memory, built-in parallel tool usage. It's going to be especially good at pairing with Box AI. And that's
04:48the sponsor of today's video. I'm really excited to tell you about them. You're going to be able to build
04:52on Box AI using the new Cloud4 models soon. With Box AI, you can use artificial intelligence to extract
04:59key metadata fields from contracts, invoices, financial documents, resumes, and more. And you
05:05can automate workflows super easily. And not just metadata. You can ask questions about it. You can
05:12really do deep dives into your company's own data. And again, if you're a developer, building on Box AI
05:18is easy. It handles the entire RAG pipeline for you. So you don't need to think about vector databases.
05:23You don't need to think about chunking. It's just done and it works. And of course, because it's Box,
05:29they have enterprise level security, governance, and compliance. And with the launch of Cloud Code,
05:35if you want to use Cloud Code with Box SDKs, it could not be easier. Simply give Cloud Code links to the
05:42Box developer docs, and it just knows how to build with it. Check out Box's blog post about the Cloud
05:48Code launch to see a demo of them building a backend contract generation tool using Box doc gen and
05:54Cloud Code. I'll drop all of the links in the description below. So unlock the power of your
05:58documents and data with Box and Box AI. Thanks again to Box for sponsoring this video. All right,
06:04so back to the announcement blog post. Cloud Opus 4 and Sonnet 4, by the way, they kind of switched the
06:10name, right? It was Cloud 3.5 Opus, Cloud 3.5 Sonnet, and now it's the opposite way. Cloud Opus 4 and
06:17Sonnet 4. Anyways, are hybrid models offering two modes, near instant responses, and extended thinking
06:23for deeper reasoning? All right, I know you want to see the benchmarks. Benchmarks only mean so much,
06:27so take it with a grain of salt, but here they are. So software engineering, SWE bench verified. Yep,
06:33Cloud 4 is the by far winner. So here's OpenAI Codex 1, which was just announced about a week
06:40ago at 72% on the SWE bench verified compared to Sonnet 3.7, which was at 62.3% and with parallel
06:48test time compute 70.3. But now we have a big jump all the way up to 80.2 with parallel test time
06:56compute for Sonnet 4 and 72.5 and 79.4 with parallel test time compute for Opus 4. And by the way,
07:04for those of you who weren't sure what parallel test time compute is, it basically just means they
07:10sampled a few test time compute solutions to a prompt and chose the best one. Now, if you're
07:15looking at this, you're probably thinking the same thing I am. Did Sonnet just score better than Opus?
07:21Well, yeah, it did. And with my initial usage, I actually found Opus to be faster than Sonnet. Now,
07:29that's just anecdotal me using it a couple of times. So I'm going to need to test it a lot more,
07:33but it does seem to output code much faster. Now, here are some more benchmarks. Here's Terminal
07:38Bench, Claude Opus 4 winning at 43.2% compared to Sonnet 4, 35%. Here is the O3 model at 30%,
07:47GPT 4.1 at 30%, Gemini 2.5 Pro at 25%, which to date, Gemini 2.5 Pro is still my favorite coding
07:55model. Here's GPQA Diamond, which is graduate level reasoning. We have Agentic Tool Use doing
08:01quite well compared to the other models. Now, you're probably noticing one other thing. Sonnet 3.7 is
08:06still doing quite well. I'm going to show you that in a second. We have Multilingual Q&A,
08:12again, getting a nice bump. Visual reasoning, getting about the same score. And then High School
08:18Math Competition, Amy 2025, getting a very nice bump over Claude 3.7. Now, I'm going to pause for a
08:24second and show you something. This is a post by John Shoneth, and he actually points out the green
08:29boxes are around benchmarks, which Claude Sonnet 4 did better than Claude Sonnet 3.7. The yellow ones are
08:35where it did about the same. And red is where it actually got a decrease in performance, which is
08:41kind of nuts. So of all of these benchmarks that they submitted, half actually went down. So I don't
08:48really know what to think about that. They're saying it was a huge bump, but the benchmarks don't
08:52actually reflect that. And the benchmarks tend to be the nicest view of these models until people
08:58start doing the vibe checks of them. So very interesting. And of course, I'm going to be testing
09:03it thoroughly. We'll see. Now, one of the things that they called out during the keynote today
09:08is that when Claude 3 came out, it was kind of lazy with coding. And then Claude 3.5 and 3.7
09:16kind of went the other way. It tried too hard and did things it shouldn't and outputted way too much
09:21code. And they think they really dialed it in with Claude 4. They also, being anthropic, focused a lot on
09:28safety. So we've significantly reduced behavior where the models use shortcuts or loopholes to
09:33complete tasks. And of course, they're using the Pokemon example here. Both models are 65% less
09:40likely to engage in this behavior than Sonnet 3.7 on agentic tasks that are particularly susceptible to
09:46shortcuts or loopholes. Claude Opus 4 also dramatically outperforms all previous models on memory capabilities,
09:52which I've mentioned already. But I have said memory for agents is really the key ingredient to making
09:59them hyper personal. And they called out in the keynote today, the 100th time you use Claude 4 should
10:04be much better, much more efficient and much more concise than the first time you use Claude 4. That's
10:10because it's learning and it's understanding what you want. It's developing a shorthand with you as the user.
10:16Opus 4 becomes skilled at creating and maintaining memory files to store key information. This unlocks
10:23better long-term task awareness, coherence, and performance on agent tasks. And here's the example
10:29of the Pokemon benchmark. They've also introduced thinking summaries for Claude 4 models that use a
10:35smaller model to condense lengthy thought processes. I would love to see the thought process, but you
10:40basically see nothing now. Now here's the key. Users requiring raw chains of thought for advanced
10:46prompt engineering can contact sales. So if you want to see the raw chains of thought, you're probably
10:53going to have to pay up. All right, the next big announcement I touched on it. Let's get into more
10:57detail. Claude Code is now generally available. They have new extensions for VS Code and JetBrains that
11:03integrate Claude Code directly into your IDE, which is nice. This is a direct competition to all of the
11:09coding tools out there. Claude's proposed edits appear inline in your files, streamlining review and tracking
11:15with the familiar editor interface. And they're releasing a Claude Code SDK so you can build your
11:21own coding agent. So again, they're really building out the infrastructure layer of agentic coding.
11:27So Claude Code on GitHub now available. And that's an example of what's possible with the SDK.
11:33Tag Claude Code on PRs to respond to reviewer feedback, fix CI errors, or modify code. So here's an
11:40example. Here's a PR right here. You're going to come into a comment. You're going to tag Claude.
11:44Could you please address this feedback, comment, and it's going to jump in and start doing it right
11:50away. Gather issue and comment context, address the feedback, create a pull request, verify lint,
11:54make the tests, and so on. And then you have a PR ready to review. Now, chief science officer at
12:01Anthropic has said, according to TechMean, Anthropic's Jared Kaplan says the company stopped
12:06investing in chatbots at the end of 2024 and instead focused on improving Claude's ability to do
12:12complex tasks. And this makes sense. Claude is just not achieving the mindshare necessary to win at
12:19the chatbot game. That's ChatGPT. That's Gemini. Hopefully Siri in the future. So they gave up on
12:26that and went in and focused on agentic capabilities. And you know what? Good for them. Focus is what is
12:32required to win. And how about the pricing? Let's check it out. So Claude 4 Opus, the most intelligent
12:38model for complex tasks. It has a 200k context window, which is still relatively small. And you
12:45get a 50% discount with batch processing, $15 per million tokens input and $75 per million tokens
12:52output. So that's it. I'm going to be testing it out. Expect a testing video soon. If you enjoyed
12:58this video, please consider giving a like and subscribe and I'll see you in the next one.