Skip to playerSkip to main contentSkip to footer
  • 5/28/2025
The AI landscape just got shaken up! ByteDance’s BAGEL, Anthropic’s Claude 4, and Mistral’s Devstral have all launched groundbreaking models that push the boundaries of language understanding, creativity, and efficiency. These AI powerhouses are setting new standards and surprising experts worldwide with their capabilities. Discover what makes these three the talk of the tech world! ⚡🧠

#ByteDanceBAGEL #Claude4 #MistralDevstral #AIModels #ArtificialIntelligence #MachineLearning #TechInnovation #LanguageAI #NextGenAI #AIRevolution #DeepLearning #AIbreakthrough #TechNews #FutureOfAI #AIResearch #SmartTech #Innovation #AIWorld #CuttingEdgeAI #AIupdate #GlobalTech

Category

🤖
Tech
Transcript
00:00So three AI giants just dropped within 48 hours.
00:06ByteDance launched Bagel, a wild new model that doesn't just generate images.
00:10It reasons, edits, navigates, and thinks in full multimodal contexts.
00:16Then Anthropic came out swinging with Cloud4, an AI that can code for seven hours straight,
00:22juggling tools like it's running an entire dev team solo.
00:25And right in the middle of it all, Mistral dropped Devstral, a monster open source coding model
00:31trained to chew through real GitHub issues and beat out some of the biggest closed systems.
00:36So three drops in two days, Bagel, Cloud4, Devstral.
00:41Everything's moving again, so let's talk about it.
00:44Let's kick off with Bagel because that announcement hit first on May 20th.
00:49ByteDance calls it a unified multimodal model,
00:52which basically means one network juggles language, images, video frames, even web data,
00:57instead of stapling separate subsystems together.
01:00The core engine is a mixture of transformer experts, MALT for short,
01:05with 7 billion active parameters out of a 14 billion total count.
01:10ByteDance bolted on two different encoders.
01:14One chews on raw pixels, the other tracks semantic cues.
01:18So Bagel sees both the fine-grained texture of an image and the higher-level idea of what's in it.
01:25Pre-training wasn't modest either.
01:27Trillions of interleaved tokens spanning texts, stills, clips, memes, you name it.
01:33That's why the dev team says the model can think across modalities.
01:38It literally predicts the next group of tokens,
01:40whether those tokens describe words or visual patches.
01:44Day one demos looked pretty spicy.
01:47Someone dropped in a snapshot of Michelangelo's David and Bagel
01:50casually rattled off the statue's history,
01:53its renaissance context,
01:54even where it's housed at the Academia in Florence.
01:58That same prompt stream flipped straight into generation mode.
02:02The model pumped out a photorealistic scene
02:05of three antique potion bottles labeled SDXL, Bagel, and Flux
02:12with correct reflections and old glass imperfections
02:15without needing instruction about lens choice or lighting.
02:19Then came editing.
02:20A clip showed a man squatting to pat a dog,
02:23and Bagel rewrote the action into new frames,
02:26keeping the pose consistent while avoiding jitter.
02:29Style transfer?
02:30It took a 2D cosplay photo and re-rendered it
02:32in a crisp 3D animated look
02:34after a single change to 3D line.
02:37It even did navigation.
02:39After watching 0.4 seconds of a video,
02:42the model predicted the next forward step
02:44for a virtual camera.
02:46All that sits under what ByteDance calls its thinking mode,
02:49where the model actually writes an internal chain of thought
02:52between think tags, refines the prompt,
02:56then starts drawing,
02:57so you get fewer random artifacts and more coherent scenes,
03:00like that wild composite of hundreds of toy cars
03:03forming a life-sized sedan.
03:06Now numbers.
03:07On the visual understanding board,
03:09Bagel hit an MMME score of 2,388,
03:14edged past Quen 2.5 VL on MMBench with 85.0,
03:19landed 55.3 on MMMU,
03:2267.2 on MMVET,
03:24and pulled 73.1 on the new MethVista reasoning test.
03:28Generation quality is equally solid.
03:31Jenny Val climbs to 0.88 when you let Bagel think,
03:35while Wise bumps to 0.70,
03:38which puts it shoulder-to-shoulder
03:40with specialized diffusion behemoth.
03:43Editing?
03:43G-Edit Bench shows 7.36 for single-condition prompts
03:49and a 44.0 on Intelligent Bench.
03:52With Chain of Thought,
03:53that last score jumps to 55.3,
03:56a real testament to the reasoning-then-render workflow.
04:00During ablation,
04:01mixing VAE and VIT features
04:04turned out to be the secret sauce.
04:06Drop either,
04:07and Intelligent Editing Tanks.
04:09If you want to run it locally,
04:10the quick star is right in the repo.
04:12Conda environment on Python 3.10,
04:15snapshot underscore download from Hugging Face
04:18with the 7b moach checkpoint,
04:20then crack open inference dot IPYNB,
04:24and you're good.
04:24Regeneration dials.
04:26cfg underscore text underscore scale
04:28around 4 to 8
04:29locks the model to your prompt.
04:31cfg underscore image underscore scale
04:33at 1 to 2,
04:34preserve source detail during edits,
04:37and cfg underscore interval.
04:39Think of it as how long you keep
04:40classifier-free guidance engaged,
04:43defaults to the 0.4 to 1.0 window,
04:46tweak tempstick underscore shift
04:48if you need cleaner layout
04:49versus sharper details,
04:50and play with 60-ish total tempstick.
04:53It's all in the readme,
04:55but those knobs matter.
04:57Two days later,
04:58Anthropic dropped Claude 4 to siblings,
05:00Opus 4 and Sonnet 4,
05:02and these folks aimed squarely
05:03at the coding crowd.
05:04Opus 4 storms the SWE Bench
05:07verified leaderboard
05:08with 72.5%,
05:10owns Terminal Bench at 43.2%,
05:13and claims the title
05:14of world's best coding model.
05:16Sonnet 4,
05:17which costs less compute,
05:18actually ekes out 72.7% on SWE Bench,
05:22but trades a little depth for latency.
05:24Both models are hybrid.
05:25They can spit back near-instant paragraphs
05:28or slip into extended thinking
05:30that reasons across up to 64,000 tokens
05:33calling external tools
05:35in the middle of a thought chain.
05:37So you might see Claude Google something,
05:39summarize a PDF it found,
05:40update its plan,
05:41and continue without you poking it.
05:44The coolest bit is endurance.
05:46Anthropic's internal tests
05:48ran Opus 4 for nearly seven hours straight
05:50on a single goal.
05:52Think audit a code base,
05:54patch bugs, update docs,
05:55without fresh human prompts.
05:57CNN's write-up framed it
05:59as an almost full workday
06:01of uninterrupted AI labor.
06:03Anthropic's product lead,
06:04Scott White, spun it like this.
06:07Let the model grind through
06:08the 30% of your day
06:10nobody finds thrilling
06:11so you can handle the creative piece.
06:13And they have receipts.
06:15Cursor calls Opus 4 state-of-the-art
06:17for complex multifaler refactors.
06:19Replit says precision shot way up.
06:22Rakuten let the model hammer
06:24on an open-source refactor
06:25for seven solid hours,
06:26and it still held context.
06:29Cognition's evaluation claims
06:31Opus 4 tackles challenges
06:32that stump other models.
06:35Tooling support is thick.
06:36Developers now get a code execution tool,
06:38an MCP connector,
06:40a files API,
06:41and prompt caching knobs
06:43so you're not burning tokens
06:44on identical system prompts
06:45every minute.
06:47Claude Code exits preview
06:48and ships straight into VS Code
06:50and JetBrains plug-ins.
06:52Edits appear inline,
06:53so you accept or reject them
06:55like a human teammate's patch.
06:56In the terminal,
06:57the same agent can spin up
06:58GitHub Actions runs
07:00or respond to PR comments.
07:02Under the hood,
07:02both models got a discipline check.
07:05Shortcut time behavior,
07:07where an LLM tries a loophole
07:09instead of the intended workflow,
07:12dropped by 65% compared to Sonnet 3.7.
07:16Pricing sticks to Anthropic tradition.
07:18Opus 4 remains 15 bucks
07:20per million input tokens
07:21and 75 per million output.
07:24Sonnet 4 sits at 3 and 15.
07:26The usual Pro, Max, Team,
07:28and Enterprise tiers
07:29include the extended thinking toggle.
07:32Free users even get Sonnet 4's core mode,
07:34which is kind of nuts.
07:35For transparency,
07:36Anthropic now compresses
07:37the chain of thought
07:38with a smaller model
07:39in about 5% of cases,
07:42but devs can apply for developer mode
07:44to see raw reasoning when needed.
07:47Safety got its own bullet time montage.
07:50Anthropics labels the new Duo
07:52ASL3 compliant,
07:55smelling salts for enterprises
07:56that fear runaway tools.
07:58Both models are live
07:59on the Anthropic endpoint,
08:01Amazon Bedrock,
08:02and Google Vertex AI.
08:03And yes,
08:04the developer plugins
08:05are already shipping.
08:07Run a single command in VS Code
08:09or JetBrains and Cloud Code
08:10shows edits right in your files,
08:12no copy paste.
08:13Needed.
08:13Fire it up in a GitHub action
08:15and the agent will comment
08:16on PR feedback,
08:17chase failing tests,
08:19and push patches to a branch.
08:21It feels less like a chatbot
08:22bolted to the side of the repo
08:23and more like a junior engineer
08:25who never takes a coffee break.
08:28Now,
08:29wedged neatly between
08:30those two announcements
08:30came something from Paris.
08:33On May 21st,
08:34Mistral AI
08:35and All Hands AI
08:36unveiled Destral,
08:3824 billion parameters,
08:39Apache 2.0 license,
08:41and a 128,000 token window.
08:44The whole brief screams
08:46Open Source Agent
08:47for real software engineering,
08:49not code completion toys.
08:52Beining wasn't just RLHF on DocStrings.
08:54The team piped DevStral
08:56through actual GitHub issues
08:57inside agent scaffolds
08:58like Open Hands
08:59and SWE Agent
09:01which forced the model
09:02to read stack traces,
09:04locate the bad file,
09:05write a patch,
09:06rerun the tests,
09:07and iterate until green.
09:09That curriculum explains
09:09why SWE Bench Verified
09:12jumps to 46.8%,
09:14six full points higher
09:16than the next open model,
09:17and hilariously,
09:1920 points above
09:20GPT 4.1 Mini
09:23when that tiny proprietary sibling
09:25is forced through
09:25the same harness.
09:27Because the waits are public,
09:29you can grab them
09:30from Hugging Face,
09:31Olama,
09:31Kegel,
09:32LM Studio,
09:32or Unsloth
09:33if you want quantized flavors.
09:35If local hardware is your thing,
09:37a single RTX 4090
09:39or an M-series Mac
09:41with 32 gigs handles it.
09:42If cloud is easier,
09:44Mistral's Endpoint
09:45Destructural Small 2505
09:47builds at 10 cents
09:48per million input tokens,
09:4930 cents output,
09:51matching Mistral Small 3.1
09:53enterprises that need
09:54private fine-tuning
09:56or a distillation
09:57into even lighter models,
09:59can ping Mistral's
10:00Applied AI team?
10:01The same open philosophy,
10:03just pay for the consulting hours.
10:06The bigger story
10:07is context stability.
10:08Because Devstral
10:09can swallow an entire
10:10monorepo in one go,
10:12it keeps variable scope,
10:14import paths,
10:15and architectural patterns
10:16in working memory.
10:17Early community tests
10:18show the model
10:19hopping through
10:2040 or 50 files
10:21without losing track
10:22of variable names
10:23and still writing
10:24indentation-perfect patches.
10:26It even reads markup
10:28like XML or HTML templates,
10:31which means it can adjust
10:32a Django config
10:33in the corresponding
10:34Jinja view
10:35in one reasoning burst.
10:37And because the license
10:38is so permissive,
10:39university teams
10:40and indie IDE plug-in authors
10:42are already hacking it
10:43into local copilots
10:44that run entirely offline.
10:46The release video
10:47shows a coder
10:48with no internet
10:49toggling an open hands panel,
10:52selecting a failing test
10:53and watching Devstral
10:54rewrite the function
10:55right inside VS Code
10:57before saving.
10:58No tokens leave the laptop.
11:00Mistral,
11:01in case you missed
11:01the origin story
11:02was founded in April 2023
11:04by Arthur Mench,
11:06Guillaume Lampley
11:06and Timothy LaCroix.
11:08They've been shipping
11:09open-weight checkpoints
11:10on a six-month cadence.
11:12Mistral,
11:13small,
11:13codestral,
11:14now destral.
11:15Venture money keeps flowing,
11:16but they still grant
11:17Apache licenses
11:18and push metal kernels
11:20for MacBooks.
11:21That combination,
11:23transparency on weights,
11:25permissive legal wrapper
11:26and hardware frugality
11:27has turned them
11:28into the EU's
11:29favorite counterbalance
11:31to U.S. cloud giants.
11:33So, there you have it.
11:34Three releases,
11:35each stretching
11:36a different muscle.
11:37Bagel lets a single decoder
11:40talk prose,
11:41paint frames,
11:42reshape video
11:42and even chart navigation steps.
11:45Claude 4 holds a conversation
11:48for hours,
11:48flipping in-house tools
11:50and leaving breadcrumb memory files
11:52so it can resume work
11:54after lunch.
11:55Devstral shows
11:55that an open agent
11:57trained in the act
11:58of fixing real GitHub tickets
12:00can beat much larger
12:01closed models
12:02on the exact tasks
12:04developers care about
12:05all while running on hardware
12:07you can stick under your desk.
12:09So, here's what I'm wondering.
12:11Did we just see the start
12:13of AI models
12:14specializing harder than ever?
12:17Or are we heading toward
12:18one system
12:19that just does
12:20everything better than you?
12:22Drop your take
12:23in the comments.
12:23I'm reading all of them
12:24and if you've already got
12:25one of these models running
12:26let me know how it's holding up.
12:27If you liked this
12:28you know what to do.
12:29Thanks for watching.
12:30Catch you in the next one.

Recommended