Skip to playerSkip to main contentSkip to footer
  • 4/28/2025
China has just introduced a next-level AI agent that's raising alarms worldwide! 🚨 Designed with unprecedented autonomy, decision-making capabilities, and real-world action potential, this new AI agent blurs the line between digital intelligence and real-world impact. From defense applications to advanced cyber operations, this agent showcases China’s serious investment in dominating the AI race. 🧠πŸ’₯
In this video, we break down what makes this AI so powerful β€” and why experts around the world are both amazed and concerned. Could this reshape the global AI arms race forever? 🌍⚑

#ChinaAI #DangerousAI #AIRevolution #AutonomousAI #AIthreat #ArtificialIntelligence #TechNews #FutureOfAI #AIarmsrace #MachineLearning #NextGenAI #AIadvancements #GlobalAI #AIpower #DigitalTransformation #AItechnology #AIimpact #SmartMachines #AIupdate #AIsafety
Transcript
00:00So, ByteDance just dropped UTAR's 1.5, and the short version is this.
00:07It's a vision language agent that treats your screen like one big image.
00:11It can read, reason about, and then manipulate directly.
00:14Instead of juggling DOM trees, calling external tools, or stuffing the prompt with handcrafted instructions,
00:20the model ingests a screenshot, figures out the layout and the task from plain language,
00:25and then acts natively as if a real user were at the controls.
00:29That shift folding perception, planning, and low-level actions into one neural backbone
00:35changes the game for GUI automation, game agents, and any workflow that lives inside a graphical interface.
00:41It's faster, it's more resilient when the UI changes, and in head-to-head benchmarks,
00:46it's already edging out GPT-4-based setups and clawed on everything from Windows desktops to Android apps to web navigation.
00:54So, they started with the original UTAR's a few months back, but version 1.5 is the beefed-up sequel.
01:00Under the hood, it's still Quen 2VL at heart, but they scaled it three ways.
01:05A lightweight 2 billion parameter model, a mid-range 7 billion, and a chunky 72 billion variant
01:11that got an extra round of direct preference optimization.
01:14Across 50 billion tokens of training, screenshots, element metadata, GUI tutorials, and bootstrapped action traces,
01:22the team taught the model to see, reason, and click in a single pass.
01:27The first big upgrade is how it looks at a screen.
01:30They scraped websites, Windows apps, Android UIs, even CAD and Office software,
01:34yanked the bounding boxes, the labels, the colors, the tiny 10x10 pixel icons, everything,
01:40and synthesized five flavors of perception data.
01:43Element descriptions tell the model, hey, that little blue square with the floppy disk icon is a save button.
01:49Dense captions stitch all the elements into a global paragraph, so the model gets layout context.
01:55State transition captions teach it to notice subtle changes, like the difference between a button-down frame and a hover frame.
02:01There's a giant heap of screenshot Q&A so it can answer short questions, such as
02:05where's the new tab button?
02:06And they sprinkled in set of mark prompting, basically drawing colored markers on elements,
02:12so the agent can ground language tokens to specific pixels.
02:16Now, seeing is cute, but acting matters.
02:18ByteDance defined a unified action space, shared primitives like click-XY, drag-scroll type, wait,
02:26plus desktop specials like hotkey or right-click, and mobile-only pressback or long press.
02:32On top of that, you get two meta actions, finished, when the job's done, and call user, when it's stuck behind a login wall.
02:40They collected millions of multi-step traces, open-source stuff like Mind2Web, AITW, Android control,
02:47plus their own hand-recorded sessions, and normalized every step into that action template.
02:52Average trace length? Around 15 steps in their in-house set, so the model learns long horizon control, not just single taps.
03:00Reasoning is where version 1.5 shines.
03:04The paper splits human-style thinking into System 1 and System 2.
03:09System 1 is the fast, intuitive, just-click-the-button vibe.
03:14System 2 is the deliberate, chain-of-thought, break-the-task-down vibe.
03:19Task decomposition, milestone recognition, trial and error loops, reflection when things go sideways.
03:26They harvested 6 million GANG's GUI tutorials off the web, about 500 tokens each, 3 pictures on average.
03:32Ran them through a fast text filter, a big LLM cleanup, and used that as a reasoning primer.
03:38Then, for every action trace they already had, they retrofitted a thought, first with act-reprompting,
03:43then with a clever bootstrapping trick where the model samples multiple thought-action pairs
03:49and only keeps the pair that actually works.
03:52Those thoughts sit in context, so before every click, you literally get the model's inner monologue,
03:56like, okay, search field detected, type the username next.
04:00Because agents crash in real life, they taught the model to learn from mistakes.
04:04They spun up hundreds of virtual PCs, let an early checkpoint roam free, captured the messy traces,
04:10filtered the junk with rules and VLM scoring, and sent human annotators to label two critical steps,
04:16the wrong move and the correct fix.
04:18That gave them paired samples for direct preference optimization.
04:22DPO's job is simple, reward the fix, penalize the blunder, keep the policy close to the stable SFT baseline.
04:28After a few bootstrapping rounds, they saw the 72 billion Ceremoner model
04:32jump from 17-ish to over 24 points on OSworld's 50-step budget.
04:38All right, numbers, because everybody asks for the scoreboard.
04:42In the synthetic desktop sandbox OS world, UTAR's 1.5 nails a 42.5% success rate with just 100 steps,
04:51beating OpenAI's Operator at 36.4 and Claude 3.7 at 28.
04:56Windows Agent Arena, a tougher 50-step Windows challenge shows 42.1% for TARS versus the old 29.8 baseline.
05:05On Android World, the 7B model pulls 64.2%, topping the previous 59.5.
05:11When it comes to naked grounding, pointing to a widget.
05:14ScreenSpot V2 clocks, UI TARS 1.5 at 94.2% accuracy, Operator at 87.9, Claude 87.6.
05:23The new ScreenSpot Pro, which uses high-res professional apps, is where the gap gets crazy.
05:2861.6% for TARS, only 23.4 for Operator, 27.7 for Claude.
05:35Gaming?
05:36They fed the model 14 Pokey minigames, 2048 Infinity Loop, Snake Solver, and so on,
05:42and the Agent clears every single one, literally 100% across the board,
05:47while Operator and Claude flop on half the titles.
05:51Minecraft's MinerL benchmark is more humbling, but still, with the Think, Then Act mode turned on,
05:57TARS 1.5 gets 42% average success on 200 mining tasks, 31% on the 100 mob killing tasks,
06:06the older VPT or Dreamer V3 agents barely cracked 1% on those same goals.
06:12To prove it's not just cherry-picked demos, they published the full benchmark table in the blog.
06:17Web Voyager browsing tasks, 84.8% for TARS versus 87% for CUA, but marginal.
06:24Mind-to-web online tasks, at 75.8% against 71% for OpenAI's CUA.
06:31Notice the pattern.
06:32On web browsing, Operator stays competitive, but once you leave HTML land,
06:37and especially once you hit desktop or mobile, UITARS pulls ahead.
06:40Scale tests are neat, too.
06:42Take the original 72B DPO checkpoint on OSworld, 24.6 points.
06:48The mid-sized 7B version with the same training setup reaches 42.5 points,
06:53because that one is optimized for desktop tasks instead of games.
06:57The tiny 7B light release clocks 27.5 on SpawnSpot Pro,
07:0338.1% for the earlier 72B DPO, 49.6 for the new 7B, 61.6 for the full 1.5.
07:12So, bigger isn't always better.
07:14Targeted data and the thought engine matter more than raw parameters for grounding.
07:19Deployment is refreshingly open.
07:21They push the 7B checkpoint to Hugging Face under Apache 2.0 at ByteDance Seed slash UTARS 1.57B.
07:30The full 72B weights are under Early Research Access.
07:34You basically email TARS at ByteDance.com, tell them your project, and cross your finger.
07:39The GitHub repo GitHub.com slash ByteDance slash UTARS has training scripts,
07:43the unified action schema, the screen capture tool chain, even replay data,
07:47so you can reproduce their metrics.
07:49If you just want to play, there's also UTAR's Desktop,
07:52a Windows X that lets you type a natural language prompt and watch the agent drive your PC.
07:58Think of an open source operator, but without the GPT-4 subscription.
08:02From a research meta angle, the paper goes full history lesson.
08:05It tracks GE agents from old-school rule-based RPA to modular LLM frameworks like Auto-GPT,
08:12to these new end-to-end native models.
08:15Rule-based stuff is brittle, frameworks are prompt engineering heavy,
08:19and fall apart when a UI shifts.
08:21But native agents, being data-driven, retrain on fresh traces and just adapt.
08:26That's why ByteDance keeps banging on the four core capabilities.
08:30Perception, know what's on the screen.
08:32Action, hit the right coordinates.
08:34Cross, debit.
08:36Reasoning, swap between System 1 and System 2 as needed.
08:40And memory, store short-term context in the prompt, long-term lessons in the weights.
08:45Stick those in one model and you get something that can reboot itself when Windows throws a pop-up.
08:50Training followed three phases.
08:52Phase 1, straight continual pre-training on everything, around 50 billion tokens, flat learning rate.
08:58Phase 2, an annealing stage, where they only keep high-quality perception, grounding, action, and reflection pairs,
09:04so the loss focuses on difficult examples.
09:07They call that checkpoint UTAR's SFT.
09:10Phase 3, DPO on the reflection pairs, which bakes in the preference for fixed actions.
09:16That's UTAR's DPO, the model that hit 24.6 on OSworld.
09:21On offline benchmarks, things are already strong.
09:24In multimodal Mind2Web, the 72B model posts 74.7% element accuracy and a 92.5 operation phone.
09:33Android control low-level tasks, 89.9% grounding, and a 91.3% success rate.
09:41GUI Odyssey's cross-app mobile navigation hits over 88% success, which beats OS Atlas 7B by about 26 points.
09:50Yet the more impressive bit is how System 2 reasoning starts to shine the moment you give the model more than one sample.
09:57In best-of-one sampling, the no-thought version sometimes edges ahead because the deliberate chain of thought can hallucinate.
10:05But at best-of-16, the thought-based policy overtakes.
10:09And by best-of-64, it's way better because the extra sampling lets the model explore multiple reasoning paths and keep the best one.
10:17Out-of-domain deliberate reasoning wins straight away.
10:20On Android World, which the model never saw during training, the thought-enabled 72B cleans up at 46.6%, while the reflex-only mode stalls at 34.5%.
10:32That generalization suggests the chain of thought is giving the agent a planning buffer so it can handle surprises.
10:38Kind of like how humans slow down when things get weird.
10:41For the broader community, the cool thing is that ByteDance didn't lock the data behind NDA.
10:47They dumped screenshots, annotation guidelines, and evaluation scripts.
10:50The license is Apache 2.0, which means you can throw it into a commercial product, tweak the code, sell the service, and nobody's coming for royalties.
10:59And because the action space is unified, click, type, drag, etc., people can fuse their own data sources, say a specialized medical interface or an indie game UI, and keep the same fine-tuning recipe.
11:11And yeah, that's the drop.
11:14If you've been waiting for an open agent that actually moves a mouse instead of hallucinating JavaScript, this might be your playground smash-like if you found this useful.
11:23Sub for more deep-dive AI stuff, and let me know in the comments what workflow you'd throw at UTAR's first.
11:29Thanks for watching, and I'll catch you in the next one.

Recommended