⚠️ China Just Dropped the Most Dangerous AI Agent Yet! | AI Revolution 🚀

Name: ⚠️ China Just Dropped the Most Dangerous AI Agent Yet! | AI Revolution 🚀
Uploaded: 2025-04-28T19:34:26+00:00
Duration: 11 min 37 s
Channel: Ai Revolution

Ai Revolution

4/28/2025

China has just introduced a next-level AI agent that's raising alarms worldwide! 🚨 Designed with unprecedented autonomy, decision-making capabilities, and real-world action potential, this new AI agent blurs the line between digital intelligence and real-world impact. From defense applications to advanced cyber operations, this agent showcases China’s serious investment in dominating the AI race. 🧠💥
In this video, we break down what makes this AI so powerful — and why experts around the world are both amazed and concerned. Could this reshape the global AI arms race forever? 🌍⚡

#ChinaAI #DangerousAI #AIRevolution #AutonomousAI #AIthreat #ArtificialIntelligence #TechNews #FutureOfAI #AIarmsrace #MachineLearning #NextGenAI #AIadvancements #GlobalAI #AIpower #DigitalTransformation #AItechnology #AIimpact #SmartMachines #AIupdate #AIsafety

Category

🤖

Tech

Transcript

Display full video transcript

00:00So, ByteDance just dropped UTAR's 1.5, and the short version is this.

00:07It's a vision language agent that treats your screen like one big image.

00:11It can read, reason about, and then manipulate directly.

00:14Instead of juggling DOM trees, calling external tools, or stuffing the prompt with handcrafted instructions,

00:20the model ingests a screenshot, figures out the layout and the task from plain language,

00:25and then acts natively as if a real user were at the controls.

00:29That shift folding perception, planning, and low-level actions into one neural backbone

00:35changes the game for GUI automation, game agents, and any workflow that lives inside a graphical interface.

00:41It's faster, it's more resilient when the UI changes, and in head-to-head benchmarks,

00:46it's already edging out GPT-4-based setups and clawed on everything from Windows desktops to Android apps to web navigation.

00:54So, they started with the original UTAR's a few months back, but version 1.5 is the beefed-up sequel.

01:00Under the hood, it's still Quen 2VL at heart, but they scaled it three ways.

01:05A lightweight 2 billion parameter model, a mid-range 7 billion, and a chunky 72 billion variant

01:11that got an extra round of direct preference optimization.

01:14Across 50 billion tokens of training, screenshots, element metadata, GUI tutorials, and bootstrapped action traces,

01:22the team taught the model to see, reason, and click in a single pass.

01:27The first big upgrade is how it looks at a screen.

01:30They scraped websites, Windows apps, Android UIs, even CAD and Office software,

01:34yanked the bounding boxes, the labels, the colors, the tiny 10x10 pixel icons, everything,

01:40and synthesized five flavors of perception data.

01:43Element descriptions tell the model, hey, that little blue square with the floppy disk icon is a save button.

01:49Dense captions stitch all the elements into a global paragraph, so the model gets layout context.

01:55State transition captions teach it to notice subtle changes, like the difference between a button-down frame and a hover frame.

02:01There's a giant heap of screenshot Q&A so it can answer short questions, such as

02:05where's the new tab button?

02:06And they sprinkled in set of mark prompting, basically drawing colored markers on elements,

02:12so the agent can ground language tokens to specific pixels.

02:16Now, seeing is cute, but acting matters.

02:18ByteDance defined a unified action space, shared primitives like click-XY, drag-scroll type, wait,

02:26plus desktop specials like hotkey or right-click, and mobile-only pressback or long press.

02:32On top of that, you get two meta actions, finished, when the job's done, and call user, when it's stuck behind a login wall.

02:40They collected millions of multi-step traces, open-source stuff like Mind2Web, AITW, Android control,

02:47plus their own hand-recorded sessions, and normalized every step into that action template.

02:52Average trace length? Around 15 steps in their in-house set, so the model learns long horizon control, not just single taps.

03:00Reasoning is where version 1.5 shines.

03:04The paper splits human-style thinking into System 1 and System 2.

03:09System 1 is the fast, intuitive, just-click-the-button vibe.

03:14System 2 is the deliberate, chain-of-thought, break-the-task-down vibe.

03:19Task decomposition, milestone recognition, trial and error loops, reflection when things go sideways.

03:26They harvested 6 million GANG's GUI tutorials off the web, about 500 tokens each, 3 pictures on average.

03:32Ran them through a fast text filter, a big LLM cleanup, and used that as a reasoning primer.

03:38Then, for every action trace they already had, they retrofitted a thought, first with act-reprompting,

03:43then with a clever bootstrapping trick where the model samples multiple thought-action pairs

03:49and only keeps the pair that actually works.

03:52Those thoughts sit in context, so before every click, you literally get the model's inner monologue,

03:56like, okay, search field detected, type the username next.

04:00Because agents crash in real life, they taught the model to learn from mistakes.

04:04They spun up hundreds of virtual PCs, let an early checkpoint roam free, captured the messy traces,

04:10filtered the junk with rules and VLM scoring, and sent human annotators to label two critical steps,

04:16the wrong move and the correct fix.

04:18That gave them paired samples for direct preference optimization.

04:22DPO's job is simple, reward the fix, penalize the blunder, keep the policy close to the stable SFT baseline.

04:28After a few bootstrapping rounds, they saw the 72 billion Ceremoner model

04:32jump from 17-ish to over 24 points on OSworld's 50-step budget.

04:38All right, numbers, because everybody asks for the scoreboard.

04:42In the synthetic desktop sandbox OS world, UTAR's 1.5 nails a 42.5% success rate with just 100 steps,

04:51beating OpenAI's Operator at 36.4 and Claude 3.7 at 28.

04:56Windows Agent Arena, a tougher 50-step Windows challenge shows 42.1% for TARS versus the old 29.8 baseline.

05:05On Android World, the 7B model pulls 64.2%, topping the previous 59.5.

05:11When it comes to naked grounding, pointing to a widget.

05:14ScreenSpot V2 clocks, UI TARS 1.5 at 94.2% accuracy, Operator at 87.9, Claude 87.6.

05:23The new ScreenSpot Pro, which uses high-res professional apps, is where the gap gets crazy.

05:2861.6% for TARS, only 23.4 for Operator, 27.7 for Claude.

05:35Gaming?

05:36They fed the model 14 Pokey minigames, 2048 Infinity Loop, Snake Solver, and so on,

05:42and the Agent clears every single one, literally 100% across the board,

05:47while Operator and Claude flop on half the titles.

05:51Minecraft's MinerL benchmark is more humbling, but still, with the Think, Then Act mode turned on,

05:57TARS 1.5 gets 42% average success on 200 mining tasks, 31% on the 100 mob killing tasks,

06:06the older VPT or Dreamer V3 agents barely cracked 1% on those same goals.

06:12To prove it's not just cherry-picked demos, they published the full benchmark table in the blog.

06:17Web Voyager browsing tasks, 84.8% for TARS versus 87% for CUA, but marginal.

06:24Mind-to-web online tasks, at 75.8% against 71% for OpenAI's CUA.

06:31Notice the pattern.

06:32On web browsing, Operator stays competitive, but once you leave HTML land,

06:37and especially once you hit desktop or mobile, UITARS pulls ahead.

06:40Scale tests are neat, too.

06:42Take the original 72B DPO checkpoint on OSworld, 24.6 points.

06:48The mid-sized 7B version with the same training setup reaches 42.5 points,

06:53because that one is optimized for desktop tasks instead of games.

06:57The tiny 7B light release clocks 27.5 on SpawnSpot Pro,

07:0338.1% for the earlier 72B DPO, 49.6 for the new 7B, 61.6 for the full 1.5.

07:12So, bigger isn't always better.

07:14Targeted data and the thought engine matter more than raw parameters for grounding.

07:19Deployment is refreshingly open.

07:21They push the 7B checkpoint to Hugging Face under Apache 2.0 at ByteDance Seed slash UTARS 1.57B.

07:30The full 72B weights are under Early Research Access.

07:34You basically email TARS at ByteDance.com, tell them your project, and cross your finger.

07:39The GitHub repo GitHub.com slash ByteDance slash UTARS has training scripts,

07:43the unified action schema, the screen capture tool chain, even replay data,

07:47so you can reproduce their metrics.

07:49If you just want to play, there's also UTAR's Desktop,

07:52a Windows X that lets you type a natural language prompt and watch the agent drive your PC.

07:58Think of an open source operator, but without the GPT-4 subscription.

08:02From a research meta angle, the paper goes full history lesson.

08:05It tracks GE agents from old-school rule-based RPA to modular LLM frameworks like Auto-GPT,

08:12to these new end-to-end native models.

08:15Rule-based stuff is brittle, frameworks are prompt engineering heavy,

08:19and fall apart when a UI shifts.

08:21But native agents, being data-driven, retrain on fresh traces and just adapt.

08:26That's why ByteDance keeps banging on the four core capabilities.

08:30Perception, know what's on the screen.

08:32Action, hit the right coordinates.

08:34Cross, debit.

08:36Reasoning, swap between System 1 and System 2 as needed.

08:40And memory, store short-term context in the prompt, long-term lessons in the weights.

08:45Stick those in one model and you get something that can reboot itself when Windows throws a pop-up.

08:50Training followed three phases.

08:52Phase 1, straight continual pre-training on everything, around 50 billion tokens, flat learning rate.

08:58Phase 2, an annealing stage, where they only keep high-quality perception, grounding, action, and reflection pairs,

09:04so the loss focuses on difficult examples.

09:07They call that checkpoint UTAR's SFT.

09:10Phase 3, DPO on the reflection pairs, which bakes in the preference for fixed actions.

09:16That's UTAR's DPO, the model that hit 24.6 on OSworld.

09:21On offline benchmarks, things are already strong.

09:24In multimodal Mind2Web, the 72B model posts 74.7% element accuracy and a 92.5 operation phone.

09:33Android control low-level tasks, 89.9% grounding, and a 91.3% success rate.

09:41GUI Odyssey's cross-app mobile navigation hits over 88% success, which beats OS Atlas 7B by about 26 points.

09:50Yet the more impressive bit is how System 2 reasoning starts to shine the moment you give the model more than one sample.

09:57In best-of-one sampling, the no-thought version sometimes edges ahead because the deliberate chain of thought can hallucinate.

10:05But at best-of-16, the thought-based policy overtakes.

10:09And by best-of-64, it's way better because the extra sampling lets the model explore multiple reasoning paths and keep the best one.

10:17Out-of-domain deliberate reasoning wins straight away.

10:20On Android World, which the model never saw during training, the thought-enabled 72B cleans up at 46.6%, while the reflex-only mode stalls at 34.5%.

10:32That generalization suggests the chain of thought is giving the agent a planning buffer so it can handle surprises.

10:38Kind of like how humans slow down when things get weird.

10:41For the broader community, the cool thing is that ByteDance didn't lock the data behind NDA.

10:47They dumped screenshots, annotation guidelines, and evaluation scripts.

10:50The license is Apache 2.0, which means you can throw it into a commercial product, tweak the code, sell the service, and nobody's coming for royalties.

10:59And because the action space is unified, click, type, drag, etc., people can fuse their own data sources, say a specialized medical interface or an indie game UI, and keep the same fine-tuning recipe.

11:11And yeah, that's the drop.

11:14If you've been waiting for an open agent that actually moves a mouse instead of hallucinating JavaScript, this might be your playground smash-like if you found this useful.

11:23Sub for more deep-dive AI stuff, and let me know in the comments what workflow you'd throw at UTAR's first.

11:29Thanks for watching, and I'll catch you in the next one.