Nicholas

Robotics' End Game: Nvidia's Jim Fan

Nicholas

Jim Fan, who leads the embodied autonomous research group at Nvidia, returns to AI Ascent to argue that robotics is entering its end game — and that the playbook is already written. He walks through what he calls "the great parallel": robotics following the...

Published
Published Apr 30, 2026
Uploaded
Uploaded Jun 11, 2026
File type
YouTube
Queried
0
Source
youtu.be

Full transcript

Showing the full transcript for this video.

AI-generated transcript with timestamped sections.

*music* And up first, I'm delighted to introduce my friend, Jim Fan. Jim leads the embodied autonomous research group at NVIDIA, otherwise known as NVIDIA Robotics. I think that robots are just one of the most thrilling things that's going to happen. A car, basically, is a big robot, but I'm excited for robots can go beep-boop and lift things for us. And so Jim was a standout at last year's AI Ascent, and we're delighted to have you back. Thanks, everyone. Thanks. Actually, right in this office that we're sitting in, There's a guy.

in shiny leather jacket, big biceps, hurling in this large metal tray. And all this... large piece of metal. He wrote, to Elon and the OpenAI team, To the future of computing and humanity, I present you the world's first DGX1. So that was the first time I met Jensen. And as any good intern would do, I rush to get in line to sign my name on it. So can you spot it, my name? It's here. And can you spot another? That's Andre right there. So Andre, we're going to the computer history museum.

I feel like a dinosaur. You know, back then, I had... No clue what I was signing up for. And then, No one can describe what happened next. better than Ilya himself. If you believe in deep learning, deep learning will believe in you. And oh boy, the deep learning believes in all of us big time. three-step functions. Six years. That's how all it took. to bring us here today. The first tick, GPT-3, Pre-training. Next token prediction is really about learning the rules of grammar. The shape of language is about simulating how thoughts and code and strings in general should unfold.

2022 instruct GPT. supervised fine tuning, allowing the simulation to do useful work. O1 Reasoning using reinforcement learning to surpass imitation learning, And finally, auto research. accelerating the whole loop. beyond what's humanly possible. So as Andre said, All the labs are getting to the final boss fight. So for LOMs, They are in the thick of the endgame. And honestly, I'm very jealous. Look at how happy Andre was, big smile on his face. DLM folks. are having the party of their lifetime. They're speedrunning AGI. on mystical creatures literally called mythos. So why can't robotics get a piece of fun?

So as any self-respecting scientist will do, I copy homework and I give it a new name. I call it a great parallel. So instead of simulating strings, Can we simulate next physical world state? And then we can align through action fine tuning onto a thin slice of that simulation that matters, for real robots. And we let reinforce learning. carry the last mile And that's it. The Great Palo: Copying the LOM Success. If you can't beat them, join them. So please join me in a new episode Robotics, the end game. And sorry, I just couldn't resist.

Nano banana's too good. Thanks, Demis. So how do we play the endgame? It boils down to two things, model strategy, and data strategy. Let's look at the model first. The last three years, were dominated by VOAs or vision language action models. And models like Pi and Groot fall in this category. So we assume that that the pre-training is done by a VOM, And we simply graph an action head on top of it. But really, if you think about these models, there are LVAs, Because the most amount of parameters are dedicated to language.

So language is first class citizen, followed by vision and action. And by design, VOAs are great at encoding knowledge and nouns, but not so much at physics and verbs. It's kind of head heavy in the wrong places. This is my favorite example. from the original VOA paper. Move the cold can to a picture of Taylor Swift. Yes, it has not seen Taylor Swift before. Yes, it's able to generalize. But this is not quite the pre-training ability that we're looking for. So what's the second pre-training paradigm? And I always thought, that it would be something glorious, Unfortunately, It turns out that this is AI video slop that we call.

You know, I can watch these cats playing banjo on security cam all day. It's peak internet. But really, look at this. No one can take this seriously. until we realize that these video models are learning to simulate next world state internally. So these are some raws from VO3. You can see that the models, they pick up gravity, buoyancy, lighting, reflection, refraction, all by themselves. None of this is coded int. Uh, Physics emerged by predicting the next blob of pixels at scale. And even visual planning emerges. Look at how VO solves these mesas.

It solves them. by running simulation forward in pixel space. And draw attention to the lower right corner here. This is my favorite example. Let's watch. And you blink if you miss. how VO3 solves this one. It's super smart. You know, VO3 figures out that if you're not looking, geometry is optional. I call this physics flop. So how do we make these world models useful? Well, we do action fine tuning. We align this superposition of all possible future states, and collapse that. onto a thin slice That matters for real robots. Introducing Dream Zero.

It's a new type of policy model. that dreams a couple of seconds into the future, and acts accordingly. And you know that motor actions, there are high dimensional continuous signals. So that looks just like pixels. we can render it at the same time as we render the videos. So Dream Zero jointly decodes the next role states and next actions. And as a result, it's able to zero shot solve tasks. and verbs that it has never seen in training. And as a robot executes, we can visualize what it's streaming about. And the correlation is very tight.

If the video prediction works, the action works. If the video hallucinates, the action fails. So once again, vision and action are now first class citizens. And we have a lot of fun with Dream Zero. So we just roll the robot around in our lab, and then type random things into the prompt box. And of course, Dream Zero is not going to get all of these tasks. 100% robust, but it's kind of like GPT-2. It's trying to get the shape of the motion correct in every case. So Dream Zero is our first step towards open-ended, open vocabulary prompting.

for robotics. And we call this new type of model world action models, or WAN. So let's all take a moment of silence for our dear friend, VOAs. They've served us well, rest in peace, long-lived world action models. And next, data strategy. This is NVIDIA's chief scientist, Bill Daly, operating teleoperation inside our lab. And given his salary, I think this is by far the most expensive teleop trajectory ever collected in our data set. The past three years. have been dominated by teleoperation. It's the golden era. All right, VR has that. extremely optimized latency for streaming.

And these complex rigs that look like medieval torture devices You know, so much investment in industry, so much pain and suffering. And yet, for tele-op, it's upper bounded by 24 hours per robot per day. the fundamental physical limit. And actually, who am I kidding? It's more like three hours per robot per day and only when the robot god is merciful. because they throw tantrums all the time. So how can we do better? Well, how about this? You just... wear the robot hand on your own hand. So this is called UMI, or Universal Manipulation Interface, and it's a deceptively simple idea.

you wear the robot actuator on your hand and directly collect the data as humans, while getting the rest of the robot body out of the loop. Yet, I would say UMI is perhaps one of the greatest papers ever written in robotics data and it spawned two unicorn startups. On the left-hand side is journalists improving this design, so you can wear the gripper here. And on the right-hand side, Sunday made these three-finger data gloves. So last year, we took it one step further we designed this exoskeleton that has a one-to-one mapping with five-fingered dexterous robot hands, and we call it DexUmi.

Let's look at it in action. On the left, The human directly collecting data always the fastest. On the right, look at how difficult TallyOp is. The human operator, here one of our most skilled PhDs, he has to align very carefully. And then it's super slow. Also, the success rate is very low as well. And in the middle, you just wear these exoskeleton, and you collect data directly. And we train our robot policy on this data. So here what you see is a fully autonomous rollout of a policy that's trained on zero teleoperation data.

So we're able to break the curse of 24 hours per robot per day, and see how happy these robots are, because they no longer need to be in the loop for data collection. Thank you. So is this the answer? Have we solved scaling for robotics? Anyone driving Tesla or Waymo here? Anyone? You know, when you're driving, you're actually contributing to the biggest-- physical data flywheel. And the beauty is you don't even feel it during FSD, because the data upload is an ambient process. Yet wearing these UMI or data wearables, it's still cumbersome, right?

It's intrusive. It's not as seamless as just driving to work. So we need an FSD equivalent. The data collection needs to get out of the way, fade into the background, so we can capture the full glory of human dexterity across all walks of lives, across all labors of economic value. So we are going all in. on human egocentric videos that come with these detailed annotations like hand position tracking, and dense language annotations. Introducing Eagle Scale. where 99.9% of the training that goes into this is based on human egocentric videos. And the result is an end-to-end policy that maps directly from the camera pixels here, to 22 degrees of freedom high-disparity robot hands.

What you see here is fully autonomous. We pre-train Eagle Scale. on 21K hours of in-the-wild egocentric human data. with zero robot data whatsoever, And during pre-training, we predict these hand joints and wrist pulses. Then in action fine-tuning, we collect only 50 hours... of high-precision mocap data graphs and four hours of TALIO. That's four hours of Tally Up. less than 0.1% of our training mix. Thank you. And with this, Eagle Scale is able to generalize to these very dexterous tasks like sorting card, or manipulating the range. Right? Over. transferring the liquid.

Someday we might have robot nurses at home. Might as well try this. And for these tasks, It takes only one shot demonstration at test time to learn different shared folding strategies. and perhaps the most fascinating finding from the paper, is that we discovered-- this neural skating law. for dexterity. It's a very clean relationship between the amount of hours we put into pre-training and the optimal validation loss. In fact, it's a clean, clean, clean, log linear mathematical equation. six years after the original neural scaling law for language models. So if we put all of these data strategies on this chart, X-axis is alignment to the robot hardware.

Y-axis is scalability. This is what it looks like. Tadiya. the least scalable, Data wearables, you can go up to hundreds of thousands of hours. And egocentric video, if we're able to spin the FSD flywheel, easily 10 million hours in the next year or so. And if we draw a line here, everything to the left of this line is a new paradigm. sensorized human data. So let me make a few predictions. In the next year or two, we'll see teleop dropping and dropping to almost negligible amount. And then there will be an ensemble of data wearables custom designed for different hardware and use cases.

And finally, the main diet for robotics will be egocentric videos. So, A moment of silence for our dear friend Taliyov. You have served us well, rest in peace, long-lived sensorized human data. Are we done with the data strategy yet? Did you notice I put two rings on data strategy? What's the outer ring here? all the O and frontier labs. have spent significant budget now on acquiring millions of coding environments to do reinforcement learning. So robotics is the same. We're in urgent need to scale up environments. And of course, you can always do reinforcement learning directly on the real robot.

So in our lab, we use RL to push certain tasks to almost 100% success rate so you can do these continuous execution for hours on end. You know, it's kind of therapeutic to see these robots assembling GPUs just by themselves. Or as a wise man would say, "Good boy, this task has been approved by my boss." Yet we can't get to 1 million environments, because that will require 1 million robots if you do it the previous way. So we need a better way. Here, let's say you take an iPhone picture.

And you can pass this through this 3D WorldScan pipeline to extract all the objects and then automatically synthesize them again inside a classical physics simulator. So all these objects are actually interactive after the scan. And then you can augment this infinitely in simulation. with variations, that we call digital cousins. So now iPhone basically become a pocket world scanner. In this process that we call real to CM2real, And in this way, we have a scalable way to port the physical world into the digital world. But still, this method relies on a classical Graph extension.

Can we do better? Introducing-- Dream Dojo. So it's how it's been on video world model. and turning them into full-fledged neural simulators. Dream Dojo takes as input these continuous action signals and outputs the next RGB frames, as well as sensor states, in real time. Not a single pixel you see here is real. And Dream Dojo is able to capture and learn the mechanics of different robots. through a purely data-driven approach. There is no physics equation, no graphics engine involved in this process. So the new post-training paradigm for robotics is a massively parallel RO system that runs on a few real robot stations, a bunch of graphics cores running world scans, and heavy inference compute running world models.

Or as this equation goes, compute now equals environment now equals data. Or as a wise man would say, The more you buy, the more you save. And this message has been approved by my boss. So that's it. Putting it together, The Great Parallel. that robotics will follow. And it's happening as we speak. And we're looking at the beginning of the end game. You guys play. the video game civilization. Still my favorite. I like to think of my research. as unlocking game achievements on the civilizational technology tree. and there are three more achievements to unlock for robotics, and then we're done.

I can retire and I can't wait for that. The first... is passing the physical Turing test. across a wide range of activities, you cannot tell the difference between a human doing a task or a robot doing it. Maybe not drunk humans, but-- Physical Turing test is about unit energy in and unit labor out. And just by judging at the sexy pose of this robot, I think the work is cut out for us. So maybe it's two to three years away. And next, physical API. You have a whole fleet of robots.

And they can be configured just like any other software, using APIs and command lines, orchestrated someday by Opus 9.0. And if we have this physical API, will be able to realize light-out factories Those are essentially printers of atoms. They take as input. design in markdown files, and then output fully assembled products, completely autonomous. or these wet labs that automate scientific discoveries in chemistry, Biology and Medicine. And the final thought, physical auto research. start to design, improve, and build a next iteration of themselves far beyond what's humanly possible. So you might ask, Is this true science fiction?

Like, are we going to see this in our lifetime? Well, It took the AI community 14 years. to go from the first forward pass of AlexNet in 2012 A model that barely recognized cat versus dog, to AI ascent today, 2026. where we talk about agentic auto research. Thank you. And let's just add another 14 years. How about that? 2026 is right in the middle of 2012 and 2040. And technology does not advance linearly. It advances exponentially. So I can say, with 95% certainty. that will get to the end of the end game.

the end of the technology tree By 2040. And we'll still be all we'll still be young. If you believe in robotics, robotics would believe in you. And to all of us here, sitting here. I think our generation was born too late to explore the Earth, and too early to explore the stars, but we are born just in time. to solve robotics. Thank you.

Want to learn more?