3: DreamDojo — Teaching Robots to Dream
Researchers from UC Berkeley, NVIDIA, and UT Austin introduce DreamDojo, a framework that teaches robots physical skills by learning from large-scale human videos. Instead of expensive robot-specific data, DreamDojo distills 5 years of human video into a gener...
Show Notes
Episode 003: DreamDojo
Why it matters. Teaching robots to manipulate objects in the real world requires vast amounts of training data that is expensive and dangerous to collect. "DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos" takes a different path: train a world model on 44,000 hours of egocentric human video — people cooking, cleaning, assembling, and manipulating objects in daily life — and then transfer that understanding of physics and dexterous control to robots with minimal fine-tuning. The key innovation is continuous latent actions that bridge the gap between unlabeled human video and precise robot commands. After distillation, DreamDojo runs at real-time speed (10.81 FPS at 640×480), enabling live teleoperation, policy evaluation without real-world deployment, and model-based planning. Thirty authors across nine institutions built this — it is a statement paper about the future of general-purpose robot world models.
NVIDIA. NVIDIA Research is the driving force behind DreamDojo, which builds on NVIDIA's Cosmos-Predict2.5 world foundation model platform. The paper is available on arXiv (2602.06949) with a project page showcasing video results. The GitHub organization currently hosts only the project website; model code has not yet been released. The DreamDojo-HV dataset of 44,000 hours represents the largest video dataset ever assembled for world model pretraining, spanning 6,015 skills across 1.1 million scenes — 15× longer, 96× more diverse, and 2,000× more scenes than any prior dataset.
NVIDIA's Leaders. Linxi "Jim" Fan (Google Scholar), project lead and head of NVIDIA's AI Agents initiative, earned his PhD at Stanford under Fei-Fei Li and was OpenAI's very first intern in 2016. He created Voyager, the first LLM agent to play Minecraft, and won the NeurIPS 2022 Outstanding Paper Award for MineDojo. Ming-Yu Liu (Google Scholar), Vice President of Research at NVIDIA and IEEE Fellow, leads the Deep Imagination Research group and created GauGAN, SPADE, pix2pixHD, and NVIDIA Cosmos. Yuke Zhu, project lead and Assistant Professor at UT Austin, built ROBOSUITE and ROBOMIMIC — benchmarks the entire robot learning community relies on.
UC Berkeley. Two of the field's most towering figures anchor the academic side. Jitendra Malik (Wikipedia, Google Scholar) is the Arthur J. Chick Professor of EECS at UC Berkeley, a member of both the National Academy of Sciences and the National Academy of Engineering, and a pioneer of texture analysis, image segmentation (Normalized Cuts), and object recognition — his academic family tree spans nearly every major computer vision lab in the world. Pieter Abbeel (Wikipedia, Google Scholar) is a Professor of EECS at UC Berkeley and co-founder of Covariant, one of the leading robotics AI companies, with foundational contributions to apprenticeship learning, deep reinforcement learning, and robotic helicopter acrobatics. Co-first author William Liang is a UC Berkeley PhD student and NVIDIA research intern.
HKUST and UT Austin. Co-first author Shenyuan Gao is a PhD student at HKUST and NVIDIA research intern whose prior work on AdaWorld pioneered the latent action approach that DreamDojo scales up. Joel Jang serves as project lead from NVIDIA Research.
Daily Tech Feed: From the Labs is available on Apple Podcasts, Spotify, and wherever fine podcasts are distributed. Visit us at pod.c457.org for all our shows. New episodes daily.