Training a general-purpose robot remains a significant hurdle in the field of robotics. Engineers generally rely on collecting data that is narrowly focused on specific robots and tasks, which they utilize to fine-tune the robot in a controlled setting.
However, this process of gathering data is often exorbitantly costly and exceedingly time-consuming. Consequently, the robots themselves typically struggle to adapt to environments or tasks that were not part of their training regimen.
To enhance the training of versatile robots, researchers at MIT have pioneered a transformative technique that amalgamates a vast array of heterogeneous data from diverse sources into a unified system. This system is capable of imparting a wide variety of tasks to any robot.
The innovative method they developed involves the alignment of data from disparate domains, including both simulations and real robots, along with multiple modalities like vision sensors and robotic arm position encoders. This data is synthesized into a shared “language” that a generative AI model can effectively process.
By harnessing such a vast pool of data, this novel approach facilitates training robots to execute various tasks without necessitating the initiation of training from scratch for each new assignment.
This method offers the promise of being significantly quicker and more economical than conventional techniques, as it relies on a far smaller volume of task-specific data. In addition, it demonstrated an impressive performance boost of over 20% compared to training from scratch in both simulation and real-world experiments.
“In robotics, people often claim that we don’t have enough training data. But in my view, another big problem is that the data come from so many different domains, modalities, and robot hardware. Our work shows how you’d be able to train a robot with all of them put together,” said Lirui Wang, an electrical engineering and computer science (EECS) graduate student and lead author of a paper on this technique.
Wang’s co-authors include fellow EECS graduate student Jialiang Zhao, Xinlei Chen, a research scientist at Meta, and senior author Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL).
Inspired by LLMs
A robotic “policy” ingests sensor observations, such as camera images or proprioceptive measurements that track the speed and position of a robotic arm, subsequently dictating how and where the robot should maneuver.
Traditionally, policies have been trained via imitation learning, where a human demonstrates actions or teleoperates a robot to generate actionable data, which is then assimilated into an AI model that learns the policy. Since this methodology depends on a limited amount of task-specific data, robots frequently falter when faced with altered environments or tasks.
To devise a more effective strategy, Wang and his peers drew inspiration from advanced large language models like GPT-4.
These sophisticated models undergo pretraining using a vast expanse of diverse language data, subsequently fine-tuned with a small amount of task-specific data. This pretraining on extensive data aids the models in adapting effectively to execute a variety of tasks.
“In the language domain, the data are all just sentences. In robotics, given all the heterogeneity in the data, if you want to pretrain in a similar manner, we need a different architecture,” he said.
Robotic data manifests in numerous forms, ranging from camera images and text instructions to depth maps. Additionally, every robot possesses its own unique mechanical configuration, complete with different numbers and orientations of arms, grippers, and sensors. Moreover, the environments from which data are collected can vary immensely.
The MIT researchers have crafted a new architecture called Heterogeneous Pretrained Transformers (HPT) that unifies data from these diverse modalities and domains.
In the center of this architecture lies a machine-learning model known as a transformer, which adeptly processes inputs from vision and proprioception. This transformer serves as the backbone for many large language models.
Within this architecture, the researchers align data from both vision and proprioception into a standardized input type, referred to as a token, enabling seamless processing by the transformer. Each input is denoted with a consistent number of tokens.
Subsequently, the transformer integrates all inputs into a singular shared space, evolving into a vast, pretrained model as it ingests and learns from an expansive dataset. The performance of the transformer enhances significantly as it grows larger.
A user is required to provide HPT with just a minimal amount of data relating to their robot’s design, configuration, and the intended task. From there, HPT translates the knowledge that the transformer acquired during pretraining to efficiently learn the new task.
Enabling dexterous motions
One of the most formidable challenges encountered during the development of HPT was the compilation of an extensive dataset for pretraining the transformer, which encompassed 52 datasets comprising over 200,000 robot trajectories across four categories, including human demonstration videos and simulations.
Furthermore, the researchers needed to devise an effective mechanism for converting raw proprioception signals from diverse sensors into a format comprehensible by the transformer.
“Proprioception is key to enable a lot of dexterous motions. Because the number of tokens is in our architecture always the same, we place the same importance on proprioception and vision,” Wang explained.
The testing of HPT yielded an improvement in robot performance exceeding 20% in both simulation and real-world scenarios compared to the conventional method of starting training from scratch. Even in instances where tasks diverged significantly from the pretraining data, HPT consistently enhanced performance.
“This paper presents a novel approach to training a single policy across multiple robot embodiments. This enables training across diverse datasets, allowing robotic learning methodologies to substantially scale up the extent of datasets they can employ. It also facilitates the model’s rapid adaptation to new robot designs, which is crucial in an era of continuous advancements in robotic technology,” stated David Held, associate professor at the Carnegie Mellon University Robotics Institute, who was not involved with this research.
Looking ahead, the researchers aim to investigate how data diversity could further elevate the performance of HPT. They also aspire to enhance HPT’s capabilities to enable processing of unlabeled data, akin to what is done with GPT-4 and other large language models.
“Our dream is to have a universal robot brain that you could download and utilize for your robot without any prior training whatsoever. While we are still in the nascent stages of this endeavor, we are committed to pushing forward, hoping that scaling will lead to breakthroughs in robotic policies, similar to the advancements seen in large language models,” he concluded.