General-Purpose Robot Models Analysis
Bellow provides an overview of several recent works on general-purpose robot models. It compares key technical aspects—such as model architecture, parameter scale, training data volume, and major innovations—and outlines the typical hardware and time requirements for training, fine-tuning, or distillation.
Note: This analysis is accurate as of the last modified date, "Mar 28, 2025."
Comparison of Robot Models
Below is a table summarizing 10 projects, including their model architecture/method, parameter scale, training data volume, and key features/remarks. (Note: For some projects, specific numbers such as parameter counts or data volumes are not disclosed; descriptive indicators are provided instead.)
Project Name | Model Architecture / Method | Parameter Scale | Training Data / Data Volume | Key Features / Remarks |
---|---|---|---|---|
1. Octo (Website) | Transformer-based diffusion policy supporting language, target images, and sensor history conditioning | Octo-Small: 27M Octo-Base: 93M | Pre-trained on 800k robot demonstrations, integrating 25 datasets (Open X-Embodiment) | Flexibly adaptable to different robots and sensors; efficient fine-tuning; excellent performance in zero-shot and few-shot scenarios |
2. OpenVLA (Website) | Fuses a visual encoder (SigLIP + DinoV2) with a Llama 2 7B language model to generate action tokens | 7B | Pre-trained on 970k robot demonstrations (Open X-Embodiment) | Leverages internet pre-trained vision-language knowledge; supports multi-robot control; resource-intensive training (64 A100 GPUs, 15 days) |
3. UMI (Website) | Data collection and policy learning framework based on a handheld gripper and wrist-mounted camera for in-the-wild demonstrations | Not disclosed | Rapid in-the-wild demonstration capture (approx. 30 seconds per demo), with high data diversity | Low-cost, portable hardware design; enables zero-calibration and bimanual dynamic manipulation; focuses on demonstration data collection |
4. RDT-1B (Website) | Transformer policy based on diffusion models, specifically designed for bimanual manipulation | 1.2B | Pre-trained on 46 datasets with over 1M demonstrations; additional 6K+ bimanual demonstrations | Large-scale pre-training, multi-task cross-robot capability; excellent zero-shot generalization and few-shot learning ability |
5. openpi (GitHub) | Consists of π₀ (streaming diffusion VLA) and π₀-FAST (autoregressive VLA) for vision-language-action tasks | Not explicitly disclosed | Pre-trained on 10k+ hours of robot data | Provides multiple base model checkpoints; easy fine-tuning for downstream tasks; adaptable to various robot platforms |
6. Mobile ALOHA (Website) | Imitation learning (behavior cloning) based mobile manipulation system combining whole-body control and low-cost remote operation | Not disclosed | Approximately 50 demonstrations per task, jointly trained with a static ALOHA dataset | Extends traditional ALOHA to mobile platforms; enables complex mobile manipulation tasks (e.g., opening doors, using elevators) |
7. RT-2 (Website) | Vision-language-action model that encodes robot actions as text tokens, combining internet pre-training with robot data | Based on PaLM-E: 12B or PaLI-X: 55B | Mixed large-scale internet vision-language data and robot trajectory data (exact numbers undisclosed) | Utilizes a pre-trained large model’s semantic understanding and reasoning; enables multi-step task planning and coherent execution; strong generalization |
8. VIMA (Website) | Transformer-based robotic agent that generates actions through multimodal prompts (language, image/video) | 2M - 200M (depending on variant) | Over 600K expert demonstrations; supplemented with large amounts of programmatically generated task data | Data-efficient; unified representation for various tasks; exhibits good zero-shot generalization and cross-task adaptability |
9. Perceiver-Actor (Website) | Behavior cloning strategy based on a Perceiver Transformer, using RGB-D voxelized input and discretized action prediction | Not disclosed (relatively lightweight) | Demonstration counts are relatively low (e.g., for RLBench with 249 variants and 7 real-world tasks, approx. 53 demos) | Data-efficient learning for 6-DoF manipulation; suitable for few-shot multi-task scenarios; high-performance action detection |
10. SayCan (Website) | Integrates a large language model with pre-trained skill/value functions; uses language scoring combined with execution probabilities for task planning | Based on LLM (e.g., PaLM) with parameters up to tens of billions; skill modules are smaller | Utilizes large-scale internet text and robot skill demonstration data (exact figures undisclosed) | Achieves long-horizon task planning and semantic reasoning; composes multi-step skills; supports multilingual capability; improves execution success rate |
Hardware and Time Requirements for Training, Fine-Tuning, or Distillation
The following table outlines typical hardware devices and approximate training times for various stages, such as pre-training (full training), fine-tuning (or parameter-efficient fine-tuning), and distillation. Actual requirements vary depending on model scale, data volume, training strategy (e.g., full vs. parameter-efficient fine-tuning), and task specifics.
Project Name | Pre-training / Full Training (Hardware & Time) | Fine-Tuning / Parameter-Efficient Fine-Tuning / Distillation (Hardware & Time) | Remarks |
---|---|---|---|
1. Octo (Website) | - Pre-training on 800k demos typically requires multiple high-performance GPUs (e.g., A100/RTX4090) - Training time: several days to weeks | - Fine-tuning using efficient strategies can often be completed on a single GPU in a few hours to one day | Adaptable to different robots and sensors; fine-tuning time is relatively short |
2. OpenVLA (Website) | - Pre-training used 64 A100 GPUs, with a training duration of about 15 days | - Task-specific fine-tuning using parameter-efficient methods usually takes a few hours to one day on a single GPU | Leverages large-scale internet pre-training; resource-intensive |
3. UMI (Website) | - Focused on data collection and policy learning; training can be done on low-cost GPUs (or even a single card) - Training time: on the order of a few hours | - Fine-tuning for specific tasks (using fast demonstration capture) typically completes within hours | Uses portable hardware design; suitable for in-the-wild demonstration data |
4. RDT-1B (Website) | - Pre-training a 1.2B parameter model generally requires a multi-GPU cluster (e.g., 8-16 A100 GPUs) - Training time: possibly over a week | - Fine-tuning on specific bimanual tasks (using additional 6K+ demos) may take from a few hours to one day | Large parameter scale and rich data; high resource and time demand for pre-training |
5. openpi (GitHub) | - Full training on 10k+ hours of robot data may require high-memory GPUs (e.g., A100/H100) | - Parameter-efficient fine-tuning (e.g., using LoRA) typically requires at least 22.5GB of GPU memory (e.g., RTX4090) with training times from a few hours to a few days | Offers both full training and efficient fine-tuning options; hardware requirements are clearly defined |
6. Mobile ALOHA (Website) | - Imitation learning methods typically train on a single high-end GPU (e.g., RTX 3090/4090) - Training time: several hours to one day | - Fine-tuning using combined static ALOHA data generally completes on a single GPU in a short period | Focuses on mobile and whole-body control; relatively small data volume |
7. RT-2 (Website) | - Large models (12B-55B parameters) require extensive GPU clusters (e.g., 64 A100 GPUs) - Pre-training time: typically several weeks | - Fine-tuning for specific tasks using joint training strategies may take from a few hours to one day, depending on data volume | Combines internet-scale pre-training with robot data; high hardware and time requirements |
8. VIMA (Website) | - Model sizes range from a few million to several hundred million parameters - Smaller variants can be trained on a single GPU in hours; larger variants may need multiple GPUs for days to a week | - Fine-tuning is typically done on a single GPU or a small multi-GPU setup; high data efficiency can greatly reduce training time | Model and data scale are adjustable, making fine-tuning flexible |
9. Perceiver-Actor (Website) | - Due to voxelized inputs and discrete action prediction, training can often be done on a single GPU (8-16GB memory) - Training time: typically several hours to one day | - Fine-tuning for few-shot scenarios is highly efficient, often completing within a few hours | Emphasizes data-efficient learning for 6-DoF manipulation; suitable for low-resource environments |
10. SayCan (Website) | - Integrates a large language model (e.g., PaLM series) with robot skills; pre-training typically uses TPUs or large-scale GPU clusters - Pre-training time: may span several weeks | - Fine-tuning or distillation for specific scenarios is typically carried out on multi-GPU or TPU setups, taking from a few hours to one day | Combines semantic reasoning with low-level skills; high resource requirements for pre-training, but fine-tuning can leverage LLM improvements |