General-Purpose Robot Models Analysis

Bellow provides an overview of several recent works on general-purpose robot models. It compares key technical aspects—such as model architecture, parameter scale, training data volume, and major innovations—and outlines the typical hardware and time requirements for training, fine-tuning, or distillation.

Note: This analysis is accurate as of the last modified date, "Mar 28, 2025."

Comparison of Robot Models

Below is a table summarizing 10 projects, including their model architecture/method, parameter scale, training data volume, and key features/remarks. (Note: For some projects, specific numbers such as parameter counts or data volumes are not disclosed; descriptive indicators are provided instead.)

Project NameModel Architecture / MethodParameter ScaleTraining Data / Data VolumeKey Features / Remarks
1. Octo
(Website)
Transformer-based diffusion policy supporting language, target images, and sensor history conditioningOcto-Small: 27M
Octo-Base: 93M
Pre-trained on 800k robot demonstrations, integrating 25 datasets (Open X-Embodiment)Flexibly adaptable to different robots and sensors; efficient fine-tuning; excellent performance in zero-shot and few-shot scenarios
2. OpenVLA
(Website)
Fuses a visual encoder (SigLIP + DinoV2) with a Llama 2 7B language model to generate action tokens7BPre-trained on 970k robot demonstrations (Open X-Embodiment)Leverages internet pre-trained vision-language knowledge; supports multi-robot control; resource-intensive training (64 A100 GPUs, 15 days)
3. UMI
(Website)
Data collection and policy learning framework based on a handheld gripper and wrist-mounted camera for in-the-wild demonstrationsNot disclosedRapid in-the-wild demonstration capture (approx. 30 seconds per demo), with high data diversityLow-cost, portable hardware design; enables zero-calibration and bimanual dynamic manipulation; focuses on demonstration data collection
4. RDT-1B
(Website)
Transformer policy based on diffusion models, specifically designed for bimanual manipulation1.2BPre-trained on 46 datasets with over 1M demonstrations; additional 6K+ bimanual demonstrationsLarge-scale pre-training, multi-task cross-robot capability; excellent zero-shot generalization and few-shot learning ability
5. openpi
(GitHub)
Consists of π₀ (streaming diffusion VLA) and π₀-FAST (autoregressive VLA) for vision-language-action tasksNot explicitly disclosedPre-trained on 10k+ hours of robot dataProvides multiple base model checkpoints; easy fine-tuning for downstream tasks; adaptable to various robot platforms
6. Mobile ALOHA
(Website)
Imitation learning (behavior cloning) based mobile manipulation system combining whole-body control and low-cost remote operationNot disclosedApproximately 50 demonstrations per task, jointly trained with a static ALOHA datasetExtends traditional ALOHA to mobile platforms; enables complex mobile manipulation tasks (e.g., opening doors, using elevators)
7. RT-2
(Website)
Vision-language-action model that encodes robot actions as text tokens, combining internet pre-training with robot dataBased on PaLM-E: 12B
or PaLI-X: 55B
Mixed large-scale internet vision-language data and robot trajectory data (exact numbers undisclosed)Utilizes a pre-trained large model’s semantic understanding and reasoning; enables multi-step task planning and coherent execution; strong generalization
8. VIMA
(Website)
Transformer-based robotic agent that generates actions through multimodal prompts (language, image/video)2M - 200M (depending on variant)Over 600K expert demonstrations; supplemented with large amounts of programmatically generated task dataData-efficient; unified representation for various tasks; exhibits good zero-shot generalization and cross-task adaptability
9. Perceiver-Actor
(Website)
Behavior cloning strategy based on a Perceiver Transformer, using RGB-D voxelized input and discretized action predictionNot disclosed (relatively lightweight)Demonstration counts are relatively low (e.g., for RLBench with 249 variants and 7 real-world tasks, approx. 53 demos)Data-efficient learning for 6-DoF manipulation; suitable for few-shot multi-task scenarios; high-performance action detection
10. SayCan
(Website)
Integrates a large language model with pre-trained skill/value functions; uses language scoring combined with execution probabilities for task planningBased on LLM (e.g., PaLM) with parameters up to tens of billions; skill modules are smallerUtilizes large-scale internet text and robot skill demonstration data (exact figures undisclosed)Achieves long-horizon task planning and semantic reasoning; composes multi-step skills; supports multilingual capability; improves execution success rate

Hardware and Time Requirements for Training, Fine-Tuning, or Distillation

The following table outlines typical hardware devices and approximate training times for various stages, such as pre-training (full training), fine-tuning (or parameter-efficient fine-tuning), and distillation. Actual requirements vary depending on model scale, data volume, training strategy (e.g., full vs. parameter-efficient fine-tuning), and task specifics.

Project NamePre-training / Full Training (Hardware & Time)Fine-Tuning / Parameter-Efficient Fine-Tuning / Distillation (Hardware & Time)Remarks
1. Octo
(Website)
- Pre-training on 800k demos typically requires multiple high-performance GPUs (e.g., A100/RTX4090)
- Training time: several days to weeks
- Fine-tuning using efficient strategies can often be completed on a single GPU in a few hours to one dayAdaptable to different robots and sensors; fine-tuning time is relatively short
2. OpenVLA
(Website)
- Pre-training used 64 A100 GPUs, with a training duration of about 15 days- Task-specific fine-tuning using parameter-efficient methods usually takes a few hours to one day on a single GPULeverages large-scale internet pre-training; resource-intensive
3. UMI
(Website)
- Focused on data collection and policy learning; training can be done on low-cost GPUs (or even a single card)
- Training time: on the order of a few hours
- Fine-tuning for specific tasks (using fast demonstration capture) typically completes within hoursUses portable hardware design; suitable for in-the-wild demonstration data
4. RDT-1B
(Website)
- Pre-training a 1.2B parameter model generally requires a multi-GPU cluster (e.g., 8-16 A100 GPUs)
- Training time: possibly over a week
- Fine-tuning on specific bimanual tasks (using additional 6K+ demos) may take from a few hours to one dayLarge parameter scale and rich data; high resource and time demand for pre-training
5. openpi
(GitHub)
- Full training on 10k+ hours of robot data may require high-memory GPUs (e.g., A100/H100)- Parameter-efficient fine-tuning (e.g., using LoRA) typically requires at least 22.5GB of GPU memory (e.g., RTX4090) with training times from a few hours to a few daysOffers both full training and efficient fine-tuning options; hardware requirements are clearly defined
6. Mobile ALOHA
(Website)
- Imitation learning methods typically train on a single high-end GPU (e.g., RTX 3090/4090)
- Training time: several hours to one day
- Fine-tuning using combined static ALOHA data generally completes on a single GPU in a short periodFocuses on mobile and whole-body control; relatively small data volume
7. RT-2
(Website)
- Large models (12B-55B parameters) require extensive GPU clusters (e.g., 64 A100 GPUs)
- Pre-training time: typically several weeks
- Fine-tuning for specific tasks using joint training strategies may take from a few hours to one day, depending on data volumeCombines internet-scale pre-training with robot data; high hardware and time requirements
8. VIMA
(Website)
- Model sizes range from a few million to several hundred million parameters
- Smaller variants can be trained on a single GPU in hours; larger variants may need multiple GPUs for days to a week
- Fine-tuning is typically done on a single GPU or a small multi-GPU setup; high data efficiency can greatly reduce training timeModel and data scale are adjustable, making fine-tuning flexible
9. Perceiver-Actor
(Website)
- Due to voxelized inputs and discrete action prediction, training can often be done on a single GPU (8-16GB memory)
- Training time: typically several hours to one day
- Fine-tuning for few-shot scenarios is highly efficient, often completing within a few hoursEmphasizes data-efficient learning for 6-DoF manipulation; suitable for low-resource environments
10. SayCan
(Website)
- Integrates a large language model (e.g., PaLM series) with robot skills; pre-training typically uses TPUs or large-scale GPU clusters
- Pre-training time: may span several weeks
- Fine-tuning or distillation for specific scenarios is typically carried out on multi-GPU or TPU setups, taking from a few hours to one dayCombines semantic reasoning with low-level skills; high resource requirements for pre-training, but fine-tuning can leverage LLM improvements