General-Purpose Robot Models Analysis

Bellow provides an overview of several recent works on general-purpose robot models. It compares key technical aspects—such as model architecture, parameter scale, training data volume, and major innovations—and outlines the typical hardware and time requirements for training, fine-tuning, or distillation.

Note: This analysis is accurate as of the last modified date, "Mar 28, 2025."

Comparison of Robot Models

Below is a table summarizing 10 projects, including their model architecture/method, parameter scale, training data volume, and key features/remarks. (Note: For some projects, specific numbers such as parameter counts or data volumes are not disclosed; descriptive indicators are provided instead.)

Project Name	Model Architecture / Method	Parameter Scale	Training Data / Data Volume	Key Features / Remarks
1. Octo (Website)	Transformer-based diffusion policy supporting language, target images, and sensor history conditioning	Octo-Small: 27M Octo-Base: 93M	Pre-trained on 800k robot demonstrations, integrating 25 datasets (Open X-Embodiment)	Flexibly adaptable to different robots and sensors; efficient fine-tuning; excellent performance in zero-shot and few-shot scenarios
2. OpenVLA (Website)	Fuses a visual encoder (SigLIP + DinoV2) with a Llama 2 7B language model to generate action tokens	7B	Pre-trained on 970k robot demonstrations (Open X-Embodiment)	Leverages internet pre-trained vision-language knowledge; supports multi-robot control; resource-intensive training (64 A100 GPUs, 15 days)
3. UMI (Website)	Data collection and policy learning framework based on a handheld gripper and wrist-mounted camera for in-the-wild demonstrations	Not disclosed	Rapid in-the-wild demonstration capture (approx. 30 seconds per demo), with high data diversity	Low-cost, portable hardware design; enables zero-calibration and bimanual dynamic manipulation; focuses on demonstration data collection
4. RDT-1B (Website)	Transformer policy based on diffusion models, specifically designed for bimanual manipulation	1.2B	Pre-trained on 46 datasets with over 1M demonstrations; additional 6K+ bimanual demonstrations	Large-scale pre-training, multi-task cross-robot capability; excellent zero-shot generalization and few-shot learning ability
5. openpi (GitHub)	Consists of π₀ (streaming diffusion VLA) and π₀-FAST (autoregressive VLA) for vision-language-action tasks	Not explicitly disclosed	Pre-trained on 10k+ hours of robot data	Provides multiple base model checkpoints; easy fine-tuning for downstream tasks; adaptable to various robot platforms
6. Mobile ALOHA (Website)	Imitation learning (behavior cloning) based mobile manipulation system combining whole-body control and low-cost remote operation	Not disclosed	Approximately 50 demonstrations per task, jointly trained with a static ALOHA dataset	Extends traditional ALOHA to mobile platforms; enables complex mobile manipulation tasks (e.g., opening doors, using elevators)
7. RT-2 (Website)	Vision-language-action model that encodes robot actions as text tokens, combining internet pre-training with robot data	Based on PaLM-E: 12B or PaLI-X: 55B	Mixed large-scale internet vision-language data and robot trajectory data (exact numbers undisclosed)	Utilizes a pre-trained large model’s semantic understanding and reasoning; enables multi-step task planning and coherent execution; strong generalization
8. VIMA (Website)	Transformer-based robotic agent that generates actions through multimodal prompts (language, image/video)	2M - 200M (depending on variant)	Over 600K expert demonstrations; supplemented with large amounts of programmatically generated task data	Data-efficient; unified representation for various tasks; exhibits good zero-shot generalization and cross-task adaptability
9. Perceiver-Actor (Website)	Behavior cloning strategy based on a Perceiver Transformer, using RGB-D voxelized input and discretized action prediction	Not disclosed (relatively lightweight)	Demonstration counts are relatively low (e.g., for RLBench with 249 variants and 7 real-world tasks, approx. 53 demos)	Data-efficient learning for 6-DoF manipulation; suitable for few-shot multi-task scenarios; high-performance action detection
10. SayCan (Website)	Integrates a large language model with pre-trained skill/value functions; uses language scoring combined with execution probabilities for task planning	Based on LLM (e.g., PaLM) with parameters up to tens of billions; skill modules are smaller	Utilizes large-scale internet text and robot skill demonstration data (exact figures undisclosed)	Achieves long-horizon task planning and semantic reasoning; composes multi-step skills; supports multilingual capability; improves execution success rate

Hardware and Time Requirements for Training, Fine-Tuning, or Distillation

The following table outlines typical hardware devices and approximate training times for various stages, such as pre-training (full training), fine-tuning (or parameter-efficient fine-tuning), and distillation. Actual requirements vary depending on model scale, data volume, training strategy (e.g., full vs. parameter-efficient fine-tuning), and task specifics.

Project Name	Pre-training / Full Training (Hardware & Time)	Fine-Tuning / Parameter-Efficient Fine-Tuning / Distillation (Hardware & Time)	Remarks
1. Octo (Website)	- Pre-training on 800k demos typically requires multiple high-performance GPUs (e.g., A100/RTX4090) - Training time: several days to weeks	- Fine-tuning using efficient strategies can often be completed on a single GPU in a few hours to one day	Adaptable to different robots and sensors; fine-tuning time is relatively short
2. OpenVLA (Website)	- Pre-training used 64 A100 GPUs, with a training duration of about 15 days	- Task-specific fine-tuning using parameter-efficient methods usually takes a few hours to one day on a single GPU	Leverages large-scale internet pre-training; resource-intensive
3. UMI (Website)	- Focused on data collection and policy learning; training can be done on low-cost GPUs (or even a single card) - Training time: on the order of a few hours	- Fine-tuning for specific tasks (using fast demonstration capture) typically completes within hours	Uses portable hardware design; suitable for in-the-wild demonstration data
4. RDT-1B (Website)	- Pre-training a 1.2B parameter model generally requires a multi-GPU cluster (e.g., 8-16 A100 GPUs) - Training time: possibly over a week	- Fine-tuning on specific bimanual tasks (using additional 6K+ demos) may take from a few hours to one day	Large parameter scale and rich data; high resource and time demand for pre-training
5. openpi (GitHub)	- Full training on 10k+ hours of robot data may require high-memory GPUs (e.g., A100/H100)	- Parameter-efficient fine-tuning (e.g., using LoRA) typically requires at least 22.5GB of GPU memory (e.g., RTX4090) with training times from a few hours to a few days	Offers both full training and efficient fine-tuning options; hardware requirements are clearly defined
6. Mobile ALOHA (Website)	- Imitation learning methods typically train on a single high-end GPU (e.g., RTX 3090/4090) - Training time: several hours to one day	- Fine-tuning using combined static ALOHA data generally completes on a single GPU in a short period	Focuses on mobile and whole-body control; relatively small data volume
7. RT-2 (Website)	- Large models (12B-55B parameters) require extensive GPU clusters (e.g., 64 A100 GPUs) - Pre-training time: typically several weeks	- Fine-tuning for specific tasks using joint training strategies may take from a few hours to one day, depending on data volume	Combines internet-scale pre-training with robot data; high hardware and time requirements
8. VIMA (Website)	- Model sizes range from a few million to several hundred million parameters - Smaller variants can be trained on a single GPU in hours; larger variants may need multiple GPUs for days to a week	- Fine-tuning is typically done on a single GPU or a small multi-GPU setup; high data efficiency can greatly reduce training time	Model and data scale are adjustable, making fine-tuning flexible
9. Perceiver-Actor (Website)	- Due to voxelized inputs and discrete action prediction, training can often be done on a single GPU (8-16GB memory) - Training time: typically several hours to one day	- Fine-tuning for few-shot scenarios is highly efficient, often completing within a few hours	Emphasizes data-efficient learning for 6-DoF manipulation; suitable for low-resource environments
10. SayCan (Website)	- Integrates a large language model (e.g., PaLM series) with robot skills; pre-training typically uses TPUs or large-scale GPU clusters - Pre-training time: may span several weeks	- Fine-tuning or distillation for specific scenarios is typically carried out on multi-GPU or TPU setups, taking from a few hours to one day	Combines semantic reasoning with low-level skills; high resource requirements for pre-training, but fine-tuning can leverage LLM improvements

Robotics Dataset Comparison

A comprehensive overview and comparative analysis of 16 mainstream robotics datasets and frameworks, including LeRobot, Open X-Embodiment, DROID, RoboTurk, MIME, Meta-World, RoboNet, RoboSet, BridgeData V2, RT-1, Dobb·E, RH20T, BC-Z, MT-Opt, VIMA, and SPOC.

Wheel-Based Humanoid Robots

An overview of notable wheel-based humanoid robots, highlighting their developers, descriptions, applications, features, advantages, and disadvantages.

On This Page

Comparison of Robot Models
Hardware and Time Requirements for Training, Fine-Tuning, or Distillation