Imitation from Heterogeneous Demonstrations
using Grounded Latent‑Action World Models

Applied Artificial Intelligence Lab
Oxford Robotics Institute
University of Oxford

Corresponding to: tianyou@robots.ox.ac.uk

GLAM learns a shared, control-aware latent action space across heterogeneous data sources, turning cheap unlabelled demonstrations into supervision for a target-robot imitation learning policy.

Abstract

Imitation learning has emerged as a powerful paradigm for learning visuomotor policies, but its generalisation and stability are limited by the scale and quality of demonstration data needed. A promising direction is to leverage more abundant but heterogeneous data sources, which differ in action space and often lack action labels altogether. Existing co-training approaches that combine heterogeneous data sources rely on heuristic and hand-engineered alignment techniques. In contrast, we argue that action representations should be grounded in prediction: actions that produce the same effect on the environment should share the same representation, regardless of their sources. To this end, we instantiate this principle by using a grounded latent-action world model (GLAM), a pair of generative models with a shared latent action space across data sources that is grounded by predicting future observations consistently across sources. This latent action space is used to train downstream behavioural cloning (BC) policies which map observations to latent actions and decode them back to robot actions, providing a paradigm for learning from heterogeneous data. Empirically, we demonstrate that GLAM successfully learns an aligned latent action space that facilitates action transfer across data sources with and without action labels. Across five manipulation tasks in simulation and in the real world, GLAM-aligned policies significantly outperform BC baselines and prior latent-action methods, achieving an average of +48% improvement in task success rate with the same data-scarce setting.

Method

GLAM turns cross-source integration into a representation-learning problem: two actions that drive the environment along the same transition should share the same latent action, no matter which embodiment or data source they come from. We use the world model's prediction of how actions shape environment transitions as the shared, physically grounded supervisory signal.

GLAM Stage 1: grounded latent-action world model

Stage 1 — Grounded Latent-Action World Model Pretraining. We treat actions as latent variables and train a pair of coupled generative models. The upper heterogeneous model uses an inverse dynamics model (IDM) and a shared forward model to infer latent actions directly from observation transitions, so it works on both target and auxiliary transitions, making the IDM space source-invariant. The lower target model uses an action encoder to map target robot state-action pairs into latent actions, injecting executable robot-control semantics. The asymmetric posterior alignment connects these two inference pathways into a single, source-invariant, control-aware latent action space.

GLAM Stage 2: world-model-aligned behavioural cloning

Stage 2 — GLAM-Aligned Behavioural Cloning from Heterogeneous Data. The frozen GLAM relabels every transition — target and auxiliary — with its latent action. A latent policy is then trained to predict chunks of latent actions from observations, which are decoded back into executable robot actions. This lets unlabelled auxiliary data supervise the downstream BC policy on equal footing with target demonstrations.

Cross-Source Transfer through a Shared Latent Action Space

To test whether GLAM learns a genuinely shared action space, we encode unseen trajectories from different sources with the same pretrained IDM, and decode the latent sequence with the same target action decoder. This decoder has never seen UMI or real-robot actions. Latents inferred from UMI, Kinova-sim, and Kinova-real episodes all reproduce the original motion on the Kinova arm — evidence of alignment across sources, embodiment, and real-sim gaps.

Latents from UMI → Kinova (cross-embodiment), Kinova → Kinova (in-distribution), and Real → Sim (sim-to-real) all replay the original motion on the Kinova arm.

Main Results

Across five manipulation tasks, GLAM-aligned policies outperform BC and prior latent-action baselines in the same data-scarce setting of 100 demonstrations, with the largest gains on the hardest tasks: +35% on average across the three real-world tasks, +44% on simulated stack-two, and +69% on bimanual stack-three. On stack-three — which typically needs hundreds to a thousand demonstrations — GLAM is the only method to achieve non-trivial success (72.7% vs. ≤4% for every baseline).

GLAM (ours)
MIP baseline

Lifting — picking up a cube.

Success rate across all tasks

Success rate (%) across five manipulation tasks. GLAM(-O) consistently outperforms all baselines and is the only approach to solve bimanual stack-three.

Auxiliary Data Substitutes for Target Teleoperation

Our central claim is that aligned auxiliary data can substitute for expensive target-robot teleoperation. On stack-two, scaling cheap auxiliary UMI data yields performance gains comparable to scaling target Kinova data for our GLAM-aligned BC policies, while the BC baseline needs roughly more target trajectories to reach the same performance. This indicates the pretrained GLAM has unified different data sources into a single latent space, enabling auxiliary data to serve as a viable substitute for target data and improve data-efficiency for the downstream BC.

Auxiliary data substitutes for target data

Heterogeneous data closes BC's data gap on stack-two. (a) MIP needs far more target trajectories than GLAM-aligned policies to reach the same success rate. (b) For GLAM-aligned policies, scaling auxiliary UMI data matches scaling target Kinova data — the two curves coincide.

Experiment Time-Lapse

Ten consecutive real-robot rollouts of the knock-down task, played back-to-back and unedited, showing the policy resetting and re-executing across varied object placements.

Knock-down — 10 consecutive rollouts.

BibTeX


  @misc{wang2026imitationheterogeneousdemonstrationsusing,
      title={Imitation from Heterogeneous Demonstrations using Grounded Latent-Action World Models}, 
      author={Tianyou Wang and Anson Lei and Joe Watson and Ingmar Posner},
      year={2026},
      eprint={2606.21672},
      archivePrefix={arXiv},
      primaryClass={cs.RO},
      url={https://arxiv.org/abs/2606.21672}, 
}