How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

SOURCE | 13 hours ago

✨ Enhance your Social Media content with NViNiO•AI™ for FREE

Developing autonomous vehicle (AV) policies requires bridging an important gap between training and deployment. Vision-language-action (VLA) models that can reason over more complex driving scenes and produce richer intermediate reasoning are predominantly trained in open-loop, where model outputs are directly compared to ground-truth behaviors without considering their effect on the environment.

In deployment, however, a driving policy runs in closed-loop, where every braking, steering, and navigation decision affects the environment, and small errors can compound over time.

A systematic means to address this challenge is provided by NVIDIA Alpamayo, an open portfolio of AI models, simulation frameworks, and physical AI datasets for AV development. Alpamayo includes the AlpaSim AV simulation platform and the AlpaGym closed-loop training framework (coming soon).

This post explains how to train AV models in closed-loop with NVIDIA Alpamayo. Specifically, it walks through how to:

Install and configure AlpaGym Define closed-loop rewards Launch closed-loop training Export the post-trained checkpoint for downstream use

Closed-loop post-training with AlpaGym extends AV training workflows by turning AlpaSim rollouts into training experience. Rather than treating simulation only as a final evaluation stage, AlpaGym connects simulator feedback directly to the policy training loop.

Workflow diagram showing a driving model (such as Alpamayo) undergoing reinforcement learning post-training in AlpaGym, including Data Collection, Closed-Loop Simulation, Driving Model, Policy Training and Orchestration.

Figure 1. End-to-end workflow for post-training a driving model such as Alpamayo using AlpaGym

How to use AlpaGym for closed-loop reinforcement learning

Reinforcement learning (RL) can be used to improve a policy that was initially trained in open-loop. Instead of optimizing only against logged expert trajectories, the model can now learn from the consequences of its own actions in simulation.

This shift is critical for AV development, where small prediction or planning errors can compound over time. In closed-loop training, each braking, steering, and navigation decision affects the next state of the environment, revealing failure modes that static datasets or open-loop evaluation may miss.

However, enabling closed-loop RL comes with its own challenges. Model inference, running simulation, training models, syncing weight updates, communicating across instances and moving data—all in parallel—is complex. This requires orchestration and efficient utilization of compute resources in a robust yet flexible manner.

Perspective grid of driving-scene clips showing many AlpaSim closed-loop rollout instances running in parallel across different road scenarios for AlpaGym reinforcement learning.

Figure 2. AlpaGym enables large-scale closed-loop training, where driving models learn from the consequences of their own actions across a wide variety of simulated scenarios–greatly reducing the difference between training and deployment

To address these challenges, AlpaGym connects policy training to AlpaSim closed-loop rollouts and provides an open source, high-throughput framework for closed-loop RL. The system combines AlpaSim simulator microservices, NVIDIA Physical AI Open Datasets, and distributed NVIDIA Cosmos-RL training framework into a scalable post-training pipeline.

Built to scale seamlessly from a single GPU to multi-node GPU clusters, AlpaGym supports efficient large-scale training through an asynchronous and stable distributed RL pipeline, without requiring changes to user code. It integrates AlpaSim and Cosmos RL as its runtime and orchestration layer, GRPO as a default algorithm, and includes reference reward functions tested with Alpamayo models and the Physical AI AV NuRec dataset.

To get started with AlpaGym post-training, follow the steps outlined below.

Step 1: Install and configure AlpaGym

To install AlpaGym from the Alpamayo checkout, install the native CUDA dependencies and Redis on the host, then sync the UV workspace:

sudo apt-get update sudo apt-get install -y libcudnn9-dev-cuda-12 \ libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \ redis-server git-lfs git lfs install git lfs pull huggingface-cli login # Or export HF_TOKEN=... uv sync --all-packages sudo apt-get update sudo apt-get install -y libcudnn9-dev-cuda-12 \ libnccl-dev=2.26.2-1+cuda12.8 libnccl2=2.26.2-1+cuda12.8 \ redis-server uv sync --all-packages

The Python environment is managed by uv, but cuDNN, NCCL, and the redis-server binary are host dependencies used by the CUDA model stack and Cosmos-RL. Alternatively, a suitable Dockerfile is also provided. Hugging Face authentication is required to download the scene artifacts.

An AlpaGym run is a Hydra configuration. It specifies the policy checkpoint, the AlpaSim scene set, rollout parallelism, reward function, and Cosmos-RL training parameters. In this workflow, the starting checkpoint is an Alpamayo model.

Architecture diagram of AlpaGym closed-loop post-training, showing AlpaSim simulator sessions sending sensor data and receiving driving actions through rollout workers, while a policy trainer and orchestrator update the model and coordinate data flow.

Figure 3. In AlpaGym closed-loop post-training, the host process starts AlpaSim, rollout workers expose policy drivers, AlpaSim executes simulator sessions, and AlpaGym returns rollout artifacts and rewards to the trainer

Step 2: Define the closed-loop reward

The reward should match the behavior you want to improve in closed-loop. For trajectory-quality post-training, common reward terms include progress, lane keeping, collision avoidance, offroad rate, comfort, and distance to a reference trajectory.

A practical first reward is intentionally simple: combine progress with penalties for safety-critical failures. In AlpaGym, this can be expressed as a small sum of terms, using AlpaSim metrics where possible:

# reward/progress_safety.yaml terms: - kind: metric metric_name: progress scale: 1.0 - kind: metric metric_name: collision_any scale: -10.0 - kind: metric metric_name: offroad scale: -5.0

Once the pipeline is stable, add more targeted terms for the failure modes observed in AlpaSim videos and metrics.

Step 3: Launch closed-loop post-training

Start AlpaGym training from your model checkpoint. Alpamayo serves as an example model here.

uv run -m alpagym_host.cli \ policy=alpamayo \ policy.model.kind=alpamayo_r1 \ policy.model.path=/path/to/checkpoint \ reward=progress_safety

This will bring up AlpaGym with AlpaSim on a single GPU. Stay tuned for detailed instructions on how to use your own AV model.

During training, AlpaGym requests scene rollouts from AlpaSim, collects per-episode artifacts, computes rewards, and updates the policy. Useful training signals include mean reward, reward variance, failure rates, policy loss, rollout throughput, and the gap between generated rollouts and the latest policy weights.

In this recipe, these rollout artifacts and training signals are the primary outputs of the post-training run. They help you confirm that closed-loop learning is running correctly and select checkpoints for downstream evaluation on your own held-out AlpaSim scenario suites.

Step 4: Export the post-trained checkpoint

After training, place the AlpaGym-produced checkpoint and config files into a folder that can be accessed by the AlpaSim driver (your Hugging Face model cache, for example). Then create a new driver config with that folder path (called alpamayo1_CLRL here). See the following code for what to edit to specify custom paths in a driver yaml config. This makes the AlpaGym post-trained policy runnable inside AlpaSim for closed-loop rollouts.

... model: model_type: alpamayo1 checkpoint_path: "/root/.cache/huggingface/alpasim_models/alpamayo1_CLRL/step_NNNNNN" device: "cuda" ...

Next, run the exported model on a representative scenario to verify that the policy, driver, and simulation loop are connected correctly. At this stage, you can inspect how the policy behaves when its own actions affect the next state of the environment.

uv run alpasim_wizard deploy=local topology=1gpu driver=alpamayo1_CLRL wizard.log_dir=$PWD/tutorial_alpamayo_CLRL scenes.scene_ids=[clipgt-9ea70552-6dcb-4ee8-a368-9a906a333f6e]

A closed-loop rollout provides useful qualitative signals: whether the model produces stable trajectories and remains within the drivable area, how it reacts to nearby traffic agents, and which failure modes should be targeted during post-training.

Video 1. AlpaSim closed-loop rollout of an AV model, including the rendered camera view, predicted trajectory, and rollout-level diagnostics

With this checkpoint, teams can inspect rollout videos, per-episode metrics, reward traces, and failure cases collected during training. These artifacts are useful for debugging reward design, checking rollout stability, and selecting checkpoints for later held-out evaluation in AlpaSim.

Get started post-training AV models

Closed-loop post-training provides a practical path for iterating on end-to-end driving policies. In this case, AlpaGym uses closed-loop rollouts to post-train AV policies in simulation, enabling them to learn from the consequences of their actions.

You can use these tools together with the other components of the NVIDIA Alpamayo Open Platform to develop reasoning models that can be run, inspected, and post-trained in a closed-loop simulation workflow. Extend this same recipe more broadly with your own rewards, scenarios, and evaluation suites.

Ready to get started? Check out the NVlabs/alpamayo-recipes GitHub repo to adapt the recipe in this post for your own use cases.

To evaluate your model on a public leaderboard, see the two open AV challenges NVIDIA launched at CVPR 2026:

AlpaSim Closed-Loop E2E Driving Challenge Physical AI AV Reasoning Challenge

To learn more, see Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation.

Join NVIDIA founder and CEO Jensen Huang for the NVIDIA GTC Taipei 2026 Keynote and dive deeper with related sessions.

✨ Enhance your brand's digital communication with NViNiO•Link™ : Get started for FREE here

Read Entire Article

How to Post-Train Autonomous Vehicle Models in Closed-Loop with NVIDIA Alpamayo

How to use AlpaGym for closed-loop reinforcement learning

Step 1: Install and configure AlpaGym

Step 2: Define the closed-loop reward

Step 3: Launch closed-loop post-training

Step 4: Export the post-trained checkpoint

Get started post-training AV models

Related

We’re spotlighting LGBTQ+ creators and artists to celebrate Pride.

Recherche Google dopée à l’IA : pourquoi les scientifiques craignent un Web plus froid et moins humain

Une avancée majeure : Scania invente un camion électrique capable d'alimenter le réseau

Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3

Trending

Popular

TOGO | Grandes ambitions énergétiques

États-Unis : Donald Trump remporte l’élection présidentielle par KO

Réparés en cas de panne et remboursés ! Les marques Erazer et Medion sont sûres de la qualité de leurs ordinateurs

Au Kenya, des femmes transforment les voitures à essence en véhicules électriques