Developing autonomous vehicle (AV) policies requires bridging an important gap between training and deployment. Vision-language-action (VLA) models that can reason over more complex driving scenes and produce richer intermediate reasoning are predominantly trained in open-loop, where model outputs are directly compared to ground-truth behaviors without considering their effect on the environment.
In deployment, however, a driving policy runs in closed-loop, where every braking, steering, and navigation decision affects the environment, and small errors can compound over time.
A systematic means to address this challenge is provided by NVIDIA Alpamayo, an open portfolio of AI models, simulation frameworks, and physical AI datasets for AV development. Alpamayo includes the AlpaSim AV simulation platform and the AlpaGym closed-loop training framework (coming soon).
This post explains how to train AV models in closed-loop with NVIDIA Alpamayo. Specifically, it walks through how to:
Install and configure AlpaGym Define closed-loop rewards Launch closed-loop training Export the post-trained checkpoint for downstream useClosed-loop post-training with AlpaGym extends AV training workflows by turning AlpaSim rollouts into training experience. Rather than treating simulation only as a final evaluation stage, AlpaGym connects simulator feedback directly to the policy training loop.
Figure 1. End-to-end workflow for post-training a driving model such as Alpamayo using AlpaGym
How to use AlpaGym for closed-loop reinforcement learning
Reinforcement learning (RL) can be used to improve a policy that was initially trained in open-loop. Instead of optimizing only against logged expert trajectories, the model can now learn from the consequences of its own actions in simulation.
This shift is critical for AV development, where small prediction or planning errors can compound over time. In closed-loop training, each braking, steering, and navigation decision affects the next state of the environment, revealing failure modes that static datasets or open-loop evaluation may miss.
However, enabling closed-loop RL comes with its own challenges. Model inference, running simulation, training models, syncing weight updates, communicating across instances and moving data—all in parallel—is complex. This requires orchestration and efficient utilization of compute resources in a robust yet flexible manner.
Figure 2. AlpaGym enables large-scale closed-loop training, where driving models learn from the consequences of their own actions across a wide variety of simulated scenarios–greatly reducing the difference between training and deployment
To address these challenges, AlpaGym connects policy training to AlpaSim closed-loop rollouts and provides an open source, high-throughput framework for closed-loop RL. The system combines AlpaSim simulator microservices, NVIDIA Physical AI Open Datasets, and distributed NVIDIA Cosmos-RL training framework into a scalable post-training pipeline.
Built to scale seamlessly from a single GPU to multi-node GPU clusters, AlpaGym supports efficient large-scale training through an asynchronous and stable distributed RL pipeline, without requiring changes to user code. It integrates AlpaSim and Cosmos RL as its runtime and orchestration layer, GRPO as a default algorithm, and includes reference reward functions tested with Alpamayo models and the Physical AI AV NuRec dataset.
To get started with AlpaGym post-training, follow the steps outlined below.
Step 1: Install and configure AlpaGym
To install AlpaGym from the Alpamayo checkout, install the native CUDA dependencies and Redis on the host, then sync the UV workspace:
The Python environment is managed by uv, but cuDNN, NCCL, and the redis-server binary are host dependencies used by the CUDA model stack and Cosmos-RL. Alternatively, a suitable Dockerfile is also provided. Hugging Face authentication is required to download the scene artifacts.
An AlpaGym run is a Hydra configuration. It specifies the policy checkpoint, the AlpaSim scene set, rollout parallelism, reward function, and Cosmos-RL training parameters. In this workflow, the starting checkpoint is an Alpamayo model.
Figure 3. In AlpaGym closed-loop post-training, the host process starts AlpaSim, rollout workers expose policy drivers, AlpaSim executes simulator sessions, and AlpaGym returns rollout artifacts and rewards to the trainer
Step 2: Define the closed-loop reward
The reward should match the behavior you want to improve in closed-loop. For trajectory-quality post-training, common reward terms include progress, lane keeping, collision avoidance, offroad rate, comfort, and distance to a reference trajectory.
A practical first reward is intentionally simple: combine progress with penalties for safety-critical failures. In AlpaGym, this can be expressed as a small sum of terms, using AlpaSim metrics where possible:
Once the pipeline is stable, add more targeted terms for the failure modes observed in AlpaSim videos and metrics.
Step 3: Launch closed-loop post-training
Start AlpaGym training from your model checkpoint. Alpamayo serves as an example model here.
This will bring up AlpaGym with AlpaSim on a single GPU. Stay tuned for detailed instructions on how to use your own AV model.
During training, AlpaGym requests scene rollouts from AlpaSim, collects per-episode artifacts, computes rewards, and updates the policy. Useful training signals include mean reward, reward variance, failure rates, policy loss, rollout throughput, and the gap between generated rollouts and the latest policy weights.
In this recipe, these rollout artifacts and training signals are the primary outputs of the post-training run. They help you confirm that closed-loop learning is running correctly and select checkpoints for downstream evaluation on your own held-out AlpaSim scenario suites.
Step 4: Export the post-trained checkpoint
After training, place the AlpaGym-produced checkpoint and config files into a folder that can be accessed by the AlpaSim driver (your Hugging Face model cache, for example). Then create a new driver config with that folder path (called alpamayo1_CLRL here). See the following code for what to edit to specify custom paths in a driver yaml config. This makes the AlpaGym post-trained policy runnable inside AlpaSim for closed-loop rollouts.
Next, run the exported model on a representative scenario to verify that the policy, driver, and simulation loop are connected correctly. At this stage, you can inspect how the policy behaves when its own actions affect the next state of the environment.
A closed-loop rollout provides useful qualitative signals: whether the model produces stable trajectories and remains within the drivable area, how it reacts to nearby traffic agents, and which failure modes should be targeted during post-training.
Video 1. AlpaSim closed-loop rollout of an AV model, including the rendered camera view, predicted trajectory, and rollout-level diagnosticsWith this checkpoint, teams can inspect rollout videos, per-episode metrics, reward traces, and failure cases collected during training. These artifacts are useful for debugging reward design, checking rollout stability, and selecting checkpoints for later held-out evaluation in AlpaSim.
Get started post-training AV models
Closed-loop post-training provides a practical path for iterating on end-to-end driving policies. In this case, AlpaGym uses closed-loop rollouts to post-train AV policies in simulation, enabling them to learn from the consequences of their actions.
You can use these tools together with the other components of the NVIDIA Alpamayo Open Platform to develop reasoning models that can be run, inspected, and post-trained in a closed-loop simulation workflow. Extend this same recipe more broadly with your own rewards, scenarios, and evaluation suites.
Ready to get started? Check out the NVlabs/alpamayo-recipes GitHub repo to adapt the recipe in this post for your own use cases.
To evaluate your model on a public leaderboard, see the two open AV challenges NVIDIA launched at CVPR 2026:
AlpaSim Closed-Loop E2E Driving Challenge Physical AI AV Reasoning ChallengeTo learn more, see Expanding the Alpamayo Open Platform for Developing Reasoning AVs Across Models, Data, and Simulation.
Join NVIDIA founder and CEO Jensen Huang for the NVIDIA GTC Taipei 2026 Keynote and dive deeper with related sessions.
.png)
13 hours ago
English (United States) ·
French (France) ·