Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down, it becomes challenging to determine why and what to do next. A problem can span computation, communication, a specific rank, or underlying hardware.
NVIDIA NCCL Inspector accelerates triaging by providing a lightweight and continuous report of NCCL communication performance. It tracks operation type, size, and bandwidth across every rank, and with this latest enhancement, can facilitate real-time analysis with minimal overhead.
It also helps determine the optimal training recipe. A previous post introduced NCCL Inspector offline mode. While fine-grained analysis remains the standard for deep-dive data, this post introduces real-time monitoring, a new feature. Live, time-series visualizations can now be powered directly within a user’s infrastructure dashboard by integrating NCCL Inspector with Prometheus Exporter.
NCCL Inspector deployment architecture
NCCL 2.30 introduces Prometheus Mode, a major enhancement for real-time performance monitoring of NCCL in AI workloads. The NCCL Inspector works in two modes, shown in Figures 1 and 2.
Figure 1. NCCL Inspector in JSON mode (default/offline mode)The JSON mode operates in a data collection and data analysis phase. First, the data collection phase generates performance metrics from each rank and stores them individually in a JSON file, typically on shared storage. Then, the data analysis phase processes the data. This method is considered offline since the processing isn’t completed in real time.
Figure 2. NCCL Inspector in real-time Prometheus mode This new feature integrates NCCL Inspector metrics with Prometheus, converting them into time-series data suitable for visualization in Grafana dashboards. Prometheus mode eliminates the large storage requirements previously necessary for JSON mode. This metric data is moved by the node exporter to Prometheus—a scalable, cloud-native platform. The NCCL job output file is designed to be overwritten continuously. Once the node exporter collects the metrics, they’re no longer needed on disk.
Experimental setup for Prometheus Mode
Setting up the NCCL Inspector Profiler plugin requires building the plugin and setting the following required environment variables:
The dump thread interval and dump directory should be set and tuned according to the node exporter used. Once configured, NCCL Inspector starts the process and dumps collective performance into the NCCL_INSPECTOR_DUMP_DIR. The Prometheus Node Exporter then sends the metrics to the Prometheus time-series database. Finally, these time-series metrics are rendered as dashboard graphs with Graphana.
When running the job, the metrics are saved to a file with the format: nccl_inspector_metrics_<uuid_of_the_gpu>.prom
The UUID of the GPU is included in the file name since CUDA device IDs can overlap in a multi-user environment.
The NCCL job output file is in the Prometheus exposition format. Each metric is labeled with context, including NCCL version, Slurm job ID, node, GPU, communicator name, number of nodes, number of ranks, and message size. The following is an example:
Once these metrics land in a Prometheus DB, the next step is rendering them in Grafana.
Time series-based Grafana dashboards
Figure 3 shows an example of how time series dashboards look using the Prometheus labels categorized into NVLink collective dashboards and mixed i.e., Network + NVLink collectives:
Figure 3. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for NVIDIA NVLink-only communicators on a single node (n_nodes==1), observed over a 6-minute window
Figure 4. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for combined network (IB/RoCE/EFA) and NVLink communicators in a multi-node setting (n_nodes==4), observed over a six-minute window
Use cases for NCCL inspector
To demonstrate the triage workflow, these two use cases highlight how the dashboards accelerate root cause identification.
Live observability
Use live dashboards for finding the root cause of performance slowdowns in a long-running AI workload. Observing changes on dashboards and correlating job-level degradations with underlying NCCL or network-layer metrics enables targeted triage based on where the anomaly originates. The team ran a large LLM pre-training job to show this strategy.
Timeline A: Normal workflow
Figure 5 shows the AllGather bus bandwidth for the mixed network + NVLink collectives in one of the experiments. The compute performance for this AI pretraining workload was ~310 TFLOPs/GPU.
Figure 5. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for mixed network + NVLink communicators during a normal AI pretraining workflow on four nodes, corresponding to an observed compute performance of ~310 TFLOPs/GPU
Timeline B: Network-induced slowdown
After introducing artificial network constraints, AllGather BusBw for mixed network + NVLink collectives shows compute performance decreased to ~268 TFLOPs per GPU (~13% degradation vs. baseline).
This example shows that a real-time dashboard improves observability of collective performance across mixed transport communicators (network + NVLink), enabling faster root cause identification and reducing mean time to resolution.
Figure 6. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for mixed network + NVLink communicators during Timeline B, a network-induced slowdown scenarioPerformance attribution
Another use case is the NCCL Inspector, which helps analyze performance degradation over a specific time period. For example, in one experiment, the performance degrades temporarily as shown:
Next, the observed degradation is examined to determine whether it correlates with a network anomaly during this period.
Figure 7. Grafana time-series dashboard showing NCCL ReduceScatter bus bandwidth (GB/s) for NVLink-only communicators during a performance attribution investigation on 2026-03-1
Figure 8. Grafana time-series dashboard showing NCCL ReduceScatter bus bandwidth (GB/s) for mixed Network + NVLink communicators during the same performance attribution window on 2026-03-19The dashboard shows performance degradation in mixed transport communication (network + NVLink-based collectives). This correlation indicates that the root cause is a disruption/congestion in the network. This enables drilling down into per-host and network counters to isolate where the slowdown occurred.
Next steps for real-time observability
The introduction of NCCL Inspector with Prometheus integration is designed to enhance network observability for AI workload performance analysis. This powerful combination enables a more scientific approach to performance analysis. Users can debug and understand the real-time performance characteristics of a running workload, triage slowdowns, fine-tune parameters, and measure the resulting performance changes using detailed metrics.
Get started
Refer to the GitHub README.md to:
Build and deploy the NCCL Inspector plugin in Prometheus mode. Configure the Prometheus exporter to expose metrics for your cluster/environment.Use the Grafana template to setup the grafana dashboard.
Acknowledgments
We would also like to thank our NVIDIA colleagues Nikhithkumar Kotagari, Giuseppe Congi, and Nishank Chandawala, and Ziyang Jia from the University of California, Riverside, for valuable input and reviews during the design process.
.png)
20 hours ago
English (United States) ·
French (France) ·