Real-Time Performance Monitoring and Faster Debugging with NCCL Inspector and Prometheus

SOURCE | 20 hours ago


Enhance your Social Media content with NViNiO•AI™ for FREE


Distributed deep learning depends on fast, reliable GPU-to-GPU communication using the NVIDIA Collective Communication Library (NCCL). When training slows down, it becomes challenging to determine why and what to do next. A problem can span computation, communication, a specific rank, or underlying hardware. 

NVIDIA NCCL Inspector accelerates triaging by providing a lightweight and continuous report of NCCL communication performance. It tracks operation type, size, and bandwidth across every rank, and with this latest enhancement, can facilitate real-time analysis with minimal overhead.

It also helps determine the optimal training recipe. A previous post introduced NCCL Inspector offline mode. While fine-grained analysis remains the standard for deep-dive data, this post introduces real-time monitoring, a new feature. Live, time-series visualizations can now be powered directly within a user’s infrastructure dashboard by integrating NCCL Inspector with Prometheus Exporter.

NCCL Inspector deployment architecture

NCCL 2.30 introduces Prometheus Mode, a major enhancement for real-time performance monitoring of NCCL in AI workloads. The NCCL Inspector works in two modes, shown in Figures 1 and 2. 

Architecture diagram showing two GB200 Nodes, each containing four NCCL Inspector instances (GPU 0–GPU 4), writing output files to shared Lustre/AWS S3/NFS storage in JSON Mode. Each GPU's inspector produces NCCL Inspector Output Files stored as pairs in the shared filesystem.Figure 1. NCCL Inspector in JSON mode (default/offline mode)

The JSON mode operates in a data collection and data analysis phase. First, the data collection phase generates performance metrics from each rank and stores them individually in a JSON file, typically on shared storage. Then, the data analysis phase processes the data. This method is considered offline since the processing isn’t completed in real time. 

Architecture diagram showing two GB200 Nodes, each containing four NCCL Inspector instances (GPU 0–GPU 4), writing output files to shared Lustre/AWS S3/NFS storage in JSON Mode. Each GPU's inspector produces NCCL Inspector Output Files stored as pairs in the shared filesystem.Figure 2. NCCL Inspector in real-time Prometheus mode 

This new feature integrates NCCL Inspector metrics with Prometheus, converting them into time-series data suitable for visualization in Grafana dashboards. Prometheus mode eliminates the large storage requirements previously necessary for JSON mode. This metric data is moved by the node exporter to Prometheus—a scalable, cloud-native platform. The NCCL job output file is designed to be overwritten continuously. Once the node exporter collects the metrics, they’re no longer needed on disk.

Experimental setup for Prometheus Mode

Setting up the NCCL Inspector Profiler plugin requires building the plugin and setting the following required environment variables:

NCCL_PROFILER_PLUGIN=/path/to/nccl/plugins/profiler/inspector/libnccl-profiler-inspector.so NCCL_INSPECTOR_ENABLE=1 NCCL_INSPECTOR_DUMP_THREAD_INTERVAL_MICROSECONDS=3000000 NCCL_INSPECTOR_PROM_DUMP=1 NCCL_INSPECTOR_DUMP_DIR=/path/to/node/exporter/log/location

The dump thread interval and dump directory should be set and tuned according to the node exporter used. Once configured, NCCL Inspector starts the process and dumps collective performance into the NCCL_INSPECTOR_DUMP_DIR. The Prometheus Node Exporter then sends the metrics to the Prometheus time-series database. Finally, these time-series metrics are rendered as dashboard graphs with Graphana.

When running the job, the metrics are saved to a file with the format:  nccl_inspector_metrics_<uuid_of_the_gpu>.prom

The UUID of the GPU is included in the file name since CUDA device IDs can overlap in a multi-user environment. 

The NCCL job output file is in the Prometheus exposition format. Each metric is labeled with context, including NCCL version, Slurm job ID, node, GPU, communicator name, number of nodes, number of ranks, and message size. The following is an example:

nccl_p2p_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Send",message_size="1-2MB"} 19.1634 nccl_p2p_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Send",message_size="1-2MB"} 92.8984 nccl_p2p_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Recv",message_size="1-2MB"} 19.2396 nccl_p2p_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="1",nranks="64",p2p_operation="Recv",message_size="1-2MB"} 92.5781
nccl_bus_bandwidth_gbs{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="4",nranks="32",collective="ReduceScatter",message_size="134-135MB",algo_proto="RING_SIMPLE"} 44.1181 nccl_collective_exec_time_microseconds{version="v5.1",slurm_job_id="1670760",node="nvl72033-T01",gpu="GPU0",comm_name="unknown",n_nodes="4",nranks="32",collective="ReduceScatter",message_size="134-135MB",algo_proto="RING_SIMPLE"} 104164

Once these metrics land in a Prometheus DB, the next step is rendering them in Grafana.

Time series-based Grafana dashboards

Figure 3 shows an example of how time series dashboards look using the Prometheus labels categorized into NVLink collective dashboards and mixed i.e., Network + NVLink collectives:

Line chart showing NCCL AllGather bus bandwidth in GB/s over NVLink-only communicators (single node), measured from 15:00 to 15:06. Multiple time series are plotted, each representing a different message size with 8 ranks. Larger message sizes achieve higher bandwidth, with the 17–18 MB range reaching approximately 650 GB/s, while smaller sizes, such as 128–129 KB, sit near 50 GB/s. All series remain stable over the observed time window.Figure 3. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for NVIDIA NVLink-only communicators on a single node (n_nodes==1), observed over a 6-minute window
Line chart showing NCCL AllGather bus bandwidth in GB/s for combined Network (IB/RoCE/EFA) and NVLink communicators across multiple nodes (n_nodes != 1), measured from 15:00 to 15:06. Five time series are plotted representing different message sizes across 4 nodes with 4 or 32 ranks. The largest message size (535–536 MB) achieves the highest bandwidth at approximately 350 GB/s, mid-range messages cluster around 125–150 GB/s, and the smallest messages sit near 50 GB/s. All series are flat and stable throughout the observation window.
Figure 4. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for combined network (IB/RoCE/EFA) and NVLink communicators in a multi-node setting (n_nodes==4), observed over a six-minute window

Use cases for NCCL inspector 

To demonstrate the triage workflow, these two use cases highlight how the dashboards accelerate root cause identification.

Live observability

Use live dashboards for finding the root cause of performance slowdowns in a long-running AI workload. Observing changes on dashboards and correlating job-level degradations with underlying NCCL or network-layer metrics enables targeted triage based on where the anomaly originates. The team ran a large LLM pre-training job to show this strategy. 

Timeline A: Normal workflow

Figure 5 shows the AllGather bus bandwidth for the mixed network + NVLink collectives in one of the experiments. The compute performance for this AI pretraining workload was ~310 TFLOPs/GPU.

Line chart showing NCCL AllGather bus bandwidth in GB/s for combined Network and NVLink communicators across 4 nodes, measured from 15:00 to 15:06. A Grafana tooltip is visible at timestamp 2026-03-18 15:03:45, highlighting the 42–43MB message size series (32 ranks) at a value of 375 GB/s. Two bandwidth levels are visually boxed out: approximately 350 GB/s for larger messages and 150 GB/s for mid-range messages. All series remain flat and stable throughout the windowFigure 5. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for mixed network + NVLink communicators during a normal AI pretraining workflow on four nodes, corresponding to an observed compute performance of ~310 TFLOPs/GPU

Timeline B: Network-induced slowdown

After introducing artificial network constraints, AllGather BusBw for mixed network + NVLink collectives shows compute performance decreased to ~268 TFLOPs per GPU (~13% degradation vs. baseline).

This example shows that a real-time dashboard improves observability of collective performance across mixed transport communicators (network + NVLink), enabling faster root cause identification and reducing mean time to resolution.

 Line chart showing NCCL AllGather bus bandwidth in GB/s for combined Network and NVLink communicators across 4 nodes, measured from 17:56 to 18:03, during an artificially induced network slowdown. A Grafana tooltip at timestamp 2026-03-18 17:57:15 shows the 42–43 MB message size series (32 ranks) reading 303 GB/s. Two bandwidth levels are boxed: approximately 300 GB/s for larger messages and 150 GB/s for mid-range messages. The top-tier bandwidth is lower than the 375 GB/s recorded in the normal workflow (Figure 5), reflecting a ~13% compute performance degradation attributed to network congestion.Figure 6. Grafana time-series dashboard showing NCCL AllGather bus bandwidth (GB/s) for mixed network + NVLink communicators during Timeline B, a network-induced slowdown scenario

Performance attribution

Another use case is the NCCL Inspector, which helps analyze performance degradation over a specific time period. For example, in one experiment, the performance degrades temporarily as shown:

[2026-03-19 14:39:47.098640] -> throughput per GPU: ~314 TFLOP/s/GPU [2026-03-19 14:40:48.696103] -> throughput per GPU: ~295 TFLOP/s/GPU [2026-03-19 14:42:00.816450] -> throughput per GPU: ~289 TFLOP/s/GPU [2026-03-19 14:44:02.304347] -> throughput per GPU: ~311 TFLOP/s/GPU

Next, the observed degradation is examined to determine whether it correlates with a network anomaly during this period.

Line chart showing NCCL ReduceScatter bus bandwidth in GB/s for NVLink-only communicators with 8 ranks, measured from approximately 14:35 to 14:51. Three message size series are plotted: 17–18MB (yellow) sustains the highest bandwidth near 600–650 GB/s, 16–17MB (green) holds between 350–420 GB/s, and 8–9MB (blue) remains relatively flat around 300–320 GB/s. All series appear stable throughout the observation window.Figure 7. Grafana time-series dashboard showing NCCL ReduceScatter bus bandwidth (GB/s) for NVLink-only communicators during a performance attribution investigation on 2026-03-1
Line chart showing NCCL ReduceScatter bus bandwidth in GB/s for combined Network and NVLink communicators across 4 nodes, measured from approximately 14:35 to 14:51. Three series are plotted: 1–2GB (green, 4 ranks), 133–134MB (yellow, 32 ranks), and 134–135MB (blue, 32 ranks). All three series show a clear and pronounced dip between approximately 14:40 and 14:45, dropping from around 125 GB/s down to roughly 114–116 GB/s, before recovering toward baseline by 14:47–14:50.Figure 8. Grafana time-series dashboard showing NCCL ReduceScatter bus bandwidth (GB/s) for mixed Network + NVLink communicators during the same performance attribution window on 2026-03-19

The dashboard shows performance degradation in mixed transport communication (network + NVLink-based collectives). This correlation indicates that the root cause is a disruption/congestion in the network. This enables drilling down into per-host and network counters to isolate where the slowdown occurred. 

Next steps for real-time observability

The introduction of NCCL Inspector with Prometheus integration is designed to enhance network observability for AI workload performance analysis. This powerful combination enables a more scientific approach to performance analysis. Users can debug and understand the real-time performance characteristics of a running workload, triage slowdowns, fine-tune parameters, and measure the resulting performance changes using detailed metrics.

Get started

Refer to the GitHub README.md to:

Build and deploy the NCCL Inspector plugin in Prometheus mode. Configure the Prometheus exporter to expose metrics for your cluster/environment.

Use the Grafana template to setup the grafana dashboard.

Acknowledgments

We would also like to thank our NVIDIA colleagues Nikhithkumar Kotagari, Giuseppe Congi, and Nishank Chandawala, and Ziyang Jia from the University of California, Riverside, for valuable input and reviews during the design process.


Enhance your brand's digital communication with NViNiO•Link™ : Get started for FREE here


Read Entire Article

© 2026 | Actualités Africaines & Tech | Moteur de recherche. NViNiO GROUP

_