AI is now essential infrastructure, powered by AI factories that generate intelligence in the form of tokens. As demand grows, these factories must scale faster, operate more efficiently, and lower the cost of intelligence across the five-layer stack: energy, chips, infrastructure, models, and applications.
NVIDIA DSX platform provides the complete playbook for designing, simulating, building, and operating AI factories, aligning every layer of the stack across compute, software, facilities, and partner technologies through a common co-designed architecture.
The DSX platform now includes DSX OS software to accelerate AI factory deployments and improve operational efficiency. DSX OS includes open source, modular software components and related NVIDIA technologies purpose-built for operating and scaling multi-tenant AI factories.
Together, DSX OS components enable NVIDIA DSX’s AI factory ecosystem to adopt the latest in agentic AI infrastructure software across the full stack, improving tokens per watt and lowering token cost, accelerating deployment, and strengthening operational reliability and resiliency.
Figure 1: NVIDIA DSX OS software in the DSX platform. DSX OS provides the open-source software for AI factory operations
Why DSX OS matters to the AI factory ecosystem
AI factories must perform optimally in order to maximize the number of tokens they produce relative to the watts they consume, and bring real value to the operators.
In order to achieve this, the complex network of components that goes into operating AI workloads at scale across datacenters must function in close harmony, requiring coordination across chips; systems; facilities infrastructure such as building management controls, cooling, and power distribution units; the power grid; the software and partner technologies running all of these; and the AI platforms and services running on top.
DSX OS software is designed for this entire ecosystem of components and provides a comprehensive set of open and extensible technologies and capabilities that can be integrated and adopted into existing platforms and software.
These capabilities have been designed and optimized around a common architecture, enabling all of the components involved to work together to deliver on three main outcomes that drive AI factory economics:
1) Faster time to revenue
NVIDIA builds and operates infrastructure and platform software on NVIDIA DGX Cloud, and now this software is being released as open source. NVIDIA ecosystem partners can leverage these components to deliver AI services rather than rebuild from scratch, eliminating months of custom development.
2) Better efficiency
Power is the limiting factor in an AI factory, and DSX connects power and grid behavior as part of the platform rather than as a facilities concern separated from the rest of the AI infrastructure. With DSX software, AI factories can run up to 40% more GPUs at peak energy efficiency within a fixed power budget, with minimal impact on inference workload performance.
3) Higher reliability and resiliency
AI factories run continuous large-scale workloads through hardware faults, grid events, and operational changes. DSX OS shifts cluster operations from reactive alerting to automated remediation, keeps runtime versions consistent across regions, and gives operators fleet-wide visibility.
How DSX OS enables gigawatt-scale AI factories
The open source, modular components in DSX OS provide the foundational technologies for building and operating AI factories, and are designed to solve challenges unique to operating AI workloads efficiently and reliably at gigawatt scale.
They do so by providing a co-designed set of core capabilities, including (but not limited to) standardized communication, power and efficiency optimization, provisioning and lifecycle operations, health monitoring and remediation, and intelligent platform services.
More details about how DSX OS provides these capabilities follows:
Standardized communication across the data center, enabled for agentic interfaces
An AI factory spans compute, networking, power, and cooling systems that all need to interoperate seamlessly. DSX Exchange bridges these components with an MQTT-based IT/OT communication hub that makes facility-level signals such as grid events, thermal data, and power anomalies, visible to the software managing the rest of the AI factory, enabling components such as DSX Flex, MaxLPS, and partner software to react to each other’s state in real time, improving coordination and efficiency
DSX OS software components across the full DSX stack will also provide MCP servers for provisioning, networking, observability, and more. Using these MCP servers, AI agents can discover the entire operational surface of the factory as a unified tool catalog, enabling them to interface across every system and perform cross-domain correlation. With an agentic AI factory, operators can easily connect a GPU health event with a thermal anomaly, or a network issue to a performance issue, or other potential scenarios.
Figure 2. DSX Exchange coordinates communication within the AI factory, including grid signals from DSX Flex, facilities-level signals, power policies to and from DSX MaxLPS, provisioning systems like NVIDIA Infra Controller, and more
Power and efficiency optimization
Static power allocation strands capacity, reactive cooling creates thermal oscillations, and disconnected IT/OT systems make grid events a manual fire drill. DSX MaxLPS includes software that treats power as a programmable resource by dynamically enforcing policies at the GPU, rack, cooling, and workload level, enabling AI factories to recover stranded power to run additional compute at optimal utilization. DSX Flex extends this beyond the factory walls, with libraries for connecting workloads to grid services so AI factories can automatically adapt to demand response, load shedding, and renewable energy availability.
Partners including CoreWeave, Firmus, Lambda, Nscale, and Phaidra are deploying MaxLPS, while Emerald AI, ENGIE, Silicon Valley Power, and UK National Grid are leveraging DSX Flex.
Provisioning and multi-tenant lifecycle operations
At scale, provisioning is a continuous workflow: nodes cycle through tenant assignments, hardware is replaced, and every transition must be auditable and secure. NVIDIA Infra Controller (NICo) makes this programmable with API-driven bare-metal lifecycle management and hardware-enforced tenant isolation through NVIDIA BlueField DPUs and the NVIDIA DOCA Platform Framework. NVIDIA AI Cluster Runtime (AICR) complements this by capturing validated runtime configurations as version-locked recipes, eliminating the configuration drift that causes silent failures across large fleets.
IREN, OpenNebula Systems, Mirantis, Rafay, Red Hat, and Supermicro are among the partners integrating these components.
Health monitoring and automation tooling
In a large GPU fleet, hardware degradation is a daily occurrence, and the traditional alert-page-investigate cycle is too manual for minimizing impact on workloads. NVIDIA NVSentinel provides Kubernetes-native GPU fault detection and automated remediation, cordoning unhealthy compute nodes and draining workloads in seconds rather than minutes or hours. NVIDIA Fleet Intelligence provides fleet-wide visibility, integrity verification, and health monitoring across global deployments.
Lambda is an early adopter of Fleet Intelligence.
Figure 3. The NVIDIA Fleet Intelligence dashboard summarizes fleet-wide aggregations of data such as GPU and memory utilization as well as total GPUs in an up state
Intelligent AI workload scheduling and platform services
AI workloads need more than GPU access; they need topology-aware intelligent scheduling, distributed inference, and production APIs. KAI Scheduler and NVIDIA Run:ai provide GPU-aware workload placement with fractional allocation and hierarchical quotas. NVIDIA Dynamo and NVIDIA Grove deliver distributed inference serving with disaggregated prefill/decode and per-stage autoscaling. NVIDIA Cloud Functions (NVCF) ties it together with unified APIs across inference, fine-tuning, and batch workloads with built-in multi-tenancy.
Partners including Aible, Beyond AI, Bhashini, Crusoe, DCAI, Mirantis, Nebius, Rafay, Sarvam, Simplismart, Spectro Cloud, vCluster, Vultr, and Yotta are using many of these components in production.
Getting started
DSX OS components are available on GitHub and designed for incremental adoption and integration with existing software stacks.
Start with the component that addresses your most immediate requirements, and build from there, leveraging the capabilities and technologies provided to accelerate your AI factory deployment and improve operational efficiency.
Some examples are provided below:
IT/OT communications: DSX Exchange Bare-metal lifecycle management and tenant isolation: NVIDIA Infra Controller and DOCA Platform Framework Fleet visibility, health, and integrity: NVIDIA Fleet Intelligence Unified AI inference APIs: NVIDIA Cloud FunctionsReview NVIDIA DSX documentation for more details about all of the components of DSX OS, implementation and reference design guides, quickstarts, and integration guidance.
.png)
14 hours ago
English (United States) ·
French (France) ·