NVIDIA Vera CPU Sets a New Standard for Agentic Workloads in AI Factories

SOURCE | 13 hours ago


Enhance your Social Media content with NViNiO•AI™ for FREE


Each wave of AI has created a new scaling law. Pretraining scaled intelligence through larger datasets, more parameters, and massively parallel GPU systems. Post-training scaled usefulness through instruction tuning, and re-balancing GPUs for generative inference. Test-time scaling improved reasoning by giving models more generated tokens for thinking. 

Now, agentic AI and reinforcement learning scale actions. Models take more steps, call more tools, run more evaluations, and interact with execution environments to perform tasks.

This blog explains how NVIDIA Vera CPUs help AI factories to scale agentic AI and reinforcement learning by shortening CPU execution time, increasing task throughput, improving overall AI factory output, and enabling smarter, longer-thinking agents.

Diagram of the agentic AI loop, showing the CPU executes model-generated content (tool calls, sandboxed code, retrieval, data processing) to provide new context for the next step.Figure 1. CPU execution becomes part of the AI loop

Why CPUs matter more in the agentic era

GPUs remain essential for model inference and training. But across agentic AI, reinforcement learning, and data-intensive AI services, much of the execution surrounding the model runs on CPUs, such as:

Sandboxed code and tool execution Data retrieval and data processing  Results computation Scheduling and orchestration 

This is a precise loop: 

A prompt (either from a user, reasoning tokens, or a previous turn’s result) kicks off generation: “I should compile and run hello.c.” The GPU generates the parameters of the tool call to be performed on the CPU: gcc -o hello hello.c ; ./hello The CPU executes the tool call, producing results that are fed back to the GPUs to update weights during reinforcement learning, or used by the agent to generate the next prompt: Output: ‘Hello, world!’ – Task Returned (0) – Successful The GPU generates reasoning tokens prompted by the result: “Hmm! It looks like that worked!”

As agents become more capable, they take more steps, call more tools, and run more checks. CPU time compounds across the request.

This makes the CPU part of the critical path. It’s no longer just a host processor feeding the GPU. It shapes latency, accelerator utilization, and AI factory output per watt and per dollar.

For the last decade, much of the data center CPU market optimized around cloud economics of more cores, more virtual machines, and lower cost per core. This remains important for general-purpose cloud services, but performance per core has not improved at the same rate.

This is further compounded by the end of Moore’s law, which limited generation-on-generation performance improvements in CPUs, even while GPU architectures and workloads benefited from a continuous cycle of co-optimization.

AI factories shift the metric from cores per dollar to tokens per dollar—from how many CPU cores a data center can rent, to how much AI output it can produce.

This demands a new CPU design point for AI factories:

High core counts to run thousands of concurrent agents, RL environments, sandboxes, and services. High per-core performance, because each agentic step is gated by sequential execution.  Energy-efficient memory bandwidth to keep data moving without turning CPU infrastructure into a bottleneck.
A graphic illustrating the need for a new CPU design point for agentic AI. It shows the shift in priority from maximizing "cores per dollar" for general-purpose cloud services to maximizing "tokens per dollar" for AI factories, which requires many cores, high per-core performance, and energy-efficient memory bandwidth.Figure 2. AI creates a need for a new CPU

The NVIDIA Vera CPU: Built for AI agents

The NVIDIA Vera CPU is designed for the reality of modern workloads, with fast per-core performance, high concurrency, and power-efficient memory bandwidth to keep the AI factory moving.

The Vera CPU combines 88 NVIDIA Olympus cores with up to 1.2 TB/s of LPDDR5X memory bandwidth to keep cores fed through tool calls, sandboxed execution of both native code and languages like Python or JavaScript, data retrieval, data processing, and orchestration.

The key requirement is fast per-core performance, sustained at all times. Unlike cloud virtual machines, the CPU sockets stay fully loaded, doing the work of many concurrent agents. Cores that remain fast under high system load reduce task completion time, delivering faster results while freeing up resources to serve the next request.

For agents, this means lower latency across multistep requests. For reinforcement learning, this means more completed evaluations and more data from each training window, helping models reach a higher quality bar faster. For AI factories, fast cores keep accelerators from waiting on orchestration, tool execution, or data movement.

Delivering this requires the core, memory subsystem, and fabric to be designed together for branch-heavy code, high-bandwidth data movement, and predictable performance under load.

This starts with the NVIDIA custom Olympus core inside the Vera CPU.

Diagram of the NVIDIA Vera CPU architecture, highlighting key components like the Olympus custom core, 1.2TB/s LPDDR5X memory subsystem, and NVIDIA Scalable Coherency Fabric (SCF), all engineered for the agentic design point for high per-core performance and efficient memory bandwidth.Figure 3. The Vera CPU is built for the agentic design point 

NVIDIA Olympus core and memory subsystem

The NVIDIA Olympus core delivers up to 50% higher IPC than NVIDIA Grace, combining a wide front end, advanced branch prediction, deep out-of-order instruction scheduling, and specialized memory prefetching to sustain high throughput on branch-heavy, memory-sensitive agentic code.

Olympus uses a neural branch predictor to reduce stalls in branch-heavy code. Combined with other prediction mechanisms, it can sustain two taken branches per cycle with zero penalty, maintaining throughput for deep software stacks such as PyTorch, graph workloads, and scripting engines.

Olympus also includes a 10-wide decode unit and a deep out-of-order engine designed to sustain high instructions per cycle. Large buffers and advanced instruction scheduling help the core maintain forward progress as code paths, dependencies, and memory access patterns shift.

Sustaining high IPC under load requires keeping the cores fed with data. Vera CPUs deliver up to 1.2 TB/s of LPDDR5X memory bandwidth, sustaining over 90% of peak memory bandwidth under load. It also offers 40% lower peak memory latency compared to x86 CPUs, ensuring Olympus cores are fed on time through retrieval, analytics, sandbox execution, and orchestration.

Olympus also adds a novel graph prefetcher built for indirect memory access patterns common in graph analytics and agent memory traversal. Combined with high-memory per-core bandwidth, Vera CPUs deliver more than 3x performance on graph traversal workloads compared with x86-based architectures.

The NVIDIA Scalable Coherency Fabric (SCF) connects all cores and a unified cache across a monolithic mesh, delivering predictable latency and 50% faster core-to-core data movement compared with CPUs that fragment compute across dies. For reinforcement learning and agentic AI, that predictability helps keep evaluation loops sustained under full load.

Together, the Olympus core, NVIDIA SCF, and LPDDR5X memory subsystem enable the Vera CPU to deliver more than 1.8x higher sandbox performance across agentic workloads under full load compared with the competition, as shown in Figure 4.

Bar chart illustrating that the NVIDIA Vera CPU delivers more than 1.8x higher agentic sandbox performance under peak load compared to x86-based architectures across a broad set of workloads, including code compilation, code analysis, and Python.Figure 4. The Vera CPU delivers industry-leading agentic sandbox performance

System efficiency

Beyond performance, agentic AI places increasing pressure on infrastructure efficiency. As AI factories scale to thousands of CPUs, memory power can become a major contributor to platform power, cooling demand, and operating cost.

The Vera CPU pairs its architecture with high-bandwidth SOCAMM LPDDR5X memory to reduce memory power compared with traditional DDR server designs. The LPDDR5X subsystem typically consumes less than 30 watts, compared with well over 100 watts for DDR5 configurations. MRDIMM-based systems can drive memory power even higher.

With a configurable 250 W to 450 W TDP range, the Vera CPU reduces combined CPU and memory subsystem power while delivering the bandwidth needed for agentic inference and reinforcement learning environments. For AI factories, this translates into better performance per watt, lower operating costs, and more efficient use of power and cooling infrastructure.

The AI factory CPU for agents

The era of agentic AI requires a shift in CPU design—from maximizing cores per dollar to maximizing AI factory output per watt and per dollar. NVIDIA Vera CPU is the CPU for agents, combining fast per-core performance, high concurrency, and power-efficient memory bandwidth. With the custom Olympus core, LPDDR5X memory, and NVIDIA Scalable Coherency Fabric, Vera CPU delivers more than 1.8x higher agentic sandbox performance than traditional x86 architectures, helping AI factories complete more tool calls, return more evaluations, and keep accelerators moving.

Learn More about the Vera CPU, the NVIDIA Vera Rubin NVL2, and the Vera CPU benchmarking by Phoronix.

Relative performance based on measured data, and subject to change. NVIDIA Vera CPU with LPDDR5X performance baselined to the latest x86 CPU. 


Enhance your brand's digital communication with NViNiO•Link™ : Get started for FREE here


Read Entire Article

© 2026 | Actualités Africaines & Tech | Moteur de recherche. NViNiO GROUP

_