AI Inference Computing Power of Top Five North American Cloud Service Providers Expected to Grow 122% Year-on-Year; NVIDIA GB and Vera Rubin Rack-Scale Deployments Accelerate

2026-05-20 15:54

Favorite

en.Wedoany.com Reported - Leading North American cloud service providers (CSPs) are accelerating the expansion of AI infrastructure at rack-scale, propelling inference computing power into a phase of explosive growth. According to industry monitoring data from May 2026, the NVIDIA GB and Vera Rubin rack-scale solutions deployed by Microsoft, Google, Amazon, Meta, and Oracle are expected to increase their total AI training computing power by over 56% year-on-year, while total AI inference computing power is projected to surge by approximately 122% year-on-year—meaning inference computing power will more than double within a single year, signaling a profound structural shift in AI computing from "training-driven" to "inference-dominated."

The NVIDIA Vera Rubin rack-scale system is the core variable driving this explosion in inference computing power. The Vera Rubin platform consists of 72 Rubin GPUs and 36 Vera CPUs forming the NVL72 fully liquid-cooled rack. It entered trial production in June this year and officially began initial deliveries to the five aforementioned top-tier North American CSPs in July. A single NVL72 rack achieves high-speed intra-rack interconnectivity via the sixth-generation NVLink copper backplane, delivering inference performance for trillion-parameter models and million-token context windows. It can also be co-deployed with the Groq 3 LPX inference acceleration rack, compressing the inference cost per million tokens to one-tenth that of the Blackwell architecture in agentic AI scenarios. Mass manufacturing is undertaken by Foxconn, Quanta, and Wistron, with concentrated shipments expected to commence in Q3 2026.

Data released by NVIDIA at GTC 2026 corroborates this generational leap in architectural efficiency. When running inference-intensive models such as Kimi-K2-Thinking, the Rubin NVL72 achieves an inference cost per million tokens that is only one-tenth that of the Blackwell GB200 NVL72; in Mixture-of-Experts model training, the number of GPUs required by Rubin can be reduced by up to three-quarters. With each NVL72 rack costing approximately $180 million, CSPs can address the continuously expanding inference workloads with higher compute density and energy efficiency. The Groq 3 LPX inference acceleration rack, delivered concurrently as dedicated inference hardware, forms a layered training-inference deployment architecture with the NVL72.

The expanding scale of rack-scale procurement is reshaping the supply-demand dynamics of data center power infrastructure. The GB and Vera Rubin rack-scale solutions deployed by the five major North American CSPs in 2026 are expected to account for over 60% of global demand for NVIDIA's equivalent products. AWS plans to add over 1 million NVIDIA GPUs across its global cloud regions starting in 2026, covering both Blackwell and Rubin architecture generations; Meta simultaneously announced a multi-year strategic partnership with NVIDIA to deploy millions of Blackwell and Rubin GPUs. The concurrent volume ramp-up across the three major platforms—NVIDIA, AMD, and CSPs' custom ASICs—is projected to drive a 116% year-on-year increase in total AI server power consumption for the five major CSPs, making data center power infrastructure a hard constraint on computing expansion.

In 2026, AI training servers still account for approximately 55% of AI server shipments, but in the medium to long term, AI inference models will become the market mainstay. Underlying this trend is the rapid differentiation of AI application scenarios: the training segment is concentrated on the pre-training and fine-tuning of a few ultra-large parameter models, exhibiting significant scale effects but stabilizing growth rates; the inference segment, accompanying the penetration of large models into end-user applications—from agentic AI and conversational assistants to real-time code generation—exhibits distributed, highly concurrent, and around-the-clock characteristics, with a continuously steepening computing power demand curve. NVIDIA, with rack-level solutions like the GB300 and Vera Rubin, integrates GPU, CPU, and LPU computing power into unified delivery units, simultaneously meeting the high-throughput demands of training and the low-latency demands of inference.

Capital expenditure by North American CSPs on AI infrastructure continues to accelerate and expand. Microsoft has raised its 2026 capital expenditure guidance to $190 billion, a year-on-year increase of approximately 130%; Google has raised its guidance to a range of $180 billion to $190 billion, an increase of over 100%; Meta has raised its guidance to $125 billion to $145 billion, an increase of about 85%; AWS's full-year capital expenditure is expected to exceed $230 billion, an increase of over 50%. The combined 2026 capital expenditure forecast for the world's top nine CSPs has been revised upward to approximately $830 billion, with the annual growth rate adjusted from 61% to 79%, heavily focused on high-performance GPU cluster construction, custom ASIC chip development, and next-generation data centers supporting high-power computing.

2026 is becoming the critical inflection year for the large-scale commercial deployment of AI inference. Since ChatGPT ignited the generative AI wave, the industry chain's center of gravity has passed through multiple phases, from GPU hoarding and 10,000-card cluster construction to large model races. The current main axis of computing power expansion is shifting from "can it be trained" to "can it run, run sustainably, and run cost-effectively." The large-scale deployment of inference is reshaping the entire industrial chain landscape, from chip design and server architecture to data center power systems. As Vera Rubin rack-scale solutions accelerate shipments in the second half of the year, the penetration curve of AI inference is expected to enter a new steep trajectory.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com

America

This bulletin is compiled and reposted from information of global Internet and strategic partners, aiming to provide communication for readers. If there is any infringement or other issues, please inform us in time. We will make modifications or deletions accordingly. Unauthorized reproduction of this article is strictly prohibited. Email: news@wedoany.com

Previous：China's SMIC, Huahong Group, and others jointly establish Shanghai Electronic Materials International Supply Chain Center with a registered capital of 200 million yuan

Next：TSMC of China Announces CoWoS Advanced Packaging Yield Exceeds 98%, 5.5x Reticle Size Products Now in Mass Production