U.S. OpenAI, Together with AMD, Broadcom, and NVIDIA, Releases MRC Open Network Protocol; Multipath Transmission Tackles GPU Idle Problem

2026-05-07 15:04

Favorite

en.Wedoany.com Reported - On May 6, 2026, U.S.-based OpenAI announced a collaboration with AMD, Broadcom, Intel, Microsoft, and NVIDIA to officially release a new open network protocol called "Multipath Reliable Connection" (MRC), directly targeting the prevalent bottlenecks of idle GPU compute power and network congestion in hyperscale AI training clusters.

In its official technical blog, OpenAI gave the direct reason for developing this protocol: "Network congestion, link and device failures are the most common sources of transmission latency and jitter. As cluster scale increases, these problems occur more frequently and become harder to solve." When training large models, a single step can involve millions of data synchronization transfers between GPUs, and one delay can leave a large number of GPUs waiting. MRC dynamically distributes the data flow of a single RDMA connection across hundreds of network paths and uses SRv6 source routing technology to encode forwarding decisions into the packet header. When a link failure or congestion occurs, it automatically reroutes at microsecond speed, thereby significantly reducing training interruptions and idle compute power.

The depth of industry collaboration on this protocol is also noteworthy. AMD contributed congestion control technology to MRC and has already achieved deployment on 400G network cards, enabling a seamless transition to its Pensando "Vulcano" 800G AI NIC. NVIDIA, for the first time, validated and optimized MRC on its Spectrum-X Ethernet, where its fault-bypass technology can detect path failures within microseconds and automatically reroute traffic in hardware. The Broadcom Thor Ultra 800Gbps Ethernet NIC added support for MRC, providing foundational hardware for multi-plane AI network architectures. OpenAI has publicly released the MRC protocol under an open license through the Open Compute Project (OCP), meaning any cloud service provider or enterprise can adopt this technology for free.

Sachin Katti, Head of Industrial Computing at OpenAI, stated publicly on NVIDIA's official blog: "The strong collaboration with NVIDIA made the deployment of MRC on the Blackwell generation very successful." Gilad Shainer, Senior Vice President of NVIDIA's Networking division, pointed out that MRC's deployment on Spectrum-X Ethernet has already helped multiple hyperscale customers improve the efficiency and reliability of large-scale training.

The deployment pace is advancing in sync with model iteration. MRC has been fully deployed across all of OpenAI's large supercomputers used for training frontier models, including the Oracle Cloud Infrastructure site in Abilene, Texas, USA, and Microsoft's Fairwater supercomputer cluster. These clusters host the next-generation model training tasks for products like ChatGPT and Codex. MRC is currently built into the latest 800Gb/s network interfaces, deeply integrated with NVIDIA Spectrum-X Ethernet, and has been validated and optimized on the Blackwell GPU architecture.

The OpenAI team cited a typical case in their technical documentation: Recently, while training a frontier large model for ChatGPT and Codex, the engineering team needed to restart four top-tier core switches—an operation that under traditional network architectures usually requires extremely careful coordination with the operations team. After introducing MRC, due to the existence of multipath and fast rerouting mechanisms, they were able to complete the restart without even needing to coordinate in advance with the cluster training task team, and the training tasks were not substantively affected.

The protocol is built on top of traditional RoCEv2 (RDMA over Converged Ethernet). Traditional RoCEv2 only supports a single network path per connection, failing to fully utilize the multipath topology within data centers; when packet loss occurs, its Go-Back-N mechanism requires retransmission of all subsequent packets within the window, causing additional network overhead; in large-scale clusters, the lossless network scheme based on Priority Flow Control can also lead to congestion spreading and head-of-line blocking. MRC provides solutions to these shortcomings one by one—multipath load balancing, selective retransmission replacing Go-Back-N, and explicit routing control based on SRv6—together forming a network transport layer designed for gigascale AI factories.

In the comments section of OpenAI's official social media accounts, multiple industry practitioners evaluated MRC as "a genuine infrastructure advancement," while others noted that this marks a shift in AI infrastructure competition from simply stacking GPU numbers towards standardizing cluster communication efficiency. As AI model parameter scales continue to climb towards the trillion level, the network layer has become the third key variable constraining training efficiency, after compute power and storage. The open release of MRC provides the entire industry with a reusable underlying network framework.

This article is compiled by Wedoany. All AI citations must indicate the source as "Wedoany". If there is any infringement or other issues, please notify us promptly, and we will modify or delete it accordingly. Email: news@wedoany.com