Skip to main content

MRC Protocol Redefines AI Supercomputer Networking

·1182 words·6 mins
OpenAI MRC AI Networking NVIDIA AMD Intel Broadcom Ethernet RoCE Supercomputing
Table of Contents

MRC Protocol Redefines AI Supercomputer Networking

OpenAI, alongside NVIDIA, AMD, Intel, and Broadcom, has introduced a new networking protocol called MRC designed to address one of the largest bottlenecks in large-scale AI training: wasted GPU compute caused by network congestion and failures.

The protocol was recently deployed across OpenAIโ€™s frontier-model training supercomputers, including Oracle Cloud Infrastructure (OCI) facilities in Abilene, Texas, and Microsoftโ€™s Fairwater supercomputer environment. According to OpenAI, MRC enabled operators to reboot Tier-1 core switches during active model training without disrupting workloads โ€” a task that previously required extensive operational coordination and risk mitigation.

As AI clusters continue scaling toward hundreds of thousands of GPUs, networking efficiency has become as critical as raw compute performance. MRC represents a major architectural shift aimed at improving fault tolerance, reducing congestion, simplifying routing complexity, and maximizing GPU utilization.

๐Ÿš€ Why AI Supercomputers Need a New Networking Model
#

Modern frontier-model training involves millions of synchronized data transfers during every training step. Even minor network delays can propagate across the cluster and leave thousands of GPUs idle while waiting for synchronization.

The primary causes include:

  • Network congestion
  • Packet loss
  • Link failures
  • Switch instability
  • Dynamic routing convergence delays

As clusters scale, these issues become exponentially harder to manage. Traditional Ethernet architectures struggle to maintain deterministic performance under extreme load conditions.

OpenAI concluded that improving AI infrastructure efficiency required a fundamental redesign of the networking stack rather than incremental optimizations.

๐Ÿ”ง What Is MRC?
#

MRC is a next-generation transport protocol built on top of RoCE (RDMA over Converged Ethernet). It extends Ethernet-based GPU communication using technologies derived from:

  • RoCEv2
  • SRv6 (IPv6 Segment Routing)
  • Ultra Ethernet Consortium (UEC) networking concepts

The protocol enables a single transmission stream to utilize hundreds of parallel network paths simultaneously instead of relying on a fixed route.

Key design goals include:

  • Fast fault avoidance
  • Congestion-aware routing
  • High path diversity
  • Reduced network control-plane complexity
  • Stable synchronization across massive GPU clusters

Unlike traditional AI networking approaches, MRC allows packets to arrive out of order and write directly into memory using embedded memory destination information.

This removes one of the major limitations of conventional Ethernet transport models.

๐ŸŒ Multi-Plane Networking Architecture
#

One of MRCโ€™s most important innovations is its support for multi-plane networking.

Instead of treating an 800Gb/s network interface as one monolithic connection, MRC divides it into multiple smaller independent links. For example:

  • One 800Gb/s NIC
  • Split into eight 100Gb/s network planes
  • Connected to eight separate switches simultaneously

This architecture dramatically increases path diversity and network scalability.

Benefits of Multi-Plane Design
#

The approach enables:

  • Higher switch port density
  • Fewer switch tiers
  • Lower power consumption
  • Better traffic localization
  • Increased redundancy
  • Improved fault isolation

OpenAI states that using MRC, a fully interconnected network supporting roughly 131,000 GPUs can be constructed using only two Ethernet switch tiers.

Traditional 800Gb/s architectures often require three or four tiers to reach comparable scale.

โšก Packet Scattering Across Hundreds of Paths
#

Traditional AI networking protocols typically pin a flow to a single path to preserve packet ordering.

MRC eliminates this restriction.

Instead, packets from a single transfer are scattered dynamically across hundreds of available paths spanning multiple network planes.

Why This Matters
#

This design enables:

  • Better utilization of available bandwidth
  • Dynamic congestion avoidance
  • Higher aggregate throughput
  • Reduced synchronization jitter
  • Faster fault recovery

Each MRC connection maintains state information about available paths. If congestion or packet loss is detected on one route, traffic immediately shifts to alternative paths.

This transition occurs in microseconds rather than seconds.

๐Ÿ›ก๏ธ Fault Handling and Packet Truncation
#

MRC treats packet loss aggressively.

When loss occurs, the protocol assumes the path may be faulty and immediately stops using it. Potentially lost packets are retransmitted while the system probes the failed route to determine whether the issue is temporary or persistent.

Packet Truncation Mechanism
#

One of the more innovative features is packet truncation.

Under congestion conditions, instead of dropping an entire packet, switches strip away the payload and forward only the packet header to the destination.

This behavior:

  • Triggers explicit retransmission
  • Preserves signaling information
  • Reduces false-positive failure detection
  • Prevents congestion events from being mistaken for link failures

Combined with packet scattering and multi-plane routing, this enables extremely fast recovery behavior that minimizes disruption to synchronized training workloads.

๐Ÿง  Simplifying the Network Control Plane with SRv6
#

MRC also reduces dependency on dynamic routing protocols such as BGP.

Traditional large-scale Ethernet fabrics rely heavily on dynamic routing convergence mechanisms, which introduce significant software complexity and operational risk.

Instead, MRC adopts SRv6 source routing.

How SRv6 Works in MRC
#

With SRv6:

  • The sender embeds the complete forwarding path into the packet itself
  • Switches simply forward packets according to static routing rules
  • No dynamic route recalculation is required during failures

When a packet reaches a switch:

  1. The switch checks whether its identifier appears in the route list
  2. It removes its own identifier
  3. The next-hop identifier becomes active
  4. The packet continues using static forwarding tables

This significantly simplifies switch behavior and eliminates entire classes of routing instability.

If a path fails, MRC simply stops selecting it without requiring network-wide convergence events.

๐Ÿญ Industry Collaboration Behind MRC
#

The MRC ecosystem represents a rare collaboration between major AI infrastructure vendors.

NVIDIA
#

NVIDIA validated and optimized MRC on its Spectrum-X Ethernet platform.

The company highlighted MRCโ€™s microsecond-scale failure bypass capabilities as critical for synchronized GPU training environments.

AMD
#

AMD contributed congestion control technologies and previously developed a pre-standard enhanced RoCEv2 transport implementation that evolved into MRC.

AMD also confirmed support for MRC on its 400G and upcoming Pensando Vulcano 800G AI NICs.

Broadcom
#

Broadcom integrated MRC support into its Thor Ultra 800Gbps Ethernet platform.

Its programmable data path architecture supports:

  • Advanced congestion control
  • Reliable transmission
  • Adaptive load balancing
  • Multi-plane traffic management

Intel
#

Intel stated that MRC enables ultra-large-scale Ethernet cluster deployment while reducing:

  • Switch hierarchy depth
  • Power consumption
  • Operational complexity

๐Ÿ“ˆ Why MRC Matters for Future AI Infrastructure
#

As frontier AI models continue scaling, network efficiency is rapidly becoming the defining factor in usable compute performance.

GPU count alone no longer guarantees faster training.

The real challenge is ensuring that tens of thousands of accelerators remain synchronized despite:

  • Hardware failures
  • Congestion events
  • Maintenance operations
  • Traffic imbalance
  • Routing instability

MRC directly targets these limitations.

By combining:

  • Multi-plane networking
  • Parallel path utilization
  • Hardware-based fault bypass
  • Packet scattering
  • SRv6 source routing
  • Simplified control planes

the protocol dramatically improves resilience and GPU utilization in ultra-large AI clusters.

๐Ÿ” Conclusion
#

MRC represents a significant evolution in Ethernet-based AI networking.

Rather than treating failures as rare edge cases, the protocol assumes large-scale clusters will constantly experience congestion, packet loss, and hardware disruptions. Its architecture is designed to absorb these events without interrupting synchronized training workloads.

For hyperscale AI infrastructure, this shift is critical.

The collaboration between OpenAI, NVIDIA, AMD, Intel, and Broadcom also signals a broader industry trend: networking efficiency is becoming one of the most important competitive battlegrounds in AI supercomputing.

As training clusters push toward hundreds of thousands of GPUs, protocols like MRC may become foundational technologies enabling the next generation of frontier AI systems.

Related

CES 2026: NVIDIA, Intel, and AMD Redefine AI Platforms
·730 words·4 mins
CES NVIDIA Intel AMD AI Hardware
Intel x NVIDIA Serpent Lake: A Mega-APU Challenge to AMD Strix Halo
·635 words·3 mins
Hardware Semiconductor SoC Intel NVIDIA AMD
AMD Warns of Risks from Intelโ€“NVIDIA Alliance
·534 words·3 mins
AMD Intel NVIDIA AI PC Semiconductors