MRC Protocol Redefines AI Supercomputer Networking

Table of Contents

MRC Protocol Redefines AI Supercomputer Networking

OpenAI, alongside NVIDIA, AMD, Intel, and Broadcom, has introduced a new networking protocol called MRC designed to address one of the largest bottlenecks in large-scale AI training: wasted GPU compute caused by network congestion and failures.

The protocol was recently deployed across OpenAI’s frontier-model training supercomputers, including Oracle Cloud Infrastructure (OCI) facilities in Abilene, Texas, and Microsoft’s Fairwater supercomputer environment. According to OpenAI, MRC enabled operators to reboot Tier-1 core switches during active model training without disrupting workloads — a task that previously required extensive operational coordination and risk mitigation.

As AI clusters continue scaling toward hundreds of thousands of GPUs, networking efficiency has become as critical as raw compute performance. MRC represents a major architectural shift aimed at improving fault tolerance, reducing congestion, simplifying routing complexity, and maximizing GPU utilization.

🚀 Why AI Supercomputers Need a New Networking Model
#

Modern frontier-model training involves millions of synchronized data transfers during every training step. Even minor network delays can propagate across the cluster and leave thousands of GPUs idle while waiting for synchronization.

The primary causes include:

Network congestion
Packet loss
Link failures
Switch instability
Dynamic routing convergence delays

As clusters scale, these issues become exponentially harder to manage. Traditional Ethernet architectures struggle to maintain deterministic performance under extreme load conditions.

OpenAI concluded that improving AI infrastructure efficiency required a fundamental redesign of the networking stack rather than incremental optimizations.

🔧 What Is MRC?
#

MRC is a next-generation transport protocol built on top of RoCE (RDMA over Converged Ethernet). It extends Ethernet-based GPU communication using technologies derived from:

RoCEv2
SRv6 (IPv6 Segment Routing)
Ultra Ethernet Consortium (UEC) networking concepts

The protocol enables a single transmission stream to utilize hundreds of parallel network paths simultaneously instead of relying on a fixed route.

Key design goals include:

Fast fault avoidance
Congestion-aware routing
High path diversity
Reduced network control-plane complexity
Stable synchronization across massive GPU clusters

Unlike traditional AI networking approaches, MRC allows packets to arrive out of order and write directly into memory using embedded memory destination information.

This removes one of the major limitations of conventional Ethernet transport models.

🌐 Multi-Plane Networking Architecture
#

One of MRC’s most important innovations is its support for multi-plane networking.

Instead of treating an 800Gb/s network interface as one monolithic connection, MRC divides it into multiple smaller independent links. For example:

One 800Gb/s NIC
Split into eight 100Gb/s network planes
Connected to eight separate switches simultaneously

This architecture dramatically increases path diversity and network scalability.

Benefits of Multi-Plane Design
#

The approach enables:

Higher switch port density
Fewer switch tiers
Lower power consumption
Better traffic localization
Increased redundancy
Improved fault isolation

OpenAI states that using MRC, a fully interconnected network supporting roughly 131,000 GPUs can be constructed using only two Ethernet switch tiers.

Traditional 800Gb/s architectures often require three or four tiers to reach comparable scale.

⚡ Packet Scattering Across Hundreds of Paths
#

Traditional AI networking protocols typically pin a flow to a single path to preserve packet ordering.

MRC eliminates this restriction.

Instead, packets from a single transfer are scattered dynamically across hundreds of available paths spanning multiple network planes.

Why This Matters
#

This design enables:

Better utilization of available bandwidth
Dynamic congestion avoidance
Higher aggregate throughput
Reduced synchronization jitter
Faster fault recovery

Each MRC connection maintains state information about available paths. If congestion or packet loss is detected on one route, traffic immediately shifts to alternative paths.

This transition occurs in microseconds rather than seconds.

🛡️ Fault Handling and Packet Truncation
#

MRC treats packet loss aggressively.

When loss occurs, the protocol assumes the path may be faulty and immediately stops using it. Potentially lost packets are retransmitted while the system probes the failed route to determine whether the issue is temporary or persistent.

Packet Truncation Mechanism
#

One of the more innovative features is packet truncation.

Under congestion conditions, instead of dropping an entire packet, switches strip away the payload and forward only the packet header to the destination.

This behavior:

Triggers explicit retransmission
Preserves signaling information
Reduces false-positive failure detection
Prevents congestion events from being mistaken for link failures

Combined with packet scattering and multi-plane routing, this enables extremely fast recovery behavior that minimizes disruption to synchronized training workloads.

🧠 Simplifying the Network Control Plane with SRv6
#

MRC also reduces dependency on dynamic routing protocols such as BGP.

Traditional large-scale Ethernet fabrics rely heavily on dynamic routing convergence mechanisms, which introduce significant software complexity and operational risk.

Instead, MRC adopts SRv6 source routing.

How SRv6 Works in MRC
#

With SRv6:

The sender embeds the complete forwarding path into the packet itself
Switches simply forward packets according to static routing rules
No dynamic route recalculation is required during failures

When a packet reaches a switch:

The switch checks whether its identifier appears in the route list
It removes its own identifier
The next-hop identifier becomes active
The packet continues using static forwarding tables

This significantly simplifies switch behavior and eliminates entire classes of routing instability.

If a path fails, MRC simply stops selecting it without requiring network-wide convergence events.

🏭 Industry Collaboration Behind MRC
#

The MRC ecosystem represents a rare collaboration between major AI infrastructure vendors.

NVIDIA
#

NVIDIA validated and optimized MRC on its Spectrum-X Ethernet platform.

The company highlighted MRC’s microsecond-scale failure bypass capabilities as critical for synchronized GPU training environments.

AMD
#

AMD contributed congestion control technologies and previously developed a pre-standard enhanced RoCEv2 transport implementation that evolved into MRC.

AMD also confirmed support for MRC on its 400G and upcoming Pensando Vulcano 800G AI NICs.

Broadcom
#

Broadcom integrated MRC support into its Thor Ultra 800Gbps Ethernet platform.

Its programmable data path architecture supports:

Advanced congestion control
Reliable transmission
Adaptive load balancing
Multi-plane traffic management

Intel
#

Intel stated that MRC enables ultra-large-scale Ethernet cluster deployment while reducing:

Switch hierarchy depth
Power consumption
Operational complexity

📈 Why MRC Matters for Future AI Infrastructure
#

As frontier AI models continue scaling, network efficiency is rapidly becoming the defining factor in usable compute performance.

GPU count alone no longer guarantees faster training.

The real challenge is ensuring that tens of thousands of accelerators remain synchronized despite:

Hardware failures
Congestion events
Maintenance operations
Traffic imbalance
Routing instability

MRC directly targets these limitations.

By combining:

Multi-plane networking
Parallel path utilization
Hardware-based fault bypass
Packet scattering
SRv6 source routing
Simplified control planes

the protocol dramatically improves resilience and GPU utilization in ultra-large AI clusters.

🔍 Conclusion
#

MRC represents a significant evolution in Ethernet-based AI networking.

Rather than treating failures as rare edge cases, the protocol assumes large-scale clusters will constantly experience congestion, packet loss, and hardware disruptions. Its architecture is designed to absorb these events without interrupting synchronized training workloads.

For hyperscale AI infrastructure, this shift is critical.

The collaboration between OpenAI, NVIDIA, AMD, Intel, and Broadcom also signals a broader industry trend: networking efficiency is becoming one of the most important competitive battlegrounds in AI supercomputing.

As training clusters push toward hundreds of thousands of GPUs, protocols like MRC may become foundational technologies enabling the next generation of frontier AI systems.