Skip to main content

Inside the 100,000-GPU NVIDIA H100 xAI Colossus Data Center

·509 words·3 mins
XAI Colossus NVIDIA H100 Liquid Cooling Ethernet Supermicro
Table of Contents

With special access granted by Elon Musk and the xAI team, Patrick Kennedy from ServeTheHome (STH) recently toured the interior of the Colossus Data Center, one of the world’s largest AI training clusters. His photos and videos offer the first detailed public look inside this massive NVIDIA-powered supercomputer.

💻 Liquid-Cooled Supermicro HGX Servers
#

At the heart of Colossus are Supermicro 4U liquid-cooled servers built on the NVIDIA HGX H100 platform.

What are HGX, MGX, and DGX?
#

  • MGX — modular platform for OEM server builders
  • HGX — used in hyperscale deployments, built by ODMs like Supermicro
  • DGX — fully integrated NVIDIA-branded systems

Because Colossus is a hyperscale AI cluster, xAI uses HGX servers.

MGX HGX DGX

Rack Layout and Scale
#

Each 4U server contains 8 NVIDIA H100 GPUs.
A single rack holds 8 servers → 64 H100 GPUs.
Eight racks form a pod of 512 H100 GPUs.
Colossus consists of about 200 pods, reaching nearly 100,000 H100 GPUs.

These Supermicro systems are purpose-built for liquid cooling, not converted from air-cooled designs. Components sit on removable trays, allowing maintenance without sliding out the whole chassis.

Liquid Cooling

At the rear of each server:

  • Four redundant power supplies
  • Three-phase power distribution
  • 400GbE Ethernet links
  • A 1U coolant manifold connected to a bottom-mounted CDU (Cooling Distribution Unit) with redundant pumps

💾 High-Density Flash Storage
#

Colossus uses Supermicro NVMe storage systems with dense 2.5-inch NVMe SSD bays. This aligns with recent reports of Tesla purchasing large volumes of enterprise SSDs from SK Hynix (Solidigm).

As AI cluster sizes grow, storage architectures are shifting to all-flash, offering:

  • Higher performance
  • Substantial power savings
  • Greater density
  • Better TCO at hyperscale

Even though cost per petabyte is higher, the efficiency gains are significant at this scale.

🌐 Ethernet at Hyperscale: Spectrum-X Instead of InfiniBand
#

While most AI supercomputers still rely on InfiniBand, xAI chose NVIDIA Spectrum-X Ethernet, which delivers:

  • Strong scalability
  • Lower deployment and maintenance cost
  • High bandwidth and low latency
  • Intelligent congestion control

Network Architecture Highlights
#

  • Spectrum SN5600 switches with up to 800 Gb/s ports
  • 400GbE BlueField-3 SuperNIC for each GPU (RDMA-enabled)
  • Additional 400GbE NIC for the CPU
  • Total per-server bandwidth: 3.6 Tbps

xAI built three separate networks:

  1. GPU network — RDMA, high-priority traffic
  2. CPU network — general compute and management
  3. Storage network — optimized for NVMe flash

The result: extremely high throughput without packet loss during massive model training workloads.

Patrick highlighted that a single 400GbE connection already exceeds the total PCIe bandwidth of high-end CPUs from just a few years ago—and each server has nine such links.

NVIDIA noted that during Grok training, they saw:

  • 0% packet loss
  • 95% sustained throughput under heavy load (versus ~60% on traditional Ethernet)

⚡ Power Stability with Tesla Megapacks
#

Outside the facility, rows of Tesla Megapack batteries stabilize the power supply.
Because Colossus’s power usage can change dramatically within milliseconds—beyond what the grid or diesel generators can tolerate—Megapacks act as energy buffers, smoothing spikes and dips to protect the GPUs.


Source: “Inside the 100K GPU xAI Colossus Cluster that Supermicro Helped Build for Elon Musk” — ServeTheHome

Related

数据中心高效液冷技术研究现状
·63 words·1 min
DataCenter Liquid Cooling
数据中心液冷技术阐述
·9 words·1 min
DataCenter Liquid Cooling Heatsink
866.5亿!全球以太网交换机市场Top 3 出炉
·95 words·1 min
DataCenter Network Ethernet Top 3 Revenue