AI Data Center Reference Architecture Explained: Why Buying GPUs Isn't Enough
A GPU is a component. An AI data center is a system. Here's what connects them.
Whether you're a Bitcoin mining operation evaluating AI compute or an enterprise standing up your first on-premises GPU cluster, you've probably encountered a version of the same question: what does it actually take to stand up a working AI data center?
The answer most people don't expect: the GPUs are the easy part.
A GPU is a component. An AI data center (or an “AI factory,” to use the term the industry is increasingly adopting) is a system. And systems only work when their components are designed, configured, and validated together. That's what AI data center reference architectures are for. They exist to answer the infrastructure question before procurement begins, not after.
TLDR
- A reference architecture is a pre-validated, full-stack blueprint specifying how compute, networking, storage, and software must be configured together to actually run AI workloads.
- Raw GPUs without a validated supporting stack are an inventory position, not an AI cluster.
- NVIDIA has the most complete program; AMD and Intel have their own; OCP is the vendor-neutral standards layer underneath all of them.
- "AI factory" is replacing "AI data center" in vendor terminology — a deliberate reframe from consumption to production.
- For miners evaluating AI/HPC: GPU sourcing is the easy part. Full-stack validation is where most plans stall.
The "100 H100s" Problem
Here's a scenario that comes up more often than it should: an operator secures access to a number of H100 GPUs — say, 100 units — and wants to start planning an AI compute deployment. The hardware is real, the demand signal is real, and the capital commitment is real. Then someone asks the obvious questions.
Are the GPUs already configured as part of a validated system, or are they bare accelerators? What workload are they configured for — training, inference, fine-tuning? Do the other components exist: the host CPU platform, the GPU-to-GPU interconnect, the cluster networking fabric, the shared storage layer, the software orchestration stack?
In most cases, the answer to at least two of those questions is either "we haven't gotten there yet" or “umm…let me check” (followed by a frantic look for information that was glossed over).
A working GPU cluster needs more than GPUs: a host CPU configuration, high-speed GPU-to-GPU interconnect within each node, a cluster networking fabric for node-to-node communication, shared high-throughput storage, a software orchestration layer, and a management plan. The reference architecture is the document that specifies exactly how those components are configured and connected — and validates that they work together before the first workload runs.
What AI Data Center Reference Architecture Actually Means
A reference architecture is a pre-validated, documented blueprint for how the components of an AI data center should be configured to work together reliably at scale. The floor plan analogy is apt: it's not just a list of materials, but a tested design that accounts for how each layer interacts with every other.
The critical word is "validated." The value isn't the component list — procurement teams can build that themselves. The value is that the design has been tested at scale, integration risks have been identified and resolved, and the resulting AI infrastructure stack has a known performance envelope before anyone signs a purchase order.
What the stack typically covers, written agnostically across vendors:
Compute nodes — GPU or accelerator type, count per node, and SCORN (storage, CPU, OS, RAM, Networking) configuration. The ratio of CPU to GPU and the memory capacity per node determines what workload scale is feasible.
GPU interconnect — how GPUs within a single node communicate with each other, distinct from cluster networking. NVLink (NVIDIA) and AMD Infinity Fabric are the dominant options; interconnect bandwidth determines how effectively large models can be split across multiple GPUs within a node.
Cluster networking — how nodes communicate across the cluster, the east-west fabric. InfiniBand and high-speed Ethernet are the two dominant options. This layer is the bottleneck for distributed training.
Storage — parallel or distributed file systems capable of keeping pace with GPU memory bandwidth at scale. Storage is chronically undersized in first-generation AI data center designs and is often the first place performance gaps appear. Different end users have different storage needs.
Software and orchestration — container runtimes, job schedulers, monitoring and observability tooling, and the AI framework stack. Data center management at this layer is what separates a research cluster from a production deployment.
The facility infrastructure (power density, cooling, physical rack design) sits beneath all of this, covered separately by Hashrate Index’s site evaluation framework and the inference vs. training primer. The reference architecture assumes the facility can support the stack.
Without a validated design, enterprises spend months on integration work that should have been resolved before procurement. They discover mismatches after hardware has shipped and end up with AI infrastructure that underperforms its theoretical specifications.
"AI Factory" — Why the Terminology Is Shifting
The industry (led by NVIDIA) is increasingly using "AI factory" instead of "AI data center." Jensen Huang has used the term consistently since at least 2023, and NVIDIA's Enterprise Reference Architecture program is organized entirely around it. The framing positions these facilities as production infrastructure that transforms data and electricity into intelligence and tokens at scale.
This isn't purely semantic. "AI data center" carries significant negative press attention right now: power consumption, water usage, grid impact, community opposition. "AI factory" reframes the same infrastructure around what it produces (economic value, AI capabilities, business outcomes) rather than what it consumes. NVIDIA's market weight means the terminology is already shifting in technical documentation, partner marketing, and enterprise procurement conversations.
Both terms appear throughout this article. Readers evaluating vendor documentation will encounter "AI factory" as the standard NVIDIA framing — that's the context behind it.
NVIDIA's Enterprise Reference Architecture Program
NVIDIA publishes the most complete and widely deployed AI data center reference architecture program on the market. Their Enterprise Reference Architecture (Enterprise RA) documentation specifies three tiers, each targeting a different deployment scale and workload profile.
RTX PRO AI Factory
Built around the 2-8-5-200 configuration (2 CPUs, 8 RTX PRO 6000 Blackwell GPUs, 5 NICs — four BlueField-3 SuperNICs for the east-west compute fabric at 400Gb each, plus one BlueField-3 DPU for north-south converged traffic at 200GbE). Air-cooled, fits a standard enterprise footprint. Scalable to 256 GPU clusters. Targeted at inference, fine-tuning, visual computing, and simulation.
NVIDIA HGX AI Factory
The 2-8-9-800 configuration. Eight Blackwell Ultra GPUs per node connected via fifth-generation NVLink and NVSwitch. The most commonly deployed large-enterprise tier, designed for LLM training, fine-tuning, and high-throughput inference in multi-user environments. Liquid cooling is recommended at this density.
NVL72 AI Factory
Rack-scale, based on the GB300 NVL72 platform. Seventy-two GPUs in a single rack via NVLink, up to 132 kW per rack. Targets frontier model training and the largest inference workloads; liquid cooling required.
NVIDIA also maintains the DGX BasePOD reference architecture, the validated design program for DGX B200, H200, and H100 systems that preceded the Enterprise RA branding. It remains the right reference for organizations whose installed base runs on DGX hardware.
What the documentation covers: GPU count, memory configuration, storage specifications, networking topology (including Spectrum-X Ethernet and BlueField DPU placement), software stack, and end-to-end integration guidance. A build specification, not a marketing document.
The NVIDIA-Certified Systems program is the interoperability layer underneath all of this. OEM partners — Dell, HPE, ASUS, Lenovo, Supermicro, and others — build servers to NVIDIA's certified specifications, which slot into these reference architectures as validated building blocks.
AMD, Intel, and the OCP Ecosystem
NVIDIA's program is the most complete, but it isn't the only one (and the alternatives matter for operators who care about vendor lock-in, open standards, or AMD GPU performance).
AMD runs two complementary programs. The Instinct MI300 Series Cluster Design Reference Guide covers cluster-scale deployments of AMD Instinct MI300X, MI325X, and MI350 accelerators — topology designs (fat-tree, rail), hardware recommendations, storage validation, and software configuration within AMD's ROCm open ecosystem. At rack scale, AMD took a deliberately different approach with the Helios rack platform: a rack-scale design built on Meta's Open Rack Wide (ORW) specification contributed to the Open Compute Project, designed so OEM and ODM partners can adopt, extend, and customize it without proprietary constraints. Helios targets volume deployment in 2026.
Intel publishes the Gaudi 3 AI Accelerator Cluster Reference Design, covering 32-node configurations built around eight compute racks of four Gaudi 3 nodes each, plus a network rack. The design specifies Arista switching for management, Ethernet-based scale-out networking, and a modular storage architecture. Intel emphasizes open, OCP-compatible design as a differentiator — the Gaudi 3 platform is positioned explicitly as an alternative for operators who want to avoid CUDA lock-in.
The Open Compute Project (OCP) is the standards-body layer underneath all of this. OCP's Open Data Center Ecosystem for AI framework, formalized in late 2025, covers rack design specifications (including the Open Rack Wide standard Meta contributed), DC power distribution architecture, telemetry standards, and Ethernet-based AI scale-up networking. Google, Meta, Microsoft, AMD, and NVIDIA are all active contributors. OCP is where proprietary and open approaches converge on shared physical standards — operators who want infrastructure flexibility without full vendor lock-in are best served by OCP-aligned designs.
Hyperscalers (AWS, Azure, and GCP) operate their own internal reference architectures, but these aren't accessible to external operators. They surface as cloud service configurations for GPU-as-a-service workloads, not as on-premises build specifications. Neoclouds building on top of cloud infrastructure work within hyperscaler configurations rather than publishing their own RA documentation.
Current Reference Architecture Programs at a Glance
NVIDIA's program is the most complete and widely deployed. AMD's Helios is the most prominent open-standard alternative. OCP is the neutral ground where the ecosystem converges on shared specifications.
What This Means for Bitcoin Miners Evaluating AI/HPC
For mining operators exploring an AI/HPC transition, the reference architecture question is typically where well-intentioned plans stall.
GPU sourcing is usually the starting point, and it's become more accessible. But sourcing GPUs is not the same as standing up a working AI cluster. The harder questions are whether the rest of the stack exists or can be assembled to match a validated configuration: the right server platform, the networking fabric, the storage layer, the software orchestration layer. Acquiring bare GPUs without those components produces an expensive inventory position, not an AI data center.
The reference architecture functions as a checklist against which a conversion plan should be evaluated before capital is committed. Sourcing NVIDIA-Certified Systems (validated node configurations, not just bare GPUs) and pairing them with appropriate networking and storage gives operators a well-documented path to deployment. Assembling components ad hoc and figuring out integration later is exactly what the reference architecture is designed to prevent.
The right tier also depends on workload. Training and inference have different infrastructure requirements — different networking profiles, different storage throughput needs, different cooling demands. We’ve written a training vs. inference primer and site evaluation framework covering those requirements in detail; the reference architecture is the next document in that sequence. Miners who've looked at the difficulty of the mining-to-AI transition will recognize the pattern: the infrastructure complexity is real, but it's manageable when approached in the right order.
The Stack Comes First
A GPU is a component. An AI data center is a system. The reference architecture is what connects the two.
The choice between NVIDIA's integrated stack, AMD's OCP-aligned open approach, Intel's Ethernet-native design, or a custom configuration built to OCP specifications depends on workload type, cluster scale, budget, and vendor preference. What matters more than which program you choose is that you work from a validated design rather than assembling components independently and hoping the integration holds.
The market is moving fast. NVIDIA is publishing new RA configurations as Blackwell Ultra ramps. AMD's Helios moves toward volume deployment in 2026. OCP is formalizing standards that will define the next generation of AI-optimized physical infrastructure.
Start with the reference architecture. The procurement question gets considerably easier from there.
If you'd like to talk through which reference architecture model is right for your build, Luxor's Cloud Sales team works with operators and enterprises on AI infrastructure sourcing and deployment. Reach out at [email protected] or visit luxor.tech/hardware.
Hashrate Index Newsletter
Join the newsletter to receive the latest updates in your inbox.