Benchmarking scale-out AI fabrics with Cisco N9000 + AMD Pensando™ Pollara 400 NICs

The “AI paradox” is a growing hurdle for enterprise leaders: investing millions in powerful GPUs, only to watch them sit idle while waiting for data. As enterprises scale from pilot to production, the real bottleneck isn’t compute—it’s the hidden cost of an inefficient network. In scale-out architectures, the tens of thousands of GPUs must synchronize to complete a single training iteration. When the network can’t keep pace with the bursty demands of modern AI training, GPUs stall and job completion time (JCT) spikes. We’ve partnered with AMD to deliver a validated, end-to-end AI infrastructure that eliminates these bottlenecks and transforms the network into a high-performance engine for innovation.

Fabric as the foundation: The Cisco and AMD AI performance blueprint

As AI workloads expand across distributed clusters, the network must scale linearly to prevent packet loss and retransmissions. This performance is only verifiable through rigorous, real-world benchmarking. At Cisco, we prioritize systemic, deterministic performance that goes beyond individual component specs.

Our reference architecture features AMD Instinct™ MI300X GPUs, AMD Pensando™ Pollara 400 NICs, Cisco Silicon One G200-powered N9364E-SG2 switches, and Cisco 800G OSFP optics. Deploying is only half the challenge; operating at scale is the other. Cisco Nexus Dashboard provides the granular, real-time visibility needed for day-0 through day-N operations.

Cisco N9000 Series Switches, with AMD Instinct GPU accelerators and AMD Pensando AI NICs, unified with Cisco Nexus One in a fully integrated stack. N9000 Series switches are included in AMD reference architecture for AI cluster design. — Figure 1: Cisco N9000 Series Switches, with AMD Instinct™ GPU accelerators and AMD Pensando™ AI NICs

By combining these technologies, we minimize JCT and maximize GPU utilization, ensuring AI infrastructure remains secure, compliant, and continuously optimized.

Benchmarking the architecture

We benchmarked two Clos topologies (2×2 & 4×2) with Cisco N9364E-SG2 switches (each with 51.2 Tbps throughput and 64 ports of 800 GbE), 128 AMD Instinct™ MI300X Series GPUs (16 servers x 8 GPUs), 128 AMD Pensando™ Pollara 400 AI NICs (16 servers x 8 NICs), and the AMD ROCm™ 6.3/7.0.3 software ecosystem.

2×2 Clos topology

This design fully subscribes each leaf switch, forcing the switch into high-congestion states to test fabric resilience:

2x leaf and 2x spine (4x Cisco N9364E-SG2) switches
8 servers (8x AMD Instinct™ MI300X Series GPUs) connected to each leaf switch
8x AMD Pensando™ Pollara 400G NICs per server
Switch side: Cisco OSFP 800G DR8 optics

2x2 CLOS topology with Cisco N9364E-SG2 + AMD Topology — Figure 2: 2×2 Clos topology

4×2 Clos topology

This design focuses on the efficacy of advanced load-balancing techniques for efficient load distribution during synchronous bursts in the GPU scale-out fabric:

4x leaf and 2x spine (6x Cisco N9364E-SG2) switches
4 servers (8x AMD Instinct™ MI300X Series GPUs) connected to each leaf switch
8x AMD Pensando™ Pollara 400G NICs per server
Switch side: Cisco OSFP 800G DR8 optics

4x2 CLOS topology with Cisco N9364E-SG2 + AMD Scale-out Topology — Figure 3: 4×2 Clos topology

Benchmarking tools

We measured scale-out fabric performance using a comprehensive toolset, including:

IBPerf measures RDMA performance over scale-out fabric in varying congestive scenarios. We used this tool to test performance between GPUs connected across a single leaf and across leaf-spine.
MLPerf is an industry-standard benchmark used to measure actual workload performance. The performance output translates to ROI on fully validated designs from Cisco and AMD.

Network fabric performance benchmarking results

We evaluated scale-out fabric performance using comprehensive testing and standard KPIs.

Single-hop IBPerf testing evaluates performance within a localized fabric domain, typically within a single leaf switch. This establishes a baseline for link utilization, buffer tuning effectiveness, and NIC-to-switch performance prior to introducing multi-hop variables.

These tests measure the Remote Direct Memory Access (RDMA) sessions’ throughput between two GPUs connected through a Cisco N9364E-SG2 leaf switch. The results capture P01 (1st percentile) and P99 (99th percentile) bandwidth, while all the sessions are active simultaneously. P01 bandwidth represents the throughput of the slowest session—a critical metric for synchronized AI/ML workload performance—while P99 represents the throughput of the fastest session. A minimal delta between P01 and P99 bandwidth and each bandwidth closer to the link bandwidth proves the efficacy of the GPU interconnect technology.

In the 2-leaf/2-spine (2×2) topology, each leaf switch handles 32 bi-directional sessions, effectively saturating the leaf switch. The 4-leaf/2-spine (4×2) topology handles 16 bi-directional sessions per leaf. Across both topologies and varying queue pair (QP) counts (4 QPs and 32 QPs), the P01 and P99 bandwidths in both topologies and both sets of queue pairs are closer to each other, with each one approaching the link bandwidth of 400 Gbps.

Figure 4: Single-hop RDMA bandwidth performance across varying leaf-spine topologies and queue pair counts

This performance shows that the AMD Pensando™ Pollara NIC and Cisco N9364E-SG2 switches deliver a highly efficient solution for demanding workloads. The tight delta between P01 and P99 metrics across different scale and configurations demonstrates that this architecture maintains deterministic performance, regardless of cluster size or queue pair density.

Bisectional IBPerf testing evaluates cross-fabric traffic traversing multiple tiers to measure bisection bandwidth, path symmetry, cross-spine load balancing, and congestion propagation.

These tests measure RDMA session throughput between two GPUs connected through leaf and spine Cisco N9364E-SG2 switches. The results show P01 and P99 bandwidth measurements with all sessions are simultaneously active. In the 2×2 topology, there are 32 bi-directional sessions per leaf, whereas the 4×2 topology has 16 bi-directional sessions per leaf. All these sessions go over spine. The traffic from each session traverses three hops (leaf-spine-leaf) to stress the entire fabric. This test validates the efficiency of the fabric’s load-balancing algorithm; any traffic polarization would lead to some links being underutilized, while other links become congested, ultimately degrading RDMA session performance. Tests were conducted using 4 and 32 QPs.

Figure 5: Bisection RDMA bandwidth stability comparison for 2-leaf/2-spine and 4-leaf/2-spine architectures across varying queue pair counts

The results demonstrate that P01 and P99 bandwidths are similar and each is closer to the link bandwidth of 400 Gbps, mirroring the performance observed in single-hop testing. This confirms that the Cisco N9364E-SG2 switches and AMD Pensando™ Pollara NIC provide a high-performance, resilient GPU interconnect technology capable of maintaining consistently deterministic performance under stress.

Congestive IBPerf testing creates high-contention scenarios using a 31:1 communication pattern, where 31 GPUs communicate with a single GPU. It evaluates queue buildup, Explicit Congestion Notification (ECN) effectiveness, Data Center Quantized Congestion Notification (DCQCN) reaction curves, tail latency, and fabric stability under worst-case AI communication patterns.

Incast conditions represent some of the most challenging scenarios for scale-out AI fabric. These tests measure P01 and P99 bandwidths under incast conditions, which manifest during collective communications such as all-to-all. If the scale-out fabric hardware, design, and tuning are not optimal, it leads to substantial degradation in JCT for training workloads. Because it is difficult to synchronize all sessions to start simultaneously, we use the Quantile Range Method to analyze the results. It analyzes bandwidth samples as a result of incast congestion instead of all bandwidth samples.

Figure 6: RDMA incast 31:1 congestion performance. Comparison of P01 and P99 bandwidth during high-contention 31:1 incast traffic

In this test, each of the 128 GPUs establishes 31 RDMA sessions to 31 other GPUs across the leaf-spine fabric, resulting in a total of 3,968 (31*128 = 3,968) simultaneously active sessions in the scale-out fabric. The delta between P01 and P99 bandwidth is very tight, and each bandwidth is close to the link bandwidth of 400 Gbps, which is a solid proof point of the Cisco N9364E-SG2 switches’ ability to handle extreme congestive conditions and a testament to the Cisco and AMD validated design.

MLPerf Training and Inference Benchmarking tests establish standardized metrics to evaluate the performance of training and inference workloads. By enforcing strict guidelines regarding models, datasets, and allowable optimizations, these benchmarks provide a level playing field for fair comparison among competing AI infrastructure solutions.

The MLPerf tests from MLCommons are designed to provide a common benchmarking methodology for measuring application-level KPIs, which are the primary indicators of performance for end users. For inference, the Llama 2 70B results demonstrate clear throughput scaling as the configuration expands from two to four nodes. The training benchmarks provide representative data for Llama 2 70B (on two nodes) and Llama 3.1 8B (on eight nodes).

Figure 7: MLPerf training and inference key performance metrics for Llama 2 and Llama 3.1 models, detailing throughput and JCT across multi-node configurations

These findings provide the foundation for our core claim: the Cisco validated architecture is not just theoretically sound; benchmarking shows it can handle the most demanding AI inference and training workloads.

A real-world deployment of the Cisco and AMD AI solution architecture

The Cisco-AMD partnership delivers real-world impact, notably powering G42’s large-scale AI clusters. This end-to-end solution—integrating AMD GPUs, Cisco UCS servers, N9000 800G switches, and Nexus Dashboard—provides the secure, scalable performance required for cutting-edge AI workloads.

“As AI workloads scale, network performance becomes a critical enabler of cluster efficiency. The AMD Pensando™ Pollara 400 AI NIC, with its fully programmable, fault-resilient design, delivers consistent performance for GPU scale-out training. In collaboration with Cisco N9000 switching, we’re advancing Ethernet to the next level, helping maximize GPU utilization and accelerate job completion.”

—Yousuf Khan, Corporate Vice President, Networking Technology and Solutions Group, AMD

Operationalizing intelligence: A new standard for performance at scale

In the age of massive-scale AI, an organization’s infrastructure is either its greatest competitive advantage or its most significant bottleneck. When the stakes involve mission-critical training, fine-tuning, and inferencing, a unified, fully validated ecosystem is a must. Cisco and AMD are changing the equation, delivering a deterministic, high-performance fabric that turns your network into a catalyst for innovation.

Connect with a Cisco AI networking specialist today to design a deployment tailored to your specific workloads.

Additional resources: