{"id":17026,"date":"2026-05-11T14:59:51","date_gmt":"2026-05-11T14:59:51","guid":{"rendered":"https:\/\/dmsretail.com\/RetailNews\/benchmarking-scale-out-ai-fabrics-with-cisco-n9000-amd-pensando-pollara-400-nics\/"},"modified":"2026-05-11T14:59:51","modified_gmt":"2026-05-11T14:59:51","slug":"benchmarking-scale-out-ai-fabrics-with-cisco-n9000-amd-pensando-pollara-400-nics","status":"publish","type":"post","link":"https:\/\/dmsretail.com\/RetailNews\/benchmarking-scale-out-ai-fabrics-with-cisco-n9000-amd-pensando-pollara-400-nics\/","title":{"rendered":"Benchmarking scale-out AI fabrics with Cisco N9000 + AMD Pensando\u2122 Pollara 400 NICs"},"content":{"rendered":"<p> <p><a href=\"https:\/\/dmsretail.com\/online-workshops-list\/\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-496\" src=\"https:\/\/dmsretail.com\/RetailNews\/wp-content\/uploads\/2022\/05\/RETAIL-ONLINE-TRAINING-728-X-90.png\" alt=\"Retail Online Training\" width=\"729\" height=\"91\" srcset=\"https:\/\/dmsretail.com\/RetailNews\/wp-content\/uploads\/2022\/05\/RETAIL-ONLINE-TRAINING-728-X-90.png 729w, https:\/\/dmsretail.com\/RetailNews\/wp-content\/uploads\/2022\/05\/RETAIL-ONLINE-TRAINING-728-X-90-300x37.png 300w\" sizes=\"auto, (max-width: 729px) 100vw, 729px\" \/><\/a><\/p><br \/>\n<\/p>\n<div>\n<p>The \u201cAI paradox\u201d is a growing hurdle for enterprise leaders: investing millions in powerful GPUs, only to watch them sit idle while waiting for data. As enterprises scale from pilot to production, the real bottleneck isn\u2019t compute\u2014it\u2019s the hidden cost of an inefficient network. In scale-out architectures, the tens of thousands of GPUs must synchronize to complete a single training iteration. When the network can\u2019t keep pace with the bursty demands of modern AI training, GPUs stall and job completion time (JCT) spikes. We\u2019ve partnered with AMD to deliver a validated, end-to-end AI infrastructure that eliminates these bottlenecks and transforms the network into a high-performance engine for innovation.<\/p>\n<h2>Fabric as the foundation: The Cisco and AMD AI performance blueprint<\/h2>\n<p>As AI workloads expand across distributed clusters, the network must scale linearly to prevent packet loss and retransmissions. This performance is only verifiable through rigorous, real-world benchmarking. At Cisco, we prioritize systemic, deterministic performance that goes beyond individual component specs.<\/p>\n<p>Our reference architecture features AMD Instinct\u2122 MI300X GPUs, AMD Pensando\u2122 Pollara 400 NICs, Cisco Silicon One G200-powered N9364E-SG2 switches, and Cisco 800G OSFP optics. Deploying is only half the challenge; operating at scale is the other. Cisco Nexus Dashboard provides the granular, real-time visibility needed for day-0 through day-N operations.<\/p>\n<figure id=\"attachment_491208\" aria-describedby=\"caption-attachment-491208\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy lazy-hidden size-medium_large wp-image-491208\" data-lazy-type=\"image\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-1-768x432.png\" alt=\"Cisco N9000 Series Switches, with AMD Instinct GPU accelerators and AMD Pensando AI NICs, unified with Cisco Nexus One in a fully integrated stack. N9000 Series switches are included in AMD reference architecture for AI cluster design.\" width=\"768\" height=\"432\"\/><noscript><img loading=\"lazy\" decoding=\"async\" class=\"size-medium_large wp-image-491208\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-1-768x432.png\" alt=\"Cisco N9000 Series Switches, with AMD Instinct GPU accelerators and AMD Pensando AI NICs, unified with Cisco Nexus One in a fully integrated stack. N9000 Series switches are included in AMD reference architecture for AI cluster design.\" width=\"768\" height=\"432\"\/><\/noscript><figcaption id=\"caption-attachment-491208\" class=\"wp-caption-text\">Figure 1: Cisco N9000 Series Switches, with AMD Instinct\u2122 GPU accelerators and AMD Pensando\u2122 AI NICs<\/figcaption><\/figure>\n<p>By combining these technologies, we minimize JCT and maximize GPU utilization, ensuring AI infrastructure remains secure, compliant, and continuously optimized.<\/p>\n<h2>Benchmarking the architecture<\/h2>\n<p>We benchmarked two Clos topologies (2\u00d72 &amp; 4\u00d72) with Cisco N9364E-SG2 switches (each with 51.2\u202fTbps throughput\u202fand 64 ports of 800 GbE), 128 AMD Instinct\u2122 MI300X Series GPUs (16 servers x 8 GPUs), 128 AMD Pensando\u2122 Pollara 400 AI NICs (16 servers x 8 NICs), and the AMD ROCm\u2122 6.3\/7.0.3 software ecosystem.<\/p>\n<h2>2\u00d72 Clos topology<\/h2>\n<p>This design fully subscribes each leaf switch, forcing the switch into high-congestion states to test fabric resilience:<\/p>\n<ul>\n<li>2x leaf and 2x spine (4x Cisco N9364E-SG2) switches<\/li>\n<li>8\u202fservers\u202f(8x AMD Instinct\u2122 MI300X Series GPUs) connected to each leaf switch<\/li>\n<li>8x AMD Pensando\u2122 Pollara 400G NICs per server<\/li>\n<li>Switch side: Cisco OSFP 800G DR8 optics<\/li>\n<\/ul>\n<figure id=\"attachment_491209\" aria-describedby=\"caption-attachment-491209\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy lazy-hidden size-medium_large wp-image-491209\" data-lazy-type=\"image\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-2-768x401.png\" alt=\"2x2 CLOS topology with Cisco N9364E-SG2 + AMD Topology\" width=\"768\" height=\"401\"\/><noscript><img loading=\"lazy\" decoding=\"async\" class=\"size-medium_large wp-image-491209\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-2-768x401.png\" alt=\"2x2 CLOS topology with Cisco N9364E-SG2 + AMD Topology\" width=\"768\" height=\"401\"\/><\/noscript><figcaption id=\"caption-attachment-491209\" class=\"wp-caption-text\">Figure 2: 2\u00d72 Clos topology<\/figcaption><\/figure>\n<h2>4\u00d72 Clos topology<\/h2>\n<p>This design focuses on the efficacy of advanced load-balancing techniques for efficient load distribution during synchronous bursts in the GPU scale-out fabric:<\/p>\n<ul>\n<li>4x leaf and 2x spine (6x Cisco N9364E-SG2) switches<\/li>\n<li>4\u202fservers\u202f(8x AMD Instinct\u2122 MI300X Series GPUs) connected to each leaf switch<\/li>\n<li>8x AMD Pensando\u2122 Pollara 400G NICs per server<\/li>\n<li>Switch side: Cisco OSFP 800G DR8 optics<\/li>\n<\/ul>\n<figure id=\"attachment_491211\" aria-describedby=\"caption-attachment-491211\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy lazy-hidden size-medium_large wp-image-491211\" data-lazy-type=\"image\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-3-768x401.png\" alt=\"4x2 CLOS topology with Cisco N9364E-SG2 + AMD Scale-out Topology\" width=\"768\" height=\"401\"\/><noscript><img loading=\"lazy\" decoding=\"async\" class=\"size-medium_large wp-image-491211\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-3-768x401.png\" alt=\"4x2 CLOS topology with Cisco N9364E-SG2 + AMD Scale-out Topology\" width=\"768\" height=\"401\"\/><\/noscript><figcaption id=\"caption-attachment-491211\" class=\"wp-caption-text\">Figure 3: 4\u00d72 Clos topology<\/figcaption><\/figure>\n<h2>Benchmarking tools<\/h2>\n<p>We measured scale-out fabric performance using a comprehensive toolset, including:<\/p>\n<ul>\n<li><strong>IBPerf<\/strong> measures RDMA performance over scale-out fabric in varying congestive\u202fscenarios. We used this tool to test performance\u202fbetween GPUs connected across a single leaf\u202fand\u202facross leaf-spine.<\/li>\n<li><strong>MLPerf<\/strong> is an industry-standard benchmark used to measure actual workload performance.\u202fThe performance output translates to ROI\u202fon fully validated\u202fdesigns from Cisco and AMD.<\/li>\n<\/ul>\n<h2>Network fabric performance benchmarking results<\/h2>\n<p>We evaluated scale-out fabric performance using comprehensive testing and standard KPIs.<\/p>\n<p><strong>Single-hop IBPerf testing<\/strong> evaluates performance within a localized fabric domain, typically within a single leaf switch. This establishes a baseline for link utilization, buffer tuning effectiveness, and NIC-to-switch performance prior to introducing multi-hop variables.<\/p>\n<p>These tests measure the Remote Direct Memory Access (RDMA) sessions\u2019 throughput between two GPUs connected through a Cisco N9364E-SG2 leaf switch. The results capture P01 (1st percentile) and P99 (99th percentile) bandwidth, while all the sessions are active simultaneously. P01 bandwidth represents the throughput of the slowest session\u2014a critical metric for synchronized AI\/ML workload performance\u2014while P99 represents the throughput of the fastest session. A minimal delta between P01 and P99 bandwidth and each bandwidth closer to the link bandwidth proves the efficacy of the GPU interconnect technology.<\/p>\n<p>In the 2-leaf\/2-spine (2\u00d72) topology, each leaf switch handles 32 bi-directional sessions, effectively saturating the leaf switch. The 4-leaf\/2-spine (4\u00d72) topology handles 16 bi-directional sessions per leaf. Across both topologies and varying queue pair (QP) counts (4 QPs and 32 QPs), the P01 and P99 bandwidths in both topologies and both sets of queue pairs are closer to each other, with each one approaching the link bandwidth of 400 Gbps.<\/p>\n<figure id=\"attachment_491212\" aria-describedby=\"caption-attachment-491212\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy lazy-hidden size-medium_large wp-image-491212\" data-lazy-type=\"image\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-4-768x418.png\" alt=\"\" width=\"768\" height=\"418\"\/><noscript><img loading=\"lazy\" decoding=\"async\" class=\"size-medium_large wp-image-491212\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-4-768x418.png\" alt=\"\" width=\"768\" height=\"418\"\/><\/noscript><figcaption id=\"caption-attachment-491212\" class=\"wp-caption-text\">Figure 4: Single-hop RDMA bandwidth performance across varying leaf-spine topologies and queue pair counts<\/figcaption><\/figure>\n<p>This performance shows that the AMD Pensando\u2122 Pollara NIC and Cisco N9364E-SG2 switches deliver a highly efficient solution for demanding workloads. The tight delta between P01 and P99 metrics across different scale and configurations demonstrates that this architecture maintains deterministic performance, regardless of cluster size or queue pair density.<\/p>\n<p><strong>Bisectional IBPerf testing<\/strong> evaluates cross-fabric traffic traversing multiple tiers to measure bisection bandwidth, path symmetry, cross-spine load balancing, and congestion propagation.<\/p>\n<p>These tests measure RDMA session throughput between two GPUs connected through leaf and spine Cisco N9364E-SG2 switches. The results show P01 and P99 bandwidth measurements with all sessions are simultaneously active. In the 2\u00d72 topology, there are 32 bi-directional sessions per leaf, whereas the 4\u00d72 topology has 16 bi-directional sessions per leaf. All these sessions go over spine. The traffic from each session traverses three hops (leaf-spine-leaf) to stress the entire fabric. This test validates the efficiency of the fabric\u2019s load-balancing algorithm; any traffic polarization would lead to some links being underutilized, while other links become congested, ultimately degrading RDMA session performance. Tests were conducted using 4 and 32 QPs.<\/p>\n<figure id=\"attachment_491213\" aria-describedby=\"caption-attachment-491213\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy lazy-hidden size-medium_large wp-image-491213\" data-lazy-type=\"image\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-5-768x428.png\" alt=\"\" width=\"768\" height=\"428\"\/><noscript><img loading=\"lazy\" decoding=\"async\" class=\"size-medium_large wp-image-491213\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-5-768x428.png\" alt=\"\" width=\"768\" height=\"428\"\/><\/noscript><figcaption id=\"caption-attachment-491213\" class=\"wp-caption-text\">Figure 5: Bisection RDMA bandwidth stability comparison for 2-leaf\/2-spine and 4-leaf\/2-spine architectures across varying queue pair counts<\/figcaption><\/figure>\n<p>The results demonstrate that P01 and P99 bandwidths are similar and each is closer to the link bandwidth of 400 Gbps, mirroring the performance observed in single-hop testing. This confirms that the Cisco N9364E-SG2 switches and AMD Pensando\u2122 Pollara NIC provide a high-performance, resilient GPU interconnect technology capable of maintaining consistently deterministic performance under stress.<\/p>\n<p><strong>Congestive IBPerf testing<\/strong> creates high-contention scenarios using a 31:1 communication pattern, where 31 GPUs communicate with a single GPU. It evaluates queue buildup, Explicit Congestion Notification (ECN) effectiveness, Data Center Quantized Congestion Notification (DCQCN) reaction curves, tail latency, and fabric stability under worst-case AI communication patterns.<\/p>\n<p>Incast conditions represent some of the most challenging scenarios for scale-out AI fabric. These tests measure P01 and P99 bandwidths under incast conditions, which manifest during collective communications such as all-to-all. If the scale-out fabric hardware, design, and tuning are not optimal, it leads to substantial degradation in JCT for training workloads. Because it is difficult to synchronize all sessions to start simultaneously, we use the Quantile Range Method to analyze the results. It analyzes bandwidth samples as a result of incast congestion instead of all bandwidth samples.<\/p>\n<figure id=\"attachment_491214\" aria-describedby=\"caption-attachment-491214\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy lazy-hidden size-medium_large wp-image-491214\" data-lazy-type=\"image\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-6-768x420.png\" alt=\"\" width=\"768\" height=\"420\"\/><noscript><img loading=\"lazy\" decoding=\"async\" class=\"size-medium_large wp-image-491214\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-6-768x420.png\" alt=\"\" width=\"768\" height=\"420\"\/><\/noscript><figcaption id=\"caption-attachment-491214\" class=\"wp-caption-text\">Figure 6: RDMA incast 31:1 congestion performance. Comparison of P01 and P99 bandwidth during high-contention 31:1 incast traffic<\/figcaption><\/figure>\n<p>In this test, each of the 128 GPUs establishes 31 RDMA sessions to 31 other GPUs across the leaf-spine fabric, resulting in a total of 3,968 (31*128 = 3,968) simultaneously active sessions in the scale-out fabric. The delta between P01 and P99 bandwidth is very tight, and each bandwidth is close to the link bandwidth of 400 Gbps, which is a solid proof point of the Cisco N9364E-SG2 switches\u2019 ability to handle extreme congestive conditions and a testament to the Cisco and AMD validated design.<\/p>\n<p><strong>MLPerf Training and Inference Benchmarking<\/strong> tests establish standardized metrics to evaluate the performance of training and inference workloads. By enforcing strict guidelines regarding models, datasets, and allowable optimizations, these benchmarks provide a level playing field for fair comparison among competing AI infrastructure solutions.<\/p>\n<p>The MLPerf tests from MLCommons are designed to provide a common benchmarking methodology for measuring application-level KPIs, which are the primary indicators of performance for end users. For inference, the Llama 2 70B results demonstrate clear throughput scaling as the configuration expands from two to four nodes. The training benchmarks provide representative data for Llama 2 70B (on two nodes) and Llama 3.1 8B (on eight nodes).<\/p>\n<figure id=\"attachment_491215\" aria-describedby=\"caption-attachment-491215\" style=\"width: 768px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"lazy lazy-hidden size-medium_large wp-image-491215\" data-lazy-type=\"image\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-7-768x418.png\" alt=\"\" width=\"768\" height=\"418\"\/><noscript><img loading=\"lazy\" decoding=\"async\" class=\"size-medium_large wp-image-491215\" src=\"https:\/\/blogs.cisco.com\/gcs\/ciscoblogs\/1\/2026\/05\/N9000-series-switches-blogg-figure-7-768x418.png\" alt=\"\" width=\"768\" height=\"418\"\/><\/noscript><figcaption id=\"caption-attachment-491215\" class=\"wp-caption-text\">Figure 7: MLPerf training and inference key performance metrics for Llama 2 and Llama 3.1 models, detailing throughput and JCT across multi-node configurations<\/figcaption><\/figure>\n<p>These findings provide the foundation for our core claim: the Cisco validated architecture is not just theoretically sound; benchmarking shows it can handle the most demanding AI inference and training workloads.<\/p>\n<h2>A real-world deployment of the Cisco and AMD AI solution architecture<\/h2>\n<p>The Cisco-AMD partnership delivers real-world impact, notably powering G42\u2019s large-scale AI clusters. This end-to-end solution\u2014integrating AMD GPUs, Cisco UCS servers, N9000 800G switches, and Nexus Dashboard\u2014provides the secure, scalable performance required for cutting-edge AI workloads.<\/p>\n<p style=\"text-align: center;\">\u201cAs AI workloads scale, network performance becomes a critical enabler of cluster efficiency. The AMD Pensando\u2122 Pollara 400 AI NIC, with its fully programmable, fault-resilient design, delivers consistent performance for GPU scale-out training. In collaboration with Cisco N9000 switching, we\u2019re advancing Ethernet to the next level, helping maximize GPU utilization and accelerate job completion.\u201d<\/p>\n<p style=\"text-align: center;\"><strong>\u2014Yousuf Khan, Corporate Vice President, Networking Technology and Solutions Group, AMD<\/strong><\/p>\n<h2>Operationalizing intelligence: A new standard for performance at scale<\/h2>\n<p>In the age of massive-scale AI, an organization\u2019s infrastructure is either its greatest competitive advantage or its most significant bottleneck. When the stakes involve mission-critical training, fine-tuning, and inferencing, a unified, fully validated ecosystem is a must. Cisco and AMD are changing the equation, delivering a deterministic, high-performance fabric that turns your network into a catalyst for innovation.<\/p>\n<p>Connect with a Cisco AI networking specialist today to design a deployment tailored to your specific workloads.<\/p>\n<blockquote>\n<\/blockquote>\n<p>Additional resources:<\/p>\n<\/p><\/div>\n<p><p><a href=\"https:\/\/dmsretail.com\/online-workshops-list\/\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-496\" src=\"https:\/\/dmsretail.com\/RetailNews\/wp-content\/uploads\/2022\/05\/RETAIL-ONLINE-TRAINING-728-X-90.png\" alt=\"Retail Online Training\" width=\"729\" height=\"91\" srcset=\"https:\/\/dmsretail.com\/RetailNews\/wp-content\/uploads\/2022\/05\/RETAIL-ONLINE-TRAINING-728-X-90.png 729w, https:\/\/dmsretail.com\/RetailNews\/wp-content\/uploads\/2022\/05\/RETAIL-ONLINE-TRAINING-728-X-90-300x37.png 300w\" sizes=\"auto, (max-width: 729px) 100vw, 729px\" \/><\/a><\/p><br \/><\/p>\n","protected":false},"excerpt":{"rendered":"<p>The \u201cAI paradox\u201d is a growing hurdle for enterprise leaders: investing millions in powerful GPUs, only to watch them sit idle while waiting for data. [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":17027,"comment_status":"","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[5],"tags":[],"class_list":["post-17026","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-technology"],"_links":{"self":[{"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/posts\/17026","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/comments?post=17026"}],"version-history":[{"count":0,"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/posts\/17026\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/media\/17027"}],"wp:attachment":[{"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/media?parent=17026"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/categories?post=17026"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/dmsretail.com\/RetailNews\/wp-json\/wp\/v2\/tags?post=17026"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}