Building A Better AI Networking Cluster with Cisco

Designing a network used to be about connectivity. We plugged things in, configured some VLANs, ensured the routing table looked sane, and went to lunch. I miss those days. Today, if you are working with AI infrastructure, connectivity is the bare minimum requirement. The actual goal is utilization.

When an enterprise spends millions on GPUs, they are buying time. They are buying the ability to train a model or run an inference engine faster than the competition. If the network drops a packet or introduces a microsecond of latency, those expensive GPUs sit idle. That is money burning in a server rack. Cisco recently showed off their strategy for AI networking during Networking Field Day and it is clear they have realized that the traditional “connect and pray” method does not work for massive AI clusters.

The Physical Layers

We need to talk about the physical reality of these things first. Layer 1 is messy. When we talk about AI clusters, we are often talking about a rail-optimized design. This is not your standard leaf-spine topology where you can just plug any cable into any port. In a rail-optimized setup, you are trying to minimize the hops between GPUs to reduce serialization and propagation delays.

The complexity of cabling these architectures is enough to make anyone cry. You might have thousands of cables. If you plug one into the wrong port, the link light will still turn green. The interface comes up. But the traffic is now taking a suboptimal path, crossing a spine switch it wasn’t supposed to touch, and suddenly your entire collective communication job slows down because one link is lagging. The whole cluster waits for the slowest member.

Cisco tackled this with a tool inside their HyperFabric AI platform. It is a Software-as-a-Service (SaaS) management option. They call it “Run Cabling.” It generates the map for you. But the real value is that it validates the connection against the plan. The UI won’t give you a green light until the specific cable is in the specific port it was designed for. They claim this reduced deployment time by 90 percent for some beta customers. I believe it. I have spent weekends troubleshooting miscabling. It is not fun. Having a system with the intelligence to let you know you made a little error can save hours of troubleshooting.

More Than Cables

Once the physical layer is sorted, you run into the traffic problem. AI traffic is not like standard TCP web traffic. It is RDMA, usually RoCE. It is bursty, heavy, and incredibly sensitive to packet loss. In the past, we relied on DCQCN for congestion control. It acts a bit like TCP by telling the sender to back off when things get crowded. But in a high-speed AI environment, backing off means slowing down. Slowing down means the GPU is waiting.

This is where the hardware comes in. Cisco is leveraging their Silicon One ASICs to move away from simple congestion notification and toward dynamic load balancing, or DLB.

Standard Equal Cost Multipathing, or ECMP, is not smart. It hashes a flow to a link and hopes for the best. If that link gets congested, too bad. DLB is smarter. It looks at “flowlets”, or groups of packets within a stream, and sprays them across the least congested links in real time. It creates near-perfect link utilization. When you pair this with the P4 programmable architecture in the switches, you stop dropping packets. You stop using flow control pause frames that halt traffic. You just move data.

The operating model is the final piece of this puzzle. We have moved past the point where we should be hand-crafting configurations for every switch. It is too risky. Cisco is pushing a prescriptive model. You have the Nexus Dashboard for the on-prem diehards who want full control, and then you have HyperFabric AI for the folks who want a cloud-managed, “hands-off” experience.

HyperFabric AI is interesting because it is opinionated. It offers t-shirt sizes like small, medium, and large. You use the validated servers, you use the validated storage, and you use their network design. It simplifies the design phase from months to days. It integrates with higher-level tools like AI Canvas and Splunk to correlate network telemetry with job performance. If a training job is running slow, you can actually see if it is because the job scheduler placed the workload across two different scalable units, creating a bandwidth bottleneck.

That is the level of visibility we need. We are building giant math factories. The network is the assembly line. If the belt stops, the factory stops. Cisco seems to understand that their job is to make the assembly line invisible, reliable, and boringly predictable. That is how you build for AI.

Bringing IT All Together

The shift to AI infrastructure requires a fundamental change in how we approach networking, prioritizing GPU utilization over simple connectivity. Cisco is addressing this through a mix of specialized hardware and prescriptive software.

Physical Precision: Tools like “Run Cabling” in HyperFabric AI eliminate human error during the complex deployment of rail-optimized designs, ensuring the physical topology matches the logical requirement for low latency.

Intelligent Silicon: Moving beyond basic ECMP, Cisco utilizes Silicon One ASICs to perform Dynamic Load Balancing. This ensures packet flows are sprayed across optimal paths in real time, reducing congestion and eliminating the need for performance-killing pause frames in RDMA traffic.

Unified Operations: Whether through the on-prem Nexus Dashboard or the SaaS-based HyperFabric AI, the focus is on validated, prescriptive architectures. This removes the guesswork from configuration and provides end-to-end visibility, linking network telemetry directly to application performance and job scheduling anomalies.

For more information about Cisco’s AI Networking designs and offerings, make sure to check out the AI Data Center Networking page. If you want to see the entire presentation from Networking Field Day, make sure to head over to the Cisco presentation appearance page here

Building A Better AI Networking Cluster with Cisco

The Physical Layers

More Than Cables

Bringing IT All Together

SHARE THIS STORY

FOLLOW US

Building A Better AI Networking Cluster with Cisco

The Physical Layers

More Than Cables

Bringing IT All Together

TECHSTRONG TV

Tech Field Day Events

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP