
As more companies pour money into building their artificial intelligence (AI) infrastructure from ground up, resource sharing is offering a new path to ROI.
Multi-tenancy, the architecture that allows resources to be dynamically split between different tenants, has rose to prominence with the cloud, making it easier for one provider to serve multiple customers concurrently.
But although companies are eager to do it, managing a multi-tenant environment in the AI era is harder than you might think.
โAI workloads are very demanding, so the entire stack requires to be improved starting from power, cooling, compute hardware, compute software to networking hardware,โ noted Alex Saroyan, CEO and cofounder of Netris, maker of a network automation, abstraction and multi-tenancy software.
But interestingly, this stack does not need to be as divergent as the companies adopting them. โAlthough organizations that are running AI infrastructures are coming from different verticals โ we work with neo clouds, infrastructure as a service providers, PaaS, sovereign AI operators, telcos โ these organizations have different business models but from a technical perspective, what they are building is very similar.โ
Based out of Santa Clara, Calif, Netris was founded in 2018.
โIn 2018, AI wasnโt as big as today,โ remembered Saroyan, โBut some of the technologies that were being used for cloud providers in networking [then] are also relevant and applicable to AI.โ
While both the hardware and software aspects of AI networking are exceedingly complex, multi-tenancy adds still more complexity to that.
AI network infrastructure operators have a few tricks up their sleeves to overcome the headaches. The first of them is physical segmentation of the network which involves physically moving apart hardware to create small sections or subnets.
โThat provides maximum security. There is no way to break in from one cluster into another if they are physically separated, but the problem is that it takes a lot of time,โ Saroyan noted.
Another part of the problem is that physical segregation is hard to scale , leading to idling of GPUs.
To work around that, operators take to manually configuring the infrastructure to make it sharable across the network of tenants. โThat does the job, but the problem with that isโฆit takes sometimes weeks to configure hundreds of switches, and the AI infrastructure has lots of switches.โ
There is a high risk of human error involved with manually configuring a network that is as large and complex as this. The smallest misconfiguration can potentially break the network for other tenants using it, impacting their training workloads, Saroyan pointed out.
To bypass that, many operators resort to building in-house automation solutions, but Saroyan argues those are โnot great solutionsโ.
โThe big problem with building automation inhouse is that the organization that is building [it] learns from their own errors versus a specialized product that is learning from the data coming from lots of lots of customersโฆ They are risky, they take a lot of time, and not a solution for cloud providers.โ
The alternative is API or software-driven isolation using Kubernetes and containers. Many operators tend to prefer this option for the instantaneous result but it fails to enforce isolation on the networking layer and that opens up security holes in the fabric, Saroyan pointed out.
โIt takes just one container vulnerability to jeopardize the reputation of a cloud provider,โ he said.
A solution that provides security at the level comparable to physical isolation, while also delivering it at software speed would make a big difference. Netrisโs โcloud-provider-grade network automationโ solution aims to do just that.
The Netris network automation, abstraction and multi-tenancy software brings together the concept of VPC, cloud networking functions like VPC peering, elastic load balancer and internet/NAT gateways, and network management, onto a single plane.
โVPC is the unit of isolation and when AI infrastructure operator is building cloud, they need this same abstraction model that works in the cloud,โ Saroyan explained, while showcasing the Netris platform at the AI Infrastructure Field Day event in late April.
The Netris VPC allows operators to securely manage resources within a logically separated virtual network. They can create new VPCs, edit existing ones and delete those that are not in use, dynamically.
Available on top of this is a range of ready cloud networking functions โ VPC peering to connect VPCs to upstream networks, internet gateways and load balancers โ which make building and operating the infrastructure much simpler.
โInference workloads are used by users on the internet; think ChatGPT and their customers. So if you’re running inference loads in the AI cluster, you need some sort of load balancer which will take traffic from the internet and balance between these different machines like in the cloudโฆIf you are a cloud provider, you need methods to provide these constructs yourself,โ he said.
The third piece of the puzzle is network management. โLet’s not forget that all these beautiful clouds are running on the physical network, and there are network engineers who need to take care of the network, deploy the fabric, upgrade and downgrade it, perform maintenance, learn if something is not good with the health of the fabric. So, for that you need a system that does your fabric management.โ
If these three components came from different providers, the tools could cancel each other out attempting to do overlapping things.
โIf more than one system is trying to make the same changes to the front or back-end network, it creates risks for logical disconnect and errors,โ Saroyan added.
Netris, being a single tool, uses the same data which eliminates the risk of conflicts.
Saroyan informed that up to 60% of its customers use InfiniBand for their backend fabric. So to avoid friction, Netris works through NVIDIA UFM controller for InfiniBand management. โLike I said, two systems should not edit one system. We donโt want to break our own rules. So we talk to UFM, and UFM talks to InfiniBand switchesโ.
For ethernet however, Netris serves as the full management layer. โFrom a customer perspective, there is a single pane of glass that takes care of entire stack, but behind the scenes, we manage ethernet and we manage InfiniBand through UFM,โ he told.
However, it is important to remember that no one vendor can solve all the challenges that come with building and managing the AI infrastructure, which makes it a smarter approach to find the right vendor with the right solution for each facet.
As Netris says, โOne company cannot solve all these pieces of the puzzle. Even NVIDIA being this amazing large organization, they don’t have all the components. That’s why it creates opportunity for partners like us and others.โ