The concept of pooling and sharing of resources is immortal within IT.

Starting with mainframes to mini-computers to PCs, computer systems are built to be shared among users to get the most value out of their resources. The virtualization revolution and the public cloud are based on this pooling of computers aimed at more efficient consumption.

One of the benefits of the public cloud is that it pools resources across many cloud customers, providing near-infinite resources on demand. But there is a cost to that massive scalability and flexibility. If those features are unnecessary for your application, keep it on-premises.

The other vital thing you have on-premises is your data, particularly that of applications using generative artificial intelligence (AI). Many AI-enabled applications use retrieval augmented generation (RAG) to supply a general-purpose AI model with business-specific data, often in near-real time. Data like real-time inventory information or updates to the internal support knowledge base allows an AI application to perform like a knowledgeable employee.

This business data is often on-premises, and the risk or cost of placing it in the public cloud may be too high.

Running the inference and the vector database that enables RAG on-premises allows your data to remain on private infrastructure. And not always the latest graphics processing unit (GPU) is essential for running an AI application. CPUs have been running predictive AI/ML for many years, and DeepSeek has proved that a generative AI application can be built and operated on older GPUs.

Generative AI applications don’t always need massive amounts of computing power and dedicated clusters with the latest and greatest NVIDIA GPUs. An optimized large language model (LLM) often only requires a fraction of a GPU, meaning a platform that can provide only a fractional GPU to a virtual machine (VM) is the most cost-effective option.

Conversely, massive scale-out farms of GPUs are vital for creating new LLM models, and smaller GPU clusters are integral to fine-tuning a model for a specific task. Creating and fine-tuning LLMs require scalability and are commonly executed on a public cloud platform. Once the model is created and fine-tuned, it can be brought on-site for inference.

Building a good generative AI is not yet a mass-produced activity. Data scientists are usually needed to analyze the problem and the data available to identify the correct LLM model and embeddings to deliver useful results. We are not yet at the stage where a traditional software developer can readily create and integrate a generative AI component into an application. In a recent Tech Field Day Spotlight episode, Tasha Drew from Broadcom shared some insights on the VMware Private AI platform and how customers run on-premises AI using their existing platforms and staff.

VMware Private AI builds on VMware Cloud Foundation (VCF), allowing AI workloads to run alongside more conventional enterprise applications. One of the features of the platform is an API gateway, which allows the separation of AI creation from the use of that AI in an application. Separating these duties allows the effective use of scarce LLM and data engineering skills without placing more burden on the software developers.

One challenge for on-premises generative AI is managing model governance, ensuring that only approved foundation models are used and their use complies with corporate data governance. VMware leverages the open-source Harbor container registry to store local models, allowing integration into existing enterprise governance.fvme

VCF has been pooling and sharing compute resources for years. Most large organizations already have skilled teams and mature processes around operating applications on VCF.

So, they already have the people and the platform to run production AI applications in their on-premises data center. The VMware Private AI platform built on the VMware Cloud Foundation, is production-ready and known among most large organizations.

The public cloud is a great place to experiment and develop new AI models and applications. On-premises data centers are often the most cost-effective and efficient way to run production applications with steady-state resource demands or tight data controls.

TECHSTRONG TV

Click full-screen to enable volume control
Watch latest episodes and shows

Tech Field Day Showcase

SHARE THIS STORY