Google Cloud Builds on Storage Portfolio to Fuel AI Hypercomputer

At Google Cloud Next 25 in April, Google Cloud unveiled a flurry of storage innovations as part of the AI Hypercomputer stack that it said are optimized specifically to supercharge artificial intelligence (AI) and ML workloads and deliver breakthrough resource and cost optimizations. The rollout was merely a glimpse of how the hyperscaler is restyling the cloud storage infrastructure to take on AI workloads of the future.

At the AI Infrastructure Field Day last week, the company offered a deeper look at the newly introduced solutions.

“Choosing improper storage solutions can really negatively affect your GPU utilization. If storage is a bottleneck, your GPUs and CPUs are just sitting idle waiting for data to come in before they can begin processing them,” said Marco Abela, product manager, highlighting the need for blur-fast storage systems to boost accelerator utilization for AI deployments.

Throughout the AI pipeline from the early stages of data preparation to inferencing and delivery, workloads throw alternating I/O demands at the underlying storage systems, making it imperative that the systems are fast and adaptive. A combination of bottomless capacity, extreme aggregate throughput and lower than millisecond latency need to be constantly available to satisfy the irregular and unpredictable demands.

“Stuff like checkpoint restores are bursty; model loads are very bursty. But basically, just depending on the different pipeline, a subset of the [workloads] can have all of these different storage requirements…but these are typically what we see in the AI/ML pipeline,” Abela noted.

“There’re really two different aspects of storage that we’re recommending for people today,” Sean Derrington, group product manager said at the event. “One is Managed Lustre, but then also our cloud storage portfolio with Anywhere Cache.”

Google Cloud Managed Lustre is a fully managed parallel file system that is developed jointly with DDN, a company that is the primary maintainer of the open-source Luster file system.

“Parallel file system is really good for AI workloads that have a handful of clients where you want to drive very high bandwidth to a single client, but also scales to hundreds of thousands of GPUs and TPUs,” Derrington noted.

Built on the core EXAScaler Lustre file system, Managed Lustre brings to offer capacity in petabytes, with up to 1TB per second throughput. Google Cloud touts it as specifically tuned for AI and HPC applications because of its ability to support extreme IOPs and under 1 millisecond latency.

“We’re launching this as really a persistent storage offering that is very highly scalable to a petabyte in a single file system,” Derrington told while giving an overview of the solution.

The zonal solution co-locates with accelerators which provides an innate advantage. “This actually not only accelerates the training, but also does very fast checkpointing with full duplex capability as well as being able to do high performance inferencing,” Derringtold told.

Launched into general availability at Next’25, Anywhere Cache is another solution that Google has developed to bridge the gap between data and compute. With a multi-region bucket, data can be accessed from any location in a given continent with this consistent cache. “It’s a single bucket and single name space,” Derrington emphasized.

“The challenge is to figure out where you can have compute and accelerator resources and how do you actually improve the performance of your AI workloads in those regions. This is exactly what Anywhere Cache does,” he said.

Under the hood, Anywhere Cache co-locates a read-only cache of up to 1TB of data in the same zone as the accelerators within a given region, delivering a throughput of up to 2.5TB/s while slashing down latency by 70%, Google Cloud said.

“Some of the benefits of cache hits upwards of 99% to accelerate training needs,” Derrington added.

With a single command line, Anywhere Cache can be activated in an existing regional bucket. “So if I’m in Oregon and I know my accelerators are in zone B, I can turn on the cache in zone B and that’s going to cache that data closer,” he explained.

“It’s not limited to just one zone; you can enable it on any zone you want within any given continent,” Derrington told.

Abela also showcased a FUSE adapter, dubbed Cloud Storage FUSE, that allows users to mount and access Cloud Storage buckets like local file systems. This allows applications to read and write objects in a storage bucket with the standard FS semantics.

Google Cloud Builds on Storage Portfolio to Fuel AI Hypercomputer

SHARE THIS STORY

FOLLOW US

Google Cloud Builds on Storage Portfolio to Fuel AI Hypercomputer

TECHSTRONG TV

Tech Field Day Events

SHARE THIS STORY

RELATED STORIES:

FOLLOW US

NEWSLETTER SIGN UP