
At Google Cloud Next 25 in April, Google Cloud unveiled a flurry of storage innovations as part of the AI Hypercomputer stack that it said are optimized specifically to supercharge artificial intelligence (AI)/ML workloads and deliver breakthrough resource and cost optimizations. The rollout was merely a glimpse of how the hyperscaler is restyling the cloud storage infrastructure to take on AI workloads of the future.
At the AI Infrastructure Field Day last week, the company offered a deeper look at these newly introduced solutions.
โChoosing improper storage solutions can really negatively affect your GPU utilization. If storage is a bottleneck, your GPUs and CPUs are just sitting idle waiting for data to come in before they can begin processing them,โ said Marco Abela, product manager, highlighting the need for blur-fast storage systems to boost accelerator utilization for AI deployments.
Throughout the AI pipeline from the early stages of data preparation to inferencing and delivery, workloads throw alternating I/O demands at the underlying storage systems, making it imperative that the systems are fast and adaptive. A combination of bottomless capacity, extreme aggregate throughput and lower than millisecond latency need to be constantly available to satisfy the irregular and unpredictable demands.
โStuff like checkpoint restores are bursty; model loads are very bursty. But basically, just depending on the different pipeline, a subset of the [workloads] can have all of these different storage requirementsโฆbut these are typically what we see in the AI/ML pipeline,โ Abela noted.
โThere’re really two different aspects of storage that we’re recommending for people today,โ Sean Derrington, group product manager said at the event. โOne is Managed Lustre, but then also our cloud storage portfolio with Anywhere Cache.โ
Google Cloud Managed Lustre is a fully managed parallel file system that is developed jointly with DDN, a company that is the primary maintainer of the open-source Luster file system.
โParallel file system is really good for AI workloads that have a handful of clients where you want to drive very high bandwidth to a single client, but also scales to hundreds of thousands of GPUs and TPUs,โ Derrington noted.
Built on the core EXAScaler Lustre file system, Managed Lustre brings to offer capacity in petabytes, with up to 1TB per second throughput. Google Cloud touts it as specifically tuned for AI and HPC applications because of its ability to support extreme IOPs and under 1 millisecond latency.
โWe’re launching this as really a persistent storage offering that is very highly scalable to a petabyte in a single file system,โ Derrington told while giving an overview of the solution.
The zonal solution co-locates with accelerators which provides an innate advantage. โThis actually not only accelerates the training, but also does very fast checkpointing with full duplex capability as well as being able to do high performance inferencing,โ Derringtold told.
Launched into general availability at Nextโ25, Anywhere Cache is another solution that Google has developed to bridge the gap between data and compute. With a multi-region bucket, data can be accessed from any location in a given continent with this consistent cache. โIt’s a single bucket and single name space,โ Derrington emphasized.
โThe challenge is to figure out where you can have compute and accelerator resources and how do you actually improve the performance of your AI workloads in those regions. This is exactly what Anywhere Cache does,โ he said.
Under the hood, Anywhere Cache co-locates a read-only cache of up to 1TB of data in the same zone as the accelerators within a given region, delivering a throughput of up to 2.5TB/s while slashing down latency by 70%, Google Cloud said.
โSome of the benefits of cache hits upwards of 99% to accelerate training needs,โ Derrington added.
With a single command line, Anywhere Cache can be activated in an existing regional bucket. โSo if Iโm in Oregon and I know my accelerators are in zone B, I can turn on the cache in zone B and that’s going to cache that data closer,โ he explained.
โIt’s not limited to just one zone; you can enable it on any zone you want within any given continent,โ Derrington told.
Abela also showcased a FUSE adapter, dubbed Cloud Storage FUSE, that allows users to mount and access Cloud Storage buckets like local file systems. This allows applications to read and write objects in a storage bucket with the standard FS semantics.