Organizations seeking to build an infrastructure stack for AI training need to know how the data platform is going to perform. This episode of Utilizing Tech, presented by Solidigm, includes Curtis Anderson, Co-Chair of the Storage Working Group at MLCommons, discussing storage benchmarking with Ace Stryker and Stephen Foskett. MLCommons is an industry consortium seeking to improve AI solutions through joint engineering. The organization publishes the well-known MLPerf benchmark, which now includes practical metrics for storage solutions. The goal of MLPerf Storage is to answer the key question: Will a given data infrastructure support AI training of a given scale. The organization encourages storage vendors to run the benchmarks against their solutions to prove the suitability to support specific workloads. The AI industry is already shifting its focus from maximum scale and performance to more-balances infrastructure using alternative GPUs, accelerators, and even CPUs, and is increasingly concerned about price and environmental impact. The question of data preparation is also rising, and this generally uses a different CPU-focused solution. MLPerf Storage is focused on training today and will soon address data preparation, though this can be quite different for each data set. The next MLPerf Storage benchmark opens soon, and we encourage all data infrastructure companies to get involved and submit their own performance numbers.

Apple Podcasts | Spotify | Overcast | More Audio Links | UtilizingTech.com


Benchmarking Storage Systems for AI Training, with MLCommons’ Curtis Anderson

AI shops refitting their storage infrastructure for training jobs have a problem on their hands. Storage capacity and performance have expanded at a geometric rate, but no one besides the manufacturers know how good the solutions really are for a given use case.

That’s because there is an overload of options. Picking one product from the other is pure agony. Sometimes, excess choices push buyers into choosing the shiniest or the priciest product on the shelf, a decision that is often in conflict with their best interest.

Besides, how does one know in what ways two comparable solutions truly differ from one another? There is no easy way to tell. It is like comparing coffee or perfumes from different brands. A direct comparison is impossible because the application always dictates the output. It’s comparing apples to oranges. Unless a buyer uses both products, there is no way of fully knowing.

Looking Past the Hype

Vendors are prone to making their products look like they are the best in the market. It is common practice to cite raw performance numbers for advertising, which hardly stand up in practice. Oftentimes, companies rely on these buzzy, glossy phrases to make purchase decisions.

In this episode of Utilizing Tech Podcast presented by Solidigm, hosts, Stephen Foskett and Solidigm’s director of market development for AI Product, Ace Stryker, meet with a guest who puts storage solutions through paces every day on the job.

Curtis Anderson is the co-chair of the MLCommons Storage working group. MLCommons is an independent group that benchmarks AI systems, and the Storage working group inside MLCommons is particularly focused on storage subsystems in support of AI workloads.

For years, there has been no standard seal of quality for AI products that could tell buyers what set of tasks a product is well-suited for. The lack of good measurement and evaluation in the industry has created a vacuum of reliable information. Buyers are byuing solutions oblivious of their weaknesses or without attempting to match them with real-world use cases.

There are benchmarking tools like Cinebench and PCMark for example, that measure performances of laptops and PCs, but at the AI datacenter scale, companies struggle with making that determination.

The MLPerf Storage Benchmark Suite

MLCommons established a benchmarking standard with MLPerf which goes on to set its own course. MLPerf Storage is a highly sophisticated benchmark that attempts to measure not one, but all aspects of a storage solution as required by real-world use cases. The results show what mileage a storage product can offer for one phase of the AI pipeline vs the other.

Companies are required to submit their products for testing before releasing them to the public. MLCommons’ working groups rigorously test the products imposing on them different workloads, and recording their performance numbers for each application.

The MLPerf Training benchmark suite focuses exclusively on AI training. It does not rely on vendors’ numbers or inflated claims. Hard evaluations are done to meticulously determine what a solution is good for, and how good it is at it.

Anderson explained how testing is done at MLCommons. “The benchmark emulates a workload which it imposes on a storage subsystem, the same workload that a training pipeline would run on the storage. So you get an honest-to-goodness answer to how a storage product or solution would perform in the real-world scenario.”

MLPerf leverages emulation to perform these tests to steer around using real hardware. “It was an explicit decision that we made early on because we also support academic research, open source, and other potential solutions. None of those people have the budget to go out and buy a hundred latest accelerators,” says Anderson.

In Pursuit of the Truth

Ordinarily, AI workflows consist of five components – data, models, accelerators, storage and network. In a nutshell, here is how it all happens. Enterprises start with a problem statement that states the opportunity and the purpose, and data they want to use for it. This data could be still images, video, audio, or text.

The data is thrown into a pipeline where it is cleaned, reformatted and tokenized. Following this, the training process begins. It’s where the models learn to infer and perform specific tasks. Then it goes into inference where the model makes predictions independently. The underlying storage solution has to service all these diverse workloads.

MLCommons began with benchmarking models and accelerators, and recently added storage to the list.

“What we do in the Storage working group is benchmark the performance during that training phase of the overall workflow pipeline. It’s very data-intensive and puts a lot of stress on the storage,” he notes.

Within AI, models are vastly different, and so are the workloads. Shoddy measurements can easily misdirect buyers towards the wrong products.

“An image recognition workload is different from a recommender which is different from a large language model. There’re many different types of neural network models and so they each impose a different workload on the storage.”

To make sure that buyers can profile storage products correctly to each of these workloads, MLPerf Storage evaluates them individually showing which systems pass with flying colors.

MLPerf does not overly focus on the traditional metrics that used to be the standard of measurement for storage systems. “We measure how well the storage system performs, not on the traditional megabytes-per-second and files-per-second, but how quickly and completely the GPU can stay utilized.”

AI training is an intense process that requires accelerators to work in full blast. “So we measure accelerator utilization as the core value of our benchmark to answer if a storage product or solution can keep up with a certain number of GPUs doing a particular workload.”

If the utilization number falls under 90%, it indicates that the system is overloaded, and the benchmark is reran using a smaller number of GPUs.

Metrics like I/O characteristics are often thrown in to hype up performance, but they don’t go further than making the product sound more enticing. For AI practitioners, the technicalities are not much use if they can’t tell if a storage product is the right fit for the use case they are aiming for.

MLPerf results directly answer what vendor has the best solutions, and how big of product will be required for a certain type of workload with a certain amount of data.

While the Storage working group is presently focused on the meat and potatoes of AI that is training, Anderson says more complicated workloads like data preparation, is on the horizon.

MLCommons releases two benchmark suits yearly, one in spring and another in autumn. Around these times, there is an open window for interested vendors to submit their products. Each submission goes for peer review which is private to the submitters, and results are published in three months.

You can join the MLPerf Storage working group, or check out their results for Solidigm’s storage systems at MLCommons.org. Be sure to check out Solidigm’s website for their portfolio of ultra-fast, high-capacity SSDs for AI. For more interesting discussions on AI data infrastructure, keep watching Utilizing Tech Season 7 on your favorite podcast platform.

Podcast Information:

Stephen Foskett is the Organizer of the Tech Field Day Event Series President of the Tech Field Day Business Unit, now part of The Futurum Group. Connect with Stephen on LinkedIn or on X/Twitter and read more on the Gestalt IT website.

Ace Stryker is the Director of Product Marketing at Solidigm. You can connect with Ace on LinkedIn and learn more about Solidigm and their AI efforts on their dedicated AI landing page or watch their AI Field Day presentations from the recent event.

Curtis Anderson is Co-Chair of the Storage Working Group at MLCommons. You can connect with Curtis on LinkedIn and learn more about the work and benchmarks by MLCommons on their website.

Learn More from MLCommons:


Thank you for listening to Utilizing Tech with Season 7 focusing on AI Data Infrastructure. If you enjoyed this discussion, please subscribe in your favorite podcast application and consider leaving us a rating and a nice review on Apple Podcasts or Spotify. This podcast was brought to you by Solidigm and by Tech Field Day, now part of The Futurum Group. For show notes and more episodes, head to our dedicated Utilizing Tech Website or find us on X/Twitter and Mastodon at Utilizing Tech.