
Cutting-edge AI infrastructure needs all the performance it can get, but these environments must also be efficient and reliable. This episode of Utilizing Tech, brought to you by Solidigm, features Davide Villa of Xinnor discussing the value of modern software RAID and NVMe SSDs with Ace Stryker and Stephen Foskett. Xinnor xiRAID leverages the resources of the server, including the AVX instruction set found on modern CPUs, to combine NVMe SSDs, providing high performance and reliability inside the box. Modern servers have multiple internal drive slots, and all of these drives must be managed and protected in the event of failure. This is especially important in AI servers, since an ML training run can take weeks, amplifying the risk of failure. Software RAID can be used in many different implementations, with various file systems, including NFS and high-performance networks like InfiniBand. And it can be tuned to maximize performance for each workload. Xinnor can help customers to tune the software to maximize reliability of SSDs, especially with QLC flash, by adapting the chunk size and minimizing write amplification. Xinnor also produces a storage platform solution called xiSTORE that combines xiRAID with the Lustre FS clustered file system, which is already popular in HPC environments. Although many environments can benefit from a full-featured storage platform, others need a software RAID solution to combine NVMe SSDs for performance and reliability.
Apple Podcasts | Spotify | Overcast | More Audio Links | UtilizingTech.com
Unlocking the Raw Device Performance of NVMe SSDs, with Xinnor xiRAID
A new generation of workloads is poised to rewrite data storage rules in enterprises, and turn old assumptions stale. For years, the conventional wisdom has been to buy storage for capacity. Now the matter is complicated by the arrival of quirky, power-hungry AI workflows. Performance is the new most-coveted thing.
AI workloads are read/write-intensive, and their I/O patterns are stressing traditional storage systems, demanding more speed and capacity.
For this episode of Utilizing Tech Podcast, brought to you by Solidigm, co-hosts, Stephen Foskett, and Solidigm’s Director of Market Development, Ace Stryker, meet with Davide Villa, Xinnor’s Chief Revenue Officer. They talk about the way AI workloads are reshaping the storage industry, and Xinnor’s software RAID technology, xiRAID, that unlocks max performance and improved fault tolerance in NVMe SSDs.
Traditional RAID vs Software RAID
Refitting servers with high-power NVMe SSDs can beef up performance and capacity, but it does not guarantee results. Enterprises rely on RAID or Redundant Array of Independent Disks, a technique that virtualizes drives into arrays, to tap into greater performance and capacity.
But with hardware RAID controllers, resources within drives remain unused leaving them working at only a fraction of their potential. A smarter way of milking NVMe solutions, says Villa, is through software.
“There are enough resources within the server, and we don’t need to add any accelerator or additional components that might become a single point of failure at some point,” he says.
Software-defined RAID solutions are designed to help tap into the unused potential of CPUs at no great extra cost, ensuring that hungry AI workloads are kept fed while significant cost-saving is achieved.
Founded only two years ago, Xinnor has developed a software RAID solution that holds the key to optimizing the different pieces of the AI data infrastructure.
“We are a young company,” says Villa, “but we are leveraging more than 10 years of development in optimizing datapath to provide very fast storage.”
Xinnor serves a class of customers that are embracing AI in small spurts. “Traditional HPC players at university and research institutes are now all facing some level of AI workloads. Our main market is definitely becoming providing very fast storage for those workloads.”
These organizations are actively investing in GPUs to get AI-ready. “GPUs, they are expensive systems,” Villa comments. “The customer cannot afford to keep them waiting for data. So it’s absolutely critical that the storage that is selected to provide data for AI models is capable of delivering stable performance in tens of GBs per seconds.”
Xinnor xiRAID
Xinnor’s xiRAID engine is built on a decade of R&D work. A software-only solution, it works with all CPUs supporting AVX instruction set, and NVMe SSDs. xiRAID leverages Xinnor’s own lockless datapath that intelligently distributes load across all available CPU cores instead of relying on a single core.
“By doing that, we avoid spikes, and get stable performance not just in normal operations but also in degraded modes.”
The result is close to raw performance numbers in RAID arrays.
A major point of difference between xiRAID and other products is that it does not demand significant memory. “We don’t have a cache in our RAID implementation, so we don’t need memory allocation. That’s the primary difference,” Villa explains.
Reflecting on the decision of developing a proprietary software RAID implementation, Villa said, “Traditional hardware RAID architecture cannot keep up with the level of parallelism of new NVMe drives. The level of parallelism that you can get on PCIe Gen 4.0 and even more on Gen 5.0, is such that you need a powerful CPU to be able to run the checksum calculation.”
One of the limitations of hardware RAID is the number of data lanes. The largest PCIe slot has 16 lanes, and each NVMe drive needs to access 4 lanes for optimal performance. This means a PCIe bus can fit a maximum of 4 drives. That is hardly sufficient for AI shops that have tens of drives deployed across servers.
“For NVMe drives and for AI workload, there is only one way to grow which is software.”
AI models run for weeks and months on end. Any small failure can jeopardize the entire operation, causing chunks of data to go missing and performance to degrade without warning.
Most companies don’t have a clearly-defined set of storage requirements to address these problems. But they agree on one thing – the need to provision storage for extreme workloads
Villa points out that every AI workload is different and so are the needs. “If you want to oversimplify, we can say that AI workload is mostly sequential by nature and has the combination of read during ingestion and write during the checkpointing. But not all the AI models and AI training are equal. So distinctions need to be made and we see that random performance plays a role as well.”
One thing is certain – high-performance storage systems are critical for any AI workloads.
As the clamor for performance gets louder, customers are showing a growing reluctance to move away from familiar infrastructures that they have spent years learning the nuts and bolts of.
“They would like to keep on using the popular parallel file system that they used on HPC implementations and leverage their competence in using those systems to also run AI models.”
Xinnor’s solution is a great fit for the varying storage implementations of companies because it gives system administrators the flexibility to pick the right geometry and chunk size.
The hunger for more capacity demands QLC drives, but without its limitation of the limited number of program and erase (P/E) cycles. “By selecting the proper chunk size, we are able to minimize the write amplification into the SSD and by doing that we can enable using QLC for extensive AI projects.”
Write amplification or WA is a phenomenon in SSDs where the amount of write gets multiplied resulting in excess data on the device. With QLC SSDs having limited erase cycles, this number needs to be kept as close to one as possible.
The advantage of xiRAID is that it lets administrators flexibly change the chunk size, ensuring that only the minimum amount of data is written to the SSDs.
“With our software, we can find the proper tuning based on the workload, the number of drives that are part of the RAID array, the level of RAID and we are able to find the optimal configuration to keep this number as close to one as possible.”
Also included with the solution is support for best configuration. The Xinnor team spends a lot of time studying and gaining a well-rounded understanding of the workloads and their requirements. Then they work out a configuration that otherwise the system administration would have had to do.
xiStore – A Robust and Scalable Storage for AI and HPC
Xinnor boasts of a second solution, xiStore that is the first of a lineup. “Our core competence is in the datapath and it’s how we create a very efficient RAID. We see that for some industries a standalone RAID, at least for some customers, is not sufficient. They’re looking for a broader solution.”
For them, the combination of xiRAID and xiStore offers the perfect solution to tackle problems like long drive rebuild times.
It’s a high-availability implementation, says Villa. “There is no single point of failure. You can lose a server and still have all the data up and running. We have our control plane to manage virtual machines and on top of those virtual machines, we mount the Lustre parallel file system.
Villa phrases it an “end-to-end solution” that AI and HPC companies would find useful if goal is to not combine xiRAID with third party software.
But while Xinnor is set to bring more software-defined storage solutions to the market, it will not stray from the software RAID technology which Villa describes is their “core competence”.
Tune in to listen to the whole conversation – Maximum Performance and Efficiency in AI Data Infrastructure with Xinnor – at the Utilizing Tech website. Be sure to check out Solidigm and Xinnor’s whitepaper on this in the Insight hub of Solidigm’s website.
Podcast Information:
Stephen Foskett is the Organizer of the Tech Field Day Event Series President of the Tech Field Day Business Unit, now part of The Futurum Group. Connect with Stephen on LinkedIn or on X/Twitter and read more on the Gestalt IT website.
Ace Stryker is the Director of Product Marketing at Solidigm. You can connect with Ace on LinkedIn and learn more about Solidigm and their AI efforts on their dedicated AI landing page or watch their AI Field Day presentations from the recent event.
Davide Villa is the Chief Revenue Officer at Xinnor. You can connect with Davide on LinkedIn and learn more about Xinnor on their website.
Learn More about Xinnor’s Products:
Thank you for listening to Utilizing Tech with Season 7 focusing on AI Data Infrastructure. If you enjoyed this discussion, please subscribe in your favorite podcast application and consider leaving us a rating and a nice review on Apple Podcasts or Spotify. This podcast was brought to you by Solidigm and by Tech Field Day, now part of The Futurum Group. For show notes and more episodes, head to our dedicated Utilizing Tech Website or find us on X/Twitter and Mastodon at Utilizing Tech.