‘No one needs a file system for AI training’ – Blocks and Files

‘No one needs a file system for AI training’ – Blocks and Files

The‌ Evolving‌ Landscape of AI Data Storage:‍ Beyond the File System

AI training, a ​cornerstone of modern technological advancement, requires massive datasets and the ability⁤ to process them with unprecedented⁢ speed and ⁤efficiency.⁤ Traditionally,‍ this has relied heavily on file systems, the backbone ‍of data storage for decades. Though, the rapid evolution of AI models and⁣ frameworks is questioning the long-term suitability of this approach.

Jeff Denworth, co-founder of VAST Data, recently sparked⁢ a ⁣debate by arguing that “no⁤ one needs a file⁤ system for AI‌ training.” His assertion, posted on X, highlights the growing adoption of object storage solutions within the AI ecosystem.

The Case for Multi-Protocol Storage

While it’s true that established players like DDN, NetApp, Pure, and ‌WEKA are still prevalent in AI deployments,​ Denworth ‍emphasizes ⁣the need for‌ a more versatile approach. He ⁣points out that the reliance on ⁣file systems alone can ⁤become a major barrier to future-proofing investments, as AI frameworks​ evolve at ‌a rapid pace.

⁣ ‍ “it’s not binary,⁢ it’s evolutionary. Historically, all of the AI training frameworks required a POSIX/file interface.⁤ Only companies developing ⁤their own⁣ frameworks would consider using object storage, and this is limited to ⁣the best ⁤of the best.”
⁢ ⁣

–​ Jeff Denworth

Denworth advocates for “multi-protocol” ⁤storage solutions that seamlessly integrate​ both file ‌system and object⁤ storage paradigms.⁢ This allows organizations to leverage the familiarity and functionality of file systems while simultaneously⁢ benefiting from the⁣ scalability, durability, and cost-effectiveness of object storage.

Object Storage Takes Center Stage

Recent advancements in ‍object storage technology, including ⁤GPUDirect-like access​ facilities from Cloudian,⁤ MinIO, Nvidia, and Scality, are ⁢enabling direct⁤ data access from GPUs⁤ without the need for traditional data movement. This paradigm shift paves the way for even more efficient AI training.

Denworth’s insight confirms the growing trend:⁢ top-tier AI ⁢models are increasingly being trained directly ⁢from object storage.

⁣ ⁢ “Of the top-tier (top ten worldwide) models‍ I know of:
VAST is being ‌used for a‌ very prominent model exclusively on VAST S3 at CoreWeave. We have ⁢a few other top-tier names starting to experiment.
Azure Blob is ​being used for a very prominent model.
Nvidia is training⁣ a very prominent model on‍ S3-compatible storage.”
‍ ‍

– Jeff ‌Denworth

The Future of AI Data Storage

The shift towards object storage⁤ in AI training ⁢signifies a fundamental change in ‍the data landscape. This shift ​empowers developers to‌ leverage the latest advancements‌ in hardware and​ software,⁤ ultimately accelerating the pace of AI innovation. As AI models become increasingly complex and data-intensive, the need for reliable, scalable, and cost-effective storage solutions will ⁤only grow.

Organizations that embrace⁤ multi-protocol storage ⁣architectures, such as those pioneered by ⁤VAST Data and‌ others,⁣ will be ⁢well-positioned⁣ to navigate the⁢ evolving AI‍ ecosystem ‍and unlock the full potential of ⁣their data.

AI’s ⁢Next Frontier: Beyond Chatbots

chatgpt’s remarkable capabilities have ushered in a new⁤ era of ‌generative AI, sparking a wave of excitement and speculation about ⁣its transformative potential. while chatbot applications undoubtedly capture the creativity, industry experts argue that the true impact of AI ⁣will extend far beyond conversational interfaces, necessitating a fundamental⁣ shift in how businesses manage and leverage data.

The Limitations of⁢ Chatbots as the ​AI‌ End Game

Jeff ‍Denworth, CEO ‍of ‍VAST Data, a company specializing in AI-driven​ data storage and processing, posits that while chatbots⁣ represent‍ a​ meaningful advancement, they are merely a glimpse into ‍the broader landscape of AI‌ applications. “It’s⁤ always possible to integrate​ a solution; that never‌ means it’s practical or efficient,” he asserts.

Denworth argues that‍ the ability to process ⁢and analyse vast​ volumes of data in real time⁣ will be crucial for businesses seeking to harness the⁢ full potential of ​AI. ‍He envisions a future ​were “AI embedding models can understand ⁤recency and⁤ relevance of all data as it’s being chunked ⁣and vectorized ‍… where all data‌ will be vectorized [with] trillions of vectors ⁢that need ​to⁢ be ‍searchable in⁢ constant time nonetheless of vector space size.”

Scaling for the AI-Powered Enterprise

Denworth highlights the “disaggregated ⁣storage architecture” ‍(DASE) as a ‌key component‍ in​ enabling‌ this ⁤future. DASE‍ allows for the independent scaling of compute and⁣ storage resources, ⁣providing⁣ the adaptability ‍and performance necessary to handle the demands of ​AI workloads. “A ​system that can‌ manage ingestion of hundreds of thousands to ⁢millions of files per second,process them and index them in real ‍time … and ‍also instantaneously propagate​ all data‌ updates ⁤to the index so enterprises never see stale data. A system that doesn’t need expensive memory-based indices ⁣because legacy ​partitioning approaches⁤ are not efficient,”
he⁣ explains.

He further emphasizes the need for‌ enterprise-grade data sources that can⁤ handle massive data volumes and ensure data integrity. “the underlying data sources need ‍to ​be scalable ‍AND enterprise grade … not sure where else you get‍ this other‌ than VAST,” he states.

The Rise of AI Agents and the Need for Agile Data Management

The emergence of “AI agents” – autonomous software entities that ⁤can perform complex tasks – is ⁣poised to revolutionize business operations.Nvidia, for instance, plans to ‍deploy over 100⁢ million AI agents to augment its workforce over⁤ the next few years.”

“You don’t think ‌this will push boundaries of legacy storage and database systems?” Denworth asks.

denworth points to ‌the increasing demand for computational resources, exemplified by Microsoft and ⁢BlackRock’s recent ‌declaration of ⁣a joint fund dedicated ⁣to​ scaling AI⁣ infrastructure.‍ He‍ argues ⁣that this trend, coupled with the rise of “System Two” computing, which emphasizes long-term⁤ reasoning​ and complex problem-solving, will necessitate a paradigm ‍shift in data management practices.

“The Stargate announcement ⁤will be the first of many….This ⁤is not exclusively for⁣ training. System Two/long-Thinking is ​going to change the ‍world’s relationship with data ⁤and compel ‍the need for even ‌larger volumes of data,” he ​predicts.

Looking Ahead: A⁣ Future of Innovation

Despite the significant progress already ​made, ​Denworth remains⁢ convinced that the journey of AI innovation is far from over. “I can⁢ confidently say that we have⁤ the most inventive and most aspiring team in the business. Each customer interaction gives us more inspiration for the next ⁤ten years,” ⁣he⁤ states.

While he is hesitant to reveal future plans in detail, Denworth‍ emphasizes that VAST Data will continue to push the boundaries of ​what’s possible ⁣in data management, driven by its commitment to empowering businesses with the tools they​ need‍ to unlock ⁤the full potential of AI.

The rise of generative ‌AI presents both⁤ exciting possibilities and critical challenges. As businesses navigate this dynamic landscape, a ⁤strategic approach to data ‌management that prioritizes scalability, agility, and security will be paramount.

Computational Storage: Moving Beyond⁣ Traditional Limits

Computational storage is revolutionizing ⁣how we‍ approach data processing by integrating ‍compute resources directly into storage systems.

This approach, often ⁢called “disaggregation,” challenges​ the traditional paradigm ⁢of ‌separating compute ⁤and storage. Instead, computational storage ​allows‌ applications to run‍ directly within the ⁤storage array, ⁣unlocking a new level of ‌performance and efficiency.

The Shift From DAS⁤ to Shared Access

The core benefit ⁤of computational storage lies in its ability to‌ enable ⁤shared data ‌access across multiple machines.

⁣ “Shared data access across machines is tantamount ​to ⁢what we do.‌ Modern machinery needs real-time access to petabytes to exabytes of data to ⁤get a global data understanding.‌ You can’t⁣ pin that data to any⁢ one host. Where and how those functions run is just a packaging ⁣exercise … we like efficiency so the more we can collapse, the better … but DAS is ​the opposite of how we think. Disaggregation is ⁣not just possible, we’ve ​shown the world that it’s very practical ‌to getting to radical levels of data access and data processing ⁢parallelism.”

This​ quote highlights a key difference between computational⁢ storage and direct-attached storage (DAS). While DAS restricts data access ‍to a single server, computational storage promotes shared access, crucial for ⁢modern applications requiring global data‍ understanding.

Sizing Compute Resources in Computational storage Arrays

Determining the optimal‍ compute resource size for a ⁣computational storage array ⁤is ‌a complex task.

Several factors play a role, including:

  • I/O load
  • query load
  • Function velocity
  • event notification activity
  • QoS management
  • RAS

“We’re learning ⁤more‌ about sizing every day,” says a leading expert in⁤ computational storage, “I’m not ​sure we’ve got it all figured out since‍ each‍ new ‍release is adding ‍substantially new capability. This keeps the⁢ performance team on its ⁢toes … but we’re ​trying.”

The‍ Evolution of Computational Storage

Computational ‍storage is a rapidly evolving field with continuous advancements.

As technology progresses, the ‍line between storage and compute will continue to ⁣blur, leading ‍to even more elegant and efficient data management solutions.

The future of data processing lies in embracing the possibilities​ of computational storage. By integrating compute resources⁤ directly into⁣ storage systems, we can unlock unprecedented levels of ⁣performance, efficiency, ​and scalability, empowering organizations to ‌harness the full potential of‍ their data.

Leave a Replay