Data storage is the oxygen of machine learning and AI.
02 Apr, 2025
Artificial intelligence (AI) and machine learning (ML) have fueled transformative breakthroughs, from predicting protein structures to enabling real-time language translation. At the heart of these innovations lies an insatiable need for high-quality data. AI models thrive on vast datasets, but without reliable, cost-effective data storage, these models—and the insights they generate—would fail to reach their potential.
Much like oxygen fuels the human mind, data storage fuels AI development. The ability to store, access, and process data efficiently determines how effectively AI models are trained and refined. Yet, as the demand for AI-driven solutions grows, so too does the challenge of managing the lifecycle of AI data—from collection to storage to processing—all while keeping costs and complexity in check.
Data science has evolved from spreadsheets and simple analytics to powerful ML-driven insights. Today, the U.S. Department of Labor reports that more than 200,000 data science jobs exist, with projected growth of 36% over the next decade. Domain experts across industries are incorporating AI tools into their workflows, even without formal data science training, using no-code platforms that allow them to build models and analyze data faster than ever before.
But raw data isn’t useful on its own. Before it can be fed into AI models, it must be structured, cleaned, and labeled—a process often called data wrangling. Open-source tools like Pandas help transform massive datasets into structured formats that AI models can use. However, this process requires fast, efficient, and local data storage to avoid bottlenecks that slow down model development.
The sheer volume of AI training data presents significant logistical challenges. Storing and managing large datasets isn’t just about capacity—it’s about cost, compliance, and accessibility.
Some of the biggest challenges in AI data management include:
Traditional centralized storage approaches are challenged by geographically dispersed data sources. A growing number of AI practitioners are turning to localized, edge storage solutions that offer greater control, lower costs, and reduced latency.
Rather than transferring vast datasets to centralized cloud servers, organizations can process and store AI data closer to where it is generated. This approach—often called edge computing—minimizes data movement costs while improving performance.
One cost-effective solution is small, hybrid NAS systems that provide local, high-performance storage for AI workloads. Unlike traditional NAS, these systems integrate containerized AI tools such as Jupyter Notebooks, allowing domain experts and AI developers to collaborate directly on the storage system itself. By eliminating the need for constant data transfers, these NAS solutions reduce operational costs while accelerating AI development.
Processing AI data at the edge also gives organizations greater control over their datasets. Maintaining sovereignty over AI training data ensures compliance with industry regulations and reduces risks associated with third-party storage. This approach makes AI workflows more efficient by keeping data close to where it is collected and analyzed.
Edge computing offers multiple advantages for AI development:
To explore the feasibility of running AI workloads on localized storage, we built a three-node NAS cluster and measured its storage performance.
We first measured single-node performance to establish a baseline for throughput. The system achieved 200 MB/s per 2.5GE link for large data transfers.
Next, we analyzed how multi-node replication affected performance. While data replication increased network traffic, it had minimal impact on read performance—a key advantage for workloads that require data consistency across multiple nodes.
Networking performance tests revealed that adding a second 2.5GE link provided only minor write benefits, while 10GE networking improved performance in select cases.
To simulate an AI workflow, we tested a real-world machine learning task using the NAS system. We trained a boat classification model using a dataset of 500 labeled images, running feature extraction and model training locally.
After storing the images in an object storage bucket with metadata labels, we used PyTorch Img2Vec to extract features from each image and then trained a random forest classifier. The resulting model achieved 78% accuracy in under a minute.
Key observations from this test included:
This experiment demonstrated that localized NAS storage can serve as a cost-effective AI data hub, reducing reliance on cloud services while improving accessibility and performance.
Final thoughts: AI storage must evolve.
The future of AI depends on efficient, cost-effective, and scalable data storage. As data volumes continue to grow, organizations must rethink how they store and manage AI datasets.
Localized NAS solutions provide a practical alternative to expensive cloud storage, allowing AI teams to:
Much like oxygen sustains life, data storage sustains AI innovation. By making AI-ready storage more accessible, cost-efficient, and high-performing, organizations can accelerate their AI-driven breakthroughs.
Tom Prohofsky