Perspective

23 Jun, 2026

Cloud

Scaling AI inference with multi‑tier storage

Perspective

Abstract 3D cubes and transparent blocks floating over a glowing digital circuit network, symbolising data systems or cloud computing.

Consider how much of your inference system’s inefficiency comes from recomputing context it has already processed. Many AI infrastructure builders know that in test environments, the cost of recomputing context is relatively inconsequential. Prompts are short, sessions are contained and performance is predictable.

But production is different. At scale, inefficiency compounds quickly into cost, latency and utilisation challenges. This is what we set out to address in our latest research collaboration with SK hynix.

Bringing together Seagate’s expertise in hard drives and SK hynix’s leadership in memory and NAND flash SSD, the research explores the system-level tradeoffs of scaling inference workloads, and demonstrates how multi-tier SSD and hard drive architectures are foundational to success.

Shift to agentic changes the math

Standard inference is transactional. A request comes in, a response goes out, and the slate clears. Agentic workloads do not work that way. They carry state forward. Context accumulates across interaction, and each new request builds on what came before.

Compared with conventional chatbots, agentic AI generates up to 15 times more tokens¹, which fundamentally changes what the system has to do. It’s no longer just generating new tokens. It must decide how much prior work to retain and how much to rebuild from scratch.

Where KV cache becomes the constraint

Key-value (KV) cache stores intermediate representations of prior tokens, so the model does not recompute them on every request. Early on, it works well. The limitation is capacity.

An NVIDIA H100 GPU carries 80 GB of high-bandwidth memory, which is enough to hold roughly 1.2 minutes of KV cache. A server with 1 TB of CPU memory extends that to about 16 minutes. Neither comes close to addressing a multi-turn agentic workflow where sessions run for hours, days or weeks.

When that memory fills, the system must evict older context. When that context becomes relevant again (and in agentic workloads, it usually does), the system recomputes it. This results in climbing time to first token, rising GPU utilisation without any increase in useful output and drifting costs even when demand looks flat.

Adding more DRAM buys time, but it doesn’t resolve the underlying problem. The system is still bounded by memory and the pressure increases as workloads expand.

Treating context as persistent state

This is where I see the approach beginning to change. Instead of treating KV cache as a memory-bound optimisation, it can be viewed as a persistent state that’s retained, retrieved and reused across inference cycles.

In practice, that means tiered storage. Hot context stays in memory, close to the GPU. SSDs provide a buffer tier for fast retrieval and data placement, while fleets of hard drives underpin object storage systems that deliver the durable capacity needed to retain days or weeks of context at a fraction of the cost of all-flash.

The real shift is that the KV cache is no longer confined to memory. Instead, it can now be managed across a tiered set of storage resources that can scale capacity, allowing inference systems to retain more prior work and reduce recomputation at its source.

In our close collaboration with SK hynix, we conducted benchmark testing using NVIDIA Dynamo with a hybrid SSD and hard drive configuration. We found the impact was measurable across every metric that matters to infrastructure teams: time to first token (TTFT), throughput, GPU utilisation and cost.

While the testing environment reflects controlled conditions, the impact is even more pronounced in real-world deployments where longer sessions and larger datasets amplify recomputation. The full results of our work together — cost modelling across storage tiers and architecture specifications — are detailed in the white paper.

Bar chart showing TTFT drop from 35.24s (regeneration) to 1.75s (hybrid storage), a 95% reduction.

Hybrid storage provides a 95% improvement in TTFT compared to regeneration.

This solution only works when storage is integrated directly into the inference stack. To do that, moving KV cache data between storage and GPU memory without CPU bottlenecks at scale or introducing new latency requires purpose-built infrastructure, not retrofitting general-purpose hardware.

Object storage as the system of record for AI

I believe our work with SK hynix reflects a broader architectural shift already underway. As inference engines become more stateful, the boundary between memory and storage begins to blur.

What was once transient context increasingly becomes persistent state — managed across tiers and retained over time. In that model, storage is not just supporting inference, it defines how context is retained and accessed, aligning with the shift toward object storage as the system of record for modern AI infrastructure.

If your team is making decisions around architecting for inference at scale, I encourage you to read the white paper, which outlines the benchmarks, tiering approach and cost model needed to evaluate these tradeoffs as you define your system.

Read the white paper here: Enabling inference at massive scale with hybrid storage for KV cache offloading.

¹ Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning, NVIDIA, March 11, 2026. Page 3.

Products

Knowledge Base

Support Downloads

Articles

suggested searches

Read the report

Read the article

Scaling AI inference with multi‑tier storage

Shift to agentic changes the math

Where KV cache becomes the constraint

Treating context as persistent state

Object storage as the system of record for AI

Recommended articles