Jason Feist

Perspective

23 Jun, 2026

Cloud

Scaling AI inference with multi‑tier storage

Jason Feist

Perspective

Abstract 3D cubes and transparent blocks floating over a glowing digital circuit network, symbolizing data systems or cloud computing.

Consider how much of your inference system’s inefficiency comes from recomputing context it has already processed. Many AI infrastructure builders know that in test environments, the cost of recomputing context is relatively inconsequential. Prompts are short, sessions are contained and performance is predictable.

But production is different. At scale, inefficiency compounds quickly into cost, latency and utilization challenges. This is what we set out to address in our latest research collaboration with SK hynix.

Bringing together Seagate’s expertise in hard drives and SK hynix’s leadership in memory and NAND flash SSD, the research explores the system-level tradeoffs of scaling inference workloads, and demonstrates how multi-tier SSD and hard drive architectures are foundational to success.

Shift to agentic changes the math

Standard inference is transactional. A request comes in, a response goes out, and the slate clears. Agentic workloads do not work that way. They carry state forward. Context accumulates across interaction, and each new request builds on what came before.

Compared with conventional chatbots, agentic AI generates up to 15 times more tokens1, which fundamentally changes what the system has to do. It’s no longer just generating new tokens. It must decide how much prior work to retain and how much to rebuild from scratch.

Where KV cache becomes the constraint

Key-value (KV) cache stores intermediate representations of prior tokens, so the model does not recompute them on every request. Early on, it works well. The limitation is capacity.

An NVIDIA H100 GPU carries 80GB of high-bandwidth memory, which is enough to hold roughly 1.2 minutes of KV cache. A server with 1TB of CPU memory extends that to about 16 minutes. Neither comes close to addressing a multi-turn agentic workflow where sessions run for hours, days or weeks.

When that memory fills, the system must evict older context. When that context becomes relevant again (and in agentic workloads, it usually does), the system recomputes it. This results in climbing time to first token, rising GPU utilization without any increase in useful output and drifting costs even when demand looks flat.

Adding more DRAM buys time, but it doesn’t resolve the underlying problem. The system is still bounded by memory and the pressure increases as workloads expand.

Treating context as persistent state

This is where I see the approach beginning to change. Instead of treating KV cache as a memory-bound optimization, it can be viewed as a persistent state that’s retained, retrieved and reused across inference cycles.

In practice, that means tiered storage. Hot context stays in memory, close to the GPU. SSDs provide a buffer tier for fast retrieval and data placement, while fleets of hard drives underpin object storage systems that deliver the durable capacity needed to retain days or weeks of context at a fraction of the cost of all-flash.

The real shift is that the KV cache is no longer confined to memory. Instead, it can now be managed across a tiered set of storage resources that can scale capacity, allowing inference systems to retain more prior work and reduce recomputation at its source.

In our close collaboration with SK hynix, we conducted benchmark testing using NVIDIA Dynamo with a hybrid SSD and hard drive configuration. We found the impact was measurable across every metric that matters to infrastructure teams: time to first token (TTFT), throughput, GPU utilization and cost.

While the testing environment reflects controlled conditions, the impact is even more pronounced in real-world deployments where longer sessions and larger datasets amplify recomputation. The full results of our work together — cost modeling across storage tiers and architecture specifications — are detailed in the white paper.

Bar chart showing TTFT drop from 35.24s (regeneration) to 1.75s (hybrid storage), a 95% reduction.

Hybrid storage provides a 95% improvement in TTFT compared to regeneration.

This solution only works when storage is integrated directly into the inference stack. To do that, moving KV cache data between storage and GPU memory without CPU bottlenecks at scale or introducing new latency requires purpose-built infrastructure, not retrofitting general-purpose hardware.

Object storage as the system of record for AI

I believe our work with SK hynix reflects a broader architectural shift already underway. As inference engines become more stateful, the boundary between memory and storage begins to blur.

What was once transient context increasingly becomes persistent state — managed across tiers and retained over time. In that model, storage is not just supporting inference, it defines how context is retained and accessed, aligning with the shift toward object storage as the system of record for modern AI infrastructure.

If your team is making decisions around architecting for inference at scale, I encourage you to read the white paper, which outlines the benchmarks, tiering approach and cost model needed to evaluate these tradeoffs as you define your system.

Read the white paper here: Enabling inference at massive scale with hybrid storage for KV cache offloading.

1 Introducing Nemotron 3 Super: An Open Hybrid Mamba-Transformer MoE for Agentic Reasoning, NVIDIA, March 11, 2026. Page 3.

Related Topics:

Cloud
Black and white headshot photo of Jason Feist, Senior Vice President, Cloud Marketing, is shown.
Jason Feist

Senior Vice President, Cloud Marketing