Hybrid Attention Model Cache Management
Overview
Some models combine more than one cache type in their architecture. The most common hybrid case is a model that uses both:
- regular KV-cache inputs for attention layers
- linear-attention state tables for layers such as
CausalConv1DorGatedDeltaNet.
These models do not have a single cache pool. They have at least two different cache pools with different growth rules:
- KV cache is token-driven
- linear-attention cache is either sequence-driven or interval-driven, depending on whether prefix caching is enabled
This distinction matters when configuring SchedulerConfig, because the same setting can affect the two cache types differently.
Cache Types in Hybrid Models
KV cache
KV cache capacity is measured in blocks, and each block corresponds to a fixed number of tokens determined by the target device. Increasing KV capacity increases the number of tokens that can remain resident across active sequences.
Relevant settings:
num_kv_blockscache_sizeuse_cache_eviction
Linear-attention cache without prefix caching
When enable_prefix_caching=false, linear-attention cache behaves like a fixed-size state per live sequence.
Each sequence needs a fixed number of linear-attention blocks, regardless of prompt length.
When num_linear_attention_blocks=0, the runtime derives this capacity differently for two common cases:
- if
max_num_batched_tokens == std::numeric_limits<size_t>::max(), it treats the configuration as a client-style latency scenario and starts with1linear-attention block - otherwise it treats the configuration as bounded batching and derives linear-attention capacity from
max_num_seqs
Relevant settings:
num_linear_attention_blocksmax_num_batched_tokensmax_num_seqs
Linear-attention cache with prefix caching
When enable_prefix_caching=true, linear-attention cache switches to paged checkpointing mode.
Instead of allocating one fixed state block per sequence, the runtime stores checkpoints every derived cache interval.
The interval is calculated as kv_block_size * cache_interval_multiplier tokens.
If cache_interval_multiplier is unset, the default multiplier is 8 for hybrid models with prefix caching.
Relevant settings:
num_linear_attention_blockscache_interval_multipliernum_kv_blockscache_size
Smaller cache_interval_multiplier values create more checkpoints and consume more linear-attention memory.
Larger values reduce memory usage but make checkpointing coarser.
How SchedulerConfig Is Interpreted
For hybrid-attention models, the following rules are the most useful mental model.
Constructor defaults matter
The starting SchedulerConfig depends on pipeline constructor type.
LLMPipeline and VLMPipeline constructors that utilize Continuous Batching use a latency-oriented default scheduler configuration if the user does not pass a scheduler config in properties. It is optimized for local non-concurrent usage.
That default changes two fields:
max_num_batched_tokens = std::numeric_limits<size_t>::max()enable_prefix_caching = true
ContinuousBatchingPipeline constructors do not assume that latency-oriented profile.
They use the SchedulerConfig object that the caller provides.
As a result, the same hybrid-attention model can start with different automatic linear-attention sizing depending on the constructor path unless these fields are set explicitly.
Explicit KV capacity
If num_linear_attention_blocks is left at 0, the runtime derives linear-attention capacity automatically:
- with
enable_prefix_caching=falseand unlimitedmax_num_batched_tokens, it starts with1linear-attention block - with
enable_prefix_caching=falseand boundedmax_num_batched_tokens, it derives linear-attention blocks frommax_num_seqs - with
enable_prefix_caching=true, it derives linear-attention blocks asceil(num_kv_blocks / cache_interval_multiplier)
This is the best option when deterministic capacity matters more than fitting into a precise byte budget.
Shared memory budget
If cache_size > 0 and num_kv_blocks == 0, the runtime treats cache_size as a shared cache budget for all registered cache types in the model.
The runtime then derives:
- KV blocks
- linear-attention blocks
under one combined memory limit.
For non-prefix mode, the same fallback split still applies before KV blocks are computed from the remaining budget:
- unlimited
max_num_batched_tokensreserves one fixed linear-attention block - bounded
max_num_batched_tokensreserves linear-attention capacity based onmax_num_seqs
For hybrid models this is usually the simplest way to tell the runtime, "fit all cache types into this memory budget".
Fully dynamic mode
If both num_kv_blocks == 0 and cache_size == 0, the cache starts with no preallocated capacity and grows on demand.
This is the most flexible mode, but it is less deterministic than explicit sizing.
Expert override for linear-attention blocks
If num_linear_attention_blocks > 0, it overrides the automatic linear-attention sizing logic.
This is useful only when the required capacity is already known.
If the model does not expose linear-attention cache inputs, num_linear_attention_blocks should not be set.
Which Configuration Fits Which Scenario
Scenario 1: Stable production throughput on a known workload
Use this when the number of active requests and expected context lengths are already understood.
Recommended settings:
- set
num_kv_blocksexplicitly - keep
cache_size=0 - leave
num_linear_attention_blocks=0unless manual control is needed - set
max_num_seqsto the intended concurrency ifenable_prefix_caching=false - leave
cache_interval_multiplierunset to use the default, or set it explicitly when a different prefix-checkpoint granularity is needed
Pros:
- deterministic cache capacity
- easy to reason about admission limits
- easier benchmarking and capacity planning
Cons:
- requires up-front sizing work
- can over-allocate memory for bursty or variable workloads
Scenario 2: Constrained device memory with mixed cache types
Use this when the main requirement is "do not exceed this cache memory budget".
Recommended settings:
- set
cache_size - keep
num_kv_blocks=0 - keep
num_linear_attention_blocks=0unless manual override is required - choose
cache_interval_multiplieraccording to the desired prefix-checkpoint granularity ifenable_prefix_caching=true
Pros:
- one budget applies to both KV and linear-attention caches
- avoids KV-only sizing on hybrid models
- good default for memory-limited deployments
Cons:
- less direct than explicit block counts
- derived capacities depend on model cache layout and
cache_interval_multiplier
Scenario 3: High concurrency, no prefix reuse
Use this when requests are short-lived or reuse across requests is not expected, and sequence concurrency matters more than prefix reuse.
Recommended settings:
- set
enable_prefix_caching=false - set
num_kv_blocksexplicitly or usecache_size - set
max_num_seqsto the intended concurrent sequence count - keep
num_linear_attention_blocks=0unless a manual override is required
Why this works:
With prefix caching disabled, linear-attention cache is sequence-driven rather than token-driven.
If the runtime sizes it automatically, bounded batching uses max_num_seqs as the target for linear-attention capacity.
Pros:
- simple mental model
- predictable linear-attention capacity for concurrent requests
- avoids unnecessary checkpoint storage
Cons:
- no prefix reuse across compatible requests
- less effective for chat-style repeated-prefix workloads
Scenario 4: Prefix-heavy chat or repeated-prompt workloads
Use this when prefix reuse matters and the model has linear-attention cache inputs.
Recommended settings:
- set
enable_prefix_caching=true - either set
num_kv_blocksexplicitly or providecache_size - keep
num_linear_attention_blocks=0unless manual tuning is necessary - choose
cache_interval_multipliercarefully
Guidance for cache_interval_multiplier:
- smaller multiplier: more checkpoints, more linear-attention memory, finer-grained reuse
- larger multiplier: fewer checkpoints, lower linear-attention memory, coarser reuse
Pros:
- aligns linear-attention capacity with token capacity
- supports prefix checkpoint reuse in hybrid models
- good fit for chat and recurring-prefix scenarios
Cons:
cache_interval_multiplierbecomes part of memory planning- too small a multiplier can consume linear-attention memory aggressively
Scenario 5: Single-stream or interactive client inference without prefix reuse
Use this when the pipeline is effectively latency-oriented and you do not want fixed linear-attention capacity to scale with server-style concurrency limits.
Recommended settings:
- set
enable_prefix_caching=false - keep
num_linear_attention_blocks=0 - leave
max_num_batched_tokensunlimited - set
num_kv_blocksexplicitly or usecache_size, depending on whether you want deterministic capacity or a shared budget
Why this works:
With prefix caching disabled and unlimited max_num_batched_tokens, the runtime treats the configuration as a client-style scenario and starts with one fixed linear-attention block instead of reserving max_num_seqs blocks.
Pros:
- avoids over-reserving fixed linear-attention state for single-stream inference
- keeps the initial memory footprint lower on large hybrid models
- still allows KV capacity to be controlled independently by
num_kv_blocksorcache_size
Cons:
- not appropriate if multiple concurrent sequences are expected immediately
- fixed linear-attention capacity may need to grow later if concurrency increases
Scenario 6: Exploratory tuning or highly variable traffic
Use this when workload shape is not stable enough to pre-size confidently.
Recommended settings:
- set
num_kv_blocks=0 - set
cache_size=0 - keep
num_linear_attention_blocks=0 - enable prefix caching only if the workload benefits from it
Pros:
- no up-front capacity planning required
- adapts to traffic that is hard to predict
Cons:
- less deterministic memory growth
- harder to compare across benchmark runs
- not the best fit when hard admission or latency targets must be guaranteed
Recommended Starting Points
If no manual tuning has been done yet, these are reasonable defaults to start with.
General deployment default
- use
cache_size - keep
num_kv_blocks=0 - keep
num_linear_attention_blocks=0
This is the safest default for hybrid models when the main goal is to respect a memory budget.
Throughput-tuned deployment default
- use explicit
num_kv_blocks - keep
num_linear_attention_blocks=0 - set bounded
max_num_batched_tokensandmax_num_seqsfor non-prefix mode - leave
cache_interval_multiplierunset to use the default in prefix mode, or set it intentionally when tuning checkpoint granularity
This is the better default when concurrency and token capacity have already been characterized.
Interactive client default
- set
enable_prefix_caching=false - keep
num_linear_attention_blocks=0 - leave
max_num_batched_tokensunlimited
This is the better default when non-prefix hybrid inference is effectively single-stream and initial linear-attention state should stay minimal.
Prefix-heavy deployment default
- set
enable_prefix_caching=true - keep
num_linear_attention_blocks=0 - start with the default
cache_interval_multiplier - tune
cache_interval_multiplieronly if memory pressure or reuse granularity requires it
When to Set num_linear_attention_blocks Manually
Manual num_linear_attention_blocks is useful only when one of the following is true:
- the deployment has a known fixed sequence budget and the value must be pinned exactly
- benchmark repeatability requires removing one more derived quantity
- a custom allocation split between KV and linear-attention cache is intentionally required
In most cases, leaving num_linear_attention_blocks=0 is preferred because it lets the runtime derive a value consistent with the selected mode.
Related Topics
Summary
For hybrid-attention models, the best configuration depends on whether the deployment is optimized for:
- deterministic token capacity
- a shared memory budget
- fixed concurrent sequence count
- prefix reuse efficiency
- dynamic flexibility
The simplest rule is:
- use
num_kv_blockswhen explicit capacity matters most - use
cache_sizewhen a shared memory budget matters most - keep
num_linear_attention_blocks=0unless there is a strong reason to override the derived value - treat
cache_interval_multiplieras a memory-versus-checkpoint-granularity knob when prefix caching is enabled