TurboQuant: How KV Cache Compression Makes 16GB Machines Run Bigger LLMs
TL;DR
TurboQuant is a new technique that compresses the KV cache (not model weights) when running LLMs locally, enabling 2x more usable context on memory-constrained machines. Alex Ziskind benchmarks it on a 16GB M4 Mac Mini and 128GB M5 Max MacBook Pro, finding that asymmetric quantization (Q8 for K, Turbo for V) preserves quality while dramatically reducing memory usage.
The Problem: Model Weights Fit, But KV Cache Doesn't
Alex demonstrates the problem using a Mac Mini with 16GB of memory. He wants to run Qwen 3.5, a popular model family with high download numbers. The 9 billion parameter version in full BF16 (unquantized) is 19.3GB — it simply won't fit. Even quantized versions that appear small enough on disk consume far more memory at runtime due to the KV cache.
He shows this on his 128GB M5 Max MacBook Pro: loading a Q4 (4-bit) version of the model — which is only about 6GB on disk — takes memory from 77GB to 84GB. That's more than the 6GB model size because memory must also be reserved for context and cache. At 4,000 context length, usage sits at 84GB. Cranking context to the model's full supported length pushes it to 92GB — before even running a single prompt.
When he sends a long prompt (~17,000 tokens), memory spikes further. Each subsequent message resends the entire conversation for processing, and the KV cache grows with every generated token. The KV cache stores key-value pairs — mathematical summaries of every token the model has already seen — acting as short-term memory. It lives in RAM alongside the model weights.
Quantization vs. TurboQuant: Two Different Compression Targets
Traditional quantization compresses model weights. BF16 stores weights as 16-bit floating point numbers. Going to Q8 (8-bit) shrinks the 9B parameter Qwen model from 19.3GB to about 10GB. Q4 (4-bit) gets it down to 5.98GB. Alex notes that going below Q4 — such as Q3 or Q2 — usually produces poor results, with the LLM sometimes getting into loops and generating garbage output. He references Bowski's quantizations on Hugging Face as an example of these aggressive options.
TurboQuant takes a fundamentally different approach: instead of compressing model weights, it compresses the KV cache. This is particularly valuable on memory-constrained systems where the KV cache — not the model weights — is what blows past your RAM budget at longer context lengths. Alex references the official Google research paper for TurboQuant.
Running TurboQuant: The Llama CPP Fork
As of the video, Llama CPP (the popular project for running LLMs locally) has not yet rolled in TurboQuant support. It remains a community effort. Alex points to a popular GitHub fork called TurboQuant Plus by Tom (referred to as Tom Turney). This fork implements TurboQuant on top of Llama CPP.
There are three variants of TurboQuant:
- **Turbo 2**: Most aggressive, compresses KV cache roughly 4x
- **Turbo 3**: Compresses roughly 2.5x
- **Turbo 4**: Compresses roughly 1.9x
Symmetric vs. Asymmetric: The Key Discovery
Alex's initial tests were disappointing. He applied TurboQuant symmetrically — the same turbo level to both K and V parts of the KV cache. On the M5 Max and M4 Mac Mini, the KV cache space savings were present, but both prefill speed and decode speed suffered significantly. Results were also model-dependent: he tested Qwen 2.5 (older model), Qwen 3 8B, and Qwen 3.5 35B (a mixture-of-experts model, 34GB, which only fit on the larger machine).
Tom suggested an asymmetric approach: use Q8 for the K part and apply TurboQuant only to the V part. For a more aggressive configuration, keep Q8 for K and use Turbo 3 for V. This turned out to be the correct approach and resolved both quality and speed issues.
Memory Savings: 2x More Usable Context on 16GB
On the Mac Mini with 16GB, loading the Q8 version of the model with a 131,000 token context window simply crashes. With asymmetric Turbo 3, the same model runs at 131K context comfortably with 3.6GB of memory to spare. Same model, same machine — TurboQuant gives you 2x more usable context.
Alex tested at multiple context lengths (32K, 65K, 131K). At each level, there was a significant difference in memory consumed. He showed a breakdown chart: model weights stayed identical between turbo and Q8 runs, but the KV cache — which grows dramatically under Q8 — was much smaller under Turbo 3, leaving extra headroom.
Quality Test: Needle in a Haystack
To verify that TurboQuant doesn't destroy output quality, Alex ran needle-in-a-haystack tests on the Mac Mini. The test hides three secrets inside varying lengths of text (1K to 32K tokens) and asks the model to find them.
With symmetric TurboQuant, results were a disaster:
- Q8 baseline: 3/3 across all context lengths (100%)
- Turbo/Turbo symmetric: 3/3 (100%)
- Turbo 3 symmetric: 1/3 at short context, 0/3 at 8K and 16K
- Turbo 2 symmetric: 1/3 at short context, 0/3 at 8K and 16K
After switching to asymmetric (Q8 for K, Turbo for V), every configuration scored 3/3 across all context lengths — perfect retrieval. This confirmed that asymmetric TurboQuant preserves output quality.
Speed Benchmarks: A Surprise on M5 Max
On the M4 Mac Mini, TurboQuant showed a 1-4% slowdown in decode speed at short context lengths compared to Q8 baseline. Not great, but not terrible.
On the M5 Max, there was a dramatic difference in TurboQuant's favor. Q8 baseline decode speed dropped from ~54 tokens/second at depth 0 to ~37 tokens/second at 8K context depth, then partially recovered to ~44 at 32K. TurboQuant stayed relatively flat across all context depths. Alex confirmed this wasn't a glitch — these were averaged across multiple runs.
The reason for the difference: on the Mac Mini, the system was compute-bound — the bottleneck was matrix multiplications, not KV cache reads. On the M5 Max, KV cache access was the bottleneck, so compressing it yielded direct speed benefits. Alex speculates that future M5-based Mac Minis (even with only 16GB) would benefit more from TurboQuant thanks to the faster compute.
How to Try It Yourself
Alex suggests two paths: grab Tom's TurboQuant Plus fork from GitHub and build it yourself now, or wait until TurboQuant is merged into mainline Llama CPP. He mentions that VLLM is also reportedly working on TurboQuant support, and once Llama CPP gets it, tools like LM Studio will inherit the feature.
He recommends using the asymmetric approach (Q8 for K, Turbo 3 or Turbo 4 for V) based on his testing, and notes that the latest Qwen 3.5 models respond particularly well to TurboQuant on Apple hardware.
Key Takeaways
- TurboQuant compresses the KV cache (not model weights), which is the real memory bottleneck at long context lengths — use it alongside traditional quantization, not instead of it
- Always use asymmetric TurboQuant: Q8 for K, Turbo 3 (or Turbo 4) for V — symmetric mode destroys quality in needle-in-a-haystack tests
- On a 16GB Mac Mini, TurboQuant enables 131K context where Q8 alone crashes — effectively 2x more usable context on the same hardware
- Speed benefits are hardware-dependent: M5 Max saw flat decode speed across context depths (vs. Q8's significant degradation), while M4 Mac Mini was compute-bound and saw 1-4% slowdowns
- Don't quantize model weights below Q4 (4-bit) — lower quantizations often produce garbage output with looping
- Qwen 3.5 models respond particularly well to TurboQuant on Apple Silicon — results vary by model