Your Ultimate Guide to Attention Mechanism, QKV, and KV Cache¶
Source: Your Ultimate Guide to Attention Mechanism, QKV, and KV Cache
Date Published: 2026-05-26
Author: Turing Post
TL;DR¶
A comprehensive guide to the attention mechanism that powers modern Transformers. The article traces attention from its 2014 origins in neural machine translation through the 2017 Transformer revolution, explaining QKV, self-attention, multi-head attention, and modern efficiency variants like KV cache, MQA, GQA, and Multi-Head Latent Attention.
History of Attention¶
| Year | Milestone |
|---|---|
| 2014 | Bahdanau et al. — First attention mechanism for neural machine translation |
| 2015 | Luong et al. — Global and local attention variants |
| 2017 | Vaswani et al. — "Attention Is All You Need": Transformer, QKV, multi-head attention |
How Attention Works¶
QKV Mechanism¶
| Concept | Meaning |
|---|---|
| Query (Q) | What the current token is looking for |
| Key (K) | What each token exposes about itself |
| Value (V) | The information passed forward if selected |
| Self-attention | Tokens attending to other tokens in the same sequence |
| Multi-head attention | Several attention operations in parallel |
| KV cache | Stored K and V from previous tokens for faster generation |
"Why separate Q, K, V? A token can simultaneously be a requester, a candidate match, and a content carrier. Without separation, attention would collapse these roles."
The Core Formula¶
- Dot product between Q and K measures compatibility
- Scale by √d_k to keep gradients stable
- Softmax converts scores into probabilities
- Weighted sum of V vectors produces the output representation
Multi-Head Attention (MHA): Runs several independent attention operations in parallel with different learned projections of Q, K, V.
KV Cache¶
"Why needed: During autoregressive generation, each new token still needs to attend to all previous tokens."
How it works: Store K and V vectors from previous tokens. For the current token, only compute its new Q, K, V, then attend over the cached K, V.
Modern KV Cache Variants¶
| Variant | Description | Benefit |
|---|---|---|
| Multi-Query Attention (MQA) | Shares one KV head across all query heads | Faster decoding, less memory |
| Grouped-Query Attention (GQA) | Uses several KV groups | Balances speed and quality |
| Cross-Layer Attention (CLA) | Shares KV activations across layers | Up to 2× reduction in KV cache |
| Multi-Head Latent Attention (MLA) | Compresses KV states into latent vectors (DeepSeek-V2) | 93.3% reduction in KV cache |
Recent Efficiency Variants¶
- Elastic Core-Periphery (Vision Transformers): Communication routed via a small set of learned "core" tokens; complexity becomes near-linear in image size
- Various approaches to reduce the quadratic complexity of full attention while maintaining quality
Key Takeaways¶
- Attention allows Transformers to dynamically decide which tokens matter most for understanding context
- The QKV formulation separates three distinct roles: requester (Q), match target (K), and content carrier (V)
- KV cache is essential for efficient autoregressive generation, avoiding recomputation of past tokens
- Modern variants (MQA, GQA, CLA, MLA) dramatically reduce KV cache memory requirements
- DeepSeek's Multi-Head Latent Attention achieves 93.3% KV cache reduction, enabling much longer contexts