Your Ultimate Guide to Attention Mechanism, QKV, and KV Cache¶

Source: Your Ultimate Guide to Attention Mechanism, QKV, and KV Cache
Date Published: 2026-05-26
Author: Turing Post

TL;DR¶

A comprehensive guide to the attention mechanism that powers modern Transformers. The article traces attention from its 2014 origins in neural machine translation through the 2017 Transformer revolution, explaining QKV, self-attention, multi-head attention, and modern efficiency variants like KV cache, MQA, GQA, and Multi-Head Latent Attention.

History of Attention¶

Year	Milestone
2014	Bahdanau et al. — First attention mechanism for neural machine translation
2015	Luong et al. — Global and local attention variants
2017	Vaswani et al. — "Attention Is All You Need": Transformer, QKV, multi-head attention

How Attention Works¶

QKV Mechanism¶

Concept	Meaning
Query (Q)	What the current token is looking for
Key (K)	What each token exposes about itself
Value (V)	The information passed forward if selected
Self-attention	Tokens attending to other tokens in the same sequence
Multi-head attention	Several attention operations in parallel
KV cache	Stored K and V from previous tokens for faster generation

"Why separate Q, K, V? A token can simultaneously be a requester, a candidate match, and a content carrier. Without separation, attention would collapse these roles."

The Core Formula¶

Attention(Q, K, V) = softmax(Q K^T / √d_k) V

Dot product between Q and K measures compatibility
Scale by √d_k to keep gradients stable
Softmax converts scores into probabilities
Weighted sum of V vectors produces the output representation

Multi-Head Attention (MHA): Runs several independent attention operations in parallel with different learned projections of Q, K, V.

KV Cache¶

"Why needed: During autoregressive generation, each new token still needs to attend to all previous tokens."

How it works: Store K and V vectors from previous tokens. For the current token, only compute its new Q, K, V, then attend over the cached K, V.

Modern KV Cache Variants¶

Variant	Description	Benefit
Multi-Query Attention (MQA)	Shares one KV head across all query heads	Faster decoding, less memory
Grouped-Query Attention (GQA)	Uses several KV groups	Balances speed and quality
Cross-Layer Attention (CLA)	Shares KV activations across layers	Up to 2× reduction in KV cache
Multi-Head Latent Attention (MLA)	Compresses KV states into latent vectors (DeepSeek-V2)	93.3% reduction in KV cache

Recent Efficiency Variants¶

Elastic Core-Periphery (Vision Transformers): Communication routed via a small set of learned "core" tokens; complexity becomes near-linear in image size
Various approaches to reduce the quadratic complexity of full attention while maintaining quality

Key Takeaways¶

Attention allows Transformers to dynamically decide which tokens matter most for understanding context
The QKV formulation separates three distinct roles: requester (Q), match target (K), and content carrier (V)
KV cache is essential for efficient autoregressive generation, avoiding recomputation of past tokens
Modern variants (MQA, GQA, CLA, MLA) dramatically reduce KV cache memory requirements
DeepSeek's Multi-Head Latent Attention achieves 93.3% KV cache reduction, enabling much longer contexts