Skip to content

Your Ultimate Guide to Attention Mechanism, QKV, and KV Cache

Source: Your Ultimate Guide to Attention Mechanism, QKV, and KV Cache
Date Published: 2026-05-26
Author: Turing Post


TL;DR

A comprehensive guide to the attention mechanism that powers modern Transformers. The article traces attention from its 2014 origins in neural machine translation through the 2017 Transformer revolution, explaining QKV, self-attention, multi-head attention, and modern efficiency variants like KV cache, MQA, GQA, and Multi-Head Latent Attention.

History of Attention

Year Milestone
2014 Bahdanau et al. — First attention mechanism for neural machine translation
2015 Luong et al. — Global and local attention variants
2017 Vaswani et al. — "Attention Is All You Need": Transformer, QKV, multi-head attention

How Attention Works

QKV Mechanism

Concept Meaning
Query (Q) What the current token is looking for
Key (K) What each token exposes about itself
Value (V) The information passed forward if selected
Self-attention Tokens attending to other tokens in the same sequence
Multi-head attention Several attention operations in parallel
KV cache Stored K and V from previous tokens for faster generation

"Why separate Q, K, V? A token can simultaneously be a requester, a candidate match, and a content carrier. Without separation, attention would collapse these roles."

The Core Formula

Attention(Q, K, V) = softmax(Q K^T / √d_k) V
  1. Dot product between Q and K measures compatibility
  2. Scale by √d_k to keep gradients stable
  3. Softmax converts scores into probabilities
  4. Weighted sum of V vectors produces the output representation

Multi-Head Attention (MHA): Runs several independent attention operations in parallel with different learned projections of Q, K, V.

KV Cache

"Why needed: During autoregressive generation, each new token still needs to attend to all previous tokens."

How it works: Store K and V vectors from previous tokens. For the current token, only compute its new Q, K, V, then attend over the cached K, V.

Modern KV Cache Variants

Variant Description Benefit
Multi-Query Attention (MQA) Shares one KV head across all query heads Faster decoding, less memory
Grouped-Query Attention (GQA) Uses several KV groups Balances speed and quality
Cross-Layer Attention (CLA) Shares KV activations across layers Up to 2× reduction in KV cache
Multi-Head Latent Attention (MLA) Compresses KV states into latent vectors (DeepSeek-V2) 93.3% reduction in KV cache

Recent Efficiency Variants

  • Elastic Core-Periphery (Vision Transformers): Communication routed via a small set of learned "core" tokens; complexity becomes near-linear in image size
  • Various approaches to reduce the quadratic complexity of full attention while maintaining quality

Key Takeaways

  1. Attention allows Transformers to dynamically decide which tokens matter most for understanding context
  2. The QKV formulation separates three distinct roles: requester (Q), match target (K), and content carrier (V)
  3. KV cache is essential for efficient autoregressive generation, avoiding recomputation of past tokens
  4. Modern variants (MQA, GQA, CLA, MLA) dramatically reduce KV cache memory requirements
  5. DeepSeek's Multi-Head Latent Attention achieves 93.3% KV cache reduction, enabling much longer contexts