Microsoft GraphRAG — Structured Knowledge Graph RAG¶

Source: Microsoft GraphRAG Docs \ Date Published: April 2024 (arXiv) \ Author: Microsoft Research \ Extra metadata: arXiv Paper · Blog Post

TL;DR¶

GraphRAG is a structured, hierarchical approach to RAG that builds a knowledge graph from raw text, clusters it into communities using the Leiden algorithm, and generates bottom-up summaries. This enables LLMs to answer holistic questions ("what are the main themes?") that naive semantic-search RAG cannot handle, while also supporting entity-level local queries. It is designed for private datasets the model hasn't been trained on.

The Problem with Baseline RAG¶

Standard RAG uses vector similarity to retrieve relevant text snippets. It struggles with two key tasks:

Connecting the dots — questions that require traversing disparate pieces of information through shared attributes or relationships
Holistic understanding — summarizing semantic concepts over large document collections or single large documents

Baseline RAG has no notion of entity relationships, community structure, or thematic groupings.

The GraphRAG Pipeline¶

Index Phase¶

TextUnits — the input corpus is sliced into analyzable chunks
Extraction — entities, relationships, and key claims are extracted from each TextUnit using an LLM
Hierarchical Clustering — the entity graph is clustered using the Leiden technique, producing a community hierarchy
Community Summaries — each community and its constituents are summarized bottom-up, enabling multi-granularity understanding

Query Phase (Four Modes)¶

Mode	Purpose
Global Search	Holistic questions about the corpus, using the community summaries
Local Search	Specific entities, fanning out to neighbors and associated concepts
DRIFT Search	Like Local Search but enriched with community information context
Basic Search	Standard top-k vector search for cases best served by baseline RAG

Why It Matters¶

GraphRAG gives LLMs a structured world model of the private data rather than a flat bag of chunks. This unlocks two capabilities that were previously weak in RAG:

Global sensemaking — "What are the major themes across all 10,000 documents?"
Multi-hop reasoning — "How does person A connect to organization B through event C?"

Both are essential for enterprise knowledge management, due diligence, legal discovery, and research synthesis.

Key Takeaways¶

GraphRAG bridges the gap between unstructured retrieval and structured reasoning by building a knowledge graph with community hierarchy
Four query modes cover the full spectrum from global thematic questions to specific entity lookups
Prompt tuning is strongly recommended per dataset for optimal extraction and summarization quality
Versioning requires care: minor bumps need graphrag init --force, major bumps use a migration notebook to avoid re-indexing