Skip to content

Microsoft GraphRAG — Structured Knowledge Graph RAG

Source: Microsoft GraphRAG Docs \ Date Published: April 2024 (arXiv) \ Author: Microsoft Research \ Extra metadata: arXiv Paper · Blog Post


TL;DR

GraphRAG is a structured, hierarchical approach to RAG that builds a knowledge graph from raw text, clusters it into communities using the Leiden algorithm, and generates bottom-up summaries. This enables LLMs to answer holistic questions ("what are the main themes?") that naive semantic-search RAG cannot handle, while also supporting entity-level local queries. It is designed for private datasets the model hasn't been trained on.

The Problem with Baseline RAG

Standard RAG uses vector similarity to retrieve relevant text snippets. It struggles with two key tasks:

  • Connecting the dots — questions that require traversing disparate pieces of information through shared attributes or relationships
  • Holistic understanding — summarizing semantic concepts over large document collections or single large documents

Baseline RAG has no notion of entity relationships, community structure, or thematic groupings.

The GraphRAG Pipeline

Index Phase

  1. TextUnits — the input corpus is sliced into analyzable chunks
  2. Extraction — entities, relationships, and key claims are extracted from each TextUnit using an LLM
  3. Hierarchical Clustering — the entity graph is clustered using the Leiden technique, producing a community hierarchy
  4. Community Summaries — each community and its constituents are summarized bottom-up, enabling multi-granularity understanding

Query Phase (Four Modes)

Mode Purpose
Global Search Holistic questions about the corpus, using the community summaries
Local Search Specific entities, fanning out to neighbors and associated concepts
DRIFT Search Like Local Search but enriched with community information context
Basic Search Standard top-k vector search for cases best served by baseline RAG

Why It Matters

GraphRAG gives LLMs a structured world model of the private data rather than a flat bag of chunks. This unlocks two capabilities that were previously weak in RAG:

  1. Global sensemaking — "What are the major themes across all 10,000 documents?"
  2. Multi-hop reasoning — "How does person A connect to organization B through event C?"

Both are essential for enterprise knowledge management, due diligence, legal discovery, and research synthesis.

Key Takeaways

  1. GraphRAG bridges the gap between unstructured retrieval and structured reasoning by building a knowledge graph with community hierarchy
  2. Four query modes cover the full spectrum from global thematic questions to specific entity lookups
  3. Prompt tuning is strongly recommended per dataset for optimal extraction and summarization quality
  4. Versioning requires care: minor bumps need graphrag init --force, major bumps use a migration notebook to avoid re-indexing