GBrain v0.40.6.0: A Personal Knowledge Brain — Benchmark Snapshot¶

Source: gbrain-evals GitHub
Date Published: 2026-05-23
System Under Test: gbrain v0.40.6.0 (commit 677142a6)
Default Stack: ZeroEntropy embedder + reranker

TL;DR¶

GBrain is a local-first personal knowledge brain: drop in notes, meetings, emails, and ask questions in plain English. v0.40.6.0 achieves public SOTA on the LongMemEval memory benchmark (97.60% recall@5) and dominates relational retrieval with 49.1% P@5 — 38 points above pure vector RAG. The secret sauce is a graph layer built from [[wiki/...]] links (no LLM call) plus a ZeroEntropy reranker that reshuffles 60% of top-1 results. It's also the fastest and cheapest configuration measured: $0.05/M tokens, 21s ingest, 122ms query latency.

What It Is¶

GBrain runs on your laptop and ingests any text you feed it — notes, meetings, emails, tweets, books. It reads everything in the background. Later you ask questions in plain English and get answers with sources.

Two modes of asking:

Search (gbrain search) — returns ranked pages. Keyword search + vector search fused via Reciprocal Rank Fusion (RRF), then re-scored by a reranker.
Think (gbrain think) — runs the same search, then composes a cited, synthesized answer with gap analysis (tells you what it doesn't know) and cross-page synthesis (answers questions no single page contains).

Headline Results¶

Axis	gbrain v0.40.6.0	Baseline
LongMemEval `_s` Recall@5	97.60%	MemPalace 96.6% — SOTA
BrainBench P@5 (relational)	49.1%	Vector-only RAG 10.8% — +38pp
Synthesis quality (Think vs Search)	5.60 vs 1.60 / 10	+4.00 point lift
Reranker R@10 (best config)	53.7%	Voyage embed + ZE rerank
Ingest cost (per M tokens)	$0.05	2.6× cheaper than OpenAI, 3.6× vs Voyage
Ingest speed (164-page corpus)	21 seconds	1.9× faster than OpenAI
Query latency (median)	122 ms	2.3× faster than OpenAI
Regression (20 releases)	0.0 points	Byte-identical since v0.20.0
Source isolation (federated brain)	0 leaks	Provably clean across 4 surfaces

Architecture¶

Search Pipeline¶

Keyword search — verbatim matches for specific names/phrases.
Vector search — numerical fingerprint (embedding) comparison for paraphrased queries.
Hybrid fusion — Reciprocal Rank Fusion (RRF) blends both ranked lists into one.
Reranker — ZeroEntropy zerank-2 re-scores the top results for nuance.

Think Pipeline¶

Runs the same search, then an LLM synthesizes a written answer with:

Citations anchored to specific page slugs.
Gap analysis — explicitly states what the brain doesn't know.
Cross-page synthesis — walks the graph to answer relational questions (e.g. "who works at companies fund X invested in?").

The Graph Layer¶

When you write [[wiki/people/alice-example]], gbrain extracts a typed edge at write time — pure pattern matching, no LLM call:

alice-example ──works_at──> acme-ai
alice-example ──attended──> meetings/2026-04-03-board
acme-ai      ──raised───> deal/acme-series-a

The graph is load-bearing. Vector search alone hits 10.8% P@5 on relational queries. With the graph: 49.1%. That's a 38-point gap — not a marginal improvement, but a different category of capability.

Detailed Benchmarks¶

LongMemEval — Memory Recall (Public SOTA)¶

The standard benchmark for AI memory systems (Wu et al., HuggingFace). 500 questions, ~50 conversation sessions each.

Metric	Value
Recall@5 (`_s` split)	97.60%
MemPalace (competitor)	96.6%
Cost	~$0.50 per 1,000 queries (no LLM in retrieval loop)

BrainBench — Relational Retrieval¶

In-house benchmark: 240 pages of fictional biographies, 145 relational queries.

Configuration	P@5	R@5
gbrain (full hybrid + graph)	49.1%	97.9%
Without graph	19.2%	70.0%
Pure keyword (BM25)	17.1%	62.4%
Pure vector search	10.8%	40.7%

Zero regression across 20 releases: v0.20.0 → v0.40.6.0 produces byte-identical scores.

Synthesis Quality — Think vs Search (Cat 29, NEW)¶

Five questions judged 0–10 by Claude Haiku on accuracy, groundedness, and utility.

Question	Search	Think	Δ
Who works at [company]?	1/10	9/10	+8
Has ARR grown over time?	0/10	9/10	+9
Which concepts do most companies link to?	7/10	2/10	−5
Who attended the autonomous-picking meeting?	0/10	8/10	+8
What is the current ARR in May 2026?	0/10	0/10	0

Metric	Value
Search mean	1.60 / 10
Think mean	5.60 / 10
Lift	+4.00 points

Think dominates on relational questions requiring cross-page synthesis. It loses on pure aggregation (where raw search dumps the right slugs by chance) and ties on gap-analysis questions where data is stale.

Embedder/Reranker Shootout (Cat 18b, NEW)¶

Six cells: three embedders, with/without ZeroEntropy zerank-2 reranker.

Cell	R@10	MRR	Query Latency	Ingest	Cost
Voyage 1024d + ZE rerank	53.7%	0.571	256 ms	24 s	$0.0031
OpenAI 1536d + ZE rerank	48.8%	0.571	327 ms	39 s	$0.0022
ZeroEntropy 2560d (no rerank)	12.2%	0.238	122 ms	21 s	$0.0009

Winners by axis:

Best recall: Voyage embed + ZE rerank (53.7% R@10)
Fastest/cheapest: ZeroEntropy embed (122ms, 21s, $0.05/M tokens)

Key lesson: the reranker matters more than the embedder. ZeroEntropy's zerank-2 reshuffles 60% of top-1 results. gbrain made ZeroEntropy the default in v0.36.2 for exactly this reason — most users never need to think about provider choice.

What's Still Running¶

Cat 17 phase2 — full 7-cell shootout on the real 240-page BrainBench corpus (6 of 7 cells pending)
Larger-corpus runs — production brains are 100K+ pages; small-corpus receipts are solid, but scale validation awaits
Postgres-mode variants — some Cats hit PGLite-specific limits (no pg_trgm, WASM cold-start)

Key Takeaways¶

Graph > vectors for relational retrieval. The 38-point P@5 gap is not incremental — it's categorical.
Rerankers are load-bearing. They matter more than which embedder you choose.
Think is +4 points better than search on synthesis quality, and the gap is largest on the hardest relational questions.
Zero regression across 20 releases means the retrieval contract is stable — feature work doesn't break accuracy.
Local-first doesn't mean slow or expensive. gbrain + ZeroEntropy is faster and cheaper than cloud-provider alternatives.