Skip to content

GBrain v0.40.6.0: A Personal Knowledge Brain — Benchmark Snapshot

Source: gbrain-evals GitHub
Date Published: 2026-05-23
System Under Test: gbrain v0.40.6.0 (commit 677142a6)
Default Stack: ZeroEntropy embedder + reranker


TL;DR

GBrain is a local-first personal knowledge brain: drop in notes, meetings, emails, and ask questions in plain English. v0.40.6.0 achieves public SOTA on the LongMemEval memory benchmark (97.60% recall@5) and dominates relational retrieval with 49.1% P@5 — 38 points above pure vector RAG. The secret sauce is a graph layer built from [[wiki/...]] links (no LLM call) plus a ZeroEntropy reranker that reshuffles 60% of top-1 results. It's also the fastest and cheapest configuration measured: $0.05/M tokens, 21s ingest, 122ms query latency.

What It Is

GBrain runs on your laptop and ingests any text you feed it — notes, meetings, emails, tweets, books. It reads everything in the background. Later you ask questions in plain English and get answers with sources.

Two modes of asking:

  • Search (gbrain search) — returns ranked pages. Keyword search + vector search fused via Reciprocal Rank Fusion (RRF), then re-scored by a reranker.
  • Think (gbrain think) — runs the same search, then composes a cited, synthesized answer with gap analysis (tells you what it doesn't know) and cross-page synthesis (answers questions no single page contains).

Headline Results

Axis gbrain v0.40.6.0 Baseline
LongMemEval _s Recall@5 97.60% MemPalace 96.6% — SOTA
BrainBench P@5 (relational) 49.1% Vector-only RAG 10.8% — +38pp
Synthesis quality (Think vs Search) 5.60 vs 1.60 / 10 +4.00 point lift
Reranker R@10 (best config) 53.7% Voyage embed + ZE rerank
Ingest cost (per M tokens) $0.05 2.6× cheaper than OpenAI, 3.6× vs Voyage
Ingest speed (164-page corpus) 21 seconds 1.9× faster than OpenAI
Query latency (median) 122 ms 2.3× faster than OpenAI
Regression (20 releases) 0.0 points Byte-identical since v0.20.0
Source isolation (federated brain) 0 leaks Provably clean across 4 surfaces

Architecture

Search Pipeline

  1. Keyword search — verbatim matches for specific names/phrases.
  2. Vector search — numerical fingerprint (embedding) comparison for paraphrased queries.
  3. Hybrid fusion — Reciprocal Rank Fusion (RRF) blends both ranked lists into one.
  4. Reranker — ZeroEntropy zerank-2 re-scores the top results for nuance.

Think Pipeline

Runs the same search, then an LLM synthesizes a written answer with:

  • Citations anchored to specific page slugs.
  • Gap analysis — explicitly states what the brain doesn't know.
  • Cross-page synthesis — walks the graph to answer relational questions (e.g. "who works at companies fund X invested in?").

The Graph Layer

When you write [[wiki/people/alice-example]], gbrain extracts a typed edge at write time — pure pattern matching, no LLM call:

alice-example ──works_at──> acme-ai
alice-example ──attended──> meetings/2026-04-03-board
acme-ai      ──raised───> deal/acme-series-a

The graph is load-bearing. Vector search alone hits 10.8% P@5 on relational queries. With the graph: 49.1%. That's a 38-point gap — not a marginal improvement, but a different category of capability.

Detailed Benchmarks

LongMemEval — Memory Recall (Public SOTA)

The standard benchmark for AI memory systems (Wu et al., HuggingFace). 500 questions, ~50 conversation sessions each.

Metric Value
Recall@5 (_s split) 97.60%
MemPalace (competitor) 96.6%
Cost ~$0.50 per 1,000 queries (no LLM in retrieval loop)

BrainBench — Relational Retrieval

In-house benchmark: 240 pages of fictional biographies, 145 relational queries.

Configuration P@5 R@5
gbrain (full hybrid + graph) 49.1% 97.9%
Without graph 19.2% 70.0%
Pure keyword (BM25) 17.1% 62.4%
Pure vector search 10.8% 40.7%

Zero regression across 20 releases: v0.20.0 → v0.40.6.0 produces byte-identical scores.

Synthesis Quality — Think vs Search (Cat 29, NEW)

Five questions judged 0–10 by Claude Haiku on accuracy, groundedness, and utility.

Question Search Think Δ
Who works at [company]? 1/10 9/10 +8
Has ARR grown over time? 0/10 9/10 +9
Which concepts do most companies link to? 7/10 2/10 −5
Who attended the autonomous-picking meeting? 0/10 8/10 +8
What is the current ARR in May 2026? 0/10 0/10 0
Metric Value
Search mean 1.60 / 10
Think mean 5.60 / 10
Lift +4.00 points

Think dominates on relational questions requiring cross-page synthesis. It loses on pure aggregation (where raw search dumps the right slugs by chance) and ties on gap-analysis questions where data is stale.

Embedder/Reranker Shootout (Cat 18b, NEW)

Six cells: three embedders, with/without ZeroEntropy zerank-2 reranker.

Cell R@10 MRR Query Latency Ingest Cost
Voyage 1024d + ZE rerank 53.7% 0.571 256 ms 24 s $0.0031
OpenAI 1536d + ZE rerank 48.8% 0.571 327 ms 39 s $0.0022
ZeroEntropy 2560d (no rerank) 12.2% 0.238 122 ms 21 s $0.0009

Winners by axis:

  • Best recall: Voyage embed + ZE rerank (53.7% R@10)
  • Fastest/cheapest: ZeroEntropy embed (122ms, 21s, $0.05/M tokens)

Key lesson: the reranker matters more than the embedder. ZeroEntropy's zerank-2 reshuffles 60% of top-1 results. gbrain made ZeroEntropy the default in v0.36.2 for exactly this reason — most users never need to think about provider choice.

What's Still Running

  • Cat 17 phase2 — full 7-cell shootout on the real 240-page BrainBench corpus (6 of 7 cells pending)
  • Larger-corpus runs — production brains are 100K+ pages; small-corpus receipts are solid, but scale validation awaits
  • Postgres-mode variants — some Cats hit PGLite-specific limits (no pg_trgm, WASM cold-start)

Key Takeaways

  1. Graph > vectors for relational retrieval. The 38-point P@5 gap is not incremental — it's categorical.
  2. Rerankers are load-bearing. They matter more than which embedder you choose.
  3. Think is +4 points better than search on synthesis quality, and the gap is largest on the hardest relational questions.
  4. Zero regression across 20 releases means the retrieval contract is stable — feature work doesn't break accuracy.
  5. Local-first doesn't mean slow or expensive. gbrain + ZeroEntropy is faster and cheaper than cloud-provider alternatives.