GBrain v0.40.6.0: A Personal Knowledge Brain — Benchmark Snapshot¶
Source: gbrain-evals GitHub
Date Published: 2026-05-23
System Under Test: gbrain v0.40.6.0 (commit 677142a6)
Default Stack: ZeroEntropy embedder + reranker
TL;DR¶
GBrain is a local-first personal knowledge brain: drop in notes, meetings, emails, and ask questions in plain English. v0.40.6.0 achieves public SOTA on the LongMemEval memory benchmark (97.60% recall@5) and dominates relational retrieval with 49.1% P@5 — 38 points above pure vector RAG. The secret sauce is a graph layer built from [[wiki/...]] links (no LLM call) plus a ZeroEntropy reranker that reshuffles 60% of top-1 results. It's also the fastest and cheapest configuration measured: $0.05/M tokens, 21s ingest, 122ms query latency.
What It Is¶
GBrain runs on your laptop and ingests any text you feed it — notes, meetings, emails, tweets, books. It reads everything in the background. Later you ask questions in plain English and get answers with sources.
Two modes of asking:
- Search (
gbrain search) — returns ranked pages. Keyword search + vector search fused via Reciprocal Rank Fusion (RRF), then re-scored by a reranker. - Think (
gbrain think) — runs the same search, then composes a cited, synthesized answer with gap analysis (tells you what it doesn't know) and cross-page synthesis (answers questions no single page contains).
Headline Results¶
| Axis | gbrain v0.40.6.0 | Baseline |
|---|---|---|
LongMemEval _s Recall@5 |
97.60% | MemPalace 96.6% — SOTA |
| BrainBench P@5 (relational) | 49.1% | Vector-only RAG 10.8% — +38pp |
| Synthesis quality (Think vs Search) | 5.60 vs 1.60 / 10 | +4.00 point lift |
| Reranker R@10 (best config) | 53.7% | Voyage embed + ZE rerank |
| Ingest cost (per M tokens) | $0.05 | 2.6× cheaper than OpenAI, 3.6× vs Voyage |
| Ingest speed (164-page corpus) | 21 seconds | 1.9× faster than OpenAI |
| Query latency (median) | 122 ms | 2.3× faster than OpenAI |
| Regression (20 releases) | 0.0 points | Byte-identical since v0.20.0 |
| Source isolation (federated brain) | 0 leaks | Provably clean across 4 surfaces |
Architecture¶
Search Pipeline¶
- Keyword search — verbatim matches for specific names/phrases.
- Vector search — numerical fingerprint (embedding) comparison for paraphrased queries.
- Hybrid fusion — Reciprocal Rank Fusion (RRF) blends both ranked lists into one.
- Reranker — ZeroEntropy
zerank-2re-scores the top results for nuance.
Think Pipeline¶
Runs the same search, then an LLM synthesizes a written answer with:
- Citations anchored to specific page slugs.
- Gap analysis — explicitly states what the brain doesn't know.
- Cross-page synthesis — walks the graph to answer relational questions (e.g. "who works at companies fund X invested in?").
The Graph Layer¶
When you write [[wiki/people/alice-example]], gbrain extracts a typed edge at write time — pure pattern matching, no LLM call:
alice-example ──works_at──> acme-ai
alice-example ──attended──> meetings/2026-04-03-board
acme-ai ──raised───> deal/acme-series-a
The graph is load-bearing. Vector search alone hits 10.8% P@5 on relational queries. With the graph: 49.1%. That's a 38-point gap — not a marginal improvement, but a different category of capability.
Detailed Benchmarks¶
LongMemEval — Memory Recall (Public SOTA)¶
The standard benchmark for AI memory systems (Wu et al., HuggingFace). 500 questions, ~50 conversation sessions each.
| Metric | Value |
|---|---|
Recall@5 (_s split) |
97.60% |
| MemPalace (competitor) | 96.6% |
| Cost | ~$0.50 per 1,000 queries (no LLM in retrieval loop) |
BrainBench — Relational Retrieval¶
In-house benchmark: 240 pages of fictional biographies, 145 relational queries.
| Configuration | P@5 | R@5 |
|---|---|---|
| gbrain (full hybrid + graph) | 49.1% | 97.9% |
| Without graph | 19.2% | 70.0% |
| Pure keyword (BM25) | 17.1% | 62.4% |
| Pure vector search | 10.8% | 40.7% |
Zero regression across 20 releases: v0.20.0 → v0.40.6.0 produces byte-identical scores.
Synthesis Quality — Think vs Search (Cat 29, NEW)¶
Five questions judged 0–10 by Claude Haiku on accuracy, groundedness, and utility.
| Question | Search | Think | Δ |
|---|---|---|---|
| Who works at [company]? | 1/10 | 9/10 | +8 |
| Has ARR grown over time? | 0/10 | 9/10 | +9 |
| Which concepts do most companies link to? | 7/10 | 2/10 | −5 |
| Who attended the autonomous-picking meeting? | 0/10 | 8/10 | +8 |
| What is the current ARR in May 2026? | 0/10 | 0/10 | 0 |
| Metric | Value |
|---|---|
| Search mean | 1.60 / 10 |
| Think mean | 5.60 / 10 |
| Lift | +4.00 points |
Think dominates on relational questions requiring cross-page synthesis. It loses on pure aggregation (where raw search dumps the right slugs by chance) and ties on gap-analysis questions where data is stale.
Embedder/Reranker Shootout (Cat 18b, NEW)¶
Six cells: three embedders, with/without ZeroEntropy zerank-2 reranker.
| Cell | R@10 | MRR | Query Latency | Ingest | Cost |
|---|---|---|---|---|---|
| Voyage 1024d + ZE rerank | 53.7% | 0.571 | 256 ms | 24 s | $0.0031 |
| OpenAI 1536d + ZE rerank | 48.8% | 0.571 | 327 ms | 39 s | $0.0022 |
| ZeroEntropy 2560d (no rerank) | 12.2% | 0.238 | 122 ms | 21 s | $0.0009 |
Winners by axis:
- Best recall: Voyage embed + ZE rerank (53.7% R@10)
- Fastest/cheapest: ZeroEntropy embed (122ms, 21s, $0.05/M tokens)
Key lesson: the reranker matters more than the embedder. ZeroEntropy's zerank-2 reshuffles 60% of top-1 results. gbrain made ZeroEntropy the default in v0.36.2 for exactly this reason — most users never need to think about provider choice.
What's Still Running¶
- Cat 17 phase2 — full 7-cell shootout on the real 240-page BrainBench corpus (6 of 7 cells pending)
- Larger-corpus runs — production brains are 100K+ pages; small-corpus receipts are solid, but scale validation awaits
- Postgres-mode variants — some Cats hit PGLite-specific limits (no
pg_trgm, WASM cold-start)
Key Takeaways¶
- Graph > vectors for relational retrieval. The 38-point P@5 gap is not incremental — it's categorical.
- Rerankers are load-bearing. They matter more than which embedder you choose.
- Think is +4 points better than search on synthesis quality, and the gap is largest on the hardest relational questions.
- Zero regression across 20 releases means the retrieval contract is stable — feature work doesn't break accuracy.
- Local-first doesn't mean slow or expensive. gbrain + ZeroEntropy is faster and cheaper than cloud-provider alternatives.