Skip to content

Essays

Which Environmental Factors Explain the Black–White IQ Gap?

Source: Aporia Magazine \ Author: Noah Carl \ Date Published: 2024-11-29


TL;DR

Noah Carl's piece critiques a PNAS paper by Kevin Lala and Marcus Feldman that equates the hereditarian hypothesis with racism. Carl argues that while Lala and Feldman dismiss hereditarianism as having "no scientific evidence," they do not provide a compelling environmental alternative — i.e., a specific, evidence-backed theory of which environmental factors explain racial IQ gaps. The article is a challenge to environmentalists to produce positive evidence, not just critique.

Accelerating Scientific Discovery with Co-Scientist

Source: Nature — Google Research, DeepMind, Stanford, et al.
Date Published: 2026-05-19
DOI: 10.1038/s41586-026-10644-y


TL;DR

Google's Co-Scientist is a multi-agent AI framework that scales test-time compute to continuously generate, critique, and refine novel scientific hypotheses. Validated in biomedical settings — including drug repurposing for acute myeloid leukemia (validated in vitro) and explaining mechanisms of antimicrobial resistance — it represents a concrete demonstration of AI accelerating the research pipeline rather than just summarising existing literature.

Google I/O 2026: The Agentic Gemini Era

Event: Google I/O 2026, May 19–20 | Shoreline Amphitheatre, Mountain View
Sources: Google Blog, TechCabal, The Verge


TL;DR

Google I/O 2026 was dominated by a single theme: AI shifts from answering questions to taking action. Key releases: Gemini 3.5 Flash (default model, half the cost), Gemini Omni Flash (any-input video generation), Gemini Spark (persistent 24/7 agent running on Google Cloud), Antigravity 2.0 (desktop agent orchestration), a complete Search re-architecture for the agentic era, and hardware announcements including TPU 8th-gen and Android XR glasses. Token usage hit 3.2 quadrillion/month — 7× year-over-year.

Linus Torvalds Says AI Bug Hunters Make Linux Security List "Almost Entirely Unmanageable"

Source: The Register — Simon Sharwood
Date: 2026-05-18


TL;DR

Multiple researchers running the same AI tools on the same codebase are flooding the private Linux kernel security mailing list with identical bug reports. Torvalds calls it "entirely pointless churn" that creates "unnecessary pain and pointless work." His solution: AI bug hunters must check for duplicates themselves, and should only submit if they've also created a patch that adds real value beyond what the AI detected.

Physicists Take the Imaginary Numbers Out of Quantum Mechanics

Source: Quanta Magazine
Author: Daniel Garisto
Date: November 7, 2025


The Core Debate: Is i Essential?

For a century, the imaginary number i (√-1) has been central to the Schrödinger equation. Schrödinger himself had hoped for an "entirely real version," calling the original complex formulation "a certain crudeness at the moment."

In 2021, a team led by Marc-Olivier Renou and Nicolas Gisin devised a three-party Bell test (Alice, Bob, Charlie) with two entanglement sources. When a group at USTC in Hefei ran the experiment, the observed correlations exceeded the ceiling for real-valued quantum theory — strongly suggesting complex numbers were empirically necessary.

The 2025 Counter-Revolution: Three Strikes

The new papers identify the 2021 team's critical flaw: their tensor product assumption (the rule for combining quantum states). The standard tensor product is natural for complex spaces but is a restrictive special case. By adopting a more general rule, real-valued theories can do anything complex ones can.

  1. The German Team (March 2025) — Michael Epping, Dagmar Bruß, Anton Trushechkin, Pedro Barrios Hita, Hermann Kampermann. Produced a real-valued QM exactly equivalent to the standard complex version.

  2. The French Team (April 2025) — Timothée Hoffreumon and Mischa Woods. Paper titled "Quantum theory does not need complex numbers," with a different tensor product yielding identical predictions.

  3. The Quantum Computing Proof (September 2025) — Craig Gidney (Google Quantum AI). Showed that all T gates (logic gates relying on complex-plane rotations) can be eliminated from any quantum algorithm, proving numerically that quantum computing doesn't require complex numbers.

The Ghost of i

While these new theories eliminate i, they don't eliminate the structure of complex arithmetic:

  • Real-valued formulations exist since Ernst Stueckelberg (1960) but are notoriously cumbersome — e.g., 2 particles (4 complex numbers) become 16 real numbers.
  • The new theories largely copy i's ability to rotate vectors.
  • Bill Wootters (Williams): "Even when you translate quantum theory into real numbers, you still see the hallmark of complex-number arithmetic."
  • Anton Trushechkin (HHU Düsseldorf): They "simulate complex numbers by means of real numbers."
  • Vlatko Vedral (Oxford): "You can write them down whichever way you like, but it's unavoidable that they have to multiply exactly as though they were complex numbers."

Why Is the Complex Formulation So Much Simpler?

  • Chao-Yang Lu (USTC): "Complex quantum theory, with its natural tensor product, remains far more concise, elegant and mathematically straightforward."
  • Jill North (Rutgers philosopher): "Even if complex numbers aren't truly necessary, they do give rise to a formulation that seems particularly well suited to quantum mechanics."
  • Vedral: "We really don't have a single alternative to how quantum mechanics was already done 100 years ago. And the question is, why? Why can't we go beyond this?"

Key Takeaways

  • The 2021 claim that i is empirically necessary has been overturned by 2025 work.
  • Real-valued QM is exactly equivalent to standard QM but significantly more complex.
  • The "hallmark" of complex arithmetic (rotation) persists in these real-valued formulations.
  • The search continues for a truly novel, simpler reformulation — and for a deeper understanding of why complex numbers fit quantum mechanics so naturally.

Project Glasswing: What Mythos Showed Us

Source: Cloudflare Blog
Author: Grant Bourzikas
Date: May 18, 2026


What Changed with Mythos Preview

Cloudflare tested Anthropic's Mythos Preview (via Project Glasswing) against 50+ of its own repositories. The core finding: Mythos is not just a better vulnerability scanner, but a system capable of reasoning like a senior security researcher.

Two standout capabilities:

  • Exploit Chain Construction: Combines multiple low-severity primitives (e.g., use-after-free → arbitrary read/write → ROP chain) into a working multi-step exploit. Low-severity bugs that would traditionally sit invisible in a backlog become actionable.
  • Proof Generation: Writes code to trigger suspected bugs, compiles and runs it in a scratch environment, iterating on failures autonomously. "A suspected flaw without a working proof is speculation, and Mythos Preview closes that gap on its own."

Model Refusals: Inconsistent Guardrails

The Glasswing version lacked the safety locks of generally available models (e.g., Opus 4.7), but displayed "organic" guardrails that were highly inconsistent. Semantically equivalent tasks produced opposite outcomes depending on framing and timing. Conclusion: Organic refusals cannot serve as a complete safety boundary.

The Signal-to-Noise Problem

  • Language matters: C/C++ projects produced consistently more false positives than memory-safe languages like Rust.
  • Model bias: "Ask a model to find bugs, and it will find them, whether the code has any or not." Hedged findings ("possibly," "could in theory") vastly outnumber solid ones — but Mythos's PoC generation dramatically improves triage.

Why Generic Coding Agents Fail

Problem Detail
Context A single agent session against a 100k LOC repo covers ~0.1% of the surface before context compaction discards earlier findings.
Throughput Security research requires narrow, parallel hypotheses. Generic coding agents are tuned for single-stream feature work.

Conclusion: The harness around the model matters far more than raw model capability.

4 Core Lessons for a Security Harness

  1. Narrow scope produces better findings — specific function + trust boundaries + architecture doc >> "find vulnerabilities in this repository."
  2. Adversarial review reduces noise — a second agent prompted to disprove the original finding catches far more noise than asking the hunter to check its own work. "Putting two agents in deliberate disagreement is way more effective than just telling one agent to be careful."
  3. Split the chain across agents — ask "Is this buggy?" and "Is this reachable from an attacker?" as separate questions.
  4. Parallel narrow tasks beat one exhaustive agent — many concurrent agents, then deduplicate afterward.

Cloudflare's Vulnerability Discovery Harness

Stage What It Does
Recon Reads repo top-down, fans out to subagents per subsystem. Produces architecture doc (build commands, trust boundaries, entry points, attack surface).
Hunt ~50 concurrent agents, each with one attack class + scope hint. Compiles and runs PoCs in per-task scratch directories.
Validate Independent agent re-reads code and tries to disprove the original finding. Different prompt, no ability to emit new findings.
Report Deduplicates surviving findings, writes advisory with PoC, CVSS score, and recommended fix.

The Industry Picture

Cloudflare also tested Codex CLI, Copilot Agent Mode, Gemini Code Assist, and various fine-tuned models. None approached Mythos Preview's exploit-chain capability. For proactive security, frontier models are now viable but demand a proper harness.

Theodore Dalrymple, Truth-Teller

Source: City Journal
Author: Rob Henderson (foreword to the 25th-anniversary edition of Life at the Bottom)
Date: May 8, 2026


Dalrymple's Central Thesis

Theodore Dalrymple worked as a doctor in British prisons and inner-city hospitals. He saw a poverty not just of money but of meaning, responsibility, and hope. His core argument: the underclass is shaped by ideas from elite intellectuals — mockery of family, self-restraint, and police, alongside celebration of "liberation." Welfare incentives alone don't explain the squalor; you need the ideological scaffolding peddled by intellectuals.

The "Luxury Belief" Class

Rob Henderson's signature concept:

  • Definition: Views that confer social status on the affluent at little cost to them but inflict real damage on the poor (e.g., denouncing marriage, effort, police).
  • Reverse Hypocrisy (JFK vs. Modern Elites):
  • JFK: Flawed in private (unfaithful, absent father) but preached public virtue.
  • Modern Elites: Live stable, disciplined private lives (marriage, hard work, family) but publicly dismiss these values as boring or oppressive.
  • Mechanism: The rich kid experiments with drugs and is fine. The poor kid hits meth and self-destructs. Both hear elite culture say "judge nothing."

Nonjudgmentalism's Toll

Refusing to say some actions are better than others destroys the poor who lack structure: - A woman dismissed advice to leave an abusive boyfriend as "sexist," returned, and was beaten again. - Academic criminologists declare criminals "addicted" to crime; inmates immediately adopt the excuse. - The pattern: deny personal choice, blame systemic forces, equate judgment with oppression.

The Behavioral Gap

Norms used to flatten the behavioral divide between rich and poor (marriage, work, lawfulness). As elites became insular and stopped modeling/enforcing norms, the gap widened massively.

"The choice is never between having an elite or not. It is between having an elite that accepts responsibility and provides leadership and an elite that does neither."

Key Anecdotes

  • Tyler (San Quentin): Friend from Henderson's past. Quit a job because he "didn't feel like it," crashed his motorcycle drunk, sentenced to 18 months. Upper-middle class excuses the choice as understandable — but studying for a Ph.D. or working 80-hour weeks "isn't fun" either.
  • Tesco Shoplifting (England): Two native-born boys stuffing pockets; white cashier bored. South Asian immigrant security guard intervenes. Boys shout "racism" and leave. Immigrants still believe work matters.
  • Cambridge Double Standard: A fellow doctoral student says publicly of a poor kid skipping class — "maybe it's good he didn't go" — but privately forces her own son to attend. "Our elites have isolated themselves from the world I grew up in, while paying lip service to inequality."
  • Doctors from Mumbai and Manila: Arrive brimming with sympathy for the British welfare state and the poor. Over time, they are shocked by the ingratitude and absence of basic decency from patients.

The Imperative

  • Elites must publicly preach the discipline that governs their private lives. Share values (marriage, family, responsibility) equally with wealth.
  • A young person from a deprived background should be held to higher standards, not lower.
  • The luxury belief class "walks the Fifties and talks the Sixties" — enjoying the warm glow of liberation while those at the bottom pay the price.

To Have Machines Make Math Proofs, Turn Them Into a Puzzle

Source: Quanta Magazine
Interview with: Marijn Heule (Carnegie Mellon University)
Date: November 10, 2025


Core Idea

Marijn Heule uses SAT (Satisfiability) — a symbolic AI technique that turns math problems into giant binary constraint puzzles (think Sudoku with millions of cells) — to solve long-standing open problems in pure mathematics. His track record includes the Empty Hexagon, Schur Number 5, and Keller's Conjecture (dimension 7), problems that resisted proof for 90+ years.

His vision is a three-part pipeline that could produce the first mathematical proof ever discovered by AI that humans cannot verify independently:

  1. LLM — Carves a big mathematical statement into smaller, plausible lemmas (high-level "big picture" work).
  2. SAT Solver — Proves or refutes each lemma, returning minimal counterexamples that act like a human learning from failure.
  3. Lean — A formal proof checker that glues all certified pieces together into a watertight whole.

"A SAT tool is not computing with zeros and ones. Instead, it is searching for a combination of them that satisfies all the constraints."


The "Understanding vs. Trust" Debate

The philosophical heart of the piece. Timothy Gowers (Fields Medalist) called Heule's Pythagorean triples proof "the most disgusting proof ever" because it offered no human-comprehensible insight.

Heule's counter: Understanding in mathematics is highly overrated. No single mathematician understands all of math — we rely on chains of trust. Automated reasoning can produce proofs more trustworthy than most pen-and-paper proofs.

"LLMs can do all of their bullshitting, but as soon as automated reasoning is able to say, 'OK, but this part is actually correct, and here's a proof,' this is actually more trustworthy than most of the pen-and-paper proofs out there."


AI as Co-Author, Not Replacement

Heule emphasizes humans remain essential — his successes came from collaborating with mathematicians who spent years developing conceptual insights, which he then encoded for the SAT solver. The future is LLMs helping more mathematicians learn to encode problems, not removing humans from the loop.

"The creative intuition, the conceptual reframing, that's still something people are uniquely good at. The magic comes from the collaboration."


Key Takeaways

SAT ≠ Neural Networks SAT is symbolic GOFAI — hard-coded logical rules, not pattern matching. It searches rather than computes.
Minimal counterexamples SAT solvers return small, interpretable refutations, providing insight into why a conjecture fails.
The bottleneck Encoding a math problem for SAT is currently an expert skill. Heule wants LLMs to automate this, opening the pipeline to more mathematicians.
Trust > Understanding Heule provocatively flips the traditional mathematical value system — certified correctness matters more than human-comprehensible narrative.
Future goal The first AI-discovered proof of a problem that no human can independently verify.

Why Birth Rates Are Falling Everywhere All at Once

Source: Financial Times — John Burn-Murdoch (Chief Data Reporter, The Big Read)
Date: 2026-05-16


TL;DR

In over two-thirds of the world's 195 countries, fertility is now below replacement rate. The primary driver has shifted: it's no longer that couples have fewer children — it's that fewer couples are forming at all. Housing costs explain up to half the decline in the US/UK, while smartphones and social media are the global accelerators, with birth rates plunging in country after country immediately following 4G rollout. The result is a K-shaped fertility collapse hitting the least educated hardest, with profound economic and political consequences.

Rethinking Agentic Search with Pi-Serini: Is Lexical Retrieval Sufficient?

Source: arXiv:2605.10848
Authors: Tz-Huan Hsu (UWaterloo), Jheng-Hong Yang (Stencilzeit), Jimmy Lin (UWaterloo)
Date: May 2026


Core Research Question

"Does a lexical retriever suffice as LLMs become more capable in an agentic loop?"

The paper argues that prior BM25 baselines scored low because of poor parameter configuration and shallow retrieval depth, not because lexical retrieval is fundamentally inadequate.

Answer: Yes. A well-configured lexical retriever (BM25 with k1=25, b=1, depth=1000) matches or beats dense-retriever baselines when paired with a capable LLM through a proper tool interface.


Key Results

System Accuracy Surfaced Recall Total Cost
Pi-Serini (gpt-5.5) 83.1% 94.7% $291.6
Prior dense baseline (gpt-5 + qwen3) 73.0% 79.0% $360.7
Pi-Serini (deepseek-v4-flash) 68.1% 94.5% $28.9
Pi-Serini (claude-opus-4.7) 69.8% $246.6

Tuning impact (default BM25 → tuned BM25): Accuracy +18.0%, surfaced recall +11.1%
Depth impact (k=5 → k=1000): Surfaced recall +25.3%
Cost reduction vs. dense baselines: 3.3×–10×


The Pi-Serini System

Pi-Serini is a deliberately minimal search agent that isolates the agent–retriever interaction. It has three components:

1. Retrieval Controller

The main abstraction layer between the LLM agent and the Anserini BM25 backend. It manages: - A cached, session-local ranking (up to 32 search IDs) - Paginated access to results - Spill files for large outputs

2. Tool Interface (Three Distinct Tools)

Decouples retrieval from context management:

  • search(query) — Issues a BM25 query, retrieves up to 1000 documents, caches the ranking, exposes only top 5 excerpts
  • read_search_results(search_id, offset, limit) — Browses the cached ranking without a new backend query
  • read_document(docid, offset, limit) — Reads a document in line-based chunks

3. Time-Budget Steering (Instead of Fixed Iterations)

  • Hard timeout: 300 seconds per query
  • Submit steer at 0.7T: Injects a message forcing the agent to stop using tools and answer immediately. All three tools are blocked afterwards.
  • Prevents runaway loops — prior work used ~74 tool calls/query; Pi-Serini uses ~15–24

Trajectory Logging (Four Document Sets)

To understand agent behavior, Pi-Serini tracks: - Dsurfaced — returned by search - Dpreviewed — excerpts shown via read_search_results - Dopened — full document read via read_document - Dcited — used in final answer


BM25 Tuning: The Critical Finding

The paper's central insight is that default BM25 parameters are optimized for passage retrieval (~100 words), not long documents (BrowseComp-Plus has median ~2k tokens, 90th percentile ~14k tokens).

Parameter Default (Anserini) Tuned (Pi-Serini)
k1 0.9 25
b 0.4 1.0

A grid search over k1=[0–32] and b=[0–1] shows Anserini's default (k1=0.9, b=0.4) sits in a low-performing region. The tuned parameters adapt term-frequency saturation (high k1) and length normalization (b=1.0) for long documents.

Result on a 100-query subset: accuracy jumps from 64.0% (default) to 82.0% (tuned).


Retrieval Depth Matters

k (depth) Surfaced Recall
5 70.5%
50 ~89%
1000 95.8%

Previewed recall saturates at k=50 (~74.7%). Deeper rankings offer more opportunity, but agents don't automatically inspect them. The bottleneck shifts from "Can the retriever find it?" to "Can the agent recognize and spend context on evidence it already has?"


Failure Mode Analysis: Premature Branch Commitment

A key behavioral difference between models with similar costs:

  • GPT-5.5 keeps candidate-specific probes reversible. If a probe fails (e.g., "Warrington", "Vinegar Strokes"), it returns to original clues (town population, spelling history).
  • Claude Opus 4.7 tends to commit to a branch early and doesn't revisit alternatives even when initial probes are unproductive.

This suggests agent architecture matters as much as retriever choice.


Cost Efficiency

Pi-Serini achieves massive cost reduction through two mechanisms: 1. No dense retriever overhead — BM25 is cheap to query 2. Prefix-cache-friendly loop — The same system prompt and repeated context means 82–90% of total tokens are served from cache at 10% of the input price

Model Input price Output price Cache read
deepseek-v4-flash $0.14 $0.28 $0.028
gpt-5.5 $5.00 $30.00 $0.50

Implications

  1. Lexical retrieval is not dead for agentic search. A tuned BM25 with sufficient depth is competitive with dense retrievers at a fraction of the cost.
  2. The bottleneck has moved up the stack. The agent's ability to use surfaced evidence — not the retriever's ability to find it — is now the primary constraint.
  3. Default configurations are misleading. Papers comparing against BM25 should confirm they're using appropriate (not default) parameters for their document domain.
  4. Time budgets beat iteration caps. Steering with a hard timeout prevents runaway tool calls and reduces variance.

Paper Details

  • Dataset: BrowseComp-Plus — 830 queries, 100,195 long documents (avg. 5,179 words/doc; 6.1 evidence docs, 2.9 gold docs per query)
  • Subjects: Information Retrieval (cs.IR), Artificial Intelligence (cs.AI), Computation and Language (cs.CL)
  • Code: github.com/justram/pi-serini
  • Pages: 15 pages, 4 figures