Car Keys

Thursday morning. You slept in. You have a coffee meeting across town, thirty minutes away. You have thirty-six minutes and thirty seconds. Just enough time.

T-36:30 You pull on your quarter-zip and reach for the keys by the coffee maker. Not there.

T-35:45 Jacket pockets. All of them. Nothing.

T-34:20 Couch cushions. On your knees, hands jammed into crevices. Nothing.

T-31:15 You’re upstairs now, rifling through yesterday’s pants. Nothing.

T-28:40 Bathroom sink. Junk drawer. The bowl by the door that exists specifically for this purpose. Nothing.

T-24:00 You text your contact: “Running a few minutes late.” Lie.

T-19:30 Second pass through the kitchen. More frantic now. You’re opening the refrigerator. Why? You don’t know. Keys could be anywhere. Keys could be nowhere.

T-12:45 You’re pulling dirty laundry out of the hamper, checking pants you haven’t worn in a week. This is peak mammalian cognition. Ten thousand years of civilization, and you’re pawing through yesterday’s underwear like a half-drunk trash panda.

T-8:20 You text: “So sorry, something came up, can we reschedule?” Your contact goes cold. Meeting’s dead. Career implications: nonzero. Mental state? Self-flaggelation.

T-6:00 Your spouse wanders through, coffee in hand, half-watching the meltdown.

“Didn’t Jake borrow the car this morning?”

Oh.

The keys aren’t lost. They’re not misplaced. They’re in your son’s pocket, three miles away. You could ransack every drawer in the house until the heat death of the universe and you would never find them, because they were never here.

I’ve been building an argument across several essays (here, here, here, and here): transformers don’t reason, they perform reasoning, having learned the statistical shadow of logic without its structure. The same failure mode is being trained into humans by algorithmic feeds. I proposed Cooperative Adversarial Decisioning as an alternative architecture: multiple cognitive modalities (generation, deduction, simulation) competing under resource constraints. Then I pivoted, having realized that an apex coordinator is the wrong answer, and that goal-space geometry might solve the infinite regress problem. Then I stumbled into RNFs (Riemann Neural Fields) and the whole thing got a lot more plausible.

Now you’re up to speed.

Two recent papers from the cutting edge illuminate what’s still missing.

I’m going to talk about intuition, instrumental agency, and deduction. Buckle up.

Intuition is pattern-matching. “Where do keys usually turn up?” Fast, cheap, and wrong in exactly the ways your training data was incomplete.

Instrumental agency is doing stuff in the world to get feedback. Open the drawer, observe no keys, update, try another drawer. It works eventually, but you’re using the territory as the map. Every failed search is compute burned, time lost, a meeting that dies at T-8:20. Maps exist because they’re cheaper than visiting.

(As an aside, I don’t have enough careful thinking to turn it into a stance yet, but I’m convinced that AGI / ASI will require a commensurate level of instrumental agency before we get there. RL gyms won’t be enough.)

Deduction is structure-preserving inference. “If the keys were here, I would have found them. I didn’t find them. Therefore: probably not here. What would make them not-here?” It questions premises. It notices when the search space itself is wrong.

The key distinction is not symbolic formality but counterfactual compression. Deduction eliminates entire regions of the search space without visiting them, based on constraints implied by failure. It is the only cognitive move that gets cheaper as problems get larger.

You spent thirty minutes grepping the house. Your spouse spent three seconds reasoning about the premise.

The Cutting Edge

The AI space race has a new darling: Recursive Language Models, or RLMs. Zhang and Khattab at MIT dropped this work recently, and it’s genuinely interesting. The results are impressive. The architecture is clever. And it perfectly illustrates everything I’ve been arguing about what’s missing in frontier AI.

The problem they’re solving: Large language models have a “context rot” issue. As the input gets longer, performance degrades. The model gets dumber as you give it more to work with. Anyone who’s watched Claude Code lose the plot after a long session knows this feeling.

RLMs aren’t fancy RAG. They give the model instrumental agency.

The architecture works like this. Your massive context (documents, datasets, codebases) lives in a Python variable. The model never sees it directly. Instead, it gets a REPL environment and the ability to call itself recursively. It can peek at chunks, grep for patterns, write code to filter and aggregate, spawn sub-calls to answer questions about pieces. The transformer decides how to decompose the problem. Python executes the logic.

This is qualitatively different from “retrieve then generate.” The model isn’t just consuming information. It’s acting in an environment, observing results, deciding what to do next, acting again. It chooses its own decomposition strategy (adaptive exploration, not predefined chunking). It can change tactics when something isn’t working.

The results are striking. On long-context benchmarks, RLM(GPT-5-mini) outperforms GPT-5 by more than double, while being cheaper per query. On a retrieval task with 10M+ tokens, RLMs maintain near-perfect accuracy where base models collapse entirely.

This is real progress. Like, actual, non-bullshit, ‘someone did something clever’ progress. I know. I’m as surprised as you are.

Agency improves cognition. Giving a model the ability to act, observe, and iterate genuinely expands what it can do. The authors are rightfully excited. They’ve built something that solves problems pure transformer inference cannot.

But… slow clap… it will never produce deduction.

(Incidentally, isn’t giving ‘sudo’ to the machine a little… I dunno… Terminator I?)

The Other Cutting Edge

DeepMind is attacking the same problem from a different angle.

Mixture-of-Recursions (MoR) doesn’t give the model agency. It changes the architecture itself. Instead of stacking dozens of unique layers, MoR reuses a smaller set of layers recursively (same weights, applied multiple times). A lightweight router examines each token and decides: does this need more passes, or can it exit early?

The intuition is appealing. Not every token deserves the same compute. “The” doesn’t need 24 layers of processing. A complex technical term might. MoR lets the model flap harder on the tricky parts and coast on the easy ones.

The results are pretty dang impressive. MoR matches the performance of much larger models with a fraction of the parameters. Memory usage drops. Inference speeds up. Some enthusiasts are calling it a “Transformer Killer.” I’d call it Moore’s Law for inference: you get there cheaper, but there hasn’t moved.

MoR is adaptive pattern-matching. The router decides which tokens are “hard” and routes them through more iterations of the same pattern-completion machinery. Complex tokens get deeper intuition. Simple tokens get shallower intuition. Still intuition all the way down.

MoR doesn’t question premises, or notice when the search space is wrong. It doesn’t do structure-preserving inference. It just varies the intensity of the pattern-matching based on token complexity.

Two approaches. RLMs give the model hands (instrumental agency, the ability to act and observe). MoR makes intuition cheaper where intuition works. Neither does anything to protect you from catastrophically wrong premises, the kinds of errors where “more compute” is not just wasteful but fatal.

Neither gives the ability to reason.

(God, I can’t help it. Someone needs to do something so the acronym becomes SMoRs. Amiright?)

Why Neither Produces Deduction

Before you argue that “Didn’t Jake borrow the car?” is itself just intuition, let me be precise about the difference.

Yes, as a heuristic, “when stuck, check your assumptions” is pattern-matching. Someone taught you that. You retrieved it from memory. Fair.

But as a reasoned response to evidence, it’s deduction:

Premise: The keys are in this house.
Observation: Exhaustive search produced no keys.
Inference: Either the search was incomplete, or the premise is false.
Conclusion: Investigate the premise.

That’s modus tollens. If keys were here, I would have found them. I didn’t find them. Therefore: probably not here.

The frantic searcher doesn’t draw that inference. They interpret failure as “search harder.” The premise is never questioned. The search space is never interrogated. Just more couch cushions, faster, with rising panic.

The structure of the failure tells you something about your premises. Deduction notices that. Intuition just keeps grepping.

Look at RLMs through this lens.

The feedback loop is: intuit a decomposition, execute in Python, observe results, intuit again. You find out the strategy was wrong after you’ve burned the compute. You end up in Idaho and then backtrack. No modality attacks the plan before execution. No voice says “Step 3 fails because X doesn’t hold when Y.” The model grepped the house. Python told it “no keys found.” Now it intuits another grep.

RLMs have kinda-sorta externalized deduction to Python, but only for execution, not for strategy. Python can count. Python can filter. Python cannot tell you that your decomposition was wrong before you tried it. Python is a calculator, not a critic. It executes the bad plan faithfully. It cannot look at the script the model just wrote and say, “Bro, I’m not doing that, that’s a stupid idea because X, not Y.” The model is still driving blind, using the territory as the map.

MoR is even further from deduction. It doesn’t add any new cognitive capability. It just makes the existing pattern-matching adaptive. Flap harder on hard tokens. Coast on easy ones. It’s checking the same couch cushion 24 times instead of twice because you really feel like the keys should be there. More intensity, same scope. And the router that decides “hard vs. easy” is itself just pattern-matching. It’s intuition about how much intuition to apply.

Neither architecture has anything that questions premises. Neither has anything that notices when the search space is wrong. Neither can do what your spouse did in three seconds: reason about the structure of the failure rather than just observing the failure and trying again.

Ok, This Is Awkward…

I should be honest about something.

Human reasoning might be some version of this. Strategic intuition, external tools, fast feedback loops, iterative refinement. The felt sense of “insight” might just be what it’s like when a particularly good pattern-match fires. If so, this critique applies more broadly than I’d like.

But even granting that, we’re stuck with a hammer of intuition and hoping everything is nails.

Problems designed to exploit the intuitive-looking path. Problems where wrong turns are expensive or irreversible. Problems requiring genuine structure-preservation under transformation. Adversarial problems where someone wants you to take the obvious path to Idaho.

These remain out of reach.

RLMs are real progress. MoR is real progress. Agency helps. Adaptive depth helps. Better flapping is genuinely better.

But better flapping is still not flight.