Frontier Models Do Not Think

Thirty years ago, at my first job in Cambridge, Massachusetts the spiffy building across the street was Thinking Machines. The name was aspirational. Also dead wrong.

It is now 2026, and we are again calling machines “thinking.” This time with more swagger, LOTS more compute, and vastly better demos. But the core claim is still suspect.

To be clear: I’m not making a claim about consciousness or moral status. Just about deductive, logical reasoning as an architectural capability.

So let’s go: frontier language models do not think. They do not reason. At least, not the way humans do. They perform something that looks suspiciously like our sort of reasoning, and the distinction matters.

The transformer architecture is not a reasoning mechanism per se. It is an engine of intuition. It does not move from premise to premise along a chain of constraints. It does not operate over abstract structure in a representation-invariant way. Instead, it performs high-dimensional pattern completion: all context, all correlations, all learned statistical residue exert influence simultaneously, and an output precipitates from that field.

This is why transformers are extraordinarily good at fluency, analogy, compression, and synthesis. And it is why they fail at things a toddler can already reason about. (Hans Moravec has entered the chat.)

We see outputs that look like reasoning and we are convinced that reasoning is happening underneath. Our inference is a category error. The architecture is associative, not deductive. It is a massively capable “what usually comes next” engine that happens to speak in complete sentences with believable conviction, right wrong or indifferent.

Recent work has demonstrated a now-reproducible failure mode: give frontier models simple one- to three-hop logic problems, and they sometimes succeed. Now rename the variables. Change surface structure. Preserve the underlying logical form exactly. [https://arxiv.org/abs/2507.07313]

The models collapse.

If a system were reasoning, this wouldn’t happen. Logic is invariant under renaming. Structure-preserving transformation is not a nice-to-have; it is the defining property of reasoning. Modus ponens does not care whether the symbols are Alice and Bob or X and Y.

What these models learned kinda quacks like logic, but it’s still not a duck. They learned the statistical shadow of logic, a distribution of surface forms that often accompany correct answers. When the costume changes, the performance fails. They learned the finger pointing at the moon, not the moon.

This is not even shallow reasoning. It is counterfeit reasoning: behavior that passes casual inspection, succeeds in-distribution, but collapses like a flan if you look at it sideways.

The claim that “reasoning emerges from scale” smuggles in an assumption: that the training data contains reasoning worth learning as reasoning. But what is the training data? The internet. And since when has the internet been lauded as a bastion of careful thought? That’s right. Since never. It is a repository of performance (at best).

Reddit is a masterclass in what reasoning looks like when optimized for, god help us, upvotes. The confident opener. The powerful, scathing takedown. A strawman so no one notices you dodged the hard part. It is rhetoric tuned for upvotes from people who already agree with you, not for truth-seeking.

The model faithfully learns all of that. It learns that reasoning means sounding confident. It learns that good arguments are ones that win threads. It learns the cargo cult of logic without the discipline of logic.

Scaling this process produces better mimicry, but zero new cognitive primitives. You get more convincing, more fluent counterfeits, but all we’ve seen so far is a distinct lack of invariant structure.

This is not a nostalgic argument for GOFAI sidecars. Bolting a logic engine onto a transformer as a post-hoc validator misses the point. A sequential pipeline (generate, verify, profit) treats reasoning as an afterthought. That is not how intelligence works. Intelligence (or consciousness, or will, or cognition, take your pick) is (arguably, fine) goal seeking in the face of adversity.

“Use a supervisor model” aint gonna work either. Not well, anyhow. Centralized routing simply shoves the problem around: now the router must know when to trust intuition, when to trust deduction. So, just another monolithic learner, with same-y failure modes.

The issue is not missing components. “The Grail? I told him we already got one.” Modern AI already has three powerful cognitive modalities. It’s got generative intuition, which does holistic pattern matching, fast and creative, but incapable of detecting its own errors. We have deductive constraint checking**:** slow, brittle, structure-preserving, intolerant of ambiguity. Finally there’s predictive simulation**:** world models that answer “What happens if…?”

None of these are new. What is missing is not capability, but coordination.

I’m a big nobody in AI but I think that this coordination system should exhibit adaptive, emergent behavior.

Current systems solve coordination with authority or a priori order: pipelines, supervisors, voting schemes. All of them fail for the same reason. One modality dominates by default, and the others are consulted too late or end up being weak tea.

The missing primitive in modern AI is not reasoning. It is a really good argument. A negotiation.

Inference should be adversarial at runtime. Not adversarial training, but adversarial cognition. Generative systems propose. Deductive systems attack. Predictive systems simulate consequences. None is trusted. All must earn influence. The final output is not what any single system prefers, but what survives the brawl.

This coordination cannot be enforced by fiat. It must be governed by scarcity. Compute is finite. Time is precious. Attention is fleeting. Confidence has an estimatable cost.

This is my architectural claim: cognition should be multi-modal, adversarial, and governed by an internal economy, not a scheduler. Not a pipeline. Not a voting scheme. A market.

The three modalities run in parallel, continuously. Not taking turns. Not waiting for handoffs. All three are always running, at different intensities, bidding for influence on the emerging output. And it is not a single pass. It is iterative. Feed-forward and feed-back. Simulation surfaces a problem, generation regenerates candidates. Deduction invalidates an assumption, the whole process loops. The auction does not converge monotonically toward an answer. It circles, probes, retreats, advances. More like annealing than traversal.

This is not Chain-of-Thought. That’s just the ouroboros telling itself, “I taste like chicken.”

You cannot run full simulation on every candidate. You cannot logic-check every fleeting intuition. Budget is real. Budget is finite. Budget forces prioritization.

Want to assert something? Pay for it. The more certain the claim, the higher the bet. Wrong claims lose credibility. Right ones gain market power.

Early in cognition, generation is cheap. Let a thousand flowers bloom. Deduction is expensive; do not invoke it yet. Late in cognition, the budgets flip. Generation becomes expensive because you are no longer ideating, you are validating. Deduction gets veto power. Simulation is mandatory for anything approaching output.

The relative value of each modality’s currency is not fixed across problem types. Writing a poem? Generation’s currency is strong. The market favors associative leaps, tolerates logical gaps, discounts simulation almost entirely. Designing a bridge? Simulation dominates. Generation can propose, but it is bidding with pesos against simulation’s dollars. Legal reasoning? Deduction holds reserve currency status. The entire price structure tilts based on what kind of truth you are chasing.

This is meta-adaptivity. Not just “prices shift during cognition” but “the entire economy reconfigures based on context.” A creative task and an engineering task run on different control regimes. The auction rules might be the same but the valuations are not.

No arbiter is required. The market is the coordination. No meta-model deciding which module to call. No hand-tuned heuristics per phase. The modules bid for influence based on their confidence and the current price of that confidence. Authority emerges from the auction, not from an org chart.

Early, middle, and late cognition emerge from prices, not hardcoded states. No brittle stage transitions. The system does not know it is in “late stage validation.” It knows only that generation has gotten expensive and deduction has gotten cheap. The phase is implicit in the price structure.

If simulation is computationally expensive in this context, it bids less. If deduction is brittle for this problem type, it prices itself out early. The system does not break when a modality underperforms. It loses market share until conditions change.

This is Cooperative Adversarial Decisioning: multiple cognitive systems operating in parallel, competing and cooperating (coopetition, from game theory) under resource constraints, producing decisions through structured conflict rather than control flow.

Yes, this might flame out spectacularly. Markets are hard. Auctions are a bitch. I might be completely wrong. Debugging emergent coordination is nontrivial. You cannot step through it. You cannot unit test it. The behavior is the interaction, and the interaction is complex and probabilistic.

And yeah, I’m not a real wonk in this space, so yup. That.

Buuuut… I’m pretty decent at synthesis across domains and I see that the prerequisites are finally in place. Frontier models are compute-bound, not data-bound. Inference-time optimization is the new battleground. Epistemic risk is suddenly economically RIGHT NOW. Tool-augmented models already have implicit costs for external calls. The intuition that some cognitive moves are expensive and should be used sparingly is already operationalized, just not generalized.

I propose a market mechanism governed by a simple question: Who can reduce the most uncertainty for the least energy? Let the modalities compete. Make them bid in kilowatt-hours.

You might retort “We have mixture of experts, pal.” Sure. MoE exists, but it’s a routing decision (a boss assigning work). My CAD idea is an auction; a bidding war (agents risking capital).

Let’s Wrap This Up, Eh?

Frontier language models do not reason in the sense that matters for intelligence. They do not operate over abstract structure invariantly. They do not possess a reasoning faculty; they possess a reasoning aesthetic.

This is not a moral failing or a safety critique. It is an architectural diagnosis.

If machine intelligence is to advance beyond admirable imitation, it will not come from scale alone. It will come from solving a modality coordination problem: building systems in which intuition, deduction, and prediction coexist without any one of them winning by default.

We built one cognitive modality and called it intelligence. The next step is not more of that.

It is letting adversaries into the room and starting a really good argument.