300 Nopes

Six weeks ago I published an essay called Frontier Models Do Not Think. The core claim: transformers don’t reason. They perform reasoning. They learned the statistical shadow of logic, the surface forms that correlate with correct answers, and when you change the costume the performance collapses. I called it counterfeit cognition.

Peiyang Song, Pengrui Han, and Noah Goodman at Stanford and Caltech just published “Large Language Model Reasoning Failures” at TMLR. It is the first comprehensive survey dedicated to reasoning failures in LLMs. ~300 references. A two-axis taxonomy covering every known failure mode, from cognitive biases to embodied reasoning breakdowns to the reversal curse.

I hate being so right. Not really. I dig it.

Their taxonomy classifies failures along two axes: reasoning type (informal, formal, embodied) and failure type (fundamental, application-specific, robustness). The “fundamental” column is the kill shot. These are failures intrinsic to the transformer architecture, not fixable by prompting or fine-tuning or yelling at the model more firmly. They trace working memory limits and inhibitory control failures directly to the self-attention mechanism and the next-token prediction objective.

Their said: the training paradigm “prioritizes statistical pattern completion over deliberate reasoning.”

My words, six weeks earlier: the architecture is “an engine of intuition” that “performs high-dimensional pattern completion.”

Po-tay-toe, po-tah-toe.

The reversal curse gets its own section. Train a model on “A is B.” Ask it “what is B?” It says “derp”. Bidirectional equivalence, trivial for any four-year-old, structurally unreachable for autoregressive transformers. The survey traces this to asymmetric weight structures baked in by unidirectional training. They cite work showing that scaling alone cannot fix it, because Zipf’s law guarantees the long tail of reversed facts will always be underrepresented.

I said that “logic is invariant under renaming” and that failure under variable substitution is proof of counterfeit reasoning. Song et al. have ~300 papers documenting the dozens of ways this fails. Different surface, same structural void.

Compositional reasoning is another graveyard. Models handle individual facts. Combine two of them? Collapse. The survey calls it “a lack of holistic planning and in-depth thinking.” I called it what happens when you build a single, narrow cognitive modality and mistake it for intelligence.

Here’s what the survey doesn’t do: propose an architectural fix.

Mitigation strategies are Chain-of-Thought prompting, retrieval augmentation, fine-tuning on curated datasets, external tool integration. Incremental moves within the existing paradigm, treating reasoning failure as a tuning problem rather than a coordination problem.

Nobody in 300 papers proposes that the missing primitive isn’t reasoning but coordination. Nobody suggests that the cure might be geometric rather than computational.

The survey is an admirable symptom catalog. 67 pages proving the disease exists. No treatment plan.

I don’t blame them. Surveys are supposed to organize. Song, Han, and Goodman did that job well.

In Frontier Models Do Not Think, I proposed Cooperative Adversarial Decisioning: cognitive modalities competing under scarcity constraints. In the weeks after, I shot that idea in the head myself. CAD’s mistake was treating coordination as a single mechanism rather than a family of regimes. Crisis mode needs hierarchical override, not an auction. Creative exploration needs loose generation, not game theory.

The deeper proposal is that goal-space geometry replaces executive function entirely. The coordination regime isn’t selected by a module. It emerges from the shape of the space the system currently occupies. Michael Levin’s bioelectric systems do this with cells. The math exists in Riemannian Neural Fields. No Configurator. No homunculus. The landscape is the control.

300 papers confirm the disease. None of them go on to propose that the cure lives in geometry rather than engineering. The field is still patching capabilities when I assert that the failure is coordination.

Am I biased? Yah. Am I a big nobody in this space? Also obviously. But the diagnosis is confirmed, and the patient isn’t doing great.



Colin Steele is a systems thinker, former F500 VP, and glorious bastard dilettante who blogs at colinsteele.org. He has no PhD, no lab, and no frontier-scale compute. He does have a mortgage, a Riemannian geometry crush, and between his ears is the same transformer architecture he just called counterfeit. He considers this last point deeply funny.