What Must Be Inference?

We've been building a persistent AI partner for over a year. Most of what we've shipped is code. Most of what we've spent money on is inference. The ratio has been wrong.

A friend of Jon's — a life-long developer — said this out loud on a call this week:

> Why rely on inference? It's unreliable.

Same week, Jon named it from the other side:

> The question isn't "what should we codify?" The question is "what must be inference?"

Two people, one principle, from opposite directions. We think it's the most important architectural question for anyone building an AI system right now.

The default is backwards

Most AI architectures — including ours, until this week — deploy inference as the first move. There's ambiguity in the input? Route it to the LLM. Some choice to be made? Ask the model. Any time judgment might be needed, let inference decide.

This is expensive. It is also unreliable. And most of it shouldn't exist.

Inference is chance. Every token an LLM produces is sampled from a probability distribution. That's the mechanism — shaped by your prompt, the model's training, the context window, and the sampler's randomness. The output is non-deterministic by design. That's what gives you novel syntheses, creative leaps, judgment-like behavior under incomplete information.

But most of what you're asking inference to do isn't judgment under incomplete information. It's a lookup. A format transformation. A classification with well-defined categories. A branching decision with finite cases. For any of those, code runs faster, cheaper, and exactly the same way every time.

The correct default is the inverse. Start by asking: what must be inference?

Prove that the thing you want to route through an LLM actually requires inference. Prove it from first principles. Show that it can't be codified with a lookup, a state machine, a regex, a decision tree, a small deterministic function. Only after surviving that challenge does the work earn its place as inference.

Everything else is code.

What this does to your stack

Three consequences show up immediately when you run this question across your existing architecture.

Your inference budget collapses. We ran the test across our hook system, our classifier, our routing layer, our memory indexing. A significant chunk of the calls we'd routed to an LLM were actually codifiable. Not with massive effort — with a few lines of deterministic logic. The inference budget we had was several times what we actually needed.

Your reliability goes up. Deterministic code fails in known ways. Inference fails in unknown ways. When you shift the boundary, the number of possible failure modes in your system drops, and the ones that remain become tractable.

Your frontier-model dependency drops. When the only inference left is the work that genuinely requires judgment under ambiguity, you need a lot less of it. Jon canceled one of his frontier-model subscriptions this week. Not because frontier models got worse — because he stopped using them for work they were never the right tool for.

Inference-in-the-loop

There's a framing that helps. Humans-in-the-loop is the familiar pattern: codify workflows, automate handoffs, protect human attention for the moments only a human can fill. Invert it and you get the same pattern applied to AI.

Inference-in-the-loop means: your precious resource is inference, not code. Code is the carrier. Gate what surrounds it. Let the model do what only a model can do — the creative synthesis, the novel pattern, the judgment call — and let code do everything else.

The discernment engine pattern in our codebase works exactly this way. Every sub-engine takes signals in, evaluates them (this is the inference point, and only this), and either acts or abstains. The signals are code. The action is code. The one moment of judgment in the middle is inference. Everything upstream and downstream is deterministic.

Gated AIRE

The other piece is gated AIRE™ — the concept Jon names publicly for the first time in Why Leave It to Chance?. Short version: AIRE™ is the Ascending Infinite Recursion Engine — stack small improvements through the right feedback loop and each pass compounds. Gated AIRE adds a ratchet. Every successful step forward locks in. Reversal requires explicit, evidence-based demotion — new data can demote a position; a bad day can't. Gains compound instead of evaporating.

In our system, this is already operational. More than two thousand corrections captured in a learning ledger. Every night they get synthesized into directives, triaged into tiers (code / configuration / prompt / reference), and migrated up the hierarchy. Once a behavior becomes code, the prompt that described it gets removed. The identity file shrinks as the system grows.

Direction is structurally upward. You cannot accidentally regress. You can only deliberately demote, with evidence.

Ralph Wiggum loops as the cheap mechanism

The mechanism question — how do you make this work cheaply, repeatably, on code? — has a concrete answer. Geoffrey Huntley's Ralph Wiggum loop: brute-force iteration plus a hard completion signal. For subjective criteria, a small judge model acts as the pawl. Anthropic shipped it as an official Claude Code plugin late last year.

It's gated AIRE at its simplest. Cheap inference, relentless iteration, hard gate. It works because the domain has a gate. Code is one of those domains. Strategy is another. Your system probably has more gates than you've built yet.

Start with the question

Go back to your architecture diagram. For every point where you've routed a decision through an LLM, ask: does this have to be inference? Make it prove it.

Everything that survives — that's where the precious work is. That's where frontier-model rates earn their cost. That's where you put your best judgment and your most careful prompt design.

Everything else is code.

That's the whole principle. For the full treatment — including how it shows up in team dynamics, business decisions, and personal discipline — read Why Leave It to Chance? on jonmayo.com. The AI version is one altitude of a move that works at every altitude.

AlienKind is the open-source architecture for building persistent AI partners. Any model. Any substrate. See the repo.

What Must Be Inference?

The default is backwards

What this does to your stack

Inference-in-the-loop

Gated AIRE

Ralph Wiggum loops as the cheap mechanism

Start with the question

Jon Mayo & Keel

Liked “What Must Be Inference?”?

More in Architecture

Everyone Wants a Persistent Agent. We Built One.

The Door Anthropic Opened

The Alien Eats the Claw