You Are Holding It Wrong

The AI cost crisis isn't an AI problem — it's an operator literacy problem. The industry has structural reasons to let operators conflate "AI is expensive" with "we are using AI inefficiently." The gap closes when you learn to tell the difference.

Somewhere right now, someone is sending "how many cups in a quart" to Opus 4.7 with extended thinking on. The query takes a few hundred milliseconds. The answer is four. The model could not be more overqualified for the job, and nobody notices, because the charge disappears into a line item called "AI usage" that nobody owns.

That's the consumer version. The enterprise version is harder to laugh at.

There is a support bot running somewhere — internal ticketing, customer service, take your pick — routing every inbound query to a frontier reasoning model. "Where is my order?" Frontier model. "How do I reset my password?" Frontier model. "What are your business hours?" Frontier model. The system is impressively capable. The latency is acceptable. The monthly bill just crossed five figures and nobody can explain why. All anyone can say in the budget meeting is that AI is expensive.

AI is not expensive. They are using AI inefficiently. And the industry has structural reasons to let them keep confusing the two.

LLMs Are the Most Inefficient Search Engine Ever Built

Every LLM call is billed on tokens — in and out, charged for both. Most teams have no visibility into how often the expensive path is running across their stack.

Enterprise deployments replicate this at scale: RAG pipelines re-embedding static corpora on a nightly schedule, regenerating vectors for content that hasn't changed; eval suites running against a frontier model when a smaller, cheaper model clears every test; agent loops catching transient failures and retrying the full task with the entire context window replayed, billing for the same tokens on every attempt; the support bot routing every trivial query to a model built for novel reasoning. The engineers who built these systems defaulted to the most capable option because nobody told them there was a cheaper one, and the failure mode didn't appear until weeks after the code shipped.

Recall vs. Reasoning

The mental model that fixes this is straightforward to describe and almost universally absent at the architecture level: recall is not reasoning.

Recall means retrieving and formatting information that is already known — a lookup, a summarization of a document in an index, answering "what is our refund policy" from a corpus the model has access to. These tasks should be cheap. They belong to small models, cached embeddings, or sometimes no language model at all.

Reasoning means working through a problem the model hasn't seen before in exactly this form: synthesizing across ambiguous sources, decomposing a novel request into a plan, catching a subtle edge case the spec didn't anticipate. These tasks can justify an expensive model.

The support bot from the opening is a clean case study. "Where is my order" — recall. "What are your business hours" — recall. A customer with three split shipments, a disputed charge, and a delivery exception requiring synthesis across order history to reach a coherent answer — that's reasoning. The bot routes both to the frontier model. The recall/reasoning frame explains exactly what's wrong: the bot is billing at reasoning rates for a workload that's overwhelmingly lookup.

The cost crisis is, almost entirely, an inability to distinguish these two categories at runtime. So everything defaults to the expensive path.

Summarize this ticket history

Triage this incident and propose a remediation path

What does this error message mean

Why is this error appearing in this specific sequence of calls across these services

Translate this query to our internal API schema

Design the schema from scratch given these constraints

What are your business hours

(This is always recall. A reasoning model should never see it.)

The frame matters because it's architectural, not just diagnostic — it gives teams a vocabulary for the routing decision: which tier should handle this class of work, and why. The model doesn't decide which category a query falls into. That decision has to happen before the model is invoked.

Routing Layers and Why Nobody Builds Them

The fix is a routing layer: a component between the inbound request and the model that classifies the query and dispatches it to the appropriate tier. The support bot sends "where is my order" to a fast lookup service. It sends "this customer is threatening to escalate and I need a nuanced, context-aware response" to the frontier model. The overall bill drops. Quality on hard queries goes up, because the frontier model is no longer degraded by a queue full of password resets.

This architecture is well understood in theory. It is nearly absent in practice.

The unglamorous-infrastructure explanation is real but incomplete. Providers do ship smaller models — Haiku, Flash, the mini variants — and those models handle recall-class work well. What providers don't ship is routing infrastructure, because routing infrastructure is workload-specific: the classification logic for a support bot looks nothing like the one for an internal developer tool. No off-the-shelf product fits cleanly across use cases, and providers have no incentive to build one — a team with good routing consumes less of the expensive tier. The discipline has to come from the operator side.

The routing layer isn't difficult to conceive. The hard part is building a classifier accurate enough that misclassifications don't erase the savings. A 10% error rate means 10% of hard queries hitting a tier that can't handle them, generating escalations and retries that cost more than the tokens saved. Getting routing right requires labeled traffic data, ongoing evaluation, and the willingness to treat misclassification as a production failure. That difficulty is real — and it's also the moat.

I built a routing layer for an internal diagnostic tool last year. Most queries were diagnostic lookups against a known schema — recall-class work, fast and cheap to handle. A small fraction required genuine investigative reasoning across ambiguous system state. Routing the first category to a fast tier and reserving the frontier model for the second produced the delta. The cost difference wasn't a percentage improvement. It was an order of magnitude.

The support bot that generated the five-figure bill? A routing layer would have intercepted it before the first invoice. Every "where is my order" query has the same shape. A classifier that routes it to a lookup returns in milliseconds for fractions of a cent. The frontier model never sees it.

The Observability Gap

Even once teams commit to building a routing layer, a deeper problem remains: you cannot fix what you cannot see.

AI spend observability is roughly where application performance monitoring was in 2008. You can see the total bill. You cannot see which feature, which prompt template, which user cohort, or which retry loop generated which portion of it. You know you spent 400 million tokens last month. You don't know where to look, what to change, or whether any of that spend was justified by the outcomes it produced. The support bot from the opening ran for weeks before the five-figure bill arrived — per-trace data would have flagged "where is my order" as the highest-volume, highest-cost query category on day one.

Token counts on a dashboard are not observability. They are billing data with a chart drawn on top. The chart tells you that spend went up. It doesn't tell you which code path caused it, what the user got for it, or whether a cheaper path would have produced the same result. This is the equivalent of knowing your cloud bill increased without knowing which service, which region, or which team drove the change.

Real observability is per-trace cost attribution joined to outcome data. It is knowing that prompt template version 12 costs 30% more than version 11 and produces no measurable improvement in resolution rate. It is knowing that your retry logic is amplifying a 5% transient failure rate into a 40% cost increase because each retry replays the full context window. It is knowing that one user segment generates 60% of your frontier-model calls — which is either a routing opportunity or a signal that this segment is your most complex and highest-value work. You need the data to tell the difference.

cost_per_trace

Total token cost for a single end-to-end request, including all model calls within the trace.

cost_per_resolved_task

What it actually costs to close a ticket, answer a question, or complete a unit of work. The number that connects spend to value.

cost_per_user_session

Aggregate cost across all model calls within a user session. Identifies high-cost interaction patterns before they scale.

retry_amplification_factor

How much a given error rate inflates actual spend through context replay. A 5% failure rate with full-context retries can silently become a 50% cost multiplier.

That last one deserves more attention than it gets. Retry logic that replays the full context window on failure can silently turn a 5% error rate into a 50% cost multiplier — each retry carries all prior context, so attempt N costs more than attempt 1; at a 5% failure rate across three retry rounds, expected spend per successful completion runs roughly 1.4–1.5× the single-attempt baseline before context growth is factored in, and the multiplier compounds from there — with no user-visible failure and nothing on any dashboard that looks wrong. You will not catch it without instrumentation designed to look for it.

The Metric: Cost Per Resolved Task

Cost per token is the unit vendors bill in. It is not the unit your business runs on.

Cost per resolved task — closed ticket, answered question, successful deployment, whatever constitutes a resolved unit of work in your system — is the only number that connects AI spend to the value AI produces. It is the number that tells you whether the system is working or burning money generating well-formatted non-answers.

A $0.30 trace that resolves the issue is cheaper than a $0.02 trace that sends the user to a human agent. Token-cost optimization without outcome data is a local minimum. You can negotiate a better per-token rate and watch cost per resolved task increase, because the cheaper configuration escalates at twice the rate and the human escalations cost more than the tokens you saved. You have optimized the wrong number at the wrong layer.

The teams that get this right first will be in a different position. The gap shows up in margin, in the ability to scale AI features without scaling the bill linearly with usage, and in budget conversations where the person defending AI spend can actually answer the question. Not "we used 400 million tokens" — but "we resolved 80,000 support queries at $0.18 per resolution, down from $0.71 three quarters ago." That is a sentence a CFO can do something with.

Back to the start: "how many cups in a quart," Opus 4.7, extended thinking on. The person who sent that query wasn't wrong to use Claude. They were wrong about which Claude — and nobody told them there was a choice. Nobody built the routing layer. Nobody had the observability to notice it was happening. The cost showed up in a billing line nobody owned.

The enterprise version of that ignorance is a five-figure monthly line item that nobody can break down, a budget meeting that turns hostile, and an engineering team explaining that AI is expensive. It isn't. They're holding it wrong. Operators who learn the difference now will look like geniuses in eighteen months. The ones who don't will still be routing "where is my order" to a frontier reasoning model and calling the invoice a cost of doing business.

You Are HoldingIt Wrong

LLMs Are the Most Inefficient Search Engine Ever Built

Recall vs. Reasoning

Routing Layers and Why Nobody Builds Them

The Observability Gap

The Metric: Cost Per Resolved Task

You Are Holding
It Wrong