Agentic AI in Production: Patterns That Survived Q1

If you had asked any AI platform team in early 2025 what an agent was, you would have gotten ten different answers. By the end of Q1 2026, a much narrower set of patterns has shaken out in production. They are not the ones the demo videos suggested. The agents that actually run in real workflows look more like carefully scoped state machines with LLM judgement embedded at specific decision points, and they survived because they failed cheaply, retried predictably, and stayed out of expensive parts of the system.

The picture below comes from talking to platform teams who put agentic systems in front of paying customers or internal stakeholders during the last six months, plus a handful of public postmortems. I am skipping the demo cases and focusing on what stuck.

Pattern one: the bounded planner

The most reliable agent pattern in production is the bounded planner. The agent receives a goal, generates a short plan (usually three to seven steps), and then executes those steps with a more deterministic runtime. The LLM does the framing and the post-step reflection, but it does not make every micro-decision inside the loop.

A customer support team at a mid-size SaaS company shipped this pattern for triage. The agent reads an inbound ticket, classifies it against an internal taxonomy, decides whether to gather more information from the user or hand off to a specialist queue, and writes a short rationale. The plan generation step uses a frontier model. The classification step uses a much cheaper smaller model. The agent does not write replies to customers. That last constraint is what kept it in production.

What works about this pattern is the cost shape. The expensive call happens once per ticket, the cheap calls run in parallel, and the entire chain fits inside a five-second budget. The team logs the plan, the steps, and the outcomes, which means a human reviewer can audit failures without replaying the whole thing.

Pattern two: the human-confirmed mutation

The second pattern that survived is one where the agent can read freely but cannot write without a human in the loop. Most internal data-pull or analysis tools fit here. The agent ingests a question, queries internal systems, drafts a result, and posts it as a proposed action for a human to approve.

This pattern has a reputation problem. People dismiss it as not real agents, just glorified search. That dismissal misses how much of the time-savings comes from the drafting step itself. A finance analyst who spends three hours pulling a variance report can review a draft in twenty minutes and approve or revise it. The agent did not replace the analyst, but it took the bottom of their job and folded it back into something faster. The mutation step staying with the human means errors do not become production incidents.

The teams that succeeded with this pattern resisted the temptation to remove the human as confidence grew. The errors that survive into late-stage testing are the subtle ones – wrong cost center, slightly off date range, a forgotten filter – and these are exactly the errors that look right at a glance and require domain context to catch.

Pattern three: the tool-narrowed agent

Agents that have access to twenty tools rarely work. Agents that have access to four tools, with one of those being “ask a human a clarifying question,” tend to work fine. The reduction in tool count is one of the clearest signals separating production systems from demos.

The mechanism here is straightforward. Every additional tool multiplies the decision space the model has to navigate, and most models are mediocre at choosing between tools when several look plausible. Teams that scoped agents to a narrow surface – say, three Salesforce-related tools and nothing else – got dramatically better reliability than teams that tried to expose an entire SaaS ecosystem.

The clarifying-question tool is the underrated piece. A surprising amount of agent failure comes from the model assuming intent rather than asking. Adding an explicit “ask the user to clarify” option, and rewarding the model for using it in the system prompt, drops a category of failure that is otherwise hard to address.

What broke and got pulled

Several agent patterns that looked promising in late 2024 quietly disappeared from production roadmaps during 2025. The biggest casualty was the open-ended browse-and-act agent that took a high-level goal and was let loose on the web. The hit rate on these was too low and the failure modes too varied to support. The few public examples that survived ended up tightly constrained – a specific site, a specific user flow – at which point they look much more like ordinary RPA than the agent vision.

Multi-agent collaboration also lost ground. The early excitement around having a researcher agent talk to a planner agent talk to a critic agent has not translated into wins on standard benchmarks once you control for token spend. The systems that beat single-agent baselines often did so because they spent five to ten times more tokens, and the same token budget given to a single careful agent did about as well. There are still cases where role separation helps – particularly when the roles map to different tool permissions – but the case for it as a general architecture got weaker, not stronger.

Long-horizon agents running unattended for hours also turned out to be much harder than expected. Recovery from intermediate failures is the killer. A two-minute task that runs perfectly in a notebook becomes a six-hour task that gets stuck on a stale token, a rate limit, or a UI change. The teams shipping long-running flows broke them into short, checkpointed steps with explicit retry semantics. That is much closer to traditional workflow orchestration than to agents in the demo sense.

The economics most decks skip

Frontier model pricing dropped meaningfully through 2025, but agent systems still consume more tokens than people expect because of the retry and reflection loops. A target unit economics for a production agent is usually somewhere between two and ten cents per task, depending on the domain. Anything north of fifty cents per task tends to fail the business review unless the task replaces a sizeable human action.

The way teams hit those numbers is mostly cache-and-cascade. Aggressive prompt caching at the platform layer cuts the steady-state cost on stable system prompts. Cascading from a small model to a large model only when the small model lacks confidence keeps the large-model spend down. Skipping the LLM entirely for any sub-step that has a deterministic implementation is the biggest win and the easiest one to miss.

Evals that actually shipped

The agent evals that survived contact with production are not the ones that measure success rate on a curated dataset. The useful ones look at three things together: task completion under a budget cap, regression on a frozen set of known failure cases, and shadow-mode comparison against a human baseline. Teams that ran shadow mode for two to four weeks before flipping the agent live caught the long tail of failure cases that no curated dataset captured.

The eval discipline that distinguishes the working systems is brutal version pinning. Model versions, prompt versions, tool versions, and the evaluation set itself all get versioned together, and any change anywhere triggers a re-run. Without that, you get drift you cannot explain, and the agent becomes a black box again.

The honest summary

Agentic AI in production right now is mostly a story about scope. The agents that ship are the ones with a small number of tools, a clear bounded plan, a human in the loop for anything that mutates state, and an aggressive eval discipline behind them. The agents that do not ship are the ones that promised to figure it out as they went. There is nothing wrong with the ambitious version. It is just not what the customer-facing systems look like today, and pretending otherwise is how budget cycles get burned.

If you are building one for the first time in 2026, the move is to start narrow and earn your way to broader scope. Pick a task with a clear definition of done. Wire up the cheapest possible version. Run it in shadow mode for a few weeks. Only then start to widen the surface. The teams that got an agent into production this past quarter did not do anything more clever than that.