All posts
Production Published Jun 4, 2026 11 min

Agentic transformation is an operating-model problem, not a model problem

Microsoft published a 6-step playbook for rolling agents out across an enterprise, and the line that matters is "you do not need a bigger model, you need a better operating model." That matches what I see in consulting: the pilots that die do not die on model quality, they die on ownership, evals, and governance. Here is how I read the playbook for IT services teams, and the operating-model gaps that actually stall agent rollouts.

Jigar JoshiJigar JoshiAgentic AI Architect and Consultant
In this post (7 sections)

The question I get most from IT services leaders is some version of "which model should we standardize on?" It is the wrong first question, and I say so on the call. Microsoft just published a 6-step agentic transformation playbook aimed at executives, and the line worth keeping is the one they lead with: you do not need a bigger model, you need a better operating model. I do not agree with every framework in it, but that sentence is correct, and it matches what I watch happen in the field. The agents that fail in production almost never fail because the model was not smart enough.

So this is my read of that playbook for the teams I actually work with: mid-market IT services firms putting their first or second agent into production, not a Fortune 500 with a standing AI office. The frame holds, but the failure modes look different when you do not have a thousand people and a governance department.

Assist mode was a feature. Execute mode is a decision.

The playbook opens on the shift from assist to execute, and that distinction is the real one. An assist-mode agent suggests, drafts, and hands back to a human who ships. An execute-mode agent takes the action: it updates the ticket, sends the email, moves money. The gap between those two is not model capability. It is everything you have to build around the model so you can trust it to act: a tool contract it cannot misuse, an exit condition so it stops, and a guardrail that vets the action before it lands. I drew that whole picture in the anatomy of an AI agent, and the move from assist to execute is the move from using one box of it to needing all four.

Most "we are doing agentic AI" decks I get sent are actually still assist mode with a nicer wrapper. That is fine. It is also not the thing that changes your economics, and pretending a suggestion engine is an autonomous agent is how a board ends up disappointed two quarters later.

The failures I see are operating-model failures, not model failures

When a client tells me their agent is unreliable, I have learned to not start at the model. The fault is almost always somewhere in how the thing is owned and operated. Three gaps account for most of it.

Nobody owns the agent

The pilot was built by a smart engineer on a side quest, it works in the demo, and then that engineer moves to the next project. There is no one whose job is to watch this agent, so when it quietly degrades, nobody sees it. An agent in production is a system someone has to run, not a feature you ship and forget. If you cannot name the person who owns it, you do not have a production agent, you have a demo on borrowed time.

There is no eval, so "better" is a vibe

Ask a team how they know the agent is improving and you usually get silence or a story. Without an eval set built from real failures, every prompt change is a guess and every regression is invisible until a customer finds it. This is the single highest-impact discipline I install on an engagement, and it is boring, which is why teams skip it. I wrote the playbook for building evals from production failures in stop testing your agents on the happy path.

The tool layer is the actual bug

When an execute-mode agent does the wrong thing, the tool it called usually let it. A tool that does two things, returns null on failure, or dumps raw database rows into the context will make a perfectly good model look broken. I run three questions before I touch a prompt: is the tool atomic, what happens on failure, and is it typed and token-efficient. The full version is in your agents are not broken, your tools are. None of these three failures is a model problem, and none of them is fixed by upgrading the model.

Diagnose before you pick a pattern

The playbook uses a five-driver diagnostic before any rollout, and I do something similar, just blunter. Before scoping an engagement I want honest answers on five things, because they predict whether a pilot survives contact with production better than any model benchmark does.

The five operating-model questions I ask before scoping an agent
DriverThe honest questionRed flag answer
OwnershipWho runs this after launch?No name, or "the AI team"
EvalsHow do you know it improved?We tried it and it felt better
Data accessCan the agent reach clean, scoped data?It queries the prod DB directly
GuardrailsWhat vets an action before it runs?We trust the model
ScopeIs this an agent or a workflow?Everyone disagrees in the room

That last row matters more than it looks. Half the projects labeled "agentic" are deterministic workflows that would be cheaper, faster, and more reliable as plain code with one model call in the middle. Calling it an agent does not make it one, and the distinction has real cost consequences, which I pull apart in AI agent vs agentic AI.

Centralized, federated, or a center of excellence

The playbook spends real time on operating structures, from a central team to a federated model to a formal center of excellence. For the enterprise it is written for, that is the right conversation. For a mid-market firm, my advice is simpler and a little contrarian: start centralized, deliberately, even though it does not scale. One small team that owns the first two or three agents end to end will teach you what your real patterns, tools, and guardrails are. You federate later, once you have something worth copying. Standing up a center of excellence before you have shipped a single agent is how you get a governance committee with nothing to govern.

The thing worth centralizing first is not the agents, it is the shared infrastructure underneath them: the tool registry, the eval suite, the observability stack, and the guardrail policy. Those are the parts every agent reuses, and the parts a scattered set of teams will each build badly. The observability layer in particular is the one I ship on every engagement, because an agent you cannot see is an agent you cannot run; that stack is in the agent observability stack we ship to every client.

The 90-day plan that ships something real

The playbook proposes a 90-day execution window, and I am a believer in the timebox, with one rule: the output at day 90 is a running agent in front of real users, not a strategy. Here is the shape I run.

  1. 01
    Weeks 1-3: pick one painful, bounded workflow
    One workflow, owned by one team, with a clear success check. Resist the platform instinct. The goal is a real agent in production, not a framework for future agents.
  2. 02
    Weeks 4-8: build the agent and the operating layer together
    Ship the agent, but also the eval set, the observability, and the guardrail gate in the same sprint. The operating layer is not phase two. It is what makes phase one trustworthy.
  3. 03
    Weeks 9-12: put it in front of users and watch the failures
    Real traffic surfaces the real failure modes. Every one becomes an eval case. By day 90 you have a working agent and, more valuable, a tested map of how it breaks.

For multi-step work, you also commit to an orchestration shape in this window, and getting it wrong is expensive to undo. The trade-off between a central supervisor and peer handoffs is in supervisor pattern vs handoffs. Decide it on purpose, early.

Scale-breakers: what stops a pilot becoming production

The playbook calls these scale-breakers, and the name is good. In my experience the ones that actually stop a working pilot from going wide are rarely technical. The agent works. What breaks is the operating model around it: no budget for someone to own it, a security review that arrives after build instead of before, data that is too messy to scope cleanly, or a guardrail story that is "the model is careful." That last one is also a security boundary, not just a safety one, and it is the surface that got attacked across the industry this year. I made the full case in your agent's supply chain is the attack surface.

Where a bigger model actually helps

I am not arguing the model never matters. A genuinely stronger model raises the ceiling on tasks that need deep reasoning, cuts the number of steps an agent takes to reach an answer, and sometimes lets you collapse a fragile multi-agent system into a simpler one. Those are real wins and worth re-benchmarking when a new frontier model ships. The point is sequence. A better model multiplies a good operating model and does almost nothing for a missing one. Fix the operating model first, then let the model upgrade compound on top of it.

Microsoft wrote the playbook for enterprises with the scale to run it formally. The lesson translates down: for any team, the work that decides whether agents pay off is the unglamorous operating-model work, ownership, evals, tooling, and guardrails. That is exactly the work I do in consulting engagements, and what we teach teams to run themselves in our agentic AI courses.

The weekly take

Agentic AI patterns, delivered Thursdays

What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.

Shipping an agentic AI project this quarter?
Book a 30-min consult
Frequently asked

Questions readers ask about this post

Share this post
LinkedIn Facebook