What does "you need a better operating model, not a bigger model" actually mean?

It means the thing standing between you and reliable production agents is usually organizational, not technical: who owns the agent, how you measure whether it improved, what data it can reach, and what vets its actions. A stronger model does not fix any of those. It only helps once they are in place.

What is the difference between assist-mode and execute-mode agents?

An assist-mode agent suggests or drafts and hands back to a human who takes the action. An execute-mode agent takes the action itself: updating a record, sending a message, moving money. The jump between them is not model capability, it is the tool contracts, exit conditions, and guardrails you build so you can trust the agent to act.

Why do most agentic AI pilots fail to reach production?

Rarely because the model is too weak. They stall on operating-model gaps: no one owns the agent after launch, there is no eval set so improvement is guesswork, the tool layer is unreliable, or there is no guardrail and no budget to run it. These are the scale-breakers, and they are organizational, not technical.

Should a mid-market firm build a center of excellence before its first agent?

No. Start with one small central team that owns two or three agents end to end, so you learn your real patterns before you formalize anything. Centralize the shared infrastructure first, the tool registry, eval suite, observability, and guardrails, and federate later once you have something worth copying.

What should a 90-day agent rollout actually produce?

A running agent in front of real users, not a strategy deck. Spend the first weeks scoping one bounded workflow, build the agent and its operating layer (evals, observability, guardrails) together rather than in phases, then put it in front of real traffic and turn every failure into an eval case.

Agentic Transformation Is an Operating-Model Problem

In this post (7 sections)

In this post

The question I get most from IT services leaders is some version of "which model should we standardize on?" It is the wrong first question, and I say so on the call. Microsoft just published a 6-step agentic transformation playbook aimed at executives, and the line worth keeping is the one they lead with: you do not need a bigger model, you need a better operating model. I do not agree with every framework in it, but that sentence is correct, and it matches what I watch happen in the field. The agents that fail in production almost never fail because the model was not smart enough.

So this is my read of that playbook for the teams I actually work with: mid-market IT services firms putting their first or second agent into production, not a Fortune 500 with a standing AI office. The frame holds, but the failure modes look different when you do not have a thousand people and a governance department.

Assist mode was a feature. Execute mode is a decision.

The playbook opens on the shift from assist to execute, and that distinction is the real one. An assist-mode agent suggests, drafts, and hands back to a human who ships. An execute-mode agent takes the action: it updates the ticket, sends the email, moves money. The gap between those two is not model capability. It is everything you have to build around the model so you can trust it to act: a tool contract it cannot misuse, an exit condition so it stops, and a guardrail that vets the action before it lands. I drew that whole picture in the anatomy of an AI agent, and the move from assist to execute is the move from using one box of it to needing all four.

Most "we are doing agentic AI" decks I get sent are actually still assist mode with a nicer wrapper. That is fine. It is also not the thing that changes your economics, and pretending a suggestion engine is an autonomous agent is how a board ends up disappointed two quarters later.

The failures I see are operating-model failures, not model failures

When a client tells me their agent is unreliable, I have learned to not start at the model. The fault is almost always somewhere in how the thing is owned and operated. Three gaps account for most of it.

Nobody owns the agent

The pilot was built by a smart engineer on a side quest, it works in the demo, and then that engineer moves to the next project. There is no one whose job is to watch this agent, so when it quietly degrades, nobody sees it. An agent in production is a system someone has to run, not a feature you ship and forget. If you cannot name the person who owns it, you do not have a production agent, you have a demo on borrowed time.

There is no eval, so "better" is a vibe

Ask a team how they know the agent is improving and you usually get silence or a story. Without an eval set built from real failures, every prompt change is a guess and every regression is invisible until a customer finds it. This is the single highest-impact discipline I install on an engagement, and it is boring, which is why teams skip it. I wrote the playbook for building evals from production failures in stop testing your agents on the happy path.

The tool layer is the actual bug

When an execute-mode agent does the wrong thing, the tool it called usually let it. A tool that does two things, returns null on failure, or dumps raw database rows into the context will make a perfectly good model look broken. I run three questions before I touch a prompt: is the tool atomic, what happens on failure, and is it typed and token-efficient. The full version is in your agents are not broken, your tools are. None of these three failures is a model problem, and none of them is fixed by upgrading the model.

Diagnose before you pick a pattern

The playbook uses a five-driver diagnostic before any rollout, and I do something similar, just blunter. Before scoping an engagement I want honest answers on five things, because they predict whether a pilot survives contact with production better than any model benchmark does.

The five operating-model questions I ask before scoping an agent

Driver	The honest question	Red flag answer
Ownership	Who runs this after launch?	No name, or "the AI team"
Evals	How do you know it improved?	We tried it and it felt better
Data access	Can the agent reach clean, scoped data?	It queries the prod DB directly
Guardrails	What vets an action before it runs?	We trust the model
Scope	Is this an agent or a workflow?	Everyone disagrees in the room

That last row matters more than it looks. Half the projects labeled "agentic" are deterministic workflows that would be cheaper, faster, and more reliable as plain code with one model call in the middle. Calling it an agent does not make it one, and the distinction has real cost consequences, which I pull apart in AI agent vs agentic AI.

Centralized, federated, or a center of excellence

The playbook spends real time on operating structures, from a central team to a federated model to a formal center of excellence. For the enterprise it is written for, that is the right conversation. For a mid-market firm, my advice is simpler and a little contrarian: start centralized, deliberately, even though it does not scale. One small team that owns the first two or three agents end to end will teach you what your real patterns, tools, and guardrails are. You federate later, once you have something worth copying. Standing up a center of excellence before you have shipped a single agent is how you get a governance committee with nothing to govern.

The thing worth centralizing first is not the agents, it is the shared infrastructure underneath them: the tool registry, the eval suite, the observability stack, and the guardrail policy. Those are the parts every agent reuses, and the parts a scattered set of teams will each build badly. The observability layer in particular is the one I ship on every engagement, because an agent you cannot see is an agent you cannot run; that stack is in the agent observability stack we ship to every client.

The 90-day plan that ships something real

The playbook proposes a 90-day execution window, and I am a believer in the timebox, with one rule: the output at day 90 is a running agent in front of real users, not a strategy. Here is the shape I run.

01
Weeks 1-3: pick one painful, bounded workflow
One workflow, owned by one team, with a clear success check. Resist the platform instinct. The goal is a real agent in production, not a framework for future agents.
02
Weeks 4-8: build the agent and the operating layer together
Ship the agent, but also the eval set, the observability, and the guardrail gate in the same sprint. The operating layer is not phase two. It is what makes phase one trustworthy.
03
Weeks 9-12: put it in front of users and watch the failures
Real traffic surfaces the real failure modes. Every one becomes an eval case. By day 90 you have a working agent and, more valuable, a tested map of how it breaks.

For multi-step work, you also commit to an orchestration shape in this window, and getting it wrong is expensive to undo. The trade-off between a central supervisor and peer handoffs is in supervisor pattern vs handoffs. Decide it on purpose, early.

Scale-breakers: what stops a pilot becoming production

The playbook calls these scale-breakers, and the name is good. In my experience the ones that actually stop a working pilot from going wide are rarely technical. The agent works. What breaks is the operating model around it: no budget for someone to own it, a security review that arrives after build instead of before, data that is too messy to scope cleanly, or a guardrail story that is "the model is careful." That last one is also a security boundary, not just a safety one, and it is the surface that got attacked across the industry this year. I made the full case in your agent's supply chain is the attack surface.

Where a bigger model actually helps

I am not arguing the model never matters. A genuinely stronger model raises the ceiling on tasks that need deep reasoning, cuts the number of steps an agent takes to reach an answer, and sometimes lets you collapse a fragile multi-agent system into a simpler one. Those are real wins and worth re-benchmarking when a new frontier model ships. The point is sequence. A better model multiplies a good operating model and does almost nothing for a missing one. Fix the operating model first, then let the model upgrade compound on top of it.

Microsoft wrote the playbook for enterprises with the scale to run it formally. The lesson translates down: for any team, the work that decides whether agents pay off is the unglamorous operating-model work, ownership, evals, tooling, and guardrails. That is exactly the work I do in consulting engagements, and what we teach teams to run themselves in our agentic AI courses.

Agentic transformation is an operating-model problem, not a model problem

Assist mode was a feature. Execute mode is a decision.

The failures I see are operating-model failures, not model failures

Nobody owns the agent

There is no eval, so "better" is a vibe

The tool layer is the actual bug

Diagnose before you pick a pattern

Centralized, federated, or a center of excellence

The 90-day plan that ships something real

Scale-breakers: what stops a pilot becoming production

Where a bigger model actually helps

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Agentic transformation is an operating-model problem, not a model problem

Assist mode was a feature. Execute mode is a decision.

The failures I see are operating-model failures, not model failures

Nobody owns the agent

There is no eval, so "better" is a vibe

The tool layer is the actual bug

Diagnose before you pick a pattern

Centralized, federated, or a center of excellence

The 90-day plan that ships something real

Scale-breakers: what stops a pilot becoming production

Where a bigger model actually helps

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Your agent's supply chain is the attack surface now

How an agentic studio screens, scores and shortlists candidates for your hiring team

MCP governance just became a product: what Databricks Unity AI Gateway changes for enterprise agents