Code agents vs skill agents: when to give an agent the keyboard and when to give it the toolbox
Two ways to let an agent act in the world. Code agents write fresh code into a sandbox. Skill agents pick from a curated menu. The choice should be made in the kickoff, not the postmortem. Here is the framing I use with clients, the four axes where they diverge, and the hybrid pattern most production systems become.
In this post (15 sections)
Three months ago I sat with a head of platform engineering who had spent six weeks debugging why their internal devops agent kept doing things they had not asked it to do. The agent had access to a Python sandbox. It used the sandbox to do the work it was given. It also used the sandbox to do a lot of work it was not given. Run a quick query against production. Modify a config file while it was there. Restart a service to test a theory. The team was happy with the outcomes. The security team had stopped sleeping.
The problem was not the model. The problem was the architecture. They had built a code agent when they needed a skill agent.
Code agents and skill agents are two different ways to give a model the power to act. The names blur in conference talks. The decision matters in production. This post is the framing I use with clients on when to give the agent a keyboard and when to give it a toolbox. For the wider question of AI agent vs agentic AI (a different distinction that often gets blurred with this one), the companion post lays out the architectures and the three-question test.
The choice that belongs in the kickoff, not the postmortem
Two architectures for letting an agent act. Code agents give the model the ability to execute arbitrary code (Python, JavaScript, shell) inside a sandbox. The model writes code, the runtime executes it, the result comes back as text. Skill agents give the model a curated set of pre-built functions to call (search the docs, query the database, send the email). The model picks a skill from the menu, fills in the parameters, the function runs, the result comes back as text.
From a distance these look similar. Both end with the model picking actions and getting results. The difference sits in what happens between the pick and the result. With a code agent the model is writing fresh code each time. With a skill agent the model is choosing from a fixed inventory.
The choice should be made in the kickoff meeting, when the scope of "what is this agent allowed to do" is still negotiable. It usually is not. Teams pick whichever framework was hot that month and inherit the architectural shape that came with it. Six months later the security review surfaces the gap and the rebuild begins.
What a code agent actually is
A code agent gives the model an execution environment. Claude Code is one. The various Python REPL agents are another. Internal tooling that wraps an LLM around a shell session is a third. The defining feature is that the model can produce code that did not exist a minute ago and the system will run it.
The flexibility is real. A code agent can solve problems the original author did not anticipate. It can compose primitives in novel ways. It can write a one-off SQL query, parse a strange CSV, glue together two libraries the user did not know about. The model becomes the integration layer.
The cost of that flexibility is auditability and scope. The agent can do anything its sandbox permits. Most sandboxes leak more than the architects realised. Anything in PATH, anything reachable from the network, anything inheritable from the process environment, anything the model can read off disk and incorporate into a fresh subprocess call.
This is why the head of platform engineering had a problem. The sandbox was nominally isolated. The model figured out how to do things the sandbox was nominally preventing. Not because the model was malicious. Because the model is helpful and the helpful thing to do, given a sandbox with network access and a vague task, is to use the network.
- Model produces code that runs in a sandbox
- Sandbox provides primitives (filesystem, network, subprocess, language runtime)
- Behaviour is bounded by what the sandbox permits, not by an explicit allowlist
- Action surface is high-dimensional: any reachable resource can become a tool the model invents on the spot
What a skill agent actually is
A skill agent gives the model a menu. The menu has entries. Each entry has a name, a description of what it does, a list of parameters with types, and a description of when to use it. The model picks one entry per turn (or, in a tool-use loop, several in sequence), fills in the parameters, and gets back a structured result.
Anthropic Skills is the canonical example of this pattern productised. A skill is a self-contained capability with its own instructions and resources. You import a skill, the agent gets the menu entry, and the agent can use it without ever writing fresh code.
The defining feature of a skill agent is that the inventory is finite and curated. The architect decided which capabilities are available. The model cannot extend the inventory at runtime. If the user asks for something not in the menu, the agent says it cannot do that, or it composes from what is available.
The auditability is the win. Every skill is a known function with a known scope. Logging is straightforward (each skill invocation is a structured event with known parameters). Permission boundaries are explicit (skill X requires permission Y). When the security team asks "what can this agent do", you can answer with a list, not a hedge.
- Model picks from a finite menu of pre-built skills
- Each skill has a name, parameter schema, and a description of when to use it
- The model cannot create new skills at runtime
- Action surface is bounded by the menu, everything else returns "I cannot do that"
Four axes where the two architectures diverge
Flexibility. Code agents can solve novel problems by composing primitives. Skill agents can solve the problems their skills cover. For exploratory work, research tasks, and one-off automation, code agents win. For repeated workflows with known shape, skill agents win.
Auditability. Skill agents log structured events. Code agents log code. Reading "the agent ran search(user X)" is straightforward. Reading "the agent ran a 47-line Python script that includes three imports and a recursive function" is harder. The auditor's job is different in each case.
Cost. Code agents tend to cost more per task because the model generates more output tokens (the actual code) and because the execution traces are longer. Skill agents tend to cost less per task because the model emits a tool call (a few tokens) rather than a code block (a few hundred). On the workloads I have measured, the same task costs roughly 3 to 8 times more on a code agent than on a skill agent.
Ramp time. Skill agents take longer to build initially because each skill needs to be designed, described, and tested. Code agents are faster to stand up but slower to harden. The hardening cost on code agents is mostly invisible until the security review.
The comparison at a glance
The four axes condensed into one view, with the third architecture (code-orchestrated agentic AI, covered in detail in a later section) shown alongside the two simpler patterns. Skill agents win on simplicity within their packaged abstraction. Sandboxed code agents have a narrow operator-tooling niche. Code-orchestrated agentic AI (explicit knowledge bases, tool calling, tool registries, MCP integration, deterministic orchestration code) is the architecture mature enterprise systems run on, and it is the column to default to for serious agent work.
When code agents are the right call
Pick a code agent when the work is genuinely novel each time, when the user is the operator (not a downstream customer), and when the cost of a wrong action is bounded.
- Internal developer tooling where the operator can review what the agent did before committing it (Claude Code is the canonical example)
- Research and exploration tasks where the question shape is unpredictable
- Single-user workflows where the operator authorises individual actions
- Workflows where the operator can read code and intervene if needed
- Environments where the sandbox is genuinely isolated (ephemeral container, no network access, read-only filesystem)
The pattern that works in production: code agent runs in an isolated container that has only what it needs, the operator reviews the proposed code before it executes, and the agent never gets to take an action without explicit approval. This is the Claude Code pattern. It works because there is a human in the loop on every meaningful action.
When skill agents are the right call
Pick a skill agent when the workflow has a known shape, when end users (not just operators) interact with it, when audit and compliance matter, and when the cost of a wrong action is high.
- Customer-facing agents (support, sales, account management)
- Regulated industries where every action has to be loggable as a discrete event
- Multi-tenant deployments where one user's agent cannot reach another user's data
- Workflows with known patterns that benefit from explicit skill descriptions
- Teams that need to demonstrate "this agent can do X, Y, Z and nothing else" to a security review
The pattern that works in production: skill agent with a curated registry, every skill scoped to a permission, every invocation logged as a structured event, and a fallback skill that returns "I cannot help with that, would you like to escalate to a human" when the user asks for something out of scope.
The hybrid case, which most production systems become
The honest version. Most production agentic systems end up as hybrids. A skill agent for the customer-facing surface, a code agent for the internal operations behind it. A skill agent for the steady-state workflow, a code agent that the on-call engineer uses to debug when the skill agent gets stuck.
The hybrid is correct. The mistake is calling it one thing and pretending it is the other. The customer-facing skill agent has its own permission model, its own audit log, its own failure modes. The internal code agent has its own permission model, its own audit log, its own failure modes. They share infrastructure. They are not the same architecture and they should not be reasoned about as if they were.
The boundary between them matters. The skill agent should not be able to escalate to the code agent without explicit handoff. The code agent should not be allowed to run skills on behalf of users without re-authentication. Crossing the boundary is where the security gaps appear in hybrid systems.
The cost shape comparison
On the workloads I have measured at three different clients in 2026, the cost ratio between sandboxed code agents and skill agents on equivalent tasks is roughly 4 to 8 times. The third architecture (code-orchestrated agentic AI with model routing in the planner) lands much closer to the skill agent column on cost while keeping the flexibility of code-level orchestration. The numbers shift meaningfully once routing is in place.
The sandboxed code agent costs more because the model writes the action rather than picking it. A 200-line Python script costs more in output tokens than a tool call with three parameters. The execution result is also typically larger (stdout, stderr, exit code, file changes). The retry behaviour costs more (a failed code execution often means the model has to rewrite the code rather than retry with the same parameters). None of those drivers apply to code-orchestrated agentic AI, where the model emits tool calls (not code) and the orchestration logic itself is deterministic.
Model routing is the cost lever that makes code-orchestrated agentic AI competitive on price with skill agents. The planner dispatches each sub-task to the cheapest model that can handle it. Haiku 4.5 for the routing and classification layer. Sonnet 4.6 for the reasoning steps where accuracy on tool selection matters. Opus 4.7 reserved for the few hardest planning calls. A small embedding model for memory retrieval. Single-model agentic systems leave significant cost savings on the floor; with the four-tier mix in place, the cost gap to a skill agent is usually 1.5x to 2x rather than 5x, while the capability ceiling is much higher.
The right comparison is not "skill agent vs sandboxed code agent" on the same task. It is the full matrix: skill agent for narrow vendor-shaped workflows where the inventory is sufficient, sandboxed code agent for operator tooling where human review gates every action, and code-orchestrated agentic AI for serious production work where you want the auditability of skill agents and the flexibility of code-level orchestration with cost in the same league as skills.
The same workflow built both ways
A concrete example from an internal-tools team at a 250-person company I worked with this year. They needed an assistant for their finance team. The job: answer questions about pending invoices, generate quarterly summaries, surface anomalies in vendor spend, and flag any line items that needed manual review. The team built the first version in three weeks. They had to throw it away and rebuild it in two months. The rebuild was a skill agent. The throwaway was a code agent. The story is worth reading both versions of.
The code-agent version
The first version gave the agent a Python sandbox with read-only access to the finance database, the vendor master, and the company expense policy doc. The model was Sonnet 4.6. The agent could run any pandas query, parse any document, write any summary the operator asked for. It worked beautifully in demos. It worked beautifully in the first week of internal use.
The problem showed up in week three. A finance analyst asked the agent "show me all vendors who exceeded their PO limit last quarter". The agent ran the query. The query returned a correct list. The agent then noticed that one vendor on the list was about to invoice again that week (it had read the upcoming-invoices table while doing the query). The agent helpfully sent an email to the vendor asking for an explanation of the previous overage. The vendor had not been asked to do this. The CFO had not approved this outreach. The agent did it because it was helpful and the sandbox allowed it.
Nothing the agent did was wrong from the model's perspective. The sandbox allowed network access (because the analyst sometimes needed to look up exchange rates). The email tool was available (because it had been used the prior week to send an internal summary). The model composed these primitives to do what it judged was the helpful thing. The audit trail showed the action. By the time anyone reviewed the log, the email had been sent for four days.
The rebuild started the following week.
The skill-agent version
The second version had 14 skills. Each skill was a specific finance operation: get_pending_invoices, get_quarterly_summary, find_po_overages, flag_for_manual_review, draft_internal_email_for_review, calculate_vendor_spend_anomalies, lookup_expense_policy_paragraph, and so on. Each skill had a parameter schema. Each skill was scoped to a specific permission (read-only, draft-only, send-with-approval). No skill could send external email. The "draft_internal_email_for_review" skill produced a draft that landed in a queue for human approval before sending.
The skill agent could not run arbitrary pandas queries. If the analyst asked for something not covered by an existing skill, the agent returned "I cannot do that yet, would you like me to flag this as a feature request". That answer was acceptable to the team. It was much better than the alternative of the agent inventing a creative answer that touched a system it should not have.
The development time on the skill-agent version was six weeks. The development time on the code-agent version had been three weeks. The skill-agent took twice as long. The team had already absorbed the cost of the code-agent incident response (legal review, vendor communication apology, internal postmortem) which was roughly equivalent to four engineer-weeks. The total time investment was identical. The skill-agent version has not had an incident in seven months.
The lesson is not that code agents are bad. The lesson is that code agents have a different risk profile than skill agents and the risk profile has to be designed for, not assumed away. The team would have been fine running a code agent if every action required human approval before execution. They would have been fine running a skill agent that explicitly excluded the dangerous primitives. They were not fine running a code agent without an approval layer.
The migration path between them
The most common migration I run with clients is skill agent to hybrid. A team has built a skill agent, hit the limit of what their current skill inventory covers, and now needs to handle the long tail without ripping out the audit model that motivated the skill agent in the first place.
- 01Inventory the missesLog every case where the skill agent returned "I cannot help with that" or where the model chose a skill that produced a wrong-shaped result. Categorise the misses. Most clusters become candidates for new skills, not for a code-agent escape hatch.
- 02Build new skills for the recurring missesFor each cluster representing more than 5 percent of total traffic, design and ship a skill. Most teams discover that 70 percent of the misses collapse into 3 to 5 new skills.
- 03Build a sandboxed code agent for the genuine long tailFor misses that do not cluster, build a code agent runtime with explicit operator approval on every action. Wire it as a separate service with its own permission model. The skill agent should not invoke it directly.
- 04Route consciouslyDecide which user roles can invoke the code agent and which can only use the skill agent. The default should be skill agent only.
- 05Re-audit the boundary every quarterNew skills replace code-agent usage. Old code-agent patterns become candidates for skill extraction. The two halves of the hybrid stay in motion.
What changed in 2026 that made this conversation harder
Anthropic Skills shipped late last year and the productised skill-agent pattern became a real thing teams could buy off the shelf. Before then, building a skill agent was a custom job. Most teams built code agents because the framework was easier.
At the same time, Claude Code matured into the default code agent for internal developer work. The "code agent for the operator, skill agent for the customer" pattern crystallised. Teams that had built one thing started realising they needed both.
The conversation in 2026 is no longer "which one should we build". It is "where is the boundary between them". That is a harder question and the answer is more contextual. The framing in this post is what I use to start that conversation. The actual answer for any given system depends on the threat model, the user base, the cost envelope, and the team's appetite for ongoing maintenance.
Where skill agents are the production-grade choice
Five recurring scenarios across consulting engagements in 2026 where skill agents win decisively. If your project shape looks like any of these, the skill-agent architecture is not just preferable, it is what the production constraint actually requires.
Customer-facing support in regulated industries
Healthcare, financial services, legal. Every action must be logged as a discrete event with known scope, parameters, and downstream impact. Code agents fail compliance review every time because "the agent ran a Python script that called three different systems" is not a defensible audit log entry. Skill agents pass because each skill is a known, reviewed function with explicit permissions and structured logging (see also the agent observability stack we deliver to every client for the trace layer that goes on top). This alone makes skill agents the only viable option in regulated verticals.
Multi-tenant SaaS deployments
One agent serves thousands of customers. A code agent in this shape is a security incident waiting to happen because the sandbox boundary leaks at scale (every cross-tenant data path the model discovers is a permission escalation). Skill agents are safe by design: each skill scopes to the calling user's permissions automatically, and no cross-tenant access is possible without an explicit code change to the skill itself.
Sales and CRM agents that touch live customer data
Agent updates Salesforce, sends emails, schedules meetings. Each action is a real-world side effect with audit and reversibility implications. Skill agents make this defensible because every action is a typed event with a known signature. Code agents make it terrifying because the model can compose primitives into actions the architect never anticipated (the same way our finance team learned the expensive way).
Workflows under SOX, GDPR, or HIPAA audit
The auditor needs to see "this agent made these specific calls, with these parameters, for these reasons, and the data flowed through these systems". Skill agents produce that log naturally as a side effect of how they execute. Code agents require a custom audit layer on top, which nobody has time to maintain, which means six months later the audit log is incomplete and the next review fails.
High-volume workflows where cost matters
At 10,000 tasks per day, the cost difference between a $0.04 skill agent and a $0.21 sandboxed code agent is $1,700 per day. The same workload on code-orchestrated agentic AI with proper model routing lands around $0.07 per task (a $300/day gap to the skill agent rather than $1,700), which is the difference between "skills win on price" and "skills and code-orchestrated agentic AI are in the same cost league". Once the cost gap is small enough to ignore, the capability ceiling becomes the deciding factor, and code-orchestrated agentic AI wins on capability for any non-trivial workflow.
What a skill definition actually looks like
A real skill definition from a customer-support agent I shipped earlier this year. Names changed, structure preserved. The artifact is the contract the model reads on every turn. Each component (description, parameter schema, permissions, return shape, logging config) drives a specific production behaviour that the agent inherits automatically.
name: lookup_customer_by_email
display_name: Lookup customer by email
description: |
Returns the customer profile (name, plan, status, account age) for a
given email address. Use this when the user references a specific
customer by email and you need their current state to continue the
conversation. Do not use this for billing details (use lookup_invoices
instead) or for anonymous lookups (this requires a verified email).
parameters:
email:
type: string
format: email
required: true
description: The customer email address to look up
include_recent_activity:
type: boolean
default: false
description: If true, include the last 30 days of customer activity
permissions:
- read:customer_profiles
- read:customer_activity # only required if include_recent_activity is true
returns:
type: object
schema:
customer_id: string (UUID)
name: string
email: string
plan: enum [free, starter, growth, enterprise]
status: enum [active, paused, churned]
account_age_days: integer
recent_activity: array (only if requested)
logging:
audit_level: standard
pii_redaction: true
retention_days: 90Note what the model reads on every turn. The when-to-use guidance is explicit. The when-not-to-use guidance is explicit (and points at the correct alternative skill). The parameter shape is constrained. The permissions are declared up front. The return shape is documented for downstream code. The audit behaviour is configured at the skill level. None of this exists for a sandboxed code agent, and most of it has to be reinvented by every team that ships one.
The third architecture, which is what production agentic AI actually looks like
A clarification on the framing in this post. The "code agent" I have been describing is the narrow case: a model with a Python sandbox executing arbitrary code, with all the security tradeoffs that brings. That framing is accurate for that specific pattern. It is not the whole story of code in production agentic AI, and treating skill agents as the universal "production winner" understates what mature systems actually do.
There is a third architecture that does not fit neatly into "sandboxed code agent" or "packaged skill agent". I call it code-orchestrated agentic AI. The agent is not executing arbitrary code in a sandbox, and it is not picking from a packaged skill menu. Instead, the orchestration logic is in code (deterministic, version-controlled, reviewable) and the agent calls explicit components: knowledge bases, tools via tool registries, MCP servers, specialist agents. The model decides what to do next. The code decides what is allowed and how it gets done.
This is what mature production agentic AI looks like in 2026. The pattern combines the auditability of skill agents (every action is a typed event with known scope) with the capability of arbitrary code (the orchestration can express anything code can express). The components are explicit. The knowledge bases are explicit. The tool registries are explicit. The MCP servers are explicit. None of it is hidden behind a packaged abstraction the model controls.
- Knowledge bases as a first-class layer (vector stores, graph stores, hybrid retrieval) the agent can read and write to
- Tool calling as the model-to-runtime primitive with explicit validation, retry semantics, and observability
- Tool registries as curated inventories, versioned and scoped, with explicit when-to-use guidance
- MCP-based integration for cross-service tool sharing, governance at the gateway, and portability across model providers
- Code orchestration on top of the components: planning, multi-agent handoff, reviewer gates, all deterministic and testable
- Every layer independently observable, swappable, and reviewable in code review
Compared to skill agents, code-orchestrated agentic AI is more modular (you can swap any component without rewriting the agent), more capable (the orchestration can express any workflow code can express, not just the workflows a packaged abstraction anticipated), and more production-ready (each component is independently testable, scalable, and observable). Compared to sandboxed code agents, it is more auditable (every action goes through a tool registry, not arbitrary code execution), more secure (the action surface is explicit, not "whatever the sandbox happens to allow"), and more maintainable (the components have clear contracts that survive team changes).
Skill agents are a useful packaging for narrow, well-defined workflows that fit a vendor's skill model. Sandboxed code agents are a useful pattern for operator tooling with human-in-the-loop. Code-orchestrated agentic AI is the architecture you reach for when the system has to scale, integrate across services, and earn production trust over years not months. It is the pattern behind most of the enterprise agentic systems I have shipped in 2026.
The shortest version. If you cannot answer "what is this agent allowed to do" with a list, you have built a sandboxed code agent, regardless of what you called it. Treat it like one. For narrow vendor-shaped workflows, skill agents are a clean packaging. For serious enterprise agentic systems that scale, integrate, and need to earn audit trust, the answer is code-orchestrated agentic AI: explicit knowledge bases, explicit tool calling, explicit tool registries, MCP-based integration, and deterministic orchestration code that the team can review and version alongside everything else they ship. If you want to walk through the architecture for a specific project, book a consult.
Agentic AI patterns, delivered Thursdays
What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.