How many tools is too many in one registry?

There is no hard cap, but I see meaningful selection-accuracy degradation above 25 to 30 tools in a single registry on Sonnet 4.6, and above 40 to 50 tools on Opus 4.7. Below 15 tools, the registry size is rarely the bottleneck. The right answer is to measure selection confidence on your own workloads and let that number drive the cap.

What is the difference between a tool and a skill?

A tool is a function the model can call. A skill (in the Anthropic Skills sense) is a packaged unit that may include one or more tools plus instructions, examples, and resources for how to use them. A skill is a higher-level concept built on tools. From the model's perspective, both result in callable functions in the prompt. The packaging is for the human author, not the model.

Should I use MCP or native function calling for my tools?

Native tools for things specific to a single agent. MCP tools for things shared across multiple agents or owned by a different team. The wrong move is wrapping native tools in MCP servers "for portability" before you have a second consumer. Wait until the second consumer exists, then extract.

How do I write a tool description that the model actually uses correctly?

Three things matter. A verb-led name that maps to user-visible patterns. A description that includes explicit when-to-use guidance, not just what-it-does. A parameter schema that uses enums and constrained types rather than free-text strings. Test the description by reading it aloud. Vague descriptions that look fine in code fail the aloud test.

When should I retire a tool?

Two signals. No invocations in 30 days. Or persistent low selection confidence (below 0.5 on most calls). The first means the tool is unused. The second means the model cannot distinguish when to pick it. In both cases, removal is the right move. Tools you might "need later" should not live in the active registry. Park them in a separate file and re-register only if a real use case appears.

Tool registry design for agentic AI: how the.

In this post (10 sections)

In this post

I reviewed an agentic system last month with 47 tools in its registry. Every API call to the model carried roughly 14 kilobytes of tool-description overhead. The agent was hitting wrong-tool-selection rates of 22 percent on tasks the team thought were solved. Their reflex was to fix the prompt. Tighten the system message, add more examples, escalate to a more capable model.

The prompt was fine. The registry was the bug.

Tool registries are the most under-engineered part of most agentic AI systems. Teams obsess over prompts, model selection, and orchestration patterns while shipping 40-tool registries that the model cannot reason about cleanly. This post is the deep dive I send to every client after the first wrong-tool incident. What a tool actually is, what a registry actually is, the seven failure modes that show up in production, and the audit pattern I run on every client codebase before we change anything else.

What a tool actually is in the agentic AI context

A tool is a function the model can call. The function has a name, a parameter schema, a description of what it does, and an implementation that runs when the model invokes it. From the model's perspective, the tool is a contract: if I emit this name with these parameters, the runtime will give me back a result of this shape.

From the runtime's perspective, the tool is a wrapper around something else. It might wrap an HTTP API. It might wrap a database query. It might wrap a shell command. It might wrap another agent. The wrapping is where most of the engineering work lives. The function signature is the small visible part of a larger system.

Tools come in three common shapes in 2026. Native function-calling tools defined inline in the agent's code. MCP tools served by a separate server process and discovered via the protocol. Hosted tools provided by the model vendor (web search, computer use). Each shape has different operational characteristics. The choice matters but the underlying tool concept is the same.

What a tool registry actually is

The tool registry is the inventory of tools an agent can see in a given turn. When the model receives its prompt, the registry is serialised into the system message as a list of tool descriptions and schemas. The model picks from this list. Nothing else is visible.

This sounds obvious. It is, until you remember what it implies. The registry is part of every prompt. The serialised tool descriptions consume tokens (8 to 12 kilobytes per tool registration is typical). The model's tool-selection accuracy is bounded by what is in the registry and how it is described. Every wasted tool is wasted accuracy budget.

Most teams treat the registry as a write-only list. They add tools as they build features. Nobody removes tools. Nobody refactors descriptions. Nobody measures whether the registry is helping or hurting selection. The registry grows because adding to it is easy. The cost of the growth is invisible until you run the numbers.

The registry is part of every API call to the model
Each tool registration adds 8 to 12 kilobytes to the prompt
The model selects from the registry on every turn
Selection accuracy is bounded by the quality of the descriptions, not the size of the model
Registries grow monotonically by default. Pruning is rare and high-leverage.

The anatomy of a good tool definition

The components of a tool definition that the model actually reads.

The name is the first signal the model uses for selection. A name like search_customer_history is informative. A name like helper_function_2 is not. Names should describe what the tool does in three to five words, using verbs the model recognises (search, fetch, create, update, send, query, calculate).

The description is the next signal. It should explain what the tool does and, critically, when to use it. "Returns the customer's last 30 support interactions" is okay. "Returns the customer's last 30 support interactions. Use this when the user references something they discussed previously, or when their question references a specific past order" is better. The second framing maps the tool to user-visible patterns the model can recognise.

The parameter schema should use specific types and constrained values. Free-text strings are an anti-pattern. Enums beat strings. Constrained ranges beat unbounded integers. Required vs optional should be marked explicitly. Default values should be documented.

A "when to use" section in the description is the highest-leverage change I make on most registries. Models are good at pattern-matching from user input to tool descriptions. Give them the patterns explicitly. "Use this when the user asks about pricing, plans, or billing" outperforms a description that requires the model to infer the connection.

Before and after: real tool descriptions from a client audit

Abstract advice on tool descriptions is easy to nod at and hard to apply. Here are three real examples from the 47-tool registry audit. Names changed, structure preserved.

Example one: a search tool that the model rarely picked correctly

The original description was eleven words. The new description was forty-three words. The longer one was selected 3.4 times more often on the same user inputs, with a selection-confidence improvement from 0.42 to 0.81.

name: search_kb
description: Searches the knowledge base.
parameters:
  query: string

name: search_product_docs
description: Searches the indexed product documentation for technical answers. Use this when the user asks how a specific feature works, how to configure something, or how to troubleshoot a product behaviour. Do not use this for billing or account questions (use lookup_account_info instead) or general company information (use search_company_info instead).
parameters:
  query: string (the user\'s question, rephrased as a search query)
  product_area: enum [billing, integrations, api, ui, mobile, all]

Two structural changes matter. The name went from search_kb to search_product_docs because the previous name was ambiguous (the system had three different "knowledge bases"). The description added explicit when-not-to-use guidance, pointing the model at the right alternative tool. The parameter went from a single free-text string to a string plus an enum that the model can reason about.

Example two: a write tool with a dangerous free-text parameter

The original tool accepted a free-text category. The model would invent categories that did not exist in the runtime ("urgent-billing", "high-priority-support") and the runtime would fail with an unhelpful error. Wrapping the category as an enum eliminated the failure mode entirely.

name: create_ticket
description: Creates a support ticket.
parameters:
  title: string
  body: string
  category: string
  priority: string

name: create_support_ticket
description: Creates a new support ticket in the queue. Use this when the user reports a problem that requires human follow-up, not for general FAQ questions (use answer_from_faq instead). After creating, return the ticket ID so the user can reference it.
parameters:
  title: string (concise summary, less than 80 chars)
  body: string (full user description, include relevant context)
  category: enum [billing, technical, account, feature_request, bug_report]
  priority: enum [low, medium, high, urgent] (default: medium)

The enum constraint forces the model to pick from a closed set. The default value documents the safe choice when the user did not specify. The added comment on the title parameter prevents the model from generating 200-character "titles" that breach a database constraint downstream.

Example three: a tool whose description bloat was hiding the actual usage pattern

The original description ran 380 words and included the historical reason the tool was built, three edge cases that were no longer relevant, and the names of two engineers who had owned it. The model selected this tool incorrectly on 31 percent of relevant queries. After trimming to the operational pattern, selection accuracy moved to 89 percent.

name: get_customer_data
description: This tool was originally built in 2024 to consolidate the four different customer data sources that existed at the time (Salesforce, HubSpot, the legacy CRM, and the spreadsheet that ops maintained). Currently it merges Salesforce and HubSpot only; the legacy CRM was retired in Q2 2025. The spreadsheet... [continues for 320 more words covering historical context]
parameters:
  customer_id: string

name: get_customer_profile
description: Returns the current customer profile (name, plan, status, tags, recent activity). Use this when the user references a specific customer and you need their current state. Returns null if the customer ID is invalid or the customer was deleted.
parameters:
  customer_id: string (UUID v4)

The bloat was hiding the signal. The model was selecting incorrectly because it could not figure out what the tool was actually for under the historical noise. Trimming was the entire fix.

The seven failure modes of a bad tool registry

Patterns I have seen in production across roughly 30 client systems audited in 2026.

One. Naming collisions. Two tools have similar names (get_user vs fetch_user, search_records vs query_records). The model picks the wrong one because the names do not disambiguate the use cases.

Two. Description bloat. A tool description spans 400 words and includes implementation details, edge cases, and version history. The model loses the signal in the noise.

Three. Free-text parameters where enums would work. A "category" parameter is typed as string when it should be one of seven specific values. The model invents categories that do not match the runtime expectations.

Four. Missing when-to-use guidance. The description explains what the tool does but not when the model should pick it over alternatives. Selection becomes guesswork.

Five. Overlapping scope. Two tools cover the same workflow. The model picks inconsistently. Sometimes search_kb, sometimes ask_docs. The downstream code has to handle both paths.

Six. Dead tools. Tools that were registered for a feature that shipped six months ago and nobody removed. They still consume tokens. The model still considers them. The selection space stays larger than it needs to.

Seven. Schema drift. The runtime function signature changed but the tool description did not. The model emits calls with old parameters or wrong types. Failures appear as model errors when they are runtime mismatches.

Tool-selection telemetry, the metric that actually matters

Anthropic's tool-use telemetry exposes per-call selection scores. The model emits a confidence indicator for each tool it considered, not just the one it picked. This is the most important diagnostic primitive shipped in the last twelve months and most teams are not using it.

With selection scores, you can answer questions that were previously unanswerable. Why did the model pick search_kb over query_docs on that turn? The score delta tells you (and usually it is a 0.04 difference, which means the descriptions are not disambiguating cleanly). What is the model's selection confidence on the routes that fail? Usually low. Consistently low routes mean the registry has overlap there.

The metric I track per client is "selection confidence p50 per route". On healthy registries this number is above 0.7. On the 47-tool registry I described in the intro, the average was 0.31. That is the registry telling you it cannot distinguish between the tools you have given it.

Wire selection scores into your trace tooling alongside cost and latency. Low confidence is the leading indicator of wrong-tool failures. Fix low confidence by editing descriptions, not by re-prompting the model.

MCP vs native tools

The choice between MCP-served tools and native inline tools is one of operational shape, not capability. Both can do the same things. They have different tradeoffs.

Native tools are defined in the agent's code. The agent process knows about them directly with no external server and no protocol overhead. They are simpler to debug. The cost is that every agent that needs the same tool has to redefine it (or share code through a library).

MCP tools are defined on a server. The agent discovers them via the MCP protocol and invokes them via remote procedure call. Multiple agents can share the same MCP server. New tools can ship on the server without redeploying the agent. The cost is the operational overhead of running the server and the protocol round-trip on every invocation.

My current rule. Native tools for things specific to this agent. MCP tools for things shared across agents or owned by a different team. The split usually maps cleanly to who owns the implementation. If the team that owns the agent also owns the integration, native is simpler. If a different team owns the integration (the platform team owns the database tool, the agent team consumes it), MCP gives you the boundary.

One pattern that does not work. Wrapping every native tool in an MCP server "for portability". The portability is real. The cost is real and usually larger than the portability win. Wait until the second agent needs the same tool before extracting it.

The audit pattern I run on every client registry

The audit goes first, before any other intervention. It usually surfaces problems that explain 60 to 80 percent of the wrong-tool selection rate without any model or prompt changes.

01
List every tool currently in the registry
Most teams cannot produce this list cleanly. They have tools registered in three different files. They have tools that are conditionally registered based on user role. They have tools that were added to test something and never removed. The first deliverable is just the canonical list.
02
For each tool, classify last-30-day invocation count
Pull the trace data. Group tools into never invoked, rarely invoked (less than 10 per day), regularly invoked, heavily invoked. The first two groups are candidates for removal.
03
For each tool, read the description aloud
Sounds silly. Works. Vague descriptions that look fine in code become obviously inadequate when you say them out loud. "Returns user data" is the description that fails this test most often.
04
Check for naming overlap
Pairs of tools whose names start with the same verb and operate on the same noun. The model will struggle to disambiguate. Either consolidate or rename so the disambiguation is in the name.
05
Check for description overlap
Pairs of tools whose descriptions could be answers to the same question. The model will pick inconsistently. Consolidate or sharpen the descriptions.
06
Validate each schema against the runtime
For each tool, manually verify that the parameter schema matches the runtime function signature. Drift between the two is one of the most common production failure sources and rarely surfaces in normal testing.
07
Cut everything from group 1 and group 2
Tools that have not been invoked in 30 days are not in use. Remove them. The token savings are real (typically 30 to 50 percent of registry size on a real codebase) and the selection-accuracy gain is real (the model has fewer wrong tools to consider).

The first time you run this audit on a real codebase, expect to remove a third to half of the registered tools. That is normal. The team will have built tools speculatively over time. The cleanup is the leverage.

Versioning, deprecation, and tool lifecycle

Tools have a lifecycle and most teams treat them as if they do not.

When a tool ships, it should have a known owner, a known consumer pattern, and a target lifecycle. Most tools at most companies have none of these. They are added by someone who left, used by some code path that nobody owns, with no notion of when they should be deprecated.

Versioning matters. Schema changes to existing tools are breaking changes for the model. Adding a required parameter to search_customer is a breaking change. Renaming a parameter is a breaking change. Tightening an enum is a breaking change. The model has been trained on the old shape across however many sessions are in flight. Roll the change in with a versioning convention (search_customer vs search_customer_v2) and run them in parallel until traffic shifts.

Deprecation is the part nobody plans. A tool that is being replaced should be marked as deprecated in the description (the model can read the description and route away from it), traffic should be measured against the new tool, and removal should happen on a known timeline. Without this, every old tool stays in the registry forever and the registry grows monotonically.

The rule I use. Every tool gets reviewed quarterly. Unused tools get removed. Tools with low confidence scores get rewritten. New tools get a 30-day usage check after launch to verify they are actually being picked. Owners change as people move teams. The registry is a product that needs ongoing maintenance, not a configuration that gets set once.

What this looks like in practice

Concrete numbers from a recent engagement. The team had 47 tools. We ran the audit. 18 were removed in the first cut (no invocations in 30 days or naming overlap). 12 had their descriptions rewritten with explicit when-to-use guidance. 4 were merged into 2 (overlapping scope). 3 had their parameter schemas tightened from free-text to enums.

47 to 19

tools after audit

0.31 to 0.74

average selection confidence

22% to 7%

wrong-tool rate

31%

token savings per call

The model did not change. The orchestration did not change. The team had been planning to migrate from Sonnet to Opus to get better tool selection. After the registry audit, the migration was unnecessary. The Sonnet-level accuracy on the cleaned registry exceeded the Opus-level accuracy on the bloated registry. The cost difference was $4,200 per month for a workflow that had been generating $2,900 per month in headroom for a model upgrade that did not need to happen.

This pattern is not exceptional. It is the median outcome on registry audits across 30 client systems in 2026. The leverage is in the registry, not the model.

The shortest version. The tool registry is part of every prompt the model reads, costs more tokens than most teams realise, and bounds selection accuracy in ways that are not visible until you measure them. Audit it before you do anything else. The audit is two days of work and usually replaces a quarter of model-tuning work that would not have helped. Then put the registry on a quarterly review cycle so it does not grow back into the same shape. If you want me to run the audit on your registry, agentic AI consulting engagements often start exactly here.

Tool registry design for agentic AI: how the wrong registry kills accuracy before the prompt is read

What a tool actually is in the agentic AI context

What a tool registry actually is

The anatomy of a good tool definition

Before and after: real tool descriptions from a client audit

Example one: a search tool that the model rarely picked correctly

Example two: a write tool with a dangerous free-text parameter

Example three: a tool whose description bloat was hiding the actual usage pattern

The seven failure modes of a bad tool registry

Tool-selection telemetry, the metric that actually matters

MCP vs native tools

The audit pattern I run on every client registry

Versioning, deprecation, and tool lifecycle

What this looks like in practice

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Tool registry design for agentic AI: how the wrong registry kills accuracy before the prompt is read

What a tool actually is in the agentic AI context

What a tool registry actually is

The anatomy of a good tool definition

Before and after: real tool descriptions from a client audit

Example one: a search tool that the model rarely picked correctly

Example two: a write tool with a dangerous free-text parameter

Example three: a tool whose description bloat was hiding the actual usage pattern

The seven failure modes of a bad tool registry

Tool-selection telemetry, the metric that actually matters

MCP vs native tools

The audit pattern I run on every client registry

Versioning, deprecation, and tool lifecycle

What this looks like in practice

Agentic AI patterns, delivered Thursdays

Questions readers ask about this post

Read next

Your agents aren't broken, your tools are: three questions to ask before you build one

Tool descriptions are prompts. Fix the registry, not the agent.

Tool descriptions are prompts. Stop treating them like docstrings