All posts
Tool Design Published May 22, 2026 12 min

Tool registry design for agentic AI: how the wrong registry kills accuracy before the prompt is read

I reviewed a system last month with 47 tools in its registry and a 22 percent wrong-tool-selection rate. The team was about to migrate from Sonnet to Opus to fix it. The prompt was fine. The registry was the bug. This is the audit pattern I run on every client codebase before we change anything else, the seven failure modes I see in production, and the numbers from the cleanup.

Jigar JoshiJigar JoshiAgentic AI Architect and Consultant
In this post (10 sections)

I reviewed an agentic system last month with 47 tools in its registry. Every API call to the model carried roughly 14 kilobytes of tool-description overhead. The agent was hitting wrong-tool-selection rates of 22 percent on tasks the team thought were solved. Their reflex was to fix the prompt. Tighten the system message, add more examples, escalate to a more capable model.

The prompt was fine. The registry was the bug.

Tool registries are the most under-engineered part of most agentic AI systems. Teams obsess over prompts, model selection, and orchestration patterns while shipping 40-tool registries that the model cannot reason about cleanly. This post is the deep dive I send to every client after the first wrong-tool incident. What a tool actually is, what a registry actually is, the seven failure modes that show up in production, and the audit pattern I run on every client codebase before we change anything else.

What a tool actually is in the agentic AI context

A tool is a function the model can call. The function has a name, a parameter schema, a description of what it does, and an implementation that runs when the model invokes it. From the model's perspective, the tool is a contract: if I emit this name with these parameters, the runtime will give me back a result of this shape.

From the runtime's perspective, the tool is a wrapper around something else. It might wrap an HTTP API. It might wrap a database query. It might wrap a shell command. It might wrap another agent. The wrapping is where most of the engineering work lives. The function signature is the small visible part of a larger system.

Tools come in three common shapes in 2026. Native function-calling tools defined inline in the agent's code. MCP tools served by a separate server process and discovered via the protocol. Hosted tools provided by the model vendor (web search, computer use). Each shape has different operational characteristics. The choice matters but the underlying tool concept is the same.

What a tool registry actually is

The tool registry is the inventory of tools an agent can see in a given turn. When the model receives its prompt, the registry is serialised into the system message as a list of tool descriptions and schemas. The model picks from this list. Nothing else is visible.

This sounds obvious. It is, until you remember what it implies. The registry is part of every prompt. The serialised tool descriptions consume tokens (8 to 12 kilobytes per tool registration is typical). The model's tool-selection accuracy is bounded by what is in the registry and how it is described. Every wasted tool is wasted accuracy budget.

Most teams treat the registry as a write-only list. They add tools as they build features. Nobody removes tools. Nobody refactors descriptions. Nobody measures whether the registry is helping or hurting selection. The registry grows because adding to it is easy. The cost of the growth is invisible until you run the numbers.

  • The registry is part of every API call to the model
  • Each tool registration adds 8 to 12 kilobytes to the prompt
  • The model selects from the registry on every turn
  • Selection accuracy is bounded by the quality of the descriptions, not the size of the model
  • Registries grow monotonically by default. Pruning is rare and high-leverage.

The anatomy of a good tool definition

The components of a tool definition that the model actually reads.

The name is the first signal the model uses for selection. A name like search_customer_history is informative. A name like helper_function_2 is not. Names should describe what the tool does in three to five words, using verbs the model recognises (search, fetch, create, update, send, query, calculate).

The description is the next signal. It should explain what the tool does and, critically, when to use it. "Returns the customer's last 30 support interactions" is okay. "Returns the customer's last 30 support interactions. Use this when the user references something they discussed previously, or when their question references a specific past order" is better. The second framing maps the tool to user-visible patterns the model can recognise.

The parameter schema should use specific types and constrained values. Free-text strings are an anti-pattern. Enums beat strings. Constrained ranges beat unbounded integers. Required vs optional should be marked explicitly. Default values should be documented.

A "when to use" section in the description is the highest-leverage change I make on most registries. Models are good at pattern-matching from user input to tool descriptions. Give them the patterns explicitly. "Use this when the user asks about pricing, plans, or billing" outperforms a description that requires the model to infer the connection.

Before and after: real tool descriptions from a client audit

Abstract advice on tool descriptions is easy to nod at and hard to apply. Here are three real examples from the 47-tool registry audit. Names changed, structure preserved.

Example one: a search tool that the model rarely picked correctly

The original description was eleven words. The new description was forty-three words. The longer one was selected 3.4 times more often on the same user inputs, with a selection-confidence improvement from 0.42 to 0.81.

name: search_kb
description: Searches the knowledge base.
parameters:
  query: string
name: search_product_docs
description: Searches the indexed product documentation for technical answers. Use this when the user asks how a specific feature works, how to configure something, or how to troubleshoot a product behaviour. Do not use this for billing or account questions (use lookup_account_info instead) or general company information (use search_company_info instead).
parameters:
  query: string (the user\'s question, rephrased as a search query)
  product_area: enum [billing, integrations, api, ui, mobile, all]

Two structural changes matter. The name went from search_kb to search_product_docs because the previous name was ambiguous (the system had three different "knowledge bases"). The description added explicit when-not-to-use guidance, pointing the model at the right alternative tool. The parameter went from a single free-text string to a string plus an enum that the model can reason about.

Example two: a write tool with a dangerous free-text parameter

The original tool accepted a free-text category. The model would invent categories that did not exist in the runtime ("urgent-billing", "high-priority-support") and the runtime would fail with an unhelpful error. Wrapping the category as an enum eliminated the failure mode entirely.

name: create_ticket
description: Creates a support ticket.
parameters:
  title: string
  body: string
  category: string
  priority: string
name: create_support_ticket
description: Creates a new support ticket in the queue. Use this when the user reports a problem that requires human follow-up, not for general FAQ questions (use answer_from_faq instead). After creating, return the ticket ID so the user can reference it.
parameters:
  title: string (concise summary, less than 80 chars)
  body: string (full user description, include relevant context)
  category: enum [billing, technical, account, feature_request, bug_report]
  priority: enum [low, medium, high, urgent] (default: medium)

The enum constraint forces the model to pick from a closed set. The default value documents the safe choice when the user did not specify. The added comment on the title parameter prevents the model from generating 200-character "titles" that breach a database constraint downstream.

Example three: a tool whose description bloat was hiding the actual usage pattern

The original description ran 380 words and included the historical reason the tool was built, three edge cases that were no longer relevant, and the names of two engineers who had owned it. The model selected this tool incorrectly on 31 percent of relevant queries. After trimming to the operational pattern, selection accuracy moved to 89 percent.

name: get_customer_data
description: This tool was originally built in 2024 to consolidate the four different customer data sources that existed at the time (Salesforce, HubSpot, the legacy CRM, and the spreadsheet that ops maintained). Currently it merges Salesforce and HubSpot only; the legacy CRM was retired in Q2 2025. The spreadsheet... [continues for 320 more words covering historical context]
parameters:
  customer_id: string
name: get_customer_profile
description: Returns the current customer profile (name, plan, status, tags, recent activity). Use this when the user references a specific customer and you need their current state. Returns null if the customer ID is invalid or the customer was deleted.
parameters:
  customer_id: string (UUID v4)

The bloat was hiding the signal. The model was selecting incorrectly because it could not figure out what the tool was actually for under the historical noise. Trimming was the entire fix.

The seven failure modes of a bad tool registry

Patterns I have seen in production across roughly 30 client systems audited in 2026.

One. Naming collisions. Two tools have similar names (get_user vs fetch_user, search_records vs query_records). The model picks the wrong one because the names do not disambiguate the use cases.

Two. Description bloat. A tool description spans 400 words and includes implementation details, edge cases, and version history. The model loses the signal in the noise.

Three. Free-text parameters where enums would work. A "category" parameter is typed as string when it should be one of seven specific values. The model invents categories that do not match the runtime expectations.

Four. Missing when-to-use guidance. The description explains what the tool does but not when the model should pick it over alternatives. Selection becomes guesswork.

Five. Overlapping scope. Two tools cover the same workflow. The model picks inconsistently. Sometimes search_kb, sometimes ask_docs. The downstream code has to handle both paths.

Six. Dead tools. Tools that were registered for a feature that shipped six months ago and nobody removed. They still consume tokens. The model still considers them. The selection space stays larger than it needs to.

Seven. Schema drift. The runtime function signature changed but the tool description did not. The model emits calls with old parameters or wrong types. Failures appear as model errors when they are runtime mismatches.

Tool-selection telemetry, the metric that actually matters

Anthropic's tool-use telemetry exposes per-call selection scores. The model emits a confidence indicator for each tool it considered, not just the one it picked. This is the most important diagnostic primitive shipped in the last twelve months and most teams are not using it.

With selection scores, you can answer questions that were previously unanswerable. Why did the model pick search_kb over query_docs on that turn? The score delta tells you (and usually it is a 0.04 difference, which means the descriptions are not disambiguating cleanly). What is the model's selection confidence on the routes that fail? Usually low. Consistently low routes mean the registry has overlap there.

The metric I track per client is "selection confidence p50 per route". On healthy registries this number is above 0.7. On the 47-tool registry I described in the intro, the average was 0.31. That is the registry telling you it cannot distinguish between the tools you have given it.

Wire selection scores into your trace tooling alongside cost and latency. Low confidence is the leading indicator of wrong-tool failures. Fix low confidence by editing descriptions, not by re-prompting the model.

MCP vs native tools

The choice between MCP-served tools and native inline tools is one of operational shape, not capability. Both can do the same things. They have different tradeoffs.

Native tools are defined in the agent's code. The agent process knows about them directly with no external server and no protocol overhead. They are simpler to debug. The cost is that every agent that needs the same tool has to redefine it (or share code through a library).

MCP tools are defined on a server. The agent discovers them via the MCP protocol and invokes them via remote procedure call. Multiple agents can share the same MCP server. New tools can ship on the server without redeploying the agent. The cost is the operational overhead of running the server and the protocol round-trip on every invocation.

My current rule. Native tools for things specific to this agent. MCP tools for things shared across agents or owned by a different team. The split usually maps cleanly to who owns the implementation. If the team that owns the agent also owns the integration, native is simpler. If a different team owns the integration (the platform team owns the database tool, the agent team consumes it), MCP gives you the boundary.

One pattern that does not work. Wrapping every native tool in an MCP server "for portability". The portability is real. The cost is real and usually larger than the portability win. Wait until the second agent needs the same tool before extracting it.

The audit pattern I run on every client registry

The audit goes first, before any other intervention. It usually surfaces problems that explain 60 to 80 percent of the wrong-tool selection rate without any model or prompt changes.

  1. 01
    List every tool currently in the registry
    Most teams cannot produce this list cleanly. They have tools registered in three different files. They have tools that are conditionally registered based on user role. They have tools that were added to test something and never removed. The first deliverable is just the canonical list.
  2. 02
    For each tool, classify last-30-day invocation count
    Pull the trace data. Group tools into never invoked, rarely invoked (less than 10 per day), regularly invoked, heavily invoked. The first two groups are candidates for removal.
  3. 03
    For each tool, read the description aloud
    Sounds silly. Works. Vague descriptions that look fine in code become obviously inadequate when you say them out loud. "Returns user data" is the description that fails this test most often.
  4. 04
    Check for naming overlap
    Pairs of tools whose names start with the same verb and operate on the same noun. The model will struggle to disambiguate. Either consolidate or rename so the disambiguation is in the name.
  5. 05
    Check for description overlap
    Pairs of tools whose descriptions could be answers to the same question. The model will pick inconsistently. Consolidate or sharpen the descriptions.
  6. 06
    Validate each schema against the runtime
    For each tool, manually verify that the parameter schema matches the runtime function signature. Drift between the two is one of the most common production failure sources and rarely surfaces in normal testing.
  7. 07
    Cut everything from group 1 and group 2
    Tools that have not been invoked in 30 days are not in use. Remove them. The token savings are real (typically 30 to 50 percent of registry size on a real codebase) and the selection-accuracy gain is real (the model has fewer wrong tools to consider).

The first time you run this audit on a real codebase, expect to remove a third to half of the registered tools. That is normal. The team will have built tools speculatively over time. The cleanup is the leverage.

Versioning, deprecation, and tool lifecycle

Tools have a lifecycle and most teams treat them as if they do not.

When a tool ships, it should have a known owner, a known consumer pattern, and a target lifecycle. Most tools at most companies have none of these. They are added by someone who left, used by some code path that nobody owns, with no notion of when they should be deprecated.

Versioning matters. Schema changes to existing tools are breaking changes for the model. Adding a required parameter to search_customer is a breaking change. Renaming a parameter is a breaking change. Tightening an enum is a breaking change. The model has been trained on the old shape across however many sessions are in flight. Roll the change in with a versioning convention (search_customer vs search_customer_v2) and run them in parallel until traffic shifts.

Deprecation is the part nobody plans. A tool that is being replaced should be marked as deprecated in the description (the model can read the description and route away from it), traffic should be measured against the new tool, and removal should happen on a known timeline. Without this, every old tool stays in the registry forever and the registry grows monotonically.

The rule I use. Every tool gets reviewed quarterly. Unused tools get removed. Tools with low confidence scores get rewritten. New tools get a 30-day usage check after launch to verify they are actually being picked. Owners change as people move teams. The registry is a product that needs ongoing maintenance, not a configuration that gets set once.

What this looks like in practice

Concrete numbers from a recent engagement. The team had 47 tools. We ran the audit. 18 were removed in the first cut (no invocations in 30 days or naming overlap). 12 had their descriptions rewritten with explicit when-to-use guidance. 4 were merged into 2 (overlapping scope). 3 had their parameter schemas tightened from free-text to enums.

47 to 19
tools after audit
0.31 to 0.74
average selection confidence
22% to 7%
wrong-tool rate
31%
token savings per call

The model did not change. The orchestration did not change. The team had been planning to migrate from Sonnet to Opus to get better tool selection. After the registry audit, the migration was unnecessary. The Sonnet-level accuracy on the cleaned registry exceeded the Opus-level accuracy on the bloated registry. The cost difference was $4,200 per month for a workflow that had been generating $2,900 per month in headroom for a model upgrade that did not need to happen.

This pattern is not exceptional. It is the median outcome on registry audits across 30 client systems in 2026. The leverage is in the registry, not the model.

The shortest version. The tool registry is part of every prompt the model reads, costs more tokens than most teams realise, and bounds selection accuracy in ways that are not visible until you measure them. Audit it before you do anything else. The audit is two days of work and usually replaces a quarter of model-tuning work that would not have helped. Then put the registry on a quarterly review cycle so it does not grow back into the same shape. If you want me to run the audit on your registry, agentic AI consulting engagements often start exactly here.

The weekly take

Agentic AI patterns, delivered Thursdays

What I am shipping, watching, and pruning out of client stacks each week. One email. No fluff.

Shipping an agentic AI project this quarter?
Book a 30-min consult
Frequently asked

Questions readers ask about this post

Share this post
LinkedIn Facebook