Tool descriptions are prompts. Fix the registry, not the agent.
When an agent picks the wrong tool, the registry is broken — not the agent. Three rules I now apply before debugging anything in a multi-tool system: precise names, "when to use" triggers, and a curated load list. Anthropic's new tool-selection telemetry finally puts numbers on what changes accuracy.
I watched an agent in a customer system pick the wrong tool three times in a row last week. The reflex on the team was to debug the model — log the prompts, retry with bigger context, ablate the system message. The model was fine. The tool registry was the problem.
The model scores descriptions, not capabilities
The mental model most engineers carry over from REST API design is wrong here. When you call a REST endpoint, you know which one to invoke because you read your own code. When a model picks a tool, it reads the description you wrote for it. Whatever you put in that string is the spec the model selects against. There is no other source of truth at runtime.
Anthropic just shipped tool-use telemetry that surfaces the per-call selection scores. For the first time you can see in numbers what the model thought your tool registry meant. The teams running it in production are reporting the same thing I see in consulting engagements: the descriptions are the bug.
Three rules I now apply before any agent debug
- Name tools precisely. "search" is not a tool name; "search_customer_orders_by_email" is. Vague names lose to specific ones every time, regardless of model strength.
- Describe when to use each tool, not what it does. The description should answer "given this request, should I pick this?" — not "what API does this wrap?". The model already infers function from name; what it needs from you is selection criteria.
- Load only the tools the task needs. Every unused tool in scope is noise the model scores against. GitHub's recent post measured 8-12 KB of schema overhead per unused tool — and the cognitive cost shows up in wrong-tool calls before the bill does.
A concrete before / after
Recent engagement: an ERP integration agent had a "get_record" tool described as "Fetches a record from the database." The agent kept passing customer IDs to a function that took order IDs. The fix was not a smarter prompt. It was the description:
name: get_customer_record
description: Use this when the user asks about a CUSTOMER (account, profile, contact). Takes a customer_id (UUID). Do not use for orders or invoices — see get_order_record for those.Selection accuracy on the affected workflow jumped from 71% to 96% in the next eval run. No model change. No prompt rewrite. The description was already a prompt; it just had not been written like one.
Where to start in your codebase
- List every tool you have registered. Anything not called in the last 30 days, drop from the default load — keep it in a specialised scope.
- For each remaining tool, rewrite the description as a selection criterion. "Use this when…" beats "This function returns…" every time.
- Add a deliberate test that asks the agent to pick the right tool for ambiguous inputs. Track the score deltas over time. That is your tool-registry regression test.
The deeper point
Most "the agent does not work" reports I get from teams turn out to be tool-registry reports in disguise. The agent is doing its job — scoring options against the descriptions you provided. Change the descriptions and the behaviour changes. That is not a workaround; that is the design.