Agent Tool Design Best Practices
30-Second Takeaway
- Problem Solved: Poorly defined tools cause models to call the wrong tool, pass bad parameters, and loop infinitely.
- Core Method: 8 production-tested rules: trigger conditions, self-documenting parameters, proper granularity, structured output, actionable error messages, tiered exposure, idempotency, real model testing.
- Key Insight: Schema correctness ≠ model compatibility. Always test with the actual model you'll use in production.
- What You'll Gain: A checklist to audit and improve every tool definition.
Tools are the Agent's hands and feet. Well-designed tools make the model unstoppable; poorly designed ones lead to wrong tool calls, bad parameters, and infinite loops.
This article distills 8 rules from real production Agent projects. Each rule comes with before/after examples you can apply immediately.
Rule 1: Tool Descriptions Must Include Trigger Conditions
Don't just describe what a tool does — describe exactly when to use it. The model needs to answer: "Given my current situation, should I call this tool?"
A good description has three parts:
- Trigger condition — what situation warrants this tool
- What it does — the action it performs
- What it returns — the shape and meaning of the output
# ❌ Too vague — model doesn't know when to use this
"description": "Search the web"
# ✅ Clear trigger + behavior + return value
"description": "Search the web for current information. Use when the answer requires real-time or recent data beyond your training cutoff, or when explicitly asked to look up current facts. Returns top 10 results with titles, URLs, and snippets."
fetch_documentation that models almost never called. Adding "Use this when you encounter an unfamiliar library, API, or framework" tripled its usage rate — the model needed the trigger condition spelled out.Rule 2: Parameter Names Are Natural Language Prompts
Remember: parameter names are part of the prompt the model reads. They influence tool selection and parameter filling accuracy.
| ❌ Bad Name | ✅ Good Name | Why |
|---|---|---|
q | search_query | Self-documenting; model understands intent |
fp | file_path | Avoids abbreviations the model might misinterpret |
id | user_id | Scoped name prevents ambiguity with other IDs |
data | csv_content | Describes the format and content expected |
Each parameter also needs its own description field — never skip it. A parameter named limit with no description is a guessing game for the model.
# ❌ Missing descriptions
{"name": "limit", "type": "integer"}
# ✅ Self-explanatory
{"name": "max_results", "type": "integer",
"description": "Maximum number of results to return. Default 10, max 50."}
Rule 3: Tool Granularity — One Complete Operation Per Tool
This is the hardest rule to get right. The extremes are easy to spot, but the sweet spot takes practice.
| Pattern | Example | Problem |
|---|---|---|
| Too fine | open_file() → read_byte() → close_file() | 10 sequential calls for one task; model loses context |
| Too coarse | do_everything(action, target, format, filter, sort, ...) | Parameter explosion; model doesn't know what to pass |
| Just right | read_file(path), write_file(path, content), search_files(pattern) | One complete operation each; composable but independent |
Golden rule: If you can describe the tool's entire purpose in one sentence without an "and," it's probably the right granularity. "Reads a file and returns its contents" — good. "Reads a file, parses it, filters lines, and writes output" — too coarse.
Rule 4: Return Structured, Parseable Output
Tool return values become the model's next input. Garbage in the return → garbage in the model's reasoning.
# ❌ Unstructured — model has to parse natural language
"Found 3 files: report.csv (2.3MB, modified 2024-01-15),
data.json (156KB, modified 2024-01-14), notes.txt (4KB, modified 2024-01-10)"
# ✅ Structured JSON — model extracts fields accurately
{
"files": [
{"name": "report.csv", "size_bytes": 2411725, "modified": "2024-01-15T14:30:00Z"},
{"name": "data.json", "size_bytes": 159744, "modified": "2024-01-14T09:15:00Z"},
{"name": "notes.txt", "size_bytes": 4096, "modified": "2024-01-10T18:00:00Z"}
],
"count": 3
}
JSON isn't mandatory — but consistency is. If you return text, use a predictable format. If you return JSON, follow the same schema across all tools.
Rule 5: Error Messages Must Suggest Next Actions
The worst thing a tool can return is an empty string or a vague "Error occurred." The model then has zero information to recover. It will either retry the same broken call or hallucinate a result.
# ❌ Useless error
{"error": "Failed"}
# ❌ Better but still unhelpful
{"error": "File not found"}
# ✅ Actionable error — model can self-correct
{
"success": false,
"error": "File not found: /data/reports/2024/summary.csv",
"suggestion": "Try listing /data/reports/2024/ to see available files, or check the path spelling.",
"available_directories": ["/data/reports/2023", "/data/reports/2025"]
}
Rule 6: Limit Tool Count With Tiered Exposure
Beyond ~20 tools in a single prompt, model selection accuracy drops measurably. In our testing, going from 10 to 30 tools increased wrong-tool calls by 40%.
Tiered exposure strategy:
- Tier 1 (always visible): 5-8 core tools — read, write, search, execute, ask
- Tier 2 (context-gated): Advanced tools exposed only when the task mentions relevant keywords
- Tier 3 (on-demand): Specialized tools the Agent can discover via a
list_advanced_tools()meta-tool
# Tiered tool registry pattern
TOOL_TIERS = {
"tier1": ["read_file", "write_file", "search_web", "execute_code", "ask_user"],
"tier2": ["query_database", "send_email", "create_chart", "run_test_suite"],
"tier3": ["deploy_service", "manage_permissions", "generate_report"]
}
Rule 7: Design for Idempotency
Agents retry. A lot. If calling the same tool twice with the same parameters produces different side effects (double-charging a customer, sending duplicate emails, creating duplicate records), you have a serious problem.
| Operation | Non-Idempotent (Dangerous) | Idempotent (Safe) |
|---|---|---|
| Create user | create_user(email) — creates duplicate | get_or_create_user(email) — returns existing if present |
| Send message | send(to, body) — sends every call | send(to, body, idempotency_key) — deduplicates |
| Charge payment | charge(amount) — charges every call | charge(order_id, amount) — skips if already charged |
The idempotency_key pattern is the simplest fix: generate a unique key per logical operation, pass it to the tool, and have the tool skip execution if it has already processed that key.
Rule 8: Test Tools With Real Model Calls
Unit-testing tool implementations isn't enough. You need to test whether the model actually uses the tool correctly. Schema correctness ≠ model compatibility.
def test_tool_usage(tool_def, test_scenarios):
"""Test if model calls the tool correctly across scenarios."""
for scenario in test_scenarios:
response = model.chat(
messages=[{"role": "user", "content": scenario["prompt"]}],
tools=[tool_def]
)
tool_call = response.tool_calls[0]
# Verify: did model call the right tool?
assert tool_call.name == scenario["expected_tool"], \
f"Expected {scenario['expected_tool']}, got {tool_call.name}"
# Verify: are parameters reasonable?
for param, validator in scenario["param_checks"].items():
assert validator(tool_call.arguments.get(param)), \
f"Parameter '{param}' validation failed"
"string" with no examples might receive Markdown when the tool expects plain text. Always test with the actual model you'll use in production.Quick Reference Checklist
| Check | Rule |
|---|---|
| ☐ Description includes trigger condition? | Rule 1 |
| ☐ Every parameter has a description? | Rule 2 |
| ☐ One complete operation per tool? | Rule 3 |
| ☐ Return format is consistent and parseable? | Rule 4 |
| ☐ Errors include suggested recovery actions? | Rule 5 |
| ☐ Under 20 tools visible at once? | Rule 6 |
| ☐ Side-effect tools have idempotency keys? | Rule 7 |
| ☐ Tested with actual model calls? | Rule 8 |
Frequently Asked Questions
Q: How to write tool descriptions models actually understand?
A: Good descriptions need: ① trigger condition (when to use), ② behavior (what it does), ③ return value (output structure).
Q: How many tools can an Agent handle?
A: Beyond ~20 tools, accuracy drops measurably. Keep 5-8 core tools always visible, gate advanced tools by context.
Q: Why design tools for idempotency?
A: Agents retry a lot. Idempotency ensures that even when the Agent makes mistakes, no damage is done.
Q: How to test tool definitions?
A: Test with real model calls. Design multiple scenario prompts and verify tool selection and parameters. Make this part of CI.