← Back to Home

Agent Tool Design Best Practices

30-Second Takeaway

  • Problem Solved: Poorly defined tools cause models to call the wrong tool, pass bad parameters, and loop infinitely.
  • Core Method: 8 production-tested rules: trigger conditions, self-documenting parameters, proper granularity, structured output, actionable error messages, tiered exposure, idempotency, real model testing.
  • Key Insight: Schema correctness ≠ model compatibility. Always test with the actual model you'll use in production.
  • What You'll Gain: A checklist to audit and improve every tool definition.

Tools are the Agent's hands and feet. Well-designed tools make the model unstoppable; poorly designed ones lead to wrong tool calls, bad parameters, and infinite loops.

This article distills 8 rules from real production Agent projects. Each rule comes with before/after examples you can apply immediately.

Rule 1: Tool Descriptions Must Include Trigger Conditions

Don't just describe what a tool does — describe exactly when to use it. The model needs to answer: "Given my current situation, should I call this tool?"

A good description has three parts:

  1. Trigger condition — what situation warrants this tool
  2. What it does — the action it performs
  3. What it returns — the shape and meaning of the output
# ❌ Too vague — model doesn't know when to use this
"description": "Search the web"

# ✅ Clear trigger + behavior + return value
"description": "Search the web for current information. Use when the answer requires real-time or recent data beyond your training cutoff, or when explicitly asked to look up current facts. Returns top 10 results with titles, URLs, and snippets."
💡 Real lesson: We once had a tool named fetch_documentation that models almost never called. Adding "Use this when you encounter an unfamiliar library, API, or framework" tripled its usage rate — the model needed the trigger condition spelled out.

Rule 2: Parameter Names Are Natural Language Prompts

Remember: parameter names are part of the prompt the model reads. They influence tool selection and parameter filling accuracy.

❌ Bad Name✅ Good NameWhy
qsearch_querySelf-documenting; model understands intent
fpfile_pathAvoids abbreviations the model might misinterpret
iduser_idScoped name prevents ambiguity with other IDs
datacsv_contentDescribes the format and content expected

Each parameter also needs its own description field — never skip it. A parameter named limit with no description is a guessing game for the model.

# ❌ Missing descriptions
{"name": "limit", "type": "integer"}

# ✅ Self-explanatory
{"name": "max_results", "type": "integer",
 "description": "Maximum number of results to return. Default 10, max 50."}

Rule 3: Tool Granularity — One Complete Operation Per Tool

This is the hardest rule to get right. The extremes are easy to spot, but the sweet spot takes practice.

PatternExampleProblem
Too fineopen_file()read_byte()close_file()10 sequential calls for one task; model loses context
Too coarsedo_everything(action, target, format, filter, sort, ...)Parameter explosion; model doesn't know what to pass
Just rightread_file(path), write_file(path, content), search_files(pattern)One complete operation each; composable but independent

Golden rule: If you can describe the tool's entire purpose in one sentence without an "and," it's probably the right granularity. "Reads a file and returns its contents" — good. "Reads a file, parses it, filters lines, and writes output" — too coarse.

Rule 4: Return Structured, Parseable Output

Tool return values become the model's next input. Garbage in the return → garbage in the model's reasoning.

# ❌ Unstructured — model has to parse natural language
"Found 3 files: report.csv (2.3MB, modified 2024-01-15),
data.json (156KB, modified 2024-01-14), notes.txt (4KB, modified 2024-01-10)"

# ✅ Structured JSON — model extracts fields accurately
{
  "files": [
    {"name": "report.csv", "size_bytes": 2411725, "modified": "2024-01-15T14:30:00Z"},
    {"name": "data.json", "size_bytes": 159744, "modified": "2024-01-14T09:15:00Z"},
    {"name": "notes.txt", "size_bytes": 4096, "modified": "2024-01-10T18:00:00Z"}
  ],
  "count": 3
}

JSON isn't mandatory — but consistency is. If you return text, use a predictable format. If you return JSON, follow the same schema across all tools.

Rule 5: Error Messages Must Suggest Next Actions

The worst thing a tool can return is an empty string or a vague "Error occurred." The model then has zero information to recover. It will either retry the same broken call or hallucinate a result.

# ❌ Useless error
{"error": "Failed"}

# ❌ Better but still unhelpful
{"error": "File not found"}

# ✅ Actionable error — model can self-correct
{
  "success": false,
  "error": "File not found: /data/reports/2024/summary.csv",
  "suggestion": "Try listing /data/reports/2024/ to see available files, or check the path spelling.",
  "available_directories": ["/data/reports/2023", "/data/reports/2025"]
}
💡 Error taxonomy: We classify tool errors into three types to guide recovery strategy — Retryable (timeout, rate limit), Fixable (bad parameter, wrong file path), and Fatal (permission denied, service down). Include this classification in the error response so the Agent can decide: retry, adjust, or escalate.

Rule 6: Limit Tool Count With Tiered Exposure

Beyond ~20 tools in a single prompt, model selection accuracy drops measurably. In our testing, going from 10 to 30 tools increased wrong-tool calls by 40%.

Tiered exposure strategy:

  1. Tier 1 (always visible): 5-8 core tools — read, write, search, execute, ask
  2. Tier 2 (context-gated): Advanced tools exposed only when the task mentions relevant keywords
  3. Tier 3 (on-demand): Specialized tools the Agent can discover via a list_advanced_tools() meta-tool
# Tiered tool registry pattern
TOOL_TIERS = {
    "tier1": ["read_file", "write_file", "search_web", "execute_code", "ask_user"],
    "tier2": ["query_database", "send_email", "create_chart", "run_test_suite"],
    "tier3": ["deploy_service", "manage_permissions", "generate_report"]
}

Rule 7: Design for Idempotency

Agents retry. A lot. If calling the same tool twice with the same parameters produces different side effects (double-charging a customer, sending duplicate emails, creating duplicate records), you have a serious problem.

OperationNon-Idempotent (Dangerous)Idempotent (Safe)
Create usercreate_user(email) — creates duplicateget_or_create_user(email) — returns existing if present
Send messagesend(to, body) — sends every callsend(to, body, idempotency_key) — deduplicates
Charge paymentcharge(amount) — charges every callcharge(order_id, amount) — skips if already charged

The idempotency_key pattern is the simplest fix: generate a unique key per logical operation, pass it to the tool, and have the tool skip execution if it has already processed that key.

Rule 8: Test Tools With Real Model Calls

Unit-testing tool implementations isn't enough. You need to test whether the model actually uses the tool correctly. Schema correctness ≠ model compatibility.

def test_tool_usage(tool_def, test_scenarios):
    """Test if model calls the tool correctly across scenarios."""
    for scenario in test_scenarios:
        response = model.chat(
            messages=[{"role": "user", "content": scenario["prompt"]}],
            tools=[tool_def]
        )
        tool_call = response.tool_calls[0]

        # Verify: did model call the right tool?
        assert tool_call.name == scenario["expected_tool"], \
            f"Expected {scenario['expected_tool']}, got {tool_call.name}"

        # Verify: are parameters reasonable?
        for param, validator in scenario["param_checks"].items():
            assert validator(tool_call.arguments.get(param)), \
                f"Parameter '{param}' validation failed"
⚠️ Common pitfall: A tool definition that passes JSON Schema validation can still confuse the model. For example, a parameter typed as "string" with no examples might receive Markdown when the tool expects plain text. Always test with the actual model you'll use in production.

Quick Reference Checklist

CheckRule
☐ Description includes trigger condition?Rule 1
☐ Every parameter has a description?Rule 2
☐ One complete operation per tool?Rule 3
☐ Return format is consistent and parseable?Rule 4
☐ Errors include suggested recovery actions?Rule 5
☐ Under 20 tools visible at once?Rule 6
☐ Side-effect tools have idempotency keys?Rule 7
☐ Tested with actual model calls?Rule 8

Frequently Asked Questions

Q: How to write tool descriptions models actually understand?

A: Good descriptions need: ① trigger condition (when to use), ② behavior (what it does), ③ return value (output structure).

Q: How many tools can an Agent handle?

A: Beyond ~20 tools, accuracy drops measurably. Keep 5-8 core tools always visible, gate advanced tools by context.

Q: Why design tools for idempotency?

A: Agents retry a lot. Idempotency ensures that even when the Agent makes mistakes, no damage is done.

Q: How to test tool definitions?

A: Test with real model calls. Design multiple scenario prompts and verify tool selection and parameters. Make this part of CI.