Function Calling Architectures in LLM Systems
Function calling (tool use) turns an LLM from a text generator into an orchestrator that proposes actions: query a database, call an HTTP API, run a script, or trigger internal workflows. The architecture problem is not “expose OpenAPI to the model”—it is safe execution, clear routing, composable chains, and recovery when tools error, time out, or return shapes the model misreads. This article outlines production patterns for tool schemas, selection logic, multi-step chains, and failure handling.
Introduction
Tool-calling APIs let models emit structured calls instead of only natural language. Your runtime executes them and returns observations back into the conversation. That loop is powerful and dangerous: tools are side effects. A mistaken DELETE, a broad SELECT *, or an unbounded search can harm users and systems. Senior designs treat tools like microservices behind an API gateway—with authz, quotas, sandboxing, and typed responses.
System Architecture
Tool gateway centralizes:
- Authentication mapping from end-user to tool credentials (often service accounts with scoped roles).
- Input validation beyond JSON Schema (max string lengths, regex allowlists for IDs).
- Timeouts and cancellation propagation from the user aborting the request.
- Audit logging of tool name, arguments digest, outcome, latency.
Orchestrator owns the loop: call LLM → if tool_calls, execute in parallel where safe → append tool messages → repeat until final assistant message or limits (max steps, max wall time).
Core Technical Mechanisms
Tool schema: Name, description, JSON parameters with types and required fields. Descriptions are part of the “prompt”; misleading descriptions cause misuse.
Router vs end-to-end: A router model or classifier first picks a small tool subset, then the worker model calls tools within that subset—reduces schema token load and wrong-tool selection.
Multi-tool chains: Sequential tools where output of A feeds B. Needs explicit state passing; avoid relying on the model to memorize large tool outputs—summarize or store externally and pass handles.
Error recovery: Tools return errors as structured objects ({ "error": "...", "retryable": true }) so the model (or code) can branch.
Production Implementation Patterns
Parallelism: Independent reads can run concurrently; writes should be serialized or use conflict detection. Do not let the model choose arbitrary parallelism without a policy layer.
Idempotency keys: For tools that charge money or mutate state, require idempotency keys generated by the server, not guessed by the model.
Pseudo-code for the loop:
steps = 0while steps < MAX_STEPS: resp = llm.chat(messages, tools=tool_defs) if not resp.tool_calls: return resp.content for call in resp.tool_calls: if not policy.allows(user, call.name, call.arguments): obs = tool_error("forbidden", retryable=false) else: obs = gateway.invoke(call.name, call.arguments) messages.append(tool_message(call.id, obs)) steps += 1return give_up_or_escalate()Observation shaping: Trim large payloads before re-injecting. Include stable identifiers (record_id) so follow-up tools reference the same entity without re-sending megabytes.
Multi-tool selection errors: If the model calls a tool not suitable for the task, return a structured hint in the observation rather than raw stack traces—stack traces teach the model to parrot internals.
Operational Challenges
Tool catalog governance
As tools proliferate, introduce a registry with owners, risk tier, and deprecation dates. Unused tools still consume schema tokens if the model can see them—archive aggressively. Breaking changes to arguments should be versioned (search_v2) rather than silently mutating search.
Observability for tool storms
Alert when a single session issues unusual patterns: dozens of reads followed by a write, rapid alternation between two tools, or repeated forbidden responses. Those patterns may be benign automation—or scripted abuse testing your gateway.
Metrics: tool error rate by name, p95 tool latency, fraction of sessions hitting step limits, rate of forbidden attempts.
Feature flags per tool for gradual rollout.
Dry-run mode: model proposes tool calls; humans approve for high-risk domains.
Contract tests: given a synthetic tool_message, the next model call should produce a specific structured action (golden tests are flaky with LLMs—use ranges and parsers where possible).
Document SLOs per tool class: reads may be fast; bulk exports may be minutes—set user expectations accordingly.
Tool output shaping and token hygiene
Large tool payloads blow the context window on the next turn. Define max observation tokens per tool with summarization strategies: tables to aggregates, logs to tail excerpts, JSON to whitelisted keys. If the model truly needs the full payload, store it externally and pass a handle—never paste megabytes inline by default.
Multi-tenant fairness
One tenant’s agent loop must not exhaust shared tool rate limits for others. Shard quotas by tenant and by tool class; return structured “quota exceeded” observations so the model can backoff or ask the user to retry later rather than hammering 429s in a tight loop.
Tool documentation as code
Generate tool descriptions from the same OpenAPI or protobuf comments that power your gateway so marketing language cannot drift from what the endpoint actually accepts.
Timeouts per tool class
Different tools deserve different timeouts: a search tool may return in milliseconds while a report generator runs minutes. Encode defaults in the gateway and allow per-tenant overrides. Log tail latency per tool to catch regressions when downstream dependencies slow.
Capacity, queues, and backpressure
Treat the LLM path like any other critical dependency: cap concurrency per upstream, set explicit timeouts on every network hop, and chart queue depth as a first-class metric. A growing in-memory backlog or a saturated broker often predicts an outage minutes before user reports. Prefer graceful shedding—return a structured “degraded mode” response—over unbounded waits that exhaust thread pools and poison shared gateways.
Rollback and blast radius
Every change that touches prompts, retrieval, routing, or tool schemas should ship behind flags with a rehearsed rollback. Know the blast radius when you flip a default: which tenants, which regions, and which downstream databases see amplified write load from a suddenly more verbose agent loop.
Ownership in incident response
Spell out which team owns rate limits, which owns index rebuilds, and which owns model routing changes. LLM incidents often span retrieval, inference, and billing—without explicit ownership, pages bounce while users churn.
Dependency and platform hygiene
Inventory every hop the request touches: reverse proxies, identity providers, feature-flag services, vector indexers, billing meters, and object stores used for attachments. Latency regressions often trace to TLS handshakes, DNS TTL interactions, or a saturated connection pool—not the GPU kernel. Keep an architecture diagram that matches what actually runs in production and update it when you add a sidecar or a new regional cell.
Load testing the unhappy path
Synthetic tests should include partial client disconnects, slow tool backends, and oversized prompts that hit context limits. Happy-path benchmarks miss the failure combinations that dominate incident hours.
Tradeoffs and Failure Modes
Large tool catalogs blow context budgets; routing layers add latency and another failure point.
Overly generic tools (run_sql) are flexible and risky; prefer many small, constrained tools.
Models may hallucinate tool results in later turns if observations are dropped by truncation—guard with explicit “you have not yet called X” system reminders or external state summaries.
Conclusion
Function calling architectures succeed when tools are designed like backend endpoints: tight schemas, gatewayed execution, clear observations, and orchestration code that enforces budgets and permissions. The LLM proposes; your runtime disposes. Keep that separation sharp, and tool use becomes maintainable at organizational scale.