Building Agentic AI Systems with Tool-Using LLMs

“Agentic” systems let an LLM iterate: propose actions, observe results, update an internal working state, and continue until a goal is met or a limit is hit. Unlike single-shot Q&A, agents introduce loops, side effects, and non-determinism at scale. Production implementations succeed when planning and execution responsibilities are clear, tool surfaces are narrow, and every step is observable. This article describes tool execution loops, planning vs execution separation, and structured reasoning cycles without relying on proprietary model internals.

Introduction

Demo agents read a README and call three tools. Production agents integrate with ticketing systems, CRMs, databases, and internal CLIs under user-specific permissions. The failure modes are authorization bugs, infinite loops, silent truncation of tool outputs, and partial failures in distributed dependencies. Design agents as state machines supervised by code, not as unconstrained chat.

System Architecture

Controller responsibilities:

  • Enforce MAX_STEPS, MAX_TOOL_CALLS, wall-clock timeout.
  • Normalize tool errors into model-readable observations.
  • Prevent duplicate destructive calls with idempotency keys.
  • Surface human escalation triggers (confidence < threshold, repeated tool failures).

Core Technical Mechanisms

Tool execution loop: LLM → optional tool_calls → runtime executes → append tool role messages → LLM again. Terminates on final natural-language message, explicit stop tool, or guard limits.

Planning vs execution: Planning proposes a sequence or graph; execution carries out steps with validation. Separation can be logical (two prompts) or physical (planner service + worker service). Benefits: smaller prompts per step, easier testing of executors, reduced chance the planner hallucinates fake tool results because it never sees fabricated observations—only the executor returns observations.

Structured reasoning cycles: Force the model to emit intermediate artifacts—checklists, JSON state, or explicit thought fields you log but may not show users—so debugging has anchors. Whether you expose chain-of-thought to end users is a product/policy choice; logging internally is separate.

State: Conversation messages alone are not enough for long tasks. Maintain external state (workflow ID, cursor positions, fetched entity IDs) in a store keyed by session.

Production Implementation Patterns

Checkpointing: After each successful milestone, write state to DB so retries resume idempotently.

Parallel tools: Only when commutative and safe; otherwise sequential execution reduces race conditions.

Observation truncation: Summarize with lossless fields preserved (IDs, counts, error codes).

Pseudo-state object:

{
"session_id": "…",
"goal": "…",
"completed_steps": ["search_kb", "open_ticket"],
"artifacts": { "ticket_id": "123" },
"last_errors": []
}

Inject a compact JSON snapshot into each LLM turn after the first to reduce drift.

Operational Challenges

Determinism where it matters

Use the LLM for judgment calls; use code for math, sorting, and id generation. If the agent must choose among enumerated strategies, have it output an enum key, then let deterministic code pick the implementation. That split makes tests meaningful and reduces “creative” JSON that parses but lies.

Failure taxonomy for loops

Classify failures: tool_timeout, tool_forbidden, schema_invalid, user_cancel, model_refusal, context_overflow. Return that code to telemetry and optionally to the user. Agents that only see generic “error” strings spiral into useless retries.

Cost and step awareness

Expose remaining step budget to the controller, not necessarily to the user, and decay aggressively when tools are expensive. Some teams implement dynamic budgets: widen limits when the user is on a premium tier; tighten when abuse heuristics fire.

Tracing: OpenTelemetry spans per loop iteration, attributes for tool name and latency, link spans to external HTTP calls.

Evaluation: scenario suites with expected tool sequences; allow order variation but assert invariants (“never call delete without confirm tool”).

UX: show progress, allow cancellation, display which tools ran when transparency helps trust.

Run load tests on the controller itself—LLM stubs returning maximal tool-call fanout—to ensure your gateway stays stable before you invite real traffic.

Delegation boundaries

Clarify which components may call the LLM. If every tool implementation secretly calls a model for “help,” you lose cost control and tracing. Centralize LLM calls in the orchestrator or explicit “LLM nodes” so spans and budgets stay honest.

User trust and transparency

When tools touch user data, summarize what ran in customer language (“I looked up your last three invoices”) without dumping raw JSON. Transparency reduces support tickets accusing the assistant of “doing something behind my back.”

Runbook for stuck agents

Document how operators cancel a runaway session without dropping unrelated traffic: feature flags per workflow, per-tenant kill switches, and replay steps for half-written external state.

Postmortems without blame

When an agent causes an incident, focus the retro on missing guardrails (budget, idempotency, authz) rather than the model “being dumb.” Guardrails are what you ship; the model is a variable component.

Capacity, queues, and backpressure

Treat the LLM path like any other critical dependency: cap concurrency per upstream, set explicit timeouts on every network hop, and chart queue depth as a first-class metric. A growing in-memory backlog or a saturated broker often predicts an outage minutes before user reports. Prefer graceful shedding—return a structured “degraded mode” response—over unbounded waits that exhaust thread pools and poison shared gateways.

Rollback and blast radius

Every change that touches prompts, retrieval, routing, or tool schemas should ship behind flags with a rehearsed rollback. Know the blast radius when you flip a default: which tenants, which regions, and which downstream databases see amplified write load from a suddenly more verbose agent loop.

Ownership in incident response

Spell out which team owns rate limits, which owns index rebuilds, and which owns model routing changes. LLM incidents often span retrieval, inference, and billing—without explicit ownership, pages bounce while users churn.

Dependency and platform hygiene

Inventory every hop the request touches: reverse proxies, identity providers, feature-flag services, vector indexers, billing meters, and object stores used for attachments. Latency regressions often trace to TLS handshakes, DNS TTL interactions, or a saturated connection pool—not the GPU kernel. Keep an architecture diagram that matches what actually runs in production and update it when you add a sidecar or a new regional cell.

Load testing the unhappy path

Synthetic tests should include partial client disconnects, slow tool backends, and oversized prompts that hit context limits. Happy-path benchmarks miss the failure combinations that dominate incident hours.

Change management

Treat prompt, tool, and routing updates like schema migrations: pair code changes with backfill jobs, communicate freeze windows, and validate in staging with traffic shadows before you widen the blast radius in production.

Tradeoffs and Failure Modes

More structure increases tokens and latency.

Planner–executor splits can suffer handoff errors if the planner omits constraints the executor needs.

Agents amplify security risk surface—every tool is a potential exfiltration or abuse channel.

Conclusion

Agentic systems are loops around an LLM with gated side effects. Invest in the controller, state store, and tool gateway as much as in the prompt. Planning and execution separation, structured intermediate artifacts, and explicit failure taxonomies turn a fragile demo into an operable subsystem you can monitor, test, and roll back.