Multi-Agent Systems — Coordination, Conflict, and Arbitration
Multi-agent LLM systems assign different roles—researcher, coder, critic, planner—to separate prompts or separate model instances. The promise is division of labor; the risk is uncoordinated partial solutions, contradictory conclusions, and duplicated work. This article covers coordination patterns, conflict detection, and arbitration mechanisms suitable for production orchestration, framed like distributed systems with message passing rather than like autonomous org charts.
Introduction
“Agents” here means orchestrated LLM-powered workers with defined interfaces, not independent software entities with private goals. Production multi-agent setups usually run inside one trust boundary (your cloud account) with shared tool gateways and shared observability. Coordination is therefore an engineering choice: central orchestrator vs peer messaging, synchronous vs async handoffs, and how disagreements are resolved.
System Architecture
Hierarchical orchestration: A lead planner assigns subtasks; leaf agents return structured results upward. Centralized control simplifies authz and budgets; it can bottleneck if the orchestrator prompt grows too large.
Voting / ensemble: Multiple agents independently answer; majority or weighted vote selects output. Increases cost linearly; mitigates single-model quirks when diversity is real (different temperatures, different prompts), not when agents are identical copies.
Debate-style loops: Pro and con agents iterate; a judge summarizes. Useful for risk review; heavy on latency.
Core Technical Mechanisms
Roles: Fixed system prompts and tool allowlists per role reduce cross-talk. A “critic” role might read drafts but not call write APIs.
Shared blackboard: A document or database record multiple agents append to, with locking or sections owned by role—classic AI pattern adapted to LLM steps.
Message passing: Agents communicate only through structured messages (JSON events), not by silently editing each other’s free text unless mediated.
Conflict: Two agents assert incompatible facts, propose different tool plans, or overwrite shared state. Conflicts arise from nondeterminism, partial observability, or stale reads—same as human teams.
Arbitration: A tie-breaker—supervisor model, rules engine, or human—chooses the winning branch.
Production Implementation Patterns
Conflict detection (deterministic): Compare structured outputs—if risk_level differs, flag. For free text, use extraction step or embedding similarity only as a weak signal; prefer explicit fields.
Arbitration policies:
- Rules first: If conflict touches regulated thresholds, rules win.
- Confidence scores: Only if your stack produces calibrated scores you trust—otherwise avoid faux precision.
- Escalate to human when automation cannot converge within N rounds.
Concurrency control: Optimistic locking on shared state (version field) prevents lost updates when two agents write.
Operational Challenges
When multi-agent is worth the tax
Add agents only when responsibilities genuinely differ: different tool sets, different risk classes, or different model tiers. If two “roles” share tools and prompts and only the label changes, you likely have a single agent with clearer instructions. The tax is coordination: duplicated context, arbitration latency, and harder traces.
Communication contracts
Prefer schemas over prose for inter-agent messages: { "claim": "...", "evidence_ids": [], "confidence": "low|med|high" } rather than paragraphs the next agent must reinterpret. Version those schemas. When agents must share long artifacts, store them externally and pass handles—do not balloon every agent’s context with duplicate blobs.
Deadlock and livelock prevention
Detect rounds where no state hash changes, no tool runs, and no new facts appear—terminate with a user-visible “stuck” message. Cap debate rounds between pro and con roles. If arbitration always picks one agent, measure whether the other still adds signal; if not, remove it.
Budget caps per session across all agents, with per-role breakdown in dashboards.
Uniform trace IDs across agent calls; propagate tenant context and agent_role on every span.
Kill switches when agents loop proposals without state change (detect hash(state) repetition).
Run game-day exercises where one agent’s tool is disabled mid-flight to verify the orchestrator degrades safely.
Cost visibility per agent role
If roles use different models or different average tool counts, tag spans with role and report spend per role. Product teams can then decide whether a “critic” agent pays for itself in reduced downstream errors.
Minimal viable multi-agent
Start with a single agent plus a deterministic critic function (code) before introducing a second LLM. Many “multi-agent” designs are really one model plus validators; that is easier to ship and often sufficient.
Documentation for roles
Each role’s prompt, tools, and escalation path should live in one doc page onboarding engineers can read in ten minutes—otherwise only the original author understands the topology.
Escalation when agents disagree on facts
When agents dispute a factual claim, prefer fetching an authoritative system of record (database row, config service) over endless debate. Models cannot vote away incorrect ground truth.
Capacity, queues, and backpressure
Treat the LLM path like any other critical dependency: cap concurrency per upstream, set explicit timeouts on every network hop, and chart queue depth as a first-class metric. A growing in-memory backlog or a saturated broker often predicts an outage minutes before user reports. Prefer graceful shedding—return a structured “degraded mode” response—over unbounded waits that exhaust thread pools and poison shared gateways.
Rollback and blast radius
Every change that touches prompts, retrieval, routing, or tool schemas should ship behind flags with a rehearsed rollback. Know the blast radius when you flip a default: which tenants, which regions, and which downstream databases see amplified write load from a suddenly more verbose agent loop.
Ownership in incident response
Spell out which team owns rate limits, which owns index rebuilds, and which owns model routing changes. LLM incidents often span retrieval, inference, and billing—without explicit ownership, pages bounce while users churn.
Dependency and platform hygiene
Inventory every hop the request touches: reverse proxies, identity providers, feature-flag services, vector indexers, billing meters, and object stores used for attachments. Latency regressions often trace to TLS handshakes, DNS TTL interactions, or a saturated connection pool—not the GPU kernel. Keep an architecture diagram that matches what actually runs in production and update it when you add a sidecar or a new regional cell.
Load testing the unhappy path
Synthetic tests should include partial client disconnects, slow tool backends, and oversized prompts that hit context limits. Happy-path benchmarks miss the failure combinations that dominate incident hours.
Change management
Treat prompt, tool, and routing updates like schema migrations: pair code changes with backfill jobs, communicate freeze windows, and validate in staging with traffic shadows before you widen the blast radius in production.
Tradeoffs and Failure Modes
More agents increase token spend and debugging surface. Without strict interfaces, logs become unreadable story anthologies.
Homogeneous multi-agent setups (same model, same tools, different labels) may not improve quality over a single well-prompted pass with self-critique.
Peer-to-peer agent topologies are hard to secure—prefer hub-and-spoke with a gateway.
Conclusion
Multi-agent LLM systems are coordination problems first. Invest in shared structured state, explicit arbitration, schema-shaped handoffs, and orchestrator-enforced limits. Roles and voting can improve coverage when they create real diversity or separation of concerns—not when they only multiply identical opinions at higher cost.