Memory Systems for LLM Agents — Short-Term vs Long-Term Memory

Human conversations rely on short-term working memory and long-term recall. LLM agents simulate this with message buffers, summaries, and external stores queried at runtime. Without explicit design, “memory” collapses to “whatever still fits in the context window,” which is neither durable nor consistent. This article classifies memory layers, explains retrieval and summarization tradeoffs, and outlines persistence strategies suitable for production agents.

Introduction

Product requirements for memory range from “remember my display name” to “remember everything I told you six months ago about this legal matter.” Those are different systems: the first is keyed profile state; the second is a searchable archive with legal retention rules. Conflating them produces privacy incidents and incoherent prompts. Engineers should map features to memory tiers with clear retention, access control, and eviction policies.

System Architecture

Write path: Not every assistant sentence should become memory. Use explicit memory proposals (remember=true structured object) validated by rules (“never store raw credit card”), then commit.

Read path: Query long-term store with user/tenant scope; inject top-k snippets with timestamps and sources—RAG for the self.

Core Technical Mechanisms

Short-term (working) memory: Recent dialogue turns, scratchpad notes, and tool outputs kept verbatim or near-verbatim until truncated. Commonly implemented as a message list plus optional structured scratch JSON injected each turn.

Long-term memory: Facts or episodes stored outside the model weights, retrievable into context when relevant. Implementations include vector databases keyed by user, knowledge graphs, or relational tables for typed facts (user prefers metric units).

Episodic memory: Time-stamped events (“on 2026-04-02 user connected Salesforce org X”). Useful for support and journaling; needs redaction and TTL.

Semantic memory: Generalized knowledge distilled from many episodes—often summaries or extracted slot values rather than raw transcripts.

Production Implementation Patterns

Summarization for STM pressure: When buffer exceeds N tokens, roll older turns into a summary block with loss-sensitive instructions (“preserve all numbers and dates”). Keep last K turns verbatim for conversational continuity.

Vector memory: Embed user facts or episode text; retrieve by embedding similarity to current query. Watch for wrong recall when embeddings confuse similar but distinct situations—store disambiguating metadata.

Typed slot memory: Key-value or small schema (locale, timezone, role) updated deterministically when classifiers detect stable facts—cheaper and safer than free-text memory for many preferences.

Concurrency: Two devices updating memory need version vectors or last-write-wins with audit.

Operational Challenges

Conflict resolution and multi-device use

Users open the same assistant on phone and desktop. If both sessions propose memory writes, you need either per-device drafts merged by rules or optimistic locking with visible conflict (“we kept your newer preference for units”). Avoid silent last-write-wins on legal or medical preferences without explicit user confirmation.

Retention, deletion, and legal holds

Long-term memory intersects GDPR-style erasure, enterprise retention policies, and litigation holds. Implement category-level deletion (“forget travel plans”) and full account deletion that cascades vectors, blobs, and audit references. Legal hold should freeze targeted records without silently resurrecting them into prompts—surface a blocked-memory notice internally instead.

What not to store

Even if the model proposes it, block storage of passwords, recovery codes, full payment PANs, and third-party secrets pasted accidentally. Combine classifier gates with deterministic regex scanners on proposed memory payloads. Log blocked writes with reason codes for tuning, not with the sensitive payload itself.

Evaluation beyond “did it remember?”

Track correct recall (user verifies), wrong recall (user corrects), and creepy recall (correct but socially inappropriate timing). Slice metrics by locale and product surface. Memory that helps in one culture can feel intrusive in another—product copy and default TTLs should vary, not only embeddings.

Ship user controls: view memory, delete categories, export, and opt-down to session-only mode for sensitive tasks.

Run periodic re-embedding jobs when you upgrade embedding models; document downtime or dual-read strategies.

Back up long-term stores with the same rigor as primary application databases; memory loss is trust loss.

Instrument memory read volume per turn; spikes often precede context window pressure or runaway agent loops.

Testing memory behavior

Write integration tests that assert: writes require validation, reads respect tenant scope, deletion removes vectors and relational rows together, and summarization preserves numbers picked from golden transcripts.

Abuse considerations

Long-term memory is a target: users may try to store disallowed content or exfiltrate data via memory fields visible on other devices. Apply the same content policies to proposed memory writes as you apply to public chat.

Memory UX copy

Explain in-product what “remember” means in plain language, including retention length and who can see stored facts in enterprise workspaces. Confused users file privacy tickets; clarity reduces churn.

Encryption and key rotation

Encrypt long-term memory stores at rest with keys managed by your KMS; rotate keys on policy. For cross-region replication, ensure ciphertext and key policies travel together so you do not accidentally widen access during failover.

Capacity, queues, and backpressure

Treat the LLM path like any other critical dependency: cap concurrency per upstream, set explicit timeouts on every network hop, and chart queue depth as a first-class metric. A growing in-memory backlog or a saturated broker often predicts an outage minutes before user reports. Prefer graceful shedding—return a structured “degraded mode” response—over unbounded waits that exhaust thread pools and poison shared gateways.

Rollback and blast radius

Every change that touches prompts, retrieval, routing, or tool schemas should ship behind flags with a rehearsed rollback. Know the blast radius when you flip a default: which tenants, which regions, and which downstream databases see amplified write load from a suddenly more verbose agent loop.

Ownership in incident response

Spell out which team owns rate limits, which owns index rebuilds, and which owns model routing changes. LLM incidents often span retrieval, inference, and billing—without explicit ownership, pages bounce while users churn.

Dependency and platform hygiene

Inventory every hop the request touches: reverse proxies, identity providers, feature-flag services, vector indexers, billing meters, and object stores used for attachments. Latency regressions often trace to TLS handshakes, DNS TTL interactions, or a saturated connection pool—not the GPU kernel. Keep an architecture diagram that matches what actually runs in production and update it when you add a sidecar or a new regional cell.

Load testing the unhappy path

Synthetic tests should include partial client disconnects, slow tool backends, and oversized prompts that hit context limits. Happy-path benchmarks miss the failure combinations that dominate incident hours.

Tradeoffs and Failure Modes

Long-term retrieval can inject harmful or stale content—version and invalidate.

Summaries lose detail; over-reliance causes “I told you yesterday” failures.

Privacy: memory is PII concentration—encrypt at rest, support deletion requests, segment by tenant hard boundaries.

Conclusion

Agent memory is a storage and retrieval subsystem, not a prompt trick. Separate short-term buffers from curated long-term stores, validate writes, scope reads, handle multi-device conflicts, and design retention and compliance into the architecture from day one—before users trust the agent with sensitive life details.