An AI agent without memory is an expensive autocomplete. It forgets your name between sessions, re-asks questions you already answered, and treats every conversation like the first. The context window gives agents short-term memory — but when the conversation ends, everything vanishes.

Building real memory into agents is the hardest unsolved problem in agentic AI. The challenge isn’t storing information — it’s deciding what to remember, when to forget, and how to retrieve the right memories at the right time without blowing your latency budget.

This guide covers the memory patterns, frameworks, and production architectures that actually work in 2026. Our analysis draws on published research, framework documentation, benchmark data, and production reports — we research and analyze rather than deploying these systems ourselves. Rob Nugen operates ChatForest; the site’s content is researched and written by AI.

Why Context Windows Are Not Memory

Every LLM has a context window — the text it can see during a single request. Claude’s is 200K tokens. GPT-4o’s is 128K. Gemini’s is up to 2M. Developers often treat this as memory: stuff everything into the context and let the model sort it out.

This breaks in three ways:

  1. Cost scales linearly. Sending 100K tokens of conversation history on every request gets expensive fast, especially at agent scale where a single task might require dozens of LLM calls.
  2. Retrieval degrades. Models lose accuracy on information buried in the middle of long contexts — the well-documented “lost in the middle” problem. More context doesn’t always mean better answers.
  3. Sessions end. When the conversation is over, the context window is gone. The agent starts fresh next time with no knowledge of what happened before.

Real memory means information that persists across sessions, that the agent can selectively retrieve, and that gets updated or forgotten as the world changes.

The Memory Type Taxonomy

AI agent memory mirrors the categories cognitive scientists use for human memory. Understanding these types matters because each requires different storage, retrieval, and management strategies.

Working Memory (Short-Term)

What the agent can see right now — the current context window contents. This includes the system prompt, recent conversation turns, tool results, and any retrieved context.

Management challenge: Context windows overflow. You need strategies for what stays and what gets evicted. Common approaches include sliding window (drop oldest messages), summarization (compress old turns into a summary), and importance-based pruning (keep high-value turns, drop small talk).

Episodic Memory

Records of what happened — past interactions, decisions, outcomes, and corrections. “The user asked me to refactor the auth module last Tuesday. I suggested extracting a middleware, they rejected it and asked for a service layer instead.”

Why it matters: Episodic memory prevents agents from repeating mistakes. If a user corrected the agent’s approach before, episodic memory ensures the correction sticks.

Storage: Conversation logs with metadata (timestamps, outcomes, user feedback). Often stored as structured records in a database or as natural language summaries in a vector store.

Semantic Memory

Stable facts and knowledge — user preferences, domain knowledge, business rules, reference data. “This user prefers TypeScript over JavaScript. Their company uses PostgreSQL. The API rate limit is 100 requests per minute.”

Why it matters: Semantic memory makes agents personalized and domain-aware without needing everything re-explained each session.

Storage: Key-value stores, knowledge graphs, or extracted fact databases. Unlike episodic memory which is append-mostly, semantic memory requires active updates when facts change.

Procedural Memory

Knowledge of how to do things — workflows, playbooks, learned patterns, and operational sequences. “When deploying to production, always run tests first, then build, then deploy to staging, then promote to production after manual approval.”

Why it matters: Procedural memory turns agents from question-answerers into capable operators that know the steps for complex tasks.

Storage: System prompts, instruction files, or structured workflow definitions. Often the least dynamic memory type — procedures change slowly compared to facts or events.

How the Types Relate

┌─────────────────────────────────────────────────┐
│               WORKING MEMORY                     │
│          (Current context window)                │
│                                                  │
│   ┌──────────┐  ┌──────────┐  ┌──────────────┐ │
│   │ Episodic  │  │ Semantic  │  │ Procedural   │ │
│   │ Retrieved │  │ Retrieved │  │ (System      │ │
│   │ memories  │  │ facts     │  │  prompt/      │ │
│   │           │  │           │  │  instructions)│ │
│   └──────────┘  └──────────┘  └──────────────┘ │
└───────────┬─────────┬──────────────┬────────────┘
            │         │              │
     ┌──────▼───┐ ┌───▼──────┐ ┌────▼─────────┐
     │ Episodic  │ │ Semantic  │ │ Procedural   │
     │ Store     │ │ Store     │ │ Store        │
     │ (logs,    │ │ (facts,   │ │ (workflows,  │
     │  events)  │ │  prefs)   │ │  playbooks)  │
     └──────────┘ └──────────┘ └──────────────┘
              LONG-TERM MEMORY

The agent selectively loads relevant long-term memories into working memory each turn. The art is in the selection — retrieve too much and you waste tokens and confuse the model; retrieve too little and the agent misses important context.

Memory Frameworks Compared

The memory framework landscape has matured significantly. Here’s how the major options compare as of early 2026.

Mem0

Approach: Extracts “memories” from conversations using an LLM, storing them as compact natural language facts with vector embeddings for retrieval.

Key innovation: Instead of storing raw conversation chunks, Mem0 runs an extraction phase that identifies salient facts from every message pair and distills them into compact memories, choosing one of four operations — ADD, UPDATE, DELETE, or NOOP. If a user says they moved from Tokyo to Berlin, Mem0 deletes the old location and adds the new one.

Architecture: Combines vector-based semantic search with optional graph memory (via Neo4j) for entity relationships. Supports hierarchical scoping — user-level, session-level, and agent-level memories.

Performance: Self-reported 26% accuracy improvement over baseline LLM memory, 91% faster responses than full-context loading, and 90% lower token usage. However, independent evaluations suggest lower accuracy scores (~58% on LoCoMo benchmark vs. self-reported ~66%).

Best for: Teams that want managed memory infrastructure with minimal setup. Strong API, good documentation, cloud-hosted option.

Zep

Approach: Temporal knowledge graph that tracks how facts change over time, combining graph-based memory with vector search.

Key innovation: Episodic and temporal memory — Zep structures interactions into meaningful sequences rather than flat logs, tracking when facts were established and how they’ve changed. This mirrors how humans remember conversations, not as transcripts but as structured episodes.

Architecture: Graph RAG with temporal indexing. Integrates structured business data with conversational history. Self-hosted or cloud options.

Performance: Independent evaluation shows ~85% on LoCoMo benchmark. P95 search latency of 0.632 seconds with concurrent search configuration. Zep’s team has publicly challenged Mem0’s benchmark methodology, with corrected scores showing Zep outperforming by ~10%.

Best for: Applications where when something was said matters as much as what was said. Enterprise use cases with compliance requirements around conversation history.

Letta (formerly MemGPT)

Approach: Treats LLM context like virtual memory in an operating system. The agent itself manages memory tiers — deciding what to store, retrieve, and evict.

Key innovation: Memory is a first-class component of the agent’s state, not an external add-on. The agent has editable memory blocks (core memory it can read and write), recall memory (searchable conversation history), and archival memory (long-term external storage). The LLM decides what goes where.

Architecture: A full agent runtime, not just a memory layer. Agents run inside Letta with persistent state, identity, and memory that evolves across sessions. Open source (Apache 2.0).

Performance: ~83.2% on LoCoMo benchmark. The OS-inspired approach means memory operations are tightly integrated with the agent loop rather than being external API calls.

Best for: Developers who want agents with persistent identity and self-managed memory. Strongest choice when the agent needs autonomy over what it remembers.

LangMem (LangChain/LangGraph)

Approach: Long-term memory within the LangGraph ecosystem. Stores memories as JSON documents in a structured store with retrieval via filters.

Key innovation: Supports both hot path (agent-initiated) and background (programmatic) memory extraction. Focuses on working memory by compressing long histories into actionable summaries.

Architecture: Integrates with LangGraph’s state management. Memories are JSON documents with namespace-based organization. Uses LangGraph’s built-in persistence layer.

Best for: Teams already using LangChain/LangGraph who want memory without adding another dependency. Good integration with the broader LangChain ecosystem.

Comparison Table

Framework Memory Type Storage Open Source LoCoMo Score Latency (p95) Best For
Mem0 Semantic + Graph Vector + Neo4j Partial ~58-66% 50-200ms recall Managed memory API
Zep Temporal graph Graph + Vector Partial ~85% 632ms search Temporal/episodic memory
Letta OS-inspired tiers Editable blocks Yes (Apache) ~83.2% Integrated Agent-managed memory
LangMem JSON documents Structured store Yes N/A Varies LangGraph users
SuperLocalMemory 4-channel fusion Local-only Yes (MIT) 74.8-87.7% Local Privacy-first, EU compliance

The Benchmark Wars

A word of caution: memory framework benchmarks are actively contested. Mem0, Zep, and Letta have all publicly challenged each other’s benchmark methodology. The LoCoMo benchmark (81 question-answer pairs across multi-session conversations) is the closest thing to a standard, but scores vary significantly depending on implementation details and evaluation methodology. Independent evaluation is recommended over vendor-reported numbers.

Hot Path vs. Background: When to Write Memory

One of the most consequential architecture decisions is when the agent processes memories. There are two approaches, and getting this wrong is a common production failure.

Hot Path (Synchronous)

The agent decides what to remember during the conversation, as part of its normal reasoning loop. After each user message, the agent explicitly stores important information before responding.

User message → Retrieve memories → Generate response → Store new memories → Return response

Advantages:

  • Memories are immediately available for the next turn
  • The agent has full context for deciding what’s important
  • Simpler architecture — everything happens in one loop

Disadvantages:

  • Adds latency to every response (memory extraction requires an LLM call)
  • The agent spends its “thinking budget” on memory management instead of the task
  • Can create a bottleneck when retrieval is slow

Background (Asynchronous)

Memory extraction happens after the conversation, in a separate process. The agent responds normally, and a background job later analyzes the conversation to extract and consolidate memories.

User message → Retrieve memories → Generate response → Return response
                                                        ↓ (async)
                                                   Extract memories
                                                   Consolidate with existing
                                                   Store updates

Advantages:

  • Zero added latency to the user-facing response
  • Can use more expensive/thorough extraction without impacting UX
  • Scales better — background jobs can be batched and rate-limited

Disadvantages:

  • New memories aren’t available until the background job runs
  • Risk of losing memories if the conversation ends abruptly before processing
  • More complex infrastructure (queues, workers, retry logic)

The Pragmatic Approach

Most production systems use a hybrid: critical facts (user corrections, explicit preferences) go on the hot path for immediate availability, while detailed extraction and consolidation happen in the background. Letta’s approach — where the agent itself manages memory as part of its tool set — is essentially a hot path architecture with the agent deciding what merits immediate storage.

Memory Consolidation: Keeping Memory Clean

Raw memory accumulation leads to contradictions, redundancies, and noise. Memory consolidation is the process of merging, deduplicating, and updating stored memories to maintain coherence.

The Consolidation Loop

When new information arrives that’s semantically similar to existing memories, the system needs to decide: is this an update, a contradiction, or a new fact?

Example scenario:

  • Stored memory: “User lives in Tokyo”
  • New information: “User mentions they recently moved to Berlin”
  • Correct action: DELETE old memory, ADD “User lives in Berlin, previously lived in Tokyo”

Mem0 handles this with its four-operation system (ADD, UPDATE, DELETE, NOOP), where an LLM evaluates each new memory candidate against existing similar memories. Other frameworks implement similar logic through different mechanisms.

Consolidation Strategies

Merge: Combine two related memories into one richer record. “User prefers Python” + “User has 5 years of Python experience” → “User is an experienced Python developer (5 years) who prefers Python for new projects.”

Supersede: Replace outdated information with new facts, preserving the old information as historical context when relevant.

Deduplicate: Detect and merge memories that express the same fact in different words. This requires semantic similarity comparison, not just string matching.

Summarize: Periodically compress collections of granular memories into higher-level summaries. Ten memories about individual project decisions might consolidate into one summary of the user’s decision-making patterns.

Memory Decay: The Art of Forgetting

Remembering everything is as bad as remembering nothing. An agent drowning in thousands of stale memories retrieves irrelevant context that confuses its responses. Effective forgetting is essential.

Recency-Weighted Scoring

The most common decay pattern multiplies semantic similarity by an exponential decay factor based on time since last access:

relevance_score = semantic_similarity × e^(-λ × time_since_last_access)

Memories that haven’t been retrieved recently lose salience gradually. The decay rate (λ) controls how aggressive the forgetting is — a high value means memories fade quickly, a low value means they persist longer.

Access-Based Reinforcement

Memories that get retrieved frequently should decay more slowly — they’re clearly still relevant. Each retrieval “refreshes” the memory’s timestamp, resetting its decay clock. This mirrors human memory where frequently recalled information stays accessible.

Explicit Deprecation

Some information becomes provably obsolete: a user changes their email, a company switches tech stacks, a project gets cancelled. These cases warrant immediate removal rather than gradual decay.

What Not to Forget

Some categories should be decay-resistant:

  • User corrections: If the user told the agent “don’t do X,” that correction should persist indefinitely
  • Safety-relevant information: Access permissions, compliance requirements, security boundaries
  • Identity information: Core user preferences and profile data that define the relationship

How AI Platforms Handle Memory

The major AI platforms have shipped consumer-facing memory features that inform what developers can build with their APIs.

Claude (Anthropic)

Claude’s built-in memory extracts and stores user preferences, project context, and working patterns across conversations. As of March 2026, memory is free for all users (previously limited to paid tiers). Claude supports importing memories from other platforms — users can migrate their accumulated context from ChatGPT or Gemini. Imported memories merge with existing Claude memories rather than overwriting them.

For developers, Claude Code uses a file-based memory system (CLAUDE.md files and a ~/.claude/memory directory) that persists across coding sessions. This is procedural memory in practice — project conventions, user preferences, and feedback that shapes future behavior.

ChatGPT (OpenAI)

ChatGPT’s memory stores facts learned during conversations and applies them to future interactions. Users can view, edit, and delete individual memories. The memory system is opt-in and transparent — users can see exactly what ChatGPT remembers about them.

Gemini (Google)

Gemini added automatic memory with privacy controls in August 2025 and launched “Personal Intelligence” with Gemini 3 in January 2026. In March 2026, Google added import tools for migrating chat history and memories from ChatGPT and Claude, marking the beginning of a data portability race among AI assistants.

The Portability Trend

The fact that all three major platforms now support memory import/export signals a strategic shift: when switching between assistants no longer means losing accumulated context, competition moves to model quality and ecosystem integration rather than user lock-in.

Production Architecture Patterns

Pattern 1: The Simple Stack

For most applications, start here:

Agent Loop
  ├── System prompt (procedural memory)
  ├── Conversation buffer (working memory, last N turns)
  ├── Vector store (semantic + episodic memory)
  │     └── Retrieve top-k relevant memories per turn
  └── Background consolidation job (periodic cleanup)

This works for single-user agents with moderate memory requirements. Use a vector database (Chroma, Qdrant, pgvector) for storage and retrieval, a sliding window for conversation history, and a nightly job for memory consolidation.

Pattern 2: The Hybrid Stack

For applications needing temporal awareness and relationship tracking:

Agent Loop
  ├── System prompt + core memory block (procedural + key facts)
  ├── Conversation buffer with summarization
  ├── Vector store (semantic similarity retrieval)
  ├── Knowledge graph (relationship traversal)
  │     └── Neo4j / Memgraph for entity relationships
  └── Temporal index (when facts were established/changed)

This is what Zep implements internally. The graph layer handles “who said what about whom” queries that vector similarity alone can’t answer. The temporal index prevents the agent from using outdated facts.

Pattern 3: The Agent-Managed Stack

For agents that need autonomy over their own memory (Letta’s approach):

Agent Loop
  ├── Core memory (agent-editable, always in context)
  ├── Recall memory (searchable conversation archive)
  ├── Archival memory (long-term external storage)
  └── Memory tools (search, insert, update exposed as agent tools)

The agent decides what to store, retrieve, and forget as part of its normal reasoning. This produces the most adaptive memory behavior but requires a capable model that uses the memory tools effectively.

Choosing Your Storage Backend

Backend Best For Avoid When
pgvector Teams already on PostgreSQL, need ACID guarantees Very high-dimensional embeddings, massive scale
Chroma Quick prototyping, local development Production at scale (still maturing)
Qdrant Production vector search, filtering + vector hybrid Simple use cases where pgvector suffices
Pinecone Managed service, don’t want to operate infrastructure Cost sensitivity, data residency requirements
Weaviate Hybrid search (vector + keyword), multi-modal Simple semantic search only
Neo4j Relationship-heavy domains, knowledge graphs Simple fact storage without relationships
Redis Fast session memory, caching frequently accessed facts Complex queries, long-term persistence

MCP and Memory

The Model Context Protocol connects naturally to memory through MCP memory servers. These servers expose memory operations (store, retrieve, search, delete) as MCP tools that any compatible AI client can use.

The main MCP memory servers — the official Knowledge Graph server, Zep/Graphiti, Mem0, Basic Memory, Chroma, and Engram — take radically different approaches, from simple JSONL files to temporal knowledge graphs. See our detailed comparison for a head-to-head analysis.

Memory as MCP Resources

Beyond tool-based memory access, MCP’s resource system enables memory to be exposed as readable context. An MCP memory server can expose frequently-accessed memories as resources that clients subscribe to, receiving updates when memories change — useful for keeping an agent’s context fresh without explicit retrieval calls on every turn.

Common Mistakes

1. Treating the Context Window as Memory

Stuffing the entire conversation history into every request works until it doesn’t. At $15/M input tokens (Claude Opus) with 50K tokens of history per request, a busy agent burns through budget fast — and retrieval accuracy degrades well before you hit the token limit.

2. Remembering Everything

Without decay and consolidation, memory stores fill with contradictory, redundant, and irrelevant entries. An agent that remembers the user’s coffee order from 2024 alongside their current project requirements retrieves noise that dilutes useful context.

3. Ignoring Latency

Every memory retrieval adds latency. A vector search takes 50-200ms. A graph traversal can take longer. A consolidation step requires an LLM call. Stack these up and your agent’s response time balloons. Design your memory pipeline with a latency budget.

4. One-Size-Fits-All Storage

Different memory types have different retrieval patterns. Semantic facts need vector similarity search. Temporal events need date-range queries. Entity relationships need graph traversal. Using a single storage backend for everything means some queries will be inefficient.

5. No Memory Visibility

Users should be able to see what the agent remembers about them — and correct it. Every major AI platform (Claude, ChatGPT, Gemini) now exposes memory to users. Building a system without memory visibility creates trust issues and compliance risks, especially with regulations like the EU AI Act (enforcement begins August 2026).

6. Skipping the Simple Approach

Before reaching for Mem0 or Zep, consider whether a markdown file or SQLite database meets your needs. Claude Code’s memory system is literally a directory of markdown files with a MEMORY.md index — and it works well for its use case. Complexity should be justified by requirements, not ambition.

What’s Coming Next

The memory landscape is evolving rapidly:

  • Memory portability is becoming standard — all major platforms now support import/export, pushing toward a world where your AI memory follows you between providers
  • EU AI Act compliance (August 2026) will force memory systems to support data residency, right-to-deletion, and transparent memory operations — local-first solutions like SuperLocalMemory are designed for this
  • Hybrid retrieval (vector + graph + keyword) is becoming the default architecture rather than vector-only, as benchmark data consistently shows multi-modal retrieval outperforms single-approach systems
  • Agent-managed memory (the Letta model) is gaining traction as models become more capable at self-directed memory operations
  • Memory benchmarks are maturing but still contested — expect a standardized evaluation framework to emerge as the field matures

Further Reading