Memory Systems for LLMs & Conversational Agents

A 2024–2025 field report — long-term memory (remember a user across sessions) and working memory (manage context). Strongest & newest first.

The 2023 foundations — MemGPT (memory as an OS), Generative Agents (memory streams + reflection), MemoryBank (forgetting curves) — set the vocabulary. 2024 added hard benchmarks (LoCoMo, LongMemEval) that exposed how badly models forget. The genuinely new story of 2025: (1) production-grade, low-latency memory layers you can ship — Mem0, Zep; (2) agentic / self-organizing memory where the model curates its own store — A-MEM, MemoryOS, MIRIX; and (3) the first attempts to learn memory management with RL — Memory-R1.

Tier 1 — The 2025 systems that matter most

1Mem0: Production-Ready Scalable Long-Term Memory

Chhikara, Khant, Aryan, Singh, Yadav (Mem0 AI) · arXiv:2504.19413 · Apr 2025 · paper · code

Idea: A memory layer that dynamically extracts, consolidates & retrieves salient facts instead of stuffing raw history into context. A graph variant (Mem0ᵍ) captures entity/event relations.

Results: On LoCoMo, ~26% relative gain (LLM-as-judge) over OpenAI memory, with ~91% lower p95 latency and >90% token-cost savings vs full history.

Why it matters: The most widely adopted open-source memory layer in practice — the reference for the "fact-based memory beats brute-force long context on cost" argument.

2A-MEM: Agentic Memory for LLM Agents

W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, Y. Zhang (Rutgers / AGI Research) · arXiv:2502.12110 · Feb 2025 · NeurIPS 2025 · paper · code

Idea: Self-organizing memory inspired by Zettelkasten note-taking. New memories become structured notes; the system autonomously links them to related notes and evolves their attributes — an interconnected knowledge net, not a flat store.

Results: Beats SOTA baselines across six foundation models on long-conversation QA.

Why it matters: The clearest articulation of "agentic memory" — the agent curates and restructures its own memory. The 2025 reference point for self-structuring memory.

3Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Rasmussen, Paliychuk, Beauvais, Ryan, Chalef (Zep) · arXiv:2501.13956 · Jan 2025 · paper · Graphiti

Idea: Memory as a temporally-aware knowledge graph (engine: Graphiti) with episodic / semantic / community subgraphs. Tracks when facts were true, so it updates instead of contradicting (e.g. "user moved cities").

Results: DMR 94.8% vs MemGPT 93.4%; on LongMemEval up to +18.5% accuracy and ~90% lower latency, especially on temporal reasoning.

Why it matters: The leading graph-based, time-aware commercial memory layer and the go-to citation for why temporal modeling matters.

4MemoryOS — Memory OS of AI Agent

Jiazheng Kang et al. · arXiv:2506.06326 · May/Jun 2025 · EMNLP 2025 (main) · paper

Idea: Borrows OS memory management: a hierarchy of short- / mid- / long-term memory with explicit paging, heat-based promotion and eviction between tiers, plus user-profile and knowledge stores. MemGPT's metaphor with a concrete consolidation scheme.

Results: Improves multi-session consistency & personalization; strong gains over prior memory methods on long-conversation evaluation.

Why it matters: A peer-reviewed (EMNLP) refinement of "LLM-as-OS" with a real consolidation mechanism — a credible academic successor to MemGPT.

5MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang, Xi Chen (MIRIX AI) · arXiv:2507.07957 · Jul 2025 · paper

Idea: Six typed stores — Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault — orchestrated by a multi-agent controller. Notably multimodal: remembers screenshots/visual experience, not just text.

Results: 85.4% on LoCoMo (SOTA at release); +35% over RAG on ScreenshotVQA while cutting storage ~99.9%.

Why it matters: Pushes memory beyond text and beyond a single store — the "structured, typed, multimodal memory" direction, and a current LoCoMo leader.

6Memory-R1: Managing & Utilizing Memory via Reinforcement Learning

arXiv:2508.19828 · Aug 2025 · paper

Idea: Train memory control instead of hand-writing it. A Memory Manager learns ADD / UPDATE / DELETE / NOOP and an Answer Agent learns to use retrieved memory — both via outcome-driven RL (PPO/GRPO), supervised only by final-answer correctness.

Results: RL-learned management beats heuristic controllers on long-conversation QA, and is data-efficient.

Why it matters: The cleanest statement of a 2025 frontier — "RL is the missing ingredient for adaptive memory." Moves the field from rule-based to learned control.

Tier 2 — The benchmarks everyone reports against

LongMemEval — Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu et al. · arXiv:2410.10813 · Oct 2024 · ICLR 2025 · paper · code

500 curated questions over scalable chat histories testing five abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. Scales ~115k → ~1.5M tokens. Finding: assistants drop ~30% (up to 30–60%) in accuracy across sustained interaction. The most demanding, now-standard test of interactive memory.

LoCoMo — Evaluating Very Long-Term Conversational Memory

Maharana, Lee, Tulyakov, Bansal, Barbieri, Fang (Snap / UNC) · arXiv:2402.17753 · Feb 2024 · ACL 2024 · paper · site

Very long multimodal dialogues — ~27 sessions, ~600 turns, ~17k tokens each — from a machine-human pipeline grounded in personas + temporal event graphs. The default proving ground (Mem0, A-MEM, MIRIX, MemoryOS all headline LoCoMo scores) — now somewhat saturated, which is why LongMemEval exists.

Tier 3 — The 2023–2024 foundations

MemGPT: Towards LLMs as Operating Systems

Packer, Wooders, Lin, Fang, Patil, Gonzalez (UC Berkeley) · arXiv:2310.08560 · Oct 2023 · now Letta · paper

Virtual context management: context window = RAM, external storage = disk; the LLM pages memory in/out via function calls and self-edits a "core memory." The origin of self-editing memory and the most influential memory paper. Lives on as Letta.

Generative Agents: Interactive Simulacra of Human Behavior

Park, O'Brien, Cai, Morris, Liang, Bernstein (Stanford/Google) · arXiv:2304.03442 · Apr 2023 · UIST 2023 · paper

The memory stream — time-stamped observations scored by recency × importance × relevance — plus reflection (synthesizing observations into higher-level insights). Introduced reflective memory and the retrieval score now used everywhere.

MemoryBank: Enhancing LLMs with Long-Term Memory

Zhong, Guo, Gao, Ye, Wang · arXiv:2305.10250 · May 2023 · AAAI 2024 · paper

Storage + retriever + updater modeled on the Ebbinghaus Forgetting Curve — memories decay unless reinforced — plus an evolving user portrait. The canonical "human-like forgetting" mechanism and an early focus on companion/personalization use cases.

HippoRAG: Neurobiologically Inspired Long-Term Memory

Gutiérrez, Shu, Gu, Yasunaga, Yu Su (Ohio State) · arXiv:2405.14831 · May 2024 · NeurIPS 2024 · paper · code

Models the hippocampal indexing theory: an LLM-built knowledge graph + Personalized PageRank to integrate knowledge across documents in one step. +20% on multi-hop QA, 6–13× faster and 10–20× cheaper than iterative retrieval. Bridges RAG and memory.

Tier 4 — Surveys that map the field (2025)

From Human Memory to AI Memory — A Survey

arXiv:2504.15965 · 2025 · paper

Organizes AI memory through a human-memory lens (sensory/working/long-term; episodic/semantic/procedural) and maps current systems onto it. Good cognitive-science orientation.

A Survey on the Memory Mechanism of LLM-Based Agents

Zhang et al. · ACM TOIS 2025 · arXiv:2404.13501 · paper

Systematizes agent memory along representation, source, and operation (read/write/manage) — the most-cited taxonomy, the framework newer work positions itself within. (Late-2025/26 surveys add a write–manage–read loop and sections on memory security & governance.)

What's genuinely new in 2025 vs. 2023–2024

Theme	2023–2024 (established)	2025 (new)
Architecture	MemGPT "LLM as OS"; flat memory streams	Tiered consolidation (MemoryOS); typed multi-store (MIRIX)
Who curates memory	Hand-written rules / fixed ops	Agentic self-organization (A-MEM); RL-learned add/update/delete (Memory-R1)
Structure	Vector store; importance/recency scoring	Temporal knowledge graphs that track when facts held (Zep); graph memory (Mem0ᵍ)
Goal	Prove it's possible	Production: 90%+ latency & cost cuts at competitive accuracy (Mem0, Zep)
Evaluation	LoCoMo (saturating)	LongMemEval (harder; abstention & knowledge updates)
Modality	Text only	Multimodal memory — screenshots/visual experience (MIRIX)
New concerns	—	Memory security, contradiction, staleness, governance

Bottom line for a voice companion: the practical stack today is a fact-extraction + consolidation memory layer (Mem0 / Zep style) over a vector or graph store, with temporal awareness so it updates rather than contradicts, evaluated on LongMemEval. The frontier to watch: agentic + RL-learned memory control (A-MEM, Memory-R1) and typed/multimodal stores (MIRIX).