Idea: A memory layer that dynamically extracts, consolidates & retrieves salient facts instead of stuffing raw history into context. A graph variant (Mem0ᵍ) captures entity/event relations.
Results: On LoCoMo, ~26% relative gain (LLM-as-judge) over OpenAI memory, with ~91% lower p95 latency and >90% token-cost savings vs full history.
Why it matters: The most widely adopted open-source memory layer in practice — the reference for the "fact-based memory beats brute-force long context on cost" argument.
Idea: Self-organizing memory inspired by Zettelkasten note-taking. New memories become structured notes; the system autonomously links them to related notes and evolves their attributes — an interconnected knowledge net, not a flat store.
Results: Beats SOTA baselines across six foundation models on long-conversation QA.
Why it matters: The clearest articulation of "agentic memory" — the agent curates and restructures its own memory. The 2025 reference point for self-structuring memory.
Idea: Memory as a temporally-aware knowledge graph (engine: Graphiti) with episodic / semantic / community subgraphs. Tracks when facts were true, so it updates instead of contradicting (e.g. "user moved cities").
Results: DMR 94.8% vs MemGPT 93.4%; on LongMemEval up to +18.5% accuracy and ~90% lower latency, especially on temporal reasoning.
Why it matters: The leading graph-based, time-aware commercial memory layer and the go-to citation for why temporal modeling matters.
Idea: Borrows OS memory management: a hierarchy of short- / mid- / long-term memory with explicit paging, heat-based promotion and eviction between tiers, plus user-profile and knowledge stores. MemGPT's metaphor with a concrete consolidation scheme.
Results: Improves multi-session consistency & personalization; strong gains over prior memory methods on long-conversation evaluation.
Why it matters: A peer-reviewed (EMNLP) refinement of "LLM-as-OS" with a real consolidation mechanism — a credible academic successor to MemGPT.
Idea: Six typed stores — Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault — orchestrated by a multi-agent controller. Notably multimodal: remembers screenshots/visual experience, not just text.
Results: 85.4% on LoCoMo (SOTA at release); +35% over RAG on ScreenshotVQA while cutting storage ~99.9%.
Why it matters: Pushes memory beyond text and beyond a single store — the "structured, typed, multimodal memory" direction, and a current LoCoMo leader.
Idea: Train memory control instead of hand-writing it. A Memory Manager learns ADD / UPDATE / DELETE / NOOP and an Answer Agent learns to use retrieved memory — both via outcome-driven RL (PPO/GRPO), supervised only by final-answer correctness.
Results: RL-learned management beats heuristic controllers on long-conversation QA, and is data-efficient.
Why it matters: The cleanest statement of a 2025 frontier — "RL is the missing ingredient for adaptive memory." Moves the field from rule-based to learned control.
500 curated questions over scalable chat histories testing five abilities: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. Scales ~115k → ~1.5M tokens. Finding: assistants drop ~30% (up to 30–60%) in accuracy across sustained interaction. The most demanding, now-standard test of interactive memory.
Very long multimodal dialogues — ~27 sessions, ~600 turns, ~17k tokens each — from a machine-human pipeline grounded in personas + temporal event graphs. The default proving ground (Mem0, A-MEM, MIRIX, MemoryOS all headline LoCoMo scores) — now somewhat saturated, which is why LongMemEval exists.
Virtual context management: context window = RAM, external storage = disk; the LLM pages memory in/out via function calls and self-edits a "core memory." The origin of self-editing memory and the most influential memory paper. Lives on as Letta.
The memory stream — time-stamped observations scored by recency × importance × relevance — plus reflection (synthesizing observations into higher-level insights). Introduced reflective memory and the retrieval score now used everywhere.
Storage + retriever + updater modeled on the Ebbinghaus Forgetting Curve — memories decay unless reinforced — plus an evolving user portrait. The canonical "human-like forgetting" mechanism and an early focus on companion/personalization use cases.
Models the hippocampal indexing theory: an LLM-built knowledge graph + Personalized PageRank to integrate knowledge across documents in one step. +20% on multi-hop QA, 6–13× faster and 10–20× cheaper than iterative retrieval. Bridges RAG and memory.
Organizes AI memory through a human-memory lens (sensory/working/long-term; episodic/semantic/procedural) and maps current systems onto it. Good cognitive-science orientation.
Systematizes agent memory along representation, source, and operation (read/write/manage) — the most-cited taxonomy, the framework newer work positions itself within. (Late-2025/26 surveys add a write–manage–read loop and sections on memory security & governance.)
| Theme | 2023–2024 (established) | 2025 (new) |
|---|---|---|
| Architecture | MemGPT "LLM as OS"; flat memory streams | Tiered consolidation (MemoryOS); typed multi-store (MIRIX) |
| Who curates memory | Hand-written rules / fixed ops | Agentic self-organization (A-MEM); RL-learned add/update/delete (Memory-R1) |
| Structure | Vector store; importance/recency scoring | Temporal knowledge graphs that track when facts held (Zep); graph memory (Mem0ᵍ) |
| Goal | Prove it's possible | Production: 90%+ latency & cost cuts at competitive accuracy (Mem0, Zep) |
| Evaluation | LoCoMo (saturating) | LongMemEval (harder; abstention & knowledge updates) |
| Modality | Text only | Multimodal memory — screenshots/visual experience (MIRIX) |
| New concerns | — | Memory security, contradiction, staleness, governance |