Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs

1 University of Alberta    2 University of Massachusetts Amherst    3 Alberta Machine Intelligence Institute (Amii)

TL;DR: We introduce BEAM, a benchmark of coherent conversations up to 10M tokens with 10 memory abilities, and LIGHT, a cognitive-inspired memory framework that improves long-term memory performance by 3.50%–12.69% on average over strong baselines.

Conversations

100

Probing Questions

2,000

Max Length

10M

Memory Abilities

10

Domains

19

Abstract

Evaluating the abilities of large language models (LLMs) for tasks that require long-term memory and thus long-context reasoning, for example in conversational settings, is hampered by the existing benchmarks, which often lack narrative coherence, cover narrow domains, and only test simple recall-oriented tasks. This paper introduces a comprehensive solution to these challenges. First, we present a novel framework for automatically generating long (up to 10M tokens), coherent, and topically diverse conversations, accompanied by probing questions targeting a wide range of memory abilities. From this, we construct BEAM, a new benchmark comprising 100 conversations and 2,000 validated questions. Second, to enhance model performance, we propose LIGHT–a framework inspired by human cognition that equips LLMs with three complementary memory systems: a long-term episodic memory, a short-term working memory, and a scratchpad for accumulating salient facts. Our experiments on BEAM reveal that even LLMs with 1M token context windows (with and without retrieval-augmentation) struggle as dialogues lengthen. In contrast, LIGHT consistently improves performance across various models, achieving an average improvement of 3.50%–12.69% over the strongest baselines, depending on the backbone LLM. An ablation study further confirms the contribution of each memory component.

BEAM Benchmark

BEAM is a benchmark for evaluating long-term memory in LLMs. It contains 100 coherent, multi-domain conversations (lengths: 128K, 500K, 1M, and 10M tokens) and 2,000 human-validated probing questions spanning 10 complementary memory abilities.

Memory abilities

Information Extraction Multi-hop Reasoning Knowledge Update Temporal Reasoning Summarization Preference Following Abstention Contradiction Resolution Event Ordering Instruction Following

New abilities introduced in BEAM include Contradiction Resolution, Event Ordering, and Instruction Following.

Comparison with Existing Long-Term Memory Benchmarks

IE: Info Extraction MR: Multi-hop KU: Knowledge Update TR: Temporal ABS: Abstention CR: Contradiction EO: Event Ordering IF: Instruction Follow. PF: Preference Follow. SUM: Summarization
Benchmark Domain Chat Length IE MR KU TR ABS CR EO IF PF SUM
MSC Casual ~1K
DuLeMon Casual ~1K
MemoryBank Personal ~5K
PerLTQA Personal N/A
LoCoMo Personal ~10K
DialSim TV/Film ~350K
LongMemEval Personal 115K 1M
MemBench Personal ~100K
BEAM (This work) Multi-domain Coding Math Health Finance Personal 14 other domains 128K 500K 1M 10M

LIGHT Method

LIGHT is a cognitive-inspired framework that improves long-term memory in LLMs by combining three complementary memory systems: episodic memory (retrieval over the conversation), working memory (recent turns), and a scratchpad that accumulates salient facts over time. At inference, the model answers using retrieved episodic evidence, the working-memory buffer, and a filtered scratchpad.

  • Episodic memory: retrieves relevant dialogue segments from a vector database.
  • Working memory: includes the most recent turns for local coherence.
  • Scratchpad: maintains compressed, curated notes that persist across long conversations.

Across models and conversation lengths, LIGHT improves performance by 3.50%–12.69% on average over the strongest baselines.

Average Results

Average performance across the 10 memory abilities on BEAM. Each card shows Vanilla, RAG, and LIGHT for each backbone model.

128K Tokens Average

Model Vanilla RAG LIGHT
Qwen 2.5 0.280 0.269 0.311
Llama 4 Maverick 0.240 0.323 0.358
Gemini 2 Flash 0.242 0.280 0.294
GPT-4.1-nano 0.239 0.309 0.345

500K Tokens Average

Model Vanilla RAG LIGHT
Qwen 2.5 0.200 0.291 0.316
Llama 4 Maverick 0.283 0.330 0.359
Gemini 2 Flash 0.257 0.267 0.292
GPT-4.1-nano 0.194 0.314 0.335

1M Tokens Average

Model Vanilla RAG LIGHT
Qwen 2.5 0.193 0.285 0.309
Llama 4 Maverick 0.259 0.307 0.336
Gemini 2 Flash 0.199 0.271 0.284
GPT-4.1-nano 0.191 0.302 0.336

10M Tokens Average

Model Vanilla RAG LIGHT
Qwen 2.5 0.133 0.211 0.238
Llama 4 Maverick 0.104 0.249 0.266
Gemini 2 Flash 0.122 0.216 0.192
GPT-4.1-nano 0.109 0.218 0.226

BibTeX

@article{tavakoli2025beyond,
  title={Beyond a million tokens: Benchmarking and enhancing long-term memory in llms},
  author={Tavakoli, Mohammad and Salemi, Alireza and Ye, Carrie and Abdalla, Mohamed and Zamani, Hamed and Mitchell, J Ross},
  journal={arXiv preprint arXiv:2510.27246},
  year={2025}
}