Agent Memory Benchmark

Memory your agents
can trust — and prove

60db Smart Memory gives AI agents genuine long-term, relational, temporal recall. We measure it the hard way and publish the full results — independently judged, full test set, no cherry-picking.

LongMemEval-s · 500 cases · independent judge

92.4%

LongMemEval-s (independent judge)

~185 ms

Median recall latency

100%

Single-session assistant recall

External APIs in retrieval path

Why this number is honest

Most memory benchmarks are self-reported and self-graded — the same vendor's model both answers and scores the answer, which inflates results. Our headline number is graded by a different vendor's model (an independent third-party judge), on the full 500-case LongMemEval-s set. It is the conservative number — and it's the real one. Cross-session aggregation is our active work area, and we'd rather tell you that than quietly omit it.

Full results — every category, no hiding

LongMemEval-s, 500 cases, independent judge. We publish the categories we lead on and the ones we are still improving.

Category	60db accuracy
Single-session — Assistant recall	100.0%
Single-session — User facts	95.6%
Single-session — Preference	95.0%
Abstention (correctly saying "I don't know")	97.3%
Knowledge Update (current value of a changed fact)	90.2%
Temporal Reasoning (dates, durations, ordering)	87.5%
Multi-session synthesis	81.2%
Overall	92.4%

Fast enough to feel instant

~185 ms

Median recall latency

Sub-200ms recall is purpose-built for real-time voice agents, not just chat. We got there by retiring an expensive query-expansion step that added ~575ms for no measurable accuracy gain (A/B-tested on 180 queries — identical recall quality, 4× faster). Speed without a quality trade.

A real memory system, not a vector cache

A single query fuses semantic, graph, and temporal signals in parallel, then assembles exactly the context your model needs — not a wall of loosely-related chunks.

Semantic vectors

Dense similarity search over everything the agent has seen

Knowledge graph

Entities and relationships — who, what, how things connect

Temporal model

Real-world event time vs ingestion time, with fact invalidation

Current-state projections

The latest value of a fact that changed over time

Hierarchical context

Tiered summaries so long histories stay retrievable

Event timeline

Append-only log of everything that happened

How 60db compares

HydraDB and Zep are graph- and temporal-first; Mem0 is vector-first with a graph add-on. 60db runs the full breadth — vector, graph, temporal, hierarchical, and timeline — at sub-200ms, with first-party models and no third-party API in the retrieval path.

Capability	60db	HydraDB	Zep	Mem0
Semantic vector recall
Knowledge graph (entities & relations)				Add-on
Temporal model (event vs ingestion time)				Partial
Fact invalidation / current-state		Partial		Partial
Hierarchical summaries
Event / episode timeline
Real-time recall latency	~185 ms	Sub-200 ms*	Sub-200 ms*	~0.9–1.1 s*
First-party in-house models (no external API in retrieval path)		BYO / optional		BYO / local
Multi-tenant

*As publicly documented as of mid-2026: HydraDB and Zep both market sub-200ms latency; Mem0 reports ~0.9–1.1s p50 retrieval at production scale. HydraDB self-reports ~90.8% on LongMemEval-s. Capabilities reflect each system's primary, generally-available offering and evolve over time.

vs HydraDB

Architecturally the closest — graph-native, self-hostable, sub-200ms. 60db adds hierarchical summaries and first-party in-house models (HydraDB's LLM parsing and fact extraction are optional/external), and our headline accuracy is independently judged: 92.4% on LongMemEval-s, where HydraDB self-reports ~90.8%.

vs Zep

Comparable graph and temporal depth and similar sub-200ms latency — but 60db's embeddings, extraction and reranking are all first-party, so no external embedding or LLM API sits in the retrieval path. Zep leans on external models for extraction.

vs Mem0

You get a true knowledge graph, temporal reasoning, and hierarchical context as first-class layers — not a vector store with optional graph add-ons — plus first-party in-house models and ~185ms recall, where Mem0 reports roughly one second at production scale.

vs all three

Our quality number is independently judged and published category by category (see methodology) — graded by a different vendor's model, not self-graded.

Built for production

No third-party API in the hot path

Embeddings, reranking, and extraction are all first-party — nothing in the retrieval path depends on an external model vendor.

Predictable cost

No per-token fees to an external memory provider.

Crash-safe ingest

A process restart never loses an in-flight memory — interrupted ingests self-heal automatically.

Multi-tenant by design

One deployment cleanly serves many clients, teams, and users with isolated memory.

How we measure

Dataset: LongMemEval-s — 500 cases across 7 reasoning categories (factual recall, preferences, knowledge updates, temporal reasoning, multi-session synthesis, and abstention).
Protocol: the published evaluation prompts and grading rubric, used verbatim so the number is directly comparable.
Judge independence: answers are graded by a model from a different vendor than the one generating them — no self-grading inflation.
Scale: the full 500 cases, not a hand-picked slice. We re-run this on every meaningful change, so quality is measured, not assumed.

Memory your agents can trust — and prove