Memory your agents
can trust — and prove
60db Smart Memory gives AI agents genuine long-term, relational, temporal recall. We measure it the hard way and publish the full results — independently judged, full test set, no cherry-picking.
LongMemEval-s · 500 cases · independent judge
Why this number is honest
Most memory benchmarks are self-reported and self-graded — the same vendor's model both answers and scores the answer, which inflates results. Our headline number is graded by a different vendor's model (an independent third-party judge), on the full 500-case LongMemEval-s set. It is the conservative number — and it's the real one. Cross-session aggregation is our active work area, and we'd rather tell you that than quietly omit it.
Full results — every category, no hiding
LongMemEval-s, 500 cases, independent judge. We publish the categories we lead on and the ones we are still improving.
| Category | 60db accuracy |
|---|---|
| Single-session — Assistant recall | 100.0% |
| Single-session — User facts | 95.6% |
| Single-session — Preference | 95.0% |
| Abstention (correctly saying "I don't know") | 97.3% |
| Knowledge Update (current value of a changed fact) | 90.2% |
| Temporal Reasoning (dates, durations, ordering) | 87.5% |
| Multi-session synthesis | 81.2% |
| Overall | 92.4% |
Fast enough to feel instant
Sub-200ms recall is purpose-built for real-time voice agents, not just chat. We got there by retiring an expensive query-expansion step that added ~575ms for no measurable accuracy gain (A/B-tested on 180 queries — identical recall quality, 4× faster). Speed without a quality trade.
A real memory system, not a vector cache
A single query fuses semantic, graph, and temporal signals in parallel, then assembles exactly the context your model needs — not a wall of loosely-related chunks.
Dense similarity search over everything the agent has seen
Entities and relationships — who, what, how things connect
Real-world event time vs ingestion time, with fact invalidation
The latest value of a fact that changed over time
Tiered summaries so long histories stay retrievable
Append-only log of everything that happened
How 60db compares
HydraDB and Zep are graph- and temporal-first; Mem0 is vector-first with a graph add-on. 60db runs the full breadth — vector, graph, temporal, hierarchical, and timeline — at sub-200ms, with first-party models and no third-party API in the retrieval path.
| Capability | 60db | HydraDB | Zep | Mem0 |
|---|---|---|---|---|
| Semantic vector recall | ||||
| Knowledge graph (entities & relations) | Add-on | |||
| Temporal model (event vs ingestion time) | Partial | |||
| Fact invalidation / current-state | Partial | Partial | ||
| Hierarchical summaries | ||||
| Event / episode timeline | ||||
| Real-time recall latency | ~185 ms | Sub-200 ms* | Sub-200 ms* | ~0.9–1.1 s* |
| First-party in-house models (no external API in retrieval path) | BYO / optional | BYO / local | ||
| Multi-tenant |
*As publicly documented as of mid-2026: HydraDB and Zep both market sub-200ms latency; Mem0 reports ~0.9–1.1s p50 retrieval at production scale. HydraDB self-reports ~90.8% on LongMemEval-s. Capabilities reflect each system's primary, generally-available offering and evolve over time.
Architecturally the closest — graph-native, self-hostable, sub-200ms. 60db adds hierarchical summaries and first-party in-house models (HydraDB's LLM parsing and fact extraction are optional/external), and our headline accuracy is independently judged: 92.4% on LongMemEval-s, where HydraDB self-reports ~90.8%.
Comparable graph and temporal depth and similar sub-200ms latency — but 60db's embeddings, extraction and reranking are all first-party, so no external embedding or LLM API sits in the retrieval path. Zep leans on external models for extraction.
You get a true knowledge graph, temporal reasoning, and hierarchical context as first-class layers — not a vector store with optional graph add-ons — plus first-party in-house models and ~185ms recall, where Mem0 reports roughly one second at production scale.
Our quality number is independently judged and published category by category (see methodology) — graded by a different vendor's model, not self-graded.
Built for production
Embeddings, reranking, and extraction are all first-party — nothing in the retrieval path depends on an external model vendor.
No per-token fees to an external memory provider.
A process restart never loses an in-flight memory — interrupted ingests self-heal automatically.
One deployment cleanly serves many clients, teams, and users with isolated memory.
How we measure
- Dataset: LongMemEval-s — 500 cases across 7 reasoning categories (factual recall, preferences, knowledge updates, temporal reasoning, multi-session synthesis, and abstention).
- Protocol: the published evaluation prompts and grading rubric, used verbatim so the number is directly comparable.
- Judge independence: answers are graded by a model from a different vendor than the one generating them — no self-grading inflation.
- Scale: the full 500 cases, not a hand-picked slice. We re-run this on every meaningful change, so quality is measured, not assumed.
