Open standard · Vendor neutral · Reproducible

Can you trust your AI's context?

ContextBenchmark is the open standard for measuring the reliability, reproducibility, and stability of AI context systems.

Compare context engines — retrieval indexes, RAG pipelines, agent memory, code-context systems — with deterministic metrics, transparent methodology, and reproducible, fingerprint-verified results.

Open SourceMIT LicensedVendor NeutralReproducibleCross-Platform VerifiedDeterministic Metrics

AI is only as reliable as the context it receives.

Modern AI systems depend on context layers to retrieve knowledge, maintain memory, and ground decisions. Yet there is no standard way to measure whether those context systems are stable, reproducible, or trustworthy. ContextBenchmark fills that gap.

The benchmark intentionally measures the context layer, not the language model. LLM inference nondeterminism is a separate, explicitly out-of-scope problem.

Why AI needs a context benchmark

AI benchmarks measure model capability. They rarely measure the quality of the context supplied to those models. Yet context determines what an AI knows, remembers, retrieves — and ultimately how reliably it behaves. ContextBenchmark establishes the first open standard for evaluating context infrastructure independently of the language model itself.

Existing AI benchmarksContextBenchmark
Measure the modelMeasures the context layer
Focus on reasoning and generationFocuses on reproducibility and reliability
Vary with every model updateDesigned to remain model-agnostic
Evaluate intelligenceEvaluates trust in context

What ContextBenchmark measures

Four test families, each answering one question a production team actually asks.

Rebuild Identity

Can the system recreate byte-identical artifacts from identical input? Independent fresh builds, artifact hash comparison.

metric: distinct-hash count over R builds

Query Stability

Does the same question return the same context every time — same files, same order?

metrics: Exact Match Rate · Jaccard@k · Kendall τ

Drift Under Noise

Does adding one unrelated file change answers to unrelated questions?

metrics: drift score (1 − Jaccard vs base) · noise-in-top-k

Cross-Machine Identity

Do Windows, macOS, and Linux produce identical context? Verified by fingerprint exchange, not by trust.

metric: artifact + per-query result hash matches across OS pairs

Future families under specification: agent safety, governance, memory integrity, determinism-under-incremental-update.

Context Trust Levels

Every implementation receives a Context Trust Level based on published benchmark results.

LevelNameRequirement
CTL 4Cross-machine deterministicCTL 3, plus identical artifacts and query results across operating systems, verified by fingerprint exchange
CTL 3Machine-deterministicByte-identical artifacts across rebuilds and exact-match query results across trials
CTL 2Stable retrievalArtifact bytes differ, but ranked query results are identical every time
CTL 1Repeatable locallyResults not identical but rank-stable (Jaccard@k ≥ 0.9, τ ≥ 0.9)
CTL 0Non-repeatableBelow CTL 1

No vendor can claim a level without publishing benchmark artifacts. Every claim is backed by a fingerprint anyone can independently verify — including ours.

Current results

micro-app reference corpus · fingerprints published in the repository · cross-platform verification runs publicly in CI on every push.

EngineCTLRebuildQuery stabilityDrift under noiseCross-platform
BM25 (lexical reference) CTL 4 ✅ identical ✅ EMR 1.0 ⚠️ 0.04 — noise reached top-10 in 2/10 queries ✅ verified in CI: ubuntu · windows · macos
Spiderbrain (structural code-context engine) CTL 3 ✅ identical ✅ EMR 1.0 ✅ 0.00 — noise never surfaced Pending CI-runnable packaging (fingerprints published)
MiniLM embeddings (RAG reference) Run pending — adapter shipped in the repository
Mem0 · Zep · Supermemory · LlamaIndex · vector stores Awaiting vendor adapters — contribute one

The reference baseline reaching CTL 4 is the point: the bar is achievable with plain engineering. A system scoring below the free baseline has made a design choice, not hit a law of nature.

Designed for fair comparison

ContextBenchmark evaluates context infrastructure — not language models, and not marketing.

  • Adapter API (~40 lines per system)
  • Open datasets, committed and license-clean
  • Public methodology and metrics
  • Fingerprint verification for every claim
  • No benchmark-specific optimizations allowed
  • Reproducible runs on commodity hardware

Benchmark architecture

git clone https://github.com/aabhisrv/contextbenchmark && cd contextbenchmark
node contextbenchmark.mjs run --adapters bm25            # dependency-free baseline
node contextbenchmark.mjs run --adapters bm25,emb-minilm # + typical-RAG reference
node contextbenchmark.mjs compare A.fingerprint.json B.fingerprint.json

Built for vendors

Implement a lightweight adapter and benchmark your context system against the same transparent methodology used by every participant.

Honesty rules apply to everyone: production configuration only, no benchmark-only determinism flags, results disclosed with fingerprints. Read the adapter contract →

Research & methodology

ContextBenchmark builds on reproducible-systems research, retrieval evaluation, and software reproducibility practice — and introduces standardized measurements for deterministic AI context.

Methodology

Test-family definitions, trial counts, pass criteria, and level assignment — versioned in the repository.

Metrics

Exact Match Rate, Jaccard@k, and Kendall τ follow the conventions established for RAG reproducibility measurement (ReproRAG, arXiv:2509.18869).

Datasets

Committed, deterministic corpora with fixed query sets; pinned real-repository tiers planned.

Disclosure

A publishable result includes the report, fingerprints, versions, corpus identity, and machine spec. No fingerprint, no claim.

Versioning

Metric or family changes version the benchmark; levels are always cited with the benchmark version that produced them.

Related work

Distinct from model benchmarks and from academic context-accuracy evaluation (e.g. ContextBench, unaffiliated); the reliability lane is complementary to both.

Open source

ContextBenchmark is community-driven. Every benchmark, adapter, dataset, metric, and result is publicly inspectable and reproducible.