What are Context Trust Levels?

CTL 0-4 grade a context system from non-repeatable (CTL 0) to cross-machine deterministic (CTL 4). No vendor can claim a level without publishing verifiable benchmark fingerprints.

How do I run ContextBenchmark?

Clone the MIT-licensed repository from GitHub, implement a ~40-line adapter for your system (or use the shipped adapters), and run node contextbenchmark.mjs. Results include a verifiable fingerprint anyone can compare.

Open standard · Vendor neutral · Reproducible

Can you trust your AI's context?

Q: What is ContextBenchmark?

ContextBenchmark is an open, vendor-neutral benchmark that measures whether AI context systems (retrieval indexes, RAG pipelines, agent memory, code-context engines) produce stable, reproducible, and trustworthy results across builds, queries, environments, and noise.

Q: Does ContextBenchmark measure language models?

No. It deliberately measures the context layer only. LLM inference nondeterminism is a separate, explicitly out-of-scope problem with its own research literature.

ContextBenchmark is the open standard for measuring the reliability, reproducibility, and stability of AI context systems.

Compare context engines — retrieval indexes, RAG pipelines, agent memory, code-context systems — with deterministic metrics, transparent methodology, and reproducible, fingerprint-verified results.

Run the Benchmark View Results

Open SourceMIT LicensedVendor NeutralReproducibleCross-Platform VerifiedDeterministic Metrics

AI is only as reliable as the context it receives.

Modern AI systems depend on context layers to retrieve knowledge, maintain memory, and ground decisions. Yet there is no standard way to measure whether those context systems are stable, reproducible, or trustworthy. ContextBenchmark fills that gap.

LLM (out of scope — measured by model benchmarks)

CONTEXT LAYER ← measured by ContextBenchmark

files · memory · knowledge · retrieval

The benchmark intentionally measures the context layer, not the language model. LLM inference nondeterminism is a separate, explicitly out-of-scope problem.

Why AI needs a context benchmark

AI benchmarks measure model capability. They rarely measure the quality of the context supplied to those models. Yet context determines what an AI knows, remembers, retrieves — and ultimately how reliably it behaves. ContextBenchmark establishes the first open standard for evaluating context infrastructure independently of the language model itself.

Existing AI benchmarks	ContextBenchmark
Measure the model	Measures the context layer
Focus on reasoning and generation	Focuses on reproducibility and reliability
Vary with every model update	Designed to remain model-agnostic
Evaluate intelligence	Evaluates trust in context

What ContextBenchmark measures

Four test families, each answering one question a production team actually asks.

Rebuild Identity

Can the system recreate byte-identical artifacts from identical input? Independent fresh builds, artifact hash comparison.

metric: distinct-hash count over R builds

Query Stability

Does the same question return the same context every time — same files, same order?

metrics: Exact Match Rate · Jaccard@k · Kendall τ

Drift Under Noise

Does adding one unrelated file change answers to unrelated questions?

metrics: drift score (1 − Jaccard vs base) · noise-in-top-k

Cross-Machine Identity

Do Windows, macOS, and Linux produce identical context? Verified by fingerprint exchange, not by trust.

metric: artifact + per-query result hash matches across OS pairs

Future families under specification: agent safety, governance, memory integrity, determinism-under-incremental-update.

Context Trust Levels

Every implementation receives a Context Trust Level based on published benchmark results.

Level	Name	Requirement
CTL 4	Cross-machine deterministic	CTL 3, plus identical artifacts and query results across operating systems, verified by fingerprint exchange
CTL 3	Machine-deterministic	Byte-identical artifacts across rebuilds and exact-match query results across trials
CTL 2	Stable retrieval	Artifact bytes differ, but ranked query results are identical every time
CTL 1	Repeatable locally	Results not identical but rank-stable (Jaccard@k ≥ 0.9, τ ≥ 0.9)
CTL 0	Non-repeatable	Below CTL 1

No vendor can claim a level without publishing benchmark artifacts. Every claim is backed by a fingerprint anyone can independently verify — including ours.

Current results

micro-app reference corpus · fingerprints published in the repository · cross-platform verification runs publicly in CI on every push.

Engine	CTL	Rebuild	Query stability	Drift under noise	Cross-platform
BM25 (lexical reference)	CTL 4	✅ identical	✅ EMR 1.0	⚠️ 0.04 — noise reached top-10 in 2/10 queries	✅ verified in CI: ubuntu · windows · macos
Spiderbrain (structural code-context engine)	CTL 3	✅ identical	✅ EMR 1.0	✅ 0.00 — noise never surfaced	Pending CI-runnable packaging (fingerprints published)
MiniLM embeddings (RAG reference)	—	Run pending — adapter shipped in the repository
Mem0 · Zep · Supermemory · LlamaIndex · vector stores	—	Awaiting vendor adapters — contribute one

The reference baseline reaching CTL 4 is the point: the bar is achievable with plain engineering. A system scoring below the free baseline has made a design choice, not hit a law of nature.

Designed for fair comparison

ContextBenchmark evaluates context infrastructure — not language models, and not marketing.

Adapter API (~40 lines per system)
Open datasets, committed and license-clean
Public methodology and metrics
Fingerprint verification for every claim
No benchmark-specific optimizations allowed
Reproducible runs on commodity hardware

Benchmark architecture

Repository→ Adapter→ Context Engine→ Queries→ Metrics→ Context Trust Level

git clone https://github.com/aabhisrv/contextbenchmark && cd contextbenchmark
node contextbenchmark.mjs run --adapters bm25            # dependency-free baseline
node contextbenchmark.mjs run --adapters bm25,emb-minilm # + typical-RAG reference
node contextbenchmark.mjs compare A.fingerprint.json B.fingerprint.json

Built for vendors

Implement a lightweight adapter and benchmark your context system against the same transparent methodology used by every participant.

Adapter→ Run→ Publish→ Verify→ Receive CTL

Honesty rules apply to everyone: production configuration only, no benchmark-only determinism flags, results disclosed with fingerprints. Read the adapter contract →

Research & methodology

ContextBenchmark builds on reproducible-systems research, retrieval evaluation, and software reproducibility practice — and introduces standardized measurements for deterministic AI context.

Methodology

Test-family definitions, trial counts, pass criteria, and level assignment — versioned in the repository.

Metrics

Exact Match Rate, Jaccard@k, and Kendall τ follow the conventions established for RAG reproducibility measurement (ReproRAG, arXiv:2509.18869).

Datasets

Committed, deterministic corpora with fixed query sets; pinned real-repository tiers planned.

Disclosure

A publishable result includes the report, fingerprints, versions, corpus identity, and machine spec. No fingerprint, no claim.

Versioning

Metric or family changes version the benchmark; levels are always cited with the benchmark version that produced them.

Related work

Distinct from model benchmarks and from academic context-accuracy evaluation (e.g. ContextBench, unaffiliated); the reliability lane is complementary to both.

Open source

ContextBenchmark is community-driven. Every benchmark, adapter, dataset, metric, and result is publicly inspectable and reproducible.

GitHub Documentation Specification Contribute