Methodology

This page describes how AgentCTX benchmarks are measured to ensure reproducibility and statistical validity.

Principles

Measure, don’t estimate — all numbers come from instrumented test runs
Control variables — same model, same prompts, same hardware
Statistical significance — multiple runs with p50/p95/p99 percentiles
Reproducible — benchmark scripts are in bench/ and can be run by anyone

Benchmark Runner

The benchmark suite lives in bench/runner.ts:

# Run all benchmarks
npx tsx bench/runner.ts

# Run specific benchmark
npx tsx bench/runner.ts --suite parser
npx tsx bench/runner.ts --suite gateway
npx tsx bench/runner.ts --suite sidecar

Token Measurement

Baseline (Without AgentCTX)

Configure an agent with raw MCP connections (no gateway)
Run a standardized workload (tool discovery, calls, knowledge search)
Count total tokens via API usage reports (OpenAI, Anthropic)
Record input tokens, output tokens, and total cost

Treatment (With AgentCTX)

Configure the same agent with AgentCTX gateway
Run the identical workload
Count tokens via the same API reports
Record savings per category

Workload Definition

Task	Description	Operations
Tool Discovery	Search and inspect 70 MCP tools	70 `?t` + 10 `!t`
Tool Execution	Call tools with various arguments	150 `>t`
Knowledge Search	Search project documentation	50 `?k`
Memory Operations	Store and retrieve memories	100 `+m` + 50 `?m`
Total		430 operations

Latency Measurement

Instrumentation: performance.now() wrapped around each operation
Warmup: 100 operations discarded before measurement
Sample size: 1,000 operations per measurement point
Percentiles: p50, p95, p99 computed from sorted samples
Units: milliseconds (ms) for gateway, microseconds (μs) for parser/CTXB

Throughput Measurement

Method: Saturate a single thread with sequential operations
Duration: 10 seconds per measurement
Metric: Operations per second (ops/sec)
Variants: TypeScript vs Rust native (same machine, same workload)

Environment Control

All benchmarks run on the same hardware:

CPU: AMD Ryzen 9 7950X, no hyperthreading pinning
RAM: 64GB DDR5, no swap
Storage: NVMe SSD (Samsung 990 Pro)
OS: Ubuntu 22.04 (kernel 6.x) and Windows 11
Isolated: no other significant processes running

Reproducing Results

git clone https://github.com/ryan-haver/agentctx.git
cd agentctx
npm install
npm run build
npx tsx bench/runner.ts --output results.json

Results are written as JSON for analysis and comparison.

Methodology

Methodology

Principles

Benchmark Runner

Token Measurement

Baseline (Without AgentCTX)

Treatment (With AgentCTX)

Workload Definition

Latency Measurement

Throughput Measurement

Environment Control

Reproducing Results

See Also