Methodology
Methodology
Section titled “Methodology”This page describes how AgentCTX benchmarks are measured to ensure reproducibility and statistical validity.
Principles
Section titled “Principles”- Measure, don’t estimate — all numbers come from instrumented test runs
- Control variables — same model, same prompts, same hardware
- Statistical significance — multiple runs with p50/p95/p99 percentiles
- Reproducible — benchmark scripts are in
bench/and can be run by anyone
Benchmark Runner
Section titled “Benchmark Runner”The benchmark suite lives in bench/runner.ts:
# Run all benchmarksnpx tsx bench/runner.ts
# Run specific benchmarknpx tsx bench/runner.ts --suite parsernpx tsx bench/runner.ts --suite gatewaynpx tsx bench/runner.ts --suite sidecarToken Measurement
Section titled “Token Measurement”Baseline (Without AgentCTX)
Section titled “Baseline (Without AgentCTX)”- Configure an agent with raw MCP connections (no gateway)
- Run a standardized workload (tool discovery, calls, knowledge search)
- Count total tokens via API usage reports (OpenAI, Anthropic)
- Record input tokens, output tokens, and total cost
Treatment (With AgentCTX)
Section titled “Treatment (With AgentCTX)”- Configure the same agent with AgentCTX gateway
- Run the identical workload
- Count tokens via the same API reports
- Record savings per category
Workload Definition
Section titled “Workload Definition”| Task | Description | Operations |
|---|---|---|
| Tool Discovery | Search and inspect 70 MCP tools | 70 ?t + 10 !t |
| Tool Execution | Call tools with various arguments | 150 >t |
| Knowledge Search | Search project documentation | 50 ?k |
| Memory Operations | Store and retrieve memories | 100 +m + 50 ?m |
| Total | 430 operations |
Latency Measurement
Section titled “Latency Measurement”- Instrumentation:
performance.now()wrapped around each operation - Warmup: 100 operations discarded before measurement
- Sample size: 1,000 operations per measurement point
- Percentiles: p50, p95, p99 computed from sorted samples
- Units: milliseconds (ms) for gateway, microseconds (μs) for parser/CTXB
Throughput Measurement
Section titled “Throughput Measurement”- Method: Saturate a single thread with sequential operations
- Duration: 10 seconds per measurement
- Metric: Operations per second (ops/sec)
- Variants: TypeScript vs Rust native (same machine, same workload)
Environment Control
Section titled “Environment Control”All benchmarks run on the same hardware:
- CPU: AMD Ryzen 9 7950X, no hyperthreading pinning
- RAM: 64GB DDR5, no swap
- Storage: NVMe SSD (Samsung 990 Pro)
- OS: Ubuntu 22.04 (kernel 6.x) and Windows 11
- Isolated: no other significant processes running
Reproducing Results
Section titled “Reproducing Results”git clone https://github.com/ryan-haver/agentctx.gitcd agentctxnpm installnpm run buildnpx tsx bench/runner.ts --output results.jsonResults are written as JSON for analysis and comparison.
See Also
Section titled “See Also”- Token Savings — the results
- Performance Data — latency and throughput