Methodology
The CSI combines three independent dimensions of AI model performance into a single efficiency metric. Each dimension is measured empirically on every benchmark run. The theoretical foundation is described in The Copernicus Problem (O’Brien, 2026).
Three Dimensions
1. Capability (Score)
Each model is evaluated on 30 standardized tasks spanning five domains: reasoning (8 tasks), coding (8 tasks), applied knowledge (8 tasks), analysis (4 tasks), and instruction following (2 tasks). Every task is scored on a 0–1 scale using deterministic, reproducible scoring functions applied identically across all models.
2. Speed (Latency)
Wall-clock latency in seconds from request submission to full response receipt. Measured via non-streaming API calls to isolate total inference time. All calls originate from the same network location.
3. Cost
Dollar cost per request, computed from each provider’s published per-token pricing:
Index Formulas
For each model m, we compute three ratios:
CS(m) = avg_score / avg_latency
-- Capability per Dollar (raw ratio)
CD(m) = avg_score / avg_cost
-- Capability-Seconds Index (log-compressed composite)
CSI(m) = ln(avg_score / (avg_latency × avg_cost))
The aggregate index is the median across all tracked models:
Using the median rather than the mean prevents any single outlier model (very cheap or very expensive) from distorting the aggregate.
Task Set
Reasoning (R1–R8)
- R1: Sheep riddle — exact-match scoring
- R2: Number sequence — exact-match scoring
- R3: Widget production — exact-match scoring
- R4: Fermi estimation — graded rubric
- R5: Logic puzzle (houses) — exact-match scoring
- R6: Bat and ball — exact-match scoring
- R7: Water jug problem — graded rubric
- R8: Hotel paradox — graded rubric
Coding (C1–C8)
- C1: Palindrome function — code correctness checks
- C2: Merge sorted lists — algorithm correctness checks
- C3: SQL query — syntax and logic element checks
- C4: Debug factorial — bug identification checks
- C5: Merge intervals — graded rubric
- C6: Rate limiter class — graded rubric
- C7: Expression evaluator — graded rubric
- C8: Transform pipeline — graded rubric
Applied (A1–A8)
- A1: Net income calculation — exact-match scoring
- A2: Medical diagnosis — graded rubric
- A3: Merger vs. acquisition — graded rubric
- A4: JSON extraction — per-field accuracy scoring
- A5: DTC e-commerce analysis — graded rubric
- A6: Clinical differential diagnosis — graded rubric
- A7: CRE property comparison — graded rubric
- A8: SaaS M&A advisory — graded rubric
Analysis (AN1–AN4)
- AN1: Financial statement analysis — graded rubric
- AN2: SaaS metrics interpretation — graded rubric
- AN3: Multi-method valuation — graded rubric
- AN4: Contrarian argument — graded rubric
Instruction Following (IF1–IF2)
- IF1: Structured table generation — graded rubric
- IF2: Constrained product description — graded rubric
The task set (30 tasks) is designed for breadth across reasoning, coding, applied, analysis, and instruction-following domains. Capability scores serve as a quality floor ensuring that only models producing substantively correct outputs receive high marks. The primary signal in CSI is cost deflation across providers.
Scoring Functions
Four scoring modes are used, chosen per task:
exact_match— Extracts numbers from the response; scores 1.0 if the expected value appears, 0.0 otherwise.code— Checks for required code elements (function names, keywords, patterns); scores 1.0 if ≥80% present, 0.5 if ≥50%, 0.0 otherwise.graded— Checks for conceptual elements from a rubric; same threshold logic as code scoring.json_extract— Parses JSON from response, scores proportionally by number of correctly extracted fields.
All scoring is deterministic and applied identically across models. The same response always receives the same score regardless of which model produced it.
Measurement Protocol
- Load the standardized prompt for the task.
- Record wall-clock start time.
- Send prompt to the model’s API (non-streaming).
- Record wall-clock end time; compute latency.
- Extract token counts from API response metadata.
- Look up per-token pricing; compute cost.
- Score the response with the task’s scoring function.
- Store all raw data (response text, tokens, latency, score, cost) in the database.
A 2-second delay is inserted between API calls to avoid rate limiting. Failed calls are logged and skipped without aborting the run.
Changelog
- v2.1 (March 23, 2026): CS and CD reverted to raw ratios (score/latency, score/cost) per the formal framework in the paper (Appendix A). Only the composite CSI retains the natural log transform for index compression. Rankings unchanged.
- v2.0 (March 23, 2026): CSI values now expressed on natural log scale. CSI = ln(Score / (Latency × Cost)). Compresses index range from ~180× to ~6 points. Movements are now interpretable: +0.69 points = 2× improvement. Rankings unchanged. Historical values convert via ln(old_value).
- v1.1.1 (March 19, 2026): No data collected for March 18 due to pipeline authentication failure (missing API key in CI environment). Issue resolved same day. No backfill attempted — CSI reports only data collected during live evaluation windows.
- v1.1 (March 17, 2026): Expanded model universe to 16 models. Added Claude Haiku 4.5, Grok 3, DeepSeek V3.2, DeepSeek R1, Cohere Command A, Cohere Command R+, Nemotron Super 49B, and Qwen 2.5 72B.
- v1.0 (March 16, 2026): Initial release. 8 models, 12 tasks, 3 scoring domains. Median-based aggregate. Daily evaluation cadence.