Methodology

Methodology v2.1 — March 2026

The CSI combines three independent dimensions of AI model performance into a single efficiency metric. Each dimension is measured empirically on every benchmark run. The theoretical foundation is described in The Copernicus Problem (O’Brien, 2026).

Three Dimensions

1. Capability (Score)

Each model is evaluated on 30 standardized tasks spanning five domains: reasoning (8 tasks), coding (8 tasks), applied knowledge (8 tasks), analysis (4 tasks), and instruction following (2 tasks). Every task is scored on a 0–1 scale using deterministic, reproducible scoring functions applied identically across all models.

2. Speed (Latency)

Wall-clock latency in seconds from request submission to full response receipt. Measured via non-streaming API calls to isolate total inference time. All calls originate from the same network location.

3. Cost

Dollar cost per request, computed from each provider’s published per-token pricing:

cost = (prompt_tokens × input_price / 1M) + (completion_tokens × output_price / 1M)

Index Formulas

For each model m, we compute three ratios:

-- Capability per Second (raw ratio)
CS(m) = avg_score / avg_latency

-- Capability per Dollar (raw ratio)
CD(m) = avg_score / avg_cost

-- Capability-Seconds Index (log-compressed composite)
CSI(m) = ln(avg_score / (avg_latency × avg_cost))

The aggregate index is the median across all tracked models:

CSI = median( CSI(m₁), CSI(m₂), …, CSI(m_n) )

Using the median rather than the mean prevents any single outlier model (very cheap or very expensive) from distorting the aggregate.

Task Set

Reasoning (R1–R8)

R1: Sheep riddle — exact-match scoring
R2: Number sequence — exact-match scoring
R3: Widget production — exact-match scoring
R4: Fermi estimation — graded rubric
R5: Logic puzzle (houses) — exact-match scoring
R6: Bat and ball — exact-match scoring
R7: Water jug problem — graded rubric
R8: Hotel paradox — graded rubric

Coding (C1–C8)

C1: Palindrome function — code correctness checks
C2: Merge sorted lists — algorithm correctness checks
C3: SQL query — syntax and logic element checks
C4: Debug factorial — bug identification checks
C5: Merge intervals — graded rubric
C6: Rate limiter class — graded rubric
C7: Expression evaluator — graded rubric
C8: Transform pipeline — graded rubric

Applied (A1–A8)

A1: Net income calculation — exact-match scoring
A2: Medical diagnosis — graded rubric
A3: Merger vs. acquisition — graded rubric
A4: JSON extraction — per-field accuracy scoring
A5: DTC e-commerce analysis — graded rubric
A6: Clinical differential diagnosis — graded rubric
A7: CRE property comparison — graded rubric
A8: SaaS M&A advisory — graded rubric

Analysis (AN1–AN4)

AN1: Financial statement analysis — graded rubric
AN2: SaaS metrics interpretation — graded rubric
AN3: Multi-method valuation — graded rubric
AN4: Contrarian argument — graded rubric

Instruction Following (IF1–IF2)

IF1: Structured table generation — graded rubric
IF2: Constrained product description — graded rubric

The task set (30 tasks) is designed for breadth across reasoning, coding, applied, analysis, and instruction-following domains. Capability scores serve as a quality floor ensuring that only models producing substantively correct outputs receive high marks. The primary signal in CSI is cost deflation across providers.

Scoring Functions

Four scoring modes are used, chosen per task:

exact_match — Extracts numbers from the response; scores 1.0 if the expected value appears, 0.0 otherwise.
code — Checks for required code elements (function names, keywords, patterns); scores 1.0 if ≥80% present, 0.5 if ≥50%, 0.0 otherwise.
graded — Checks for conceptual elements from a rubric; same threshold logic as code scoring.
json_extract — Parses JSON from response, scores proportionally by number of correctly extracted fields.

All scoring is deterministic and applied identically across models. The same response always receives the same score regardless of which model produced it.

Measurement Protocol

Load the standardized prompt for the task.
Record wall-clock start time.
Send prompt to the model’s API (non-streaming).
Record wall-clock end time; compute latency.
Extract token counts from API response metadata.
Look up per-token pricing; compute cost.
Score the response with the task’s scoring function.
Store all raw data (response text, tokens, latency, score, cost) in the database.

A 2-second delay is inserted between API calls to avoid rate limiting. Failed calls are logged and skipped without aborting the run.

Changelog

v2.1 (March 23, 2026): CS and CD reverted to raw ratios (score/latency, score/cost) per the formal framework in the paper (Appendix A). Only the composite CSI retains the natural log transform for index compression. Rankings unchanged.
v2.0 (March 23, 2026): CSI values now expressed on natural log scale. CSI = ln(Score / (Latency × Cost)). Compresses index range from ~180× to ~6 points. Movements are now interpretable: +0.69 points = 2× improvement. Rankings unchanged. Historical values convert via ln(old_value).
v1.1.1 (March 19, 2026): No data collected for March 18 due to pipeline authentication failure (missing API key in CI environment). Issue resolved same day. No backfill attempted — CSI reports only data collected during live evaluation windows.
v1.1 (March 17, 2026): Expanded model universe to 16 models. Added Claude Haiku 4.5, Grok 3, DeepSeek V3.2, DeepSeek R1, Cohere Command A, Cohere Command R+, Nemotron Super 49B, and Qwen 2.5 72B.
v1.0 (March 16, 2026): Initial release. 8 models, 12 tasks, 3 scoring domains. Median-based aggregate. Daily evaluation cadence.