Methodology

Methodology v2.1 — March 2026

The CSI combines three independent dimensions of AI model performance into a single efficiency metric. Each dimension is measured empirically on every benchmark run. The theoretical foundation is described in The Copernicus Problem (O’Brien, 2026).

Three Dimensions

1. Capability (Score)

Each model is evaluated on 30 standardized tasks spanning five domains: reasoning (8 tasks), coding (8 tasks), applied knowledge (8 tasks), analysis (4 tasks), and instruction following (2 tasks). Every task is scored on a 0–1 scale using deterministic, reproducible scoring functions applied identically across all models.

2. Speed (Latency)

Wall-clock latency in seconds from request submission to full response receipt. Measured via non-streaming API calls to isolate total inference time. All calls originate from the same network location.

3. Cost

Dollar cost per request, computed from each provider’s published per-token pricing:

cost = (prompt_tokens × input_price / 1M) + (completion_tokens × output_price / 1M)

Index Formulas

For each model m, we compute three ratios:

-- Capability per Second (raw ratio)
CS(m) = avg_score / avg_latency

-- Capability per Dollar (raw ratio)
CD(m) = avg_score / avg_cost

-- Capability-Seconds Index (log-compressed composite)
CSI(m) = ln(avg_score / (avg_latency × avg_cost))

The aggregate index is the median across all tracked models:

CSI = median( CSI(m1), CSI(m2), …, CSI(mn) )

Using the median rather than the mean prevents any single outlier model (very cheap or very expensive) from distorting the aggregate.

Task Set

Reasoning (R1–R8)

Coding (C1–C8)

Applied (A1–A8)

Analysis (AN1–AN4)

Instruction Following (IF1–IF2)

The task set (30 tasks) is designed for breadth across reasoning, coding, applied, analysis, and instruction-following domains. Capability scores serve as a quality floor ensuring that only models producing substantively correct outputs receive high marks. The primary signal in CSI is cost deflation across providers.

Scoring Functions

Four scoring modes are used, chosen per task:

All scoring is deterministic and applied identically across models. The same response always receives the same score regardless of which model produced it.

Measurement Protocol

  1. Load the standardized prompt for the task.
  2. Record wall-clock start time.
  3. Send prompt to the model’s API (non-streaming).
  4. Record wall-clock end time; compute latency.
  5. Extract token counts from API response metadata.
  6. Look up per-token pricing; compute cost.
  7. Score the response with the task’s scoring function.
  8. Store all raw data (response text, tokens, latency, score, cost) in the database.

A 2-second delay is inserted between API calls to avoid rate limiting. Failed calls are logged and skipped without aborting the run.

Changelog