// benchmarking AI where it matters

AI models are ranked on questions. But we use them to build software.

The Problem

Most benchmarks test whether a model can answer questions. That's not the job. The job is building software that works.

Most benchmarks don't tell you:

The Solution

BuildBench tests AI by making it build real software. Each model gets the same project, the same features to implement, the same tests to pass. No human help. Just the AI.

protocol.md

input: project spec + test suite

constraint: no human intervention

evaluation: automated test pass/fail

runs: 3+ per task for consistency

What We Measure

bb-score

BB Score

How well the output works. Mean score across all runs, with standard deviation and pass rate. The only metric that matters if you care about shipping.

tokens

Tokens

Total AI usage. A model that solves the problem in 50k tokens is more efficient than one that burns 500k tokens for the same result.

cost

Cost

Real dollar cost of the full run. Because "best model" is meaningless if it costs 10x more for the same output quality.

time

Time

Wall-clock time from start to finish. Speed matters when you're iterating. A model that takes 20 minutes per task isn't practical for real workflows.

Why It Matters

You don't need the "smartest" AI. You need the one that finishes the job, costs less, and works reliably.

Each report shows:

The Goal

Find the AI that actually builds useful software. BuildBench already measures more of real software delivery than most answer-style benchmarks. The roadmap prioritizes benchmark integrity before platform expansion.

roadmap-1.md

Modular Runner

Separate protocol, orchestration, evaluation, scoring, and reporting into discrete modules.

roadmap-2.md

Asset Boundaries

Public task assets help the model without leaking private scoring signals.

roadmap-3.md

Auditability

Report replay, rerun validation, baseline comparison. Historical results stay trustworthy.

roadmap-4.md

Task Coverage

Expand tasks per category. Category conclusions are weak when one task dominates the signal.

roadmap-5.md

Category Prompting

Backend, repair, algorithm, and web tasks get protocol guidance that fits the work.

roadmap-6.md

Provider Diversity

Multi-provider parallel execution. Comes after correctness, reproducibility, and interpretability.

Reports

Have a model you want benchmarked?

[LET'S TALK]