// benchmarking AI where it matters
AI models are ranked on questions. But we use them to build software.
The Problem
Most benchmarks test whether a model can answer questions. That's not the job. The job is building software that works.
Most benchmarks don't tell you:
- Which model builds a working app
- Which one is fastest
- Which one is cheapest
- Which one actually works across repeated runs
The Solution
BuildBench tests AI by making it build real software. Each model gets the same project, the same features to implement, the same tests to pass. No human help. Just the AI.
input: project spec + test suite
constraint: no human intervention
evaluation: automated test pass/fail
runs: 3+ per task for consistency
What We Measure
BB Score
How well the output works. Mean score across all runs, with standard deviation and pass rate. The only metric that matters if you care about shipping.
Tokens
Total AI usage. A model that solves the problem in 50k tokens is more efficient than one that burns 500k tokens for the same result.
Cost
Real dollar cost of the full run. Because "best model" is meaningless if it costs 10x more for the same output quality.
Time
Wall-clock time from start to finish. Speed matters when you're iterating. A model that takes 20 minutes per task isn't practical for real workflows.
Why It Matters
You don't need the "smartest" AI. You need the one that finishes the job, costs less, and works reliably.
Each report shows:
- What the AI built
- What passed and what failed
- How much it cost
- How long it took
- Whether it does the same thing every time you run it
The Goal
Find the AI that actually builds useful software. BuildBench already measures more of real software delivery than most answer-style benchmarks. The roadmap prioritizes benchmark integrity before platform expansion.
Modular Runner
Separate protocol, orchestration, evaluation, scoring, and reporting into discrete modules.
Asset Boundaries
Public task assets help the model without leaking private scoring signals.
Auditability
Report replay, rerun validation, baseline comparison. Historical results stay trustworthy.
Task Coverage
Expand tasks per category. Category conclusions are weak when one task dominates the signal.
Category Prompting
Backend, repair, algorithm, and web tasks get protocol guidance that fits the work.
Provider Diversity
Multi-provider parallel execution. Comes after correctness, reproducibility, and interpretability.