// benchmark report
GPT-5.4 BuildBench Report
Std Dev: +/-26.06
Range: [40.09, 100]
Pass Rate: 44.4% (4/9)
Runs per Task: 3
Resource Usage
Tokens: 453,560
Cost: $0.5782
Time: 400.7s (6.7 min)
Category Breakdown
+/-0 stddev
Range: [100, 100]
Pass Rate: 100.0%
+/-18.76 stddev
Range: [67.5, 100]
Pass Rate: 33.3%
+/-3.8 stddev
Range: [40.09, 46.95]
Pass Rate: 0.0%
Per-Task Results
Std Dev: +/-0
Range: [100, 100]
Pass Rate: 100.0%
Consistency: 100.0%
Individual Runs:
Std Dev: +/-18.76
Range: [67.5, 100]
Pass Rate: 33.3%
Consistency: 33.3%
Individual Runs:
Failures:
- (2x) model_code: Tests failed: 2/8 tests
Std Dev: +/-3.8
Range: [40.09, 46.95]
Pass Rate: 0.0%
Consistency: 0.0%
Individual Runs:
Failures:
- (1x) model_incomplete: Model did not complete: max_iterations
- (2x) model_code: Functional tests failed: 11/11 tests
Overview
GPT-5.4 was tested across three task categories: algorithm implementation, code repair, and full web application builds. Each task was run 3 times to measure consistency. The model excels at isolated algorithmic problems but struggles significantly with full-stack web builds that require sustained multi-file coordination.
Analysis
The algorithm category shows perfect scores: 100/100 across all three runs. This is expected. Algorithmic tasks are self-contained, have clear inputs and outputs, and play to the strengths of large language models. The interesting signal is in repair and web. Repair tasks scored 78.33 but with high variance (stdDev 18.76). One run scored perfect, two scored 67.5. The model understands what needs fixing but doesn't reliably apply the fix correctly. Web tasks are where GPT-5.4 breaks down. A mean of 44.47 with zero passes means it never produced a fully working web application across any run. One run hit max_iterations without completing. The other two completed but failed all 11 functional tests.
Verdict
GPT-5.4 is strong on contained problems and weak on sustained builds. If your use case is algorithm implementation or targeted code fixes, it performs well. If you need it to build and ship a working web application end-to-end, look elsewhere. The cost efficiency is notable: $0.58 for 9 runs across 3 tasks is cheap. But cheap doesn't matter if the output doesn't pass its tests.
Key Findings
- Perfect 100/100 on algorithm tasks across all runs. Zero variance.
- Repair tasks show high variance: the model can fix code, but not reliably.
- Web build tasks failed every run. Zero functional tests passed in two runs.
- Total cost of $0.58 for the full benchmark suite is very cost-efficient.
- The model hit max_iterations on one web task run, suggesting it loops without converging.