// benchmark report

GPT-5.4 BuildBench Report

openai/gpt-5.4 | | 3 tasks, 9 runs
overall-score.md
74.27/100

Std Dev: +/-26.06

Range: [40.09, 100]

Pass Rate: 44.4% (4/9)

Runs per Task: 3

resources.md

Resource Usage

Tokens: 453,560

Cost: $0.5782

Time: 400.7s (6.7 min)

Category Breakdown

algorithm.md
100/100

+/-0 stddev

Range: [100, 100]

Pass Rate: 100.0%

repair.md
78.33/100

+/-18.76 stddev

Range: [67.5, 100]

Pass Rate: 33.3%

web.md
44.47/100

+/-3.8 stddev

Range: [40.09, 46.95]

Pass Rate: 0.0%

Per-Task Results

buildbench-algo-medium
PASS algorithm Score: 100/100

Std Dev: +/-0

Range: [100, 100]

Pass Rate: 100.0%

Consistency: 100.0%

Individual Runs:

✓100.0✓100.0✓100.0
buildbench-repair-logic
PARTIAL repair Score: 78.33/100

Std Dev: +/-18.76

Range: [67.5, 100]

Pass Rate: 33.3%

Consistency: 33.3%

Individual Runs:

~67.5~67.5✓100.0

Failures:

  • (2x) model_code: Tests failed: 2/8 tests
buildbench-web-v1
PARTIAL web Score: 44.47/100

Std Dev: +/-3.8

Range: [40.09, 46.95]

Pass Rate: 0.0%

Consistency: 0.0%

Individual Runs:

~40.09~46.36~46.95

Failures:

  • (1x) model_incomplete: Model did not complete: max_iterations
  • (2x) model_code: Functional tests failed: 11/11 tests

Overview

GPT-5.4 was tested across three task categories: algorithm implementation, code repair, and full web application builds. Each task was run 3 times to measure consistency. The model excels at isolated algorithmic problems but struggles significantly with full-stack web builds that require sustained multi-file coordination.

Analysis

The algorithm category shows perfect scores: 100/100 across all three runs. This is expected. Algorithmic tasks are self-contained, have clear inputs and outputs, and play to the strengths of large language models. The interesting signal is in repair and web. Repair tasks scored 78.33 but with high variance (stdDev 18.76). One run scored perfect, two scored 67.5. The model understands what needs fixing but doesn't reliably apply the fix correctly. Web tasks are where GPT-5.4 breaks down. A mean of 44.47 with zero passes means it never produced a fully working web application across any run. One run hit max_iterations without completing. The other two completed but failed all 11 functional tests.

Verdict

GPT-5.4 is strong on contained problems and weak on sustained builds. If your use case is algorithm implementation or targeted code fixes, it performs well. If you need it to build and ship a working web application end-to-end, look elsewhere. The cost efficiency is notable: $0.58 for 9 runs across 3 tasks is cheap. But cheap doesn't matter if the output doesn't pass its tests.

Key Findings

  • Perfect 100/100 on algorithm tasks across all runs. Zero variance.
  • Repair tasks show high variance: the model can fix code, but not reliably.
  • Web build tasks failed every run. Zero functional tests passed in two runs.
  • Total cost of $0.58 for the full benchmark suite is very cost-efficient.
  • The model hit max_iterations on one web task run, suggesting it loops without converging.

Have a model you want benchmarked?

We run the same tasks, the same tests, and publish the results.

[REACH OUT]
← All reports