Introducing ProjectBench:We measure AI capabilities to plan, steer and execute on complex real world projects
Can current AI models produce a project plan worth building from?
ProjectBench is our benchmark for real project-management work. It hands a model a real project with clear scope, a hard budget cap and a timeline, then has it plan, steer and adapt the project from end to end and judges the result on one question: would the owner be measurably better off holding this plan than not?
What we measure
Most AI demos count fields filled or admire the prose. We measure value: a single Plan Value score for how much a real project lead gains from the plan, weighted toward the two things that decide execution.
Real-world projects
Every run starts from a real brief: a clear scope, a hard budget cap, a timeline and the actual people involved, across domains like construction, events and product launches. One scenario, for instance, builds a 150 m² house near Fürth on a hard €480,000 cap with move-in inside 14 months. The model plans and runs it end to end, not a clever prompt in isolation.
Plan Value, not prose
Actionability and resilience lead the score: are tasks owned, dated, and sequenced, and does the plan absorb change. A beautiful plan that breaks on first contact is worth less than a rougher one that adapts.
Blind and fair
Quality is judged blind across a panel, every pairing seen in both orders to cancel position bias. Hard constraints like the budget cap or invented people are checked mechanically, no judgment involved.
The finding: context over model
We ran the same project through a broad range of models, and the plans spread from build-from quality down to not-usable. The differences were concrete, not cosmetic. When a mid-project change cut the financing, slipped the start by three weeks, and still required staying within the original budget cap, one plan came back about €110,000 over, roughly 123% of the €480,000 cap, with no decision recorded to justify it. Others tripped a dependency cycle, a task waiting on one that waited back on it, leaving the schedule incoherent. A few never produced a usable plan at all, too thin to run a project from. What set a build-from plan apart was rarely raw model horsepower. It was execution detail: holding the budget, staying internally consistent, and covering the whole lifecycle.
When the context is structured, owned, and queryable, the context does the heavy lifting and the model becomes a swappable engine. That is why Bring Your Own Key matters: pick the engine that fits your budget and let the context layer carry the quality.
How a run works
Each run has the model do genuine work, snapshotting the full project graph after every step so each change is something we can inspect.
- 1Let the model bootstrap a real project from a brief with clear scope, budget and timeline
- 2Build the task plan with owners, dates, and priorities
- 3Map the dependencies across the lifecycle
- 4Replan when reality shifts, like a budget cut and a schedule slip
- 5Process an incoming change through the human-in-the-loop intake
We keep ourselves honest
ProjectBench is how we check whether the context layer earns its place. These are early findings on a single scenario, and we are extending it across more domains. The method stays the same: measure plans by the value they create, in the real product, judged blind.
Read the full story