Enabled by TensorPM

Introducing ProjectBench:We measure AI capabilities to plan, steer and execute on complex real world projects

Can current AI models produce a project plan worth building from?

ProjectBench is our benchmark for real project-management work. It hands a model a real project with clear scope, a hard budget cap and a timeline, then has it plan, steer and adapt the project from end to end and judges the result on one question: would the owner be measurably better off holding this plan than not?

What we measure

Most AI demos count fields filled or admire the prose. We measure value: a single Plan Value score for how much a real project lead gains from the plan, weighted toward the two things that decide execution.

Real-world projects

Every run starts from a real brief: a clear scope, a hard budget cap, a timeline and the actual people involved, across domains like construction, events and product launches. One scenario, for instance, builds a 150 m² house near Fürth on a hard €480,000 cap with move-in inside 14 months. The model plans and runs it end to end, not a clever prompt in isolation.

Plan Value, not prose

Actionability and resilience lead the score: are tasks owned, dated, and sequenced, and does the plan absorb change. A beautiful plan that breaks on first contact is worth less than a rougher one that adapts.

Blind and fair

Quality is judged blind across a panel, every pairing seen in both orders to cancel position bias. Hard constraints like the budget cap or invented people are checked mechanically, no judgment involved.

The finding: context over model

We ran the same project through a broad range of models, and the plans spread from build-from quality down to not-usable. The differences were concrete, not cosmetic. When a mid-project change cut the financing, slipped the start by three weeks, and still required staying within the original budget cap, one plan came back about €110,000 over, roughly 123% of the €480,000 cap, with no decision recorded to justify it. Others tripped a dependency cycle, a task waiting on one that waited back on it, leaving the schedule incoherent. A few never produced a usable plan at all, too thin to run a project from. What set a build-from plan apart was rarely raw model horsepower. It was execution detail: holding the budget, staying internally consistent, and covering the whole lifecycle.

When the context is structured, owned, and queryable, the context does the heavy lifting and the model becomes a swappable engine. That is why Bring Your Own Key matters: pick the engine that fits your budget and let the context layer carry the quality.

How a run works

Each run has the model do genuine work, snapshotting the full project graph after every step so each change is something we can inspect.

  1. 1Let the model bootstrap a real project from a brief with clear scope, budget and timeline
  2. 2Build the task plan with owners, dates, and priorities
  3. 3Map the dependencies across the lifecycle
  4. 4Replan when reality shifts, like a budget cut and a schedule slip
  5. 5Process an incoming change through the human-in-the-loop intake

We keep ourselves honest

ProjectBench is how we check whether the context layer earns its place. These are early findings on a single scenario, and we are extending it across more domains. The method stays the same: measure plans by the value they create, in the real product, judged blind.

Read the full story