The Model Matters Less Than the Context: Introducing ProjectBench
Ask any capable model to plan a construction project and you get something that reads beautifully. Clean phases, tidy milestones, a neat risk table. The question that matters to a Bauleiter or a Projektleiter is a different one. Could you actually run the project from it? Are the tasks owned, dated, and in the right order? Does the budget respect the cap? And when the bank confirms financing 25,000 EUR short and the shell contractor slips three weeks, does the plan absorb the change or quietly fall apart?
Most "AI for project management" demos measure the wrong thing. They count fields filled or admire the prose. We wanted to measure value. Whether the plan creates a real advantage for the person who has to execute it.
So we built ProjectBench.
What ProjectBench does
ProjectBench drives the real TensorPM, the same application our users run, and has the model do genuine project work from end to end. The model creates the project from a short brief, builds the task plan, maps the dependencies, replans when reality shifts, writes a stakeholder update, and processes an incoming change through the same human-in-the-loop intake that TensorPM uses in production. The full project graph is captured after every step, so each change is something we can inspect.
This matters more than it sounds. We are not testing a model on a clever prompt in isolation. We are measuring what a project lead would actually get: the model working inside the context layer, with the project graph as the shared source of truth.
How we judge fairly
Then we score it, carefully.
Quality is judged blind. The judges never see which model produced a plan. They compare plans head to head, and they see each pairing in both orders so position bias cancels out, across a panel rather than a single opinion. Hard constraints are checked mechanically, with no judgment involved at all: a budget over the cap, a person who was invented rather than given, a dependency chain that loops back on itself. Those are facts, not opinions.
The headline number is what we call Plan Value. A single score, from 0 to 100, for how much a real owner would gain from holding the plan. It leans on the two things that decide execution. The actionability of the task plan, and its resilience when the situation changes. A beautiful static plan that breaks on first contact is worth less than a rougher one that adapts.
The first finding
We expected the most expensive models to separate themselves from the field. They did not.
Across runs, a low-cost model produced plans that scored in the same build-from tier as the flagship models. The expensive models planned well too, and the gap between them and the affordable ones was small. In one telling case, a top-tier model wrote an elegant plan and then broke the hard budget cap during replanning, while cheaper models held the line. A plan that stays within budget is worth more than an elegant one that does not.
That result is the whole thesis of Context-Driven Project Management in a single number. When the project context is structured, owned, and queryable, the context does the heavy lifting and the model becomes a swappable engine. You can run a frontier model or an affordable one, your own key, your own choice, and still get a plan worth building from. The advantage lives in the context, not the vendor.
It is also why Bring Your Own Key is a first-class option in TensorPM rather than a compromise. If the context is what creates the value, then paying for the most expensive engine is optional, and the economics of doing serious project work with AI get a lot friendlier.
Resilience is where plans earn their keep
The most revealing moment in every run is the change event. A static plan is easy. A plan that takes a financing cut and a schedule slip, reflows the dependent dates, protects the cap, and records the trade-offs as decisions is rare, and far more valuable.
This is also where the context layer shows its worth. Because the plan lives as a graph of owned tasks, dependencies, and decisions rather than a block of prose, the change can propagate through it. Dates shift, the critical path updates, the reasoning is captured for the next person who opens the project. That is replanning as a project lead actually lives it, not a fresh wall of text.
What this means for you
If you lead projects in the office, the practical takeaway is freeing. You do not need the most expensive model to get a plan worth building from. Give the work a real context, pick the engine that fits your budget, and let the context layer carry the quality. TensorPM is the intelligence layer above your field tools, and ProjectBench is how we keep ourselves honest about whether that layer earns its place.
What's next
These are early findings on a single construction scenario, and we are deliberately not publishing a vendor leaderboard from one test. We are extending ProjectBench across more scenarios and more domains, and we will share what we learn. The method stays the same throughout. We measure plans by the value they create, inside the real product, judged blind.
If you want to see Context-Driven Project Management for yourself, the best way is to try it on a project of your own.