A landmark study from METR (Model Evaluation & Threat Research) published in March 2025 revealed something that should reframe how we think about AI agent capability trajectories. According to their benchmarks, the maximum duration of tasks that AI agents can autonomously complete has been doubling roughly every seven months. That is not a metaphor or a rough estimate — it is a measured, consistent trend across multiple frontier model generations.
If you are building or evaluating desktop AI agent applications today, this finding deserves more than a passing read.
What METR Actually Measured
METR's benchmark methodology centers on a deceptively simple question: given a real-world task with no human assistance during execution, how long of a task can a given agent handle before it fails or requires intervention?
Their dataset spans software engineering, research summarization, data analysis, and system administration tasks. They measured elapsed time-to-completion rather than token count or step count, because wall-clock duration more faithfully captures the compound challenge of multi-step reasoning: each subtask creates dependencies, introduces new information, and demands that the agent maintain coherent intent across an extended context window.
The results showed a clean log-linear improvement curve:
| Time Period | Reliable Autonomous Task Duration | Approximate Doubling |
|---|---|---|
| Early 2024 | Under 10 minutes | Baseline |
| Late 2024 | ~30 minutes | ~2 doublings |
| Mid-2025 | ~1 hour | ~3 doublings |
| Early 2026 (projected) | 4-8 hours | ~4-5 doublings |
The seven-month doubling period places the theoretical ceiling somewhere around four to eight hours of autonomous task execution by the time you read this post.
Why Duration Is the Right Unit
Many AI benchmarks focus on accuracy, pass-at-k scores, or step efficiency. Duration captures something different: the agent's ability to maintain coherent intent and adaptive reasoning across compounding uncertainty.
A ten-minute task might involve three to five distinct decisions. A two-hour task involves dozens — and crucially, early decisions constrain later options in ways the agent cannot fully predict at the outset. This is the domain where rigid prompt-and-reply architectures collapse. The agent needs a genuine working model of the task's current state, a recovery strategy when subgoals fail, and the judgment to know when to proceed versus when to pause and re-plan.
This is precisely why the two-phase architecture — planning before execution — matters so much at the longer end of the task duration curve.
How This Maps to Neumar's Two-Phase Execution
Neumar's agent runtime separates task handling into two explicit phases: a planning phase, where the agent decomposes the goal into a structured sequence of subtasks, and an execution phase, where each subtask is carried out with access to MCP tools and workspace resources.
This architecture directly addresses the failure mode that caps agent performance at longer task durations. When an agent operates purely reactively — picking the next action based only on the most recent context — error accumulation compounds quickly. A wrong assumption in step 3 silently propagates through steps 4, 5, and 6 before anything fails loudly.
By externalizing the plan as a first-class artifact, Neumar's agents can:
- Inspect the plan before committing to execution, letting users validate high-level intent without micromanaging individual tool calls
- Re-plan when execution diverges, treating the plan as a mutable document rather than a fixed script
- Report progress against planned milestones, giving users meaningful status rather than opaque streaming output
The METR data suggests that this class of architecture will become increasingly important as the task duration ceiling rises. An agent that can reliably handle 30-minute tasks today will be attempting 2-hour tasks by mid-2026 under the same doubling curve. At that scale, plan-then-execute is not a nice-to-have — it is load-bearing.
The Compounding Effect of Tool Integration
One factor METR's analysis highlights is that tool availability is a significant multiplier on effective task duration. An agent without filesystem access, web browsing, or code execution hits a hard ceiling on what it can accomplish autonomously regardless of raw reasoning capability. Tool use transforms the agent from a sophisticated text predictor into a genuine task executor.
Neumar's integration of the Model Context Protocol gives agents access to over 10,000 skills through a standardized tool interface. This breadth matters for the duration curve in a concrete way: longer tasks almost always require heterogeneous tool use — reading files, running code, querying APIs, and writing output across multiple systems. An agent constrained to a single tool category will stall on cross-domain tasks regardless of how capable its underlying model is.
The practical implication is that MCP coverage and task duration capability are tightly coupled. Expanding tool availability is not just a feature — it is a prerequisite for operating near the frontier of what the duration curve makes achievable.
What This Means for Teams Evaluating Agent Platforms Today
The doubling curve creates an interesting selection pressure on agent platform choices. A platform that handles 15-minute tasks reliably today will, under the same architecture, need to handle 60-minute tasks reliably within 14 months. Platforms designed for short-horizon tasks — single tool calls, single-shot completions — are not simply inadequate at the frontier; they are architecturally misaligned with the trajectory.
When evaluating a desktop agent platform, the relevant questions are:
- Does the architecture externalize state between planning and execution, or does everything live in the LLM's context window?
- Does the platform support recovery and re-planning when subtasks fail, or does the agent treat failures as terminal?
- Is tool integration breadth a first-class concern, or an afterthought bolted onto a chat interface?
- Can the platform handle asynchronous, long-running operations without blocking the user interface?
These are not arbitrary preferences. They are the properties that determine whether a platform can ride the capability curve rather than get left behind by it.
The Honest Limitations
METR's seven-month doubling figure comes with important caveats. The benchmark tasks, while carefully designed, are not fully representative of real enterprise workloads. Some categories of task — those requiring organizational knowledge, human approval checkpoints, or access to proprietary internal systems — resist purely autonomous execution regardless of agent capability.
The doubling curve also describes capability ceiling, not reliable deployment. An agent that can theoretically handle a 90-minute task will still fail or produce poor output on a meaningful fraction of 90-minute tasks. The practical deployment window is typically well below the benchmark ceiling.
That said, the directional signal is clear and consistent across multiple research groups and model generations. The capability frontier is moving fast, and the architecture decisions made today will determine how well any given platform can keep pace.
The METR research is available in full on their website. For teams actively building on top of frontier agent capability, it is one of the most actionable benchmark studies published in 2025.
