A framework for automatically synthesizing computer-use tasks, environments, and verifiers at scale.
Pipeline replay for blender — the same flow runs for any app.
We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. The pipeline is designed to be extensible: new applications and platforms (Windows, macOS, Android) can be added by following the same verifier-grounded workflow. We are actively expanding coverage, and the framework is open for the community to build on — anyone can use it to onboard their own apps and grow the environment and tasks.
OpenComputer is an open-ended synthesis framework that generates verifiable tasks for any desktop app. It also ships a benchmark suite of 1,000 tasks across 33 applications produced by that framework. Use the toggle below to read whichever side is relevant to your work.
Pass any alias to --model. Unknown model IDs are routed by family heuristics —
a name containing claude, kimi, qwen, gemini,
gpt-, etc., automatically picks up the correct agent class.
| Family | Provider | Aliases |
|---|---|---|
| Claude | Anthropic | claude-sonnet-4-6 claude-opus-4 claude-3-7-sonnet |
| GPT / ChatGPT | OpenAI | gpt-5 gpt-5.4 computer-use-preview azure-gpt-5.4 |
| Gemini | gemini-3-flash gemini-3-flash-preview gemini-2.5-computer-use | |
| Kimi | Moonshot | kimi-k2.5 kimi-k2.6 |
| Qwen | Alibaba | qwen3-vl qwen2.5-vl-72b qwen3.5-35b-a3b qwen3.5-27b |
| Specialised CUA | GUI-Owl, EvoCUA, Mano, OpenCUA, Dart | owl1.5 evocua-s1 evocua-s2 mano opencua dart |
Performance and efficiency comparison across computer-use agents on the OpenComputer benchmark. Success Rate is the fraction of tasks for which all required criteria are satisfied; Avg. Reward is the mean fraction of passed verifier checks (giving partial credit). OSWorld-Verified is reported as an external reference where available.
| # | Model | OSWorld | Success Rate ↓ | Avg. Reward | Avg. Steps |
|---|---|---|---|---|---|
| 1 | GPT-5.4 Best | 75.0% | 68.3% | 88.4% | 19.0 |
| 2 | Claude-Sonnet-4.6 | 72.5% | 64.4% | 76.6% | 31.5 |
| 3 | Kimi-K2.6 | 73.1% | 58.8% | 70.7% | 35.7 |
| 4 | Qwen-3.5-27B | 56.2% | 32.3% | 59.4% | 33.1 |
| 5 | Gemini-3-Flash | — | 16.4% | 37.0% | 25.4 |
| 6 | EvoCUA-8B | 46.1% | 10.9% | 38.1% | 67.0 |
| 7 | Qwen-3.5-9B | 41.8% | 7.8% | 31.7% | 39.3 |
| 8 | GUI-OWL-1.5-8B | 52.3% | 5.7% | 27.8% | 73.6 |
Observations. OpenComputer is challenging even for the strongest current agents — GPT-5.4 still fails to completely solve nearly one-third of the benchmark tasks. Open-source models drop sharply from their reported OSWorld-Verified scores (e.g. GUI-OWL-1.5-8B: 52.3% → 5.7%; EvoCUA-8B: 46.1% → 10.9%), suggesting limited cross-benchmark generalization to the broader, more heterogeneous software surfaces covered here. GPT-5.4 is also the most efficient: it completes tasks in only 19 steps on average by combining low-level operations into single computer-control steps.
The execution order is strict: verifier → smoke tests → task generation → (optional) repair. Each stage gates the next, because verifier quality dictates task quality and task quality dictates evaluation signal. Use this part of the framework when you want to add a new app or extend the existing benchmark with new tasks.
Build a programmatic verifier for the target app under verifiers/<app>/.
Design comprehensive check-* endpoints across every inspectable surface — file
state, IPC state, profile databases. Write Test.md before test code, then run
until all unit tests pass.
After unit tests pass, run the smoke pipeline so each verifier endpoint is exercised by
a real agent trajectory in a fresh sandbox. One small task per endpoint group; the run
produces REPORT.md. Real task generation does not begin until smoke is green.
Generate proposals without consulting the verifier (the single most important
synthesis rule — it prevents the task set from collapsing onto whatever the verifier already
supports). Evaluate on complexity and data-generatability, then match to verifiers by
adapting the task, extending the verifier, or discarding. Synthesize env/
seed files and target at least three finalized tasks per app.
For each round the agent runs, the programmatic verifier and an LLM-as-judge both grade the trajectory. When they disagree, the script edits the verifier or task spec; when they agree, the run is finalized. Each invocation is single-task; loop in shell to repair multiple.
OpenComputer started on Linux desktop, but the synthesis pipeline — verifiers, task generation, sandbox provisioning — is platform-agnostic by design. We are actively extending the framework to Windows, macOS, and Android so that a single benchmark can grade computer-use agents across every major form factor.
The synthesis pipeline has been instantiated for all 33 applications below — each ships with a verifier module, a documented set of inspection endpoints, and finalized tasks. The catalog is actively growing; new apps are added by following the same four-stage pipeline.
If you use OpenComputer in your research, please cite: