Research Preview • 2026

OpenComputer: Synthetic Environments for Computer-Use Agents

A framework for automatically synthesizing computer-use tasks, environments, and verifiers at scale.

Jinbiao WeiY Qianran MaP Yilun ZhaoY Xiao ZhouY Kangqi NiC Guo GanY Arman CohanY
YYale NLP Lab  ·  PUniversity of Pennsylvania  ·  CUniversity of North Carolina at Chapel Hill
Paper Code Demo Leaderboard Tasks & Data
1,000
Tasks
Growing
33
Apps
Growing
8
Models Evaluated
17.7
Endpoints / App
6.9
Checks / Task

Pipeline replay for blender — the same flow runs for any app.

1.Abstract

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. The pipeline is designed to be extensible: new applications and platforms (Windows, macOS, Android) can be added by following the same verifier-grounded workflow. We are actively expanding coverage, and the framework is open for the community to build on — anyone can use it to onboard their own apps and grow the environment and tasks.

Stage 1
Verifier
Endpoints over IPC, files, profile DBs
Stage 2
Smoke Tests
Live-sandbox endpoint checks
Stage 3
Task Synthesis
Propose → evaluate → match → seed env
Stage 4
Repair Loop
Verifier ↔ LLM judge reconciliation
Figure 1. The OpenComputer pipeline. Each stage gates the next: verifier quality determines task quality, and task quality determines whether evaluation is meaningful.

2.Framework and Benchmark

OpenComputer is an open-ended synthesis framework that generates verifiable tasks for any desktop app. It also ships a benchmark suite of 1,000 tasks across 33 applications produced by that framework. Use the toggle below to read whichever side is relevant to your work.

Reading the Synthesis Framework — the verifier-grounded pipeline that builds new tasks.

Supported Models

Pass any alias to --model. Unknown model IDs are routed by family heuristics — a name containing claude, kimi, qwen, gemini, gpt-, etc., automatically picks up the correct agent class.

Family Provider Aliases
Claude Anthropic claude-sonnet-4-6 claude-opus-4 claude-3-7-sonnet
GPT / ChatGPT OpenAI gpt-5 gpt-5.4 computer-use-preview azure-gpt-5.4
Gemini Google gemini-3-flash gemini-3-flash-preview gemini-2.5-computer-use
Kimi Moonshot kimi-k2.5 kimi-k2.6
Qwen Alibaba qwen3-vl qwen2.5-vl-72b qwen3.5-35b-a3b qwen3.5-27b
Specialised CUA GUI-Owl, EvoCUA, Mano, OpenCUA, Dart owl1.5 evocua-s1 evocua-s2 mano opencua dart

Leaderboard

Performance and efficiency comparison across computer-use agents on the OpenComputer benchmark. Success Rate is the fraction of tasks for which all required criteria are satisfied; Avg. Reward is the mean fraction of passed verifier checks (giving partial credit). OSWorld-Verified is reported as an external reference where available.

🥇 1st · Frontier
GPT-5.4
68.3%
Success Rate
🥈 2nd · Frontier
Claude-Sonnet-4.6
64.4%
Success Rate
🥉 3rd · Frontier
Kimi-K2.6
58.8%
Success Rate
# Model OSWorld Success Rate Avg. Reward Avg. Steps
1 GPT-5.4 Best 75.0% 68.3% 88.4% 19.0
2 Claude-Sonnet-4.6 72.5% 64.4% 76.6% 31.5
3 Kimi-K2.6 73.1% 58.8% 70.7% 35.7
4 Qwen-3.5-27B 56.2% 32.3% 59.4% 33.1
5 Gemini-3-Flash 16.4% 37.0% 25.4
6 EvoCUA-8B 46.1% 10.9% 38.1% 67.0
7 Qwen-3.5-9B 41.8% 7.8% 31.7% 39.3
8 GUI-OWL-1.5-8B 52.3% 5.7% 27.8% 73.6

Observations. OpenComputer is challenging even for the strongest current agents — GPT-5.4 still fails to completely solve nearly one-third of the benchmark tasks. Open-source models drop sharply from their reported OSWorld-Verified scores (e.g. GUI-OWL-1.5-8B: 52.3% → 5.7%; EvoCUA-8B: 46.1% → 10.9%), suggesting limited cross-benchmark generalization to the broader, more heterogeneous software surfaces covered here. GPT-5.4 is also the most efficient: it completes tasks in only 19 steps on average by combining low-level operations into single computer-control steps.

Pipeline

The execution order is strict: verifier → smoke tests → task generation → (optional) repair. Each stage gates the next, because verifier quality dictates task quality and task quality dictates evaluation signal. Use this part of the framework when you want to add a new app or extend the existing benchmark with new tasks.

1Stage

Verifier Generation

Build a programmatic verifier for the target app under verifiers/<app>/. Design comprehensive check-* endpoints across every inspectable surface — file state, IPC state, profile databases. Write Test.md before test code, then run until all unit tests pass.

2Stage

Smoke Tests

After unit tests pass, run the smoke pipeline so each verifier endpoint is exercised by a real agent trajectory in a fresh sandbox. One small task per endpoint group; the run produces REPORT.md. Real task generation does not begin until smoke is green.

3Stage

Task Generation

Generate proposals without consulting the verifier (the single most important synthesis rule — it prevents the task set from collapsing onto whatever the verifier already supports). Evaluate on complexity and data-generatability, then match to verifiers by adapting the task, extending the verifier, or discarding. Synthesize env/ seed files and target at least three finalized tasks per app.

4Stage

Repair Loop (optional)

For each round the agent runs, the programmatic verifier and an LLM-as-judge both grade the trajectory. When they disagree, the script edits the verifier or task spec; when they agree, the run is finalized. Each invocation is single-task; loop in shell to repair multiple.

Platform Roadmap

OpenComputer started on Linux desktop, but the synthesis pipeline — verifiers, task generation, sandbox provisioning — is platform-agnostic by design. We are actively extending the framework to Windows, macOS, and Android so that a single benchmark can grade computer-use agents across every major form factor.

Available
Linux
Ubuntu / XFCE · 33 apps, 1,000 tasks
In Progress
Windows
Win32 + UWP · UI Automation verifiers
In Progress
macOS
AppKit · AX API + AppleScript bridges
In Progress
Android
Mobile · adb / UI Automator harness
Growing
More Apps
New apps synthesized every release
Active development: the verifier → smoke → task pipeline is being ported to each platform — Windows and Android sandboxes are already provisioning agents end-to-end in our internal CI.

Applications

The synthesis pipeline has been instantiated for all 33 applications below — each ships with a verifier module, a documented set of inspection endpoints, and finalized tasks. The catalog is actively growing; new apps are added by following the same four-stage pipeline.

Audacity
Blender
Brave
Chrome
CloudCompare
darktable
draw.io
Eclipse
Firefox
FreeCAD
Galculator
gedit
GIMP
Godot 4
Inkscape
Kdenlive
Krita
LO Calc
LO Draw
LO Impress
LO Writer
MuseScore 3
OBS Studio
Obsidian
PCManFM
RenderDoc
Shotcut
Thunderbird
VLC
VS Code
Zoom
Zotero
+ more soon

3.Citation

If you use OpenComputer in your research, please cite:

@misc{wei2026opencomputerverifiablesoftwareworlds, title={OpenComputer: Verifiable Software Worlds for Computer-Use Agents}, author={Jinbiao Wei and Qianran Ma and Yilun Zhao and Xiao Zhou and Kangqi Ni and Guo Gan and Arman Cohan}, year={2026}, eprint={2605.19769}, archivePrefix={arXiv}, primaryClass={cs.AI}, url={https://arxiv.org/abs/2605.19769}, }