OpenComputer: Synthetic Verifiable Environments for Computer-Use Agents

1.Abstract

We present OpenComputer, a verifier-grounded framework for constructing verifiable software worlds for computer-use agents. OpenComputer integrates four components: (1) app-specific state verifiers that expose structured inspection endpoints over real applications, (2) a self-evolving verification layer that improves verifier reliability using execution-grounded feedback, (3) a task-generation pipeline that synthesizes realistic and machine-checkable desktop tasks, and (4) an evaluation harness that records full trajectories and computes auditable partial-credit rewards. In its current form, OpenComputer covers 33 desktop applications and 1,000 finalized tasks spanning browsers, office tools, creative software, development environments, file managers, and communication applications. The pipeline is designed to be extensible: new applications and platforms (Windows, macOS, Android) can be added by following the same verifier-grounded workflow. We are actively expanding coverage, and the framework is open for the community to build on — anyone can use it to onboard their own apps and grow the environment and tasks.

Stage 1

Verifier

Endpoints over IPC, files, profile DBs

Stage 2

Smoke Tests

Live-sandbox endpoint checks

Stage 3

Task Synthesis

Propose → evaluate → match → seed env

Stage 4

Repair Loop

Verifier ↔ LLM judge reconciliation

Figure 1. The OpenComputer pipeline. Each stage gates the next: verifier quality determines task quality, and task quality determines whether evaluation is meaningful.

Supported Models

Pass any alias to --model. Unknown model IDs are routed by family heuristics — a name containing claude, kimi, qwen, gemini, gpt-, etc., automatically picks up the correct agent class.

Family	Provider	Aliases
Claude	Anthropic	claude-sonnet-4-6 claude-opus-4 claude-3-7-sonnet
GPT / ChatGPT	OpenAI	gpt-5 gpt-5.4 computer-use-preview azure-gpt-5.4
Gemini	Google	gemini-3-flash gemini-3-flash-preview gemini-2.5-computer-use
Kimi	Moonshot	kimi-k2.5 kimi-k2.6
Qwen	Alibaba	qwen3-vl qwen2.5-vl-72b qwen3.5-35b-a3b qwen3.5-27b
Specialised CUA	GUI-Owl, EvoCUA, Mano, OpenCUA, Dart	owl1.5 evocua-s1 evocua-s2 mano opencua dart

Leaderboard

Performance and efficiency comparison across computer-use agents on the OpenComputer benchmark. Success Rate is the fraction of tasks for which all required criteria are satisfied; Avg. Reward is the mean fraction of passed verifier checks (giving partial credit). OSWorld-Verified is reported as an external reference where available.

🥇 1st · Frontier

GPT-5.4

68.3%

Success Rate

🥈 2nd · Frontier

Claude-Sonnet-4.6

64.4%

Success Rate

🥉 3rd · Frontier

Kimi-K2.6

58.8%

Success Rate

#	Model	OSWorld	Success Rate ↓	Avg. Reward	Avg. Steps
1	GPT-5.4 Best	75.0%	68.3%	88.4%	19.0
2	Claude-Sonnet-4.6	72.5%	64.4%	76.6%	31.5
3	Kimi-K2.6	73.1%	58.8%	70.7%	35.7
4	Qwen-3.5-27B	56.2%	32.3%	59.4%	33.1
5	Gemini-3-Flash	—	16.4%	37.0%	25.4
6	EvoCUA-8B	46.1%	10.9%	38.1%	67.0
7	Qwen-3.5-9B	41.8%	7.8%	31.7%	39.3
8	GUI-OWL-1.5-8B	52.3%	5.7%	27.8%	73.6

Observations. OpenComputer is challenging even for the strongest current agents — GPT-5.4 still fails to completely solve nearly one-third of the benchmark tasks. Open-source models drop sharply from their reported OSWorld-Verified scores (e.g. GUI-OWL-1.5-8B: 52.3% → 5.7%; EvoCUA-8B: 46.1% → 10.9%), suggesting limited cross-benchmark generalization to the broader, more heterogeneous software surfaces covered here. GPT-5.4 is also the most efficient: it completes tasks in only 19 steps on average by combining low-level operations into single computer-control steps.

Pipeline

The execution order is strict: verifier → smoke tests → task generation → (optional) repair. Each stage gates the next, because verifier quality dictates task quality and task quality dictates evaluation signal. Use this part of the framework when you want to add a new app or extend the existing benchmark with new tasks.

1Stage

Verifier Generation

Build a programmatic verifier for the target app under verifiers/<app>/. Design comprehensive check-* endpoints across every inspectable surface — file state, IPC state, profile databases. Write Test.md before test code, then run until all unit tests pass.

2Stage

Smoke Tests

After unit tests pass, run the smoke pipeline so each verifier endpoint is exercised by a real agent trajectory in a fresh sandbox. One small task per endpoint group; the run produces REPORT.md. Real task generation does not begin until smoke is green.

3Stage

Task Generation

Generate proposals without consulting the verifier (the single most important synthesis rule — it prevents the task set from collapsing onto whatever the verifier already supports). Evaluate on complexity and data-generatability, then match to verifiers by adapting the task, extending the verifier, or discarding. Synthesize env/ seed files and target at least three finalized tasks per app.

4Stage

Repair Loop (optional)

For each round the agent runs, the programmatic verifier and an LLM-as-judge both grade the trajectory. When they disagree, the script edits the verifier or task spec; when they agree, the run is finalized. Each invocation is single-task; loop in shell to repair multiple.

Platform Roadmap

OpenComputer started on Linux desktop, but the synthesis pipeline — verifiers, task generation, sandbox provisioning — is platform-agnostic by design. We are actively extending the framework to Windows, macOS, and Android so that a single benchmark can grade computer-use agents across every major form factor.

Available

Linux

Ubuntu / XFCE · 33 apps, 1,000 tasks

In Progress

Windows

Win32 + UWP · UI Automation verifiers

In Progress

macOS

AppKit · AX API + AppleScript bridges

In Progress

Android

Mobile · adb / UI Automator harness

Growing

More Apps

New apps synthesized every release

Active development: the verifier → smoke → task pipeline is being ported to each platform — Windows and Android sandboxes are already provisioning agents end-to-end in our internal CI.

Applications

The synthesis pipeline has been instantiated for all 33 applications below — each ships with a verifier module, a documented set of inspection endpoints, and finalized tasks. The catalog is actively growing; new apps are added by following the same four-stage pipeline.

Audacity

Blender

Brave

Chrome

CloudCompare

darktable

draw.io

Eclipse

Firefox

FreeCAD

Galculator

gedit

GIMP

Godot 4

Inkscape

Kdenlive

Krita

LO Calc

LO Draw

LO Impress

LO Writer

MuseScore 3

OBS Studio

Obsidian

PCManFM

RenderDoc

Shotcut

Thunderbird

VLC

VS Code

Zoom

Zotero

+ more soon

OpenComputer: Synthetic Environments for Computer-Use Agents

1.Abstract

2.Framework and Benchmark

Supported Models

Leaderboard

Pipeline

Verifier Generation

Smoke Tests

Task Generation

Repair Loop (optional)

Platform Roadmap

Applications

3.Citation