AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions

Abstract

Autonomous computer-use agents powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control.

We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where uncertainties in dynamic environments disrupt the execution flow without direct adversarial intent. AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks using MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation.

We further propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness across all 9 corruption categories.

Comparison with Prior Benchmarks

Prior benchmarks either evaluate agents only in clean environments, or study adversarial attacks rather than the everyday, accidental corruptions that derail agents in practice. AgentHijack is the first benchmark that simultaneously offers a realistic virtual-machine environment, common (non-adversarial) corruptions, and user-configurable parameters.

Property	Mind2Web	WebArena	OSWorld	AndroidWorld	InjecAgent	R‑Judge	Agent‑Safety	Env. Distract.	GUI‑Robust	SafeArena	ST‑WebAgent	MobileSafety	WASP	VWA‑Adv	RiOSWorld	AgentHijackOurs
# Tasks	2,350	812	369	116	1,054	569	2,000	1,198	5,318	250	234	80	84	200	492	3,321
Environment	QA	BrowserGym	VM	Android	QA	QA	QA	QA	QA	BrowserGym	BrowserGym	Android	BrowserGym	BrowserGym	VM	VM
Multi‑modal	✓	✓	✓	✓	✗	✗	✗	✓	✓	✓	✓	✓	✓	✓	✓	✓
Abnormal Env.	✗	✗	✗	✗	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓	✓
Common Corr.	N/A	N/A	N/A	N/A	✗	✗	✗	✓	✓	✗	✗	✗	✗	✗	✗	✓
Configurable	N/A	N/A	N/A	N/A	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✗	✓
# Categories	N/A	N/A	N/A	N/A	6	5	8	4	7	5	3	5	1	1	13	9

Multi‑modal: agents take multi-modal inputs. Abnormal Env.: environment is corrupted rather than clean. Common Corr.: corruptions are common rather than adversarial. Configurable: corruption parameters are user-controllable. AgentHijack is the only benchmark that satisfies all four desiderata while running on a real virtual machine.

The AgentHijack Benchmark

AgentHijack provides 9 configurable corruption types applied to OSWorld tasks, grouped by perturbation scope: visual disruptors alter the observation space, unexpected operations interfere with state transitions, and environment errors perturb the environment state itself.

Visual disruptor

Pop Ups

Pop-up windows from communication or system software occlude the workspace.

Visual disruptor

Resolution Change

Resolution shifts caused by hardware switches or display settings.

Visual disruptor

Marks

On-screen marks from screensavers or animations clutter the canvas.

Visual disruptor

Subtitle

Floating subtitles from music or video applications overlay UI.

Visual disruptor

Multi Apps

Overlapping windows from multiple simultaneously running applications.

Unexpected operation

Accidental Touch

Stray clicks on function bars or buttons triggered by mouse mishandling.

Unexpected operation

App Minimization

Foreground application minimized unexpectedly mid-task.

Environment error

Network Error

Lost network connection blocking online actions.

Environment error

Verification

Unexpected login or identity verification gates the workflow.

Case studies showing failure modes — **Figure 2.** Case study of various corruptions. UI-TARS-1.5-7B exhibits grounding deviations under visual disruptors, mis-attributes consequences of prior actions under unexpected operations, and makes meaningless attempts when facing environmental errors.

Three Weaknesses We Uncover

Across 9 representative MLLM-based agents, we identify recurring failure modes that the community has long overlooked.

Obs. 1

Grounding is fragile under visual disruptors

Agents click on pop-ups even when targets remain visible, and deviate from targets under resolution change, marks, subtitles, or multi-app overlap.

Obs. 2

Decisions are derailed by unexpected operations

When accidental touches or app minimizations occur, agents mis-attribute the state change to their own action and chase the triggered content.

Obs. 3

Initial environment errors go undetected

Agents assume the start state is normal and keep executing inside broken environments — network down, verification screen, missing password.

AgentHijack-Agent

A two-role framework that pairs an action generator with enhanced grounding and an onlooker that summarizes behavior and checks the environment before and during execution.

Training

DA-GRPO

Data-Augmented Group Relative Policy Optimization rolls out across corrupted variants of the same environment, with experience replay to preserve sparse success signals.

Onlooker

Behavior Summarization

An auxiliary agent compares before/after screenshots each step, producing concise change descriptions so the action generator's context is grounded in what actually happened — not what it intended.

Onlooker

Environment Checking

Before execution the onlooker validates the initial state against a repository of known errors (network, verification, login) and triggers reinitialization rather than letting the agent flounder.

AgentHijack-Agent pipeline — **Figure 3.** Pipeline of AgentHijack-Agent. The onlooker first validates the initial environment; the action generator then iteratively outputs the next action, conditioned on the onlooker's behavioral summaries of historical screenshots.

Main Results

We benchmark 9 open-source, closed-source, and specialized GUI agents on all 9 corruption types. UI-TARS-1.5-7B — the strongest baseline — drops from 24.21% (clean) to 18.74% average under corruptions; our framework restores most of the gap.

24.21%

Clean baseline (UI-TARS-1.5-7B)

18.74%

Same model, average under corruption

22.89%+4.15

AgentHijack-Agent (Ours)

Agent	Clean	Pop ups	Resolution	Marks	Subtitle	Multi Apps	Accidental Touch	App Min.	Network Err.	Verification	Average
Open-source Multimodal Large Language Models
GLM-4.5V	4.24%	0.86%	3.68%	2.52%	3.68%	3.68%	1.98%	4.24%	2.52%	3.12%	3.05%
Llama-3.2-90B-Vision	3.97%	0.77%	1.64%	1.93%	1.87%	1.59%	1.45%	0.00%	1.45%	1.64%	1.63%
Qwen2.5-VL-72B	10.99%	1.86%	6.38%	9.45%	10.29%	5.79%	7.48%	8.32%	7.48%	6.63%	7.47%
Closed-source Multimodal Large Language Models
GPT-4o	5.38%	1.44%	4.82%	2.56%	3.66%	3.68%	3.12%	4.82%	4.24%	3.25%	3.69%
Claude-3.7-Sonnet	4.23%	1.41%	2.54%	2.82%	2.54%	1.97%	2.54%	2.25%	1.69%	2.54%	2.45%
Gemini-2.5-Pro	8.11%	5.20%	6.98%	6.64%	6.28%	2.76%	4.61%	2.78%	7.02%	7.81%	5.82%
State-of-the-Art GUI Agents
UI-TARS-7B-DPO	16.20%	13.09%	10.03%	13.41%	15.59%	13.85%	13.97%	13.61%	13.33%	8.31%	13.14%
UI-TARS-72B-DPO	22.38%	15.51%	14.32%	20.36%	19.32%	18.94%	14.44%	15.19%	19.76%	9.42%	16.96%
UI-TARS-1.5-7B (baseline)	24.21%	10.28%	11.69%	23.31%	22.75%	19.25%	22.54%	20.84%	22.02%	10.48%	18.74%
AgentHijack-Agent
Ours	27.80%	21.51%	12.53%	27.28%	26.45%	21.17%	24.37%	24.51%	23.09%	20.15%	22.89%
Δ	+3.59%	+11.23%	+0.84%	+3.97%	+3.70%	+1.92%	+1.83%	+3.67%	+1.07%	+9.67%	+4.15%

Table 2. Success rate of various LLM-based agents across nine corruption types. Green deltas (Δ) indicate improvement of AgentHijack-Agent over the top-performing baseline (UI-TARS-1.5-7B).