ICML 2026 · Computer-Use Agents

AgentHijack: Benchmarking Computer Use Agent
Robustness to Common Environment Corruptions

Evaluating & improving MLLM-based agents under 9 realistic corruptions

Jingwei Sun1 Jianing Zhu2 Yuanyi Li1 Tongliang Liu3 Xia Hu4 Bo Han1
1TMLR Group, Hong Kong Baptist University   2The University of Texas at Austin   3Sydney AI Center, The University of Sydney   4Shanghai AI Laboratory
Correspondence: Bo Han <bhanml@comp.hkbu.edu.hk>
9
Corruption Types
3,321
Configurable Tasks
−5.47%
Baseline Drop Under Corruption
+4.15%
Avg. Gain (Ours)
AgentHijack teaser figure
Figure 1. AgentHijack generates configurable corrupted scenarios across 9 common corruption types — pop-ups, identity verification, resolution modifications, and more. State-of-the-art agents (UI-TARS series) exhibit significant performance degradation, while our framework achieves universal improvement.

Abstract

Autonomous computer-use agents powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control.

We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where uncertainties in dynamic environments disrupt the execution flow without direct adversarial intent. AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks using MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation.

We further propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness across all 9 corruption categories.

Comparison with Prior Benchmarks

Prior benchmarks either evaluate agents only in clean environments, or study adversarial attacks rather than the everyday, accidental corruptions that derail agents in practice. AgentHijack is the first benchmark that simultaneously offers a realistic virtual-machine environment, common (non-adversarial) corruptions, and user-configurable parameters.

Property Mind2Web WebArena OSWorld AndroidWorld InjecAgent R‑Judge Agent‑Safety Env. Distract. GUI‑Robust SafeArena ST‑WebAgent MobileSafety WASP VWA‑Adv RiOSWorld AgentHijackOurs
# Tasks 2,350 812 369 116 1,054 569 2,000 1,198 5,318 250 234 80 84 200 492 3,321
Environment QA BrowserGym VM Android QA QA QA QA QA BrowserGym BrowserGym Android BrowserGym BrowserGym VM VM
Multi‑modal
Abnormal Env.
Common Corr. N/A N/A N/A N/A
Configurable N/A N/A N/A N/A
# Categories N/A N/A N/A N/A 6 5 8 4 7 5 3 5 1 1 13 9

Multi‑modal: agents take multi-modal inputs. Abnormal Env.: environment is corrupted rather than clean. Common Corr.: corruptions are common rather than adversarial. Configurable: corruption parameters are user-controllable. AgentHijack is the only benchmark that satisfies all four desiderata while running on a real virtual machine.

The AgentHijack Benchmark

AgentHijack provides 9 configurable corruption types applied to OSWorld tasks, grouped by perturbation scope: visual disruptors alter the observation space, unexpected operations interfere with state transitions, and environment errors perturb the environment state itself.

Visual disruptor

Pop Ups

Pop-up windows from communication or system software occlude the workspace.

Visual disruptor

Resolution Change

Resolution shifts caused by hardware switches or display settings.

Visual disruptor

Marks

On-screen marks from screensavers or animations clutter the canvas.

Visual disruptor

Subtitle

Floating subtitles from music or video applications overlay UI.

Visual disruptor

Multi Apps

Overlapping windows from multiple simultaneously running applications.

Unexpected operation

Accidental Touch

Stray clicks on function bars or buttons triggered by mouse mishandling.

Unexpected operation

App Minimization

Foreground application minimized unexpectedly mid-task.

Environment error

Network Error

Lost network connection blocking online actions.

Environment error

Verification

Unexpected login or identity verification gates the workflow.

Case studies showing failure modes
Figure 2. Case study of various corruptions. UI-TARS-1.5-7B exhibits grounding deviations under visual disruptors, mis-attributes consequences of prior actions under unexpected operations, and makes meaningless attempts when facing environmental errors.

Three Weaknesses We Uncover

Across 9 representative MLLM-based agents, we identify recurring failure modes that the community has long overlooked.

Obs. 1

Grounding is fragile under visual disruptors

Agents click on pop-ups even when targets remain visible, and deviate from targets under resolution change, marks, subtitles, or multi-app overlap.

Obs. 2

Decisions are derailed by unexpected operations

When accidental touches or app minimizations occur, agents mis-attribute the state change to their own action and chase the triggered content.

Obs. 3

Initial environment errors go undetected

Agents assume the start state is normal and keep executing inside broken environments — network down, verification screen, missing password.

AgentHijack-Agent

A two-role framework that pairs an action generator with enhanced grounding and an onlooker that summarizes behavior and checks the environment before and during execution.

Training

DA-GRPO

Data-Augmented Group Relative Policy Optimization rolls out across corrupted variants of the same environment, with experience replay to preserve sparse success signals.

Onlooker

Behavior Summarization

An auxiliary agent compares before/after screenshots each step, producing concise change descriptions so the action generator's context is grounded in what actually happened — not what it intended.

Onlooker

Environment Checking

Before execution the onlooker validates the initial state against a repository of known errors (network, verification, login) and triggers reinitialization rather than letting the agent flounder.

AgentHijack-Agent pipeline
Figure 3. Pipeline of AgentHijack-Agent. The onlooker first validates the initial environment; the action generator then iteratively outputs the next action, conditioned on the onlooker's behavioral summaries of historical screenshots.

Main Results

We benchmark 9 open-source, closed-source, and specialized GUI agents on all 9 corruption types. UI-TARS-1.5-7B — the strongest baseline — drops from 24.21% (clean) to 18.74% average under corruptions; our framework restores most of the gap.

24.21%
Clean baseline (UI-TARS-1.5-7B)
18.74%
Same model, average under corruption
22.89%+4.15
AgentHijack-Agent (Ours)
Agent Clean Pop ups Resolution Marks Subtitle Multi Apps Accidental Touch App Min. Network Err. Verification Average
Open-source Multimodal Large Language Models
GLM-4.5V 4.24%0.86%3.68%2.52%3.68%3.68%1.98%4.24%2.52%3.12%3.05%
Llama-3.2-90B-Vision 3.97%0.77%1.64%1.93%1.87%1.59%1.45%0.00%1.45%1.64%1.63%
Qwen2.5-VL-72B 10.99%1.86%6.38%9.45%10.29%5.79%7.48%8.32%7.48%6.63%7.47%
Closed-source Multimodal Large Language Models
GPT-4o 5.38%1.44%4.82%2.56%3.66%3.68%3.12%4.82%4.24%3.25%3.69%
Claude-3.7-Sonnet 4.23%1.41%2.54%2.82%2.54%1.97%2.54%2.25%1.69%2.54%2.45%
Gemini-2.5-Pro 8.11%5.20%6.98%6.64%6.28%2.76%4.61%2.78%7.02%7.81%5.82%
State-of-the-Art GUI Agents
UI-TARS-7B-DPO 16.20%13.09%10.03%13.41%15.59%13.85%13.97%13.61%13.33%8.31%13.14%
UI-TARS-72B-DPO 22.38%15.51%14.32%20.36%19.32%18.94%14.44%15.19%19.76%9.42%16.96%
UI-TARS-1.5-7B (baseline) 24.21%10.28%11.69%23.31%22.75%19.25%22.54%20.84%22.02%10.48%18.74%
AgentHijack-Agent
Ours 27.80%21.51%12.53%27.28%26.45%21.17%24.37%24.51%23.09%20.15%22.89%
Δ +3.59%+11.23%+0.84%+3.97%+3.70%+1.92%+1.83%+3.67%+1.07%+9.67%+4.15%
Table 2. Success rate of various LLM-based agents across nine corruption types. Green deltas (Δ) indicate improvement of AgentHijack-Agent over the top-performing baseline (UI-TARS-1.5-7B).
Resolution scaling ratio(a) Resolution scaling ratio
Number of UI marks(b) Number of UI marks
Frequency of accidental touch(c) Frequency of accidental touch
Frequency of app minimization(d) Frequency of app minimization
Figure 4. Ablation across corruption intensities. Although agent performance declines as intensity grows, AgentHijack-Agent consistently outperforms the base model.
Pop-up content(a) Pop-up content
Subtitle content(b) Subtitle content
Shape of UI marks(c) Shape of UI marks
Color of UI marks(d) Color of UI marks
Figure 5. Ablation across corruption content. Performance fluctuates with content variants, but our framework maintains a steady improvement.
Subtitle screen location(a) Subtitle screen location
Step of accidental touch(b) Step of accidental touch
Step of app minimization(c) Step of app minimization
Necessity of each module(d) Necessity of each module
Figure 6. Ablation across corruption locations and the necessity of each module. Performance is robust to where corruptions occur, and both the RL grounding and the onlooker contribute non-trivial gains.
Behavior summarization comparison(a) Behavior summarization examples
Performance across onlooker models(b) Performance across onlooker models
Figure 7. More capable onlooker models yield greater gains; we adopt the fine-tuned UI-TARS-1.5-7B as the default for compute efficiency.

Case Studies of AgentHijack-Agent

Under the same set of corrupted tasks, our agent maintains accurate grounding, correctly attributes unexpected state changes, and adaptively recovers from environment errors — rather than burning steps on meaningless attempts.

AgentHijack-Agent successful trajectories
Figure 8. AgentHijack-Agent dismisses pop-ups without redundant clicks, stays on task after accidental touches, and verifies environmental preconditions before acting.

Citation

If you find AgentHijack useful, please cite our paper.

@inproceedings{sun2026agenthijack,
  title     = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
  author    = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
  booktitle = {Forty-third International Conference on Machine Learning},
  year      = {2026},
  url       = {https://openreview.net/forum?id=0H5Im3Xvuf}
}