Evaluating & improving MLLM-based agents under 9 realistic corruptions
Autonomous computer-use agents powered by multimodal large language models (MLLMs) are emerging as capable assistants for completing complex digital workflows. However, real-world execution environments are far from ideal: pop-ups, resolution changes, and competing applications frequently interfere with agent perception and control.
We introduce AgentHijack, a benchmark designed to evaluate the robustness of computer-use agents under common corruptions, where uncertainties in dynamic environments disrupt the execution flow without direct adversarial intent. AgentHijack introduces 9 configurable common corruptions to replicate realistic imperfect scenarios. We evaluate a variety of desktop tasks using MLLM-based agents and discover that even minor instances of corruption can result in substantial performance degradation.
We further propose AgentHijack-Agent, a framework that integrates an action generator with enhanced grounding capabilities and an onlooker responsible for behavior summarization and environment checking. Extensive experiments validate its effectiveness across all 9 corruption categories.
Prior benchmarks either evaluate agents only in clean environments, or study adversarial attacks rather than the everyday, accidental corruptions that derail agents in practice. AgentHijack is the first benchmark that simultaneously offers a realistic virtual-machine environment, common (non-adversarial) corruptions, and user-configurable parameters.
| Property | Mind2Web | WebArena | OSWorld | AndroidWorld | InjecAgent | R‑Judge | Agent‑Safety | Env. Distract. | GUI‑Robust | SafeArena | ST‑WebAgent | MobileSafety | WASP | VWA‑Adv | RiOSWorld | AgentHijackOurs |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # Tasks | 2,350 | 812 | 369 | 116 | 1,054 | 569 | 2,000 | 1,198 | 5,318 | 250 | 234 | 80 | 84 | 200 | 492 | 3,321 |
| Environment | QA | BrowserGym | VM | Android | QA | QA | QA | QA | QA | BrowserGym | BrowserGym | Android | BrowserGym | BrowserGym | VM | VM |
| Multi‑modal | ✓ | ✓ | ✓ | ✓ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Abnormal Env. | ✗ | ✗ | ✗ | ✗ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ | ✓ |
| Common Corr. | N/A | N/A | N/A | N/A | ✗ | ✗ | ✗ | ✓ | ✓ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| Configurable | N/A | N/A | N/A | N/A | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✗ | ✓ |
| # Categories | N/A | N/A | N/A | N/A | 6 | 5 | 8 | 4 | 7 | 5 | 3 | 5 | 1 | 1 | 13 | 9 |
Multi‑modal: agents take multi-modal inputs. Abnormal Env.: environment is corrupted rather than clean. Common Corr.: corruptions are common rather than adversarial. Configurable: corruption parameters are user-controllable. AgentHijack is the only benchmark that satisfies all four desiderata while running on a real virtual machine.
AgentHijack provides 9 configurable corruption types applied to OSWorld tasks, grouped by perturbation scope: visual disruptors alter the observation space, unexpected operations interfere with state transitions, and environment errors perturb the environment state itself.
Pop-up windows from communication or system software occlude the workspace.
Resolution shifts caused by hardware switches or display settings.
On-screen marks from screensavers or animations clutter the canvas.
Floating subtitles from music or video applications overlay UI.
Overlapping windows from multiple simultaneously running applications.
Stray clicks on function bars or buttons triggered by mouse mishandling.
Foreground application minimized unexpectedly mid-task.
Lost network connection blocking online actions.
Unexpected login or identity verification gates the workflow.
Across 9 representative MLLM-based agents, we identify recurring failure modes that the community has long overlooked.
Agents click on pop-ups even when targets remain visible, and deviate from targets under resolution change, marks, subtitles, or multi-app overlap.
When accidental touches or app minimizations occur, agents mis-attribute the state change to their own action and chase the triggered content.
Agents assume the start state is normal and keep executing inside broken environments — network down, verification screen, missing password.
A two-role framework that pairs an action generator with enhanced grounding and an onlooker that summarizes behavior and checks the environment before and during execution.
Data-Augmented Group Relative Policy Optimization rolls out across corrupted variants of the same environment, with experience replay to preserve sparse success signals.
An auxiliary agent compares before/after screenshots each step, producing concise change descriptions so the action generator's context is grounded in what actually happened — not what it intended.
Before execution the onlooker validates the initial state against a repository of known errors (network, verification, login) and triggers reinitialization rather than letting the agent flounder.
We benchmark 9 open-source, closed-source, and specialized GUI agents on all 9 corruption types. UI-TARS-1.5-7B — the strongest baseline — drops from 24.21% (clean) to 18.74% average under corruptions; our framework restores most of the gap.
| Agent | Clean | Pop ups | Resolution | Marks | Subtitle | Multi Apps | Accidental Touch | App Min. | Network Err. | Verification | Average |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Open-source Multimodal Large Language Models | |||||||||||
| GLM-4.5V | 4.24% | 0.86% | 3.68% | 2.52% | 3.68% | 3.68% | 1.98% | 4.24% | 2.52% | 3.12% | 3.05% |
| Llama-3.2-90B-Vision | 3.97% | 0.77% | 1.64% | 1.93% | 1.87% | 1.59% | 1.45% | 0.00% | 1.45% | 1.64% | 1.63% |
| Qwen2.5-VL-72B | 10.99% | 1.86% | 6.38% | 9.45% | 10.29% | 5.79% | 7.48% | 8.32% | 7.48% | 6.63% | 7.47% |
| Closed-source Multimodal Large Language Models | |||||||||||
| GPT-4o | 5.38% | 1.44% | 4.82% | 2.56% | 3.66% | 3.68% | 3.12% | 4.82% | 4.24% | 3.25% | 3.69% |
| Claude-3.7-Sonnet | 4.23% | 1.41% | 2.54% | 2.82% | 2.54% | 1.97% | 2.54% | 2.25% | 1.69% | 2.54% | 2.45% |
| Gemini-2.5-Pro | 8.11% | 5.20% | 6.98% | 6.64% | 6.28% | 2.76% | 4.61% | 2.78% | 7.02% | 7.81% | 5.82% |
| State-of-the-Art GUI Agents | |||||||||||
| UI-TARS-7B-DPO | 16.20% | 13.09% | 10.03% | 13.41% | 15.59% | 13.85% | 13.97% | 13.61% | 13.33% | 8.31% | 13.14% |
| UI-TARS-72B-DPO | 22.38% | 15.51% | 14.32% | 20.36% | 19.32% | 18.94% | 14.44% | 15.19% | 19.76% | 9.42% | 16.96% |
| UI-TARS-1.5-7B (baseline) | 24.21% | 10.28% | 11.69% | 23.31% | 22.75% | 19.25% | 22.54% | 20.84% | 22.02% | 10.48% | 18.74% |
| AgentHijack-Agent | |||||||||||
| Ours | 27.80% | 21.51% | 12.53% | 27.28% | 26.45% | 21.17% | 24.37% | 24.51% | 23.09% | 20.15% | 22.89% |
| Δ | +3.59% | +11.23% | +0.84% | +3.97% | +3.70% | +1.92% | +1.83% | +3.67% | +1.07% | +9.67% | +4.15% |
Under the same set of corrupted tasks, our agent maintains accurate grounding, correctly attributes unexpected state changes, and adaptively recovers from environment errors — rather than burning steps on meaningless attempts.
If you find AgentHijack useful, please cite our paper.
@inproceedings{sun2026agenthijack,
title = {AgentHijack: Benchmarking Computer Use Agent Robustness to Common Environment Corruptions},
author = {Jingwei Sun and Jianing Zhu and Yuanyi Li and Tongliang Liu and Xia Hu and Bo Han},
booktitle = {Forty-third International Conference on Machine Learning},
year = {2026},
url = {https://openreview.net/forum?id=0H5Im3Xvuf}
}