OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics

Mingxian Lin¹, Shengju Qian^2,‡, Yuqi Liu³, Yi-Hua Huang¹, Yiyu Wang², Wei Huang¹,
Yitang Li⁴, Fan Zhang³, Zeyu Hu², Lingting Zhu², Xin Wang², Xiaojuan Qi^1,†

¹The University of Hong Kong ²LIGHTSPEED ³The Chinese University of Hong Kong ⁴Tsinghua University

‡ Project Leader † Corresponding Author

OmniGameArena at a glance. Twelve newly built UE5 games span Solo (7), PvP (3), and Coop (2). Heterogeneous agents (commercial VLMs, open-weight VLMs, keyboard-mouse and gamepad policies) connect to the same real-time UE5 environment through documented adapters. Evaluation reports the cold-start leaderboard and the Improvement Dynamics Curve (IDC) under multi-round reflection.

Abstract

Vision-language model (VLM) agents are increasingly deployed in interactive game environments. Yet game benchmarks for VLM agents typically report a single first-attempt score per (agent, game) pair, focus on single-agent Solo play, and lack unified protocols for evaluating heterogeneous agent classes (commercial VLMs, open-weight VLMs, and specialized game policies) on the same footing.

We address these gaps with OmniGameArena, a real-time benchmark of twelve newly built Unreal Engine 5 games spanning Solo (7), PvP (3), and Coop (2) with unified action interfaces, and the Improvement Dynamics Curve (IDC), an agentic-reflection harness in which a tool-using reflector LLM autonomously refines a bounded skill prompt across multiple rounds. Beyond cold-start leaderboard scores, IDC exposes two additional observables for each (agent, game) pair: how the score evolves across reflection rounds, and how the learned skill behaves on held-out task variants. We report these observables for twelve VLM agents on the cold-start leaderboard and four top agents under IDC.

Contributions

What OmniGameArena adds

A twelve-game UE5 benchmark

Solo, PvP, and Coop regimes share a unified action interface. Every game instance (rules, layouts, scripts, scoring) is authored from scratch for this benchmark rather than reused from public titles, which lowers pre-training leakage and contamination.

The IDC harness

An agentic-reflection framework whose autonomous tool-use reflector refines a bounded skill prompt across rounds, with persistent memory and best-skill rollback, turning each instance into an improvement trajectory.

Empirical findings across agents

Across games and agents, leadership rotates and no single VLM dominates; origin-task improvement does not by itself predict held-out variant transfer, a divergence that cold-start leaderboard scores hide but multi-round reflection brings out.

Benchmark Suite

Twelve custom games, seven capability axes

Each game targets a different mix of visual perception, spatial navigation, reaction, memory, planning, adversarial interaction, and cooperation.

Radar charts of the twelve OmniGameArena games across seven capability dimensions. — Capability profiles of the twelve games across seven dimensions (VP Visual Perception, SN Spatial Navigation, RT Reaction, **MEM** Memory, **PLN** Planning, **ADV** Adversarial, **COOP** Cooperation), each scored from 0 to 3.

Framework Design

The Improvement Dynamics Curve

Each (agent, game) instance runs for multiple rounds. The agent plays K episodes under a current skill prompt; a tool-using reflector then inspects the trajectories, deciding on its own what to read and when to stop, before refining the skill for the next round.

IDC framework: Experience Acquisition Module, Reflection Module with Explore, Diagnose, Validate, Distill, and Persistent Module. — The IDC harness couples an **Experience Acquisition** module, a tool-using **Reflection** module, and a **Persistent** module that carries the experience notebook, validated skills, and the per-round score curve across rounds.

1Explore

Inspect, search, and review the K episode trajectories from the current round.

2Diagnose

Compare causes across episodes and explain the success and failure cases.

3Validate

Test, verify, and confirm candidate skills before they are committed to the persistent skill set.

4Distill

Summarize, refine, and store the refined skill prompt for the next round.

Results · Cold-start

Cold-start performance

Leaderboard

Mean normalized score · Solo · Quality (PDQ)

Gameplay recordings

Browse cold-start episodes by game. Solo and Coop show every model; PvP is a head-to-head matchup viewer. Numbers are normalized scores.

Results · Improvement Dynamics

Improvement under IDC

A tool-using reflector distills a skill prompt from the agent's own episodes.
Below: the learned skill, how its score evolves across reflection rounds, and how the skill transfers to three held-out task variants (with vs. without skill).

Model

No single winner

Leadership rotates across games; commercial agents hold a wide gap over open-weight VLMs and specialized policies.

Improvement peaks mid-curve

All four top agents improve over their cold-start baseline through reflection, yet peak performance is typically reached before the final round.

Transfer can diverge

Origin-task improvement and held-out variant transfer can diverge, a divergence that cold-start leaderboard scores do not reveal.

BibTeX

Citation

@article{lin2026omnigamearena,
  title   = {OmniGameArena: A Unified UE5 Benchmark for VLM Game Agents with Improvement Dynamics},
  author  = {Lin, Mingxian and Qian, Shengju and Liu, Yuqi and Huang, Yi-Hua and Wang, Yiyu and Huang, Wei and Li, Yitang and Zhang, Fan and Hu, Zeyu and Zhu, Lingting and Wang, Xin and Qi, Xiaojuan},
  journal = {arXiv preprint arXiv:2606.09826},
  year    = {2026}
}