🚀 Motivation: Why a Vision-Centric Survey?
Vision serves as the primary carrier of the world's information, capturing granular dynamics, like the trajectory of shattering glass, that language alone cannot describe.
To simulate the world evolution*, the field is converging on the Vision World Model (VWM) to learn evolutionary laws directly from visual observation.
Currently, research has branched into three distinct paradigms:
- Video generation methods focus on modeling appearance (e.g., Diffusion, Autoregressive) to achieve high visual fidelity and continuity.
- State transition approaches compress visual signals into compact states (e.g., SSMs) to serve as latent dynamics simulators for planning and control.
- Embedding prediction/JEPA-style approaches forecast in latent spaces (e.g., JEPA), bypassing pixel details to prioritize semantic understanding.
These divergent goals lead to inconsistent terminology and incomparable metrics. Existing surveys, while valuable, still do not solve this fragmentation.
- Application-oriented surveys (e.g., in robotics or autonomous driving) provide depth but lack a systematic analysis of VWM as a standalone technology.
- Broad conceptual surveys offer high-level overviews but treat visual input as a passive assumption rather than the active design challenge it is.
To fill these gaps, we argue that addressing the field's fragmentation calls for a shift in perspective: treating the visual nature of the world not as a passive input challenge, but as the central design driver for world modeling.
🧩 Conceptual Framework
At its core, a VWM is defined as follows:
A vision world model is an AI model that learns to simulate the physical world through visual observation.
Formally, a VWM can be seen as a probabilistic model \( f_{\theta} \) that predicts the distribution of future states given observed visual context and interactive conditions:
$$
p(\mathcal{S}_{t+1:T}| v_{0:t}, c_{t}) = f_{\theta} (\mathcal{E}(v_{0:t}), c_{t})
$$
where \( v_{0:t} \) represents the sequence of visual observations from time \( 0 \) to \( t \), and \( c_{t} \) represents current conditions (e.g., agent actions, language instructions, or control signals). \( \mathcal{E}( \cdot ) \) denotes the visual encoder that maps raw inputs into representations.
\( \mathcal{S}_{t+1:T} \) refers to the representations of future world states, which manifests in diverse ways depending on the specific modeling paradigm, including future observations (\( v_{t+1:T} \)), latent states (\( s_{t+1:T} \)), or other future properties (e.g., segmentation maps, depth, flow, 3D Gaussian splats, or trajectories).
Based on this, we establish a conceptual framework that decomposes VWM into three essential components:
Figure 2: The conceptual framework of VWM. A VWM receives the high-dimensional visual context and interaction conditions (action, instruction, etc.) of the physical world, and performs future simulations for this world.
- (1) The Perceptual Foundation: How diverse visual signals are transformed into world representation.
- (2) The Dynamics Core: What "laws of the world" are learned, progressing from spatio-temporal coherence to physical dynamics and causal reasoning.
- (3) The Key Capability: How VWM performs controllable simulation conditioned on actions, language, or other interaction prompts.
🗺️ Taxonomy of VWM Designs
We provide an in-depth analysis of
VWMs' four major architectural families, applying our three-component framework to compare their
underlying mechanisms:
-
Sequential Generation
- Visual Autoregressive Modeling: Discretizes raw signals into a sequence of visual tokens and learns world laws via next-token prediction, enabling long-term generation through autoregressive rollouts.
- MLLM as VWM Engine: Maps observations into language-aligned tokens to leverage MLLM reasoning power, producing an interleaved multimodal stream of visual futures and textual plans.
-
Diffusion-based Generation
- Latent Diffusion: Views modeling as a denoising process within compressed continuous latents, generating holistic, high-fidelity video clips in a non-sequential, all-at-once manner.
- Autoregressive Diffusion: A hybrid design that employs causal denoising conditioned on historical context to generate sequential denoised latents, ensuring both temporal consistency and visual quality.
-
Embedding Prediction
- JEPA: A non-generative paradigm that encodes observations into contextual embeddings and learns laws via latent prediction, providing efficient representations for planning without pixel reconstruction.
-
State Transition
- State-Space Modeling: Compresses visual history into a compact recurrent state and models laws as linear updates, enabling efficient long-horizon state rollouts.
- Object-Centric Modeling: Decomposes observations into discrete object slots and learns world laws through slot interaction, allowing for generalization to novel object combinations.
Table 1: Examples of visual autoregressive modeling VWMs.
Figure 4: Examples of diffusion-based (latent diffusion, autoregressive diffusion) VWMs.
📊 Evaluation System
We provide an extensive review of the evaluation landscape, cataloging metrics and distinguishing between datasets and benchmarks types.
Figure 5: Overview of the evaluation systems of VWMs.
Metrics
- Visual Quality: Assess the visual consistency of the simulation.
- Dynamics Fidelity: Probe the model's adherence to physical laws.
- Task Execution: Measure model's overall effectiveness in downstream tasks.
Table 2: Summary of evaluation metrics for VWMs.
Datasets and Benchmarks
-
General-Purpose World Modeling: Foundational benchmarks designed to evaluate core capabilities:
- World Prediction and Simulation
- Physics and Causality
-
Application-Specific World Modeling: Datasets and testbeds tailored for high-impact downstream applications:
- Embodied AI and Robotics
- Autonomous Driving
- Interactive Environments and Gaming
Table 3: General-purpose world modeling datasets and benchmarks.
Table 4: Application-specific world modeling datasets and benchmarks.
💡 Challenges and Future Directions
We organize our analysis around three interconnected calls to action for the next-generation world models:
Figure 6: Overview of the challenges and future directions for the next-generation world models.
- Re-grounding: Move beyond superficial imitation to establish robust foundations for deep understanding and simulation of the world.
- Re-evaluation: challenge the field's current metrics and benchmarks, which often prioritize misleading notions of visual fidelity over true physical and logical plausibility.
- Re-scaling: explore how strategic scaling can pivot towards unlocking unification for vision-centric world tasks and emergent capabilities.