Logo A Survey on World Models: A Vision Perspective

Xiao Yu1, Yichen Zhang1, Mingzhang Wang1, Shifang Zhao1, Weizhe Liu1, Yuyang Yin1, Zhongwei Ren1,
1Beijing Jiaotong University, 2National University of Singapore, 3Beijing Academy of Artificial Intelligence
Corresponding author, Project Lead
Intro

Figure 1: A summary of representative works for Vision World Models.

Abstract

The ability to learn world knowledge from visual observation and interact with the physical world is fundamental to Artificial General Intelligence (AGI). The Vision World Model (VWM) emerges as a critical paradigm and realizes this capability by learning to simulate world evolution directly from visual observations. However, rapid evolution within fragmented research communities creates inconsistent taxonomies and isolated evaluation progress. We argue that resolving this fragmentation requires a shift from treating vision as a passive input challenge to viewing the visual nature as the central design driver. Guided by this principle, we establish a unified framework that decomposes VWM research into three parts: the perceptual foundation, the dynamics core, and the key capability. Then, we dive into details of this framework by introducing technical taxonomies and the evaluation system. Finally, we discuss future directions organized around three dimensions: Re-grounding in physical and causal laws, Re-evaluating beyond visual fidelity, and Re-scaling for emergent capabilities, aiming to advance VWMs toward the pillar stone of generalist intelligence.

🚀 Motivation: Why a Vision-Centric Survey?

Vision serves as the primary carrier of the world's information, capturing granular dynamics, like the trajectory of shattering glass, that language alone cannot describe. To simulate the world evolution*, the field is converging on the Vision World Model (VWM) to learn evolutionary laws directly from visual observation. Currently, research has branched into three distinct paradigms:

  • Video generation methods focus on modeling appearance (e.g., Diffusion, Autoregressive) to achieve high visual fidelity and continuity.
  • State transition approaches compress visual signals into compact states (e.g., SSMs) to serve as latent dynamics simulators for planning and control.
  • Embedding prediction/JEPA-style approaches forecast in latent spaces (e.g., JEPA), bypassing pixel details to prioritize semantic understanding.

These divergent goals lead to inconsistent terminology and incomparable metrics. Existing surveys, while valuable, still do not solve this fragmentation.

  • Application-oriented surveys (e.g., in robotics or autonomous driving) provide depth but lack a systematic analysis of VWM as a standalone technology.
  • Broad conceptual surveys offer high-level overviews but treat visual input as a passive assumption rather than the active design challenge it is.

To fill these gaps, we argue that addressing the field's fragmentation calls for a shift in perspective: treating the visual nature of the world not as a passive input challenge, but as the central design driver for world modeling.


World evolution*: refers to the spatio-temporal progression of the environment and the entities within it, encompassing the continuous unfolding of both physical states and logical events.


🧩 Conceptual Framework

At its core, a VWM is defined as follows:

A vision world model is an AI model that learns to simulate the physical world through visual observation.

Formally, a VWM can be seen as a probabilistic model \( f_{\theta} \) that predicts the distribution of future states given observed visual context and interactive conditions:

$$ p(\mathcal{S}_{t+1:T}| v_{0:t}, c_{t}) = f_{\theta} (\mathcal{E}(v_{0:t}), c_{t}) $$

where \( v_{0:t} \) represents the sequence of visual observations from time \( 0 \) to \( t \), and \( c_{t} \) represents current conditions (e.g., agent actions, language instructions, or control signals). \( \mathcal{E}( \cdot ) \) denotes the visual encoder that maps raw inputs into representations. \( \mathcal{S}_{t+1:T} \) refers to the representations of future world states, which manifests in diverse ways depending on the specific modeling paradigm, including future observations (\( v_{t+1:T} \)), latent states (\( s_{t+1:T} \)), or other future properties (e.g., segmentation maps, depth, flow, 3D Gaussian splats, or trajectories).

Based on this, we establish a conceptual framework that decomposes VWM into three essential components:

Conceptual framework of VWM

Figure 2: The conceptual framework of VWM. A VWM receives the high-dimensional visual context and interaction conditions (action, instruction, etc.) of the physical world, and performs future simulations for this world.

  • (1) The Perceptual Foundation: How diverse visual signals are transformed into world representation.
  • (2) The Dynamics Core: What "laws of the world" are learned, progressing from spatio-temporal coherence to physical dynamics and causal reasoning.
  • (3) The Key Capability: How VWM performs controllable simulation conditioned on actions, language, or other interaction prompts.

🗺️ Taxonomy of VWM Designs

We provide an in-depth analysis of VWMs' four major architectural families, applying our three-component framework to compare their underlying mechanisms:

Taxonomy of VWM designs

Figure 3: A taxonomy of VWM designs, organized into 4 primary classes (with 7 sub-classes). For each design, the upper panel illustrates its model architecture, while the lower panel summarizes its specific instantiation of the VWM framework (Section 2): Icon 1 World Representation: The type of representation extracted from raw visual signals; Icon 2 Mechanism for Learning World laws: The way to learn the laws governing world evolution; Icon 3 World Simulation: The manifestation of the simulated future world.

  • Sequential Generation
    1. Visual Autoregressive Modeling: Discretizes raw signals into a sequence of visual tokens and learns world laws via next-token prediction, enabling long-term generation through autoregressive rollouts.
    2. MLLM as VWM Engine: Maps observations into language-aligned tokens to leverage MLLM reasoning power, producing an interleaved multimodal stream of visual futures and textual plans.
  • Diffusion-based Generation
    1. Latent Diffusion: Views modeling as a denoising process within compressed continuous latents, generating holistic, high-fidelity video clips in a non-sequential, all-at-once manner.
    2. Autoregressive Diffusion: A hybrid design that employs causal denoising conditioned on historical context to generate sequential denoised latents, ensuring both temporal consistency and visual quality.
  • Embedding Prediction
    1. JEPA: A non-generative paradigm that encodes observations into contextual embeddings and learns laws via latent prediction, providing efficient representations for planning without pixel reconstruction.
  • State Transition
    1. State-Space Modeling: Compresses visual history into a compact recurrent state and models laws as linear updates, enabling efficient long-horizon state rollouts.
    2. Object-Centric Modeling: Decomposes observations into discrete object slots and learns world laws through slot interaction, allowing for generalization to novel object combinations.
Taxonomy of VWM designs

Table 1: Examples of visual autoregressive modeling VWMs.

Taxonomy of VWM designs

Figure 4: Examples of diffusion-based (latent diffusion, autoregressive diffusion) VWMs.

📊 Evaluation System

We provide an extensive review of the evaluation landscape, cataloging metrics and distinguishing between datasets and benchmarks types.

Overview of the evaluation ecosystems of VWMs

Figure 5: Overview of the evaluation systems of VWMs.

Metrics

  • Visual Quality: Assess the visual consistency of the simulation.
  • Dynamics Fidelity: Probe the model's adherence to physical laws.
  • Task Execution: Measure model's overall effectiveness in downstream tasks.
Summary of evaluation metrics for VWMs

Table 2: Summary of evaluation metrics for VWMs.

Datasets and Benchmarks

  • General-Purpose World Modeling: Foundational benchmarks designed to evaluate core capabilities:
    • World Prediction and Simulation
    • Physics and Causality
  • Application-Specific World Modeling: Datasets and testbeds tailored for high-impact downstream applications:
    • Embodied AI and Robotics
    • Autonomous Driving
    • Interactive Environments and Gaming
General-purpose world modeling datasets and benchmarks

Table 3: General-purpose world modeling datasets and benchmarks.

Application-specific world modeling datasets and benchmarks

Table 4: Application-specific world modeling datasets and benchmarks.

💡 Challenges and Future Directions

We organize our analysis around three interconnected calls to action for the next-generation world models:

Overview of challenges and future directions

Figure 6: Overview of the challenges and future directions for the next-generation world models.

  1. Re-grounding: Move beyond superficial imitation to establish robust foundations for deep understanding and simulation of the world.
  2. Re-evaluation: challenge the field's current metrics and benchmarks, which often prioritize misleading notions of visual fidelity over true physical and logical plausibility.
  3. Re-scaling: explore how strategic scaling can pivot towards unlocking unification for vision-centric world tasks and emergent capabilities.

BibTeX

@XXX{XXX,
  author    = {},
  title     = {A Survey on World Models: A Vision Perspective},
  journal   = {},
  year      = {},
}