Logo From Seeing to Knowing the World:
A Survey of Vision World Models

Xiao Yu1,‡, Yichen Zhang1, Mingzhang Wang1, Shifang Zhao1, Weizhe Liu1, Yuyang Yin1, Zhongwei Ren1,
Yunchao Wei1,✉, Xiaojie Jin1,✉,†,‡
1BJTU,     2Tencent,     3ByteDance,     4NUS,     5Midea-AIRC,     6CCRI
Corresponding Author, Project Lead, Equal Contribution
Intro

Figure 1: A summary of representative works for Vision World Models.

Abstract

Acquiring world knowledge directly from visual observation is fundamental to Artificial General Intelligence (AGI). To support this capability, the Vision World Model (VWM) has emerged as a key paradigm, which learns how the world evolves over time from visual streams. However, recent progress has been driven by diverse research communities, resulting in inconsistent problem formulations, disconnected taxonomies, and divergent evaluation protocols. We argue that addressing this gap requires a conceptual shift: vision should not be treated merely as an input modality, but as the primary driver shaping how world models are represented, learned, and evaluated. Guided by this vision-centric perspective, we introduce a unified framework that organizes VWM research into three core components: vision encoding, knowledge learning, and controllable simulation, and use it to analyze existing model designs and evaluation methodologies. Finally, we outline future research directions that emphasize stronger physical and causal grounding, more meaningful evaluation beyond visual appearance, and scaling toward more general and reliable world modeling capabilities.

🚀 Motivation: Why a Vision-Centric Survey?

Visual observation provides the most direct evidence of how the world changes over time, capturing fine-grained dynamics, like how a glass falls and scatters upon impact, that language alone cannot fully specify.

How to describe the world?

Figure 2: How to describe the world?

World models have become a major focus of contemporary AI research, with recent progress emerging from diverse communities such as generative modeling, representation learning, embodied intelligence, and autonomous driving. Although visual signals are now widely used across these paradigms, vision is still often treated merely as an input modality, rather than the central factor that shapes how a world model should represent the world, learn world knowledge, and be evaluated. Within this landscape, current research has largely evolved along three major directions:

  • Video generation methods focus on modeling appearance (e.g., Diffusion, Autoregressive) to achieve high visual fidelity and continuity.
  • State transition approaches compress visual signals into compact states (e.g., SSMs) to serve as latent dynamics simulators for planning and control.
  • Embedding prediction/JEPA-style approaches forecast in latent spaces (e.g., JEPA), bypassing pixel details to prioritize semantic understanding.

These divergent goals lead to inconsistent terminology and incomparable metrics. Existing surveys, while valuable, still do not solve this fragmentation.

  • Application-focused surveys (e.g., in robotics or autonomous driving) provide depth but lack a systematic analysis of VWM as a standalone technology.
  • Broad conceptual surveys offer high-level overviews but treat visual input as a passive assumption rather than the active design challenge it is.

To fill these gaps, we argue that addressing the field's fragmentation calls for a shift in perspective: treating the visual nature of the world not as a passive input challenge, but as the central design driver for world modeling.


🧩 Conceptual Framework

At its core, a VWM is defined as follows:

A Vision World Model (VWM) is an AI model that learns world knowledge from visual data and generates future world states conditioned on interaction.

Formally, a VWM can be seen as a probabilistic model \( f_{\theta} \) that predicts the distribution of future states given observed visual context and interactive conditions:

$$ p(\mathcal{S}_{t+1:T}| v_{0:t}, c_{t}) = f_{\theta} (\mathcal{E}(v_{0:t}), c_{t}) $$

where \( v_{0:t} \) represents the sequence of visual observations from time \( 0 \) to \( t \), and \( c_{t} \) represents current conditions (e.g., agent actions, language instructions, or control signals). \( \mathcal{E}( \cdot ) \) denotes the visual encoder that maps raw inputs into representations. \( \mathcal{S}_{t+1:T} \) denotes future world states, which may take different forms depending on the modeling paradigm, including future frames, latent states, or other meaningful attributes (e.g., depth, flow, occupancy, 3D primitives, or trajectories).

Based on this, we establish a conceptual framework that decomposes VWM into three essential components:

Conceptual framework of VWM

Figure 3: A unified framework for Vision World Models (VWMs). A VWM encodes visual observations, learns structured world knowledge about how the world changes, and performs future simulation conditioned on interaction.

  • (1) Vision Encoding: How diverse visual signals are transformed into world representation.
  • (2) Knowledge Learning: What world knowledge are learned, progressing from spatio-temporal coherence to physical dynamics and causal mechanisms.
  • (3) Controllable Simulation: How VWM performs controllable simulation conditioned on actions, language, or other interaction prompts.

🗺️ Taxonomy of VWM Designs

We provide an in-depth analysis of VWMs' four major architectural families, applying our three-component framework to compare their underlying mechanisms:

Taxonomy of VWM designs

Figure 4: A taxonomy of VWM designs is organized into four architectural families with seven sub-designs. For each sub-design, the upper panel illustrates its typical input--output pipeline (from visual context and conditions to future outputs). The lower panel highlights its key design choices under the three components of our framework (Section 2): Icon 1 Vision Encoding: what form observations are encoded into; Icon 2 Knowledge Learning: through what mechanism structured world knowledge is learned; Icon 3 Controllable Simulation: what form the future rollouts take.

  • Sequential Generation
    1. Visual Autoregressive Modeling: Discretizes raw signals into a sequence of visual tokens and learns world knowledge via next-token prediction, enabling long-term generation through autoregressive rollouts.
    2. MLLM-guided Multimodal Autoregressive Model: Maps observations into language-aligned tokens to leverage MLLM reasoning power, producing an interleaved multimodal stream of visual futures and textual plans.
  • Diffusion-based Generation
    1. Latent Diffusion: Views modeling as a denoising process within compressed continuous latents, generating holistic, high-fidelity video clips in a non-sequential, all-at-once manner.
    2. Autoregressive Diffusion: A hybrid design that employs causal denoising conditioned on historical context to generate sequential denoised latents, ensuring both temporal consistency and visual quality.
  • Embedding Prediction
    1. JEPA: A non-generative paradigm that encodes observations into contextual embeddings and learns knowledge via latent prediction, providing efficient representations for planning without pixel reconstruction.
  • State Transition
    1. State-Space Modeling: Compresses visual history into a compact recurrent state and models knowledge as linear updates, enabling efficient long-horizon state rollouts.
    2. Object-Centric Modeling: Decomposes observations into discrete object slots and learns world knowledge through slot interaction, allowing for generalization to novel object combinations.
Taxonomy of VWM designs

Table 1: Examples of visual autoregressive modeling VWMs.

Taxonomy of VWM designs

Figure 5: Examples of diffusion-based (latent diffusion, autoregressive diffusion) VWMs.

📊 Evaluation System

We provide an extensive review of the evaluation landscape, cataloging metrics and distinguishing between datasets and benchmarks types.

Overview of the evaluation ecosystems of VWMs

Figure 6: Overview of the evaluation systems of VWMs.

Metrics

  • Visual Quality: Assess the visual consistency of the simulation.
  • Dynamics plausibility: Probe the model's adherence to physical laws.
  • Task Performance: Measure model's overall effectiveness in downstream tasks.
Summary of evaluation metrics for VWMs

Table 2: Summary of evaluation metrics for VWMs.

Datasets and Benchmarks

  • Foundational World Modeling: Foundational benchmarks designed to evaluate core capabilities:
    • General World Prediction and Simulation
    • Physics and Causality Benchmark
  • Domain-specific World Modeling: Datasets and testbeds tailored for high-impact downstream applications:
    • Embodied AI and Robotics
    • Autonomous Driving
    • Interactive Environments and Gaming
General-purpose world modeling datasets and benchmarks

Table 3: Foundational world modeling datasets and benchmarks.

Application-specific world modeling datasets and benchmarks

Table 4: Domain-specific world modeling datasets and benchmarks.

💡 Challenges and Future Directions

We organize our analysis around three interconnected calls to action for the next-generation world models:

Overview of challenges and future directions

Figure 7: Overview of the challenges and future directions for the next-generation world models.

  1. Re-grounding: move beyond appearance-driven imitation toward stronger grounding in world knowledge, including richer physical interactions, geometry-aware modeling, neuro-symbolic hybrid designs, and human-centered rules and conventions.
  2. Re-evaluation: move beyond visual fidelity toward meaningful evaluation of plausibility, through better benchmarks, judge models, and execution-based protocols that assess physical consistency, causal reasoning, and complex dynamics.
  3. Re-scaling: rethink scaling as a path to stronger world modeling capabilities, through pretraining scaling toward generalist VWMs and inference-time scaling that enables reasoning before generation.

Please visit our paper for more information!

BibTeX

@article{yu2026seeing,
  title={From Seeing to Knowing the World: A Survey of Vision World Models},
  author={Yu, Xiao and Zhang, Yichen and Wang, Mingzhang and Zhao, Shifang and Liu, Weizhe and Yin, Yuyang and Ren, Zhongwei and An, Ning and Wu, Xinglong and Liu, Hao and others},
  year={2026},
  publisher={Preprints}
}