World2Minecraft: Occupancy-Driven Simulated Scenes Construction

Jiang, Tao; Zhang, Lechao; Yuan, Wang; Tu, Liping; Cui, Ji; Hu, Xiaojuan; Tan, Xin; Ma, Lizhuang; Xu, Jiatuo

World2Minecraft: Occupancy-Driven Simulated Scenes Construction

Lechao Zhang¹, Haoran Xu¹, Jingyu Gong¹, Xuhong Wang², Yuan Xie¹, Xin Tan^1,2^✉

1 East China Normal University, Shanghai, China
2 Shanghai Artificial Intelligence Laboratory, Shanghai, China
^✉ Corresponding authors

Accepted to ICLR 2026

Code arXiv 🤗 Hugging Face

Abstract

Embodied intelligence requires high-fidelity simulation environments to support perception and decision-making, yet existing platforms often suffer from data contamination and limited flexibility. To mitigate this, we propose World2Minecraft to convert real-world scenes into structured Minecraft environments based on 3D semantic occupancy prediction. In the reconstructed scenes, we can effortlessly perform downstream tasks such as Vision-Language Navigation (VLN). However, we observe that reconstruction quality heavily depends on accurate occupancy prediction, which remains limited by data scarcity and poor generalization in existing models. We introduce a low-cost, automated, and scalable data acquisition pipeline for creating customized occupancy datasets, and demonstrate its effectiveness through MinecraftOcc, a large-scale dataset featuring 100,165 images from 156 richly detailed indoor scenes. Extensive experiments show that our dataset provides a critical complement to existing datasets and poses a significant challenge to current SOTA methods. These findings contribute to improving occupancy prediction and highlight the value of World2Minecraft in providing a customizable and editable platform for personalized embodied AI research. We will publicly release the dataset and the complete generation framework to ensure reproducibility and encourage future work.

Framework

Dataset Construction for MinecraftVLN

Dataset Construction Pipeline for MinecraftVLN showing segmentation of roomtour sequences into trajectories and generation of instruction-following QA for Next-View and Next-Action — **Dataset Construction Pipeline for MinecraftVLN**. We segment roomtour sequences into valid trajectories, then generate instruction-following Question-Answer pairs using the collected coordinates and orientations to construct Next-View and Next-Action dataset.

Dataset Construction for MinecraftOcc

Dataset Construction Pipeline for MinecraftOcc including roomtour coordinate recording, yaw-based view region division, and semantic occupancy extraction — **Dataset Construction Pipeline for MinecraftOcc**. We record coordinate data during roomtour, divide the viewpoint into two yaw-based cases to define view regions (the yellow indicates invisible areas; green indicates visible areas), and extract semantic occupancy from map data.

Reconstruction Results

Demo Results

A Gemini-2.5-Pro controlled agent performing VLN in the reconstructed scene following the instruction 'Go to the piano' — The result of a Gemini-2.5-Pro controlled agent performing VLN in our reconstructed scene. Following the natural language instruction "Go to the piano", the agent successfully navigates to the target step by step.

VLN Performance Results

Across three distinct MinecraftVLN settings, the performance (Accuracy) of Qwen2.5-VL models (3B and 7B) on Next-View and Next-Action tasks under No Training, SFT, and RFT conditions is evaluated.

MinecraftOcc Dataset Results

NYU V2 Dataset Performance

Performance comparison on NYU V2 Dataset. * Represents the model trained on a mixture of the MinecraftOcc 8k and NYUv2 training sets, and evaluated on the NYUv2 test set.

BibTeX

@article{zhangworld2minecraft,
  title={WORLD2MINECRAFT: OCCUPANCY-DRIVEN SIMU-LATED SCENES CONSTRUCTION},
  author={Zhang, Lechao and Xu, Haoran and Gong, Jingyu and Wang, Xuhong and Xie, Yuan and Tan, Xin}
}