World2Minecraft: Occupancy-Driven Simulated Scenes Construction

Lechao Zhang1, Haoran Xu1, Jingyu Gong1, Xuhong Wang2, Yuan Xie1, Xin Tan1,2
1 East China Normal University, Shanghai, China
2 Shanghai Artificial Intelligence Laboratory, Shanghai, China

Corresponding authors
Accepted to ICLR 2026

Abstract

Embodied intelligence requires high-fidelity simulation environments to support perception and decision-making, yet existing platforms often suffer from data contamination and limited flexibility. To mitigate this, we propose World2Minecraft to convert real-world scenes into structured Minecraft environments based on 3D semantic occupancy prediction. In the reconstructed scenes, we can effortlessly perform downstream tasks such as Vision-Language Navigation (VLN). However, we observe that reconstruction quality heavily depends on accurate occupancy prediction, which remains limited by data scarcity and poor generalization in existing models. We introduce a low-cost, automated, and scalable data acquisition pipeline for creating customized occupancy datasets, and demonstrate its effectiveness through MinecraftOcc, a large-scale dataset featuring 100,165 images from 156 richly detailed indoor scenes. Extensive experiments show that our dataset provides a critical complement to existing datasets and poses a significant challenge to current SOTA methods. These findings contribute to improving occupancy prediction and highlight the value of World2Minecraft in providing a customizable and editable platform for personalized embodied AI research. We will publicly release the dataset and the complete generation framework to ensure reproducibility and encourage future work.

Framework

Framework of our World2Minecraft illustrating reconstruction of real-world scenes into Minecraft and subsequent navigation, including occupancy prediction and VLN tasks
Framework of our World2Minecraft, which illustrates the process of reconstructing real-world scenes into Minecraft environments and subsequently conducting navigation within these scenes. 1) For the transfer reality to Minecraft, RGB images are input into the occupancy prediction model to predict semantic occupancy, which is then preprocessed to generate instructions for reconstruction in Minecraft. 2) VLN tasks involving Next-View and Next-Action are performed within the reconstructed scenes.

Dataset Construction for MinecraftVLN

Dataset Construction Pipeline for MinecraftVLN showing segmentation of roomtour sequences into trajectories and generation of instruction-following QA for Next-View and Next-Action
Dataset Construction Pipeline for MinecraftVLN. We segment roomtour sequences into valid trajectories, then generate instruction-following Question-Answer pairs using the collected coordinates and orientations to construct Next-View and Next-Action dataset.

Dataset Construction for MinecraftOcc

Dataset Construction Pipeline for MinecraftOcc including roomtour coordinate recording, yaw-based view region division, and semantic occupancy extraction
Dataset Construction Pipeline for MinecraftOcc. We record coordinate data during roomtour, divide the viewpoint into two yaw-based cases to define view regions (the yellow indicates invisible areas; green indicates visible areas), and extract semantic occupancy from map data.

Reconstruction Results

Reconstruction results from reality to Minecraft showing consistency across views and predicted occupancy compared with constructed scenes
The reconstruction results from reality to Minecraft are presented above. As we can observe that from View 1 to View 3, the Reality row and the Minecraft row demonstrate a high degree of consistency. The Prediction column displays the predicted occupancy views from different perspectives of the same scene, while the corresponding reconstructed scenes in the Construction column align well with them.

Demo Results

A Gemini-2.5-Pro controlled agent performing VLN in the reconstructed scene following the instruction 'Go to the piano'
The result of a Gemini-2.5-Pro controlled agent performing VLN in our reconstructed scene. Following the natural language instruction "Go to the piano", the agent successfully navigates to the target step by step.

VLN Performance Results

Across three distinct MinecraftVLN settings, the performance (Accuracy) of Qwen2.5-VL models (3B and 7B) on Next-View and Next-Action tasks under No Training, SFT, and RFT conditions is evaluated
Across three distinct MinecraftVLN settings, the performance (Accuracy) of Qwen2.5-VL models (3B and 7B) on Next-View and Next-Action tasks under No Training, SFT, and RFT conditions is evaluated.

MinecraftOcc Dataset Results

Minecraftocc Dataset results under different training settings
Minecraftocc Dataset results under different training settings.

NYU V2 Dataset Performance

Performance comparison on NYU V2 Dataset. * Represents the model trained on a mixture of the MinecraftOcc 8k and NYUv2 training sets, and evaluated on the NYUv2 test set
Performance comparison on NYU V2 Dataset. * Represents the model trained on a mixture of the MinecraftOcc 8k and NYUv2 training sets, and evaluated on the NYUv2 test set.

BibTeX

@article{zhangworld2minecraft,
  title={WORLD2MINECRAFT: OCCUPANCY-DRIVEN SIMU-LATED SCENES CONSTRUCTION},
  author={Zhang, Lechao and Xu, Haoran and Gong, Jingyu and Wang, Xuhong and Xie, Yuan and Tan, Xin}
}