
Learning to Generate Unbounded 3D Scenes from Image Collections
Introduction
Scene generation has raised considerable attention in recent years, addressing the growing need for 3D creative tools in the metaverse. At the core of 3D content creation is inverse graphics, which aims to recover 3D representations from 2D observations. Given the cost and labor for creating 3D assets, the ultimate goal of 3D content creation would be learning a generative model from in-the-wild 2D images.
Recent work on 3D-aware generative models tackles the problem to some extent, by learning to generate object-level content from curated 2D image data. However, the observation space is in a bounded domain and the generated target occupies a limited region of Euclidean space. It is highly desirable to learn 3D generative models for unbounded scenes from in-the-wild 2D images, e.g., a vivid landscape that covers an arbitrarily large region, which we aim to tackle in this paper.
Method
To generate unbounded 3D scenes from in-the-wild 2D images, intuitively, three critical issues must be addressed: the unbounded range of scenes, unaligned content in scale and coordinate, and in-the-wild 2D images without knowledge of camera pose. Specifically, a successful unbounded scene generation model should overcome the following challenges: 1) Lack of efficient 3D representation for unbounded 3D scenes. Unbounded scenes can occupy an arbitrarily large region of Euclidean space, necessitating the design of an efficient 3D representation; 2) Lack of content alignment. Given a set of in-the-wild 2D images, objects with different semantics could be captured in varying scales, 3D locations, and orientations. The unaligned content often leads to unstable training; 3) Lack of priors on camera pose distributions. In-the-wild 2D images may originate from non-overlapping views or different image sources, making it difficult to use structure-from-motion to estimate the camera poses on in-the-wild 2D images due to the absence of reliable correspondence between different images.
Given the aforementioned challenges, we propose a principled learning paradigm, SceneDreamer, that learns to generate unbounded 3D scenes from in-the-wild 2D image collection without camera parameters. To facilitate that, our framework consists of three modules, an efficient yet expressive 3D scene representation, a generative scene parameterization, and a volumetric renderer that can leverage knowledge from 2D images.
Inference
Once the training is done, we can randomly sample noise vectors to generate diverse 3D scenes with 3D consistency and well-defined geometry, even support rendering with free camera trajectories!
When inference with sliding window mechanism, we can generate scene with a resolution beyond the training resolution. The figure below shows a scene 10x times larger than the training resolution, and we can also perform interpolation along the style and space smoothly.
For more information (code, video, and interactive demo), please visit our project page.