Unsupervised Discovery of
Object-Centric Neural Fields

Rundong Luo*    Hong-Xing "Koven" Yu*    Jiajun Wu   

Stanford University   

(*Equal Contribution)   

input

Abstract


We study inferring 3D object-centric scene representations from a single image. While recent methods have shown potential in unsupervised 3D object discovery from simple synthetic images, they fail to generalize to real-world scenes with visually rich and diverse objects. This limitation stems from their object representations, which entangle objects' intrinsic attributes like shape and appearance with extrinsic, viewer-centric properties such as their 3D location. To address this bottleneck, we propose Unsupervised discovery of Object-Centric neural Fields (uOCF). uOCF focuses on learning the intrinsics of objects and models the extrinsics separately. Our approach significantly improves systematic generalization, thus enabling unsupervised learning of high-fidelity object-centric scene representations from sparse real-world images. To evaluate our approach, we collect three new datasets, including two real kitchen environments. Extensive experiments show that uOCF enables unsupervised discovery of visually rich objects from a single real image, allowing applications such as 3D object segmentation and scene manipulation. Notably, uOCF demonstrates zero-shot generalization to unseen objects from a single real image.


Inferring object representations from a single image


Given a real image (the first frame) with visually-rich objects, uOCF infers the factorized 3D object-centric scene representations, enabling reconstruction and manipulation from arbitrary novel views.


Video



Novel View Synthesis


input

Robustness to occlusion


input

Scene segmentation


input
input

Method


Our model consists of an encoder, a latent inference module, and a decoder. The encoder extracts features from the input image. The latent inference module infers the objects' latent representations and positions in the underlying 3D scene from the obtained feature map. Finally, the object NeRF decoder decodes the latent representations and positions into the object-centric neural fields and composes them to reconstruct the scene.

input

We introduce object-centric prior learning to address the inherent difficulty caused by the ambiguities in complex compositional scenes. The main idea is to learn general object priors (e.g., physical coherence) from simple scenes (e.g., scenes with a single synthetic object), then leverage the obtained priors to learn from more complex scenes that could potentially have very different scene geometry and spatial layout. An illustration is shown below.

input

Citation


        
@article{uOCF,
  title={Unsupervised Discovery of Object-Centric Neural Fields},
  author={Luo, Rundong and Yu, Hong-Xing and Wu, Jiajun},
  journal={arXiv},
  year={2024},
}
      

Contact


If you have any questions, please feel free to contact us:

  • Rundong Luo: rundongluo2002Prevent spamming@Prevent spamminggmail.com