PanoContext-Former: Panoramic Total Scene Understanding with a Transformer

CVPR 2024


Yuan Dong1, Chuan Fang2, Liefeng Bo1, Zilong Dong1, Ping Tan2,

1Alibaba Group     2Hong Kong University of Science and Technology  

Abstract


PanoContext-Former is able to estimate the room layout, orientated object bounding boxes, and object mesh from a single panoramic image of the indoor scene.
Interpolate start reference image.

Panoramic image enables deeper understanding and more holistic perception of $360^\circ$ surrounding environment, which can naturally encode enriched scene context information compared to standard perspective image. Previous work has made lots of effort to solve the scene understanding task in a hybrid solution based on 2D-3D geometric reasoning, thus each sub-task is processed separately and few correlations are explored in this procedure. In this paper, we propose a fully 3D method for holistic indoor scene understanding which recovers the objects' shapes, oriented bounding boxes and the 3D room layout simultaneously from a single panorama. To maximize the exploration of the rich context information, we design a transformer-based context module to predict the representation and relationship among each component of the scene. In addition, we introduce a new dataset for scene understanding, including photo-realistic panoramas, high-fidelity depth images, accurately annotated room layouts, oriented object bounding boxes and shapes. Experiments on the synthetic and new datasets demonstrate that our method outperforms previous panoramic scene understanding methods in terms of both layout estimation and 3D object detection.


Method


Interpolate start reference image.

The proposed pipeline simultaneously predicts the room layout, oriented 3D object bounding boxes and shapes. As shown in the above figure, we use a depth estimator to lift the 2D panorama into the point cloud, which is then fed into the Object Detection Network (ODN) to jointly predict 3D object boxes and shape codes. In the meantime, the layout is recovered as a triangle mesh from the input through the Layout Estimation Network (LEN). In this paper, we exploit the transformer's intrinsic advantages and scalability in modeling different modalities and tasks, making learning appropriate spatial relationships among objects and layout easier. Features from layout, image, and 3D objects are fed into the context module to better estimate representations and relations among objects and layout. Finally, the room layout and object shapes are recovered as mesh, then scaled and placed into appropriate locations to reconstruct the full scene. We elaborate on the details of each module in this section.


Holistic scene understanding and reconstruction results


Interpolate start reference image.


More results on iGibson-Synthetic dataset


Interpolate start reference image.


More results on Replica-Pano dataset


Interpolate start reference image.


Citation


@article{dong2023panocontext,
  title={PanoContext-Former: Panoramic Total Scene Understanding with a Transformer},
  author={Dong, Yuan and Fang, Chuan and Dong, Zilong and Bo, Liefeng and Tan, Ping},
  journal={arXiv preprint arXiv:2305.12497},
  year={2023}
}