Best student paper award

PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment

Gustav Hanning, Kalle Åström, Viktor Larsson

Lund University
International Conference on Computer Vision 2025
Workshop on Large Scale Cross Device Localization

Our optimization-based room layout estimation method PixCuboid can accurately predict the layout from a collection of posed images. The ground truth layout is shown in yellow. The point cloud is displayed for visualization purposes and is not used as an input to our method.

Abstract

Coarse room layout estimation provides important geometric cues for many downstream tasks. Current state-of-the-art methods are predominantly based on single views and often assume panoramic images. We introduce PixCuboid, an optimization-based approach for cuboid-shaped room layout estimation, which is based on multi-view alignment of dense deep features. By training with the optimization end-to-end, we learn feature maps that yield large convergence basins and smooth loss landscapes in the alignment. This allows us to initialize the room layout using simple heuristics.

For the evaluation we propose two new benchmarks based on ScanNet++ and 2D-3D-Semantics, with manually verified ground truth 3D cuboids. In thorough experiments we validate our approach and significantly outperform the competition. Finally, while our network is trained with single cuboids, the flexibility of the optimization-based approach allow us to easily extend to multi-room estimation, e.g. larger apartments or offices.

Results

We compare our predicted layouts with those of Total3DUnderstanding, Implicit3DUnderstanding, Deep3DLayout, LED²-Net and PSMNet. The results are shown above for three spaces in 2D-3D-Semantics. Predictions are visualized in blue and the ground truth cuboids in yellow.

Poster

PixCuboid: Room Layout Estimation from Multi-view Featuremetric Alignment

Our optimization-based room layout estimation method PixCuboid can accurately predict the layout from a collection of posed images. The ground truth layout is shown in yellow. The point cloud is displayed for visualization purposes and is not used as an input to our method.

Results

We compare our predicted layouts with those of Total3DUnderstanding, Implicit3DUnderstanding, Deep3DLayout, LED²-Net and PSMNet. The results are shown above for three spaces in 2D-3D-Semantics. Predictions are visualized in blue and the ground truth cuboids in yellow.

BibTeX