DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision
Published in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024
Abstract
We have witnessed significant progress in deep learning-based 3D vision, ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However, existing scene-level datasets—limited to either synthetic environments or a narrow selection of real-world scenes—are insufficient. This gap hinders comprehensive benchmarking and caps what could be explored in 3D analysis. To address this, we present DL3DV-10K, a large-scale scene dataset featuring 51.2 million frames from 10,510 videos captured across 65 types of points of interest, covering bounded and unbounded scenes with diverse reflection, transparency, and lighting. We benchmark recent NVS methods on DL3DV-10K and report insights for future research, including a pilot study on generalizable NeRF that underscores the need for large-scale scene-level data.
Dataset Statistics
DL3DV-10K is designed to cover the complexity of the real world.
- Scale: 10,510 videos, 51.2 million frames at 4K resolution.
- Diversity: 65 categories of Points of Interest (POIs) including shopping centers, historical sites, and nature.
- Complexity Annotations: Fine-grained labels for reflection, transparency, lighting conditions (natural/artificial), and texture frequency.
- Quality: Low motion blur, 4K 60fps standard, professionally captured with $360^\circ$ coverage.
Benchmark Results (DL3DV-140)
We evaluated state-of-the-art methods on a challenging subset of 140 scenes. Zip-NeRF and 3DGS generally outperform others, but challenges remain in unbounded and high-frequency scenes.
Table 1: Performance on DL3DV-140 Benchmark
| Method | PSNR $\uparrow$ | SSIM $\uparrow$ | LPIPS $\downarrow$ | Train Time |
|---|---|---|---|---|
| Instant-NGP | 25.01 | 0.834 | 0.228 | 1.2 hr |
| Nerfacto | 24.61 | 0.848 | 0.211 | 2.6 hr |
| Mip-NeRF 360 | 30.98 | 0.911 | 0.132 | 48.0 hr |
| 3DGS | 29.82 | 0.919 | 0.120 | 2.1 hr |
| **Zip-NeRF** | **31.22** | **0.921** | **0.112** | 4.0 hr |
Key Contributions
- Unprecedented Scale: The largest real-world scene-level dataset to date with 10K+ scenes.
- Real-World Complexity: Captures non-Lambertian surfaces (glass, water), view-dependent lighting, and unbounded outdoor environments lacking in synthetic datasets.
- Generalization: Pilot experiments show that pre-training on DL3DV-10K significantly improves the performance of generalizable NeRF models (like IBRNet) compared to training on smaller datasets.
@inproceedings{ling2024dl3dv,
title={DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision},
author={Lu Ling and Yichen Sheng and Zhi Tu and Wentian Zhao and Lantao Yu and Qianyu Guo and Zixun Yu and Yawen Lu and Xuanmao Li and Xingpeng Sun and Rohan Ashok and Aniruddha Mukherjee and Cheng Xin and Kun Wan and Hao Kang and Xiangrui Kong and Gang Hua and Tianyi Zhang and Bedrich Benes and Aniket Bera},
booktitle={IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
year={2024}
}
Links
| Project | Code | Paper |
