3D Reconstruction Technology for Next-Generation XR

Summary

3D reconstruction to provide XR systems utilizing 3D digital twin information

Background

In an extended reality (XR) system, it is important to reproduce a given space in three dimensions and to accurately determine the number, location, and relationships of objects, so that users get a proper understanding of unfamiliar environments. 3D reconstruction and localization technologies are also essential to present information in a way that seems natural to those using devices such as XR glasses.

There are, however, several challenges that make this difficult.

The conventional terrestrial laser scanners commonly used today take several minutes per scan and require a tremendous number of steps to scan and process large scenes. Data capture can also be interrupted by objects or people moving during the scan.

Training an AI model that understands scenes^*1 in the real world requires 3D data on many objects, paired with ground truth labels, making it impossible to train a model that can be applied to all scenes in the real world.

Two technologies are required to properly link a 3D digital twin in an XR system with the real world: “camera pose estimation” for the XR system to accurately determine its own position in the real world, and “ground plane estimation” to correctly display the avatar in the digital twin on the floor in real space (Figure 1).

Conventional plane detection methods sometimes extract the wrong plane from 3D data obtained from ToF sensors^*2 due to noise or MPI^*3.

^*1 Scene understanding: The process of interpreting the context of a situation by comprehensively determining “what” is present “where” in a given scene, and “how.” For example, a system would determine information such as “people walking on the sidewalk” or “cars waiting at a traffic light” from video of a street, and then interpret the overall situation.

^*2 ToF (time of flight) sensor: A sensor that measures the distance to an object by emitting light on the object and measuring the time it takes for the reflected wave to return or the phase of the reflected wave.
^*3 MPI: Multipath interference. A phenomenon in which emitted light reaches a ToF sensor via multiple paths and overlaps, causing distorted or incorrect depth measurements.

Figure 1. Link between 3D digital twin and real world

Solutions

We have developed technologies for scanning 3D space and generating 3D digital twins, and for recognizing these 3D digital twins.

These allow the XR system to understand information about the user, such as his or her state, surrounding environment, behavioral tendencies, and interactions, resulting in a more natural and intuitive XR experience.

Technical highlights

We provide various methods to easily and quickly reconstruct 3D space.

(1) Single-shot 360-degree 3D reconstruction device

The novel 3D reconstruction device newly developed by RICOH can easily and quickly capture a 360-degree field of view, scan an environment at high frequency, provide accurate 3D scans in dynamic environments and reduce discrepancy between actual scenes and 3D digital twins. 3D data captured using this device in real-world scenes is also available for use as a dataset for multiple AI tasks (see related information).

Overview of a 3D reconstruction device

・ Omni-directional indirect ToF camera

・ Capable of convenient and quick capture

・ Generates digital twins in a single shot

Figure 2. 3D reconstruction device
Alt=”ToF receiver: Wide angle fisheye lens (×4), ToF emitter: VCSEL + DOE + fisheye lens (×2), Color sensor: Hemispherical fisheye lens (×2)”

Figure 3. Possible usage industries, resolution, camera field of view, measurable range

Alt= “For use in industries such as interior coordination, equipment, facility management, construction, and civil engineering. Resolution is 7296 × 3648 pixels, range is 0.5 to 5 m, and field of view is 360° × 150°.”

Figure 4. Data from multiple scans overlaid to reproduce entire space as single accurate 3D space

Recognition of 3D space (spatial segmentation using open-vocabulary spatial segmentation)

Ricoh and the German Research Center for Artificial Intelligence (DFKI) have developed a spatial understanding AI model that requires no training, based on zero-shot learning^*4. It utilizes models that have been pre-trained on large-scale data, such as the Segment Anything Model (SAM)^*5 and Contrastive Language-Image Pre-training (CLIP)^*⁶, to estimate correspondence relationships between natural language and 3D data. In addition to allowing for a given scene to be understood with a high degree of accuracy for unknown data, this technology also allows for interaction with the XR system using natural language.

^*4 Zero-shot learning: A machine learning technique used to predict the recognition and classification of data with absolutely no training data.

^*5 Segment Anything Model (SAM): A foundation model released by Meta, which performs segmentation. It can automatically segment any region in an image based on user-specified prompts. It is also capable of zero-shot learning.

^*6 Contrastive Language-Image Pre-training (CLIP): A foundation model released by OpenAI, which effectively combines information from different modalities, such as text and images, to make zero-shot learning possible.

Figure 5. Scanned 3D data and spatial understanding results

The spatial understanding AI model suggests the object in space that best matches the question asked by the user in natural language. The figure shows an example of the XR system responding to the user based on the spatial information it has recognized.

Demo video

Technology to realize XR experiences that connect virtual spaces with real spaces

(1) XR system and real world relocalization

In an XR system, accurate positional alignment in 3D space with the real world is essential for virtual content in the digital twin to be displayed naturally in real space. An XR device uses pre-acquired 3D scan data taken from the real world as a reference to estimate its own position and orientation in real time and aligns this with scan data. Ricoh is currently developing technology that makes use of AR markers, structure from motion (SfM), and simultaneous localization and mapping (SLAM) so these devices can estimate their own positions. AR markers function as reference points during the initial alignment process by mapping the position where they are physically attached to a position in the 3D scan. SfM is a technique for reconstructing 3D structures and camera trajectories from multiple viewpoint images, and it is useful for geometric matching with scan data. SLAM estimates the current position of the XR system in real time from camera and inertial measurement unit (IMU) information and simultaneously builds an environment map. Through combining all of these, scans, the real world, and spatial coordinates in the XR system are kept consistent and integrated.

Figure 7. Technology to estimate the position of the XR device in a 3D space that has been scanned

(2) Ground plane estimation

Room layout estimation using 2D color images without 3D information can be used to estimate regions such as floors and walls in a scene. Spatial constraints such as the Manhattan world assumption^*7 can be added to this to reconstruct a realistic room layout. With this serving as inspiration, Ricoh developed a high-precision ground plane detection method for 3D point clouds. The room layout is estimated from a 2D color image, and then the raw 3D point cloud obtained from a sensor is positioned and shaped to fit the layout to estimate the ground plane region. This estimation technique allows for the avatar in the digital twin to be positioned and aligned correctly with the floor in the real world.

^*7 Manhattan world assumption: The assumption that there are three dominant axes that are mutually perpendicular in artificial objects, and that any surfaces comprising the structure will be aligned perpendicular or parallel to them.

Figure 8. Ground plane estimation flowcharts and example results

Related information

・ Research paper:

ToF-360 – A Panoramic Time-of-flight RGB-D Dataset for Single Capture Indoor Semantic 3D Reconstruction

https://av.dfki.de/publications/tof-360-a-panoramic-time-of-flight-rgb-d-dataset-for-single-capture-indoor-semantic-3d-reconstruction/
Source: IEEE/CVF (Hrsg.). 21st. CVPR Workshop on Perception Beyond the Visible Spectrum (PBVS-2025), June 11-15, Nashville, Tennessee, USA, IEEE, 2025.

・ Public dataset:

ToF-360 Dataset

https://huggingface.co/datasets/COLE-Ricoh/ToF-360

・ Related technology:

AI Solution for Spatial Data Creation and Utilization

https://www.ricoh.com/technology/tech/126_building_digital_twin