This scene understanding is to take 3D imaging

This section discusses some of the primary challenges existing in the current practice of using 3D imaging technologies within the A/E/C/FM industry.

Occlusions. Occlusions block a visual sensor from seeing the object of interest, requiring either the sensor to bypass them or utilization of an algorithm to infer a complete object behind occlusions. Many AEC/FM applications are susceptible to occlusions,  due to several reasons, such as: (1) dense deployment of infrastructure components in a given scene,  e.g., bracings under a bridge (at least 35-40% of the deck underside is occluded in the Fig. 1) and pipelines crammed in an equipment room, (2) temporary components during construction, such as scaffolding and equipment, (3) mixture of moving components (such as appliances and furniture) and fixed architectural/infrastructure components.

Lack of 3D scene understanding features/clutter. The goal of scene understanding is to take 3D imaging data, such as RGB(D) image or point cloud, as input, and output the semantic understanding of scene including: the objects present in the scene, the attributes of the objects, and the relationship among the objects. The scene understanding analysis can be divided into two levels: (1) low-level semantic analysis that focuses on segmenting/clustering the data based on similarity, such as grouping the neighboring points with similar color into a patch or segmenting a plane from the point cloud using normal-based region growing, and (2) high-level semantic analysis that aims at understanding what building/infrastructure components exist in a scene and how they are related to each other, such as recognizing the MEP components in the scene. In this paper, we refer scene understanding feature as the information that could be helpful for scene understanding analysis. BA1 Hand-crafted features extracted by filters, such as edges, corners, and SIFT descriptors38 have been widely used in the task of scene understanding for both images and point clouds. However, unlike hand-crafted features,  learning features from labeled data using machine learning for 3D imaging representation is still an active research domain. This is because: 1) 3D imaging data, especially point cloud,  is usually unordered, sparse, and non-uniformly distributed in space. 2) the architectural scene usually has many texture-poor surfaces (walls, ceilings, floors) and repetitive components (columns, beams) that makes the extracted features non-unique and complicated for visibility reasoning 39. The quality of the reconstructed point cloud will also be affected by having lack of features in a scene, as shown in Fig. 2.

Another issue with point cloud space is clutter and the need to adjust the level of detail as per use-case scenario. For example, it is possible to see each brick on a brick wall in 3D capture technologies, but it will never be modeled as such in a 3D design model. Similarly, there are temporary objects in a scene (e.g., formwork, scaffolding, furniture, etc.), that 3D imaging technologies capture in the scene and the engineers and architects might not care about in many A/E/C/FM use cases. It is in this context that it is necessary to identify the objects of interest and remove the objects of not interest, and this presents one of the unique challenges that A/E/C/FM project conditions poses to the utilization of 3D imaging technologies.Distance, speed and accuracy. The requirements for longer distance and faster speed of data capturing arise from the increasing scale and complexity of a scene39.  However, this typically comes with a tradeoff of reduction in accuracy. The tradeoff between accuracy and speed is becoming even more critical when considering robustness, especially with respect to wide variety of scan environments. For example, though modern LiDAR can capture relatively dense point cloud from over 100 meters away with accuracy of up to 2mm or less40, accuracy drastically reduces when scanning glass structure or scanning in the raining/snowy weather. Sometimes, the workaround in such a situation is to scan the same area multiple times and take the average of the measurements. However, such operation will drastically slow down data capture process, and can be untenable in large scenes. In contrast, while photogrammetry is convenient in most cases, it is still impacted by environmental conditions, such as motion blur, lighting condition, and camera calibration.

Interoperability. Interoperability, in the context of 3D imaging industry, means enabling different components of a technology (both hardware and software) to work together. Sometimes, it involves communication between: (i) two or more hardware systems (scanners and on-site laptops that are used for tuning scan settings and preprocessing), (ii) hardware and software (scanners from different vendors and point cloud processing software, such as Autodesk ReCap), and (iii) two or more software (preprocessing software, such as Autodesk ReCap, and modeling and analysis software, such as Autodesk Revit), in order to facilitate and streamline various tasks, such as data collection, modeling and analysis. More specifically, interoperability measures the difficulty/ease of: (i) different hardware interacting through a communication protocol, (ii) transforming data from one format to another format based on application requirements. Lack of communication will hinder the ease of use and operation of a specific 3D imaging technology. Whereas, lack of interoperability of data can increase the inability to fully exploit value of information contained in collected data, which in turn can severely impede the effectiveness and efficiency of decision making41.