posted on 2024-11-25, 18:48authored byWei Qin Chuah
Depth perception is the ability to estimate the distance of observed objects to allow for a three-dimensional (3D) understanding of the surrounding environment. Stereo matching algorithms are a prominent approach for depth perception: specifically, using a pair of rectified stereo images, a stereo matching algorithm estimates the horizontal pixel disparity~(inverse of depth) between correspondences. With the recent advancements in the field of deep learning, particularly the use of convolutional neural networks (CNNs), there has been an emerging research interest in employing deep CNNs to tackle the stereo matching problem. Despite their impressive performance, existing deep learning-based stereo matching algorithms (often referred to as stereo matching networks) suffer from their own set of challenges. For example, stereo matching networks remain fragile to untextured and repetitive flat surfaces and are prone to having poor long-range depth estimation performance. Furthermore, the performance of most stereo matching networks deteriorates substantially when there is a shift in style between the training and testing environments, which occurs in almost all real-world applications. These limitations critically confine the practicality of integrating current deep learning-based stereo matching algorithms into real-world applications.
This thesis explored and developed robust deep learning-based stereo matching and depth estimation systems by addressing the aforementioned challenges. To this end, the thesis first addresses the improvement of disparity estimation accuracy on repetitive and/or flat planar surfaces by modelling the geometry relationship between these planar surfaces and disparities. Next, the thesis shows how we can remove biases imposed by the training data and disparity-based loss functions that result in poor long-range depth estimation performance in stereo matching networks. To overcome this challenge, we exploit semantic cues to develop a new combinatory loss function that allows adaptive adjustment of the learning bias in the stereo matching networks. Lastly, this thesis shows that by minimizing the sensitivity of learned representations with respect to input variations the networks must learn domain-invariant feature representations, thus improving out-of-domain accuracy. All formulated solutions were tested on challenging datasets. The results show that the proposed methods can improve stereo matching performance on challenging planar regions, objects located at far distances, and in multiple unseen environments that are different from training data.