RMIT University
Browse

Enhancing Re-Identification and Object Detection Through Multi-Modal Feature Learning

Download (11.95 MB)
thesis
posted on 2025-07-03, 06:27 authored by Ying Chen

Modern computer vision systems are increasingly tasked with operating in dynamic and complex environments, where challenges such as person Re- Identification (Re-ID) and object detection demand sophisticated and adaptable methodologies. Recent advancements in multi-modal feature learning have shown significant promise in addressing these challenges by integrating complementary data modalities and leveraging cutting-edge computational techniques. These innovations have enabled more robust feature representations, improving precision, adaptability, and efficiency in real-world applications. Advanced approaches, such as ranking-based dictionary learning, transformer-based architectures, and the incorporation of point cloud data, have emerged as key strategies to further enhance system performance in multi-camera and multi-modal frameworks.

Building on these advancements, this thesis proposes novel methods to enhance accuracy, robustness, and efficiency in object detection and Re-ID. The research is guided by three interconnected questions:

  • How can ranking-based dictionary learning improve person Re- Identification performance in multi-camera systems?
  • How can transformer-based multi-query methods enhance both detection and Re-Identification through improved feature representations?
  • How can the integration of point cloud data optimize multi-modal frameworks for detection and Re-ID in complex environments?

To tackle the first question, a dictionary learning-based framework is introduced, incorporating Top-Push Polynomial Ranking Loss (TPRL) and a ranking graph Laplacian constraint. This approach targets variability in real-world data by enhancing intra-personal compactness and inter-personal dispersion within the feature space. The framework mitigates challenges such as occlusion, lighting inconsistencies, and diverse poses, achieving state-of-the-art results on multiple Re-ID benchmark datasets. By refining ranking relationships between person images, the proposed method significantly improves matching accuracy and system reliability under challenging conditions.

Building on this foundation, the thesis addresses the second research question by developing a transformer-based Multi-Query Person Search (MQPS) framework. Unlike traditional methods that rely on single-object queries, MQPS leverages multiple adjacent queries to extract robust, multi-scale feature representations. This design enables the framework to better handle occlusion, small objects, and complex camera configurations, providing enhanced robustness in challenging inference scenarios. Experiments conducted on the CUHK-SYSU and PRW datasets demonstrate the superior performance of MQPS, setting new benchmarks in both detection and Re-ID tasks.

Recognizing the importance of efficiency in real-world applications, the thesis explores the third question by introducing Frustum 3DNet (F-3DNet), a novel framework for 3D object detection. F-3DNet takes advantage of the structured nature of LiDAR-generated point clouds, creating pseudo-panoramic images and defining frustum regions of interest to capture both global and local context. By integrating 3D and RGB data, the method effectively manages data variability while maintaining computational efficiency. Experimental results on the KITTI and nuScenes datasets showcase the framework’s ability to achieve state-of-the-art detection accuracy, offering a scalable and efficient solution for dynamic IoT and surveillance environments.

This research makes significant contributions to the field of computer vision by addressing core challenges in multi-modal feature learning. Through the integration of advanced ranking methods, transformer-based frameworks, and 3D object detection techniques, the thesis advances the state-of-the-art in person search, object detection, and Re-ID. By providing robust, accurate, and efficient solutions, the proposed methods enable the deployment of vision systems in complex and dynamic real-world scenarios, paving the way for further advancements in intelligent surveillance and IoT applications.

History

Degree Type

Doctorate by Research

Imprint Date

2025-02-06

School name

Computing Technologies, RMIT University

Copyright

© Ying Chen 2025

Usage metrics

    Theses

    Categories

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC