Efficient Deep Learning Algorithms for Resource Limited Edge Computing
Deep Neural Networks (DNNs) have become the de facto standard for artificial intelligence (AI) applications, including visual recognition, natural language processing, and autonomous driving. Modern research has demonstrated that the performance of DNN is closely tied to its parameter count, which reflects its learning capacity. In other words, a large model with abundant parameters usually leads to superior performance across various downstream tasks. However, the growing model size of DNNs presents significant challenges for deployment on resource-constrained devices. Edge devices, e.g., smartphones, have revolutionized daily life but are often limited by storage and memory constraints, making them less capable of handling large AI models. A common solution is cloud computing. In this scenario, the users send their requests to the remote cloud servers that process AI models and return the results. Nonetheless, privacy concerns prevent many users from sharing sensitive data with external servers, creating a strong demand for on-device AI. To this end, efficient learning for DNNs has emerged as an important field of tremendous interest to the AI community, aiming to develop compact models with satisfactory performance.
In this thesis, we systematically investigate efficient neural models from three perspectives. Firstly, we present a knowledge distillation (KD) framework that improves small models seamlessly without architecture modification and extra inference overhead. Notably, we observe a semantic gap between the small student models and the large teacher models due to the variation in learning capacity. To address this, we propose the one-to-all distillation using the attention mechanism to bridge this gap. Second, we investigate the widely used attention module, which is believed the key to Transformers yet criticized for quadratic computational complexity. We found that some attention layers are less informative and could be removed without performance compromise. Specifically, we utilize transfer entropy to quantify the importance of the attention layers within a Transformer. Lastly, we investigate integrating multi-modal input in a single model to handle multiple tasks simultaneously, which reduces model size compared to training separate models for each task. We propose the gradient calibration to mitigate the challenges of task conflict and modality bias.
We conduct extensive experiments on large-scale benchmarks to validate the effectiveness of the proposed methods. Our research contributes largely to the development of efficient learning for deep models, marking a significant step toward compact AI solutions for resource-constrained devices.