posted on 2024-11-24, 03:49authored byShohreh Deldari
The unprecedented advances in sensing technologies with respect to their cost, size, energy efficiency and interoperability have led to their ever-growing ubiquitous applications in health and medicine, environmental monitoring, smart homes, wildlife, agriculture, industry, etc.
Such advances have produced an exponentially growing amount of diverse sensor data; however, extracting practical knowledge from this massive amount of data can be challenging for modern-day modelling approaches. On the one hand, the past decade has witnessed the great success of deep learning techniques, which has led to a proliferation of applications benefiting from available large labelled datasets. State-of-the-art techniques mostly require human intervention, e.g. manual data preprocessing and data annotation, because the success of conventional machine learning models heavily depends on the availability and quality of labelled datasets. On the other hand, with this massive proliferation of data captured by ubiquitous sensors, it is getting harder and even impossible to annotate the data. This is a significant bottleneck for supervised learning models, given that acquiring data annotations can often be cumbersome and expensive, requiring extensive domain knowledge.
Self-supervised learning is a recently introduced line of approaches that try to efficiently address the bottleneck of accessing massive labelled datasets by uncovering meaningful information about data through training the network model with a supervisory signal obtained from the data itself. This technique significantly increases the data bandwidth available for training and has been shown to reduce the reliance on manual data annotations. This is also considered an early step toward general artificial intelligence, given that the computer learns from the observed data with far less human input than a supervised learning approach.
In this thesis, titled `Learning from multivariate time series data with minimal supervision', we addressed challenges in learning from ubiquitous and wearable sensor data with no or minimal supervision. We specifically focus on building an end-to-end framework to cover preprocessing, feature extraction, pre-training and low-labelled fine-tuning steps of machine learning models to learn from multiple sensory inputs with less supervision.
Each step comes with its own set of challenges that are explained below.
Extracting informative and meaningful temporal segments from high-dimensional multiple-source data (e.g. wearable sensors, smart devices or the Internet of Things) is a vital preprocessing step in tracking and predicting human behaviours and actions in real-life environments. The main challenges in the time series analysis compared to computer vision and natural language processing models are 1) the presence of various modalities (different kinds of sensors) and the high variations in correlations between multi-dimensional data sources, 2) noise and drifts in temporal patterns of data and 3) limitations in annotating the data. Regarding these challenges, we made several contributions in this thesis toward learning general-purpose and robust deep neural networks with minimal supervision.
First, we introduced Entropy and ShaPe awaRe timE-Series SegmentatiOn (ESPRESSO), a novel unsupervised multivariate time series segmentation technique based on a combination of shape-based features, temporal drifts and statistical features of the original data. One-dimensional time series segmentation can be a straightforward task if we have enough domain knowledge about the data, such as data distribution. But when it comes to multiple sources, we cannot have simple assumptions about what the combination of all the time series dimensions would look like. ESPRESSO detects homogeneous segments across various data sources with different characteristics, such as 1) sudden or gradual changes, 2) the presence of repeated and non-repeated patterns and 3) pattern drifts along with statistical properties. To address these challenges, we propose an effective multi-dimensional time series segmentation method by combining multiple features/properties of the data to detect the most homogeneous segments of data.
Second, we take one further step toward unsupervised context recognition through self-supervised representation learning. Recently, representation learning has attracted much attention in the field of image processing and natural language processing but less in unsupervised time series analysis. We proposed TSCP2, a self-supervised contrastive learning-based approach to learn compact latent vectors that not only represent the main features of data but also preserve temporal dependencies in time series sensor data.
Third, we expand our self-supervised representation learning model into learning (crossmodal) compact representation when data come from multiple sources. The main challenge here is to capture and maintain local and global dependencies and correlations between multiple dimensions. By combining the local and global latent representations of various sources, we extracted effective representations that can help infer the system's state in a minimally supervised way. We proposed Cross mOdality COntrastive leArning (COCOA), a novel contrastive-based objective function and an efficient and lightweight model capable of learning crossmodal and compact information from multivariate input data. We investigated the effectiveness of our proposed approach in downstream tasks, such as human activity recognition, sleep stage detection and emotion detection, against ten state-of-the-art supervised and self-supervised baselines across multiple well-known public datasets. We show that COCOA outperforms its supervised rival by using only 10\% of labels.
Putting all the steps together, we proposed an end-to-end framework to preprocess unlabelled time series data from various sensor devices, apply segmentation, and learn compact reusable representations for the downstream tasks with limited labelled datasets. To evaluate the applicability of the proposed method, we adapted the model to consider various types of sensors and time series data, such as wearable sensors (e.g. accelerometer, gyroscope, electroencephalogram, electrooculogram, electromyography, electrocardiogram and electrodermal activity), radio-frequency identification signals and web traffic. We present learning approaches that do not require human semantic labels but extract supervisory signals from the input itself, i.e. in a self-supervised manner. We evaluated our methods in various important downstream tasks, including but not limited to human activity recognition, emotion recognition, hand gesture recognition and sleep-stage detection tasks.