RMIT University
Browse

A holistic approach for efficient host-based data exfiltration detection

Download (4.68 MB)
thesis
posted on 2024-11-25, 18:42 authored by Jakapan SUABOOT
<p>Data exfiltration is a type of cyberattack that causes breaches of sensitive information. It is undoubtedly a critical issue for the modern world of data-centric services. In particular, the sectors of critical infrastructure (CI), information technology (IT), and mobile computing, are the targets of Advanced Persistent Threat (APT). Data breaches cause huge losses every year to a wide range of industries including large enterprises such as Google, Facebook, Microsoft, to name a few. Furthermore, such breaches can have major impacts on national security if government departments or the military are targeted. Since the adversary constantly attacks the target using various system vulnerabilities (e.g., unknown or zero-day exploits), a prevention-based measure alone is not sufficient to thwart the adversary. To address the problem, in this thesis, we propose a holistic approach for data exfiltration detection based on three approaches for detection of data breaches.</p> <p>We began by examining technologies that have a strong potential to be used as the basis for data leakage prevention technologies. The literature review revealed numerous advanced intrusion detection techniques that use similar core technologies for the data exfiltration detection solution. The industrial control system (ICS), especially the one with Supervisory Control and Data Acquisition (SCADA), is one of the most challenging applications for researchers. That is because ICS has been a critical target of cyberattacks. In the systematic literature review, over a hundred peer-reviewed articles were examined. Our study illustrated the development of supervised-based learning approaches from industry perspectives that target various data auditing sources and attacking techniques. Based on the review, we conducted qualitative and quantitative benchmarks for several machine learning techniques in ten (10) different categories. Furthermore, we identified future directions for the development of new algorithms and facilitating the selection of algorithms for industrial-based IDS systems.</p> <p>Following the review, the technical aspects of a holistic data exfiltration detection approach were considered. Firstly, we examined the data exfiltration issue caused by malicious software, as malware is a critical tool used by attackers to steal sensitive information. For the real-time detection of suspicious data-stealing behaviors, we proposed a novel Sub-Curve HMM approach, which is based on the Hidden Markov Model (HMM) to extract the sub-contained malicious behavior from a long API call sequence. The proposed method is intended to detect malicious activities that occur only over a short period. By projecting a series of matching scores into a curve, our approach distinguishes malignant actions from other system activities using discontinuities in the slope of the curve. In particular, when testing the long API call sequence, malicious and benign activities obtain different matching scores for an adjoining set of API calls. The experimental results show that the proposed approach outperforms existing solutions in detecting six (6) families of malware: the detection accuracy of Sub-Curve HMM is over 94% compared to 83% for the baseline HMM approach and 73% for Information Gain.</p> <p>After that, we moved from the behavior-based to the sensitive-data-oriented approach. In particular, we challenged the common belief of this research field by proposing ideas of monitoring the physical memory for sensitive information instead of checking for malicious activities or scanning the network traffic. Essentially, the main memory is a single point of data flow in the computer system; hence, the adversary cannot evade the detection system by using other channels. This approach helps to shortlist processes that involve sensitive data; using the anomaly detection system, advanced hackers who use a legitimate program to commit data exfiltration can be detected. To efficiently monitor sensitive text-based files in the memory space of the running processes, we propose a novel Fast lookup Bag-of Words (FBoW) technique. This is an algorithm that transforms a text document into a BoW sequence and then builds a DFA (Deterministic Finite Automaton) to match content in the RAM with the database of sensitive text documents. The experimental results showed that FBoW is the most scalable technique compared to other state-of-the-art, pattern-matching techniques when the size of sensitive data is increasing. Specifically, it uses 31-400 times less memory than the Aho-Corasick method, with a trade-off of less than a 2% drop in the detection accuracy for the non-memory-based dataset. When tested with memory-based datasets, FBoW distinctively outperformed the state-of-the-art methods in terms of memory efficiency, run-time, and robustness.</p> <p>Finally, we addressed one of the most challenging data exfiltration issues: temporal data exfiltration. The sophisticated adversary could delay the data-stealing activity by exfiltrating tiny pieces of information over a long period instead of transferring a lot of information at once. Although we can detect small fractions of sensitive data in the RAM, if a piece of information is too small, the detection could return a false-negative result. This research is the very first attempt to mitigate temporal data exfiltration by proposing a novel Temporary Memory Bag-of-Words (TMBoW) technique. The proposed solution combines Sparse Densities Representation (SDR) and Bag-of-Words representation to detect the temporal patterns of the time-delayed data exfiltration. The experimental result shows the proposed approach has 100% accuracy when the minimum detection threshold is only 0.5, and the analytical result shows that the probability of TMBoW reporting a false alarm is approximately zero.</p>

History

Degree Type

Doctorate by Research

Imprint Date

2021-01-01

School name

School of Science, RMIT University

Former Identifier

9922035623601341

Open access

  • Yes

Usage metrics

    Theses

    Exports

    RefWorks
    BibTeX
    Ref. manager
    Endnote
    DataCite
    NLM
    DC