posted on 2024-11-02, 01:48authored byShamsul Huda, Jemal Abawajy, Mali AbdollahianMali Abdollahian, Rafiqul Islam, John Yearwood
Malware replicates itself and produces offspring with the same characteristics but different signatures
by using code obfuscation techniques. Current generation anti-virus engines employ a signature-template
type detection approach where malware can easily evade existing signatures in the database. This reduces
the capability of current anti-virus engines in detecting malware. In this paper, we propose a stepwise
binary logistic regression-based dimensionality reduction techniques for malware detection using application
program interface (API) call statistics. Finding the most significant malware feature using traditional
wrapper-based approaches takes an exponential complexity of the dimension (m) of the dataset with a bruteforce
search strategies and order of (m-1) complexity with a backward elimination filter heuristics. The
novelty of the proposed approach is that it finds the worst case computational complexity which is less than
order of (m-1). The proposed approach uses multi-linear regression and the p-value of each individual API
feature for selection of the most uncorrelated and significant features in order to reduce the dimensionality
of the large malware data and to ensure the absence of multi-collinearity. The stepwise logistic regression
approach is then employed to test the significance of the individual malware feature based on their corresponding
Wald statistic and to construct the binary decision the model. When the selected most significant
APIs are used in a decision rule generation systems, this approach not only reduces the tree size but also
improves classification performance. Exhaustive experiments on a large malware data set show that the
proposed approach clearly exceeds the existing standard decision rule, support vector machine-based template
approach with complete data and provides a better statistical fitness.