In the current digital era, malware detection is a crucial component of computer security. Malware threats are becoming more numerous and complicated as technology develops. Any program or piece of code that is intended to harm or exploit computer systems or their users is known as malware, which is short for “malicious software.” Malware can appear in a variety of ways, such as viruses, worms, Trojan horses, spyware, adware, and ransomware, to name a few.
Malware assaults may have a significant negative effect, leading to financial losses, data breaches, system failures, and other types of disruption. Thus, having efficient techniques for identifying and preventing malware infestations is crucial. The process of locating and evaluating possible threats to ascertain whether they are harmful or not is known as malware detection.
A method that is becoming more and more common for detecting malware is machine learning as it has an ability to categorize harmful software. The necessity for high-quality data is one of the key obstacles in adopting machine learning for virus detection. The caliber and variety of the training dataset determine how well the model performs. Also, to avoid being discovered by machine learning models, malware creators might employ strategies like obfuscation and polymorphism. Machine learning can offer more precise and efficient identification and categorization of dangerous software in light of the evolving sophistication of malware threats. To solve the issues with data quality and resilience against evasion strategies, however, more study is required.
The dataset on which the model was built was made available on the Kaggle portal and can be accessed at the web link. There are 1,079 goodware API call sequences and 42,797 malware API call sequences in it. The first 100 non-repeated sequential API calls connected to the parent process are collected for each software using Cuckoo Sandbox reports. The dataset eventually features the following columns:
Every row includes a sequence of 100 API calls that describe the system operations that were performed one by one. For instance, software activity could look as follows: create new folder, create new file, edit file, save file, etc. which is reflected by consequitive calls here and classified as malware or goodware in the last column.
hash | t_0 | t_1 | t_2 | t_3 | t_4 | t_5 | t_6 | t_7 | t_8 | t_9 | t_10 | t_11 | t_12 | t_13 | t_14 | t_15 | t_16 | t_17 | t_18 | t_19 | t_20 | t_21 | t_22 | t_23 | t_24 | t_25 | t_26 | t_27 | t_28 | t_29 | t_30 | t_31 | t_32 | t_33 | t_34 | t_35 | t_36 | t_37 | t_38 | t_39 | t_40 | t_41 | t_42 | t_43 | t_44 | t_45 | t_46 | t_47 | t_48 | t_49 | t_50 | t_51 | t_52 | t_53 | t_54 | t_55 | t_56 | t_57 | t_58 | t_59 | t_60 | t_61 | t_62 | t_63 | t_64 | t_65 | t_66 | t_67 | t_68 | t_69 | t_70 | t_71 | t_72 | t_73 | t_74 | t_75 | t_76 | t_77 | t_78 | t_79 | t_80 | t_81 | t_82 | t_83 | t_84 | t_85 | t_86 | t_87 | t_88 | t_89 | t_90 | t_91 | t_92 | t_93 | t_94 | t_95 | t_96 | t_97 | t_98 | t_99 | malware |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
071e8c3f8922e186e57548cd4c703a5d | 112 | 274 | 158 | 215 | 274 | 158 | 215 | 298 | 76 | 208 | 76 | 172 | 117 | 172 | 117 | 172 | 76 | 117 | 35 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 117 | 60 | 81 | 60 | 81 | 208 | 35 | 215 | 35 | 208 | 240 | 117 | 172 | 60 | 81 | 60 | 81 | 225 | 35 | 60 | 81 | 35 | 225 | 172 | 60 | 81 | 60 | 81 | 60 | 81 | 172 | 117 | 76 | 172 | 117 | 172 | 117 | 35 | 111 | 81 | 140 | 208 | 240 | 117 | 71 | 297 | 135 | 171 | 215 | 35 | 208 | 56 | 71 | 1 |
33f8e6d08a6aae939f25a8e0d63dd523 | 82 | 208 | 187 | 208 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 208 | 172 | 117 | 16 | 208 | 297 | 8 | 199 | 264 | 274 | 158 | 215 | 274 | 158 | 215 | 208 | 260 | 87 | 260 | 65 | 14 | 65 | 240 | 117 | 208 | 187 | 208 | 240 | 117 | 39 | 35 | 171 | 172 | 117 | 208 | 35 | 215 | 35 | 208 | 240 | 117 | 35 | 60 | 81 | 60 | 81 | 172 | 35 | 60 | 81 | 60 | 81 | 31 | 60 | 81 | 208 | 187 | 208 | 274 | 158 | 266 | 208 | 60 | 81 | 60 | 81 | 240 | 117 | 71 | 297 | 135 | 171 | 215 | 35 | 1 |
b68abd064e975e1c6d5f25e748663076 | 16 | 110 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 172 | 117 | 99 | 260 | 141 | 65 | 240 | 117 | 240 | 117 | 240 | 117 | 263 | 215 | 263 | 215 | 263 | 215 | 263 | 215 | 263 | 215 | 263 | 215 | 263 | 215 | 263 | 215 | 274 | 158 | 215 | 274 | 158 | 215 | 240 | 117 | 71 | 297 | 135 | 171 | 215 | 48 | 208 | 112 | 113 | 112 | 123 | 65 | 112 | 123 | 65 | 112 | 123 | 65 | 113 | 112 | 123 | 65 | 112 | 123 | 65 | 112 | 123 | 65 | 113 | 112 | 123 | 65 | 112 | 123 | 65 | 112 | 123 | 65 | 113 | 112 | 123 | 65 | 112 | 123 | 65 | 112 | 123 | 65 | 113 | 112 | 1 |
72049be7bd30ea61297ea624ae198067 | 82 | 208 | 187 | 208 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 208 | 16 | 240 | 117 | 228 | 208 | 187 | 228 | 240 | 117 | 82 | 198 | 86 | 82 | 274 | 37 | 240 | 117 | 240 | 117 | 262 | 228 | 275 | 172 | 240 | 275 | 172 | 274 | 158 | 215 | 172 | 117 | 215 | 37 | 158 | 215 | 240 | 117 | 82 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 228 | 187 | 215 | 274 | 158 | 215 | 274 | 158 | 215 | 240 | 117 | 71 | 297 | 135 | 171 | 215 | 228 | 215 | 208 | 302 | 208 | 302 | 187 | 208 | 302 | 228 | 302 | 1 |
c9b3700a77facf29172f32df6bc77f48 | 82 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 172 | 117 | 172 | 117 | 16 | 240 | 117 | 11 | 274 | 158 | 215 | 274 | 158 | 215 | 117 | 270 | 117 | 301 | 117 | 297 | 8 | 199 | 264 | 215 | 260 | 141 | 65 | 31 | 260 | 141 | 202 | 260 | 141 | 65 | 202 | 65 | 260 | 141 | 65 | 80 | 287 | 87 | 14 | 65 | 260 | 141 | 240 | 141 | 65 | 82 | 260 | 141 | 65 | 260 | 141 | 65 | 31 | 159 | 224 | 82 | 261 | 172 | 117 | 260 | 208 | 260 | 2 | 140 | 81 | 208 | 159 | 224 | 82 | 159 | 224 | 82 | 261 | 208 | 240 | 117 | 260 | 40 | 209 | 260 | 40 | 209 | 260 | 141 | 260 | 141 | 260 | 1 |
cc6217be863e606e49da90fee2252f52 | 117 | 208 | 117 | 208 | 117 | 240 | 117 | 240 | 117 | 208 | 228 | 215 | 274 | 158 | 215 | 274 | 158 | 215 | 240 | 117 | 71 | 297 | 135 | 171 | 215 | 208 | 187 | 20 | 34 | 215 | 208 | 187 | 86 | 215 | 240 | 91 | 89 | 192 | 89 | 133 | 89 | 248 | 297 | 135 | 171 | 8 | 178 | 215 | 18 | 194 | 240 | 117 | 215 | 117 | 31 | 56 | 71 | 56 | 172 | 117 | 240 | 117 | 260 | 141 | 65 | 194 | 117 | 240 | 117 | 260 | 141 | 65 | 260 | 141 | 65 | 260 | 141 | 65 | 260 | 141 | 65 | 260 | 141 | 65 | 260 | 112 | 117 | 260 | 141 | 65 | 260 | 141 | 65 | 260 | 141 | 65 | 9 | 117 | 260 | 65 | 1 |
f7a1a3c38809d807b3f5f4cc00b1e9b7 | 215 | 274 | 158 | 215 | 274 | 158 | 215 | 172 | 117 | 172 | 117 | 172 | 117 | 198 | 208 | 260 | 257 | 25 | 240 | 117 | 99 | 25 | 172 | 117 | 260 | 274 | 158 | 172 | 117 | 172 | 117 | 260 | 141 | 65 | 172 | 117 | 240 | 117 | 286 | 240 | 286 | 35 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 60 | 81 | 110 | 60 | 81 | 60 | 81 | 215 | 208 | 35 | 208 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 15 | 117 | 15 | 240 | 117 | 240 | 117 | 240 | 117 | 172 | 60 | 81 | 60 | 81 | 225 | 1 |
164b56522eb24164184460f8523ed7e2 | 82 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 172 | 117 | 172 | 117 | 16 | 31 | 86 | 112 | 271 | 111 | 81 | 140 | 286 | 194 | 286 | 297 | 252 | 215 | 117 | 297 | 93 | 264 | 215 | 271 | 111 | 81 | 140 | 286 | 194 | 286 | 297 | 252 | 215 | 297 | 93 | 264 | 215 | 274 | 158 | 215 | 274 | 158 | 215 | 240 | 117 | 71 | 297 | 135 | 171 | 215 | 212 | 253 | 79 | 215 | 245 | 210 | 65 | 80 | 2 | 81 | 140 | 108 | 80 | 159 | 240 | 159 | 232 | 50 | 215 | 261 | 82 | 240 | 208 | 187 | 208 | 198 | 172 | 117 | 172 | 117 | 35 | 172 | 117 | 275 | 240 | 80 | 60 | 215 | 35 | 1 |
56ae1459ba61a14eb119982d6ec793d7 | 82 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 16 | 208 | 187 | 208 | 240 | 117 | 39 | 35 | 171 | 172 | 117 | 208 | 35 | 215 | 274 | 158 | 215 | 274 | 158 | 215 | 35 | 208 | 240 | 117 | 228 | 208 | 240 | 117 | 240 | 117 | 240 | 117 | 71 | 56 | 172 | 117 | 240 | 117 | 260 | 141 | 65 | 260 | 141 | 65 | 198 | 172 | 117 | 260 | 294 | 240 | 117 | 198 | 208 | 187 | 208 | 240 | 117 | 240 | 117 | 240 | 117 | 82 | 112 | 123 | 65 | 240 | 117 | 275 | 112 | 240 | 117 | 240 | 117 | 198 | 240 | 117 | 240 | 117 | 82 | 172 | 117 | 16 | 31 | 215 | 108 | 208 | 80 | 240 | 117 | 1 |
c4148ca91c5246a8707a1ac1fd1e2e36 | 82 | 208 | 187 | 208 | 172 | 117 | 172 | 208 | 16 | 208 | 240 | 117 | 240 | 117 | 82 | 112 | 123 | 65 | 112 | 123 | 65 | 260 | 141 | 65 | 215 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 240 | 117 | 208 | 187 | 208 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 172 | 117 | 208 | 172 | 117 | 100 | 215 | 35 | 1 |
As we can see below, the dataset is not balanced with only 1,079 goodware API call sequences and 42,797 malware API call sequences in it. It may be a good idea to use sampling method to improve the model’s performance.
The correlation plot shows that there are some correlated variables mainly focused betwen 20th and 45th API calls. It may mean that there are some typical patterns for a software to behave in a similar, sequential way and perform processes similar processes one by one, e.g. edit file, save file. However, most of the chart is gray which indicates no correlation. On the top of it, the row with malware variable is all grey which indicates no correlation between labels and explanatory variables.
Since there are no empty values, no prior data transformation is needed and the modelling part can be started. All variables will be taken into consideration when modelling since the data is sequential.
Following libraries were used in order to enable malware detection:
library(readr)
library(ggplot2)
library(reshape2)
library(tidyverse)
library(MASS)
library(tree)
library(caret)
library(rpart)
library(ROCR)
library(rpart.plot)
library(rattle)
library(pROC)
library(here)
library(e1071)
library(gbm)
library(dplyr)
library(tibble)
Classical approaches such as logistic regression will probably perform poorly or, in worst case generate random predictions from this dataset. It is therefore better to use methods such as Neural Networks and Decision Trees in order to be able to detect malware.
Prior to model development, data split is created. As a results, training and testing datasets are created. Both of them include a very similar proportion (2.5% and 97.5%) of malware and goodware observations (training: 743 goodwares 29971 malwares while testing: 336 and 12826).
First algorithm that is going to be used in this study is Decision Tree. As it is a non parametric method that gauges the homogeneity of the target variable within each subgroup by finding the optimal split, it is a perfect algorithm considering the nature of this dataset. Let’s see how well it can perform in this case. First, a tree with default values will be generated.
set.seed(123456789)
training_obs <- createDataPartition(data$malware,
p = 0.7,
list = FALSE)
data.train <- data[training_obs,]
data.test <- data[-training_obs,]
model1.formula <- malware ~.
data.tree1 <-
rpart(model1.formula, # model formula
data = data.train, # data
method = "class")
fancyRpartPlot(data.tree1)
Let’s now see how well it detects malicious software.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 204 29
## 1 132 12797
##
## Accuracy : 0.9878
## 95% CI : (0.9857, 0.9896)
## No Information Rate : 0.9745
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.711
##
## Mcnemar's Test P-Value : 9.078e-16
##
## Sensitivity : 0.60714
## Specificity : 0.99774
## Pos Pred Value : 0.87554
## Neg Pred Value : 0.98979
## Prevalence : 0.02553
## Detection Rate : 0.01550
## Detection Prevalence : 0.01770
## Balanced Accuracy : 0.80244
##
## 'Positive' Class : 0
##
With Balanced Accuracy of 80%, Sensitivity of 61%, Specificity of > 99% and overall Accuracy > 98%, the default model is not disappointing considering that it was not tuned in any special way. However, the predictions can definitely be better, especially when it comes down to detecting goodware. We can draw an interesting, inital conclusion from this visualization. As we can see, out of 100 explanatory variables, only a few of them are used to generate predictions. It indicates that there may be some crucial parts of software operations which make it malicious or not, which is proved by the variable importance plot below. As there are only few variables with significant importance and the rest of them seem to be much less influential, there is a reason to believe that some parts of running the software can point to it being dangerous for user or not.
Now, let’s try improving model’s performance. A more complex model will be estimated and pruned so that there are only relevant splits left.
data.tree4 <-
rpart(model1.formula,
data = data.train,
method = "class",
minsplit = 600,
minbucket = 300,
maxdepth = 30,
cp = -1)
fancyRpartPlot(data.tree4)
The tree created is very big, complex and difficult to understand, let’s now prune it and see how the predictions respond to that.
This tree is in its most basic form which means one split only. It is very unlikely that this will generate well fitted predictions. We can assess that by looking at the confusion matrix and ROC curve.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 97 58
## 1 239 12768
##
## Accuracy : 0.9774
## 95% CI : (0.9748, 0.9799)
## No Information Rate : 0.9745
## P-Value [Acc > NIR] : 0.01537
##
## Kappa : 0.3852
##
## Mcnemar's Test P-Value : < 2e-16
##
## Sensitivity : 0.28869
## Specificity : 0.99548
## Pos Pred Value : 0.62581
## Neg Pred Value : 0.98163
## Prevalence : 0.02553
## Detection Rate : 0.00737
## Detection Prevalence : 0.01178
## Balanced Accuracy : 0.64208
##
## 'Positive' Class : 0
##
As it was expected, the predictions on the data set are really bad when it comes down to model’s its sensitivity - only 28.9%, which has its reflection in Balanced accuracy - 64%.
The poor ability to predict is also visible when we take a look at the ROC curve. We can see than it’s below diagonal line which indicates that it is worse than a random predictor. This model definitely shouldn’t be used for predicting malware. Now, cross-validation technique will be used in order to find the values of parameters that will allow us to generate the most accurate predictions.
## CART
##
## 30714 samples
## 100 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 27642, 27642, 27642, 27643, 27643, 27643, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.000 0.8981512 0.5772613 0.9954955
## 0.001 0.8741244 0.5665225 0.9963964
## 0.002 0.8525947 0.5530270 0.9969637
## 0.003 0.8347632 0.5276036 0.9977311
## 0.004 0.8123084 0.5141081 0.9978979
## 0.005 0.8097878 0.5181622 0.9978312
## 0.006 0.8050902 0.5154414 0.9981315
## 0.007 0.8037077 0.5127387 0.9981315
## 0.008 0.8010868 0.5074054 0.9981649
## 0.009 0.7858587 0.4952432 0.9977979
## 0.010 0.7858366 0.5020000 0.9975643
## 0.011 0.7840173 0.4980000 0.9974309
## 0.012 0.7709174 0.4966847 0.9974642
## 0.013 0.7709001 0.5033514 0.9973975
## 0.014 0.7682948 0.4980180 0.9974309
## 0.015 0.7604426 0.4845045 0.9974976
## 0.016 0.7560582 0.4790991 0.9976311
## 0.017 0.7468934 0.4655856 0.9976311
## 0.018 0.7390559 0.4534775 0.9977645
## 0.019 0.7333360 0.4480721 0.9979313
## 0.020 0.7308440 0.4467207 0.9980314
## 0.021 0.7308440 0.4467207 0.9980314
## 0.022 0.7308440 0.4467207 0.9980314
## 0.023 0.7308440 0.4467207 0.9980314
## 0.024 0.7308440 0.4467207 0.9980314
## 0.025 0.7308440 0.4467207 0.9980314
## 0.026 0.7308313 0.4467207 0.9978980
## 0.027 0.7308128 0.4480721 0.9976644
## 0.028 0.7308128 0.4480721 0.9976644
## 0.029 0.7308029 0.4480721 0.9975310
## 0.030 0.7321227 0.4520901 0.9973641
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.
It appears that the optimal value of complexity parameter which is used to control the size of the decision tree and to select the optimal tree size is 0. That would imply that the parameter is not constrained. A extremely complicated and overfitted model may arise, for instance, if the maximum depth parameter is set to 0, which would allow the tree to grow as deep as feasible. It’s generally not a good idea to set complexity parameters to 0, since this might result in overfitting and poor generalization performance on fresh data.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 225 71
## Yes 111 12755
##
## Accuracy : 0.9862
## 95% CI : (0.984, 0.9881)
## No Information Rate : 0.9745
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.705
##
## Mcnemar's Test P-Value : 0.003842
##
## Sensitivity : 0.66964
## Specificity : 0.99446
## Pos Pred Value : 0.76014
## Neg Pred Value : 0.99137
## Prevalence : 0.02553
## Detection Rate : 0.01709
## Detection Prevalence : 0.02249
## Balanced Accuracy : 0.83205
##
## 'Positive' Class : No
##
Nevertheless, the predictions are not too bad. Balanced Accuracy is 83% which makes it the best Decision Tree model so far. Both, Accuracy and Specifity are very high. It would be great, if there was a way to improve Sensitivity here, though.
The ROC curves indicate that the model is definitely better than the random classifier. However, it is not perfect either as it is far from reaching the square shape. There is no problem with overfitting here as the results from training and testing sets are closely aligned.
In general, the Decision Tree alorithm didn’t do well enough when it comes down to our dataset. On the other hand, it is not a complex algorithm. Let’s see what more sophisticated ones can do.
Boosting is a popular machine learning technique that may increase accuracy of a variety of models. It turns several poor learners into a strong learner. The objective of boosting is to train a series of models, each of which is trained to fix the flaws of the one before it. In this specific case, GBM will be used to combine several trees in the final model to produce a forecast. GBM has the benefit of being able to handle several characteristics, which makes it ideal for complicated data sets which is exactly what we need for this dataset with 100 predictors. To prevent overfitting, which GBM is prone to, it is crucial to properly adjust the hyperparameters. Let’s now see how accurate this approach can be on our dataset.
data.train$malware <- as.numeric(data.train$malware)-1
data.test$malware <- as.numeric(data.test$malware)-1
set.seed(123456789)
data.gbm <-
gbm(model1.formula,
data = data.train,
distribution = "bernoulli",
n.trees = 500,
interaction.depth = 4,
shrinkage = 0.01,
verbose = FALSE)
datausa.pred.train.gbm <- predict(data.gbm,
data.train,
# type = "response" gives in this case
# probability of success
type = "response",
# n.trees sets the number of trees
# which are used to generate the prediction
n.trees = 500)
datausa.pred.test.gbm <- predict(data.gbm,
data.test,
type = "response",
n.trees = 500)
confusionMatrix(data = as.factor(ifelse(datausa.pred.train.gbm>0.5,1,0)),
reference = as.factor(data.train$malware))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 409 16
## 1 334 29955
##
## Accuracy : 0.9886
## 95% CI : (0.9874, 0.9898)
## No Information Rate : 0.9758
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.695
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.55047
## Specificity : 0.99947
## Pos Pred Value : 0.96235
## Neg Pred Value : 0.98897
## Prevalence : 0.02419
## Detection Rate : 0.01332
## Detection Prevalence : 0.01384
## Balanced Accuracy : 0.77497
##
## 'Positive' Class : 0
##
confusionMatrix(data = as.factor(ifelse(datausa.pred.test.gbm>0.5,1,0)),
reference = as.factor(data.test$malware))
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 192 11
## 1 144 12815
##
## Accuracy : 0.9882
## 95% CI : (0.9862, 0.99)
## No Information Rate : 0.9745
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7068
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.57143
## Specificity : 0.99914
## Pos Pred Value : 0.94581
## Neg Pred Value : 0.98889
## Prevalence : 0.02553
## Detection Rate : 0.01459
## Detection Prevalence : 0.01542
## Balanced Accuracy : 0.78529
##
## 'Positive' Class : 0
##
A model with hardcoded hyperparameter values is not very bad. However, the sensitivity is very low in case of both - training and testing datasets and the balanced accuracy is lower than in case of decision trees. AUC of ~ 95% on training set and ~94% on testing indicates that there is no problem with overfitting here. Let’s see what the ROC curve will tell us.
## AUC for train = 0.9499104, Gini for train = 0.8998207
## AUC for test = 0.9390856, Gini for test = 0.8781711
Both models seem to be well fitted. Again, the ROC curve indicates that there is no overfitting problem in this case which is a succeess considering that this model is prone to this phenomena. However, this is more likely to happen when cross-validation technique is used for model training, which is why, special attention to this is needed when we apply it:
## Stochastic Gradient Boosting
##
## 30714 samples
## 100 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 20475, 20476, 20477
## Resampling results across tuning parameters:
##
## shrinkage interaction.depth n.minobsinnode n.trees ROC Sens
## 0.01 1 100 100 0.7659008 0.008075399
## 0.01 1 100 500 0.8845632 0.310930086
## 0.01 1 250 100 0.7714563 0.000000000
## 0.01 1 250 500 0.8887391 0.289375735
## 0.01 1 500 100 0.7909549 0.000000000
## 0.01 1 500 500 0.8761452 0.277257194
## 0.01 2 100 100 0.8430256 0.071236559
## 0.01 2 100 500 0.9105436 0.355306691
## 0.01 2 250 100 0.8787782 0.000000000
## 0.01 2 250 500 0.9150162 0.302832920
## 0.01 2 500 100 0.8749848 0.000000000
## 0.01 2 500 500 0.9115755 0.278595838
## 0.01 4 100 100 0.8990318 0.000000000
## 0.01 4 100 500 0.9378185 0.426646641
## 0.01 4 250 100 0.9102775 0.000000000
## 0.01 4 250 500 0.9376762 0.347231292
## 0.01 4 500 100 0.8974153 0.000000000
## 0.01 4 500 500 0.9373684 0.316290105
## 0.10 1 100 100 0.8997789 0.368785643
## 0.10 1 100 500 0.9295846 0.481824910
## 0.10 1 250 100 0.9031057 0.305515650
## 0.10 1 250 500 0.9340272 0.437404771
## 0.10 1 500 100 0.8981709 0.288031649
## 0.10 1 500 500 0.9350312 0.425297114
## 0.10 2 100 100 0.9289951 0.429340255
## 0.10 2 100 500 0.9514340 0.542379522
## 0.10 2 250 100 0.9281799 0.367425232
## 0.10 2 250 500 0.9520153 0.499319795
## 0.10 2 500 100 0.9277171 0.329752732
## 0.10 2 500 500 0.9524186 0.484513082
## 0.10 4 100 100 0.9470772 0.481808585
## 0.10 4 100 500 0.9632337 0.554481738
## 0.10 4 250 100 0.9487769 0.423974794
## 0.10 4 250 500 0.9635171 0.534309564
## 0.10 4 500 100 0.9512696 0.403791737
## 0.10 4 500 500 0.9636632 0.520841496
## Spec
## 0.9991325
## 0.9980315
## 1.0000000
## 0.9994995
## 1.0000000
## 0.9999333
## 0.9999333
## 0.9985319
## 1.0000000
## 0.9989990
## 1.0000000
## 0.9999666
## 1.0000000
## 0.9986988
## 1.0000000
## 0.9996664
## 1.0000000
## 1.0000000
## 0.9981649
## 0.9970972
## 0.9985986
## 0.9975643
## 0.9992326
## 0.9982984
## 0.9975643
## 0.9969304
## 0.9982650
## 0.9978646
## 0.9997331
## 0.9979647
## 0.9975977
## 0.9975977
## 0.9990991
## 0.9978312
## 0.9994662
## 0.9979314
##
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 500, interaction.depth =
## 4, shrinkage = 0.1 and n.minobsinnode = 500.
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 504 12
## 1 239 29959
##
## Accuracy : 0.9918
## 95% CI : (0.9908, 0.9928)
## No Information Rate : 0.9758
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7966
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.67833
## Specificity : 0.99960
## Pos Pred Value : 0.97674
## Neg Pred Value : 0.99209
## Prevalence : 0.02419
## Detection Rate : 0.01641
## Detection Prevalence : 0.01680
## Balanced Accuracy : 0.83897
##
## 'Positive' Class : 0
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 199 17
## 1 137 12809
##
## Accuracy : 0.9883
## 95% CI : (0.9863, 0.9901)
## No Information Rate : 0.9745
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7153
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.59226
## Specificity : 0.99867
## Pos Pred Value : 0.92130
## Neg Pred Value : 0.98942
## Prevalence : 0.02553
## Detection Rate : 0.01512
## Detection Prevalence : 0.01641
## Balanced Accuracy : 0.79547
##
## 'Positive' Class : 0
##
As we can see, optimal values used for the model generated with use of cross-validation were following number of trees = 500, interaction.depth 4 (It establishes how many interactions between various characteristics are taken into account in each decision tree split - in our case, it’s 4), shrinkage 0.1 and n.minobsinnode (minimum number of observations required in each terminal (leaf) node of the decision tree) 500. Almost 4% difference in balanced accuracy is worrying here as it may indicate that the model is slightly overfitted. Either way, Its predictions are worse than in case of decision trees.
## AUC for train = 0.8389654, Gini for train = 0.6779307
## AUC for test = 0.7954682, Gini for test = 0.5909365
The plot confirms this assumption. ROC curve is quite close to the diagonal and the gap between train and test sets is not small. It is reasonable to move to the most advanced technique now and see if it can top decision trees - CNN.
The ability of Convolutional Neural Networks (CNNs) to understand the underlying patterns and characteristics that are indicative of dangerous activity has led to promising outcomes in the detection of malware. The requirement for a substantial amount of training data is one of the main obstacles to employing CNNs for malware detection. This is due to the fact that there are several varieties of malware, each with specific traits and behaviors, and it might be challenging to include all of these variants in a single dataset. However, our dataset is complex enough to build a model on the top of it. This part will be run in Python with use of (among others) Tensorflow and Keras libraries:
#import numpy as np # linear algebra
#import pandas as pd
#import seaborn as sns
#import scikitplot as skplt
#import matplotlib.pyplot as plt
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.optimizers import Adam
#from tensorflow.keras.layers import Embedding, #LSTM, GlobalAveragePooling2D, TimeDistributed
#from tensorflow.keras.layers import Dense, #Dropout, Conv1D, MaxPool1D, BatchNormalization
#from sklearn.model_selection import #train_test_split
#from sklearn.metrics import #accuracy_score,classification_report,confusion_matrix
A similar operation of splitting the initial set was run in Python:
#X_train, X_test, y_train, y_test = train_test_split(used_data, data['malware'], test_size=0.4,
# shuffle=True, random_state=42,stratify=data['malware'])
Afterwards, a basic model with one Convolutional layer was developed as a base and included following parameters:
#model = Sequential(name="Cnn-Lstm_model")
#model.add(Embedding(input_dim=unique_api_calls, output_dim=15,
# input_length=X_train.shape[1], name='layer_embedding'))
#model.add(BatchNormalization())
#model.add(Conv1D(filters = 20, kernel_size = 5, padding = 'same', activation = #'relu'))
#model.add(MaxPool1D(pool_size = 5))
#model.add(LSTM(units=512, return_sequences=False, dropout=0.2))
#model.add(Dense(units=1, activation='sigmoid'))
#optimizer = Adam(lr=.0001)
#model.compile(loss='binary_crossentropy', optimizer=optimizer, #metrics=['accuracy'])
#history = model.fit(X_train, y_train, validation_split=0.2, epochs=40, #batch_size=512)
As we can see, first, an embedding layer that links categorical inputs to continuous vector representations that can be learned during training was added to represent the API calls in a numerical way. Afterwards, the data was normalized and the convolutional layer was added. It processes 1-dimensional convolution on underlying data and can be utilized in case of sequential data, which is definitely the case here. With each filter covering a certain length of the input sequence, the Conv1D layer applies a collection of filters to the sequence. A collection of feature maps that identify local patterns in the input sequence make up the convolution operation’s output. It is followed by a MaxPool1D technique that by simply keeping the most crucial features in each window, reduces the size of the feature maps. Since the data is sequential and the consequitive API calls can be related to each other, LSTM layer is added followed by a Dense function with parameters 1 and ‘sigmoid’ which indicates that we want the output to be a probability.
The training process was shown on a below plot so as it can be seen how consequitive epochs contributed to training the model:
#fig, ax = plt.subplots(1,2, figsize=[12,6])
#ax[0].plot(history.history["loss"])
#ax[0].plot(history.history["val_loss"])
#ax[0].set_title(" Loss")
#ax[0].legend(("Training", "validation"), loc="upper right")
#ax[0].set_xlabel("Epochs")
#ax[1].plot(history.history["accuracy"])
#ax[1].plot(history.history["val_accuracy"])
#ax[1].legend(("Training", "validation"), loc="lower right")
#ax[1].set_title("Accuracy")
#ax[1].set_xlabel("Epochs")
It can be noticed that from 15 epochs onward, the training process starts giving consistent results. There should not be a problem with overfitting since the validation and training scores are quite close to each other. Overall, it looks quite promising since the loss function on validation set is <0.05 and accuracy > 98. Let’s now see the confusion matrix to assess the predictions.
#y_pred = np.where(y_pred>0.5, 1, 0)
#print("CNN_LSTM model classification report: \n\n #{}".format(classification_report(np.array(y_test), y_pred.flatten())))
#ax=skplt.metrics.plot_confusion_matrix(y_test, y_pred, figsize=(8, 7))
#tickx=ax.set_xticklabels(['Benign', 'Malware'])
#ticky=ax.set_yticklabels(['Benign', 'Malware'])
The matrix looks promising. With 99% Specificity, 73% Sensitivity and balanced accuracy of 86%, it is the best model so far. Let’s now take a look at ROC curve.
#from sklearn.metrics import roc_curve
#import matplotlib.pyplot as plt
#
## y_true: true labels of the data
## y_score: predicted probabilities of the positive class
#fpr, tpr, thresholds = roc_curve(y_test, pred)
#
## plot the ROC curve
#plt.plot(fpr, tpr)
#plt.plot([0, 1], [0, 1], linestyle='--')
#plt.xlabel('False Positive Rate')
#plt.ylabel('True Positive Rate')
#plt.title('Receiver Operating Characteristic (ROC) Curve')
#plt.show()
#from sklearn.metrics import roc_auc_score
#
## calculate the AUC score
#auc = roc_auc_score(y_test, pred)
#print('AUC: %.3f' % auc)
The AUC: 0.983 and the shape of the Curve look very well. The ROC curve is getting close to the perfect shape.
Now, a two layer model will be train in order to see if it can further improve the results. It will be very similar to the previous one. However, it will include one additional Convolutional layer.
#model = Sequential(name="Cnn-Lstm_model")
#model.add(Embedding(input_dim=unique_api_calls, output_dim=8,
# input_length=X_train.shape[1], name='layer_embedding'))
#model.add(BatchNormalization())
#model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'valid', activation = #'relu'))
#model.add(Dropout(0.2))
#model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'valid', activation = #'relu'))
#model.add(MaxPool1D(pool_size = 2))
#model.add(LSTM(units=512, return_sequences=False, dropout=0.2))
#model.add(Dense(units=1, activation='sigmoid'))
#optimizer = Adam(lr=.0001)
#model.compile(loss='binary_crossentropy', optimizer=optimizer, #metrics=['accuracy'])
#history = model.fit(X_train, y_train, validation_split=0.2, epochs=30, #batch_size=512)
#fig, ax = plt.subplots(1,2, figsize=[12,6])
#ax[0].plot(history.history["loss"])
#ax[0].plot(history.history["val_loss"])
#ax[0].set_title(" Loss")
#ax[0].legend(("Training", "validation"), loc="upper right")
#ax[0].set_xlabel("Epochs")
#ax[1].plot(history.history["accuracy"])
#ax[1].plot(history.history["val_accuracy"])
#ax[1].legend(("Training", "validation"), loc="lower right")
#ax[1].set_title("Accuracy")
#ax[1].set_xlabel("Epochs")
From 15 epochs onward, the training process starts giving consistent results. Again, there is no problem with overfitting here. Both, loss and accuracy plots look very well and promising. Let’s now see how confusion matrix and ROC curves look like.
#y_pred = model.predict(X_test)
#pred = np.where(y_pred>0.5, 1, 0)
#ax=skplt.metrics.plot_confusion_matrix(y_test, pred, figsize=(8, 7))
#tickx=ax.set_xticklabels(['Benign', 'Malware'])
#ticky=ax.set_yticklabels(['Benign', 'Malware'])
The confusion matrix looks very good. With balanced accuracy of approx.
88%, 75% Sensitivity and 99% Specificity, this is the best model that
was developed so far. Also, the ROC curve is of an almost perfect
form:
Over the course of analysis, three different algorithms were applied in order to enable prediction of malware. Out all of them, it turned out that CNN with LSTM layer that takes into advantage the relationship between consecuitive data was able to provide the best fit to the underlying data. The overall accuracy is satisfying which means that the model is able to automatically detect malware activity. The best model is able to classify 99% of the observations to corresponding group, which is a great score. Even when Sensitivity and Specificity are considered, the model’s performance is still impressing.
Basing on the evidence presented above, the analysis can be concluded by following findings: