Introduction


In the current digital era, malware detection is a crucial component of computer security. Malware threats are becoming more numerous and complicated as technology develops. Any program or piece of code that is intended to harm or exploit computer systems or their users is known as malware, which is short for “malicious software.” Malware can appear in a variety of ways, such as viruses, worms, Trojan horses, spyware, adware, and ransomware, to name a few.

Malware assaults may have a significant negative effect, leading to financial losses, data breaches, system failures, and other types of disruption. Thus, having efficient techniques for identifying and preventing malware infestations is crucial. The process of locating and evaluating possible threats to ascertain whether they are harmful or not is known as malware detection.

A method that is becoming more and more common for detecting malware is machine learning as it has an ability to categorize harmful software. The necessity for high-quality data is one of the key obstacles in adopting machine learning for virus detection. The caliber and variety of the training dataset determine how well the model performs. Also, to avoid being discovered by machine learning models, malware creators might employ strategies like obfuscation and polymorphism. Machine learning can offer more precise and efficient identification and categorization of dangerous software in light of the evolving sophistication of malware threats. To solve the issues with data quality and resilience against evasion strategies, however, more study is required.

Data


The dataset on which the model was built was made available on the Kaggle portal and can be accessed at the web link. There are 1,079 goodware API call sequences and 42,797 malware API call sequences in it. The first 100 non-repeated sequential API calls connected to the parent process are collected for each software using Cuckoo Sandbox reports. The dataset eventually features the following columns:

  1. Hash: MD5 hash of the example
  2. t_0 … t_99: API call
  3. Malware: 0 (Goodware) or 1 (Malware)

Every row includes a sequence of 100 API calls that describe the system operations that were performed one by one. For instance, software activity could look as follows: create new folder, create new file, edit file, save file, etc. which is reflected by consequitive calls here and classified as malware or goodware in the last column.

hash t_0 t_1 t_2 t_3 t_4 t_5 t_6 t_7 t_8 t_9 t_10 t_11 t_12 t_13 t_14 t_15 t_16 t_17 t_18 t_19 t_20 t_21 t_22 t_23 t_24 t_25 t_26 t_27 t_28 t_29 t_30 t_31 t_32 t_33 t_34 t_35 t_36 t_37 t_38 t_39 t_40 t_41 t_42 t_43 t_44 t_45 t_46 t_47 t_48 t_49 t_50 t_51 t_52 t_53 t_54 t_55 t_56 t_57 t_58 t_59 t_60 t_61 t_62 t_63 t_64 t_65 t_66 t_67 t_68 t_69 t_70 t_71 t_72 t_73 t_74 t_75 t_76 t_77 t_78 t_79 t_80 t_81 t_82 t_83 t_84 t_85 t_86 t_87 t_88 t_89 t_90 t_91 t_92 t_93 t_94 t_95 t_96 t_97 t_98 t_99 malware
071e8c3f8922e186e57548cd4c703a5d 112 274 158 215 274 158 215 298 76 208 76 172 117 172 117 172 76 117 35 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 117 60 81 60 81 208 35 215 35 208 240 117 172 60 81 60 81 225 35 60 81 35 225 172 60 81 60 81 60 81 172 117 76 172 117 172 117 35 111 81 140 208 240 117 71 297 135 171 215 35 208 56 71 1
33f8e6d08a6aae939f25a8e0d63dd523 82 208 187 208 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 208 172 117 16 208 297 8 199 264 274 158 215 274 158 215 208 260 87 260 65 14 65 240 117 208 187 208 240 117 39 35 171 172 117 208 35 215 35 208 240 117 35 60 81 60 81 172 35 60 81 60 81 31 60 81 208 187 208 274 158 266 208 60 81 60 81 240 117 71 297 135 171 215 35 1
b68abd064e975e1c6d5f25e748663076 16 110 240 117 240 117 240 117 240 117 240 117 240 117 172 117 99 260 141 65 240 117 240 117 240 117 263 215 263 215 263 215 263 215 263 215 263 215 263 215 263 215 274 158 215 274 158 215 240 117 71 297 135 171 215 48 208 112 113 112 123 65 112 123 65 112 123 65 113 112 123 65 112 123 65 112 123 65 113 112 123 65 112 123 65 112 123 65 113 112 123 65 112 123 65 112 123 65 113 112 1
72049be7bd30ea61297ea624ae198067 82 208 187 208 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 208 16 240 117 228 208 187 228 240 117 82 198 86 82 274 37 240 117 240 117 262 228 275 172 240 275 172 274 158 215 172 117 215 37 158 215 240 117 82 240 117 240 117 240 117 240 117 228 187 215 274 158 215 274 158 215 240 117 71 297 135 171 215 228 215 208 302 208 302 187 208 302 228 302 1
c9b3700a77facf29172f32df6bc77f48 82 240 117 240 117 240 117 240 117 172 117 172 117 16 240 117 11 274 158 215 274 158 215 117 270 117 301 117 297 8 199 264 215 260 141 65 31 260 141 202 260 141 65 202 65 260 141 65 80 287 87 14 65 260 141 240 141 65 82 260 141 65 260 141 65 31 159 224 82 261 172 117 260 208 260 2 140 81 208 159 224 82 159 224 82 261 208 240 117 260 40 209 260 40 209 260 141 260 141 260 1
cc6217be863e606e49da90fee2252f52 117 208 117 208 117 240 117 240 117 208 228 215 274 158 215 274 158 215 240 117 71 297 135 171 215 208 187 20 34 215 208 187 86 215 240 91 89 192 89 133 89 248 297 135 171 8 178 215 18 194 240 117 215 117 31 56 71 56 172 117 240 117 260 141 65 194 117 240 117 260 141 65 260 141 65 260 141 65 260 141 65 260 141 65 260 112 117 260 141 65 260 141 65 260 141 65 9 117 260 65 1
f7a1a3c38809d807b3f5f4cc00b1e9b7 215 274 158 215 274 158 215 172 117 172 117 172 117 198 208 260 257 25 240 117 99 25 172 117 260 274 158 172 117 172 117 260 141 65 172 117 240 117 286 240 286 35 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 60 81 110 60 81 60 81 215 208 35 208 240 117 240 117 240 117 240 117 240 117 15 117 15 240 117 240 117 240 117 172 60 81 60 81 225 1
164b56522eb24164184460f8523ed7e2 82 240 117 240 117 240 117 240 117 240 117 172 117 172 117 16 31 86 112 271 111 81 140 286 194 286 297 252 215 117 297 93 264 215 271 111 81 140 286 194 286 297 252 215 297 93 264 215 274 158 215 274 158 215 240 117 71 297 135 171 215 212 253 79 215 245 210 65 80 2 81 140 108 80 159 240 159 232 50 215 261 82 240 208 187 208 198 172 117 172 117 35 172 117 275 240 80 60 215 35 1
56ae1459ba61a14eb119982d6ec793d7 82 240 117 240 117 240 117 240 117 240 117 16 208 187 208 240 117 39 35 171 172 117 208 35 215 274 158 215 274 158 215 35 208 240 117 228 208 240 117 240 117 240 117 71 56 172 117 240 117 260 141 65 260 141 65 198 172 117 260 294 240 117 198 208 187 208 240 117 240 117 240 117 82 112 123 65 240 117 275 112 240 117 240 117 198 240 117 240 117 82 172 117 16 31 215 108 208 80 240 117 1
c4148ca91c5246a8707a1ac1fd1e2e36 82 208 187 208 172 117 172 208 16 208 240 117 240 117 82 112 123 65 112 123 65 260 141 65 215 240 117 240 117 240 117 240 117 240 117 240 117 208 187 208 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 172 117 208 172 117 100 215 35 1

As we can see below, the dataset is not balanced with only 1,079 goodware API call sequences and 42,797 malware API call sequences in it. It may be a good idea to use sampling method to improve the model’s performance.

The correlation plot shows that there are some correlated variables mainly focused betwen 20th and 45th API calls. It may mean that there are some typical patterns for a software to behave in a similar, sequential way and perform processes similar processes one by one, e.g. edit file, save file. However, most of the chart is gray which indicates no correlation. On the top of it, the row with malware variable is all grey which indicates no correlation between labels and explanatory variables.

Since there are no empty values, no prior data transformation is needed and the modelling part can be started. All variables will be taken into consideration when modelling since the data is sequential.

Libraries used

Following libraries were used in order to enable malware detection:

library(readr)
library(ggplot2)
library(reshape2)
library(tidyverse)
library(MASS)
library(tree)
library(caret)
library(rpart)
library(ROCR)
library(rpart.plot)
library(rattle)
library(pROC)
library(here)
library(e1071)
library(gbm)
library(dplyr)
library(tibble)

Modeling

Classical approaches such as logistic regression will probably perform poorly or, in worst case generate random predictions from this dataset. It is therefore better to use methods such as Neural Networks and Decision Trees in order to be able to detect malware.

Prior to model development, data split is created. As a results, training and testing datasets are created. Both of them include a very similar proportion (2.5% and 97.5%) of malware and goodware observations (training: 743 goodwares 29971 malwares while testing: 336 and 12826).

Decision Trees

First algorithm that is going to be used in this study is Decision Tree. As it is a non parametric method that gauges the homogeneity of the target variable within each subgroup by finding the optimal split, it is a perfect algorithm considering the nature of this dataset. Let’s see how well it can perform in this case. First, a tree with default values will be generated.

set.seed(123456789)
training_obs <- createDataPartition(data$malware, 
                                    p = 0.7, 
                                    list = FALSE) 
data.train <- data[training_obs,]
data.test  <- data[-training_obs,]

model1.formula <- malware ~.

data.tree1 <- 
  rpart(model1.formula, # model formula
        data = data.train, # data
        method = "class")
        
fancyRpartPlot(data.tree1)

Let’s now see how well it detects malicious software.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   204    29
##          1   132 12797
##                                           
##                Accuracy : 0.9878          
##                  95% CI : (0.9857, 0.9896)
##     No Information Rate : 0.9745          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.711           
##                                           
##  Mcnemar's Test P-Value : 9.078e-16       
##                                           
##             Sensitivity : 0.60714         
##             Specificity : 0.99774         
##          Pos Pred Value : 0.87554         
##          Neg Pred Value : 0.98979         
##              Prevalence : 0.02553         
##          Detection Rate : 0.01550         
##    Detection Prevalence : 0.01770         
##       Balanced Accuracy : 0.80244         
##                                           
##        'Positive' Class : 0               
## 

With Balanced Accuracy of 80%, Sensitivity of 61%, Specificity of > 99% and overall Accuracy > 98%, the default model is not disappointing considering that it was not tuned in any special way. However, the predictions can definitely be better, especially when it comes down to detecting goodware. We can draw an interesting, inital conclusion from this visualization. As we can see, out of 100 explanatory variables, only a few of them are used to generate predictions. It indicates that there may be some crucial parts of software operations which make it malicious or not, which is proved by the variable importance plot below. As there are only few variables with significant importance and the rest of them seem to be much less influential, there is a reason to believe that some parts of running the software can point to it being dangerous for user or not.

Now, let’s try improving model’s performance. A more complex model will be estimated and pruned so that there are only relevant splits left.

data.tree4 <- 
  rpart(model1.formula,
        data = data.train,
        method = "class",
        minsplit = 600, 
        minbucket = 300, 
        maxdepth = 30, 
        cp = -1)
fancyRpartPlot(data.tree4)

The tree created is very big, complex and difficult to understand, let’s now prune it and see how the predictions respond to that.

This tree is in its most basic form which means one split only. It is very unlikely that this will generate well fitted predictions. We can assess that by looking at the confusion matrix and ROC curve.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0    97    58
##          1   239 12768
##                                           
##                Accuracy : 0.9774          
##                  95% CI : (0.9748, 0.9799)
##     No Information Rate : 0.9745          
##     P-Value [Acc > NIR] : 0.01537         
##                                           
##                   Kappa : 0.3852          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.28869         
##             Specificity : 0.99548         
##          Pos Pred Value : 0.62581         
##          Neg Pred Value : 0.98163         
##              Prevalence : 0.02553         
##          Detection Rate : 0.00737         
##    Detection Prevalence : 0.01178         
##       Balanced Accuracy : 0.64208         
##                                           
##        'Positive' Class : 0               
## 

As it was expected, the predictions on the data set are really bad when it comes down to model’s its sensitivity - only 28.9%, which has its reflection in Balanced accuracy - 64%.

The poor ability to predict is also visible when we take a look at the ROC curve. We can see than it’s below diagonal line which indicates that it is worse than a random predictor. This model definitely shouldn’t be used for predicting malware. Now, cross-validation technique will be used in order to find the values of parameters that will allow us to generate the most accurate predictions.

## CART 
## 
## 30714 samples
##   100 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 27642, 27642, 27642, 27643, 27643, 27643, ... 
## Resampling results across tuning parameters:
## 
##   cp     ROC        Sens       Spec     
##   0.000  0.8981512  0.5772613  0.9954955
##   0.001  0.8741244  0.5665225  0.9963964
##   0.002  0.8525947  0.5530270  0.9969637
##   0.003  0.8347632  0.5276036  0.9977311
##   0.004  0.8123084  0.5141081  0.9978979
##   0.005  0.8097878  0.5181622  0.9978312
##   0.006  0.8050902  0.5154414  0.9981315
##   0.007  0.8037077  0.5127387  0.9981315
##   0.008  0.8010868  0.5074054  0.9981649
##   0.009  0.7858587  0.4952432  0.9977979
##   0.010  0.7858366  0.5020000  0.9975643
##   0.011  0.7840173  0.4980000  0.9974309
##   0.012  0.7709174  0.4966847  0.9974642
##   0.013  0.7709001  0.5033514  0.9973975
##   0.014  0.7682948  0.4980180  0.9974309
##   0.015  0.7604426  0.4845045  0.9974976
##   0.016  0.7560582  0.4790991  0.9976311
##   0.017  0.7468934  0.4655856  0.9976311
##   0.018  0.7390559  0.4534775  0.9977645
##   0.019  0.7333360  0.4480721  0.9979313
##   0.020  0.7308440  0.4467207  0.9980314
##   0.021  0.7308440  0.4467207  0.9980314
##   0.022  0.7308440  0.4467207  0.9980314
##   0.023  0.7308440  0.4467207  0.9980314
##   0.024  0.7308440  0.4467207  0.9980314
##   0.025  0.7308440  0.4467207  0.9980314
##   0.026  0.7308313  0.4467207  0.9978980
##   0.027  0.7308128  0.4480721  0.9976644
##   0.028  0.7308128  0.4480721  0.9976644
##   0.029  0.7308029  0.4480721  0.9975310
##   0.030  0.7321227  0.4520901  0.9973641
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

It appears that the optimal value of complexity parameter which is used to control the size of the decision tree and to select the optimal tree size is 0. That would imply that the parameter is not constrained. A extremely complicated and overfitted model may arise, for instance, if the maximum depth parameter is set to 0, which would allow the tree to grow as deep as feasible. It’s generally not a good idea to set complexity parameters to 0, since this might result in overfitting and poor generalization performance on fresh data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No    225    71
##        Yes   111 12755
##                                          
##                Accuracy : 0.9862         
##                  95% CI : (0.984, 0.9881)
##     No Information Rate : 0.9745         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.705          
##                                          
##  Mcnemar's Test P-Value : 0.003842       
##                                          
##             Sensitivity : 0.66964        
##             Specificity : 0.99446        
##          Pos Pred Value : 0.76014        
##          Neg Pred Value : 0.99137        
##              Prevalence : 0.02553        
##          Detection Rate : 0.01709        
##    Detection Prevalence : 0.02249        
##       Balanced Accuracy : 0.83205        
##                                          
##        'Positive' Class : No             
## 

Nevertheless, the predictions are not too bad. Balanced Accuracy is 83% which makes it the best Decision Tree model so far. Both, Accuracy and Specifity are very high. It would be great, if there was a way to improve Sensitivity here, though.

The ROC curves indicate that the model is definitely better than the random classifier. However, it is not perfect either as it is far from reaching the square shape. There is no problem with overfitting here as the results from training and testing sets are closely aligned.

In general, the Decision Tree alorithm didn’t do well enough when it comes down to our dataset. On the other hand, it is not a complex algorithm. Let’s see what more sophisticated ones can do.

Boosting

Boosting is a popular machine learning technique that may increase accuracy of a variety of models. It turns several poor learners into a strong learner. The objective of boosting is to train a series of models, each of which is trained to fix the flaws of the one before it. In this specific case, GBM will be used to combine several trees in the final model to produce a forecast. GBM has the benefit of being able to handle several characteristics, which makes it ideal for complicated data sets which is exactly what we need for this dataset with 100 predictors. To prevent overfitting, which GBM is prone to, it is crucial to properly adjust the hyperparameters. Let’s now see how accurate this approach can be on our dataset.

data.train$malware <- as.numeric(data.train$malware)-1
data.test$malware <- as.numeric(data.test$malware)-1

set.seed(123456789)
data.gbm <- 
  gbm(model1.formula,
      data = data.train,
      distribution = "bernoulli",
      n.trees = 500,
      interaction.depth = 4,
      shrinkage = 0.01,
      verbose = FALSE)

datausa.pred.train.gbm <- predict(data.gbm,
                                  data.train, 
                                  # type = "response" gives in this case 
                                  # probability of success
                                  type = "response",
                                  # n.trees sets the number of trees
                                  # which are used to generate the prediction
                                  n.trees = 500)

datausa.pred.test.gbm <- predict(data.gbm,
                                 data.test, 
                                 type = "response",
                                 n.trees = 500)

confusionMatrix(data = as.factor(ifelse(datausa.pred.train.gbm>0.5,1,0)),
                reference = as.factor(data.train$malware)) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   409    16
##          1   334 29955
##                                           
##                Accuracy : 0.9886          
##                  95% CI : (0.9874, 0.9898)
##     No Information Rate : 0.9758          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.695           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.55047         
##             Specificity : 0.99947         
##          Pos Pred Value : 0.96235         
##          Neg Pred Value : 0.98897         
##              Prevalence : 0.02419         
##          Detection Rate : 0.01332         
##    Detection Prevalence : 0.01384         
##       Balanced Accuracy : 0.77497         
##                                           
##        'Positive' Class : 0               
## 
confusionMatrix(data = as.factor(ifelse(datausa.pred.test.gbm>0.5,1,0)),
                reference = as.factor(data.test$malware)) 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   192    11
##          1   144 12815
##                                         
##                Accuracy : 0.9882        
##                  95% CI : (0.9862, 0.99)
##     No Information Rate : 0.9745        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7068        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.57143       
##             Specificity : 0.99914       
##          Pos Pred Value : 0.94581       
##          Neg Pred Value : 0.98889       
##              Prevalence : 0.02553       
##          Detection Rate : 0.01459       
##    Detection Prevalence : 0.01542       
##       Balanced Accuracy : 0.78529       
##                                         
##        'Positive' Class : 0             
## 

A model with hardcoded hyperparameter values is not very bad. However, the sensitivity is very low in case of both - training and testing datasets and the balanced accuracy is lower than in case of decision trees. AUC of ~ 95% on training set and ~94% on testing indicates that there is no problem with overfitting here. Let’s see what the ROC curve will tell us.

## AUC for train = 0.9499104, Gini for train = 0.8998207
## AUC for test = 0.9390856, Gini for test = 0.8781711

Both models seem to be well fitted. Again, the ROC curve indicates that there is no overfitting problem in this case which is a succeess considering that this model is prone to this phenomena. However, this is more likely to happen when cross-validation technique is used for model training, which is why, special attention to this is needed when we apply it:

## Stochastic Gradient Boosting 
## 
## 30714 samples
##   100 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 20475, 20476, 20477 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  ROC        Sens       
##   0.01       1                  100             100      0.7659008  0.008075399
##   0.01       1                  100             500      0.8845632  0.310930086
##   0.01       1                  250             100      0.7714563  0.000000000
##   0.01       1                  250             500      0.8887391  0.289375735
##   0.01       1                  500             100      0.7909549  0.000000000
##   0.01       1                  500             500      0.8761452  0.277257194
##   0.01       2                  100             100      0.8430256  0.071236559
##   0.01       2                  100             500      0.9105436  0.355306691
##   0.01       2                  250             100      0.8787782  0.000000000
##   0.01       2                  250             500      0.9150162  0.302832920
##   0.01       2                  500             100      0.8749848  0.000000000
##   0.01       2                  500             500      0.9115755  0.278595838
##   0.01       4                  100             100      0.8990318  0.000000000
##   0.01       4                  100             500      0.9378185  0.426646641
##   0.01       4                  250             100      0.9102775  0.000000000
##   0.01       4                  250             500      0.9376762  0.347231292
##   0.01       4                  500             100      0.8974153  0.000000000
##   0.01       4                  500             500      0.9373684  0.316290105
##   0.10       1                  100             100      0.8997789  0.368785643
##   0.10       1                  100             500      0.9295846  0.481824910
##   0.10       1                  250             100      0.9031057  0.305515650
##   0.10       1                  250             500      0.9340272  0.437404771
##   0.10       1                  500             100      0.8981709  0.288031649
##   0.10       1                  500             500      0.9350312  0.425297114
##   0.10       2                  100             100      0.9289951  0.429340255
##   0.10       2                  100             500      0.9514340  0.542379522
##   0.10       2                  250             100      0.9281799  0.367425232
##   0.10       2                  250             500      0.9520153  0.499319795
##   0.10       2                  500             100      0.9277171  0.329752732
##   0.10       2                  500             500      0.9524186  0.484513082
##   0.10       4                  100             100      0.9470772  0.481808585
##   0.10       4                  100             500      0.9632337  0.554481738
##   0.10       4                  250             100      0.9487769  0.423974794
##   0.10       4                  250             500      0.9635171  0.534309564
##   0.10       4                  500             100      0.9512696  0.403791737
##   0.10       4                  500             500      0.9636632  0.520841496
##   Spec     
##   0.9991325
##   0.9980315
##   1.0000000
##   0.9994995
##   1.0000000
##   0.9999333
##   0.9999333
##   0.9985319
##   1.0000000
##   0.9989990
##   1.0000000
##   0.9999666
##   1.0000000
##   0.9986988
##   1.0000000
##   0.9996664
##   1.0000000
##   1.0000000
##   0.9981649
##   0.9970972
##   0.9985986
##   0.9975643
##   0.9992326
##   0.9982984
##   0.9975643
##   0.9969304
##   0.9982650
##   0.9978646
##   0.9997331
##   0.9979647
##   0.9975977
##   0.9975977
##   0.9990991
##   0.9978312
##   0.9994662
##   0.9979314
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  4, shrinkage = 0.1 and n.minobsinnode = 500.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   504    12
##          1   239 29959
##                                           
##                Accuracy : 0.9918          
##                  95% CI : (0.9908, 0.9928)
##     No Information Rate : 0.9758          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7966          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.67833         
##             Specificity : 0.99960         
##          Pos Pred Value : 0.97674         
##          Neg Pred Value : 0.99209         
##              Prevalence : 0.02419         
##          Detection Rate : 0.01641         
##    Detection Prevalence : 0.01680         
##       Balanced Accuracy : 0.83897         
##                                           
##        'Positive' Class : 0               
## 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   199    17
##          1   137 12809
##                                           
##                Accuracy : 0.9883          
##                  95% CI : (0.9863, 0.9901)
##     No Information Rate : 0.9745          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7153          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.59226         
##             Specificity : 0.99867         
##          Pos Pred Value : 0.92130         
##          Neg Pred Value : 0.98942         
##              Prevalence : 0.02553         
##          Detection Rate : 0.01512         
##    Detection Prevalence : 0.01641         
##       Balanced Accuracy : 0.79547         
##                                           
##        'Positive' Class : 0               
## 

As we can see, optimal values used for the model generated with use of cross-validation were following number of trees = 500, interaction.depth 4 (It establishes how many interactions between various characteristics are taken into account in each decision tree split - in our case, it’s 4), shrinkage 0.1 and n.minobsinnode (minimum number of observations required in each terminal (leaf) node of the decision tree) 500. Almost 4% difference in balanced accuracy is worrying here as it may indicate that the model is slightly overfitted. Either way, Its predictions are worse than in case of decision trees.

## AUC for train = 0.8389654, Gini for train = 0.6779307
## AUC for test = 0.7954682, Gini for test = 0.5909365

The plot confirms this assumption. ROC curve is quite close to the diagonal and the gap between train and test sets is not small. It is reasonable to move to the most advanced technique now and see if it can top decision trees - CNN.

Convolutional Neural Network

The ability of Convolutional Neural Networks (CNNs) to understand the underlying patterns and characteristics that are indicative of dangerous activity has led to promising outcomes in the detection of malware. The requirement for a substantial amount of training data is one of the main obstacles to employing CNNs for malware detection. This is due to the fact that there are several varieties of malware, each with specific traits and behaviors, and it might be challenging to include all of these variants in a single dataset. However, our dataset is complex enough to build a model on the top of it. This part will be run in Python with use of (among others) Tensorflow and Keras libraries:

#import numpy as np # linear algebra
#import pandas as pd 
#import seaborn as sns
#import scikitplot as skplt
#import matplotlib.pyplot as plt
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.optimizers import Adam
#from tensorflow.keras.layers import Embedding, #LSTM, GlobalAveragePooling2D, TimeDistributed
#from tensorflow.keras.layers import Dense, #Dropout, Conv1D, MaxPool1D, BatchNormalization
#from sklearn.model_selection import #train_test_split
#from sklearn.metrics import #accuracy_score,classification_report,confusion_matrix

A similar operation of splitting the initial set was run in Python:

#X_train, X_test, y_train, y_test = train_test_split(used_data, data['malware'], test_size=0.4, 
#                                                    shuffle=True, random_state=42,stratify=data['malware'])

Afterwards, a basic model with one Convolutional layer was developed as a base and included following parameters:

#model = Sequential(name="Cnn-Lstm_model")
#model.add(Embedding(input_dim=unique_api_calls, output_dim=15,
#                    input_length=X_train.shape[1], name='layer_embedding'))
#model.add(BatchNormalization())
#model.add(Conv1D(filters = 20, kernel_size = 5, padding = 'same', activation = #'relu'))
#model.add(MaxPool1D(pool_size = 5))
#model.add(LSTM(units=512, return_sequences=False, dropout=0.2))
#model.add(Dense(units=1, activation='sigmoid'))
#optimizer = Adam(lr=.0001)
#model.compile(loss='binary_crossentropy', optimizer=optimizer, #metrics=['accuracy'])
#history = model.fit(X_train, y_train, validation_split=0.2, epochs=40, #batch_size=512)

As we can see, first, an embedding layer that links categorical inputs to continuous vector representations that can be learned during training was added to represent the API calls in a numerical way. Afterwards, the data was normalized and the convolutional layer was added. It processes 1-dimensional convolution on underlying data and can be utilized in case of sequential data, which is definitely the case here. With each filter covering a certain length of the input sequence, the Conv1D layer applies a collection of filters to the sequence. A collection of feature maps that identify local patterns in the input sequence make up the convolution operation’s output. It is followed by a MaxPool1D technique that by simply keeping the most crucial features in each window, reduces the size of the feature maps. Since the data is sequential and the consequitive API calls can be related to each other, LSTM layer is added followed by a Dense function with parameters 1 and ‘sigmoid’ which indicates that we want the output to be a probability.

The training process was shown on a below plot so as it can be seen how consequitive epochs contributed to training the model:

#fig, ax = plt.subplots(1,2, figsize=[12,6])
#ax[0].plot(history.history["loss"])
#ax[0].plot(history.history["val_loss"])
#ax[0].set_title(" Loss")
#ax[0].legend(("Training", "validation"), loc="upper right")
#ax[0].set_xlabel("Epochs")
#ax[1].plot(history.history["accuracy"])
#ax[1].plot(history.history["val_accuracy"])
#ax[1].legend(("Training", "validation"), loc="lower right")
#ax[1].set_title("Accuracy")
#ax[1].set_xlabel("Epochs")

It can be noticed that from 15 epochs onward, the training process starts giving consistent results. There should not be a problem with overfitting since the validation and training scores are quite close to each other. Overall, it looks quite promising since the loss function on validation set is <0.05 and accuracy > 98. Let’s now see the confusion matrix to assess the predictions.

#y_pred = np.where(y_pred>0.5, 1, 0)
#print("CNN_LSTM model classification report: \n\n #{}".format(classification_report(np.array(y_test), y_pred.flatten())))
#ax=skplt.metrics.plot_confusion_matrix(y_test, y_pred, figsize=(8, 7))
#tickx=ax.set_xticklabels(['Benign', 'Malware'])
#ticky=ax.set_yticklabels(['Benign', 'Malware'])

The matrix looks promising. With 99% Specificity, 73% Sensitivity and balanced accuracy of 86%, it is the best model so far. Let’s now take a look at ROC curve.

#from sklearn.metrics import roc_curve
#import matplotlib.pyplot as plt
#
## y_true: true labels of the data
## y_score: predicted probabilities of the positive class
#fpr, tpr, thresholds = roc_curve(y_test, pred)
#
## plot the ROC curve
#plt.plot(fpr, tpr)
#plt.plot([0, 1], [0, 1], linestyle='--')
#plt.xlabel('False Positive Rate')
#plt.ylabel('True Positive Rate')
#plt.title('Receiver Operating Characteristic (ROC) Curve')
#plt.show()
#from sklearn.metrics import roc_auc_score
#
## calculate the AUC score
#auc = roc_auc_score(y_test, pred)
#print('AUC: %.3f' % auc)

The AUC: 0.983 and the shape of the Curve look very well. The ROC curve is getting close to the perfect shape.

Now, a two layer model will be train in order to see if it can further improve the results. It will be very similar to the previous one. However, it will include one additional Convolutional layer.

#model = Sequential(name="Cnn-Lstm_model")
#model.add(Embedding(input_dim=unique_api_calls, output_dim=8,
#                    input_length=X_train.shape[1], name='layer_embedding'))
#model.add(BatchNormalization())
#model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'valid', activation = #'relu'))
#model.add(Dropout(0.2))
#model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'valid', activation = #'relu'))
#model.add(MaxPool1D(pool_size = 2))
#model.add(LSTM(units=512, return_sequences=False, dropout=0.2))
#model.add(Dense(units=1, activation='sigmoid'))
#optimizer = Adam(lr=.0001)
#model.compile(loss='binary_crossentropy', optimizer=optimizer, #metrics=['accuracy'])
#history = model.fit(X_train, y_train, validation_split=0.2, epochs=30, #batch_size=512)
#fig, ax = plt.subplots(1,2, figsize=[12,6])
#ax[0].plot(history.history["loss"])
#ax[0].plot(history.history["val_loss"])
#ax[0].set_title(" Loss")
#ax[0].legend(("Training", "validation"), loc="upper right")
#ax[0].set_xlabel("Epochs")
#ax[1].plot(history.history["accuracy"])
#ax[1].plot(history.history["val_accuracy"])
#ax[1].legend(("Training", "validation"), loc="lower right")
#ax[1].set_title("Accuracy")
#ax[1].set_xlabel("Epochs")

From 15 epochs onward, the training process starts giving consistent results. Again, there is no problem with overfitting here. Both, loss and accuracy plots look very well and promising. Let’s now see how confusion matrix and ROC curves look like.

#y_pred = model.predict(X_test)
#pred = np.where(y_pred>0.5, 1, 0)
#ax=skplt.metrics.plot_confusion_matrix(y_test, pred, figsize=(8, 7))
#tickx=ax.set_xticklabels(['Benign', 'Malware'])
#ticky=ax.set_yticklabels(['Benign', 'Malware'])

The confusion matrix looks very good. With balanced accuracy of approx. 88%, 75% Sensitivity and 99% Specificity, this is the best model that was developed so far. Also, the ROC curve is of an almost perfect form:

Evaluation


Over the course of analysis, three different algorithms were applied in order to enable prediction of malware. Out all of them, it turned out that CNN with LSTM layer that takes into advantage the relationship between consecuitive data was able to provide the best fit to the underlying data. The overall accuracy is satisfying which means that the model is able to automatically detect malware activity. The best model is able to classify 99% of the observations to corresponding group, which is a great score. Even when Sensitivity and Specificity are considered, the model’s performance is still impressing.


Conclusions

Basing on the evidence presented above, the analysis can be concluded by following findings:

  1. There may exist some crucial parts of running softwares which suggests if the software is malicious or not.
  2. CNN with LSTM layer turned out to be the best method for malware detection.
  3. It is important here to keep in mind what the nature of this dataset is and try to maximize balanced accuracy and not the accuracy itself.

Possible improvements.

  1. Different approach to sampling data so that it’s a better representation of real malware to goodware proportion.
  2. Further hyperparameters tuning.
  3. Using more advanced techniques.
  4. Adding more features to the dataset.
  5. Using cost matrix to optimize cut-off treshold when classifying to 0-1 in a way that maximizes balanced accuracy.