Malware detection using machine learning

Introduction

In the current digital era, malware detection is a crucial component of computer security. Malware threats are becoming more numerous and complicated as technology develops. Any program or piece of code that is intended to harm or exploit computer systems or their users is known as malware, which is short for “malicious software.” Malware can appear in a variety of ways, such as viruses, worms, Trojan horses, spyware, adware, and ransomware, to name a few.

Malware assaults may have a significant negative effect, leading to financial losses, data breaches, system failures, and other types of disruption. Thus, having efficient techniques for identifying and preventing malware infestations is crucial. The process of locating and evaluating possible threats to ascertain whether they are harmful or not is known as malware detection.

A method that is becoming more and more common for detecting malware is machine learning as it has an ability to categorize harmful software. The necessity for high-quality data is one of the key obstacles in adopting machine learning for virus detection. The caliber and variety of the training dataset determine how well the model performs. Also, to avoid being discovered by machine learning models, malware creators might employ strategies like obfuscation and polymorphism. Machine learning can offer more precise and efficient identification and categorization of dangerous software in light of the evolving sophistication of malware threats. To solve the issues with data quality and resilience against evasion strategies, however, more study is required.

Data

The dataset on which the model was built was made available on the Kaggle portal and can be accessed at the web link. There are 1,079 goodware API call sequences and 42,797 malware API call sequences in it. The first 100 non-repeated sequential API calls connected to the parent process are collected for each software using Cuckoo Sandbox reports. The dataset eventually features the following columns:

Hash: MD5 hash of the example
t_0 … t_99: API call
Malware: 0 (Goodware) or 1 (Malware)

Every row includes a sequence of 100 API calls that describe the system operations that were performed one by one. For instance, software activity could look as follows: create new folder, create new file, edit file, save file, etc. which is reflected by consequitive calls here and classified as malware or goodware in the last column.

hash	t_0	t_1	t_2	t_3	t_4	t_5	t_6	t_7	t_8	t_9	t_10	t_11	t_12	t_13	t_14	t_15	t_16	t_17	t_18	t_19	t_20	t_21	t_22	t_23	t_24	t_25	t_26	t_27	t_28	t_29	t_30	t_31	t_32	t_33	t_34	t_35	t_36	t_37	t_38	t_39	t_40	t_41	t_42	t_43	t_44	t_45	t_46	t_47	t_48	t_49	t_50	t_51	t_52	t_53	t_54	t_55	t_56	t_57	t_58	t_59	t_60	t_61	t_62	t_63	t_64	t_65	t_66	t_67	t_68	t_69	t_70	t_71	t_72	t_73	t_74	t_75	t_76	t_77	t_78	t_79	t_80	t_81	t_82	t_83	t_84	t_85	t_86	t_87	t_88	t_89	t_90	t_91	t_92	t_93	t_94	t_95	t_96	t_97	t_98	t_99	malware
071e8c3f8922e186e57548cd4c703a5d	112	274	158	215	274	158	215	298	76	208	76	172	117	172	117	172	76	117	35	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	117	60	81	60	81	208	35	215	35	208	240	117	172	60	81	60	81	225	35	60	81	35	225	172	60	81	60	81	60	81	172	117	76	172	117	172	117	35	111	81	140	208	240	117	71	297	135	171	215	35	208	56	71	1
33f8e6d08a6aae939f25a8e0d63dd523	82	208	187	208	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	208	172	117	16	208	297	8	199	264	274	158	215	274	158	215	208	260	87	260	65	14	65	240	117	208	187	208	240	117	39	35	171	172	117	208	35	215	35	208	240	117	35	60	81	60	81	172	35	60	81	60	81	31	60	81	208	187	208	274	158	266	208	60	81	60	81	240	117	71	297	135	171	215	35	1
b68abd064e975e1c6d5f25e748663076	16	110	240	117	240	117	240	117	240	117	240	117	240	117	172	117	99	260	141	65	240	117	240	117	240	117	263	215	263	215	263	215	263	215	263	215	263	215	263	215	263	215	274	158	215	274	158	215	240	117	71	297	135	171	215	48	208	112	113	112	123	65	112	123	65	112	123	65	113	112	123	65	112	123	65	112	123	65	113	112	123	65	112	123	65	112	123	65	113	112	123	65	112	123	65	112	123	65	113	112	1
72049be7bd30ea61297ea624ae198067	82	208	187	208	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	208	16	240	117	228	208	187	228	240	117	82	198	86	82	274	37	240	117	240	117	262	228	275	172	240	275	172	274	158	215	172	117	215	37	158	215	240	117	82	240	117	240	117	240	117	240	117	228	187	215	274	158	215	274	158	215	240	117	71	297	135	171	215	228	215	208	302	208	302	187	208	302	228	302	1
c9b3700a77facf29172f32df6bc77f48	82	240	117	240	117	240	117	240	117	172	117	172	117	16	240	117	11	274	158	215	274	158	215	117	270	117	301	117	297	8	199	264	215	260	141	65	31	260	141	202	260	141	65	202	65	260	141	65	80	287	87	14	65	260	141	240	141	65	82	260	141	65	260	141	65	31	159	224	82	261	172	117	260	208	260	2	140	81	208	159	224	82	159	224	82	261	208	240	117	260	40	209	260	40	209	260	141	260	141	260	1
cc6217be863e606e49da90fee2252f52	117	208	117	208	117	240	117	240	117	208	228	215	274	158	215	274	158	215	240	117	71	297	135	171	215	208	187	20	34	215	208	187	86	215	240	91	89	192	89	133	89	248	297	135	171	8	178	215	18	194	240	117	215	117	31	56	71	56	172	117	240	117	260	141	65	194	117	240	117	260	141	65	260	141	65	260	141	65	260	141	65	260	141	65	260	112	117	260	141	65	260	141	65	260	141	65	9	117	260	65	1
f7a1a3c38809d807b3f5f4cc00b1e9b7	215	274	158	215	274	158	215	172	117	172	117	172	117	198	208	260	257	25	240	117	99	25	172	117	260	274	158	172	117	172	117	260	141	65	172	117	240	117	286	240	286	35	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	60	81	110	60	81	60	81	215	208	35	208	240	117	240	117	240	117	240	117	240	117	15	117	15	240	117	240	117	240	117	172	60	81	60	81	225	1
164b56522eb24164184460f8523ed7e2	82	240	117	240	117	240	117	240	117	240	117	172	117	172	117	16	31	86	112	271	111	81	140	286	194	286	297	252	215	117	297	93	264	215	271	111	81	140	286	194	286	297	252	215	297	93	264	215	274	158	215	274	158	215	240	117	71	297	135	171	215	212	253	79	215	245	210	65	80	2	81	140	108	80	159	240	159	232	50	215	261	82	240	208	187	208	198	172	117	172	117	35	172	117	275	240	80	60	215	35	1
56ae1459ba61a14eb119982d6ec793d7	82	240	117	240	117	240	117	240	117	240	117	16	208	187	208	240	117	39	35	171	172	117	208	35	215	274	158	215	274	158	215	35	208	240	117	228	208	240	117	240	117	240	117	71	56	172	117	240	117	260	141	65	260	141	65	198	172	117	260	294	240	117	198	208	187	208	240	117	240	117	240	117	82	112	123	65	240	117	275	112	240	117	240	117	198	240	117	240	117	82	172	117	16	31	215	108	208	80	240	117	1
c4148ca91c5246a8707a1ac1fd1e2e36	82	208	187	208	172	117	172	208	16	208	240	117	240	117	82	112	123	65	112	123	65	260	141	65	215	240	117	240	117	240	117	240	117	240	117	240	117	208	187	208	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	172	117	208	172	117	100	215	35	1

As we can see below, the dataset is not balanced with only 1,079 goodware API call sequences and 42,797 malware API call sequences in it. It may be a good idea to use sampling method to improve the model’s performance.

The correlation plot shows that there are some correlated variables mainly focused betwen 20th and 45th API calls. It may mean that there are some typical patterns for a software to behave in a similar, sequential way and perform processes similar processes one by one, e.g. edit file, save file. However, most of the chart is gray which indicates no correlation. On the top of it, the row with malware variable is all grey which indicates no correlation between labels and explanatory variables.

Since there are no empty values, no prior data transformation is needed and the modelling part can be started. All variables will be taken into consideration when modelling since the data is sequential.

Libraries used

Following libraries were used in order to enable malware detection:

library(readr)
library(ggplot2)
library(reshape2)
library(tidyverse)
library(MASS)
library(tree)
library(caret)
library(rpart)
library(ROCR)
library(rpart.plot)
library(rattle)
library(pROC)
library(here)
library(e1071)
library(gbm)
library(dplyr)
library(tibble)

Modeling

Classical approaches such as logistic regression will probably perform poorly or, in worst case generate random predictions from this dataset. It is therefore better to use methods such as Neural Networks and Decision Trees in order to be able to detect malware.

Prior to model development, data split is created. As a results, training and testing datasets are created. Both of them include a very similar proportion (2.5% and 97.5%) of malware and goodware observations (training: 743 goodwares 29971 malwares while testing: 336 and 12826).

Decision Trees

First algorithm that is going to be used in this study is Decision Tree. As it is a non parametric method that gauges the homogeneity of the target variable within each subgroup by finding the optimal split, it is a perfect algorithm considering the nature of this dataset. Let’s see how well it can perform in this case. First, a tree with default values will be generated.

set.seed(123456789)
training_obs <- createDataPartition(data$malware, 
                                    p = 0.7, 
                                    list = FALSE) 
data.train <- data[training_obs,]
data.test  <- data[-training_obs,]

model1.formula <- malware ~.

data.tree1 <- 
  rpart(model1.formula, # model formula
        data = data.train, # data
        method = "class")
        
fancyRpartPlot(data.tree1)

Let’s now see how well it detects malicious software.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   204    29
##          1   132 12797
##                                           
##                Accuracy : 0.9878          
##                  95% CI : (0.9857, 0.9896)
##     No Information Rate : 0.9745          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.711           
##                                           
##  Mcnemar's Test P-Value : 9.078e-16       
##                                           
##             Sensitivity : 0.60714         
##             Specificity : 0.99774         
##          Pos Pred Value : 0.87554         
##          Neg Pred Value : 0.98979         
##              Prevalence : 0.02553         
##          Detection Rate : 0.01550         
##    Detection Prevalence : 0.01770         
##       Balanced Accuracy : 0.80244         
##                                           
##        'Positive' Class : 0               
##

With Balanced Accuracy of 80%, Sensitivity of 61%, Specificity of > 99% and overall Accuracy > 98%, the default model is not disappointing considering that it was not tuned in any special way. However, the predictions can definitely be better, especially when it comes down to detecting goodware. We can draw an interesting, inital conclusion from this visualization. As we can see, out of 100 explanatory variables, only a few of them are used to generate predictions. It indicates that there may be some crucial parts of software operations which make it malicious or not, which is proved by the variable importance plot below. As there are only few variables with significant importance and the rest of them seem to be much less influential, there is a reason to believe that some parts of running the software can point to it being dangerous for user or not.

Now, let’s try improving model’s performance. A more complex model will be estimated and pruned so that there are only relevant splits left.

data.tree4 <- 
  rpart(model1.formula,
        data = data.train,
        method = "class",
        minsplit = 600, 
        minbucket = 300, 
        maxdepth = 30, 
        cp = -1)
fancyRpartPlot(data.tree4)

The tree created is very big, complex and difficult to understand, let’s now prune it and see how the predictions respond to that.

This tree is in its most basic form which means one split only. It is very unlikely that this will generate well fitted predictions. We can assess that by looking at the confusion matrix and ROC curve.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0    97    58
##          1   239 12768
##                                           
##                Accuracy : 0.9774          
##                  95% CI : (0.9748, 0.9799)
##     No Information Rate : 0.9745          
##     P-Value [Acc > NIR] : 0.01537         
##                                           
##                   Kappa : 0.3852          
##                                           
##  Mcnemar's Test P-Value : < 2e-16         
##                                           
##             Sensitivity : 0.28869         
##             Specificity : 0.99548         
##          Pos Pred Value : 0.62581         
##          Neg Pred Value : 0.98163         
##              Prevalence : 0.02553         
##          Detection Rate : 0.00737         
##    Detection Prevalence : 0.01178         
##       Balanced Accuracy : 0.64208         
##                                           
##        'Positive' Class : 0               
##

As it was expected, the predictions on the data set are really bad when it comes down to model’s its sensitivity - only 28.9%, which has its reflection in Balanced accuracy - 64%.

The poor ability to predict is also visible when we take a look at the ROC curve. We can see than it’s below diagonal line which indicates that it is worse than a random predictor. This model definitely shouldn’t be used for predicting malware. Now, cross-validation technique will be used in order to find the values of parameters that will allow us to generate the most accurate predictions.

## CART 
## 
## 30714 samples
##   100 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 27642, 27642, 27642, 27643, 27643, 27643, ... 
## Resampling results across tuning parameters:
## 
##   cp     ROC        Sens       Spec     
##   0.000  0.8981512  0.5772613  0.9954955
##   0.001  0.8741244  0.5665225  0.9963964
##   0.002  0.8525947  0.5530270  0.9969637
##   0.003  0.8347632  0.5276036  0.9977311
##   0.004  0.8123084  0.5141081  0.9978979
##   0.005  0.8097878  0.5181622  0.9978312
##   0.006  0.8050902  0.5154414  0.9981315
##   0.007  0.8037077  0.5127387  0.9981315
##   0.008  0.8010868  0.5074054  0.9981649
##   0.009  0.7858587  0.4952432  0.9977979
##   0.010  0.7858366  0.5020000  0.9975643
##   0.011  0.7840173  0.4980000  0.9974309
##   0.012  0.7709174  0.4966847  0.9974642
##   0.013  0.7709001  0.5033514  0.9973975
##   0.014  0.7682948  0.4980180  0.9974309
##   0.015  0.7604426  0.4845045  0.9974976
##   0.016  0.7560582  0.4790991  0.9976311
##   0.017  0.7468934  0.4655856  0.9976311
##   0.018  0.7390559  0.4534775  0.9977645
##   0.019  0.7333360  0.4480721  0.9979313
##   0.020  0.7308440  0.4467207  0.9980314
##   0.021  0.7308440  0.4467207  0.9980314
##   0.022  0.7308440  0.4467207  0.9980314
##   0.023  0.7308440  0.4467207  0.9980314
##   0.024  0.7308440  0.4467207  0.9980314
##   0.025  0.7308440  0.4467207  0.9980314
##   0.026  0.7308313  0.4467207  0.9978980
##   0.027  0.7308128  0.4480721  0.9976644
##   0.028  0.7308128  0.4480721  0.9976644
##   0.029  0.7308029  0.4480721  0.9975310
##   0.030  0.7321227  0.4520901  0.9973641
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.

It appears that the optimal value of complexity parameter which is used to control the size of the decision tree and to select the optimal tree size is 0. That would imply that the parameter is not constrained. A extremely complicated and overfitted model may arise, for instance, if the maximum depth parameter is set to 0, which would allow the tree to grow as deep as feasible. It’s generally not a good idea to set complexity parameters to 0, since this might result in overfitting and poor generalization performance on fresh data.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    No   Yes
##        No    225    71
##        Yes   111 12755
##                                          
##                Accuracy : 0.9862         
##                  95% CI : (0.984, 0.9881)
##     No Information Rate : 0.9745         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.705          
##                                          
##  Mcnemar's Test P-Value : 0.003842       
##                                          
##             Sensitivity : 0.66964        
##             Specificity : 0.99446        
##          Pos Pred Value : 0.76014        
##          Neg Pred Value : 0.99137        
##              Prevalence : 0.02553        
##          Detection Rate : 0.01709        
##    Detection Prevalence : 0.02249        
##       Balanced Accuracy : 0.83205        
##                                          
##        'Positive' Class : No             
##

Nevertheless, the predictions are not too bad. Balanced Accuracy is 83% which makes it the best Decision Tree model so far. Both, Accuracy and Specifity are very high. It would be great, if there was a way to improve Sensitivity here, though.

The ROC curves indicate that the model is definitely better than the random classifier. However, it is not perfect either as it is far from reaching the square shape. There is no problem with overfitting here as the results from training and testing sets are closely aligned.

In general, the Decision Tree alorithm didn’t do well enough when it comes down to our dataset. On the other hand, it is not a complex algorithm. Let’s see what more sophisticated ones can do.

Boosting

Boosting is a popular machine learning technique that may increase accuracy of a variety of models. It turns several poor learners into a strong learner. The objective of boosting is to train a series of models, each of which is trained to fix the flaws of the one before it. In this specific case, GBM will be used to combine several trees in the final model to produce a forecast. GBM has the benefit of being able to handle several characteristics, which makes it ideal for complicated data sets which is exactly what we need for this dataset with 100 predictors. To prevent overfitting, which GBM is prone to, it is crucial to properly adjust the hyperparameters. Let’s now see how accurate this approach can be on our dataset.

data.train$malware <- as.numeric(data.train$malware)-1
data.test$malware <- as.numeric(data.test$malware)-1

set.seed(123456789)
data.gbm <- 
  gbm(model1.formula,
      data = data.train,
      distribution = "bernoulli",
      n.trees = 500,
      interaction.depth = 4,
      shrinkage = 0.01,
      verbose = FALSE)

datausa.pred.train.gbm <- predict(data.gbm,
                                  data.train, 
                                  # type = "response" gives in this case 
                                  # probability of success
                                  type = "response",
                                  # n.trees sets the number of trees
                                  # which are used to generate the prediction
                                  n.trees = 500)

datausa.pred.test.gbm <- predict(data.gbm,
                                 data.test, 
                                 type = "response",
                                 n.trees = 500)

confusionMatrix(data = as.factor(ifelse(datausa.pred.train.gbm>0.5,1,0)),
                reference = as.factor(data.train$malware))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   409    16
##          1   334 29955
##                                           
##                Accuracy : 0.9886          
##                  95% CI : (0.9874, 0.9898)
##     No Information Rate : 0.9758          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.695           
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.55047         
##             Specificity : 0.99947         
##          Pos Pred Value : 0.96235         
##          Neg Pred Value : 0.98897         
##              Prevalence : 0.02419         
##          Detection Rate : 0.01332         
##    Detection Prevalence : 0.01384         
##       Balanced Accuracy : 0.77497         
##                                           
##        'Positive' Class : 0               
##

confusionMatrix(data = as.factor(ifelse(datausa.pred.test.gbm>0.5,1,0)),
                reference = as.factor(data.test$malware))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   192    11
##          1   144 12815
##                                         
##                Accuracy : 0.9882        
##                  95% CI : (0.9862, 0.99)
##     No Information Rate : 0.9745        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.7068        
##                                         
##  Mcnemar's Test P-Value : < 2.2e-16     
##                                         
##             Sensitivity : 0.57143       
##             Specificity : 0.99914       
##          Pos Pred Value : 0.94581       
##          Neg Pred Value : 0.98889       
##              Prevalence : 0.02553       
##          Detection Rate : 0.01459       
##    Detection Prevalence : 0.01542       
##       Balanced Accuracy : 0.78529       
##                                         
##        'Positive' Class : 0             
##

A model with hardcoded hyperparameter values is not very bad. However, the sensitivity is very low in case of both - training and testing datasets and the balanced accuracy is lower than in case of decision trees. AUC of ~ 95% on training set and ~94% on testing indicates that there is no problem with overfitting here. Let’s see what the ROC curve will tell us.

## AUC for train = 0.9499104, Gini for train = 0.8998207

## AUC for test = 0.9390856, Gini for test = 0.8781711

Both models seem to be well fitted. Again, the ROC curve indicates that there is no overfitting problem in this case which is a succeess considering that this model is prone to this phenomena. However, this is more likely to happen when cross-validation technique is used for model training, which is why, special attention to this is needed when we apply it:

## Stochastic Gradient Boosting 
## 
## 30714 samples
##   100 predictor
##     2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 20475, 20476, 20477 
## Resampling results across tuning parameters:
## 
##   shrinkage  interaction.depth  n.minobsinnode  n.trees  ROC        Sens       
##   0.01       1                  100             100      0.7659008  0.008075399
##   0.01       1                  100             500      0.8845632  0.310930086
##   0.01       1                  250             100      0.7714563  0.000000000
##   0.01       1                  250             500      0.8887391  0.289375735
##   0.01       1                  500             100      0.7909549  0.000000000
##   0.01       1                  500             500      0.8761452  0.277257194
##   0.01       2                  100             100      0.8430256  0.071236559
##   0.01       2                  100             500      0.9105436  0.355306691
##   0.01       2                  250             100      0.8787782  0.000000000
##   0.01       2                  250             500      0.9150162  0.302832920
##   0.01       2                  500             100      0.8749848  0.000000000
##   0.01       2                  500             500      0.9115755  0.278595838
##   0.01       4                  100             100      0.8990318  0.000000000
##   0.01       4                  100             500      0.9378185  0.426646641
##   0.01       4                  250             100      0.9102775  0.000000000
##   0.01       4                  250             500      0.9376762  0.347231292
##   0.01       4                  500             100      0.8974153  0.000000000
##   0.01       4                  500             500      0.9373684  0.316290105
##   0.10       1                  100             100      0.8997789  0.368785643
##   0.10       1                  100             500      0.9295846  0.481824910
##   0.10       1                  250             100      0.9031057  0.305515650
##   0.10       1                  250             500      0.9340272  0.437404771
##   0.10       1                  500             100      0.8981709  0.288031649
##   0.10       1                  500             500      0.9350312  0.425297114
##   0.10       2                  100             100      0.9289951  0.429340255
##   0.10       2                  100             500      0.9514340  0.542379522
##   0.10       2                  250             100      0.9281799  0.367425232
##   0.10       2                  250             500      0.9520153  0.499319795
##   0.10       2                  500             100      0.9277171  0.329752732
##   0.10       2                  500             500      0.9524186  0.484513082
##   0.10       4                  100             100      0.9470772  0.481808585
##   0.10       4                  100             500      0.9632337  0.554481738
##   0.10       4                  250             100      0.9487769  0.423974794
##   0.10       4                  250             500      0.9635171  0.534309564
##   0.10       4                  500             100      0.9512696  0.403791737
##   0.10       4                  500             500      0.9636632  0.520841496
##   Spec     
##   0.9991325
##   0.9980315
##   1.0000000
##   0.9994995
##   1.0000000
##   0.9999333
##   0.9999333
##   0.9985319
##   1.0000000
##   0.9989990
##   1.0000000
##   0.9999666
##   1.0000000
##   0.9986988
##   1.0000000
##   0.9996664
##   1.0000000
##   1.0000000
##   0.9981649
##   0.9970972
##   0.9985986
##   0.9975643
##   0.9992326
##   0.9982984
##   0.9975643
##   0.9969304
##   0.9982650
##   0.9978646
##   0.9997331
##   0.9979647
##   0.9975977
##   0.9975977
##   0.9990991
##   0.9978312
##   0.9994662
##   0.9979314
## 
## ROC was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 500, interaction.depth =
##  4, shrinkage = 0.1 and n.minobsinnode = 500.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   504    12
##          1   239 29959
##                                           
##                Accuracy : 0.9918          
##                  95% CI : (0.9908, 0.9928)
##     No Information Rate : 0.9758          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7966          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.67833         
##             Specificity : 0.99960         
##          Pos Pred Value : 0.97674         
##          Neg Pred Value : 0.99209         
##              Prevalence : 0.02419         
##          Detection Rate : 0.01641         
##    Detection Prevalence : 0.01680         
##       Balanced Accuracy : 0.83897         
##                                           
##        'Positive' Class : 0               
##

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0   199    17
##          1   137 12809
##                                           
##                Accuracy : 0.9883          
##                  95% CI : (0.9863, 0.9901)
##     No Information Rate : 0.9745          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7153          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.59226         
##             Specificity : 0.99867         
##          Pos Pred Value : 0.92130         
##          Neg Pred Value : 0.98942         
##              Prevalence : 0.02553         
##          Detection Rate : 0.01512         
##    Detection Prevalence : 0.01641         
##       Balanced Accuracy : 0.79547         
##                                           
##        'Positive' Class : 0               
##

As we can see, optimal values used for the model generated with use of cross-validation were following number of trees = 500, interaction.depth 4 (It establishes how many interactions between various characteristics are taken into account in each decision tree split - in our case, it’s 4), shrinkage 0.1 and n.minobsinnode (minimum number of observations required in each terminal (leaf) node of the decision tree) 500. Almost 4% difference in balanced accuracy is worrying here as it may indicate that the model is slightly overfitted. Either way, Its predictions are worse than in case of decision trees.

## AUC for train = 0.8389654, Gini for train = 0.6779307

## AUC for test = 0.7954682, Gini for test = 0.5909365

The plot confirms this assumption. ROC curve is quite close to the diagonal and the gap between train and test sets is not small. It is reasonable to move to the most advanced technique now and see if it can top decision trees - CNN.

Convolutional Neural Network

The ability of Convolutional Neural Networks (CNNs) to understand the underlying patterns and characteristics that are indicative of dangerous activity has led to promising outcomes in the detection of malware. The requirement for a substantial amount of training data is one of the main obstacles to employing CNNs for malware detection. This is due to the fact that there are several varieties of malware, each with specific traits and behaviors, and it might be challenging to include all of these variants in a single dataset. However, our dataset is complex enough to build a model on the top of it. This part will be run in Python with use of (among others) Tensorflow and Keras libraries:

#import numpy as np # linear algebra
#import pandas as pd 
#import seaborn as sns
#import scikitplot as skplt
#import matplotlib.pyplot as plt
#from tensorflow.keras.models import Sequential
#from tensorflow.keras.optimizers import Adam
#from tensorflow.keras.layers import Embedding, #LSTM, GlobalAveragePooling2D, TimeDistributed
#from tensorflow.keras.layers import Dense, #Dropout, Conv1D, MaxPool1D, BatchNormalization
#from sklearn.model_selection import #train_test_split
#from sklearn.metrics import #accuracy_score,classification_report,confusion_matrix

A similar operation of splitting the initial set was run in Python:

#X_train, X_test, y_train, y_test = train_test_split(used_data, data['malware'], test_size=0.4, 
#                                                    shuffle=True, random_state=42,stratify=data['malware'])

Afterwards, a basic model with one Convolutional layer was developed as a base and included following parameters:

#model = Sequential(name="Cnn-Lstm_model")
#model.add(Embedding(input_dim=unique_api_calls, output_dim=15,
#                    input_length=X_train.shape[1], name='layer_embedding'))
#model.add(BatchNormalization())
#model.add(Conv1D(filters = 20, kernel_size = 5, padding = 'same', activation = #'relu'))
#model.add(MaxPool1D(pool_size = 5))
#model.add(LSTM(units=512, return_sequences=False, dropout=0.2))
#model.add(Dense(units=1, activation='sigmoid'))
#optimizer = Adam(lr=.0001)
#model.compile(loss='binary_crossentropy', optimizer=optimizer, #metrics=['accuracy'])
#history = model.fit(X_train, y_train, validation_split=0.2, epochs=40, #batch_size=512)

As we can see, first, an embedding layer that links categorical inputs to continuous vector representations that can be learned during training was added to represent the API calls in a numerical way. Afterwards, the data was normalized and the convolutional layer was added. It processes 1-dimensional convolution on underlying data and can be utilized in case of sequential data, which is definitely the case here. With each filter covering a certain length of the input sequence, the Conv1D layer applies a collection of filters to the sequence. A collection of feature maps that identify local patterns in the input sequence make up the convolution operation’s output. It is followed by a MaxPool1D technique that by simply keeping the most crucial features in each window, reduces the size of the feature maps. Since the data is sequential and the consequitive API calls can be related to each other, LSTM layer is added followed by a Dense function with parameters 1 and ‘sigmoid’ which indicates that we want the output to be a probability.

The training process was shown on a below plot so as it can be seen how consequitive epochs contributed to training the model:

#fig, ax = plt.subplots(1,2, figsize=[12,6])
#ax[0].plot(history.history["loss"])
#ax[0].plot(history.history["val_loss"])
#ax[0].set_title(" Loss")
#ax[0].legend(("Training", "validation"), loc="upper right")
#ax[0].set_xlabel("Epochs")
#ax[1].plot(history.history["accuracy"])
#ax[1].plot(history.history["val_accuracy"])
#ax[1].legend(("Training", "validation"), loc="lower right")
#ax[1].set_title("Accuracy")
#ax[1].set_xlabel("Epochs")

It can be noticed that from 15 epochs onward, the training process starts giving consistent results. There should not be a problem with overfitting since the validation and training scores are quite close to each other. Overall, it looks quite promising since the loss function on validation set is <0.05 and accuracy > 98. Let’s now see the confusion matrix to assess the predictions.

#y_pred = np.where(y_pred>0.5, 1, 0)
#print("CNN_LSTM model classification report: \n\n #{}".format(classification_report(np.array(y_test), y_pred.flatten())))
#ax=skplt.metrics.plot_confusion_matrix(y_test, y_pred, figsize=(8, 7))
#tickx=ax.set_xticklabels(['Benign', 'Malware'])
#ticky=ax.set_yticklabels(['Benign', 'Malware'])

The matrix looks promising. With 99% Specificity, 73% Sensitivity and balanced accuracy of 86%, it is the best model so far. Let’s now take a look at ROC curve.

#from sklearn.metrics import roc_curve
#import matplotlib.pyplot as plt
#
## y_true: true labels of the data
## y_score: predicted probabilities of the positive class
#fpr, tpr, thresholds = roc_curve(y_test, pred)
#
## plot the ROC curve
#plt.plot(fpr, tpr)
#plt.plot([0, 1], [0, 1], linestyle='--')
#plt.xlabel('False Positive Rate')
#plt.ylabel('True Positive Rate')
#plt.title('Receiver Operating Characteristic (ROC) Curve')
#plt.show()
#from sklearn.metrics import roc_auc_score
#
## calculate the AUC score
#auc = roc_auc_score(y_test, pred)
#print('AUC: %.3f' % auc)

The AUC: 0.983 and the shape of the Curve look very well. The ROC curve is getting close to the perfect shape.

Now, a two layer model will be train in order to see if it can further improve the results. It will be very similar to the previous one. However, it will include one additional Convolutional layer.

#model = Sequential(name="Cnn-Lstm_model")
#model.add(Embedding(input_dim=unique_api_calls, output_dim=8,
#                    input_length=X_train.shape[1], name='layer_embedding'))
#model.add(BatchNormalization())
#model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'valid', activation = #'relu'))
#model.add(Dropout(0.2))
#model.add(Conv1D(filters = 32, kernel_size = 3, padding = 'valid', activation = #'relu'))
#model.add(MaxPool1D(pool_size = 2))
#model.add(LSTM(units=512, return_sequences=False, dropout=0.2))
#model.add(Dense(units=1, activation='sigmoid'))
#optimizer = Adam(lr=.0001)
#model.compile(loss='binary_crossentropy', optimizer=optimizer, #metrics=['accuracy'])
#history = model.fit(X_train, y_train, validation_split=0.2, epochs=30, #batch_size=512)

#fig, ax = plt.subplots(1,2, figsize=[12,6])
#ax[0].plot(history.history["loss"])
#ax[0].plot(history.history["val_loss"])
#ax[0].set_title(" Loss")
#ax[0].legend(("Training", "validation"), loc="upper right")
#ax[0].set_xlabel("Epochs")
#ax[1].plot(history.history["accuracy"])
#ax[1].plot(history.history["val_accuracy"])
#ax[1].legend(("Training", "validation"), loc="lower right")
#ax[1].set_title("Accuracy")
#ax[1].set_xlabel("Epochs")

From 15 epochs onward, the training process starts giving consistent results. Again, there is no problem with overfitting here. Both, loss and accuracy plots look very well and promising. Let’s now see how confusion matrix and ROC curves look like.

#y_pred = model.predict(X_test)
#pred = np.where(y_pred>0.5, 1, 0)
#ax=skplt.metrics.plot_confusion_matrix(y_test, pred, figsize=(8, 7))
#tickx=ax.set_xticklabels(['Benign', 'Malware'])
#ticky=ax.set_yticklabels(['Benign', 'Malware'])

The confusion matrix looks very good. With balanced accuracy of approx. 88%, 75% Sensitivity and 99% Specificity, this is the best model that was developed so far. Also, the ROC curve is of an almost perfect form:

Evaluation

Over the course of analysis, three different algorithms were applied in order to enable prediction of malware. Out all of them, it turned out that CNN with LSTM layer that takes into advantage the relationship between consecuitive data was able to provide the best fit to the underlying data. The overall accuracy is satisfying which means that the model is able to automatically detect malware activity. The best model is able to classify 99% of the observations to corresponding group, which is a great score. Even when Sensitivity and Specificity are considered, the model’s performance is still impressing.

Conclusions

Basing on the evidence presented above, the analysis can be concluded by following findings:

There may exist some crucial parts of running softwares which suggests if the software is malicious or not.
CNN with LSTM layer turned out to be the best method for malware detection.
It is important here to keep in mind what the nature of this dataset is and try to maximize balanced accuracy and not the accuracy itself.

Possible improvements.

Different approach to sampling data so that it’s a better representation of real malware to goodware proportion.
Further hyperparameters tuning.
Using more advanced techniques.
Adding more features to the dataset.
Using cost matrix to optimize cut-off treshold when classifying to 0-1 in a way that maximizes balanced accuracy.