Goal

Reduce the number of False Positive generated by the NN classifier by using a client profile.

Approach for training the classifier

Features

The cliente profile was generated using the following features

CURRENT TIME WINDOW = 1H

  1. Total number of Requests

  2. Ratio of NX requests (answer=NX)

  3. ratio of MX request (query_type=“MX”)

  4. Ratio of Reverse requests (answer=in.arpa)

  5. Ratio of Fail Request (answer=SERVFAIL)

  6. Ratio of the total amount of domains that answered to more than one IP (given a query, count the number of distinct IP addresses, using the answer_ip field and sum them)

  7. Ratio of the Average amount of domains that answered to more than one (calculate the average over all the queries that have distinct IP addresses. In other words instead of sum them, just calcuate the average)

  8. Ratio of the total amount of IP that correspond to more than one Domain name (Given an answer_ip field, count the number distinct queries for this answer_ip and sum them)

  9. Ratio of the Average amount of IP addresses that correspond to more than one Domain name. (Calculate the average over all the answer_ip that have distinct queries. In other word, instead of sum them, just calculate the average)

Note: in all cases ratio is calculated over the Total Number of requests (Feature 1)

Training approach

The training dataset contains only those client profiles labeled as Normal with at least one request detected by the NN. That is, those profiles that contains False Positives. THe resulting training dataset contains 225 client profiles.

The table below shows the GroundTruthLabel (Label) and the (NN) Label (i.e. those profile with at least 1 NN detection are labeled as DGA). For simplitiy a third label called (profile) is added for aggregating all those GroundTruhLabels not Normal.

Separating in training and testing Sets

We used a 2x5 Cross Validation approach. The ROC metric was used for finding the best model.

ctrl_fast <- trainControl(method="cv", 
                     repeats=2,
                     number=5, 
                     summaryFunction=twoClassSummary,
                     verboseIter=T,
                     classProbs=TRUE,
                     allowParallel = TRUE)  

70% of the dataset was used for training and a 30% for testing.

Below we show the distribution of the profile labels in the resulting 2 datasets (training,testing)

data_train %>% group_by(profile) %>% summarise(total=n())
data_test %>% group_by(profile) %>% summarise(total=n())

A randomForest classifier is training using ROC metric for finding the best Model. Prior training the training dataset was Randomly Upsampled

ctrl_fast$sampling<-"up"
rfFit<- train(train_formula,
               data = data_train,
#               method = "svmRadialWeights",   # Radial kernel
               method = "rf",
               tuneLength = 5,
               #tuneGrid = svmGrid,
               #preProcess=c("scale","center"),
               metric="ROC",
               #weights = model_weights,
               trControl = ctrl_fast)

rfFit

The Confusion matrix for resulting model on the Training set is shown below.

rfFit$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 2

        OOB estimate of  error rate: 4.04%
Confusion matrix:
          Malicious Normal class.error
Malicious        94      5  0.05050505
Normal            3     96  0.03030303

Most relevant features

THe list of the most relevant features used by the RandomForest Classifier

varImp(rfFit, scale = F)
rf variable importance

                     Overall
tot_requests          19.030
ratio_avg_samedomain  18.481
ratio_detected        12.318
ratio_tot_samedomain   8.645
ratio_nx               8.596
ratio_reverse          8.122
ratio_fail             6.902
ratio_tot_sameip       4.929
ratio_avg_sameip       4.738
ratio_mx               4.383

ROC Curve Analysis:

The resulting ROC curves showed that a threshold between 0.8 and 0.9 are the best options for detecting the most of the Normal Profiles (i.e False Negatives) while keeping a good rate of True Positive (i.e. those profiles labeled asMalicious)

#plot(roc(data_test$profile,predsrprofilerobsamp$Malicious))
ggplot(cbind(predsrprofilerobsamp,class=data_test$profile), 
       aes(m = Normal, d = factor(class, labels=c("Normal","Malicious"),levels = c("Normal", "Malicious")))) + 
    geom_roc(hjust = -0.4, vjust = 1.5,colour='orange') + 
  theme_bw()

Evaluation on Testing File

The Confusion matrix for the resulting model on the Testing set is shown below:

predsrprofilerobsamp=predict(rfFit,data_test,type='prob')
predsrfsamp=ifelse(predsrprofilerobsamp$Malicious >=0.9,'Malicious','Normal')
cm<-confusionMatrix(predsrfsamp,data_test$profile,positive="Malicious")
cm
Confusion Matrix and Statistics

           Reference
Prediction  Malicious Normal
  Malicious        20      0
  Normal            5     42
                                          
               Accuracy : 0.9254          
                 95% CI : (0.8344, 0.9753)
    No Information Rate : 0.6269          
    P-Value [Acc > NIR] : 2.134e-08       
                                          
                  Kappa : 0.8337          
 Mcnemar's Test P-Value : 0.07364         
                                          
            Sensitivity : 0.8000          
            Specificity : 1.0000          
         Pos Pred Value : 1.0000          
         Neg Pred Value : 0.8936          
             Prevalence : 0.3731          
         Detection Rate : 0.2985          
   Detection Prevalence : 0.2985          
      Balanced Accuracy : 0.9000          
                                          
       'Positive' Class : Malicious       
                                          

With a specificity of 1 all the Normal profiles were correctly detected by the classifier. While, the Sensitiviy value was 0.8, which means that some of the the TP profile were not detected.

Distribution of False and True Positive

data_test_predic<-cbind(data_test,profilepredclass=predsrfsamp,profilepredprob=predsrprofilerobsamp)
profile_incorrect<-data_test_predic %>% filter(profile=='Normal' & profilepredclass=='Malicious')
profile_correct<-data_test_predic %>% filter(profile=='Normal' & profilepredclass=='Normal')
histogram(~ratio_avg_samedomain,data=profile_correct,main="Distribution of Avg_samedomain for True positives ")

#histogram(~ratio_avg_samedomain,data=profile_incorrect,main="Distribution of Avg_samedomain for False positives ")
histogram(~tot_requests,data=profile_correct,main="Distribution of Total Requests for True positives ")

#histogram(~tot_requests,data=profile_incorrect,main="Distribution of Total Requests for False positives")

Per request evaluation

Despites the good results of the client profile classifier showed in the previous secctions, we need to analize it in the context of the DNS request. Therefore, a per request evaluation is performed. THe approach followed for performing such evaluation was the following:

  1. We considered only the profiles in the testfile
  2. For each profile, we select all the DNS requests and labeled accordingly.
  3. For a profile labeled as Normal, all DNS requests are considered Normal
  4. For a profile labeled as DGA we label only those request labeled as DGA by the NN. The remaining Resquests are note labeled and considered Background
a<-request_labeled %>% group_by(client,profilenum) %>% summarise(n=n())
b<-data_test %>% group_by(client,profilenum) %>% summarise(n=sum(tot_requests))
as.data.frame(c(a,b)) %>% filter(n!=n.1)

legit_labeled %>% filter(dga.class==1)
request_labeled %>% filter(dga.class==1  & GroundTruthLabel=='Normal') %>% select(profilepredclass) %>% group_by(profilepredclass) %>% summarise(n=n())


request_labeled %>% filter(dga.class==1  & GroundTruthLabel=='Normal')  %>% group_by(client,profilenum) %>% summarise(n=n())
dga_data_test %>% filter(profilenum==0)

Number of labeled requests: 7688. THe labels of the requests are distributed according to the Table below.

nlabels<-request_labeled%>% filter(!is.na(GroundTruthLabel)) %>% group_by(GroundTruthLabel) %>% summarise(total=n()) 
nlabels

As can be see, a total of 7609 Normal request are available while the total of Not Normal requests is79

NN per request Analysis

The distribution of the GroundTruthLabel classified as DGA by the NN is showed in the Table below.

nndetected<-request_labeled %>%  filter(!is.na(GroundTruthLabel) & dga.class==1) %>% group_by(GroundTruthLabel, dga.class) %>% summarise(nntotal=n()) 
nndetected

In this case, the total of False Positive requests from NN classifier is 3031.

Client Profile per request Analysis

The distribution of the GroundTruthLabel classified as DGA by the Profile is showed in the Table below.

profiledetected<-request_labeled %>%  filter(!is.na(GroundTruthLabel) & profilepredclass=='Malicious') %>% group_by(GroundTruthLabel, profilepredclass) %>% summarise(profiletotal=n()) 
profiledetected

In this case, the total of False Positive requests from profile classifier is 2655. That means the profile has reduced the number of False Positive in 376.

Such results are shown in the next barplot. In gray is the total amount of Normal record, in blue the total amount of Normal requests detected as DGA by the NN. Finally, in Orange the total amount of Malicious requests according to the client profile.

results<-cbind(nlabels,profiletotal=profiledetected$profiletotal,nntotal=nndetected$nntotal)
results<-results %>% mutate(GroundTruthLabel=ifelse(GroundTruthLabel=='Normal','Normal','Not Normal'))
ggplot(as.data.frame(results))+
    geom_col(aes(x=GroundTruthLabel,y=total),fill='lightgray')+
    geom_col(aes(x=GroundTruthLabel,y=nntotal),fill='skyblue',alpha=0.5)+
    geom_col(aes(x=GroundTruthLabel,y=profiletotal),fill='orange',alpha=0.7)+
    theme_bw()

Conclusions

saving the model

