library(dplyr)
library(readr)
library(ggplot2)
CACIC service uses the default architecture from bitbucket repo
The dataset contains 3465 domains names. These domains were obtained from 3 datasets sent by Vaclav:
We merge the datasets and removed duplicated. Final dataset can be found here.
caret::confusionMatrix(as.factor(vaclav_results$class),as.factor(vaclav_results$label))
Confusion Matrix and Statistics
Reference
Prediction dga normal
dga 2654 15
normal 470 326
Accuracy : 0.86
95% CI : (0.848, 0.8714)
No Information Rate : 0.9016
P-Value [Acc > NIR] : 1
Kappa : 0.5053
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.8496
Specificity : 0.9560
Pos Pred Value : 0.9944
Neg Pred Value : 0.4095
Prevalence : 0.9016
Detection Rate : 0.7659
Detection Prevalence : 0.7703
Balanced Accuracy : 0.9028
'Positive' Class : dga
FALSE NEGATIVES Only 470 domains labeled as DGA by the original NN service where not detected by the cacic web service
plotly::plot_ly(pca_data , type="scatter3d",
x = ~PC1, y = ~PC2, z = ~PC3, color = ~label,
colors = c('#BF382A', '#0C4B8E'),
opacity=0.5, marker = list(size = 2),text = ~paste(preds," ",domain))
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
No scatter3d mode specifed:
Setting the mode to markers
Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
We can observe three clusters corresponding to the .hosting , .org and .feedback TLDs. TP and FN are observed in the three clusters. The table below shows the percent of FN for each cluster. The org TLD is the the cluster with the higher FN with a 23% followed by the hosting with 18% and feedback with 3%
char_dist<- function(tld_name){
library(scales)
charlist_FN=fn_domains %>% filter(grepl(tld_name,domain)) %>% do(charlist= unlist(sapply(.$domain, function(x) c(str_split(x,"")[1]))))
charlist_TP=tp_domains %>% filter(grepl(tld_name,domain)) %>% do(charlist= unlist(sapply(.$domain, function(x) c(str_split(x,"")[1]))))
charlist_FN=as.vector(unlist(charlist_FN %>% select(charlist)))
charlist_TP=as.vector(unlist(charlist_TP %>% select(charlist)))
tpplot<-ggplot(data.frame(charlist=charlist_TP),aes(x=charlist))+
geom_bar(col="black",fill='white',aes(y = (..count..)/sum(..count..)))+
scale_y_continuous(labels=percent)+ylab("Percent")+xlab("")+
theme_bw()
fnplot<-ggplot(data.frame(charlist=charlist_FN),aes(x=charlist))+
geom_bar(col="black",fill='black',aes(y = (..count..)/sum(..count..)))+
scale_y_continuous(labels=percent)+ylab("Percent")+xlab("")+
theme_bw()
gridExtra::grid.arrange(fnplot,tpplot,ncol=1)
return(list(fnplot=fnplot,tpplot=tpplot))
}
The Character Frequency Histogram for TP (white) and FN (black)
The Character Frequency Histogram for TP (white) and FN (black)
org_plot<-char_dist("hosting")
The Character Frequency Histogram for TP (white) and FN (black)
org_plot<-char_dist("feedback")
No significant differences are observable between the FP and TP histograms for the three considered domains. A possible solution could be to include a portion of the non detected DGA into the dataset and retrain de model.
tensorboard("logs")