Highligths

Results:

Accuracy : 0.9486 Sensitivity : 0.9570
Specificity : 0.9403

Raw Datasets

Two Datasets:

  1. 40000 samples (0.5 legit and 0.5 dga) from andre Waeva https://github.com/andrewaeva/DGA
##               host       domain class subclass
## 1 busybeetools.com busybeetools legit    legit
## 2     caoliu70.com     caoliu70 legit    legit
## 3        uwsc.info         uwsc legit    legit
## 4         orash.ir        orash legit    legit
## 5         spsp.org         spsp legit    legit
## 6        snoco.org        snoco legit    legit

Dataset preprocess

The following new features were created using Jay Jacobs approach:

  1. Onegram
  2. Twogram
  3. Threegram
  4. Fourgram
  5. Fivegram
  6. 3,4,5 grams
  7. Entropy
  8. Length
  9. Dict Freq:
##    onegram   twogram threegram fourgram fivegram   gram345  entropy length
## 1 47.95233 29.638694 13.629354 3.260071 1.414973 18.304399 2.918296     12
## 2 29.13033 13.219833  4.004321 0.000000 0.000000  4.004321 3.000000      8
## 3 15.20832  6.474508  0.000000 0.000000 0.000000  0.000000 2.000000      4
## 4 20.40532 12.525872  6.114473 0.000000 0.000000  6.114473 2.321928      5
## 5 15.77712  7.802046  0.000000 0.000000 0.000000  0.000000 1.000000      4
## 6 20.41921 10.550798  3.984977 0.000000 0.000000  3.984977 1.921928      5
##    dict
## 1 1.000
## 2 0.875
## 3 0.750
## 4 1.000
## 5 1.000
## 6 1.000

Discriminative power of selected features:

Matrix scatter plot for features on Andre W. dataset.

Experiments on dataset from Andre W.

Dataset is split in 80/20 for training/testing

Both datasets are used for building and testing four models:

  1. Logistic Regression (Simple linear discriminant classifier)
  2. C4.5 Classification Tree (Classification Tree)
  3. SVM using a radial basis kernel (non Linear Classifier)
  4. Random Forest (Boost algorithm)

Each model is evaluated using a 10-folds CV on the trainset and then tested on the testset

# Validation method
ctrl_fast <- trainControl(method="cv", 
                     repeats=1,
                     number=10, 
                     summaryFunction=twoClassSummary,
                     verboseIter=T,
                     classProbs=TRUE,
                     allowParallel = TRUE)                     
# Random Forest
rfFit <- train(class ~ .,
               data = traindga,
               metric="ROC",
               method = "rf",
               trControl = ctrl_fast)
# SVM 
svmFit <- train(class ~ .,
                data = traindga,
                method = "svmRadial",
                preProc = c("center", "scale"),
                metric="ROC",
                tuneLength = 10,
                trControl = ctrl_fast)
#c4.5
c45Fit <- train(class ~ .,
                data = traindga,
                method = "J48",
                metric="ROC",
                trControl = ctrl_fast)
#Logitisc Regression
glmFit <- train(class ~ .,
                data = traindga,
                method = "glm",
                family=binomial(link='logit'),
                metric="ROC",
                trControl = ctrl_fast)

Results on testset

Logitisc Regression:

print(confusionMatrix(predsglm,testdga$class)) #first level failure, second level success
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  dga legit
##      dga   3400   373
##      legit  600  3627
##                                          
##                Accuracy : 0.8784         
##                  95% CI : (0.871, 0.8855)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7568         
##  Mcnemar's Test P-Value : 4.317e-13      
##                                          
##             Sensitivity : 0.8500         
##             Specificity : 0.9067         
##          Pos Pred Value : 0.9011         
##          Neg Pred Value : 0.8581         
##              Prevalence : 0.5000         
##          Detection Rate : 0.4250         
##    Detection Prevalence : 0.4716         
##       Balanced Accuracy : 0.8784         
##                                          
##        'Positive' Class : dga            
## 

C45 classification tree:

confusionMatrix(predsc45, testdga$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  dga legit
##      dga   3808   261
##      legit  192  3739
##                                           
##                Accuracy : 0.9434          
##                  95% CI : (0.9381, 0.9483)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8868          
##  Mcnemar's Test P-Value : 0.001399        
##                                           
##             Sensitivity : 0.9520          
##             Specificity : 0.9347          
##          Pos Pred Value : 0.9359          
##          Neg Pred Value : 0.9512          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4760          
##    Detection Prevalence : 0.5086          
##       Balanced Accuracy : 0.9434          
##                                           
##        'Positive' Class : dga             
## 

Support Vector Machines using RBF kernel

confusionMatrix(predssvm, testdga$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  dga legit
##      dga   3804   270
##      legit  196  3730
##                                           
##                Accuracy : 0.9418          
##                  95% CI : (0.9364, 0.9468)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8835          
##  Mcnemar's Test P-Value : 0.0007205       
##                                           
##             Sensitivity : 0.9510          
##             Specificity : 0.9325          
##          Pos Pred Value : 0.9337          
##          Neg Pred Value : 0.9501          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4755          
##    Detection Prevalence : 0.5092          
##       Balanced Accuracy : 0.9417          
##                                           
##        'Positive' Class : dga             
## 

Random Forest

confusionMatrix(predsrf,testdga$class) #first level failure, second level success
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  dga legit
##      dga   3828   239
##      legit  172  3761
##                                           
##                Accuracy : 0.9486          
##                  95% CI : (0.9436, 0.9534)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8973          
##  Mcnemar's Test P-Value : 0.001132        
##                                           
##             Sensitivity : 0.9570          
##             Specificity : 0.9403          
##          Pos Pred Value : 0.9412          
##          Neg Pred Value : 0.9563          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4785          
##    Detection Prevalence : 0.5084          
##       Balanced Accuracy : 0.9486          
##                                           
##        'Positive' Class : dga             
## 

Model performance comparison in terms of ROC, sensitivity and Specifity

Sensitivity=Recall=DetectionRate 1-Specificity=False Alarm Rate

Classification trees (c45 and RF) seems to outperforms SVM and Logistic regression: however the FPR is higher than expected (i.e. 0.07). This can be avoided by incresing the probability threshold.

Probability of domain names INCORRECTLY detected as DGA (AKA FPR)

print(probs)
##                                       host predsrfprobs$dga
## 1                 pestlemortarclothing.com            0.922
## 2                          kreuzfahrten.de            0.742
## 3                             ntpcgets.com            0.844
## 4                         lvpaiutegolf.com            0.552
## 5    singaporeanstocksinvestor.blogspot.sg            0.930
## 6                            ijeee-apm.com            0.822
## 7                          sacm-usa.gov.sa            0.982
## 8            thecenterforsalesstrategy.com            0.890
## 9                              aluaplia.es            0.750
## 10                               emg-ss.jp            0.864
## 11                            mp3http.mobi            0.664
## 12                             cencosud.cl            0.748
## 13                homesinsummervillesc.com            0.762
## 14                            maru-jan.com            0.844
## 15                            sinapsit.com            0.568
## 16                         domvnaem.com.ua            0.980
## 17          andaluciacompromisodigital.org            0.990
## 18                            iran-doc.com            0.508
## 19                              txzshc.com            0.766
## 20               invisibledisabilities.org            0.886
## 21                            mydrupal.com            0.944
## 22                           gubukbola.com            0.822
## 23               copelandsofneworleans.com            0.882
## 24             engkumuzahadin.blogspot.com            0.616
## 25                              zfdmkj.com            0.528
## 26                             ghayoumi.ir            0.518
## 27                               thzhd.com            0.880
## 28                             zaliczaj.pl            0.956
## 29                             go2jump.org            0.752
## 30                            lanrensc.com            0.500
## 31                            juweixin.com            0.956
## 32               tattooedjessicarabbit.net            0.964
## 33                          baixardvdr.com            0.582
## 34              malankaraorthodoxchurch.in            0.968
## 35                           lafsozluk.com            0.964
## 36              sustainablecitynetwork.com            0.684
## 37                            getmziki.com            0.886
## 38                           duoduoyin.com            0.604
## 39                     maydohuyetap.net.vn            0.862
## 40                         vne-konkursa.kz            0.518
## 41                               soqom.net            0.702
## 42                         baltykgdynia.pl            0.998
## 43                               cjgls.com            0.868
## 44                               db-pbc.pl            0.926
## 45                mapa-de-buenos-aires.com            0.506
## 46                               afpbb.com            0.632
## 47                            banehtak.com            0.742
## 48                                  www.kw            0.670
## 49                    lulouisa.blogspot.in            0.894
## 50                               sozdik.kz            0.756
## 51                vashikaranprediction.com            0.696
## 52                           luxusuhr24.de            0.782
## 53           junipergrovebooksolutions.com            0.830
## 54                adronhomesproperties.com            0.650
## 55                            voxxintl.com            0.840
## 56            kreissparkasse-nordhausen.de            0.974
## 57       clutterfreeclassroom.blogspot.com            0.830
## 58               reviewofcontactlenses.com            0.750
## 59                           cmp-rugby.com            0.946
## 60                            maji-ero.com            0.922
## 61                paginaswebenmedellin.com            0.888
## 62               strongerbraverfighter.com            0.836
## 63                              qwairh.net            0.568
## 64                          bizimvezne.com            0.920
## 65                              al7eah.net            0.536
## 66                            hushhush.com            0.690
## 67                designlimitededition.com            0.646
## 68                            dont-nod.com            0.710
## 69                            djpanaaz.com            0.944
## 70        howtomakeyourboobsgrowbigger.com            0.992
## 71                            docmosis.com            0.576
## 72             scd-coastcapitalsavings.com            0.616
## 73                            mahsafar.com            0.820
## 74                             ijdmtoy.com            0.766
## 75                               qqszc.com            0.582
## 76               quiltaddictsanonymous.com            0.980
## 77                indiannursingcouncil.org            0.594
## 78                              l0sh0ck.ru            0.870
## 79                         tm-modus.com.ua            0.748
## 80      rajasthanpatwari2015recruitment.in            0.998
## 81                            usmle-rx.com            0.804
## 82              radeberger-praemienwelt.de            0.966
## 83                            mesaieux.com            0.970
## 84            practicalcreativewriting.com            0.584
## 85                            frfutbol.com            0.778
## 86                              sa7orh.com            0.736
## 87                              mccdaq.com            0.500
## 88                            sp4leczna.pl            0.976
## 89                theophrastus-stiftung.de            0.890
## 90               studentenring-seminare.de            0.514
## 91                             ubt-uni.net            0.744
## 92                             nerdgate.it            0.608
## 93                 happydiwaliwishessms.in            0.950
## 94                qatarairwaysholidays.com            0.626
## 95  fullypcgamesdownloadlinks.blogspot.com            0.702
## 96                                obqvi.eu            0.926
## 97                         azizbebe.com.tr            0.928
## 98                           yofuiaegb.com            0.876
## 99                          cqwsjsw.gov.cn            0.628
## 100                         leapfrog.co.za            0.814
## 101                            dangjian.cn            0.712
## 102                       sukiyaki-dvd.com            0.950
## 103                           qalamoun.com            0.728
## 104                              trktvs.ru            0.672
## 105                           jjudaica.com            0.682
## 106                             gulpjs.com            0.526
## 107              wellnessforlifecenter.com            0.714
## 108                             fajkowo.pl            0.790
## 109            laboutiquedupetitprince.com            0.712
## 110           affordable-link-building.com            0.690
## 111               thelovemapcodereview.com            0.824
## 112                            avtoklub.az            0.746
## 113         thechicagofinancialplanner.com            0.966
## 114                          bullz-eye.com            0.726
## 115           predictiveanalyticstoday.com            0.992
## 116              sangocongnghiepcaocap.com            0.656
## 117               tuhoctienganhhieuqua.com            0.622
## 118               backyardmetalcasting.com            0.742
## 119               laescueladedecoracion.es            0.930
## 120                       itsuxtobefat.com            0.730
## 121                             51jiuyi.cn            0.682
## 122                         1500kyujin.com            0.780
## 123              attackontitanthemovie.com            0.930
## 124                           zloekino.com            0.608
## 125                        wongcungkup.com            0.662
## 126                           toyscute.com            0.666
## 127                           swapsmut.com            0.676
## 128                           bdsousuo.com            0.730
## 129               noticiasdemajadahonda.es            0.824
## 130                         vinylnet.co.uk            0.976
## 131                       escozul-cuba.com            0.898
## 132         surgicalgastroenterologist.com            0.968
## 133                            welikeit.fr            0.504
## 134                            zlavadna.sk            0.696
## 135               mealsonwheelsamerica.org            0.828
## 136                           pantyhoz.net            0.924
## 137                           deep-web.org            0.556
## 138                           idgrosir.com            0.652
## 139                            haganeya.jp            0.546
## 140              smithsonianassociates.org            0.862
## 141                            ebay-spb.ru            0.724
## 142              windows10blurayplayer.com            0.608
## 143              rubbisheatrubbishgrow.com            0.940
## 144                           faucetfm.com            0.912
## 145                            vobzore.com            0.696
## 146                         jeld-wen.co.uk            0.960
## 147            controversietelefoniche.com            0.930
## 148                      sergeybiryukov.ru            0.656
## 149                           ancccert.org            0.722
## 150              jablonnadlamieszkancow.pl            0.946
## 151                           sploofus.com            0.774
## 152            recruitmenthighcourtchd.com            0.890
## 153                           alyasmin.net            0.790
## 154                              yxvzb.com            0.786
## 155                           houigaku.net            0.908
## 156           humanesocietyofcharlotte.org            0.842
## 157                  fitujemy.blogspot.com            0.922
## 158                        fj-l-tax.gov.cn            0.992
## 159                          swiatmp3.info            0.864
## 160              afterhoursprogramming.com            0.540
## 161                          ncmkuwait.com            0.954
## 162                        urheilulehti.fi            0.870
## 163                           qiyequan.com            0.910
## 164                           moliera2.com            0.512
## 165                           nijiyome.com            0.862
## 166                       kaghaz-rangi.com            0.600
## 167              muensterland-tourismus.de            0.634
## 168              beautiful-women-pedia.com            0.760
## 169               lexusofpembrokepines.com            0.948
## 170                inteligentny-budynek.pl            0.828
## 171                         fikrimuhim.com            0.510
## 172                            j-w-a.or.jp            0.862
## 173                      ccgp-hebei.gov.cn            0.626
## 174                             zsrpxy.com            0.784
## 175                            szczecin.eu            0.874
## 176                             daejeon.kr            0.980
## 177                         seagulls.co.uk            0.586
## 178                           chapintv.com            0.610
## 179                psykologtidsskriftet.no            0.616
## 180       musculacionparaprincipiantes.com            0.994
## 181                           imgspice.com            0.684
## 182                              k-pbc.com            0.570
## 183               naturskyddsforeningen.se            0.746
## 184               flatfacefingerboards.com            0.792
## 185                           mashfrog.com            0.672
## 186                       hddizitvizle.net            0.936
## 187               fukuoka-chiropractic.com            0.686
## 188                           lubrizol.com            0.976
## 189                          mamzdrowie.pl            0.514
## 190                            dovrecka.sk            0.834
## 191                           pspdfkit.com            0.932
## 192                            firriato.it            0.850
## 193                     mltaqa-alsfwah.com            0.998
## 194              climbinggriermountain.com            0.888
## 195                           haohaizi.com            0.878
## 196           millionairedatingreviews.com            0.642
## 197                              xoxco.com            0.506
## 198                           pervasiz.com            0.574
## 199                            bit-isle.jp            0.688
## 200             alexandredeparis-store.com            0.624
## 201                             idn-ro.com            0.582
## 202   blogdabelezaperfeita.blogspot.com.br            0.896
## 203                           cheznous.com            0.962
## 204                              a-a-ah.ru            0.740
## 205                           rythmosfm.gr            0.584
## 206                           123veggie.fr            0.516
## 207             discpersonalitytesting.com            0.672
## 208                             mlc-cn.com            0.898
## 209                             povezlo.su            0.560
## 210                              txfip.com            0.550
## 211                           topcunts.com            0.666
## 212          closetorganizationsystems.net            0.706
## 213                           tsogosun.com            0.702
## 214               recettes-economiques.com            0.856
## 215                            slovniky.cz            0.724
## 216                        voulevar.com.br            0.844
## 217                           hoac-bsa.org            0.954
## 218         highimpactmediagrouppanama.com            0.972
## 219                               wbhrb.in            0.748
## 220          documentosimobiliarios.com.br            0.828
## 221          jochen-schweizer-corporate.de            0.714
## 222                        mitofago.com.mx            0.626
## 223             retainguaninefluorite.info            0.938
## 224                            isacmsrt.ir            0.978
## 225                      192-168-1-254.org            0.816
## 226                              abqhp.com            0.900
## 227                            eduroute.nl            0.564
## 228                              bezvre.ru            0.640
## 229                             czeladz.pl            0.904
## 230                              znphc.com            0.546
## 231                            wazamono.jp            0.504
## 232                           gnezdoto.net            0.876
## 233                           aljyyosh.com            0.988
## 234                            fxpravda.ru            0.940
## 235                            voleibol.pe            0.548
## 236                           airwheel.net            0.876
## 237                           cityzeum.com            0.644
## 238                           tri-kobe.org            0.930
## 239              homeliferemodelinginc.com            0.820
bwplot(probs$`predsrfprobs$dga`)

Increasing threshold up to 0.9 we can reduce FP (while incresing the FN)

predsrf2=ifelse(testdga$predsrfprobs >0.9,'dga','legit')
confusionMatrix(predsrf2,testdga$class)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  dga legit
##      dga   3291    61
##      legit  709  3939
##                                           
##                Accuracy : 0.9038          
##                  95% CI : (0.8971, 0.9101)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8075          
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.8227          
##             Specificity : 0.9848          
##          Pos Pred Value : 0.9818          
##          Neg Pred Value : 0.8475          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4114          
##    Detection Prevalence : 0.4190          
##       Balanced Accuracy : 0.9038          
##                                           
##        'Positive' Class : dga             
##