Accuracy : 0.9486 Sensitivity : 0.9570
Specificity : 0.9403
Two Datasets:
## host domain class subclass
## 1 busybeetools.com busybeetools legit legit
## 2 caoliu70.com caoliu70 legit legit
## 3 uwsc.info uwsc legit legit
## 4 orash.ir orash legit legit
## 5 spsp.org spsp legit legit
## 6 snoco.org snoco legit legit
The following new features were created using Jay Jacobs approach:
## onegram twogram threegram fourgram fivegram gram345 entropy length
## 1 47.95233 29.638694 13.629354 3.260071 1.414973 18.304399 2.918296 12
## 2 29.13033 13.219833 4.004321 0.000000 0.000000 4.004321 3.000000 8
## 3 15.20832 6.474508 0.000000 0.000000 0.000000 0.000000 2.000000 4
## 4 20.40532 12.525872 6.114473 0.000000 0.000000 6.114473 2.321928 5
## 5 15.77712 7.802046 0.000000 0.000000 0.000000 0.000000 1.000000 4
## 6 20.41921 10.550798 3.984977 0.000000 0.000000 3.984977 1.921928 5
## dict
## 1 1.000
## 2 0.875
## 3 0.750
## 4 1.000
## 5 1.000
## 6 1.000
Matrix scatter plot for features on Andre W. dataset.
Dataset is split in 80/20 for training/testing
Both datasets are used for building and testing four models:
Each model is evaluated using a 10-folds CV on the trainset and then tested on the testset
# Validation method
ctrl_fast <- trainControl(method="cv",
repeats=1,
number=10,
summaryFunction=twoClassSummary,
verboseIter=T,
classProbs=TRUE,
allowParallel = TRUE)
# Random Forest
rfFit <- train(class ~ .,
data = traindga,
metric="ROC",
method = "rf",
trControl = ctrl_fast)
# SVM
svmFit <- train(class ~ .,
data = traindga,
method = "svmRadial",
preProc = c("center", "scale"),
metric="ROC",
tuneLength = 10,
trControl = ctrl_fast)
#c4.5
c45Fit <- train(class ~ .,
data = traindga,
method = "J48",
metric="ROC",
trControl = ctrl_fast)
#Logitisc Regression
glmFit <- train(class ~ .,
data = traindga,
method = "glm",
family=binomial(link='logit'),
metric="ROC",
trControl = ctrl_fast)
print(confusionMatrix(predsglm,testdga$class)) #first level failure, second level success
## Confusion Matrix and Statistics
##
## Reference
## Prediction dga legit
## dga 3400 373
## legit 600 3627
##
## Accuracy : 0.8784
## 95% CI : (0.871, 0.8855)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7568
## Mcnemar's Test P-Value : 4.317e-13
##
## Sensitivity : 0.8500
## Specificity : 0.9067
## Pos Pred Value : 0.9011
## Neg Pred Value : 0.8581
## Prevalence : 0.5000
## Detection Rate : 0.4250
## Detection Prevalence : 0.4716
## Balanced Accuracy : 0.8784
##
## 'Positive' Class : dga
##
confusionMatrix(predsc45, testdga$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction dga legit
## dga 3808 261
## legit 192 3739
##
## Accuracy : 0.9434
## 95% CI : (0.9381, 0.9483)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8868
## Mcnemar's Test P-Value : 0.001399
##
## Sensitivity : 0.9520
## Specificity : 0.9347
## Pos Pred Value : 0.9359
## Neg Pred Value : 0.9512
## Prevalence : 0.5000
## Detection Rate : 0.4760
## Detection Prevalence : 0.5086
## Balanced Accuracy : 0.9434
##
## 'Positive' Class : dga
##
confusionMatrix(predssvm, testdga$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction dga legit
## dga 3804 270
## legit 196 3730
##
## Accuracy : 0.9418
## 95% CI : (0.9364, 0.9468)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8835
## Mcnemar's Test P-Value : 0.0007205
##
## Sensitivity : 0.9510
## Specificity : 0.9325
## Pos Pred Value : 0.9337
## Neg Pred Value : 0.9501
## Prevalence : 0.5000
## Detection Rate : 0.4755
## Detection Prevalence : 0.5092
## Balanced Accuracy : 0.9417
##
## 'Positive' Class : dga
##
confusionMatrix(predsrf,testdga$class) #first level failure, second level success
## Confusion Matrix and Statistics
##
## Reference
## Prediction dga legit
## dga 3828 239
## legit 172 3761
##
## Accuracy : 0.9486
## 95% CI : (0.9436, 0.9534)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8973
## Mcnemar's Test P-Value : 0.001132
##
## Sensitivity : 0.9570
## Specificity : 0.9403
## Pos Pred Value : 0.9412
## Neg Pred Value : 0.9563
## Prevalence : 0.5000
## Detection Rate : 0.4785
## Detection Prevalence : 0.5084
## Balanced Accuracy : 0.9486
##
## 'Positive' Class : dga
##
Sensitivity=Recall=DetectionRate 1-Specificity=False Alarm Rate
Classification trees (c45 and RF) seems to outperforms SVM and Logistic regression: however the FPR is higher than expected (i.e. 0.07). This can be avoided by incresing the probability threshold.
print(probs)
## host predsrfprobs$dga
## 1 pestlemortarclothing.com 0.922
## 2 kreuzfahrten.de 0.742
## 3 ntpcgets.com 0.844
## 4 lvpaiutegolf.com 0.552
## 5 singaporeanstocksinvestor.blogspot.sg 0.930
## 6 ijeee-apm.com 0.822
## 7 sacm-usa.gov.sa 0.982
## 8 thecenterforsalesstrategy.com 0.890
## 9 aluaplia.es 0.750
## 10 emg-ss.jp 0.864
## 11 mp3http.mobi 0.664
## 12 cencosud.cl 0.748
## 13 homesinsummervillesc.com 0.762
## 14 maru-jan.com 0.844
## 15 sinapsit.com 0.568
## 16 domvnaem.com.ua 0.980
## 17 andaluciacompromisodigital.org 0.990
## 18 iran-doc.com 0.508
## 19 txzshc.com 0.766
## 20 invisibledisabilities.org 0.886
## 21 mydrupal.com 0.944
## 22 gubukbola.com 0.822
## 23 copelandsofneworleans.com 0.882
## 24 engkumuzahadin.blogspot.com 0.616
## 25 zfdmkj.com 0.528
## 26 ghayoumi.ir 0.518
## 27 thzhd.com 0.880
## 28 zaliczaj.pl 0.956
## 29 go2jump.org 0.752
## 30 lanrensc.com 0.500
## 31 juweixin.com 0.956
## 32 tattooedjessicarabbit.net 0.964
## 33 baixardvdr.com 0.582
## 34 malankaraorthodoxchurch.in 0.968
## 35 lafsozluk.com 0.964
## 36 sustainablecitynetwork.com 0.684
## 37 getmziki.com 0.886
## 38 duoduoyin.com 0.604
## 39 maydohuyetap.net.vn 0.862
## 40 vne-konkursa.kz 0.518
## 41 soqom.net 0.702
## 42 baltykgdynia.pl 0.998
## 43 cjgls.com 0.868
## 44 db-pbc.pl 0.926
## 45 mapa-de-buenos-aires.com 0.506
## 46 afpbb.com 0.632
## 47 banehtak.com 0.742
## 48 www.kw 0.670
## 49 lulouisa.blogspot.in 0.894
## 50 sozdik.kz 0.756
## 51 vashikaranprediction.com 0.696
## 52 luxusuhr24.de 0.782
## 53 junipergrovebooksolutions.com 0.830
## 54 adronhomesproperties.com 0.650
## 55 voxxintl.com 0.840
## 56 kreissparkasse-nordhausen.de 0.974
## 57 clutterfreeclassroom.blogspot.com 0.830
## 58 reviewofcontactlenses.com 0.750
## 59 cmp-rugby.com 0.946
## 60 maji-ero.com 0.922
## 61 paginaswebenmedellin.com 0.888
## 62 strongerbraverfighter.com 0.836
## 63 qwairh.net 0.568
## 64 bizimvezne.com 0.920
## 65 al7eah.net 0.536
## 66 hushhush.com 0.690
## 67 designlimitededition.com 0.646
## 68 dont-nod.com 0.710
## 69 djpanaaz.com 0.944
## 70 howtomakeyourboobsgrowbigger.com 0.992
## 71 docmosis.com 0.576
## 72 scd-coastcapitalsavings.com 0.616
## 73 mahsafar.com 0.820
## 74 ijdmtoy.com 0.766
## 75 qqszc.com 0.582
## 76 quiltaddictsanonymous.com 0.980
## 77 indiannursingcouncil.org 0.594
## 78 l0sh0ck.ru 0.870
## 79 tm-modus.com.ua 0.748
## 80 rajasthanpatwari2015recruitment.in 0.998
## 81 usmle-rx.com 0.804
## 82 radeberger-praemienwelt.de 0.966
## 83 mesaieux.com 0.970
## 84 practicalcreativewriting.com 0.584
## 85 frfutbol.com 0.778
## 86 sa7orh.com 0.736
## 87 mccdaq.com 0.500
## 88 sp4leczna.pl 0.976
## 89 theophrastus-stiftung.de 0.890
## 90 studentenring-seminare.de 0.514
## 91 ubt-uni.net 0.744
## 92 nerdgate.it 0.608
## 93 happydiwaliwishessms.in 0.950
## 94 qatarairwaysholidays.com 0.626
## 95 fullypcgamesdownloadlinks.blogspot.com 0.702
## 96 obqvi.eu 0.926
## 97 azizbebe.com.tr 0.928
## 98 yofuiaegb.com 0.876
## 99 cqwsjsw.gov.cn 0.628
## 100 leapfrog.co.za 0.814
## 101 dangjian.cn 0.712
## 102 sukiyaki-dvd.com 0.950
## 103 qalamoun.com 0.728
## 104 trktvs.ru 0.672
## 105 jjudaica.com 0.682
## 106 gulpjs.com 0.526
## 107 wellnessforlifecenter.com 0.714
## 108 fajkowo.pl 0.790
## 109 laboutiquedupetitprince.com 0.712
## 110 affordable-link-building.com 0.690
## 111 thelovemapcodereview.com 0.824
## 112 avtoklub.az 0.746
## 113 thechicagofinancialplanner.com 0.966
## 114 bullz-eye.com 0.726
## 115 predictiveanalyticstoday.com 0.992
## 116 sangocongnghiepcaocap.com 0.656
## 117 tuhoctienganhhieuqua.com 0.622
## 118 backyardmetalcasting.com 0.742
## 119 laescueladedecoracion.es 0.930
## 120 itsuxtobefat.com 0.730
## 121 51jiuyi.cn 0.682
## 122 1500kyujin.com 0.780
## 123 attackontitanthemovie.com 0.930
## 124 zloekino.com 0.608
## 125 wongcungkup.com 0.662
## 126 toyscute.com 0.666
## 127 swapsmut.com 0.676
## 128 bdsousuo.com 0.730
## 129 noticiasdemajadahonda.es 0.824
## 130 vinylnet.co.uk 0.976
## 131 escozul-cuba.com 0.898
## 132 surgicalgastroenterologist.com 0.968
## 133 welikeit.fr 0.504
## 134 zlavadna.sk 0.696
## 135 mealsonwheelsamerica.org 0.828
## 136 pantyhoz.net 0.924
## 137 deep-web.org 0.556
## 138 idgrosir.com 0.652
## 139 haganeya.jp 0.546
## 140 smithsonianassociates.org 0.862
## 141 ebay-spb.ru 0.724
## 142 windows10blurayplayer.com 0.608
## 143 rubbisheatrubbishgrow.com 0.940
## 144 faucetfm.com 0.912
## 145 vobzore.com 0.696
## 146 jeld-wen.co.uk 0.960
## 147 controversietelefoniche.com 0.930
## 148 sergeybiryukov.ru 0.656
## 149 ancccert.org 0.722
## 150 jablonnadlamieszkancow.pl 0.946
## 151 sploofus.com 0.774
## 152 recruitmenthighcourtchd.com 0.890
## 153 alyasmin.net 0.790
## 154 yxvzb.com 0.786
## 155 houigaku.net 0.908
## 156 humanesocietyofcharlotte.org 0.842
## 157 fitujemy.blogspot.com 0.922
## 158 fj-l-tax.gov.cn 0.992
## 159 swiatmp3.info 0.864
## 160 afterhoursprogramming.com 0.540
## 161 ncmkuwait.com 0.954
## 162 urheilulehti.fi 0.870
## 163 qiyequan.com 0.910
## 164 moliera2.com 0.512
## 165 nijiyome.com 0.862
## 166 kaghaz-rangi.com 0.600
## 167 muensterland-tourismus.de 0.634
## 168 beautiful-women-pedia.com 0.760
## 169 lexusofpembrokepines.com 0.948
## 170 inteligentny-budynek.pl 0.828
## 171 fikrimuhim.com 0.510
## 172 j-w-a.or.jp 0.862
## 173 ccgp-hebei.gov.cn 0.626
## 174 zsrpxy.com 0.784
## 175 szczecin.eu 0.874
## 176 daejeon.kr 0.980
## 177 seagulls.co.uk 0.586
## 178 chapintv.com 0.610
## 179 psykologtidsskriftet.no 0.616
## 180 musculacionparaprincipiantes.com 0.994
## 181 imgspice.com 0.684
## 182 k-pbc.com 0.570
## 183 naturskyddsforeningen.se 0.746
## 184 flatfacefingerboards.com 0.792
## 185 mashfrog.com 0.672
## 186 hddizitvizle.net 0.936
## 187 fukuoka-chiropractic.com 0.686
## 188 lubrizol.com 0.976
## 189 mamzdrowie.pl 0.514
## 190 dovrecka.sk 0.834
## 191 pspdfkit.com 0.932
## 192 firriato.it 0.850
## 193 mltaqa-alsfwah.com 0.998
## 194 climbinggriermountain.com 0.888
## 195 haohaizi.com 0.878
## 196 millionairedatingreviews.com 0.642
## 197 xoxco.com 0.506
## 198 pervasiz.com 0.574
## 199 bit-isle.jp 0.688
## 200 alexandredeparis-store.com 0.624
## 201 idn-ro.com 0.582
## 202 blogdabelezaperfeita.blogspot.com.br 0.896
## 203 cheznous.com 0.962
## 204 a-a-ah.ru 0.740
## 205 rythmosfm.gr 0.584
## 206 123veggie.fr 0.516
## 207 discpersonalitytesting.com 0.672
## 208 mlc-cn.com 0.898
## 209 povezlo.su 0.560
## 210 txfip.com 0.550
## 211 topcunts.com 0.666
## 212 closetorganizationsystems.net 0.706
## 213 tsogosun.com 0.702
## 214 recettes-economiques.com 0.856
## 215 slovniky.cz 0.724
## 216 voulevar.com.br 0.844
## 217 hoac-bsa.org 0.954
## 218 highimpactmediagrouppanama.com 0.972
## 219 wbhrb.in 0.748
## 220 documentosimobiliarios.com.br 0.828
## 221 jochen-schweizer-corporate.de 0.714
## 222 mitofago.com.mx 0.626
## 223 retainguaninefluorite.info 0.938
## 224 isacmsrt.ir 0.978
## 225 192-168-1-254.org 0.816
## 226 abqhp.com 0.900
## 227 eduroute.nl 0.564
## 228 bezvre.ru 0.640
## 229 czeladz.pl 0.904
## 230 znphc.com 0.546
## 231 wazamono.jp 0.504
## 232 gnezdoto.net 0.876
## 233 aljyyosh.com 0.988
## 234 fxpravda.ru 0.940
## 235 voleibol.pe 0.548
## 236 airwheel.net 0.876
## 237 cityzeum.com 0.644
## 238 tri-kobe.org 0.930
## 239 homeliferemodelinginc.com 0.820
bwplot(probs$`predsrfprobs$dga`)
predsrf2=ifelse(testdga$predsrfprobs >0.9,'dga','legit')
confusionMatrix(predsrf2,testdga$class)
## Confusion Matrix and Statistics
##
## Reference
## Prediction dga legit
## dga 3291 61
## legit 709 3939
##
## Accuracy : 0.9038
## 95% CI : (0.8971, 0.9101)
## No Information Rate : 0.5
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8075
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.8227
## Specificity : 0.9848
## Pos Pred Value : 0.9818
## Neg Pred Value : 0.8475
## Prevalence : 0.5000
## Detection Rate : 0.4114
## Detection Prevalence : 0.4190
## Balanced Accuracy : 0.9038
##
## 'Positive' Class : dga
##