Introduction
This notebook uses the Phishing Detection Dataset from Kaggle. The task of the exercise is to use supervised machine learning methods to predict phishing websites, and evaluate models performance The methods used in this notebook include CART, bagged CART, random forest, XGBoost, adaboost, and logistic regression, and the evalution metrics used are accuracy, sensitivity, specificity, F1-score and AUC-ROC score. Summary of the performance and five most important features for predicting phishing websites is presented in the last section of this notebook.
Data dictionary (from source)
| No. | Variable | Description |
|---|---|---|
| 1 | having_ip_address | marks as suspicious if the URL contains IP address or similar characteristics. |
| 2 | length_of_url | Suspicious if the URL length is too long. |
| 3 | shortening_services | Suspicious if URL shortening services has been used. |
| 4 | having_at_symbol | Suspicious if URL contains @ symbols. |
| 5 | double-slash_redirection | Suspicious if the URL contain //. |
| 6 | prefix and suffix | Suspicious if the URL contains prefixes and suffixes. |
| 7 | sub_domains | Suspicious if the URL contains many subdomains. |
| 8 | ssl_state | Scan any suspicious behaviors related to SSL state. |
| 9 | domain_registered | Scan any suspicious behaviors related to domain registration details. |
| 10 | favicons | Scan any suspicious behaviors related to favicons. |
| 11 | ports | Scan any suspicious behaviors related to ports. |
| 12 | https | Suspicious if the URL does not contain HTTPS. |
| 13 | external_objects | Scan any suspicious behaviors related to website elements (audio, video, images). |
| 14 | anchor_tags | Scan any suspicious behaviors related to anchor tags. |
| 15 | links_in_tags | Scan any suspicious behaviors related to links in html tags. |
| 16 | sfh-domain | Scan any suspicious behaviors related to SFH info on domain. |
| 17 | auto_email | Suspicious if site auto submits emails. |
| 18 | abnoramal_url | suspicious if URL contains abnormal characteristics. |
| 19 | iframe_redirection | Scan any suspicious behaviors related to I Frame element. |
| 20 | on_mouse_over | Scan any suspicious behaviors related to onMouseOver scripts. |
| 21 | right_click | Scan any suspicious behaviors related to rightClick scripts. |
| 22 | popup_windows | Scan any suspicious behaviors related to pop up windows. |
| 23 | domain_age | Scan any suspicious behaviors related to domain age. |
| 24 | dns_record | Scan any suspicious behaviors related to DNS record. |
| 25 | web_traffic | Scan any suspicious behaviors related to website ranking. |
| 26 | links_pointing | Scan any suspicious behaviors related to links pointing to the web page. |
| 27 | statistical_report | Scan any suspicious behaviors related to statistical report. |
| 28 | image_text_keyword | Suspicious if website images contains phishing keywords. |
| 29 | result | 1 = Phishing -1 = Legitimate |
library(tidyverse)
library(Hmisc)
library(patchwork)
library(colorspace)
library(ggstatsplot)
library(caret)
library(rattle)
library(randomForest)
library(pROC)
library(pscl)
theme_set(theme_bw(base_size=10))
theme_update(axis.ticks=element_blank(),
plot.title.position="plot")
data = read_csv("fixed_values_ds.csv")
dim(data)
[1] 14093 29
sum(is.na(data))
[1] 0
data %>% mutate_if(is.numeric,as.factor) %>% summary()
having_ip_address length_of_url shortening_services having_at_symbol double-slash_redirection prefix and suffix
-1:14057 -1:7080 -1:12846 -1:13977 -1:14051 -1:11360
1 : 36 0 :2747 1 : 1247 1 : 116 1 : 42 1 : 2733
1 :4266
sub_domains ssl_state domain_registered favicons ports https external_objects anchor_tags links_in_tags
-1:1750 -1:5919 -1: 1528 -1: 1810 -1:14091 -1:7798 -1:9764 -1:10800 -1:8937
0 :5737 1 :8174 1 :12565 1 :12283 1 : 2 1 :6295 0 : 265 0 : 504 0 :1573
1 :6606 1 :4064 1 : 2789 1 :3583
sfh-domain auto_email abnoramal_url iframe_redirection on_mouse_over right_click popup_windows domain_age
-1:4563 -1:5983 -1:5844 -1:5983 -1:5973 -1:5981 -1:5788 -1: 3951
0 :9433 1 :8110 1 :8249 1 :8110 1 :8120 1 :8112 1 :8305 1 :10142
1 : 97
dns_record web_traffic links_pointing statistical_report image_text_keyword result
-1: 1455 -1:5648 -1: 2525 -1:10790 -1: 1987 -1:7044
1 :12638 0 :1051 0 : 647 1 : 3303 1 :12106 1 :7049
1 :7394 1 :10921
Hmisc::describe(factor(data$result))
factor(data$result)
n missing distinct
14093 0 2
Value -1 1
Frequency 7044 7049
Proportion 0.5 0.5
data %>% mutate(result= ifelse(result==-1,0,1)) -> data
Hmisc::describe(factor(data$result))
factor(data$result)
n missing distinct
14093 0 2
Value 0 1
Frequency 7044 7049
Proportion 0.5 0.5
# separate features into 2 groups by levels
data %>%
mutate(id = row_number()) %>%
pivot_longer(!id) %>%
mutate_at(vars(value),list(factor)) %>%
group_by(name) %>% count(value) %>%
mutate(levels=n_distinct(value)) -> t1
t1a = t1 %>% filter(levels==2) %>% filter(name!="result")
t1b = t1 %>% filter(levels==3) %>% filter(name!="result")
# comparison
data %>% pivot_longer(!result) %>%
mutate_at(vars(value),list(factor)) %>%
group_by(result,name) %>% count(value) ->t2
# plot
t2 %>% filter(name %in% t1a$name) %>%
ggplot(aes(y=name, x=n, color=value)) +
geom_line(aes(group=name), color="grey") +
geom_point(size=2, alpha=0.9) +
facet_wrap(~result, ncol=2, labeller=label_both) +
scale_color_manual(values=c("#f3722c","#277da1","#90be6d")) +
labs(x="count",y="feature", subtitle="Comparision of 2-level features, by result\n")
t2 %>% filter(name %in% t1b$name) %>%
ggplot(aes(y=name, x=n, color=value)) +
geom_line(aes(group=name), color="grey") +
geom_point(size=3, alpha=0.9) +
facet_wrap(~result, ncol=2,labeller=label_both) +
scale_color_manual(values=c("#f3722c","#277da1","#90be6d")) +
labs(x="count",y="feature", subtitle="Comparision of 3-level features, by result\n")
As this is a balanced dataset, facetted dumbbell plots can provide a compact presentation of comparison of level distribution between target classes.
data %>% mutate_all(list(factor)) %>%
group_by(result, https,web_traffic) %>% tally() %>%
ungroup() %>% group_by(result) %>% mutate(proportion=round(n/sum(n),3)) %>%
ggplot(aes(x=https, y=web_traffic, fill=proportion)) +
geom_tile(color="white", size=4, alpha=0.9) +
geom_text(aes(label=scales::percent(proportion, accuracy=0.1)), size=3.5)+
facet_wrap(~result, ncol=2, labeller=label_both) +
scale_fill_continuous_sequential(palette="mint") +
theme(legend.position="top",
strip.background=element_rect(fill=NA),
axis.text=element_text(face="bold", size=11),
axis.title=element_text(size=10),
strip.text = element_text(size=10),
plot.margin=unit(c(1,2,1,2),"cm"))
data %>% mutate_all(list(factor)) %>%
group_by(result, links_in_tags,web_traffic) %>% tally() %>%
ungroup() %>% group_by(result) %>% mutate(proportion=round(n/sum(n),3)) %>%
ggplot(aes(x=links_in_tags, y=web_traffic, fill=proportion)) +
geom_tile(color="white", size=4, alpha=0.9) +
geom_text(aes(label=scales::percent(proportion, accuracy=0.1)), size=3.5)+
facet_wrap(~result, ncol=2, labeller=label_both) +
scale_fill_continuous_sequential(palette="teal") +
theme(legend.position="top",
strip.background=element_rect(fill=NA),
axis.text=element_text(face="bold", size=11),
axis.title=element_text(size=10),
strip.text = element_text(size=10),
plot.margin=unit(c(1,2,1,2),"cm"))
data %>% mutate_all(list(factor)) %>%
group_by(result, web_traffic,length_of_url) %>% tally() %>%
ungroup() %>% group_by(result) %>% mutate(proportion=round(n/sum(n),3)) %>%
ggplot(aes(x=web_traffic, y=length_of_url, fill=proportion)) +
geom_tile(color="white", size=4, alpha=0.9) +
geom_text(aes(label=scales::percent(proportion, accuracy=0.1)), size=3.5)+
facet_wrap(~result, ncol=2, labeller=label_both) +
scale_fill_continuous_sequential(palette="peach") +
theme(legend.position="top",
strip.background=element_rect(fill=NA),
axis.text=element_text(face="bold", size=11),
axis.title=element_text(size=10),
strip.text = element_text(size=10),
plot.margin=unit(c(1,2,1,2),"cm"))
# function
flattenCorrMatrix <- function(cormat, pmat) {
ut <- upper.tri(cormat)
data.frame(
row = rownames(cormat)[row(cormat)[ut]],
column = rownames(cormat)[col(cormat)[ut]],
cor =(cormat)[ut],
p = pmat[ut]
)
}
res2<-rcorr(as.matrix(data), type="spearman")
flattenCorrMatrix(res2$r, res2$P) -> corr_table
# significant correlations
corr_table %>% filter(column=="result") %>%
mutate(sig=ifelse(p<=.05,"sig.","not sig.")) %>%
ggplot(aes(x=row, y=cor)) +
geom_segment(aes(x=reorder(row,cor), xend=row, y=0, yend=cor, color=sig)) +
geom_point(aes(color=sig)) +
coord_flip() +
labs(color="",subtitle="Spearman Correlation (to result)") +
scale_color_manual(values=c("#b7094c","#277da1"))
NA
corr_table %>% filter(p<=0.05) %>% filter(cor<=-0.8 | cor >=0.8) %>% arrange(desc(cor))
ggstatsplot::ggcorrmat(
data=data,
type="spearman",
ggcorrplot.args = list(lab_size=2, tl.srt=90, tl.cex=7)
)
drop_cols= c('auto_email', 'iframe_redirection', 'right_click', 'ssl_state', 'popup_windows', 'on_mouse_over', 'domain_registered','links_in_tags')
data2 = data %>% select(-one_of(drop_cols))
# corr plot after dropping variables
ggstatsplot::ggcorrmat(
data=data2,
type="spearman",
ggcorrplot.args = list(lab_size=3, tl.srt=90, tl.cex=9)
)
# check variables for missing values
data2 %>% type.convert() %>% sapply(function(x)sum(is.na(x)))
having_ip_address length_of_url shortening_services having_at_symbol
0 0 0 0
double-slash_redirection prefix and suffix sub_domains favicons
0 0 0 0
ports https external_objects anchor_tags
0 0 0 0
sfh-domain abnoramal_url domain_age dns_record
0 0 0 0
web_traffic links_pointing statistical_report image_text_keyword
0 0 0 0
result
0
Researc #### Data partition
# partition data based on outcome i.e. result
data %>% mutate_all(list(factor)) ->data2
colnames(data2) <- make.names(colnames(data2)) #make valid col names
set.seed(123)
train.index <- createDataPartition(data2$result, p = .7, list = FALSE)
xtrain <- data2[ train.index,]
xtest <- data2[-train.index,]
Hmisc::describe(xtrain$result)
xtrain$result
n missing distinct
9866 0 2
Value 0 1
Frequency 4931 4935
Proportion 0.5 0.5
Hmisc::describe(xtest$result)
xtest$result
n missing distinct
4227 0 2
Value 0 1
Frequency 2113 2114
Proportion 0.5 0.5
Research question: What are the key features for predicting customer segments? * To identify the key features for predicting customer segments, all the features in the dataset are used for modeling.
set.seed(123)
dt <- train(
result ~., data = xtrain, method = "rpart",
trControl = trainControl("cv", number = 10),
tuneLength = 10
)
plot(dt) # plot
dt$bestTune %>% unlist() #print
cp
0.002027986
fancyRpartPlot(dt$finalModel) #tree
dt.p <- dt %>% predict(xtest) # predict
cmdt = confusionMatrix(dt.p, factor(xtest$result)) #confusion matrix
cmdt
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1830 169
1 283 1945
Accuracy : 0.8931
95% CI : (0.8834, 0.9022)
No Information Rate : 0.5001
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7861
Mcnemar's Test P-Value : 1.066e-07
Sensitivity : 0.8661
Specificity : 0.9201
Pos Pred Value : 0.9155
Neg Pred Value : 0.8730
Prevalence : 0.4999
Detection Rate : 0.4329
Detection Prevalence : 0.4729
Balanced Accuracy : 0.8931
'Positive' Class : 0
round(cmdt$byClass["F1"],4) #F1 score
F1
0.8901
roc(response= xtest$result, predictor = factor(dt.p,ordered=T), plot=T, print.auc=T) #AUC
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = xtest$result, predictor = factor(dt.p, ordered = T), plot = T, print.auc = T)
Data: factor(dt.p, ordered = T) in 2113 controls (xtest$result 0) < 2114 cases (xtest$result 1).
Area under the curve: 0.8931
plot(varImp(dt)) #plot variable importance
set.seed(123)
rf <- train(result ~., data = xtrain, method = "rf",
trControl = trainControl("cv", number = 10),
importance = TRUE)
rf$bestTune
rf$finalModel
Call:
randomForest(x = x, y = y, mtry = param$mtry, importance = TRUE)
Type of random forest: classification
Number of trees: 500
No. of variables tried at each split: 19
OOB estimate of error rate: 7.66%
Confusion matrix:
0 1 class.error
0 4503 428 0.08679781
1 328 4607 0.06646403
rf.p <- rf %>% predict(xtest) #predict
cmrf = confusionMatrix(rf.p, factor(xtest$result)) #confusion matrix
cmrf
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1900 132
1 213 1982
Accuracy : 0.9184
95% CI : (0.9097, 0.9265)
No Information Rate : 0.5001
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8368
Mcnemar's Test P-Value : 1.654e-05
Sensitivity : 0.8992
Specificity : 0.9376
Pos Pred Value : 0.9350
Neg Pred Value : 0.9030
Prevalence : 0.4999
Detection Rate : 0.4495
Detection Prevalence : 0.4807
Balanced Accuracy : 0.9184
'Positive' Class : 0
round(cmrf$byClass["F1"],4) #F1 score
F1
0.9168
roc(response= xtest$result, predictor = factor(rf.p,ordered=T), plot=T, print.auc=T) #AUC
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = xtest$result, predictor = factor(rf.p, ordered = T), plot = T, print.auc = T)
Data: factor(rf.p, ordered = T) in 2113 controls (xtest$result 0) < 2114 cases (xtest$result 1).
Area under the curve: 0.9184
varImpPlot(rf$finalModel, type=2) #plot variable importance
set.seed(123)
xgb <- train(result ~., data = xtrain, method = "xgbTree",trControl = trainControl("cv", number = 10))
xgb$bestTune
xgb.p = xgb %>% predict(xtest) #predict
cmxgb= confusionMatrix(xgb.p, factor(xtest$result)) #confusion matrix
cmxgb
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1896 146
1 217 1968
Accuracy : 0.9141
95% CI : (0.9053, 0.9224)
No Information Rate : 0.5001
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8282
Mcnemar's Test P-Value : 0.0002387
Sensitivity : 0.8973
Specificity : 0.9309
Pos Pred Value : 0.9285
Neg Pred Value : 0.9007
Prevalence : 0.4999
Detection Rate : 0.4485
Detection Prevalence : 0.4831
Balanced Accuracy : 0.9141
'Positive' Class : 0
round(cmxgb$byClass["F1"],4) #F1 score
F1
0.9126
roc(response= xtest$result, predictor = factor(xgb.p,ordered=T), plot=T, print.auc=T) #AUC
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = xtest$result, predictor = factor(xgb.p, ordered = T), plot = T, print.auc = T)
Data: factor(xgb.p, ordered = T) in 2113 controls (xtest$result 0) < 2114 cases (xtest$result 1).
Area under the curve: 0.9141
varImp(xgb) #var imp
xgbTree variable importance
only 20 most important variables shown (out of 36)
plot(varImp(xgb)) #plot var imp
ada <- train(result ~., data = xtrain, method = "adaboost",tuneLength=2, trControl = trainControl("cv", number = 10))
ada.p = ada %>% predict(xtest) #predict
cmada= confusionMatrix(ada.p, factor(xtest$result), positive="1") #confusion matrix
cmada
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1877 112
1 236 2002
Accuracy : 0.9177
95% CI : (0.909, 0.9258)
No Information Rate : 0.5001
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8353
Mcnemar's Test P-Value : 4.296e-11
Sensitivity : 0.9470
Specificity : 0.8883
Pos Pred Value : 0.8945
Neg Pred Value : 0.9437
Prevalence : 0.5001
Detection Rate : 0.4736
Detection Prevalence : 0.5295
Balanced Accuracy : 0.9177
'Positive' Class : 1
round(cmada$byClass["F1"],4) #F1 score
F1
0.92
roc(response= xtest$result, predictor = factor(ada.p,ordered=T), plot=T, print.auc=T) #AUC
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = xtest$result, predictor = factor(ada.p, ordered = T), plot = T, print.auc = T)
Data: factor(ada.p, ordered = T) in 2113 controls (xtest$result 0) < 2114 cases (xtest$result 1).
Area under the curve: 0.9177
varImp(bag) #var imp
treebag variable importance
only 20 most important variables shown (out of 36)
plot(varImp(bag)) #plot var imp
set.seed(123)
bag <- train(result ~., data = xtrain, method = "treebag",trControl = trainControl("cv", number = 10))
bag$bestTune
bag.p = bag %>% predict(xtest) #predict
cmbag= confusionMatrix(bag.p, factor(xtest$result), positive="1") #confusion matrix
cmbag
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1890 134
1 223 1980
Accuracy : 0.9155
95% CI : (0.9068, 0.9238)
No Information Rate : 0.5001
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.8311
Mcnemar's Test P-Value : 3.201e-06
Sensitivity : 0.9366
Specificity : 0.8945
Pos Pred Value : 0.8988
Neg Pred Value : 0.9338
Prevalence : 0.5001
Detection Rate : 0.4684
Detection Prevalence : 0.5212
Balanced Accuracy : 0.9155
'Positive' Class : 1
round(cmbag$byClass["F1"],4) #F1 score
F1
0.9173
roc(response= xtest$result, predictor = factor(bag.p,ordered=T), plot=T, print.auc=T) #AUC
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = xtest$result, predictor = factor(bag.p, ordered = T), plot = T, print.auc = T)
Data: factor(bag.p, ordered = T) in 2113 controls (xtest$result 0) < 2114 cases (xtest$result 1).
Area under the curve: 0.9155
varImp(bag) #var imp
treebag variable importance
only 20 most important variables shown (out of 36)
plot(varImp(bag)) #plot var imp
lr = glm(result~., family = "binomial", data=xtrain)
summary(lr)
Call:
glm(formula = result ~ ., family = "binomial", data = xtrain)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.3130 -0.3215 0.0006 0.2824 2.9190
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.09195 0.34215 -3.191 0.00142 **
having_ip_address1 12.55434 190.86656 0.066 0.94756
length_of_url0 -0.07899 0.09579 -0.825 0.40956
length_of_url1 1.08145 0.09932 10.888 < 2e-16 ***
shortening_services1 0.06103 0.12003 0.508 0.61111
having_at_symbol1 3.53524 0.75962 4.654 3.26e-06 ***
double.slash_redirection1 2.61877 1.09576 2.390 0.01685 *
prefix.and.suffix1 0.46769 0.08967 5.216 1.83e-07 ***
sub_domains0 -0.91000 0.12188 -7.466 8.25e-14 ***
sub_domains1 -0.86560 0.12722 -6.804 1.02e-11 ***
ssl_state1 -1.01601 0.51125 -1.987 0.04689 *
domain_registered1 1.70248 0.51926 3.279 0.00104 **
favicons1 0.35959 0.20599 1.746 0.08087 .
ports1 11.20087 882.74339 0.013 0.98988
https1 2.89122 0.08420 34.339 < 2e-16 ***
external_objects0 -0.06460 0.31762 -0.203 0.83884
external_objects1 0.24922 0.19581 1.273 0.20309
anchor_tags0 -0.55937 0.24545 -2.279 0.02267 *
anchor_tags1 -1.54962 0.17957 -8.630 < 2e-16 ***
links_in_tags0 -0.65805 0.21712 -3.031 0.00244 **
links_in_tags1 -0.05029 0.25598 -0.196 0.84426
sfh.domain0 -1.33089 0.18062 -7.368 1.73e-13 ***
sfh.domain1 1.42103 0.50728 2.801 0.00509 **
auto_email1 -1.73934 959.97781 -0.002 0.99855
abnoramal_url1 -2.55015 0.47178 -5.405 6.47e-08 ***
iframe_redirection1 NA NA NA NA
on_mouse_over1 -10.48161 377.25429 -0.028 0.97783
right_click1 16.55640 882.74352 0.019 0.98504
popup_windows1 0.08224 0.35237 0.233 0.81546
domain_age1 -0.08391 0.09378 -0.895 0.37096
dns_record1 0.23754 0.53876 0.441 0.65928
web_traffic0 -0.39328 0.14308 -2.749 0.00598 **
web_traffic1 2.83291 0.09079 31.201 < 2e-16 ***
links_pointing0 -1.57702 0.22638 -6.966 3.26e-12 ***
links_pointing1 -2.55353 0.20894 -12.221 < 2e-16 ***
statistical_report1 1.28715 0.09347 13.771 < 2e-16 ***
image_text_keyword1 -0.95526 0.13408 -7.125 1.04e-12 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 13677.2 on 9865 degrees of freedom
Residual deviance: 5570.7 on 9830 degrees of freedom
AIC: 5642.7
Number of Fisher Scoring iterations: 13
pR2(lr)
llh llhNull G2 McFadden r2ML r2CU
-2785.3434664 -6838.5892725 8106.4916123 0.5927020 0.5602986 0.7470648
anova(lr, test="Chisq")
Analysis of Deviance Table
Model: binomial, link: logit
Response: result
Terms added sequentially (first to last)
Df Deviance Resid. Df Resid. Dev Pr(>Chi)
NULL 9865 13677.2
having_ip_address 1 27.75 9864 13649.4 1.380e-07 ***
length_of_url 2 348.97 9862 13300.5 < 2.2e-16 ***
shortening_services 1 0.65 9861 13299.8 0.4204011
having_at_symbol 1 69.04 9860 13230.8 < 2.2e-16 ***
double.slash_redirection 1 32.67 9859 13198.1 1.090e-08 ***
prefix.and.suffix 1 279.14 9858 12919.0 < 2.2e-16 ***
sub_domains 2 925.74 9856 11993.2 < 2.2e-16 ***
ssl_state 1 67.68 9855 11925.5 < 2.2e-16 ***
domain_registered 1 342.94 9854 11582.6 < 2.2e-16 ***
favicons 1 369.63 9853 11213.0 < 2.2e-16 ***
ports 1 2.78 9852 11210.2 0.0953372 .
https 1 2459.82 9851 8750.4 < 2.2e-16 ***
external_objects 2 57.60 9849 8692.8 3.101e-13 ***
anchor_tags 2 100.87 9847 8591.9 < 2.2e-16 ***
links_in_tags 2 30.19 9845 8561.7 2.776e-07 ***
sfh.domain 2 225.62 9843 8336.1 < 2.2e-16 ***
auto_email 1 6.93 9842 8329.2 0.0084615 **
abnoramal_url 1 119.10 9841 8210.1 < 2.2e-16 ***
iframe_redirection 0 0.00 9841 8210.1
on_mouse_over 1 4.47 9840 8205.6 0.0345425 *
right_click 1 1.12 9839 8204.5 0.2898616
popup_windows 1 2.56 9838 8201.9 0.1095906
domain_age 1 1.78 9837 8200.1 0.1826757
dns_record 1 13.38 9836 8186.7 0.0002541 ***
web_traffic 2 2189.12 9834 5997.6 < 2.2e-16 ***
links_pointing 2 188.66 9832 5809.0 < 2.2e-16 ***
statistical_report 1 186.39 9831 5622.6 < 2.2e-16 ***
image_text_keyword 1 51.89 9830 5570.7 5.879e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#predict
prob=predict(lr, xtest, type="response")
prediction from a rank-deficient fit may be misleading
prob1=rep(0,4227)
prob1[prob>0.5]=1
cmlr= confusionMatrix(as.factor(prob1),xtest$result, positive="1") #confusion matrix
cmlr
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 1770 152
1 343 1962
Accuracy : 0.8829
95% CI : (0.8728, 0.8924)
No Information Rate : 0.5001
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.7658
Mcnemar's Test P-Value : < 2.2e-16
Sensitivity : 0.9281
Specificity : 0.8377
Pos Pred Value : 0.8512
Neg Pred Value : 0.9209
Prevalence : 0.5001
Detection Rate : 0.4642
Detection Prevalence : 0.5453
Balanced Accuracy : 0.8829
'Positive' Class : 1
round(cmlr$byClass["F1"],4)
F1
0.888
roc(xtest$result, prob1, plot=T, print.auc=T)
Setting levels: control = 0, case = 1
Setting direction: controls < cases
Call:
roc.default(response = xtest$result, predictor = prob1, plot = T, print.auc = T)
Data: prob1 in 2113 controls (xtest$result 0) < 2114 cases (xtest$result 1).
Area under the curve: 0.8829
| Accuracy | Sensitivity | Specificity | F1-score | AUC-ROC | |
|---|---|---|---|---|---|
| CART | 0.8931 | 0.8661 | 0.9201 | 0.8901 | 0.8931 |
| Random Forest | 0.9184 | 0.8992 | 0.9376 | 0.9168 | 0.9184 |
| XGBoost | 0.9141 | 0.8973 | 0.9309 | 0.9126 | 0.9141 |
| AdaBoost | 0.9177 | 0.9470 | 0.8883 | 0.92 | 0.9177 |
| Bagged CART | 0.9155 | 0.9366 | 0.8945 | 0.9173 | 0.9155 |
| LR | 0.8829 | 0.9281 | 0.8377 | 0.888 | 0.8829 |
| feature 1 | feature 2 | feature 3 | feature 4 | feature 5 | |
|---|---|---|---|---|---|
| CART | https | web_traffic | links_pointing | statistical_report | dns_record |
| Random Forest | web_traffic | https | statistical_report | links_pointing | sub_domains |
| XGBoost | web_traffic | https | links_pointing | prefix.and.suffix | sub_domains |
| AdaBoost | https | web_traffic | links_pointing | statistical_report | dns_record |
| Bagged CART | https | web_traffic | links_pointing | statistical_report | dns_record |
| LR | https | web_traffic | sub_domains | favicons | length_of_url |