The idea of this handout is to develop a logistic regression model to discuss the odds of the presence of a bening or a spam URL_type. In this mini tutorial you have the steps we need to follow in order to complete this acitivity. This is just the starting point, during the remaining two weeks we are going to continue working on this document.
getwd()
## [1] "C:/Users/npenaper/Downloads"
final1<-read.csv("group3.csv",sep = ",",header = TRUE)
mydata1<-na.omit(final1)
#summary(mydata)
str(mydata1)
## 'data.frame': 8051 obs. of 80 variables:
## $ Querylength : int 0 0 19 0 0 0 31 0 17 0 ...
## $ domain_token_count : int 2 3 2 2 2 2 2 2 2 2 ...
## $ path_token_count : int 12 12 10 10 9 13 9 9 11 7 ...
## $ avgdomaintokenlen : num 5.5 5 6 5.5 2.5 4.5 5.5 5.5 6 4.5 ...
## $ longdomaintokenlen : int 8 10 9 9 3 6 8 9 9 6 ...
## $ avgpathtokenlen : num 4.08 3.58 2.25 4.1 4.56 ...
## $ tld : int 2 3 2 2 2 2 2 2 2 2 ...
## $ charcompvowels : int 15 12 9 15 6 16 22 13 9 9 ...
## $ charcompace : int 7 8 5 11 3 9 17 9 8 4 ...
## $ ldl_url : int 0 2 0 0 0 1 12 0 0 2 ...
## $ ldl_domain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ldl_path : int 0 2 0 0 0 1 12 0 0 2 ...
## $ ldl_filename : int 0 2 0 0 0 0 11 0 0 0 ...
## $ ldl_getArg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dld_url : int 0 0 0 0 0 0 8 0 0 0 ...
## $ dld_domain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dld_path : int 0 0 0 0 0 0 8 0 0 0 ...
## $ dld_filename : int 0 0 0 0 0 0 8 0 0 0 ...
## $ dld_getArg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ urlLen : int 80 78 68 69 62 98 114 63 72 54 ...
## $ domainlength : int 12 17 13 12 6 10 12 12 13 10 ...
## $ pathLength : int 61 54 48 50 49 81 95 44 52 37 ...
## $ subDirLen : int 61 54 48 50 49 81 95 44 52 37 ...
## $ fileNameLen : int 2 40 2 44 17 2 80 38 2 7 ...
## $ this.fileExtLen : int 2 4 2 4 4 2 5 4 2 3 ...
## $ ArgLen : int 2 2 35 2 2 2 37 2 39 2 ...
## $ pathurlRatio : num 0.762 0.692 0.706 0.725 0.79 ...
## $ ArgUrlRatio : num 0.025 0.0256 0.5147 0.029 0.0323 ...
## $ argDomanRatio : num 0.167 0.118 2.692 0.167 0.333 ...
## $ domainUrlRatio : num 0.15 0.2179 0.1912 0.1739 0.0968 ...
## $ pathDomainRatio : num 5.08 3.18 3.69 4.17 8.17 ...
## $ argPathRatio : num 0.0328 0.037 0.7292 0.04 0.0408 ...
## $ executable : int 0 0 0 0 0 0 0 0 0 0 ...
## $ isPortEighty : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ NumberofDotsinURL : int 1 3 2 2 2 1 5 2 2 2 ...
## $ ISIpAddressInDomainName : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ CharacterContinuityRate : num 0.75 0.647 0.769 0.833 0.667 ...
## $ LongestVariableValue : int -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
## $ URL_DigitCount : int 10 8 6 6 24 35 23 6 6 9 ...
## $ host_DigitCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Directory_DigitCount : int 6 6 0 -1 8 19 23 -1 0 3 ...
## $ File_name_DigitCount : int 2 2 0 6 16 16 0 6 0 6 ...
## $ Extension_DigitCount : int 2 0 6 0 0 0 0 0 6 0 ...
## $ Query_DigitCount : int -1 -1 6 -1 -1 -1 0 -1 6 -1 ...
## $ URL_Letter_Count : int 54 54 48 50 26 47 77 45 50 35 ...
## $ host_letter_count : int 11 15 12 11 5 9 11 11 12 9 ...
## $ Directory_LetterCount : int 0 0 3 -1 13 16 54 -1 3 18 ...
## $ Filename_LetterCount : int 0 31 3 31 0 5 3 26 3 1 ...
## $ Extension_LetterCount : int 39 4 26 4 4 13 5 4 28 3 ...
## $ Query_LetterCount : int -1 -1 12 -1 -1 -1 26 -1 10 -1 ...
## $ LongestPathTokenLength : int 48 40 39 44 17 27 40 38 43 18 ...
## $ Domain_LongestWordLength : int 8 10 9 9 3 6 8 9 9 6 ...
## $ Path_LongestWordLength : int 8 8 7 7 7 13 12 8 7 7 ...
## $ sub.Directory_LongestWordLength: int 8 7 7 7 7 5 12 8 7 7 ...
## $ Arguments_LongestWordLength : int -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
## $ URL_sensitiveWord : int 0 0 0 0 0 0 0 0 0 0 ...
## $ URLQueries_variable : int 0 0 2 0 0 0 1 0 3 0 ...
## $ spcharUrl : int 5 4 3 2 7 6 3 2 3 4 ...
## $ delimeter_Domain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delimeter_path : int 7 8 4 8 2 7 4 7 4 3 ...
## $ delimeter_Count : int -1 -1 3 -1 -1 -1 1 -1 5 -1 ...
## $ NumberRate_URL : num 0.125 0.1026 0.0882 0.087 0.3871 ...
## $ NumberRate_Domain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberRate_DirectoryName : num -1 0.667 0 -1 0.296 ...
## $ NumberRate_FileName : num -1 0.0444 0.1395 -1 0.7273 ...
## $ NumberRate_Extension : num -1 0 0.154 -1 0 ...
## $ NumberRate_AfterPath : num -1 -1 0.171 -1 -1 ...
## $ SymbolCount_URL : int 6 7 9 4 9 7 10 4 11 6 ...
## $ SymbolCount_Domain : int 1 2 1 1 1 1 1 1 1 1 ...
## $ SymbolCount_Directoryname : int -1 2 1 -1 5 -1 1 -1 1 2 ...
## $ SymbolCount_FileName : int -1 1 5 -1 1 -1 6 -1 7 1 ...
## $ SymbolCount_Extension : int -1 0 4 -1 0 -1 5 -1 6 0 ...
## $ SymbolCount_Afterpath : int -1 -1 3 -1 -1 -1 4 -1 5 -1 ...
## $ Entropy_URL : num 0.677 0.716 0.747 0.733 0.743 ...
## $ Entropy_Domain : num 0.861 0.777 0.834 0.861 1 ...
## $ Entropy_DirectoryName : num -1 0.693 0.655 -1 0.786 ...
## $ Entropy_Filename : num -1 0.738 0.83 -1 0.809 ...
## $ Entropy_Extension : num -1 1 0.836 -1 1 ...
## $ Entropy_Afterpath : num -1 -1 0.823 -1 -1 ...
## $ URL_Type_obf_Type : chr "benign" "benign" "benign" "benign" ...
## - attr(*, "na.action")= 'omit' Named int [1:6428] 3 4 7 8 11 14 17 18 19 20 ...
## ..- attr(*, "names")= chr [1:6428] "3" "4" "7" "8" ...
mydata1$URL_Type_obf_Type<-as.factor(mydata1$URL_Type_obf_Type)
str(mydata1)
## 'data.frame': 8051 obs. of 80 variables:
## $ Querylength : int 0 0 19 0 0 0 31 0 17 0 ...
## $ domain_token_count : int 2 3 2 2 2 2 2 2 2 2 ...
## $ path_token_count : int 12 12 10 10 9 13 9 9 11 7 ...
## $ avgdomaintokenlen : num 5.5 5 6 5.5 2.5 4.5 5.5 5.5 6 4.5 ...
## $ longdomaintokenlen : int 8 10 9 9 3 6 8 9 9 6 ...
## $ avgpathtokenlen : num 4.08 3.58 2.25 4.1 4.56 ...
## $ tld : int 2 3 2 2 2 2 2 2 2 2 ...
## $ charcompvowels : int 15 12 9 15 6 16 22 13 9 9 ...
## $ charcompace : int 7 8 5 11 3 9 17 9 8 4 ...
## $ ldl_url : int 0 2 0 0 0 1 12 0 0 2 ...
## $ ldl_domain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ ldl_path : int 0 2 0 0 0 1 12 0 0 2 ...
## $ ldl_filename : int 0 2 0 0 0 0 11 0 0 0 ...
## $ ldl_getArg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dld_url : int 0 0 0 0 0 0 8 0 0 0 ...
## $ dld_domain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ dld_path : int 0 0 0 0 0 0 8 0 0 0 ...
## $ dld_filename : int 0 0 0 0 0 0 8 0 0 0 ...
## $ dld_getArg : int 0 0 0 0 0 0 0 0 0 0 ...
## $ urlLen : int 80 78 68 69 62 98 114 63 72 54 ...
## $ domainlength : int 12 17 13 12 6 10 12 12 13 10 ...
## $ pathLength : int 61 54 48 50 49 81 95 44 52 37 ...
## $ subDirLen : int 61 54 48 50 49 81 95 44 52 37 ...
## $ fileNameLen : int 2 40 2 44 17 2 80 38 2 7 ...
## $ this.fileExtLen : int 2 4 2 4 4 2 5 4 2 3 ...
## $ ArgLen : int 2 2 35 2 2 2 37 2 39 2 ...
## $ pathurlRatio : num 0.762 0.692 0.706 0.725 0.79 ...
## $ ArgUrlRatio : num 0.025 0.0256 0.5147 0.029 0.0323 ...
## $ argDomanRatio : num 0.167 0.118 2.692 0.167 0.333 ...
## $ domainUrlRatio : num 0.15 0.2179 0.1912 0.1739 0.0968 ...
## $ pathDomainRatio : num 5.08 3.18 3.69 4.17 8.17 ...
## $ argPathRatio : num 0.0328 0.037 0.7292 0.04 0.0408 ...
## $ executable : int 0 0 0 0 0 0 0 0 0 0 ...
## $ isPortEighty : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ NumberofDotsinURL : int 1 3 2 2 2 1 5 2 2 2 ...
## $ ISIpAddressInDomainName : int -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
## $ CharacterContinuityRate : num 0.75 0.647 0.769 0.833 0.667 ...
## $ LongestVariableValue : int -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
## $ URL_DigitCount : int 10 8 6 6 24 35 23 6 6 9 ...
## $ host_DigitCount : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Directory_DigitCount : int 6 6 0 -1 8 19 23 -1 0 3 ...
## $ File_name_DigitCount : int 2 2 0 6 16 16 0 6 0 6 ...
## $ Extension_DigitCount : int 2 0 6 0 0 0 0 0 6 0 ...
## $ Query_DigitCount : int -1 -1 6 -1 -1 -1 0 -1 6 -1 ...
## $ URL_Letter_Count : int 54 54 48 50 26 47 77 45 50 35 ...
## $ host_letter_count : int 11 15 12 11 5 9 11 11 12 9 ...
## $ Directory_LetterCount : int 0 0 3 -1 13 16 54 -1 3 18 ...
## $ Filename_LetterCount : int 0 31 3 31 0 5 3 26 3 1 ...
## $ Extension_LetterCount : int 39 4 26 4 4 13 5 4 28 3 ...
## $ Query_LetterCount : int -1 -1 12 -1 -1 -1 26 -1 10 -1 ...
## $ LongestPathTokenLength : int 48 40 39 44 17 27 40 38 43 18 ...
## $ Domain_LongestWordLength : int 8 10 9 9 3 6 8 9 9 6 ...
## $ Path_LongestWordLength : int 8 8 7 7 7 13 12 8 7 7 ...
## $ sub.Directory_LongestWordLength: int 8 7 7 7 7 5 12 8 7 7 ...
## $ Arguments_LongestWordLength : int -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
## $ URL_sensitiveWord : int 0 0 0 0 0 0 0 0 0 0 ...
## $ URLQueries_variable : int 0 0 2 0 0 0 1 0 3 0 ...
## $ spcharUrl : int 5 4 3 2 7 6 3 2 3 4 ...
## $ delimeter_Domain : int 0 0 0 0 0 0 0 0 0 0 ...
## $ delimeter_path : int 7 8 4 8 2 7 4 7 4 3 ...
## $ delimeter_Count : int -1 -1 3 -1 -1 -1 1 -1 5 -1 ...
## $ NumberRate_URL : num 0.125 0.1026 0.0882 0.087 0.3871 ...
## $ NumberRate_Domain : num 0 0 0 0 0 0 0 0 0 0 ...
## $ NumberRate_DirectoryName : num -1 0.667 0 -1 0.296 ...
## $ NumberRate_FileName : num -1 0.0444 0.1395 -1 0.7273 ...
## $ NumberRate_Extension : num -1 0 0.154 -1 0 ...
## $ NumberRate_AfterPath : num -1 -1 0.171 -1 -1 ...
## $ SymbolCount_URL : int 6 7 9 4 9 7 10 4 11 6 ...
## $ SymbolCount_Domain : int 1 2 1 1 1 1 1 1 1 1 ...
## $ SymbolCount_Directoryname : int -1 2 1 -1 5 -1 1 -1 1 2 ...
## $ SymbolCount_FileName : int -1 1 5 -1 1 -1 6 -1 7 1 ...
## $ SymbolCount_Extension : int -1 0 4 -1 0 -1 5 -1 6 0 ...
## $ SymbolCount_Afterpath : int -1 -1 3 -1 -1 -1 4 -1 5 -1 ...
## $ Entropy_URL : num 0.677 0.716 0.747 0.733 0.743 ...
## $ Entropy_Domain : num 0.861 0.777 0.834 0.861 1 ...
## $ Entropy_DirectoryName : num -1 0.693 0.655 -1 0.786 ...
## $ Entropy_Filename : num -1 0.738 0.83 -1 0.809 ...
## $ Entropy_Extension : num -1 1 0.836 -1 1 ...
## $ Entropy_Afterpath : num -1 -1 0.823 -1 -1 ...
## $ URL_Type_obf_Type : Factor w/ 2 levels "benign","spam": 1 1 1 1 1 1 1 1 1 1 ...
## - attr(*, "na.action")= 'omit' Named int [1:6428] 3 4 7 8 11 14 17 18 19 20 ...
## ..- attr(*, "names")= chr [1:6428] "3" "4" "7" "8" ...
mydata1$URL_Type_obf_Type <- ifelse(mydata1$URL_Type_obf_Type == "benign", 1, 0)
View(mydata1)
mydata1<-mydata1[,c(1,3,5,6,8,9,20,22,24,27,28,29,30,31,32,37,46,51,52,53,54,62,74,80)]
The purpose of creating two different datasets from the original one is to improve our ability so as to accurately predict the previously unused or unseen data.
There are a number of ways to proportionally split our data into
train
and test
sets: 50/50, 60/40, 70/30,
80/20, and so forth. The data split that you select should be based on
your experience and judgment. For this exercise, we will use a 70/30
split, as follows:
set.seed(123) # random number generator
ind <- sample(2, nrow(mydata1), replace = TRUE, prob = c(0.9, 0.1))
Partitioning the data:
train1 <- mydata1[ind==1, ] #the training set
test1 <- mydata1[ind==2, ] # the testing set
You can confirm the dimensions of both sets as follows:
dim(train1)
## [1] 7281 24
dim(test1)
## [1] 770 24
To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:
table(train1$URL_Type_obf_Type)
##
## 0 1
## 4825 2456
table(test1$URL_Type_obf_Type)
##
## 0 1
## 517 253
This is an acceptable ratio of our outcomes in the two datasets; with this, we can begin the modeling and evaluation.
We will use the function glm()
(from base R) for the
logistic regression model.
An R installation comes with the glm()
function fitting
the generalized linear models, which are a class of
models that includes logistic regression. The code syntax is similar to
the lm()
function that we used for linear regression. One
difference is that we must use the family = binomial
argument in the function, which tells R to run a logistic regression
method instead of the other versions of the generalized linear models.
We will start by creating a model that includes all of the features on
the train set and see how it performs on the test set:
attach(train1)
full.fit <- glm(URL_Type_obf_Type ~ ., family = binomial, data = train1)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Create a summary of the model:
summary(full.fit)
##
## Call:
## glm(formula = URL_Type_obf_Type ~ ., family = binomial, data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.0169 -0.1310 -0.0009 0.0187 3.5715
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.225e+02 1.032e+01 11.863 < 2e-16 ***
## Querylength -1.878e-02 1.814e-02 -1.035 0.300586
## path_token_count 7.468e-01 7.984e-02 9.354 < 2e-16 ***
## longdomaintokenlen 9.299e-01 1.200e-01 7.747 9.41e-15 ***
## avgpathtokenlen 3.062e-01 1.071e-01 2.859 0.004249 **
## charcompvowels 3.233e-01 4.190e-02 7.716 1.20e-14 ***
## charcompace -1.719e-01 3.656e-02 -4.702 2.58e-06 ***
## urlLen -1.972e+00 1.598e-01 -12.341 < 2e-16 ***
## pathLength 1.710e+00 1.556e-01 10.993 < 2e-16 ***
## fileNameLen 4.609e-02 1.015e-02 4.540 5.63e-06 ***
## pathurlRatio -9.705e+01 9.864e+00 -9.839 < 2e-16 ***
## ArgUrlRatio -5.506e+01 1.377e+01 -3.997 6.41e-05 ***
## argDomanRatio 2.565e+00 4.587e-01 5.592 2.24e-08 ***
## domainUrlRatio -1.645e+02 1.344e+01 -12.240 < 2e-16 ***
## pathDomainRatio 1.684e-01 2.392e-01 0.704 0.481464
## argPathRatio 2.829e+01 7.919e+00 3.572 0.000354 ***
## CharacterContinuityRate 2.608e+01 1.978e+00 13.185 < 2e-16 ***
## host_letter_count 2.876e+00 2.514e-01 11.438 < 2e-16 ***
## LongestPathTokenLength 6.379e-03 9.402e-03 0.679 0.497441
## Domain_LongestWordLength -1.430e+00 1.142e-01 -12.522 < 2e-16 ***
## Path_LongestWordLength 7.223e-02 6.793e-02 1.063 0.287666
## sub.Directory_LongestWordLength 2.232e-02 6.667e-02 0.335 0.737770
## NumberRate_URL 1.976e+01 2.035e+00 9.710 < 2e-16 ***
## Entropy_URL -3.923e+01 3.901e+00 -10.058 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 9308.6 on 7280 degrees of freedom
## Residual deviance: 1084.3 on 7257 degrees of freedom
## AIC: 1132.3
##
## Number of Fisher Scoring iterations: 14
You cannot translate the coefficients in logistic regression as “the change in Y is based on one-unit change in X”.
This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).
In order to produce the odds ratios in R, we will use the following
exp(coef())
syntax:
exp(coef(full.fit))
## (Intercept) Querylength
## 1.534765e+53 9.813962e-01
## path_token_count longdomaintokenlen
## 2.110307e+00 2.534237e+00
## avgpathtokenlen charcompvowels
## 1.358295e+00 1.381643e+00
## charcompace urlLen
## 8.420585e-01 1.392444e-01
## pathLength fileNameLen
## 5.530344e+00 1.047168e+00
## pathurlRatio ArgUrlRatio
## 7.085263e-43 1.224803e-24
## argDomanRatio domainUrlRatio
## 1.300145e+01 3.467725e-72
## pathDomainRatio argPathRatio
## 1.183402e+00 1.933168e+12
## CharacterContinuityRate host_letter_count
## 2.126669e+11 1.774283e+01
## LongestPathTokenLength Domain_LongestWordLength
## 1.006400e+00 2.393626e-01
## Path_LongestWordLength sub.Directory_LongestWordLength
## 1.074900e+00 1.022574e+00
## NumberRate_URL Entropy_URL
## 3.821082e+08 9.149558e-18
The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease.
Let us now run a model with the coefficients with the lowest p-values.
You will first have to create a vector of the predicted probabilities, as follows:
train.probs <- predict(full.fit, type = "response")
# inspect the first 5 probabilities
#train.probs[1:5]
Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix. The default value by which the function selects either benign or malignant is 0.50, which is to say that any probability at or above 0.50 is classified as malignant:
trainY1<-mydata1$URL_Type_obf_Type[ind==1]
testY1<-mydata1$URL_Type_obf_Type[ind==2]
#install.packages("caret")
#library(caret)
#install.packages("InformationValue")
library(InformationValue)
## Warning: package 'InformationValue' was built under R version 4.2.1
confusionMatrix(trainY1,train.probs)
misClassError(trainY1, train.probs)
## [1] 0.0209
test.probs <- predict(full.fit, newdata = test1, type = "response")
#misclassification error
misClassError(testY1, test.probs)
## [1] 0.0221
# confusion matrix
confusionMatrix(testY1, test.probs)