The idea of this handout is to develop a logistic regression model to discuss the odds of the presence of a bening or a spam URL_type. In this mini tutorial you have the steps we need to follow in order to complete this acitivity. This is just the starting point, during the remaining two weeks we are going to continue working on this document.

getwd()
## [1] "C:/Users/npenaper/Downloads"
final1<-read.csv("group3.csv",sep = ",",header = TRUE)
mydata1<-na.omit(final1)
#summary(mydata)
str(mydata1)
## 'data.frame':    8051 obs. of  80 variables:
##  $ Querylength                    : int  0 0 19 0 0 0 31 0 17 0 ...
##  $ domain_token_count             : int  2 3 2 2 2 2 2 2 2 2 ...
##  $ path_token_count               : int  12 12 10 10 9 13 9 9 11 7 ...
##  $ avgdomaintokenlen              : num  5.5 5 6 5.5 2.5 4.5 5.5 5.5 6 4.5 ...
##  $ longdomaintokenlen             : int  8 10 9 9 3 6 8 9 9 6 ...
##  $ avgpathtokenlen                : num  4.08 3.58 2.25 4.1 4.56 ...
##  $ tld                            : int  2 3 2 2 2 2 2 2 2 2 ...
##  $ charcompvowels                 : int  15 12 9 15 6 16 22 13 9 9 ...
##  $ charcompace                    : int  7 8 5 11 3 9 17 9 8 4 ...
##  $ ldl_url                        : int  0 2 0 0 0 1 12 0 0 2 ...
##  $ ldl_domain                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ldl_path                       : int  0 2 0 0 0 1 12 0 0 2 ...
##  $ ldl_filename                   : int  0 2 0 0 0 0 11 0 0 0 ...
##  $ ldl_getArg                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dld_url                        : int  0 0 0 0 0 0 8 0 0 0 ...
##  $ dld_domain                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dld_path                       : int  0 0 0 0 0 0 8 0 0 0 ...
##  $ dld_filename                   : int  0 0 0 0 0 0 8 0 0 0 ...
##  $ dld_getArg                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ urlLen                         : int  80 78 68 69 62 98 114 63 72 54 ...
##  $ domainlength                   : int  12 17 13 12 6 10 12 12 13 10 ...
##  $ pathLength                     : int  61 54 48 50 49 81 95 44 52 37 ...
##  $ subDirLen                      : int  61 54 48 50 49 81 95 44 52 37 ...
##  $ fileNameLen                    : int  2 40 2 44 17 2 80 38 2 7 ...
##  $ this.fileExtLen                : int  2 4 2 4 4 2 5 4 2 3 ...
##  $ ArgLen                         : int  2 2 35 2 2 2 37 2 39 2 ...
##  $ pathurlRatio                   : num  0.762 0.692 0.706 0.725 0.79 ...
##  $ ArgUrlRatio                    : num  0.025 0.0256 0.5147 0.029 0.0323 ...
##  $ argDomanRatio                  : num  0.167 0.118 2.692 0.167 0.333 ...
##  $ domainUrlRatio                 : num  0.15 0.2179 0.1912 0.1739 0.0968 ...
##  $ pathDomainRatio                : num  5.08 3.18 3.69 4.17 8.17 ...
##  $ argPathRatio                   : num  0.0328 0.037 0.7292 0.04 0.0408 ...
##  $ executable                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ isPortEighty                   : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ NumberofDotsinURL              : int  1 3 2 2 2 1 5 2 2 2 ...
##  $ ISIpAddressInDomainName        : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ CharacterContinuityRate        : num  0.75 0.647 0.769 0.833 0.667 ...
##  $ LongestVariableValue           : int  -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
##  $ URL_DigitCount                 : int  10 8 6 6 24 35 23 6 6 9 ...
##  $ host_DigitCount                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Directory_DigitCount           : int  6 6 0 -1 8 19 23 -1 0 3 ...
##  $ File_name_DigitCount           : int  2 2 0 6 16 16 0 6 0 6 ...
##  $ Extension_DigitCount           : int  2 0 6 0 0 0 0 0 6 0 ...
##  $ Query_DigitCount               : int  -1 -1 6 -1 -1 -1 0 -1 6 -1 ...
##  $ URL_Letter_Count               : int  54 54 48 50 26 47 77 45 50 35 ...
##  $ host_letter_count              : int  11 15 12 11 5 9 11 11 12 9 ...
##  $ Directory_LetterCount          : int  0 0 3 -1 13 16 54 -1 3 18 ...
##  $ Filename_LetterCount           : int  0 31 3 31 0 5 3 26 3 1 ...
##  $ Extension_LetterCount          : int  39 4 26 4 4 13 5 4 28 3 ...
##  $ Query_LetterCount              : int  -1 -1 12 -1 -1 -1 26 -1 10 -1 ...
##  $ LongestPathTokenLength         : int  48 40 39 44 17 27 40 38 43 18 ...
##  $ Domain_LongestWordLength       : int  8 10 9 9 3 6 8 9 9 6 ...
##  $ Path_LongestWordLength         : int  8 8 7 7 7 13 12 8 7 7 ...
##  $ sub.Directory_LongestWordLength: int  8 7 7 7 7 5 12 8 7 7 ...
##  $ Arguments_LongestWordLength    : int  -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
##  $ URL_sensitiveWord              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ URLQueries_variable            : int  0 0 2 0 0 0 1 0 3 0 ...
##  $ spcharUrl                      : int  5 4 3 2 7 6 3 2 3 4 ...
##  $ delimeter_Domain               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delimeter_path                 : int  7 8 4 8 2 7 4 7 4 3 ...
##  $ delimeter_Count                : int  -1 -1 3 -1 -1 -1 1 -1 5 -1 ...
##  $ NumberRate_URL                 : num  0.125 0.1026 0.0882 0.087 0.3871 ...
##  $ NumberRate_Domain              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberRate_DirectoryName       : num  -1 0.667 0 -1 0.296 ...
##  $ NumberRate_FileName            : num  -1 0.0444 0.1395 -1 0.7273 ...
##  $ NumberRate_Extension           : num  -1 0 0.154 -1 0 ...
##  $ NumberRate_AfterPath           : num  -1 -1 0.171 -1 -1 ...
##  $ SymbolCount_URL                : int  6 7 9 4 9 7 10 4 11 6 ...
##  $ SymbolCount_Domain             : int  1 2 1 1 1 1 1 1 1 1 ...
##  $ SymbolCount_Directoryname      : int  -1 2 1 -1 5 -1 1 -1 1 2 ...
##  $ SymbolCount_FileName           : int  -1 1 5 -1 1 -1 6 -1 7 1 ...
##  $ SymbolCount_Extension          : int  -1 0 4 -1 0 -1 5 -1 6 0 ...
##  $ SymbolCount_Afterpath          : int  -1 -1 3 -1 -1 -1 4 -1 5 -1 ...
##  $ Entropy_URL                    : num  0.677 0.716 0.747 0.733 0.743 ...
##  $ Entropy_Domain                 : num  0.861 0.777 0.834 0.861 1 ...
##  $ Entropy_DirectoryName          : num  -1 0.693 0.655 -1 0.786 ...
##  $ Entropy_Filename               : num  -1 0.738 0.83 -1 0.809 ...
##  $ Entropy_Extension              : num  -1 1 0.836 -1 1 ...
##  $ Entropy_Afterpath              : num  -1 -1 0.823 -1 -1 ...
##  $ URL_Type_obf_Type              : chr  "benign" "benign" "benign" "benign" ...
##  - attr(*, "na.action")= 'omit' Named int [1:6428] 3 4 7 8 11 14 17 18 19 20 ...
##   ..- attr(*, "names")= chr [1:6428] "3" "4" "7" "8" ...
mydata1$URL_Type_obf_Type<-as.factor(mydata1$URL_Type_obf_Type)
str(mydata1)
## 'data.frame':    8051 obs. of  80 variables:
##  $ Querylength                    : int  0 0 19 0 0 0 31 0 17 0 ...
##  $ domain_token_count             : int  2 3 2 2 2 2 2 2 2 2 ...
##  $ path_token_count               : int  12 12 10 10 9 13 9 9 11 7 ...
##  $ avgdomaintokenlen              : num  5.5 5 6 5.5 2.5 4.5 5.5 5.5 6 4.5 ...
##  $ longdomaintokenlen             : int  8 10 9 9 3 6 8 9 9 6 ...
##  $ avgpathtokenlen                : num  4.08 3.58 2.25 4.1 4.56 ...
##  $ tld                            : int  2 3 2 2 2 2 2 2 2 2 ...
##  $ charcompvowels                 : int  15 12 9 15 6 16 22 13 9 9 ...
##  $ charcompace                    : int  7 8 5 11 3 9 17 9 8 4 ...
##  $ ldl_url                        : int  0 2 0 0 0 1 12 0 0 2 ...
##  $ ldl_domain                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ ldl_path                       : int  0 2 0 0 0 1 12 0 0 2 ...
##  $ ldl_filename                   : int  0 2 0 0 0 0 11 0 0 0 ...
##  $ ldl_getArg                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dld_url                        : int  0 0 0 0 0 0 8 0 0 0 ...
##  $ dld_domain                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ dld_path                       : int  0 0 0 0 0 0 8 0 0 0 ...
##  $ dld_filename                   : int  0 0 0 0 0 0 8 0 0 0 ...
##  $ dld_getArg                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ urlLen                         : int  80 78 68 69 62 98 114 63 72 54 ...
##  $ domainlength                   : int  12 17 13 12 6 10 12 12 13 10 ...
##  $ pathLength                     : int  61 54 48 50 49 81 95 44 52 37 ...
##  $ subDirLen                      : int  61 54 48 50 49 81 95 44 52 37 ...
##  $ fileNameLen                    : int  2 40 2 44 17 2 80 38 2 7 ...
##  $ this.fileExtLen                : int  2 4 2 4 4 2 5 4 2 3 ...
##  $ ArgLen                         : int  2 2 35 2 2 2 37 2 39 2 ...
##  $ pathurlRatio                   : num  0.762 0.692 0.706 0.725 0.79 ...
##  $ ArgUrlRatio                    : num  0.025 0.0256 0.5147 0.029 0.0323 ...
##  $ argDomanRatio                  : num  0.167 0.118 2.692 0.167 0.333 ...
##  $ domainUrlRatio                 : num  0.15 0.2179 0.1912 0.1739 0.0968 ...
##  $ pathDomainRatio                : num  5.08 3.18 3.69 4.17 8.17 ...
##  $ argPathRatio                   : num  0.0328 0.037 0.7292 0.04 0.0408 ...
##  $ executable                     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ isPortEighty                   : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ NumberofDotsinURL              : int  1 3 2 2 2 1 5 2 2 2 ...
##  $ ISIpAddressInDomainName        : int  -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
##  $ CharacterContinuityRate        : num  0.75 0.647 0.769 0.833 0.667 ...
##  $ LongestVariableValue           : int  -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
##  $ URL_DigitCount                 : int  10 8 6 6 24 35 23 6 6 9 ...
##  $ host_DigitCount                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Directory_DigitCount           : int  6 6 0 -1 8 19 23 -1 0 3 ...
##  $ File_name_DigitCount           : int  2 2 0 6 16 16 0 6 0 6 ...
##  $ Extension_DigitCount           : int  2 0 6 0 0 0 0 0 6 0 ...
##  $ Query_DigitCount               : int  -1 -1 6 -1 -1 -1 0 -1 6 -1 ...
##  $ URL_Letter_Count               : int  54 54 48 50 26 47 77 45 50 35 ...
##  $ host_letter_count              : int  11 15 12 11 5 9 11 11 12 9 ...
##  $ Directory_LetterCount          : int  0 0 3 -1 13 16 54 -1 3 18 ...
##  $ Filename_LetterCount           : int  0 31 3 31 0 5 3 26 3 1 ...
##  $ Extension_LetterCount          : int  39 4 26 4 4 13 5 4 28 3 ...
##  $ Query_LetterCount              : int  -1 -1 12 -1 -1 -1 26 -1 10 -1 ...
##  $ LongestPathTokenLength         : int  48 40 39 44 17 27 40 38 43 18 ...
##  $ Domain_LongestWordLength       : int  8 10 9 9 3 6 8 9 9 6 ...
##  $ Path_LongestWordLength         : int  8 8 7 7 7 13 12 8 7 7 ...
##  $ sub.Directory_LongestWordLength: int  8 7 7 7 7 5 12 8 7 7 ...
##  $ Arguments_LongestWordLength    : int  -1 -1 13 -1 -1 -1 31 -1 11 -1 ...
##  $ URL_sensitiveWord              : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ URLQueries_variable            : int  0 0 2 0 0 0 1 0 3 0 ...
##  $ spcharUrl                      : int  5 4 3 2 7 6 3 2 3 4 ...
##  $ delimeter_Domain               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ delimeter_path                 : int  7 8 4 8 2 7 4 7 4 3 ...
##  $ delimeter_Count                : int  -1 -1 3 -1 -1 -1 1 -1 5 -1 ...
##  $ NumberRate_URL                 : num  0.125 0.1026 0.0882 0.087 0.3871 ...
##  $ NumberRate_Domain              : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ NumberRate_DirectoryName       : num  -1 0.667 0 -1 0.296 ...
##  $ NumberRate_FileName            : num  -1 0.0444 0.1395 -1 0.7273 ...
##  $ NumberRate_Extension           : num  -1 0 0.154 -1 0 ...
##  $ NumberRate_AfterPath           : num  -1 -1 0.171 -1 -1 ...
##  $ SymbolCount_URL                : int  6 7 9 4 9 7 10 4 11 6 ...
##  $ SymbolCount_Domain             : int  1 2 1 1 1 1 1 1 1 1 ...
##  $ SymbolCount_Directoryname      : int  -1 2 1 -1 5 -1 1 -1 1 2 ...
##  $ SymbolCount_FileName           : int  -1 1 5 -1 1 -1 6 -1 7 1 ...
##  $ SymbolCount_Extension          : int  -1 0 4 -1 0 -1 5 -1 6 0 ...
##  $ SymbolCount_Afterpath          : int  -1 -1 3 -1 -1 -1 4 -1 5 -1 ...
##  $ Entropy_URL                    : num  0.677 0.716 0.747 0.733 0.743 ...
##  $ Entropy_Domain                 : num  0.861 0.777 0.834 0.861 1 ...
##  $ Entropy_DirectoryName          : num  -1 0.693 0.655 -1 0.786 ...
##  $ Entropy_Filename               : num  -1 0.738 0.83 -1 0.809 ...
##  $ Entropy_Extension              : num  -1 1 0.836 -1 1 ...
##  $ Entropy_Afterpath              : num  -1 -1 0.823 -1 -1 ...
##  $ URL_Type_obf_Type              : Factor w/ 2 levels "benign","spam": 1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "na.action")= 'omit' Named int [1:6428] 3 4 7 8 11 14 17 18 19 20 ...
##   ..- attr(*, "names")= chr [1:6428] "3" "4" "7" "8" ...
mydata1$URL_Type_obf_Type <- ifelse(mydata1$URL_Type_obf_Type == "benign", 1, 0)
View(mydata1)
mydata1<-mydata1[,c(1,3,5,6,8,9,20,22,24,27,28,29,30,31,32,37,46,51,52,53,54,62,74,80)]

Train and Test Data

The purpose of creating two different datasets from the original one is to improve our ability so as to accurately predict the previously unused or unseen data.

There are a number of ways to proportionally split our data into train and test sets: 50/50, 60/40, 70/30, 80/20, and so forth. The data split that you select should be based on your experience and judgment. For this exercise, we will use a 70/30 split, as follows:

set.seed(123)  # random number generator
ind <- sample(2, nrow(mydata1), replace = TRUE, prob = c(0.9, 0.1))

Partitioning the data:

train1 <- mydata1[ind==1, ]  #the training set

test1 <- mydata1[ind==2, ]   # the testing set 

You can confirm the dimensions of both sets as follows:

dim(train1)
## [1] 7281   24
dim(test1)
## [1] 770  24

To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:

table(train1$URL_Type_obf_Type)
## 
##    0    1 
## 4825 2456
table(test1$URL_Type_obf_Type)
## 
##   0   1 
## 517 253

This is an acceptable ratio of our outcomes in the two datasets; with this, we can begin the modeling and evaluation.

Modeling and Evaluation

We will use the function glm() (from base R) for the logistic regression model.

An R installation comes with the glm() function fitting the generalized linear models, which are a class of models that includes logistic regression. The code syntax is similar to the lm() function that we used for linear regression. One difference is that we must use the family = binomial argument in the function, which tells R to run a logistic regression method instead of the other versions of the generalized linear models. We will start by creating a model that includes all of the features on the train set and see how it performs on the test set:

attach(train1)
full.fit <- glm(URL_Type_obf_Type ~ ., family = binomial, data = train1)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Create a summary of the model:

summary(full.fit)
## 
## Call:
## glm(formula = URL_Type_obf_Type ~ ., family = binomial, data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.0169  -0.1310  -0.0009   0.0187   3.5715  
## 
## Coefficients:
##                                   Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      1.225e+02  1.032e+01  11.863  < 2e-16 ***
## Querylength                     -1.878e-02  1.814e-02  -1.035 0.300586    
## path_token_count                 7.468e-01  7.984e-02   9.354  < 2e-16 ***
## longdomaintokenlen               9.299e-01  1.200e-01   7.747 9.41e-15 ***
## avgpathtokenlen                  3.062e-01  1.071e-01   2.859 0.004249 ** 
## charcompvowels                   3.233e-01  4.190e-02   7.716 1.20e-14 ***
## charcompace                     -1.719e-01  3.656e-02  -4.702 2.58e-06 ***
## urlLen                          -1.972e+00  1.598e-01 -12.341  < 2e-16 ***
## pathLength                       1.710e+00  1.556e-01  10.993  < 2e-16 ***
## fileNameLen                      4.609e-02  1.015e-02   4.540 5.63e-06 ***
## pathurlRatio                    -9.705e+01  9.864e+00  -9.839  < 2e-16 ***
## ArgUrlRatio                     -5.506e+01  1.377e+01  -3.997 6.41e-05 ***
## argDomanRatio                    2.565e+00  4.587e-01   5.592 2.24e-08 ***
## domainUrlRatio                  -1.645e+02  1.344e+01 -12.240  < 2e-16 ***
## pathDomainRatio                  1.684e-01  2.392e-01   0.704 0.481464    
## argPathRatio                     2.829e+01  7.919e+00   3.572 0.000354 ***
## CharacterContinuityRate          2.608e+01  1.978e+00  13.185  < 2e-16 ***
## host_letter_count                2.876e+00  2.514e-01  11.438  < 2e-16 ***
## LongestPathTokenLength           6.379e-03  9.402e-03   0.679 0.497441    
## Domain_LongestWordLength        -1.430e+00  1.142e-01 -12.522  < 2e-16 ***
## Path_LongestWordLength           7.223e-02  6.793e-02   1.063 0.287666    
## sub.Directory_LongestWordLength  2.232e-02  6.667e-02   0.335 0.737770    
## NumberRate_URL                   1.976e+01  2.035e+00   9.710  < 2e-16 ***
## Entropy_URL                     -3.923e+01  3.901e+00 -10.058  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 9308.6  on 7280  degrees of freedom
## Residual deviance: 1084.3  on 7257  degrees of freedom
## AIC: 1132.3
## 
## Number of Fisher Scoring iterations: 14

You cannot translate the coefficients in logistic regression as “the change in Y is based on one-unit change in X”.

This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).

In order to produce the odds ratios in R, we will use the following exp(coef()) syntax:

exp(coef(full.fit))
##                     (Intercept)                     Querylength 
##                    1.534765e+53                    9.813962e-01 
##                path_token_count              longdomaintokenlen 
##                    2.110307e+00                    2.534237e+00 
##                 avgpathtokenlen                  charcompvowels 
##                    1.358295e+00                    1.381643e+00 
##                     charcompace                          urlLen 
##                    8.420585e-01                    1.392444e-01 
##                      pathLength                     fileNameLen 
##                    5.530344e+00                    1.047168e+00 
##                    pathurlRatio                     ArgUrlRatio 
##                    7.085263e-43                    1.224803e-24 
##                   argDomanRatio                  domainUrlRatio 
##                    1.300145e+01                    3.467725e-72 
##                 pathDomainRatio                    argPathRatio 
##                    1.183402e+00                    1.933168e+12 
##         CharacterContinuityRate               host_letter_count 
##                    2.126669e+11                    1.774283e+01 
##          LongestPathTokenLength        Domain_LongestWordLength 
##                    1.006400e+00                    2.393626e-01 
##          Path_LongestWordLength sub.Directory_LongestWordLength 
##                    1.074900e+00                    1.022574e+00 
##                  NumberRate_URL                     Entropy_URL 
##                    3.821082e+08                    9.149558e-18

The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease.

Let us now run a model with the coefficients with the lowest p-values.

Testing the model

You will first have to create a vector of the predicted probabilities, as follows:

train.probs <- predict(full.fit, type = "response")
# inspect the first 5 probabilities
#train.probs[1:5]

Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix. The default value by which the function selects either benign or malignant is 0.50, which is to say that any probability at or above 0.50 is classified as malignant:

trainY1<-mydata1$URL_Type_obf_Type[ind==1]
testY1<-mydata1$URL_Type_obf_Type[ind==2]
#install.packages("caret")
#library(caret)
#install.packages("InformationValue")
library(InformationValue)
## Warning: package 'InformationValue' was built under R version 4.2.1
confusionMatrix(trainY1,train.probs)
misClassError(trainY1, train.probs)
## [1] 0.0209
test.probs <- predict(full.fit, newdata = test1, type = "response")
#misclassification error
misClassError(testY1, test.probs)
## [1] 0.0221
# confusion matrix
confusionMatrix(testY1, test.probs)