The idea of this handout is to develop a logistic regression model to discuss the odds of the presence of a bening or a spam URL_type. In this mini tutorial you have the steps we need to follow in order to complete this acitivity. This is just the starting point, during the remaining two weeks we are going to continue working on this document.

getwd()
## [1] "C:/Users/npenaper/Downloads"
final1<-read.csv("group5.csv",sep = ",",header = TRUE)
mydata1<-na.omit(final1)
#summary(mydata)
str(mydata1)
## 'data.frame':    10707 obs. of  13 variables:
##  $ avgpathtokenlen        : num  3.25 5.62 3 2.67 3.5 ...
##  $ pathurlRatio           : num  0.61 0.732 0.574 0.644 0.734 0.719 0.549 0.869 0.45 0.747 ...
##  $ ArgUrlRatio            : num  0.195 0.028 0.5 0.522 0.453 0.547 0.028 0.02 0.1 0.54 ...
##  $ argDomanRatio          : num  0.889 0.167 3.091 1.88 2.9 ...
##  $ domainUrlRatio         : num  0.22 0.169 0.162 0.278 0.156 0.172 0.352 0.061 0.375 0.172 ...
##  $ pathDomainRatio        : num  2.78 4.33 3.55 2.32 4.7 ...
##  $ argPathRatio           : num  0.32 0.039 0.872 0.81 0.617 0.761 0.051 0.023 0.222 0.723 ...
##  $ CharacterContinuityRate: num  0.778 0.75 0.727 0.44 0.7 0.727 0.6 0.667 0.533 0.533 ...
##  $ NumberRate_URL         : num  0.146 0 0.088 0.044 0.063 0.031 0.042 0.091 0.025 0.184 ...
##  $ NumberRate_FileName    : num  0.158 0 0.171 0.07 0.108 0.056 0.231 0 0.059 0.214 ...
##  $ NumberRate_AfterPath   : num  0.375 -1 0.261 0.085 0.138 0.057 -1 -1 0.25 0.255 ...
##  $ Entropy_Domain         : num  1 0.843 0.895 0.832 0.94 0.947 0.757 1 0.77 0.817 ...
##  $ class                  : chr  "Defacement" "malware" "malware" "Defacement" ...
mydata1$class<-as.factor(mydata1$class)
str(mydata1)
## 'data.frame':    10707 obs. of  13 variables:
##  $ avgpathtokenlen        : num  3.25 5.62 3 2.67 3.5 ...
##  $ pathurlRatio           : num  0.61 0.732 0.574 0.644 0.734 0.719 0.549 0.869 0.45 0.747 ...
##  $ ArgUrlRatio            : num  0.195 0.028 0.5 0.522 0.453 0.547 0.028 0.02 0.1 0.54 ...
##  $ argDomanRatio          : num  0.889 0.167 3.091 1.88 2.9 ...
##  $ domainUrlRatio         : num  0.22 0.169 0.162 0.278 0.156 0.172 0.352 0.061 0.375 0.172 ...
##  $ pathDomainRatio        : num  2.78 4.33 3.55 2.32 4.7 ...
##  $ argPathRatio           : num  0.32 0.039 0.872 0.81 0.617 0.761 0.051 0.023 0.222 0.723 ...
##  $ CharacterContinuityRate: num  0.778 0.75 0.727 0.44 0.7 0.727 0.6 0.667 0.533 0.533 ...
##  $ NumberRate_URL         : num  0.146 0 0.088 0.044 0.063 0.031 0.042 0.091 0.025 0.184 ...
##  $ NumberRate_FileName    : num  0.158 0 0.171 0.07 0.108 0.056 0.231 0 0.059 0.214 ...
##  $ NumberRate_AfterPath   : num  0.375 -1 0.261 0.085 0.138 0.057 -1 -1 0.25 0.255 ...
##  $ Entropy_Domain         : num  1 0.843 0.895 0.832 0.94 0.947 0.757 1 0.77 0.817 ...
##  $ class                  : Factor w/ 2 levels "Defacement","malware": 1 2 2 1 2 2 1 2 1 2 ...
mydata1$class <- ifelse(mydata1$class == "Defacement", 1, 0)
#View(mydata1)

Train and Test Data

The purpose of creating two different datasets from the original one is to improve our ability so as to accurately predict the previously unused or unseen data.

There are a number of ways to proportionally split our data into train and test sets: 50/50, 60/40, 70/30, 80/20, and so forth. The data split that you select should be based on your experience and judgment. For this exercise, we will use a 70/30 split, as follows:

set.seed(123)  # random number generator
ind <- sample(2, nrow(mydata1), replace = TRUE, prob = c(0.9, 0.1))

Partitioning the data:

train1 <- mydata1[ind==1, ]  #the training set

test1 <- mydata1[ind==2, ]   # the testing set 

You can confirm the dimensions of both sets as follows:

dim(train1)
## [1] 9699   13
dim(test1)
## [1] 1008   13

To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:

table(train1$class)
## 
##    0    1 
## 6106 3593
table(test1$class)
## 
##   0   1 
## 601 407

This is an acceptable ratio of our outcomes in the two datasets; with this, we can begin the modeling and evaluation.

Modeling and Evaluation

We will use the function glm() (from base R) for the logistic regression model.

An R installation comes with the glm() function fitting the generalized linear models, which are a class of models that includes logistic regression. The code syntax is similar to the lm() function that we used for linear regression. One difference is that we must use the family = binomial argument in the function, which tells R to run a logistic regression method instead of the other versions of the generalized linear models. We will start by creating a model that includes all of the features on the train set and see how it performs on the test set:

attach(train1)
full.fit <- glm(class ~ ., family = binomial, data = train1)

Create a summary of the model:

summary(full.fit)
## 
## Call:
## glm(formula = class ~ ., family = binomial, data = train1)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3441  -0.6851  -0.1941   0.6406   3.2729  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)              32.18948    1.50599  21.374  < 2e-16 ***
## avgpathtokenlen          -0.01343    0.03222  -0.417 0.676841    
## pathurlRatio            -36.79053    1.66703 -22.070  < 2e-16 ***
## ArgUrlRatio              14.27915    2.07389   6.885 5.77e-12 ***
## argDomanRatio            -0.04492    0.10175  -0.441 0.658891    
## domainUrlRatio          -24.67512    1.52809 -16.148  < 2e-16 ***
## pathDomainRatio           0.04349    0.08389   0.518 0.604205    
## argPathRatio             -4.54301    1.23667  -3.674 0.000239 ***
## CharacterContinuityRate  -4.51676    0.20423 -22.116  < 2e-16 ***
## NumberRate_URL           -3.87436    0.43228  -8.963  < 2e-16 ***
## NumberRate_FileName      -0.88306    0.13706  -6.443 1.17e-10 ***
## NumberRate_AfterPath     -0.60313    0.09976  -6.046 1.49e-09 ***
## Entropy_Domain           -2.07771    0.52388  -3.966 7.31e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 12787.1  on 9698  degrees of freedom
## Residual deviance:  7853.5  on 9686  degrees of freedom
## AIC: 7879.5
## 
## Number of Fisher Scoring iterations: 7

You cannot translate the coefficients in logistic regression as “the change in Y is based on one-unit change in X”.

This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).

In order to produce the odds ratios in R, we will use the following exp(coef()) syntax:

exp(coef(full.fit))
##             (Intercept)         avgpathtokenlen            pathurlRatio 
##            9.543632e+13            9.866610e-01            1.052143e-16 
##             ArgUrlRatio           argDomanRatio          domainUrlRatio 
##            1.589844e+06            9.560762e-01            1.921904e-11 
##         pathDomainRatio            argPathRatio CharacterContinuityRate 
##            1.044447e+00            1.064136e-02            1.092432e-02 
##          NumberRate_URL     NumberRate_FileName    NumberRate_AfterPath 
##            2.076767e-02            4.135160e-01            5.470964e-01 
##          Entropy_Domain 
##            1.252167e-01

The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease.

Let us now run a model with the coefficients with the lowest p-values.

Testing the model

You will first have to create a vector of the predicted probabilities, as follows:

train.probs <- predict(full.fit, type = "response")
# inspect the first 5 probabilities
#train.probs[1:5]

Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix. The default value by which the function selects either benign or malignant is 0.50, which is to say that any probability at or above 0.50 is classified as malignant:

trainY1<-mydata1$class[ind==1]
testY1<-mydata1$class[ind==2]
#install.packages("caret")
#library(caret)
#install.packages("InformationValue")
library(InformationValue)
## Warning: package 'InformationValue' was built under R version 4.2.1
confusionMatrix(trainY1,train.probs)
misClassError(trainY1, train.probs)
## [1] 0.162
test.probs <- predict(full.fit, newdata = test1, type = "response")
#misclassification error
misClassError(testY1, test.probs)
## [1] 0.1776
# confusion matrix
confusionMatrix(testY1, test.probs)