The idea of this handout is to develop a logistic regression model to discuss the odds of the presence of a bening or a spam URL_type. In this mini tutorial you have the steps we need to follow in order to complete this acitivity. This is just the starting point, during the remaining two weeks we are going to continue working on this document.
getwd()
## [1] "C:/Users/npenaper/Downloads"
final1<-read.csv("group5.csv",sep = ",",header = TRUE)
mydata1<-na.omit(final1)
#summary(mydata)
str(mydata1)
## 'data.frame': 10707 obs. of 13 variables:
## $ avgpathtokenlen : num 3.25 5.62 3 2.67 3.5 ...
## $ pathurlRatio : num 0.61 0.732 0.574 0.644 0.734 0.719 0.549 0.869 0.45 0.747 ...
## $ ArgUrlRatio : num 0.195 0.028 0.5 0.522 0.453 0.547 0.028 0.02 0.1 0.54 ...
## $ argDomanRatio : num 0.889 0.167 3.091 1.88 2.9 ...
## $ domainUrlRatio : num 0.22 0.169 0.162 0.278 0.156 0.172 0.352 0.061 0.375 0.172 ...
## $ pathDomainRatio : num 2.78 4.33 3.55 2.32 4.7 ...
## $ argPathRatio : num 0.32 0.039 0.872 0.81 0.617 0.761 0.051 0.023 0.222 0.723 ...
## $ CharacterContinuityRate: num 0.778 0.75 0.727 0.44 0.7 0.727 0.6 0.667 0.533 0.533 ...
## $ NumberRate_URL : num 0.146 0 0.088 0.044 0.063 0.031 0.042 0.091 0.025 0.184 ...
## $ NumberRate_FileName : num 0.158 0 0.171 0.07 0.108 0.056 0.231 0 0.059 0.214 ...
## $ NumberRate_AfterPath : num 0.375 -1 0.261 0.085 0.138 0.057 -1 -1 0.25 0.255 ...
## $ Entropy_Domain : num 1 0.843 0.895 0.832 0.94 0.947 0.757 1 0.77 0.817 ...
## $ class : chr "Defacement" "malware" "malware" "Defacement" ...
mydata1$class<-as.factor(mydata1$class)
str(mydata1)
## 'data.frame': 10707 obs. of 13 variables:
## $ avgpathtokenlen : num 3.25 5.62 3 2.67 3.5 ...
## $ pathurlRatio : num 0.61 0.732 0.574 0.644 0.734 0.719 0.549 0.869 0.45 0.747 ...
## $ ArgUrlRatio : num 0.195 0.028 0.5 0.522 0.453 0.547 0.028 0.02 0.1 0.54 ...
## $ argDomanRatio : num 0.889 0.167 3.091 1.88 2.9 ...
## $ domainUrlRatio : num 0.22 0.169 0.162 0.278 0.156 0.172 0.352 0.061 0.375 0.172 ...
## $ pathDomainRatio : num 2.78 4.33 3.55 2.32 4.7 ...
## $ argPathRatio : num 0.32 0.039 0.872 0.81 0.617 0.761 0.051 0.023 0.222 0.723 ...
## $ CharacterContinuityRate: num 0.778 0.75 0.727 0.44 0.7 0.727 0.6 0.667 0.533 0.533 ...
## $ NumberRate_URL : num 0.146 0 0.088 0.044 0.063 0.031 0.042 0.091 0.025 0.184 ...
## $ NumberRate_FileName : num 0.158 0 0.171 0.07 0.108 0.056 0.231 0 0.059 0.214 ...
## $ NumberRate_AfterPath : num 0.375 -1 0.261 0.085 0.138 0.057 -1 -1 0.25 0.255 ...
## $ Entropy_Domain : num 1 0.843 0.895 0.832 0.94 0.947 0.757 1 0.77 0.817 ...
## $ class : Factor w/ 2 levels "Defacement","malware": 1 2 2 1 2 2 1 2 1 2 ...
mydata1$class <- ifelse(mydata1$class == "Defacement", 1, 0)
#View(mydata1)
The purpose of creating two different datasets from the original one is to improve our ability so as to accurately predict the previously unused or unseen data.
There are a number of ways to proportionally split our data into
train
and test
sets: 50/50, 60/40, 70/30,
80/20, and so forth. The data split that you select should be based on
your experience and judgment. For this exercise, we will use a 70/30
split, as follows:
set.seed(123) # random number generator
ind <- sample(2, nrow(mydata1), replace = TRUE, prob = c(0.9, 0.1))
Partitioning the data:
train1 <- mydata1[ind==1, ] #the training set
test1 <- mydata1[ind==2, ] # the testing set
You can confirm the dimensions of both sets as follows:
dim(train1)
## [1] 9699 13
dim(test1)
## [1] 1008 13
To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:
table(train1$class)
##
## 0 1
## 6106 3593
table(test1$class)
##
## 0 1
## 601 407
This is an acceptable ratio of our outcomes in the two datasets; with this, we can begin the modeling and evaluation.
We will use the function glm()
(from base R) for the
logistic regression model.
An R installation comes with the glm()
function fitting
the generalized linear models, which are a class of
models that includes logistic regression. The code syntax is similar to
the lm()
function that we used for linear regression. One
difference is that we must use the family = binomial
argument in the function, which tells R to run a logistic regression
method instead of the other versions of the generalized linear models.
We will start by creating a model that includes all of the features on
the train set and see how it performs on the test set:
attach(train1)
full.fit <- glm(class ~ ., family = binomial, data = train1)
Create a summary of the model:
summary(full.fit)
##
## Call:
## glm(formula = class ~ ., family = binomial, data = train1)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3441 -0.6851 -0.1941 0.6406 3.2729
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 32.18948 1.50599 21.374 < 2e-16 ***
## avgpathtokenlen -0.01343 0.03222 -0.417 0.676841
## pathurlRatio -36.79053 1.66703 -22.070 < 2e-16 ***
## ArgUrlRatio 14.27915 2.07389 6.885 5.77e-12 ***
## argDomanRatio -0.04492 0.10175 -0.441 0.658891
## domainUrlRatio -24.67512 1.52809 -16.148 < 2e-16 ***
## pathDomainRatio 0.04349 0.08389 0.518 0.604205
## argPathRatio -4.54301 1.23667 -3.674 0.000239 ***
## CharacterContinuityRate -4.51676 0.20423 -22.116 < 2e-16 ***
## NumberRate_URL -3.87436 0.43228 -8.963 < 2e-16 ***
## NumberRate_FileName -0.88306 0.13706 -6.443 1.17e-10 ***
## NumberRate_AfterPath -0.60313 0.09976 -6.046 1.49e-09 ***
## Entropy_Domain -2.07771 0.52388 -3.966 7.31e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 12787.1 on 9698 degrees of freedom
## Residual deviance: 7853.5 on 9686 degrees of freedom
## AIC: 7879.5
##
## Number of Fisher Scoring iterations: 7
You cannot translate the coefficients in logistic regression as “the change in Y is based on one-unit change in X”.
This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).
In order to produce the odds ratios in R, we will use the following
exp(coef())
syntax:
exp(coef(full.fit))
## (Intercept) avgpathtokenlen pathurlRatio
## 9.543632e+13 9.866610e-01 1.052143e-16
## ArgUrlRatio argDomanRatio domainUrlRatio
## 1.589844e+06 9.560762e-01 1.921904e-11
## pathDomainRatio argPathRatio CharacterContinuityRate
## 1.044447e+00 1.064136e-02 1.092432e-02
## NumberRate_URL NumberRate_FileName NumberRate_AfterPath
## 2.076767e-02 4.135160e-01 5.470964e-01
## Entropy_Domain
## 1.252167e-01
The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease.
Let us now run a model with the coefficients with the lowest p-values.
You will first have to create a vector of the predicted probabilities, as follows:
train.probs <- predict(full.fit, type = "response")
# inspect the first 5 probabilities
#train.probs[1:5]
Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix. The default value by which the function selects either benign or malignant is 0.50, which is to say that any probability at or above 0.50 is classified as malignant:
trainY1<-mydata1$class[ind==1]
testY1<-mydata1$class[ind==2]
#install.packages("caret")
#library(caret)
#install.packages("InformationValue")
library(InformationValue)
## Warning: package 'InformationValue' was built under R version 4.2.1
confusionMatrix(trainY1,train.probs)
misClassError(trainY1, train.probs)
## [1] 0.162
test.probs <- predict(full.fit, newdata = test1, type = "response")
#misclassification error
misClassError(testY1, test.probs)
## [1] 0.1776
# confusion matrix
confusionMatrix(testY1, test.probs)