The idea of this handout is to develop a logistic regression model to discuss the odds of the presence of a bening or a phising URL_type. In this mini tutorial you have the steps we need to follow in order to complete this acitivity.
getwd()
[1] "C:/Users/antho/OneDrive/Documents/School/4.DataSecurity&Governance/Project 2"
First we make sure we are working the correct working directory. Then we will import the csv. file into the “final1” dataframe by using the read.csv() function.
final1<-read.csv("groupVIIclean.csv",sep = ",",header = TRUE)
Now we want to inspect the data structure of this dataframe we just created. We can use the structure function to do this.
str(final1)
'data.frame': 15087 obs. of 13 variables:
$ avgpathtokenlen : num 2.67 3.33 8 3.46 4.62 ...
$ pathurlRatio : num 0.256 0.68 0.413 0.781 0.611 0.5 0.564 0.84 0.346 0.5 ...
$ ArgUrlRatio : num 0.047 0.027 0.018 0.027 0.028 0.05 0.036 0.017 0.036 0.05 ...
$ argDomanRatio : num 0.08 0.118 0.035 0.222 0.095 0.154 0.118 0.167 0.069 0.154 ...
$ domainUrlRatio : num 0.581 0.227 0.523 0.123 0.292 0.325 0.309 0.101 0.527 0.325 ...
$ pathDomainRatio : num 0.44 3 0.79 6.33 2.1 ...
$ argPathRatio : num 0.182 0.039 0.044 0.035 0.046 0.1 0.065 0.02 0.105 0.1 ...
$ CharacterContinuityRate: num 0.72 0.824 0.281 0.667 0.667 0.769 0.588 0.75 0.69 0.769 ...
$ NumberRate_URL : num 0.116 0.053 0.193 0.082 0 0.1 0 0.059 0.055 0.05 ...
$ NumberRate_FileName : num 1 0.08 0.625 0 0 0.286 0 0 0 0.143 ...
$ NumberRate_AfterPath : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ Entropy_Domain : num 0.763 0.788 0.703 0.93 0.84 0.917 0.759 0.954 0.799 0.917 ...
$ class : chr "phishing" "benign" "phishing" "benign" ...
Now let us omit any null values from the data frame
mydata1<-na.omit(final1)
Now let us get a statisitical summary for each field.
summary(mydata1)
avgpathtokenlen pathurlRatio ArgUrlRatio argDomanRatio domainUrlRatio pathDomainRatio
Min. : 0.667 Min. :0.0410 Min. :0.00000 Min. : 0.0000 Min. :0.0290 Min. : 0.044
1st Qu.: 3.800 1st Qu.:0.5470 1st Qu.:0.02400 1st Qu.: 0.1110 1st Qu.:0.1410 1st Qu.: 1.667
Median : 4.500 Median :0.6920 Median :0.03300 Median : 0.1540 Median :0.2030 Median : 3.429
Mean : 5.293 Mean :0.6539 Mean :0.09634 Mean : 0.7042 Mean :0.2401 Mean : 4.074
3rd Qu.: 5.571 3rd Qu.:0.7740 3rd Qu.:0.04900 3rd Qu.: 0.2000 3rd Qu.:0.3250 3rd Qu.: 5.500
Max. :105.000 Max. :0.9510 Max. :0.91100 Max. :20.4620 Max. :0.9300 Max. :32.900
argPathRatio CharacterContinuityRate NumberRate_URL NumberRate_FileName NumberRate_AfterPath Entropy_Domain
Min. :0.0000 Min. :0.0750 Min. :0.00000 Min. :-1.00000 Min. :-1.0000 Min. :0.5620
1st Qu.:0.0340 1st Qu.:0.6000 1st Qu.:0.00000 1st Qu.: 0.00000 1st Qu.:-1.0000 1st Qu.:0.7990
Median :0.0530 Median :0.7200 Median :0.05700 Median : 0.00000 Median :-1.0000 Median :0.8610
Mean :0.1414 Mean :0.6725 Mean :0.08681 Mean : 0.09023 Mean :-0.8171 Mean :0.8557
3rd Qu.:0.1000 3rd Qu.:0.7780 3rd Qu.:0.12500 3rd Qu.: 0.17900 3rd Qu.:-1.0000 3rd Qu.:0.9170
Max. :0.9780 Max. :1.0000 Max. :0.76200 Max. : 1.00000 Max. : 1.0000 Max. :1.0000
class
Length:15087
Class :character
Mode :character
Let us also inspect the data structure of or new dataframe that has no nulls. Notice it has the same number of observations/records 15087, so we can assume there were no nulls in this data set.
str(mydata1)
'data.frame': 15087 obs. of 13 variables:
$ avgpathtokenlen : num 2.67 3.33 8 3.46 4.62 ...
$ pathurlRatio : num 0.256 0.68 0.413 0.781 0.611 0.5 0.564 0.84 0.346 0.5 ...
$ ArgUrlRatio : num 0.047 0.027 0.018 0.027 0.028 0.05 0.036 0.017 0.036 0.05 ...
$ argDomanRatio : num 0.08 0.118 0.035 0.222 0.095 0.154 0.118 0.167 0.069 0.154 ...
$ domainUrlRatio : num 0.581 0.227 0.523 0.123 0.292 0.325 0.309 0.101 0.527 0.325 ...
$ pathDomainRatio : num 0.44 3 0.79 6.33 2.1 ...
$ argPathRatio : num 0.182 0.039 0.044 0.035 0.046 0.1 0.065 0.02 0.105 0.1 ...
$ CharacterContinuityRate: num 0.72 0.824 0.281 0.667 0.667 0.769 0.588 0.75 0.69 0.769 ...
$ NumberRate_URL : num 0.116 0.053 0.193 0.082 0 0.1 0 0.059 0.055 0.05 ...
$ NumberRate_FileName : num 1 0.08 0.625 0 0 0.286 0 0 0 0.143 ...
$ NumberRate_AfterPath : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ Entropy_Domain : num 0.763 0.788 0.703 0.93 0.84 0.917 0.759 0.954 0.799 0.917 ...
$ class : chr "phishing" "benign" "phishing" "benign" ...
Lets convert the data type for the “class” field from character to factor so that we can perform binary classification on this fields only two values “benign” and “phishing”. Notice we use the as.factor() function to make this data type conversion.
mydata1$class<-as.factor(mydata1$class)
Now lets review or data structure once more to see the changes we made to the “class” column.
str(mydata1)
'data.frame': 15087 obs. of 13 variables:
$ avgpathtokenlen : num 2.67 3.33 8 3.46 4.62 ...
$ pathurlRatio : num 0.256 0.68 0.413 0.781 0.611 0.5 0.564 0.84 0.346 0.5 ...
$ ArgUrlRatio : num 0.047 0.027 0.018 0.027 0.028 0.05 0.036 0.017 0.036 0.05 ...
$ argDomanRatio : num 0.08 0.118 0.035 0.222 0.095 0.154 0.118 0.167 0.069 0.154 ...
$ domainUrlRatio : num 0.581 0.227 0.523 0.123 0.292 0.325 0.309 0.101 0.527 0.325 ...
$ pathDomainRatio : num 0.44 3 0.79 6.33 2.1 ...
$ argPathRatio : num 0.182 0.039 0.044 0.035 0.046 0.1 0.065 0.02 0.105 0.1 ...
$ CharacterContinuityRate: num 0.72 0.824 0.281 0.667 0.667 0.769 0.588 0.75 0.69 0.769 ...
$ NumberRate_URL : num 0.116 0.053 0.193 0.082 0 0.1 0 0.059 0.055 0.05 ...
$ NumberRate_FileName : num 1 0.08 0.625 0 0 0.286 0 0 0 0.143 ...
$ NumberRate_AfterPath : num -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 ...
$ Entropy_Domain : num 0.763 0.788 0.703 0.93 0.84 0.917 0.759 0.954 0.799 0.917 ...
$ class : Factor w/ 2 levels "benign","phishing": 2 1 2 1 2 2 2 1 2 2 ...
Notice how now the data type is a Factor w/ 2 levels represented by 2 and 1. Lets change this feature once again by converting all the “phishing” values to zero. We will use the ifelse loop to achieve this conversion.
mydata1$class <- ifelse(mydata1$class == "benign", 1, 0)# this will make all "benign" values equal 1
#and all other values 0
This change should give us a more traditional binary values of 1 and 0.
# We can open the dataframe of the my data and verify that our binary value has been created.
#View(mydata1)
Avgpathtokenlen: Numerical-Average length of path token. The avgpathtokenlen column references these token paths or strings and groups them to calculate its average length.
pathurlRatio: Numerical-Path divided by URL.
ArgUrlRatio: Numerical-Ratio of argument and URL.
argDomanRatio: Numerical-Argument divided by domain. The argDomanRatio holds the sum of the arguments grouped by the specifies domains, divided by the amount of times a domain is requested.
domainUrlRatio: Numerical-Domain divided by URL.
pathDomainRatio-Numerical-Path: divided by Domain. The URL divided by the domain is the domain Path Domain ratio.
argPathRatio: Numerical-Ratio of argument and path.
CharacterContinuityRate: Numerical-Character Continuity Rate is used to find the sum of the longest token length of each character type in the domain, such as abc567ti = (3 + 3 + 1)/9 = 0.77. Malicious websites use URLs which have variable number of character types. Character continuity rate determine the sequence of letter, digit and symbol characters. The sum of longest token length of a character type is divided by the length of the URL.
NumberRate_URL: Numerical-Number rate calculate the proportion of digits in the URL part of URL itself.
NumberRate_FileName: Numerical-Number rate calculate the proportion of digits in the URL part of filename.
NumberRate_AfterPath: Numerical-Number rate calculate the proportion of digits in the URL parts of part after the path.
Entropy_Domain: Numerical-Malicious websites often insert additional characters in the URL to make it look like a legitimate. e.g, CITI can be written as CIT1, by replacing last alphabet I with digit 1. English text has fairly low entropy i.e., it is predictable. By inserting characters the entropy changes than usual. For identifying the randomly generated malicious URLs, alphabet entropy is used. A formula is used to calculate the information entropy.
class-Categorical: URL typed: Benign, Spam, Phishing, Malware or the avgpathtokenlen column references these token paths or strings and groups them to calculate its average length.
cr<-cor(mydata1[,-13])
cr
avgpathtokenlen pathurlRatio ArgUrlRatio argDomanRatio domainUrlRatio pathDomainRatio
avgpathtokenlen 1.00000000 0.20565786 -0.11096675 -0.02151169 -0.15435881 0.19253758
pathurlRatio 0.20565786 1.00000000 0.25844811 0.34334011 -0.98059649 0.82368027
ArgUrlRatio -0.11096675 0.25844811 1.00000000 0.90844457 -0.24406584 0.31230027
argDomanRatio -0.02151169 0.34334011 0.90844457 1.00000000 -0.31388861 0.48577889
domainUrlRatio -0.15435881 -0.98059649 -0.24406584 -0.31388861 1.00000000 -0.78454426
pathDomainRatio 0.19253758 0.82368027 0.31230027 0.48577889 -0.78454426 1.00000000
argPathRatio -0.16045571 0.11223739 0.97893163 0.83932099 -0.10172942 0.19515815
CharacterContinuityRate -0.17954935 0.26435796 0.02382876 0.03876218 -0.32697254 0.12678431
NumberRate_URL 0.37615691 0.26915457 0.16064291 0.23703321 -0.22454144 0.26379090
NumberRate_FileName 0.21155718 -0.04867724 0.06440619 0.06958914 0.06086648 -0.08436904
NumberRate_AfterPath -0.07354815 0.26803235 0.86986041 0.76162049 -0.25390104 0.30073882
Entropy_Domain -0.11281137 0.34574630 0.02154917 0.04130224 -0.43013984 0.30621945
argPathRatio CharacterContinuityRate NumberRate_URL NumberRate_FileName NumberRate_AfterPath
avgpathtokenlen -0.16045571 -0.17954935 0.37615691 0.21155718 -0.07354815
pathurlRatio 0.11223739 0.26435796 0.26915457 -0.04867724 0.26803235
ArgUrlRatio 0.97893163 0.02382876 0.16064291 0.06440619 0.86986041
argDomanRatio 0.83932099 0.03876218 0.23703321 0.06958914 0.76162049
domainUrlRatio -0.10172942 -0.32697254 -0.22454144 0.06086648 -0.25390104
pathDomainRatio 0.19515815 0.12678431 0.26379090 -0.08436904 0.30073882
argPathRatio 1.00000000 -0.01659203 0.11435402 0.06021489 0.86650561
CharacterContinuityRate -0.01659203 1.00000000 -0.03412866 -0.06663442 0.05033632
NumberRate_URL 0.11435402 -0.03412866 1.00000000 0.34717068 0.20217671
NumberRate_FileName 0.06021489 -0.06663442 0.34717068 1.00000000 0.08145431
NumberRate_AfterPath 0.86650561 0.05033632 0.20217671 0.08145431 1.00000000
Entropy_Domain -0.01971431 0.36454930 0.01045629 -0.07448911 0.04887298
Entropy_Domain
avgpathtokenlen -0.11281137
pathurlRatio 0.34574630
ArgUrlRatio 0.02154917
argDomanRatio 0.04130224
domainUrlRatio -0.43013984
pathDomainRatio 0.30621945
argPathRatio -0.01971431
CharacterContinuityRate 0.36454930
NumberRate_URL 0.01045629
NumberRate_FileName -0.07448911
NumberRate_AfterPath 0.04887298
Entropy_Domain 1.00000000
Above can see we created the matrix below we will test it.
Let us install the packages necessary to view it
install.packages("Corrplot")
WARNING: Rtools is required to build R packages but is not currently installed. Please download and install the appropriate version of Rtools before proceeding:
https://cran.rstudio.com/bin/windows/Rtools/
Installing package into ‘C:/Users/antho/AppData/Local/R/win-library/4.2’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘Corrplot’ is not available for this version of R
A version of this package for your version of R might be available elsewhere,
see the ideas at
https://cran.r-project.org/doc/manuals/r-patched/R-admin.html#Installing-packages
Warning in install.packages :
Perhaps you meant ‘corrplot’ ?
library(corrplot)
Let us view it with corrplot
corrplot(cr, method = 'ellipse')
Let us see if we can order it in way that we can understand more aout the corelletion maybe by the angular order of the eigenvectors(AOE),the firtst principal component order (FPC), or the hierarchical clustering order.
corrplot(cr, order = 'AOE')
corrplot(cr, order = 'FPC')
corrplot(cr, order = 'hclust')
I prefer how the hierarchical clustering order presents the correlations. I can can clearly see about 3 clusters but dont know what the clustering linkage method is be it complete, centeroid, average,etc. If I was going to run the data through a k-Nearest Neighbor model I would benefit from cluster categorization like denerograms which I belive the adddrect argument achieves to some degree below.
corrplot(cr, method = 'ellipse', order = 'hclust', addrect = 4)
The purpose of creating two different datasets from the original one is to improve our ability so as to accurately predict the previously unused or unseen data.
There are a number of ways to proportionally split our data into
train and test sets: 50/50, 60/40, 70/30,
80/20, and so forth. The data split that you select should be based on
your experience and judgment. For this exercise, we will use a 80/20
split, as follows:
set.seed(23) # random number generator set to the seed of 23.
ind <- sample(2, nrow(mydata1), replace = TRUE, prob = c(0.8, 0.2))
Partitioning the data:
Here we grab the split data into our first train and test variables. Notice above is where we implimented the split
train1 <- mydata1[ind==1, ] #the training set
test1 <- mydata1[ind==2, ] # the testing set
You can confirm the dimensions of both sets as follows:
dim(train1)
[1] 12127 13
dim(test1)
[1] 2960 13
Above we can see that both fields have the same number of features and the test set has a quarter of the records that the train set does.
To ensure that we have a well-balanced outcome variable between the two datasets, we will perform the following check:
table(train1$class) # Creates a table that shows the count of each class
0 1
5910 6217
table(test1$class)
0 1
1401 1559
We can see that the there is a fair balance between our targert class for both of or sets.
This is an acceptable ratio of our outcomes in the two datasets; with this, we can begin the modeling and evaluation.
We will use the function glm() (from base R) for the
logistic regression model.
An R installation comes with the glm() function fitting
the generalized linear models, which are a class of
models that includes logistic regression. The code syntax is similar to
the lm() function that we used for linear regression. One
difference is that we must use the family = binomial
argument in the function, which tells R to run a logistic regression
method instead of the other versions of the generalized linear models.
We will start by creating a model that includes all of the features on
the train set and see how it performs on the test set:
attach(train1)
The following objects are masked from train1 (pos = 4):
argDomanRatio, argPathRatio, ArgUrlRatio, avgpathtokenlen, CharacterContinuityRate, class,
domainUrlRatio, Entropy_Domain, NumberRate_AfterPath, NumberRate_FileName, NumberRate_URL,
pathDomainRatio, pathurlRatio
Finally we can begin with our logistic regression here we will target the “class” field and run it against all other fields. We will use the train data set and not the test set. Notice we are predicting for a binary output and not continous so we also set the family argument to binomial.
full.fit <- glm(class ~ ., family = binomial, data = train1)
Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred
Create a summary of the model: So that we can see the
summary(full.fit)
Call:
glm(formula = class ~ ., family = binomial, data = train1)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.5588 -0.2816 0.1451 0.4246 4.4548
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 29.092755 1.899319 15.317 < 2e-16 ***
avgpathtokenlen -0.149654 0.014215 -10.528 < 2e-16 ***
pathurlRatio -24.594896 1.702746 -14.444 < 2e-16 ***
ArgUrlRatio -15.972711 3.188721 -5.009 5.47e-07 ***
argDomanRatio 0.077851 0.079488 0.979 0.32738
domainUrlRatio -69.481573 2.453667 -28.317 < 2e-16 ***
pathDomainRatio -0.292382 0.029244 -9.998 < 2e-16 ***
argPathRatio 6.863700 2.136408 3.213 0.00131 **
CharacterContinuityRate 8.487651 0.288219 29.449 < 2e-16 ***
NumberRate_URL -3.721122 0.371732 -10.010 < 2e-16 ***
NumberRate_FileName -0.003547 0.090242 -0.039 0.96864
NumberRate_AfterPath 1.714502 0.182821 9.378 < 2e-16 ***
Entropy_Domain 1.132800 0.578038 1.960 0.05003 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 16803.8 on 12126 degrees of freedom
Residual deviance: 6934.8 on 12114 degrees of freedom
AIC: 6960.8
Number of Fisher Scoring iterations: 7
We can see from our statistical summary that all the features are significant within 5% level, except the argDomanRatio and entropy_domain.
You cannot translate the coefficients in logistic regression as “the change in Y is based on one-unit change in X”.
This is where the odds ratio can be quite helpful. The beta coefficients from the log function can be converted to odds ratios with an exponent (beta).
In order to produce the odds ratios in R, we will use the following
exp(coef()) syntax:
exp(coef(full.fit))
(Intercept) avgpathtokenlen pathurlRatio ArgUrlRatio argDomanRatio
4.313432e+12 8.610055e-01 2.082440e-11 1.156485e-07 1.080961e+00
domainUrlRatio pathDomainRatio argPathRatio CharacterContinuityRate NumberRate_URL
6.676308e-31 7.464830e-01 9.569007e+02 4.854451e+03 2.420680e-02
NumberRate_FileName NumberRate_AfterPath Entropy_Domain
9.964590e-01 5.553906e+00 3.104336e+00
The interpretation of an odds ratio is the change in the outcome odds resulting from a unit change in the feature. If the value is greater than 1, it indicates that, as the feature increases, the odds of the outcome increase. Conversely, a value less than 1 would mean that, as the feature increases, the odds of the outcome decrease.
Let us now run a model with the coefficients with the lowest p-values.
You will first have to create a vector of the predicted probabilities, as follows:
train.probs <- predict(full.fit, type = "response")
# inspect the first 5 probabilities
train.probs[1:5]
1 2 3 4 6
2.845787e-06 7.388572e-01 1.422444e-08 9.703884e-01 1.832720e-01
We can see that the output comes out as a value from 0-1
Next, we need to evaluate how well the model performed in training and then evaluate how it fits on the test set. A quick way to do this is to produce a confusion matrix. The default value by which the function selects either benign or phishing is 0.50, which is to say that any probability at or above 0.50 is classified as phishing:
trainY1<-mydata1$class[ind==1] # creating the target variable (y) for the train data we split
testY1<-mydata1$class[ind==2]
Lets install and load the two necessary packages caret and informationvalue
#install.packages("caret")
#library(caret)
#install.packages("InformationValue")
library(InformationValue)
Lets see how many accuracy by running a confusion matrix. Showing our false positives and negatives against our accurate predictions. for the train data.
confusionMatrix(trainY1,train.probs)
Here we test the error probability between predictions of the target variable based off of the training data set and the actual values of the target variable.
misClassError(trainY1, train.probs)# since we dont specifiy it the threshold is set to 0.5 for the train.probs values to be 1
[1] 0.1034
Notice our train data has a low classification error of 10%
Lets do the same for our test set
test.probs <- predict(full.fit, newdata = test1, type = "response")
#misclassification error
misClassError(testY1, test.probs)
[1] 0.1095
Notice our test set has a little more error at 0.109 just 0.03 more. This ~11% is an acceptable error rate which means our model so useable.
Lets see how our test set performed in accuracy like we did with the train set in a confusion matrix.
# confusion matrix
confusionMatrix(testY1, test.probs)
Notice that there are far more accurate begnine and phishing classifications than there are false positives or false negatives.
I will attempt to graph this. After downloading the ggplot package.
install.packages("ggplot2")
Error in install.packages : Updating loaded packages
library(ggplot2)
Warning: package ‘ggplot2’ was built under R version 4.2.1
modelog <- predict.glm(full.fit, test1, type="response")
gg <- ggplot(data.frame(x=test.probs, y=ifelse(test1$class>0.5, "Benign", "Phishing")), aes(x, y)) +
geom_point(size=3, fill="steelblue", color="black", shape=4) +
ylab("Known Phising URL") +
xlab("Estimated Probability of Non-Phising URL") + theme_bw()
print(gg)
Research and/or business questions to consider based on this data set would be can we predict weather a URL address is malicious and trying to perform a “Phishing” attack on you or is it Benign? Can we identify this behavior with information we receive on the URL, domain, and path? We could also ask what other regression or classification models could we try classifying this data with maybe k-Nearest neighbors, decision trees, Navies Bayes, or a Support Vector Machine. A researcher could also wonder what fields when used to train the model would give the highest significance and coefficient of determination.
I was able to digest the data using the structure (str) function, summary (function), and view(function) and though I was able to revise that the data set was clean, wrangled, and ready for partitioning. Though I was able to train test the data in a logistic regression model and see it resulting coefficients, p-value, R-squared, etc. I still don’t feel like I have a better understanding for the measures. I Only feel I digested information about this specific data sets values and probability of predicting “phishing” urls, I still don’t feel I know what the fields that aren’t the target mean and how they relate to each other. Though I am very interested in understand more about web services and connections.
3.I did not identify any clear anomaly in this data set, I had inspected this data frame with a correlation heat map and didnt find any high relationships between any values that didn’t already seem related. An anomaly for me would be finding something that has any clear correspondence that isn’t parallel correlation.