When attempting to use Simple Regression we could not get an accurate model when running the

confusion matrix it was only returning two variable instead of three.

Packages.

partykit - A toolkit with infrastructure for representing, summarizing, and

visualizing tree-structured regression and classification models. This

unified infrastructure can be used for reading/coercing tree models from

different sources

randomForest - Classification and regression based on a forest of trees using random inputs, based on Breiman

xgboost - Extreme Gradient Boosting, which is an efficient implementation

of the gradient boosting framework from Chen & Guestrin. The package includes efficient linear

model solver and tree learning algorithms. The package can automatically

do parallel computation on a single machine which could be more than 10

times faster than existing gradient boosting packages. It supports

various objective functions, including regression, classification and ranking.

The package is made to be extensible, so that users are also allowed to define

their own objectives easily.

caret - Misc functions for training and plotting classification and

regression models.

#install.packages("partykit")
#install.packages("randomForest")
#install.packages("xgboost")
#install.packages("caret")

library() function used to load and attach the packages we installed above.

library(rpart)
library(partykit)
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(xgboost)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:randomForest':
## 
##     margin
project <- read.csv("data/Group4.csv",sep = ",",header = TRUE)

Removing rows with missing or incomplete cases in data object.

mydata1 <- na.omit(project)
#View(mydata1)

Train and Test Data

The purpose of creating two different datasets from the original one is to improve our ability so as to accurately predict the previously unused or unseen data.

There are a number of ways to proportionally split our data into train and test sets: 50/50, 60/40, 70/30, 80/20, and so forth. The data split that you select should be based on your experience and judgment. For this exercise, we will use a 70/30 split, as follows:

#mydata1$URL_Type_obf_Type

Showing the structure of only the URL_Type_obf_Type in mydata1

str(mydata1$URL_Type_obf_Type)
##  chr [1:7149] "benign" "benign" "benign" "benign" "benign" "benign" ...

head will return the first 6 rows of mydata1 by default.

Showing data frame 6 rows by 80 columns

head(mydata1)

Here show structure of mydata1 we have 7149 obj/rows with 80 columns.

This was done to see what the data types of the objs.

#str(mydata1)
#mydata1$URL_Type_obf_Type <- as.factor(mydata1$URL_Type_obf_Type)
#str(mydata1)

Classification Trees

A classifiction tree is very similar to a regression tree, except that it is used to predict a qualitative response rather than a quantitative one.

Using different data first using all the data in mydata1 and creating mydata2

Then using different correlated predictors to see which will give the best model.

#mydata2<-mydata1

mydata2<-mydata1[,c(1,3,5,6,8,9,20,22,24,27,28,29,30,31,32,37,46,51,52,53,54,62,74,80)]
                 
#First Run
#mydata2<-mydata1[,c(1,3,5,8,20,22,25,26,27,29,30,31,39,41,42,43,44,47,48,49,50,56,57,58,60,64,68,72,73,76,78,80)]

#Second Run
#mydata2<-mydata1[,c(1,3,5,8,20,22,26,27,29,30,31,39,41,42,43,44,49,50,56,57,64,68,72,73,76,78,80)]

set.seed will start at 123 and randomly chosing from a number of row in mydata2

Then splitting the probablity weights for training and testing this case 90% training

10% test.

Prob 60/40 - 0.9241355

Prob 70/30 - 0.9138418

Prob 80/20 - 0.9122435

Prob 90/10 - 0.9297218

set.seed(123) # random number generator
ind <- sample(2, nrow(mydata2), replace = TRUE, prob = c(0.9, 0.1))

Assigning indexes to each Partition dataset train1 is equal to 1 and test1 is equal to 2.

train1 <- mydata2[ind==1, ] # the training set

test1  <- mydata2[ind==2, ] # the testing set
#train1

Setting the dimensions of an object

train1 there is 6466 rows and 24 columns

test1 we have 683 rows and 24 columns

dim(train1)
## [1] 6466   24
dim(test1)
## [1] 683  24
#View(train1)

table function performs categorical tabulation of the train1 for URL_Type_obf_Type

and it’s frequency.

train1 is showing for benign a frequency of 2456 and malaware 4010

test1 is showing for benign a frequency of 253 and malaware 430

table(train1$URL_Type_obf_Type)
## 
##  benign malware 
##    2456    4010
table(test1$URL_Type_obf_Type)
## 
##  benign malware 
##     253     430

rpart(Recursive Partitioning And Regression Trees) algorithm works by splitting the dataset recursively, the

subsets that arise from a split are further split until a predetermined termination criterion is reached.

rpart keeps track of the complexity of the tree. The complexity measure is a combination of the size and the ability

of the tree to separate the classes of the target variable. If the next split does not reduce the tree’s overall

complexity by a certain amount, then rpart will terminate the growing process.cp(complexity paramter)

tree.data <- rpart(URL_Type_obf_Type~.,data = train1)
tree.data$cptable
##           CP nsplit rel error    xerror        xstd
## 1 0.41042345      0 1.0000000 1.0000000 0.015890595
## 2 0.20195440      1 0.5895765 0.5895765 0.013649081
## 3 0.05578176      2 0.3876221 0.3876221 0.011601273
## 4 0.02992671      3 0.3318404 0.3318404 0.010866639
## 5 0.01180782      6 0.2365635 0.2402280 0.009428024
## 6 0.01058632      7 0.2247557 0.2259772 0.009171297
## 7 0.01000000      8 0.2141694 0.2174267 0.009012079
#summary(tree.data)

Create our Classification decision tree tree.data

By using the object class of party which consist of partynode objects representing

the tree structure in a recursive way with data.

plot(as.party(tree.data))

# Here we prune our tree this done to reduce the changes of overfitting the tree to the # train1 data and reduce the overall complexity of the tree,

cp <- min(tree.data$cptable[7,])
prune.tree.data <- prune(tree.data,cp <- -1)
plot(as.party(prune.tree.data))

# The predict is used to predict the values based on the previous # data behaviors and by fitting that data to the model.

Taking the type classification data test1 and fitting to prune.tree.data to

predict how well our model responded.

rparty.test <- predict(prune.tree.data, newdata = test1, type ="class")
table(rparty.test,test1$URL_Type_obf_Type)
##            
## rparty.test benign malware
##     benign     234      29
##     malware     19     401

All Columns

rparty.test benign malware

benign 244 9

malware 9 421

0.9736457

Orig run

rparty.test benign malware

benign 234 29

malware 19 401

0.9297218

#rparty.test benign malware # benign 237 19 # malware 16 411

First run

rparty.test benign malware

benign 237 19

malware 16 411

0.9487555

#Second Run # rparty.test benign malware # benign 229 17 # malware 24 413 # 0.9361371

# All columns
#80/20
#accuracy=1-(124/1413)
#70/30
#accuracy=1-(183/2124)
#60/40
#accuracy=1-(215/2834)
#accuracy=1-(18/683)
#First Run
#accuracy=1-(35/683)
#Second Run
#accuracy=1-(41/642)
#Original
accuracy=1-(48/683) 
accuracy
## [1] 0.9297218
#predict(prune.tree.data, newdata = test1, interval = 'confidence')

Conclusion:

A Categorical Regression Tree mode was used because when attempting to use the linear regression model

when converting the URL_Type_obf_Type (benign, Malaware) to 1 and 0s, all the ones were being sorted

the top, thus confusionmatrix was only give two rather than three catagories.

By splitting the data 90/10 was our best prediction was able to give the best accuracy at 0.9297218.

Loaded and process all data only omitted NA received a 0.9736457 but felt we were over fitting the train1 model

Which may give in correct results when using actual data.

Our decision tree is making decisions based on the iterations of the models to predict the probability

that the threat will be benign or it would be malware. Based on the values in certain columns it will

take you down a path until you reach the mode with the probability.