A decision tree model is a classification method that decides outcomes based off a prediction variable. In this example, a decision tree will be constructed on the iris dataset.
# loads the dataset into the environment
data(iris)
# gives variable name to dataset
ds <- iris
str(ds)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
An index of the dataset is made using the ‘caret’ package. Next, a training and test set are created to train a prediction model. The species column is used for the prediction variable and the decision tree model is constructed using the ‘rpart’ package.
# index of the dataset
index <- createDataPartition(ds$Species, p=0.75, list=FALSE)
# creates the training and testing dataset
train <- ds[index,]
test <- ds[-index,]
# 'grabs' the prediction variable to be used in the decision tree model
colIndex <- grep('Species', names(ds))
# creates the decision tree model and plots the tree
dtm <- rpart(Species~., data = train, method = 'class')
rpart.plot::rpart.plot(dtm)
The plot above shows the prediction of iris flower species type based of petal length. At first we see there is a 33% chance of each species type based on no predictors. However, the first fork shows that there is a 100% chance that the species is setosa if the petal length is less than 2.5 cm.
Otherwise, the species is either versicolor or virginica. The next fork displays that if the petal length is less than 4.8 cm, there is a 97% chance that the species is versicolor. With a petal length greater than 4.8 cm, there is a 88% chance that the species is virginica.
Now that we have the prediction plot, we can check the accuracy of the model as well as create a confusion matrix.
# checks the accuracy of the model
prediction <- predict(dtm, test[,-colIndex], type='class')
check <- mean(prediction==test$Species)
check
## [1] 0.8611111
The output above multiplied by one hundred is the accuracy percentage of the decision tree model.
# constructs a confusion matrix
table(Prediction=prediction, Truth = test$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 10 3
## virginica 0 2 9
The confusion matrix shows the observations that are identified. Any diagonal values within the matrix are the correctly identified values, whereas others are possible outcomes not entirely accurate.
A SVM is a type of supervised learning algorithm to analyze data. This method can be used for many types of data, but in this example we will be focused on linear data (iris dataset). A linear svm is constructed below using the ‘e1071’ library, along with a confusion matrix.
# creating the support vector machine
svmLinear <- svm(Species~., data = train, kernel = 'linear')
svmLinear
##
## Call:
## svm(formula = Species ~ ., data = train, kernel = "linear")
##
##
## Parameters:
## SVM-Type: C-classification
## SVM-Kernel: linear
## cost: 1
##
## Number of Support Vectors: 24
# confusion matrix
table(Prediction = predict(svmLinear, train), Truth = train$Species)
## Truth
## Prediction setosa versicolor virginica
## setosa 38 0 0
## versicolor 0 38 1
## virginica 0 0 37
The random forest algorithm calculates predictions from multiple decision trees. Each decision tree is randomly trained from the dataset giving a large amount of possible outcomes. This is completed using the ‘randomForest’ library.
# random forest confusion matrix
rf <- randomForest::randomForest(Species~., data = train)
rf
##
## Call:
## randomForest(formula = Species ~ ., data = train)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.88%
## Confusion matrix:
## setosa versicolor virginica class.error
## setosa 38 0 0 0.00000000
## versicolor 0 38 0 0.00000000
## virginica 0 1 37 0.02631579
plot(rf, main = "Random Forest Error")
The plot above gives the prediction error based on the amount of decision trees used in the algorithim. We can see that the error oscillates depending on the number of trees. It eventually becomes constant and does not vary.
The goal of back propagation is to map nD data onto a 2D plane. The iris dataset has four input variables that affect the one prediction output - species of iris. So, we will be mapping 4D data onto a 2D plane. This algorithm uses the ‘neuralnet’ library.
# creates a newly binded dataframe
bindTrain <- ds
bindTrain <- bindTrain[,1:4]
bindTrain <- cbind(bindTrain, bindTrain)
# names the columns with the dataset, input and output variables
colnames(bindTrain) <- c("SL.I","SW.I", "PL.I", "PW.I", "SL.O","SW.O", "PL.O", "PW.O")
# shows the dataset
head(bindTrain)
## SL.I SW.I PL.I PW.I SL.O SW.O PL.O PW.O
## 1 5.1 3.5 1.4 0.2 5.1 3.5 1.4 0.2
## 2 4.9 3.0 1.4 0.2 4.9 3.0 1.4 0.2
## 3 4.7 3.2 1.3 0.2 4.7 3.2 1.3 0.2
## 4 4.6 3.1 1.5 0.2 4.6 3.1 1.5 0.2
## 5 5.0 3.6 1.4 0.2 5.0 3.6 1.4 0.2
## 6 5.4 3.9 1.7 0.4 5.4 3.9 1.7 0.4
# pre processes the training dataset
processValues <- preProcess(bindTrain, method = c("range"))
proTrain <- predict(processValues, bindTrain)
# produces and plots the neural network
model <- neuralnet(SL.O + SW.O + PL.O + PW.O ~ SL.I + SW.I + PL.I + PW.I,
proTrain, hidden=c(3,2,3), algorithm = 'rprop+', threshold = 0.01)
plot(model, rep = 1)
A random value for each of the input variables is given into the network. Depending on these values, the next step is calcualted. Once each of the steps have been completed, the predicted output is given for which the type of iris flower (species). The number of steps and error are also calculated in this model.
This classification method takes in all the predicted outcomes, including the “weak learners” which are the prediction errors. These are called stumps. All the predictions are combined to create a strong classifier. The confusion matrix for this method is outputted below.
# creates a boosted dataset and outputs the confusion matrix
bigBoost <- boosting.cv(Species~., data = ds, boos = TRUE, mfinal = 10, v = 5)
## i: 1 Sun May 03 17:15:45 2020
## i: 2 Sun May 03 17:15:47 2020
## i: 3 Sun May 03 17:15:49 2020
## i: 4 Sun May 03 17:15:52 2020
## i: 5 Sun May 03 17:15:54 2020
bigBoost$confusion
## Observed Class
## Predicted Class setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 45 5
## virginica 0 5 45
Each of the algorithms above give a prediction of the type of species the iris flower is based on the length and width of the flowers. The accuracy of the models are generally calculated through a confusion matrix, which gives the percent error of the prediction.
When comparing the algorithms above, the linear support vector machine was the most accurate because the calculated percent error is lowest. The linear model is strong with the iris dataset, producing the confusion matrix with the least amount of error. The random forest model was also accurate, since multiple decision trees are used in the algorithm.
The least accurate model used was the back propagation method. The neural network seemed very confusing and hard to understand. A very high error percentage was also outputted. This could be due to the nature of the data (works well with linear) or a code error that causes the neural network to be incorrect. It is also interesting to see that the ada boosting algorithm produced a confusion matrix with more incorrect values than the linear support vector machine and random forest methods.