The goal of this document is to compare the accuracy of classification models using the iris data set. Classification models attempt to predict a categorical (also known as discrete) variable. In this example, the predictor variables are continuous.
The dataset contains 150 observations, each with five attributes: sepal length, sepal width, petal length, petal width, and species. These attributes are the vriables in the models. The dependent variable species has three levels: setosa (50), versicolor (50), and virginica (50). We will use models that use the information fo the predictor variables, the lengths and widths of the sepals and petals, to classify the species of an iris flower.
Because the iris dataset is a built-in R dataset, so we get a free pass on the data cleaning/wrangling/preparation part of the analysis.
Packages used:
* dplyr for efficient and fast data manipulation
* ggvis for data visualization
* caret for partitioning the dataset
* gmodels for creating contingency tables
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(ggvis))
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(gmodels))
Before we start trying to build our baseline model, wel’ll partition the data into two sets: one training test and one test set. The training set will be 80% and the test set will be 20% the size of the original dataset. The assignment is random. The training set is used to come up with the models. Once we have a model, we apply it to the test set. Because the test set has a species assignment for each observation, we can check the accuracy of the predictions. This is called a supervised model. We’ll go through this process 1,000 times, and then we’ll calculate the average of the accuracy.
set.seed(9911)
trainIndex <- createDataPartition(iris$Species, p = .8, list = FALSE, times = 1)
Train <- iris[ trainIndex,]
Test <- iris[-trainIndex,]
The baseline model is a simple algorithm that uses the attributes that best clusters and separates the species. We do a visual exploratory analysis of the train set, which consists in plotting each of the four predictors against each other while color coding for species. Although there are a total of \({4 \choose 2} = 6\) possible plots, only two convey the clustering of similar species. We inlcude both of those plots below.
Train %>% ggvis(~Sepal.Length, ~Sepal.Width, fill = ~Species) %>%
layer_points() %>%
add_axis("x", orient = "bottom") %>%
add_axis("x", orient = "top", ticks = 0, title = "Sepal Width against Sepal Length")
Train %>% ggvis(~Petal.Length, ~Petal.Width, fill = ~Species) %>%
layer_points() %>%
add_axis("x", orient = "bottom") %>%
add_axis("x", orient = "top", ticks = 0, title = "Petal Width against Petal Length")
It is somewhat clear from the plots that the variables petal length and petal width are the best differentiators between species. So we’ll use those two attributes to assign a species label to each observation. The simple algorithm below will be our baseline classification model. It assigns the species:
* setosa to observations with petal length < 2, petal width < 0.7
* versicolor to observations with 3 <= petal length < 5, 0.8 <= petal width <= 1.6
* virginica to observations with petal.length >= 5 petal.width > 1.6
After running the baseline classification algorithm 1,000 times, the algorithm is accurate 91.47% of the time. Not bad for a “bare bones” type algorithm. We still need to see how the other classification approaches do.
accuracy_df <- data.frame()
for (j in 1:1000){
set.seed(8000 + j)
trainIndex <- createDataPartition(iris$Species, p = .8, list = FALSE, times = 1)
Train <- iris[ trainIndex,]
Test <- iris[-trainIndex,]
Train.lab <- Train$Species
Test.lab <- Test$Species
for (i in 1:nrow(Train)){
if (Train$Petal.Length[i] < 2 & Train$Petal.Width[i] <= 0.7){
Train$species_bline[i] = "setosa"
} else if (Train$Petal.Length[i] >= 3 & Train$Petal.Length[i] < 5 &
Train$Petal.Width[i] >= 0.8 & Train$Petal.Width[i] <= 1.6){
Train$species_bline[i] = "versicolor"
} else if (Train$Petal.Length[i] >= 5 & Train$Petal.Width[i] > 1.6){
Train$species_bline[i] = "virginica"
}
}
for (i in 1:nrow(Test)){
if (Test$Petal.Length[i] < 2 & Test$Petal.Width[i] <= 0.7){
Test$species_bline[i] = "setosa"
} else if (Test$Petal.Length[i] >= 3 & Test$Petal.Length[i] < 5 &
Test$Petal.Width[i] >= 0.8 & Test$Petal.Width[i] <= 1.6){
Test$species_bline[i] = "versicolor"
} else if (Test$Petal.Length[i] >= 5 & Test$Petal.Width[i] > 1.6){
Test$species_bline[i] = "virginica"
}
}
accuracy_df[j,1] <- sum(Test$Species == Test$species_bline)/length(Test$species_bline)
}
mean(accuracy_df[,1])
## [1] 0.9147667
Naive Bayes Algorith is another classification method. It’s included in the e1071 library. It applies Bayes’ Theorem and assumes independence between predictors. Although the method is easy to carry, it does surprisingly well and often outperforms other classification methods. It should be noted that naive Bayes is not a Bayesian approach, but rather an approach that uses Bayes’ Theorem.
I want to thank Stephen Turner for his development of the function splitdf to create randomized partitions. I’m including his code because it’s an alternative way to create a partition.
splitdf <- function(df, seed=NULL, train_fraction=0.8) {
if (train_fraction<=0 | train_fraction>=1) stop("Training fraction must be strictly between 0 and 1")
if (!is.null(seed)) set.seed(seed)
index <- 1:nrow(df)
trainindex <- sample(index, trunc(length(index)*train_fraction))
trainset <- df[trainindex, ]
testset <- df[-trainindex, ]
list(trainset=trainset,testset=testset)}
Again, like in the baseline model, we’ll carry out the Naive Bayes Algorithm 1,000 times and then calculate the accuracy. As we can see, the Naive Bayes Algorithm correctly predicts the species name with an accuracy of 95.4%. Not bad. This is a significant improvement from the baseline model.
suppressPackageStartupMessages(library(e1071))
accuracy_df <- data.frame()
for (i in 1:1000){
splits <- splitdf(iris, seed=3000+i, train_fraction = .8)
Train <- as.data.frame(splits[1])
Test <- as.data.frame(splits[2])
names(Train) <- names(iris)
names(Test) <- names(iris)
m <- naiveBayes(Species ~ Sepal.Length+Sepal.Width+Petal.Length+Petal.Width, data=Train)
p <- predict(m, Test[,-5])
accuracy_df[i,1] <- sum(p==Test$Species)/length(p)
}
mean(accuracy_df[,1])
## [1] 0.9546667