1.1 Load iris dataset
The iris data was used for this analysis. This dataset contains 150 observations of iris flowers. There are four columns of measurements of the flowers in centimeters. The fifth column is the species of the flower observed. All observed flowers belong to one of three species.
# Attach iris data set to environment
data(iris)
#Rename data set
dataset<-iris
1.2 Load Caret Package
The caret package in R was utlizied to build the models. This package provides a consistent interface into hundreds of machine learning algorithms and provides useful convenience methods for data visualization, data resampling, model tuning and model comparison, among other features. It’s a must have tool for machine learning projects in R.
# Load package
library(caret)
I split the loaded iris dataset into two parts:
- 80% used to train the models
- 20% held back as a validation dataset
Creating a validation was critical in determining whether the built models were any good.
After using statistical methods to estimate the accuracy of the models created on unseen data, I would need a more concrete accuracy estimate of the best model on unseen data by evaluating it on actual unseen data in the validation set. That is, I held back some data that the algorithms did not see (the validation set) and used that data to get a second and independent idea of how accurate the best model would actually be.
# Create a list of 80% of the rows in the original dataset we can use for training
validation_index<-createDataPartition(dataset$Species, p =0.80, list = FALSE)
# Select 20% of the data for validation
validation<-dataset[-validation_index, ]
# Use the remaining 80% of data to train and test the models
dataset<-dataset[validation_index, ]
I then looked at the data in a number of different ways:
3.1 Dimensions of the Dataset
# Dataset dimensions
dim(dataset)
[1] 120 5
3.2 Types of Attributes
The types of attributes that reside in the data were identified next. Knowing the types is important as it will give an idea of how to better summarize the data and what transformations I might need to apply to prepare the data before it is modeled.
# List types for each attribute
sapply(dataset,class)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
"numeric" "numeric" "numeric" "numeric" "factor"
3.3 Peek at the data
For good measure, I took a quick look at the first seven rows to get a better feel for what the data looks like.
# View first five rows of the data
head(dataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
3.4 Levels of the Factor Variables
Species was the only factor variable in the dataset, so I investigated further to identify its levels. Species had three levels so this is a multi-class or a multinomial classification problem. If there were only two levels, it would have been a binary classification problem.
levels(dataset$Species)
[1] "setosa" "versicolor" "virginica"
3.5 Class Distribution
I then determined the number of instances that belong to each class of Species as an absolute count and as a percentage. Note that each class had the same number of instances (40 or 33% of the dataset).
# Summarize the class distribution
percentage<-prop.table(table(dataset$Species)) * 100
cbind(freq = table(dataset$Species), percentage = percentage)
freq percentage
setosa 40 33.33333
versicolor 40 33.33333
virginica 40 33.33333
3.6 Statistical Summary
# Summarize attribute distributions
summary(dataset)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.854 Mean :3.051 Mean :3.749 Mean :1.196
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Species
setosa :40
versicolor:40
virginica :40
After getting a basic feel for the data, I extended that understanding with some visualizations:
1. Univariate plots to better understand each attribute
2. Multivariate plots to better understand the relationships between attributes
4.1 Univariate Plots
I started with univariate plots(i.e. plots of each individual variable).
With visualization, it is helpful to have a way to refer to just the input attributes and just the output attributes. So I set up variables calling the input attributes x and the output attribute (or Species class) y.
# Split input and output
x<-dataset[,1:4]
y<-dataset[ ,5]
Given that the input variables were numeric, I created box-and-whisker plots to view the distribution of each attribute.
# Boxplot for each attribute on one image
par(mfrow=c(1,4))
for(i in 1:4) {
boxplot(x[,i], main=names(iris)[i])
}
I then created a bar plot of the output variable (Species) to get a graphical representation of the class distribution. This was uninteresting as they were evenly distributed across the classes, confirming what was learned from the frequency table created earlier in the analysis.
# Barplot of class breakdown
library(dplyr)
library(ggplot2)
dataset %>% ggplot(aes(x= y)) + geom_bar() +labs(x = "Iris Flower Species")
4.2 Multivariate Plots
After plotting each individual attribute, I explored the interaction between the variables by looking at scatter plots of all attributes pairs with points colored by class. Because the scatter plots show that points for each class are generally separate, ellipses were added around them to highlight this clear relationship between the input attributes (trends) and between attributes and the class values (ellipses).
# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")
# scatterplot matrix
featurePlot(x=x, y=y, plot="ellipse")
I then re-visited the box-and-whisker plots for each input variable, but this time breaking it out into separate plots for each Species class. This faceting helped tease out obvious linear separations between the classes and revealed that there are clearly different distributions of the attributes for each Species class.
# Box and whisker plots for each attribute
featurePlot(x = x, y = y, plot = "box")
The distribution of each attribute was explored further with probability density plots. Again, like the box and whisker plots above, the density plots were broken down by Species class, Sometimes histograms are good for this, but I chose probability density plots in this case to give nice smooth lines for each distribution. Like the box plots, the difference in distribution of each attribute by class is apparent. It was also worth noting the Gaussian-like distribution (bell curve) of each attribute.
# Density plots for each attribute by species class value
scales<-list(x = list(relation = "free"), y = list(relation = "free"))
featurePlot(x = x, y = y, plot = "density", scales = scales)
Next, I created models of the data and estimated their accuracy on unseen data.
This was a three step process:
5.1 Test Harness
I used 10-fold cross validation to estimate accuracy. This split the dataset into 10 parts (train in 9 and test on 1) and then released for all combinations of train-test splits. The process was repeated 3 times for each of the 5 algorithms, with different splits of the data into 10 groups to get more accurate estimates.
As mentioned, the “Accuracy” metric was used to evaluate the models. This is a ratio of the number of correctly predicted instances divided by the total number of instances in the dataset multiplied by 100 to give a percentage (e.g. 95% accurate).
# Run algorithms using 10-fold cross validation
control<-trainControl(method = "cv", number = 10)
metric<-"Accuracy"
5.2 Build Models
It was initially unknown which algorithms would work well on this problem or what configurations to use. The plots suggested that some of the classes are partially linearly separable in some dimensions, so generally good results were expected.
Five different algorithms were evaluated:
This was a good mixture of simple linear (LDA), nonlinear (CART, kNN) and complex nonlinear methods (SVM, RF). The random number seed was reset before each run to ensure evaluation of each algorithm was performed using exactly the same data splits and results were directly comparable.
The five models were built and saved as variables in the work space.
Linear Algorithms: LDA
# Linear Discriminant Analysis (LDA)
set.seed(7)
fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
Nonlinear Algorithms: CART and kNN
# Classification and Regression Trees (CART)
set.seed(7)
fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# k-Nearest Neighbors (kNN)
set.seed(7)
fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
Advanced Algorithms: SVM and RF
# Support Vector Machines (SVM)
set.seed(7)
fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# Random Forest (RF)
set.seed(7)
fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)
5.3 Select Best Model
Once the five models and accuracy estimations for each were created, my next task was to compare the models and select the most accurate.
To do this, I created a list of the fitted models and passed these results to the summary function to get an output displaying the accuracy of each classifier along with other metrics, like Kappa.
# Summarize model accuracy for each model
results <- resamples(list(lda=fit.lda,cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results)
Call:
summary.resamples(object = results)
Models: lda, cart, knn, svm, rf
Number of resamples: 10
Accuracy
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lda 0.9167 1.0000 1.0000 0.9833 1 1 0
cart 0.8333 0.9167 0.9583 0.9500 1 1 0
knn 0.9167 1.0000 1.0000 0.9833 1 1 0
svm 0.8333 0.9375 1.0000 0.9667 1 1 0
rf 0.7500 0.9375 1.0000 0.9583 1 1 0
Kappa
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
lda 0.875 1.0000 1.0000 0.9750 1 1 0
cart 0.750 0.8750 0.9375 0.9250 1 1 0
knn 0.875 1.0000 1.0000 0.9750 1 1 0
svm 0.750 0.9062 1.0000 0.9500 1 1 0
rf 0.625 0.9062 1.0000 0.9375 1 1 0
I then created a plot of the model evaluation results and compared the spread as well as the mean accuracy of each model. It is important to note that there is a population of accuracy measures for each algorithm because each algorithm was evaluated 10 times (10 fold cross validation), which is why the mean accuracy estimates had to be compared. The most accurate model in this case was LDA, given it had the highest mean accuracy with the smallest spread.
# Compare accuracy of models
dotplot(results)
Since LDA was identified as the best model, results for just LDA were summarized. The output gave a nice summary of what was used to train the model and the mean and standard deviation (SD) accuracy achieved, specifically 97.5% accuracy +/- 4%
# Summarize Best Model
print(fit.lda)
Linear Discriminant Analysis
120 samples
4 predictor
3 classes: 'setosa', 'versicolor', 'virginica'
No pre-processing
Resampling: Cross-Validated (10 fold)
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
Resampling results:
Accuracy Kappa
0.9833333 0.975
LDA was the most accurate model on the training set, but I had to determine the model’s accuracy on the validation set to get an independent final check on the accuracy of the best model. As best practice, I kept a validation set just in case of overfitting to the training set or a data leak, as both would have resulted in an overly optimistic result.
The LDA model was ran directly on the validation set and results were summarized in a confusion matrix. The accuracy was 100%. It was a small validation dataset (20%), but this result was within the expected margin of 97% +/-4% suggesting LDA was an accurate and a reliable model.
# Estimate skill of LDA on the validation dataset
predictions<-predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species)
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 10 0 0
versicolor 0 9 0
virginica 0 1 10
Overall Statistics
Accuracy : 0.9667
95% CI : (0.8278, 0.9992)
No Information Rate : 0.3333
P-Value [Acc > NIR] : 2.963e-13
Kappa : 0.95
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.9000 1.0000
Specificity 1.0000 1.0000 0.9500
Pos Pred Value 1.0000 1.0000 0.9091
Neg Pred Value 1.0000 0.9524 1.0000
Prevalence 0.3333 0.3333 0.3333
Detection Rate 0.3333 0.3000 0.3333
Detection Prevalence 0.3333 0.3000 0.3667
Balanced Accuracy 1.0000 0.9500 0.9750