Required packages

The “caret” and “ellipse” packages are required to perform the steps in this document. The “if” statements below will only install the required packages if they are not already installed, to avoid losing time to re-loading packages.

if("caret" %in% rownames(installed.packages()) == FALSE) {install.packages("caret")}
if("ellipse" %in% rownames(installed.packages()) == FALSE) {install.packages("ellipse")}
library(caret)
library(ellipse)

Executive Summary

Machine learning is just one application of artificial intelligence (AI) whereby systems automatically learn and improve from experience without being explicitly programmed, (Expertsystem.com, 2020). The machine learning performed in this document was based on the “Your First Machine Learning Project in R Step-By-Step”, by Jason Brownlee from Machine Learning Mastery and aims to produce an algorithm that can predict flower species using the length and width of the flower’s petals and sepal (bud), (Brownlee, 2016). The steps involved in the process are as follows:

Data

The Iris dataset will be used to produce the machine learning algorithm in this document. The Iris dataset is readily available in the R envioronment but can also be accessed from the UCI Machine Learning Repository (https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data). The Iris dataset will be assigned to “dataset” so that the rest of the steps can be used in the future with new datasets simply by assigning the new dataset to “dataset”.

#The Iris dataset is pre-loaded into the R enviornment
data(iris)
# rename the dataset so the remaining steps can be performed with any dataset assigned to "dataset"
dataset <- iris

Understand

The Iris dataset contains 150 observations of 5 variables for flowers from three different species, “setosa”, “versicolor” and “virginica”. All columns are numeric except for the “Species” column, which is a factor as it contains categorical data about flower species. The variables in the dataset are as follow:

dim(dataset)
[1] 150   5
sapply(dataset, class) #list the data types in the dataset
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
   "numeric"    "numeric"    "numeric"    "numeric"     "factor" 
head(dataset)
levels(dataset$Species)
[1] "setosa"     "versicolor" "virginica" 

Create Training and Validation datasets

The dataset was then separated into two subsets of data, one to be used for training the machine learning algorithms and another to be used to validate the final model and check that data leakage had not occurred.

# Create a subset of 80% of the rows in the original dataset. We will use these to train the model
validation_index <- createDataPartition(dataset$Species, p=0.80, list=FALSE)
# Select 20% of the data for validation
validation <- dataset[-validation_index,]
# Assign the training data to "dataset"
dataset <- dataset[validation_index,]

Explore

The training data was explored to observe the number of instances of each flower species and summary statistics (mean, median, min, max) of the dataset.

# Look at the number of instances (rows) that belong to each class
percentage <- prop.table( table(dataset$Species) ) * 100
cbind(freq = table(dataset$Species), percentage = percentage)
           freq percentage
setosa       40   33.33333
versicolor   40   33.33333
virginica    40   33.33333
# There are the same number of observations (40) in each class (Species)
summary(dataset)
  Sepal.Length   Sepal.Width     Petal.Length    Petal.Width          Species  
 Min.   :4.30   Min.   :2.000   Min.   :1.000   Min.   :0.100   setosa    :40  
 1st Qu.:5.10   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300   versicolor:40  
 Median :5.80   Median :3.000   Median :4.400   Median :1.300   virginica :40  
 Mean   :5.86   Mean   :3.051   Mean   :3.765   Mean   :1.201                  
 3rd Qu.:6.40   3rd Qu.:3.325   3rd Qu.:5.100   3rd Qu.:1.800                  
 Max.   :7.90   Max.   :4.400   Max.   :6.900   Max.   :2.500                  

Visualise the data

The flower species was separated from the numeric variables in the training data in order to inspect the numeric variables.

Univariate plots of the numeric training data were produced and outliers were observed in the “sepal.width”" variable. “sepal.Length” exhibited slightly positive skew while “petal.length” and “petal.width” displayed slightly negative skew. A barplot of the frequency of each flower species was also produced, which confirmed that each flower species was represented the same number of times in the training data.

A multivaraite plot was produced to visualise the relationship between each numeric variable and each of the other numeric variables. The points in these scatter plots were grouped by flower species. The points appeared to cluster according to flower species and thus they were encircled with an ellipse to highlight the clustering. Boxplots of each variable, grouped according to flower species, were also produced tto observe whether particular variables displayed different distributions according to flower species. The boxplots revealed that the distribution of all variables appeared to differ according to flower species. Finally, density plots were produced to highlight the differences in distributions of the numeric variables according to flower species.

#Univariate plots
#Split the data into species, and the numeric variables.
x <- dataset[ ,1:4]
y <- dataset[ ,5]
#Because data in x are numeric, you can use a boxplot to visualise the variables
par(mfrow = c(1,4))
for (i in 1:4) {
  boxplot(x[ ,i], main = names(iris[i]))
}
#outliers in sepal.width
#slight positive skew in sepal.length, and slight negative skew in petal.length and petal.width
#barplot of y - not interseting b/c all categories have the same freq as seen earlier
par(mfrow = c(1,1))

plot(y)

#Multivariate plots
#scatterplot matrix of every pair of variables.
featurePlot(x = x, y = y)

#The points from each of the 3 Species appear to be clustered, so circle them to highlight this ("ellipse").
featurePlot(x = x, y = y, plot = "ellipse")

#boxplots by Species (class)
featurePlot(x = x, y = y, plot = "box")

#When split by Species, the distribution of each variable (x) is different. 
#>>>There appears to be a relationship between Species and the distribution of each of the variables
#density plots for each attribute by class value
scales <- list(x = list(relation = "free"), y = list(relation = "free") )
featurePlot(x = x, y = y, plot = "density", scales = scales)

# shows that the distribution of each attribute (x) is different based on the species

Evaluate Some Algorithms

Five algorithms were evaluated using the training dataset, the algorithms evaulated were:

The algorithms were used to produce predictive models by training them using 10-fold cross validation. This involved splitting the training data into 10 parts; 9 of which were used to train the model and 1 to test it, for all combinations of train-test splits. This was repeated 3 times for each algorithm, using different splits of the data in each repetition.

The models were then evaluated according to how accurate they were when making predictions. In this case, the predictions of each model were used to produce a ratio, being the number of correctly predicted instances (flower species) divided by the total number of instances (predictions made) in the dataset.

The LDA model was identified as the most accurate model, having the highest accuracy and Kappa of all models produced. In the test example, the model accuracy was 97.5%, with a standard deviation of +/- 5.6% (Margin of Error).

# Run algorithms using 10-fold cross validation
control <- trainControl(method = "cv", number = 10)
metric <- "Accuracy"
#We will be using the "metric" variable created above when we build and evaluate each model (next).
#Build the 5 models:
#Make sure you reset the random number seed before each run to ensure that the evaluation of each algorithm is performed using exactly the same data splits. This ensures the results are directly comparavble.
# 1 - LDA
set.seed(7)
fit.lda <- train(Species~., data=dataset, method="lda", metric=metric, trControl=control)
# 2 - CART
set.seed(7)
fit.cart <- train(Species~., data=dataset, method="rpart", metric=metric, trControl=control)
# 3 - kNN
set.seed(7)
fit.knn <- train(Species~., data=dataset, method="knn", metric=metric, trControl=control)
# 4 - SVM
set.seed(7)
fit.svm <- train(Species~., data=dataset, method="svmRadial", metric=metric, trControl=control)
# 5 - Random Forest
set.seed(7)
fit.rf <- train(Species~., data=dataset, method="rf", metric=metric, trControl=control)
#summarize the accuracy of the models and compare them to identify the best one
results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
summary(results) # Kappa gives us the accuracy of each model

Call:
summary.resamples(object = results)

Models: lda, cart, knn, svm, rf 
Number of resamples: 10 

Accuracy 
          Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
lda  0.8333333 1.0000000 1.0000000 0.9750000 1.0000000    1    0
cart 0.8333333 0.9166667 0.9166667 0.9333333 0.9791667    1    0
knn  0.9166667 0.9166667 1.0000000 0.9666667 1.0000000    1    0
svm  0.8333333 0.9166667 0.9583333 0.9416667 1.0000000    1    0
rf   0.9166667 0.9166667 0.9583333 0.9583333 1.0000000    1    0

Kappa 
      Min. 1st Qu. Median   Mean 3rd Qu. Max. NA's
lda  0.750   1.000 1.0000 0.9625 1.00000    1    0
cart 0.750   0.875 0.8750 0.9000 0.96875    1    0
knn  0.875   0.875 1.0000 0.9500 1.00000    1    0
svm  0.750   0.875 0.9375 0.9125 1.00000    1    0
rf   0.875   0.875 0.9375 0.9375 1.00000    1    0
#Because each model was evaluated 10 times (10-fold cross validation) we have data that we can plot (e.g. the mean of each model)
#Create a plot of the results from the model evaluation and compare the spread of each model's mean accuracy.
dotplot(results)

#The best model is the LDA model as it has the highest mean accuracy and Kappa of all those tested
print(fit.lda) #View a summary of the LDA model
Linear Discriminant Analysis 

120 samples
  4 predictor
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
Resampling results:

  Accuracy  Kappa 
  0.975     0.9625
fit.lda$results #accuracy in example was 98.3%. SD is +/-3.5%, this is the Margin of Error (MoE).

Run the model on the validation dataset

The LDA model was then used to predict flower species using the validation dataset, to observe its performance on a dataset it had not yet seen and confirm that the accuracy of the model was within the margin of error observed using the training dataset. The results were presented in a confusion matrix, a convenient way of displaying the outcomes of a prediction model. This also provided an opportunity to test the model for errors such as overfitting and data leakage, which would result in an artificially positive result.

In the example run, the model’s accuracy (100%) was within the Margin of Error (MoE) of the accuracy observed when producing the model (97.5 +/- 5.6 = [100, 91.9]). Therefore, it was likely that the model was accurate and reliable, and unlikely that errors such as overfitting or data leakage had occurred.

#Run the LDA model on the validation dataset and summarise the results in a "confusion matrix"
predictions <- predict(fit.lda, validation)
confusionMatrix(predictions, validation$Species) #Check that the accuracy is within the expected MoE
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         10          0         0
  versicolor      0         10         0
  virginica       0          0        10

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.8843, 1)
    No Information Rate : 0.3333     
    P-Value [Acc > NIR] : 4.857e-15  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            1.0000           1.0000
Specificity                 1.0000            1.0000           1.0000
Pos Pred Value              1.0000            1.0000           1.0000
Neg Pred Value              1.0000            1.0000           1.0000
Prevalence                  0.3333            0.3333           0.3333
Detection Rate              0.3333            0.3333           0.3333
Detection Prevalence        0.3333            0.3333           0.3333
Balanced Accuracy           1.0000            1.0000           1.0000
#For the model to be considered reliable, its prediction accuracy should be within the MoE around the test accuracy: 98.3 +/- 3.5 = [100, 94.8]
example_data <- iris[40,1:4]
predict(fit.lda, example_data) #predicts the Species of the data you select as "example_data"
[1] setosa
Levels: setosa versicolor virginica

Use the model to predict flower species.

The model was then used to provide an example of what it could do. In this case, the LDA model was run on a selected set of petal and sepal lengths and widths from the Iris dataset and the prediction was then compared with what was in the Iris dataset to observe whether the model correctly predicted the flower species. The example was taken from row 40 of the Iris dataset, where the flower species was “setosa”.

example_data <- iris[40,1:4]
predict(fit.lda, example_data) #predicts the Species of the data you select as "example_data"
[1] setosa
Levels: setosa versicolor virginica

References

Expertsystem.com, 2020. What is Machine Learning? A definition, viewed 08 June 2020, https://expertsystem.com/machine-learning-definition/#:~:text=Machine%20learning%20is%20an%20application,use%20it%20learn%20for%20themselves.

Brownlee, J., 2016. Machine Learning Mastery, Your First Machine Learning Project in R Step-By-Step, viewed 16 May 2020, https://machinelearningmastery.com/machine-learning-in-r-step-by-step/



