Synopsis

This report will explore the iris dataset, which is a dataset found in R Studio. There are measurements of one hundred fifty iris flowers in the dataset. Each flower contains five different measurements which are Sepal Width, Sepal Length, Petal Width and Petal Length and Species. In the end this data will be used to build a machine learning model that will predict the species of iris flower based on the other variables.

Exploratory Analysis

Loading Packages and Data

#loading dplyr package to manipulate data and iris dataset
library(dplyr)
library(caret)
data(iris)

Previewing The Data

str(iris)
## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

These charts show that there are 150 different iris flowers. For each flower we have a record of the species, sepal length, sepal width, petal length, and petal width. We also see that there are three different species of iris. The three different species are Setosa, Versicolor and Virginica.

Graphing the Data

All Species Boxplot

boxplot(iris[,1:4], main = "All Species", ylab = "Centimeters", angle = 90)

Summary

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

The summary shows numerically what is shown in the graph. It shows the different medians for each measurement. It also shows the lowest value, the highest value, the first quartile, and the third quartile. The summary also shows the mean, which is not shown in the graph. The values of petal length seems to be negatively skewed. The negative skew helps to explain the large difference between the median and mean of the petal length measurement. Looking at a histogram will show a better spread of that variable.

Histogram Petal Length

hist(iris$Petal.Length, breaks=75, main = "Petal Length Histogram", xlab = "Petal Length")

Since the machine learning model will predict the species of the flower based on the other measurements, it would be interesting to see how these variables look when broken down by species.

Graphing Species Subsets

Subsetting the Variables

setosa <- iris %>%
  filter(Species == "setosa")

versicolor <- iris %>%
  filter(Species == "versicolor")

virginica <- iris %>%
  filter(Species == "virginica")

Graphing Subsets

#scaling graphs so their axes are equal
ymin = 0 
ymax = 8
par(mfrow=c(1, 3))
boxplot(setosa[, 1:4], main = "Setosa", ylab = "Centimeters", 
        ylim = c(ymin, ymax)) 
boxplot(versicolor[, 1:4], main = "Versicolor", ylab = "Centimeters", 
        ylim = c(ymin, ymax))
boxplot(virginica[, 1:4], main = "Virginica", ylab = "Centimeters", 
        ylim = c(ymin, ymax))

Setosa

summary(setosa)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.300   Min.   :1.000   Min.   :0.100  
##  1st Qu.:4.800   1st Qu.:3.200   1st Qu.:1.400   1st Qu.:0.200  
##  Median :5.000   Median :3.400   Median :1.500   Median :0.200  
##  Mean   :5.006   Mean   :3.428   Mean   :1.462   Mean   :0.246  
##  3rd Qu.:5.200   3rd Qu.:3.675   3rd Qu.:1.575   3rd Qu.:0.300  
##  Max.   :5.800   Max.   :4.400   Max.   :1.900   Max.   :0.600  
##        Species  
##  setosa    :50  
##  versicolor: 0  
##  virginica : 0  
##                 
##                 
## 

Vesrsicolor

summary(versicolor)
##   Sepal.Length    Sepal.Width     Petal.Length   Petal.Width   
##  Min.   :4.900   Min.   :2.000   Min.   :3.00   Min.   :1.000  
##  1st Qu.:5.600   1st Qu.:2.525   1st Qu.:4.00   1st Qu.:1.200  
##  Median :5.900   Median :2.800   Median :4.35   Median :1.300  
##  Mean   :5.936   Mean   :2.770   Mean   :4.26   Mean   :1.326  
##  3rd Qu.:6.300   3rd Qu.:3.000   3rd Qu.:4.60   3rd Qu.:1.500  
##  Max.   :7.000   Max.   :3.400   Max.   :5.10   Max.   :1.800  
##        Species  
##  setosa    : 0  
##  versicolor:50  
##  virginica : 0  
##                 
##                 
## 

Virginica

summary(virginica)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.900   Min.   :2.200   Min.   :4.500   Min.   :1.400  
##  1st Qu.:6.225   1st Qu.:2.800   1st Qu.:5.100   1st Qu.:1.800  
##  Median :6.500   Median :3.000   Median :5.550   Median :2.000  
##  Mean   :6.588   Mean   :2.974   Mean   :5.552   Mean   :2.026  
##  3rd Qu.:6.900   3rd Qu.:3.175   3rd Qu.:5.875   3rd Qu.:2.300  
##  Max.   :7.900   Max.   :3.800   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    : 0  
##  versicolor: 0  
##  virginica :50  
##                 
##                 
## 

The graphs show that the Setosa species have extremely low petal lengths when compared to the other species. This low petal length explains why there was a negative skew and a bimodal nature when looking at petal length on the first graph. The Setosa flowers also tend to have larger sepal widths when compared to the other flowers. On the other hand, Versicolor and Virginica tend to have similar measurements when compared to each other. However, the Versicolor flowers tend to have smaller petal lengths and widths than the Virginica flowers. Overall, the measurement that seems to be most unique amongth the flowers are their petal lengths. When constructing the machine learning models, there should be high accuracy when predicting the Setosa flowers. However, there may be mistakes when predicting the Versicolor and Virginica Flowers.

Machine Learning Models

Partitioning The Data

library(caret)
set.seed(366284)

#Partitioning the data and subsetting data into a Train Set and a Test Set
inTrain <- createDataPartition(y = iris$Species, p = 0.8, list=FALSE)
train <- iris[inTrain, ]
test <- iris[-inTrain, ]

This code breaks the iris data up into a train and a test set. The train set will be used to build the model, while the test set will be used to test the model. First the seed was set to 366284, which will ensure that these results are reproducible. Then the data was partitioned by randomly sampling 80% of the observations. Finally, the rest of the 80% that wasn’t selected as part of the train set will be used as the test set.

Random Forest Model

#Building the Random Forest Model using 10 Fold Cross Validation
model_rf <- train(Species ~ ., train, method = "ranger", trControl = trainControl(method = "cv", number = 10))
model_rf
## Random Forest 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   mtry  splitrule   Accuracy   Kappa 
##   2     gini        0.9333333  0.9000
##   2     extratrees  0.9416667  0.9125
##   3     gini        0.9416667  0.9125
##   3     extratrees  0.9416667  0.9125
##   4     gini        0.9416667  0.9125
##   4     extratrees  0.9500000  0.9250
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were mtry = 4 and splitrule
##  = extratrees.

This table shows the results of creating the Random Forest Model. The optimal models were selected using the largest value for accuracy. The final values for the model were 4 for mtry and extratrees for splitrule. This model’s accuracy was about 95%.

#Comparing predicted Species to actual Species 
predictions_rf <- predict(model_rf, test)
confusionMatrix(predictions_rf, test$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          0        10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 4.857e-15  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

Testing the predictions made based on the Random Forest Model yielded results that were 100% accurate. If there was a different seed set, then the model may have produced less accurate results. Although these results are satisfying, it is useful to compare this model with another one.

C5.0 Model

model_c5 <- train(Species ~ ., train, method = "C5.0", trControl = trainControl(method = "cv", number = 10))
model_c5
## C5.0 
## 
## 120 samples
##   4 predictor
##   3 classes: 'setosa', 'versicolor', 'virginica' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ... 
## Resampling results across tuning parameters:
## 
##   model  winnow  trials  Accuracy   Kappa 
##   rules  FALSE    1      0.9083333  0.8625
##   rules  FALSE   10      0.9083333  0.8625
##   rules  FALSE   20      0.9166667  0.8750
##   rules   TRUE    1      0.9000000  0.8500
##   rules   TRUE   10      0.9416667  0.9125
##   rules   TRUE   20      0.9166667  0.8750
##   tree   FALSE    1      0.9083333  0.8625
##   tree   FALSE   10      0.9083333  0.8625
##   tree   FALSE   20      0.9083333  0.8625
##   tree    TRUE    1      0.9000000  0.8500
##   tree    TRUE   10      0.9250000  0.8875
##   tree    TRUE   20      0.9083333  0.8625
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were trials = 10, model = rules
##  and winnow = TRUE.

The results of this model show that accuracy was used to select the optimal model using the largest value. The final values used for the model were 20 for trials, rules for model, and TRUE for winnow. The accuracy for these values is about 94.2%.

predictions_c5 <- predict(model_c5, test)
confusionMatrix(predictions_c5, test$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0         10         0
##   virginica       0          0        10
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.8843, 1)
##     No Information Rate : 0.3333     
##     P-Value [Acc > NIR] : 4.857e-15  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            1.0000           1.0000
## Specificity                 1.0000            1.0000           1.0000
## Pos Pred Value              1.0000            1.0000           1.0000
## Neg Pred Value              1.0000            1.0000           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3333           0.3333
## Detection Prevalence        0.3333            0.3333           0.3333
## Balanced Accuracy           1.0000            1.0000           1.0000

The model yielded results that were 100% accurate when comparing predicted values for the species of the iris with the actual values for the species of the iris. Again, if a different seed was set then the model might be a little less accurate.

Conclusion

In the end, both the C5.0 and the Random Forest Model seem to be good models for predicting the species of an iris based on its sepal width, sepal length, petal length and petal width. The exploratory analysis shows several differences between the species. These difference may have been helpful in building an accurate model. Unfortunately some explainability is lost with these models, but the models being highly accurate makes up for this loss. It would be interesting to see how this model performs on a dataset containing different measurements for these variables. When given a sepal width, sepal length, petal width, and petal length, how accurate would this model be when determining the species of an iris flower?