This report will explore the iris dataset, which is a dataset found in R Studio. There are measurements of one hundred fifty iris flowers in the dataset. Each flower contains five different measurements which are Sepal Width, Sepal Length, Petal Width and Petal Length and Species. In the end this data will be used to build a machine learning model that will predict the species of iris flower based on the other variables.
#loading dplyr package to manipulate data and iris dataset
library(dplyr)
library(caret)
data(iris)
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
These charts show that there are 150 different iris flowers. For each flower we have a record of the species, sepal length, sepal width, petal length, and petal width. We also see that there are three different species of iris. The three different species are Setosa, Versicolor and Virginica.
boxplot(iris[,1:4], main = "All Species", ylab = "Centimeters", angle = 90)
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
The summary shows numerically what is shown in the graph. It shows the different medians for each measurement. It also shows the lowest value, the highest value, the first quartile, and the third quartile. The summary also shows the mean, which is not shown in the graph. The values of petal length seems to be negatively skewed. The negative skew helps to explain the large difference between the median and mean of the petal length measurement. Looking at a histogram will show a better spread of that variable.
hist(iris$Petal.Length, breaks=75, main = "Petal Length Histogram", xlab = "Petal Length")
Since the machine learning model will predict the species of the flower based on the other measurements, it would be interesting to see how these variables look when broken down by species.
setosa <- iris %>%
filter(Species == "setosa")
versicolor <- iris %>%
filter(Species == "versicolor")
virginica <- iris %>%
filter(Species == "virginica")
#scaling graphs so their axes are equal
ymin = 0
ymax = 8
par(mfrow=c(1, 3))
boxplot(setosa[, 1:4], main = "Setosa", ylab = "Centimeters",
ylim = c(ymin, ymax))
boxplot(versicolor[, 1:4], main = "Versicolor", ylab = "Centimeters",
ylim = c(ymin, ymax))
boxplot(virginica[, 1:4], main = "Virginica", ylab = "Centimeters",
ylim = c(ymin, ymax))
summary(setosa)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100
## 1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200
## Median :5.000 Median :3.400 Median :1.500 Median :0.200
## Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
## 3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
## Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
## Species
## setosa :50
## versicolor: 0
## virginica : 0
##
##
##
summary(versicolor)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000
## 1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200
## Median :5.900 Median :2.800 Median :4.35 Median :1.300
## Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
## 3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
## Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
## Species
## setosa : 0
## versicolor:50
## virginica : 0
##
##
##
summary(virginica)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400
## 1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800
## Median :6.500 Median :3.000 Median :5.550 Median :2.000
## Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
## 3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
## Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
## Species
## setosa : 0
## versicolor: 0
## virginica :50
##
##
##
The graphs show that the Setosa species have extremely low petal lengths when compared to the other species. This low petal length explains why there was a negative skew and a bimodal nature when looking at petal length on the first graph. The Setosa flowers also tend to have larger sepal widths when compared to the other flowers. On the other hand, Versicolor and Virginica tend to have similar measurements when compared to each other. However, the Versicolor flowers tend to have smaller petal lengths and widths than the Virginica flowers. Overall, the measurement that seems to be most unique amongth the flowers are their petal lengths. When constructing the machine learning models, there should be high accuracy when predicting the Setosa flowers. However, there may be mistakes when predicting the Versicolor and Virginica Flowers.
library(caret)
set.seed(366284)
#Partitioning the data and subsetting data into a Train Set and a Test Set
inTrain <- createDataPartition(y = iris$Species, p = 0.8, list=FALSE)
train <- iris[inTrain, ]
test <- iris[-inTrain, ]
This code breaks the iris data up into a train and a test set. The train set will be used to build the model, while the test set will be used to test the model. First the seed was set to 366284, which will ensure that these results are reproducible. Then the data was partitioned by randomly sampling 80% of the observations. Finally, the rest of the 80% that wasn’t selected as part of the train set will be used as the test set.
#Building the Random Forest Model using 10 Fold Cross Validation
model_rf <- train(Species ~ ., train, method = "ranger", trControl = trainControl(method = "cv", number = 10))
model_rf
## Random Forest
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## mtry splitrule Accuracy Kappa
## 2 gini 0.9333333 0.9000
## 2 extratrees 0.9416667 0.9125
## 3 gini 0.9416667 0.9125
## 3 extratrees 0.9416667 0.9125
## 4 gini 0.9416667 0.9125
## 4 extratrees 0.9500000 0.9250
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 4 and splitrule
## = extratrees.
This table shows the results of creating the Random Forest Model. The optimal models were selected using the largest value for accuracy. The final values for the model were 4 for mtry and extratrees for splitrule. This model’s accuracy was about 95%.
#Comparing predicted Species to actual Species
predictions_rf <- predict(model_rf, test)
confusionMatrix(predictions_rf, test$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.8843, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 4.857e-15
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000 1.0000
Testing the predictions made based on the Random Forest Model yielded results that were 100% accurate. If there was a different seed set, then the model may have produced less accurate results. Although these results are satisfying, it is useful to compare this model with another one.
model_c5 <- train(Species ~ ., train, method = "C5.0", trControl = trainControl(method = "cv", number = 10))
model_c5
## C5.0
##
## 120 samples
## 4 predictor
## 3 classes: 'setosa', 'versicolor', 'virginica'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 108, 108, 108, 108, 108, 108, ...
## Resampling results across tuning parameters:
##
## model winnow trials Accuracy Kappa
## rules FALSE 1 0.9083333 0.8625
## rules FALSE 10 0.9083333 0.8625
## rules FALSE 20 0.9166667 0.8750
## rules TRUE 1 0.9000000 0.8500
## rules TRUE 10 0.9416667 0.9125
## rules TRUE 20 0.9166667 0.8750
## tree FALSE 1 0.9083333 0.8625
## tree FALSE 10 0.9083333 0.8625
## tree FALSE 20 0.9083333 0.8625
## tree TRUE 1 0.9000000 0.8500
## tree TRUE 10 0.9250000 0.8875
## tree TRUE 20 0.9083333 0.8625
##
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were trials = 10, model = rules
## and winnow = TRUE.
The results of this model show that accuracy was used to select the optimal model using the largest value. The final values used for the model were 20 for trials, rules for model, and TRUE for winnow. The accuracy for these values is about 94.2%.
predictions_c5 <- predict(model_c5, test)
confusionMatrix(predictions_c5, test$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 0 10
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.8843, 1)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 4.857e-15
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3333 0.3333
## Detection Prevalence 0.3333 0.3333 0.3333
## Balanced Accuracy 1.0000 1.0000 1.0000
The model yielded results that were 100% accurate when comparing predicted values for the species of the iris with the actual values for the species of the iris. Again, if a different seed was set then the model might be a little less accurate.
In the end, both the C5.0 and the Random Forest Model seem to be good models for predicting the species of an iris based on its sepal width, sepal length, petal length and petal width. The exploratory analysis shows several differences between the species. These difference may have been helpful in building an accurate model. Unfortunately some explainability is lost with these models, but the models being highly accurate makes up for this loss. It would be interesting to see how this model performs on a dataset containing different measurements for these variables. When given a sepal width, sepal length, petal width, and petal length, how accurate would this model be when determining the species of an iris flower?