Iris Species Classification

Developing Data Products Project - Coursera, February 2015


Iris Species Classification

In the early 1930's, Edgar Anderson collected data on three iris species in the Gaspe penisula: virginica, setosa, and versicolor. In 1936, R. A. Fisher used this data in his paper on discriminate analysis in which he described a method for distinguishing between the species. Since then, this data set has become ubiquitious for testing Machine Learning algorithms.

This is a small data set with fifty observations for each species and with four predictors: pedal length, petal width, sepal length, and sepal width. Yet it is challenging to be able to provide an error free classification.

For this project we have provided users with an application that when given values of the parameters will then predict which of the three species is described by those parameters.

The Classification Method

We build a model by constructing a regression tree using the caret package with the rpart method.

The R command used to fit the model is:

fit <- train(Species~.,method="rpart",data=iris)

Applying the model to the entire data set, we see that we have achieved a 96% accuracy rate. The confusion matrix is:

##             Reference
## Prediction   setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         49         5
##   virginica       0          1        45

A Picture of the Model

The tree that corresponds to our model is pictured below.

plot of chunk tree

Summary

  • The regression tree we built provides an intuitive model for classifing measurements from an iris plant belonging to one of the three considered species.
  • The method has a resonable predictive value.
  • Using shiny we have built an application that will predict a species of iris based on measurements of the petals and sepals.
  • This application may be found on the RStudio's shiny server at this URL