Classification of iris flowers

In this case, the datasets library will be used, which contains the data that I am going to use in this practical case. The library is installed by default in R, so to load it use the command library. The data that I will use is called iris. These data correspond to 150 samples of three species of Iris flowers (Iris setosa, Iris virginica and Iris versicolor). Four variables have been measured from each sample: sepal length, sepal width, petal length and petal width. The data is loaded in the variable iris:

library(datasets)
dim(iris)
## [1] 150   5
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
summary(iris)# With the summary I observe that there are NO missing values
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

I first use the tree library to draw the decision tree that is obtained for this data.

library(tree)

Next I generate the tree structure, which I will internally save in the variable iris.tr:

iris.tr<- tree(Species~.,iris)

I see what the structure of the tree is like:

iris.tr
## node), split, n, deviance, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 150 329.600 setosa ( 0.33333 0.33333 0.33333 )  
##    2) Petal.Length < 2.45 50   0.000 setosa ( 1.00000 0.00000 0.00000 ) *
##    3) Petal.Length > 2.45 100 138.600 versicolor ( 0.00000 0.50000 0.50000 )  
##      6) Petal.Width < 1.75 54  33.320 versicolor ( 0.00000 0.90741 0.09259 )  
##       12) Petal.Length < 4.95 48   9.721 versicolor ( 0.00000 0.97917 0.02083 )  
##         24) Sepal.Length < 5.15 5   5.004 versicolor ( 0.00000 0.80000 0.20000 ) *
##         25) Sepal.Length > 5.15 43   0.000 versicolor ( 0.00000 1.00000 0.00000 ) *
##       13) Petal.Length > 4.95 6   7.638 virginica ( 0.00000 0.33333 0.66667 ) *
##      7) Petal.Width > 1.75 46   9.635 virginica ( 0.00000 0.02174 0.97826 )  
##       14) Petal.Length < 4.95 6   5.407 virginica ( 0.00000 0.16667 0.83333 ) *
##       15) Petal.Length > 4.95 40   0.000 virginica ( 0.00000 0.00000 1.00000 ) *

To see a summary of the tree structure we use the summary command:

summary(iris.tr)
## 
## Classification tree:
## tree(formula = Species ~ ., data = iris)
## Variables actually used in tree construction:
## [1] "Petal.Length" "Petal.Width"  "Sepal.Length"
## Number of terminal nodes:  6 
## Residual mean deviance:  0.1253 = 18.05 / 144 
## Misclassification error rate: 0.02667 = 4 / 150

The tree has 6 terminal nodes, and for its construction it has only been necessary to use three variables, which are: petal length, petal width and sepal length. The tree error rate is 0.02667. Next we paint the tree:

plot(iris.tr)
text(iris.tr)