1 Introduction

iris is perhaps the best known database to be found in the pattern recognition literature. Although it is simple, but very classic. In this project, I will do some data exploring first, then do some visualizations. I will try to build several models to predicate the classification. Finally I will compare the accuracy results among those models. Comments are welcome!

2 Knowing the data

2.1 Load the package

library(tidyverse) # visualization/processing
library(lattice)# visualization
library(ggpubr) # for multiple plots
library(GGally) # for pairplots
library(caret)  # machine learning models
library(e1071)
library(dplyr)
library(randomForest)

2.2 Import the data

dataset <- read.csv("Iris.csv")

2.3 Check data by using different methods

We begin by getting an idea of dimensions, size and attributes of the data we are working on.

head(dataset,3)#top 3 rows
##   Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
## 1  1           5.1          3.5           1.4          0.2 Iris-setosa
## 2  2           4.9          3.0           1.4          0.2 Iris-setosa
## 3  3           4.7          3.2           1.3          0.2 Iris-setosa
tail(dataset,3)#bottom 3 rows
##      Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm        Species
## 148 148           6.5          3.0           5.2          2.0 Iris-virginica
## 149 149           6.2          3.4           5.4          2.3 Iris-virginica
## 150 150           5.9          3.0           5.1          1.8 Iris-virginica
dim(dataset)#shape
## [1] 150   6
names(dataset)#variables
## [1] "Id"            "SepalLengthCm" "SepalWidthCm"  "PetalLengthCm"
## [5] "PetalWidthCm"  "Species"
str(dataset)#structures
## 'data.frame':    150 obs. of  6 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ SepalLengthCm: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ SepalWidthCm : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ PetalLengthCm: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ PetalWidthCm : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species      : chr  "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
lapply(dataset, function(x) length(unique(x))) # Unique values per column
## $Id
## [1] 150
## 
## $SepalLengthCm
## [1] 35
## 
## $SepalWidthCm
## [1] 23
## 
## $PetalLengthCm
## [1] 43
## 
## $PetalWidthCm
## [1] 22
## 
## $Species
## [1] 3

So we have started to know some about the data. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, we have to develope some models to distinguish the species from each other.

Variables Information: 1. Id: obervation # ; 2. SepalLengthCm: sepal length in cm ; 3. SepalWidthCm: sepal width in cm ; 4. PetalLengthCm: petal length in cm ; 5. PetalWidthCm: petal width in cm ; 6. Species: – Iris-setosa ; – Iris-versicolor ; – Iris-virginica ;

summary(dataset)#summary of the data
##        Id         SepalLengthCm    SepalWidthCm   PetalLengthCm  
##  Min.   :  1.00   Min.   :4.300   Min.   :2.000   Min.   :1.000  
##  1st Qu.: 38.25   1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600  
##  Median : 75.50   Median :5.800   Median :3.000   Median :4.350  
##  Mean   : 75.50   Mean   :5.843   Mean   :3.054   Mean   :3.759  
##  3rd Qu.:112.75   3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100  
##  Max.   :150.00   Max.   :7.900   Max.   :4.400   Max.   :6.900  
##   PetalWidthCm     Species         
##  Min.   :0.100   Length:150        
##  1st Qu.:0.300   Class :character  
##  Median :1.300   Mode  :character  
##  Mean   :1.199                     
##  3rd Qu.:1.800                     
##  Max.   :2.500

Observation: Checking the scales of features is very important. Sepal length ranges from 4.3-7.9, Sepal width range: 2-4.4, Petal length range:1-6.9, Petal width:0.1-2.5. The ranges basically are from 0 to 10, so we don’t have to do scaling before the building the models.

3 Clean the data

3.1 Any missing values?

sum(is.na(dataset))#check overall
## [1] 0
dataset %>% 
  summarize_all(funs(sum(is.na(.))/n())) #check each column
##   Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1  0             0            0             0            0       0

No missing values… very good…

3.2 remove Id for easy processing data

data<- dataset[,-1]
head(data)#check if Id removed or not
##   SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm     Species
## 1           5.1          3.5           1.4          0.2 Iris-setosa
## 2           4.9          3.0           1.4          0.2 Iris-setosa
## 3           4.7          3.2           1.3          0.2 Iris-setosa
## 4           4.6          3.1           1.5          0.2 Iris-setosa
## 5           5.0          3.6           1.4          0.2 Iris-setosa
## 6           5.4          3.9           1.7          0.4 Iris-setosa

3.3 Remove duplicate info in Species variable

I remove duplicate info (Iris-) in the Species for better visulation in the following step

data$Species <- sapply(strsplit(as.character(data$Species),'-'), "[", 2)
str(data) #check the data again and found Species became character
## 'data.frame':    150 obs. of  5 variables:
##  $ SepalLengthCm: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ SepalWidthCm : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ PetalLengthCm: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ PetalWidthCm : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species      : chr  "setosa" "setosa" "setosa" "setosa" ...
data$Species <- as.factor(data$Species)#change Species as factor
str(data)#check again, Species is factor variable now
## 'data.frame':    150 obs. of  5 variables:
##  $ SepalLengthCm: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ SepalWidthCm : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ PetalLengthCm: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ PetalWidthCm : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species      : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

4 Visualization

4.1 Why plots?

To further dig the data, we need to check basic statistics indexes of the data including mean/standard deviation/count/max/min value for each numeric variables

describle_datatable <- data %>% 
  group_by(Species) %>% 
  summarise_if(is.numeric,funs(mean, sd,n(), max, min))
  
t(describle_datatable) #transpose for easy checking
##                    [,1]        [,2]         [,3]       
## Species            "setosa"    "versicolor" "virginica"
## SepalLengthCm_mean "5.006"     "5.936"      "6.588"    
## SepalWidthCm_mean  "3.418"     "2.770"      "2.974"    
## PetalLengthCm_mean "1.464"     "4.260"      "5.552"    
## PetalWidthCm_mean  "0.244"     "1.326"      "2.026"    
## SepalLengthCm_sd   "0.3524897" "0.5161711"  "0.6358796"
## SepalWidthCm_sd    "0.3810244" "0.3137983"  "0.3224966"
## PetalLengthCm_sd   "0.1735112" "0.4699110"  "0.5518947"
## PetalWidthCm_sd    "0.1072095" "0.1977527"  "0.2746501"
## SepalLengthCm_n    "50"        "50"         "50"       
## SepalWidthCm_n     "50"        "50"         "50"       
## PetalLengthCm_n    "50"        "50"         "50"       
## PetalWidthCm_n     "50"        "50"         "50"       
## SepalLengthCm_max  "5.8"       "7.0"        "7.9"      
## SepalWidthCm_max   "4.4"       "3.4"        "3.8"      
## PetalLengthCm_max  "1.9"       "5.1"        "6.9"      
## PetalWidthCm_max   "0.6"       "1.8"        "2.5"      
## SepalLengthCm_min  "4.3"       "4.9"        "4.9"      
## SepalWidthCm_min   "2.3"       "2.0"        "2.2"      
## PetalLengthCm_min  "1.0"       "3.0"        "4.5"      
## PetalWidthCm_min   "0.1"       "1.0"        "1.4"

Not easy to get more insights? so plots…

4.2 boxplot

We begin by using boxplots to understand the distribution of attributes for each Species.

p1 <- ggplot(data, aes(x = Species, y = SepalLengthCm,colour=Species)) +
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.1))+
  theme(legend.position="none") # Remove legend

p2 <- ggplot(data, aes(x = Species, y = SepalWidthCm,colour=Species)) +
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.1))+
  theme(legend.position="none") # Remove legend


p3 <- ggplot(data, aes(x = Species, y = PetalLengthCm,colour=Species)) +
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.1))+
  theme(legend.position="none") # Remove legend


p4 <- ggplot(data, aes(x = Species, y = PetalWidthCm,colour=Species)) +
  geom_boxplot() +
  geom_jitter(shape=16, position=position_jitter(0.1))+
  theme(legend.position="none") # Remove legend


ggarrange(p1,p2,p3,p4, 
          labels = c("A", "B", "C","D"),
          ncol = 2, nrow = 2)

From above boxplots, we can see that virginica has a bigger petal and bigger sepal length, however sentosa has a smaller petal, but bigger sepal length..

4.3 pairplot

To understand relationships between atrributes, let’s do the pairplots (package: GGally)

head(data)
##   SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1           5.1          3.5           1.4          0.2  setosa
## 2           4.9          3.0           1.4          0.2  setosa
## 3           4.7          3.2           1.3          0.2  setosa
## 4           4.6          3.1           1.5          0.2  setosa
## 5           5.0          3.6           1.4          0.2  setosa
## 6           5.4          3.9           1.7          0.4  setosa
ggpairs(data, columns=1:4, aes(color=Species)) + 
  ggtitle("Iris Data by Species")

Petal width and Petal length have a strong linear correlation relationship.

4.4 scatterplot

ggplot(data, aes(x =SepalWidthCm , y = SepalLengthCm  , color = Species))+
  geom_point()+               
  geom_smooth(method="loess", aes(fill= Species, color = Species))+
  facet_wrap(~Species, ncol = 3, nrow = 1)

ggplot(data, aes(x = PetalWidthCm , y =PetalLengthCm  , color = Species))+
  geom_point()+               
  geom_smooth(method="lm", aes(fill= Species, color = Species))+
  facet_wrap(~Species, ncol = 3, nrow = 1)

Versicolor’s petal widith and length has a strong linear relationgship. Could we guess: petal width and petal length might be the key features for classfication? Just guess.. we need to do the predication.

5 Splitting the data for training and testing

set.seed(101)
# We use the dataset to create a partition (80% training 20% testing)
id <- createDataPartition(data$Species, p=0.80, list=FALSE)

# select 80% of data to train the models
train <- data[id,]
dim(train)
## [1] 120   5
# select 20% of the data for testing
test <- data[-id,]
dim(test)
## [1] 30  5
#review the train dataset to confirm the Species are randomly selected
lapply(train, function(x) length(unique(x))) 
## $SepalLengthCm
## [1] 34
## 
## $SepalWidthCm
## [1] 23
## 
## $PetalLengthCm
## [1] 42
## 
## $PetalWidthCm
## [1] 20
## 
## $Species
## [1] 3
table(train$Species)
## 
##     setosa versicolor  virginica 
##         40         40         40
summary(train)
##  SepalLengthCm    SepalWidthCm   PetalLengthCm    PetalWidthCm  
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.575   1st Qu.:0.300  
##  Median :5.750   Median :3.000   Median :4.250   Median :1.300  
##  Mean   :5.836   Mean   :3.041   Mean   :3.734   Mean   :1.187  
##  3rd Qu.:6.425   3rd Qu.:3.200   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.700   Max.   :4.400   Max.   :6.900   Max.   :2.400  
##        Species  
##  setosa    :40  
##  versicolor:40  
##  virginica :40  
##                 
##                 
## 
str(train)
## 'data.frame':    120 obs. of  5 variables:
##  $ SepalLengthCm: num  4.7 5.4 4.6 5 4.4 4.9 5.4 4.8 4.8 4.3 ...
##  $ SepalWidthCm : num  3.2 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3 3 ...
##  $ PetalLengthCm: num  1.3 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 ...
##  $ PetalWidthCm : num  0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 ...
##  $ Species      : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Observation- 1.The train dataset has 120 observations while test dataset has 30. 2.Each class has the same number of instances (40).

6 Model building

6.1 Model 1: Cart Model: Decision tree

set.seed(101)

cart_model <- train(train[,1:4], train[, 5], method='rpart2')
## note: only 2 possible values of the max tree depth from the initial fit.
##  Truncating the grid to 2 .
# Predict the labels of the test set
predictions<-predict(cart_model,test[,1:4])

# Evaluate the predictions
table(predictions)
## predictions
##     setosa versicolor  virginica 
##         10          9         11
# Confusion matrix 
confusionMatrix(predictions,test[,5])
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          8         1
##   virginica       0          2         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9             
##                  95% CI : (0.7347, 0.9789)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 1.665e-10       
##                                           
##                   Kappa : 0.85            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.8000           0.9000
## Specificity                 1.0000            0.9500           0.9000
## Pos Pred Value              1.0000            0.8889           0.8182
## Neg Pred Value              1.0000            0.9048           0.9474
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2667           0.3000
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.8750           0.9000
#feature importance
importance_cart <- varImp(cart_model)
plot(importance_cart, main="Variable Importance with cart_model")

As suspected, Petal Width is the most used variable, followed by Petal Length and Sepal Length.

6.2 Model 2 KNN

# Train the model with preprocessing
set.seed(101)

knn_model <- train(train[, 1:4], train[, 5], method='knn', 
                   preProcess=c("center", "scale"))

# Predict values
predictions<-predict(knn_model,test[,1:4], type="raw")

# Confusion matrix
confusionMatrix(predictions,test[,5])
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          7         2
##   virginica       0          3         8
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.6528, 0.9436)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.444e-08       
##                                           
##                   Kappa : 0.75            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.7000           0.8000
## Specificity                 1.0000            0.9000           0.8500
## Pos Pred Value              1.0000            0.7778           0.7273
## Neg Pred Value              1.0000            0.8571           0.8947
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2333           0.2667
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.8000           0.8250
#feature importance
importance_knn <- varImp(knn_model)
plot(importance_knn, main="Variable Importance with knn_model")

6.3 Model 3 Neural Network

# Train the model with preprocessing
set.seed(101)
nnet_model <- train(train[, 1:4], train[, 5], method='nnet', 
                   preProcess=c("center", "scale"), 
                   tuneLength = 2,
                   trace = FALSE,
                   maxit = 100)

# Predict values
predictions<-predict(nnet_model,test[,1:4], type="raw")

# Confusion matrix
confusionMatrix(predictions,test[,5])
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          7         1
##   virginica       0          3         9
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8667          
##                  95% CI : (0.6928, 0.9624)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.296e-09       
##                                           
##                   Kappa : 0.8             
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.7000           0.9000
## Specificity                 1.0000            0.9500           0.8500
## Pos Pred Value              1.0000            0.8750           0.7500
## Neg Pred Value              1.0000            0.8636           0.9444
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.2333           0.3000
## Detection Prevalence        0.3333            0.2667           0.4000
## Balanced Accuracy           1.0000            0.8250           0.8750
#feature importance
importance_nnet <- varImp(nnet_model);importance_nnet
## nnet variable importance
## 
##   variables are sorted by maximum importance across the classes
##               Overall setosa versicolor virginica
## PetalLengthCm  100.00 100.00     100.00    100.00
## PetalWidthCm    97.48  97.48      97.48     97.48
## SepalWidthCm    39.49  39.49      39.49     39.49
## SepalLengthCm    0.00   0.00       0.00      0.00

6.4 Model 4 Randomforest

# Train the model with preprocessing
set.seed(101)
randomforest_model <- train(train[, 1:4], train[, 5], method='rf')

# Predict values
predictions<-predict(randomforest_model,test[,1:4], type="raw")

# Confusion matrix
confusionMatrix(predictions,test[,5])
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         10          0         0
##   versicolor      0          9         0
##   virginica       0          1        10
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9667          
##                  95% CI : (0.8278, 0.9992)
##     No Information Rate : 0.3333          
##     P-Value [Acc > NIR] : 2.963e-13       
##                                           
##                   Kappa : 0.95            
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9000           1.0000
## Specificity                 1.0000            1.0000           0.9500
## Pos Pred Value              1.0000            1.0000           0.9091
## Neg Pred Value              1.0000            0.9524           1.0000
## Prevalence                  0.3333            0.3333           0.3333
## Detection Rate              0.3333            0.3000           0.3333
## Detection Prevalence        0.3333            0.3000           0.3667
## Balanced Accuracy           1.0000            0.9500           0.9750
#feature importance
importance_rf <- varImp(randomforest_model)
plot(importance_rf, main="Variable Importance with randomforest_model")

6.5 Compare model performances

models_compare <- resamples(list(cart_model,knn_model, nnet_model,randomforest_model))

# Summary of the models performances
summary(models_compare)
## 
## Call:
## summary.resamples(object = models_compare)
## 
## Models: Model1, Model2, Model3, Model4 
## Number of resamples: 25 
## 
## Accuracy 
##             Min.   1st Qu.    Median      Mean   3rd Qu. Max. NA's
## Model1 0.9000000 0.9361702 0.9545455 0.9515344 0.9729730 1.00    0
## Model2 0.9189189 0.9534884 0.9565217 0.9605841 0.9772727 1.00    0
## Model3 0.9347826 0.9772727 0.9800000 0.9828838 1.0000000 1.00    0
## Model4 0.9069767 0.9347826 0.9545455 0.9517384 0.9761905 0.98    0
## 
## Kappa 
##             Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## Model1 0.8434442 0.9036227 0.9312500 0.9259588 0.9578107 1.0000000    0
## Model2 0.8763920 0.9297386 0.9332366 0.9399519 0.9656250 1.0000000    0
## Model3 0.9001447 0.9652997 0.9698614 0.9738443 1.0000000 1.0000000    0
## Model4 0.8566667 0.9001447 0.9312500 0.9263153 0.9642553 0.9698614    0

From the accuracy resluts, neural network model works best among the 4 models. Also, Petal Width and Petal Length are the key features for the classification.