Iris

Introduction

Iris dataset have been written and analyzed widely for its classic case. Numerous guides and explorations are introduced by almost every scholar. this dataset is introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. it contains 150 obervations and 5 variables. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.

The dataset information comes from UCI Machine Learning Repository. ——————–I would build models(logistic regression, classification tree and Neural Networks) using the data, Split the data on variables by groups. with statistical models find the relationship between variables to forecast customers’ future behavior help the hotels to make preparation for customers. look into the models’ accuracy and compare with these 3 different models.

Packages Required

library(datasets)          # load the dataset
library(tidyverse)         # ggplot
library(corrplot)          # pheatmap
library(ggpubr)            # add ggplots
library(factoextra)        # kmeans
library(ROCR)              # ROC
library(rpart)             # regression tree
library(rpart.plot)        # regression tree plot
library(caret)             # confusion matrix
library(e1071)             # confusion matrix
library(mgcv)              # GAM

Exploratory Data Analysis

Iris dataset have been written and analyzed widely for its classic case. Numerous guides and explorations are introduced by almost every scholar. this dataset is introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. it contains 150 obervations and 5 variables. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.

data(iris) #load the dataset
summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

check the missing value

There is no missing value in this dataset.

iris %>%
  map_dbl(~ sum(is.na(.)))
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
##            0            0            0            0            0

Percentage of iris The sample is average equal by Species.

pie(table(iris$Species), col=c("darkred","lightblue","darkgreen"))

Heatmap of the iris

  • Strong Correlation in Sepal Length and Petal Length.
  • Strong Correlation in Sepal Length and Petal width.

to conclude: Flowers with Longer sepal tend to have wider and longer Petals

  • Negative Correlation between Sepal width and Petal Length Inference: As sepal width increases the petals tend to be shorter in length

  • Negative Correlation between Sepal width and Petal width Inference: As sepal width increases the petals tend to be shorter in length as well as in width

  • Strong Correlation in Petal Length and Petal width. Inference: As petal width increases petals tend to be wider

iris_hm <- iris[,-5]
cor(iris_hm)
##              Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
## Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
## Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
## Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000
corrplot(cor(iris_hm, method = "pearson"), number.cex = .9, method = "square", 
         hclust.method = "ward", order = "FPC",
         type = "full", tl.cex=0.8,tl.col = "black")

Hist of Distribution The mean and median values for Petal width and Petal Length has significant difference.

h1 <- iris %>% 
  ggplot(aes(x = Sepal.Length, y = ..density.., fill = Species))+
  geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
  facet_grid(Species ~.)+
  geom_density(aes(color = Species), alpha = 0, size = 1)+
  ggtitle("Sepal.Length")

h2 <- iris %>% 
  ggplot(aes(x = Sepal.Width, y = ..density.., fill = Species))+
  geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
  facet_grid(Species ~.)+
  geom_density(aes(color = Species), alpha = 0, size = 1)+
  ggtitle("Sepal.Width")

h3 <- iris %>% 
  ggplot(aes(x = Petal.Length, y = ..density.., fill = Species))+
  geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
  facet_grid(Species ~.)+
  geom_density(aes(color = Species), alpha = 0, size = 1)+
  ggtitle("Petal.Length")

h4 <- iris %>% 
  ggplot(aes(x = Petal.Width, y = ..density.., fill = Species))+
  geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
  facet_grid(Species ~.)+
  geom_density(aes(color = Species), alpha = 0, size = 1)+
  ggtitle("Petal.Width")

ggarrange(h1,h2,h3,h4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)

Scatter Plot

  • Sepal Length: Have minor overlap in 3 species

  • Sepal Width : Have considerably higher overlap in 3 species

  • Petal Length: Have a clear distinction in 3 species

  • Petal Width: Have a clear distinction in 3 species

  • Ranking based on scatter plot Petal Width > Petal Length > Sepal Length > Sepal Width

summary(iris)
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 
P1 <- iris %>% 
  ggplot(aes(x = Sepal.Length, y = Petal.Length, fill = Species, color = Species ))+
  geom_point()+
  ggtitle("Sepal.Length vs. Petal.Length")

P2 <- iris %>% 
  ggplot(aes(x = Sepal.Length, y = Sepal.Width, fill = Species, color = Species ))+
  geom_point()+
  ggtitle("Sepal.Length vs. Sepal.Width")

P3 <- iris %>% 
  ggplot(aes(x = Sepal.Length, y = Petal.Width, fill = Species, color = Species ))+
  geom_point()+
  ggtitle("Sepal.Length vs.Petal.Width")

P4 <- iris %>% 
  ggplot(aes(x = Sepal.Width, y = Petal.Width, fill = Species, color = Species ))+
  geom_point()+
  ggtitle("Sepal.Width vs. Petal.Width")

ggarrange(P1,P2,P3,P4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)

  • Boxplot of iris

  • We observe significant number (4) outliers are present in Setosa specie as compared to others.

  • Sepal Length: Virginica has 1 outlier

  • Sepal Width : Setosa has 1 outlier

  • Petal Length: Setosa has 1 outlier Versicolor has 1 outlier

  • Petal Width: Setosa has 2 outlier

boxplot(iris_hm, col=c("red", "blue", "yellow", "darkgreen"))

b1 <- iris %>% 
  ggplot(aes(x = Species,y = Sepal.Length,fill = Species))+
  geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
  ggtitle("Sepal.Length")

b2 <- iris %>% 
  ggplot(aes(x = Species,y = Sepal.Width,fill = Species))+
  geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
  ggtitle("Sepal.Width")


b3 <- iris %>% 
  ggplot(aes(x = Species,y = Petal.Length,fill = Species))+
  geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
  ggtitle("Petal.Length")

b4 <- iris %>% 
  ggplot(aes(x = Species,y = Petal.Width, fill = Species))+
  geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
  ggtitle("Petal.Width")

ggarrange(b1,b2,b3,b4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)

** Density of iris**

  • Sepal.Width is overlap among the Species

  • Petal.Length is not overlap among the Species

d1 <- iris %>% 
  ggplot(aes(x = Sepal.Length, color = Species, fill = Species))+  
  geom_density(alpha = 0.5, size = 0.5)+
  ggtitle("Sepal.Length")

d2 <- iris %>% 
  ggplot(aes(x = Sepal.Width, color = Species, fill = Species))+  
  geom_density(alpha = 0.5, size = 0.5)+
  ggtitle("Sepal.Width")

d3 <- iris %>% 
  ggplot(aes(x = Petal.Length, color = Species, fill = Species))+  
  geom_density(alpha = 0.5, size = 0.5)+
  ggtitle("Petal.Length")

d4 <- iris %>% 
  ggplot(aes(x = Petal.Width, color = Species, fill = Species))+  
  geom_density(alpha = 0.5, size = 0.5)+
  ggtitle("Petal.Width")

ggarrange(d1,d2,d3,d4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)

Unsupervised Learning

K-means

n order to use k-means method for clustering and plot results, we can use kmeans function in R. It will group the data into a specified number of clusters.

set.seed(2020)
iris_kmeans <- iris[,1:4]

k2 <-  kmeans(iris_kmeans, center = 2)  # 2 clusters
table(k2$cluster,iris$Species)
##    
##     setosa versicolor virginica
##   1      0         47        50
##   2     50          3         0
kmeans_2 <- fviz_cluster(k2, geom = "point", data = iris_kmeans) + ggtitle("k = 2")

k3 <-  kmeans(iris_kmeans, center = 3)  # 3 clusters
table(k3$cluster,iris$Species)
##    
##     setosa versicolor virginica
##   1      0         48        14
##   2     50          0         0
##   3      0          2        36
kmeans_3 <- fviz_cluster(k3, geom = "point", data = iris_kmeans) + ggtitle("k = 3")

k4 <-  kmeans(iris_kmeans, center = 4)  # 2 clusters
table(k4$cluster,iris$Species)
##    
##     setosa versicolor virginica
##   1      0          0        32
##   2      0         23        17
##   3     50          0         0
##   4      0         27         1
kmeans_4 <- fviz_cluster(k4, geom = "point", data = iris_kmeans) + ggtitle("k = 4")

ggarrange(kmeans_2,kmeans_3,kmeans_4,common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 3)

for K=2 Total number of correctly classified instances are: 50 + 50 = 100 Total number of incorrectly classified instances are: 47+3 = 50 Accuracy = 50/150 = 0.33 i.e our model has achieved 67% accuracy

for k = 3 Total number of correctly classified instances are: 50 + 48 +36 = 134 Total number of incorrectly classified instances are: 2 + 14 = 16 Accuracy = 16/150 = i.e our model has achieved 89.3% accuracy

for k = 4 Total number of correctly classified instances are: 50 + 27 +27 = 104 Total number of incorrectly classified instances are: 23 + 22 +1 = 46 Accuracy = 46/150 = 0.307 i.e our model has achieved 69.3% accuracy

To conclude, the best kmeans cluster = 3, the accurancy is 89.3%

Hierarchical clustering

#Calculate the distance matrix
iris.dist=dist(iris)
#Obtain clusters using the Wards method
iris.hclust=hclust(iris.dist, method="ward")
plot(iris.hclust)

Supervised Learning

Linear Regression Model

Split the model into training(70%) and testing(30%) build the multiple linear regression model, we treat the Petal.Width as the response, regardless the Species. the linear model with R-squared = 0.9387.

iris_lm_data <- iris
set.seed(2020)
index <- sample(nrow(iris_lm_data),nrow(iris_lm_data)*0.7)
iris_lm.train <- iris_lm_data[index,]
iris_lm.test <-  iris_lm_data[-index,]

iris_lm <- lm(Petal.Width~ . - Species, data = iris_lm.train)
summary(iris_lm)
## 
## Call:
## lm(formula = Petal.Width ~ . - Species, data = iris_lm.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40559 -0.12563 -0.01201  0.13277  0.58985 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.18720    0.22854  -0.819 0.414650    
## Sepal.Length -0.21917    0.06006  -3.649 0.000419 ***
## Sepal.Width   0.21933    0.06098   3.597 0.000501 ***
## Petal.Length  0.53524    0.03140  17.048  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2002 on 101 degrees of freedom
## Multiple R-squared:  0.9387, Adjusted R-squared:  0.9369 
## F-statistic: 515.6 on 3 and 101 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(iris_lm)

logistic Regression Model

Split the data into training(70%) and testing(30%), add new column “response”, if the Species value is versicolor then the "response’=1, others = 0. built the logtistic regression model. the AIC of the model is 103.15, the AUC is 0.755

response <- ifelse(iris$Species=="versicolor", 1, 0)
iris_glm_data <- cbind(iris[,1:4],response)
set.seed(2020)
index <- sample(nrow(iris_glm_data),nrow(iris_glm_data)*0.7)
iris_glm.train <- iris_glm_data[index,]
iris_glm.test <-  iris_glm_data[-index,]
iris_glm <- glm(response ~ ., data = iris_glm.train, family = binomial )

summary(iris_glm)
## 
## Call:
## glm(formula = response ~ ., family = binomial, data = iris_glm.train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1719  -0.7743  -0.3409   0.7216   2.0038  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    9.0804     3.2738   2.774  0.00554 **
## Sepal.Length  -1.4971     0.8536  -1.754  0.07946 . 
## Sepal.Width   -1.9905     0.8828  -2.255  0.02415 * 
## Petal.Length   2.6039     0.9225   2.823  0.00476 **
## Petal.Width   -4.3196     1.5351  -2.814  0.00490 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 127.423  on 104  degrees of freedom
## Residual deviance:  93.154  on 100  degrees of freedom
## AIC: 103.15
## 
## Number of Fisher Scoring iterations: 5
summary(iris_glm)$coef
##               Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)   9.080419  3.2738268  2.773641 0.005543287
## Sepal.Length -1.497058  0.8535954 -1.753826 0.079460371
## Sepal.Width  -1.990527  0.8828484 -2.254665 0.024154393
## Petal.Length  2.603938  0.9224861  2.822739 0.004761530
## Petal.Width  -4.319602  1.5351212 -2.813851 0.004895195
pred.glm.test<- predict(iris_glm, newdata = iris_glm.test, type="response")
pred <- prediction(pred.glm.test, iris_glm.test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7550607

Regression Tree Model

Split the data into 70% of training and 30% of testing. Fit a regression model with the response “Species”. the accuracy of the regression model on the test data is 91.11%

set.seed(2020)
index <- sample(nrow(iris),nrow(iris)*0.7)
iris_rpart.train <- iris[index,]
iris_rpart.test <-  iris[-index,]

iris.rpart <- rpart(Species ~., data = iris_rpart.train, method = "class")
iris.rpart
## n= 105 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
## 1) root 105 67 setosa (0.36190476 0.29523810 0.34285714)  
##   2) Petal.Length< 2.6 38  0 setosa (1.00000000 0.00000000 0.00000000) *
##   3) Petal.Length>=2.6 67 31 virginica (0.00000000 0.46268657 0.53731343)  
##     6) Petal.Width< 1.75 33  2 versicolor (0.00000000 0.93939394 0.06060606) *
##     7) Petal.Width>=1.75 34  0 virginica (0.00000000 0.00000000 1.00000000) *
rpart.plot(iris.rpart)

Prediction1 <- predict(iris.rpart, newdata = iris_rpart.test[-5], type = 'class')
confusionMatrix(Prediction1,iris_rpart.test$Species)
## Confusion Matrix and Statistics
## 
##             Reference
## Prediction   setosa versicolor virginica
##   setosa         12          0         0
##   versicolor      0         18         3
##   virginica       0          1        11
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9111          
##                  95% CI : (0.7878, 0.9752)
##     No Information Rate : 0.4222          
##     P-Value [Acc > NIR] : 7.909e-12       
##                                           
##                   Kappa : 0.863           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: setosa Class: versicolor Class: virginica
## Sensitivity                 1.0000            0.9474           0.7857
## Specificity                 1.0000            0.8846           0.9677
## Pos Pred Value              1.0000            0.8571           0.9167
## Neg Pred Value              1.0000            0.9583           0.9091
## Prevalence                  0.2667            0.4222           0.3111
## Detection Rate              0.2667            0.4000           0.2444
## Detection Prevalence        0.2667            0.4667           0.2667
## Balanced Accuracy           1.0000            0.9160           0.8767

Generalized Additive Model

set.seed(2020)
response <- ifelse(iris$Sepal.Length>=mean(iris$Sepal.Length),1,0)
iris_gam_data <- cbind(iris[,1:4],response)
index <- sample(nrow(iris_gam_data),nrow(iris_gam_data)*0.7)
iris_gam.train <- iris_gam_data[index,]
iris_gam.test <-  iris_gam_data[-index,]

iris.gam <- gam(response ~ s(Sepal.Width) + s(Petal.Length) + s(Petal.Width), family=binomial , data = iris_gam.train)
summary(iris.gam)
## 
## Family: binomial 
## Link function: logit 
## 
## Formula:
## response ~ s(Sepal.Width) + s(Petal.Length) + s(Petal.Width)
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   -1.758      1.435  -1.225    0.221
## 
## Approximate significance of smooth terms:
##                   edf Ref.df Chi.sq p-value   
## s(Sepal.Width)  3.127  3.886  9.491 0.06784 . 
## s(Petal.Length) 1.000  1.000  9.669 0.00188 **
## s(Petal.Width)  1.000  1.000  5.018 0.02509 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =   0.79   Deviance explained = 78.7%
## UBRE = -0.59019  Scale est. = 1         n = 105
plot(iris.gam, shade=TRUE,seWithMean=TRUE,scale=0, pages = 1)

AIC(iris.gam)
## [1] 43.03024
BIC(iris.gam)
## [1] 59.29201
iris.gam$deviance
## [1] 30.77553
pcut.gam <- mean(iris_gam_data$response)
prob.gam <-predict(iris.gam, iris_gam.test, type="response")
pred.gam<-(prob.gam >= pcut.gam)*1
table(iris_gam.test$response, pred.gam, dnn=c("Observed","Predicted"))
##         Predicted
## Observed  0  1
##        0 21  1
##        1  2 21
# MR
mean(ifelse(iris_gam.test$response != pred.gam, 1, 0))    
## [1] 0.06666667