Iris dataset have been written and analyzed widely for its classic case. Numerous guides and explorations are introduced by almost every scholar. this dataset is introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. it contains 150 obervations and 5 variables. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.
The dataset information comes from UCI Machine Learning Repository. ——————–I would build models(logistic regression, classification tree and Neural Networks) using the data, Split the data on variables by groups. with statistical models find the relationship between variables to forecast customers’ future behavior help the hotels to make preparation for customers. look into the models’ accuracy and compare with these 3 different models.
library(datasets) # load the dataset
library(tidyverse) # ggplot
library(corrplot) # pheatmap
library(ggpubr) # add ggplots
library(factoextra) # kmeans
library(ROCR) # ROC
library(rpart) # regression tree
library(rpart.plot) # regression tree plot
library(caret) # confusion matrix
library(e1071) # confusion matrix
library(mgcv) # GAM
Iris dataset have been written and analyzed widely for its classic case. Numerous guides and explorations are introduced by almost every scholar. this dataset is introduced by Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems, contains three plant species (setosa, virginica, versicolor) and four features measured for each sample. it contains 150 obervations and 5 variables. These quantify the morphologic variation of the iris flower in its three species, all measurements given in centimeters.
data(iris) #load the dataset
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
check the missing value
There is no missing value in this dataset.
iris %>%
map_dbl(~ sum(is.na(.)))
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 0 0 0 0 0
Percentage of iris The sample is average equal by Species.
pie(table(iris$Species), col=c("darkred","lightblue","darkgreen"))
Heatmap of the iris
to conclude: Flowers with Longer sepal tend to have wider and longer Petals
Negative Correlation between Sepal width and Petal Length Inference: As sepal width increases the petals tend to be shorter in length
Negative Correlation between Sepal width and Petal width Inference: As sepal width increases the petals tend to be shorter in length as well as in width
Strong Correlation in Petal Length and Petal width. Inference: As petal width increases petals tend to be wider
iris_hm <- iris[,-5]
cor(iris_hm)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
corrplot(cor(iris_hm, method = "pearson"), number.cex = .9, method = "square",
hclust.method = "ward", order = "FPC",
type = "full", tl.cex=0.8,tl.col = "black")
Hist of Distribution The mean and median values for Petal width and Petal Length has significant difference.
h1 <- iris %>%
ggplot(aes(x = Sepal.Length, y = ..density.., fill = Species))+
geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
facet_grid(Species ~.)+
geom_density(aes(color = Species), alpha = 0, size = 1)+
ggtitle("Sepal.Length")
h2 <- iris %>%
ggplot(aes(x = Sepal.Width, y = ..density.., fill = Species))+
geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
facet_grid(Species ~.)+
geom_density(aes(color = Species), alpha = 0, size = 1)+
ggtitle("Sepal.Width")
h3 <- iris %>%
ggplot(aes(x = Petal.Length, y = ..density.., fill = Species))+
geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
facet_grid(Species ~.)+
geom_density(aes(color = Species), alpha = 0, size = 1)+
ggtitle("Petal.Length")
h4 <- iris %>%
ggplot(aes(x = Petal.Width, y = ..density.., fill = Species))+
geom_histogram(binwidth = 1, alpha = 0.5, size = 1, stat = "bin")+
facet_grid(Species ~.)+
geom_density(aes(color = Species), alpha = 0, size = 1)+
ggtitle("Petal.Width")
ggarrange(h1,h2,h3,h4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)
Scatter Plot
Sepal Length: Have minor overlap in 3 species
Sepal Width : Have considerably higher overlap in 3 species
Petal Length: Have a clear distinction in 3 species
Petal Width: Have a clear distinction in 3 species
Ranking based on scatter plot Petal Width > Petal Length > Sepal Length > Sepal Width
summary(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu.:0.300
## Median :5.800 Median :3.000 Median :4.350 Median :1.300
## Mean :5.843 Mean :3.057 Mean :3.758 Mean :1.199
## 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
## Species
## setosa :50
## versicolor:50
## virginica :50
##
##
##
P1 <- iris %>%
ggplot(aes(x = Sepal.Length, y = Petal.Length, fill = Species, color = Species ))+
geom_point()+
ggtitle("Sepal.Length vs. Petal.Length")
P2 <- iris %>%
ggplot(aes(x = Sepal.Length, y = Sepal.Width, fill = Species, color = Species ))+
geom_point()+
ggtitle("Sepal.Length vs. Sepal.Width")
P3 <- iris %>%
ggplot(aes(x = Sepal.Length, y = Petal.Width, fill = Species, color = Species ))+
geom_point()+
ggtitle("Sepal.Length vs.Petal.Width")
P4 <- iris %>%
ggplot(aes(x = Sepal.Width, y = Petal.Width, fill = Species, color = Species ))+
geom_point()+
ggtitle("Sepal.Width vs. Petal.Width")
ggarrange(P1,P2,P3,P4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)
Boxplot of iris
We observe significant number (4) outliers are present in Setosa specie as compared to others.
Sepal Length: Virginica has 1 outlier
Sepal Width : Setosa has 1 outlier
Petal Length: Setosa has 1 outlier Versicolor has 1 outlier
Petal Width: Setosa has 2 outlier
boxplot(iris_hm, col=c("red", "blue", "yellow", "darkgreen"))
b1 <- iris %>%
ggplot(aes(x = Species,y = Sepal.Length,fill = Species))+
geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
ggtitle("Sepal.Length")
b2 <- iris %>%
ggplot(aes(x = Species,y = Sepal.Width,fill = Species))+
geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
ggtitle("Sepal.Width")
b3 <- iris %>%
ggplot(aes(x = Species,y = Petal.Length,fill = Species))+
geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
ggtitle("Petal.Length")
b4 <- iris %>%
ggplot(aes(x = Species,y = Petal.Width, fill = Species))+
geom_boxplot(size = 0.5,width = 0.8, alpha = 0.3)+
ggtitle("Petal.Width")
ggarrange(b1,b2,b3,b4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)
** Density of iris**
Sepal.Width is overlap among the Species
Petal.Length is not overlap among the Species
d1 <- iris %>%
ggplot(aes(x = Sepal.Length, color = Species, fill = Species))+
geom_density(alpha = 0.5, size = 0.5)+
ggtitle("Sepal.Length")
d2 <- iris %>%
ggplot(aes(x = Sepal.Width, color = Species, fill = Species))+
geom_density(alpha = 0.5, size = 0.5)+
ggtitle("Sepal.Width")
d3 <- iris %>%
ggplot(aes(x = Petal.Length, color = Species, fill = Species))+
geom_density(alpha = 0.5, size = 0.5)+
ggtitle("Petal.Length")
d4 <- iris %>%
ggplot(aes(x = Petal.Width, color = Species, fill = Species))+
geom_density(alpha = 0.5, size = 0.5)+
ggtitle("Petal.Width")
ggarrange(d1,d2,d3,d4,common.legend = TRUE, legend = "bottom", nrow = 2, ncol = 2)
n order to use k-means method for clustering and plot results, we can use kmeans function in R. It will group the data into a specified number of clusters.
set.seed(2020)
iris_kmeans <- iris[,1:4]
k2 <- kmeans(iris_kmeans, center = 2) # 2 clusters
table(k2$cluster,iris$Species)
##
## setosa versicolor virginica
## 1 0 47 50
## 2 50 3 0
kmeans_2 <- fviz_cluster(k2, geom = "point", data = iris_kmeans) + ggtitle("k = 2")
k3 <- kmeans(iris_kmeans, center = 3) # 3 clusters
table(k3$cluster,iris$Species)
##
## setosa versicolor virginica
## 1 0 48 14
## 2 50 0 0
## 3 0 2 36
kmeans_3 <- fviz_cluster(k3, geom = "point", data = iris_kmeans) + ggtitle("k = 3")
k4 <- kmeans(iris_kmeans, center = 4) # 2 clusters
table(k4$cluster,iris$Species)
##
## setosa versicolor virginica
## 1 0 0 32
## 2 0 23 17
## 3 50 0 0
## 4 0 27 1
kmeans_4 <- fviz_cluster(k4, geom = "point", data = iris_kmeans) + ggtitle("k = 4")
ggarrange(kmeans_2,kmeans_3,kmeans_4,common.legend = TRUE, legend = "bottom", nrow = 1, ncol = 3)
for K=2 Total number of correctly classified instances are: 50 + 50 = 100 Total number of incorrectly classified instances are: 47+3 = 50 Accuracy = 50/150 = 0.33 i.e our model has achieved 67% accuracy
for k = 3 Total number of correctly classified instances are: 50 + 48 +36 = 134 Total number of incorrectly classified instances are: 2 + 14 = 16 Accuracy = 16/150 = i.e our model has achieved 89.3% accuracy
for k = 4 Total number of correctly classified instances are: 50 + 27 +27 = 104 Total number of incorrectly classified instances are: 23 + 22 +1 = 46 Accuracy = 46/150 = 0.307 i.e our model has achieved 69.3% accuracy
To conclude, the best kmeans cluster = 3, the accurancy is 89.3%
#Calculate the distance matrix
iris.dist=dist(iris)
#Obtain clusters using the Wards method
iris.hclust=hclust(iris.dist, method="ward")
plot(iris.hclust)
Split the model into training(70%) and testing(30%) build the multiple linear regression model, we treat the Petal.Width as the response, regardless the Species. the linear model with R-squared = 0.9387.
iris_lm_data <- iris
set.seed(2020)
index <- sample(nrow(iris_lm_data),nrow(iris_lm_data)*0.7)
iris_lm.train <- iris_lm_data[index,]
iris_lm.test <- iris_lm_data[-index,]
iris_lm <- lm(Petal.Width~ . - Species, data = iris_lm.train)
summary(iris_lm)
##
## Call:
## lm(formula = Petal.Width ~ . - Species, data = iris_lm.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40559 -0.12563 -0.01201 0.13277 0.58985
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.18720 0.22854 -0.819 0.414650
## Sepal.Length -0.21917 0.06006 -3.649 0.000419 ***
## Sepal.Width 0.21933 0.06098 3.597 0.000501 ***
## Petal.Length 0.53524 0.03140 17.048 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2002 on 101 degrees of freedom
## Multiple R-squared: 0.9387, Adjusted R-squared: 0.9369
## F-statistic: 515.6 on 3 and 101 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(iris_lm)
Split the data into training(70%) and testing(30%), add new column “response”, if the Species value is versicolor then the "response’=1, others = 0. built the logtistic regression model. the AIC of the model is 103.15, the AUC is 0.755
response <- ifelse(iris$Species=="versicolor", 1, 0)
iris_glm_data <- cbind(iris[,1:4],response)
set.seed(2020)
index <- sample(nrow(iris_glm_data),nrow(iris_glm_data)*0.7)
iris_glm.train <- iris_glm_data[index,]
iris_glm.test <- iris_glm_data[-index,]
iris_glm <- glm(response ~ ., data = iris_glm.train, family = binomial )
summary(iris_glm)
##
## Call:
## glm(formula = response ~ ., family = binomial, data = iris_glm.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1719 -0.7743 -0.3409 0.7216 2.0038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.0804 3.2738 2.774 0.00554 **
## Sepal.Length -1.4971 0.8536 -1.754 0.07946 .
## Sepal.Width -1.9905 0.8828 -2.255 0.02415 *
## Petal.Length 2.6039 0.9225 2.823 0.00476 **
## Petal.Width -4.3196 1.5351 -2.814 0.00490 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 127.423 on 104 degrees of freedom
## Residual deviance: 93.154 on 100 degrees of freedom
## AIC: 103.15
##
## Number of Fisher Scoring iterations: 5
summary(iris_glm)$coef
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 9.080419 3.2738268 2.773641 0.005543287
## Sepal.Length -1.497058 0.8535954 -1.753826 0.079460371
## Sepal.Width -1.990527 0.8828484 -2.254665 0.024154393
## Petal.Length 2.603938 0.9224861 2.822739 0.004761530
## Petal.Width -4.319602 1.5351212 -2.813851 0.004895195
pred.glm.test<- predict(iris_glm, newdata = iris_glm.test, type="response")
pred <- prediction(pred.glm.test, iris_glm.test$response)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)
unlist(slot(performance(pred, "auc"), "y.values"))
## [1] 0.7550607
Split the data into 70% of training and 30% of testing. Fit a regression model with the response “Species”. the accuracy of the regression model on the test data is 91.11%
set.seed(2020)
index <- sample(nrow(iris),nrow(iris)*0.7)
iris_rpart.train <- iris[index,]
iris_rpart.test <- iris[-index,]
iris.rpart <- rpart(Species ~., data = iris_rpart.train, method = "class")
iris.rpart
## n= 105
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 105 67 setosa (0.36190476 0.29523810 0.34285714)
## 2) Petal.Length< 2.6 38 0 setosa (1.00000000 0.00000000 0.00000000) *
## 3) Petal.Length>=2.6 67 31 virginica (0.00000000 0.46268657 0.53731343)
## 6) Petal.Width< 1.75 33 2 versicolor (0.00000000 0.93939394 0.06060606) *
## 7) Petal.Width>=1.75 34 0 virginica (0.00000000 0.00000000 1.00000000) *
rpart.plot(iris.rpart)
Prediction1 <- predict(iris.rpart, newdata = iris_rpart.test[-5], type = 'class')
confusionMatrix(Prediction1,iris_rpart.test$Species)
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 12 0 0
## versicolor 0 18 3
## virginica 0 1 11
##
## Overall Statistics
##
## Accuracy : 0.9111
## 95% CI : (0.7878, 0.9752)
## No Information Rate : 0.4222
## P-Value [Acc > NIR] : 7.909e-12
##
## Kappa : 0.863
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9474 0.7857
## Specificity 1.0000 0.8846 0.9677
## Pos Pred Value 1.0000 0.8571 0.9167
## Neg Pred Value 1.0000 0.9583 0.9091
## Prevalence 0.2667 0.4222 0.3111
## Detection Rate 0.2667 0.4000 0.2444
## Detection Prevalence 0.2667 0.4667 0.2667
## Balanced Accuracy 1.0000 0.9160 0.8767
set.seed(2020)
response <- ifelse(iris$Sepal.Length>=mean(iris$Sepal.Length),1,0)
iris_gam_data <- cbind(iris[,1:4],response)
index <- sample(nrow(iris_gam_data),nrow(iris_gam_data)*0.7)
iris_gam.train <- iris_gam_data[index,]
iris_gam.test <- iris_gam_data[-index,]
iris.gam <- gam(response ~ s(Sepal.Width) + s(Petal.Length) + s(Petal.Width), family=binomial , data = iris_gam.train)
summary(iris.gam)
##
## Family: binomial
## Link function: logit
##
## Formula:
## response ~ s(Sepal.Width) + s(Petal.Length) + s(Petal.Width)
##
## Parametric coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.758 1.435 -1.225 0.221
##
## Approximate significance of smooth terms:
## edf Ref.df Chi.sq p-value
## s(Sepal.Width) 3.127 3.886 9.491 0.06784 .
## s(Petal.Length) 1.000 1.000 9.669 0.00188 **
## s(Petal.Width) 1.000 1.000 5.018 0.02509 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.79 Deviance explained = 78.7%
## UBRE = -0.59019 Scale est. = 1 n = 105
plot(iris.gam, shade=TRUE,seWithMean=TRUE,scale=0, pages = 1)
AIC(iris.gam)
## [1] 43.03024
BIC(iris.gam)
## [1] 59.29201
iris.gam$deviance
## [1] 30.77553
pcut.gam <- mean(iris_gam_data$response)
prob.gam <-predict(iris.gam, iris_gam.test, type="response")
pred.gam<-(prob.gam >= pcut.gam)*1
table(iris_gam.test$response, pred.gam, dnn=c("Observed","Predicted"))
## Predicted
## Observed 0 1
## 0 21 1
## 1 2 21
# MR
mean(ifelse(iris_gam.test$response != pred.gam, 1, 0))
## [1] 0.06666667