iris is perhaps the best known database to be found in the pattern recognition literature. Although it is simple, but very classic. In this project, I will do some data exploring first, then do some visualizations. I will try to build several models to predicate the classification. Finally I will compare the accuracy results among those models. Comments are welcome!
We begin by getting an idea of dimensions, size and attributes of the data we are working on.
## Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 3 4.7 3.2 1.3 0.2 Iris-setosa
## Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 148 148 6.5 3.0 5.2 2.0 Iris-virginica
## 149 149 6.2 3.4 5.4 2.3 Iris-virginica
## 150 150 5.9 3.0 5.1 1.8 Iris-virginica
## [1] 150 6
## [1] "Id" "SepalLengthCm" "SepalWidthCm" "PetalLengthCm"
## [5] "PetalWidthCm" "Species"
## 'data.frame': 150 obs. of 6 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ SepalLengthCm: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ SepalWidthCm : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ PetalLengthCm: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ PetalWidthCm : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "Iris-setosa" "Iris-setosa" "Iris-setosa" "Iris-setosa" ...
## $Id
## [1] 150
##
## $SepalLengthCm
## [1] 35
##
## $SepalWidthCm
## [1] 23
##
## $PetalLengthCm
## [1] 43
##
## $PetalWidthCm
## [1] 22
##
## $Species
## [1] 3
So we have started to know some about the data. The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, we have to develope some models to distinguish the species from each other.
Variables Information: 1. Id: obervation # ; 2. SepalLengthCm: sepal length in cm ; 3. SepalWidthCm: sepal width in cm ; 4. PetalLengthCm: petal length in cm ; 5. PetalWidthCm: petal width in cm ; 6. Species: – Iris-setosa ; – Iris-versicolor ; – Iris-virginica ;
## Id SepalLengthCm SepalWidthCm PetalLengthCm
## Min. : 1.00 Min. :4.300 Min. :2.000 Min. :1.000
## 1st Qu.: 38.25 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600
## Median : 75.50 Median :5.800 Median :3.000 Median :4.350
## Mean : 75.50 Mean :5.843 Mean :3.054 Mean :3.759
## 3rd Qu.:112.75 3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100
## Max. :150.00 Max. :7.900 Max. :4.400 Max. :6.900
## PetalWidthCm Species
## Min. :0.100 Length:150
## 1st Qu.:0.300 Class :character
## Median :1.300 Mode :character
## Mean :1.199
## 3rd Qu.:1.800
## Max. :2.500
Observation: Checking the scales of features is very important. Sepal length ranges from 4.3-7.9, Sepal width range: 2-4.4, Petal length range:1-6.9, Petal width:0.1-2.5. The ranges basically are from 0 to 10, so we don’t have to do scaling before the building the models.
## [1] 0
## Id SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 0 0 0 0 0 0
No missing values… very good…
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 5.1 3.5 1.4 0.2 Iris-setosa
## 2 4.9 3.0 1.4 0.2 Iris-setosa
## 3 4.7 3.2 1.3 0.2 Iris-setosa
## 4 4.6 3.1 1.5 0.2 Iris-setosa
## 5 5.0 3.6 1.4 0.2 Iris-setosa
## 6 5.4 3.9 1.7 0.4 Iris-setosa
I remove duplicate info (Iris-) in the Species for better visulation in the following step
data$Species <- sapply(strsplit(as.character(data$Species),'-'), "[", 2)
str(data) #check the data again and found Species became character## 'data.frame': 150 obs. of 5 variables:
## $ SepalLengthCm: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ SepalWidthCm : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ PetalLengthCm: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ PetalWidthCm : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : chr "setosa" "setosa" "setosa" "setosa" ...
data$Species <- as.factor(data$Species)#change Species as factor
str(data)#check again, Species is factor variable now## 'data.frame': 150 obs. of 5 variables:
## $ SepalLengthCm: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ SepalWidthCm : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ PetalLengthCm: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ PetalWidthCm : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
To further dig the data, we need to check basic statistics indexes of the data including mean/standard deviation/count/max/min value for each numeric variables
describle_datatable <- data %>%
group_by(Species) %>%
summarise_if(is.numeric,funs(mean, sd,n(), max, min))
t(describle_datatable) #transpose for easy checking## [,1] [,2] [,3]
## Species "setosa" "versicolor" "virginica"
## SepalLengthCm_mean "5.006" "5.936" "6.588"
## SepalWidthCm_mean "3.418" "2.770" "2.974"
## PetalLengthCm_mean "1.464" "4.260" "5.552"
## PetalWidthCm_mean "0.244" "1.326" "2.026"
## SepalLengthCm_sd "0.3524897" "0.5161711" "0.6358796"
## SepalWidthCm_sd "0.3810244" "0.3137983" "0.3224966"
## PetalLengthCm_sd "0.1735112" "0.4699110" "0.5518947"
## PetalWidthCm_sd "0.1072095" "0.1977527" "0.2746501"
## SepalLengthCm_n "50" "50" "50"
## SepalWidthCm_n "50" "50" "50"
## PetalLengthCm_n "50" "50" "50"
## PetalWidthCm_n "50" "50" "50"
## SepalLengthCm_max "5.8" "7.0" "7.9"
## SepalWidthCm_max "4.4" "3.4" "3.8"
## PetalLengthCm_max "1.9" "5.1" "6.9"
## PetalWidthCm_max "0.6" "1.8" "2.5"
## SepalLengthCm_min "4.3" "4.9" "4.9"
## SepalWidthCm_min "2.3" "2.0" "2.2"
## PetalLengthCm_min "1.0" "3.0" "4.5"
## PetalWidthCm_min "0.1" "1.0" "1.4"
Not easy to get more insights? so plots…
We begin by using boxplots to understand the distribution of attributes for each Species.
p1 <- ggplot(data, aes(x = Species, y = SepalLengthCm,colour=Species)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.1))+
theme(legend.position="none") # Remove legend
p2 <- ggplot(data, aes(x = Species, y = SepalWidthCm,colour=Species)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.1))+
theme(legend.position="none") # Remove legend
p3 <- ggplot(data, aes(x = Species, y = PetalLengthCm,colour=Species)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.1))+
theme(legend.position="none") # Remove legend
p4 <- ggplot(data, aes(x = Species, y = PetalWidthCm,colour=Species)) +
geom_boxplot() +
geom_jitter(shape=16, position=position_jitter(0.1))+
theme(legend.position="none") # Remove legend
ggarrange(p1,p2,p3,p4,
labels = c("A", "B", "C","D"),
ncol = 2, nrow = 2)From above boxplots, we can see that virginica has a bigger petal and bigger sepal length, however sentosa has a smaller petal, but bigger sepal length..
To understand relationships between atrributes, let’s do the pairplots (package: GGally)
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
Petal width and Petal length have a strong linear correlation relationship.
ggplot(data, aes(x =SepalWidthCm , y = SepalLengthCm , color = Species))+
geom_point()+
geom_smooth(method="loess", aes(fill= Species, color = Species))+
facet_wrap(~Species, ncol = 3, nrow = 1)ggplot(data, aes(x = PetalWidthCm , y =PetalLengthCm , color = Species))+
geom_point()+
geom_smooth(method="lm", aes(fill= Species, color = Species))+
facet_wrap(~Species, ncol = 3, nrow = 1)Versicolor’s petal widith and length has a strong linear relationgship. Could we guess: petal width and petal length might be the key features for classfication? Just guess.. we need to do the predication.
set.seed(101)
# We use the dataset to create a partition (80% training 20% testing)
id <- createDataPartition(data$Species, p=0.80, list=FALSE)
# select 80% of data to train the models
train <- data[id,]
dim(train)## [1] 120 5
## [1] 30 5
#review the train dataset to confirm the Species are randomly selected
lapply(train, function(x) length(unique(x))) ## $SepalLengthCm
## [1] 34
##
## $SepalWidthCm
## [1] 23
##
## $PetalLengthCm
## [1] 42
##
## $PetalWidthCm
## [1] 20
##
## $Species
## [1] 3
##
## setosa versicolor virginica
## 40 40 40
## SepalLengthCm SepalWidthCm PetalLengthCm PetalWidthCm
## Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
## 1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.575 1st Qu.:0.300
## Median :5.750 Median :3.000 Median :4.250 Median :1.300
## Mean :5.836 Mean :3.041 Mean :3.734 Mean :1.187
## 3rd Qu.:6.425 3rd Qu.:3.200 3rd Qu.:5.100 3rd Qu.:1.800
## Max. :7.700 Max. :4.400 Max. :6.900 Max. :2.400
## Species
## setosa :40
## versicolor:40
## virginica :40
##
##
##
## 'data.frame': 120 obs. of 5 variables:
## $ SepalLengthCm: num 4.7 5.4 4.6 5 4.4 4.9 5.4 4.8 4.8 4.3 ...
## $ SepalWidthCm : num 3.2 3.9 3.4 3.4 2.9 3.1 3.7 3.4 3 3 ...
## $ PetalLengthCm: num 1.3 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 ...
## $ PetalWidthCm : num 0.2 0.4 0.3 0.2 0.2 0.1 0.2 0.2 0.1 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Observation- 1.The train dataset has 120 observations while test dataset has 30. 2.Each class has the same number of instances (40).
## note: only 2 possible values of the max tree depth from the initial fit.
## Truncating the grid to 2 .
# Predict the labels of the test set
predictions<-predict(cart_model,test[,1:4])
# Evaluate the predictions
table(predictions)## predictions
## setosa versicolor virginica
## 10 9 11
## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 8 1
## virginica 0 2 9
##
## Overall Statistics
##
## Accuracy : 0.9
## 95% CI : (0.7347, 0.9789)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 1.665e-10
##
## Kappa : 0.85
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.8000 0.9000
## Specificity 1.0000 0.9500 0.9000
## Pos Pred Value 1.0000 0.8889 0.8182
## Neg Pred Value 1.0000 0.9048 0.9474
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2667 0.3000
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.8750 0.9000
#feature importance
importance_cart <- varImp(cart_model)
plot(importance_cart, main="Variable Importance with cart_model")As suspected, Petal Width is the most used variable, followed by Petal Length and Sepal Length.
# Train the model with preprocessing
set.seed(101)
knn_model <- train(train[, 1:4], train[, 5], method='knn',
preProcess=c("center", "scale"))
# Predict values
predictions<-predict(knn_model,test[,1:4], type="raw")
# Confusion matrix
confusionMatrix(predictions,test[,5])## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 7 2
## virginica 0 3 8
##
## Overall Statistics
##
## Accuracy : 0.8333
## 95% CI : (0.6528, 0.9436)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.444e-08
##
## Kappa : 0.75
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.7000 0.8000
## Specificity 1.0000 0.9000 0.8500
## Pos Pred Value 1.0000 0.7778 0.7273
## Neg Pred Value 1.0000 0.8571 0.8947
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2333 0.2667
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.8000 0.8250
#feature importance
importance_knn <- varImp(knn_model)
plot(importance_knn, main="Variable Importance with knn_model")# Train the model with preprocessing
set.seed(101)
nnet_model <- train(train[, 1:4], train[, 5], method='nnet',
preProcess=c("center", "scale"),
tuneLength = 2,
trace = FALSE,
maxit = 100)
# Predict values
predictions<-predict(nnet_model,test[,1:4], type="raw")
# Confusion matrix
confusionMatrix(predictions,test[,5])## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 7 1
## virginica 0 3 9
##
## Overall Statistics
##
## Accuracy : 0.8667
## 95% CI : (0.6928, 0.9624)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.296e-09
##
## Kappa : 0.8
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.7000 0.9000
## Specificity 1.0000 0.9500 0.8500
## Pos Pred Value 1.0000 0.8750 0.7500
## Neg Pred Value 1.0000 0.8636 0.9444
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.2333 0.3000
## Detection Prevalence 0.3333 0.2667 0.4000
## Balanced Accuracy 1.0000 0.8250 0.8750
## nnet variable importance
##
## variables are sorted by maximum importance across the classes
## Overall setosa versicolor virginica
## PetalLengthCm 100.00 100.00 100.00 100.00
## PetalWidthCm 97.48 97.48 97.48 97.48
## SepalWidthCm 39.49 39.49 39.49 39.49
## SepalLengthCm 0.00 0.00 0.00 0.00
# Train the model with preprocessing
set.seed(101)
randomforest_model <- train(train[, 1:4], train[, 5], method='rf')
# Predict values
predictions<-predict(randomforest_model,test[,1:4], type="raw")
# Confusion matrix
confusionMatrix(predictions,test[,5])## Confusion Matrix and Statistics
##
## Reference
## Prediction setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 9 0
## virginica 0 1 10
##
## Overall Statistics
##
## Accuracy : 0.9667
## 95% CI : (0.8278, 0.9992)
## No Information Rate : 0.3333
## P-Value [Acc > NIR] : 2.963e-13
##
## Kappa : 0.95
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: setosa Class: versicolor Class: virginica
## Sensitivity 1.0000 0.9000 1.0000
## Specificity 1.0000 1.0000 0.9500
## Pos Pred Value 1.0000 1.0000 0.9091
## Neg Pred Value 1.0000 0.9524 1.0000
## Prevalence 0.3333 0.3333 0.3333
## Detection Rate 0.3333 0.3000 0.3333
## Detection Prevalence 0.3333 0.3000 0.3667
## Balanced Accuracy 1.0000 0.9500 0.9750
#feature importance
importance_rf <- varImp(randomforest_model)
plot(importance_rf, main="Variable Importance with randomforest_model")models_compare <- resamples(list(cart_model,knn_model, nnet_model,randomforest_model))
# Summary of the models performances
summary(models_compare)##
## Call:
## summary.resamples(object = models_compare)
##
## Models: Model1, Model2, Model3, Model4
## Number of resamples: 25
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Model1 0.9000000 0.9361702 0.9545455 0.9515344 0.9729730 1.00 0
## Model2 0.9189189 0.9534884 0.9565217 0.9605841 0.9772727 1.00 0
## Model3 0.9347826 0.9772727 0.9800000 0.9828838 1.0000000 1.00 0
## Model4 0.9069767 0.9347826 0.9545455 0.9517384 0.9761905 0.98 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## Model1 0.8434442 0.9036227 0.9312500 0.9259588 0.9578107 1.0000000 0
## Model2 0.8763920 0.9297386 0.9332366 0.9399519 0.9656250 1.0000000 0
## Model3 0.9001447 0.9652997 0.9698614 0.9738443 1.0000000 1.0000000 0
## Model4 0.8566667 0.9001447 0.9312500 0.9263153 0.9642553 0.9698614 0
From the accuracy resluts, neural network model works best among the 4 models. Also, Petal Width and Petal Length are the key features for the classification.