Problem statement
Our aim is to:
• Understand and explore Iris data set
• Analyze the data using supervised and unsupervised learning methods
• Summarize findings and present the conclusion
Approach
We will follow these steps to achieve our goal:
• Initial exploration of data
• Data preparation for analysis
• Applying supervised learning algorithms
• Applying unsupervised learning algorithms
• Conclusion
Data set used - Edgar Anderson’s Iris Data
Description - This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables i.e. sepal length and width and petal length and width, respectively, for 50 flowers from each of 3 species of iris. The species are Iris setosa, versicolor, and virginica.
Format - Iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal.Length, Sepal.Width, Petal.Length,Petal.Width, and Species.
• Petal length and width are significantly differing (as compared to Sepal length, width) across Setosa, Versicolor and Virginica and hence can be used for classification and clustering purpose
• In K Nearest Neighbor, increase in value of K neighbors increase model bias whereas decrease model variance. Classification error in validation(test) dataset increases with increase number of nearest neighbors. 3 nearest neighbors have been selected which gives accuracy of 96.66% (29/30)
• Goodness of K-means cluster is decided by within ness and betweenness of the clusters formed. Based on knee curve, k = 3 (Within ness by betweenness = 0.13) and k = 5 (Within ness by betweenness = 0.08) has been selected for final clusters
• After looking at the dendrogram, we decided to set the level of dissimilarity at height 2, which gave us 6 clusters
Dataset Summary
The data contains 150 records for the three species – setosa, versicolor and virginica. The records are equally split across the species. For the purpose of analysis, we divided the data set into ‘train’ and ‘test’ data sets. ‘Train’ data set is the part of the data on which the supervised and unsupervised models are trained. The models are then applied on ‘test’ data set to gauge the accuracy. We used stratified sampling method to split the data so that we get equal number of records from all the three species.
data(iris)
attach(iris)
table(Species)
## Species
## setosa versicolor virginica
## 50 50 50
set.seed("12824891")
# EDA ---------------------------------------------------------------------
setosa = iris[Species == "setosa",]
versicolor = iris[Species == "versicolor",]
virginica = iris[Species == "virginica",]
setosa_ind <- sample(seq_len(nrow(setosa)), size = 0.8*nrow(setosa))
setosa_train <- setosa[setosa_ind,]
setosa_test <- setosa[-setosa_ind,]
versicolor_ind <- sample(seq_len(nrow(versicolor)), size = 0.8*nrow(versicolor))
versicolor_train <- versicolor[versicolor_ind,]
versicolor_test <- versicolor[-versicolor_ind,]
virginica_ind <- sample(seq_len(nrow(virginica)), size = 0.8*nrow(virginica))
virginica_train <- virginica[virginica_ind,]
virginica_test <- virginica[-virginica_ind,]
iris_train <- rbind(setosa_train, versicolor_train, virginica_train)
iris_test <- rbind(setosa_test, versicolor_test, virginica_test)
colnames(iris)
## [1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width"
## [5] "Species"
min.sepal.length <- c(min(setosa_train$Sepal.Length)
, min(versicolor_train$Sepal.Length)
, min(virginica_train$Sepal.Length))
min.sepal.width <- c(min(setosa_train$Sepal.Width)
, min(versicolor_train$Sepal.Width)
, min(virginica_train$Sepal.Width))
min.Petal.width <- c(min(setosa_train$Petal.Width)
, min(versicolor_train$Petal.Width)
, min(virginica_train$Petal.Width))
min.Petal.Length <- c(min(setosa_train$Petal.Length)
, min(versicolor_train$Petal.Length)
, min(virginica_train$Petal.Length))
max.sepal.length <- c(max(setosa_train$Sepal.Length)
, max(versicolor_train$Sepal.Length)
, max(virginica_train$Sepal.Length))
max.sepal.width <- c(max(setosa_train$Sepal.Width)
, max(versicolor_train$Sepal.Width)
, max(virginica_train$Sepal.Width))
max.Petal.width <- c(max(setosa_train$Petal.Width)
, max(versicolor_train$Petal.Width)
, max(virginica_train$Petal.Width))
max.Petal.Length <- c(max(setosa_train$Petal.Length)
, max(versicolor_train$Petal.Length)
, max(virginica_train$Petal.Length))
mean.sepal.length <- c(mean(setosa_train$Sepal.Length)
, mean(versicolor_train$Sepal.Length)
, mean(virginica_train$Sepal.Length))
mean.sepal.width <- c(mean(setosa_train$Sepal.Width)
, mean(versicolor_train$Sepal.Width)
, mean(virginica_train$Sepal.Width))
mean.Petal.width <- c(mean(setosa_train$Petal.Width)
, mean(versicolor_train$Petal.Width)
, mean(virginica_train$Petal.Width))
mean.Petal.Length <- c(mean(setosa_train$Petal.Length)
, mean(versicolor_train$Petal.Length)
, mean(virginica_train$Petal.Length))
Category <- c("setosa", "versicolor", "virginica")
# Output 1
Summary.Sepal.Length <- data.frame(Category
, min.sepal.length
, max.sepal.length
, mean.sepal.length)
# Output 2
Summary.Sepal.Width <- data.frame(Category
, min.sepal.width
, max.sepal.width
, mean.sepal.width)
# Output 3
Summary.Petal.Length <- data.frame(Category
, min.Petal.Length
, max.Petal.Length
, mean.Petal.Length)
# Output 4
Summary.Petal.Width <- data.frame(Category
, min.Petal.width
, max.Petal.width
, mean.Petal.width)
Observations
• Virginica has the maximum Sepal Length while Setosa has the minimum Sepal Length
• Setosa has the maximum Sepal Width while Virginica has the minimum Sepal Width
• Virginica has the maximum Petal Length while Setosa has the minimum Petal Length
• Virginica has the maximum Petal Width while Setosa has the minimum Petal Width
• Petal Length, Petal Width and Sepal Length follow a similar trend with Virginica being highest followed by Versicolor and Setosa being the last.
However, Sepal Width follows a different trend with Setosa being the highest, followed by Virginica and then Versicolor.
for (i in 1:5) {
a <- colnames(iris[i])
print(a)
b <- summary(iris[,i])
print(b)
}
## [1] "Sepal.Length"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.300 5.100 5.800 5.843 6.400 7.900
## [1] "Sepal.Width"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.000 2.800 3.000 3.057 3.300 4.400
## [1] "Petal.Length"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.600 4.350 3.758 5.100 6.900
## [1] "Petal.Width"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.100 0.300 1.300 1.199 1.800 2.500
## [1] "Species"
## setosa versicolor virginica
## 50 50 50
Univariate Analysis - Boxplots
library(ggplot2)
ggplot(iris_train, aes(x = Species, y = Sepal.Length, color = Species)) +
geom_boxplot() +
ggtitle("Box plot of Sepal.Length")
ggplot(iris_train, aes(x = Species, y = Sepal.Width, color = Species)) +
geom_boxplot() +
ggtitle("Box plot of Sepal.Width")
ggplot(iris_train, aes(x = Species, y = Petal.Width, color = Species)) +
geom_boxplot() +
ggtitle("Box plot of Petal.Width")
ggplot(iris_train, aes(x = Species, y = Petal.Length, color = Species)) +
geom_boxplot() +
ggtitle("Box plot of Petal.Length")
Observations:
• The variance between the values of Petal lengths is less in case of Setosa as opposed to Virginica and Versicolor. There are also overlapping values of Petal Length in case of Versicolor and Virginica, therefore Petal Length is not a good candidate to use as a classifier for classification between Virginica and Versicolor.
• There is less variance in Petal Widths of Setosa with a few outliers, while Virginica and Versicolor have more variance and a few overlapping values of Petal Width.
• Setosa has few outliers in case of Sepal Length, Virginica has one while Versicolor has none. There are overlapping values of Sepal Length in case of all the three species.
• Setosa has an outlier that is very far away from the normal Sepal Width values. There are overlapping intervals of Sepal Width for Setosa, Virginica and Versicolor which does not qualify Sepal Width to be a good candidate for classifying/clustering the three species.
Univariate analysis (Histograms and density plots)
library(gridExtra)
library(grid)
ggplot(data = iris_train) +
geom_density(aes(x = Petal.Length, group = Species, fill = Species)
, adjust = 1.5 , alpha = 0.2) +
ggtitle("Density Curve of Petal.Length")
ggplot(data = iris_train) +
geom_density(aes(x = Petal.Width, group = Species, fill = Species)
, adjust = 1.5 , alpha = 0.2) +
ggtitle("Density Curve of Petal.Width")
ggplot(data = iris_train) +
geom_density(aes(x = Sepal.Width, group = Species, fill = Species)
, adjust = 1.5 , alpha = 0.2) +
ggtitle("Density Curve of Sepal.Width")
ggplot(data = iris_train) +
geom_density(aes(x = Sepal.Length, group = Species, fill = Species)
, adjust = 1.5 , alpha = 0.2) +
ggtitle("Density Curve of Sepal.Length")
Observations:
• We see a clear demarcation in the range of Petal Length values between Setosa and the other two species. However, there are a few overlapping values in case of Versicolor and Virginica. Setosa has almost a normal distribution, while versicolor and virginica have a bimodal distribution.
• Petal Width can be a good candidate to be used as a classifier as we can see almost clear demarcation of intervals of Petal Width for all the three species.
• Sepal Length for all the three species follow an almost normal distribution.
• Sepal Width for all the three species follow a normal distribution. There are overlapping intervals between all the three species.
Bivariate analysis (Scatter plot)
ggplot(data = iris_train, aes(x = Petal.Length, y = Petal.Width, col = Species)) +
geom_point(size = 3, alpha = 0.5) +
ggtitle("Scatter Plot of Petal.Length Vs Petal.Width")
ggplot(data = iris_train, aes(x = Sepal.Length, y = Sepal.Width, col = Species)) +
geom_point(size = 3, alpha = 0.5) +
ggtitle("Scatter Plot of Sepal.Length Vs Sepal.Width")
Observations:
• Petal Length and Petal Width are good classifiers as we can see almost three clusters of Setosa (red), Versicolor (green), Virginica (blue).
• From the scatter plots we can see a clear separate cluster of Setosa while there are few overlapping values in case of Versiclor and Virginica.
Problem:
To classify flower species among three categories i.e. Setosa, Versicolor and Virginica based on dimension of their sepal and petal
KNN Summary:
KNN creates a multi-dimensional space where number of dimensions is equal to number of variables to be used as classifier variables. Each data point is assigned a location in multidimensional space based on values of the classifier variables. When KNN wants to assign class to new data point, it takes a poll from K nearest data points and assigns class with higher majority among neighbors to the data point under investigation
Variable Selection:
From density plots it is evident that petal length and petal width are the most important variables which can help to classify flowers among three different categories. Density curve of Setosa’s petal length and width is not overlapping with those of Versicolor and Virginica. Sepal Length and Sepal Width have also been included for additional information, if any. However, Sepal length and width are not helping much to categorize flowers (overlapping density curves)
Test and Train Accuracy Vs K and Selection of K:
Total 120 models have been built from value of k=1 to k=120 (all data points in training data set) and classification error has been calculated on both Train and Test (Validation) dataset.
When we assign k=1, classification error for training dataset will be zero as KNN will assign class of nearest neighbor (data point itself) to the training data point under inspection. As we go on increasing k till 120, classification error in training dataset goes on increasing (red line in the graph below).
When we go on increasing value of K model will transition from overfitted behavior to generalized one. By plotting train and test error across increasing K, we have found optimum value of K (k = 3) which can lead to useful model for classification (cyan line in the graph below)
• Small value of k can lead to overfitted model (low bias, high variance)
• Large value of k can lead to generalized model (high bias, low variance)
# KNN ---------------------------------------------------------------------
library(class)
knn_k_train <- function(n){
knn_iris <- knn(train = iris_train[,-5],
test = iris_train[,-5], cl = iris_train[ ,5], k = n)
error = sum(iris_train[,5] != knn_iris)/nrow(iris_train)
}
train_error <- vector(mode = "numeric", length = 0)
for (i in 1:120) {
train_error[i] <- knn_k_train(i)
}
knn_k_test <- function(n){
knn_iris <- knn(train = iris_train[,-5],
test = iris_test[,-5], cl = iris_train[ ,5], k = n)
error = sum(iris_test[,5] != knn_iris)/nrow(iris_test)
}
test_error <- vector(mode = "numeric", length = 0)
for (i in 1:120) {
test_error[i] <- knn_k_test(i)
}
error <- data.frame(k = c(1:120), train_error, test_error)
library(reshape)
error_melted <- melt(error, id = c("k"))
colnames(error_melted)[2] <- "Train_Vs_Test"
ggplot(error_melted,
aes(x = k, y = value, group = Train_Vs_Test, color = Train_Vs_Test)) +
geom_line() +
ylab("Prediction Error Rate") +
ggtitle("Train Vs Test Prediction Error Rate across K")
# small value of k, overfitted model, low bias, high variance
# large value of k, generalized model, high bias, low variance
# Selecting k = 5 from accuracy graph
knn_iris <- knn(train = iris_train[,-5],
test = iris_test[,-5], cl = iris_train[ ,5], k = 3)
Confusion Matrix:
KNN model with K=3 is used to classify flower categories in Test (Validation) dataset. Following are classification result. Model can classify 29 out of 30 flowers correctly (Accuracy ~ 96.66%)
table(iris_test[,5]
, knn_iris
, dnn = c("True", "Predicted"))
## Predicted
## True setosa versicolor virginica
## setosa 10 0 0
## versicolor 0 10 0
## virginica 0 1 9
K-Means Clustering
Problem:
Formulate logical groups from available flowers dataset based on their sepal and petal dimensions.
K-Means Summary:
K-Means creates a multi-dimensional space where number of dimensions is equal to number of variables to be used as classifier variables. Each data point is assigned a location in multidimensional space based on values of the classifier variables. K-Means randomly select K data points from the available dataset and make them centroid of the K clusters in the first iteration where K is equal to number of clusters required as a final output. After getting first set of centroids, it assigns the remaining data points to K centroids based on Euclidean distance. Now, it calculates K new centroids which will be the second set of centroids. K-means keep on repeating the same process till it gets stable location of the centroid.
Variable Selection:
Variables with higher variance will help to formulate different groups from available data points while variables with zero variance will put all data points into one cluster. All four variables i.e. sepal/petal length/width has enough variance (from box plots and density function) to be considered for probable candidates for K-means clustering
Withinness Vs Betweenness:
Goodness of clustering results can be best scored by calculating within ness and betweenness metric. Within ness tells how compact each cluster is and Betweenness tells how far each cluster is from other clusters. In idea scenario, we will like to have within ness equal to zero and between ness equal to infinity.
Selection of number of clusters:
19 K-means model has been built from number of clusters equal to 2 till 20. Within ness by betweenness metric has been calculated for number of clusters varying from 2 to 20 which can be observed in the following graph. K=3 show steepest decrease in Within ness by betweenness. However, k=5 and k=7 are also eligible candidate for number of clusters.
library(fpc)
kmeans_k <- function(n){
kmeans_model <- kmeans(iris[1:4],n)
within_by_between <- mean(kmeans_model$withinss)/kmeans_model$betweenss
}
within_by_between <- vector(mode = "numeric", length = 0)
for (i in 2:20) {
within_by_between[i] <- kmeans_k(i)
}
within_by_between_metric <- data.frame(k = 2:20, kpi = within_by_between[2:20])
ggplot(within_by_between_metric,
aes(x = k, y = kpi)) +
geom_line() +
geom_point() +
ylab("Withinness by Betweenness") +
xlab("Number of Clusters") +
ggtitle("Withinness by Betweenness across K")
Cluster Analysis:
Discriminant coordinates method is used to get two hybrid dimensions/coordinates/lines from four raw dimensions i.e. sepal length, sepal width, petal length and petal width.
Discriminant coordinates (Vs Principal Component Analysis):
Discriminant analysis is like principal component analysis as both are used to reduce dimensions of the dataset. However, PCA does not require class variable to find reduced dimensions whereas discriminant coordinates need class variable. Selection of dimension in PCA is mainly aimed towards explaining maximum variability in the data. Selection of dimension in DC aims to separate data points into the provided categories in the class variable
K-Means cluster representation with Discriminant Coordinates:
Orchestrated K-Means cluster has been passed to discriminant coordinates along with four variables used for clustering. Two discriminant co-ordinates have been used to represent clusters within ness and betweenness in 2D format for four dimensions
# Based on knee curve, we will choose number of clusters = 3 or 5
model_kmeans_1 <- kmeans(iris[1:4],3)
plotcluster(iris[,1:4], model_kmeans_1$cluster)
model_kmeans_2 <- kmeans(iris[1:4],5)
plotcluster(iris[,1:4], model_kmeans_2$cluster)
# it can be observed from cluster plots that 5 clusters are able
# segregate data more precisely as compared to 5 clusters
Hierarchical Clustering
Problem:
Formulate logical groups from available flowers dataset based on their sepal and petal dimensions
Hierarchical clustering summary:
Hierarchical clustering starts by treating each observation as a separate cluster. Then, it repeatedly executes the following two steps: (1) identify the two clusters that are closest together (2) merge the two most similar clusters. This continues until all the clusters are merged together. The main output of Hierarchical Clustering is a dendrogram, which shows the hierarchical relationship between the clusters
Interpretation:
The vertical axis of the dendrogram represents the distance or dissimilarity between clusters. The horizontal axis represents the objects and clusters. Our interest is in similarity and clustering. Each joining (fusion) of two clusters is represented on the graph by the splitting of a vertical line into two vertical lines. The vertical position of the split, shown by the short horizontal bar, gives the distance (dissimilarity) between the two clusters. Here, we have set K=6 which will highlight the dendrogram into six clusters. Looking at the dendrogram, we can take a call on the level of dissimilarity we want and set K accordingly.
# Hierarichal Clustering --------------------------------------------------
hc <- hclust(dist(iris[,1:4]))
plot(hc)
rect.hclust(hc, k = 6)