Clustering is the segmentation of the data into a set of homogenous clusters of observations (members within the same cluster are similar). The goal of the project is to analyse the performance of the unsupervised algorithm by using it to identify subspecies in the iris dataset and to compare with the iris class (name of subspecies) which is already included in the dataset.
Iris is a genus of 260-300 species of flowering plants with showy flowers. The data set was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. It consists of 150 data points, each with 4 features of the flower (sepal length, sepal width, petal length and petal width). The data set also contains the types of subspecies, but in this case, we are removing the subspecies name (class) column to analyse the performance of the unsupervised algorithm.
#set working directory
getwd()
## [1] "C:/Users/Thinithi/Desktop/PORTFOLIO/clustering"
setwd("C:/Users/Thinithi/Desktop/WEKA/weka-3-8-2/data")
#Loading libraries
library(foreign) #Load arff files
library(car) #Plot scatterplot matrix
## Loading required package: carData
library(RColorBrewer) #Control palettes
library(cluster) # Obtain cluster results & cluster plot
#load the dataset
iris<-read.arff("iris.arff")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ sepallength: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ sepalwidth : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ petallength: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ petalwidth : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ class : Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...
#Testing for missing values
missing<-sapply(iris[,-5], function(x) sum(is.na(x)))
missing
## sepallength sepalwidth petallength petalwidth
## 0 0 0 0
#Detecting Outliers
A<-boxplot(iris[,-5])
outliers<-A$out
#Treating Outliers
qn<-quantile(iris$sepalwidth, c(0.05, 0.95), na.rm = TRUE)
qn
## 5% 95%
## 2.345 3.800
Iris<- within(iris,{sepalwidth = ifelse(sepalwidth < qn[1], qn[1], sepalwidth)
sepalwidth = ifelse(sepalwidth > qn[2], qn[2], sepalwidth)})
#Re-testing for Outliers
boxplot(Iris)
Principle components analysis is used to identify Which variables have the highest variance, because larger the variance, larger the amount of information the variable contains.The data is scaled before hand to ensure that the variances across the different variables are equal and one variable does not dominate once we apply principal component analysis.
pc<-prcomp(Iris[,-5], center = TRUE,scale. = TRUE)
summary(pc)
## Importance of components:
## PC1 PC2 PC3 PC4
## Standard deviation 1.7068 0.9575 0.38603 0.1442
## Proportion of Variance 0.7283 0.2292 0.03725 0.0052
## Cumulative Proportion 0.7283 0.9576 0.99480 1.0000
plot(pc, main="Principal Component analysis (Iris dataset)")
Principle Component 1(PC1) explains 73% of the total variation in the data set, while PC2 explains 23%. Overall, just PC1 and PC2 can explain 96% of the variance.
The point at which, increasing the number of K by one (+1) does not improve the total within cluster sum of squares significantly (Elbow point)
# Compute within-cluster sum of squares for k = 2 to k = 15
set.seed(1)
k.max <- 15
wss<- sapply(1:k.max,
function(k){kmeans(iris[,-5], k, nstart=50,iter.max = 15 )$tot.withinss})
wss
## [1] 680.82440 152.36871 78.94084 57.31787 46.53558 38.94595 34.18921
## [8] 29.87992 27.76542 25.94376 24.11323 22.64357 21.55965 19.79325
## [15] 18.45639
#plot wss against increasing k-values (Elbow point = Optimal K)
plot(1:k.max, wss,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total within-clusters sum of squares", main="Total within-clusters sum of squares against K")
The maximum number of iterations are capped at 100 to ensure that the algorithm converges.
result2<-kmeans(iris[,-5],3,iter.max =50)
result2
## K-means clustering with 3 clusters of sizes 62, 38, 50
##
## Cluster means:
## sepallength sepalwidth petallength petalwidth
## 1 5.901613 2.748387 4.393548 1.433871
## 2 6.850000 3.073684 5.742105 2.071053
## 3 5.006000 3.418000 1.464000 0.244000
##
## Clustering vector:
## [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
## [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
## [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
## [141] 2 2 1 2 2 2 1 2 2 1
##
## Within cluster sum of squares by cluster:
## [1] 39.82097 23.87947 15.24040
## (between_SS / total_SS = 88.4 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
Obtain the cluster means and Identify which variables played a major role in deciding the clusters. If there is a significant change in the cluster mean values for a particular variable, it has played an important role in the clustering process.
# Identify which variables played a major role in deciding the clusters
aggregate(iris[,-5],by=list(result2$cluster),FUN=mean)
## Group.1 sepallength sepalwidth petallength petalwidth
## 1 1 5.901613 2.748387 4.393548 1.433871
## 2 2 6.850000 3.073684 5.742105 2.071053
## 3 3 5.006000 3.418000 1.464000 0.244000
The petallength and petalwidth cluster mean values differ significant from one cluster to another, especially noticable in cluster 2 and cluster 3. This signifies that these 2 variables contributed alot in the clustering process.
Since there are more than two dimensions, the clusplot function will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.
#Scatterplot Matrix
pairs(iris,pch = 3, cex =0.5, col = result2$cluster,main="Scatterplot matrix",lower.panel=NULL,cex.labels=1.7)
#Cluster
clusplot(iris[,-5], result2$cluster, color=TRUE, shade=TRUE,
labels=1, lines=0,main="Clusterplot of the Iris dataset",cex=1)
class<-iris[,5]
my_solution<-data.frame(cluster=result2$cluster,class)
my_solution
## cluster class
## 1 3 Iris-setosa
## 2 3 Iris-setosa
## 3 3 Iris-setosa
## 4 3 Iris-setosa
## 5 3 Iris-setosa
## 6 3 Iris-setosa
## 7 3 Iris-setosa
## 8 3 Iris-setosa
## 9 3 Iris-setosa
## 10 3 Iris-setosa
## 11 3 Iris-setosa
## 12 3 Iris-setosa
## 13 3 Iris-setosa
## 14 3 Iris-setosa
## 15 3 Iris-setosa
## 16 3 Iris-setosa
## 17 3 Iris-setosa
## 18 3 Iris-setosa
## 19 3 Iris-setosa
## 20 3 Iris-setosa
## 21 3 Iris-setosa
## 22 3 Iris-setosa
## 23 3 Iris-setosa
## 24 3 Iris-setosa
## 25 3 Iris-setosa
## 26 3 Iris-setosa
## 27 3 Iris-setosa
## 28 3 Iris-setosa
## 29 3 Iris-setosa
## 30 3 Iris-setosa
## 31 3 Iris-setosa
## 32 3 Iris-setosa
## 33 3 Iris-setosa
## 34 3 Iris-setosa
## 35 3 Iris-setosa
## 36 3 Iris-setosa
## 37 3 Iris-setosa
## 38 3 Iris-setosa
## 39 3 Iris-setosa
## 40 3 Iris-setosa
## 41 3 Iris-setosa
## 42 3 Iris-setosa
## 43 3 Iris-setosa
## 44 3 Iris-setosa
## 45 3 Iris-setosa
## 46 3 Iris-setosa
## 47 3 Iris-setosa
## 48 3 Iris-setosa
## 49 3 Iris-setosa
## 50 3 Iris-setosa
## 51 1 Iris-versicolor
## 52 1 Iris-versicolor
## 53 2 Iris-versicolor
## 54 1 Iris-versicolor
## 55 1 Iris-versicolor
## 56 1 Iris-versicolor
## 57 1 Iris-versicolor
## 58 1 Iris-versicolor
## 59 1 Iris-versicolor
## 60 1 Iris-versicolor
## 61 1 Iris-versicolor
## 62 1 Iris-versicolor
## 63 1 Iris-versicolor
## 64 1 Iris-versicolor
## 65 1 Iris-versicolor
## 66 1 Iris-versicolor
## 67 1 Iris-versicolor
## 68 1 Iris-versicolor
## 69 1 Iris-versicolor
## 70 1 Iris-versicolor
## 71 1 Iris-versicolor
## 72 1 Iris-versicolor
## 73 1 Iris-versicolor
## 74 1 Iris-versicolor
## 75 1 Iris-versicolor
## 76 1 Iris-versicolor
## 77 1 Iris-versicolor
## 78 2 Iris-versicolor
## 79 1 Iris-versicolor
## 80 1 Iris-versicolor
## 81 1 Iris-versicolor
## 82 1 Iris-versicolor
## 83 1 Iris-versicolor
## 84 1 Iris-versicolor
## 85 1 Iris-versicolor
## 86 1 Iris-versicolor
## 87 1 Iris-versicolor
## 88 1 Iris-versicolor
## 89 1 Iris-versicolor
## 90 1 Iris-versicolor
## 91 1 Iris-versicolor
## 92 1 Iris-versicolor
## 93 1 Iris-versicolor
## 94 1 Iris-versicolor
## 95 1 Iris-versicolor
## 96 1 Iris-versicolor
## 97 1 Iris-versicolor
## 98 1 Iris-versicolor
## 99 1 Iris-versicolor
## 100 1 Iris-versicolor
## 101 2 Iris-virginica
## 102 1 Iris-virginica
## 103 2 Iris-virginica
## 104 2 Iris-virginica
## 105 2 Iris-virginica
## 106 2 Iris-virginica
## 107 1 Iris-virginica
## 108 2 Iris-virginica
## 109 2 Iris-virginica
## 110 2 Iris-virginica
## 111 2 Iris-virginica
## 112 2 Iris-virginica
## 113 2 Iris-virginica
## 114 1 Iris-virginica
## 115 1 Iris-virginica
## 116 2 Iris-virginica
## 117 2 Iris-virginica
## 118 2 Iris-virginica
## 119 2 Iris-virginica
## 120 1 Iris-virginica
## 121 2 Iris-virginica
## 122 1 Iris-virginica
## 123 2 Iris-virginica
## 124 1 Iris-virginica
## 125 2 Iris-virginica
## 126 2 Iris-virginica
## 127 1 Iris-virginica
## 128 1 Iris-virginica
## 129 2 Iris-virginica
## 130 2 Iris-virginica
## 131 2 Iris-virginica
## 132 2 Iris-virginica
## 133 2 Iris-virginica
## 134 1 Iris-virginica
## 135 2 Iris-virginica
## 136 2 Iris-virginica
## 137 2 Iris-virginica
## 138 2 Iris-virginica
## 139 1 Iris-virginica
## 140 2 Iris-virginica
## 141 2 Iris-virginica
## 142 2 Iris-virginica
## 143 1 Iris-virginica
## 144 2 Iris-virginica
## 145 2 Iris-virginica
## 146 2 Iris-virginica
## 147 1 Iris-virginica
## 148 2 Iris-virginica
## 149 2 Iris-virginica
## 150 1 Iris-virginica
table(result2$cluster,class)
## class
## Iris-setosa Iris-versicolor Iris-virginica
## 1 0 48 14
## 2 0 2 36
## 3 50 0 0
accuracy<-((134/150)*100)
accuracy
## [1] 89.33333