Clustering is the segmentation of the data into a set of homogenous clusters of observations (members within the same cluster are similar). The goal of the project is to analyse the performance of the unsupervised algorithm by using it to identify subspecies in the iris dataset and to compare with the iris class (name of subspecies) which is already included in the dataset.

Data Description

Iris is a genus of 260-300 species of flowering plants with showy flowers. The data set was introduced by the British statistician and biologist Ronald Fisher in his 1936 paper. It consists of 150 data points, each with 4 features of the flower (sepal length, sepal width, petal length and petal width). The data set also contains the types of subspecies, but in this case, we are removing the subspecies name (class) column to analyse the performance of the unsupervised algorithm.

Preparation

#set working directory
getwd()

## [1] "C:/Users/Thinithi/Desktop/PORTFOLIO/clustering"

setwd("C:/Users/Thinithi/Desktop/WEKA/weka-3-8-2/data")
#Loading libraries
library(foreign) #Load arff files
library(car) #Plot scatterplot matrix

## Loading required package: carData

library(RColorBrewer) #Control palettes 
library(cluster) # Obtain cluster results & cluster plot

Upload dataset

#load the dataset
iris<-read.arff("iris.arff")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ sepallength: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ sepalwidth : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ petallength: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ petalwidth : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ class      : Factor w/ 3 levels "Iris-setosa",..: 1 1 1 1 1 1 1 1 1 1 ...

Cleaning Data

#Testing for missing values
missing<-sapply(iris[,-5], function(x) sum(is.na(x)))
missing

## sepallength  sepalwidth petallength  petalwidth 
##           0           0           0           0

#Detecting Outliers
A<-boxplot(iris[,-5])

outliers<-A$out
#Treating Outliers
qn<-quantile(iris$sepalwidth, c(0.05, 0.95), na.rm = TRUE)
qn

##    5%   95% 
## 2.345 3.800

Iris<- within(iris,{sepalwidth = ifelse(sepalwidth < qn[1], qn[1], sepalwidth)
sepalwidth = ifelse(sepalwidth > qn[2], qn[2], sepalwidth)}) 
#Re-testing for Outliers
boxplot(Iris)

Principle Component Analysis

Principle components analysis is used to identify Which variables have the highest variance, because larger the variance, larger the amount of information the variable contains.The data is scaled before hand to ensure that the variances across the different variables are equal and one variable does not dominate once we apply principal component analysis.

pc<-prcomp(Iris[,-5], center = TRUE,scale. = TRUE)
summary(pc)

## Importance of components:
##                           PC1    PC2     PC3    PC4
## Standard deviation     1.7068 0.9575 0.38603 0.1442
## Proportion of Variance 0.7283 0.2292 0.03725 0.0052
## Cumulative Proportion  0.7283 0.9576 0.99480 1.0000

plot(pc, main="Principal Component analysis (Iris dataset)")

Principle Component 1(PC1) explains 73% of the total variation in the data set, while PC2 explains 23%. Overall, just PC1 and PC2 can explain 96% of the variance.

Elbow Method for finding the optimal number of clusters

The point at which, increasing the number of K by one (+1) does not improve the total within cluster sum of squares significantly (Elbow point)

# Compute within-cluster sum of squares for k = 2 to k = 15
set.seed(1) 
k.max <- 15
wss<- sapply(1:k.max, 
             function(k){kmeans(iris[,-5], k, nstart=50,iter.max = 15 )$tot.withinss})
wss

##  [1] 680.82440 152.36871  78.94084  57.31787  46.53558  38.94595  34.18921
##  [8]  29.87992  27.76542  25.94376  24.11323  22.64357  21.55965  19.79325
## [15]  18.45639

#plot wss against increasing k-values (Elbow point = Optimal K)
plot(1:k.max, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares", main="Total within-clusters sum of squares against K")

K-means clustering

The maximum number of iterations are capped at 100 to ensure that the algorithm converges.

result2<-kmeans(iris[,-5],3,iter.max =50)
result2

## K-means clustering with 3 clusters of sizes 62, 38, 50
## 
## Cluster means:
##   sepallength sepalwidth petallength petalwidth
## 1    5.901613   2.748387    4.393548   1.433871
## 2    6.850000   3.073684    5.742105   2.071053
## 3    5.006000   3.418000    1.464000   0.244000
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 2 2
## [106] 2 1 2 2 2 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 1 2
## [141] 2 2 1 2 2 2 1 2 2 1
## 
## Within cluster sum of squares by cluster:
## [1] 39.82097 23.87947 15.24040
##  (between_SS / total_SS =  88.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

Obtain the cluster means and Identify which variables played a major role in deciding the clusters. If there is a significant change in the cluster mean values for a particular variable, it has played an important role in the clustering process.

# Identify which variables played a major role in deciding the clusters
aggregate(iris[,-5],by=list(result2$cluster),FUN=mean)

##   Group.1 sepallength sepalwidth petallength petalwidth
## 1       1    5.901613   2.748387    4.393548   1.433871
## 2       2    6.850000   3.073684    5.742105   2.071053
## 3       3    5.006000   3.418000    1.464000   0.244000

The petallength and petalwidth cluster mean values differ significant from one cluster to another, especially noticable in cluster 2 and cluster 3. This signifies that these 2 variables contributed alot in the clustering process.

Plotting the clusters

Since there are more than two dimensions, the clusplot function will perform principal component analysis (PCA) and plot the data points according to the first two principal components that explain the majority of the variance.

#Scatterplot Matrix
pairs(iris,pch = 3,  cex =0.5, col = result2$cluster,main="Scatterplot matrix",lower.panel=NULL,cex.labels=1.7)

#Cluster
clusplot(iris[,-5], result2$cluster, color=TRUE, shade=TRUE, 
         labels=1, lines=0,main="Clusterplot of the Iris dataset",cex=1)

Clusters corresponding to which class

class<-iris[,5]
my_solution<-data.frame(cluster=result2$cluster,class)
my_solution

##     cluster           class
## 1         3     Iris-setosa
## 2         3     Iris-setosa
## 3         3     Iris-setosa
## 4         3     Iris-setosa
## 5         3     Iris-setosa
## 6         3     Iris-setosa
## 7         3     Iris-setosa
## 8         3     Iris-setosa
## 9         3     Iris-setosa
## 10        3     Iris-setosa
## 11        3     Iris-setosa
## 12        3     Iris-setosa
## 13        3     Iris-setosa
## 14        3     Iris-setosa
## 15        3     Iris-setosa
## 16        3     Iris-setosa
## 17        3     Iris-setosa
## 18        3     Iris-setosa
## 19        3     Iris-setosa
## 20        3     Iris-setosa
## 21        3     Iris-setosa
## 22        3     Iris-setosa
## 23        3     Iris-setosa
## 24        3     Iris-setosa
## 25        3     Iris-setosa
## 26        3     Iris-setosa
## 27        3     Iris-setosa
## 28        3     Iris-setosa
## 29        3     Iris-setosa
## 30        3     Iris-setosa
## 31        3     Iris-setosa
## 32        3     Iris-setosa
## 33        3     Iris-setosa
## 34        3     Iris-setosa
## 35        3     Iris-setosa
## 36        3     Iris-setosa
## 37        3     Iris-setosa
## 38        3     Iris-setosa
## 39        3     Iris-setosa
## 40        3     Iris-setosa
## 41        3     Iris-setosa
## 42        3     Iris-setosa
## 43        3     Iris-setosa
## 44        3     Iris-setosa
## 45        3     Iris-setosa
## 46        3     Iris-setosa
## 47        3     Iris-setosa
## 48        3     Iris-setosa
## 49        3     Iris-setosa
## 50        3     Iris-setosa
## 51        1 Iris-versicolor
## 52        1 Iris-versicolor
## 53        2 Iris-versicolor
## 54        1 Iris-versicolor
## 55        1 Iris-versicolor
## 56        1 Iris-versicolor
## 57        1 Iris-versicolor
## 58        1 Iris-versicolor
## 59        1 Iris-versicolor
## 60        1 Iris-versicolor
## 61        1 Iris-versicolor
## 62        1 Iris-versicolor
## 63        1 Iris-versicolor
## 64        1 Iris-versicolor
## 65        1 Iris-versicolor
## 66        1 Iris-versicolor
## 67        1 Iris-versicolor
## 68        1 Iris-versicolor
## 69        1 Iris-versicolor
## 70        1 Iris-versicolor
## 71        1 Iris-versicolor
## 72        1 Iris-versicolor
## 73        1 Iris-versicolor
## 74        1 Iris-versicolor
## 75        1 Iris-versicolor
## 76        1 Iris-versicolor
## 77        1 Iris-versicolor
## 78        2 Iris-versicolor
## 79        1 Iris-versicolor
## 80        1 Iris-versicolor
## 81        1 Iris-versicolor
## 82        1 Iris-versicolor
## 83        1 Iris-versicolor
## 84        1 Iris-versicolor
## 85        1 Iris-versicolor
## 86        1 Iris-versicolor
## 87        1 Iris-versicolor
## 88        1 Iris-versicolor
## 89        1 Iris-versicolor
## 90        1 Iris-versicolor
## 91        1 Iris-versicolor
## 92        1 Iris-versicolor
## 93        1 Iris-versicolor
## 94        1 Iris-versicolor
## 95        1 Iris-versicolor
## 96        1 Iris-versicolor
## 97        1 Iris-versicolor
## 98        1 Iris-versicolor
## 99        1 Iris-versicolor
## 100       1 Iris-versicolor
## 101       2  Iris-virginica
## 102       1  Iris-virginica
## 103       2  Iris-virginica
## 104       2  Iris-virginica
## 105       2  Iris-virginica
## 106       2  Iris-virginica
## 107       1  Iris-virginica
## 108       2  Iris-virginica
## 109       2  Iris-virginica
## 110       2  Iris-virginica
## 111       2  Iris-virginica
## 112       2  Iris-virginica
## 113       2  Iris-virginica
## 114       1  Iris-virginica
## 115       1  Iris-virginica
## 116       2  Iris-virginica
## 117       2  Iris-virginica
## 118       2  Iris-virginica
## 119       2  Iris-virginica
## 120       1  Iris-virginica
## 121       2  Iris-virginica
## 122       1  Iris-virginica
## 123       2  Iris-virginica
## 124       1  Iris-virginica
## 125       2  Iris-virginica
## 126       2  Iris-virginica
## 127       1  Iris-virginica
## 128       1  Iris-virginica
## 129       2  Iris-virginica
## 130       2  Iris-virginica
## 131       2  Iris-virginica
## 132       2  Iris-virginica
## 133       2  Iris-virginica
## 134       1  Iris-virginica
## 135       2  Iris-virginica
## 136       2  Iris-virginica
## 137       2  Iris-virginica
## 138       2  Iris-virginica
## 139       1  Iris-virginica
## 140       2  Iris-virginica
## 141       2  Iris-virginica
## 142       2  Iris-virginica
## 143       1  Iris-virginica
## 144       2  Iris-virginica
## 145       2  Iris-virginica
## 146       2  Iris-virginica
## 147       1  Iris-virginica
## 148       2  Iris-virginica
## 149       2  Iris-virginica
## 150       1  Iris-virginica

Accuracy

table(result2$cluster,class)

##    class
##     Iris-setosa Iris-versicolor Iris-virginica
##   1           0              48             14
##   2           0               2             36
##   3          50               0              0

accuracy<-((134/150)*100)
accuracy

## [1] 89.33333

Clustering project: Exploring an Iris Dataset to identify subspecies

Thinithi_Bulathsinghala

March 7, 2019