First, we view the head of dataset iris:

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Now assign all but the 5th column (species name) to the variable ‘df’:

df <- iris[,-5]

Center and scale the data:

df <- scale(df, center = TRUE, scale = TRUE)

Check the scaled data now has a mean of 0:

summary(df)
##   Sepal.Length       Sepal.Width       Petal.Length      Petal.Width     
##  Min.   :-1.86378   Min.   :-2.4258   Min.   :-1.5623   Min.   :-1.4422  
##  1st Qu.:-0.89767   1st Qu.:-0.5904   1st Qu.:-1.2225   1st Qu.:-1.1799  
##  Median :-0.05233   Median :-0.1315   Median : 0.3354   Median : 0.1321  
##  Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.67225   3rd Qu.: 0.5567   3rd Qu.: 0.7602   3rd Qu.: 0.7880  
##  Max.   : 2.48370   Max.   : 3.0805   Max.   : 1.7799   Max.   : 1.7064

Before we begin our analysis, we look at the iris dataset:

table(iris[,5])
## 
##     setosa versicolor  virginica 
##         50         50         50

We can see we have 50 of each type of species, so, in theory, our data should be grouped into three distinct clusters.

The kmeans function takes: data matrix, number of clusters/centres, the maximum no. of iterations allowed and ‘nstart’, the number of random starts. We assign the output of the kmeans function to the variable “km3”:

km3 <- kmeans(df, 3, iter.max = 10, nstart = 10)
km3
## K-means clustering with 3 clusters of sizes 47, 53, 50
## 
## Cluster means:
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   1.13217737  0.08812645    0.9928284   1.0141287
## 2  -0.05005221 -0.88042696    0.3465767   0.2805873
## 3  -1.01119138  0.85041372   -1.3006301  -1.2507035
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2
##  [75] 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2
## 
## Within cluster sum of squares by cluster:
## [1] 47.45019 44.08754 47.35062
##  (between_SS / total_SS =  76.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

Looking at how the three species are distributed with respect to the three clusters:

table (iris[,5],km3$cluster)
##             
##               1  2  3
##   setosa      0  0 50
##   versicolor 11 39  0
##   virginica  36 14  0

…we see that kmeans has been able to allocate all of the setosa species into cluster 2, but the versicolour and virginica species are distributed between clusters 1 & 3.

Let’s increase the nstart value by a factor of 10 to see if the result improves:

km.opt <- kmeans(df,3,iter.max = 10, nstart = 100)

table(iris[,5],km.opt$cluster)
##             
##               1  2  3
##   setosa      0  0 50
##   versicolor 11 39  0
##   virginica  36 14  0

Although the cluster numbers have changed, the result is equivalent to that already obtained with nstart set to 10.

To see the list of clusters to which each point is allocated:

km.opt$cluster
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2
##  [75] 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2

To see positions of the centroids of the three clusters:

km.opt$centers
##   Sepal.Length Sepal.Width Petal.Length Petal.Width
## 1   1.13217737  0.08812645    0.9928284   1.0141287
## 2  -0.05005221 -0.88042696    0.3465767   0.2805873
## 3  -1.01119138  0.85041372   -1.3006301  -1.2507035

To see the within cluster sum of squares:

km.opt$withinss
## [1] 47.45019 44.08754 47.35062

We could then perform sum(km.opt$withinss) or find the total directly as below, to see the result of the minimisation |X - U Xbar|^2:

km.opt$tot.withinss
## [1] 138.8884

To see the number of points in each cluster:

km.opt$size
## [1] 47 53 50

The clusters are similar in size, although “perfection” would be three clusters of 50.

To construct the matrix “U”, first we set up a null 150 x 3 matrix:

U <- matrix(NA,150,3)

Then we set up an identity diagonal 3 x 3 matrix “I”:

I <- diag(3)

Then we transform the cluster data here:

km.opt$cluster
##   [1] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [38] 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 2 2 2
##  [75] 2 1 1 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1
## [112] 1 1 2 2 1 1 1 1 2 1 2 1 2 1 1 2 1 1 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 1
## [149] 1 2

into a 150 x 3 matrix U (here we only view the first 8 rows):

U <- I[km.opt$cluster,]
U[1:8,1:3]
##      [,1] [,2] [,3]
## [1,]    0    0    1
## [2,]    0    0    1
## [3,]    0    0    1
## [4,]    0    0    1
## [5,]    0    0    1
## [6,]    0    0    1
## [7,]    0    0    1
## [8,]    0    0    1

Compare with the tabled totals:

table(iris[, 5],km.opt$cluster)
##             
##               1  2  3
##   setosa      0  0 50
##   versicolor 11 39  0
##   virginica  36 14  0

We can now plot the scaled datapoints (1st and 2nd columns, sepal width against sepal length only). Now we add on the centroids: The function points plots coordinates onto our plot cex controls the font size pch controls the character to use (4 = X)

plot(df[,1:2],
     col=km.opt$cluster,
     main= "K-means with 3 clusters")
points(km.opt$centers[,1:2],
        col= 1:3,
        pch = 4,
        cex = 2)

To view the positions of the centres (sepal width and sepal length only):

km.opt$centers[,1:2]
##   Sepal.Length Sepal.Width
## 1   1.13217737  0.08812645
## 2  -0.05005221 -0.88042696
## 3  -1.01119138  0.85041372

We can make a cluster plot relative to the first two principal components, using the function fviz_cluster, which takes arguments kmeans object, data:

library(factoextra)
fviz_cluster(km.opt,df)

This plot allows us to avoid having to pick two of our four variables (sepal length, sepal width, petal length, petal width) for a 2D scatterplot, so that we can instead see the clusters relative to the first two principal components and see how much of the variability is accounted for. In this case, the division between the three different clusters is clearer.

For comparision, we can try specifying 2 clusters:

km.opt2 <- kmeans(df, 2, iter.max = 10, nstart = 100)
fviz_cluster(km.opt2, df)

table(km.opt2$cluster, iris[, 5])
##    
##     setosa versicolor virginica
##   1      0         50        50
##   2     50          0         0

We can see that the 2nd and 3rd clusters of the initial kmeans calculation have now been joined into one cluster.

To understand the optimal number of clusters to be used, we use the function fviz_nbclust, which takes the arguments data frame, partitioning function, method. “wss” refers to the total within sum of squares :

fviz_nbclust(df, kmeans, method = "wss")

We can see that the optimal number of clusters is 3 (after this, little is to be gained by adding further clusters). See the ‘elbow’ at 3 clusters.

library(cluster)
gap_stat <- clusGap(df,
                    FUN = kmeans,
                    nstart = 25,
                    K.max = 10,
                    B = 50)

Now we can plot the gap statistic against the number of clusters.

fviz_gap_stat(gap_stat) 

As 3 clusters is the point at which the graph rises then falls, we again find the optimal number of clusters is 3.