Clustering analysis

data(“movielens”)

library(tidyverse)
library(rgl) 
library(dslabs)   
library(factoextra)
 
x <- iris[,2:4]
rownames(x)<- 1:150
plot3d(x,col = as.numeric(iris$Species),size = 8) 

K-means

set.seed(123)
kmres <- kmeans(x,3,iter.max = 10,nstart=1) 
print(kmres)
## K-means clustering with 3 clusters of sizes 50, 53, 47
## 
## Cluster means:
##   Sepal.Width Petal.Length Petal.Width
## 1    3.428000     1.462000    0.246000
## 2    2.754717     4.281132    1.350943
## 3    3.004255     5.610638    2.042553
## 
## Clustering vector:
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19  20 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40 
##   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1   1 
##  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60 
##   1   1   1   1   1   1   1   1   1   1   2   2   2   2   2   2   2   2   2   2 
##  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80 
##   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   3   2   2 
##  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99 100 
##   2   2   2   3   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2   2 
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 
##   3   3   3   3   3   3   2   3   3   3   3   3   3   3   3   3   3   3   3   2 
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 
##   3   3   3   2   3   3   2   3   3   3   3   3   3   3   3   3   3   3   2   3 
## 141 142 143 144 145 146 147 148 149 150 
##   3   3   3   3   3   3   3   3   3   3 
## 
## Within cluster sum of squares by cluster:
## [1]  9.06280 18.86491 19.93872
##  (between_SS / total_SS =  91.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

It’s possible to compute the mean of each variables by clusters using the original data:

centroides <- aggregate(x,by=list(cluster=kmres$cluster),mean)
print(centroides)
##   cluster Sepal.Width Petal.Length Petal.Width
## 1       1    3.428000     1.462000    0.246000
## 2       2    2.754717     4.281132    1.350943
## 3       3    3.004255     5.610638    2.042553
dd <- cbind(iris,cluster=kmres$cluster)
head(dd)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species cluster
## 1          5.1         3.5          1.4         0.2  setosa       1
## 2          4.9         3.0          1.4         0.2  setosa       1
## 3          4.7         3.2          1.3         0.2  setosa       1
## 4          4.6         3.1          1.5         0.2  setosa       1
## 5          5.0         3.6          1.4         0.2  setosa       1
## 6          5.4         3.9          1.7         0.4  setosa       1

Visualizing results

fviz_cluster(kmres,x,ellipse.type="norm",geom="point")

Hierarchical clustering

With the distance between each pair of movies computed, we need an algorithm to define groups from these. Hierarchical clustering starts by defining each observation as a separate group, then the two closest groups are joined into a group iteratively until there is just one group including all the observations.

 hres <- hcut(scale(x),k=3)

We can see the resulting groups using a dendrogram.

fviz_dend(hres,show_labels=0,rect=1) #Visualize dendrogram

  • The height of this location is the distance between these two groups.

To generate actual groups we can do one of two things:

  • decide on a minimum distance needed for observations to be in the same group
  • decide on the number of groups you want and then find the minimum distance that achieves this.

The function cutree can be applied to the output of hclust to perform either of these two operations and generate groups.

fviz_cluster(hres, ellipse.type = "convex")