data(“movielens”)
library(tidyverse)
library(rgl)
library(dslabs)
library(factoextra)
x <- iris[,2:4]
rownames(x)<- 1:150
plot3d(x,col = as.numeric(iris$Species),size = 8)
set.seed(123)
kmres <- kmeans(x,3,iter.max = 10,nstart=1)
print(kmres)
## K-means clustering with 3 clusters of sizes 50, 53, 47
##
## Cluster means:
## Sepal.Width Petal.Length Petal.Width
## 1 3.428000 1.462000 0.246000
## 2 2.754717 4.281132 1.350943
## 3 3.004255 5.610638 2.042553
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
## 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
## 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2
## 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100
## 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
## 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 2
## 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140
## 3 3 3 2 3 3 2 3 3 3 3 3 3 3 3 3 3 3 2 3
## 141 142 143 144 145 146 147 148 149 150
## 3 3 3 3 3 3 3 3 3 3
##
## Within cluster sum of squares by cluster:
## [1] 9.06280 18.86491 19.93872
## (between_SS / total_SS = 91.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
It’s possible to compute the mean of each variables by clusters using the original data:
centroides <- aggregate(x,by=list(cluster=kmres$cluster),mean)
print(centroides)
## cluster Sepal.Width Petal.Length Petal.Width
## 1 1 3.428000 1.462000 0.246000
## 2 2 2.754717 4.281132 1.350943
## 3 3 3.004255 5.610638 2.042553
dd <- cbind(iris,cluster=kmres$cluster)
head(dd)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species cluster
## 1 5.1 3.5 1.4 0.2 setosa 1
## 2 4.9 3.0 1.4 0.2 setosa 1
## 3 4.7 3.2 1.3 0.2 setosa 1
## 4 4.6 3.1 1.5 0.2 setosa 1
## 5 5.0 3.6 1.4 0.2 setosa 1
## 6 5.4 3.9 1.7 0.4 setosa 1
Visualizing results
fviz_cluster(kmres,x,ellipse.type="norm",geom="point")
With the distance between each pair of movies computed, we need an algorithm to define groups from these. Hierarchical clustering starts by defining each observation as a separate group, then the two closest groups are joined into a group iteratively until there is just one group including all the observations.
hres <- hcut(scale(x),k=3)
We can see the resulting groups using a dendrogram.
fviz_dend(hres,show_labels=0,rect=1) #Visualize dendrogram
To generate actual groups we can do one of two things:
The function cutree can be applied to the output of hclust to perform either of these two operations and generate groups.
fviz_cluster(hres, ellipse.type = "convex")