K-Means is a simple unsupervised machine learning algorithm used to group similar data points together. The K-means algorithm identifies k number of centroids (centers), and then allocates every data point to the nearest cluster. In this example the Ruspini data set is used for K-means clustering. The data set can be found in the cluster package.
The data is explored in this step. The structure and statistical information regarding the Ruspini data is displayed here. A histogram is used to show the distribution of every numerical variable in the data set.
print(ruspini)
## x y
## 1 4 53
## 2 5 63
## 3 10 59
## 4 9 77
## 5 13 49
## 6 13 69
## 7 12 88
## 8 15 75
## 9 18 61
## 10 19 65
## 11 22 74
## 12 27 72
## 13 28 76
## 14 24 58
## 15 27 55
## 16 28 60
## 17 30 52
## 18 31 60
## 19 32 61
## 20 36 72
## 21 28 147
## 22 32 149
## 23 35 153
## 24 33 154
## 25 38 151
## 26 41 150
## 27 38 145
## 28 38 143
## 29 32 143
## 30 34 141
## 31 44 156
## 32 44 149
## 33 44 143
## 34 46 142
## 35 47 149
## 36 49 152
## 37 50 142
## 38 53 144
## 39 52 152
## 40 55 155
## 41 54 124
## 42 60 136
## 43 63 139
## 44 86 132
## 45 85 115
## 46 85 96
## 47 78 94
## 48 74 96
## 49 97 122
## 50 98 116
## 51 98 124
## 52 99 119
## 53 99 128
## 54 101 115
## 55 108 111
## 56 110 111
## 57 108 116
## 58 111 126
## 59 115 117
## 60 117 115
## 61 70 4
## 62 77 12
## 63 83 21
## 64 61 15
## 65 69 15
## 66 78 16
## 67 66 18
## 68 58 13
## 69 64 20
## 70 69 21
## 71 66 23
## 72 61 25
## 73 76 27
## 74 72 31
## 75 64 30
str(ruspini)
## 'data.frame': 75 obs. of 2 variables:
## $ x: int 4 5 10 9 13 13 12 15 18 19 ...
## $ y: int 53 63 59 77 49 69 88 75 61 65 ...
describe(ruspini)
## ruspini
##
## 2 Variables 75 Observations
## --------------------------------------------------------------------------------
## x
## n missing distinct Info Mean Gmd .05 .10
## 75 0 56 1 54.88 35.1 11.4 16.2
## .25 .50 .75 .90 .95
## 31.5 52.0 76.5 99.0 108.6
##
## lowest : 4 5 9 10 12, highest: 108 110 111 115 117
## --------------------------------------------------------------------------------
## y
## n missing distinct Info Mean Gmd .05 .10
## 75 0 58 1 92.03 55.84 15.0 20.4
## .25 .50 .75 .90 .95
## 56.5 96.0 141.5 149.6 152.3
##
## lowest : 4 12 13 15 16, highest: 152 153 154 155 156
## --------------------------------------------------------------------------------
plot_num(ruspini)
wss <- numeric(15)
for(k in 1:15){
wss[k] <- sum(kmeans(ruspini, k, nstart=25)$withinss)
}
print(wss)
## [1] 244373.867 89337.832 51063.475 12881.051 10126.720 8575.407
## [7] 7126.199 6149.639 5653.427 4446.282 3897.252 3556.985
## [13] 3386.115 2939.035 2591.425
plot(1:15, wss, type = "b", xlab = "Number of Clusters", ylab = "Within Sum of Squares")
There are many ways to determine the value of K. One simple way is using trial and error until a desired clustering is achieved. However, in this example the elbow method is being used. The elbow method looks at the “Within Sum of Squares”. The within sum of squares is the average distance between points inside a cluster. We can see using the elbow method that any k value larger than 4 causes the change with the within sum of squares values to be minimal. This means that there is no significant decrease in value. The number of clusters to be used with the Ruspini data will be 4.
ruspini_km <- kmeans(ruspini, 4, nstart = 25)
ruspini_km
## K-means clustering with 4 clusters of sizes 17, 20, 23, 15
##
## Cluster means:
## x y
## 1 98.17647 114.8824
## 2 20.15000 64.9500
## 3 43.91304 146.0435
## 4 68.93333 19.4000
##
## Clustering vector:
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
## 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52
## 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 1 1 1 1 1 1 1 1 1
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
## 1 1 1 1 1 1 1 1 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4
##
## Within cluster sum of squares by cluster:
## [1] 4558.235 3689.500 3176.783 1456.533
## (between_SS / total_SS = 94.7 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss" "tot.withinss"
## [6] "betweenss" "size" "iter" "ifault"
plot(ruspini, col=ruspini_km$cluster)
points(ruspini_km$center, col=1:2, pch=8, cex=1)
The results of the clustering are displayed in the scatter plot. The algorithm was configured to generate 25 initial random centroids and select the best ones for clustering. Each center is identified by an asterisk on the scatter plot. Four clusters were produces of sizes 15, 20, 17, and 23.