K-Means is a simple unsupervised machine learning algorithm used to group similar data points together. The K-means algorithm identifies k number of centroids (centers), and then allocates every data point to the nearest cluster. In this example the Ruspini data set is used for K-means clustering. The data set can be found in the cluster package.

Ruspini Data Exploration

The data is explored in this step. The structure and statistical information regarding the Ruspini data is displayed here. A histogram is used to show the distribution of every numerical variable in the data set.

print(ruspini)
##      x   y
## 1    4  53
## 2    5  63
## 3   10  59
## 4    9  77
## 5   13  49
## 6   13  69
## 7   12  88
## 8   15  75
## 9   18  61
## 10  19  65
## 11  22  74
## 12  27  72
## 13  28  76
## 14  24  58
## 15  27  55
## 16  28  60
## 17  30  52
## 18  31  60
## 19  32  61
## 20  36  72
## 21  28 147
## 22  32 149
## 23  35 153
## 24  33 154
## 25  38 151
## 26  41 150
## 27  38 145
## 28  38 143
## 29  32 143
## 30  34 141
## 31  44 156
## 32  44 149
## 33  44 143
## 34  46 142
## 35  47 149
## 36  49 152
## 37  50 142
## 38  53 144
## 39  52 152
## 40  55 155
## 41  54 124
## 42  60 136
## 43  63 139
## 44  86 132
## 45  85 115
## 46  85  96
## 47  78  94
## 48  74  96
## 49  97 122
## 50  98 116
## 51  98 124
## 52  99 119
## 53  99 128
## 54 101 115
## 55 108 111
## 56 110 111
## 57 108 116
## 58 111 126
## 59 115 117
## 60 117 115
## 61  70   4
## 62  77  12
## 63  83  21
## 64  61  15
## 65  69  15
## 66  78  16
## 67  66  18
## 68  58  13
## 69  64  20
## 70  69  21
## 71  66  23
## 72  61  25
## 73  76  27
## 74  72  31
## 75  64  30
str(ruspini)
## 'data.frame':    75 obs. of  2 variables:
##  $ x: int  4 5 10 9 13 13 12 15 18 19 ...
##  $ y: int  53 63 59 77 49 69 88 75 61 65 ...
describe(ruspini)
## ruspini 
## 
##  2  Variables      75  Observations
## --------------------------------------------------------------------------------
## x 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       75        0       56        1    54.88     35.1     11.4     16.2 
##      .25      .50      .75      .90      .95 
##     31.5     52.0     76.5     99.0    108.6 
## 
## lowest :   4   5   9  10  12, highest: 108 110 111 115 117
## --------------------------------------------------------------------------------
## y 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##       75        0       58        1    92.03    55.84     15.0     20.4 
##      .25      .50      .75      .90      .95 
##     56.5     96.0    141.5    149.6    152.3 
## 
## lowest :   4  12  13  15  16, highest: 152 153 154 155 156
## --------------------------------------------------------------------------------
plot_num(ruspini)

Determine Number of Clusters with Elbow Method

wss <- numeric(15)
for(k in 1:15){
  wss[k] <- sum(kmeans(ruspini, k, nstart=25)$withinss)
}
print(wss)
##  [1] 244373.867  89337.832  51063.475  12881.051  10126.720   8575.407
##  [7]   7126.199   6149.639   5653.427   4446.282   3897.252   3556.985
## [13]   3386.115   2939.035   2591.425
plot(1:15, wss, type = "b", xlab = "Number of Clusters", ylab = "Within Sum of Squares")

There are many ways to determine the value of K. One simple way is using trial and error until a desired clustering is achieved. However, in this example the elbow method is being used. The elbow method looks at the “Within Sum of Squares”. The within sum of squares is the average distance between points inside a cluster. We can see using the elbow method that any k value larger than 4 causes the change with the within sum of squares values to be minimal. This means that there is no significant decrease in value. The number of clusters to be used with the Ruspini data will be 4.

K-Means Analysis

ruspini_km <- kmeans(ruspini, 4, nstart = 25)
ruspini_km
## K-means clustering with 4 clusters of sizes 17, 20, 23, 15
## 
## Cluster means:
##          x        y
## 1 98.17647 114.8824
## 2 20.15000  64.9500
## 3 43.91304 146.0435
## 4 68.93333  19.4000
## 
## Clustering vector:
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
##  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  3  3  3  3  3  3 
## 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 
##  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  1  1  1  1  1  1  1  1  1 
## 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 
##  1  1  1  1  1  1  1  1  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4 
## 
## Within cluster sum of squares by cluster:
## [1] 4558.235 3689.500 3176.783 1456.533
##  (between_SS / total_SS =  94.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"
plot(ruspini, col=ruspini_km$cluster)
points(ruspini_km$center, col=1:2, pch=8, cex=1)

The results of the clustering are displayed in the scatter plot. The algorithm was configured to generate 25 initial random centroids and select the best ones for clustering. Each center is identified by an asterisk on the scatter plot. Four clusters were produces of sizes 15, 20, 17, and 23.