K-Means Clustering

The function kmeans() performs K-means clustering in R. We begin with a simple simulated example in which there truly are two clusters in the data: the first 25 observations have a mean shift relative to the next 25 observations.

set.seed(2)
x=matrix(rnorm (50*2), ncol=2)
x[1:25,1]=x[1:25,1]+3
x[1:25,2]=x[1:25,2]-4

We now perform K-means clustering with K = 2.

km.out=kmeans (x,2, nstart =20)

The cluster assignments of the 50 observations are contained in km.out$cluster.

km.out$cluster
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[50] 2

The K-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to kmeans(). We can plot the data, with each observation colored according to its cluster assignment.

plot(x, col=(km.out$cluster +1), main="K-Means Clustering Results with K=2", xlab="", ylab="", pch=20, cex=2)

Here the observations can be easily plotted because they are two-dimensional. If there were more than two variables then we could instead perform PCA and plot the first two principal components score vectors.

In this example, we knew that there really were two clusters because we generated the data. However, for real data, in general we do not know the true number of clusters. We could instead have performed K-means clustering on this example with K = 3.

set.seed(4)
km.out=kmeans(x,3, nstart =20)
km.out
K-means clustering with 3 clusters of sizes 17, 23, 10

Cluster means:
        [,1]        [,2]
1  3.7789567 -4.56200798
2 -0.3820397 -0.08740753
3  2.3001545 -2.69622023

Clustering vector:
 [1] 1 3 1 3 1 1 1 3 1 3 1 3 1 3 1 3 1 1 1 1 1 3 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 2 2
[50] 2

Within cluster sum of squares by cluster:
[1] 25.74089 52.67700 19.56137
 (between_SS / total_SS =  79.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss" "betweenss"   
[7] "size"         "iter"         "ifault"      
plot(x, col=(km.out$cluster +1), main="K-Means Clustering Results with K=3", xlab="", ylab="", pch=20, cex=2)

When K = 3, K-means clustering splits up the two clusters.

To run the kmeans() function in R with multiple initial cluster assignments, we use the nstart argument. If a value of nstart greater than one is used, then K-means clustering will be performed using multiple random assignments in Step 1 of Algorithm 10.1, and the kmeans() function will report only the best results. Here we compare using nstart=1 to nstart=20.

set.seed(3)
km.out=kmeans(x,3, nstart =1)
km.out$tot.withinss
[1] 97.97927
km.out=kmeans(x,3, nstart =20)
km.out$tot.withinss
[1] 97.97927

Note that km.out$tot.withinss is the total within-cluster sum of squares, which we seek to minimize by performing K-means clustering (Equation 10.11). The individual within-cluster sum-of-squares are contained in the vector km.out$withinss.

We strongly recommend always running K-means clustering with a large value of nstart, such as 20 or 50, since otherwise an undesirable local optimum may be obtained.

When performing K-means clustering, in addition to using multiple initial cluster assignments, it is also important to set a random seed using the set.seed() function. This way, the initial cluster assignments in Step 1 can be replicated, and the K-means output will be fully reproducible.

Hierarchical Clustering

The hclust() function implements hierarchical clustering in R. In the following example we use the data from Section 10.5.1 to plot the hierarchical clustering dendrogram using complete, single, and average linkage clustering, with Euclidean distance as the dissimilarity measure. We begin by clustering observations using complete linkage. The dist() function is used to compute the 50 ? 50 inter-observation Euclidean distance matrix.

set.seed(2)
x=matrix(rnorm (50*2), ncol=2)
x[1:25,1]=x[1:25,1]+3
x[1:25,2]=x[1:25,2]-4
hc.complete =hclust(dist(x), method="complete")

We could just as easily perform hierarchical clustering with average or single linkage instead:

hc.average =hclust(dist(x), method ="average")
hc.single=hclust(dist(x), method ="single")

We can now plot the dendrograms obtained using the usual plot() function. The numbers at the bottom of the plot identify each observation.

par(mfrow=c(1,3))
plot(hc.complete ,main="Complete Linkage ", xlab="", sub="",cex=.9)
plot(hc.average , main="Average Linkage", xlab="", sub="",cex=.9)
plot(hc.single , main="Single Linkage ", xlab="", sub="", cex=.9)

To determine the cluster labels for each observation associated with a given cut of the dendrogram, we can use the cutree() function:

cutree(hc.complete , 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[50] 2
cutree(hc.average , 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2
[50] 2
cutree(hc.single , 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[50] 1

For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one pointc as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons.

cutree(hc.single , 4)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 4 3 3 3 3 3 3 3
[50] 3

To scale the variables before performing hierarchical clustering of the observations, we use the scale() function:

xsc=scale(x)
plot(hclust(dist(xsc), method ="complete"), main=" Hierarchical Clustering with Scaled Features ")

LS0tDQp0aXRsZTogIkNsdXN0ZXJpbmciDQpvdXRwdXQ6IA0KICBodG1sX25vdGVib29rOg0KICAgIHRvYzogdHJ1ZQ0KICAgIHRvY19mbG9hdDogdHJ1ZQ0KLS0tDQoNCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQ0Ka25pdHI6Om9wdHNfY2h1bmskc2V0KG1lc3NhZ2U9RkFMU0Usd2FybmluZz1GQUxTRSkNCmBgYA0KDQojIyMgSy1NZWFucyBDbHVzdGVyaW5nDQpUaGUgZnVuY3Rpb24gYGttZWFucygpYCBwZXJmb3JtcyAqSyotbWVhbnMgY2x1c3RlcmluZyBpbiBgUmAuIFdlIGJlZ2luIHdpdGggYSBzaW1wbGUgc2ltdWxhdGVkIGV4YW1wbGUgaW4gd2hpY2ggdGhlcmUgdHJ1bHkgYXJlIHR3byBjbHVzdGVycyBpbiB0aGUgZGF0YTogdGhlIGZpcnN0IDI1IG9ic2VydmF0aW9ucyBoYXZlIGEgbWVhbiBzaGlmdCByZWxhdGl2ZSB0byB0aGUgbmV4dCAyNSBvYnNlcnZhdGlvbnMuDQoNCmBgYHtyfQ0Kc2V0LnNlZWQoMikNCng9bWF0cml4KHJub3JtICg1MCoyKSwgbmNvbD0yKQ0KeFsxOjI1LDFdPXhbMToyNSwxXSszDQp4WzE6MjUsMl09eFsxOjI1LDJdLTQNCmBgYA0KDQpXZSBub3cgcGVyZm9ybSBLLW1lYW5zIGNsdXN0ZXJpbmcgd2l0aCAqSyogPSAyLg0KDQpgYGB7cn0NCmttLm91dD1rbWVhbnMgKHgsMiwgbnN0YXJ0ID0yMCkNCmBgYA0KDQpUaGUgY2x1c3RlciBhc3NpZ25tZW50cyBvZiB0aGUgNTAgb2JzZXJ2YXRpb25zIGFyZSBjb250YWluZWQgaW4gYGttLm91dCRjbHVzdGVyYC4NCg0KYGBge3J9DQprbS5vdXQkY2x1c3Rlcg0KYGBgDQoNClRoZSAqSyotbWVhbnMgY2x1c3RlcmluZyBwZXJmZWN0bHkgc2VwYXJhdGVkIHRoZSBvYnNlcnZhdGlvbnMgaW50byB0d28gY2x1c3RlcnMgZXZlbiB0aG91Z2ggd2UgZGlkIG5vdCBzdXBwbHkgYW55IGdyb3VwIGluZm9ybWF0aW9uIHRvIGBrbWVhbnMoKWAuIFdlIGNhbiBwbG90IHRoZSBkYXRhLCB3aXRoIGVhY2ggb2JzZXJ2YXRpb24gY29sb3JlZCBhY2NvcmRpbmcgdG8gaXRzIGNsdXN0ZXIgYXNzaWdubWVudC4NCg0KDQpgYGB7cn0NCnBsb3QoeCwgY29sPShrbS5vdXQkY2x1c3RlciArMSksIG1haW49IkstTWVhbnMgQ2x1c3RlcmluZyBSZXN1bHRzIHdpdGggSz0yIiwgeGxhYj0iIiwgeWxhYj0iIiwgcGNoPTIwLCBjZXg9MikNCmBgYA0KDQpIZXJlIHRoZSBvYnNlcnZhdGlvbnMgY2FuIGJlIGVhc2lseSBwbG90dGVkIGJlY2F1c2UgdGhleSBhcmUgdHdvLWRpbWVuc2lvbmFsLiBJZiB0aGVyZSB3ZXJlIG1vcmUgdGhhbiB0d28gdmFyaWFibGVzIHRoZW4gd2UgY291bGQgaW5zdGVhZCBwZXJmb3JtIFBDQSBhbmQgcGxvdCB0aGUgZmlyc3QgdHdvIHByaW5jaXBhbCBjb21wb25lbnRzIHNjb3JlIHZlY3RvcnMuDQoNCkluIHRoaXMgZXhhbXBsZSwgd2Uga25ldyB0aGF0IHRoZXJlIHJlYWxseSB3ZXJlIHR3byBjbHVzdGVycyBiZWNhdXNlIHdlIGdlbmVyYXRlZCB0aGUgZGF0YS4gSG93ZXZlciwgZm9yIHJlYWwgZGF0YSwgaW4gZ2VuZXJhbCB3ZSBkbyBub3Qga25vdyB0aGUgdHJ1ZSBudW1iZXIgb2YgY2x1c3RlcnMuIFdlIGNvdWxkIGluc3RlYWQgaGF2ZSBwZXJmb3JtZWQgSy1tZWFucyBjbHVzdGVyaW5nIG9uIHRoaXMgZXhhbXBsZSB3aXRoICpLKiA9IDMuDQoNCmBgYHtyfQ0Kc2V0LnNlZWQoNCkNCmttLm91dD1rbWVhbnMoeCwzLCBuc3RhcnQgPTIwKQ0Ka20ub3V0DQpwbG90KHgsIGNvbD0oa20ub3V0JGNsdXN0ZXIgKzEpLCBtYWluPSJLLU1lYW5zIENsdXN0ZXJpbmcgUmVzdWx0cyB3aXRoIEs9MyIsIHhsYWI9IiIsIHlsYWI9IiIsIHBjaD0yMCwgY2V4PTIpDQpgYGANCg0KV2hlbiAqSyogPSAzLCAqSyotbWVhbnMgY2x1c3RlcmluZyBzcGxpdHMgdXAgdGhlIHR3byBjbHVzdGVycy4gDQoNClRvIHJ1biB0aGUgYGttZWFucygpYCBmdW5jdGlvbiBpbiBgUmAgd2l0aCBtdWx0aXBsZSBpbml0aWFsIGNsdXN0ZXIgYXNzaWdubWVudHMsIHdlIHVzZSB0aGUgYG5zdGFydGAgYXJndW1lbnQuIElmIGEgdmFsdWUgb2YgYG5zdGFydGAgZ3JlYXRlciB0aGFuIG9uZSBpcyB1c2VkLCB0aGVuIEstbWVhbnMgY2x1c3RlcmluZyB3aWxsIGJlIHBlcmZvcm1lZCB1c2luZyBtdWx0aXBsZSByYW5kb20gYXNzaWdubWVudHMgaW4gU3RlcCAxIG9mIEFsZ29yaXRobSAxMC4xLCBhbmQgdGhlIGBrbWVhbnMoKWAgZnVuY3Rpb24gd2lsbCByZXBvcnQgb25seSB0aGUgYmVzdCByZXN1bHRzLiBIZXJlIHdlIGNvbXBhcmUgdXNpbmcgbnN0YXJ0PTEgdG8gbnN0YXJ0PTIwLg0KDQpgYGB7cn0NCnNldC5zZWVkKDMpDQprbS5vdXQ9a21lYW5zKHgsMywgbnN0YXJ0ID0xKQ0Ka20ub3V0JHRvdC53aXRoaW5zcw0Ka20ub3V0PWttZWFucyh4LDMsIG5zdGFydCA9MjApDQprbS5vdXQkdG90LndpdGhpbnNzDQpgYGANCg0KTm90ZSB0aGF0IGBrbS5vdXQkdG90LndpdGhpbnNzYCBpcyB0aGUgdG90YWwgd2l0aGluLWNsdXN0ZXIgc3VtIG9mIHNxdWFyZXMsIHdoaWNoIHdlIHNlZWsgdG8gbWluaW1pemUgYnkgcGVyZm9ybWluZyAqSyotbWVhbnMgY2x1c3RlcmluZyAoRXF1YXRpb24gMTAuMTEpLiBUaGUgaW5kaXZpZHVhbCB3aXRoaW4tY2x1c3RlciBzdW0tb2Ytc3F1YXJlcyBhcmUgY29udGFpbmVkIGluIHRoZSB2ZWN0b3IgYGttLm91dCR3aXRoaW5zc2AuDQoNCldlICpzdHJvbmdseSByZWNvbW1lbmQqIGFsd2F5cyBydW5uaW5nICpLKi1tZWFucyBjbHVzdGVyaW5nIHdpdGggYSBsYXJnZSB2YWx1ZSBvZiBgbnN0YXJ0YCwgc3VjaCBhcyAyMCBvciA1MCwgc2luY2Ugb3RoZXJ3aXNlIGFuIHVuZGVzaXJhYmxlIGxvY2FsIG9wdGltdW0gbWF5IGJlIG9idGFpbmVkLg0KDQpXaGVuIHBlcmZvcm1pbmcgKksqLW1lYW5zIGNsdXN0ZXJpbmcsIGluIGFkZGl0aW9uIHRvIHVzaW5nIG11bHRpcGxlIGluaXRpYWwgY2x1c3RlciBhc3NpZ25tZW50cywgaXQgaXMgYWxzbyBpbXBvcnRhbnQgdG8gc2V0IGEgcmFuZG9tIHNlZWQgdXNpbmcgdGhlIGBzZXQuc2VlZCgpYCBmdW5jdGlvbi4gVGhpcyB3YXksIHRoZSBpbml0aWFsIGNsdXN0ZXIgYXNzaWdubWVudHMgaW4gU3RlcCAxIGNhbiBiZSByZXBsaWNhdGVkLCBhbmQgdGhlICpLKi1tZWFucyBvdXRwdXQgd2lsbCBiZSBmdWxseSByZXByb2R1Y2libGUuDQoNCiMjIyBIaWVyYXJjaGljYWwgQ2x1c3RlcmluZw0KVGhlIGBoY2x1c3QoKWAgZnVuY3Rpb24gaW1wbGVtZW50cyBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyBpbiBgUmAuIEluIHRoZSBmb2xsb3dpbmcgZXhhbXBsZSB3ZSB1c2UgdGhlIGRhdGEgZnJvbSBTZWN0aW9uIDEwLjUuMSB0byBwbG90IHRoZSBoaWVyYXJjaGljYWwgY2x1c3RlcmluZyBkZW5kcm9ncmFtIHVzaW5nIGNvbXBsZXRlLCBzaW5nbGUsIGFuZCBhdmVyYWdlIGxpbmthZ2UgY2x1c3RlcmluZywgd2l0aCBFdWNsaWRlYW4gZGlzdGFuY2UgYXMgdGhlIGRpc3NpbWlsYXJpdHkgbWVhc3VyZS4gV2UgYmVnaW4gYnkgY2x1c3RlcmluZyBvYnNlcnZhdGlvbnMgdXNpbmcgY29tcGxldGUgbGlua2FnZS4gVGhlIGBkaXN0KClgIGZ1bmN0aW9uIGlzIHVzZWQgdG8gY29tcHV0ZSB0aGUgNTAgPyA1MCBpbnRlci1vYnNlcnZhdGlvbiBFdWNsaWRlYW4gZGlzdGFuY2UgbWF0cml4Lg0KDQpgYGB7cn0NCnNldC5zZWVkKDIpDQp4PW1hdHJpeChybm9ybSAoNTAqMiksIG5jb2w9MikNCnhbMToyNSwxXT14WzE6MjUsMV0rMw0KeFsxOjI1LDJdPXhbMToyNSwyXS00DQpoYy5jb21wbGV0ZSA9aGNsdXN0KGRpc3QoeCksIG1ldGhvZD0iY29tcGxldGUiKQ0KYGBgDQoNCldlIGNvdWxkIGp1c3QgYXMgZWFzaWx5IHBlcmZvcm0gaGllcmFyY2hpY2FsIGNsdXN0ZXJpbmcgd2l0aCBhdmVyYWdlIG9yIHNpbmdsZSBsaW5rYWdlIGluc3RlYWQ6DQoNCmBgYHtyfQ0KaGMuYXZlcmFnZSA9aGNsdXN0KGRpc3QoeCksIG1ldGhvZCA9ImF2ZXJhZ2UiKQ0KaGMuc2luZ2xlPWhjbHVzdChkaXN0KHgpLCBtZXRob2QgPSJzaW5nbGUiKQ0KYGBgDQoNCldlIGNhbiBub3cgcGxvdCB0aGUgZGVuZHJvZ3JhbXMgb2J0YWluZWQgdXNpbmcgdGhlIHVzdWFsIGBwbG90KClgIGZ1bmN0aW9uLiBUaGUgbnVtYmVycyBhdCB0aGUgYm90dG9tIG9mIHRoZSBwbG90IGlkZW50aWZ5IGVhY2ggb2JzZXJ2YXRpb24uDQoNCmBgYHtyfQ0KcGFyKG1mcm93PWMoMSwzKSkNCnBsb3QoaGMuY29tcGxldGUgLG1haW49IkNvbXBsZXRlIExpbmthZ2UgIiwgeGxhYj0iIiwgc3ViPSIiLGNleD0uOSkNCnBsb3QoaGMuYXZlcmFnZSAsIG1haW49IkF2ZXJhZ2UgTGlua2FnZSIsIHhsYWI9IiIsIHN1Yj0iIixjZXg9LjkpDQpwbG90KGhjLnNpbmdsZSAsIG1haW49IlNpbmdsZSBMaW5rYWdlICIsIHhsYWI9IiIsIHN1Yj0iIiwgY2V4PS45KQ0KYGBgDQoNClRvIGRldGVybWluZSB0aGUgY2x1c3RlciBsYWJlbHMgZm9yIGVhY2ggb2JzZXJ2YXRpb24gYXNzb2NpYXRlZCB3aXRoIGEgZ2l2ZW4gY3V0IG9mIHRoZSBkZW5kcm9ncmFtLCB3ZSBjYW4gdXNlIHRoZSBgY3V0cmVlKClgIGZ1bmN0aW9uOg0KDQpgYGB7cn0NCmN1dHJlZShoYy5jb21wbGV0ZSAsIDIpDQpjdXRyZWUoaGMuYXZlcmFnZSAsIDIpDQpjdXRyZWUoaGMuc2luZ2xlICwgMikNCmBgYA0KDQpGb3IgdGhpcyBkYXRhLCBjb21wbGV0ZSBhbmQgYXZlcmFnZSBsaW5rYWdlIGdlbmVyYWxseSBzZXBhcmF0ZSB0aGUgb2JzZXJ2YXRpb25zIGludG8gdGhlaXIgY29ycmVjdCBncm91cHMuIEhvd2V2ZXIsIHNpbmdsZSBsaW5rYWdlIGlkZW50aWZpZXMgb25lIHBvaW50YyBhcyBiZWxvbmdpbmcgdG8gaXRzIG93biBjbHVzdGVyLiBBIG1vcmUgc2Vuc2libGUgYW5zd2VyIGlzIG9idGFpbmVkIHdoZW4gZm91ciBjbHVzdGVycyBhcmUgc2VsZWN0ZWQsIGFsdGhvdWdoIHRoZXJlIGFyZSBzdGlsbCB0d28gc2luZ2xldG9ucy4NCg0KYGBge3J9DQpjdXRyZWUoaGMuc2luZ2xlICwgNCkNCmBgYA0KDQpUbyBzY2FsZSB0aGUgdmFyaWFibGVzIGJlZm9yZSBwZXJmb3JtaW5nIGhpZXJhcmNoaWNhbCBjbHVzdGVyaW5nIG9mIHRoZSBvYnNlcnZhdGlvbnMsIHdlIHVzZSB0aGUgYHNjYWxlKClgIGZ1bmN0aW9uOg0KDQpgYGB7cn0NCnhzYz1zY2FsZSh4KQ0KcGxvdChoY2x1c3QoZGlzdCh4c2MpLCBtZXRob2QgPSJjb21wbGV0ZSIpLCBtYWluPSIgSGllcmFyY2hpY2FsIENsdXN0ZXJpbmcgd2l0aCBTY2FsZWQgRmVhdHVyZXMgIikNCmBgYA0KDQoNCg==