Clustering: Lab 10 from ISLR Book

Bruno Wu
March 5, 2014

K-Means Clustering

K-Means Clustering Exericse

Using simulated data set

  • This is from Lab 10 from ISLR book
  • Testing out how to use R Presentation on RStudio
  • See if this can create 2 columns

Create 2 clusters first - 2 sets of 25 points

set.seed(2)
x=matrix(rnorm(50*2), ncol=2)
x[1:25,1] = x[1:25,1] + 3
x[1:25,2] = x[1:25,2] - 4

Perform K-means clustering with K=2

km.out = kmeans(x,2,nstart=20)
km.out$cluster
 [1] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1
[36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  • Perfectly separated the observations into 2 clusters even though we did not supply any group information to kmeans()!

Plot the clusters

plot(x, col=(km.out$cluster+1), main="K-Means Clustering Results with K=2", xlab="", ylab="", pch=20, cex=2)

plot of chunk unnamed-chunk-3

Use K=3 instead to cluster

set.seed(4)
km.out=kmeans(x,3,nstart=20)
km.out
K-means clustering with 3 clusters of sizes 10, 23, 17

Cluster means:
    [,1]     [,2]
1  2.300 -2.69622
2 -0.382 -0.08741
3  3.779 -4.56201

Clustering vector:
 [1] 3 1 3 1 3 3 3 1 3 1 3 1 3 1 3 1 3 3 3 3 3 1 3 3 3 2 2 2 2 2 2 2 2 2 2
[36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2

Within cluster sum of squares by cluster:
[1] 19.56 52.68 25.74
 (between_SS / total_SS =  79.3 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"         "iter"        
[9] "ifault"      

Plot the 3 clusters

plot of chunk unnamed-chunk-5

Use nstart argument to run multiple initial assignments

set.seed(3)
km.out = kmeans(x,3,nstart=1) #only 1 initial random cluster assignment
km.out$tot.withinss
[1] 104.3
km.out = kmeans(x,3,nstart=20) #20 initial random cluster assignments
km.out$tot.withinss
[1] 97.98

Hierachical Clustering

Introduce hcluster() and dist() functions

hc.complete = hclust(dist(x), method="complete")
hc.average = hclust(dist(x), method="average")
hc.single = hclust(dist(x), method="single")

Plot for Complete Linkage Method

plot of chunk unnamed-chunk-8

Plot for Average Linkage Method

plot of chunk unnamed-chunk-9

Plot for Single Linkage Method

plot of chunk unnamed-chunk-10

Determine where to cut using cutree()

cutree(hc.complete, 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2
[36] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
cutree(hc.average, 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2
[36] 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2
cutree(hc.single, 2)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
  • Single linkage method identified a “singleton”

A more sensible cut

cutree(hc.single, 4)
 [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3
[36] 3 3 3 3 3 3 4 3 3 3 3 3 3 3 3
  • Although still 2 singletons.

Scale the variables first

xsc=scale(x)
plot(hclust(dist(xsc), method="complete"), main="Hierhichal Clustering with Scaled Features")

plot of chunk unnamed-chunk-13

Correlation-based Distances using as.dist()

x=matrix(rnorm(30*3), ncol=3)
dd=as.dist(1-cor(t(x)))
plot(hclust(dd, method="complete"), main="Complete Linkage with Correlation-Based Distance", xlab="", sub="")

plot of chunk unnamed-chunk-14

Thank you