1. In homework 1, you did Question 3 on page 414, “An introduction to Statistical Learning”. In this question we are going to use the same data.
  1. Construct the dissimilarity matrix based on the Euclidian distance.
library(cluster)
data= matrix(c(1,1,0,5,6,4,4,3,4,1,2,0),nrow=6,ncol=2)
daisy(data, metric = c("euclidean"),stand = FALSE,type = list())
## Dissimilarities :
##          1        2        3        4        5
## 2 1.000000                                    
## 3 1.000000 1.414214                           
## 4 5.000000 4.472136 5.830952                  
## 5 5.385165 5.099020 6.324555 1.414214         
## 6 5.000000 4.242641 5.656854 1.414214 2.828427
## 
## Metric :  euclidean 
## Number of objects : 6
  1. Construct the Dendrogram based on complete linkage.
data.dist=dist(data)
plot(hclust(data.dist), main="Complete Linkage",xlab="",sub="",ylab="")

  1. Construct the Dendrogram based on average linkage.
plot(hclust(data.dist , method ="average"),main="Average Linkage ",xlab="", sub="",ylab="")

  1. Calculate the CH index for the K-means clustering with K=2, where CH(K)= (B(K)/(K-1))/(W(K)/(n-K))defined in (14.29),where W(K)W(K), B(K)B(K) are the within- and between-cluster variations for C(K).
ch.index = function(x,kmax,iter.max=100,nstart=10,algorithm="Lloyd") {
  ch = numeric(length=kmax-1)
    n = nrow(x)
    for (k in 2:kmax) {
        a = kmeans(x,k,iter.max=iter.max,nstart=nstart,algorithm=algorithm)
        w = a$tot.withinss
        b = a$betweenss
        ch[k-1] = (b/(k-1))/(w/(n-k))
        }
    return(list(k=2:kmax,ch=ch))
}
ans=ch.index(data,kmax=5)
ans
## $k
## [1] 2 3 4 5
## 
## $ch
## [1] 29.12500 26.89286 21.41667 21.83333
  1. Show that clustering based on correlation as similarity is equivalent to that based on squared distance as dissimilarity. That is, in page 36 of the slides, show ??p j=1 (xij - xi’j)^2 ??? 1- p( xi,xi’)
  2. Complete Question 9 on page 416, “An introduction to Statistical Learning”.
  3. Consider the USArrests data. We will now perform hierarchical clustering on the states.
  1. Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
library(ISLR)
set.seed(123)
hc.comp <- hclust(dist(USArrests), method = "complete")
plot(hc.comp)

  1. Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
cutree(hc.comp, 3)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              1              2              1 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              1              1              2 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              1              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              1 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              2              1              3              1              2 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              1              3              2 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              1              1              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              2              2              3              2              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              2              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              2              2              3              3              2

Each state has its corresponding cluster below it.

  1. Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
sd <- scale(USArrests)
hcsd <- hclust(dist(sd), method = "complete")
plot(hcsd)

  1. What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed?
cutree(hcsd, 3)
##        Alabama         Alaska        Arizona       Arkansas     California 
##              1              1              2              3              2 
##       Colorado    Connecticut       Delaware        Florida        Georgia 
##              2              3              3              2              1 
##         Hawaii          Idaho       Illinois        Indiana           Iowa 
##              3              3              2              3              3 
##         Kansas       Kentucky      Louisiana          Maine       Maryland 
##              3              3              1              3              2 
##  Massachusetts       Michigan      Minnesota    Mississippi       Missouri 
##              3              2              3              1              3 
##        Montana       Nebraska         Nevada  New Hampshire     New Jersey 
##              3              3              2              3              3 
##     New Mexico       New York North Carolina   North Dakota           Ohio 
##              2              2              1              3              3 
##       Oklahoma         Oregon   Pennsylvania   Rhode Island South Carolina 
##              3              3              3              3              1 
##   South Dakota      Tennessee          Texas           Utah        Vermont 
##              3              1              2              3              3 
##       Virginia     Washington  West Virginia      Wisconsin        Wyoming 
##              3              3              3              3              3
table(cutree(hc.comp, 3), cutree(hcsd, 3))
##    
##      1  2  3
##   1  6  9  1
##   2  2  2 10
##   3  0  0 20

Provide a justification for your answer. The trees are similar, but scaling the variables does affect the clusters that generated. Its always a good idea to scale the variables when they have different measurements in the units.

In addition, perform the K-means clustering and choose K according to the CH index. Using the command “table”, compare the result with what you find from the Hierarchical clustering in part (a) (cutting the dendrogram at the same number of clusters). The ideal K according to CH index is K = 2, with the highest CH index of 29.12500.

km <- kmeans(sd, 2, nstart = 20)
table(cutree(hc.comp, 2), km$cluster)
##    
##      1  2
##   1  1 15
##   2 29  5

There is some discrepancy as the first and second cluster are split into 2 clusters each. Ideally, we would want the total number to be under the same cluster.