library(cluster)
data= matrix(c(1,1,0,5,6,4,4,3,4,1,2,0),nrow=6,ncol=2)
daisy(data, metric = c("euclidean"),stand = FALSE,type = list())
## Dissimilarities :
## 1 2 3 4 5
## 2 1.000000
## 3 1.000000 1.414214
## 4 5.000000 4.472136 5.830952
## 5 5.385165 5.099020 6.324555 1.414214
## 6 5.000000 4.242641 5.656854 1.414214 2.828427
##
## Metric : euclidean
## Number of objects : 6
data.dist=dist(data)
plot(hclust(data.dist), main="Complete Linkage",xlab="",sub="",ylab="")
plot(hclust(data.dist , method ="average"),main="Average Linkage ",xlab="", sub="",ylab="")
ch.index = function(x,kmax,iter.max=100,nstart=10,algorithm="Lloyd") {
ch = numeric(length=kmax-1)
n = nrow(x)
for (k in 2:kmax) {
a = kmeans(x,k,iter.max=iter.max,nstart=nstart,algorithm=algorithm)
w = a$tot.withinss
b = a$betweenss
ch[k-1] = (b/(k-1))/(w/(n-k))
}
return(list(k=2:kmax,ch=ch))
}
ans=ch.index(data,kmax=5)
ans
## $k
## [1] 2 3 4 5
##
## $ch
## [1] 29.12500 26.89286 21.41667 21.83333
library(ISLR)
set.seed(123)
hc.comp <- hclust(dist(USArrests), method = "complete")
plot(hc.comp)
cutree(hc.comp, 3)
## Alabama Alaska Arizona Arkansas California
## 1 1 1 2 1
## Colorado Connecticut Delaware Florida Georgia
## 2 3 1 1 2
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 1 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 1
## Massachusetts Michigan Minnesota Mississippi Missouri
## 2 1 3 1 2
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 1 3 2
## New Mexico New York North Carolina North Dakota Ohio
## 1 1 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 2 2 3 2 1
## South Dakota Tennessee Texas Utah Vermont
## 3 2 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 2 2 3 3 2
Each state has its corresponding cluster below it.
sd <- scale(USArrests)
hcsd <- hclust(dist(sd), method = "complete")
plot(hcsd)
cutree(hcsd, 3)
## Alabama Alaska Arizona Arkansas California
## 1 1 2 3 2
## Colorado Connecticut Delaware Florida Georgia
## 2 3 3 2 1
## Hawaii Idaho Illinois Indiana Iowa
## 3 3 2 3 3
## Kansas Kentucky Louisiana Maine Maryland
## 3 3 1 3 2
## Massachusetts Michigan Minnesota Mississippi Missouri
## 3 2 3 1 3
## Montana Nebraska Nevada New Hampshire New Jersey
## 3 3 2 3 3
## New Mexico New York North Carolina North Dakota Ohio
## 2 2 1 3 3
## Oklahoma Oregon Pennsylvania Rhode Island South Carolina
## 3 3 3 3 1
## South Dakota Tennessee Texas Utah Vermont
## 3 1 2 3 3
## Virginia Washington West Virginia Wisconsin Wyoming
## 3 3 3 3 3
table(cutree(hc.comp, 3), cutree(hcsd, 3))
##
## 1 2 3
## 1 6 9 1
## 2 2 2 10
## 3 0 0 20
Provide a justification for your answer. The trees are similar, but scaling the variables does affect the clusters that generated. Its always a good idea to scale the variables when they have different measurements in the units.
In addition, perform the K-means clustering and choose K according to the CH index. Using the command “table”, compare the result with what you find from the Hierarchical clustering in part (a) (cutting the dendrogram at the same number of clusters). The ideal K according to CH index is K = 2, with the highest CH index of 29.12500.
km <- kmeans(sd, 2, nstart = 20)
table(cutree(hc.comp, 2), km$cluster)
##
## 1 2
## 1 1 15
## 2 29 5
There is some discrepancy as the first and second cluster are split into 2 clusters each. Ideally, we would want the total number to be under the same cluster.