Notebook prepared by Everton Lima
\[ \frac{1}{|C_k|}\sum_{i,i' \in C_k}\sum_{j=1}^p (x_{ij}-x_{i'j})^2 = 2 \sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_{kj})^2, (10.12) \]
Starting from the left side of the equation we have,
\[ \frac{1}{|C_k|}\sum_{i,i' \in C_k}\sum_{j=1}^p (x_{ij}-x_{i'j})^2 = \frac{1}{|C_k|}\sum_{i' \in C_k}\sum_{j=1}^p (x_{1j}-x_{i'j})^2 + \frac{1}{|C_k|}\sum_{i' \in C_k}\sum_{j=1}^p (x_{2j}-x_{i'j})^2 + ...\]
selecting a single term of the summation and expanding it further yields the following,
\[ \frac{1}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p (x_{ij}-x_{i'j})^2 = \frac{1}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p ((x_{ij}-\bar{x}_j)-(x_{i'j}-\bar{x_j}))^2 \]
\[=\frac{1}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p ((x_{ij}-\bar{x}_j)^2-2(x_{ij}-\bar{x}_j)(x_{i'j}-\bar{x_j})+(x_{i'j}-\bar{x_j})^2) \]
\[=\frac{1}{|C_k|}\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2 -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{ij}-\bar{x}_j)(x_{i'j}-\bar{x_j}) +\frac{1}{|C_k|}\sum_{i' \in C_k}\sum_{j=1}^p(x_{i'j}-\bar{x_j})^2\]
\[=\frac{2}{|C_k|}\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2 -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{ij}-\bar{x}_j)(x_{i'j}-\bar{x_j}) \]
returning to the expression and substituting each term produces, \[=2\frac{|C_k|}{|C_k|}\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2 -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{1j}-\bar{x}_j)(x_{i'j}-\bar{x_j}) -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{2j}-\bar{x}_j)(x_{i'j}-\bar{x_j})+...\] \[=2\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2\]
Note that,
\[\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{1j}-\bar{x}_j)(x_{i'j}-\bar{x_j}) \frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{2j}-\bar{x}_j)(x_{i'j}-\bar{x_j})+...=0 \]
because, \[\sum_{i' in C_k}x_{i'j}-|C_k| \bar{x_j}=0\]
require(knitr)
## Loading required package: knitr
DissMatrix=data.frame(c(0,0.3,0.4,0.7),c(0.3,0,0.5,0.8),c(0.4,0.5,0,0.45),c(0.7,0.8,0.45,0))
colnames(DissMatrix)=c(paste('Col',1:4))
kable(DissMatrix)
| Col 1 | Col 2 | Col 3 | Col 4 |
|---|---|---|---|
| 0.0 | 0.3 | 0.40 | 0.70 |
| 0.3 | 0.0 | 0.50 | 0.80 |
| 0.4 | 0.5 | 0.00 | 0.45 |
| 0.7 | 0.8 | 0.45 | 0.00 |
plot(hclust(dist(DissMatrix)),xlab='')
The table gives the dissimilarity between classes, meaning that in order to produce clusters we need only select pairs whose values are small. The smallest value in the table is for the entries (1,2) and (2,1), followed by the pair (3,4),(4,3).
If we cut the first dendogram obtain at the value of 0.7, then we obtain the clusters (1,2) and (3,4). Cutting below this value produces 3 clusters since the observations 3 and 4 are in their own cluster.
A dendogram is read bottom up, where the height indicates where clusters are fused. Thus there is no horizontal meaning, the leafs are be swapped but they still represent clusters that are fused at the same height.
row.names(DissMatrix)=c(2,1,4,3)
plot(hclust(dist(DissMatrix)))
tb=data.frame(c(1,1,0,5,6,4),c(4,3,4,1,2,0))
colnames(tb)=c('X1','X2')
rownames(tb)=1:6
kable(tb,row.names = T)
| X1 | X2 | |
|---|---|---|
| 1 | 1 | 4 |
| 2 | 1 | 3 |
| 3 | 0 | 4 |
| 4 | 5 | 1 |
| 5 | 6 | 2 |
| 6 | 4 | 0 |
set.seed(42)
labels=sample(1:6,6)
tb=cbind(tb,labels)
kable(tb)
| X1 | X2 | labels |
|---|---|---|
| 1 | 4 | 6 |
| 1 | 3 | 5 |
| 0 | 4 | 2 |
| 5 | 1 | 3 |
| 6 | 2 | 4 |
| 4 | 0 | 1 |
centroids=aggregate(.~labels,tb,mean)
colnames(centroids)=c('labels',paste('C.X',1:2))
kable(data.frame(tb[,-3],centroids[tb$labels,-1],labels=tb$labels))
| X1 | X2 | C.X.1 | C.X.2 | labels |
|---|---|---|---|---|
| 1 | 4 | 1 | 4 | 6 |
| 1 | 3 | 1 | 3 | 5 |
| 0 | 4 | 0 | 4 | 2 |
| 5 | 1 | 5 | 1 | 3 |
| 6 | 2 | 6 | 2 | 4 |
| 4 | 0 | 4 | 0 | 1 |
EuclideanDistance=function(v,z){
sqrt(sum( (v-z)^2 ))
}
labels=apply(tb[,c(1,2)],1,
function(x){
dist=apply(centroids[tb$labels,c(2,3)],1,function(y){ EuclideanDistance(x,y) })
which.min(dist[dist>0])
}
)
kable(data.frame(tb[,-3],centroids[tb$labels,-1],old.labels=tb$labels,new.labels=labels))
| X1 | X2 | C.X.1 | C.X.2 | old.labels | new.labels |
|---|---|---|---|---|---|
| 1 | 4 | 1 | 4 | 6 | 1 |
| 1 | 3 | 1 | 3 | 5 | 1 |
| 0 | 4 | 0 | 4 | 2 | 1 |
| 5 | 1 | 5 | 1 | 3 | 4 |
| 6 | 2 | 6 | 2 | 4 | 4 |
| 4 | 0 | 4 | 0 | 1 | 4 |
plot(tb[,c(1,2)],col=rainbow(6)[labels])
Single and complete linkage use the minimal, and maximal distance respectively. Thus there are two possibilities either,
\[ D^{\{1,2,3\},\{4,5\}}_{min} < D^{\{1,2,3\},\{4,5\}}_{max} \]
or,
\[ D^{\{1,2,3\},\{4,5\}}_{min} = D^{\{1,2,3\},\{4,5\}}_{max} \]
where, \[ D^{\{1,2,3\},\{4,5\}}_{min} = min(d_{14},d_{15},d_{24},d_{25},d_{34},d_{35})\]
and,
\[ D^{\{1,2,3\},\{4,5\}}_{max} = max(d_{14},d_{15},d_{24},d_{25},d_{34},d_{35})\]
but if \(D^{\{1,2,3\},\{4,5\}}_{min} = D^{\{1,2,3\},\{4,5\}}_{max}\) then the points are equidistant. This implies that there would be a single cluster of {1,2,3,4,5} instead of the two separate clusters.
Thus \[ D^{\{1,2,3\},\{4,5\}}_{min} < D^{\{1,2,3\},\{4,5\}}_{max} \] is true. Consequently the clusters will fuse at a lower height in the single linkage dendogram.
Similarly to the previous problems, either,
\[ D^{\{5\},\{6\}}_{min} < D^{\{5\},\{6\}}_{max} \]
or
\[ D^{\{5\},\{6\}}_{min} = D^{\{5\},\{6\}}_{max} \]
but in this case there isn’t enough information to determine when the clusters will fuse. Both possibilities are possible since the points can be equidistant.
left=data.frame(socks=c(8,11,7,6,5,6,7,8),
computers=c(0,0,0,0,1,1,1,1),
col=c('black','tan','lightblue','green','yellow','darkblue','red','pink'))
par(mfrow=c(1,2))
barplot(left$socks,col=as.character(left$col),xlab = 'Socks',width = rep(0.4,8),xlim = c(0,8))
barplot(left$computers,col=as.character(left$col),xlab = 'Computers',width = rep(0.2,8),xlim = c(0,8),ylim=c(0,11),yaxt='n')
set.seed(42)
Ks=sample(c(rep(1,4),rep(2,4)),8) # step 1 - randomly assign clusters
left=data.frame(left,cluster=Ks)
left
## socks computers col cluster
## 1 8 0 black 2
## 2 11 0 tan 2
## 3 7 0 lightblue 1
## 4 6 0 green 2
## 5 5 1 yellow 1
## 6 6 1 darkblue 2
## 7 7 1 red 1
## 8 8 1 pink 1
Kmeans=aggregate(left[,c(1,2,4)],list(Ks),mean) # step 2 - find the cluster means
Kmeans=data.frame(Kmeans,col=apply(data.frame(split(left$col,left$cluster)),2,function(x){paste(x,collapse = ",")}))
Kmeans
## Group.1 socks computers cluster col
## X1 1 6.75 0.75 1 lightblue,yellow,red,pink
## X2 2 7.75 0.25 2 black,tan,green,darkblue
Ks=apply(left[,c(1,2)],1,function(x){ # step 3 - re-assign observations.
dists=c(dist(rbind(x,Kmeans[1,c(2,3)])),dist(rbind(x,Kmeans[2,c(2,3)])))
which.min(dists)
})
left$cluster=Ks
left$cluster
## [1] 2 2 1 1 1 1 1 2
Kmeans=aggregate(left[,c(1,2,4)],list(Ks),mean) # repeat step 2 - find the cluster means
col=sapply(split(left$col,left$cluster),function(x){paste(x,collapse = ",")})
Kmeans=data.frame(Kmeans,col=col)
Kmeans
## Group.1 socks computers cluster col
## 1 1 6.2 0.6000000 1 lightblue,green,yellow,darkblue,red
## 2 2 9.0 0.3333333 2 black,tan,pink
Ks=apply(left[,c(1,2)],1,function(x){ # step 3 - re-assign observations.
dists=c(dist(rbind(x,Kmeans[1,c(2,3)])),dist(rbind(x,Kmeans[2,c(2,3)])))
which.min(dists)
})
any(Ks!=left$cluster)
## [1] FALSE
Since there is no change, the algorithm ends.
left$cluster=Ks
left
## socks computers col cluster
## 1 8 0 black 2
## 2 11 0 tan 2
## 3 7 0 lightblue 1
## 4 6 0 green 1
## 5 5 1 yellow 1
## 6 6 1 darkblue 1
## 7 7 1 red 1
## 8 8 1 pink 2
For the remaining two problems the scaling makes it so there is a very high weight placed on Computers, both resulting in same the clusters;
\[((black,tan,lightblue,green),(yellow,darkblue,red,pink))\]
Use cor and dist to calculate the correlation and distance of the observation as follows,
USArrests.scaled=scale(USArrests)
correlation=as.dist(1-cor(t(USArrests.scaled)))
euclidean=dist(USArrests.scaled)^2
If the quantities are approximately proportional then \(euclidean \approx K \cdot correlation\) for a constant K.
summary(correlation/euclidean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000086 0.069140 0.133900 0.234200 0.262600 4.888000
summary(correlation-0.1339*euclidean)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.570000 -0.459700 0.000393 -0.059040 0.557600 1.809000
If \(K=0.1339\) then they are approximately equal, different only 0.05 on average.
hclust.cut=cutree(hclust.out,k = 3)
hclust.cut=split(data.frame(names(hclust.cut),hclust.cut),as.factor(hclust.cut))
hclust.cut
## $`1`
## names.hclust.cut. hclust.cut
## Alabama Alabama 1
## Alaska Alaska 1
## Arizona Arizona 1
## California California 1
## Delaware Delaware 1
## Florida Florida 1
## Illinois Illinois 1
## Louisiana Louisiana 1
## Maryland Maryland 1
## Michigan Michigan 1
## Mississippi Mississippi 1
## Nevada Nevada 1
## New Mexico New Mexico 1
## New York New York 1
## North Carolina North Carolina 1
## South Carolina South Carolina 1
##
## $`2`
## names.hclust.cut. hclust.cut
## Arkansas Arkansas 2
## Colorado Colorado 2
## Georgia Georgia 2
## Massachusetts Massachusetts 2
## Missouri Missouri 2
## New Jersey New Jersey 2
## Oklahoma Oklahoma 2
## Oregon Oregon 2
## Rhode Island Rhode Island 2
## Tennessee Tennessee 2
## Texas Texas 2
## Virginia Virginia 2
## Washington Washington 2
## Wyoming Wyoming 2
##
## $`3`
## names.hclust.cut. hclust.cut
## Connecticut Connecticut 3
## Hawaii Hawaii 3
## Idaho Idaho 3
## Indiana Indiana 3
## Iowa Iowa 3
## Kansas Kansas 3
## Kentucky Kentucky 3
## Maine Maine 3
## Minnesota Minnesota 3
## Montana Montana 3
## Nebraska Nebraska 3
## New Hampshire New Hampshire 3
## North Dakota North Dakota 3
## Ohio Ohio 3
## Pennsylvania Pennsylvania 3
## South Dakota South Dakota 3
## Utah Utah 3
## Vermont Vermont 3
## West Virginia West Virginia 3
## Wisconsin Wisconsin 3
hclust.out=hclust(dist(scale(USArrests)),method='complete')
plot(hclust.out)
Scaling significantly reduces the range and spread of the height of the tree. Moreover, cutting at the same height produces different
hclust.cut=cutree(hclust.out,k = 3)
split(data.frame(names(hclust.cut),hclust.cut),as.factor(hclust.cut))
## $`1`
## names.hclust.cut. hclust.cut
## Alabama Alabama 1
## Alaska Alaska 1
## Georgia Georgia 1
## Louisiana Louisiana 1
## Mississippi Mississippi 1
## North Carolina North Carolina 1
## South Carolina South Carolina 1
## Tennessee Tennessee 1
##
## $`2`
## names.hclust.cut. hclust.cut
## Arizona Arizona 2
## California California 2
## Colorado Colorado 2
## Florida Florida 2
## Illinois Illinois 2
## Maryland Maryland 2
## Michigan Michigan 2
## Nevada Nevada 2
## New Mexico New Mexico 2
## New York New York 2
## Texas Texas 2
##
## $`3`
## names.hclust.cut. hclust.cut
## Arkansas Arkansas 3
## Connecticut Connecticut 3
## Delaware Delaware 3
## Hawaii Hawaii 3
## Idaho Idaho 3
## Indiana Indiana 3
## Iowa Iowa 3
## Kansas Kansas 3
## Kentucky Kentucky 3
## Maine Maine 3
## Massachusetts Massachusetts 3
## Minnesota Minnesota 3
## Missouri Missouri 3
## Montana Montana 3
## Nebraska Nebraska 3
## New Hampshire New Hampshire 3
## New Jersey New Jersey 3
## North Dakota North Dakota 3
## Ohio Ohio 3
## Oklahoma Oklahoma 3
## Oregon Oregon 3
## Pennsylvania Pennsylvania 3
## Rhode Island Rhode Island 3
## South Dakota South Dakota 3
## Utah Utah 3
## Vermont Vermont 3
## Virginia Virginia 3
## Washington Washington 3
## West Virginia West Virginia 3
## Wisconsin Wisconsin 3
## Wyoming Wyoming 3
By looking a bit more deeply into the clusters we can see that cluster 2, which contains California and New York, are high Assault and Urban population clusters, together with a higher average value for Rape.
aggregate(USArrests,list(hclust.cut=hclust.cut),mean)
## hclust.cut Murder Assault UrbanPop Rape
## 1 1 14.087500 252.7500 53.50000 24.53750
## 2 2 11.054545 264.0909 79.09091 32.61818
## 3 3 5.003226 116.4839 63.83871 16.33871
set.seed(42)
data= matrix(sapply(1:3,function(x){ rnorm(20*50,mean = 10*sqrt(x)) }),ncol=50) # 20 obs. in each class with 50 features.
class=unlist(lapply(1:3,function(x){rep(x,20)}))
set.seed(1)
kmeans.out=kmeans(data,3)
table(kmeans.out$cluster)
##
## 1 2 3
## 19 21 20
table(class)
## class
## 1 2 3
## 20 20 20
We can see that there is only one observation that is miss classified.
plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)
set.seed(1)
kmeans.out=kmeans(data,2)
table(kmeans.out$cluster)
##
## 1 2
## 36 24
table(class)
## class
## 1 2 3
## 20 20 20
K-means seem to find a single cluster that is the same as before. This can clearly be observed in the picture below as the red cluster closely matches the original green cluster.
plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)
set.seed(1)
kmeans.out=kmeans(data,4)
table(kmeans.out$cluster)
##
## 1 2 3 4
## 19 11 12 18
table(class)
## class
## 1 2 3
## 20 20 20
When using 4 clusters it becomes more difficult to determine the difference between the new found clusters and the actual class values. However, by examining the plot we can see that it again find the original green cluster with some overlap between it and the remaining ones. Overlap between clusters in the two principal components is also clear, as should be expected since they may be close in the remaining dimensions.
plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)
set.seed(1)
kmeans.out=kmeans(pr.out$x[,c(1,2)],3)
table(kmeans.out$cluster)
##
## 1 2 3
## 28 18 14
table(class)
## class
## 1 2 3
## 20 20 20
The algorithm performs well when clustering on the first two principal components, however, since it is missing information about the remaining dimensions observations that are close in the first two components are assigned to the same cluster which leads to mistakes. By examining the plot as before we can see that this is true, as there is no overlap.
plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)
set.seed(1)
kmeans.out=kmeans(scale(data,center = T,scale = T),3)
table(kmeans.out$cluster)
##
## 1 2 3
## 29 14 17
table(class)
## class
## 1 2 3
## 20 20 20
plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)
There is significant overlap in the first two clusters, and the algorithm performs poorly.
Data can be read directly from the website by using the function fread from the library data.table. ``
library(data.table)
data=fread('http://www-bcf.usc.edu/~gareth/ISL/Ch10Ex11.csv')