Notebook prepared by Everton Lima

Introduction to Statistical Learning Solutions (ISLR)

Ch 10 Exercises

Table of Contents

Conceptual Exercises

Applied Exercises

Conceptual Exercises

1

1a

\[ \frac{1}{|C_k|}\sum_{i,i' \in C_k}\sum_{j=1}^p (x_{ij}-x_{i'j})^2 = 2 \sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_{kj})^2, (10.12) \]

Starting from the left side of the equation we have,

\[ \frac{1}{|C_k|}\sum_{i,i' \in C_k}\sum_{j=1}^p (x_{ij}-x_{i'j})^2 = \frac{1}{|C_k|}\sum_{i' \in C_k}\sum_{j=1}^p (x_{1j}-x_{i'j})^2 + \frac{1}{|C_k|}\sum_{i' \in C_k}\sum_{j=1}^p (x_{2j}-x_{i'j})^2 + ...\]

selecting a single term of the summation and expanding it further yields the following,

\[ \frac{1}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p (x_{ij}-x_{i'j})^2 = \frac{1}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p ((x_{ij}-\bar{x}_j)-(x_{i'j}-\bar{x_j}))^2 \]

\[=\frac{1}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p ((x_{ij}-\bar{x}_j)^2-2(x_{ij}-\bar{x}_j)(x_{i'j}-\bar{x_j})+(x_{i'j}-\bar{x_j})^2) \]

\[=\frac{1}{|C_k|}\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2 -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{ij}-\bar{x}_j)(x_{i'j}-\bar{x_j}) +\frac{1}{|C_k|}\sum_{i' \in C_k}\sum_{j=1}^p(x_{i'j}-\bar{x_j})^2\]

\[=\frac{2}{|C_k|}\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2 -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{ij}-\bar{x}_j)(x_{i'j}-\bar{x_j}) \]

returning to the expression and substituting each term produces, \[=2\frac{|C_k|}{|C_k|}\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2 -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{1j}-\bar{x}_j)(x_{i'j}-\bar{x_j}) -\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{2j}-\bar{x}_j)(x_{i'j}-\bar{x_j})+...\] \[=2\sum_{i \in C_k}\sum_{j=1}^p (x_{ij}-\bar{x}_j)^2\]

Note that,

\[\frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{1j}-\bar{x}_j)(x_{i'j}-\bar{x_j}) \frac{2}{|C_k|}\sum_{i' \in C_k,i'\neq i}\sum_{j=1}^p(x_{2j}-\bar{x}_j)(x_{i'j}-\bar{x_j})+...=0 \]

because, \[\sum_{i' in C_k}x_{i'j}-|C_k| \bar{x_j}=0\]

1b

At each iteration observations are re-assigned to their closest cluster, mini zing the Euclidean distance.

2

2a

require(knitr)
## Loading required package: knitr
DissMatrix=data.frame(c(0,0.3,0.4,0.7),c(0.3,0,0.5,0.8),c(0.4,0.5,0,0.45),c(0.7,0.8,0.45,0))
colnames(DissMatrix)=c(paste('Col',1:4))

kable(DissMatrix)
Col 1 Col 2 Col 3 Col 4
0.0 0.3 0.40 0.70
0.3 0.0 0.50 0.80
0.4 0.5 0.00 0.45
0.7 0.8 0.45 0.00
plot(hclust(dist(DissMatrix)),xlab='')

The table gives the dissimilarity between classes, meaning that in order to produce clusters we need only select pairs whose values are small. The smallest value in the table is for the entries (1,2) and (2,1), followed by the pair (3,4),(4,3).

2b

plot(hclust(dist(DissMatrix),method='single'),xlab='')

2c

If we cut the first dendogram obtain at the value of 0.7, then we obtain the clusters (1,2) and (3,4). Cutting below this value produces 3 clusters since the observations 3 and 4 are in their own cluster.

2d

The clusters obtained here are (4) and (1,2,3).

2e

A dendogram is read bottom up, where the height indicates where clusters are fused. Thus there is no horizontal meaning, the leafs are be swapped but they still represent clusters that are fused at the same height.

row.names(DissMatrix)=c(2,1,4,3)
plot(hclust(dist(DissMatrix)))

3

tb=data.frame(c(1,1,0,5,6,4),c(4,3,4,1,2,0))
colnames(tb)=c('X1','X2')
rownames(tb)=1:6

kable(tb,row.names = T)
X1 X2
1 1 4
2 1 3
3 0 4
4 5 1
5 6 2
6 4 0

3a

plot(tb,col=rainbow(6))

3b

set.seed(42)
labels=sample(1:6,6)
tb=cbind(tb,labels)

kable(tb)
X1 X2 labels
1 4 6
1 3 5
0 4 2
5 1 3
6 2 4
4 0 1

3c

centroids=aggregate(.~labels,tb,mean)
colnames(centroids)=c('labels',paste('C.X',1:2))

kable(data.frame(tb[,-3],centroids[tb$labels,-1],labels=tb$labels))
X1 X2 C.X.1 C.X.2 labels
1 4 1 4 6
1 3 1 3 5
0 4 0 4 2
5 1 5 1 3
6 2 6 2 4
4 0 4 0 1

3d

EuclideanDistance=function(v,z){
  sqrt(sum( (v-z)^2 ))
}

labels=apply(tb[,c(1,2)],1,
 function(x){
    dist=apply(centroids[tb$labels,c(2,3)],1,function(y){ EuclideanDistance(x,y) })
    which.min(dist[dist>0])
  }
 )

kable(data.frame(tb[,-3],centroids[tb$labels,-1],old.labels=tb$labels,new.labels=labels))
X1 X2 C.X.1 C.X.2 old.labels new.labels
1 4 1 4 6 1
1 3 1 3 5 1
0 4 0 4 2 1
5 1 5 1 3 4
6 2 6 2 4 4
4 0 4 0 1 4
plot(tb[,c(1,2)],col=rainbow(6)[labels])

3e

while( any(labels != tb$labels)){
  tb$labels=labels
  centroids=aggregate(.~labels,tb,mean)

  labels=apply(tb[,c(1,2)],1,
    function(x){
      dist=apply(centroids[tb$labels,c(2,3)],1,function(y){ EuclideanDistance(x,y) })
      which.min(dist[dist>0])
    }
   )
  }

3f

plot(tb[,c(1,2)],col=rainbow(6)[tb$labels])

4

4a

Single and complete linkage use the minimal, and maximal distance respectively. Thus there are two possibilities either,

\[ D^{\{1,2,3\},\{4,5\}}_{min} < D^{\{1,2,3\},\{4,5\}}_{max} \]

or,

\[ D^{\{1,2,3\},\{4,5\}}_{min} = D^{\{1,2,3\},\{4,5\}}_{max} \]

where, \[ D^{\{1,2,3\},\{4,5\}}_{min} = min(d_{14},d_{15},d_{24},d_{25},d_{34},d_{35})\]

and,

\[ D^{\{1,2,3\},\{4,5\}}_{max} = max(d_{14},d_{15},d_{24},d_{25},d_{34},d_{35})\]

but if \(D^{\{1,2,3\},\{4,5\}}_{min} = D^{\{1,2,3\},\{4,5\}}_{max}\) then the points are equidistant. This implies that there would be a single cluster of {1,2,3,4,5} instead of the two separate clusters.

Thus \[ D^{\{1,2,3\},\{4,5\}}_{min} < D^{\{1,2,3\},\{4,5\}}_{max} \] is true. Consequently the clusters will fuse at a lower height in the single linkage dendogram.

4b

Similarly to the previous problems, either,

\[ D^{\{5\},\{6\}}_{min} < D^{\{5\},\{6\}}_{max} \]

or

\[ D^{\{5\},\{6\}}_{min} = D^{\{5\},\{6\}}_{max} \]

but in this case there isn’t enough information to determine when the clusters will fuse. Both possibilities are possible since the points can be equidistant.

5

left=data.frame(socks=c(8,11,7,6,5,6,7,8),
                computers=c(0,0,0,0,1,1,1,1),
                col=c('black','tan','lightblue','green','yellow','darkblue','red','pink'))

par(mfrow=c(1,2))
barplot(left$socks,col=as.character(left$col),xlab = 'Socks',width = rep(0.4,8),xlim = c(0,8))
barplot(left$computers,col=as.character(left$col),xlab = 'Computers',width = rep(0.2,8),xlim = c(0,8),ylim=c(0,11),yaxt='n')

set.seed(42)
Ks=sample(c(rep(1,4),rep(2,4)),8) # step 1 - randomly assign clusters
left=data.frame(left,cluster=Ks)
left
##   socks computers       col cluster
## 1     8         0     black       2
## 2    11         0       tan       2
## 3     7         0 lightblue       1
## 4     6         0     green       2
## 5     5         1    yellow       1
## 6     6         1  darkblue       2
## 7     7         1       red       1
## 8     8         1      pink       1
Kmeans=aggregate(left[,c(1,2,4)],list(Ks),mean) # step 2 - find the cluster means
Kmeans=data.frame(Kmeans,col=apply(data.frame(split(left$col,left$cluster)),2,function(x){paste(x,collapse = ",")}))
Kmeans
##    Group.1 socks computers cluster                       col
## X1       1  6.75      0.75       1 lightblue,yellow,red,pink
## X2       2  7.75      0.25       2  black,tan,green,darkblue
Ks=apply(left[,c(1,2)],1,function(x){  # step 3 - re-assign observations.
  dists=c(dist(rbind(x,Kmeans[1,c(2,3)])),dist(rbind(x,Kmeans[2,c(2,3)])))
  which.min(dists)
})

left$cluster=Ks
left$cluster
## [1] 2 2 1 1 1 1 1 2
Kmeans=aggregate(left[,c(1,2,4)],list(Ks),mean)   # repeat step 2 - find the cluster means
col=sapply(split(left$col,left$cluster),function(x){paste(x,collapse = ",")})

Kmeans=data.frame(Kmeans,col=col)
Kmeans
##   Group.1 socks computers cluster                                 col
## 1       1   6.2 0.6000000       1 lightblue,green,yellow,darkblue,red
## 2       2   9.0 0.3333333       2                      black,tan,pink
Ks=apply(left[,c(1,2)],1,function(x){  # step 3 - re-assign observations.
  dists=c(dist(rbind(x,Kmeans[1,c(2,3)])),dist(rbind(x,Kmeans[2,c(2,3)])))
  which.min(dists)
})

any(Ks!=left$cluster)
## [1] FALSE

Since there is no change, the algorithm ends.

left$cluster=Ks
left
##   socks computers       col cluster
## 1     8         0     black       2
## 2    11         0       tan       2
## 3     7         0 lightblue       1
## 4     6         0     green       1
## 5     5         1    yellow       1
## 6     6         1  darkblue       1
## 7     7         1       red       1
## 8     8         1      pink       2

For the remaining two problems the scaling makes it so there is a very high weight placed on Computers, both resulting in same the clusters;

\[((black,tan,lightblue,green),(yellow,darkblue,red,pink))\]

6

Skipped.

Applied Exercises

7

Use cor and dist to calculate the correlation and distance of the observation as follows,

USArrests.scaled=scale(USArrests)

correlation=as.dist(1-cor(t(USArrests.scaled)))
euclidean=dist(USArrests.scaled)^2

If the quantities are approximately proportional then \(euclidean \approx K \cdot correlation\) for a constant K.

summary(correlation/euclidean)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000086 0.069140 0.133900 0.234200 0.262600 4.888000
summary(correlation-0.1339*euclidean)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -4.570000 -0.459700  0.000393 -0.059040  0.557600  1.809000

If \(K=0.1339\) then they are approximately equal, different only 0.05 on average.

8

8a

pr.out=prcomp(USArrests.scaled)
pr.var=pr.out$sdev^2
pr.var/sum(pr.var)
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

8b

num=rowSums(apply(USArrests.scaled,1,function(x){ colSums(x %*% pr.out$rotation )^2  }))
num
##        PC1        PC2        PC3        PC4 
## 121.531837  48.498492  17.471596   8.498074
denom=sum(rowSums(USArrests.scaled^2))
denom
## [1] 196
num/denom
##        PC1        PC2        PC3        PC4 
## 0.62006039 0.24744129 0.08914080 0.04335752

9

9a

hclust.out=hclust(dist(USArrests),method='complete')
plot(hclust.out)

9b

hclust.cut=cutree(hclust.out,k = 3)
hclust.cut=split(data.frame(names(hclust.cut),hclust.cut),as.factor(hclust.cut))
hclust.cut
## $`1`
##                names.hclust.cut. hclust.cut
## Alabama                  Alabama          1
## Alaska                    Alaska          1
## Arizona                  Arizona          1
## California            California          1
## Delaware                Delaware          1
## Florida                  Florida          1
## Illinois                Illinois          1
## Louisiana              Louisiana          1
## Maryland                Maryland          1
## Michigan                Michigan          1
## Mississippi          Mississippi          1
## Nevada                    Nevada          1
## New Mexico            New Mexico          1
## New York                New York          1
## North Carolina    North Carolina          1
## South Carolina    South Carolina          1
## 
## $`2`
##               names.hclust.cut. hclust.cut
## Arkansas               Arkansas          2
## Colorado               Colorado          2
## Georgia                 Georgia          2
## Massachusetts     Massachusetts          2
## Missouri               Missouri          2
## New Jersey           New Jersey          2
## Oklahoma               Oklahoma          2
## Oregon                   Oregon          2
## Rhode Island       Rhode Island          2
## Tennessee             Tennessee          2
## Texas                     Texas          2
## Virginia               Virginia          2
## Washington           Washington          2
## Wyoming                 Wyoming          2
## 
## $`3`
##               names.hclust.cut. hclust.cut
## Connecticut         Connecticut          3
## Hawaii                   Hawaii          3
## Idaho                     Idaho          3
## Indiana                 Indiana          3
## Iowa                       Iowa          3
## Kansas                   Kansas          3
## Kentucky               Kentucky          3
## Maine                     Maine          3
## Minnesota             Minnesota          3
## Montana                 Montana          3
## Nebraska               Nebraska          3
## New Hampshire     New Hampshire          3
## North Dakota       North Dakota          3
## Ohio                       Ohio          3
## Pennsylvania       Pennsylvania          3
## South Dakota       South Dakota          3
## Utah                       Utah          3
## Vermont                 Vermont          3
## West Virginia     West Virginia          3
## Wisconsin             Wisconsin          3

9c

hclust.out=hclust(dist(scale(USArrests)),method='complete')
plot(hclust.out)

Scaling significantly reduces the range and spread of the height of the tree. Moreover, cutting at the same height produces different

hclust.cut=cutree(hclust.out,k = 3)
split(data.frame(names(hclust.cut),hclust.cut),as.factor(hclust.cut))
## $`1`
##                names.hclust.cut. hclust.cut
## Alabama                  Alabama          1
## Alaska                    Alaska          1
## Georgia                  Georgia          1
## Louisiana              Louisiana          1
## Mississippi          Mississippi          1
## North Carolina    North Carolina          1
## South Carolina    South Carolina          1
## Tennessee              Tennessee          1
## 
## $`2`
##            names.hclust.cut. hclust.cut
## Arizona              Arizona          2
## California        California          2
## Colorado            Colorado          2
## Florida              Florida          2
## Illinois            Illinois          2
## Maryland            Maryland          2
## Michigan            Michigan          2
## Nevada                Nevada          2
## New Mexico        New Mexico          2
## New York            New York          2
## Texas                  Texas          2
## 
## $`3`
##               names.hclust.cut. hclust.cut
## Arkansas               Arkansas          3
## Connecticut         Connecticut          3
## Delaware               Delaware          3
## Hawaii                   Hawaii          3
## Idaho                     Idaho          3
## Indiana                 Indiana          3
## Iowa                       Iowa          3
## Kansas                   Kansas          3
## Kentucky               Kentucky          3
## Maine                     Maine          3
## Massachusetts     Massachusetts          3
## Minnesota             Minnesota          3
## Missouri               Missouri          3
## Montana                 Montana          3
## Nebraska               Nebraska          3
## New Hampshire     New Hampshire          3
## New Jersey           New Jersey          3
## North Dakota       North Dakota          3
## Ohio                       Ohio          3
## Oklahoma               Oklahoma          3
## Oregon                   Oregon          3
## Pennsylvania       Pennsylvania          3
## Rhode Island       Rhode Island          3
## South Dakota       South Dakota          3
## Utah                       Utah          3
## Vermont                 Vermont          3
## Virginia               Virginia          3
## Washington           Washington          3
## West Virginia     West Virginia          3
## Wisconsin             Wisconsin          3
## Wyoming                 Wyoming          3

By looking a bit more deeply into the clusters we can see that cluster 2, which contains California and New York, are high Assault and Urban population clusters, together with a higher average value for Rape.

aggregate(USArrests,list(hclust.cut=hclust.cut),mean)
##   hclust.cut    Murder  Assault UrbanPop     Rape
## 1          1 14.087500 252.7500 53.50000 24.53750
## 2          2 11.054545 264.0909 79.09091 32.61818
## 3          3  5.003226 116.4839 63.83871 16.33871

10

10a

set.seed(42)
data= matrix(sapply(1:3,function(x){ rnorm(20*50,mean = 10*sqrt(x))  }),ncol=50)    # 20 obs. in each class with 50 features.
class=unlist(lapply(1:3,function(x){rep(x,20)}))

10b

pr.out=prcomp(data)
plot(pr.out$x[,c(1,2)],col=class)

10c

set.seed(1)
kmeans.out=kmeans(data,3)

table(kmeans.out$cluster)
## 
##  1  2  3 
## 19 21 20
table(class)
## class
##  1  2  3 
## 20 20 20

We can see that there is only one observation that is miss classified.

plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)

10d

set.seed(1)
kmeans.out=kmeans(data,2)

table(kmeans.out$cluster)
## 
##  1  2 
## 36 24
table(class)
## class
##  1  2  3 
## 20 20 20

K-means seem to find a single cluster that is the same as before. This can clearly be observed in the picture below as the red cluster closely matches the original green cluster.

plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)

10e

set.seed(1)
kmeans.out=kmeans(data,4)

table(kmeans.out$cluster)
## 
##  1  2  3  4 
## 19 11 12 18
table(class)
## class
##  1  2  3 
## 20 20 20

When using 4 clusters it becomes more difficult to determine the difference between the new found clusters and the actual class values. However, by examining the plot we can see that it again find the original green cluster with some overlap between it and the remaining ones. Overlap between clusters in the two principal components is also clear, as should be expected since they may be close in the remaining dimensions.

plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)

10f

set.seed(1)
kmeans.out=kmeans(pr.out$x[,c(1,2)],3)

table(kmeans.out$cluster)
## 
##  1  2  3 
## 28 18 14
table(class)
## class
##  1  2  3 
## 20 20 20

The algorithm performs well when clustering on the first two principal components, however, since it is missing information about the remaining dimensions observations that are close in the first two components are assigned to the same cluster which leads to mistakes. By examining the plot as before we can see that this is true, as there is no overlap.

plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)

10g

set.seed(1)
kmeans.out=kmeans(scale(data,center = T,scale = T),3)

table(kmeans.out$cluster)
## 
##  1  2  3 
## 29 14 17
table(class)
## class
##  1  2  3 
## 20 20 20
plot(pr.out$x[,c(1,2)],col=kmeans.out$cluster)

There is significant overlap in the first two clusters, and the algorithm performs poorly.

11

11a

Data can be read directly from the website by using the function fread from the library data.table. ``

library(data.table)
data=fread('http://www-bcf.usc.edu/~gareth/ISL/Ch10Ex11.csv')

11b

single.out=hclust(as.dist(1-cor(data)),method='single')
plot(single.out)

complete.out=hclust(as.dist(1-cor(data)),method = 'complete')
plot(complete.out)

It is very clear that single linkage gives poor results. Using single linkage produces a very unbalanced dendogram with a single tree.

11c

I would recommend to examine which observations contribute the most to the overall variance present in the data. One way to achieve this by examining the loading produced by PCA.

pr.out=prcomp(t(data))
head(order(abs(rowSums(pr.out$rotation)),decreasing = T))
## [1] 865  68 911 428 624 524