10.4 Principal Components Analysis

We perform PCA on the USArrests data set.

states = row.names(USArrests)
states

##  [1] "Alabama"        "Alaska"         "Arizona"        "Arkansas"      
##  [5] "California"     "Colorado"       "Connecticut"    "Delaware"      
##  [9] "Florida"        "Georgia"        "Hawaii"         "Idaho"         
## [13] "Illinois"       "Indiana"        "Iowa"           "Kansas"        
## [17] "Kentucky"       "Louisiana"      "Maine"          "Maryland"      
## [21] "Massachusetts"  "Michigan"       "Minnesota"      "Mississippi"   
## [25] "Missouri"       "Montana"        "Nebraska"       "Nevada"        
## [29] "New Hampshire"  "New Jersey"     "New Mexico"     "New York"      
## [33] "North Carolina" "North Dakota"   "Ohio"           "Oklahoma"      
## [37] "Oregon"         "Pennsylvania"   "Rhode Island"   "South Carolina"
## [41] "South Dakota"   "Tennessee"      "Texas"          "Utah"          
## [45] "Vermont"        "Virginia"       "Washington"     "West Virginia" 
## [49] "Wisconsin"      "Wyoming"

dim(USArrests)

## [1] 50  4

names(USArrests)

## [1] "Murder"   "Assault"  "UrbanPop" "Rape"

The rows of the data set contain the 50 states
The colunms of the data set contain the four variables

Examnine the data:

apply(USArrests,2,mean)

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

# 1：按行
# 2：按列
apply(USArrests,2,var)

##     Murder    Assault   UrbanPop       Rape 
##   18.97047 6945.16571  209.51878   87.72916

We notice that the variables have vastly different means. Not surprisingly, the variables also have vastly different variances.

We now perform PCA using the prcomp()function.

pr.out = prcomp(USArrests,scale=TRUE) # have standard deviation one

names(pr.out)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

pr.out$center

##   Murder  Assault UrbanPop     Rape 
##    7.788  170.760   65.540   21.232

pr.out$scale

##    Murder   Assault  UrbanPop      Rape 
##  4.355510 83.337661 14.474763  9.366385

pr.out$rotation

##                 PC1        PC2        PC3         PC4
## Murder   -0.5358995  0.4181809 -0.3412327  0.64922780
## Assault  -0.5831836  0.1879856 -0.2681484 -0.74340748
## UrbanPop -0.2781909 -0.8728062 -0.3780158  0.13387773
## Rape     -0.5434321 -0.1673186  0.8177779  0.08902432

pr.out$sdev

## [1] 1.5748783 0.9948694 0.5971291 0.4164494

dim(pr.out$x)

## [1] 50  4

center: the means of the variables that were used for scaling prior to implementing PCA;
scale: the standard deviation;
rotation: principal component loadings vector;
sdev: the standard deviation of each principal component;

We see that there are four distinct principal components. Rather the $50\times4$ matrix x has as its columns the principal component score vectors. That is, the kth column is the kth principal component score vector.

We can plot the first two principal components as follows:

biplot(pr.out,scale=0)

The scale=0 argument to biplot()ensures that the arrows are scaled to represent the loadinds; other values for scale give slightly different biplots with different interpretations.

Reproduce the Figure by making a few small changes:

pr.out$rotation = -pr.out$rotation
pr.out$x = -pr.out$x
biplot(pr.out,scale=0)

The variance explained by each principal component is obtained by squaring these:

pr.var = pr.out$sdev^2
pr.var

## [1] 2.4802416 0.9897652 0.3565632 0.1734301

The propotion of variance explained by each principal components:

pve = pr.var/sum(pr.var)
pve

## [1] 0.62006039 0.24744129 0.08914080 0.04335752

We see that the first principal component explains 62.0% of the variance in the data, the next principal component explains 24.7% of the variance, and so forth. We can plot the PVE explained by each component. as well as the cumulative PVE, as follows:

plot(pve,xlab="Principal Component",ylab="Proportion of Variance Explained",ylim=c(0,1),type="b")

plot(cumsum(pve),xlab="Principal Component",ylab="Cumulative Proportion of Variance Explained",ylim=c(0,1),type="b")

10.5 Clustering

10.5.1 K-Means Clustering

Simulated sample:

set.seed(2)
x = matrix(rnorm(50*2),ncol=2)
x[1:25,1] = x[1:25,1] + 3
x[1:25,2] = x[1:25,1] - 4

We now perform K-means clustering with K=2.

km.out = kmeans(x,2,nstart=20) # repeat 20 times then choose the best one
names(km.out)

## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

The cluster assignments of the 50 obervations are contained in km.out$cluster.

km.out$cluster

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2
## [39] 2 2 2 2 2 1 2 1 2 2 2 2

The K-means clustering perfectly separated the observations into two clusters even though we did not supply any group information to kmeans(). We can plot the data, with each observation colored according to its cluster assignment.

plot(x,col=(km.out$cluster+1),main="K-Means Clustering Results with K=2",xlab="",ylab="",pch=20,cex=2)

In this example, we knew that there really were two clusters because we generated the data. However, for real data, in general we do not konw the true number of clusters. We could instead have performed K-means clustering on this example with K=3.

set.seed(4)
km.out = kmeans(x,3,nstart=20)
km.out

## K-means clustering with 3 clusters of sizes 12, 18, 20
## 
## Cluster means:
##         [,1]        [,2]
## 1  4.2850595  0.28505950
## 2  2.1022619 -1.46804124
## 3 -0.5402266  0.08657173
## 
## Clustering vector:
##  [1] 2 2 1 2 2 2 1 2 1 2 1 1 2 2 1 2 1 2 1 1 1 2 1 1 2 3 3 3 3 3 3 2 2 3 3 3 3 3
## [39] 3 3 3 3 3 2 2 2 3 3 3 3
## 
## Within cluster sum of squares by cluster:
## [1]  8.324848 21.938097 43.127245
##  (between_SS / total_SS =  74.4 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
## [6] "betweenss"    "size"         "iter"         "ifault"

plot(x,col=(km.out$cluster+1),main="K-Means Clustering Results with K=3",xlab="",ylab="",pch=20,cex=2)

When K=3, K-means clustering splits up the two clusters.

10.5.2 Hierarchical Clustering

# dist() compute the 50*50 inter-observation Euclidean distance matrix.
hc.complete = hclust(dist(x),method="complete")
hc.average = hclust(dist(x),method="average")
hc.single = hclust(dist(x),method="single")

complete:最长距离法
single:最短距离法
average:平均距离法

We can plot the dendrograms obtained using the usual plot().

par(mfrow=c(1,3))
plot(hc.complete,main="Complete Linkage",xlab="",sub="",cex=.9)
plot(hc.average,main="Average Linkage",xlab="",sub="",cex=.9)
plot(hc.single,main="Single Linkage",xlab="",sub="",cex=.9)

To determine the cluster labels for each observations associated with a given cut of the dendrogram, we can use the cutree() function.

cutree(hc.complete,2)

##  [1] 1 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1 1 1 1 1 1 1 1 1

cutree(hc.average,2)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 1 2 2 2 2 2
## [39] 2 2 2 2 2 1 2 1 2 2 2 2

cutree(hc.single,2)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [39] 1 1 1 1 1 1 1 1 1 1 1 1

For this data, complete and average linkage generally separate the observations into their correct groups. However, single linkage identifies one point as belonging to its own cluster. A more sensible answer is obtained when four clusters are selected, although there are still two singletons.

cutree(hc.single,4)

##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 3 3 3
## [39] 3 3 3 4 3 1 3 1 3 3 3 3

To scale the variavles before performing hierarchical clustering of the observations, we use the scale()function.

xsc = scale(x)
plot(hclust(dist(xsc),method="complete"),main="Hierarchical Clustering with Scaled Features")

Correlation-based distance can be computed using the as.dist() function, which converts an arbitrary square symmetric matrix into a form that the hclust()function regonizes as a distance matrix. However, this only makes sense ofr data with as least three features since the absolute correlation between any two observations with measurements on two features is always 1. Hence,we will cluster a three-dimensional data set.

x = matrix(rnorm(30*30),ncol=3)
dd = as.dist(1-cor(t(x)))
plot(hclust(dd,method="complete"),main="Complete Linkage with Correlation-Based Distance",xlab="",sub="")

10.6 NCI60 Data Example

We illustrate PCA and hierarchical clustering on the NCI60 cancer cell line microarray data, which consists of 6830 gene expression measurements on 64 cancer cell lines.

library(ISLR)
nci.labs = NCI60$labs
nci.data = NCI60$data
dim(nci.data)

## [1]   64 6830

The data has 64 rows and 6830 columns.

We begin by examining the cancer types for the cell lines.

nci.labs[1:4]

## [1] "CNS"   "CNS"   "CNS"   "RENAL"

table(nci.labs)

## nci.labs
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

10.6.1 PCA on the NCI60 Data

We fist perform PCA on the data after scaling the variables(genes) to have standard deviation one.

pr.out = prcomp(nci.data,scale=TRUE)

We now plot the first few principal component score vectors, in order to visualize the data. The observations corresponding to a given cancer type will be plotted in the same color, so that we can see to what extent the observations within a cancer type are similar to each other. We first create a simple function that assigns a distinct color to each element of a numeric vector. The function will be used to assign a color to each of the 64 cell lines, based on the cancer type to which it corresponds.

Cols = function(vec){
  cols = rainbow(length(unique(vec)))
  return(cols[as.numeric(as.factor(vec))])
}

rainbow() function takes as its argument a positive integer,and returns a vector containing that number of distinct colors.

We now can plot the principal component score vectors.

par(mfrow=c(1,2))
plot(pr.out$x[,1:2],col=Cols(nci.labs),pch=19,xlab="Z1",ylab="Z2")
plot(pr.out$x[,c(1,3)],col=Cols(nci.labs),pch=19,xlab="Z1",ylab="Z3")

summary(pr.out)

## Importance of components:
##                            PC1      PC2      PC3      PC4      PC5      PC6
## Standard deviation     27.8535 21.48136 19.82046 17.03256 15.97181 15.72108
## Proportion of Variance  0.1136  0.06756  0.05752  0.04248  0.03735  0.03619
## Cumulative Proportion   0.1136  0.18115  0.23867  0.28115  0.31850  0.35468
##                             PC7      PC8      PC9     PC10     PC11     PC12
## Standard deviation     14.47145 13.54427 13.14400 12.73860 12.68672 12.15769
## Proportion of Variance  0.03066  0.02686  0.02529  0.02376  0.02357  0.02164
## Cumulative Proportion   0.38534  0.41220  0.43750  0.46126  0.48482  0.50646
##                            PC13     PC14     PC15     PC16     PC17     PC18
## Standard deviation     11.83019 11.62554 11.43779 11.00051 10.65666 10.48880
## Proportion of Variance  0.02049  0.01979  0.01915  0.01772  0.01663  0.01611
## Cumulative Proportion   0.52695  0.54674  0.56590  0.58361  0.60024  0.61635
##                            PC19    PC20     PC21    PC22    PC23    PC24
## Standard deviation     10.43518 10.3219 10.14608 10.0544 9.90265 9.64766
## Proportion of Variance  0.01594  0.0156  0.01507  0.0148 0.01436 0.01363
## Cumulative Proportion   0.63229  0.6479  0.66296  0.6778 0.69212 0.70575
##                           PC25    PC26    PC27   PC28    PC29    PC30    PC31
## Standard deviation     9.50764 9.33253 9.27320 9.0900 8.98117 8.75003 8.59962
## Proportion of Variance 0.01324 0.01275 0.01259 0.0121 0.01181 0.01121 0.01083
## Cumulative Proportion  0.71899 0.73174 0.74433 0.7564 0.76824 0.77945 0.79027
##                           PC32    PC33    PC34    PC35    PC36    PC37    PC38
## Standard deviation     8.44738 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127
## Proportion of Variance 0.01045 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896
## Cumulative Proportion  0.80072 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803
##                           PC39    PC40    PC41   PC42    PC43   PC44    PC45
## Standard deviation     7.72156 7.58603 7.45619 7.3444 7.10449 7.0131 6.95839
## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072 0.00709
## Cumulative Proportion  0.86676 0.87518 0.88332 0.8912 0.89861 0.9058 0.91290
##                          PC46    PC47    PC48    PC49    PC50    PC51    PC52
## Standard deviation     6.8663 6.80744 6.64763 6.61607 6.40793 6.21984 6.20326
## Proportion of Variance 0.0069 0.00678 0.00647 0.00641 0.00601 0.00566 0.00563
## Cumulative Proportion  0.9198 0.92659 0.93306 0.93947 0.94548 0.95114 0.95678
##                           PC53    PC54    PC55    PC56    PC57   PC58    PC59
## Standard deviation     6.06706 5.91805 5.91233 5.73539 5.47261 5.2921 5.02117
## Proportion of Variance 0.00539 0.00513 0.00512 0.00482 0.00438 0.0041 0.00369
## Cumulative Proportion  0.96216 0.96729 0.97241 0.97723 0.98161 0.9857 0.98940
##                           PC60    PC61    PC62    PC63      PC64
## Standard deviation     4.68398 4.17567 4.08212 4.04124 2.148e-14
## Proportion of Variance 0.00321 0.00255 0.00244 0.00239 0.000e+00
## Cumulative Proportion  0.99262 0.99517 0.99761 1.00000 1.000e+00

Using the plot()function, we can also plot the variance explained by the first few principal components.

plot(pr.out)

However, it is more informative to plot the PVE of each principal component and the cumulative PVE of each principal component. This can be done with just a litter work.

pve = 100*pr.out$sdev^2/sum(pr.out$sdev^2)
par(mfrow=c(1,2))
plot(pve,type="o",ylab="PVE",xlab="Principal Component",col="blue")
plot(cumsum(pve),type="o",ylab="Cumulative PVE",xlab="Principal Component",col="brown3")

pve can also be computed directly from thesummary(pr.out)$importance[2,]
’cumsum(pve)are given bysummary(pr.out)$importance[3,]`

We can see, the first seven principal components explain around 40% of the variance in the data. This is not a huge amount of the variance. However, looking at the scree plot, we see that while each of the first seven principal components explain a substantial amount of variance, there is a marked decrease in the variance explained by further principal components.

That is, there is an elbow in the plot after approximately the seventh principal component. This suggests that there may be little benefit to examining more than seven or so principal components(though even examining seven principal components may be difficult)

10.6.2 Clustering the Observations of the NCI60 Data

The goal:to find out whether or not the observations cluster into distinct types of cancer.

sd.data = scale(nci.data) # standardize the variables
data.dist = dist(sd.data)
plot(hclust(data.dist),labels=nci.labs,main="Complete Linkage",xlab="",sub="",ylab="")

plot(hclust(data.dist,method="average"),labels=nci.labs,main="Average Linkage",xlab="",sub="",ylab="")

plot(hclust(data.dist,method="single"),labels=nci.labs,main="Single Linkage",xlab="",sub="",ylab="")

We see that the choice of linkage certainly does affect the results obtained. Typically,single linkage will tend to yield trailing clusters: very large clusters onto which individual observations attach one-by-one.

On the other had, complete and average linkage tend to yield more balanced, attractive clusters.

For this reason, complete and average linkage are generally preferred to single linkage. Clearly cell lines within a single cancer type do tend to cluster together, although the clustering is not perfect.

We will use complete linkage hierarchical clustering for the analysis that follows.

We can cut the dendrogram at the height that will yield a particular number of clusters, say four:

hc.out = hclust(dist(sd.data))
hc.clusters = cutree(hc.out,4)
table(hc.clusters,nci.labs)

##            nci.labs
## hc.clusters BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro
##           1      2   3     2           0           0        0           0
##           2      3   2     0           0           0        0           0
##           3      0   0     0           1           1        6           0
##           4      2   0     5           0           0        0           1
##            nci.labs
## hc.clusters MCF7D-repro MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
##           1           0        8     8       6        2     8       1
##           2           0        0     1       0        0     1       0
##           3           0        0     0       0        0     0       0
##           4           1        0     0       0        0     0       0

All the leukemia cell lines fall in cluster 3, while the breast cancer cell lines are spared out over three different clusters. We can plot the cut on the dendrogram that produces these four clusters:

par(mfrow=c(1,1))
plot(hc.out,labels=nci.labs)
abline(h=139,col="red")

Printing the output of hclust gives a useful brief summary of the object:

hc.out

## 
## Call:
## hclust(d = dist(sd.data))
## 
## Cluster method   : complete 
## Distance         : euclidean 
## Number of objects: 64

We perform K-means clustering whit K=4.

set.seed(2)
km.out = kmeans(sd.data,4,nstart = 20)
km.clusters = km.out$cluster
table(km.clusters,hc.clusters)

##            hc.clusters
## km.clusters  1  2  3  4
##           1 11  0  0  9
##           2 20  7  0  0
##           3  9  0  0  0
##           4  0  0  8  0

We see that the four clusters obtained using hierarchical clustering and K-means clusters are somewhat different.Cluster 2 in K-means clustering is identical to cluster 3 in hierarchical clustering. However, the other clusters differ: For instance, cluster 4 in K-means clustering contains a portion of the observations assigned to cluster 1 by hierarchical clustering, as well as all of the observations assigned to cluster 2 by hierarchical clustering.

Rather than performing hierarchical clustering on the entire data matrix, we can simply perform hierarchical clustering on the first few principal component score vectors, as follows:

hc.out = hclust(dist(pr.out$x[,1:5]))
plot(hc.out,labels=nci.labs,main="Hier.Clust. on First Five Score Vectors")

table(cutree(hc.out,4),nci.labs)

##    nci.labs
##     BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro MCF7D-repro
##   1      0   2     7           0           0        2           0           0
##   2      5   3     0           0           0        0           0           0
##   3      0   0     0           1           1        4           0           0
##   4      2   0     0           0           0        0           1           1
##    nci.labs
##     MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
##   1        1     8       5        2     7       0
##   2        7     1       1        0     2       1
##   3        0     0       0        0     0       0
##   4        0     0       0        0     0       0

Not surprisingly, these results are different form the ones that we obtained when we performed hierarchical clustering on the full data set.

Unsupervised Learning

Zlgee