It is used to summarize the high-dimensional set with a smaller number of representative variables. PCA is an unsupervised approach. Each of the dimensions found by PCA is a linear combination of the \(p\) features.
The first principal component of a set of features \(X_1,X_2, . . . ,X_p\) is the normalized linear combination of the features \(Z_1=\phi^T X\) that has the largest variance. \(\phi\) is called the loadings of the first PC. After the first principal component Z1 of the features has been determined, we can find the second principal component \(Z_2\)
states <- row.names(USArrests)
states
## [1] "Alabama" "Alaska" "Arizona" "Arkansas"
## [5] "California" "Colorado" "Connecticut" "Delaware"
## [9] "Florida" "Georgia" "Hawaii" "Idaho"
## [13] "Illinois" "Indiana" "Iowa" "Kansas"
## [17] "Kentucky" "Louisiana" "Maine" "Maryland"
## [21] "Massachusetts" "Michigan" "Minnesota" "Mississippi"
## [25] "Missouri" "Montana" "Nebraska" "Nevada"
## [29] "New Hampshire" "New Jersey" "New Mexico" "New York"
## [33] "North Carolina" "North Dakota" "Ohio" "Oklahoma"
## [37] "Oregon" "Pennsylvania" "Rhode Island" "South Carolina"
## [41] "South Dakota" "Tennessee" "Texas" "Utah"
## [45] "Vermont" "Virginia" "Washington" "West Virginia"
## [49] "Wisconsin" "Wyoming"
names(USArrests)
## [1] "Murder" "Assault" "UrbanPop" "Rape"
We first briefly examine the data. We notice that the variables have vastly different means.
apply(USArrests,2,mean)
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
apply(USArrests,2,var)
## Murder Assault UrbanPop Rape
## 18.97047 6945.16571 209.51878 87.72916
The UrbanPop variable measures the percentage of the population in each state living in an urban area, which is not a comparable number to the number of rapes in each state per 100,000 individuals.
So we shall standardize the variables to have mean zero and standard deviation one before performing PCA.
x <- scale(USArrests,center = TRUE,scal=TRUE)
apply(x,2,mean)
## Murder Assault UrbanPop Rape
## -7.663087e-17 1.112408e-16 -4.332808e-16 8.942391e-17
apply(x,2,var)
## Murder Assault UrbanPop Rape
## 1 1 1 1
We now perform principal components analysis using the prcomp() function, which centers the variables to have zero mean and scale the variabls to have standard deviation 1.
pr.out <- prcomp(USArrests , scale = TRUE)
names(pr.out)
## [1] "sdev" "rotation" "center" "scale" "x"
pr.out$center
## Murder Assault UrbanPop Rape
## 7.788 170.760 65.540 21.232
pr.out$scale
## Murder Assault UrbanPop Rape
## 4.355510 83.337661 14.474763 9.366385
each column of pr.out$rotation contains the corresponding principal component loading vector. - we see that the first loading vector places approximately equal weight on Assault, Murder, and Rape, but with much less weight on UrbanPop. Hence this component roughly corresponds to a measure of overall rates of serious crimes.
pr.out$rotation
## PC1 PC2 PC3 PC4
## Murder -0.5358995 -0.4181809 0.3412327 0.64922780
## Assault -0.5831836 -0.1879856 0.2681484 -0.74340748
## UrbanPop -0.2781909 0.8728062 0.3780158 0.13387773
## Rape -0.5434321 0.1673186 -0.8177779 0.08902432
Overall, we see that the crime-related variables (Murder, Assault, and Rape) are located close to each other, and that the UrbanPop variable is far from the other three. This indicates that the crime-related variables are correlated with each other—states with high murder rates tend to have high assault and rape rates—and that the UrbanPop variable is less correlated with the other three.
dim(pr.out$x)
## [1] 50 4
matrix x has as its columns the principal component score vectors.
biplot(pr.out , scale = 0)
we would like to use just the first few principal components in order to visualize or interpret the data. In fact, we would like to use the smallest number of principal components required to get a good understanding of the data.
plot(pr.out,type="l")
To compute the proportion of variance explained by each principal
component, we simply divide the variance explained by each principal
component by the total variance explained by all four principal
components:
pve <- pr.out$sdev^2 /sum(pr.out$sdev^2)
pve
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
cumsum(pve)
## [1] 0.6200604 0.8675017 0.9566425 1.0000000
PCA are often used in the analysis of genomic data high-dimensional data.
library(ISLR2)
nci.labs <- NCI60$labs
nci.data <- NCI60$data
Cancer types are not involving training the model, only used to see the exten to which these agree with the results of our analysis.
We first perform PCA on the data after scaling the variables.
pr.out <- prcomp(nci.data , scale = TRUE)
summary(pr.out)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 27.8535 21.48136 19.82046 17.03256 15.97181 15.72108
## Proportion of Variance 0.1136 0.06756 0.05752 0.04248 0.03735 0.03619
## Cumulative Proportion 0.1136 0.18115 0.23867 0.28115 0.31850 0.35468
## PC7 PC8 PC9 PC10 PC11 PC12
## Standard deviation 14.47145 13.54427 13.14400 12.73860 12.68672 12.15769
## Proportion of Variance 0.03066 0.02686 0.02529 0.02376 0.02357 0.02164
## Cumulative Proportion 0.38534 0.41220 0.43750 0.46126 0.48482 0.50646
## PC13 PC14 PC15 PC16 PC17 PC18
## Standard deviation 11.83019 11.62554 11.43779 11.00051 10.65666 10.48880
## Proportion of Variance 0.02049 0.01979 0.01915 0.01772 0.01663 0.01611
## Cumulative Proportion 0.52695 0.54674 0.56590 0.58361 0.60024 0.61635
## PC19 PC20 PC21 PC22 PC23 PC24
## Standard deviation 10.43518 10.3219 10.14608 10.0544 9.90265 9.64766
## Proportion of Variance 0.01594 0.0156 0.01507 0.0148 0.01436 0.01363
## Cumulative Proportion 0.63229 0.6479 0.66296 0.6778 0.69212 0.70575
## PC25 PC26 PC27 PC28 PC29 PC30 PC31
## Standard deviation 9.50764 9.33253 9.27320 9.0900 8.98117 8.75003 8.59962
## Proportion of Variance 0.01324 0.01275 0.01259 0.0121 0.01181 0.01121 0.01083
## Cumulative Proportion 0.71899 0.73174 0.74433 0.7564 0.76824 0.77945 0.79027
## PC32 PC33 PC34 PC35 PC36 PC37 PC38
## Standard deviation 8.44738 8.37305 8.21579 8.15731 7.97465 7.90446 7.82127
## Proportion of Variance 0.01045 0.01026 0.00988 0.00974 0.00931 0.00915 0.00896
## Cumulative Proportion 0.80072 0.81099 0.82087 0.83061 0.83992 0.84907 0.85803
## PC39 PC40 PC41 PC42 PC43 PC44 PC45
## Standard deviation 7.72156 7.58603 7.45619 7.3444 7.10449 7.0131 6.95839
## Proportion of Variance 0.00873 0.00843 0.00814 0.0079 0.00739 0.0072 0.00709
## Cumulative Proportion 0.86676 0.87518 0.88332 0.8912 0.89861 0.9058 0.91290
## PC46 PC47 PC48 PC49 PC50 PC51 PC52
## Standard deviation 6.8663 6.80744 6.64763 6.61607 6.40793 6.21984 6.20326
## Proportion of Variance 0.0069 0.00678 0.00647 0.00641 0.00601 0.00566 0.00563
## Cumulative Proportion 0.9198 0.92659 0.93306 0.93947 0.94548 0.95114 0.95678
## PC53 PC54 PC55 PC56 PC57 PC58 PC59
## Standard deviation 6.06706 5.91805 5.91233 5.73539 5.47261 5.2921 5.02117
## Proportion of Variance 0.00539 0.00513 0.00512 0.00482 0.00438 0.0041 0.00369
## Cumulative Proportion 0.96216 0.96729 0.97241 0.97723 0.98161 0.9857 0.98940
## PC60 PC61 PC62 PC63 PC64
## Standard deviation 4.68398 4.17567 4.08212 4.04124 1.951e-14
## Proportion of Variance 0.00321 0.00255 0.00244 0.00239 0.000e+00
## Cumulative Proportion 0.99262 0.99517 0.99761 1.00000 1.000e+00
pe <- pr.out$sdev^2 / sum(pr.out$sdev^2)
plot(pr.out)
plot(cumsum(pe),type="o")
We going into Hierarchical clustering
sd.data <- scale(nci.data)
data.dist <- dist(sd.data)
hc.out <- hclust(data.dist)
table(cutree(hc.out , 4), nci.labs)
## nci.labs
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro MCF7D-repro
## 1 2 3 2 0 0 0 0 0
## 2 3 2 0 0 0 0 0 0
## 3 0 0 0 1 1 6 0 0
## 4 2 0 5 0 0 0 1 1
## nci.labs
## MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
## 1 8 8 6 2 8 1
## 2 0 1 0 0 1 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
plot(hclust(data.dist), xlab = "", sub = "", ylab = "", labels = nci.labs)
hc.out <- hclust(dist(pr.out$x[, 1:5]))
plot(hc.out , labels = nci.labs,)
table(cutree(hc.out , 4), nci.labs)
## nci.labs
## BREAST CNS COLON K562A-repro K562B-repro LEUKEMIA MCF7A-repro MCF7D-repro
## 1 0 2 7 0 0 2 0 0
## 2 5 3 0 0 0 0 0 0
## 3 0 0 0 1 1 4 0 0
## 4 2 0 0 0 0 0 1 1
## nci.labs
## MELANOMA NSCLC OVARIAN PROSTATE RENAL UNKNOWN
## 1 1 8 5 2 7 0
## 2 7 1 1 0 2 1
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0