Principal component analysis (PCA) is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The first i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data.
In this study, I will apply PCA to the built-in R data set USArrests(We used this dataset for Hcluster), which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973 it covers also the % the population living in urban areas.
The first step will be import the dataset from stats package and clean the data.
library(stats)
head(USArrests)
## Murder Assault UrbanPop Rape
## Alabama 13.2 236 58 21.2
## Alaska 10.0 263 48 44.5
## Arizona 8.1 294 80 31.0
## Arkansas 8.8 190 50 19.5
## California 9.0 276 91 40.6
## Colorado 7.9 204 78 38.7
df <- USArrests
remove missing values and standardize the data
df <- na.omit(df)
df <- scale(df)
head(df)
## Murder Assault UrbanPop Rape
## Alabama 1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska 0.50786248 1.1068225 -1.2117642 2.484202941
## Arizona 0.07163341 1.4788032 0.9989801 1.042878388
## Arkansas 0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144 1.7589234 2.067820292
## Colorado 0.02571456 0.3988593 0.8608085 1.864967207
Calculate Covariance matrix
df.cov <- cov(df)
Calculate the eigenvalues of the covariance
df.eigen <- eigen(df.cov)
df.eigen$values
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
df.eigen$vectors
## [,1] [,2] [,3] [,4]
## [1,] -0.5358995 0.4181809 -0.3412327 0.64922780
## [2,] -0.5831836 0.1879856 -0.2681484 -0.74340748
## [3,] -0.2781909 -0.8728062 -0.3780158 0.13387773
## [4,] -0.5434321 -0.1673186 0.8177779 0.08902432
Calculate variance contribution rate
(df.pov <- (df.eigen$values/sum(df.eigen$values)))
## [1] 0.62006039 0.24744129 0.08914080 0.04335752
And cumulative contribution rate
(df.cp <- cumsum(df.pov))
## [1] 0.6200604 0.8675017 0.9566425 1.0000000
Standard deviation of principal components
(df.z <- sqrt(df.eigen$values))
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
test.pr <- princomp(df, cor=TRUE)
summary(test.pr,loadings=TRUE)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 1.5748783 0.9948694 0.5971291 0.41644938
## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
## Cumulative Proportion 0.6200604 0.8675017 0.9566425 1.00000000
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4
## Murder 0.536 0.418 0.341 0.649
## Assault 0.583 0.188 0.268 -0.743
## UrbanPop 0.278 -0.873 0.378 0.134
## Rape 0.543 -0.167 -0.818
Obviously Comp.1 + Comp2 accounts for more than 85% of total variance, so the first two principal components can be used to represent the entire data set, and the data can be reduced from 4 dimensions to 2 dimensions.
What’s more, it is easy to get the following equation
Comp1 = 0.536 * Murder + 0.583 * Assault + 0.278* UrbanPop + 0.543 * Rape Comp2 = 0.418 * Murder + 0.188 * Assault + (-0.873)* UrbanPop + (-0.167) * Rape
screeplot(test.pr, type="lines")
We can also predict
p <- predict(test.pr)
p
## Comp.1 Comp.2 Comp.3 Comp.4
## Alabama 0.98556588 1.13339238 0.44426879 0.156267145
## Alaska 1.95013775 1.07321326 -2.04000333 -0.438583440
## Arizona 1.76316354 -0.74595678 -0.05478082 -0.834652924
## Arkansas -0.14142029 1.11979678 -0.11457369 -0.182810896
## California 2.52398013 -1.54293399 -0.59855680 -0.341996478
## Colorado 1.51456286 -0.98755509 -1.09500699 0.001464887
## Connecticut -1.35864746 -1.08892789 0.64325757 -0.118469414
## Delaware 0.04770931 -0.32535892 0.71863294 -0.881977637
## Florida 3.01304227 0.03922851 0.57682949 -0.096284752
## Georgia 1.63928304 1.27894240 0.34246008 1.076796812
## Hawaii -0.91265715 -1.57046001 -0.05078189 0.902806864
## Idaho -1.63979985 0.21097292 -0.25980134 -0.499104101
## Illinois 1.37891072 -0.68184119 0.67749564 -0.122021292
## Indiana -0.50546136 -0.15156254 -0.22805484 0.424665700
## Iowa -2.25364607 -0.10405407 -0.16456432 0.017555916
## Kansas -0.79688112 -0.27016470 -0.02555331 0.206496428
## Kentucky -0.75085907 0.95844029 0.02836942 0.670556671
## Louisiana 1.56481798 0.87105466 0.78348036 0.454728038
## Maine -2.39682949 0.37639158 0.06568239 -0.330459817
## Maryland 1.76336939 0.42765519 0.15725013 -0.559069521
## Massachusetts -0.48616629 -1.47449650 0.60949748 -0.179598963
## Michigan 2.10844115 -0.15539682 -0.38486858 0.102372019
## Minnesota -1.69268181 -0.63226125 -0.15307043 0.067316885
## Mississippi 0.99649446 2.39379599 0.74080840 0.215508013
## Missouri 0.69678733 -0.26335479 -0.37744383 0.225824461
## Montana -1.18545191 0.53687437 -0.24688932 0.123742227
## Nebraska -1.26563654 -0.19395373 -0.17557391 0.015892888
## Nevada 2.87439454 -0.77560020 -1.16338049 0.314515476
## New Hampshire -2.38391541 -0.01808229 -0.03685539 -0.033137338
## New Jersey 0.18156611 -1.44950571 0.76445355 0.243382700
## New Mexico 1.98002375 0.14284878 -0.18369218 -0.339533597
## New York 1.68257738 -0.82318414 0.64307509 -0.013484369
## North Carolina 1.12337861 2.22800338 0.86357179 -0.954381667
## North Dakota -2.99222562 0.59911882 -0.30127728 -0.253987327
## Ohio -0.22596542 -0.74223824 0.03113912 0.473915911
## Oklahoma -0.31178286 -0.28785421 0.01530979 0.010332321
## Oregon 0.05912208 -0.54141145 -0.93983298 -0.237780688
## Pennsylvania -0.88841582 -0.57110035 0.40062871 0.359061124
## Rhode Island -0.86377206 -1.49197842 1.36994570 -0.613569430
## South Carolina 1.32072380 1.93340466 0.30053779 -0.131466685
## South Dakota -1.98777484 0.82334324 -0.38929333 -0.109571764
## Tennessee 0.99974168 0.86025130 -0.18808295 0.652864291
## Texas 1.35513821 -0.41248082 0.49206886 0.643195491
## Utah -0.55056526 -1.47150461 -0.29372804 -0.082314047
## Vermont -2.80141174 1.40228806 -0.84126309 -0.144889914
## Virginia -0.09633491 0.19973529 -0.01171254 0.211370813
## Washington -0.21690338 -0.97012418 -0.62487094 -0.220847793
## West Virginia -2.10858541 1.42484670 -0.10477467 0.131908831
## Wisconsin -2.07971417 -0.61126862 0.13886500 0.184103743
## Wyoming -0.62942666 0.32101297 0.24065923 -0.166651801
We can extract the first two Comps and convert them to data.frame to facilitate the drawing
test.pr.new <- data.frame(test.pr$scores[, -c(3:4)])
head(test.pr.new)
## Comp.1 Comp.2
## Alabama 0.9855659 1.1333924
## Alaska 1.9501378 1.0732133
## Arizona 1.7631635 -0.7459568
## Arkansas -0.1414203 1.1197968
## California 2.5239801 -1.5429340
## Colorado 1.5145629 -0.9875551
We can draw a plot to show the distribution of states, observe which states are unusual and which can be clustered.
library(ggplot2)
ggplot(test.pr.new, aes(x = Comp.1, y = Comp.2)) +
xlab("First Component") + ylab("Second Component") +
geom_text(alpha = 0.75, label = rownames(test.pr.new), size = 4)