Principal component analysis (PCA) is used in exploratory data analysis and for making predictive models. It is commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data’s variation as possible. The first principal component can equivalently be defined as a direction that maximizes the variance of the projected data. The first i-th principal component can be taken as a direction orthogonal to the first i-1 principal components that maximizes the variance of the projected data.

In this study, I will apply PCA to the built-in R data set USArrests(We used this dataset for Hcluster), which contains statistics in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states in 1973 it covers also the % the population living in urban areas.

The first step will be import the dataset from stats package and clean the data.

library(stats)  
head(USArrests)
##            Murder Assault UrbanPop Rape
## Alabama      13.2     236       58 21.2
## Alaska       10.0     263       48 44.5
## Arizona       8.1     294       80 31.0
## Arkansas      8.8     190       50 19.5
## California    9.0     276       91 40.6
## Colorado      7.9     204       78 38.7
df <- USArrests

remove missing values and standardize the data

df <- na.omit(df)
df <- scale(df)
head(df)
##                Murder   Assault   UrbanPop         Rape
## Alabama    1.24256408 0.7828393 -0.5209066 -0.003416473
## Alaska     0.50786248 1.1068225 -1.2117642  2.484202941
## Arizona    0.07163341 1.4788032  0.9989801  1.042878388
## Arkansas   0.23234938 0.2308680 -1.0735927 -0.184916602
## California 0.27826823 1.2628144  1.7589234  2.067820292
## Colorado   0.02571456 0.3988593  0.8608085  1.864967207

Calculate Covariance matrix

df.cov <- cov(df)

Calculate the eigenvalues of the covariance

df.eigen <- eigen(df.cov) 
df.eigen$values
## [1] 2.4802416 0.9897652 0.3565632 0.1734301
df.eigen$vectors
##            [,1]       [,2]       [,3]        [,4]
## [1,] -0.5358995  0.4181809 -0.3412327  0.64922780
## [2,] -0.5831836  0.1879856 -0.2681484 -0.74340748
## [3,] -0.2781909 -0.8728062 -0.3780158  0.13387773
## [4,] -0.5434321 -0.1673186  0.8177779  0.08902432

Calculate variance contribution rate

(df.pov <- (df.eigen$values/sum(df.eigen$values)))
## [1] 0.62006039 0.24744129 0.08914080 0.04335752

And cumulative contribution rate

(df.cp <- cumsum(df.pov))
## [1] 0.6200604 0.8675017 0.9566425 1.0000000

Standard deviation of principal components

(df.z <- sqrt(df.eigen$values))
## [1] 1.5748783 0.9948694 0.5971291 0.4164494
test.pr <- princomp(df, cor=TRUE)
summary(test.pr,loadings=TRUE)  
## Importance of components:
##                           Comp.1    Comp.2    Comp.3     Comp.4
## Standard deviation     1.5748783 0.9948694 0.5971291 0.41644938
## Proportion of Variance 0.6200604 0.2474413 0.0891408 0.04335752
## Cumulative Proportion  0.6200604 0.8675017 0.9566425 1.00000000
## 
## Loadings:
##          Comp.1 Comp.2 Comp.3 Comp.4
## Murder    0.536  0.418  0.341  0.649
## Assault   0.583  0.188  0.268 -0.743
## UrbanPop  0.278 -0.873  0.378  0.134
## Rape      0.543 -0.167 -0.818

Obviously Comp.1 + Comp2 accounts for more than 85% of total variance, so the first two principal components can be used to represent the entire data set, and the data can be reduced from 4 dimensions to 2 dimensions.

What’s more, it is easy to get the following equation

Comp1 = 0.536 * Murder + 0.583 * Assault + 0.278* UrbanPop + 0.543 * Rape Comp2 = 0.418 * Murder + 0.188 * Assault + (-0.873)* UrbanPop + (-0.167) * Rape

screeplot(test.pr, type="lines")

We can also predict

p <- predict(test.pr)
p
##                     Comp.1      Comp.2      Comp.3       Comp.4
## Alabama         0.98556588  1.13339238  0.44426879  0.156267145
## Alaska          1.95013775  1.07321326 -2.04000333 -0.438583440
## Arizona         1.76316354 -0.74595678 -0.05478082 -0.834652924
## Arkansas       -0.14142029  1.11979678 -0.11457369 -0.182810896
## California      2.52398013 -1.54293399 -0.59855680 -0.341996478
## Colorado        1.51456286 -0.98755509 -1.09500699  0.001464887
## Connecticut    -1.35864746 -1.08892789  0.64325757 -0.118469414
## Delaware        0.04770931 -0.32535892  0.71863294 -0.881977637
## Florida         3.01304227  0.03922851  0.57682949 -0.096284752
## Georgia         1.63928304  1.27894240  0.34246008  1.076796812
## Hawaii         -0.91265715 -1.57046001 -0.05078189  0.902806864
## Idaho          -1.63979985  0.21097292 -0.25980134 -0.499104101
## Illinois        1.37891072 -0.68184119  0.67749564 -0.122021292
## Indiana        -0.50546136 -0.15156254 -0.22805484  0.424665700
## Iowa           -2.25364607 -0.10405407 -0.16456432  0.017555916
## Kansas         -0.79688112 -0.27016470 -0.02555331  0.206496428
## Kentucky       -0.75085907  0.95844029  0.02836942  0.670556671
## Louisiana       1.56481798  0.87105466  0.78348036  0.454728038
## Maine          -2.39682949  0.37639158  0.06568239 -0.330459817
## Maryland        1.76336939  0.42765519  0.15725013 -0.559069521
## Massachusetts  -0.48616629 -1.47449650  0.60949748 -0.179598963
## Michigan        2.10844115 -0.15539682 -0.38486858  0.102372019
## Minnesota      -1.69268181 -0.63226125 -0.15307043  0.067316885
## Mississippi     0.99649446  2.39379599  0.74080840  0.215508013
## Missouri        0.69678733 -0.26335479 -0.37744383  0.225824461
## Montana        -1.18545191  0.53687437 -0.24688932  0.123742227
## Nebraska       -1.26563654 -0.19395373 -0.17557391  0.015892888
## Nevada          2.87439454 -0.77560020 -1.16338049  0.314515476
## New Hampshire  -2.38391541 -0.01808229 -0.03685539 -0.033137338
## New Jersey      0.18156611 -1.44950571  0.76445355  0.243382700
## New Mexico      1.98002375  0.14284878 -0.18369218 -0.339533597
## New York        1.68257738 -0.82318414  0.64307509 -0.013484369
## North Carolina  1.12337861  2.22800338  0.86357179 -0.954381667
## North Dakota   -2.99222562  0.59911882 -0.30127728 -0.253987327
## Ohio           -0.22596542 -0.74223824  0.03113912  0.473915911
## Oklahoma       -0.31178286 -0.28785421  0.01530979  0.010332321
## Oregon          0.05912208 -0.54141145 -0.93983298 -0.237780688
## Pennsylvania   -0.88841582 -0.57110035  0.40062871  0.359061124
## Rhode Island   -0.86377206 -1.49197842  1.36994570 -0.613569430
## South Carolina  1.32072380  1.93340466  0.30053779 -0.131466685
## South Dakota   -1.98777484  0.82334324 -0.38929333 -0.109571764
## Tennessee       0.99974168  0.86025130 -0.18808295  0.652864291
## Texas           1.35513821 -0.41248082  0.49206886  0.643195491
## Utah           -0.55056526 -1.47150461 -0.29372804 -0.082314047
## Vermont        -2.80141174  1.40228806 -0.84126309 -0.144889914
## Virginia       -0.09633491  0.19973529 -0.01171254  0.211370813
## Washington     -0.21690338 -0.97012418 -0.62487094 -0.220847793
## West Virginia  -2.10858541  1.42484670 -0.10477467  0.131908831
## Wisconsin      -2.07971417 -0.61126862  0.13886500  0.184103743
## Wyoming        -0.62942666  0.32101297  0.24065923 -0.166651801

We can extract the first two Comps and convert them to data.frame to facilitate the drawing

test.pr.new <- data.frame(test.pr$scores[, -c(3:4)])
head(test.pr.new)
##                Comp.1     Comp.2
## Alabama     0.9855659  1.1333924
## Alaska      1.9501378  1.0732133
## Arizona     1.7631635 -0.7459568
## Arkansas   -0.1414203  1.1197968
## California  2.5239801 -1.5429340
## Colorado    1.5145629 -0.9875551

We can draw a plot to show the distribution of states, observe which states are unusual and which can be clustered.

library(ggplot2)
ggplot(test.pr.new, aes(x = Comp.1, y = Comp.2)) +
  xlab("First Component") + ylab("Second Component") +
  geom_text(alpha = 0.75, label = rownames(test.pr.new), size = 4)