Principal Component Analysis Example

Load data

This code chunk loads and cleans the PCA data. It also creates a basic correlation matrix of the variables you’re using for the PCA.

#read in the data and omit missing values. Note that both commands are from the 'tidyverse' library 
pca <- read_csv("example_pca_data.csv")
pca_na_omit <- drop_na(pca)

#select the columns with the variables that you're conducting the PCA on
pca.active <- pca_na_omit[, 3:7]

#print the general correlations among the variables
cor(pca.active)

##                          voice_accountability political_stability
## voice_accountability                1.0000000           0.4422135
## political_stability                 0.4422135           1.0000000
## government_effectiveness            0.8473619           0.5987618
## regulatory_quality                  0.8171273           0.5728140
## rule_law                            0.7758457           0.7344944
##                          government_effectiveness regulatory_quality
## voice_accountability                    0.8473619          0.8171273
## political_stability                     0.5987618          0.5728140
## government_effectiveness                1.0000000          0.9220522
## regulatory_quality                      0.9220522          1.0000000
## rule_law                                0.9283160          0.9191059
##                           rule_law
## voice_accountability     0.7758457
## political_stability      0.7344944
## government_effectiveness 0.9283160
## regulatory_quality       0.9191059
## rule_law                 1.0000000

PCA in R

This section uses two different functions to derive a set of principal components. They are alternative methods to achieve the same thing. The first method is good but the second creates some interesting graphics that may be useful. I recommend running them both. Check out this article for an overview of the differences in approaches. Here is another article from the same site that goes over some general PCA information.

# Pricipal Components Analysis with princomp

fit <- princomp(na.omit(pca.active), cor = TRUE) # note the use of na.omit. If you're data still have missing values at this point (which it shouldn't), this should eliminate them.

summary(fit) # print summary of components

## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     2.0154529 0.7878424 0.44413505 0.27328090
## Proportion of Variance 0.8124101 0.1241391 0.03945119 0.01493649
## Cumulative Proportion  0.8124101 0.9365492 0.97600041 0.99093690
##                             Comp.5
## Standard deviation     0.212874424
## Proportion of Variance 0.009063104
## Cumulative Proportion  1.000000000

loadings(fit) # pc loadings

## 
## Loadings:
##                          Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## voice_accountability     -0.432 -0.444  0.769 -0.117  0.102
## political_stability      -0.360  0.855  0.304        -0.199
## government_effectiveness -0.478 -0.166 -0.197  0.702 -0.461
## regulatory_quality       -0.471 -0.184 -0.435 -0.683 -0.298
## rule_law                 -0.482  0.104 -0.296  0.142  0.805
## 
##                Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## SS loadings       1.0    1.0    1.0    1.0    1.0
## Proportion Var    0.2    0.2    0.2    0.2    0.2
## Cumulative Var    0.2    0.4    0.6    0.8    1.0

plot(fit,type="lines") # scree plot. Note where the 'kink' in the graph is. This turning point in the level of explained variance indicates the number of principal components present in the data. In this case, there is just one, given the precipitous fall-off in the explained variance when moving from 'one' to 'two'.

# fit$scores lists the principal components  

# Add 1st principal component to dataframe
pca_na_omit$pc_scores <- fit$scores

pca_final <- pca_na_omit

Here is a code chunk to create more graphics for the PCA

#load more libraries
library(factoextra)
library(corrplot)

# Initialize the PCA with the prcomp() function
pca_prcomp <- prcomp(na.omit(pca.active, scale = TRUE))

# Visualize the eigenvalues with a scree plot. 
fviz_eig(pca_prcomp, addlabels = TRUE, ylim = c(0, 85))

# Get the eigenvalues
eig.val <- get_eigenvalue(pca_prcomp)
eig.val

##       eigenvalue variance.percent cumulative.variance.percent
## Dim.1  3.0347217       81.4618102                    81.46181
## Dim.2  0.4601117       12.3508961                    93.81271
## Dim.3  0.1389554        3.7300165                    97.54272
## Dim.4  0.0587996        1.5783727                    99.12110
## Dim.5  0.0327421        0.8789045                   100.00000

# Results for Variables
res.var <- get_pca_var(pca_prcomp)
res.var$coord          # Coordinates

##                              Dim.1       Dim.2       Dim.3       Dim.4
## voice_accountability     0.6910543  0.26856554  0.29989116  0.03885501
## political_stability      0.6347820 -0.58870837  0.09252937  0.03152814
## government_effectiveness 0.8082002  0.11702369 -0.03833692 -0.16833881
## regulatory_quality       0.9091777  0.15581164 -0.17377656  0.15492045
## rule_law                 0.8212344 -0.05860807 -0.09376055 -0.06290917
##                                Dim.5
## voice_accountability     -0.02061529
## political_stability       0.03407123
## government_effectiveness  0.09270466
## regulatory_quality        0.04020076
## rule_law                 -0.14472734

res.var$contrib        # Contributions to the PCs

##                             Dim.1      Dim.2     Dim.3     Dim.4     Dim.5
## voice_accountability     15.73640 15.6760728 64.721974  2.567555  1.297993
## political_stability      13.27793 75.3246523  6.161460  1.690528  3.545433
## government_effectiveness 21.52380  2.9763521  1.057691 48.194131 26.248023
## regulatory_quality       27.23822  5.2763854 21.732357 40.817191  4.935851
## rule_law                 22.22365  0.7465373  6.326518  6.730596 63.972699

res.var$cos2           # Quality of representation

##                              Dim.1       Dim.2       Dim.3        Dim.4
## voice_accountability     0.4775561 0.072127447 0.089934708 0.0015097119
## political_stability      0.4029481 0.346577546 0.008561685 0.0009940237
## government_effectiveness 0.6531875 0.013694545 0.001469719 0.0283379560
## regulatory_quality       0.8266040 0.024277267 0.030198294 0.0240003448
## rule_law                 0.6744259 0.003434906 0.008791041 0.0039575633
##                                 Dim.5
## voice_accountability     0.0004249903
## political_stability      0.0011608490
## government_effectiveness 0.0085941533
## regulatory_quality       0.0016161013
## rule_law                 0.0209460036

# Visualize Quality of Representation (cos2)
corrplot(res.var$cos2, is.corr=FALSE)

# Visualize total quality of representation (cos2) of variables on Dim.1 and Dim.2 (axes=1:2)
fviz_cos2(pca_prcomp, choice = "var", axes = 1:2)

# Color by quality of representation values: quality on the factor map
fviz_pca_var(pca_prcomp, col.var = "cos2",
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"), 
             repel = TRUE # Avoid text overlapping
             )

# Results for individual units
res.ind <- get_pca_ind(pca_prcomp)
#res.ind$coord          # Coordinates
#res.ind$contrib        # Contributions to the PCs

#visualize individual unit contributions (this can be messy)
corrplot(res.ind$contrib, is.corr=FALSE)

#res.ind$cos2           # Quality of representation at the unit level

# Biplot Graphic for individual units (potentially messy)
fviz_pca_ind(pca_prcomp,
             col.ind = "cos2", # Color by the quality of representation
             gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
             repel = TRUE     # Avoid text overlapping
             )

Notes on Preprocessing with Principal Component Analysis (PCA)

Here are some old class notes on implementing PCAs.

constructing a prediction model may not require every predictor
ideally we want to capture the most variation with the fewest variables
- weighted combination of predictors may improve fit
- combination needs to capture the most information
PCA is suited to do this and will help reduce number of predictors as well as reduce noise (due to averaging)
- statistical goal = find new set of multivariate variables that are uncorrelated and explain as much variance as possible
- data compression goal = find the best matrix created with fewer variables that explains the original data
- PCA is most useful for linear-type models (GLM, LDA)
- generally more difficult to interpret the predictors (complex weighted sums of variables)
- Note that outliers can be detrimental to PCA as they may represent a lot of variation in data
  - exploratory analysis (plots/tables) should be used to identify problems with the predictors
  - transformations with log/BoxCox may be helpful

The `prcomp` Function

pr<-prcomp(data) = performs PCA on all variables and returns a prcomp object that contains information about standard deviations and rotations
- pr$rotations = returns eigenvectors for the linear combinations of all variables (coefficients that variables are multiplied by to come up with the principal components) $\rightarrow$ how the principal components are created
- often times, it is useful to take the log transformation of the variables and adding 1 before performing PCA
  - helps to reduce skewness or strange distribution in data
  - log(0) = - infinity, so we add 1 to account for zero values
  - makes data more Gaussian
- plot(pr) = plots the percent variation explained by the first 10 principal components (PC)
  - can be used to find the PCs that represent the most variation

Using the `caret` Package

If you’re using PCA within a machine learning application, then you may want to consider this method.

pp<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2,thresh=0.8)) = perform PCA with preProcess function and returns the number of principal components that can capture the majority of the variation
- creates a preProcess object that can be applied using predict function
- pcaComp=2 = specifies the number of principal components to compute (2 in this case)
- thresh=0.8 = threshold for variation captured by principal components
  - thresh=0.95 = default value, which returns the number of principal components that are needed to capture 95% of the variation in data
predict(pp, training) = computes new variables for the PCs (2 in this case) for the training data set
- the results from predict can then be used as data for the prediction model
- Note: the same PCA must be performed on the test set

Principal Component Analysis Example

Joe Florence

10/18/2018

Load data

PCA in R

Here is a code chunk to create more graphics for the PCA

Notes on Preprocessing with Principal Component Analysis (PCA)

The `prcomp` Function

Using the `caret` Package

Principal Component Analysis Example

Joe Florence

10/18/2018

Load data

PCA in R

Here is a code chunk to create more graphics for the PCA

Notes on Preprocessing with Principal Component Analysis (PCA)

The prcomp Function

Using the caret Package

The `prcomp` Function

Using the `caret` Package