This code chunk loads and cleans the PCA data. It also creates a basic correlation matrix of the variables you’re using for the PCA.
#read in the data and omit missing values. Note that both commands are from the 'tidyverse' library
pca <- read_csv("example_pca_data.csv")
pca_na_omit <- drop_na(pca)
#select the columns with the variables that you're conducting the PCA on
pca.active <- pca_na_omit[, 3:7]
#print the general correlations among the variables
cor(pca.active)
## voice_accountability political_stability
## voice_accountability 1.0000000 0.4422135
## political_stability 0.4422135 1.0000000
## government_effectiveness 0.8473619 0.5987618
## regulatory_quality 0.8171273 0.5728140
## rule_law 0.7758457 0.7344944
## government_effectiveness regulatory_quality
## voice_accountability 0.8473619 0.8171273
## political_stability 0.5987618 0.5728140
## government_effectiveness 1.0000000 0.9220522
## regulatory_quality 0.9220522 1.0000000
## rule_law 0.9283160 0.9191059
## rule_law
## voice_accountability 0.7758457
## political_stability 0.7344944
## government_effectiveness 0.9283160
## regulatory_quality 0.9191059
## rule_law 1.0000000
This section uses two different functions to derive a set of principal components. They are alternative methods to achieve the same thing. The first method is good but the second creates some interesting graphics that may be useful. I recommend running them both. Check out this article for an overview of the differences in approaches. Here is another article from the same site that goes over some general PCA information.
# Pricipal Components Analysis with princomp
fit <- princomp(na.omit(pca.active), cor = TRUE) # note the use of na.omit. If you're data still have missing values at this point (which it shouldn't), this should eliminate them.
summary(fit) # print summary of components
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4
## Standard deviation 2.0154529 0.7878424 0.44413505 0.27328090
## Proportion of Variance 0.8124101 0.1241391 0.03945119 0.01493649
## Cumulative Proportion 0.8124101 0.9365492 0.97600041 0.99093690
## Comp.5
## Standard deviation 0.212874424
## Proportion of Variance 0.009063104
## Cumulative Proportion 1.000000000
loadings(fit) # pc loadings
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## voice_accountability -0.432 -0.444 0.769 -0.117 0.102
## political_stability -0.360 0.855 0.304 -0.199
## government_effectiveness -0.478 -0.166 -0.197 0.702 -0.461
## regulatory_quality -0.471 -0.184 -0.435 -0.683 -0.298
## rule_law -0.482 0.104 -0.296 0.142 0.805
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## SS loadings 1.0 1.0 1.0 1.0 1.0
## Proportion Var 0.2 0.2 0.2 0.2 0.2
## Cumulative Var 0.2 0.4 0.6 0.8 1.0
plot(fit,type="lines") # scree plot. Note where the 'kink' in the graph is. This turning point in the level of explained variance indicates the number of principal components present in the data. In this case, there is just one, given the precipitous fall-off in the explained variance when moving from 'one' to 'two'.
# fit$scores lists the principal components
# Add 1st principal component to dataframe
pca_na_omit$pc_scores <- fit$scores
pca_final <- pca_na_omit
#load more libraries
library(factoextra)
library(corrplot)
# Initialize the PCA with the prcomp() function
pca_prcomp <- prcomp(na.omit(pca.active, scale = TRUE))
# Visualize the eigenvalues with a scree plot.
fviz_eig(pca_prcomp, addlabels = TRUE, ylim = c(0, 85))
# Get the eigenvalues
eig.val <- get_eigenvalue(pca_prcomp)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 3.0347217 81.4618102 81.46181
## Dim.2 0.4601117 12.3508961 93.81271
## Dim.3 0.1389554 3.7300165 97.54272
## Dim.4 0.0587996 1.5783727 99.12110
## Dim.5 0.0327421 0.8789045 100.00000
# Results for Variables
res.var <- get_pca_var(pca_prcomp)
res.var$coord # Coordinates
## Dim.1 Dim.2 Dim.3 Dim.4
## voice_accountability 0.6910543 0.26856554 0.29989116 0.03885501
## political_stability 0.6347820 -0.58870837 0.09252937 0.03152814
## government_effectiveness 0.8082002 0.11702369 -0.03833692 -0.16833881
## regulatory_quality 0.9091777 0.15581164 -0.17377656 0.15492045
## rule_law 0.8212344 -0.05860807 -0.09376055 -0.06290917
## Dim.5
## voice_accountability -0.02061529
## political_stability 0.03407123
## government_effectiveness 0.09270466
## regulatory_quality 0.04020076
## rule_law -0.14472734
res.var$contrib # Contributions to the PCs
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## voice_accountability 15.73640 15.6760728 64.721974 2.567555 1.297993
## political_stability 13.27793 75.3246523 6.161460 1.690528 3.545433
## government_effectiveness 21.52380 2.9763521 1.057691 48.194131 26.248023
## regulatory_quality 27.23822 5.2763854 21.732357 40.817191 4.935851
## rule_law 22.22365 0.7465373 6.326518 6.730596 63.972699
res.var$cos2 # Quality of representation
## Dim.1 Dim.2 Dim.3 Dim.4
## voice_accountability 0.4775561 0.072127447 0.089934708 0.0015097119
## political_stability 0.4029481 0.346577546 0.008561685 0.0009940237
## government_effectiveness 0.6531875 0.013694545 0.001469719 0.0283379560
## regulatory_quality 0.8266040 0.024277267 0.030198294 0.0240003448
## rule_law 0.6744259 0.003434906 0.008791041 0.0039575633
## Dim.5
## voice_accountability 0.0004249903
## political_stability 0.0011608490
## government_effectiveness 0.0085941533
## regulatory_quality 0.0016161013
## rule_law 0.0209460036
# Visualize Quality of Representation (cos2)
corrplot(res.var$cos2, is.corr=FALSE)
# Visualize total quality of representation (cos2) of variables on Dim.1 and Dim.2 (axes=1:2)
fviz_cos2(pca_prcomp, choice = "var", axes = 1:2)
# Color by quality of representation values: quality on the factor map
fviz_pca_var(pca_prcomp, col.var = "cos2",
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
# Results for individual units
res.ind <- get_pca_ind(pca_prcomp)
#res.ind$coord # Coordinates
#res.ind$contrib # Contributions to the PCs
#visualize individual unit contributions (this can be messy)
corrplot(res.ind$contrib, is.corr=FALSE)
#res.ind$cos2 # Quality of representation at the unit level
# Biplot Graphic for individual units (potentially messy)
fviz_pca_ind(pca_prcomp,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
Here are some old class notes on implementing PCAs.
prcomp
Functionpr<-prcomp(data)
= performs PCA on all variables and returns a prcomp
object that contains information about standard deviations and rotations
pr$rotations
= returns eigenvectors for the linear combinations of all variables (coefficients that variables are multiplied by to come up with the principal components) \(\rightarrow\) how the principal components are createdlog
transformation of the variables and adding 1 before performing PCA
plot(pr)
= plots the percent variation explained by the first 10 principal components (PC)
caret
PackageIf you’re using PCA within a machine learning application, then you may want to consider this method.
pp<-preProcess(log10(training[,-58]+1),method="pca",pcaComp=2,thresh=0.8))
= perform PCA with preProcess
function and returns the number of principal components that can capture the majority of the variation
preProcess
object that can be applied using predict
functionpcaComp=2
= specifies the number of principal components to compute (2 in this case)thresh=0.8
= threshold for variation captured by principal components
thresh=0.95
= default value, which returns the number of principal components that are needed to capture 95% of the variation in datapredict(pp, training)
= computes new variables for the PCs (2 in this case) for the training data set
predict
can then be used as data for the prediction model