Required packages
library(tidyverse)
library(plotly)
library(factoextra)
library(FactoMineR)
library(corrplot)This tutorial draws from content in Practical
Guide to Principal Component Methods in R. The full book provides
far greater detail. The purpose of this tutorial is to highlight a
limited set of functions from the factoextra library and
provide a general overview of overarching PCA concepts using interactive
data based on research in Investigating
higher education productivity and its measurement in
Australia.
Complex, multi-variate data sets can leave us scratching our heads about where to begin an investigation. Which variables are related? Which relationships are key? How to visualize and communicate insights in a simple way?
One possible answer is to proceed with dimensionality reduction (DR) on your set of data. Principal component analysis (PCA) is a common DR method for continuous variables. So what is PCA? To answer that, we need to build a conceptual understanding of what PCA does to a set of data.
First, let’s begin with some data. A quick look at some indicative multi-dimensional data will help show the power and relevance of DR and PCA.
mock_data <- tibble(point = c("P1", "P2", "P3", "P4", "P5", 'P6', "P7", "P8", "P9", "P10",
"P11", "P12", "P13", "P14", "P15"),
var_1 = ceiling(runif(15, 0, 10)),
var_2 = ceiling(runif(15, 0, 10)),
var_3 = var_1 + var_2) %>%
column_to_rownames(var = "point")
head(mock_data)## var_1 var_2 var_3
## P1 6 7 13
## P2 2 4 6
## P3 7 2 9
## P4 10 1 11
## P5 6 2 8
## P6 10 3 13
The nature of this data indicates why and when DR can be so powerful. Click and drag on the plot below to explore different perspectives. What do you notice about the shape of the data?
plot_ly(mock_data, x = ~var_1, y = ~var_2, z = ~var_3,
type = "scatter3d",
mode = "markers",
text = row.names(mock_data),
hoverinfo = "text") The data above exist along a 2-D plane in 3-D space. So, a 2-D representation of the points should suffice in capturing all the information contained in the data set. That’s where PCA comes in. PCA can project these points onto a 2-D plane and while maximizing the amount of variation preserved in the data set.
After you run the code below, play around with the orientation of the 3-D visual, and see if you can orient the 3-D plot to resemble the 2-D plot produced by PCA. In doing so, you can see how PCA projects the points onto the new space, while dropping an entire dimension.
mock_pca <- PCA(mock_data, scale.unit = F, graph = F)
fviz_pca_ind(mock_pca, repel = T)Create some indicative university operational/perforamnce data.
uni_data <- tibble(institution = c("Uni A", "Uni B", "Uni C", "Uni D", "Uni E", "Uni F"),
group = c("G2", "G2", "G1", "G2", "G1", "G1"),
year = c("2015", "2015", "2015", "2015", "2015", "2015"),
academic_staff = c(1000, 2000, 1500, 500, 3000, 2000),
non_academic_staff = c(750, 2250, 1500, 750, 2600, 2100),
ug_completions = c(3000, 6200, 4400, 1600, 8900, 5000),
research_pubs = c(1750, 1500, 2050, 1200, 3050, 2800),
cap_ex = c(2000000, 1000000, 5000000, 250000, 4000000, 3000000),
attrition_rate = c(0.2, 0.12, 0.15, 0.25, 0.1, .13))The PCA package we’re using takes data frames with row names because the input data must only be continuous variables.
pca_data <- uni_data %>%
column_to_rownames(var = "institution") %>% # set the institution variable to row names
select(-c(year, group)) # select only continuous variables‘Eigen values’ emerge as a result of running PCA on the data and creating new dimensions. We won’t get into the linear algebra, but the eigen values represent the percentage of variance in the data captured by each new dimension.
uni_pca <- PCA(pca_data, scale.unit = T, graph = F)
eig_val <- get_eigenvalue(uni_pca)
print(eig_val)## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 4.84701924 80.783654 80.78365
## Dim.2 0.80924706 13.487451 94.27110
## Dim.3 0.23205659 3.867610 98.13871
## Dim.4 0.08345689 1.390948 99.52966
## Dim.5 0.02822022 0.470337 100.00000
A ‘scree plot’ is a bar chart of the new dimensions and their corresponding eigen values. This plot tells you how much information you retain (or lose) when reducing your data to fewer dimensions.
fviz_eig(uni_pca, addlabels = T, ylim = c(0, 100))In this example we can reduce the data to only two dimensions (Dim1 and Dim2) and still retain over 90% of the information contained in the data set. You can see that the PCA algorithm does produce additional dimensions for expressing the data, but the first two are enough to capture nearly the full story behind our data.
But what actually do these new dimensions mean? It’s helpful to plot the new PCA dimensions with the original variables also projected onto the new space. This will help indicate:
So let’s plot and interpret.
fviz_pca_var(uni_pca, col.var = "black", repel = T)The plot above shows the directions and magnitudes of the original variables when projected onto the new space defined by principal components, Dim 1 and Dim 2. Following the ordered list above, let’s interpret the plot.
attrition_rate is captured thoroughly by Dim 1, where
attrition_rate is more positive as Dim 1 becomes more
negative. Alternatively, movement in cap_ex is captured
almost equally by both Dim 1 and Dim 2, where as Dim 1 and Dim 2 become
more positive, cap_ex also becomes more positive.academic_staff and ug_completions are
positively correlated, whereas academic_staff and
attrition_rate are negatively correlated.
cap_ex and non_academic_staff show almost no
correlation to one another.academic_staff extends all the way to
the outside circle, meaning the variation is that variable is captured
totally in our new space. research_pubs though is not fully
capture in the space, meaning that some information about his variable
has been lost in this new projection.If you’re curious about how exactly the new PCA dimensions comprise the original data, you can check. Let’s visualise the results.
var <- get_pca_var(uni_pca) #this stores all key outputs created in the pca analysis
var$cos2 #the cos2 output variable indicates the 'quality' of the new dimensions in representing the initial variables ## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## academic_staff 0.9688451 0.02211828 0.0023588102 0.0065834862 9.433844e-05
## non_academic_staff 0.8733939 0.09659751 0.0005992045 0.0150921560 1.431726e-02
## ug_completions 0.9123414 0.04570512 0.0014412542 0.0404786815 3.356190e-05
## research_pubs 0.7361508 0.11476252 0.1463952253 0.0023014713 3.900196e-04
## cap_ex 0.4339695 0.51395749 0.0501285295 0.0004908478 1.453673e-03
## attrition_rate 0.9223187 0.01610614 0.0311335652 0.0185102507 1.193137e-02
corrplot(var$cos2, is.corr = FALSE)Based on the information above, we can see that about 15% of the
variation in research_pubs is actually captured by Dim 3,
which has not been plotted. But because Dim 3 captures such a small
amount of information overall – relatively speaking in relation to the
first two dimensions – interrogating that dimension further or creating
an alternative plot using that dimension is not really necessary.
Now let’s look at how we can further visualize PCA results.
fviz_pca_ind(uni_pca, repel = TRUE )A great default option of this PCA package is that when you have a grouping variable, it automatically locates the centroid of each group. Cluster analysis is a natural complement to PCA because multi-dimensional clusters become much more apparent when plotted against your principal components. The confidence ellipses below can be thought of as indicating the certainty about whether individuals truly belong in the specified group. Whether through grouping variables or full cluster anlayses, you can begin to predict key characteristics about individual observations.
fviz_pca_ind(uni_pca,
col.ind = uni_data$group,
addEllipses = T,
ellipse.type = "confidence",
ellipse.level = 0.95,
repel = TRUE )fviz_pca_biplot(uni_pca,
col.ind = uni_data$group,
addEllipses = T,
ellipse.type = "confidence",
ellipse.level = 0.95,
col.var = "black",
repel = TRUE )