Learning objectives
All of this material will appear on the exam. Take notes on the
workflow, functions, and concepts.
Main objectives
- Work through a full analysis of a dataset with PCA
- Understand the connection between scree plots and the amount of
variation explained by each PC
- Learn how to make a scree plot in terms of explained variation and
how to assess it using a simple rule of thumb
Review
By the end of this lesson you will know how to..
- set a working directory in RStudio
- confirm the location of the working directory with
getwd()
- confirm a file is present with and
list.files(pattern = ...)
- load typical R data file in spreadsheet format with
read.csv()
- Use basic R functions to check data you’ve loaded
(e.g.
dim()
, summary()
, etc.)
- Create a basic PCA, make a screeplot, and make and interpret a
simple biplot.
Introduction
This Portfolio is a complete worked example of a PCA analysis of a
dataset. It covers all relevant issues
- Loading data
- Dealing with NAs
- Scaling
- Running PCA
- Building and interpreting scree plots
- Extracting PCA scores
- Creating custom scatterplots of PCA scores
This Portfolio also expands on our understanding of PCA by
introducing the concept of PCs explain a proportion of the variation of
the data. This is useful for understanding what PC does and what the PCs
mean, and also allow us to interpret screeplots more thoroughly.
Preliminaries
Check the location of your working directory with
getwd()
# run getwd()
getwd() # TODO
## [1] "/Users/james/Documents/Comp Bio Fall 2022/FinalProject"
Check for the presence of the “walsh2017morphology.csv” file in the
working directory with list.files()
# run list.files()
list.files() # TODO
## [1] "07-mean_imputation.docx"
## [2] "07-mean_imputation.html"
## [3] "07-mean_imputation.Rmd"
## [4] "08-PCA_worked.Rmd"
## [5] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
## [6] "8.21122256-21362256.ALL.chr8_GRCh38.genotypes.20170504.vcf"
## [7] "all_loci.vcf"
## [8] "bird_snps_remove_NAs.html"
## [9] "bird_snps_remove_NAs.Rmd"
## [10] "fst_exploration_in_class-STUDENT.html"
## [11] "fst_exploration_in_class-STUDENT.Rmd"
## [12] "removing_fixed_alleles.html"
## [13] "removing_fixed_alleles.Rmd"
## [14] "rsconnect"
## [15] "transpose_VCF_data.html"
## [16] "transpose_VCF_data.Rmd"
## [17] "walsh2017morphology.csv"
If you have lots of files in the working directory, you can search
for the file specifically with list.files(pattern = “walsh”)
# Run list.files() with pattern = "walsh"
list.files(pattern = "walsh") # TODO
## [1] "walsh2017morphology.csv"
Load the .csv file
CSV files can be read in with the read.csv()
function.
# add read.csv() to load the file
df <- read.csv(file = "walsh2017morphology.csv") # TODO
Always check to make sure the data looks like what you expected with
head()
, summary()
and other functions.
# run head(), summary(), and dim() on the data
# TODO
head(df)
## spp wing bill weight
## 1 NESP 56 8.5 18.2
## 2 NESP 56 8.5 20.7
## 3 NESP 59 8.0 17.6
## 4 NESP 59 8.2 16.0
## 5 NESP 60 8.3 16.5
## 6 NESP 58 8.5 16.0
summary(df)
## spp wing bill weight
## Length:73 Min. :53.00 Min. :7.900 Min. :14.5
## Class :character 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Mode :character Median :57.00 Median :8.600 Median :17.0
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.240 3rd Qu.:18.9
## Max. :60.00 Max. :9.900 Max. :21.7
## NA's :10 NA's :10 NA's :12
dim(df)
## [1] 73 4
PCA data preparation
Invariant columns
When doing PCA on SNPs you have to consider the possibility that a
SNP is fixed and all values in a column are identical. This is not
typically an issue for conventional datasets like the morphology data we
are working with here. When doing large-scale analyses with 1000s of
columns and limits on computational power it would be advisable to check
for invariant columns, which could be due to some feature of the data
you aren’t aware of or an error in data processing.
Data scaling
We always scale data for PCA. This could be done before or after
dealing with NAs and other data preparation issues.
The first column is character data so we’ll drop that when scaling
using df[, -1]
.
# make a copy of the dataframe
df2_scale <- df
# add scale() to scale the data
df2_scale[,-1] <- scale(df[,-1])
Dealing with NAs
We need to remove NAs with na.omit()
. We could impute
the missing data, but there aren’t too many values that are missing. (We
could use a for() loop to do the imputation if we wanted, or my function
mean_imputation())
# add na.omit() to remove the NAs
# assign the output to df2_scale_noNA
df2_scale_noNA <- na.omit(df2_scale) #TODO
PCA
We can run a PCA with prcomp()
. The first column is
character data so [,-1] is included to that prcomp() doesn’t get it.
# add prcomp() and assign it to an object called
## morpho_pca
morpho_pca <- prcomp(df2_scale_noNA[,-1]) #TODO
PCA Diagnostics
Its important to assess some things from the PCA output before moving
forward with interpreting a plot of the PCA results such as a biplot.
These are generally called performance diagnostics.
Scree plot
It is important when doing PCA to decide how many of the new,
re-engineered dimensions should be retained. PCA does not inherently
reduce the dimensionality of a dataset. If you put in a dataframe with
10 columns into prcomp(), you get 10 PCs (10 dimensions) out of it. Its
the job of the data analyst to decide how many of the new features (PCs)
should be considered, plot, and/or used in further analyses.
A scree plot is the typical tool for this. The base R scree plot is
pretty basic. In general, we’re looking for a steep drop between two
PCs. The PCs before the drop are most worthwhile to examine.
# add screeplot() to make the scree plot
screeplot(morpho_pca) #TODO

Explained variation
The bars on the default R scree plot represent the relative
importance of the PCs in representing the data.
More specifically, they are proportional to the amount of variation
in the data captured by each PC. The more variation represented by a PC,
the more important it is.
We can get the information on the percentage of variation in the data
using the summary()
command.
First, we get the summary information and store it to an object.
# call summary()
summary_out_morpho <- summary(morpho_pca) #TODO
Then we use a function to extract the information on explained
variation (importance) we want.
# add return(var_explained) to get output
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained) # TODO
}
This function allows us to specify how many PCs2 we want. The default
is 2 PCs, since that is what is usually plotted. Let’s get the first 3
PCs because that’s all there are in these morphological data.
# call PCA_variation(), with PCs = 3
var_out <- PCA_variation(summary_out_morpho, # TODO
PCs = 3)
var_out
## PC1 PC2 PC3
## 55.1 32.0 12.8
This means that PC1 captures 55.1% of the variation in the data. PC2
captures only 32%, and PC3 captures 12.8%.
These percentages are often reported as labels on biplots and
scatterplots of PCA results (See below).
They can also be used to decide how many PCs to work with.
Instead of the default screeplot, we can make a screeplot where the
y-axis is the percent of variation captured by each PC. We’ll use the
barplot() function to do this.
# number of dimensions in the data
N_columns <- ncol(df2_scale_noNA)
# make barplot
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The horizontal line in the barplot is calculated as 100/(# of
dimensions in the data) If all the PCs were equally important, then the
amount of variation they would explain would be (100)/(# of columns).
These data have 3 columns, so if all the PCs were equal, they would each
explain 33% of the variation. A general rule of them is to focus on PCs
that explain more than this percentage.
PCA Biplot
Now we’ll make the biplot. Look at the biplot and interpret the
relationship between the 3 features bill, weight, and wing. Then read
the information below.
# add biplot() to see the biplot
biplot(morpho_pca) #TODO

Custom PCA Plot
Note that if the arrows aren’t plotted, its not a biplot.
Get the scores:
# call vegan::scores()
morpho_scores <- vegan::scores(morpho_pca) # TODO
Combine the scores with the species information into a dataframe.
# call data.frame()
morpho_scores2 <- data.frame(spp = df2_scale_noNA$spp,
morpho_scores)
Plot the scores, with species color-coded
# make color and shape = "spp"
ggpubr::ggscatter(data = morpho_scores2,
y = "PC2",
x = "PC1",
color = "spp", # TODO
shape = "spp", # TODO
main = "PCA Scatterplot",
xlab = "PC1 (55.1% of variation)",
ylab = "PC2 (32% of variation)")

Note how in the plot the amount of variation explained by each PC is
shown in the axis labels.
How to interpret the biplot
In the biplot created above, the “bill” and “weight” vectors point to
the left, and “wing” points straight down.
This means that bill and weight are correlated with PC1, which is
always the horizontal axis. Wing is correlated with PC2, the vertical
axis.
Bill and weight are very close to each other, so the raw data of
these features are going to be highly correlated with each other.
The “Wing” vector points straight down at a about a 90 degree (right)
angle to not only PC1, but also bill and weight. We can therefore say
that the wing vector is orthogonal to PC1, bill, and
weight.