PCA Analysis Case Study - Bird Species Morphology

Learning objectives

All of this material will appear on the exam. Take notes on the workflow, functions, and concepts.

Main objectives

Work through a full analysis of a dataset with PCA
Understand the connection between scree plots and the amount of variation explained by each PC
Learn how to make a scree plot in terms of explained variation and how to assess it using a simple rule of thumb

Review

By the end of this lesson you will know how to..

set a working directory in RStudio
confirm the location of the working directory with getwd()
confirm a file is present with and list.files(pattern = ...)
load typical R data file in spreadsheet format with read.csv()
Use basic R functions to check data you’ve loaded (e.g. dim(), summary(), etc.)
Create a basic PCA, make a screeplot, and make and interpret a simple biplot.

Introduction

This Portfolio is a complete worked example of a PCA analysis of a dataset. It covers all relevant issues

Loading data
Dealing with NAs
Scaling
Running PCA
Building and interpreting scree plots
Extracting PCA scores
Creating custom scatterplots of PCA scores

This Portfolio also expands on our understanding of PCA by introducing the concept of PCs explain a proportion of the variation of the data. This is useful for understanding what PC does and what the PCs mean, and also allow us to interpret screeplots more thoroughly.

Preliminaries

Check the location of your working directory with getwd()

# run getwd()
getwd()                # TODO

## [1] "/Users/james/Documents/Comp Bio Fall 2022/FinalProject"

Check for the presence of the “walsh2017morphology.csv” file in the working directory with list.files()

# run list.files()
list.files()           # TODO

##  [1] "07-mean_imputation.docx"                                        
##  [2] "07-mean_imputation.html"                                        
##  [3] "07-mean_imputation.Rmd"                                         
##  [4] "08-PCA_worked.Rmd"                                              
##  [5] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
##  [6] "8.21122256-21362256.ALL.chr8_GRCh38.genotypes.20170504.vcf"     
##  [7] "all_loci.vcf"                                                   
##  [8] "bird_snps_remove_NAs.html"                                      
##  [9] "bird_snps_remove_NAs.Rmd"                                       
## [10] "fst_exploration_in_class-STUDENT.html"                          
## [11] "fst_exploration_in_class-STUDENT.Rmd"                           
## [12] "removing_fixed_alleles.html"                                    
## [13] "removing_fixed_alleles.Rmd"                                     
## [14] "rsconnect"                                                      
## [15] "transpose_VCF_data.html"                                        
## [16] "transpose_VCF_data.Rmd"                                         
## [17] "walsh2017morphology.csv"

If you have lots of files in the working directory, you can search for the file specifically with list.files(pattern = “walsh”)

# Run list.files() with pattern = "walsh"
list.files(pattern = "walsh")                  # TODO

## [1] "walsh2017morphology.csv"

Load the .csv file

CSV files can be read in with the read.csv() function.

# add read.csv() to load the file
df <- read.csv(file = "walsh2017morphology.csv") # TODO

Always check to make sure the data looks like what you expected with head(), summary() and other functions.

# run head(), summary(), and dim() on the data
    # TODO
head(df)

##    spp wing bill weight
## 1 NESP   56  8.5   18.2
## 2 NESP   56  8.5   20.7
## 3 NESP   59  8.0   17.6
## 4 NESP   59  8.2   16.0
## 5 NESP   60  8.3   16.5
## 6 NESP   58  8.5   16.0

summary(df)

##      spp                 wing            bill           weight    
##  Length:73          Min.   :53.00   Min.   :7.900   Min.   :14.5  
##  Class :character   1st Qu.:56.00   1st Qu.:8.400   1st Qu.:16.0  
##  Mode  :character   Median :57.00   Median :8.600   Median :17.0  
##                     Mean   :57.01   Mean   :8.782   Mean   :17.4  
##                     3rd Qu.:58.00   3rd Qu.:9.240   3rd Qu.:18.9  
##                     Max.   :60.00   Max.   :9.900   Max.   :21.7  
##                     NA's   :10      NA's   :10      NA's   :12

dim(df)

## [1] 73  4

PCA data preparation

Invariant columns

When doing PCA on SNPs you have to consider the possibility that a SNP is fixed and all values in a column are identical. This is not typically an issue for conventional datasets like the morphology data we are working with here. When doing large-scale analyses with 1000s of columns and limits on computational power it would be advisable to check for invariant columns, which could be due to some feature of the data you aren’t aware of or an error in data processing.

Data scaling

We always scale data for PCA. This could be done before or after dealing with NAs and other data preparation issues.

The first column is character data so we’ll drop that when scaling using df[, -1].

# make a copy of the dataframe
df2_scale       <- df

# add scale() to scale the data
df2_scale[,-1] <- scale(df[,-1])

Dealing with NAs

We need to remove NAs with na.omit(). We could impute the missing data, but there aren’t too many values that are missing. (We could use a for() loop to do the imputation if we wanted, or my function mean_imputation())

# add na.omit() to remove the NAs
# assign the output to df2_scale_noNA

df2_scale_noNA <- na.omit(df2_scale) #TODO

PCA

We can run a PCA with prcomp(). The first column is character data so [,-1] is included to that prcomp() doesn’t get it.

# add prcomp() and assign it to an object called
## morpho_pca
 morpho_pca <- prcomp(df2_scale_noNA[,-1]) #TODO

PCA Diagnostics

Its important to assess some things from the PCA output before moving forward with interpreting a plot of the PCA results such as a biplot. These are generally called performance diagnostics.

Scree plot

It is important when doing PCA to decide how many of the new, re-engineered dimensions should be retained. PCA does not inherently reduce the dimensionality of a dataset. If you put in a dataframe with 10 columns into prcomp(), you get 10 PCs (10 dimensions) out of it. Its the job of the data analyst to decide how many of the new features (PCs) should be considered, plot, and/or used in further analyses.

A scree plot is the typical tool for this. The base R scree plot is pretty basic. In general, we’re looking for a steep drop between two PCs. The PCs before the drop are most worthwhile to examine.

# add screeplot() to make the scree plot
screeplot(morpho_pca) #TODO

Explained variation

The bars on the default R scree plot represent the relative importance of the PCs in representing the data.

More specifically, they are proportional to the amount of variation in the data captured by each PC. The more variation represented by a PC, the more important it is.

We can get the information on the percentage of variation in the data using the summary() command.

First, we get the summary information and store it to an object.

# call summary()
summary_out_morpho <- summary(morpho_pca) #TODO

Then we use a function to extract the information on explained variation (importance) we want.

# add return(var_explained) to get output
PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)   # TODO
}

This function allows us to specify how many PCs2 we want. The default is 2 PCs, since that is what is usually plotted. Let’s get the first 3 PCs because that’s all there are in these morphological data.

# call PCA_variation(), with PCs = 3
var_out <- PCA_variation(summary_out_morpho, # TODO
                         PCs = 3)
var_out

##  PC1  PC2  PC3 
## 55.1 32.0 12.8

This means that PC1 captures 55.1% of the variation in the data. PC2 captures only 32%, and PC3 captures 12.8%.

These percentages are often reported as labels on biplots and scatterplots of PCA results (See below).

They can also be used to decide how many PCs to work with.

Instead of the default screeplot, we can make a screeplot where the y-axis is the percent of variation captured by each PC. We’ll use the barplot() function to do this.

# number of dimensions in the data
N_columns <- ncol(df2_scale_noNA)

# make barplot
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The horizontal line in the barplot is calculated as 100/(# of dimensions in the data) If all the PCs were equally important, then the amount of variation they would explain would be (100)/(# of columns). These data have 3 columns, so if all the PCs were equal, they would each explain 33% of the variation. A general rule of them is to focus on PCs that explain more than this percentage.

PCA Biplot

Now we’ll make the biplot. Look at the biplot and interpret the relationship between the 3 features bill, weight, and wing. Then read the information below.

# add biplot() to see the biplot
biplot(morpho_pca) #TODO

Custom PCA Plot

Note that if the arrows aren’t plotted, its not a biplot.

Get the scores:

# call vegan::scores()
morpho_scores  <- vegan::scores(morpho_pca) # TODO

Combine the scores with the species information into a dataframe.

# call data.frame()
morpho_scores2 <- data.frame(spp = df2_scale_noNA$spp,
                             morpho_scores)

Plot the scores, with species color-coded

# make color and shape = "spp"
ggpubr::ggscatter(data = morpho_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "spp",   # TODO
                  shape = "spp",   # TODO
                  main = "PCA Scatterplot",
                  xlab = "PC1 (55.1% of variation)",
                  ylab = "PC2 (32% of variation)")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

How to interpret the biplot

In the biplot created above, the “bill” and “weight” vectors point to the left, and “wing” points straight down.

This means that bill and weight are correlated with PC1, which is always the horizontal axis. Wing is correlated with PC2, the vertical axis.

Bill and weight are very close to each other, so the raw data of these features are going to be highly correlated with each other.

The “Wing” vector points straight down at a about a 90 degree (right) angle to not only PC1, but also bill and weight. We can therefore say that the wing vector is orthogonal to PC1, bill, and weight.