The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)
Set the working directory
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
By centering the data, it makes sure that variance of each variable is comparable and not influenced by the scale of the original values.
SNPs_scaled <- scale(SNPs_cleaned)
Now we can run a PCA analysis on the data. The prcomp takes the SNPs we provided in a data frame and turns it into a list of objects with the PCA results.
pca_scaled <- prcomp(SNPs_scaled)
A screeplot can help visualize the amount of variance explained by each component. This will help decide the number of PCs to retain for further analysis
TODO: UPDATE PLOT WITH TITLE
screeplot(pca_scaled,
ylab = "Relative importance",
main = "PCA variance")
TODO: INTERPRET SCREEPLOT
Given the elbow at PC2, we should only retain the first PC. However, given the fact that 1 dimension is not useful for analysis, we will take 2.
The summary function in R provide useful information about our PCA results. The item we are interested in is the proportion of variance explained by the PCs. We can then visualize the proportion of variance in our data explained with a bar chart.
summary_out_scaled <- summary(pca_scaled)
summary(pca_scaled)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 16.2981 5.5721 5.41909 5.26487 5.19656 5.17558 5.11325
## Proportion of Variance 0.1985 0.0232 0.02195 0.02072 0.02018 0.02002 0.01954
## Cumulative Proportion 0.1985 0.2217 0.24368 0.26439 0.28458 0.30460 0.32414
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 5.01144 4.94492 4.86900 4.8524 4.79022 4.69654 4.67504
## Proportion of Variance 0.01877 0.01828 0.01772 0.0176 0.01715 0.01649 0.01633
## Cumulative Proportion 0.34291 0.36118 0.37890 0.3965 0.41365 0.43013 0.44647
## PC15 PC16 PC17 PC18 PC19 PC20 PC21
## Standard deviation 4.64717 4.62267 4.59217 4.55181 4.51365 4.48088 4.44317
## Proportion of Variance 0.01614 0.01597 0.01576 0.01549 0.01523 0.01501 0.01475
## Cumulative Proportion 0.46261 0.47858 0.49434 0.50983 0.52505 0.54006 0.55481
## PC22 PC23 PC24 PC25 PC26 PC27 PC28
## Standard deviation 4.39156 4.35229 4.32199 4.27752 4.25784 4.22305 4.19721
## Proportion of Variance 0.01441 0.01416 0.01396 0.01368 0.01355 0.01333 0.01317
## Cumulative Proportion 0.56923 0.58339 0.59735 0.61102 0.62457 0.63790 0.65107
## PC29 PC30 PC31 PC32 PC33 PC34 PC35
## Standard deviation 4.17971 4.11221 4.05455 4.03023 4.01265 3.98857 3.96761
## Proportion of Variance 0.01306 0.01264 0.01229 0.01214 0.01203 0.01189 0.01177
## Cumulative Proportion 0.66412 0.67676 0.68905 0.70119 0.71322 0.72511 0.73688
## PC36 PC37 PC38 PC39 PC40 PC41 PC42
## Standard deviation 3.93394 3.90733 3.89238 3.82540 3.79295 3.75287 3.73895
## Proportion of Variance 0.01157 0.01141 0.01132 0.01094 0.01075 0.01053 0.01045
## Cumulative Proportion 0.74844 0.75985 0.77118 0.78211 0.79287 0.80339 0.81384
## PC43 PC44 PC45 PC46 PC47 PC48 PC49
## Standard deviation 3.70877 3.68193 3.64483 3.6210 3.57055 3.52563 3.5082
## Proportion of Variance 0.01028 0.01013 0.00993 0.0098 0.00953 0.00929 0.0092
## Cumulative Proportion 0.82412 0.83425 0.84418 0.8540 0.86351 0.87280 0.8820
## PC50 PC51 PC52 PC53 PC54 PC55 PC56
## Standard deviation 3.43804 3.37418 3.35409 3.31966 3.3121 3.25229 3.17153
## Proportion of Variance 0.00883 0.00851 0.00841 0.00824 0.0082 0.00791 0.00752
## Cumulative Proportion 0.89083 0.89934 0.90775 0.91598 0.9242 0.93209 0.93961
## PC57 PC58 PC59 PC60 PC61 PC62 PC63
## Standard deviation 3.1679 3.07460 3.03509 2.91990 2.88622 2.7606 2.58794
## Proportion of Variance 0.0075 0.00707 0.00688 0.00637 0.00623 0.0057 0.00501
## Cumulative Proportion 0.9471 0.95417 0.96106 0.96743 0.97365 0.9794 0.98436
## PC64 PC65 PC66 PC67 PC68
## Standard deviation 2.5064 2.39421 2.16921 2.05232 1.1e-14
## Proportion of Variance 0.0047 0.00428 0.00352 0.00315 0.0e+00
## Cumulative Proportion 0.9890 0.99334 0.99685 1.00000 1.0e+00
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
TODO: INTERPRET VARIANCE EXPLAINED
Variance of the data explained by the first PC is far higher than the rest. This means the data can be represented efficiently with just 1 dimension.
Biplots are very helpful in interpreting the results of a PCA and can help us reach the conclusions we reached in our previous section with just one function. So why do we not use this?
biplot(pca_scaled)
TODO: EXPLAIN WHY THIS IS A BAD IDEA
Due to the large number of observations or variables, the biplot becomes very cluttered. This makes it far more difficult to interpret.
The score function from the vegan library returns the coordinates of the observations, which will help use visualize this data.
pca_scores <- vegan::scores(pca_scaled)
Creating a vector which stores the ids of the population
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
This is used to categorize our observations into individual groups.
pca_scores2 <- data.frame(pop_id,
pca_scores)
After obtaining our pca scores, we can visualize them in a scatter plot. The positions of each observation reflects the the relationship between each observation in the original data set. This will allow us to identify clusters/patterns within the data
TODO: UPDATE PLOT WITH TITLE TODO: UPDATE X and Y AXES WITH AMOUNT OF VARIATION EXPLAINED
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (19.85% variation)",
ylab = "PC2 (2.32% variation)",
main = "PC1 vs PC2")
TODO: INTERPRET PLOT
Looks like Cau, Div seem to be in one cluster, which indicates they are similar with each other in terms of genetic variation they capture. This means both SNPs are common in the population and they have similar frequencies of different nucleotide variations.