The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
## Warning: package 'vcfR' was built under R version 4.2.2
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
setwd("C:/Users/mikaw/OneDrive/Pitt/Computational Biology/my_snps")
list.files()
## [1] "1000Genome.gz"
## [2] "1000genomes_people_info2-1.csv"
## [3] "6.1803432-2043432.ALL.chr6_GRCh38.genotypes.20170504.vcf.gz"
## [4] "all_loci-1.vcf"
## [5] "all_loci.vcf"
## [6] "project snps.Rmd"
## [7] "project_snps_num_df"
## [8] "Screenshot 2022-12-03 105306.png"
## [9] "vcf_num.csv"
## [10] "walsh2017morphology.csv"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
Center and scale data by subtracting the mean (center around 0) and dividing by the standard deviation
SNPs_scaled <- scale(SNPs_cleaned)
Used prcomp() to perform PCA
pca_scaled <- prcomp(SNPs_scaled)
Create screeplot from PCA data
screeplot(pca_scaled,
ylab = "Relative importance",
main = "SNPs Screeplot")
PC1 is the only one with relative high importance. All the others have lower and about equal importance
Look at the summary of the data Load PCA variation adn extract that to calculate the percentage variation Calculate the cutoff for therule of thumb based on the number of dimensions in the data Plot the percentage variation
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
Of the first 10 PCs, none of them are cut off
This gives the data based on all the SNPs
biplot(pca_scaled)
There are toomany dimensions to analyze this data
Extract using vegan::score()
pca_scores <- vegan::scores(pca_scaled)
Create vector with the population IDS
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Create a dataframe combining the population IDs and the PCS scores
pca_scores2 <- data.frame(pop_id,
pca_scores)
Create a scatter plot of the PCA scores
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 amount of variation explained",
ylab = "PC2 amount of variation explained",
main = "PCA scores")
You can see groups and trends forming based on the population ID