This project summarizes the analysis workflow and results of an analysis of SNPs from the 1000 Genomes Project.
The VCF file was loaded into R using the vcfR package (function read.vcfR) and converted to counts of the minor allele using the function vcfR::extract.gt().
Single Nucleotide Polymorphism (SNPs) data in VCF format were obtained from the 1000 Genomes Project.SNPs were downloaded using the Ensembl Data Slicer from chromosome 4 between genomic coordinates 13132463 and 13372463. This represents about 0.3% of the chromosome. A total of 7497 variants genotyped in 2513 individuals were downloaded.
Because of the large size of the data file 10000 random SNPs were
selected using R’s sample() function. This allowed the data
to be analyzed much more efficiently while representing key features of
the data. A full analysis will require use of all of the SNPs.
TODO: Change this with 1 to 2 sentences to apply to the meta data you used for your project. Refer to the original portfolio files for more information.
Metadata gives the information of the other data. I used the metadata for superpopulation for regions like America and Europeans, Africans, and etc. The self-identified ethnicity of each person is indicated in the row names of the dataframe. This information was extracted, cleaned with regular expressions and added as a column to the dataframe.
These SNPs were then screened for any SNPs that were invariant
(fixed), resulting in removal of 1877 SNPs (features). This was done
using the invar_omit() function by Nathan Brouwer.
NOTE: The original workflow code for removing
invariant SNPs contained and error that resulted in no
columns actually being removed (Brouwer, personal
communication). The code was updated and a reduction in the size of
the dataframe after omitting invariant columns confirmed by checking the
dimensions of the dataframes before and after this process using
dim().
The data were then screened for rows (people) with >50% NAs. There were no NAs in the data, so no rows were removed due to the presence of excessive NAs. Similarly, because no NAs were present no imputation was required.
The data were then centered and scaled using R’s
scale() function. (Alternatively a SNP-specific centering
technique common in other studies could have been applied).
The data were then saved in .csv format using
write.csv() for PCA analysis.
After final processing the data contained 5620 SNPS and 2504 samples (people).
The code below carries out a PCA on the data and presents the results. The key steps are:
read.csv().prcomp().The following packages were used in this analysis:
# plotting:
library(ggplot2)
library(ggpubr)
# scores() function
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
# 3D scatter plot
library(scatterplot3d)
Load the fully processed data:
NOTE: with 5620 SNPs, this CSV is ~276.2 megabytes. There are more specialized packages for doing PCA with datasets this big. I do not recommend working with more than 10,000 SNPs with basic R functions as we have done in class.
## TODO: Change to your .csv file name
vcf_scaled <- read.csv(file = "datascaled.csv")
Check the dimensions of the data to confirm this is the correct data:
dim(vcf_scaled)
## [1] 2504 5626
TODO: Update this paragraph to reflect how many columns of character data are in your file.
The data are scaled and ready for analysis. Only the very first column contains character data and needs to be omitted.
head(vcf_scaled[,1:10])
## sample pop super_pop sex lat lng X1 X2
## 1 HG00096 GBR EUR male 52.48624 -1.890401 -0.07496829 -0.03462717
## 2 HG00097 GBR EUR female 52.48624 -1.890401 -0.07496829 -0.03462717
## 3 HG00099 GBR EUR female 52.48624 -1.890401 -0.07496829 -0.03462717
## 4 HG00100 GBR EUR female 52.48624 -1.890401 -0.07496829 -0.03462717
## 5 HG00101 GBR EUR male 52.48624 -1.890401 -0.07496829 -0.03462717
## 6 HG00102 GBR EUR female 52.48624 -1.890401 -0.07496829 -0.03462717
## X3 X4
## 1 -0.0600481 -1.0787738
## 2 -0.0600481 0.9266082
## 3 -0.0600481 0.9266082
## 4 -0.0600481 -1.0787738
## 5 -0.0600481 0.9266082
## 6 -0.0600481 -1.0787738
Principal Components Analysis was run using
prcomp().
# TODO: change the -1 to code that
## omits the proper number of columns
vcf_pca <- prcomp(vcf_scaled[, -c(1:6)])
warning("If this didn't work, you may not have omitted the proper number of columns. ")
## Warning: If this didn't work, you may not have omitted the proper number of
## columns.
Get the PCA scores, which will be plotted.:
vcf_pca_scores <- vegan::scores(vcf_pca)
Combine the scores with the sample information into a dataframe.
TODO Be sure to change
vcf_scaled$sample to whatever is the name of the column
with your superpopulation information.
# call data.frame()
vcf_pca_scores2 <- data.frame(population = vcf_scaled$super_pop, #TODO
vcf_pca_scores)
# set as a factor
vcf_pca_scores2$population <- factor(vcf_pca_scores2$population)
warning("If this didn't work, you may not have set of the column for sample data correctly")
## Warning: If this didn't work, you may not have set of the column for sample data
## correctly
The following steps help us understand the PCA output and determine how many PCs should be plotted and/or used in further analyses such as scans for natural selection, cluster analysis, and GWAS.
A default R scree plot was created with screeplot().
This plot does not provide extra information for assessing the
importance of the PCs.
screeplot(vcf_pca,
xlab = "Principal Components")
The original workflow and function for making a more advanced scree plot lacked flexibility (Brouwer, personal communication). The following function and workflow simplifies things
PCA_variation() (below) on PCA
output.screeplot_snps() on the output of
PCA_variation() to make an advanced scree plotPCA_cumulative_var_plot() to show the
cumulative variation explained as more PCs are consideredThis function extacts information needed to make a more advanced, annotated scree plot.
# This is a NEW function
PCA_variation <- function(pca){
# get summary information from PCA
pca_summary <- summary(pca)
# extract information from summary
## raw variance for each PC
variance <- pca_summary$importance[1,]
## % variance explained by each PC
var_explained <- pca_summary$importance[2,]*100
var_explained <- round(var_explained,3)
## cumulative % variance
var_cumulative <- pca_summary$importance[3,]*100
var_cumulative <- round(var_cumulative,3)
# prepare output
N.PCs <- length(var_explained)
var_df <- data.frame(PC = 1:N.PCs,
var_raw = variance,
var_percent = var_explained,
cumulative_percent = var_cumulative)
# return output
return(var_df)
}
This functions makes a more advanced scree plot better suited for PCS on for SNPs.
# This is a NEW function
screeplot_snps <- function(var_df){
total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut)
ti <- paste0("Cutoff = ",
var_cut_percent_rnd,
"%\n","Useful PCs = ",i_cut)
plot(var_df$var_percent,
main =ti, type = "l",
xlab = "PC",
ylab = "Percent variation",
col = 0)
segments(x0 = var_df$PC,
x1 = var_df$PC,
y0 = 0,
y1 = var_df$var_percent,
col = 1)
segments(x0 = 0,
x1 = N,
y0 = var_cut_percent,
y1 = var_cut_percent,
col = 2)
}
This makes a plot complementary to a scree plot. A scree plot plots the amount of variation explained by each PC. This plot plots a curve of cumulative amount of variation explained by the PCs.
# This is a NEW function
PCA_cumulative_var_plot <- function(var_df){
plot(cumulative_percent ~ PC,
data = var_out,
main = "Cumulative percent variation\n explained by PCs",
xlab = "PC",
ylab = "Cumulative %",
type = "l")
total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut)
percent_cut_i <- which(var_out$PC == i_cut )
percent_cut <- var_out$cumulative_percent[percent_cut_i]
segments(x0 = i_cut,
x1 = i_cut,
y0 = 0,
y1 = 100,
col = 2)
segments(x0 = -10,
x1 = N,
y0 = percent_cut,
y1 = percent_cut,
col = 2)
}
Extract information on the variance explained by each PC.
var_out <- PCA_variation(vcf_pca)
Look at the output of PCA_variation()
head(var_out)
## PC var_raw var_percent cumulative_percent
## PC1 1 13.941255 3.458 3.458
## PC2 2 11.202242 2.233 5.691
## PC3 3 9.846852 1.725 7.417
## PC4 4 8.935021 1.421 8.837
## PC5 5 8.678657 1.340 10.177
## PC6 6 8.464895 1.275 11.452
This advanced scree plot shows the amount of variation explained by all PCs. It marks with a horizontal line what the cutoff is for the amount of Percent variation explained that is useful, and a vertical line for where that line interacts the curve of the scree plot. The title indicates the percentage value of the cutoff and which PC is the last PC below that value. Though only the first few PCs can be plotted, PCs below the cut off value (“useful PCs) should probably used for further machine learning algorithms.
Make the scree plot with screeplot_snps()
screeplot_snps(var_out)
The cumulative variation plot shows how much variation in the data explained in total as more and more PCs are considered. The vertical red line shows the cutoff value from the scree plot (above). The horizontal line indicates what the total percentage of variation explained by these useful PCs is.
Make cumulative variation plot with
PCA_cumulative_var_plot()
PCA_cumulative_var_plot(var_out)
The object created above var_out indicates how much
variation is explained by each of the Principal components. This
information is often added to the axes of scatterplots of PCA
output.
head(var_out)
## PC var_raw var_percent cumulative_percent
## PC1 1 13.941255 3.458 3.458
## PC2 2 11.202242 2.233 5.691
## PC3 3 9.846852 1.725 7.417
## PC4 4 8.935021 1.421 8.837
## PC5 5 8.678657 1.340 10.177
## PC6 6 8.464895 1.275 11.452
TODO: and the number of PCs the scree plot indicates should be used in any further analyses, such as GWAS or cluster analysis.
PC 1 explains 3.458 percent of the variation, PC2 explains. 2.233%, and PC3 explains 1.725. In total, the first 3 PCs explain only 7.416 of the variability in the data. The scree plot indicate that the first ~635 PCs are useful explain ~78% of the variation in the data. In further analysis such as GWAS the first 635 PCs should therefore be used.
TODO 01 If necessary Change the color =
and shape = variables to be appropriate to your data,
e.g. whatever superpopulation is called in your
dataframe.
TODO 02 Put the correct amount of variation explained by each PC in the plots below.
Plot the scores, with super-population color-coded
# make color and shape = "super_pop"
ggpubr::ggscatter(data = vcf_pca_scores2,
y = "PC2",
x = "PC1",
color = "population", # TODO
shape = "population", # TODO
main = "PCA Scatterplot",
ylab = "PC2 (2.233% of variation)",
xlab = "PC1 (3.458% of variation")
warning("If this didn't work, you make need to change the color = and shape = code")
## Warning: If this didn't work, you make need to change the color = and shape =
## code
message("Be sure to also update the amount of variation explained by the PCs")
## Be sure to also update the amount of variation explained by the PCs
Note how in the plot the amount of variation explained by each PC is shown in the axis labels.
Plot the scores, with super population color-coded
# make color and shape = "super_pop"
ggpubr::ggscatter(data = vcf_pca_scores2,
y = "PC3",
x = "PC2",
color = "population", # todo
shape = "population", # todo
main = "PCA Scatterplot",
ylab = "PC3 (1.725% of variation)",
xlab = "PC2 (2.233% of variation")
warning("Be sure to update the amount of variation explained by the PCs")
## Warning: Be sure to update the amount of variation explained by the PCs
Note how in the plot the amount of variation explained by each PC is shown in the axis labels.
Plot the scores, with super population color-coded
# make color and shape = "population"
ggpubr::ggscatter(data = vcf_pca_scores2,
y = "PC3",
x = "PC1",
ellipse = T,
color = "population", # todo
shape = "population", # todo
main = "PCA Scatterplot",
ylab = "PC3 (1.725% of variation)",
xlab = "PC1 (3.458% of variation")
warning("Be sure to update the amount of variation explained by the PCs")
## Warning: Be sure to update the amount of variation explained by the PCs
Note how in the plot the amount of variation explained by each PC is shown in the axis labels.
The first 3 principal components can be presented as a 3D scatterplot.
colors_use <- as.numeric(vcf_pca_scores2$population)
scatterplot3d(x = vcf_pca_scores2$PC1,
y = vcf_pca_scores2$PC2,
z = vcf_pca_scores2$PC3,
color = colors_use,
xlab = "PC1 (3.458%)",
ylab = "PC2 (2.233%)",
zlab = "PC3 (1.725%)")
warning("Be sure to update the amount of variation explained by the PCs")
## Warning: Be sure to update the amount of variation explained by the PCs
#Conclusion Based on the PC1 vs PC3 scatterplot, the graph is showing an overlap in many populations, which means many populations have similarities in their genomes. Superpopulations seemed to be unrelated to the populations.According to the 3d scatterplot, we are able to conclude that AFR EAS EUR shared some simliarities among all PCs.