Grace Wang Final Report: Analysis of 1000 Genomes Data with PCA

Introduction

This report summarizes the analysis workflow and results of an analysis of SNPs from the 1000 Genomes Project.

The raw data imported for analysis consisted of 9,786 SNPs from 2,513 individuals, but was reduced to 7,261 SNPs from 2,504 individuals after data cleaning. This process consisted of omitting invariant SNPs that were unconducive to the analysis (2,525 removed), removing low-quality samples with excess NAs (my analysis had none for this), and using mean imputation to replace remaining NAs (242).

Applying PCA analysis resulted in the top 3 PCs accounting for 3.10%, 2.13%, and 1.57% of the data’s variation respectively. These top 3 PCs thus amounted to explaining approximately 6.80% of variation, and only the first 679 PCs were determined to be within the threshold of usefulness (% variation explained) for future analysis.

Data preparation

Obtaining and loading data

Single Nucleotide Polymorphism (SNPs) data in VCF format were obtained from the 1000 Genomes Project.

SNPs were downloaded using the Ensembl Data Slicer from chromosome 11 between genomic coordinates 494,585 and 734,585. This represents 0.178% of the chromosome. A total of 9,786 variants genotyped in 2,513 individuals were downloaded.

The VCF file was loaded into R using the vcfR package (function read.vcfR) and converted to counts of the minor allele using the function vcfR::extract.gt().

Data subsetting

Because of the large size of the data file 10,000 random SNPs were selected using R’s sample() function. This allowed the data to be analyzed much more efficiently while representing key features of the data. A full analysis will require use of all of the SNPs.

Meta-data and sample information

We were able to merge our SNP datasets with meta-data from the 1000 Genomes Project by using our samples’ IDs. This allowed us to incorporate population, super population, sex, and latitude/longitude as columns in our dataframes.

Data cleaning

These SNPs were then screened for any SNPs that were invariant (fixed), resulting in removal of 2,525 SNPs (features). This was done using the invar_omit() function by Nathan Brouwer.

NOTE: The original workflow code for removing invariant SNPs contained and error that resulted in no columns actually being removed (Brouwer, personal communication). The code was updated and a reduction in the size of the dataframe after omitting invariant columns confirmed by checking the dimensions of the dataframes before and after this process using dim().

The data were then screened for rows (people) with >50% NAs. There were 242 NAs in the data, so overall no rows needed to be removed on the basis of excessive NAs. Mean imputation was still applied to substitute values for the few NAs present.

The data were then centered and scaled using R’s scale() function. (Alternatively a SNP-specific centering technique common in other studies could have been applied).

The data were then saved in .csv format using write.csv() for PCA analysis.

After final processing the data contained 7,261 SNPS and 2,504 samples (people).

Data Analysis

The code below carries out a PCA on the data and presents the results. The key steps are:

Load the data with read.csv().
Process the data with prcomp().
Extract PCA scores.
Carry out PCA diagnostics, including construction of a scree plot.
Plot PCs 1 through 3 as scatterplots as pairwise scatterplot.
Plots PCS 1 through 3 as a 3D scatterplot.

Packages

The following packages were used in this analysis:

# plotting:
library(ggplot2)
library(ggpubr)

# scores() function
library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

# 3D scatter plot
library(scatterplot3d)

Loading data

Load the fully processed data:

NOTE: with 7,261 SNPs, this CSV is 369.3 megabytes. There are more specialized packages for doing PCA with datasets this big. I do not recommend working with more than 10,000 SNPs with basic R functions as we have done in class.

# load in file with processed SNP data
vcf_scaled <- read.csv(file = "vcf_data_processed.csv")

Check the dimensions of the data to confirm this is the correct data:

# check dimensions of loaded data
dim(vcf_scaled)

## [1] 2504 7267

Principal Components Analysis

The data are scaled and ready for analysis. The first six columns contain character data and thus need to be omitted.

# examine preview of the loaded data 
head(vcf_scaled[,1:10])

##    sample pop super_pop    sex      lat       lng          X2          X3
## 1 HG00096 GBR       EUR   male 52.48624 -1.890401 -0.01998402 -0.01998402
## 2 HG00097 GBR       EUR female 52.48624 -1.890401 -0.01998402 -0.01998402
## 3 HG00099 GBR       EUR female 52.48624 -1.890401 -0.01998402 -0.01998402
## 4 HG00100 GBR       EUR female 52.48624 -1.890401 -0.01998402 -0.01998402
## 5 HG00101 GBR       EUR   male 52.48624 -1.890401 -0.01998402 -0.01998402
## 6 HG00102 GBR       EUR female 52.48624 -1.890401 -0.01998402 -0.01998402
##            X4         X6
## 1 -0.03999201  4.1240264
## 2 -0.03999201 -0.2423846
## 3 -0.03999201 -0.2423846
## 4 -0.03999201 -0.2423846
## 5 -0.03999201 -0.2423846
## 6 -0.03999201 -0.2423846

PCA

Principal Components Analysis was run using prcomp().

# run PCA on numeric columns only
vcf_pca <- prcomp(vcf_scaled[,-c(1:6)])

Get the PCA scores, which will be plotted.

# use vegan package to get PCA scores
vcf_pca_scores  <- vegan::scores(vcf_pca)

Combine the scores with the sample information into a dataframe.

# add into a dataframe
vcf_pca_scores2 <- data.frame(population = vcf_scaled$super_pop, vcf_pca_scores)

# set as a factor
vcf_pca_scores2$population <- factor(vcf_pca_scores2$population)

PCA diagnostics

The following steps help us understand the PCA output and determine how many PCs should be plotted and/or used in further analyses such as scans for natural selection, cluster analysis, and GWAS.

Default scree plot

A default R scree plot was created with screeplot(). This plot does not provide extra information for assessing the importance of the PCs.

screeplot(vcf_pca, xlab = "Principal Components", main = "PCA Screeplot")

Advanced scree plot

The original workflow and function for making a more advanced scree plot lacked flexibility (Brouwer, personal communication). The following function and workflow simplifies things

Run PC (done above)
Call function PCA_variation() (below) on PCA output.
(Do NOT call summary() on PCA output)
Call function screeplot_snps() on the output of PCA_variation() to make an advanced scree plot
Call function PCA_cumulative_var_plot() to show the cumulative variation explained as more PCs are considered

NEW Functions

PCA_variation() function

This function extacts information needed to make a more advanced, annotated scree plot.

# NEW function
PCA_variation <- function(pca){
  
  # get summary information from PCA
  pca_summary <- summary(pca)
  
  # extract information from summary
  ## raw variance for each PC
  variance <- pca_summary$importance[1,]
  
  ## % variance explained by each PC
  var_explained <- pca_summary$importance[2,]*100
  var_explained <- round(var_explained,3)
  
  ## cumulative % variance  
  var_cumulative <- pca_summary$importance[3,]*100
  var_cumulative <- round(var_cumulative,3)
  
  # prepare output
  N.PCs <- length(var_explained)
  var_df <- data.frame(PC = 1:N.PCs,
            var_raw  = variance,
            var_percent = var_explained, 
            cumulative_percent = var_cumulative)
  
  # return output
  return(var_df)   
}

screeplot_snps() function

This functions makes a more advanced scree plot better suited for PCS on for SNPs.

# NEW function
screeplot_snps <- function(var_df){
total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut) 
ti <- paste0("Cutoff = ",
            var_cut_percent_rnd,
            "%\n","Useful PCs = ",i_cut)
plot(var_df$var_percent,
        main =ti, type = "l",
     xlab = "PC",
     ylab = "Percent variation",
     col = 0)

segments(x0 = var_df$PC,
         x1 = var_df$PC,
         y0 = 0, 
         y1 = var_df$var_percent,
         col = 1)

segments(x0 = 0,
         x1 = N,
         y0 = var_cut_percent, 
         y1 = var_cut_percent,
         col = 2)
}

PCA_cumulative_var_plot() function

This makes a plot complementary to a scree plot. A scree plot plots the amount of variation explained by each PC. This plot plots a curve of cumulative amount of variation explained by the PCs.

# NEW function
PCA_cumulative_var_plot <- function(var_df){
  plot(cumulative_percent ~ PC, 
       data = var_out,
       main = "Cumulative percent variation\n explained by PCs",
       xlab = "PC",
       ylab = "Cumulative %",
       type = "l")
  
  total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut) 

percent_cut_i <- which(var_out$PC == i_cut )
percent_cut <- var_out$cumulative_percent[percent_cut_i]
segments(x0 = i_cut,
         x1 = i_cut,
         y0 = 0, 
         y1 = 100,
         col = 2)

segments(x0 = -10,
         x1 = N,
         y0 = percent_cut, 
         y1 = percent_cut,
         col = 2)

return(percent_cut)         # prints cumulative % variation explained by PCs above cutoff
}

Advanced screeplot analysis

Extract information

Extract information on the variance explained by each PC.

var_out <- PCA_variation(vcf_pca)

Look at the output of PCA_variation()

head(var_out)

##     PC   var_raw var_percent cumulative_percent
## PC1  1 14.999084       3.098              3.098
## PC2  2 12.433852       2.129              5.228
## PC3  3 10.687443       1.573              6.801
## PC4  4  9.393030       1.215              8.016
## PC5  5  8.940202       1.101              9.117
## PC6  6  8.716726       1.046             10.163

Advanced screeplot

This advanced scree plot shows the amount of variation explained by all PCs. It marks with a horizontal line what the cutoff is for the amount of Percent variation explained that is useful, and a vertical line for where that line interacts the curve of the scree plot. The title indicates the percentage value of the cutoff and which PC is the last PC below that value. Though only the first few PCs can be plotted, PCs below the cut off value (“useful PCs) should probably used for further machine learning algorithms.

Make the scree plot with screeplot_snps()

screeplot_snps(var_out)

Cumulative variation plot

The cumulative variation plot shows how much variation in the data explained in total as more and more PCs are considered. The vertical red line shows the cutoff value from the scree plot (above). The horizontal line indicates what the total percentage of variation explained by these useful PCs is.

Make cumulative variation plot with PCA_cumulative_var_plot()

PCA_cumulative_var_plot(var_out)

## [1] 74.566

PCA Scatterplots

The object created above var_out indicates how much variation is explained by each of the Principal components. This information is often added to the axes of scatterplots of PCA output.

head(var_out)

##     PC   var_raw var_percent cumulative_percent
## PC1  1 14.999084       3.098              3.098
## PC2  2 12.433852       2.129              5.228
## PC3  3 10.687443       1.573              6.801
## PC4  4  9.393030       1.215              8.016
## PC5  5  8.940202       1.101              9.117
## PC6  6  8.716726       1.046             10.163

PC 1 explains 3.10% percent of the variation, PC2 explains 2.13%, and PC3 explains 1.57%. In total, the first 3 PCs explain only ~6.8% of the variability in the data. The scree plot indicate that the first ~679 PCs are useful explain ~75% of the variation in the data. In further analysis such as GWAS the first 679 PCs should therefore be used.

Plot PC1 versus PC2

Plot PC1 against PC2 with super-population color-coded

# Plot PCA scores in scatterplot
ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC2",
                  x = "PC1",
              color = "population",
              shape = "population",
              main = "PC1 vs. PC2 Scatterplot",
         ylab = "PC2 (2.13% of variation)",
         xlab = "PC1 (3.10% of variation)")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

Plot PC2 versus PC3

Plot PC2 against PC3 with super population color-coded

# Plot PCA scores in scatterplot
ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC3",
                  x = "PC2",
                  color = "population",   # todo
                  shape = "population",   # todo
                  main = "PC2 vs. PC3 Scatterplot",
          ylab = "PC3 (1.57% of variation)",
          xlab = "PC2 (2.13% of variation)")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

Plot PC1 versus PC3

Plot PC1 against PC3 with super population color-coded

# Plot PCA scores in scatterplot
ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC3",
                  x = "PC1",
                  ellipse = T,
            color = "population",   # todo
            shape = "population",   # todo
            main = "PC1 vs. PC3 Scatterplot",
      ylab = "PC3 (1.57% of variation)",
      xlab = "PC1 (3.10% of variation")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

3D scatterplot

The first 3 principal components can be presented as a 3D scatterplot.

colors_use <- as.numeric(factor(vcf_pca_scores2$population))
scatterplot3d(x = vcf_pca_scores2$PC1,
              y = vcf_pca_scores2$PC2,
              z = vcf_pca_scores2$PC3,
              color = colors_use,
              xlab = "PC1 (3.10%)",
              ylab = "PC2 (2.13%)",
              zlab = "PC3 (1.57%)",
              main = "3D Scatterplot of PC1, PC2, PC3")

Conclusion

Looking at the final plots generated by our PCA analysis, we can visualize a slight distinction in groups between the superpopulations included in the SNP dataset. This is primarily possible in the scatterplots comparing PC1 vs. PC2 and PC2 vs. PC3, where there is a clear separation between the datapoints for the African populations and all the others (a minimal amount of AMR datapoints are dispersed in the Africa groupings as well). The PC1 vs. PC3 graph and 3D scatterplot do not offer any information on possible groupings. Thus we can use these PCA results to further explore the reasons behind the African superpopulation’s clear separation from the others in terms of SNPs from this segment of Chromosome 11.

Grace Wang Final Report: Analysis of 1000 Genomes Data with PCA

Nathan Brouwer

2022-12-05

Introduction

Data preparation

Obtaining and loading data

Data subsetting

Meta-data and sample information

Data cleaning

Data Analysis

Packages

Loading data

Principal Components Analysis

PCA

PCA diagnostics

Default scree plot

Advanced scree plot

NEW Functions

PCA_variation() function

screeplot_snps() function

PCA_cumulative_var_plot() function

Advanced screeplot analysis

Extract information

Advanced screeplot

Cumulative variation plot

PCA Scatterplots

Plot PC1 versus PC2

Plot PC2 versus PC3

Plot PC1 versus PC3

3D scatterplot

Conclusion