Load necessary R packages.

library(vcfR)
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)

Confirm working directory and location of files.

getwd()
## [1] "/Users/rheapremanand/Documents/BIOSC 1540 Code/Final_Project"

Look for the vcf files.

list.files(pattern = "vcf")
## [1] "7.7000-247000.ALL.chr7_GRCh38.genotypes.20170504.vcf.gz"
## [2] "all_loci-1.vcf"                                         
## [3] "all_loci.vcf"                                           
## [4] "vcf_num_df.csv"                                         
## [5] "vcf_num.csv"                                            
## [6] "vcf_scaled_final.csv"

##Set SNP data up for R

Load the (compressed) vcf data.

#save the name of my vcf file to a shorter name
my_vcf <- "7.7000-247000.ALL.chr7_GRCh38.genotypes.20170504.vcf.gz"

#load vcf file
vcf <- vcfR::read.vcfR(my_vcf,
                       convertNA = T)
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 130
##   header_line: 131
##   variant count: 8981
##   column count: 2513
## 
Meta line 130 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 8981
##   Character matrix gt cols: 2513
##   skip: 0
##   nrows: 8981
##   row_num: 0
## 
Processed variant 1000
Processed variant 2000
Processed variant 3000
Processed variant 4000
Processed variant 5000
Processed variant 6000
Processed variant 7000
Processed variant 8000
Processed variant: 8981
## All variants processed

Call head() on vcf to examine data.

head(vcf)
## [1] "***** Object of class 'vcfR' *****"
## [1] "***** Meta section *****"
## [1] "##fileformat=VCFv4.1"
## [1] "##FILTER=<ID=PASS,Description=\"All filters passed\">"
## [1] "##fileDate=20150218"
## [1] "##reference=ftp://ftp.1000genomes.ebi.ac.uk//vol1/ftp/technical/refe [Truncated]"
## [1] "##source=1000GenomesPhase3Pipeline"
## [1] "##contig=<ID=1,assembly=b37,length=249250621>"
## [1] "First 6 rows."
## [1] 
## [1] "***** Fixed section *****"
##      CHROM POS     ID            REF ALT QUAL  FILTER
## [1,] "7"   "14808" "rs555283805" "T" "C" "100" "PASS"
## [2,] "7"   "15064" "rs576737504" "T" "C" "100" "PASS"
## [3,] "7"   "16454" "rs544026442" "C" "T" "100" "PASS"
## [4,] "7"   "16692" "rs370739206" "G" "C" "100" "PASS"
## [5,] "7"   "16712" "rs373250171" "T" "G" "100" "PASS"
## [6,] "7"   "16731" "rs541070747" "T" "G" "100" "PASS"
## [1] 
## [1] "***** Genotype section *****"
##      FORMAT HG00096 HG00097 HG00099 HG00100 HG00101
## [1,] "GT"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
## [2,] "GT"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
## [3,] "GT"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
## [4,] "GT"   "1|1"   "1|1"   "1|1"   "1|1"   "1|1"  
## [5,] "GT"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
## [6,] "GT"   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
## [1] "First 6 columns only."
## [1] 
## [1] "Unique GT formats:"
## [1] "GT"
## [1]
dim(vcf)
## variants fix_cols  gt_cols 
##     8981        8     2505

Convert raw VCF file to genotype scores with extract.gt().

#get genotype score (allele counts)

vcf_num <- vcfR::extract.gt(vcf, 
                element = "GT",
                IDtoRowNames = F,
                as.numeric = T,
                convertNA = T)

Save the csv.

write.csv(vcf_num, file = "vcf_num.csv", row.names = F)

Confirm presence of file.

list.files(pattern = "csv")
## [1] "1000genomes_people_info2-1.csv" "SNPs_cleaned.csv"              
## [3] "vcf_num_df.csv"                 "vcf_num.csv"                   
## [5] "vcf_scaled_final.csv"           "walsh2017morphology.csv"

Transpose original VCF orientation to R datafram orientation (make SNPs columns!).

vcf_num_t <-t(vcf_num)

Make into a dataframe.

vcf_num_df <- data.frame(vcf_num_t)

Get sample(person) names and add sample info to dataframe.

sample <- row.names(vcf_num_df)

vcf_num_df <- data.frame(sample, vcf_num_df)

Check working directory and save csv.

getwd()
## [1] "/Users/rheapremanand/Documents/BIOSC 1540 Code/Final_Project"
write.csv(vcf_num_df,
          file = "vcf_num_df.csv",
          row.names = F)

#confirm file presence
list.files()
##  [1] "07-mean_imputation.html"                                
##  [2] "07-mean_imputation.Rmd"                                 
##  [3] "08-PCA_worked.html"                                     
##  [4] "08-PCA_worked.Rmd"                                      
##  [5] "09-PCA_worked_example-SNPs-part1.html"                  
##  [6] "09-PCA_worked_example-SNPs-part1.Rmd"                   
##  [7] "10-PCA_worked_example-SNPs-part2.html"                  
##  [8] "10-PCA_worked_example-SNPs-part2.Rmd"                   
##  [9] "1000genomes_people_info2-1.csv"                         
## [10] "7.7000-247000.ALL.chr7_GRCh38.genotypes.20170504.vcf.gz"
## [11] "all_loci-1.vcf"                                         
## [12] "all_loci.vcf"                                           
## [13] "bird_snps_remove_NAs.html"                              
## [14] "bird_snps_remove_NAs.Rmd"                               
## [15] "final_proj_workflow.Rmd"                                
## [16] "Final_Project.Rproj"                                    
## [17] "final_report_template.html"                             
## [18] "final_report_template.Rmd"                              
## [19] "final_workflow.Rmd"                                     
## [20] "removing_fixed_alleles.html"                            
## [21] "removing_fixed_alleles.Rmd"                             
## [22] "rsconnect"                                              
## [23] "SNPs_cleaned.csv"                                       
## [24] "Software_Download.png"                                  
## [25] "transpose_VCF_data.html"                                
## [26] "transpose_VCF_data.Rmd"                                 
## [27] "vcf_num_df.csv"                                         
## [28] "vcf_num.csv"                                            
## [29] "vcf_scaled_final.csv"                                   
## [30] "walsh2017morphology.csv"

Clean data

Merge data with population meta data

Load population meta data.

pop_meta <- read.csv(file = "1000genomes_people_info2-1.csv")

Merge meta data with SNP data.

names(pop_meta)
## [1] "pop"       "super_pop" "sample"    "sex"       "lat"       "lng"
names(vcf_num_df)[1:10]
##  [1] "sample" "X1"     "X2"     "X3"     "X4"     "X5"     "X6"     "X7"    
##  [9] "X8"     "X9"

Merge the two sets of data into a new dataframe vcf_num_df2.

vcf_num_df2 <- merge(pop_meta,
                     vcf_num_df,
                     by = "sample")

Checking the dimensions before and after the merge

nrow(vcf_num_df) == nrow(vcf_num_df2)
## [1] TRUE

Check names of new dataframe.

names(vcf_num_df2)[1:15]
##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X4"        "X5"        "X6"       
## [13] "X7"        "X8"        "X9"

Check the working directory and save the csv.

getwd()
## [1] "/Users/rheapremanand/Documents/BIOSC 1540 Code/Final_Project"
write.csv(vcf_num_df2, file = "vcf_num_df.csv", row.names = F)

Confirm presence of file:

list.files(pattern = "csv")
## [1] "1000genomes_people_info2-1.csv" "SNPs_cleaned.csv"              
## [3] "vcf_num_df.csv"                 "vcf_num.csv"                   
## [5] "vcf_scaled_final.csv"           "walsh2017morphology.csv"

Subset data

Omit invariant features

Load invar_omit() function and run

invar_omit <- function(x){
  cat("Dataframe of dim",dim(x), "processed...\n")
  sds <- apply(x, 2, sd, na.rm = TRUE)
  i_var0 <- which(sds == 0)
 
  
  cat(length(i_var0),"columns removed\n")
  
  if(length(i_var0) > 0){
     x <- x[, -i_var0]
  }
  
  ## add return()  with x in it
  return(x)                      
}

Check which columns have character data

names(vcf_num_df2)[1:10]
##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X4"

Use negative indexing to skip columns with character data

#new dataframe to store output
vcf_noinvar <- vcf_num_df2[, -c(1:6)]

#run invar_omit() on numeric data
vcf_noinvar<- invar_omit(vcf_noinvar)
## Dataframe of dim 2504 8981 processed...
## 2085 columns removed
vcf_noinvar <- data.frame(vcf_num_df2[, c(1:6)],
                          vcf_noinvar)

#get new dimensions
dim(vcf_noinvar)
## [1] 2504 6902

Remove low-quality data

Load find_NAs()

find_Nas <- function(x){
  NAs_TF <- is.na(x)
  i_NA <- which(NAs_TF == TRUE)
  N_NA <- length(i_NA)
  
  return(i_NA)
}

Use a for() loop to search for NAs

#N_rows
#number of rows (individuals)
N_rows <- nrow(vcf_noinvar)

#N_NA
#vector to hold output (number of NAs)
N_NA <- rep(x = 0, times = N_rows)

#N_SNPs 
#total number of columns (SNPs)
N_SNPs <- ncol(vcf_noinvar)

# for() loop

for(i in 1:N_rows){
  
  #for each row, find the location of NAs
  i_NA <- find_Nas(vcf_noinvar[i,])
  
  #determine how many NAs with length()
  N_NA_i <- length(i_NA)
  
  #save the output to our storage vector
  N_NA[i] <- N_NA_i
  
}

Check if any row has >50% NAs.

cutoff50 <- N_SNPs*0.5
percent_NA <- N_NA/N_SNPs*100
any(percent_NA > 50)
## [1] FALSE

Find the average number of NAs per row and save.

my_meta_N_meanNA_rows <- mean(percent_NA)
my_meta_N_meanNA_rows
## [1] 0

Imputation of NAs

Mean Imputation

Load the mean_imputation() function

mean_imputation <- function(df){
  
  cat("This may take some time :)")
  n_cols <- ncol(df)
  
  for(i in 1:n_cols){
    #get the current column
    column_i <- df[, i]
    
    #get the mean of the current column
    mean_i <- mean(column_i, na.rm = TRUE)
    
    #get the NAs in the current column
    NAs_i <- which(is.na(column_i))
    
    #report the number of NAs
    N_NAs <-  length(NAs_i)
    
    #replace the NAs in the current column
    column_i[NAs_i] <- mean_i
    
    #replace the original column with the updated columns
    df[, i] <- column_i
    
  }
  
  return(df)
}

Run the function only on numeric columns.

names(vcf_noinvar)[1:10]
##  [1] "sample"    "pop"       "super_pop" "sex"       "lat"       "lng"      
##  [7] "X1"        "X2"        "X3"        "X5"

Make a new copy of data

vcf_noNA <- vcf_noinvar[, -c(1:6)]
vcf_noNA <- mean_imputation(vcf_noNA)
## This may take some time :)
vcf_noNA <- data.frame(vcf_noinvar[, c(1:6)],
                                   vcf_noNA)

dim(vcf_noNA)
## [1] 2504 6902

PCA Preparation

Scale the data

Center the data around the mean by using the scale() on the SNPs_cleaned dataframe. This will make the mean of the data set 0 and the standard deviation 1.

#new copy of data
vcf_scaled <- vcf_noNA[, -c(1:6)]

#scale
vcf_scaled <- scale(vcf_scaled)

vcf_scaled <- data.frame(vcf_noNA[, c(1:6)],
                         vcf_scaled)

Save cleaned up scaled data to a csv INCLUDES ALL PARTS OF DATAFRAME Proceed to final report template now.

write.csv(vcf_scaled, # no brackets!
          file = "vcf_scaled_final.csv",
          row.names = F) # row.names = F, no column.names!

#check for file
list.files(pattern = "csv")
## [1] "1000genomes_people_info2-1.csv" "SNPs_cleaned.csv"              
## [3] "vcf_num_df.csv"                 "vcf_num.csv"                   
## [5] "vcf_scaled_final.csv"           "walsh2017morphology.csv"

PCA

Principal Components Analysis was run using prcomp().

vcf_pca <- prcomp(vcf_scaled[,-c(1:6)])

warning("If this didn't work, you may not have omitted the proper number of columns.  ")
## Warning: If this didn't work, you may not have omitted the proper number of
## columns.

Get the PCA scores, which will be plotted.:

vcf_pca_scores  <- vegan::scores(vcf_pca) 

Combine the scores with the sample information into a dataframe.

# call data.frame()
vcf_pca_scores2 <- data.frame(population = vcf_scaled$super_pop, #TODO
                              vcf_pca_scores)

# set as a factor
vcf_pca_scores2$population <- factor(vcf_pca_scores2$population)

warning("If this didn't work, you may not have set of the column for sample data correctly")
## Warning: If this didn't work, you may not have set of the column for sample data
## correctly

PCA diagnostics

The following steps help us understand the PCA output and determine how many PCs should be plotted and/or used in further analyses such as scans for natural selection, cluster analysis, and GWAS.

Default scree plot

A default R scree plot was created with screeplot(). This plot does not provide extra information for assessing the importance of the PCs.

screeplot(vcf_pca, 
          xlab = "Principal Components")

Advanced scree plot

The original workflow and function for making a more advanced scree plot lacked flexibility (Brouwer, personal communication). The following function and workflow simplifies things

  1. Run PC (done above)
  2. Call function PCA_variation() (below) on PCA output.
  3. (Do NOT call summary() on PCA output)
  4. Call function screeplot_snps() on the output of PCA_variation() to make an advanced scree plot
  5. Call function PCA_cumulative_var_plot() to show the cumulative variation explained as more PCs are considered
NEW Functions
PCA_variation() function

This function extracts information needed to make a more advanced, annotated scree plot.

# This is a NEW function
PCA_variation <- function(pca){
  
  # get summary information from PCA
  pca_summary <- summary(pca)
  
  # extract information from summary
  ## raw variance for each PC
  variance <- pca_summary$importance[1,]
  
  ## % variance explained by each PC
  var_explained <- pca_summary$importance[2,]*100
  var_explained <- round(var_explained,3)
  
  ## cumulative % variance  
  var_cumulative <- pca_summary$importance[3,]*100
  var_cumulative <- round(var_cumulative,3)
  
  # prepare output
  N.PCs <- length(var_explained)
  var_df <- data.frame(PC = 1:N.PCs,
            var_raw  = variance,
            var_percent = var_explained, 
            cumulative_percent = var_cumulative)
  
  # return output
  return(var_df)   
}
screeplot_snps() function

This functions makes a more advanced scree plot better suited for PCS on for SNPs.

# This is a NEW function
screeplot_snps <- function(var_df){
total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut) 
ti <- paste0("Cutoff = ",
            var_cut_percent_rnd,
            "%\n","Useful PCs = ",i_cut)
plot(var_df$var_percent,
        main =ti, type = "l",
     xlab = "PC",
     ylab = "Percent variation",
     col = 0)

segments(x0 = var_df$PC,
         x1 = var_df$PC,
         y0 = 0, 
         y1 = var_df$var_percent,
         col = 1)


segments(x0 = 0,
         x1 = N,
         y0 = var_cut_percent, 
         y1 = var_cut_percent,
         col = 2)

}
PCA_cumulative_var_plot() function

This makes a plot complementary to a scree plot. A scree plot plots the amount of variation explained by each PC. This plot plots a curve of cumulative amount of variation explained by the PCs.

# This is a NEW function
PCA_cumulative_var_plot <- function(var_df){
  plot(cumulative_percent ~ PC, 
       data = var_out,
       main = "Cumulative percent variation\n explained by PCs",
       xlab = "PC",
       ylab = "Cumulative %",
       type = "l")
  
  total_var <- sum(var_df$var_raw)
N <- length(var_df$var_raw)
var_cutoff <- total_var/N
var_cut_percent <- var_cutoff/total_var*100
var_cut_percent_rnd <- round(var_cut_percent,2)
i_above_cut <- which(var_df$var_percent > var_cut_percent)
i_cut <- max(i_above_cut) 

percent_cut_i <- which(var_out$PC == i_cut )
percent_cut <- var_out$cumulative_percent[percent_cut_i]
segments(x0 = i_cut,
         x1 = i_cut,
         y0 = 0, 
         y1 = 100,
         col = 2)

segments(x0 = -10,
         x1 = N,
         y0 = percent_cut, 
         y1 = percent_cut,
         col = 2)

}
Advanced screeplot analysis
Extract information

Extract information on the variance explained by each PC.

var_out <- PCA_variation(vcf_pca)

Look at the output of PCA_variation()

head(var_out)
##     PC   var_raw var_percent cumulative_percent
## PC1  1 10.869964       1.713              1.713
## PC2  2 10.421917       1.575              3.288
## PC3  3  9.784612       1.388              4.677
## PC4  4  8.912234       1.152              5.829
## PC5  5  8.454771       1.037              6.865
## PC6  6  8.240014       0.985              7.850
Advanced screeplot

This advanced scree plot shows the amount of variation explained by all PCs. It marks with a horizontal line what the cutoff is for the amount of Percent variation explained that is useful, and a vertical line for where that line interacts the curve of the scree plot. The title indicates the percentage value of the cutoff and which PC is the last PC below that value. Though only the first few PCs can be plotted, PCs below the cut off value (“useful PCs) should probably used for further machine learning algorithms.

Make the scree plot with screeplot_snps()

screeplot_snps(var_out)

Cumulative variation plot

The cumulative variation plot shows how much variation in the data explained in total as more and more PCs are considered. The vertical red line shows the cutoff value from the scree plot (above). The horizontal line indicates what the total percentage of variation explained by these useful PCs is.

Make cumulative variation plot with PCA_cumulative_var_plot()

PCA_cumulative_var_plot(var_out)

PCA Scatterplots

The object created above var_out indicates how much variation is explained by each of the Principal components. This information is often added to the axes of scatterplots of PCA output.

head(var_out)
##     PC   var_raw var_percent cumulative_percent
## PC1  1 10.869964       1.713              1.713
## PC2  2 10.421917       1.575              3.288
## PC3  3  9.784612       1.388              4.677
## PC4  4  8.912234       1.152              5.829
## PC5  5  8.454771       1.037              6.865
## PC6  6  8.240014       0.985              7.850
library(scatterplot3d)

PC 1 explains 1.71% percent of the variation, PC2 explains. 1.58%, and PC3 explains 1.39%. In total, the first 3 PCs explain only ~4.7% of the variability in the data. The scree plot indicate that the first ~644 PCs are useful explain ~74.71% of the variation in the data. In further analysis such as GWAS the first 644 PCs should therefore be used.

Plot PC1 versus PC2

Plot the scores, with super-population color-coded

ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC2",
                  x = "PC1",
              color = "population",
              shape = "population",  
              main = "PCA Scatterplot",
         ylab = "PC2 (1.58% of variation)",
         xlab = "PC1 (1.71% of variation)")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

Plot PC2 versus PC3

Plot the scores, with super population color-coded

ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC3",
                  x = "PC2",
                  color = "population", 
                  shape = "population", 
                  main = "PCA Scatterplot",
          ylab = "PC3 (1.39% of variation)",
          xlab = "PC2 1.58% of variation)")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

Plot PC1 versus PC3

Plot the scores, with super population color-coded

ggpubr::ggscatter(data = vcf_pca_scores2,
                  y = "PC3",
                  x = "PC1",
                  ellipse = T,
            color = "population", 
            shape = "population",
            main = "PCA Scatterplot",
      ylab = "PC3 (1.39% of variation)",
      xlab = "PC1 (1.71% of variation)")

Note how in the plot the amount of variation explained by each PC is shown in the axis labels.

3D scatterplot

The first 3 principal components can be presented as a 3D scatterplot.

colors_use <- as.numeric(vcf_pca_scores2$population)
scatterplot3d(x = vcf_pca_scores2$PC1,
              y = vcf_pca_scores2$PC2,
              z = vcf_pca_scores2$PC3,
              color = colors_use,
              xlab = "PC1 (1.71%)",
              ylab = "PC2 (1.58%)",
              zlab = "PC3 (1.39%)")

Conclusion

In general, most of the super populations were not readily identifiable amongst the others. Looking at PC2 vs. PC1, the EAS population was the most apparent on the left side of the plot, even though parts of it were mixed in with the other populations all around. In PC3 vs. PC2, the AFR population was the most distinguishable, and tended to cluster together in contrast to the other populations which were distributed everywhere. For PC3 vs. PC1, again the AFR super population was the most apparent, as shown in a little red circle towards the middle bottom of the plot while the other populations were all on top of each other. Looking at the 3D scatterplot of PC1, PC2, and PC3, there was no apparent population indicated. There was, however, a large black group towards the bottom of the plot, but I am unclear as to which population it corresponds to - it may also represent a mixture of multiple populations. This may be attributed to there not being large differences in variability between these PCs.