Worked Example: PCA on SNPs data from a vcf file Part 1

Introduction

In this worked example you will replicate a PCA on a published dataset.

The example is split into 2 Parts:

Part 1: Data Preparation (this file)
Part 2: Data analysis with PCA

In this Data Preparation phase, you will do the following things:

Load the SNP genotypes in .vcf format (vcfR::read.vcfR())
Extract the genotypes into an R-compatible format (vcfR::extract.gt())
Rotate the data into the standard R analysis format (t())
Remove individuals (rows) from the data set that have >50% NAs (using a function I wrote)
Remove SNPs (columns) that are fixed
Impute remaining NAs (using a for() loop)
Save the prepared data as a .csv file for the next step (write.csv())

Biological background

This worked example is based on a paper in the journal Molecular Ecology from 2017 by Jennifer Walsh titled Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.

The study investigated variation between two bird species in the genus Ammodramus: A. nenlsoni and A. caudacutus.

The species A. nenlsoni has been divided into 3 sub-species: A. n. nenlsoni, A.n. alterus, and A n. subvirgatus. The other species, A. caudacutus, has been divided into two subspecies, A.c. caudacutus and A.c. diversus.

The purpose of this study was to investigate to what extent these five subspecies recognized by taxonomists are supported by genetic data. The author’s collected DNA from 75 birds (15 per subspecies) and genotyped 1929 SNPs. They then analyzed the data with Principal Components Analysis (PCA), among other genetic analyzes.

This tutorial will work through all of the steps necessary to re-analyze Walsh et al.s data

Tasks

In the code below all code is provided. Your tasks will be to do 2 things:

Give a meaningful title to all sections marked “TODO: TITLE”
Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”

Preliminaries

Load the vcfR and other packages with library().

library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.6-4

library(ggplot2)
library(ggpubr)

Make sure that your working directory is set to the location of the file all_loci.vcf.

getwd()

## [1] "/Users/adetayoadenekan/Downloads"

list.files()

##   [1] "_ada86_project1.py (1).zip"                                                      
##   [2] "_ada86_project1.py.zip"                                                          
##   [3] "06.07.2022_41_13_3_AA_40x_DHam#3.jpg"                                            
##   [4] "06.07.2022_41_13_3_AA_40x_DHam#3(1).jpg"                                         
##   [5] "06.07.2022_41_13_3_AA_40x_DHam#3(2).jpg"                                         
##   [6] "06.07.2022_41_13_3_AA_40x_DHam#3(3).jpg"                                         
##   [7] "06.07.2022_41_13_3_AA_40x_Ham_1.jpg"                                             
##   [8] "06.07.2022_41_13_3_AA_40x_Ham#1.jpg"                                             
##   [9] "06.07.2022_41_13_3_AA_40x_Ham#2.jpg"                                             
##  [10] "06.21.22_5A_3_.jpg"                                                              
##  [11] "06.21.22_5a_4_pinching.jpg"                                                      
##  [12] "07-mean_imputation.html"                                                         
##  [13] "07-mean_imputation.Rmd"                                                          
##  [14] "08-PCA_worked.html"                                                              
##  [15] "08-PCA_worked.Rmd"                                                               
##  [16] "09-PCA_worked_example-SNPs-part1.Rmd"                                            
##  [17] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"                 
##  [18] "10-PCA_worked_example-SNPs-part2.Rmd"                                            
##  [19] "1000genomes_people_info2-1.csv"                                                  
##  [20] "11.47816134-48056134.ALL.chr11_GRCh38.genotypes.20170504.vcf.gz"                 
##  [21] "1540_week14_PCA_SNP_workflow.pdf"                                                
##  [22] "1663709605.650343.jpg"                                                           
##  [23] "2.136483646-136733646.ALL.chr2_GRCh38.genotypes.20170504.vcf.gz"                 
##  [24] "20221121005702.ics"                                                              
##  [25] "20221121015257.ics"                                                              
##  [26] "5.24.22_2_10_5_BS_40xham.jpg"                                                    
##  [27] "5.24.22_2_5_3_BS_40xdham.jpg"                                                    
##  [28] "5.24.22_2_5_3_BS_40xdham2.jpg"                                                   
##  [29] "5.24.22_2_5_3_BS_40xham.jpg"                                                     
##  [30] "5.24.22_2_5_3_BS_40xham2.jpg"                                                    
##  [31] "5.24.22_2_7_5_BS_40xham.jpg"                                                     
##  [32] "5.24.22_2_7_5_BS_40xham2.jpg"                                                    
##  [33] "5.24.22_2_7_5_BS_d16hooks_wormbody.jpg"                                          
##  [34] "5.24.22_2_9_3_BS_40x.jpg"                                                        
##  [35] "5.24.22_2_9_3_BS_40xham.jpg"                                                     
##  [36] "5.26.22_2_11_3_AA_40x_DHam.jpg"                                                  
##  [37] "5.26.22_2_11_AA_40x (1).jpg"                                                     
##  [38] "5.31.22_2_12_ND_40X__Parasite1__dapiham.jpg"                                     
##  [39] "Adenekan_resume (1).pdf"                                                         
##  [40] "Adenekan_Resume.pdf"                                                             
##  [41] "Adetayo_Adenekan_BIOSC_1903.pdf"                                                 
##  [42] "all_loci-1.vcf"                                                                  
##  [43] "all_loci.vcf"                                                                    
##  [44] "allomtery_3_scatterplot3d (1).Rmd"                                               
##  [45] "ApE_OSX_modern_current (1).dmg"                                                  
##  [46] "ApE_OSX_modern_current.dmg"                                                      
##  [47] "BIOSC0067 Practice introduction (GROUP) (1).pptx"                                
##  [48] "BIOSC0067 Practice introduction (GROUP).pdf"                                     
##  [49] "BIOSC0067 Practice introduction (GROUP).pptx"                                    
##  [50] "bird_snps_remove_NAs.html"                                                       
##  [51] "bird_snps_remove_NAs.Rmd"                                                        
##  [52] "booleans (1).ppt"                                                                
##  [53] "booleans.ppt"                                                                    
##  [54] "BSrecursive.java"                                                                
##  [55] "casey 2017 NYT Climate Change Claims a Lake, and an Identity.pdf"                
##  [56] "center_function (1).R"                                                           
##  [57] "center_function.R"                                                               
##  [58] "Certification_Form_Completed.pdf"                                                
##  [59] "Chapter 02 - Programs 4"                                                         
##  [60] "Chapter 03- programs"                                                            
##  [61] "Chapter 03- programs.zip"                                                        
##  [62] "Chapter 04-programs"                                                             
##  [63] "Chapter 04-programs.zip"                                                         
##  [64] "Chapter 05-programs"                                                             
##  [65] "Chapter 05-programs (1).zip"                                                     
##  [66] "Chapter 05-programs.zip"                                                         
##  [67] "Chapter 06-programs"                                                             
##  [68] "Chapter 06-programs.zip"                                                         
##  [69] "Chapter 07-programs"                                                             
##  [70] "Chapter 07-programs.zip"                                                         
##  [71] "Chapter 09-programs.zip"                                                         
##  [72] "Chapter 11-programs"                                                             
##  [73] "Chapter 11-programs.zip"                                                         
##  [74] "Chapter-worksheet2.docx"                                                         
##  [75] "Chapter6-Worksheet.docx"                                                         
##  [76] "Chapter7.WorkProblems.docx"                                                      
##  [77] "Chromosome_IX_features.txt"                                                      
##  [78] "cluster_analysis_portfolio.Rmd"                                                  
##  [79] "code_checkpoint_vcfR.html"                                                       
##  [80] "code_checkpoint_vcfR.Rmd"                                                        
##  [81] "CODE_CHECKPOINT-first_rstudio_script (1).R"                                      
##  [82] "CODE_CHECKPOINT-first_rstudio_script.R"                                          
##  [83] "CodeResources"                                                                   
##  [84] "ComparableFraction.java"                                                         
##  [85] "DNA Reg 0067 syllabus Fall 2022 Barbieri.pdf"                                    
##  [86] "download.html"                                                                   
##  [87] "Fall Term 2022-2023 (1).ics"                                                     
##  [88] "Fall Term 2022-2023 (2).ics"                                                     
##  [89] "Fall Term 2022-2023 (3).ics"                                                     
##  [90] "Fall Term 2022-2023.ics"                                                         
##  [91] "feature_engineering (1).Rmd"                                                     
##  [92] "feature_engineering_intro_2_functions-part2.Rmd"                                 
##  [93] "feature_engineering.Rmd"                                                         
##  [94] "final essay- SC.pdf"                                                             
##  [95] "Final Project Genomes.vcf.gz"                                                    
##  [96] "final_report_template.Rmd"                                                       
##  [97] "Fraction.java"                                                                   
##  [98] "fst_exploration_in_class-STUDENT.Rmd"                                            
##  [99] "fst_exploration_in_class.Rmd"                                                    
## [100] "GroupC_Thur830.jpg"                                                              
## [101] "Horner.java"                                                                     
## [102] "How to access CS 401 Textbook.docx"                                              
## [103] "IMG_3493.HEIC"                                                                   
## [104] "IMG_3660.jp2"                                                                    
## [105] "IMG_3694 Small.jpeg"                                                             
## [106] "IMG_3694.HEIC"                                                                   
## [107] "IMG_3695 Small.jpeg"                                                             
## [108] "IMG_3695.HEIC"                                                                   
## [109] "IMG_3696 Small.jpeg"                                                             
## [110] "IMG_3696.HEIC"                                                                   
## [111] "IMG_3825 Small.jpeg"                                                             
## [112] "IMG_3825.HEIC"                                                                   
## [113] "IMG_4941.jpg"                                                                    
## [114] "INCLAS_1 (1).APE"                                                                
## [115] "INCLAS_1.APE"                                                                    
## [116] "James Stewart - Essential Calculus_ Early Transcendentals-Brooks Cole (2012).pdf"
## [117] "Lab3.java"                                                                       
## [118] "Lab4Tayo (1).java"                                                               
## [119] "Lab4Tayo.java"                                                                   
## [120] "Library for Students.pptx"                                                       
## [121] "Microsoft_Office_16.57.22011101_BusinessPro_Installer.pkg"                       
## [122] "Mu.Editor.1.1.0b7 (1).dmg"                                                       
## [123] "Mu.Editor.1.1.0b7 (2).dmg"                                                       
## [124] "Mu.Editor.1.1.0b7.dmg"                                                           
## [125] "Mult.java"                                                                       
## [126] "Nasal Anomaly- SC.pdf"                                                           
## [127] "OpenJDK17U-jdk_aarch64_mac_hotspot_17.0.4.1_1.pkg"                               
## [128] "parasite_45_pics"                                                                
## [129] "parasite_45_pics.zip"                                                            
## [130] "Part_2_analyze_course_pop_data_student_JR.pptx"                                  
## [131] "PCA-missing_data-KEY.Rmd"                                                        
## [132] "PCA-missing_data.Rmd"                                                            
## [133] "portfolio_ggpubr_intro-2 (1).Rmd"                                                
## [134] "portfolio_ggpubr_intro-2.Rmd"                                                    
## [135] "portfolio_ggpubr_log_transformation (1).Rmd"                                     
## [136] "portfolio_ggpubr_log_transformation.Rmd"                                         
## [137] "Project One- Knight (complete)"                                                  
## [138] "project One.zip"                                                                 
## [139] "Project3 (1).java"                                                               
## [140] "Project3 (2).java"                                                               
## [141] "Project3 (3).java"                                                               
## [142] "Project3.java"                                                                   
## [143] "Project5 (1).java"                                                               
## [144] "Project5 (2).java"                                                               
## [145] "Project5.java"                                                                   
## [146] "ProtonVPN.dmg"                                                                   
## [147] "R-4.2.1-arm64.pkg"                                                               
## [148] "removing_fixed_alleles (1).Rmd"                                                  
## [149] "removing_fixed_alleles (2).Rmd"                                                  
## [150] "removing_fixed_alleles.Rmd"                                                      
## [151] "restaurant.py"                                                                   
## [152] "Result.pdf"                                                                      
## [153] "Roblox.dmg"                                                                      
## [154] "RPB1 (1).ape"                                                                    
## [155] "RPB1.ape"                                                                        
## [156] "Rplot (cluster).jpeg"                                                            
## [157] "Rplot-1.jpeg"                                                                    
## [158] "Rplot.jpeg"                                                                      
## [159] "rsconnect"                                                                       
## [160] "RStudio-2022.07.1-554.dmg"                                                       
## [161] "RStudio-2022.07.2-576.dmg"                                                       
## [162] "S288C_chrIX_BK006942.2 (1).fsa"                                                  
## [163] "S288C_chrIX_BK006942.2.fsa"                                                      
## [164] "SCIGRESS_V3.5.0.dmg"                                                             
## [165] "Sequence 2 (1).ape"                                                              
## [166] "Sequence 2.ape"                                                                  
## [167] "Sequence 5 (1).ape"                                                              
## [168] "Sequence 5.ape"                                                                  
## [169] "Sequence 8 (1).ape"                                                              
## [170] "Sequence 8.ape"                                                                  
## [171] "SignedFraction.java"                                                             
## [172] "Silverman 2015_BrandingPeru.pdf"                                                 
## [173] "SNPs_cleaned.csv"                                                                
## [174] "SSL2.ape"                                                                        
## [175] "Standard_Normal_Table_Z-Values.pdf"                                              
## [176] "transpose_VCF_data.html"                                                         
## [177] "transpose_VCF_data.Rmd"                                                          
## [178] "Unconfirmed 31776.crdownload"                                                    
## [179] "Untitled 1.R"                                                                    
## [180] "Untitled.Rmd"                                                                    
## [181] "vcfR_test.vcf"                                                                   
## [182] "vcfR_test.vcf.gz"                                                                
## [183] "vegan_PCA_amino_acids-STUDENT (1).Rmd"                                           
## [184] "vegan_PCA_amino_acids-STUDENT--1-.html"                                          
## [185] "vegan_PCA_amino_acids-STUDENT.Rmd"                                               
## [186] "vegan_pca_with_msleep-STUDENT.html"                                              
## [187] "vegan_pca_with_msleep-STUDENT.Rmd"                                               
## [188] "Visual Studio Code 2.app"                                                        
## [189] "Visual Studio Code.app"                                                          
## [190] "VSCode-darwin-universal (1).zip"                                                 
## [191] "VSCode-darwin-universal.zip"                                                     
## [192] "walsh2017morphology.csv"                                                         
## [193] "Week 1 class slides.pptx"                                                        
## [194] "working_directory_practice.html"                                                 
## [195] "working_directory_practice.Rmd"                                                  
## [196] "World History through Case Studies (Dave Eaton) (z-lib.org).epub"

list.files(pattern = "vcf")

##  [1] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
##  [2] "11.47816134-48056134.ALL.chr11_GRCh38.genotypes.20170504.vcf.gz"
##  [3] "2.136483646-136733646.ALL.chr2_GRCh38.genotypes.20170504.vcf.gz"
##  [4] "all_loci-1.vcf"                                                 
##  [5] "all_loci.vcf"                                                   
##  [6] "code_checkpoint_vcfR.html"                                      
##  [7] "code_checkpoint_vcfR.Rmd"                                       
##  [8] "Final Project Genomes.vcf.gz"                                   
##  [9] "vcfR_test.vcf"                                                  
## [10] "vcfR_test.vcf.gz"

Data preparation

TODO: Load loci into an R data object called snps

TODO: the vcf file all loci is being loaded into an object called snps. read.vcfR is used to call the file and load it into the object

snps <- vcfR::read.vcfR("all_loci.vcf", convertNA  = TRUE)

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 8
##   header_line: 9
##   variant count: 1929
##   column count: 81
## 
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 1929
##   Character matrix gt cols: 81
##   skip: 0
##   nrows: 1929
##   row_num: 0
## 
Processed variant 1000
Processed variant: 1929
## All variants processed

TODO: Extract the genotype scores

TODO: extract.gt() is used as a way to get the genotype scores from the loaded data. The generated genotype scores are then loaded into another object called snps_num.

snps_num <- vcfR::extract.gt(snps, 
           element = "GT",
           IDtoRowNames  = F,
           as.numeric = T,
           convertNA = T,
           return.alleles = F)

TODO: Transpose snps_num to get the proper orientation

TODO: transposing the data lets you calculate the matrix

snps_num_t <- t(snps_num)

TODO: the object snps_num is then made into a data frame and saved to another object called snps_num_df

snps_num_df <- data.frame(snps_num_t)

TODO: Create a function to remove NAs from the data frame

TODO: the function is used to remove NAs from the data frame and creates invariant columns. Then the function is used to check if numbers are NAs and where they are/ how many there are.

find_NAs <- function(x){
  NAs_TF <- is.na(x)
  i_NA <- which(NAs_TF == TRUE)
  N_NA <- length(i_NA)
  
  cat("Results:",N_NA, "NAs present\n.")
  return(i_NA)
}

TODO: the NAs are grouped together in a vector called N_rows/ This makes the data into rows, so the NAs can be checked and deleted within each row without having to delete the whole row

# N_rows
# number of rows (individuals)
N_rows <- nrow(snps_num_t)

# N_NA
# vector to hold output (number of NAs)
N_NA   <- rep(x = 0, times = N_rows)

# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(snps_num_t)

# the for() loop
for(i in 1:N_rows){
  
  # for each row, find the location of
  ## NAs with snps_num_t()
  i_NA <- find_NAs(snps_num_t[i,]) 
  
  # then determine how many NAs
  ## with length()
  N_NA_i <- length(i_NA)
  
  # then save the output to 
  ## our storage vector
  N_NA[i] <- N_NA_i
}

## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .

TODO: only half of the NAs in rows will be used in the histogram. The histogram also serves as a cut off for where the data becomes redundant and no needed.

# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5

hist(N_NA)            
abline(v = cutoff50, 
       col = 2, 
       lwd = 2, 
       lty = 2)

TODO: the rows with no NAs are divided by the percentage of the snps. after, the percent_NA is then saved and used to calculate which snps

percent_NA <- N_NA/N_SNPs*100

# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50) 

snps_num_t02 <- snps_num_t[-i_NA_50percent, ]

TODO: Find the Population Code

TODO: gsub is used to find the population. first the row names are saved

row_names <- row.names(snps_num_t02) # Key

row_names02 <- gsub("sample_","",row_names)

sample_id <- gsub("^([ATCG]*)(_)(.*)",
                  "\\3",
                  row_names02)
pop_id <- gsub("[01-9]*",    
               "",
               sample_id)

table(pop_id)

## pop_id
## Alt Cau Div Nel Sub 
##  15  12  15  15  11

TODO: Create a function to process and remove

TODO: the function is used to calculate the dimensions of the data

invar_omit <- function(x){
  cat("Dataframe of dim",dim(x), "processed...\n")
  sds <- apply(x, 2, sd, na.rm = TRUE)
  i_var0 <- which(sds == 0)
 
  
  cat(length(i_var0),"columns removed\n")
  
  if(length(i_var0) > 0){
     x <- x[, -i_var0]
  }
  
  ## add return()  with x in it
  return(x)                      
}


snps_no_invar <- invar_omit(snps_num_t02)

## Dataframe of dim 68 1929 processed...
## 591 columns removed

TODO: Remove NAs in the columns

TODO: the data in the columns have to be replaced by the mean if there are any NAs left. after,the number of NAs are counted then replaced by the mean.

snps_noNAs <- snps_no_invar

N_col <- ncol(snps_no_invar)
for(i in 1:N_col){
  
  # get the current column
  column_i <- snps_noNAs[, i]
  
  # get the mean of the current column
  mean_i <- mean(column_i, na.rm = TRUE)
  
  # get the NAs in the current column
  NAs_i <- which(is.na(column_i))
  
  # record the number of NAs
  N_NAs <- length(NAs_i)

  # replace the NAs in the current column
  column_i[NAs_i] <- mean_i
  
  # replace the original column with the
  ## updated columns
  snps_noNAs[, i] <- column_i
  
}

Save the data

Save the data as a .csv file which can be loaded again later.

write.csv(snps_noNAs, file = "SNPs_cleaned.csv",
          row.names = F)

Check for the presence of the file with list.files()

list.files(pattern = ".csv")

## [1] "1000genomes_people_info2-1.csv" "SNPs_cleaned.csv"              
## [3] "walsh2017morphology.csv"

Next steps:

In Part 2, we will re-load the SNPs_cleaned.csv file and carry an an analysis with PCA.

Worked Example: PCA on SNPs data from a vcf file Part 1 - Data Preparation

2022-11-19