Introduction

In this worked example you will replicate a PCA on a published dataset.

The example is split into 2 Parts:

In this Data Preparation phase, you will do the following things:

  1. Load the SNP genotypes in .vcf format (vcfR::read.vcfR())
  2. Extract the genotypes into an R-compatible format (vcfR::extract.gt())
  3. Rotate the data into the standard R analysis format (t())
  4. Remove individuals (rows) from the data set that have >50% NAs (using a function I wrote)
  5. Remove SNPs (columns) that are fixed
  6. Impute remaining NAs (using a for() loop)
  7. Save the prepared data as a .csv file for the next step (write.csv())

Biological background

This worked example is based on a paper in the journal Molecular Ecology from 2017 by Jennifer Walsh titled Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.

The study investigated variation between two bird species in the genus Ammodramus: A. nenlsoni and A. caudacutus.

The species A. nenlsoni has been divided into 3 sub-species: A. n. nenlsoni, A.n. alterus, and A n. subvirgatus. The other species, A. caudacutus, has been divided into two subspecies, A.c. caudacutus and A.c. diversus.

The purpose of this study was to investigate to what extent these five subspecies recognized by taxonomists are supported by genetic data. The author’s collected DNA from 75 birds (15 per subspecies) and genotyped 1929 SNPs. They then analyzed the data with Principal Components Analysis (PCA), among other genetic analyzes.

This tutorial will work through all of the steps necessary to re-analyze Walsh et al.s data

Tasks

In the code below all code is provided. Your tasks will be to do 2 things:

  1. Give a meaningful title to all sections marked “TODO: TITLE”
  2. Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”

Preliminaries

Load the vcfR and other packages with library().

library(vcfR)    
## Warning: package 'vcfR' was built under R version 4.2.2
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-2
library(ggplot2)
library(ggpubr)

Make sure that your working directory is set to the location of the file all_loci.vcf.

getwd()
## [1] "C:/Users/sn0wR/Downloads"
list.files()
##   [1] "~$citation 3 Fall 2022 updated.docx"                                              
##   [2] "07-mean_imputation.docx"                                                          
##   [3] "07-mean_imputation.html"                                                          
##   [4] "07-mean_imputation.Rmd"                                                           
##   [5] "08-PCA_worked.html"                                                               
##   [6] "08-PCA_worked.Rmd"                                                                
##   [7] "08_extended_euclidean (1).pdf"                                                    
##   [8] "08_extended_euclidean.pdf"                                                        
##   [9] "09-PCA_worked_example-SNPs-part1.Rmd"                                             
##  [10] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"                  
##  [11] "10-PCA_worked_example-SNPs-part2.Rmd"                                             
##  [12] "100 percent sure to obtain Anomaly Investigations-1744-1-0-1664560538.rar"        
##  [13] "100 percent sure to obtain Anomaly Investigations (PAK)-1744-1-0-1664600511.rar"  
##  [14] "1501Final_Practice_Questions.pdf"                                                 
##  [15] "1501Final_Practice_Questions_KEY.pdf"                                             
##  [16] "1501Midterm_Practice_Completed.pdf"                                               
##  [17] "1501Midterm_Practice_Questions_KEY.pdf"                                           
##  [18] "1501Midterm_Practice_Questions_ZID8.pdf"                                          
##  [19] "1540_week14_PCA_SNP_workflow.pdf"                                                 
##  [20] "1953-nature-papers-watson-crick-wilkins-franklin.pdf"                             
##  [21] "2.136483646-136733646.ALL.chr2_GRCh38.genotypes.20170504.vcf.gz"                  
##  [22] "22.sav-18-1-1648918704.rar"                                                       
##  [23] "6dc10fab-1f0a-4b9e-ba08-566289371ee6.pdf"                                         
##  [24] "6dc10fab-1f0a-4b9e-ba08-566289371ee6_Page_1.png"                                  
##  [25] "6dc10fab-1f0a-4b9e-ba08-566289371ee6_Page_2.png"                                  
##  [26] "77e36712-1127-4561-992f-5714914a3c19"                                             
##  [27] "77e36712-1127-4561-992f-5714914a3c19.pdf"                                         
##  [28] "77e36712-1127-4561-992f-5714914a3c19.zip"                                         
##  [29] "84ea412b-e1c5-4ad3-878a-52cb05420b91.pdf"                                         
##  [30] "92e11f20-7d1b-4464-8453-a8f7f29f5603.pdf"                                         
##  [31] "all_loci-1.vcf"                                                                   
##  [32] "all_loci.vcf"                                                                     
##  [33] "Analyzing_Practice_Sequencing_Flowers_F22.pptx"                                   
##  [34] "ApE_win_current"                                                                  
##  [35] "ApE_win_current.zip"                                                              
##  [36] "Article Portfolio W1to4 ZID8.docx"                                                
##  [37] "Article Portfolio w9to12 ZHD.docx"                                                
##  [38] "articulation_points.pdf"                                                          
##  [39] "Assessing_Shannon_Diversity_StudentPrelab_Flowers_F22_.pptx"                      
##  [40] "Background_Research_Assignment_Flowers_F22_-_SG_Edits (1) (1).pptx"               
##  [41] "Background_Research_Assignment_Flowers_F22_-_SG_Edits (1).pptx"                   
##  [42] "Bacterial_isolation_protocol_Student_version_Flowers_F22.pptx"                    
##  [43] "BenchC_170V_40mins_PCR.jpg"                                                       
##  [44] "bird_snps_remove_NAs.html"                                                        
##  [45] "bird_snps_remove_NAs.Rmd"                                                         
##  [46] "boosteroid-install-x64.zip"                                                       
##  [47] "BPB_spectrophotometry.xlsx"                                                       
##  [48] "business-model-canvas - blank.pdf"                                                
##  [49] "business-model (1).pdf"                                                           
##  [50] "business-model.pdf"                                                               
##  [51] "Business Model Canvas-1.pdf"                                                      
##  [52] "CFU Practice Calculation_Flowers_F22.xlsx"                                        
##  [53] "cluster_analysis_worksheet.pdf"                                                   
##  [54] "CODE_CHECKPOINT-first_rstudio_script.R"                                           
##  [55] "code_checkpoint_vcfR.html"                                                        
##  [56] "code_checkpoint_vcfR.Rmd"                                                         
##  [57] "Collated Data_Flowers_Data_Excel_F22_10.2.22_DIVERSITY_ZHD.xlsx"                  
##  [58] "Collated Flowers_Data_Excel_Sheet_F22_10.19.22.xlsx"                              
##  [59] "Config 1.rewasd"                                                                  
##  [60] "Customer Discovery Plan-1 (1).pdf"                                                
##  [61] "Customer Discovery Plan-1.pdf"                                                    
##  [62] "desktop.ini"                                                                      
##  [63] "DouyuLive_8.6.0.5_Server_1.1.1.4.exe"                                             
##  [64] "elife-72072-v1.pdf"                                                               
##  [65] "Exam 3 practice Fall 2022.pdf"                                                    
##  [66] "Excel Practice_Flowers F22.pptx"                                                  
##  [67] "FA22-Flower_Collection_and_Press_Protocol_BR.pdf"                                 
##  [68] "FA22-Flower_Collection_and_Press_Protocol_BR.pptx"                                
##  [69] "Figure Assignment #2.pptx"                                                        
##  [70] "Figure Assignment #3.pptx"                                                        
##  [71] "Figure Assignment #4.pptx"                                                        
##  [72] "Final Figure.pptx"                                                                
##  [73] "Final Presentation planning Flowers F22 (1).docx"                                 
##  [74] "Final Presentation planning Flowers F22.docx"                                     
##  [75] "Final Presentation.pdf"                                                           
##  [76] "Flow chart assignment_Flowers_F22.pptx"                                           
##  [77] "flower.jpeg"                                                                      
##  [78] "Flowers_Data_Excel_Sheet_F22_9.9.22 (1).xlsx"                                     
##  [79] "Flowers_Data_Excel_Sheet_F22_9.9.22.xlsx"                                         
##  [80] "Fluffy Manager 5000-7-2-255-1664883438.zip"                                       
##  [81] "For Any Controller.rewasd"                                                        
##  [82] "fst_exploration_in_class-STUDENT.html"                                            
##  [83] "fst_exploration_in_class-STUDENT.Rmd"                                             
##  [84] "fst_exploration_in_class.Rmd"                                                     
##  [85] "Gel Electro Results.pptx"                                                         
##  [86] "Gel_Electrophoresis_flowchart_Student_Flowers_F22.pptx"                           
##  [87] "General Figure Guidelines and Checklist_Flowers_F22_9.2.22-2.png"                 
##  [88] "General Figure Guidelines and Checklist_Flowers_F22_9.2.22 (1).docx"              
##  [89] "General Figure Guidelines and Checklist_Flowers_F22_9.2.22 (1).pdf"               
##  [90] "General Figure Guidelines and Checklist_Flowers_F22_9.2.22 (2).docx"              
##  [91] "General Figure Guidelines and Checklist_Flowers_F22_9.2.22 (2).pdf"               
##  [92] "General Figure Guidelines and Checklist_Flowers_F22_9.2.22.docx"                  
##  [93] "General Figure Guidelines and Checklist_Flowers_F22_9.2.22.pdf"                   
##  [94] "General Figure Guidelines and Checklist_Flowers_F22_Revision-1.docx"              
##  [95] "General Figure Guidelines and Checklist_Flowers_F22_Revision-1.pdf"               
##  [96] "Genetics_from_Genes_to_Genomes_----_(Part_II_What_Genes_Are_and_What_They_Do).pdf"
##  [97] "ggplot2ggpubrIntro.png"                                                           
##  [98] "ggpubrlog.png"                                                                    
##  [99] "Git-2.37.3-64-bit.exe"                                                            
## [100] "GK HIST Schedule.pdf"                                                             
## [101] "Hendry_aphids_CB_2018[34].pdf"                                                    
## [102] "How to keep a lab notebook_F22.docx"                                              
## [103] "HunterPie-181-2-5-0-1667440563.zip"                                               
## [104] "Indentifying_features_of_sunflowers_Flowers_F22.pptx"                             
## [105] "INI files - Recommended 1.2-76466-1-2-1666602728.rar"                             
## [106] "Introduction Figure Slides.pptx"                                                  
## [107] "JavaCodingPack-0.4.1.exe"                                                         
## [108] "jdk-18_windows-x64_bin.exe"                                                       
## [109] "Journal Club 1 Flowers F22 Zi Han Ding.pptx"                                      
## [110] "Journal Club 2 Flowers F22-1 Ding.pptx"                                           
## [111] "Kb&Mouse for Xbox Cloud.rewasd"                                                   
## [112] "Lab Meeting #2 Graphing_Flowers_F22.pptx"                                         
## [113] "Lab Meeting #2_Flowers_F22.pptx"                                                  
## [114] "Lab Meeting 1 Diversity_Directions_Flowers F22.pptx"                              
## [115] "LabMathPipettingPresentation_Flowers_F22.pptx"                                    
## [116] "LayeredWeaponProgression-1442-2-1-0-1661926735.zip"                               
## [117] "Lecture J Chromosomes Fall 2022.pptx"                                             
## [118] "LeetCode 101 - A LeetCode Grinding Guide (C++ Version).pdf"                       
## [119] "Market Analysis Zi Han Ding.docx"                                                 
## [120] "MO33_ZiHan-27F_forward.seq"                                                       
## [121] "Mod Organizer 2-6194-2-4-4-1640622655.exe"                                        
## [122] "ModdedStuff 3-22-0-3-1649973598.7z"                                               
## [123] "OBS-Studio-28.1.1-Full-Installer-x64.exe"                                         
## [124] "OperaGXSetup.exe"                                                                 
## [125] "PCA-missing_data.Rmd"                                                             
## [126] "PCA.drawio (1).png"                                                               
## [127] "PCA.drawio.pdf"                                                                   
## [128] "PCA.drawio.png"                                                                   
## [129] "PCA_with_SNPs_handout_worksheet_031122.pdf"                                       
## [130] "PCR_Set_up_Flowchart_STUDENT_Flowers_F21.pptx"                                    
## [131] "pdf2png.zip"                                                                      
## [132] "Personal Entrepreneurship Journey Zi Han Ding (1).docx"                           
## [133] "Personal Entrepreneurship Journey Zi Han Ding.docx"                               
## [134] "PIT353937751_auth_letter.pdf"                                                     
## [135] "Practice_Midterm_Attempt.pdf"                                                     
## [136] "Prelab for Bacterial Diversity and Isolation_F22.pptx"                            
## [137] "Prelab Identifying Isolated Bacteria_Flowers_F22.pptx"                            
## [138] "Prelab_Barcode_SNP_Sequencing_set_up.pptx"                                        
## [139] "Prelab_for_Petal_pressing_bacteria_from_sunflower_Flowers_F22-3.pptx"             
## [140] "Prelab_UV_Petal_Pattern_Flowers_F22-1.pptx"                                       
## [141] "Primary Literature Review 1.docx"                                                 
## [142] "Priority_Queues.pdf"                                                              
## [143] "Problem_Statement_Worksheet (1).docx"                                             
## [144] "Problem_Statement_Worksheet.docx"                                                 
## [145] "Problems 1 Fall 2022.pdf"                                                         
## [146] "Problems 2 Fall 2022 updated2.pdf"                                                
## [147] "Problems 2 Fall 2022.pdf"                                                         
## [148] "Problems 3 Fall 2022 2 updated.docx"                                              
## [149] "Problems 3 Fall 2022 2.pdf"                                                       
## [150] "Problems 3 Fall 2022 key 2 updated.pdf"                                           
## [151] "Problems 4 Fall 2022.pdf"                                                         
## [152] "Problems 5 Fall 2022.pdf"                                                         
## [153] "Problems 6 Fall 2022.pdf"                                                         
## [154] "PROJECT Skyrim - 0.6.1.6-76466-0-6-1-6-1668460668.7z"                             
## [155] "R-4.2.1-win.exe"                                                                  
## [156] "R_data_structures_vectors_intro.pdf"                                              
## [157] "Rec2worksheet.docx"                                                               
## [158] "Recitation 1 Fall 2022.docx"                                                      
## [159] "Recitation 10 Fall 2022.docx"                                                     
## [160] "Recitation 12 Fall 2022.docx"                                                     
## [161] "Recitation 13 Fall 2022.docx"                                                     
## [162] "Recitation 14 Fall 2022.docx"                                                     
## [163] "Recitation 2 Fall 2022a.docx"                                                     
## [164] "Recitation 3 Fall 2022 updated.docx"                                              
## [165] "Recitation 4 Fall 2022.docx"                                                      
## [166] "Recitation 5 Fall 2022 NEW.docx"                                                  
## [167] "Recitation 6 Fall 2022.docx"                                                      
## [168] "Recitation 7 Fall 2022.docx"                                                      
## [169] "Recitation 8 Fall 2022.docx"                                                      
## [170] "Recitation 9 Fall 2022.docx"                                                      
## [171] "REFramework-26-1-3-5-1664161771.zip"                                              
## [172] "removing_fixed_alleles.html"                                                      
## [173] "removing_fixed_alleles.Rmd"                                                       
## [174] "Rplot (1).png"                                                                    
## [175] "Rplot.png"                                                                        
## [176] "Rplot01.png"                                                                      
## [177] "Rplot02.png"                                                                      
## [178] "Rplot03.png"                                                                      
## [179] "Rplot04.png"                                                                      
## [180] "rsconnect"                                                                        
## [181] "RStudio-2022.07.1-554.exe"                                                        
## [182] "RStudio-2022.07.2-576.exe"                                                        
## [183] "Screenshot 2022-08-30 161304.png"                                                 
## [184] "Screenshot 2022-09-04 170310.png"                                                 
## [185] "Screenshot 2022-09-15 155913.png"                                                 
## [186] "Screenshot 2022-10-03 162320.png"                                                 
## [187] "Screenshot 2022-10-03 162957.png"                                                 
## [188] "Screenshot 2022-10-06 191254.png"                                                 
## [189] "Screenshot 2022-10-08 130304.png"                                                 
## [190] "Screenshot 2022-10-27 113305.png"                                                 
## [191] "Screenshot 2022-11-14 171840.png"                                                 
## [192] "Screenshot 2022-11-30 171043.png"                                                 
## [193] "Screenshot 2022-12-02 195706.png"                                                 
## [194] "Screenshot 2022-12-04 145221.png"                                                 
## [195] "Screenshot 2022-12-06 133151.png"                                                 
## [196] "Screenshot 2022-12-06 214917.png"                                                 
## [197] "Screenshot 2022-12-06 215457.png"                                                 
## [198] "SEC 1002 - Monday 615PM - Student Numbers.xlsx"                                   
## [199] "Shannon_Diversity_Excel_F22_Flowers_9.7.22.xlsx"                                  
## [200] "SodaPDF-resized-FA22-Flower_Collection_and_Press_Protocol_BR.pdf"                 
## [201] "Special Settings V2 (bugs fixed).rewasd"                                          
## [202] "Spot plate and UV tolerance Prelab_Flowers_F22.pptx"                              
## [203] "Spot_plate_practice_flowchart_STUDENT_Flowers_F21.pptx"                           
## [204] "Spot_test_plate_Excel_Data_Flowers_F21.xlsx"                                      
## [205] "Teamwork semester evaluation_Flowers_F22.docx"                                    
## [206] "Technology_Preparation_Flowers_F2022_JR.docx"                                     
## [207] "transpose_VCF_data.html"                                                          
## [208] "transpose_VCF_data.Rmd"                                                           
## [209] "ttwl-cli-saveedit-master.zip"                                                     
## [210] "UV tolerance Collated Data_Flowers_Data_Excel_Sheet_F22.xlsx"                     
## [211] "UV_Petal_Pattern_Flowchart_STUDENT_Flowers_F21.pptx"                              
## [212] "vcfR_test.vcf"                                                                    
## [213] "vcfR_test.vcf.gz"                                                                 
## [214] "vegan_pca_with_msleep-STUDENT.html"                                               
## [215] "vegan_pca_with_msleep-STUDENT.Rmd"                                                
## [216] "vlc-3.0.18-win64.exe"                                                             
## [217] "walsh2017morphology.csv"                                                          
## [218] "WeChatSetup.exe"                                                                  
## [219] "week11_handout.pdf"                                                               
## [220] "WeMod-Setup.exe"                                                                  
## [221] "WetFunctionRedux SE OStim.7z"                                                     
## [222] "ZHD_UV tolerance analysis_10312022.xlsx"                                          
## [223] "Zi Han Ding 01 10 2022.xlsx"                                                      
## [224] "ZID, UV tolerance Analysis LM2, 11 12 2022.xlsx"
list.files(pattern = "vcf")
## [1] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
## [2] "2.136483646-136733646.ALL.chr2_GRCh38.genotypes.20170504.vcf.gz"
## [3] "all_loci-1.vcf"                                                 
## [4] "all_loci.vcf"                                                   
## [5] "code_checkpoint_vcfR.html"                                      
## [6] "code_checkpoint_vcfR.Rmd"                                       
## [7] "vcfR_test.vcf"                                                  
## [8] "vcfR_test.vcf.gz"

Data preparation

Loading the SNPS from the VCF file

Loading the raw data from the vcf file. This is the basis of analysis. We need to use the read.vcfR function to read data from the .vcf file, which we will later use for analysis.

snps <- vcfR::read.vcfR("all_loci.vcf", convertNA  = TRUE)
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 8
##   header_line: 9
##   variant count: 1929
##   column count: 81
## 
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 1929
##   Character matrix gt cols: 81
##   skip: 0
##   nrows: 1929
##   row_num: 0
## 
Processed variant 1000
Processed variant: 1929
## All variants processed

Extract the genotypes

Even though the raw data has been read into R, they are still not usable for our functions such as PCA. Think of them as untranslated text, we need to translate them into R-compatible data to being our data processing.

snps_num <- vcfR::extract.gt(snps, 
           element = "GT",
           IDtoRowNames  = F,
           as.numeric = T,
           convertNA = T,
           return.alleles = F)

Rotating the data

Even though we have extracted the data and converted it into R compatible form to work with, we still need to transpose them for our analysis. This rotates the columns and the rows, which makes it easier for our population genetic analysis.

snps_num_t <- t(snps_num) 

transpose function gives its output in matrix, however we need data frames, not only for some of our functions, but to also modify the values within the data structure.

snps_num_df <- data.frame(snps_num_t) 

Removing NAs

The default method for removing invalid values is na.omit(), which works by removing the rows/columns that contains them. An easy but disruptive method, which may sometimes not suit the data. This is why we need our own ways of dealing with invalid values. The following function finds the NAs present in a data structure.

find_NAs <- function(x){
  NAs_TF <- is.na(x)
  i_NA <- which(NAs_TF == TRUE)
  N_NA <- length(i_NA)
  
  cat("Results:",N_NA, "NAs present\n.")
  return(i_NA)
}

We then use the function created above to iterate through our transposed data structure to identify the locations of NA values.

# N_rows
# number of rows (individuals)
N_rows <- nrow(snps_num_t)

# N_NA
# vector to hold output (number of NAs)
N_NA   <- rep(x = 0, times = N_rows)

# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(snps_num_t)

# the for() loop
for(i in 1:N_rows){
  
  # for each row, find the location of
  ## NAs with snps_num_t()
  i_NA <- find_NAs(snps_num_t[i,]) 
  
  # then determine how many NAs
  ## with length()
  N_NA_i <- length(i_NA)
  
  # then save the output to 
  ## our storage vector
  N_NA[i] <- N_NA_i
}
## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .

Creating a 50% value vector and then constructing a histogram of our NAs. This allows us to determine which row is the cutoff point.

# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5

hist(N_NA)            
abline(v = cutoff50, 
       col = 2, 
       lwd = 2, 
       lty = 2)

Calculating the percentage of NAs in our total number of SNPs. which function is used to determine the location of rows that have more than 50% NA values and store a removed version in another data structure.

percent_NA <- N_NA/N_SNPs*100

# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50) 

snps_num_t02 <- snps_num_t[-i_NA_50percent, ]

Modifying row names for readability

The ids of individuals are replaced with our format of ids. This not only increases readability, but also allows up to find individuals easily.

row_names <- row.names(snps_num_t02) # Key

row_names02 <- gsub("sample_","",row_names)

sample_id <- gsub("^([ATCG]*)(_)(.*)",
                  "\\3",
                  row_names02)
pop_id <- gsub("[01-9]*",    
               "",
               sample_id)

table(pop_id)  
## pop_id
## Alt Cau Div Nel Sub 
##  15  12  15  15  11

Creating a function for removing fixed SNPs

If a SNP is fixed, all individuals within a population have the same variant at that specific locus. This is not useful in our analysis, so we will create a function to remove them.

invar_omit <- function(x){
  cat("Dataframe of dim",dim(x), "processed...\n")
  sds <- apply(x, 2, sd, na.rm = TRUE)
  i_var0 <- which(sds == 0)
 
  
  cat(length(i_var0),"columns removed\n")
  
  if(length(i_var0) > 0){
     x <- x[, -i_var0]
  }
  
  ## add return()  with x in it
  return(x)                      
}


snps_no_invar <- invar_omit(snps_num_t02) 
## Dataframe of dim 68 1929 processed...
## 591 columns removed

Removing remaining NAs

This is imputation. By replacing NAs with the mean values of each column, we can make the data workable without disturbing the mean of the data. This allows us to analyze columns with NAs present.

snps_noNAs <- snps_no_invar

N_col <- ncol(snps_no_invar)
for(i in 1:N_col){
  
  # get the current column
  column_i <- snps_noNAs[, i]
  
  # get the mean of the current column
  mean_i <- mean(column_i, na.rm = TRUE)
  
  # get the NAs in the current column
  NAs_i <- which(is.na(column_i))
  
  # record the number of NAs
  N_NAs <- length(NAs_i)

  # replace the NAs in the current column
  column_i[NAs_i] <- mean_i
  
  # replace the original column with the
  ## updated columns
  snps_noNAs[, i] <- column_i
  
}

Save the data

Save the data as a .csv file which can be loaded again later.

write.csv(snps_noNAs, file = "SNPs_cleaned.csv",
          row.names = F)

Check for the presence of the file with list.files()

list.files(pattern = ".csv")
## [1] "SNPs_cleaned.csv"        "walsh2017morphology.csv"

Next steps:

In Part 2, we will re-load the SNPs_cleaned.csv file and carry an an analysis with PCA.