Introduction

In this worked example you will replicate a PCA on a published dataset.

The example is split into 2 Parts:

In this Data Preparation phase, you will do the following things:

  1. Load the SNP genotypes in .vcf format (vcfR::read.vcfR())
  2. Extract the genotypes into an R-compatible format (vcfR::extract.gt())
  3. Rotate the data into the standard R analysis format (t())
  4. Remove individuals (rows) from the data set that have >50% NAs (using a function I wrote)
  5. Remove SNPs (columns) that are fixed
  6. Impute remaining NAs (using a for() loop)
  7. Save the prepared data as a .csv file for the next step (write.csv())

Biological background

This worked example is based on a paper in the journal Molecular Ecology from 2017 by Jennifer Walsh titled Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.

The study investigated variation between two bird species in the genus Ammodramus: A. nenlsoni and A. caudacutus.

The species A. nenlsoni has been divided into 3 sub-species: A. n. nenlsoni, A.n. alterus, and A n. subvirgatus. The other species, A. caudacutus, has been divided into two subspecies, A.c. caudacutus and A.c. diversus.

The purpose of this study was to investigate to what extent these five subspecies recognized by taxonomists are supported by genetic data. The author’s collected DNA from 75 birds (15 per subspecies) and genotyped 1929 SNPs. They then analyzed the data with Principal Components Analysis (PCA), among other genetic analyzes.

This tutorial will work through all of the steps necessary to re-analyze Walsh et al.s data

Tasks

In the code below all code is provided. Your tasks will be to do 2 things:

  1. Give a meaningful title to all sections marked “TODO: TITLE”
  2. Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”

Preliminaries

Load the vcfR and other packages with library().

library(vcfR)    
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)

Make sure that your working directory is set to the location of the file all_loci.vcf.

getwd()
## [1] "/Users/sumedhasripada/Downloads"
list.files()
##   [1] " pittsburghpa.txt"                                                                            
##   [2] " project"                                                                                     
##   [3] " project.txt"                                                                                 
##   [4] "__pycache__"                                                                                  
##   [5] "_tmp__lease_documents_20221008033035.pdf"                                                     
##   [6] "07-mean_imputation.docx"                                                                      
##   [7] "07-mean_imputation.html"                                                                      
##   [8] "07-mean_imputation.Rmd"                                                                       
##   [9] "08-PCA_worked (1).Rmd"                                                                        
##  [10] "08-PCA_worked.html"                                                                           
##  [11] "08-PCA_worked.Rmd"                                                                            
##  [12] "09-PCA_worked_example-SNPs-part1.Rmd"                                                         
##  [13] "2217 Project 2 - Lab 1.pdf"                                                                   
##  [14] "3.26537733-26777733.ALL.chr3_GRCh38.genotypes.20170504.vcf.gz"                                
##  [15] "30072210_20220511031119.pdf"                                                                  
##  [16] "30072211_20220511031120.pdf"                                                                  
##  [17] "30072212_20220511031121.pdf"                                                                  
##  [18] "41852558_20221008023640.pdf"                                                                  
##  [19] "41852559_20221008023640.pdf"                                                                  
##  [20] "41852560_20221008023641.pdf"                                                                  
##  [21] "6:28.1.py"                                                                                    
##  [22] "6:28.2.py"                                                                                    
##  [23] "6:28.3.py"                                                                                    
##  [24] "6:28.4.py"                                                                                    
##  [25] "A2.class"                                                                                     
##  [26] "A2.java"                                                                                      
##  [27] "A3.class"                                                                                     
##  [28] "A3.java"                                                                                      
##  [29] "A3input.txt"                                                                                  
##  [30] "all_loci (1).vcf"                                                                             
##  [31] "all_loci (2).vcf"                                                                             
##  [32] "all_loci (3).vcf"                                                                             
##  [33] "all_loci.vcf"                                                                                 
##  [34] "ApE_OSX_modern_current.dmg"                                                                   
##  [35] "Assignment 3 - Sciences (1).pdf"                                                              
##  [36] "Assignment 3 - Sciences.pdf"                                                                  
##  [37] "Assignment1.class"                                                                            
##  [38] "Assignment1.java"                                                                             
##  [39] "atom-mac.zip"                                                                                 
##  [40] "Atom.app"                                                                                     
##  [41] "bio lab 2 presentation.pptx"                                                                  
##  [42] "bird_snps_remove_NAs.html"                                                                    
##  [43] "bird_snps_remove_NAs.Rmd"                                                                     
##  [44] "bmp (1).py"                                                                                   
##  [45] "bmp.py"                                                                                       
##  [46] "brightness.bmp"                                                                               
##  [47] "center_function (1).R"                                                                        
##  [48] "center_function.R"                                                                            
##  [49] "Certificate.pdf"                                                                              
##  [50] "Chapter 02 - Programs"                                                                        
##  [51] "Chapter 02 - Programs.zip"                                                                    
##  [52] "Close reading 2.pdf"                                                                          
##  [53] "cluster_analysis_portfolio.Rmd"                                                               
##  [54] "code_checkpoint_vcfR (1).Rmd"                                                                 
##  [55] "code_checkpoint_vcfR (2).Rmd"                                                                 
##  [56] "code_checkpoint_vcfR (3).Rmd"                                                                 
##  [57] "code_checkpoint_vcfR--3-.html"                                                                
##  [58] "code_checkpoint_vcfR.Rmd"                                                                     
##  [59] "CODE_CHECKPOINT-first_rstudio_script.R"                                                       
##  [60] "comp bio notes.pdf"                                                                           
##  [61] "CompBio Tracker.pdf"                                                                          
##  [62] "ConfidentialityAgreementStudents (1).pdf"                                                     
##  [63] "ConfidentialityAgreementStudents.pdf"                                                         
##  [64] "contrast.bmp"                                                                                 
##  [65] "conversion_table.txt"                                                                         
##  [66] "CS0011_Project1_Lab2.pdf"                                                                     
##  [67] "des217project2.py"                                                                            
##  [68] "des217project2.py.zip"                                                                        
##  [69] "des217project3.py"                                                                            
##  [70] "des217project3.py.zip"                                                                        
##  [71] "displaychannel.bmp"                                                                           
##  [72] "distance.html"                                                                                
##  [73] "distance.py"                                                                                  
##  [74] "Englit 560 Syllabus Fall 2022.docx"                                                           
##  [75] "Exam 1 practice Fall 2022 key updated no pedigrees.pdf"                                       
##  [76] "Exam 1 practice Fall 2022 updated no pedigrees.pdf"                                           
##  [77] "Exam 2 Fall 2022 (1).pdf"                                                                     
##  [78] "Exam 2 Fall 2022.pdf"                                                                         
##  [79] "feature_engineering (1).Rmd"                                                                  
##  [80] "feature_engineering_files"                                                                    
##  [81] "feature_engineering.Rmd"                                                                      
##  [82] "FileReadSample.class"                                                                         
##  [83] "FileReadSample.java"                                                                          
##  [84] "FiletoArray.class"                                                                            
##  [85] "FiletoArray.java"                                                                             
##  [86] "First Close Reading Paper.docx"                                                               
##  [87] "flip.bmp"                                                                                     
##  [88] "Genetics Group PLR 1 (1).docx"                                                                
##  [89] "Genetics Group PLR 1.docx"                                                                    
##  [90] "googlechrome.dmg"                                                                             
##  [91] "GPA-CALCULATOR.xls"                                                                           
##  [92] "gray.bmp"                                                                                     
##  [93] "Homework_ Introduction research and PowerPoint slides for presentation_ Sumedha Sripada_files"
##  [94] "Homework_ Introduction research and PowerPoint slides for presentation_ Sumedha Sripada.html" 
##  [95] "IMG_7175.heic"                                                                                
##  [96] "IMG_8965.HEIC"                                                                                
##  [97] "img1 (1).bmp"                                                                                 
##  [98] "img1.bmp"                                                                                     
##  [99] "imgproc"                                                                                      
## [100] "imgproc.zip"                                                                                  
## [101] "Install Respondus LockDown Browser (x64c) 346685215.pkg"                                      
## [102] "InstallLDBPackage64c-2-0-8-05.zip"                                                            
## [103] "invert.bmp"                                                                                   
## [104] "IP Rights Assignment (1).pdf"                                                                 
## [105] "IP Rights Assignment.pdf"                                                                     
## [106] "jdk-18_macos-aarch64_bin.dmg"                                                                 
## [107] "jdk-18_macos-aarch64_bin.tar.gz"                                                              
## [108] "jdk-18.0.2.1.jdk"                                                                             
## [109] "Lab1.class"                                                                                   
## [110] "Lab1.java"                                                                                    
## [111] "Lab2 (1).java"                                                                                
## [112] "Lab2.1.java"                                                                                  
## [113] "Lab2.class"                                                                                   
## [114] "Lab2.java"                                                                                    
## [115] "Lab2.javac"                                                                                   
## [116] "Lab3.class"                                                                                   
## [117] "Lab3.java"                                                                                    
## [118] "Lab4-1.java"                                                                                  
## [119] "Lab4.class"                                                                                   
## [120] "Lab4.java"                                                                                    
## [121] "Lab41.class"                                                                                  
## [122] "lab41.java"                                                                                   
## [123] "Lab5 (1).java"                                                                                
## [124] "Lab5.class"                                                                                   
## [125] "Lab5.java"                                                                                    
## [126] "Lab51.java"                                                                                   
## [127] "Lab6.class"                                                                                   
## [128] "Lab6.java"                                                                                    
## [129] "Lab7Instructions.java"                                                                        
## [130] "Lecture A Mendel Fall Fall 2022 pre.pdf"                                                      
## [131] "Lecture E1 Fall 2022 for exam 3 (1).pptx"                                                     
## [132] "Lecture E1 Fall 2022 for exam 3.pptx"                                                         
## [133] "Lecture E2 Fall 2022 pre (1).pptx"                                                            
## [134] "Lecture E2 Fall 2022 pre.pptx"                                                                
## [135] "matlab_R2022a_maci64.dmg"                                                                     
## [136] "matlab_R2022a_maci64.dmg.zip"                                                                 
## [137] "Midterm Close Reading (1).docx"                                                               
## [138] "Midterm Close Reading.docx"                                                                   
## [139] "Monitored Withdrawal Request_0.pdf"                                                           
## [140] "MoralityLarksOwls-1.docx"                                                                     
## [141] "MoralityLarksOwls-1.pdf"                                                                      
## [142] "MoralityLarksOwls.pdf"                                                                        
## [143] "MuEditor-osx-1.1.1.dmg"                                                                       
## [144] "MusicIQ-1.pdf"                                                                                
## [145] "MusicIQ.pdf"                                                                                  
## [146] "my_snps"                                                                                      
## [147] "MyClass.class"                                                                                
## [148] "MyClass.java"                                                                                 
## [149] "name.txt"                                                                                     
## [150] "Notability Notes.pdf"                                                                         
## [151] "Note Mar 21, 2022 (2)-1.pdf"                                                                  
## [152] "Note Mar 21, 2022 (2).pdf"                                                                    
## [153] "Note Sep 27, 2022-1.pdf"                                                                      
## [154] "Note Sep 27, 2022-2.pdf"                                                                      
## [155] "Note Sep 27, 2022.pdf"                                                                        
## [156] "ocean"                                                                                        
## [157] "owls.Rmd"                                                                                     
## [158] "parrots.bmp"                                                                                  
## [159] "PCA-missing_data (1).Rmd"                                                                     
## [160] "PCA-missing_data-KEY.Rmd"                                                                     
## [161] "PCA-missing_data.Rmd"                                                                         
## [162] "pitt (1).bmp"                                                                                 
## [163] "pitt (2).bmp"                                                                                 
## [164] "Pitt employee UPMC competency and Clearance Flow Chart.docx"                                  
## [165] "Pitt employee UPMC competency and Clearance Flow Chart.pdf"                                   
## [166] "pitt.bmp"                                                                                     
## [167] "pitt.bmp project.py"                                                                          
## [168] "pittsburghpa.txt"                                                                             
## [169] "pittxburghpa.txt"                                                                             
## [170] "portfolio .docx"                                                                              
## [171] "portfolio_ggpubr_intro-2 (1).Rmd"                                                             
## [172] "portfolio_ggpubr_intro-2 (2).Rmd"                                                             
## [173] "portfolio_ggpubr_intro-2 (3).Rmd"                                                             
## [174] "portfolio_ggpubr_intro-2.Rmd"                                                                 
## [175] "portfolio_ggpubr_log_transformation.Rmd"                                                      
## [176] "Problems 1 Fall 2022.pdf"                                                                     
## [177] "Problems 2 Fall 2022.pdf"                                                                     
## [178] "Problems 5 Fall 2022.pdf"                                                                     
## [179] "Problems 6 Fall 2022.pdf"                                                                     
## [180] "project"                                                                                      
## [181] "Project 2 - Lab 3-2.pdf"                                                                      
## [182] "project 2 pt. 1.py"                                                                           
## [183] "project 3.1.py"                                                                               
## [184] "projext.txt"                                                                                  
## [185] "QC_2116_fMRI"                                                                                 
## [186] "QC_2116_fMRI.zip"                                                                             
## [187] "QC_HCPreg2dwi_1629.zip"                                                                       
## [188] "QualityChecking"                                                                              
## [189] "QualityChecking.zip"                                                                          
## [190] "Quiz 2 9-8.pptx"                                                                              
## [191] "R-4.2.1.pkg"                                                                                  
## [192] "randomnumbers.txt"                                                                            
## [193] "Recitation 10 Fall 2022.docx"                                                                 
## [194] "Recitation 10 Information Sheet 2022.docx"                                                    
## [195] "Recitation 5 Fall 2022 NEW (1).docx"                                                          
## [196] "Recitation 5 Fall 2022 NEW.docx"                                                              
## [197] "reg_examples"                                                                                 
## [198] "reg_examples.zip"                                                                             
## [199] "religion essay and portfolio 1 (1).pdf"                                                       
## [200] "religion essay and portfolio 1.pdf"                                                           
## [201] "religion essay.docx"                                                                          
## [202] "ResponseSummary (1).pdf"                                                                      
## [203] "ResponseSummary (2).pdf"                                                                      
## [204] "ResponseSummary (3).pdf"                                                                      
## [205] "ResponseSummary.pdf"                                                                          
## [206] "Resume Sumedha Sripada..docx"                                                                 
## [207] "Resume Sumedha Sripada.docx"                                                                  
## [208] "Resume+Sumedha+Sripada..docx"                                                                 
## [209] "RM EC Scenario Threats IntValidity S2022.docx"                                                
## [210] "RM Exam 2 Review terms S2022 (1).pdf"                                                         
## [211] "RM Exam 2 Review terms S2022.pdf"                                                             
## [212] "RM_S22_06_Cognitive_Therapy Qs POST (1).docx"                                                 
## [213] "RM_S22_06_Cognitive_Therapy Qs POST.docx"                                                     
## [214] "RM_S22_08_Resource_Scarcity POST (1).docx"                                                    
## [215] "RM_S22_08_Resource_Scarcity POST.docx"                                                        
## [216] "rsconnect"                                                                                    
## [217] "RStudio-2022.07.1-554.dmg"                                                                    
## [218] "Self Assessment.docx"                                                                         
## [219] "SimpleLoop.class"                                                                             
## [220] "SimpleLoop.java"                                                                              
## [221] "SNPs_cleaned.csv"                                                                             
## [222] "spm12"                                                                                        
## [223] "spm12 (1).zip"                                                                                
## [224] "spm12.zip"                                                                                    
## [225] "SPSS_St_Cl_28.0.1.1_Mac_ISO.iso"                                                              
## [226] "SripadaSumedha_FirstProgressTracker_10:2022.pdf"                                              
## [227] "sss103_project1.py"                                                                           
## [228] "sss103_project2.py"                                                                           
## [229] "sss103_project3.py"                                                                           
## [230] "Starting Out with JavaTM - Gaddis, Tony.pdf"                                                  
## [231] "Starting Out with JavaTM - Gaddis, Tony.zip"                                                  
## [232] "StringManipulator.class"                                                                      
## [233] "StringManipulator.java"                                                                       
## [234] "study.py"                                                                                     
## [235] "temp.html"                                                                                    
## [236] "temperature.py"                                                                               
## [237] "The-Lion-the-Witch-and-the-Wardrobe.pdf"                                                      
## [238] "transpose_VCF_data (1).Rmd"                                                                   
## [239] "transpose_VCF_data (2).Rmd"                                                                   
## [240] "transpose_VCF_data.html"                                                                      
## [241] "transpose_VCF_data.Rmd"                                                                       
## [242] "TWH1W3UL.pdf"                                                                                 
## [243] "Unconfirmed 637729.crdownload"                                                                
## [244] "Untitled.docx"                                                                                
## [245] "Untitled.Rmd"                                                                                 
## [246] "vcfR_test.vcf"                                                                                
## [247] "vcfR_test.vcf.gz"                                                                             
## [248] "vegan_PCA_amino_acids-STUDENT.Rmd"                                                            
## [249] "vegan_pca_with_msleep-STUDENT (1).Rmd"                                                        
## [250] "vegan_pca_with_msleep-STUDENT--1-.html"                                                       
## [251] "vegan_pca_with_msleep-STUDENT.Rmd"                                                            
## [252] "volume.py"                                                                                    
## [253] "walsh2017morphology (1).csv"                                                                  
## [254] "walsh2017morphology.csv"                                                                      
## [255] "Working Group Activity #2.docx"                                                               
## [256] "working_directory_practice (1).Rmd"                                                           
## [257] "working_directory_practice (2).Rmd"                                                           
## [258] "working_directory_practice.html"                                                              
## [259] "working_directory_practice.Rmd"                                                               
## [260] "Zoom (1).pkg"                                                                                 
## [261] "Zoom.pkg"
list.files(pattern = "vcf")
##  [1] "3.26537733-26777733.ALL.chr3_GRCh38.genotypes.20170504.vcf.gz"
##  [2] "all_loci (1).vcf"                                             
##  [3] "all_loci (2).vcf"                                             
##  [4] "all_loci (3).vcf"                                             
##  [5] "all_loci.vcf"                                                 
##  [6] "code_checkpoint_vcfR (1).Rmd"                                 
##  [7] "code_checkpoint_vcfR (2).Rmd"                                 
##  [8] "code_checkpoint_vcfR (3).Rmd"                                 
##  [9] "code_checkpoint_vcfR--3-.html"                                
## [10] "code_checkpoint_vcfR.Rmd"                                     
## [11] "vcfR_test.vcf"                                                
## [12] "vcfR_test.vcf.gz"

Data preparation

Load the data into an object called snps

Using vcf::read.vcfR() allows you to load data that is stored as a .vcf. Renaming this data as “snps” allows it to be easily identified for extraction and the following steps

snps <- vcfR::read.vcfR("all_loci.vcf", convertNA  = TRUE)
## Scanning file to determine attributes.
## File attributes:
##   meta lines: 8
##   header_line: 9
##   variant count: 1929
##   column count: 81
## 
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 1929
##   Character matrix gt cols: 81
##   skip: 0
##   nrows: 1929
##   row_num: 0
## 
Processed variant 1000
Processed variant: 1929
## All variants processed

Extract the genotype scores into an object “snps_num”

vcfT::extract.gt() allows you to extract the genotype scores from “snps”

snps_num <- vcfR::extract.gt(snps, 
           element = "GT",
           IDtoRowNames  = F,
           as.numeric = T,
           convertNA = T,
           return.alleles = F)

Transpose the data

t() allows you to transpose the data in the matrix to have proper orientation and is easily used for later commands

snps_num_t <- t(snps_num) 

data.frame() is used when converting the matrix to a data frame

snps_num_df <- data.frame(snps_num_t) 

Remove NAs from data frame

In order to remove the NAs from the data frame, you need to first locate them. This function will look for NAs in single column or vector, and tell us the index values. We use is.na() to find the NAs, which() to get a vector of index values, and length() to find the length of the vector

find_NAs <- function(x){
  NAs_TF <- is.na(x)
  i_NA <- which(NAs_TF == TRUE)
  N_NA <- length(i_NA)
  
  cat("Results:",N_NA, "NAs present\n.")
  return(i_NA)
}

We will use a “for” loop to go through each row of the data and see if more than half of it consists of NAs. Find the number of rows, make a vector to store how many NAs are in each row, and find the number of SNPs by calling the function to find # of columns

# N_rows
# number of rows (individuals)
N_rows <- nrow(snps_num_t)

# N_NA
# vector to hold output (number of NAs)
N_NA   <- rep(x = 0, times = N_rows)

# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(snps_num_t)

# the for() loop
for(i in 1:N_rows){
  
  # for each row, find the location of
  ## NAs with snps_num_t()
  i_NA <- find_NAs(snps_num_t[i,]) 
  
  # then determine how many NAs
  ## with length()
  N_NA_i <- length(i_NA)
  
  # then save the output to 
  ## our storage vector
  N_NA[i] <- N_NA_i
}
## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .

Make a histogram using hist() to get a visual of how many NAs are in each row. We will also set the cutoff to .5 to only plot data where NAs are present over 50%

# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5

hist(N_NA)            
abline(v = cutoff50, 
       col = 2, 
       lwd = 2, 
       lty = 2)

Now, using which(), we can find the index value of each row consisting of over 50% NAs

percent_NA <- N_NA/N_SNPs*100

# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50) 

snps_num_t02 <- snps_num_t[-i_NA_50percent, ]

Retrieve sampling locations

First call row.names() to get the names of each row, and then clean them up using gsub(). Then, clean up the character data again using gsub(). Then, create a vector with population code using gsub() to get rid of the other numbers and name is pop_id.

row_names <- row.names(snps_num_t02) # Key

row_names02 <- gsub("sample_","",row_names)

sample_id <- gsub("^([ATCG]*)(_)(.*)",
                  "\\3",
                  row_names02)
pop_id <- gsub("[01-9]*",    
               "",
               sample_id)

table(pop_id)  
## pop_id
## Alt Cau Div Nel Sub 
##  15  12  15  15  11

Remove columns

These steps are used to set and process the dimension of the matrix and remove unwanted columns

invar_omit <- function(x){
  cat("Dataframe of dim",dim(x), "processed...\n")
  sds <- apply(x, 2, sd, na.rm = TRUE)
  i_var0 <- which(sds == 0)
 
  
  cat(length(i_var0),"columns removed\n")
  
  if(length(i_var0) > 0){
     x <- x[, -i_var0]
  }
  
  ## add return()  with x in it
  return(x)                      
}


snps_no_invar <- invar_omit(snps_num_t02) 
## Dataframe of dim 68 1929 processed...
## 591 columns removed

Replace the NAs

Here, locate the columns and replace the NAs in them. Then replace the original columns with the new columns

snps_noNAs <- snps_no_invar

N_col <- ncol(snps_no_invar)
for(i in 1:N_col){
  
  # get the current column
  column_i <- snps_noNAs[, i]
  
  # get the mean of the current column
  mean_i <- mean(column_i, na.rm = TRUE)
  
  # get the NAs in the current column
  NAs_i <- which(is.na(column_i))
  
  # record the number of NAs
  N_NAs <- length(NAs_i)

  # replace the NAs in the current column
  column_i[NAs_i] <- mean_i
  
  # replace the original column with the
  ## updated columns
  snps_noNAs[, i] <- column_i
  
}

Save the data

Save the data as a .csv file which can be loaded again later.

write.csv(snps_noNAs, file = "SNPs_cleaned.csv",
          row.names = F)

Check for the presence of the file with list.files()

list.files(pattern = ".csv")
## [1] "SNPs_cleaned.csv"            "walsh2017morphology (1).csv"
## [3] "walsh2017morphology.csv"

Next steps:

In Part 2, we will re-load the SNPs_cleaned.csv file and carry an an analysis with PCA.