Portfolio: Working with VCF files 3: Removing samples with many NAsusing for loops

Learning objectives

Review the problem of missing data in SNP datasets
Introduce the concept of researcher degrees of freedom
Review how to locate NAs in R using is.na() and which()
Outline the use of for() loops in R to carry out repetitive tasks.
Review the use of regular expressions to clean text data.

Introduction

The data we are using to practice working with SNPs in VCF files comes from a paper in Molecular Ecology by Jennifer Walsh called “Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.” The goal of the paper is to compare traditional ways of classifying different populations and subspecies of Ammodramus sparrow with morphology and genetic data. In order to do this analysis, though, we need to prepare the data.

Missing data is common in SNP datasets, especially in samples from wild animals where the amount of DNA collected may be small. We can fill in (impute) missing data through methods such as mean imputation. However, if there’s lots of NAs we may want to consider removing samples (in this case birds) for which we’d have to impute many - if not most - of their SNPs. That is to say - if we have to impute a lot of the data for the sample, we’d be better off just removing the sample.

When to remove and when to impute?

There are no strict rules for many of the decisions you have to make when doing data analysis, especially when it comes to preparation and cleaning steps like removing or imputing NAs.

In this study, the authors state:

“we removed any individuals that were missing data at more than 50% of the SNPs.”

This is an arbitrary decision - they could have chosen 60% or 25%. It is therefore called a researcher degree of freedom because they are free to chose what to do, and someone else may have made a different choice.

This type of decision is very common in data analysis, and needs to be documented and justified. Posting the raw data, as the authors have, allows other researchers to explore the consequences of a different choice.

In this exercise we will show how you can remove rows of data if it is necessary. We’ll use the handy functions which() and is.na() and a common programming tool called a for() loop.

Note we are removing individual samples, which are represented by rows in the data. Elsewhere in our protocol we remove columns, for example if all the columns have the same genotype.

Preliminaries

Load the vcfR package with library()`:

library(vcfR)

## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****

Make sure that your working directory is set to the location of the file all_loci.vcf.

getwd()

## [1] "/Users/sadhikayanamadala/Downloads"

list.files()

##   [1] "08-PCA_worked.Rmd"                                                                
##   [2] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504 3.vcf"                   
##   [3] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf"                     
##   [4] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"                  
##   [5] "10.24 Find the Differences-2.docx"                                                
##   [6] "10.24 Find the Differences-3.docx"                                                
##   [7] "10.24 Find the Differences-4.docx"                                                
##   [8] "10.24 Find the Differences.docx"                                                  
##   [9] "13.29656372-29896372.ALL.chr13_GRCh38.genotypes.20170504.vcf"                     
##  [10] "1540_cluster_analysis.pdf"                                                        
##  [11] "2022_18_d.JPG"                                                                    
##  [12] "2022RaiseEveryVoicePresentation.pptx"                                             
##  [13] "2231 Exam 2 KEY.pdf"                                                              
##  [14] "AGH AC - Patient and Family.doc"                                                  
##  [15] "all_loci-1.vcf"                                                                   
##  [16] "all_loci-1.vcf.txt"                                                               
##  [17] "all_loci.vcf"                                                                     
##  [18] "all_loci.vcf.txt"                                                                 
##  [19] "allomtery_3_scatterplot3d (1).Rmd"                                                
##  [20] "Analyzing Sanger Sequencing_Flowers_F22.pptx"                                     
##  [21] "Analyzing_Practice_Sequencing_Flowers_F22_JR-1.pptx"                              
##  [22] "Analyzing_Practice_Sequencing_Flowers_F22_JR-2.pptx"                              
##  [23] "Analyzing_Practice_Sequencing_Flowers_F22_JR.pptx"                                
##  [24] "Assessing_Shannon_Diversity_StudentPrelab_Flowers_F22_.pptx"                      
##  [25] "Background_Research_Assignment_Flowers_F22_-_SG_Edits (1).pptx"                   
##  [26] "Bacterial_isolation_protocol_Student_version_Flowers_F22.pptx"                    
##  [27] "bird_snps_remove_NAs.Rmd"                                                         
##  [28] "BPB_spectrophotometry.xlsx"                                                       
##  [29] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-2.docx"                  
##  [30] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-2.docx"                
##  [31] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-3.docx"                
##  [32] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-4.docx"                
##  [33] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-5.docx"                
##  [34] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3.docx"                  
##  [35] "BR_General Figure Guidelines and Checklist_Flowers_F22-2.docx"                    
##  [36] "BR_General Figure Guidelines and Checklist_Flowers_F22.docx"                      
##  [37] "BR_Lab Meeting 1-Diversity_Flowers FA22.pptx"                                     
##  [38] "BR_UV-tolerance-Analysis_Flowers_FA22.pptx"                                       
##  [39] "BR-Excel Practice_Flowers-FA22.pptx"                                              
##  [40] "BR-Figure-3_Flowers_FA22.docx"                                                    
##  [41] "BR-Figure-4_Flowers_FA22-1.docx"                                                  
##  [42] "BR-In_class_template_Journal_Club_2_Flowers_F22.pptx"                             
##  [43] "BR-Journal Club 1_Flowers_FA22.pptx"                                              
##  [44] "BR-Prelab_Bacterial Diversity and Isolation_FA22.pptx"                            
##  [45] "BR-Spot plate and UV tolerance Prelab_Flowers_F22.pptx"                           
##  [46] "center_function.R"                                                                
##  [47] "CFU Practice Calculation_Flowers_F22-2.xlsx"                                      
##  [48] "CFU Practice Calculation_Flowers_F22-3.xlsx"                                      
##  [49] "CFU Practice Calculation_Flowers_F22.xlsx"                                        
##  [50] "Chapter 5 Stereoisomers.docx"                                                     
##  [51] "chem 0310 exam 1 2221 answers.pdf"                                                
##  [52] "chem 0310 exam 1 2221.pdf"                                                        
##  [53] "chem 0310 exam 2 2221.pdf"                                                        
##  [54] "chem 0310 exam 2 key 2221.pdf"                                                    
##  [55] "Chem 310 Recitation Problem Set I-2211 (002).docx"                                
##  [56] "Chem 310 Recitation Problem Set III-2211.docx"                                    
##  [57] "Chem 310 Recitation Problem Set IV-2211.docx"                                     
##  [58] "Chemistry 310 Recitation Problem Set II-2211.docx"                                
##  [59] "cluster_analysis_portfolio.Rmd"                                                   
##  [60] "code_checkpoint_vcfR.Rmd"                                                         
##  [61] "CODE_CHECKPOINT-first_rstudio_script.R"                                           
##  [62] "College Resume-2.pdf"                                                             
##  [63] "Cover letter (1).docx"                                                            
##  [64] "Cover Letter.pdf"                                                                 
##  [65] "CV-writingclass.docx"                                                             
##  [66] "d, 11-08-2022, Reed, 10 sec.jpg"                                                  
##  [67] "d, 11-08-2022, Reed, 20 sec.jpg"                                                  
##  [68] "Daddy B-day Photos.pdf"                                                           
##  [69] "dino.csv"                                                                         
##  [70] "Email Examples-2.docx"                                                            
##  [71] "Email Examples.docx"                                                              
##  [72] "Excel Practice Data Set_Flowers F22.xlsx"                                         
##  [73] "FA6_SB_50_B-16sForward.ab1"                                                       
##  [74] "Fadi Resume -final[1].docx"                                                       
##  [75] "feature_engineering_intro_2_functions-part2.Rmd"                                  
##  [76] "feature_engineering-2.Rmd"                                                        
##  [77] "feature_engineering.Rmd"                                                          
##  [78] "figure(II)_BR_General Figure Guidelines and Checklist_Flowers_F22-2-3 copy.docx"  
##  [79] "Fiji.app"                                                                         
##  [80] "Final Presentation planning Flowers F22 copy.docx"                                
##  [81] "Final Presentation planning Flowers F22.docx"                                     
##  [82] "Final Presentation Planning Slides_Flowers_F22.pptx"                              
##  [83] "Flow chart assignment_Flowers_F22-2.pptx"                                         
##  [84] "Flow chart assignment_Flowers_F22.pptx"                                           
##  [85] "Flower Microbiome Slides.pptx"                                                    
##  [86] "Flowers_Data_Excel_Sheet_F22_9.9.22-2.xlsx"                                       
##  [87] "Flowers_Data_Excel_Sheet_F22_9.9.22-3.xlsx"                                       
##  [88] "Flowers_Data_Excel_Sheet_F22_9.9.22.xlsx"                                         
##  [89] "fst_exploration_in_class-STUDENT.Rmd"                                             
##  [90] "Gel_Electrophoresis_flowchart_Student_Flowers_F22.pptx"                           
##  [91] "Graphs Presentation.pptx"                                                         
##  [92] "HansikaDOB-2.pdf"                                                                 
##  [93] "HealthcareCover.doc"                                                              
##  [94] "Hendry_aphids_CB_2018[34].pdf"                                                    
##  [95] "How-to-Answer-the-64-Toughest-Interview-Questions.pdf"                            
##  [96] "Indentifying_features_of_sunflowers_Flowers_F22-2.pptx"                           
##  [97] "Indentifying_features_of_sunflowers_Flowers_F22.pptx"                             
##  [98] "Lab Meeting #2 Graphing_Flowers_F22.pptx"                                         
##  [99] "Lab Meeting #3-1.pptx"                                                            
## [100] "lecture-introd2RStudio-with_scripts.pdf"                                          
## [101] "LinkedIn Profile-3.pdf"                                                           
## [102] "Math0220_M2_review.pdf"                                                           
## [103] "Oath Ceremony Scheduled.pdf"                                                      
## [104] "ParentsMarriageCertificate.pdf"                                                   
## [105] "PCR_Set_up_Flowchart_STUDENT_Flowers_F21.pptx"                                    
## [106] "penguins.csv"                                                                     
## [107] "pitch perfect .pdf"                                                               
## [108] "portfolio_ggpubr_intro-2.Rmd"                                                     
## [109] "portfolio_ggpubr_log_transformation.Rmd"                                          
## [110] "Practice CFU Calculation_Flowers_F22.pptx"                                        
## [111] "Prelab Identifying Isolated Bacteria_Flowers_F22-2.pptx"                          
## [112] "Prelab Identifying Isolated Bacteria_Flowers_F22.pptx"                            
## [113] "Prelab_Barcode_SNP_Sequencing_set_up.pptx"                                        
## [114] "Prelab_UV_Petal_Pattern_Flowers_F22-1.pptx"                                       
## [115] "PXL_20220920_175337658.MP.jpg"                                                    
## [116] "PXL_20221101_211536673.jpg"                                                       
## [117] "PXL_20221104_190523393.jpg"                                                       
## [118] "PXL_20221111_000004240.jpg"                                                       
## [119] "PXL_20221111_000319013.jpg"                                                       
## [120] "PXL_20221115_155938237.MP.jpg"                                                    
## [121] "PXL_20221118_003205920.jpg"                                                       
## [122] "PXL_20221118_003633265.jpg"                                                       
## [123] "r_help_hclust_intro-vs2.pdf"                                                      
## [124] "removing_fixed_alleles.Rmd"                                                       
## [125] "REV-BR_Lab Meeting 1-Diversity_Flowers FA22.pptx"                                 
## [126] "Rev-LabMathPipettingPresentation_Flowers_F22.pptx"                                
## [127] "rsconnect"                                                                        
## [128] "S-Yanamadala_10-02-2022_Practice.xlsx"                                            
## [129] "Scholarly Background.pdf"                                                         
## [130] "ScienceCover.doc"                                                                 
## [131] "ScienceCoverLetterTemplate.doc"                                                   
## [132] "Screen Shot 2022-10-28 at 7.29.19 PM.png"                                         
## [133] "Screen Shot 2022-11-04 at 8.55.19 PM.png"                                         
## [134] "Screen Shot 2022-12-04 at 1.02.53 PM.png"                                         
## [135] "Shamita_BirthCerti.pdf"                                                           
## [136] "Shannon_Diversity_Excel_F22_Flowers_9.7.22.xlsx"                                  
## [137] "Shared_data_Lab_Meeting_1_Diversity_SY.xlsx"                                      
## [138] "Soc 0010 Fall2022 Syllabus .docx"                                                 
## [139] "Spot_plate_practice_flowchart_STUDENT_Flowers_F21.pptx"                           
## [140] "Spot_test_plate_Excel_Data_Flowers_F21.xlsx"                                      
## [141] "Statement_Aug_30_2022.pdf"                                                        
## [142] "STIT & UCONN Resume .pdf"                                                         
## [143] "STIT Resume-2.pdf"                                                                
## [144] "STIT Resume-3.pdf"                                                                
## [145] "STIT Resume.pdf"                                                                  
## [146] "STUDENT-BR-Week 4 Bacterial Diversity and Isolation_Flowers_FA22.pdf"             
## [147] "SY, UV tolerance Analysis LM2, 11-14-2022-2.xlsx"                                 
## [148] "SY, UV tolerance Analysis LM2, 11-14-2022.xlsx"                                   
## [149] "Table-D_Tuesday1145_PCR.jpg"                                                      
## [150] "Title.html"                                                                       
## [151] "transpose_VCF_data.Rmd"                                                           
## [152] "TU27_Shamita-27F_forward.seq"                                                     
## [153] "Undergraduate-Student-Resume-Examples.pdf"                                        
## [154] "UV tolerance Collated Data_Flowers_Data_Excel_Sheet_F22.xlsx"                     
## [155] "UV_Pattern_quantification_Fiji_Flowers_FA22.pptx"                                 
## [156] "UV_Petal_Pattern_Flowchart_STUDENT_Flowers_F21.pptx"                              
## [157] "UV_tolerance_Analysis_Flowers_F22.pptx"                                           
## [158] "UV_tolerance_data_spreadsheet_Flowers_F22_copy.xlsx"                              
## [159] "UV_tolerance_data_spreadsheet_Flowers_F22-2.xlsx"                                 
## [160] "UV_tolerance_data_spreadsheet_Flowers_F22.xlsx"                                   
## [161] "vcfR_test.vcf"                                                                    
## [162] "vcfR_test.vcf.gz"                                                                 
## [163] "vegan_PCA_amino_acids-STUDENT.Rmd"                                                
## [164] "vegan_pca_with_msleep-STUDENT.html"                                               
## [165] "vegan_pca_with_msleep-STUDENT.Rmd"                                                
## [166] "VM^JSY^JKC^JGR_BR_General Figure Guidelines and Checklist_Flowers_F22-2 copy.docx"
## [167] "walsh2017morphology-2.csv"                                                        
## [168] "walsh2017morphology-3.csv"                                                        
## [169] "walsh2017morphology-4.csv"                                                        
## [170] "walsh2017morphology-5.csv"                                                        
## [171] "walsh2017morphology-6.csv"                                                        
## [172] "walsh2017morphology-7.csv"                                                        
## [173] "walsh2017morphology-8.csv"                                                        
## [174] "walsh2017morphology.csv"                                                          
## [175] "working_directory_practice.Rmd"                                                   
## [176] "WPC MySpace Survey.docx"                                                          
## [177] "Yanamadala_Journal Club 2 Flowers F22-1.pptx"                                     
## [178] "yanamadala.pdf"

list.files(pattern = "vcf")

##  [1] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504 3.vcf" 
##  [2] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf"   
##  [3] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
##  [4] "13.29656372-29896372.ALL.chr13_GRCh38.genotypes.20170504.vcf"   
##  [5] "all_loci-1.vcf"                                                 
##  [6] "all_loci-1.vcf.txt"                                             
##  [7] "all_loci.vcf"                                                   
##  [8] "all_loci.vcf.txt"                                               
##  [9] "code_checkpoint_vcfR.Rmd"                                       
## [10] "vcfR_test.vcf"                                                  
## [11] "vcfR_test.vcf.gz"

Data preparation

Load data

Load the all_loci.vcf file into an R data object with vcfR::read.vcfR().

bird_snps <- vcfR::read.vcfR("all_loci.vcf")

## Scanning file to determine attributes.
## File attributes:
##   meta lines: 8
##   header_line: 9
##   variant count: 1929
##   column count: 81
## 
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
##   Character matrix gt rows: 1929
##   Character matrix gt cols: 81
##   skip: 0
##   nrows: 1929
##   row_num: 0
## 
Processed variant 1000
Processed variant: 1929
## All variants processed

cat("Note - if this didn't work you may not have your working directory set")

## Note - if this didn't work you may not have your working directory set

Get genotype scores (allele counts)

Use vcfR::extract.gt() to get the genotype scores.

bird_snps_num <- vcfR::extract.gt(bird_snps,
           element = "GT",
           IDtoRowNames  = F,
           as.numeric = T,
           convertNA = T,
           return.alleles = F)

Transpose data

Transpose the data with t() so that it has the proper orientation.

# add t()
bird_snps_num_t <- t(bird_snps_num)

Convert the matrix to a dataframe.

# add data.frame()
bird_snps_num_df <- data.frame(bird_snps_num_t)

Removing NAs

In order to deal with NAs you must first locate them.

Locating NAs in rows

In this paper, the author’s state that they removed from their analysis data an individual (row) that had missing values (NAs) for >50% of the SNPs. First we need to find them.

NAs can be detected in R using is.na().

Let’s take a look at how many NAs are in the first row of the data. We’ll use bracket notation of [1, ] to look at the row.

# Add is.na() and select the first row 
## using [1, ]
NAs_row_01 <- is.na(bird_snps_num_t[1,])

is.na() returns a logical vector of TRUE and FALSE values.

Look at the output of is.na() with head().

# call head() on the vector NAs_row_01
head(NAs_row_01)

## [1] FALSE FALSE FALSE FALSE FALSE FALSE

In this vector, TRUE means “yes, there was an NA in this position”, and FALSE means “No, no NA there.”

The length of the vector is the length of the entire row we put into is.na(), meaning we have a TRUE or FALSE answer for every single value in the row.

We can check this with length() and logical comparisons.

First, the length of our vector of TRUE/FALSE responses.

# Call length() on NAs_row_01
N_NAs <- length(NAs_row_01)
N_NAs

## [1] 1929

Now, the length of our original row

# Call length() on the first row of bird_snps_num_t
## use bracket notation of [1, ] to get the 
## first row
length_row <- length(bird_snps_num_t[1,])

Now check that they are identical using ==

# Use a logical comparison with ==
## to confirm they are the same length
N_NAs == length_row

## [1] TRUE

We can work directly with a vector of TRUE and FALSE value, but I find its easiest to first convert this logical vector into a vector of index values (indices) that tell us exactly where the NAs are in the dataframe.

We can get these indices this with which(... == TRUE), because in the vector TRUE is saying “Yes, its TRUE there was an NA there.”

# Add which()
which(NAs_row_01 == TRUE)

##  [1]  664  665  666  667  668  669  693  744  983  984  985  986  987  988  989
## [16]  990 1158 1159 1470 1471 1537 1901 1902 1925 1926 1927 1928 1929

This gives us an vector of index values. We’ll save the vector for later use.

# Assign the output to an object called
## i_NA_row_01
i_NA_row_01 <- (NAs_row_01 == TRUE)

We can confirm that these parts of row 1 of our dataframe contain NAs using the vector we made i_NA_row_01 and bracket notation. Let’s look at the rows with NAs in column 1, and also see what’s in rows 2 and 3.

bird_snps_num_t[c(1:3), i_NA_row_01]

##                    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## sample_ACAAA_Nel3    NA   NA   NA   NA   NA   NA   NA   NA   NA    NA    NA
## sample_ACAGTG_Nel5    0    0    0    0    0    0    0    1    0     0     0
## sample_AGCAT_Nel8     0    0    0    0    0    0    0    1   NA    NA    NA
##                    [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## sample_ACAAA_Nel3     NA    NA    NA    NA    NA    NA    NA    NA    NA    NA
## sample_ACAGTG_Nel5     0     0     0     0     0    NA    NA     1     0    NA
## sample_AGCAT_Nel8     NA    NA    NA    NA    NA     0     1     1     0     0
##                    [,22] [,23] [,24] [,25] [,26] [,27] [,28]
## sample_ACAAA_Nel3     NA    NA    NA    NA    NA    NA    NA
## sample_ACAGTG_Nel5     0     0     0     0     0     0     0
## sample_AGCAT_Nel8      0     0    NA    NA    NA    NA    NA

A NA-seeking function

Here’s a function that will look for NAs in single column or vector, and tell us the index values.

find_NAs <- function(x){
  NAs_TF <- is.na(x)
  i_NA <- which(NAs_TF == TRUE)
  N_NA <- length(i_NA)
  
  cat("Results:",N_NA, "NAs present\n.")
  return(i_NA)
}

Let’s test it on the first row of our data.

# run find_NAs() on the first row of
## bird_snps_num_t, using bracket notation [1,]
find_NAs(bird_snps_num_t[1,])

## Results: 28 NAs present
## .

##  [1]  664  665  666  667  668  669  693  744  983  984  985  986  987  988  989
## [16]  990 1158 1159 1470 1471 1537 1901 1902 1925 1926 1927 1928 1929

Remove individuals with >50% NAs

For our workflow, we’re going to want to find the NAs in each row of data, count them up, and if >50% of the SNPs are NA, then remove the entire row.

We could find the NAs in each row like this. First, find the NAs in row 1 and save it to a vector:

i_NAs01 <- find_NAs(bird_snps_num_t[1,])

## Results: 28 NAs present
## .

And figure how how many NAs there are with length()

length(i_NAs01)

## [1] 28

Then continue this many times:

i_NAs02 <- find_NAs(bird_snps_num_t[2,])

## Results: 20 NAs present
## .

i_NAs03 <- find_NAs(bird_snps_num_t[3,])

## Results: 28 NAs present
## .

i_NAs04 <- find_NAs(bird_snps_num_t[4,])

## Results: 24 NAs present
## .

by making a new vector for each row and updating the row number in the brackets. This would take a lot of time and be very prone to errors.

for() loops

This process of working on many rows is most easily done with a common programming approach called a for loop. The name at first doesn’t make sense; what it should be called is a “do something a bunch of times” loop.

It is called a “for” loop because you tell it something along the lines of: “FOR every row in this dataframe frame, do this …” In our case we’ll work on rows, but it can also work on columns, or anything else that can exist in R.

I’ll write a for() loop that will go through each row of our SNP data and determine if >50% of the values are NA. I’ll use our find_NA() function we just made.

To do this I’m going to want to have a few things:

N_rows: The total number of rows of data in our dataframe
N_NA: A vector to hold how many NAs are in each row
N_SNPs: The total number of columns, so we can determine if >50% of the columns (SNPs) are NAs.

We can get the number of rows with nrow():

# call nrow() on bird_snps_num_t
N_rows <- nrow(bird_snps_num_t)

We can make a vector to store how many NAs are in each row like this, where rep() repeats 0 for us as many times as we want.

N_NA <- rep(x = 0, times = N_rows)

I can get the number of SNPs withncol()

# call ncol() on bird_snps_num_t
N_SNPs <- ncol(bird_snps_num_t)

The percentage of SNPs that are NA can be found as

length(i_NAs01)/N_SNPs*100

## [1] 1.451529

If we were doing this by hand we’d have to fill in the vector of the number of NAs (N_NA) like this:

# Number of NAs in row 1
i_NAs01 <- find_NAs(bird_snps_num_t[1,])

## Results: 28 NAs present
## .

N_NA[1] <- length(i_NAs01)

# Number of NAs in row 2:
i_NAs02 <- find_NAs(bird_snps_num_t[2,])

## Results: 20 NAs present
## .

N_NA[2] <- length(i_NAs02)

# Number of NAs in row 3:.
# ... etc

That would not be fun. So we automate the process with a for() loop. I won’t explain right now how whole thing work, but take a look to get the general sense of what it is.

I’ll repeat the previous preparation code so its all in one place.

# N_rows
# number of rows (individuals)
N_rows <- nrow(bird_snps_num_t)

# N_NA
# vector to hold output (number of NAs)
N_NA   <- rep(x = 0, times = N_rows)

# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(bird_snps_num_t)

# the for() loop
for(i in 1:N_rows){
  
  # for each row, find the location of
  ## NAs with bird_snps_num_t()
  i_NA <- find_NAs(bird_snps_num_t[i,]) 
  
  # then determine how many NAs
  ## with length()
  N_NA_i <- length(i_NA)
  
  # then save the output to 
  ## our storage vector
  N_NA[i] <- N_NA_i
}

## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .

My vector N_SNPs now how the number of NAs in each row of the dataframe.

head(N_NA)

## [1] 28 20 28 24 23 63

We can get a sense of the how many NAs there are in the rows of the dataset by making a histogram. Most rows have very few, and a few have a lot:

# Call hist() on N_NA
hist(N_NA)

The authors of the bird speciation paper decided to remove any row that was >50% NAs. There are 1929 SNPs, so 50% is about 964 SNPs.

# total number of columns
N_SNPs

## [1] 1929

# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5

I can add a line for the cutoff to the plot with abline()

# Call hist() on N_NA
## add a vertical  line at the cutoff value
## using abline()
hist(N_NA)          
abline(v = cutoff50, 
       col = 2, 
       lwd = 2, 
       lty = 2)

After figuring out how many NAs there in each row, I can convert this to a percent.

percent_NA <- N_NA/N_SNPs*100

I can plot these percentages and set the cutoff at 50 for 50%

# Call hist() on N_NA
## add a vertical  line at 50%
## using abline()
hist(percent_NA)
abline(v = 50 ,     
       col = 2, 
       lwd = 2, 
       lty = 2)

I can determine the index value of each row with >50% NAs using which()

# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50)

I use length() to see how many there are

# call length() on i_NA_50percent
length(i_NA_50percent)

## [1] 4

The index values happen to be:

i_NA_50percent

## [1] 37 39 41 53

There are 4 rows where 50% or more of the columns contain an NA

In the paper they say they removed 6, and I’m not sure where the discrepancy comes from. In order get up to 6 birds, I need to decrease the threshold to 38% missing.

which(percent_NA > 38)

## [1] 31 37 39 41 43 53

length(which(percent_NA > 38))

## [1] 6

Remove rows with >50% missing

I can remove the rows of data with >50% missing using negative indexing.

bird_snps_num_t02 <- bird_snps_num_t[-i_NA_50percent, ]

I always need to check to make sure the previous and current data make sense.

dim(bird_snps_num_t)

## [1]   72 1929

dim(bird_snps_num_t02)

## [1]   68 1929

Determine sampling locations

In our workflow information about the samples like their population of origin is getting embedded in the row names of the dataframe. (In contrast to this, VCF files from the 1000 Genomes Project have a separate file with all the information).

Its going to become necessary in subsequent assignments to access this information, for example to color-code plots. It takes a little bit of code to get the information from these row names, so I’m not going to dig into it here except to say I’m using functions known as regular expressions to be able to edit the text in the row names.

First, let’s look at the row names.

# call row.names() on  bird_snps_num_t
row_names <- row.names(bird_snps_num_t)

# call head() on row_names
head(row_names)

## [1] "sample_ACAAA_Nel3"    "sample_ACAGTG_Nel5"   "sample_AGCAT_Nel8"   
## [4] "sample_ATGAAAC_Nel10" "sample_ATGAAAC_Nel15" "sample_CGATGT_Nel4"

The individual samples are called things like “Nel3” and “Cau10”. The numbers are ID numbers of individuals birds that DNA was collected from, and the letters are the populations.

I can use regular expressions (in this case a function called gsub()) to remove the stuff like “sample_ACAAA_” before the things I want. First I’ll remove the “sample_” (don’t worry about the exact details of how the function works).

# add gsub() to before ("sample_","",row_names)
row_names02 <- gsub("sample_","",row_names)

# look at the output using head()
head(row_names02)

## [1] "ACAAA_Nel3"    "ACAGTG_Nel5"   "AGCAT_Nel8"    "ATGAAAC_Nel10"
## [5] "ATGAAAC_Nel15" "CGATGT_Nel4"

Now I’ll get rid of the As, Cs, Ts and Gs (not yet sure what those are actually…). This gives me a unique combination of a population code and number for each sample. (Again, we won’t worry about the code withing gsub()).

# clean up the character data
sample_id <- gsub("^([ATCG]*)(_)(.*)",
                  "\\3",
                  row_names02)

# look at thee output
head(sample_id)

## [1] "Nel3"  "Nel5"  "Nel8"  "Nel10" "Nel15" "Nel4"

Now I want a vector just with the population code, so I’ll use gsub() to get rid of the numbers. (Again, don’t worry about the details of what’s in gsub().)

# add gsub() before the stuff in the parentheses
pop_id <- gsub("[01-9]*",    
               "",
               sample_id)

The function table() summarizes the output for me

# call table() on  pop_id
table(pop_id)

## pop_id
## Alt Cau Div Nel Sub 
##  15  13  15  15  14

Back to the issue of the number of rows with >50% NAs

As noted above, the authors say they removed 6 rows because of NAs, but I only got 4. Its actually pretty common for things reported in a paper to diverge from what you’re able to replicate from the data - somewhere, something minor got misreported or left off of a file. But I do always like to try to figure out what’s going on.

The author’s state:

“We obtained blood samples from 75 Ammodramus sparrows … 15 from each of five putative subspecies. … Due to missing data, we removed six individuals (four of which were from [the subspecies] subvirgatus ), resulting in the analysis of 69 individuals (11–15 individuals per population).

Now that I’ve extracted the information on the samples I can see what samples they provided in their data.

Again, I can summarize the vector of population names with table().

length(pop_id)

## [1] 72

table(pop_id)

## pop_id
## Alt Cau Div Nel Sub 
##  15  13  15  15  14

Right away I can see I have only 72 samples versus their 75, so 3 are missing. There’s also only 13 in the “Cau” category and 1 in “Sub”. So the .vcf file they provided is either missing a few birds, or something is happening when vcfR loads to kick out some rows, perhaps due to data quality. I could open up the vcfR file in a text editor to check this out if I wanted.

When locating rows with NAs I created a vector where there were >50% NAs:

i_NA_50percent

## [1] 37 39 41 53

I can compare this to my vector of population IDs using brackets:

sample_id[i_NA_50percent]

## [1] "Sub3"  "Sub8"  "Sub4"  "Cau10"

In the paper they say of the six samples they removed, “four of which were from [the subspecies] subvirgatus”. I have 3 samples called “Sub”, which are probably 3 of those 4 samples, and one “Cau.”

In total, I’m missing 1 Sub with >50% missing data, 1 Cau with >50% missing data, and 1 Cau with <50% missing data. I’m not sure what’s going on, but I doubt this will impact the analysis.