The data we are using to practice working with SNPs in VCF files comes from a paper in Molecular Ecology by Jennifer Walsh called “Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.” The goal of the paper is to compare traditional ways of classifying different populations and subspecies of Ammodramus sparrow with morphology and genetic data. In order to do this analysis, though, we need to prepare the data.
Missing data is common in SNP datasets, especially in samples from wild animals where the amount of DNA collected may be small. We can fill in (impute) missing data through methods such as mean imputation. However, if there’s lots of NAs we may want to consider removing samples (in this case birds) for which we’d have to impute many - if not most - of their SNPs. That is to say - if we have to impute a lot of the data for the sample, we’d be better off just removing the sample.
There are no strict rules for many of the decisions you have to make when doing data analysis, especially when it comes to preparation and cleaning steps like removing or imputing NAs.
In this study, the authors state:
“we removed any individuals that were missing data at more than 50% of the SNPs.”
This is an arbitrary decision - they could have chosen 60% or 25%. It is therefore called a researcher degree of freedom because they are free to chose what to do, and someone else may have made a different choice.
This type of decision is very common in data analysis, and needs to be documented and justified. Posting the raw data, as the authors have, allows other researchers to explore the consequences of a different choice.
In this exercise we will show how you can remove rows of data if it
is necessary. We’ll use the handy functions which()
and
is.na()
and a common programming tool called a
for() loop.
Note we are removing individual samples, which are represented by rows in the data. Elsewhere in our protocol we remove columns, for example if all the columns have the same genotype.
Load the vcfR package with library
()`:
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
Make sure that your working directory is set to the location of the
file all_loci.vcf
.
getwd()
## [1] "/Users/sadhikayanamadala/Downloads"
list.files()
## [1] "08-PCA_worked.Rmd"
## [2] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504 3.vcf"
## [3] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf"
## [4] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
## [5] "10.24 Find the Differences-2.docx"
## [6] "10.24 Find the Differences-3.docx"
## [7] "10.24 Find the Differences-4.docx"
## [8] "10.24 Find the Differences.docx"
## [9] "13.29656372-29896372.ALL.chr13_GRCh38.genotypes.20170504.vcf"
## [10] "1540_cluster_analysis.pdf"
## [11] "2022_18_d.JPG"
## [12] "2022RaiseEveryVoicePresentation.pptx"
## [13] "2231 Exam 2 KEY.pdf"
## [14] "AGH AC - Patient and Family.doc"
## [15] "all_loci-1.vcf"
## [16] "all_loci-1.vcf.txt"
## [17] "all_loci.vcf"
## [18] "all_loci.vcf.txt"
## [19] "allomtery_3_scatterplot3d (1).Rmd"
## [20] "Analyzing Sanger Sequencing_Flowers_F22.pptx"
## [21] "Analyzing_Practice_Sequencing_Flowers_F22_JR-1.pptx"
## [22] "Analyzing_Practice_Sequencing_Flowers_F22_JR-2.pptx"
## [23] "Analyzing_Practice_Sequencing_Flowers_F22_JR.pptx"
## [24] "Assessing_Shannon_Diversity_StudentPrelab_Flowers_F22_.pptx"
## [25] "Background_Research_Assignment_Flowers_F22_-_SG_Edits (1).pptx"
## [26] "Bacterial_isolation_protocol_Student_version_Flowers_F22.pptx"
## [27] "bird_snps_remove_NAs.Rmd"
## [28] "BPB_spectrophotometry.xlsx"
## [29] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-2.docx"
## [30] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-2.docx"
## [31] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-3.docx"
## [32] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-4.docx"
## [33] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3-5.docx"
## [34] "BR_General Figure Guidelines and Checklist_Flowers_F22-2-3.docx"
## [35] "BR_General Figure Guidelines and Checklist_Flowers_F22-2.docx"
## [36] "BR_General Figure Guidelines and Checklist_Flowers_F22.docx"
## [37] "BR_Lab Meeting 1-Diversity_Flowers FA22.pptx"
## [38] "BR_UV-tolerance-Analysis_Flowers_FA22.pptx"
## [39] "BR-Excel Practice_Flowers-FA22.pptx"
## [40] "BR-Figure-3_Flowers_FA22.docx"
## [41] "BR-Figure-4_Flowers_FA22-1.docx"
## [42] "BR-In_class_template_Journal_Club_2_Flowers_F22.pptx"
## [43] "BR-Journal Club 1_Flowers_FA22.pptx"
## [44] "BR-Prelab_Bacterial Diversity and Isolation_FA22.pptx"
## [45] "BR-Spot plate and UV tolerance Prelab_Flowers_F22.pptx"
## [46] "center_function.R"
## [47] "CFU Practice Calculation_Flowers_F22-2.xlsx"
## [48] "CFU Practice Calculation_Flowers_F22-3.xlsx"
## [49] "CFU Practice Calculation_Flowers_F22.xlsx"
## [50] "Chapter 5 Stereoisomers.docx"
## [51] "chem 0310 exam 1 2221 answers.pdf"
## [52] "chem 0310 exam 1 2221.pdf"
## [53] "chem 0310 exam 2 2221.pdf"
## [54] "chem 0310 exam 2 key 2221.pdf"
## [55] "Chem 310 Recitation Problem Set I-2211 (002).docx"
## [56] "Chem 310 Recitation Problem Set III-2211.docx"
## [57] "Chem 310 Recitation Problem Set IV-2211.docx"
## [58] "Chemistry 310 Recitation Problem Set II-2211.docx"
## [59] "cluster_analysis_portfolio.Rmd"
## [60] "code_checkpoint_vcfR.Rmd"
## [61] "CODE_CHECKPOINT-first_rstudio_script.R"
## [62] "College Resume-2.pdf"
## [63] "Cover letter (1).docx"
## [64] "Cover Letter.pdf"
## [65] "CV-writingclass.docx"
## [66] "d, 11-08-2022, Reed, 10 sec.jpg"
## [67] "d, 11-08-2022, Reed, 20 sec.jpg"
## [68] "Daddy B-day Photos.pdf"
## [69] "dino.csv"
## [70] "Email Examples-2.docx"
## [71] "Email Examples.docx"
## [72] "Excel Practice Data Set_Flowers F22.xlsx"
## [73] "FA6_SB_50_B-16sForward.ab1"
## [74] "Fadi Resume -final[1].docx"
## [75] "feature_engineering_intro_2_functions-part2.Rmd"
## [76] "feature_engineering-2.Rmd"
## [77] "feature_engineering.Rmd"
## [78] "figure(II)_BR_General Figure Guidelines and Checklist_Flowers_F22-2-3 copy.docx"
## [79] "Fiji.app"
## [80] "Final Presentation planning Flowers F22 copy.docx"
## [81] "Final Presentation planning Flowers F22.docx"
## [82] "Final Presentation Planning Slides_Flowers_F22.pptx"
## [83] "Flow chart assignment_Flowers_F22-2.pptx"
## [84] "Flow chart assignment_Flowers_F22.pptx"
## [85] "Flower Microbiome Slides.pptx"
## [86] "Flowers_Data_Excel_Sheet_F22_9.9.22-2.xlsx"
## [87] "Flowers_Data_Excel_Sheet_F22_9.9.22-3.xlsx"
## [88] "Flowers_Data_Excel_Sheet_F22_9.9.22.xlsx"
## [89] "fst_exploration_in_class-STUDENT.Rmd"
## [90] "Gel_Electrophoresis_flowchart_Student_Flowers_F22.pptx"
## [91] "Graphs Presentation.pptx"
## [92] "HansikaDOB-2.pdf"
## [93] "HealthcareCover.doc"
## [94] "Hendry_aphids_CB_2018[34].pdf"
## [95] "How-to-Answer-the-64-Toughest-Interview-Questions.pdf"
## [96] "Indentifying_features_of_sunflowers_Flowers_F22-2.pptx"
## [97] "Indentifying_features_of_sunflowers_Flowers_F22.pptx"
## [98] "Lab Meeting #2 Graphing_Flowers_F22.pptx"
## [99] "Lab Meeting #3-1.pptx"
## [100] "lecture-introd2RStudio-with_scripts.pdf"
## [101] "LinkedIn Profile-3.pdf"
## [102] "Math0220_M2_review.pdf"
## [103] "Oath Ceremony Scheduled.pdf"
## [104] "ParentsMarriageCertificate.pdf"
## [105] "PCR_Set_up_Flowchart_STUDENT_Flowers_F21.pptx"
## [106] "penguins.csv"
## [107] "pitch perfect .pdf"
## [108] "portfolio_ggpubr_intro-2.Rmd"
## [109] "portfolio_ggpubr_log_transformation.Rmd"
## [110] "Practice CFU Calculation_Flowers_F22.pptx"
## [111] "Prelab Identifying Isolated Bacteria_Flowers_F22-2.pptx"
## [112] "Prelab Identifying Isolated Bacteria_Flowers_F22.pptx"
## [113] "Prelab_Barcode_SNP_Sequencing_set_up.pptx"
## [114] "Prelab_UV_Petal_Pattern_Flowers_F22-1.pptx"
## [115] "PXL_20220920_175337658.MP.jpg"
## [116] "PXL_20221101_211536673.jpg"
## [117] "PXL_20221104_190523393.jpg"
## [118] "PXL_20221111_000004240.jpg"
## [119] "PXL_20221111_000319013.jpg"
## [120] "PXL_20221115_155938237.MP.jpg"
## [121] "PXL_20221118_003205920.jpg"
## [122] "PXL_20221118_003633265.jpg"
## [123] "r_help_hclust_intro-vs2.pdf"
## [124] "removing_fixed_alleles.Rmd"
## [125] "REV-BR_Lab Meeting 1-Diversity_Flowers FA22.pptx"
## [126] "Rev-LabMathPipettingPresentation_Flowers_F22.pptx"
## [127] "rsconnect"
## [128] "S-Yanamadala_10-02-2022_Practice.xlsx"
## [129] "Scholarly Background.pdf"
## [130] "ScienceCover.doc"
## [131] "ScienceCoverLetterTemplate.doc"
## [132] "Screen Shot 2022-10-28 at 7.29.19 PM.png"
## [133] "Screen Shot 2022-11-04 at 8.55.19 PM.png"
## [134] "Screen Shot 2022-12-04 at 1.02.53 PM.png"
## [135] "Shamita_BirthCerti.pdf"
## [136] "Shannon_Diversity_Excel_F22_Flowers_9.7.22.xlsx"
## [137] "Shared_data_Lab_Meeting_1_Diversity_SY.xlsx"
## [138] "Soc 0010 Fall2022 Syllabus .docx"
## [139] "Spot_plate_practice_flowchart_STUDENT_Flowers_F21.pptx"
## [140] "Spot_test_plate_Excel_Data_Flowers_F21.xlsx"
## [141] "Statement_Aug_30_2022.pdf"
## [142] "STIT & UCONN Resume .pdf"
## [143] "STIT Resume-2.pdf"
## [144] "STIT Resume-3.pdf"
## [145] "STIT Resume.pdf"
## [146] "STUDENT-BR-Week 4 Bacterial Diversity and Isolation_Flowers_FA22.pdf"
## [147] "SY, UV tolerance Analysis LM2, 11-14-2022-2.xlsx"
## [148] "SY, UV tolerance Analysis LM2, 11-14-2022.xlsx"
## [149] "Table-D_Tuesday1145_PCR.jpg"
## [150] "Title.html"
## [151] "transpose_VCF_data.Rmd"
## [152] "TU27_Shamita-27F_forward.seq"
## [153] "Undergraduate-Student-Resume-Examples.pdf"
## [154] "UV tolerance Collated Data_Flowers_Data_Excel_Sheet_F22.xlsx"
## [155] "UV_Pattern_quantification_Fiji_Flowers_FA22.pptx"
## [156] "UV_Petal_Pattern_Flowchart_STUDENT_Flowers_F21.pptx"
## [157] "UV_tolerance_Analysis_Flowers_F22.pptx"
## [158] "UV_tolerance_data_spreadsheet_Flowers_F22_copy.xlsx"
## [159] "UV_tolerance_data_spreadsheet_Flowers_F22-2.xlsx"
## [160] "UV_tolerance_data_spreadsheet_Flowers_F22.xlsx"
## [161] "vcfR_test.vcf"
## [162] "vcfR_test.vcf.gz"
## [163] "vegan_PCA_amino_acids-STUDENT.Rmd"
## [164] "vegan_pca_with_msleep-STUDENT.html"
## [165] "vegan_pca_with_msleep-STUDENT.Rmd"
## [166] "VM^JSY^JKC^JGR_BR_General Figure Guidelines and Checklist_Flowers_F22-2 copy.docx"
## [167] "walsh2017morphology-2.csv"
## [168] "walsh2017morphology-3.csv"
## [169] "walsh2017morphology-4.csv"
## [170] "walsh2017morphology-5.csv"
## [171] "walsh2017morphology-6.csv"
## [172] "walsh2017morphology-7.csv"
## [173] "walsh2017morphology-8.csv"
## [174] "walsh2017morphology.csv"
## [175] "working_directory_practice.Rmd"
## [176] "WPC MySpace Survey.docx"
## [177] "Yanamadala_Journal Club 2 Flowers F22-1.pptx"
## [178] "yanamadala.pdf"
list.files(pattern = "vcf")
## [1] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504 3.vcf"
## [2] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf"
## [3] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
## [4] "13.29656372-29896372.ALL.chr13_GRCh38.genotypes.20170504.vcf"
## [5] "all_loci-1.vcf"
## [6] "all_loci-1.vcf.txt"
## [7] "all_loci.vcf"
## [8] "all_loci.vcf.txt"
## [9] "code_checkpoint_vcfR.Rmd"
## [10] "vcfR_test.vcf"
## [11] "vcfR_test.vcf.gz"
Load the all_loci.vcf file into an R data object with
vcfR::read.vcfR()
.
bird_snps <- vcfR::read.vcfR("all_loci.vcf")
## Scanning file to determine attributes.
## File attributes:
## meta lines: 8
## header_line: 9
## variant count: 1929
## column count: 81
##
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 1929
## Character matrix gt cols: 81
## skip: 0
## nrows: 1929
## row_num: 0
##
Processed variant 1000
Processed variant: 1929
## All variants processed
cat("Note - if this didn't work you may not have your working directory set")
## Note - if this didn't work you may not have your working directory set
Use vcfR::extract.gt()
to get the genotype scores.
bird_snps_num <- vcfR::extract.gt(bird_snps,
element = "GT",
IDtoRowNames = F,
as.numeric = T,
convertNA = T,
return.alleles = F)
Transpose the data with t()
so that it has the proper
orientation.
# add t()
bird_snps_num_t <- t(bird_snps_num)
Convert the matrix to a dataframe.
# add data.frame()
bird_snps_num_df <- data.frame(bird_snps_num_t)
In order to deal with NAs you must first locate them.
In this paper, the author’s state that they removed from their analysis data an individual (row) that had missing values (NAs) for >50% of the SNPs. First we need to find them.
NAs can be detected in R using is.na()
.
Let’s take a look at how many NAs are in the first row of the data.
We’ll use bracket notation of [1, ]
to
look at the row.
# Add is.na() and select the first row
## using [1, ]
NAs_row_01 <- is.na(bird_snps_num_t[1,])
is.na()
returns a logical vector of
TRUE
and FALSE
values.
Look at the output of is.na()
with
head()
.
# call head() on the vector NAs_row_01
head(NAs_row_01)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
In this vector, TRUE means “yes, there was an NA in this position”, and FALSE means “No, no NA there.”
The length of the vector is the length of the entire row we put into
is.na()
, meaning we have a TRUE or FALSE answer for every
single value in the row.
We can check this with length()
and logical
comparisons.
First, the length of our vector of TRUE/FALSE responses.
# Call length() on NAs_row_01
N_NAs <- length(NAs_row_01)
N_NAs
## [1] 1929
Now, the length of our original row
# Call length() on the first row of bird_snps_num_t
## use bracket notation of [1, ] to get the
## first row
length_row <- length(bird_snps_num_t[1,])
Now check that they are identical using ==
# Use a logical comparison with ==
## to confirm they are the same length
N_NAs == length_row
## [1] TRUE
We can work directly with a vector of TRUE
and
FALSE
value, but I find its easiest to first convert this
logical vector into a vector of index
values (indices) that tell us exactly where
the NAs
are in the dataframe.
We can get these indices this with which(... == TRUE)
,
because in the vector TRUE
is saying “Yes, its TRUE there
was an NA there.”
# Add which()
which(NAs_row_01 == TRUE)
## [1] 664 665 666 667 668 669 693 744 983 984 985 986 987 988 989
## [16] 990 1158 1159 1470 1471 1537 1901 1902 1925 1926 1927 1928 1929
This gives us an vector of index values. We’ll save the vector for later use.
# Assign the output to an object called
## i_NA_row_01
i_NA_row_01 <- (NAs_row_01 == TRUE)
We can confirm that these parts of row 1 of our dataframe contain NAs
using the vector we made i_NA_row_01
and bracket notation.
Let’s look at the rows with NAs in column 1, and also see what’s in rows
2 and 3.
bird_snps_num_t[c(1:3), i_NA_row_01]
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
## sample_ACAAA_Nel3 NA NA NA NA NA NA NA NA NA NA NA
## sample_ACAGTG_Nel5 0 0 0 0 0 0 0 1 0 0 0
## sample_AGCAT_Nel8 0 0 0 0 0 0 0 1 NA NA NA
## [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
## sample_ACAAA_Nel3 NA NA NA NA NA NA NA NA NA NA
## sample_ACAGTG_Nel5 0 0 0 0 0 NA NA 1 0 NA
## sample_AGCAT_Nel8 NA NA NA NA NA 0 1 1 0 0
## [,22] [,23] [,24] [,25] [,26] [,27] [,28]
## sample_ACAAA_Nel3 NA NA NA NA NA NA NA
## sample_ACAGTG_Nel5 0 0 0 0 0 0 0
## sample_AGCAT_Nel8 0 0 NA NA NA NA NA
Here’s a function that will look for NAs in single column or vector, and tell us the index values.
find_NAs <- function(x){
NAs_TF <- is.na(x)
i_NA <- which(NAs_TF == TRUE)
N_NA <- length(i_NA)
cat("Results:",N_NA, "NAs present\n.")
return(i_NA)
}
Let’s test it on the first row of our data.
# run find_NAs() on the first row of
## bird_snps_num_t, using bracket notation [1,]
find_NAs(bird_snps_num_t[1,])
## Results: 28 NAs present
## .
## [1] 664 665 666 667 668 669 693 744 983 984 985 986 987 988 989
## [16] 990 1158 1159 1470 1471 1537 1901 1902 1925 1926 1927 1928 1929
For our workflow, we’re going to want to find the NAs in each row of data, count them up, and if >50% of the SNPs are NA, then remove the entire row.
We could find the NAs in each row like this. First, find the NAs in row 1 and save it to a vector:
i_NAs01 <- find_NAs(bird_snps_num_t[1,])
## Results: 28 NAs present
## .
And figure how how many NAs there are with length()
length(i_NAs01)
## [1] 28
Then continue this many times:
i_NAs02 <- find_NAs(bird_snps_num_t[2,])
## Results: 20 NAs present
## .
i_NAs03 <- find_NAs(bird_snps_num_t[3,])
## Results: 28 NAs present
## .
i_NAs04 <- find_NAs(bird_snps_num_t[4,])
## Results: 24 NAs present
## .
by making a new vector for each row and updating the row number in the brackets. This would take a lot of time and be very prone to errors.
This process of working on many rows is most easily done with a common programming approach called a for loop. The name at first doesn’t make sense; what it should be called is a “do something a bunch of times” loop.
It is called a “for” loop because you tell it something along the lines of: “FOR every row in this dataframe frame, do this …” In our case we’ll work on rows, but it can also work on columns, or anything else that can exist in R.
I’ll write a for()
loop that will go through each row of
our SNP data and determine if >50% of the values are NA
.
I’ll use our find_NA()
function we just made.
To do this I’m going to want to have a few things:
N_rows
: The total number of rows of data in our
dataframeN_NA
: A vector to hold how many NAs are in each
rowN_SNPs
: The total number of columns, so we can
determine if >50% of the columns (SNPs) are NAs.We can get the number of rows with nrow()
:
# call nrow() on bird_snps_num_t
N_rows <- nrow(bird_snps_num_t)
We can make a vector to store how many NAs are in each row like this,
where rep()
repeats 0 for us as many times as we want.
N_NA <- rep(x = 0, times = N_rows)
I can get the number of SNPs withncol()
# call ncol() on bird_snps_num_t
N_SNPs <- ncol(bird_snps_num_t)
The percentage of SNPs that are NA can be found as
length(i_NAs01)/N_SNPs*100
## [1] 1.451529
If we were doing this by hand we’d have to fill in the vector of the
number of NAs (N_NA
) like this:
# Number of NAs in row 1
i_NAs01 <- find_NAs(bird_snps_num_t[1,])
## Results: 28 NAs present
## .
N_NA[1] <- length(i_NAs01)
# Number of NAs in row 2:
i_NAs02 <- find_NAs(bird_snps_num_t[2,])
## Results: 20 NAs present
## .
N_NA[2] <- length(i_NAs02)
# Number of NAs in row 3:.
# ... etc
That would not be fun. So we automate the process with a
for()
loop. I won’t explain right now how whole thing work,
but take a look to get the general sense of what it is.
I’ll repeat the previous preparation code so its all in one place.
# N_rows
# number of rows (individuals)
N_rows <- nrow(bird_snps_num_t)
# N_NA
# vector to hold output (number of NAs)
N_NA <- rep(x = 0, times = N_rows)
# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(bird_snps_num_t)
# the for() loop
for(i in 1:N_rows){
# for each row, find the location of
## NAs with bird_snps_num_t()
i_NA <- find_NAs(bird_snps_num_t[i,])
# then determine how many NAs
## with length()
N_NA_i <- length(i_NA)
# then save the output to
## our storage vector
N_NA[i] <- N_NA_i
}
## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .
My vector N_SNPs
now how the number of NAs in each row
of the dataframe.
head(N_NA)
## [1] 28 20 28 24 23 63
We can get a sense of the how many NAs there are in the rows of the dataset by making a histogram. Most rows have very few, and a few have a lot:
# Call hist() on N_NA
hist(N_NA)
The authors of the bird speciation paper decided to remove any row that was >50% NAs. There are 1929 SNPs, so 50% is about 964 SNPs.
# total number of columns
N_SNPs
## [1] 1929
# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5
I can add a line for the cutoff to the plot with
abline()
# Call hist() on N_NA
## add a vertical line at the cutoff value
## using abline()
hist(N_NA)
abline(v = cutoff50,
col = 2,
lwd = 2,
lty = 2)
After figuring out how many NAs there in each row, I can convert this to a percent.
percent_NA <- N_NA/N_SNPs*100
I can plot these percentages and set the cutoff at 50 for 50%
# Call hist() on N_NA
## add a vertical line at 50%
## using abline()
hist(percent_NA)
abline(v = 50 ,
col = 2,
lwd = 2,
lty = 2)
I can determine the index value of each row with >50% NAs using
which()
# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50)
I use length()
to see how many there are
# call length() on i_NA_50percent
length(i_NA_50percent)
## [1] 4
The index values happen to be:
i_NA_50percent
## [1] 37 39 41 53
There are 4 rows where 50% or more of the columns contain an NA
In the paper they say they removed 6, and I’m not sure where the discrepancy comes from. In order get up to 6 birds, I need to decrease the threshold to 38% missing.
which(percent_NA > 38)
## [1] 31 37 39 41 43 53
length(which(percent_NA > 38))
## [1] 6
I can remove the rows of data with >50% missing using negative indexing.
bird_snps_num_t02 <- bird_snps_num_t[-i_NA_50percent, ]
I always need to check to make sure the previous and current data make sense.
dim(bird_snps_num_t)
## [1] 72 1929
dim(bird_snps_num_t02)
## [1] 68 1929
In our workflow information about the samples like their population of origin is getting embedded in the row names of the dataframe. (In contrast to this, VCF files from the 1000 Genomes Project have a separate file with all the information).
Its going to become necessary in subsequent assignments to access this information, for example to color-code plots. It takes a little bit of code to get the information from these row names, so I’m not going to dig into it here except to say I’m using functions known as regular expressions to be able to edit the text in the row names.
First, let’s look at the row names.
# call row.names() on bird_snps_num_t
row_names <- row.names(bird_snps_num_t)
# call head() on row_names
head(row_names)
## [1] "sample_ACAAA_Nel3" "sample_ACAGTG_Nel5" "sample_AGCAT_Nel8"
## [4] "sample_ATGAAAC_Nel10" "sample_ATGAAAC_Nel15" "sample_CGATGT_Nel4"
The individual samples are called things like “Nel3” and “Cau10”. The numbers are ID numbers of individuals birds that DNA was collected from, and the letters are the populations.
I can use regular expressions (in this case a function called
gsub(
)) to remove the stuff like “sample_ACAAA_” before the
things I want. First I’ll remove the “sample_” (don’t worry about the
exact details of how the function works).
# add gsub() to before ("sample_","",row_names)
row_names02 <- gsub("sample_","",row_names)
# look at the output using head()
head(row_names02)
## [1] "ACAAA_Nel3" "ACAGTG_Nel5" "AGCAT_Nel8" "ATGAAAC_Nel10"
## [5] "ATGAAAC_Nel15" "CGATGT_Nel4"
Now I’ll get rid of the As, Cs, Ts and Gs (not yet sure what those
are actually…). This gives me a unique combination of a population code
and number for each sample. (Again, we won’t worry about the code
withing gsub()
).
# clean up the character data
sample_id <- gsub("^([ATCG]*)(_)(.*)",
"\\3",
row_names02)
# look at thee output
head(sample_id)
## [1] "Nel3" "Nel5" "Nel8" "Nel10" "Nel15" "Nel4"
Now I want a vector just with the population code, so I’ll use
gsub()
to get rid of the numbers. (Again, don’t worry about
the details of what’s in gsub()
.)
# add gsub() before the stuff in the parentheses
pop_id <- gsub("[01-9]*",
"",
sample_id)
The function table() summarizes the output for me
# call table() on pop_id
table(pop_id)
## pop_id
## Alt Cau Div Nel Sub
## 15 13 15 15 14
As noted above, the authors say they removed 6 rows because of NAs, but I only got 4. Its actually pretty common for things reported in a paper to diverge from what you’re able to replicate from the data - somewhere, something minor got misreported or left off of a file. But I do always like to try to figure out what’s going on.
The author’s state:
“We obtained blood samples from 75 Ammodramus sparrows … 15 from each of five putative subspecies. … Due to missing data, we removed six individuals (four of which were from [the subspecies] subvirgatus ), resulting in the analysis of 69 individuals (11–15 individuals per population).
Now that I’ve extracted the information on the samples I can see what samples they provided in their data.
Again, I can summarize the vector of population names with
table()
.
length(pop_id)
## [1] 72
table(pop_id)
## pop_id
## Alt Cau Div Nel Sub
## 15 13 15 15 14
Right away I can see I have only 72 samples versus their 75, so 3 are missing. There’s also only 13 in the “Cau” category and 1 in “Sub”. So the .vcf file they provided is either missing a few birds, or something is happening when vcfR loads to kick out some rows, perhaps due to data quality. I could open up the vcfR file in a text editor to check this out if I wanted.
When locating rows with NAs I created a vector where there were >50% NAs:
i_NA_50percent
## [1] 37 39 41 53
I can compare this to my vector of population IDs using brackets:
sample_id[i_NA_50percent]
## [1] "Sub3" "Sub8" "Sub4" "Cau10"
In the paper they say of the six samples they removed, “four of which were from [the subspecies] subvirgatus”. I have 3 samples called “Sub”, which are probably 3 of those 4 samples, and one “Cau.”
In total, I’m missing 1 Sub with >50% missing data, 1 Cau with >50% missing data, and 1 Cau with <50% missing data. I’m not sure what’s going on, but I doubt this will impact the analysis.