In this worked example you will replicate a PCA on a published dataset.
The example is split into 2 Parts:
In this Data Preparation phase, you will do the following things:
vcfR::read.vcfR())vcfR::extract.gt())t())for()
loop).csv file
for the next step (write.csv())This worked example is based on a paper in the journal Molecular Ecology from 2017 by Jennifer Walsh titled Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.
The study investigated variation between two bird species in the genus Ammodramus: A. nenlsoni and A. caudacutus.
The species A. nenlsoni has been divided into 3 sub-species: A. n. nenlsoni, A.n. alterus, and A n. subvirgatus. The other species, A. caudacutus, has been divided into two subspecies, A.c. caudacutus and A.c. diversus.
The purpose of this study was to investigate to what extent these five subspecies recognized by taxonomists are supported by genetic data. The author’s collected DNA from 75 birds (15 per subspecies) and genotyped 1929 SNPs. They then analyzed the data with Principal Components Analysis (PCA), among other genetic analyzes.
This tutorial will work through all of the steps necessary to re-analyze Walsh et al.s data
In the code below all code is provided. Your tasks will be to do 2 things:
Load the vcfR and other packages with library().
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Make sure that your working directory is set to the location of the
file all_loci.vcf.
getwd()
## [1] "/Users/mansiavunoori/Downloads"
list.files()
## [1] "02-2023 (1).pdf"
## [2] "02-2023 (2).pdf"
## [3] "02-2023.pdf"
## [4] "07-mean_imputation.docx"
## [5] "07-mean_imputation.html"
## [6] "07-mean_imputation.Rmd"
## [7] "08-PCA_worked.html"
## [8] "08-PCA_worked.Rmd"
## [9] "08C7A425-5969-4803-ADF1-5F36CC28DCDD.jpeg"
## [10] "09-PCA_worked_example-SNPs-part1.Rmd"
## [11] "1 .mp3"
## [12] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
## [13] "10-PCA_worked_example-SNPs-part2.Rmd"
## [14] "136662205-cartoon-butt-drawing-with-fart-cloud-funny-hand-drawn-doodle-of-gas-and-flatulence-simple-vector-cli (1).webp"
## [15] "1540_final_project_Final_Report_template.pdf"
## [16] "1540_final_report_flowchart (1).pdf"
## [17] "1540_final_report_flowchart.pdf"
## [18] "1540_in_class_exercise_random_numrbers.pdf"
## [19] "2.136483646-136733646.ALL.chr2_GRCh38.genotypes.20170504.vcf.gz"
## [20] "2.mp3"
## [21] "20210914-dhirana-029.jpg"
## [22] "20220914210450_001.pdf"
## [23] "2217 Project 2 - Lab 1.pdf"
## [24] "2283007F-E478-45FD-8DE6-A360F95199B7.jpeg"
## [25] "3.mp3"
## [26] "3470DAC8-F896-4F2F-BC74-B28F7096667E.jpeg"
## [27] "356 N Craig Avunoori Welcome Letter (1).pdf"
## [28] "356 N Craig Avunoori Welcome Letter.pdf"
## [29] "356 N Craig St, Pittsburgh, pa 15213 (1).png"
## [30] "356 N Craig St, Pittsburgh, pa 15213.png"
## [31] "4.mp3"
## [32] "5.mp3"
## [33] "566BDAC6-FA25-4E60-A645-3CBE821E341B.jpeg"
## [34] "6.mp3"
## [35] "65791521-44B0-4AEB-B8A0-6C30F12EA73F.jpeg"
## [36] "714297.vcf"
## [37] "A.R. Rahman - Radha Kaise Na Jale Best Video_Lagaan_Aamir Khan_Asha Bhosle_Udit Narayan.mp3"
## [38] "A2.class"
## [39] "A2.java"
## [40] "A2598A5C-179F-4059-8C60-9A7D6C8E7DA7.jpeg"
## [41] "A2copy.class"
## [42] "a2input.txt"
## [43] "a2output.txt"
## [44] "a2outputCOPY.txt"
## [45] "A3.class"
## [46] "A3.java"
## [47] "A3input.txt"
## [48] "all_loci-1.vcf"
## [49] "all_loci.vcf"
## [50] "allomtery_3_scatterplot3d (1).Rmd"
## [51] "ALLoopRemove.class"
## [52] "ALLoopRemove.java"
## [53] "ALRemove.class"
## [54] "ALRemove.java"
## [55] "anuya's bday (1).png"
## [56] "anuya's bday.png"
## [57] "Arhat.Rmd"
## [58] "ArrayExample.class"
## [59] "ArrayExample.java"
## [60] "arrayinput-1.txt"
## [61] "ArrayListDemo.class"
## [62] "ArrayListDemo.java"
## [63] "Arvin.mp3"
## [64] "AS1_7568.jpg"
## [65] "Assignment 3 - Sciences.pdf"
## [66] "Assignment1.class"
## [67] "Assignment1.java"
## [68] "atom-mac.zip"
## [69] "Atom.app"
## [70] "Avunoori Renters application - signed.pdf"
## [71] "Avunoori Renters application_encrypted_ (1).pdf"
## [72] "Avunoori Renters application_encrypted_.pdf"
## [73] "AvunooriMansi_CompBioTracker_3.13.22.pdf"
## [74] "BioSci Tracker_NEW Gen Eds_1.2022.pdf"
## [75] "bird_snps_remove_NAs.html"
## [76] "bird_snps_remove_NAs.Rmd"
## [77] "BrochureFinal7-21 - Sonali Dixit.pdf"
## [78] "CafeDhirana - Quote (1).pdf"
## [79] "CafeDhirana - Quote.pdf"
## [80] "CafeDhirana - QuoteV2.pdf"
## [81] "Catering Exemption Form Requirements 3.28.18.pdf"
## [82] "Chapter 02 - Programs"
## [83] "Chapter 02 - Programs.zip"
## [84] "Chapter 03"
## [85] "Chapter 03.zip"
## [86] "Chapter 04"
## [87] "Chapter 04.zip"
## [88] "Chapter 05"
## [89] "Chapter 05.zip"
## [90] "Chapter 06"
## [91] "Chapter 06.zip"
## [92] "Chapter 07"
## [93] "Chapter 07.zip"
## [94] "Chapter 09"
## [95] "Chapter 09.zip"
## [96] "Chapter 1.pdf"
## [97] "Cheat Sheet.pdf"
## [98] "CheatSheetExam4.pdf"
## [99] "cluster_analysis_portfolio.Rmd"
## [100] "code_checkpoint_vcfR.Rmd"
## [101] "CODE_CHECKPOINT-first_rstudio_script.R"
## [102] "Comp Piece 2022-23 FINAL.mp3"
## [103] "CompBio Cheat Sheet.docx"
## [104] "CompBio Cheat Sheet.pdf"
## [105] "CompBio Cheat Sheet2.pdf"
## [106] "CS0011_Project1_Lab1-1.pdf"
## [107] "CS0011_Project1_Lab2.pdf"
## [108] "D6E996E3-2DC4-422B-976C-E999CEC2BBE4.jpeg"
## [109] "des217project3.py"
## [110] "des217project3.py.zip"
## [111] "Dhirana 2022(Falll) - Basic Invoice.pdf"
## [112] "Dhirana Mangalam.mp3"
## [113] "donate blood save life.png"
## [114] "Englit 560 Close Reading Workshop Handout-1.docx"
## [115] "Event Description.docx"
## [116] "Events with Minors BLANK.xlsx"
## [117] "Excerpt from Fledgling.pdf"
## [118] "Fall 2022 Schedule - Mansi .png"
## [119] "final_report_template.Rmd"
## [120] "fst_exploration_in_class-STUDENT.Rmd"
## [121] "fst_exploration_in_class.Rmd"
## [122] "Full.pdf"
## [123] "Getting Started in Research.pdf"
## [124] "GL_ACORD_UniversityofPittsburgh_09_14_2022_P100123633-519030125556.PDF"
## [125] "Goffman_Erving_The_Presentation_of_Self_in_Everyday_Life (1).pdf"
## [126] "Goffman_Erving_The_Presentation_of_Self_in_Everyday_Life.pdf"
## [127] "Group Project Written Response.docx"
## [128] "Harassment MEMO_Mansi Avunoori_dnr comments.pdf"
## [129] "Harassment MEMO_Mansi Avunoori-3.pdf"
## [130] "hindu student council presents - Amrutha Selvaraja.mp4"
## [131] "IMG_0112.JPG"
## [132] "IMG_02DBE07B4868-1.jpeg"
## [133] "IMG_0921 (2).JPEG"
## [134] "IMG_7363.jpg"
## [135] "IMG_7365.jpg"
## [136] "IMG_7570.jpg"
## [137] "IMG_7596.jpg"
## [138] "IMG_9125.jpg"
## [139] "IMG_9428.HEIC"
## [140] "IMG_9597.HEIC"
## [141] "IMG_9809 (2).jpg"
## [142] "IMG_9964.heic"
## [143] "IMG_9965.heic"
## [144] "InClassExerciseGGPLOT2.heic"
## [145] "introduciNG NRITYAMALA's 2022-2023 EXECUTIVE BOARD"
## [146] "introduciNG NRITYAMALA's 2022-2023 EXECUTIVE BOARD (1).png"
## [147] "introduciNG NRITYAMALA's 2022-2023 EXECUTIVE BOARD.mp4"
## [148] "introduciNG NRITYAMALA's 2022-2023 EXECUTIVE BOARD.png"
## [149] "introduciNG NRITYAMALA's 2022-2023 EXECUTIVE BOARD.zip"
## [150] "jdk-18_macos-x64_bin.tar.gz"
## [151] "jdk-18.0.2.1.jdk"
## [152] "KDB_4082.JPG"
## [153] "KDB_6431.JPG"
## [154] "KDB_6906.JPG"
## [155] "Lab1.class"
## [156] "Lab1.java"
## [157] "Lab2.class"
## [158] "Lab2.java"
## [159] "Lab3.class"
## [160] "Lab3.java"
## [161] "Lab4.class"
## [162] "Lab4.java"
## [163] "Lab5.class"
## [164] "Lab5.java"
## [165] "Lab6.class"
## [166] "Lab6.java"
## [167] "Lab6b.class"
## [168] "Lab6b.java"
## [169] "Lab7.class"
## [170] "Lab7.java"
## [171] "Lab8.class"
## [172] "Lab8.java"
## [173] "line_of_best_fit_example-tibet_allele_freq.pdf"
## [174] "LoadingVCFintoR.Rmd"
## [175] "Lyrical_ Mere Dholna _ Bhool Bhulaiyaa _ Vidya Balan _ Shreya Ghoshal, M.G. Sreekumar _ Pritam.mp3"
## [176] "maa368_project1.py"
## [177] "Mansi__Front Judges Ranking Template .xlsx"
## [178] "mansi_snps.vcf.gz"
## [179] "Master Course List_2.8.2022.pdf"
## [180] "MEMO DRAFT 2.pdf"
## [181] "MidtermPractice.java"
## [182] "MuEditor-osx-1.1.1.dmg"
## [183] "NameAgeAnnualIncome.class"
## [184] "NameAgeAnnualIncome.java"
## [185] "new-business-package.pdf"
## [186] "Nmala 2022 Hype.mp4"
## [187] "Nmala Critique Night 02:01.MOV"
## [188] "Nmala New Hype Video.mp4"
## [189] "Nrityamala_Performance.mp4"
## [190] "Nrityamala-Intro Video.mp4"
## [191] "Nrityapriya Mangalam.mp4"
## [192] "PCA-missing_data.Rmd"
## [193] "Pitt_CafeDelhi_Cert_2022 (1).pdf"
## [194] "Pitt_CafeDelhi_Cert_2022.pdf"
## [195] "Poems by Wordsworth, Thompson, and Edwards.pdf"
## [196] "portfolio_ggpubr_intro-2.Rmd"
## [197] "portfolio_ggpubr_log_transformation.Rmd"
## [198] "Pranav.mp3"
## [199] "Pranavalaya - Rough Cut 2.mp4"
## [200] "Pranavalaya - ROUGH.mp4"
## [201] "Pranavalaya.mp3"
## [202] "project_3_partial.py"
## [203] "python-3.10.4-macos11.pkg"
## [204] "Quiz 2 Review theory 1070.docx"
## [205] "r_help_hclust_intro-vs2.pdf"
## [206] "R-4.2.1.pkg"
## [207] "Racial_Innocence_Performing_American_Childhood_fro..._----_(4_The_Black-and-Whiteness_of_Raggedy_Ann).pdf"
## [208] "Recursion.java"
## [209] "RecursiveRemove.java"
## [210] "removing_fixed_alleles.html"
## [211] "removing_fixed_alleles.Rmd"
## [212] "Resume -- Template.docx"
## [213] "Resume_MansiAvunoori.pdf"
## [214] "rough logo idea.png"
## [215] "rsconnect"
## [216] "SalesTax.class"
## [217] "SalesTax.java"
## [218] "Simba.mp3"
## [219] "Spacer.mp3"
## [220] "Spring-2023---option-1.csmo"
## [221] "Spring-2023---option-1.png"
## [222] "ST Second Writing Prompt.docx"
## [223] "staples_scan (6).pdf"
## [224] "StringManipulator.class"
## [225] "StringManipulator.java"
## [226] "Taaza 4.0 NDA 2022-2023.pdf"
## [227] "Tech Confirm - South Asian Activities Fair.pdf"
## [228] "telugu.png"
## [229] "Test.docx"
## [230] "test9.java"
## [231] "Thillana2023.mpeg"
## [232] "transpose_VCF_data.html"
## [233] "transpose_VCF_data.Rmd"
## [234] "UPitt_CafeDelhi_Cert_2022.pdf"
## [235] "vcfR_test.vcf"
## [236] "vcfR_test.vcf.gz"
## [237] "vegan_PCA_amino_acids-STUDENT.docx"
## [238] "vegan_PCA_amino_acids-STUDENT.html"
## [239] "vegan_PCA_amino_acids-STUDENT.Rmd"
## [240] "vegan_pca_with_msleep-STUDENT (1).Rmd"
## [241] "vegan_pca_with_msleep-STUDENT.html"
## [242] "vegan_pca_with_msleep-STUDENT.Rmd"
## [243] "walsh2017morphology (1).csv"
## [244] "walsh2017morphology.csv"
## [245] "week08_cluster_analysis-1.pdf"
## [246] "week11_handout.pdf"
## [247] "WELCOME-TO-EMS.pptx"
## [248] "working_directory_practice.html"
## [249] "working_directory_practice.Rmd"
## [250] "YTMp3_YTMP3WEB_v4.1.1.apk"
list.files(pattern = "vcf")
## [1] "1.159051856-159301856.ALL.chr1_GRCh38.genotypes.20170504.vcf.gz"
## [2] "2.136483646-136733646.ALL.chr2_GRCh38.genotypes.20170504.vcf.gz"
## [3] "714297.vcf"
## [4] "all_loci-1.vcf"
## [5] "all_loci.vcf"
## [6] "code_checkpoint_vcfR.Rmd"
## [7] "mansi_snps.vcf.gz"
## [8] "vcfR_test.vcf"
## [9] "vcfR_test.vcf.gz"
TODO: Loading all_loci.vcf file into an R data object using
vcfR::read.vcfR().
snps <- vcfR::read.vcfR("all_loci.vcf", convertNA = TRUE)
## Scanning file to determine attributes.
## File attributes:
## meta lines: 8
## header_line: 9
## variant count: 1929
## column count: 81
##
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 1929
## Character matrix gt cols: 81
## skip: 0
## nrows: 1929
## row_num: 0
##
Processed variant 1000
Processed variant: 1929
## All variants processed
Using vcfR::extract.gt() to get the genotype scores
TODO: EXPLAIN
snps_num <- vcfR::extract.gt(snps,
element = "GT",
IDtoRowNames = F,
as.numeric = T,
convertNA = T,
return.alleles = F)
TODO: Transpose the data with t() so that it has the
proper orientation.
snps_num_t <- t(snps_num)
TODO: Convert the matrix to a dataframe.
snps_num_df <- data.frame(snps_num_t)
TODO: This function will look for NAs in a single column or vector to tell us the index values.
find_NAs <- function(x){
NAs_TF <- is.na(x)
i_NA <- which(NAs_TF == TRUE)
N_NA <- length(i_NA)
cat("Results:",N_NA, "NAs present\n.")
return(i_NA)
}
TODO: Write a for() loop that will go through each row
of our SNP data and determine if >50% of the values are
NA.
N_rows: The total number of rows of data in our
dataframeN_NA: A vector to hold how many NAs are in each
rowN_SNPs: The total number of columns, so we can
determine if >50% of the columns (SNPs) are NAs.# N_rows
# number of rows (individuals)
N_rows <- nrow(snps_num_t)
# N_NA
# vector to hold output (number of NAs)
N_NA <- rep(x = 0, times = N_rows)
# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(snps_num_t)
# the for() loop
for(i in 1:N_rows){
# for each row, find the location of
## NAs with snps_num_t()
i_NA <- find_NAs(snps_num_t[i,])
# then determine how many NAs
## with length()
N_NA_i <- length(i_NA)
# then save the output to
## our storage vector
N_NA[i] <- N_NA_i
}
## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .
TODO: calculate 50% of the SNPs. Then make a histogram. Finally add a
line for the cutoff to the plot with abline()
# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5
hist(N_NA)
abline(v = cutoff50,
col = 2,
lwd = 2,
lty = 2)
TODO: After figuring out how many NAs there in each row, convert this
number to a percent. Then we can determine the index value of each row
with >50% NAs using which(). We can also remove the rows
of data with >50% missing using negative indexing.
percent_NA <- N_NA/N_SNPs*100
# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50)
snps_num_t02 <- snps_num_t[-i_NA_50percent, ]
TODO: First grab row names. Use regular expressions called
gsub()) to remove sample. Get rid of the As, Cs, Ts and Gs
to give a unique combination of a population code and number for each
sample. Use gsub() to get rid of the numbers and a vector
with the population code. The function table() summarizes the
output.
row_names <- row.names(snps_num_t02) # Key
row_names02 <- gsub("sample_","",row_names)
sample_id <- gsub("^([ATCG]*)(_)(.*)",
"\\3",
row_names02)
pop_id <- gsub("[01-9]*",
"",
sample_id)
table(pop_id)
## pop_id
## Alt Cau Div Nel Sub
## 15 12 15 15 11
TODO: The function below will remove invariant columns from a
dataframe, as long as you add return(x) to the end
invar_omit <- function(x){
cat("Dataframe of dim",dim(x), "processed...\n")
sds <- apply(x, 2, sd, na.rm = TRUE)
i_var0 <- which(sds == 0)
cat(length(i_var0),"columns removed\n")
if(length(i_var0) > 0){
x <- x[, -i_var0]
}
## add return() with x in it
return(x)
}
snps_no_invar <- invar_omit(snps_num_t02)
## Dataframe of dim 68 1929 processed...
## 591 columns removed
TODO: Get the number of columns, means within each, and NA’s inside each as well. Record the number of NA’s to replace them in the current column. Then replace the original column with the number of updated columns
snps_noNAs <- snps_no_invar
N_col <- ncol(snps_no_invar)
for(i in 1:N_col){
# get the current column
column_i <- snps_noNAs[, i]
# get the mean of the current column
mean_i <- mean(column_i, na.rm = TRUE)
# get the NAs in the current column
NAs_i <- which(is.na(column_i))
# record the number of NAs
N_NAs <- length(NAs_i)
# replace the NAs in the current column
column_i[NAs_i] <- mean_i
# replace the original column with the
## updated columns
snps_noNAs[, i] <- column_i
}
Save the data as a .csv file which can be loaded again later.
write.csv(snps_noNAs, file = "SNPs_cleaned.csv",
row.names = F)
Check for the presence of the file with list.files()
list.files(pattern = ".csv")
## [1] "SNPs_cleaned.csv" "walsh2017morphology (1).csv"
## [3] "walsh2017morphology.csv"
In Part 2, we will re-load the SNPs_cleaned.csv file and
carry an an analysis with PCA.