In this worked example you will replicate a PCA on a published dataset.
The example is split into 2 Parts:
In this Data Preparation phase, you will do the following things:
vcfR::read.vcfR())vcfR::extract.gt())t())for()
loop).csv file
for the next step (write.csv())This worked example is based on a paper in the journal Molecular Ecology from 2017 by Jennifer Walsh titled Subspecies delineation amid phenotypic, geographic and genetic discordance in a songbird.
The study investigated variation between two bird species in the genus Ammodramus: A. nenlsoni and A. caudacutus.
The species A. nenlsoni has been divided into 3 sub-species: A. n. nenlsoni, A.n. alterus, and A n. subvirgatus. The other species, A. caudacutus, has been divided into two subspecies, A.c. caudacutus and A.c. diversus.
The purpose of this study was to investigate to what extent these five subspecies recognized by taxonomists are supported by genetic data. The author’s collected DNA from 75 birds (15 per subspecies) and genotyped 1929 SNPs. They then analyzed the data with Principal Components Analysis (PCA), among other genetic analyzes.
This tutorial will work through all of the steps necessary to re-analyze Walsh et al.s data
In the code below all code is provided. Your tasks will be to do 2 things:
Load the vcfR and other packages with library().
library(vcfR)
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Make sure that your working directory is set to the location of the
file all_loci.vcf.
getwd()
## [1] "/Users/nithinkodali/Downloads"
list.files()
## [1] "_Deck.java"
## [2] "_tmp__lease_documents_20221003060021.pdf"
## [3] "_tmp__lease_documents_20221003060050.pdf"
## [4] "07-mean_imputation.html"
## [5] "07-mean_imputation.Rmd"
## [6] "08-PCA_worked.html"
## [7] "08-PCA_worked.Rmd"
## [8] "09-PCA_worked_example-SNPs-part1.html"
## [9] "09-PCA_worked_example-SNPs-part1.Rmd"
## [10] "10-PCA_worked_example-SNPs-part2.Rmd"
## [11] "10.11-cluster-analysis.pdf"
## [12] "1666239815.868967.jpg"
## [13] "2022 April Grad App.docx"
## [14] "2022 Summer & Fall Landmark Fundraising Schedule_as of 5.26.22.pdf"
## [15] "22.1059792-1299792.ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [16] "7.11.pdf"
## [17] "770d265a-3c0b-4234-9faa-18212e9f5318.pdf"
## [18] "990-EZ_DesiDanceNetworkInc 2020.pdf"
## [19] "abc123_Assignment1_CMPINF0401.zip"
## [20] "abc123_Assignment2_CMPINF0401.zip"
## [21] "abc123_CommunityFoodVolunteeringManager"
## [22] "abc123_CommunityFoodVolunteeringManager 2"
## [23] "Act-GAL4_ Shroom Presentation.pptx"
## [24] "admin_20861-ch-4-nucleic-acids-and-the-rna-world.pdf"
## [25] "admin_22090-ch-12-solutions.pdf"
## [26] "admin_22092-ch-14-chemical-equilibrium.pdf"
## [27] "admin_22097-ch-18-thermodynamics-and-equilibrium.pdf"
## [28] "admin2_biology-12-biological-science-freeman-93-ch-15-dna-and-the-gene-synthesis-and-repair-3467.pdf"
## [29] "all_loci-1.vcf"
## [30] "all_loci.vcf"
## [31] "ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [32] "ALL.chr22_GRCh38.genotypes.20170504.vcf.gz-2.download"
## [33] "allomtery_3_scatterplot3d (1).Rmd"
## [34] "â\u0080\u0094Pngtreeâ\u0080\u0094graduate class of 2022 with_6310583.png"
## [35] "Application-2022-Hybrid-Program.docx"
## [36] "Bench_E_Ashley_Adynn_Nithin_Mar_22_2022.jpg"
## [37] "Bench_E_Ashley_Adynn_Nithin_Mar_29_2022.jpg"
## [38] "bestcnn.sdf"
## [39] "BIOSC0160_R1.docx"
## [40] "bird_snps_remove_NAs.html"
## [41] "bird_snps_remove_NAs.Rmd"
## [42] "bolinger JC2 part B_edit.pptx"
## [43] "c0120_expt3_datasheets.pdf"
## [44] "c0120_expt4_datasheets.pdf"
## [45] "CA Offer Letter - Kodali.pdf"
## [46] "center_function.R"
## [47] "Chapter2.pptx"
## [48] "CHEM0110 Cheat Sheet.pdf"
## [49] "clown-39861.png"
## [50] "cluster_analysis_portfolio.Rmd"
## [51] "cluster_analysis_with_Higgs_aa.R"
## [52] "code_checkpoint_vcfR.html"
## [53] "code_checkpoint_vcfR.Rmd"
## [54] "CODE_CHECKPOINT-first_rstudio_script.R"
## [55] "combinepdf-2.pdf"
## [56] "combinepdf.pdf"
## [57] "compile_actgal4_viability_11_6_22-2.xlsx"
## [58] "Copy of TaazaLogoSmall-04.png"
## [59] "DDN 501c3 Determination Letter.pdf"
## [60] "Dead_Pupae_ShroomB.8.jpg"
## [61] "Dead_Pupae_ShroomB.8x2.jpg"
## [62] "documents.zip"
## [63] "eclipse-inst-jre-mac64.dmg"
## [64] "Exam 2 Review.pdf"
## [65] "Exam 2.pdf"
## [66] "Exam Review 2 CHEM120 Spring 2022.docx"
## [67] "Exam_1_Practice (3).pdf"
## [68] "Exam_1_Practice_KEY.pdf"
## [69] "F48B4F30-FFA2-465D-BB4C-99EA57D44EC3.HEIC"
## [70] "feature_engineering.Rmd"
## [71] "Final 2.pdf"
## [72] "FInal Bio Lab 1 Presentation.pptx"
## [73] "Final.pdf"
## [74] "Flyer Eyewear Discount.pdf"
## [75] "flyer.pdf"
## [76] "Giant Eagle Proposal Management System.pdf"
## [77] "gnina"
## [78] "GPA-CALCULATOR.xls"
## [79] "Graph.java"
## [80] "Hardwig(1).pdf"
## [81] "HashCodeTester.java"
## [82] "hildebrand soriano assertion evidence templates f22.pdf"
## [83] "hildebrand soriano assertion evidence templates f22.pptx"
## [84] "Hoffman"
## [85] "Hoffman.zip"
## [86] "HPS 0612 essay topics 2.23.22.docx"
## [87] "I-9-Acceptable-Documents-List.pdf"
## [88] "IMG_0013.PNG"
## [89] "IMG_3721.jpg"
## [90] "IMG_4537.HEIC"
## [91] "IMG_4543.HEIC"
## [92] "IMG_4544.HEIC"
## [93] "IMG_4545.HEIC"
## [94] "IMG_4546.HEIC"
## [95] "IMG_4547.HEIC"
## [96] "IMG_4554 2.jpg"
## [97] "IMG_4554.jpg"
## [98] "IMG_4555.PNG"
## [99] "IMG_4758.HEIC"
## [100] "IMG_4807.HEIC"
## [101] "IMG_4809.HEIC"
## [102] "IMG_4811.HEIC"
## [103] "IMG_6020.heic"
## [104] "IMG_6081.HEIC"
## [105] "IMG_6145.HEIC"
## [106] "IMG_6149.HEIC"
## [107] "IMG_6156.JPG"
## [108] "IMG_6162.JPG"
## [109] "IMG_6165.JPG"
## [110] "IMG_6189.JPG"
## [111] "IMG_6460.JPG"
## [112] "IMG_6477 2.JPG"
## [113] "IMG_6477.JPG"
## [114] "IMG_6612.JPG"
## [115] "IMG_AE7E91570F83-1.jpeg"
## [116] "IMG_C50E5F1B5F81-1.jpeg"
## [117] "Imported File-1.pdf"
## [118] "Imported File.pdf"
## [119] "Intro (1).pptx"
## [120] "Intro (2).pptx"
## [121] "Intro-2.pptx"
## [122] "Intro.pptx"
## [123] "Invoice - 0003.pdf"
## [124] "ja4052153_si_001.pdf"
## [125] "jab464_CommunityFoodVolunteeringManager"
## [126] "jab464_CommunityFoodVolunteeringManager.zip"
## [127] "jc2 bolinger et al f22_Martin edit.pptx"
## [128] "jm100112j.pdf"
## [129] "Journal Club1_DWS_SP22-2.pptx"
## [130] "Journal Club1_DWS_SP22.pptx"
## [131] "Journal club2_Hitsman_sp22_post.pptx"
## [132] "last slide lecture 3.docx"
## [133] "Late Withdrawal Form.pdf"
## [134] "LinkedList-2.java"
## [135] "LinkedList.java"
## [136] "LUX_2022_AE_Guide_PT_09_28_21.pdf"
## [137] "LuxotticaDigitalEmployeeGuide_09102018.pdf"
## [138] "Macro final.pdf"
## [139] "MALWARE_BYTES-setup-1.80.2.1012.exe"
## [140] "MATLAB_Book_2E_Solution_Manual.doc"
## [141] "micrometer1.25.jpg"
## [142] "micrometer2.5.jpg"
## [143] "Mid_term_2.pdf"
## [144] "Midterm Questions - Mind and Medicine Spring 2022.docx"
## [145] "midterm_exam_1.pdf"
## [146] "Mind and Med First Paper HW.pdf"
## [147] "Mini exam 1.pdf"
## [148] "Movie revisions.docx"
## [149] "MyHashCode.java"
## [150] "nk fine slip1.pdf"
## [151] "Note Jun 21, 2022.pdf"
## [152] "Oct25.docx"
## [153] "Oct25.Rmd"
## [154] "Online Survey Software _ Qualtrics Survey Solutions.pdf"
## [155] "Patel (A.4) & Kodali (B.8) Pyramid Poster Planning.pptx"
## [156] "Patel (A.4) & Kodali (B.8)- FINAL SLIDES.pdf"
## [157] "Patel (A.4) & Kodali (B.8)- FINAL SLIDES.pptx"
## [158] "Patel + Kodali Slides.pptx"
## [159] "PCR Prelab Homework_3-10-22 EDITED.pptx"
## [160] "Peer_Presentation_Evaluation_Duckweed_SP22_nk.docx"
## [161] "PHOTO-2022-05-24-14-29-48.jpg"
## [162] "Pollutant Research.pptx"
## [163] "pollutant websearch_1.7.22-2.docx"
## [164] "pollutant websearch_1.7.22.docx"
## [165] "portfolio_ggpubr_intro-2.Rmd"
## [166] "portfolio_ggpubr_log_transformation.Rmd"
## [167] "Presentation+planning+DWS+sp22%281%29 copy.docx"
## [168] "Quantification_graph.png"
## [169] "Question 1.pdf"
## [170] "Question 2.pdf"
## [171] "R2_DNARep.docx"
## [172] "readingVcf.html"
## [173] "readingVcf.Rmd"
## [174] "Rec 1 micro.pdf"
## [175] "Rec 2 micro.pdf"
## [176] "Rec 3.pdf"
## [177] "Rec 6.pdf"
## [178] "Recitation feb 21.pdf"
## [179] "removing_fixed_alleles.html"
## [180] "removing_fixed_alleles.Rmd"
## [181] "Resume Template.docx"
## [182] "rsconnect"
## [183] "RStudio-2022.07.1-554.dmg"
## [184] "rstudio-ide.pdf"
## [185] "Scanned Document 20.pdf"
## [186] "Screen Shot 2022-01-19 at 5.18.58 PM.png"
## [187] "Screen%20Shot%202022-01-10%20at%208.47.26%20PM.png"
## [188] "Screen%20Shot%202022-03-22%20at%2011.25.02%20PM.png"
## [189] "sept 19 2022 final slide.docx"
## [190] "SNPs_cleaned.csv"
## [191] "Sponsorship Packet 4.0.docx.pdf"
## [192] "SS Econ Exam 1.pdf"
## [193] "Student_SP22anaylzing_course_population_2_20.pptx"
## [194] "Subject.png"
## [195] "T185_WIWNINLM66_rbcLF_IS-rbcL_forward.ab1"
## [196] "Taaza Fiscal Sponsorship Agreement .docx.pdf"
## [197] "Target_GiftCard-Request-Form.pdf"
## [198] "Teamwork semester evaluation_Duckweed_SP22_nk.docx"
## [199] "traj_0.pdb"
## [200] "transaction_history-2.csv"
## [201] "transaction_history.csv"
## [202] "transpose_VCF_data.html"
## [203] "transpose_VCF_data.Rmd"
## [204] "Tree.pdf"
## [205] "Untitled design-2.jpg"
## [206] "Untitled design-3.jpg"
## [207] "Untitled design.jpg"
## [208] "UPMC CFEI Sponsorship Application 2021.docx"
## [209] "vcfR_test.vcf"
## [210] "vcfR_test.vcf.gz"
## [211] "vegan_PCA_amino_acids-STUDENT.html"
## [212] "vegan_PCA_amino_acids-STUDENT.Rmd"
## [213] "vegan_pca_with_msleep-STUDENT.Rmd"
## [214] "viability_shroom_f22.xlsx"
## [215] "Vial1_parents.JPG"
## [216] "VitalSource-Bookshelf_10.1.1.1411.dmg"
## [217] "walsh2017morphology.csv"
## [218] "Warehouse_Donation_Request_090214.pdf"
## [219] "WDTS Offers Accepted_SULI - Fall FY22.xlsx"
## [220] "week08_cluster_analysis-1.pdf"
## [221] "WIWNINLM66_2_NaCl_D14_021522_NSK.jpg"
## [222] "WIWNINLM66_Al2_D14_021522_EW.jpg"
## [223] "WIWNINLM66_CuSO4_D14_021522_SG_1.jpg"
## [224] "WIWNINLM66_Fe1_D14_021522_RV.jpg"
## [225] "WIWNINLM66_Mn_D14_021522_NK_2.jpg"
## [226] "WIWNINLM66_Mn_D14_021522_NK_Cont.jpg"
## [227] "XC1.java"
## [228] "Zoom.pkg.download"
list.files(pattern = "vcf")
## [1] "22.1059792-1299792.ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [2] "all_loci-1.vcf"
## [3] "all_loci.vcf"
## [4] "ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [5] "ALL.chr22_GRCh38.genotypes.20170504.vcf.gz-2.download"
## [6] "code_checkpoint_vcfR.html"
## [7] "code_checkpoint_vcfR.Rmd"
## [8] "vcfR_test.vcf"
## [9] "vcfR_test.vcf.gz"
read the Vcf file into the page
snps <- vcfR::read.vcfR("all_loci.vcf", convertNA = TRUE)
## Scanning file to determine attributes.
## File attributes:
## meta lines: 8
## header_line: 9
## variant count: 1929
## column count: 81
##
Meta line 8 read in.
## All meta lines processed.
## gt matrix initialized.
## Character matrix gt created.
## Character matrix gt rows: 1929
## Character matrix gt cols: 81
## skip: 0
## nrows: 1929
## row_num: 0
##
Processed variant 1000
Processed variant: 1929
## All variants processed
Use vcfR::extract.gt() to get the genotype scores.
snps_num <- vcfR::extract.gt(snps,
element = "GT",
IDtoRowNames = F,
as.numeric = T,
convertNA = T,
return.alleles = F)
Transpose the data with t() so that it has the proper orientation.
snps_num_t <- t(snps_num)
convert the matrix to a data frame
snps_num_df <- data.frame(snps_num_t)
create a function that finds Na’s throughout the data
find_NAs <- function(x){
NAs_TF <- is.na(x)
i_NA <- which(NAs_TF == TRUE)
N_NA <- length(i_NA)
cat("Results:",N_NA, "NAs present\n.")
return(i_NA)
}
create a for loop that will determine NA’s throughout each row of data
# N_rows
# number of rows (individuals)
N_rows <- nrow(snps_num_t)
# N_NA
# vector to hold output (number of NAs)
N_NA <- rep(x = 0, times = N_rows)
# N_SNPs
# total number of columns (SNPs)
N_SNPs <- ncol(snps_num_t)
# the for() loop
for(i in 1:N_rows){
# for each row, find the location of
## NAs with snps_num_t()
i_NA <- find_NAs(snps_num_t[i,])
# then determine how many NAs
## with length()
N_NA_i <- length(i_NA)
# then save the output to
## our storage vector
N_NA[i] <- N_NA_i
}
## Results: 28 NAs present
## .Results: 20 NAs present
## .Results: 28 NAs present
## .Results: 24 NAs present
## .Results: 23 NAs present
## .Results: 63 NAs present
## .Results: 51 NAs present
## .Results: 38 NAs present
## .Results: 34 NAs present
## .Results: 24 NAs present
## .Results: 48 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 78 NAs present
## .Results: 45 NAs present
## .Results: 21 NAs present
## .Results: 42 NAs present
## .Results: 34 NAs present
## .Results: 66 NAs present
## .Results: 54 NAs present
## .Results: 59 NAs present
## .Results: 52 NAs present
## .Results: 47 NAs present
## .Results: 31 NAs present
## .Results: 63 NAs present
## .Results: 40 NAs present
## .Results: 40 NAs present
## .Results: 22 NAs present
## .Results: 60 NAs present
## .Results: 48 NAs present
## .Results: 961 NAs present
## .Results: 478 NAs present
## .Results: 59 NAs present
## .Results: 26 NAs present
## .Results: 285 NAs present
## .Results: 409 NAs present
## .Results: 1140 NAs present
## .Results: 600 NAs present
## .Results: 1905 NAs present
## .Results: 25 NAs present
## .Results: 1247 NAs present
## .Results: 23 NAs present
## .Results: 750 NAs present
## .Results: 179 NAs present
## .Results: 433 NAs present
## .Results: 123 NAs present
## .Results: 65 NAs present
## .Results: 49 NAs present
## .Results: 192 NAs present
## .Results: 433 NAs present
## .Results: 66 NAs present
## .Results: 597 NAs present
## .Results: 1891 NAs present
## .Results: 207 NAs present
## .Results: 41 NAs present
## .Results: 268 NAs present
## .Results: 43 NAs present
## .Results: 110 NAs present
## .Results: 130 NAs present
## .Results: 90 NAs present
## .Results: 271 NAs present
## .Results: 92 NAs present
## .Results: 103 NAs present
## .Results: 175 NAs present
## .Results: 31 NAs present
## .Results: 66 NAs present
## .Results: 64 NAs present
## .Results: 400 NAs present
## .Results: 192 NAs present
## .Results: 251 NAs present
## .Results: 69 NAs present
## .Results: 58 NAs present
## .
Plot a histogram of NA’s and determine which data has over 50% NA’s
# 50% of N_SNPs
cutoff50 <- N_SNPs*0.5
hist(N_NA)
abline(v = cutoff50,
col = 2,
lwd = 2,
lty = 2)
Convert NA’s into a percent and remove rows with greater than 50% NA’s
percent_NA <- N_NA/N_SNPs*100
# Call which() on percent_NA
i_NA_50percent <- which(percent_NA > 50)
snps_num_t02 <- snps_num_t[-i_NA_50percent, ]
use gsub() to remove irrelevant information from the row names and use table() to view the result.
row_names <- row.names(snps_num_t02) # Key
row_names02 <- gsub("sample_","",row_names)
sample_id <- gsub("^([ATCG]*)(_)(.*)",
"\\3",
row_names02)
pop_id <- gsub("[01-9]*",
"",
sample_id)
table(pop_id)
## pop_id
## Alt Cau Div Nel Sub
## 15 12 15 15 11
remove data where the sd = 0, and omit them because they can’t be scaled
invar_omit <- function(x){
cat("Dataframe of dim",dim(x), "processed...\n")
sds <- apply(x, 2, sd, na.rm = TRUE)
i_var0 <- which(sds == 0)
cat(length(i_var0),"columns removed\n")
if(length(i_var0) > 0){
x <- x[, -i_var0]
}
## add return() with x in it
return(x)
}
snps_no_invar <- invar_omit(snps_num_t02)
## Dataframe of dim 68 1929 processed...
## 591 columns removed
Use mean imputation to replace NA’s with the mean of the entire column
snps_noNAs <- snps_no_invar
N_col <- ncol(snps_no_invar)
for(i in 1:N_col){
# get the current column
column_i <- snps_noNAs[, i]
# get the mean of the current column
mean_i <- mean(column_i, na.rm = TRUE)
# get the NAs in the current column
NAs_i <- which(is.na(column_i))
# record the number of NAs
N_NAs <- length(NAs_i)
# replace the NAs in the current column
column_i[NAs_i] <- mean_i
# replace the original column with the
## updated columns
snps_noNAs[, i] <- column_i
}
Save the data as a .csv file which can be loaded again later.
write.csv(snps_noNAs, file = "SNPs_cleaned.csv",
row.names = F)
Check for the presence of the file with list.files()
list.files(pattern = ".csv")
## [1] "SNPs_cleaned.csv" "transaction_history-2.csv"
## [3] "transaction_history.csv" "walsh2017morphology.csv"
In Part 2, we will re-load the SNPs_cleaned.csv file and
carry an an analysis with PCA.