The example is split into 2 Parts:
Part 1 must be completed first to create a file,
SNPs_cleaned.csv, that has been completely prepared for
analysis.
Now in Part 2, you will analyze the data with PCA. The steps here will be:
scale())prcomp())screeplot())vegan::scores())In the code below all code is provided. Your tasks will be to do 4 things:
Load the vcfR package with library()
library(vcfR) # KEY
##
## ***** *** vcfR *** *****
## This is vcfR 1.13.0
## browseVignettes('vcfR') # Documentation
## citation('vcfR') # Citation
## ***** ***** ***** *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)
Set the working directory
getwd()
## [1] "/Users/nithinkodali/Downloads"
list.files()
## [1] "_Deck.java"
## [2] "_tmp__lease_documents_20221003060021.pdf"
## [3] "_tmp__lease_documents_20221003060050.pdf"
## [4] "07-mean_imputation.html"
## [5] "07-mean_imputation.Rmd"
## [6] "08-PCA_worked.html"
## [7] "08-PCA_worked.Rmd"
## [8] "09-PCA_worked_example-SNPs-part1.html"
## [9] "09-PCA_worked_example-SNPs-part1.Rmd"
## [10] "10-PCA_worked_example-SNPs-part2.Rmd"
## [11] "10.11-cluster-analysis.pdf"
## [12] "1666239815.868967.jpg"
## [13] "2022 April Grad App.docx"
## [14] "2022 Summer & Fall Landmark Fundraising Schedule_as of 5.26.22.pdf"
## [15] "22.1059792-1299792.ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [16] "7.11.pdf"
## [17] "770d265a-3c0b-4234-9faa-18212e9f5318.pdf"
## [18] "990-EZ_DesiDanceNetworkInc 2020.pdf"
## [19] "abc123_Assignment1_CMPINF0401.zip"
## [20] "abc123_Assignment2_CMPINF0401.zip"
## [21] "abc123_CommunityFoodVolunteeringManager"
## [22] "abc123_CommunityFoodVolunteeringManager 2"
## [23] "Act-GAL4_ Shroom Presentation.pptx"
## [24] "admin_20861-ch-4-nucleic-acids-and-the-rna-world.pdf"
## [25] "admin_22090-ch-12-solutions.pdf"
## [26] "admin_22092-ch-14-chemical-equilibrium.pdf"
## [27] "admin_22097-ch-18-thermodynamics-and-equilibrium.pdf"
## [28] "admin2_biology-12-biological-science-freeman-93-ch-15-dna-and-the-gene-synthesis-and-repair-3467.pdf"
## [29] "all_loci-1.vcf"
## [30] "all_loci.vcf"
## [31] "ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [32] "ALL.chr22_GRCh38.genotypes.20170504.vcf.gz-2.download"
## [33] "allomtery_3_scatterplot3d (1).Rmd"
## [34] "â\u0080\u0094Pngtreeâ\u0080\u0094graduate class of 2022 with_6310583.png"
## [35] "Application-2022-Hybrid-Program.docx"
## [36] "Bench_E_Ashley_Adynn_Nithin_Mar_22_2022.jpg"
## [37] "Bench_E_Ashley_Adynn_Nithin_Mar_29_2022.jpg"
## [38] "bestcnn.sdf"
## [39] "BIOSC0160_R1.docx"
## [40] "bird_snps_remove_NAs.html"
## [41] "bird_snps_remove_NAs.Rmd"
## [42] "bolinger JC2 part B_edit.pptx"
## [43] "c0120_expt3_datasheets.pdf"
## [44] "c0120_expt4_datasheets.pdf"
## [45] "CA Offer Letter - Kodali.pdf"
## [46] "center_function.R"
## [47] "Chapter2.pptx"
## [48] "CHEM0110 Cheat Sheet.pdf"
## [49] "clown-39861.png"
## [50] "cluster_analysis_portfolio.Rmd"
## [51] "cluster_analysis_with_Higgs_aa.R"
## [52] "code_checkpoint_vcfR.html"
## [53] "code_checkpoint_vcfR.Rmd"
## [54] "CODE_CHECKPOINT-first_rstudio_script.R"
## [55] "combinepdf-2.pdf"
## [56] "combinepdf.pdf"
## [57] "compile_actgal4_viability_11_6_22-2.xlsx"
## [58] "Copy of TaazaLogoSmall-04.png"
## [59] "DDN 501c3 Determination Letter.pdf"
## [60] "Dead_Pupae_ShroomB.8.jpg"
## [61] "Dead_Pupae_ShroomB.8x2.jpg"
## [62] "documents.zip"
## [63] "eclipse-inst-jre-mac64.dmg"
## [64] "Exam 2 Review.pdf"
## [65] "Exam 2.pdf"
## [66] "Exam Review 2 CHEM120 Spring 2022.docx"
## [67] "Exam_1_Practice (3).pdf"
## [68] "Exam_1_Practice_KEY.pdf"
## [69] "F48B4F30-FFA2-465D-BB4C-99EA57D44EC3.HEIC"
## [70] "feature_engineering.Rmd"
## [71] "Final 2.pdf"
## [72] "FInal Bio Lab 1 Presentation.pptx"
## [73] "Final.pdf"
## [74] "Flyer Eyewear Discount.pdf"
## [75] "flyer.pdf"
## [76] "Giant Eagle Proposal Management System.pdf"
## [77] "gnina"
## [78] "GPA-CALCULATOR.xls"
## [79] "Graph.java"
## [80] "Hardwig(1).pdf"
## [81] "HashCodeTester.java"
## [82] "hildebrand soriano assertion evidence templates f22.pdf"
## [83] "hildebrand soriano assertion evidence templates f22.pptx"
## [84] "Hoffman"
## [85] "Hoffman.zip"
## [86] "HPS 0612 essay topics 2.23.22.docx"
## [87] "I-9-Acceptable-Documents-List.pdf"
## [88] "IMG_0013.PNG"
## [89] "IMG_3721.jpg"
## [90] "IMG_4537.HEIC"
## [91] "IMG_4543.HEIC"
## [92] "IMG_4544.HEIC"
## [93] "IMG_4545.HEIC"
## [94] "IMG_4546.HEIC"
## [95] "IMG_4547.HEIC"
## [96] "IMG_4554 2.jpg"
## [97] "IMG_4554.jpg"
## [98] "IMG_4555.PNG"
## [99] "IMG_4758.HEIC"
## [100] "IMG_4807.HEIC"
## [101] "IMG_4809.HEIC"
## [102] "IMG_4811.HEIC"
## [103] "IMG_6020.heic"
## [104] "IMG_6081.HEIC"
## [105] "IMG_6145.HEIC"
## [106] "IMG_6149.HEIC"
## [107] "IMG_6156.JPG"
## [108] "IMG_6162.JPG"
## [109] "IMG_6165.JPG"
## [110] "IMG_6189.JPG"
## [111] "IMG_6460.JPG"
## [112] "IMG_6477 2.JPG"
## [113] "IMG_6477.JPG"
## [114] "IMG_6612.JPG"
## [115] "IMG_AE7E91570F83-1.jpeg"
## [116] "IMG_C50E5F1B5F81-1.jpeg"
## [117] "Imported File-1.pdf"
## [118] "Imported File.pdf"
## [119] "Intro (1).pptx"
## [120] "Intro (2).pptx"
## [121] "Intro-2.pptx"
## [122] "Intro.pptx"
## [123] "Invoice - 0003.pdf"
## [124] "ja4052153_si_001.pdf"
## [125] "jab464_CommunityFoodVolunteeringManager"
## [126] "jab464_CommunityFoodVolunteeringManager.zip"
## [127] "jc2 bolinger et al f22_Martin edit.pptx"
## [128] "jm100112j.pdf"
## [129] "Journal Club1_DWS_SP22-2.pptx"
## [130] "Journal Club1_DWS_SP22.pptx"
## [131] "Journal club2_Hitsman_sp22_post.pptx"
## [132] "last slide lecture 3.docx"
## [133] "Late Withdrawal Form.pdf"
## [134] "LinkedList-2.java"
## [135] "LinkedList.java"
## [136] "LUX_2022_AE_Guide_PT_09_28_21.pdf"
## [137] "LuxotticaDigitalEmployeeGuide_09102018.pdf"
## [138] "Macro final.pdf"
## [139] "MALWARE_BYTES-setup-1.80.2.1012.exe"
## [140] "MATLAB_Book_2E_Solution_Manual.doc"
## [141] "micrometer1.25.jpg"
## [142] "micrometer2.5.jpg"
## [143] "Mid_term_2.pdf"
## [144] "Midterm Questions - Mind and Medicine Spring 2022.docx"
## [145] "midterm_exam_1.pdf"
## [146] "Mind and Med First Paper HW.pdf"
## [147] "Mini exam 1.pdf"
## [148] "Movie revisions.docx"
## [149] "MyHashCode.java"
## [150] "nk fine slip1.pdf"
## [151] "Note Jun 21, 2022.pdf"
## [152] "Oct25.docx"
## [153] "Oct25.Rmd"
## [154] "Online Survey Software _ Qualtrics Survey Solutions.pdf"
## [155] "Patel (A.4) & Kodali (B.8) Pyramid Poster Planning.pptx"
## [156] "Patel (A.4) & Kodali (B.8)- FINAL SLIDES.pdf"
## [157] "Patel (A.4) & Kodali (B.8)- FINAL SLIDES.pptx"
## [158] "Patel + Kodali Slides.pptx"
## [159] "PCR Prelab Homework_3-10-22 EDITED.pptx"
## [160] "Peer_Presentation_Evaluation_Duckweed_SP22_nk.docx"
## [161] "PHOTO-2022-05-24-14-29-48.jpg"
## [162] "Pollutant Research.pptx"
## [163] "pollutant websearch_1.7.22-2.docx"
## [164] "pollutant websearch_1.7.22.docx"
## [165] "portfolio_ggpubr_intro-2.Rmd"
## [166] "portfolio_ggpubr_log_transformation.Rmd"
## [167] "Presentation+planning+DWS+sp22%281%29 copy.docx"
## [168] "Quantification_graph.png"
## [169] "Question 1.pdf"
## [170] "Question 2.pdf"
## [171] "R2_DNARep.docx"
## [172] "readingVcf.html"
## [173] "readingVcf.Rmd"
## [174] "Rec 1 micro.pdf"
## [175] "Rec 2 micro.pdf"
## [176] "Rec 3.pdf"
## [177] "Rec 6.pdf"
## [178] "Recitation feb 21.pdf"
## [179] "removing_fixed_alleles.html"
## [180] "removing_fixed_alleles.Rmd"
## [181] "Resume Template.docx"
## [182] "rsconnect"
## [183] "RStudio-2022.07.1-554.dmg"
## [184] "rstudio-ide.pdf"
## [185] "Scanned Document 20.pdf"
## [186] "Screen Shot 2022-01-19 at 5.18.58 PM.png"
## [187] "Screen%20Shot%202022-01-10%20at%208.47.26%20PM.png"
## [188] "Screen%20Shot%202022-03-22%20at%2011.25.02%20PM.png"
## [189] "sept 19 2022 final slide.docx"
## [190] "SNPs_cleaned.csv"
## [191] "Sponsorship Packet 4.0.docx.pdf"
## [192] "SS Econ Exam 1.pdf"
## [193] "Student_SP22anaylzing_course_population_2_20.pptx"
## [194] "Subject.png"
## [195] "T185_WIWNINLM66_rbcLF_IS-rbcL_forward.ab1"
## [196] "Taaza Fiscal Sponsorship Agreement .docx.pdf"
## [197] "Target_GiftCard-Request-Form.pdf"
## [198] "Teamwork semester evaluation_Duckweed_SP22_nk.docx"
## [199] "traj_0.pdb"
## [200] "transaction_history-2.csv"
## [201] "transaction_history.csv"
## [202] "transpose_VCF_data.html"
## [203] "transpose_VCF_data.Rmd"
## [204] "Tree.pdf"
## [205] "Untitled design-2.jpg"
## [206] "Untitled design-3.jpg"
## [207] "Untitled design.jpg"
## [208] "UPMC CFEI Sponsorship Application 2021.docx"
## [209] "vcfR_test.vcf"
## [210] "vcfR_test.vcf.gz"
## [211] "vegan_PCA_amino_acids-STUDENT.html"
## [212] "vegan_PCA_amino_acids-STUDENT.Rmd"
## [213] "vegan_pca_with_msleep-STUDENT.Rmd"
## [214] "viability_shroom_f22.xlsx"
## [215] "Vial1_parents.JPG"
## [216] "VitalSource-Bookshelf_10.1.1.1411.dmg"
## [217] "walsh2017morphology.csv"
## [218] "Warehouse_Donation_Request_090214.pdf"
## [219] "WDTS Offers Accepted_SULI - Fall FY22.xlsx"
## [220] "week08_cluster_analysis-1.pdf"
## [221] "WIWNINLM66_2_NaCl_D14_021522_NSK.jpg"
## [222] "WIWNINLM66_Al2_D14_021522_EW.jpg"
## [223] "WIWNINLM66_CuSO4_D14_021522_SG_1.jpg"
## [224] "WIWNINLM66_Fe1_D14_021522_RV.jpg"
## [225] "WIWNINLM66_Mn_D14_021522_NK_2.jpg"
## [226] "WIWNINLM66_Mn_D14_021522_NK_Cont.jpg"
## [227] "XC1.java"
## [228] "Zoom.pkg.download"
list.files(pattern = "vcf")
## [1] "22.1059792-1299792.ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [2] "all_loci-1.vcf"
## [3] "all_loci.vcf"
## [4] "ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [5] "ALL.chr22_GRCh38.genotypes.20170504.vcf.gz-2.download"
## [6] "code_checkpoint_vcfR.html"
## [7] "code_checkpoint_vcfR.Rmd"
## [8] "vcfR_test.vcf"
## [9] "vcfR_test.vcf.gz"
Load the data
SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")
warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.
We always have to scale data when doing PCA
SNPs_scaled <- scale(SNPs_cleaned)
use prcomp to run the PCA test to transform the data into fewer dimensions
pca_scaled <- prcomp(SNPs_scaled)
create a screeplot to determine how many dimensions to use
screeplot(pca_scaled,
ylab = "Relative importance",
main = "PCA test")
We see that PC1 is the most important as it had the highest percentage compared to all the other PCs. It would be worth using both PC1 and PC2 as the dimensions to compare results.
use summary to determine what PC to use
summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)
we see that all PCs 1-10 are all above the threshold are all above the threshold of variation and would be considered important
use biplot() to create a biplot to futher analyze the data
biplot(pca_scaled)
TODO: it is very messy and hard to understand
Obtaine PCA scores in order to remove any repetitions in the data
pca_scores <- vegan::scores(pca_scaled)
PCA can only process numeric data, so we should only add the label after the scores are obtained.
pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")
Organize the PCA data with categorical labels
pca_scores2 <- data.frame(pop_id,
pca_scores)
plot the data and organize it by shape and coloe
ggpubr::ggscatter(data = pca_scores2,
y = "PC2",
x = "PC1",
color = "pop_id",
shape = "pop_id",
xlab = "PC1 (20% variation)",
ylab = "PC2 (2% variation)",
main = "Species Variation")
PC1 accounts for 20% of the variation while PC2 accounts for 2% of the variation. There are 2 distinct clusters. Alt, Nel, and Sub on PC1 while PC2 sees Cau and Dv.