Introduction

The example is split into 2 Parts:

Part 1 must be completed first to create a file, SNPs_cleaned.csv, that has been completely prepared for analysis.

Now in Part 2, you will analyze the data with PCA. The steps here will be:

  1. Center the data (scale())
  2. Run a PCA analysis (prcomp())
  3. Evaluate the scree plot from the PCA (screeplot())
  4. Evaluate the amount of variation explained by the first 2 PCs.
  5. Extract the PCA scores for plotting (vegan::scores())
  6. Plot the data

Tasks

In the code below all code is provided. Your tasks will be to do 4 things:

  1. Give a meaningful title to all sections marked “TODO: TITLE”
  2. Write 1 to 2 sentences describing what is being done and why in all sections marked “TODO: EXPLAIN”
  3. Add titles and axes to plots in all sections marked “TODO: UPDATE PLOT”
  4. Write 1 or 2 sentences interpreting the output from R in all sections marked “TODO: INTERPRET”

Preliminaries

Load the vcfR package with library()

library(vcfR) # KEY
## 
##    *****       ***   vcfR   ***       *****
##    This is vcfR 1.13.0 
##      browseVignettes('vcfR') # Documentation
##      citation('vcfR') # Citation
##    *****       *****      *****       *****
library(vegan)
## Loading required package: permute
## Loading required package: lattice
## This is vegan 2.6-4
library(ggplot2)
library(ggpubr)

Set the working directory

getwd()
## [1] "/Users/nithinkodali/Downloads"
list.files()
##   [1] "_Deck.java"                                                                                          
##   [2] "_tmp__lease_documents_20221003060021.pdf"                                                            
##   [3] "_tmp__lease_documents_20221003060050.pdf"                                                            
##   [4] "07-mean_imputation.html"                                                                             
##   [5] "07-mean_imputation.Rmd"                                                                              
##   [6] "08-PCA_worked.html"                                                                                  
##   [7] "08-PCA_worked.Rmd"                                                                                   
##   [8] "09-PCA_worked_example-SNPs-part1.html"                                                               
##   [9] "09-PCA_worked_example-SNPs-part1.Rmd"                                                                
##  [10] "10-PCA_worked_example-SNPs-part2.Rmd"                                                                
##  [11] "10.11-cluster-analysis.pdf"                                                                          
##  [12] "1666239815.868967.jpg"                                                                               
##  [13] "2022 April Grad App.docx"                                                                            
##  [14] "2022 Summer & Fall Landmark Fundraising Schedule_as of 5.26.22.pdf"                                  
##  [15] "22.1059792-1299792.ALL.chr22_GRCh38.genotypes.20170504.vcf"                                          
##  [16] "7.11.pdf"                                                                                            
##  [17] "770d265a-3c0b-4234-9faa-18212e9f5318.pdf"                                                            
##  [18] "990-EZ_DesiDanceNetworkInc 2020.pdf"                                                                 
##  [19] "abc123_Assignment1_CMPINF0401.zip"                                                                   
##  [20] "abc123_Assignment2_CMPINF0401.zip"                                                                   
##  [21] "abc123_CommunityFoodVolunteeringManager"                                                             
##  [22] "abc123_CommunityFoodVolunteeringManager 2"                                                           
##  [23] "Act-GAL4_ Shroom Presentation.pptx"                                                                  
##  [24] "admin_20861-ch-4-nucleic-acids-and-the-rna-world.pdf"                                                
##  [25] "admin_22090-ch-12-solutions.pdf"                                                                     
##  [26] "admin_22092-ch-14-chemical-equilibrium.pdf"                                                          
##  [27] "admin_22097-ch-18-thermodynamics-and-equilibrium.pdf"                                                
##  [28] "admin2_biology-12-biological-science-freeman-93-ch-15-dna-and-the-gene-synthesis-and-repair-3467.pdf"
##  [29] "all_loci-1.vcf"                                                                                      
##  [30] "all_loci.vcf"                                                                                        
##  [31] "ALL.chr22_GRCh38.genotypes.20170504.vcf"                                                             
##  [32] "ALL.chr22_GRCh38.genotypes.20170504.vcf.gz-2.download"                                               
##  [33] "allomtery_3_scatterplot3d (1).Rmd"                                                                   
##  [34] "â\u0080\u0094Pngtreeâ\u0080\u0094graduate class of 2022 with_6310583.png"                            
##  [35] "Application-2022-Hybrid-Program.docx"                                                                
##  [36] "Bench_E_Ashley_Adynn_Nithin_Mar_22_2022.jpg"                                                         
##  [37] "Bench_E_Ashley_Adynn_Nithin_Mar_29_2022.jpg"                                                         
##  [38] "bestcnn.sdf"                                                                                         
##  [39] "BIOSC0160_R1.docx"                                                                                   
##  [40] "bird_snps_remove_NAs.html"                                                                           
##  [41] "bird_snps_remove_NAs.Rmd"                                                                            
##  [42] "bolinger JC2 part B_edit.pptx"                                                                       
##  [43] "c0120_expt3_datasheets.pdf"                                                                          
##  [44] "c0120_expt4_datasheets.pdf"                                                                          
##  [45] "CA Offer Letter - Kodali.pdf"                                                                        
##  [46] "center_function.R"                                                                                   
##  [47] "Chapter2.pptx"                                                                                       
##  [48] "CHEM0110 Cheat Sheet.pdf"                                                                            
##  [49] "clown-39861.png"                                                                                     
##  [50] "cluster_analysis_portfolio.Rmd"                                                                      
##  [51] "cluster_analysis_with_Higgs_aa.R"                                                                    
##  [52] "code_checkpoint_vcfR.html"                                                                           
##  [53] "code_checkpoint_vcfR.Rmd"                                                                            
##  [54] "CODE_CHECKPOINT-first_rstudio_script.R"                                                              
##  [55] "combinepdf-2.pdf"                                                                                    
##  [56] "combinepdf.pdf"                                                                                      
##  [57] "compile_actgal4_viability_11_6_22-2.xlsx"                                                            
##  [58] "Copy of TaazaLogoSmall-04.png"                                                                       
##  [59] "DDN 501c3 Determination Letter.pdf"                                                                  
##  [60] "Dead_Pupae_ShroomB.8.jpg"                                                                            
##  [61] "Dead_Pupae_ShroomB.8x2.jpg"                                                                          
##  [62] "documents.zip"                                                                                       
##  [63] "eclipse-inst-jre-mac64.dmg"                                                                          
##  [64] "Exam 2 Review.pdf"                                                                                   
##  [65] "Exam 2.pdf"                                                                                          
##  [66] "Exam Review  2 CHEM120 Spring 2022.docx"                                                             
##  [67] "Exam_1_Practice (3).pdf"                                                                             
##  [68] "Exam_1_Practice_KEY.pdf"                                                                             
##  [69] "F48B4F30-FFA2-465D-BB4C-99EA57D44EC3.HEIC"                                                           
##  [70] "feature_engineering.Rmd"                                                                             
##  [71] "Final 2.pdf"                                                                                         
##  [72] "FInal Bio Lab 1 Presentation.pptx"                                                                   
##  [73] "Final.pdf"                                                                                           
##  [74] "Flyer Eyewear Discount.pdf"                                                                          
##  [75] "flyer.pdf"                                                                                           
##  [76] "Giant Eagle Proposal Management System.pdf"                                                          
##  [77] "gnina"                                                                                               
##  [78] "GPA-CALCULATOR.xls"                                                                                  
##  [79] "Graph.java"                                                                                          
##  [80] "Hardwig(1).pdf"                                                                                      
##  [81] "HashCodeTester.java"                                                                                 
##  [82] "hildebrand soriano assertion evidence templates f22.pdf"                                             
##  [83] "hildebrand soriano assertion evidence templates f22.pptx"                                            
##  [84] "Hoffman"                                                                                             
##  [85] "Hoffman.zip"                                                                                         
##  [86] "HPS 0612 essay  topics 2.23.22.docx"                                                                 
##  [87] "I-9-Acceptable-Documents-List.pdf"                                                                   
##  [88] "IMG_0013.PNG"                                                                                        
##  [89] "IMG_3721.jpg"                                                                                        
##  [90] "IMG_4537.HEIC"                                                                                       
##  [91] "IMG_4543.HEIC"                                                                                       
##  [92] "IMG_4544.HEIC"                                                                                       
##  [93] "IMG_4545.HEIC"                                                                                       
##  [94] "IMG_4546.HEIC"                                                                                       
##  [95] "IMG_4547.HEIC"                                                                                       
##  [96] "IMG_4554 2.jpg"                                                                                      
##  [97] "IMG_4554.jpg"                                                                                        
##  [98] "IMG_4555.PNG"                                                                                        
##  [99] "IMG_4758.HEIC"                                                                                       
## [100] "IMG_4807.HEIC"                                                                                       
## [101] "IMG_4809.HEIC"                                                                                       
## [102] "IMG_4811.HEIC"                                                                                       
## [103] "IMG_6020.heic"                                                                                       
## [104] "IMG_6081.HEIC"                                                                                       
## [105] "IMG_6145.HEIC"                                                                                       
## [106] "IMG_6149.HEIC"                                                                                       
## [107] "IMG_6156.JPG"                                                                                        
## [108] "IMG_6162.JPG"                                                                                        
## [109] "IMG_6165.JPG"                                                                                        
## [110] "IMG_6189.JPG"                                                                                        
## [111] "IMG_6460.JPG"                                                                                        
## [112] "IMG_6477 2.JPG"                                                                                      
## [113] "IMG_6477.JPG"                                                                                        
## [114] "IMG_6612.JPG"                                                                                        
## [115] "IMG_AE7E91570F83-1.jpeg"                                                                             
## [116] "IMG_C50E5F1B5F81-1.jpeg"                                                                             
## [117] "Imported File-1.pdf"                                                                                 
## [118] "Imported File.pdf"                                                                                   
## [119] "Intro (1).pptx"                                                                                      
## [120] "Intro (2).pptx"                                                                                      
## [121] "Intro-2.pptx"                                                                                        
## [122] "Intro.pptx"                                                                                          
## [123] "Invoice - 0003.pdf"                                                                                  
## [124] "ja4052153_si_001.pdf"                                                                                
## [125] "jab464_CommunityFoodVolunteeringManager"                                                             
## [126] "jab464_CommunityFoodVolunteeringManager.zip"                                                         
## [127] "jc2 bolinger et al f22_Martin edit.pptx"                                                             
## [128] "jm100112j.pdf"                                                                                       
## [129] "Journal Club1_DWS_SP22-2.pptx"                                                                       
## [130] "Journal Club1_DWS_SP22.pptx"                                                                         
## [131] "Journal club2_Hitsman_sp22_post.pptx"                                                                
## [132] "last slide lecture 3.docx"                                                                           
## [133] "Late Withdrawal Form.pdf"                                                                            
## [134] "LinkedList-2.java"                                                                                   
## [135] "LinkedList.java"                                                                                     
## [136] "LUX_2022_AE_Guide_PT_09_28_21.pdf"                                                                   
## [137] "LuxotticaDigitalEmployeeGuide_09102018.pdf"                                                          
## [138] "Macro final.pdf"                                                                                     
## [139] "MALWARE_BYTES-setup-1.80.2.1012.exe"                                                                 
## [140] "MATLAB_Book_2E_Solution_Manual.doc"                                                                  
## [141] "micrometer1.25.jpg"                                                                                  
## [142] "micrometer2.5.jpg"                                                                                   
## [143] "Mid_term_2.pdf"                                                                                      
## [144] "Midterm Questions - Mind and Medicine  Spring 2022.docx"                                             
## [145] "midterm_exam_1.pdf"                                                                                  
## [146] "Mind and Med First Paper HW.pdf"                                                                     
## [147] "Mini exam 1.pdf"                                                                                     
## [148] "Movie revisions.docx"                                                                                
## [149] "MyHashCode.java"                                                                                     
## [150] "nk fine slip1.pdf"                                                                                   
## [151] "Note Jun 21, 2022.pdf"                                                                               
## [152] "Oct25.docx"                                                                                          
## [153] "Oct25.Rmd"                                                                                           
## [154] "Online Survey Software _ Qualtrics Survey Solutions.pdf"                                             
## [155] "Patel (A.4) & Kodali (B.8) Pyramid Poster Planning.pptx"                                             
## [156] "Patel (A.4) & Kodali (B.8)- FINAL SLIDES.pdf"                                                        
## [157] "Patel (A.4) & Kodali (B.8)- FINAL SLIDES.pptx"                                                       
## [158] "Patel + Kodali Slides.pptx"                                                                          
## [159] "PCR Prelab Homework_3-10-22 EDITED.pptx"                                                             
## [160] "Peer_Presentation_Evaluation_Duckweed_SP22_nk.docx"                                                  
## [161] "PHOTO-2022-05-24-14-29-48.jpg"                                                                       
## [162] "Pollutant Research.pptx"                                                                             
## [163] "pollutant websearch_1.7.22-2.docx"                                                                   
## [164] "pollutant websearch_1.7.22.docx"                                                                     
## [165] "portfolio_ggpubr_intro-2.Rmd"                                                                        
## [166] "portfolio_ggpubr_log_transformation.Rmd"                                                             
## [167] "Presentation+planning+DWS+sp22%281%29 copy.docx"                                                     
## [168] "Quantification_graph.png"                                                                            
## [169] "Question 1.pdf"                                                                                      
## [170] "Question 2.pdf"                                                                                      
## [171] "R2_DNARep.docx"                                                                                      
## [172] "readingVcf.html"                                                                                     
## [173] "readingVcf.Rmd"                                                                                      
## [174] "Rec 1 micro.pdf"                                                                                     
## [175] "Rec 2 micro.pdf"                                                                                     
## [176] "Rec 3.pdf"                                                                                           
## [177] "Rec 6.pdf"                                                                                           
## [178] "Recitation feb 21.pdf"                                                                               
## [179] "removing_fixed_alleles.html"                                                                         
## [180] "removing_fixed_alleles.Rmd"                                                                          
## [181] "Resume Template.docx"                                                                                
## [182] "rsconnect"                                                                                           
## [183] "RStudio-2022.07.1-554.dmg"                                                                           
## [184] "rstudio-ide.pdf"                                                                                     
## [185] "Scanned Document 20.pdf"                                                                             
## [186] "Screen Shot 2022-01-19 at 5.18.58 PM.png"                                                            
## [187] "Screen%20Shot%202022-01-10%20at%208.47.26%20PM.png"                                                  
## [188] "Screen%20Shot%202022-03-22%20at%2011.25.02%20PM.png"                                                 
## [189] "sept 19 2022 final slide.docx"                                                                       
## [190] "SNPs_cleaned.csv"                                                                                    
## [191] "Sponsorship Packet 4.0.docx.pdf"                                                                     
## [192] "SS Econ Exam 1.pdf"                                                                                  
## [193] "Student_SP22anaylzing_course_population_2_20.pptx"                                                   
## [194] "Subject.png"                                                                                         
## [195] "T185_WIWNINLM66_rbcLF_IS-rbcL_forward.ab1"                                                           
## [196] "Taaza Fiscal Sponsorship Agreement .docx.pdf"                                                        
## [197] "Target_GiftCard-Request-Form.pdf"                                                                    
## [198] "Teamwork semester evaluation_Duckweed_SP22_nk.docx"                                                  
## [199] "traj_0.pdb"                                                                                          
## [200] "transaction_history-2.csv"                                                                           
## [201] "transaction_history.csv"                                                                             
## [202] "transpose_VCF_data.html"                                                                             
## [203] "transpose_VCF_data.Rmd"                                                                              
## [204] "Tree.pdf"                                                                                            
## [205] "Untitled design-2.jpg"                                                                               
## [206] "Untitled design-3.jpg"                                                                               
## [207] "Untitled design.jpg"                                                                                 
## [208] "UPMC CFEI Sponsorship Application 2021.docx"                                                         
## [209] "vcfR_test.vcf"                                                                                       
## [210] "vcfR_test.vcf.gz"                                                                                    
## [211] "vegan_PCA_amino_acids-STUDENT.html"                                                                  
## [212] "vegan_PCA_amino_acids-STUDENT.Rmd"                                                                   
## [213] "vegan_pca_with_msleep-STUDENT.Rmd"                                                                   
## [214] "viability_shroom_f22.xlsx"                                                                           
## [215] "Vial1_parents.JPG"                                                                                   
## [216] "VitalSource-Bookshelf_10.1.1.1411.dmg"                                                               
## [217] "walsh2017morphology.csv"                                                                             
## [218] "Warehouse_Donation_Request_090214.pdf"                                                               
## [219] "WDTS Offers Accepted_SULI - Fall FY22.xlsx"                                                          
## [220] "week08_cluster_analysis-1.pdf"                                                                       
## [221] "WIWNINLM66_2_NaCl_D14_021522_NSK.jpg"                                                                
## [222] "WIWNINLM66_Al2_D14_021522_EW.jpg"                                                                    
## [223] "WIWNINLM66_CuSO4_D14_021522_SG_1.jpg"                                                                
## [224] "WIWNINLM66_Fe1_D14_021522_RV.jpg"                                                                    
## [225] "WIWNINLM66_Mn_D14_021522_NK_2.jpg"                                                                   
## [226] "WIWNINLM66_Mn_D14_021522_NK_Cont.jpg"                                                                
## [227] "XC1.java"                                                                                            
## [228] "Zoom.pkg.download"
list.files(pattern = "vcf")
## [1] "22.1059792-1299792.ALL.chr22_GRCh38.genotypes.20170504.vcf"
## [2] "all_loci-1.vcf"                                            
## [3] "all_loci.vcf"                                              
## [4] "ALL.chr22_GRCh38.genotypes.20170504.vcf"                   
## [5] "ALL.chr22_GRCh38.genotypes.20170504.vcf.gz-2.download"     
## [6] "code_checkpoint_vcfR.html"                                 
## [7] "code_checkpoint_vcfR.Rmd"                                  
## [8] "vcfR_test.vcf"                                             
## [9] "vcfR_test.vcf.gz"

Load the data

SNPs_cleaned <- read.csv(file = "SNPs_cleaned.csv")

warning("If this didn't work, its may be because you didn't set your working directory.")
## Warning: If this didn't work, its may be because you didn't set your working
## directory.

Data analysis

Scale data

We always have to scale data when doing PCA

SNPs_scaled <- scale(SNPs_cleaned)

Perform PCA

use prcomp to run the PCA test to transform the data into fewer dimensions

pca_scaled <- prcomp(SNPs_scaled)

Create a Screeplot

create a screeplot to determine how many dimensions to use

screeplot(pca_scaled, 
          ylab  = "Relative importance",
          main = "PCA test")

We see that PC1 is the most important as it had the highest percentage compared to all the other PCs. It would be worth using both PC1 and PC2 as the dimensions to compare results.

Calculate the Variation

use summary to determine what PC to use

summary_out_scaled <- summary(pca_scaled)
PCA_variation <- function(pca_summary, PCs = 2){
  var_explained <- pca_summary$importance[2,1:PCs]*100
  var_explained <- round(var_explained,1)
  return(var_explained)
}
var_out <- PCA_variation(summary_out_scaled,PCs = 10)
N_columns <- ncol(SNPs_scaled)
barplot(var_out,
        main = "Percent variation Scree plot",
        ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

we see that all PCs 1-10 are all above the threshold are all above the threshold of variation and would be considered important

Create a Biplot

use biplot() to create a biplot to futher analyze the data

biplot(pca_scaled)

TODO: it is very messy and hard to understand

Extract PCA scores

Obtaine PCA scores in order to remove any repetitions in the data

pca_scores <- vegan::scores(pca_scaled)

PCA can only process numeric data, so we should only add the label after the scores are obtained.

pop_id <- c("Nel","Nel","Nel","Nel","Nel","Nel","Nel","Nel",
"Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Nel", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Alt",
"Alt", "Alt", "Alt", "Alt", "Alt", "Alt", "Sub", "Sub",
"Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub", "Sub",
"Sub", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau", "Cau",
"Cau", "Cau", "Cau", "Cau", "Cau", "Div", "Div", "Div",
"Div", "Div", "Div", "Div", "Div", "Div", "Div", "Div",
"Div", "Div", "Div", "Div")

Organize the PCA data with categorical labels

pca_scores2 <- data.frame(pop_id,
                              pca_scores)

Plot the data

plot the data and organize it by shape and coloe

ggpubr::ggscatter(data = pca_scores2,
                  y = "PC2",
                  x = "PC1",
                  color = "pop_id",
                  shape = "pop_id",
                  xlab = "PC1 (20% variation)",
                  ylab = "PC2 (2% variation)",
                  main = "Species Variation")

PC1 accounts for 20% of the variation while PC2 accounts for 2% of the variation. There are 2 distinct clusters. Alt, Nel, and Sub on PC1 while PC2 sees Cau and Dv.