Learning objectives
All of this material will appear on the exam. Take notes on the
workflow, functions, and concepts.
Main objectives
- Work through a full analysis of a dataset with PCA
- Understand the connection between scree plots and the amount of
variation explained by each PC
- Learn how to make a scree plot in terms of explained variation and
how to assess it using a simple rule of thumb
Review
By the end of this lesson you will know how to..
- set a working directory in RStudio
- confirm the location of the working directory with
getwd()
- confirm a file is present with and
list.files(pattern = ...)
- load typical R data file in spreadsheet format with
read.csv()
- Use basic R functions to check data you’ve loaded
(e.g.
dim(), summary(), etc.)
- Create a basic PCA, make a screeplot, and make and interpret a
simple biplot.
Introduction
This Portfolio is a complete worked example of a PCA analysis of a
dataset. It covers all relevant issues
- Loading data
- Dealing with NAs
- Scaling
- Running PCA
- Building and interpreting scree plots
- Extracting PCA scores
- Creating custom scatterplots of PCA scores
This Portfolio also expands on our understanding of PCA by
introducing the concept of PCs explain a proportion of the variation of
the data. This is useful for understanding what PC does and what the PCs
mean, and also allow us to interpret screeplots more thoroughly.
Preliminaries
Check the location of your working directory with
getwd()
# run getwd()
# TODO
getwd()
## [1] "C:/Users/slmg/Downloads"
Check for the presence of the “walsh2017morphology.csv” file in the
working directory with list.files()
# run list.files()
# TODO
list.files()
## [1] "__LSRHS_High_School_Report_Card.pdf"
## [2] "__LSRHS_Report_Card (1).pdf"
## [3] "__LSRHS_Report_Card (2).pdf"
## [4] "__LSRHS_Report_Card (3).pdf"
## [5] "__LSRHS_Report_Card (4).pdf"
## [6] "__LSRHS_Report_Card (5).pdf"
## [7] "__LSRHS_Report_Card (6).pdf"
## [8] "__LSRHS_Report_Card (7).pdf"
## [9] "__LSRHS_Report_Card (8).pdf"
## [10] "__LSRHS_Report_Card (9).pdf"
## [11] "__LSRHS_Report_Card.pdf"
## [12] "__LSRHS_Transcript (1).pdf"
## [13] "__LSRHS_Transcript (2).pdf"
## [14] "__LSRHS_Transcript (3).pdf"
## [15] "__LSRHS_Transcript (4).pdf"
## [16] "__LSRHS_Transcript.pdf"
## [17] "__MACOSX"
## [18] "__pycache__"
## [19] "_Thumbs.db"
## [20] "03-Stoichiometry-student notes (1).pdf"
## [21] "03-Stoichiometry-student notes.pdf"
## [22] "03-Stoichiometry-student notes.pdf (to print).pdf"
## [23] "04-09b Lab - -8 Dyn Equil Part II Data and graph (Jan 19- 2021 at 4-22 PM).png"
## [24] "04-Reactions-student notes (1).pdf"
## [25] "04-Reactions-student notes.pdf"
## [26] "05-ThermochemistryV2-student notes.pdf"
## [27] "06-AtomicStructure & PT trends-student notes.pdf"
## [28] "07-Lewis Structures & Molecular Geometry-student notes.pdf"
## [29] "07-mean_imputation.Rmd"
## [30] "08-Advanced Theories of Covalent Bonding-student notes.pdf"
## [31] "08-PCA_worked.Rmd"
## [32] "09-Gases-student notes.pdf"
## [33] "1_3_Electrical+Communication+Notes+1+Page.pdf"
## [34] "1_4_Synapses+Notes+1+Page.pdf"
## [35] "1_5_Pharmacology+Notes+1+Page.pdf"
## [36] "10-IMFs and Phases-student notes.pdf"
## [37] "1021688factor1by8.tif channel 1_1021688factor1by8.tif channel 1_xfm_0 (1).tif"
## [38] "1021688factor1by8.tif channel 1_1021688factor1by8.tif channel 1_xfm_0 (2).tif"
## [39] "1021688factor1by8.tif channel 1_1021688factor1by8.tif channel 1_xfm_0.tif"
## [40] "110Q9S22.pdf"
## [41] "1255156.jpg"
## [42] "1288814 (1).jpg"
## [43] "1288814.jpg"
## [44] "1540_cluster_analysis.pdf"
## [45] "172822Words.zip"
## [46] "1BHZ (1).png"
## [47] "1BHZ.png"
## [48] "1st Amendment Review.doc"
## [49] "20.17246903-17486903.ALL.chr20_GRCh38.genotypes.20170504.vcf.gz"
## [50] "2018-2019_Q1.pdf"
## [51] "2020-02-03_17832_2019-2020_Q2.pdf"
## [52] "2022 Pitt Honors Application Essay Prompt (1).pdf"
## [53] "2022 Pitt Honors Application Essay Prompt (2).pdf"
## [54] "2022 Pitt Honors Application Essay Prompt.pdf"
## [55] "2201 - 0120 - Exam 1.pdf"
## [56] "2201 - 0120 - Exam 2.pdf"
## [57] "2201 - 0120 - Exam 3.pdf"
## [58] "2211 Yellow PPE Form.pdf"
## [59] "38. Layli Long Soldier (1).pdf"
## [60] "38. Layli Long Soldier.pdf"
## [61] "50MillionWords.zip"
## [62] "5MilionWords (1).zip"
## [63] "5MilionWords.zip"
## [64] "6 pm August 15 Fall 2021 EngLit 625 Carol M Bove Syllabus Detective Fiction.docx"
## [65] "Acting French. Ta Nehisi Coates.docx"
## [66] "ADE_4.5_Installer.exe"
## [67] "ALL.chr20_GRCh38.genotypes.20170504 (1).vcf.gz"
## [68] "ALL.chr20_GRCh38.genotypes.20170504.vcf.gz"
## [69] "all_loci-1.vcf"
## [70] "all_loci.vcf"
## [71] "allomtery_3_scatterplot3d (1).Rmd"
## [72] "Anaconda3-2021.11-Windows-x86_64.exe"
## [73] "Apache-NetBeans-12.0-bin-windows-x64.exe"
## [74] "at3_1m4_01.tif"
## [75] "AtomSetup-x64.exe"
## [76] "background.jpeg"
## [77] "Bergdoll.pdf"
## [78] "bio rec 13.pdf"
## [79] "BIO150Week1SV.pdf"
## [80] "bird_snps_remove_NAs.html"
## [81] "bird_snps_remove_NAs.Rmd"
## [82] "c0110_expt10_datasheets.pdf"
## [83] "c0110_expt11_datasheets.pdf"
## [84] "c0110_expt12_datasheets.pdf"
## [85] "c0110_expt2_datasheets.pdf"
## [86] "c0110_expt3_datasheets.pdf"
## [87] "c0110_expt4_datasheets.pdf"
## [88] "c0110_expt5_datasheets (1).pdf"
## [89] "c0110_expt5_datasheets.pdf"
## [90] "c0110_expt6_datasheets.pdf"
## [91] "c0110_expt7_datasheets.pdf"
## [92] "c0110_expt8_datasheets.pdf"
## [93] "c0110_expt9_datasheets.pdf"
## [94] "c0120_expt1_partII.pdf"
## [95] "c0120_expt10_datasheets.pdf"
## [96] "c0120_expt11_datasheets.pdf"
## [97] "c0120_expt12_datasheets (1).pdf"
## [98] "c0120_expt12_datasheets.pdf"
## [99] "c0120_expt13_datasheets.pdf"
## [100] "c0120_expt2_datasheets.pdf"
## [101] "c0120_expt3_datasheets.pdf"
## [102] "c0120_expt4_datasheets.pdf"
## [103] "c0120_expt5_datasheets.pdf"
## [104] "c0120_expt6_datasheets.pdf"
## [105] "c0120_expt7_datasheets.pdf"
## [106] "c0120_expt8_datasheets.pdf"
## [107] "c0120_expt9_datasheets.pdf"
## [108] "candygrams tabling poster .jpg"
## [109] "CB_OrderReceipt (1).pdf"
## [110] "CB_OrderReceipt.pdf"
## [111] "CDLL_Node.class"
## [112] "center_function.R"
## [113] "Ch13_DNA&Heredity.pdf"
## [114] "Ch14_GeneExpression.pdf"
## [115] "Ch15_GeneMutation.pdf"
## [116] "Ch16_RegulationGeneExpression.pdf"
## [117] "Ch17_Genomes.pdf"
## [118] "Ch18_RecombinantDNA&Biotechnology.pdf"
## [119] "Ch19_Evolution.pdf"
## [120] "Ch20_Phylogenies.pdf"
## [121] "Ch21_EvolutionGenesGenomes.pdf"
## [122] "Ch22_Speciation.pdf"
## [123] "Ch52_PhysicalEnvironment&Biogeography.pdf"
## [124] "Ch53_Populations.pdf"
## [125] "Ch54_SpeciesInteractions (1).pdf"
## [126] "Ch54_SpeciesInteractions.pdf"
## [127] "Ch55_Communities.pdf"
## [128] "Ch56_Ecosystems.pdf"
## [129] "Chapter 02 - Programs"
## [130] "Chapter 02 - Programs (1).zip"
## [131] "Chapter 02 - Programs.zip"
## [132] "Chapter 03- programs"
## [133] "Chapter 03- programs.zip"
## [134] "Chapter 04-programs"
## [135] "Chapter 04-programs.zip"
## [136] "Chapter 05-programs"
## [137] "Chapter 06-programs.zip"
## [138] "Chapter 07-programs.zip"
## [139] "Chapter 09-programs"
## [140] "Chapter 09-programs.zip"
## [141] "Chapter 10-programs.zip"
## [142] "Chapter 21 Study Guide.docx.pdf"
## [143] "Chapter 22 Study Guide.pdf"
## [144] "Chapter 52 Study Guide.pdf"
## [145] "chem 0120 experiement 12 (1).png"
## [146] "chem 0120 experiement 12.png"
## [147] "chem 0120 experiment 1 introduction to graphing (1).pdf"
## [148] "chem 0120 experiment 1 introduction to graphing (2).pdf"
## [149] "chem 0120 experiment 1 introduction to graphing .pdf"
## [150] "chem 0120 lab 1 part 2 (1).pdf"
## [151] "chem 0120 lab 1 part 2 .pdf"
## [152] "chem 0120 lab 1 part 3 (1).pdf"
## [153] "chem 0120 lab 1 part 3 (2).pdf"
## [154] "chem 0120 lab 1 part 3.pdf"
## [155] "chem 0120 lab 10 .pdf"
## [156] "chem 0120 lab 11.pdf"
## [157] "chem 0120 lab 12.pdf"
## [158] "chem 0120 lab 13.pdf"
## [159] "chem 0120 lab 2 chromatography (1).pdf"
## [160] "chem 0120 lab 2 chromatography .pdf"
## [161] "chem 0120 lab 2 data sheets (1).pdf"
## [162] "chem 0120 lab 2 data sheets .pdf"
## [163] "chem 0120 lab 3 .pdf"
## [164] "chem 0120 lab 4 .pdf"
## [165] "chem 0120 lab 5 .pdf"
## [166] "chem 0120 lab 6.pdf"
## [167] "chem 0120 lab 7.pdf"
## [168] "chem 0120 lab 8.pdf"
## [169] "chem 0120 lab 9 data .pdf"
## [170] "chem 0120 lab 9 part ii.pdf"
## [171] "chem 0120 lab graph 9.png"
## [172] "chem hw 1 .pdf"
## [173] "chem hw 11 .pdf"
## [174] "chem hw 12.pdf"
## [175] "chem hw 13.pdf"
## [176] "chem hw 2.pdf"
## [177] "chem hw 3 .pdf"
## [178] "chem hw 4.pdf"
## [179] "chem hw 5.pdf"
## [180] "chem hw 6.pdf"
## [181] "chem hw 7.pdf"
## [182] "chem hw 8.pdf"
## [183] "chem hw 9.pdf"
## [184] "chem lab 1 flow chart (1).pdf"
## [185] "chem lab 1 flow chart .pdf"
## [186] "chem lab 10 part 1.pdf"
## [187] "chem lab 10 part 2 .pdf"
## [188] "chem lab 11 part 1.pdf"
## [189] "chem lab 11 part 2.pdf"
## [190] "chem lab 12 part 1 (1).pdf"
## [191] "chem lab 12 part 1.pdf"
## [192] "chem lab 12 part 2 .pdf"
## [193] "chem lab 12 part 3 .pdf"
## [194] "chem lab 2 data charts.pdf"
## [195] "chem lab 2 part 3 (1).pdf"
## [196] "chem lab 2 part 3 .pdf"
## [197] "chem lab 3.pdf"
## [198] "chem lab 4 .pdf"
## [199] "chem lab 5 .pdf"
## [200] "chem lab 6.pdf"
## [201] "chem lab 7.pdf"
## [202] "chem lab 8 part 1 (1).pdf"
## [203] "chem lab 8 part 1.pdf"
## [204] "chem lab 8 part 2 (1).pdf"
## [205] "chem lab 8 part 2(2).pdf"
## [206] "chem lab 8 part 2.pdf"
## [207] "chem lab 9.pdf"
## [208] "chem quiz 1.pdf"
## [209] "chem tutoring.pdf"
## [210] "Chem0110S22infosheet.pdf"
## [211] "cluster_analysis_portfolio (1).Rmd"
## [212] "cluster_analysis_portfolio.Rmd"
## [213] "CODE_CHECKPOINT-first_rstudio_script (1).R"
## [214] "CODE_CHECKPOINT-first_rstudio_script.R"
## [215] "code_checkpoint_vcfR.html"
## [216] "code_checkpoint_vcfR.Rmd"
## [217] "Community Scholarship App 2021 (my copy).pdf"
## [218] "comp bio in class activity.jpg"
## [219] "comp bio in class pca on 1000 genomes .jpg"
## [220] "connected_comp.zip"
## [221] "convert-output-to-sorted.pl"
## [222] "Copy of sd-calculator.xlsx"
## [223] "create2dlist.py"
## [224] "cs3400-09-JavaFXPart1.pptx"
## [225] "cyclopentene_3D.PDB"
## [226] "desktop.ini"
## [227] "detective fiction summary (1) (1).docx"
## [228] "detective fiction summary (1) (2).docx"
## [229] "detective fiction summary (1).docx"
## [230] "detective fiction summary (2).docx"
## [231] "detective fiction summary (2.1) (1).docx"
## [232] "detective fiction summary (2.1).docx"
## [233] "DH final presentation.pptx"
## [234] "difference.py"
## [235] "DigitalImageProcessing_Gonzalez Woods (1).pdf"
## [236] "DigitalImageProcessing_Gonzalez Woods (2).pdf"
## [237] "DigitalImageProcessing_Gonzalez Woods.pdf"
## [238] "Disc.4.pdf"
## [239] "distance.py"
## [240] "DropboxInstaller.exe"
## [241] "english composition graded essay 2 FINAL REVISION (1).doc"
## [242] "english composition graded essay 2 FINAL REVISION .doc"
## [243] "english composition graded essay 2 FINAL REVISION .docx"
## [244] "english composition graded essay 2 for review.doc"
## [245] "english composition graded essay 2 for review.docx"
## [246] "english composition writing exploration 1 .docx"
## [247] "EPD_7018.jpg"
## [248] "exam rework graph (portfolio 2).docx"
## [249] "experiement 4 graph part 2 (4-7) (1).cmbl"
## [250] "experiement 4 graph part 2 (4-7) (2).cmbl"
## [251] "experiement 4 graph part 2 (4-7) (3).cmbl"
## [252] "experiement 4 graph part 2 (4-7) (4).cmbl"
## [253] "experiement 4 graph part 2 (4-7).cmbl"
## [254] "feature_engineering.Rmd"
## [255] "feature_engineering_intro_2_functions-part2.Rmd"
## [256] "February Survey Data_ LSSC 2-24-21.pdf"
## [257] "ferrari.png"
## [258] "fiji-win64"
## [259] "fiji-win64.zip"
## [260] "file.pdf"
## [261] "Firefox Installer.exe"
## [262] "flashplayer32pp_xa_install.exe"
## [263] "foreground.ckpt (1).data-00000-of-00001"
## [264] "foreground.ckpt.data-00000-of-00001"
## [265] "Forum_Issue_7.pdf"
## [266] "fr 0088 reading response 1 (1).docx"
## [267] "fr 0088 reading response 1 .docx"
## [268] "FR88-FinalReadingAssignment.pdf"
## [269] "FR88Fall2022Syllabus.docx"
## [270] "Fraser and Hazelwood arguments.doc"
## [271] "GannonDec.14.docx"
## [272] "Graded Essay 1 (1).pdf"
## [273] "Graded Essay 1.pdf"
## [274] "Hashing-1 (1).pptx"
## [275] "Hashing-1.pptx"
## [276] "heating-copper-sulphate-hydrate-lab.pdf"
## [277] "homevideosearch (1).MOV"
## [278] "homevideosearch.MOV"
## [279] "Homework Problems.pdf"
## [280] "House.class"
## [281] "How to access CS 445 Textbook.docx"
## [282] "image of boston .jpg"
## [283] "IMG_0427.jpg"
## [284] "IMG_2532.JPG"
## [285] "IMG_3193.JPG"
## [286] "IMG_3922.HEIC"
## [287] "IMG_3923 (1).HEIC"
## [288] "IMG_3923.HEIC"
## [289] "IMG_7158.jpg"
## [290] "IMG_7164.jpg"
## [291] "IMG_7166.jpg"
## [292] "IMG_7408.jpg"
## [293] "IMG_7462.jpg"
## [294] "In-class exercise - strings.py"
## [295] "in class cluster diagram .pdf"
## [296] "in class random numbers comp bio .pdf"
## [297] "In the Company of Men (Véronique Tadjo) (z-lib.org) (1).epub"
## [298] "In the Company of Men (Véronique Tadjo) (z-lib.org) (1).pdf"
## [299] "In the Company of Men (Véronique Tadjo) (z-lib.org) (2).epub"
## [300] "In the Company of Men (Véronique Tadjo) (z-lib.org).epub"
## [301] "In the Company of Men (Véronique Tadjo) (z-lib.org).pdf"
## [302] "Informed Consent Form - reflex.docx"
## [303] "intersection.py"
## [304] "J-vl-cp7.epub.part"
## [305] "JavaSetup8u241.exe"
## [306] "jdk-14.0.2_windows-x64_bin (1).exe"
## [307] "jdk-14.0.2_windows-x64_bin.exe"
## [308] "jdk-17_windows-x64_bin.exe"
## [309] "jdk-17_windows-x64_bin.msi"
## [310] "JSSs3d.pdf"
## [311] "KindleForPC-installer-1.32.61109.exe"
## [312] "Kw_of_Water_part_III.ppt"
## [313] "Lab01_sug52 (1).java"
## [314] "Lab01_sug52.java"
## [315] "lamborghini_1.gif"
## [316] "Latin Project Movie slgr.mp4"
## [317] "latin project.mp4"
## [318] "lecture-introd2RStudio-with_scripts.pdf"
## [319] "Lime Green Safety Rules origional.pdf"
## [320] "Literary Devices in “The Garden of Forking Paths”.pptx"
## [321] "loading vcf into R.docx"
## [322] "LoggerPro3_16_2_Demo.exe"
## [323] "Looking for Ugrad Research 04-19.docx"
## [324] "LS 73120 Preliminary Submittal to DESE.pdf"
## [325] "LS_Student_Schedule_Sheet (1).pdf"
## [326] "LS_Student_Schedule_Sheet (2).pdf"
## [327] "LS_Student_Schedule_Sheet.pdf"
## [328] "LS_Student_Schedules_(Matrix)_BY_TERM (1).pdf"
## [329] "LS_Student_Schedules_(Matrix)_BY_TERM.pdf"
## [330] "main.py"
## [331] "MALWARE_BYTES-setup-1.80.2.1012 (1).exe"
## [332] "MALWARE_BYTES-setup-1.80.2.1012.exe"
## [333] "mastering-periodic-trends-infographic (1).pdf"
## [334] "mastering-periodic-trends-infographic.pdf"
## [335] "MATLAB_Runtime_R2022a_win64.zip"
## [336] "MCAS.pdf"
## [337] "menu.py"
## [338] "Microsoft Teams (1).download"
## [339] "Microsoft Teams (2).download"
## [340] "Microsoft Teams (3).download"
## [341] "Microsoft Teams.download"
## [342] "Miniconda3-latest-Windows-x86_64.exe"
## [343] "Module 12 lecture.pdf"
## [344] "Module 12.pdf"
## [345] "movies.py"
## [346] "Mu-Editor-Win64-1.1.0b5.msi"
## [347] "nf70iz4l.epub.part"
## [348] "NIS_Viewer_5.21.00_b1483_64bit.zip"
## [349] "numbers.txt"
## [350] "NuSeT-master.zip"
## [351] "NuSeTDeep elarning for reliably separating and analyzing crowded cells_ Computational Biology.pdf"
## [352] "OneDrive_1_5-16-2022"
## [353] "OneDrive_1_5-16-2022.zip"
## [354] "OpenJDK16U-jdk_x64_windows_hotspot_16.0.2_7.msi"
## [355] "OpenJDK17U-jdk_x64_windows_hotspot_17.0.4.1_1.msi"
## [356] "openjfx-17.0.2_monocle-linux-x64_bin-sdk"
## [357] "openjfx-17.0.2_monocle-linux-x64_bin-sdk.zip"
## [358] "original, thresholding, maximum intensity on thresholded (1).zip"
## [359] "original, thresholding, maximum intensity on thresholded (2).zip"
## [360] "original, thresholding, maximum intensity on thresholded.zip"
## [361] "P&H Fall 2022.doc"
## [362] "p3-skeleton-gannon-ols9.py (1).zip"
## [363] "p3-skeleton-gannon-ols9.py.zip"
## [364] "PCA-missing_data-KEY.Rmd"
## [365] "PCA-missing_data.Rmd"
## [366] "PCA on SNPS worksheet.jpg"
## [367] "PCA worksheet in class.jpg"
## [368] "PeriodicTableMuted2018 (1).pdf"
## [369] "PeriodicTableMuted2018.pdf"
## [370] "pic for video.heic"
## [371] "Pitt_Printing_Client_Win64_1930.exe"
## [372] "PITT_transcript Spring 2022 (1).pdf"
## [373] "PITT_transcript Spring 2022.pdf"
## [374] "Please_DocuSign_Associate_Chapter_Bid_Accept.pdf"
## [375] "plot.docx"
## [376] "plot_labels.py"
## [377] "pol_0402 (1).mov"
## [378] "pol_0402.mov"
## [379] "portfolio_ggpubr_intro-2.Rmd"
## [380] "portfolio_ggpubr_log_transformation.Rmd"
## [381] "pptlab.swf"
## [382] "practice chem upload quiz.pdf"
## [383] "Practice Quiz 1.pdf"
## [384] "Practice Quiz 2-1 (1).pdf"
## [385] "Practice Quiz 2-1.pdf"
## [386] "Practice Quiz 3-2.pdf"
## [387] "R-4.2.1-win.exe"
## [388] "R scatter plot for exam 3.png"
## [389] "R studio checkpoint (first).docx"
## [390] "r_help_hclust_intro-vs2.pdf"
## [391] "readingimage"
## [392] "readingimage (1).zip"
## [393] "readingimage (2).zip"
## [394] "readingimage (3).zip"
## [395] "readingimage (4).zip"
## [396] "readingimage.zip"
## [397] "real_estate_listing.txt"
## [398] "recording for video essay.m4a"
## [399] "removing_fixed_alleles.Rmd"
## [400] "Research Links.docx"
## [401] "research meeting dec 1.docx"
## [402] "resume 1031.docx"
## [403] "revised proposal detective fiction (1).docx"
## [404] "revised proposal detective fiction (2).docx"
## [405] "revised proposal detective fiction.docx"
## [406] "Rplot for allomtry 3 scatterplot 3d.png"
## [407] "Rplot for cluster analysis portfolio part 1.png"
## [408] "Rplot for cluster analysis portfolio part 2.png"
## [409] "Rplot for ggpubr log transformation.png"
## [410] "Rplot for portfolio on ggpubr intro.png"
## [411] "rsconnect"
## [412] "RStudio-2022.07.1-554.exe"
## [413] "SATStudentScoreReport_1573872323137.pdf"
## [414] "SATStudentScoreReport_1611952785519.pdf"
## [415] "SceneBuilder-17.0.0.msi"
## [416] "Schedule spring 2022.pdf"
## [417] "screengab of R installation.docx"
## [418] "screengrab of RStudio Cloud .docx"
## [419] "screengrab of Rstudio swirl package (1).docx"
## [420] "screengrab of Rstudio swirl package .docx"
## [421] "screengrab of RStudio.docx"
## [422] "sending_app_4_12"
## [423] "sending_app_4_12.zip"
## [424] "sending_app2"
## [425] "sending_app2.zip"
## [426] "sendingapp_4_12"
## [427] "sendingapp_4_12.zip"
## [428] "Sequences_Rubisco.docx"
## [429] "Small Group Discussion Questions (2).docx"
## [430] "software checkpoint data slicer and vcf download (1).docx"
## [431] "software checkpoint data slicer and vcf download.docx"
## [432] "SpotifySetup.exe"
## [433] "spring-sale.jpg"
## [434] "spring candygrams fundraiser! .jpg"
## [435] "spring candygrams fundraising .jpg"
## [436] "spring candygrams! .jpg"
## [437] "Spring Term 2021-2022.ics"
## [438] "spyder-kernels-0.2.4-py27_0.tar.bz2"
## [439] "Spyder_64bit_full.exe"
## [440] "subset.py"
## [441] "sudoku_games.txt"
## [442] "Sue-Ling-Gannon (1).pdf"
## [443] "Sue-Ling-Gannon (2).pdf"
## [444] "Sue-Ling-Gannon (3).pdf"
## [445] "Sue-Ling-Gannon (4).pdf"
## [446] "Sue-Ling-Gannon (5).docx"
## [447] "Sue-Ling-Gannon (5).pdf"
## [448] "Sue-Ling-Gannon.pdf"
## [449] "Sue-Ling Gannon .pdf"
## [450] "Sue-Ling Gannon.pdf"
## [451] "sueling_test.zip"
## [452] "sug52project2.py"
## [453] "SUG54-Deck.java"
## [454] "sugar-and-salt-solutions_en (1).jar"
## [455] "symmetric_dif.py"
## [456] "TaxForms.pdf"
## [457] "TegRunner_myclasses.tegrity.com&XnIpVaHcBkqScqcc2zUVQg.exe"
## [458] "temperature.py"
## [459] "test.cpython-38.pyc"
## [460] "test.docx"
## [461] "test.Rmd"
## [462] "text.txt"
## [463] "tgXpZNUp.epub.part"
## [464] "the-crip-poetics-of-pain.pdf"
## [465] "the-eatery.htm"
## [466] "The Science of Acupuncture.docx"
## [467] "tiff-viewer.exe"
## [468] "Tips for Project 1 part 1.py"
## [469] "toyota.gif"
## [470] "toyota_logo.png"
## [471] "transpose_VCF_data (1).Rmd"
## [472] "transpose_VCF_data.Rmd"
## [473] "TrendMicro_16.0_HE_Full.exe"
## [474] "TSIS.The Art of Quoting. Signal verbs.pdf"
## [475] "tutoring_2224.pdf"
## [476] "Unit 2.1.18.2022A.pdf"
## [477] "Unit 2.1.20.2022 copy.pdf"
## [478] "Unit 3.1.25.2022 copy.pdf"
## [479] "Unit 3.1.27.2022 copy.pdf"
## [480] "Unit1.1.2022.Jan.11. copy.pdf"
## [481] "Unit1.2022.Jan.13. copy.pdf"
## [482] "university of pittsburgh bid acceptance phi rho 2022 spring.pdf"
## [483] "vcfR_test.vcf"
## [484] "vcfR_test.vcf.gz"
## [485] "vegan_PCA_amino_acids-STUDENT.html"
## [486] "vegan_PCA_amino_acids-STUDENT.Rmd"
## [487] "vegan_pca_with_msleep-STUDENT.Rmd"
## [488] "video for project1.MOV"
## [489] "video for project10.MOV"
## [490] "video for project11.MOV"
## [491] "video for project12.MOV"
## [492] "video for project13.MOV"
## [493] "video for project2.MOV"
## [494] "video for project3.MOV"
## [495] "video for project4.MOV"
## [496] "video for project5.MOV"
## [497] "video for project6.MOV"
## [498] "video for project7.MOV"
## [499] "video for project9.MOV"
## [500] "video project8.MOV"
## [501] "vocab in class comp bio.jpg"
## [502] "VSCodeUserSetup-x64-1.70.2.exe"
## [503] "walsh2017morphology.csv"
## [504] "week08_cluster_analysis-1.pdf"
## [505] "Weimar Syllabus.2022.pdf"
## [506] "working_directory_practice.Rmd"
## [507] "Your Name.pdf"
## [508] "zip file for final project"
## [509] "Zoom_cm_fe5wwZ9vvrZo4_mSAKJbgYjxmlgFRdNV7BbvWij7TdboNUdRLm1@pT9YaNMrUoUo-qX3_k0a92dbba7b6a12b5_.exe"
## [510] "Zoom_cm_fne5Z9vvrZo4_mxnN2Jfhb6zLqhothyeCO3x6DWZDnQcHZGvQU@ayEgUznsgdfgmjIM_kdd77f79fc8b20957_.exe"
## [511] "ZoomInstaller.exe"
If you have lots of files in the working directory, you can search
for the file specifically with list.files(pattern = “walsh”)
# Run list.files() with pattern = "walsh"
# TODO
list.files(pattern="walsh")
## [1] "walsh2017morphology.csv"
Load the .csv file
CSV files can be read in with the read.csv()
function.
# add read.csv() to load the file
df <- read.csv(file = "walsh2017morphology.csv") # TODO
Always check to make sure the data looks like what you expected with
head(), summary() and other functions.
# run head(), summary(), and dim() on the data
# TODO
head(df)
## spp wing bill weight
## 1 NESP 56 8.5 18.2
## 2 NESP 56 8.5 20.7
## 3 NESP 59 8.0 17.6
## 4 NESP 59 8.2 16.0
## 5 NESP 60 8.3 16.5
## 6 NESP 58 8.5 16.0
summary(df)
## spp wing bill weight
## Length:73 Min. :53.00 Min. :7.900 Min. :14.5
## Class :character 1st Qu.:56.00 1st Qu.:8.400 1st Qu.:16.0
## Mode :character Median :57.00 Median :8.600 Median :17.0
## Mean :57.01 Mean :8.782 Mean :17.4
## 3rd Qu.:58.00 3rd Qu.:9.240 3rd Qu.:18.9
## Max. :60.00 Max. :9.900 Max. :21.7
## NA's :10 NA's :10 NA's :12
dim(df)
## [1] 73 4
PCA data preparation
Invariant columns
When doing PCA on SNPs you have to consider the possibility that a
SNP is fixed and all values in a column are identical. This is not
typically an issue for conventional datasets like the morphology data we
are working with here. When doing large-scale analyses with 1000s of
columns and limits on computational power it would be advisable to check
for invariant columns, which could be due to some feature of the data
you aren’t aware of or an error in data processing.
Data scaling
We always scale data for PCA. This could be done before or after
dealing with NAs and other data preparation issues.
The first column is character data so we’ll drop that when scaling
using df[, -1].
# make a copy of the dataframe
df2_scale <- df
# add scale() to scale the data
df2_scale[,-1] <- scale(df[,-1])
Dealing with NAs
We need to remove NAs with na.omit(). We could impute
the missing data, but there aren’t too many values that are missing. (We
could use a for() loop to do the imputation if we wanted, or my function
mean_imputation())
# add na.omit() to remove the NAs
# assign the output to df2_scale_noNA
df2_scale_noNA<- na.omit(df2_scale) #TODO
PCA
We can run a PCA with prcomp(). The first column is
character data so [,-1] is included to that prcomp() doesn’t get it.
# add prcomp() and assign it to an object called
## morpho_pca
morpho_pca<- prcomp(df2_scale_noNA[,-1]) #TODO
PCA Diagnostics
Its important to assess some things from the PCA output before moving
forward with interpreting a plot of the PCA results such as a biplot.
These are generally called performance diagnostics.
Scree plot
It is important when doing PCA to decide how many of the new,
re-engineered dimensions should be retained. PCA does not inherently
reduce the dimensionality of a dataset. If you put in a dataframe with
10 columns into prcomp(), you get 10 PCs (10 dimensions) out of it. Its
the job of the data analyst to decide how many of the new features (PCs)
should be considered, plot, and/or used in further analyses.
A scree plot is the typical tool for this. The base R scree plot is
pretty basic. In general, we’re looking for a steep drop between two
PCs. The PCs before the drop are most worthwhile to examine.
# add screeplot() to make the scree plot
screeplot(morpho_pca) #TODO

Explained variation
The bars on the default R scree plot represent the relative
importance of the PCs in representing the data.
More specifically, they are proportional to the amount of variation
in the data captured by each PC. The more variation represented by a PC,
the more important it is.
We can get the information on the percentage of variation in the data
using the summary() command.
First, we get the summary information and store it to an object.
# call summary()
summary_out_morpho <- summary(morpho_pca) #TODO
Then we use a function to extract the information on explained
variation (importance) we want.
# add return(var_explained) to get output
PCA_variation <- function(pca_summary, PCs = 2){
var_explained <- pca_summary$importance[2,1:PCs]*100
var_explained <- round(var_explained,1)
return(var_explained) # TODO
}
This function allows us to specify how many PCs2 we want. The default
is 2 PCs, since that is what is usually plotted. Let’s get the first 3
PCs because that’s all there are in these morphological data.
# call PCA_variation(), with PCs = 3
var_out <- PCA_variation(summary_out_morpho, # TODO
PCs =3 )
var_out
## PC1 PC2 PC3
## 55.1 32.0 12.8
This means that PC1 captures 55.1% of the variation in the data. PC2
captures only 32%, and PC3 captures 12.8%.
These percentages are often reported as labels on biplots and
scatterplots of PCA results (See below).
They can also be used to decide how many PCs to work with.
Instead of the default screeplot, we can make a screeplot where the
y-axis is the percent of variation captured by each PC. We’ll use the
barplot() function to do this.
# number of dimensions in the data
N_columns <- ncol(df2_scale_noNA)
# make barplot
barplot(var_out,
main = "Percent variation Scree plot",
ylab = "Percent variation explained")
abline(h = 1/N_columns*100, col = 2, lwd = 2)

The horizontal line in the barplot is calculated as 100/(# of
dimensions in the data) If all the PCs were equally important, then the
amount of variation they would explain would be (100)/(# of columns).
These data have 3 columns, so if all the PCs were equal, they would each
explain 33% of the variation. A general rule of them is to focus on PCs
that explain more than this percentage.
PCA Biplot
Now we’ll make the biplot. Look at the biplot and interpret the
relationship between the 3 features bill, weight, and wing. Then read
the information below.
# add biplot() to see the biplot
biplot(morpho_pca) #TODO

Custom PCA Plot
Note that if the arrows aren’t plotted, its not a biplot.
Get the scores:
# call vegan::scores()
morpho_scores <- vegan::scores(morpho_pca) # TODO
Combine the scores with the species information into a dataframe.
# call data.frame()
morpho_scores2 <- data.frame(spp = df2_scale_noNA$spp,
morpho_scores)
Plot the scores, with species color-coded
# make color and shape = "spp"
ggpubr::ggscatter(data = morpho_scores2,
y = "PC2",
x = "PC1",
color ="spp" , # TODO
shape = "spp", # TODO
main = "PCA Scatterplot",
xlab = "PC1 (55.1% of variation",
ylab = "PC2 (32% of variation)")

Note how in the plot the amount of variation explained by each PC is
shown in the axis labels.
How to interpret the biplot
In the biplot created above, the “bill” and “weight” vectors point to
the left, and “wing” points straight down.
This means that bill and weight are correlated with PC1, which is
always the horizontal axis. Wing is correlated with PC2, the vertical
axis.
Bill and weight are very close to each other, so the raw data of
these features are going to be highly correlated with each other.
The “Wing” vector points straight down at a about a 90 degree (right)
angle to not only PC1, but also bill and weight. We can therefore say
that the wing vector is orthogonal to PC1, bill, and
weight.