Principal Component Analysis in R

Introductory Code

At the beginning of any R script, you may want to first set your working directory. This will allow you to load datasets and save datasets directly into this folder without having to type the entire file path.

## Setting the working directory
setwd("~/Dad's PFAs")

Any line of code that has the # symbol as the first character is a comment. Commented code is not run and is helpful for the future if you or anyone else intends to use the script later. The author of ggplot2 (a wonderful package for data visualization) and many R textbooks once tweeted: “Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley.” Comments can be thought of as notes to the future user of the R script.

Next, any extra packages are installed and loading to your working environment. More commonly used functions are preloaded into the environment when R is opened. Packages contain many functions usually geared towards a single purpose. This package factoextra contains many function for data visualization after principal component analysis is conducted.

#install.packages("factoextra")
library(factoextra)

Note that the first line of code in the above block #install.packages("factoextra") is commented out. This is because I had to install this package to a different directory to create this .html file. If you would like to rerun this, you will need to remove the hashtag symbol: \(#\).

The next order of business is reading in your datafiles. R can handle all types of data files: .txt, .xlsx, .nc4, .csv, etc. It is much easier to deal with .csv and .txt files. Before reading in your datafiles, I saved each sheet as a separate .csv file. When saving Excel (.xlsx) sheets as .csv files, only the open sheet is saved (so you have to do this seven times in this case).

## Reading in the data
Raw_Data = read.csv('RawData.csv')
#View(Raw_Data)

In the code chunk above that starts with the commented line: ## Reading in the data, we are reading in the raw datafile. Note that we only had to reference the datafile with the name 'RawData.csv' because we set our working directory to the folder containing this file. If this was not done we would have had to reference the entire file path which can be a burden. Also note that #View(Raw_Data) is commented out. If you would like to view the datafile after you have read it in, run this line of code without the \(#\). It is always advisable to look at the dataframe after it was read into R and verify that everything looks ok.

Below, we include the first 10 rows of the dataframe Raw_Data. This is not how it will look in R, this is just how it appears in an .html file; I have hidden the code to create this data table. Let’s check it to see if it was read in properly. If you scroll all the way to the right in this viewer, you will see that the last column read in incorrectly. All other columns are fine, but it added a column of NA values with the column name: X. This can sometimes occur when you save .xlsx files as .csv. I think this is because you once had data in this column and it appears as a “ghost” column (empty values) in the .csv file, but I am not sure. Let’s delete this column (see below).

## Deleting the last (26th) column
Raw_Data = Raw_Data[,-26]

You can reference the dataframe by the name we saved it as earlier: Raw_Data. You can reference a specific element with the code Raw_Data[1,2]: this will be the element in the first row and second column. You can reference an entire row with the code Raw_Data[1,]: this will select the first row. And, likewise, you can reference an entire column with the code Raw_Data[,26]: this will print the 26th column (the empty one).

In the code above we have overwritten our original Raw_Data name and saved as the same dataset but without the 26th column. Observe below that the column is no longer present. All of the other datafiles had the same “ghost” column, but you should check yours if you do this on your own (maybe send me your code to check).