At the beginning of any R script, you may want to first set your working directory. This will allow you to load datasets and save datasets directly into this folder without having to type the entire file path.
## Setting the working directory
setwd("~/Dad's PFAs")
Any line of code that has the # symbol as the first character is a comment. Commented code is not run and is helpful for the future if you or anyone else intends to use the script later. The author of ggplot2 (a wonderful package for data visualization) and many R textbooks once tweeted: “Dear past-Hadley: PLEASE COMMENT YOUR CODE BETTER. Love present-Hadley.” Comments can be thought of as notes to the future user of the R script.
Next, any extra packages are installed and loading to your working environment. More commonly used functions are preloaded into the environment when R is opened. Packages contain many functions usually geared towards a single purpose. This package factoextra contains many function for data visualization after principal component analysis is conducted.
#install.packages("factoextra")
library(factoextra)
Note that the first line of code in the above block #install.packages("factoextra") is commented out. This is because I had to install this package to a different directory to create this .html file. If you would like to rerun this, you will need to remove the hashtag symbol: \(#\).
The next order of business is reading in your datafiles. R can handle all types of data files: .txt, .xlsx, .nc4, .csv, etc. It is much easier to deal with .csv and .txt files. Before reading in your datafiles, I saved each sheet as a separate .csv file. When saving Excel (.xlsx) sheets as .csv files, only the open sheet is saved (so you have to do this seven times in this case).
## Reading in the data
Raw_Data = read.csv('RawData.csv')
#View(Raw_Data)
In the code chunk above that starts with the commented line: ## Reading in the data, we are reading in the raw datafile. Note that we only had to reference the datafile with the name 'RawData.csv' because we set our working directory to the folder containing this file. If this was not done we would have had to reference the entire file path which can be a burden. Also note that #View(Raw_Data) is commented out. If you would like to view the datafile after you have read it in, run this line of code without the \(#\). It is always advisable to look at the dataframe after it was read into R and verify that everything looks ok.
Below, we include the first 10 rows of the dataframe Raw_Data. This is not how it will look in R, this is just how it appears in an .html file; I have hidden the code to create this data table. Let’s check it to see if it was read in properly. If you scroll all the way to the right in this viewer, you will see that the last column read in incorrectly. All other columns are fine, but it added a column of NA values with the column name: X. This can sometimes occur when you save .xlsx files as .csv. I think this is because you once had data in this column and it appears as a “ghost” column (empty values) in the .csv file, but I am not sure. Let’s delete this column (see below).
## Deleting the last (26th) column
Raw_Data = Raw_Data[,-26]
You can reference the dataframe by the name we saved it as earlier: Raw_Data. You can reference a specific element with the code Raw_Data[1,2]: this will be the element in the first row and second column. You can reference an entire row with the code Raw_Data[1,]: this will select the first row. And, likewise, you can reference an entire column with the code Raw_Data[,26]: this will print the 26th column (the empty one).
In the code above we have overwritten our original Raw_Data name and saved as the same dataset but without the 26th column. Observe below that the column is no longer present. All of the other datafiles had the same “ghost” column, but you should check yours if you do this on your own (maybe send me your code to check).
Before we get started, I am following very closely with this tutorial: http://www.sthda.com/english/articles/31-principal-component-methods-in-r-practical-guide/118-principal-component-analysis-in-r-prcomp-vs-princomp/. The tutorial will go into greater detail and could be a useful reference. There are two main functions for PCA in R: 1) prcomp() and 2) princomp(). There are two general methods to perform PCA in R:
First, we will select out data of interest. If you would like to use some of the other datasets put its name in place of Raw_Data. Note that we did not take the first column because these are the sample names.
## Selecting data of interest
Data_of_Int = Raw_Data[,2:ncol(Raw_Data)]
I have given up taking notes here. I can explain in more detail if needed. There are 3 plots for each of 12 combinations: Log vs No Log, Standardized vs. Not Standardized, All Variables vs. All Minus Linear vs. All Minus Linear and Distance (2*2*3) (so 36 plots). The first for each group is a scree plot which examines how much variance is explained by each PC, the seond is a plot of the PC loadings, and the third are the samples plotted in their PC space.
## Selecting data of interest
Data_of_Int = Raw_Data[,2:ncol(Raw_Data)]
## Standardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=T)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Standardized raw data: Minus Linear
Data_of_Int = Raw_Data[,c(2:14,17:ncol(Raw_Data))]
## Standardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=T)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Standardized raw data: Minus Linear & Distances
Data_of_Int = Raw_Data[,c(4:14,17:ncol(Raw_Data))]
## Standardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=T)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Unstandardized raw data: Full
Data_of_Int = Raw_Data[,2:ncol(Raw_Data)]
## Unstandardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=F)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Unstandardized raw data: Minus Linear
Data_of_Int = Raw_Data[,c(2:14,17:ncol(Raw_Data))]
## Unstandardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=F)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Unstandardized raw data: Minus Linear & Distances
Data_of_Int = Raw_Data[,c(4:14,17:ncol(Raw_Data))]
## Unstandardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=F)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Logged Data
Raw_Logs = read.csv('Logs_RawData.csv')
Raw_Logs = Raw_Logs[,-26]
## Standardized raw data: Full
Data_of_Int = Raw_Logs[,2:ncol(Raw_Logs)]
## Standardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=T)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Standardized raw data: Minus Linear
Data_of_Int = Raw_Logs[,c(2:14,17:ncol(Raw_Logs))]
## Standardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=T)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Standardized raw data: Minus Linear & Distances
Data_of_Int = Raw_Logs[,c(4:14,17:ncol(Raw_Logs))]
## Standardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=T)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Unstandardized raw data: Full
Data_of_Int = Raw_Logs[,2:ncol(Raw_Logs)]
## Untandardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=F)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Untandardized raw data: Minus Linear
Data_of_Int = Raw_Logs[,c(2:14,17:ncol(Raw_Logs))]
## Untandardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=F)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Unstandardized raw data: Minus Linear & Distances
Data_of_Int = Raw_Logs[,c(4:14,17:ncol(Raw_Logs))]
## Unstandardized
PRcomp_Stand <- prcomp(Data_of_Int,scale=F)
fviz_eig(PRcomp_Stand)
## Plot of the loadings
fviz_pca_var(PRcomp_Stand,
col.var = "contrib", # Color by contributions to the PC
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
## Plot of individuals
fviz_pca_ind(PRcomp_Stand,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)