The goal of this lab is to have you all recreate the coding and analysis behind a typical Principal Component Analysis.

Please follow the tutorial linked here: https://bioinfo4all.wordpress.com/2021/01/31/tutorial-6-how-to-do-principal-component-analysis-pca-in-r/

This tutorial does a good job of describing why we use PCA and applies it to common data types for bioinformatics. As you go through the tutorial, copy and paste code below to perform the tasks. Additionally, I have a few questions you need to answer.

NOTE: I have already altered the .csv file you need for this, it is found on Canvas as “cell.csv”, bring in this file using the code below (your working directory will need to be set up correctly):

cell <- read.csv("cell.csv", row.names = 1)  

The code above calls the dataframe “data”, if for some reason, you end up calling your data someting else, you will need to alter all of the rest of the code accordingly. Add code chunks below this starting the tutorial with the “install.packages” code on the website:

cell <- read.csv("cell.csv", row.names = 1)  
library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("FactoMineR")
pca.data <- PCA(cell[, -1], scale.unit = TRUE, graph = FALSE)
fviz_eig(pca.data, addlabels = TRUE, ylim = c(0, 70))

fviz_pca_var(pca.data, col.var = "cos2",
             gradient.cols = c("#FFCC00", "#CC9933", "#660033", "#330033"),
             repel = TRUE) 

pca.data <- PCA(t(cell[,-1]), scale.unit = TRUE, graph = FALSE)
fviz_pca_ind(pca.data, col.ind = "cos2", 
                  gradient.cols = c("#FFCC00", "#CC9933", "#660033", "#330033"), 
                  repel = TRUE)

library(ggpubr) 
a <- fviz_pca_ind(pca.data, col.ind = "cos2", 
                  gradient.cols = c("#FFCC00", "#CC9933", "#660033", "#330033"), 
                  repel = TRUE)
ggpar(a,
      title = "Principal Component Analysis",
      xlab = "PC1", ylab = "PC2",
      legend.title = "Cos2", legend.position = "top",
      ggtheme = theme_minimal())

pca.data <- PCA(cell[,-1], scale.unit = TRUE,ncp = 2, graph = FALSE)
cell$lineage <- as.factor(cell$lineage)
library(RColorBrewer)
nb.cols <- 3
mycolors <- colorRampPalette(brewer.pal(3, "Set1"))(nb.cols)
a <- fviz_pca_ind(pca.data, col.ind = cell$lineage,
                  palette = mycolors, addEllipses = TRUE)
ggpar(a,
      title = "Principal Component Analysis",
      xlab = "PC1", ylab = "PC2",
      legend.title = "Cell type", legend.position = "top",
      ggtheme = theme_minimal())

QUESTIONS: 1. Why does the author of this tutorial suggest PCA is a powerful analysis tool?

This author suggest PCA is a powerful analysis tool that reduces data dimensions, makes sense of the big data, gives an overall shape of the data, and identifies which samples are similar and which are different.

  1. Why do you think this data makes sense for working with PCAs?

This data makes sense because it has multiple numerical variables (dimensions) and two of these dimensions are very clearly are the two principal components.There are also multiple types of PCA plots here that can look at the data from different angles to try and find patterns.

  1. How do you interpret the Scree plot?

The scree plot shows the different dimensions in the sample data set and their corresponding percentage of explained variances. This shows us which of our two dimensions have the highest percentage of variation, as these will be the first two principle components.

  1. How do you interpret the final PCA plot you make?

The final plot shows each gene assigned a color by cell type. The closer together the gene data points of the same color, the more you can say these genes are associated with their specific cell type.