The goal of this lab is to have you all recreate the coding and analysis behind a typical Principal Component Analysis.

Please follow the tutorial linked here: https://bioinfo4all.wordpress.com/2021/01/31/tutorial-6-how-to-do-principal-component-analysis-pca-in-r/

load libraries

library("factoextra")
## Loading required package: ggplot2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library("FactoMineR")

read in cells csv file

data <- read.csv("cell.csv", row.names = 1)

Creating scree plot:

pca.data <- PCA(data[,-1], scale.unit = TRUE, graph = FALSE)
fviz_eig(pca.data, addlabels = TRUE, ylim = c(0, 70))

fviz_pca_var(pca.data, col.var = "cos2",
             gradient.cols = c("#FFCC00", "#CC9933", "#660033", "#330033"),
             repel = TRUE) 

creating initial comparison plot

pca.data <- PCA(t(data[,-1]), scale.unit = TRUE, graph = FALSE)
fviz_pca_ind(pca.data, col.ind = "cos2", 
                  gradient.cols = c("#FFCC00", "#CC9933", "#660033", "#330033"), 
                  repel = TRUE)

library(ggpubr) 

CHanging labels of PCA initial plot

a <- fviz_pca_ind(pca.data, col.ind = "cos2", 
                  gradient.cols = c("#FFCC00", "#CC9933", "#660033", "#330033"), 
                  repel = TRUE)
ggpar(a,
      title = "Principal Component Analysis",
      xlab = "PC1", ylab = "PC2",
      legend.title = "Cos2", legend.position = "top",
      ggtheme = theme_minimal())

reformating PCA data and changing lineage column to a factor

pca.data <- PCA(data[,-1], scale.unit = TRUE,ncp = 2, graph = FALSE)
data$lineage <- as.factor(data$lineage)

loading package for color scheme

library(RColorBrewer)
nb.cols <- 3
mycolors <- colorRampPalette(brewer.pal(3, "Set1"))(nb.cols)

creating final PCA plot

a <- fviz_pca_ind(pca.data, col.ind = data$lineage,
                  palette = mycolors, addEllipses = TRUE)
ggpar(a,
      title = "Principal Component Analysis",
      xlab = "PC1", ylab = "PC2",
      legend.title = "Cell type", legend.position = "top",
      ggtheme = theme_minimal())

QUESTIONS:

  1. Why does the author of this tutorial suggest PCA is a powerful analysis tool? The author states that utilizing PCA in R is beneficial because of its ability to reduce data dimensions and thus make more sense of the data, give an overall shape to the data, and indicate which samples are similar and which samples are different

  2. Why do you think this data makes sense for working with PCAs? This data set is looking at lineage-specifc genes for 3 different cell types. Utilizing the PCA function where we compare PC1 (corresponds to the directions with the maximum amount of variation in the data set) with PC2 (corresponds to the directions with the second maximum amount of variation in the data set) lets us make sure we are analyzing data as a result of factors that are the larger causes of changes. I think this makes sense to do for this data set because once the values are standardized we can use it to look at the relationship between these genes and how they correlate to the different types of cells and to what degree

  3. How do you interpret the Scree plot? the scree plot is showing us which explained components have the most weight in terms of variances, so the first two components (PC1 and PC2) would make up 81.6% of explained variances across this data set. The third dimension/component is then 7.3% of the cause for explained variances but we did not use that in our tests because we wanted PC1 and PC2.

  4. How do you interpret the final PCA plot you make? This plot feels a little chaotic, but I see a good visual representation of genes for different cell lines and how the correlations they have with PC1 and PC2. So I see that some of the TE based genes such as KRT8 and KRT18 have a much more positive relation with PC1 than PC2, so if I were to further investigate these genes I would know that another covariate was influencing this data and specifically, those specific genes. The ovals make good visual markers for the trend of the data, the green oval seems tilted downwards which follows the trend of the genes, and the pink/red oval follows the trend for those EPI genes that seem to have a relationship with both PC1 and PC2.