Principle Component Analysis

Unsupervised techniques are often used in the analysis of genomic data. For example, we’ve seen that heatmaps can organize data into meaningful patterns and show us which patients have similar gene expression profiles.

In addition, principle components analysis (PCA) is another popular tool for genomics data. PCA is a dimensionality reduction method that reduces the dimensionality of large data sets by transforming a large set of variables into a smaller set that still contains most of the information in the large set.

Both heatmaps and PCA are unsupervised techniques. This means that we don’t make use of identifying information, for example clinical data, until after the analysis when we check whether our results agree with this information.


The NCI-60 Data Set

NCI-60 is a collection of data, including genomic, transcriptomic, and proteomic information, for 60 different human cancer cell lines representing nine major cancer types: breast, central nervous system, colon, kidney, leukemia, lung, melanoma, ovary, and prostate.


Loading the data

NCI60 is included in R but we have to check out the package.

# the package ISLR contains the NCI data
library(ISLR)

# The cancer types that the cell lines represent
nci.labs=NCI60$labs

# Expression levels on 6830 genes from 64 cancer cell lines
nci.data=NCI60$data

Visualizing the data

In the next code block, use tools you know to find out what the nci.labs and nci.data objects look like:

  • one-dimensional arrays and matrices?
  • numeric or character data?
  • length or dimensions of the data?
  • what does the data look like?
  • can you make tables?
# The dimensions of nci.data

print(dim(nci.labs))
## NULL

A NULL means that the object is likely not a data structure that supports dimensions. This means that the data does not have rows and columns and instead could be a matrix, data fram, or array.

# to see what the nci.labs data looks like
print((nci.labs))
##  [1] "CNS"         "CNS"         "CNS"         "RENAL"       "BREAST"     
##  [6] "CNS"         "CNS"         "BREAST"      "NSCLC"       "NSCLC"      
## [11] "RENAL"       "RENAL"       "RENAL"       "RENAL"       "RENAL"      
## [16] "RENAL"       "RENAL"       "BREAST"      "NSCLC"       "RENAL"      
## [21] "UNKNOWN"     "OVARIAN"     "MELANOMA"    "PROSTATE"    "OVARIAN"    
## [26] "OVARIAN"     "OVARIAN"     "OVARIAN"     "OVARIAN"     "PROSTATE"   
## [31] "NSCLC"       "NSCLC"       "NSCLC"       "LEUKEMIA"    "K562B-repro"
## [36] "K562A-repro" "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"    "LEUKEMIA"   
## [41] "LEUKEMIA"    "COLON"       "COLON"       "COLON"       "COLON"      
## [46] "COLON"       "COLON"       "COLON"       "MCF7A-repro" "BREAST"     
## [51] "MCF7D-repro" "BREAST"      "NSCLC"       "NSCLC"       "NSCLC"      
## [56] "MELANOMA"    "BREAST"      "BREAST"      "MELANOMA"    "MELANOMA"   
## [61] "MELANOMA"    "MELANOMA"    "MELANOMA"    "MELANOMA"
# place the data in a table 
table(nci.labs)
## nci.labs
##      BREAST         CNS       COLON K562A-repro K562B-repro    LEUKEMIA 
##           7           5           7           1           1           6 
## MCF7A-repro MCF7D-repro    MELANOMA       NSCLC     OVARIAN    PROSTATE 
##           1           1           8           9           6           2 
##       RENAL     UNKNOWN 
##           9           1

It looks like the nci.labs data gives us categorical data of the amount of patients that have certain types of cancer. Now lets take a look at nci.data

# dimensions of nci.data

dim(nci.data)
## [1]   64 6830
# nci.data is two dimensional

dim_nci.data <- dim(nci.data)

print(paste("nci.data has", dim_nci.data[1], 
            "rows and", dim_nci.data[2], "columns."))
## [1] "nci.data has 64 rows and 6830 columns."

This aligns with the fact that the data represents expression levels on 6830 genes from 64 cancer cell lines.

# to look at the first 5 rows and columns of the data

print(nci.data[1:5,1:5])
##           1         2         3         4         5
## V1 0.300000  1.180000  0.550000  1.140000 -0.265000
## V2 0.679961  1.289961  0.169961  0.379961  0.464961
## V3 0.940000 -0.040000 -0.170000 -0.040000 -0.605000
## V4 0.280000 -0.310000  0.680000 -0.810000  0.625000
## V5 0.485000 -0.465000  0.395000  0.905000  0.200000

The fact that the data has 6830 columns means that there are 6830 dimensions (variables) to consider. Thus, preforming PCA on this dataset can help us reduce the dimensions for easier visulization and analysis.


Perform PCA

We first perform PCA on the data after scaling the variables (genes) to have standard deviation one, although one could reasonably argue that it is better not to scale the genes.

# prcomp() is the function for Principal Component Analysis (PCA)
pr.out=prcomp(nci.data , scale=TRUE)

# the scale=TRUE means that the data is being scaled 

What is the output?

summary(pr.out)

How many principal components are obtained? Why 64?

There are a total of 64 prinipal components because there were originally 64 rows (observations) in my data. Beacuase of this, our cumulative proportion is 100 (meaning our data is the explaination in the variation in our data). Looking at our output we can also see that the first 7 principal components explains around 50% of our data variation.

You learn more about the output of prcomp() in this Geeks for Geeks tutorial.


Proportion of variance explained

It is more informative to plot 1) the proportion of variance explained (PVE) of each principal component, called a scree plot and 2) the cumulative PVE of each principal component.

This can be done with just a little work.

#  Add your comments

pve =100*pr.out$sdev ^2/sum(pr.out$sdev ^2)

par(mfrow=c(1,2))

plot(cumsum(pve), type="o", ylab="Cumulative PVE", xlab="Principal Component ", col="brown3", main = "Cummulative PVE")

plot(pve , type="o", ylab="PVE", xlab=" Principal Component ", col="blue", main = "Scree plot")

From the cummulative PVE plot, the first seven principal components explain around 40% of the variance in the data. This isn’t a huge amount of variance.

However, in the scree plot, there is a marked decrease in the variance explained by considering additional principal components.

That is, there is an elbow in the plot. This suggests that there may be little benefit to examining more than seven or so principal components. Every other principal component only adds a small bit of variance to our data, meaning we can only focus on the first 7.

It looks like we’ve successfully reduced the dimensionality of our data from 64 to 7.


Visualize the results

To visualize the data, we plot the first few principal component score vectors. The observations (cell lines) corresponding to a given cancer type will be plotted in the same color, so that we can see to what extent the observations within a cancer type are similar to each other.


Create a new function

We create a simple function that assigns a distinct color to each element of a numeric vector. The function will be used to assign a color to each of the 64 cell lines, based on the cancer type to which it corresponds.

# Add your own comments here
# What is the rainbow function? It creates a vector of n contiguous colors

Cols=function (vec){
 cols=rainbow (length(unique(vec)))
 return(cols[as.numeric (as.factor(vec))])
}

Plot PC score vectors

We will create plots to examine whether there first few principal component vectors contain useful (discriminating) information.

# Add you comments here

par(mfrow=c(1,2))
plot(pr.out$x[,1:2], col=Cols(nci.labs), pch=19, xlab="Z1",ylab="Z2")
plot(pr.out$x[,c(1,3)], col=Cols(nci.labs), pch=19, xlab="Z1",ylab="Z3")

If the different colors refer to different cancer types that the cell lines represent, do you think there has been any successful data clustering? In other words, do cell lines corresponding to a single cancer type cluster together? The graph represents the correlation between the different principal components for each type of cancer from our data set.


Challenge

Learn more about doing PCA analysis in R. Can you make a legend on the colored plots that describe which cell lines are represented by the colors?