Heatmaps and clustering

About this activity

Last time, we examined the mRNA levels for 18,351 genes across 1,082 breast cancer patients.

We saw that the average expression of the genes (across patient samples) varied greatly. By taking the log of the expression values, we compressed the data (or reduced the range of variation) so that the genes were more comparable. This also made the data more “normal.” In other words, the histogram (or distribution) of the log-tranformed average expression values looked more like a bell curve.

Each column in the log-transformed expression matrix corresponds to a different patient, and each patient has their own set of gene expression values. We can consider the expression matrix as a set of 1,082 objects, one for each patient.

Each row in the log-transformed expression matrix corresponds to a different gene, and, from this perspective, we can consider the expression matrix as a set of 18,351 objects, one for each genes.

As we saw last time, heatmap() can reorder objects in our dataset in such a way that objects in the same group are more similar to each other than to those in other groups. This reordering (more technically called cluster analysis) is a main task of exploratory data analysis, and a common technique for statistical data analysis in bioinformatics.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

The data directory

We created an object that holds the name of the directory where the TCGA data resides. This is good R coding practice because we can apply all we do below to a data set in a different directory by changing a single variable (data_dir).

Loading the data

Throughout the course, we have been loading data from R files with the extension .RData. An example is the file TCGA_brca.RData. In this Module, we’ll load a different object, TCGA_brca_log.RData that contains the log-transformed expression matrix we created.

.RData files are binary files (i.e., not human readable) that can store multiple R objects, such as vectors, lists, matrices, and data frames.

# We `load()` the file "TCGA_brca_log.RData".
# The file.path() function tells `load()` where our data lives
# The argument "verbose = TRUE" makes `load()` tell us what R objects are being loaded
# The objects will also appear in our "Environment" tab.

load(file.path(data_dir, "TCGA_brca_log.RData"),verbose=TRUE)

## Loading objects:
##   brca_clin_df
##   brca_expr_mat.log

Objects were named to be be descriptive.

“brca” stands for “TCGA breast cancer”
“clin” is short for “clinical”
“df” is short for “data frame,” a table of data
“expr” stands for “gene expression data”
“mat” means the expression data is in matrix form
“log” reminds us that the data was log-transformed

Viewing the data

The Environment tab gives us basic information on the object that was loaded.

We will use some of the R functions we learned previously to examine the data in more detail.

dim() tells us the dimensions (# row, # columns) of the objects.
head() shows us the top several rows of the objects.
indexing allows us to look at the rows and columns we choose.

dim_mat <- dim(brca_expr_mat.log)  # We should get a result consistent
                                        # with what is in the Environment tab

print(paste("The log-transformed gene expression matrix has",dim_mat[1],
              "rows and",dim_mat[2],"columns."))

## [1] "The log-transformed gene expression matrix has 18351 rows and 1082 columns."

# The function paste() allows us to print text and numbers together.

Let’s see what’s in the rows and columns.

brca_expr_mat.log[1:5,1:5]

##          TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6      7.5636463     7.705439     9.975045    10.110718     9.881960
## TNMD        0.4272843     1.061776     5.353718     1.164271     2.506120
## DPM1        9.0367945     9.662019     9.919403     8.859174     8.928912
## SCYL3       8.3493520    10.607275     8.395877     9.103957     8.704889
## C1orf112    6.9691735     8.106778     7.922947     7.776269     7.495535

Rows are named with the genes whose expression values were measured. Columns are named with the patient barcode identifier.

“TSPAN6” is the name of the gene that codes for the the protein Tetraspanin-6. “TCGA-3C-AAAU” is the anonymized identifier for one patient’s sample.

Heatmaps

A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heat maps make it easy to visualize complex data and understand it at a glance. We won’t be able to make a heatmap of the entire data set — there are just too many genes — so we’ll look at a subset of genes.

Most variable genes

To reduce the number of genes to consider, we will select the most important ones for our exploratory analysis. There are many different ways to do this, and our choice here can lead to profoundly different views of the data.

One common approach to select genes in an unbiased way is to find the ones with the biggest differences across patient samples, or in statistics language, the ones with the highest variance.

Let’s consider the 500 genes with the highest variance.

We’ll order our matrix accordingly and make a matrix for the top 500 variable genes.

NUM.GENES <- 500                        # How many genes we want to consider.
V <- apply(brca_expr_mat.log, 1, var)   # Use the function `apply()`
                                        # with the function `var()`

ORD <- order(V, decreasing=TRUE)   # `order()` ranks the values for us
                                 # decreasing=TRUE for largest first

brca.log.sub <- brca_expr_mat.log[ORD[1:NUM.GENES],]

# `brca.log.sub` is a subset of the `brca_expr_mat.log` matrix 
# and contains data only for the 500 most variable genes

Let’s take a look at the data, first in matrix form:

brca.log.sub[1:5,1:5]

##         TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## CLEC3A      2.454334    11.768726    0.0000000     5.376641     7.493047
## CPB1       16.161348     4.243707   11.1077402     6.928501     2.707503
## SCGB2A2    15.969238    16.161817    0.0000000     8.744171    14.983354
## SCGB1D2    12.231416    10.534089    0.9310022     6.549621     9.335627
## TFF1        6.326816     7.546671    7.8249460    13.595630    12.354572

Boxplots

Boxplots provide a summary view of the expression values for different genes.

ngenes <- 25
boxplot(t(brca_expr_mat.log[1:ngenes,]), ylab="Log_2 Expression",
        cex.names=0.2, las=2, pch=19, cex=1)

In a boxplot, the gray box shows the range of most of the data. Also note the scale on the vertical axis (from 0 to 15).

Now, let’s look at boxplots for 25 most variable genes:

ngenes <- 25
boxplot(t(brca.log.sub[1:ngenes,]), ylab="Log_2 Expression",
        cex.names=0.2, las=2, pch=19, cex=1)

First, the scale of the vertical axis is much larger (0 to 20). Second, the gray boxes, which give the range of the bulk of the data, are also much larger. There’s a pretty big difference between the two groups!

Subset of samples

When we’re doing exploratory analysis, we’ll often work with just a subset of the data, such as the 500 most variable genes. It makes things go faster, and the plots are easier to interpret.

We’re also going to consider only a subset of the patient samples! To pull out 25% of the samples, let’s create a vector our_samples that stores the indices of every 4th sample. We’ll use this subset for now, but for the final analysis, we’ll certainly want to use the whole data set.

our_samples <- seq(1, ncol(brca.log.sub), 4)  # Every fourth sample of 
                                              # `brca.log.sub`.
length(our_samples)

## [1] 271

length(our_samples)/ncol(brca.log.sub)

## [1] 0.2504621

We’ve chose 271 samples, which amount to 25% of the samples (one out of every four).

Let’s make a heatmap of our reduced matrix of log-transformed expression values. Remember, we’re considering the 500 most variable genes and 25% of the samples.

library("RColorBrewer")  # We call a library that has nice colors

heatmap(brca.log.sub[,our_samples],
        Rowv=NA, Colv=NA, scale="none", labRow="",  # arguments for plotting
        labCol="", col=brewer.pal(10,"RdBu"), margins = c(1, 0))

Scaling

Each row is a gene, and each column is a sample. It doesn’t quite look right, though. There are a bunch of horizontal stripes and we see almost exclusively one color (red). This happens because some of the genes are expressed more highly than others.

While this is interesting, we want to know, whether the expression of each gene is higher in one group of samples than another. So we are more interested in the relative expression of the genes, rather than the absolute expression.

To get the relative gene expression, we will scale each of the genes to put them on equal footing. We used scaling in our previous activity with the function scale() for the mtcars and colleges datasets.

The absolute gene expression values will be changed, but the relative expression (whether it is higher or lower in a particular sample) will be preserved.

First, check out the help file:

help(scale)

Notice that scale() will scale the columns of our matrix. But we want to scale the genes, which are in the rows!

This is not a problem, because we can use the transpose function t(). When you transpose a matrix, you convert rows to columns and columns to rows.

Here is what brca.log.sub looks like when it is transposed:

t(brca.log.sub)[1:5, 1:5]

##                 CLEC3A      CPB1   SCGB2A2    SCGB1D2      TFF1
## TCGA-3C-AAAU  2.454334 16.161348 15.969238 12.2314162  6.326816
## TCGA-3C-AALI 11.768726  4.243707 16.161817 10.5340888  7.546671
## TCGA-3C-AALJ  0.000000 11.107740  0.000000  0.9310022  7.824946
## TCGA-3C-AALK  5.376641  6.928501  8.744171  6.5496214 13.595630
## TCGA-4H-AAAK  7.493047  2.707503 14.983354  9.3356271 12.354572

If we scale the transposed matrix, then we have to transpose it back!

We will use three nested functions:

t() to transpose our matrix so the genes are columns.
scale() to scale the transposed matrix to get the relative expression.
another t() to transpose back so that genes are rows again.

brca.log.sub.scale <- t(scale(t(brca.log.sub)))

Let’s make things easier by creating a new object with just the samples we want:

our_matrix <- brca.log.sub.scale[,our_samples]

our_matrix[1:5,1:5]

##         TCGA-3C-AAAU TCGA-4H-AAAK TCGA-A1-A0SD TCGA-A1-A0SH TCGA-A1-A0SM
## CLEC3A    -0.4444731    0.5273528   0.16677510   -0.9178450   0.01391312
## CPB1       1.8630076   -0.7426437   2.34079027   -0.5521063  -0.50657065
## SCGB2A2    1.2860995    1.0899717   0.11776738    1.5142496   1.70208586
## SCGB1D2    1.0984689    0.4513522   0.06435826    1.8327443   1.58792690
## TFF1      -0.4898101    0.8657281   0.91149541    0.3064546   1.15536664

We’ll plot both the original and reordered heatmaps for comparison.

heatmap(our_matrix,scale="none", labRow="",  
   Rowv=NA, Colv=NA, 
   labCol="", col=brewer.pal(10,"RdBu"), zlim=c(-2, 2), margins = c(1, 0))

heatmap(our_matrix,scale="none", labRow="",     
   labCol="", col=brewer.pal(10,"RdBu"), zlim=c(-2, 2), margins = c(1, 0))

Toggle back and forth between the two images in the upper left corner of the output by clicking on them.

Looking at the heatmap with reordering, we see that the tumors split up into different groups. Is there any biological significance to these groups? Are these groups seen only in the gene expression patterns, or is it related to any known biology?

Heatmap Groups and Biology

As we learned in class, breast cancer is split up into ER+ and ER- tumors, depending on whether the tumor expresses high levels of receptors for estrogen. This is really important to know because it tells us what’s driving the disease, and also how to treat it. To see whether ER status is associated with the groups we see in the reordered heatmap, let’s label the samples with a "+" for ER+ breast cancers, and "." for negative.

clin <- brca_clin_df[our_samples,]  # Index the clinical data 
                                    # for our subset of samples.

x <- clin[["estrogen_receptor_status"]]   # `x` contains the contents 
                                          # of the column we choose
er <- rep("", length(x))                  # We create an object that 
er[x == "Positive"] <- "+"                # is + for ER+, and 
er[x == "Negative"] <- "."                # - for ER-
table(er)

## er
##       .   + 
##  13  57 201

ER status

heatmap(
  our_matrix,                            # The matrix of expression values
  scale="none", labRow="", labCol=er,    # NOTICE labCol=er
                                         # We use `er` to label the 
                                         # columns of the heatmap
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) 
)

In this plot, we see about four group of samples. Most of them have a lot of ER+ tumors (marked by the row of pluses on the bottom), while one is nearly all ER-.

Beautiful!!

PR and HER2 status

We will create the objects pr and her2 to annotate the columns with receptor status:

The symbol “+” for “Positive”
The symbol “.” for “Negative”

y <- clin[["progesterone_receptor_status"]]  # Grab the column we want
pr <- rep("", length(y))                     # Create an object with
pr[y == "Positive"] <- "+"                   # + for PR+, and            
pr[y == "Negative"] <- "."                   # - for PR-
table(pr)

## pr
##       .   + 
##  13  81 177

z <- clin[["her2_receptor_status"]]  # Grab the column we want
her2 <- rep("", length(z))           # Create an object with  
her2[z == "Positive"] <- "+"         # + for HER2+, and                    
her2[z == "Negative"] <- "."         # - for HER2-
table(her2)

## her2
##       .   + 
## 108 125  38

We use the code chunk for creating the heatmap annotated with “ER status” but we label the columns with “PR status” and “HER2” status, respectively.

For PR status:

heatmap(
  our_matrix,                            # The matrix of expression values
  scale="none", labRow="", labCol=pr,    # NOTICE labCol=pr
                                         # We use `pr` to label the 
                                         # columns of the heatmap
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) 
)

For HER2 status:

heatmap(
  our_matrix,                            # The matrix of expression values
  scale="none", labRow="", labCol=her2,  # NOTICE labCol=her2
                                         # We use `her2` to label the 
                                         # columns of the heatmap
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) 
)

To make it easier to compare, we can plot all three together.

par(cex.main=0.6)
# ER status
heatmap(
  our_matrix,                            # The matrix of expression values
  scale="none", labRow="", labCol=er,    # NOTICE labCol=er
                                         # We use `er` to label the 
                                         # columns of the heatmap
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0), main = "ER status"
)

#PR status
heatmap(
  our_matrix,                            # The matrix of expression values
  scale="none", labRow="", labCol=pr,    # NOTICE labCol=pr
                                         # We use `pr` to label the 
                                         # columns of the heatmap
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0), main = "PR status"
)

# HER2 status
heatmap(
  our_matrix,                            # The matrix of expression values
  scale="none", labRow="", labCol=her2,  # NOTICE labCol=her2
                                         # We use `her2` to label the 
                                         # columns of the heatmap
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0), main = "HER2 status"
)

In each of the three maps, there is a group of samples in the middle that have ER-, PR-, and HER2-. These samples are the triple negative breast cancer samples, and have the worst prognosis. We see a distinctive patter of overexpressed and underexpressed genes for these samples.