TCGA heatmaps and clustering

About this activity
Loading the data
- The data directory
Viewing the data
Subsetting the data
Heatmaps
- Dendrograms
- Color schemes
Annotating Heatmap Clusters
- ER status
DREAM-High Challenge
Create your R Markdown Report
Congratulations

About this activity

In the earlier module Breast_Cancer_Expression_Data, we examined the mRNA levels for 18,351 genes across 1,082 breast cancer patients.

We saw that the average expression of the genes (across patient samples) varied greatly. By taking the log of the expression values, we compressed the data (or reduced the range of variation) so the genes were more comparable.

Each column in the log-transformed expression matrix corresponds to a different patient, and each patient has their own set of gene expression values.

Each row in the log-transformed expression matrix corresponds to a different gene, and each gene has its own set of expression values across patients.

In another activity Heatmaps, we worked with the smaller data set mtcar and saw how the function heatmap() can reorder objects in our data set to reveal patterns in the data: Objects and features in the same clusters are more similar to each other than to those in other clusters.

The reordering of data by heatmap clustering provides insight into why objects may be related.

We will apply these concepts and skills to the TCGA clinical and gene expression features.

Loading the data

Last time, we learned why it’s important to log transform gene expression values. You can review the procedure in Breast_Cancer_Expression_Data.Rmd.

We will begin this activity where we left off last time: with the matrix of log-transformed data.

The data directory

We will create an object that holds the name of the directory where the TCGA data resides. This is good R coding practice because we can apply all we do below to a data set in a different directory by changing a single variable (data_dir).

We will use both the clinical and gene expression data. The clinical data will help us correlate information about the receptor status of patients’ tumors with patterns in gene expression.

# We `load()` the file "TCGA_brca_log.RData".
# The file.path() function tells `load()` where our data resides
# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "TCGA_brca_log.RData"),verbose=TRUE)

## Loading objects:
##   brca_clin_df
##   brca_expr_mat.log

Objects were named to be be descriptive.

“brca” stands for “TCGA breast cancer”
“clin” is short for “clinical”
“df” is short for “data frame,” a table of data
“expr” stands for “gene expression data”
“mat” means the expression data is in matrix form
“log” reminds us that the data was log-transformed

Viewing the data

The Environment tab gives us basic information on the object that was loaded.

We will use some of the R functions we learned previously to examine the data in more detail.

dim() tells us the dimensions (# row, # columns) of the objects.
head() shows us the top several rows of the objects.
indexing allows us to look at the rows and columns we choose.

# the function `dim()` returns the number of rows and columns in our matrix
dim_mat <- dim(brca_expr_mat.log)  
                                        
# Let's add a little more code to display our results in an informative way
print(paste("The log-transformed gene expression matrix has",dim_mat[1],
              "rows and",dim_mat[2],"columns."))

## [1] "The log-transformed gene expression matrix has 18351 rows and 1082 columns."

Let’s see what’s in the rows and columns.

# This is an example of indexing
brca_expr_mat.log[1:5,1:5]

##          TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6      7.5636463     7.705439     9.975045    10.110718     9.881960
## TNMD        0.4272843     1.061776     5.353718     1.164271     2.506120
## DPM1        9.0367945     9.662019     9.919403     8.859174     8.928912
## SCYL3       8.3493520    10.607275     8.395877     9.103957     8.704889
## C1orf112    6.9691735     8.106778     7.922947     7.776269     7.495535

Rows are named with the genes whose expression values (mRNA levels) were measured. Columns are named with the patient bar code identifier.

“TSPAN6” is the name of the gene that codes for the protein Tetraspanin-6. “TCGA-3C-AAAU” is the anonymized identifier for one patient’s sample.

Subsetting the data

In exploratory analysis the work is often done first with a subset of the data. The code will run faster, the plots are easier to interpret, and it is easy to change parameters and re-check the analysis. We can return to the full data set later.

Most variable genes

We can make a heatmap for the full matrix of 18351 genes and 1,082 patients, but it is good practice to work with a smaller prioritized set of data.

To reduce the number of genes to consider, we will select the “most important” ones for our exploratory analysis.

One common approach is to find the genes with the biggest differences or highest variance across patient samples.

Variance is a measure of dispersion, meaning it tells us how far a set of numbers is spread out from their average value.

We will use the apply() function again, this time to find the variance of each gene across patient samples.

#  `apply()` to get the variance with `var()`  
# The argument "1" applies `var()` to the rows
var_genes <- apply(brca_expr_mat.log, 1, var)   

hist(var_genes,
     main = "Variance of log-transformed gene expression",
     xlab = "Variance")

It looks like we can choose genes with a variance larger than 5 to get a smaller set.

Since the estrogen receptor (gene name ESR1) is an important gene in the breast cancer patient cohort, let’s see what its variance is.

var_genes[names(var_genes)=="ESR1"]

##     ESR1 
## 10.18163

The value of 10 for the variance of ESR1 suggests that variance > 5 is a reasonable cutoff.

To check our results, let’s calculate the variance for ESR1 explicitly.

esr1<- brca_expr_mat.log["ESR1",]
hist(esr1, 
     main=paste("ESR1 has mean of", round(mean(esr1)), 
                "and variance of", round(var(esr1))),
                xlab = "Log expression")

The spread of log-transformed ESR1 expression shows that most genes have values within +/-10 of the mean (12).

How many genes have variance > 5?

length(var_genes[var_genes>5])

## [1] 690

We’ll consider the 690 genes with variance > 5 (out of 18,351 total genes) in our analysis.

Let’s order our matrix so that the most variable genes appear in the top rows and create a sub-matrix for these genes.

The function order() returns the order of each gene that will rank the list of variances from highest to lowest. We can use it to re-order the entire matrix so that the most variable genes appear at the top.

# `order()` ranks the values 
# The argument decreasing=TRUE orders from large to small values
order_var <- order(var_genes, decreasing=TRUE)   

# We choose 690 genes, but you can change this value!
num_genes <- 690

                
# Create a matrix that consists of the 500 genes (rows) with the highest variance 
brca.log.sub <- brca_expr_mat.log[order_var[1:num_genes],]

The new matrix will have the log-transformed gene expression values, but it will contain only the most variable genes. Let’s take a look at the new matrix:

# Indexing
dim(brca.log.sub)

## [1]  690 1082

brca.log.sub[1:5,1:5]

##         TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## CLEC3A      2.454334    11.768726    0.0000000     5.376641     7.493047
## CPB1       16.161348     4.243707   11.1077402     6.928501     2.707503
## SCGB2A2    15.969238    16.161817    0.0000000     8.744171    14.983354
## SCGB1D2    12.231416    10.534089    0.9310022     6.549621     9.335627
## TFF1        6.326816     7.546671    7.8249460    13.595630    12.354572

Let’s examine this a bit further…

Boxplots

As we learned in our previous activities, boxplots provide a summary view of the range of expression values for different genes.

We’ll check the first 10 genes from our original matrix and then the 10 most variable genes from our matrix re-ordered by variance.

# the function par() allows us to control how plots are shown
par(mfrow=c(1,2))

# a boxplot for our original matrix
boxplot(t(brca_expr_mat.log[1:10,]), 
        main="Sample of genes",
        ylab="Log_2 Expression",ylim=c(0,20),
        cex.names=0.2, las=2, pch=20, cex=0.6)

# a boxplot for the sub-matrix
boxplot(t(brca.log.sub[1:10,]), 
        main="Most variable genes",
        ylab="Log_2 Expression",ylim=c(0,20),
        cex.names=0.2, las=2, pch=20, cex=0.6)

The range of values for the most variable genes is much greater!

In a boxplot, the gray box shows the range of most (50%) of the data. Outliers, the genes with extreme values, are depicted as dots. Note the scale on the vertical axis (from 0 to 20) is the same in both plots to enable a direct comparison.

Remember, we are assuming that genes with greater variance across patient samples are more important.

Subset of samples

Similarly, we’ll consider only a subset of the patient samples, too. To pull out a third of the samples, let’s create a vector our_samples that stores the indices of every 3rd sample.

# The seq() function creates a series of numbers 
# between the first two arguments.
# The third argument says we will get every third column 
our_samples <- seq(1, ncol(brca.log.sub), 3)   

#Let's check that we did this correctly!                                            
length(our_samples)

## [1] 361

length(our_samples)/ncol(brca.log.sub)

## [1] 0.3336414

We’ve chosen 361 samples, one out of every three.

Create a new object, our_matrix that contains only the samples and genes we want to consider:

# The matrix with the 690 genes and 361 samples 
our_matrix <- brca.log.sub[,our_samples]

our_matrix[1:5,1:5]

##         TCGA-3C-AAAU TCGA-3C-AALK TCGA-5T-A9QA TCGA-A1-A0SE TCGA-A1-A0SH
## CLEC3A      2.454334     5.376641     6.603393     4.041497     0.000000
## CPB1       16.161348     6.928501    10.279785     6.254743     3.691311
## SCGB2A2    15.969238     8.744171    16.523099    10.872521    17.116090
## SCGB1D2    12.231416     6.549621    12.412250     9.936669    15.517233
## TFF1        6.326816    13.595630    14.611388    13.128586     9.867615

What are the dimensions of our_matrix and what does it contain?

dim_sub <- dim(our_matrix)
print(paste("The subset matrix for log-transformed gene expression data has",dim_sub[1], "rows and",dim_sub[2],"columns."))

## [1] "The subset matrix for log-transformed gene expression data has 690 rows and 361 columns."

To summarize our_matrix contains the 690 most variable genes across ALL 1,082 patients, and then we chose every third patient for our exploratory data analysis.

Heatmaps

A heatmap is a graphical representation of data that uses a system of color-coding to represent different ranges of values. Heat maps make it easy to visualize complex data and understand it at a glance.

We will create the heatmap with the same function call and arguments we used in R_Heatmaps.Rmd for mtcars.

First, we’ll scale the data so the genes are on a similar scale while preserving their relative order.

our_matrix_scaled <- scale(our_matrix)

The heatmap() function is very powerful and allows us to do many things at once. Let’s examine the color-coded matrix:

# We invoke `heatmap()` the same way we did for `mtcars`
# The last two arguments suppress labeling the rows and columns
heatmap(our_matrix, Rowv=NA, Colv=NA, 
        labRow="", labCol="", margins = c(2, 2),
        xlab="Patients", ylab="Genes",
        zlim=c(-2, 2))

We have more data points here than with the mtcars dataset, but the result is similar: Data values have been color-coded with the yellow-orange-red palette (“YlOrRd”).

Dendrograms

The dendrogram is a summary of the the pairwise distances among columns and among rows.

The expression values for a given gene across patients comprise a point for the gene with 361 “dimensions.”

There are many ways to calculate the distance between two points, for example, via roads or “as the crow flies.” By default, heatmap() uses the Euclidean distance or the length of a line between two points (“as the crow flies”).

# by default, heatmap will calculate the dendrograms and reorder
heatmap(our_matrix_scaled, 
        labRow="", labCol="",  margins = c(2, 2),
        xlab="Patients", ylab="Genes",
        zlim=c(-2, 2))

By following the dendrogram branches and inspecting the patterns in our color-coded matrix, we observe that even with a relatively small amount of data, the tumors (columns) split up into different groups, each with similar gene expression profiles. In addition, different genes cluster together based on their variation across tumors.

For example, the rightmost columns seem to form a patient cluster with some genes very highly expressed and some with very low expression.

Color schemes

Let’s try another color palette.

# Load the palette package
library(RColorBrewer)
# Parameters for plotting 
par(cex = 0.5)
# Get a graphic for all color schemes
display.brewer.all()

Often in publications with gene expression data, a red-blue palette is used, “RdBu”.

# the `col` argument allows us to choose a different palette.
heatmap(our_matrix_scaled, 
        labRow="", labCol="", margins = c(2, 2),
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

This color scheme makes the patterns even easier to see.

Is there any biological significance to these groups? Are these groups seen only in the gene expression patterns, or is it related to biology we already know?

Annotating Heatmap Clusters

We saw in the activity Breast_Cancer_Patient_Data that most samples are either ER+ tumors, depending on whether the tumor expresses high levels of receptors for estrogen. This is really important to know because it tells us what’s driving the disease and also how to treat it.

To see whether ER status is associated with the groups in the reordered heatmap, we will label the samples with a “+” for ER+ breast cancers, and “.” for negative.

To get the information on ER+/ER-, we go back to the clinical data, brca_clin_df.

View(brca_clin_df)

As we saw last time, the patient samples are in the same order in both brca_clin_df and brca_expr_mat.log. Let’s create a subset of the clinical data frame that corresponds to our gene expression matrix our_matrix.

We set our_samples to 361, every third patient sample.

#Extract the rows corresponding to our_samples 
clin <- brca_clin_df[our_samples,]  

#Pull out the "estrogen_receptor_status" feature                              
x <- clin$estrogen_receptor_status 

head(x, 10)

##  [1] "Positive" "Positive" "Positive" "Positive" "Negative" "Negative"
##  [7] "Negative" "Positive" "Positive" "Positive"

Assign the labels:

# `er` will have labels + for ER positive and . for ER negative
# Note we are using boolean expressions to assign the symbols
er <- rep("", length(x))                  
er[x == "Positive"] <- "+"                
er[x == "Negative"] <- "."                
table(er)

## er
##       .   + 
##  19  75 267

In our matrix with 691 most variable genes and 361 patients, 75 samples are ER- and 267 are ER+. Note than 19 samples have missing values.

head(er,10)

##  [1] "+" "+" "+" "+" "." "." "." "+" "+" "+"

The symbols match the values above!

ER status

We can add more arguments to the heatmap() function to include the new ER labels we created.

# NOTE: we are giving `er` to the argument labCol
heatmap(our_matrix_scaled,
        labRow="", labCol=er, cexCol = 0.4,
        margins = c(2, 2),
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

In this plot, we see about four group of samples. Most of them have a lot of ER+ tumors (marked by the row of pluses on the bottom), while one (on the right) is nearly all ER-.

Beautiful!! We see that the clustering on gene expression data seems to recapitulate the ER status of the tumors. A great deal of research has been done on this and other associations between gene expression profiles and the characteristics of patient tumors.

DREAM-High Challenge

Heatmaps for PR and HER2

In the code chunks below, create heatmaps labeled with PR and HER2 status. The code follows what was done for ER status above.

PR status

Create the object pr to annotate the columns with progesterone receptor status:

The symbol “+” for “Positive”
The symbol “.” for “Negative”

# This code block is similar to the one for ER status
# replace yyyyyyyy with feature for PR status
# look at `brca_clin_df` to find the feature name

y <- clin$progesterone_receptor_status

pr <- rep("", length(y))                     

pr[y == "Positive"] <- "+"                              
pr[y == "Negative"] <- "."                   

table(pr)

## pr
##       .   + 
##  20 109 232

pr[1:10]

##  [1] "+" "+" "." "+" "+" "." "." "+" "+" "+"

How many samples are PR+ and PR-? Replace MMM and NNN to assign your answers to the variables PRpos and PRneg.

# enter the value you obtained from the previous code chunk
PRpos <- 232
  
PRneg <- 109

print(paste("Our matrix has",PRpos, 
            "progesterone positive samples and",
            PRneg,"progesterone negative samples."))

## [1] "Our matrix has 232 progesterone positive samples and 109 progesterone negative samples."

Create the heatmap annotated with PR status:

# Replace YY in the argument labCol to label the columns with PR status

heatmap(our_matrix_scaled,
        labRow="", labCol=pr, cexCol = 0.4,
         margins = c(2, 2),
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

Is there overlap in the clusters for ER- and PR-? Replace XXX to assign your answer (YES/NO) to the variable PR_ER_neg_overlap.

# Place YES or NO between the quotes
PR_ER_neg_overlap <- " YES "

print(paste("Do ER- and PR- samples overlap in the heatmap?", 
            PR_ER_neg_overlap))

## [1] "Do ER- and PR- samples overlap in the heatmap?  YES "

HER2 status

Create the object her2 to annotate the columns with HER2 status:

The symbol “+” for “Positive”
The symbol “.” for “Negative”

# This code block is similar to the one for ER status
# replace zzzzzzzz with feature for PR status
# look at `brca_clin_df` to find the feature name

z <- clin$her2_receptor_status

her2 <- rep("", length(z))  

# Complete the logical statements to assign "+" and "."
# Follow the code we used for ER and PR status

her2[z == "Positive"   ] <- "+"                              
her2[z == "Negative"   ] <- "."                   

table(her2)

## her2
##       .   + 
## 127 183  51

How many samples are HER2+ and HER2-? Replace MMM and NNN to assign your answers to the variables HER2pos and HER2neg.

# enter the value you obtained from the previous code chunk
HER2pos <- 51
  
HER2neg <- 183

print(paste("Our matrix has",HER2pos, 
            "progesterone positive samples and",
            HER2neg,"progesterone negative samples."))

## [1] "Our matrix has 51 progesterone positive samples and 183 progesterone negative samples."

Create the heatmap annotated with HER2 status:

heatmap(our_matrix_scaled,
        labRow="", labCol=er, cexCol = 0.4,
         margins = c(2, 2),
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

heatmap(our_matrix_scaled,
        labRow="", labCol=pr, cexCol = 0.4,
         margins = c(2, 2),
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

heatmap(our_matrix_scaled,
        labRow="", labCol=her2, cexCol=0.4,
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

Look at all three labeled heatmaps: ER, PR, and HER2.

Triple negative breast cancers have ER-, PR-, and HER2-, and should correspond to the rightmost cluster.

Create your R Markdown Report

Let’s save our work as an html file!

Click the “Knit” button at the top of this window and select “Knit to html.”

Congratulations

We have analyzed clinical and gene expression data from over 1,000 patients in the TCGA Breast Cancer cohort.