About this activity

In Module 4, we used clustering to look for clinically relevant patterns in the TCGA breast cancer gene expression data matrix. We found a cluster that corresponds to patient samples with Triple Negative Breast Cancer.

In this Module, we will look at the composition of clusters on the other dimension of our data matrix: The genes. We will learn about Enrichment Analysis and how to use it to look for biological themes in our data.


Preliminaries

The knitr R package

knitr() is the R package that generates the report from R Markdown. We can create reports as Word doc, PDF, and HTML files.

An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.

The data directory

We created an object that holds the name of the directory where the TCGA data resides. This is good R coding practice because we can apply all we do below to a data set in a different directory by changing a single variable (data_dir).


Loading the data

Throughout the course, we will be loading data from R files with the extension .RData. An example is the file TCGA_brca.RData which we examine below.

.RData files are binary files (i.e., not human readable) that can store multiple R objects, such as vectors, lists, matrices, and data frames.

The R objects in the TCGA_brca.RData file are the results of previous R analyses, and the data is in tidy form. Tidy data sets have consistent structure and are easy to manipulate, visualize, and model.

We could spend the entire course on learning how to put data in tidy form, but our goal is the understand and analyze the data, which occurs after tidying the data.

# We `load()` the file "brca_expr_mat.RData".
# The file.path() function tells `load()` where our data lives
# The argument "verbose = TRUE" makes `load()` tell us what R objects are being loaded
# The objects will also appear in our "Environment" tab.

load(file.path(data_dir, "brca_expr_mat.RData"),verbose=TRUE)   
## Loading objects:
##   brca_expr_mat
##   brca_expr_mat.log

Objects were named to be be descriptive.

  • “brca” stands for “TCGA breast cancer”
  • “expr” stands for “gene expression data”
  • “mat” means the expression data is in matrix form
  • “log” reminds us that the data was log-transformed in Module 2

Viewing the data

In this activity, we will go directly to the log-transformed matix brca_expr_mat.log and create the same clusters we did in Module 4.

The Environment tab gives us basic information on the object that was loaded.

We can use some of the R functions we learned previously to remind us what is in brca_expr_mat.log.

  • dim() tells us the dimensions (# row, # columns) of the objects.
  • head() shows us the top several rows of the objects.
  • indexing allows us to look at the rows and columns we choose.
dim_mat <- dim(brca_expr_mat.log)  # We should get a result consistent
                                        # with what is in the Environment tab

print(paste("The gene expression matrix has",dim_mat[1],
              "rows and",dim_mat[2],"columns."))
## [1] "The gene expression matrix has 18351 rows and 1082 columns."
# The function paste() allows us to print text and numbers together.

Let’s see what’s in the rows and columns.

brca_expr_mat.log[1:5,1:5] 
##          TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6      7.5636463     7.705439     9.975045    10.110718     9.881960
## TNMD        0.4272843     1.061776     5.353718     1.164271     2.506120
## DPM1        9.0367945     9.662019     9.919403     8.859174     8.928912
## SCYL3       8.3493520    10.607275     8.395877     9.103957     8.704889
## C1orf112    6.9691735     8.106778     7.922947     7.776269     7.495535

Rows are named with the genes whose expression values were measured. Columns are named with the patient barcode identifier.

“TSPAN6” is the name of the gene that codes for the the protein Tetraspanin-6. “TCGA-3C-AAAU” is the anonymized identifier for one patient’s sample.


Heatmaps revisited

The next few code chunks are taken directly from Modules 3 and 4.

A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heat maps make it easy to visualize complex data and understand it at a glance. We won’t be able to make a heatmap of the entire data set — there are just too many genes — so we’ll look at a subset of genes.

Most variable genes

To reduce the number of genes, we will try to select the most important ones for an exploratory analysis. There are many different ways to do this, and our choice here can lead to profoundly different views of the data.

One common approach to select genes in an unbiased way is to find the ones with the biggest differences across the samples, or in statistics language, the ones with the highest variance

Let’s consider the 500 genes with the highest variance.

We’ll order our matrix accordingly and make a matrix for the top 500 variable genes.

NUM.GENES <- 500                          # How many genes we want to consider.
V <- apply(brca_expr_mat, 1, var)         # Here, we use the function apply
                                          # with the function var (for variance).

O <- order(V, decreasing=TRUE)   # The function order will rank the values for us.

var_genes <- V[O]                         

brca.log.sub <- brca_expr_mat.log[O[1:NUM.GENES],]

# `brca.log.sub` is a subset of the `brca_expr_mat.log` matrix 
# and contains data only for the 500 most variable genes

The gene names will be different from the ones we’re used to seeing.

brca.log.sub[1:5,1:5]
##        TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## CPB1       16.16135     4.243707     11.10774     6.928501     2.707503
## COL1A1     15.76215    17.451710     17.46421    18.729189    18.659219
## COL1A2     15.08131    16.591093     16.53135    17.853364    18.187168
## FN1        15.49303    16.703795     16.88043    17.056871    17.901097
## COL3A1     14.94762    16.132576     16.08866    17.581760    17.871713

We scaled the expression matrix so we could examine relative gene expression. We used the funcion scale() to change each gene such that the mean expression is 0, and the variance is 1.

brca.log.sub.scale <- t(scale(t(brca.log.sub)))   

Now we can recreate the heatmap.

library("RColorBrewer")             # We call a library that has nice colors

I.sample <- seq(1, ncol(brca.log.sub), 4)         # We take every fourth sample.

heatmap(brca.log.sub.scale[,I.sample],
   Rowv=NA, Colv=NA, scale="none", labRow="",     # The rest of the arguments are for plotting.
   labCol="", col=brewer.pal(10,"RdBu"), zlim=c(-2, 2), margins = c(1, 0))

In the heatmap, the relatively overexpressed genes are red and the relatively underexpressed genes are blue.


Clustering revisited

To find patterns in the heat map, we performed clustering to find the genes that are most similar to each other and the samples that are most similar to each other.

our_matrix <- brca.log.sub.scale[,I.sample]

row.dist <- dist(our_matrix)       # Use dist() on our matrix
col.dist <- dist(t(our_matrix))    # Use distr() on the transpose of our matrix

We cluster the columns and rows based on the distances we calculated and create the dendrograms, i.e. the distance trees, with the function hclust().

column_clusters <- hclust(col.dist)     # Apply `hclust()` to the columns (patient samples)
row_clusters <- hclust(row.dist)        # Apply `hclust()` to the rows (genes)

We used the row and column clustering to reorganize the heat map.

heatmap(
  our_matrix,                           # The matrix of expression values
  Rowv=as.dendrogram(row_clusters),     # Create row dendrogram from `row_clusters`
  Colv=as.dendrogram(column_clusters),  # Create column dendogram from `column_clusters`
  
  scale="none", labRow="", labCol="", col=brewer.pal(10,"RdBu"),   # plotting arguments
  zlim=c(-2, 2), margins = c(1, 0)      # more plotting arguments
)


Dendrograms revisited

This is a plot of the dendrogram tree for gene (row) clustering.

plot(row_clusters,cex=0.3)

Cutting the tree

Let’s cut the tree so that we get five clusters.

cluster <- cutree(row_clusters, k=5)     # `cutree()` cuts our dendrogram into 
                                         # `k` branches or clusters.

cluster[1:25]
##    CPB1  COL1A1  COL1A2     FN1  COL3A1 SCGB2A2  IGFBP5     LTF     MGP   SPARC 
##       1       2       2       2       2       1       2       3       3       2 
##    CD24   RPL19    CSN2     PIP   GAPDH    MMP9    CHGA    ACTB    APOD   ERBB2 
##       4       3       4       3       4       4       3       4       3       3 
##     C4A SLC39A6 SCGB1D2  EEF1A1   MUCL1 
##       3       1       1       3       3

The object cluster contains the genes as names and the cluster membership as numbers, 1 through 4.

Let’s check that we do indeed have four clusters.

unique(cluster)      # `unique()` gives us a list of the unique values
## [1] 1 2 3 4 5

We can find the genes in each cluster by taking the names() of each set. Let’s consider cluster 2, the second one from the right in our cluster dendogram.

names(cluster[cluster==2])
##  [1] "COL1A1"   "COL1A2"   "FN1"      "COL3A1"   "IGFBP5"   "SPARC"   
##  [7] "IGF2"     "COL6A2"   "CXCL14"   "POSTN"    "COL6A3"   "COL6A1"  
## [13] "TIMP3"    "APP"      "BGN"      "LUM"      "AEBP1"    "KRT1"    
## [19] "MYH9"     "COL12A1"  "GJA1"     "DCN"      "CTSB"     "SLC40A1" 
## [25] "MMP11"    "NOTCH3"   "COL5A2"   "SFRP2"    "THBS1"    "COL5A1"  
## [31] "MMP2"     "SPP1"     "VCAN"     "TAGLN"    "MMP14"    "THBS2"   
## [37] "CYBRD1"   "COL4A2"   "MXRA5"    "TPM4"     "ACTA2"    "SH3BGRL" 
## [43] "CTSK"     "SULF1"    "COL4A1"   "COL11A1"  "TIMP2"    "TGFBI"   
## [49] "LGALS1"   "FBN1"     "LRP1"     "ITGB1"    "PCOLCE"   "FSTL1"   
## [55] "SLITRK6"  "FMOD"     "CLIC6"    "HDLBP"    "ACAN"     "HSPG2"   
## [61] "SERPINH1" "ACTG2"    "SULF2"    "ANTXR1"   "SPARCL1"  "MRC2"    
## [67] "CILP"     "ELN"      "HTRA1"    "THBS4"    "FBLN1"    "MYL9"

We write these gene lists to a file for further analysis.

write.csv(names(cluster[cluster==2]),"cluster2.csv")

Let’s but some biology to the gene names Gene Ontology site….