In Module 4, we used clustering to look for clinically relevant patterns in the TCGA breast cancer gene expression data matrix. We found a cluster that corresponds to patient samples with Triple Negative Breast Cancer.
In this Module, we will look at the composition of clusters on the other dimension of our data matrix: The genes. We will learn about Enrichment Analysis and how to use it to look for biological themes in our data.
knitr() is the R package that generates the report from R Markdown. We can create reports as Word doc, PDF, and HTML files.
An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.
We created an object that holds the name of the directory where the TCGA data resides. This is good R coding practice because we can apply all we do below to a data set in a different directory by changing a single variable (data_dir).
Throughout the course, we will be loading data from R files with the extension .RData. An example is the file TCGA_brca.RData which we examine below.
.RData files are binary files (i.e., not human readable) that can store multiple R objects, such as vectors, lists, matrices, and data frames.
The R objects in the TCGA_brca.RData file are the results of previous R analyses, and the data is in tidy form. Tidy data sets have consistent structure and are easy to manipulate, visualize, and model.
We could spend the entire course on learning how to put data in tidy form, but our goal is the understand and analyze the data, which occurs after tidying the data.
# We `load()` the file "brca_expr_mat.RData".
# The file.path() function tells `load()` where our data lives
# The argument "verbose = TRUE" makes `load()` tell us what R objects are being loaded
# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "brca_expr_mat.RData"),verbose=TRUE)
## Loading objects:
## brca_expr_mat
## brca_expr_mat.log
Objects were named to be be descriptive.
In this activity, we will go directly to the log-transformed matix brca_expr_mat.log and create the same clusters we did in Module 4.
The Environment tab gives us basic information on the object that was loaded.
We can use some of the R functions we learned previously to remind us what is in brca_expr_mat.log.
dim() tells us the dimensions (# row, # columns) of the objects.head() shows us the top several rows of the objects.dim_mat <- dim(brca_expr_mat.log) # We should get a result consistent
# with what is in the Environment tab
print(paste("The gene expression matrix has",dim_mat[1],
"rows and",dim_mat[2],"columns."))
## [1] "The gene expression matrix has 18351 rows and 1082 columns."
# The function paste() allows us to print text and numbers together.
Let’s see what’s in the rows and columns.
brca_expr_mat.log[1:5,1:5]
## TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6 7.5636463 7.705439 9.975045 10.110718 9.881960
## TNMD 0.4272843 1.061776 5.353718 1.164271 2.506120
## DPM1 9.0367945 9.662019 9.919403 8.859174 8.928912
## SCYL3 8.3493520 10.607275 8.395877 9.103957 8.704889
## C1orf112 6.9691735 8.106778 7.922947 7.776269 7.495535
Rows are named with the genes whose expression values were measured. Columns are named with the patient barcode identifier.
“TSPAN6” is the name of the gene that codes for the the protein Tetraspanin-6. “TCGA-3C-AAAU” is the anonymized identifier for one patient’s sample.
The next few code chunks are taken directly from Modules 3 and 4.
A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heat maps make it easy to visualize complex data and understand it at a glance. We won’t be able to make a heatmap of the entire data set — there are just too many genes — so we’ll look at a subset of genes.
To reduce the number of genes, we will try to select the most important ones for an exploratory analysis. There are many different ways to do this, and our choice here can lead to profoundly different views of the data.
One common approach to select genes in an unbiased way is to find the ones with the biggest differences across the samples, or in statistics language, the ones with the highest variance
Let’s consider the 500 genes with the highest variance.
We’ll order our matrix accordingly and make a matrix for the top 500 variable genes.
NUM.GENES <- 500 # How many genes we want to consider.
V <- apply(brca_expr_mat, 1, var) # Here, we use the function apply
# with the function var (for variance).
O <- order(V, decreasing=TRUE) # The function order will rank the values for us.
var_genes <- V[O]
brca.log.sub <- brca_expr_mat.log[O[1:NUM.GENES],]
# `brca.log.sub` is a subset of the `brca_expr_mat.log` matrix
# and contains data only for the 500 most variable genes
The gene names will be different from the ones we’re used to seeing.
brca.log.sub[1:5,1:5]
## TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## CPB1 16.16135 4.243707 11.10774 6.928501 2.707503
## COL1A1 15.76215 17.451710 17.46421 18.729189 18.659219
## COL1A2 15.08131 16.591093 16.53135 17.853364 18.187168
## FN1 15.49303 16.703795 16.88043 17.056871 17.901097
## COL3A1 14.94762 16.132576 16.08866 17.581760 17.871713
We scaled the expression matrix so we could examine relative gene expression. We used the funcion scale() to change each gene such that the mean expression is 0, and the variance is 1.
brca.log.sub.scale <- t(scale(t(brca.log.sub)))
Now we can recreate the heatmap.
library("RColorBrewer") # We call a library that has nice colors
I.sample <- seq(1, ncol(brca.log.sub), 4) # We take every fourth sample.
heatmap(brca.log.sub.scale[,I.sample],
Rowv=NA, Colv=NA, scale="none", labRow="", # The rest of the arguments are for plotting.
labCol="", col=brewer.pal(10,"RdBu"), zlim=c(-2, 2), margins = c(1, 0))
In the heatmap, the relatively overexpressed genes are red and the relatively underexpressed genes are blue.
To find patterns in the heat map, we performed clustering to find the genes that are most similar to each other and the samples that are most similar to each other.
our_matrix <- brca.log.sub.scale[,I.sample]
row.dist <- dist(our_matrix) # Use dist() on our matrix
col.dist <- dist(t(our_matrix)) # Use distr() on the transpose of our matrix
We cluster the columns and rows based on the distances we calculated and create the dendrograms, i.e. the distance trees, with the function hclust().
column_clusters <- hclust(col.dist) # Apply `hclust()` to the columns (patient samples)
row_clusters <- hclust(row.dist) # Apply `hclust()` to the rows (genes)
We used the row and column clustering to reorganize the heat map.
heatmap(
our_matrix, # The matrix of expression values
Rowv=as.dendrogram(row_clusters), # Create row dendrogram from `row_clusters`
Colv=as.dendrogram(column_clusters), # Create column dendogram from `column_clusters`
scale="none", labRow="", labCol="", col=brewer.pal(10,"RdBu"), # plotting arguments
zlim=c(-2, 2), margins = c(1, 0) # more plotting arguments
)
This is a plot of the dendrogram tree for gene (row) clustering.
plot(row_clusters,cex=0.3)
Let’s cut the tree so that we get five clusters.
cluster <- cutree(row_clusters, k=5) # `cutree()` cuts our dendrogram into
# `k` branches or clusters.
cluster[1:25]
## CPB1 COL1A1 COL1A2 FN1 COL3A1 SCGB2A2 IGFBP5 LTF MGP SPARC
## 1 2 2 2 2 1 2 3 3 2
## CD24 RPL19 CSN2 PIP GAPDH MMP9 CHGA ACTB APOD ERBB2
## 4 3 4 3 4 4 3 4 3 3
## C4A SLC39A6 SCGB1D2 EEF1A1 MUCL1
## 3 1 1 3 3
The object cluster contains the genes as names and the cluster membership as numbers, 1 through 4.
Let’s check that we do indeed have four clusters.
unique(cluster) # `unique()` gives us a list of the unique values
## [1] 1 2 3 4 5
We can find the genes in each cluster by taking the names() of each set. Let’s consider cluster 2, the second one from the right in our cluster dendogram.
names(cluster[cluster==2])
## [1] "COL1A1" "COL1A2" "FN1" "COL3A1" "IGFBP5" "SPARC"
## [7] "IGF2" "COL6A2" "CXCL14" "POSTN" "COL6A3" "COL6A1"
## [13] "TIMP3" "APP" "BGN" "LUM" "AEBP1" "KRT1"
## [19] "MYH9" "COL12A1" "GJA1" "DCN" "CTSB" "SLC40A1"
## [25] "MMP11" "NOTCH3" "COL5A2" "SFRP2" "THBS1" "COL5A1"
## [31] "MMP2" "SPP1" "VCAN" "TAGLN" "MMP14" "THBS2"
## [37] "CYBRD1" "COL4A2" "MXRA5" "TPM4" "ACTA2" "SH3BGRL"
## [43] "CTSK" "SULF1" "COL4A1" "COL11A1" "TIMP2" "TGFBI"
## [49] "LGALS1" "FBN1" "LRP1" "ITGB1" "PCOLCE" "FSTL1"
## [55] "SLITRK6" "FMOD" "CLIC6" "HDLBP" "ACAN" "HSPG2"
## [61] "SERPINH1" "ACTG2" "SULF2" "ANTXR1" "SPARCL1" "MRC2"
## [67] "CILP" "ELN" "HTRA1" "THBS4" "FBLN1" "MYL9"
We write these gene lists to a file for further analysis.
write.csv(names(cluster[cluster==2]),"cluster2.csv")
Let’s but some biology to the gene names Gene Ontology site….