In the earlier module Breast_Cancer_Expression_Data, we examined the mRNA levels for 18,351 genes across 1,082 breast cancer patients.
We saw that the average expression of the genes (across patient samples) varied greatly. By taking the log of the expression values, we compressed the data (or reduced the range of variation) so the genes were more comparable.
Each column in the log-transformed expression matrix corresponds to a different patient, and each patient has their own set of gene expression values.
Each row in the log-transformed expression matrix corresponds to a different gene, and each gene has its own set of expression values across patients.
In another activity R_Heatmaps, we worked with the
smaller iconic data set mtcar and saw how the function
heatmap() can reorder objects in our data set to reveal
patterns in the data: Objects and features in the same
clusters are more similar to each other than to those
in other clusters.
The reordering of data by heatmap clustering provides insight into why objects may be related.
We will apply these concepts and skills to the TCGA clinical and gene expression features.
Last time, we learned why it’s important to log transform gene expression values. You can review the procedure in Breast_Cancer_Expression_Data.Rmd.
We will begin this activity where we left off last time: with the matrix of log-transformed data.
We will create an object that holds the name of the directory where
the TCGA data resides. This is good R coding practice because we can
apply all we do below to a data set in a different directory by changing
a single variable (data_dir).
We will use both the clinical and gene expression data.
# We `load()` the file "TCGA_brca_log.RData".
# The file.path() function tells `load()` where our data resides
# The objects will also appear in our "Environment" tab.
load(file.path("C:","Users","abowen","Downloads","TCGA_brca_log.RData"),verbose=TRUE)
## Loading objects:
## brca_expr_mat.log
## brca_clin_df
Objects were named to be be descriptive.
The Environment tab gives us basic information on the object that was loaded.
We will use some of the R functions we learned previously to examine the data in more detail.
dim() tells us the dimensions (# row, # columns) of the
objects.head() shows us the top several rows of the
objects.# the function `dim()` returns the number of rows and columns in our matrix
dim_mat <- dim(brca_expr_mat.log)
# Let's add a little more code to display our results in an informative way
print(paste("The log-transformed gene expression matrix has",dim_mat[1],
"rows and",dim_mat[2],"columns."))
## [1] "The log-transformed gene expression matrix has 18351 rows and 1082 columns."
Let’s see what’s in the rows and columns.
# This is an example of indexing
brca_expr_mat.log[1:5,1:5]
## TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6 7.5636463 7.705439 9.975045 10.110718 9.881960
## TNMD 0.4272843 1.061776 5.353718 1.164271 2.506120
## DPM1 9.0367945 9.662019 9.919403 8.859174 8.928912
## SCYL3 8.3493520 10.607275 8.395877 9.103957 8.704889
## C1orf112 6.9691735 8.106778 7.922947 7.776269 7.495535
Rows are named with the genes whose expression values were measured. Columns are named with the patient bar code identifier.
“TSPAN6” is the name of the gene that codes for the protein Tetraspanin-6. “TCGA-3C-AAAU” is the anonymized identifier for one patient’s sample.
A heatmap is a graphical representation of data that uses a system of color-coding to represent different ranges of values. Heat maps make it easy to visualize complex data and understand it at a glance.
To reduce the number of genes to consider, we will select the “most important” ones for our exploratory analysis.
One common approach is to find the genes with the biggest differences or highest variance across patient samples.
We’ll order our matrix accordingly and create a sub-matrix for the top 500 variable genes. This is a large code chunk, but we’re getting good at this!
# We choose 500 genes, but you can change this value!
num_genes <- 500
# `apply()` is used to get the variance of each row
# The argument "1" applies `var()` to the rows
var_genes <- apply(brca_expr_mat.log, 1, var)
# `order()` ranks the values
# The argument decreasing=TRUE orders from large to small values
order_var <- order(var_genes, decreasing=TRUE)
# This line of code does multiple operations
# We create a matrix that consists of the 500 genes (rows) with the highest variance
brca.log.sub <- brca_expr_mat.log[order_var[1:num_genes],]
Let’s take a look at the new matrix:
# Indexing
brca.log.sub[1:5,1:10]
## TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## CLEC3A 2.454334 11.768726 0.0000000 5.376641 7.493047
## CPB1 16.161348 4.243707 11.1077402 6.928501 2.707503
## SCGB2A2 15.969238 16.161817 0.0000000 8.744171 14.983354
## SCGB1D2 12.231416 10.534089 0.9310022 6.549621 9.335627
## TFF1 6.326816 7.546671 7.8249460 13.595630 12.354572
## TCGA-5L-AAT0 TCGA-5T-A9QA TCGA-A1-A0SB TCGA-A1-A0SD TCGA-A1-A0SE
## CLEC3A 2.352250 6.603393 1.701549 5.623527 4.041497
## CPB1 7.560035 10.279785 6.612615 18.628300 6.254743
## SCGB2A2 11.942258 16.523099 8.347892 10.096333 10.872521
## SCGB1D2 9.266208 12.412250 7.798614 7.603864 9.936669
## TFF1 9.097263 14.611388 2.574223 12.558088 13.128586
Let’s examine this a bit further…
Boxplots provide a summary view of the range of expression values for different genes.
We’ll check the first 25 genes from our original matrix and then from our re-ordered matrix.
Task Use help() to see what the arguments to
boxplot() mean in the next two code chunks.
ngenes <- 25
# the arguments to `boxplot()` in the second line are plotting parameters
boxplot(t(brca_expr_mat.log[1:ngenes,]), ylab="Log_2 Expression",
cex.names=0.2, las=2, pch=20, cex=0.6)
In a boxplot, the gray box shows the range of most (50%) of the data. Outliers, the genes with extreme values are depicted as dots. Note the scale on the vertical axis (from 0 to 15).
Now, let’s look at boxplots for 25 most variable genes:
ngenes <- 25
# the arguments to `boxplot()` in the second line are plotting parameters
boxplot(t(brca.log.sub[1:ngenes,]), ylab="Log_2 Expression",
cex.names=0.2, las=2, pch=20, cex=0.6)
The range of the values is much greater!
To see this more clearly, we’ll plot both side-by-side on the same scale.
# the function par() allows us to control how plots are shown
par(mfrow=c(1,2))
# the argument ylim=c(0,20) sets the scale of the y-axis
boxplot(t(brca_expr_mat.log[1:ngenes,]), ylab="Log_2 Expression",ylim=c(0,20),
cex.names=0.2, las=2, pch=20, cex=0.6)
boxplot(t(brca.log.sub[1:ngenes,]), ylab="Log_2 Expression",ylim=c(0,20),
cex.names=0.2, las=2, pch=20, cex=0.6)
Remember, we are assuming that genes with greater variance across patient samples are more important.
When we’re doing exploratory analysis, we’ll often work with a subset of the data, such as the 500 most variable genes. Our code will run faster, and the plots are easier to interpret. We can return to the full data set later.
In this spirit, we’ll consider only a subset of the patient samples.
To pull out 25% of the samples, let’s create a vector
our_samples that stores the indices of every 4th sample.
We’ll use this subset for now, but for the final analysis, we’ll
certainly want to use the whole data set.
# The seq() function creates a series of numbers between the first two argument
# With the third argument, we get every fourth column
our_samples <- seq(1, ncol(brca.log.sub), 4)
#Let's check that we did this correctly!
length(our_samples)
## [1] 271
length(our_samples)/ncol(brca.log.sub)
## [1] 0.2504621
We’ve chosen 271 samples, which amount to 25% of the samples (one out of every four).
Create a new object, our_matrix that contains only the
samples we want to consider for now:
# The matrix with the 500 genes and 271 samples
our_matrix <- brca.log.sub[,our_samples]
our_matrix[1:5,1:5]
## TCGA-3C-AAAU TCGA-4H-AAAK TCGA-A1-A0SD TCGA-A1-A0SH TCGA-A1-A0SM
## CLEC3A 2.454334 7.493047 5.623527 0.000000 4.830970
## CPB1 16.161348 2.707503 18.628300 3.691311 3.926426
## SCGB2A2 15.969238 14.983354 10.096333 17.116090 18.060295
## SCGB1D2 12.231416 9.335627 7.603864 15.517233 14.421697
## TFF1 6.326816 12.354572 12.558088 9.867615 13.642526
We will create the heatmap with the same function call and arguments
we used in R_Heatmaps.Rmd for mtcars.
# We invoke `heatmap()` the same way we did for `mtcars`
# The last two arguments suppress labeling the rows and columns
heatmap(our_matrix,
Rowv=NA, Colv=NA, scale="none",
labRow="", labCol="")
Each row is a gene, and each column is a sample. It doesn’t look
right, though. There are a bunch of horizontal stripes with almost
exclusively one color. This is reminiscent of what we saw with the
mtcars features. This happens because some genes are more
highly expressed than others.
To get the relative gene expression, we will scale each of the genes to put them on equal footing.
# We invoke `heatmap()` the same way we did for `mtcars`
# The last two arguments suppress labeling the rows and columns
heatmap(our_matrix,scale="row", Rowv=NA, Colv=NA,
labRow="", labCol="",
zlim=c(-2, 2))
This representation is much richer. Features (genes in the rows) and observations (patient samples in the columns) exhibit informative relative variation with scaling.
The dendrogram is a summary of the distance matrix that gives the pairwise distances between among all points.
There are many ways to calculate the distance between two points, for
example, via roads or “as the crow flies.” By default,
heatmap() uses the Euclidean distance or
the length of a line between two points. For more on Euclidean distance,
see the Extra Work in R_Heatmaps.Rmd.
# by default, heatmap will calculate the dendrograms and reorder
heatmap(our_matrix,scale="row",
labRow="", labCol="",
zlim=c(-2, 2))
Wow! We see that the tumors (columns) split up into different groups, each with similar gene expression profiles. Similarly, different genes cluster together based on their variation across tumors.
Is there any biological significance to these groups? Are these groups seen only in the gene expression patterns, or is it related to biology we already know?
We same in the activity Breast_Cancer_Patient_Data that most samples are either ER+ or ER- tumors, depending on whether the tumor expresses high levels of receptors for estrogen. This is really important to know because it tells us what’s driving the disease and also how to treat it.
To see whether ER status is associated with the groups we see in the reordered heatmap, we’ll label the samples with a “+” for ER+ breast cancers, and “.” for negative.
To get the information on ER+/ER-, we go back to the clinical data,
brca_clin_df.
View(brca_clin_df)
Luckily, the patient samples are in the same order in both
brca_clin_df and brca_expr_mat.log. Let’s
create a subset of the clinical data frame that corresponds to our gene
expression matrix our_matrix.
#Extract the rows corresponding to our_samples
clin <- brca_clin_df[our_samples,]
#Pull out the "estrogen_receptor_status" feature
x <- clin$estrogen_receptor_status
head(x, 10)
## [1] "Positive" "Positive" "Positive" "Negative" "Positive" "Positive"
## [7] "Positive" "Negative" "Positive" "Positive"
Assign the labels:
# `er` will have labels + for ER positive and . for ER negative
# Note we are using boolean expressions to assign the symbols
er <- rep("", length(x))
er[x == "Positive"] <- "+"
er[x == "Negative"] <- "."
table(er)
## er
## . +
## 13 57 201
In our matrix with 500 genes and 271 patients, 57 samples are ER- and 201 are ER+. Note than 13 samples have missing values.
head(er,10)
## [1] "+" "+" "+" "." "+" "+" "+" "." "+" "+"
The symbols match the values above!
We can add more arguments to the heatmap() function to include the new ER labels we created.
# by default, heatmap will calculate the dendrograms and reorder
# NOTE: we are giving `er` to the argument labCol
heatmap(our_matrix,scale="row",
labRow="", labCol=er,
zlim=c(-2, 2))
In this plot, we see about four group of samples. Most of them have a lot of ER+ tumors (marked by the row of pluses on the bottom), while one is nearly all ER-.
Beautiful!! We see that the clustering on gene expression data seems to recapitulate the ER status of the tumors. A great deal of research has been done on this and other associations between gene expression profiles and the characteristics of patient tumors.
In the code chunks below, create heatmaps labeled with PR and HER2 status. The code follows exactly what was done for ER status above.
Create the object pr to annotate the columns with
progesterone receptor status:
# This code block is similar to the one for ER status
# replace yyyyyyyy with feature for PR status
y <- clin$yyyyyyyy
pr <- rep("", length(y))
pr[y == "Positive"] <- "+"
pr[y == "Negative"] <- "."
table(pr)
How many samples are PR+ and PR-? Assign your
answers to the variables PRpos and PRneg.
# enter the value you obtained from the previous code chunk
PRpos <-
PRneg <-
Create the heatmap annotated with PR status:
# Replace XX in the argument labCol to label the columns with PR status
heatmap(our_matrix,scale="row",
labRow="", labCol=pr,
zlim=c(-2, 2))
Is there overlap in the clusters for ER- and PR-?
Assign your answer (YES/NO) to the variable
PR_ER_neg_overlap.
# Place YES or NO between the quotes
PR_ER_neg_overlap <- " "
Is there overlap in the clusters for ER+ and PR+?
Assign your answer (YES/NO) to the variable
PR_ER_pos_overlap.
# Place YES or NO between the quotes
PR_ER_pos_overlap <- " "
Create the object her2 to annotate the columns with HER2
status:
# This code block is similar to the one for ER status
# replace zzzzzzzz with feature for PR status
z <- clin$zzzzzzzz
her2 <- rep("", length(y))
# Complete the logical statements to assign "+" and "."
# Follow the code we used for ER and PR status
her2[z ] <- "+"
her2[z ] <- "."
table(her2)
Create the heatmap annotated with HER2 status:
# Replace zzzz in the argument labCol to label the columns with HER2 status
heatmap(our_matrix,scale="row",
labRow="", labCol=zzzz,
zlim=c(-2, 2))
Follow the dendrogram over the columns of our heatmap to four main clusters. Compare the ER, PR, and HER2 annotations for the clusters. Which cluster corresponds to triple negative breast cancer: 1, 2, 3, or 4.
# assign your choice for cluster number
cluster_number <- N
Let’s save our work as an html file!
knitr() is the R package that generates reports from
R Markdown. R Markdown is the syntax we are using in this document,
denoted by the .Rmd file type. With knitr() and R Markdown,
we can create reports of our work in the form of Word doc, PDF, and HTML
files.
An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.
Click the “Knit” button at the top of this window and select “Knit to html.”
# Load function to submitting answers to the leaderboard.
scripts_dir <- "/home/shared/R"
source(file.path(scripts_dir, "submission_helpers.R"))
# Log into Synapse
synLoginSecure()
# Submit answers
submission <- submit_module_answers(module = 3)
We have analyzed clinical and gene expression data from over 1,000 patients in the TCGA Breast Cancer cohort. Next time we will apply our R skills to analyze data from human cancer cell lines and calculate differential gene expression for models of normal breast tissue cells versus models of aggressive cancer cells.