DREAM-High 2021- Correlation between Age and Gene Expression

Introduction

The function ‘heatmap()’ is a great way to look at data in a more interesting and fun way.

In this activity we will use the gene expression data from breast cancer patients in The Cancer Genome Atlas (TCGA) and we explore whether clinical features of these patients(mainly age_at_diagnosis) correlate with gene expression patterns.

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document.

When you Knit a Rmd(R markdown file) it turn into an HTML, a PDF, or even a Word document. Overall this can make an Rmd with a bunch of code into a neat looking page.

The Data Directory

We created an object that holds the name of the directory where the TCGA data resides. This is good R coding practice because we can apply all we do below to a data set in a different directory by changing a single variable (data_dir).

Coloring the Heatmap

In gene expression analysis, the convention is to use a red, white, blue color scale where + red corresponds to genes that are relatively overexpressed, + blue corresponds to genes that are relatively underexpressed, and + white are genes with average expression levels.

To use a color scheme to reproduce this convention we load a library that contains the palette we want.

library("RColorBrewer")

Extra Libraries

We load a extra 2 libraries so we can have different types/looking Heatmaps.

Gene Expression Data

We are examining a subset of the TCGA breast cancer expression data. Patient samples are the columns of our gene expression matrix, and the mRNA levels of genes are the rows.

Previously, the gene expression values were log-transformed. Then we chose: + The 500 most variable genes + Every fourth patient sample

We saved this data as a matrix object brca_expr_sub

load(file.path(data_dir, "brca_expr_sub.RData"),verbose=TRUE)

## Loading objects:
##   brca_expr_sub

Let’s have a look at this Data:

dim_mat <- dim(brca_expr_sub)

print(paste("The log-transformed gene expression matrix has",dim_mat[1],
              "rows and",dim_mat[2],"columns."))

## [1] "The log-transformed gene expression matrix has 500 rows and 271 columns."

brca_expr_sub[1:10,1:10]

##          TCGA-3C-AAAU TCGA-4H-AAAK TCGA-A1-A0SD TCGA-A1-A0SH TCGA-A1-A0SM
## CLEC3A     -0.4444731   0.52735276   0.16677510   -0.9178450   0.01391312
## CPB1        1.8630076  -0.74264370   2.34079027   -0.5521063  -0.50657065
## SCGB2A2     1.2860995   1.08997172   0.11776738    1.5142496   1.70208586
## SCGB1D2     1.0984689   0.45135225   0.06435826    1.8327443   1.58792690
## TFF1       -0.4898101   0.86572807   0.91149541    0.3064546   1.15536664
## GSTM1       1.2050592   1.32794322  -0.93515772    0.5912250  -0.98400281
## PIP         1.0955774  -0.18334300   0.81964515   -0.1133463   1.16477436
## S100A7     -0.8548111  -0.85481114   0.05561040   -0.7490921   1.23274950
## MUCL1       0.2655204   0.01909743   1.55128156    0.5953695   0.14529979
## ANKRD30A    0.9432439   0.61670044   0.33181421    0.2513654   0.49356835
##          TCGA-A1-A0SQ TCGA-A2-A04R TCGA-A2-A04W TCGA-A2-A0CL TCGA-A2-A0CQ
## CLEC3A     -0.9178450   -0.9178450   1.49248571   1.41220392   -0.9178450
## CPB1        2.2647191    2.5915996  -0.68515082  -1.05813006   -0.2917495
## SCGB2A2     1.7966174   -0.7901945   1.60427635  -1.03776225   -0.5091813
## SCGB1D2     1.6129304   -0.6078331   0.90008564  -1.34170609   -0.2266103
## TFF1        1.2535206    0.9320101   0.05716392   0.21282911    1.0908511
## GSTM1       1.1410190    0.1197914  -0.91600571   0.92233293    1.6026970
## PIP         1.1542239   -0.6350037   0.33822210  -0.03328558    2.0589881
## S100A7     -0.8548111   -0.8548111   0.43581699  -0.74673270   -0.3336430
## MUCL1       0.4100594   -0.2858546   0.74592674  -0.67289520   -0.4109986
## ANKRD30A    0.6982975   -1.0272372   0.10020537  -1.08966091    1.0892899

Heatmap

A heatmap is a graphical representation of data that uses a system of color-coding to represent different values. Heat maps make it easy to visualize complex data and understand it at a glance.

heatmap(brca_expr_sub,                            
  scale="none", labRow="", labCol="",
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) )

We create a new function object called “my_colors”, so we can be able to give our heatmap new colors.

my_colors <- colorRampPalette(c("cyan","deeppink3"))

Examples

Here we have more examples of Heatmaps with different colors and different styles as well.

Normal Heatmap

First up, we have a normal heatmap() but we did some basic color change to the heatmap.

Compared to the original(the heatmap before it) this one is a little unclear due to the color choice.

heatmap(brca_expr_sub, col= my_colors(100))

Heatmaply Heatmap

Finally, we have Heatmaply which is very similar to Plotly. I personally would use Heatmaply instead of Plotly because I’m much more familiar with it, and the colors they provide for the heatmaps are very good. Heatmaply also has the same function like Plotly, when you hover over the heatmap and it’ll show you the gene, and the patient but Heatmaply actually gives the name of the gene and also the patient barcode.

heatmaply(brca_expr_sub, colors= inferno(n=20, alpha = 1, begin = 0, end = 1, direction = 1),
fontsize_row = 1, fontsize_col= 1)

Clinical Information

We call in the data which has clinical information for 271 patients. This is necessary because our main focus has to do with “age_at_diagnosis”.

clin_sub<-read.csv(file.path(data_dir,"clin_sub.csv"),row.names=1)

dim(clin_sub)

## [1] 271  27

As we know, the rows come in first. This means that the rows are the patient samples, while the columns contain the features related to the patients. For example, “Gender” and “Tumor Status” are only 2 out of the 27 features that are listed in this file.

Let’s have a look at the rest of the features:

colnames(clin_sub)

##  [1] "bcr_patient_barcode"                    
##  [2] "gender"                                 
##  [3] "race"                                   
##  [4] "ethnicity"                              
##  [5] "age_at_diagnosis"                       
##  [6] "year_of_initial_pathologic_diagnosis"   
##  [7] "vital_status"                           
##  [8] "menopause_status"                       
##  [9] "tumor_status"                           
## [10] "margin_status"                          
## [11] "days_to_last_followup"                  
## [12] "prior_dx"                               
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"                      
## [15] "histological_type"                      
## [16] "pathologic_T"                           
## [17] "pathologic_M"                           
## [18] "pathologic_N"                           
## [19] "pathologic_stage_sub"                   
## [20] "pathologic_stage"                       
## [21] "lymph_node_examined_count"              
## [22] "number_of_lymphnodes_positive"          
## [23] "initial_diagnosis_method"               
## [24] "surgical_procedure"                     
## [25] "estrogen_receptor_status"               
## [26] "progesterone_receptor_status"           
## [27] "her2_receptor_status"

Age Grouping

I decided to start out at 30 and then I’ll start to move my way up through the ages.

From this, we can interpret that there aren’t many patients who were diagnosed at 30 or younger.

age<-30
x <- clin_sub[["age_at_diagnosis"]]
ag <- rep("", length(x)) 
ag[x <=age] <- "+"
ag[x >age] <- "-"
table(ag)

## ag
##   -   + 
## 266   5

We create a heatmap to see how many of the “+” are there, and to see the expression levels of the genes.

The “+” is in only one spot, and the expression levels of the genes are high towards the bottom, but as you move the column the colors start to change to more blue which means the starting genes or the genes on the top of the list are less expressed.

heatmap(
  brca_expr_sub,                            
  scale="none", labRow="", labCol=ag,
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) 
)

Now the age has increased and gone up to 50. (We’re looking to see who’s been diagnosed at 50 and under, and who’s been diagnosed at an age higher than 50)

Since the age increased, we should see more of “+” in our upcoming heatmap.

I say that because we’re basically increasing our range of numbers from having a limit on the number 30. Our main focus is **X<=age which is the “+”. I have another symbol which is “-”, but this is to only show the difference between the people who are 50 or younger and above 50 years old.

age<- 50
x <- clin_sub[["age_at_diagnosis"]]
ag <- rep("", length(x)) 
ag[x <=age] <- "+"
ag[x >age] <- "-"
table(ag)

## ag
##   -   + 
## 187  84

heatmap(
  brca_expr_sub,                            
  scale="none", labRow="", labCol=ag,
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) 
)

Now the age has reached 70. (We’re looking to see who’s been diagnosed at 70 and under, and who’s been diagnosed at an age higher than 70)

We’ve the past the median age which was around 58 years old, which means that our upcoming heat map should be mostly “+”.

age<- 70
x <- clin_sub[["age_at_diagnosis"]]
ag <- rep("", length(x)) 
ag[x <=age] <- "+"
ag[x >age] <- "-"
table(ag)

## ag
##   -   + 
##  49 222

heatmap(
  brca_expr_sub,                            
  scale="none", labRow="", labCol=ag,
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) 
)

Age Ranges

To allow us to easily pick age ranges, we will assign the ages to each sample in the expression matrix.

age_matrix<-rbind(clin_sub$age_at_diagnosis, brca_expr_sub)

age_matrix[1:5,1:5]

##         TCGA-3C-AAAU TCGA-4H-AAAK TCGA-A1-A0SD TCGA-A1-A0SH TCGA-A1-A0SM
##           55.0000000   50.0000000  59.00000000   39.0000000  77.00000000
## CLEC3A    -0.4444731    0.5273528   0.16677510   -0.9178450   0.01391312
## CPB1       1.8630076   -0.7426437   2.34079027   -0.5521063  -0.50657065
## SCGB2A2    1.2860995    1.0899717   0.11776738    1.5142496   1.70208586
## SCGB1D2    1.0984689    0.4513522   0.06435826    1.8327443   1.58792690

We can use the ages in the first row to make a subset of the expression matrix. Then we have to remove the first row because it isn’t the right data type!

Creating this will help us create a range of patients who fall in the ages between 25 and 45. Then we can compare to the original ‘brca_expr_sub’ to see the difference.

age1<-25
age2<-45


matrix<-age_matrix[ ,((age_matrix[1,]>age1)&
                          (age_matrix[1,]<=age2))]

heatmap_matrix<-matrix[-1,]

clin_age<- clin_sub[((clin_sub$age_at_diagnosis>age1)&
                     (clin_sub$age_at_diagnosis<=age2)),]

Let’s the check the dimensions first, so we can get an idea of how many numbers we’ll be working with.

dim(heatmap_matrix)

## [1] 500  42

dim(clin_age)

## [1] 42 27

Let’s make a heatmap for ‘heatmap_matrix’:

When observing the heatmap, you can see that the amount of patients(columns) isn’t that large when compared to previous heatmaps.

heatmap(heatmap_matrix,                            
  scale="none", labRow="", labCol="",
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) )

Now let’s make a heatmap for brca_expr_sub so we can compare both heatmaps.

The heatmap for ‘heatmap_matrix’ does look fairly similar to the ‘brca_expr_sub’ heatmap, but since the ‘heatmap_matrix’ has a certain age range to it, it makes it smaller when compared to the ‘brca_expr_sub’ heatmap.

The bottom left for each heatmap is fairly similar because in both ‘heatmap_matrix’ and ‘heatmap_matrix’ heatmaps they have a good amount of red, and a little of blue mixed within, but it’s mostly on the red side for both even though one is much smaller.

When it comes to the ‘brca_expr_sub’ heatmap it’s much more darker towards the center at the top which could lead to us thinking that’s where most triple negative patients reside.

heatmap(brca_expr_sub,                            
  scale="none", labRow="", labCol="",
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) )

Let’s create a different age range and compare them.

age1<-50
age2<-75


matrix<-age_matrix[ ,((age_matrix[1,]>age1)&
                          (age_matrix[1,]<=age2))]

heatmap_matrix<-matrix[-1,]

clin_age<- clin_sub[((clin_sub$age_at_diagnosis>age1)&
                     (clin_sub$age_at_diagnosis<=age2)),]

let’s also check our dimensions for this new age range.

dim(heatmap_matrix)

## [1] 500 155

dim(clin_age)

## [1] 155  27

Create a new heatmap for our new age range:

-As we can see, our new heatmap with the range of 50 yrs-75yrs has a few minor differences when compared with the heatmap with the age range of 25 yrs-45yrs.

On the right-side of the heatmap with the age range of 25 yrs-45yrs, it’s more on the lighter side and most of the genes aren’t highly expressed.

Towards the right-side of the heatmap with the age range of 50yrs-75yrs it’s much darker(higher expression(Trip-Neg)) when compared to the heatmap with the age range of 25yrs-45yrs.

In both heatmaps, they have a similar expression towards the bottom right. The genes there are a mixture of both highly and low expressed genes.

heatmap(heatmap_matrix,                            
  scale="none", labRow="", labCol="",
  col=brewer.pal(10,"RdBu"), 
  zlim=c(-2, 2), margins = c(1, 0) )