About this activity

We will load and examine R dataframe objects that contain data from over 1,000 breast cancer (BRCA) patients from The Cancer Genome Atlas (TCGA).

The objects include:


The data directory

We create an object that holds the name of the directory where the TCGA data resides (data_dir).


Loading the data

Like we did in the activity with TCGA breast cancer clinical data, we will load data from R files with the extension .RData. .RData files are binary files (i.e., not human readable) that can store multiple R objects, such as vectors, lists, matrices, and data frames. Usually, the data objects have already been cleaned up from the raw data and ready to use in exploratory data analysis. Though, of course, we should inspect the data first!

# We `load()` TCGA_brca.RData
# The file.path() function tells `load()` where our data resides
# The objects will also appear in our "Environment" tab.

load(file.path(data_dir, "TCGA_brca.RData"),verbose=TRUE)   
## Loading objects:
##   brca_clin_df
##   brca_expr_df

Revisiting the clinical data frame

During our last meeting, we spent some time exploring the clinical dataframe, brca_clin_df. Let’s review its properties.

View(brca_clin_df)

Histograms

Another way to look at the values of a feature is a histogram.

hist(brca_clin_df$age_at_diagnosis,
     xlab="Age at Diagnosis",
     main="Distribution of Ages")

A third way to look at the values of a feature is a boxplot.

boxplot(brca_clin_df$age_at_diagnosis,
     xlab="Age at Diagnosis",
     main="Distribution of Ages")

Review:

  1. What features does the data frame contain?
# uncomment code line below and replace `function` with the function 
# that will return the column names of the data frame

colnames(brca_clin_df)
##  [1] "bcr_patient_barcode"                    
##  [2] "gender"                                 
##  [3] "race"                                   
##  [4] "ethnicity"                              
##  [5] "age_at_diagnosis"                       
##  [6] "year_of_initial_pathologic_diagnosis"   
##  [7] "vital_status"                           
##  [8] "menopause_status"                       
##  [9] "tumor_status"                           
## [10] "margin_status"                          
## [11] "days_to_last_followup"                  
## [12] "prior_dx"                               
## [13] "new_tumor_event_after_initial_treatment"
## [14] "radiation_therapy"                      
## [15] "histological_type"                      
## [16] "pathologic_T"                           
## [17] "pathologic_M"                           
## [18] "pathologic_N"                           
## [19] "pathologic_stage_sub"                   
## [20] "pathologic_stage"                       
## [21] "lymph_node_examined_count"              
## [22] "number_of_lymphnodes_positive"          
## [23] "initial_diagnosis_method"               
## [24] "surgical_procedure"                     
## [25] "estrogen_receptor_status"               
## [26] "progesterone_receptor_status"           
## [27] "her2_receptor_status"
  1. How many total values are in the column for estrogen receptor (ER) status? HINT: You can find this out in the “Enivronment” tab.
# uncomment code line below and use one of the ways for finding  
# the that will return estrogen_receptor_status column

length(brca_clin_df$estrogen_receptor_status   )
## [1] 1082
  1. What are the unique values in the estrogen receptor (ER) status column?
# uncomment code line below and use one of the ways for finding  
# the that will return estrogen_receptor_status column

table(brca_clin_df$estrogen_receptor_status   )
## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              48               2             236             796
  1. Tables provide both the unique values and how many samples contain the values.
table(brca_clin_df$radiation_therapy)
## 
##   [Discrepancy] [Not Available]       [Unknown]              NO             YES 
##               1              80              18             435             548

For example, we see that 796 patients have ER Positive tumors, and 236 have ER negative tumors.


The gene expression data frame

Another important source of information in The Cancer Genome Atlas (TCGA) breast cancer set is the gene expression data.

# Check the Environment tab to see the dimensions, too!
dim_df <- dim(brca_expr_df)  

print(paste("The gene expression data frame has",dim_df[1],
              "rows and",dim_df[2],"columns."))
## [1] "The gene expression data frame has 18351 rows and 1083 columns."
# The function `paste()` allows us to `print()` text and numbers together.

Let’s see what’s in the rows and columns using indexing to get

brca_expr_df[1:10,1:5] 
##      symbol TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK
## 1    TSPAN6       188.18        207.7         1005       1104.7
## 2      TNMD         0.34          1.1           40          1.2
## 3      DPM1       524.23        809.1          967        463.4
## 4     SCYL3       325.14       1558.9          336        549.3
## 5  C1orf112       124.29        274.7          242        218.2
## 6       FGR        75.14        182.7          284        144.0
## 7       CFH       348.11       1119.6         1031       1262.3
## 8     FUCA2       563.17       1099.5          872        862.6
## 9      GCLC       578.68        903.2          633        602.0
## 10     NFYA      1408.62        756.9          730       1150.6

The first column, “symbol”, provides the names of genes whose mRNA expression levels were measured in the experiments.

The remaining columns are the expression levels of the messenger RNAs (mRNAs) from different patient samples. The column names are anonymized indentifiers for patient samples.

Remember that the clinical data is organized so that patient sample identifiers are the rows of the data frame:

brca_clin_df[1:5,1:10] 
##   bcr_patient_barcode gender                      race              ethnicity
## 1        TCGA-3C-AAAU FEMALE                     WHITE NOT HISPANIC OR LATINO
## 2        TCGA-3C-AALI FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 3        TCGA-3C-AALJ FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 4        TCGA-3C-AALK FEMALE BLACK OR AFRICAN AMERICAN NOT HISPANIC OR LATINO
## 5        TCGA-4H-AAAK FEMALE                     WHITE NOT HISPANIC OR LATINO
##   age_at_diagnosis year_of_initial_pathologic_diagnosis vital_status
## 1               55                                 2004        Alive
## 2               50                                 2003        Alive
## 3               62                                 2011        Alive
## 4               52                                 2011        Alive
## 5               50                                 2013        Alive
##                                                                               menopause_status
## 1 Pre (<6 months since LMP AND no prior bilateral ovariectomy AND not on estrogen replacement)
## 2            Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 3            Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
## 4                                                                                    [Unknown]
## 5            Post (prior bilateral ovariectomy OR >12 mo since LMP with no prior hysterectomy)
##   tumor_status margin_status
## 1   WITH TUMOR      Negative
## 2   TUMOR FREE      Negative
## 3   TUMOR FREE      Negative
## 4   TUMOR FREE         Close
## 5   TUMOR FREE      Negative

Notice that both the expression and clinical dataframes have information for the same patient samples: TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK

In the next Module, we will combine clinical and expression data for each patient.

# Are the column names of the expression data frame the same 
# as the patient IDs in the first column of the clinical dataframe?
# 'identical()` is a new function for us!

identical(colnames(brca_expr_df)[-1],brca_clin_df[,1])
## [1] TRUE

Matrix form

To do analysis with the expression data frame, we have to create a matrix that contains only numerical data.

The first column of brca_expr_df contains gene names, which are text characters.

typeof(brca_expr_df[,1])
## [1] "character"

The remaining columns are numeric. For example:

typeof(brca_expr_df[,1001])
## [1] "double"

“double” stands for “double-precision floating-point format.” In other words, a decimal number.

We want to create a matrix that contains only the numerical gene expression data. But we want to keep the information from the “symbol” column so we know what genes (mRNAs) the expression levels refer to.

# We remove the first column ("symbol"),
# use the function as.matrix() for the expression data
# and assign the result to a new object ("brca_expr_mat").

# Indexing with -1 removes the first column
brca_expr_mat <- as.matrix(brca_expr_df[,-1])  

# We want the row names to correspond to gene symbols. 
# We can do this with the row.names() function.
row.names(brca_expr_mat) <- brca_expr_df$symbol

# check the result
brca_expr_mat[1:5, 1:5]
##          TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6         188.18        207.7         1005       1104.7        942.6
## TNMD             0.34          1.1           40          1.2          4.7
## DPM1           524.23        809.1          967        463.4        486.4
## SCYL3          325.14       1558.9          336        549.3        416.3
## C1orf112       124.29        274.7          242        218.2        179.5

Average expression

Let’s look at the average value of the expression for the first gene (TSPAN6) across the patient samples.

# The function `mean()` takes the average of a list of numbers
# We are looking at the first row of values 
# There are two ways to identify the row we want

mean(brca_expr_mat[1,])
## [1] 996
mean(brca_expr_mat["TSPAN6",])
## [1] 996

Low and high mRNA levels

What is the average expression of the second gene (TNMD)?

# The function `mean()` takes the average of a list of numbers
mean(brca_expr_mat[2,])
## [1] 24
mean(brca_expr_mat["TNMD",])
## [1] 24

The average of the expression is much smaller for the second gene (24) than the first gene (996). Does this mean anything?

The gene TSPAN6 codes for the protein Tetraspanin-6. This protein helps signal events that play a role in cell growth and motility, and its higher level suggests motility of breast cancer cells may be important.

The gene TNMD codes for the protein Tenomodulin, which is important for the formation of tendons. TNMD is highly expressed in tendons, but lowly expressed in other parts of the body. Its lower level here suggests it does not play a role in breast cancer.

It would be tedious to examine each of the 18,351 genes in the expression matrix this way!!


The apply() function

We can use the function apply() to apply a function like mean() to all rows or columns of our matrix at once.

# We use `apply()` to "apply" the function mean() to all rows.
# The first argument is the matrix.
# The second argument is 1 for "rows".
# The third argument is the function name, `mean()`

mean.expr <- apply(brca_expr_mat, 1, mean)  
                   
mean.expr[1:25]                            
##   TSPAN6     TNMD     DPM1    SCYL3 C1orf112      FGR      CFH    FUCA2 
##      996       24      699      624      320      187     1246     1432 
##     GCLC     NFYA    STPG1   NIPAL3    LAS1L    ENPP4   SEMA3F     CFTR 
##      869     1326      215      985     1143      642     2180       19 
##   ANKIB1  CYP51A1    KRIT1    RAD52      BAD     LAP3     CD99   HS3ST1 
##     1614     1945      738      130      853     2646     2751       91 
##     AOC1 
##       84

Now that we have the mean expression for all 18K genes across the 1,082 patient samples, how else may we use this information?


Plotting average expression

Let’s try plotting the mean expression values (we saved these values in the object mean.expr) as a histogram.

hist(mean.expr, 
     main="Distribution of Gene Expression Values",
     xlab="Mean Expression")

Wow! Most of the average expression values are relatively small but those values are very compressed because of a few VERY large values.

How many values in mean.expr are greater than 10,000? We can find this out by choosing only those values in mean.expr that are greater than 10,000.

# This is a logical expression
greater_than_10000 <- (mean.expr > 10000)

sample(greater_than_10000,25)
##   SLC44A1      GPR3      LCTL      AMOT     KDM1B   L3MBTL3     NXNL1 C20orf196 
##     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE 
##    PPP1R8      NAAA    FIBCD1     AGAP5    S100A1     INHBB     VCAM1    EPS8L2 
##     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE 
##  SERPINB6   FAM124A   ADAMTS4      OGFR     ZPBP2    ZYG11B    ANGPT1    MAGED2 
##     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE     FALSE      TRUE 
##     CDK19 
##     FALSE

When we use an expression like “mean.expr > 10000” we get a vector that tells us if the condition is TRUE or FALSE for that gene. Note that most genes do NOT have expression level greater than 10,000, i.e. are FALSE.

We can use the logical condition to create an objective that has only genes with expression level > 10000.

# Create an object for expression values > 10,000
mean.expr.high <- mean.expr[mean.expr > 10000]

length(mean.expr.high)
## [1] 201
mean.expr.high[1:10]
##  RPS20  CSDE1    DCN    LTF   CD74    VIM   CD44 TMSB10   CDH1  HSPA5 
##  15894  13524  12904  19975  41389  22440  10251  20686  10912  17166

Out of 18,351 mean values, only 201 are greater than 10,000!

Let’s zoom in on the average expression values less than 10,000 with the logical condition “mean.expr < 10,000”.

# Let's look at the "distribution" of mean expression values
# that are less than 10000.

mean.expr.low <- mean.expr[mean.expr < 10000]  

hist(mean.expr.low, 
     breaks=50, 
     main="Distribution of Gene Expression Values",
     xlab="Mean Expression")

By “zooming” in, we observe that most genes actually have average expression < 2,000.

There are only a few genes that have really large average expression (201 out of 18351 genes). Nearly all genes have relatively low expression, and very few genes have high expression.

Our data has a long right tail or, in other words, is highly skewed.

Many functions in R assume the data are normally distributed, which means their histogram is expected to be bell-shaped.

For example, in the clinical data frame we saw that the ages of the patients follows a normal distribution.

hist(brca_clin_df$age_at_diagnosis,
     xlab="Age at Diagnosis",
     main="Distribution of Ages")

But this is not the situation with the expression data!

The log transformation can be used to make highly skewed distributions less skewed and sometimes normally distributed. And this can be valuable both for

  • making patterns in the data more interpretable, and
  • helping to meet the assumptions of the statistics underlying our analysis.

Log transformation

Expression data analysis is usually done with log base 2.

So the values (8, 4, 2) are log transformed to (3, 2, 1).

Let’s consider some larger numbers:

a <- c(4096, 1024, 512, 64)

b <- log(a,2)

plot(a,b,pch=19,col="red",
     ylab="Log transformed values",
     xlab="Raw values")

Notice how the scale for the log-transformed values (vertical axis) is much more compressed than for the raw values (horizontal axis).

There is one caveat. The log2 of 0 is infinity!

log(0,2)
## [1] -Inf

But the log2 of 1 is “well-behaved”.

log(1,2)
## [1] 0

So before taking the log2 of our expression data, we will add 1 to each value to avoid the occurence of “-Inf” values. This is standard practice and, as evidenced from the histogram of expression values, it is a small change.

# Add 1 to each element, then take the log base 2.
brca_expr_mat.log <- log(brca_expr_mat+1, 2)  

# Look at some values:
brca_expr_mat.log[1:5,1:5]
##          TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6           7.56          7.7         10.0         10.1          9.9
## TNMD             0.43          1.1          5.4          1.2          2.5
## DPM1             9.04          9.7          9.9          8.9          8.9
## SCYL3            8.35         10.6          8.4          9.1          8.7
## C1orf112         6.97          8.1          7.9          7.8          7.5

Let’s check out the distrubution (histogram) for the log-transformed expression values:

mean.expr <- apply(brca_expr_mat.log, 1, mean)  
                                                

hist(mean.expr, breaks=100, main="Distribution of Log_2 Gene Expression Values",
  xlab="Mean Log_2 Expression")

The log-transformed data is less scewed and even has some bell-shaped character.

While this isn’t a true normal distribution of data (especially at the lower range), it is much closer, and importantly, close enough for some exploratory analyses.

In our future analyses, we will be working with the log-transformed data matrix brca_expr_mat.log.


Practice: mRNA levels of oncogenes

Learn about genes related to breast cancer at this site: https://www.nationalbreastcancer.org/other-breast-cancer-genes and look at their average values across patients.

STK11 codes for the protein Serine/Threonine Kinase 11, which is a tumor suppressor, a type of protein that helps control cell growth. STK11 gene mutations cause Peutz-Jeghers syndrome. Peutz-Jeghers syndrome carries an increased risk for multiple types of cancer, including breast cancer.

mean(brca_expr_mat["STK11",])
## [1] 1026

What is the mean expression of CHEK2?

# Uncomment the code line below and replace XXXXX with the correct gene name

mean(brca_expr_mat["CHEK2",])
## [1] 259

What is the distribution of CHEK2 across patients?

# Uncomment the code lines below and replace XXXXX with the correct gene name

hist(brca_expr_mat["CHEK2",], xlab="CHEK2 expression", main="Distribution of CHEK2 expression")

abline(v = mean(brca_expr_mat["CHEK2",]), col = 'blue')

Most patients have relatively low expression of CHEK2.

Find the average expression of PTEN and create a distribution of mRNA levels across patients.

# Uncomment the code lines below, and replace function with 
# a function to calculate histogram and replace XXXXX with the correct gene name

hist(brca_expr_mat["PTEN",],xlab="PTEN expression", main="Distribution of PTEN expression")

abline(v = mean(brca_expr_mat["PTEN",]), col = 'blue')

STK11, CHEK2, and PTEN are normally tumor suppressor genes. Mutations in tumor suppressor genes can lead to malignancies.

Let’s save our work as an html file!


The knitr R package

knitr() is the R package that generates reports from R Markdown. R Markdown is the syntax we are using in this document, denoted by the .Rmd file type. With knitr() and R Markdown, we can create reports of our work in the form of Word doc, PDF, and HTML files.

An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.

Click the “Knit” button at the top of this window and select “Knit to html.”

Congratulations on completing another activity!

Now that we’ve gained some experience working with expression data, we can move onto the next activity and look for interesting patterns.