TCGA Breast Cancer Expression Data

About this activity

We will load and examine R dataframe objects that contain data from over 1,000 breast cancer (BRCA) patients from The Cancer Genome Atlas (TCGA).

The objects include:

clinical measurements on the patients and the patients’ tumors, such as gender, age, estrogen, progesterone, and her2 receptor status. We examined this data in detail in our previous activity, Breast_Cancer_Patient_Data.Rmd.
gene expression data which tells us how many messenger RNAs (mRNAs) per gene are present in a patient sample. The amount of a gene’s mRNA corresponds (roughly) to the amount of protein in the sample.

Loading the data

Like we did in the activity with TCGA breast cancer clinical data, we will load data from R files with the extension .RData. .RData files are binary files (i.e., not human readable) that can store multiple R objects, such as vectors, lists, matrices, and data frames. Usually, the data objects have already been cleaned up from the raw data and are ready to use in exploratory data analysis. However, we should inspect the data first!

# We `load()` TCGA_brca.RData
# The objects will also appear in our "Environment" tab.
load("/home/data/TCGA_brca.RData",verbose=TRUE)

## Loading objects:
##   brca_clin_df
##   brca_expr_df

The clinical data frame

Last time, we worked with the table brca_clin_df and examined various clinical features of the patients samples, for example receptor status (ER, PR, HER2).

# The table() function provides a summary of columns
table(brca_clin_df$estrogen_receptor_status)

## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              48               2             236             796

The gene expression data frame

Another important source of information in The Cancer Genome Atlas (TCGA) breast cancer set is the gene expression data.

# Check the Environment tab to see the dimensions, too!
dim_df <- dim(brca_expr_df)  

# The function `paste()` allows us to `print()` text and numbers together.
print(paste("The gene expression data frame has",dim_df[1],
              "rows and",dim_df[2],"columns."))

## [1] "The gene expression data frame has 18351 rows and 1083 columns."

The expression data frame is HUGE, but we can still View() it.

# Indexing gives us a subset of the data
# brca_expr_df[1:10,1:5] 
View(brca_clin_df)

The first column, “symbol”, provides the names of genes whose mRNA expression levels were measured.

The remaining columns are the expression levels of the messenger RNAs (mRNAs) from different patient samples. The column names are anonymized indentifiers for patient samples.

The clinical data is organized so that patient sample identifiers appear in the first column and the features are the rows of the data frame.

Are the column names of brca_expr_df the same as the patient IDs in the first column of brca_clin_df?

# identical()` is a new function for us!
# Let's compare the column names of brca_expr_df
# with the first column in brca_clin_df

identical(colnames(brca_expr_df)[-1],brca_clin_df[,1])

## [1] TRUE

So there is a direct one-to-one correspondence between patient ID in the rows of brca_clin_df and the columns of brc_expr_df.

Matrix form

To do analysis with the expression data frame, we have to create a matrix that contains only numerical data. We did this with the mtcars heatmap activity.

The first column of brca_expr_df contains gene names, which are text characters.

# What type of data is contained in the first column?
typeof(brca_expr_df[,1])

## [1] "character"

The remaining columns are numeric. For example:

# What type of data is contained in the remaining columns?
typeof(brca_expr_df[,2])

## [1] "double"

“double” stands for “double-precision floating-point format.” In other words, a decimal number.

We want to create a matrix that contains only the numerical gene expression data. But we also want to keep the information from the “symbol” column (genes). We will set the first column as the row names of our matrix.

# remove the first column ("symbol"),
# use the function as.matrix() for the expression data
# and assign the result to a new object ("brca_expr_mat").

# Indexing with -1 removes the first column
brca_expr_mat <- as.matrix(brca_expr_df[,-1])  

# use the row.names() function use the gene symbols
row.names(brca_expr_mat) <- brca_expr_df[,1]

# check the result
brca_expr_mat[1:5, 1:5]

##          TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6       188.1840     207.7220    1005.4400    1104.6800     942.5530
## TNMD           0.3447       1.0875      39.8912       1.2412       4.6809
## DPM1         524.2260     809.1350     967.3620     463.3840     486.3830
## SCYL3        325.1410    1558.9400     335.8300     549.2550     416.2850
## C1orf112     124.2940     274.6660     241.6860     218.2250     179.4600

Let’s doublecheck that the first element of the matrix is what we expect:

# Value from first row, first column:
brca_expr_mat[1, 1]

## [1] 188.184

Let’s keep in mind that the rows of brca_expr_mat correspond to genes, and the columns correspond to patients.

Average expression

Let’s look at the average value of the expression for the first gene (TSPAN6) across the patient samples.

# The function `mean()` takes the average of a list of numbers
# We are looking at the first row of values 
# There are two ways to identify the row we want
mean(brca_expr_mat[1,])

## [1] 996.1939

mean(brca_expr_mat["TSPAN6",])

## [1] 996.1939

Low and high mRNA levels

What is the average expression of the second gene (TNMD)?

# The function `mean()` takes the average of a list of numbers
mean(brca_expr_mat[2,])

## [1] 23.68886

mean(brca_expr_mat["TNMD",])

## [1] 23.68886

The average of TNMD expression (24) is much smaller than that of TSPAN6 (996). Does this mean anything?

The gene TSPAN6 codes for the protein Tetraspanin-6. This protein helps signal events that play a role in cell growth and motility, and its higher level suggests motility of breast cancer cells may be important.

The gene TNMD codes for the protein Tenomodulin, which is important for the formation of tendons. TNMD is highly expressed in tendons, but lowly expressed in other parts of the body. Its lower level here suggests it does not play a role in breast cancer.

You can find more information on their proteins by searching for them at the UniProt Knowledgebase.

This is fun, but it would be tedious to examine each of the 18,351 genes in the expression matrix this way!

The `apply()` function

We can use the function apply() to apply a function like mean() to all rows or columns of our matrix at once.

# We use `apply()` to "apply" the function mean() to all rows.
# The first argument is the matrix.
# The second argument is 1 for "rows".
# The third argument is the function name, `mean()`
mean.expr <- apply(brca_expr_mat, 1, mean)  

# Show the first 25 values and round them to whole numbers
round(mean.expr[1:25],0)

##   TSPAN6     TNMD     DPM1    SCYL3 C1orf112      FGR      CFH    FUCA2 
##      996       24      699      624      320      187     1246     1432 
##     GCLC     NFYA    STPG1   NIPAL3    LAS1L    ENPP4   SEMA3F     CFTR 
##      869     1326      215      985     1143      642     2180       19 
##   ANKIB1  CYP51A1    KRIT1    RAD52      BAD     LAP3     CD99   HS3ST1 
##     1614     1945      738      130      853     2646     2751       91 
##     AOC1 
##       84

Now that we have the mean expression for all 18K genes across the 1,082 patient samples, how else may we use this information?

Plotting average expression

Let’s try plotting the mean expression values (we saved these values in the object mean.expr) as a histogram.

# the hist() function gives the distribution of values
hist(mean.expr, 
     main="Distribution of Gene Expression Values",
     xlab="Mean Expression")

Wow! Most of the average expression values are relatively small but those values are very compressed because of a few VERY large values.

How many values in mean.expr are greater than 10,000? We can find this out by choosing only those values in mean.expr that are greater than 10,000.

We can use the logical condition to create an objective that has only genes with expression level > 10000.

# Create an object for expression values > 10,000
mean.expr.high <- mean.expr[mean.expr > 10000]

length(mean.expr.high)

## [1] 201

Out of 18,351 mean values, only 201 are greater than 10,000!

Let’s zoom in on the average expression values less than 10,000 with the logical condition “mean.expr < 10000”.

# Let's look at the "distribution" of mean expression values
# that are less than 10000.
mean.expr.low <- mean.expr[mean.expr < 10000]  

hist(mean.expr.low, 
     breaks=50, 
     main="Distribution of Gene Expression Values",
     xlab="Mean Expression")

By “zooming” in, we observe that most genes actually have average expression < 2,000.

There are only a few genes that have really large average expression (201 out of 18351 genes). Nearly all genes have relatively low expression, and very few genes have high expression.

Our data has a long right tail or, in other words, is highly skewed.

Many functions in R assume the data are normally distributed, which means their histogram is expected to be bell-shaped.

Log transformation

The log transformation can be used to make highly skewed distributions less skewed and sometimes normally distributed. And this can be valuable both for

making patterns in the data more interpretable, and
helping to meet the assumptions of the statistics underlying our analysis.

Expression data analysis is usually done with log base 2.

Log base 2 of 8 is 3 because 2 * 2 * 2 = 8.
Log base 2 of 4 is 2 because 2 * 2 = 4.
Log base 2 of 2 is 1 because 2 = 2.

So the values (8, 4, 2) are log transformed to (3, 2, 1).

Let’s consider some larger numbers:

# Create an array of values with the combine function
a <- c(4096, 1024, 512, 64)

# Take the log of the values in a
b <- log(a,2)

# Scatterplot
plot(a,b,pch=19,col="red",
     ylab="Log transformed values",
     xlab="Raw values")

Notice how the scale for the log-transformed values (vertical axis) is much more compressed than for the raw values (horizontal axis).

There is one caveat. The log2 of 0 is infinity!

log(0,2)

## [1] -Inf

But the log2 of 1 is “well-behaved” and very small.

log(1,2)

## [1] 0

So before taking the log2 of our expression data, we will add 1 to each value to avoid the occurence of “-Inf” values. This is standard practice and, as evidenced from the histogram of expression values, it is a small change.

# Add 1 to each element, then take the log base 2.
brca_expr_mat.log <- log(brca_expr_mat+1, 2)  

# Look at some values and keep two decimal places
round(brca_expr_mat.log[1:5,1:5],2)

##          TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## TSPAN6           7.56         7.71         9.98        10.11         9.88
## TNMD             0.43         1.06         5.35         1.16         2.51
## DPM1             9.04         9.66         9.92         8.86         8.93
## SCYL3            8.35        10.61         8.40         9.10         8.70
## C1orf112         6.97         8.11         7.92         7.78         7.50

Let’s check out the distrubution (histogram) for the log-transformed expression values:

# use `apply()` to "apply" the function mean() to all rows.
mean.log.expr <- apply(brca_expr_mat.log, 1, mean)  
                                                
hist(mean.log.expr, breaks=100, main="Distribution of Log_2 Gene Expression Values",
  xlab="Mean Log_2 Expression")

The log-transformed data is less scewed and even has some bell-shaped character.

While this isn’t a true normal distribution of data (especially at the lower range), it is much closer, and importantly, close enough for some exploratory analyses.

We will continue working with the log-transformed data matrix brca_expr_mat.log.

Subsetting the data

In exploratory analysis the work is often done first with a subset of the data. The code will run faster, the plots are easier to interpret, and it is easy to change parameters and re-check the analysis. We can return to the full data set later.

Most variable genes

We can make a heatmap for the full matrix of 18351 genes and 1,082 patients, but it is good practice to work with a smaller prioritized set of data.

To reduce the number of genes to consider, we will select the “most important” ones for our exploratory analysis.

One common approach is to find the genes with the biggest differences or highest variance across patient samples.

Variance is a measure of dispersion, meaning it tells us how far a set of numbers is spread out from their average value.

We will use the apply() function again, this time to find the variance of each gene across patient samples.

#  `apply()` to get the variance with `var()`  
# The argument "1" applies `var()` to the rows
var_genes <- apply(brca_expr_mat.log, 1, var)   

hist(var_genes,
     main = "Variance of log-transformed gene expression",
     xlab = "Variance")

It looks like we can choose genes with a variance larger than 5 to get a smaller set.

Since the estrogen receptor (gene name ESR1) is an important gene in the breast cancer patient cohort, let’s see what its variance is.

var_genes[names(var_genes)=="ESR1"]

##     ESR1 
## 10.18163

The value of 10 for the variance of ESR1 suggests that variance > 5 is a reasonable cutoff.

How many genes have variance > 5?

# length() tells how many elements in an array
length(var_genes[var_genes>5])

## [1] 690

We’ll consider the 690 genes with variance > 5 (out of 18,351 total genes) in our analysis.

Let’s order our matrix so that the most variable genes appear in the top rows and create a sub-matrix for these genes.

The function order() returns the order of each gene that will rank the list of variances from highest to lowest. We can use it to re-order the entire matrix so that the most variable genes appear at the top.

# `order()` ranks the values 
# The argument decreasing=TRUE orders from large to small values
order_var <- order(var_genes, decreasing=TRUE)   

# We choose 690 genes, but you can change this value!
num_genes <- 690

                
# Create a matrix that consists of the 500 genes (rows) with the highest variance 
brca.log.var <- brca_expr_mat.log[order_var[1:num_genes],]

The new matrix will have the log-transformed gene expression values, but it will contain only the most variable genes. Let’s take a look at the new matrix:

# Indexing
dim(brca.log.var)

## [1]  690 1082

round(brca.log.var[1:5,1:5],2)

##         TCGA-3C-AAAU TCGA-3C-AALI TCGA-3C-AALJ TCGA-3C-AALK TCGA-4H-AAAK
## CLEC3A          2.45        11.77         0.00         5.38         7.49
## CPB1           16.16         4.24        11.11         6.93         2.71
## SCGB2A2        15.97        16.16         0.00         8.74        14.98
## SCGB1D2        12.23        10.53         0.93         6.55         9.34
## TFF1            6.33         7.55         7.82        13.60        12.35

Let’s examine this a bit further…

Boxplots

As we’ve learned, boxplots provide a summary view of the range of expression values for different genes.

We’ll check the first 10 genes from our original matrix and then the 10 most variable genes from our matrix re-ordered by variance.

# the function par() allows us to control how plots are shown
par(mfrow=c(1,2))

# the last set of arguments to boxplot() control how text appears

# a boxplot for our original matrix
boxplot(t(brca_expr_mat.log[1:10,]), 
        main="Sample of genes",
        ylab="Log_2 Expression",ylim=c(0,20),
        cex.names=0.2, las=2, pch=20, cex=0.6)

# a boxplot for the sub-matrix
boxplot(t(brca.log.var[1:10,]), 
        main="Most variable genes",
        ylab="Log_2 Expression",ylim=c(0,20),
        cex.names=0.2, las=2, pch=20, cex=0.6)

The range of values for the most variable genes is much greater!

Remember, we are assuming that genes with greater variance across patient samples are more important.

Subset of samples

Similarly, we’ll consider only a subset of the patient samples, too. To pull out a fourth of the samples, let’s create a vector our_samples that stores the indices of every 4th sample.

# The seq() function creates a series of numbers 
# between the first two arguments.
# The third argument says we will get every fourth column 
our_samples <- seq(1, ncol(brca.log.var), 4)   

#Let's check that we did this correctly!                                            
length(our_samples)

## [1] 271

length(our_samples)/ncol(brca.log.var)

## [1] 0.2504621

We’ve chosen 271 samples, one out of every four.

Create a new object, our_matrix that contains only the samples and genes we want to consider:

# The matrix with the 690 genes and 361 samples 
our_matrix <- brca.log.var[,our_samples]

round(our_matrix[1:5,1:5],2)

##         TCGA-3C-AAAU TCGA-4H-AAAK TCGA-A1-A0SD TCGA-A1-A0SH TCGA-A1-A0SM
## CLEC3A          2.45         7.49         5.62         0.00         4.83
## CPB1           16.16         2.71        18.63         3.69         3.93
## SCGB2A2        15.97        14.98        10.10        17.12        18.06
## SCGB1D2        12.23         9.34         7.60        15.52        14.42
## TFF1            6.33        12.35        12.56         9.87        13.64

dim(our_matrix)

## [1] 690 271

What are the dimensions of our_matrix and what does it contain?

# information on the dimensions of our new matrix
dim_sub <- dim(our_matrix)

print(paste("The subset matrix for log-transformed gene expression data has",dim_sub[1], "rows and",dim_sub[2],"columns."))

## [1] "The subset matrix for log-transformed gene expression data has 690 rows and 271 columns."

To summarize our_matrix contains the 690 most variable genes across ALL 1,082 patients, and then we chose every fourth patient for our exploratory data analysis.

Gene expression heatmap

A heatmap is a graphical representation of data that uses a system of color-coding to represent different ranges of values. Heat maps make it easy to visualize complex data and understand it at a glance.

We will create the heatmap with the same function call and arguments we used in R_Heatmaps.Rmd for mtcars.

First, we’ll scale the data so the genes are on a similar scale while preserving their relative order.

# scale() "normalizes" our data
our_matrix_scaled <- scale(our_matrix)

The heatmap() function is very powerful and allows us to do many things at once. Let’s examine the color-coded matrix:

# Use `heatmap()` the same way as for `mtcars`
heatmap(our_matrix, 
        labRow="", labCol="", margins = c(2, 2),
        xlab="Patients", ylab="Genes",
        zlim=c(-2, 2))

We have more data points here than with the mtcars dataset, but the result is similar: Data values have been color-coded with the yellow-orange-red palette (“YlOrRd”).

Color schemes

Let’s try another color palette.

# Load the palette package
library(RColorBrewer)

# Parameters for plotting 
par(cex = 0.5)

# Get a graphic for all color schemes
display.brewer.all()

Often in publications with gene expression data, a red-blue palette is used, “RdBu”.

# the `col` argument allows us to choose a different palette.
heatmap(our_matrix_scaled,
        labRow="", labCol="", margins = c(2, 2),
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

This color scheme makes the patterns even easier to see.

Annotating Heatmap Clusters

We saw in the activity Breast_Cancer_Patient_Data that most samples are either ER+ tumors, depending on whether the tumor expresses high levels of receptors for estrogen. This is really important to know because it tells us what’s driving the disease and also how to treat it.

To see whether ER status is associated with the groups in the reordered heatmap, we will label the samples with a “+” for ER+ breast cancers, and “.” for negative.

To get the information on ER+/ER-, we go back to the clinical data, brca_clin_df.

table(brca_clin_df$estrogen_receptor_status)

## 
## [Not Evaluated]   Indeterminate        Negative        Positive 
##              48               2             236             796

ER status

Let’s look at the values of

#Extract the rows corresponding to our_samples 
clin <- brca_clin_df[our_samples,]  

#Pull out the "estrogen_receptor_status" feature                              
x <- clin$estrogen_receptor_status 

head(x, 10)

##  [1] "Positive" "Positive" "Positive" "Negative" "Positive" "Positive"
##  [7] "Positive" "Negative" "Positive" "Positive"

Assign the labels:

# `er` will have labels + for ER positive and . for ER negative
# Note we are using boolean expressions to assign the symbols
er <- rep("", length(x))                  
er[x == "Positive"] <- "+"                
er[x == "Negative"] <- "."                
table(er)

## er
##       .   + 
##  13  57 201

In our matrix with 691 most variable genes and 271 patients, 57 samples are ER- and 201 are ER+. Note than 13 samples have missing values.

Heatmap with ER status

We can add more arguments to the heatmap() function to include the new ER labels we created.

# NOTE: we are giving `er` to the argument labCol
heatmap(our_matrix_scaled,
        labRow="", labCol=er, 
        cexCol = 0.4, margins = c(2, 2),
        zlim=c(-2, 2),
        xlab="Patients", ylab="Genes",
        col=brewer.pal(8,"RdBu"))

In this plot, we see about four group of samples. Most of them have a lot of ER+ tumors (marked by the row of pluses on the bottom), while one (on the right) is nearly all ER-.

Beautiful!! We see that the clustering on gene expression data seems to recapitulate the ER status of the tumors. A great deal of research has been done on this and other associations between gene expression profiles and the characteristics of patient tumors.

DREAM-High Challenge 1

Perform heatmap analyses for PR status and HER2 status

Heatmap with PR status

Create the object pr to annotate the columns with progesterone receptor status:

The symbol “+” for “Positive”
The symbol “.” for “Negative”

Above, we created the table clin with clinical data for the 271 patients. We’ll use that object again.

# This code block is similar to the one for ER status
# replace FEATURE with feature for PR status
# look at `brca_clin_df` to find the feature name
# Don't forget to uncomment your code lines

# y <- clin$FEATURE
# pr <- rep("", length(y))                     

# pr[y == "Positive"] <- "+"                              
# pr[y == "Negative"] <- "."    

#table(pr)

How many samples are PR+ and PR-? Replace MMM and NNN to assign your answers to the variables PRpos and PRneg.

# Enter the value you obtained from the previous code chunk
# Don't forget to uncomment your code lines

#PRpos <- MMM
  
#PRneg <- NNN

# print(paste("Our matrix has",PRpos, 
#            "progesterone positive samples and",
#            PRneg,"progesterone negative samples."))

Create the heatmap annotated with PR status. We already have the matrix we want to map, our_matrix_scaled.

# Replace OBJECT for the data to plot 
# Replace YY in the argument labCol to label the columns with PR status
# Don't forget to uncomment your code lines

# heatmap(OBJECT,
#        labRow="", labCol=YY, 
#        cexCol = 0.4, margins = c(2, 2),
#        zlim=c(-2, 2),
#        xlab="Patients", ylab="Genes",
#        col=brewer.pal(8,"RdBu"))

Is there overlap in the clusters for ER- and PR-? Replace XXX to assign your answer (YES/NO) to the variable PR_ER_neg_overlap.

# Replace XXX with YES or NO between the quotes
# Don't forget to uncomment your code lines

#PR_ER_neg_overlap <- " XXX "
# print(paste("Do ER- and PR- samples overlap in the heatmap?", 
#            PR_ER_neg_overlap))

Heatmap with HER2 status

Create the object her2 to annotate the columns with HER2 status:

The symbol “+” for “Positive”
The symbol “.” for “Negative”

# This code block is similar to the one for ER status
# replace FEATURE with feature for HER2 status
# Don't forget to uncomment your code lines

# z <- clin$FEATURE
# her2 <- rep("", length(z))  

# The logical statements assign "+" and "."

# her2[z == "Positive"   ] <- "+"                              
# her2[z == "Negative"   ] <- "."                   

# table(her2)

Create the heatmap annotated with HER2 status:

# Replace OBJECT for the data to plot 
# Replace ZZ in the argument labCol to label the columns with PR status
# Don't forget to uncomment your code lines

# heatmap(OBJECT,
#         labRow="", labCol=her2, 
#         cexCol=0.4, margins = c(2, 2),
#         zlim=c(-2, 2),
#         xlab="Patients", ylab="Genes",
#         col=brewer.pal(8,"RdBu"))

Is there overlap in the clusters for ER-, PR-, HER2-? Replace XXX to assign your answer (YES/NO) to the variable PR_ER_HER2_neg_overlap.

# Replace XXX with YES or NO between the quotes
# Don't forget to uncomment your code lines

#PR_ER_HER2_neg_overlap <- " XXX "
# print(paste("Do ER- and PR- samples overlap in the heatmap?", 
#              PR_ER_HER2_neg_overlap))

The leftmost cluster of patients correspond to those with triple negative breast cancer. Now we see that the clustering on gene expression data seems to recapitulate the ER, PR, and HER2 status of the tumors. Our result is important because we find that the patients with triple negative breast cancer share a distinct pattern of expression in their genes.

DREAM-High Challenge 2

Learn about genes related to breast cancer at this site: https://www.nationalbreastcancer.org/other-breast-cancer-genes and look at their average values across patients.

STK11

STK11 codes for the protein Serine/Threonine Kinase 11, which is a tumor suppressor, a type of protein that helps control cell growth. STK11 gene mutations cause Peutz-Jeghers syndrome. Peutz-Jeghers syndrome carries an increased risk for multiple types of cancer, including breast cancer.

mean(brca_expr_mat["STK11",])

## [1] 1026.01

CHEK2

What is the mean expression of CHEK2?

# Uncomment the code line below and replace XXXXX with the correct gene name

# mean(brca_expr_mat["XXXXX",])

What is the distribution of CHEK2 across patients?

# Uncomment the code lines below and replace XXXXX with the correct gene name

# hist(brca_expr_mat["XXXXX",], xlab="CHEK2 expression", main="Distribution of CHEK2 expression")

# abline(v = mean(brca_expr_mat["XXXXX",]), col = 'blue')

Most patients have relatively low expression of CHEK2.

PTEN

Find the average expression of PTEN and create a distribution of mRNA levels across patients.

# Uncomment the code lines below, and replace FUNCTION with 
# a function to calculate histogram and 
# replace XXXXX with the correct gene name

# FUNCTION(brca_expr_mat["XXXX",],
#          xlab="PTEN expression", 
#          main="Distribution of PTEN expression")

# abline(v = mean(brca_expr_mat["XXXXX",]), col = 'blue')

STK11, CHEK2, and PTEN are normally tumor suppressor genes. Mutations in tumor suppressor genes can lead to malignancies.

Relative expression levels

How many genes have mean expression less than 2500?

# How many genes have mean expression values less than 2500
mean.expr.most <- mean.expr[mean.expr < 2500]  
ngenes <- length(mean.expr.most)
per_ngenes <- round( (ngenes/18351)*100, 0)

print(paste(per_ngenes, "% of genes have mean expression less than 2500",sep=""))

## [1] "92% of genes have mean expression less than 2500"

# Let's look at the distribution of mean expression values 

hist(mean.expr.most, 
     breaks=50, 
     main="Distribution of Gene Expression Values",
     xlab="Mean Expression")

Given the distribution of the mean expression values for almost all of the genes we are considering, are the values of STK11, CHEK2, and PTEN relatively high or relatively low?

# Replace MEAN1, MEAN2, MEAN3
# Create the object ANS by concatenating the gene names with c()
# Remember, the gene names have to be in quotes
# For example, c("cat", "dog", "hamster")
# Don't forget to uncomment your code lines

# print(paste("The mean expression of STK11 is", MEAN1))
# print(paste("The mean expression of CHEK2 is", MEAN2))
# print(paste("The mean expression of PTEN is", MEAN3))

# ANS <- c( )

# print(paste("On average", ANS, "is highly expressed compared to most other genes."))

Let’s save our work as an html file and post the report on Rpubs!

Click the “Knit” button at the top of this window and select “Knit to html.”

Congratulations on completing another activity!

We’ve done a lot of work with TCGA breast cancer data. Next week, we will examine data on cancer cell lines and their physical properties.