About this activity

Until now, we worked with expression data and clinical information from breast cancer patient samples (TCGA). We found patterns in genes and in samples by color-coding the expression data in a heat map and then clustering the samples and the genes in the heat map based on how similar they are to each other.

We will continue working with data from experiments with human cancer cell lines from the Physical Sciences in Oncology Cell Line Characterization Study, which includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement).

In this activity, we will explore the function of the genes most differentially expressed between two cancer cell lines: one that moves relatively quickly on a hyaluronic acid collagen substrate and one that moves more slowly.

The fast cell line (MDA-MB-231) is our model for aggressive, triple negative (or basal) breast cancer. The slow cell line (T-47D) is a model for ER+ breast cancer, which has a better prognosis.


Preliminaries

The knitr R package

knitr() is the R package that generates the report from R Markdown. We can create reports as Word doc, PDF, and HTML files.

An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.

Loading images

The imager package allows us to view images within RStudio.

library(imager)  
## Loading required package: magrittr
## 
## Attaching package: 'imager'
## The following object is masked from 'package:magrittr':
## 
##     add
## The following objects are masked from 'package:stats':
## 
##     convolve, spectrum
## The following object is masked from 'package:graphics':
## 
##     frame
## The following object is masked from 'package:base':
## 
##     save.image

The data directory


Vimentin and keratin in motility

Last time, we found that VIM, which codes for the protein vimentin, is at the very top of the expression matrix ranked by dge, and KRT23, which codes for the protein keratin, is near the bottom.

This image comes from the paper, “Vimentin induces changes in cell shape, motility, and adhesion during the epithelial to mesenchymal transition”.

Cells that express vimentin filaments (VIF) but not keratin filaments (KIF) are elongated in shape and move better (more motile, panel A), whereas cells with KIF and not VIF are round and undergo fewer changes in morphology (shape) and position (panel B).

VIF_KIF_img<-load.image(file.path(data_dir,"VIF_KIF.JPG"))

plot(VIF_KIF_img,axes=FALSE)

The cells in A are elongated and move more quickly than the cells in B. A main difference between the cells is the different expression levels of VIM versus KRT23.

We discovered this by considered genes individually. Today, we’ll see how to use a list of genes to look for clues about biology.

Loading the data

We will work with the matrix of differentially expressed genes that we compiled last time for the PSON breast cancer cell line data.

# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "DGE_mat.RData"),verbose=TRUE) 
## Loading objects:
##   DGE_mat

There are three columns: + slow: log-transformed mRNA expression levels of the genes in the “slow” breast cancer cell line, T-47D + fast: log-transformed mRNA expression levels of the genes in the “fast” breast cancer cell line, MDA-MB-231 + dge: differential gene expression calculated as “fast” - “slow”


head(DGE_mat)
##                slow      fast      dge
## VIM      1.23878686 10.959234 9.720447
## LDHB     0.28688115  9.076067 8.789186
## SERPINE1 0.01435529  8.718704 8.704349
## MSN      0.01435529  8.709945 8.695590
## GPX1     0.04264434  8.720347 8.677703
## CAV1     0.00000000  8.457955 8.457955

We saw last time that here are many genes that are preferrentially expressed in one cell line versus the other.

plot(DGE_mat[,1],DGE_mat[,2],pch=20, 
     xlab = "Log expression in fast cell line",
     ylab = "Log expression in slow cell line",
     xlim = c(0,15), ylim = c(0,15))
abline(0,1, col = "red")

If each gene had the same expression level in both cell lines, the plot would be a straight line with a slope of 1 (the red line added to the plot). Many of the points do lie along this line, but most do not, and some are quite far away from the line.

There are many genes that are preferrentially expressed in one cell line versus the other. The genes at the top and bottom of the DGE matrix correspond to the points in our histogram plot that are far away from the red line.

Last time, we visualized the distribution of “dge” values with a histogram. This time, let’s take the genes in the “tails” of the distribution.

hist(DGE_mat[,3], xlab = "Differential gene expression, dge",
                  main = "Histogram of dge values")
abline(v = c(-4, 4), lty = 2)

Let’s make a list of the most differentially expressed genes for an “off-site” exploration.

cutoff <- 4     # We can change this value to consider more or fewer genes

geneList <- DGE_mat[, 3]                 
names(geneList) <- row.names(DGE_mat)

head(geneList)
##      VIM     LDHB SERPINE1      MSN     GPX1     CAV1 
## 9.720447 8.789186 8.704349 8.695590 8.677703 8.457955

geneList contains the genes ranked by their differential expression- the difference in their expression between the “fast” and “slow” cancer cell lines.

We’ll use the value of cutoff to get the genes in the tails of the histogram.

cutoff <- 4 

fast_genes <- names(geneList[geneList > cutoff])      
slow_genes <- names(geneList[geneList < -cutoff])   

length(fast_genes)
## [1] 291
length(slow_genes)
## [1] 206
# View(fast_genes)

Let’s write the list of fast_genes to a file so we may do explorations with them off-line.

write.csv(fast_genes,"fast_genes.csv",
      quote=FALSE, row.names=FALSE)

The genes with large differential expression (those in the tail of the histogram) are the most interesting to consider because they may provide us with clues as to why the cell lines behave so differently.