About this activity

Until now, we worked with expression data and clinical information from breast cancer patient samples (TCGA). We found patterns in genes and in samples by color-coding the expression data in a heat map and then clustering the samples and the genes in the heat map based on how similar they are to each other.

In this module, we will work with data from experiments with human cancer cell lines from the Physical Sciences in Oncology Cell Line Characterization Study, which includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:

  • the expression levels of genes, and
  • how fast the cells move.

Preliminaries

The knitr R package

knitr() is the R package that generates the report from R Markdown. We can create reports as Word doc, PDF, and HTML files.

An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.

The data directory


Loading the data

PSON.RData contains gene expression and cell speed data from the Physical Sciences in Oncology project.

# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "PSON.RData"),verbose=TRUE) 
## Loading objects:
##   pson_expr_df
##   cell_speeds_df

Objects were named to be be descriptive.

  • “pson” stands for “Physical Sciences in Oncology”
  • “expr” stands for “gene expression data”
  • “df” stands for “data frame”

Expression data

The Environment tab gives us basic information on the object that was loaded.

You can use some of the R functions we learned previously to remind us what is in the data frames we loaded.

  • dim() tells us the dimensions (# row, # columns) of the objects.
  • head() shows us the top several rows of the objects.
  • indexing allows us to look at the rows and columns we choose.
pson_expr_df[1:5,1:5]
##     symbol mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19
## 1   TSPAN6    33.56    45.10    39.42    35.61
## 2     TNMD     0.00     0.00     0.00     0.00
## 3     DPM1   169.46   129.88   132.06   113.73
## 4    SCYL3     1.85     1.85     1.77     1.93
## 5 C1orf112     5.73    11.85    10.16     9.28

Rows are named with the genes whose expression values were measured. Columns are named with identifiers for the experiments performed. More about this below.

QUESTION 1: What are the names of the first two genes in the expression data frame?

Gene names (“symbol”) are in the first column. We’ll create a numerical matrix of the mRNA levels:

# Remove the first column because it contains gene names
pson_expr_mat <- as.matrix(pson_expr_df[, -1])                
# Make the gene names into row names
rownames(pson_expr_mat) <- pson_expr_df$symbol               

# Use indexing to take a look
pson_expr_mat[1:5, 1:5]       
##          mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6      33.56    45.10    39.42    35.61    32.10
## TNMD         0.00     0.00     0.00     0.00     0.00
## DPM1       169.46   129.88   132.06   113.73   123.37
## SCYL3        1.85     1.85     1.77     1.93     2.30
## C1orf112     5.73    11.85    10.16     9.28    10.01

Transcript per Million

There are many ways of summarizing mRNA expression levels. The mRNA expression values in our matrix have units of TPM or “transcripts per million.” Each column in the expression data should add up to 1,000,000.

To learn more about the different ways RNA-seq levels are measured, check out StatQuest’s video.

colSums(pson_expr_mat)   # Add up the values in each column
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15 mRNA_R38 
## 925249.2 939584.8 936328.0 939342.9 937213.3 932558.1 930438.3 941026.1 
## mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36 mRNA_R24 mRNA_R28 
## 934102.2 943195.8 928026.6 937107.5 935499.5 941930.0 940546.9 941475.4 
## mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22 mRNA_R45 mRNA_R49 mRNA_R48 
## 945065.0 934940.7 938330.5 927800.4 939314.6 928315.2 930200.8 927158.0 
## mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43 mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12 
## 929514.4 929947.4 933376.2 931928.5 950605.1 953271.7 952507.0 950621.5 
## mRNA_R11  mRNA_R9  mRNA_R8  mRNA_R3  mRNA_R7  mRNA_R6  mRNA_R5  mRNA_R4 
## 934212.8 950774.9 952000.5 936577.1 937718.2 932324.8 922843.2 935792.9 
##  mRNA_R2  mRNA_R1 mRNA_R52 mRNA_R56 mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51 
## 937852.1 930950.6 944337.2 945542.5 944524.4 940821.8 946705.3 941959.7 
## mRNA_R50 mRNA_R31 mRNA_R35 mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29 
## 944849.8 949522.1 950227.3 951114.6 952487.6 949740.4 952105.8 946548.6 
## mRNA_R59 mRNA_R63 mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57 
## 951240.0 951305.3 950446.6 951255.0 946777.2 951215.8 949333.9

Most of the columns don’t add up to a million. The original data table included many more rows than the 18,682 we see here, but in the tidying process we removed the genes that do not code for proteins.

QUESTION 2: What is the sum of mRNA values (in units of TPM) for the experiment “mRNA_R20”?

Log transformation

As we did with the TCGA breast cancer patient expression data, we’ll log-transform the PSON expression data to compress the range of the values.

pson_log_mat <- log2(1 + pson_expr_mat)
pson_log_mat[1:5, 1:5]
##          mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6   5.111031 5.526695 5.336997 5.194166 5.048759
## TNMD     0.000000 0.000000 0.000000 0.000000 0.000000
## DPM1     7.413289 7.032101 7.055933 6.842099 6.958495
## SCYL3    1.510962 1.510962 1.469886 1.550901 1.722466
## C1orf112 2.750607 3.683696 3.480265 3.361768 3.460743

Motility data

The object cell_speeds_df contains data that is completely new to us.

# View(cell_speeds_df)

Let’s get the column names of this data frame:

colnames(cell_speeds_df)
## [1] "sample"                        "cellLine"                     
## [3] "diagnosis"                     "experimentalCondition"        
## [5] "summary_metric"                "average_value"                
## [7] "total_number_of_cells_tracked"
  • “sample” is the name of the experiment
  • “cellLine” is the human cancer cell line name
  • “diagnosis” is the cancer type that the cell line models
  • “experimentalCondition” gives the substrate on which the cells were grown
  • “summary_metric” tells us that cell speed in microns per hour was measured
  • “average_value” is the average cell speed for that experiment
  • “total_number_of_cells_tracked” tells us how many cells were watch in the imaging microscope

Cancer cell lines

What cell lines (cellLine) were examined?
And what cancer types do the cell lines represent (diagnosis)?

unique(cell_speeds_df$cellLine)
## [1] "SW620"      "SW480"      "RWPE-1"     "A375"       "T98G"      
## [6] "22Rv1"      "T-47D"      "U-87"       "MDA-MB-231"
unique(cell_speeds_df$diagnosis)
## [1] "Colon Cancer"    "Not Applicable"  "Skin Cancer"     "Brain Cancer"   
## [5] "Prostate Cancer" "Breast Cancer"

We can use the function unique() on a table as well!

unique(cell_speeds_df[,c(2,3)])
##      cellLine       diagnosis
## 1       SW620    Colon Cancer
## 8       SW480    Colon Cancer
## 14     RWPE-1  Not Applicable
## 21       A375     Skin Cancer
## 28       T98G    Brain Cancer
## 35      22Rv1 Prostate Cancer
## 42      T-47D   Breast Cancer
## 49       U-87    Brain Cancer
## 56 MDA-MB-231   Breast Cancer

We are interested in the two breast cancer cell lines: + T47-D + MDA-MB-231

QUESTION 3: What is the name of one of the Colon Cancer cell lines?

Each experiment was conducted under different conditions. “experimentalCondition” gives the type of substrate on which cell speed was measured.

unique(cell_speeds_df$experimentalCondition)
## [1] "Glass"                             "HyaluronicAcid Collagen"          
## [3] "HyaluronicAcid Fibronectin"        "30 kPa polyacrylamide Collagen"   
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"   
## [7] "500 Pa polyacrylamide Fibronectin"

We will use the data for the experimental condition “HyaluronicAcid Collagen”.


Hyaluronic acid and collagen

Hyaluronic acid, collagen, and fibronectin are all common components of the extracellular matrix for biological cells.

substr_img<-load.image(file.path(data_dir,"substrate.jpg"))

plot(substr_img,axes=FALSE)

Let’s look in more detail at one condition, “HyaluronicAcid Collagen.”

# Make a smaller data frame for this condition
hyal_coll_df <- subset(cell_speeds_df, 
                       experimentalCondition == "HyaluronicAcid Collagen")
hyal_coll_df
##      sample   cellLine       diagnosis   experimentalCondition summary_metric
## 2  mRNA_R21      SW620    Colon Cancer HyaluronicAcid Collagen    speed_um_hr
## 8  mRNA_R42      SW480    Colon Cancer HyaluronicAcid Collagen    speed_um_hr
## 15 mRNA_R28     RWPE-1  Not Applicable HyaluronicAcid Collagen    speed_um_hr
## 22 mRNA_R49       A375     Skin Cancer HyaluronicAcid Collagen    speed_um_hr
## 29 mRNA_R14       T98G    Brain Cancer HyaluronicAcid Collagen    speed_um_hr
## 36  mRNA_R7      22Rv1 Prostate Cancer HyaluronicAcid Collagen    speed_um_hr
## 43 mRNA_R56      T-47D   Breast Cancer HyaluronicAcid Collagen    speed_um_hr
## 50 mRNA_R35       U-87    Brain Cancer HyaluronicAcid Collagen    speed_um_hr
## 57 mRNA_R63 MDA-MB-231   Breast Cancer HyaluronicAcid Collagen    speed_um_hr
##    average_value total_number_of_cells_tracked
## 2      18.477557                            51
## 8       8.814832                            40
## 15     13.225218                           108
## 22     43.955352                            52
## 29     31.561848                            75
## 36     19.510723                            50
## 43     16.515478                            81
## 50     39.380265                            58
## 57     36.196133                            12

Nine experiments were done on the “HyaluronicAcid Collagen” substrate.

Two of the experiments were done with the breast cancer cell lines we are interested in.

# Make an even smaller data frame that includes only the breast cancer cell lines
hyal_brca_df <- subset(hyal_coll_df, 
                       diagnosis == "Breast Cancer")
head(hyal_brca_df)
##      sample   cellLine     diagnosis   experimentalCondition summary_metric
## 43 mRNA_R56      T-47D Breast Cancer HyaluronicAcid Collagen    speed_um_hr
## 57 mRNA_R63 MDA-MB-231 Breast Cancer HyaluronicAcid Collagen    speed_um_hr
##    average_value total_number_of_cells_tracked
## 43      16.51548                            81
## 57      36.19613                            12

The “summary_metric” is cell speed in microns per hour. From this table, we see that one cell line (MDA-MB-231) moves twice as fast as the other cell line (T-47D) on the “HyaluronicAcid Collagen” substrate: 36 u/hr versus 16 u/hr.

We’ll call MDA-MB-231 the “fast” cell line and T-47D the “slow” cell line.


Expression and motility data

Even though the expression matrix contains information from 63 experiments, we want the expression data from the two experiments for the breast cancer cell lines measured on the “HyaluronicAcid Collagen” substrate: mRNA_R56 and mRNA_R63.

Let’s extract these two experiments from the expression matrix.

# We match the experiments in the column of our motility sub-table
# with the column names of our expression matrix

exps <- match(hyal_brca_df$sample, colnames(pson_log_mat))

# Make a sub-matrix with the expression data for only these two expriments
hyal_brca_log_mat <- pson_log_mat[, exps]

# Call experiments according to the relative speed of the cells
colnames(hyal_brca_log_mat) <- c("slow","fast")

# Look at the first 10 rows
hyal_brca_log_mat[1:10,]
##                slow       fast
## TSPAN6   4.22419466 4.06091205
## TNMD     0.00000000 0.00000000
## DPM1     6.95965425 6.73890291
## SCYL3    3.28835856 2.63459327
## C1orf112 3.49313492 4.09592442
## FGR      0.00000000 0.07038933
## CFH      0.05658353 1.64154603
## FUCA2    5.21916852 6.51506901
## GCLC     4.44625623 5.21761849
## NFYA     3.32481060 5.13504229

Some of the genes have a similar expression level in both cell lines, but some genes are quite different.

Genes that have very different levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.

By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.

Differential Gene Expression

# Subtract the first and second columns
dge <- hyal_brca_log_mat[,2] - hyal_brca_log_mat[,1]

# Add the column to a new matrix 
DGE_mat <- cbind(hyal_brca_log_mat,dge)

# Sort the differential gene expression values from high to low
order_dge <- order(dge, decreasing = TRUE)

# Re-order the rows of the matrices so we can easily see which
# genes are most expressed in "fast" versus "slow"
DGE_mat <- DGE_mat[order_dge,]

# Look at the first 10 rows
head(DGE_mat, 10)
##                slow      fast      dge
## VIM      1.23878686 10.959234 9.720447
## LDHB     0.28688115  9.076067 8.789186
## SERPINE1 0.01435529  8.718704 8.704349
## MSN      0.01435529  8.709945 8.695590
## GPX1     0.04264434  8.720347 8.677703
## CAV1     0.00000000  8.457955 8.457955
## GSTP1    0.59454855  8.961392 8.366843
## FOSL1    0.68706069  8.750841 8.063780
## AXL      0.18903382  8.201781 8.012747
## F3       0.69599381  8.651411 7.955417

We can also look at the bottom of the matrix to see which genes are more highly expressed in the “slow” cell line versus the “fast” cell line.

QUESTION 4: Give one gene with high expression in the fast cell line and low expression in the slow cell line.

# Look at the last 10 rows
tail(DGE_mat,10)
##               slow       fast        dge
## RAB25     8.054957 0.09761080  -7.957346
## AGR3      8.373170 0.11103131  -8.262138
## IGFBP5    8.417684 0.02856915  -8.389115
## CDH1      8.760354 0.17632277  -8.584031
## STC2      9.032679 0.41142625  -8.621253
## CRABP2   11.248586 2.49057013  -8.758016
## SERPINA6  9.490229 0.04264434  -9.447585
## CRIP1     9.928415 0.31034012  -9.618075
## PIP      11.931051 0.00000000 -11.931051
## MGP      12.594747 0.27500705 -12.319740

There are many genes that are preferrentially expressed in one cell line versus the other.

hist(DGE_mat[,3], xlab = "Differential gene expression, dge",
                  main = "Histogram of dge values")

Most of the dge values are around zero, so most genes have very similar expression in the “fast” and “slow” cell lines.

The genes with large differential expression (those in the tail of the histogram) are the most interesting to consider because they may provide us with clues as to why the cell lines behave so differently.

We’ll look at this in more detail next time, but here is a flavor of how a single gene can make a big difference.

QUESTION 5: Approximately, how many genes have dge close to zero?


Vimentin and keratin in motility

VIM, which codes for the protein vimentin, is at the very top of the expression matrix ranked by dge, and KRT23, which codes for the protein keratin, is near the bottom.

This image comes from the paper, “Vimentin induces changes in cell shape, motility, and adhesion during the epithelial to mesenchymal transition”.

Cells that express vimentin filaments (VIF) but not keratin filaments (KIF) are elongated in shape and move better (more motile, panel A), whereas cells with KIF and not VIF are round and undergo fewer changes in morphology (shape) and position (panel B).

VIF_KIF_img<-load.image(file.path(data_dir,"VIF_KIF.JPG"))

plot(VIF_KIF_img,axes=FALSE)

The cells in A are elongated and move more quickly than the cells in B. A main difference between the cells is the different expression levels of VIM versus KRT23.