About this activity

In this module, we will switch gears and work with data from experiments with human cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study.

Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.

The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:


Preliminaries

We will load the imager library so we may look at some images in this activity.

The data directory

data_dir <- "/home/data"    # Where the data live

Loading the data

PSON.RData contains gene expression and cell speed data from the Physical Sciences in Oncology project.

# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "PSON.RData"),verbose=TRUE) 
## Loading objects:
##   pson_expr_df
##   cell_speeds_df

Objects were named to be be descriptive.


Expression data

The Environment tab gives us basic information on the object that was loaded.

You can use some of the R functions we learned previously to remind us what is in the data frames we loaded.

pson_expr_df[1:5,1:5]
##     symbol mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19
## 1   TSPAN6    33.56    45.10    39.42    35.61
## 2     TNMD     0.00     0.00     0.00     0.00
## 3     DPM1   169.46   129.88   132.06   113.73
## 4    SCYL3     1.85     1.85     1.77     1.93
## 5 C1orf112     5.73    11.85    10.16     9.28

Rows are named with the genes whose expression (mRNA) values were measured. Columns are named with identifiers for the experiments performed. More about this below.

Gene names (“symbol”) are in the first column. We’ll create a numerical matrix of the mRNA levels and designate the gene symbols as names for the rows of the matrix.

# Remove the first column because it contains gene names
pson_expr_mat <- as.matrix(pson_expr_df[, -1])                
# Make the gene names into row names
rownames(pson_expr_mat) <- pson_expr_df$symbol               

# Use indexing to take a look
pson_expr_mat[1:5, 1:5]       
##          mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6      33.56    45.10    39.42    35.61    32.10
## TNMD         0.00     0.00     0.00     0.00     0.00
## DPM1       169.46   129.88   132.06   113.73   123.37
## SCYL3        1.85     1.85     1.77     1.93     2.30
## C1orf112     5.73    11.85    10.16     9.28    10.01

Transcript per Million

There are many ways of summarizing mRNA expression levels. The mRNA expression values in our matrix have units of TPM or “transcripts per million.” Each column in the expression data should add up to 1,000,000.

To learn more about the different ways RNA-seq levels are measured, check out StatQuest’s video.

colSums(pson_expr_mat)   # Add up the values in each column
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15 mRNA_R38 
## 925249.2 939584.8 936328.0 939342.9 937213.3 932558.1 930438.3 941026.1 
## mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36 mRNA_R24 mRNA_R28 
## 934102.2 943195.8 928026.6 937107.5 935499.5 941930.0 940546.9 941475.4 
## mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22 mRNA_R45 mRNA_R49 mRNA_R48 
## 945065.0 934940.7 938330.5 927800.4 939314.6 928315.2 930200.8 927158.0 
## mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43 mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12 
## 929514.4 929947.4 933376.2 931928.5 950605.1 953271.7 952507.0 950621.5 
## mRNA_R11  mRNA_R9  mRNA_R8  mRNA_R3  mRNA_R7  mRNA_R6  mRNA_R5  mRNA_R4 
## 934212.8 950774.9 952000.5 936577.1 937718.2 932324.8 922843.2 935792.9 
##  mRNA_R2  mRNA_R1 mRNA_R52 mRNA_R56 mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51 
## 937852.1 930950.6 944337.2 945542.5 944524.4 940821.8 946705.3 941959.7 
## mRNA_R50 mRNA_R31 mRNA_R35 mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29 
## 944849.8 949522.1 950227.3 951114.6 952487.6 949740.4 952105.8 946548.6 
## mRNA_R59 mRNA_R63 mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57 
## 951240.0 951305.3 950446.6 951255.0 946777.2 951215.8 949333.9

Most of the columns don’t add up to a million. The original data table included many more rows than the 18,682 we see here, but in the tidying process we removed the genes that do not code for proteins.


Log transformation

We’ll discuss in more detail why we take the log of the mRNA data when we look at TCGA breast cancer patient gene expression data. We log-transform the PS-ON expression data to compress the range of the values.

pson_log_mat <- log2(1 + pson_expr_mat)
pson_log_mat[1:5, 1:5]
##          mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6   5.111031 5.526695 5.336997 5.194166 5.048759
## TNMD     0.000000 0.000000 0.000000 0.000000 0.000000
## DPM1     7.413289 7.032101 7.055933 6.842099 6.958495
## SCYL3    1.510962 1.510962 1.469886 1.550901 1.722466
## C1orf112 2.750607 3.683696 3.480265 3.361768 3.460743

To see the effect of the transformation, let’s inspect the distribution of the mRNA levels for the first 10 genes with and without the conversion.

ngenes <- 10

# the function par() allows us to control how plots are shown
par(mfrow=c(1,2))
par(cex.axis=0.7)

# the argument ylim=c(0,20) sets the scale of the y-axis
boxplot(t(pson_expr_mat[1:ngenes,]), main="Original data", 
        ylab="mRNA (TPM)",ylim=c(0,30),
        las=2, pch=20)

boxplot(t(pson_log_mat[1:ngenes,]), main="Log-transformed data", 
        ylab="Log base2 of mRNA",ylim=c(0,30),
        las=2, pch=20)

The log-transformed data are less variable. We will examine the effect on data distributions with the TCGA data.


Motility data

The object cell_speeds_df contains data that is completely new to us.

View(cell_speeds_df)

Let’s get the column names of this data frame:

colnames(cell_speeds_df)
## [1] "sample"                        "cellLine"                     
## [3] "diagnosis"                     "experimentalCondition"        
## [5] "summary_metric"                "average_value"                
## [7] "total_number_of_cells_tracked"

Cancer cell lines

What cell lines (cellLine) were examined?
And what cancer types do the cell lines represent (diagnosis)?

unique(cell_speeds_df$cellLine)
## [1] "SW620"      "SW480"      "RWPE-1"     "A375"       "T98G"      
## [6] "22Rv1"      "T-47D"      "U-87"       "MDA-MB-231"
unique(cell_speeds_df$diagnosis)
## [1] "Colon Cancer"    "Not Applicable"  "Skin Cancer"     "Brain Cancer"   
## [5] "Prostate Cancer" "Breast Cancer"

We can use the function unique() on a table as well!

unique(cell_speeds_df[,c(2,3)])
##      cellLine       diagnosis
## 1       SW620    Colon Cancer
## 8       SW480    Colon Cancer
## 14     RWPE-1  Not Applicable
## 21       A375     Skin Cancer
## 28       T98G    Brain Cancer
## 35      22Rv1 Prostate Cancer
## 42      T-47D   Breast Cancer
## 49       U-87    Brain Cancer
## 56 MDA-MB-231   Breast Cancer

We are interested in the two breast cancer cell lines:

  • T47-D
  • MDA-MB-231

Each experiment was conducted under different conditions. The column “experimentalCondition” gives the type of substrate on which cell speed was measured.

unique(cell_speeds_df$experimentalCondition)
## [1] "Glass"                             "HyaluronicAcid Collagen"          
## [3] "HyaluronicAcid Fibronectin"        "30 kPa polyacrylamide Collagen"   
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"   
## [7] "500 Pa polyacrylamide Fibronectin"

We will use the data for the experimental condition “HyaluronicAcid Collagen”.


Hyaluronic acid and collagen

Hyaluronic acid, collagen, and fibronectin are all common components of the extracellular matrix for biological cells.

substr_img<-load.image(file.path(data_dir,"substrate.jpg"))

plot(substr_img,axes=FALSE)

Let’s look in more detail at one condition, “HyaluronicAcid Collagen.”

# Make a smaller data frame for this condition
hyal_coll_df <- subset(cell_speeds_df, 
                       experimentalCondition == "HyaluronicAcid Collagen")
hyal_coll_df
##      sample   cellLine       diagnosis   experimentalCondition summary_metric
## 2  mRNA_R21      SW620    Colon Cancer HyaluronicAcid Collagen    speed_um_hr
## 8  mRNA_R42      SW480    Colon Cancer HyaluronicAcid Collagen    speed_um_hr
## 15 mRNA_R28     RWPE-1  Not Applicable HyaluronicAcid Collagen    speed_um_hr
## 22 mRNA_R49       A375     Skin Cancer HyaluronicAcid Collagen    speed_um_hr
## 29 mRNA_R14       T98G    Brain Cancer HyaluronicAcid Collagen    speed_um_hr
## 36  mRNA_R7      22Rv1 Prostate Cancer HyaluronicAcid Collagen    speed_um_hr
## 43 mRNA_R56      T-47D   Breast Cancer HyaluronicAcid Collagen    speed_um_hr
## 50 mRNA_R35       U-87    Brain Cancer HyaluronicAcid Collagen    speed_um_hr
## 57 mRNA_R63 MDA-MB-231   Breast Cancer HyaluronicAcid Collagen    speed_um_hr
##    average_value total_number_of_cells_tracked
## 2      18.477557                            51
## 8       8.814832                            40
## 15     13.225218                           108
## 22     43.955352                            52
## 29     31.561848                            75
## 36     19.510723                            50
## 43     16.515478                            81
## 50     39.380265                            58
## 57     36.196133                            12

Nine experiments were done on the “HyaluronicAcid Collagen” substrate.

Two of the experiments were done with the breast cancer cell lines we are interested in, i.e. T-47D and MDA-MB-231.

# Make an even smaller data frame that includes only the breast cancer cell lines
hyal_brca_df <- subset(hyal_coll_df, 
                       diagnosis == "Breast Cancer")
print(hyal_brca_df)
##      sample   cellLine     diagnosis   experimentalCondition summary_metric
## 43 mRNA_R56      T-47D Breast Cancer HyaluronicAcid Collagen    speed_um_hr
## 57 mRNA_R63 MDA-MB-231 Breast Cancer HyaluronicAcid Collagen    speed_um_hr
##    average_value total_number_of_cells_tracked
## 43      16.51548                            81
## 57      36.19613                            12

The “summary_metric” is cell speed in microns per hour. From this table, we see that one cell line (MDA-MB-231) moves twice as fast as the other cell line (T-47D) on the “HyaluronicAcid Collagen” substrate: 36 u/hr versus 16 u/hr.

In research, the faster cell line MDA-MB-231 is used as a model of aggressive breast cancer, and the slower cell line T-47D is used as a model of normal breast tissue.

hyal_brca_df[,c(2,6)]
##      cellLine average_value
## 43      T-47D      16.51548
## 57 MDA-MB-231      36.19613

We’ll refer to MDA-MB-231 as the “fast” cell line and T-47D as the “slow” cell line.


Expression and motility data

Even though the expression matrix contains information from 63 experiments, we want the expression data from the two experiments for the breast cancer cell lines measured on the “HyaluronicAcid Collagen” substrate: mRNA_R56 and mRNA_R63.

Let’s extract these two experiments from the expression matrix.

First we’ll remind ourselves what is in our objects. Note how indexing is used.

t(hyal_brca_df[,1])
##      [,1]       [,2]      
## [1,] "mRNA_R56" "mRNA_R63"
colnames(pson_log_mat)[1:5]
## [1] "mRNA_R17" "mRNA_R21" "mRNA_R20" "mRNA_R19" "mRNA_R18"

Both objects contain the sample names. The function match() will return the positions of where the items in the first array appear in the second array.

# Match the experiments both objects
experiments <- match(hyal_brca_df$sample, colnames(pson_log_mat))
print(experiments)
## [1] 44 58

We will extract columns 44 and 58 from the expression matrix to create a matrix of genes with mRNA levels the fast and slow cell lines on the hyaluronic acid collagen substrate.

# Make a sub-matrix with the expression data for only these two expriments
hyal_brca_log_mat <- pson_log_mat[, experiments]

# Call experiments according to the relative speed of the cells
colnames(hyal_brca_log_mat) <- c("slow","fast")

# Look at some of the rows
hyal_brca_log_mat[35:50,]
##               slow      fast
## ICA1    6.14771372 2.6644828
## DBNDD1  4.99954909 1.4059924
## ALS2    3.11436702 4.2532329
## CASP10  0.05658353 1.8278190
## CFLAR   4.11769504 6.1430260
## TFPI    0.11103131 7.1371961
## NDUFAF7 3.69041710 3.6926504
## RBM5    5.60139939 5.9680908
## MTMR7   0.59454855 0.9183862
## SLC7A2  0.01435529 3.6052573
## ARF5    7.14913740 7.1826923
## SARM1   1.52105074 0.9560567
## POLDIP2 6.95256654 6.6918133
## PLXND1  5.02502879 5.7032115
## AK2     7.22631585 7.3775576
## CD38    0.00000000 0.2509616

Some of the genes have a similar expression level in both cell lines, but some genes are quite different.


Differential Gene Expression

Genes that have very different mRNA levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.

By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.

# Subtract the first column ("slow") from the second column ("fast")
dge <- hyal_brca_log_mat[,2] - hyal_brca_log_mat[,1]

# Add dge as a column to a new matrix 
DGE_mat <- cbind(hyal_brca_log_mat,dge)

DGE_mat[35:50,]
##               slow      fast          dge
## ICA1    6.14771372 2.6644828 -3.483230882
## DBNDD1  4.99954909 1.4059924 -3.593556728
## ALS2    3.11436702 4.2532329  1.138865914
## CASP10  0.05658353 1.8278190  1.771235496
## CFLAR   4.11769504 6.1430260  2.025330961
## TFPI    0.11103131 7.1371961  7.026164786
## NDUFAF7 3.69041710 3.6926504  0.002233275
## RBM5    5.60139939 5.9680908  0.366691362
## MTMR7   0.59454855 0.9183862  0.323837685
## SLC7A2  0.01435529 3.6052573  3.590901970
## ARF5    7.14913740 7.1826923  0.033554896
## SARM1   1.52105074 0.9560567 -0.564994084
## POLDIP2 6.95256654 6.6918133 -0.260753244
## PLXND1  5.02502879 5.7032115  0.678182673
## AK2     7.22631585 7.3775576  0.151241754
## CD38    0.00000000 0.2509616  0.250961574

We can calculate a histogram to see the distribution of differential expression values.

hist(DGE_mat[,3],main="Differential gene expression",
     xlab="DGE")

For most genes, there is little change in expression between the “fast” and “slow” cell lines. But there are many for which expression is higher in one cell line over the others. These genes are those that appear in the tails of the histogram. The genes with large differential expression are the most interesting to consider because they may provide us with clues as to why the cell lines behave so differently. Let’s find them!

# Sort the differential gene expression values from high to low
order_dge <- order(dge, decreasing = TRUE)
DGE_mat_ordered <- DGE_mat[order_dge,]
head(DGE_mat_ordered,15)
##                slow      fast      dge
## VIM      1.23878686 10.959234 9.720447
## LDHB     0.28688115  9.076067 8.789186
## SERPINE1 0.01435529  8.718704 8.704349
## MSN      0.01435529  8.709945 8.695590
## GPX1     0.04264434  8.720347 8.677703
## CAV1     0.00000000  8.457955 8.457955
## GSTP1    0.59454855  8.961392 8.366843
## FOSL1    0.68706069  8.750841 8.063780
## AXL      0.18903382  8.201781 8.012747
## F3       0.69599381  8.651411 7.955417
## CST1     0.00000000  7.920532 7.920532
## PLAT     0.09761080  7.883315 7.785705
## MMP14    0.61353165  8.398744 7.785212
## AKR1B1   0.08406426  7.820881 7.736817
## PLAU     0.51601515  8.167368 7.651353

The genes with higher expression in the “fast” cell line are those that appear in the right tail of the histogram for the differential gene expression.

Similarly, genes whose expression are highest in the “slow” cell line appear in the left tail of the histogram for the differential gene expression.

tail(DGE_mat_ordered,15)
##               slow       fast        dge
## ST14      7.778734 0.87184365  -6.906891
## AZGP1     8.022312 0.57531233  -7.447000
## KRT23     7.504700 0.00000000  -7.504700
## FOXA1     7.829723 0.04264434  -7.787078
## OLFM1     7.868699 0.00000000  -7.868699
## RAB25     8.054957 0.09761080  -7.957346
## AGR3      8.373170 0.11103131  -8.262138
## IGFBP5    8.417684 0.02856915  -8.389115
## CDH1      8.760354 0.17632277  -8.584031
## STC2      9.032679 0.41142625  -8.621253
## CRABP2   11.248586 2.49057013  -8.758016
## SERPINA6  9.490229 0.04264434  -9.447585
## CRIP1     9.928415 0.31034012  -9.618075
## PIP      11.931051 0.00000000 -11.931051
## MGP      12.594747 0.27500705 -12.319740

Vimentin and keratin in motility

VIM, which codes for the protein vimentin, is at the very top of the expression matrix ranked by dge, and KRT23, which codes for the protein keratin, is near the bottom.

This image comes from the paper, “Vimentin induces changes in cell shape, motility, and adhesion during the epithelial to mesenchymal transition”.

Cells that express vimentin filaments (VIF) but not keratin filaments (KIF) are elongated in shape and move better (more motile, panel A), whereas cells with KIF and not VIF are round and undergo fewer changes in morphology (shape) and position (panel B).

VIF_KIF_img<-load.image(file.path(data_dir,"VIF_KIF.JPG"))

plot(VIF_KIF_img,axes=FALSE)

The cells in A are elongated and move more quickly than the cells in B. A main difference between the cells is the different expression levels of VIM (relatively high in fast cell line) versus KRT23 (relatively high in slow cell line).

Biological processes

Even though this is just an exploratory data analysis, we can still extract biological meaning from our results.

The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the functions of genes. It provides an easy to use web server for inputting a list of genes and telling us if the genes collectively contribute to particular biological processes versus sets of randomly selected genes.

To test our hypothesis that the genes most differentially expressed are those that may contribute to cell properties associated with cancer, we’ll do a quick Gene Ontology analysis of the genes with highest differential expression in the fast cell line.

N <- 100
fast_genes <- rownames(DGE_mat_ordered)[1:N]
head(fast_genes)
## [1] "VIM"      "LDHB"     "SERPINE1" "MSN"      "GPX1"     "CAV1"
write.csv(fast_genes,"fast_genes.csv",row.names=FALSE,quote=FALSE)

Look in your working directory to find the file “fast_genes.csv”. Open it, highlight and copy the gene names. We’ll input them into the “GO Enrichment Analysis” window at the Gene Ontology Resource.

Look to see if any of the results include biological processes related to motility and migration.


CHALLENGE QUESTIONS

QUESTION 1

What is the sum of mRNA values (in units of TPM) for the experiment “mRNA_R20”?

# assign your answer to mRNA_R20_TPM
mRNA_R20_TPM <- 936328

print(paste("The number of gene transcripts in experiment mRNA_20 is",
            mRNA_R20_TPM,"Transcripts per Million (TPM)"))
## [1] "The number of gene transcripts in experiment mRNA_20 is 936328 Transcripts per Million (TPM)"

QUESTION 2

What is the name of one of the Colon Cancer cell lines?

# assign your answer to colon_cancer_line
colon_cancer_line <- "SW620"

print(paste(colon_cancer_line,"is one of the colon cancer cell lines in the PS-ON Cell Line Study."))
## [1] "SW620 is one of the colon cancer cell lines in the PS-ON Cell Line Study."

QUESTION 3

Approximately how many genes have dge close to zero?

# look back at the histogram titled "Differential gene expression"
# assign your answer to zero_dge
zero_dge <- 14000

print(paste("Approximately",zero_dge,
            "genes have DGE close to zero."))
## [1] "Approximately 14000 genes have DGE close to zero."

QUESTION 4

What biological processes are associated with genes more strongly expressed in the slow cell line?

# create and write a list of genes at the bottom of 
# DGE_mat_ordered

N <- 100
num_rows<-nrow(DGE_mat_ordered)
slow_genes <- rownames(DGE_mat_ordered)[(num_rows-N):num_rows]
head(slow_genes)
write.csv(slow_genes,"slow_genes.csv",row.names=FALSE,quote=FALSE)

Input the genes in the file “slow_genes.csv” into the “GO Enrichment Analysis” window at the Gene Ontology Resource.

Do you see any biological processes in the results that suggest these genes are not related to breast cancer? For example, there is a term near the end of the out regulation of cell adhesion. This may indicate that these genes help keep cells together in comparison to the fast genes that are associated with migration.

What other terms do you see that similarly indicate these genes do not contribute to breast cancer?

# assign your answer to `BP`
BP <- "epithelial cell morphogenesis"

print(paste("The gene ontology biological process -", BP,
            "- indicates the slow genes may not be involved in breast cancer."))
## [1] "The gene ontology biological process - epithelial cell morphogenesis - indicates the slow genes may not be involved in breast cancer."

The knitr R package

knitr() is the R package that generates reports from R Markdown. R Markdown is the syntax we are using in this document, denoted by the .Rmd file type. With knitr() and R Markdown, we can create reports of our work in the form of Word doc, PDF, and HTML files.

An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.

Click the “Knit” button at the top of this window and select “Knit to html.”

Congratulations

We have successfully applied our R skills to analyze data from human cancer cell lines and to calculate differential gene expression for models of normal breast tissue cells (“slow” cell line) versus models of aggressive cancer cells (“fast” cell line). We found interesting biological functions for the groups of genes most differentially expressed!