In this module, we will switch gears and work with data from experiments with human cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study.
Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.
The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:
We will load the imager library so we may look at some images in this activity.
data_dir <- "/home/data" # Where the data live
PSON.RData contains gene expression and cell speed data from the Physical Sciences in Oncology project.
# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "PSON.RData"),verbose=TRUE)
## Loading objects:
## pson_expr_df
## cell_speeds_df
Objects were named to be be descriptive.
The Environment tab gives us basic information on the object that was loaded.
You can use some of the R functions we learned previously to remind us what is in the data frames we loaded.
dim() tells us the dimensions (# row, # columns) of the objects.head() shows us the top several rows of the objects.pson_expr_df[1:5,1:5]
## symbol mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19
## 1 TSPAN6 33.56 45.10 39.42 35.61
## 2 TNMD 0.00 0.00 0.00 0.00
## 3 DPM1 169.46 129.88 132.06 113.73
## 4 SCYL3 1.85 1.85 1.77 1.93
## 5 C1orf112 5.73 11.85 10.16 9.28
Rows are named with the genes whose expression (mRNA) values were measured. Columns are named with identifiers for the experiments performed. More about this below.
Gene names (“symbol”) are in the first column. We’ll create a numerical matrix of the mRNA levels and designate the gene symbols as names for the rows of the matrix.
# Remove the first column because it contains gene names
pson_expr_mat <- as.matrix(pson_expr_df[, -1])
# Make the gene names into row names
rownames(pson_expr_mat) <- pson_expr_df$symbol
# Use indexing to take a look
pson_expr_mat[1:5, 1:5]
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6 33.56 45.10 39.42 35.61 32.10
## TNMD 0.00 0.00 0.00 0.00 0.00
## DPM1 169.46 129.88 132.06 113.73 123.37
## SCYL3 1.85 1.85 1.77 1.93 2.30
## C1orf112 5.73 11.85 10.16 9.28 10.01
There are many ways of summarizing mRNA expression levels. The mRNA expression values in our matrix have units of TPM or “transcripts per million.” Each column in the expression data should add up to 1,000,000.
To learn more about the different ways RNA-seq levels are measured, check out StatQuest’s video.
colSums(pson_expr_mat) # Add up the values in each column
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15 mRNA_R38
## 925249.2 939584.8 936328.0 939342.9 937213.3 932558.1 930438.3 941026.1
## mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36 mRNA_R24 mRNA_R28
## 934102.2 943195.8 928026.6 937107.5 935499.5 941930.0 940546.9 941475.4
## mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22 mRNA_R45 mRNA_R49 mRNA_R48
## 945065.0 934940.7 938330.5 927800.4 939314.6 928315.2 930200.8 927158.0
## mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43 mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12
## 929514.4 929947.4 933376.2 931928.5 950605.1 953271.7 952507.0 950621.5
## mRNA_R11 mRNA_R9 mRNA_R8 mRNA_R3 mRNA_R7 mRNA_R6 mRNA_R5 mRNA_R4
## 934212.8 950774.9 952000.5 936577.1 937718.2 932324.8 922843.2 935792.9
## mRNA_R2 mRNA_R1 mRNA_R52 mRNA_R56 mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51
## 937852.1 930950.6 944337.2 945542.5 944524.4 940821.8 946705.3 941959.7
## mRNA_R50 mRNA_R31 mRNA_R35 mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29
## 944849.8 949522.1 950227.3 951114.6 952487.6 949740.4 952105.8 946548.6
## mRNA_R59 mRNA_R63 mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57
## 951240.0 951305.3 950446.6 951255.0 946777.2 951215.8 949333.9
Most of the columns don’t add up to a million. The original data table included many more rows than the 18,682 we see here, but in the tidying process we removed the genes that do not code for proteins.
We’ll discuss in more detail why we take the log of the mRNA data when we look at TCGA breast cancer patient gene expression data. We log-transform the PS-ON expression data to compress the range of the values.
pson_log_mat <- log2(1 + pson_expr_mat)
pson_log_mat[1:5, 1:5]
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6 5.111031 5.526695 5.336997 5.194166 5.048759
## TNMD 0.000000 0.000000 0.000000 0.000000 0.000000
## DPM1 7.413289 7.032101 7.055933 6.842099 6.958495
## SCYL3 1.510962 1.510962 1.469886 1.550901 1.722466
## C1orf112 2.750607 3.683696 3.480265 3.361768 3.460743
To see the effect of the transformation, let’s inspect the distribution of the mRNA levels for the first 10 genes with and without the conversion.
ngenes <- 10
# the function par() allows us to control how plots are shown
par(mfrow=c(1,2))
par(cex.axis=0.7)
# the argument ylim=c(0,20) sets the scale of the y-axis
boxplot(t(pson_expr_mat[1:ngenes,]), main="Original data",
ylab="mRNA (TPM)",ylim=c(0,30),
las=2, pch=20)
boxplot(t(pson_log_mat[1:ngenes,]), main="Log-transformed data",
ylab="Log base2 of mRNA",ylim=c(0,30),
las=2, pch=20)
The log-transformed data are less variable. We will examine the effect on data distributions with the TCGA data.
The object cell_speeds_df contains data that is completely new to us.
View(cell_speeds_df)
Let’s get the column names of this data frame:
colnames(cell_speeds_df)
## [1] "sample" "cellLine"
## [3] "diagnosis" "experimentalCondition"
## [5] "summary_metric" "average_value"
## [7] "total_number_of_cells_tracked"
What cell lines (cellLine) were examined?
And what cancer types do the cell lines represent (diagnosis)?
unique(cell_speeds_df$cellLine)
## [1] "SW620" "SW480" "RWPE-1" "A375" "T98G"
## [6] "22Rv1" "T-47D" "U-87" "MDA-MB-231"
unique(cell_speeds_df$diagnosis)
## [1] "Colon Cancer" "Not Applicable" "Skin Cancer" "Brain Cancer"
## [5] "Prostate Cancer" "Breast Cancer"
We can use the function unique() on a table as well!
unique(cell_speeds_df[,c(2,3)])
## cellLine diagnosis
## 1 SW620 Colon Cancer
## 8 SW480 Colon Cancer
## 14 RWPE-1 Not Applicable
## 21 A375 Skin Cancer
## 28 T98G Brain Cancer
## 35 22Rv1 Prostate Cancer
## 42 T-47D Breast Cancer
## 49 U-87 Brain Cancer
## 56 MDA-MB-231 Breast Cancer
We are interested in the two breast cancer cell lines:
Each experiment was conducted under different conditions. The column “experimentalCondition” gives the type of substrate on which cell speed was measured.
unique(cell_speeds_df$experimentalCondition)
## [1] "Glass" "HyaluronicAcid Collagen"
## [3] "HyaluronicAcid Fibronectin" "30 kPa polyacrylamide Collagen"
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"
## [7] "500 Pa polyacrylamide Fibronectin"
We will use the data for the experimental condition “HyaluronicAcid Collagen”.
Hyaluronic acid, collagen, and fibronectin are all common components of the extracellular matrix for biological cells.
substr_img<-load.image(file.path(data_dir,"substrate.jpg"))
plot(substr_img,axes=FALSE)
Let’s look in more detail at one condition, “HyaluronicAcid Collagen.”
# Make a smaller data frame for this condition
hyal_coll_df <- subset(cell_speeds_df,
experimentalCondition == "HyaluronicAcid Collagen")
hyal_coll_df
## sample cellLine diagnosis experimentalCondition summary_metric
## 2 mRNA_R21 SW620 Colon Cancer HyaluronicAcid Collagen speed_um_hr
## 8 mRNA_R42 SW480 Colon Cancer HyaluronicAcid Collagen speed_um_hr
## 15 mRNA_R28 RWPE-1 Not Applicable HyaluronicAcid Collagen speed_um_hr
## 22 mRNA_R49 A375 Skin Cancer HyaluronicAcid Collagen speed_um_hr
## 29 mRNA_R14 T98G Brain Cancer HyaluronicAcid Collagen speed_um_hr
## 36 mRNA_R7 22Rv1 Prostate Cancer HyaluronicAcid Collagen speed_um_hr
## 43 mRNA_R56 T-47D Breast Cancer HyaluronicAcid Collagen speed_um_hr
## 50 mRNA_R35 U-87 Brain Cancer HyaluronicAcid Collagen speed_um_hr
## 57 mRNA_R63 MDA-MB-231 Breast Cancer HyaluronicAcid Collagen speed_um_hr
## average_value total_number_of_cells_tracked
## 2 18.477557 51
## 8 8.814832 40
## 15 13.225218 108
## 22 43.955352 52
## 29 31.561848 75
## 36 19.510723 50
## 43 16.515478 81
## 50 39.380265 58
## 57 36.196133 12
Nine experiments were done on the “HyaluronicAcid Collagen” substrate.
Two of the experiments were done with the breast cancer cell lines we are interested in, i.e. T-47D and MDA-MB-231.
# Make an even smaller data frame that includes only the breast cancer cell lines
hyal_brca_df <- subset(hyal_coll_df,
diagnosis == "Breast Cancer")
print(hyal_brca_df)
## sample cellLine diagnosis experimentalCondition summary_metric
## 43 mRNA_R56 T-47D Breast Cancer HyaluronicAcid Collagen speed_um_hr
## 57 mRNA_R63 MDA-MB-231 Breast Cancer HyaluronicAcid Collagen speed_um_hr
## average_value total_number_of_cells_tracked
## 43 16.51548 81
## 57 36.19613 12
The “summary_metric” is cell speed in microns per hour. From this table, we see that one cell line (MDA-MB-231) moves twice as fast as the other cell line (T-47D) on the “HyaluronicAcid Collagen” substrate: 36 u/hr versus 16 u/hr.
In research, the faster cell line MDA-MB-231 is used as a model of aggressive breast cancer, and the slower cell line T-47D is used as a model of normal breast tissue.
hyal_brca_df[,c(2,6)]
## cellLine average_value
## 43 T-47D 16.51548
## 57 MDA-MB-231 36.19613
We’ll refer to MDA-MB-231 as the “fast” cell line and T-47D as the “slow” cell line.
Even though the expression matrix contains information from 63 experiments, we want the expression data from the two experiments for the breast cancer cell lines measured on the “HyaluronicAcid Collagen” substrate: mRNA_R56 and mRNA_R63.
Let’s extract these two experiments from the expression matrix.
First we’ll remind ourselves what is in our objects. Note how indexing is used.
t(hyal_brca_df[,1])
## [,1] [,2]
## [1,] "mRNA_R56" "mRNA_R63"
colnames(pson_log_mat)[1:5]
## [1] "mRNA_R17" "mRNA_R21" "mRNA_R20" "mRNA_R19" "mRNA_R18"
Both objects contain the sample names. The function match() will return the positions of where the items in the first array appear in the second array.
# Match the experiments both objects
experiments <- match(hyal_brca_df$sample, colnames(pson_log_mat))
print(experiments)
## [1] 44 58
We will extract columns 44 and 58 from the expression matrix to create a matrix of genes with mRNA levels the fast and slow cell lines on the hyaluronic acid collagen substrate.
# Make a sub-matrix with the expression data for only these two expriments
hyal_brca_log_mat <- pson_log_mat[, experiments]
# Call experiments according to the relative speed of the cells
colnames(hyal_brca_log_mat) <- c("slow","fast")
# Look at some of the rows
hyal_brca_log_mat[35:50,]
## slow fast
## ICA1 6.14771372 2.6644828
## DBNDD1 4.99954909 1.4059924
## ALS2 3.11436702 4.2532329
## CASP10 0.05658353 1.8278190
## CFLAR 4.11769504 6.1430260
## TFPI 0.11103131 7.1371961
## NDUFAF7 3.69041710 3.6926504
## RBM5 5.60139939 5.9680908
## MTMR7 0.59454855 0.9183862
## SLC7A2 0.01435529 3.6052573
## ARF5 7.14913740 7.1826923
## SARM1 1.52105074 0.9560567
## POLDIP2 6.95256654 6.6918133
## PLXND1 5.02502879 5.7032115
## AK2 7.22631585 7.3775576
## CD38 0.00000000 0.2509616
Some of the genes have a similar expression level in both cell lines, but some genes are quite different.
Genes that have very different mRNA levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.
By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.
# Subtract the first column ("slow") from the second column ("fast")
dge <- hyal_brca_log_mat[,2] - hyal_brca_log_mat[,1]
# Add dge as a column to a new matrix
DGE_mat <- cbind(hyal_brca_log_mat,dge)
DGE_mat[35:50,]
## slow fast dge
## ICA1 6.14771372 2.6644828 -3.483230882
## DBNDD1 4.99954909 1.4059924 -3.593556728
## ALS2 3.11436702 4.2532329 1.138865914
## CASP10 0.05658353 1.8278190 1.771235496
## CFLAR 4.11769504 6.1430260 2.025330961
## TFPI 0.11103131 7.1371961 7.026164786
## NDUFAF7 3.69041710 3.6926504 0.002233275
## RBM5 5.60139939 5.9680908 0.366691362
## MTMR7 0.59454855 0.9183862 0.323837685
## SLC7A2 0.01435529 3.6052573 3.590901970
## ARF5 7.14913740 7.1826923 0.033554896
## SARM1 1.52105074 0.9560567 -0.564994084
## POLDIP2 6.95256654 6.6918133 -0.260753244
## PLXND1 5.02502879 5.7032115 0.678182673
## AK2 7.22631585 7.3775576 0.151241754
## CD38 0.00000000 0.2509616 0.250961574
We can calculate a histogram to see the distribution of differential expression values.
hist(DGE_mat[,3],main="Differential gene expression",
xlab="DGE")
For most genes, there is little change in expression between the “fast” and “slow” cell lines. But there are many for which expression is higher in one cell line over the others. These genes are those that appear in the tails of the histogram. The genes with large differential expression are the most interesting to consider because they may provide us with clues as to why the cell lines behave so differently. Let’s find them!
# Sort the differential gene expression values from high to low
order_dge <- order(dge, decreasing = TRUE)
DGE_mat_ordered <- DGE_mat[order_dge,]
head(DGE_mat_ordered,15)
## slow fast dge
## VIM 1.23878686 10.959234 9.720447
## LDHB 0.28688115 9.076067 8.789186
## SERPINE1 0.01435529 8.718704 8.704349
## MSN 0.01435529 8.709945 8.695590
## GPX1 0.04264434 8.720347 8.677703
## CAV1 0.00000000 8.457955 8.457955
## GSTP1 0.59454855 8.961392 8.366843
## FOSL1 0.68706069 8.750841 8.063780
## AXL 0.18903382 8.201781 8.012747
## F3 0.69599381 8.651411 7.955417
## CST1 0.00000000 7.920532 7.920532
## PLAT 0.09761080 7.883315 7.785705
## MMP14 0.61353165 8.398744 7.785212
## AKR1B1 0.08406426 7.820881 7.736817
## PLAU 0.51601515 8.167368 7.651353
The genes with higher expression in the “fast” cell line are those that appear in the right tail of the histogram for the differential gene expression.
Similarly, genes whose expression are highest in the “slow” cell line appear in the left tail of the histogram for the differential gene expression.
tail(DGE_mat_ordered,15)
## slow fast dge
## ST14 7.778734 0.87184365 -6.906891
## AZGP1 8.022312 0.57531233 -7.447000
## KRT23 7.504700 0.00000000 -7.504700
## FOXA1 7.829723 0.04264434 -7.787078
## OLFM1 7.868699 0.00000000 -7.868699
## RAB25 8.054957 0.09761080 -7.957346
## AGR3 8.373170 0.11103131 -8.262138
## IGFBP5 8.417684 0.02856915 -8.389115
## CDH1 8.760354 0.17632277 -8.584031
## STC2 9.032679 0.41142625 -8.621253
## CRABP2 11.248586 2.49057013 -8.758016
## SERPINA6 9.490229 0.04264434 -9.447585
## CRIP1 9.928415 0.31034012 -9.618075
## PIP 11.931051 0.00000000 -11.931051
## MGP 12.594747 0.27500705 -12.319740
VIM, which codes for the protein vimentin, is at the very top of the expression matrix ranked by dge, and KRT23, which codes for the protein keratin, is near the bottom.
This image comes from the paper, “Vimentin induces changes in cell shape, motility, and adhesion during the epithelial to mesenchymal transition”.
Cells that express vimentin filaments (VIF) but not keratin filaments (KIF) are elongated in shape and move better (more motile, panel A), whereas cells with KIF and not VIF are round and undergo fewer changes in morphology (shape) and position (panel B).
VIF_KIF_img<-load.image(file.path(data_dir,"VIF_KIF.JPG"))
plot(VIF_KIF_img,axes=FALSE)
The cells in A are elongated and move more quickly than the cells in B. A main difference between the cells is the different expression levels of VIM (relatively high in fast cell line) versus KRT23 (relatively high in slow cell line).
Even though this is just an exploratory data analysis, we can still extract biological meaning from our results.
The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the functions of genes. It provides an easy to use web server for inputting a list of genes and telling us if the genes collectively contribute to particular biological processes versus sets of randomly selected genes.
To test our hypothesis that the genes most differentially expressed are those that may contribute to cell properties associated with cancer, we’ll do a quick Gene Ontology analysis of the genes with highest differential expression in the fast cell line.
N <- 100
fast_genes <- rownames(DGE_mat_ordered)[1:N]
head(fast_genes)
## [1] "VIM" "LDHB" "SERPINE1" "MSN" "GPX1" "CAV1"
write.csv(fast_genes,"fast_genes.csv",row.names=FALSE,quote=FALSE)
Look in your working directory to find the file “fast_genes.csv”. Open it, highlight and copy the gene names. We’ll input them into the “GO Enrichment Analysis” window at the Gene Ontology Resource.
Look to see if any of the results include biological processes related to motility and migration.
QUESTION 1
What is the sum of mRNA values (in units of TPM) for the experiment “mRNA_R20”?
# assign your answer to mRNA_R20_TPM
mRNA_R20_TPM <- 936328
print(paste("The number of gene transcripts in experiment mRNA_20 is",
mRNA_R20_TPM,"Transcripts per Million (TPM)"))
## [1] "The number of gene transcripts in experiment mRNA_20 is 936328 Transcripts per Million (TPM)"
QUESTION 2
What is the name of one of the Colon Cancer cell lines?
# assign your answer to colon_cancer_line
colon_cancer_line <- "SW620"
print(paste(colon_cancer_line,"is one of the colon cancer cell lines in the PS-ON Cell Line Study."))
## [1] "SW620 is one of the colon cancer cell lines in the PS-ON Cell Line Study."
QUESTION 3
Approximately how many genes have dge close to zero?
# look back at the histogram titled "Differential gene expression"
# assign your answer to zero_dge
zero_dge <- 14000
print(paste("Approximately",zero_dge,
"genes have DGE close to zero."))
## [1] "Approximately 14000 genes have DGE close to zero."
QUESTION 4
What biological processes are associated with genes more strongly expressed in the slow cell line?
# create and write a list of genes at the bottom of
# DGE_mat_ordered
N <- 100
num_rows<-nrow(DGE_mat_ordered)
slow_genes <- rownames(DGE_mat_ordered)[(num_rows-N):num_rows]
head(slow_genes)
write.csv(slow_genes,"slow_genes.csv",row.names=FALSE,quote=FALSE)
Input the genes in the file “slow_genes.csv” into the “GO Enrichment Analysis” window at the Gene Ontology Resource.
Do you see any biological processes in the results that suggest these genes are not related to breast cancer? For example, there is a term near the end of the out regulation of cell adhesion. This may indicate that these genes help keep cells together in comparison to the fast genes that are associated with migration.
What other terms do you see that similarly indicate these genes do not contribute to breast cancer?
# assign your answer to `BP`
BP <- "epithelial cell morphogenesis"
print(paste("The gene ontology biological process -", BP,
"- indicates the slow genes may not be involved in breast cancer."))
## [1] "The gene ontology biological process - epithelial cell morphogenesis - indicates the slow genes may not be involved in breast cancer."
knitr() is the R package that generates reports from R Markdown. R Markdown is the syntax we are using in this document, denoted by the .Rmd file type. With knitr() and R Markdown, we can create reports of our work in the form of Word doc, PDF, and HTML files.
An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.
Click the “Knit” button at the top of this window and select “Knit to html.”
We have successfully applied our R skills to analyze data from human cancer cell lines and to calculate differential gene expression for models of normal breast tissue cells (“slow” cell line) versus models of aggressive cancer cells (“fast” cell line). We found interesting biological functions for the groups of genes most differentially expressed!