In this module, we will switch gears and work with data from experiments with human cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study.
Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.
The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:
Save this file with a different name and add your name as author.
We will load the imager
library so we may look at some images in this activity.
PSON.RData
contains gene expression and cell speed data from the Physical Sciences in Oncology project.
# The objects will also appear in our "Environment" tab.
load("/home/data/PSON.RData",verbose=TRUE)
## Loading objects:
## pson_expr_df
## cell_speeds_df
Objects were named to be be descriptive.
The Environment tab gives us basic information on the object that was loaded.
You can use some of the R functions we learned previously to remind us what is in the data frames we loaded.
In `pson_expr_df, rows are named with the genes whose expression (mRNA) values were measured. Columns are named with identifiers for the experiments performed.
Gene names (“symbol”) are in the first column. We’ll create a numerical matrix of the mRNA levels and designate the gene symbols as names for the rows of the matrix.
# Remove the first column because it contains gene names
# and assign the result to a new object
# Replace N with the number needed to do this
# Remember, we want to remove something
pson_expr_mat <- as.matrix(pson_expr_df[, -1])
# Make the gene names into row names.
# Replace FEATURE with the correct column name
rownames(pson_expr_mat) <- pson_expr_df$symbol
Use indexing to examine your new matrix pson_expr_mat
. Does it match what you expect?
print(pson_expr_mat)
# now instead of the symbols being a column, they are the names for the rows.
The object cell_speeds_df
contains data that is completely new to us.
# Check out this new table
print(cell_speeds_df)
Get the column names of this data frame:
# Replace FUNCTION with a function that extracts the column names
colnames(cell_speeds_df)
## [1] "sample" "cellLine"
## [3] "diagnosis" "experimentalCondition"
## [5] "summary_metric" "average_value"
## [7] "total_number_of_cells_tracked"
A little background!
What cell lines (cellLine
) were examined?
And what cancer types do the cell lines represent (diagnosis
)?
We can use the function unique()
on the table to get the information for both features.
# for cell lines and the cancers they model
unique(cell_speeds_df[,c(2,3)])
## cellLine diagnosis
## 1 SW620 Colon Cancer
## 8 SW480 Colon Cancer
## 14 RWPE-1 Not Applicable
## 21 A375 Skin Cancer
## 28 T98G Brain Cancer
## 35 22Rv1 Prostate Cancer
## 42 T-47D Breast Cancer
## 49 U-87 Brain Cancer
## 56 MDA-MB-231 Breast Cancer
We are interested in the two breast cancer cell lines:
Each experiment was conducted under different conditions. The column “experimentalCondition” gives the type of substrate on which cell speed was measured.
# Replace FEATURE with the column name to get the experimental conditions
table(cell_speeds_df$experimentalCondition)
##
## 30 kPa polyacrylamide Collagen 30 kPa polyacrylamide Fibronectin
## 9 9
## 500 Pa polyacrylamide Collagen 500 Pa polyacrylamide Fibronectin
## 9 9
## Glass HyaluronicAcid Collagen
## 8 9
## HyaluronicAcid Fibronectin
## 9
We will use the data for the experimental condition HyaluronicAcid Collagen.
Hyaluronic acid, collagen, and fibronectin are all common components of the extracellular matrix for biological cells. The extracellular matrix is a complex network of proteins and other molecules that
It is a biologically meaningful material on which to measure the movement of cells!
substr_img<-load.image("/home/data/substrate.jpg")
plot(substr_img,axes=FALSE)
Hyaluronic acid is a type of sugar molecule that maintains tissue structure. Hyaluronic acid’s unique properties, such as its ability to bind water and interact with other components of the extracellular matrix, contribute to its diverse functions in the body.
Let’s look in more detail at the condition HyaluronicAcid Collagen.
# Make a smaller data frame for this condition with a new function `subset()`
hyal_coll_df <- subset(cell_speeds_df,
experimentalCondition == "HyaluronicAcid Collagen")
Nine experiments were done on the “HyaluronicAcid Collagen” substrate.
Two of the experiments were done with the breast cancer cell lines we are interested in, i.e. T-47D and MDA-MB-231.
# Make an even smaller data frame that includes only the breast cancer cell lines
# Replace CANCER TYPE with the `diagnosis` we're interested in
hyal_brca_df <- subset(hyal_coll_df,
diagnosis == "Breast Cancer")
The summary_metric
is cell speed in microns per hour. From this table, we see that one cell line (MDA-MB-231) moves twice as fast as the other cell line (T-47D) on the “HyaluronicAcid Collagen” substrate: 36 u/hr versus 16 u/hr.
# Get speed information on our two cell lines
hyal_brca_df[,c(1,2,6)]
## sample cellLine average_value
## 43 mRNA_R56 T-47D 16.51548
## 57 mRNA_R63 MDA-MB-231 36.19613
In research, the faster cell line MDA-MB-231 is used as a model of aggressive breast cancer, and the slower cell line T-47D is used as a model of normal breast tissue.
We’ll refer to MDA-MB-231 as the fast cell line and T-47D as the slow cell line.
Let’s first prepare the expression data frame.
As we did with the TCGA expression data, we log-transform the PS-ON expression data to compress the range of the values.
# The same log transformation we did for TCGA data
# Replace N with the number we add to each element to avoid infinity values
pson_log_mat <- log2(pson_expr_mat + 1)
The log-transformed data are less variable.
Even though the expression matrix contains information from 63 experiments, we want the expression data from the two experiments for the breast cancer cell lines, MDA-MB-231 and T-47D measured on the HyaluronicAcid Collagen substrate:
Let’s extract these two experiments from the expression matrix.
We find the sample names (experiments):
hyal_brca_df
, andpson_log_mat
.The function match()
returns the positions of matches for the object in its first argument in its second, i.e. which columns in pson_log_mat
correspond to our experiments.
# Match the experiments both objects
experiments <- match(hyal_brca_df$sample, colnames(pson_log_mat))
print(experiments)
## [1] 44 58
Extract these columns from the expression matrix to create a matrix of genes with mRNA levels for the fast and slow cell lines on the hyaluronic acid collagen substrate.
# Make a sub-matrix with the expression data for only these two expriments
# Replace OBJECT with the one you created in the previous code chunk
hyal_brca_log_mat <- pson_log_mat[, experiments]
# Call experiments according to the relative speed of the cells
colnames(hyal_brca_log_mat) <- c("slow","fast")
# Look at some of the rows
round(hyal_brca_log_mat[35:45,],1)
## slow fast
## ICA1 6.1 2.7
## DBNDD1 5.0 1.4
## ALS2 3.1 4.3
## CASP10 0.1 1.8
## CFLAR 4.1 6.1
## TFPI 0.1 7.1
## NDUFAF7 3.7 3.7
## RBM5 5.6 6.0
## MTMR7 0.6 0.9
## SLC7A2 0.0 3.6
## ARF5 7.1 7.2
Some of the genes have a similar expression level in both cell lines, but some genes are quite different.
Genes that have very different mRNA levels in the fast versus slow cell lines may be informative about why the cell lines behave differently.
By subtracting the expression values in the slow cell line from the expression values in the fast cell line, we create a differential gene expression profile, or DGE profile.
# Subtract the first column ("slow") from the second column ("fast")
dge <- hyal_brca_log_mat[,2] - hyal_brca_log_mat[,1]
# Add dge as a column to a new matrix
DGE_mat <- cbind(hyal_brca_log_mat,dge)
round(DGE_mat[35:45,],1)
## slow fast dge
## ICA1 6.1 2.7 -3.5
## DBNDD1 5.0 1.4 -3.6
## ALS2 3.1 4.3 1.1
## CASP10 0.1 1.8 1.8
## CFLAR 4.1 6.1 2.0
## TFPI 0.1 7.1 7.0
## NDUFAF7 3.7 3.7 0.0
## RBM5 5.6 6.0 0.4
## MTMR7 0.6 0.9 0.3
## SLC7A2 0.0 3.6 3.6
## ARF5 7.1 7.2 0.0
Calculate a histogram to see the distribution of differential expression values (dge
).
# create a histogram from DGE_mat for the DGE profile
# Add a title name to the plot and label the x axis
# We've made histograms in previous activities and you can use help
hist(DGE_mat,
main = "Distribution of differential expression values",
xlab = "dge values between fast and slow cell lines")
What values of differential gene expression do most genes have?
How many values of dge are in the tails of the histogram?
Let’s look first at the left tail.
# The number of genes with dge less than -5
length(dge[dge<(-5)])
## [1] 94
Now let’s look at the right tail:
# Find the number of genes with dge GREATER THAN 5
length(dge[dge>5])
## [1] 134
You found the number of genes for which expression is higher in one cell line over the others. They appear in the tails of the histogram. The genes with large differential expression are the most interesting to consider because they may provide us with clues as to why the two cancer cell lines behave so differently. Let’s find them!
# Sort the differential gene expression values from high to low
order_dge <- order(dge, decreasing = TRUE)
# Create a new matix with the dge values sorter by replacing OBJECT
DGE_mat_ordered <- DGE_mat[order_dge,]
# check the first highest 15 values
head(DGE_mat_ordered,15)
## slow fast dge
## VIM 1.23878686 10.959234 9.720447
## LDHB 0.28688115 9.076067 8.789186
## SERPINE1 0.01435529 8.718704 8.704349
## MSN 0.01435529 8.709945 8.695590
## GPX1 0.04264434 8.720347 8.677703
## CAV1 0.00000000 8.457955 8.457955
## GSTP1 0.59454855 8.961392 8.366843
## FOSL1 0.68706069 8.750841 8.063780
## AXL 0.18903382 8.201781 8.012747
## F3 0.69599381 8.651411 7.955417
## CST1 0.00000000 7.920532 7.920532
## PLAT 0.09761080 7.883315 7.785705
## MMP14 0.61353165 8.398744 7.785212
## AKR1B1 0.08406426 7.820881 7.736817
## PLAU 0.51601515 8.167368 7.651353
What genes are at the bottom of the list?
# Replace FUNCTION with the companion function to `head`
# to look at the lowest 15 dge values
tail(DGE_mat_ordered,15)
## slow fast dge
## ST14 7.778734 0.87184365 -6.906891
## AZGP1 8.022312 0.57531233 -7.447000
## KRT23 7.504700 0.00000000 -7.504700
## FOXA1 7.829723 0.04264434 -7.787078
## OLFM1 7.868699 0.00000000 -7.868699
## RAB25 8.054957 0.09761080 -7.957346
## AGR3 8.373170 0.11103131 -8.262138
## IGFBP5 8.417684 0.02856915 -8.389115
## CDH1 8.760354 0.17632277 -8.584031
## STC2 9.032679 0.41142625 -8.621253
## CRABP2 11.248586 2.49057013 -8.758016
## SERPINA6 9.490229 0.04264434 -9.447585
## CRIP1 9.928415 0.31034012 -9.618075
## PIP 11.931051 0.00000000 -11.931051
## MGP 12.594747 0.27500705 -12.319740
The genes with higher expression in the fast cell line (MDA-MB-231) are those that appear in the right tail of the histogram for the differential gene expression.
Similarly, genes whose expression are highest in the slow cell line (T-47D) appear in the left tail of the histogram for the differential gene expression.
VIM, which codes for the protein vimentin, is at the very top of the expression matrix ranked by dge
, and KRT23, which codes for the protein keratin, is near the bottom.
This image comes from the paper, “Vimentin induces changes in cell shape, motility, and adhesion during the epithelial to mesenchymal transition”.
Cells that express vimentin filaments (VIF) but not keratin filaments (KIF) are elongated in shape and more motile (panel A), whereas cells with KIF and not VIF are round and undergo fewer changes in morphology (shape) and position (panel B).
VIF_KIF_img<-load.image("/home/data/VIF_KIF.JPG")
plot(VIF_KIF_img,axes=FALSE)
The main difference between the cells is the different expression levels of the gene VIM (relatively high in fast cell line) versus the gene KRT23 (relatively high in slow cell line).
The VIF only cells are similar to MDA-MB-231 cells (the fast cell line) and the KIF only cells are similar to T-47D cells (the slow cell line).
We saw from cell_speeds_df
that the MDA-MB-231 cell line is more motile than the T-47D cell line which is consistent with their models for more and less aggressive breast cancer, respectively.
Even though this is just an exploratory data analysis, we can still extract biological meaning from our results.
The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the functions of genes. There are several easy to use web servers for inputting a list of genes and finding if the genes collectively contribute to particular biological processes versus sets of randomly selected genes.
We can test our hypothesis that the genes most differentially expressed are those that contribute to cell properties more associated with each cancer types.
Let’s first do a quick Gene Ontology analysis of the genes with highest differential expression in the fast cell line.
# In DGE_mat_ordered, the top genes are more highly expressed
# in the fast versus slow cell line
N <- 50
fast_genes <- rownames(DGE_mat_ordered)[1:N]
write.table(fast_genes,"fast_genes.csv",
row.names=FALSE, col.names=FALSE, quote=FALSE)
Look in your working directory to find the file fast_genes.csv
. Open it (when you click on the file name, select View File option), highlight and copy the gene names. Input them into the Try a gene set query window at Gene Set AI.
What functional theme is detected?
Name: Extracellular matrix remodeling and cell migration
Create a list of some of the groups of proteins you find:
There are many ways of summarizing mRNA expression levels. The mRNA expression values in our matrix have units of TPM or “transcripts per million.” For each experiment, the sum of gene expression values should add up to 1,000,000. In other words, the sum of values in each column should add up to 1,000,000.
In the code chunk below, use the function colSums
to find the sum of TPM values for each experiment.
# Use the colSums function on your expression matrix `pson_expr_mat`
# You can look up the help page for colSums
colSums(pson_expr_mat, na.rm = FALSE, dims = 1L)
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15 mRNA_R38
## 925249.2 939584.8 936328.0 939342.9 937213.3 932558.1 930438.3 941026.1
## mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36 mRNA_R24 mRNA_R28
## 934102.2 943195.8 928026.6 937107.5 935499.5 941930.0 940546.9 941475.4
## mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22 mRNA_R45 mRNA_R49 mRNA_R48
## 945065.0 934940.7 938330.5 927800.4 939314.6 928315.2 930200.8 927158.0
## mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43 mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12
## 929514.4 929947.4 933376.2 931928.5 950605.1 953271.7 952507.0 950621.5
## mRNA_R11 mRNA_R9 mRNA_R8 mRNA_R3 mRNA_R7 mRNA_R6 mRNA_R5 mRNA_R4
## 934212.8 950774.9 952000.5 936577.1 937718.2 932324.8 922843.2 935792.9
## mRNA_R2 mRNA_R1 mRNA_R52 mRNA_R56 mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51
## 937852.1 930950.6 944337.2 945542.5 944524.4 940821.8 946705.3 941959.7
## mRNA_R50 mRNA_R31 mRNA_R35 mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29
## 944849.8 949522.1 950227.3 951114.6 952487.6 949740.4 952105.8 946548.6
## mRNA_R59 mRNA_R63 mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57
## 951240.0 951305.3 950446.6 951255.0 946777.2 951215.8 949333.9
QUESTION: Do any of the columns add up to 1,000,000? Provide a possible explanation for your observations:
None of the columns add up to 1,000,000. This could be a result of rounding errors.
Extra: To learn more about the different ways RNA-seq levels are measured, check out StatQuest’s video.
Now let’s first do a quick Gene Ontology analysis of the genes with highest differential expression in the slow cell line.
# create and write a list of genes at the bottom of DGE_mat_ordered
# follow the code above but now we need the last 50 genes
# Replace N1 and N2 to find the gene names for the lowest 50 dge values
slow_genes <- rownames(DGE_mat_ordered)[18633:18682]
# Does your result make sense?
print(slow_genes)
## [1] "SLC7A8" "TGFB3" "RERG" "PRSS8" "GREB1" "KIAA1324"
## [7] "SPTSSB" "QPRT" "PDZK1" "SEPP1" "TTC39A" "VAV3"
## [13] "YBX2" "PGR" "TSPAN13" "CKMT1A" "TFF3" "BNIPL"
## [19] "DEGS2" "CXCL12" "FKBP10" "SYCP2" "PRLR" "ELF5"
## [25] "ESRP1" "OLFML3" "GATA3" "LURAP1L" "APOE" "ABCC11"
## [31] "PREX1" "CLDN3" "HOXC10" "FXYD3" "IGFBP2" "ST14"
## [37] "AZGP1" "KRT23" "FOXA1" "OLFM1" "RAB25" "AGR3"
## [43] "IGFBP5" "CDH1" "STC2" "CRABP2" "SERPINA6" "CRIP1"
## [49] "PIP" "MGP"
When you are happy with your output, write it to a file:
# Create a file as we did for fast genes
write.table(slow_genes,"slow_genes.csv",
row.names=FALSE, col.names=FALSE, quote=FALSE)
Look in your working directory to find the file fast_genes.csv
. Open it (when you click on the file name, select View File option), highlight and copy the gene names. Input them into the Try a gene set query window at Gene Set AI.
What functional theme is detected?
Name:Hormone Regulation and Epithelial Cell Differentiation
Create a list of some of the groups of proteins you find:
Do a similar analysis for the colon cancer cell lines in the activity Colon_Cancer_Cell_Lines.Rmd
.
Click the “Knit” button at the top of this window and select “Knit to html.”
We have successfully applied our R skills to analyze data from human cancer cell lines and to calculate differential gene expression for models of normal breast tissue cells (“slow” cell line) versus models of aggressive cancer cells (“fast” cell line). We found interesting biological functions for the groups of genes most differentially expressed!