We previously looked at breast cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study. In this activity, we will consider data on colon cancer cell lines.
Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.
The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:
Save this file with a different name and add your name as author.
Clean your Global Environment.
PSON.RData
contains gene expression and cell speed data from the Physical Sciences in Oncology project.
# Replace FUNCTION to load in the data
# The objects will also appear in our "Environment" tab.
load("/home/data/PSON.RData",verbose=TRUE)
## Loading objects:
## pson_expr_df
## cell_speeds_df
Objects were named to be be descriptive.
Create a matrix containing just the mRNA values of the genes.
# Remove the first column because it contains gene names
# by replacing N
pson_expr_mat <- as.matrix(pson_expr_df[, -1])
# Make the gene names into row names
# by replacing FEATURE
rownames(pson_expr_mat) <- pson_expr_df$symbol
# Use indexing to take a look
print(pson_expr_mat)
As we did with the TCGA expression data, log-transform the PS-ON expression data to compress the range of the values.
# Replace FUNCTION to do log transformation
pson_log_mat <- log(1 + pson_expr_mat)
# Index the first few rows and columns
pson_log_mat[1:5,1:5]
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6 3.542697 3.830813 3.699325 3.600321 3.499533
## TNMD 0.000000 0.000000 0.000000 0.000000 0.000000
## DPM1 5.138501 4.874281 4.890800 4.742582 4.823261
## SCYL3 1.047319 1.047319 1.018847 1.075002 1.193922
## C1orf112 1.906575 2.553344 2.412336 2.330200 2.398804
The log-transformed data are less variable. We will examine the effect on data distributions with the TCGA data.
What cell lines (cellLine
) were examined?
And what cancer types do the cell lines represent (diagnosis
)?
# for cell lines and the cancers they model
unique(cell_speeds_df[,c(2,3)])
## cellLine diagnosis
## 1 SW620 Colon Cancer
## 8 SW480 Colon Cancer
## 14 RWPE-1 Not Applicable
## 21 A375 Skin Cancer
## 28 T98G Brain Cancer
## 35 22Rv1 Prostate Cancer
## 42 T-47D Breast Cancer
## 49 U-87 Brain Cancer
## 56 MDA-MB-231 Breast Cancer
We are interested in the two colon cancer cell lines:
SW620 cells are considered more aggressive than SW480 cells. Both cell lines were derived from the same patient, but at different times and locations:
This makes them a valuable pair for studying colon cancer progression and metastasis in vitro.
Let’s see if there are major differences in the expression of their genes.
# You can use the table function to do something similar
unique(cell_speeds_df$experimentalCondition)
## [1] "Glass" "HyaluronicAcid Collagen"
## [3] "HyaluronicAcid Fibronectin" "30 kPa polyacrylamide Collagen"
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"
## [7] "500 Pa polyacrylamide Fibronectin"
Previously, we looked at the speed and data for breast cancer cell lines on the HyaluronicAcid Collagen
substrate. Here, we will use the colon cancer data for the experimental condition HyaluronicAcid Fibronectin
.
# Make a smaller data frame for the `HyaluronicAcid Fibronectin` condition
# by replacing CONDITION
fibro_colon_df <- subset(cell_speeds_df,
experimentalCondition == "HyaluronicAcid Fibronectin")
fibro_colon_df
## sample cellLine diagnosis experimentalCondition
## 3 mRNA_R20 SW620 Colon Cancer HyaluronicAcid Fibronectin
## 9 mRNA_R41 SW480 Colon Cancer HyaluronicAcid Fibronectin
## 16 mRNA_R27 RWPE-1 Not Applicable HyaluronicAcid Fibronectin
## 23 mRNA_R48 A375 Skin Cancer HyaluronicAcid Fibronectin
## 30 mRNA_R13 T98G Brain Cancer HyaluronicAcid Fibronectin
## 37 mRNA_R6 22Rv1 Prostate Cancer HyaluronicAcid Fibronectin
## 44 mRNA_R55 T-47D Breast Cancer HyaluronicAcid Fibronectin
## 51 mRNA_R34 U-87 Brain Cancer HyaluronicAcid Fibronectin
## 58 mRNA_R62 MDA-MB-231 Breast Cancer HyaluronicAcid Fibronectin
## summary_metric average_value total_number_of_cells_tracked
## 3 speed_um_hr 53.006023 36
## 9 speed_um_hr 5.679561 65
## 16 speed_um_hr 16.205186 105
## 23 speed_um_hr 40.729492 51
## 30 speed_um_hr 28.483277 86
## 37 speed_um_hr 31.806179 56
## 44 speed_um_hr 11.288006 27
## 51 speed_um_hr 45.791985 47
## 58 speed_um_hr 55.548778 26
# Make a smaller data frame that includes only the colon cancer cell lines
# by replacing CANCER
fibro_colon_df <- subset(fibro_colon_df,
diagnosis == "Colon Cancer")
fibro_colon_df[,c(1,2,6)]
## sample cellLine average_value
## 3 mRNA_R20 SW620 53.006023
## 9 mRNA_R41 SW480 5.679561
The SW620 cell line (from the metastisis) moves almost ten times as fast as the SW480 cell line (from the primary tumor) on the “HyaluronicAcid Fibronectin” substrate: 53 u/hr versus 5.7 u/hr.
We’ll refer to SW620 as the fast cell line and SW480 as the slow cell line.
Let’s extract the two colon cancer cell experiments (mRNA_R20 and mRNA_R41) from the expression matrix.
The sample names (experiments) are the first column in fibro_colon_df
and the column names in pson_log_mat
.
# Match the experiments both objects
# by replacing FUNCTION
exp_colon <- match(fibro_colon_df$sample, colnames(pson_log_mat))
print(exp_colon)
## [1] 3 10
Extract columns 3 and 10 from the expression matrix to create a matrix of genes with mRNA levels for the fast and slow cell lines on the HyaluronicAcid Fibronectin
substrate.
# Subset the expression data for the two colon cancer cell lines
# by replacing COLUMNS
fibro_colon_log_mat <- pson_log_mat[, exp_colon]
# Call experiments according to the relative speed of the cells
colnames(fibro_colon_log_mat) <- c("fast","slow")
Take a look at your new expression matrix:
print(fibro_colon_log_mat)
Genes that have very different mRNA levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.
By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.
# Subtract the second column ("slow") from the first column ("fast")
# Replace N1 and N2
dge_colon <- fibro_colon_log_mat[,2] - fibro_colon_log_mat[,1]
# Add dge as a column to a new matrix
# by replacing FUNCTION
DGE_mat_colon <- cbind(fibro_colon_log_mat,dge_colon)
Calculate a histogram to see the distribution of differential expression values.
# Create a plot as you did for the breast cancer cell lines
hist(DGE_mat_colon,
main="Distribution of differential expression values.",
xlab="dge values between fast and slow cell lines.")
The genes with large differential expression may provide us with clues as to why the cell lines behave so differently. You should find that there are fewer genes that are highly overexpressed for the colon cancer cell lines than for the breast cancer cell lines.
Why do think this is so?
# How many genes have dge > 4
# How many genes have dge < -4
length(dge_colon[dge_colon<(-4)])
## [1] 4
length(dge_colon[dge_colon>(4)])
## [1] 9
There are more genes that are highly expressed in the SW480 (slow) cell line than in the SW620 (fast) cell line. Let’s fish out this set.
# Order the differential gene expression values from LOW to HIGH
# This is different from the sorting for the breast cancer cell lines
# Note that we put decreasing = FALSE
# This will put the "slow" genes at the top of our matrix
# Replace FUNCTION
order_dge_colon <- order(dge_colon, decreasing = FALSE)
# Replace ORDERED_ROWS
DGE_mat_ordered_colon <- DGE_mat_colon[order_dge_colon,]
head(DGE_mat_ordered_colon,15)
## fast slow dge_colon
## S100P 5.949444 1.80005827 -4.149386
## MFNG 4.412798 0.28517894 -4.127619
## LCP1 4.439942 0.33647224 -4.103469
## PEG10 4.141387 0.08617770 -4.055209
## C3orf14 3.918005 0.00000000 -3.918005
## CNN3 4.164803 0.57661336 -3.588190
## AKR1B1 4.041647 0.50681760 -3.534829
## MT1G 3.326833 0.09531018 -3.231523
## WIPF1 3.170106 0.02955880 -3.140547
## HBE1 3.065725 0.11332869 -2.952396
## SPON2 3.434310 0.52472853 -2.909581
## PLLP 3.367641 0.60976557 -2.757875
## MAGEB2 2.644045 0.00000000 -2.644045
## TFF3 4.503580 1.88706965 -2.616511
## PLAC8 2.602690 0.00000000 -2.602690
You should find KRT5, KRT13, and TACSTD2 as the top genes with expression greater in the slow cell line.
# In DGE_mat_ordered_colon, the top genes are
# more highly expressed in the slow versus fast cell line
N <- 50
colon_genes <- rownames(DGE_mat_ordered_colon)[1:N]
write.table(colon_genes,"colon_genes.csv",
row.names=FALSE, col.names=FALSE, quote=FALSE)
Look in your working directory to find the file colon_genes.csv
. Open it (when you click on the file name, select View File option), highlight and copy the gene names. Input them into the Try a gene set query window at Gene Set AI.
What functional theme is detected?
Name: Cellular Stress Response and Metabolic Regulation
Create a list of some of the groups of proteins you find: