We previously looked at breast cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study. In this activity, we will consider data on colon cancer cell lines.
Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.
The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:
Save this file with a different name and add your name as author.
Clean your Global Environment.
PSON.RData
contains gene expression and cell speed data from the Physical Sciences in Oncology project.
# Replace FUNCTION to load in the data
# The objects will also appear in our "Environment" tab.
load("/home/data/PSON.RData",verbose=TRUE)
## Loading objects:
## pson_expr_df
## cell_speeds_df
Objects were named to be be descriptive.
Create a matrix containing just the mRNA values of the genes.
# Remove the first column because it contains gene names
# by replacing N
pson_expr_mat <- as.matrix(pson_expr_df[, -1])
# Make the gene names into row names
# by replacing FEATURE
rownames(pson_expr_mat) <- pson_expr_df$symbol
# Use indexing to take a look
head(pson_expr_mat)
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15
## TSPAN6 33.56 45.10 39.42 35.61 32.10 39.49 29.90
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## DPM1 169.46 129.88 132.06 113.73 123.37 136.29 137.98
## SCYL3 1.85 1.85 1.77 1.93 2.30 1.54 2.00
## C1orf112 5.73 11.85 10.16 9.28 10.01 8.93 7.47
## FGR 0.01 0.00 0.00 0.02 0.01 0.01 0.00
## mRNA_R38 mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36
## TSPAN6 20.80 19.63 24.60 12.37 18.25 21.11 17.54
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## DPM1 185.06 135.75 173.80 188.45 173.70 179.63 143.49
## SCYL3 1.75 1.41 1.76 0.84 1.54 2.35 1.83
## C1orf112 7.92 5.50 8.55 4.33 5.73 5.16 7.34
## FGR 0.05 0.06 0.17 0.04 0.04 0.08 0.06
## mRNA_R24 mRNA_R28 mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22
## TSPAN6 4.95 4.01 3.61 3.61 3.45 4.35 2.98
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## DPM1 182.84 179.46 177.32 161.67 209.87 171.66 155.91
## SCYL3 3.49 3.81 4.87 5.07 3.90 2.99 4.27
## C1orf112 6.03 6.40 7.31 11.12 5.64 6.01 6.16
## FGR 0.44 0.74 0.69 0.00 0.55 0.02 0.41
## mRNA_R45 mRNA_R49 mRNA_R48 mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43
## TSPAN6 10.35 9.95 6.47 9.83 5.07 11.59 2.96
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## DPM1 124.89 152.52 97.52 148.47 61.81 132.04 14.48
## SCYL3 5.14 4.29 2.12 5.34 1.28 8.40 0.45
## C1orf112 14.63 11.95 9.32 10.10 9.42 9.22 6.05
## FGR 0.01 0.01 0.09 0.02 0.00 0.03 0.00
## mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12 mRNA_R11 mRNA_R9 mRNA_R8 mRNA_R3
## TSPAN6 6.22 7.73 5.59 7.72 8.73 7.69 7.28 5.32
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## DPM1 98.06 81.59 100.83 86.01 87.73 75.97 96.31 75.27
## SCYL3 4.46 5.01 5.81 6.16 2.24 5.24 3.83 6.71
## C1orf112 8.48 14.43 10.52 7.09 1.85 7.89 3.84 12.05
## FGR 0.07 0.04 0.11 0.02 0.04 0.09 0.07 0.00
## mRNA_R7 mRNA_R6 mRNA_R5 mRNA_R4 mRNA_R2 mRNA_R1 mRNA_R52 mRNA_R56
## TSPAN6 5.00 4.42 3.22 4.13 4.48 4.64 17.47 17.69
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## DPM1 72.06 68.96 75.23 80.33 87.15 72.45 122.39 123.47
## SCYL3 7.07 5.51 10.49 7.89 8.51 8.67 9.73 8.77
## C1orf112 10.21 8.96 7.03 10.64 10.74 10.77 9.19 10.26
## FGR 0.00 0.00 0.00 0.00 0.00 0.00 0.02 0.00
## mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51 mRNA_R50 mRNA_R31 mRNA_R35
## TSPAN6 17.81 11.23 13.34 14.64 13.59 19.42 19.69
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.04
## DPM1 121.18 111.65 127.87 116.58 124.17 118.91 124.45
## SCYL3 8.84 12.03 10.22 11.08 11.11 2.92 3.78
## C1orf112 8.29 9.24 11.03 10.10 9.93 4.11 4.16
## FGR 0.02 0.00 0.09 0.04 0.10 0.28 0.30
## mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29 mRNA_R59 mRNA_R63
## TSPAN6 18.56 17.74 16.08 18.37 17.42 16.09 15.69
## TNMD 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## DPM1 124.31 110.52 134.33 107.29 134.78 101.79 105.81
## SCYL3 3.34 2.53 3.27 3.66 2.96 4.33 5.21
## C1orf112 6.52 4.97 4.38 3.99 2.83 16.98 16.10
## FGR 0.32 0.20 0.57 0.24 0.72 0.03 0.05
## mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57
## TSPAN6 16.39 11.46 9.38 14.81 9.84
## TNMD 0.00 0.00 0.00 0.00 0.00
## DPM1 110.63 91.56 85.66 100.57 70.69
## SCYL3 5.92 3.56 3.49 4.09 4.40
## C1orf112 11.90 17.67 13.37 19.29 12.16
## FGR 0.09 0.04 0.03 0.00 0.03
As we did with the TCGA expression data, log-transform the PS-ON expression data to compress the range of the values.
# Replace FUNCTION to do log transformation
pson_log_mat <- log2(1 + pson_expr_mat)
# Index the first few rows and columns
head(pson_log_mat)
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15
## TSPAN6 5.11103131 5.526695 5.336997 5.19416587 5.04875931 5.33949374 4.949535
## TNMD 0.00000000 0.000000 0.000000 0.00000000 0.00000000 0.00000000 0.000000
## DPM1 7.41328943 7.032101 7.055933 6.84209887 6.95849472 7.10108274 7.118733
## SCYL3 1.51096192 1.510962 1.469886 1.55090066 1.72246602 1.34482850 1.584963
## C1orf112 2.75060650 3.683696 3.480265 3.36176836 3.46074256 3.31179372 3.082362
## FGR 0.01435529 0.000000 0.000000 0.02856915 0.01435529 0.01435529 0.000000
## mRNA_R38 mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37
## TSPAN6 4.44625623 4.36667192 4.6780719 3.74092756 4.26678654 4.4666271
## TNMD 0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0.0000000
## DPM1 7.53962412 7.09539702 7.4495614 7.56567333 7.44873580 7.4968937
## SCYL3 1.45943162 1.26903315 1.4646683 0.87970577 1.34482850 1.7441611
## C1orf112 3.15704371 2.70043972 3.2555007 2.41413553 2.75060650 2.6229304
## FGR 0.07038933 0.08406426 0.2265085 0.05658353 0.05658353 0.1110313
## mRNA_R36 mRNA_R24 mRNA_R28 mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23
## TSPAN6 4.21256934 2.5728897 2.3248106 2.2047668 2.204767 2.1538053 2.41953889
## TNMD 0.00000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.00000000
## DPM1 7.17482584 7.5223069 7.4955353 7.4783247 7.345804 7.7202101 7.43179008
## SCYL3 1.50080205 2.1667154 2.2660369 2.5533605 2.601697 2.2927817 1.99638875
## C1orf112 3.06004738 2.8135247 2.8875253 3.0548485 3.599318 2.7311832 2.80941444
## FGR 0.08406426 0.5260688 0.7990873 0.7570232 0.000000 0.6322682 0.02856915
## mRNA_R22 mRNA_R45 mRNA_R49 mRNA_R48 mRNA_R47 mRNA_R46
## TSPAN6 1.9927684 3.50462039 3.45285896 2.9011082 3.43696134 2.601697
## TNMD 0.0000000 0.00000000 0.00000000 0.0000000 0.00000000 0.000000
## DPM1 7.2937935 6.97601988 7.26228281 6.6223447 7.22371214 5.972922
## SCYL3 2.3978030 2.61823866 2.40326772 1.6415460 2.66448284 1.189034
## C1orf112 2.8399596 3.96624587 3.69488019 3.3673711 3.47248777 3.381283
## FGR 0.4956952 0.01435529 0.01435529 0.1243281 0.02856915 0.000000
## mRNA_R44 mRNA_R43 mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12
## TSPAN6 3.65420638 1.9855004 2.8519988 3.12598165 2.7202785 3.12432814
## TNMD 0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000
## DPM1 7.05571626 3.9523336 6.6302307 6.36789521 6.6700188 6.44310931
## SCYL3 3.23266076 0.5360529 2.4489010 2.58736499 2.7676548 2.83995959
## C1orf112 3.35332329 2.8176233 3.2448871 3.94766616 3.5260688 3.01613970
## FGR 0.04264434 0.0000000 0.0976108 0.05658353 0.1505597 0.02856915
## mRNA_R11 mRNA_R9 mRNA_R8 mRNA_R3 mRNA_R7 mRNA_R6 mRNA_R5
## TSPAN6 3.28243981 3.1193562 3.0496308 2.659925 2.584963 2.438293 2.077243
## TNMD 0.00000000 0.0000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## DPM1 6.47135006 6.2662243 6.6045162 6.253044 6.191010 6.128458 6.252287
## SCYL3 1.69599381 2.6415460 2.2720232 2.946731 3.012569 2.702658 3.522307
## C1orf112 1.51096192 3.1521834 2.2750070 3.705978 3.486714 3.316146 3.005400
## FGR 0.05658353 0.1243281 0.0976108 0.000000 0.000000 0.000000 0.000000
## mRNA_R4 mRNA_R2 mRNA_R1 mRNA_R52 mRNA_R56 mRNA_R55 mRNA_R54
## TSPAN6 2.358959 2.454176 2.495695 4.20711196 4.224195 4.23342794 3.612352
## TNMD 0.000000 0.000000 0.000000 0.00000000 0.000000 0.00000000 0.000000
## DPM1 6.345716 6.461889 6.198691 6.94708167 6.959654 6.93286434 6.815704
## SCYL3 3.152183 3.249445 3.273516 3.42357817 3.288359 3.29865832 3.703765
## C1orf112 3.541019 3.553361 3.557042 3.34908215 3.493135 3.21567860 3.356144
## FGR 0.000000 0.000000 0.000000 0.02856915 0.000000 0.02856915 0.000000
## mRNA_R53 mRNA_R51 mRNA_R50 mRNA_R31 mRNA_R35 mRNA_R34
## TSPAN6 3.8419731 3.96716861 3.8669080 4.3519110 4.37086174 4.2898345
## TNMD 0.0000000 0.00000000 0.0000000 0.0000000 0.05658353 0.0000000
## DPM1 7.0097726 6.87749887 6.9677450 6.9058082 6.97096866 6.9693577
## SCYL3 3.4880008 3.59454855 3.5981270 1.9708537 2.25701062 2.1176950
## C1orf112 3.5885647 3.47248777 3.4502215 2.3533233 2.36737107 2.9107327
## FGR 0.1243281 0.05658353 0.1375035 0.3561438 0.37851162 0.4005379
## mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29 mRNA_R59 mRNA_R63
## TSPAN6 4.2280490 4.0942361 4.2757520 4.2032012 4.09508049 4.06091205
## TNMD 0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000
## DPM1 6.8011587 7.0803379 6.7587562 7.0851272 6.68355611 6.73890291
## SCYL3 1.8196682 2.0942361 2.2203300 1.9855004 2.41413553 2.63459327
## C1orf112 2.5777309 2.4276062 2.3190398 1.9373444 4.16832112 4.09592442
## FGR 0.2630344 0.6507646 0.3103401 0.7824086 0.04264434 0.07038933
## mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57
## TSPAN6 4.1201860 3.63923216 3.37573454 3.982765 3.43829285
## TNMD 0.0000000 0.00000000 0.00000000 0.000000 0.00000000
## DPM1 6.8025810 6.53231696 6.43729433 6.666331 6.16369999
## SCYL3 2.7907720 2.18903382 2.16671544 2.347666 2.43295941
## C1orf112 3.6892992 4.22265002 3.84498816 4.342697 3.71808758
## FGR 0.1243281 0.05658353 0.04264434 0.000000 0.04264434
The log-transformed data are less variable. We will examine the effect on data distributions with the TCGA data.
What cell lines (cellLine
) were examined?
And what cancer types do the cell lines represent (diagnosis
)?
# for cell lines and the cancers they model
unique(cell_speeds_df[,c(2,3)])
## cellLine diagnosis
## 1 SW620 Colon Cancer
## 8 SW480 Colon Cancer
## 14 RWPE-1 Not Applicable
## 21 A375 Skin Cancer
## 28 T98G Brain Cancer
## 35 22Rv1 Prostate Cancer
## 42 T-47D Breast Cancer
## 49 U-87 Brain Cancer
## 56 MDA-MB-231 Breast Cancer
We are interested in the two colon cancer cell lines:
SW620 cells are considered more aggressive than SW480 cells. Both cell lines were derived from the same patient, but at different times and locations:
This makes them a valuable pair for studying colon cancer progression and metastasis in vitro.
Let’s see if there are major differences in the expression of their genes.
# You can use the table function to do something similar
unique(cell_speeds_df$experimentalCondition)
## [1] "Glass" "HyaluronicAcid Collagen"
## [3] "HyaluronicAcid Fibronectin" "30 kPa polyacrylamide Collagen"
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"
## [7] "500 Pa polyacrylamide Fibronectin"
Previously, we looked at the speed and data for breast cancer cell lines on the HyaluronicAcid Collagen
substrate. Here, we will use the colon cancer data for the experimental condition HyaluronicAcid Fibronectin
.
# Make a smaller data frame for the `HyaluronicAcid Fibronectin` condition
# by replacing CONDITION
fibro_colon_df <- subset(cell_speeds_df,
experimentalCondition == "HyaluronicAcid Fibronectin")
fibro_colon_df
## sample cellLine diagnosis experimentalCondition
## 3 mRNA_R20 SW620 Colon Cancer HyaluronicAcid Fibronectin
## 9 mRNA_R41 SW480 Colon Cancer HyaluronicAcid Fibronectin
## 16 mRNA_R27 RWPE-1 Not Applicable HyaluronicAcid Fibronectin
## 23 mRNA_R48 A375 Skin Cancer HyaluronicAcid Fibronectin
## 30 mRNA_R13 T98G Brain Cancer HyaluronicAcid Fibronectin
## 37 mRNA_R6 22Rv1 Prostate Cancer HyaluronicAcid Fibronectin
## 44 mRNA_R55 T-47D Breast Cancer HyaluronicAcid Fibronectin
## 51 mRNA_R34 U-87 Brain Cancer HyaluronicAcid Fibronectin
## 58 mRNA_R62 MDA-MB-231 Breast Cancer HyaluronicAcid Fibronectin
## summary_metric average_value total_number_of_cells_tracked
## 3 speed_um_hr 53.006023 36
## 9 speed_um_hr 5.679561 65
## 16 speed_um_hr 16.205186 105
## 23 speed_um_hr 40.729492 51
## 30 speed_um_hr 28.483277 86
## 37 speed_um_hr 31.806179 56
## 44 speed_um_hr 11.288006 27
## 51 speed_um_hr 45.791985 47
## 58 speed_um_hr 55.548778 26
# Make a smaller data frame that includes only the colon cancer cell lines
# by replacing CANCER
fibro_colon_df <- subset(fibro_colon_df,
diagnosis == "Colon Cancer")
fibro_colon_df[,c(1,2,6)]
## sample cellLine average_value
## 3 mRNA_R20 SW620 53.006023
## 9 mRNA_R41 SW480 5.679561
The SW620 cell line (from the metastisis) moves almost ten times as fast as the SW480 cell line (from the primary tumor) on the “HyaluronicAcid Fibronectin” substrate: 53 u/hr versus 5.7 u/hr.
We’ll refer to SW620 as the fast cell line and SW480 as the slow cell line.
Let’s extract the two colon cancer cell experiments (mRNA_R20 and mRNA_R41) from the expression matrix.
The sample names (experiments) are the first column in fibro_colon_df
and the column names in pson_log_mat
.
# Match the experiments both objects
# by replacing FUNCTION
exp_colon <- match(fibro_colon_df$sample, colnames(pson_log_mat))
print(exp_colon)
## [1] 3 10
Extract columns 3 and 10 from the expression matrix to create a matrix of genes with mRNA levels for the fast and slow cell lines on the HyaluronicAcid Fibronectin
substrate.
# Subset the expression data for the two colon cancer cell lines
# by replacing COLUMNS
fibro_colon_log_mat <- pson_log_mat[, exp_colon]
# Call experiments according to the relative speed of the cells
colnames(fibro_colon_log_mat) <- c("fast","slow")
Take a look at your new expression matrix:
round(fibro_colon_log_mat[35:45,],1)
## fast slow
## ICA1 3.9 5.2
## DBNDD1 1.8 3.1
## ALS2 2.4 3.4
## CASP10 0.2 0.7
## CFLAR 3.1 3.9
## TFPI 0.3 0.2
## NDUFAF7 3.1 4.2
## RBM5 4.4 5.0
## MTMR7 0.5 0.6
## SLC7A2 0.2 2.0
## ARF5 7.4 7.7
Genes that have very different mRNA levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.
By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.
# Subtract the second column ("slow") from the first column ("fast")
# Replace N1 and N2
dge_colon <- fibro_colon_log_mat[,1] - fibro_colon_log_mat[,2]
# Add dge as a column to a new matrix
# by replacing FUNCTION
DGE_mat_colon <- cbind(fibro_colon_log_mat,dge_colon)
Calculate a histogram to see the distribution of differential expression values.
# Create a plot as you did for the breast cancer cell lines
hist(DGE_mat_colon, main="Distribution of Differential Expression Values", xlab="Mean mRNA Expression", ylab="Frequency")
The genes with large differential expression may provide us with clues as to why the cell lines behave so differently. You should find that there are fewer genes that are highly overexpressed for the colon cancer cell lines than for the breast cancer cell lines.
Why do think this is so?
# How many genes have dge > 4
length(dge_colon[dge_colon>(4)])
## [1] 11
# How many genes have dge < -4
length(dge_colon[dge_colon<(-4)])
## [1] 49
There are more genes that are highly expressed in the SW480 (slow) cell line than in the SW620 (fast) cell line. Let’s fish out this set.
# Order the differential gene expression values from LOW to HIGH
# This is different from the sorting for the breast cancer cell lines
# Note that we put decreasing = FALSE
# This will put the "slow" genes at the top of our matrix
# Replace FUNCTION
order_dge_colon <- order(dge_colon, decreasing = FALSE)
# Replace ORDERED_ROWS
DGE_mat_ordered_colon <- DGE_mat_colon[order_dge_colon,]
head(DGE_mat_ordered_colon,15)
## fast slow dge_colon
## KRT5 0.61353165 9.907507 -9.293975
## KRT13 0.29865832 7.581728 -7.283070
## TACSTD2 0.12432814 6.795585 -6.671257
## KRT23 1.85598970 8.523601 -6.667611
## MIA 0.36737107 6.889108 -6.521737
## MAL2 0.08406426 6.190022 -6.105958
## IGFBP3 1.46466827 7.517748 -6.053080
## LCN2 5.00898878 10.927000 -5.918011
## PHLDB2 0.16349873 5.978424 -5.814926
## HPGD 0.13750352 5.890447 -5.752943
## ANKRD1 0.04264434 5.671010 -5.628366
## WFDC2 1.81966818 7.306244 -5.486576
## CD74 0.11103131 5.548128 -5.437097
## TRIP6 0.83187724 6.140779 -5.308901
## WNT5A 0.00000000 5.130107 -5.130107
You should find KRT5, KRT13, and TACSTD2 as the top genes with expression greater in the slow cell line.
# In DGE_mat_ordered_colon, the top genes are
# more highly expressed in the slow versus fast cell line
N <- 50
colon_genes <- rownames(DGE_mat_ordered_colon)[1:N]
write.table(colon_genes,"colon_genes.csv",
row.names=FALSE, col.names=FALSE, quote=FALSE)
Look in your working directory to find the file fast_genes.csv
. Open it (when you click on the file name, select View File option), highlight and copy the gene names. Input them into the Try a gene set query window at Gene Set AI.
What functional theme is detected?
Name: Extracellular matrix remodeling and cell adhesion regulation
Create a list of some of the groups of proteins you find: