Challenge: Colon Cancer Cell Lines

About this activity
Preliminaries
Loading the data
Prepare the gene expression data
- Log transformation for gene expression data
Motility data
- Colon cancer cell lines
Expression and motility data
- Differential Gene Expression
Annotation of genes

About this activity

We previously looked at breast cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study. In this activity, we will consider data on colon cancer cell lines.

Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.

The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:

the expression levels of genes, and
how fast the cells move.

Preliminaries

Save this file with a different name and add your name as author.

Clean your Global Environment.

Loading the data

PSON.RData contains gene expression and cell speed data from the Physical Sciences in Oncology project.

# Replace FUNCTION to load in the data
# The objects will also appear in our "Environment" tab.

load("/home/data/PSON.RData",verbose=TRUE)

## Loading objects:
##   pson_expr_df
##   cell_speeds_df

Objects were named to be be descriptive.

“pson” stands for “Physical Sciences in Oncology”
“expr” stands for “gene expression data”
“df” stands for “data frame”

Prepare the gene expression data

Create a matrix containing just the mRNA values of the genes.

# Remove the first column because it contains gene names
# by replacing N
pson_expr_mat <- as.matrix(pson_expr_df[, -1])                


# Make the gene names into row names
# by replacing FEATURE
rownames(pson_expr_mat) <- pson_expr_df$symbol  

# Use indexing to take a look
head(pson_expr_mat)

##          mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15
## TSPAN6      33.56    45.10    39.42    35.61    32.10    39.49    29.90
## TNMD         0.00     0.00     0.00     0.00     0.00     0.00     0.00
## DPM1       169.46   129.88   132.06   113.73   123.37   136.29   137.98
## SCYL3        1.85     1.85     1.77     1.93     2.30     1.54     2.00
## C1orf112     5.73    11.85    10.16     9.28    10.01     8.93     7.47
## FGR          0.01     0.00     0.00     0.02     0.01     0.01     0.00
##          mRNA_R38 mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36
## TSPAN6      20.80    19.63    24.60    12.37    18.25    21.11    17.54
## TNMD         0.00     0.00     0.00     0.00     0.00     0.00     0.00
## DPM1       185.06   135.75   173.80   188.45   173.70   179.63   143.49
## SCYL3        1.75     1.41     1.76     0.84     1.54     2.35     1.83
## C1orf112     7.92     5.50     8.55     4.33     5.73     5.16     7.34
## FGR          0.05     0.06     0.17     0.04     0.04     0.08     0.06
##          mRNA_R24 mRNA_R28 mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22
## TSPAN6       4.95     4.01     3.61     3.61     3.45     4.35     2.98
## TNMD         0.00     0.00     0.00     0.00     0.00     0.00     0.00
## DPM1       182.84   179.46   177.32   161.67   209.87   171.66   155.91
## SCYL3        3.49     3.81     4.87     5.07     3.90     2.99     4.27
## C1orf112     6.03     6.40     7.31    11.12     5.64     6.01     6.16
## FGR          0.44     0.74     0.69     0.00     0.55     0.02     0.41
##          mRNA_R45 mRNA_R49 mRNA_R48 mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43
## TSPAN6      10.35     9.95     6.47     9.83     5.07    11.59     2.96
## TNMD         0.00     0.00     0.00     0.00     0.00     0.00     0.00
## DPM1       124.89   152.52    97.52   148.47    61.81   132.04    14.48
## SCYL3        5.14     4.29     2.12     5.34     1.28     8.40     0.45
## C1orf112    14.63    11.95     9.32    10.10     9.42     9.22     6.05
## FGR          0.01     0.01     0.09     0.02     0.00     0.03     0.00
##          mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12 mRNA_R11 mRNA_R9 mRNA_R8 mRNA_R3
## TSPAN6       6.22     7.73     5.59     7.72     8.73    7.69    7.28    5.32
## TNMD         0.00     0.00     0.00     0.00     0.00    0.00    0.00    0.00
## DPM1        98.06    81.59   100.83    86.01    87.73   75.97   96.31   75.27
## SCYL3        4.46     5.01     5.81     6.16     2.24    5.24    3.83    6.71
## C1orf112     8.48    14.43    10.52     7.09     1.85    7.89    3.84   12.05
## FGR          0.07     0.04     0.11     0.02     0.04    0.09    0.07    0.00
##          mRNA_R7 mRNA_R6 mRNA_R5 mRNA_R4 mRNA_R2 mRNA_R1 mRNA_R52 mRNA_R56
## TSPAN6      5.00    4.42    3.22    4.13    4.48    4.64    17.47    17.69
## TNMD        0.00    0.00    0.00    0.00    0.00    0.00     0.00     0.00
## DPM1       72.06   68.96   75.23   80.33   87.15   72.45   122.39   123.47
## SCYL3       7.07    5.51   10.49    7.89    8.51    8.67     9.73     8.77
## C1orf112   10.21    8.96    7.03   10.64   10.74   10.77     9.19    10.26
## FGR         0.00    0.00    0.00    0.00    0.00    0.00     0.02     0.00
##          mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51 mRNA_R50 mRNA_R31 mRNA_R35
## TSPAN6      17.81    11.23    13.34    14.64    13.59    19.42    19.69
## TNMD         0.00     0.00     0.00     0.00     0.00     0.00     0.04
## DPM1       121.18   111.65   127.87   116.58   124.17   118.91   124.45
## SCYL3        8.84    12.03    10.22    11.08    11.11     2.92     3.78
## C1orf112     8.29     9.24    11.03    10.10     9.93     4.11     4.16
## FGR          0.02     0.00     0.09     0.04     0.10     0.28     0.30
##          mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29 mRNA_R59 mRNA_R63
## TSPAN6      18.56    17.74    16.08    18.37    17.42    16.09    15.69
## TNMD         0.00     0.00     0.00     0.00     0.00     0.00     0.00
## DPM1       124.31   110.52   134.33   107.29   134.78   101.79   105.81
## SCYL3        3.34     2.53     3.27     3.66     2.96     4.33     5.21
## C1orf112     6.52     4.97     4.38     3.99     2.83    16.98    16.10
## FGR          0.32     0.20     0.57     0.24     0.72     0.03     0.05
##          mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57
## TSPAN6      16.39    11.46     9.38    14.81     9.84
## TNMD         0.00     0.00     0.00     0.00     0.00
## DPM1       110.63    91.56    85.66   100.57    70.69
## SCYL3        5.92     3.56     3.49     4.09     4.40
## C1orf112    11.90    17.67    13.37    19.29    12.16
## FGR          0.09     0.04     0.03     0.00     0.03

Log transformation for gene expression data

As we did with the TCGA expression data, log-transform the PS-ON expression data to compress the range of the values.

# Replace FUNCTION to do log transformation 

pson_log_mat <- log2(1 + pson_expr_mat)

# Index the first few rows and columns
head(pson_log_mat)

##            mRNA_R17 mRNA_R21 mRNA_R20   mRNA_R19   mRNA_R18   mRNA_R16 mRNA_R15
## TSPAN6   5.11103131 5.526695 5.336997 5.19416587 5.04875931 5.33949374 4.949535
## TNMD     0.00000000 0.000000 0.000000 0.00000000 0.00000000 0.00000000 0.000000
## DPM1     7.41328943 7.032101 7.055933 6.84209887 6.95849472 7.10108274 7.118733
## SCYL3    1.51096192 1.510962 1.469886 1.55090066 1.72246602 1.34482850 1.584963
## C1orf112 2.75060650 3.683696 3.480265 3.36176836 3.46074256 3.31179372 3.082362
## FGR      0.01435529 0.000000 0.000000 0.02856915 0.01435529 0.01435529 0.000000
##            mRNA_R38   mRNA_R42  mRNA_R41   mRNA_R40   mRNA_R39  mRNA_R37
## TSPAN6   4.44625623 4.36667192 4.6780719 3.74092756 4.26678654 4.4666271
## TNMD     0.00000000 0.00000000 0.0000000 0.00000000 0.00000000 0.0000000
## DPM1     7.53962412 7.09539702 7.4495614 7.56567333 7.44873580 7.4968937
## SCYL3    1.45943162 1.26903315 1.4646683 0.87970577 1.34482850 1.7441611
## C1orf112 3.15704371 2.70043972 3.2555007 2.41413553 2.75060650 2.6229304
## FGR      0.07038933 0.08406426 0.2265085 0.05658353 0.05658353 0.1110313
##            mRNA_R36  mRNA_R24  mRNA_R28  mRNA_R27 mRNA_R26  mRNA_R25   mRNA_R23
## TSPAN6   4.21256934 2.5728897 2.3248106 2.2047668 2.204767 2.1538053 2.41953889
## TNMD     0.00000000 0.0000000 0.0000000 0.0000000 0.000000 0.0000000 0.00000000
## DPM1     7.17482584 7.5223069 7.4955353 7.4783247 7.345804 7.7202101 7.43179008
## SCYL3    1.50080205 2.1667154 2.2660369 2.5533605 2.601697 2.2927817 1.99638875
## C1orf112 3.06004738 2.8135247 2.8875253 3.0548485 3.599318 2.7311832 2.80941444
## FGR      0.08406426 0.5260688 0.7990873 0.7570232 0.000000 0.6322682 0.02856915
##           mRNA_R22   mRNA_R45   mRNA_R49  mRNA_R48   mRNA_R47 mRNA_R46
## TSPAN6   1.9927684 3.50462039 3.45285896 2.9011082 3.43696134 2.601697
## TNMD     0.0000000 0.00000000 0.00000000 0.0000000 0.00000000 0.000000
## DPM1     7.2937935 6.97601988 7.26228281 6.6223447 7.22371214 5.972922
## SCYL3    2.3978030 2.61823866 2.40326772 1.6415460 2.66448284 1.189034
## C1orf112 2.8399596 3.96624587 3.69488019 3.3673711 3.47248777 3.381283
## FGR      0.4956952 0.01435529 0.01435529 0.1243281 0.02856915 0.000000
##            mRNA_R44  mRNA_R43  mRNA_R10   mRNA_R14  mRNA_R13   mRNA_R12
## TSPAN6   3.65420638 1.9855004 2.8519988 3.12598165 2.7202785 3.12432814
## TNMD     0.00000000 0.0000000 0.0000000 0.00000000 0.0000000 0.00000000
## DPM1     7.05571626 3.9523336 6.6302307 6.36789521 6.6700188 6.44310931
## SCYL3    3.23266076 0.5360529 2.4489010 2.58736499 2.7676548 2.83995959
## C1orf112 3.35332329 2.8176233 3.2448871 3.94766616 3.5260688 3.01613970
## FGR      0.04264434 0.0000000 0.0976108 0.05658353 0.1505597 0.02856915
##            mRNA_R11   mRNA_R9   mRNA_R8  mRNA_R3  mRNA_R7  mRNA_R6  mRNA_R5
## TSPAN6   3.28243981 3.1193562 3.0496308 2.659925 2.584963 2.438293 2.077243
## TNMD     0.00000000 0.0000000 0.0000000 0.000000 0.000000 0.000000 0.000000
## DPM1     6.47135006 6.2662243 6.6045162 6.253044 6.191010 6.128458 6.252287
## SCYL3    1.69599381 2.6415460 2.2720232 2.946731 3.012569 2.702658 3.522307
## C1orf112 1.51096192 3.1521834 2.2750070 3.705978 3.486714 3.316146 3.005400
## FGR      0.05658353 0.1243281 0.0976108 0.000000 0.000000 0.000000 0.000000
##           mRNA_R4  mRNA_R2  mRNA_R1   mRNA_R52 mRNA_R56   mRNA_R55 mRNA_R54
## TSPAN6   2.358959 2.454176 2.495695 4.20711196 4.224195 4.23342794 3.612352
## TNMD     0.000000 0.000000 0.000000 0.00000000 0.000000 0.00000000 0.000000
## DPM1     6.345716 6.461889 6.198691 6.94708167 6.959654 6.93286434 6.815704
## SCYL3    3.152183 3.249445 3.273516 3.42357817 3.288359 3.29865832 3.703765
## C1orf112 3.541019 3.553361 3.557042 3.34908215 3.493135 3.21567860 3.356144
## FGR      0.000000 0.000000 0.000000 0.02856915 0.000000 0.02856915 0.000000
##           mRNA_R53   mRNA_R51  mRNA_R50  mRNA_R31   mRNA_R35  mRNA_R34
## TSPAN6   3.8419731 3.96716861 3.8669080 4.3519110 4.37086174 4.2898345
## TNMD     0.0000000 0.00000000 0.0000000 0.0000000 0.05658353 0.0000000
## DPM1     7.0097726 6.87749887 6.9677450 6.9058082 6.97096866 6.9693577
## SCYL3    3.4880008 3.59454855 3.5981270 1.9708537 2.25701062 2.1176950
## C1orf112 3.5885647 3.47248777 3.4502215 2.3533233 2.36737107 2.9107327
## FGR      0.1243281 0.05658353 0.1375035 0.3561438 0.37851162 0.4005379
##           mRNA_R33  mRNA_R32  mRNA_R30  mRNA_R29   mRNA_R59   mRNA_R63
## TSPAN6   4.2280490 4.0942361 4.2757520 4.2032012 4.09508049 4.06091205
## TNMD     0.0000000 0.0000000 0.0000000 0.0000000 0.00000000 0.00000000
## DPM1     6.8011587 7.0803379 6.7587562 7.0851272 6.68355611 6.73890291
## SCYL3    1.8196682 2.0942361 2.2203300 1.9855004 2.41413553 2.63459327
## C1orf112 2.5777309 2.4276062 2.3190398 1.9373444 4.16832112 4.09592442
## FGR      0.2630344 0.6507646 0.3103401 0.7824086 0.04264434 0.07038933
##           mRNA_R62   mRNA_R61   mRNA_R60 mRNA_R58   mRNA_R57
## TSPAN6   4.1201860 3.63923216 3.37573454 3.982765 3.43829285
## TNMD     0.0000000 0.00000000 0.00000000 0.000000 0.00000000
## DPM1     6.8025810 6.53231696 6.43729433 6.666331 6.16369999
## SCYL3    2.7907720 2.18903382 2.16671544 2.347666 2.43295941
## C1orf112 3.6892992 4.22265002 3.84498816 4.342697 3.71808758
## FGR      0.1243281 0.05658353 0.04264434 0.000000 0.04264434

The log-transformed data are less variable. We will examine the effect on data distributions with the TCGA data.

Motility data

What cell lines (cellLine) were examined?
And what cancer types do the cell lines represent (diagnosis)?

# for cell lines and the cancers they model
unique(cell_speeds_df[,c(2,3)])

##      cellLine       diagnosis
## 1       SW620    Colon Cancer
## 8       SW480    Colon Cancer
## 14     RWPE-1  Not Applicable
## 21       A375     Skin Cancer
## 28       T98G    Brain Cancer
## 35      22Rv1 Prostate Cancer
## 42      T-47D   Breast Cancer
## 49       U-87    Brain Cancer
## 56 MDA-MB-231   Breast Cancer

We are interested in the two colon cancer cell lines:

SW480
SW620

Colon cancer cell lines

SW620 cells are considered more aggressive than SW480 cells. Both cell lines were derived from the same patient, but at different times and locations:

SW480 cells are from the primary tumor, and
SW620 cells are from a lymph node metastasis.

This makes them a valuable pair for studying colon cancer progression and metastasis in vitro.

Let’s see if there are major differences in the expression of their genes.

# You can use the table function to do something similar
unique(cell_speeds_df$experimentalCondition)

## [1] "Glass"                             "HyaluronicAcid Collagen"          
## [3] "HyaluronicAcid Fibronectin"        "30 kPa polyacrylamide Collagen"   
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"   
## [7] "500 Pa polyacrylamide Fibronectin"

Previously, we looked at the speed and data for breast cancer cell lines on the HyaluronicAcid Collagen substrate. Here, we will use the colon cancer data for the experimental condition HyaluronicAcid Fibronectin.

# Make a smaller data frame for the `HyaluronicAcid Fibronectin` condition
# by replacing CONDITION
fibro_colon_df <- subset(cell_speeds_df, 
                       experimentalCondition == "HyaluronicAcid Fibronectin")
fibro_colon_df

##      sample   cellLine       diagnosis      experimentalCondition
## 3  mRNA_R20      SW620    Colon Cancer HyaluronicAcid Fibronectin
## 9  mRNA_R41      SW480    Colon Cancer HyaluronicAcid Fibronectin
## 16 mRNA_R27     RWPE-1  Not Applicable HyaluronicAcid Fibronectin
## 23 mRNA_R48       A375     Skin Cancer HyaluronicAcid Fibronectin
## 30 mRNA_R13       T98G    Brain Cancer HyaluronicAcid Fibronectin
## 37  mRNA_R6      22Rv1 Prostate Cancer HyaluronicAcid Fibronectin
## 44 mRNA_R55      T-47D   Breast Cancer HyaluronicAcid Fibronectin
## 51 mRNA_R34       U-87    Brain Cancer HyaluronicAcid Fibronectin
## 58 mRNA_R62 MDA-MB-231   Breast Cancer HyaluronicAcid Fibronectin
##    summary_metric average_value total_number_of_cells_tracked
## 3     speed_um_hr     53.006023                            36
## 9     speed_um_hr      5.679561                            65
## 16    speed_um_hr     16.205186                           105
## 23    speed_um_hr     40.729492                            51
## 30    speed_um_hr     28.483277                            86
## 37    speed_um_hr     31.806179                            56
## 44    speed_um_hr     11.288006                            27
## 51    speed_um_hr     45.791985                            47
## 58    speed_um_hr     55.548778                            26

# Make a smaller data frame that includes only the colon cancer cell lines
# by replacing CANCER
fibro_colon_df <- subset(fibro_colon_df, 
                       diagnosis == "Colon Cancer")
fibro_colon_df[,c(1,2,6)]

##     sample cellLine average_value
## 3 mRNA_R20    SW620     53.006023
## 9 mRNA_R41    SW480      5.679561

The SW620 cell line (from the metastisis) moves almost ten times as fast as the SW480 cell line (from the primary tumor) on the “HyaluronicAcid Fibronectin” substrate: 53 u/hr versus 5.7 u/hr.

We’ll refer to SW620 as the fast cell line and SW480 as the slow cell line.

Expression and motility data

Let’s extract the two colon cancer cell experiments (mRNA_R20 and mRNA_R41) from the expression matrix.

The sample names (experiments) are the first column in fibro_colon_df and the column names in pson_log_mat.

# Match the experiments both objects
# by replacing FUNCTION

exp_colon <- match(fibro_colon_df$sample, colnames(pson_log_mat))
print(exp_colon)

## [1]  3 10

Extract columns 3 and 10 from the expression matrix to create a matrix of genes with mRNA levels for the fast and slow cell lines on the HyaluronicAcid Fibronectin substrate.

# Subset the expression data for the two colon cancer cell lines
# by replacing COLUMNS
fibro_colon_log_mat <- pson_log_mat[, exp_colon]

# Call experiments according to the relative speed of the cells
colnames(fibro_colon_log_mat) <- c("fast","slow")

Take a look at your new expression matrix:

round(fibro_colon_log_mat[35:45,],1)

##         fast slow
## ICA1     3.9  5.2
## DBNDD1   1.8  3.1
## ALS2     2.4  3.4
## CASP10   0.2  0.7
## CFLAR    3.1  3.9
## TFPI     0.3  0.2
## NDUFAF7  3.1  4.2
## RBM5     4.4  5.0
## MTMR7    0.5  0.6
## SLC7A2   0.2  2.0
## ARF5     7.4  7.7

Differential Gene Expression

Genes that have very different mRNA levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.

By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.

# Subtract the second column ("slow") from the first column ("fast")
# Replace N1 and N2

dge_colon <- fibro_colon_log_mat[,1] - fibro_colon_log_mat[,2]

# Add dge as a column to a new matrix 
# by replacing FUNCTION

DGE_mat_colon <- cbind(fibro_colon_log_mat,dge_colon)

Calculate a histogram to see the distribution of differential expression values.

# Create a plot as you did for the breast cancer cell lines
hist(DGE_mat_colon, main="Distribution of Differential Expression Values", xlab="Mean mRNA Expression", ylab="Frequency")

The genes with large differential expression may provide us with clues as to why the cell lines behave so differently. You should find that there are fewer genes that are highly overexpressed for the colon cancer cell lines than for the breast cancer cell lines.

Why do think this is so?

# How many genes have dge > 4
length(dge_colon[dge_colon>(4)])

## [1] 11

# How many genes have dge < -4
length(dge_colon[dge_colon<(-4)])

## [1] 49

There are more genes that are highly expressed in the SW480 (slow) cell line than in the SW620 (fast) cell line. Let’s fish out this set.

# Order the differential gene expression values from LOW to HIGH
# This is different from the sorting for the breast cancer cell lines
# Note that we put decreasing = FALSE 
# This will put the "slow" genes at the top of our matrix
# Replace FUNCTION

order_dge_colon <- order(dge_colon, decreasing = FALSE)

# Replace ORDERED_ROWS
DGE_mat_ordered_colon <- DGE_mat_colon[order_dge_colon,]

head(DGE_mat_ordered_colon,15)

##               fast      slow dge_colon
## KRT5    0.61353165  9.907507 -9.293975
## KRT13   0.29865832  7.581728 -7.283070
## TACSTD2 0.12432814  6.795585 -6.671257
## KRT23   1.85598970  8.523601 -6.667611
## MIA     0.36737107  6.889108 -6.521737
## MAL2    0.08406426  6.190022 -6.105958
## IGFBP3  1.46466827  7.517748 -6.053080
## LCN2    5.00898878 10.927000 -5.918011
## PHLDB2  0.16349873  5.978424 -5.814926
## HPGD    0.13750352  5.890447 -5.752943
## ANKRD1  0.04264434  5.671010 -5.628366
## WFDC2   1.81966818  7.306244 -5.486576
## CD74    0.11103131  5.548128 -5.437097
## TRIP6   0.83187724  6.140779 -5.308901
## WNT5A   0.00000000  5.130107 -5.130107

You should find KRT5, KRT13, and TACSTD2 as the top genes with expression greater in the slow cell line.

Annotation of genes

# In DGE_mat_ordered_colon, the top genes are 
# more highly expressed in the slow versus fast cell line 

N <- 50

colon_genes <- rownames(DGE_mat_ordered_colon)[1:N]
write.table(colon_genes,"colon_genes.csv",
          row.names=FALSE, col.names=FALSE, quote=FALSE)

Look in your working directory to find the file fast_genes.csv. Open it (when you click on the file name, select View File option), highlight and copy the gene names. Input them into the Try a gene set query window at Gene Set AI.

What functional theme is detected?

Name: Extracellular matrix remodeling and cell adhesion regulation

Create a list of some of the groups of proteins you find:

Matrix Metalloproteinases (MMPs) and Tissue Inhibitors
Cell Adhesion and Structural Proteins
Integrin and ECM Interactions
Protease and Inhibitor Dynamics