About this activity

We previously looked at breast cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study. In this activity, we will consider data on colon cancer cell lines.

Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.

The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:


Preliminaries

Save this file with a different name and add your name as author.

Clean your Global Environment.


Loading the data

PSON.RData contains gene expression and cell speed data from the Physical Sciences in Oncology project.

# Replace FUNCTION to load in the data
# The objects will also appear in our "Environment" tab.

load("/home/data/PSON.RData",verbose=TRUE) 
## Loading objects:
##   pson_expr_df
##   cell_speeds_df

Objects were named to be be descriptive.


Prepare the gene expression data

Create a matrix containing just the mRNA values of the genes.

# Remove the first column because it contains gene names
# by replacing N
pson_expr_mat <- as.matrix(pson_expr_df[, -1])                


# Make the gene names into row names
# by replacing FEATURE
rownames(pson_expr_mat) <- pson_expr_df$symbol  
# Use indexing to take a look
print(pson_expr_mat)

Log transformation for gene expression data

As we did with the TCGA expression data, log-transform the PS-ON expression data to compress the range of the values.

# Replace FUNCTION to do log transformation 

pson_log_mat <- log(1 + pson_expr_mat)

# Index the first few rows and columns

pson_log_mat[1:5,1:5]
##          mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6   3.542697 3.830813 3.699325 3.600321 3.499533
## TNMD     0.000000 0.000000 0.000000 0.000000 0.000000
## DPM1     5.138501 4.874281 4.890800 4.742582 4.823261
## SCYL3    1.047319 1.047319 1.018847 1.075002 1.193922
## C1orf112 1.906575 2.553344 2.412336 2.330200 2.398804

The log-transformed data are less variable. We will examine the effect on data distributions with the TCGA data.


Motility data

What cell lines (cellLine) were examined?
And what cancer types do the cell lines represent (diagnosis)?

# for cell lines and the cancers they model
unique(cell_speeds_df[,c(2,3)])
##      cellLine       diagnosis
## 1       SW620    Colon Cancer
## 8       SW480    Colon Cancer
## 14     RWPE-1  Not Applicable
## 21       A375     Skin Cancer
## 28       T98G    Brain Cancer
## 35      22Rv1 Prostate Cancer
## 42      T-47D   Breast Cancer
## 49       U-87    Brain Cancer
## 56 MDA-MB-231   Breast Cancer

We are interested in the two colon cancer cell lines:

Colon cancer cell lines

SW620 cells are considered more aggressive than SW480 cells. Both cell lines were derived from the same patient, but at different times and locations:

  • SW480 cells are from the primary tumor, and
  • SW620 cells are from a lymph node metastasis.

This makes them a valuable pair for studying colon cancer progression and metastasis in vitro.

Let’s see if there are major differences in the expression of their genes.

# You can use the table function to do something similar
unique(cell_speeds_df$experimentalCondition)
## [1] "Glass"                             "HyaluronicAcid Collagen"          
## [3] "HyaluronicAcid Fibronectin"        "30 kPa polyacrylamide Collagen"   
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"   
## [7] "500 Pa polyacrylamide Fibronectin"

Previously, we looked at the speed and data for breast cancer cell lines on the HyaluronicAcid Collagen substrate. Here, we will use the colon cancer data for the experimental condition HyaluronicAcid Fibronectin.

# Make a smaller data frame for the `HyaluronicAcid Fibronectin` condition
# by replacing CONDITION
fibro_colon_df <- subset(cell_speeds_df, 
                       experimentalCondition == "HyaluronicAcid Fibronectin")
fibro_colon_df
##      sample   cellLine       diagnosis      experimentalCondition
## 3  mRNA_R20      SW620    Colon Cancer HyaluronicAcid Fibronectin
## 9  mRNA_R41      SW480    Colon Cancer HyaluronicAcid Fibronectin
## 16 mRNA_R27     RWPE-1  Not Applicable HyaluronicAcid Fibronectin
## 23 mRNA_R48       A375     Skin Cancer HyaluronicAcid Fibronectin
## 30 mRNA_R13       T98G    Brain Cancer HyaluronicAcid Fibronectin
## 37  mRNA_R6      22Rv1 Prostate Cancer HyaluronicAcid Fibronectin
## 44 mRNA_R55      T-47D   Breast Cancer HyaluronicAcid Fibronectin
## 51 mRNA_R34       U-87    Brain Cancer HyaluronicAcid Fibronectin
## 58 mRNA_R62 MDA-MB-231   Breast Cancer HyaluronicAcid Fibronectin
##    summary_metric average_value total_number_of_cells_tracked
## 3     speed_um_hr     53.006023                            36
## 9     speed_um_hr      5.679561                            65
## 16    speed_um_hr     16.205186                           105
## 23    speed_um_hr     40.729492                            51
## 30    speed_um_hr     28.483277                            86
## 37    speed_um_hr     31.806179                            56
## 44    speed_um_hr     11.288006                            27
## 51    speed_um_hr     45.791985                            47
## 58    speed_um_hr     55.548778                            26
# Make a smaller data frame that includes only the colon cancer cell lines
# by replacing CANCER
fibro_colon_df <- subset(fibro_colon_df, 
                       diagnosis == "Colon Cancer")
fibro_colon_df[,c(1,2,6)]
##     sample cellLine average_value
## 3 mRNA_R20    SW620     53.006023
## 9 mRNA_R41    SW480      5.679561

The SW620 cell line (from the metastisis) moves almost ten times as fast as the SW480 cell line (from the primary tumor) on the “HyaluronicAcid Fibronectin” substrate: 53 u/hr versus 5.7 u/hr.

We’ll refer to SW620 as the fast cell line and SW480 as the slow cell line.


Expression and motility data

Let’s extract the two colon cancer cell experiments (mRNA_R20 and mRNA_R41) from the expression matrix.

The sample names (experiments) are the first column in fibro_colon_df and the column names in pson_log_mat.

# Match the experiments both objects
# by replacing FUNCTION

exp_colon <- match(fibro_colon_df$sample, colnames(pson_log_mat))
print(exp_colon)
## [1]  3 10

Extract columns 3 and 10 from the expression matrix to create a matrix of genes with mRNA levels for the fast and slow cell lines on the HyaluronicAcid Fibronectin substrate.

# Subset the expression data for the two colon cancer cell lines
# by replacing COLUMNS
fibro_colon_log_mat <- pson_log_mat[, exp_colon]

# Call experiments according to the relative speed of the cells
colnames(fibro_colon_log_mat) <- c("fast","slow")

Take a look at your new expression matrix:

print(fibro_colon_log_mat)

Differential Gene Expression

Genes that have very different mRNA levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.

By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.

# Subtract the second column ("slow") from the first column ("fast")
# Replace N1 and N2

dge_colon <- fibro_colon_log_mat[,2] - fibro_colon_log_mat[,1]

# Add dge as a column to a new matrix 
# by replacing FUNCTION

DGE_mat_colon <- cbind(fibro_colon_log_mat,dge_colon)

Calculate a histogram to see the distribution of differential expression values.

# Create a plot as you did for the breast cancer cell lines

hist(DGE_mat_colon,
     main="Distribution of differential expression values.",
     xlab="dge values between fast and slow cell lines.")

The genes with large differential expression may provide us with clues as to why the cell lines behave so differently. You should find that there are fewer genes that are highly overexpressed for the colon cancer cell lines than for the breast cancer cell lines.

Why do think this is so?

# How many genes have dge > 4 
# How many genes have dge < -4

length(dge_colon[dge_colon<(-4)])
## [1] 4
length(dge_colon[dge_colon>(4)])
## [1] 9

There are more genes that are highly expressed in the SW480 (slow) cell line than in the SW620 (fast) cell line. Let’s fish out this set.

# Order the differential gene expression values from LOW to HIGH
# This is different from the sorting for the breast cancer cell lines
# Note that we put decreasing = FALSE 
# This will put the "slow" genes at the top of our matrix
# Replace FUNCTION

order_dge_colon <- order(dge_colon, decreasing = FALSE)

# Replace ORDERED_ROWS
DGE_mat_ordered_colon <- DGE_mat_colon[order_dge_colon,]

head(DGE_mat_ordered_colon,15)
##             fast       slow dge_colon
## S100P   5.949444 1.80005827 -4.149386
## MFNG    4.412798 0.28517894 -4.127619
## LCP1    4.439942 0.33647224 -4.103469
## PEG10   4.141387 0.08617770 -4.055209
## C3orf14 3.918005 0.00000000 -3.918005
## CNN3    4.164803 0.57661336 -3.588190
## AKR1B1  4.041647 0.50681760 -3.534829
## MT1G    3.326833 0.09531018 -3.231523
## WIPF1   3.170106 0.02955880 -3.140547
## HBE1    3.065725 0.11332869 -2.952396
## SPON2   3.434310 0.52472853 -2.909581
## PLLP    3.367641 0.60976557 -2.757875
## MAGEB2  2.644045 0.00000000 -2.644045
## TFF3    4.503580 1.88706965 -2.616511
## PLAC8   2.602690 0.00000000 -2.602690

You should find KRT5, KRT13, and TACSTD2 as the top genes with expression greater in the slow cell line.

Annotation of genes

# In DGE_mat_ordered_colon, the top genes are 
# more highly expressed in the slow versus fast cell line 

N <- 50

colon_genes <- rownames(DGE_mat_ordered_colon)[1:N]
write.table(colon_genes,"colon_genes.csv",
          row.names=FALSE, col.names=FALSE, quote=FALSE)

Look in your working directory to find the file colon_genes.csv. Open it (when you click on the file name, select View File option), highlight and copy the gene names. Input them into the Try a gene set query window at Gene Set AI.

What functional theme is detected?

Name: Cellular Stress Response and Metabolic Regulation

Create a list of some of the groups of proteins you find: