Breast Cancer Cell Lines

About this activity

In this module, we will switch gears and work with data from experiments with human cancer cell lines from the Physical Sciences in Oncology (PS-ON) Cell Line Characterization Study.

Cancer cell lines are cancer cells that keep dividing and growing over time, under certain conditions in a laboratory. Cancer cell lines are used in research to study the biology of cancer and to test cancer treatments.

The PS-ON Study includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:

the expression levels of genes, and
how fast the cells move.

Preliminaries

Save this file with a different name and add your name as author.

We will load the imager library so we may look at some images in this activity.

Loading the data

PSON.RData contains gene expression and cell speed data from the Physical Sciences in Oncology project.

# The objects will also appear in our "Environment" tab.
load("/home/data/PSON.RData",verbose=TRUE)

## Loading objects:
##   pson_expr_df
##   cell_speeds_df

Objects were named to be be descriptive.

“pson” stands for “Physical Sciences in Oncology”
“expr” stands for “gene expression data”
“df” stands for “data frame”

Compare the data frames

The Environment tab gives us basic information on the object that was loaded.

You can use some of the R functions we learned previously to remind us what is in the data frames we loaded.

The expression data frame

In `pson_expr_df, rows are named with the genes whose expression (mRNA) values were measured. Columns are named with identifiers for the experiments performed.

Gene names (“symbol”) are in the first column. We’ll create a numerical matrix of the mRNA levels and designate the gene symbols as names for the rows of the matrix.

# Remove the first column because it contains gene names 
# and assign the result to a new object
# Replace N with the number needed to do this
# Remember, we want to remove something

pson_expr_mat <- as.matrix(pson_expr_df[, -1])                

# Make the gene names into row names. 
# Replace FEATURE with the correct column name

rownames(pson_expr_mat) <- pson_expr_df$symbol

Use indexing to examine your new matrix pson_expr_mat. Does it match what you expect?

print(pson_expr_mat)

# now instead of the symbols being a column, they are the names for the rows.

The motility data frame

The object cell_speeds_df contains data that is completely new to us.

# Check out this new table 

print(cell_speeds_df)

Get the column names of this data frame:

# Replace FUNCTION with a function that extracts the column names 

colnames(cell_speeds_df)

## [1] "sample"                        "cellLine"                     
## [3] "diagnosis"                     "experimentalCondition"        
## [5] "summary_metric"                "average_value"                
## [7] "total_number_of_cells_tracked"

A little background!

sample is the name of the experiment
cellLine is the human cancer cell line name
diagnosis is the cancer type that the cell line models
experimentalCondition is the substrate on which the cells were grown
summary_metric has cell speed in microns per hour
average_value is the average cell speed
total_number_of_cells_tracked how many cells were watched in the imaging microscope

Cancer cell lines

What cell lines (cellLine) were examined?
And what cancer types do the cell lines represent (diagnosis)?

We can use the function unique() on the table to get the information for both features.

# for cell lines and the cancers they model
unique(cell_speeds_df[,c(2,3)])

##      cellLine       diagnosis
## 1       SW620    Colon Cancer
## 8       SW480    Colon Cancer
## 14     RWPE-1  Not Applicable
## 21       A375     Skin Cancer
## 28       T98G    Brain Cancer
## 35      22Rv1 Prostate Cancer
## 42      T-47D   Breast Cancer
## 49       U-87    Brain Cancer
## 56 MDA-MB-231   Breast Cancer

We are interested in the two breast cancer cell lines:

T47-D
MDA-MB-231

Experimental conditions

Each experiment was conducted under different conditions. The column “experimentalCondition” gives the type of substrate on which cell speed was measured.

# Replace FEATURE with the column name to get the experimental conditions

table(cell_speeds_df$experimentalCondition)

## 
##    30 kPa polyacrylamide Collagen 30 kPa polyacrylamide Fibronectin 
##                                 9                                 9 
##    500 Pa polyacrylamide Collagen 500 Pa polyacrylamide Fibronectin 
##                                 9                                 9 
##                             Glass           HyaluronicAcid Collagen 
##                                 8                                 9 
##        HyaluronicAcid Fibronectin 
##                                 9

We will use the data for the experimental condition HyaluronicAcid Collagen.

Hyaluronic acid and collagen

Hyaluronic acid, collagen, and fibronectin are all common components of the extracellular matrix for biological cells. The extracellular matrix is a complex network of proteins and other molecules that

surrounds and supports cells,
influences cell behavior, and
facilitates communication between cells.

It is a biologically meaningful material on which to measure the movement of cells!

substr_img<-load.image("/home/data/substrate.jpg")

plot(substr_img,axes=FALSE)

Hyaluronic acid is a type of sugar molecule that maintains tissue structure. Hyaluronic acid’s unique properties, such as its ability to bind water and interact with other components of the extracellular matrix, contribute to its diverse functions in the body.

Let’s look in more detail at the condition HyaluronicAcid Collagen.

# Make a smaller data frame for this condition with a new function `subset()`

hyal_coll_df <- subset(cell_speeds_df, 
                       experimentalCondition == "HyaluronicAcid Collagen")

Nine experiments were done on the “HyaluronicAcid Collagen” substrate.

Two of the experiments were done with the breast cancer cell lines we are interested in, i.e. T-47D and MDA-MB-231.

# Make an even smaller data frame that includes only the breast cancer cell lines
# Replace CANCER TYPE with the `diagnosis` we're interested in

hyal_brca_df <- subset(hyal_coll_df, 
                       diagnosis == "Breast Cancer")

The summary_metric is cell speed in microns per hour. From this table, we see that one cell line (MDA-MB-231) moves twice as fast as the other cell line (T-47D) on the “HyaluronicAcid Collagen” substrate: 36 u/hr versus 16 u/hr.

# Get speed information on our two cell lines
hyal_brca_df[,c(1,2,6)]

##      sample   cellLine average_value
## 43 mRNA_R56      T-47D      16.51548
## 57 mRNA_R63 MDA-MB-231      36.19613

In research, the faster cell line MDA-MB-231 is used as a model of aggressive breast cancer, and the slower cell line T-47D is used as a model of normal breast tissue.

We’ll refer to MDA-MB-231 as the fast cell line and T-47D as the slow cell line.

Combining the expression and motility data

Let’s first prepare the expression data frame.

Log transformation

As we did with the TCGA expression data, we log-transform the PS-ON expression data to compress the range of the values.

# The same log transformation we did for TCGA data
# Replace N with the number we add to each element to avoid infinity values
pson_log_mat <- log2(pson_expr_mat + 1)

The log-transformed data are less variable.

Get expression data for our cell lines

Even though the expression matrix contains information from 63 experiments, we want the expression data from the two experiments for the breast cancer cell lines, MDA-MB-231 and T-47D measured on the HyaluronicAcid Collagen substrate:

mRNA_R56 and
mRNA_R63.

Let’s extract these two experiments from the expression matrix.

We find the sample names (experiments):

In the first column of hyal_brca_df, and
As the column names in pson_log_mat.

The function match() returns the positions of matches for the object in its first argument in its second, i.e. which columns in pson_log_mat correspond to our experiments.

# Match the experiments both objects
experiments <- match(hyal_brca_df$sample, colnames(pson_log_mat))
print(experiments)

## [1] 44 58

Extract these columns from the expression matrix to create a matrix of genes with mRNA levels for the fast and slow cell lines on the hyaluronic acid collagen substrate.

# Make a sub-matrix with the expression data for only these two expriments
# Replace OBJECT with the one you created in the previous code chunk

hyal_brca_log_mat <- pson_log_mat[, experiments]

# Call experiments according to the relative speed of the cells
colnames(hyal_brca_log_mat) <- c("slow","fast")

# Look at some of the rows
round(hyal_brca_log_mat[35:45,],1)

##         slow fast
## ICA1     6.1  2.7
## DBNDD1   5.0  1.4
## ALS2     3.1  4.3
## CASP10   0.1  1.8
## CFLAR    4.1  6.1
## TFPI     0.1  7.1
## NDUFAF7  3.7  3.7
## RBM5     5.6  6.0
## MTMR7    0.6  0.9
## SLC7A2   0.0  3.6
## ARF5     7.1  7.2

Some of the genes have a similar expression level in both cell lines, but some genes are quite different.

Differential Gene Expression

Genes that have very different mRNA levels in the fast versus slow cell lines may be informative about why the cell lines behave differently.

By subtracting the expression values in the slow cell line from the expression values in the fast cell line, we create a differential gene expression profile, or DGE profile.

# Subtract the first column ("slow") from the second column ("fast")
dge <- hyal_brca_log_mat[,2] - hyal_brca_log_mat[,1]

# Add dge as a column to a new matrix 
DGE_mat <- cbind(hyal_brca_log_mat,dge)

round(DGE_mat[35:45,],1)

##         slow fast  dge
## ICA1     6.1  2.7 -3.5
## DBNDD1   5.0  1.4 -3.6
## ALS2     3.1  4.3  1.1
## CASP10   0.1  1.8  1.8
## CFLAR    4.1  6.1  2.0
## TFPI     0.1  7.1  7.0
## NDUFAF7  3.7  3.7  0.0
## RBM5     5.6  6.0  0.4
## MTMR7    0.6  0.9  0.3
## SLC7A2   0.0  3.6  3.6
## ARF5     7.1  7.2  0.0

Calculate a histogram to see the distribution of differential expression values (dge).

# create a histogram from DGE_mat for the DGE profile
# Add a title name to the plot and label the x axis
# We've made histograms in previous activities and you can use help

hist(DGE_mat,
     main = "Distribution of differential expression values",
     xlab = "dge values between fast and slow cell lines")

What values of differential gene expression do most genes have?

How many values of dge are in the tails of the histogram?

Let’s look first at the left tail.

# The number of genes with dge less than -5

length(dge[dge<(-5)])

## [1] 94

Now let’s look at the right tail:

# Find the number of genes with dge GREATER THAN 5

length(dge[dge>5])

## [1] 134

You found the number of genes for which expression is higher in one cell line over the others. They appear in the tails of the histogram. The genes with large differential expression are the most interesting to consider because they may provide us with clues as to why the two cancer cell lines behave so differently. Let’s find them!

# Sort the differential gene expression values from high to low
order_dge <- order(dge, decreasing = TRUE)

# Create a new matix with the dge values sorter by replacing OBJECT
DGE_mat_ordered <- DGE_mat[order_dge,]

# check the first highest 15 values
head(DGE_mat_ordered,15)

##                slow      fast      dge
## VIM      1.23878686 10.959234 9.720447
## LDHB     0.28688115  9.076067 8.789186
## SERPINE1 0.01435529  8.718704 8.704349
## MSN      0.01435529  8.709945 8.695590
## GPX1     0.04264434  8.720347 8.677703
## CAV1     0.00000000  8.457955 8.457955
## GSTP1    0.59454855  8.961392 8.366843
## FOSL1    0.68706069  8.750841 8.063780
## AXL      0.18903382  8.201781 8.012747
## F3       0.69599381  8.651411 7.955417
## CST1     0.00000000  7.920532 7.920532
## PLAT     0.09761080  7.883315 7.785705
## MMP14    0.61353165  8.398744 7.785212
## AKR1B1   0.08406426  7.820881 7.736817
## PLAU     0.51601515  8.167368 7.651353

What genes are at the bottom of the list?

# Replace FUNCTION with the companion function to `head`
# to look at the lowest 15 dge values

tail(DGE_mat_ordered,15)

##               slow       fast        dge
## ST14      7.778734 0.87184365  -6.906891
## AZGP1     8.022312 0.57531233  -7.447000
## KRT23     7.504700 0.00000000  -7.504700
## FOXA1     7.829723 0.04264434  -7.787078
## OLFM1     7.868699 0.00000000  -7.868699
## RAB25     8.054957 0.09761080  -7.957346
## AGR3      8.373170 0.11103131  -8.262138
## IGFBP5    8.417684 0.02856915  -8.389115
## CDH1      8.760354 0.17632277  -8.584031
## STC2      9.032679 0.41142625  -8.621253
## CRABP2   11.248586 2.49057013  -8.758016
## SERPINA6  9.490229 0.04264434  -9.447585
## CRIP1     9.928415 0.31034012  -9.618075
## PIP      11.931051 0.00000000 -11.931051
## MGP      12.594747 0.27500705 -12.319740

The genes with higher expression in the fast cell line (MDA-MB-231) are those that appear in the right tail of the histogram for the differential gene expression.

Similarly, genes whose expression are highest in the slow cell line (T-47D) appear in the left tail of the histogram for the differential gene expression.

Vimentin and keratin in motility

VIM, which codes for the protein vimentin, is at the very top of the expression matrix ranked by dge, and KRT23, which codes for the protein keratin, is near the bottom.

This image comes from the paper, “Vimentin induces changes in cell shape, motility, and adhesion during the epithelial to mesenchymal transition”.

Cells that express vimentin filaments (VIF) but not keratin filaments (KIF) are elongated in shape and more motile (panel A), whereas cells with KIF and not VIF are round and undergo fewer changes in morphology (shape) and position (panel B).

VIF_KIF_img<-load.image("/home/data/VIF_KIF.JPG")

plot(VIF_KIF_img,axes=FALSE)

The main difference between the cells is the different expression levels of the gene VIM (relatively high in fast cell line) versus the gene KRT23 (relatively high in slow cell line).

The VIF only cells are similar to MDA-MB-231 cells (the fast cell line) and the KIF only cells are similar to T-47D cells (the slow cell line).

We saw from cell_speeds_df that the MDA-MB-231 cell line is more motile than the T-47D cell line which is consistent with their models for more and less aggressive breast cancer, respectively.

Biological processes

Even though this is just an exploratory data analysis, we can still extract biological meaning from our results.

The Gene Ontology (GO) knowledgebase is the world’s largest source of information on the functions of genes. There are several easy to use web servers for inputting a list of genes and finding if the genes collectively contribute to particular biological processes versus sets of randomly selected genes.

We can test our hypothesis that the genes most differentially expressed are those that contribute to cell properties more associated with each cancer types.

Gene most expressed in MDA-MB-231

Let’s first do a quick Gene Ontology analysis of the genes with highest differential expression in the fast cell line.

# In DGE_mat_ordered, the top genes are more highly expressed 
# in the fast versus slow cell line 

N <- 50
fast_genes <- rownames(DGE_mat_ordered)[1:N]
write.table(fast_genes,"fast_genes.csv",
          row.names=FALSE, col.names=FALSE, quote=FALSE)

Look in your working directory to find the file fast_genes.csv. Open it (when you click on the file name, select View File option), highlight and copy the gene names. Input them into the Try a gene set query window at Gene Set AI.

What functional theme is detected?

Name: Extracellular matrix remodeling and cell migration

Create a list of some of the groups of proteins you find:

Matrix Metalloproteinases (MMPs) and Tissue Inhibitors
Adhesion and Structural Proteins
Integrin and ECM Interactions
Protease Systems

CHALLENGE QUESTIONS

Transcript per Million

There are many ways of summarizing mRNA expression levels. The mRNA expression values in our matrix have units of TPM or “transcripts per million.” For each experiment, the sum of gene expression values should add up to 1,000,000. In other words, the sum of values in each column should add up to 1,000,000.

In the code chunk below, use the function colSums to find the sum of TPM values for each experiment.

# Use the colSums function on your expression matrix `pson_expr_mat`
# You can look up the help page for colSums

colSums(pson_expr_mat, na.rm = FALSE, dims = 1L)

## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15 mRNA_R38 
## 925249.2 939584.8 936328.0 939342.9 937213.3 932558.1 930438.3 941026.1 
## mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36 mRNA_R24 mRNA_R28 
## 934102.2 943195.8 928026.6 937107.5 935499.5 941930.0 940546.9 941475.4 
## mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22 mRNA_R45 mRNA_R49 mRNA_R48 
## 945065.0 934940.7 938330.5 927800.4 939314.6 928315.2 930200.8 927158.0 
## mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43 mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12 
## 929514.4 929947.4 933376.2 931928.5 950605.1 953271.7 952507.0 950621.5 
## mRNA_R11  mRNA_R9  mRNA_R8  mRNA_R3  mRNA_R7  mRNA_R6  mRNA_R5  mRNA_R4 
## 934212.8 950774.9 952000.5 936577.1 937718.2 932324.8 922843.2 935792.9 
##  mRNA_R2  mRNA_R1 mRNA_R52 mRNA_R56 mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51 
## 937852.1 930950.6 944337.2 945542.5 944524.4 940821.8 946705.3 941959.7 
## mRNA_R50 mRNA_R31 mRNA_R35 mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29 
## 944849.8 949522.1 950227.3 951114.6 952487.6 949740.4 952105.8 946548.6 
## mRNA_R59 mRNA_R63 mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57 
## 951240.0 951305.3 950446.6 951255.0 946777.2 951215.8 949333.9

QUESTION: Do any of the columns add up to 1,000,000? Provide a possible explanation for your observations:

None of the columns add up to 1,000,000. This could be a result of rounding errors.

Extra: To learn more about the different ways RNA-seq levels are measured, check out StatQuest’s video.

Gene most expressed in T-47D

Now let’s first do a quick Gene Ontology analysis of the genes with highest differential expression in the slow cell line.

# create and write a list of genes at the bottom of DGE_mat_ordered
# follow the code above but now we need the last 50 genes

# Replace N1 and N2 to find the gene names for the lowest 50 dge values
slow_genes <- rownames(DGE_mat_ordered)[18633:18682]

# Does your result make sense?
print(slow_genes)

##  [1] "SLC7A8"   "TGFB3"    "RERG"     "PRSS8"    "GREB1"    "KIAA1324"
##  [7] "SPTSSB"   "QPRT"     "PDZK1"    "SEPP1"    "TTC39A"   "VAV3"    
## [13] "YBX2"     "PGR"      "TSPAN13"  "CKMT1A"   "TFF3"     "BNIPL"   
## [19] "DEGS2"    "CXCL12"   "FKBP10"   "SYCP2"    "PRLR"     "ELF5"    
## [25] "ESRP1"    "OLFML3"   "GATA3"    "LURAP1L"  "APOE"     "ABCC11"  
## [31] "PREX1"    "CLDN3"    "HOXC10"   "FXYD3"    "IGFBP2"   "ST14"    
## [37] "AZGP1"    "KRT23"    "FOXA1"    "OLFM1"    "RAB25"    "AGR3"    
## [43] "IGFBP5"   "CDH1"     "STC2"     "CRABP2"   "SERPINA6" "CRIP1"   
## [49] "PIP"      "MGP"

When you are happy with your output, write it to a file:

# Create a file as we did for fast genes

write.table(slow_genes,"slow_genes.csv",
            row.names=FALSE, col.names=FALSE, quote=FALSE)

What functional theme is detected?

Name:Hormone Regulation and Epithelial Cell Differentiation

Create a list of some of the groups of proteins you find:

Hormone Regulation
Epithelial Cell Differentiation and Function
Extracellular Matrix and Cell Signaling
Metabolism and Transport

Colon Cancer Cell Lines

Do a similar analysis for the colon cancer cell lines in the activity Colon_Cancer_Cell_Lines.Rmd.

Create your report

Click the “Knit” button at the top of this window and select “Knit to html.”

Congratulations

We have successfully applied our R skills to analyze data from human cancer cell lines and to calculate differential gene expression for models of normal breast tissue cells (“slow” cell line) versus models of aggressive cancer cells (“fast” cell line). We found interesting biological functions for the groups of genes most differentially expressed!

Breast Cancer Cell Lines

Jessica Jin

July 30, 2025

About this activity

Preliminaries

Loading the data

Compare the data frames

The expression data frame

The motility data frame

Cancer cell lines

Experimental conditions

Hyaluronic acid and collagen

Combining the expression and motility data

Log transformation

Get expression data for our cell lines

Differential Gene Expression

Vimentin and keratin in motility

Biological processes

Gene most expressed in MDA-MB-231

CHALLENGE QUESTIONS

Transcript per Million

Gene most expressed in T-47D

Colon Cancer Cell Lines

Create your report

Congratulations