Until now, we worked with expression data and clinical information from breast cancer patient samples (TCGA). We found patterns in genes and in samples by color-coding the expression data in a heat map and then clustering the samples and the genes in the heat map based on how similar they are to each other.
In this module, we will work with data from experiments with human cancer cell lines from the Physical Sciences in Oncology Cell Line Characterization Study, which includes imaging- and microscopy-based measurements of physical properties of the cells, such as morphology (shape) and motility (movement). We will examine:
knitr() is the R package that generates the report from R Markdown. We can create reports as Word doc, PDF, and HTML files.
An R package bundles together code, data, documentation, and tests, and is easy to download and share with others.
PSON.RData contains gene expression and cell speed data from the Physical Sciences in Oncology project.
# The objects will also appear in our "Environment" tab.
load(file.path(data_dir, "PSON.RData"),verbose=TRUE)
## Loading objects:
## pson_expr_df
## cell_speeds_df
Objects were named to be be descriptive.
The Environment tab gives us basic information on the object that was loaded.
You can use some of the R functions we learned previously to remind us what is in the data frames we loaded.
dim() tells us the dimensions (# row, # columns) of the objects.head() shows us the top several rows of the objects.pson_expr_df[1:5,1:5]
## symbol mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19
## 1 TSPAN6 33.56 45.10 39.42 35.61
## 2 TNMD 0.00 0.00 0.00 0.00
## 3 DPM1 169.46 129.88 132.06 113.73
## 4 SCYL3 1.85 1.85 1.77 1.93
## 5 C1orf112 5.73 11.85 10.16 9.28
Rows are named with the genes whose expression values were measured. Columns are named with identifiers for the experiments performed. More about this below.
QUESTION 1: What are the names of the first two genes in the expression data frame?
Gene names (“symbol”) are in the first column. We’ll create a numerical matrix of the mRNA levels:
# Remove the first column because it contains gene names
pson_expr_mat <- as.matrix(pson_expr_df[, -1])
# Make the gene names into row names
rownames(pson_expr_mat) <- pson_expr_df$symbol
# Use indexing to take a look
pson_expr_mat[1:5, 1:5]
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6 33.56 45.10 39.42 35.61 32.10
## TNMD 0.00 0.00 0.00 0.00 0.00
## DPM1 169.46 129.88 132.06 113.73 123.37
## SCYL3 1.85 1.85 1.77 1.93 2.30
## C1orf112 5.73 11.85 10.16 9.28 10.01
There are many ways of summarizing mRNA expression levels. The mRNA expression values in our matrix have units of TPM or “transcripts per million.” Each column in the expression data should add up to 1,000,000.
To learn more about the different ways RNA-seq levels are measured, check out StatQuest’s video.
colSums(pson_expr_mat) # Add up the values in each column
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18 mRNA_R16 mRNA_R15 mRNA_R38
## 925249.2 939584.8 936328.0 939342.9 937213.3 932558.1 930438.3 941026.1
## mRNA_R42 mRNA_R41 mRNA_R40 mRNA_R39 mRNA_R37 mRNA_R36 mRNA_R24 mRNA_R28
## 934102.2 943195.8 928026.6 937107.5 935499.5 941930.0 940546.9 941475.4
## mRNA_R27 mRNA_R26 mRNA_R25 mRNA_R23 mRNA_R22 mRNA_R45 mRNA_R49 mRNA_R48
## 945065.0 934940.7 938330.5 927800.4 939314.6 928315.2 930200.8 927158.0
## mRNA_R47 mRNA_R46 mRNA_R44 mRNA_R43 mRNA_R10 mRNA_R14 mRNA_R13 mRNA_R12
## 929514.4 929947.4 933376.2 931928.5 950605.1 953271.7 952507.0 950621.5
## mRNA_R11 mRNA_R9 mRNA_R8 mRNA_R3 mRNA_R7 mRNA_R6 mRNA_R5 mRNA_R4
## 934212.8 950774.9 952000.5 936577.1 937718.2 932324.8 922843.2 935792.9
## mRNA_R2 mRNA_R1 mRNA_R52 mRNA_R56 mRNA_R55 mRNA_R54 mRNA_R53 mRNA_R51
## 937852.1 930950.6 944337.2 945542.5 944524.4 940821.8 946705.3 941959.7
## mRNA_R50 mRNA_R31 mRNA_R35 mRNA_R34 mRNA_R33 mRNA_R32 mRNA_R30 mRNA_R29
## 944849.8 949522.1 950227.3 951114.6 952487.6 949740.4 952105.8 946548.6
## mRNA_R59 mRNA_R63 mRNA_R62 mRNA_R61 mRNA_R60 mRNA_R58 mRNA_R57
## 951240.0 951305.3 950446.6 951255.0 946777.2 951215.8 949333.9
Most of the columns don’t add up to a million. The original data table included many more rows than the 18,682 we see here, but in the tidying process we removed the genes that do not code for proteins.
QUESTION 2: What is the sum of mRNA values (in units of TPM) for the experiment “mRNA_R20”?
As we did with the TCGA breast cancer patient expression data, we’ll log-transform the PSON expression data to compress the range of the values.
pson_log_mat <- log2(1 + pson_expr_mat)
pson_log_mat[1:5, 1:5]
## mRNA_R17 mRNA_R21 mRNA_R20 mRNA_R19 mRNA_R18
## TSPAN6 5.111031 5.526695 5.336997 5.194166 5.048759
## TNMD 0.000000 0.000000 0.000000 0.000000 0.000000
## DPM1 7.413289 7.032101 7.055933 6.842099 6.958495
## SCYL3 1.510962 1.510962 1.469886 1.550901 1.722466
## C1orf112 2.750607 3.683696 3.480265 3.361768 3.460743
The object cell_speeds_df contains data that is completely new to us.
# View(cell_speeds_df)
Let’s get the column names of this data frame:
colnames(cell_speeds_df)
## [1] "sample" "cellLine"
## [3] "diagnosis" "experimentalCondition"
## [5] "summary_metric" "average_value"
## [7] "total_number_of_cells_tracked"
What cell lines (cellLine) were examined?
And what cancer types do the cell lines represent (diagnosis)?
unique(cell_speeds_df$cellLine)
## [1] "SW620" "SW480" "RWPE-1" "A375" "T98G"
## [6] "22Rv1" "T-47D" "U-87" "MDA-MB-231"
unique(cell_speeds_df$diagnosis)
## [1] "Colon Cancer" "Not Applicable" "Skin Cancer" "Brain Cancer"
## [5] "Prostate Cancer" "Breast Cancer"
We can use the function unique() on a table as well!
unique(cell_speeds_df[,c(2,3)])
## cellLine diagnosis
## 1 SW620 Colon Cancer
## 8 SW480 Colon Cancer
## 14 RWPE-1 Not Applicable
## 21 A375 Skin Cancer
## 28 T98G Brain Cancer
## 35 22Rv1 Prostate Cancer
## 42 T-47D Breast Cancer
## 49 U-87 Brain Cancer
## 56 MDA-MB-231 Breast Cancer
We are interested in the two breast cancer cell lines: + T47-D + MDA-MB-231
QUESTION 3: What is the name of one of the Colon Cancer cell lines?
Each experiment was conducted under different conditions. “experimentalCondition” gives the type of substrate on which cell speed was measured.
unique(cell_speeds_df$experimentalCondition)
## [1] "Glass" "HyaluronicAcid Collagen"
## [3] "HyaluronicAcid Fibronectin" "30 kPa polyacrylamide Collagen"
## [5] "30 kPa polyacrylamide Fibronectin" "500 Pa polyacrylamide Collagen"
## [7] "500 Pa polyacrylamide Fibronectin"
We will use the data for the experimental condition “HyaluronicAcid Collagen”.
Hyaluronic acid, collagen, and fibronectin are all common components of the extracellular matrix for biological cells.
substr_img<-load.image(file.path(data_dir,"substrate.jpg"))
plot(substr_img,axes=FALSE)
Let’s look in more detail at one condition, “HyaluronicAcid Collagen.”
# Make a smaller data frame for this condition
hyal_coll_df <- subset(cell_speeds_df,
experimentalCondition == "HyaluronicAcid Collagen")
hyal_coll_df
## sample cellLine diagnosis experimentalCondition summary_metric
## 2 mRNA_R21 SW620 Colon Cancer HyaluronicAcid Collagen speed_um_hr
## 8 mRNA_R42 SW480 Colon Cancer HyaluronicAcid Collagen speed_um_hr
## 15 mRNA_R28 RWPE-1 Not Applicable HyaluronicAcid Collagen speed_um_hr
## 22 mRNA_R49 A375 Skin Cancer HyaluronicAcid Collagen speed_um_hr
## 29 mRNA_R14 T98G Brain Cancer HyaluronicAcid Collagen speed_um_hr
## 36 mRNA_R7 22Rv1 Prostate Cancer HyaluronicAcid Collagen speed_um_hr
## 43 mRNA_R56 T-47D Breast Cancer HyaluronicAcid Collagen speed_um_hr
## 50 mRNA_R35 U-87 Brain Cancer HyaluronicAcid Collagen speed_um_hr
## 57 mRNA_R63 MDA-MB-231 Breast Cancer HyaluronicAcid Collagen speed_um_hr
## average_value total_number_of_cells_tracked
## 2 18.477557 51
## 8 8.814832 40
## 15 13.225218 108
## 22 43.955352 52
## 29 31.561848 75
## 36 19.510723 50
## 43 16.515478 81
## 50 39.380265 58
## 57 36.196133 12
Nine experiments were done on the “HyaluronicAcid Collagen” substrate.
Two of the experiments were done with the breast cancer cell lines we are interested in.
# Make an even smaller data frame that includes only the breast cancer cell lines
hyal_brca_df <- subset(hyal_coll_df,
diagnosis == "Breast Cancer")
head(hyal_brca_df)
## sample cellLine diagnosis experimentalCondition summary_metric
## 43 mRNA_R56 T-47D Breast Cancer HyaluronicAcid Collagen speed_um_hr
## 57 mRNA_R63 MDA-MB-231 Breast Cancer HyaluronicAcid Collagen speed_um_hr
## average_value total_number_of_cells_tracked
## 43 16.51548 81
## 57 36.19613 12
The “summary_metric” is cell speed in microns per hour. From this table, we see that one cell line (MDA-MB-231) moves twice as fast as the other cell line (T-47D) on the “HyaluronicAcid Collagen” substrate: 36 u/hr versus 16 u/hr.
We’ll call MDA-MB-231 the “fast” cell line and T-47D the “slow” cell line.
Even though the expression matrix contains information from 63 experiments, we want the expression data from the two experiments for the breast cancer cell lines measured on the “HyaluronicAcid Collagen” substrate: mRNA_R56 and mRNA_R63.
Let’s extract these two experiments from the expression matrix.
# We match the experiments in the column of our motility sub-table
# with the column names of our expression matrix
exps <- match(hyal_brca_df$sample, colnames(pson_log_mat))
# Make a sub-matrix with the expression data for only these two expriments
hyal_brca_log_mat <- pson_log_mat[, exps]
# Call experiments according to the relative speed of the cells
colnames(hyal_brca_log_mat) <- c("slow","fast")
# Look at the first 10 rows
hyal_brca_log_mat[1:10,]
## slow fast
## TSPAN6 4.22419466 4.06091205
## TNMD 0.00000000 0.00000000
## DPM1 6.95965425 6.73890291
## SCYL3 3.28835856 2.63459327
## C1orf112 3.49313492 4.09592442
## FGR 0.00000000 0.07038933
## CFH 0.05658353 1.64154603
## FUCA2 5.21916852 6.51506901
## GCLC 4.44625623 5.21761849
## NFYA 3.32481060 5.13504229
Some of the genes have a similar expression level in both cell lines, but some genes are quite different.
Genes that have very different levels in the “fast” versus “slow” cell lines may be informative about why the cell lines behave differently.
By subtracting the expression in the “slow” cell line from the expression in the fast cell line, we create a differential gene expression profile, or DGE profile.
# Subtract the first and second columns
dge <- hyal_brca_log_mat[,2] - hyal_brca_log_mat[,1]
# Add the column to a new matrix
DGE_mat <- cbind(hyal_brca_log_mat,dge)
# Sort the differential gene expression values from high to low
order_dge <- order(dge, decreasing = TRUE)
# Re-order the rows of the matrices so we can easily see which
# genes are most expressed in "fast" versus "slow"
DGE_mat <- DGE_mat[order_dge,]
# Look at the first 10 rows
head(DGE_mat, 10)
## slow fast dge
## VIM 1.23878686 10.959234 9.720447
## LDHB 0.28688115 9.076067 8.789186
## SERPINE1 0.01435529 8.718704 8.704349
## MSN 0.01435529 8.709945 8.695590
## GPX1 0.04264434 8.720347 8.677703
## CAV1 0.00000000 8.457955 8.457955
## GSTP1 0.59454855 8.961392 8.366843
## FOSL1 0.68706069 8.750841 8.063780
## AXL 0.18903382 8.201781 8.012747
## F3 0.69599381 8.651411 7.955417
We can also look at the bottom of the matrix to see which genes are more highly expressed in the “slow” cell line versus the “fast” cell line.
QUESTION 4: Give one gene with high expression in the fast cell line and low expression in the slow cell line.
# Look at the last 10 rows
tail(DGE_mat,10)
## slow fast dge
## RAB25 8.054957 0.09761080 -7.957346
## AGR3 8.373170 0.11103131 -8.262138
## IGFBP5 8.417684 0.02856915 -8.389115
## CDH1 8.760354 0.17632277 -8.584031
## STC2 9.032679 0.41142625 -8.621253
## CRABP2 11.248586 2.49057013 -8.758016
## SERPINA6 9.490229 0.04264434 -9.447585
## CRIP1 9.928415 0.31034012 -9.618075
## PIP 11.931051 0.00000000 -11.931051
## MGP 12.594747 0.27500705 -12.319740
There are many genes that are preferrentially expressed in one cell line versus the other.
hist(DGE_mat[,3], xlab = "Differential gene expression, dge",
main = "Histogram of dge values")
Most of the dge values are around zero, so most genes have very similar expression in the “fast” and “slow” cell lines.
The genes with large differential expression (those in the tail of the histogram) are the most interesting to consider because they may provide us with clues as to why the cell lines behave so differently.
We’ll look at this in more detail next time, but here is a flavor of how a single gene can make a big difference.
QUESTION 5: Approximately, how many genes have dge close to zero?
VIM, which codes for the protein vimentin, is at the very top of the expression matrix ranked by dge, and KRT23, which codes for the protein keratin, is near the bottom.
This image comes from the paper, “Vimentin induces changes in cell shape, motility, and adhesion during the epithelial to mesenchymal transition”.
Cells that express vimentin filaments (VIF) but not keratin filaments (KIF) are elongated in shape and move better (more motile, panel A), whereas cells with KIF and not VIF are round and undergo fewer changes in morphology (shape) and position (panel B).
VIF_KIF_img<-load.image(file.path(data_dir,"VIF_KIF.JPG"))
plot(VIF_KIF_img,axes=FALSE)
The cells in A are elongated and move more quickly than the cells in B. A main difference between the cells is the different expression levels of VIM versus KRT23.