This module will introduce you to loading and manipulating the data from the breast cancer METABRIC dataset, visualizing and working with gene expression measurements, and building predictive models based on the expression of many different genes and the associated clinical data.
The METABRIC dataset contains more patient survivial information than we had with the TCGA data.
Let’s load the data — both the expression matrix (already log-transformed) and a table of clinical information.
load("/home/data/metabric_disc_clin_df.RData")
load("/home/data/metabric_disc_expr_mat.RData")
ls()
## [1] "metabric_disc_clin_df" "metabric_disc_expr_mat"
Check out the clinical data frame. Many of the features are similar to those in the TCGA breast cancer clinical data frame.
# What's in the clinical data frame?
View(metabric_disc_clin_df)
Note that the expression matrix has already been log-transformed. How many rows and columns does metabric_disc_expr_mat
have?
Reminder: The dimensions of a matrix can be found with the dim()
function.
# How many rows and columns in the expression matrix?
# Find out: uncomment the lines below and replace FUNCTION and OBJECT
dim(metabric_disc_expr_mat)
## [1] 15834 997
What’s in the expression matrix?
Reminder: Indexing can be used to get specific rows and some columns of a matrix.
# The expression matrix is already log-transformed
# Uncomment and use indexing to view the first 5 rows and 5 columns
metabric_disc_expr_mat[1:5 , 1:5 ]
## MB_0362 MB_0346 MB_0386 MB_0574 MB_0185
## TSPAN6 7.579 7.819 7.366 8.110 5.993
## TNMD 6.211 5.557 8.108 5.789 5.484
## DPM1 9.150 9.794 9.229 9.672 10.159
## SCYL3 7.193 7.845 6.777 6.863 6.692
## C1orf112 7.835 8.084 6.895 6.987 7.173
Each row of the clinical data frame is one patient, and each column of the expression data frame is for one patient.
Check whether the patient IDs in the expression and clinical data frames are in the same order:
# Are the patient IDs the same and in the same order?
identical(metabric_disc_clin_df$metabric_id,colnames(metabric_disc_expr_mat))
## [1] TRUE
Now we are ready to do some actual analysis!
We happen to know that high AURKA expression in a tumor is bad for the patient’s outcome.
AURKA (Aurora kinase A) encodes for a protein that is involved in regulation of the cell cycle. Various studies have associated it with tumour progression and metastasis.
# Row number for AURKA in the expression data
aurka_row <- which(rownames(metabric_disc_expr_mat) == "AURKA")
aurka_row
## [1] 1595
Make an object that contains the expression of AURKA across patients:
# as.vector turns a row into a one-dimensional array
aurka <- as.vector(metabric_disc_expr_mat[aurka_row, ])
# Uncomment and use indexing to look at the first 50 values
aurka[1:50]
## [1] 7.918 9.626 7.689 7.738 8.612 7.083 6.769 8.211 9.697 8.645
## [11] 7.236 6.220 8.678 7.500 6.362 7.416 11.767 8.489 7.208 7.195
## [21] 6.991 6.999 6.183 8.220 7.532 7.805 8.173 8.831 9.731 9.217
## [31] 7.931 7.292 8.078 8.921 8.413 6.577 7.508 8.768 7.927 7.248
## [41] 8.261 7.996 6.849 6.447 9.380 7.629 7.582 7.513 8.531 8.504
Since high AURKA expression is associated with tumor progression, we are interested in clinical features for which AURKA expression may be predictive.
The feature last_follow_up_status
tells whether patients being followed in this study were alive or had passed at that time.
Reminder: There is a function that returns a feature in table form.
# What status can patients have at the last time they were checked?
# Uncomment and replace FUNCTION to find each category and its frequency
table(metabric_disc_clin_df$last_follow_up_status)
##
## a d d-d.s. d-o.c.
## 548 46 259 126
What do each of the categories mean?
At time of last follow up, a patient can have a status:
a
: alived
: deadd-d.s.
: dead from diseased-o.c
: dead of other causes259 patients were specifically indicated to have died from breast cancer (d-d.s.). There are 46 patients labelled as d
. We think they died of cancer, but the cause for these patients could be unknown.
Even for the patients who died of unknown cause d-o.c.
, the cause may have been indirectly caused by having cancer, so we usually want to be careful about how we treat these cases.
Another perspective is to consider only those patients who died of disease and examine how long since diagnosis that death occurred.
We have to exclude the rows and columns for patients who are alive and those who died of other causes. These cases are known as censored events.
Rather than filtering out patients with last_follow_up_status
equal to a
or d-o.c.
, we can use the censored
column in the data frame.
# How many patients are alive or have died?
table(metabric_disc_clin_df$censored)
##
## FALSE TRUE
## 323 674
Compare these number to the numbers for a
, d
, d-d.s.
and d-o.c
. What do TRUE
and FALSE
correspond to? Living status
Why do we want to remove patients with censored
equal to TRUE
? At the last follow-up, they were alive
# Remove the censored patients from the data
# The function which() gives the row number for censored = TRUE
drop_pats <- which(metabric_disc_clin_df$censored)
# Create clincial dataframe and expression matrix for patients who have died
dead_clin <- metabric_disc_clin_df[-drop_pats, ]
dead_expr <- metabric_disc_expr_mat[, -drop_pats]
dim(dead_clin)
## [1] 323 32
dim(dead_expr)
## [1] 15834 323
To test whether the code did what you expected, check out the new clinical data frame:
View(dead_clin)
The time-to-death (the feature T
) is measured in days since cancer diagnosis.
How is time-to-death related to the expression level of AURKA? As AURKA expression increases, time to death decreases We’ll create a plot that includes a trend line representing the linear model from lm()
.
# make sure to remove censored patients from the `aurka` vector
aurka_dead = aurka[-drop_pats]
# model time to death as a function of AURKA expression
aurka_model <- lm(dead_clin$T ~ aurka_dead)
# create a scatterplot to check your result
plot(dead_clin$T ~ aurka_dead,
main = "Time to death as a function of AURKA expression",
ylab = "Time to death, days",
xlab = "AURKA expression")
abline(aurka_model, col = "red")
Time-to-death is inversely correlation with AURKA expression.
As we saw in the mtcars_linear_regression.Rmd
activity, there are several things in the output of lm()
that are interesting. We’ll store the summary information as an object and extract the values we want.
# Store the summary of the model you already made
aurka_model_summary <- summary(aurka_model)
# Uncomment and replace FUNCTION to get the coefficients of your model
coef_aurka_model = coefficients(aurka_model_summary)
# Get slope and p-value
slope = coef_aurka_model[2, 1]
pval = coef_aurka_model[2, 4]
# Uncomment and replace FEATURE tp get R-squared
r_squared <- aurka_model_summary$r.squared
slope
## [1] -711.7943
r_squared
## [1] 0.1723047
pval
## [1] 8.426028e-15
The time to death decreases as AURKA expression increases.
In the code chunk above, you should have found that AURKA expression explains 17% of the variance (r.squares) in time to death (dead_clin$T
).
Let’s include more factors in the model.
AURKA expression in aurka_dead
is a continuous variable. You can include other continuous variables in your model by adding them. For example:
The number of lymph nodes found to be positive for cancer cells at the time of diagnosis, lymph_nodes_positive
, can be a strong predictor of time until the person dies. We would probably want to allow for variables like this in a survival model.
# Add lymph_nodes_positive as a feature to the model
# Uncomment and replace FUNCTION to create the linear model
summary(lm(dead_clin$T ~ dead_clin$lymph_nodes_positive + aurka_dead))
##
## Call:
## lm(formula = dead_clin$T ~ dead_clin$lymph_nodes_positive + aurka_dead)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2400.1 -857.5 -179.4 679.9 4308.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7599.93 676.37 11.236 < 2e-16 ***
## dead_clin$lymph_nodes_positive -62.72 14.07 -4.459 1.14e-05 ***
## aurka_dead -678.83 85.19 -7.968 2.90e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1214 on 318 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.221, Adjusted R-squared: 0.2161
## F-statistic: 45.11 on 2 and 318 DF, p-value: < 2.2e-16
Does time-to-death increase or decrease with lymph_nodes_positive
? How much variance is explained with the new model?
As more lymph nodes are expressed, then time to death decreases.
Try the same with NPI
. The Nottingham Prognostic Index (NPI) is a tool used to predict the prognosis of breast cancer patients after surgery. It combines three key factors: tumor size, lymph node involvement, and tumor grade. A higher score corresponds to poorer prognosis.
# Uncomment and replace DEPENDENT_FEATURE with the feature you want to predict
# and replace FEATURE with feature in the clinical data for NPI
summary(lm(dead_clin$T ~ dead_clin$NPI + aurka_dead))
##
## Call:
## lm(formula = dead_clin$T ~ dead_clin$NPI + aurka_dead)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2830.8 -782.4 -260.5 605.3 4298.4
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7971.96 677.71 11.763 < 2e-16 ***
## dead_clin$NPI -294.20 63.74 -4.616 5.69e-06 ***
## aurka_dead -579.69 89.40 -6.484 3.40e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1212 on 318 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2243, Adjusted R-squared: 0.2194
## F-statistic: 45.97 on 2 and 318 DF, p-value: < 2.2e-16
Are your results consistent with the definition of the NPI score? Yes —
Categorical variables, such as ER.Expr
and cellularity
have only two or a few discrete values. Let’s see how they are handled by lm()
.
With the TCGA data, we learned how ER expression status can be an indicator of survival.
# ER.Expr values
table(dead_clin$ER.Expr)
##
## - +
## 91 230
Include ER.Expr
as an independent feature in your model:
# Combine AURKA expression with ER status in a regression model
# Uncomment the line below and replace FEATURE
summary(lm(dead_clin$T ~ dead_clin$ER.Expr + aurka_dead))
##
## Call:
## lm(formula = dead_clin$T ~ dead_clin$ER.Expr + aurka_dead)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2498.6 -808.0 -203.8 674.1 4241.8
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6111.53 770.71 7.930 3.76e-14 ***
## dead_clin$ER.Expr+ 687.18 160.87 4.272 2.57e-05 ***
## aurka_dead -576.20 90.81 -6.345 7.63e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1217 on 318 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.2172, Adjusted R-squared: 0.2123
## F-statistic: 44.12 on 2 and 318 DF, p-value: < 2.2e-16
Your first observation here might be: What happened to ER -
?
When given categorical variables, R will treat one of the values as the intercept for the line it fits.
We can therefore interpret the remaining coefficient ER.Expr+
as the change in survival time when ER status changes from -
to +
. As we might expect, patients who are ER+ are predicted to survive longer. We can see this by looking at the slope (aka Estimate Std.) of dead_clin$ER.Expr+
.
Cellularity is the percentage of the tumor volume occupied by invasive tumor cells. We expect this may be important to our model as well.
Make a table for the feature cellularity
.
table(dead_clin$cellularity)
##
## high moderate
## 178 143
Include cellularity
as an independent feature in your model:
# Combine AURKA expression with cellularity in a regression model
# Look at the summary of the model
summary(lm(dead_clin$T ~ dead_clin$cellularity + aurka_dead))
##
## Call:
## lm(formula = dead_clin$T ~ dead_clin$cellularity + aurka_dead)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3045.2 -829.8 -281.4 756.7 4439.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7532.85 719.79 10.465 < 2e-16 ***
## dead_clin$cellularitymoderate 113.84 142.46 0.799 0.425
## aurka_dead -699.79 88.68 -7.891 4.87e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1250 on 318 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.174, Adjusted R-squared: 0.1688
## F-statistic: 33.49 on 2 and 318 DF, p-value: 6.351e-14
What do your results tell you about the effect of cellularity on time-to-death?
Putting all the bits together, we can write a function, gene_lm()
that takes the time-to-death (y
) and fits it to some variable (x
), such as the expression of AURKA, and extracts the slope (direction of the association) and statistical significance (pval).
More generally, yes, we can write our own functions in R!
# Do a linear fit of y ~ x and return the slope and p-value
gene_lm <- function(y, x) {
gene_model_summary <- NULL
pval <- NA
slope <- NA
try(gene_model_summary <- summary(lm(y ~ x)))
if (!is.null(gene_model_summary)) {
coef_gene_model <- coefficients(gene_model_summary)
pval = coef_gene_model[2, 4]
slope = coef_gene_model[2, 1]
}
return(c(pval, slope))
}
Check that you get the same result as before.
# Run your new function
gene_lm(dead_clin$T, aurka_dead)
## [1] 8.426028e-15 -7.117943e+02
We could check every gene in the expression matrix to see if it is significantly correlated with time-to-death of the METABRIC patients. Fortunately with R, we can easily loop over all the genes. We’ll create an empty matrix of results for storing the p-value and slope for each gene. Altogether, 15,834 genes were measured and included in the expression matrix.
# Try the prediction first for 5000 genes
# Later, try it for all genes, i.e. nrow(dead_expr)
ngenes <- nrow(dead_expr)
ngenes <- 5000
lm_results = matrix(nrow = ngenes, ncol = 2, data = NA)
colnames(lm_results) <- c("pval", "slope")
rownames(lm_results) <- rownames(dead_expr)[1:ngenes]
for (i in 1:ngenes) {
lm_results[i, ] <- gene_lm(dead_clin$T, dead_expr[i,])
}
And we check on AURKA again.
lm_results[rownames(lm_results) == "AURKA", ]
## pval slope
## 8.426028e-15 -7.117943e+02
That’s it. We now have the regression of every gene against time-to-death. There are some other things we need to do in practice, like correcting the p-values for the number of tests we did. If you check 1000 genes, then for p = 0.01 you would expect roughly 10 genes to be that significant by chance (0.01*1000).
From the results table lm_results
, we can order by p-value to see which genes are most significantly associated with death (either favourably or unfavourably).
lm_results = lm_results[order(lm_results[, 1]), ]
head(lm_results,20)
## pval slope
## AVP 5.637819e-18 1678.9654
## CENPA 1.130745e-16 -855.0454
## ANLN 7.649051e-16 -995.2773
## RABEP1 2.573654e-15 627.6337
## AURKA 8.426028e-15 -711.7943
## FAM83D 4.566357e-14 -733.2090
## MCM10 7.459059e-14 -751.7670
## TRIP13 9.654274e-14 -710.3487
## STMN1 4.681205e-13 -807.2726
## SEC14L2 7.382927e-13 501.6181
## KIAA0141 8.697885e-13 1449.0480
## MXD4 9.772895e-13 934.6374
## KARS 1.139240e-12 -1169.7592
## ESR1 1.412970e-12 236.0715
## TPX2 1.464180e-12 -732.2328
## MZF1 1.700396e-12 822.2763
## CYBRD1 2.220941e-12 499.1818
## CDC20 2.256791e-12 -485.7147
## HPN 2.285301e-12 482.4860
## CPEB3 2.874603e-12 1068.1870
AURKA is near the top. The ANLN gene codes for the protein Anillin, an actin-binding protein required for cytokinesis, is a prognostic marker panel in breast cancer..
Several of the genes seem to be favorable for survival, such as MXD4 and RABEP1. You can learn more about their functions at the UniProt Knowledgebase.
Let’s use your model for prediction. You know already that your model accounts for only ~20% of the variance in time-to-death, so we don’t expect stellar performance.
If you look at dead_clin
, you’ll see there are a couple of cases where dead_clin$T
has a value of NA
:
which(is.na(dead_clin$T))
## [1] 87 316
Predictions, which are number values for time-of-death, will be made for all patients. To compare the predicted values with the real values, we should remove the patients with NA
values.
new_dead_clin<-dead_clin[-c(87,316),]
new_dead_expr<-dead_expr[,-c(87,316)]
Let’s create a model based on expression of a gene that may be important for survival. We’ve been focussing on AURKA, but you can choose another- perhaps one of the most significant genes from lm_result
when you modeled survival for all genes.
For prediction, it is easiest if we add the column for gene expression to the clinical data frame.
# pick the gene to use in your model here!
my_model_gene <- "AVP"
# add gene expression to the training clinical table above
gene_exp <- as.vector(new_dead_expr[my_model_gene, ])
new_dead_clin$gene_exp <- gene_exp
Now train your model for survival as a function of gene expression…
gene_model <- lm(T ~ gene_exp, data = new_dead_clin)
summary(gene_model)
##
## Call:
## lm(formula = T ~ gene_exp, data = new_dead_clin)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2986.5 -860.2 -243.7 561.4 4617.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -8270.8 1125.1 -7.351 1.66e-12 ***
## gene_exp 1679.0 182.9 9.179 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1222 on 319 degrees of freedom
## Multiple R-squared: 0.2089, Adjusted R-squared: 0.2064
## F-statistic: 84.24 on 1 and 319 DF, p-value: < 2.2e-16
Look at the coefficient named gene_exp
. Based on this coefficient, is the relationship between the expression of your gene and survival time positive or negative?
You see that your gene has some explanatory value (Multiple R-squared
), let’s see if it has predictive value.
The function predict.lm
returns predicted values for the dependent feature, in our case time-to-death T
.
# The object my_prediction will hold your predicted values for T
my_prediction <- data.frame(metabric_id = new_dead_clin$metabric_id,
T = predict.lm(gene_model,
newdata = new_dead_clin))
head(my_prediction)
## metabric_id T
## 1 MB_0362 1569.578
## 2 MB_0346 1223.711
## 5 MB_0185 1369.781
## 11 MB_0189 1530.962
## 19 MB_0223 2118.599
## 26 MB_0906 1379.855
To get a sense of how well you model did, let’s compare the predicted values with the real values:
# correlation
cor(my_prediction$T,new_dead_clin$T)
## [1] 0.4570749
What do you think?
Take it a step further! Include ER.expr
in a predictive model.
First, create the model:
# Uncomment and replace Y ~ X + Z with the correct variables
er_gene_model <- lm(T ~ gene_exp + ER.Expr, data = new_dead_clin)
Then, perform the prediction:
# Uncomment and replace FUNCTION with what need for prediction
my_prediction <- data.frame(metabric_id = new_dead_clin$metabric_id,
T = predict.lm(er_gene_model,
newdata = new_dead_clin))
Finally, calculate the correlation between your prediction and the real values:
cor(my_prediction$T,new_dead_clin$T)
## [1] 0.4864656
How does your new model compare?
Congratulations on accomplishing another activity!! You’ve done a tremendous amount of work this summmer and learned a wide range of data analysis methods with many different data types. Great work.
Don’t forget to knit and publish to Rpubs!