Predictive modeling for cancer prognosis

About this activity
Preliminaries
Loading & inspecting data
Exploring AURKA expression & survival
Modeling survival for all genes

About this activity

This module will introduce you to loading and manipulating the data from the breast cancer METABRIC dataset, visualizing and working with gene expression measurements, and building predictive models based on the expression of many different genes (and clinical data too).

The METABRIC dataset is from another large breast cancer patient study and was actually used for the original DREAM Breast Cancer Prognosis Challenge. After taking an initial look at modeling survival in the TCGA BRCA cohort, we realized there wasn’t much interesting to see. This is a useful lesson in making sure you have the right data for the analysis! In a study published earlier this year, a group of researchers carefully assessed which clinical outcomes can be reliably used for modeling across TCGA cancer types — this study also produced the TCGA Pan-Cancer Clinical Data Resource (CDR) dataset that we’ve seen in a couple modules.

For now, we’ve switched to the METABRIC study, where we know from previous experience there are at least some strong relationships between gene expression and survival.

Preliminaries

Loading & inspecting data

Let’s load the data — both the expression matrix and a table of clinical information. They are saved conveniently in files of type .RData as with previous modules. These are binary files (i.e., not human readable) that store multiple R objects (results) that you create in the process of an R analysis. We’ll load the datasets, and list what we loaded with the ls() command.

load(file.path(data_dir, "metabric_disc_expr_df.RData"))
load(file.path(data_dir, "metabric_disc_clin_df.RData"))
load(file.path(data_dir, "metabric_expr_gene_info.RData"))
load(file.path(data_dir, "metabric_disc_expr_mat.RData"))
ls()

## [1] "data_dir"               "gene_df"                "metabric_disc_clin_df" 
## [4] "metabric_disc_expr_df"  "metabric_disc_expr_mat"

The clinical data frame is a bit unwieldy — we can use View() to take a look. In RStudio, you can also click on the variable under the Environment tab in the upper right panel. Reminder: running the chunk below will open the viewer in a new tab; you can switch back to this tab to resume the activity.

View(metabric_disc_clin_df)

Reminder: We can find the dimensions of a matrix or data frame with the dim() command, to see just how large the expression data matrix is…

dim(metabric_disc_clin_df)

## [1] 997  32

dim(metabric_disc_expr_df)

## [1] 15834   999

Each row of the clinical data frame is one patient, and each column of the expression data frame is for one patient (except for the first two, which store the gene information for each row). Let’s check whether they’re both in the same order:

metabric_disc_clin_df$metabric_id[1:10]

##  [1] "MB_0362" "MB_0346" "MB_0386" "MB_0574" "MB_0185" "MB_0503" "MB_0641"
##  [8] "MB_0201" "MB_0218" "MB_0316"

# uncomment the line below and replace "function" with 
# the actual function to get column names

# function(metabric_disc_expr_df)[3:12]

Let’s convert the expression data frame into a matrix for more convenient use below:

metabric_disc_expr_mat[1:5, 1:5]

##          MB_0362 MB_0346 MB_0386 MB_0574 MB_0185
## TSPAN6     7.579   7.819   7.366   8.110   5.993
## TNMD       6.211   5.557   8.108   5.789   5.484
## DPM1       9.150   9.794   9.229   9.672  10.159
## SCYL3      7.193   7.845   6.777   6.863   6.692
## C1orf112   7.835   8.084   6.895   6.987   7.173

And it never hurts to do yet another check to make sure nothing went astray… The match() function tells us the position in the 2nd vector where each element in the 1st vector can be found — ideally, we want these numbers to be increasing from 1 to the length of the vector.

dim(metabric_disc_clin_df)

## [1] 997  32

dim(metabric_disc_expr_mat)

## [1] 15834   997

match(metabric_disc_clin_df$metabric_id, colnames(metabric_disc_expr_mat))[1:50]

##  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
## [26] 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50

Now we are ready to do some actual analysis!

Exploring AURKA expression & survival

We happen to know that high AURKA expression in a tumor is bad for the patient’s outcome (https://doi.org/10.1093/bioinformatics/btn374). AURKA (Aurora kinase A) encodes for a protein that is involved in regulation of progression through the cell cycle. Various studies have associated it with tumour progression and it is thought to promote metastasis.

aurka_row <- which(rownames(metabric_disc_expr_mat) == "AURKA")
aurka_row

## [1] 1595

aurka <- as.vector(metabric_disc_expr_mat[aurka_row, ])
aurka[1:50]

##  [1]  7.918  9.626  7.689  7.738  8.612  7.083  6.769  8.211  9.697  8.645
## [11]  7.236  6.220  8.678  7.500  6.362  7.416 11.767  8.489  7.208  7.195
## [21]  6.991  6.999  6.183  8.220  7.532  7.805  8.173  8.831  9.731  9.217
## [31]  7.931  7.292  8.078  8.921  8.413  6.577  7.508  8.768  7.927  7.248
## [41]  8.261  7.996  6.849  6.447  9.380  7.629  7.582  7.513  8.531  8.504

How is the expression of AURKA related to being alive at 5 years? Here are a few different ways to look at the relationship.

boxplot(aurka ~ metabric_disc_clin_df$survival_5y, 
        xlab = "Alive at 5 yrs", ylab = "AURKA expression level")

stripchart(aurka ~ metabric_disc_clin_df$survival_5y, 
           vertical = TRUE, method = "jitter", pch = '.', 
           cex = 3, add = TRUE)

summary(lm(aurka ~ metabric_disc_clin_df$survival_5y))

## 
## Call:
## lm(formula = aurka ~ metabric_disc_clin_df$survival_5y)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7165 -0.6235 -0.1070  0.5700  3.9225 
## 
## Coefficients:
##                                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                            7.84454    0.04539 172.813  < 2e-16 ***
## metabric_disc_clin_df$survival_5yTRUE -0.22350    0.05595  -3.995 6.95e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.837 on 993 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.01582,    Adjusted R-squared:  0.01483 
## F-statistic: 15.96 on 1 and 993 DF,  p-value: 6.952e-05

Removing censored patients

Another way of looking at things is to just take the patients who died of disease, and to look at how long it was before that occurred. To do that, we’ll want to exclude the rows and columns for patients who alive at the last time they were followed OR who died of other causes. These patients are known as “censored” events.

table(metabric_disc_clin_df$last_follow_up_status)

## 
##      a      d d-d.s. d-o.c. 
##    548     46    259    126

Rather than filtering out patients with last_follow_up_status equal to a or d-o.c., we can actually use the censored column in the data frame. You’ll see that the number of censored patients (674) is equal to the total of alive (548) and died of other causes (126) patients.

table(metabric_disc_clin_df$censored)

## 
## FALSE  TRUE 
##   323   674

So, now we’ll drop the censored patients from the data.

drop_pats <- which(metabric_disc_clin_df$censored)
dead_clin <- metabric_disc_clin_df[-drop_pats, ]
dead_expr <- metabric_disc_expr_mat[, -drop_pats]

dim(dead_clin)

## [1] 323  32

dim(dead_expr)

## [1] 15834   323

That’s less than half of patients in total — though note that only 259 were specifically indicated to have died from breast cancer; the 46 patients labelled as d we think died of cancer, but the cause for these patients could be unknown. This is often a problem — the unknown or “other cause”" might have been something completely unrelated, or something that was indirectly caused by having cancer, so we usually want to be careful about how we treat these cases.

Modeling survival and AURKA expression

Let’s see how the expression level of AURKA is related to time-to-death with a plot that includes a trend line representing the linear model from lm().

# make sure to remove censored patients from the `aurka` vector
aurka_dead = aurka[-drop_pats]

plot(dead_clin$T ~ aurka_dead)

aurka_model <- lm(dead_clin$T ~ aurka_dead)
abline(aurka_model, col = "red")

There are several things in the output of lm() that might be interesting, and the summary() of it has even more detail, such as p-values. Here we store it as an object and then see what it includes.

aurka_model_summary <- summary(aurka_model)
aurka_model_summary

## 
## Call:
## lm(formula = dead_clin$T ~ aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3003.3  -844.4  -256.2   773.7  4508.6 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  7678.72     695.86  11.035  < 2e-16 ***
## aurka_dead   -711.79      87.35  -8.149 8.43e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1250 on 319 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1723, Adjusted R-squared:  0.1697 
## F-statistic: 66.41 on 1 and 319 DF,  p-value: 8.426e-15

Specifically, we can pull out the estimate (slope) for how AURKA is associated with time to death, and the p-value.

coef_aurka_model = coefficients(aurka_model_summary)

slope = coef_aurka_model[2, 1]
pval = coef_aurka_model[2, 4]

slope

## [1] -711.7943

pval

## [1] 8.426028e-15

Including clinical variables with expression

Continuous variables

The number of lymph nodes found to be positive for cancer cells at the time of diagnosis can also be a strong predictor of time until the person dies. We would probably want to allow for these variables like these in a survival model.

There are many ways to include multiple variables in a model — one would be to fit the time to death as a function of both positive lymph node coudnt and other variables (e.g., AURKA expression). We can do this easily for the combination of positive lymph nodes and AURKA with lm().

# uncomment the line below and replace "function" with 
# the function we use to fit a linear model

# summary(function(dead_clin$T ~ dead_clin$lymph_nodes_positive + aurka_dead))

By “controlling” for survival differences related to positive lymph nodes, we’re able to focus our model on differences that might be more specifically related to AURKA expression. If, alternatively, we included lymph_nodes_positive and the effect of aurka_dead disappeared, we might deduce that these variables were confounded in some way.

Categorical variables

We’ve also learned about how ER expression status can be an indicator of survival. In this case, we have to “categories” among our patients: “positive” for ER (+) and “negative” for ER (-).

table(dead_clin$ER.Expr)

## 
##   -   + 
##  91 230

summary(lm(dead_clin$T ~ dead_clin$ER.Expr + aurka_dead))

## 
## Call:
## lm(formula = dead_clin$T ~ dead_clin$ER.Expr + aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2498.6  -808.0  -203.8   674.1  4241.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6111.53     770.71   7.930 3.76e-14 ***
## dead_clin$ER.Expr+   687.18     160.87   4.272 2.57e-05 ***
## aurka_dead          -576.20      90.81  -6.345 7.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1217 on 318 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2172, Adjusted R-squared:  0.2123 
## F-statistic: 44.12 on 2 and 318 DF,  p-value: < 2.2e-16

Your first observation here might be: “what happened to ER -?” When given categorical variables, R will treat one of the values as the intercept for the line it fits. We can therefore interpret the remaining coefficient ER.Expr+ as “the change in survival time when ER status changes from - to +”. As we might expect, patients who are ER+ are predicted to survive longer.

Categorical variables …?

How about another common clinical variable, such as tumor “grade”?

table(dead_clin$grade)

## 
##   1   2   3 
##  15 107 199

Including grade with AURKA expression would look like this:

summary(lm(dead_clin$T ~ dead_clin$grade + aurka_dead))

## 
## Call:
## lm(formula = dead_clin$T ~ dead_clin$grade + aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2711.3  -829.8  -269.3   703.7  4610.4 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      7725.19     687.70  11.233  < 2e-16 ***
## dead_clin$grade  -388.02     130.92  -2.964  0.00327 ** 
## aurka_dead       -591.69      95.34  -6.206 1.69e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1235 on 318 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.1946, Adjusted R-squared:  0.1895 
## F-statistic: 38.41 on 2 and 318 DF,  p-value: 1.147e-15

We see that increasing grade has a negative relationship with survival time (as we might expect), and that relationship is significant — great! But wait… something is funny about how lm() treated the grade variable. Can you tell what’s wrong? Think about how the model looked when we considered the number of positive lymph nodes compared to ER status.

Cue Jeopardy music…

The problem here is that R doesn’t know grade numbers (1, 2, 3) represent different categories, so it just treats them as continuous values. Let’s see what that looks like with a plot of time versus grade.

plot(dead_clin$T ~ dead_clin$grade)

What we need to do is convert the grade variable into a factor — this tells R to treat each “level” of the factor as a distinct category. Try the plot now.

plot(dead_clin$T ~ as.factor(dead_clin$grade))

Better! How does our model look?

summary(lm(dead_clin$T ~ as.factor(dead_clin$grade) + aurka_dead))

## 
## Call:
## lm(formula = dead_clin$T ~ as.factor(dead_clin$grade) + aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2642.2  -818.3  -238.0   655.5  4638.3 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  6951.95     741.12   9.380  < 2e-16 ***
## as.factor(dead_clin$grade)2    91.47     343.70   0.266    0.790    
## as.factor(dead_clin$grade)3  -430.85     347.36  -1.240    0.216    
## aurka_dead                   -590.26      95.15  -6.203 1.73e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1232 on 317 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2003, Adjusted R-squared:  0.1927 
## F-statistic: 26.47 on 3 and 317 DF,  p-value: 2.655e-15

Note that for each increasing grade above “1” (denoted as the Intercept term), the relationship with survival time becomes more negative. While grade 3 shows a negative relationship relative to grade 2, the lack of statistical significance (p > 0.05) suggests that the evidence for this relationship isn’t super strong.

We won’t actually include grade in our models below. Long story short, interpreting the relationship between survival time and tumor grade (at least for this set of patients) is a bit confusing.

Modeling survival for all genes

Putting all these bits together, we can write a function that takes the time to death (y below) does a fit to some variable x, such as the expression of AURKA, and extracts the slope (direction of the association) and statistical significance.

# Do a linear fit of y ~ x and return the slope and p-value
gene_lm <- function(y, x) {
    gene_model_summary <- NULL
    pval <- NA
    slope <- NA
    try(gene_model_summary <- summary(lm(y ~ x)))
    if (!is.null(gene_model_summary)) {
        coef_gene_model <- coefficients(gene_model_summary)
        pval = coef_gene_model[2, 4]
        slope = coef_gene_model[2, 1]
    }
    return(c(pval, slope))
}

Check this got the same as we had before.

gene_lm(dead_clin$T, aurka_dead)

## [1]  8.426028e-15 -7.117943e+02

Now, we could check every gene in the expression matrix to see if it is significantly correlated with time to death of the METABRIC patients. Fortunately with R, we can easily loop over all the genes. Altogether, there are 24,368 genes measured. For now, let’s just take the first 5000 for speed. We will create an empty matrix of results that we can store the p-value and slope in for each gene.

Note that these models aren’t including any information about stage or grade.

# If we were doing all genes, `ngenes` would be nrow(dead_expr), i.e. the number of rows of genes
ngenes <- 5000
lm_results = matrix(nrow = ngenes, ncol = 2, data = NA)
colnames(lm_results) <- c("pval", "slope")
rownames(lm_results) <- rownames(dead_expr)[1:ngenes]
for (i in 1:ngenes) {
    lm_results[i, ] <- gene_lm(dead_clin$T, dead_expr[i,])
}

And we check on AURKA again.

lm_results[rownames(lm_results) == "AURKA", ]

##          pval         slope 
##  8.426028e-15 -7.117943e+02

That’s it. We now have the regression of every gene against time to death. There are some other things we need to do in practice, like correcting the p-values for the number of tests we did. If you check 1000 genes, then for p = 0.01 you would expect roughly 10 genes to be that significant by chance (0.01*1000). From the table generated, we can order by p-value to see which genes are most statistically significantly associated with death (either favourable or unfavourably).

lm_results = lm_results[order(lm_results[, 1]), ]

head(lm_results,20)

##                  pval      slope
## AVP      5.637819e-18  1678.9654
## CENPA    1.130745e-16  -855.0454
## ANLN     7.649051e-16  -995.2773
## RABEP1   2.573654e-15   627.6337
## AURKA    8.426028e-15  -711.7943
## FAM83D   4.566357e-14  -733.2090
## MCM10    7.459059e-14  -751.7670
## TRIP13   9.654274e-14  -710.3487
## STMN1    4.681205e-13  -807.2726
## SEC14L2  7.382927e-13   501.6181
## KIAA0141 8.697885e-13  1449.0480
## MXD4     9.772895e-13   934.6374
## KARS     1.139240e-12 -1169.7592
## ESR1     1.412970e-12   236.0715
## TPX2     1.464180e-12  -732.2328
## MZF1     1.700396e-12   822.2763
## CYBRD1   2.220941e-12   499.1818
## CDC20    2.256791e-12  -485.7147
## HPN      2.285301e-12   482.4860
## CPEB3    2.874603e-12  1068.1870