About this activity

This module will introduce you to loading and manipulating the data from the breast cancer METABRIC dataset, visualizing and working with gene expression measurements, and building predictive models based on the expression of many different genes and the associated clinical data.

The METABRIC dataset contains more patient survivial information than we had with the TCGA data.


Loading & inspecting data

Let’s load the data — both the expression matrix (already log-transformed) and a table of clinical information.

load("/home/data/metabric_disc_clin_df.RData")
load("/home/data/metabric_disc_expr_mat.RData")
ls()
## [1] "metabric_disc_clin_df"  "metabric_disc_expr_mat"

Check out the clinical data frame. Many of the features are similar to those in the TCGA breast cancer clinical data frame.

# What's in the clinical data frame?

View(metabric_disc_clin_df)

Note that the expression matrix has already been log-transformed. How many rows and columns does metabric_disc_expr_mat have?

Reminder: The dimensions of a matrix can be found with the dim() function.

# How many rows and columns in the expression matrix?
# Find out: uncomment the lines below and replace FUNCTION and OBJECT

 dim(metabric_disc_expr_mat)
## [1] 15834   997

What’s in the expression matrix?

Reminder: Indexing can be used to get specific rows and some columns of a matrix.

# The expression matrix is already log-transformed
# Uncomment and use indexing to view the first 5 rows and 5 columns

 metabric_disc_expr_mat[1:5 , 1:5 ]
##          MB_0362 MB_0346 MB_0386 MB_0574 MB_0185
## TSPAN6     7.579   7.819   7.366   8.110   5.993
## TNMD       6.211   5.557   8.108   5.789   5.484
## DPM1       9.150   9.794   9.229   9.672  10.159
## SCYL3      7.193   7.845   6.777   6.863   6.692
## C1orf112   7.835   8.084   6.895   6.987   7.173

Each row of the clinical data frame is one patient, and each column of the expression data frame is for one patient.

Check whether the patient IDs in the expression and clinical data frames are in the same order:

# Are the patient IDs the same and in the same order?

identical(metabric_disc_clin_df$metabric_id,colnames(metabric_disc_expr_mat))
## [1] TRUE

Now we are ready to do some actual analysis!


Exploring AURKA expression & survival

We happen to know that high AURKA expression in a tumor is bad for the patient’s outcome.

AURKA (Aurora kinase A) encodes for a protein that is involved in regulation of the cell cycle. Various studies have associated it with tumour progression and metastasis.

# Row number for AURKA in the expression data

aurka_row <- which(rownames(metabric_disc_expr_mat) == "AURKA")
aurka_row
## [1] 1595

Make an object that contains the expression of AURKA across patients:

# as.vector turns a row into a one-dimensional array
aurka <- as.vector(metabric_disc_expr_mat[aurka_row, ])

# Uncomment and use indexing to look at the first 50 values

aurka[1:50]
##  [1]  7.918  9.626  7.689  7.738  8.612  7.083  6.769  8.211  9.697  8.645
## [11]  7.236  6.220  8.678  7.500  6.362  7.416 11.767  8.489  7.208  7.195
## [21]  6.991  6.999  6.183  8.220  7.532  7.805  8.173  8.831  9.731  9.217
## [31]  7.931  7.292  8.078  8.921  8.413  6.577  7.508  8.768  7.927  7.248
## [41]  8.261  7.996  6.849  6.447  9.380  7.629  7.582  7.513  8.531  8.504

Patient survival status

Since high AURKA expression is associated with tumor progression, we are interested in clinical features for which AURKA expression may be predictive.

The feature last_follow_up_status tells whether patients being followed in this study were alive or had passed at that time.

Reminder: There is a function that returns a feature in table form.

# What status can patients have at the last time they were checked?
# Uncomment and replace FUNCTION to find each category and its frequency

table(metabric_disc_clin_df$last_follow_up_status)
## 
##      a      d d-d.s. d-o.c. 
##    548     46    259    126

What do each of the categories mean?

At time of last follow up, a patient can have a status:

  • a: alive
  • d: dead
  • d-d.s.: dead from disease
  • d-o.c: dead of other causes

259 patients were specifically indicated to have died from breast cancer (d-d.s.). There are 46 patients labelled as d. We think they died of cancer, but the cause for these patients could be unknown.

Even for the patients who died of unknown cause d-o.c., the cause may have been indirectly caused by having cancer, so we usually want to be careful about how we treat these cases.


Censored records

Another perspective is to consider only those patients who died of disease and examine how long since diagnosis that death occurred.

We have to exclude the rows and columns for patients who are alive and those who died of other causes. These cases are known as censored events.

Rather than filtering out patients with last_follow_up_status equal to a or d-o.c., we can use the censored column in the data frame.

# How many patients are alive or have died?

table(metabric_disc_clin_df$censored)
## 
## FALSE  TRUE 
##   323   674

Compare these number to the numbers for a, d, d-d.s. and d-o.c. What do TRUE and FALSE correspond to? Living status

Why do we want to remove patients with censored equal to TRUE? At the last follow-up, they were alive

# Remove the censored patients from the data
# The function which() gives the row number for censored = TRUE

drop_pats <- which(metabric_disc_clin_df$censored)

# Create clincial dataframe and expression matrix for patients who have died
dead_clin <- metabric_disc_clin_df[-drop_pats, ]
dead_expr <- metabric_disc_expr_mat[, -drop_pats]

dim(dead_clin)
## [1] 323  32
dim(dead_expr)
## [1] 15834   323

To test whether the code did what you expected, check out the new clinical data frame:

View(dead_clin)

The time-to-death (the feature T) is measured in days since cancer diagnosis.

Modeling survival and AURKA expression

How is time-to-death related to the expression level of AURKA? As AURKA expression increases, time to death decreases We’ll create a plot that includes a trend line representing the linear model from lm().

# make sure to remove censored patients from the `aurka` vector
aurka_dead = aurka[-drop_pats]

# model time to death as a function of AURKA expression
aurka_model <- lm(dead_clin$T ~ aurka_dead)

# create a scatterplot to check your result
plot(dead_clin$T ~ aurka_dead,
     main = "Time to death as a function of AURKA expression",
     ylab = "Time to death, days",
     xlab = "AURKA expression")

abline(aurka_model, col = "red")

Time-to-death is inversely correlation with AURKA expression.

As we saw in the mtcars_linear_regression.Rmd activity, there are several things in the output of lm() that are interesting. We’ll store the summary information as an object and extract the values we want.

# Store the summary of the model you already made
aurka_model_summary <- summary(aurka_model)

# Uncomment and replace FUNCTION to get the coefficients of your model
 coef_aurka_model = coefficients(aurka_model_summary)

# Get slope and p-value
slope = coef_aurka_model[2, 1]
pval = coef_aurka_model[2, 4]

# Uncomment and replace FEATURE tp get R-squared
r_squared <- aurka_model_summary$r.squared

slope
## [1] -711.7943
r_squared
## [1] 0.1723047
pval
## [1] 8.426028e-15

The time to death decreases as AURKA expression increases.


Add clinical variables to your model

In the code chunk above, you should have found that AURKA expression explains 17% of the variance (r.squares) in time to death (dead_clin$T).

Let’s include more factors in the model.


Continuous variables

AURKA expression in aurka_dead is a continuous variable. You can include other continuous variables in your model by adding them. For example:

  • For one independent variable: y ~ x
  • For two independent variables: y ~ x + z

Number of positive lymph nodes

The number of lymph nodes found to be positive for cancer cells at the time of diagnosis, lymph_nodes_positive, can be a strong predictor of time until the person dies. We would probably want to allow for variables like this in a survival model.

# Add lymph_nodes_positive as a feature to the model
# Uncomment and replace FUNCTION to create the linear model 

summary(lm(dead_clin$T ~ dead_clin$lymph_nodes_positive + aurka_dead))
## 
## Call:
## lm(formula = dead_clin$T ~ dead_clin$lymph_nodes_positive + aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2400.1  -857.5  -179.4   679.9  4308.6 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                     7599.93     676.37  11.236  < 2e-16 ***
## dead_clin$lymph_nodes_positive   -62.72      14.07  -4.459 1.14e-05 ***
## aurka_dead                      -678.83      85.19  -7.968 2.90e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1214 on 318 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.221,  Adjusted R-squared:  0.2161 
## F-statistic: 45.11 on 2 and 318 DF,  p-value: < 2.2e-16

Does time-to-death increase or decrease with lymph_nodes_positive? How much variance is explained with the new model?
As more lymph nodes are expressed, then time to death decreases.


NPI

Try the same with NPI. The Nottingham Prognostic Index (NPI) is a tool used to predict the prognosis of breast cancer patients after surgery. It combines three key factors: tumor size, lymph node involvement, and tumor grade. A higher score corresponds to poorer prognosis.

# Uncomment and replace DEPENDENT_FEATURE with the feature you want to predict
# and replace FEATURE with feature in the clinical data for NPI

 summary(lm(dead_clin$T ~ dead_clin$NPI + aurka_dead))
## 
## Call:
## lm(formula = dead_clin$T ~ dead_clin$NPI + aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2830.8  -782.4  -260.5   605.3  4298.4 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    7971.96     677.71  11.763  < 2e-16 ***
## dead_clin$NPI  -294.20      63.74  -4.616 5.69e-06 ***
## aurka_dead     -579.69      89.40  -6.484 3.40e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1212 on 318 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2243, Adjusted R-squared:  0.2194 
## F-statistic: 45.97 on 2 and 318 DF,  p-value: < 2.2e-16

Are your results consistent with the definition of the NPI score? Yes —

Categorical variables

Categorical variables, such as ER.Expr and cellularity have only two or a few discrete values. Let’s see how they are handled by lm().


ER status

With the TCGA data, we learned how ER expression status can be an indicator of survival.

# ER.Expr values
table(dead_clin$ER.Expr)
## 
##   -   + 
##  91 230

Include ER.Expr as an independent feature in your model:

# Combine AURKA expression with ER status in a regression model
# Uncomment the line below and replace FEATURE

summary(lm(dead_clin$T ~ dead_clin$ER.Expr + aurka_dead))
## 
## Call:
## lm(formula = dead_clin$T ~ dead_clin$ER.Expr + aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2498.6  -808.0  -203.8   674.1  4241.8 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         6111.53     770.71   7.930 3.76e-14 ***
## dead_clin$ER.Expr+   687.18     160.87   4.272 2.57e-05 ***
## aurka_dead          -576.20      90.81  -6.345 7.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1217 on 318 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.2172, Adjusted R-squared:  0.2123 
## F-statistic: 44.12 on 2 and 318 DF,  p-value: < 2.2e-16

Your first observation here might be: What happened to ER -?

When given categorical variables, R will treat one of the values as the intercept for the line it fits.

We can therefore interpret the remaining coefficient ER.Expr+ as the change in survival time when ER status changes from - to +. As we might expect, patients who are ER+ are predicted to survive longer. We can see this by looking at the slope (aka Estimate Std.) of dead_clin$ER.Expr+.


Cellularity

Cellularity is the percentage of the tumor volume occupied by invasive tumor cells. We expect this may be important to our model as well.

Make a table for the feature cellularity.

table(dead_clin$cellularity)
## 
##     high moderate 
##      178      143

Include cellularity as an independent feature in your model:

# Combine AURKA expression with cellularity in a regression model
# Look at the summary of the model

summary(lm(dead_clin$T ~ dead_clin$cellularity + aurka_dead))
## 
## Call:
## lm(formula = dead_clin$T ~ dead_clin$cellularity + aurka_dead)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3045.2  -829.8  -281.4   756.7  4439.1 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    7532.85     719.79  10.465  < 2e-16 ***
## dead_clin$cellularitymoderate   113.84     142.46   0.799    0.425    
## aurka_dead                     -699.79      88.68  -7.891 4.87e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1250 on 318 degrees of freedom
##   (2 observations deleted due to missingness)
## Multiple R-squared:  0.174,  Adjusted R-squared:  0.1688 
## F-statistic: 33.49 on 2 and 318 DF,  p-value: 6.351e-14

What do your results tell you about the effect of cellularity on time-to-death?

As cellularity increases, time to death decreases.

Modeling survival for all genes

Putting all the bits together, we can write a function, gene_lm() that takes the time-to-death (y) and fits it to some variable (x), such as the expression of AURKA, and extracts the slope (direction of the association) and statistical significance (pval).

More generally, yes, we can write our own functions in R!

# Do a linear fit of y ~ x and return the slope and p-value
gene_lm <- function(y, x) {
    gene_model_summary <- NULL
    pval <- NA
    slope <- NA
    try(gene_model_summary <- summary(lm(y ~ x)))
    if (!is.null(gene_model_summary)) {
        coef_gene_model <- coefficients(gene_model_summary)
        pval = coef_gene_model[2, 4]
        slope = coef_gene_model[2, 1]
    }
    return(c(pval, slope))
}

Check that you get the same result as before.

# Run your new function

gene_lm(dead_clin$T, aurka_dead)
## [1]  8.426028e-15 -7.117943e+02

We could check every gene in the expression matrix to see if it is significantly correlated with time-to-death of the METABRIC patients. Fortunately with R, we can easily loop over all the genes. We’ll create an empty matrix of results for storing the p-value and slope for each gene. Altogether, 15,834 genes were measured and included in the expression matrix.

# Try the prediction first for 5000 genes
# Later, try it for all genes, i.e. nrow(dead_expr)

ngenes <- nrow(dead_expr)
ngenes <- 5000
lm_results = matrix(nrow = ngenes, ncol = 2, data = NA)
colnames(lm_results) <- c("pval", "slope")
rownames(lm_results) <- rownames(dead_expr)[1:ngenes]
for (i in 1:ngenes) {
    lm_results[i, ] <- gene_lm(dead_clin$T, dead_expr[i,])
}

And we check on AURKA again.

lm_results[rownames(lm_results) == "AURKA", ]
##          pval         slope 
##  8.426028e-15 -7.117943e+02

That’s it. We now have the regression of every gene against time-to-death. There are some other things we need to do in practice, like correcting the p-values for the number of tests we did. If you check 1000 genes, then for p = 0.01 you would expect roughly 10 genes to be that significant by chance (0.01*1000).

From the results table lm_results, we can order by p-value to see which genes are most significantly associated with death (either favourably or unfavourably).

lm_results = lm_results[order(lm_results[, 1]), ]

head(lm_results,20)
##                  pval      slope
## AVP      5.637819e-18  1678.9654
## CENPA    1.130745e-16  -855.0454
## ANLN     7.649051e-16  -995.2773
## RABEP1   2.573654e-15   627.6337
## AURKA    8.426028e-15  -711.7943
## FAM83D   4.566357e-14  -733.2090
## MCM10    7.459059e-14  -751.7670
## TRIP13   9.654274e-14  -710.3487
## STMN1    4.681205e-13  -807.2726
## SEC14L2  7.382927e-13   501.6181
## KIAA0141 8.697885e-13  1449.0480
## MXD4     9.772895e-13   934.6374
## KARS     1.139240e-12 -1169.7592
## ESR1     1.412970e-12   236.0715
## TPX2     1.464180e-12  -732.2328
## MZF1     1.700396e-12   822.2763
## CYBRD1   2.220941e-12   499.1818
## CDC20    2.256791e-12  -485.7147
## HPN      2.285301e-12   482.4860
## CPEB3    2.874603e-12  1068.1870

AURKA is near the top. The ANLN gene codes for the protein Anillin, an actin-binding protein required for cytokinesis, is a prognostic marker panel in breast cancer..

Several of the genes seem to be favorable for survival, such as MXD4 and RABEP1. You can learn more about their functions at the UniProt Knowledgebase.


Prediction

Let’s use your model for prediction. You know already that your model accounts for only ~20% of the variance in time-to-death, so we don’t expect stellar performance.

If you look at dead_clin, you’ll see there are a couple of cases where dead_clin$T has a value of NA:

which(is.na(dead_clin$T))
## [1]  87 316

Predictions, which are number values for time-of-death, will be made for all patients. To compare the predicted values with the real values, we should remove the patients with NA values.

new_dead_clin<-dead_clin[-c(87,316),]
new_dead_expr<-dead_expr[,-c(87,316)]

Let’s create a model based on expression of a gene that may be important for survival. We’ve been focussing on AURKA, but you can choose another- perhaps one of the most significant genes from lm_result when you modeled survival for all genes.

For prediction, it is easiest if we add the column for gene expression to the clinical data frame.

# pick the gene to use in your model here!
my_model_gene <- "AVP"

# add gene expression to the training clinical table above
gene_exp <- as.vector(new_dead_expr[my_model_gene, ])
new_dead_clin$gene_exp <- gene_exp

Now train your model for survival as a function of gene expression…

gene_model <- lm(T ~ gene_exp, data = new_dead_clin)
summary(gene_model)
## 
## Call:
## lm(formula = T ~ gene_exp, data = new_dead_clin)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2986.5  -860.2  -243.7   561.4  4617.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -8270.8     1125.1  -7.351 1.66e-12 ***
## gene_exp      1679.0      182.9   9.179  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1222 on 319 degrees of freedom
## Multiple R-squared:  0.2089, Adjusted R-squared:  0.2064 
## F-statistic: 84.24 on 1 and 319 DF,  p-value: < 2.2e-16

Look at the coefficient named gene_exp. Based on this coefficient, is the relationship between the expression of your gene and survival time positive or negative?

You see that your gene has some explanatory value (Multiple R-squared), let’s see if it has predictive value.

The function predict.lm returns predicted values for the dependent feature, in our case time-to-death T.

# The object my_prediction will hold your predicted values for T

my_prediction <- data.frame(metabric_id = new_dead_clin$metabric_id, 
                            T = predict.lm(gene_model, 
                                           newdata = new_dead_clin))
head(my_prediction)
##    metabric_id        T
## 1      MB_0362 1569.578
## 2      MB_0346 1223.711
## 5      MB_0185 1369.781
## 11     MB_0189 1530.962
## 19     MB_0223 2118.599
## 26     MB_0906 1379.855

To get a sense of how well you model did, let’s compare the predicted values with the real values:

# correlation
cor(my_prediction$T,new_dead_clin$T)
## [1] 0.4570749

What do you think?


Take it a step further! Include ER.expr in a predictive model.

First, create the model:

# Uncomment and replace Y ~ X + Z with the correct variables

er_gene_model <- lm(T ~ gene_exp + ER.Expr, data = new_dead_clin)

Then, perform the prediction:

# Uncomment and replace FUNCTION with what need for prediction

my_prediction <- data.frame(metabric_id = new_dead_clin$metabric_id, 
                            T = predict.lm(er_gene_model, 
                                           newdata = new_dead_clin))

Finally, calculate the correlation between your prediction and the real values:

 cor(my_prediction$T,new_dead_clin$T)
## [1] 0.4864656

How does your new model compare?


Wrap-up

Congratulations on accomplishing another activity!! You’ve done a tremendous amount of work this summmer and learned a wide range of data analysis methods with many different data types. Great work.

Don’t forget to knit and publish to Rpubs!