TCGA Glioblastoma Predictive Modeling

As part of the the Synapse metaGenomic project, TCGA data relating to Glioblastoma multiforme was normalized using the SNM algoithm and stored in a Synapse project (syn1418075).

We'll try a little data analysis, predicting clinical or phenotype data based on gene expression, copy number variation, micro-RNAs and proteomics using elastic net and the glmnet package.

Premininaries, getting libraries and data

First, load required packages.

library(synapseClient)
library(Biobase)
library(predictiveModeling)
library(venneuler)
source('overlap.R')
source('markdown_table.R')

synapseLogin("myaccount@nowhere.com", "secret")

Load Synapse entities

syn_copy <- loadEntity('syn313614')
syn_expr <- loadEntity('syn313583')
syn_mirna <- loadEntity('syn313603')
syn_prot <- loadEntity('syn361966')
syn_meta <- loadEntity('syn673127')

Extract 4 matrices and 1 data.frame from entities.

expr <- assayData(syn_expr$objects$eset)$exprs
mirna <- assayData(syn_mirna$objects$eset)$exprs
prot <- assayData(syn_prot$objects$eset)$exprs
copy <- assayData(syn_copy$objects$eset)$exprs
meta <- syn_meta$objects$metadata

Put data together into a list.

data <- list(expr=expr, mirna=mirna, prot=prot, copy=copy)

Show sizes of input data.

dims <- lapply(data, dim)
message(sprintf("dim(%s): %d x %d\n", names(data), sapply(dims, `[`, 1), sapply(dims, `[`, 2)))

## dim(expr): 12185 x 559 dim(mirna): 1510 x 253 dim(prot): 171 x 214
## dim(copy): 16095 x 512

Exploring the data

TCGA Barcodes

Samples are identified by TCGA barcodes. The first 12 characters of the barcode identifies an individual patient. For example:

TCGA-02-0001-01C-01R-0177-01

Define a function to truncate TCGA barcodes at the patient level.

participant <- function(barcodes) {
  substr(barcodes, 1, 12)
}

Find unique participants

Get prefixes of sample names from each of the 4 data types and find unique patients across all 4 data types.

samples <- sort(unique(do.call(c, lapply(data, function(d) colnames(d)))))
participants <- lapply(data, function(d) participant(colnames(d)))
unique_participants <- sort(unique(do.call(c, participants)))
message(sprintf("unique patients: %d\n", length(unique_participants)))

## unique patients: 590

Overlap among the 4 data types

Compute samples with overlapping data from among the 4 data types.

overlap_participants <- overlap(participants)
overlap_participant_counts <- sapply(overlap_participants, length)

Plot venn-euler diagram of the overlap among data types.

ve <- venneuler(overlap_participant_counts)
plot(ve)

plot of chunk venneuler-diagram

##            data.type |  patients 
## -------------------- | --------- 
##                 expr |       539
##                mirna |       253
##                 prot |       214
##                 copy |       502
##           expr&mirna |       244
##            expr&prot |       184
##            expr&copy |       456
##           mirna&prot |        52
##           mirna&copy |       179
##            prot&copy |       203
##      expr&mirna&prot |        50
##      expr&mirna&copy |       170
##       expr&prot&copy |       178
##      mirna&prot&copy |        51
## expr&mirna&prot&copy |        49

While we have 1356 samples and 590 unique participants, the overlap of participants with all 4 data types is only 49. We'll do our modeling with a subset that has more samples. Later, we might think about imputing missing data.

Prepare data

Feature matrix and response vector

Survival will be our response. We'll try to model that in terms of gene expression and copy number variation, as this is where we have the most samples.

## extract survival data for patients
patient_survival <- unique(na.omit(meta[, c('bcr_patient_barcode', 'days_to_death'), drop=FALSE]))
responseData <- patient_survival[,'days_to_death', drop=FALSE]
rownames(responseData) <- patient_survival$bcr_patient_barcode

Filter and scale

## truncate TCGA barcodes at the participant level
data_by_participant <- lapply(data, function(d) {
  colnames(d) <- participant(colnames(d))
  return(d)
})

## combine together expression and copy number variation, as this
## is where we have the most overlap
featureData <- createAggregateFeatureDataSet(data_by_participant[c('expr','copy')])

## filter out NAs and low variance features
featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows")

## filter out NAs and low variance features
FeatureAndResponseData <- createFeatureAndResponseDataList(t(featureData_filtered), responseData)                                             
filteredData <- filterPredictiveModelData(FeatureAndResponseData$featureData,
                                          FeatureAndResponseData$responseData,
                                          featureVarianceThreshold = 0.01,
                                          corPValThresh = 0.1)

## uniquify and scale feature data
filteredFeatureDataScaled  <- scale(unique(filteredData$featureData, MARGIN=2))

## scale response data
filteredResponseDataScaled <- scale(filteredData$responseData)

Train

Elastic Net

Randomly select training and test set.

training <- sample.int(nrow(filteredFeatureDataScaled), nrow(filteredFeatureDataScaled)*0.85)
testing <- setdiff(1:nrow(filteredFeatureDataScaled), training)

Fit a model using cross-validated elastic net, using an alpha of 0.85.

m <- cv.glmnet(filteredFeatureDataScaled[training,],
               filteredResponseDataScaled[training,],
               alpha=0.45)

How many non-zero coefficients did we get?

sum(coef(m)[,1] !=0)

## [1] 44

Glmnet returns a lambda.1se, which is the largest value of lambda such that error is within 1 standard error of the minimum.

lambda.1se = 0.3622

Predict

Training

Predict response on training data and compute correlation between predicted and training data.

p.train <- predict(m, filteredFeatureDataScaled[training,])
ord.by.p.train <- order(p.train)
corr.train <- cor(p.train, filteredResponseDataScaled[training,1])

plot of chunk plot-corr-train

Testing

Now, try to predict on a hold-out set not used for training.

p.test <- predict(m, filteredFeatureDataScaled[testing,])
ord.by.p.test <- order(p.test)
corr.test <- cor(p.test, filteredResponseDataScaled[testing,1])

plot of chunk plot-corr-test

Results

The testing correlation seems to depend a lot on which samples are included in the training and test sets. In general, testing correlations are not good.
plot of chunk cross-validate-on-correlations

These show that the predictors found for training data have weak predictive ability on testing data. This might be indicative of overfitting or a result of the know difficulties in modeling large p, small n problems.