As part of the the Synapse metaGenomic project, TCGA data relating to Glioblastoma multiforme was normalized using the SNM algoithm and stored in a Synapse project (syn1418075).
We'll try a little data analysis, predicting clinical or phenotype data based on gene expression, copy number variation, micro-RNAs and proteomics using elastic net and the glmnet package.
First, load required packages.
library(synapseClient)
library(Biobase)
library(predictiveModeling)
library(venneuler)
source('overlap.R')
source('markdown_table.R')
synapseLogin("myaccount@nowhere.com", "secret")
Load Synapse entities
syn_copy <- loadEntity('syn313614')
syn_expr <- loadEntity('syn313583')
syn_mirna <- loadEntity('syn313603')
syn_prot <- loadEntity('syn361966')
syn_meta <- loadEntity('syn673127')
Extract 4 matrices and 1 data.frame from entities.
expr <- assayData(syn_expr$objects$eset)$exprs
mirna <- assayData(syn_mirna$objects$eset)$exprs
prot <- assayData(syn_prot$objects$eset)$exprs
copy <- assayData(syn_copy$objects$eset)$exprs
meta <- syn_meta$objects$metadata
Put data together into a list.
data <- list(expr=expr, mirna=mirna, prot=prot, copy=copy)
Show sizes of input data.
dims <- lapply(data, dim)
message(sprintf("dim(%s): %d x %d\n", names(data), sapply(dims, `[`, 1), sapply(dims, `[`, 2)))
## dim(expr): 12185 x 559 dim(mirna): 1510 x 253 dim(prot): 171 x 214
## dim(copy): 16095 x 512
Samples are identified by TCGA barcodes. The first 12 characters of the barcode identifies an individual patient. For example:
TCGA-02-0001-01C-01R-0177-01
Define a function to truncate TCGA barcodes at the patient level.
participant <- function(barcodes) {
substr(barcodes, 1, 12)
}
Get prefixes of sample names from each of the 4 data types and find unique patients across all 4 data types.
samples <- sort(unique(do.call(c, lapply(data, function(d) colnames(d)))))
participants <- lapply(data, function(d) participant(colnames(d)))
unique_participants <- sort(unique(do.call(c, participants)))
message(sprintf("unique patients: %d\n", length(unique_participants)))
## unique patients: 590
Compute samples with overlapping data from among the 4 data types.
overlap_participants <- overlap(participants)
overlap_participant_counts <- sapply(overlap_participants, length)
Plot venn-euler diagram of the overlap among data types.
ve <- venneuler(overlap_participant_counts)
plot(ve)
## data.type | patients
## -------------------- | ---------
## expr | 539
## mirna | 253
## prot | 214
## copy | 502
## expr&mirna | 244
## expr&prot | 184
## expr© | 456
## mirna&prot | 52
## mirna© | 179
## prot© | 203
## expr&mirna&prot | 50
## expr&mirna© | 170
## expr&prot© | 178
## mirna&prot© | 51
## expr&mirna&prot© | 49
While we have 1356
samples and 590
unique participants, the overlap of participants with all 4 data types is only 49
. We'll do our modeling with a subset that has more samples. Later, we might think about imputing missing data.
Survival will be our response. We'll try to model that in terms of gene expression and copy number variation, as this is where we have the most samples.
## extract survival data for patients
patient_survival <- unique(na.omit(meta[, c('bcr_patient_barcode', 'days_to_death'), drop=FALSE]))
responseData <- patient_survival[,'days_to_death', drop=FALSE]
rownames(responseData) <- patient_survival$bcr_patient_barcode
## truncate TCGA barcodes at the participant level
data_by_participant <- lapply(data, function(d) {
colnames(d) <- participant(colnames(d))
return(d)
})
## combine together expression and copy number variation, as this
## is where we have the most overlap
featureData <- createAggregateFeatureDataSet(data_by_participant[c('expr','copy')])
## filter out NAs and low variance features
featureData_filtered <- filterNasFromMatrix(featureData, filterBy = "rows")
## filter out NAs and low variance features
FeatureAndResponseData <- createFeatureAndResponseDataList(t(featureData_filtered), responseData)
filteredData <- filterPredictiveModelData(FeatureAndResponseData$featureData,
FeatureAndResponseData$responseData,
featureVarianceThreshold = 0.01,
corPValThresh = 0.1)
## uniquify and scale feature data
filteredFeatureDataScaled <- scale(unique(filteredData$featureData, MARGIN=2))
## scale response data
filteredResponseDataScaled <- scale(filteredData$responseData)
Randomly select training and test set.
training <- sample.int(nrow(filteredFeatureDataScaled), nrow(filteredFeatureDataScaled)*0.85)
testing <- setdiff(1:nrow(filteredFeatureDataScaled), training)
Fit a model using cross-validated elastic net, using an alpha of 0.85.
m <- cv.glmnet(filteredFeatureDataScaled[training,],
filteredResponseDataScaled[training,],
alpha=0.45)
How many non-zero coefficients did we get?
sum(coef(m)[,1] !=0)
## [1] 44
Glmnet returns a lambda.1se, which is the largest value of lambda such that error is within 1 standard error of the minimum.
lambda.1se = 0.3622
Predict response on training data and compute correlation between predicted and training data.
p.train <- predict(m, filteredFeatureDataScaled[training,])
ord.by.p.train <- order(p.train)
corr.train <- cor(p.train, filteredResponseDataScaled[training,1])
Now, try to predict on a hold-out set not used for training.
p.test <- predict(m, filteredFeatureDataScaled[testing,])
ord.by.p.test <- order(p.test)
corr.test <- cor(p.test, filteredResponseDataScaled[testing,1])
The testing correlation seems to depend a lot on which samples are included in the training and test sets. In general, testing correlations are not good.
These show that the predictors found for training data have weak predictive ability on testing data. This might be indicative of overfitting or a result of the know difficulties in modeling large p, small n problems.