analysis_presentation

Ian Donaldson
Tuesday, March 24th, 2015

Task

40mins presentation outlining their skills and expertise relevant to the position.

If possible this should contain at least one example of using machine learning in a predictive modelling (classification/regression) setting.

Goals

  • Try a machine learning task with expression data
  • Use MLearn package (similar to Caret)
  • Provide some reproducible sample code showing how I work
  • Try collapsing probe set data by GO complexes

Following this protocol

Bioconductor case studies
(Hahne, Huber, Gentleman, Falcon)

  • Chapter 6 - Easy Differential Expression
  • Chapter 8 - Supervised machine learning

  • Then my own code

The Data

biocLite("ALL")
library(ALL)
data(ALL)
?ALL

The data consist of microarrays from 128 different individuals with acute lymphoblastic leukemia (ALL). A number of additional covariates are available. The data have been normalized (using rma) and it is the jointly normalized data that are available here. The data are presented in the form of an exprSet object.

Set up environment for presentation

load(file="analysis.Rdata")
#ls()

Working with a subset of the data

bcell = grep("^B", as.character(ALL$BT)) 
#bcell is a vector indices of individuals with B-cell type leukemia

moltyp = which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL")) 
#moltype is a vector of indices of individuals with BCR/ABL translocations or no cytogentic abnormalities (NEG)

ALL_bcrneg = ALL[, intersect(bcell, moltyp)] 
#get a list of individuals meeting both criteria for our data subset 
# ALL_bcrneg is an ExpressionSet

A peek at the expression data

exprs(ALL_bcrneg)[1:3,1:3]
             01005    01010    03002
1000_at   7.597323 7.479445 7.567593
1001_at   5.046194 4.932537 4.799294
1002_f_at 3.900466 4.208155 3.886169

Feature filtering (selection)

  • Remove probe data with little information.

  • Using standard deviations and “shorth”

  • The “shorth” is the shortest interval that covers half the values in a vector x.

  • genefilter package

  • robust estimator of location good for non-normal data

  • has implications for results left after multiple hypothesis correction

Feature filter code

#library("genefilter")
?genefilter #<- a package for filtering gene expression probe data
expressionData <- exprs(ALL_bcrneg)
str(expressionData) #<-matrix of 12625 probes by 79 samples (individuals) 
sds = rowSds(expressionData) #<-the standard deviation of each row (rowSds is from genefilter package)
sh = shorth(sds)
sds_alternative <- apply(expressionData, 1, sd) #alternative way of coding the same thing
sds_alternative[1:5]

Feature filter results

plot of chunk feature_selection_plot

Alternative methods for feature selection

  1. discard all probes with consistantly low expression values
  2. remove non-specific probes using the nsFilter function from the Category package

Testing for differential expression

  • rowttests function from the genefilter package is used to
  • test for differential expression between two different (mol.biol).
  • i.e., BCR/ABL translocation (37) or not(42).
  • rowttests(ALLsfilt, "mol.biol")

Visualization of t-test results

plot of chunk t_test_hist

Zooming in on t-test results

plot of chunk t_test_hist_zoom

What are we throwing away?

plot of chunk t_rest_hist

What are we throwing away? Zoom.

plot of chunk t_rest_hist_zoom

Most of these won't survive multiple-hypothesis testing anyway, but a few may bear further examination.

Multiple-hypothesis testing correction

  • there are 8812 probes (t-tests)
  • about 440 false positives at the 0.05 significance level
  • 1115 p-values are less than 0.05 - are they all significant
  • we need to correct for doing so many tests

  • Benjamini and Hochberg procedure uses the false-discovery rate (FDR) to recalculate p-values

Correction code

p_adjust <- p.adjust(tt$p.value, method="BH")
num_significant <- sum(tt$p.value < 0.05)
num_significant #<-- 1155
#functions in the multtest package give identical results

Then look at the genes ranked by "significance"

     probe_id  symbol
1     1635_at    ABL1
2   1636_g_at    ABL1
3     1674_at    YES1
4    32434_at  MARCKS
5    37015_at ALDH1A1
6    37027_at   AHNAK
7    39730_at    ABL1
8  39837_s_at  ZNF467
9    40202_at    KLF9
10   40504_at    PON2

What can we do with a ranked list

What does significant mean anyway?

Such a ranked list might be used as input for further analysis including

  • ORA - over-represenation analysis
  • GSEA - Gene set enrichment analysis
  • as features in a machine learning model

Machine learning task

Predict the presence or absence of a BCL/ABL translocation given an expression profile.

table(ALL_bcrneg$mol.biol)

BCR/ABL     NEG 
     37      42 

Machine learning pipeline

  • Form a question
  • Gather data, clean and normalize
  • Select features or make them
  • Select a ML method
  • Tune parameters on training/test set
  • Assess performance on validation set

Filtering (cleaning) data

  • nsFilter function from the genefilter package
  • remove non-informative probes
  • also removes control probes on Affymetrix arrays (identifed by their AFFX prefix)
  • exclude genes without Entrez Gene identifiers
  • select the top 25% of genes on the basis of variability across samples
  • nsFilter specifically works with ExpressionSet data and returns the same class type
ALLfilt_bcrneg = nsFilter(ALL_bcrneg, var.cutoff=0.75)$eset`

Data standardization (normalization)

Visualizing euclidean distance between samples

  • distance between samples in the expression matrix
  • rows x columns = probes x individuals,
  • call the dist function on the transpose of this matrix (individuals x probes)
  • rows x columns = individuals x individuals with euclidean distances for each pair of individuals
  • visualized using the heatmap() function
  • RColorBrewer package is used to construct a color pallette for the visualization
#calculate euclidean distances between samples
eucD = dist(t(exprs(ALLfilt_bcrneg)))

Hierarchical clustering of samples viewed as a heat map

plot of chunk euclid_vis

Using the MLInterfaces package for machine learning

  • MLInterfaces provides a consistant interface to a number of machine learning packages in R (much like carret but adapted specifically for Bioconductor containers)

Construct a training set and a test set

#create sample indices for the training and test sets
set.seed(1234)
Negs = which(ALLfilt_bcrneg$mol.biol == "NEG")
Bcr = which(ALLfilt_bcrneg$mol.biol == "BCR/ABL")
S1 = sample(Negs, 20, replace=FALSE)
S2 = sample(Bcr, 20, replace = FALSE)
TrainInd = c(S1, S2)
TestInd = setdiff(1:79, TrainInd)

KNN

Use KNN to predict the presence or absence of the BCL/ABL translocation

kans = MLearn( mol.biol ~ ., data=ALLfilt_bcrneg, knnI(k=1,l=0), TrainInd)
kans.cm <- confuMat(kans)
kans.cm
         predicted
given     BCR/ABL NEG
  BCR/ABL      16   1
  NEG           4  18

Assessing performance

cm <- as.table(kans.cm)
precision(cm)
  BCR/ABL       NEG 
0.9411765 0.8181818 
recall(cm)
  BCR/ABL       NEG 
0.8000000 0.9473684 
acc(cm)
[1] 0.8717949

Explain the confusion matrix

alt text

Feature selection

  • select gene probes that are likely to be indicative of the presence or absence of the BCL/ABL translocation
  • use the t-test to do this
  • done on only the training set - to avoid overestimated performance
  • repeat KNN using the new data set containing only the selected features

Does this impact KNN model accuracy?

#redo knn
#library("MLInterfaces")
kans.bf = MLearn( mol.biol ~ ., data=BNf, knnI(k=1,l=0), TrainInd)
kans.bf.cm <- confuMat(kans.bf)
kans.bf.cm
         predicted
given     BCR/ABL NEG
  BCR/ABL      14   3
  NEG           1  21

Does this impact KNN model accuracy?

cm<-as.table(kans.bf.cm)
precision(cm)
  BCR/ABL       NEG 
0.8235294 0.9545455 
recall(cm)
  BCR/ABL       NEG 
0.9333333 0.8750000 
acc(cm)
[1] 0.8974359

Support for cross-validation in MLInterfaces

  • by specifying xvalSpec parameters that are passed to the MLearn function
  • only three cross-validations types supported:
  • LOO=leave-one-out,
  • LOG=leave-out-group and
  • NOTEST=use the entire data set for training
  • there is no built-in supprot for boot-strapping
  • possibility for feature selection is built into the cross-validation object - see the fs.Fun parameter
  • the example code provides a good example of how using only one round of train-test can overestimate perfomrance and why cross-validation should always be used.

Random forest in MLearn

set.seed(123)
#the guideline rule for mtry is the sqrt of the number of features
dim(ALLfilt_bcrneg)
sqrt(2160)
rf1 = MLearn( mol.biol~., data=ALLfilt_bcrneg, randomForestI, TrainInd, ntree=1000, mtry=55, importance=TRUE)
confuMat(rf1)
acc(as.table(confuMat(rf1)))

Random forest performance

confuMat(rf1)
         predicted
given     BCR/ABL NEG
  BCR/ABL      13   4
  NEG           0  22
acc(as.table(confuMat(rf1)))
[1] 0.8974359

Cross-validation is built into this method and the results are more interpretable.

Examining importance of features from a random forest model

  • importance parameter of the random forest method allows us to examine which features had the largest impact on performance

  • data about feature importance is retrieved using the getVarImp function of the MLInterfaces package

Importance

opar = par(no.readonly=TRUE, mar=c(7,5,4,2)) #opar=original parameters - keep these so they can be restored
par(las=2) #<-- style of the axis label - parallel to the axis
?getVarImp #<-- from the MLInterfaces package
impV1 = getVarImp(rf1)
plot(impV1, n=15, plat="hgu95av2", toktype="SYMBOL") #<-- implementation of plot allows direct mapping of gene symbols
par(opar) #<-- restore the original graphics parameters

Importance

alt text

Next, lets try collapsing features by complex membership

Retrieve complexes from GO

library("AnnotationDbi")
library(GO.db)
searchTerms <- c("complex", "some", "mere")
#...
query <- paste("select go_id from go_term where term like '%", searchTerm, "%'", sep = "")
#...

GO complexes

allGoidDetails[c(1:3), c(2,3)]
                                         TERM ONTOLOGY
1           phosphopyruvate hydratase complex       CC
2          nucleotide-excision repair complex       CC
3 nucleotide-excision repair factor 1 complex       CC

Skipping some steps here

  • retrieve Entrez Gene Ids for these complexes
  • map probe ids to Gene Ids then to complex ids (if any)

Collapse using collapseRows

# ...
myCollapse<-function(someMatrix)apply(someMatrix, 2, function(x){ return(x[which.max(abs(x))])}) #define a collapse function
#
collapseRows.results<-collapseRows(datET=tmp, rowGroup=rowGroup, rowID=rowID, method="function", methodFunction=myCollapse)#collapse

Prepare the collapsed expression data for random forest

This was the hard part

# prepare expression set for ml
eData <- as.data.frame(t(tmpCollapsed))#transpose and cast as a data frame
newColNames <- paste("AAA", colnames(eData), sep="", collapse=NULL) #fix column names
#fix column names - need to replace :
newColNames<-gsub(":", "", newColNames)
learnThis <- ALLfilt_bcrneg$mol.biol ##finally, add the outcome variable separately
newColNames <- c("learnThis", newColNames)
eData <- cbind(learnThis, eData) 
colnames(eData) <- newColNames

Learn

rf1e = MLearn( learnThis~., data=eData, randomForestI, TrainInd, ntree=1000, mtry=55, importance=TRUE)

Assess performance

rf1e.cm<-confuMat(rf1e)
rf1e.cm
         predicted
given     BCR/ABL NEG
  BCR/ABL      13   4
  NEG           0  22
acc(as.table(rf1e.cm))
[1] 0.8974359

Importance

alt text

GO:0034673

Future work

  • this has only established a workflow
  • explore these results more
  • are the results relevant
  • could this have been found without collapsing
  • what's the best way to collapse rows (max, mean)?
  • use only complex membership as features
  • use simulations to establish efficacy of method
  • pathways and larger collections of complexes

End