analysis_presentation

Ian Donaldson
Tuesday, March 24th, 2015

Task

40mins presentation outlining their skills and expertise relevant to the position.

If possible this should contain at least one example of using machine learning in a predictive modelling (classification/regression) setting.

Goals

Try a machine learning task with expression data
Use MLearn package (similar to Caret)
Provide some reproducible sample code showing how I work
Try collapsing probe set data by GO complexes

Following this protocol

Bioconductor case studies
(Hahne, Huber, Gentleman, Falcon)

Chapter 6 - Easy Differential Expression
Chapter 8 - Supervised machine learning
Then my own code

The Data

biocLite("ALL")
library(ALL)
data(ALL)
?ALL

The data consist of microarrays from 128 different individuals with acute lymphoblastic leukemia (ALL). A number of additional covariates are available. The data have been normalized (using rma) and it is the jointly normalized data that are available here. The data are presented in the form of an exprSet object.

Set up environment for presentation

load(file="analysis.Rdata")
#ls()

Working with a subset of the data

bcell = grep("^B", as.character(ALL$BT)) 
#bcell is a vector indices of individuals with B-cell type leukemia

moltyp = which(as.character(ALL$mol.biol) %in% c("NEG", "BCR/ABL")) 
#moltype is a vector of indices of individuals with BCR/ABL translocations or no cytogentic abnormalities (NEG)

ALL_bcrneg = ALL[, intersect(bcell, moltyp)] 
#get a list of individuals meeting both criteria for our data subset 
# ALL_bcrneg is an ExpressionSet

A peek at the expression data

exprs(ALL_bcrneg)[1:3,1:3]

             01005    01010    03002
1000_at   7.597323 7.479445 7.567593
1001_at   5.046194 4.932537 4.799294
1002_f_at 3.900466 4.208155 3.886169

Feature filtering (selection)

Remove probe data with little information.
Using standard deviations and “shorth”
The “shorth” is the shortest interval that covers half the values in a vector x.
genefilter package
robust estimator of location good for non-normal data
has implications for results left after multiple hypothesis correction

Feature filter code

#library("genefilter")
?genefilter #<- a package for filtering gene expression probe data
expressionData <- exprs(ALL_bcrneg)
str(expressionData) #<-matrix of 12625 probes by 79 samples (individuals) 
sds = rowSds(expressionData) #<-the standard deviation of each row (rowSds is from genefilter package)
sh = shorth(sds)
sds_alternative <- apply(expressionData, 1, sd) #alternative way of coding the same thing
sds_alternative[1:5]

Feature filter results

plot of chunk feature_selection_plot

Alternative methods for feature selection

discard all probes with consistantly low expression values
remove non-specific probes using the nsFilter function from the Category package

Testing for differential expression

rowttests function from the genefilter package is used to
test for differential expression between two different (mol.biol).
i.e., BCR/ABL translocation (37) or not(42).
rowttests(ALLsfilt, "mol.biol")

Visualization of t-test results

plot of chunk t_test_hist

Zooming in on t-test results

plot of chunk t_test_hist_zoom

What are we throwing away?

plot of chunk t_rest_hist

What are we throwing away? Zoom.

plot of chunk t_rest_hist_zoom

Most of these won't survive multiple-hypothesis testing anyway, but a few may bear further examination.

Multiple-hypothesis testing correction

there are 8812 probes (t-tests)
about 440 false positives at the 0.05 significance level
1115 p-values are less than 0.05 - are they all significant
we need to correct for doing so many tests
Benjamini and Hochberg procedure uses the false-discovery rate (FDR) to recalculate p-values

Correction code

p_adjust <- p.adjust(tt$p.value, method="BH")
num_significant <- sum(tt$p.value < 0.05)
num_significant #<-- 1155
#functions in the multtest package give identical results

Then look at the genes ranked by "significance"

     probe_id  symbol
1     1635_at    ABL1
2   1636_g_at    ABL1
3     1674_at    YES1
4    32434_at  MARCKS
5    37015_at ALDH1A1
6    37027_at   AHNAK
7    39730_at    ABL1
8  39837_s_at  ZNF467
9    40202_at    KLF9
10   40504_at    PON2

What can we do with a ranked list

What does significant mean anyway?

Such a ranked list might be used as input for further analysis including

ORA - over-represenation analysis
GSEA - Gene set enrichment analysis
as features in a machine learning model

Machine learning task

Predict the presence or absence of a BCL/ABL translocation given an expression profile.

table(ALL_bcrneg$mol.biol)


BCR/ABL     NEG 
     37      42

Machine learning pipeline

Form a question
Gather data, clean and normalize
Select features or make them
Select a ML method
Tune parameters on training/test set
Assess performance on validation set

Filtering (cleaning) data

nsFilter function from the genefilter package
remove non-informative probes
also removes control probes on Affymetrix arrays (identifed by their AFFX prefix)
exclude genes without Entrez Gene identifiers
select the top 25% of genes on the basis of variability across samples
nsFilter specifically works with ExpressionSet data and returns the same class type

ALLfilt_bcrneg = nsFilter(ALL_bcrneg, var.cutoff=0.75)$eset`

Data standardization (normalization)

centred and scaled using median and IQR
more robust to non-normal distributions outliers
compare to the typical method of centering and scaling that uses mean and standard deviation
See http://en.wikipedia.org/wiki/Robust_measures_of_scale

Visualizing euclidean distance between samples

distance between samples in the expression matrix
rows x columns = probes x individuals,
call the dist function on the transpose of this matrix (individuals x probes)
rows x columns = individuals x individuals with euclidean distances for each pair of individuals
visualized using the heatmap() function
RColorBrewer package is used to construct a color pallette for the visualization

#calculate euclidean distances between samples
eucD = dist(t(exprs(ALLfilt_bcrneg)))

Hierarchical clustering of samples viewed as a heat map

plot of chunk euclid_vis

Using the MLInterfaces package for machine learning

MLInterfaces provides a consistant interface to a number of machine learning packages in R (much like carret but adapted specifically for Bioconductor containers)

Construct a training set and a test set

#create sample indices for the training and test sets
set.seed(1234)
Negs = which(ALLfilt_bcrneg$mol.biol == "NEG")
Bcr = which(ALLfilt_bcrneg$mol.biol == "BCR/ABL")
S1 = sample(Negs, 20, replace=FALSE)
S2 = sample(Bcr, 20, replace = FALSE)
TrainInd = c(S1, S2)
TestInd = setdiff(1:79, TrainInd)

KNN

Use KNN to predict the presence or absence of the BCL/ABL translocation

kans = MLearn( mol.biol ~ ., data=ALLfilt_bcrneg, knnI(k=1,l=0), TrainInd)
kans.cm <- confuMat(kans)
kans.cm

         predicted
given     BCR/ABL NEG
  BCR/ABL      16   1
  NEG           4  18

Assessing performance

cm <- as.table(kans.cm)
precision(cm)

  BCR/ABL       NEG 
0.9411765 0.8181818

recall(cm)

  BCR/ABL       NEG 
0.8000000 0.9473684

acc(cm)

[1] 0.8717949

Explain the confusion matrix

alt text

Feature selection

select gene probes that are likely to be indicative of the presence or absence of the BCL/ABL translocation
use the t-test to do this
done on only the training set - to avoid overestimated performance
repeat KNN using the new data set containing only the selected features

Does this impact KNN model accuracy?

#redo knn
#library("MLInterfaces")
kans.bf = MLearn( mol.biol ~ ., data=BNf, knnI(k=1,l=0), TrainInd)
kans.bf.cm <- confuMat(kans.bf)
kans.bf.cm

         predicted
given     BCR/ABL NEG
  BCR/ABL      14   3
  NEG           1  21

Does this impact KNN model accuracy?

cm<-as.table(kans.bf.cm)
precision(cm)

  BCR/ABL       NEG 
0.8235294 0.9545455

recall(cm)

  BCR/ABL       NEG 
0.9333333 0.8750000

acc(cm)

[1] 0.8974359

Support for cross-validation in MLInterfaces

by specifying xvalSpec parameters that are passed to the MLearn function
only three cross-validations types supported:
LOO=leave-one-out,
LOG=leave-out-group and
NOTEST=use the entire data set for training
there is no built-in supprot for boot-strapping
possibility for feature selection is built into the cross-validation object - see the fs.Fun parameter
the example code provides a good example of how using only one round of train-test can overestimate perfomrance and why cross-validation should always be used.

Random forest in MLearn

set.seed(123)
#the guideline rule for mtry is the sqrt of the number of features
dim(ALLfilt_bcrneg)
sqrt(2160)
rf1 = MLearn( mol.biol~., data=ALLfilt_bcrneg, randomForestI, TrainInd, ntree=1000, mtry=55, importance=TRUE)
confuMat(rf1)
acc(as.table(confuMat(rf1)))

Random forest performance

confuMat(rf1)

         predicted
given     BCR/ABL NEG
  BCR/ABL      13   4
  NEG           0  22

acc(as.table(confuMat(rf1)))

[1] 0.8974359

Cross-validation is built into this method and the results are more interpretable.

Examining importance of features from a random forest model

importance parameter of the random forest method allows us to examine which features had the largest impact on performance
data about feature importance is retrieved using the getVarImp function of the MLInterfaces package

Importance

opar = par(no.readonly=TRUE, mar=c(7,5,4,2)) #opar=original parameters - keep these so they can be restored
par(las=2) #<-- style of the axis label - parallel to the axis
?getVarImp #<-- from the MLInterfaces package
impV1 = getVarImp(rf1)
plot(impV1, n=15, plat="hgu95av2", toktype="SYMBOL") #<-- implementation of plot allows direct mapping of gene symbols
par(opar) #<-- restore the original graphics parameters

Importance

alt text

Next, lets try collapsing features by complex membership

Retrieve complexes from GO

library("AnnotationDbi")
library(GO.db)
searchTerms <- c("complex", "some", "mere")
#...
query <- paste("select go_id from go_term where term like '%", searchTerm, "%'", sep = "")
#...

GO complexes

allGoidDetails[c(1:3), c(2,3)]

                                         TERM ONTOLOGY
1           phosphopyruvate hydratase complex       CC
2          nucleotide-excision repair complex       CC
3 nucleotide-excision repair factor 1 complex       CC

Skipping some steps here

retrieve Entrez Gene Ids for these complexes
map probe ids to Gene Ids then to complex ids (if any)

Collapse using collapseRows

# ...
myCollapse<-function(someMatrix)apply(someMatrix, 2, function(x){ return(x[which.max(abs(x))])}) #define a collapse function
#
collapseRows.results<-collapseRows(datET=tmp, rowGroup=rowGroup, rowID=rowID, method="function", methodFunction=myCollapse)#collapse

Prepare the collapsed expression data for random forest

This was the hard part

# prepare expression set for ml
eData <- as.data.frame(t(tmpCollapsed))#transpose and cast as a data frame
newColNames <- paste("AAA", colnames(eData), sep="", collapse=NULL) #fix column names
#fix column names - need to replace :
newColNames<-gsub(":", "", newColNames)
learnThis <- ALLfilt_bcrneg$mol.biol ##finally, add the outcome variable separately
newColNames <- c("learnThis", newColNames)
eData <- cbind(learnThis, eData) 
colnames(eData) <- newColNames

Learn

rf1e = MLearn( learnThis~., data=eData, randomForestI, TrainInd, ntree=1000, mtry=55, importance=TRUE)

Assess performance

rf1e.cm<-confuMat(rf1e)
rf1e.cm

         predicted
given     BCR/ABL NEG
  BCR/ABL      13   4
  NEG           0  22

acc(as.table(rf1e.cm))

[1] 0.8974359

Importance

alt text

GO:0034673

alt text

http://amigo.geneontology.org/amigo/term/GO:0034673

Future work

this has only established a workflow
explore these results more
are the results relevant
could this have been found without collapsing
what's the best way to collapse rows (max, mean)?
use only complex membership as features
use simulations to establish efficacy of method
pathways and larger collections of complexes

analysis_presentation

Task

Goals

Following this protocol

The Data

Set up environment for presentation

Working with a subset of the data

A peek at the expression data

Feature filtering (selection)

Feature filter code

Feature filter results

Alternative methods for feature selection

Testing for differential expression

Visualization of t-test results

Zooming in on t-test results

What are we throwing away?

What are we throwing away? Zoom.

Multiple-hypothesis testing correction

Correction code

Then look at the genes ranked by "significance"

What can we do with a ranked list

Machine learning task

Machine learning pipeline

Filtering (cleaning) data

Data standardization (normalization)

Visualizing euclidean distance between samples

Hierarchical clustering of samples viewed as a heat map

Using the MLInterfaces package for machine learning

Construct a training set and a test set

KNN

Assessing performance

Explain the confusion matrix

Feature selection

Does this impact KNN model accuracy?

Does this impact KNN model accuracy?

Support for cross-validation in MLInterfaces

Random forest in MLearn

Random forest performance

Examining importance of features from a random forest model

Importance

Importance

Next, lets try collapsing features by complex membership

GO complexes

Skipping some steps here

Collapse using collapseRows

Prepare the collapsed expression data for random forest

Learn

Assess performance

Importance

GO:0034673

Future work

End