tag: INFO523_HW6B / Code for HW6 Pt. B / ELT 2021-12-16

Goal: The goal of this assignment is to improve the code from Case Study 4 in Torgo (2011) Chapter 5.

The Available Data

Note: Instructions in the book on pg. 325 are outdated. See Bioconductor website for details

Install BiocManager to access packages in Bioconductor

if (!require("BiocManager", quietly = TRUE))  
  install.packages("BiocManager")  
BiocManager::install(version = "3.14")  

Install Biobase package and ALL data set

BiocManager::install(c("Biobase"))  
BiocManager::install(c("ALL"))  

Call the Biobase package and ALL data. Ask R about the ALL object

>> Loading the ALL data set creates an object of a special class called Expression Set.
>> The output should look like this:
Show in New Window
[1] "/Users/elliot/Desktop/INFO523/Homework/Homework 6 Part B & C"
Show in New Window
ExpressionSet (storageMode: lockedEnvironment)
assayData: 12625 features, 128 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: 01005 01010 ... LAL4 (128 total)
  varLabels: cod diagnosis ... date last seen (21 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
  pubMedIds: 14684422 16243790 
Annotation: hgu95av2  

We want to access information from the ExpressionSets object. Here are some ways:

Access the phenoData and variable metadata

# Access `ExpressionSets` Info with phenoData data frame
pD <- phenoData(ALL)

# Connect variables with their metadata (e.g., labels, names)
varMetadata(pD)
##                                                                   labelDescription
## cod                                                                     Patient ID
## diagnosis                                                        Date of diagnosis
## sex                                                          Gender of the patient
## age                                                    Age of the patient at entry
## BT                                      does the patient have B-cell or T-cell ALL
## remission           Complete remission(CR), refractory(REF) or NA. Derived from CR
## CR                                                          Original remisson data
## date.cr                                        Date complete remission if achieved
## t(4;11)             did the patient have t(4;11) translocation. Derived from citog
## t(9;22)             did the patient have t(9;22) translocation. Derived from citog
## cyto.normal                        Was cytogenetic test normal? Derived from citog
## citog             original citogenetics data, deletions or t(4;11), t(9;22) status
## mol.biol                                                         molecular biology
## fusion protein                        which of p190, p210 or p190/210 for bcr/able
## mdr                                                           multi-drug resistant
## kinet                                            ploidy: either diploid or hyperd.
## ccr                                Continuous complete remission? Derived from f.u
## relapse                                                  Relapse? Derived from f.u
## transplant      did the patient receive a bone marrow transplant? Derived from f.u
## f.u                                                       follow up data available
## date last seen                                          date patient was last seen

Make tables for BT cells and the abnormality data

# Make a table showing whether a patient has B or T cells
table(ALL$BT)
## 
##  B B1 B2 B3 B4  T T1 T2 T3 T4 
##  5 19 36 23 12  5  1 15 10  2
# Make table of the cytogeneic abormality found on each sample
table(ALL$mol.biol)
## 
## ALL1/AF4  BCR/ABL E2A/PBX1      NEG   NUP-98  p15/p16 
##       10       37        5       74        1        1

Combine the two tables

# Combine BT cells table with molecular bio table
table(ALL$BT, ALL$mol.biol) # NEG = no abnormality
##     
##      ALL1/AF4 BCR/ABL E2A/PBX1 NEG NUP-98 p15/p16
##   B         0       2        1   2      0       0
##   B1       10       1        0   8      0       0
##   B2        0      19        0  16      0       1
##   B3        0       8        1  14      0       0
##   B4        0       7        3   2      0       0
##   T         0       0        0   5      0       0
##   T1        0       0        0   1      0       0
##   T2        0       0        0  15      0       0
##   T3        0       0        0   9      1       0
##   T4        0       0        0   2      0       0

Take a look at names of some individuals and genes

# Obtain information on the gene samples
featureNames(ALL)[1:10] # names of first 10 genes
##  [1] "1000_at"   "1001_at"   "1002_f_at" "1003_s_at" "1004_at"   "1005_at"  
##  [7] "1006_at"   "1007_s_at" "1008_f_at" "1009_at"
sampleNames(ALL)[1:5] # names of first 5 individuals
## [1] "01005" "01010" "03002" "04006" "04007"

The analysis in this case study focuses specifically on the B-cell cases and on samples with a subset of gene mutations. Those parameters represent the ‘target class’. Below, we obtain the subset of data for our analysis.

Obtain the target cases

# Obtain the target casesfor cells and mutations
tgt.cases <- which(ALL$BT %in% levels(ALL$BT)[1:5] &
                   ALL$mol.biol %in% levels(ALL$mol.biol)[1:4])

Subset the original ALL data set

# Subset the original `ALL` set
ALLb <- ALL[,tgt.cases]

Check out information about the subset

# Print `ALLb`
ALLb
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 12625 features, 94 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 01005 01010 ... LAL5 (94 total)
##   varLabels: cod diagnosis ... date last seen (21 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 14684422 16243790 
## Annotation: hgu95av2

This looks good, but we need to make sure the subset contains all expected values for the BT cells and cytogenetic abnormalities. To do this, we need to update the levels of our factors in the ALLb object.

Update the levels of the factors

# Update levels for BT cells
ALLb$BT <- factor(ALLb$BT)

# Updsate levels for mutations
ALLb$mol.biol <- factor(ALLb$mol.biol)

Now, save the new data set to be used throughout the case study.

# Save this ALLb data set as it will be used throughout the case study
save(ALLb, file = "myALL.Rdata")

Now, let’s dive in.

Exploring the Data

Look at the gene expression levels matrix

# Look at the gene expression levels matrix:
es <- exprs(ALLb)
dim(es) 
## [1] 12625    94

There are 12,625 factors (genes) in rows, and 94 samples (individuals) in columns. That’s pretty large data set, and we’ll want to reduce its size to make it easier to work with.

Summarize the expression levels data

# Summarize `es`
summary(as.vector(es))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.985   4.122   5.469   5.624   6.829  14.045

Here, we can see that the middle 50% of our data ranges from between 4 (1Q) and 7 (3Q).

Next, we want to get an overview of the distribution of expression levels. We’ll do this graphically.

Install the genefilter package from Bioconductor

BiocManager::install(c("genefilter"))

Load the package

#Call package
library(genefilter)

Generate a histogram to show distribution of expression levels

>> Add lines indicating the placement of the median, shorth, and quantiles 1 and 3
>> Add a legend indicating what the lines represent
# Gen hist()
hist(as.vector(es),breaks=80,prob=T,
     xlab='Expression Levels',
     main='Histogram of Overall Expression Levels')

# Add lines
abline(v=c(median(as.vector(es)),
           shorth(as.vector(es)),
           quantile(as.vector(es),c(0.25,0.75))),
       lty=2, col=c(2,3,4,4))

# Add legend
legend('topright', c('Median','Shorth','1stQ','3rdQ'),
       lty=2, col=c(2,3,4,4))

For this histogram, we changed the default number of intervals of the function hist(). Using the value of 80 for the breaks parameter, we’re able to get a detailed look at the distribution of expression levels.

On top of the histogram are lines showing the median, first quantile, 3rd quantile, and shorth. The short is a robust estimator of the centrality of a continuous distribution. We implement it into the histogram through the shorth() funciton in genefilter.

The shorth represents the mean of the values in a central interval containing 50% of observations. That central interval is the Interquartile Range (IQR).

Next, we’ll investigate whether the distributions of gene expression levels for samples with a given mutation are similar or different.

Generate distributions of gene expression levels by mutation

# Distributions by abnormality
sapply(levels(ALLb$mol.bio),function(x) summary(as.vector(es[,
    which(ALLb$mol.bio == x)])))
##          ALL1/AF4   BCR/ABL  E2A/PBX1       NEG
## Min.     2.266403  2.194858  2.267802  1.984919
## 1st Qu.  4.140797  4.124414  4.152339  4.111435
## Median   5.454388  5.467834  5.496825  5.469887
## Mean     5.620976  5.627016  5.630445  5.621616
## 3rd Qu.  6.805440  6.832862  6.818587  6.832056
## Max.    14.031235 14.044896 13.812578 13.949564

Across the mutations in the subset of samples, and relative to the neg control group, the distributions are similar.

Now, let’s shift to

Feature (Gene) Selection

We’d like to get an idea of the distribution of the expression levels of each gene across the sample of individuals. We’ll do that graphically using the median and IQR for each gene and plotting those values.

Plot the scores for each gene

# Calcs for IQR 
rowIQRs <- function(em) 
  rowQ(em,ceiling(0.75*ncol(em))) - rowQ(em,floor(0.25*ncol(em)))

# Plot medians and IQR
plot(rowMedians(es),rowIQRs(es),
     xlab='Median expression level',
     ylab='IQR expression level',
     main='Main Characteristics of Genes Expression Levels')

From this graph, we see that a large number of the genes have low variabiity. When a gene has low variability across samples, it is unlikely to be useful in deciphering among mutation types of B-cells in our sample.

At this point, it makes sense for us to eliminate some of the genes with low variability as they will not serve our model. We’ll do this using a heuristic threshold based on IQR values.

First, install the hgu95av2.db package from Bioconductor

BiocManager::install("hgu95av2.db")

Filter genes with variability < 0.20(global IQR)

>> Note: An error message is produced and I cannot fix the issue.

Display filtered ALLb

#Print
ALLb
## $eset
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 4025 features, 94 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 01005 01010 ... LAL5 (94 total)
##   varLabels: cod diagnosis ... date last seen (21 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 14684422 16243790 
## Annotation: hgu95av2 
## 
## $filter.log
## $filter.log$numDupsRemoved
## [1] 2888
## 
## $filter.log$numLowVar
## [1] 4751
## 
## $filter.log$numRemoved.ENTREZID
## [1] 942
## 
## $filter.log$feature.exclude
## [1] 19

After this filtering effort, we’re left with only 4025 features across our 94 samples.

Now, let’s update our ALLb and es objects to reflect the filtering

ALLb <- ALLb$eset
es <- exprs(ALLb)
dim(es)
## [1] 4025   94

Next, we’ll compare the mean expression level of genes across the subsets of samples by mutation. This represents the mean conditioned on our target values. We’ll do this using ANOVA filters.
Compare means across groups

# Conduct ANOVA filtering
f <- Anova(ALLb$mol.bio,p=0.01)
ff <- filterfun(f)
selGenes <- genefilter(exprs(ALLb),ff)

# Output results
sum(selGenes) 
## [1] 751
ALLb <- ALLb[selGenes,]
ALLb
## ExpressionSet (storageMode: lockedEnvironment)
## assayData: 751 features, 94 samples 
##   element names: exprs 
## protocolData: none
## phenoData
##   sampleNames: 01005 01010 ... LAL5 (94 total)
##   varLabels: cod diagnosis ... date last seen (21 total)
##   varMetadata: labelDescription
## featureData: none
## experimentData: use 'experimentData(object)'
##   pubMedIds: 14684422 16243790 
## Annotation: hgu95av2

Next, let’s regerate the plot of median and IQR values for each gene, but with our filtered data set.

Generate plot for final set of genes

# ANOVA filtered plot of median, IQR per gene
es <- exprs(ALLb)
plot(rowMedians(es),rowIQRs(es),
     xlab='Median expression level',
     ylab='IQR expression level',
     main='Distribution Properties of the Selected Genes')

This plot shows the genes of use to us after we eliminated genes with low variability.

Now, we’ll move on to filtering using random forests, but we need to change the names of the genes before proceeding as they are currently non-standard for what is expected in data frames used in modeling.

Change names of genes

# Change names for modeling standards
featureNames(ALLb) <- make.names(featureNames(ALLb))
es <- exprs(ALLb)

Install randomForest

install.packages("randomForest")

Obtain a ranking of genes using random forests

Here, we’ll construct a training set by adding mutation information to the expression matrix. We’ll also obtain the genes that appear at the top 30 positions of ranking based on their scores.

>> Note: There are error messages I cannot fix.

But what is the expression levels distribution for these top 30 genes? We’ll find out below.

Obtain median level for top 30 genes

# Obtain median levels for top 30 genes from randomForest
sapply(rf.genes,function(g) tapply(dt[,g],dt$Mut,median))
##          X1635_at X40795_at X37015_at X1467_at X40504_at X1674_at X34699_at
## ALL1/AF4 7.302814  3.867134  3.752649 3.708985  3.218079 3.745752  4.253504
## BCR/ABL  8.693082  4.544239  4.857105 4.239306  4.924310 5.833510  6.315966
## E2A/PBX1 7.562676  4.151637  6.579530 3.411696  3.455316 3.808258  6.102031
## NEG      7.324691  3.909532  3.765741 3.515020  3.541651 4.244791  6.092511
##          X34850_at X35162_s_at X39837_s_at X37225_at X41470_at X40202_at
## ALL1/AF4  5.426653    4.398885    6.633188  5.220668  9.616743  8.550639
## BCR/ABL   6.898979    4.924553    7.374046  3.460902  5.205797  9.767293
## E2A/PBX1  5.928574    4.380962    6.708400  7.445655  3.931191  7.414635
## NEG       6.327281    4.236335    6.878846  3.387552  4.157748  7.655605
##          X36638_at X37363_at X37351_at X32378_at X38323_at X40480_s_at
## ALL1/AF4  9.811828  3.910130  5.919309  8.703860  4.195967    6.414368
## BCR/ABL   8.486446  5.583184  6.959287  9.694933  4.866452    8.208263
## E2A/PBX1  6.259730  3.800260  6.125676 10.066073  4.317550    6.722296
## NEG       5.856580  3.926599  6.255333  9.743168  4.286104    7.362318
##          X37105_at X36617_at X1249_at X39070_at X36275_at X1616_at X41274_at
## ALL1/AF4  6.845719  6.438007 3.582763  7.679571  3.618819 4.059859  5.501107
## BCR/ABL   6.493001  7.480436 4.477659  8.543137  6.259073 3.643486  5.261901
## E2A/PBX1  6.740213  6.627934 3.257649  6.671786  3.635956 6.033233  4.888560
## NEG       6.298859  6.561701 3.791764  7.614008  3.749953 3.747773  4.860075
##          X2062_at X41136_s_at X39631_at X35831_at
## ALL1/AF4 9.409753    4.060797  4.412177  6.488396
## BCR/ABL  7.530185    4.855383  5.021266  5.729676
## E2A/PBX1 7.935259    4.880618  4.901999  5.527489
## NEG      7.086033    5.009292  4.537882  5.992724

There are some interesting differences between median expression level across types of mutations in this set of 30 genes. That indicates there is good discriminative power in the selection, and that will help with our modeling efforts.

We can also view this graphically by inspecting the concrete expression values of these top 30 genes for the 94 samples.

Graph the concrete expression values of genes from our samples

# Graph the expression level of the 30 genes across 94 samples
library(lattice)
ordMut <- order(dt$Mut)
levelplot(as.matrix(dt[ordMut,rf.genes]),
          aspect='fill', xlab='', ylab='',
          scales=list(
            x=list(
              labels=c('+','-','*','|')[as.integer(dt$Mut[ordMut])],
              cex=0.7,
              tck=0)
            ),
          main=paste(paste(c('"+"','"-"','"*"','"|"'),
                           levels(dt$Mut)
                          ),
                     collapse='; '),
          col.regions=colorRampPalette(c('white','orange','blue'))
          )

We can see there are several genes with differences in expression level across the mutations (e.g., gene X36275_at between ALL1/AF4 and BCR/ABL).

Now, we’ll move on to filtering using feature clustering ensembles, which involes using a clustering algorithm to obstain 30 groups of variables that are intended to be similar. The 30 variable cluters are then used to obtain an ensemble classification model where m models will be obtained with 30 variables each through random selection from among the 30 clusters.

We’ll do this using the Hmisc() package that uses a hierarchical clustering algorithm to cluster variables in our set.

Install Hmisc()

packages.install("Hmisc")  

Cluster the variables and display the resulting table

# display table
table(clus30)
## clus30
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 
## 21 16 37 18 44 19 28 38 28 30 35 54 20 47 18 45 17 14 25 18 22 31 28 15  7 15 
## 27 28 29 30 
## 11 18 19 13

Based on this clustering, we can create a set of predictors through random selection of one variable per cluster. We’ll use a function to achieve this.

Generate one set of variables via random sampling from clusters

# Function for set of predictors
getVarsSet <- function(cluster,nvars=30,seed=NULL,verb=F) 
{
  if (!is.null(seed)) set.seed(seed)

  cls <- cutree(cluster,nvars)
  tots <- table(cls)
  vars <- c()
  vars <- sapply(1:nvars,function(clID)
    {
      if (!length(tots[clID])) stop('Empty cluster! (',clID,')')
      x <- sample(1:tots[clID],1)
      names(cls[cls==clID])[x]
    })
  if (verb)  structure(vars,clusMemb=cls,clusTots=tots)
  else       vars
}

#Print outcome once
getVarsSet(vc$hclust)
##  [1] "X35816_at"   "X33047_at"   "X32773_at"   "X35831_at"   "X38980_at"  
##  [6] "X35418_at"   "X38370_at"   "X39650_s_at" "X38063_at"   "X35693_at"  
## [11] "X37875_at"   "X35735_at"   "X34283_at"   "X40054_at"   "X34180_at"  
## [16] "X1467_at"    "X41237_at"   "X34936_at"   "X33263_at"   "X41872_at"  
## [21] "X38285_at"   "X36897_at"   "X40441_g_at" "X1943_at"    "X336_at"    
## [26] "X905_at"     "X2002_s_at"  "X33304_at"   "X37280_at"   "X39070_at"

Now, call the function again and observe that it returns a different result. This will happen each time.

Call function again

# Call function
getVarsSet(vc$hclust)
##  [1] "X266_s_at"   "X37657_at"   "X1039_s_at"  "X40087_at"   "X38518_at"  
##  [6] "X36312_at"   "X33528_at"   "X34168_at"   "X37558_at"   "X32696_at"  
## [11] "X37684_at"   "X36637_at"   "X1640_at"    "X1453_at"    "X39058_at"  
## [16] "X41257_at"   "X40771_at"   "X34936_at"   "X37147_at"   "X41225_at"  
## [21] "X36203_at"   "X40587_s_at" "X34816_at"   "X32618_at"   "X32612_at"  
## [26] "X41417_at"   "X37027_at"   "X35369_at"   "X37280_at"   "X39771_at"

Now, let’s shift to

Predicting Cytogenetic Abnormalities

At this point, we shift to build models with 3 different modeling techniques: random forests, support vector machines, and k-Nearest Neighbors. Predictions are obtained through random forests by a system of decision trees and a voting mechanism. Support vector machines support predictions by mapping out original data onto a high-dimensional space and obtaining a “separating hyper plane” (Torgo 2011:127) through linear modeling. The hyper plane separates the classes of the problem when working with classification tasks. The k-Nearest Neighbors algorithm does not actually obtain a model from training data. When it is time for prediction, k-Nearest Neighbors searches for similar cases in the training data it has stored. The k-most similar training cases, identified through a voting mechanism, are then used to obtain predictions for a given test case.

Later, we compare our selected models using an iterative Leave-One-Out Cross-Validation approach that obtains predictions through a voting mechanism comparing variants. Variants are ranked by “best” result. We use the function rankSystems() to obtain the best performing results for each model and join them together in a compEx object called “all.trials.” Next, we use a function (e.g., bestknn.loocv) to determine the predictions of the best model on each iteration of the LOOCV process. We use the attr() attribute function to learn more about the predictions, including their actual values, and then we obtain a confusion matrix. These matrices allow us to inspect errors made by the model and to observe when the model correctly predicts information.

The associated code for this work is visible below:

###################################################
### Predicting Cytogenetic Abnormalities
###################################################

# LOOCV Method
  data(iris)
  rpart.loocv <- function(form,train,test,...) {
    require(rpart,quietly=T)
    m <- rpart(form,train,...)
    p <- predict(m,test,type='class')
    c(accuracy=ifelse(p == resp(form,test),100,0))
  }
  exp <- loocv(learner('rpart.loocv',list()),
               dataset(Species~.,iris),
               loocvSettings(seed=1234,verbose=F))
  
# Summary of LOOCV method
  summary(exp)

# k-Nearest Neighbors
  library(class)
  data(iris)
  idx <- sample(1:nrow(iris),as.integer(0.7*nrow(iris)))
  tr <- iris[idx,]
  ts <- iris[-idx,]
  preds <- knn(tr[,-5],ts[,-5],tr[,5],k=3)
  table(preds,ts[,5])
  
  
  kNN <- function(form,train,test,norm=T,norm.stats=NULL,...) {
    require(class,quietly=TRUE)
    tgtCol <- which(colnames(train)==as.character(form[[2]]))
    if (norm) {
      if (is.null(norm.stats)) tmp <- scale(train[,-tgtCol],center=T,scale=T)
      else tmp <- scale(train[,-tgtCol],center=norm.stats[[1]],scale=norm.stats[[2]])
      train[,-tgtCol] <- tmp
      ms <- attr(tmp,"scaled:center")
      ss <- attr(tmp,"scaled:scale")
      test[,-tgtCol] <- scale(test[,-tgtCol],center=ms,scale=ss)
    }
    knn(train[,-tgtCol],test[,-tgtCol],train[,tgtCol],...)
  }
  preds.norm <- kNN(Species ~ .,tr,ts,k=3)
  table(preds.norm,ts[,5])
  preds.notNorm <- kNN(Species ~ .,tr,ts,norm=F,k=3)
  table(preds.notNorm,ts[,5])


# Comparing the models
vars <- list()
vars$randomForest <- list(ntree=c(500,750,100),
                          mtry=c(5,15,30),
                          fs.meth=list(list('all'),
                                       list('rf',30),
                                       list('varclus',30,50)))
vars$svm <- list(cost=c(1,100,500),
                 gamma=c(0.01,0.001,0.0001),
                 fs.meth=list(list('all'),
                              list('rf',30),
                              list('varclus',30,50)))
vars$knn <- list(k=c(3,5,7,11),
                 norm=c(T,F),
                 fs.meth=list(list('all'),
                              list('rf',30),
                              list('varclus',30,50)))


varsEnsembles <- function(tgt,train,test,
                          varsSets,
                          baseLearner,blPars,
                          verb=F)
{
  preds <- matrix(NA,ncol=length(varsSets),nrow=NROW(test))
  for(v in seq(along=varsSets)) {
    if (baseLearner=='knn')
      preds[,v] <- knn(train[,varsSets[[v]]],
                       test[,varsSets[[v]]],
                       train[,tgt],blPars)
    else {
      m <- do.call(baseLearner,
                   c(list(as.formula(paste(tgt,
                                           paste(varsSets[[v]],
                                                 collapse='+'),
                                           sep='~')),
                          train[,c(tgt,varsSets[[v]])]),
                     blPars)
                   )
      if (baseLearner == 'randomForest')
        preds[,v] <- do.call('predict',
                             list(m,test[,c(tgt,varsSets[[v]])],
                                  type='response'))
      else
        preds[,v] <- do.call('predict',
                             list(m,test[,c(tgt,varsSets[[v]])]))
    }
  }
  ps <- apply(preds,1,function(x)
                 levels(factor(x))[which.max(table(factor(x)))])
  ps <- factor(ps,
               levels=1:nlevels(train[,tgt]),
               labels=levels(train[,tgt]))
  if (verb) structure(ps,ensemblePreds=preds) else ps
}

#Generic Model
genericModel <- function(form,train,test,
                         learner,
                         fs.meth,
                         ...)
  {
    cat('=')
    tgt <- as.character(form[[2]])
    tgtCol <- which(colnames(train)==tgt)

    # Anova filtering  
    f <- Anova(train[,tgt],p=0.01)
    ff <- filterfun(f)
    genes <- genefilter(t(train[,-tgtCol]),ff)
    genes <- names(genes)[genes]
    train <- train[,c(tgt,genes)]
    test <- test[,c(tgt,genes)]
    tgtCol <- 1

    # Specific filtering 
    if (fs.meth[[1]]=='varclus') {
      require(Hmisc,quietly=T)
      v <- varclus(as.matrix(train[,-tgtCol]))
      VSs <- lapply(1:fs.meth[[3]],function(x)
                    getVarsSet(v$hclust,nvars=fs.meth[[2]]))
      pred <- varsEnsembles(tgt,train,test,VSs,learner,list(...))

    } else {
      if (fs.meth[[1]]=='rf') {
        require(randomForest,quietly=T)
        rf <- randomForest(form,train,importance=T)
        imp <- importance(rf)
        imp <- imp[,ncol(imp)-1]
        rf.genes <- names(imp)[order(imp,decreasing=T)[1:fs.meth[[2]]]]
        train <- train[,c(tgt,rf.genes)]
        test <- test[,c(tgt,rf.genes)]
      }

      if (learner == 'knn') 
        pred <- kNN(form,
             train,
             test,
             norm.stats=list(rowMedians(t(as.matrix(train[,-tgtCol]))),
                             rowIQRs(t(as.matrix(train[,-tgtCol])))),
             ...)
      else {
        model <- do.call(learner,c(list(form,train),list(...)))
        pred <- if (learner != 'randomForest') predict(model,test)
                else predict(model,test,type='response')
      }

    }

    c(accuracy=ifelse(pred == resp(form,test),100,0))
  }


require(class,quietly=TRUE)
require(randomForest,quietly=TRUE)
require(e1071,quietly=TRUE)
load('myALL.Rdata')
es <- exprs(ALLb)
# simple filtering
ALLb <- nsFilter(ALLb,
                 var.func=IQR,var.cutoff=IQR(as.vector(es))/5, 
                 feature.exclude="^AFFX")
ALLb <- ALLb$eset
# the data set
featureNames(ALLb) <- make.names(featureNames(ALLb))
dt <- data.frame(t(exprs(ALLb)),Mut=ALLb$mol.bio)
DSs <- list(dataset(Mut ~ .,dt,'ALL'))
# The learners to evaluate
TODO <- c('knn','svm','randomForest')
for(td in TODO) {
  assign(td,
         experimentalComparison(
              DSs,
              c(
                do.call('variants',
                        c(list('genericModel',learner=td),
                          vars[[td]],
                          varsRootName=td))
                ),
               loocvSettings(seed=1234,verbose=F)
                                )
         )
  save(list=td,file=paste(td,'Rdata',sep='.'))
}


load('knn.Rdata')
load('svm.Rdata')
load('randomForest.Rdata')


rankSystems(svm,max=T)


all.trials <- join(svm,knn,randomForest,by='variants')


rankSystems(all.trials,top=10,max=T)


getVariant('knn.v2',all.trials)


bestknn.loocv <- function(form,train,test,...) {
  require(Biobase,quietly=T)
  require(randomForest,quietly=T)
  cat('=')
  tgt <- as.character(form[[2]])
  tgtCol <- which(colnames(train)==tgt)
  # Anova filtering
  f <- Anova(train[,tgt],p=0.01)
  ff <- filterfun(f)
  genes <- genefilter(t(train[,-tgtCol]),ff)
  genes <- names(genes)[genes]
  train <- train[,c(tgt,genes)]
  test <- test[,c(tgt,genes)]
  tgtCol <- 1
  # Random Forest filtering
  rf <- randomForest(form,train,importance=T)
  imp <- importance(rf)
  imp <- imp[,ncol(imp)-1]
  rf.genes <- names(imp)[order(imp,decreasing=T)[1:30]]
  train <- train[,c(tgt,rf.genes)]
  test <- test[,c(tgt,rf.genes)]
  # knn prediction
  ps <- kNN(form,train,test,norm=T, 
            norm.stats=list(rowMedians(t(as.matrix(train[,-tgtCol]))),
                            rowIQRs(t(as.matrix(train[,-tgtCol])))),
            k=5,...)
  structure(c(accuracy=ifelse(ps == resp(form,test),100,0)),
            itInfo=list(ps)
           )
}
resTop <- loocv(learner('bestknn.loocv',pars=list()),
                dataset(Mut~.,dt),
                loocvSettings(seed=1234,verbose=F),
                itsInfo=T)


attr(resTop,'itsInfo')[1:4]


dt <- data.frame(t(exprs(ALLb)),Mut=ALLb$mol.bio)


preds <- unlist(attr(resTop,'itsInfo'))
table(preds,dt$Mut)

Conclusions of our Analysis

Our compEx object, “all.trials,” reveals that the top score in tests comes from a variant of the k-Nearest Neighbor approach. The variant uses 30 genes filtered by random forest, 5 neighbors, and normalized gene expression values. Among the top 10 in all.trials, accuracy scores are very similar. Across the 94 cases in our tests, the top score made 11 errors while the 10th place score made 13. Looking at the confusion matrix for our trials, we can conclude that all cases of the ALL11/AF4 mutation are correctly predicted. We can also observe that most errors with the model relate to predicting NEG for a case when actually there is some mutation. In 5 cases, a mutation is predicted when there isn’t one. As I shift gears from writing about the chapter to diving into their code, one way I think I’ll be able to easily improve on their analytical and methodological approach is with the use of packages like tidyverse, ggplot, and plotly.