# load packages
library(glmnet)
library(randomForest)
library(dplyr)
library(doMC)
library(foreach)
library(ggplot2)
library(plsgenomics) # for PLS-LDA
# devtools::install_github("ad1729/akshat")
#library(akshat) # my own package with useful functions
Update: An older version of this script ran the nested for loops sequentially which had relatively long run times. The latest version uses the foreach package to run the resampling inner loops in parallel thereby leading to a reduction in computation time.
Introduction
This short project aims to build a classifier for accurately detecting whether patients with a particular gene expression profile have prostate cancer or not. Starting with a brief description of the dataset, statistical methods for building classifiers that are used in this project are mentioned along with the reasons for their selection. Next, R output from the modelling is provided. This is followed by a discussion of the output and some comments.
Data Description
The data are obtained from the study in Singh et al (2002) where the authors provided access to two datasets. The first dataset is taken as the training set with 102 observations, 12600 predictors, and 1 categorical response column indicating “Tumor” or “Normal”. 50 patients belonged to the category “Normal” and 52 patients had tumorous growth. A test dataset was also provided with 34 observations (9 Normal and 25 Tumor) and (12600 + 1) columns in the data matrix.
The code for naming the columns, reading in the training and test data is given in the chunk below.
# read in colnames from the names file
col_name = read.table("prostate/prostate_TumorVSNormal.names", skip = 4)
col_name = col_name[,1]
head(col_name)
[1] AFFX-MurIL2_at AFFX-MurIL10_at AFFX-MurIL4_at AFFX-MurFAS_at AFFX-BioB-5_at
[6] AFFX-BioB-M_at
12600 Levels: 1000_at 1001_at 1002_f_at 1003_s_at 1004_at 1005_at 1006_at 1007_s_at ... AFFX-YEL024w/RIP1_at
col_name = gsub("-", ".", col_name)
col_name = gsub("_", ".", col_name)
head(col_name)
[1] "AFFX.MurIL2.at" "AFFX.MurIL10.at" "AFFX.MurIL4.at" "AFFX.MurFAS.at" "AFFX.BioB.5.at"
[6] "AFFX.BioB.M.at"
# read training data
tr = read.table("prostate/prostate_TumorVSNormal_train.data", header = F, sep = ",",
blank.lines.skip = T, col.names = c(col_name, "response"),
colClasses = c(rep("numeric", length(col_name)), "factor"))
#head(tr[, 1:5]) # last col is the response
str(tr[,1:10])
'data.frame': 102 obs. of 10 variables:
$ AFFX.MurIL2.at : num -9 -2 -6 0 -1 0 -5 -3 -8 -12 ...
$ AFFX.MurIL10.at: num 1 1 17 9 0 17 5 1 -2 11 ...
$ AFFX.MurIL4.at : num 1 1 6 4 1 1 -1 1 -1 -3 ...
$ AFFX.MurFAS.at : num 15 4 29 19 5 20 9 5 -32 21 ...
$ AFFX.BioB.5.at : num -2 -2 4 -10 0 -20 -10 -2 -20 -10 ...
$ AFFX.BioB.M.at : num -3 -5 -11 -18 -4 -18 -17 -6 -41 -9 ...
$ AFFX.BioB.3.at : num 4 0 -8 -18 1 -2 -8 0 1 -9 ...
$ AFFX.BioC.5.at : num 8 8 10 5 6 15 13 0 30 12 ...
$ AFFX.BioC.3.at : num -12 -5 -24 -33 -4 -9 -15 -4 -23 -29 ...
$ AFFX.BioDn.5.at: num -12 -9 -32 -31 -9 -23 -26 -8 -36 -43 ...
#dim(tr)
# read test data
test = read.table("prostate/prostate_TumorVSNormal_test.data", header = F, sep = ",",
blank.lines.skip = T, col.names = c(col_name, "response"),
colClasses = c(rep("numeric", length(col_name)), "factor"))
#head(test[1:10, 1:10]) # last col is the response
#dim(test)
Y = tr[, "response"]
X = tr[, -ncol(tr)]
The following output shows a summary of the variability in the variances across the predictors.
summary(apply(tr[,-ncol(tr)], 2, var))
Min. 1st Qu. Median Mean 3rd Qu. Max.
7 88 259 5332 1024 8031000
This extreme difference in the variances between predictors indicates that standardizing our dataset will be necessary, to put variables on a similar footing, so to speak. We can also get an idea about the multicollinearity among the variables by running the following code chunk.
cor_mat = cor(tr[,-ncol(tr)])
cor_vec = cor_mat[upper.tri(cor_mat, diag = FALSE)]
foo = summary(cor_vec) # summary of pairwise correlations between predictors
rm(cor_mat, cor_vec) # remove the above two objects since they're consuming a lot of memory
foo
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.989800 -0.306100 0.006571 0.004139 0.313800 0.994900
It appears that most of the genes are uncorrelated with each other, as indicated by the mean/median but some genes are very strongly correlated with each other, as indicated by the min/max values.
Modelling
Since we find ourselves in a situation where \(p \gg n\), a lot of the classical methods will fail. Thus, relatively modern methods are required for the classification task. To this end, the following three methods were used: PLS-LDA (partial least squares - linear discriminant analysis), random forests, and penalized logistic regression.
The classifiers are assessed based on the following metrics: misclassification rate, precision and recall which are defined as follows:
misclassification rate: \(1 - \frac{TP + TN}{N}\)
precision (PPV): \(\frac{TP}{TP + FP}\)
recall (sensitivity): \(\frac{TP}{TP + FN}\)
where TP: true positive, TN: true negative, FP: false positive, FN: false negatives.
PLS-(L)DA
PLS-LDA first performs PLS on the data and then performs classification using LDA on the scores (lower-dimensional projections of the data) that are created by PLS. LDA cannot be used here without preprocessing the data since \(p\) is much larger than \(n\). PCA (principal component analysis) could be performed instead of PLS as an input to LDA, however, PCA creates projected variables that ignore the covariance between the predictors and the response as opposed to the PLS method which takes this into account.
Number of scores/components \((n_{comp})\) to retain from the PLS procedure are selected using cross-validation, by trying a sequence of values. For each \((n_{comp})\), the performance (misclassification rate, precision, recall) is assessed by running 120 repeated 60-40 (train-validation sets) splits of the data.
## PLSDA - train the model and assess performance using repeated CV
score = seq(1, 30, 3) # PLS scores - number of components to retain
mcr.df = c()
ppv.df = c()
rec.df = c()
system.time({
registerDoMC(cores = 8)
for (ctr in seq_along(score)) {
#ctr = 1
#print(sprintf("#comp = %s", score[ctr]))
ncomp = score[ctr]
foo <- foreach(i = 1:120, .combine = 'rbind', .packages = c("plsgenomics", "magrittr")) %dopar% {
# split data into 60-40
indx = unname(unlist(caret::createDataPartition(Y, p = 0.6)))
temp.tr = X[indx, ] %>% apply(., 2, scale)
temp.te = X[-indx, ] %>% apply(., 2, scale)
# fit the model
temp.fit = pls.lda(Xtrain = temp.tr, Ytrain = as.numeric(Y[indx]),
ncomp = ncomp, Xtest = temp.te)
# predict the validation set class
temp.pred = temp.fit$predclass
# convert predict number to label (1: normal, 2: tumor)
temp.class = ifelse(temp.pred == "1", "Normal", "Tumor")
# compute metrics
temp.conf = table(actual = Y[-indx], predicted = temp.class)
temp.mcr = (1 - (sum(base::diag(temp.conf)) / sum(temp.conf))) * 100
temp.ppv = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[1,2])) * 100
temp.rec = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[2,1])) * 100
c(ncomp, temp.mcr, temp.ppv, temp.rec)
}
# save the results
mcr.df = rbind(mcr.df, foo[, c(1,2)])
ppv.df = rbind(ppv.df, foo[, c(1,3)])
rec.df = rbind(rec.df, foo[, c(1,4)])
}
})
user system elapsed
10598.412 2264.540 1668.038
beepr::beep(3)
colnames(mcr.df) = c("ncomp", "misclass")
colnames(ppv.df) = c("ncomp", "PosPredVal")
colnames(rec.df) = c("ncomp", "recall")
mcr.df = as.data.frame(mcr.df)
ppv.df = as.data.frame(ppv.df)
rec.df = as.data.frame(rec.df)
The runtime was 28 minutes.
Next, we can visualize the results from the training process. The plots show boxplots for misclassification rate, precision, and recall across the \(n_{comp}\) values. We can pick \(n_{comp}\) based on the value that had the highest (or smallest) median value, or the value at which the boxplot is the narrowest.
mcr.df %>%
ggplot(aes(x = as.factor(ncomp), y = misclass)) +
geom_boxplot() + theme_bw() + xlab("# Components") +
ylab("% Misclassified (120 random splits of 60-40)") +
stat_summary(fun.y = "mean", geom = "point", size = 5, color = "red", shape = 16)

ppv.df %>%
ggplot(aes(x = as.factor(ncomp), y = PosPredVal)) +
geom_boxplot() + theme_bw() + xlab("# Components") +
ylab("Precision (%) (120 random splits of 60-40)") +
stat_summary(fun.y = "mean", geom = "point", size = 5, color = "red", shape = 16)

rec.df %>%
ggplot(aes(x = as.factor(ncomp), y = recall)) +
geom_boxplot() + theme_bw() + xlab("# Components") +
ylab("Sensitivity/Recall (%) (120 random splits of 60-40)") +
stat_summary(fun.y = "mean", geom = "point", size = 5, color = "red", shape = 16)

Based on the figures, we select \(n_{comp}\) as 13. The model will be assessed on the test set in one of the later sections.
Random Forest
Next, a random forest is used as a classifier on the standardized data with 1000 trees. The rest of the parameters are left at their default values (Eg: \(m_{try} \approx \sqrt p\), etc). The following output shows the result of training the random forest classifier on the standardized training data.
X_std = X %>% apply(., 2, scale)
rf_fit = randomForest(x = X_std, y = Y, ntree = 1000, importance = TRUE, proximity = TRUE)
Call:
randomForest(x = X_std, y = Y, ntree = 1000, importance = TRUE, proximity = TRUE)
Type of random forest: classification
Number of trees: 1000
No. of variables tried at each split: 112
OOB estimate of error rate: 8.82%
Confusion matrix:
Normal Tumor class.error
Normal 48 2 0.0400000
Tumor 7 45 0.1346154
Based on the results, we see that the classifier had an error rate of 9.8%. Cross-validation was not explicitly performed here since random forest assesses error rate using out-of-bag samples, i.e., all the cases that were not in the bootstrap sample used to train the model. As the authors of ESL point out in section 15.3.1, the OOB error rate is approximately similar to the LOO-CV rate.
The confusion matrix shows that the classifier has a higher false negative rate rather than the false positive rate. This indicates that the precision will be higher as compared to the recall in this case. Next, the OOB error rate vs the number of trees in the ensemble is plotted below. The plot indicates that the OOB error stabilizes around 200 trees.
#plot(1:1000, rf_fit$err.rate[,1], type = "l", xlab = "# Trees", ylab = "OOB Error Rate", main = "Random Forest Error Rate")
# old code
# data.frame(x = 1:1000, y = rf_fit$err.rate[,1]) %>%
# ggplot(aes(x, y)) +
# geom_line(color = "blue") + theme_bw() +
# xlab("# Trees") + ylab("OOB Error Rate")
# new code that uses the oob_error() function in my package
akshat::oob_error(rf_fit)

# plotting by class
akshat::oob_error(rf_fit, by_class = TRUE)

#plot(rf_fit, main = "")
#varImpPlot(rf_fit)
# proximity plot
#MDSplot(rf_fit, Y)
#title("MDS Proximity Plot For Random Forest")
# MCR, PPV and recall
rf.conf = rf_fit$confusion[,1:2]
rf.mcr = (1 - (sum(base::diag(rf.conf)) / sum(rf.conf))) * 100
rf.ppv = (rf.conf[2,2] / (rf.conf[2,2] + rf.conf[1,2])) * 100
rf.rec = (rf.conf[2,2] / (rf.conf[2,2] + rf.conf[2,1])) * 100
data.frame(misclass = round(rf.mcr, 2), precision = round(rf.ppv, 2), recall = round(rf.rec, 2))
Penalized Linear Regression
Finally, a penalized logistic regression (using the elastic-net penalty) is performed. This fits a penalized linear model which simultaneously performs parameter estimation and variable selection. It rests on the (informal) “bet on sparsity” principle, which states that only a small subset of the predictor have a non-zero effect on the response variable. The final model can easily be written down in a simple prediction equation and only has a small subset of genes which have non-zero coefficients. Model assessment is performed by computing performance criteria on 1000 random splits of the training data into training set (60%) and validation set (40%).
en.mcr = c()
en.ppv = c()
en.rec = c()
registerDoMC(cores = 8)
system.time({
res <- foreach(i = 1:1000, .combine = 'rbind', .packages = c("glmnet")) %dopar% {
# split data into 60-40
indx = unname(unlist(caret::createDataPartition(Y, p = 0.6)))
temp.tr = X[indx, ] #%>% apply(., 2, scale)
temp.te = X[-indx, ] #%>% apply(., 2, scale)
# fit model (standardizes data internally)
temp.las = cv.glmnet(x = data.matrix(temp.tr), y = Y[indx], family = "binomial",
type.measure = "class", nfolds = 5, alpha = 0.5)
# get pred
temp.class = predict(temp.las, data.matrix(temp.te), s = "lambda.1se", type = "class")
# compute metrics
temp.conf = table(actual = Y[-indx], predicted = temp.class)
mcr = (1 - (sum(base::diag(temp.conf)) / sum(temp.conf))) * 100
ppv = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[1,2])) * 100
rec = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[2,1])) * 100
c(mcr, ppv, rec)
}
})
user system elapsed
1384.528 20.496 699.936
# remove the rownames
rownames(res) = c()
en.mcr = res[, 1, drop = T]
en.ppv = res[, 2, drop = T]
en.rec = res[, 3, drop = T]
en.res = data.frame(
rbind(unname(summary(en.mcr)), unname(summary(en.ppv)), unname(summary(en.rec))), row.names = c("misclass", "precision", "recall"))
colnames(en.res) = names(summary(en.rec))
#en.res %>% xtable::xtable(digits = 2, align = rep("c", ncol(en.res) + 1), caption = "1000 repeats")
Training time was about 700/60 \(\approx\) 12 minutes.
The table above shows the summary of the misclassification rate, precision, and recall across the 1000 random splits. The median misclassification rate is 7.5%, precision: 95% and recall: 90%. Ideally, the recall should be 100% since it would be catastrophic if the classifier assigned the ‘non-tumor’ label to someone with cancer.
# predict using full data
lasso_fit = cv.glmnet(x = data.matrix(X), y = Y, family = "binomial", type.measure = "class", nfolds = 5, alpha = 0.4)
plot(lasso_fit)

#title("Elastic Net (alpha = 0.5)", line = 2.5)
Selecting \(\lambda\) based on the “1-standard error” rule results in a lambda value of 0.12 and that about 54 genes (out of 12600) have a non-zero coefficient at this lambda value, as seen from the figure above.
Test Set Performance
We can assess the performance of the three classifiers on the test set to get a measure of the generalization error of the classifier. The confusion matrices for the PLS-LDA, random forest, and lasso are given below.
### PLS-LDA
# fit the PLS model to the full training data with ncomp selected from the previous plots
# obtain confusion matrix and metrics from the final model on the test set
testY = test[, c("response")]
Xt = test[, -ncol(test)]
# PLS
fit.pls.lda = pls.lda(Xtrain = X_std, Ytrain = as.numeric(Y), ncomp = 13, Xtest = Xt)
test.class = ifelse(fit.pls.lda$predclass == "1", "Normal", "Tumor")
test.pls.conf = table(actual = testY, predicted = test.class)
pls_test = cbind(
(1 - (sum(base::diag(test.pls.conf)) / sum(test.pls.conf))) * 100, # mcr
(test.pls.conf[2,2] / (test.pls.conf[2,2] + test.pls.conf[1,2])) * 100, # ppv
(test.pls.conf[2,2] / (test.pls.conf[2,2] + test.pls.conf[2,1])) * 100 # rec
)
test.pls.conf
predicted
actual Normal Tumor
Normal 1 8
Tumor 0 25
### Random Forest
# get rf prediction
rf.test = predict(rf_fit, data.frame(apply(Xt, 2, scale)), "response")
# compute metrics
rf.conf.test = table(actual = testY, predicted = rf.test)
rf_test = cbind(
(1 - (base::sum(base::diag(rf.conf.test)) / sum(rf.conf.test))) * 100, #mcr
(rf.conf.test[2,2] / (rf.conf.test[2,2] + rf.conf.test[1,2])) * 100, # ppv/precision
(rf.conf.test[2,2] / (rf.conf.test[2,2] + rf.conf.test[2,1])) * 100 # recall
)
rf.conf.test
predicted
actual Normal Tumor
Normal 9 0
Tumor 2 23
### Lasso
lasso.test = predict(lasso_fit, data.matrix(Xt), s = "lambda.1se", type = "class")
# compute metrics
lasso.conf = table(actual = testY, predicted = lasso.test)
lasso_test = cbind(
(1 - (sum(base::diag(lasso.conf)) / sum(lasso.conf))) * 100, #mcr
(lasso.conf[2,2] / (lasso.conf[2,2] + lasso.conf[1,2])) * 100, # ppv/precision
(lasso.conf[2,2] / (lasso.conf[2,2] + lasso.conf[2,1])) * 100 # recall
)
lasso.conf
predicted
actual Normal Tumor
Normal 9 0
Tumor 1 24
The following table contrasts the various metrics for the three models on the test data.
test_df = data.frame(cbind(t(pls_test), t(rf_test), t(lasso_test)), row.names = c("Misclassification rate", "Precision", "Recall"))
colnames(test_df) = c("PLS-LDA", "RF", "Lasso")
round(test_df, 2)
There was also substantial variability observed in the test set performance of the lasso model across different reruns of the command for training the final lasso model using all the training data.
References
Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., … & Lander, E. S. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203-209. (Dataset Link)
Nguyen, D. V., & Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1), 39-50.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer New York.
---
title: "Detecting Prostate Cancer Using Gene Expression Profiles"
author: "Akshat Dwivedi"
date: '`r Sys.Date()`'
output:
  html_notebook:
    highlight: tango
    theme: flatly
    toc: yes
    toc_float: yes
  html_document:
    code_folding: hide
    number_sections: yes
    toc: yes
    toc_float: yes
---

```{r, include=FALSE}
knitr::opts_chunk$set(cache = TRUE)
```


```{r, echo=TRUE, message=FALSE}
# load packages
library(glmnet)
library(randomForest)
library(dplyr)
library(doMC)
library(foreach)
library(ggplot2)
library(plsgenomics) # for PLS-LDA
# devtools::install_github("ad1729/akshat")
#library(akshat) # my own package with useful functions
```

**Update: An older version of this script ran the nested for loops sequentially which had relatively long run times. The latest version uses the foreach package to run the resampling inner loops in parallel thereby leading to a reduction in computation time.**

# Introduction

This short project aims to build a classifier for accurately detecting whether patients with a particular gene expression profile have prostate cancer or not. Starting with a brief description of the dataset, statistical methods for building classifiers that are used in this project are mentioned along with the reasons for their selection. Next, R output from the modelling is provided. This is followed by a discussion of the output and some comments.

# Data Description

The data are obtained from the study in Singh et al (2002) where the authors provided access to two datasets. The first dataset is taken as the training set with 102 observations, 12600 predictors, and 1 categorical response column indicating "Tumor" or "Normal". 50 patients belonged to the category "Normal" and 52 patients had tumorous growth. A test dataset was also provided with 34 observations (9 Normal and 25 Tumor) and (12600 + 1) columns in the data matrix.

The code for naming the columns, reading in the training and test data is given in the chunk below.

```{r}
# read in colnames from the names file
col_name = read.table("prostate/prostate_TumorVSNormal.names", skip = 4)
col_name = col_name[,1]
head(col_name)
col_name = gsub("-", ".", col_name)
col_name = gsub("_", ".", col_name)
head(col_name)

# read training data
tr = read.table("prostate/prostate_TumorVSNormal_train.data", header = F, sep = ",", 
                blank.lines.skip = T, col.names = c(col_name, "response"), 
                colClasses = c(rep("numeric", length(col_name)), "factor"))
#head(tr[, 1:5]) # last col is the response
str(tr[,1:10])
#dim(tr)

# read test data
test = read.table("prostate/prostate_TumorVSNormal_test.data", header = F, sep = ",", 
                  blank.lines.skip = T, col.names = c(col_name, "response"), 
                  colClasses = c(rep("numeric", length(col_name)), "factor"))
#head(test[1:10, 1:10]) # last col is the response
#dim(test)

Y = tr[, "response"]
X = tr[, -ncol(tr)]
```

The following output shows a summary of the variability in the variances across the predictors.

```{r}
summary(apply(tr[,-ncol(tr)], 2, var))
```

This extreme difference in the variances between predictors indicates that standardizing our dataset will be necessary, to put variables on a similar footing, so to speak. We can also get an idea about the multicollinearity among the variables by running the following code chunk.

```{r}
cor_mat = cor(tr[,-ncol(tr)])
cor_vec = cor_mat[upper.tri(cor_mat, diag = FALSE)]
foo = summary(cor_vec) # summary of pairwise correlations between predictors
rm(cor_mat, cor_vec) # remove the above two objects since they're consuming a lot of memory
foo
```

It appears that most of the genes are uncorrelated with each other, as indicated by the mean/median but some genes are very strongly correlated with each other, as indicated by the min/max values.

# Modelling

Since we find ourselves in a situation where $p \gg n$, a lot of the classical methods will fail. Thus, relatively modern methods are required for the classification task. To this end, the following three methods were used: PLS-LDA (partial least squares - linear discriminant analysis), random forests, and penalized logistic regression.

The classifiers are assessed based on the following metrics: misclassification rate, precision and recall which are defined as follows:

misclassification rate: $1 - \frac{TP + TN}{N}$

precision (PPV): $\frac{TP}{TP + FP}$

recall (sensitivity): $\frac{TP}{TP + FN}$

where TP: true positive, TN: true negative, FP: false positive, FN: false negatives.

## PLS-(L)DA

PLS-LDA first performs PLS on the data and then performs classification using LDA on the scores (lower-dimensional projections of the data) that are created by PLS. LDA cannot be used here without preprocessing the data since $p$ is much larger than $n$. PCA (principal component analysis) could be performed instead of PLS as an input to LDA, however, PCA creates projected variables that ignore the covariance between the predictors and the response as opposed to the PLS method which takes this into account. 

Number of scores/components $(n_{comp})$ to retain from the PLS procedure are selected using cross-validation, by trying a sequence of values. For each $(n_{comp})$, the performance (misclassification rate, precision, recall) is assessed by running 120 repeated 60-40 (train-validation sets) splits of the data.

```{r plslda, message=FALSE, warning=FALSE}
## PLSDA - train the model and assess performance using repeated CV
score = seq(1, 30, 3) # PLS scores - number of components to retain

mcr.df = c()
ppv.df = c()
rec.df = c()

system.time({
  registerDoMC(cores = 8)
  for (ctr in seq_along(score)) {
    
    #ctr = 1
    #print(sprintf("#comp = %s", score[ctr]))
    
    ncomp = score[ctr]
    
    foo <- foreach(i = 1:120, .combine = 'rbind', .packages = c("plsgenomics", "magrittr")) %dopar% {
      
      # split data into 60-40
      indx = unname(unlist(caret::createDataPartition(Y, p = 0.6)))
      temp.tr = X[indx, ] %>% apply(., 2, scale)
      temp.te = X[-indx, ] %>% apply(., 2, scale)
      
      # fit the model
      temp.fit = pls.lda(Xtrain = temp.tr, Ytrain = as.numeric(Y[indx]), 
                         ncomp = ncomp, Xtest = temp.te)
      
      # predict the validation set class
      temp.pred = temp.fit$predclass
      
      # convert predict number to label (1: normal, 2: tumor)
      temp.class = ifelse(temp.pred == "1", "Normal", "Tumor")
      
      # compute metrics
      temp.conf = table(actual = Y[-indx], predicted = temp.class)
      temp.mcr = (1 - (sum(base::diag(temp.conf)) / sum(temp.conf))) * 100
      temp.ppv = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[1,2])) * 100
      temp.rec = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[2,1])) * 100
      
      c(ncomp, temp.mcr, temp.ppv, temp.rec)
    }
    
    # save the results
    mcr.df = rbind(mcr.df, foo[, c(1,2)])
    ppv.df = rbind(ppv.df, foo[, c(1,3)])
    rec.df = rbind(rec.df, foo[, c(1,4)])
    
  }
})
  
colnames(mcr.df) = c("ncomp", "misclass")
colnames(ppv.df) = c("ncomp", "PosPredVal")
colnames(rec.df) = c("ncomp", "recall")

mcr.df = as.data.frame(mcr.df)
ppv.df = as.data.frame(ppv.df)
rec.df = as.data.frame(rec.df)
```

The runtime was `r round(1668/60)` minutes.

Next, we can visualize the results from the training process. The plots show boxplots for misclassification rate, precision, and recall across the $n_{comp}$ values. We can pick $n_{comp}$ based on the value that had the highest (or smallest) median value, or the value at which the boxplot is the narrowest.

```{r}
mcr.df %>% 
  ggplot(aes(x = as.factor(ncomp), y = misclass)) +
  geom_boxplot() + theme_bw() + xlab("# Components") + 
  ylab("% Misclassified (120 random splits of 60-40)") +
  stat_summary(fun.y = "mean", geom = "point", size = 5, color = "red", shape = 16)
```

```{r}
ppv.df %>% 
  ggplot(aes(x = as.factor(ncomp), y = PosPredVal)) +
  geom_boxplot() + theme_bw() + xlab("# Components") + 
  ylab("Precision (%) (120 random splits of 60-40)") +
  stat_summary(fun.y = "mean", geom = "point", size = 5, color = "red", shape = 16)
```

```{r}
rec.df %>% 
  ggplot(aes(x = as.factor(ncomp), y = recall)) +
  geom_boxplot() + theme_bw() + xlab("# Components") + 
  ylab("Sensitivity/Recall (%) (120 random splits of 60-40)") +
  stat_summary(fun.y = "mean", geom = "point", size = 5, color = "red", shape = 16)
```

Based on the figures, we select $n_{comp}$ as 13. The model will be assessed on the test set in one of the later sections.

## Random Forest

Next, a random forest is used as a classifier on the standardized data with 1000 trees. The rest of the parameters are left at their default values (Eg: $m_{try} \approx \sqrt p$, etc). The following output shows the result of training the random forest classifier on the standardized training data.

```{r}
X_std = X %>% apply(., 2, scale)

rf_fit = randomForest(x = X_std, y = Y, ntree = 1000, importance = TRUE, proximity = TRUE)
```

```{r}
rf_fit
```

Based on the results, we see that the classifier had an error rate of 9.8%. Cross-validation was not explicitly performed here since random forest assesses error rate using out-of-bag samples, i.e., all the cases that were not in the bootstrap sample used to train the model. As the authors of ESL point out in section 15.3.1, the OOB error rate is approximately similar to the LOO-CV rate.

The confusion matrix shows that the classifier has a higher false negative rate rather than the false positive rate. This indicates that the precision will be higher as compared to the recall in this case. Next, the OOB error rate vs the number of trees in the ensemble is plotted below. The plot indicates that the OOB error stabilizes around 200 trees.

```{r, warning=FALSE}
#plot(1:1000, rf_fit$err.rate[,1], type = "l", xlab = "# Trees", ylab = "OOB Error Rate", main = "Random Forest Error Rate")

# old code
# data.frame(x = 1:1000, y = rf_fit$err.rate[,1]) %>%
#   ggplot(aes(x, y)) + 
#   geom_line(color = "blue") + theme_bw() + 
#   xlab("# Trees") + ylab("OOB Error Rate")

# new code that uses the oob_error() function in my package
akshat::oob_error(rf_fit)

# plotting by class
akshat::oob_error(rf_fit, by_class = TRUE)

#plot(rf_fit, main = "")

#varImpPlot(rf_fit)

# proximity plot
#MDSplot(rf_fit, Y)
#title("MDS Proximity Plot For Random Forest")

# MCR, PPV and recall
rf.conf = rf_fit$confusion[,1:2]
rf.mcr = (1 - (sum(base::diag(rf.conf)) / sum(rf.conf))) * 100
rf.ppv = (rf.conf[2,2] / (rf.conf[2,2] + rf.conf[1,2])) * 100
rf.rec = (rf.conf[2,2] / (rf.conf[2,2] + rf.conf[2,1])) * 100
data.frame(misclass = round(rf.mcr, 2), precision = round(rf.ppv, 2), recall = round(rf.rec, 2))
```

## Penalized Linear Regression

Finally, a penalized logistic regression (using the elastic-net penalty) is performed. This fits a penalized linear model which simultaneously performs parameter estimation and variable selection. It rests on the (informal) "bet on sparsity" principle, which states that only a small subset of the predictor have a non-zero effect on the response variable. The final model can easily be written down in a simple prediction equation and only has a small subset of genes which have non-zero coefficients. Model assessment is performed by computing performance criteria on 1000 random splits of the training data into training set (60\%) and validation set (40\%).

```{r}
en.mcr = c()
en.ppv = c()
en.rec = c()

registerDoMC(cores = 8)
system.time({
  res <- foreach(i = 1:1000, .combine = 'rbind', .packages = c("glmnet")) %dopar% {
    
    # split data into 60-40
    indx = unname(unlist(caret::createDataPartition(Y, p = 0.6)))
    temp.tr = X[indx, ] #%>% apply(., 2, scale)
    temp.te = X[-indx, ] #%>% apply(., 2, scale)
    
    # fit model (standardizes data internally)
    temp.las = cv.glmnet(x = data.matrix(temp.tr), y = Y[indx], family = "binomial", 
                         type.measure = "class", nfolds = 5, alpha = 0.5)
    
    # get pred
    temp.class = predict(temp.las, data.matrix(temp.te), s = "lambda.1se", type = "class")
    
    # compute metrics
    temp.conf = table(actual = Y[-indx], predicted = temp.class)
    mcr = (1 - (sum(base::diag(temp.conf)) / sum(temp.conf))) * 100
    ppv = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[1,2])) * 100
    rec = (temp.conf[2,2] / (temp.conf[2,2] + temp.conf[2,1])) * 100
    
    c(mcr, ppv, rec)
  }
})

# remove the rownames
rownames(res) = c()

en.mcr = res[, 1, drop = T]
en.ppv = res[, 2, drop = T]
en.rec = res[, 3, drop = T]

en.res = data.frame(
  rbind(unname(summary(en.mcr)), unname(summary(en.ppv)), unname(summary(en.rec))), row.names = c("misclass", "precision", "recall"))
colnames(en.res) = names(summary(en.rec))
#en.res %>% xtable::xtable(digits = 2, align = rep("c", ncol(en.res) + 1), caption = "1000 repeats")
```

Training time was about 700/60 $\approx$ `r round(700/60)` minutes.

```{r}
en.res
```

The table above shows the summary of the misclassification rate, precision, and recall across the 1000 random splits.  The median misclassification rate is 7.5%, precision: 95% and recall: 90%. Ideally, the recall should be 100% since it would be catastrophic if the classifier assigned the 'non-tumor' label to someone with cancer.

```{r}
# predict using full data
lasso_fit = cv.glmnet(x = data.matrix(X), y = Y, family = "binomial", type.measure = "class", nfolds = 5, alpha = 0.4)

plot(lasso_fit)
#title("Elastic Net (alpha = 0.5)", line = 2.5)
```

```{r, include=FALSE}
a = coef(lasso_fit, s = "lambda.1se")
```

Selecting $\lambda$ based on the "1-standard error" rule results in a lambda value of `r round(lasso_fit$lambda.1se, 2)` and that about `r length(a[which(a != 0)]) - 1` genes (out of 12600) have a non-zero coefficient at this lambda value, as seen from the figure above.

# Test Set Performance

We can assess the performance of the three classifiers on the test set to get a measure of the generalization error of the classifier. The confusion matrices for the PLS-LDA, random forest, and lasso are given below.

```{r}
### PLS-LDA
# fit the PLS model to the full training data with ncomp selected from the previous plots
# obtain confusion matrix and metrics from the final model on the test set
testY = test[, c("response")]
Xt = test[, -ncol(test)]

# PLS
fit.pls.lda = pls.lda(Xtrain = X_std, Ytrain = as.numeric(Y), ncomp = 13, Xtest = Xt)
test.class = ifelse(fit.pls.lda$predclass == "1", "Normal", "Tumor")
test.pls.conf = table(actual = testY, predicted = test.class)
pls_test = cbind(
  (1 - (sum(base::diag(test.pls.conf)) / sum(test.pls.conf))) * 100, # mcr
  (test.pls.conf[2,2] / (test.pls.conf[2,2] + test.pls.conf[1,2])) * 100, # ppv
  (test.pls.conf[2,2] / (test.pls.conf[2,2] + test.pls.conf[2,1])) * 100 # rec
)

test.pls.conf

### Random Forest
# get rf prediction
rf.test = predict(rf_fit, data.frame(apply(Xt, 2, scale)), "response")
# compute metrics
rf.conf.test = table(actual = testY, predicted = rf.test)
rf_test = cbind(
  (1 - (base::sum(base::diag(rf.conf.test)) / sum(rf.conf.test))) * 100, #mcr
  (rf.conf.test[2,2] / (rf.conf.test[2,2] + rf.conf.test[1,2])) * 100, # ppv/precision
  (rf.conf.test[2,2] / (rf.conf.test[2,2] + rf.conf.test[2,1])) * 100 # recall
)

rf.conf.test

### Lasso
lasso.test = predict(lasso_fit, data.matrix(Xt), s = "lambda.1se", type = "class")
# compute metrics
lasso.conf = table(actual = testY, predicted = lasso.test)
lasso_test = cbind(
  (1 - (sum(base::diag(lasso.conf)) / sum(lasso.conf))) * 100, #mcr
  (lasso.conf[2,2] / (lasso.conf[2,2] + lasso.conf[1,2])) * 100, # ppv/precision
  (lasso.conf[2,2] / (lasso.conf[2,2] + lasso.conf[2,1])) * 100 # recall
)
lasso.conf
```

The following table contrasts the various metrics for the three models on the test data.

```{r}
test_df = data.frame(cbind(t(pls_test), t(rf_test), t(lasso_test)), row.names = c("Misclassification rate", "Precision", "Recall"))
colnames(test_df) = c("PLS-LDA", "RF", "Lasso")
round(test_df, 2)
```

There was also substantial variability observed in the test set performance of the lasso model across different reruns of the command for training the final lasso model using all the training data.

# Comments

In the next part, I will modify the loss function to penalize the misclassification from 'tumour' to 'normal' more greatly than the misclassfication in the other direction. 

Furthermore, in the PLS-LDA case, cross-validation was used for model selection which had the problem of reduced sample size for the model building. Only 60 out of 100 data points were used at each step to train the model, and in this case where the sample size is very small, it may be more useful to use the bootstrap as the resampling method for model selection, where the out-of-bag (OOB) error can be used to obtain an estimate of the expected predicition error.

# References

Singh, D., Febbo, P. G., Ross, K., Jackson, D. G., Manola, J., Ladd, C., ... & Lander, E. S. (2002). Gene expression correlates of clinical prostate cancer behavior. Cancer cell, 1(2), 203-209. [(Dataset Link)](http://datam.i2r.a-star.edu.sg/datasets/krbd/ProstateCancer/ProstateCancer.html)

[Nguyen, D. V., & Rocke, D. M. (2002). Tumor classification by partial least squares using microarray gene expression data. Bioinformatics, 18(1), 39-50.](https://academic.oup.com/bioinformatics/article/18/1/39/243615/Tumor-classification-by-partial-least-squares)

Hastie, T., Tibshirani, R., & Friedman, J. (2009). The elements of statistical learning. Springer New York.

Comments
In the next part, I will modify the loss function to penalize the misclassification from ‘tumour’ to ‘normal’ more greatly than the misclassfication in the other direction.
Furthermore, in the PLS-LDA case, cross-validation was used for model selection which had the problem of reduced sample size for the model building. Only 60 out of 100 data points were used at each step to train the model, and in this case where the sample size is very small, it may be more useful to use the bootstrap as the resampling method for model selection, where the out-of-bag (OOB) error can be used to obtain an estimate of the expected predicition error.