Final Project : Classification and Regression Study on Metabolomic data

Antoine Pissoort

mai 31, 2017,18:09

Abstract

In this project, we analyzed the Urine-Citrate-Hippurate database where we put emphasis on regression and classification models as the spectrums have already been preprocessed. Aware of the huge variety of methods that exist in the R world for doing such analysis, the significant loss of time from being overwhelmed on dispersed information and the dispersion of results when adopting different methods of model’s building , we decided to condense this project at maximum and provide the most homogeneous methodology relying on the caret infrastructure. This should accomodate both user and reader convenience. The random forest ensembling (PLS, lm, xgboost,) vastly outperformed the other with nearly 0 error

# data<-read.csv("Data/X_UCH.csv", sep=';',header=T)
load("Data/X_UCH.RData")
load("Data/D_UCH.RData")
dataD <- as_tibble(D_UCH);       dataX <- as_tibble(X_UCH)
rm(D_UCH, X_UCH)
# Let's use tibble objects as they are more efficient than standard dataFrames.

rownames(dataD) <- substr( rownames(dataD), 1, nchar(rownames(dataD)[1])-3 )
sum(rownames(dataD) == rownames(dataX))  # 269 : OK

## Double the columns of hippurate concentration (see end of chapter 3 for explanation)
dataD[dataD$Medium == 'diluted urine',]$Hip_CONC <- 2 * dataD[dataD$Medium == 'diluted urine',]$Hip_CONC



dataAll <- as_tibble(cbind(dataD, dataX))

sum(is.na(dataAll))  # No NA !

parler + des m?thodes spls opls : interpretations, des coefficients, tableaux, etc etC… pas trop d emethods trop complexes.

1 Introduction

Whereas the metabolome represents the collection of all metabolites in a biological cell, tissue, organ or organism, which are the end products of cellular processes Metabolomic data analysis are concerned with

Section 7 will summarize the results of all the models considered and will

1.1 Presentation of the Data

The Urine-Citrate-Hippurate that we used here is well documented in (Rousseau 2011). These data follow an experimental plan carefully designed to study the ability of statistical methods to detect biomarkers. These biomarkers are

It is then always interesting to see the data that we will analyze.

Have a look at the datasets

Meta-data

You can scroll the x-axis to see all features.

DT::datatable(dataD, style = "bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))

Click on the up(down) arrow next to each variable’s name to show the observations in descending order for this variable. Change page to see other observations

Spectral data

You can scroll the x-axis to see all features.

DT::datatable(dataX, style = "bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))

Click on the up(down) arrow next to each variable’s name to show the observations in descending order for this variable. Change page to see other observations

1.2 Presentation of the Analysis

The objective of the analysis will then to detect the biomarkers for which a variability was experimentally controlled. Our analysis will (essentially) be made of the different experimental design of Hippurate, that is :

For Regression : We will try to predict the quantity of Hippurate’s concentrations. To do that, we will use the algorithms
For Classification : We will try to predict whether it is the class $0$ or $Q_{h/2}$ of Hippurate proportion. To do this, we will divide the dataset by two classes of hippurate’s concentrations. We will use the algorithms

The aim will be to select the best method (algorithm) which allows to predict at best

To achieve this, we will mainly make use of the library caret.

2 Descriptive Analysis

2.1 Spectrums and ppm

We decide to display 15 randomly chosen spectrums to inspect their behaviour. This balances between visibility and enough information to detect possible problems (e.g. outliers).

tdataX <- t(dataX)

set.seed(123)
spectr_samp10 <- sample(colnames(tdataX), 15)
df_samp10 <- tdataX[,spectr_samp10]

df_samp10 <- cbind.data.frame(df_samp10, spec = as.factor(1:nrow(df_samp10))) 
df_samp10 <- melt(df_samp10, id.vars = "spec" )
df_samp10 <- cbind.data.frame(df_samp10, ppm = as.numeric(rownames(tdataX)))
dimnames(df_samp10)[[2]][2] <- "Spectrum"

gg_labs <- labs(title = "15 random Spectrums representation", y = "Spectrums intensities", caption = "By redoing several times with the other spectrums, we saw that the shape is always very similar." )

gg_spectre10 <-  ggplot(df_samp10)  +  geom_line(aes(x = ppm, y = value, col = Spectrum), size = 0.45) + gg_labs + scale_color_hue(l = 20, c = 200)  + theme_piss() + scale_x_reverse()  +
  annotate(geom = "text", label = "Hippurate",
           x = 7.1,
           y = .035, col = "#33666C" , size = 5, fontface = "bold") +
  geom_segment(aes(x = 7, xend = 7.5, y = 0.033, yend = .027),
               arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
  geom_segment(aes(x = 6.2, xend = 4.35, y = 0.034, yend = .0328),
               arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
  annotate(geom = "text", label = "Citrate",
           x = 1.5,
           y = .038, col = "#33666C" , size = 5, fontface = "italic") +
  geom_segment(aes(x = 1.7, xend = 2.3, y = 0.036, yend = .0325),
               arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") 
#ggplotly(gg_spectre10) # Too slow
gg_spectre10

library("scales")
library('ggthemes')
library("animation")
library("gganimate")
library("tweenr")

dfSampgg <- cbind(df_samp10, .frame = 1:ncol(df_samp10))
ggspec <- ggplot(dfSampgg, aes(frame = .frame))  +  geom_line(aes(x = ppm, y = value, col = Spectrum), size = 0.45) + gg_labs + scale_color_hue(l = 20, c = 200)  + theme_piss() + scale_x_reverse()  +
  annotate(geom = "text", label = "Hippurate",
           x = 7.1,
           y = .035, col = "#33666C" , size = 5, fontface = "bold") +
  geom_segment(aes(x = 7, xend = 7.5, y = 0.033, yend = .027),
               arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
  geom_segment(aes(x = 6.2, xend = 4.35, y = 0.034, yend = .0328),
               arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
  annotate(geom = "text", label = "Citrate",
           x = 1.5,
           y = .038, col = "#33666C" , size = 5, fontface = "italic") +
  geom_segment(aes(x = 1.7, xend = 2.3, y = 0.036, yend = .0325),
               arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") 
gganimate(ggspec)

We clearly see that the spectrums displayed have the same shape, corresponding to the hippurate (left ) and citrate (right) peaks which are located in distant ppm domains. Hippurate has three peaks with two in the region of high ppms containing a low noise level while this is the opposite for citrate peaks.

We also note the region of ppm where no intensity is recorded : between $\approx$ 6 and 4.6. As it represents features which are constant (=0), we will remove them for the rest of the analysis as they provide no information regarding our purpose.

sd_X <- apply(dataAll[10:ncol(dataAll)], 2, sd)

X_NoVariance <- names(sd_X[sd_X<1e-6])

# Remove no variance columns
dataAll <- dataAll[, colnames(dataAll) %in% X_NoVariance == F]

0 features were already removed in total, representing $\approx 10\%$ of the total space. This zone is actually due to the pretreatment correction where the values have been put to 0 for the water zone that is present in the urines solutions and that are not interesting for the analysis.

Concerning the meta-data, that is the experimental conditions, we have the following hippurate proportions in the dataset :

pander(table(dataD$Hip_prop))

0	Qh	Qh/2	Qh/4
110	47	64	48

showing that there is more observations with no hippurate.

We also have 14 levels for the combination Citrate-hippurate, with the following entries

#table(dataD$Ci_prop, dataD$Ci_CONC)
pander(table(dataAll$Hip_prop, dataAll$Ci_prop))

	0	QC	QC/2	QC/4
0	63	16	15	16
Qh	16	15	16	0
Qh/2	16	16	16	16
Qh/4	16	0	16	16

Regarding the PPM

3 Methodology and Pre-processing

To make the best algorithm selection possible with the data at hand, we must find a methodology that is best suited. To reduce bias and to avoid selection problems, we will sample 2 different sets of data that we will call :

The training set (80%) which contains the samples that we will further divide into a new training and validation set using the $k$-fold cross-validation scheme.
The test set (20%) that we will use estimate the true error and to rank the models

If we do not divide the data into training-test set like this, the error estimate on validation data will be biased (that is, in general smaller than the true error rate) since the validation set is used to select the final model (???). In microarrays applications, (Varma and Simon 2006) also showed that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. A nested cross-validation would be preferable.

The decomposition 80-20 allow to have 215 data for training and 54 for testing. This allows to have a good balance between “enough training data to build effective models” and “enough test data to make accurate predictions”. We could also do a sensitivity analysis by inspecting how the results could vary when changing this propoertion 80-20 . Moreover, one could also consider a (kind of) bootstrap procedure and redo the analysis several times (thus with several random sampling for the test and training set), giving more accurate estimates of the true errors. This is interesting for such small smaples… but at a computational cost !

For the classification task(s), it is important to check whether the distribution of the binary (response) variable is similar in each of the two sets, or it could have terrible impacts on the estimations of the perfomances of the algorithms. While the data are ‘stratified’, it is also recommended that the proportions of the different ‘strata’ should (approximately) be the same in the sample and in both training and test set.

rownames(dataAll) <- rownames(dataD)


set.seed(1234)
index <- sample(rownames(dataAll), round(0.8*nrow(dataX), 0))

training <- dataAll[rownames(dataAll) %in% index , ]
test <- dataAll[rownames(dataAll) %in% index == F, ]

rownames(training) <- index
rownames(test) <- rownames(dataAll)[rownames(dataAll) %in% index == F ]


# Control the proportions : 

# table(training$Hip_prop)
# table(test$Ci_CONC)

#pander(table(training$Hip_prop, training$Ci_prop))
#pander(table(test$Hip_prop, test$Ci_prop))

Original dataset : (%)

pander(round(table(dataAll$Hip_prop, dataAll$Ci_prop) / sum(table(dataAll$Hip_prop, dataAll$Ci_prop)) * 100, 3), justify = "center")

	0	QC	QC/2	QC/4
0	23.42	5.948	5.576	5.948
Qh	5.948	5.576	5.948	0
Qh/2	5.948	5.948	5.948	5.948
Qh/4	5.948	0	5.948	5.948

pct_train <- round(table(training$Hip_prop, training$Ci_prop) / sum(table(training$Hip_prop, training$Ci_prop)) * 100, 2)
#pander(pct_train)

pct_test <- round(table(test$Hip_prop, test$Ci_prop) / sum(table(test$Hip_prop, test$Ci_prop))*100, 2)

# pander(pct_test)
# pander(list(pct_train, pct_test))

Training set (left): $\qquad\qquad$ and $\qquad\qquad$ Test set (right): (both expressed in %)

t1 <- tableGrob(pct_train)
t2 <- tableGrob(pct_test)

grid.arrange(t1, t2, ncol = 2, left = "Training", right = "Test")

# pushViewport(viewport(layout = grid.layout(1, 1)))  
# print(t1 + ggtitle("bar"), 
#   vp = viewport(layout.pos.row = 1, layout.pos.col = 1))     
# print(t2 + labs(title = "point"), 
#   vp = viewport(layout.pos.row = 1, layout.pos.col = 2))

We see that the ‘distribution’ of the observations with respect to the experimental design Hippurate-Citrate si very similar accross the two sets. This is a good point for our aim to build reliable models and reliable estimates of the errors made. Moreover, regarding the repartition of the days in the 3 sets :

## Retrieve the day numbers to check the proportions in the different sets
t1 <- table(substr(rownames(dataAll), 7, 7))  
t2 <- table(substr(rownames(training), 7, 7)) 
t3 <- table(substr(rownames(test), 7, 7)) 

tab <- rbind(t1,t2,t3)
rownames(tab) <- c("Whole set", "Training set", "Test set")
colnames(tab) <- c("day 1", "day 2", "day 3", "day 5")
pander(tab)

	day 1	day 2	day 3	day 5
Whole set	65	68	68	68
Training set	53	55	54	53
Test set	12	13	14	15

The proportions of the days are very similar here accross the sets, so we can go on. This particular random indexing provided reliable separation training-test set.

Note thate also had to check the column of Hippurate’s concentration. That is, when the urine has been diluted with 50% of water, we had to double the values for this column (see chunk 1).

Regarding the preprocessing we will not a priori normalize the data. Each of the following models has its own properties and will need its own preprocessing.

4 Preliminaries and Feature Selection

4.1 Descriptive statistics

Linear correlations

First, regarding the linear correlations between the input variables (ppm) :

# corrX <- c()
# for (i in 0:(ncol(dataAll_c)-2)){
#   corrX[i] <- cor(dataAll_c[,i+1], dataAll_c[,i+2])
#   names(corrX)[i] <- paste0(substring(names(dataAll_c)[i+1], 1, 6),"&", substring(names(dataAll_c)[i+2],1,6))
# }

corrX <- cor(training[,-(1:9)])

dimnames(corrX)[[1]] <- dimnames(corrX)[[2]] <- c(rep("", 24), "0", rep("", 24), rep("", 30), "1", rep("", 29), rep("", 30), "2", rep("", 29),rep("", 31), "3", rep("", 30),rep("", 15), "4", rep("", 14), rep("", 29), "6", rep("", 29),rep("", 30),  "7",rep("",31), rep("", 30), "8", rep("", 30), rep("", 30), "9", rep("", 29))

## Plot it
corrplot::corrplot(corrX, method="color", type="lower" )

# sum(is.na(corrX)) # Weird 
highlyCorrelated <- findCorrelation(corrX, cutoff=0.99)
highlyCorrelated <- colnames(training_c[,highlyCorrelated])
#cor(training_c[,colnames(training_c[,highlyCorrelated])])

Distributions

training_c <- training[,-(1:9)] # Just keep it uncentered for now. We will center this object a few chunks later

## Retrieve number of ppm in each class : 0,1,...,9.
tabPPM <- table(substr(colnames(training_c), 1, 1))

## Sample one in each class
set.seed(12345)
sampl_distByPPM <- sapply(tabPPM, sample, 1)

## Create the dataFrame for ggplo, with corresponding column (ppm) full names
df_ppmDist <- data.frame(
    training_c[ ,substr(colnames(training_c), 1, 1) == "0"][, sampl_distByPPM['0']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "1"][, sampl_distByPPM['1']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "2"][, sampl_distByPPM['2']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "3"][, sampl_distByPPM['3']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "4"][, sampl_distByPPM['4']],
    #training_c[ ,substr(colnames(training_c), 1, 1) == "5"][, sampl_distByPPM['5']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "6"][, sampl_distByPPM['6']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "7"][, sampl_distByPPM['7']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "8"][, sampl_distByPPM['8']],
    training_c[ ,substr(colnames(training_c), 1, 1) == "9"][, sampl_distByPPM['9']]  )


colnames(df_ppmDist) <- c(colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "0"])[ sampl_distByPPM['0']],
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "1"])[ sampl_distByPPM['1']], 
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "2"])[ sampl_distByPPM['2']], 
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "3"])[ sampl_distByPPM['3']], 
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "4"])[ sampl_distByPPM['4']], 
                          #colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "5"])[ sampl_distByPPM['5']],
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "6"])[ sampl_distByPPM['6']], 
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "7"])[ sampl_distByPPM['7']], 
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "8"])[ sampl_distByPPM['8']], 
                          colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "9"])[ sampl_distByPPM['9']])


## Melt it for ggplot : represent the classes
dfgg_ppmDist <- melt(df_ppmDist)
colnames(dfgg_ppmDist) <- c("ppm.value", "value")


dodge <- position_dodge(width = 0.4)

gpp1 <- ggplot(dfgg_ppmDist[1:(4*nrow(training)),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', size = .6, alpha=0.99,width = 0.2) + geom_violin(fill = "lightseagreen", alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) + geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss() #+ scale_fill_brewer(palette="Blues") #+ scale_color_brewer(palette = "Dark2")


gpp2 <- ggplot(dfgg_ppmDist[(4*nrow(training)+1):nrow(dfgg_ppmDist),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99,width = 0.2, size = .6) + geom_violin(fill='lightseagreen', alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) +  geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()

grid.arrange(gpp1, gpp2, nrow = 2, top = textGrob(expression( italic("Violin plots of random ppm for each ´class´")), gp = gpar(fontsize = 19, font = 2, col = "#33666C")))

Centered

dfgg_ppmDistc <- melt(as.data.frame(scale(df_ppmDist, center = T, scale = F)))
colnames(dfgg_ppmDistc) <- c("ppm.value", "value")


gpp1 <- ggplot(dfgg_ppmDistc[1:(4*nrow(training)),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', size = .6, alpha=0.99,width = 0.2) + geom_violin(fill = "lightseagreen", alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) + geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss() #+ scale_fill_brewer(palette="Blues") #+ scale_color_brewer(palette = "Dark2")


gpp2 <- ggplot(dfgg_ppmDistc[(4*nrow(training)+1):nrow(dfgg_ppmDistc),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99,width = 0.2, size = .6) + geom_violin(fill='lightseagreen', alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) +  geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()

grid.arrange(gpp1, gpp2, nrow = 2, top = textGrob(expression( italic(paste("Logged Violin plots of ", underline("centered")," random ppm for each ´class´"))), gp = gpar(fontsize = 19, font = 2, col = "#33666C")))

Logarithm

dfgg_ppmDistLog <- dfgg_ppmDist
dfgg_ppmDistLog$value <- log(dfgg_ppmDist$value) 

gpplog1 <- ggplot(dfgg_ppmDistLog[1:(4*nrow(training)),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99, size = .6, width = .3) + geom_violin(fill='lightseagreen', alpha=0.7, draw_quantiles = T, psoition = dodge, width = 1.8) + geom_boxplot(width=.04, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()

gpplog2 <- ggplot(dfgg_ppmDistLog[(4*nrow(training)+1):nrow(dfgg_ppmDistLog),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99, size = .6, width = .3) + geom_violin(fill='lightseagreen', alpha=0.7,  draw_quantiles = T, position = dodge, width = 1.8) +  geom_boxplot(width=.04, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()

grid.arrange(gpplog1, gpplog2, nrow = 2, top = textGrob(expression( italic(paste(underline("Log-"), "Violin plots of random ppm for each ´class´"))), gp = gpar(fontsize = 19, font = 2, col = "#33666C")))

Densities by hippurate’s concentration

From the graph of the spectrum above, we decide to select one region of one ppm and to plot the densities with respect to the hippurate proportions. We also selected a random ppm to make the comparison

## Chosen
df_densHip <- data.frame(value = unname(training_c[, "7.656082"]), Hippurate = training$Hip_prop)

#ggplot(df_densHip, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_line(stat = "density", alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") 

df_densHip$Hippurate <- factor(df_densHip$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend

g1ppdens <- ggplot(df_densHip, aes(x = value , fill = Hippurate, color=Hippurate)) + geom_density(alpha = 0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Chosen high-peaked ppm", x = "value for ppm 7.656082")


## Random 
set.seed(12)
ind <- sample(as.data.frame(training_c), 1)

df_densHipRandom <- data.frame(value = unname(as.vector(ind)), Hippurate = training$Hip_prop)

df_densHipRandom$Hippurate <- factor(df_densHipRandom$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend

g2ppdens <- ggplot(df_densHipRandom, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Random ppm", x = paste("value for ppm ", names(ind)))


grid_arrange_legend(g1ppdens, g2ppdens, ncol = 2, position = "bottom")

#### TEST SET 

## Chosen
df_densHipTest <- data.frame(value = unname(test_c[,"7.656082"]), Hippurate = test$Hip_prop)

df_densHipTest$Hippurate <- factor(df_densHipTest$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend

g1ppdensT <- ggplot(df_densHipTest, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Chosen high-peaked ppm", x = "value for ppm 7.656082") + theme(legend.position="none")


## Random (same as for training set)
indT <- test[,names(ind)]

df_densHipRandomTest <- data.frame(value = unname(as.vector(indT)), Hippurate = test$Hip_prop)

df_densHipRandomTest$Hippurate <- factor(df_densHipRandomTest$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend

g2ppdensT <- ggplot(df_densHipRandomTest, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Random ppm", x = paste("value for ppm ", names(ind))) + theme(legend.position="none")


grid.arrange(g1ppdensT, g2ppdensT, ncol = 2, top = textGrob(expression(italic(paste("For the ", underline("test"), "set"))), gp = gpar(fontsize = 21, font = 2, col = "darkblue")))

Comments

The correlation is 1 between some X variables $\rightarrow$ multicolinearity. We will have to handle that for some specific models (e.g. linear regression). Moreover,

sort(cor(training$Hip_CONC, training_c))

theare are very large linear correlation for some input features with the target. We will take advantage of that that in the feature selection subsection.

The distribution seem very different accross features and some seem to have very fat tails : we can better visualize that on the violin plots of the centered features. Hence one solution would be to take their logarithm to get rid of the deep influence of the values located in the tails that could be seen as outliers and harm particularly linear models.
By arbitrarily choosing a ppm that is located in the “peaks” of the spectrum’s graph above, we remark that the hippurate’s concentration deeply influence the horizontal location of the distributions of this ppm for each class. However, by sampling a random ppm we see that there are not much differences accross the classes of hippurate’s concentration. Hence, we see that a single ppm can be very discriminative for predciting hippurate while other seem useless : that is the reason why we will make feature selection in Section 4.4. Actually, we will see that this particular chosen ppm is the feature selected by most of the algorithms. By looking deeper on the left graph, we remark that the densities overlap a little and hence this is not possible to predict perfectly the quantities of hippurate of the training set. For the TEST SET, we see that this is the same, the separation seem even a bit more strong (also because there a less observation). Actually, note that we could not have this information in real applications because we do not know the target.

training_c <- scale(training[,-(1:9)], center = T, scale = F)
test_c <- scale(test[,-(1:9)], center = T, scale = F)

# Remove NoVariance columns from the test set too.
test_cVar <- test_c[, colnames(test_c) %in% X_NoVariance ==F]

4.2 PCA

Then, we will do a PCA to represent our data in a smaller space. With such values of pearson’s correlations, we expect to get good results with the PCA.

ncomp = 8
PCA.res = SVDforPCA(training_c, ncomp = ncomp)

# EigenValues
eig.res=rbind(PCA.res$var,PCA.res$var*100/sum(PCA.res$var),PCA.res$cumvar)
rownames(eig.res)=c("Variances","Prop Var","Cum. Eigen Values")

## No need to display the frame : Pareto will summarize it
res.fin.scale <- prcomp(training_c)


#fviz_eig(res.fin.scale)+ggtitle("Scree plot of information kept by components")+ geom_vline(aes(xintercept=3),color="red")

NOMSG( pareto.chart(PCA.res$eigval[1:8], main = "Pareto chart - Explained Variance", ylab = "Variances",plot = T) )  # pareto chart

Indeed, we remark that a very few components allows to explain the whole ppm space (remaining). This actually means that the feature space could be easily (linearly) reduced. We have some prior knowledge that the process should be linear and hence, this result is consistent with this prior. In the following, it should tells us that it is not necessary to consider too much complex models.

Scores and Loadings Plots

We decided to colour the points with respect to the hippurate concentration (thus as a class here) to better visualize if there is a straightforward separation in the score plot with respect to the quantity of hippurate. Note that the colour legend is not in increasing order.

Scores (PCA)

Hippurate <- training$Hip_prop

## Note that we have changed the functions of MBX to get more flexibility, and we stored these function in a .R file that we source() in the beginning of this project. Here, the size_p parameter controls the size of the points (that were too big with the original function of M.Martin).
g1 <- DrawScores(PCA.res, type.obj = "PCA", drawNames=F,createWindow=FALSE, main = paste("PCA score plots for PC1 vs PC2, and for PC3 vs PC4"), axes =c(1,2), color = Hippurate, size_p = .6) + theme_piss(size_p=12, size_c = 10)
g2 <- DrawScores(PCA.res, type.obj = "PCA", drawNames=F,createWindow=FALSE, main =paste("PCA score plots for PC1 vs PC2, and for PC3 vs PC4"), axes =c(3,4), color = Hippurate, size_p = .6) + theme_piss(size_p = 12, size_c = 10)

#grid.arrange(g1,g2, nrow = 1)

plotly::subplot(hide_legend(ggplotly(g1)), ggplotly(g2), shareX = T, shareY= T, which_layout = 2)

You can select classes of Hippurate by clicking on the points in the right

Loadings (PCA)

gg_load <- DrawLoadings(PCA.res, type.obj = "PCA", createWindow=FALSE, main = paste("PCA loading plots for PC1 vs PC2"), axes =c(1,2))[[1]] + theme_piss(size_p = 14) + labs(x = "ppm")

ggplotly(gg_load)

Comments

We see that there is actually a clear separation regading the PCA scores of the ppm in the 2-dimensionnal space with respect to the quantity of hippurate. This subspace is actually preserving a relatively good amount of variance ($\approx 65\%$).

The PCA being a parametric mapping model relying on linear combinations of the original variables, we can already state that we have relatively good confidence that

Moreover, we see in the loadings plot that the peaks are related to the peaks of the spectrums in figure

4.3 t-tests and Multiple Testing

First of all, we want to be able to detect the biomarkers with a “simple” method. We choose the t-test because it relies on less assumptions than the ANOVA for example

Here, we decide to take all the data: that is, we think no more division in training-test sets is relevant here.

p_value <- c()
dataAll <- as.data.frame(dataAll)
dataX <- as.data.frame(dataX)
for(i in 1:(ncol(dataX))) {
  #print(i)
  p_value[i] <- t.test(dataX[dataAll$Hip_prop == "0",i], dataX[dataAll$Hip_prop == "Qh/2",i], var.equal = F)$p.value
}
p_val_Bonf0 <- p_value * (ncol(dataAll)-9)
p_val_Bonf <- p_val_Bonf0 ;   p_val_Bonf[p_val_Bonf>1] = 1
names(p_val_Bonf) <- colnames(dataX)

# NA are obviously coming for the "no-variance" columns that we pointed out earlier

For such tests which are applied multiple times, it is important to make a Here, we choose to present only the Bonferroni method which do the following correction:

\[\text{Pvalue}_{(\text{Bonf})} = \begin{cases} \ N\cdot \text{Pvalue}, \qquad\quad\text{Pvalue}<N^{-1} ; \\ \quad\quad 1 \ \ \ \ \qquad \qquad \quad \text{otherwise}. \end{cases} \]

Other methods for multiple testing correction are available, such as the ones of Sidak, Benjamin-Hochberg, …

4.3.1 Reprentations of p-values {-}

Histograms

library(RColorBrewer)
myPalette <- colorRampPalette(rev(brewer.pal(4, "Spectral")))

g1 <- ggplot(as.data.frame(p_value)) + geom_histogram(aes(x = p_value,fill=..count..),bins = 25, col="red") + theme_piss() + labs(title = "Without correction") +  scale_fill_gradientn("Count",limits = c(0,400),colours = myPalette(10) )

g2 <- ggplot(as.data.frame(p_val_Bonf)) + geom_histogram(aes(x = p_val_Bonf,fill=..count..),bins = 25, col="red") + theme_piss() + labs(title = "Using Bonferroni correction") +  scale_fill_gradientn("Count",limits = c(0,400), colours = myPalette(10))

PissoortThesis::grid_arrange_legend(g1, g2, position = "bottom")

Visualize ‘Densities’

df_pval <- data.frame(Standard = p_value, Bonferroni = p_val_Bonf)
df_pvalgg <- melt(df_pval,variable.name = "Method" )


ggplot(df_pvalgg, aes(x = value, fill = Method, color=Method)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1") + scale_color_brewer(palette = "Set1")

Analysis

cat("Number of significant p-values (<5%) is ",sum(p_value<0.05, na.rm = T)," out of ", ncol(dataAll)-9," --> ",100 *round(sum(p_value<0.05, na.rm = T)/(ncol(dataAll)-9),2),"%. \n")

Number of significant p-values (<5%) is  199  out of  503  -->  40 %.

cat("Number of significant test after Bonferroni Correction : ",sum(p_val_Bonf<0.05, na.rm = T)," out of ",(ncol(dataAll)-9)," --> ",100*round(sum(p_val_Bonf<0.05, na.rm = T)/(ncol(dataAll)-9),2),"%.")

Number of significant test after Bonferroni Correction :  88  out of  503  -->  17 %.

Indeed, we see that the correction is useful and we get

#cat("Biomarkers : ", names(p_val_Bonf[order(p_val_Bonf)][1:8]))
#pander(p_val_Bonf[order(p_val_Bonf)][1:7]
df_tests <- as.data.frame(t(p_val_Bonf[order(p_val_Bonf)]))
rownames(df_tests) <- "P-val.:"
# cat(" .dataTables_paginate {
#    display: none; }  ")   # Hide  page numbering 

#as.datatable(formattable(df_tests, list(p_val_Bonf = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp',scrollX = TRUE), caption =  htmltools::tags$caption(style = 'caption-side: bottom; text-align: center;', 'Table: ', htmltools::em("P-values in decreasing order of significance for t-tests with bonferroni correction. NA's ")))

as.datatable(formattable(df_tests, list(p_val_Bonf = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp',scrollX = TRUE))

as most relevant features (biomarkers) regarding the Bonferroni correction.

4.4 Feature selection

Several algorithms do not wrap automatic feature selection an hence, it is important to do it manually in order to not provide a too high feature space to these algorithms. As for features which do not vary between the spectrums, it is important to remove other features which are not important, relying on several criterion.

We assume not to know a priori that the process is globally linear (see PCA) and hence, we want to take all forms of relationship into account. Moreover, even if this procedure is more ‘complex’ to build, it will also handle pretty well more ‘simple’ relationships.

1. Correlation methods which allow not only the Pearson but also the rank correlation.

data_FSlect <- cbind.data.frame(Hip_CONC = training$Hip_CONC, training_c)

weights_lin <- linear.correlation(formula = Hip_CONC ~ ., data_FSlect)
subset_lin <- cutoff.k(weights_lin, 20)

# Same procedure for the rank correlation
weights_rank <- rank.correlation(formula = Hip_CONC ~ ., data_FSlect)
subset_rank <- cutoff.k(weights_rank, 20)

df_FSelect <- rbind(subset_lin, subset_rank)

length(feat_FSelect <- unique(as.vector(df_FSelect))) # 20

These two corelation methods (nonlinear or linear) select nearly the same features. Only 1 is different if we subset 20 features, and 4 if we subset 40 features for each.

2. Regarding a Random Forest selection which is used on each iteration to evaluate the model, we have

set.seed(123)

# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number= 5) #5 cv folds
# run the RFE algorithm
results <- rfe(training_c, training$Hip_prop, sizes= c(1:20), rfeControl=control, metric = "Accuracy")

## list the chosen features
feat_rfe <- predictors(results)
## plot the results
#plot(results, type=c("g", "o"))

feat_rfe <- results$variables$var[1:20]

The algorithm is configured to explore all possible subsets of the attributes

3. Finally XGBoost

set.seed(13)

params_glmnet = list(alpha = 1, family = 'gaussian', nfolds = 5, parallel = TRUE) # 5 folds 


params_xgboost = list( params = list("objective" = "reg:linear", "bst:eta" = 0.001, "subsample" = 0.75, "max_depth" = 5,  "colsample_bytree" = 0.75, "nthread" = 6),
                       nrounds = 1000, print.every.n = 1000, maximize = FALSE)
                      
# params_ranger1 = list(probability = FALSE, num.trees = 1000, verbose = TRUE, mtry = 5, min.node.size = 10, num.threads = 6, classifiction = FALSE, importance = 'permutation')

params_features = list(keep_number_feat = 20, union = TRUE) # say we want 30 features max

feat =  wrapper_feat_select(X = training_c, y = training$Hip_CONC, params_glmnet = params_glmnet, params_xgboost = params_xgboost, xgb_sort = 'Gain', CV_folds = 5, stratified_regr = FALSE, scale_coefs_glmnet = FALSE, cores_glmnet = 8, params_features = params_features, verbose = F) 
       
pander(feat$union_feat[1:10,]) 

feat_GlmnetXGB <- as.vector(feat$union_feat$feature[1:20])

It can be seen as a bit weird to consider such a complex model (XGBoost) for doing feature selection but it is necessary to optimize our selection process, and not to discard features (ppm) that could have more complex relationships with hippurate.

Now, let’s do a summary of all our methods and make them votes. The idea is to make the selection the most robust and valid as possible. If we arbitrarily discard a feature now because our method is too ‘poor’, then this feature will be removed for all the future algorithms for which the feature selection will be applied. Then, it would induce a big and so it is crucial to build a reliable feature selection scheme.

feat_all <- c(feat_FSelect, feat_rfe, feat_GlmnetXGB)
# length(unique(feat_all))  # 39 retained features in total (<60 due to duplications)
 
sum(dupl <- duplicated(feat_all))  # 21
( kept_features <- unique(feat_all[dupl]) ) # 16

training_c_small <- training_c[, kept_features]
test_small <- test[, kept_features] # For prediction. But this is not always mandatory

We decided to keep a feature if it appeared in at least 2 different methods (out of the 4). Having 4 methods with each selecting 20 features By doing this, this procedure leaves us with 19 most important features.

4.5 First Biomarkers Detection

With the feature selection made in the previous subsection, we can already identify biomarkers. As the models used to select relevant features are actually well ‘complex’ and diversified, this could already be our final biomarkers…

gg_spectre10 + geom_vline(xintercept = as.numeric(kept_features[1:10]), linetype = 2, size = .5) + 
   labs(title = "15 random Spectrums with 10 selected features", y = "Spectrums intensities" )

We can compare these results with the ones coming from the corrected t-tet, bot in increasing order of selection :

#cat("Biomarkers from corrected t-tests : ", names(p_val_Bonf[order(p_val_Bonf)][1:8]))
#pander(t(data.frame(t.test.corrected = names(p_val_Bonf[order(p_val_Bonf)][1:8]), feature.selection = kept_features[1:8])))

df_pvalAll <- data.frame(t.test.corrected = names(p_val_Bonf[order(p_val_Bonf)][1:length(kept_features)]), feature.selection = kept_features)

dfff <- as.data.frame(t(df_pvalAll))

colnames(dfff) <- seq(1, length(kept_features))

DT::datatable(dfff, style = "bootstrap", class="table-condensed", options = list(dom = 'tp', scrollX = TRUE))

We see that the features elected are very similar, and hence it is already a good sign of ‘robustness’.

pct_keptfeat <- sum(duplicated(c(as.vector(df_pvalAll$t.test.corrected), as.vector(df_pvalAll$feature.selection ) ))) / length(kept_features) *100
cat(pct_keptfeat, "% of the selected features are also the most significants wrt Bonferroni t-tests")

68.75 % of the selected features are also the most significants wrt Bonferroni t-tests

#@ These are all the features (21) selected by the methods
# unique(c(as.vector(df_pvalAll$t.test.corrected), as.vector(df_pvalAll$feature.selection) ) )

5 Regression Models

The metric we will use is the RMSE. Its use is very common and it makes an excellent general purpose error metric for numerical predictions. For example, compared to the Mean Absolute Error (MAE), RMSE amplifies and severely punishes larger errors. It is often interesting as larger errors are more problematic. For example here, being off by 200 of the hippurate’s quantity is more than twice as bad as being off by 100.

'RMSE' <- function(x.true, x.fit) sqrt( (1/length(x.true)) * sum((x.true - x.fit)^2) )

\[\text{RMSE} = \sqrt{\frac{1}{N}\sum_i^N \big( y_i-\hat{y}_i\big)^2 }, \qquad\qquad N:=\text{test set size} = 54.\]

It computes the squared difference between the predicted value from the model of interest with the real (true) value. Then, it does the average over al the values and take of it. It will be our basis to compare models, and to tune hyperparameters of the models by cross-validation.

Now, we will explore some models and their characteristics, do some insightful visualization and compute some diagnostics if needed. We will save their cross-validation RMSE but also their ‘true’ RMSE (predicted on test set). Finally, we will compare them all (and try so stacking ?)

Note that for the (stepwise) linear regression models, there are no real tuning hyperparameters but we will make the cross-validation anyway to obtain estimates of this error (RMSE-cv). If we only wanted to obtain the parameters’ coefficients, it would have been better to compute these using the whole training set $\rightarrow$ more points.

5.1 Linear regression

We use the standard lm() function from stats R, with the feature selection applied to ensure $p<n$.

library(doMC)
registerDoMC(cores = 8)  # All the subsequent simulations will be run in parralel


n_folds <- 10
tc_reglin <- trainControl("cv", n_folds, savePredictions = T  )  

set.seed(12)
regLin <-  train(y = training$Hip_CONC,
                     x = training_c_small,
                     metric = "RMSE",
                     method = "lm", 
                     trControl = tc_reglin)
cv_rmse_regLin <- min(regLin$results$RMSE)

pred_regLin <- unname(predict(regLin, test))

rmse_reglin <- RMSE(test$Hip_CONC, pred_regLin) # 19.99

pred_regLinc <- unname(predict(regLin, test_c))
rmse_reglinc <- RMSE(test$Hip_CONC, pred_regLinc) 

cat("Cross-validated RMSE is ", cv_rmse_regLin , "while the True RMSE is ", rmse_reglinc, " with the ", length(kept_features), " selected features")

Cross-validated RMSE is  9.30199 while the True RMSE is  8.237091  with the  16  selected features

A simple cross-validation is enough here (no hyperparameters to tune).

5.1.1 Stepwise Linear Regression

We make use of the method lmStepAIC which comes from the MASS package.

n_folds <- 10
tc_reglin <- trainControl("cv", n_folds, savePredictions = T)  

set.seed(12)
regLin_stpw <-  train(y = training$Hip_CONC,
                     x = training_c_small,
                     metric = "RMSE",
                     method = "lmStepAIC", 
                     trace = 0,
                     trControl = tc_reglin)
cat("Retained variables are " , colnames(regLin_stpw$finalModel$model)[-1])

Retained variables are  7.6397634 3.9811536 7.5743654 2.0538744 8.5544156 0.894194 1.5801986 0.926832 0.9430896 1.9230788 3.0501212 1.5638796 3.1644144 3.6054524

cv_rmse_regLinsw <- min(regLin_stpw$results$RMSE)

pred_regLinsw <- unname(predict(regLin_stpw, test))
rmse_reglinsw <- RMSE(test$Hip_CONC, pred_regLinsw)   

pred_regLinswc <- unname(predict(regLin_stpw, test_c))
rmse_reglinswc <- RMSE(test$Hip_CONC, pred_regLinswc)

The final model is a model with 11 variables (out of the initial 19).

Remark that RMSE is lower for the linear regression with the preliminary feature selection than for the linear regression doing this stepwise selection. This can be seen on the following plot

Now, we try a last model using more variables. It did not work for the complete dataset because there where multicolinearity. Hence, we had to remove features that are perfectly colinear. Finally, we decide to redo the same feature selection method and keep now $\approx$ 100 variables. As it will be useful to tune the number of features to keep (this number is often somewhat arbitrarily fixed) for other algorithms, we decide to create a function to enable the retrieval of any number of relevant features.

## n.keepByAlgo controls the number of selected features to keep by each of the 4 methods used.

'featureSelect' <- function(n.keepByAlgo = 100) {
  
  # Linear and rank correlation
  subset_lin2 <- cutoff.k(weights_lin, n.keepByAlgo)
  subset_rank2 <- cutoff.k(weights_rank, n.keepByAlgo)
  df_FSelect2 <- rbind(subset_lin2, subset_rank2)
  feat_FSelect2 <- unique(as.vector(df_FSelect2))

  # XGboost and random forest importance
  feat_GlmnetXGB2 <- as.vector(feat$union_feat$feature[1:n.keepByAlgo])
  feat_rfe2 <- results$variables$var[1:n.keepByAlgo]

  # all 
  feat_all2 <- c(feat_FSelect2, feat_rfe2, feat_GlmnetXGB2)
  sum(dupl2 <- duplicated(feat_all2))
  
  return(unique(feat_all2[dupl2]))
}

kept_featuresBig <- featureSelect()

# See section 3 (feature selection on correlation)
# training_c_nocor <- training_c[, colnames(training_c) %in% highlyCorrelated ==F]

tc_reglin <- trainControl("cv", n_folds, savePredictions = T )

set.seed(12)
regLin_stpw_all <-  train(y = training$Hip_CONC,
                     x = training_c[,kept_featuresBig],
                     metric = "RMSE",
                     method = "lmStepAIC", 
                     #direction = "forward",
                     trace = 0,
                     trControl = tc_reglin)
cat(ncol(regLin_stpw_all$finalModel$model)-1, "Retained variables now :" , colnames(regLin_stpw_all$finalModel$model)[-1])

cv_rmse_regLinsw_all <- min(regLin_stpw_all$results$RMSE)

pred_regLinsw_all <- unname(predict(regLin_stpw_all, test))
rmse_reglinsw_all <- RMSE(test$Hip_CONC, pred_regLinsw_all)   # 41.3

pred_regLinsw_allc <- unname(predict(regLin_stpw_all, test_c))
rmse_reglinsw_allc <- RMSE(test$Hip_CONC, pred_regLinsw_allc)

dfstep <- data.frame( True.RMSE = c(rmse_reglinswc, rmse_reglinsw_allc), RMSE.cv = c(cv_rmse_regLinsw, cv_rmse_regLinsw_all), size = c( paste0(length(regLin_stpw$finalModel$model)-1, "(", length(kept_features), ")"), paste0(length(regLin_stpw_all$finalModel$model)-1, "(", length(kept_featuresBig), ")")) )
rownames(dfstep) <- c( 'Linear regression SW small', 'Linear regression SW big')
pander(t(dfstep))

	Linear regression SW small	Linear regression SW big
True.RMSE	8.248378	6.992543
RMSE.cv	9.484595	6.157128
size	14(16)	43(84)

In “()” is the initial number of variables in the model before the stepwise selection stage.

43 variables are now retained by the model with much higher RMSE on the test set.

Prediction Plots

Now, we create a function that we aim to use in all our models. The goal is to compare the predicted values with the true value, for both the cross-validation and the test set. Thus, note that there are 215 value for the cross-validation and 54 values for the test set. We print in red the observation number of the problematic observation to facilitate furtherdiagnostic.

We define the function to make the predictions plots for the rest of the project, for both cross-validation and testing..

'PredVsObsPlots' <- function(df_cv, obs.test = test$Hip_CONC, pred.test, threshold = 30, rmse.cv) {
  
  #browser()
  df_cv$error <- abs(df_cv$obs - df_cv$pred)
  ProblematicObs.cv <- df_cv$error > threshold
  
  names.cv <- row.names(df_cv)
  names.cv[with(df_cv, !ProblematicObs.cv)] <- ""

  
  lines <- data.frame(x1 = c(-30, 45, 120, 270), x2 = c(30, 105, 180, 330), y1 = c(0, 75, 150, 300), y2 = c(0, 75, 150, 300))
  
  g1  <- ggplot(df_cv) + geom_point(aes(x = obs, y = pred), size = .95 )  + theme_piss() + labs(title = "Cross-Validation", subtitle = "Predicted values vs observed values") + geom_text(aes(obs, pred+12, label = names.cv, size = error), color = "red") + geom_segment(aes(x = x1, xend = x2, y = y1, yend = y2), col = "blue", linetype = 2, data = lines) + # Annotate whole value of RMSE
    annotate(geom = "text", label = paste("RMSE-cv", " is : \n", round(rmse.cv, 4)),
               x = 250,
               y = 45, col = "#33666C" , size = 4) 
  
  
  df <- data.frame(obs.test = obs.test, pred.test = pred.test)
  df$error <- abs(obs.test - pred.test)
  
   ProblematicObsTest <- df$error  > threshold

  g2  <- ggplot(df, aes(x = obs.test, y = pred.test)) 
  
  names <- row.names(df)
  names[with(df, !ProblematicObsTest)] <- ""
    
  
  true.rmse <- RMSE(obs.test, pred.test) 
  g2 <- g2 + geom_point(size = .95) + geom_segment(aes(x = x1, xend = x2, y = y1, yend = y2), col = "blue", linetype = 2, data = lines) + theme_piss() + labs(title = "Test set", subtitle = "Predicted values vs observed values")  +  geom_text(aes(obs.test, pred.test+12, label = names, size = error), color = "red") + # Annotate whole value of RMSE
    annotate(geom = "text", label = paste("RMSE", " is : \n", round(true.rmse, 4)),
               x = 260,
               y = 30, col = "#33666C" , size = 4) 
  
  #grid <- grid.arrange(g1, g2, nrow = 1)
   #grid.arrange(g1, g2, nrow = 1)
  return(list(gcv = g1, gtest = g2 ))
}

Note that on these plots, the size of the observation number printed in red is proportional to the magnitude of its error. Horizontal blue dotted lines represent the true (observed) value on the prediction scale. Note also that when we do repeated cross-validation to tune hyperparameters, we only take the predicted values of the first fold (and not the average over the repeated cv)

Linear Regression

df_cv_reglin <- regLin$pred[order(regLin$pred$rowIndex),]

# Retrieve correct indexing of observations
rownames(df_cv_reglin) <- df_cv_reglin$rowIndex


plots_lin <-  PredVsObsPlots(df_cv_reglin, pred.test = pred_regLinc, threshold = 25, rmse.cv = cv_rmse_regLin)
# the threshold can be changed to manage the number of problematic obsverations displayed.
grid.arrange(plots_lin$gcv, plots_lin$gtest, nrow = 1)

Stepwise small

df_cv_reglinStp <- regLin_stpw$pred[order(regLin_stpw$pred$rowIndex),]

rownames(df_cv_reglinStp) <- df_cv_reglinStp$rowIndex


plots_sw <- PredVsObsPlots(df_cv_reglinStp, pred.test = pred_regLinswc, threshold = 25, rmse.cv = cv_rmse_regLinsw)
grid.arrange(plots_sw$gcv, plots_sw$gtest, nrow = 1)

Stepwise big

df_cv_reglinStpAll <- regLin_stpw_all$pred[order(regLin_stpw_all$pred$rowIndex),]

rownames(df_cv_reglinStpAll) <- df_cv_reglinStpAll$rowIndex


plots_swAll <- PredVsObsPlots(df_cv_reglinStpAll, pred.test = pred_regLinsw_allc, threshold = 25, rmse.cv = cv_rmse_regLinsw_all)
grid.arrange(plots_swAll$gcv, plots_swAll$gtest, nrow = 1)

Comments

We remark the same shape as for the regression with the 19 original variables.

There seem to be a prominence to predict higher values of hippurate on the test set. Indeed, Moreover, we remark the observation number $117$ in the validation set is problematic. Indeed, the predicted concentration of hippurate is -6.0999233 while it is actually NA. It corresponds to the observatio

More specific diagnostics could be made…

We see for the stepwise small (left plot) we have again observatio number 117 that is outlier in the cross-validation. Actually, it is because we used It is important apples to apples in the final results. Alternatively, we could ignore this concern and increase the number of repeats to 30 or 100, using randomness to control for variation in the data partitioning. " basic assumptions underlying cross validation is that the models built on the different splits are equal (or at least equivalent). This allows pooling the results from all those splits. If that assumption is met, then the splitting doesn’t matter. However, in practice it does happen that the splitting introduces non-negligible variance, i.e. models are unstable wrt. slight changes in the training data, so this variation should anyways be checked (even if not optimizing anything). "

Here, the predicted values are too low compared to the true values (except for the observation 27 which is still way too high). This can be explained by the overfitting : we introduced to much noise as the stepwise selected too much variables on this wider set.

Note that the two model

As a statistician, we have to notice that the assumption of the linear reression are likely to be poor here as we have a discrete variable with onyl 4 categories. However, as our goal is only prediction and not ‘global’ inference, this is not really important and that is the reason why we do not check these assumptions here. But it can explain

5.2 Penalized Linear regression

Instead of doing separately the LASSO and the RIDGE regression, we decide to do only the Elastic Nets model which take both penalties into account and allow for more flexibility.

We use the glmnet package with 2 tuning parameters : alpha(mixing) and lambda (Regularization). The criterion can be conveniently denoted as

\[\text{argmin}_{\beta} \Big\{ \sum_{i=1}^n(y_i-\boldsymbol{x}_i'\boldsymbol{\beta})^2 + \lambda\Big[(1-\alpha)\sum_{i=1}^n\beta_i^2 + \alpha\sum_{j=1}^n|\beta_i|\Big]\Big\}\] See (Zou and Hastie 2005) for more information.

We make use of 10-fold cross-validation repeated 5 times to improve the accuracy of the results. The optimal hyperparameters will be computed through this cross-validation.

n_folds <- 10

# We tune the hyperparameters together in a 2-dimensional grid. It is 
eGrid2 <- expand.grid(.alpha = (0:10) * 0.1, 
                      .lambda =  (0:10) * 0.1)

rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 5, savePredictions = T)

set.seed(12)
glmnet_cv_all <- train(y = training$Hip_CONC, 
                          x = training_c,
                          method = "glmnet",
                          trControl = rctrl1,
                          tuneGrid = eGrid2) # alpha = 0.9, lambda=1

cv_rmse_reglmnet1 <- min(glmnet_cv_all$results$RMSE)

set.seed(12)
glmnet_cv_small <- train(y = training$Hip_CONC, 
                          x = training_c_small,
                          method = "glmnet",
                          trControl = rctrl1,
                          tuneGrid = eGrid2) # alpha = 0.1, lambda= 0.3
cv_rmse_reglmnet2 <- min(glmnet_cv_small$results$RMSE)


eGrid3 <- expand.grid(.alpha = seq(.1, .3, by = .03), 
                      .lambda =  seq(.1, .5, by = .06))
#rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 5) # repeated cv 10 times

set.seed(12)
glmnet_cv_small2 <- train(y = training$Hip_CONC, 
                           x = training_c_small,
                          method = "glmnet",
                          trControl = rctrl1,
                          tuneGrid = eGrid3)
cv_rmse_reglmnet3 <- min(glmnet_cv_small2$results$RMSE)


pred_glmnet1 <- unname(predict(glmnet_cv_all, test))
pred_glmnet2 <- unname(predict(glmnet_cv_small, test))
pred_glmnet3 <- unname(predict(glmnet_cv_small2, test))
rmse_reglmnet1 <- RMSE(test$Hip_CONC, pred_glmnet1)
rmse_reglmnet2 <- RMSE(test$Hip_CONC, pred_glmnet2)
rmse_reglmnet3 <- RMSE(test$Hip_CONC, pred_glmnet3)
#c(rmse_reglmnet1, rmse_reglmnet2, rmse_reglmnet3)


pred_glmnet1c <- unname(predict(glmnet_cv_all, test_cVar))
## Important to take the right number of features in the test set !!!! 'kept_features' !!!!!!!!
pred_glmnet2c <- unname(predict(glmnet_cv_small, test_c[, kept_features]))
pred_glmnet3c <- unname(predict(glmnet_cv_small2, test_c[ ,kept_features]))


rmse_reglmnet1c <- RMSE(test$Hip_CONC, pred_glmnet1c)
rmse_reglmnet2c <- RMSE(test$Hip_CONC, pred_glmnet2c)
rmse_reglmnet3c <- RMSE(test$Hip_CONC, pred_glmnet3c)

dfglmnet <- as.data.frame(t(data.frame( True.RMSE = c(rmse_reglmnet1c, rmse_reglmnet2c, rmse_reglmnet3c), RMSE.cv = c(cv_rmse_reglmnet1, cv_rmse_reglmnet2, cv_rmse_reglmnet3), lambda = c(glmnet_cv_all$bestTune$lambda, glmnet_cv_small$bestTune$lambda, glmnet_cv_small2$bestTune$lambda))) )
colnames(dfglmnet) <- c( 'Elastic Net full', 'Elastic Net small (-)', 'Elastic Net small (+)')

#pander(dfglmnet)


## Example
# colnames(Beta_2b) <- c("A","B","C")
# Beta_2b<- data.frame(Beta_2b)
# Font.Col1 <- c("bold",NA,NA,"bold","bold","bold")
# formattable(Beta_2b, list(
#   A =  formatter("span", style = x ~ style(font.weight = Font.Col1))))


Font.Col1 <- c("bold", NA, NA)
formattable(dfglmnet, list(`Elastic Net full` =  formatter("span", style = x ~ formattable::style(font.weight = Font.Col1))))

	Elastic Net full	Elastic Net small (-)	Elastic Net small (+)
True.RMSE	6.976341	8.092806	8.102530
RMSE.cv	8.055671	9.334340	9.235442
lambda	1.000000	0.300000	0.220000

It actually shows that it is better not to apply a feature selection before applying the elastic net. With the Ridge penalty, it actually does itself the reducing of the dimensionnality by shrinking the coefficients.

set.seed(12)
glmnet_fitOptim <- cv.glmnet(y = training$Hip_CONC, 
                        x = training_c, alpha = glmnet_cv_all$bestTune$alpha,
                        lambda = seq(0, 100, by = .1))


# tidied_cv <- tidy(glmnet_cv_all$results)
# glance_cv <- glance(glmnet_cv_all$results)
# 
# tidied_cv <- tidy(glmnet_cv_all$pred)
# glance_cv <- glance(glmnet_cv_all$pred)
# 
# ggplot(tidied_cv, aes(lambda, estimate)) + geom_line(color = "red") +
#     geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .2) +
#     scale_x_log10() +
#     geom_vline(xintercept = glance_cv$lambda.min) +
#     geom_vline(xintercept = glance_cv$lambda.1se, lty = 2)

# glmnet_cv_all$resample
# glmnet_cv_all$control



set.seed(12)
glmnet_fitRidge <- cv.glmnet(y = training$Hip_CONC, 
                        x = training_c, alpha = 0,
                        lambda = seq(0, 100, by = .1))
set.seed(12)
glmnet_fitLasso <- cv.glmnet(y = training$Hip_CONC, 
                        x = training_c, alpha = 1,
                        lambda = seq(0, 100, by = .1))

–>compare multiple LASSO cross-validations in the same plot.

Some visualizations of the models

Tuning

g1 <- ggplot(glmnet_cv_all) + labs(x = 'mixing percentage (alpha)', linetype = 'Regu. lambda', title = "Full") + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda")) 
g2 <- ggplot(glmnet_cv_small) +  labs(x = 'mixing percentage (alpha)', fill = 'Regul. lambda', title = "Small") + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda")) 

#grid.arrange(g1, g2, nrow = 1)

PissoortThesis::grid_arrange_legend(g1, g2, ncol = 2, position = "bottom")

Ridge vs LASSO

## RIDGE
tidied_cvRidge <- tidy(glmnet_fitRidge)
glance_cvRidge <- glance(glmnet_fitRidge)


# We are provided the MSE, hence we square root it to have the RMSE
tidied_cvRidge$RMSE <- sqrt(tidied_cvRidge$estimate)
tidied_cvRidge$conf.lowsq <- sqrt(tidied_cvRidge$conf.low)
tidied_cvRidge$conf.highsq <- sqrt(tidied_cvRidge$conf.high)

rmse_ridge <- RMSE(test$Hip_CONC, unname(predict(glmnet_fitRidge, test_c)))


gr1 <- ggplot(tidied_cvRidge, aes(lambda, RMSE)) + geom_line(color = "red") +
    geom_ribbon(aes(ymin = conf.lowsq, ymax = conf.highsq), alpha = .2) +
    scale_x_log10() + theme_piss() + labs(y = "estimate (RMSE)") +
    geom_vline(xintercept = glance_cvRidge$lambda.min) +
    geom_vline(xintercept = glance_cvRidge$lambda.1se, lty = 2) + # largest value of lambda such that error is within 1 standard error of the minimum. 
        annotate(geom = "text", label = paste("True RMSE", " is : \n", round(rmse_ridge, 4)),
               x = 30,
               y = 15, col = "#33666C" , size = 5) 




## LASSO
tidied_cvLasso <- tidy(glmnet_fitLasso)
glance_cvLasso <- glance(glmnet_fitLasso)


# We are provided the MSE, hence we square root it to have the RMSE
tidied_cvLasso$RMSE <- sqrt(tidied_cvLasso$estimate)
tidied_cvLasso$conf.lowsq <- sqrt(tidied_cvLasso$conf.low)
tidied_cvLasso$conf.highsq <- sqrt(tidied_cvLasso$conf.high)

rmse_lasso <- RMSE(test$Hip_CONC, unname(predict(glmnet_fitLasso, test_c)))


gl2 <- ggplot(tidied_cvLasso, aes(lambda, RMSE)) + geom_line(color = "red") +
    geom_ribbon(aes(ymin = conf.lowsq, ymax = conf.highsq), alpha = .2) +
    scale_x_log10() + theme_piss() +  labs(y = "estimate (RMSE)") +
    geom_vline(xintercept = glance_cvLasso$lambda.min) +
    geom_vline(xintercept = glance_cvLasso$lambda.1se, lty = 2) + # largest value of lambda such that error is within 1 standard error of the minimum.
          annotate(geom = "text", label = paste("True RMSE", " is : \n", round(rmse_lasso, 4), "\n ....."),
               x = 1,
               y = 85, col = "#33666C" , size = 5) 

grid.arrange(gr1, gl2, ncol = 2)

Coeff. shrinkage

## Fit (to obtain S3 object for further plotting)
fit_glmnetRid <- glmnet(y = training$Hip_CONC, 
                        x = training_c, alpha = 0,
                        lambda = seq(0,3, by = .1) ) 
fit_glmnetLass <- glmnet(y = training$Hip_CONC, 
                        x = training_c, alpha = 1,
                        lambda = seq(0,2, by = .1) ) 

#plot(fit_glmnetRid, xvar="lambda", main="")
plot(fit_glmnetLass, xvar="lambda", main="")

Predictions Plots

df_cv_regPenAll <- glmnet_cv_all$pred[glmnet_cv_all$pred$alpha == glmnet_cv_all$bestTune[,1] & glmnet_cv_all$pred$lambda == glmnet_cv_all$bestTune[,2] ,][1:nrow(training),]

# Retrieve correct indexing of observations
rownames(df_cv_regPenAll) <- df_cv_regPenAll$rowIndex


df_cv_regPenSmall <- glmnet_cv_small$pred[glmnet_cv_small$pred$alpha == glmnet_cv_small$bestTune[,1] & glmnet_cv_small$pred$lambda == glmnet_cv_small$bestTune[,2] ,][1:nrow(training),]

rownames(df_cv_regPenSmall) <- df_cv_regPenSmall$rowIndex


df_cv_regPenSmall2 <- glmnet_cv_small2$pred[glmnet_cv_small2$pred$alpha == glmnet_cv_small2$bestTune[,1] & glmnet_cv_small2$pred$lambda == glmnet_cv_small2$bestTune[,2] ,][1:nrow(training),]

rownames(df_cv_regPenSmall2) <- df_cv_regPenSmall2$rowIndex



NOMSG( g1 <- PredVsObsPlots(df_cv_regPenAll, pred.test = pred_glmnet1c, threshold = 20, rmse.cv = cv_rmse_reglmnet1))
NOMSG( g2 <- PredVsObsPlots(df_cv_regPenSmall, pred.test = pred_glmnet2c, threshold = 20, rmse.cv = cv_rmse_reglmnet2))

g3 <- PredVsObsPlots(df_cv_regPenSmall2, pred.test = pred_glmnet3c, threshold = 20, rmse.cv = cv_rmse_reglmnet3)


grid.arrange(g1$gcv, g1$gtest, nrow = 1, top = textGrob(expression( italic("Full")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )

grid.arrange( g2$gcv, g2$gtest,nrow = 1, top = textGrob(expression( italic("Small (-)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )

grid.arrange( g3$gcv, g3$gtest, nrow = 1, top = textGrob(expression( italic("Small (+)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )

#grid.arrange(g3$gcv, g3$gtest, nrow=1)

Comments

Hyperparameter Tuning : we see that Interestingly, we see that the value of $\lambda$ has no impact on the model full (left)
Prediction Plots : For the first, it seems that we overestimate the true value while for the second () we really understimate this value…

5.3 Principal Component Regression

In Principal Component Regression (PCR), we keep the principal components that preserve the most information of the ppm’s matrix, and then we build the linear regression model with the scores of the selected components as features. This number of components is the only hyperparameter to tune. Note that this model is especially interesting to overcome the issue of multicolinearity that we pointed out in section X as our features are (for some) highly correlated.

We make use of the pls package to construct our model. The only tuning parameter of this model is the number of components ncomp.

# Explore manually Values for controlling the training process
rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 5, savePredictions = T) # repeated cv 5 times

set.seed(12)
pcr_cv_small <- train(y = training$Hip_CONC, 
                      x = training_c_small,
                      method = "pcr",
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.ncomp = c(1:15)))
cv_rmse_pcrSmall <- min(pcr_cv_small$results$RMSE )

pcr_train_small <- train(y = training$Hip_CONC, 
                         x = training_c_small,
                         method = "pcr",
                         .ncomp = 3)

set.seed(12)
pcr_cv_all <- train(y = training$Hip_CONC, 
                      x = training_c,
                      method = "pcr",
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.ncomp = c(1:100)))
cv_rmse_pcrAll1 <- min(pcr_cv_all$results$RMSE )
plot(pcr_cv_all)  # Take rather 17 comp actually! 


set.seed(12)
pcr_cv_all2 <- train(y = training$Hip_CONC, 
                      x = training_c,
                      method = "pcr",
                      trControl = rctrl1, 
                     tuneLength = 17)
                      #tuneGrid = expand.grid(.ncomp = 17))
cv_rmse_pcrAll2 <- min(pcr_cv_all2$results$RMSE )



pred_pcrsmall <-  unname(predict(pcr_cv_small, test))
pred_pcrall <-  unname(predict(pcr_cv_all, test))
pred_pcrall2 <-  unname(predict(pcr_cv_all2, test))
rmse_pcrSmall <- RMSE(test$Hip_CONC,pred_pcrsmall)
rmse_pcrAll <- RMSE(test$Hip_CONC, pred_pcrall)
rmse_pcrAll2 <- RMSE(test$Hip_CONC, pred_pcrall2)


pred_pcrsmallc <-  unname(predict(pcr_cv_small, test_c[, kept_features]))
pred_pcrallc <-  unname(predict(pcr_cv_all, test_c))
pred_pcrall2c <-  unname(predict(pcr_cv_all2, test_c))

rmse_pcrSmallc <- RMSE(test$Hip_CONC,pred_pcrsmallc)
rmse_pcrAllc <- RMSE(test$Hip_CONC, pred_pcrallc)
rmse_pcrAll2c <- RMSE(test$Hip_CONC, pred_pcrall2c)


# ?
rmse_pcrSmallcomp3 <- RMSE(test$Hip_CONC, unname(predict(pcr_train_small, test)))
rmse_pcrSmallcomp3c <- RMSE(test$Hip_CONC, unname(predict(pcr_train_small, test_c)))

We tried first for a small number of values for ncompand we remarked that the optimal value was the highest number. Hence, we did for 100 and retrieved 66 components (very variable) as optimal.

5.3.1 Independent Component Regression

Independent Component Regression (ICR) is analogous to Principal Components Regression (PCR) but uses Independent Component Analysis (ICA) to produce the scores. We use the package fastICA and the model has the same only tuning parameters n.comp as for the PCR

# Explore manually Values for controlling the training process
rctrl1 <- trainControl(method = "cv", number = 10, savePredictions = T, verboseIter = T) # Here the algorithm is very slow so we decrease the number of folds and of repetitions for the cv.

set.seed(12)
icr_cv_small <- train(y = training$Hip_CONC, 
                      x = training_c_small,
                      method = "icr",
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.n.comp = c(1:16)))
cv_rmse_icr_small <- min(icr_cv_small$results$RMSE)


set.seed(12)
icr_cv_all <- train(y = training$Hip_CONC,
                      x = training_c,
                      method = "icr",
                      trControl = rctrl1,
                      tuneGrid = expand.grid(.n.comp = c(1:35)), verbose = F)
cv_rmse_icr_all <- min(icr_cv_all$results$RMSE)

# 
pred_icr_small <- predict(icr_cv_small, test) 
rmse_icrSmall <- RMSE(test$Hip_CONC, unname(pred_icr_small) ) # 23.55

pred_icr_smallc <- predict(icr_cv_small, test_c[,kept_features]) 
rmse_icrSmallc <- RMSE(test$Hip_CONC, unname(pred_icr_smallc) )

pred_icr_allc <- predict(icr_cv_all, test_c)

rmse_icrAll <- RMSE(test$Hip_CONC, unname(pred_icr_allc))
c(rmse_icrSmallc, rmse_icrAll)

An error occured when doing the ICR on the complete training set. (?)

dfpicr <- data.frame( True.RMSE = c(rmse_pcrSmallc, rmse_pcrAllc, rmse_pcrAll2c, rmse_icrSmallc, rmse_icrAll), RMSE.cv = c(cv_rmse_pcrSmall, cv_rmse_pcrAll1, cv_rmse_pcrAll2, cv_rmse_icr_small, cv_rmse_icr_all), ncomp = c(pcr_cv_small$bestTune$ncomp, pcr_cv_all$bestTune$ncomp, pcr_cv_all2$bestTune$ncomp, icr_cv_small$bestTune$n.comp, icr_cv_all$bestTune$n.comp ))
rownames(dfpicr) <- c( 'PCR small', 'PCR all (+)', 'PCR all (-)', "ICR small", "ICR all")
pander(t(dfpicr))

	PCR small	PCR all (+)	PCR all (-)	ICR small	ICR all
True.RMSE	9.647	6.57	6.725	8.243	13.07
RMSE.cv	9.653	5.306	6.042	9.002	12.98
ncomp	7	35	16	12	30

We see with the “true” RMSE that actually, taking apriori a too high number of (irrelevant) features is very bad for the performance. Moreover, the hyperparameter tuning of the number of components by cross-validation becomes less reliable (big variance) and is thus more likely to overfit as it is the case here. Hence, we should make the cv method more reliable (increase number of repetitions, play with the number of folds taken, do some bootstrap, …) or just make a prior reduce of the feature space.

Tuning

PCR all (+)

ncoptall <- pcr_cv_all$bestTune$ncomp

library(grid)

vp <- viewport(width = 0.45,
                height = 0.48,
                x = 0.59,
                y = 0.53)

gBigAll <- ggplot(pcr_cv_all) + geom_vline(aes(xintercept = ncoptall, color = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All (+) ")  + geom_vline(aes(xintercept = 12, color = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))




gZOOMall <- ggplot(pcr_cv_all) + geom_vline(aes(xintercept = 4, col = "advised(-)"), linetype = 2) + theme_piss(size_c = 9, size_p = 13, legend.position = c(.83, .45)) + labs(title = "Zoom ", y = NULL, x = NULL)  + geom_vline(xintercept = 12, col = 3, linetype = 2) + coord_cartesian(xlim = c(0, 20)) + scale_color_manual(name = "# comp.", values = c("advised(-)" = "blue"))  + scale_size(range=c(2,10)) + theme(legend.title = element_text(size=10, face="bold")) + theme(legend.text = element_text(size = 8, face = "italic"))


gBigAll 
print(gZOOMall, vp = vp)

PCR Small & All (-)

ncoptsmall <- pcr_cv_small$bestTune$ncomp
gpcSmall1 <- ggplot(pcr_cv_small) + geom_vline(aes(xintercept = ncoptsmall, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "Small ")  + geom_vline(aes(xintercept = 3, col = "advised"), linetype = 2) +  scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))


ncoptall2 <- pcr_cv_all2$bestTune$ncomp
gpcall2 <- ggplot(pcr_cv_all2) + geom_vline(aes(xintercept = ncoptall2, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All (-) ")  + geom_vline(aes(xintercept = 4, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))

grid_arrange_legend(gpcSmall1, gpcall2, ncol = 2, position = "bottom" )

ICR

ncoptsmalli <- icr_cv_small$bestTune$n.comp
gicSmall1 <- ggplot(icr_cv_small) + geom_vline(aes(xintercept = ncoptsmalli, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "Small ")  + geom_vline(aes(xintercept = 6, col = "advised"), linetype = 2) +  scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))


ncoptalli <- icr_cv_all$bestTune$n.comp
gicall2 <- ggplot(icr_cv_all) + geom_vline(aes(xintercept = ncoptalli, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All ")  + geom_vline(aes(xintercept = 11, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))

grid_arrange_legend(gicSmall1, gicall2, ncol = 2, position = "bottom" )

Comments

From that, we will try the more relevant values for the number of components as “advised” by the plots

fit.pcr.small3 <- pcr(training$Hip_CONC ~., data = as.data.frame(training_c_small), ncomp = 3, validation = "none")
fit.pcr.all12 <- pcr(training$Hip_CONC ~., data = as.data.frame(training_c), ncomp = 12, validation = "none")
fit.pcr.all4 <- pcr(training$Hip_CONC ~., data = as.data.frame(training_c), ncomp = 4,validation = "none")

pred.fitpcrSmall3 <-  unname(predict(fit.pcr.small3, test_c[, kept_features], type = "response", ncomp = 1:3))
pred.fitpcrAll12 <-  unname(predict(fit.pcr.all12, test_c, type = "response", ncomp = 1:12))
pred.fitpcrAll4 <-  unname(predict(fit.pcr.all4, test_c, type = "response", ncomp = 1:4))

#pred.fitpcrSmall3


rmse.fitpcrSmall3 <- RMSE(test$Hip_CONC, pred.fitpcrSmall3[,,3])
rmse.fitpcrAll12 <- RMSE(test$Hip_CONC, pred.fitpcrAll12[,,12])
rmse.fitpcrAll4 <- RMSE(test$Hip_CONC, pred.fitpcrAll4[,,4])

cat("For PCR Small, PCR All (+) and PCR All (-), we have RMSE : ",  c(rmse.fitpcrSmall3, rmse.fitpcrAll12, rmse.fitpcrAll4))

For PCR Small, PCR All (+) and PCR All (-), we have RMSE :  10.09472 7.629864 9.89201

A bit surprisingly when comparing these values with those in the table above, we remark that the value of the error actually increases when we reduce the number of components. Hence, there were no real concern of overfitting here.

Prediction Plots

All the observations denoted have absolute residual $\geq 20$.

## PCR
df_cv_pcrall <- pcr_cv_all$pred[pcr_cv_all$pred$ncomp == pcr_cv_all$bestTune[,1],][1:nrow(training),]
rownames(df_cv_pcrall) <- df_cv_pcrall$rowIndex

df_cv_pcrsmall <- pcr_cv_small$pred[pcr_cv_small$pred$ncomp == pcr_cv_small$bestTune[,1],][1:nrow(training),]
rownames(df_cv_pcrsmall) <- df_cv_pcrsmall$rowIndex

df_cv_pcrall2 <- pcr_cv_all2$pred[pcr_cv_all2$pred$ncomp == pcr_cv_all2$bestTune[,1],][1:nrow(training),]
rownames(df_cv_pcrall2) <- df_cv_pcrall2$rowIndex


NOMSG( gpcrall <- PredVsObsPlots(df_cv_pcrall, pred.test = pred_pcrallc, threshold = 20,rmse.cv = cv_rmse_pcrAll1))
NOMSG( gpcrsmall <- PredVsObsPlots(df_cv_pcrsmall, pred.test = pred_pcrsmallc, threshold = 20 , rmse.cv = cv_rmse_pcrSmall))
NOMSG( gpcrall2 <- PredVsObsPlots(df_cv_pcrall2, pred.test = pred_pcrall2c, threshold = 20, rmse.cv = cv_rmse_pcrAll2))


## ICR
df_cv_icrSmall <- icr_cv_small$pred[icr_cv_small$pred$n.comp == icr_cv_small$bestTune[,1],][1:nrow(training),]
rownames(df_cv_icrSmall) <- df_cv_icrSmall$rowIndex

NOMSG( g_icr1 <- PredVsObsPlots(df_cv_icrSmall, pred.test = pred_icr_smallc, threshold = 20, rmse.cv = cv_rmse_icr_small))

df_cv_icrAll <- icr_cv_all$pred[icr_cv_all$pred$n.comp == icr_cv_all$bestTune[,1],][1:nrow(training),]
rownames(df_cv_icrAll) <- df_cv_icrAll$rowIndex

NOMSG( g_icr2 <- PredVsObsPlots(df_cv_icrAll, pred.test = pred_icr_allc, threshold = 20, rmse.cv = cv_rmse_icr_all))

PCR Small

grid.arrange(gpcrsmall$gcv, gpcrsmall$gtest, nrow = 1)

PCR All (+)

grid.arrange(gpcrall$gcv, gpcrall$gtest, nrow = 1)

PCR All (-)

grid.arrange(gpcrall2$gcv, gpcrall2$gtest, nrow = 1)

ICR Small

grid.arrange(g_icr1$gcv, g_icr1$gtest, nrow = 1)

ICR All

grid.arrange(g_icr2$gcv, g_icr2$gtest, nrow = 1)

Comments

urine, Qh/2 + 0, Qh/2, 0, 150, 0, 450, 1, 3 problematic

The fact that there seem to be more observations that deviate the true value in the cross-validation scheme while the RMSE is still lower in CV is that there are more points (215) for which we average the error in the CV than in the test (54) scheme .
For the purpose of predictions, the principal components with low variances may also be important, and in some cases even more important. We will then now introduce information about the relationship of our target (hippurate’s concentration) with the ppm’s.

5.4 Partial Least Square

We use the original library pls.

# Explore manually Values for controlling the training process
rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 2, savePredictions = T) 

set.seed(12)
pls_cv_small <- train(y = training$Hip_CONC, 
                      x = training_c_small,
                      method = "pls",
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.ncomp = c(1:15)))

cv_rmse_pls_small <- min(pls_cv_small$results$RMSE)


set.seed(12)
pls_cv_all <- train(y = training$Hip_CONC, 
                    x = training_c,
                      method = "pls",
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.ncomp = c(1:100)))

cv_rmse_pls_all <- min(pls_cv_all$results$RMSE)


pred_plsSmall <-  unname(predict(pls_cv_small, test))
pred_plsall <-  unname(predict(pls_cv_all, test))
pred_plsSmallc <-  unname(predict(pls_cv_small, test_c[,kept_features]))
pred_plsallc <-  unname(predict(pls_cv_all, test_c))

# rmse_plsSmall <- RMSE(test$Hip_CONC, pred_plsSmall)
# rmse_plsAll <- RMSE(test$Hip_CONC, pred_plsall)
# c(rmse_plsSmall, rmse_plsAll)

rmse_plsSmallc <- RMSE(test$Hip_CONC, pred_plsSmallc)
rmse_plsAllc <- RMSE(test$Hip_CONC, pred_plsallc)
c(rmse_plsSmallc, rmse_plsAllc)

An interesting thing to remark is that the number of components kept is (nearly) for both methods, 13. (is is natural ?) However, we see that the true error is significantly smaller for the method on the smaller feature space, while the RMSE in cross-validation is actually higher for this method. This proves us how important is it to do the evaluation on an independent test set.

Larger gaps for the extreme classes “0” and “300” !

Indeed, it is

5.4.1 Orthogonal Least Squares (!!!!)

The function which performs OPLS from the MBXUCL package, that is OPLSDA(), informs us in its help page that it works only for classification (two-class) problems. Hence, we had to

# oplsc <- OPLSDA(as.matrix(training_class_c), as.factor(training_class$Hip_prop), no = 3)
# 
# oplsreg <- OPLSDA(as.matrix(training_c), training$Hip_CONC, no = 3)
# str(oplsreg)
# 
# OPLSDA_pred(oplsreg, as.matrix(test_cVar))
# OPLSDA_pred(oplsc, as.matrix(test_class_c))
# test_class$Hip_prop
# 
# # With other package
# library(ropls)
# 
# plsda <- opls(as.matrix(training_class_c), training_class$Hip_prop)
# olsda2 <- opls(as.matrix(training_class_c), training_class$Hip_prop, orthoI = 2)
# olsda2reg <- opls(as.matrix(training_c), training$Hip_CONC, orthoI = 2)
# 
# table(test_class$Hip_prop, predict(olsda2, test_class_c))
# predict(olsda2, test_class_c, type="prob")
# 
# predict(olsda2reg, test_cVar)
# test$Hip_CONC



YC <- scale(training$Hip_CONC, center = T, scale =F)
YtestC <- scale(test$Hip_CONC, center = T, scale =F) 

## If we let the same seed here as we put for the models with caret, we should retrieve the same data partition : the random sampling should be the same... (To check !!!!)
'kfoldCV' <- function(X ,Y, k, argrm = 0, seed = 12){
  #browser()
   n=dim(X)[1] ;  nbseg = k ;  sn=ceiling(n/k) ; set.seed(seed)
   
  seg_i <-  sample(rep(1:k, length.out = n)) ;    yfit.cv = numeric()
  
  for (i in 1:nbseg){
    
   val_i <- which(seg_i == i)
   Y.train = Y[-val_i]  ;   X.train =X[-val_i,] ;    X.valid=X[val_i,]
    
   coefcv <- fctreg(X.train,Y.train,argrm)
   yfit.cv[val_i] =X.valid%*%coefcv
   names(yfit.cv)[val_i] <- val_i
  }
  
  RMSE.cv=sqrt((1/n)*sum((Y-yfit.cv)^2))
  return(list(yfit.cv=yfit.cv,RMSE.cv=RMSE.cv))
}

# fctreg <- function(X,Y,argrm) {
#   opls(X, Y, orthoI = argrm-1)@coefficientMN
# }

fctreg <- function(X,Y,argrm) {OPLSDA(X,Y,no=argrm-1)$b}


RMSECV_opls <- numeric()
yfitCV_opls <- list()   ;   NCmax <- 25  # tune it 
for(i in 2:NCmax){
  cat(i, "\n")
 rescv <- kfoldCV(as.matrix(training_c), YC, k = 10, argrm = i)
 RMSECV_opls[i] <- rescv$RMSE.cv
 yfitCV_opls[[i]] <- rescv$yfit.cv
}

cv_rmse_opls_all <- min(RMSECV_opls, na.rm = T)

ncoptOplsAll <- which.min(RMSECV_opls) 

df_opls_reg <- data.frame(indi = 2:NCmax, RMSEcv = RMSECV_opls[-1])



## Optimal model
fit_oplsOpt <- OPLSDA(as.matrix(training_c), YC, no = ncoptOplsAll-1)

pred_oplsAll <- OPLSDA_pred(fit_oplsOpt, test_c)
rmse_oplsAll <-  RMSE(YtestC, pred_oplsAll)


## Advised 
fit_oplsAdvised <- OPLSDA(as.matrix(training_c), YC, no = 12)

pred_oplsAllAdvis <- OPLSDA_pred(fit_oplsAdvised, test_c)
rmse_oplsAllAdvis <-  RMSE(YtestC, pred_oplsAllAdvis)

ggoplsAll <- ggplot(df_opls_reg, aes(x = indi, y = RMSEcv)) + geom_point() + geom_line() + geom_vline(aes(xintercept = ncoptOplsAll, col = "optimal"), linetype = 2) + geom_vline(aes(xintercept = 12, col = "advised"), linetype = 2) +
  scale_color_manual(name = "# comp.", values = c("advised" = "green", "optimal" = "red")) + theme_piss(size_c = 12, legend.position = c(.92, .75)) + labs(title = "RMSE cross-validation vs complexity", x = "Number of components", y = "RMSE-cv")  + 
    annotate(geom = "text", label = paste("True RMSE's", " are : \n", round(rmse_oplsAll, 4)),
             x = 21, y = 9.5, col = "red" , size = 4) + 
    annotate(geom = "text", label = paste(round(rmse_oplsAllAdvis, 4)),
             x = 24, y = 9.1, col = "green" , size = 4) 
ggplotly(ggoplsAll)

RMSECV_oplsSmall <- numeric()
yfitCV_oplsSmall <- list()   ;   NCmaxRed <- 14  # tune it 
for(i in 2:NCmaxRed){
  cat(i, "\n")
 rescv <- kfoldCV(as.matrix(training_c_small), YC, k = 10, argrm = i)
 RMSECV_oplsSmall[i] <- rescv$RMSE.cv
 yfitCV_oplsSmall[[i]] <- rescv$yfit.cv
}

cv_rmse_opls_Small <- min(RMSECV_oplsSmall, na.rm = T)
ncoptOplsSmall <- which.min(RMSECV_oplsSmall) 



## Optimal model
fit_oplsOptSmall <- OPLSDA(as.matrix(training_c), YC, no = ncoptOplsSmall-1)

pred_oplsSmall <- OPLSDA_pred(fit_oplsOptSmall, test_c)
rmse_oplsSmall <-  RMSE(YtestC, pred_oplsSmall)

df_opls_regSmall <- data.frame(indi = 2:NCmaxRed, RMSEcv = RMSECV_oplsSmall[-1])

ggoplsSmall <- ggplot(df_opls_regSmall, aes(x = indi, y = RMSEcv)) + geom_point() + geom_line() + geom_vline(xintercept = ncoptOplsSmall, col = "red", linetype = 2) +  theme_piss(size_c = 12, legend.position = c(.92, .75)) + labs(title = "RMSE cross-validation vs complexity", x = "Number of components", y = "RMSE-cv")

ggplotly(ggoplsSmall)

Visualization : Final Model

Scores Plots

Hippurate <- as.factor(training$Hip_prop)

gsopls1 <- DrawScores(fit_oplsAdvised, type.obj = "OPLSDA", drawNames=F, createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), axes =c(1,2), color = Hippurate, size_p = 0.5) + theme_piss(size_p = 11)


gsopls2 <- DrawScores(fit_oplsAdvised, type.obj = "OPLSDA", drawNames=F, createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), axes =c(2,3), color = Hippurate, size_p = 0.5) + theme_piss(size_p = 11)


plotly::subplot(hide_legend(ggplotly(gsopls1)), ggplotly(gsopls2), shareX = T, shareY= T, which_layout = 2)

Loadings Plots

gloadopls1 <- DrawLoadings(fit_oplsAdvised, type.obj = "OPLSDA",  createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), loadingstype="l", axes =c(1,2), xlab = "ppm")[[1]] + theme_piss(size_p = 11)

# gloadopls2 <- DrawLoadings(fit_oplsOpt, type.obj = "OPLSDA", createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), loadingstype="l", axes =c(3,4), xlab = "ppm")[[1]] + theme_piss(size_p = 11)
# 
# plotly::subplot(hide_legend(ggplotly(gloadopls1)), ggplotly(gloadopls2), shareX = T, shareY= T, which_layout = 2,nrows = 2)
ggplotly(gloadopls1)

Clear Separation for the first component, but less for the second. Relation with … This explain the very good prediction error.

5.4.2 Sparse Partial Least Square

For Simultaneous Dimension Reduction and Variable Selection Explained in details by (Chun and Keles 2007).

With the library spls.

n_folds <- 10 

rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 1, verboseIter = T, savePredictions = T) 

set.seed(12)
spls_cv_small <- train(y = training$Hip_CONC, 
                      x = training_c_small,
                      method = "spls",
                      trControl = rctrl1, 
                      tuneLength = 4)
spls_cv_small
cv_rmse_spls_small <- min(spls_cv_small$results$RMSE)

pred_splsSmall <- predict(spls_cv_small, test)
pred_splsSmallc <- predict(spls_cv_small, test_c[, kept_features])

# All strongly negative.... weird
plot(spls_cv_small)



grid_spls <- expand.grid(K = seq(1, 100, by = 4), 
                         eta = c(.1, .3, .6), 
                         kappa = .5)

set.seed(12)
spls_cv_all <- train(y = training$Hip_CONC, 
                    x = training_c,
                      method = "spls",
                      trControl = rctrl1, 
                     tuneGrid = grid_spls)
                      #tuneLength = 2)
cv_rmse_spls_all <- min(spls_cv_all$results$RMSE)

pred_spls_allc <- unname(predict(spls_cv_all, test_c))

rmse_splsSmall <- RMSE(test$Hip_CONC, unname(pred_splsSmall))
rmse_splsSmallc <- RMSE(test$Hip_CONC, unname(pred_splsSmallc))

rmse_splsAll <- RMSE(test$Hip_CONC, unname(predict(spls_cv_all, test)))
rmse_splsAllc <- RMSE(test$Hip_CONC,pred_spls_allc)

c(rmse_splsSmall, rmse_splsAll)
c(rmse_splsSmallc, rmse_splsAllc)

5.4.3 Analysis

dfpls <- data.frame( True.RMSE = c(rmse_plsSmallc, rmse_plsAllc,  rmse_oplsAll, rmse_oplsSmall,rmse_splsSmallc, rmse_splsAllc), RMSE.cv = c(cv_rmse_pls_small, cv_rmse_pls_all, cv_rmse_opls_all,cv_rmse_opls_Small, cv_rmse_spls_small, cv_rmse_spls_all), ncomp = c(pls_cv_small$bestTune$ncomp, pls_cv_all$bestTune$ncomp, 12, ncoptOplsSmall, spls_cv_small$bestTune$K, spls_cv_all$bestTune$K))
rownames(dfpls) <- c('PLS small', 'PLS all', "OPLS all", "OPLS Small", 'SPLS small','SPLS all')
pander(t(dfpls))

	PLS small	PLS all	OPLS all	OPLS Small	SPLS small	SPLS all
True.RMSE	8.236	6.599	4.215	7.491	8.223	9.136
RMSE.cv	9.459	5.265	5.267	9.896	9.059	10.12
ncomp	15	12	12	4	9	21

Hyperparam. Tuning

ncoptplsAll <- pls_cv_all$bestTune$ncomp
gPlsAll <- ggplot(pls_cv_all) + geom_vline(aes(xintercept = ncoptplsAll, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All ")  + geom_vline(aes(xintercept = ncoptplsAll-1, col = "advised"), linetype = 2) +  scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))


ncoptplsSmall <- pls_cv_small$bestTune$ncomp
gPlsSmall <- ggplot(pls_cv_small) + geom_vline(aes(xintercept = ncoptplsSmall, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "Small ")  + geom_vline(aes(xintercept = 4, col = "advised"), linetype = 2) +  scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))

grid_arrange_legend(gPlsAll, gPlsSmall, ncol = 2, position = "bottom" )

Prediction Plots PLS

df_cv_plsSmall <- pls_cv_small$pred[pls_cv_small$pred$ncomp == pls_cv_small$bestTune[,1],][1:nrow(training),]
rownames(df_cv_plsSmall) <- df_cv_plsSmall$rowIndex

NOMSG( g_pls1 <- PredVsObsPlots(df_cv_plsSmall, pred.test = pred_plsSmallc, threshold = 20, rmse.cv = cv_rmse_pls_small))


df_cv_plsall <- pls_cv_all$pred[pls_cv_all$pred$ncomp == pls_cv_all$bestTune[,1],][1:nrow(training),]#[order(pls_cv_all$pred$rowIndex),]
rownames(df_cv_plsall) <- df_cv_plsall$rowIndex

NOMSG( g_pls2 <- PredVsObsPlots(df_cv_plsall, pred.test = pred_plsallc, threshold = 30, rmse.cv = cv_rmse_pls_all ))


grid.arrange(g_pls1$gcv, g_pls1$gtest,g_pls2$gcv, g_pls2$gtest,  nrow = 2, top = textGrob(expression( italic(" `Small´ (top)  vs  `All´ (below)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )

Pred. Plots OPLS

As we have centered the data and the target to run the model, it was important here to make the back-transformation here.

df_cv_OplsAll <- data.frame( obs = YC + mean(training$Hip_CONC), pred = yfitCV_opls[[12]] + mean(training$Hip_CONC)) # 12 is the advised number of comp.

NOMSG( g_OplsAll <- PredVsObsPlots(df_cv_OplsAll, pred.test = pred_oplsAll + mean(training$Hip_CONC), threshold = 20, rmse.cv = cv_rmse_opls_all))


df_cv_OplsSmall <- data.frame( obs = YC + mean(training$Hip_CONC), pred = yfitCV_oplsSmall[[ncoptOplsSmall]] + mean(training$Hip_CONC)) 
  
NOMSG( g_OplsSmall <- PredVsObsPlots(df_cv_OplsSmall, pred.test = pred_oplsSmall + mean(training$Hip_CONC), threshold = 20, rmse.cv = cv_rmse_opls_Small ))


grid.arrange(g_OplsAll$gcv, g_OplsAll$gtest, g_OplsSmall$gcv, g_OplsSmall$gtest,  nrow = 2, top = textGrob(expression( italic(" `All´ (top)  vs  `Small´ (below)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )

Pred. Plots SPLS

df_cv_SplsSmall <- spls_cv_small$pred[spls_cv_small$pred$K == spls_cv_small$bestTune$K[1] & spls_cv_small$pred$eta == spls_cv_small$bestTune$eta[1] & spls_cv_small$pred$kappa == spls_cv_small$bestTune$kappa[1],][1:nrow(training),]
rownames(df_cv_SplsSmall) <- df_cv_SplsSmall$rowIndex

NOMSG( g_Spls1 <- PredVsObsPlots(df_cv_SplsSmall, pred.test = pred_splsSmallc, threshold = 20, rmse.cv = cv_rmse_spls_small))


df_cv_SplsAll <- spls_cv_all$pred[spls_cv_all$pred$K == spls_cv_all$bestTune$K[1] & spls_cv_all$pred$eta == spls_cv_all$bestTune$eta[1] & spls_cv_all$pred$kappa == spls_cv_all$bestTune$kappa[1],][1:nrow(training),]
rownames(df_cv_SplsAll) <- df_cv_SplsAll$rowIndex

NOMSG( g_Spls2 <- PredVsObsPlots(df_cv_SplsAll, pred.test = pred_spls_allc, threshold = 30, rmse.cv = cv_rmse_spls_all ))


grid.arrange(g_Spls1$gcv, g_Spls1$gtest, g_Spls2$gcv, g_Spls2$gtest,  nrow = 2, top = textGrob(expression( italic(" `Small´ (top)  vs  `All´ (below)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )

5.5 Comparisons of the 5 main Models : Coefficients’ Plots

5.5.1 Linear Regression models

Linear Regression

(Investigate why this particular plot shows here ? Weird… Run the code to see the actual plot)

df_coef_reglin <- data.frame( ppm = as.numeric(substr(names(regLin$finalModel$coefficients)[-1], 2, nchar(names(regLin$finalModel$coefficients)[3])-2)), coefficients = unname(regLin$finalModel$coefficients[-1]) )

myPalette <- colorRampPalette(rev(brewer.pal(10, "RdGy")))

#ggplot(df_coef_reglin, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() + geom_point(  size = 2) + theme_piss() + scale_colour_gradientn("Coefficient", limits = c(min(df_coef_reglin$coefficients), max(df_coef_reglin$coefficients) ), colours = myPalette(10))

ggplot(df_coef_reglin, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() + geom_point(  size = 2) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_coef_reglin$coefficients), max(df_coef_reglin$coefficients) ), low = "darkred", mid = "gray80", high = "darkred") + scale_x_reverse()

Stepwise Lin. Reg. (-)

df_coef_reglinstpw <- data.frame( ppm = as.numeric(substr(names(regLin_stpw$finalModel$coefficients)[-1], 2, nchar(names(regLin_stpw$finalModel$coefficients)[3])-2)), coefficients = unname(regLin_stpw$finalModel$coefficients[-1]) )

ggplot(df_coef_reglinstpw, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() +geom_point(  size = 2) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_coef_reglinstpw$coefficients), max(df_coef_reglinstpw$coefficients) ), low = "darkred", mid = "gray80", high = "darkred") + scale_x_reverse()

Stepwise Lin. Reg. (+)

df_coef_reglinstpwAll <- data.frame( ppm = as.numeric(substr(names(regLin_stpw_all$finalModel$coefficients)[-1], 2, nchar(names(regLin_stpw_all$finalModel$coefficients)[3])-2)), coefficients = unname(regLin_stpw_all$finalModel$coefficients[-1]) )

ggplot(df_coef_reglinstpwAll, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() +geom_point(  size = 2) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_coef_reglinstpwAll$coefficients), max(df_coef_reglinstpwAll$coefficients) ), low = "darkred", mid = "gray80", high = "darkred") + scale_x_reverse()

5.5.2 Penalized Regression Models

We take the object coming from cv.glmnet above with the optimal value for hyperparameters. With this, we can extract the coefficients.

Optimal Elastic Net

$\qquad\qquad\qquad\qquad$ $\alpha$ = 0.4

tmp_coeffs <- coef(glmnet_fitOptim, s = "lambda.min")
df_coeffs <- data.frame(ppm = as.numeric(tmp_coeffs@Dimnames[[1]][tmp_coeffs@i][-1]), coefficient = tmp_coeffs@x[-c(1,100)])

ggplot(df_coeffs, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(tmp_coeffs@x), max(tmp_coeffs@x)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_continuous(breaks = c(rev(as.numeric(unique(substr(tmp_coeffs@Dimnames[[1]][tmp_coeffs@i][-1], 1, 1)))))) + scale_x_reverse()

RIDGE vs LASSO

tmp_coeffsRid <- coef(glmnet_fitRidge, s = "lambda.min")
df_coeffsRid <- data.frame(ppm = as.numeric(tmp_coeffsRid@Dimnames[[1]][tmp_coeffsRid@i][-1]), coefficient = tmp_coeffsRid@x[-c(1,100)])

gr1 <- ggplot(df_coeffsRid, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(tmp_coeffsRid@x), max(tmp_coeffsRid@x)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_continuous(breaks = c(rev(as.numeric(unique(substr(tmp_coeffsRid@Dimnames[[1]][tmp_coeffs@i][-1], 1, 1)))))) + scale_x_reverse() + labs(title = "Ridge")

tmp_coeffsLasso <- coef(glmnet_fitLasso, s = "lambda.min")
df_coeffsRidLasso <- data.frame(ppm = as.numeric(tmp_coeffsLasso@Dimnames[[1]][tmp_coeffsLasso@i][-1]), coefficient = tmp_coeffsLasso@x[-c(1,100)])

gr2 <- ggplot(df_coeffsRidLasso, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point(size = 1.9) + theme_piss() +  scale_colour_gradient2("Value", limits = c(min(tmp_coeffsLasso@x), max(tmp_coeffsLasso@x)), low = "darkred", mid = "gray80", high = "darkred" )  + scale_x_continuous(breaks = c(rev(as.numeric(unique(substr(tmp_coeffsLasso@Dimnames[[1]][tmp_coeffs@i][-1], 1, 1)))))) + scale_x_reverse() + labs(title = "LASSO")


grid_arrange_legend(gr1, gr2, ncol = 2, bottom = "right")

From the fact that 404 variables are shrinked to zero, we visualize selection process and hence we observe the LASSO ‘effect’. This is emphasized in the coefficient plot.

5.5.3 Principal Component Regression

We decide to represent only the coefficients associated the first 5 components only as it is …

PCR All (+)

You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest

ppmpcr <- as.numeric(substr(dimnames(pcr_cv_all$finalModel$coefficients)[[1]], 2, nchar(dimnames(pcr_cv_all$finalModel$coefficients)[[1]][3])-2))



df_pcrAllComp <- melt(pcr_cv_all$finalModel$coefficients)
df_pcrAllComp <- df_pcrAllComp[, colnames(df_pcrAllComp) %in% c("Var2") == F ]

colnames(df_pcrAllComp) <- c("ppm", "Comp", "coefficients")
df_pcrAllComp$ppm_num <- as.numeric(substr(as.character(df_pcrAllComp$ppm), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2))


## PLOTLY

gppAll <- ggplot(df_pcrAllComp[1:(5*ncol(training_c)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point(  size = 0.6)  + theme_piss() + scale_x_reverse() + labs(title = "Visualize First Components")

as.widget(ggplotly(gppAll))

## tweenr PACKAGE

# ggg <- ggplot(df_pcrAllComp, aes(x = ppm_num, y = coefficients, frame = Comp)) + geom_line() + geom_point(  size = 2) + theme_piss() + scale_x_reverse()
# 
# library(gganimate)
# 
# gganimate(ggg, "exxx.gif")

Look at the different y-scales between the two graphs !!!

df_pcr_opt <- data.frame(ppm = as.numeric(substr(as.character(names(pcr_cv_all$finalModel$coefficient[,,35])), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2)), coefficient = unname(pcr_cv_all$finalModel$coefficient[,,35]))

ggplot(df_pcr_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss(size_p = 15) + scale_colour_gradient2("Value", limits = c(min(df_pcr_opt$coefficient), max(df_pcr_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (35)")

PCR Small

You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest

df_pcrSmallComp <- melt(pcr_cv_small$finalModel$coefficients)
df_pcrSmallComp <- df_pcrSmallComp[, colnames(df_pcrSmallComp) %in% c("Var2") == F ]
colnames(df_pcrSmallComp) <- c("ppm", "Comp", "coefficients")
df_pcrSmallComp$ppm_num <- as.numeric(substr(as.character(df_pcrSmallComp$ppm), 2, nchar(as.character(df_pcrSmallComp$ppm)[3])-2))


gppSmall <- ggplot(df_pcrSmallComp[1:(5*length(kept_features)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point(  size = 0.6)  + theme_piss() + scale_x_reverse() + labs(title = "Visualize First Components")

as.widget(ggplotly(gppSmall))

Look at the different y-scales between the two graphs !!!

df_pcr_optSmall <- data.frame(ppm = as.numeric(substr(as.character(names(pcr_cv_small$finalModel$coefficient[,,ncoptsmall])), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2)), coefficient = unname(pcr_cv_small$finalModel$coefficient[,,ncoptsmall]))

ggplot(df_pcr_optSmall, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss(size_p = 15) + scale_colour_gradient2("Value", limits = c(min(df_pcr_optSmall$coefficient), max(df_pcr_optSmall$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (7)")

PCR All (-)

You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest

df_pcrall22Comp <- melt(pcr_cv_all2$finalModel$coefficients)
df_pcrall22Comp <- df_pcrall22Comp[, colnames(df_pcrall22Comp) %in% c("Var2") == F ]
colnames(df_pcrall22Comp) <- c("ppm", "Comp", "coefficients")
df_pcrall22Comp$ppm_num <- as.numeric(substr(as.character(df_pcrall22Comp$ppm), 2, nchar(as.character(df_pcrall22Comp$ppm)[3])-2))


gppAll2 <- ggplot(df_pcrall22Comp[1:(5*ncol(training_c)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6)   + theme_piss() + scale_x_reverse() + labs(title = "Visualize First Components") + coord_cartesian(ylim = c(-3000, 9000))

as.widget(ggplotly(gppAll2 ))

Same y-scales between the two graphs

df_pcr_opt2 <- data.frame(ppm = as.numeric(substr(as.character(names(pcr_cv_all2$finalModel$coefficient[,,ncoptall2])), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2)), coefficient = unname(pcr_cv_all2$finalModel$coefficient[,,ncoptall2]))

ggplot(df_pcr_opt2, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss(size_p = 15) + scale_colour_gradient2("Value", limits = c(min(df_pcr_opt2$coefficient), max(df_pcr_opt2$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (16)") + coord_cartesian(ylim = c(-3000, 9000))

ICR Small

Just for “visualization” …: (top) Graph of the coefficients associated to the selected independent components in the (linear) regression model. (below) Heatmap of these components against the ‘individuals’ to see their respective contributions. We put the hippurate’s concentrations of the ‘inviduals’ as y-label to add relevant information. We could put another meta-data information.

# icr_cv_small$finalModel$model$coefficients
# icr_cv_small$finalModel$model$terms
# 
# icr_cv_small$finalModel$model$model
# icr_cv_small$finalMode$names
# icr_cv_small$finalMode$param


# df_icrSmallComp <- melt(icr_cv_small$finalModel$coefficients)
# df_icrSmallComp <- df_icrSmallComp[, colnames(df_icrSmallComp) %in% c("Var2") == F ]
# colnames(df_icrSmallComp) <- c("ppm", "Comp", "coefficients")
# df_icrSmallComp$ppm_num <- as.numeric(substr(as.character(df_icrSmallComp$ppm), 2, nchar(as.character(df_icrSmallComp$ppm)[3])-2))

df_icrSmallICA <- data.frame(ICA = as.numeric(substr(names(icr_cv_small$finalModel$model$coefficients)[-1], 4, 5)), coefficients = icr_cv_small$finalModel$model$coefficients[-1] )


ggicrSmall <- ggplot(df_icrSmallICA, aes(x = ICA, y = coefficients, col = coefficients)) + geom_line(size = 0.55) + geom_point(  size = 1.9)  + theme_piss() + labs(title = "Visualize Components") + scale_colour_gradient2("Value", limits = c(min(df_icrSmallICA$coefficient), max(df_icrSmallICA$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) 

ggicrSmall

#ggplotly(ggicrSmall )

mat.Coeff.ICR <- as.matrix(icr_cv_small$finalModel$model$model)

# basic heatmap : orange and white show higher values

namesSpec <- rep(c( rep("", 4), "0", rep("", 4), "75", rep("", 4), "150", rep("", 4), "300", rep("", 4)), 9)

hm_icrSmall <- heatmap(mat.Coeff.ICR[,-1], Rowv=NA, Colv=NA, 
                       col = heat.colors(256), 
                       margins=c(5, 0) , labRow = namesSpec[-1] , 
                       main = 'Heatmap of Hippurate Conc. VS components')

# ggplot( as.data.frame(mat.Coeff.ICR[,-1]), aes(x = y, y = teams)) +
#   geom_tile(aes(fill = performance)) 
# 
# ggplotly(sfo_hmap1)
# 
# resH <- plot_heatmapsOnSel(icaSet = mat.Coeff.ICR, selCutoff = 3, level = "features", keepVar = keepVar,
#                            doSamplesDendro = FALSE, doGenesDendro = FALSE, keepComp = 2,
#                            heatmapCol = maPalette(low = "blue", high = "red", mid = "yellow", k=44),
#                            file = "heatmapWithoutDendro", annot2col=annot2col(params))

plotHeaticr1 <-plot_ly(x=namesSpec, y=colnames(mat.Coeff.ICR[,-1]), z = mat.Coeff.ICR[,-1], type="heatmap")

namesObsPlotly <- make.names(rownames(mat.Coeff.ICR[,-1]))

xlab = list(categoryorder = "array", categoryarray = namesObsPlotly)
ylab = list(categoryorder = "array", categoryarray = c("d", "f", "z","g"))

plotHeaticr1 <-plot_ly(x=colnames(mat.Coeff.ICR[,-1]), y = namesObsPlotly, z = mat.Coeff.ICR[,-1], type="heatmap") %>%
  layout(yaxis = xlab)
plotHeaticr1

#embed_notebook(plotHeaticr1)

# m <- matrix(rnorm(12), nrow = 4, ncol = 3)
# xlab = list(categoryorder = "array", categoryarray = c("a", "b", "c","d"))
# ylab = list(categoryorder = "array", categoryarray = c("d", "f", "z","g"))
# plot_ly(x = c("a1", "b1", "c1","d1"), y = c("d", "f", "z","g"), z = m, type = "heatmap") %>%
# layout(xaxis = xlab, yaxis = ylab)

ICR All (-)

(top) Graph of the coefficients associated to the selected independent components in the (linear) regression model. (below) Heatmap of these components against the ‘individuals’ to see their respective contributions. We put the hippurate’s concentrations of the ‘inviduals’ as y-label to add relevant information. We could put another meta-data information.

df_icrAllICA <- data.frame(ICA = as.numeric(substr(names(icr_cv_all$finalModel$model$coefficients)[-1], 4, 5)), coefficients = icr_cv_all$finalModel$model$coefficients[-1] )


ggicrAll <- ggplot(df_icrAllICA, aes(x = ICA, y = coefficients, col = coefficients)) + geom_line(size = 0.55) + geom_point(  size = 1.9)  + theme_piss() + scale_x_reverse() + labs(title = "Visualize Components") + scale_colour_gradient2("Value", limits = c(min(df_icrAllICA$coefficient), max(df_icrAllICA$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) 

ggicrAll

mat.Coeff.ICRAll <- as.matrix(icr_cv_all$finalModel$model$model)

# basic heatmap : orange and white show higher values
hm_icrAll <- heatmap(mat.Coeff.ICRAll[,-1], Rowv=NA, Colv=NA, 
                    col = heat.colors(256), 
                    margins=c(5, 0) , labRow = namesSpec[-1] , 
                    main = 'Heatmap of Hippurate Conc. VS components')

plotHeaticr2 <-plot_ly(x=namesSpec[-1], y=colnames(mat.Coeff.ICRAll[,-1]), z = mat.Coeff.ICRAll[,-1], type="heatmap")

plotHeaticr2 <-plot_ly(x=colnames(mat.Coeff.ICRAll[,-1]), y = namesObsPlotly, z = mat.Coeff.ICRAll[,-1], type="heatmap") %>%
  layout(yaxis = xlab)
plotHeaticr2

#embed_notebook(plotHeaticr2)

Comments

We see that

5.5.4 Partial Least Squares

PLS Small

You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest

#pls_cv_small$finalModel$coefficients

# plot(pls_cv_small$finalModel, plottype = "coef", ncomp=1:3, legendpos = "bottomleft", labels = "numbers", 
#      xlab = "nm")
# plot(pls_cv_small$finalModel, type = "coefficients", ...)


df_plsSmallComp <- melt(pls_cv_small$finalModel$coefficients)
df_plsSmallComp <- df_plsSmallComp[, colnames(df_plsSmallComp) %in% c("Var2") == F ]

colnames(df_plsSmallComp) <- c("ppm", "Comp", "coefficients")
df_plsSmallComp$ppm_num <- as.numeric(substr(as.character(df_plsSmallComp$ppm), 2, nchar(as.character(df_plsSmallComp$ppm)[3])-2))


## PLOTLY

gplsSmall <- ggplot(df_plsSmallComp[1:(5*length(kept_features)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point(  size = 0.6) + labs(title = "Visualize First Components") + theme_piss() + scale_x_reverse()
ggplotly(gplsSmall)

Look at the different y-scales between the two graphs !!!

df_pls_opt <- data.frame(ppm = as.numeric(substr(as.character(names(pls_cv_small$finalModel$coefficient[,,pls_cv_small$bestTune$ncomp])), 2, nchar(as.character(df_plsSmallComp$ppm)[3])-2)), coefficient = unname(pls_cv_small$finalModel$coefficient[,,pls_cv_small$bestTune$ncomp]))

ggplot(df_pls_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_pls_opt$coefficient), max(df_pls_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (16)")

PLS All

You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest

df_plsAllComp <- melt(pls_cv_all$finalModel$coefficients)
df_plsAllComp <- df_plsAllComp[, colnames(df_plsAllComp) %in% c("Var2") == F ]

colnames(df_plsAllComp) <- c("ppm", "Comp", "coefficients")
df_plsAllComp$ppm_num <- as.numeric(substr(as.character(df_plsAllComp$ppm), 2, nchar(as.character(df_plsAllComp$ppm)[3])-2))


## PLOTLY

gplsAll <- ggplot(df_plsAllComp[1:(5*ncol(training_c)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point(  size = 0.6) + labs(title = "Visualize First Components") + theme_piss() + scale_x_reverse() + coord_cartesian(ylim = c(-5000, 9e3))

ggplotly(gplsAll)

Scales are the same

df_pls_opt2 <- data.frame(ppm = as.numeric(substr(as.character(names(pls_cv_all$finalModel$coefficient[,,pls_cv_all$bestTune$ncomp])), 2, nchar(as.character(df_plsSmallComp$ppm)[3])-2)), coefficient = unname(pls_cv_all$finalModel$coefficient[,,pls_cv_all$bestTune$ncomp]))

ggplot(df_pls_opt2, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_pls_opt2$coefficient), max(df_pls_opt2$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (16)") + coord_cartesian(ylim = c(-5000, 9e3))

SPLS Small

Total of the 9 components

You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest

#spls_cv_small$finalModel$betahat

df_SplsSmallComp <- melt(spls_cv_small$finalModel$betamat)
df_SplsSmallComp <- df_SplsSmallComp[, colnames(df_SplsSmallComp) %in% c("Var2") == F ]

colnames(df_SplsSmallComp) <- c("ppm", "coefficients", "Comp")
df_SplsSmallComp$ppm_num <- as.numeric(substr(as.character(dimnames(spls_cv_small$finalModel$betahat)[[1]]), 1, nchar(as.character(dimnames(spls_cv_small$finalModel$betahat)[[1]])[3])-1))

df_SplsSmallComp$Comp <- as.factor(df_SplsSmallComp$Comp)

## PLOTLY

gSplsSmAll <- ggplot(df_SplsSmallComp, aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point(  size = 0.6) + labs(title = "Visualize All Components") + theme_piss() + scale_x_reverse() + coord_cartesian(ylim = c(-12, 25))

ggplotly(gSplsSmAll)

Scales are the same

df_splsSmall_opt <- data.frame(ppm = as.numeric(substr(dimnames(spls_cv_small$finalModel$betahat)[[1]], 1, nchar(as.character(dimnames(spls_cv_small$finalModel$betahat)[[1]][3]))-1)), coefficient = unname(spls_cv_small$finalModel$betamat[[spls_cv_small$bestTune$K]]))

ggplot(df_splsSmall_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_splsSmall_opt$coefficient), max(df_splsSmall_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (9)")  + coord_cartesian(ylim = c(-12, 25))

SPLS All

You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest

df_SplsAllComp <- melt(spls_cv_all$finalModel$betamat)
df_SplsAllComp <- df_SplsAllComp[, colnames(df_SplsAllComp) %in% c("Var2") == F ]

colnames(df_SplsAllComp) <- c("ppm", "coefficients", "Comp")
df_SplsAllComp$ppm_num <- as.numeric(substr(as.character(dimnames(spls_cv_all$finalModel$betahat)[[1]]), 1, nchar(as.character(dimnames(spls_cv_all$finalModel$betahat)[[1]])[3])-1))

## Visualize only the first 5 components 
df_SplsAllComp5 <- df_SplsAllComp[df_SplsAllComp$Comp < 6,]

df_SplsAllComp5$Comp <- as.factor(df_SplsAllComp5$Comp)

## PLOTLY

gSplsAll <- ggplot(df_SplsAllComp5, aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point(  size = 0.6) + labs(title = "Visualize First Components") + theme_piss() + scale_x_reverse() + coord_cartesian(ylim = c(-4, 8))

ggplotly(gSplsAll)

#pspls <- ggplotly(gSplsAll)
#link_pspls <- plotly_POST(p, filename="ggplot2/intro-2")

Scales are the same

df_splsAll_opt <- data.frame(ppm = as.numeric(substr(dimnames(spls_cv_all$finalModel$betahat)[[1]], 1, nchar(as.character(dimnames(spls_cv_all$finalModel$betahat)[[1]][3]))-1)), coefficient = unname(spls_cv_all$finalModel$betamat[[spls_cv_all$bestTune$K]]))

ggplot(df_splsAll_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_splsAll_opt$coefficient), max(df_splsAll_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (25)") + coord_cartesian(ylim = c(-4, 8))

Analysis (To finish !!!)

Interestingly, this is rather the low ppm values which seem to have the bigger effects.
As for the PCR, the first component has a very low magnitude of coefficients compared to the other components. We can see that for PLS and SPLS. For SPLS

5.6 Other Models : Improvements

Now, we will explore some models that are known in the Machine Learning’s litterature to be have the greatest performance for a wide range of contexts. These methods will thus not be oriented to omics data analysis such as some of those seen above, but as we have an objective of building the most accurate model, it could be interesting to see if one of those could outperform the preceding models. We will not explain them as this is not part of the course and they are rather very complex.

These methods are random forest and XGBoost The theoretical background of these methods goes beyong the scope of this project These methods are really “off-the-shelf”, meaning that they do not really need preprocessing. Moreover the feature selection procss is done ‘inside’ the algorithms through the variable’s importance metrics (actually, we used these metrics to select our features), that is howmuch

5.6.1 Random Forest

Outlined by (Breiman 2001), the random forest is an algorithm fits several trees on subsets of the training set using the concept of bootstrapping aggregating (bagging) and fits. Hence, it is often preferred to ‘individual’ trees as it reduces their tendency to overfit.

We use the randomForest package (as well as the e1071 and the foreach) and compute the random forest algorithm in parallel as the process is computationally complex. It alows us to consider a deeper grid for the tuning hyperparameter mtry, the number of predictors sampled randomly for spliting at each node.

First, we fitted on the centered data

n_folds <- 10 

rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 1, verbose = T, savePredictions = T) 

# Without Centering the Data !!!!
time <- proc.time()
set.seed(12)
rf_all_nocenter <- train(y = training$Hip_CONC, 
                x = training[, 9:ncol(training)],
                          method = "parRF",
                          trControl = rctrl1,
                          tuneLength = 15,
                verbose = T)  
(proc.time() - time)[3] # 637 sec 

plot(rf_all_nocenter)

cv_rmse_rf_all_nocenter <- min(rf_all_nocenter$results$RMSE)

pred_rf_all_noCenter <- predict(rf_all_nocenter, test)

rmse_rf_all_nocenter <- RMSE(test$Hip_CONC, unname(pred_rf_all_noCenter))


#### CENTERING the training features ! 

time <- proc.time()
set.seed(12)
rf_all <- train(y = training$Hip_CONC, 
                x = training_c,
                          method = "parRF",
                          trControl = rctrl1,
                          tuneLength = 23,
                verbose = T)  
(proc.time() - time)[3] # 943 sec 


cv_rmse_rf_all <- min(rf_all$results$RMSE)
#pred_rf_all <- predict(rf_all, test)
pred_rf_allc <- predict(rf_all, test_c)

#rmse_rf_all <- RMSE(test$Hip_CONC, unname(pred_rf_all))
rmse_rf_allc <- RMSE(test$Hip_CONC, unname(pred_rf_allc))

cat("RMSE from ", rmse_rf_all_nocenter, " to", rmse_rf_allc, "by centering features")

gained nearly 3 points of ´true´ RMSE only by centering the training features.

To run !!!!!!!!

time <- proc.time()
set.seed(12)
rf_allMore <- train(y = training$Hip_CONC, 
                    x = training_c,
                          method = "parRF",
                          trControl = rctrl1,
                          tuneGrid = expand.grid(.mtry = c(10, 50, 100, 140, seq(150, 450, by = 10 ))),
                verbose = T)  
(proc.time() - time)[3] # 943 sec 


cv_rmse_rf_allMore <- min(rf_allMore$results$RMSE)
#pred_rf_all <- predict(rf_all, test)
pred_rf_allMore <- predict(rf_allMore, test_c)

rmse_rf_allMore <- RMSE(test$Hip_CONC, unname(pred_rf_allMore))

5.6.2 XGBoost

Note that when we do not have idea for values of the hyperparameters to take, we can let caret do it automatically. It will make a crude guess as to what values to try. Then, the idea that we will follow

From the xgboost package

## Guess the grid. Then update values 
xgb_grid_1 = expand.grid(
  nrounds = c(50, 150), 
  max_depth =  c(20, 12, 30), 
  eta = c(.3, .1), 
  gamma = c(1, 2),
  colsample_bytree = c(.85), 
  min_child_weight = c(1),
  subsample = c(.9)  )

xgb_trcontrol_1 = trainControl(
  method = "repeatedcv",
  number = 10,  
  repeats = 1,  # Very Very slow ...... So no repetitions here
  verboseIter = T,
  allowParallel = T,
  savePredictions = T
)

set.seed(12)
time <- proc.time()
xgb_train_4 <- train(y = training$Hip_CONC, 
                     x = training_c,
                          method = "xgbTree",
                          trControl = xgb_trcontrol_1,
                          tuneGrid = xgb_grid_1)  
(proc.time() - time)[3]
# Approx 30min with repeats =2.


cv_rmse_xgbAll <- min(xgb_train_4$results$RMSE)

pred_xgb_reg <- predict(xgb_train_4, test_c)

rmse_xgb_reg <- RMSE(test$Hip_CONC, unname(pred_xgb_reg))  # 21


### ========================
# Automatic Grid (try)

set.seed(12)
time <- proc.time()
xgb_train_AutoGrid2 <- train(y = training$Hip_CONC, 
                            x = training_c,
                          method = "xgbTree",
                          trControl = xgb_trcontrol_1,
                          tuneLength = 2)  # tuneLength is length of grid PER hyperparameters.  2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3]  # 85 sec. 

set.seed(12)
time <- proc.time()
xgb_train_AutoGrid3 <- train(y = training$Hip_CONC, 
                            x = training_c,
                          method = "xgbTree",
                          trControl = xgb_trcontrol_1,
                          tuneLength = 3)  # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 323 sec 


set.seed(12)
time <- proc.time()
xgb_train_AutoGrid4 <- train(y = training$Hip_CONC, 
                            x = training_c,
                          method = "xgbTree",
                          trControl = xgb_trcontrol_1,
                          tuneLength = 4)  # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] #  790 sec 


cv_rmse_xgbAuto2 <- min(xgb_train_AutoGrid2$results$RMSE)
cv_rmse_xgbAuto3 <- min(xgb_train_AutoGrid3$results$RMSE)
cv_rmse_xgbAuto4 <- min(xgb_train_AutoGrid4$results$RMSE)

pred_xgb_regAuto2c <- predict(xgb_train_AutoGrid2, test_c)
pred_xgb_regAuto3c <- predict(xgb_train_AutoGrid3, test_c)
pred_xgb_regAuto4c <- predict(xgb_train_AutoGrid4, test_c)


rmse_xgb_regAuto2 <- RMSE(test$Hip_CONC, unname(pred_xgb_regAuto2c))
rmse_xgb_regAuto3 <- RMSE(test$Hip_CONC, unname(pred_xgb_regAuto3c))
rmse_xgb_regAuto4 <- RMSE(test$Hip_CONC, unname(pred_xgb_regAuto4c))

### Without centering training set #####################
###############################################

# Control the cross-validation 
xgb_trcontrol_1 = trainControl(
  method = "repeatedcv",
  number = 10,  
  repeats = 1,  # Very Very slow ...... So no repetitions here
  verboseIter = T,
  allowParallel = T,
  savePredictions = T
)

## Tune with a grid of length 2 for each hyperparameters

time <- proc.time()
set.seed(12)
xgb_train_AutoGrid2_nocenter <- train(y = training$Hip_CONC, 
                                      x = training[, 9:ncol(training)],
                                      method = "xgbTree",
                                      trControl = xgb_trcontrol_1,
                                      tuneLength = 2)  # tuneLength is length of grid PER hyperparameters.  2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3]  # 131sec

cv_rmse_xgbNocenter <- min(xgb_train_AutoGrid2_nocenter$results$RMSE)

pred_xgb_nocenter <- predict(xgb_train_AutoGrid2_nocenter, test)

rmse_xgb_regNocenter <- RMSE(test$Hip_CONC, unname(pred_xgb_nocenter))



## Tune with a grid of length 5 for each hyperparameters 

time <- proc.time()
set.seed(12)
xgb_train_AutoGrid5_nocenter <- train(y = training$Hip_CONC, 
                                      x = training[, 9:ncol(training)],
                                      method = "xgbTree",
                                       trControl = xgb_trcontrol_1,
                                       tuneLength = 5)  # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 29 min

cv_rmse_xgbNocenter5 <- min(xgb_train_AutoGrid5_nocenter$results$RMSE)

pred_xgb_nocenter5 <- predict(xgb_train_AutoGrid5_nocenter, test)

rmse_xgb_regNocenter5 <- RMSE(test$Hip_CONC, unname(pred_xgb_nocenter5))


## Improve tuning. Keep the optimal parameters found above and tune the two remaining left constant.
xgb_train_AutoGrid5_nocenter$bestTune

xgb_grid = expand.grid(
  nrounds = 50, 
  max_depth =  3, 
  eta = .4, 
  gamma = c(1, 2, 4, 6),
  colsample_bytree = .8, 
  min_child_weight = c(.5, 1, 2, 4),
  subsample = c(.875)  )

time <- proc.time()
set.seed(12)
xgb_noCenter_fin <- train(y = training$Hip_CONC, 
                          x = training[, 9:ncol(training)],
                                      method = "xgbTree",
                                       trControl = xgb_trcontrol_1,
                                       tuneGrid = xgb_grid)  # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 94sec

cv_rmse_xgbNocenterFin <- min(xgb_noCenter_fin$results$RMSE)

pred_xgb_nocenterFin <- predict(xgb_noCenter_fin, test)

rmse_xgb_regNocenterFin <- RMSE(test$Hip_CONC, unname(pred_xgb_nocenterFin))




time <- proc.time()
set.seed(12)
xgb_Center_fin <- train(y = training$Hip_CONC, 
                          x = training_c,
                                      method = "xgbTree",
                                       trControl = xgb_trcontrol_1,
                                       tuneGrid = xgb_grid)  
(proc.time() - time)[3] # 74sec

cv_rmse_xgbCenterFin <- min(xgb_Center_fin$results$RMSE)

pred_xgb_CenterFin <- predict(xgb_Center_fin, test_c)

rmse_xgb_regCenterFin <- RMSE(test$Hip_CONC, unname(pred_xgb_CenterFin))

We now take the the most “trained” model as final model. The features centering actually decreases the RMSE

Notice that we did not tune it as well ; the whole hyperparameter space has been poorly visited. EDIT : Actually, performance seems even to decay when we look for higher hyperparameter space

We cannot trust this model as we could not tune the hyperparameter correctly due to the time constraint, and we did not have any real prior information for their starting values (…)

Visualization

Tuning RF

ggplot(rf_all) + geom_vline(xintercept = rf_all$bestTune$mtry, color = "green", linetype = 2) + theme_piss(size_c = 12,size_p = 15) #+ labs(title = "All (+) ")

Tuning XGB

#plot(xgb_train_AutoGrid4)
plot(xgb_train_AutoGrid5_nocenter)

We can see how complex is the tuning with so much hyperparameters…

Predictions Plots

df_cv_rf <- rf_all$pred[rf_all$pred$mtry == rf_all$bestTune[,1],][1:nrow(training),]
rownames(df_cv_rf) <- df_cv_rf$rowIndex

g_rf <- PredVsObsPlots(df_cv_rf, pred.test = pred_rf_allc, threshold = 15, rmse.cv = cv_rmse_rf_all)

df_cv_rf_nocenter <- rf_all_nocenter$pred[rf_all_nocenter$pred$mtry == rf_all_nocenter$bestTune[,1],][1:nrow(training),] %>% arrange(rowIndex)

g_rf_nocenter <- PredVsObsPlots(df_cv_rf_nocenter, pred.test = pred_rf_all_noCenter, threshold = 30, rmse.cv = cv_rmse_rf_all_nocenter)

## Show only the second (best) RF model
#grid.arrange(g_rf_nocenter$gcv, g_rf_nocenter$gtest, ncol = 2)
grid.arrange(g_rf$gcv, g_rf$gtest, ncol = 2, top = textGrob(expression( italic("Random Forest")), gp = gpar(fontsize = 19, font = 2, col = "blue")))

## XGBOOST

# df_cv_xgb <- xgb_noCenter_fin$pred %>% 
#   dplyr::filter(eta == xgb_noCenter_fin$bestTune$eta[1] &
#                 gamma == xgb_noCenter_fin$bestTune$gamma[1] &
#                 colsample_bytree == xgb_noCenter_fin$bestTune$colsample_bytree[1] &
#                 min_child_weight == xgb_noCenter_fin$bestTune$min_child_weight[1] &
#                 subsample == xgb_noCenter_fin$bestTune$subsample[1] &
#                 nrounds == xgb_noCenter_fin$bestTune$nrounds[1])

df_cv_xgb <- xgb_noCenter_fin$pred[xgb_noCenter_fin$pred$eta == xgb_noCenter_fin$bestTune$eta[1] & xgb_noCenter_fin$pred$max_depth == xgb_noCenter_fin$bestTune$max_depth[1] & xgb_noCenter_fin$pred$gamma == xgb_noCenter_fin$bestTune$gamma[1] & xgb_noCenter_fin$pred$colsample_bytree == xgb_noCenter_fin$bestTune$colsample_bytree[1] & xgb_noCenter_fin$pred$min_child_weight == xgb_noCenter_fin$bestTune$min_child_weight[1] & xgb_noCenter_fin$pred$subsample == xgb_noCenter_fin$bestTune$subsample & xgb_noCenter_fin$pred$nrounds == xgb_noCenter_fin$bestTune$nrounds[1][1],]#[1:nrow(training),]




rownames(df_cv_xgb) <- df_cv_xgb$rowIndex

g_xxgb <- PredVsObsPlots(df_cv_xgb, pred.test = pred_xgb_nocenterFin, threshold = 20, rmse.cv = cv_rmse_xgbNocenterFin)

grid.arrange(g_xxgb$gcv, g_xxgb$gtest, ncol = 2, top = textGrob(expression( italic("XGboost")), gp = gpar(fontsize = 19, font = 2, col = "blue")))

Analysis

If we change cross-validation splitting (i.e. the seed before each models), we will not have 126 as outlier ???

5.6.3 Ensemble Methods : Stacking

We already see 2 out of the three ensemble methods that are widely used in machine learning. That is, bagging (used in random forest) and boosting (used in XGBoost). There is actually a last ensemble method that is “stacking”, i.e.

We make use of the caretEnsemble package and we are based on its vignette.

## STACKING

trcontrol = trainControl(
  method = "cv",
  number = 10,  
  repeats = 1,  # Very Very slow ...... So no repetitions here
  verboseIter = T,
  allowParallel = T,
  savePredictions = 'all'
)


t <- proc.time()
model_list <- caretList(     # 2 learners 
  y = training$Hip_CONC, 
  x = as.data.frame(training[,9:ncol(training)]), 
  trControl = trcontrol,
  #methodList=c("glmboost", "gbm"),
  tuneList = list(
    rf1 = caretModelSpec(method="parRF", tuneGrid = data.frame(.mtry= rf_all$bestTune[,1])),
    #rf2=caretModelSpec(method="rf", tuneGrid=data.frame(.mtry=10), preProcess="pca"),
    #nn=caretModelSpec(method="nnet", tuneLength=2, trace=FALSE),
    xgbTree=caretModelSpec(method="xgbTree", tuneGrid = xgb_noCenter_fin$bestTune, trace=F)
  )
)  
print( (proc.time() - t)[3] ) # 109 sec


t <- proc.time()
model_list3 <- caretList(     # 3 learners 
  y = training$Hip_CONC, 
  x = as.data.frame(training[,9:ncol(training)]), 
  trControl = trcontrol,
  #methodList=c("glmboost", "gbm"),
  tuneList=list(
    rf1 = caretModelSpec(method="parRF", tuneGrid = data.frame(.mtry= rf_all$bestTune[,1])),
    pls1 = caretModelSpec(method="pls", tuneGrid = pls_cv_all$bestTune),
    #nn=caretModelSpec(method="nnet", tuneLength=2, trace=FALSE),
    xgbTree=caretModelSpec(method="xgbTree", tuneGrid = xgb_noCenter_fin$bestTune, trace=F)
  )
)  
print( (proc.time() - t)[3] ) # 212 sec. 



# Combine all models using simple linear model
set.seed(222)
stack.glm <- caretStack(model_list3, 
                         metric = "RMSE",
                          method="glm",
  trControl=trainControl(
    method="cv",
    number=10,
    savePredictions= 'all'
  )) # tune trControl ! 
print(stack.glm) # RMSE decreases compared to the lowest RMSE ! 

stack.glm$ens_model$bestTune

greedy_ensemble <- caretEnsemble(
  model_list3, 
  metric="RMSE",
  trControl = trainControl(
    method="cv",
    number=10,
    savePredictions='all'
  ))

str(greedy_ensemble)
#varImp(greedy_ensemble)


model_preds <- lapply(model_list3, predict, newdata = test)
model_predsFin <- data.frame( lapply(model_preds, function(x) x[,"M"]) )


summary(greedy_ensemble)
summary(stack.glm)
ens_preds <- predict(greedy_ensemble, newdata=test[,9:ncol(test)])
ens_preds <- predict(stack.glm, test[,9:ncol(test)])

## The predict() do not Work due to the issue

Unfortunately, there is a non-fixed issue in caretEnsemble and hence we cannot put directly the optimized hyperparameters values found in the previous sections for each model individually. Instead, we will

## =======================================================
set.seed(12)
try_models <- caretList(
  y = training$Hip_CONC, 
  x = as.data.frame(training[,9:ncol(training)]),  
  methodList=c("parRF", "glmnet", "pls", "xgbTree"),
    trControl = trcontrol
  )

# Some visualization diagnostics between our models
summary(resamples(try_models))
dotplot(resamples(try_models))
modelCor(resamples(try_models)) # look for low corr when stacking
# If the predictions for the sub-models were highly correlated, they would lead to similar predictions, reducing the benefit of stacking.


set.seed(12)
try_greedy_ensemble <- caretEnsemble(
  try_models, 
  metric="RMSE",
  trControl = trainControl(
    method="cv",
    number=10,
    savePredictions='all'
  ))

tryens_preds <- predict(try_models, test[,9:ncol(test)])
tryensAll_preds <- predict(try_greedy_ensemble, test[,9:ncol(test)])

try_cv_rmse_ensemble <- min(try_greedy_ensemble$ens_model$results$RMSE)
try_rmse_ensemble <- RMSE(test$Hip_CONC, unname(tryensAll_preds))

# stack using glm
stackControl <- trainControl(method="repeatedcv", number=10, repeats=2, savePredictions=TRUE, classProbs=TRUE)

set.seed(12)
stack.glm <- caretStack(try_models, method="glm", trControl=stackControl)
summary(stack.glm)

try_cv_rmse_stack <- min(stack.glm$ens_model$results$RMSE)
tryStackAll_preds <- predict(stack.glm, test[,9:ncol(test)])

RMSE(test$Hip_CONC, unname(tryStackAll_preds))

# stack using random forest
set.seed(123)
stack.rf <- caretStack(try_models, method="rf",trControl=stackControl)
print(stack.rf)

try_cv_rmse_stackrf <- min(stack.rf$ens_model$results$RMSE)
tryStackrfAll_preds <- predict(stack.rf, test[,9:ncol(test)])
try_rmse_stackrf <- RMSE(test$Hip_CONC, unname(tryStackrfAll_preds))



## =======================================================

set.seed(12)
try_models2 <- caretList(
  y = training$Hip_CONC, 
  x = as.data.frame(training[,9:ncol(training)]),  
  methodList=c("parRF", "glmnet", "pls", "pcr"),
    trControl = trcontrol
  )

summary(resamples(try_models2))
dotplot(resamples(try_models2))
modelCor(resamples(try_models2))

try_greedy_ensemble2 <- caretEnsemble(
  try_models2, 
  metric="RMSE",
  trControl = trainControl(
    method="cv",
    number=10,
    savePredictions='all'
  ))


tryens_preds2 <- predict(try_models2, test[,9:ncol(test)]) # Predictions for each models individually
tryensAll_preds2 <- predict(try_greedy_ensemble2, test[,9:ncol(test)])

try_cv_rmse_ensemble2 <- min(try_greedy_ensemble2$ens_model$results$RMSE)
try_rmse_ensemble2 <- RMSE(test$Hip_CONC, unname(tryensAll_preds2))



# stack using glm
set.seed(123)
stack.glm2 <- caretStack(try_models2, method="glm", trControl=stackControl)
summary(stack.glm2)

try_cv_rmse_stack2 <- min(stack.glm2$ens_model$results$RMSE)
tryStackAll_preds2 <- predict(stack.glm2, test[,9:ncol(test)])

RMSE(test$Hip_CONC, unname(tryStackAll_preds2))


# stack using random forest
set.seed(123)
stack.rf2 <- caretStack(try_models2, method="rf",trControl=stackControl)
print(stack.rf2)

try_cv_rmse_stackrf2 <- min(stack.rf2$ens_model$results$RMSE)
tryStackrfAll_preds2 <- predict(stack.rf2, test[,9:ncol(test)])
try_rmse_stackrf2 <- RMSE(test$Hip_CONC, unname(tryStackrfAll_preds2))
## =======================================================

c(try_rmse_ensemble, try_rmse_stackrf, try_rmse_ensemble2, try_rmse_stackrf2 )

Visualizations

‘Optimal’ Models (not scored)

This ensemble with PLS, Random Forest and XGBoost all with their optimal (tuned) hyperparameter values we found individually in the previous sections (see PLS, RF and XGB )

# Some visualization diagnostics between our models
pander(summary(resamples(model_list3))[[c("statistics")]])

RMSE:

Min. 1st Qu. Median Mean 3rd Qu. Max. NA’s

rf1 4.684 7.044 9.497 9.46 10.33 16.73 0

pls1 3.581 4.317 4.874 5.09 5.652 7.005 0

xgbTree 6.029 7.731 12.35 13.82 16.98 28.54 0

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
rf1	4.684	7.044	9.497	9.46	10.33	16.73
pls1	3.581	4.317	4.874	5.09	5.652	7.005
xgbTree	6.029	7.731	12.35	13.82	16.98	28.54

Rsquared:

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
rf1	0.9779	0.9903	0.9951	0.9928	0.9974	0.9988
pls1	0.9958	0.998	0.9981	0.9979	0.9983	0.9993
xgbTree	0.9448	0.974	0.986	0.9809	0.9957	0.9972

dotplot(resamples(model_list3))

splom(resamples(model_list3))

pander(modelCor(resamples(model_list3)), caption = "Correlations") # look for low corr when stacking

Correlations
	rf1	pls1	xgbTree
rf1	1	-0.1707	0.876
pls1	-0.1707	1	-0.1668
xgbTree	0.876	-0.1668	1

#f the predictions for the sub-models were highly corr, they would similar pred,reducing the benefit.

Individual Models (scored)

pander(summary(resamples(try_models2))[[c("statistics")]])

RMSE:

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
parRF	0.7879	5.585	7.694	11.85	15.27	39.97
glmnet	5.816	6.78	8.561	9.242	9.525	19.75
pls	6.784	9.2	10.18	9.661	10.46	11.07
pcr	11.12	12.51	13.99	15.84	16.3	30.49

Rsquared:

	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
parRF	0.8599	0.9775	0.995	0.9769	0.9978	1
glmnet	0.9654	0.9945	0.9957	0.9927	0.9968	0.9971
pls	0.9906	0.9914	0.9921	0.9926	0.9937	0.9961
pcr	0.9369	0.978	0.9855	0.9787	0.9881	0.989

dotplot(resamples(try_models2))

pander(modelCor(resamples(try_models2)),caption = "Correlations")

Correlations
	parRF	glmnet	pls	pcr
parRF	1	0.6251	-0.5924	0.2447
glmnet	0.6251	1	-0.4056	-0.1596
pls	-0.5924	-0.4056	1	-0.6341
pcr	0.2447	-0.1596	-0.6341	1

Prediction Plots

# Select only the observations from the first replication
df_cv_stackBest <- stack.rf2$ens_model$pred[stack.rf2$ens_model$pred$mtry == stack.rf2$ens_model$bestTune$mtry,][1:nrow(training),] #%>%   dplyr::filter("mtry" == stack.rf2$ens_model$bestTune$mtry)
rownames(df_cv_stackBest) <- df_cv_stackBest$rowIndex

gStack <- PredVsObsPlots(df_cv_stackBest, pred.test = tryStackrfAll_preds2, threshold = 10, rmse.cv = try_cv_rmse_stackrf2)


grid.arrange(gStack$gcv, gStack$gtest, nrow = 1, top = textGrob(expression( italic("RF Stacking")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )

Comments

Stacking aims at ensembling models that are not correlated. If the predictions for the sub-models were highly correlated, they would lead to similar predictions, and hence reducing the benefit of ensembling the models.
Hence by ensembling Random Forest, Pnealized Regression, PLS and PCR which are not very correlated, we can benefit from stacking. And we can see it with more than 10 times lower RMSE…

6 Classification Models

As we said in the descriptive analysis, we have to slightly split our datasets to retain only 2 classes for binary classification. The aim here is to keep the most observations possible in the dataset in order not to deteriorate too much the quality (of assessment) of the subsequent models. Hence, we take the class $0$ or $Q_{h/2}$ of Hippurate proportion. We redo the random sampling in order to have the same decomposition $80-20$ as for the regression models.

rownames(dataAll) <- rownames(dataD)
data_classif <- dataAll[dataAll$Hip_prop %in% c("0", "Qh/2") == T ,]

data_classif <- data_classif[, colnames(data_classif) %in% X_NoVariance == F]

set.seed(99)
index <- sample(rownames(data_classif), round(0.8*nrow(data_classif), 0))

# Faster in tibble objects
training_class <- as_tibble(data_classif[rownames(data_classif) %in% index , ])
test_class <- as_tibble(data_classif[rownames(data_classif) %in% index == F, ])

# Transform the factor's names into valid names : needed for future models
training_class$Hip_prop <- make.names(factor(training_class$Hip_prop, levels = c("0", "Qh/2")))
test_class$Hip_prop <- make.names(factor(test_class$Hip_prop, levels = c("0", "Qh/2")))

# rownames(training_class) <- index
# rownames(test_class) <- rownames(data_classif)[rownames(data_classif) %in% index == F ]

training_class_c <- as_tibble(scale(training_class[,-(1:9)], center = T, scale = F))
test_class_c <- as_tibble(scale(test_class[,-(1:9)], center = T, scale = F))


training_class_c_small <- training_class_c[, kept_features]

It is important for classification models to keep the same proportion of the two class in the test set, or we would need to make ‘corrections’ for imbalanced classes

prop_all <- sum(data_classif$Hip_prop == "Qh/2")/nrow(data_classif)
prop_training <- sum(training_class$Hip_prop == "Qh.2")/nrow(training_class)
prop_test <- sum(test_class$Hip_prop == "Qh.2")/nrow(test_class)

df_prop <- data.frame(Original = prop_all, Training = prop_training, Test = prop_test )
rownames(df_prop) <- "Proportion of the class Qh/2"
#pander(df_prop)
formattable(df_prop, list(~percent))

	Original	Training	Test
Proportion of the class Qh/2	36.78%	35.97%	40.00%

Classification Metric

Concerning the classification metric, we think it is not bad to take the accuracy as comparison metric. But we will see that accuracy will not be sufficient to discriminate between models (the accuracy is 100% most of time) as the discrimination these two classes is too easy. Hence, whereas the ROC (AUC) is not very straightforward, espescially if we aim at doing multi-class classification. (Hand 2009) have emphasized the fact that “it is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules. It is equivalent to saying that, using one classifier, misclassifying a class 1 point is p times as serious as misclassifying a class 0 point, but, using another classifier, misclassifying a class 1 point is P times as serious, where p = P.”
It can thus causes ambiguities for comparing classifiers.

Anoter metric is often preferred, the logloss. Its extension to multi class is straightforward. \[-N^{-1}\sum_i^N\sum_j^M y_{ij}\cdot \log p_{ij}\] For binary classification, thus where $M=2$, we simply retrieve \[-N^{-1}\sum_{i}^N\Big[y_i\cdot\log p_i+(1-y_i)\cdot\log (1-p_i)\Big]\] One can recognize the similarity with the maximization of the statistical log-likelihood for a binomial distribution. Relationship with the Cross-entropy or the Kullback-Leibler divergence can also be shown. It is now the classification loss function that is the most appreciated in the Machine Learning competitions such as Kaggle.

'logLossBinary' = function(actual, predicted, eps = 1e-15) {
   predicted = pmin(pmax(predicted, eps), 1-eps)
  -(sum(actual * log(predicted) + (1 - actual) * log(1-predicted))) / length(actual)
}
# More compact way, but same outcome
'logLossBinary' <- function(actual,pred){
  -1*mean(log(pred[model.matrix(~ actual + 0) - pred > 0]))
}

cat("Predict with probability 0.5 the true class :" , logLossBinary(1, .5))

Predict with probability 0.5 the true class : 0.6931472

cat("Predict with probability 0.05 the true class :" , logLossBinary(1, .05))

Predict with probability 0.05 the true class : 2.995732

cat("Predict with probability 0.95 the true class :" , logLossBinary(1, .95))

Predict with probability 0.95 the true class : 0.05129329

Minimizing the Logloss is equivalent to maximizing the accuracy but there is a subtle. Log Loss heavily penalises classifiers that are confident about an incorrect classification.

The fact that we tune the models wrt a certain metric and then we compute the performance (test set) with another metric can make differences. Hence, we will use the logloss for both cross-validation and testing, for all the models we will compare.

Regarding the preliminary feature selection, we can keep the same as for the regression. Indeed, when looking at the methodology we have adopted to select the variables, this is relevant in this configuration too.

6.1 Logistic Regression

Here, we must apply the preliminary feature selection for the model . We used the basic functions glm() and the MASS package.

## We will use the seed 99 for all models  !!!!! 

n_folds <- 10 


tc <- trainControl("repeatedcv", number = n_folds, repeats = 1, classProbs = TRUE, summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T)

set.seed(99)
fit_Small <- train(y = training_class$Hip_prop, 
                   x = training_class_c_small,
                           method    = "glm"    ,
                           family    = binomial ,
                           trace = 0,
                           metric = "LogLoss",
                           trControl = tc)

cv_logl_glm <- min(fit_Small$results$logLoss)



pred_log_smallc_p <- predict(fit_Small, test_class_c, type = "prob")
pred_log_smallc <- predict(fit_Small, test_class_c)
logl_log_smallc <- logLossBinary(test_class$Hip_prop, pred_log_smallc_p)


set.seed(99)
fit_all <- train(y = training_class$Hip_prop, 
                 x = training_class_c[, kept_featuresBig],
                           method    = "glm"    ,
                           family    = binomial ,
                           trace = 0,
                           metric = "logloss",
                           trControl = tc)

cv_logl_glmAll <- min(fit_all$results$logLoss)

pred_log_allc_p <- predict(fit_all, test_class_c, type = "prob")
pred_log_allc <- predict(fit_all, test_class_c)
logl_log_allc <- logLossBinary(test_class$Hip_prop, pred_log_allc_p)

6.1.1 Stepwise

# by doing the stepwise feature selection
set.seed(99)
fit_SmallStep <- train(y = training_class$Hip_prop, 
                       x = training_class_c_small,
                           method    = "glmStepAIC"    ,
                           family    = binomial ,
                           trace = 0,
                           metric = "logloss",
                           trControl = tc)

cv_logl_glmSmallsw <- min(fit_SmallStep$results$logLoss)

pred_log_small2c <- predict(fit_SmallStep, test_class_c)
pred_log_small2c_p <- predict(fit_SmallStep, test_class_c, type = "prob")
logl_log_small2c <- logLossBinary(test_class$Hip_prop, pred_log_small2c_p)



tc <- trainControl("repeatedcv", number = n_folds, repeats = 1, classProbs = TRUE, summaryFunction = mnLogLoss, verboseIter = T)

kept_features2 <- featureSelect(60)  ## Larger subset of selected features.  

set.seed(99)
t <- proc.time()
fit_bigStep <- train(y = training_class$Hip_prop,  # VERY SLOW !!
                     x = as.matrix(training_class_c[, kept_features2]),
                           method    = "glmStepAIC"    ,
                           family    = binomial ,
                      direction = "both",
                           trace = 0,
                           metric = "logloss",
                           trControl = tc)
print( (proc.time() - t)[3] ) # 212 sec. 
fit_bigStep$finalModel

cv_logl_glmAllsw <- min(fit_bigStep$results$logLoss)

pred_log_bigc <- predict(fit_bigStep, test_class_c)
pred_log_bigc_p <- predict(fit_bigStep, test_class_c, type = "prob")
logl_log_bigswc <- logLossBinary(test_class$Hip_prop, pred_log_bigc_p)



## Explore the influence of the number of folds on the results: to draw e.g. learning curves 

# seq_folds <- 1:(floor(nrow(data_classif)/2)-1) 
# rmse <- numeric(length = (nrow(data_classif)/2)-1)
# for(i in seq_folds){
#     t <- proc.time()
#    tc <- trainControl("repeatedcv", number = (i+1), repeats = 10)
#    fitt <- train(y = training_class$Hip_prop, 
#                  x = training_class_c_small,
#                            method    = "glm"    ,
#                            family    = binomial ,
#                                     metric = "logloss",
#                            trControl = tc)
#                            
#    rmse[i] <- fitt$results$Accuracy
#   cat((proc.time() - t)[3], "\n") ;  print (i)
# }
# rmse

# confusionMatrix(table(pred_log_smallc, test_class$Hip_prop))
# confusionMatrix(table(predict(fit_all, test_class_c), test_class$Hip_prop))

### Draw the results

df_logl <- data.frame('on selected features' = c(logl_log_smallc, length(coef(fit_Small$finalModel))-1),
 'on more features' = c(logl_log_allc, length(coef(fit_all$finalModel))-1), 'Stepwise small' = c(logl_log_small2c, length(coef(fit_SmallStep$finalModel))-1), 'Stepwise big' = c(logl_log_bigswc, length(coef(fit_bigStep$finalModel))-1))
rownames(df_logl) <- c('logloss', 'size')
#pander(df_logl)

# formattable(df_logl, list(
#   Stepwise.small = formatter("span", 
#     style = ~ style("font-weight" = ifelse(df_logl <= 1e-13 , "bold", NA)))))

#above_avg_bold <- formatter("span", style = x ~ style("font-weight" = ifelse(x < 1e-12, "bold", NA)))

formattable(df_logl)

	on.selected.features	on.more.features	Stepwise.small	Stepwise.big
logloss	7.085589e-12	2.167976e-11	1.034945e-10	8.034843e-15
size	1.600000e+01	8.400000e+01	2.000000e+00	1.000000e+00

Indeed, for logistic regression, it is important to impute the feature selection ($p<n$). Keeping 18 or 46 features and doing a stepwise logistic regression lead to the same result: 2 selected variables but they are different ! If we check the accuracy criterion, we see that it is indeed $100\%$ for all models. Regarding logloss, we see with the table above that the stepwise model on small subset of selected features is the best.

Actually it is not very necessary to consider more complex models as we see the performance is (nearly) perfect already…

6.2 Penalized Logistic Regression

With the packages stepPlr (tuning parameters lambda and cp) or glmnet (tuning parameters alpha and lambda) as explained in section x.x

n_folds <- 10

# We tune the hyperparameters together in a 2-dimensional grid. It is 
eGrid <- expand.grid(.cp = (0:10) * 0.1, 
                     .lambda =  (0:10) * 0.1)

tc_plr <- trainControl("repeatedcv", n_folds, repeats = 1,  classProbs = T, savePredictions = T, verboseIter = T)#, summaryFunction = prSummary)  

set.seed(99)
fit_plr <- train( y = training_class$Hip_prop ,
                  x = training_class[, -(1:9)],   
                           method    = "plr",
                           trControl = tc_plr, 
                           metric = "logloss",
                           tuneGrid = eGrid)


## ==============================================================================

eGrid_glm <- expand.grid(.alpha = (0:10) * 0.1, 
                         .lambda =  (0:10) * 0.1)

set.seed(99)
glmnet_class <- train(y = as.factor(training_class$Hip_prop) ,
                      x = as.matrix(training_class[, -(1:9)]),
                          method = "glmnet",
                          trControl = tc_plr,
                          metric = "ROC",
                         # metric = "logloss",
                          tuneGrid = eGrid_glm)
cv_logl_glmnetAll <- min(glmnet_class$results$logLoss)

# pred_glmnet_all_p <- predict(glmnet_class, test_class, type = "prob")
# pred_glmnet_all <- predict(glmnet_class, test_class)
# logl_glmnet_all <- logLossBinary(test_class$Hip_prop, pred_glmnet_all_p)

pred_glmnet_allc_p <- predict(glmnet_class, test_class[, -(1:9)], type = "prob")
pred_glmnet_allc <- predict(glmnet_class, test_class[, -(1:9)])
logl_glmnet_allc <- logLossBinary(test_class$Hip_prop, pred_glmnet_allc_p)


set.seed(99)
glmnet_class_small <- train(y = as.factor(training_class$Hip_prop) ,
                            x = as.matrix(training_class[, kept_features]),
                          method = "glmnet",
                          trControl = tc_plr,
                          #metric = "ROC",
                        # metric = "logloss",
                          family = "binomial",
                          tuneGrid = eGrid_glm)

cv_logl_glmnetSmall <- min(glmnet_class_small$results$logLoss)

# pred_glmnet_all_p_small <- predict(glmnet_class_small, test_class, type = "prob")
# pred_glmnet_all_small <- predict(glmnet_class_small, test_class)
# logl_glmnet_all_small <- logLossBinary(test_class$Hip_prop, pred_glmnet_all_p_small)

pred_glmnet_all_p_smallc <- predict(glmnet_class_small, test_class[, kept_features], type = "prob")
pred_glmnet_all_smallc <- predict(glmnet_class_small, test_class[, kept_features])
logl_glmnet_all_smallc <- logLossBinary(test_class$Hip_prop, pred_glmnet_all_p_smallc)

## ==============================================================================




eGrid_pen <- expand.grid(.lambda1 = (0:10) * 0.1, 
                         .lambda2 =  (0:10) * 0.1)

set.seed(99)
penal_class <- train(y = training_class$Hip_prop ,
                      x = training_class_c,
  # Hip_prop ~ .,
#                      data = cbind.data.frame(Hip_prop = training$Hip_prop, training_c_small),
                          method = "penalized",
                          metric = "logloss",
                          trControl = tc_plr,
                          tuneGrid = eGrid_pen)


glmnet_accur <- train(y = training_class$Hip_prop ,
                      x = training_class[, -(1:9)],
                        method = "glmnet_h2o", 
                        trControl = tc_plr,
                        metric = "logloss",
                        tuneGrid = eGrid_glm)

g1 <- ggplot(glmnet_class) + labs(x = 'mixing percentage (alpha)', linetype = 'Regu. lambda') + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda")) 
g2 <- ggplot(glmnet_class_small) +  labs(x = 'mixing percentage (alpha)', fill = 'Regul. lambda') + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda")) 

grid.arrange(g1, g2, nrow = 1)

6.3 PLS-DA

With the original library pls.

Not probability for PLS or OPLS ? (course p.49)

# Explore manually Values for controlling the training process
rctrl1 <- trainControl("cv", n_folds, classProbs = T,  summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T)  

set.seed(99)
pls_cv_small_class <- train(y = training_class$Hip_prop ,
                      x = training_class_c_small,
                      method = "pls",
                      metric = "logloss",
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.ncomp = c(1:15)))
cv_logl_plsSmall <- min(pls_cv_small_class$results$logLoss)


set.seed(99)
pls_cv_all_class <- train(y = training_class$Hip_prop ,
                      x = training_class_c,
                      method = "pls",
                      metric = "logloss",
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.ncomp = c(1:100)))
cv_logl_plsAll <- min(pls_cv_all_class$results$logLoss)


ggplot(pls_cv_small_class)

# pred_pls_small_p_small <- predict(pls_cv_small_class, test_class, type = "prob")
# pred_pls_small_small <- predict(pls_cv_small_class, test_class)
# logl_pls_small_small <- logLossBinary(test_class$Hip_prop, pred_pls_small_p_small)

pred_pls_small_p_smallc <- predict(pls_cv_small_class, test_class_c, type = "prob")
pred_pls_small_smallc <- predict(pls_cv_small_class, test_class_c)
logl_pls_small_smallc <- logLossBinary(test_class$Hip_prop, pred_pls_small_p_smallc)

pred_pls_small_p_allc <- predict(pls_cv_all_class, test_class_c, type = "prob")
pred_pls_small_allc <- predict(pls_cv_all_class, test_class_c)
logl_pls_small_allc <- logLossBinary(test_class$Hip_prop, pred_pls_small_p_allc)

Voir “S-plot” pour d?tecter des biomarqueurs sur OPLS(-DA)

6.3.1 OPLS-DA

Orthogonal [Partial] Least Squares (OPLS) is a variant of PLS which uses orthogonal signal correction to maximize the explained covariance between the predictors X and the variable to predict Y on the first latent variable, while components >1 capture variance in X which is orthogonal (or unrelated) to Y.

“Diagnostics such as the Q2Y metrics and permutation testing are of high importance to avoid overfitting and assess the statistical significance of the model. The Variable Importance in Projection (VIP), which reflects both the loading weights for each component and the variability of the response explained by this component (Pinto, Trygg, and Gottfries 2012; Mehmood et al. 2012), can be used for feature selection (J. Trygg and Wold 2002; Pinto, Trygg, and Gottfries 2012).”

As there does not seem that there is already an implemented method to do OPLS in the caret package, we decided here to rely on the method implemented by M.Martin in MBXUCL.

Here, we will use the RMSE as cross-validation criterion. Since we transformed the two-class target as "Qh.2=1" and "X0=0", it is

CENTER the RESPONSE VARIABLE (?)

oplsc <- OPLSDA(as.matrix(training_class_c), as.factor(training_class$Hip_prop), no = 3, impT = T)

oplsreg <- OPLSDA(as.matrix(training_c), training$Hip_CONC, no = 3)

test_cVar <- test_c[, colnames(test_c) %in% X_NoVariance ==F]
# 
# OPLSDA_pred(oplsreg, as.matrix(test_cVar))
# 
# OPLSDA_pred(oplsc, as.matrix(test_class_c))
# 


# For OPLS it is IMPORTANT TO CENTER THE RESPONSE 
classY.CV <- ifelse(training_class$Hip_prop == "Qh.2", 1, 0)
classY <- scale(classY, center=T, scale = F)

test_classYAll <- ifelse(test_class$Hip_prop == "Qh.2", 1, 0)
test_classY <- scale(test_classYAll, center=T, scale = F)


############# CHANGE FOR CLASSIF

'kfoldCV' <- function(X ,Y, k, argrm = 0, seed = 99){
  #browser()
   n=dim(X)[1] ;  nbseg = k ;  sn=ceiling(n/k) ; set.seed(seed)
   
  seg_i <-  sample(rep(1:k, length.out = n)) ;    yfit.cv = numeric()
  
  for (i in 1:nbseg){
    
   val_i <- which(seg_i == i)
   Y.train = Y[-val_i]  ;   X.train =X[-val_i,] ;    X.valid=X[val_i,]
    
   coefcv =fctreg(X.train,Y.train,argrm)
   yfit.cv[val_i] =X.valid%*%coefcv
   names(yfit.cv)[val_i] <- val_i
  }
  
  RMSE.cv=sqrt((1/n)*sum((Y-yfit.cv)^2))
  return(list(yfit.cv=yfit.cv,RMSE.cv=RMSE.cv))
}

fctreg <- function(X,Y,argrm) {OPLSDA(X,Y, no=argrm-1)$b}

classY <- ifelse(training_class$Hip_prop == "Qh.2", 1, 0)
# IMPORTANT TO CENTER THE RESPONSE
classY <- scale(classY, center=T, scale = F)

# classY is the Transformed target to handle RMSE

RMSECV_oplsda <- numeric()
yfitCV_oplsda <- list()   ;   NCmaxda <- 25  # tune it 
for(i in 2:NCmaxda){
  #cat(i, "\n")
 rescv <- kfoldCV(as.matrix(training_class_c), as.numeric(classY), k = 10, argrm = i)
 RMSECV_oplsda[i] <- rescv$RMSE.cv
 yfitCV_oplsda[[i]] <- rescv$yfit.cv
}

cv_rmse_opls_small <- min(RMSECV_opls, na.rm = T)

ncoptda <- which.min(RMSECV_oplsda) 

df_oplsda <- data.frame(indi = 2:NCmaxda, RMSEcv = RMSECV_oplsda[-1])
gg <- ggplot(df_oplsda, aes(x = indi, y = RMSEcv)) + geom_point() + geom_line() + geom_vline(aes(xintercept = ncoptda, col = "optimal"), linetype = 2) + theme_piss(size_c = 12) + labs(title = "RMSE cross-validation vs complexity", x = "Number of components", y = "RMSE-cv") + geom_vline(aes(xintercept = 15, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c("optimal" = "red", "advised" = "green"))

ggplotly(gg)

roplsda_opt <- OPLSDA(as.matrix(training_class_c), classY, no = ncoptda-1)

#pred_ropls_small <- OPLSDA_pred(ropls_opt_small, as.matrix(test[,kept_features]))

pred_roplsda_smallc <- OPLSDA_pred(roplsda_opt, as.matrix(test_class_c))

pred_df_oplsda <- data.frame("Qh.2" = pred_roplsda_smallc + mean(test_classYAll), "X0" = 1 - (pred_roplsda_smallc + mean(test_classYAll)))
pred_df_oplsda$Qh.2 <- ifelse(pred_df_oplsda$Qh.2 > 1, 1, pred_df_oplsda$Qh.2)
pred_df_oplsda$X0 <- ifelse(pred_df_oplsda$X0 > 1, 1, pred_df_oplsda$X0)
pred_df_oplsda$Qh.2 <- ifelse(pred_df_oplsda$Qh.2 < 0, 0, pred_df_oplsda$Qh.2)
pred_df_oplsda$X0 <- ifelse(pred_df_oplsda$X0 < 0, 0, pred_df_oplsda$X0)

logl_oplsda <- logLossBinary(test_class$Hip_prop, pred_df_oplsda)

We thus kept x components and made predictions with it.

Interesting thing to note is that the estimate RMSE (by cross-validation) seems much higher when not doing it with caret… and perhaps closer to the ‘true’ RMSE (?)

As the model is not implemented in the caret infrastucture and the R implementation comes from UCL, we think it would be very interesting to implement it in caret to profit from the flexibility and the tuning facilities that this package offers Moreover, it will allow us to understand more deeply the ‘black-box’ which sometimes lies behind the caret methodology.

## We suppress

opls_MBX <- list(label = "OPLS-DA",
               library = "MBXUCL",
               type = "Classification",
               #type = c("Regression", "Classification"),
               ## Tune over both parameters at the same time
               parameters = data.frame(parameter = 'ncomp',
                                          class = "numeric",
                                          label = '#Components'),
               grid = function(x, y, len = NULL, search = "grid"){
                    if(search == "grid") {
                      out <- data.frame(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
                    } else {
                      out <- data.frame(ncomp = unique(sample(1:ncol(x), replace = TRUE)))
                    }
                    out
                  },
               loop = function(grid) {
                    grid <- grid[order(grid$ncomp, decreasing = TRUE),, drop = FALSE]
                    loop <- grid[1, ,drop = FALSE]
                    submodels <- list(grid[-1,,drop = FALSE])
                    list(loop = loop, submodels = submodels)
                  },
               #loop = NULL,
               fit = function(x, y, wts, param, lev, last, classProbs, ...) { 
                 #dat <- if(is.data.frame(x)) x else as.data.frame(x)
                 OPLSDA(x, y, no = param$ncomp, ...)$b
                 },
               predict = function(modelFit, newdata, submodels = NULL) {
                 y.pred <- newdata %*% modelFit
                 names(y.pred) <- rownames(newdata)
                 y.pred
                 },
               prob = NULL,
               varImp = NULL,
               predictors = function(x, ...) rownames(x$projection),
               #levels = function(x) x$obsLevels,
               sort = function(x) x[order(x[,1]),]
             )


rctrl1 <- trainControl(method = "cv", number = n_folds) 
set.seed(100)
opls1 <- train(y = training$Hip_CONC, 
                x = training_c_small,
                      method = opls_MBX,
                      trControl = rctrl1, 
                      tuneGrid = expand.grid(.ncomp = c(1:4)))

!! Produce the “S-Plots” : charactheristics of the PLS.

6.4 Support Vector Machine

Support Vector Machine (SVM) are well known classification models and in omics data analysis, see e.g. (Kim et al. 2017) for a recent application in microarrays. We will explore three families of SVM, that is with a linear kernel, a Gaussian (radial) kernel and with a polynomial kernel. We will have only one hyperparameter for the linear and the radial kernel ( we will fix the scale parameter) but three for the polynomial kernel. For each model, we obviously take the whole feature space because this one of the key feature of the SVM : it can deal with a very high feature space.

n_folds <- 10

ctrl <- trainControl("repeatedcv", n_folds, repeats  = 1,         # do 5 repititions of cv
                     classProbs = T, summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T )

 
#Train and Tune the SVM
set.seed(99)
svm_linear <- train( y = as.factor(training_class$Hip_prop) ,
                  x = training_class_c, 
                  method = "svmLinear2",   
                  metric = "logloss",
                  tuneLength = 10,                  # 10 values of the cost function
                  trControl=ctrl)
predict(svm_linear, test_class)
cv_logl_svmLin <- min(svm_linear$results$logLoss)


set.seed(99)
svm_radial <- train( y = training_class$Hip_prop ,
                  x = training_class_c, 
                  method = "svmRadial",   # Radial kernel
                  tuneLength = 9,                   # 9 values of the cost function. We let here sigma constant !!!!!! 
                  metric = "logloss",
                  trControl=tc)
predict(svm_radial, test_class_c)
cv_logl_svmRad <- min(svm_radial$results$logLoss)


set.seed(99)
# Use the expand.grid to specify the search space   
grid <- expand.grid(scale = c(.01, .015, 0.2),
                    C = c(0.75, 0.9, 1, 1.1, 1.25),
                    degree = 0:4)
 
#Train and Tune the SVM
svm_polyn <- train( y = training_class$Hip_prop ,
                    x = training_class_c, 
                    method = "svmPoly",
                   # tuneGrid = grid,
                   metric = "logloss",
                   tuneLength = 6, 
                    trControl=ctrl)  # c(.01, .75, 1) 
predict(svm_polyn, test_class_c)
cv_logl_svmPol <- min(svm_polyn$results$logLoss)


pred_svmRad <- predict(svm_radial, test_class_c)
pred_svmLin <- predict(svm_linear, test_class_c)
pred_svmPoly <- predict(svm_polyn, test_class_c)

pred_svmLinp <- predict(svm_linear, test_class_c, type = "prob")
pred_svmRadp <- predict(svm_radial, test_class_c, type = "prob")
pred_svmPolyp <- predict(svm_polyn, test_class_c, type = "prob")



logloss_svmLin <- logLossBinary(test_class$Hip_prop, predict(svm_linear, test_class_c, type = "prob")
)
logloss_svmRad <- logLossBinary(test_class$Hip_prop, predict(svm_radial, test_class_c, type = "prob")
)
logloss_svmPol <- logLossBinary(test_class$Hip_prop, predict(svm_polyn, test_class_c, type = "prob")
)
c(logloss_svmLin, logloss_svmRad, logloss_svmPol)

df_svm <- data.frame('Linear' = c(logloss_svmLin, cv_logl_svmLin, svm_linear$bestTune$cost),
 'Radial' = c(logloss_svmRad, cv_logl_svmRad, svm_radial$bestTune$C), 'Polynomial' = c(logloss_svmPol, cv_logl_svmPol, svm_polyn$bestTune$C) )
rownames(df_svm) <- c('logloss', "logloss CV", 'Tuned value of C')
pander(df_svm)

	Linear	Radial	Polynomial
logloss	0.01667	0.2817	0.632
logloss CV	5.794	2.911	1.391
Tuned value of C	1	0.25	1

We tried lots of different values for the hyperparameters but none of them provided good results the SVM with polynomial kernel :

6.4.1 Hyperparameters Tuning {-}

Linear Kernel

ggplot(svm_linear) + theme_piss()

Radial Kernel

ggplot(svm_radial) + theme_piss()

Polynomial Kernel

ggplot(svm_polyn) + theme_piss()

6.5 Comparison of the Models

## Plot the probabilities

PredMat <- data.frame(mod1 = pred_log_smallc_p[,"Qh.2"], mod2 = pred_log_allc_p[,"Qh.2"], mod3 = pred_log_small2c_p[,"Qh.2"] , mod4 = pred_log_bigc_p[,"Qh.2"], m5 = pred_glmnet_allc_p[,'Qh.2'] , m6 = pred_glmnet_all_p_smallc[, 'Qh.2'], m7 = pred_pls_small_p_smallc[,'Qh.2'], m8 = pred_pls_small_p_allc[,'Qh.2'], m88 = pred_df_oplsda$Qh.2,  m9 = pred_svmLinp[,"Qh.2"], m10 =  pred_svmRadp[,"Qh.2"] , m11 =  pred_svmPolyp[,"Qh.2"])

colnames(PredMat) <- c(colnames(df_logl), "Pen.all", "Pen.small", 'PLS.DA.small', "PLS-DA all", "OPLS-DA all",  "SVM linear", "SVM radial", "SVM polyn.")

cutoff <- seq(min(PredMat)-0.1,max(PredMat)+0.1,length = nrow(test_class))
nco <- length(cutoff)
truepositive <- matrix(nrow=nco,ncol= ncol(PredMat))
falsepositive <- matrix(nrow=nco,ncol= ncol(PredMat))
npos <- sum(test_class$Hip_prop == "Qh.2")
nneg <- sum(test_class$Hip_prop == "X0")
listpos <- (test_class$Hip_prop == "Qh.2")
listneg <- (test_class$Hip_prop == "X0")

nm <- ncol(PredMat)

PredMat <- as.matrix(PredMat)
for(i in 1:nco){
  for(j in 1:nm) {
    # Rate of true positives in the set of positive 
    truepositive[i,j] <- sum(PredMat[listpos,j]>=cutoff[i])/npos
    # Rate of false positives in the set of negatives
    falsepositive[i,j] <- sum(PredMat[listneg,j]>=cutoff[i])/nneg
    }
 }
par(mfrow=c(1,1))
plot(falsepositive[,1],truepositive[,1],type="l",main="ROC Curves : Logistic Models", xlab="False Positive Rate",ylab="True Positive Rate")
for(j in 2:nm){
  lines(falsepositive[,j],truepositive[,j],type="l",col=j,lty=j)}
legend(0.4,0.75,dimnames(PredMat)[[2]],col=1:nm,lty=1)



# Recherche des seuils optimaux qui minimisent la distance au point (0,1) FOR ALL MODELS
cutoffopt=PredMat[1,]
for(i in 1:nm){
  distances2=falsepositive[,i]^2+(1-truepositive[,i])^2
  cutoffopt[i]=cutoff[which(distances2==min(distances2))][1]
}  

pander(cutoffopt)


nm <- nrow(listr$RMSE)   # nbre de mod?les
## Line plot to the Predictions
par(mfrow=c(1,1))
PredMat=listr$PredMat
plot(1:(nrow(PredMat)+30),c(PredMat[,1],rep(NA,30)),lty=1,type="l",main="Predictions",xlab="Observation",ylab="Prediction",ylim=c(min(PredMat),max(PredMat)))  
for(i in 2:nm)  { lines(1:(nrow(PredMat)+30),c(PredMat[,i],rep(NA,30)),lty=i,col=i)}
legend(nrow(Xdata)+5,1,dimnames(PredMat)[[2]],col=1:4,lty=1,cex=0.5)
abline(v=92)

Prediction Plots

We will now produce a new function to plot predictions, similar to the one for regression. The printed observations are those who are misclassified.

## Print the value of an observation if it it misclassified
## We assume the value of the cutoff is the same for the validation and test set... (we take it as the test set)
'PredVsObsPlotsClassif' <- function(df_cv, obs.test = test_class$Hip_prop, predp.testQh.2, cutoff, threshold = .5) {
  #browser()
  
  df_cv$true_obs <- ifelse(df_cv$obs == 'Qh.2', 1, 0)
  # df_cv$error <- logLoss(true_obs, df_cv[, 'Qh.2'])
  # df_cv$error <- sapply(true_obs, function(x) logLoss(x, df_cv[, 'Qh.2']))
  #ProblematicObs.cv <- df_cv$error > threshold
  ProblematicObs.cv <- ifelse(df_cv$pred != df_cv$obs, T, F)
  
  names.cv <- row.names(df_cv)
  names.cv[with(df_cv, !ProblematicObs.cv)] <- ""

  # cutoff
  g1  <- ggplot(df_cv) + geom_point(aes(x = true_obs, y = Qh.2), size = .95 )  + theme_piss() + labs(title = "Cross-Validation", subtitle = "Predicted values vs observed values", x = 'Observed (0= X0; 1=Qh/2)', y = 'pred. proba.') + geom_text(aes(true_obs, Qh.2, label = names.cv), color = "red") + geom_hline(yintercept = cutoff, col = "blue", linetype = 2) + geom_text(aes(.5, cutoff + .05, label = 'cutoff'), col = 'blue', data = df_cv, size = 1.5)

  
  
  df <- data.frame(obs.test = obs.test, predp.testQh.2 = predp.testQh.2)
  df$obs.test0 <- ifelse(df$obs.test == 'Qh.2', 1, 0)

  # We classify wrt to the value of the cutoff.
  df$pred.Qh.2 <- ifelse(df$predp.testQh.2 > cutoff, 'Qh.2', 'X0')
   ProblematicObsTest <- ifelse(df$pred.Qh.2 != df$obs.test, T, F)

  g2  <- ggplot(df, aes(x = obs.test0, y = predp.testQh.2)) 
  
  names <- row.names(df)
  names[with(df, !ProblematicObsTest)] <- ""
    
  g2 <- g2 + geom_point(size = .95) + geom_hline(yintercept = cutoff, col = "blue", linetype = 2) + geom_text(aes(.5, cutoff + .05, label = 'cutoff'), col = 'blue', data = df_cv) + theme_piss() + labs(title = "Test set", subtitle = "Predicted values vs observed values", x = 'Observed (0= X0; 1=Qh/2)', y = 'pred. proba.')  + geom_text(aes(obs.test0, predp.testQh.2, label = names), color = "red", size = 1.5) 
  
  #grid <- grid.arrange(g1, g2, nrow = 1)
   #grid.arrange(g1, g2, nrow = 1)
  return(list(gcv = g1, gtest = g2 ))
}

We will also compute some other insightful plots such as the scatterplot matrix of predictions and the lines plot of predictions. For these two graphs we will only represent the results on the test set because this is the one that will be used to make the final performance’s assessment. But looking at these same plots for the cross-validation would be appreciate e.g. to check if the validation procedure has been correctly constructed. (not shown here)

Logistic Regressions

df_cv_logRegSmall <- fit_Small$pred[order(fit_Small$pred$rowIndex),]

plots_logReg_small <- PredVsObsPlotsClassif(df_cv_logRegSmall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_smallc_p$Qh.2, threshold = .05, cutoff = cutoffopt[1])

#grid.arrange(plots_logReg_small$gcv, plots_logReg_small$gtest, nrow = 1)



df_cv_logRegall <- fit_all$pred[order(fit_all$pred$rowIndex),]

plots_logReg_all <- PredVsObsPlotsClassif(df_cv_logRegall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_allc_p$Qh.2, threshold = .05, cutoff = cutoffopt[2])

#grid.arrange(plots_logReg_all$gcv, plots_logReg_all$gtest, nrow = 1)



df_cv_logRegSmallstep <- fit_SmallStep$pred[order(fit_SmallStep$pred$rowIndex),]

plots_logReg_smallstep <- PredVsObsPlotsClassif(df_cv_logRegSmallstep, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_small2c_p$Qh.2, threshold = .05, cutoff = cutoffopt[3])

#grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, nrow = 1)


grid.arrange(plots_logReg_small$gcv, plots_logReg_small$gtest, top = textGrob(expression( italic("Small")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)

grid.arrange(plots_logReg_all$gcv, plots_logReg_all$gtest, top = textGrob(expression( italic("Big")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)

grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, top = textGrob(expression( italic("Stepwise")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)

# df_cv_logRegbigstep <- fit_bigStep$pred[order(fit_bigStep$pred$rowIndex),]
# 
# plots_logReg_bigstep <- PredVsObsPlotsClassif(df_cv_logRegbigstep, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_bigc_p$Qh.2, threshold = .05, cutoff = cutoffopt[4])
# 
# grid.arrange(plots_logReg_bigstep$gcv, plots_logReg_bigstep$gtest, nrow = 1)

Penalized Logistic Reg.

df_cv_glmnetsAll <- glmnet_class$pred[glmnet_class$pred$alpha == glmnet_class$bestTune$alpha[1] & glmnet_class$pred$lambda == glmnet_class$bestTune$lambda[1] ,][1:nrow(training_class),]

plots_glmnetAllclass <- PredVsObsPlotsClassif(df_cv_glmnetsAll, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_allc_p$Qh.2, threshold = .05, cutoff = cutoffopt["Penalized log.reg. all"])


#grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, nrow = 1)

df_cv_glmnetsmall <- glmnet_class_small$pred[glmnet_class_small$pred$alpha == glmnet_class_small$bestTune$alpha[1] & glmnet_class_small$pred$lambda == glmnet_class_small$bestTune$lambda[1] ,][1:nrow(training_class),]

plots_glmnetSmallclass <- PredVsObsPlotsClassif(df_cv_glmnetsmall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_all_p_smallc$Qh.2, threshold = .05, cutoff = cutoffopt['Penalized log.reg. small'])

grid.arrange(plots_glmnetAllclass$gcv, plots_glmnetAllclass$gtest,
             plots_glmnetSmallclass$gcv, plots_glmnetSmallclass$gtest,
             ncol = 2)

(O)PLS-DA

df_cv_plsdaAll <- pls_cv_all_class$pred[pls_cv_all_class$pred$ncomp == pls_cv_all_class$bestTune$ncomp[1] ,][1:nrow(training_class),]

plots_plsdaAll <- PredVsObsPlotsClassif(df_cv_plsdaAll, predp.testQh.2 = pred_pls_small_p_allc$Qh.2, threshold = .05, cutoff = cutoffopt["PLS-DA all"])


df_cv_plsdaSmall <- pls_cv_small_class$pred[pls_cv_all_class$pred$ncomp == pls_cv_small_class$bestTune$ncomp[1] ,][1:nrow(training_class),]

plots_plsdaSmall <- PredVsObsPlotsClassif(df_cv_plsdaSmall, predp.testQh.2 = pred_pls_small_p_smallc$Qh.2, threshold = .05, cutoff = cutoffopt["PLS-DA small"])

grid.arrange(plots_plsdaAll$gcv, plots_plsdaAll$gtest, top = textGrob(expression( italic("PLS-DA All")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)

grid.arrange(plots_plsdaSmall$gcv, plots_plsdaSmall$gtest, top = textGrob(expression( italic("PLS-DA Small")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)

df_cv_oplsda <- data.frame(obs = training_class$Hip_prop, Qh.2 =  yfitCV_oplsda[[ncoptda]] + mean(classY.CV) )
df_cv_oplsda$pred <- ifelse(df_cv_oplsda$Qh.2 > unname(cutoffopt["OPLS-DA all"]), "Qh.2", "X0")

g_oplsda <- PredVsObsPlotsClassif(df_cv_oplsda, predp.testQh.2 = pred_df_oplsda$Qh.2, threshold = .05, cutoff = unname(cutoffopt["OPLS-DA all"]))

grid.arrange(g_oplsda$gcv, g_oplsda$gtest, nrow = 1, top = textGrob(expression( italic("OPLS-DA all")), gp = gpar(fontsize = 19, font = 2, col = "blue")))

SVM

df_cv_glmnetsAll <- glmnet_class$pred[glmnet_class$pred$alpha == glmnet_class$bestTune$alpha[1] & glmnet_class$pred$lambda == glmnet_class$bestTune$lambda[1] ,][1:nrow(training_class),]

plots_glmnetAllclass <- PredVsObsPlotsClassif(df_cv_glmnetsAll, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_allc_p$Qh.2, threshold = .05, cutoff = cutoffopt["Penalized log.reg. all"])


#grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, nrow = 1)

df_cv_glmnetsmall <- glmnet_class_small$pred[glmnet_class_small$pred$alpha == glmnet_class_small$bestTune$alpha[1] & glmnet_class_small$pred$lambda == glmnet_class_small$bestTune$lambda[1] ,][1:nrow(training_class),]

plots_glmnetSmallclass <- PredVsObsPlotsClassif(df_cv_glmnetsmall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_all_p_smallc$Qh.2, threshold = .05, cutoff = cutoffopt['Penalized log.reg. small'])

grid.arrange(plots_glmnetAllclass$gcv, plots_glmnetAllclass$gtest,
             plots_glmnetSmallclass$gcv, plots_glmnetSmallclass$gtest,
             ncol = 2)

Scatterplot matrix

library("GGally")
colnames(PredMat) <- c("LogReg.small", "LogReg.big", "SW.small","SW.big", "Pen.all", "Pen.small", 'PLSDA.small', "PLSDA.all", "OPLSDA.all",  "SVM.linear", "SVM.radial", "SVM.polyn")
ggpairs(as.data.frame(PredMat)  ) + theme_piss()

This kind of matrix could be a good starting point if we wanted to build an ensemble model, i.e. to pick some individual models that are poorly correlated (e.g. PLS-DA or Penalized Logistic regression with SVM’s)

Lines Plot

We decide to represent only the “best” model in each family. Dashed Vertical black lines represent the observations that are actually the target class (“Qh.2”)

PredMatRed <- cbind.data.frame(Observation = 1:nrow(test_class), PredMat[, c("LogReg.small", "SW.big", "Pen.all", "PLSDA.all", "OPLSDA.all", "SVM.linear" )])

indexQh.2 <- PredMatRed[test_class$Hip_prop == "Qh.2",]$Observation

ggplot(PredMatRed, aes(x = Observation)) + geom_line(aes(x = Observation, y = LogReg.small, col = "Logistic"), linetype = 2) + geom_line(aes(y = SW.big, col = "logistic.SW")) + geom_line(aes(y = Pen.all, col = "Penalized")) + geom_line(aes(y = PLSDA.all, col = "PLS-DA")) + geom_line(aes(y = OPLSDA.all, col = "OPLS-DA")) + geom_line(aes(y = SVM.linear, col = "SVM.lin")) +  scale_color_manual(name = "Model", values = c("Logistic" = brewer.pal(1, "Set1")[1], "logistic.SW" = brewer.pal(1, "Set1")[2], "Penalized" = brewer.pal(1, "Set1")[3], "PLS-DA" = brewer.pal(5, "Set1")[4], "OPLS-DA" = brewer.pal(5, "Set1")[5], "SVM.lin" = brewer.pal(8, "Set1")[6])) + geom_vline(xintercept = indexQh.2, colour ="black", linetype = 2, size = 0.2) + theme_piss()

We remark the “perfect” path for all the displayed models as their predicted probabilités is always >0.5 (indeed, 0.5 is arbitra)

Other classifier’s evaluation tools

ROC curves

Relying on the plotROC package and its fabulous vignette, we were able to compute insightful ROC curves that we divided by “meta”-classifiers.

## For individual plots ! But this is too voluminous...

#gc_prob_ex <- extractProb(list(pls_cv_small_class), training_class_c %>% select(-Class))

plotdfRoc <- data.frame(D = ifelse(pred_pls_small_allc == "Qh.2", 1, 0), D.str = pred_pls_small_allc, M1 = pred_pls_small_p_allc$Qh.2)

basicplot <- ggplot(plotdfRoc, aes(d = D, m = M1)) + geom_roc(n.cuts = 15)
basicplot + style_roc() + annotate("text", x = .85, y = .15, 
       label = paste("AUC =", round(calc_auc(basicplot)$AUC, 2)), col = "#33666C") + theme_piss()


"ggplotROCModel" <- function(classObs, predClass, ConfRegions = T, interac = F,  interactiveHtml = F) { 
  plotdfRoc <- data.frame(D = ifelse(classObs == "Qh.2", 1, 0), D.str = classObs, M1 = predClass)

  basicplot <- ggplot(plotdfRoc.cv, aes(d = D, m = M1)) + geom_roc(n.cuts = 15)
  basicplot <- basicplot + style_roc() + annotate("text", x = .85, y = .15, label = paste("AUC =", round(calc_auc(basicplot)$AUC, 2)), col = "#33666C") + theme_piss()  +  theme(axis.text = element_text(colour = "#33666C"))  
 if (ConfRegions == T)  basicplot <- basicplot + geom_rocci(linetype = 1)
 
 if (interac == T) {
   basicplot <- plot_interactive_roc(basicplot)
   if (interactiveHtml == T)  cat(  export_interactive_roc(basicplot, prefix = "a")  )
   else  basicplot
 }
  else basicplot
}

## PLS
# Testing
#ggplotROCModel(pred_pls_small_allc, pred_pls_small_p_allc$Qh.2, interac = T, interactiveHtml  = T)
# CV
ggplotROCModel(df_cv_plsdaAll$obs, df_cv_plsdaAll$Qh.2)
ggplotROCModel(pred_log_small2c, pred_log_small2c_p$Qh.2)
ggplotROCModel(df_cv_logRegSmallstep$obs, df_cv_logRegSmallstep$Qh.2)
ggplotROCModel(pred_svmRad, pred_svmRadp$Qh.2)
ggplotROCModel(svm_radial$pred$obs, svm_radial$pred$Qh.2)

ggplotROCModel(df_cv_plsdaAll$obs, df_cv_plsdaAll$Qh.2)
ggplotROCModel(test_class$Hip_prop, pred_log_small2c_p$Qh.2)
ggplotROCModel(df_cv_logRegSmallstep$obs, df_cv_logRegSmallstep$Qh.2)
ggplotROCModel(test_class$Hip_prop, pred_svmRadp$Qh.2)
ggplotROCModel(svm_radial$pred$obs, svm_radial$pred$Qh.2)


confusionMatrix(table(pred_log_smallc, test_class$Hip_prop))
confusionMatrix(table(predict(fit_all, test_class_c), test_class$Hip_prop))


roc_logreg <- roc(test_class$Hip_prop, sort(predict(fit_Small, test_class, type = "prob")[,1]))
roc_logreg

From (Clopper and Pearson 1934) and based on result 2.4 of (Pepe 2004), we can compute exact (individual) confidence intervals for the TPF (True Positive Fraction) and the FPF (False Positive Fraction) separately, each at a level $ 1-$. Then, we can compute the rectangular confidence regions by taking the cross-product yielding to a level $100\cdot(1-\alpha)\%$ for the pair. This is achieved by the geom_rocci() layer in the package.

We decided to display only 2 confidence regions along the curve at the $40_{th}$ and $60_{th}$ quantile of the model’s probabilities, in order not to overload the graphs. We changed the signifiance level $\alpha$ to $\boldsymbol{25\%}$(!!!) also in order to decrease the size of the regions and to see something because either the regions would be too large… This already points out the fact that our models are very variables and difficult to compare from the basis of a test set that has only 35 observations (…)

## Logistic Regression

plotdfRocClass <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), 
                             D.str = pred_log_small2c, 
                             M1 = pred_log_small2c_p$Qh.2, 
                             Type = rep("Small", nrow(test_class)) )
plotdfRocClass <- rbind.data.frame(plotdfRocClass,
                       data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),
                               D.str = pred_log_allc, 
                                  M1 = pred_log_allc_p$Qh.2, 
                               Type = rep("Big", nrow(test_class))
                               ))
plotdfRocClass <- rbind.data.frame(plotdfRocClass,
                          data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), 
                                 D.str = pred_log_bigc, 
                                    M1 = pred_log_bigc_p $Qh.2, 
                                 Type = rep("Big SW", nrow(test_class))
                                   ))

basicplot <- ggplot(plotdfRocClass, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc() 

ind.auc <- round(calc_auc(basicplot)$AUC, 2)

basicplot <- basicplot  + geom_rocci(linetype = 1, aes(fill = Type), ci.at = quantile(M1, c( .4,.6)), sig.level = .25) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
  annotate("text", x = .93, y = .19, label = paste("AUC =", ind.auc[1]), col = brewer.pal(1, "Set1")[1]) + theme_piss() +
  annotate("text", x = .93, y = .11, label = paste("AUC =", ind.auc[2]), col = brewer.pal(1, "Set1")[2]) +
  annotate("text", x = .93, y = .03, label = paste("AUC =", ind.auc[3]), col = brewer.pal(1, "Set1")[3]) + labs(title = "Logistic Regression")+ theme(legend.position = c(.7, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4"))


## Penalized

plotdfRocClassGlmnet <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),
                             D.str = pred_glmnet_all_smallc , 
                             M1 = pred_glmnet_all_p_smallc$Qh.2, 
                             Type = rep("Small", nrow(test_class)) )
plotdfRocClassGlmnet <- rbind.data.frame(plotdfRocClassGlmnet,
                       data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), 
                               D.str = pred_glmnet_allc, 
                                  M1 = pred_glmnet_allc_p$Qh.2, 
                               Type = rep("All", nrow(test_class))
                               ))

basicplotGlmnet <- ggplot(plotdfRocClassGlmnet, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc() 

ind.auc <- round(calc_auc(basicplotGlmnet)$AUC, 2)

basicplotGlmnet <- basicplotGlmnet +  geom_rocci(linetype = 1, aes(fill = Type),sig.level = .25, ci.at = quantile(M1, c( .4, .6))) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
  annotate("text", x = .93, y = .19, label = paste("AUC =", ind.auc[1]), col = brewer.pal(1, "Set1")[1]) +
  annotate("text", x = .93, y = .11, label = paste("AUC =", ind.auc[2]), col = brewer.pal(1, "Set1")[2])  + theme_piss()+ labs(title = "Penalized Log. Regression")+ theme(legend.position = c(.7, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4")) 


## PLS-DA

plotdfRocClassPLS <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),#ifelse(pred_svmLin == "Qh.2", 1, 0),
                             D.str = pred_pls_small_smallc , 
                             M1 = pred_pls_small_p_smallc $Qh.2, 
                             Type = rep("Small", nrow(test_class)) )
plotdfRocClassPLS <- rbind.data.frame(plotdfRocClassPLS,
                       data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmRad == "Qh.2", 1, 0),
                               D.str = pred_pls_small_allc , 
                                  M1 = pred_pls_small_p_allc $Qh.2, 
                               Type = rep("All", nrow(test_class))
                               ))
plotdfRocClassPLS <- rbind.data.frame(plotdfRocClassPLS,
                          data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmPoly == "Qh.2", 1, 0), 
                                 D.str = ifelse(pred_df_oplsda$Qh.2 > cutoffopt['OPLS-DA all'], "Qh.2", "X0"), 
                                    M1 = pred_df_oplsda$Qh.2, 
                                 Type = rep("OPLS-DA", nrow(test_class))
                                   ))

basicplotPLS <- ggplot(plotdfRocClassPLS, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc() 

ind.aucPLS <- round(calc_auc(basicplotPLS)$AUC, 2)

basicplotPLS <- basicplotPLS +  geom_rocci(linetype = 1, aes(fill = Type),sig.level = .25, ci.at = quantile(M1, c( .4, .6))) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
  annotate("text", x = .93, y = .19, label = paste("AUC =", ind.aucPLS[1]), col = brewer.pal(1, "Set1")[1]) +
  annotate("text", x = .93, y = .11, label = paste("AUC =", ind.aucPLS[2]), col = brewer.pal(1, "Set1")[2]) +  annotate("text", x = .93, y = .03, label = paste("AUC =", ind.auc[3]), col = brewer.pal(1, "Set1")[3]) + labs(title = "Partial Least Squares") + theme_piss() + theme(legend.position = c(.68, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4"))


## SVM 

plotdfRocClassSVM <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),#ifelse(pred_svmLin == "Qh.2", 1, 0),
                             D.str = pred_svmLin, 
                             M1 = pred_svmLinp$Qh.2, 
                             Type = rep("Linear", nrow(test_class)) )
plotdfRocClassSVM <- rbind.data.frame(plotdfRocClassSVM,
                       data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmRad == "Qh.2", 1, 0),
                               D.str = pred_svmRad, 
                                  M1 = pred_svmRadp$Qh.2, 
                               Type = rep("Radial", nrow(test_class))
                               ))
plotdfRocClassSVM <- rbind.data.frame(plotdfRocClassSVM,
                          data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmPoly == "Qh.2", 1, 0), 
                                 D.str = pred_svmPoly, 
                                    M1 = pred_svmPolyp$Qh.2, 
                                 Type = rep("Polyn.", nrow(test_class))
                                   ))

basicplotSVM <- ggplot(plotdfRocClassSVM, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc() 

ind.auc <- round(calc_auc(basicplotSVM)$AUC, 2)

basicplotSVM <- basicplotSVM  + geom_rocci(linetype = 1, aes(fill = Type),sig.level = .25, ci.at = quantile(M1, c( .4, .6))) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
  annotate("text", x = .93, y = .19, label = paste("AUC =", ind.auc[1]), col = brewer.pal(1, "Set1")[1]) +
  annotate("text", x = .93, y = .11, label = paste("AUC =", ind.auc[2]), col = brewer.pal(1, "Set1")[2]) +
  annotate("text", x = .93, y = .03, label = paste("AUC =", ind.auc[3]), col = brewer.pal(1, "Set1")[3]) + labs(title = "SVM") + theme_piss() + theme(legend.position = c(.7, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4"))#, axis.text = element_text(col = "#33666C"), axis.title = element_text(col = "blue"))


grid.arrange(basicplot, basicplotGlmnet,
             basicplotPLS, basicplotSVM,
             ncol = 2)

df_classFin <- data.frame(logloss = , AUC = , accuracy = , size = )
rownames(df_classFin) <- 
pander(df_classFin)

6.6 Extension to Multi-Class

Note that we could have done multi-class classifiction to improve the comparison (take all the samples).

The problem of the binary classification models is that it is too easy to predict these two hippurate’s classes. The discrimination between the class “0” and “Qh/2” is too strong, as we have seen on the densities in Section 4.1 where the separation is quasi-perfect.

To do that, we come back to the complete sample : thus the one that was used for regression models.

Here we optimise wrt to the multi-class logloss criterion.

'LogLossMulti' <- function(actual, predicted, eps=1e-15) {
  predicted[predicted < eps] <- eps;
  predicted[predicted > 1 - eps] <- 1 - eps;
  -1/nrow(actual)*(sum(actual*log(predicted)))
}


## Function to compute some measures such as the precision, recall and F-scores. For multiclass, average across all classes 
'check.model.accuracy' <- function(predicted.class, actual.class){

  result.tbl <- as.data.frame(table(predicted.class,actual.class ) ) 

  result.tbl$Var1 <- as.character(result.tbl$predicted.class)
  result.tbl$Var2 <- as.character(result.tbl$actual.class)

  colnames(result.tbl)[1:2] <- c("Pred","Actual")

  cntr <- 0  
  for (pred.class in unique(result.tbl$Pred) ){
    cntr <- cntr+ 1
    tp <- sum(result.tbl[result.tbl$Pred == pred.class & result.tbl$Act==pred.class, "Freq"])
    tp.fp <- sum(result.tbl[result.tbl$Pred == pred.class , "Freq" ])
    tp.fn <- sum(result.tbl[result.tbl$Act == pred.class , "Freq" ])
    precision <- tp/tp.fp 
    recall <- tp/tp.fn
    F.score <- 2*precision*recall/(precision+recall)
    if (cntr == 1 ) F.score.row <- cbind(pred.class, precision,recall,F.score)
    if (cntr > 1 ) F.score.row <- rbind(F.score.row,cbind(pred.class,precision,recall,F.score))
  }

  F.score.row <- as.data.frame(F.score.row) 
  return(F.score.row)
}

# Transform the factor's names into valid names : needed for future models
# We keep the same decomposition train-test as for binary classif. Faster in tibble objects
training_class <- as_tibble(data_classif[rownames(data_classif) %in% index , ])
test_class <- as_tibble(data_classif[rownames(data_classif) %in% index == F, ])
trainClassYmulti <- make.names(factor(training_class$Hip_prop, levels = c("0", "Qh/2", "Qh/4", "Qh")))
testClassYmulti <- make.names(factor(test_class$Hip_prop, levels = c("0", "Qh/2")))

We will also make use of the MLmetrics package to compute some metrics.

6.6.1 Multinomial Logistic Regression

Penalized Multinomial Regression

model = multinom(Hip_prop ~ ., data = cbind.data.frame(Hip_prop = training$Hip_prop, training_c_small))
summary(model)


set.seed(134)
multiLogPen <- train(y = ,
                     x = training_c_small,
                     method = "multinom",
                     metric = "mnLogLoss",
                     tuneLength = 15,
                            trControl = ctrl)
pred_multiLog <- predict(multiLogPen, test_c[, kept_features])#, type = "prob")


pred_multiLogp <- predict(multiLogPen, test_c[, kept_features], type = "prob")

pred_logl_multi <- numeric(length = nrow(pred_multiLogp))
for(i in 1:nrow(pred_multiLogp)){
  pred_logl_multi[i] <- pred_multiLogp[i, test$Hip_prop[i]]
}

#pred_multiLogp[ ,test$Hip_prop == colnames(pred_multiLogp)]

MultiLogLoss(pred_multiLogp, test$Hip_prop)
# LogLossMulti(test$Hip_prop, pred_multiLogp)
MultiLogLoss(pred_logl_multi, test$Hip_prop)
LogLossMulti(test$Hip_prop, pred_logl_multi)


Accuracy(test$Hip_prop, predict(multiLogPen, test_c))
check.model.accuracy(test$Hip_prop, predict(multiLogPen, test_c))

6.6.2 Partial LEast Square

set.seed(134)
multi_pls <- train(y = training$Hip_prop ,
                  x = training_c,  
                            method = "pls",  
                            metric = "mnLogLoss",
                  tuneLength = 15,
                            trControl = ctrl)
predict(multi_pls, test_c)#, type = "prob")

MultiLogLoss(predict(multi_pls, test_c, type = "prob"), test$Hip_prop)
LogLossMulti(test$Hip_prop, predict(multi_pls, test_c, type = "prob"))

Accuracy(test$Hip_prop, predict(multi_pls, test_c))
check.model.accuracy(test$Hip_prop, predict(multi_pls, test_c))

SPLS

ctrl <- trainControl(method = "repeatedcv", repeats  = 1, classProbs = T, summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T )

grid_spls2 <- expand.grid(K = seq(1, 100, by = 6), 
                         eta = c(.1, .3, .6), 
                         kappa = .5)

set.seed(134)
time <- proc.time()
multi_Spls <- train(y = training$Hip_prop ,
                  x = training_c,  
                            method = "spls",  
                            metric = "mnLogLoss",
                  #tuneLength = 5,
                   tuneGrid = grid_spls2,

                            trControl = ctrl)
(proc.time() - time)[3]   # 1952 sec

pred_multi_spls <- predict(multi_Spls, test_c)#, type = "prob")

pred_multi_splsp <- predict(multi_Spls, test_c, type = "prob")

MultiLogLoss(pred_multi_splsp, test$Hip_prop)
LogLossMulti(test$Hip_prop, pred_multi_splsp)

Accuracy(test$Hip_prop, pred_multi_spls)
check.model.accuracy(test$Hip_prop, pred_multi_spls)

6.6.3 SVM

ctrl <- trainControl(method = "repeatedcv", repeats  = 1,         # do 2 repititions of cv. Very slow
                     classProbs = T, summaryFunction = mnLogLoss, verboseIter = T, savePredictions = T )

set.seed(134)
time <- proc.time()
svm_multiClass_rad <- train(y = make.names(training$Hip_prop),
                            x = training_c,
                             method = "svmLinear2",   # Radial kernel
                             tuneLength = 10,                 # 9 values of the cost function
                           metric = "mnLogLoss",
                            trControl = ctrl)
(proc.time() - time)[3]   # 52 sec

pred_multi_svm <- predict(svm_multiClass_rad, test_c)
pred_multisvmp <- predict(svm_multiClass_rad, test_c, type = "prob")

LogLossMulti(test$Hip_prop, pred_multisvmp)
MultiLogLoss(pred_multisvmp, test$Hip_prop)

Accuracy(test$Hip_prop, pred_multi_svm)
check.model.accuracy(test$Hip_prop, pred_multi_svm) 

test[test$Hip_prop != pred_multi_svm,]

As always, the observation number x is problematic !

7 Comparison of the Models

7.1 Have an in-depth look to the criteria {-}

Regression

You can class them in ascending order of RMSE by clicking on the arrows right to the column names.

final_rmse <- data.frame("True.RMSE" = c(rmse_reglinc, rmse_reglinswc, rmse_reglmnet1c, rmse_reglmnet2c, rmse_reglmnet3c, rmse_pcrAllc, rmse_pcrAll2c, rmse_pcrSmallc, rmse_icrSmallc, rmse_plsAllc, rmse_plsSmallc, rmse_oplsAll, rmse_oplsSmall, rmse_splsSmallc,rmse_splsAllc, rmse_rf_allc, rmse_xgb_regNocenterFin, try_rmse_ensemble, try_rmse_stackrf2), "RMSE-cv" = c(cv_rmse_regLin, cv_rmse_regLinsw, cv_rmse_reglmnet1, cv_rmse_reglmnet2, cv_rmse_reglmnet3, cv_rmse_pcrAll1, cv_rmse_pcrAll2,cv_rmse_pcrSmall, cv_rmse_icr_small, cv_rmse_pls_all, cv_rmse_pls_small, cv_rmse_opls_all,cv_rmse_opls_Small,  cv_rmse_spls_small, cv_rmse_spls_all, cv_rmse_rf_all_nocenter, cv_rmse_xgbNocenterFin, try_cv_rmse_ensemble, try_cv_rmse_stackrf2) )

rownames(final_rmse) <- c('Linear regression', 'Linear regression SW', 'Elastic Net full', 'Elastic Net small (-)', 'Elastic Net small (+)', 'PCR all (+)', 'PCR all (-)', 'PCR small', 'ICR small', 'PLS all', 'PLS small', "OPLS all", "OPLS Small", 'SPLS small','SPLS all', 'Random Forest all', 'XGBoost all', "Ensemble Linear", "Ensemble random forest")



#formattable(final_rmse, list(area(col = True.RMSE) ~ color_tile("lightblue", "lightpink")))

as.datatable(formattable(final_rmse, list(True.RMSE = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp'), width = "100%")

‘(+)’ means broader hyperparameter space search, while ‘small’ means with the 18 selected features

models <- rownames(final_rmse)
models.RMSE <- final_rmse$True.RMSE
  
rmse.df <- data.frame(models = models,
                       RMSE = models.RMSE)
  

  #Creating bar plot to compare the all models
ggplot(rmse.df, aes(x = models, y = RMSE)) + 
    geom_col(aes(fill = RMSE)) + 
    geom_text(aes(x = models, y = RMSE/2, label = RMSE), color = "white") + 
    coord_flip() + 
    scale_fill_gradient(low = "cyan", high = "darkred", limits = c(min(rmse.df$RMSE), max(rmse.df$RMSE))) + 
    labs(title = "True RMSE Values Of Each Model") + theme_piss(theme = theme_grey()) + scale_x_discrete(limits = rev(models))

Classification

You can class them in ascending order of LogLoss by clicking on the arrows right to the column names.

final_logl <- data.frame("True.LogLoss" = c( logl_log_smallc, logl_log_allc, logl_log_small2c, logl_glmnet_allc, logl_glmnet_all_smallc, logl_pls_small_smallc, logl_pls_small_allc, 0, logloss_svmLin, logloss_svmRad, logloss_svmPol), "LogLoss.cv" =  c(cv_logl_glmnetSmall, cv_logl_glmAll, cv_logl_glmSmallsw, cv_logl_glmnetAll, cv_logl_glmnetSmall, cv_logl_plsSmall, cv_logl_plsAll, "XX ?", cv_logl_svmLin, cv_logl_svmRad, cv_logl_svmPol))

rownames(final_logl) <- c( 'Logistic regression small', 'Logistic regression `all`', "Logistic regression SW small", 'Elastic Net all', 'Elastic Net small', 'PLS-DA small', 'PLS-DA all', 'OPLS-DA small', "SVM linear", "SVM radial", "SVM polynomial")


as.datatable(formattable(final_logl, list(True.LogLoss = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp'))

‘(+)’ means broader hyperparameter space search, while ‘small’ means with the 18 selected features

modelsC <- rownames(final_logl)
models.logl <- final_logl$True.LogLoss
  
logl.df <- data.frame(models = modelsC,
                      logloss = models.logl)
  

  #Creating bar plot to compare the all models
ggplot(logl.df, aes(x = models, y = logloss)) + 
    geom_col(mapping = aes(fill = logloss)) + 
    geom_text(mapping = aes(x = models, y = logloss/2, label = logloss), color = "white") + 
    coord_flip() + 
    scale_fill_gradient(low = "cyan", high = "darkred", limits = c(min(logl.df$logloss), max(logl.df$logloss))) + 
    labs(title = "True LogLoss Values Of Each Model") + theme_piss(theme = theme_grey()) + scale_x_discrete(limits = rev(modelsC))

Multi-Class

You can class them in ascending order of LogLoss by clicking on the arrows right to the column names.

final_loglMult <- data.frame("True.LogLoss" = c( logl_log_smallc, logl_log_allc, logl_log_small2c, logl_glmnet_allc, logl_glmnet_all_smallc, logl_pls_small_smallc, logl_pls_small_allc, "XX?", logloss_svmLin, logloss_svmRad, logloss_svmPol) )

rownames(final_loglMult) <- c( 'Linear regression small', 'Linear regression `all`', "Linear regression SW small", 'Elastic Net all', 'Elastic Net small', 'PLS small', 'PLS all', 'OPLS small', "SVM linear", "SVM radial", "SVM polynomial")

as.datatable(formattable(final_loglMult, list(True.LogLoss = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp'))

‘(+)’ means broader hyperparameter space search, while ‘small’ means with the 18 selected features

TO UPDATE!!!

Comments

Redo the analysis on 5 folds and we should not underestimate the error by CV that much (?)

8 Conclusion : Summary

compute some densities such as CV error vs true error….
Graphs showing test vs cv error. learning curves ?
We have seen that that ensemble methods have indeed outperformed all the other methods in the reression scheme. However, the drawdback is the understanding of the method that is very complex as it ‘aggregates’ different single models. Toghether with the high computationnal cost, it may not be 10 times more valuable as other methods for such a ‘simple’ process. OPLS for example
We only worked with with original data but as we have seen in the distributions (violon) plot in section 4.1, it could be valuable to redo the analysis on the log-data. This could in particular reduce the ‘bad’ effect of outliers for linear models.

Moreover, another default of the more complex models (which led to a poorer performance) is their lack of interpretability.

Regarding the biomarkers detection , we thought unuseful to display all the selection of all the methods we considered. As one would want to get a summary of all the methods, we could make a final “vote” and display the ppm that arose most of the time :

9 Appendix

knitr::knit('C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.Rmd', 'C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.md')
markdown::markdownToHTML('C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.md', 'C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.html')

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324.

Chun, Hyonho, and Sündüz Keles. 2007. “Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis.” Department of Statistics, University of Wisconsin, Madison. http://pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf.

Clopper, C.V., and E.S. Pearson. 1934. “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial on JSTOR.” https://www.jstor.org/stable/2331986.

Hand, David J. 2009. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.” Machine Learning 77 (1): 103–23. doi:10.1007/s10994-009-5119-5.

Kim, SungHwan, Jae-Hwan Jhong, JungJun Lee, and Ja-Yong Koo. 2017. “Meta-Analytic Support Vector Machine for Integrating Multiple Omics Data.” BioData Mining 10: 2. doi:10.1186/s13040-017-0126-8.

Pepe, Margaret Sullivan. 2004. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press.

Rousseau, Réjane. 2011. “Statistical Contributrion to the Analysis of Metabonomic Data in 1H-NMR Spectroscopy.” PhD thesis, UCL - Université Catholique de Louvain. https://dial.uclouvain.be/pr/boreal/object/boreal:75532.

Varma, Sudhir, and Richard Simon. 2006. “Bias in Error Estimation When Using Cross-Validation for Model Selection.” BMC Bioinformatics 7 (February): 91. doi:10.1186/1471-2105-7-91.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20. http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2005.00503.x/full.