Abstract
In this project, we analyzed the Urine-Citrate-Hippurate database where we put emphasis on regression and classification models as the spectrums have already been preprocessed. Aware of the huge variety of methods that exist in theR world for doing such analysis, the significant loss of time from being overwhelmed on dispersed information and the dispersion of results when adopting different methods of model’s building , we decided to condense this project at maximum and provide the most homogeneous methodology relying on the caret infrastructure. This should accomodate both user and reader convenience. The random forest ensembling (PLS, lm, xgboost,) vastly outperformed the other with nearly 0 error
# data<-read.csv("Data/X_UCH.csv", sep=';',header=T)
load("Data/X_UCH.RData")
load("Data/D_UCH.RData")
dataD <- as_tibble(D_UCH); dataX <- as_tibble(X_UCH)
rm(D_UCH, X_UCH)
# Let's use tibble objects as they are more efficient than standard dataFrames.
rownames(dataD) <- substr( rownames(dataD), 1, nchar(rownames(dataD)[1])-3 )
sum(rownames(dataD) == rownames(dataX)) # 269 : OK
## Double the columns of hippurate concentration (see end of chapter 3 for explanation)
dataD[dataD$Medium == 'diluted urine',]$Hip_CONC <- 2 * dataD[dataD$Medium == 'diluted urine',]$Hip_CONC
dataAll <- as_tibble(cbind(dataD, dataX))
sum(is.na(dataAll)) # No NA ! Whereas the metabolome represents the collection of all metabolites in a biological cell, tissue, organ or organism, which are the end products of cellular processes Metabolomic data analysis are concerned with
Section 7 will summarize the results of all the models considered and will
The Urine-Citrate-Hippurate that we used here is well documented in (Rousseau 2011). These data follow an experimental plan carefully designed to study the ability of statistical methods to detect biomarkers. These biomarkers are
It is then always interesting to see the data that we will analyze.
You can scroll the x-axis to see all features.
DT::datatable(dataD, style = "bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))Click on the up(down) arrow next to each variable’s name to show the observations in descending order for this variable. Change page to see other observations
You can scroll the x-axis to see all features.
DT::datatable(dataX, style = "bootstrap", class="table-condensed", options = list(dom = 'tp',scrollX = TRUE))Click on the up(down) arrow next to each variable’s name to show the observations in descending order for this variable. Change page to see other observations
The objective of the analysis will then to detect the biomarkers for which a variability was experimentally controlled. Our analysis will (essentially) be made of the different experimental design of Hippurate, that is :
The aim will be to select the best method (algorithm) which allows to predict at best
To achieve this, we will mainly make use of the library caret.
We decide to display 15 randomly chosen spectrums to inspect their behaviour. This balances between visibility and enough information to detect possible problems (e.g. outliers).
tdataX <- t(dataX)
set.seed(123)
spectr_samp10 <- sample(colnames(tdataX), 15)
df_samp10 <- tdataX[,spectr_samp10]
df_samp10 <- cbind.data.frame(df_samp10, spec = as.factor(1:nrow(df_samp10)))
df_samp10 <- melt(df_samp10, id.vars = "spec" )
df_samp10 <- cbind.data.frame(df_samp10, ppm = as.numeric(rownames(tdataX)))
dimnames(df_samp10)[[2]][2] <- "Spectrum"
gg_labs <- labs(title = "15 random Spectrums representation", y = "Spectrums intensities", caption = "By redoing several times with the other spectrums, we saw that the shape is always very similar." )
gg_spectre10 <- ggplot(df_samp10) + geom_line(aes(x = ppm, y = value, col = Spectrum), size = 0.45) + gg_labs + scale_color_hue(l = 20, c = 200) + theme_piss() + scale_x_reverse() +
annotate(geom = "text", label = "Hippurate",
x = 7.1,
y = .035, col = "#33666C" , size = 5, fontface = "bold") +
geom_segment(aes(x = 7, xend = 7.5, y = 0.033, yend = .027),
arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
geom_segment(aes(x = 6.2, xend = 4.35, y = 0.034, yend = .0328),
arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
annotate(geom = "text", label = "Citrate",
x = 1.5,
y = .038, col = "#33666C" , size = 5, fontface = "italic") +
geom_segment(aes(x = 1.7, xend = 2.3, y = 0.036, yend = .0325),
arrow = arrow(length = unit(0.5, "cm")), col = "#33666C")
#ggplotly(gg_spectre10) # Too slow
gg_spectre10library("scales")
library('ggthemes')
library("animation")
library("gganimate")
library("tweenr")
dfSampgg <- cbind(df_samp10, .frame = 1:ncol(df_samp10))
ggspec <- ggplot(dfSampgg, aes(frame = .frame)) + geom_line(aes(x = ppm, y = value, col = Spectrum), size = 0.45) + gg_labs + scale_color_hue(l = 20, c = 200) + theme_piss() + scale_x_reverse() +
annotate(geom = "text", label = "Hippurate",
x = 7.1,
y = .035, col = "#33666C" , size = 5, fontface = "bold") +
geom_segment(aes(x = 7, xend = 7.5, y = 0.033, yend = .027),
arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
geom_segment(aes(x = 6.2, xend = 4.35, y = 0.034, yend = .0328),
arrow = arrow(length = unit(0.5, "cm")), col = "#33666C") +
annotate(geom = "text", label = "Citrate",
x = 1.5,
y = .038, col = "#33666C" , size = 5, fontface = "italic") +
geom_segment(aes(x = 1.7, xend = 2.3, y = 0.036, yend = .0325),
arrow = arrow(length = unit(0.5, "cm")), col = "#33666C")
gganimate(ggspec)We clearly see that the spectrums displayed have the same shape, corresponding to the hippurate (left ) and citrate (right) peaks which are located in distant ppm domains. Hippurate has three peaks with two in the region of high ppms containing a low noise level while this is the opposite for citrate peaks.
We also note the region of ppm where no intensity is recorded : between \(\approx\) 6 and 4.6. As it represents features which are constant (=0), we will remove them for the rest of the analysis as they provide no information regarding our purpose.
sd_X <- apply(dataAll[10:ncol(dataAll)], 2, sd)
X_NoVariance <- names(sd_X[sd_X<1e-6])
# Remove no variance columns
dataAll <- dataAll[, colnames(dataAll) %in% X_NoVariance == F]0 features were already removed in total, representing \(\approx 10\%\) of the total space. This zone is actually due to the pretreatment correction where the values have been put to 0 for the water zone that is present in the urines solutions and that are not interesting for the analysis.
Concerning the meta-data, that is the experimental conditions, we have the following hippurate proportions in the dataset :
pander(table(dataD$Hip_prop))| 0 | Qh | Qh/2 | Qh/4 |
|---|---|---|---|
| 110 | 47 | 64 | 48 |
showing that there is more observations with no hippurate.
We also have 14 levels for the combination Citrate-hippurate, with the following entries
#table(dataD$Ci_prop, dataD$Ci_CONC)
pander(table(dataAll$Hip_prop, dataAll$Ci_prop))| 0 | QC | QC/2 | QC/4 | |
|---|---|---|---|---|
| 0 | 63 | 16 | 15 | 16 |
| Qh | 16 | 15 | 16 | 0 |
| Qh/2 | 16 | 16 | 16 | 16 |
| Qh/4 | 16 | 0 | 16 | 16 |
Regarding the PPM
To make the best algorithm selection possible with the data at hand, we must find a methodology that is best suited. To reduce bias and to avoid selection problems, we will sample 2 different sets of data that we will call :
If we do not divide the data into training-test set like this, the error estimate on validation data will be biased (that is, in general smaller than the true error rate) since the validation set is used to select the final model (???). In microarrays applications, (Varma and Simon 2006) also showed that using CV to compute an error estimate for a classifier that has itself been tuned using CV gives a significantly biased estimate of the true error. A nested cross-validation would be preferable.
The decomposition 80-20 allow to have 215 data for training and 54 for testing. This allows to have a good balance between “enough training data to build effective models” and “enough test data to make accurate predictions”. We could also do a sensitivity analysis by inspecting how the results could vary when changing this propoertion 80-20 . Moreover, one could also consider a (kind of) bootstrap procedure and redo the analysis several times (thus with several random sampling for the test and training set), giving more accurate estimates of the true errors. This is interesting for such small smaples… but at a computational cost !
For the classification task(s), it is important to check whether the distribution of the binary (response) variable is similar in each of the two sets, or it could have terrible impacts on the estimations of the perfomances of the algorithms. While the data are ‘stratified’, it is also recommended that the proportions of the different ‘strata’ should (approximately) be the same in the sample and in both training and test set.
rownames(dataAll) <- rownames(dataD)
set.seed(1234)
index <- sample(rownames(dataAll), round(0.8*nrow(dataX), 0))
training <- dataAll[rownames(dataAll) %in% index , ]
test <- dataAll[rownames(dataAll) %in% index == F, ]
rownames(training) <- index
rownames(test) <- rownames(dataAll)[rownames(dataAll) %in% index == F ]
# Control the proportions :
# table(training$Hip_prop)
# table(test$Ci_CONC)
#pander(table(training$Hip_prop, training$Ci_prop))
#pander(table(test$Hip_prop, test$Ci_prop))Original dataset : (%)
pander(round(table(dataAll$Hip_prop, dataAll$Ci_prop) / sum(table(dataAll$Hip_prop, dataAll$Ci_prop)) * 100, 3), justify = "center")| 0 | QC | QC/2 | QC/4 | |
|---|---|---|---|---|
| 0 | 23.42 | 5.948 | 5.576 | 5.948 |
| Qh | 5.948 | 5.576 | 5.948 | 0 |
| Qh/2 | 5.948 | 5.948 | 5.948 | 5.948 |
| Qh/4 | 5.948 | 0 | 5.948 | 5.948 |
pct_train <- round(table(training$Hip_prop, training$Ci_prop) / sum(table(training$Hip_prop, training$Ci_prop)) * 100, 2)
#pander(pct_train)
pct_test <- round(table(test$Hip_prop, test$Ci_prop) / sum(table(test$Hip_prop, test$Ci_prop))*100, 2)
# pander(pct_test)
# pander(list(pct_train, pct_test))Training set (left): \(\qquad\qquad\) and \(\qquad\qquad\) Test set (right): (both expressed in %)
t1 <- tableGrob(pct_train)
t2 <- tableGrob(pct_test)
grid.arrange(t1, t2, ncol = 2, left = "Training", right = "Test")# pushViewport(viewport(layout = grid.layout(1, 1)))
# print(t1 + ggtitle("bar"),
# vp = viewport(layout.pos.row = 1, layout.pos.col = 1))
# print(t2 + labs(title = "point"),
# vp = viewport(layout.pos.row = 1, layout.pos.col = 2)) We see that the ‘distribution’ of the observations with respect to the experimental design Hippurate-Citrate si very similar accross the two sets. This is a good point for our aim to build reliable models and reliable estimates of the errors made. Moreover, regarding the repartition of the days in the 3 sets :
## Retrieve the day numbers to check the proportions in the different sets
t1 <- table(substr(rownames(dataAll), 7, 7))
t2 <- table(substr(rownames(training), 7, 7))
t3 <- table(substr(rownames(test), 7, 7))
tab <- rbind(t1,t2,t3)
rownames(tab) <- c("Whole set", "Training set", "Test set")
colnames(tab) <- c("day 1", "day 2", "day 3", "day 5")
pander(tab)| day 1 | day 2 | day 3 | day 5 | |
|---|---|---|---|---|
| Whole set | 65 | 68 | 68 | 68 |
| Training set | 53 | 55 | 54 | 53 |
| Test set | 12 | 13 | 14 | 15 |
The proportions of the days are very similar here accross the sets, so we can go on. This particular random indexing provided reliable separation training-test set.
Note thate also had to check the column of Hippurate’s concentration. That is, when the urine has been diluted with 50% of water, we had to double the values for this column (see chunk 1).
Regarding the preprocessing we will not a priori normalize the data. Each of the following models has its own properties and will need its own preprocessing.
First, regarding the linear correlations between the input variables (ppm) :
# corrX <- c()
# for (i in 0:(ncol(dataAll_c)-2)){
# corrX[i] <- cor(dataAll_c[,i+1], dataAll_c[,i+2])
# names(corrX)[i] <- paste0(substring(names(dataAll_c)[i+1], 1, 6),"&", substring(names(dataAll_c)[i+2],1,6))
# }
corrX <- cor(training[,-(1:9)])
dimnames(corrX)[[1]] <- dimnames(corrX)[[2]] <- c(rep("", 24), "0", rep("", 24), rep("", 30), "1", rep("", 29), rep("", 30), "2", rep("", 29),rep("", 31), "3", rep("", 30),rep("", 15), "4", rep("", 14), rep("", 29), "6", rep("", 29),rep("", 30), "7",rep("",31), rep("", 30), "8", rep("", 30), rep("", 30), "9", rep("", 29))
## Plot it
corrplot::corrplot(corrX, method="color", type="lower" )# sum(is.na(corrX)) # Weird
highlyCorrelated <- findCorrelation(corrX, cutoff=0.99)
highlyCorrelated <- colnames(training_c[,highlyCorrelated])
#cor(training_c[,colnames(training_c[,highlyCorrelated])])training_c <- training[,-(1:9)] # Just keep it uncentered for now. We will center this object a few chunks later
## Retrieve number of ppm in each class : 0,1,...,9.
tabPPM <- table(substr(colnames(training_c), 1, 1))
## Sample one in each class
set.seed(12345)
sampl_distByPPM <- sapply(tabPPM, sample, 1)
## Create the dataFrame for ggplo, with corresponding column (ppm) full names
df_ppmDist <- data.frame(
training_c[ ,substr(colnames(training_c), 1, 1) == "0"][, sampl_distByPPM['0']],
training_c[ ,substr(colnames(training_c), 1, 1) == "1"][, sampl_distByPPM['1']],
training_c[ ,substr(colnames(training_c), 1, 1) == "2"][, sampl_distByPPM['2']],
training_c[ ,substr(colnames(training_c), 1, 1) == "3"][, sampl_distByPPM['3']],
training_c[ ,substr(colnames(training_c), 1, 1) == "4"][, sampl_distByPPM['4']],
#training_c[ ,substr(colnames(training_c), 1, 1) == "5"][, sampl_distByPPM['5']],
training_c[ ,substr(colnames(training_c), 1, 1) == "6"][, sampl_distByPPM['6']],
training_c[ ,substr(colnames(training_c), 1, 1) == "7"][, sampl_distByPPM['7']],
training_c[ ,substr(colnames(training_c), 1, 1) == "8"][, sampl_distByPPM['8']],
training_c[ ,substr(colnames(training_c), 1, 1) == "9"][, sampl_distByPPM['9']] )
colnames(df_ppmDist) <- c(colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "0"])[ sampl_distByPPM['0']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "1"])[ sampl_distByPPM['1']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "2"])[ sampl_distByPPM['2']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "3"])[ sampl_distByPPM['3']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "4"])[ sampl_distByPPM['4']],
#colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "5"])[ sampl_distByPPM['5']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "6"])[ sampl_distByPPM['6']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "7"])[ sampl_distByPPM['7']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "8"])[ sampl_distByPPM['8']],
colnames(training_c[ ,substr(colnames(training_c), 1, 1) == "9"])[ sampl_distByPPM['9']])
## Melt it for ggplot : represent the classes
dfgg_ppmDist <- melt(df_ppmDist)
colnames(dfgg_ppmDist) <- c("ppm.value", "value")
dodge <- position_dodge(width = 0.4)
gpp1 <- ggplot(dfgg_ppmDist[1:(4*nrow(training)),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', size = .6, alpha=0.99,width = 0.2) + geom_violin(fill = "lightseagreen", alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) + geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss() #+ scale_fill_brewer(palette="Blues") #+ scale_color_brewer(palette = "Dark2")
gpp2 <- ggplot(dfgg_ppmDist[(4*nrow(training)+1):nrow(dfgg_ppmDist),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99,width = 0.2, size = .6) + geom_violin(fill='lightseagreen', alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) + geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()
grid.arrange(gpp1, gpp2, nrow = 2, top = textGrob(expression( italic("Violin plots of random ppm for each ´class´")), gp = gpar(fontsize = 19, font = 2, col = "#33666C")))dfgg_ppmDistc <- melt(as.data.frame(scale(df_ppmDist, center = T, scale = F)))
colnames(dfgg_ppmDistc) <- c("ppm.value", "value")
gpp1 <- ggplot(dfgg_ppmDistc[1:(4*nrow(training)),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', size = .6, alpha=0.99,width = 0.2) + geom_violin(fill = "lightseagreen", alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) + geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss() #+ scale_fill_brewer(palette="Blues") #+ scale_color_brewer(palette = "Dark2")
gpp2 <- ggplot(dfgg_ppmDistc[(4*nrow(training)+1):nrow(dfgg_ppmDistc),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99,width = 0.2, size = .6) + geom_violin(fill='lightseagreen', alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) + geom_boxplot(width=.06, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()
grid.arrange(gpp1, gpp2, nrow = 2, top = textGrob(expression( italic(paste("Logged Violin plots of ", underline("centered")," random ppm for each ´class´"))), gp = gpar(fontsize = 19, font = 2, col = "#33666C")))dfgg_ppmDistLog <- dfgg_ppmDist
dfgg_ppmDistLog$value <- log(dfgg_ppmDist$value)
gpplog1 <- ggplot(dfgg_ppmDistLog[1:(4*nrow(training)),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99, size = .6, width = .3) + geom_violin(fill='lightseagreen', alpha=0.7, draw_quantiles = T, psoition = dodge, width = 1.8) + geom_boxplot(width=.04, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()
gpplog2 <- ggplot(dfgg_ppmDistLog[(4*nrow(training)+1):nrow(dfgg_ppmDistLog),], aes(x = ppm.value, y = value)) + geom_jitter(color='red', alpha=0.99, size = .6, width = .3) + geom_violin(fill='lightseagreen', alpha=0.7, draw_quantiles = T, position = dodge, width = 1.8) + geom_boxplot(width=.04, outlier.colour=NA, position = dodge) + ggtitle('') + theme_piss()
grid.arrange(gpplog1, gpplog2, nrow = 2, top = textGrob(expression( italic(paste(underline("Log-"), "Violin plots of random ppm for each ´class´"))), gp = gpar(fontsize = 19, font = 2, col = "#33666C")))From the graph of the spectrum above, we decide to select one region of one ppm and to plot the densities with respect to the hippurate proportions. We also selected a random ppm to make the comparison
## Chosen
df_densHip <- data.frame(value = unname(training_c[, "7.656082"]), Hippurate = training$Hip_prop)
#ggplot(df_densHip, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_line(stat = "density", alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1")
df_densHip$Hippurate <- factor(df_densHip$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend
g1ppdens <- ggplot(df_densHip, aes(x = value , fill = Hippurate, color=Hippurate)) + geom_density(alpha = 0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Chosen high-peaked ppm", x = "value for ppm 7.656082")
## Random
set.seed(12)
ind <- sample(as.data.frame(training_c), 1)
df_densHipRandom <- data.frame(value = unname(as.vector(ind)), Hippurate = training$Hip_prop)
df_densHipRandom$Hippurate <- factor(df_densHipRandom$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend
g2ppdens <- ggplot(df_densHipRandom, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Random ppm", x = paste("value for ppm ", names(ind)))
grid_arrange_legend(g1ppdens, g2ppdens, ncol = 2, position = "bottom")#### TEST SET
## Chosen
df_densHipTest <- data.frame(value = unname(test_c[,"7.656082"]), Hippurate = test$Hip_prop)
df_densHipTest$Hippurate <- factor(df_densHipTest$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend
g1ppdensT <- ggplot(df_densHipTest, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Chosen high-peaked ppm", x = "value for ppm 7.656082") + theme(legend.position="none")
## Random (same as for training set)
indT <- test[,names(ind)]
df_densHipRandomTest <- data.frame(value = unname(as.vector(indT)), Hippurate = test$Hip_prop)
df_densHipRandomTest$Hippurate <- factor(df_densHipRandomTest$Hippurate, levels = c("0", "Qh/4", "Qh/2", "Qh")) # Change order of levels in ggplot legend
g2ppdensT <- ggplot(df_densHipRandomTest, aes(x= value , fill=Hippurate, color=Hippurate)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1" )+ scale_color_brewer(palette= "Set1") + geom_hline(yintercept=0, colour="white", size=1.1) + labs(title = "Density of a Random ppm", x = paste("value for ppm ", names(ind))) + theme(legend.position="none")
grid.arrange(g1ppdensT, g2ppdensT, ncol = 2, top = textGrob(expression(italic(paste("For the ", underline("test"), "set"))), gp = gpar(fontsize = 21, font = 2, col = "darkblue")))Then, we will do a PCA to represent our data in a smaller space. With such values of pearson’s correlations, we expect to get good results with the PCA.
ncomp = 8
PCA.res = SVDforPCA(training_c, ncomp = ncomp)
# EigenValues
eig.res=rbind(PCA.res$var,PCA.res$var*100/sum(PCA.res$var),PCA.res$cumvar)
rownames(eig.res)=c("Variances","Prop Var","Cum. Eigen Values")
## No need to display the frame : Pareto will summarize it
res.fin.scale <- prcomp(training_c)
#fviz_eig(res.fin.scale)+ggtitle("Scree plot of information kept by components")+ geom_vline(aes(xintercept=3),color="red")
NOMSG( pareto.chart(PCA.res$eigval[1:8], main = "Pareto chart - Explained Variance", ylab = "Variances",plot = T) ) # pareto chartIndeed, we remark that a very few components allows to explain the whole ppm space (remaining). This actually means that the feature space could be easily (linearly) reduced. We have some prior knowledge that the process should be linear and hence, this result is consistent with this prior. In the following, it should tells us that it is not necessary to consider too much complex models.
We decided to colour the points with respect to the hippurate concentration (thus as a class here) to better visualize if there is a straightforward separation in the score plot with respect to the quantity of hippurate. Note that the colour legend is not in increasing order.
Hippurate <- training$Hip_prop
## Note that we have changed the functions of MBX to get more flexibility, and we stored these function in a .R file that we source() in the beginning of this project. Here, the size_p parameter controls the size of the points (that were too big with the original function of M.Martin).
g1 <- DrawScores(PCA.res, type.obj = "PCA", drawNames=F,createWindow=FALSE, main = paste("PCA score plots for PC1 vs PC2, and for PC3 vs PC4"), axes =c(1,2), color = Hippurate, size_p = .6) + theme_piss(size_p=12, size_c = 10)
g2 <- DrawScores(PCA.res, type.obj = "PCA", drawNames=F,createWindow=FALSE, main =paste("PCA score plots for PC1 vs PC2, and for PC3 vs PC4"), axes =c(3,4), color = Hippurate, size_p = .6) + theme_piss(size_p = 12, size_c = 10)
#grid.arrange(g1,g2, nrow = 1)
plotly::subplot(hide_legend(ggplotly(g1)), ggplotly(g2), shareX = T, shareY= T, which_layout = 2)You can select classes of Hippurate by clicking on the points in the right
gg_load <- DrawLoadings(PCA.res, type.obj = "PCA", createWindow=FALSE, main = paste("PCA loading plots for PC1 vs PC2"), axes =c(1,2))[[1]] + theme_piss(size_p = 14) + labs(x = "ppm")
ggplotly(gg_load)We see that there is actually a clear separation regading the PCA scores of the ppm in the 2-dimensionnal space with respect to the quantity of hippurate. This subspace is actually preserving a relatively good amount of variance (\(\approx 65\%\)).
The PCA being a parametric mapping model relying on linear combinations of the original variables, we can already state that we have relatively good confidence that
Moreover, we see in the loadings plot that the peaks are related to the peaks of the spectrums in figure
First of all, we want to be able to detect the biomarkers with a “simple” method. We choose the t-test because it relies on less assumptions than the ANOVA for example
Here, we decide to take all the data: that is, we think no more division in training-test sets is relevant here.
p_value <- c()
dataAll <- as.data.frame(dataAll)
dataX <- as.data.frame(dataX)
for(i in 1:(ncol(dataX))) {
#print(i)
p_value[i] <- t.test(dataX[dataAll$Hip_prop == "0",i], dataX[dataAll$Hip_prop == "Qh/2",i], var.equal = F)$p.value
}
p_val_Bonf0 <- p_value * (ncol(dataAll)-9)
p_val_Bonf <- p_val_Bonf0 ; p_val_Bonf[p_val_Bonf>1] = 1
names(p_val_Bonf) <- colnames(dataX)
# NA are obviously coming for the "no-variance" columns that we pointed out earlierFor such tests which are applied multiple times, it is important to make a Here, we choose to present only the Bonferroni method which do the following correction:
\[\text{Pvalue}_{(\text{Bonf})} = \begin{cases} \ N\cdot \text{Pvalue}, \qquad\quad\text{Pvalue}<N^{-1} ; \\ \quad\quad 1 \ \ \ \ \qquad \qquad \quad \text{otherwise}. \end{cases} \]
Other methods for multiple testing correction are available, such as the ones of Sidak, Benjamin-Hochberg, …
library(RColorBrewer)
myPalette <- colorRampPalette(rev(brewer.pal(4, "Spectral")))
g1 <- ggplot(as.data.frame(p_value)) + geom_histogram(aes(x = p_value,fill=..count..),bins = 25, col="red") + theme_piss() + labs(title = "Without correction") + scale_fill_gradientn("Count",limits = c(0,400),colours = myPalette(10) )
g2 <- ggplot(as.data.frame(p_val_Bonf)) + geom_histogram(aes(x = p_val_Bonf,fill=..count..),bins = 25, col="red") + theme_piss() + labs(title = "Using Bonferroni correction") + scale_fill_gradientn("Count",limits = c(0,400), colours = myPalette(10))
PissoortThesis::grid_arrange_legend(g1, g2, position = "bottom")df_pval <- data.frame(Standard = p_value, Bonferroni = p_val_Bonf)
df_pvalgg <- melt(df_pval,variable.name = "Method" )
ggplot(df_pvalgg, aes(x = value, fill = Method, color=Method)) + geom_density(alpha=0.1, size=1.2) + theme_piss() + scale_fill_brewer(palette = "Set1") + scale_color_brewer(palette = "Set1") cat("Number of significant p-values (<5%) is ",sum(p_value<0.05, na.rm = T)," out of ", ncol(dataAll)-9," --> ",100 *round(sum(p_value<0.05, na.rm = T)/(ncol(dataAll)-9),2),"%. \n")Number of significant p-values (<5%) is 199 out of 503 --> 40 %.
cat("Number of significant test after Bonferroni Correction : ",sum(p_val_Bonf<0.05, na.rm = T)," out of ",(ncol(dataAll)-9)," --> ",100*round(sum(p_val_Bonf<0.05, na.rm = T)/(ncol(dataAll)-9),2),"%.")Number of significant test after Bonferroni Correction : 88 out of 503 --> 17 %.
Indeed, we see that the correction is useful and we get
#cat("Biomarkers : ", names(p_val_Bonf[order(p_val_Bonf)][1:8]))
#pander(p_val_Bonf[order(p_val_Bonf)][1:7]
df_tests <- as.data.frame(t(p_val_Bonf[order(p_val_Bonf)]))
rownames(df_tests) <- "P-val.:"
# cat(" .dataTables_paginate {
# display: none; } ") # Hide page numbering
#as.datatable(formattable(df_tests, list(p_val_Bonf = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp',scrollX = TRUE), caption = htmltools::tags$caption(style = 'caption-side: bottom; text-align: center;', 'Table: ', htmltools::em("P-values in decreasing order of significance for t-tests with bonferroni correction. NA's ")))as.datatable(formattable(df_tests, list(p_val_Bonf = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp',scrollX = TRUE))as most relevant features (biomarkers) regarding the Bonferroni correction.
Several algorithms do not wrap automatic feature selection an hence, it is important to do it manually in order to not provide a too high feature space to these algorithms. As for features which do not vary between the spectrums, it is important to remove other features which are not important, relying on several criterion.
We assume not to know a priori that the process is globally linear (see PCA) and hence, we want to take all forms of relationship into account. Moreover, even if this procedure is more ‘complex’ to build, it will also handle pretty well more ‘simple’ relationships.
1. Correlation methods which allow not only the Pearson but also the rank correlation.
data_FSlect <- cbind.data.frame(Hip_CONC = training$Hip_CONC, training_c)
weights_lin <- linear.correlation(formula = Hip_CONC ~ ., data_FSlect)
subset_lin <- cutoff.k(weights_lin, 20)
# Same procedure for the rank correlation
weights_rank <- rank.correlation(formula = Hip_CONC ~ ., data_FSlect)
subset_rank <- cutoff.k(weights_rank, 20)
df_FSelect <- rbind(subset_lin, subset_rank)
length(feat_FSelect <- unique(as.vector(df_FSelect))) # 20These two corelation methods (nonlinear or linear) select nearly the same features. Only 1 is different if we subset 20 features, and 4 if we subset 40 features for each.
2. Regarding a Random Forest selection which is used on each iteration to evaluate the model, we have
set.seed(123)
# define the control using a random forest selection function
control <- rfeControl(functions=rfFuncs, method="cv", number= 5) #5 cv folds
# run the RFE algorithm
results <- rfe(training_c, training$Hip_prop, sizes= c(1:20), rfeControl=control, metric = "Accuracy")
## list the chosen features
feat_rfe <- predictors(results)
## plot the results
#plot(results, type=c("g", "o"))
feat_rfe <- results$variables$var[1:20]The algorithm is configured to explore all possible subsets of the attributes
3. Finally XGBoost
set.seed(13)
params_glmnet = list(alpha = 1, family = 'gaussian', nfolds = 5, parallel = TRUE) # 5 folds
params_xgboost = list( params = list("objective" = "reg:linear", "bst:eta" = 0.001, "subsample" = 0.75, "max_depth" = 5, "colsample_bytree" = 0.75, "nthread" = 6),
nrounds = 1000, print.every.n = 1000, maximize = FALSE)
# params_ranger1 = list(probability = FALSE, num.trees = 1000, verbose = TRUE, mtry = 5, min.node.size = 10, num.threads = 6, classifiction = FALSE, importance = 'permutation')
params_features = list(keep_number_feat = 20, union = TRUE) # say we want 30 features max
feat = wrapper_feat_select(X = training_c, y = training$Hip_CONC, params_glmnet = params_glmnet, params_xgboost = params_xgboost, xgb_sort = 'Gain', CV_folds = 5, stratified_regr = FALSE, scale_coefs_glmnet = FALSE, cores_glmnet = 8, params_features = params_features, verbose = F)
pander(feat$union_feat[1:10,])
feat_GlmnetXGB <- as.vector(feat$union_feat$feature[1:20])It can be seen as a bit weird to consider such a complex model (XGBoost) for doing feature selection but it is necessary to optimize our selection process, and not to discard features (ppm) that could have more complex relationships with hippurate.
Now, let’s do a summary of all our methods and make them votes. The idea is to make the selection the most robust and valid as possible. If we arbitrarily discard a feature now because our method is too ‘poor’, then this feature will be removed for all the future algorithms for which the feature selection will be applied. Then, it would induce a big and so it is crucial to build a reliable feature selection scheme.
feat_all <- c(feat_FSelect, feat_rfe, feat_GlmnetXGB)
# length(unique(feat_all)) # 39 retained features in total (<60 due to duplications)
sum(dupl <- duplicated(feat_all)) # 21
( kept_features <- unique(feat_all[dupl]) ) # 16
training_c_small <- training_c[, kept_features]
test_small <- test[, kept_features] # For prediction. But this is not always mandatoryWe decided to keep a feature if it appeared in at least 2 different methods (out of the 4). Having 4 methods with each selecting 20 features By doing this, this procedure leaves us with 19 most important features.
With the feature selection made in the previous subsection, we can already identify biomarkers. As the models used to select relevant features are actually well ‘complex’ and diversified, this could already be our final biomarkers…
gg_spectre10 + geom_vline(xintercept = as.numeric(kept_features[1:10]), linetype = 2, size = .5) +
labs(title = "15 random Spectrums with 10 selected features", y = "Spectrums intensities" )We can compare these results with the ones coming from the corrected t-tet, bot in increasing order of selection :
#cat("Biomarkers from corrected t-tests : ", names(p_val_Bonf[order(p_val_Bonf)][1:8]))
#pander(t(data.frame(t.test.corrected = names(p_val_Bonf[order(p_val_Bonf)][1:8]), feature.selection = kept_features[1:8])))
df_pvalAll <- data.frame(t.test.corrected = names(p_val_Bonf[order(p_val_Bonf)][1:length(kept_features)]), feature.selection = kept_features)
dfff <- as.data.frame(t(df_pvalAll))
colnames(dfff) <- seq(1, length(kept_features))
DT::datatable(dfff, style = "bootstrap", class="table-condensed", options = list(dom = 'tp', scrollX = TRUE))We see that the features elected are very similar, and hence it is already a good sign of ‘robustness’.
pct_keptfeat <- sum(duplicated(c(as.vector(df_pvalAll$t.test.corrected), as.vector(df_pvalAll$feature.selection ) ))) / length(kept_features) *100
cat(pct_keptfeat, "% of the selected features are also the most significants wrt Bonferroni t-tests")68.75 % of the selected features are also the most significants wrt Bonferroni t-tests
#@ These are all the features (21) selected by the methods
# unique(c(as.vector(df_pvalAll$t.test.corrected), as.vector(df_pvalAll$feature.selection) ) )The metric we will use is the RMSE. Its use is very common and it makes an excellent general purpose error metric for numerical predictions. For example, compared to the Mean Absolute Error (MAE), RMSE amplifies and severely punishes larger errors. It is often interesting as larger errors are more problematic. For example here, being off by 200 of the hippurate’s quantity is more than twice as bad as being off by 100.
'RMSE' <- function(x.true, x.fit) sqrt( (1/length(x.true)) * sum((x.true - x.fit)^2) )\[\text{RMSE} = \sqrt{\frac{1}{N}\sum_i^N \big( y_i-\hat{y}_i\big)^2 }, \qquad\qquad N:=\text{test set size} = 54.\]
It computes the squared difference between the predicted value from the model of interest with the real (true) value. Then, it does the average over al the values and take of it. It will be our basis to compare models, and to tune hyperparameters of the models by cross-validation.
Now, we will explore some models and their characteristics, do some insightful visualization and compute some diagnostics if needed. We will save their cross-validation RMSE but also their ‘true’ RMSE (predicted on test set). Finally, we will compare them all (and try so stacking ?)
Note that for the (stepwise) linear regression models, there are no real tuning hyperparameters but we will make the cross-validation anyway to obtain estimates of this error (RMSE-cv). If we only wanted to obtain the parameters’ coefficients, it would have been better to compute these using the whole training set \(\rightarrow\) more points.
We use the standard lm() function from stats R, with the feature selection applied to ensure \(p<n\).
library(doMC)
registerDoMC(cores = 8) # All the subsequent simulations will be run in parralel
n_folds <- 10
tc_reglin <- trainControl("cv", n_folds, savePredictions = T )
set.seed(12)
regLin <- train(y = training$Hip_CONC,
x = training_c_small,
metric = "RMSE",
method = "lm",
trControl = tc_reglin)
cv_rmse_regLin <- min(regLin$results$RMSE)
pred_regLin <- unname(predict(regLin, test))
rmse_reglin <- RMSE(test$Hip_CONC, pred_regLin) # 19.99
pred_regLinc <- unname(predict(regLin, test_c))
rmse_reglinc <- RMSE(test$Hip_CONC, pred_regLinc)
cat("Cross-validated RMSE is ", cv_rmse_regLin , "while the True RMSE is ", rmse_reglinc, " with the ", length(kept_features), " selected features")Cross-validated RMSE is 9.30199 while the True RMSE is 8.237091 with the 16 selected features
A simple cross-validation is enough here (no hyperparameters to tune).
We make use of the method lmStepAIC which comes from the MASS package.
n_folds <- 10
tc_reglin <- trainControl("cv", n_folds, savePredictions = T)
set.seed(12)
regLin_stpw <- train(y = training$Hip_CONC,
x = training_c_small,
metric = "RMSE",
method = "lmStepAIC",
trace = 0,
trControl = tc_reglin)
cat("Retained variables are " , colnames(regLin_stpw$finalModel$model)[-1])Retained variables are 7.6397634 3.9811536 7.5743654 2.0538744 8.5544156 0.894194 1.5801986 0.926832 0.9430896 1.9230788 3.0501212 1.5638796 3.1644144 3.6054524
cv_rmse_regLinsw <- min(regLin_stpw$results$RMSE)
pred_regLinsw <- unname(predict(regLin_stpw, test))
rmse_reglinsw <- RMSE(test$Hip_CONC, pred_regLinsw)
pred_regLinswc <- unname(predict(regLin_stpw, test_c))
rmse_reglinswc <- RMSE(test$Hip_CONC, pred_regLinswc) The final model is a model with 11 variables (out of the initial 19).
Remark that RMSE is lower for the linear regression with the preliminary feature selection than for the linear regression doing this stepwise selection. This can be seen on the following plot
Now, we try a last model using more variables. It did not work for the complete dataset because there where multicolinearity. Hence, we had to remove features that are perfectly colinear. Finally, we decide to redo the same feature selection method and keep now \(\approx\) 100 variables. As it will be useful to tune the number of features to keep (this number is often somewhat arbitrarily fixed) for other algorithms, we decide to create a function to enable the retrieval of any number of relevant features.
## n.keepByAlgo controls the number of selected features to keep by each of the 4 methods used.
'featureSelect' <- function(n.keepByAlgo = 100) {
# Linear and rank correlation
subset_lin2 <- cutoff.k(weights_lin, n.keepByAlgo)
subset_rank2 <- cutoff.k(weights_rank, n.keepByAlgo)
df_FSelect2 <- rbind(subset_lin2, subset_rank2)
feat_FSelect2 <- unique(as.vector(df_FSelect2))
# XGboost and random forest importance
feat_GlmnetXGB2 <- as.vector(feat$union_feat$feature[1:n.keepByAlgo])
feat_rfe2 <- results$variables$var[1:n.keepByAlgo]
# all
feat_all2 <- c(feat_FSelect2, feat_rfe2, feat_GlmnetXGB2)
sum(dupl2 <- duplicated(feat_all2))
return(unique(feat_all2[dupl2]))
}
kept_featuresBig <- featureSelect()# See section 3 (feature selection on correlation)
# training_c_nocor <- training_c[, colnames(training_c) %in% highlyCorrelated ==F]
tc_reglin <- trainControl("cv", n_folds, savePredictions = T )
set.seed(12)
regLin_stpw_all <- train(y = training$Hip_CONC,
x = training_c[,kept_featuresBig],
metric = "RMSE",
method = "lmStepAIC",
#direction = "forward",
trace = 0,
trControl = tc_reglin)
cat(ncol(regLin_stpw_all$finalModel$model)-1, "Retained variables now :" , colnames(regLin_stpw_all$finalModel$model)[-1])
cv_rmse_regLinsw_all <- min(regLin_stpw_all$results$RMSE)
pred_regLinsw_all <- unname(predict(regLin_stpw_all, test))
rmse_reglinsw_all <- RMSE(test$Hip_CONC, pred_regLinsw_all) # 41.3
pred_regLinsw_allc <- unname(predict(regLin_stpw_all, test_c))
rmse_reglinsw_allc <- RMSE(test$Hip_CONC, pred_regLinsw_allc) dfstep <- data.frame( True.RMSE = c(rmse_reglinswc, rmse_reglinsw_allc), RMSE.cv = c(cv_rmse_regLinsw, cv_rmse_regLinsw_all), size = c( paste0(length(regLin_stpw$finalModel$model)-1, "(", length(kept_features), ")"), paste0(length(regLin_stpw_all$finalModel$model)-1, "(", length(kept_featuresBig), ")")) )
rownames(dfstep) <- c( 'Linear regression SW small', 'Linear regression SW big')
pander(t(dfstep))| Linear regression SW small | Linear regression SW big | |
|---|---|---|
| True.RMSE | 8.248378 | 6.992543 |
| RMSE.cv | 9.484595 | 6.157128 |
| size | 14(16) | 43(84) |
In “()” is the initial number of variables in the model before the stepwise selection stage.
43 variables are now retained by the model with much higher RMSE on the test set.
Now, we create a function that we aim to use in all our models. The goal is to compare the predicted values with the true value, for both the cross-validation and the test set. Thus, note that there are 215 value for the cross-validation and 54 values for the test set. We print in red the observation number of the problematic observation to facilitate furtherdiagnostic.
We define the function to make the predictions plots for the rest of the project, for both cross-validation and testing..
'PredVsObsPlots' <- function(df_cv, obs.test = test$Hip_CONC, pred.test, threshold = 30, rmse.cv) {
#browser()
df_cv$error <- abs(df_cv$obs - df_cv$pred)
ProblematicObs.cv <- df_cv$error > threshold
names.cv <- row.names(df_cv)
names.cv[with(df_cv, !ProblematicObs.cv)] <- ""
lines <- data.frame(x1 = c(-30, 45, 120, 270), x2 = c(30, 105, 180, 330), y1 = c(0, 75, 150, 300), y2 = c(0, 75, 150, 300))
g1 <- ggplot(df_cv) + geom_point(aes(x = obs, y = pred), size = .95 ) + theme_piss() + labs(title = "Cross-Validation", subtitle = "Predicted values vs observed values") + geom_text(aes(obs, pred+12, label = names.cv, size = error), color = "red") + geom_segment(aes(x = x1, xend = x2, y = y1, yend = y2), col = "blue", linetype = 2, data = lines) + # Annotate whole value of RMSE
annotate(geom = "text", label = paste("RMSE-cv", " is : \n", round(rmse.cv, 4)),
x = 250,
y = 45, col = "#33666C" , size = 4)
df <- data.frame(obs.test = obs.test, pred.test = pred.test)
df$error <- abs(obs.test - pred.test)
ProblematicObsTest <- df$error > threshold
g2 <- ggplot(df, aes(x = obs.test, y = pred.test))
names <- row.names(df)
names[with(df, !ProblematicObsTest)] <- ""
true.rmse <- RMSE(obs.test, pred.test)
g2 <- g2 + geom_point(size = .95) + geom_segment(aes(x = x1, xend = x2, y = y1, yend = y2), col = "blue", linetype = 2, data = lines) + theme_piss() + labs(title = "Test set", subtitle = "Predicted values vs observed values") + geom_text(aes(obs.test, pred.test+12, label = names, size = error), color = "red") + # Annotate whole value of RMSE
annotate(geom = "text", label = paste("RMSE", " is : \n", round(true.rmse, 4)),
x = 260,
y = 30, col = "#33666C" , size = 4)
#grid <- grid.arrange(g1, g2, nrow = 1)
#grid.arrange(g1, g2, nrow = 1)
return(list(gcv = g1, gtest = g2 ))
}Note that on these plots, the size of the observation number printed in red is proportional to the magnitude of its error. Horizontal blue dotted lines represent the true (observed) value on the prediction scale. Note also that when we do repeated cross-validation to tune hyperparameters, we only take the predicted values of the first fold (and not the average over the repeated cv)
df_cv_reglin <- regLin$pred[order(regLin$pred$rowIndex),]
# Retrieve correct indexing of observations
rownames(df_cv_reglin) <- df_cv_reglin$rowIndex
plots_lin <- PredVsObsPlots(df_cv_reglin, pred.test = pred_regLinc, threshold = 25, rmse.cv = cv_rmse_regLin)
# the threshold can be changed to manage the number of problematic obsverations displayed.
grid.arrange(plots_lin$gcv, plots_lin$gtest, nrow = 1)df_cv_reglinStp <- regLin_stpw$pred[order(regLin_stpw$pred$rowIndex),]
rownames(df_cv_reglinStp) <- df_cv_reglinStp$rowIndex
plots_sw <- PredVsObsPlots(df_cv_reglinStp, pred.test = pred_regLinswc, threshold = 25, rmse.cv = cv_rmse_regLinsw)
grid.arrange(plots_sw$gcv, plots_sw$gtest, nrow = 1)df_cv_reglinStpAll <- regLin_stpw_all$pred[order(regLin_stpw_all$pred$rowIndex),]
rownames(df_cv_reglinStpAll) <- df_cv_reglinStpAll$rowIndex
plots_swAll <- PredVsObsPlots(df_cv_reglinStpAll, pred.test = pred_regLinsw_allc, threshold = 25, rmse.cv = cv_rmse_regLinsw_all)
grid.arrange(plots_swAll$gcv, plots_swAll$gtest, nrow = 1)We remark the same shape as for the regression with the 19 original variables.
More specific diagnostics could be made…
Here, the predicted values are too low compared to the true values (except for the observation 27 which is still way too high). This can be explained by the overfitting : we introduced to much noise as the stepwise selected too much variables on this wider set.
Note that the two model
As a statistician, we have to notice that the assumption of the linear reression are likely to be poor here as we have a discrete variable with onyl 4 categories. However, as our goal is only prediction and not ‘global’ inference, this is not really important and that is the reason why we do not check these assumptions here. But it can explain
Instead of doing separately the LASSO and the RIDGE regression, we decide to do only the Elastic Nets model which take both penalties into account and allow for more flexibility.
We use the glmnet package with 2 tuning parameters : alpha(mixing) and lambda (Regularization). The criterion can be conveniently denoted as
\[\text{argmin}_{\beta} \Big\{ \sum_{i=1}^n(y_i-\boldsymbol{x}_i'\boldsymbol{\beta})^2 + \lambda\Big[(1-\alpha)\sum_{i=1}^n\beta_i^2 + \alpha\sum_{j=1}^n|\beta_i|\Big]\Big\}\] See (Zou and Hastie 2005) for more information.
We make use of 10-fold cross-validation repeated 5 times to improve the accuracy of the results. The optimal hyperparameters will be computed through this cross-validation.
n_folds <- 10
# We tune the hyperparameters together in a 2-dimensional grid. It is
eGrid2 <- expand.grid(.alpha = (0:10) * 0.1,
.lambda = (0:10) * 0.1)
rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 5, savePredictions = T)
set.seed(12)
glmnet_cv_all <- train(y = training$Hip_CONC,
x = training_c,
method = "glmnet",
trControl = rctrl1,
tuneGrid = eGrid2) # alpha = 0.9, lambda=1
cv_rmse_reglmnet1 <- min(glmnet_cv_all$results$RMSE)
set.seed(12)
glmnet_cv_small <- train(y = training$Hip_CONC,
x = training_c_small,
method = "glmnet",
trControl = rctrl1,
tuneGrid = eGrid2) # alpha = 0.1, lambda= 0.3
cv_rmse_reglmnet2 <- min(glmnet_cv_small$results$RMSE)
eGrid3 <- expand.grid(.alpha = seq(.1, .3, by = .03),
.lambda = seq(.1, .5, by = .06))
#rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 5) # repeated cv 10 times
set.seed(12)
glmnet_cv_small2 <- train(y = training$Hip_CONC,
x = training_c_small,
method = "glmnet",
trControl = rctrl1,
tuneGrid = eGrid3)
cv_rmse_reglmnet3 <- min(glmnet_cv_small2$results$RMSE)
pred_glmnet1 <- unname(predict(glmnet_cv_all, test))
pred_glmnet2 <- unname(predict(glmnet_cv_small, test))
pred_glmnet3 <- unname(predict(glmnet_cv_small2, test))
rmse_reglmnet1 <- RMSE(test$Hip_CONC, pred_glmnet1)
rmse_reglmnet2 <- RMSE(test$Hip_CONC, pred_glmnet2)
rmse_reglmnet3 <- RMSE(test$Hip_CONC, pred_glmnet3)
#c(rmse_reglmnet1, rmse_reglmnet2, rmse_reglmnet3)
pred_glmnet1c <- unname(predict(glmnet_cv_all, test_cVar))
## Important to take the right number of features in the test set !!!! 'kept_features' !!!!!!!!
pred_glmnet2c <- unname(predict(glmnet_cv_small, test_c[, kept_features]))
pred_glmnet3c <- unname(predict(glmnet_cv_small2, test_c[ ,kept_features]))
rmse_reglmnet1c <- RMSE(test$Hip_CONC, pred_glmnet1c)
rmse_reglmnet2c <- RMSE(test$Hip_CONC, pred_glmnet2c)
rmse_reglmnet3c <- RMSE(test$Hip_CONC, pred_glmnet3c)dfglmnet <- as.data.frame(t(data.frame( True.RMSE = c(rmse_reglmnet1c, rmse_reglmnet2c, rmse_reglmnet3c), RMSE.cv = c(cv_rmse_reglmnet1, cv_rmse_reglmnet2, cv_rmse_reglmnet3), lambda = c(glmnet_cv_all$bestTune$lambda, glmnet_cv_small$bestTune$lambda, glmnet_cv_small2$bestTune$lambda))) )
colnames(dfglmnet) <- c( 'Elastic Net full', 'Elastic Net small (-)', 'Elastic Net small (+)')
#pander(dfglmnet)
## Example
# colnames(Beta_2b) <- c("A","B","C")
# Beta_2b<- data.frame(Beta_2b)
# Font.Col1 <- c("bold",NA,NA,"bold","bold","bold")
# formattable(Beta_2b, list(
# A = formatter("span", style = x ~ style(font.weight = Font.Col1))))
Font.Col1 <- c("bold", NA, NA)
formattable(dfglmnet, list(`Elastic Net full` = formatter("span", style = x ~ formattable::style(font.weight = Font.Col1))))| Elastic Net full | Elastic Net small (-) | Elastic Net small (+) | |
|---|---|---|---|
| True.RMSE | 6.976341 | 8.092806 | 8.102530 |
| RMSE.cv | 8.055671 | 9.334340 | 9.235442 |
| lambda | 1.000000 | 0.300000 | 0.220000 |
It actually shows that it is better not to apply a feature selection before applying the elastic net. With the Ridge penalty, it actually does itself the reducing of the dimensionnality by shrinking the coefficients.
set.seed(12)
glmnet_fitOptim <- cv.glmnet(y = training$Hip_CONC,
x = training_c, alpha = glmnet_cv_all$bestTune$alpha,
lambda = seq(0, 100, by = .1))
# tidied_cv <- tidy(glmnet_cv_all$results)
# glance_cv <- glance(glmnet_cv_all$results)
#
# tidied_cv <- tidy(glmnet_cv_all$pred)
# glance_cv <- glance(glmnet_cv_all$pred)
#
# ggplot(tidied_cv, aes(lambda, estimate)) + geom_line(color = "red") +
# geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .2) +
# scale_x_log10() +
# geom_vline(xintercept = glance_cv$lambda.min) +
# geom_vline(xintercept = glance_cv$lambda.1se, lty = 2)
# glmnet_cv_all$resample
# glmnet_cv_all$control
set.seed(12)
glmnet_fitRidge <- cv.glmnet(y = training$Hip_CONC,
x = training_c, alpha = 0,
lambda = seq(0, 100, by = .1))
set.seed(12)
glmnet_fitLasso <- cv.glmnet(y = training$Hip_CONC,
x = training_c, alpha = 1,
lambda = seq(0, 100, by = .1))–>compare multiple LASSO cross-validations in the same plot.
g1 <- ggplot(glmnet_cv_all) + labs(x = 'mixing percentage (alpha)', linetype = 'Regu. lambda', title = "Full") + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda"))
g2 <- ggplot(glmnet_cv_small) + labs(x = 'mixing percentage (alpha)', fill = 'Regul. lambda', title = "Small") + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda"))
#grid.arrange(g1, g2, nrow = 1)
PissoortThesis::grid_arrange_legend(g1, g2, ncol = 2, position = "bottom")## RIDGE
tidied_cvRidge <- tidy(glmnet_fitRidge)
glance_cvRidge <- glance(glmnet_fitRidge)
# We are provided the MSE, hence we square root it to have the RMSE
tidied_cvRidge$RMSE <- sqrt(tidied_cvRidge$estimate)
tidied_cvRidge$conf.lowsq <- sqrt(tidied_cvRidge$conf.low)
tidied_cvRidge$conf.highsq <- sqrt(tidied_cvRidge$conf.high)
rmse_ridge <- RMSE(test$Hip_CONC, unname(predict(glmnet_fitRidge, test_c)))
gr1 <- ggplot(tidied_cvRidge, aes(lambda, RMSE)) + geom_line(color = "red") +
geom_ribbon(aes(ymin = conf.lowsq, ymax = conf.highsq), alpha = .2) +
scale_x_log10() + theme_piss() + labs(y = "estimate (RMSE)") +
geom_vline(xintercept = glance_cvRidge$lambda.min) +
geom_vline(xintercept = glance_cvRidge$lambda.1se, lty = 2) + # largest value of lambda such that error is within 1 standard error of the minimum.
annotate(geom = "text", label = paste("True RMSE", " is : \n", round(rmse_ridge, 4)),
x = 30,
y = 15, col = "#33666C" , size = 5)
## LASSO
tidied_cvLasso <- tidy(glmnet_fitLasso)
glance_cvLasso <- glance(glmnet_fitLasso)
# We are provided the MSE, hence we square root it to have the RMSE
tidied_cvLasso$RMSE <- sqrt(tidied_cvLasso$estimate)
tidied_cvLasso$conf.lowsq <- sqrt(tidied_cvLasso$conf.low)
tidied_cvLasso$conf.highsq <- sqrt(tidied_cvLasso$conf.high)
rmse_lasso <- RMSE(test$Hip_CONC, unname(predict(glmnet_fitLasso, test_c)))
gl2 <- ggplot(tidied_cvLasso, aes(lambda, RMSE)) + geom_line(color = "red") +
geom_ribbon(aes(ymin = conf.lowsq, ymax = conf.highsq), alpha = .2) +
scale_x_log10() + theme_piss() + labs(y = "estimate (RMSE)") +
geom_vline(xintercept = glance_cvLasso$lambda.min) +
geom_vline(xintercept = glance_cvLasso$lambda.1se, lty = 2) + # largest value of lambda such that error is within 1 standard error of the minimum.
annotate(geom = "text", label = paste("True RMSE", " is : \n", round(rmse_lasso, 4), "\n ....."),
x = 1,
y = 85, col = "#33666C" , size = 5)
grid.arrange(gr1, gl2, ncol = 2)## Fit (to obtain S3 object for further plotting)
fit_glmnetRid <- glmnet(y = training$Hip_CONC,
x = training_c, alpha = 0,
lambda = seq(0,3, by = .1) )
fit_glmnetLass <- glmnet(y = training$Hip_CONC,
x = training_c, alpha = 1,
lambda = seq(0,2, by = .1) )
#plot(fit_glmnetRid, xvar="lambda", main="")
plot(fit_glmnetLass, xvar="lambda", main="")df_cv_regPenAll <- glmnet_cv_all$pred[glmnet_cv_all$pred$alpha == glmnet_cv_all$bestTune[,1] & glmnet_cv_all$pred$lambda == glmnet_cv_all$bestTune[,2] ,][1:nrow(training),]
# Retrieve correct indexing of observations
rownames(df_cv_regPenAll) <- df_cv_regPenAll$rowIndex
df_cv_regPenSmall <- glmnet_cv_small$pred[glmnet_cv_small$pred$alpha == glmnet_cv_small$bestTune[,1] & glmnet_cv_small$pred$lambda == glmnet_cv_small$bestTune[,2] ,][1:nrow(training),]
rownames(df_cv_regPenSmall) <- df_cv_regPenSmall$rowIndex
df_cv_regPenSmall2 <- glmnet_cv_small2$pred[glmnet_cv_small2$pred$alpha == glmnet_cv_small2$bestTune[,1] & glmnet_cv_small2$pred$lambda == glmnet_cv_small2$bestTune[,2] ,][1:nrow(training),]
rownames(df_cv_regPenSmall2) <- df_cv_regPenSmall2$rowIndex
NOMSG( g1 <- PredVsObsPlots(df_cv_regPenAll, pred.test = pred_glmnet1c, threshold = 20, rmse.cv = cv_rmse_reglmnet1))
NOMSG( g2 <- PredVsObsPlots(df_cv_regPenSmall, pred.test = pred_glmnet2c, threshold = 20, rmse.cv = cv_rmse_reglmnet2))
g3 <- PredVsObsPlots(df_cv_regPenSmall2, pred.test = pred_glmnet3c, threshold = 20, rmse.cv = cv_rmse_reglmnet3)
grid.arrange(g1$gcv, g1$gtest, nrow = 1, top = textGrob(expression( italic("Full")), gp = gpar(fontsize = 19, font = 2, col = "blue")) ) grid.arrange( g2$gcv, g2$gtest,nrow = 1, top = textGrob(expression( italic("Small (-)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )grid.arrange( g3$gcv, g3$gtest, nrow = 1, top = textGrob(expression( italic("Small (+)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) )#grid.arrange(g3$gcv, g3$gtest, nrow=1)Hyperparameter Tuning : we see that Interestingly, we see that the value of \(\lambda\) has no impact on the model full (left)
Prediction Plots : For the first, it seems that we overestimate the true value while for the second () we really understimate this value…
In Principal Component Regression (PCR), we keep the principal components that preserve the most information of the ppm’s matrix, and then we build the linear regression model with the scores of the selected components as features. This number of components is the only hyperparameter to tune. Note that this model is especially interesting to overcome the issue of multicolinearity that we pointed out in section X as our features are (for some) highly correlated.
We make use of the pls package to construct our model. The only tuning parameter of this model is the number of components ncomp.
# Explore manually Values for controlling the training process
rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 5, savePredictions = T) # repeated cv 5 times
set.seed(12)
pcr_cv_small <- train(y = training$Hip_CONC,
x = training_c_small,
method = "pcr",
trControl = rctrl1,
tuneGrid = expand.grid(.ncomp = c(1:15)))
cv_rmse_pcrSmall <- min(pcr_cv_small$results$RMSE )
pcr_train_small <- train(y = training$Hip_CONC,
x = training_c_small,
method = "pcr",
.ncomp = 3)
set.seed(12)
pcr_cv_all <- train(y = training$Hip_CONC,
x = training_c,
method = "pcr",
trControl = rctrl1,
tuneGrid = expand.grid(.ncomp = c(1:100)))
cv_rmse_pcrAll1 <- min(pcr_cv_all$results$RMSE )
plot(pcr_cv_all) # Take rather 17 comp actually!
set.seed(12)
pcr_cv_all2 <- train(y = training$Hip_CONC,
x = training_c,
method = "pcr",
trControl = rctrl1,
tuneLength = 17)
#tuneGrid = expand.grid(.ncomp = 17))
cv_rmse_pcrAll2 <- min(pcr_cv_all2$results$RMSE )
pred_pcrsmall <- unname(predict(pcr_cv_small, test))
pred_pcrall <- unname(predict(pcr_cv_all, test))
pred_pcrall2 <- unname(predict(pcr_cv_all2, test))
rmse_pcrSmall <- RMSE(test$Hip_CONC,pred_pcrsmall)
rmse_pcrAll <- RMSE(test$Hip_CONC, pred_pcrall)
rmse_pcrAll2 <- RMSE(test$Hip_CONC, pred_pcrall2)
pred_pcrsmallc <- unname(predict(pcr_cv_small, test_c[, kept_features]))
pred_pcrallc <- unname(predict(pcr_cv_all, test_c))
pred_pcrall2c <- unname(predict(pcr_cv_all2, test_c))
rmse_pcrSmallc <- RMSE(test$Hip_CONC,pred_pcrsmallc)
rmse_pcrAllc <- RMSE(test$Hip_CONC, pred_pcrallc)
rmse_pcrAll2c <- RMSE(test$Hip_CONC, pred_pcrall2c)
# ?
rmse_pcrSmallcomp3 <- RMSE(test$Hip_CONC, unname(predict(pcr_train_small, test)))
rmse_pcrSmallcomp3c <- RMSE(test$Hip_CONC, unname(predict(pcr_train_small, test_c)))We tried first for a small number of values for ncompand we remarked that the optimal value was the highest number. Hence, we did for 100 and retrieved 66 components (very variable) as optimal.
Independent Component Regression (ICR) is analogous to Principal Components Regression (PCR) but uses Independent Component Analysis (ICA) to produce the scores. We use the package fastICA and the model has the same only tuning parameters n.comp as for the PCR
# Explore manually Values for controlling the training process
rctrl1 <- trainControl(method = "cv", number = 10, savePredictions = T, verboseIter = T) # Here the algorithm is very slow so we decrease the number of folds and of repetitions for the cv.
set.seed(12)
icr_cv_small <- train(y = training$Hip_CONC,
x = training_c_small,
method = "icr",
trControl = rctrl1,
tuneGrid = expand.grid(.n.comp = c(1:16)))
cv_rmse_icr_small <- min(icr_cv_small$results$RMSE)
set.seed(12)
icr_cv_all <- train(y = training$Hip_CONC,
x = training_c,
method = "icr",
trControl = rctrl1,
tuneGrid = expand.grid(.n.comp = c(1:35)), verbose = F)
cv_rmse_icr_all <- min(icr_cv_all$results$RMSE)
#
pred_icr_small <- predict(icr_cv_small, test)
rmse_icrSmall <- RMSE(test$Hip_CONC, unname(pred_icr_small) ) # 23.55
pred_icr_smallc <- predict(icr_cv_small, test_c[,kept_features])
rmse_icrSmallc <- RMSE(test$Hip_CONC, unname(pred_icr_smallc) )
pred_icr_allc <- predict(icr_cv_all, test_c)
rmse_icrAll <- RMSE(test$Hip_CONC, unname(pred_icr_allc))
c(rmse_icrSmallc, rmse_icrAll)An error occured when doing the ICR on the complete training set. (?)
dfpicr <- data.frame( True.RMSE = c(rmse_pcrSmallc, rmse_pcrAllc, rmse_pcrAll2c, rmse_icrSmallc, rmse_icrAll), RMSE.cv = c(cv_rmse_pcrSmall, cv_rmse_pcrAll1, cv_rmse_pcrAll2, cv_rmse_icr_small, cv_rmse_icr_all), ncomp = c(pcr_cv_small$bestTune$ncomp, pcr_cv_all$bestTune$ncomp, pcr_cv_all2$bestTune$ncomp, icr_cv_small$bestTune$n.comp, icr_cv_all$bestTune$n.comp ))
rownames(dfpicr) <- c( 'PCR small', 'PCR all (+)', 'PCR all (-)', "ICR small", "ICR all")
pander(t(dfpicr)) | PCR small | PCR all (+) | PCR all (-) | ICR small | ICR all | |
|---|---|---|---|---|---|
| True.RMSE | 9.647 | 6.57 | 6.725 | 8.243 | 13.07 |
| RMSE.cv | 9.653 | 5.306 | 6.042 | 9.002 | 12.98 |
| ncomp | 7 | 35 | 16 | 12 | 30 |
We see with the “true” RMSE that actually, taking apriori a too high number of (irrelevant) features is very bad for the performance. Moreover, the hyperparameter tuning of the number of components by cross-validation becomes less reliable (big variance) and is thus more likely to overfit as it is the case here. Hence, we should make the cv method more reliable (increase number of repetitions, play with the number of folds taken, do some bootstrap, …) or just make a prior reduce of the feature space.
ncoptall <- pcr_cv_all$bestTune$ncomp
library(grid)
vp <- viewport(width = 0.45,
height = 0.48,
x = 0.59,
y = 0.53)
gBigAll <- ggplot(pcr_cv_all) + geom_vline(aes(xintercept = ncoptall, color = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All (+) ") + geom_vline(aes(xintercept = 12, color = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))
gZOOMall <- ggplot(pcr_cv_all) + geom_vline(aes(xintercept = 4, col = "advised(-)"), linetype = 2) + theme_piss(size_c = 9, size_p = 13, legend.position = c(.83, .45)) + labs(title = "Zoom ", y = NULL, x = NULL) + geom_vline(xintercept = 12, col = 3, linetype = 2) + coord_cartesian(xlim = c(0, 20)) + scale_color_manual(name = "# comp.", values = c("advised(-)" = "blue")) + scale_size(range=c(2,10)) + theme(legend.title = element_text(size=10, face="bold")) + theme(legend.text = element_text(size = 8, face = "italic"))
gBigAll
print(gZOOMall, vp = vp)ncoptsmall <- pcr_cv_small$bestTune$ncomp
gpcSmall1 <- ggplot(pcr_cv_small) + geom_vline(aes(xintercept = ncoptsmall, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "Small ") + geom_vline(aes(xintercept = 3, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))
ncoptall2 <- pcr_cv_all2$bestTune$ncomp
gpcall2 <- ggplot(pcr_cv_all2) + geom_vline(aes(xintercept = ncoptall2, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All (-) ") + geom_vline(aes(xintercept = 4, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))
grid_arrange_legend(gpcSmall1, gpcall2, ncol = 2, position = "bottom" )ncoptsmalli <- icr_cv_small$bestTune$n.comp
gicSmall1 <- ggplot(icr_cv_small) + geom_vline(aes(xintercept = ncoptsmalli, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "Small ") + geom_vline(aes(xintercept = 6, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))
ncoptalli <- icr_cv_all$bestTune$n.comp
gicall2 <- ggplot(icr_cv_all) + geom_vline(aes(xintercept = ncoptalli, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All ") + geom_vline(aes(xintercept = 11, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))
grid_arrange_legend(gicSmall1, gicall2, ncol = 2, position = "bottom" )From that, we will try the more relevant values for the number of components as “advised” by the plots
fit.pcr.small3 <- pcr(training$Hip_CONC ~., data = as.data.frame(training_c_small), ncomp = 3, validation = "none")
fit.pcr.all12 <- pcr(training$Hip_CONC ~., data = as.data.frame(training_c), ncomp = 12, validation = "none")
fit.pcr.all4 <- pcr(training$Hip_CONC ~., data = as.data.frame(training_c), ncomp = 4,validation = "none")
pred.fitpcrSmall3 <- unname(predict(fit.pcr.small3, test_c[, kept_features], type = "response", ncomp = 1:3))
pred.fitpcrAll12 <- unname(predict(fit.pcr.all12, test_c, type = "response", ncomp = 1:12))
pred.fitpcrAll4 <- unname(predict(fit.pcr.all4, test_c, type = "response", ncomp = 1:4))
#pred.fitpcrSmall3
rmse.fitpcrSmall3 <- RMSE(test$Hip_CONC, pred.fitpcrSmall3[,,3])
rmse.fitpcrAll12 <- RMSE(test$Hip_CONC, pred.fitpcrAll12[,,12])
rmse.fitpcrAll4 <- RMSE(test$Hip_CONC, pred.fitpcrAll4[,,4])
cat("For PCR Small, PCR All (+) and PCR All (-), we have RMSE : ", c(rmse.fitpcrSmall3, rmse.fitpcrAll12, rmse.fitpcrAll4))For PCR Small, PCR All (+) and PCR All (-), we have RMSE : 10.09472 7.629864 9.89201
A bit surprisingly when comparing these values with those in the table above, we remark that the value of the error actually increases when we reduce the number of components. Hence, there were no real concern of overfitting here.
All the observations denoted have absolute residual \(\geq 20\).
## PCR
df_cv_pcrall <- pcr_cv_all$pred[pcr_cv_all$pred$ncomp == pcr_cv_all$bestTune[,1],][1:nrow(training),]
rownames(df_cv_pcrall) <- df_cv_pcrall$rowIndex
df_cv_pcrsmall <- pcr_cv_small$pred[pcr_cv_small$pred$ncomp == pcr_cv_small$bestTune[,1],][1:nrow(training),]
rownames(df_cv_pcrsmall) <- df_cv_pcrsmall$rowIndex
df_cv_pcrall2 <- pcr_cv_all2$pred[pcr_cv_all2$pred$ncomp == pcr_cv_all2$bestTune[,1],][1:nrow(training),]
rownames(df_cv_pcrall2) <- df_cv_pcrall2$rowIndex
NOMSG( gpcrall <- PredVsObsPlots(df_cv_pcrall, pred.test = pred_pcrallc, threshold = 20,rmse.cv = cv_rmse_pcrAll1))
NOMSG( gpcrsmall <- PredVsObsPlots(df_cv_pcrsmall, pred.test = pred_pcrsmallc, threshold = 20 , rmse.cv = cv_rmse_pcrSmall))
NOMSG( gpcrall2 <- PredVsObsPlots(df_cv_pcrall2, pred.test = pred_pcrall2c, threshold = 20, rmse.cv = cv_rmse_pcrAll2))
## ICR
df_cv_icrSmall <- icr_cv_small$pred[icr_cv_small$pred$n.comp == icr_cv_small$bestTune[,1],][1:nrow(training),]
rownames(df_cv_icrSmall) <- df_cv_icrSmall$rowIndex
NOMSG( g_icr1 <- PredVsObsPlots(df_cv_icrSmall, pred.test = pred_icr_smallc, threshold = 20, rmse.cv = cv_rmse_icr_small))
df_cv_icrAll <- icr_cv_all$pred[icr_cv_all$pred$n.comp == icr_cv_all$bestTune[,1],][1:nrow(training),]
rownames(df_cv_icrAll) <- df_cv_icrAll$rowIndex
NOMSG( g_icr2 <- PredVsObsPlots(df_cv_icrAll, pred.test = pred_icr_allc, threshold = 20, rmse.cv = cv_rmse_icr_all))grid.arrange(gpcrsmall$gcv, gpcrsmall$gtest, nrow = 1)grid.arrange(gpcrall$gcv, gpcrall$gtest, nrow = 1)grid.arrange(gpcrall2$gcv, gpcrall2$gtest, nrow = 1)grid.arrange(g_icr1$gcv, g_icr1$gtest, nrow = 1)grid.arrange(g_icr2$gcv, g_icr2$gtest, nrow = 1)urine, Qh/2 + 0, Qh/2, 0, 150, 0, 450, 1, 3 problematic
The fact that there seem to be more observations that deviate the true value in the cross-validation scheme while the RMSE is still lower in CV is that there are more points (215) for which we average the error in the CV than in the test (54) scheme .
For the purpose of predictions, the principal components with low variances may also be important, and in some cases even more important. We will then now introduce information about the relationship of our target (hippurate’s concentration) with the ppm’s.
We use the original library pls.
# Explore manually Values for controlling the training process
rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 2, savePredictions = T)
set.seed(12)
pls_cv_small <- train(y = training$Hip_CONC,
x = training_c_small,
method = "pls",
trControl = rctrl1,
tuneGrid = expand.grid(.ncomp = c(1:15)))
cv_rmse_pls_small <- min(pls_cv_small$results$RMSE)
set.seed(12)
pls_cv_all <- train(y = training$Hip_CONC,
x = training_c,
method = "pls",
trControl = rctrl1,
tuneGrid = expand.grid(.ncomp = c(1:100)))
cv_rmse_pls_all <- min(pls_cv_all$results$RMSE)
pred_plsSmall <- unname(predict(pls_cv_small, test))
pred_plsall <- unname(predict(pls_cv_all, test))
pred_plsSmallc <- unname(predict(pls_cv_small, test_c[,kept_features]))
pred_plsallc <- unname(predict(pls_cv_all, test_c))
# rmse_plsSmall <- RMSE(test$Hip_CONC, pred_plsSmall)
# rmse_plsAll <- RMSE(test$Hip_CONC, pred_plsall)
# c(rmse_plsSmall, rmse_plsAll)
rmse_plsSmallc <- RMSE(test$Hip_CONC, pred_plsSmallc)
rmse_plsAllc <- RMSE(test$Hip_CONC, pred_plsallc)
c(rmse_plsSmallc, rmse_plsAllc)An interesting thing to remark is that the number of components kept is (nearly) for both methods, 13. (is is natural ?) However, we see that the true error is significantly smaller for the method on the smaller feature space, while the RMSE in cross-validation is actually higher for this method. This proves us how important is it to do the evaluation on an independent test set.
Larger gaps for the extreme classes “0” and “300” !
Indeed, it is
The function which performs OPLS from the MBXUCL package, that is OPLSDA(), informs us in its help page that it works only for classification (two-class) problems. Hence, we had to
# oplsc <- OPLSDA(as.matrix(training_class_c), as.factor(training_class$Hip_prop), no = 3)
#
# oplsreg <- OPLSDA(as.matrix(training_c), training$Hip_CONC, no = 3)
# str(oplsreg)
#
# OPLSDA_pred(oplsreg, as.matrix(test_cVar))
# OPLSDA_pred(oplsc, as.matrix(test_class_c))
# test_class$Hip_prop
#
# # With other package
# library(ropls)
#
# plsda <- opls(as.matrix(training_class_c), training_class$Hip_prop)
# olsda2 <- opls(as.matrix(training_class_c), training_class$Hip_prop, orthoI = 2)
# olsda2reg <- opls(as.matrix(training_c), training$Hip_CONC, orthoI = 2)
#
# table(test_class$Hip_prop, predict(olsda2, test_class_c))
# predict(olsda2, test_class_c, type="prob")
#
# predict(olsda2reg, test_cVar)
# test$Hip_CONC
YC <- scale(training$Hip_CONC, center = T, scale =F)
YtestC <- scale(test$Hip_CONC, center = T, scale =F)
## If we let the same seed here as we put for the models with caret, we should retrieve the same data partition : the random sampling should be the same... (To check !!!!)
'kfoldCV' <- function(X ,Y, k, argrm = 0, seed = 12){
#browser()
n=dim(X)[1] ; nbseg = k ; sn=ceiling(n/k) ; set.seed(seed)
seg_i <- sample(rep(1:k, length.out = n)) ; yfit.cv = numeric()
for (i in 1:nbseg){
val_i <- which(seg_i == i)
Y.train = Y[-val_i] ; X.train =X[-val_i,] ; X.valid=X[val_i,]
coefcv <- fctreg(X.train,Y.train,argrm)
yfit.cv[val_i] =X.valid%*%coefcv
names(yfit.cv)[val_i] <- val_i
}
RMSE.cv=sqrt((1/n)*sum((Y-yfit.cv)^2))
return(list(yfit.cv=yfit.cv,RMSE.cv=RMSE.cv))
}
# fctreg <- function(X,Y,argrm) {
# opls(X, Y, orthoI = argrm-1)@coefficientMN
# }
fctreg <- function(X,Y,argrm) {OPLSDA(X,Y,no=argrm-1)$b}
RMSECV_opls <- numeric()
yfitCV_opls <- list() ; NCmax <- 25 # tune it
for(i in 2:NCmax){
cat(i, "\n")
rescv <- kfoldCV(as.matrix(training_c), YC, k = 10, argrm = i)
RMSECV_opls[i] <- rescv$RMSE.cv
yfitCV_opls[[i]] <- rescv$yfit.cv
}
cv_rmse_opls_all <- min(RMSECV_opls, na.rm = T)
ncoptOplsAll <- which.min(RMSECV_opls)
df_opls_reg <- data.frame(indi = 2:NCmax, RMSEcv = RMSECV_opls[-1])
## Optimal model
fit_oplsOpt <- OPLSDA(as.matrix(training_c), YC, no = ncoptOplsAll-1)
pred_oplsAll <- OPLSDA_pred(fit_oplsOpt, test_c)
rmse_oplsAll <- RMSE(YtestC, pred_oplsAll)
## Advised
fit_oplsAdvised <- OPLSDA(as.matrix(training_c), YC, no = 12)
pred_oplsAllAdvis <- OPLSDA_pred(fit_oplsAdvised, test_c)
rmse_oplsAllAdvis <- RMSE(YtestC, pred_oplsAllAdvis)ggoplsAll <- ggplot(df_opls_reg, aes(x = indi, y = RMSEcv)) + geom_point() + geom_line() + geom_vline(aes(xintercept = ncoptOplsAll, col = "optimal"), linetype = 2) + geom_vline(aes(xintercept = 12, col = "advised"), linetype = 2) +
scale_color_manual(name = "# comp.", values = c("advised" = "green", "optimal" = "red")) + theme_piss(size_c = 12, legend.position = c(.92, .75)) + labs(title = "RMSE cross-validation vs complexity", x = "Number of components", y = "RMSE-cv") +
annotate(geom = "text", label = paste("True RMSE's", " are : \n", round(rmse_oplsAll, 4)),
x = 21, y = 9.5, col = "red" , size = 4) +
annotate(geom = "text", label = paste(round(rmse_oplsAllAdvis, 4)),
x = 24, y = 9.1, col = "green" , size = 4)
ggplotly(ggoplsAll)RMSECV_oplsSmall <- numeric()
yfitCV_oplsSmall <- list() ; NCmaxRed <- 14 # tune it
for(i in 2:NCmaxRed){
cat(i, "\n")
rescv <- kfoldCV(as.matrix(training_c_small), YC, k = 10, argrm = i)
RMSECV_oplsSmall[i] <- rescv$RMSE.cv
yfitCV_oplsSmall[[i]] <- rescv$yfit.cv
}
cv_rmse_opls_Small <- min(RMSECV_oplsSmall, na.rm = T)
ncoptOplsSmall <- which.min(RMSECV_oplsSmall)
## Optimal model
fit_oplsOptSmall <- OPLSDA(as.matrix(training_c), YC, no = ncoptOplsSmall-1)
pred_oplsSmall <- OPLSDA_pred(fit_oplsOptSmall, test_c)
rmse_oplsSmall <- RMSE(YtestC, pred_oplsSmall)df_opls_regSmall <- data.frame(indi = 2:NCmaxRed, RMSEcv = RMSECV_oplsSmall[-1])
ggoplsSmall <- ggplot(df_opls_regSmall, aes(x = indi, y = RMSEcv)) + geom_point() + geom_line() + geom_vline(xintercept = ncoptOplsSmall, col = "red", linetype = 2) + theme_piss(size_c = 12, legend.position = c(.92, .75)) + labs(title = "RMSE cross-validation vs complexity", x = "Number of components", y = "RMSE-cv")
ggplotly(ggoplsSmall)Hippurate <- as.factor(training$Hip_prop)
gsopls1 <- DrawScores(fit_oplsAdvised, type.obj = "OPLSDA", drawNames=F, createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), axes =c(1,2), color = Hippurate, size_p = 0.5) + theme_piss(size_p = 11)
gsopls2 <- DrawScores(fit_oplsAdvised, type.obj = "OPLSDA", drawNames=F, createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), axes =c(2,3), color = Hippurate, size_p = 0.5) + theme_piss(size_p = 11)
plotly::subplot(hide_legend(ggplotly(gsopls1)), ggplotly(gsopls2), shareX = T, shareY= T, which_layout = 2)gloadopls1 <- DrawLoadings(fit_oplsAdvised, type.obj = "OPLSDA", createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), loadingstype="l", axes =c(1,2), xlab = "ppm")[[1]] + theme_piss(size_p = 11)
# gloadopls2 <- DrawLoadings(fit_oplsOpt, type.obj = "OPLSDA", createWindow=FALSE, main = paste("OPLS score plots PC1 and PC1-Ortho (left) PC2 and PC2-Ortho (right) "), loadingstype="l", axes =c(3,4), xlab = "ppm")[[1]] + theme_piss(size_p = 11)
#
# plotly::subplot(hide_legend(ggplotly(gloadopls1)), ggplotly(gloadopls2), shareX = T, shareY= T, which_layout = 2,nrows = 2)
ggplotly(gloadopls1) For Simultaneous Dimension Reduction and Variable Selection Explained in details by (Chun and Keles 2007).
With the library spls.
n_folds <- 10
rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 1, verboseIter = T, savePredictions = T)
set.seed(12)
spls_cv_small <- train(y = training$Hip_CONC,
x = training_c_small,
method = "spls",
trControl = rctrl1,
tuneLength = 4)
spls_cv_small
cv_rmse_spls_small <- min(spls_cv_small$results$RMSE)
pred_splsSmall <- predict(spls_cv_small, test)
pred_splsSmallc <- predict(spls_cv_small, test_c[, kept_features])
# All strongly negative.... weird
plot(spls_cv_small)
grid_spls <- expand.grid(K = seq(1, 100, by = 4),
eta = c(.1, .3, .6),
kappa = .5)
set.seed(12)
spls_cv_all <- train(y = training$Hip_CONC,
x = training_c,
method = "spls",
trControl = rctrl1,
tuneGrid = grid_spls)
#tuneLength = 2)
cv_rmse_spls_all <- min(spls_cv_all$results$RMSE)
pred_spls_allc <- unname(predict(spls_cv_all, test_c))
rmse_splsSmall <- RMSE(test$Hip_CONC, unname(pred_splsSmall))
rmse_splsSmallc <- RMSE(test$Hip_CONC, unname(pred_splsSmallc))
rmse_splsAll <- RMSE(test$Hip_CONC, unname(predict(spls_cv_all, test)))
rmse_splsAllc <- RMSE(test$Hip_CONC,pred_spls_allc)
c(rmse_splsSmall, rmse_splsAll)
c(rmse_splsSmallc, rmse_splsAllc)dfpls <- data.frame( True.RMSE = c(rmse_plsSmallc, rmse_plsAllc, rmse_oplsAll, rmse_oplsSmall,rmse_splsSmallc, rmse_splsAllc), RMSE.cv = c(cv_rmse_pls_small, cv_rmse_pls_all, cv_rmse_opls_all,cv_rmse_opls_Small, cv_rmse_spls_small, cv_rmse_spls_all), ncomp = c(pls_cv_small$bestTune$ncomp, pls_cv_all$bestTune$ncomp, 12, ncoptOplsSmall, spls_cv_small$bestTune$K, spls_cv_all$bestTune$K))
rownames(dfpls) <- c('PLS small', 'PLS all', "OPLS all", "OPLS Small", 'SPLS small','SPLS all')
pander(t(dfpls)) | PLS small | PLS all | OPLS all | OPLS Small | SPLS small | SPLS all | |
|---|---|---|---|---|---|---|
| True.RMSE | 8.236 | 6.599 | 4.215 | 7.491 | 8.223 | 9.136 |
| RMSE.cv | 9.459 | 5.265 | 5.267 | 9.896 | 9.059 | 10.12 |
| ncomp | 15 | 12 | 12 | 4 | 9 | 21 |
ncoptplsAll <- pls_cv_all$bestTune$ncomp
gPlsAll <- ggplot(pls_cv_all) + geom_vline(aes(xintercept = ncoptplsAll, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "All ") + geom_vline(aes(xintercept = ncoptplsAll-1, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))
ncoptplsSmall <- pls_cv_small$bestTune$ncomp
gPlsSmall <- ggplot(pls_cv_small) + geom_vline(aes(xintercept = ncoptplsSmall, col = "optimal"), linetype = 2) + theme_piss(size_c = 12,size_p = 15) + labs(title = "Small ") + geom_vline(aes(xintercept = 4, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c(optimal = "red", advised = "green"))
grid_arrange_legend(gPlsAll, gPlsSmall, ncol = 2, position = "bottom" )df_cv_plsSmall <- pls_cv_small$pred[pls_cv_small$pred$ncomp == pls_cv_small$bestTune[,1],][1:nrow(training),]
rownames(df_cv_plsSmall) <- df_cv_plsSmall$rowIndex
NOMSG( g_pls1 <- PredVsObsPlots(df_cv_plsSmall, pred.test = pred_plsSmallc, threshold = 20, rmse.cv = cv_rmse_pls_small))
df_cv_plsall <- pls_cv_all$pred[pls_cv_all$pred$ncomp == pls_cv_all$bestTune[,1],][1:nrow(training),]#[order(pls_cv_all$pred$rowIndex),]
rownames(df_cv_plsall) <- df_cv_plsall$rowIndex
NOMSG( g_pls2 <- PredVsObsPlots(df_cv_plsall, pred.test = pred_plsallc, threshold = 30, rmse.cv = cv_rmse_pls_all ))
grid.arrange(g_pls1$gcv, g_pls1$gtest,g_pls2$gcv, g_pls2$gtest, nrow = 2, top = textGrob(expression( italic(" `Small´ (top) vs `All´ (below)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) ) As we have centered the data and the target to run the model, it was important here to make the back-transformation here.
df_cv_OplsAll <- data.frame( obs = YC + mean(training$Hip_CONC), pred = yfitCV_opls[[12]] + mean(training$Hip_CONC)) # 12 is the advised number of comp.
NOMSG( g_OplsAll <- PredVsObsPlots(df_cv_OplsAll, pred.test = pred_oplsAll + mean(training$Hip_CONC), threshold = 20, rmse.cv = cv_rmse_opls_all))
df_cv_OplsSmall <- data.frame( obs = YC + mean(training$Hip_CONC), pred = yfitCV_oplsSmall[[ncoptOplsSmall]] + mean(training$Hip_CONC))
NOMSG( g_OplsSmall <- PredVsObsPlots(df_cv_OplsSmall, pred.test = pred_oplsSmall + mean(training$Hip_CONC), threshold = 20, rmse.cv = cv_rmse_opls_Small ))
grid.arrange(g_OplsAll$gcv, g_OplsAll$gtest, g_OplsSmall$gcv, g_OplsSmall$gtest, nrow = 2, top = textGrob(expression( italic(" `All´ (top) vs `Small´ (below)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) ) df_cv_SplsSmall <- spls_cv_small$pred[spls_cv_small$pred$K == spls_cv_small$bestTune$K[1] & spls_cv_small$pred$eta == spls_cv_small$bestTune$eta[1] & spls_cv_small$pred$kappa == spls_cv_small$bestTune$kappa[1],][1:nrow(training),]
rownames(df_cv_SplsSmall) <- df_cv_SplsSmall$rowIndex
NOMSG( g_Spls1 <- PredVsObsPlots(df_cv_SplsSmall, pred.test = pred_splsSmallc, threshold = 20, rmse.cv = cv_rmse_spls_small))
df_cv_SplsAll <- spls_cv_all$pred[spls_cv_all$pred$K == spls_cv_all$bestTune$K[1] & spls_cv_all$pred$eta == spls_cv_all$bestTune$eta[1] & spls_cv_all$pred$kappa == spls_cv_all$bestTune$kappa[1],][1:nrow(training),]
rownames(df_cv_SplsAll) <- df_cv_SplsAll$rowIndex
NOMSG( g_Spls2 <- PredVsObsPlots(df_cv_SplsAll, pred.test = pred_spls_allc, threshold = 30, rmse.cv = cv_rmse_spls_all ))
grid.arrange(g_Spls1$gcv, g_Spls1$gtest, g_Spls2$gcv, g_Spls2$gtest, nrow = 2, top = textGrob(expression( italic(" `Small´ (top) vs `All´ (below)")), gp = gpar(fontsize = 19, font = 2, col = "blue")) ) (Investigate why this particular plot shows here ? Weird… Run the code to see the actual plot)
df_coef_reglin <- data.frame( ppm = as.numeric(substr(names(regLin$finalModel$coefficients)[-1], 2, nchar(names(regLin$finalModel$coefficients)[3])-2)), coefficients = unname(regLin$finalModel$coefficients[-1]) )
myPalette <- colorRampPalette(rev(brewer.pal(10, "RdGy")))
#ggplot(df_coef_reglin, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() + geom_point( size = 2) + theme_piss() + scale_colour_gradientn("Coefficient", limits = c(min(df_coef_reglin$coefficients), max(df_coef_reglin$coefficients) ), colours = myPalette(10))
ggplot(df_coef_reglin, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() + geom_point( size = 2) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_coef_reglin$coefficients), max(df_coef_reglin$coefficients) ), low = "darkred", mid = "gray80", high = "darkred") + scale_x_reverse() df_coef_reglinstpw <- data.frame( ppm = as.numeric(substr(names(regLin_stpw$finalModel$coefficients)[-1], 2, nchar(names(regLin_stpw$finalModel$coefficients)[3])-2)), coefficients = unname(regLin_stpw$finalModel$coefficients[-1]) )
ggplot(df_coef_reglinstpw, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() +geom_point( size = 2) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_coef_reglinstpw$coefficients), max(df_coef_reglinstpw$coefficients) ), low = "darkred", mid = "gray80", high = "darkred") + scale_x_reverse()df_coef_reglinstpwAll <- data.frame( ppm = as.numeric(substr(names(regLin_stpw_all$finalModel$coefficients)[-1], 2, nchar(names(regLin_stpw_all$finalModel$coefficients)[3])-2)), coefficients = unname(regLin_stpw_all$finalModel$coefficients[-1]) )
ggplot(df_coef_reglinstpwAll, aes(x = ppm, y = coefficients, col = coefficients)) + geom_line() +geom_point( size = 2) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_coef_reglinstpwAll$coefficients), max(df_coef_reglinstpwAll$coefficients) ), low = "darkred", mid = "gray80", high = "darkred") + scale_x_reverse()We take the object coming from cv.glmnet above with the optimal value for hyperparameters. With this, we can extract the coefficients.
\(\qquad\qquad\qquad\qquad\) \(\alpha\) = 0.4
tmp_coeffs <- coef(glmnet_fitOptim, s = "lambda.min")
df_coeffs <- data.frame(ppm = as.numeric(tmp_coeffs@Dimnames[[1]][tmp_coeffs@i][-1]), coefficient = tmp_coeffs@x[-c(1,100)])
ggplot(df_coeffs, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(tmp_coeffs@x), max(tmp_coeffs@x)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_continuous(breaks = c(rev(as.numeric(unique(substr(tmp_coeffs@Dimnames[[1]][tmp_coeffs@i][-1], 1, 1)))))) + scale_x_reverse()tmp_coeffsRid <- coef(glmnet_fitRidge, s = "lambda.min")
df_coeffsRid <- data.frame(ppm = as.numeric(tmp_coeffsRid@Dimnames[[1]][tmp_coeffsRid@i][-1]), coefficient = tmp_coeffsRid@x[-c(1,100)])
gr1 <- ggplot(df_coeffsRid, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(tmp_coeffsRid@x), max(tmp_coeffsRid@x)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_continuous(breaks = c(rev(as.numeric(unique(substr(tmp_coeffsRid@Dimnames[[1]][tmp_coeffs@i][-1], 1, 1)))))) + scale_x_reverse() + labs(title = "Ridge")
tmp_coeffsLasso <- coef(glmnet_fitLasso, s = "lambda.min")
df_coeffsRidLasso <- data.frame(ppm = as.numeric(tmp_coeffsLasso@Dimnames[[1]][tmp_coeffsLasso@i][-1]), coefficient = tmp_coeffsLasso@x[-c(1,100)])
gr2 <- ggplot(df_coeffsRidLasso, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point(size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(tmp_coeffsLasso@x), max(tmp_coeffsLasso@x)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_continuous(breaks = c(rev(as.numeric(unique(substr(tmp_coeffsLasso@Dimnames[[1]][tmp_coeffs@i][-1], 1, 1)))))) + scale_x_reverse() + labs(title = "LASSO")
grid_arrange_legend(gr1, gr2, ncol = 2, bottom = "right")We decide to represent only the coefficients associated the first 5 components only as it is …
You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest
ppmpcr <- as.numeric(substr(dimnames(pcr_cv_all$finalModel$coefficients)[[1]], 2, nchar(dimnames(pcr_cv_all$finalModel$coefficients)[[1]][3])-2))
df_pcrAllComp <- melt(pcr_cv_all$finalModel$coefficients)
df_pcrAllComp <- df_pcrAllComp[, colnames(df_pcrAllComp) %in% c("Var2") == F ]
colnames(df_pcrAllComp) <- c("ppm", "Comp", "coefficients")
df_pcrAllComp$ppm_num <- as.numeric(substr(as.character(df_pcrAllComp$ppm), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2))
## PLOTLY
gppAll <- ggplot(df_pcrAllComp[1:(5*ncol(training_c)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6) + theme_piss() + scale_x_reverse() + labs(title = "Visualize First Components")
as.widget(ggplotly(gppAll))## tweenr PACKAGE
# ggg <- ggplot(df_pcrAllComp, aes(x = ppm_num, y = coefficients, frame = Comp)) + geom_line() + geom_point( size = 2) + theme_piss() + scale_x_reverse()
#
# library(gganimate)
#
# gganimate(ggg, "exxx.gif")Look at the different y-scales between the two graphs !!!
df_pcr_opt <- data.frame(ppm = as.numeric(substr(as.character(names(pcr_cv_all$finalModel$coefficient[,,35])), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2)), coefficient = unname(pcr_cv_all$finalModel$coefficient[,,35]))
ggplot(df_pcr_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss(size_p = 15) + scale_colour_gradient2("Value", limits = c(min(df_pcr_opt$coefficient), max(df_pcr_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (35)")You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest
df_pcrSmallComp <- melt(pcr_cv_small$finalModel$coefficients)
df_pcrSmallComp <- df_pcrSmallComp[, colnames(df_pcrSmallComp) %in% c("Var2") == F ]
colnames(df_pcrSmallComp) <- c("ppm", "Comp", "coefficients")
df_pcrSmallComp$ppm_num <- as.numeric(substr(as.character(df_pcrSmallComp$ppm), 2, nchar(as.character(df_pcrSmallComp$ppm)[3])-2))
gppSmall <- ggplot(df_pcrSmallComp[1:(5*length(kept_features)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6) + theme_piss() + scale_x_reverse() + labs(title = "Visualize First Components")
as.widget(ggplotly(gppSmall))Look at the different y-scales between the two graphs !!!
df_pcr_optSmall <- data.frame(ppm = as.numeric(substr(as.character(names(pcr_cv_small$finalModel$coefficient[,,ncoptsmall])), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2)), coefficient = unname(pcr_cv_small$finalModel$coefficient[,,ncoptsmall]))
ggplot(df_pcr_optSmall, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss(size_p = 15) + scale_colour_gradient2("Value", limits = c(min(df_pcr_optSmall$coefficient), max(df_pcr_optSmall$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (7)")You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest
df_pcrall22Comp <- melt(pcr_cv_all2$finalModel$coefficients)
df_pcrall22Comp <- df_pcrall22Comp[, colnames(df_pcrall22Comp) %in% c("Var2") == F ]
colnames(df_pcrall22Comp) <- c("ppm", "Comp", "coefficients")
df_pcrall22Comp$ppm_num <- as.numeric(substr(as.character(df_pcrall22Comp$ppm), 2, nchar(as.character(df_pcrall22Comp$ppm)[3])-2))
gppAll2 <- ggplot(df_pcrall22Comp[1:(5*ncol(training_c)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6) + theme_piss() + scale_x_reverse() + labs(title = "Visualize First Components") + coord_cartesian(ylim = c(-3000, 9000))
as.widget(ggplotly(gppAll2 ))Same y-scales between the two graphs
df_pcr_opt2 <- data.frame(ppm = as.numeric(substr(as.character(names(pcr_cv_all2$finalModel$coefficient[,,ncoptall2])), 2, nchar(as.character(df_pcrAllComp$ppm)[3])-2)), coefficient = unname(pcr_cv_all2$finalModel$coefficient[,,ncoptall2]))
ggplot(df_pcr_opt2, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss(size_p = 15) + scale_colour_gradient2("Value", limits = c(min(df_pcr_opt2$coefficient), max(df_pcr_opt2$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (16)") + coord_cartesian(ylim = c(-3000, 9000))Just for “visualization” …: (top) Graph of the coefficients associated to the selected independent components in the (linear) regression model. (below) Heatmap of these components against the ‘individuals’ to see their respective contributions. We put the hippurate’s concentrations of the ‘inviduals’ as y-label to add relevant information. We could put another meta-data information.
# icr_cv_small$finalModel$model$coefficients
# icr_cv_small$finalModel$model$terms
#
# icr_cv_small$finalModel$model$model
# icr_cv_small$finalMode$names
# icr_cv_small$finalMode$param
# df_icrSmallComp <- melt(icr_cv_small$finalModel$coefficients)
# df_icrSmallComp <- df_icrSmallComp[, colnames(df_icrSmallComp) %in% c("Var2") == F ]
# colnames(df_icrSmallComp) <- c("ppm", "Comp", "coefficients")
# df_icrSmallComp$ppm_num <- as.numeric(substr(as.character(df_icrSmallComp$ppm), 2, nchar(as.character(df_icrSmallComp$ppm)[3])-2))
df_icrSmallICA <- data.frame(ICA = as.numeric(substr(names(icr_cv_small$finalModel$model$coefficients)[-1], 4, 5)), coefficients = icr_cv_small$finalModel$model$coefficients[-1] )
ggicrSmall <- ggplot(df_icrSmallICA, aes(x = ICA, y = coefficients, col = coefficients)) + geom_line(size = 0.55) + geom_point( size = 1.9) + theme_piss() + labs(title = "Visualize Components") + scale_colour_gradient2("Value", limits = c(min(df_icrSmallICA$coefficient), max(df_icrSmallICA$coefficient)), low = "darkred", mid = "gray80", high = "darkred" )
ggicrSmall#ggplotly(ggicrSmall )mat.Coeff.ICR <- as.matrix(icr_cv_small$finalModel$model$model)
# basic heatmap : orange and white show higher values
namesSpec <- rep(c( rep("", 4), "0", rep("", 4), "75", rep("", 4), "150", rep("", 4), "300", rep("", 4)), 9)
hm_icrSmall <- heatmap(mat.Coeff.ICR[,-1], Rowv=NA, Colv=NA,
col = heat.colors(256),
margins=c(5, 0) , labRow = namesSpec[-1] ,
main = 'Heatmap of Hippurate Conc. VS components')# ggplot( as.data.frame(mat.Coeff.ICR[,-1]), aes(x = y, y = teams)) +
# geom_tile(aes(fill = performance))
#
# ggplotly(sfo_hmap1)
#
# resH <- plot_heatmapsOnSel(icaSet = mat.Coeff.ICR, selCutoff = 3, level = "features", keepVar = keepVar,
# doSamplesDendro = FALSE, doGenesDendro = FALSE, keepComp = 2,
# heatmapCol = maPalette(low = "blue", high = "red", mid = "yellow", k=44),
# file = "heatmapWithoutDendro", annot2col=annot2col(params))
plotHeaticr1 <-plot_ly(x=namesSpec, y=colnames(mat.Coeff.ICR[,-1]), z = mat.Coeff.ICR[,-1], type="heatmap")
namesObsPlotly <- make.names(rownames(mat.Coeff.ICR[,-1]))
xlab = list(categoryorder = "array", categoryarray = namesObsPlotly)
ylab = list(categoryorder = "array", categoryarray = c("d", "f", "z","g"))
plotHeaticr1 <-plot_ly(x=colnames(mat.Coeff.ICR[,-1]), y = namesObsPlotly, z = mat.Coeff.ICR[,-1], type="heatmap") %>%
layout(yaxis = xlab)
plotHeaticr1#embed_notebook(plotHeaticr1)
# m <- matrix(rnorm(12), nrow = 4, ncol = 3)
# xlab = list(categoryorder = "array", categoryarray = c("a", "b", "c","d"))
# ylab = list(categoryorder = "array", categoryarray = c("d", "f", "z","g"))
# plot_ly(x = c("a1", "b1", "c1","d1"), y = c("d", "f", "z","g"), z = m, type = "heatmap") %>%
# layout(xaxis = xlab, yaxis = ylab)(top) Graph of the coefficients associated to the selected independent components in the (linear) regression model. (below) Heatmap of these components against the ‘individuals’ to see their respective contributions. We put the hippurate’s concentrations of the ‘inviduals’ as y-label to add relevant information. We could put another meta-data information.
df_icrAllICA <- data.frame(ICA = as.numeric(substr(names(icr_cv_all$finalModel$model$coefficients)[-1], 4, 5)), coefficients = icr_cv_all$finalModel$model$coefficients[-1] )
ggicrAll <- ggplot(df_icrAllICA, aes(x = ICA, y = coefficients, col = coefficients)) + geom_line(size = 0.55) + geom_point( size = 1.9) + theme_piss() + scale_x_reverse() + labs(title = "Visualize Components") + scale_colour_gradient2("Value", limits = c(min(df_icrAllICA$coefficient), max(df_icrAllICA$coefficient)), low = "darkred", mid = "gray80", high = "darkred" )
ggicrAllmat.Coeff.ICRAll <- as.matrix(icr_cv_all$finalModel$model$model)
# basic heatmap : orange and white show higher values
hm_icrAll <- heatmap(mat.Coeff.ICRAll[,-1], Rowv=NA, Colv=NA,
col = heat.colors(256),
margins=c(5, 0) , labRow = namesSpec[-1] ,
main = 'Heatmap of Hippurate Conc. VS components')plotHeaticr2 <-plot_ly(x=namesSpec[-1], y=colnames(mat.Coeff.ICRAll[,-1]), z = mat.Coeff.ICRAll[,-1], type="heatmap")
plotHeaticr2 <-plot_ly(x=colnames(mat.Coeff.ICRAll[,-1]), y = namesObsPlotly, z = mat.Coeff.ICRAll[,-1], type="heatmap") %>%
layout(yaxis = xlab)
plotHeaticr2#embed_notebook(plotHeaticr2)You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest
#pls_cv_small$finalModel$coefficients
# plot(pls_cv_small$finalModel, plottype = "coef", ncomp=1:3, legendpos = "bottomleft", labels = "numbers",
# xlab = "nm")
# plot(pls_cv_small$finalModel, type = "coefficients", ...)
df_plsSmallComp <- melt(pls_cv_small$finalModel$coefficients)
df_plsSmallComp <- df_plsSmallComp[, colnames(df_plsSmallComp) %in% c("Var2") == F ]
colnames(df_plsSmallComp) <- c("ppm", "Comp", "coefficients")
df_plsSmallComp$ppm_num <- as.numeric(substr(as.character(df_plsSmallComp$ppm), 2, nchar(as.character(df_plsSmallComp$ppm)[3])-2))
## PLOTLY
gplsSmall <- ggplot(df_plsSmallComp[1:(5*length(kept_features)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6) + labs(title = "Visualize First Components") + theme_piss() + scale_x_reverse()
ggplotly(gplsSmall)Look at the different y-scales between the two graphs !!!
df_pls_opt <- data.frame(ppm = as.numeric(substr(as.character(names(pls_cv_small$finalModel$coefficient[,,pls_cv_small$bestTune$ncomp])), 2, nchar(as.character(df_plsSmallComp$ppm)[3])-2)), coefficient = unname(pls_cv_small$finalModel$coefficient[,,pls_cv_small$bestTune$ncomp]))
ggplot(df_pls_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_pls_opt$coefficient), max(df_pls_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (16)")You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest
df_plsAllComp <- melt(pls_cv_all$finalModel$coefficients)
df_plsAllComp <- df_plsAllComp[, colnames(df_plsAllComp) %in% c("Var2") == F ]
colnames(df_plsAllComp) <- c("ppm", "Comp", "coefficients")
df_plsAllComp$ppm_num <- as.numeric(substr(as.character(df_plsAllComp$ppm), 2, nchar(as.character(df_plsAllComp$ppm)[3])-2))
## PLOTLY
gplsAll <- ggplot(df_plsAllComp[1:(5*ncol(training_c)),], aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6) + labs(title = "Visualize First Components") + theme_piss() + scale_x_reverse() + coord_cartesian(ylim = c(-5000, 9e3))
ggplotly(gplsAll)Scales are the same
df_pls_opt2 <- data.frame(ppm = as.numeric(substr(as.character(names(pls_cv_all$finalModel$coefficient[,,pls_cv_all$bestTune$ncomp])), 2, nchar(as.character(df_plsSmallComp$ppm)[3])-2)), coefficient = unname(pls_cv_all$finalModel$coefficient[,,pls_cv_all$bestTune$ncomp]))
ggplot(df_pls_opt2, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_pls_opt2$coefficient), max(df_pls_opt2$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (16)") + coord_cartesian(ylim = c(-5000, 9e3))Total of the 9 components
You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest
#spls_cv_small$finalModel$betahat
df_SplsSmallComp <- melt(spls_cv_small$finalModel$betamat)
df_SplsSmallComp <- df_SplsSmallComp[, colnames(df_SplsSmallComp) %in% c("Var2") == F ]
colnames(df_SplsSmallComp) <- c("ppm", "coefficients", "Comp")
df_SplsSmallComp$ppm_num <- as.numeric(substr(as.character(dimnames(spls_cv_small$finalModel$betahat)[[1]]), 1, nchar(as.character(dimnames(spls_cv_small$finalModel$betahat)[[1]])[3])-1))
df_SplsSmallComp$Comp <- as.factor(df_SplsSmallComp$Comp)
## PLOTLY
gSplsSmAll <- ggplot(df_SplsSmallComp, aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6) + labs(title = "Visualize All Components") + theme_piss() + scale_x_reverse() + coord_cartesian(ylim = c(-12, 25))
ggplotly(gSplsSmAll)Scales are the same
df_splsSmall_opt <- data.frame(ppm = as.numeric(substr(dimnames(spls_cv_small$finalModel$betahat)[[1]], 1, nchar(as.character(dimnames(spls_cv_small$finalModel$betahat)[[1]][3]))-1)), coefficient = unname(spls_cv_small$finalModel$betamat[[spls_cv_small$bestTune$K]]))
ggplot(df_splsSmall_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_splsSmall_opt$coefficient), max(df_splsSmall_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (9)") + coord_cartesian(ylim = c(-12, 25))You can click on the legend (right) to deselect one component in the plot to better visualize component(s) of interest
df_SplsAllComp <- melt(spls_cv_all$finalModel$betamat)
df_SplsAllComp <- df_SplsAllComp[, colnames(df_SplsAllComp) %in% c("Var2") == F ]
colnames(df_SplsAllComp) <- c("ppm", "coefficients", "Comp")
df_SplsAllComp$ppm_num <- as.numeric(substr(as.character(dimnames(spls_cv_all$finalModel$betahat)[[1]]), 1, nchar(as.character(dimnames(spls_cv_all$finalModel$betahat)[[1]])[3])-1))
## Visualize only the first 5 components
df_SplsAllComp5 <- df_SplsAllComp[df_SplsAllComp$Comp < 6,]
df_SplsAllComp5$Comp <- as.factor(df_SplsAllComp5$Comp)
## PLOTLY
gSplsAll <- ggplot(df_SplsAllComp5, aes(x = ppm_num, y = coefficients, col = Comp)) + geom_line(size = 0.15) + geom_point( size = 0.6) + labs(title = "Visualize First Components") + theme_piss() + scale_x_reverse() + coord_cartesian(ylim = c(-4, 8))
ggplotly(gSplsAll)#pspls <- ggplotly(gSplsAll)
#link_pspls <- plotly_POST(p, filename="ggplot2/intro-2")Scales are the same
df_splsAll_opt <- data.frame(ppm = as.numeric(substr(dimnames(spls_cv_all$finalModel$betahat)[[1]], 1, nchar(as.character(dimnames(spls_cv_all$finalModel$betahat)[[1]][3]))-1)), coefficient = unname(spls_cv_all$finalModel$betamat[[spls_cv_all$bestTune$K]]))
ggplot(df_splsAll_opt, aes(x = ppm, y = coefficient, col = coefficient)) + geom_line() + geom_point( size = 1.9) + theme_piss() + scale_colour_gradient2("Value", limits = c(min(df_splsAll_opt$coefficient), max(df_splsAll_opt$coefficient)), low = "darkred", mid = "gray80", high = "darkred" ) + scale_x_reverse() + labs(title = "With optimal Number of Components (25)") + coord_cartesian(ylim = c(-4, 8))Interestingly, this is rather the low ppm values which seem to have the bigger effects.
As for the PCR, the first component has a very low magnitude of coefficients compared to the other components. We can see that for PLS and SPLS. For SPLS
Now, we will explore some models that are known in the Machine Learning’s litterature to be have the greatest performance for a wide range of contexts. These methods will thus not be oriented to omics data analysis such as some of those seen above, but as we have an objective of building the most accurate model, it could be interesting to see if one of those could outperform the preceding models. We will not explain them as this is not part of the course and they are rather very complex.
These methods are random forest and XGBoost The theoretical background of these methods goes beyong the scope of this project These methods are really “off-the-shelf”, meaning that they do not really need preprocessing. Moreover the feature selection procss is done ‘inside’ the algorithms through the variable’s importance metrics (actually, we used these metrics to select our features), that is howmuch
Outlined by (Breiman 2001), the random forest is an algorithm fits several trees on subsets of the training set using the concept of bootstrapping aggregating (bagging) and fits. Hence, it is often preferred to ‘individual’ trees as it reduces their tendency to overfit.
We use the randomForest package (as well as the e1071 and the foreach) and compute the random forest algorithm in parallel as the process is computationally complex. It alows us to consider a deeper grid for the tuning hyperparameter mtry, the number of predictors sampled randomly for spliting at each node.
First, we fitted on the centered data
n_folds <- 10
rctrl1 <- trainControl(method = "repeatedcv", number = n_folds, repeats = 1, verbose = T, savePredictions = T)
# Without Centering the Data !!!!
time <- proc.time()
set.seed(12)
rf_all_nocenter <- train(y = training$Hip_CONC,
x = training[, 9:ncol(training)],
method = "parRF",
trControl = rctrl1,
tuneLength = 15,
verbose = T)
(proc.time() - time)[3] # 637 sec
plot(rf_all_nocenter)
cv_rmse_rf_all_nocenter <- min(rf_all_nocenter$results$RMSE)
pred_rf_all_noCenter <- predict(rf_all_nocenter, test)
rmse_rf_all_nocenter <- RMSE(test$Hip_CONC, unname(pred_rf_all_noCenter))
#### CENTERING the training features !
time <- proc.time()
set.seed(12)
rf_all <- train(y = training$Hip_CONC,
x = training_c,
method = "parRF",
trControl = rctrl1,
tuneLength = 23,
verbose = T)
(proc.time() - time)[3] # 943 sec
cv_rmse_rf_all <- min(rf_all$results$RMSE)
#pred_rf_all <- predict(rf_all, test)
pred_rf_allc <- predict(rf_all, test_c)
#rmse_rf_all <- RMSE(test$Hip_CONC, unname(pred_rf_all))
rmse_rf_allc <- RMSE(test$Hip_CONC, unname(pred_rf_allc))
cat("RMSE from ", rmse_rf_all_nocenter, " to", rmse_rf_allc, "by centering features")gained nearly 3 points of ´true´ RMSE only by centering the training features.
To run !!!!!!!!
time <- proc.time()
set.seed(12)
rf_allMore <- train(y = training$Hip_CONC,
x = training_c,
method = "parRF",
trControl = rctrl1,
tuneGrid = expand.grid(.mtry = c(10, 50, 100, 140, seq(150, 450, by = 10 ))),
verbose = T)
(proc.time() - time)[3] # 943 sec
cv_rmse_rf_allMore <- min(rf_allMore$results$RMSE)
#pred_rf_all <- predict(rf_all, test)
pred_rf_allMore <- predict(rf_allMore, test_c)
rmse_rf_allMore <- RMSE(test$Hip_CONC, unname(pred_rf_allMore))Note that when we do not have idea for values of the hyperparameters to take, we can let caret do it automatically. It will make a crude guess as to what values to try. Then, the idea that we will follow
From the xgboost package
## Guess the grid. Then update values
xgb_grid_1 = expand.grid(
nrounds = c(50, 150),
max_depth = c(20, 12, 30),
eta = c(.3, .1),
gamma = c(1, 2),
colsample_bytree = c(.85),
min_child_weight = c(1),
subsample = c(.9) )
xgb_trcontrol_1 = trainControl(
method = "repeatedcv",
number = 10,
repeats = 1, # Very Very slow ...... So no repetitions here
verboseIter = T,
allowParallel = T,
savePredictions = T
)
set.seed(12)
time <- proc.time()
xgb_train_4 <- train(y = training$Hip_CONC,
x = training_c,
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid_1)
(proc.time() - time)[3]
# Approx 30min with repeats =2.
cv_rmse_xgbAll <- min(xgb_train_4$results$RMSE)
pred_xgb_reg <- predict(xgb_train_4, test_c)
rmse_xgb_reg <- RMSE(test$Hip_CONC, unname(pred_xgb_reg)) # 21
### ========================
# Automatic Grid (try)
set.seed(12)
time <- proc.time()
xgb_train_AutoGrid2 <- train(y = training$Hip_CONC,
x = training_c,
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneLength = 2) # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 85 sec.
set.seed(12)
time <- proc.time()
xgb_train_AutoGrid3 <- train(y = training$Hip_CONC,
x = training_c,
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneLength = 3) # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 323 sec
set.seed(12)
time <- proc.time()
xgb_train_AutoGrid4 <- train(y = training$Hip_CONC,
x = training_c,
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneLength = 4) # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 790 sec
cv_rmse_xgbAuto2 <- min(xgb_train_AutoGrid2$results$RMSE)
cv_rmse_xgbAuto3 <- min(xgb_train_AutoGrid3$results$RMSE)
cv_rmse_xgbAuto4 <- min(xgb_train_AutoGrid4$results$RMSE)
pred_xgb_regAuto2c <- predict(xgb_train_AutoGrid2, test_c)
pred_xgb_regAuto3c <- predict(xgb_train_AutoGrid3, test_c)
pred_xgb_regAuto4c <- predict(xgb_train_AutoGrid4, test_c)
rmse_xgb_regAuto2 <- RMSE(test$Hip_CONC, unname(pred_xgb_regAuto2c))
rmse_xgb_regAuto3 <- RMSE(test$Hip_CONC, unname(pred_xgb_regAuto3c))
rmse_xgb_regAuto4 <- RMSE(test$Hip_CONC, unname(pred_xgb_regAuto4c))### Without centering training set #####################
###############################################
# Control the cross-validation
xgb_trcontrol_1 = trainControl(
method = "repeatedcv",
number = 10,
repeats = 1, # Very Very slow ...... So no repetitions here
verboseIter = T,
allowParallel = T,
savePredictions = T
)
## Tune with a grid of length 2 for each hyperparameters
time <- proc.time()
set.seed(12)
xgb_train_AutoGrid2_nocenter <- train(y = training$Hip_CONC,
x = training[, 9:ncol(training)],
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneLength = 2) # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 131sec
cv_rmse_xgbNocenter <- min(xgb_train_AutoGrid2_nocenter$results$RMSE)
pred_xgb_nocenter <- predict(xgb_train_AutoGrid2_nocenter, test)
rmse_xgb_regNocenter <- RMSE(test$Hip_CONC, unname(pred_xgb_nocenter))
## Tune with a grid of length 5 for each hyperparameters
time <- proc.time()
set.seed(12)
xgb_train_AutoGrid5_nocenter <- train(y = training$Hip_CONC,
x = training[, 9:ncol(training)],
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneLength = 5) # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 29 min
cv_rmse_xgbNocenter5 <- min(xgb_train_AutoGrid5_nocenter$results$RMSE)
pred_xgb_nocenter5 <- predict(xgb_train_AutoGrid5_nocenter, test)
rmse_xgb_regNocenter5 <- RMSE(test$Hip_CONC, unname(pred_xgb_nocenter5))
## Improve tuning. Keep the optimal parameters found above and tune the two remaining left constant.
xgb_train_AutoGrid5_nocenter$bestTune
xgb_grid = expand.grid(
nrounds = 50,
max_depth = 3,
eta = .4,
gamma = c(1, 2, 4, 6),
colsample_bytree = .8,
min_child_weight = c(.5, 1, 2, 4),
subsample = c(.875) )
time <- proc.time()
set.seed(12)
xgb_noCenter_fin <- train(y = training$Hip_CONC,
x = training[, 9:ncol(training)],
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid) # tuneLength is length of grid PER hyperparameters. 2 hyperparameters are held constant (gamma and min_child_weight). TUNE it
(proc.time() - time)[3] # 94sec
cv_rmse_xgbNocenterFin <- min(xgb_noCenter_fin$results$RMSE)
pred_xgb_nocenterFin <- predict(xgb_noCenter_fin, test)
rmse_xgb_regNocenterFin <- RMSE(test$Hip_CONC, unname(pred_xgb_nocenterFin))
time <- proc.time()
set.seed(12)
xgb_Center_fin <- train(y = training$Hip_CONC,
x = training_c,
method = "xgbTree",
trControl = xgb_trcontrol_1,
tuneGrid = xgb_grid)
(proc.time() - time)[3] # 74sec
cv_rmse_xgbCenterFin <- min(xgb_Center_fin$results$RMSE)
pred_xgb_CenterFin <- predict(xgb_Center_fin, test_c)
rmse_xgb_regCenterFin <- RMSE(test$Hip_CONC, unname(pred_xgb_CenterFin))We now take the the most “trained” model as final model. The features centering actually decreases the RMSE
Notice that we did not tune it as well ; the whole hyperparameter space has been poorly visited. EDIT : Actually, performance seems even to decay when we look for higher hyperparameter space
We cannot trust this model as we could not tune the hyperparameter correctly due to the time constraint, and we did not have any real prior information for their starting values (…)
ggplot(rf_all) + geom_vline(xintercept = rf_all$bestTune$mtry, color = "green", linetype = 2) + theme_piss(size_c = 12,size_p = 15) #+ labs(title = "All (+) ") #plot(xgb_train_AutoGrid4)
plot(xgb_train_AutoGrid5_nocenter)We can see how complex is the tuning with so much hyperparameters…
df_cv_rf <- rf_all$pred[rf_all$pred$mtry == rf_all$bestTune[,1],][1:nrow(training),]
rownames(df_cv_rf) <- df_cv_rf$rowIndex
g_rf <- PredVsObsPlots(df_cv_rf, pred.test = pred_rf_allc, threshold = 15, rmse.cv = cv_rmse_rf_all)
df_cv_rf_nocenter <- rf_all_nocenter$pred[rf_all_nocenter$pred$mtry == rf_all_nocenter$bestTune[,1],][1:nrow(training),] %>% arrange(rowIndex)
g_rf_nocenter <- PredVsObsPlots(df_cv_rf_nocenter, pred.test = pred_rf_all_noCenter, threshold = 30, rmse.cv = cv_rmse_rf_all_nocenter)
## Show only the second (best) RF model
#grid.arrange(g_rf_nocenter$gcv, g_rf_nocenter$gtest, ncol = 2)
grid.arrange(g_rf$gcv, g_rf$gtest, ncol = 2, top = textGrob(expression( italic("Random Forest")), gp = gpar(fontsize = 19, font = 2, col = "blue")))## XGBOOST
# df_cv_xgb <- xgb_noCenter_fin$pred %>%
# dplyr::filter(eta == xgb_noCenter_fin$bestTune$eta[1] &
# gamma == xgb_noCenter_fin$bestTune$gamma[1] &
# colsample_bytree == xgb_noCenter_fin$bestTune$colsample_bytree[1] &
# min_child_weight == xgb_noCenter_fin$bestTune$min_child_weight[1] &
# subsample == xgb_noCenter_fin$bestTune$subsample[1] &
# nrounds == xgb_noCenter_fin$bestTune$nrounds[1])
df_cv_xgb <- xgb_noCenter_fin$pred[xgb_noCenter_fin$pred$eta == xgb_noCenter_fin$bestTune$eta[1] & xgb_noCenter_fin$pred$max_depth == xgb_noCenter_fin$bestTune$max_depth[1] & xgb_noCenter_fin$pred$gamma == xgb_noCenter_fin$bestTune$gamma[1] & xgb_noCenter_fin$pred$colsample_bytree == xgb_noCenter_fin$bestTune$colsample_bytree[1] & xgb_noCenter_fin$pred$min_child_weight == xgb_noCenter_fin$bestTune$min_child_weight[1] & xgb_noCenter_fin$pred$subsample == xgb_noCenter_fin$bestTune$subsample & xgb_noCenter_fin$pred$nrounds == xgb_noCenter_fin$bestTune$nrounds[1][1],]#[1:nrow(training),]
rownames(df_cv_xgb) <- df_cv_xgb$rowIndex
g_xxgb <- PredVsObsPlots(df_cv_xgb, pred.test = pred_xgb_nocenterFin, threshold = 20, rmse.cv = cv_rmse_xgbNocenterFin)
grid.arrange(g_xxgb$gcv, g_xxgb$gtest, ncol = 2, top = textGrob(expression( italic("XGboost")), gp = gpar(fontsize = 19, font = 2, col = "blue")))We already see 2 out of the three ensemble methods that are widely used in machine learning. That is, bagging (used in random forest) and boosting (used in XGBoost). There is actually a last ensemble method that is “stacking”, i.e.
We make use of the caretEnsemble package and we are based on its vignette.
## STACKING
trcontrol = trainControl(
method = "cv",
number = 10,
repeats = 1, # Very Very slow ...... So no repetitions here
verboseIter = T,
allowParallel = T,
savePredictions = 'all'
)
t <- proc.time()
model_list <- caretList( # 2 learners
y = training$Hip_CONC,
x = as.data.frame(training[,9:ncol(training)]),
trControl = trcontrol,
#methodList=c("glmboost", "gbm"),
tuneList = list(
rf1 = caretModelSpec(method="parRF", tuneGrid = data.frame(.mtry= rf_all$bestTune[,1])),
#rf2=caretModelSpec(method="rf", tuneGrid=data.frame(.mtry=10), preProcess="pca"),
#nn=caretModelSpec(method="nnet", tuneLength=2, trace=FALSE),
xgbTree=caretModelSpec(method="xgbTree", tuneGrid = xgb_noCenter_fin$bestTune, trace=F)
)
)
print( (proc.time() - t)[3] ) # 109 sec
t <- proc.time()
model_list3 <- caretList( # 3 learners
y = training$Hip_CONC,
x = as.data.frame(training[,9:ncol(training)]),
trControl = trcontrol,
#methodList=c("glmboost", "gbm"),
tuneList=list(
rf1 = caretModelSpec(method="parRF", tuneGrid = data.frame(.mtry= rf_all$bestTune[,1])),
pls1 = caretModelSpec(method="pls", tuneGrid = pls_cv_all$bestTune),
#nn=caretModelSpec(method="nnet", tuneLength=2, trace=FALSE),
xgbTree=caretModelSpec(method="xgbTree", tuneGrid = xgb_noCenter_fin$bestTune, trace=F)
)
)
print( (proc.time() - t)[3] ) # 212 sec.
# Combine all models using simple linear model
set.seed(222)
stack.glm <- caretStack(model_list3,
metric = "RMSE",
method="glm",
trControl=trainControl(
method="cv",
number=10,
savePredictions= 'all'
)) # tune trControl !
print(stack.glm) # RMSE decreases compared to the lowest RMSE !
stack.glm$ens_model$bestTune
greedy_ensemble <- caretEnsemble(
model_list3,
metric="RMSE",
trControl = trainControl(
method="cv",
number=10,
savePredictions='all'
))
str(greedy_ensemble)
#varImp(greedy_ensemble)
model_preds <- lapply(model_list3, predict, newdata = test)
model_predsFin <- data.frame( lapply(model_preds, function(x) x[,"M"]) )
summary(greedy_ensemble)
summary(stack.glm)
ens_preds <- predict(greedy_ensemble, newdata=test[,9:ncol(test)])
ens_preds <- predict(stack.glm, test[,9:ncol(test)])
## The predict() do not Work due to the issueUnfortunately, there is a non-fixed issue in caretEnsemble and hence we cannot put directly the optimized hyperparameters values found in the previous sections for each model individually. Instead, we will
## =======================================================
set.seed(12)
try_models <- caretList(
y = training$Hip_CONC,
x = as.data.frame(training[,9:ncol(training)]),
methodList=c("parRF", "glmnet", "pls", "xgbTree"),
trControl = trcontrol
)
# Some visualization diagnostics between our models
summary(resamples(try_models))
dotplot(resamples(try_models))
modelCor(resamples(try_models)) # look for low corr when stacking
# If the predictions for the sub-models were highly correlated, they would lead to similar predictions, reducing the benefit of stacking.
set.seed(12)
try_greedy_ensemble <- caretEnsemble(
try_models,
metric="RMSE",
trControl = trainControl(
method="cv",
number=10,
savePredictions='all'
))
tryens_preds <- predict(try_models, test[,9:ncol(test)])
tryensAll_preds <- predict(try_greedy_ensemble, test[,9:ncol(test)])
try_cv_rmse_ensemble <- min(try_greedy_ensemble$ens_model$results$RMSE)
try_rmse_ensemble <- RMSE(test$Hip_CONC, unname(tryensAll_preds))
# stack using glm
stackControl <- trainControl(method="repeatedcv", number=10, repeats=2, savePredictions=TRUE, classProbs=TRUE)
set.seed(12)
stack.glm <- caretStack(try_models, method="glm", trControl=stackControl)
summary(stack.glm)
try_cv_rmse_stack <- min(stack.glm$ens_model$results$RMSE)
tryStackAll_preds <- predict(stack.glm, test[,9:ncol(test)])
RMSE(test$Hip_CONC, unname(tryStackAll_preds))
# stack using random forest
set.seed(123)
stack.rf <- caretStack(try_models, method="rf",trControl=stackControl)
print(stack.rf)
try_cv_rmse_stackrf <- min(stack.rf$ens_model$results$RMSE)
tryStackrfAll_preds <- predict(stack.rf, test[,9:ncol(test)])
try_rmse_stackrf <- RMSE(test$Hip_CONC, unname(tryStackrfAll_preds))
## =======================================================
set.seed(12)
try_models2 <- caretList(
y = training$Hip_CONC,
x = as.data.frame(training[,9:ncol(training)]),
methodList=c("parRF", "glmnet", "pls", "pcr"),
trControl = trcontrol
)
summary(resamples(try_models2))
dotplot(resamples(try_models2))
modelCor(resamples(try_models2))
try_greedy_ensemble2 <- caretEnsemble(
try_models2,
metric="RMSE",
trControl = trainControl(
method="cv",
number=10,
savePredictions='all'
))
tryens_preds2 <- predict(try_models2, test[,9:ncol(test)]) # Predictions for each models individually
tryensAll_preds2 <- predict(try_greedy_ensemble2, test[,9:ncol(test)])
try_cv_rmse_ensemble2 <- min(try_greedy_ensemble2$ens_model$results$RMSE)
try_rmse_ensemble2 <- RMSE(test$Hip_CONC, unname(tryensAll_preds2))
# stack using glm
set.seed(123)
stack.glm2 <- caretStack(try_models2, method="glm", trControl=stackControl)
summary(stack.glm2)
try_cv_rmse_stack2 <- min(stack.glm2$ens_model$results$RMSE)
tryStackAll_preds2 <- predict(stack.glm2, test[,9:ncol(test)])
RMSE(test$Hip_CONC, unname(tryStackAll_preds2))
# stack using random forest
set.seed(123)
stack.rf2 <- caretStack(try_models2, method="rf",trControl=stackControl)
print(stack.rf2)
try_cv_rmse_stackrf2 <- min(stack.rf2$ens_model$results$RMSE)
tryStackrfAll_preds2 <- predict(stack.rf2, test[,9:ncol(test)])
try_rmse_stackrf2 <- RMSE(test$Hip_CONC, unname(tryStackrfAll_preds2))
## =======================================================
c(try_rmse_ensemble, try_rmse_stackrf, try_rmse_ensemble2, try_rmse_stackrf2 )This ensemble with PLS, Random Forest and XGBoost all with their optimal (tuned) hyperparameter values we found individually in the previous sections (see PLS, RF and XGB )
# Some visualization diagnostics between our models
pander(summary(resamples(model_list3))[[c("statistics")]])RMSE:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
|---|---|---|---|---|---|---|---|
| rf1 | 4.684 | 7.044 | 9.497 | 9.46 | 10.33 | 16.73 | 0 |
| pls1 | 3.581 | 4.317 | 4.874 | 5.09 | 5.652 | 7.005 | 0 |
| xgbTree | 6.029 | 7.731 | 12.35 | 13.82 | 16.98 | 28.54 | 0 |
Rsquared:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
|---|---|---|---|---|---|---|---|
| rf1 | 0.9779 | 0.9903 | 0.9951 | 0.9928 | 0.9974 | 0.9988 | 0 |
| pls1 | 0.9958 | 0.998 | 0.9981 | 0.9979 | 0.9983 | 0.9993 | 0 |
| xgbTree | 0.9448 | 0.974 | 0.986 | 0.9809 | 0.9957 | 0.9972 | 0 |
dotplot(resamples(model_list3))splom(resamples(model_list3))pander(modelCor(resamples(model_list3)), caption = "Correlations") # look for low corr when stacking| rf1 | pls1 | xgbTree | |
|---|---|---|---|
| rf1 | 1 | -0.1707 | 0.876 |
| pls1 | -0.1707 | 1 | -0.1668 |
| xgbTree | 0.876 | -0.1668 | 1 |
#f the predictions for the sub-models were highly corr, they would similar pred,reducing the benefit.pander(summary(resamples(try_models2))[[c("statistics")]])RMSE:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
|---|---|---|---|---|---|---|---|
| parRF | 0.7879 | 5.585 | 7.694 | 11.85 | 15.27 | 39.97 | 0 |
| glmnet | 5.816 | 6.78 | 8.561 | 9.242 | 9.525 | 19.75 | 0 |
| pls | 6.784 | 9.2 | 10.18 | 9.661 | 10.46 | 11.07 | 0 |
| pcr | 11.12 | 12.51 | 13.99 | 15.84 | 16.3 | 30.49 | 0 |
Rsquared:
| Min. | 1st Qu. | Median | Mean | 3rd Qu. | Max. | NA’s | |
|---|---|---|---|---|---|---|---|
| parRF | 0.8599 | 0.9775 | 0.995 | 0.9769 | 0.9978 | 1 | 0 |
| glmnet | 0.9654 | 0.9945 | 0.9957 | 0.9927 | 0.9968 | 0.9971 | 0 |
| pls | 0.9906 | 0.9914 | 0.9921 | 0.9926 | 0.9937 | 0.9961 | 0 |
| pcr | 0.9369 | 0.978 | 0.9855 | 0.9787 | 0.9881 | 0.989 | 0 |
dotplot(resamples(try_models2))pander(modelCor(resamples(try_models2)),caption = "Correlations")| parRF | glmnet | pls | pcr | |
|---|---|---|---|---|
| parRF | 1 | 0.6251 | -0.5924 | 0.2447 |
| glmnet | 0.6251 | 1 | -0.4056 | -0.1596 |
| pls | -0.5924 | -0.4056 | 1 | -0.6341 |
| pcr | 0.2447 | -0.1596 | -0.6341 | 1 |
# Select only the observations from the first replication
df_cv_stackBest <- stack.rf2$ens_model$pred[stack.rf2$ens_model$pred$mtry == stack.rf2$ens_model$bestTune$mtry,][1:nrow(training),] #%>% dplyr::filter("mtry" == stack.rf2$ens_model$bestTune$mtry)
rownames(df_cv_stackBest) <- df_cv_stackBest$rowIndex
gStack <- PredVsObsPlots(df_cv_stackBest, pred.test = tryStackrfAll_preds2, threshold = 10, rmse.cv = try_cv_rmse_stackrf2)
grid.arrange(gStack$gcv, gStack$gtest, nrow = 1, top = textGrob(expression( italic("RF Stacking")), gp = gpar(fontsize = 19, font = 2, col = "blue")) ) Stacking aims at ensembling models that are not correlated. If the predictions for the sub-models were highly correlated, they would lead to similar predictions, and hence reducing the benefit of ensembling the models.
Hence by ensembling Random Forest, Pnealized Regression, PLS and PCR which are not very correlated, we can benefit from stacking. And we can see it with more than 10 times lower RMSE…
As we said in the descriptive analysis, we have to slightly split our datasets to retain only 2 classes for binary classification. The aim here is to keep the most observations possible in the dataset in order not to deteriorate too much the quality (of assessment) of the subsequent models. Hence, we take the class \(0\) or \(Q_{h/2}\) of Hippurate proportion. We redo the random sampling in order to have the same decomposition \(80-20\) as for the regression models.
rownames(dataAll) <- rownames(dataD)
data_classif <- dataAll[dataAll$Hip_prop %in% c("0", "Qh/2") == T ,]
data_classif <- data_classif[, colnames(data_classif) %in% X_NoVariance == F]
set.seed(99)
index <- sample(rownames(data_classif), round(0.8*nrow(data_classif), 0))
# Faster in tibble objects
training_class <- as_tibble(data_classif[rownames(data_classif) %in% index , ])
test_class <- as_tibble(data_classif[rownames(data_classif) %in% index == F, ])
# Transform the factor's names into valid names : needed for future models
training_class$Hip_prop <- make.names(factor(training_class$Hip_prop, levels = c("0", "Qh/2")))
test_class$Hip_prop <- make.names(factor(test_class$Hip_prop, levels = c("0", "Qh/2")))
# rownames(training_class) <- index
# rownames(test_class) <- rownames(data_classif)[rownames(data_classif) %in% index == F ]
training_class_c <- as_tibble(scale(training_class[,-(1:9)], center = T, scale = F))
test_class_c <- as_tibble(scale(test_class[,-(1:9)], center = T, scale = F))
training_class_c_small <- training_class_c[, kept_features]It is important for classification models to keep the same proportion of the two class in the test set, or we would need to make ‘corrections’ for imbalanced classes
prop_all <- sum(data_classif$Hip_prop == "Qh/2")/nrow(data_classif)
prop_training <- sum(training_class$Hip_prop == "Qh.2")/nrow(training_class)
prop_test <- sum(test_class$Hip_prop == "Qh.2")/nrow(test_class)
df_prop <- data.frame(Original = prop_all, Training = prop_training, Test = prop_test )
rownames(df_prop) <- "Proportion of the class Qh/2"
#pander(df_prop)
formattable(df_prop, list(~percent))| Original | Training | Test | |
|---|---|---|---|
| Proportion of the class Qh/2 | 36.78% | 35.97% | 40.00% |
Concerning the classification metric, we think it is not bad to take the accuracy as comparison metric. But we will see that accuracy will not be sufficient to discriminate between models (the accuracy is 100% most of time) as the discrimination these two classes is too easy. Hence, whereas the ROC (AUC) is not very straightforward, espescially if we aim at doing multi-class classification. (Hand 2009) have emphasized the fact that “it is fundamentally incoherent in terms of misclassification costs: the AUC uses different misclassification cost distributions for different classifiers. This means that using the AUC is equivalent to using different metrics to evaluate different classification rules. It is equivalent to saying that, using one classifier, misclassifying a class 1 point is p times as serious as misclassifying a class 0 point, but, using another classifier, misclassifying a class 1 point is P times as serious, where p = P.”
It can thus causes ambiguities for comparing classifiers.
Anoter metric is often preferred, the logloss. Its extension to multi class is straightforward. \[-N^{-1}\sum_i^N\sum_j^M y_{ij}\cdot \log p_{ij}\] For binary classification, thus where \(M=2\), we simply retrieve \[-N^{-1}\sum_{i}^N\Big[y_i\cdot\log p_i+(1-y_i)\cdot\log (1-p_i)\Big]\] One can recognize the similarity with the maximization of the statistical log-likelihood for a binomial distribution. Relationship with the Cross-entropy or the Kullback-Leibler divergence can also be shown. It is now the classification loss function that is the most appreciated in the Machine Learning competitions such as Kaggle.
'logLossBinary' = function(actual, predicted, eps = 1e-15) {
predicted = pmin(pmax(predicted, eps), 1-eps)
-(sum(actual * log(predicted) + (1 - actual) * log(1-predicted))) / length(actual)
}
# More compact way, but same outcome
'logLossBinary' <- function(actual,pred){
-1*mean(log(pred[model.matrix(~ actual + 0) - pred > 0]))
}
cat("Predict with probability 0.5 the true class :" , logLossBinary(1, .5))Predict with probability 0.5 the true class : 0.6931472
cat("Predict with probability 0.05 the true class :" , logLossBinary(1, .05))Predict with probability 0.05 the true class : 2.995732
cat("Predict with probability 0.95 the true class :" , logLossBinary(1, .95))Predict with probability 0.95 the true class : 0.05129329
Minimizing the Logloss is equivalent to maximizing the accuracy but there is a subtle. Log Loss heavily penalises classifiers that are confident about an incorrect classification.
The fact that we tune the models wrt a certain metric and then we compute the performance (test set) with another metric can make differences. Hence, we will use the logloss for both cross-validation and testing, for all the models we will compare.
Regarding the preliminary feature selection, we can keep the same as for the regression. Indeed, when looking at the methodology we have adopted to select the variables, this is relevant in this configuration too.
Here, we must apply the preliminary feature selection for the model . We used the basic functions glm() and the MASS package.
## We will use the seed 99 for all models !!!!!
n_folds <- 10
tc <- trainControl("repeatedcv", number = n_folds, repeats = 1, classProbs = TRUE, summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T)
set.seed(99)
fit_Small <- train(y = training_class$Hip_prop,
x = training_class_c_small,
method = "glm" ,
family = binomial ,
trace = 0,
metric = "LogLoss",
trControl = tc)
cv_logl_glm <- min(fit_Small$results$logLoss)
pred_log_smallc_p <- predict(fit_Small, test_class_c, type = "prob")
pred_log_smallc <- predict(fit_Small, test_class_c)
logl_log_smallc <- logLossBinary(test_class$Hip_prop, pred_log_smallc_p)
set.seed(99)
fit_all <- train(y = training_class$Hip_prop,
x = training_class_c[, kept_featuresBig],
method = "glm" ,
family = binomial ,
trace = 0,
metric = "logloss",
trControl = tc)
cv_logl_glmAll <- min(fit_all$results$logLoss)
pred_log_allc_p <- predict(fit_all, test_class_c, type = "prob")
pred_log_allc <- predict(fit_all, test_class_c)
logl_log_allc <- logLossBinary(test_class$Hip_prop, pred_log_allc_p)# by doing the stepwise feature selection
set.seed(99)
fit_SmallStep <- train(y = training_class$Hip_prop,
x = training_class_c_small,
method = "glmStepAIC" ,
family = binomial ,
trace = 0,
metric = "logloss",
trControl = tc)
cv_logl_glmSmallsw <- min(fit_SmallStep$results$logLoss)
pred_log_small2c <- predict(fit_SmallStep, test_class_c)
pred_log_small2c_p <- predict(fit_SmallStep, test_class_c, type = "prob")
logl_log_small2c <- logLossBinary(test_class$Hip_prop, pred_log_small2c_p)
tc <- trainControl("repeatedcv", number = n_folds, repeats = 1, classProbs = TRUE, summaryFunction = mnLogLoss, verboseIter = T)
kept_features2 <- featureSelect(60) ## Larger subset of selected features.
set.seed(99)
t <- proc.time()
fit_bigStep <- train(y = training_class$Hip_prop, # VERY SLOW !!
x = as.matrix(training_class_c[, kept_features2]),
method = "glmStepAIC" ,
family = binomial ,
direction = "both",
trace = 0,
metric = "logloss",
trControl = tc)
print( (proc.time() - t)[3] ) # 212 sec.
fit_bigStep$finalModel
cv_logl_glmAllsw <- min(fit_bigStep$results$logLoss)
pred_log_bigc <- predict(fit_bigStep, test_class_c)
pred_log_bigc_p <- predict(fit_bigStep, test_class_c, type = "prob")
logl_log_bigswc <- logLossBinary(test_class$Hip_prop, pred_log_bigc_p)
## Explore the influence of the number of folds on the results: to draw e.g. learning curves
# seq_folds <- 1:(floor(nrow(data_classif)/2)-1)
# rmse <- numeric(length = (nrow(data_classif)/2)-1)
# for(i in seq_folds){
# t <- proc.time()
# tc <- trainControl("repeatedcv", number = (i+1), repeats = 10)
# fitt <- train(y = training_class$Hip_prop,
# x = training_class_c_small,
# method = "glm" ,
# family = binomial ,
# metric = "logloss",
# trControl = tc)
#
# rmse[i] <- fitt$results$Accuracy
# cat((proc.time() - t)[3], "\n") ; print (i)
# }
# rmse# confusionMatrix(table(pred_log_smallc, test_class$Hip_prop))
# confusionMatrix(table(predict(fit_all, test_class_c), test_class$Hip_prop))
### Draw the results
df_logl <- data.frame('on selected features' = c(logl_log_smallc, length(coef(fit_Small$finalModel))-1),
'on more features' = c(logl_log_allc, length(coef(fit_all$finalModel))-1), 'Stepwise small' = c(logl_log_small2c, length(coef(fit_SmallStep$finalModel))-1), 'Stepwise big' = c(logl_log_bigswc, length(coef(fit_bigStep$finalModel))-1))
rownames(df_logl) <- c('logloss', 'size')
#pander(df_logl)
# formattable(df_logl, list(
# Stepwise.small = formatter("span",
# style = ~ style("font-weight" = ifelse(df_logl <= 1e-13 , "bold", NA)))))
#above_avg_bold <- formatter("span", style = x ~ style("font-weight" = ifelse(x < 1e-12, "bold", NA)))
formattable(df_logl)| on.selected.features | on.more.features | Stepwise.small | Stepwise.big | |
|---|---|---|---|---|
| logloss | 7.085589e-12 | 2.167976e-11 | 1.034945e-10 | 8.034843e-15 |
| size | 1.600000e+01 | 8.400000e+01 | 2.000000e+00 | 1.000000e+00 |
Indeed, for logistic regression, it is important to impute the feature selection (\(p<n\)). Keeping 18 or 46 features and doing a stepwise logistic regression lead to the same result: 2 selected variables but they are different ! If we check the accuracy criterion, we see that it is indeed \(100\%\) for all models. Regarding logloss, we see with the table above that the stepwise model on small subset of selected features is the best.
Actually it is not very necessary to consider more complex models as we see the performance is (nearly) perfect already…
With the packages stepPlr (tuning parameters lambda and cp) or glmnet (tuning parameters alpha and lambda) as explained in section x.x
n_folds <- 10
# We tune the hyperparameters together in a 2-dimensional grid. It is
eGrid <- expand.grid(.cp = (0:10) * 0.1,
.lambda = (0:10) * 0.1)
tc_plr <- trainControl("repeatedcv", n_folds, repeats = 1, classProbs = T, savePredictions = T, verboseIter = T)#, summaryFunction = prSummary)
set.seed(99)
fit_plr <- train( y = training_class$Hip_prop ,
x = training_class[, -(1:9)],
method = "plr",
trControl = tc_plr,
metric = "logloss",
tuneGrid = eGrid)
## ==============================================================================
eGrid_glm <- expand.grid(.alpha = (0:10) * 0.1,
.lambda = (0:10) * 0.1)
set.seed(99)
glmnet_class <- train(y = as.factor(training_class$Hip_prop) ,
x = as.matrix(training_class[, -(1:9)]),
method = "glmnet",
trControl = tc_plr,
metric = "ROC",
# metric = "logloss",
tuneGrid = eGrid_glm)
cv_logl_glmnetAll <- min(glmnet_class$results$logLoss)
# pred_glmnet_all_p <- predict(glmnet_class, test_class, type = "prob")
# pred_glmnet_all <- predict(glmnet_class, test_class)
# logl_glmnet_all <- logLossBinary(test_class$Hip_prop, pred_glmnet_all_p)
pred_glmnet_allc_p <- predict(glmnet_class, test_class[, -(1:9)], type = "prob")
pred_glmnet_allc <- predict(glmnet_class, test_class[, -(1:9)])
logl_glmnet_allc <- logLossBinary(test_class$Hip_prop, pred_glmnet_allc_p)
set.seed(99)
glmnet_class_small <- train(y = as.factor(training_class$Hip_prop) ,
x = as.matrix(training_class[, kept_features]),
method = "glmnet",
trControl = tc_plr,
#metric = "ROC",
# metric = "logloss",
family = "binomial",
tuneGrid = eGrid_glm)
cv_logl_glmnetSmall <- min(glmnet_class_small$results$logLoss)
# pred_glmnet_all_p_small <- predict(glmnet_class_small, test_class, type = "prob")
# pred_glmnet_all_small <- predict(glmnet_class_small, test_class)
# logl_glmnet_all_small <- logLossBinary(test_class$Hip_prop, pred_glmnet_all_p_small)
pred_glmnet_all_p_smallc <- predict(glmnet_class_small, test_class[, kept_features], type = "prob")
pred_glmnet_all_smallc <- predict(glmnet_class_small, test_class[, kept_features])
logl_glmnet_all_smallc <- logLossBinary(test_class$Hip_prop, pred_glmnet_all_p_smallc)
## ==============================================================================
eGrid_pen <- expand.grid(.lambda1 = (0:10) * 0.1,
.lambda2 = (0:10) * 0.1)
set.seed(99)
penal_class <- train(y = training_class$Hip_prop ,
x = training_class_c,
# Hip_prop ~ .,
# data = cbind.data.frame(Hip_prop = training$Hip_prop, training_c_small),
method = "penalized",
metric = "logloss",
trControl = tc_plr,
tuneGrid = eGrid_pen)
glmnet_accur <- train(y = training_class$Hip_prop ,
x = training_class[, -(1:9)],
method = "glmnet_h2o",
trControl = tc_plr,
metric = "logloss",
tuneGrid = eGrid_glm)g1 <- ggplot(glmnet_class) + labs(x = 'mixing percentage (alpha)', linetype = 'Regu. lambda') + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda"))
g2 <- ggplot(glmnet_class_small) + labs(x = 'mixing percentage (alpha)', fill = 'Regul. lambda') + theme_piss() + guides(fill=guide_legend(title="lambda"), shape=guide_legend(title="lambda"), color=guide_legend(title="lambda"))
grid.arrange(g1, g2, nrow = 1)With the original library pls.
Not probability for PLS or OPLS ? (course p.49)
# Explore manually Values for controlling the training process
rctrl1 <- trainControl("cv", n_folds, classProbs = T, summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T)
set.seed(99)
pls_cv_small_class <- train(y = training_class$Hip_prop ,
x = training_class_c_small,
method = "pls",
metric = "logloss",
trControl = rctrl1,
tuneGrid = expand.grid(.ncomp = c(1:15)))
cv_logl_plsSmall <- min(pls_cv_small_class$results$logLoss)
set.seed(99)
pls_cv_all_class <- train(y = training_class$Hip_prop ,
x = training_class_c,
method = "pls",
metric = "logloss",
trControl = rctrl1,
tuneGrid = expand.grid(.ncomp = c(1:100)))
cv_logl_plsAll <- min(pls_cv_all_class$results$logLoss)
ggplot(pls_cv_small_class)
# pred_pls_small_p_small <- predict(pls_cv_small_class, test_class, type = "prob")
# pred_pls_small_small <- predict(pls_cv_small_class, test_class)
# logl_pls_small_small <- logLossBinary(test_class$Hip_prop, pred_pls_small_p_small)
pred_pls_small_p_smallc <- predict(pls_cv_small_class, test_class_c, type = "prob")
pred_pls_small_smallc <- predict(pls_cv_small_class, test_class_c)
logl_pls_small_smallc <- logLossBinary(test_class$Hip_prop, pred_pls_small_p_smallc)
pred_pls_small_p_allc <- predict(pls_cv_all_class, test_class_c, type = "prob")
pred_pls_small_allc <- predict(pls_cv_all_class, test_class_c)
logl_pls_small_allc <- logLossBinary(test_class$Hip_prop, pred_pls_small_p_allc)Voir “S-plot” pour d?tecter des biomarqueurs sur OPLS(-DA)
Orthogonal [Partial] Least Squares (OPLS) is a variant of PLS which uses orthogonal signal correction to maximize the explained covariance between the predictors X and the variable to predict Y on the first latent variable, while components >1 capture variance in X which is orthogonal (or unrelated) to Y.
“Diagnostics such as the Q2Y metrics and permutation testing are of high importance to avoid overfitting and assess the statistical significance of the model. The Variable Importance in Projection (VIP), which reflects both the loading weights for each component and the variability of the response explained by this component (Pinto, Trygg, and Gottfries 2012; Mehmood et al. 2012), can be used for feature selection (J. Trygg and Wold 2002; Pinto, Trygg, and Gottfries 2012).”
As there does not seem that there is already an implemented method to do OPLS in the caret package, we decided here to rely on the method implemented by M.Martin in MBXUCL.
Here, we will use the RMSE as cross-validation criterion. Since we transformed the two-class target as "Qh.2=1" and "X0=0", it is
CENTER the RESPONSE VARIABLE (?)
oplsc <- OPLSDA(as.matrix(training_class_c), as.factor(training_class$Hip_prop), no = 3, impT = T)
oplsreg <- OPLSDA(as.matrix(training_c), training$Hip_CONC, no = 3)
test_cVar <- test_c[, colnames(test_c) %in% X_NoVariance ==F]
#
# OPLSDA_pred(oplsreg, as.matrix(test_cVar))
#
# OPLSDA_pred(oplsc, as.matrix(test_class_c))
#
# For OPLS it is IMPORTANT TO CENTER THE RESPONSE
classY.CV <- ifelse(training_class$Hip_prop == "Qh.2", 1, 0)
classY <- scale(classY, center=T, scale = F)
test_classYAll <- ifelse(test_class$Hip_prop == "Qh.2", 1, 0)
test_classY <- scale(test_classYAll, center=T, scale = F)
############# CHANGE FOR CLASSIF
'kfoldCV' <- function(X ,Y, k, argrm = 0, seed = 99){
#browser()
n=dim(X)[1] ; nbseg = k ; sn=ceiling(n/k) ; set.seed(seed)
seg_i <- sample(rep(1:k, length.out = n)) ; yfit.cv = numeric()
for (i in 1:nbseg){
val_i <- which(seg_i == i)
Y.train = Y[-val_i] ; X.train =X[-val_i,] ; X.valid=X[val_i,]
coefcv =fctreg(X.train,Y.train,argrm)
yfit.cv[val_i] =X.valid%*%coefcv
names(yfit.cv)[val_i] <- val_i
}
RMSE.cv=sqrt((1/n)*sum((Y-yfit.cv)^2))
return(list(yfit.cv=yfit.cv,RMSE.cv=RMSE.cv))
}
fctreg <- function(X,Y,argrm) {OPLSDA(X,Y, no=argrm-1)$b}
classY <- ifelse(training_class$Hip_prop == "Qh.2", 1, 0)
# IMPORTANT TO CENTER THE RESPONSE
classY <- scale(classY, center=T, scale = F)
# classY is the Transformed target to handle RMSE
RMSECV_oplsda <- numeric()
yfitCV_oplsda <- list() ; NCmaxda <- 25 # tune it
for(i in 2:NCmaxda){
#cat(i, "\n")
rescv <- kfoldCV(as.matrix(training_class_c), as.numeric(classY), k = 10, argrm = i)
RMSECV_oplsda[i] <- rescv$RMSE.cv
yfitCV_oplsda[[i]] <- rescv$yfit.cv
}
cv_rmse_opls_small <- min(RMSECV_opls, na.rm = T)
ncoptda <- which.min(RMSECV_oplsda)
df_oplsda <- data.frame(indi = 2:NCmaxda, RMSEcv = RMSECV_oplsda[-1])
gg <- ggplot(df_oplsda, aes(x = indi, y = RMSEcv)) + geom_point() + geom_line() + geom_vline(aes(xintercept = ncoptda, col = "optimal"), linetype = 2) + theme_piss(size_c = 12) + labs(title = "RMSE cross-validation vs complexity", x = "Number of components", y = "RMSE-cv") + geom_vline(aes(xintercept = 15, col = "advised"), linetype = 2) + scale_color_manual(name = "# comp.", values = c("optimal" = "red", "advised" = "green"))
ggplotly(gg)roplsda_opt <- OPLSDA(as.matrix(training_class_c), classY, no = ncoptda-1)
#pred_ropls_small <- OPLSDA_pred(ropls_opt_small, as.matrix(test[,kept_features]))
pred_roplsda_smallc <- OPLSDA_pred(roplsda_opt, as.matrix(test_class_c))
pred_df_oplsda <- data.frame("Qh.2" = pred_roplsda_smallc + mean(test_classYAll), "X0" = 1 - (pred_roplsda_smallc + mean(test_classYAll)))
pred_df_oplsda$Qh.2 <- ifelse(pred_df_oplsda$Qh.2 > 1, 1, pred_df_oplsda$Qh.2)
pred_df_oplsda$X0 <- ifelse(pred_df_oplsda$X0 > 1, 1, pred_df_oplsda$X0)
pred_df_oplsda$Qh.2 <- ifelse(pred_df_oplsda$Qh.2 < 0, 0, pred_df_oplsda$Qh.2)
pred_df_oplsda$X0 <- ifelse(pred_df_oplsda$X0 < 0, 0, pred_df_oplsda$X0)
logl_oplsda <- logLossBinary(test_class$Hip_prop, pred_df_oplsda)We thus kept x components and made predictions with it.
Interesting thing to note is that the estimate RMSE (by cross-validation) seems much higher when not doing it with caret… and perhaps closer to the ‘true’ RMSE (?)
As the model is not implemented in the caret infrastucture and the R implementation comes from UCL, we think it would be very interesting to implement it in caret to profit from the flexibility and the tuning facilities that this package offers Moreover, it will allow us to understand more deeply the ‘black-box’ which sometimes lies behind the caret methodology.
## We suppress
opls_MBX <- list(label = "OPLS-DA",
library = "MBXUCL",
type = "Classification",
#type = c("Regression", "Classification"),
## Tune over both parameters at the same time
parameters = data.frame(parameter = 'ncomp',
class = "numeric",
label = '#Components'),
grid = function(x, y, len = NULL, search = "grid"){
if(search == "grid") {
out <- data.frame(ncomp = seq(1, min(ncol(x) - 1, len), by = 1))
} else {
out <- data.frame(ncomp = unique(sample(1:ncol(x), replace = TRUE)))
}
out
},
loop = function(grid) {
grid <- grid[order(grid$ncomp, decreasing = TRUE),, drop = FALSE]
loop <- grid[1, ,drop = FALSE]
submodels <- list(grid[-1,,drop = FALSE])
list(loop = loop, submodels = submodels)
},
#loop = NULL,
fit = function(x, y, wts, param, lev, last, classProbs, ...) {
#dat <- if(is.data.frame(x)) x else as.data.frame(x)
OPLSDA(x, y, no = param$ncomp, ...)$b
},
predict = function(modelFit, newdata, submodels = NULL) {
y.pred <- newdata %*% modelFit
names(y.pred) <- rownames(newdata)
y.pred
},
prob = NULL,
varImp = NULL,
predictors = function(x, ...) rownames(x$projection),
#levels = function(x) x$obsLevels,
sort = function(x) x[order(x[,1]),]
)
rctrl1 <- trainControl(method = "cv", number = n_folds)
set.seed(100)
opls1 <- train(y = training$Hip_CONC,
x = training_c_small,
method = opls_MBX,
trControl = rctrl1,
tuneGrid = expand.grid(.ncomp = c(1:4)))!! Produce the “S-Plots” : charactheristics of the PLS.
Support Vector Machine (SVM) are well known classification models and in omics data analysis, see e.g. (Kim et al. 2017) for a recent application in microarrays. We will explore three families of SVM, that is with a linear kernel, a Gaussian (radial) kernel and with a polynomial kernel. We will have only one hyperparameter for the linear and the radial kernel ( we will fix the scale parameter) but three for the polynomial kernel. For each model, we obviously take the whole feature space because this one of the key feature of the SVM : it can deal with a very high feature space.
n_folds <- 10
ctrl <- trainControl("repeatedcv", n_folds, repeats = 1, # do 5 repititions of cv
classProbs = T, summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T )
#Train and Tune the SVM
set.seed(99)
svm_linear <- train( y = as.factor(training_class$Hip_prop) ,
x = training_class_c,
method = "svmLinear2",
metric = "logloss",
tuneLength = 10, # 10 values of the cost function
trControl=ctrl)
predict(svm_linear, test_class)
cv_logl_svmLin <- min(svm_linear$results$logLoss)
set.seed(99)
svm_radial <- train( y = training_class$Hip_prop ,
x = training_class_c,
method = "svmRadial", # Radial kernel
tuneLength = 9, # 9 values of the cost function. We let here sigma constant !!!!!!
metric = "logloss",
trControl=tc)
predict(svm_radial, test_class_c)
cv_logl_svmRad <- min(svm_radial$results$logLoss)
set.seed(99)
# Use the expand.grid to specify the search space
grid <- expand.grid(scale = c(.01, .015, 0.2),
C = c(0.75, 0.9, 1, 1.1, 1.25),
degree = 0:4)
#Train and Tune the SVM
svm_polyn <- train( y = training_class$Hip_prop ,
x = training_class_c,
method = "svmPoly",
# tuneGrid = grid,
metric = "logloss",
tuneLength = 6,
trControl=ctrl) # c(.01, .75, 1)
predict(svm_polyn, test_class_c)
cv_logl_svmPol <- min(svm_polyn$results$logLoss)
pred_svmRad <- predict(svm_radial, test_class_c)
pred_svmLin <- predict(svm_linear, test_class_c)
pred_svmPoly <- predict(svm_polyn, test_class_c)
pred_svmLinp <- predict(svm_linear, test_class_c, type = "prob")
pred_svmRadp <- predict(svm_radial, test_class_c, type = "prob")
pred_svmPolyp <- predict(svm_polyn, test_class_c, type = "prob")
logloss_svmLin <- logLossBinary(test_class$Hip_prop, predict(svm_linear, test_class_c, type = "prob")
)
logloss_svmRad <- logLossBinary(test_class$Hip_prop, predict(svm_radial, test_class_c, type = "prob")
)
logloss_svmPol <- logLossBinary(test_class$Hip_prop, predict(svm_polyn, test_class_c, type = "prob")
)
c(logloss_svmLin, logloss_svmRad, logloss_svmPol)df_svm <- data.frame('Linear' = c(logloss_svmLin, cv_logl_svmLin, svm_linear$bestTune$cost),
'Radial' = c(logloss_svmRad, cv_logl_svmRad, svm_radial$bestTune$C), 'Polynomial' = c(logloss_svmPol, cv_logl_svmPol, svm_polyn$bestTune$C) )
rownames(df_svm) <- c('logloss', "logloss CV", 'Tuned value of C')
pander(df_svm)| Linear | Radial | Polynomial | |
|---|---|---|---|
| logloss | 0.01667 | 0.2817 | 0.632 |
| logloss CV | 5.794 | 2.911 | 1.391 |
| Tuned value of C | 1 | 0.25 | 1 |
We tried lots of different values for the hyperparameters but none of them provided good results the SVM with polynomial kernel :
ggplot(svm_linear) + theme_piss()ggplot(svm_radial) + theme_piss()ggplot(svm_polyn) + theme_piss()## Plot the probabilities
PredMat <- data.frame(mod1 = pred_log_smallc_p[,"Qh.2"], mod2 = pred_log_allc_p[,"Qh.2"], mod3 = pred_log_small2c_p[,"Qh.2"] , mod4 = pred_log_bigc_p[,"Qh.2"], m5 = pred_glmnet_allc_p[,'Qh.2'] , m6 = pred_glmnet_all_p_smallc[, 'Qh.2'], m7 = pred_pls_small_p_smallc[,'Qh.2'], m8 = pred_pls_small_p_allc[,'Qh.2'], m88 = pred_df_oplsda$Qh.2, m9 = pred_svmLinp[,"Qh.2"], m10 = pred_svmRadp[,"Qh.2"] , m11 = pred_svmPolyp[,"Qh.2"])
colnames(PredMat) <- c(colnames(df_logl), "Pen.all", "Pen.small", 'PLS.DA.small', "PLS-DA all", "OPLS-DA all", "SVM linear", "SVM radial", "SVM polyn.")
cutoff <- seq(min(PredMat)-0.1,max(PredMat)+0.1,length = nrow(test_class))
nco <- length(cutoff)
truepositive <- matrix(nrow=nco,ncol= ncol(PredMat))
falsepositive <- matrix(nrow=nco,ncol= ncol(PredMat))
npos <- sum(test_class$Hip_prop == "Qh.2")
nneg <- sum(test_class$Hip_prop == "X0")
listpos <- (test_class$Hip_prop == "Qh.2")
listneg <- (test_class$Hip_prop == "X0")
nm <- ncol(PredMat)
PredMat <- as.matrix(PredMat)
for(i in 1:nco){
for(j in 1:nm) {
# Rate of true positives in the set of positive
truepositive[i,j] <- sum(PredMat[listpos,j]>=cutoff[i])/npos
# Rate of false positives in the set of negatives
falsepositive[i,j] <- sum(PredMat[listneg,j]>=cutoff[i])/nneg
}
}
par(mfrow=c(1,1))
plot(falsepositive[,1],truepositive[,1],type="l",main="ROC Curves : Logistic Models", xlab="False Positive Rate",ylab="True Positive Rate")
for(j in 2:nm){
lines(falsepositive[,j],truepositive[,j],type="l",col=j,lty=j)}
legend(0.4,0.75,dimnames(PredMat)[[2]],col=1:nm,lty=1)
# Recherche des seuils optimaux qui minimisent la distance au point (0,1) FOR ALL MODELS
cutoffopt=PredMat[1,]
for(i in 1:nm){
distances2=falsepositive[,i]^2+(1-truepositive[,i])^2
cutoffopt[i]=cutoff[which(distances2==min(distances2))][1]
}
pander(cutoffopt)
nm <- nrow(listr$RMSE) # nbre de mod?les
## Line plot to the Predictions
par(mfrow=c(1,1))
PredMat=listr$PredMat
plot(1:(nrow(PredMat)+30),c(PredMat[,1],rep(NA,30)),lty=1,type="l",main="Predictions",xlab="Observation",ylab="Prediction",ylim=c(min(PredMat),max(PredMat)))
for(i in 2:nm) { lines(1:(nrow(PredMat)+30),c(PredMat[,i],rep(NA,30)),lty=i,col=i)}
legend(nrow(Xdata)+5,1,dimnames(PredMat)[[2]],col=1:4,lty=1,cex=0.5)
abline(v=92)We will now produce a new function to plot predictions, similar to the one for regression. The printed observations are those who are misclassified.
## Print the value of an observation if it it misclassified
## We assume the value of the cutoff is the same for the validation and test set... (we take it as the test set)
'PredVsObsPlotsClassif' <- function(df_cv, obs.test = test_class$Hip_prop, predp.testQh.2, cutoff, threshold = .5) {
#browser()
df_cv$true_obs <- ifelse(df_cv$obs == 'Qh.2', 1, 0)
# df_cv$error <- logLoss(true_obs, df_cv[, 'Qh.2'])
# df_cv$error <- sapply(true_obs, function(x) logLoss(x, df_cv[, 'Qh.2']))
#ProblematicObs.cv <- df_cv$error > threshold
ProblematicObs.cv <- ifelse(df_cv$pred != df_cv$obs, T, F)
names.cv <- row.names(df_cv)
names.cv[with(df_cv, !ProblematicObs.cv)] <- ""
# cutoff
g1 <- ggplot(df_cv) + geom_point(aes(x = true_obs, y = Qh.2), size = .95 ) + theme_piss() + labs(title = "Cross-Validation", subtitle = "Predicted values vs observed values", x = 'Observed (0= X0; 1=Qh/2)', y = 'pred. proba.') + geom_text(aes(true_obs, Qh.2, label = names.cv), color = "red") + geom_hline(yintercept = cutoff, col = "blue", linetype = 2) + geom_text(aes(.5, cutoff + .05, label = 'cutoff'), col = 'blue', data = df_cv, size = 1.5)
df <- data.frame(obs.test = obs.test, predp.testQh.2 = predp.testQh.2)
df$obs.test0 <- ifelse(df$obs.test == 'Qh.2', 1, 0)
# We classify wrt to the value of the cutoff.
df$pred.Qh.2 <- ifelse(df$predp.testQh.2 > cutoff, 'Qh.2', 'X0')
ProblematicObsTest <- ifelse(df$pred.Qh.2 != df$obs.test, T, F)
g2 <- ggplot(df, aes(x = obs.test0, y = predp.testQh.2))
names <- row.names(df)
names[with(df, !ProblematicObsTest)] <- ""
g2 <- g2 + geom_point(size = .95) + geom_hline(yintercept = cutoff, col = "blue", linetype = 2) + geom_text(aes(.5, cutoff + .05, label = 'cutoff'), col = 'blue', data = df_cv) + theme_piss() + labs(title = "Test set", subtitle = "Predicted values vs observed values", x = 'Observed (0= X0; 1=Qh/2)', y = 'pred. proba.') + geom_text(aes(obs.test0, predp.testQh.2, label = names), color = "red", size = 1.5)
#grid <- grid.arrange(g1, g2, nrow = 1)
#grid.arrange(g1, g2, nrow = 1)
return(list(gcv = g1, gtest = g2 ))
}We will also compute some other insightful plots such as the scatterplot matrix of predictions and the lines plot of predictions. For these two graphs we will only represent the results on the test set because this is the one that will be used to make the final performance’s assessment. But looking at these same plots for the cross-validation would be appreciate e.g. to check if the validation procedure has been correctly constructed. (not shown here)
df_cv_logRegSmall <- fit_Small$pred[order(fit_Small$pred$rowIndex),]
plots_logReg_small <- PredVsObsPlotsClassif(df_cv_logRegSmall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_smallc_p$Qh.2, threshold = .05, cutoff = cutoffopt[1])
#grid.arrange(plots_logReg_small$gcv, plots_logReg_small$gtest, nrow = 1)
df_cv_logRegall <- fit_all$pred[order(fit_all$pred$rowIndex),]
plots_logReg_all <- PredVsObsPlotsClassif(df_cv_logRegall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_allc_p$Qh.2, threshold = .05, cutoff = cutoffopt[2])
#grid.arrange(plots_logReg_all$gcv, plots_logReg_all$gtest, nrow = 1)
df_cv_logRegSmallstep <- fit_SmallStep$pred[order(fit_SmallStep$pred$rowIndex),]
plots_logReg_smallstep <- PredVsObsPlotsClassif(df_cv_logRegSmallstep, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_small2c_p$Qh.2, threshold = .05, cutoff = cutoffopt[3])
#grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, nrow = 1)
grid.arrange(plots_logReg_small$gcv, plots_logReg_small$gtest, top = textGrob(expression( italic("Small")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1) grid.arrange(plots_logReg_all$gcv, plots_logReg_all$gtest, top = textGrob(expression( italic("Big")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, top = textGrob(expression( italic("Stepwise")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)# df_cv_logRegbigstep <- fit_bigStep$pred[order(fit_bigStep$pred$rowIndex),]
#
# plots_logReg_bigstep <- PredVsObsPlotsClassif(df_cv_logRegbigstep, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_log_bigc_p$Qh.2, threshold = .05, cutoff = cutoffopt[4])
#
# grid.arrange(plots_logReg_bigstep$gcv, plots_logReg_bigstep$gtest, nrow = 1)df_cv_glmnetsAll <- glmnet_class$pred[glmnet_class$pred$alpha == glmnet_class$bestTune$alpha[1] & glmnet_class$pred$lambda == glmnet_class$bestTune$lambda[1] ,][1:nrow(training_class),]
plots_glmnetAllclass <- PredVsObsPlotsClassif(df_cv_glmnetsAll, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_allc_p$Qh.2, threshold = .05, cutoff = cutoffopt["Penalized log.reg. all"])
#grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, nrow = 1)
df_cv_glmnetsmall <- glmnet_class_small$pred[glmnet_class_small$pred$alpha == glmnet_class_small$bestTune$alpha[1] & glmnet_class_small$pred$lambda == glmnet_class_small$bestTune$lambda[1] ,][1:nrow(training_class),]
plots_glmnetSmallclass <- PredVsObsPlotsClassif(df_cv_glmnetsmall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_all_p_smallc$Qh.2, threshold = .05, cutoff = cutoffopt['Penalized log.reg. small'])
grid.arrange(plots_glmnetAllclass$gcv, plots_glmnetAllclass$gtest,
plots_glmnetSmallclass$gcv, plots_glmnetSmallclass$gtest,
ncol = 2)df_cv_plsdaAll <- pls_cv_all_class$pred[pls_cv_all_class$pred$ncomp == pls_cv_all_class$bestTune$ncomp[1] ,][1:nrow(training_class),]
plots_plsdaAll <- PredVsObsPlotsClassif(df_cv_plsdaAll, predp.testQh.2 = pred_pls_small_p_allc$Qh.2, threshold = .05, cutoff = cutoffopt["PLS-DA all"])
df_cv_plsdaSmall <- pls_cv_small_class$pred[pls_cv_all_class$pred$ncomp == pls_cv_small_class$bestTune$ncomp[1] ,][1:nrow(training_class),]
plots_plsdaSmall <- PredVsObsPlotsClassif(df_cv_plsdaSmall, predp.testQh.2 = pred_pls_small_p_smallc$Qh.2, threshold = .05, cutoff = cutoffopt["PLS-DA small"])
grid.arrange(plots_plsdaAll$gcv, plots_plsdaAll$gtest, top = textGrob(expression( italic("PLS-DA All")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)grid.arrange(plots_plsdaSmall$gcv, plots_plsdaSmall$gtest, top = textGrob(expression( italic("PLS-DA Small")), gp = gpar(fontsize = 19, font = 2, col = "blue")), nrow = 1)df_cv_oplsda <- data.frame(obs = training_class$Hip_prop, Qh.2 = yfitCV_oplsda[[ncoptda]] + mean(classY.CV) )
df_cv_oplsda$pred <- ifelse(df_cv_oplsda$Qh.2 > unname(cutoffopt["OPLS-DA all"]), "Qh.2", "X0")
g_oplsda <- PredVsObsPlotsClassif(df_cv_oplsda, predp.testQh.2 = pred_df_oplsda$Qh.2, threshold = .05, cutoff = unname(cutoffopt["OPLS-DA all"]))
grid.arrange(g_oplsda$gcv, g_oplsda$gtest, nrow = 1, top = textGrob(expression( italic("OPLS-DA all")), gp = gpar(fontsize = 19, font = 2, col = "blue")))df_cv_glmnetsAll <- glmnet_class$pred[glmnet_class$pred$alpha == glmnet_class$bestTune$alpha[1] & glmnet_class$pred$lambda == glmnet_class$bestTune$lambda[1] ,][1:nrow(training_class),]
plots_glmnetAllclass <- PredVsObsPlotsClassif(df_cv_glmnetsAll, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_allc_p$Qh.2, threshold = .05, cutoff = cutoffopt["Penalized log.reg. all"])
#grid.arrange(plots_logReg_smallstep$gcv, plots_logReg_smallstep$gtest, nrow = 1)
df_cv_glmnetsmall <- glmnet_class_small$pred[glmnet_class_small$pred$alpha == glmnet_class_small$bestTune$alpha[1] & glmnet_class_small$pred$lambda == glmnet_class_small$bestTune$lambda[1] ,][1:nrow(training_class),]
plots_glmnetSmallclass <- PredVsObsPlotsClassif(df_cv_glmnetsmall, obs.test = test_class$Hip_prop, predp.testQh.2 = pred_glmnet_all_p_smallc$Qh.2, threshold = .05, cutoff = cutoffopt['Penalized log.reg. small'])
grid.arrange(plots_glmnetAllclass$gcv, plots_glmnetAllclass$gtest,
plots_glmnetSmallclass$gcv, plots_glmnetSmallclass$gtest,
ncol = 2)library("GGally")
colnames(PredMat) <- c("LogReg.small", "LogReg.big", "SW.small","SW.big", "Pen.all", "Pen.small", 'PLSDA.small', "PLSDA.all", "OPLSDA.all", "SVM.linear", "SVM.radial", "SVM.polyn")
ggpairs(as.data.frame(PredMat) ) + theme_piss()This kind of matrix could be a good starting point if we wanted to build an ensemble model, i.e. to pick some individual models that are poorly correlated (e.g. PLS-DA or Penalized Logistic regression with SVM’s)
We decide to represent only the “best” model in each family. Dashed Vertical black lines represent the observations that are actually the target class (“Qh.2”)
PredMatRed <- cbind.data.frame(Observation = 1:nrow(test_class), PredMat[, c("LogReg.small", "SW.big", "Pen.all", "PLSDA.all", "OPLSDA.all", "SVM.linear" )])
indexQh.2 <- PredMatRed[test_class$Hip_prop == "Qh.2",]$Observation
ggplot(PredMatRed, aes(x = Observation)) + geom_line(aes(x = Observation, y = LogReg.small, col = "Logistic"), linetype = 2) + geom_line(aes(y = SW.big, col = "logistic.SW")) + geom_line(aes(y = Pen.all, col = "Penalized")) + geom_line(aes(y = PLSDA.all, col = "PLS-DA")) + geom_line(aes(y = OPLSDA.all, col = "OPLS-DA")) + geom_line(aes(y = SVM.linear, col = "SVM.lin")) + scale_color_manual(name = "Model", values = c("Logistic" = brewer.pal(1, "Set1")[1], "logistic.SW" = brewer.pal(1, "Set1")[2], "Penalized" = brewer.pal(1, "Set1")[3], "PLS-DA" = brewer.pal(5, "Set1")[4], "OPLS-DA" = brewer.pal(5, "Set1")[5], "SVM.lin" = brewer.pal(8, "Set1")[6])) + geom_vline(xintercept = indexQh.2, colour ="black", linetype = 2, size = 0.2) + theme_piss() We remark the “perfect” path for all the displayed models as their predicted probabilités is always >0.5 (indeed, 0.5 is arbitra)
Relying on the plotROC package and its fabulous vignette, we were able to compute insightful ROC curves that we divided by “meta”-classifiers.
## For individual plots ! But this is too voluminous...
#gc_prob_ex <- extractProb(list(pls_cv_small_class), training_class_c %>% select(-Class))
plotdfRoc <- data.frame(D = ifelse(pred_pls_small_allc == "Qh.2", 1, 0), D.str = pred_pls_small_allc, M1 = pred_pls_small_p_allc$Qh.2)
basicplot <- ggplot(plotdfRoc, aes(d = D, m = M1)) + geom_roc(n.cuts = 15)
basicplot + style_roc() + annotate("text", x = .85, y = .15,
label = paste("AUC =", round(calc_auc(basicplot)$AUC, 2)), col = "#33666C") + theme_piss()
"ggplotROCModel" <- function(classObs, predClass, ConfRegions = T, interac = F, interactiveHtml = F) {
plotdfRoc <- data.frame(D = ifelse(classObs == "Qh.2", 1, 0), D.str = classObs, M1 = predClass)
basicplot <- ggplot(plotdfRoc.cv, aes(d = D, m = M1)) + geom_roc(n.cuts = 15)
basicplot <- basicplot + style_roc() + annotate("text", x = .85, y = .15, label = paste("AUC =", round(calc_auc(basicplot)$AUC, 2)), col = "#33666C") + theme_piss() + theme(axis.text = element_text(colour = "#33666C"))
if (ConfRegions == T) basicplot <- basicplot + geom_rocci(linetype = 1)
if (interac == T) {
basicplot <- plot_interactive_roc(basicplot)
if (interactiveHtml == T) cat( export_interactive_roc(basicplot, prefix = "a") )
else basicplot
}
else basicplot
}
## PLS
# Testing
#ggplotROCModel(pred_pls_small_allc, pred_pls_small_p_allc$Qh.2, interac = T, interactiveHtml = T)
# CV
ggplotROCModel(df_cv_plsdaAll$obs, df_cv_plsdaAll$Qh.2)
ggplotROCModel(pred_log_small2c, pred_log_small2c_p$Qh.2)
ggplotROCModel(df_cv_logRegSmallstep$obs, df_cv_logRegSmallstep$Qh.2)
ggplotROCModel(pred_svmRad, pred_svmRadp$Qh.2)
ggplotROCModel(svm_radial$pred$obs, svm_radial$pred$Qh.2)
ggplotROCModel(df_cv_plsdaAll$obs, df_cv_plsdaAll$Qh.2)
ggplotROCModel(test_class$Hip_prop, pred_log_small2c_p$Qh.2)
ggplotROCModel(df_cv_logRegSmallstep$obs, df_cv_logRegSmallstep$Qh.2)
ggplotROCModel(test_class$Hip_prop, pred_svmRadp$Qh.2)
ggplotROCModel(svm_radial$pred$obs, svm_radial$pred$Qh.2)
confusionMatrix(table(pred_log_smallc, test_class$Hip_prop))
confusionMatrix(table(predict(fit_all, test_class_c), test_class$Hip_prop))
roc_logreg <- roc(test_class$Hip_prop, sort(predict(fit_Small, test_class, type = "prob")[,1]))
roc_logregFrom (Clopper and Pearson 1934) and based on result 2.4 of (Pepe 2004), we can compute exact (individual) confidence intervals for the TPF (True Positive Fraction) and the FPF (False Positive Fraction) separately, each at a level $ 1-$. Then, we can compute the rectangular confidence regions by taking the cross-product yielding to a level \(100\cdot(1-\alpha)\%\) for the pair. This is achieved by the geom_rocci() layer in the package.
We decided to display only 2 confidence regions along the curve at the \(40_{th}\) and \(60_{th}\) quantile of the model’s probabilities, in order not to overload the graphs. We changed the signifiance level \(\alpha\) to \(\boldsymbol{25\%}\)(!!!) also in order to decrease the size of the regions and to see something because either the regions would be too large… This already points out the fact that our models are very variables and difficult to compare from the basis of a test set that has only 35 observations (…)
## Logistic Regression
plotdfRocClass <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),
D.str = pred_log_small2c,
M1 = pred_log_small2c_p$Qh.2,
Type = rep("Small", nrow(test_class)) )
plotdfRocClass <- rbind.data.frame(plotdfRocClass,
data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),
D.str = pred_log_allc,
M1 = pred_log_allc_p$Qh.2,
Type = rep("Big", nrow(test_class))
))
plotdfRocClass <- rbind.data.frame(plotdfRocClass,
data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),
D.str = pred_log_bigc,
M1 = pred_log_bigc_p $Qh.2,
Type = rep("Big SW", nrow(test_class))
))
basicplot <- ggplot(plotdfRocClass, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc()
ind.auc <- round(calc_auc(basicplot)$AUC, 2)
basicplot <- basicplot + geom_rocci(linetype = 1, aes(fill = Type), ci.at = quantile(M1, c( .4,.6)), sig.level = .25) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
annotate("text", x = .93, y = .19, label = paste("AUC =", ind.auc[1]), col = brewer.pal(1, "Set1")[1]) + theme_piss() +
annotate("text", x = .93, y = .11, label = paste("AUC =", ind.auc[2]), col = brewer.pal(1, "Set1")[2]) +
annotate("text", x = .93, y = .03, label = paste("AUC =", ind.auc[3]), col = brewer.pal(1, "Set1")[3]) + labs(title = "Logistic Regression")+ theme(legend.position = c(.7, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4"))
## Penalized
plotdfRocClassGlmnet <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),
D.str = pred_glmnet_all_smallc ,
M1 = pred_glmnet_all_p_smallc$Qh.2,
Type = rep("Small", nrow(test_class)) )
plotdfRocClassGlmnet <- rbind.data.frame(plotdfRocClassGlmnet,
data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),
D.str = pred_glmnet_allc,
M1 = pred_glmnet_allc_p$Qh.2,
Type = rep("All", nrow(test_class))
))
basicplotGlmnet <- ggplot(plotdfRocClassGlmnet, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc()
ind.auc <- round(calc_auc(basicplotGlmnet)$AUC, 2)
basicplotGlmnet <- basicplotGlmnet + geom_rocci(linetype = 1, aes(fill = Type),sig.level = .25, ci.at = quantile(M1, c( .4, .6))) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
annotate("text", x = .93, y = .19, label = paste("AUC =", ind.auc[1]), col = brewer.pal(1, "Set1")[1]) +
annotate("text", x = .93, y = .11, label = paste("AUC =", ind.auc[2]), col = brewer.pal(1, "Set1")[2]) + theme_piss()+ labs(title = "Penalized Log. Regression")+ theme(legend.position = c(.7, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4"))
## PLS-DA
plotdfRocClassPLS <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),#ifelse(pred_svmLin == "Qh.2", 1, 0),
D.str = pred_pls_small_smallc ,
M1 = pred_pls_small_p_smallc $Qh.2,
Type = rep("Small", nrow(test_class)) )
plotdfRocClassPLS <- rbind.data.frame(plotdfRocClassPLS,
data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmRad == "Qh.2", 1, 0),
D.str = pred_pls_small_allc ,
M1 = pred_pls_small_p_allc $Qh.2,
Type = rep("All", nrow(test_class))
))
plotdfRocClassPLS <- rbind.data.frame(plotdfRocClassPLS,
data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmPoly == "Qh.2", 1, 0),
D.str = ifelse(pred_df_oplsda$Qh.2 > cutoffopt['OPLS-DA all'], "Qh.2", "X0"),
M1 = pred_df_oplsda$Qh.2,
Type = rep("OPLS-DA", nrow(test_class))
))
basicplotPLS <- ggplot(plotdfRocClassPLS, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc()
ind.aucPLS <- round(calc_auc(basicplotPLS)$AUC, 2)
basicplotPLS <- basicplotPLS + geom_rocci(linetype = 1, aes(fill = Type),sig.level = .25, ci.at = quantile(M1, c( .4, .6))) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
annotate("text", x = .93, y = .19, label = paste("AUC =", ind.aucPLS[1]), col = brewer.pal(1, "Set1")[1]) +
annotate("text", x = .93, y = .11, label = paste("AUC =", ind.aucPLS[2]), col = brewer.pal(1, "Set1")[2]) + annotate("text", x = .93, y = .03, label = paste("AUC =", ind.auc[3]), col = brewer.pal(1, "Set1")[3]) + labs(title = "Partial Least Squares") + theme_piss() + theme(legend.position = c(.68, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4"))
## SVM
plotdfRocClassSVM <- data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0),#ifelse(pred_svmLin == "Qh.2", 1, 0),
D.str = pred_svmLin,
M1 = pred_svmLinp$Qh.2,
Type = rep("Linear", nrow(test_class)) )
plotdfRocClassSVM <- rbind.data.frame(plotdfRocClassSVM,
data.frame( D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmRad == "Qh.2", 1, 0),
D.str = pred_svmRad,
M1 = pred_svmRadp$Qh.2,
Type = rep("Radial", nrow(test_class))
))
plotdfRocClassSVM <- rbind.data.frame(plotdfRocClassSVM,
data.frame(D = ifelse(test_class$Hip_prop == "Qh.2", 1, 0), #ifelse(pred_svmPoly == "Qh.2", 1, 0),
D.str = pred_svmPoly,
M1 = pred_svmPolyp$Qh.2,
Type = rep("Polyn.", nrow(test_class))
))
basicplotSVM <- ggplot(plotdfRocClassSVM, aes(d = D, m = M1, color = Type)) + geom_roc(n.cuts = 15) + style_roc()
ind.auc <- round(calc_auc(basicplotSVM)$AUC, 2)
basicplotSVM <- basicplotSVM + geom_rocci(linetype = 1, aes(fill = Type),sig.level = .25, ci.at = quantile(M1, c( .4, .6))) + scale_color_manual(values = brewer.pal(1, "Set1")[1:3]) +
annotate("text", x = .93, y = .19, label = paste("AUC =", ind.auc[1]), col = brewer.pal(1, "Set1")[1]) +
annotate("text", x = .93, y = .11, label = paste("AUC =", ind.auc[2]), col = brewer.pal(1, "Set1")[2]) +
annotate("text", x = .93, y = .03, label = paste("AUC =", ind.auc[3]), col = brewer.pal(1, "Set1")[3]) + labs(title = "SVM") + theme_piss() + theme(legend.position = c(.7, .17), axis.text = element_text(colour = "black"), axis.title = element_text(colour = "turquoise4"))#, axis.text = element_text(col = "#33666C"), axis.title = element_text(col = "blue"))
grid.arrange(basicplot, basicplotGlmnet,
basicplotPLS, basicplotSVM,
ncol = 2)df_classFin <- data.frame(logloss = , AUC = , accuracy = , size = )
rownames(df_classFin) <-
pander(df_classFin)Note that we could have done multi-class classifiction to improve the comparison (take all the samples).
The problem of the binary classification models is that it is too easy to predict these two hippurate’s classes. The discrimination between the class “0” and “Qh/2” is too strong, as we have seen on the densities in Section 4.1 where the separation is quasi-perfect.
To do that, we come back to the complete sample : thus the one that was used for regression models.
Here we optimise wrt to the multi-class logloss criterion.
'LogLossMulti' <- function(actual, predicted, eps=1e-15) {
predicted[predicted < eps] <- eps;
predicted[predicted > 1 - eps] <- 1 - eps;
-1/nrow(actual)*(sum(actual*log(predicted)))
}
## Function to compute some measures such as the precision, recall and F-scores. For multiclass, average across all classes
'check.model.accuracy' <- function(predicted.class, actual.class){
result.tbl <- as.data.frame(table(predicted.class,actual.class ) )
result.tbl$Var1 <- as.character(result.tbl$predicted.class)
result.tbl$Var2 <- as.character(result.tbl$actual.class)
colnames(result.tbl)[1:2] <- c("Pred","Actual")
cntr <- 0
for (pred.class in unique(result.tbl$Pred) ){
cntr <- cntr+ 1
tp <- sum(result.tbl[result.tbl$Pred == pred.class & result.tbl$Act==pred.class, "Freq"])
tp.fp <- sum(result.tbl[result.tbl$Pred == pred.class , "Freq" ])
tp.fn <- sum(result.tbl[result.tbl$Act == pred.class , "Freq" ])
precision <- tp/tp.fp
recall <- tp/tp.fn
F.score <- 2*precision*recall/(precision+recall)
if (cntr == 1 ) F.score.row <- cbind(pred.class, precision,recall,F.score)
if (cntr > 1 ) F.score.row <- rbind(F.score.row,cbind(pred.class,precision,recall,F.score))
}
F.score.row <- as.data.frame(F.score.row)
return(F.score.row)
}
# Transform the factor's names into valid names : needed for future models
# We keep the same decomposition train-test as for binary classif. Faster in tibble objects
training_class <- as_tibble(data_classif[rownames(data_classif) %in% index , ])
test_class <- as_tibble(data_classif[rownames(data_classif) %in% index == F, ])
trainClassYmulti <- make.names(factor(training_class$Hip_prop, levels = c("0", "Qh/2", "Qh/4", "Qh")))
testClassYmulti <- make.names(factor(test_class$Hip_prop, levels = c("0", "Qh/2")))We will also make use of the MLmetrics package to compute some metrics.
Penalized Multinomial Regression
model = multinom(Hip_prop ~ ., data = cbind.data.frame(Hip_prop = training$Hip_prop, training_c_small))
summary(model)
set.seed(134)
multiLogPen <- train(y = ,
x = training_c_small,
method = "multinom",
metric = "mnLogLoss",
tuneLength = 15,
trControl = ctrl)
pred_multiLog <- predict(multiLogPen, test_c[, kept_features])#, type = "prob")
pred_multiLogp <- predict(multiLogPen, test_c[, kept_features], type = "prob")
pred_logl_multi <- numeric(length = nrow(pred_multiLogp))
for(i in 1:nrow(pred_multiLogp)){
pred_logl_multi[i] <- pred_multiLogp[i, test$Hip_prop[i]]
}
#pred_multiLogp[ ,test$Hip_prop == colnames(pred_multiLogp)]
MultiLogLoss(pred_multiLogp, test$Hip_prop)
# LogLossMulti(test$Hip_prop, pred_multiLogp)
MultiLogLoss(pred_logl_multi, test$Hip_prop)
LogLossMulti(test$Hip_prop, pred_logl_multi)
Accuracy(test$Hip_prop, predict(multiLogPen, test_c))
check.model.accuracy(test$Hip_prop, predict(multiLogPen, test_c)) set.seed(134)
multi_pls <- train(y = training$Hip_prop ,
x = training_c,
method = "pls",
metric = "mnLogLoss",
tuneLength = 15,
trControl = ctrl)
predict(multi_pls, test_c)#, type = "prob")
MultiLogLoss(predict(multi_pls, test_c, type = "prob"), test$Hip_prop)
LogLossMulti(test$Hip_prop, predict(multi_pls, test_c, type = "prob"))
Accuracy(test$Hip_prop, predict(multi_pls, test_c))
check.model.accuracy(test$Hip_prop, predict(multi_pls, test_c)) SPLS
ctrl <- trainControl(method = "repeatedcv", repeats = 1, classProbs = T, summaryFunction = mnLogLoss, savePredictions = T, verboseIter = T )
grid_spls2 <- expand.grid(K = seq(1, 100, by = 6),
eta = c(.1, .3, .6),
kappa = .5)
set.seed(134)
time <- proc.time()
multi_Spls <- train(y = training$Hip_prop ,
x = training_c,
method = "spls",
metric = "mnLogLoss",
#tuneLength = 5,
tuneGrid = grid_spls2,
trControl = ctrl)
(proc.time() - time)[3] # 1952 sec
pred_multi_spls <- predict(multi_Spls, test_c)#, type = "prob")
pred_multi_splsp <- predict(multi_Spls, test_c, type = "prob")
MultiLogLoss(pred_multi_splsp, test$Hip_prop)
LogLossMulti(test$Hip_prop, pred_multi_splsp)
Accuracy(test$Hip_prop, pred_multi_spls)
check.model.accuracy(test$Hip_prop, pred_multi_spls) ctrl <- trainControl(method = "repeatedcv", repeats = 1, # do 2 repititions of cv. Very slow
classProbs = T, summaryFunction = mnLogLoss, verboseIter = T, savePredictions = T )
set.seed(134)
time <- proc.time()
svm_multiClass_rad <- train(y = make.names(training$Hip_prop),
x = training_c,
method = "svmLinear2", # Radial kernel
tuneLength = 10, # 9 values of the cost function
metric = "mnLogLoss",
trControl = ctrl)
(proc.time() - time)[3] # 52 sec
pred_multi_svm <- predict(svm_multiClass_rad, test_c)
pred_multisvmp <- predict(svm_multiClass_rad, test_c, type = "prob")
LogLossMulti(test$Hip_prop, pred_multisvmp)
MultiLogLoss(pred_multisvmp, test$Hip_prop)
Accuracy(test$Hip_prop, pred_multi_svm)
check.model.accuracy(test$Hip_prop, pred_multi_svm)
test[test$Hip_prop != pred_multi_svm,]As always, the observation number x is problematic !
You can class them in ascending order of RMSE by clicking on the arrows right to the column names.
final_rmse <- data.frame("True.RMSE" = c(rmse_reglinc, rmse_reglinswc, rmse_reglmnet1c, rmse_reglmnet2c, rmse_reglmnet3c, rmse_pcrAllc, rmse_pcrAll2c, rmse_pcrSmallc, rmse_icrSmallc, rmse_plsAllc, rmse_plsSmallc, rmse_oplsAll, rmse_oplsSmall, rmse_splsSmallc,rmse_splsAllc, rmse_rf_allc, rmse_xgb_regNocenterFin, try_rmse_ensemble, try_rmse_stackrf2), "RMSE-cv" = c(cv_rmse_regLin, cv_rmse_regLinsw, cv_rmse_reglmnet1, cv_rmse_reglmnet2, cv_rmse_reglmnet3, cv_rmse_pcrAll1, cv_rmse_pcrAll2,cv_rmse_pcrSmall, cv_rmse_icr_small, cv_rmse_pls_all, cv_rmse_pls_small, cv_rmse_opls_all,cv_rmse_opls_Small, cv_rmse_spls_small, cv_rmse_spls_all, cv_rmse_rf_all_nocenter, cv_rmse_xgbNocenterFin, try_cv_rmse_ensemble, try_cv_rmse_stackrf2) )
rownames(final_rmse) <- c('Linear regression', 'Linear regression SW', 'Elastic Net full', 'Elastic Net small (-)', 'Elastic Net small (+)', 'PCR all (+)', 'PCR all (-)', 'PCR small', 'ICR small', 'PLS all', 'PLS small', "OPLS all", "OPLS Small", 'SPLS small','SPLS all', 'Random Forest all', 'XGBoost all', "Ensemble Linear", "Ensemble random forest")
#formattable(final_rmse, list(area(col = True.RMSE) ~ color_tile("lightblue", "lightpink")))
as.datatable(formattable(final_rmse, list(True.RMSE = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp'), width = "100%")‘(+)’ means broader hyperparameter space search, while ‘small’ means with the 18 selected features
models <- rownames(final_rmse)
models.RMSE <- final_rmse$True.RMSE
rmse.df <- data.frame(models = models,
RMSE = models.RMSE)
#Creating bar plot to compare the all models
ggplot(rmse.df, aes(x = models, y = RMSE)) +
geom_col(aes(fill = RMSE)) +
geom_text(aes(x = models, y = RMSE/2, label = RMSE), color = "white") +
coord_flip() +
scale_fill_gradient(low = "cyan", high = "darkred", limits = c(min(rmse.df$RMSE), max(rmse.df$RMSE))) +
labs(title = "True RMSE Values Of Each Model") + theme_piss(theme = theme_grey()) + scale_x_discrete(limits = rev(models))You can class them in ascending order of LogLoss by clicking on the arrows right to the column names.
final_logl <- data.frame("True.LogLoss" = c( logl_log_smallc, logl_log_allc, logl_log_small2c, logl_glmnet_allc, logl_glmnet_all_smallc, logl_pls_small_smallc, logl_pls_small_allc, 0, logloss_svmLin, logloss_svmRad, logloss_svmPol), "LogLoss.cv" = c(cv_logl_glmnetSmall, cv_logl_glmAll, cv_logl_glmSmallsw, cv_logl_glmnetAll, cv_logl_glmnetSmall, cv_logl_plsSmall, cv_logl_plsAll, "XX ?", cv_logl_svmLin, cv_logl_svmRad, cv_logl_svmPol))
rownames(final_logl) <- c( 'Logistic regression small', 'Logistic regression `all`', "Logistic regression SW small", 'Elastic Net all', 'Elastic Net small', 'PLS-DA small', 'PLS-DA all', 'OPLS-DA small', "SVM linear", "SVM radial", "SVM polynomial")
as.datatable(formattable(final_logl, list(True.LogLoss = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp'))‘(+)’ means broader hyperparameter space search, while ‘small’ means with the 18 selected features
modelsC <- rownames(final_logl)
models.logl <- final_logl$True.LogLoss
logl.df <- data.frame(models = modelsC,
logloss = models.logl)
#Creating bar plot to compare the all models
ggplot(logl.df, aes(x = models, y = logloss)) +
geom_col(mapping = aes(fill = logloss)) +
geom_text(mapping = aes(x = models, y = logloss/2, label = logloss), color = "white") +
coord_flip() +
scale_fill_gradient(low = "cyan", high = "darkred", limits = c(min(logl.df$logloss), max(logl.df$logloss))) +
labs(title = "True LogLoss Values Of Each Model") + theme_piss(theme = theme_grey()) + scale_x_discrete(limits = rev(modelsC))You can class them in ascending order of LogLoss by clicking on the arrows right to the column names.
final_loglMult <- data.frame("True.LogLoss" = c( logl_log_smallc, logl_log_allc, logl_log_small2c, logl_glmnet_allc, logl_glmnet_all_smallc, logl_pls_small_smallc, logl_pls_small_allc, "XX?", logloss_svmLin, logloss_svmRad, logloss_svmPol) )
rownames(final_loglMult) <- c( 'Linear regression small', 'Linear regression `all`', "Linear regression SW small", 'Elastic Net all', 'Elastic Net small', 'PLS small', 'PLS all', 'OPLS small', "SVM linear", "SVM radial", "SVM polynomial")
as.datatable(formattable(final_loglMult, list(True.LogLoss = color_tile("lightblue", "lightpink"))), style = 'bootstrap', options = list(dom = 'tp'))‘(+)’ means broader hyperparameter space search, while ‘small’ means with the 18 selected features
TO UPDATE!!!
Redo the analysis on 5 folds and we should not underestimate the error by CV that much (?)
We
Graphs showing test vs cv error. learning curves ?
We have seen that that ensemble methods have indeed outperformed all the other methods in the reression scheme. However, the drawdback is the understanding of the method that is very complex as it ‘aggregates’ different single models. Toghether with the high computationnal cost, it may not be 10 times more valuable as other methods for such a ‘simple’ process. OPLS for example
We only worked with with original data but as we have seen in the distributions (violon) plot in section 4.1, it could be valuable to redo the analysis on the log-data. This could in particular reduce the ‘bad’ effect of outliers for linear models.
Moreover, another default of the more complex models (which led to a poorer performance) is their lack of interpretability.
Regarding the biomarkers detection , we thought unuseful to display all the selection of all the methods we considered. As one would want to get a summary of all the methods, we could make a final “vote” and display the ppm that arose most of the time :
knitr::knit('C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.Rmd', 'C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.md')
markdown::markdownToHTML('C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.md', 'C:\\Users\\Piss\\Documents\\LINUX\\Documents\\Omics LSTAT2340 \\Projet Final\\ProjetFinal_Pissoort.html')Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. doi:10.1023/A:1010933404324.
Chun, Hyonho, and Sündüz Keles. 2007. “Sparse Partial Least Squares Regression with an Application to Genome Scale Transcription Factor Analysis.” Department of Statistics, University of Wisconsin, Madison. http://pages.stat.wisc.edu/~keles/Papers/old_SPLS_Nov07.pdf.
Clopper, C.V., and E.S. Pearson. 1934. “The Use of Confidence or Fiducial Limits Illustrated in the Case of the Binomial on JSTOR.” https://www.jstor.org/stable/2331986.
Hand, David J. 2009. “Measuring Classifier Performance: A Coherent Alternative to the Area Under the ROC Curve.” Machine Learning 77 (1): 103–23. doi:10.1007/s10994-009-5119-5.
Kim, SungHwan, Jae-Hwan Jhong, JungJun Lee, and Ja-Yong Koo. 2017. “Meta-Analytic Support Vector Machine for Integrating Multiple Omics Data.” BioData Mining 10: 2. doi:10.1186/s13040-017-0126-8.
Pepe, Margaret Sullivan. 2004. The Statistical Evaluation of Medical Tests for Classification and Prediction. Oxford University Press.
Rousseau, Réjane. 2011. “Statistical Contributrion to the Analysis of Metabonomic Data in 1H-NMR Spectroscopy.” PhD thesis, UCL - Université Catholique de Louvain. https://dial.uclouvain.be/pr/boreal/object/boreal:75532.
Varma, Sudhir, and Richard Simon. 2006. “Bias in Error Estimation When Using Cross-Validation for Model Selection.” BMC Bioinformatics 7 (February): 91. doi:10.1186/1471-2105-7-91.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20. http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2005.00503.x/full.
Comments
theare are very large linear correlation for some input features with the target. We will take advantage of that that in the feature selection subsection.
The distribution seem very different accross features and some seem to have very fat tails : we can better visualize that on the violin plots of the centered features. Hence one solution would be to take their logarithm to get rid of the deep influence of the values located in the tails that could be seen as outliers and harm particularly linear models.
By arbitrarily choosing a ppm that is located in the “peaks” of the spectrum’s graph above, we remark that the hippurate’s concentration deeply influence the horizontal location of the distributions of this ppm for each class. However, by sampling a random ppm we see that there are not much differences accross the classes of hippurate’s concentration. Hence, we see that a single ppm can be very discriminative for predciting hippurate while other seem useless : that is the reason why we will make feature selection in Section 4.4. Actually, we will see that this particular chosen ppm is the feature selected by most of the algorithms. By looking deeper on the left graph, we remark that the densities overlap a little and hence this is not possible to predict perfectly the quantities of hippurate of the training set. For the TEST SET, we see that this is the same, the separation seem even a bit more strong (also because there a less observation). Actually, note that we could not have this information in real applications because we do not know the target.