Classification and Regression Study on Metabolomic data

Model Comparisons

Antoine Pissoort
LSTAT2340 OMICS

Introduction

Database with 269 observations and 600 variables (ppm) + 9 variables which controlled the experiment
Predict Hippurate's concentration :
1. For regression models, we
2. For classification models, we will

Representation of the spectrums

plot of chunk unnamed-chunk-1

Methodology

Supervised Learning

Training set \(\Rightarrow\) 80% (215 obs.)
Test set \(\Rightarrow\) 20% (54 obs.)

Visualization : PCA

plot of chunk unnamed-chunk-2

\(\Longrightarrow\) Linear separation

Feature Selection

For models which do not support too much variables (\(p>>n\))
Better global performance when the feature space is reliably reduced ...

Some models do support lots of variable (e.g. penalized regression) or they integrate the feature selection process in their

##  [1] "7.656082"  "7.6397634" "7.852031"  "3.9811536" "7.5580466" "7.5743654"
##  [7] "2.0538744" "8.5380354" "1.5638796" "1.3187906" "0.894194"  "2.2007444"
## [13] "3.6054524" "2.0047344" "3.0501212" "1.9230788" "1.5801986" "0.926832"

Remove features with no variance. Then, by a 'meta'-procedure, we

Assume same selection for regression and classification models.

Representation of spectrums + selected features

plot of chunk unnamed-chunk-4

Regression Models

Predict the quantity of Hippurate's concentration

Linear Regression

Partial Least Squares

PLS

##             49      54      10      3     29    31     33      1
## True     0.000 300.000 300.000  0.000  0.000 0.000 75.000  0.000
## Predict 11.079 312.068 305.355 20.716 10.413 8.577 88.276 16.911

OPLS

##             49      54     10       3      29      31     33       1
## True      0.00 300.000 300.00   0.000   0.000   0.000 75.000   0.000
## Predict -50.13 245.737 240.66 -41.549 -51.556 -54.849 28.371 -44.753

SPLS (3 hyperparam and very slow)

##               49       54       10        3      29       31       33        1
## True       0.000  300.000  300.000    0.000    0.00    0.000   75.000    0.000
## Predict -667.296 -808.474 -545.018 -441.751 -820.34 -798.326 -895.176 -444.809

Final Results : Regression

	True.RMSE	RMSE.cv
Linear regression SW	20.00	7.29
Elastic Net full	46.96	6.58
Elastic Net small (-)	1734.71	7.99
Elastic Net small (+)	998.93	8.10
PCR all (+)	105.77	4.70
PCR all (-)	417.86	5.59
PCR small	23.12	8.03
ICR small	22.67	7.10
PLS all	159.76	4.65
PLS small	215.62	8.08
OPLS small	45.54	166.90
SPLS small	857.05	7.07
XGBoost all	215.62	5.52

(+) larger hyperparameter space search. small is with the 18 selected features

Classification Models

The logloss will be our metric to compare models

Binary \[ \min -N^{-1}\sum_{i}^N\Big[y_i\cdot\log p_i+(1-y_i)\cdot\log (1-p_i)\Big]\]
Multi-class \[ \min -N^{-1}\sum_i^N\sum_j^M y_{ij}\cdot \log p_{ij}\]
- We will consider

```{r}

publish(user = "proto4426", repo = "OMICS", host = github)

#publish(title = 'Classification and Regression Study on Metabolomic data', 'Slidify.html', host = 'rpubs')

library(knitr)

rmarkdown::render("Slidify.Rmd")

knit2html("Slidify.Rmd", options = "")

markdownToHTML("Slidify.Rmd", options = c('skip_images'))

library(mailR)

send.mail(from = "antoine.pissoort@student.uclouvain.be",

to = "antoine.pissoort@student.uclouvain.be",

subject = "MyMail",

html = T,

inline = T,

body = "Slidify.html",

smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = "Antoine Pissoort", passwd = "fcb4426//", ssl = T),