Classification and Regression Study on Metabolomic data

Model Comparisons

Antoine Pissoort
LSTAT2340 OMICS

Introduction

  • Database with 269 observations and 600 variables (ppm) + 9 variables which controlled the experiment

  • Predict Hippurate's concentration :

    1. For regression models, we
    2. For classification models, we will

Representation of the spectrums

plot of chunk unnamed-chunk-1

Methodology

Supervised Learning

  • Training set \(\Rightarrow\) 80% (215 obs.)
  • Test set \(\Rightarrow\) 20% (54 obs.)

Visualization : PCA

plot of chunk unnamed-chunk-2

\(\Longrightarrow\) Linear separation

Feature Selection

  1. For models which do not support too much variables (\(p>>n\))
  2. Better global performance when the feature space is reliably reduced ...

Some models do support lots of variable (e.g. penalized regression) or they integrate the feature selection process in their

##  [1] "7.656082"  "7.6397634" "7.852031"  "3.9811536" "7.5580466" "7.5743654"
##  [7] "2.0538744" "8.5380354" "1.5638796" "1.3187906" "0.894194"  "2.2007444"
## [13] "3.6054524" "2.0047344" "3.0501212" "1.9230788" "1.5801986" "0.926832"

Remove features with no variance. Then, by a 'meta'-procedure, we

Assume same selection for regression and classification models.

Representation of spectrums + selected features

plot of chunk unnamed-chunk-4

Regression Models

Predict the quantity of Hippurate's concentration

Linear Regression

Partial Least Squares

PLS

##             49      54      10      3     29    31     33      1
## True     0.000 300.000 300.000  0.000  0.000 0.000 75.000  0.000
## Predict 11.079 312.068 305.355 20.716 10.413 8.577 88.276 16.911

OPLS

##             49      54     10       3      29      31     33       1
## True      0.00 300.000 300.00   0.000   0.000   0.000 75.000   0.000
## Predict -50.13 245.737 240.66 -41.549 -51.556 -54.849 28.371 -44.753

SPLS (3 hyperparam and very slow)

##               49       54       10        3      29       31       33        1
## True       0.000  300.000  300.000    0.000    0.00    0.000   75.000    0.000
## Predict -667.296 -808.474 -545.018 -441.751 -820.34 -798.326 -895.176 -444.809

Final Results : Regression

True.RMSE RMSE.cv
Linear regression SW 20.00 7.29
Elastic Net full 46.96 6.58
Elastic Net small (-) 1734.71 7.99
Elastic Net small (+) 998.93 8.10
PCR all (+) 105.77 4.70
PCR all (-) 417.86 5.59
PCR small 23.12 8.03
ICR small 22.67 7.10
PLS all 159.76 4.65
PLS small 215.62 8.08
OPLS small 45.54 166.90
SPLS small 857.05 7.07
XGBoost all 215.62 5.52
  • (+) larger hyperparameter space search. small is with the 18 selected features

Classification Models

The logloss will be our metric to compare models

  • Binary \[ \min -N^{-1}\sum_{i}^N\Big[y_i\cdot\log p_i+(1-y_i)\cdot\log (1-p_i)\Big]\]
  • Multi-class \[ \min -N^{-1}\sum_i^N\sum_j^M y_{ij}\cdot \log p_{ij}\]
    • We will consider

```{r}

publish(user = "proto4426", repo = "OMICS", host = github)

#publish(title = 'Classification and Regression Study on Metabolomic data', 'Slidify.html', host = 'rpubs')

library(knitr)

rmarkdown::render("Slidify.Rmd")

knit2html("Slidify.Rmd", options = "")

markdownToHTML("Slidify.Rmd", options = c('skip_images'))

library(mailR)

send.mail(from = "antoine.pissoort@student.uclouvain.be",

to = "antoine.pissoort@student.uclouvain.be",

subject = "MyMail",

html = T,

inline = T,

body = "Slidify.html",

smtp = list(host.name = "smtp.gmail.com", port = 465, user.name = "Antoine Pissoort", passwd = "fcb4426//", ssl = T),

authenticate = T,

send = T)

```