Classification and Regression Study on Metabolomic data

Model Comparisons

Antoine Pissoort
LSTAT2340 OMICS

Introduction

  • Metabolomic study

  • Database with 269 observations and 600 variables (ppm) + 9 meta-variables which controlled the experiment

  • Our aim is to Predict Hippurate's concentration (that we controlled !):

    1. For regression models, we will take it as numeric and predict it as continuous
    2. For classification models, we will take 2 classes of concentration
    3. We will also try multi-class classification
  • Use caret infrastructure \(\Rightarrow\) homogeneous.

Representation of the spectrums

plot of chunk unnamed-chunk-1

Methodology

  • Supervised Learning

Divide the dataset into

  • Training set \(\Rightarrow\) 80% (215 obs.)
  • Test set \(\Rightarrow\) 20% (54 obs.)
  1. Train the model on training set \(\Rightarrow\) choose the optimal (hyper)paramaters by cross-validation and estimate the error.
  2. With the model selected, test it on the new (independent) test set to assess its accuracy ('true' error)

Visualization : PCA

plot of chunk unnamed-chunk-2

\(\Longrightarrow\) Linear separation

Multiple Testing Comparison

  • t-tests to compare the "0" and "Qh/2" proportion of hippurate.

Bonferroni correction :

\[Pvalue_{(\text{Bonf})} =\ N\cdot Pvalue, \ \ \text{if} \ \ Pvalue

plot of chunk unnamed-chunk-4

## Biomarkers :  7.509151 7.4926486 7.705161 1.9067596 1.171921 2.90319 0.7471408 1.1390996 1.4167652

Feature Selection

  1. For models which do not support too much variables (\(p>>n\))
  2. Better global performance when the feature space is reliably reduced ...
  • Remove features with no variance. Then, by a 'meta'-procedure, we
    • Select 20 features by several methods : linear/rank correlation, recursive feature elimination by randm forest and XGboost.
    • Put them together and keep features that are selected by at least 2 methods.
##  [1] "7.656082"  "7.6397634" "7.852031"  "3.9811536" "7.5580466" "7.5743654"
##  [7] "2.0538744" "8.5380354" "1.5638796" "1.3187906" "0.894194"  "2.2007444"
## [13] "3.6054524" "2.0047344" "3.0501212" "1.9230788" "1.5801986" "0.926832"

\(\Longrightarrow\) 18 selected features

  • Some models do support lots of variable (e.g. penalized regression) or they integrate the feature selection process in their algorithm (LASSO)

  • Assume same selection for regression and classification models.

Representation of spectrums + selected features

plot of chunk unnamed-chunk-7

1) Regression Models

  • Predict the quantity of Hippurate's concentration as a continuous variable

  • Predictive criterion to minimize :

\[\text{RMSE} = \sqrt{\frac{1}{N}\sum_i^N \big( y_i-\hat{y}_i\big)^2 }, \qquad\qquad N:=\text{test set size} = 54. \] 'RMSE' <- function(x.true, x.fit) sqrt( mean(x.true - x.fit)^2 )

  • Several models; from the most simple regressions :
    • Linear, stepwise linear, penalized, principal component, independent component, ...
  • To more 'complex' models :
    • (sparse or orthogonal) partial least squares, random forest, XGBoost, ...

Linear Regressions

  • With the 18 selected features (RMSE = 20)
##            31      27     41      28       5     21      47     16
## True    0.000 150.000 75.000 300.000 150.000 75.000 300.000  0.000
## Predict 3.325 243.507 94.307 298.138 175.612 90.555 314.993 12.106
  • With stepwise selection on the 18 selected features (RMSE = 22)
## Retained variables are  7.656 7.64 7.852 3.981 2.054 1.319 2.201 3.605 2.005 3.05 1.58

\(\qquad\Rightarrow\) Left with 11 features

  • Stepwise selection on a higher subset (other feature selection with 80 selected variables)

\(\qquad\Rightarrow\) Left with 31 selected variables. (RMSE = 41)

Penalized Regression : Elastic Nets

\[\text{argmin}_{\beta} \Big\{ \sum_{i=1}^n(\boldsymbol{y}_i-\boldsymbol{x}_i'\boldsymbol{\beta})^2 + \lambda\Big[(1-\alpha)\sum_{i=1}^n\beta_i^2 + \alpha\sum_{j=1}^n|\beta_i|\Big]\Big\}\]

\(\qquad\qquad\qquad\qquad\Longrightarrow\) LASSO (\(\alpha=1\)) vs RIDGE ($\alpha=0 $)

  • On the whole training set (RMSE = 47) \(\quad\) or \(\quad\) On the 18 selected variables (RMSE = 146)

plot of chunk unnamed-chunk-12

Principal Component Regression

  • With all the variables \(\Rightarrow\) 66 coponents kept. (RMSE= 106)

  • With the feature selection \(\Rightarrow\) 13 components kept. (RMSE = 23)

plot of chunk unnamed-chunk-13

Tried with with more relevant # components (green) did not improve the error... (To check)

  • Did also Independent Component Regression on the 18 selected variable : RMSE= 23 !!

Partial Least Squares

PLS : RMSE = 28.5 on the small dataset. ( = 50 on the big dataset)

##            49      54      10      3     29     31     33      1
## True     0.00 300.000 300.000  0.000  0.000  0.000 75.000  0.000
## Predict 13.87 314.626 307.609 22.032 15.278 11.078 90.057 17.965

OPLS : RMSE = 45.5 on small dataset. Not in caret... (yet?)

##             49      54     10       3      29      31     33       1
## True      0.00 300.000 300.00   0.000   0.000   0.000 75.000   0.000
## Predict -50.13 245.737 240.66 -41.549 -51.556 -54.849 28.371 -44.753

SPLS : 3 hyperparameters (\(\nearrow\) complexity) \(\rightarrow\) very slow (RMSE = 376)... but we did not do enough cv steps

##               49      54       10      3       29       31       33     1
## True       0.000  300.00  300.000  0.000    0.000    0.000   75.000  0.00
## Predict -142.381 -347.07 -123.848 49.587 -241.695 -239.295 -329.542 60.66

Random Forest, XGBoost

  • Too much complexity \(\Rightarrow\) Difficult to tune (6 hyperparameters for XGBoost), very (very) slow...
  • Nonparametric and nonlinear models
  • Weird results.
##               4      44      50      14       9       2      52      31
## True     75.000 150.000  75.000   0.000   0.000   0.000 150.000   0.000
## Predict 284.071 284.058 284.041 284.077 284.096 284.075 284.096 284.097
##           4  44  50  14   9   2  52  31
## True     75 150  75   0   0   0 150   0
## Predict 300 300 300 300 300 300 300 300

\(\qquad\Rightarrow\) Too much complexity for such a simple process can lead to weird results

Final Results : Regression

True.RMSE RMSE.cv
Linear regression 20.00 7.15
Linear regression SW 21.99 7.15
Elastic Net full 46.96 6.58
Elastic Net small (-) 146.06 7.14
PCR all (+) 105.77 4.66
PCR all (-) 417.86 5.30
PCR small 22.03 7.09
ICR small 23.56 7.10
PLS all 50.43 4.64
PLS small 28.50 7.09
OPLS small 45.54 166.90
SPLS small 857.05 7.07
XGBoost all 215.62 5.52

  • (+) larger hyperparameter space search. small is with the 18 selected features

2) Classification Models

The logloss will be our metric to compare models

  • Binary \[ \min \ -N^{-1}\sum_{i}^N\Big[y_i\cdot\log p_i+(1-y_i)\cdot\log (1-p_i)\Big]\]
  • Multi-class \[ \min \ -N^{-1}\sum_i^N\sum_j^M y_{ij}\cdot \log p_{ij}\]

    • Accuracy is less accurate and ROC can be ambiguous. (...)

Logistic Regression

  • On the small dataset : predicts perfectly the two classes.
## logloss is  2.734654e-12
  • On the "big" dataset ( 80 features), it only predicts one class \(\forall\) observations ... Hence
## logloss is  12.35782
  • It is the same when we do stepwise selection ...
##      1 2 3 4 5 6 7
## Qh.2 1 1 1 1 1 1 1
## X0   0 0 0 0 0 0 0

Penalized logistic Regression

Problem for some packages it was not possible to implement (plr and penalized ). For glmnet, it gives...

  • For small dataset
##              1         2         3         4         5         6         7
## Qh.2 0.3597122 0.3597122 0.3597122 0.3597122 0.3597122 0.3597122 0.3597122
## X0   0.6402878 0.6402878 0.6402878 0.6402878 0.6402878 0.6402878 0.6402878
  • For whole dataset

plot of chunk unnamed-chunk-25

Support Vector Machine

  • Linear kernel

  • Radial Kernel

  • Polynomial Kernel

\(\qquad\Longrightarrow\) Always same results : predicts perfectly the classes in CV (accuracy 100% for a big set of hyperparameters) and then predict only one class with proba \(\to\) 1 for the test evaluation...

(O)PLS-DA

Same (type of ) problem ...

##                1           2          3           4           5           6
## Qh.2 0.993977226 0.997078446 0.99357819 0.993883989 0.993578587 0.997561093
## X0   0.006022774 0.002921554 0.00642181 0.006116011 0.006421413 0.002438907
##                7          8          9         10
## Qh.2 0.997502736 0.98483209 0.98344112 0.98328997
## X0   0.002497264 0.01516791 0.01655888 0.01671003

Multi-class

Thank You

  • Questions ?