Classification and Regression Study on Metabolomic data

Model Comparisons

Antoine Pissoort
LSTAT2340 OMICS

Introduction

Metabolomic study
Database with 269 observations and 600 variables (ppm) + 9 meta-variables which controlled the experiment
Our aim is to Predict Hippurate's concentration (that we controlled !):
1. For regression models, we will take it as numeric and predict it as continuous
2. For classification models, we will take 2 classes of concentration
3. We will also try multi-class classification
Use caret infrastructure $\Rightarrow$ homogeneous.

Representation of the spectrums

plot of chunk unnamed-chunk-1

Methodology

Supervised Learning

Divide the dataset into

Training set $\Rightarrow$ 80% (215 obs.)
Test set $\Rightarrow$ 20% (54 obs.)

Train the model on training set $\Rightarrow$ choose the optimal (hyper)paramaters by cross-validation and estimate the error.
With the model selected, test it on the new (independent) test set to assess its accuracy ('true' error)

Visualization : PCA

plot of chunk unnamed-chunk-2

$\Longrightarrow$ Linear separation

Multiple Testing Comparison

t-tests to compare the "0" and "Qh/2" proportion of hippurate.

Bonferroni correction :

\[Pvalue_{(\text{Bonf})} =\ N\cdot Pvalue, \ \ \text{if} \ \ Pvalue

plot of chunk unnamed-chunk-4

## Biomarkers :  7.509151 7.4926486 7.705161 1.9067596 1.171921 2.90319 0.7471408 1.1390996 1.4167652

Feature Selection

For models which do not support too much variables ($p>>n$)
Better global performance when the feature space is reliably reduced ...

Remove features with no variance. Then, by a 'meta'-procedure, we
- Select 20 features by several methods : linear/rank correlation, recursive feature elimination by randm forest and XGboost.
- Put them together and keep features that are selected by at least 2 methods.

##  [1] "7.656082"  "7.6397634" "7.852031"  "3.9811536" "7.5580466" "7.5743654"
##  [7] "2.0538744" "8.5380354" "1.5638796" "1.3187906" "0.894194"  "2.2007444"
## [13] "3.6054524" "2.0047344" "3.0501212" "1.9230788" "1.5801986" "0.926832"

$\Longrightarrow$ 18 selected features

Some models do support lots of variable (e.g. penalized regression) or they integrate the feature selection process in their algorithm (LASSO)
Assume same selection for regression and classification models.

Representation of spectrums + selected features

plot of chunk unnamed-chunk-7

1) Regression Models

Predict the quantity of Hippurate's concentration as a continuous variable
Predictive criterion to minimize :

\[\text{RMSE} = \sqrt{\frac{1}{N}\sum_i^N \big( y_i-\hat{y}_i\big)^2 }, \qquad\qquad N:=\text{test set size} = 54. \] 'RMSE' <- function(x.true, x.fit) sqrt( mean(x.true - x.fit)^2 )

Several models; from the most simple regressions :
- Linear, stepwise linear, penalized, principal component, independent component, ...
To more 'complex' models :
- (sparse or orthogonal) partial least squares, random forest, XGBoost, ...

Linear Regressions

With the 18 selected features (RMSE = 20)

##            31      27     41      28       5     21      47     16
## True    0.000 150.000 75.000 300.000 150.000 75.000 300.000  0.000
## Predict 3.325 243.507 94.307 298.138 175.612 90.555 314.993 12.106

With stepwise selection on the 18 selected features (RMSE = 22)

## Retained variables are  7.656 7.64 7.852 3.981 2.054 1.319 2.201 3.605 2.005 3.05 1.58

$\qquad\Rightarrow$ Left with 11 features

Stepwise selection on a higher subset (other feature selection with 80 selected variables)

$\qquad\Rightarrow$ Left with 31 selected variables. (RMSE = 41)

Penalized Regression : Elastic Nets

\[\text{argmin}_{\beta} \Big\{ \sum_{i=1}^n(\boldsymbol{y}_i-\boldsymbol{x}_i'\boldsymbol{\beta})^2 + \lambda\Big[(1-\alpha)\sum_{i=1}^n\beta_i^2 + \alpha\sum_{j=1}^n|\beta_i|\Big]\Big\}\]

$\qquad\qquad\qquad\qquad\Longrightarrow$ LASSO ($\alpha=1$) vs RIDGE ($\alpha=0 $)

On the whole training set (RMSE = 47) $\quad$ or $\quad$ On the 18 selected variables (RMSE = 146)

plot of chunk unnamed-chunk-12

Principal Component Regression

With all the variables $\Rightarrow$ 66 coponents kept. (RMSE= 106)
With the feature selection $\Rightarrow$ 13 components kept. (RMSE = 23)

plot of chunk unnamed-chunk-13

Tried with with more relevant # components (green) did not improve the error... (To check)

Did also Independent Component Regression on the 18 selected variable : RMSE= 23 !!

Partial Least Squares

PLS : RMSE = 28.5 on the small dataset. ( = 50 on the big dataset)

##            49      54      10      3     29     31     33      1
## True     0.00 300.000 300.000  0.000  0.000  0.000 75.000  0.000
## Predict 13.87 314.626 307.609 22.032 15.278 11.078 90.057 17.965

OPLS : RMSE = 45.5 on small dataset. Not in caret... (yet?)

##             49      54     10       3      29      31     33       1
## True      0.00 300.000 300.00   0.000   0.000   0.000 75.000   0.000
## Predict -50.13 245.737 240.66 -41.549 -51.556 -54.849 28.371 -44.753

SPLS : 3 hyperparameters ($\nearrow$ complexity) $\rightarrow$ very slow (RMSE = 376)... but we did not do enough cv steps

##               49      54       10      3       29       31       33     1
## True       0.000  300.00  300.000  0.000    0.000    0.000   75.000  0.00
## Predict -142.381 -347.07 -123.848 49.587 -241.695 -239.295 -329.542 60.66

Random Forest, XGBoost

Too much complexity $\Rightarrow$ Difficult to tune (6 hyperparameters for XGBoost), very (very) slow...
Nonparametric and nonlinear models
Weird results.

##               4      44      50      14       9       2      52      31
## True     75.000 150.000  75.000   0.000   0.000   0.000 150.000   0.000
## Predict 284.071 284.058 284.041 284.077 284.096 284.075 284.096 284.097

##           4  44  50  14   9   2  52  31
## True     75 150  75   0   0   0 150   0
## Predict 300 300 300 300 300 300 300 300

$\qquad\Rightarrow$ Too much complexity for such a simple process can lead to weird results

Final Results : Regression

	True.RMSE	RMSE.cv
Linear regression	20.00	7.15
Linear regression SW	21.99	7.15
Elastic Net full	46.96	6.58
Elastic Net small (-)	146.06	7.14
PCR all (+)	105.77	4.66
PCR all (-)	417.86	5.30
PCR small	22.03	7.09
ICR small	23.56	7.10
PLS all	50.43	4.64
PLS small	28.50	7.09
OPLS small	45.54	166.90
SPLS small	857.05	7.07
XGBoost all	215.62	5.52

(+) larger hyperparameter space search. small is with the 18 selected features

2) Classification Models

The logloss will be our metric to compare models

Binary \[ \min \ -N^{-1}\sum_{i}^N\Big[y_i\cdot\log p_i+(1-y_i)\cdot\log (1-p_i)\Big]\]
Multi-class \[ \min \ -N^{-1}\sum_i^N\sum_j^M y_{ij}\cdot \log p_{ij}\]
- Accuracy is less accurate and ROC can be ambiguous. (...)

Logistic Regression

On the small dataset : predicts perfectly the two classes.

## logloss is  2.734654e-12

On the "big" dataset ( 80 features), it only predicts one class $\forall$ observations ... Hence

## logloss is  12.35782

It is the same when we do stepwise selection ...

##      1 2 3 4 5 6 7
## Qh.2 1 1 1 1 1 1 1
## X0   0 0 0 0 0 0 0

Penalized logistic Regression

Problem for some packages it was not possible to implement (plr and penalized ). For glmnet, it gives...

For small dataset

##              1         2         3         4         5         6         7
## Qh.2 0.3597122 0.3597122 0.3597122 0.3597122 0.3597122 0.3597122 0.3597122
## X0   0.6402878 0.6402878 0.6402878 0.6402878 0.6402878 0.6402878 0.6402878

For whole dataset

plot of chunk unnamed-chunk-25

Support Vector Machine

Linear kernel
Radial Kernel
Polynomial Kernel

$\qquad\Longrightarrow$ Always same results : predicts perfectly the classes in CV (accuracy 100% for a big set of hyperparameters) and then predict only one class with proba $\to$ 1 for the test evaluation...

(O)PLS-DA

Same (type of ) problem ...

##                1           2          3           4           5           6
## Qh.2 0.993977226 0.997078446 0.99357819 0.993883989 0.993578587 0.997561093
## X0   0.006022774 0.002921554 0.00642181 0.006116011 0.006421413 0.002438907
##                7          8          9         10
## Qh.2 0.997502736 0.98483209 0.98344112 0.98328997
## X0   0.002497264 0.01516791 0.01655888 0.01671003

Multi-class

Thank You

Questions ?