R/Pharma 2020 Conference for practitioners of R in the Pharmaceutical Industry • Oct. 15th 2020

Disclaimer

Background

  • Identifying drug candidates is expensive!

  • Performing biochemical screening assays in the laboratory is time consuming

  • In cases, where the number of potential candidates is large, predictive modeling can help prioritise candidates for screening

  • Thereby, the search space and thus costs are greatly reduced

Source: Original file on wikipedia | Author | CC BY-SA 4.0

Setting the Scene

  • You work as a data scientist in a pharmaceutical company

  • You have taken delivery of a predictive model for candidate prioritisation

  • In the documentation it says, that the final model was created by expanding an initial simple model:

    1. A simple naive baseline model
    2. A more sophisticated high-complexity model
  • Also in the documentation you find some visualisations quantifying some performance metrics for the final model

Source: Original file on kissclipart

Model Performance

Metrics:

  • mse = mean-squared-error (low = good)

  • pcc = Pearson’s correlation Coefficient (high = good)

  • scc = Spearman’s correlation Coefficient (high = good)

Great! All is good!

Conclusion:

  • Evidently, the complex model captures the more subtle information better

  • Therefore, it is decided to put the complex model into production

  • You create a shiny app wrapping the model predictions and continue with other tasks

Great! All is good!

  • However…

  • Time goes by…

  • People using the model for prioritisation in the wet lab start complaining that despite prioritising targets using the model, only very few of the prioritised candidates are found to be relevant downstream

What now!?

  • You are worried and you communicate your worries to your boss

  • She authorises using a large sum of money on generating new data and you using your time to get to the bottom of this!

  • Great! Finally you can see if the wetlab scientist are right or if something has been lost in translation

You finally recieve the new data

  • First things first, revisit the original performance plot

You finally recieve the new data

  • and compare it to the one you created for the new data

What on earth is going on?

  • and compare it to the one you created for the new data

  • Same models and same data source!

  • But for all performance metrics the simple model now outperforms the complex!

Wait - What? Tuning?

  • Now, you re-read the fine prints of the documentation for the models and apparently the complex model was trained using a hyper-parameter \(\alpha\)

  • \(\alpha\) was tuned for optimal predictive performance and the tuned value is \(\alpha = 0.12\)

  • The details on how this parameter was tuned are scarce

  • You decide to take a closer look at this parameter

  • You call the training function in the delivered package with 3 values of \(\alpha = (0.12, 0.24, 0.48)\)

Source: Original file on losangle.com

Taking a look at \(\alpha = (0.12, 0.24, 0.48)\)

  • Left: When \(\alpha\) increases, the performance drops (Low \(\alpha\) yields high performance)

  • Right: When \(\alpha\) increases, the performance increases (Low \(\alpha\) yields low performance)

  • Evidently, something is up with this \(\alpha\)-parameter

Coffee… Coffee to the rescue!

  • At the coffee machine you discuss with a colleague who talks about data and heuristic hyperparameter optimisation using cross validation

  • You get the gist and decide to try it out

  • You decide to forget about the new data for now and figure out what’s-what

  • So, you return to the old case to take a look at this elusive \(\alpha\)

  • The documentation mentions something about \(\alpha\) being tuned between 0 and 2

Source: Original file on hiclipart.com

Method: Heuristic hyper-parameter optimisation using 5-fold cross validation

  1. Randomly split your data set into 5 partitions
  2. For each \(\alpha \in [0.12;2]\) by \(0.01\)
  3. For each partition \(i \in [1;5]\) by \(1\)
  4. Train the model on 4/5
  5. Predict on the \(i'th\) 1/5
  6. Record performance metrics
  7. For each value of \(\alpha\) calculate mean and se of performance metrics
  8. Identify optimal \(\alpha\)
  9. Apply the optimised \(\alpha\)-value to train the model on all the available data

5-fold cross validation - Results

Taking a closer look at \(\alpha\)

  • Ok, that’s interesting!

  • It seems that your original tuned hyper-parameter \(\alpha=0.12\) was a pretty poor choice

  • Let us have a closer look at that

  • From your cross validation, it seems that a good choice of alpha could be \(\alpha=1.1\)

  • You choose that and re-train a new complex model on the original data using the new \(\alpha=1.1\)

See how the model is doing

See how the model is doing

  • Comparing the performance on the original and new data

  • It is evident, that the original complex model fails

  • and the new complex model with the tuned \(\alpha\) is comparable with the original simple model

Ok, so what was really going on?

Ok, so what was really going on?

Ok, so what was really going on?

Ok, so what was really going on?

  • As we tune \(\alpha=1.1\), our new optimised complex model (loess(y~x, span=alpha)) approaches the original simple model (lm(y~x))

  • Data generating process (dashed line): \(x \in [0;1], y = 3x + 2 + \frac{3}{4} \cdot rnorm(n,0,1)\), \(n = 50\)

  • In this case, this plot would have been easy to make, but what if you have 20-30-40-100-20.000 explanatory variables?

Summary - What was the trap?

  • Overfitting! Overfitting was the trap!

  • Evidently, the original \(\alpha=0.12\) was overfitting

  • Modern high level data science modeling frameworks enables complex model creation in few lines of code, e.g. stacks or keras (Which is awesome!)

  • However, overfitting leads to inflated model performance, which is non-extrapolatable to unseen data

  • It can potentially be very expensive if a model is set in production without properly testing this

  • It is a very real problem and finds its way into scientific literature

  • What to do about it then?

    • Use cross-validation and partition pre-clustering and ideally a completely external set for model evaluation

    • Be as harsh as you can to your model, try hard to make it fall apart and if you honestly fail, then there is a good chance that you have a robust model capable of predicting on unseen data!

Acknowledgements