It’s a trap!

R/Pharma 2020 Conference for practitioners of R in the Pharmaceutical Industry • Oct. 15th 2020

Disclaimer

If you attended my workshop on neural networks in R using TensorFlow via Keras, then parts of the following may seem familiar!

Now, that’s out the way, let’s get to it!

Identifying drug candidates is expensive!
Performing biochemical screening assays in the laboratory is time consuming
In cases, where the number of potential candidates is large, predictive modeling can help prioritise candidates for screening
Thereby, the search space and thus costs are greatly reduced

You work as a data scientist in a pharmaceutical company
You have taken delivery of a predictive model for candidate prioritisation
In the documentation it says, that the final model was created by expanding an initial simple model:
1. A simple naive baseline model
2. A more sophisticated high-complexity model
Also in the documentation you find some visualisations quantifying some performance metrics for the final model

Metrics:

Conclusion:

Evidently, the complex model captures the more subtle information better
Therefore, it is decided to put the complex model into production
You create a shiny app wrapping the model predictions and continue with other tasks

However…
Time goes by…
People using the model for prioritisation in the wet lab start complaining that despite prioritising targets using the model, only very few of the prioritised candidates are found to be relevant downstream

You are worried and you communicate your worries to your boss
She authorises using a large sum of money on generating new data and you using your time to get to the bottom of this!
Great! Finally you can see if the wetlab scientist are right or if something has been lost in translation

Now, you re-read the fine prints of the documentation for the models and apparently the complex model was trained using a hyper-parameter \(\alpha\)
\(\alpha\) was tuned for optimal predictive performance and the tuned value is \(\alpha = 0.12\)
The details on how this parameter was tuned are scarce
You decide to take a closer look at this parameter
You call the training function in the delivered package with 3 values of \(\alpha = (0.12, 0.24, 0.48)\)

Left: When \(\alpha\) increases, the performance drops (Low \(\alpha\) yields high performance)
Right: When \(\alpha\) increases, the performance increases (Low \(\alpha\) yields low performance)
Evidently, something is up with this \(\alpha\)-parameter

At the coffee machine you discuss with a colleague who talks about data and heuristic hyperparameter optimisation using cross validation
You get the gist and decide to try it out
You decide to forget about the new data for now and figure out what’s-what
So, you return to the old case to take a look at this elusive \(\alpha\)
The documentation mentions something about \(\alpha\) being tuned between 0 and 2

Randomly split your data set into 5 partitions
For each \(\alpha \in [0.12;2]\) by \(0.01\)
For each partition \(i \in [1;5]\) by \(1\)
Train the model on 4/5
Predict on the \(i'th\) 1/5
Record performance metrics
For each value of \(\alpha\) calculate mean and se of performance metrics
Identify optimal \(\alpha\)
Apply the optimised \(\alpha\)-value to train the model on all the available data

Ok, that’s interesting!
It seems that your original tuned hyper-parameter \(\alpha=0.12\) was a pretty poor choice
Let us have a closer look at that
From your cross validation, it seems that a good choice of alpha could be \(\alpha=1.1\)
You choose that and re-train a new complex model on the original data using the new \(\alpha=1.1\)

Comparing the performance on the original and new data
It is evident, that the original complex model fails
and the new complex model with the tuned \(\alpha\) is comparable with the original simple model

As we tune \(\alpha=1.1\), our new optimised complex model (loess(y~x, span=alpha)) approaches the original simple model (lm(y~x))
Data generating process (dashed line): \(x \in [0;1], y = 3x + 2 + \frac{3}{4} \cdot rnorm(n,0,1)\), \(n = 50\)
In this case, this plot would have been easy to make, but what if you have 20-30-40-100-20.000 explanatory variables?

Overfitting! Overfitting was the trap!
Evidently, the original \(\alpha=0.12\) was overfitting
Modern high level data science modeling frameworks enables complex model creation in few lines of code, e.g. stacks or keras (Which is awesome!)
However, overfitting leads to inflated model performance, which is non-extrapolatable to unseen data
It can potentially be very expensive if a model is set in production without properly testing this
It is a very real problem and finds its way into scientific literature
What to do about it then?
- Use cross-validation and partition pre-clustering and ideally a completely external set for model evaluation
- Be as harsh as you can to your model, try hard to make it fall apart and if you honestly fail, then there is a good chance that you have a robust model capable of predicting on unseen data!

Thank you for your attention!
Thank you to the R/Pharma organisers for inviting me!
Thank you to the Google TensorFlow team for supporting my research
Thank you to my excellent colleagues at the Department of Health Tech, Section for Bioinformatics, Technical University of Denmark
For more bioinformatics, data science and applied machine learning, find me here:
- Twitter: twitter.com/jessenleon
- LinkedIn: linkedin.com/in/leonjessen/
- GitHub: github.com/leonjessen
This presentation: rpubs.com/leonjessen/rpharma2020