If you attended my workshop on neural networks in R using TensorFlow via Keras, then parts of the following may seem familiar!
Now, that’s out the way, let’s get to it!
R/Pharma 2020 Conference for practitioners of R in the Pharmaceutical Industry • Oct. 15th 2020
If you attended my workshop on neural networks in R using TensorFlow via Keras, then parts of the following may seem familiar!
Now, that’s out the way, let’s get to it!
Identifying drug candidates is expensive!
Performing biochemical screening assays in the laboratory is time consuming
In cases, where the number of potential candidates is large, predictive modeling can help prioritise candidates for screening
Thereby, the search space and thus costs are greatly reduced
Source: Original file on wikipedia | Author | CC BY-SA 4.0
You work as a data scientist in a pharmaceutical company
You have taken delivery of a predictive model for candidate prioritisation
In the documentation it says, that the final model was created by expanding an initial simple model:
Also in the documentation you find some visualisations quantifying some performance metrics for the final model
Source: Original file on kissclipart
Metrics:
mse = mean-squared-error (low = good)
pcc = Pearson’s correlation Coefficient (high = good)
scc = Spearman’s correlation Coefficient (high = good)
Conclusion:
Evidently, the complex model captures the more subtle information better
Therefore, it is decided to put the complex model into production
You create a shiny app wrapping the model predictions and continue with other tasks
However…
Time goes by…
People using the model for prioritisation in the wet lab start complaining that despite prioritising targets using the model, only very few of the prioritised candidates are found to be relevant downstream
You are worried and you communicate your worries to your boss
She authorises using a large sum of money on generating new data and you using your time to get to the bottom of this!
Great! Finally you can see if the wetlab scientist are right or if something has been lost in translation
Same models and same data source!
But for all performance metrics the simple model now outperforms the complex!
Now, you re-read the fine prints of the documentation for the models and apparently the complex model was trained using a hyper-parameter \(\alpha\)
\(\alpha\) was tuned for optimal predictive performance and the tuned value is \(\alpha = 0.12\)
The details on how this parameter was tuned are scarce
You decide to take a closer look at this parameter
You call the training function in the delivered package with 3 values of \(\alpha = (0.12, 0.24, 0.48)\)
Source: Original file on losangle.com
Left: When \(\alpha\) increases, the performance drops (Low \(\alpha\) yields high performance)
Right: When \(\alpha\) increases, the performance increases (Low \(\alpha\) yields low performance)
Evidently, something is up with this \(\alpha\)-parameter
At the coffee machine you discuss with a colleague who talks about data and heuristic hyperparameter optimisation using cross validation
You get the gist and decide to try it out
You decide to forget about the new data for now and figure out what’s-what
So, you return to the old case to take a look at this elusive \(\alpha\)
The documentation mentions something about \(\alpha\) being tuned between 0 and 2
Source: Original file on hiclipart.com
Ok, that’s interesting!
It seems that your original tuned hyper-parameter \(\alpha=0.12\) was a pretty poor choice
Let us have a closer look at that
From your cross validation, it seems that a good choice of alpha could be \(\alpha=1.1\)
You choose that and re-train a new complex model on the original data using the new \(\alpha=1.1\)
Comparing the performance on the original and new data
It is evident, that the original complex model fails
and the new complex model with the tuned \(\alpha\) is comparable with the original simple model
As we tune \(\alpha=1.1\), our new optimised complex model (loess(y~x, span=alpha)) approaches the original simple model (lm(y~x))
Data generating process (dashed line): \(x \in [0;1], y = 3x + 2 + \frac{3}{4} \cdot rnorm(n,0,1)\), \(n = 50\)
In this case, this plot would have been easy to make, but what if you have 20-30-40-100-20.000 explanatory variables?
Overfitting! Overfitting was the trap!
Evidently, the original \(\alpha=0.12\) was overfitting
Modern high level data science modeling frameworks enables complex model creation in few lines of code, e.g. stacks or keras (Which is awesome!)
However, overfitting leads to inflated model performance, which is non-extrapolatable to unseen data
It can potentially be very expensive if a model is set in production without properly testing this
It is a very real problem and finds its way into scientific literature
What to do about it then?
Use cross-validation and partition pre-clustering and ideally a completely external set for model evaluation
Be as harsh as you can to your model, try hard to make it fall apart and if you honestly fail, then there is a good chance that you have a robust model capable of predicting on unseen data!
Thank you for your attention!
Thank you to the R/Pharma organisers for inviting me!
Thank you to the Google TensorFlow team for supporting my research
Thank you to my excellent colleagues at the Department of Health Tech, Section for Bioinformatics, Technical University of Denmark
For more bioinformatics, data science and applied machine learning, find me here:
This presentation: rpubs.com/leonjessen/rpharma2020