R/Pharma 2020 Conference for practitioners of R in the Pharmaceutical Industry • Oct. 15th 2020



  • Identifying drug candidates is expensive!

  • Performing biochemical screening assays in the laboratory is time consuming

  • In cases, where the number of potential candidates is large, predictive modeling can help prioritise candidates for screening

  • Thereby, the search space and thus costs are greatly reduced

Source: Original file on wikipedia | Author | CC BY-SA 4.0

Setting the Scene

  • You work as a data scientist in a pharmaceutical company

  • You have taken delivery of a predictive model for candidate prioritisation

  • In the documentation it says, that the final model was created by expanding an initial simple model:

    1. A simple naive baseline model
    2. A more sophisticated high-complexity model
  • Also in the documentation you find some visualisations quantifying some performance metrics for the final model

Source: Original file on kissclipart

Model Performance


  • mse = mean-squared-error (low = good)

  • pcc = Pearson’s correlation Coefficient (high = good)

  • scc = Spearman’s correlation Coefficient (high = good)

Great! All is good!


  • Evidently, the complex model captures the more subtle information better

  • Therefore, it is decided to put the complex model into production

  • You create a shiny app wrapping the model predictions and continue with other tasks

Great! All is good!

  • However…

  • Time goes by…

  • People using the model for prioritisation in the wet lab start complaining that despite prioritising targets using the model, only very few of the prioritised candidates are found to be relevant downstream