Preliminary note

This workbook is available from my public bitbucket. The version at bitbucket does not contain solutions to the challenges.

This workbook uses an R package called drc, which we haven’t used so far in the code clubs. Ensure you have this package installed (either using install.packages(“drc”) or using rstudio’s “Install Packages” dialogues) before attempting to compile or run this notebook.

library(drc)
library(ggplot2)
library(magrittr)

IC₅₀s, EC₅₀s and all that

IC₅₀ is the “half-maximal inhibitory concentration” for an entity (typically a drug) against a biological process or function (eg, enzyme activity, cell number, metabolic activity). If you have an inhibitor of a given enzyme and an in vitro assay for that enzyme, the IC50 is the concentration of inhibitor at which that enzyme runs at 50% of its normal activity in that assay.

EC₅₀ is, comparably, the “half-maximal effective concentration”. The effective is in there because not all entities will completely wipe out the biological process that you are assaying. Suppose you are assessing the effects of drug-treatment on cell number under a certain growth regime. Further, suppose that the drug reduces cell number by at most 80% (relative to an untreated sample). Then the EC₅₀ is the concentration when cell number is reduced by 40% relative to the control (so the EC₅₀ can be different from the IC₅₀).

In the latter assay, suppose the drug could induce at most a 45% reduction in cell number. Does it make sense to discuss IC₅₀ in that setting?

Both IC₅₀ and EC₅₀ are measured in mol / L.

You can visualise the difference between IC₅₀ and EC₅₀ in the following graph.

In the above graph, we have a drug that can reduce cell count by at most 80%.

The red lines illustrate EC₅₀: the dotted line gives the maximal possible effect, the solid line lies halfway between the number of cells in the untreated state and the maximal possible effect. The dashed line gives the EC₅₀ (which I’d set to be 0.1 mM beforehand).

Similarly the blue lines illustrate IC₅₀.

In all of the above, there is an implicit model underlying the dose-responses. When working with raw data, we use this implicit model to estimate four parameters:

the maximum value (the solid black line in the above);
the minimum value (the dotted red line);
the EC₅₀ (the dashed red line); and
the Hill coefficient (which affects the steepness of the slope).

Given these four parameters, we can estimate the IC₅₀.

The model looks like this ¹. Here x is the drug concentration. If you compare to wikipedia, the sign of the Hill coef is reversed; but it’s sign doesn’t affect inference for max, min and EC₅₀).

\[ fitted\_value = min\_value + \frac{ max\_value - min\_value }{ 1 + (x / EC_{50})^{Hill} } \]

\[ observed\_value = fitted\_value + NOISE! \]

Assume a positive Hill coefficient ². The model implies some sensible stuff:

When x is equal to EC₅₀, the denominator in the fitted_value equation is 2, and the fitted value becomes \((max\_value + min\_value)/2\) (ie, the solid red line).
When x gets really large, the fitted_value approaches \(min\_value\). This occurs because the denominator in the fitted_value equation gets really large, and so the fraction tends to 0.
When x is 0, the fitted value is the same as \(max\_value\).

It’s possible to invert the above equation algebraically, to find the value of x that gives a particular fitted_value. The IC₅₀ is given when the fitted_value is equal to max_value / 2. The IC₅₀ is then given by:³

\[ \log(IC_{50}) = \log(EC_{50}) + \frac{1}{Hill}.\log(\frac{ max\_value }{ max\_value - 2min\_value }) \]

Importantly, this is only a model. It may not be appropriate for the particular assay or treatment that you are using and if you use a different model, the above connection between IC₅₀ and EC₅₀ needn’t hold.

Data for subsequent use

Imagine we’ve ran a dose-response analysis of a drug on cells from a patient. We want to identify the IC₅₀ and EC₅₀ for the drug.

Experimentally, you’d try to use a few concentrations of drug that are above the EC₅₀, a few that are below this value, and some samples from untreated controls.

The values we use will be randomly generated, so these are not from a real experiment. But we will use the EC₅₀ equation above to generate the random numbers.

To define a function in R you use syntax that looks like

my_function_name <- function(argument_1, argument_2){
  commands
}

I’ve written some functions to help sample from the EC₅₀ equation. The first one returns the fitted values for a set of drug-concentrations when provided the minimum, maximum. EC₅₀ and Hill coefficients. You don’t need to understand code like this yet.

get_fitted_values <- function(concs, min_val, max_val, ec_50, hill){
  # Sanity check the input
  stopifnot(length(min_val) == 1 && length(max_val) == 1 && max_val > min_val)
  stopifnot(length(hill) == 1 && hill > 0)
  stopifnot(length(ec_50) == 1)
  stopifnot(all(c(concs, ec_50) >= 0))

  # Note that `concs` may contain multiple values,
  #   whereas min, max, hill, ec_50 are each of length 1.
  #
  # But, the results of this function are of the same length as `concs` because
  #   R will duplicate the entries of min, max, ec_50, hill to match the length
  #   of concs.
  #
  # Similarly, 1 + c(2, 3, 4, 5) is equal to c(3, 4, 5, 6)
  #
  min_val + (max_val - min_val) / (1 + (concs / ec_50) ^ hill)
}

The second one adds a bit of random noise around the fitted values, so the data we are working with looks a bit more like experimentally-derived data.

add_some_noise <- function(fitted_values, sd){
  # `rnorm` adds some normally-distributed noise around the fitted values
  #
  # The user defines the standard deviation of the noise that is added.
  #
  noisy_vals <- rnorm(
    n = length(fitted_values), mean = fitted_values, sd = sd
    )

  # Check that the returned values are all positive
  stopifnot(all(noisy_vals > 0))

  noisy_vals
}

The third function wraps around the above two functions. You can use this to sample some experimental data.

sample_ec50_dataset <- function(concs, min_val, max_val, ec_50, hill, sd){
  fitted_vals <- get_fitted_values(concs, min_val, max_val, ec_50, hill)

  add_some_noise(fitted_vals, sd = sd)
}

So, our first problem is to generate some (sampled) experimental data.

# Challenge 1: Sample some data from the sigmoid model using
#   the function `sample_ec50_dataset`
#
# Use the following data:
#   concs: use concentrations 0, 0.1, 0.3, 1, 3, 10
#   min_val: use a minimum value of 10
#   max_val: use a maximum value of 90
#   ec_50: use an EC_50 value of 0.5 (note this is in the middle of the range
#          of concentrations)
#   hill: use a hill coefficient of 1
#   sd:   use a standard deviation of 3

# YOUR CODE GOES HERE:

# First define a vector called `concs` that stores the required concentrations

# concs <- ...

# Store the values in a vector called `cell_counts`

# cell_counts <- ...

# Challenge 1 - Solution

concs <- c(0, 0.1, 0.3, 1, 3, 10)

cell_counts <- sample_ec50_dataset(
  concs, min_val = 10, max_val = 90, ec_50 = 0.5, hill = 1, sd = 3
  )

To use the data effectively in ggplot, you should convert it into a data-frame

# Challenge 2

# Combine the concentrations and the sampled cell-counts into a data-frame
#   called `dose_response`
#
# The columns of `dose_response` should be called `conc` and `cell_count`

# YOUR CODE GOES HERE:

# dose_response <- data.frame(
#   # FILL IN THE COLUMN DEFINITIONS:
#   )

# Challenge 2 - Solution

dose_response <- data.frame(
  conc = concs,
  cell_count = cell_counts
  )

Rough and ready plotting

You have some experience making basic plots in ggplot2. So you should be able to plot the sample values out using geom_point() or geom_line().

# Challenge 3
# - Plot out the data in `dose_response` using ggplot2
# - try overlaying a line graph on a scatter plot
# - try and add sensible axis labels
# - try using ylim() to ensure that the line y=0 is present on the graph
# - try storing the resulting graph in a variable called `p` and then plotting
#     it by evaluating `p`

# To store a graph in a variable:
# p <- ggplot(....) + ... + ...
#
# Then plot it out:
# p

# YOUR CODE GOES HERE:

# Challenge 3 - Solution

p <- ggplot(
    data = dose_response, aes(x = conc, y = cell_count)
  ) +
  geom_point() +
  geom_line() +
  labs(x = "Drug Concentration", y = "Cell Count") +
  ylim(0, NA)

p

There’s a slight issue with that graph. The concentrations aren’t evenly spread across the horizontal span of the graph, so we ended up with half of the concentrations (0, 0.1, 0.3, 1.0) lying in the left-most quarter of the graph. We can rectify this by log-transforming the x-axis.

In ggplot2 you make axis transformations using ’scale_*’ functions. Since the x-axis represents a continuous variable, we use scale_x_continuous for transformations. scale_x_continuous can be used for more than just transforming the axis though, for example it can be used to specify where tick marks occur, and what they are labelled with. To log-transform the x axis on the above plot, use the argument trans = "log10" in scale_x_continuous.⁴

# Challenge 4:
# Use scale_x_continuous to log-transform the x-axis
#
# p + ...()

# YOUR CODE GOES HERE:

# Challenge 4 - Solution
# - Alternative: p + scale_x_log10()

p + scale_x_continuous(trans = "log10")

## Warning: Transformation introduced infinite values in continuous x-axis

## Warning: Transformation introduced infinite values in continuous x-axis

The points are more evenly spread out within the body of the graph. But, Uh-oh! That zero concentration datapoint has thrown a warning. Log-transforming it sent it to minus-infinity and in the graph above, it is represented using a half-point at the left-most position on the graph. You can try and fix this if you want but for the rest of the workshop, we’re going to work with untransformed axes.

Prediction from a model using `predict`

In the above, geom_line is just connecting the observed points, rather than adding a model-predicted regression line to the data. We’ll learn how to fit the data to the EC₅₀ equation so that we can overplot a more appropriate regression line on the dose-response data.

R provides various tools to i) perform model-fitting, ii) obtain fitted values from a given model, and iii) plot out those fitted values.

Although it isn’t entirely appropriate since the dataset is curved, we could model the dataset using linear regression - this would fit a straight line that minimises it’s distance from the observed datapoints using least-squares.

To fit a linear regression model, you use lm(dependent_var ~ independent_var, data = my_dataframe)

Try this now using the EC₅₀ dataset: with conc as the independent variable (ie, the predictor) and cell_count as the dependent (response) variable. Store the result in a variable called fit.

# Challenge 5:
# - Use the `lm` function to run linear regression of `cell_count` against `conc`
# in your `dose_response` data-frame
# - Store the result in a variable called `fit`

# YOUR CODE GOES HERE:
# fit <- ...

# Challenge 5 - Solution

fit <- lm(cell_count ~ conc, data = dose_response)

fit contains details about the intercept and gradient of the best-fitting regression line. You can look at these by running summary(fit).

summary(fit)

## 
## Call:
## lm(formula = cell_count ~ conc, data = dose_response)
## 
## Residuals:
##       1       2       3       4       5       6 
##  25.524   7.963   2.420 -20.554 -24.652   9.299 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)   62.613     10.563   5.927  0.00406 **
## conc          -5.527      2.466  -2.241  0.08849 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 21.43 on 4 degrees of freedom
## Multiple R-squared:  0.5567, Adjusted R-squared:  0.4459 
## F-statistic: 5.024 on 1 and 4 DF,  p-value: 0.08849

[Aside] The coefficient for the gradient has a p-value < 0.1, so you shouldn’t be surprised that even poorly-chosen models can give significant results.

Given a model fit, you can also predict the fitted values for some additional data points by calling the function predict. Call predict on fit now:

predict(fit)

##         1         2         3         4         5         6 
## 62.612595 62.059893 60.954490 57.085579 46.031546  7.342433

This gives you the fitted values corresponding to the concentrations in your input data (assuming a linear model). You can provide additional concentrations as well, and obtain predictions for concentrations that you didn’t include in your experiment. To do this, you need to provide a data-frame as the newdata argument to predict that contains a column for each of the predictor/independent variables that you used during modelling.

So if I set up a data-frame called additional_data, you can use this to predict fitted values for unobserved concentrations:

additional_data <- data.frame(
  # This gives a sequence of 100 evenly spaced values between 0 and 10
  conc = seq(0, 10, length.out = 100)
)

# Challenge 6:
# - use `predict` with `fit` as the model, and `additional_data` as the
# `newdata` argument
# - store the returned values in a vector called `linear_counts`

# YOUR CODE GOES HERE:
# linear_counts <- ...

# Challenge 6 - Solution

linear_counts <- predict(fit, newdata = additional_data)

We can now plot the fitted linear_counts over the observed data, to see both the experimental data and the model-fit.

linear_fitted_data <- data.frame(
  additional_data,
  cell_count = linear_counts
  )

ggplot(dose_response, aes(x = conc, y = cell_count)) +
  geom_point() +
  geom_line(data = linear_fitted_data) +
  ylim(0, NA)

There’s actually a short-hand for the above that’s useful if you’re working with simple linear regression models and ggplot2. The stat_smooth function will compute a linear model and overlay it’s fitted line on your datapoints. You call it as follows (the confidence intervals can be removed by using se = FALSE in the call to stat_smooth):

ggplot(dose_response, aes(x = conc, y = cell_count)) +
  geom_point() +
  stat_smooth(method = "lm", se = TRUE) +
  ylim(0, NA)

So you have seen how to model data using linear regression, how to predict fitted points from a model and how to overlay the fitted model on a ggplot scatterplot. predict can be used for more complicated models, as we see next.

Fitting data to the EC₅₀ equation

To determine IC50s / EC50s using our sampled data, we need a way to perform non-linear regression. To do this we use the package drc (which is specifically for analysis of dose-response curves).⁵

The functions you need to know are called drm (which fits the regression model) and LL.4 (which defines the structure of the regression model).

The documentation for LL.4 defines the log-logistic function used in regression. This is shown below, but it’s actually identical to our definition of the \(fitted\_value\) equation above with

b = the Hill coefficient;
c = \(min\_value\);
d = \(max\_value\);
e = \(EC_{50}\).

\[ f(x) = c + \frac{ d - c }{ 1 + \exp(b(\log{x} - \log{e})) } \]

It’s important to name your variables so that they’re self-describing. I hope the authors of drc don’t mind if we continue to use hill, min_value etc instead of b, c etc.

drm does the hard-work. Similarly to linear-regression in lm, you pass the following to drm:

a model formula (as the formula argument: eg, formula = cell_count ~ conc)
a dataset (as the data argument: eg, data = dose_response)

In addition you have to provide a function that defines the model you’re trying to fit (here this is defined by a call to LL.4).

To run the regression model we use:

curved_fit <- drm(
  formula = cell_count ~ conc,
  data = dose_response,
  fct = LL.4(names = c("hill", "min_value", "max_value", "ec_50"))
  )

In setting up the sampled data, we used min_val = 10, max_val = 90, ec_50 = 0.5, hill = 1

Compare these values to those that were fitted by drm, by calling summary() on the returned curved_fit.

# Challenge 7: Summarise the values in `curved_fit`

# YOUR CODE GOES HERE:

# Challenge 7 - Solution

summary(curved_fit)

## 
## Model fitted: Log-logistic (ED50 as parameter) (4 parms)
## 
## Parameter estimates:
## 
##                       Estimate Std. Error  t-value p-value
## hill:(Intercept)       0.95056    0.27896  3.40755  0.0764
## min_value:(Intercept) 10.77102    7.70553  1.39783  0.2970
## max_value:(Intercept) 86.91633    4.82055 18.03036  0.0031
## ec_50:(Intercept)      0.53356    0.17779  3.00111  0.0954
## 
## Residual standard error:
## 
##  4.658618 (2 degrees of freedom)

The values should be pretty close to those we used in setting up the model

We can convert the inferred values to an IC₅₀.

coefs <- setNames(
  curved_fit$coefficients,
  c("hill", "min_value", "max_value", "ec_50")
  )

ic_50 <- with(
  as.list(coefs),
  exp(
    log(ec_50) + (1 / hill) * log(max_value / (max_value - 2 * min_value))
  )
)

ic_50

## [1] 0.7199612

We can predict values using predict on the newly defined curved_fit variable, just as we did for the linear model.

# Challenge 8:
# - Run `predict` on the nonlinear regression model stored in `curved_fit`
# - this gives the model-fitted values for the experimental data
# - Use the additional_data data-frame that we set up for linear-prediction as
# the `newdata` argument
# - Store the values in the variable `curved_counts`

# YOUR CODE GOES HERE:
# curved_counts <- ...

# Challenge 8 - Solution

curved_counts <- predict(curved_fit, newdata = additional_data)

Then we can plot the fitted model over our original data, just as we did for the linear fit; just copy the code we used there over to here, and replace linear_counts with curved_counts.

# Challenge 9:
# - Overlay the drm-fitted regression model (as a line) on a scatter plot of the
# original cell counts

# YOUR CODE GOES HERE:

# Challenge 9 - Solution

curved_fitted_data <- data.frame(
  additional_data,
  cell_count = curved_counts
)

ggplot(dose_response, aes(x = conc, y = cell_count)) +
  geom_point() +
  geom_line(data = curved_fitted_data) +
  ylim(0, NA)

Similarly to lm there is also a shortcut whereby ggplot2 will run the regression model and plot out the fitted curve (although getting this to work took a lot of hacking).

ggplot(dose_response, aes(x = conc, y = cell_count)) +
  geom_point() +
  stat_smooth(
    method = "drm",
    method.args = list(
      fct = LL.4(names = c("hill", "min_value", "max_value", "ec_50"))
      ),
    se = FALSE,
    ) +
  ylim(0, NA)

… and if you want a complete shortcut, you can plot the output from drm directly (but with great ease comes great limitations; I’ll continue to stick with ggplot2):

# Note the logged x-axis
plot(curved_fit)

Footnotes

“Fitting Models to Biological Data Using Linear and Nonlinear Regression: A practical guide to curve fitting”. Motulsky H. and Christopoulos A., Oxford Uni Press (2004).↩
If the Hill coef was zero, the model would flatline; if it was negative you’d be modelling activation rather than inhibition.↩
I think↩
Because log-transformation of axes is such a common operation, ggplot2 provides a shorthand function scale_x_log10() that could be substituted for scale_x_continuous(trans = "log10").↩
Code in this section was borrowed from Eduardo GÃ³mez-CastaÃ±eda.↩

Dose Response Curve Analysis

Russ Hyde, University of Glasgow

9 April 2018

Preliminary note

IC₅₀s, EC₅₀s and all that

Data for subsequent use

Rough and ready plotting

Prediction from a model using `predict`

Fitting data to the EC₅₀ equation

Footnotes

Dose Response Curve Analysis

Russ Hyde, University of Glasgow

9 April 2018

Preliminary note

IC50s, EC50s and all that

Data for subsequent use

Rough and ready plotting

Prediction from a model using predict

Fitting data to the EC50 equation

Footnotes

IC₅₀s, EC₅₀s and all that

Prediction from a model using `predict`

Fitting data to the EC₅₀ equation