This workbook is available from my public bitbucket. The version at bitbucket does not contain solutions to the challenges.
This workbook uses an R package called drc
, which we haven’t used so far in the code clubs. Ensure you have this package installed (either using install.packages(“drc”) or using rstudio’s “Install Packages” dialogues) before attempting to compile or run this notebook.
library(drc)
library(ggplot2)
library(magrittr)
IC50 is the “half-maximal inhibitory concentration” for an entity (typically a drug) against a biological process or function (eg, enzyme activity, cell number, metabolic activity). If you have an inhibitor of a given enzyme and an in vitro assay for that enzyme, the IC50 is the concentration of inhibitor at which that enzyme runs at 50% of its normal activity in that assay.
EC50 is, comparably, the “half-maximal effective concentration”. The effective is in there because not all entities will completely wipe out the biological process that you are assaying. Suppose you are assessing the effects of drug-treatment on cell number under a certain growth regime. Further, suppose that the drug reduces cell number by at most 80% (relative to an untreated sample). Then the EC50 is the concentration when cell number is reduced by 40% relative to the control (so the EC50 can be different from the IC50).
In the latter assay, suppose the drug could induce at most a 45% reduction in cell number. Does it make sense to discuss IC50 in that setting?
Both IC50 and EC50 are measured in mol / L.
You can visualise the difference between IC50 and EC50 in the following graph.
In the above graph, we have a drug that can reduce cell count by at most 80%.
The red lines illustrate EC50: the dotted line gives the maximal possible effect, the solid line lies halfway between the number of cells in the untreated state and the maximal possible effect. The dashed line gives the EC50 (which I’d set to be 0.1 mM beforehand).
Similarly the blue lines illustrate IC50.
In all of the above, there is an implicit model underlying the dose-responses. When working with raw data, we use this implicit model to estimate four parameters:
the maximum value (the solid black line in the above);
the minimum value (the dotted red line);
the EC50 (the dashed red line); and
the Hill coefficient (which affects the steepness of the slope).
Given these four parameters, we can estimate the IC50.
The model looks like this 1. Here x
is the drug concentration. If you compare to wikipedia, the sign of the Hill coef is reversed; but it’s sign doesn’t affect inference for max, min and EC50).
\[ fitted\_value = min\_value + \frac{ max\_value - min\_value }{ 1 + (x / EC_{50})^{Hill} } \]
\[ observed\_value = fitted\_value + NOISE! \]
Assume a positive Hill coefficient 2. The model implies some sensible stuff:
When x
is equal to EC50, the denominator in the fitted_value equation is 2, and the fitted value becomes \((max\_value + min\_value)/2\) (ie, the solid red line).
When x
gets really large, the fitted_value approaches \(min\_value\). This occurs because the denominator in the fitted_value equation gets really large, and so the fraction tends to 0.
When x
is 0, the fitted value is the same as \(max\_value\).
It’s possible to invert the above equation algebraically, to find the value of x
that gives a particular fitted_value
. The IC50 is given when the fitted_value
is equal to max_value
/ 2. The IC50 is then given by:3
\[ \log(IC_{50}) = \log(EC_{50}) + \frac{1}{Hill}.\log(\frac{ max\_value }{ max\_value - 2min\_value }) \]
Importantly, this is only a model. It may not be appropriate for the particular assay or treatment that you are using and if you use a different model, the above connection between IC50 and EC50 needn’t hold.
Imagine we’ve ran a dose-response analysis of a drug on cells from a patient. We want to identify the IC50 and EC50 for the drug.
Experimentally, you’d try to use a few concentrations of drug that are above the EC50, a few that are below this value, and some samples from untreated controls.
The values we use will be randomly generated, so these are not from a real experiment. But we will use the EC50 equation above to generate the random numbers.
To define a function in R you use syntax that looks like
my_function_name <- function(argument_1, argument_2){
commands
}
I’ve written some functions to help sample from the EC50 equation. The first one returns the fitted values for a set of drug-concentrations when provided the minimum, maximum. EC50 and Hill coefficients. You don’t need to understand code like this yet.
get_fitted_values <- function(concs, min_val, max_val, ec_50, hill){
# Sanity check the input
stopifnot(length(min_val) == 1 && length(max_val) == 1 && max_val > min_val)
stopifnot(length(hill) == 1 && hill > 0)
stopifnot(length(ec_50) == 1)
stopifnot(all(c(concs, ec_50) >= 0))
# Note that `concs` may contain multiple values,
# whereas min, max, hill, ec_50 are each of length 1.
#
# But, the results of this function are of the same length as `concs` because
# R will duplicate the entries of min, max, ec_50, hill to match the length
# of concs.
#
# Similarly, 1 + c(2, 3, 4, 5) is equal to c(3, 4, 5, 6)
#
min_val + (max_val - min_val) / (1 + (concs / ec_50) ^ hill)
}
The second one adds a bit of random noise around the fitted values, so the data we are working with looks a bit more like experimentally-derived data.
add_some_noise <- function(fitted_values, sd){
# `rnorm` adds some normally-distributed noise around the fitted values
#
# The user defines the standard deviation of the noise that is added.
#
noisy_vals <- rnorm(
n = length(fitted_values), mean = fitted_values, sd = sd
)
# Check that the returned values are all positive
stopifnot(all(noisy_vals > 0))
noisy_vals
}
The third function wraps around the above two functions. You can use this to sample some experimental data.
sample_ec50_dataset <- function(concs, min_val, max_val, ec_50, hill, sd){
fitted_vals <- get_fitted_values(concs, min_val, max_val, ec_50, hill)
add_some_noise(fitted_vals, sd = sd)
}
So, our first problem is to generate some (sampled) experimental data.
# Challenge 1: Sample some data from the sigmoid model using
# the function `sample_ec50_dataset`
#
# Use the following data:
# concs: use concentrations 0, 0.1, 0.3, 1, 3, 10
# min_val: use a minimum value of 10
# max_val: use a maximum value of 90
# ec_50: use an EC_50 value of 0.5 (note this is in the middle of the range
# of concentrations)
# hill: use a hill coefficient of 1
# sd: use a standard deviation of 3
# YOUR CODE GOES HERE:
# First define a vector called `concs` that stores the required concentrations
# concs <- ...
# Store the values in a vector called `cell_counts`
# cell_counts <- ...
# Challenge 1 - Solution
concs <- c(0, 0.1, 0.3, 1, 3, 10)
cell_counts <- sample_ec50_dataset(
concs, min_val = 10, max_val = 90, ec_50 = 0.5, hill = 1, sd = 3
)
To use the data effectively in ggplot, you should convert it into a data-frame
# Challenge 2
# Combine the concentrations and the sampled cell-counts into a data-frame
# called `dose_response`
#
# The columns of `dose_response` should be called `conc` and `cell_count`
# YOUR CODE GOES HERE:
# dose_response <- data.frame(
# # FILL IN THE COLUMN DEFINITIONS:
# )
# Challenge 2 - Solution
dose_response <- data.frame(
conc = concs,
cell_count = cell_counts
)
You have some experience making basic plots in ggplot2
. So you should be able to plot the sample values out using geom_point() or geom_line().
# Challenge 3
# - Plot out the data in `dose_response` using ggplot2
# - try overlaying a line graph on a scatter plot
# - try and add sensible axis labels
# - try using ylim() to ensure that the line y=0 is present on the graph
# - try storing the resulting graph in a variable called `p` and then plotting
# it by evaluating `p`
# To store a graph in a variable:
# p <- ggplot(....) + ... + ...
#
# Then plot it out:
# p
# YOUR CODE GOES HERE:
# Challenge 3 - Solution
p <- ggplot(
data = dose_response, aes(x = conc, y = cell_count)
) +
geom_point() +
geom_line() +
labs(x = "Drug Concentration", y = "Cell Count") +
ylim(0, NA)
p
There’s a slight issue with that graph. The concentrations aren’t evenly spread across the horizontal span of the graph, so we ended up with half of the concentrations (0, 0.1, 0.3, 1.0) lying in the left-most quarter of the graph. We can rectify this by log-transforming the x-axis.
In ggplot2
you make axis transformations using ’scale_*’ functions. Since the x-axis represents a continuous variable, we use scale_x_continuous
for transformations. scale_x_continuous
can be used for more than just transforming the axis though, for example it can be used to specify where tick marks occur, and what they are labelled with. To log-transform the x axis on the above plot, use the argument trans = "log10"
in scale_x_continuous
.4
# Challenge 4:
# Use scale_x_continuous to log-transform the x-axis
#
# p + ...()
# YOUR CODE GOES HERE:
# Challenge 4 - Solution
# - Alternative: p + scale_x_log10()
p + scale_x_continuous(trans = "log10")
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
The points are more evenly spread out within the body of the graph. But, Uh-oh! That zero concentration datapoint has thrown a warning. Log-transforming it sent it to minus-infinity and in the graph above, it is represented using a half-point at the left-most position on the graph. You can try and fix this if you want but for the rest of the workshop, we’re going to work with untransformed axes.
predict
In the above, geom_line
is just connecting the observed points, rather than adding a model-predicted regression line to the data. We’ll learn how to fit the data to the EC50 equation so that we can overplot a more appropriate regression line on the dose-response data.
R provides various tools to i) perform model-fitting, ii) obtain fitted values from a given model, and iii) plot out those fitted values.
Although it isn’t entirely appropriate since the dataset is curved, we could model the dataset using linear regression - this would fit a straight line that minimises it’s distance from the observed datapoints using least-squares.
To fit a linear regression model, you use lm(dependent_var ~ independent_var, data = my_dataframe)
Try this now using the EC50 dataset: with conc
as the independent variable (ie, the predictor) and cell_count
as the dependent (response) variable. Store the result in a variable called fit
.
# Challenge 5:
# - Use the `lm` function to run linear regression of `cell_count` against `conc`
# in your `dose_response` data-frame
# - Store the result in a variable called `fit`
# YOUR CODE GOES HERE:
# fit <- ...
# Challenge 5 - Solution
fit <- lm(cell_count ~ conc, data = dose_response)
fit
contains details about the intercept and gradient of the best-fitting regression line. You can look at these by running summary(fit)
.
summary(fit)
##
## Call:
## lm(formula = cell_count ~ conc, data = dose_response)
##
## Residuals:
## 1 2 3 4 5 6
## 25.524 7.963 2.420 -20.554 -24.652 9.299
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.613 10.563 5.927 0.00406 **
## conc -5.527 2.466 -2.241 0.08849 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.43 on 4 degrees of freedom
## Multiple R-squared: 0.5567, Adjusted R-squared: 0.4459
## F-statistic: 5.024 on 1 and 4 DF, p-value: 0.08849
[Aside] The coefficient for the gradient has a p-value < 0.1, so you shouldn’t be surprised that even poorly-chosen models can give significant results.
Given a model fit, you can also predict the fitted values for some additional data points by calling the function predict
. Call predict
on fit
now:
predict(fit)
## 1 2 3 4 5 6
## 62.612595 62.059893 60.954490 57.085579 46.031546 7.342433
This gives you the fitted values corresponding to the concentrations in your input data (assuming a linear model). You can provide additional concentrations as well, and obtain predictions for concentrations that you didn’t include in your experiment. To do this, you need to provide a data-frame as the newdata
argument to predict
that contains a column for each of the predictor/independent variables that you used during modelling.
So if I set up a data-frame called additional_data
, you can use this to predict fitted values for unobserved concentrations:
additional_data <- data.frame(
# This gives a sequence of 100 evenly spaced values between 0 and 10
conc = seq(0, 10, length.out = 100)
)
# Challenge 6:
# - use `predict` with `fit` as the model, and `additional_data` as the
# `newdata` argument
# - store the returned values in a vector called `linear_counts`
# YOUR CODE GOES HERE:
# linear_counts <- ...
# Challenge 6 - Solution
linear_counts <- predict(fit, newdata = additional_data)
We can now plot the fitted linear_counts
over the observed data, to see both the experimental data and the model-fit.
linear_fitted_data <- data.frame(
additional_data,
cell_count = linear_counts
)
ggplot(dose_response, aes(x = conc, y = cell_count)) +
geom_point() +
geom_line(data = linear_fitted_data) +
ylim(0, NA)
There’s actually a short-hand for the above that’s useful if you’re working with simple linear regression models and ggplot2. The stat_smooth
function will compute a linear model and overlay it’s fitted line on your datapoints. You call it as follows (the confidence intervals can be removed by using se = FALSE
in the call to stat_smooth
):
ggplot(dose_response, aes(x = conc, y = cell_count)) +
geom_point() +
stat_smooth(method = "lm", se = TRUE) +
ylim(0, NA)
So you have seen how to model data using linear regression, how to predict fitted points from a model and how to overlay the fitted model on a ggplot scatterplot. predict
can be used for more complicated models, as we see next.
To determine IC50s / EC50s using our sampled data, we need a way to perform non-linear regression. To do this we use the package drc
(which is specifically for analysis of dose-response curves).5
The functions you need to know are called drm
(which fits the regression model) and LL.4
(which defines the structure of the regression model).
The documentation for LL.4
defines the log-logistic function used in regression. This is shown below, but it’s actually identical to our definition of the \(fitted\_value\) equation above with
b = the Hill coefficient;
c = \(min\_value\);
d = \(max\_value\);
e = \(EC_{50}\).
\[ f(x) = c + \frac{ d - c }{ 1 + \exp(b(\log{x} - \log{e})) } \]
It’s important to name your variables so that they’re self-describing. I hope the authors of drc
don’t mind if we continue to use hill
, min_value
etc instead of b
, c
etc.
drm
does the hard-work. Similarly to linear-regression in lm
, you pass the following to drm
:
a model formula (as the formula argument: eg, formula = cell_count ~ conc)
a dataset (as the data argument: eg, data = dose_response)
In addition you have to provide a function that defines the model you’re trying to fit (here this is defined by a call to LL.4
).
To run the regression model we use:
curved_fit <- drm(
formula = cell_count ~ conc,
data = dose_response,
fct = LL.4(names = c("hill", "min_value", "max_value", "ec_50"))
)
In setting up the sampled data, we used min_val = 10, max_val = 90, ec_50 = 0.5, hill = 1
Compare these values to those that were fitted by drm
, by calling summary()
on the returned curved_fit
.
# Challenge 7: Summarise the values in `curved_fit`
# YOUR CODE GOES HERE:
# Challenge 7 - Solution
summary(curved_fit)
##
## Model fitted: Log-logistic (ED50 as parameter) (4 parms)
##
## Parameter estimates:
##
## Estimate Std. Error t-value p-value
## hill:(Intercept) 0.95056 0.27896 3.40755 0.0764
## min_value:(Intercept) 10.77102 7.70553 1.39783 0.2970
## max_value:(Intercept) 86.91633 4.82055 18.03036 0.0031
## ec_50:(Intercept) 0.53356 0.17779 3.00111 0.0954
##
## Residual standard error:
##
## 4.658618 (2 degrees of freedom)
The values should be pretty close to those we used in setting up the model
We can convert the inferred values to an IC50.
coefs <- setNames(
curved_fit$coefficients,
c("hill", "min_value", "max_value", "ec_50")
)
ic_50 <- with(
as.list(coefs),
exp(
log(ec_50) + (1 / hill) * log(max_value / (max_value - 2 * min_value))
)
)
ic_50
## [1] 0.7199612
We can predict values using predict
on the newly defined curved_fit
variable, just as we did for the linear model.
# Challenge 8:
# - Run `predict` on the nonlinear regression model stored in `curved_fit`
# - this gives the model-fitted values for the experimental data
# - Use the additional_data data-frame that we set up for linear-prediction as
# the `newdata` argument
# - Store the values in the variable `curved_counts`
# YOUR CODE GOES HERE:
# curved_counts <- ...
# Challenge 8 - Solution
curved_counts <- predict(curved_fit, newdata = additional_data)
Then we can plot the fitted model over our original data, just as we did for the linear fit; just copy the code we used there over to here, and replace linear_counts with curved_counts.
# Challenge 9:
# - Overlay the drm-fitted regression model (as a line) on a scatter plot of the
# original cell counts
# YOUR CODE GOES HERE:
# Challenge 9 - Solution
curved_fitted_data <- data.frame(
additional_data,
cell_count = curved_counts
)
ggplot(dose_response, aes(x = conc, y = cell_count)) +
geom_point() +
geom_line(data = curved_fitted_data) +
ylim(0, NA)
Similarly to lm
there is also a shortcut whereby ggplot2
will run the regression model and plot out the fitted curve (although getting this to work took a lot of hacking).
ggplot(dose_response, aes(x = conc, y = cell_count)) +
geom_point() +
stat_smooth(
method = "drm",
method.args = list(
fct = LL.4(names = c("hill", "min_value", "max_value", "ec_50"))
),
se = FALSE,
) +
ylim(0, NA)
… and if you want a complete shortcut, you can plot the output from drm
directly (but with great ease comes great limitations; I’ll continue to stick with ggplot2):
# Note the logged x-axis
plot(curved_fit)
“Fitting Models to Biological Data Using Linear and Nonlinear Regression: A practical guide to curve fitting”. Motulsky H. and Christopoulos A., Oxford Uni Press (2004).↩
If the Hill coef was zero, the model would flatline; if it was negative you’d be modelling activation rather than inhibition.↩
I think↩
Because log-transformation of axes is such a common operation, ggplot2 provides a shorthand function scale_x_log10()
that could be substituted for scale_x_continuous(trans = "log10")
.↩
Code in this section was borrowed from Eduardo Gómez-Castañeda.↩