Non-linear Regression in R for biologist (Part1)

In biology many processes are ocurring in a non-linear way: population growth, enzyme concentration during a reaction … Several options exist to model these processes and to get the coefficient: Non-linear regression and Generalized Additive Modelling are two examples. In this post I will show some example on how to model non-linear regression in R using the function 'nls' even if for certain processes there are function entirely designed for it (micmen in package VGAM for example).

Let's jump right into some code with the power function:

#### the power function #### simulate some data
x <- 1:100
y <- 1 + x^0.15 + rnorm(100, 0, 0.01)
m <- nls(y ~ a + b * I(x^z), start = list(a = 1, b = 1, z = 1))
m

## Nonlinear regression model
##   model: y ~ a + b * I(x^z)
##    data: parent.frame()
##     a     b     z 
## 0.916 1.077 0.143 
##  residual sum-of-squares: 0.0105
## 
## Number of iterations to convergence: 18 
## Achieved convergence tolerance: 6.39e-08

plot(y ~ x)
lines(x, fitted(m), lty = 2, col = "red", lwd = 2)

plot of chunk unnamed-chunk-2

# now let's look at the coefficient when we have 10 times more data point
x <- 1:1000
y <- 1 + x^0.15 + rnorm(1000, 0, 0.01)

m <- nls(y ~ a + b * I(x^z), start = list(a = 1, b = 1, z = 1))
m

## Nonlinear regression model
##   model: y ~ a + b * I(x^z)
##    data: parent.frame()
##     a     b     z 
## 0.992 1.005 0.150 
##  residual sum-of-squares: 0.103
## 
## Number of iterations to convergence: 42 
## Achieved convergence tolerance: 4.15e-07

plot(y ~ x)
lines(x, fitted(m), lty = 2, col = "red", lwd = 2)

plot of chunk unnamed-chunk-4

# some model evaluation, residual sum of square and R²
RSS <- sum(residuals(m)^2)
TSS <- sum((y - mean(y))^2)
R.square <- 1 - (RSS/TSS)

The power function can be written as: y=a+b*x^z, in such equation there are three parameters to estimate, we must therefore when writing the formula in the nls call include all three of them. The I operator is to tell R that we want a power operator and not a factor crossing. In this case the I is not necessary but when using such equation in a multifactorial context then it becomes important. As seen when we have 1000 points instead of 100 the coefficients are much closer to the reality, the more data you have the much closer to the reality will be your coefficient estimate. It must be noted that this equation do not reach an asymptote and will keep increasing towards Inf.

Now I go to another example from the biochemistry world, the Michaelis-Menten equation:

### Michaelis Menten equation ### generating some data
dat <- data.frame(x = 1:100, y = ((5 * 1:100)/(10 + 1:100)) + rnorm(100, 0, 
    0.1))
# building a Self Start object, finding the optimal values
print(getInitial(y ~ SSmicmen(x, max(y), 1), data = dat), digits = 3)

## max(y)      1 
##   5.01  10.29

# fit a nls model with the appropriate values
Vm <- getInitial(y ~ SSmicmen(x, max(y), 1), data = dat)[1]
K <- getInitial(y ~ SSmicmen(x, max(y), 1), data = dat)[2]
m.m <- nls(y ~ SSmicmen(x, Vm, K), data = dat)
m.m

## Nonlinear regression model
##   model: y ~ SSmicmen(x, Vm, K)
##    data: dat
##    Vm     K 
##  5.01 10.29 
##  residual sum-of-squares: 0.846
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 2.13e-07

plot(y ~ x, dat)
lines(1:100, fitted(m.m), lty = 2, col = "red")

plot of chunk unnamed-chunk-7

With such a function an asymptote is reached with a value of Vm, K is the x value such as y(x)=Vm/2.

Now let's go to the sigmoid function:

### Sigmoid function ### create a function to generate sigmoid pattern
sigmoid <- function(x, lower_asymptote, carrying_capacity, growth_rate, time_max) {
    return(lower_asymptote + ((carrying_capacity - lower_asymptote)/(1 + exp(-growth_rate * 
        (x - time_max)))))
}
x <- 1:100
y <- sigmoid(1:100, 1, 50, 0.1, 50) + rnorm(100, 0, 2)

m.s <- nls(y ~ a + ((b - a)/(1 + exp(-c * (x - d)))), start = list(a = min(y), 
    b = max(y), c = 1, d = round(median(x))), trace = TRUE)

## 9897 :  -0.354 53.096  1.000 50.000
## 2929 :   2.8016 48.9636  0.3473 50.0169
## 851.2 :   3.4203 47.9010  0.1789 50.0981
## 706 :   2.21233 48.67143  0.09053 50.22824
## 379.8 :   1.4130 49.5346  0.1064 50.3018
## 378.1 :   1.2234 49.6943  0.1063 50.2672
## 378.1 :   1.2235 49.6944  0.1063 50.2676

m.s

## Nonlinear regression model
##   model: y ~ a + ((b - a)/(1 + exp(-c * (x - d))))
##    data: parent.frame()
##      a      b      c      d 
##  1.224 49.694  0.106 50.268 
##  residual sum-of-squares: 378
## 
## Number of iterations to convergence: 6 
## Achieved convergence tolerance: 5.79e-07

plot(y ~ x)
lines(x, fitted(m.s), lty = 2, lwd = 2, col = "red")
# adding the 95% confidence interval around the fitted coefficient
conf <- confint(m.s)

## Waiting for profiling to be done...

## 380.8 :  49.6944  0.1063 50.2676
## 379.1 :  49.7706  0.1047 50.1639
## 379.1 :  49.7741  0.1047 50.1654
## 379.1 :  49.7741  0.1047 50.1654
## 382.1 :  49.855  0.103 50.062
## 382.1 :  49.858  0.103 50.063
## 382.1 :  49.858  0.103 50.063
## 387 :  49.9429  0.1014 49.9582
## 387 :  49.9463  0.1014 49.9587
## 387 :  49.9462  0.1014 49.9587
## 393.9 :  50.03627  0.09973 49.85287
## 393.9 :  50.03987  0.09974 49.85345
## 393.9 :  50.03981  0.09974 49.85343
## 402.8 :  50.13514  0.09806 49.74616
## 402.8 :  50.13898  0.09807 49.74684
## 402.8 :  50.13890  0.09807 49.74681
## 413.6 :  50.23997  0.09637 49.63807
## 413.6 :  50.24406  0.09638 49.63884
## 413.6 :  50.24395  0.09638 49.63881
## 380.8 :  49.6944  0.1063 50.2676
## 379.1 :  49.6181  0.1079 50.3725
## 379.1 :  49.6176  0.1079 50.3702
## 382.3 :  49.5414  0.1095 50.4721
## 382.3 :  49.5443  0.1095 50.4724
## 387.5 :  49.4721  0.1111 50.5730
## 387.5 :  49.4748  0.1111 50.5733
## 394.8 :  49.4064  0.1127 50.6727
## 394.8 :  49.4090  0.1127 50.6729
## 404.3 :  49.3441  0.1142 50.7711
## 404.3 :  49.3465  0.1143 50.7713
## 380.7 :   1.2235  0.1063 50.2676
## 379.1 :   1.3007  0.1079 50.1649
## 379.1 :   1.3003  0.1079 50.1654
## 382.2 :   1.3770  0.1095 50.0634
## 382.2 :   1.3741  0.1095 50.0631
## 387.5 :   1.4467  0.1111 49.9624
## 387.5 :   1.4440  0.1111 49.9622
## 394.8 :   1.5128  0.1127 49.8627
## 394.8 :   1.5102  0.1127 49.8625
## 404.2 :   1.5754  0.1143 49.7643
## 404.2 :   1.5730  0.1143 49.7642
## 380.7 :   1.2235  0.1063 50.2676
## 379.1 :   1.1464  0.1047 50.3693
## 379.1 :   1.1438  0.1047 50.3695
## 379.1 :   1.1437  0.1047 50.3695
## 382 :   1.063  0.103 50.473
## 382 :   1.060  0.103 50.473
## 382 :   1.060  0.103 50.473
## 387 :   0.9739  0.1014 50.5773
## 387 :   0.9706  0.1014 50.5769
## 387 :   0.9706  0.1014 50.5769
## 393.9 :   0.88004  0.09972 50.68314
## 393.9 :   0.87646  0.09973 50.68263
## 393.9 :   0.87651  0.09973 50.68263
## 402.7 :   0.78061  0.09804 50.79039
## 402.7 :   0.77680  0.09805 50.78980
## 402.7 :   0.77686  0.09805 50.78980
## 413.5 :   0.67522  0.09634 50.89913
## 413.5 :   0.67116  0.09635 50.89844
## 413.5 :   0.67125  0.09635 50.89845
## 380.8 :   1.224 49.694 50.268
## 379.1 :   1.057 49.861 50.268
## 382.3 :   0.8902 50.0268 50.2694
## 382.3 :   0.8803 50.0368 50.2695
## 387.6 :   0.7079 50.2089 50.2704
## 387.6 :   0.6975 50.2195 50.2705
## 395 :   0.5187 50.3980 50.2716
## 395 :   0.5078 50.4090 50.2717
## 404.5 :   0.3222 50.5944 50.2728
## 404.5 :   0.3108 50.6059 50.2729
## 380.6 :   1.224 49.694 50.268
## 379.1 :   1.381 49.537 50.267
## 382 :   1.542 49.377 50.266
## 382 :   1.534 49.385 50.266
## 386.9 :   1.69 49.23 50.27
## 386.9 :   1.682 49.238 50.266
## 393.7 :   1.833 49.087 50.265
## 393.7 :   1.825 49.096 50.266
## 402.4 :   1.972 48.949 50.265
## 402.4 :   1.964 48.957 50.265
## 413.1 :   2.107 48.815 50.265
## 413.1 :   2.10 48.82 50.27
## 380.4 :   1.2235 49.6944  0.1063
## 379.1 :   1.1086 49.5841  0.1062
## 382.2 :   0.9938 49.4740  0.1062
## 382.2 :   0.9895 49.4783  0.1062
## 382.2 :   0.9894 49.4783  0.1062
## 387.3 :   0.8701 49.3725  0.1061
## 387.3 :   0.8658 49.3766  0.1060
## 387.3 :   0.8657 49.3767  0.1060
## 394.4 :   0.7419 49.2749  0.1058
## 394.4 :   0.7375 49.2790  0.1058
## 394.4 :   0.7374 49.2791  0.1058
## 403.6 :   0.6089 49.1813  0.1056
## 403.6 :   0.6043 49.1855  0.1055
## 403.6 :   0.6042 49.1856  0.1055
## 414.8 :   0.4707 49.0918  0.1052
## 414.8 :   0.4661 49.0960  0.1052
## 414.8 :   0.4660 49.0961  0.1051
## 380.4 :   1.2235 49.6944  0.1063
## 379.1 :   1.3342 49.8088  0.1062
## 379.1 :   1.3342 49.8088  0.1062
## 382.2 :   1.4447 49.9232  0.1062
## 382.2 :   1.4405 49.9274  0.1061
## 382.2 :   1.4404 49.9275  0.1061
## 387.3 :   1.547 50.046  0.106
## 387.3 :   1.5425 50.0507  0.1059
## 387.3 :   1.5424 50.0507  0.1059
## 394.4 :   1.6446 50.1741  0.1058
## 394.4 :   1.6404 50.1786  0.1057
## 394.4 :   1.6403 50.1786  0.1057
## 403.6 :   1.7384 50.3068  0.1055
## 403.6 :   1.7341 50.3114  0.1054
## 403.6 :   1.7340 50.3115  0.1054
## 414.7 :   1.8280 50.4446  0.1051
## 414.7 :   1.824 50.449  0.105
## 414.7 :   1.824 50.449  0.105

lines(x, sigmoid(1:100, conf[1, 1], conf[2, 1], conf[3, 1], conf[4, 1]), lty = 2, 
    lwd = 1, col = "blue")
lines(x, sigmoid(1:100, conf[1, 2], conf[2, 2], conf[3, 2], conf[4, 2]), lty = 2, 
    lwd = 1, col = "blue")

plot of chunk unnamed-chunk-9

The sigmoid function is seen in many exmample in population ecology, the form I presented is a bit simplified in the more general form there are two more coefficient to estimate.

In a later post I will extend these non-linear models to the case where there is a correlation structure and/or non-equal variance.

Interesting references:

A pdf on curve fitting: http://www.itc.nl/~rossiter/teach/R/R_CurveFit.pdf Some blog post: http://davetang.org/muse/2013/05/09/on-curve-fitting/; http://kyrcha.info/tutorials/fitting-a-sigmoid-function-in-r/

And you can have a look at the wikipedia pages: https://en.wikipedia.org/wiki/Generalized_logistic_curve; https://en.wikipedia.org/wiki/Michaelis_menten