Carvalho data analysis

Version 2013-02-04 08:47:48

This is an example of fitting a log-log model (\( Y \sim \text{Normal}(a X^b, \sigma^2) \)) to data.

Get data and fit a log-log linear regression: this is not exactly the model we want to end up with (it is instead \( \log Y \sim \text{Normal}(a' + b \log X,\sigma^2) \), with \( a' = \log a \)), but it should be a good start.

dados <- read.csv("Carvalho_dat.csv")
require(bbmle)
packageVersion("bbmle")  ## newest version required for convergence warnings

## [1] '1.0.7'

lm1 <- lm(log(He) ~ log(forest_pct_500m), data = dados)
(cc <- coef(lm1))

##          (Intercept) log(forest_pct_500m) 
##            -0.240177             0.002215

Plot with ggplot …

library(ggplot2)
theme_set(theme_bw())
p1 <- qplot(forest_pct_500m, He, data = dados, log = "xy")
p1 + stat_function(fun = function(x) cc[1]/log(10) + cc[2] * x, colour = "red")

plot of chunk unnamed-chunk-3

(since qplot(...log="xy") plots on a log-10 scale, we need to scale the intercept coefficient …)

Or base graphics …

par(las = 1, bty = "l")
with(dados, plot(He ~ forest_pct_500m, log = "xy"))
curve(exp(cc[1]) * x^cc[2], col = 2, add = TRUE)

plot of chunk unnamed-chunk-4

(note that curve() plots on the original scale, stat_function() plots on the transformed scale)

Fitting via bbmle::mle2(). First use a custom log-likelihood function: start from the estimates based on the log-log fit, and the raw standard deviation of the data (this will be an overestimate, although in this case because the data are so messy/the log-log fit is nearly a straight line, it might not be too far off)

mlefun <- function(a1, a2, dp) {
    esp = a1 * forest_pct_500m^a2
    -sum(dnorm(He, mean = esp, sd = dp, log = TRUE))
}
start1 <- list(a1 = exp(cc[1]), a2 = cc[2], dp = sd(dados$He))
mlefit1 <- mle2(mlefun, start = start1, data = dados)

We get lots of warnings (suppressed here) because we are not doing anything to prevent the standard deviation parameter from being negative (i.e. setting a lower-bound constraint, using method="L-BFGS-B" or one of the optimx optimizer choices, or fitting log-standard-deviation instead).

The actual coefficients aren't that different from the starting values:

cbind(mlefit1 = coef(mlefit1), start1 = unlist(start1))

##     mlefit1   start1
## a1 0.789161 0.786489
## a2 0.001617 0.002215
## dp 0.043414 0.044442

What if we (mistakenly) try starting from the raw value of the intercept from the log-log regression?

start2 <- transform(start1, a1 = cc[1])
mlefit2 <- update(mlefit1, start = start2)

## Warning: convergence failure: code=1 (iteration limit 'maxit' reached)

In the newest version of bbmle we get a warning about convergence failure here (the default R behaviour is to leave this for the user to check …)

Furthermore, the parameters are silly.

coef(mlefit2)

##     a1     a2     dp 
##  3.487 -2.805 56.085

print(vcov(mlefit2), digits = 3)

##           a1        a2      dp
## a1 158939.82 -29526.33    3.96
## a2 -29526.33  -3149.77   -5.21
## dp      3.96     -5.21 -143.06

What has happened here is that we have slid out of the sensible parameter regime into one where the estimated variance is so large that it covers any possible change in the other coefficients.

For what it's worth, there are a few other ways we could fit this model slightly more conveniently.

formula interface:

mlefit3 <- mle2(He ~ dnorm(mean = a1 * forest_pct_500m^a2, sd = dp), start = start1, 
    data = dados)

This gives us access to additional tools such as predict(), residuals() etc.

formula interface with profiled standard deviation: if we use what we know about estimating the standard deviation from the residual sum of squares, we don't need to fit the standard deviation explicitly

dnorm2 <- function(x, mean, log = FALSE) {
    rss <- sum((x - mean)^2)
    n <- length(x)
    dnorm(x, mean = mean, sd = sqrt(rss/(n - 1)), log = log)
}
mlefit4 <- mle2(He ~ dnorm2(mean = a1 * forest_pct_500m^a2), start = start1[1:2], 
    data = dados)
coef(mlefit4)

##       a1       a2 
## 0.789163 0.001619

add parameters argument:

mlefit5 <- mle2(He ~ dnorm2(mean = exp(logmu)), parameters = list(logmu ~ log(forest_pct_500m)), 
    start = list(logmu = log(start1[[1]])), data = dados)
coef(mlefit5)

##          logmu.(Intercept) logmu.log(forest_pct_500m) 
##                  -0.236788                   0.001621

The parameters argument makes it easier for us to fit more complex response models (although now we get the coefficient on the log scale again)

using non-linear least squares:

nlsfit1 <- nls(He ~ a1 * forest_pct_500m^a2, start = start1[1:2], data = dados)
coef(nlsfit1)

##       a1       a2 
## 0.789119 0.001636

using glm with a log link:

glmfit1 <- glm(He ~ log(forest_pct_500m), family = gaussian(link = "log"), data = dados)
coef(glmfit1)

##          (Intercept) log(forest_pct_500m) 
##            -0.236838             0.001636

This way we don't even have to specify starting parameters.

We can see that there's not much difference between the first crude model (the log-log linear fit) and the desired model predictions in this case …

pp <- data.frame(forest_pct_500m = 1:100)
pp$He <- predict(glmfit1, newdata = pp, type = "response")
p1 + geom_line(data = pp, colour = "blue") + stat_function(fun = function(x) cc[1]/log(10) + 
    cc[2] * x, colour = "red")

plot of chunk unnamed-chunk-12