In a study of ‘Swissmeteo’, pollen counts of different species (trees, grasses, etc) have been measured… albeit with some percentage (say 2–3%) of missing data. In our “Statistics Lab” at ETH, we were looking into imputation of such counts. In order to judge “goodness of fit” of models or imputations for such counts, we will want to summarize differences between observed counts \(N_i\) and a fitted (e.g., imputed) counts \(\hat{N_i}\).
Poisson regression, i.e., a generalized linear model with \(\log(.)\) link and (in R) family = poisson()
makes sense as a first approximation to modeling counts, given other station counts, weather conditions, previous years experience etc. A more sophisticated model would use “zero-inflated” (ZI) poisson or also ZI negative binomal models. The usual poisson regression model for counts \(N_i\) may be written as \[ \log(E[N_i]) = {\beta}^T \mathbf{x_i}, \] or more precisely, \(N_i \sim {\cal P}(\lambda_i)\) with \(\log(\lambda_i) = {\beta}^T \mathbf{x_i}\). In R, one would use
fm <- glm(N ~ ., data, family = poisson())
where family
is often specified as character string ("family"
) instead of function call. However that emphasizes that these are functions creating family objects in R:
poisF <- poisson() ; poisF
Family: poisson
Link function: log
## inspecting it a bit more:
str(poisF)
List of 12
$ family : chr "poisson"
$ link : chr "log"
$ linkfun :function (mu)
$ linkinv :function (eta)
$ variance :function (mu)
$ dev.resids:function (y, mu, wt)
$ aic :function (y, n, mu, wt, dev)
$ mu.eta :function (eta)
$ initialize: expression({ if (any(y < 0)) stop("negative values not allowed for the 'Poisson' family") n <- rep.int(1, nobs) mustart <- y + 0.1 })
$ validmu :function (mu)
$ valideta :function (eta)
$ simulate :function (object, nsim)
- attr(*, "class")= chr "family"
and we see that such R family
objects contain a dev.resids
function for computing “deviance residuals”, and we can use indeed to compare count discrepancies.
resid.d <- poisF$dev.resids; resid.d ## 'wt' allows **weights** to be specified
function (y, mu, wt)
{
r <- mu * wt
p <- which(y > 0)
r[p] <- (wt * (y * log(y/mu) - (y - mu)))[p]
2 * r
}
which defines (for constant weights, say \(1\)), the deviance function, using \(\hat Y\) here, instead of \(\mu\), \[ D(Y, \hat Y) = 2(Y \log(Y/{\hat Y}) - (Y - \hat Y)), \] where (for \(Y = 0\), the first term is set to zero (in short “\(0 \cdot \log(0) := 0\)”). Note that \(D(x,y)\) is not symmetric in \(x\) and \(y\), we have \(Y \ge 0\) and in theory, \(\mu > 0\), but in the above situation \(\hat Y \ge 0\), i.e., \(\hat Y = 0\) is possible and leads to \(D(Y, 0) = \infty\) (apart from \(D(0,0) = 0\)).
resid.d(1, mu = 0, wt = 1) # Y.hat = mu = 0
[1] Inf
Y <- c(0:3, 20:23)
Y.h <- Y + 1 # artificially assume we fitted '+1' every where
print( resid.d(Y, Y.h, wt = 1), digits = 3)
[1] 2.0000 0.6137 0.3781 0.2739 0.0484 0.0462 0.0441 0.0423
so, a “miss count” of one is more severe for small counts than for large ones, as it should be.
Applied statisticians for a long time have used square root transformed counts instead of counts, as the \(\sqrt{\cdot}\) is the “first aid” transformation for counts (John W. Tukey), because it is the variance stabilizing transformation for the poisson distribution. Also, for our application, larger scale imputation methods are often geared towards approximately normal data, sometimes even only indirectly by using least squares loss.
From this perspective, it is measure losses “normally”, i.e., squared deviation on the transformed scale, i.e., \[ S(Y, \hat Y) = 4 (\sqrt{Y} - \sqrt{{\hat Y}}) ^2, \] where the factor \(4\) is arbitrary for now.
Now, I found that indeed, the squared \(\sqrt\)-difference \(S()\) and the deviance residual \(D()\) are very similar numerically notably in the realistic situation that \(Y\) and \(\hat Y\) differ not by more than say two standard deviations, i.e., approximately \(2 \sqrt{\hat Y}\), e.g., first for the case above,
## Difference is '1' but residuals are *not* the same:
rbind(Y, resid.d = resid.d(Y, Y.h, wt=1),
sqrt.d = 4 *(sqrt(Y) - sqrt(Y.h))^2)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
Y 0 1.00000 2.00000 3.00000 20.000000 21.000000 22.000000
resid.d 2 0.61371 0.37814 0.27391 0.048393 0.046159 0.044122
sqrt.d 4 0.68629 0.40408 0.28719 0.048788 0.046518 0.044450
[,8]
Y 23.000000
resid.d 0.042258
sqrt.d 0.042558
and indeed, the number are quite close. Here a visual “proof”:
p.pois.loss <- function(Y, Y.hat, type = "b", log = "xy", ...) {
stopifnot(length(Y) == length(Y.hat), Y >= 0, Y.hat >= 0, Y == as.integer(Y))
dMat <- cbind(resid.d = poisson()$dev.resids(y = Y, mu = Y.hat, wt=1),
sqrt.d = abs(sqrt(Y) - sqrt(Y.hat)),
sqrt.d2 = (sqrt(Y) - sqrt(Y.hat))^2)
matplot(Y, dMat, type=type, log=log,
ylab = "Poisson D(.,.)", main = "Poisson Losses", ...)
legend("topright", expression("deviance resid.",
abs(sqrt(Y) -sqrt(hat(Y))),
(sqrt(Y) - sqrt(hat(Y)))^2), lty=1:3, col=1:3, bty="n")
mtext(deparse(match.call()), side=3)
invisible(cbind(Y=Y, Y.hat = Y.hat, dMat))
}
Y <- c(0:3, 10:12, 20, 30,50, 100, 200, 500, 1000, 10000)
(d.1 <- p.pois.loss(Y, Y+1))
Warning in xy.coords(x, y, xlabel, ylabel, log = log): 3 x values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0
omitted from logarithmic plot
Y Y.hat resid.d sqrt.d sqrt.d2
[1,] 0 1 2.0000e+00 1.0000000 1.0000e+00
[2,] 1 2 6.1371e-01 0.4142136 1.7157e-01
[3,] 2 3 3.7814e-01 0.3178372 1.0102e-01
[4,] 3 4 2.7391e-01 0.2679492 7.1797e-02
[5,] 10 11 9.3796e-02 0.1543471 2.3823e-02
[6,] 11 12 8.5750e-02 0.1474768 2.1749e-02
[7,] 12 13 7.8975e-02 0.1414497 2.0008e-02
[8,] 20 21 4.8393e-02 0.1104397 1.2197e-02
[9,] 30 31 3.2611e-02 0.0905388 8.1973e-03
[10,] 50 51 1.9737e-02 0.0703606 4.9506e-03
[11,] 100 101 9.9338e-03 0.0498756 2.4876e-03
[12,] 200 201 4.9834e-03 0.0353113 1.2469e-03
[13,] 500 501 1.9973e-03 0.0223495 4.9950e-04
[14,] 1000 1001 9.9933e-04 0.0158074 2.4988e-04
[15,] 10000 10001 9.9993e-05 0.0049999 2.4999e-05
(d10 <- p.pois.loss(Y + 10, Y))
Y Y.hat resid.d sqrt.d sqrt.d2
[1,] 10 0 Inf 3.162278 10.0000000
[2,] 11 1 32.7536960 2.316625 5.3667504
[3,] 12 2 23.0022273 2.049888 4.2020410
[4,] 13 3 18.1247638 1.873500 3.5100040
[5,] 20 10 7.7258872 1.309858 1.7157288
[6,] 21 11 7.1583409 1.265951 1.6026317
[7,] 22 12 6.6699754 1.226314 1.5038464
[8,] 30 20 4.3279065 1.005090 1.0102051
[9,] 40 30 3.0145658 0.847330 0.7179677
[10,] 60 50 1.8785868 0.674899 0.4554885
[11,] 110 100 0.9682396 0.488088 0.2382304
[12,] 210 200 0.4918690 0.349241 0.1219694
[13,] 510 500 0.1986798 0.222500 0.0495062
[14,] 1010 1000 0.0996683 0.157721 0.0248758
[15,] 10010 10000 0.0099967 0.049988 0.0024988
(dd <- p.pois.loss(Y, pmax(0, Y + c(-2:-1, 1:2))))
Warning in Y + c(-2:-1, 1:2): longer object length is not a multiple
of shorter object length
Warning in xy.coords(x, y, xlabel, ylabel, log = log): 3 x values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log = log): 3 y values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0
omitted from logarithmic plot
Y Y.hat resid.d sqrt.d sqrt.d2
[1,] 0 0 0.0000e+00 0.0000000 0.0000e+00
[2,] 1 0 Inf 1.0000000 1.0000e+00
[3,] 2 3 3.7814e-01 0.3178372 1.0102e-01
[4,] 3 5 9.3505e-01 0.5040172 2.5403e-01
[5,] 10 8 4.6287e-01 0.3338505 1.1146e-01
[6,] 11 10 9.6824e-02 0.1543471 2.3823e-02
[7,] 12 13 7.8975e-02 0.1414497 2.0008e-02
[8,] 20 22 1.8759e-01 0.2182798 4.7646e-02
[9,] 30 28 1.3957e-01 0.1857230 3.4493e-02
[10,] 50 49 2.0271e-02 0.0710678 5.0506e-03
[11,] 100 101 9.9338e-03 0.0498756 2.4876e-03
[12,] 200 202 1.9868e-02 0.0705348 4.9752e-03
[13,] 500 498 8.0214e-03 0.0447662 2.0040e-03
[14,] 1000 999 1.0007e-03 0.0158153 2.5013e-04
[15,] 10000 10001 9.9993e-05 0.0049999 2.4999e-05
(d0 <- p.pois.loss(c(0,0,0,1,1,1),
c(0,1,2,0,1,2)))
Warning in xy.coords(x, y, xlabel, ylabel, log = log): 9 x values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log = log): 6 y values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log): 3 x values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log): 2 y values <= 0
omitted from logarithmic plot
Y Y.hat resid.d sqrt.d sqrt.d2
[1,] 0 0 0.00000 0.00000 0.00000
[2,] 0 1 2.00000 1.00000 1.00000
[3,] 0 2 4.00000 1.41421 2.00000
[4,] 1 0 Inf 1.00000 1.00000
[5,] 1 1 0.00000 0.00000 0.00000
[6,] 1 2 0.61371 0.41421 0.17157
(d.prop <- p.pois.loss(Y, 2*Y))
Warning in xy.coords(x, y, xlabel, ylabel, log = log): 3 x values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log = log): 3 y values <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log): 1 x value <= 0
omitted from logarithmic plot
Warning in xy.coords(x, y, xlabel, ylabel, log): 1 y value <= 0
omitted from logarithmic plot
Y Y.hat resid.d sqrt.d sqrt.d2
[1,] 0 0 0.00000 0.00000 0.00000
[2,] 1 2 0.61371 0.41421 0.17157
[3,] 2 4 1.22741 0.58579 0.34315
[4,] 3 6 1.84112 0.71744 0.51472
[5,] 10 20 6.13706 1.30986 1.71573
[6,] 11 22 6.75076 1.37379 1.88730
[7,] 12 24 7.36447 1.43488 2.05887
[8,] 20 40 12.27411 1.85242 3.43146
[9,] 30 60 18.41117 2.26874 5.14719
[10,] 50 100 30.68528 2.92893 8.57864
[11,] 100 200 61.37056 4.14214 17.15729
[12,] 200 400 122.74113 5.85786 34.31458
[13,] 500 1000 306.85282 9.26210 85.78644
[14,] 1000 2000 613.70564 13.09858 171.57288
[15,] 10000 20000 6137.05639 41.42136 1715.72875
library("Ryacas")
yacas("S(x,y):= 4(Sqrt(x) - Sqrt(y))^2")
expression(TRUE)
## Now set y := x / (1 + delta)
yacas("S(x, x/(1+d))")
expression(`4`(root(x, 2) - root(x/(d + 1), 2))^2)
## cannot get 'yacas' to see it
yacas("S:= 4*Taylor(d,0,5) 1 + 1/(1+d) - 2/Sqrt(1+d)")
expression(4 * (d^2/4 + -3 * d^3/8 + 29 * d^4/64 + -65 * d^5/128))
yacas("S:= Factor(Simplify(%))")
expression((-3 * d/2 + 29 * d^2/16 + -65 * d^3/32 + 1) * d^2)