EXAMPLE 1

The number of awards earned by students at one senior high school. Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., HUMSS, ABM or STEM) and the score on their final exam in statistics.

In this example, num_awards is the outcome variable and indicates the number of awards earned by students at a senior high school in a year, stat is a continuous predictor variable and represents students’ scores on their statistics final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. It is coded as 1 = “ABM”, 2 = “STEM” and 3 = “HUMSS”. Let’s start with loading the data and looking at some descriptive statistics.

poisson_awards <- read.delim("D:/UP Cebu Statistics Courses/1st Sem 2023-2024/STAT 197 (ST - Count Data Analysis)/MCD_Rdata/POIS/poisson_awards.txt")
attach(poisson_awards)
poisson_awards <- within(poisson_awards, {
  prog <- factor(prog, levels=1:3, labels=c("ABM", "STEM", "HUMSS"))
  id <- factor(id)
})
summary(poisson_awards)
##        id        num_awards      prog          stat      
##  1      :  1   Min.   :0.00   ABM  : 45   Min.   :33.00  
##  2      :  1   1st Qu.:0.00   STEM :105   1st Qu.:45.00  
##  3      :  1   Median :0.00   HUMSS: 50   Median :52.00  
##  4      :  1   Mean   :0.63               Mean   :52.65  
##  5      :  1   3rd Qu.:1.00               3rd Qu.:59.00  
##  6      :  1   Max.   :6.00               Max.   :75.00  
##  (Other):194

Each variable has 200 valid observations and their distributions seem quite reasonable. The unconditional mean and variance of our outcome variable are not extremely different. Our model assumes that these values, conditioned on the predictor variables, will be equal (or at least roughly so).

We can use the tapply function to display the summary statistics by program type. The table below shows the average numbers of awards by program type and seems to suggest that program type is a good candidate for predicting the number of awards, our outcome variable, because the mean value of the outcome appears to vary by prog. Additionally, the means and variances within each level of prog–the conditional means and variances–are similar. A conditional histogram separated out by program type is plotted to show the distribution.

with(poisson_awards, tapply(num_awards, prog, function(x) {
  sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
##                    ABM                   STEM                  HUMSS 
## "M (SD) = 0.20 (0.40)" "M (SD) = 1.00 (1.28)" "M (SD) = 0.24 (0.52)"
library(ggplot2)
ggplot(poisson_awards, aes(num_awards, fill = prog)) +
  geom_histogram(binwidth=.5, position="dodge")

Poisson Regression

At this point, we are ready to perform our Poisson model analysis using the glm function. We fit the model and store it in the object model_1 and get a summary of the model at the same time.

model_1 <- glm(num_awards ~ prog + stat, family="poisson", data=poisson_awards)
summary(model_1)
## 
## Call:
## glm(formula = num_awards ~ prog + stat, family = "poisson", data = poisson_awards)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -5.24712    0.65845  -7.969 1.60e-15 ***
## progSTEM     1.08386    0.35825   3.025  0.00248 ** 
## progHUMSS    0.36981    0.44107   0.838  0.40179    
## stat         0.07015    0.01060   6.619 3.63e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 287.67  on 199  degrees of freedom
## Residual deviance: 189.45  on 196  degrees of freedom
## AIC: 373.5
## 
## Number of Fisher Scoring iterations: 6

Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates to control for mild violation of the distribution assumption that the variance equals the mean.

We use R package sandwich below to obtain the robust standard errors and calculated the p-values accordingly. Together with the p-values, we have also calculated the 95% confidence interval using the parameter estimates and their robust standard errors.

library(sandwich)
cov.model_1 <- vcovHC(model_1, type="HC0")
std.err <- sqrt(diag(cov.model_1))
r.est <- cbind(Estimate= coef(model_1), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(model_1)/std.err), lower.tail=FALSE),
LL = coef(model_1) - 1.96 * std.err,
UL = coef(model_1) + 1.96 * std.err)

r.est
##               Estimate  Robust SE     Pr(>|z|)          LL          UL
## (Intercept) -5.2471244 0.64599839 4.566630e-16 -6.51328124 -3.98096756
## progSTEM     1.0838591 0.32104816 7.354745e-04  0.45460476  1.71311353
## progHUMSS    0.3698092 0.40041731 3.557157e-01 -0.41500870  1.15462716
## stat         0.0701524 0.01043516 1.783975e-11  0.04969947  0.09060532
with(model_1, cbind(res.deviance = deviance, df = df.residual,
  p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance  df         p
## [1,]     189.4496 196 0.6182274

Now let’s look at the output of function glm more closely.

  • The output begins with echoing the function call. The information on deviance residuals is displayed next. Deviance residuals are approximately normally distributed if the model is specified correctly.

    • In our example, it shows a little bit of skeweness since median is not quite zero.
  • Next come the Poisson regression coefficients for each of the variables along with the standard errors, z-scores, p-values and 95% confidence intervals for the coefficients. The coefficient for stat is .07. This means that the expected log count for a one-unit increase in stat is .07.

    • The indicator variable progSTEM compares between prog = “STEM” and prog = “ABM”, the expected log count for prog = “STEM” increases by about 1.1.

    • The indicator variable progHUMSS is the expected difference in log count ((approx .37)) between prog = “HUMSS” and the reference group (prog = “ABM”).

  • The information on deviance is also provided. We can use the residual deviance to perform a goodness of fit test for the overall model.

    • The residual deviance is the difference between the deviance of the current model and the maximum deviance of the ideal model where the predicted values are identical to the observed. Therefore, if the residual difference is small enough, the goodness of fit test will not be significant, indicating that the model fits the data.
    • We conclude that the model fits reasonably well because the goodness-of-fit chi-squared test is not statistically significant. If the test had been statistically significant, it would indicate that the data do not fit the model well.
    • In that situation, we may try to determine if there are omitted predictor variables, if our linearity assumption holds and/or if there is an issue of over-dispersion.

We can also test the overall effect of prog by comparing the deviance of the full model with the deviance of the model excluding prog.

## update m1 model dropping prog
model_2 <- update(model_1, . ~ . - prog)

## test model differences with chi square test
anova(model_2, model_1, test="Chisq")

The two degree-of-freedom chi-square test indicates that prog, taken together, is a statistically significant predictor of num_awards.

Sometimes, we might want to present the regression results as incident rate ratios and their standard errors, together with the confidence interval.

To compute the standard error for the incident rate ratios, we will use the Delta method. To this end, we make use the function deltamethod implemented in R package msm.

library(msm)
s <- deltamethod(list(~ exp(x1), ~ exp(x2), ~ exp(x3), ~ exp(x4)), coef(model_1), cov.model_1)

## exponentiate old estimates dropping the p values
rexp.est <- exp(r.est[, -3])

## replace SEs with estimates for exponentiated coefficients
rexp.est[, "Robust SE"] <- s

rexp.est
##               Estimate  Robust SE          LL         UL
## (Intercept) 0.00526263 0.00339965 0.001483604 0.01866757
## progSTEM    2.95606545 0.94903937 1.575550534 5.54620292
## progHUMSS   1.44745846 0.57958742 0.660334536 3.17284023
## stat        1.07267164 0.01119351 1.050955210 1.09483681

The output above indicates that the incident rate for prog = “STEM” is 2.96 times the incident rate for the reference group (prog = “ABM”). Likewise, the incident rate for prog = “HUMSS” is 1.45 times the incident rate for the reference group holding the other variables at constant. The percent change in the incident rate of num_awards is by 7% for every unit increase in stat.

Sometimes, we might want to look at the expected marginal means.

For example, what are the expected counts for each program type holding stat score at its overall mean? To answer this question, we can make use of the predict function. First off, we will make a small data set to apply the predict function to it.

(score_1 <- data.frame(stat = mean(poisson_awards$stat),
  prog = factor(1:3, levels = 1:3, labels = levels(poisson_awards$prog))))
predict(model_1, score_1, type="response", se.fit=TRUE)
## $fit
##         1         2         3 
## 0.2114109 0.6249446 0.3060086 
## 
## $se.fit
##          1          2          3 
## 0.07050108 0.08628117 0.08833706 
## 
## $residual.scale
## [1] 1

In the output above, we see that the predicted number of events for level 1 of prog is about .21, holding stat at its mean. The predicted number of events for level 2 of prog is higher at .62, and the predicted number of events for level 3 of prog is about .31. The ratios of these predicted counts (\(\frac{.625}{.211} = 2.96\), \(\frac{.306}{.211} = 1.45\)) match what we saw looking at the IRR.

We can also graph the predicted number of events with the commands below.

## calculate and store predicted values
poisson_awards$phat <- predict(model_1, type="response")

## order by program and then by math
poisson_awards <- poisson_awards[with(poisson_awards, order(prog, stat)), ]

## create the plot
ggplot(poisson_awards, aes(x = stat, y = phat, colour = prog)) +
  geom_point(aes(y = num_awards), alpha=.5, position=position_jitter(h=.2)) +
  geom_line(size = 1) +
  labs(x = "Stat Score", y = "Expected Number of Awards")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

The graph indicates that the most awards are predicted for those in the academic program (prog = 2), especially if the student has a high stat score. The lowest number of predicted awards is for those students in the general program (prog = 1). The graph overlays the lines of expected values onto the actual points, although a small amount of random noise was added vertically to lessen overplotting.

EXAMPLE 2

Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing near her. Explanatory variables that are thought to affect this included the female crab’s color (C), spine condition (S), weight (Wt), and carapace width (W). The response outcome for each female crab is her number of satellites (Sa). There are 173 females in this study.

crab_pois <- read.table("D:/UP Cebu Statistics Courses/1st Sem 2023-2024/STAT 197 (ST - Count Data Analysis)/MCD_Rdata/POIS/crab_pois.txt", quote="\"", comment.char="")
colnames(crab_pois)=c("Obs","C","S","W","Wt","Sa")

#### to remove the column labeled "Obs"

crab_pois=crab_pois[,-1]

Below is the part of R code that corresponds for fitting a Poisson regression model with only one predictor, carapace width (W).

#### Poisson Regression of Sa on W

model_2=glm(crab_pois$Sa~1+crab_pois$W,family=poisson(link=log))

Note that the specification of a Poisson distribution in R is “family=poisson” and “link=log”. You can also get the predicted count for each observation and the linear predictor values from R output by using specific statements such as:

#### to get the predicted count for each observation: 
#### e.g. for the first observation E(y1)=3.810

print=data.frame(crab_pois,pred=model_2$fitted)
print
#### note the linear predictor values
#### e.g., for the first observation, exp(1.3378)=3.810

model_2$linear.predictors
##         1         2         3         4         5         6         7         8 
## 1.3377187 0.9604150 0.8947970 0.1401896 1.4525503 0.7963699 0.9932240 0.7799654 
##         9        10        11        12        13        14        15        16 
## 0.9112015 1.2064827 0.9768195 1.4361458 1.6658089 0.4518753 0.9932240 0.7143474 
##        17        18        19        20        21        22        23        24 
## 1.6165954 0.9932240 0.8619880 0.8619880 1.2064827 1.1244601 0.6323249 1.4033368 
##        25        26        27        28        29        30        31        32 
## 1.0424376 0.7143474 1.1736736 1.0424376 0.7963699 0.3042347 1.6494044 0.8619880 
##        33        34        35        36        37        38        39        40 
## 0.7799654 0.9276060 1.1572691 1.6986179 0.7963699 1.6165954 0.4518753 0.6159203 
##        41        42        43        44        45        46        47        48 
## 0.9604150 0.9276060 1.4525503 1.0424376 0.3862572 0.5995158 0.6815384 0.9604150 
##        49        50        51        52        53        54        55        56 
## 0.7471564 0.3862572 1.4033368 1.5017638 1.0752466 0.5338978 1.2392917 1.3213142 
##        57        58        59        60        61        62        63        64 
## 0.7471564 0.9112015 1.2556962 1.1244601 1.4525503 0.8947970 0.6651339 0.9112015 
##        65        66        67        68        69        70        71        72 
## 0.4846843 1.3705277 1.5673818 0.4846843 0.7143474 1.2064827 1.0096286 1.2556962 
##        73        74        75        76        77        78        79        80 
## 1.9282810 0.7963699 0.9932240 1.3541232 0.7143474 1.2721007 0.7963699 1.4525503 
##        81        82        83        84        85        86        87        88 
## 1.8954720 1.2228872 0.7143474 0.5995158 1.3213142 0.6487294 1.2885052 0.9604150 
##        89        90        91        92        93        94        95        96 
## 0.7471564 0.9276060 1.1408646 1.1900781 1.0752466 1.0916511 0.9276060 0.5831113 
##        97        98        99       100       101       102       103       104 
## 1.2721007 1.6165954 0.7963699 1.2392917 1.3377187 0.8783925 0.9604150 0.9932240 
##       105       106       107       108       109       110       111       112 
## 0.4682798 0.4518753 0.8127744 0.9440105 0.8783925 1.0916511 1.4525503 1.3705277 
##       113       114       115       116       117       118       119       120 
## 0.7471564 1.4525503 1.1244601 0.5831113 1.1244601 0.6651339 0.3862572 0.8127744 
##       121       122       123       124       125       126       127       128 
## 0.7799654 1.2064827 0.6815384 1.5345728 0.9932240 0.7471564 1.5837864 0.9112015 
##       129       130       131       132       133       134       135       136 
## 0.9932240 1.1244601 0.7635609 0.5831113 1.3213142 0.8291790 0.5010888 0.9276060 
##       137       138       139       140       141       142       143       144 
## 1.2064827 0.9112015 1.0916511 1.2064827 1.3705277 1.3705277 1.1900781 1.1572691 
##       145       146       147       148       149       150       151       152 
## 1.1408646 1.2885052 1.0424376 0.4682798 0.9604150 0.7143474 0.9276060 0.5503023 
##       153       154       155       156       157       158       159       160 
## 1.0752466 0.8783925 1.3213142 0.8291790 0.8455835 0.9112015 1.5017638 0.5995158 
##       161       162       163       164       165       166       167       168 
## 1.1900781 0.9932240 1.2885052 1.3541232 2.1907532 0.9276060 0.6323249 0.4846843 
##       169       170       171       172       173 
## 1.3377187 1.0424376 1.0424376 0.9768195 0.7143474
exp(model_2$linear.predictors)
##        1        2        3        4        5        6        7        8 
## 3.810341 2.612781 2.446839 1.150492 4.274001 2.217477 2.699925 2.181397 
##        9       10       11       12       13       14       15       16 
## 2.487309 3.341710 2.655995 4.204460 5.289951 1.571256 2.699925 2.042853 
##       17       18       19       20       21       22       23       24 
## 5.035916 2.699925 2.367863 2.367863 3.341710 3.078554 1.881981 4.068754 
##       25       26       27       28       29       30       31       32 
## 2.836122 2.042853 3.233851 2.836122 2.217477 1.355587 5.203879 2.367863 
##       33       34       35       36       37       38       39       40 
## 2.181397 2.528449 3.181234 5.466387 2.217477 5.035916 1.571256 1.851360 
##       41       42       43       44       45       46       47       48 
## 2.612781 2.528449 4.274001 2.836122 1.471463 1.821237 1.976917 2.612781 
##       49       50       51       52       53       54       55       56 
## 2.110989 1.471463 4.068754 4.489601 2.930715 1.705567 3.453167 3.748344 
##       57       58       59       60       61       62       63       64 
## 2.110989 2.487309 3.510281 3.078554 4.274001 2.446839 1.944751 2.487309 
##       65       66       67       68       69       70       71       72 
## 1.623662 3.937428 4.794080 1.623662 2.042853 3.341710 2.744581 3.510281 
##       73       74       75       76       77       78       79       80 
## 6.877678 2.217477 2.699925 3.873363 2.042853 3.568341 2.217477 4.274001 
##       81       82       83       84       85       86       87       88 
## 6.655689 3.396981 2.042853 1.821237 3.748344 1.913108 3.627360 2.612781 
##       89       90       91       92       93       94       95       96 
## 2.110989 2.528449 3.129473 3.287338 2.930715 2.979189 2.528449 1.791604 
##       97       98       99      100      101      102      103      104 
## 3.568341 5.035916 2.217477 3.453167 3.810341 2.407027 2.612781 2.699925 
##      105      106      107      108      109      110      111      112 
## 1.597244 1.571256 2.254153 2.570269 2.407027 2.979189 4.274001 3.937428 
##      113      114      115      116      117      118      119      120 
## 2.110989 4.274001 3.078554 1.791604 3.078554 1.944751 1.471463 2.254153 
##      121      122      123      124      125      126      127      128 
## 2.181397 3.341710 1.976917 4.639343 2.699925 2.110989 4.873373 2.487309 
##      129      130      131      132      133      134      135      136 
## 2.699925 3.078554 2.145904 1.791604 3.748344 2.291437 1.650517 2.528449 
##      137      138      139      140      141      142      143      144 
## 3.341710 2.487309 2.979189 3.341710 3.937428 3.937428 3.287338 3.181234 
##      145      146      147      148      149      150      151      152 
## 3.129473 3.627360 2.836122 1.597244 2.612781 2.042853 2.528449 1.733777 
##      153      154      155      156      157      158      159      160 
## 2.930715 2.407027 3.748344 2.291437 2.329337 2.487309 4.489601 1.821237 
##      161      162      163      164      165      166      167      168 
## 3.287338 2.699925 3.627360 3.873363 8.941945 2.528449 1.881981 1.623662 
##      169      170      171      172      173 
## 3.810341 2.836122 2.836122 2.655995 2.042853

In the output below, you should be able to identify the relevant parts:

summary(model_2)
## 
## Call:
## glm(formula = crab_pois$Sa ~ 1 + crab_pois$W, family = poisson(link = log))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.30476    0.54224  -6.095  1.1e-09 ***
## crab_pois$W  0.16405    0.01997   8.216  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 632.79  on 172  degrees of freedom
## Residual deviance: 567.88  on 171  degrees of freedom
## AIC: 927.18
## 
## Number of Fisher Scoring iterations: 6

The estimated model is:\(\log(\hat{\mu}_i)=-3.30476+0.16405(W_i)\).

The ASE of estimated \(\beta= 0.164\) is 0.01997 which is small, and the slope is statistically significant given its z-value of 8.216 and its low p-value.

Interpretation: Since estimate of \(\beta > 0\), the wider the female crab, the greater expected number of male satellites on the multiplicative order as \(\exp(0.1640) = 1.18\). More specifically, for one unit of increase in the width, the number of Sa will increase and it will be multiplied by 1.18.

If we look at the scatter plot of W vs. Sa (see further below) we may suspect some outliers, e.g., observations #48, #101 and #165. For example, #165 has W = 33.5, and Sa = 7.

#### creating a scatter plot of Sa vs. W

plot(crab_pois$W,crab_pois$Sa)
identify(crab_pois$W, crab_pois$Sa) 

## integer(0)
#### click on the plot to identify individual values
#### identified on the screen and the plot, \#48,101,165