The number of awards earned by students at one senior high school. Predictors of the number of awards earned include the type of program in which the student was enrolled (e.g., HUMSS, ABM or STEM) and the score on their final exam in statistics.
In this example, num_awards is the outcome variable and indicates the number of awards earned by students at a senior high school in a year, stat is a continuous predictor variable and represents students’ scores on their statistics final exam, and prog is a categorical predictor variable with three levels indicating the type of program in which the students were enrolled. It is coded as 1 = “ABM”, 2 = “STEM” and 3 = “HUMSS”. Let’s start with loading the data and looking at some descriptive statistics.
poisson_awards <- read.delim("D:/UP Cebu Statistics Courses/1st Sem 2023-2024/STAT 197 (ST - Count Data Analysis)/MCD_Rdata/POIS/poisson_awards.txt")
attach(poisson_awards)
poisson_awards <- within(poisson_awards, {
prog <- factor(prog, levels=1:3, labels=c("ABM", "STEM", "HUMSS"))
id <- factor(id)
})
summary(poisson_awards)
## id num_awards prog stat
## 1 : 1 Min. :0.00 ABM : 45 Min. :33.00
## 2 : 1 1st Qu.:0.00 STEM :105 1st Qu.:45.00
## 3 : 1 Median :0.00 HUMSS: 50 Median :52.00
## 4 : 1 Mean :0.63 Mean :52.65
## 5 : 1 3rd Qu.:1.00 3rd Qu.:59.00
## 6 : 1 Max. :6.00 Max. :75.00
## (Other):194
Each variable has 200 valid observations and their distributions seem quite reasonable. The unconditional mean and variance of our outcome variable are not extremely different. Our model assumes that these values, conditioned on the predictor variables, will be equal (or at least roughly so).
We can use the tapply function to display the summary statistics by program type. The table below shows the average numbers of awards by program type and seems to suggest that program type is a good candidate for predicting the number of awards, our outcome variable, because the mean value of the outcome appears to vary by prog. Additionally, the means and variances within each level of prog–the conditional means and variances–are similar. A conditional histogram separated out by program type is plotted to show the distribution.
with(poisson_awards, tapply(num_awards, prog, function(x) {
sprintf("M (SD) = %1.2f (%1.2f)", mean(x), sd(x))
}))
## ABM STEM HUMSS
## "M (SD) = 0.20 (0.40)" "M (SD) = 1.00 (1.28)" "M (SD) = 0.24 (0.52)"
library(ggplot2)
ggplot(poisson_awards, aes(num_awards, fill = prog)) +
geom_histogram(binwidth=.5, position="dodge")
At this point, we are ready to perform our Poisson model analysis using the glm function. We fit the model and store it in the object model_1 and get a summary of the model at the same time.
model_1 <- glm(num_awards ~ prog + stat, family="poisson", data=poisson_awards)
summary(model_1)
##
## Call:
## glm(formula = num_awards ~ prog + stat, family = "poisson", data = poisson_awards)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -5.24712 0.65845 -7.969 1.60e-15 ***
## progSTEM 1.08386 0.35825 3.025 0.00248 **
## progHUMSS 0.36981 0.44107 0.838 0.40179
## stat 0.07015 0.01060 6.619 3.63e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 287.67 on 199 degrees of freedom
## Residual deviance: 189.45 on 196 degrees of freedom
## AIC: 373.5
##
## Number of Fisher Scoring iterations: 6
Cameron and Trivedi (2009) recommended using robust standard errors for the parameter estimates to control for mild violation of the distribution assumption that the variance equals the mean.
We use R package sandwich below to obtain the robust standard errors and calculated the p-values accordingly. Together with the p-values, we have also calculated the 95% confidence interval using the parameter estimates and their robust standard errors.
library(sandwich)
cov.model_1 <- vcovHC(model_1, type="HC0")
std.err <- sqrt(diag(cov.model_1))
r.est <- cbind(Estimate= coef(model_1), "Robust SE" = std.err,
"Pr(>|z|)" = 2 * pnorm(abs(coef(model_1)/std.err), lower.tail=FALSE),
LL = coef(model_1) - 1.96 * std.err,
UL = coef(model_1) + 1.96 * std.err)
r.est
## Estimate Robust SE Pr(>|z|) LL UL
## (Intercept) -5.2471244 0.64599839 4.566630e-16 -6.51328124 -3.98096756
## progSTEM 1.0838591 0.32104816 7.354745e-04 0.45460476 1.71311353
## progHUMSS 0.3698092 0.40041731 3.557157e-01 -0.41500870 1.15462716
## stat 0.0701524 0.01043516 1.783975e-11 0.04969947 0.09060532
with(model_1, cbind(res.deviance = deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))
## res.deviance df p
## [1,] 189.4496 196 0.6182274
Now let’s look at the output of function glm more closely.
The output begins with echoing the function call. The information on deviance residuals is displayed next. Deviance residuals are approximately normally distributed if the model is specified correctly.
Next come the Poisson regression coefficients for each of the variables along with the standard errors, z-scores, p-values and 95% confidence intervals for the coefficients. The coefficient for stat is .07. This means that the expected log count for a one-unit increase in stat is .07.
The indicator variable progSTEM compares between prog = “STEM” and prog = “ABM”, the expected log count for prog = “STEM” increases by about 1.1.
The indicator variable progHUMSS is the expected difference in log count ((approx .37)) between prog = “HUMSS” and the reference group (prog = “ABM”).
The information on deviance is also provided. We can use the residual deviance to perform a goodness of fit test for the overall model.
We can also test the overall effect of prog by comparing the deviance of the full model with the deviance of the model excluding prog.
## update m1 model dropping prog
model_2 <- update(model_1, . ~ . - prog)
## test model differences with chi square test
anova(model_2, model_1, test="Chisq")
The two degree-of-freedom chi-square test indicates that prog, taken together, is a statistically significant predictor of num_awards.
Sometimes, we might want to present the regression results as incident rate ratios and their standard errors, together with the confidence interval.
To compute the standard error for the incident rate ratios, we will use the Delta method. To this end, we make use the function deltamethod implemented in R package msm.
library(msm)
s <- deltamethod(list(~ exp(x1), ~ exp(x2), ~ exp(x3), ~ exp(x4)), coef(model_1), cov.model_1)
## exponentiate old estimates dropping the p values
rexp.est <- exp(r.est[, -3])
## replace SEs with estimates for exponentiated coefficients
rexp.est[, "Robust SE"] <- s
rexp.est
## Estimate Robust SE LL UL
## (Intercept) 0.00526263 0.00339965 0.001483604 0.01866757
## progSTEM 2.95606545 0.94903937 1.575550534 5.54620292
## progHUMSS 1.44745846 0.57958742 0.660334536 3.17284023
## stat 1.07267164 0.01119351 1.050955210 1.09483681
The output above indicates that the incident rate for prog = “STEM” is 2.96 times the incident rate for the reference group (prog = “ABM”). Likewise, the incident rate for prog = “HUMSS” is 1.45 times the incident rate for the reference group holding the other variables at constant. The percent change in the incident rate of num_awards is by 7% for every unit increase in stat.
Sometimes, we might want to look at the expected marginal means.
For example, what are the expected counts for each program type holding stat score at its overall mean? To answer this question, we can make use of the predict function. First off, we will make a small data set to apply the predict function to it.
(score_1 <- data.frame(stat = mean(poisson_awards$stat),
prog = factor(1:3, levels = 1:3, labels = levels(poisson_awards$prog))))
predict(model_1, score_1, type="response", se.fit=TRUE)
## $fit
## 1 2 3
## 0.2114109 0.6249446 0.3060086
##
## $se.fit
## 1 2 3
## 0.07050108 0.08628117 0.08833706
##
## $residual.scale
## [1] 1
In the output above, we see that the predicted number of events for level 1 of prog is about .21, holding stat at its mean. The predicted number of events for level 2 of prog is higher at .62, and the predicted number of events for level 3 of prog is about .31. The ratios of these predicted counts (\(\frac{.625}{.211} = 2.96\), \(\frac{.306}{.211} = 1.45\)) match what we saw looking at the IRR.
We can also graph the predicted number of events with the commands below.
## calculate and store predicted values
poisson_awards$phat <- predict(model_1, type="response")
## order by program and then by math
poisson_awards <- poisson_awards[with(poisson_awards, order(prog, stat)), ]
## create the plot
ggplot(poisson_awards, aes(x = stat, y = phat, colour = prog)) +
geom_point(aes(y = num_awards), alpha=.5, position=position_jitter(h=.2)) +
geom_line(size = 1) +
labs(x = "Stat Score", y = "Expected Number of Awards")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The graph indicates that the most awards are predicted for those in the academic program (prog = 2), especially if the student has a high stat score. The lowest number of predicted awards is for those students in the general program (prog = 1). The graph overlays the lines of expected values onto the actual points, although a small amount of random noise was added vertically to lessen overplotting.
Each female horseshoe crab in the study had a male crab attached to her in her nest. The study investigated factors that affect whether the female crab had any other males, called satellites, residing near her. Explanatory variables that are thought to affect this included the female crab’s color (C), spine condition (S), weight (Wt), and carapace width (W). The response outcome for each female crab is her number of satellites (Sa). There are 173 females in this study.
crab_pois <- read.table("D:/UP Cebu Statistics Courses/1st Sem 2023-2024/STAT 197 (ST - Count Data Analysis)/MCD_Rdata/POIS/crab_pois.txt", quote="\"", comment.char="")
colnames(crab_pois)=c("Obs","C","S","W","Wt","Sa")
#### to remove the column labeled "Obs"
crab_pois=crab_pois[,-1]
Below is the part of R code that corresponds for fitting a Poisson regression model with only one predictor, carapace width (W).
#### Poisson Regression of Sa on W
model_2=glm(crab_pois$Sa~1+crab_pois$W,family=poisson(link=log))
Note that the specification of a Poisson distribution in R is “family=poisson” and “link=log”. You can also get the predicted count for each observation and the linear predictor values from R output by using specific statements such as:
#### to get the predicted count for each observation:
#### e.g. for the first observation E(y1)=3.810
print=data.frame(crab_pois,pred=model_2$fitted)
print
#### note the linear predictor values
#### e.g., for the first observation, exp(1.3378)=3.810
model_2$linear.predictors
## 1 2 3 4 5 6 7 8
## 1.3377187 0.9604150 0.8947970 0.1401896 1.4525503 0.7963699 0.9932240 0.7799654
## 9 10 11 12 13 14 15 16
## 0.9112015 1.2064827 0.9768195 1.4361458 1.6658089 0.4518753 0.9932240 0.7143474
## 17 18 19 20 21 22 23 24
## 1.6165954 0.9932240 0.8619880 0.8619880 1.2064827 1.1244601 0.6323249 1.4033368
## 25 26 27 28 29 30 31 32
## 1.0424376 0.7143474 1.1736736 1.0424376 0.7963699 0.3042347 1.6494044 0.8619880
## 33 34 35 36 37 38 39 40
## 0.7799654 0.9276060 1.1572691 1.6986179 0.7963699 1.6165954 0.4518753 0.6159203
## 41 42 43 44 45 46 47 48
## 0.9604150 0.9276060 1.4525503 1.0424376 0.3862572 0.5995158 0.6815384 0.9604150
## 49 50 51 52 53 54 55 56
## 0.7471564 0.3862572 1.4033368 1.5017638 1.0752466 0.5338978 1.2392917 1.3213142
## 57 58 59 60 61 62 63 64
## 0.7471564 0.9112015 1.2556962 1.1244601 1.4525503 0.8947970 0.6651339 0.9112015
## 65 66 67 68 69 70 71 72
## 0.4846843 1.3705277 1.5673818 0.4846843 0.7143474 1.2064827 1.0096286 1.2556962
## 73 74 75 76 77 78 79 80
## 1.9282810 0.7963699 0.9932240 1.3541232 0.7143474 1.2721007 0.7963699 1.4525503
## 81 82 83 84 85 86 87 88
## 1.8954720 1.2228872 0.7143474 0.5995158 1.3213142 0.6487294 1.2885052 0.9604150
## 89 90 91 92 93 94 95 96
## 0.7471564 0.9276060 1.1408646 1.1900781 1.0752466 1.0916511 0.9276060 0.5831113
## 97 98 99 100 101 102 103 104
## 1.2721007 1.6165954 0.7963699 1.2392917 1.3377187 0.8783925 0.9604150 0.9932240
## 105 106 107 108 109 110 111 112
## 0.4682798 0.4518753 0.8127744 0.9440105 0.8783925 1.0916511 1.4525503 1.3705277
## 113 114 115 116 117 118 119 120
## 0.7471564 1.4525503 1.1244601 0.5831113 1.1244601 0.6651339 0.3862572 0.8127744
## 121 122 123 124 125 126 127 128
## 0.7799654 1.2064827 0.6815384 1.5345728 0.9932240 0.7471564 1.5837864 0.9112015
## 129 130 131 132 133 134 135 136
## 0.9932240 1.1244601 0.7635609 0.5831113 1.3213142 0.8291790 0.5010888 0.9276060
## 137 138 139 140 141 142 143 144
## 1.2064827 0.9112015 1.0916511 1.2064827 1.3705277 1.3705277 1.1900781 1.1572691
## 145 146 147 148 149 150 151 152
## 1.1408646 1.2885052 1.0424376 0.4682798 0.9604150 0.7143474 0.9276060 0.5503023
## 153 154 155 156 157 158 159 160
## 1.0752466 0.8783925 1.3213142 0.8291790 0.8455835 0.9112015 1.5017638 0.5995158
## 161 162 163 164 165 166 167 168
## 1.1900781 0.9932240 1.2885052 1.3541232 2.1907532 0.9276060 0.6323249 0.4846843
## 169 170 171 172 173
## 1.3377187 1.0424376 1.0424376 0.9768195 0.7143474
exp(model_2$linear.predictors)
## 1 2 3 4 5 6 7 8
## 3.810341 2.612781 2.446839 1.150492 4.274001 2.217477 2.699925 2.181397
## 9 10 11 12 13 14 15 16
## 2.487309 3.341710 2.655995 4.204460 5.289951 1.571256 2.699925 2.042853
## 17 18 19 20 21 22 23 24
## 5.035916 2.699925 2.367863 2.367863 3.341710 3.078554 1.881981 4.068754
## 25 26 27 28 29 30 31 32
## 2.836122 2.042853 3.233851 2.836122 2.217477 1.355587 5.203879 2.367863
## 33 34 35 36 37 38 39 40
## 2.181397 2.528449 3.181234 5.466387 2.217477 5.035916 1.571256 1.851360
## 41 42 43 44 45 46 47 48
## 2.612781 2.528449 4.274001 2.836122 1.471463 1.821237 1.976917 2.612781
## 49 50 51 52 53 54 55 56
## 2.110989 1.471463 4.068754 4.489601 2.930715 1.705567 3.453167 3.748344
## 57 58 59 60 61 62 63 64
## 2.110989 2.487309 3.510281 3.078554 4.274001 2.446839 1.944751 2.487309
## 65 66 67 68 69 70 71 72
## 1.623662 3.937428 4.794080 1.623662 2.042853 3.341710 2.744581 3.510281
## 73 74 75 76 77 78 79 80
## 6.877678 2.217477 2.699925 3.873363 2.042853 3.568341 2.217477 4.274001
## 81 82 83 84 85 86 87 88
## 6.655689 3.396981 2.042853 1.821237 3.748344 1.913108 3.627360 2.612781
## 89 90 91 92 93 94 95 96
## 2.110989 2.528449 3.129473 3.287338 2.930715 2.979189 2.528449 1.791604
## 97 98 99 100 101 102 103 104
## 3.568341 5.035916 2.217477 3.453167 3.810341 2.407027 2.612781 2.699925
## 105 106 107 108 109 110 111 112
## 1.597244 1.571256 2.254153 2.570269 2.407027 2.979189 4.274001 3.937428
## 113 114 115 116 117 118 119 120
## 2.110989 4.274001 3.078554 1.791604 3.078554 1.944751 1.471463 2.254153
## 121 122 123 124 125 126 127 128
## 2.181397 3.341710 1.976917 4.639343 2.699925 2.110989 4.873373 2.487309
## 129 130 131 132 133 134 135 136
## 2.699925 3.078554 2.145904 1.791604 3.748344 2.291437 1.650517 2.528449
## 137 138 139 140 141 142 143 144
## 3.341710 2.487309 2.979189 3.341710 3.937428 3.937428 3.287338 3.181234
## 145 146 147 148 149 150 151 152
## 3.129473 3.627360 2.836122 1.597244 2.612781 2.042853 2.528449 1.733777
## 153 154 155 156 157 158 159 160
## 2.930715 2.407027 3.748344 2.291437 2.329337 2.487309 4.489601 1.821237
## 161 162 163 164 165 166 167 168
## 3.287338 2.699925 3.627360 3.873363 8.941945 2.528449 1.881981 1.623662
## 169 170 171 172 173
## 3.810341 2.836122 2.836122 2.655995 2.042853
In the output below, you should be able to identify the relevant parts:
summary(model_2)
##
## Call:
## glm(formula = crab_pois$Sa ~ 1 + crab_pois$W, family = poisson(link = log))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.30476 0.54224 -6.095 1.1e-09 ***
## crab_pois$W 0.16405 0.01997 8.216 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 632.79 on 172 degrees of freedom
## Residual deviance: 567.88 on 171 degrees of freedom
## AIC: 927.18
##
## Number of Fisher Scoring iterations: 6
The estimated model is:\(\log(\hat{\mu}_i)=-3.30476+0.16405(W_i)\).
The ASE of estimated \(\beta= 0.164\) is 0.01997 which is small, and the slope is statistically significant given its z-value of 8.216 and its low p-value.
Interpretation: Since estimate of \(\beta > 0\), the wider the female crab, the greater expected number of male satellites on the multiplicative order as \(\exp(0.1640) = 1.18\). More specifically, for one unit of increase in the width, the number of Sa will increase and it will be multiplied by 1.18.
If we look at the scatter plot of W vs. Sa (see further below) we may suspect some outliers, e.g., observations #48, #101 and #165. For example, #165 has W = 33.5, and Sa = 7.
#### creating a scatter plot of Sa vs. W
plot(crab_pois$W,crab_pois$Sa)
identify(crab_pois$W, crab_pois$Sa)
## integer(0)
#### click on the plot to identify individual values
#### identified on the screen and the plot, \#48,101,165