[http://www.nuigalway.ie/maths/jh/npmlreg/html/irlsuicide.html]
A data frame consists of 104 observations on the following 8 variables.
Region : 13 different health regions of Ireland.
ID : factor with levels 1 2 3 … 12 13 corresponding to Regions.
pop : a numeric vector giving the population size of each suppopulation group.
death :a numeric vector giving the total number of deaths of each suppopulation group.
sex : a factor for gender with levels 0 (female) and 1 (male).
age : a factor for age with levels 1 (0-29), 2 (30-39), 3 (40-59), 4(60+ years).
smr : a numeric vector with standardized mortality ratios (SMRs)
expected : a numeric vector with `expected’ number of cases obtained from a reference population (Ahlbom, 1993).
The Poisson Mixed-Effect model is used for the analysis of Irish Suicide Data. First, This data has 2 level hierarchical data structure. The upper level of this data consists of 13 classified health regions of the Republic of Ireland and the lower level is comprised of 8 suppopulation groups with the combinaton of age and sex, standardized mortality ratio, pop, expected number of cases, and the number of death. Assuming that the random effects are differnet in each health regions and the fixed effects are identical for all groups, the random effects model is considered for this data. The following box plots reflects the possibility of the random effects from 13 different regions on the response variable ‘death’.
library(npmlreg)
library(car)
data(irlsuicide)
str(irlsuicide)
## 'data.frame': 104 obs. of 8 variables:
## $ Region : Factor w/ 13 levels "Cork ","Dublin ",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ pop : int 31923 31907 8517 8127 13527 12965 11958 8299 117575 114626 ...
## $ death : int 6 52 8 37 17 33 14 22 33 114 ...
## $ sex : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
## $ age : Factor w/ 4 levels "1","2","3","4": 1 1 2 2 3 3 4 4 1 1 ...
## $ smr : num 97.7 144 181.7 193.4 182.4 ...
## $ expected: num 6.14 36.11 4.4 19.13 9.32 ...
irlsuicide$Region = with(irlsuicide, reorder(Region, death))
plot(death ~ Region, data=irlsuicide, pch=16,col="lightblue")
Second, the response variable death is the total counts of suicides in each subpopulation defined by age and gender combination over the 10 years (1989~1998). The Nature of count data is discrete and a skewed distribution. The shape of the distribution is checked by the plot below. The follwoing plot visualizes the frequency of each number of deaths using vertical lines. The numbers on top of the frequencies are the different values of death. Thus, there are 4 small suppopulation groups with 1 death, 5 with 2 deaths, 2 with 3 deaths, and so on. The shape of the response variable death is skewed to the right since death can’t be negative numbers. So Poisson distribution conforms to this data.
Therefore, the Poisson Mixed-Effect model is considered for this analysis.
(tab <- xtabs(~ death, data=irlsuicide))
## death
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
## 4 5 2 2 2 3 4 1 5 2 1 2 3 5 1 2 2 4
## 19 21 22 25 26 27 30 31 33 34 35 37 38 42 45 48 52 55
## 1 2 3 2 1 2 1 1 2 2 1 2 1 2 2 2 1 1
## 58 61 67 70 74 77 81 82 87 92 97 101 104 106 108 112 114 132
## 3 2 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1
## 142 162 198
## 1 1 1
x<- as.numeric(names(tab))
plot(x, tab, type="h", xlab= "Numberof deaths", ylab="Frequency")
points(x, tab, pch=16)
To show the structure of the data, the following discriptive table analyses are used.
data(irlsuicide)
scatterplotMatrix(~ death + Region + age + sex + pop + smr + expected, span =0.7, data = irlsuicide)
(xtabs(death~ Region, data = irlsuicide))
## Region
## Cork Dublin EHB - Dub. Galway Lim. Mid HB
## 189 485 603 51 67 213
## MWHB - Lim. NEHB NWHB SEHB - Wat. SHB - Cork Waterf.
## 305 331 226 462 510 36
## WHB - Gal.
## 346
(xtabs(death~ sex, data = irlsuicide))
## sex
## 0 1
## 742 3082
(xtabs(death~ age, data= irlsuicide))
## age
## 1 2 3 4
## 1292 777 1151 604
(xtabs(death~ Region+sex, data = irlsuicide))
## sex
## Region 0 1
## Cork 45 144
## Dublin 127 358
## EHB - Dub. 127 476
## Galway 10 41
## Lim. 11 56
## Mid HB 36 177
## MWHB - Lim. 54 251
## NEHB 55 276
## NWHB 37 189
## SEHB - Wat. 77 385
## SHB - Cork 97 413
## Waterf. 10 26
## WHB - Gal. 56 290
(xtabs(death~ Region+age, data= irlsuicide))
## age
## Region 1 2 3 4
## Cork 58 45 50 36
## Dublin 147 131 143 64
## EHB - Dub. 233 137 177 56
## Galway 25 11 10 5
## Lim. 26 20 17 4
## Mid HB 67 35 70 41
## MWHB - Lim. 93 58 102 52
## NEHB 110 54 98 69
## NWHB 76 41 75 34
## SEHB - Wat. 160 92 134 76
## SHB - Cork 179 88 152 91
## Waterf. 8 8 15 5
## WHB - Gal. 110 57 108 71
The first poisson mixed effect model has fixed effect term ‘sex’ and the random intercpet term and the second model has the fixted effect term ‘sex’ and the random slope term added to measure the effect on the explantory variable ‘sex’. The next 2 models are set up in the same way for checking the random effect on the slope of ‘age’ variable. Then the following 3 models are used to check the random effect from the Regional groups on the slope of ‘age’ and ‘sex’ respectively when there are explantory variables ‘sex’ and ‘age’.
The other explanatory variables ‘smr’(with standardized mortality ratios) and ‘expected’(expected number of cases obtained from a reference population) are not considered for model fitting since ‘smr’ is just the standardized ratio of death to expected and so they are highly correlated to each other.
library(lme4)
library(Zelig)
m1_res <-glmer(death ~sex +(1|Region), family= poisson, data = irlsuicide)
m2_res <-glmer(death ~sex +(sex|Region), family= poisson, data = irlsuicide)
m1_rea <-glmer(death ~ age + (1|Region), family= poisson, data = irlsuicide)
m2_rea <-glmer(death ~ age + (age|Region), family=poisson, data = irlsuicide)
m1_resa <-glmer(death ~sex + age + (1|Region), family= poisson, data = irlsuicide)
m2_resa <-glmer(death ~sex + age + (sex|Region), family= poisson, data = irlsuicide)
m3_resa <-glmer(death ~sex + age + (age|Region), family= poisson, data = irlsuicide)
Then anova() uses Maximum Likelihood method on these models and it returns the chi-squre-based p-value for the significance of the random effect on the slope of ‘sex’. The p-value is 0.03009 and which implies that the 13 regions’s random effect on the response variable ‘death’ and the explanatory variable ‘sex’ is sgnificant. The second Anova returns the p-value 0.0002305 which is much lower than the p-value for the effect on ‘sex’. The third Anova shows that the random effect on the slope of ‘age’ is most significant with the p-value 0.0008843. This model shows the lowest deviance value(maximum likelihoood estimation:779.64) as the best model.
anova(m1_res, m2_res)
## Data: irlsuicide
## Models:
## m1_res: death ~ sex + (1 | Region)
## m2_res: death ~ sex + (sex | Region)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m1_res 3 1146 1153.9 -570.00 1140
## m2_res 5 1143 1156.2 -566.49 1133 7.0073 2 0.03009 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m1_rea, m2_rea)
## Data: irlsuicide
## Models:
## m1_rea: death ~ age + (1 | Region)
## m2_rea: death ~ age + (age | Region)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m1_rea 5 2359.4 2372.7 -1174.7 2349.4
## m2_rea 14 2345.8 2382.8 -1158.9 2317.8 31.632 9 0.0002305 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(m1_resa, m2_resa, m3_resa)
## Data: irlsuicide
## Models:
## m1_resa: death ~ sex + age + (1 | Region)
## m2_resa: death ~ sex + age + (sex | Region)
## m3_resa: death ~ sex + age + (age | Region)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## m1_resa 6 823.27 839.14 -405.64 811.27
## m2_resa 8 820.26 841.42 -402.13 804.26 7.0073 2 0.0300869 *
## m3_resa 15 809.64 849.30 -389.82 779.64 24.6247 7 0.0008842 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The next step is to investigate the fixed effect terms. The general Poisson model with no regional random effect for complete pooling is compared to the poisson mixed effect model for partial pooling of observations within the same regions. The results shows that the random effects coming from regional differeces changes the slopes of the explanatory variable ‘age’ groups but not the slope of ‘sex’. This results exactly conform to the above random effect mainly on the varialbe ‘age’, but not on ‘sex’.
library(stargazer)
cpooling <-glm(death~ sex+age, family=poisson, data = irlsuicide)
m3_resa <-glmer(death ~sex + age +(age|Region), family= poisson, data = irlsuicide)
stargazer(cpooling, m3_resa, type = "html")
| Dependent variable: | ||
| death | ||
| Poisson | generalized linear | |
| mixed-effects | ||
| (1) | (2) | |
| sex1 | 1.424*** | 1.424*** |
| (0.041) | (0.041) | |
| age2 | -0.509*** | -0.493*** |
| (0.045) | (0.075) | |
| age3 | -0.116*** | -0.114* |
| (0.041) | (0.061) | |
| age4 | -0.760*** | -0.807*** |
| (0.049) | (0.114) | |
| Constant | 2.959*** | 2.665*** |
| (0.043) | (0.248) | |
| Observations | 104 | 104 |
| Log Likelihood | -1,160.509 | -389.819 |
| Akaike Inf. Crit. | 2,331.018 | 809.638 |
| Bayesian Inf. Crit. | 849.304 | |
| Note: | p<0.1; p<0.05; p<0.01 | |
The standard link function of Poisson model is the natural logarithm. So from the fitted Poisson mixed effect model, E(y) = exp(2.666+ 1.424 x sex1 - 0.493 x age2 -0.114 x age3- 0.807x age4) and all estimated parameters for the explanatory variables are statistically significant at alpha level 5%.
Which implies that the average number of suicides of the female group(sex=0) with age between 0-29 for the 10 years is exp(2.666) = 14.3823. The estimated parameter for sex is 1.424. So exp(1.424) = 4.1537 means that male group has approximately 4 times higher average number of suicides than femal group for the 10 years. The estimated parameter for age groups are -0.493, -0.114, -0.807 and so The age group 2(30~39) has about 61.08% of the average number of suicides comaring to the base age group 1(20~29) and the age group3(40~59) has about 89.23% of the average number of suicides comparing to the base age group 1(20~29). In a summary, age group1 has the highest risk of suicides and then age group3, age group 2, and age group 4(60+ ) are following in order. The following plot shows this result. These fixed effect coefficients are the same for all observations regardless of the 13 health regions.
library(effects)
plot(allEffects(m3_resa, default.levels=50), ask=FALSE)
The assumption of Poisson distribution is that the variance is equal to the constant mean. So the residuals have the same constant homogeneous assumption. If the data fits to a Poisson distribution, then dispersion should be 1. However, dispersion can be differ from 1 when there is a random effect or there are clusters of observations.
plot(residuals(m3_resa) ~ jitter(fitted(m3_resa),0.2),
xlab="Fitted Values",ylab="Residuals")
abline(h=0, lty=2)
resP = residuals(m3_resa, type="pearson")
dispersion = sum(resP^2)/(103-4)
dispersion
## [1] 1.661781
The residual plot shows the nonhomogeneous variace property of the Irish Suicide data. The dispersion calculaton also returns 1.662, much greater than 1.
Therefore, further analyses such as examining over-dispersion, centering the variables, or transformation of variables can be considered for better model fitting without overdispersion problem.