Poisson Mixed Effect model of Irish Suicide Data

1. Introduction of Irish Suicide Data.

[http://www.nuigalway.ie/maths/jh/npmlreg/html/irlsuicide.html]
A data frame consists of 104 observations on the following 8 variables.
Region : 13 different health regions of Ireland.
ID : factor with levels 1 2 3 … 12 13 corresponding to Regions.
pop : a numeric vector giving the population size of each suppopulation group.
death :a numeric vector giving the total number of deaths of each suppopulation group.
sex : a factor for gender with levels 0 (female) and 1 (male).
age : a factor for age with levels 1 (0-29), 2 (30-39), 3 (40-59), 4(60+ years).
smr : a numeric vector with standardized mortality ratios (SMRs)
expected : a numeric vector with `expected’ number of cases obtained from a reference population (Ahlbom, 1993).

2. Mixed-Effect and Poisson distribution of Irish Suicide Data

The Poisson Mixed-Effect model is used for the analysis of Irish Suicide Data. First, This data has 2 level hierarchical data structure. The upper level of this data consists of 13 classified health regions of the Republic of Ireland and the lower level is comprised of 8 suppopulation groups with the combinaton of age and sex, standardized mortality ratio, pop, expected number of cases, and the number of death. Assuming that the random effects are differnet in each health regions and the fixed effects are identical for all groups, the random effects model is considered for this data. The following box plots reflects the possibility of the random effects from 13 different regions on the response variable ‘death’.

library(npmlreg)
library(car)
data(irlsuicide)
str(irlsuicide)

## 'data.frame':    104 obs. of  8 variables:
##  $ Region  : Factor w/ 13 levels "Cork ","Dublin ",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ ID      : Factor w/ 13 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ pop     : int  31923 31907 8517 8127 13527 12965 11958 8299 117575 114626 ...
##  $ death   : int  6 52 8 37 17 33 14 22 33 114 ...
##  $ sex     : Factor w/ 2 levels "0","1": 1 2 1 2 1 2 1 2 1 2 ...
##  $ age     : Factor w/ 4 levels "1","2","3","4": 1 1 2 2 3 3 4 4 1 1 ...
##  $ smr     : num  97.7 144 181.7 193.4 182.4 ...
##  $ expected: num  6.14 36.11 4.4 19.13 9.32 ...

irlsuicide$Region = with(irlsuicide, reorder(Region, death))
plot(death ~ Region, data=irlsuicide, pch=16,col="lightblue")

Second, the response variable death is the total counts of suicides in each subpopulation defined by age and gender combination over the 10 years (1989~1998). The Nature of count data is discrete and a skewed distribution. The shape of the distribution is checked by the plot below. The follwoing plot visualizes the frequency of each number of deaths using vertical lines. The numbers on top of the frequencies are the different values of death. Thus, there are 4 small suppopulation groups with 1 death, 5 with 2 deaths, 2 with 3 deaths, and so on. The shape of the response variable death is skewed to the right since death can’t be negative numbers. So Poisson distribution conforms to this data.

Therefore, the Poisson Mixed-Effect model is considered for this analysis.

(tab <- xtabs(~ death, data=irlsuicide))

## death
##   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18 
##   4   5   2   2   2   3   4   1   5   2   1   2   3   5   1   2   2   4 
##  19  21  22  25  26  27  30  31  33  34  35  37  38  42  45  48  52  55 
##   1   2   3   2   1   2   1   1   2   2   1   2   1   2   2   2   1   1 
##  58  61  67  70  74  77  81  82  87  92  97 101 104 106 108 112 114 132 
##   3   2   1   1   1   1   1   1   1   2   1   1   1   1   1   1   1   1 
## 142 162 198 
##   1   1   1

x<- as.numeric(names(tab)) 
plot(x, tab, type="h", xlab= "Numberof deaths", ylab="Frequency")
points(x, tab, pch=16)

3. Descriptive Statistics of data

To show the structure of the data, the following discriptive table analyses are used.

data(irlsuicide)
scatterplotMatrix(~ death + Region + age + sex + pop + smr + expected, span =0.7, data = irlsuicide)

(xtabs(death~ Region, data = irlsuicide))

## Region
##       Cork      Dublin   EHB - Dub.     Galway        Lim.       Mid HB 
##         189         485         603          51          67         213 
## MWHB - Lim.        NEHB        NWHB SEHB - Wat.  SHB - Cork     Waterf. 
##         305         331         226         462         510          36 
##  WHB - Gal. 
##         346

(xtabs(death~ sex, data = irlsuicide))

## sex
##    0    1 
##  742 3082

(xtabs(death~ age, data= irlsuicide))

## age
##    1    2    3    4 
## 1292  777 1151  604

(xtabs(death~ Region+sex, data = irlsuicide))

##              sex
## Region          0   1
##   Cork         45 144
##   Dublin      127 358
##   EHB - Dub.  127 476
##   Galway       10  41
##   Lim.         11  56
##   Mid HB       36 177
##   MWHB - Lim.  54 251
##   NEHB         55 276
##   NWHB         37 189
##   SEHB - Wat.  77 385
##   SHB - Cork   97 413
##   Waterf.      10  26
##   WHB - Gal.   56 290

(xtabs(death~ Region+age, data= irlsuicide))

##              age
## Region          1   2   3   4
##   Cork         58  45  50  36
##   Dublin      147 131 143  64
##   EHB - Dub.  233 137 177  56
##   Galway       25  11  10   5
##   Lim.         26  20  17   4
##   Mid HB       67  35  70  41
##   MWHB - Lim.  93  58 102  52
##   NEHB        110  54  98  69
##   NWHB         76  41  75  34
##   SEHB - Wat. 160  92 134  76
##   SHB - Cork  179  88 152  91
##   Waterf.       8   8  15   5
##   WHB - Gal.  110  57 108  71

4. Poisson Random Effect Model Fitting

The first poisson mixed effect model has fixed effect term ‘sex’ and the random intercpet term and the second model has the fixted effect term ‘sex’ and the random slope term added to measure the effect on the explantory variable ‘sex’. The next 2 models are set up in the same way for checking the random effect on the slope of ‘age’ variable. Then the following 3 models are used to check the random effect from the Regional groups on the slope of ‘age’ and ‘sex’ respectively when there are explantory variables ‘sex’ and ‘age’.
The other explanatory variables ‘smr’(with standardized mortality ratios) and ‘expected’(expected number of cases obtained from a reference population) are not considered for model fitting since ‘smr’ is just the standardized ratio of death to expected and so they are highly correlated to each other.

library(lme4)
library(Zelig)

m1_res <-glmer(death ~sex +(1|Region), family= poisson, data = irlsuicide)
m2_res <-glmer(death ~sex +(sex|Region), family= poisson, data = irlsuicide)


m1_rea <-glmer(death ~ age + (1|Region), family= poisson, data = irlsuicide)
m2_rea <-glmer(death ~ age + (age|Region), family=poisson, data = irlsuicide)


m1_resa <-glmer(death ~sex + age + (1|Region), family= poisson, data = irlsuicide)
m2_resa <-glmer(death ~sex + age + (sex|Region), family= poisson, data = irlsuicide)
m3_resa <-glmer(death ~sex + age + (age|Region), family= poisson, data = irlsuicide)

Then anova() uses Maximum Likelihood method on these models and it returns the chi-squre-based p-value for the significance of the random effect on the slope of ‘sex’. The p-value is 0.03009 and which implies that the 13 regions’s random effect on the response variable ‘death’ and the explanatory variable ‘sex’ is sgnificant. The second Anova returns the p-value 0.0002305 which is much lower than the p-value for the effect on ‘sex’. The third Anova shows that the random effect on the slope of ‘age’ is most significant with the p-value 0.0008843. This model shows the lowest deviance value(maximum likelihoood estimation:779.64) as the best model.

anova(m1_res, m2_res)

## Data: irlsuicide
## Models:
## m1_res: death ~ sex + (1 | Region)
## m2_res: death ~ sex + (sex | Region)
##        Df  AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## m1_res  3 1146 1153.9 -570.00     1140                           
## m2_res  5 1143 1156.2 -566.49     1133 7.0073      2    0.03009 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m1_rea, m2_rea)

## Data: irlsuicide
## Models:
## m1_rea: death ~ age + (1 | Region)
## m2_rea: death ~ age + (age | Region)
##        Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## m1_rea  5 2359.4 2372.7 -1174.7   2349.4                             
## m2_rea 14 2345.8 2382.8 -1158.9   2317.8 31.632      9  0.0002305 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(m1_resa, m2_resa, m3_resa)

## Data: irlsuicide
## Models:
## m1_resa: death ~ sex + age + (1 | Region)
## m2_resa: death ~ sex + age + (sex | Region)
## m3_resa: death ~ sex + age + (age | Region)
##         Df    AIC    BIC  logLik deviance   Chisq Chi Df Pr(>Chisq)    
## m1_resa  6 823.27 839.14 -405.64   811.27                              
## m2_resa  8 820.26 841.42 -402.13   804.26  7.0073      2  0.0300869 *  
## m3_resa 15 809.64 849.30 -389.82   779.64 24.6247      7  0.0008842 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5. Fixed Effect Terms of Poisson Model

The next step is to investigate the fixed effect terms. The general Poisson model with no regional random effect for complete pooling is compared to the poisson mixed effect model for partial pooling of observations within the same regions. The results shows that the random effects coming from regional differeces changes the slopes of the explanatory variable ‘age’ groups but not the slope of ‘sex’. This results exactly conform to the above random effect mainly on the varialbe ‘age’, but not on ‘sex’.

library(stargazer)
cpooling <-glm(death~ sex+age, family=poisson, data = irlsuicide)
m3_resa <-glmer(death ~sex + age +(age|Region), family= poisson, data = irlsuicide)

stargazer(cpooling, m3_resa, type = "html")


	Dependent variable:

	death
	Poisson	generalized linear
		mixed-effects
	(1)	(2)

sex1	1.424^***	1.424^***
	(0.041)	(0.041)

age2	-0.509^***	-0.493^***
	(0.045)	(0.075)

age3	-0.116^***	-0.114^*
	(0.041)	(0.061)

age4	-0.760^***	-0.807^***
	(0.049)	(0.114)

Constant	2.959^***	2.665^***
	(0.043)	(0.248)


Observations	104	104
Log Likelihood	-1,160.509	-389.819
Akaike Inf. Crit.	2,331.018	809.638
Bayesian Inf. Crit.		849.304

Note:	p<0.1; p<0.05; p<0.01

The standard link function of Poisson model is the natural logarithm. So from the fitted Poisson mixed effect model, E(y) = exp(2.666+ 1.424 x sex1 - 0.493 x age2 -0.114 x age3- 0.807x age4) and all estimated parameters for the explanatory variables are statistically significant at alpha level 5%.
Which implies that the average number of suicides of the female group(sex=0) with age between 0-29 for the 10 years is exp(2.666) = 14.3823. The estimated parameter for sex is 1.424. So exp(1.424) = 4.1537 means that male group has approximately 4 times higher average number of suicides than femal group for the 10 years. The estimated parameter for age groups are -0.493, -0.114, -0.807 and so The age group 2(30~39) has about 61.08% of the average number of suicides comaring to the base age group 1(20~29) and the age group3(40~59) has about 89.23% of the average number of suicides comparing to the base age group 1(20~29). In a summary, age group1 has the highest risk of suicides and then age group3, age group 2, and age group 4(60+ ) are following in order. The following plot shows this result. These fixed effect coefficients are the same for all observations regardless of the 13 health regions.

library(effects)
plot(allEffects(m3_resa, default.levels=50), ask=FALSE)

6. Residual plot and fitted curve

The assumption of Poisson distribution is that the variance is equal to the constant mean. So the residuals have the same constant homogeneous assumption. If the data fits to a Poisson distribution, then dispersion should be 1. However, dispersion can be differ from 1 when there is a random effect or there are clusters of observations.

plot(residuals(m3_resa) ~ jitter(fitted(m3_resa),0.2),
xlab="Fitted Values",ylab="Residuals")
abline(h=0, lty=2)

resP = residuals(m3_resa, type="pearson")
dispersion = sum(resP^2)/(103-4)
dispersion

## [1] 1.661781

The residual plot shows the nonhomogeneous variace property of the Irish Suicide data. The dispersion calculaton also returns 1.662, much greater than 1.

Therefore, further analyses such as examining over-dispersion, centering the variables, or transformation of variables can be considered for better model fitting without overdispersion problem.