I. Data

## 'data.frame':    219 obs. of  3 variables:
##  $ Gender  : Factor w/ 2 levels "1","2": 2 2 2 2 2 2 1 1 1 1 ...
##  $ MB_score: int  0 0 4 11 76 10 6 0 8 9 ...
##  $ REE     : num  932 1388 1772 986 1217 ...

II. Poisson Regression (MB_score ~ REE + Gender)

1.Regression

summary(m1<-glm(MB_score~REE+Gender,family="poisson",data=data))
## 
## Call:
## glm(formula = MB_score ~ REE + Gender, family = "poisson", data = data)
## 
## Deviance Residuals: 
##    Min      1Q  Median      3Q     Max  
## -7.869  -4.199  -2.227   1.243  17.330  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.380e+00  1.163e-01  37.653   <2e-16 ***
## REE         -9.115e-04  6.496e-05 -14.031   <2e-16 ***
## Gender2     -9.827e-02  4.232e-02  -2.322   0.0202 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 5331.9  on 218  degrees of freedom
## Residual deviance: 5034.4  on 216  degrees of freedom
## AIC: 5852.1
## 
## Number of Fisher Scoring iterations: 5

2. Test for over-dispersion

##by AER package (chi dung cho Poisson GLMs)
dispersiontest(m1,trafo=1)
## 
##  Overdispersion test
## 
## data:  m1
## z = 5.2157, p-value = 9.157e-08
## alternative hypothesis: true alpha is greater than 0
## sample estimates:
##   alpha 
## 26.8795
#Comments: p significant--> dispersed

3. Assumption check (plot the estimated variance against the mean)

plot(log(fitted(m1)),log((data$MB_score-fitted(m1))^2),
      xlab=expression(hat(mu)),ylab=expression((y-hat(mu))^2),ylim=c(-5,10),pch=16,col="grey")
abline(0,1)
text(2.6,-3,"Variance (y) is not propotional" )
text(2.6,-4, "but larger than mean (x)")

4. Check goodness-of-fit (xem model co fit voi data khong)

deviance(m1)
## [1] 5034.371
df.residual(m1)
## [1] 216
# deviance va df.residual cach xa nhau quas--> khong fit
# Thu so sanh hai gia tri nay bang Chi-square test
with(m1, cbind(res.deviance = deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance  df p
## [1,]     5034.371 216 0
# Comments: p significant ---> rat khac biet

—>>> Posisson regression is definitely not appropriate <<<—

III. Negative regression

1.Regression

summary(m2 <- glm.nb(MB_score ~ REE + Gender, data = data))
## 
## Call:
## glm.nb(formula = MB_score ~ REE + Gender, data = data, init.theta = 0.6179190548, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2189  -0.9848  -0.4579   0.2409   2.4114  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.4008470  0.6340909   6.940 3.91e-12 ***
## REE         -0.0009492  0.0003399  -2.792  0.00523 ** 
## Gender2     -0.0330037  0.2364783  -0.140  0.88901    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.6179) family taken to be 1)
## 
##     Null deviance: 271.04  on 218  degrees of freedom
## Residual deviance: 259.76  on 216  degrees of freedom
## AIC: 1694.7
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.6179 
##           Std. Err.:  0.0606 
## 
##  2 x log-likelihood:  -1686.7350

2. Assumption check

Assumption Negative Binomial Regression la do phan tan cua du lieu khong dong nhat (khac voi Poisson regression)
Negative binomial models assume the conditional means are not equal to the conditional variances.
This inequality is captured by estimating a dispersion parameter (not shown in the output) that is held constant in a Poisson model. Thus, the Poisson model is actually nested in the negative binomial model. We can then use a likelihood ratio test to compare these two and test this model assumption. To do this, we will run our model as a Poisson.
Loglikelihood_dif<-2*(logLik(m2)-logLik(m1))
Loglikelihood_dif
## 'log Lik.' 4159.337 (df=4)
pchisq(Loglikelihood_dif,df=1,lower.tail = F) #Tinh p value
## 'log Lik.' 0 (df=4)
# Comment: This very large chi-square (very low p value) strongly suggests the negative binomialmodel, which estimates the dispersion parameter, is more appropriate than the Poisson model

3. Check goodness-of-fit (xem model co fit voi data khong)

deviance(m2)
## [1] 259.7574
df.residual(m2)
## [1] 216
# deviance va df.residual gan nhu nhau--> Co the fit
# Thu so sanh hai gia tri nay bang Chi-square test
with(m2, cbind(res.deviance = deviance, df = df.residual,
p = pchisq(deviance, df.residual, lower.tail=FALSE)))
##      res.deviance  df          p
## [1,]     259.7574 216 0.02224544
# Comments: p significant ---> khac biet rat nho 

4. Tinh Incident Rate ratio

(est <- cbind(Estimate = coef(m2), confint(m2)))
## Waiting for profiling to be done...
##                  Estimate        2.5 %        97.5 %
## (Intercept)  4.4008470180  3.236266598  5.6025507155
## REE         -0.0009491715 -0.001596397 -0.0003007758
## Gender2     -0.0330036864 -0.464153232  0.3880722325
exp(est)                                            # Tinh IRR
##               Estimate      2.5 %      97.5 %
## (Intercept) 81.5198882 25.4385718 271.1170687
## REE          0.9990513  0.9984049   0.9996993
## Gender2      0.9675350  0.6286672   1.4741363
# Comments 1: For each unit increase in REE, the IRR of MB_score decrease as 0.1%
# Comments 2: No gender effect in predicting MB_score

5. Tinh Predictive Probability

newdata2 <- data.frame(
     REE = rep(seq(from = min(data$REE), to = max(data$REE), length.out = 100), 2),
     Gender = factor(rep(1:2, each = 100), levels = 1:2, labels =levels(data$Gender)))
newdata2 <- cbind(newdata2, predict(m2, newdata2, type = "link", se.fit=TRUE))
newdata2 <- within(newdata2, {
     MB_score <- exp(fit)
     LL <- exp(fit - 1.96 * se.fit)
     UL <- exp(fit + 1.96 * se.fit)
 })                                                  # Tinh predictive probability

ggplot(newdata2, aes(REE, MB_score)) +
     geom_ribbon(aes(ymin = LL, ymax = UL, fill = Gender), alpha = .25) +
     geom_line(aes(colour = Gender), size = 2) +
     labs(x = "REE", y = "Predicted MB_score")       # Ve PP bang ggplot2

# Comments 1: Higher REE is related with lower MB_score
# Comments 2:No gender effect

IV. Things to consider

1.Is there a more appropriate model using REE to predict MB_score?

2.If MB_score is higher when REE is high or low, and is low with medium REE, what model should be tested?