BIOS621 / 821 Session 5

Levi Waldron

loglinear regression part 2

Welcome and outline - session 5

Components of GLM

Motivating example: Choice of Distribution

Poisson model: the GLM

The systematic part of the GLM is: \[ log(\lambda_i) = \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i \] Or alternatively: \[ \lambda_i = exp \left( \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i \right) \]

The random part is (Recall the \(\lambda_i\) is both the mean and variance of a Poisson distribution): \[ y_i \sim Poisson(\lambda_i) \]

Example: Risky Drug Use Behavior

needledat = read.csv("needle_sharing.csv")
needledat2 <- needledat[needledat$sex %in% c("M", "F") & 
    needledat$ethn %in% c("White", "AA", "Hispanic"), ]
summary(needledat2$shared_syr)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   3.068   0.000  60.000       4
var(needledat2$shared_syr, na.rm=TRUE)
## [1] 111.5815

Example: Risky Drug Use Behavior

Risky Drug Use Behavior: fitting a Poisson model

fit.pois <- glm(shared_syr ~ sex + ethn + homeless, 
           data=needledat2, family=poisson(link="log"))

Risky Drug Use Behavior: residuals plots

* Poisson model is definitely not a good fit.

When the Poisson model doesn’t fit

Negative binomial distribution

Negative binomial GLM

One way to parametrize a NB model is with a systematic part equivalent to the Poisson model: \[ log(\lambda_i) = \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i \] Or: \[ \lambda_i = exp \left( \beta_0 + \beta_1 \textrm{RACE}_i + \beta_2 \textrm{TRT}_i + \beta_3 \textrm{ALCH}_i + \beta_4 \textrm{DRUG}_i \right) \]

And a random part: \[ y_i \sim NB(\lambda_i, \theta) \]

Negative Binomial Random Distribution

Compare Poisson vs. Negative Binomial

Negative Binomial Distribution has two parameters: # of trials n, and probability of success p

Risky drug behavior: Negative Binomial Regression

library(MASS)
fit.negbin <- glm.nb(shared_syr ~ sex + ethn + homeless, 
                     data=needledat2)
summary(fit.negbin)
## 
## Call:
## glm.nb(formula = shared_syr ~ sex + ethn + homeless, data = needledat2, 
##     init.theta = 0.07743871374, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.8801  -0.7787  -0.6895  -0.5748   1.5675  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    0.4641     0.8559   0.542   0.5876  
## sexM          -1.0148     0.8294  -1.224   0.2211  
## ethnHispanic   1.3424     1.3201   1.017   0.3092  
## ethnWhite      0.2429     0.7765   0.313   0.7544  
## homelessyes    1.6445     0.7073   2.325   0.0201 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(0.0774) family taken to be 1)
## 
##     Null deviance: 62.365  on 114  degrees of freedom
## Residual deviance: 56.232  on 110  degrees of freedom
##   (6 observations deleted due to missingness)
## AIC: 306.26
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  0.0774 
##           Std. Err.:  0.0184 
## 
##  2 x log-likelihood:  -294.2550

Likelihood ratio test

Recall from class 2 the Deviance:

\(\Delta (\textrm{D}) = -2 * \Delta (\textrm{log likelihood})\)

And recall the difference in deviance under \(H_0\) (no improvement in fit) is chi-square distributed, with df equal to the difference in df of the two models:

(ll.negbin <- logLik(fit.negbin))
## 'log Lik.' -147.1277 (df=6)
(ll.pois <- logLik(fit.pois))
## 'log Lik.' -730.0133 (df=5)
pchisq(2 * (ll.negbin - ll.pois), df=1, lower.tail=FALSE)
## 'log Lik.' 1.675949e-255 (df=6)

Risky Drug Use Behavior: NB regression residuals plots

Zero Inflation

Poisson Distribution with Zero Inflation

Risky drug behavior: Zero-inflated Poisson regression

library(pscl)
fit.ZIpois <- zeroinfl(shared_syr ~ sex+ethn+homeless, 
                        dist="poisson",data=needledat2)

Zero-inflated Poisson regression - the model

summary(fit.ZIpois)
## 
## Call:
## zeroinfl(formula = shared_syr ~ sex + ethn + homeless, data = needledat2, 
##     dist = "poisson")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -1.0761 -0.5784 -0.4030 -0.3341 10.6835 
## 
## Count model coefficients (poisson with log link):
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)    3.2169     0.1796  17.909  < 2e-16 ***
## sexM          -1.4725     0.1442 -10.212  < 2e-16 ***
## ethnHispanic  -0.1525     0.1576  -0.968 0.333223    
## ethnWhite     -0.5236     0.1464  -3.577 0.000347 ***
## homelessyes    1.2034     0.1455   8.268  < 2e-16 ***
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   2.06262    0.65227   3.162  0.00157 **
## sexM         -0.05067    0.58252  -0.087  0.93068   
## ethnHispanic -1.76120    0.81177  -2.170  0.03004 * 
## ethnWhite    -0.50187    0.56919  -0.882  0.37792   
## homelessyes  -0.53013    0.48108  -1.102  0.27048   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Number of iterations in BFGS optimization: 18 
## Log-likelihood: -299.8 on 10 Df

Risky drug behavior: Zero-inflated Negative Binomial regression

fit.ZInegbin <- zeroinfl(shared_syr ~ sex+ethn+homeless, 
                        dist="negbin", data=needledat2)

Zero-inflated Negative Binomial regression - model 1

summary(fit.ZInegbin)
## 
## Call:
## zeroinfl(formula = shared_syr ~ sex + ethn + homeless, data = needledat2, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.5401 -0.3255 -0.2715 -0.1926  5.1489 
## 
## Count model coefficients (negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    2.8401     1.1845   2.398  0.01649 * 
## sexM          -2.2278     0.9350  -2.382  0.01720 * 
## ethnHispanic  -0.4116     0.9832  -0.419  0.67545   
## ethnWhite     -0.4294     0.8647  -0.497  0.61949   
## homelessyes    1.9461     0.7103   2.740  0.00615 **
## Log(theta)    -1.1972     0.5159  -2.320  0.02032 * 
## 
## Zero-inflation model coefficients (binomial with logit link):
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)    1.6863     0.8466   1.992   0.0464 *
## sexM          -0.9919     0.8016  -1.237   0.2159  
## ethnHispanic -11.3556   112.8675  -0.101   0.9199  
## ethnWhite     -0.7452     0.7304  -1.020   0.3076  
## homelessyes    0.3555     0.7397   0.481   0.6308  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.302 
## Number of iterations in BFGS optimization: 37 
## Log-likelihood: -142.8 on 11 Df

Zero-inflated Negative Binomial regression - simplified ZI model

needledat2$hiv <- factor(ifelse(needledat2$hivstat==0,
                                "negative","positive"))
fit.ZInb2<-zeroinfl(shared_syr~sex+ethn+homeless+hiv|hiv, 
                        dist="negbin", data=needledat2)

Zero-inflated Negative Binomial regression - model 2

summary(fit.ZInb2)
## 
## Call:
## zeroinfl(formula = shared_syr ~ sex + ethn + homeless + hiv | hiv, 
##     data = needledat2, dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4419 -0.3295 -0.3206 -0.3015  6.0894 
## 
## Count model coefficients (negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept)    3.7521     1.2267   3.059  0.00222 **
## sexM          -1.8727     0.7635  -2.453  0.01418 * 
## ethnHispanic  -1.2466     0.9693  -1.286  0.19841   
## ethnWhite     -1.2869     0.8436  -1.526  0.12712   
## homelessyes    0.9184     0.6822   1.346  0.17827   
## hivpositive    1.7342     0.8175   2.121  0.03388 * 
## Log(theta)    -0.4561     0.5337  -0.854  0.39287   
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   1.0418     0.3515   2.964  0.00304 **
## hivpositive  -0.7252     0.7342  -0.988  0.32327   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.6338 
## Number of iterations in BFGS optimization: 65 
## Log-likelihood: -127.9 on 9 Df

Intercept-only zero-inflation model

fit.ZInb3 <- zeroinfl(shared_syr~sex+ethn+homeless|1, 
                        dist="negbin", data=needledat2)
summary(fit.ZInb3)
## 
## Call:
## zeroinfl(formula = shared_syr ~ sex + ethn + homeless | 1, data = needledat2, 
##     dist = "negbin")
## 
## Pearson residuals:
##     Min      1Q  Median      3Q     Max 
## -0.3159 -0.3123 -0.3040 -0.2953  5.2941 
## 
## Count model coefficients (negbin with log link):
##              Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   2.08551    1.42665   1.462   0.1438  
## sexM         -1.43812    0.89188  -1.612   0.1069  
## ethnHispanic  0.48126    1.16639   0.413   0.6799  
## ethnWhite    -0.07421    0.81066  -0.092   0.9271  
## homelessyes   1.62076    0.67705   2.394   0.0167 *
## Log(theta)   -1.12533    0.89365  -1.259   0.2079  
## 
## Zero-inflation model coefficients (binomial with logit link):
##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.5211     0.7599   0.686    0.493
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Theta = 0.3245 
## Number of iterations in BFGS optimization: 37 
## Log-likelihood: -146.8 on 7 Df

Residuals vs. fitted values

I invisibly define functions plotpanel1 and plotpanel2 that will work for all types of models (see .R or .Rmd file for functions). These use Pearson residuals.

Quantile-quantile plots for residuals

still over-dispersed - ideas?

Inference from the different models

Dependent variable:
shared_syr
Poisson negative zero-inflated
binomial count data
(1) (2) (3) (4) (5)
sexM -0.925*** -1.015 -1.473*** -2.228** -1.438
(0.121) (0.829) (0.144) (0.935) (0.892)
ethnHispanic 1.465*** 1.342 -0.152 -0.412 0.481
(0.160) (1.320) (0.158) (0.983) (1.166)
ethnWhite 0.061 0.243 -0.524*** -0.429 -0.074
(0.133) (0.776) (0.146) (0.865) (0.811)
homelessyes 1.285*** 1.644** 1.203*** 1.946*** 1.621**
(0.127) (0.707) (0.146) (0.710) (0.677)
Constant 0.723*** 0.464 3.217*** 2.840** 2.086
(0.145) (0.856) (0.180) (1.184) (1.427)
Observations 115 115 115 115 115
Log Likelihood -730.013 -148.128 -299.790 -142.750 -146.768
theta 0.077*** (0.018)
Akaike Inf. Crit. 1,470.027 306.255
Note: p<0.1; p<0.05; p<0.01

Zero-inflated models are 3) Poisson, 4) Negative Binomial, and 5) Negative Binomial with intercept-only zero inflation model.

Example of plotting observed and predicted counts

Lab exercises

Resources for R and SAS