This example will cover the use of R functions for fitting count data models to complex survey data and to aggregate data at the county level. Specifically, we focus on the Poisson and Negative Binomial models to individual level survey data as well as for aggregate data.

For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART county data. Link

We will also use data from the CDC Compressed mortality file on mortality in US counties.

Poisson model for counts

For the Poisson model, you observe some count of events. The Poisson distribution has only one parameter, \((\lambda)\), which is the average count (y).

The Poisson GLM assumes: \[Y \sim Poisson (\lambda),\text{ } E(Y) = \lambda,\text{ } var(Y) = \lambda\] \[\lambda = log(\eta)\]

\[\eta = \beta_0 + \beta_1 x_1+ log(n)\] Which is the log linear model for a count, with the potential for adding an offset term to them model. We have several ways of modeling the Poisson count:

Pure count If each area has the same risk set. This means each observation is exposed to the “process” for the same amount of time, this is the exposure level. If observations do not all have the same level of exposure, then we may express the count as a Rate. If we do this, we need to include an offset term in your model to incorporate unequal risk. I.e. people who are exposed to a disease for a year are more likely to experience the disease than someone who is exposed for a week.

\[log(y)= X' \beta + log(n)\]

, where n is the period of risk. This is called the offset term in the model.

Standardized ratio incorporate differential exposure as an expected count

\[log(y)= X' \beta + log(E)\].

In Epidemiology, the ratio \(\frac{y}{E}\) is referred to as the standardized mortality (or incidence) ratio. In order to get the expected counts for an area, typically the population of the area is multiplied by the “global rate”. For US county mortality, we would multiply the population of each county by the US mortality rate.

\[E = r * n\] \[r = \frac{\sum y}{\sum n}\]

Interpreting the model parameters

All interpretation of parameters is done on a log scale, so

\(exp(\beta) = \text{% change in the mean}\)

, or % change in the rate, for a 1 unit change in X. All testing is done in the same manner.

#load brfss
library(car)
library(stargazer)
library(survey)
library(sjPlot)
library(ggplot2)
load("D:/GoogleDrive/classes/dem7283/class18/data/brfss16_mmsa.Rdata")

#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
head(nams, n=10)
##  [1] "DISPCODE" "HHADULT"  "GENHLTH"  "PHYSHLTH" "MENTHLTH" "POORHLTH"
##  [7] "HLTHPLN1" "PERSDOC2" "MEDCOST"  "CHECKUP1"
#we see some names are lower case, some are upper and some have a little _ 
#in the first position. This is a nightmare.
newnames<-tolower(gsub(pattern = "_",replacement =  "",x =  nams))
names(brfss16m)<-newnames

Our outcome here is the number of days the respondent reported poor physical health in the past month: Q: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?

brfss16m$healthdays<-recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")
hist(brfss16m$healthdays)

summary(brfss16m$healthdays)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.000   0.000   4.103   3.000  30.000    4922

Other variables:

#brfss16m$badhealth<-ifelse(brfss16m$genhlth %in% c(4,5),1,0)
brfss16m$badhealth<-recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")

brfss16m$race_eth<-recode(brfss16m$racegr3, 
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor.result = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")

#insurance
brfss16m$ins<-recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")

#income grouping
brfss16m$inc<-ifelse(brfss16m$incomg==9, NA, brfss16m$incomg)

#education level
brfss16m$educ<-recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor.result=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')

#employment
brfss16m$employ<-recode(brfss16m$employ1,
recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor.result=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='Employed')

#marital status
brfss16m$marst<-recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor.result=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')

#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))

#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's

brfss16m$bmi<-brfss16m$bmi5/100

#smoking currently
brfss16m$smoke<-recode(brfss16m$smoker3, 
recodes="1:2='Current'; 3='Former';4='NeverSmoked'; else=NA", 
as.factor.result=T)
brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")

Analysis

First, we will subset our data to have complete cases for our variables in our model and make our survey design object

#Here I keep complete cases on my key variables,
#just for speed (the suvey procedures can run for a long time)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
sub<-brfss16m%>%
  select(healthdays, mmsaname, bmi,
         agec,race_eth, marst, educ,white, black, hispanic,
         other, smoke, ins, mmsawt, ststr) %>%
  filter( complete.cases(.))

#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data =sub )
#OR THE BRFSS, R GAVE ME A WARNING AND I NEEDED TO ADD:
#YOU MAY NOT NEED TO DO THIS!!!!
#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, 
               weights=~mmsawt,
               data = brfss16m[is.na(brfss16m$mmsawt)==F,])

Poisson regression example

To fit a Poisson GLM to survey data in R, we use the svyglm function in the survey library.

#First I do some simple descriptives
svyhist(~healthdays, des)

svyby(~healthdays, ~race_eth+educ, des, svymean, na.rm=T)
##                           race_eth     educ healthdays         se
## nhwhite.2hsgrad            nhwhite  2hsgrad   4.365827 0.07420115
## hispanic.2hsgrad          hispanic  2hsgrad   3.206603 0.15492460
## nh black.2hsgrad          nh black  2hsgrad   4.113492 0.16816281
## nh multirace.2hsgrad  nh multirace  2hsgrad   5.862313 0.63475001
## nh other.2hsgrad          nh other  2hsgrad   3.007511 0.28359931
## nhwhite.0Prim              nhwhite    0Prim   9.007095 0.51999873
## hispanic.0Prim            hispanic    0Prim   4.912313 0.26856336
## nh black.0Prim            nh black    0Prim   6.594188 0.78846956
## nh multirace.0Prim    nh multirace    0Prim  10.309898 2.32094212
## nh other.0Prim            nh other    0Prim   9.846208 1.81204689
## nhwhite.1somehs            nhwhite  1somehs   6.987239 0.25016489
## hispanic.1somehs          hispanic  1somehs   4.631390 0.27979154
## nh black.1somehs          nh black  1somehs   6.767867 0.41691489
## nh multirace.1somehs  nh multirace  1somehs  11.448505 3.48273874
## nh other.1somehs          nh other  1somehs   5.927376 1.23162936
## nhwhite.3somecol           nhwhite 3somecol   4.040074 0.07162744
## hispanic.3somecol         hispanic 3somecol   3.580861 0.16896535
## nh black.3somecol         nh black 3somecol   3.443287 0.16556281
## nh multirace.3somecol nh multirace 3somecol   5.749841 0.50997327
## nh other.3somecol         nh other 3somecol   2.110460 0.20104799
## nhwhite.4colgrad           nhwhite 4colgrad   2.382518 0.03683385
## hispanic.4colgrad         hispanic 4colgrad   2.499118 0.12417816
## nh black.4colgrad         nh black 4colgrad   2.221732 0.10449605
## nh multirace.4colgrad nh multirace 4colgrad   3.961559 0.37102485
## nh other.4colgrad         nh other 4colgrad   1.730583 0.11003042
svyby(~healthdays, ~agec, des, svymean, na.rm=T)
##            agec healthdays         se
## (0,24]   (0,24]   2.063722 0.07168843
## (24,39] (24,39]   2.486118 0.05289680
## (39,59] (39,59]   4.009304 0.05757186
## (59,79] (59,79]   5.054443 0.07093562
## (79,99] (79,99]   6.099186 0.17453877
#Poisson glm fit to survey data
fit1<-svyglm(healthdays~factor(race_eth)+factor(educ)+factor(agec), design=des, family=poisson)
summary(fit1)
## 
## Call:
## svyglm(formula = healthdays ~ factor(race_eth) + factor(educ) + 
##     factor(agec), design = des, family = poisson)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss16m[is.na(brfss16m$mmsawt) == 
##     F, ])
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.79842    0.03724  21.437  < 2e-16 ***
## factor(race_eth)hispanic     -0.11224    0.02850  -3.938 8.23e-05 ***
## factor(race_eth)nh black     -0.01705    0.02627  -0.649  0.51638    
## factor(race_eth)nh multirace  0.52808    0.06696   7.886 3.12e-15 ***
## factor(race_eth)nh other     -0.23628    0.04726  -5.000 5.74e-07 ***
## factor(educ)0Prim             0.34526    0.04464   7.734 1.04e-14 ***
## factor(educ)1somehs           0.41598    0.03251  12.797  < 2e-16 ***
## factor(educ)3somecol         -0.05704    0.02147  -2.657  0.00788 ** 
## factor(educ)4colgrad         -0.57466    0.02075 -27.692  < 2e-16 ***
## factor(agec)(24,39]           0.27266    0.04126   6.608 3.91e-11 ***
## factor(agec)(39,59]           0.73292    0.03801  19.281  < 2e-16 ***
## factor(agec)(59,79]           0.93081    0.03781  24.619  < 2e-16 ***
## factor(agec)(79,99]           1.04760    0.04552  23.013  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 16.37206)
## 
## Number of Fisher Scoring iterations: 6
#here are the poisson model "risk ratios", which just show the change in the mean
round(exp(summary(fit1)$coef[-1,1]), 3)
##     factor(race_eth)hispanic     factor(race_eth)nh black 
##                        0.894                        0.983 
## factor(race_eth)nh multirace     factor(race_eth)nh other 
##                        1.696                        0.790 
##            factor(educ)0Prim          factor(educ)1somehs 
##                        1.412                        1.516 
##         factor(educ)3somecol         factor(educ)4colgrad 
##                        0.945                        0.563 
##          factor(agec)(24,39]          factor(agec)(39,59] 
##                        1.313                        2.081 
##          factor(agec)(59,79]          factor(agec)(79,99] 
##                        2.537                        2.851

So, we interpret this as follows. NH Multirace respondents had higher mean counts of poor health days than NH Whites, while NH Others had a lower mean count of poor health days. As education increases, the mean count of poor health days decreases. Also, as age increase, the mean count of poor health days increases.

In terms of the risk ratios \(exp(\beta)\) NH multirace respondents had 32% higher number of days when their health was poor, compared to NH whites. In practice, this translates into : NA days for NH multirace, and NA for NH whites.

Overdispersion

When using the Poisson GLM, you often run into overdispersion * What’s overdispersion? For the Poisson distribution, the mean and the variance are functions of one another (variance = mean for Poisson). So when you have more variability than you expect in your data, you have overdispersion. This basically says your data do not fit your model, and is a problem because overdispersion leads to standard errors for our model parameters that are too small. But, we can fit other models that do not make such assumptions, or allow there to be more variability.

An easy check on this is to compare the residual deviance to the residual degrees of freedom. They ratio should be 1 if the model fits the data.

NOTE

The svyglm() function includes a scaling term for overdispersion, so this is already taken into account. But if you have data that aren’t a complex survey, we can measure this ourselves using the residual deviance.

fit2<-glm(healthdays~factor(race_eth)+factor(educ)+factor(agec), data=brfss16m, family=poisson)
summary(fit2)
## 
## Call:
## glm(formula = healthdays ~ factor(race_eth) + factor(educ) + 
##     factor(agec), family = poisson, data = brfss16m)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4610  -3.0607  -2.3949  -0.5225  11.5766  
## 
## Coefficients:
##                               Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)                   0.801437   0.006104  131.289  < 2e-16 ***
## factor(race_eth)hispanic     -0.009393   0.003713   -2.530  0.01141 *  
## factor(race_eth)nh black      0.017711   0.003348    5.290 1.22e-07 ***
## factor(race_eth)nh multirace  0.474632   0.006741   70.407  < 2e-16 ***
## factor(race_eth)nh other     -0.018913   0.005905   -3.203  0.00136 ** 
## factor(educ)0Prim             0.325068   0.005713   56.899  < 2e-16 ***
## factor(educ)1somehs           0.395359   0.004091   96.639  < 2e-16 ***
## factor(educ)3somecol         -0.083288   0.002615  -31.854  < 2e-16 ***
## factor(educ)4colgrad         -0.573885   0.002670 -214.908  < 2e-16 ***
## factor(agec)(24,39]           0.323062   0.006652   48.567  < 2e-16 ***
## factor(agec)(39,59]           0.825983   0.006133  134.684  < 2e-16 ***
## factor(agec)(59,79]           0.943281   0.006092  154.846  < 2e-16 ***
## factor(agec)(79,99]           1.030676   0.006706  153.689  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 2901363  on 238812  degrees of freedom
## Residual deviance: 2743149  on 238800  degrees of freedom
##   (10198 observations deleted due to missingness)
## AIC: 3062226
## 
## Number of Fisher Scoring iterations: 6
scale<-sqrt(fit2$deviance/fit2$df.residual)
scale
## [1] 3.38928

The deviance can also be a test of model fit, using a chi square distribution, with degrees of freedom equal to the residual d.f. (n-p):

1-pchisq(fit2$deviance, df = fit2$df.residual)
## [1] 0

So, this p value is 0, which means the model does not fit the data.

Modeling Overdispersion via a Quasi distribution

For the Poisson , we can fit a “quasi” distribution that adds an extra parameter to allow the mean-variance relationship to not be constant. For Poisson we get:

\[var(Y) = \lambda * \phi\], instead of \[var(Y) = \lambda \]

This allows us to include a rough proxy for a dispersion parameter for the distribution. Naturally this is fixed at 1 for basic models, and estimated in the quasi models, we can look to see if is much bigger than 1. If overdispersion is present and not accounted for you could identify a relationship as being significant when it is not!

fit3<-glm(healthdays~factor(race_eth)+factor(educ)+factor(agec), data=brfss16m, family=quasipoisson)
summary(fit3)
## 
## Call:
## glm(formula = healthdays ~ factor(race_eth) + factor(educ) + 
##     factor(agec), family = quasipoisson, data = brfss16m)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -5.4610  -3.0607  -2.3949  -0.5225  11.5766  
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.801437   0.025229  31.767  < 2e-16 ***
## factor(race_eth)hispanic     -0.009393   0.015346  -0.612    0.540    
## factor(race_eth)nh black      0.017711   0.013838   1.280    0.201    
## factor(race_eth)nh multirace  0.474632   0.027861  17.036  < 2e-16 ***
## factor(race_eth)nh other     -0.018913   0.024404  -0.775    0.438    
## factor(educ)0Prim             0.325068   0.023612  13.767  < 2e-16 ***
## factor(educ)1somehs           0.395359   0.016908  23.383  < 2e-16 ***
## factor(educ)3somecol         -0.083288   0.010806  -7.708 1.29e-14 ***
## factor(educ)4colgrad         -0.573885   0.011036 -51.999  < 2e-16 ***
## factor(agec)(24,39]           0.323062   0.027491  11.751  < 2e-16 ***
## factor(agec)(39,59]           0.825983   0.025346  32.588  < 2e-16 ***
## factor(agec)(59,79]           0.943281   0.025177  37.466  < 2e-16 ***
## factor(agec)(79,99]           1.030676   0.027716  37.187  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 17.08095)
## 
##     Null deviance: 2901363  on 238812  degrees of freedom
## Residual deviance: 2743149  on 238800  degrees of freedom
##   (10198 observations deleted due to missingness)
## AIC: NA
## 
## Number of Fisher Scoring iterations: 6

Other count models - Negative binomial

  • Of course, we could just fit other distributional models to our data, popular choices are:

  • Negative binomial
  • Effectively adds a shape parameter to Poisson

\[Y \sim NB (\lambda, \lambda+\lambda^2/\theta),\text{ } E(Y) = \lambda,\text{ } var(Y) = \lambda+\lambda^2/\theta\] \[\lambda = log(\eta)\] \[\eta = \beta_0 + \beta_1 x_1+ log(n)\]

Now, R will not fit negative binomial models using survey design, so, we will fit them using sample weights only, then calculate the robust standard errors. We standardize the weights to equal the sample size, as opposed to the population size by dividing each person’s weight by the mean weight.

#First, I define a function to get the clustered, or robust standard errors.
#This function effectively controls for the within-strata homogeneity when
#calculateing the se's for the betas. 

#I stole this from: http://drewdimmery.com/robust-ses-in-r/
#and http://people.su.se/~ma/clustering.pdf
#I also added a correction to use this with the hurdle and zero-inflated models
#This is how stata gets robust se's

clx2 <-   function(fm, dfcw,  cluster){
    # R-codes (www.r-project.org) for computing
    # clustered-standard errors. Mahmood Arai, Jan 26, 2008.
    
    # The arguments of the function are:
    # fitted model, cluster1
    # You need to install libraries `sandwich' and `lmtest'
    
    # reweighting the var-cov matrix for the within model
    require(sandwich);require(lmtest)
    if(class(fm)=="zeroinfl"|class(fm)=="hurdle") {
    M <- length(unique(cluster))   
    N <- length(cluster)           
    K <- dim(fm$vcov)[1]        #here is the rank from the zero inflated fits             
    dfc <- (M/(M-1))*((N-1)/(N-K))  
    uj  <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum, na.rm=T));
    vcovCL <- dfc[1]*sandwich(fm, meat=crossprod(uj)/N)*dfcw #fix a length problem in dfc
    list(summary=coeftest(fm, vcovCL))}
    else if(class(fm)!="zeroinfl"){
    M <- length(unique(cluster))
    N <- length(cluster)
    K <- fm$rank
    dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
    uj <- apply(estfun(fm), 2, function(x) tapply(x, cluster, sum, na.rm=T));
    rcse.cov <- dfc * sandwich(fm, meat = crossprod(uj)/N)
    rcse.se <- coeftest(fm, rcse.cov)
    return(list( rcse.se))}
}

Now, i’ll illustrate that this works:

#Fit poisson, and compare it to the fit from the survey design
#Fit the Poisson GLM
sub$wts<-sub$mmsawt/mean(sub$mmsawt, na.rm=T)
fit.pois<-glm(healthdays~factor(race_eth)+factor(educ)+factor(agec),
              data=sub,
              weights=wts, family=poisson)

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(sandwich)
clx2(fit.pois,cluster =sub$ststr)
## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used
## Warning in if (class(fm) != "zeroinfl") {: the condition has length > 1 and
## only the first element will be used
## [[1]]
## 
## z test of coefficients:
## 
##                               Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept)                   0.817725   0.033340  24.5271 < 2.2e-16 ***
## factor(race_eth)hispanic     -0.090198   0.032177  -2.8032  0.005060 ** 
## factor(race_eth)nh black     -0.027048   0.028917  -0.9354  0.349595    
## factor(race_eth)nh multirace  0.507678   0.063901   7.9447 1.946e-15 ***
## factor(race_eth)nh other     -0.216965   0.069227  -3.1341  0.001724 ** 
## factor(educ)0Prim             0.373432   0.045712   8.1693 3.102e-16 ***
## factor(educ)1somehs           0.430531   0.039717  10.8401 < 2.2e-16 ***
## factor(educ)3somecol         -0.060751   0.029546  -2.0561  0.039770 *  
## factor(educ)4colgrad         -0.586394   0.033470 -17.5202 < 2.2e-16 ***
## factor(agec)(24,39]           0.274159   0.037137   7.3824 1.555e-13 ***
## factor(agec)(39,59]           0.724831   0.034624  20.9341 < 2.2e-16 ***
## factor(agec)(59,79]           0.922134   0.033949  27.1626 < 2.2e-16 ***
## factor(agec)(79,99]           1.023305   0.043268  23.6502 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit.pois, vcov=vcovHC(fit.pois, type="HC1",cluster="ststr"))
## 
## z test of coefficients:
## 
##                               Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept)                   0.817725   0.039512  20.6958 < 2.2e-16 ***
## factor(race_eth)hispanic     -0.090198   0.029819  -3.0248  0.002488 ** 
## factor(race_eth)nh black     -0.027048   0.027789  -0.9733  0.330383    
## factor(race_eth)nh multirace  0.507678   0.070495   7.2017 5.949e-13 ***
## factor(race_eth)nh other     -0.216965   0.050194  -4.3225 1.543e-05 ***
## factor(educ)0Prim             0.373432   0.046253   8.0737 6.821e-16 ***
## factor(educ)1somehs           0.430531   0.034090  12.6293 < 2.2e-16 ***
## factor(educ)3somecol         -0.060751   0.022537  -2.6956  0.007026 ** 
## factor(educ)4colgrad         -0.586394   0.021783 -26.9196 < 2.2e-16 ***
## factor(agec)(24,39]           0.274159   0.043513   6.3006 2.965e-10 ***
## factor(agec)(39,59]           0.724831   0.040180  18.0396 < 2.2e-16 ***
## factor(agec)(59,79]           0.922134   0.039936  23.0903 < 2.2e-16 ***
## factor(agec)(79,99]           1.023305   0.047504  21.5417 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#same as survey!

summary(fit1)
## 
## Call:
## svyglm(formula = healthdays ~ factor(race_eth) + factor(educ) + 
##     factor(agec), design = des, family = poisson)
## 
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss16m[is.na(brfss16m$mmsawt) == 
##     F, ])
## 
## Coefficients:
##                              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                   0.79842    0.03724  21.437  < 2e-16 ***
## factor(race_eth)hispanic     -0.11224    0.02850  -3.938 8.23e-05 ***
## factor(race_eth)nh black     -0.01705    0.02627  -0.649  0.51638    
## factor(race_eth)nh multirace  0.52808    0.06696   7.886 3.12e-15 ***
## factor(race_eth)nh other     -0.23628    0.04726  -5.000 5.74e-07 ***
## factor(educ)0Prim             0.34526    0.04464   7.734 1.04e-14 ***
## factor(educ)1somehs           0.41598    0.03251  12.797  < 2e-16 ***
## factor(educ)3somecol         -0.05704    0.02147  -2.657  0.00788 ** 
## factor(educ)4colgrad         -0.57466    0.02075 -27.692  < 2e-16 ***
## factor(agec)(24,39]           0.27266    0.04126   6.608 3.91e-11 ***
## factor(agec)(39,59]           0.73292    0.03801  19.281  < 2e-16 ***
## factor(agec)(59,79]           0.93081    0.03781  24.619  < 2e-16 ***
## factor(agec)(79,99]           1.04760    0.04552  23.013  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 16.37206)
## 
## Number of Fisher Scoring iterations: 6
#Fit the Negative Binomial GLM
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
fit.nb2<-glm.nb(healthdays~factor(race_eth)+factor(educ)+factor(agec),
              data=sub,
              weights=wts)
clx2(fit.nb2,cluster =sub$ststr)
## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used

## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used
## [[1]]
## 
## z test of coefficients:
## 
##                               Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept)                   0.827033   0.032587  25.3796 < 2.2e-16 ***
## factor(race_eth)hispanic     -0.046654   0.032776  -1.4234  0.154611    
## factor(race_eth)nh black     -0.045410   0.030626  -1.4827  0.138152    
## factor(race_eth)nh multirace  0.505841   0.053859   9.3919 < 2.2e-16 ***
## factor(race_eth)nh other     -0.250848   0.065963  -3.8029  0.000143 ***
## factor(educ)0Prim             0.351842   0.047721   7.3729 1.669e-13 ***
## factor(educ)1somehs           0.411224   0.041172   9.9880 < 2.2e-16 ***
## factor(educ)3somecol         -0.052103   0.031220  -1.6689  0.095132 .  
## factor(educ)4colgrad         -0.562438   0.035198 -15.9792 < 2.2e-16 ***
## factor(agec)(24,39]           0.270551   0.035607   7.5983 3.001e-14 ***
## factor(agec)(39,59]           0.686307   0.033953  20.2135 < 2.2e-16 ***
## factor(agec)(59,79]           0.904456   0.031943  28.3147 < 2.2e-16 ***
## factor(agec)(79,99]           1.043586   0.041398  25.2083 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit.nb2, vcov=vcovHC(fit.nb2, type="HC1",cluster="ststr"))
## 
## z test of coefficients:
## 
##                               Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept)                   0.827033   0.038423  21.5246 < 2.2e-16 ***
## factor(race_eth)hispanic     -0.046654   0.031529  -1.4797   0.13896    
## factor(race_eth)nh black     -0.045410   0.029690  -1.5295   0.12615    
## factor(race_eth)nh multirace  0.505841   0.061123   8.2758 < 2.2e-16 ***
## factor(race_eth)nh other     -0.250848   0.048894  -5.1304 2.891e-07 ***
## factor(educ)0Prim             0.351842   0.051348   6.8521 7.279e-12 ***
## factor(educ)1somehs           0.411224   0.037323  11.0181 < 2.2e-16 ***
## factor(educ)3somecol         -0.052103   0.025270  -2.0619   0.03922 *  
## factor(educ)4colgrad         -0.562438   0.023666 -23.7660 < 2.2e-16 ***
## factor(agec)(24,39]           0.270551   0.042085   6.4287 1.287e-10 ***
## factor(agec)(39,59]           0.686307   0.038817  17.6806 < 2.2e-16 ***
## factor(agec)(59,79]           0.904456   0.038385  23.5625 < 2.2e-16 ***
## factor(agec)(79,99]           1.043586   0.046109  22.6332 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Aggregate data

Typically when we get data aggregate data, they have a spatial component, meaning they are aggregated to some geography

  • Tract, block, county, ZIP code, etc.

  • These are typically either raw counts

    • Count of deaths, crimes, births, people with some characteristic (eg poverty) > numerator
  • Hopefully with a population at risk > denominator, or rates that someone has calculated

    • Mortality rates from NCHS or some census rate

GLMs are a class of statistical models that link the mean of the distribution to a linear combination of covariates by some form of link function.

\[g(u) = g(E(Y)) = X ' \beta\]

Where g() is the link function. The link function for a binomial distribution is typically the logit, or:

\[g(u) = log(\frac{\pi}{1- \pi})\]

Modeling options

  • Inherently, we want to see what influences the rate in our areas

    • This is for aggregate data
  • So we want regression models for our area-based counts that properly model the independent variable

  • You have a few options: Binomial, Poisson, Normal, Negative binomial

#I subset to only US counties this time, for the sake of computational time.
library(dplyr)
library(tidycensus)
library(acs)
## Loading required package: stringr
## Loading required package: XML
## 
## Attaching package: 'acs'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:base':
## 
##     apply
library(lmtest)
library(sandwich)
githubURL <- "https://github.com/coreysparks/data/blob/master/cmf06_10_age_sex_std.Rdata?raw=true"
load(url(githubURL))

dim(cmf)
githubURL <- "https://github.com/coreysparks/data/blob/master/arf1516.Rdata?raw=true"
load(url(githubURL))

dim(arf15)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
#Gini coefficent for income inquality
gini<-get_acs(geography = "county", variables = "B19083_001", year = 2009 )
## Please note: `get_acs()` now defaults to a year or endyear of 2016.
usdata<-merge(x = arf15, y=cmf, by.x="f00004", by.y="County.Code")
library(tigris)
## To enable 
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
## 
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
## 
##     plot
## Warning: '*' is not a valid FIPS code or state name/abbreviation
usco<-counties(state = "*",  year = 2010,cb = T )
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(ggplot2) #note you must do: devtools::install_github("tidyverse/ggplot2") to get geom_sf() geometries
usco<- st_as_sf(usco)
usco<-mutate(usco, geoid=paste(STATEFP, COUNTYFP, sep=""))
usco<-inner_join(usco, y=usdata, by=c("geoid"="f00004") ) 
usco<-inner_join(usco,y=gini, c("geoid"="GEOID") )

usco<-mutate(usco,pblack=100*( (f1391005 + f1391105)/ f1198405  ),
             phisp=100*( (f1392005+ f1392105)/ f1198405 ),
             prural = 100*( f1367900/ f0453000 ),
             pfemhh =  f0874600 ,
             unemp= f0679505,
             medhouseval=f1461306,
             popdensity=f1198405/(as.numeric(CENSUSAREA)/1000000),
             gini=estimate )
usco<-  filter(usco, !STATEFP %in% c("02", "15", "72"), is.na(Age.Adjusted.Rate)==F&is.na(pblack)==F&is.na(prural)==F&is.na(pfemhh)==F) 

usco<-  mutate( usco, 
                mortquant = cut( Age.Adjusted.Rate, breaks= quantile(Age.Adjusted.Rate, p=seq(0, 1, .2), na.rm=T)),
                pblackq=cut( pblack, breaks= quantile(pblack, p=seq(0, 1, .2), na.rm=T)), 
                giniq=cut(gini, breaks=quantile(gini, p=seq(0,1,.2), na.rm=T)))

usco %>%
  ggplot()+geom_sf(aes( fill=mortquant))+coord_sf(crs = 102009)+ scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title="Mortality Quartile"))

usco %>%
  ggplot()+geom_sf(aes( fill=pblackq))+coord_sf(crs = 102009)+ scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title="Percent Black Population Quartile"))

usco %>%
  ggplot()+geom_sf(aes( fill=giniq))+coord_sf(crs = 102009)+ scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title="Gini Coefficient Quartile"))

Binomial model for counts

To use the Binomial model, you have two options Your observations are 1/0, we have done this before. This says that each area has the event or not OR Your observations are a combination of y and n, where y is a count of some event which has happened to some portion of the population at risk, n. For either of these, you are modeling the probability of observing an event, \(\pi\) or really the logit of the probability: \[log(\frac{\pi}{1- \pi}) = X' \beta\] * The good thing is that you have already done this! * All interpretation of parameters is the same as logistic regression

\[exp(\beta)\]

  • All model testing is the same too!
usfitbin<-glm(cbind(Deaths, Population)~scale(pblack)+scale(phisp)+scale(prural)+scale(pfemhh)+scale(unemp)+scale(medhouseval)+scale(popdensity)+scale(gini) ,family=binomial,
            data=usco)

summary(usfitbin)
## 
## Call:
## glm(formula = cbind(Deaths, Population) ~ scale(pblack) + scale(phisp) + 
##     scale(prural) + scale(pfemhh) + scale(unemp) + scale(medhouseval) + 
##     scale(popdensity) + scale(gini), family = binomial, data = usco)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -84.916   -3.345    1.293    5.331   85.896  
## 
## Coefficients:
##                      Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)        -4.6614580  0.0004832 -9647.01   <2e-16 ***
## scale(pblack)      -0.1176577  0.0006187  -190.17   <2e-16 ***
## scale(phisp)       -0.0991019  0.0002986  -331.90   <2e-16 ***
## scale(prural)       0.0963676  0.0004370   220.50   <2e-16 ***
## scale(pfemhh)       0.1157800  0.0007022   164.87   <2e-16 ***
## scale(unemp)        0.0242070  0.0004342    55.76   <2e-16 ***
## scale(medhouseval) -0.0431336  0.0002387  -180.73   <2e-16 ***
## scale(popdensity)  -0.0051121  0.0001066   -47.97   <2e-16 ***
## scale(gini)         0.0659669  0.0003770   174.98   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 862832  on 3093  degrees of freedom
## Residual deviance: 401550  on 3085  degrees of freedom
## AIC: 429698
## 
## Number of Fisher Scoring iterations: 4
round(exp(coef(usfitbin)), 3)
##        (Intercept)      scale(pblack)       scale(phisp) 
##              0.009              0.889              0.906 
##      scale(prural)      scale(pfemhh)       scale(unemp) 
##              1.101              1.123              1.025 
## scale(medhouseval)  scale(popdensity)        scale(gini) 
##              0.958              0.995              1.068
coeftest(usfitbin, vcov.=vcovHC(usfitbin, type="HC1"))
## 
## z test of coefficients:
## 
##                      Estimate Std. Error   z value  Pr(>|z|)    
## (Intercept)        -4.6614580  0.0050820 -917.2412 < 2.2e-16 ***
## scale(pblack)      -0.1176577  0.0158820   -7.4082 1.280e-13 ***
## scale(phisp)       -0.0991019  0.0090527  -10.9472 < 2.2e-16 ***
## scale(prural)       0.0963676  0.0092186   10.4536 < 2.2e-16 ***
## scale(pfemhh)       0.1157800  0.0195351    5.9268 3.090e-09 ***
## scale(unemp)        0.0242070  0.0079075    3.0613  0.002204 ** 
## scale(medhouseval) -0.0431336  0.0064707   -6.6660 2.629e-11 ***
## scale(popdensity)  -0.0051121  0.0017827   -2.8676  0.004136 ** 
## scale(gini)         0.0659669  0.0099643    6.6203 3.584e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1-usfitbin$deviance/usfitbin$null.deviance
## [1] 0.5346143
#This should be the crude death rate, per 1,000 persons
1000 * mean(fitted(usfitbin))
## [1] 9.509964
hist(1000*fitted(usfitbin), main="US Crude Death Rate Distribution from Binomial Model")

Poisson model for counts

For the Poisson model, you observe some count of events in each area (crimes, deaths, etc), and you also have some population at risk in each area. This is generally not the same for each area, but could be, you want a Poisson rate \((\lambda)\), which is really the average count (y). We have several ways of modeling the Poisson count:

Pure count If each area has the same risk set, Rate, include an offset term in your model to incorporate unequal risk

\[log(y)= X' \beta + log(n)\]

, where n is the population at risk in each area. This is called the offset term in the model. Standardized ratio incorporate differential exposure as an expected count

\[log(y)= X' \beta + log(E)\].

Again, all interpretation of parameters is done on a log scale, so

\[exp(\beta) = \text{ % change in the mean}\] , or % change in the rate, for a 1 unit change in X. All testing is done in the same manner.

#Form the expected count
usco$E<-usco$Population * (sum(usco$Deaths)/sum(usco$Population))
fit.poi.us<-glm(Deaths~offset(log(E))+scale(pblack)+scale(phisp)+scale(prural)+scale(pfemhh)+scale(unemp)+scale(medhouseval)+scale(popdensity)+scale(gini) ,family=poisson,
            data=usco)
summary(fit.poi.us)
## 
## Call:
## glm(formula = Deaths ~ offset(log(E)) + scale(pblack) + scale(phisp) + 
##     scale(prural) + scale(pfemhh) + scale(unemp) + scale(medhouseval) + 
##     scale(popdensity) + scale(gini), family = poisson, data = usco)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -85.228   -3.359    1.302    5.354   86.305  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.1660918  0.0004808  345.43   <2e-16 ***
## scale(pblack)      -0.1175107  0.0006157 -190.85   <2e-16 ***
## scale(phisp)       -0.0990723  0.0002975 -333.05   <2e-16 ***
## scale(prural)       0.0962742  0.0004350  221.34   <2e-16 ***
## scale(pfemhh)       0.1155893  0.0006989  165.39   <2e-16 ***
## scale(unemp)        0.0242254  0.0004322   56.05   <2e-16 ***
## scale(medhouseval) -0.0431726  0.0002378 -181.53   <2e-16 ***
## scale(popdensity)  -0.0050917  0.0001062  -47.95   <2e-16 ***
## scale(gini)         0.0659110  0.0003754  175.58   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 869840  on 3093  degrees of freedom
## Residual deviance: 404890  on 3085  degrees of freedom
## AIC: 433069
## 
## Number of Fisher Scoring iterations: 4
lmtest::coeftest(fit.poi.us, vcov.=vcovHC(fit.poi.us, type="HC1"))
## 
## z test of coefficients:
## 
##                      Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept)         0.1660918  0.0050833  32.6742 < 2.2e-16 ***
## scale(pblack)      -0.1175107  0.0158543  -7.4119 1.245e-13 ***
## scale(phisp)       -0.0990723  0.0090510 -10.9460 < 2.2e-16 ***
## scale(prural)       0.0962742  0.0092066  10.4570 < 2.2e-16 ***
## scale(pfemhh)       0.1155893  0.0195027   5.9268 3.088e-09 ***
## scale(unemp)        0.0242254  0.0078982   3.0672  0.002161 ** 
## scale(medhouseval) -0.0431726  0.0064692  -6.6736 2.497e-11 ***
## scale(popdensity)  -0.0050917  0.0017812  -2.8585  0.004256 ** 
## scale(gini)         0.0659110  0.0099520   6.6229 3.523e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1-fit.poi.us$deviance/fit.poi.us$null.deviance
## [1] 0.5345239
scale<-sqrt(fit.poi.us$deviance/fit.poi.us$df.residual)
scale
## [1] 11.45621
1-pchisq(fit.poi.us$deviance, df = fit.poi.us$df.residual)
## [1] 0
round(exp(coef(fit.poi.us)), 3)
##        (Intercept)      scale(pblack)       scale(phisp) 
##              1.181              0.889              0.906 
##      scale(prural)      scale(pfemhh)       scale(unemp) 
##              1.101              1.123              1.025 
## scale(medhouseval)  scale(popdensity)        scale(gini) 
##              0.958              0.995              1.068
hist(fitted(fit.poi.us)/usco$E,main="US SMR Distribution from Poisson Model" )

Other 2 - parameter distributions

  • Of course, we could just fit other distributional models to our data, popular choices are:
  • Normal
  • y has mean and variance
  • Effectively adds a shape parameter to Poisson
library(MASS)
fit.nb.us<-glm.nb(Deaths~offset(log(E))+scale(pblack)+scale(phisp)+scale(prural)+scale(pfemhh)+scale(unemp)+scale(medhouseval)+scale(popdensity)+scale(gini),
            data=usco)
summary(fit.nb.us)
## 
## Call:
## glm.nb(formula = Deaths ~ offset(log(E)) + scale(pblack) + scale(phisp) + 
##     scale(prural) + scale(pfemhh) + scale(unemp) + scale(medhouseval) + 
##     scale(popdensity) + scale(gini), data = usco, init.theta = 27.08878204, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -4.9318  -0.6185  -0.0112   0.5614   3.8671  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         0.211208   0.003523  59.953  < 2e-16 ***
## scale(pblack)      -0.050047   0.006809  -7.350 1.98e-13 ***
## scale(phisp)       -0.067109   0.003822 -17.558  < 2e-16 ***
## scale(prural)       0.060118   0.004449  13.513  < 2e-16 ***
## scale(pfemhh)       0.020949   0.008118   2.581 0.009864 ** 
## scale(unemp)        0.010826   0.004326   2.503 0.012331 *  
## scale(medhouseval) -0.112580   0.004148 -27.140  < 2e-16 ***
## scale(popdensity)   0.013567   0.003755   3.613 0.000303 ***
## scale(gini)         0.058649   0.004103  14.293  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(27.0888) family taken to be 1)
## 
##     Null deviance: 5463.5  on 3093  degrees of freedom
## Residual deviance: 3141.5  on 3085  degrees of freedom
## AIC: 43737
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  27.089 
##           Std. Err.:  0.715 
## 
##  2 x log-likelihood:  -43716.987
 1-fit.nb.us$deviance/fit.nb.us$null.deviance
## [1] 0.4250012
lmtest::coeftest(fit.nb.us, vcov.=vcovHC(fit.nb.us, type="HC1"))
## 
## z test of coefficients:
## 
##                      Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept)         0.2112076  0.0034518  61.1879 < 2.2e-16 ***
## scale(pblack)      -0.0500470  0.0081986  -6.1043 1.032e-09 ***
## scale(phisp)       -0.0671094  0.0040376 -16.6213 < 2.2e-16 ***
## scale(prural)       0.0601181  0.0052639  11.4209 < 2.2e-16 ***
## scale(pfemhh)       0.0209488  0.0104343   2.0077  0.044676 *  
## scale(unemp)        0.0108258  0.0042265   2.5614  0.010425 *  
## scale(medhouseval) -0.1125799  0.0078156 -14.4044 < 2.2e-16 ***
## scale(popdensity)   0.0135672  0.0035312   3.8421  0.000122 ***
## scale(gini)         0.0586487  0.0049607  11.8226 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(exp(coef(fit.nb.us)), 3)
##        (Intercept)      scale(pblack)       scale(phisp) 
##              1.235              0.951              0.935 
##      scale(prural)      scale(pfemhh)       scale(unemp) 
##              1.062              1.021              1.011 
## scale(medhouseval)  scale(popdensity)        scale(gini) 
##              0.894              1.014              1.060
hist(fitted(fit.nb.us)/usco$E,main="US SMR Distribution from Negative Binomial Model" )