This example will cover the use of R functions for fitting count data models to complex survey data and to aggregate data at the county level. Specifically, we focus on the Poisson and Negative Binomial models to individual level survey data as well as for aggregate data.
For this example I am using 2016 CDC Behavioral Risk Factor Surveillance System (BRFSS) SMART county data. Link
We will also use data from the CDC Compressed mortality file on mortality in US counties.
For the Poisson model, you observe some count of events. The Poisson distribution has only one parameter, \((\lambda)\), which is the average count (y).
The Poisson GLM assumes: \[Y \sim Poisson (\lambda),\text{ } E(Y) = \lambda,\text{ } var(Y) = \lambda\] \[\lambda = log(\eta)\]
\[\eta = \beta_0 + \beta_1 x_1+ log(n)\] Which is the log linear model for a count, with the potential for adding an offset term to them model. We have several ways of modeling the Poisson count:
Pure count If each area has the same risk set. This means each observation is exposed to the “process” for the same amount of time, this is the exposure level. If observations do not all have the same level of exposure, then we may express the count as a Rate. If we do this, we need to include an offset term in your model to incorporate unequal risk. I.e. people who are exposed to a disease for a year are more likely to experience the disease than someone who is exposed for a week.
\[log(y)= X' \beta + log(n)\]
, where n is the period of risk. This is called the offset term in the model.
Standardized ratio incorporate differential exposure as an expected count
\[log(y)= X' \beta + log(E)\].
In Epidemiology, the ratio \(\frac{y}{E}\) is referred to as the standardized mortality (or incidence) ratio. In order to get the expected counts for an area, typically the population of the area is multiplied by the “global rate”. For US county mortality, we would multiply the population of each county by the US mortality rate.
\[E = r * n\] \[r = \frac{\sum y}{\sum n}\]
All interpretation of parameters is done on a log scale, so
\(exp(\beta) = \text{% change in the mean}\)
, or % change in the rate, for a 1 unit change in X. All testing is done in the same manner.
#load brfss
library(car)
library(stargazer)
library(survey)
library(sjPlot)
library(ggplot2)
load("D:/GoogleDrive/classes/dem7283/class18/data/brfss16_mmsa.Rdata")
#The names in the data are very ugly, so I make them less ugly
nams<-names(brfss16m)
head(nams, n=10)
## [1] "DISPCODE" "HHADULT" "GENHLTH" "PHYSHLTH" "MENTHLTH" "POORHLTH"
## [7] "HLTHPLN1" "PERSDOC2" "MEDCOST" "CHECKUP1"
#we see some names are lower case, some are upper and some have a little _
#in the first position. This is a nightmare.
newnames<-tolower(gsub(pattern = "_",replacement = "",x = nams))
names(brfss16m)<-newnames
Our outcome here is the number of days the respondent reported poor physical health in the past month: Q: Now thinking about your physical health, which includes physical illness and injury, for how many days during the past 30 days was your physical health not good?
brfss16m$healthdays<-recode(brfss16m$physhlth, recodes = "88=0; 77=NA; 99=NA")
hist(brfss16m$healthdays)
summary(brfss16m$healthdays)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.000 0.000 0.000 4.103 3.000 30.000 4922
Other variables:
#brfss16m$badhealth<-ifelse(brfss16m$genhlth %in% c(4,5),1,0)
brfss16m$badhealth<-recode(brfss16m$genhlth, recodes="4:5=1; 1:3=0; else=NA")
#race/ethnicity
brfss16m$black<-recode(brfss16m$racegr3, recodes="2=1; 9=NA; else=0")
brfss16m$white<-recode(brfss16m$racegr3, recodes="1=1; 9=NA; else=0")
brfss16m$other<-recode(brfss16m$racegr3, recodes="3:4=1; 9=NA; else=0")
brfss16m$hispanic<-recode(brfss16m$racegr3, recodes="5=1; 9=NA; else=0")
brfss16m$race_eth<-recode(brfss16m$racegr3,
recodes="1='nhwhite'; 2='nh black'; 3='nh other';4='nh multirace'; 5='hispanic'; else=NA",
as.factor.result = T)
brfss16m$race_eth<-relevel(brfss16m$race_eth, ref = "nhwhite")
#insurance
brfss16m$ins<-recode(brfss16m$hlthpln1, recodes ="7:9=NA; 1=1;2=0")
#income grouping
brfss16m$inc<-ifelse(brfss16m$incomg==9, NA, brfss16m$incomg)
#education level
brfss16m$educ<-recode(brfss16m$educa,
recodes="1:2='0Prim'; 3='1somehs'; 4='2hsgrad'; 5='3somecol'; 6='4colgrad';9=NA",
as.factor.result=T)
brfss16m$educ<-relevel(brfss16m$educ, ref='2hsgrad')
#employment
brfss16m$employ<-recode(brfss16m$employ1,
recodes="1:2='Employed'; 2:6='nilf'; 7='retired'; 8='unable'; else=NA",
as.factor.result=T)
brfss16m$employ<-relevel(brfss16m$employ, ref='Employed')
#marital status
brfss16m$marst<-recode(brfss16m$marital,
recodes="1='married'; 2='divorced'; 3='widowed'; 4='separated'; 5='nm';6='cohab'; else=NA",
as.factor.result=T)
brfss16m$marst<-relevel(brfss16m$marst, ref='married')
#Age cut into intervals
brfss16m$agec<-cut(brfss16m$age80, breaks=c(0,24,39,59,79,99))
#BMI, in the brfss16ma the bmi variable has 2 implied decimal places,
#so we must divide by 100 to get real bmi's
brfss16m$bmi<-brfss16m$bmi5/100
#smoking currently
brfss16m$smoke<-recode(brfss16m$smoker3,
recodes="1:2='Current'; 3='Former';4='NeverSmoked'; else=NA",
as.factor.result=T)
brfss16m$smoke<-relevel(brfss16m$smoke, ref = "NeverSmoked")
First, we will subset our data to have complete cases for our variables in our model and make our survey design object
#Here I keep complete cases on my key variables,
#just for speed (the suvey procedures can run for a long time)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
##
## recode
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
sub<-brfss16m%>%
select(healthdays, mmsaname, bmi,
agec,race_eth, marst, educ,white, black, hispanic,
other, smoke, ins, mmsawt, ststr) %>%
filter( complete.cases(.))
#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr, weights=~mmsawt, data =sub )
#OR THE BRFSS, R GAVE ME A WARNING AND I NEEDED TO ADD:
#YOU MAY NOT NEED TO DO THIS!!!!
#First we tell R our survey design
options(survey.lonely.psu = "adjust")
des<-svydesign(ids=~1, strata=~ststr,
weights=~mmsawt,
data = brfss16m[is.na(brfss16m$mmsawt)==F,])
To fit a Poisson GLM to survey data in R, we use the svyglm function in the survey library.
#First I do some simple descriptives
svyhist(~healthdays, des)
svyby(~healthdays, ~race_eth+educ, des, svymean, na.rm=T)
## race_eth educ healthdays se
## nhwhite.2hsgrad nhwhite 2hsgrad 4.365827 0.07420115
## hispanic.2hsgrad hispanic 2hsgrad 3.206603 0.15492460
## nh black.2hsgrad nh black 2hsgrad 4.113492 0.16816281
## nh multirace.2hsgrad nh multirace 2hsgrad 5.862313 0.63475001
## nh other.2hsgrad nh other 2hsgrad 3.007511 0.28359931
## nhwhite.0Prim nhwhite 0Prim 9.007095 0.51999873
## hispanic.0Prim hispanic 0Prim 4.912313 0.26856336
## nh black.0Prim nh black 0Prim 6.594188 0.78846956
## nh multirace.0Prim nh multirace 0Prim 10.309898 2.32094212
## nh other.0Prim nh other 0Prim 9.846208 1.81204689
## nhwhite.1somehs nhwhite 1somehs 6.987239 0.25016489
## hispanic.1somehs hispanic 1somehs 4.631390 0.27979154
## nh black.1somehs nh black 1somehs 6.767867 0.41691489
## nh multirace.1somehs nh multirace 1somehs 11.448505 3.48273874
## nh other.1somehs nh other 1somehs 5.927376 1.23162936
## nhwhite.3somecol nhwhite 3somecol 4.040074 0.07162744
## hispanic.3somecol hispanic 3somecol 3.580861 0.16896535
## nh black.3somecol nh black 3somecol 3.443287 0.16556281
## nh multirace.3somecol nh multirace 3somecol 5.749841 0.50997327
## nh other.3somecol nh other 3somecol 2.110460 0.20104799
## nhwhite.4colgrad nhwhite 4colgrad 2.382518 0.03683385
## hispanic.4colgrad hispanic 4colgrad 2.499118 0.12417816
## nh black.4colgrad nh black 4colgrad 2.221732 0.10449605
## nh multirace.4colgrad nh multirace 4colgrad 3.961559 0.37102485
## nh other.4colgrad nh other 4colgrad 1.730583 0.11003042
svyby(~healthdays, ~agec, des, svymean, na.rm=T)
## agec healthdays se
## (0,24] (0,24] 2.063722 0.07168843
## (24,39] (24,39] 2.486118 0.05289680
## (39,59] (39,59] 4.009304 0.05757186
## (59,79] (59,79] 5.054443 0.07093562
## (79,99] (79,99] 6.099186 0.17453877
#Poisson glm fit to survey data
fit1<-svyglm(healthdays~factor(race_eth)+factor(educ)+factor(agec), design=des, family=poisson)
summary(fit1)
##
## Call:
## svyglm(formula = healthdays ~ factor(race_eth) + factor(educ) +
## factor(agec), design = des, family = poisson)
##
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss16m[is.na(brfss16m$mmsawt) ==
## F, ])
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79842 0.03724 21.437 < 2e-16 ***
## factor(race_eth)hispanic -0.11224 0.02850 -3.938 8.23e-05 ***
## factor(race_eth)nh black -0.01705 0.02627 -0.649 0.51638
## factor(race_eth)nh multirace 0.52808 0.06696 7.886 3.12e-15 ***
## factor(race_eth)nh other -0.23628 0.04726 -5.000 5.74e-07 ***
## factor(educ)0Prim 0.34526 0.04464 7.734 1.04e-14 ***
## factor(educ)1somehs 0.41598 0.03251 12.797 < 2e-16 ***
## factor(educ)3somecol -0.05704 0.02147 -2.657 0.00788 **
## factor(educ)4colgrad -0.57466 0.02075 -27.692 < 2e-16 ***
## factor(agec)(24,39] 0.27266 0.04126 6.608 3.91e-11 ***
## factor(agec)(39,59] 0.73292 0.03801 19.281 < 2e-16 ***
## factor(agec)(59,79] 0.93081 0.03781 24.619 < 2e-16 ***
## factor(agec)(79,99] 1.04760 0.04552 23.013 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 16.37206)
##
## Number of Fisher Scoring iterations: 6
#here are the poisson model "risk ratios", which just show the change in the mean
round(exp(summary(fit1)$coef[-1,1]), 3)
## factor(race_eth)hispanic factor(race_eth)nh black
## 0.894 0.983
## factor(race_eth)nh multirace factor(race_eth)nh other
## 1.696 0.790
## factor(educ)0Prim factor(educ)1somehs
## 1.412 1.516
## factor(educ)3somecol factor(educ)4colgrad
## 0.945 0.563
## factor(agec)(24,39] factor(agec)(39,59]
## 1.313 2.081
## factor(agec)(59,79] factor(agec)(79,99]
## 2.537 2.851
So, we interpret this as follows. NH Multirace respondents had higher mean counts of poor health days than NH Whites, while NH Others had a lower mean count of poor health days. As education increases, the mean count of poor health days decreases. Also, as age increase, the mean count of poor health days increases.
In terms of the risk ratios \(exp(\beta)\) NH multirace respondents had 32% higher number of days when their health was poor, compared to NH whites. In practice, this translates into : NA days for NH multirace, and NA for NH whites.
When using the Poisson GLM, you often run into overdispersion * What’s overdispersion? For the Poisson distribution, the mean and the variance are functions of one another (variance = mean for Poisson). So when you have more variability than you expect in your data, you have overdispersion. This basically says your data do not fit your model, and is a problem because overdispersion leads to standard errors for our model parameters that are too small. But, we can fit other models that do not make such assumptions, or allow there to be more variability.
An easy check on this is to compare the residual deviance to the residual degrees of freedom. They ratio should be 1 if the model fits the data.
The svyglm() function includes a scaling term for overdispersion, so this is already taken into account. But if you have data that aren’t a complex survey, we can measure this ourselves using the residual deviance.
fit2<-glm(healthdays~factor(race_eth)+factor(educ)+factor(agec), data=brfss16m, family=poisson)
summary(fit2)
##
## Call:
## glm(formula = healthdays ~ factor(race_eth) + factor(educ) +
## factor(agec), family = poisson, data = brfss16m)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.4610 -3.0607 -2.3949 -0.5225 11.5766
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.801437 0.006104 131.289 < 2e-16 ***
## factor(race_eth)hispanic -0.009393 0.003713 -2.530 0.01141 *
## factor(race_eth)nh black 0.017711 0.003348 5.290 1.22e-07 ***
## factor(race_eth)nh multirace 0.474632 0.006741 70.407 < 2e-16 ***
## factor(race_eth)nh other -0.018913 0.005905 -3.203 0.00136 **
## factor(educ)0Prim 0.325068 0.005713 56.899 < 2e-16 ***
## factor(educ)1somehs 0.395359 0.004091 96.639 < 2e-16 ***
## factor(educ)3somecol -0.083288 0.002615 -31.854 < 2e-16 ***
## factor(educ)4colgrad -0.573885 0.002670 -214.908 < 2e-16 ***
## factor(agec)(24,39] 0.323062 0.006652 48.567 < 2e-16 ***
## factor(agec)(39,59] 0.825983 0.006133 134.684 < 2e-16 ***
## factor(agec)(59,79] 0.943281 0.006092 154.846 < 2e-16 ***
## factor(agec)(79,99] 1.030676 0.006706 153.689 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 2901363 on 238812 degrees of freedom
## Residual deviance: 2743149 on 238800 degrees of freedom
## (10198 observations deleted due to missingness)
## AIC: 3062226
##
## Number of Fisher Scoring iterations: 6
scale<-sqrt(fit2$deviance/fit2$df.residual)
scale
## [1] 3.38928
The deviance can also be a test of model fit, using a chi square distribution, with degrees of freedom equal to the residual d.f. (n-p):
1-pchisq(fit2$deviance, df = fit2$df.residual)
## [1] 0
So, this p value is 0, which means the model does not fit the data.
For the Poisson , we can fit a “quasi” distribution that adds an extra parameter to allow the mean-variance relationship to not be constant. For Poisson we get:
\[var(Y) = \lambda * \phi\], instead of \[var(Y) = \lambda \]
This allows us to include a rough proxy for a dispersion parameter for the distribution. Naturally this is fixed at 1 for basic models, and estimated in the quasi models, we can look to see if is much bigger than 1. If overdispersion is present and not accounted for you could identify a relationship as being significant when it is not!
fit3<-glm(healthdays~factor(race_eth)+factor(educ)+factor(agec), data=brfss16m, family=quasipoisson)
summary(fit3)
##
## Call:
## glm(formula = healthdays ~ factor(race_eth) + factor(educ) +
## factor(agec), family = quasipoisson, data = brfss16m)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.4610 -3.0607 -2.3949 -0.5225 11.5766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.801437 0.025229 31.767 < 2e-16 ***
## factor(race_eth)hispanic -0.009393 0.015346 -0.612 0.540
## factor(race_eth)nh black 0.017711 0.013838 1.280 0.201
## factor(race_eth)nh multirace 0.474632 0.027861 17.036 < 2e-16 ***
## factor(race_eth)nh other -0.018913 0.024404 -0.775 0.438
## factor(educ)0Prim 0.325068 0.023612 13.767 < 2e-16 ***
## factor(educ)1somehs 0.395359 0.016908 23.383 < 2e-16 ***
## factor(educ)3somecol -0.083288 0.010806 -7.708 1.29e-14 ***
## factor(educ)4colgrad -0.573885 0.011036 -51.999 < 2e-16 ***
## factor(agec)(24,39] 0.323062 0.027491 11.751 < 2e-16 ***
## factor(agec)(39,59] 0.825983 0.025346 32.588 < 2e-16 ***
## factor(agec)(59,79] 0.943281 0.025177 37.466 < 2e-16 ***
## factor(agec)(79,99] 1.030676 0.027716 37.187 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 17.08095)
##
## Null deviance: 2901363 on 238812 degrees of freedom
## Residual deviance: 2743149 on 238800 degrees of freedom
## (10198 observations deleted due to missingness)
## AIC: NA
##
## Number of Fisher Scoring iterations: 6
Of course, we could just fit other distributional models to our data, popular choices are:
Effectively adds a shape parameter to Poisson
\[Y \sim NB (\lambda, \lambda+\lambda^2/\theta),\text{ } E(Y) = \lambda,\text{ } var(Y) = \lambda+\lambda^2/\theta\] \[\lambda = log(\eta)\] \[\eta = \beta_0 + \beta_1 x_1+ log(n)\]
Now, R will not fit negative binomial models using survey design, so, we will fit them using sample weights only, then calculate the robust standard errors. We standardize the weights to equal the sample size, as opposed to the population size by dividing each person’s weight by the mean weight.
#First, I define a function to get the clustered, or robust standard errors.
#This function effectively controls for the within-strata homogeneity when
#calculateing the se's for the betas.
#I stole this from: http://drewdimmery.com/robust-ses-in-r/
#and http://people.su.se/~ma/clustering.pdf
#I also added a correction to use this with the hurdle and zero-inflated models
#This is how stata gets robust se's
clx2 <- function(fm, dfcw, cluster){
# R-codes (www.r-project.org) for computing
# clustered-standard errors. Mahmood Arai, Jan 26, 2008.
# The arguments of the function are:
# fitted model, cluster1
# You need to install libraries `sandwich' and `lmtest'
# reweighting the var-cov matrix for the within model
require(sandwich);require(lmtest)
if(class(fm)=="zeroinfl"|class(fm)=="hurdle") {
M <- length(unique(cluster))
N <- length(cluster)
K <- dim(fm$vcov)[1] #here is the rank from the zero inflated fits
dfc <- (M/(M-1))*((N-1)/(N-K))
uj <- apply(estfun(fm),2, function(x) tapply(x, cluster, sum, na.rm=T));
vcovCL <- dfc[1]*sandwich(fm, meat=crossprod(uj)/N)*dfcw #fix a length problem in dfc
list(summary=coeftest(fm, vcovCL))}
else if(class(fm)!="zeroinfl"){
M <- length(unique(cluster))
N <- length(cluster)
K <- fm$rank
dfc <- (M/(M - 1)) * ((N - 1)/(N - K))
uj <- apply(estfun(fm), 2, function(x) tapply(x, cluster, sum, na.rm=T));
rcse.cov <- dfc * sandwich(fm, meat = crossprod(uj)/N)
rcse.se <- coeftest(fm, rcse.cov)
return(list( rcse.se))}
}
Now, i’ll illustrate that this works:
#Fit poisson, and compare it to the fit from the survey design
#Fit the Poisson GLM
sub$wts<-sub$mmsawt/mean(sub$mmsawt, na.rm=T)
fit.pois<-glm(healthdays~factor(race_eth)+factor(educ)+factor(agec),
data=sub,
weights=wts, family=poisson)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(sandwich)
clx2(fit.pois,cluster =sub$ststr)
## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used
## Warning in if (class(fm) != "zeroinfl") {: the condition has length > 1 and
## only the first element will be used
## [[1]]
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.817725 0.033340 24.5271 < 2.2e-16 ***
## factor(race_eth)hispanic -0.090198 0.032177 -2.8032 0.005060 **
## factor(race_eth)nh black -0.027048 0.028917 -0.9354 0.349595
## factor(race_eth)nh multirace 0.507678 0.063901 7.9447 1.946e-15 ***
## factor(race_eth)nh other -0.216965 0.069227 -3.1341 0.001724 **
## factor(educ)0Prim 0.373432 0.045712 8.1693 3.102e-16 ***
## factor(educ)1somehs 0.430531 0.039717 10.8401 < 2.2e-16 ***
## factor(educ)3somecol -0.060751 0.029546 -2.0561 0.039770 *
## factor(educ)4colgrad -0.586394 0.033470 -17.5202 < 2.2e-16 ***
## factor(agec)(24,39] 0.274159 0.037137 7.3824 1.555e-13 ***
## factor(agec)(39,59] 0.724831 0.034624 20.9341 < 2.2e-16 ***
## factor(agec)(59,79] 0.922134 0.033949 27.1626 < 2.2e-16 ***
## factor(agec)(79,99] 1.023305 0.043268 23.6502 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit.pois, vcov=vcovHC(fit.pois, type="HC1",cluster="ststr"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.817725 0.039512 20.6958 < 2.2e-16 ***
## factor(race_eth)hispanic -0.090198 0.029819 -3.0248 0.002488 **
## factor(race_eth)nh black -0.027048 0.027789 -0.9733 0.330383
## factor(race_eth)nh multirace 0.507678 0.070495 7.2017 5.949e-13 ***
## factor(race_eth)nh other -0.216965 0.050194 -4.3225 1.543e-05 ***
## factor(educ)0Prim 0.373432 0.046253 8.0737 6.821e-16 ***
## factor(educ)1somehs 0.430531 0.034090 12.6293 < 2.2e-16 ***
## factor(educ)3somecol -0.060751 0.022537 -2.6956 0.007026 **
## factor(educ)4colgrad -0.586394 0.021783 -26.9196 < 2.2e-16 ***
## factor(agec)(24,39] 0.274159 0.043513 6.3006 2.965e-10 ***
## factor(agec)(39,59] 0.724831 0.040180 18.0396 < 2.2e-16 ***
## factor(agec)(59,79] 0.922134 0.039936 23.0903 < 2.2e-16 ***
## factor(agec)(79,99] 1.023305 0.047504 21.5417 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#same as survey!
summary(fit1)
##
## Call:
## svyglm(formula = healthdays ~ factor(race_eth) + factor(educ) +
## factor(agec), design = des, family = poisson)
##
## Survey design:
## svydesign(ids = ~1, strata = ~ststr, weights = ~mmsawt, data = brfss16m[is.na(brfss16m$mmsawt) ==
## F, ])
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.79842 0.03724 21.437 < 2e-16 ***
## factor(race_eth)hispanic -0.11224 0.02850 -3.938 8.23e-05 ***
## factor(race_eth)nh black -0.01705 0.02627 -0.649 0.51638
## factor(race_eth)nh multirace 0.52808 0.06696 7.886 3.12e-15 ***
## factor(race_eth)nh other -0.23628 0.04726 -5.000 5.74e-07 ***
## factor(educ)0Prim 0.34526 0.04464 7.734 1.04e-14 ***
## factor(educ)1somehs 0.41598 0.03251 12.797 < 2e-16 ***
## factor(educ)3somecol -0.05704 0.02147 -2.657 0.00788 **
## factor(educ)4colgrad -0.57466 0.02075 -27.692 < 2e-16 ***
## factor(agec)(24,39] 0.27266 0.04126 6.608 3.91e-11 ***
## factor(agec)(39,59] 0.73292 0.03801 19.281 < 2e-16 ***
## factor(agec)(59,79] 0.93081 0.03781 24.619 < 2e-16 ***
## factor(agec)(79,99] 1.04760 0.04552 23.013 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 16.37206)
##
## Number of Fisher Scoring iterations: 6
#Fit the Negative Binomial GLM
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
fit.nb2<-glm.nb(healthdays~factor(race_eth)+factor(educ)+factor(agec),
data=sub,
weights=wts)
clx2(fit.nb2,cluster =sub$ststr)
## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used
## Warning in if (class(fm) == "zeroinfl" | class(fm) == "hurdle") {: the
## condition has length > 1 and only the first element will be used
## [[1]]
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.827033 0.032587 25.3796 < 2.2e-16 ***
## factor(race_eth)hispanic -0.046654 0.032776 -1.4234 0.154611
## factor(race_eth)nh black -0.045410 0.030626 -1.4827 0.138152
## factor(race_eth)nh multirace 0.505841 0.053859 9.3919 < 2.2e-16 ***
## factor(race_eth)nh other -0.250848 0.065963 -3.8029 0.000143 ***
## factor(educ)0Prim 0.351842 0.047721 7.3729 1.669e-13 ***
## factor(educ)1somehs 0.411224 0.041172 9.9880 < 2.2e-16 ***
## factor(educ)3somecol -0.052103 0.031220 -1.6689 0.095132 .
## factor(educ)4colgrad -0.562438 0.035198 -15.9792 < 2.2e-16 ***
## factor(agec)(24,39] 0.270551 0.035607 7.5983 3.001e-14 ***
## factor(agec)(39,59] 0.686307 0.033953 20.2135 < 2.2e-16 ***
## factor(agec)(59,79] 0.904456 0.031943 28.3147 < 2.2e-16 ***
## factor(agec)(79,99] 1.043586 0.041398 25.2083 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
coeftest(fit.nb2, vcov=vcovHC(fit.nb2, type="HC1",cluster="ststr"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.827033 0.038423 21.5246 < 2.2e-16 ***
## factor(race_eth)hispanic -0.046654 0.031529 -1.4797 0.13896
## factor(race_eth)nh black -0.045410 0.029690 -1.5295 0.12615
## factor(race_eth)nh multirace 0.505841 0.061123 8.2758 < 2.2e-16 ***
## factor(race_eth)nh other -0.250848 0.048894 -5.1304 2.891e-07 ***
## factor(educ)0Prim 0.351842 0.051348 6.8521 7.279e-12 ***
## factor(educ)1somehs 0.411224 0.037323 11.0181 < 2.2e-16 ***
## factor(educ)3somecol -0.052103 0.025270 -2.0619 0.03922 *
## factor(educ)4colgrad -0.562438 0.023666 -23.7660 < 2.2e-16 ***
## factor(agec)(24,39] 0.270551 0.042085 6.4287 1.287e-10 ***
## factor(agec)(39,59] 0.686307 0.038817 17.6806 < 2.2e-16 ***
## factor(agec)(59,79] 0.904456 0.038385 23.5625 < 2.2e-16 ***
## factor(agec)(79,99] 1.043586 0.046109 22.6332 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Typically when we get data aggregate data, they have a spatial component, meaning they are aggregated to some geography
Tract, block, county, ZIP code, etc.
These are typically either raw counts
Hopefully with a population at risk > denominator, or rates that someone has calculated
GLMs are a class of statistical models that link the mean of the distribution to a linear combination of covariates by some form of link function.
\[g(u) = g(E(Y)) = X ' \beta\]
Where g() is the link function. The link function for a binomial distribution is typically the logit, or:
\[g(u) = log(\frac{\pi}{1- \pi})\]
Inherently, we want to see what influences the rate in our areas
So we want regression models for our area-based counts that properly model the independent variable
You have a few options: Binomial, Poisson, Normal, Negative binomial
#I subset to only US counties this time, for the sake of computational time.
library(dplyr)
library(tidycensus)
library(acs)
## Loading required package: stringr
## Loading required package: XML
##
## Attaching package: 'acs'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:base':
##
## apply
library(lmtest)
library(sandwich)
githubURL <- "https://github.com/coreysparks/data/blob/master/cmf06_10_age_sex_std.Rdata?raw=true"
load(url(githubURL))
dim(cmf)
githubURL <- "https://github.com/coreysparks/data/blob/master/arf1516.Rdata?raw=true"
load(url(githubURL))
dim(arf15)
## To install your API key for use in future sessions, run this function with `install = TRUE`.
#Gini coefficent for income inquality
gini<-get_acs(geography = "county", variables = "B19083_001", year = 2009 )
## Please note: `get_acs()` now defaults to a year or endyear of 2016.
usdata<-merge(x = arf15, y=cmf, by.x="f00004", by.y="County.Code")
library(tigris)
## To enable
## caching of data, set `options(tigris_use_cache = TRUE)` in your R script or .Rprofile.
##
## Attaching package: 'tigris'
## The following object is masked from 'package:graphics':
##
## plot
## Warning: '*' is not a valid FIPS code or state name/abbreviation
usco<-counties(state = "*", year = 2010,cb = T )
library(sf)
## Linking to GEOS 3.6.1, GDAL 2.2.0, proj.4 4.9.3
library(ggplot2) #note you must do: devtools::install_github("tidyverse/ggplot2") to get geom_sf() geometries
usco<- st_as_sf(usco)
usco<-mutate(usco, geoid=paste(STATEFP, COUNTYFP, sep=""))
usco<-inner_join(usco, y=usdata, by=c("geoid"="f00004") )
usco<-inner_join(usco,y=gini, c("geoid"="GEOID") )
usco<-mutate(usco,pblack=100*( (f1391005 + f1391105)/ f1198405 ),
phisp=100*( (f1392005+ f1392105)/ f1198405 ),
prural = 100*( f1367900/ f0453000 ),
pfemhh = f0874600 ,
unemp= f0679505,
medhouseval=f1461306,
popdensity=f1198405/(as.numeric(CENSUSAREA)/1000000),
gini=estimate )
usco<- filter(usco, !STATEFP %in% c("02", "15", "72"), is.na(Age.Adjusted.Rate)==F&is.na(pblack)==F&is.na(prural)==F&is.na(pfemhh)==F)
usco<- mutate( usco,
mortquant = cut( Age.Adjusted.Rate, breaks= quantile(Age.Adjusted.Rate, p=seq(0, 1, .2), na.rm=T)),
pblackq=cut( pblack, breaks= quantile(pblack, p=seq(0, 1, .2), na.rm=T)),
giniq=cut(gini, breaks=quantile(gini, p=seq(0,1,.2), na.rm=T)))
usco %>%
ggplot()+geom_sf(aes( fill=mortquant))+coord_sf(crs = 102009)+ scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title="Mortality Quartile"))
usco %>%
ggplot()+geom_sf(aes( fill=pblackq))+coord_sf(crs = 102009)+ scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title="Percent Black Population Quartile"))
usco %>%
ggplot()+geom_sf(aes( fill=giniq))+coord_sf(crs = 102009)+ scale_colour_brewer(palette = "Blues" )+scale_fill_brewer(palette = "Blues", na.value="grey")+guides(fill=guide_legend(title="Gini Coefficient Quartile"))
To use the Binomial model, you have two options Your observations are 1/0, we have done this before. This says that each area has the event or not OR Your observations are a combination of y and n, where y is a count of some event which has happened to some portion of the population at risk, n. For either of these, you are modeling the probability of observing an event, \(\pi\) or really the logit of the probability: \[log(\frac{\pi}{1- \pi}) = X' \beta\] * The good thing is that you have already done this! * All interpretation of parameters is the same as logistic regression
\[exp(\beta)\]
usfitbin<-glm(cbind(Deaths, Population)~scale(pblack)+scale(phisp)+scale(prural)+scale(pfemhh)+scale(unemp)+scale(medhouseval)+scale(popdensity)+scale(gini) ,family=binomial,
data=usco)
summary(usfitbin)
##
## Call:
## glm(formula = cbind(Deaths, Population) ~ scale(pblack) + scale(phisp) +
## scale(prural) + scale(pfemhh) + scale(unemp) + scale(medhouseval) +
## scale(popdensity) + scale(gini), family = binomial, data = usco)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -84.916 -3.345 1.293 5.331 85.896
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.6614580 0.0004832 -9647.01 <2e-16 ***
## scale(pblack) -0.1176577 0.0006187 -190.17 <2e-16 ***
## scale(phisp) -0.0991019 0.0002986 -331.90 <2e-16 ***
## scale(prural) 0.0963676 0.0004370 220.50 <2e-16 ***
## scale(pfemhh) 0.1157800 0.0007022 164.87 <2e-16 ***
## scale(unemp) 0.0242070 0.0004342 55.76 <2e-16 ***
## scale(medhouseval) -0.0431336 0.0002387 -180.73 <2e-16 ***
## scale(popdensity) -0.0051121 0.0001066 -47.97 <2e-16 ***
## scale(gini) 0.0659669 0.0003770 174.98 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 862832 on 3093 degrees of freedom
## Residual deviance: 401550 on 3085 degrees of freedom
## AIC: 429698
##
## Number of Fisher Scoring iterations: 4
round(exp(coef(usfitbin)), 3)
## (Intercept) scale(pblack) scale(phisp)
## 0.009 0.889 0.906
## scale(prural) scale(pfemhh) scale(unemp)
## 1.101 1.123 1.025
## scale(medhouseval) scale(popdensity) scale(gini)
## 0.958 0.995 1.068
coeftest(usfitbin, vcov.=vcovHC(usfitbin, type="HC1"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.6614580 0.0050820 -917.2412 < 2.2e-16 ***
## scale(pblack) -0.1176577 0.0158820 -7.4082 1.280e-13 ***
## scale(phisp) -0.0991019 0.0090527 -10.9472 < 2.2e-16 ***
## scale(prural) 0.0963676 0.0092186 10.4536 < 2.2e-16 ***
## scale(pfemhh) 0.1157800 0.0195351 5.9268 3.090e-09 ***
## scale(unemp) 0.0242070 0.0079075 3.0613 0.002204 **
## scale(medhouseval) -0.0431336 0.0064707 -6.6660 2.629e-11 ***
## scale(popdensity) -0.0051121 0.0017827 -2.8676 0.004136 **
## scale(gini) 0.0659669 0.0099643 6.6203 3.584e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1-usfitbin$deviance/usfitbin$null.deviance
## [1] 0.5346143
#This should be the crude death rate, per 1,000 persons
1000 * mean(fitted(usfitbin))
## [1] 9.509964
hist(1000*fitted(usfitbin), main="US Crude Death Rate Distribution from Binomial Model")
For the Poisson model, you observe some count of events in each area (crimes, deaths, etc), and you also have some population at risk in each area. This is generally not the same for each area, but could be, you want a Poisson rate \((\lambda)\), which is really the average count (y). We have several ways of modeling the Poisson count:
Pure count If each area has the same risk set, Rate, include an offset term in your model to incorporate unequal risk
\[log(y)= X' \beta + log(n)\]
, where n is the population at risk in each area. This is called the offset term in the model. Standardized ratio incorporate differential exposure as an expected count
\[log(y)= X' \beta + log(E)\].
Again, all interpretation of parameters is done on a log scale, so
\[exp(\beta) = \text{ % change in the mean}\] , or % change in the rate, for a 1 unit change in X. All testing is done in the same manner.
#Form the expected count
usco$E<-usco$Population * (sum(usco$Deaths)/sum(usco$Population))
fit.poi.us<-glm(Deaths~offset(log(E))+scale(pblack)+scale(phisp)+scale(prural)+scale(pfemhh)+scale(unemp)+scale(medhouseval)+scale(popdensity)+scale(gini) ,family=poisson,
data=usco)
summary(fit.poi.us)
##
## Call:
## glm(formula = Deaths ~ offset(log(E)) + scale(pblack) + scale(phisp) +
## scale(prural) + scale(pfemhh) + scale(unemp) + scale(medhouseval) +
## scale(popdensity) + scale(gini), family = poisson, data = usco)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -85.228 -3.359 1.302 5.354 86.305
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1660918 0.0004808 345.43 <2e-16 ***
## scale(pblack) -0.1175107 0.0006157 -190.85 <2e-16 ***
## scale(phisp) -0.0990723 0.0002975 -333.05 <2e-16 ***
## scale(prural) 0.0962742 0.0004350 221.34 <2e-16 ***
## scale(pfemhh) 0.1155893 0.0006989 165.39 <2e-16 ***
## scale(unemp) 0.0242254 0.0004322 56.05 <2e-16 ***
## scale(medhouseval) -0.0431726 0.0002378 -181.53 <2e-16 ***
## scale(popdensity) -0.0050917 0.0001062 -47.95 <2e-16 ***
## scale(gini) 0.0659110 0.0003754 175.58 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 869840 on 3093 degrees of freedom
## Residual deviance: 404890 on 3085 degrees of freedom
## AIC: 433069
##
## Number of Fisher Scoring iterations: 4
lmtest::coeftest(fit.poi.us, vcov.=vcovHC(fit.poi.us, type="HC1"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.1660918 0.0050833 32.6742 < 2.2e-16 ***
## scale(pblack) -0.1175107 0.0158543 -7.4119 1.245e-13 ***
## scale(phisp) -0.0990723 0.0090510 -10.9460 < 2.2e-16 ***
## scale(prural) 0.0962742 0.0092066 10.4570 < 2.2e-16 ***
## scale(pfemhh) 0.1155893 0.0195027 5.9268 3.088e-09 ***
## scale(unemp) 0.0242254 0.0078982 3.0672 0.002161 **
## scale(medhouseval) -0.0431726 0.0064692 -6.6736 2.497e-11 ***
## scale(popdensity) -0.0050917 0.0017812 -2.8585 0.004256 **
## scale(gini) 0.0659110 0.0099520 6.6229 3.523e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1-fit.poi.us$deviance/fit.poi.us$null.deviance
## [1] 0.5345239
scale<-sqrt(fit.poi.us$deviance/fit.poi.us$df.residual)
scale
## [1] 11.45621
1-pchisq(fit.poi.us$deviance, df = fit.poi.us$df.residual)
## [1] 0
round(exp(coef(fit.poi.us)), 3)
## (Intercept) scale(pblack) scale(phisp)
## 1.181 0.889 0.906
## scale(prural) scale(pfemhh) scale(unemp)
## 1.101 1.123 1.025
## scale(medhouseval) scale(popdensity) scale(gini)
## 0.958 0.995 1.068
hist(fitted(fit.poi.us)/usco$E,main="US SMR Distribution from Poisson Model" )
library(MASS)
fit.nb.us<-glm.nb(Deaths~offset(log(E))+scale(pblack)+scale(phisp)+scale(prural)+scale(pfemhh)+scale(unemp)+scale(medhouseval)+scale(popdensity)+scale(gini),
data=usco)
summary(fit.nb.us)
##
## Call:
## glm.nb(formula = Deaths ~ offset(log(E)) + scale(pblack) + scale(phisp) +
## scale(prural) + scale(pfemhh) + scale(unemp) + scale(medhouseval) +
## scale(popdensity) + scale(gini), data = usco, init.theta = 27.08878204,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.9318 -0.6185 -0.0112 0.5614 3.8671
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.211208 0.003523 59.953 < 2e-16 ***
## scale(pblack) -0.050047 0.006809 -7.350 1.98e-13 ***
## scale(phisp) -0.067109 0.003822 -17.558 < 2e-16 ***
## scale(prural) 0.060118 0.004449 13.513 < 2e-16 ***
## scale(pfemhh) 0.020949 0.008118 2.581 0.009864 **
## scale(unemp) 0.010826 0.004326 2.503 0.012331 *
## scale(medhouseval) -0.112580 0.004148 -27.140 < 2e-16 ***
## scale(popdensity) 0.013567 0.003755 3.613 0.000303 ***
## scale(gini) 0.058649 0.004103 14.293 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(27.0888) family taken to be 1)
##
## Null deviance: 5463.5 on 3093 degrees of freedom
## Residual deviance: 3141.5 on 3085 degrees of freedom
## AIC: 43737
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 27.089
## Std. Err.: 0.715
##
## 2 x log-likelihood: -43716.987
1-fit.nb.us$deviance/fit.nb.us$null.deviance
## [1] 0.4250012
lmtest::coeftest(fit.nb.us, vcov.=vcovHC(fit.nb.us, type="HC1"))
##
## z test of coefficients:
##
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2112076 0.0034518 61.1879 < 2.2e-16 ***
## scale(pblack) -0.0500470 0.0081986 -6.1043 1.032e-09 ***
## scale(phisp) -0.0671094 0.0040376 -16.6213 < 2.2e-16 ***
## scale(prural) 0.0601181 0.0052639 11.4209 < 2.2e-16 ***
## scale(pfemhh) 0.0209488 0.0104343 2.0077 0.044676 *
## scale(unemp) 0.0108258 0.0042265 2.5614 0.010425 *
## scale(medhouseval) -0.1125799 0.0078156 -14.4044 < 2.2e-16 ***
## scale(popdensity) 0.0135672 0.0035312 3.8421 0.000122 ***
## scale(gini) 0.0586487 0.0049607 11.8226 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(exp(coef(fit.nb.us)), 3)
## (Intercept) scale(pblack) scale(phisp)
## 1.235 0.951 0.935
## scale(prural) scale(pfemhh) scale(unemp)
## 1.062 1.021 1.011
## scale(medhouseval) scale(popdensity) scale(gini)
## 0.894 1.014 1.060
hist(fitted(fit.nb.us)/usco$E,main="US SMR Distribution from Negative Binomial Model" )