Aggregate data

Typically when we get data with a spatial component, they are aggregated to some geography * Tract, block, county, ZIP code, etc. * These are typically either raw counts + Count of deaths, crimes, births, people with some characteristic (eg poverty) > numerator * Hopefully with a population at risk > denominator, or rates that someone has calculated + Mortality rates from NCHS or some census rate

GLM’s are a class of statitical models that link the mean of the distribuion to a linear combination of covariates by some form of link function.

\[g(u) = g(E(Y)) = X ' \beta\]

Where g() is the link function. The link function for a binomial distribution is typically the logit, or:

\[g(u) = log(\frac{\pi}{1- \pi})\]

Modeling options

#I subset to only US counties this time, for the sake of computational time.
spdat<-readOGR("/Users/ozd504//Google Drive/classes/dem7263/class17/data/usdata_mort.shp")
## OGR data source with driver: ESRI Shapefile 
## Source: "/Users/ozd504//Google Drive/classes/dem7263/class17/data/usdata_mort.shp", layer: "usdata_mort"
## with 3067 features
## It has 175 fields
## Integer64 fields read as strings:  pubassist SSI socsec poptot CountyCode
#spdat<-spdat[substr(spdat$COFIPS,1,2)=="48",]
spdat$population<-as.numeric(as.character(spdat$Population))
spdat$deaths<-as.numeric(as.character(spdat$Deaths))
#Create a k=4 nearest neighbor set
us.nb4<-knearneigh(coordinates(spdat), k=4)
us.nb4<-knn2nb(us.nb4)
us.nb4<-make.sym.nb(us.nb4)
us.wt4<-nb2listw(us.nb4, style="W")

Binomial model for counts

To use the Binomial model, you have two options Your observations are 1/0, we have done this before. This says that each area has the event or not OR Your observations are a combination of y and n, where y is a count of some event which has happened to some portion of the population at risk, n. For either of these, you are modeling the probability of observing an event, \(\pi\) or really the logit of the probability: \[log(\frac{\pi}{1- \pi}) = X' \beta\] * The good thing is that you have already done this! * All interpretation of parameters is the same as logistic regression

\[exp(\beta)\]

  • All model testing is the same too!
spdat$std<-fitted(lm(I(deaths/population)~ppopkids+ppop65plus+ppopwrknga-1, spdat))
fit.bin.us<-glm(cbind(deaths, population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = binomial)
summary(fit.bin.us)
## 
## Call:
## glm(formula = cbind(deaths, population) ~ scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7), family = binomial, data = spdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -79.567   -1.739    0.970    3.895   42.784  
## 
## Coefficients:
##                     Estimate Std. Error    z value Pr(>|z|)    
## (Intercept)       -4.6538250  0.0004283 -10865.054   <2e-16 ***
## scale(ppersonspo)  0.1001728  0.0004527    221.298   <2e-16 ***
## scale(p65plus)     0.1835270  0.0003114    589.301   <2e-16 ***
## scale(pblack_1)    0.0001039  0.0003830      0.271    0.786    
## scale(phisp)      -0.0905159  0.0003067   -295.087   <2e-16 ***
## I(RUCC >= 7)TRUE   0.0296354  0.0011100     26.697   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 816361  on 3066  degrees of freedom
## Residual deviance: 167335  on 3061  degrees of freedom
## AIC: 195182
## 
## Number of Fisher Scoring iterations: 4
round(exp(coef(fit.bin.us)), 3)
##       (Intercept) scale(ppersonspo)    scale(p65plus)   scale(pblack_1) 
##             0.010             1.105             1.201             1.000 
##      scale(phisp)  I(RUCC >= 7)TRUE 
##             0.913             1.030
lmtest::coeftest(fit.bin.us, vcov.=vcovHC(fit.bin.us, type="HC5"))
## 
## z test of coefficients:
## 
##                      Estimate  Std. Error   z value  Pr(>|z|)    
## (Intercept)       -4.65382502  0.01017737 -457.2717 < 2.2e-16 ***
## scale(ppersonspo)  0.10017275  0.01320576    7.5855 3.311e-14 ***
## scale(p65plus)     0.18352698  0.01345639   13.6386 < 2.2e-16 ***
## scale(pblack_1)    0.00010392  0.00948725    0.0110    0.9913    
## scale(phisp)      -0.09051587  0.01499597   -6.0360 1.580e-09 ***
## I(RUCC >= 7)TRUE   0.02963539  0.01505388    1.9686    0.0490 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1-fit.bin.us$deviance/fit.bin.us$null.deviance
## [1] 0.7950239
#This should be the crude death rate, per 1,000 persons
1000 * mean(fitted(fit.bin.us))
## [1] 9.819147
hist(1000*fitted(fit.bin.us), main="US Crude Death Rate Distribution from Binomial Model")

spdat$fittedbin<-1000*fitted(fit.bin.us)
spplot(spdat, "fittedbin", at=quantile(spdat$fittedbin), col.regions=brewer.pal(5, "Blues"))

Poisson model for counts

For the Poisson model, you observe some count of events in each area (crimes, deaths, etc), and you also have some population at risk in each area. This is generally not the same for each area, but could be, you want a Poisson rate \((\lambda)\), which is really the average count (y). We have several ways of modeling the Poisson count:

Pure count If each area has the same risk set, Rate, include an offset term in your model to incorporate unequal risk

\[log(y)= X' \beta + log(n)\]

, where n is the population at risk in each area. This is called the offset term in the model. Standardized ratio incorporate differential exposure as an expected count

\[log(y)= X' \beta + log(E)\].

Again, all interpretation of parameters is done on a log scale, so

\[exp(\beta) = \text{ % change in the mean}\] , or % change in the rate, for a 1 unit change in X. All testing is done in the same manner.

#Form the expected count
spdat$E<-spdat$population * (sum(spdat$deaths)/sum(spdat$population))
fit.poi.us<-glm(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = poisson)
summary(fit.poi.us)
## 
## Call:
## glm(formula = deaths ~ offset(log(E)) + scale(ppersonspo) + scale(p65plus) + 
##     scale(pblack_1) + scale(phisp) + I(RUCC >= 7), family = poisson, 
##     data = spdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -79.884   -1.752    0.991    3.908   43.043  
## 
## Coefficients:
##                     Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)        1.602e-01  4.262e-04  375.777   <2e-16 ***
## scale(ppersonspo)  1.002e-01  4.505e-04  222.432   <2e-16 ***
## scale(p65plus)     1.832e-01  3.093e-04  592.466   <2e-16 ***
## scale(pblack_1)    5.043e-05  3.812e-04    0.132    0.895    
## scale(phisp)      -9.061e-02  3.055e-04 -296.555   <2e-16 ***
## I(RUCC >= 7)TRUE   2.976e-02  1.104e-03   26.953   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 823082  on 3066  degrees of freedom
## Residual deviance: 168839  on 3061  degrees of freedom
## AIC: 196718
## 
## Number of Fisher Scoring iterations: 4
lmtest::coeftest(fit.poi.us, vcov.=vcovHC(fit.poi.us, type="HC5"))
## 
## z test of coefficients:
## 
##                      Estimate  Std. Error z value  Pr(>|z|)    
## (Intercept)        1.6015e-01  1.0180e-02 15.7318 < 2.2e-16 ***
## scale(ppersonspo)  1.0021e-01  1.3184e-02  7.6005 2.951e-14 ***
## scale(p65plus)     1.8324e-01  1.3458e-02 13.6158 < 2.2e-16 ***
## scale(pblack_1)    5.0431e-05  9.4555e-03  0.0053   0.99574    
## scale(phisp)      -9.0609e-02  1.4942e-02 -6.0642 1.326e-09 ***
## I(RUCC >= 7)TRUE   2.9764e-02  1.5064e-02  1.9758   0.04817 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
1-fit.poi.us$deviance/fit.poi.us$null.deviance
## [1] 0.7948695
scale<-sqrt(fit.poi.us$deviance/fit.poi.us$df.residual)
round(exp(coef(fit.poi.us)), 3)
##       (Intercept) scale(ppersonspo)    scale(p65plus)   scale(pblack_1) 
##             1.174             1.105             1.201             1.000 
##      scale(phisp)  I(RUCC >= 7)TRUE 
##             0.913             1.030
hist(fitted(fit.poi.us)/spdat$E,main="US SMR Distribution from Poisson Model" )

Normal (Gaussian) Model

In both the Binomial and Poisson case, if both n and y are large, the Normal model will often work This is generally not the case for rare events To do this, we assume the rate is our outcome, not the count, and assume Gaussian errors So our model is:

\[\frac{y}{n} = X ' \beta + e_i\]

This is just OLS! Or the more exotic SAR specifications we saw last week. Of course, you need to evaluate the Normal model’s assumptions and see if you can use it. The other models also need to be examined for linearity, but since they don’t have the same structure as the Normal model, they have fewer things to check.

fit.g.us<-glm(I(deaths/population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = gaussian)
summary(fit.g.us)
## 
## Call:
## glm(formula = I(deaths/population) ~ scale(ppersonspo) + scale(p65plus) + 
##     scale(pblack_1) + scale(phisp) + I(RUCC >= 7), family = gaussian, 
##     data = spdat)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -0.0061977  -0.0006043   0.0000686   0.0006903   0.0063190  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.015e-02  3.191e-05 318.267  < 2e-16 ***
## scale(ppersonspo)  7.225e-04  2.737e-05  26.396  < 2e-16 ***
## scale(p65plus)     2.020e-03  2.441e-05  82.728  < 2e-16 ***
## scale(pblack_1)    1.066e-04  2.628e-05   4.055 5.13e-05 ***
## scale(phisp)      -4.947e-04  2.402e-05 -20.594  < 2e-16 ***
## I(RUCC >= 7)TRUE  -1.597e-04  5.031e-05  -3.175  0.00151 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 1.477215e-06)
## 
##     Null deviance: 0.0190688  on 3066  degrees of freedom
## Residual deviance: 0.0045218  on 3061  degrees of freedom
## AIC: -32464
## 
## Number of Fisher Scoring iterations: 2
1-fit.g.us$deviance/fit.g.us$null.deviance
## [1] 0.7628718
lmtest::coeftest(fit.g.us, vcov.=vcovHC(fit.g.us, type="HC5"))
## 
## z test of coefficients:
## 
##                      Estimate  Std. Error  z value  Pr(>|z|)    
## (Intercept)        1.0155e-02  3.4037e-05 298.3479 < 2.2e-16 ***
## scale(ppersonspo)  7.2245e-04  3.4551e-05  20.9097 < 2.2e-16 ***
## scale(p65plus)     2.0198e-03  4.0002e-05  50.4919 < 2.2e-16 ***
## scale(pblack_1)    1.0657e-04  2.7674e-05   3.8509 0.0001177 ***
## scale(phisp)      -4.9468e-04  3.0838e-05 -16.0414 < 2.2e-16 ***
## I(RUCC >= 7)TRUE  -1.5972e-04  5.6931e-05  -2.8056 0.0050229 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(1000*coef(fit.g.us), 3)
##       (Intercept) scale(ppersonspo)    scale(p65plus)   scale(pblack_1) 
##            10.155             0.722             2.020             0.107 
##      scale(phisp)  I(RUCC >= 7)TRUE 
##            -0.495            -0.160
hist(1000*fitted(fit.g.us),main="US Crude Death Rate Distribution from Gaussian Model" )

When using the Binomial and Poisson GLM on areal data, you often run into overdispersion * What’s overdispersion? Class? For Binomial and Poisson distributions, the mean and the variance are functions of one another (variance = mean for Poisson). So when you have more variability than you expect in your data, you have overdispersion. This basically says your data don’t fit your model, and is a problem because overdispersion leads to standard errors for our model parameters that are too small. But, we can fit other models that don’t make such assumptions, or allow there to be more variability.

An easy check on this is to compare the residual deviance to the residual degrees of freedom. They ratio should be 1 if the model fits the data.

Modeling Overdispersion via a “Quasi” distribution

For the Poisson and Binomial, we can fit a “quasi” distribution that adds an extra parameter to allow the mean-variance relationship to not be constant. For e.g. in the Binomial we get:

\[var(Y) = n*p*(1-p)*\phi\], instead of \[var(Y) = n*p*(1-p)\]

For Poisson we get:

\[var(Y) = \lambda * \phi\], instead of \[var(Y) = λ\]

This allows us to include a rough proxy for a “dispersion parameter” for the distribution. Naturally this is fixed at 1 for basic models, and estimated in the quasi models, we can look to see if is much bigger than 1. If overdispersion is present and not accounted for you could identify a relationship as being significant when it is not!

fit.qbin.us<-glm(cbind(deaths, population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = quasibinomial)
summary(fit.qbin.us)
## 
## Call:
## glm(formula = cbind(deaths, population) ~ scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7), family = quasibinomial, data = spdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -79.567   -1.739    0.970    3.895   42.784  
## 
## Coefficients:
##                     Estimate Std. Error   t value Pr(>|t|)    
## (Intercept)       -4.6538250  0.0031174 -1492.845  < 2e-16 ***
## scale(ppersonspo)  0.1001728  0.0032945    30.406  < 2e-16 ***
## scale(p65plus)     0.1835270  0.0022666    80.969  < 2e-16 ***
## scale(pblack_1)    0.0001039  0.0027872     0.037 0.970260    
## scale(phisp)      -0.0905159  0.0022325   -40.545  < 2e-16 ***
## I(RUCC >= 7)TRUE   0.0296354  0.0080790     3.668 0.000248 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 52.97053)
## 
##     Null deviance: 816361  on 3066  degrees of freedom
## Residual deviance: 167335  on 3061  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
  1-fit.qbin.us$deviance/fit.qbin.us$null.deviance
## [1] 0.7950239
lmtest::coeftest(fit.g.us, vcov.=vcovHC(fit.g.us, type="HC5"))
## 
## z test of coefficients:
## 
##                      Estimate  Std. Error  z value  Pr(>|z|)    
## (Intercept)        1.0155e-02  3.4037e-05 298.3479 < 2.2e-16 ***
## scale(ppersonspo)  7.2245e-04  3.4551e-05  20.9097 < 2.2e-16 ***
## scale(p65plus)     2.0198e-03  4.0002e-05  50.4919 < 2.2e-16 ***
## scale(pblack_1)    1.0657e-04  2.7674e-05   3.8509 0.0001177 ***
## scale(phisp)      -4.9468e-04  3.0838e-05 -16.0414 < 2.2e-16 ***
## I(RUCC >= 7)TRUE  -1.5972e-04  5.6931e-05  -2.8056 0.0050229 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
fit.qpoi.us<-glm(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = quasipoisson)
summary(fit.qpoi.us)
## 
## Call:
## glm(formula = deaths ~ offset(log(E)) + scale(ppersonspo) + scale(p65plus) + 
##     scale(pblack_1) + scale(phisp) + I(RUCC >= 7), family = quasipoisson, 
##     data = spdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -79.884   -1.752    0.991    3.908   43.043  
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.602e-01  3.115e-03  51.411  < 2e-16 ***
## scale(ppersonspo)  1.002e-01  3.293e-03  30.431  < 2e-16 ***
## scale(p65plus)     1.832e-01  2.261e-03  81.056  < 2e-16 ***
## scale(pblack_1)    5.043e-05  2.786e-03   0.018  0.98556    
## scale(phisp)      -9.061e-02  2.233e-03 -40.572  < 2e-16 ***
## I(RUCC >= 7)TRUE   2.976e-02  8.071e-03   3.688  0.00023 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 53.42588)
## 
##     Null deviance: 823082  on 3066  degrees of freedom
## Residual deviance: 168839  on 3061  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4
 1-fit.qpoi.us$deviance/fit.qpoi.us$null.deviance
## [1] 0.7948695
lmtest::coeftest(fit.g.us, vcov.=vcovHC(fit.g.us, type="HC5"))
## 
## z test of coefficients:
## 
##                      Estimate  Std. Error  z value  Pr(>|z|)    
## (Intercept)        1.0155e-02  3.4037e-05 298.3479 < 2.2e-16 ***
## scale(ppersonspo)  7.2245e-04  3.4551e-05  20.9097 < 2.2e-16 ***
## scale(p65plus)     2.0198e-03  4.0002e-05  50.4919 < 2.2e-16 ***
## scale(pblack_1)    1.0657e-04  2.7674e-05   3.8509 0.0001177 ***
## scale(phisp)      -4.9468e-04  3.0838e-05 -16.0414 < 2.2e-16 ***
## I(RUCC >= 7)TRUE  -1.5972e-04  5.6931e-05  -2.8056 0.0050229 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(exp(coef(fit.poi.us)), 3)
##       (Intercept) scale(ppersonspo)    scale(p65plus)   scale(pblack_1) 
##             1.174             1.105             1.201             1.000 
##      scale(phisp)  I(RUCC >= 7)TRUE 
##             0.913             1.030

Notice the dispersion parameters in both of these fits, both were around 50, they are assumed to be 1, although substantively, there was no difference in the interpretation of the model effects.

Other 2 - parameter distributions

  • Of course, we could just fit other distributional models to our data, popular choices are:
  • Normal – y has mean and variance
  • Negative binomial – Effectively adds a shape parameter to Poisson
  • Beta binomial – Effectively adds a shape parameter to binomial – p has a mean and a variance
library(MASS)
fit.nb.us<-glm.nb(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat)
summary(fit.nb.us)
## 
## Call:
## glm.nb(formula = deaths ~ offset(log(E)) + scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7), data = spdat, init.theta = 65.9681801, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.4349  -0.5437   0.0729   0.6057   5.6327  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        0.205725   0.003305  62.244  < 2e-16 ***
## scale(ppersonspo)  0.085633   0.002926  29.268  < 2e-16 ***
## scale(p65plus)     0.214550   0.002607  82.298  < 2e-16 ***
## scale(pblack_1)    0.008369   0.002777   3.013  0.00259 ** 
## scale(phisp)      -0.059229   0.002601 -22.769  < 2e-16 ***
## I(RUCC >= 7)TRUE  -0.025314   0.005338  -4.742 2.11e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(65.9682) family taken to be 1)
## 
##     Null deviance: 12374.2  on 3066  degrees of freedom
## Residual deviance:  3164.4  on 3061  degrees of freedom
## AIC: 40724
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  65.97 
##           Std. Err.:  1.85 
## 
##  2 x log-likelihood:  -40709.96
 1-fit.nb.us$deviance/fit.nb.us$null.deviance
## [1] 0.7442778
lmtest::coeftest(fit.nb.us, vcov.=vcovHC(fit.nb.us, type="HC5"))
## 
## z test of coefficients:
## 
##                     Estimate Std. Error  z value  Pr(>|z|)    
## (Intercept)        0.2057253  0.0034945  58.8709 < 2.2e-16 ***
## scale(ppersonspo)  0.0856332  0.0035158  24.3569 < 2.2e-16 ***
## scale(p65plus)     0.2145501  0.0043119  49.7578 < 2.2e-16 ***
## scale(pblack_1)    0.0083686  0.0027369   3.0576  0.002231 ** 
## scale(phisp)      -0.0592291  0.0034669 -17.0841 < 2.2e-16 ***
## I(RUCC >= 7)TRUE  -0.0253141  0.0062090  -4.0770 4.562e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
round(exp(coef(fit.nb.us)), 3)
##       (Intercept) scale(ppersonspo)    scale(p65plus)   scale(pblack_1) 
##             1.228             1.089             1.239             1.008 
##      scale(phisp)  I(RUCC >= 7)TRUE 
##             0.942             0.975
hist(fitted(fit.nb.us)/spdat$E,main="US SMR Distribution from Negative Binomial Model" )

library(aod)
fit.bbin.us<-betabin(cbind(deaths, population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), random=~1, spdat@data, method="SANN")
summary(fit.bbin.us)
## Beta-binomial model
## -------------------
## betabin(formula = cbind(deaths, population) ~ scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7), random = ~1, data = spdat@data, ... = pairlist(method = "SANN"))
## 
## Convergence was obtained after 2000 iterations.
## 
## Fixed-effect coefficients:
##                     Estimate Std. Error    z value Pr(> |z|)
## (Intercept)       -4.686e+00  8.332e-03 -5.623e+02     0e+00
## scale(ppersonspo) -1.651e-01  7.316e-03 -2.257e+01     0e+00
## scale(p65plus)     1.419e-01  5.007e-03  2.835e+01     0e+00
## scale(pblack_1)    1.043e-01  6.133e-03  1.700e+01     0e+00
## scale(phisp)      -1.381e-01  9.706e-03 -1.423e+01     0e+00
## I(RUCC >= 7)TRUE   2.939e-01  1.120e-02  2.625e+01     0e+00
## 
## Overdispersion coefficients:
##                  Estimate Std. Error z value Pr(> z)
## phi.(Intercept) 8.439e-04      2e-13   0e+00   1e+00
## 
## Log-likelihood statistics
##    Log-lik      nbpar    df res.   Deviance        AIC       AICc 
## -2.273e+04          7       3060  1.763e+04  4.548e+04  4.548e+04
1-((-2*fit.bbin.us@logL.max)/(-2*fit.bbin.us@logL))
## [1] 0.3878044
hist(1000*fitted(fit.bbin.us),main="US Crude Death Rate Distribution from Beta Binomial Model" )

Autocorrelation in GLMs

While we don’t have the same assumption about residuals for GLMs as we do for the Gaussian (OLS) model, we still may be curious if our model’s residuals display spatial clustering. While it’s not recommended, we can use the lm.morantest() function to examine correlation among the residuals.

lm.morantest(fit.poi.us, listw = us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm(formula = deaths ~ offset(log(E)) + scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 7),
## family = poisson, data = spdat)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 24.72, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     2.974003e-01    -5.767635e-07     1.447360e-04
lm.morantest(fit.bin.us, listw = us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm(formula = cbind(deaths, population) ~ scale(ppersonspo)
## + scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 7),
## family = binomial, data = spdat)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 5.5531, p-value = 1.403e-08
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.297511889     -0.119432111      0.005637416
lm.morantest(fit.nb.us, listw = us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta =
## 65.9681801, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 27.829, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     3.347412e-01    -2.216496e-05     1.447065e-04

Spatial nuisance and GLMs

Inherently, we are kind of doing spatial analysis since we are using areal units as our observations, right Paul? There are other options that have been explored that promote not modeling space explicityly but instead filtering out the nuisance of spatial autocorrelation. Spatial Filtering is a method Griffith and Peres-Neto (2006) and Dray S, Legendre P and Peres-Neto PR (2005) both describe this method. The Moran eigenvector filtering function is intended to remove spatial autocorrelation from the residuals of (generalised) linear models. Tiefelsdorf M, Griffith DA. (2007) also describe a method of using eigenvectors of the spatial weight matrix to filter out autocorrelation in a model.

Typically this is done by 1) forming the spatial weight matrix, 2)compute the eigenvectors of the centered weight matrix: \[(I-1 1 ' /n)W (I-1 1 ' /n)\], where I is the identity matrix, and 1 is a vector of 1’s. 3) select the eigenvectors to be included in a linear model as spatial predictors, so that the level of autocorrelation in the model is minimized, until Moran’s I is nonsigificant. We can do this using the ME() in the spdep library.

ME() example:

#you have to provide the theta value for the NB family, I got this from the model fit above (fit.nb.us).
me.fit<-ME(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat@data, family=negative.binomial(65.97), listw = us.wt4, verbose=T,alpha=.05 )
## eV[,2115], I: 0.3765507 ZI: NA, pr(ZI): 0.01
## eV[,6], I: 0.3412072 ZI: NA, pr(ZI): 0.01
## eV[,5], I: 0.306686 ZI: NA, pr(ZI): 0.01
## eV[,4], I: 0.2617026 ZI: NA, pr(ZI): 0.01
## eV[,2742], I: 0.2285889 ZI: NA, pr(ZI): 0.01
## eV[,2762], I: 0.1952729 ZI: NA, pr(ZI): 0.01
## eV[,2619], I: 0.1699339 ZI: NA, pr(ZI): 0.01
## eV[,2794], I: 0.1477504 ZI: NA, pr(ZI): 0.01
## eV[,54], I: 0.126919 ZI: NA, pr(ZI): 0.01
## eV[,2653], I: 0.1075138 ZI: NA, pr(ZI): 0.01
## eV[,287], I: 0.09318322 ZI: NA, pr(ZI): 0.01
## eV[,51], I: 0.07981999 ZI: NA, pr(ZI): 0.01
## eV[,2196], I: 0.06946673 ZI: NA, pr(ZI): 0.01
## eV[,19], I: 0.05997612 ZI: NA, pr(ZI): 0.02
## eV[,13], I: 0.04839149 ZI: NA, pr(ZI): 0.01
## eV[,229], I: 0.04000454 ZI: NA, pr(ZI): 0.02
## eV[,31], I: 0.03112747 ZI: NA, pr(ZI): 0.06
me.fit
##    Eigenvector ZI pr(ZI)
## 0           NA NA   0.01
## 1         2115 NA   0.01
## 2            6 NA   0.01
## 3            5 NA   0.01
## 4            4 NA   0.01
## 5         2742 NA   0.01
## 6         2762 NA   0.01
## 7         2619 NA   0.01
## 8         2794 NA   0.01
## 9           54 NA   0.01
## 10        2653 NA   0.01
## 11         287 NA   0.01
## 12          51 NA   0.01
## 13        2196 NA   0.01
## 14          19 NA   0.02
## 15          13 NA   0.01
## 16         229 NA   0.02
## 17          31 NA   0.06
fits<-data.frame(fitted(me.fit))
spdat$me2115<-fits$vec2115
spdat$me6<-fits$vec6

spplot(spdat, c("me2115", "me6"), at=quantile(c(spdat$me2115, spdat$me6), p=seq(0,1,length.out = 7)), col.regions=brewer.pal(n=7, "Greys"), main="Spatial Distribution Moran Eigenvectors 161 & 234")

Then I plug the scores of each county on these Moran eigenvectors into the model to “clean” it:

clean.nb<-glm.nb(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7)+fitted(me.fit), spdat@data)
summary(clean.nb)
## 
## Call:
## glm.nb(formula = deaths ~ offset(log(E)) + scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7) + fitted(me.fit), data = spdat@data, init.theta = 81.00951534, 
##     link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.3819  -0.5514   0.0630   0.5911   6.4110  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            0.202559   0.003035  66.745  < 2e-16 ***
## scale(ppersonspo)      0.072568   0.002816  25.772  < 2e-16 ***
## scale(p65plus)         0.220639   0.002434  90.648  < 2e-16 ***
## scale(pblack_1)        0.000429   0.002589   0.166 0.868380    
## scale(phisp)          -0.039670   0.002545 -15.587  < 2e-16 ***
## I(RUCC >= 7)TRUE      -0.022254   0.004960  -4.486 7.24e-06 ***
## fitted(me.fit)vec2115  0.375614   0.118712   3.164 0.001556 ** 
## fitted(me.fit)vec6     2.184548   0.131734  16.583  < 2e-16 ***
## fitted(me.fit)vec5    -2.100059   0.123904 -16.949  < 2e-16 ***
## fitted(me.fit)vec4    -0.573031   0.116869  -4.903 9.43e-07 ***
## fitted(me.fit)vec2742  0.146056   0.116149   1.257 0.208577    
## fitted(me.fit)vec2762 -0.155979   0.117466  -1.328 0.184224    
## fitted(me.fit)vec2619 -0.356876   0.121360  -2.941 0.003275 ** 
## fitted(me.fit)vec2794  0.113667   0.118743   0.957 0.338439    
## fitted(me.fit)vec54    0.453222   0.117731   3.850 0.000118 ***
## fitted(me.fit)vec2653  0.235081   0.119567   1.966 0.049288 *  
## fitted(me.fit)vec287  -0.287122   0.114618  -2.505 0.012244 *  
## fitted(me.fit)vec51   -0.333599   0.117563  -2.838 0.004545 ** 
## fitted(me.fit)vec2196 -0.146617   0.117441  -1.248 0.211874    
## fitted(me.fit)vec19   -0.296070   0.115217  -2.570 0.010180 *  
## fitted(me.fit)vec13    0.386518   0.118299   3.267 0.001086 ** 
## fitted(me.fit)vec229  -0.170353   0.115092  -1.480 0.138834    
## fitted(me.fit)vec31    0.442780   0.119074   3.719 0.000200 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(81.0095) family taken to be 1)
## 
##     Null deviance: 14926.7  on 3066  degrees of freedom
## Residual deviance:  3157.7  on 3044  degrees of freedom
## AIC: 40176
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  81.01 
##           Std. Err.:  2.30 
## 
##  2 x log-likelihood:  -40127.59
lm.morantest(clean.nb, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7) + fitted(me.fit), data = spdat@data,
## init.theta = 81.00951534, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 18.888, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     2.283919e-01    -4.441347e-05     1.462704e-04
AICs<-c(AIC(fit.bin.us),AIC(fit.poi.us),AIC(fit.g.us),  AIC(fit.nb.us), test@istats$AIC, AIC(clean.nb))
## Error in eval(expr, envir, enclos): object 'test' not found
plot(AICs, type="l", lwd=1.5, xaxt="n", xlab="")
## Error in plot(AICs, type = "l", lwd = 1.5, xaxt = "n", xlab = ""): object 'AICs' not found
axis(1, at=1:6,labels=F) #7= number of models
## Error in axis(1, at = 1:6, labels = F): plot.new has not been called yet
labels<-c("Binomial", "Poisson","Gaussian", "NBinomial","BetaBin", "clean NB")
text(1:6, par("usr")[3]-.25, srt=45, adj=1, labels=labels, xpd=T)
## Error in text.default(1:6, par("usr")[3] - 0.25, srt = 45, adj = 1, labels = labels, : plot.new has not been called yet
mtext(side=1, text="Model Specification", line=3)
## Error in mtext(side = 1, text = "Model Specification", line = 3): plot.new has not been called yet
symbols(x= which.min(AICs), y=AICs[which.min(AICs)], circles=1, fg=2,lwd=2,add=T)
## Error in xy.coords(x, y, xlab = deparse(substitute(x)), ylab = deparse(substitute(y))): object 'AICs' not found

Auto-models

One model that has been proposed is the auto - model. This includes an auto-covariate in the model, which is an average of the neighboring values outcome: \[log(y) = X ' \beta + W y\]. This can be used for binomial or Poisson but people don’t really seem to like this.

#Creating an auto- model
#this involves creating an "auto-covariate"
spdat$lag_rate<-lag.listw(x=us.wt4, var=(spdat$deaths/spdat$population))
#binomial response
fit.bn.auto<-glm(cbind(deaths, population)~lag_rate+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), family=binomial, data=spdat)
summary(fit.bn.auto)
## 
## Call:
## glm(formula = cbind(deaths, population) ~ lag_rate + scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7), family = binomial, data = spdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -65.262   -1.815    0.827    3.417   49.714  
## 
## Coefficients:
##                     Estimate Std. Error  z value Pr(>|z|)    
## (Intercept)       -4.9169098  0.0022164 -2218.38   <2e-16 ***
## lag_rate          27.4582212  0.2264485   121.26   <2e-16 ***
## scale(ppersonspo)  0.0750184  0.0004999   150.07   <2e-16 ***
## scale(p65plus)     0.1673719  0.0003404   491.75   <2e-16 ***
## scale(pblack_1)    0.0143084  0.0004011    35.67   <2e-16 ***
## scale(phisp)      -0.0685444  0.0003547  -193.22   <2e-16 ***
## I(RUCC >= 7)TRUE   0.0114674  0.0011193    10.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 816361  on 3066  degrees of freedom
## Residual deviance: 152677  on 3060  degrees of freedom
## AIC: 180527
## 
## Number of Fisher Scoring iterations: 4
#likewise for a poisson
spdat$lagcount<-lag.listw(x=us.wt4, var=spdat$deaths/spdat$E)
fit.poi.auto<-glm(deaths~lagcount+offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), family=poisson, data=spdat)
summary(fit.poi.auto)
## 
## Call:
## glm(formula = deaths ~ lagcount + offset(log(E)) + scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7), family = poisson, data = spdat)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -65.506   -1.823    0.833    3.435   49.930  
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.1031135  0.0022067  -46.73   <2e-16 ***
## lagcount           0.2229784  0.0018295  121.88   <2e-16 ***
## scale(ppersonspo)  0.0750372  0.0004976  150.80   <2e-16 ***
## scale(p65plus)     0.1671073  0.0003381  494.22   <2e-16 ***
## scale(pblack_1)    0.0142652  0.0003993   35.72   <2e-16 ***
## scale(phisp)      -0.0686067  0.0003534 -194.16   <2e-16 ***
## I(RUCC >= 7)TRUE   0.0115633  0.0011135   10.38   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 823082  on 3066  degrees of freedom
## Residual deviance: 154032  on 3060  degrees of freedom
## AIC: 181913
## 
## Number of Fisher Scoring iterations: 4
#likewise for a negative binomial
fit.nb.auto<-glm.nb(deaths~lagcount+offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7),  data=spdat)
summary(fit.nb.auto)
## 
## Call:
## glm.nb(formula = deaths ~ lagcount + offset(log(E)) + scale(ppersonspo) + 
##     scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 
##     7), data = spdat, init.theta = 71.40563795, link = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -6.1425  -0.5415   0.0725   0.5995   5.7853  
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.057980   0.017359  -3.340 0.000838 ***
## lagcount           0.214177   0.013892  15.417  < 2e-16 ***
## scale(ppersonspo)  0.067834   0.003050  22.242  < 2e-16 ***
## scale(p65plus)     0.191950   0.002902  66.144  < 2e-16 ***
## scale(pblack_1)    0.013309   0.002697   4.936 7.99e-07 ***
## scale(phisp)      -0.045138   0.002680 -16.843  < 2e-16 ***
## I(RUCC >= 7)TRUE  -0.031785   0.005168  -6.151 7.71e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(71.4056) family taken to be 1)
## 
##     Null deviance: 13305.9  on 3066  degrees of freedom
## Residual deviance:  3164.4  on 3060  degrees of freedom
## AIC: 40503
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  71.41 
##           Std. Err.:  2.01 
## 
##  2 x log-likelihood:  -40486.87
lm.morantest(fit.bn.auto, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm(formula = cbind(deaths, population) ~ lag_rate +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), family = binomial, data = spdat)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 3.9955, p-value = 3.227e-05
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##      0.194127566     -0.128202728      0.006508063
lm.morantest(fit.poi.auto, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm(formula = deaths ~ lagcount + offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), family = poisson, data = spdat)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 16.1, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.937512e-01    -6.164148e-07     1.448306e-04
lm.morantest(fit.nb.auto, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = deaths ~ lagcount + offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta =
## 71.40563795, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 15.858, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.907982e-01    -2.208591e-05     1.447995e-04
spdat$resnb<-rstudent(fit.nb.auto)
spdat$resnb1<-rstudent(fit.nb.us)
spdat$resclean<-rstudent(clean.nb)
spplot(spdat,c("resnb1"),at=quantile(c(spdat$resnb1,spdat$resnb, spdat$resclean), p=seq(0,1,length.out = 7) ), col.regions=brewer.pal(n=7, "Greys"), main="Spatial Distribution Neg. Bin Residuals - GLM", names.attr=c("NB Regresssion"))

spplot(spdat,c("resnb"),at=quantile(c(spdat$resnb1,spdat$resnb, spdat$resclean), p=seq(0,1,length.out = 7) ), col.regions=brewer.pal(n=7, "Greys"), main="Spatial Distribution Neg. Bin Residuals - Auto count model", names.attr=c("Auto NB Regression"))

spplot(spdat,c("resclean"),at=quantile(c(spdat$resnb1,spdat$resnb, spdat$resclean), p=seq(0,1,length.out = 7) ), col.regions=brewer.pal(n=7, "Greys"), main="Spatial Distribution Neg. Bin Residuals - Filtered count model", names.attr=c( "Filtered NB"))

lm.morantest(fit.nb.us, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta =
## 65.9681801, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 27.829, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     3.347412e-01    -2.216496e-05     1.447065e-04
lm.morantest(fit.nb.auto, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = deaths ~ lagcount + offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta =
## 71.40563795, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 15.858, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     1.907982e-01    -2.208591e-05     1.447995e-04
lm.morantest(clean.nb, listw=us.wt4)
## 
##  Global Moran I for regression residuals
## 
## data:  
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7) + fitted(me.fit), data = spdat@data,
## init.theta = 81.00951534, link = log)
## weights: us.wt4
## 
## Moran I statistic standard deviate = 18.888, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Observed Moran I      Expectation         Variance 
##     2.283919e-01    -4.441347e-05     1.462704e-04

So we decrease the autocorrelation in the model, but not entirely.