Typically when we get data with a spa+al component, they are aggregated to some geography * Tract, block, county, ZIP code, etc. * These are typically either raw counts + Count of deaths, crimes, births, people with some characteristic (eg poverty) > numerator * Hopefully with a population at risk > denominator, or rates that someone has calculated + Mortality rates from NCHS or some census rate
GLM’s are a class of statitical models that link the mean of the distribuion to a linear combination of covariates by some form of link func)on. \(g(u) = g(E(Y)) = X ' \beta\) Where g() is the link function. The link function for a binomial distribution is typically the logit, or: \(g(u) = log(\frac{\pi}{1- \pi})\), or for the Poisson model, \(g(u) = log(\mu)\)
#I subset to only Texas counties this time, for the sake of computational time.
spdat<-readShapePoly("~/Google Drive/dem7263/data/usdata_mort.shp")
spdat<-spdat[substr(spdat$COFIPS,1,2)=="48",]
spdat$population<-as.numeric(as.character(spdat$Population))
spdat$deaths<-as.numeric(as.character(spdat$Deaths))
#Create a k=4 nearest neighbor set
us.nb4<-knearneigh(coordinates(spdat), k=4)
us.nb4<-knn2nb(us.nb4)
us.nb4<-make.sym.nb(us.nb4)
us.wt4<-nb2listw(us.nb4, style="W")
To use the Binomial model, you have two options Your observations are 1/0, we have done this before. This says that each area has the event or not OR Your observations are a combination of y and n, where y is a count of some event which has happened to some portion of the population at risk, n. For either of these, you are modeling the probability of observing an event, \(\pi\) or really the logit of the probability: \(log(\frac{\pi}{1- \pi}) = X' \beta\) * The good thing is that you have already done this! * All interpretation of parameters is the same as logistic regression! \(exp(\beta)\) * All model testing is the same too!
fit.bin.us<-glm(cbind(deaths, population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = binomial)
summary(fit.bin.us)
##
## Call:
## glm(formula = cbind(deaths, population) ~ scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7), family = binomial, data = spdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.6665 -1.7024 0.9846 4.0827 17.2912
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.631185 0.002016 -2296.70 <2e-16 ***
## scale(ppersonspo) 0.024078 0.002151 11.20 <2e-16 ***
## scale(p65plus) 0.343010 0.001851 185.29 <2e-16 ***
## scale(pblack_1) 0.034587 0.001286 26.90 <2e-16 ***
## scale(phisp) -0.059135 0.002439 -24.24 <2e-16 ***
## I(RUCC >= 7)TRUE -0.128607 0.005381 -23.90 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79455.8 on 251 degrees of freedom
## Residual deviance: 9901.4 on 246 degrees of freedom
## AIC: 12098
##
## Number of Fisher Scoring iterations: 4
round(exp(coef(fit.bin.us)), 3)
## (Intercept) scale(ppersonspo) scale(p65plus) scale(pblack_1)
## 0.010 1.024 1.409 1.035
## scale(phisp) I(RUCC >= 7)TRUE
## 0.943 0.879
#This should be the crude death rate, per 1,000 persons
1000 * mean(fitted(fit.bin.us))
## [1] 9.701682
hist(1000*fitted(fit.bin.us), main="Texas Crude Death Rate Distribution from Binomial Model")
For the Poisson model, you observe some count of events in each area (crimes, deaths, etc), and you also have some population at risk in each area. This is generally not the same for each area, but could be, you want a Poisson rate \((\lambda)\), which is really the average count (y). We have several ways of modeling the Poisson count:
Pure count If each area has the same risk set, Rate, include an offset term in your model to incorporate unequal risk \(log(y)= X' \beta + log(n)\), where n is the population at risk in each area. This is called the offset term in the model. Standardized ratio incorporate differential exposure as an expected count \(log(y)= X' \beta + log(E)\). Again, all interpretation of parameters is done on a log scale, so \(exp(\beta)\) = % change in the mean, or % change in the rate, for a 1 unit change in X. All testing is done in the same manner.
#Form the expected count
spdat$E<-spdat$population * (sum(spdat$deaths)/sum(spdat$population))
fit.poi.us<-glm(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = poisson)
summary(fit.poi.us)
##
## Call:
## glm(formula = deaths ~ offset(log(E)) + scale(ppersonspo) + scale(p65plus) +
## scale(pblack_1) + scale(phisp) + I(RUCC >= 7), family = poisson,
## data = spdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.7942 -1.6952 0.9956 4.1290 17.3567
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.368526 0.002005 183.76 <2e-16 ***
## scale(ppersonspo) 0.024405 0.002142 11.39 <2e-16 ***
## scale(p65plus) 0.342493 0.001841 185.99 <2e-16 ***
## scale(pblack_1) 0.034490 0.001281 26.92 <2e-16 ***
## scale(phisp) -0.059474 0.002430 -24.48 <2e-16 ***
## I(RUCC >= 7)TRUE -0.128789 0.005355 -24.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 80038.6 on 251 degrees of freedom
## Residual deviance: 9990.2 on 246 degrees of freedom
## AIC: 12189
##
## Number of Fisher Scoring iterations: 4
round(exp(coef(fit.poi.us)), 3)
## (Intercept) scale(ppersonspo) scale(p65plus) scale(pblack_1)
## 1.446 1.025 1.408 1.035
## scale(phisp) I(RUCC >= 7)TRUE
## 0.942 0.879
hist(fitted(fit.poi.us)/spdat$E,main="Texas SMR Distribution from Poisson Model" )
In both the Binomial and Poisson case, if both n and y are large, the Normal model will often work This is generally not the case for rare events To do this, we assume the rate is our outcome, not the count, and assume Gaussian errors So our model is:\(frac{y}{n} = X ' \beta + e_i\) This is just OLS! Or the more exotic SAR specifications we saw last week. Of course, you need to evaluate the Normal model’s assumptions and see if you can use it. The other models also need to be examined for linearity, but since they don’t have the same structure as the Normal model, they have fewer things to check.
fit.g.us<-glm(I(deaths/population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = gaussian)
summary(fit.g.us)
##
## Call:
## glm(formula = I(deaths/population) ~ scale(ppersonspo) + scale(p65plus) +
## scale(pblack_1) + scale(phisp) + I(RUCC >= 7), family = gaussian,
## data = spdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.0042947 -0.0005506 0.0000912 0.0006799 0.0029490
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.006e-02 9.931e-05 101.332 < 2e-16 ***
## scale(ppersonspo) 2.757e-04 1.210e-04 2.278 0.023568 *
## scale(p65plus) 2.437e-03 9.636e-05 25.292 < 2e-16 ***
## scale(pblack_1) 2.196e-04 9.112e-05 2.410 0.016702 *
## scale(phisp) -5.250e-04 1.444e-04 -3.636 0.000337 ***
## I(RUCC >= 7)TRUE -3.663e-04 1.676e-04 -2.186 0.029786 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 1.232979e-06)
##
## Null deviance: 0.00196081 on 251 degrees of freedom
## Residual deviance: 0.00030331 on 246 degrees of freedom
## AIC: -2705.7
##
## Number of Fisher Scoring iterations: 2
round(1000*coef(fit.g.us), 3)
## (Intercept) scale(ppersonspo) scale(p65plus) scale(pblack_1)
## 10.064 0.276 2.437 0.220
## scale(phisp) I(RUCC >= 7)TRUE
## -0.525 -0.366
hist(1000*fitted(fit.g.us),main="Texas Crude Death Rate Distribution from Gaussian Model" )
When using the Binomial and Poisson GLM on areal data, you often run into overdispersion * What’s overdispersion? Class? For Binomial and Poisson distributions, the mean and the variance are functions of one another (variance = mean for Poisson). So when you have more variability than you expect in your data, you have overdispersion. This basically says your data don’t fit your model, and is a problem because overdispersion leads to standard errors for our model parameters that are too small. But, we can fit other models that don’t make such assumptions, or allow there to be more variability.
An easy check on this is to compare the residual deviance to the residual degrees of freedom. They ratio should be 1 if the model fits the data.
For the Poisson and Binomial, we can fit a “quasi” distribution that adds an extra parameter to allow the mean-variance relationship to not be constant. For e.g. in the Binomial we get: \(var(Y) = n*p*(1-p)*\phi\), instead of \(var(Y) = n*p*(1-p)\) For Poisson we get: \(var(Y) = \lambda * \phi\), instead of \(var(Y) = λ\)
This allows us to include a rough proxy for a “dispersion parameter” for the distribution. Naturally this is fixed at 1 for basic models, and estimated in the quasi models, we can look to see if is much bigger than 1. If overdispersion is present and not accounted for you could identify a relationship as being significant when it is not!
fit.qbin.us<-glm(cbind(deaths, population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = quasibinomial)
summary(fit.qbin.us)
##
## Call:
## glm(formula = cbind(deaths, population) ~ scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7), family = quasibinomial, data = spdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.6665 -1.7024 0.9846 4.0827 17.2912
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4.631185 0.012714 -364.269 < 2e-16 ***
## scale(ppersonspo) 0.024078 0.013561 1.776 0.077033 .
## scale(p65plus) 0.343010 0.011672 29.387 < 2e-16 ***
## scale(pblack_1) 0.034587 0.008108 4.266 2.84e-05 ***
## scale(phisp) -0.059135 0.015380 -3.845 0.000154 ***
## I(RUCC >= 7)TRUE -0.128607 0.033929 -3.790 0.000189 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 39.7523)
##
## Null deviance: 79455.8 on 251 degrees of freedom
## Residual deviance: 9901.4 on 246 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
fit.qpoi.us<-glm(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat, family = quasipoisson)
summary(fit.qpoi.us)
##
## Call:
## glm(formula = deaths ~ offset(log(E)) + scale(ppersonspo) + scale(p65plus) +
## scale(pblack_1) + scale(phisp) + I(RUCC >= 7), family = quasipoisson,
## data = spdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -27.7942 -1.6952 0.9956 4.1290 17.3567
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.368526 0.012699 29.021 < 2e-16 ***
## scale(ppersonspo) 0.024405 0.013561 1.800 0.073141 .
## scale(p65plus) 0.342493 0.011660 29.374 < 2e-16 ***
## scale(pblack_1) 0.034490 0.008111 4.252 3.01e-05 ***
## scale(phisp) -0.059474 0.015386 -3.866 0.000142 ***
## I(RUCC >= 7)TRUE -0.128789 0.033904 -3.799 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 40.09362)
##
## Null deviance: 80038.6 on 251 degrees of freedom
## Residual deviance: 9990.2 on 246 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
round(exp(coef(fit.poi.us)), 3)
## (Intercept) scale(ppersonspo) scale(p65plus) scale(pblack_1)
## 1.446 1.025 1.408 1.035
## scale(phisp) I(RUCC >= 7)TRUE
## 0.942 0.879
Notice the dispersion parameters in both of these fits, both were around 50, they are assumed to be 1, although substantively, there was no difference in the interpretation of the model effects.
library(MASS)
fit.nb.us<-glm.nb(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat)
summary(fit.nb.us)
##
## Call:
## glm.nb(formula = deaths ~ offset(log(E)) + scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7), data = spdat, init.theta = 70.16980546, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.3010 -0.6124 0.0721 0.6028 2.2726
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.36759 0.01111 33.092 < 2e-16 ***
## scale(ppersonspo) 0.03208 0.01433 2.238 0.025192 *
## scale(p65plus) 0.25781 0.01124 22.929 < 2e-16 ***
## scale(pblack_1) 0.02178 0.01036 2.102 0.035571 *
## scale(phisp) -0.05766 0.01719 -3.354 0.000796 ***
## I(RUCC >= 7)TRUE -0.03978 0.01966 -2.023 0.043096 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(70.1698) family taken to be 1)
##
## Null deviance: 1312.84 on 251 degrees of freedom
## Residual deviance: 251.57 on 246 degrees of freedom
## AIC: 3144.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 70.17
## Std. Err.: 6.99
##
## 2 x log-likelihood: -3130.805
round(exp(coef(fit.nb.us)), 3)
## (Intercept) scale(ppersonspo) scale(p65plus) scale(pblack_1)
## 1.444 1.033 1.294 1.022
## scale(phisp) I(RUCC >= 7)TRUE
## 0.944 0.961
hist(fitted(fit.nb.us)/spdat$E,main="Texas SMR Distribution from Negative Binomial Model" )
library(aod)
fit.bbin.us<-betabin(cbind(deaths, population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), random=~1, spdat@data, method="SANN")
summary(fit.bbin.us)
## Beta-binomial model
## -------------------
## betabin(formula = cbind(deaths, population) ~ scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7), random = ~1, data = spdat@data, ... = pairlist(method = "SANN"))
##
## Convergence was obtained after 2000 iterations.
##
## Fixed-effect coefficients:
## Estimate Std. Error z value Pr(> |z|)
## (Intercept) -4.484e+00 2.309e-02 -1.942e+02 0.000e+00
## scale(ppersonspo) 8.867e-02 2.443e-02 3.629e+00 2.843e-04
## scale(p65plus) 8.415e-03 2.022e-02 4.162e-01 6.773e-01
## scale(pblack_1) -1.971e-01 2.386e-02 -8.263e+00 2.220e-16
## scale(phisp) -1.050e-01 3.055e-02 -3.436e+00 5.914e-04
## I(RUCC >= 7)TRUE 1.259e-01 3.627e-02 3.471e+00 5.183e-04
##
## Overdispersion coefficients:
## Estimate Std. Error z value Pr(> z)
## phi.(Intercept) 8.394e-04 2e-13 0e+00 1e+00
##
## Log-likelihood statistics
## Log-lik nbpar df res. Deviance AIC AICc
## -1.856e+03 7 245 1.528e+03 3.726e+03 3.727e+03
hist(1000*fitted(fit.bbin.us),main="Texas Crude Death Rate Distribution from Beta Binomial Model" )
We can also consider using a random-effect model to help with overdispersion. One way of doing this is to add an observation-level random effect to our model, so now we have: \(log(\frac{\pi}{1- \pi}) = X' \beta + u_i\) with \(u_i \sim N (0, \sigma^2)\) or \(log(\lambda) = X' \beta + u_i\) with \(u_i \sim N (0, \sigma^2)\).
library(lme4)
fit.hpoi.us<-glmer(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7)+(1|COFIPS), spdat@data, family=poisson)
summary(fit.hpoi.us)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: poisson ( log )
## Formula: deaths ~ offset(log(E)) + scale(ppersonspo) + scale(p65plus) +
## scale(pblack_1) + scale(phisp) + I(RUCC >= 7) + (1 | COFIPS)
## Data: spdat@data
##
## AIC BIC logLik deviance df.resid
## 3152.1 3176.8 -1569.0 3138.1 245
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.47313 -0.11097 0.02923 0.16553 1.18443
##
## Random effects:
## Groups Name Variance Std.Dev.
## COFIPS (Intercept) 0.01475 0.1215
## Number of obs: 252, groups: COFIPS, 252
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.36071 0.01128 31.96 < 2e-16 ***
## scale(ppersonspo) 0.03168 0.01458 2.17 0.029754 *
## scale(p65plus) 0.25972 0.01145 22.69 < 2e-16 ***
## scale(pblack_1) 0.02317 0.01053 2.20 0.027869 *
## scale(phisp) -0.05786 0.01746 -3.31 0.000922 ***
## I(RUCC >= 7)TRUE -0.04065 0.01995 -2.04 0.041570 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) scl(pp) sc(65) sc(_1) scl(ph)
## scl(pprsns) 0.057
## scl(p65pls) 0.283 -0.432
## scl(pblc_1) -0.149 -0.475 0.349
## scale(phsp) 0.011 -0.821 0.578 0.609
## I(RUCC>=7)T -0.688 -0.087 -0.380 0.180 -0.003
round(exp(fixef(fit.hpoi.us)), 3)
## (Intercept) scale(ppersonspo) scale(p65plus) scale(pblack_1)
## 1.434 1.032 1.297 1.023
## scale(phisp) I(RUCC >= 7)TRUE
## 0.944 0.960
hist(fitted(fit.hpoi.us)/spdat$E,main="Texas SMR Distribution from Hierarchical Poisson Model" )
fit.hbin.us<-glmer(cbind(deaths, population)~scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7)+(1|COFIPS), spdat@data,family=binomial )
summary(fit.hbin.us)
## Generalized linear mixed model fit by maximum likelihood (Laplace
## Approximation) [glmerMod]
## Family: binomial ( logit )
## Formula: cbind(deaths, population) ~ scale(ppersonspo) + scale(p65plus) +
## scale(pblack_1) + scale(phisp) + I(RUCC >= 7) + (1 | COFIPS)
## Data: spdat@data
##
## AIC BIC logLik deviance df.resid
## 3147.2 3171.9 -1566.6 3133.2 245
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -1.47874 -0.11176 0.02936 0.16631 1.18769
##
## Random effects:
## Groups Name Variance Std.Dev.
## COFIPS (Intercept) 0.01474 0.1214
## Number of obs: 252, groups: COFIPS, 252
##
## Fixed effects:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.63933 0.01128 -411.1 < 2e-16 ***
## scale(ppersonspo) 0.03171 0.01458 2.2 0.029617 *
## scale(p65plus) 0.25976 0.01145 22.7 < 2e-16 ***
## scale(pblack_1) 0.02315 0.01053 2.2 0.027948 *
## scale(phisp) -0.05789 0.01746 -3.3 0.000917 ***
## I(RUCC >= 7)TRUE -0.04064 0.01995 -2.0 0.041675 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Correlation of Fixed Effects:
## (Intr) scl(pp) sc(65) sc(_1) scl(ph)
## scl(pprsns) 0.057
## scl(p65pls) 0.283 -0.432
## scl(pblc_1) -0.149 -0.475 0.349
## scale(phsp) 0.011 -0.821 0.578 0.609
## I(RUCC>=7)T -0.688 -0.087 -0.380 0.180 -0.003
hist(1000*fitted(fit.hbin.us),main="Texas Crude Death Rate Distribution from Hierarchical Binomial Model" )
So, who’s best here? AIC is a good rule:
test<-AIC(fit.bbin.us)
AICs<-c(AIC(fit.bin.us),AIC(fit.poi.us),AIC(fit.g.us), AIC(fit.nb.us), test@istats$AIC, AIC(fit.hbin.us),AIC(fit.hpoi.us))
plot(AICs, type="l", lwd=1.5, xaxt="n", xlab="")
axis(1, at=1:7,labels=F) #7= number of models
labels<-c("Bin", "Poi","Gau", "NBin","BetaBin", "HierBin", "HierPoi" )
text(1:7, par("usr")[3]-.25, srt=45, adj=1, labels=labels, xpd=T)
mtext(side=1, text="Model Specification", line=3)
symbols(x= which.min(AICs), y=AICs[which.min(AICs)], circles=1, fg=2,lwd=2,add=T)
While we don’t have the same assumption about residuals for GLMs as we do for the Gaussian (OLS) model, we still may be curious if our model’s residuals display spatial clustering. While it’s not recommended, we can use the lm.morantest() function to examine correlation among the residuals.
lm.morantest(fit.poi.us, listw = us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm(formula = deaths ~ offset(log(E)) + scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 7),
## family = poisson, data = spdat)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.8068, p-value = 7.039e-05
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 1.625924e-01 -8.222115e-06 1.824432e-03
lm.morantest(fit.bin.us, listw = us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm(formula = cbind(deaths, population) ~ scale(ppersonspo)
## + scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >= 7),
## family = binomial, data = spdat)
## weights: us.wt4
##
## Moran I statistic standard deviate = 1.6679, p-value = 0.04766
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1623571 -1.2045497 0.6716097
lm.morantest(fit.nb.us, listw = us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta =
## 70.16980546, link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.9806, p-value = 3.437e-05
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1696496093 -0.0002054652 0.0018208068
Inherently, we are kind of doing spatial analysis since we are using areal units as our observations, right Paul? There are other options that have been explored that promote not modeling space explicityly but instead filtering out the nuisance of spatial autocorrelation. Spatial Filtering is a method Griffith and Peres-Neto (2006) and Dray S, Legendre P and Peres-Neto PR (2005) both describe this method. The Moran eigenvector filtering function is intended to remove spatial autocorrelation from the residuals of (generalised) linear models. Tiefelsdorf M, Griffith DA. (2007) also describe a method of using eigenvectors of the spatial weight matrix to filter out autocorrelation in a model.
Typically this is done by 1) forming the spatial weight matrix, 2)compute the eigenvectors of the centered weight matrix: \((I-1 1 ' /n)W (I-1 1 ' /n)\), where I is the identity matrix, and 1 is a vector of 1’s. 3) select the eigenvectors to be included in a linear model as spatial predictors, so that the level of autocorrelation in the model is minimized, until Moran’s I is nonsigificant. We can do this using the ME() in the spdep library.
ME() example
#you have to provide the theta value for the NB family, I got this from the model fit above.
me.fit<-ME(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), spdat@data, family=negative.binomial(70.17), listw = us.wt4, verbose=T,alpha=.05 )
## eV[,161], I: 0.1411555 ZI: NA, pr(ZI): 0.01
## eV[,234], I: 0.1165919 ZI: NA, pr(ZI): 0.01
## eV[,39], I: 0.0982866 ZI: NA, pr(ZI): 0.06
me.fit
## Eigenvector ZI pr(ZI)
## 0 NA NA 0.01
## 1 161 NA 0.01
## 2 234 NA 0.01
## 3 39 NA 0.06
fits<-data.frame(fitted(me.fit))
spdat$me161<-fits$vec161
spdat$me234<-fits$vec234
spplot(spdat, c("me161", "me234"), at=quantile(c(spdat$me161, spdat$me234), p=seq(0,1,length.out = 7)), col.regions=brewer.pal(n=7, "Greys"), main="Spatial Distribution Moran Eigenvectors 161 & 234")
clean.nb<-glm.nb(deaths~offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7)+fitted(me.fit), spdat@data)
summary(clean.nb)
##
## Call:
## glm.nb(formula = deaths ~ offset(log(E)) + scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7) + fitted(me.fit), data = spdat@data, init.theta = 73.40428524,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1490 -0.5935 0.1117 0.6388 2.2814
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.36388 0.01095 33.222 < 2e-16 ***
## scale(ppersonspo) 0.03401 0.01409 2.414 0.015781 *
## scale(p65plus) 0.25258 0.01121 22.523 < 2e-16 ***
## scale(pblack_1) 0.02145 0.01020 2.103 0.035491 *
## scale(phisp) -0.06034 0.01699 -3.553 0.000381 ***
## I(RUCC >= 7)TRUE -0.03219 0.01951 -1.650 0.099005 .
## fitted(me.fit)vec161 -0.12309 0.12241 -1.006 0.314644
## fitted(me.fit)vec234 0.19407 0.12088 1.606 0.108374
## fitted(me.fit)vec39 0.33911 0.12335 2.749 0.005974 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(73.4043) family taken to be 1)
##
## Null deviance: 1367.15 on 251 degrees of freedom
## Residual deviance: 251.14 on 243 degrees of freedom
## AIC: 3140.5
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 73.40
## Std. Err.: 7.34
##
## 2 x log-likelihood: -3120.455
lm.morantest(clean.nb, listw=us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7) + fitted(me.fit), data = spdat@data,
## init.theta = 73.40428524, link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.8382, p-value = 6.198e-05
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1655784395 -0.0001879006 0.0018652724
One model that has been proposed is the auto - model. This includes an auto-covariate in the model, which is an average of the neighboring values outcome: \(log(y) = X ' \beta + W y\). This can be used for binomial or Poisson but people don’t really seem to like this.
#Creating an auto- model
#this involves creating an "auto-covariate"
spdat$lag_rate<-lag.listw(x=us.wt4, var=(spdat$deaths/spdat$population))
#binomial response
fit.bn.auto<-glm(cbind(deaths, population)~lag_rate+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), family=binomial, data=spdat)
summary(fit.bn.auto)
##
## Call:
## glm(formula = cbind(deaths, population) ~ lag_rate + scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7), family = binomial, data = spdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -28.156 -1.863 1.055 3.645 18.616
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.776658 0.009746 -490.136 < 2e-16 ***
## lag_rate 14.901955 0.975323 15.279 < 2e-16 ***
## scale(ppersonspo) 0.011381 0.002305 4.938 7.88e-07 ***
## scale(p65plus) 0.328584 0.002081 157.927 < 2e-16 ***
## scale(pblack_1) 0.038738 0.001315 29.465 < 2e-16 ***
## scale(phisp) -0.040886 0.002712 -15.076 < 2e-16 ***
## I(RUCC >= 7)TRUE -0.131294 0.005379 -24.407 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 79455.8 on 251 degrees of freedom
## Residual deviance: 9668.5 on 245 degrees of freedom
## AIC: 11867
##
## Number of Fisher Scoring iterations: 4
#likewise for a poisson
spdat$lagcount<-lag.listw(x=us.wt4, var=spdat$deaths/spdat$E)
fit.poi.auto<-glm(deaths~lagcount+offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), family=poisson, data=spdat)
summary(fit.poi.auto)
##
## Call:
## glm(formula = deaths ~ lagcount + offset(log(E)) + scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7), family = poisson, data = spdat)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -28.249 -1.855 1.072 3.659 18.694
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.222329 0.009703 22.912 < 2e-16 ***
## lagcount 0.100923 0.006544 15.421 < 2e-16 ***
## scale(ppersonspo) 0.011612 0.002296 5.058 4.23e-07 ***
## scale(p65plus) 0.328030 0.002069 158.538 < 2e-16 ***
## scale(pblack_1) 0.038660 0.001310 29.518 < 2e-16 ***
## scale(phisp) -0.041102 0.002702 -15.211 < 2e-16 ***
## I(RUCC >= 7)TRUE -0.131483 0.005353 -24.564 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 80039 on 251 degrees of freedom
## Residual deviance: 9753 on 245 degrees of freedom
## AIC: 11954
##
## Number of Fisher Scoring iterations: 4
#likewise for a negative binomial
fit.nb.auto<-glm.nb(deaths~lagcount+offset(log(E))+scale(ppersonspo)+scale(p65plus)+scale(pblack_1)+scale(phisp)+I(RUCC>=7), data=spdat)
summary(fit.nb.auto)
##
## Call:
## glm.nb(formula = deaths ~ lagcount + offset(log(E)) + scale(ppersonspo) +
## scale(p65plus) + scale(pblack_1) + scale(phisp) + I(RUCC >=
## 7), data = spdat, init.theta = 72.068416, link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1730 -0.5644 0.0412 0.6208 2.1977
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.22353 0.06025 3.710 0.000207 ***
## lagcount 0.09948 0.04096 2.429 0.015158 *
## scale(ppersonspo) 0.02508 0.01451 1.729 0.083893 .
## scale(p65plus) 0.24730 0.01190 20.784 < 2e-16 ***
## scale(pblack_1) 0.02231 0.01024 2.178 0.029393 *
## scale(phisp) -0.04223 0.01824 -2.315 0.020624 *
## I(RUCC >= 7)TRUE -0.04668 0.01965 -2.376 0.017511 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(72.0684) family taken to be 1)
##
## Null deviance: 1344.77 on 251 degrees of freedom
## Residual deviance: 251.61 on 245 degrees of freedom
## AIC: 3141
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 72.07
## Std. Err.: 7.20
##
## 2 x log-likelihood: -3124.969
lm.morantest(fit.bn.auto, listw=us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm(formula = cbind(deaths, population) ~ lag_rate +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), family = binomial, data = spdat)
## weights: us.wt4
##
## Moran I statistic standard deviate = 1.5315, p-value = 0.06283
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1490254 -1.1968593 0.7723265
lm.morantest(fit.poi.auto, listw=us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm(formula = deaths ~ lagcount + offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), family = poisson, data = spdat)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.4757, p-value = 0.0002548
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 1.490526e-01 -8.185909e-06 1.839284e-03
lm.morantest(fit.nb.auto, listw=us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm.nb(formula = deaths ~ lagcount + offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta = 72.068416,
## link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.0207, p-value = 0.001261
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1291953881 -0.0002095377 0.0018352684
spdat$resnb<-rstudent(fit.nb.auto)
spdat$resnb1<-rstudent(fit.nb.us)
spdat$resclean<-rstudent(clean.nb)
spplot(spdat,c("resnb1", "resnb", "resclean"),at=quantile(c(spdat$resnb1,spdat$resnb, spdat$resclean), p=seq(0,1,length.out = 7) ), col.regions=brewer.pal(n=7, "Greys"), main="Spatial Distribution Neg. Bin Residuals", names.attr=c("NB Regresssion","Auto NB Regression", "Filtered NB"))
lm.morantest(fit.nb.us, listw=us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta =
## 70.16980546, link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.9806, p-value = 3.437e-05
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1696496093 -0.0002054652 0.0018208068
lm.morantest(fit.nb.auto, listw=us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm.nb(formula = deaths ~ lagcount + offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7), data = spdat, init.theta = 72.068416,
## link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.0207, p-value = 0.001261
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1291953881 -0.0002095377 0.0018352684
lm.morantest(clean.nb, listw=us.wt4)
##
## Global Moran's I for regression residuals
##
## data:
## model: glm.nb(formula = deaths ~ offset(log(E)) +
## scale(ppersonspo) + scale(p65plus) + scale(pblack_1) +
## scale(phisp) + I(RUCC >= 7) + fitted(me.fit), data = spdat@data,
## init.theta = 73.40428524, link = log)
## weights: us.wt4
##
## Moran I statistic standard deviate = 3.8382, p-value = 6.198e-05
## alternative hypothesis: greater
## sample estimates:
## Observed Moran's I Expectation Variance
## 0.1655784395 -0.0001879006 0.0018652724
We almost get rid of the autocorrelation in the binomial model using the autoregressive rate.