This document explains the application of Spatial Autoregressive (SAR) and Conditional Autoregressive (CAR) models to analyze North Carolina birth rate data. We examine the relationship between non-white birth rates and overall birth rates using spatial econometric techniques.
library(spdep)
library(sf)
library(sp)
library(ggplot2)
library(gridExtra)
library(tmap)
library(spatialreg)
# Load North Carolina SIDS data
nc.sids <- st_read(system.file("shapes/sids.gpkg", package = "spData"), quiet = TRUE)
# Check the class of the object
class(nc.sids)
## [1] "sf" "data.frame"
# View the data
head(nc.sids, 3)
## Simple feature collection with 3 features and 22 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -81.74107 ymin: 36.23388 xmax: -80.43531 ymax: 36.58965
## Geodetic CRS: NAD27
## CNTY_ID AREA PERIMETER CNTY_ NAME FIPS FIPSNO CRESS_ID BIR74 SID74
## 1 1825 0.114 1.442 1825 Ashe 37009 37009 5 1091 1
## 2 1827 0.061 1.231 1827 Alleghany 37005 37005 3 487 0
## 3 1828 0.143 1.630 1828 Surry 37171 37171 86 3188 5
## NWBIR74 BIR79 SID79 NWBIR79 east north x y lon lat L_id
## 1 10 1364 0 19 164 176 -81.67 4052.29 -81.48594 36.43940 1
## 2 10 542 3 12 183 182 -50.06 4059.70 -81.14061 36.52443 1
## 3 208 3616 6 260 204 174 -16.14 4043.76 -80.75312 36.40033 1
## M_id geom
## 1 2 MULTIPOLYGON (((-81.47276 3...
## 2 2 MULTIPOLYGON (((-81.23989 3...
## 3 2 MULTIPOLYGON (((-80.45634 3...
# Check column names
names(nc.sids)
## [1] "CNTY_ID" "AREA" "PERIMETER" "CNTY_" "NAME" "FIPS"
## [7] "FIPSNO" "CRESS_ID" "BIR74" "SID74" "NWBIR74" "BIR79"
## [13] "SID79" "NWBIR79" "east" "north" "x" "y"
## [19] "lon" "lat" "L_id" "M_id" "geom"
The Freeman-Tukey transformation is applied to stabilize variance and normalize the distribution of non-white birth rates:
\[ \text{nwbir.FT} = \sqrt{1000} \left( \sqrt{\frac{\text{NWBIR79}}{\text{BIR79}}} + \sqrt{\frac{\text{NWBIR79} + 1}{\text{BIR79}}} \right) \]
# Apply Freeman-Tukey transformation for non-white birth rates
nc.sids$nwbir.FT <- sqrt(1000) * (sqrt(nc.sids$NWBIR79/nc.sids$BIR79) +
sqrt((nc.sids$NWBIR79 + 1)/nc.sids$BIR79))
Similarly, we need to transform the overall birth rates
(rates.FT). Based on the original code,
rates.FT is the transformed overall birth rates:
\[ \text{rates.FT} = \sqrt{1000} \left( \sqrt{\frac{\text{SID79}}{\text{BIR79}}} + \sqrt{\frac{\text{SID79} + 1}{\text{BIR79}}} \right) \]
# Create the overall birth rates transformation
# Using SID79 as the count variable for overall rates
nc.sids$rates.FT <- sqrt(1000) * (sqrt(nc.sids$SID79/nc.sids$BIR79) +
sqrt((nc.sids$SID79 + 1)/nc.sids$BIR79))
# Check the created variables
head(nc.sids[, c("BIR79", "SID79", "NWBIR79", "rates.FT", "nwbir.FT")])
## Simple feature collection with 6 features and 5 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -81.74107 ymin: 36.07282 xmax: -75.77316 ymax: 36.58965
## Geodetic CRS: NAD27
## BIR79 SID79 NWBIR79 rates.FT nwbir.FT geom
## 1 1364 0 19 0.8562347 7.561438 MULTIPOLYGON (((-81.47276 3...
## 2 542 3 12 5.0692990 9.602812 MULTIPOLYGON (((-81.23989 3...
## 3 3616 6 260 2.6794800 16.975378 MULTIPOLYGON (((-80.45634 3...
## 4 830 2 145 3.4534738 26.480233 MULTIPOLYGON (((-76.00897 3...
## 5 1606 3 1197 2.9449286 54.612867 MULTIPOLYGON (((-77.21767 3...
## 6 1838 5 1237 3.4561178 51.895516 MULTIPOLYGON (((-76.74506 3...
# Convert to sf object for spatial operations
nc.sids.sf <- st_as_sf(nc.sids)
# Check the result
class(nc.sids.sf)
## [1] "sf" "data.frame"
head(nc.sids.sf, 3)
## Simple feature collection with 3 features and 24 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -81.74107 ymin: 36.23388 xmax: -80.43531 ymax: 36.58965
## Geodetic CRS: NAD27
## CNTY_ID AREA PERIMETER CNTY_ NAME FIPS FIPSNO CRESS_ID BIR74 SID74
## 1 1825 0.114 1.442 1825 Ashe 37009 37009 5 1091 1
## 2 1827 0.061 1.231 1827 Alleghany 37005 37005 3 487 0
## 3 1828 0.143 1.630 1828 Surry 37171 37171 86 3188 5
## NWBIR74 BIR79 SID79 NWBIR79 east north x y lon lat L_id
## 1 10 1364 0 19 164 176 -81.67 4052.29 -81.48594 36.43940 1
## 2 10 542 3 12 183 182 -50.06 4059.70 -81.14061 36.52443 1
## 3 208 3616 6 260 204 174 -16.14 4043.76 -80.75312 36.40033 1
## M_id geom nwbir.FT rates.FT
## 1 2 MULTIPOLYGON (((-81.47276 3... 7.561438 0.8562347
## 2 2 MULTIPOLYGON (((-81.23989 3... 9.602812 5.0692990
## 3 2 MULTIPOLYGON (((-80.45634 3... 16.975378 2.6794800
# Verify rates.FT is in the sf object
names(nc.sids.sf)
## [1] "CNTY_ID" "AREA" "PERIMETER" "CNTY_" "NAME" "FIPS"
## [7] "FIPSNO" "CRESS_ID" "BIR74" "SID74" "NWBIR74" "BIR79"
## [13] "SID79" "NWBIR79" "east" "north" "x" "y"
## [19] "lon" "lat" "L_id" "M_id" "geom" "nwbir.FT"
## [25] "rates.FT"
# Create queen contiguity weights from the sf object
ncCC89.nb <- poly2nb(nc.sids.sf, queen = TRUE)
# Convert to listw object with row standardization (style = "W")
ncCC89.listw <- nb2listw(ncCC89.nb, style = "W")
# View the spatial weights object
ncCC89.listw
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 490
## Percentage nonzero weights: 4.9
## Average number of links: 4.9
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 100 10000 100 44.65023 410.4746
# Summary information
summary(ncCC89.listw)
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 100
## Number of nonzero links: 490
## Percentage nonzero weights: 4.9
## Average number of links: 4.9
## Link number distribution:
##
## 2 3 4 5 6 7 8 9
## 8 15 17 23 19 14 2 2
## 8 least connected regions:
## 4 21 45 56 77 80 90 99 with 2 links
## 2 most connected regions:
## 39 67 with 9 links
##
## Weights style: W
## Weights constants summary:
## n nn S0 S1 S2
## W 100 10000 100 44.65023 410.4746
The SAR error model (estimated using spautolm() with
family = "SAR") is specified as:
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u} \]
\[ \mathbf{u} = \lambda \mathbf{W} \mathbf{u} + \boldsymbol{\varepsilon} \]
\[ \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) \]
where: - \(\mathbf{y}\) = vector of transformed overall birth rates (\(\text{rates.FT}\)) - \(\mathbf{X}\) = design matrix containing the intercept and \(\text{nwbir.FT}\) - \(\boldsymbol{\beta}\) = vector of regression coefficients - \(\mathbf{u}\) = spatial error term - \(\lambda\) = spatial autoregressive parameter (measures strength of spatial dependence) - \(\mathbf{W}\) = spatial weights matrix (\(\text{ncCC89.listw}\)) - \(\boldsymbol{\varepsilon}\) = i.i.d. normally distributed error term
The model can be expressed in reduced form as:
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + (\mathbf{I} - \lambda \mathbf{W})^{-1} \boldsymbol{\varepsilon} \]
The log-likelihood function for the SAR error model is:
\[ \ell(\boldsymbol{\beta}, \lambda, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) + \ln|\mathbf{I} - \lambda\mathbf{W}| - \frac{1}{2\sigma^2} \left[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top (\mathbf{I} - \lambda\mathbf{W})^\top (\mathbf{I} - \lambda\mathbf{W}) (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \right] \]
# Estimate SAR error model
nc.sids.sar.out <- spautolm(rates.FT ~ nwbir.FT,
data = nc.sids,
family = "SAR",
listw = ncCC89.listw,
zero.policy = TRUE)
# View results
summary(nc.sids.sar.out)
##
## Call:
## spautolm(formula = rates.FT ~ nwbir.FT, data = nc.sids, listw = ncCC89.listw,
## family = "SAR", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.310193 -0.456265 0.042259 0.512063 2.555381
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5481233 0.2615757 9.7414 <2e-16
## nwbir.FT 0.0110715 0.0073343 1.5095 0.1312
##
## Lambda: 0.24579 LR test value: 3.0723 p-value: 0.079635
## Numerical Hessian standard error of lambda: 0.1356
##
## Log likelihood: -123.2626
## ML residual variance (sigma squared): 0.67958, (sigma: 0.82437)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 254.53
# Extract fitted values and add to both data frames
nc.sids$fitted.sar <- fitted(nc.sids.sar.out)
nc.sids.sf$fitted.sar <- fitted(nc.sids.sar.out)
The SAR error model accounts for spatial autocorrelation in the error term:
The fitted values are:
\[ \hat{\mathbf{y}} = \mathbf{X}\hat{\boldsymbol{\beta}} + \hat{\lambda} \mathbf{W} (\mathbf{y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \]
The CAR model (estimated using spautolm() with
family = "CAR") is specified as:
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{u} \]
\[ \mathbf{u} \sim \mathcal{N}(\mathbf{0}, \sigma^2 (\mathbf{I} - \rho \mathbf{W})^{-1}) \]
or equivalently:
\[ (\mathbf{I} - \rho \mathbf{W})\mathbf{u} = \boldsymbol{\varepsilon} \]
\[ \boldsymbol{\varepsilon} \sim \mathcal{N}(\mathbf{0}, \sigma^2 \mathbf{I}) \]
where: - \(\rho\) = spatial autocorrelation parameter (CAR parameter) - All other parameters are as defined in the SAR model
For each county \(i\), the conditional distribution of its error term given all other counties’ errors is:
\[ u_i | \mathbf{u}_{-i} \sim \mathcal{N}\left(\rho \sum_{j=1}^n w_{ij} u_j,\ \sigma^2\right) \]
This means: - The expected value of \(u_i\) given neighboring errors is \(\rho\) times the weighted average of neighboring errors - The conditional variance is constant (\(\sigma^2\)) for all counties - The spatial dependence is local: each county’s error depends directly on its neighbors
The log-likelihood for the CAR model is:
\[ \ell(\boldsymbol{\beta}, \rho, \sigma^2) = -\frac{n}{2}\ln(2\pi) - \frac{n}{2}\ln(\sigma^2) + \frac{1}{2}\ln|\mathbf{I} - \rho\mathbf{W}| - \frac{1}{2\sigma^2} (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})^\top (\mathbf{I} - \rho\mathbf{W}) (\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) \]
Note the Jacobian term is \(\frac{1}{2}\ln|\mathbf{I} - \rho\mathbf{W}|\) (half the log-determinant), unlike the SAR model which has \(\ln|\mathbf{I} - \lambda\mathbf{W}|\).
# Estimate CAR model
nc.sids.car.out <- spautolm(rates.FT ~ nwbir.FT,
data = nc.sids,
family = "CAR",
listw = ncCC89.listw,
zero.policy = TRUE)
# View results
summary(nc.sids.car.out)
##
## Call:
## spautolm(formula = rates.FT ~ nwbir.FT, data = nc.sids, listw = ncCC89.listw,
## family = "CAR", zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.334312 -0.423111 0.031383 0.502847 2.718650
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.4234930 0.2605823 9.3003 < 2e-16
## nwbir.FT 0.0148474 0.0073079 2.0317 0.04218
##
## Lambda: 0.43308 LR test value: 2.9066 p-value: 0.088217
## Numerical Hessian standard error of lambda: 0.24786
##
## Log likelihood: -123.3455
## ML residual variance (sigma squared): 0.67449, (sigma: 0.82127)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 254.69
# Extract fitted values and add to both data frames
nc.sids$fitted.car <- fitted(nc.sids.car.out)
nc.sids.sf$fitted.car <- fitted(nc.sids.car.out)
| Aspect | SAR Error | CAR |
|---|---|---|
| Error Structure | \(\mathbf{u} = \lambda \mathbf{W}\mathbf{u} + \boldsymbol{\varepsilon}\) | \((\mathbf{I} - \rho \mathbf{W})\mathbf{u} = \boldsymbol{\varepsilon}\) |
| Variance-Covariance | \(\sigma^2 (\mathbf{I} - \lambda \mathbf{W})^{-1} (\mathbf{I} - \lambda \mathbf{W}^\top)^{-1}\) | \(\sigma^2 (\mathbf{I} - \rho \mathbf{W})^{-1}\) |
| Symmetric Covariance? | No (unless \(\mathbf{W}\) is symmetric) | Yes (if \(\mathbf{W}\) is symmetric) |
| Interpretation | Global spatial spillover | Local conditional dependence |
| Jacobian Term | \(\ln\|\mathbf{I} - \lambda\mathbf{W}\|\) | \(\frac{1}{2}\ln\|\mathbf{I} - \rho\mathbf{W}\|\) |
| Model | Best Used When |
|---|---|
| SAR | • Global spatial spillovers are expected • Shocks propagate through the entire system • Asymmetric or row-standardized weights are used • Spatial dependence in omitted variables |
| CAR | • Local spatial dependence is expected • Neighboring areas directly influence each other • Symmetric weights are appropriate • Conditional interpretation is meaningful |
# Compare coefficients
sar_coef <- summary(nc.sids.sar.out)$Coef
car_coef <- summary(nc.sids.car.out)$Coef
print("SAR Coefficients:")
## [1] "SAR Coefficients:"
print(sar_coef)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.54812325 0.261575735 9.741436 0.0000000
## nwbir.FT 0.01107147 0.007334329 1.509541 0.1311605
print("CAR Coefficients:")
## [1] "CAR Coefficients:"
print(car_coef)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.42349295 0.260582294 9.300298 0.00000000
## nwbir.FT 0.01484736 0.007307885 2.031691 0.04218499
# Compare spatial parameters
cat("SAR lambda:", nc.sids.sar.out$lambda, "\n")
## SAR lambda: 0.2457948
cat("CAR rho:", nc.sids.car.out$lambda, "\n")
## CAR rho: 0.4330803
# Compare log-likelihoods
cat("SAR log-likelihood:", logLik(nc.sids.sar.out), "\n")
## SAR log-likelihood: -123.2626
cat("CAR log-likelihood:", logLik(nc.sids.car.out), "\n")
## CAR log-likelihood: -123.3455
# Define color schemes
color.code.raw <- colorRampPalette(c("lightyellow", "orange", "red"))(20)
color.code.sar <- colorRampPalette(c("lightblue", "steelblue", "darkblue"))(20)
color.code.car <- colorRampPalette(c("lightgreen", "forestgreen", "darkgreen"))(20)
# Set up layout
par(mfrow = c(3, 1), oma = c(0, 0, 4, 0) + 0.1, mar = c(0, 0, 1, 0) + 0.1)
# Plot 1: Raw rates
plot(nc.sids.sf[, "rates.FT"],
pal = colorRampPalette(c("lightyellow", "orange", "red")),
main = "Raw Birth Rates (FT Transformation)",
border = "gray40",
lwd = 0.5,
key.pos = 4)
# Plot 2: Fitted SAR values
plot(nc.sids.sf[, "fitted.sar"],
pal = colorRampPalette(c("lightblue", "steelblue", "darkblue")),
main = "Fitted SAR Values",
border = "gray40",
lwd = 0.5,
key.pos = 4)
# Plot 3: Fitted CAR values
plot(nc.sids.sf[, "fitted.car"],
pal = colorRampPalette(c("lightgreen", "forestgreen", "darkgreen")),
main = "Fitted CAR Values",
border = "gray40",
lwd = 0.5,
key.pos = 4)
# Add overall title
mtext("North Carolina SIDS Analysis: SAR vs CAR Models",
outer = TRUE, cex = 1.5, line = 1)
# Create individual plots
p1 <- ggplot(nc.sids.sf) +
geom_sf(aes(fill = rates.FT), color = "black", size = 0.1) +
scale_fill_gradientn(colors = c("lightyellow", "orange", "red"),
name = "Rates.FT") +
theme_minimal() +
ggtitle("Raw Birth Rates (FT Transformation)")
p2 <- ggplot(nc.sids.sf) +
geom_sf(aes(fill = fitted.sar), color = "black", size = 0.1) +
scale_fill_gradientn(colors = c("lightblue", "steelblue", "darkblue"),
name = "Fitted SAR") +
theme_minimal() +
ggtitle("Fitted SAR Values")
p3 <- ggplot(nc.sids.sf) +
geom_sf(aes(fill = fitted.car), color = "black", size = 0.1) +
scale_fill_gradientn(colors = c("lightgreen", "forestgreen", "darkgreen"),
name = "Fitted CAR") +
theme_minimal() +
ggtitle("Fitted CAR Values")
# Arrange plots
grid.arrange(p1, p2, p3, ncol = 1,
top = "North Carolina SIDS Analysis: SAR vs CAR Models")
# Estimate Spatial Lag Model
nc.sids.lagsar.out <- lagsarlm(rates.FT ~ nwbir.FT,
data = nc.sids,
listw = ncCC89.listw,
zero.policy = TRUE)
summary(nc.sids.lagsar.out)
##
## Call:
## lagsarlm(formula = rates.FT ~ nwbir.FT, data = nc.sids, listw = ncCC89.listw,
## zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.310957 -0.469092 0.051219 0.499487 2.553309
##
## Type: lag
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.9179079 0.4110306 4.6661 3.07e-06
## nwbir.FT 0.0088023 0.0061509 1.4311 0.1524
##
## Rho: 0.24179, LR test value: 2.9108, p-value: 0.087987
## Asymptotic standard error: 0.13349
## z-value: 1.8113, p-value: 0.070096
## Wald statistic: 3.2808, p-value: 0.070096
##
## Log likelihood: -123.3433 for lag model
## ML residual variance (sigma squared): 0.68099, (sigma: 0.82522)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 254.69, (AIC for lm: 255.6)
## LM test for residual autocorrelation
## test value: 0.82615, p-value: 0.36339
# Estimate Spatial Error Model using errorsarlm
nc.sids.errorsar.out <- errorsarlm(rates.FT ~ nwbir.FT,
data = nc.sids,
listw = ncCC89.listw,
zero.policy = TRUE)
summary(nc.sids.errorsar.out)
##
## Call:
## errorsarlm(formula = rates.FT ~ nwbir.FT, data = nc.sids, listw = ncCC89.listw,
## zero.policy = TRUE)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.310193 -0.456265 0.042259 0.512063 2.555381
##
## Type: error
## Coefficients: (asymptotic standard errors)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.5481233 0.2615757 9.7414 <2e-16
## nwbir.FT 0.0110715 0.0073343 1.5095 0.1312
##
## Lambda: 0.24579, LR test value: 3.0723, p-value: 0.079635
## Asymptotic standard error: 0.13382
## z-value: 1.8367, p-value: 0.066256
## Wald statistic: 3.3734, p-value: 0.066256
##
## Log likelihood: -123.2626 for error model
## ML residual variance (sigma squared): 0.67958, (sigma: 0.82437)
## Number of observations: 100
## Number of parameters estimated: 4
## AIC: 254.53, (AIC for lm: 255.6)
# Compare AIC values
models <- list(
SAR = nc.sids.sar.out,
CAR = nc.sids.car.out,
Lag = nc.sids.lagsar.out,
Error = nc.sids.errorsar.out
)
# Extract AIC for each model
aic_values <- sapply(models, AIC)
aic_values
## SAR CAR Lag Error
## 254.5252 254.6909 254.6867 254.5252
# Best model (lowest AIC)
best_model <- names(which.min(aic_values))
cat("Best model by AIC:", best_model)
## Best model by AIC: Error
# Create summary table
parameter_summary <- data.frame(
Model = c("SAR", "CAR", "Lag", "Error"),
Spatial_Parameter = c(nc.sids.sar.out$lambda,
nc.sids.car.out$lambda,
nc.sids.lagsar.out$rho,
nc.sids.errorsar.out$lambda),
AIC = aic_values,
LogLik = sapply(models, logLik)
)
parameter_summary
## Model Spatial_Parameter AIC LogLik
## SAR SAR 0.2457948 254.5252 -123.2626
## CAR CAR 0.4330803 254.6909 -123.3455
## Lag Lag 0.2417853 254.6867 -123.3433
## Error Error 0.2457948 254.5252 -123.2626
Both SAR and CAR models provide spatial alternatives to ordinary least squares regression for analyzing North Carolina birth rate data. The key distinction is:
The choice between them depends on the nature of the spatial process being modeled and the structure of the spatial weights matrix. Model selection criteria (AIC) can help guide this choice.