ERGM

Markdown showing my process to run the ERGM. This just shows a sample CZ, Philadelphia.

Get data

Wrote a wrapper function to get all the associated data:

  • Polygon subgroup
  • regressors & controls from previous models
  • safegraph origin-destination data

Turns all of this into an undirected graph, with trips in between as “edges” and all the many census-tract attributes as nodes.

Descriptives

gh %>%
  get_nodes()
## # A tibble: 1,465 x 52
##    name                 geometry int.poly hwy.poly county.poly plc.poly
##    <chr>             <POINT [m]> <fct>    <fct>    <fct>       <fct>   
##  1 34001000100 (1325333 4776632) 6        37       5           27      
##  2 34001000200 (1325872 4775231) 6        37       5           27      
##  3 34001000300 (1326119 4776374) 6        37       5           27      
##  4 34001000400 (1326744 4775883) 6        37       5           27      
##  5 34001000500 (1326631 4776888) 6        37       5           27      
##  6 34001001100 (1327622 4777999) 6        37       5           27      
##  7 34001001200 (1327011 4778364) 6        37       5           27      
##  8 34001001300 (1325070 4779912) 6        37       5           27      
##  9 34001001400 (1327865 4779420) 6        37       5           27      
## 10 34001001500 (1328258 4778378) 6        37       5           27      
## # ... with 1,455 more rows, and 46 more variables: school.dist.poly <fct>,
## #   touches.int <lgl>, touches.hwy <lgl>, touches.county <lgl>,
## #   touches.plc <lgl>, touches.school.dist <lgl>, countyfp <chr>, cz <chr>,
## #   cz_name <chr>, cbsa <dbl>, cbsa_name <chr>, acres_tot_tract <dbl>,
## #   acres_res_tract <dbl>, acres_sfr_nec_tract <dbl>, acres_sfr_tract <dbl>,
## #   perc.acres.sfz <dbl>, awater <dbl>, pop <dbl>, perc.whincl <dbl>,
## #   perc.blincl <dbl>, perc.wh <dbl>, perc.bl <dbl>, perc.hsp <dbl>,
## #   hh.median.income <dbl>, pop.below.poverty <dbl>, manufactur.emp <dbl>,
## #   single.headed.fams <dbl>, under18 <dbl>, aland <dbl>,
## #   total.incoming.visitors <dbl>, incoming.n_wh <dbl>, incoming.perc_wh <dbl>,
## #   incoming.n_bl <dbl>, incoming.perc_bl <dbl>, normalized.incoming_wh <dbl>,
## #   normalized.incoming_bl <dbl>, inflow.dissim_wh.bl <dbl>,
## #   total.visited <dbl>, perc.visited_wh <dbl>, perc.visited_bl <dbl>,
## #   normalized.visited_wh <dbl>, normalized.visited_bl <dbl>,
## #   outflow.dissim_wh.bl <dbl>, pop.dens <dbl>, ln.pop <dbl>, ln.aland <dbl>
# `n` is daily average flows between tracts
gh %>%
  get_edges()
## # A tibble: 134,838 x 3
##     from    to     n
##    <int> <int> <int>
##  1     1     2   118
##  2     1     3   213
##  3     2     3   240
##  4     1     4   270
##  5     2     4   315
##  6     3     4   581
##  7     1     5   117
##  8     2     5    81
##  9     3     5   430
## 10     4     5   458
## # ... with 134,828 more rows
# quantiles of trips btwn tract pairs
with(
  get_edges(gh),
  quantile(n, seq(0,1,.1))
)
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    5    6    6    7    9   11   14   19   29   66 3043
gh %>%
  get_nodes() %>%
  st_sf() %>%
  select(int.poly, plc.poly,
         perc.bl, incoming.perc_bl) %>%
  plot()

Distance & spatial weights matrices

I build crosswise distance and spatial weight matrices between all tracts. I use tract centroids to calculate distances.

The matrices are used for an edgecov term in the ergm model – based on explanation here: http://mjh4.blogspot.com/2012/09/ergm-edgecov-and-dyadcov-specifications.html

I also build a spatial weights matrix with a bi-square distance decay, where weights turn to 0 when tract centroids are >= 20 miles apart. (Although the ERGM based on just distance seemed to run better)

ctrs <- gh %>% get_nodes() %>% st_sf() %>% select(geoid = name)

dst.mat <- st_distance(ctrs)
dst.dim <- dim(dst.mat)

# drop explicit units tag and convert to KM
dst.mat <- as.numeric(dst.mat) / 1000
# also get bisquare spatial weights, as an alternative to raw distances
# I use 32.2 km, about 20 miles for distance at which spatial weight is fully decayed
spwM <-
  spgwr::gwr.bisquare(dst.mat^2, 32.2)

spwM <-
  matrix(spwM,
            ncol = dst.dim[1])
dst.mat <-
  matrix(dst.mat,
         ncol = dst.dim[1])

Tie-strength measure

From project sheet:

Outcome measure: Create a measure of the strength of the ties between each tract (or cbg) in a commuting zone (e.g. look at networks literature to figure out best way to measure, but one simple way is the # of people moving btwn tracts in either direction scaled by # moving into and out of both tracts).

I use this function to create that measure of tie strength:

get.normalized.undirected.connectedness <- function(x, y,
                                                    graph, flow.colm = "n") {

  # subset to ods containing either of the two tracts
  total.trips <-
    graph[with(graph,
               from %in% c(x, y) |
                 to %in% c(x, y) ), ]

  # vs flow between just the two tracts
  xy.link <-
    graph[with(graph,
               (from == x & to == y |
                  from == y & to == x)), ]

  # get ratio
  flow.btwn <- pull(xy.link, flow.colm) %>% sum()
  total.trips <- pull(total.trips, flow.colm) %>% sum()

  flow.btwn / total.trips
}

Statnet in-progress caveats

It seems like the current form of the statnet libraries don’t handle missing values for ERGMs well. I worry this could be a significant problem for us.

All our demographic % characteristics will be NAs for tracts for which population is 0. Many of these tracts are airports, large parks, or other areas that could have a lot of visitors while having 0 population.

For now, I replace these NAs with 0s so the model can run and they will be included in some way.

# For now, set to 0--- definitely isn't ideal
ghc <- gh %>%
  activate("nodes") %>%
  mutate(across(matches("^perc"),
                ~if_else(is.na(.x),
                         0, .x)))

ghc <- ghc %>%
  activate("nodes") %>%
  mutate(across(matches("hh.median.income"),
                ~if_else(is.na(.x),
                         0, .x)))

I then switch to using the statnet library for my graph object instead of igraph/tidygraph, other packages in R made for network analysis (I find them easier for doing the data cleans and descriptives)

library(statnet)

# alter classes to network can accommodate graph setup
# (remove factors & geometries)
ghc <- ghc %>%
  activate("nodes") %>%
  select(-geometry) %>%
  mutate(across(matches("poly$"),
                ~as.integer(.x)))

# coerce to statnet network object
ergh <- intergraph::asNetwork(ghc)

Running ERGMs

Here’s my setup for the ERGMs. I still have a long way to go to wrap my head around all aspects of this! But showing my setup.

Summary statistics

# summary statistics with distance
summary(ergh ~ edges +
          nodematch("int.poly") +
          edgecov(dst.mat, "dst") +
          absdiff("perc.wh", pow=1) +
          absdiff("perc.hsp", pow=1) +
          absdiff("hh.median.income", pow=1)
        , response = "tie.strength") # , response = "n")


# or with the distance-decayed spatial weights
summary(ergh ~ edges +
          nodematch("int.poly") +
          edgecov(spwM, "spw") +
          absdiff("perc.wh", pow=1) +
          absdiff("perc.hsp", pow=1) +
          absdiff("hh.median.income", pow=1)
        , response = "tie.strength") # , response = "n")

Calls

## (not run; were saved from elsewhere and loaded into workspace)

# set seed for reproducability
set.seed(1234)

ergm.phl.dst <-
  ergm(ergh ~ edges +
         nodematch("int.poly") +
         edgecov(dst.mat, "dst") +
         absdiff("perc.wh", pow=1) +
         absdiff("perc.hsp", pow=1) +
         absdiff("hh.median.income", pow=1)
       , response = "tie.strength" # edge characteristic / tie definition to predict
       , reference = ~Poisson # reference distribution
  )


ergm.phl.spw <-
  ergm(ergh ~ edges +
         nodematch("int.poly") +
         edgecov(spwM, "spw") +
         absdiff("perc.wh", pow=1) +
         absdiff("perc.hsp", pow=1) +
         absdiff("hh.median.income", pow=1)
       , response = "tie.strength" # edge characteristic / tie definition to predict
       , reference = ~Poisson # reference distribution
  )

Summaries

# distance-based model
ergm.phl.dst %>%
  summary()
## Call:
## ergm(formula = ergh ~ edges + nodematch("int.poly") + edgecov(dst.mat, 
##     "dst") + absdiff("perc.wh", pow = 1) + absdiff("perc.hsp", 
##     pow = 1) + absdiff("hh.median.income", pow = 1), response = "tie.strength", 
##     reference = ~Poisson)
## 
## Iterations:  20 out of 20 
## 
## Monte Carlo MLE Results:
##                                Estimate Std. Error MCMC %   z value Pr(>|z|)
## nonzero                      -1.593e+01  2.083e-06    100  -7647868   <1e-04
## nodematch.sum.int.poly        1.859e+00  6.975e-07    100   2664990   <1e-04
## edgecov.sum.dst              -1.592e-01  3.375e-05    100     -4718   <1e-04
## absdiff.sum.perc.wh          -4.975e+00  2.141e-07    100 -23237027   <1e-04
## absdiff.sum.perc.hsp          1.258e+01  3.126e-07    100  40235743   <1e-04
## absdiff.sum.hh.median.income  2.713e-04  1.492e-08    100     18189   <1e-04
##                                 
## nonzero                      ***
## nodematch.sum.int.poly       ***
## edgecov.sum.dst              ***
## absdiff.sum.perc.wh          ***
## absdiff.sum.perc.hsp         ***
## absdiff.sum.hh.median.income ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance:       0  on 1072380  degrees of freedom
##  Residual Deviance: -175835  on 1072374  degrees of freedom
##  
## Note that the null model likelihood and deviance are defined to be 0.
## This means that all likelihood-based inference (LRT, Analysis of
## Deviance, AIC, BIC, etc.) is only valid between models with the same
## reference distribution and constraints.
## 
## AIC: -175823    BIC: -175752    (Smaller is better.)
# spatial decay model
ergm.phl.spw %>%
  summary()
## Call:
## ergm(formula = ergh ~ edges + nodematch("int.poly") + edgecov(spwM, 
##     "spw") + absdiff("perc.wh", pow = 1) + absdiff("perc.hsp", 
##     pow = 1) + absdiff("hh.median.income", pow = 1), response = "tie.strength", 
##     reference = ~Poisson)
## 
## Iterations:  20 out of 20 
## 
## Monte Carlo MLE Results:
##                               Estimate Std. Error MCMC % z value Pr(>|z|)
## nonzero                       -9.10269         NA     NA      NA       NA
## nodematch.sum.int.poly       -43.84300         NA     NA      NA       NA
## edgecov.sum.spw               -0.29251         NA     NA      NA       NA
## absdiff.sum.perc.wh            0.46469         NA     NA      NA       NA
## absdiff.sum.perc.hsp          -0.62834         NA     NA      NA       NA
## absdiff.sum.hh.median.income  -0.00114         NA     NA      NA       NA
## 
##      Null Deviance:       0  on 1072380  degrees of freedom
##  Residual Deviance: -738579  on 1072374  degrees of freedom
##  
## Note that the null model likelihood and deviance are defined to be 0.
## This means that all likelihood-based inference (LRT, Analysis of
## Deviance, AIC, BIC, etc.) is only valid between models with the same
## reference distribution and constraints.
## 
## AIC: -738567    BIC: -738495    (Smaller is better.)