Markdown showing my process to run the ERGM. This just shows a sample CZ, Philadelphia.
Wrote a wrapper function to get all the associated data:
Turns all of this into an undirected graph, with trips in between as “edges” and all the many census-tract attributes as nodes.
## # A tibble: 1,465 x 52
## name geometry int.poly hwy.poly county.poly plc.poly
## <chr> <POINT [m]> <fct> <fct> <fct> <fct>
## 1 34001000100 (1325333 4776632) 6 37 5 27
## 2 34001000200 (1325872 4775231) 6 37 5 27
## 3 34001000300 (1326119 4776374) 6 37 5 27
## 4 34001000400 (1326744 4775883) 6 37 5 27
## 5 34001000500 (1326631 4776888) 6 37 5 27
## 6 34001001100 (1327622 4777999) 6 37 5 27
## 7 34001001200 (1327011 4778364) 6 37 5 27
## 8 34001001300 (1325070 4779912) 6 37 5 27
## 9 34001001400 (1327865 4779420) 6 37 5 27
## 10 34001001500 (1328258 4778378) 6 37 5 27
## # ... with 1,455 more rows, and 46 more variables: school.dist.poly <fct>,
## # touches.int <lgl>, touches.hwy <lgl>, touches.county <lgl>,
## # touches.plc <lgl>, touches.school.dist <lgl>, countyfp <chr>, cz <chr>,
## # cz_name <chr>, cbsa <dbl>, cbsa_name <chr>, acres_tot_tract <dbl>,
## # acres_res_tract <dbl>, acres_sfr_nec_tract <dbl>, acres_sfr_tract <dbl>,
## # perc.acres.sfz <dbl>, awater <dbl>, pop <dbl>, perc.whincl <dbl>,
## # perc.blincl <dbl>, perc.wh <dbl>, perc.bl <dbl>, perc.hsp <dbl>,
## # hh.median.income <dbl>, pop.below.poverty <dbl>, manufactur.emp <dbl>,
## # single.headed.fams <dbl>, under18 <dbl>, aland <dbl>,
## # total.incoming.visitors <dbl>, incoming.n_wh <dbl>, incoming.perc_wh <dbl>,
## # incoming.n_bl <dbl>, incoming.perc_bl <dbl>, normalized.incoming_wh <dbl>,
## # normalized.incoming_bl <dbl>, inflow.dissim_wh.bl <dbl>,
## # total.visited <dbl>, perc.visited_wh <dbl>, perc.visited_bl <dbl>,
## # normalized.visited_wh <dbl>, normalized.visited_bl <dbl>,
## # outflow.dissim_wh.bl <dbl>, pop.dens <dbl>, ln.pop <dbl>, ln.aland <dbl>
## # A tibble: 134,838 x 3
## from to n
## <int> <int> <int>
## 1 1 2 118
## 2 1 3 213
## 3 2 3 240
## 4 1 4 270
## 5 2 4 315
## 6 3 4 581
## 7 1 5 117
## 8 2 5 81
## 9 3 5 430
## 10 4 5 458
## # ... with 134,828 more rows
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 5 6 6 7 9 11 14 19 29 66 3043
I build crosswise distance and spatial weight matrices between all tracts. I use tract centroids to calculate distances.
The matrices are used for an edgecov term in the ergm model – based on explanation here: http://mjh4.blogspot.com/2012/09/ergm-edgecov-and-dyadcov-specifications.html
I also build a spatial weights matrix with a bi-square distance decay, where weights turn to 0 when tract centroids are >= 20 miles apart. (Although the ERGM based on just distance seemed to run better)
ctrs <- gh %>% get_nodes() %>% st_sf() %>% select(geoid = name)
dst.mat <- st_distance(ctrs)
dst.dim <- dim(dst.mat)
# drop explicit units tag and convert to KM
dst.mat <- as.numeric(dst.mat) / 1000
# also get bisquare spatial weights, as an alternative to raw distances
# I use 32.2 km, about 20 miles for distance at which spatial weight is fully decayed
spwM <-
spgwr::gwr.bisquare(dst.mat^2, 32.2)
spwM <-
matrix(spwM,
ncol = dst.dim[1])
dst.mat <-
matrix(dst.mat,
ncol = dst.dim[1])From project sheet:
Outcome measure: Create a measure of the strength of the ties between each tract (or cbg) in a commuting zone (e.g. look at networks literature to figure out best way to measure, but one simple way is the # of people moving btwn tracts in either direction scaled by # moving into and out of both tracts).
I use this function to create that measure of tie strength:
get.normalized.undirected.connectedness <- function(x, y,
graph, flow.colm = "n") {
# subset to ods containing either of the two tracts
total.trips <-
graph[with(graph,
from %in% c(x, y) |
to %in% c(x, y) ), ]
# vs flow between just the two tracts
xy.link <-
graph[with(graph,
(from == x & to == y |
from == y & to == x)), ]
# get ratio
flow.btwn <- pull(xy.link, flow.colm) %>% sum()
total.trips <- pull(total.trips, flow.colm) %>% sum()
flow.btwn / total.trips
}It seems like the current form of the statnet libraries don’t handle missing values for ERGMs well. I worry this could be a significant problem for us.
All our demographic % characteristics will be NAs for tracts for which population is 0. Many of these tracts are airports, large parks, or other areas that could have a lot of visitors while having 0 population.
For now, I replace these NAs with 0s so the model can run and they will be included in some way.
# For now, set to 0--- definitely isn't ideal
ghc <- gh %>%
activate("nodes") %>%
mutate(across(matches("^perc"),
~if_else(is.na(.x),
0, .x)))
ghc <- ghc %>%
activate("nodes") %>%
mutate(across(matches("hh.median.income"),
~if_else(is.na(.x),
0, .x)))I then switch to using the statnet library for my graph object instead of igraph/tidygraph, other packages in R made for network analysis (I find them easier for doing the data cleans and descriptives)
Here’s my setup for the ERGMs. I still have a long way to go to wrap my head around all aspects of this! But showing my setup.
# summary statistics with distance
summary(ergh ~ edges +
nodematch("int.poly") +
edgecov(dst.mat, "dst") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
, response = "tie.strength") # , response = "n")
# or with the distance-decayed spatial weights
summary(ergh ~ edges +
nodematch("int.poly") +
edgecov(spwM, "spw") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
, response = "tie.strength") # , response = "n")## (not run; were saved from elsewhere and loaded into workspace)
# set seed for reproducability
set.seed(1234)
ergm.phl.dst <-
ergm(ergh ~ edges +
nodematch("int.poly") +
edgecov(dst.mat, "dst") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
, response = "tie.strength" # edge characteristic / tie definition to predict
, reference = ~Poisson # reference distribution
)
ergm.phl.spw <-
ergm(ergh ~ edges +
nodematch("int.poly") +
edgecov(spwM, "spw") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
, response = "tie.strength" # edge characteristic / tie definition to predict
, reference = ~Poisson # reference distribution
)## Call:
## ergm(formula = ergh ~ edges + nodematch("int.poly") + edgecov(dst.mat,
## "dst") + absdiff("perc.wh", pow = 1) + absdiff("perc.hsp",
## pow = 1) + absdiff("hh.median.income", pow = 1), response = "tie.strength",
## reference = ~Poisson)
##
## Iterations: 20 out of 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % z value Pr(>|z|)
## nonzero -1.593e+01 2.083e-06 100 -7647868 <1e-04
## nodematch.sum.int.poly 1.859e+00 6.975e-07 100 2664990 <1e-04
## edgecov.sum.dst -1.592e-01 3.375e-05 100 -4718 <1e-04
## absdiff.sum.perc.wh -4.975e+00 2.141e-07 100 -23237027 <1e-04
## absdiff.sum.perc.hsp 1.258e+01 3.126e-07 100 40235743 <1e-04
## absdiff.sum.hh.median.income 2.713e-04 1.492e-08 100 18189 <1e-04
##
## nonzero ***
## nodematch.sum.int.poly ***
## edgecov.sum.dst ***
## absdiff.sum.perc.wh ***
## absdiff.sum.perc.hsp ***
## absdiff.sum.hh.median.income ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 0 on 1072380 degrees of freedom
## Residual Deviance: -175835 on 1072374 degrees of freedom
##
## Note that the null model likelihood and deviance are defined to be 0.
## This means that all likelihood-based inference (LRT, Analysis of
## Deviance, AIC, BIC, etc.) is only valid between models with the same
## reference distribution and constraints.
##
## AIC: -175823 BIC: -175752 (Smaller is better.)
## Call:
## ergm(formula = ergh ~ edges + nodematch("int.poly") + edgecov(spwM,
## "spw") + absdiff("perc.wh", pow = 1) + absdiff("perc.hsp",
## pow = 1) + absdiff("hh.median.income", pow = 1), response = "tie.strength",
## reference = ~Poisson)
##
## Iterations: 20 out of 20
##
## Monte Carlo MLE Results:
## Estimate Std. Error MCMC % z value Pr(>|z|)
## nonzero -9.10269 NA NA NA NA
## nodematch.sum.int.poly -43.84300 NA NA NA NA
## edgecov.sum.spw -0.29251 NA NA NA NA
## absdiff.sum.perc.wh 0.46469 NA NA NA NA
## absdiff.sum.perc.hsp -0.62834 NA NA NA NA
## absdiff.sum.hh.median.income -0.00114 NA NA NA NA
##
## Null Deviance: 0 on 1072380 degrees of freedom
## Residual Deviance: -738579 on 1072374 degrees of freedom
##
## Note that the null model likelihood and deviance are defined to be 0.
## This means that all likelihood-based inference (LRT, Analysis of
## Deviance, AIC, BIC, etc.) is only valid between models with the same
## reference distribution and constraints.
##
## AIC: -738567 BIC: -738495 (Smaller is better.)