Setup & Intro

I setup five varying model specifications and ran ERGMs over these variations. I ran each ERGM at the CZ level, and included all CZs that were divided into at least 2 parts by an interstate, so the hypothesis about interstate divisions would make sense. This left 408 eligible CZs.

My Della job iterated over eligible CZs and the 5 model variations, saving the prepped network data, the ergm model objects, and a visual generated for each one CZ. A separate script loaded all the model variations for all regions, extracted summary information and coefficient estimates, and saved this to a sparse .csv. This makes it easier to compare the model performance across regions and variations.

Frequently the model specifications cause an error. I haven’t gone through these cases in too much detail yet, but often seems to be because multiple arguments can be too similar and the models can’t converge.

Additionally, the Della job ran out of time. I figured it’d be good to do some analysis and maybe revise the model specifications before resuming and generating more CZs. Therefore, there’s currently 56 CZs for which models generated (another 64 where models were attempted but failed.)

The Model Variations

The model variations I used are below. In these formula:

  • int.poly is the “interstate polygon” effect. Nodematch means this term estimates the effect of nodes/tracts being in the same interstate polygon (not being across an interstate from one another). None of the current model variations included non-interstate highways.

  • dst is distance. It’s raw distance in KM for all models, except model 5 which uses decayed proximity. (Signs for this term for model 5 have opposite meaning)

  • in.cc is whether or not a node/tract is in the “city center”, or the largest Place within the Commuting Zone. Nodematch with this term means the effect of both nodes being within/outside the city center vs across its boundaries, while nodefactor attributes an effect to being within the Place bounds.

Notably, for distance, there are at least a couple papers discussing the “spatial interaction function” to use for this kind of geographic network analysis. I haven’t used any of the variations suggested in thse papers yet, but this is discussed in:

  • Butts and Acton 2011 “Spatial Modeling of Social Networks” &

  • Daraganova et al 2012 " Modelling community network structures as the outcome of both spatial and network processes"

Will plan to consider more and implement.

# define formula variations -----------------------------------------------

fo1_base <- formula(
  ergh ~ edges +
    nodematch("int.poly", diff = FALSE) +
    edgecov(dst.mat, "dst") +
    #nodecov("pop", form="sum") +
    absdiff("perc.wh", pow=1) +
    absdiff("perc.hsp", pow=1) +
    absdiff("hh.median.income", pow=1)
)

fo2_city.centers <- formula(
  ergh ~ edges +
    nodematch("int.poly", diff = FALSE) +
    nodematch("in.cc") +
    nodefactor("in.cc") +
    edgecov(dst.mat, "dst") +
    #nodecov("pop", form="sum") +
    absdiff("perc.wh", pow=1) +
    absdiff("perc.hsp", pow=1) +
    absdiff("hh.median.income", pow=1)
)


fo3_multiple.cc <- formula(  
  ergh ~ edges +
    nodematch("int.poly", diff = FALSE) +
    nodematch("in.cc") +
    # nodefactor("in.cc") +
    edgecov(dst.mat, "dst") +
    #nodecov("pop", form="sum") +
    absdiff("perc.wh", pow=1) +
    absdiff("perc.hsp", pow=1) +
    absdiff("hh.median.income", pow=1)
)

fo4_ccs.pop.dens <-
  formula(
    ergh ~ edges +
      nodematch("int.poly", diff = FALSE) +
      nodematch("in.cc") +
      nodefactor("in.cc") +
      edgecov(dst.mat, "dst") +
      nodecov("ln.pop.dens", form="sum") +
      absdiff("perc.wh", pow=1) +
      absdiff("perc.hsp", pow=1) +
      absdiff("hh.median.income", pow=1)
  )

fo5_ccs.pop.dens.decay <-
  formula(
    ergh ~ edges +
      nodematch("int.poly", diff = FALSE) +
      nodematch("in.cc") +
      nodefactor("in.cc") +
      edgecov(decay.mat, "dst") +
      nodecov("ln.pop.dens", form="sum") +
      absdiff("perc.wh", pow=1) +
      absdiff("perc.hsp", pow=1) +
      absdiff("hh.median.income", pow=1)
  )

Meta analysis

This section does some mass analysis of all the models that have run without errors so far.

# what's generated so far
mss %>%
  count(region, cz, fo) %>% 
  head() %>% kable(digits = 0)
region cz fo n
00100-Johnson City 00100 fo2_city.centers 8
00302-Knoxville 00302 fo1_base 6
00302-Knoxville 00302 fo3_multiple.cc 7
00302-Knoxville 00302 fo4_ccs.pop.dens 9
00302-Knoxville 00302 fo5_ccs.pop.dens.decay 9
00401-Winston-Salem 00401 fo2_city.centers 8
# some mass analysis ------------------------------------------------------

mss %>%
  group_by(term) %>%
  tidyaux::quantiles.across.groups("estimate",
                          ntile = 1/6) %>% 
  kable(digits = 2)
term 0% 16.66667% 33.33333% 50% 66.66667% 83.33333% 100%
absdiff.sum.hh.median.income -0.05 0.00 0.00 0.00 0.00 0.00 1.19
absdiff.sum.perc.hsp -30.85 -2.55 -0.64 0.20 0.88 2.45 22.70
absdiff.sum.perc.wh -15.80 -2.56 -0.37 0.05 0.64 2.44 56.14
edgecov.sum.dst -21.51 -0.10 -0.02 0.00 0.03 0.25 10.61
nodecov.sum.ln.pop.dens -16.29 -0.34 -0.08 0.13 0.28 0.77 7.20
nodefactor.sum.in.cc.TRUE -14.55 -1.13 -0.09 0.35 1.73 3.59 11.21
nodematch.sum.in.cc -12.30 -2.34 -0.92 0.01 1.48 3.57 19.06
nodematch.sum.int.poly -5131.57 -8.05 -3.69 -0.32 1.17 4.97 51.88
nonzero -202.55 -25.95 -14.39 -6.47 -4.63 -3.84 4.06
# in addition to models that just had an error, there are also many models where
# many coefficients had NA standard errors and p-values:

mss %>%
    group_by(region, fo) %>%
    mutate(na.p.vals =
               sum(is.na(p.value))) %>%
  ungroup() %>% select(1:2, na.p.vals) %>% distinct() %>% 
  head() %>% kable()
region fo na.p.vals
00302-Knoxville fo1_base 0
02700-Biloxi fo1_base 6
32801-Waco fo1_base 1
24200-Kankakee fo1_base 6
00401-Winston-Salem fo2_city.centers 2
09800-Auburn fo1_base 1
# i need to think more about this, but I think models where all the
# std.errors/p.values are NA did not converge in the estimation--- and could
# just need more iterations. But I can leave out of meta analysis here
ams <-
  mss %>%
  group_by(region, fo) %>%
  filter(!all(is.na(p.value))) %>%
  ungroup()

ams %>%
  group_by(term) %>%
  tidyaux::quantiles.across.groups("estimate",
                          ntile = 1/6) %>% 
  kable(digits = 2)
term 0% 16.66667% 33.33333% 50% 66.66667% 83.33333% 100%
absdiff.sum.hh.median.income -0.05 0.00 0.00 0.00 0.00 0.00 0.03
absdiff.sum.perc.hsp -25.87 -2.22 -0.15 0.39 0.91 2.80 9.96
absdiff.sum.perc.wh -15.80 -2.66 -0.35 0.02 0.64 2.19 14.69
edgecov.sum.dst -6.94 -0.06 -0.02 0.00 0.03 0.17 10.61
nodecov.sum.ln.pop.dens -6.80 -0.23 -0.02 0.17 0.45 0.81 3.21
nodefactor.sum.in.cc.TRUE -14.55 -1.37 -0.25 0.01 0.81 2.33 10.21
nodematch.sum.in.cc -8.73 -2.81 -1.03 -0.19 0.77 3.32 8.48
nodematch.sum.int.poly -5131.57 -8.00 -4.09 -0.58 1.30 4.94 51.88
nonzero -202.55 -25.93 -14.56 -7.09 -4.84 -4.28 4.06
#' seems to be a lot of variation across regions, so that no coefficients are
#' consistently in the same direction.

#' Distribution of terms:
ams %>%
  filter(term != "nonzero") %>%
  ggplot() +
  geom_histogram(aes(x = estimate,
                     fill = term),
                 position = "dodge"
                 ,bins = 10) +
  scale_x_continuous(limits = c(-40,40))
## Warning: Removed 2 rows containing non-finite values (stat_bin).

# just the interstate coefficient, across variations
ams %>%
  filter(grepl("int.poly", term)) %>%
  group_by(fo) %>%
  tidyaux::quantiles.across.groups("estimate", 1/6)
## # A tibble: 5 x 8
##   fo         `0%` `16.66667%` `33.33333%`   `50%` `66.66667%` `83.33333%` `100%`
##   <chr>     <dbl>       <dbl>       <dbl>   <dbl>       <dbl>       <dbl>  <dbl>
## 1 fo1_ba… -5.13e3      -10.4       -6.44  -1.18         1.46         3.31   7.86
## 2 fo2_ci… -2.27e1       -4.87      -4.31  -1.75        -0.521        3.97  51.9 
## 3 fo3_mu… -9.82e0       -4.10      -0.675  0.452        3.03         9.07  13.7 
## 4 fo4_cc… -1.50e1       -6.32      -3.97  -0.0767       1.07         3.65   6.98
## 5 fo5_cc… -1.19e1      -10.7       -7.37  -1.87         0.127        4.53   9.67

City-level analyses

For the visuals here, the first shows flows within the Place boundaries; the second is zoomed out to show all CZ flows.

Tampa

# tampa -------------------------------------------------------------------

tmp <- mss %>%
  filter(cz == "06700")

# all 5 model variations available for tampa
tmp %>% count(fo)
## # A tibble: 5 x 2
##   fo                         n
##   <chr>                  <int>
## 1 fo1_base                   6
## 2 fo2_city.centers           8
## 3 fo3_multiple.cc            7
## 4 fo4_ccs.pop.dens           9
## 5 fo5_ccs.pop.dens.decay     9
# get visuals
gdir <-
  "/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)

plc.map <- divseg::Wrapper_make.graph.map(gh,
                               cz = tmp$cz[1],
                               trim2plc = T)
## Loading required namespace: xwalks
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
plc.map

cz.map <- divseg::Wrapper_make.graph.map(gh,
                                          cz = tmp$cz[1],
                                          trim2plc = F)
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
cz.map

tmp %>% select(fo, aic) %>% distinct() %>% arrange(aic)
## # A tibble: 5 x 2
##   fo                          aic
##   <chr>                     <dbl>
## 1 fo5_ccs.pop.dens.decay  -79027.
## 2 fo2_city.centers        -76719.
## 3 fo3_multiple.cc         -71443.
## 4 fo4_ccs.pop.dens        345709.
## 5 fo1_base               6671792.
# coefficient estimates across models for Tampa
coef.scatter <-
  tmp %>%
  ggplot() +
  geom_point(aes(x = fo,
                 y = estimate,
                 color = term),
             alpha = .8, size = 3)

plotly::ggplotly(coef.scatter)
#' some things i notice:
#'
#' - absdiff.income consistently near 0 -- though likely b/c differences in
#' income are so large compared to differences in percent and
#' T/F dummies. Probably similar for distance in models 1-4
#'
#' - distance is only not near zero in first model, when no Place/inner city
#' characteristics are considered
#'
#' - the int.polygon tends to be ~negative~ in all of the first 4 models--
#' meaning travel is more likely to places across the interstate from one
#' another...

# model 5 is only one that has expected value for the int poly coef. Also note
# that the edgecov.distance term has different meaning (and opposite expected
# sign) for model #5, because it is decayed proximity, rather than raw distance

Minneapolis

# minneapolis -------------------------------------------------------------

tmp <- mss %>%
  filter(cz == "21501")

# all 5 model variations available 
tmp %>% count(fo)
## # A tibble: 5 x 2
##   fo                         n
##   <chr>                  <int>
## 1 fo1_base                   6
## 2 fo2_city.centers           8
## 3 fo3_multiple.cc            7
## 4 fo4_ccs.pop.dens           9
## 5 fo5_ccs.pop.dens.decay     9
# get visuals
gdir <-
  "/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)

plc.map <- divseg::Wrapper_make.graph.map(gh,
                                          cz = tmp$cz[1],
                                          trim2plc = T)
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
plc.map

cz.map <- divseg::Wrapper_make.graph.map(gh,
                                         cz = tmp$cz[1],
                                         trim2plc = F)
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
cz.map

# feel like the maps support our hypothesis, specifically about hwys (less so the models..?)

tmp %>%
  group_by(term) %>%
  tidyaux::quantiles.across.groups("estimate", 1/6) %>% 
  kable(digits= 2)
term 0% 16.66667% 33.33333% 50% 66.66667% 83.33333% 100%
absdiff.sum.hh.median.income 0.00 0.00 0.00 0.00 0.00 0.00 0.00
absdiff.sum.perc.hsp -4.90 -4.38 -2.67 0.21 1.11 2.29 3.75
absdiff.sum.perc.wh -0.30 -0.18 0.61 2.07 2.30 2.96 4.03
edgecov.sum.dst -0.04 -0.03 -0.01 0.00 0.01 0.58 1.72
nodecov.sum.ln.pop.dens 0.03 0.05 0.06 0.08 0.10 0.11 0.13
nodefactor.sum.in.cc.TRUE -1.91 -1.38 -0.85 -0.32 0.27 0.87 1.47
nodematch.sum.in.cc -2.10 -1.35 -0.61 -0.45 -0.30 1.83 3.97
nodematch.sum.int.poly 0.17 1.17 1.98 2.60 9.99 26.41 51.88
nonzero -44.28 -33.35 -21.69 -9.30 -8.12 -6.91 -5.68
# this time the int polys are consistently positive (expected)
coef.scatter <-
  tmp %>%
  ggplot() +
  geom_point(aes(x = fo,
                 y = estimate,
                 color = term),
             alpha = .8, size = 3)
library(plotly)
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:igraph':
## 
##     groups
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout
ggplotly(coef.scatter)
#' - the int.polygon tends to be ~positive~ in all models (as expected, but
#' unlike Tampa)

tmp %>%
  filter(p.value > .05 |
           is.na(p.value) )
## # A tibble: 2 x 10
##   region  fo     term  estimate std.error statistic p.value     aic cz       pop
##   <chr>   <chr>  <chr>    <dbl>     <dbl>     <dbl>   <dbl>   <dbl> <chr>  <int>
## 1 21501-… fo3_m… nonz…  -44.3     NA          NA    NA      -42574. 21501 3.19e6
## 2 21501-… fo5_c… node…    0.166    0.0992      1.67  0.0950 -40332. 21501 3.19e6
tmp %>% select(fo, aic) %>% distinct() %>% arrange(aic)
## # A tibble: 5 x 2
##   fo                         aic
##   <chr>                    <dbl>
## 1 fo3_multiple.cc        -42574.
## 2 fo5_ccs.pop.dens.decay -40332.
## 3 fo4_ccs.pop.dens       -27133.
## 4 fo1_base                -6886.
## 5 fo2_city.centers       911033.
# all models for mn avoid NA estimates or very high standard errors, and  model
# 3 has lowest AIC.

Memphis

# Memphis -----------------------------------------------------------------

tmp <- mss %>%
  filter(cz == "05202")

# all 5 model variations available
tmp %>% count(fo)
## # A tibble: 5 x 2
##   fo                         n
##   <chr>                  <int>
## 1 fo1_base                   6
## 2 fo2_city.centers           8
## 3 fo3_multiple.cc            7
## 4 fo4_ccs.pop.dens           9
## 5 fo5_ccs.pop.dens.decay     9
# get visuals
gdir <-
  "/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)

plc.map <- divseg::Wrapper_make.graph.map(gh,
                                          cz = tmp$cz[1],
                                          trim2plc = T)
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
plc.map

cz.map <- divseg::Wrapper_make.graph.map(gh,
                                         cz = tmp$cz[1],
                                         trim2plc = F)
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
cz.map

# Maps for Memphis seem to show cordoning-off effect of interstates. The
# relative wastelands of travel in the north & south areas bounded by the
# peripheral interstate seem dramatic.


tmp %>%
  filter(p.value > .05 |
           is.na(p.value) )
## # A tibble: 0 x 10
## # … with 10 variables: region <chr>, fo <chr>, term <chr>, estimate <dbl>,
## #   std.error <dbl>, statistic <dbl>, p.value <dbl>, aic <dbl>, cz <chr>,
## #   pop <int>
tmp %>% select(fo, aic) %>% distinct() %>% arrange(aic)
## # A tibble: 5 x 2
##   fo                          aic
##   <chr>                     <dbl>
## 1 fo4_ccs.pop.dens         -4027.
## 2 fo3_multiple.cc           8312.
## 3 fo5_ccs.pop.dens.decay   44622.
## 4 fo2_city.centers         49633.
## 5 fo1_base               1258659.
# Consistent estimation, with low p-values and no
# NA ones. Model 4 had lowest AIC followed by 3.

coef.scatter <-
  tmp %>%
  ggplot() +
  geom_point(aes(x = fo,
                 y = estimate,
                 color = term),
             alpha = .8, size = 3)

plotly::ggplotly(coef.scatter)
# Models with the negative int.poly coef also seem to have bad aic