Setup & Intro

I setup five varying model specifications and ran ERGMs over these variations. I ran each ERGM at the CZ level, and included all CZs that were divided into at least 2 parts by an interstate, so the hypothesis about interstate divisions would make sense. This left 408 eligible CZs.

My Della job iterated over eligible CZs and the 5 model variations, saving the prepped network data, the ergm model objects, and a visual generated for each one CZ. A separate script loaded all the model variations for all regions, extracted summary information and coefficient estimates, and saved this to a sparse .csv. This makes it easier to compare the model performance across regions and variations.

Frequently the model specifications cause an error. I haven’t gone through these cases in too much detail yet, but often seems to be because multiple arguments can be too similar and the models can’t converge.

Additionally, the Della job ran out of time. I figured it’d be good to do some analysis and maybe revise the model specifications before resuming and generating more CZs. Therefore, there’s currently 56 CZs for which models generated (another 64 where models were attempted but failed.)

The Model Variations

The model variations I used are below. In these formula:

int.poly is the “interstate polygon” effect. Nodematch means this term estimates the effect of nodes/tracts being in the same interstate polygon (not being across an interstate from one another). None of the current model variations included non-interstate highways.
dst is distance. It’s raw distance in KM for all models, except model 5 which uses decayed proximity. (Signs for this term for model 5 have opposite meaning)
in.cc is whether or not a node/tract is in the “city center”, or the largest Place within the Commuting Zone. Nodematch with this term means the effect of both nodes being within/outside the city center vs across its boundaries, while nodefactor attributes an effect to being within the Place bounds.

Notably, for distance, there are at least a couple papers discussing the “spatial interaction function” to use for this kind of geographic network analysis. I haven’t used any of the variations suggested in thse papers yet, but this is discussed in:

Butts and Acton 2011 “Spatial Modeling of Social Networks” &
Daraganova et al 2012 " Modelling community network structures as the outcome of both spatial and network processes"

Will plan to consider more and implement.

# define formula variations -----------------------------------------------

fo1_base <- formula(
  ergh ~ edges +
    nodematch("int.poly", diff = FALSE) +
    edgecov(dst.mat, "dst") +
    #nodecov("pop", form="sum") +
    absdiff("perc.wh", pow=1) +
    absdiff("perc.hsp", pow=1) +
    absdiff("hh.median.income", pow=1)
)

fo2_city.centers <- formula(
  ergh ~ edges +
    nodematch("int.poly", diff = FALSE) +
    nodematch("in.cc") +
    nodefactor("in.cc") +
    edgecov(dst.mat, "dst") +
    #nodecov("pop", form="sum") +
    absdiff("perc.wh", pow=1) +
    absdiff("perc.hsp", pow=1) +
    absdiff("hh.median.income", pow=1)
)


fo3_multiple.cc <- formula(  
  ergh ~ edges +
    nodematch("int.poly", diff = FALSE) +
    nodematch("in.cc") +
    # nodefactor("in.cc") +
    edgecov(dst.mat, "dst") +
    #nodecov("pop", form="sum") +
    absdiff("perc.wh", pow=1) +
    absdiff("perc.hsp", pow=1) +
    absdiff("hh.median.income", pow=1)
)

fo4_ccs.pop.dens <-
  formula(
    ergh ~ edges +
      nodematch("int.poly", diff = FALSE) +
      nodematch("in.cc") +
      nodefactor("in.cc") +
      edgecov(dst.mat, "dst") +
      nodecov("ln.pop.dens", form="sum") +
      absdiff("perc.wh", pow=1) +
      absdiff("perc.hsp", pow=1) +
      absdiff("hh.median.income", pow=1)
  )

fo5_ccs.pop.dens.decay <-
  formula(
    ergh ~ edges +
      nodematch("int.poly", diff = FALSE) +
      nodematch("in.cc") +
      nodefactor("in.cc") +
      edgecov(decay.mat, "dst") +
      nodecov("ln.pop.dens", form="sum") +
      absdiff("perc.wh", pow=1) +
      absdiff("perc.hsp", pow=1) +
      absdiff("hh.median.income", pow=1)
  )

Meta analysis

This section does some mass analysis of all the models that have run without errors so far.

# what's generated so far
mss %>%
  count(region, cz, fo) %>% 
  head() %>% kable(digits = 0)

region	cz	fo	n
00100-Johnson City	00100	fo2_city.centers	8
00302-Knoxville	00302	fo1_base	6
00302-Knoxville	00302	fo3_multiple.cc	7
00302-Knoxville	00302	fo4_ccs.pop.dens	9
00302-Knoxville	00302	fo5_ccs.pop.dens.decay	9
00401-Winston-Salem	00401	fo2_city.centers	8

# some mass analysis ------------------------------------------------------

mss %>%
  group_by(term) %>%
  tidyaux::quantiles.across.groups("estimate",
                          ntile = 1/6) %>% 
  kable(digits = 2)

term	0%	16.66667%	33.33333%	50%	66.66667%	83.33333%	100%
absdiff.sum.hh.median.income	-0.05	0.00	0.00	0.00	0.00	0.00	1.19
absdiff.sum.perc.hsp	-30.85	-2.55	-0.64	0.20	0.88	2.45	22.70
absdiff.sum.perc.wh	-15.80	-2.56	-0.37	0.05	0.64	2.44	56.14
edgecov.sum.dst	-21.51	-0.10	-0.02	0.00	0.03	0.25	10.61
nodecov.sum.ln.pop.dens	-16.29	-0.34	-0.08	0.13	0.28	0.77	7.20
nodefactor.sum.in.cc.TRUE	-14.55	-1.13	-0.09	0.35	1.73	3.59	11.21
nodematch.sum.in.cc	-12.30	-2.34	-0.92	0.01	1.48	3.57	19.06
nodematch.sum.int.poly	-5131.57	-8.05	-3.69	-0.32	1.17	4.97	51.88
nonzero	-202.55	-25.95	-14.39	-6.47	-4.63	-3.84	4.06

# in addition to models that just had an error, there are also many models where
# many coefficients had NA standard errors and p-values:

mss %>%
    group_by(region, fo) %>%
    mutate(na.p.vals =
               sum(is.na(p.value))) %>%
  ungroup() %>% select(1:2, na.p.vals) %>% distinct() %>% 
  head() %>% kable()

region	fo	na.p.vals
00302-Knoxville	fo1_base	0
02700-Biloxi	fo1_base	6
32801-Waco	fo1_base	1
24200-Kankakee	fo1_base	6
00401-Winston-Salem	fo2_city.centers	2
09800-Auburn	fo1_base	1

# i need to think more about this, but I think models where all the
# std.errors/p.values are NA did not converge in the estimation--- and could
# just need more iterations. But I can leave out of meta analysis here
ams <-
  mss %>%
  group_by(region, fo) %>%
  filter(!all(is.na(p.value))) %>%
  ungroup()

ams %>%
  group_by(term) %>%
  tidyaux::quantiles.across.groups("estimate",
                          ntile = 1/6) %>% 
  kable(digits = 2)

term	0%	16.66667%	33.33333%	50%	66.66667%	83.33333%	100%
absdiff.sum.hh.median.income	-0.05	0.00	0.00	0.00	0.00	0.00	0.03
absdiff.sum.perc.hsp	-25.87	-2.22	-0.15	0.39	0.91	2.80	9.96
absdiff.sum.perc.wh	-15.80	-2.66	-0.35	0.02	0.64	2.19	14.69
edgecov.sum.dst	-6.94	-0.06	-0.02	0.00	0.03	0.17	10.61
nodecov.sum.ln.pop.dens	-6.80	-0.23	-0.02	0.17	0.45	0.81	3.21
nodefactor.sum.in.cc.TRUE	-14.55	-1.37	-0.25	0.01	0.81	2.33	10.21
nodematch.sum.in.cc	-8.73	-2.81	-1.03	-0.19	0.77	3.32	8.48
nodematch.sum.int.poly	-5131.57	-8.00	-4.09	-0.58	1.30	4.94	51.88
nonzero	-202.55	-25.93	-14.56	-7.09	-4.84	-4.28	4.06

#' seems to be a lot of variation across regions, so that no coefficients are
#' consistently in the same direction.

#' Distribution of terms:
ams %>%
  filter(term != "nonzero") %>%
  ggplot() +
  geom_histogram(aes(x = estimate,
                     fill = term),
                 position = "dodge"
                 ,bins = 10) +
  scale_x_continuous(limits = c(-40,40))

## Warning: Removed 2 rows containing non-finite values (stat_bin).

# just the interstate coefficient, across variations
ams %>%
  filter(grepl("int.poly", term)) %>%
  group_by(fo) %>%
  tidyaux::quantiles.across.groups("estimate", 1/6)

## # A tibble: 5 x 8
##   fo         `0%` `16.66667%` `33.33333%`   `50%` `66.66667%` `83.33333%` `100%`
##   <chr>     <dbl>       <dbl>       <dbl>   <dbl>       <dbl>       <dbl>  <dbl>
## 1 fo1_ba… -5.13e3      -10.4       -6.44  -1.18         1.46         3.31   7.86
## 2 fo2_ci… -2.27e1       -4.87      -4.31  -1.75        -0.521        3.97  51.9 
## 3 fo3_mu… -9.82e0       -4.10      -0.675  0.452        3.03         9.07  13.7 
## 4 fo4_cc… -1.50e1       -6.32      -3.97  -0.0767       1.07         3.65   6.98
## 5 fo5_cc… -1.19e1      -10.7       -7.37  -1.87         0.127        4.53   9.67

City-level analyses

For the visuals here, the first shows flows within the Place boundaries; the second is zoomed out to show all CZ flows.

Tampa

# tampa -------------------------------------------------------------------

tmp <- mss %>%
  filter(cz == "06700")

# all 5 model variations available for tampa
tmp %>% count(fo)

## # A tibble: 5 x 2
##   fo                         n
##   <chr>                  <int>
## 1 fo1_base                   6
## 2 fo2_city.centers           8
## 3 fo3_multiple.cc            7
## 4 fo4_ccs.pop.dens           9
## 5 fo5_ccs.pop.dens.decay     9

# get visuals
gdir <-
  "/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)

plc.map <- divseg::Wrapper_make.graph.map(gh,
                               cz = tmp$cz[1],
                               trim2plc = T)

## Loading required namespace: xwalks

## Checking if spatial network structure is valid...

## Spatial network structure is valid

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

plc.map

cz.map <- divseg::Wrapper_make.graph.map(gh,
                                          cz = tmp$cz[1],
                                          trim2plc = F)

## Checking if spatial network structure is valid...
## Spatial network structure is valid

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

cz.map

tmp %>% select(fo, aic) %>% distinct() %>% arrange(aic)

## # A tibble: 5 x 2
##   fo                          aic
##   <chr>                     <dbl>
## 1 fo5_ccs.pop.dens.decay  -79027.
## 2 fo2_city.centers        -76719.
## 3 fo3_multiple.cc         -71443.
## 4 fo4_ccs.pop.dens        345709.
## 5 fo1_base               6671792.

# coefficient estimates across models for Tampa
coef.scatter <-
  tmp %>%
  ggplot() +
  geom_point(aes(x = fo,
                 y = estimate,
                 color = term),
             alpha = .8, size = 3)

plotly::ggplotly(coef.scatter)

#' some things i notice:
#'
#' - absdiff.income consistently near 0 -- though likely b/c differences in
#' income are so large compared to differences in percent and
#' T/F dummies. Probably similar for distance in models 1-4
#'
#' - distance is only not near zero in first model, when no Place/inner city
#' characteristics are considered
#'
#' - the int.polygon tends to be ~negative~ in all of the first 4 models--
#' meaning travel is more likely to places across the interstate from one
#' another...

# model 5 is only one that has expected value for the int poly coef. Also note
# that the edgecov.distance term has different meaning (and opposite expected
# sign) for model #5, because it is decayed proximity, rather than raw distance

Minneapolis

# minneapolis -------------------------------------------------------------

tmp <- mss %>%
  filter(cz == "21501")

# all 5 model variations available 
tmp %>% count(fo)

## # A tibble: 5 x 2
##   fo                         n
##   <chr>                  <int>
## 1 fo1_base                   6
## 2 fo2_city.centers           8
## 3 fo3_multiple.cc            7
## 4 fo4_ccs.pop.dens           9
## 5 fo5_ccs.pop.dens.decay     9

# get visuals
gdir <-
  "/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)

plc.map <- divseg::Wrapper_make.graph.map(gh,
                                          cz = tmp$cz[1],
                                          trim2plc = T)

## Checking if spatial network structure is valid...

## Spatial network structure is valid

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

plc.map

cz.map <- divseg::Wrapper_make.graph.map(gh,
                                         cz = tmp$cz[1],
                                         trim2plc = F)

## Checking if spatial network structure is valid...
## Spatial network structure is valid

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

cz.map

# feel like the maps support our hypothesis, specifically about hwys (less so the models..?)

tmp %>%
  group_by(term) %>%
  tidyaux::quantiles.across.groups("estimate", 1/6) %>% 
  kable(digits= 2)

term	0%	16.66667%	33.33333%	50%	66.66667%	83.33333%	100%
absdiff.sum.hh.median.income	0.00	0.00	0.00	0.00	0.00	0.00	0.00
absdiff.sum.perc.hsp	-4.90	-4.38	-2.67	0.21	1.11	2.29	3.75
absdiff.sum.perc.wh	-0.30	-0.18	0.61	2.07	2.30	2.96	4.03
edgecov.sum.dst	-0.04	-0.03	-0.01	0.00	0.01	0.58	1.72
nodecov.sum.ln.pop.dens	0.03	0.05	0.06	0.08	0.10	0.11	0.13
nodefactor.sum.in.cc.TRUE	-1.91	-1.38	-0.85	-0.32	0.27	0.87	1.47
nodematch.sum.in.cc	-2.10	-1.35	-0.61	-0.45	-0.30	1.83	3.97
nodematch.sum.int.poly	0.17	1.17	1.98	2.60	9.99	26.41	51.88
nonzero	-44.28	-33.35	-21.69	-9.30	-8.12	-6.91	-5.68

# this time the int polys are consistently positive (expected)
coef.scatter <-
  tmp %>%
  ggplot() +
  geom_point(aes(x = fo,
                 y = estimate,
                 color = term),
             alpha = .8, size = 3)
library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:igraph':
## 
##     groups

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

ggplotly(coef.scatter)

#' - the int.polygon tends to be ~positive~ in all models (as expected, but
#' unlike Tampa)

tmp %>%
  filter(p.value > .05 |
           is.na(p.value) )

## # A tibble: 2 x 10
##   region  fo     term  estimate std.error statistic p.value     aic cz       pop
##   <chr>   <chr>  <chr>    <dbl>     <dbl>     <dbl>   <dbl>   <dbl> <chr>  <int>
## 1 21501-… fo3_m… nonz…  -44.3     NA          NA    NA      -42574. 21501 3.19e6
## 2 21501-… fo5_c… node…    0.166    0.0992      1.67  0.0950 -40332. 21501 3.19e6

tmp %>% select(fo, aic) %>% distinct() %>% arrange(aic)

## # A tibble: 5 x 2
##   fo                         aic
##   <chr>                    <dbl>
## 1 fo3_multiple.cc        -42574.
## 2 fo5_ccs.pop.dens.decay -40332.
## 3 fo4_ccs.pop.dens       -27133.
## 4 fo1_base                -6886.
## 5 fo2_city.centers       911033.

# all models for mn avoid NA estimates or very high standard errors, and  model
# 3 has lowest AIC.

Memphis

# Memphis -----------------------------------------------------------------

tmp <- mss %>%
  filter(cz == "05202")

# all 5 model variations available
tmp %>% count(fo)

## # A tibble: 5 x 2
##   fo                         n
##   <chr>                  <int>
## 1 fo1_base                   6
## 2 fo2_city.centers           8
## 3 fo3_multiple.cc            7
## 4 fo4_ccs.pop.dens           9
## 5 fo5_ccs.pop.dens.decay     9

# get visuals
gdir <-
  "/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)

plc.map <- divseg::Wrapper_make.graph.map(gh,
                                          cz = tmp$cz[1],
                                          trim2plc = T)

## Checking if spatial network structure is valid...

## Spatial network structure is valid

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

plc.map

cz.map <- divseg::Wrapper_make.graph.map(gh,
                                         cz = tmp$cz[1],
                                         trim2plc = F)

## Checking if spatial network structure is valid...
## Spatial network structure is valid

## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries

cz.map

# Maps for Memphis seem to show cordoning-off effect of interstates. The
# relative wastelands of travel in the north & south areas bounded by the
# peripheral interstate seem dramatic.


tmp %>%
  filter(p.value > .05 |
           is.na(p.value) )

## # A tibble: 0 x 10
## # … with 10 variables: region <chr>, fo <chr>, term <chr>, estimate <dbl>,
## #   std.error <dbl>, statistic <dbl>, p.value <dbl>, aic <dbl>, cz <chr>,
## #   pop <int>

tmp %>% select(fo, aic) %>% distinct() %>% arrange(aic)

## # A tibble: 5 x 2
##   fo                          aic
##   <chr>                     <dbl>
## 1 fo4_ccs.pop.dens         -4027.
## 2 fo3_multiple.cc           8312.
## 3 fo5_ccs.pop.dens.decay   44622.
## 4 fo2_city.centers         49633.
## 5 fo1_base               1258659.

# Consistent estimation, with low p-values and no
# NA ones. Model 4 had lowest AIC followed by 3.

coef.scatter <-
  tmp %>%
  ggplot() +
  geom_point(aes(x = fo,
                 y = estimate,
                 color = term),
             alpha = .8, size = 3)

plotly::ggplotly(coef.scatter)

# Models with the negative int.poly coef also seem to have bad aic

ERGM variations markdown

kira