I setup five varying model specifications and ran ERGMs over these variations. I ran each ERGM at the CZ level, and included all CZs that were divided into at least 2 parts by an interstate, so the hypothesis about interstate divisions would make sense. This left 408 eligible CZs.
My Della job iterated over eligible CZs and the 5 model variations, saving the prepped network data, the ergm model objects, and a visual generated for each one CZ. A separate script loaded all the model variations for all regions, extracted summary information and coefficient estimates, and saved this to a sparse .csv. This makes it easier to compare the model performance across regions and variations.
Frequently the model specifications cause an error. I haven’t gone through these cases in too much detail yet, but often seems to be because multiple arguments can be too similar and the models can’t converge.
Additionally, the Della job ran out of time. I figured it’d be good to do some analysis and maybe revise the model specifications before resuming and generating more CZs. Therefore, there’s currently 56 CZs for which models generated (another 64 where models were attempted but failed.)
The model variations I used are below. In these formula:
int.poly is the “interstate polygon” effect. Nodematch means this term estimates the effect of nodes/tracts being in the same interstate polygon (not being across an interstate from one another). None of the current model variations included non-interstate highways.
dst is distance. It’s raw distance in KM for all models, except model 5 which uses decayed proximity. (Signs for this term for model 5 have opposite meaning)
in.cc is whether or not a node/tract is in the “city center”, or the largest Place within the Commuting Zone. Nodematch with this term means the effect of both nodes being within/outside the city center vs across its boundaries, while nodefactor attributes an effect to being within the Place bounds.
Notably, for distance, there are at least a couple papers discussing the “spatial interaction function” to use for this kind of geographic network analysis. I haven’t used any of the variations suggested in thse papers yet, but this is discussed in:
Butts and Acton 2011 “Spatial Modeling of Social Networks” &
Daraganova et al 2012 " Modelling community network structures as the outcome of both spatial and network processes"
Will plan to consider more and implement.
# define formula variations -----------------------------------------------
fo1_base <- formula(
ergh ~ edges +
nodematch("int.poly", diff = FALSE) +
edgecov(dst.mat, "dst") +
#nodecov("pop", form="sum") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
)
fo2_city.centers <- formula(
ergh ~ edges +
nodematch("int.poly", diff = FALSE) +
nodematch("in.cc") +
nodefactor("in.cc") +
edgecov(dst.mat, "dst") +
#nodecov("pop", form="sum") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
)
fo3_multiple.cc <- formula(
ergh ~ edges +
nodematch("int.poly", diff = FALSE) +
nodematch("in.cc") +
# nodefactor("in.cc") +
edgecov(dst.mat, "dst") +
#nodecov("pop", form="sum") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
)
fo4_ccs.pop.dens <-
formula(
ergh ~ edges +
nodematch("int.poly", diff = FALSE) +
nodematch("in.cc") +
nodefactor("in.cc") +
edgecov(dst.mat, "dst") +
nodecov("ln.pop.dens", form="sum") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
)
fo5_ccs.pop.dens.decay <-
formula(
ergh ~ edges +
nodematch("int.poly", diff = FALSE) +
nodematch("in.cc") +
nodefactor("in.cc") +
edgecov(decay.mat, "dst") +
nodecov("ln.pop.dens", form="sum") +
absdiff("perc.wh", pow=1) +
absdiff("perc.hsp", pow=1) +
absdiff("hh.median.income", pow=1)
)This section does some mass analysis of all the models that have run without errors so far.
| region | cz | fo | n |
|---|---|---|---|
| 00100-Johnson City | 00100 | fo2_city.centers | 8 |
| 00302-Knoxville | 00302 | fo1_base | 6 |
| 00302-Knoxville | 00302 | fo3_multiple.cc | 7 |
| 00302-Knoxville | 00302 | fo4_ccs.pop.dens | 9 |
| 00302-Knoxville | 00302 | fo5_ccs.pop.dens.decay | 9 |
| 00401-Winston-Salem | 00401 | fo2_city.centers | 8 |
# some mass analysis ------------------------------------------------------
mss %>%
group_by(term) %>%
tidyaux::quantiles.across.groups("estimate",
ntile = 1/6) %>%
kable(digits = 2)| term | 0% | 16.66667% | 33.33333% | 50% | 66.66667% | 83.33333% | 100% |
|---|---|---|---|---|---|---|---|
| absdiff.sum.hh.median.income | -0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 1.19 |
| absdiff.sum.perc.hsp | -30.85 | -2.55 | -0.64 | 0.20 | 0.88 | 2.45 | 22.70 |
| absdiff.sum.perc.wh | -15.80 | -2.56 | -0.37 | 0.05 | 0.64 | 2.44 | 56.14 |
| edgecov.sum.dst | -21.51 | -0.10 | -0.02 | 0.00 | 0.03 | 0.25 | 10.61 |
| nodecov.sum.ln.pop.dens | -16.29 | -0.34 | -0.08 | 0.13 | 0.28 | 0.77 | 7.20 |
| nodefactor.sum.in.cc.TRUE | -14.55 | -1.13 | -0.09 | 0.35 | 1.73 | 3.59 | 11.21 |
| nodematch.sum.in.cc | -12.30 | -2.34 | -0.92 | 0.01 | 1.48 | 3.57 | 19.06 |
| nodematch.sum.int.poly | -5131.57 | -8.05 | -3.69 | -0.32 | 1.17 | 4.97 | 51.88 |
| nonzero | -202.55 | -25.95 | -14.39 | -6.47 | -4.63 | -3.84 | 4.06 |
# in addition to models that just had an error, there are also many models where
# many coefficients had NA standard errors and p-values:
mss %>%
group_by(region, fo) %>%
mutate(na.p.vals =
sum(is.na(p.value))) %>%
ungroup() %>% select(1:2, na.p.vals) %>% distinct() %>%
head() %>% kable()| region | fo | na.p.vals |
|---|---|---|
| 00302-Knoxville | fo1_base | 0 |
| 02700-Biloxi | fo1_base | 6 |
| 32801-Waco | fo1_base | 1 |
| 24200-Kankakee | fo1_base | 6 |
| 00401-Winston-Salem | fo2_city.centers | 2 |
| 09800-Auburn | fo1_base | 1 |
# i need to think more about this, but I think models where all the
# std.errors/p.values are NA did not converge in the estimation--- and could
# just need more iterations. But I can leave out of meta analysis here
ams <-
mss %>%
group_by(region, fo) %>%
filter(!all(is.na(p.value))) %>%
ungroup()
ams %>%
group_by(term) %>%
tidyaux::quantiles.across.groups("estimate",
ntile = 1/6) %>%
kable(digits = 2)| term | 0% | 16.66667% | 33.33333% | 50% | 66.66667% | 83.33333% | 100% |
|---|---|---|---|---|---|---|---|
| absdiff.sum.hh.median.income | -0.05 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 |
| absdiff.sum.perc.hsp | -25.87 | -2.22 | -0.15 | 0.39 | 0.91 | 2.80 | 9.96 |
| absdiff.sum.perc.wh | -15.80 | -2.66 | -0.35 | 0.02 | 0.64 | 2.19 | 14.69 |
| edgecov.sum.dst | -6.94 | -0.06 | -0.02 | 0.00 | 0.03 | 0.17 | 10.61 |
| nodecov.sum.ln.pop.dens | -6.80 | -0.23 | -0.02 | 0.17 | 0.45 | 0.81 | 3.21 |
| nodefactor.sum.in.cc.TRUE | -14.55 | -1.37 | -0.25 | 0.01 | 0.81 | 2.33 | 10.21 |
| nodematch.sum.in.cc | -8.73 | -2.81 | -1.03 | -0.19 | 0.77 | 3.32 | 8.48 |
| nodematch.sum.int.poly | -5131.57 | -8.00 | -4.09 | -0.58 | 1.30 | 4.94 | 51.88 |
| nonzero | -202.55 | -25.93 | -14.56 | -7.09 | -4.84 | -4.28 | 4.06 |
#' seems to be a lot of variation across regions, so that no coefficients are
#' consistently in the same direction.
#' Distribution of terms:
ams %>%
filter(term != "nonzero") %>%
ggplot() +
geom_histogram(aes(x = estimate,
fill = term),
position = "dodge"
,bins = 10) +
scale_x_continuous(limits = c(-40,40))## Warning: Removed 2 rows containing non-finite values (stat_bin).
# just the interstate coefficient, across variations
ams %>%
filter(grepl("int.poly", term)) %>%
group_by(fo) %>%
tidyaux::quantiles.across.groups("estimate", 1/6)## # A tibble: 5 x 8
## fo `0%` `16.66667%` `33.33333%` `50%` `66.66667%` `83.33333%` `100%`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 fo1_ba… -5.13e3 -10.4 -6.44 -1.18 1.46 3.31 7.86
## 2 fo2_ci… -2.27e1 -4.87 -4.31 -1.75 -0.521 3.97 51.9
## 3 fo3_mu… -9.82e0 -4.10 -0.675 0.452 3.03 9.07 13.7
## 4 fo4_cc… -1.50e1 -6.32 -3.97 -0.0767 1.07 3.65 6.98
## 5 fo5_cc… -1.19e1 -10.7 -7.37 -1.87 0.127 4.53 9.67
For the visuals here, the first shows flows within the Place boundaries; the second is zoomed out to show all CZ flows.
# tampa -------------------------------------------------------------------
tmp <- mss %>%
filter(cz == "06700")
# all 5 model variations available for tampa
tmp %>% count(fo)## # A tibble: 5 x 2
## fo n
## <chr> <int>
## 1 fo1_base 6
## 2 fo2_city.centers 8
## 3 fo3_multiple.cc 7
## 4 fo4_ccs.pop.dens 9
## 5 fo5_ccs.pop.dens.decay 9
# get visuals
gdir <-
"/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)
plc.map <- divseg::Wrapper_make.graph.map(gh,
cz = tmp$cz[1],
trim2plc = T)## Loading required namespace: xwalks
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
## # A tibble: 5 x 2
## fo aic
## <chr> <dbl>
## 1 fo5_ccs.pop.dens.decay -79027.
## 2 fo2_city.centers -76719.
## 3 fo3_multiple.cc -71443.
## 4 fo4_ccs.pop.dens 345709.
## 5 fo1_base 6671792.
# coefficient estimates across models for Tampa
coef.scatter <-
tmp %>%
ggplot() +
geom_point(aes(x = fo,
y = estimate,
color = term),
alpha = .8, size = 3)
plotly::ggplotly(coef.scatter)#' some things i notice:
#'
#' - absdiff.income consistently near 0 -- though likely b/c differences in
#' income are so large compared to differences in percent and
#' T/F dummies. Probably similar for distance in models 1-4
#'
#' - distance is only not near zero in first model, when no Place/inner city
#' characteristics are considered
#'
#' - the int.polygon tends to be ~negative~ in all of the first 4 models--
#' meaning travel is more likely to places across the interstate from one
#' another...
# model 5 is only one that has expected value for the int poly coef. Also note
# that the edgecov.distance term has different meaning (and opposite expected
# sign) for model #5, because it is decayed proximity, rather than raw distance# minneapolis -------------------------------------------------------------
tmp <- mss %>%
filter(cz == "21501")
# all 5 model variations available
tmp %>% count(fo)## # A tibble: 5 x 2
## fo n
## <chr> <int>
## 1 fo1_base 6
## 2 fo2_city.centers 8
## 3 fo3_multiple.cc 7
## 4 fo4_ccs.pop.dens 9
## 5 fo5_ccs.pop.dens.decay 9
# get visuals
gdir <-
"/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)
plc.map <- divseg::Wrapper_make.graph.map(gh,
cz = tmp$cz[1],
trim2plc = T)## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
# feel like the maps support our hypothesis, specifically about hwys (less so the models..?)
tmp %>%
group_by(term) %>%
tidyaux::quantiles.across.groups("estimate", 1/6) %>%
kable(digits= 2)| term | 0% | 16.66667% | 33.33333% | 50% | 66.66667% | 83.33333% | 100% |
|---|---|---|---|---|---|---|---|
| absdiff.sum.hh.median.income | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
| absdiff.sum.perc.hsp | -4.90 | -4.38 | -2.67 | 0.21 | 1.11 | 2.29 | 3.75 |
| absdiff.sum.perc.wh | -0.30 | -0.18 | 0.61 | 2.07 | 2.30 | 2.96 | 4.03 |
| edgecov.sum.dst | -0.04 | -0.03 | -0.01 | 0.00 | 0.01 | 0.58 | 1.72 |
| nodecov.sum.ln.pop.dens | 0.03 | 0.05 | 0.06 | 0.08 | 0.10 | 0.11 | 0.13 |
| nodefactor.sum.in.cc.TRUE | -1.91 | -1.38 | -0.85 | -0.32 | 0.27 | 0.87 | 1.47 |
| nodematch.sum.in.cc | -2.10 | -1.35 | -0.61 | -0.45 | -0.30 | 1.83 | 3.97 |
| nodematch.sum.int.poly | 0.17 | 1.17 | 1.98 | 2.60 | 9.99 | 26.41 | 51.88 |
| nonzero | -44.28 | -33.35 | -21.69 | -9.30 | -8.12 | -6.91 | -5.68 |
# this time the int polys are consistently positive (expected)
coef.scatter <-
tmp %>%
ggplot() +
geom_point(aes(x = fo,
y = estimate,
color = term),
alpha = .8, size = 3)
library(plotly)##
## Attaching package: 'plotly'
## The following object is masked from 'package:igraph':
##
## groups
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#' - the int.polygon tends to be ~positive~ in all models (as expected, but
#' unlike Tampa)
tmp %>%
filter(p.value > .05 |
is.na(p.value) )## # A tibble: 2 x 10
## region fo term estimate std.error statistic p.value aic cz pop
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <int>
## 1 21501-… fo3_m… nonz… -44.3 NA NA NA -42574. 21501 3.19e6
## 2 21501-… fo5_c… node… 0.166 0.0992 1.67 0.0950 -40332. 21501 3.19e6
## # A tibble: 5 x 2
## fo aic
## <chr> <dbl>
## 1 fo3_multiple.cc -42574.
## 2 fo5_ccs.pop.dens.decay -40332.
## 3 fo4_ccs.pop.dens -27133.
## 4 fo1_base -6886.
## 5 fo2_city.centers 911033.
# Memphis -----------------------------------------------------------------
tmp <- mss %>%
filter(cz == "05202")
# all 5 model variations available
tmp %>% count(fo)## # A tibble: 5 x 2
## fo n
## <chr> <int>
## 1 fo1_base 6
## 2 fo2_city.centers 8
## 3 fo3_multiple.cc 7
## 4 fo4_ccs.pop.dens 9
## 5 fo5_ccs.pop.dens.decay 9
# get visuals
gdir <-
"/scratch/gpfs/km31/ergms/cz-graphs/"
gh <- paste0(gdir, "cz-", tmp$region[1], ".rds")
gh <- readRDS(gh)
plc.map <- divseg::Wrapper_make.graph.map(gh,
cz = tmp$cz[1],
trim2plc = T)## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
## Checking if spatial network structure is valid...
## Spatial network structure is valid
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
# Maps for Memphis seem to show cordoning-off effect of interstates. The
# relative wastelands of travel in the north & south areas bounded by the
# peripheral interstate seem dramatic.
tmp %>%
filter(p.value > .05 |
is.na(p.value) )## # A tibble: 0 x 10
## # … with 10 variables: region <chr>, fo <chr>, term <chr>, estimate <dbl>,
## # std.error <dbl>, statistic <dbl>, p.value <dbl>, aic <dbl>, cz <chr>,
## # pop <int>
## # A tibble: 5 x 2
## fo aic
## <chr> <dbl>
## 1 fo4_ccs.pop.dens -4027.
## 2 fo3_multiple.cc 8312.
## 3 fo5_ccs.pop.dens.decay 44622.
## 4 fo2_city.centers 49633.
## 5 fo1_base 1258659.