Statistical_Network_Final

Statistical Network Final Project

First we call the required libraries:

library(readxl)
library(igraph)

## 
## Attaching package: 'igraph'

## The following objects are masked from 'package:stats':
## 
##     decompose, spectrum

## The following object is masked from 'package:base':
## 
##     union

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:igraph':
## 
##     as_data_frame, groups, union

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ergm)

## Loading required package: network

## 
## 'network' 1.18.2 (2023-12-04), part of the Statnet Project
## * 'news(package="network")' for changes since last version
## * 'citation("network")' for citation information
## * 'https://statnet.org' for help, support, and other information

## 
## Attaching package: 'network'

## The following objects are masked from 'package:igraph':
## 
##     %c%, %s%, add.edges, add.vertices, delete.edges, delete.vertices,
##     get.edge.attribute, get.edges, get.vertex.attribute, is.bipartite,
##     is.directed, list.edge.attributes, list.vertex.attributes,
##     set.edge.attribute, set.vertex.attribute

## 
## 'ergm' 4.6.0 (2023-12-17), part of the Statnet Project
## * 'news(package="ergm")' for changes since last version
## * 'citation("ergm")' for citation information
## * 'https://statnet.org' for help, support, and other information

## 'ergm' 4 is a major update that introduces some backwards-incompatible
## changes. Please type 'news(package="ergm")' for a list of major
## changes.

library(network)
library(intergraph)
library(ggplot2)

library(broom)

Read Data

The data describes commuter mobility between regions in Brazil. Each observation represents a commuting flow from one municipality or region (the source) to another (the target). The variable Commuters captures the size of this flow, indicating how many people travel between the two regions.

In addition to the number of commuters, the dataset includes socioeconomic characteristics for both the origin and destination. These are recorded as paired variables, such as GDP source and GDP target, which describe the economic conditions of each region involved in the flow.Other variables such as sanitation, water access, waste collection, income, unemployment, population, and geographic location are also available for each region.

Below you can see the columns we have in the dataset:

data <- read_excel("~/Statistical_Network_Project/data.xlsx")

colnames(data)

##  [1] "source"                                
##  [2] "target"                                
##  [3] "Commuters"                             
##  [4] "GDP source"                            
##  [5] "GDP Target"                            
##  [6] "Regular waste collection source"       
##  [7] "Regular waste collection target"       
##  [8] "Irregular waste collection source"     
##  [9] "Irregular waste collection target"     
## [10] "Unsafe water source"                   
## [11] "Unsafe water target"                   
## [12] "inadequate sanitation source"          
## [13] "inadequate sanitation target"          
## [14] "Latitude source"                       
## [15] "Longitude source"                      
## [16] "Latitude target"                       
## [17] "Longitude target"                      
## [18] "Distance Haversine (km)"               
## [19] "Mean household income per capit source"
## [20] "Mean household income per capit target"
## [21] "income < 0.5 minimum wage source"      
## [22] "income < 0.5 minimum wage target"      
## [23] "Unemployment rate source"              
## [24] "Unemployment rate target"              
## [25] "Population source"                     
## [26] "Population target"

We first categorize some of the attributes:

To see the distribution of GDP in different regions, we plotted the histograms of GDP at the source and target nodes. The distributions are highly right-skewed, indicating that very small number of regions have very high GDP values, while the majority of regions have very low GDP:

par(mfrow = c(1, 2))  # 1 row, 2 columns

hist(data$`GDP Target`, main = "GDP Target")
hist(data$`GDP source`, main = "GDP Source")

Given the strong right-skewness observed in the GDP distributions, a logarithmic transformation was first applied to reduce the influence of extreme values. Specifically, GDP values from both source and target regions were combined and transformed using a base-10 logarithm.

To ensure consistency between origin and destination nodes, quantile breaks were computed on the pooled (combined) log-transformed GDP values. This guarantees that both GDP source and GDP target are categorized using the same thresholds, making them directly comparable within the network.

The log-transformed GDP values were then divided into four categories based on quartiles: Low, Lower-Medium, Upper-Medium, and High. This results in balanced groups while preserving the relative ranking of regions in terms of economic activity.

This preprocessing step,resulting into categorical variables improve model stability and interpretability.

all_gdp <- c(data$`GDP source`, data$`GDP Target`)

log_all_gdp <- log10(all_gdp)
breaks <- quantile(log_all_gdp, probs = seq(0, 1, 0.25), na.rm = TRUE)


data$GDP_source_cat <- cut(log10(data$`GDP source`), breaks = breaks, include.lowest = TRUE,
  labels = c("Low", "Lower-Medium", "Upper-Medium", "High"))

data$GDP_target_cat <- cut( log10(data$`GDP Target`), breaks = breaks, include.lowest = TRUE,
  labels = c("Low", "Lower-Medium", "Upper-Medium", "High"))

We apply the same pre-processing step for the population variable. The distribution of population across regions is highly right-skewed. To reduce this skewness, population values from both source and target regions were first combined and transformed using a base-10 logarithm.

Quantile breaks were then calculated and based on these breaks, population was divided into four categories: Small, Medium, Large, and Very Large.

This approach ensures balanced group sizes while maintaining comparability between origin and destination nodes.

all_pop <- c(data$`Population source`, data$`Population target`)

log_pop <- log10(all_pop)
pop_breaks <- quantile(log_pop, probs = seq(0, 1, 0.25), na.rm = TRUE)

data$pop_source_cat <- cut( log10(data$`Population source`), breaks = pop_breaks,
  labels = c("Small", "Medium", "Large", "Very Large"), include.lowest = TRUE)

data$pop_target_cat <- cut( log10(data$`Population target`), breaks = pop_breaks,
  labels = c("Small", "Medium", "Large", "Very Large"), include.lowest = TRUE)

Mean household income per capita was processed analogously to GDP and population:

all_income <- c( data$`Mean household income per capit source`, data$`Mean household income per capit target`)
log_income <- log10(all_income)

income_breaks <- quantile(log_income, probs = seq(0, 1, 0.25), na.rm = TRUE)

data$income_source_cat <- cut(
  log10(data$`Mean household income per capit source`), breaks = income_breaks,
  labels = c("Low", "Lower-Medium", "Upper-Medium", "High"), include.lowest = TRUE)

data$income_target_cat <- cut( log10(data$`Mean household income per capit target`),
  breaks = income_breaks, labels = c("Low", "Lower-Medium", "Upper-Medium", "High"), include.lowest = TRUE)

To reduce model instability and prevent computational issues in the ERGM, the continuous service-access variables (regular and irregular waste collection, unsafe water, and inadequate sanitation) were transformed into categorical variables. Since all variables were proportions bounded between 0 and 1, they were discretized using substantively meaningful thresholds rather than data-driven quantiles. Specifically, values were grouped into three categories—Low (0–0.33), Medium (0.33–0.66), and High (0.66–1)—to ensure interpretability and improve model convergence. This categorization reduces the number of unique values, stabilizes estimation, and facilitates the analysis of homophily and mixing patterns in the commuter network using ERGM.

cuts <- c(0, 0.25, 0.5, 0.75, 2)
labels <- c("Low", "Moderate", "High", "Very High")

# Regular waste
data$regw_source_cat <- cut(data$`Regular waste collection source`, breaks = cuts, include.lowest = TRUE, labels = labels)

data$regw_target_cat <- cut(data$`Regular waste collection target`, breaks = cuts, include.lowest = TRUE, labels = labels)

# Irregular waste
data$irregw_source_cat <- cut(data$`Irregular waste collection source`, breaks = cuts, include.lowest = TRUE, labels = labels)

data$irregw_target_cat <- cut(data$`Irregular waste collection target`, breaks = cuts, include.lowest = TRUE, labels = labels)

# Unsafe water
data$uw_source_cat <- cut(data$`Unsafe water source`, breaks = cuts, include.lowest = TRUE, labels = labels)

data$uw_target_cat <- cut(data$`Unsafe water target`, breaks = cuts, include.lowest = TRUE, labels = labels)

# Sanitation
data$san_source_cat <- cut(data$`inadequate sanitation source`, breaks = cuts, include.lowest = TRUE, labels = labels)

data$san_target_cat <- cut(data$`inadequate sanitation target`, breaks = cuts, include.lowest = TRUE, labels = labels)

After preprocessing these node attributes, the commuter network was constructed as a directed weighted graph. First, the edge list was created by selecting the variables source, target, and number of commuters (as the edge weight).

Next, node-level attributes were assembled separately for each region. For each node, the dataset includes categorized GDP, categorized population, categorized household income per capita, and several infrastructure-related indicators, including regular waste collection, irregular waste collection, unsafe water, and inadequate sanitation.

Finally, the network was created where each node represents a region and each directed edge represents a commuter flow from a source region to a target region, and each edge’s weight corresponds to the total number of commuters traveling along that connection.

The resulting object, network_All, is therefore a directed and weighted commuter network that can be used for descriptive network analysis and as the basis for subsequent ERGM modeling.

edges <- data %>%
  select(source, target, Commuters) %>%
  group_by(source, target) %>%
  summarise(weight = sum(Commuters, na.rm = TRUE), .groups = "drop") %>%
  mutate( source = as.character(source), target = as.character(target))


nodes_source <- data %>%
  select(node = source,
    GDP_cat = GDP_source_cat,
    regular_waste = `regw_source_cat`,
    irregular_waste = `irregw_source_cat`,
    unsafe_water = `uw_source_cat`,
    inadequate_sanitation = `san_source_cat`,
    population = `pop_source_cat`,
    household_income = `income_source_cat`
    ) %>% mutate(node = as.character(node)) %>% distinct()



nodes_target <- data %>%
  select(node = target,
    GDP_cat = GDP_target_cat,
    regular_waste = `regw_target_cat`,
    irregular_waste = `irregw_target_cat`,
    unsafe_water = `uw_target_cat`,
    inadequate_sanitation = `san_target_cat`,
    population = `pop_target_cat`,
    household_income = `income_target_cat`
    ) %>% mutate(node = as.character(node)) %>% distinct()


nodes <- bind_rows(nodes_source, nodes_target) %>%
  group_by(node) %>%
  summarise(GDP_cat = first(na.omit(GDP_cat)),
    regular_waste = first(na.omit(regular_waste)),
    irregular_waste = first(na.omit(irregular_waste)),
    unsafe_water = first(na.omit(unsafe_water)),
    inadequate_sanitation = first(na.omit(inadequate_sanitation)),
    population = first(na.omit(population)),
    household_income = first(na.omit(household_income)),
    .groups = "drop")


network_All <- graph_from_data_frame( d = edges, vertices = nodes,directed = TRUE)


E(network_All)$weight <- edges$weight

Basic structural properties of the resulting network were then examined. We confirm that the network is directed, meaning that commuter flows have a defined origin and destination. We make sure that all relevant variables—such as socioeconomic categories and edge weights—were successfully transferred during the conversion.

This table summarizes node-level attributes in the commuter network, where each node (region) is characterized by population size and categorized socioeconomic and service-access indicators that will be used in the ERGM analysis:

net <- intergraph::asNetwork(network_All)
net

##  Network attributes:
##   vertices = 5551 
##   directed = TRUE 
##   hyper = FALSE 
##   loops = FALSE 
##   multiple = FALSE 
##   bipartite = FALSE 
##   total edges= 55244 
##     missing edges= 0 
##     non-missing edges= 55244 
## 
##  Vertex attribute names: 
##     GDP_cat household_income inadequate_sanitation irregular_waste population regular_waste unsafe_water vertex.names 
## 
##  Edge attribute names not shown

#is.directed(net)
#list.vertex.attributes(net)
#network.size(net)
#network.edgecount(net)

head(data.frame(
  node = net %v% "vertex.names",
  population = net %v% "population",
  GDP_cat = net %v% "GDP_cat",
  household_income = net %v% "household_income",
  regular_waste = net %v% "regular_waste",
  irregular_waste = net %v% "irregular_waste",
  unsafe_water = net %v% "unsafe_water",
  inadequate_sanitation = net %v% "inadequate_sanitation"))

##      node population      GDP_cat household_income regular_waste
## 1 1100015      Large Upper-Medium     Lower-Medium          High
## 2 1100023      Large Upper-Medium     Upper-Medium     Very High
## 3 1100031      Small          Low     Lower-Medium      Moderate
## 4 1100049      Large Upper-Medium     Upper-Medium     Very High
## 5 1100056     Medium Lower-Medium     Lower-Medium     Very High
## 6 1100064     Medium Lower-Medium     Lower-Medium          High
##   irregular_waste unsafe_water inadequate_sanitation
## 1        Moderate    Very High             Very High
## 2             Low         High             Very High
## 3            High    Very High             Very High
## 4             Low     Moderate              Moderate
## 5             Low         High             Very High
## 6        Moderate     Moderate             Very High

We start with the baseline ERGM, which includes only the edges term. This term captures the overall tendency for edges (i.e., commuting connections) to exist in the network, without considering any node attributes or structural effects.

The estimated coefficient for edges is -6.32, which is statistically significant (p < 0.001). This large negative value indicates that the probability of a commuting link between any two regions is very low, meaning the network is extremely sparse.

In ERGM terms, the edges coefficient can be interpreted as the log-odds of a tie existing between two regions. This implies that, on average, the probability of a commuting connection between any randomly selected pair of regions is less than 0.2%, confirming that only a very small fraction of all possible connections are present.

m0 <- ergm(net ~ edges)

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Evaluating log-likelihood at the estimate.

summary(m0)

## Call:
## ergm(formula = net ~ edges)
## 
## Maximum Likelihood Results:
## 
##        Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges -6.321977   0.004258      0   -1485   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   809090  on 30808049  degrees of freedom
##  
## AIC: 809092  BIC: 809107  (Smaller is better. MC Std. Err. = 0)

In the next step, we include other variable, one step at a time:

GDP

# This asks whether nodes in some GDP categories are generally more likely to be involved in ties.
m1 <- ergm(net ~ edges + nodefactor("GDP_cat"))

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Evaluating log-likelihood at the estimate.

summary(m1)

## Call:
## ergm(formula = net ~ edges + nodefactor("GDP_cat"))
## 
## Maximum Likelihood Results:
## 
##                                  Estimate Std. Error MCMC % z value Pr(>|z|)
## edges                           -3.285975   0.011404      0  -288.1   <1e-04
## nodefactor.GDP_cat.Low          -2.118783   0.008580      0  -246.9   <1e-04
## nodefactor.GDP_cat.Lower-Medium -1.735069   0.008578      0  -202.3   <1e-04
## nodefactor.GDP_cat.Upper-Medium -1.184662   0.008571      0  -138.2   <1e-04
##                                    
## edges                           ***
## nodefactor.GDP_cat.Low          ***
## nodefactor.GDP_cat.Lower-Medium ***
## nodefactor.GDP_cat.Upper-Medium ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   752359  on 30808046  degrees of freedom
##  
## AIC: 752367  BIC: 752428  (Smaller is better. MC Std. Err. = 0)

# This asks whether ties are more likely between nodes with the same GDP category.
m2 <- ergm(net ~ edges + nodematch("GDP_cat"))

## Starting maximum pseudolikelihood estimation (MPLE):
## Obtaining the responsible dyads.
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Evaluating log-likelihood at the estimate.

summary(m2)

## Call:
## ergm(formula = net ~ edges + nodematch("GDP_cat"))
## 
## Maximum Likelihood Results:
## 
##                    Estimate Std. Error MCMC %  z value Pr(>|z|)    
## edges             -6.266108   0.005091      0 -1230.72   <1e-04 ***
## nodematch.GDP_cat -0.175026   0.009288      0   -18.84   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   808727  on 30808048  degrees of freedom
##  
## AIC: 808731  BIC: 808761  (Smaller is better. MC Std. Err. = 0)

m3 <- ergm(net ~ edges + nodefactor("GDP_cat") + nodematch("GDP_cat"))

## Starting maximum pseudolikelihood estimation (MPLE):
## Obtaining the responsible dyads.
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Evaluating log-likelihood at the estimate.

summary(m3)

## Call:
## ergm(formula = net ~ edges + nodefactor("GDP_cat") + nodematch("GDP_cat"))
## 
## Maximum Likelihood Results:
## 
##                                  Estimate Std. Error MCMC % z value Pr(>|z|)
## edges                           -3.353318   0.011421      0  -293.6   <1e-04
## nodefactor.GDP_cat.Low          -2.121583   0.008312      0  -255.2   <1e-04
## nodefactor.GDP_cat.Lower-Medium -1.737940   0.008311      0  -209.1   <1e-04
## nodefactor.GDP_cat.Upper-Medium -1.187682   0.008306      0  -143.0   <1e-04
## nodematch.GDP_cat                0.263028   0.009327      0    28.2   <1e-04
##                                    
## edges                           ***
## nodefactor.GDP_cat.Low          ***
## nodefactor.GDP_cat.Lower-Medium ***
## nodefactor.GDP_cat.Upper-Medium ***
## nodematch.GDP_cat               ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   751592  on 30808045  degrees of freedom
##  
## AIC: 751602  BIC: 751679  (Smaller is better. MC Std. Err. = 0)

Model M1 shows that GDP is strongly related to tie formation. All GDP categories (Low, Lower-Medium, Upper-Medium) have negative and significant coefficients relative to the reference group (High GDP). Regions with lower GDP are less likely to be involved in commuter flows:

Low GDP (-2.12): much less likely to form ties, Lower-Medium (-1.74): also less likely, Upper-Medium (-1.18): less likely, but closer to High GDP regions.

This pattern shows a clear gradient: the higher the GDP, the more connected a region is in the commuter network. High-GDP regions act as central hubs, while lower-GDP regions are more peripheral and participate in fewer commuting connections.

Model M2 tests whether regions with the same GDP category are more likely to be connected. The negative coefficient (-6.26) suggests no homophily, in fact, ties are slightly less likely between regions with similar GDP levels. and regions with different GDP categories are connecting together (this makes sense as high-low is more common than low-low)

Model M3 combines both effects. The results confirm that:

Lower-GDP regions remain less likely to form ties (as in M1), but now the nodematch term becomes positive and significant, indicating homophily: regions with similar GDP are more likely to be connected, once differences in overall activity by GDP level are accounted for (I dont know why this is the case, maybe we should only pick 1 model from these 3).

Overall, this suggests that high-GDP regions dominate connectivity, and that similar-GDP regions tend to connect, but this similarity effect is only visible after controlling for differences in node activity.

We summarize M1 in this plot:

df_model1 <- data.frame(
  term = c("Low", "Lower-Medium", "Upper-Medium"),
  estimate = c(-2.118783, -1.735069, -1.184662),
  se = c(0.008580, 0.008578, 0.008571)
)

# 95% confidence intervals
df_model1$lower <- df_model1$estimate - 1.96 * df_model1$se
df_model1$upper <- df_model1$estimate + 1.96 * df_model1$se

# p-values
df_model1$pval <- c(0.00001, 0.00001, 0.00001)  # replace with real p-values

# significance stars
df_model1$stars <- ifelse(df_model1$pval < 0.001, "***",
                          ifelse(df_model1$pval < 0.01, "**",
                                 ifelse(df_model1$pval < 0.05, "*", "")))

# labels
df_model1$label <- paste0(round(df_model1$estimate, 2), df_model1$stars)

ggplot(df_model1, aes(x = term, y = estimate)) +
  geom_point(size = 3) +
  geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.2) +
  geom_text(aes(label = label), vjust = -1.2, size = 4) +
  geom_hline(yintercept = 0, linetype = "dashed") +
  theme_minimal() +
  labs(
    x = "",
    y = "Log-odds (vs. High GDP reference)",
    title = "GDP Category Effects (ERGM)",
    caption = "Reference category: High GDP | *** p < 0.001, ** p < 0.01, * p < 0.05"
  )

AIC(m1, m2, m3)

##    df      AIC
## m1  4 752367.1
## m2  2 808730.7
## m3  5 751602.3

BIC(m1, m2, m3)

##    df      BIC
## m1  4 752428.1
## m2  2 808761.2
## m3  5 751678.5

Based on AIC, Model 3 (m3) is clearly the best model to report.

It has the lowest AIC (751,602.3) and lowest BIC (751,678.5) It improves substantially over both m1 and m2 The improvement over m1 shows that adding homophily (nodematch) provides additional explanatory power beyond node-level differences The large gap with m2 shows that node-level effects (nodefactor) are essential

Regular and Irregular waste

Model 4 examines the role of waste collection in shaping commuter ties. Municipalities with very high levels of service are significantly more likely to both send (Estimate = 0.27, p < 0.001) and, especially, receive commuters (Estimate = 0.83, p < 0.001), indicating their central role in the network. In contrast, municipalities with low (Estimate = -0.30, p < 0.001) and moderate (Estimate = -0.18, p < 0.001) service levels are significantly less likely to attract incoming commuters. For outgoing ties, moderate levels also show a small negative effect (Estimate = -0.04, p < 0.01), while low levels are only marginally significant (Estimate = -0.06, p ≈ 0.07). Additionally, the positive and significant homophily effect (Estimate = 0.70, p < 0.001) suggests that municipalities tend to form commuting ties with others that have similar levels of waste collection. Overall, the findings indicate that both service inequality and similarity jointly shape the structure of commuter networks.

Model 5 examines the role of irregular waste collection in shaping commuter ties. The results indicate that commuter flows are strongly associated with levels of irregular waste collection, but in a pattern that contrasts with regular service provision. Municipalities with low levels of irregular waste collection are significantly more likely to both send (Estimate = 0.31, p < 0.001) and, especially, receive commuters (Estimate = 1.02, p < 0.001), suggesting that these areas are highly central in the network. Municipalities with moderate levels also show positive effects for both sending (Estimate = 0.04, p < 0.01) and receiving (Estimate = 0.18, p < 0.001), although these effects are smaller. In contrast, municipalities with very high levels of irregular waste collection do not significantly differ in their likelihood of sending commuters (Estimate = -0.01, p = 0.69) and are slightly less likely to attract incoming commuters (Estimate = -0.11, p < 0.01). Additionally, the positive and highly significant homophily effect (Estimate = 0.70, p < 0.001) indicates that municipalities are more likely to form commuter ties with others that share similar levels of irregular waste collection. Overall, the findings suggest that municipalities with lower levels of irregular waste collection are more central in the commuter network, while similarity in service conditions continues to structure commuting patterns.

m4 <- ergm(net ~ edges +
       nodeofactor("regular_waste") +   # sender effect (source)
       nodeifactor("regular_waste") +   # receiver effect (target)
       nodematch("regular_waste"))

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Evaluating log-likelihood at the estimate.

summary(m4)

## Call:
## ergm(formula = net ~ edges + nodeofactor("regular_waste") + nodeifactor("regular_waste") + 
##     nodematch("regular_waste"))
## 
## Maximum Likelihood Results:
## 
##                                     Estimate Std. Error MCMC %  z value
## edges                               -7.28538    0.01239      0 -587.968
## nodeofactor.regular_waste.Low       -0.05661    0.03121      0   -1.814
## nodeofactor.regular_waste.Moderate  -0.04364    0.01569      0   -2.782
## nodeofactor.regular_waste.Very High  0.26691    0.01197      0   22.307
## nodeifactor.regular_waste.Low       -0.29939    0.04085      0   -7.328
## nodeifactor.regular_waste.Moderate  -0.18477    0.01885      0   -9.800
## nodeifactor.regular_waste.Very High  0.83489    0.01199      0   69.615
## nodematch.regular_waste              0.70053    0.01106      0   63.363
##                                     Pr(>|z|)    
## edges                                 <1e-04 ***
## nodeofactor.regular_waste.Low         0.0697 .  
## nodeofactor.regular_waste.Moderate    0.0054 ** 
## nodeofactor.regular_waste.Very High   <1e-04 ***
## nodeifactor.regular_waste.Low         <1e-04 ***
## nodeifactor.regular_waste.Moderate    <1e-04 ***
## nodeifactor.regular_waste.Very High   <1e-04 ***
## nodematch.regular_waste               <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   782412  on 30808042  degrees of freedom
##  
## AIC: 782428  BIC: 782550  (Smaller is better. MC Std. Err. = 0)

m5 <- ergm(net ~ edges +
             nodeofactor("irregular_waste") +   # sender effect (source)
             nodeifactor("irregular_waste") +   # receiver effect (target)
             nodematch("irregular_waste"))

## Starting maximum pseudolikelihood estimation (MPLE):
## Obtaining the responsible dyads.
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Evaluating log-likelihood at the estimate.

summary(m5)

## Call:
## ergm(formula = net ~ edges + nodeofactor("irregular_waste") + 
##     nodeifactor("irregular_waste") + nodematch("irregular_waste"))
## 
## Maximum Likelihood Results:
## 
##                                       Estimate Std. Error MCMC %  z value
## edges                                 -7.51347    0.01830      0 -410.597
## nodeofactor.irregular_waste.Low        0.31009    0.01525      0   20.331
## nodeofactor.irregular_waste.Moderate   0.04328    0.01569      0    2.759
## nodeofactor.irregular_waste.Very High -0.01301    0.03224      0   -0.403
## nodeifactor.irregular_waste.Low        1.01933    0.01675      0   60.869
## nodeifactor.irregular_waste.Moderate   0.18427    0.01886      0    9.771
## nodeifactor.irregular_waste.Very High -0.11463    0.04226      0   -2.713
## nodematch.irregular_waste              0.70108    0.01106      0   63.415
##                                       Pr(>|z|)    
## edges                                  < 1e-04 ***
## nodeofactor.irregular_waste.Low        < 1e-04 ***
## nodeofactor.irregular_waste.Moderate   0.00580 ** 
## nodeofactor.irregular_waste.Very High  0.68663    
## nodeifactor.irregular_waste.Low        < 1e-04 ***
## nodeifactor.irregular_waste.Moderate   < 1e-04 ***
## nodeifactor.irregular_waste.Very High  0.00667 ** 
## nodematch.irregular_waste              < 1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   782406  on 30808042  degrees of freedom
##  
## AIC: 782422  BIC: 782544  (Smaller is better. MC Std. Err. = 0)

Unsafe water and Inadequate sanitation

Model 6 examines the role of unsafe water access in shaping commuter ties. The results indicate that commuter flows are strongly associated with levels of unsafe water exposure. Municipalities with low levels of unsafe water are significantly more likely to both send (Estimate = 0.46, p < 0.001) and, especially, receive commuters (Estimate = 0.91, p < 0.001), suggesting that these areas are highly central in the network. Municipalities with moderate levels also show positive effects for both sending (Estimate = 0.10, p < 0.001) and receiving (Estimate = 0.13, p < 0.001), although these effects are smaller in magnitude. In contrast, municipalities with very high levels of unsafe water are less likely to send commuters (Estimate = -0.17, p < 0.001) and show no statistically significant difference in their likelihood of receiving commuters (Estimate = -0.06, p = 0.10). Additionally, the positive and highly significant homophily effect (Estimate = 0.46, p < 0.001) indicates that municipalities are more likely to form commuter ties with others that have similar levels of unsafe water exposure. Overall, the findings suggest that municipalities with better water conditions (lower unsafe water levels) are more central in the commuter network, while similarity in water access conditions continues to play an important role in structuring commuting patterns.

Model 7 examines the role of inadequate sanitation in shaping commuter ties, distinguishing between sender (origin), receiver (destination), and homophily effects. The results show that commuter flows are strongly structured by sanitation conditions. Municipalities with low levels of inadequate sanitation (i.e., better sanitation) are significantly more likely to both send (Estimate = 0.44, p < 0.001) and, especially, receive commuters (Estimate = 1.00, p < 0.001), indicating a central position in the network. Municipalities with moderate levels also show positive effects for sending (Estimate = 0.22, p < 0.001) and receiving (Estimate = 0.37, p < 0.001), though these are smaller in magnitude. In contrast, municipalities with very high levels of inadequate sanitation are significantly less likely to both send (Estimate = -0.19, p < 0.001) and receive commuters (Estimate = -0.44, p < 0.001), suggesting a more peripheral role.

Additionally, the positive and highly significant homophily effect (Estimate = 0.84, p < 0.001) indicates that municipalities are substantially more likely to form commuter ties with others that have similar levels of sanitation. This is the strongest homophily effect observed among the service variables, suggesting that sanitation similarity is a particularly important factor structuring the network.

Overall, the findings indicate that municipalities with better sanitation conditions are more central in the commuter network, while those with poor sanitation are more isolated, and that similarity in sanitation levels strongly reinforces commuting connections.

m6 <- ergm(net ~ edges +
                   nodeofactor("unsafe_water") +   # sender effect (source)
                   nodeifactor("unsafe_water") +   # receiver effect (target)
                   nodematch("unsafe_water"))

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Evaluating log-likelihood at the estimate.

summary(m6)

## Call:
## ergm(formula = net ~ edges + nodeofactor("unsafe_water") + nodeifactor("unsafe_water") + 
##     nodematch("unsafe_water"))
## 
## Maximum Likelihood Results:
## 
##                                    Estimate Std. Error MCMC %  z value Pr(>|z|)
## edges                              -7.34993    0.02012      0 -365.235   <1e-04
## nodeofactor.unsafe_water.Low        0.46086    0.01614      0   28.548   <1e-04
## nodeofactor.unsafe_water.Moderate   0.09723    0.01620      0    6.002   <1e-04
## nodeofactor.unsafe_water.Very High -0.17056    0.03472      0   -4.912   <1e-04
## nodeifactor.unsafe_water.Low        0.91276    0.01730      0   52.747   <1e-04
## nodeifactor.unsafe_water.Moderate   0.12743    0.01842      0    6.918   <1e-04
## nodeifactor.unsafe_water.Very High -0.06273    0.03814      0   -1.645      0.1
## nodematch.unsafe_water              0.46019    0.01025      0   44.887   <1e-04
##                                       
## edges                              ***
## nodeofactor.unsafe_water.Low       ***
## nodeofactor.unsafe_water.Moderate  ***
## nodeofactor.unsafe_water.Very High ***
## nodeifactor.unsafe_water.Low       ***
## nodeifactor.unsafe_water.Moderate  ***
## nodeifactor.unsafe_water.Very High .  
## nodematch.unsafe_water             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   789261  on 30808042  degrees of freedom
##  
## AIC: 789277  BIC: 789399  (Smaller is better. MC Std. Err. = 0)

m7 <- ergm(net ~ edges +
             nodeofactor("inadequate_sanitation") +   # sender effect (source)
             nodeifactor("inadequate_sanitation") +   # receiver effect (target)
             nodematch("inadequate_sanitation"))

## Starting maximum pseudolikelihood estimation (MPLE):
## Obtaining the responsible dyads.
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Evaluating log-likelihood at the estimate.

summary(m7)

## Call:
## ergm(formula = net ~ edges + nodeofactor("inadequate_sanitation") + 
##     nodeifactor("inadequate_sanitation") + nodematch("inadequate_sanitation"))
## 
## Maximum Likelihood Results:
## 
##                                              Estimate Std. Error MCMC % z value
## edges                                       -6.976177   0.013818      0 -504.87
## nodeofactor.inadequate_sanitation.Low        0.442399   0.013418      0   32.97
## nodeofactor.inadequate_sanitation.Moderate   0.216558   0.014363      0   15.08
## nodeofactor.inadequate_sanitation.Very High -0.190392   0.013522      0  -14.08
## nodeifactor.inadequate_sanitation.Low        0.995318   0.013292      0   74.88
## nodeifactor.inadequate_sanitation.Moderate   0.372303   0.014988      0   24.84
## nodeifactor.inadequate_sanitation.Very High -0.438764   0.015098      0  -29.06
## nodematch.inadequate_sanitation              0.842814   0.008962      0   94.04
##                                             Pr(>|z|)    
## edges                                         <1e-04 ***
## nodeofactor.inadequate_sanitation.Low         <1e-04 ***
## nodeofactor.inadequate_sanitation.Moderate    <1e-04 ***
## nodeofactor.inadequate_sanitation.Very High   <1e-04 ***
## nodeifactor.inadequate_sanitation.Low         <1e-04 ***
## nodeifactor.inadequate_sanitation.Moderate    <1e-04 ***
## nodeifactor.inadequate_sanitation.Very High   <1e-04 ***
## nodematch.inadequate_sanitation               <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   775103  on 30808042  degrees of freedom
##  
## AIC: 775119  BIC: 775241  (Smaller is better. MC Std. Err. = 0)

Population and Hosehold income

Model m8 examines how population size affects tie formation and whether regions with similar population sizes are more likely to connect.

The nodefactor results show that, compared to the reference group (Large population), all other categories have negative and significant coefficients:

Small (-5.27) and Medium (-4.89) regions are much less likely to form ties, Very Large (-3.16) regions are also less likely, but still more connected than smaller regions.

This indicates that larger regions (especially “Large”) are the most central in the commuter network, while smaller regions are more peripheral.

The nodematch.population term is negative (-4.22), meaning that regions with the same population category are less likely to be connected. This suggests heterophily: commuter flows are more likely to occur between regions of different sizes (e.g., small → large).

Model m9 is for household income effects, A similar pattern appears for income. Compared to the reference group (High income), all other categories are less likely to form ties:

Low (-4.52) and Lower-Medium (-4.30) show the strongest negative effects, Upper-Medium (-3.88) is closer but still less connected.

This suggests that high-income regions are the most connected in the network.

The nodematch.household_income coefficient is also negative (-2.52), indicating that regions with similar income levels are less likely to be connected. Again, this points to heterophily, where commuting tends to occur between regions with different income levels.

m8 <- ergm(net ~ nodefactor("population") + nodematch("population"))

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Evaluating log-likelihood at the estimate.

summary(m8)

## Call:
## ergm(formula = net ~ nodefactor("population") + nodematch("population"))
## 
## Maximum Likelihood Results:
## 
##                                   Estimate Std. Error MCMC % z value Pr(>|z|)
## nodefactor.population.Medium     -4.890011   0.006040      0  -809.6   <1e-04
## nodefactor.population.Small      -5.273194   0.006031      0  -874.3   <1e-04
## nodefactor.population.Very Large -3.155108   0.006145      0  -513.5   <1e-04
## nodematch.population             -4.217652   0.007852      0  -537.2   <1e-04
##                                     
## nodefactor.population.Medium     ***
## nodefactor.population.Small      ***
## nodefactor.population.Very Large ***
## nodematch.population             ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:  1072739  on 30808046  degrees of freedom
##  
## AIC: 1072747  BIC: 1072808  (Smaller is better. MC Std. Err. = 0)

m9 <- ergm(net ~ nodefactor("household_income") + nodematch("household_income"))

## Starting maximum pseudolikelihood estimation (MPLE):
## Obtaining the responsible dyads.
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Evaluating log-likelihood at the estimate.

summary(m9)

## Call:
## ergm(formula = net ~ nodefactor("household_income") + nodematch("household_income"))
## 
## Maximum Likelihood Results:
## 
##                                           Estimate Std. Error MCMC % z value
## nodefactor.household_income.Low          -4.520312   0.006046      0  -747.6
## nodefactor.household_income.Lower-Medium -4.300754   0.006054      0  -710.4
## nodefactor.household_income.Upper-Medium -3.884398   0.006069      0  -640.0
## nodematch.household_income               -2.523247   0.006907      0  -365.3
##                                          Pr(>|z|)    
## nodefactor.household_income.Low            <1e-04 ***
## nodefactor.household_income.Lower-Medium   <1e-04 ***
## nodefactor.household_income.Upper-Medium   <1e-04 ***
## nodematch.household_income                 <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:  1023259  on 30808046  degrees of freedom
##  
## AIC: 1023267  BIC: 1023328  (Smaller is better. MC Std. Err. = 0)

GDP, Hosehold income and Population

This model examines how GDP, population, and household income jointly influence the likelihood of commuter ties. The negative and significant edges coefficient (-4.09) confirms that the network remains sparse, even after accounting for node characteristics.

Across all three variables, a consistent pattern emerges:

GDP: Compared to high-GDP regions (reference), all lower categories have negative coefficients, meaning they are less likely to be involved in ties. However, the effects are smaller than in earlier models, indicating that part of GDP’s effect overlaps with population and income. Population: Smaller regions (Small, Medium) are less likely to form ties, while Very Large regions have a positive coefficient (0.67), meaning they are more likely than large regions (reference) to be highly connected. This highlights the strong role of population size in driving connectivity. Household income: Lower-income regions are consistently less likely to form ties compared to high-income regions, again showing a gradient where more affluent regions are more connected.

Overall, the results show that more developed and larger regions (high GDP, high income, large population) are the most central in the commuter network. The reduced magnitude of the coefficients compared to simpler models suggests that these factors are correlated and jointly explain connectivity, rather than acting independently.

m10 <- ergm(net ~ edges + nodefactor("GDP_cat") +  nodefactor("population") + nodefactor("household_income") )

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Evaluating log-likelihood at the estimate.

summary(m10)

## Call:
## ergm(formula = net ~ edges + nodefactor("GDP_cat") + nodefactor("population") + 
##     nodefactor("household_income"))
## 
## Maximum Likelihood Results:
## 
##                                           Estimate Std. Error MCMC % z value
## edges                                    -4.089998   0.026450      0 -154.63
## nodefactor.GDP_cat.Low                   -0.752133   0.020698      0  -36.34
## nodefactor.GDP_cat.Lower-Medium          -0.630834   0.017406      0  -36.24
## nodefactor.GDP_cat.Upper-Medium          -0.454104   0.013713      0  -33.12
## nodefactor.population.Medium             -0.235760   0.010320      0  -22.84
## nodefactor.population.Small              -0.552149   0.013820      0  -39.95
## nodefactor.population.Very Large          0.665007   0.013221      0   50.30
## nodefactor.household_income.Low          -0.584784   0.011568      0  -50.55
## nodefactor.household_income.Lower-Medium -0.482806   0.009744      0  -49.55
## nodefactor.household_income.Upper-Medium -0.282481   0.009153      0  -30.86
##                                          Pr(>|z|)    
## edges                                      <1e-04 ***
## nodefactor.GDP_cat.Low                     <1e-04 ***
## nodefactor.GDP_cat.Lower-Medium            <1e-04 ***
## nodefactor.GDP_cat.Upper-Medium            <1e-04 ***
## nodefactor.population.Medium               <1e-04 ***
## nodefactor.population.Small                <1e-04 ***
## nodefactor.population.Very Large           <1e-04 ***
## nodefactor.household_income.Low            <1e-04 ***
## nodefactor.household_income.Lower-Medium   <1e-04 ***
## nodefactor.household_income.Upper-Medium   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   745749  on 30808040  degrees of freedom
##  
## AIC: 745769  BIC: 745921  (Smaller is better. MC Std. Err. = 0)

nodecov.irregular_waste = NA meaning that there is perfect (or near-perfect) collinearity.

irregular_waste can be explained almost exactly by the other variables (especially regular_waste, and likely also unsafe_water / sanitation).

Since variables are all infrastructure indicators, they are likely strongly related:

High regular waste collection → low irregular waste High irregular waste → poor sanitation / unsafe water

This creates perfect linear dependence, which breaks estimation.

m11 <- ergm(net ~ edges + nodefactor("regular_waste") + nodefactor("irregular_waste") +
              nodefactor("unsafe_water") +  nodefactor("inadequate_sanitation"))

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Warning: Model statistics 'nodefactor.irregular_waste.Low' and
## 'nodefactor.irregular_waste.Very High' are linear combinations of some set of
## preceding statistics at the current stage of the estimation. This may indicate
## that the model is nonidentifiable.

## Evaluating log-likelihood at the estimate.

##

summary(m11)

## Call:
## ergm(formula = net ~ edges + nodefactor("regular_waste") + nodefactor("irregular_waste") + 
##     nodefactor("unsafe_water") + nodefactor("inadequate_sanitation"))
## 
## Maximum Likelihood Results:
## 
##                                             Estimate Std. Error MCMC % z value
## edges                                      -7.477216   0.633459      0 -11.804
## nodefactor.regular_waste.Low               -0.111029   0.317616      0  -0.350
## nodefactor.regular_waste.Moderate          -0.024845   0.316553      0  -0.078
## nodefactor.regular_waste.Very High          0.387335   0.316762      0   1.223
## nodefactor.irregular_waste.Low                    NA   0.000000      0      NA
## nodefactor.irregular_waste.Moderate         0.030654   0.316649      0   0.097
## nodefactor.irregular_waste.Very High              NA   0.000000      0      NA
## nodefactor.unsafe_water.Low                 0.366846   0.012928      0  28.377
## nodefactor.unsafe_water.Moderate            0.061673   0.012700      0   4.856
## nodefactor.unsafe_water.Very High          -0.071526   0.025929      0  -2.759
## nodefactor.inadequate_sanitation.Low        0.511921   0.010214      0  50.121
## nodefactor.inadequate_sanitation.Moderate   0.166681   0.010507      0  15.865
## nodefactor.inadequate_sanitation.Very High -0.179559   0.009984      0 -17.984
##                                            Pr(>|z|)    
## edges                                       < 1e-04 ***
## nodefactor.regular_waste.Low                0.72666    
## nodefactor.regular_waste.Moderate           0.93744    
## nodefactor.regular_waste.Very High          0.22141    
## nodefactor.irregular_waste.Low                   NA    
## nodefactor.irregular_waste.Moderate         0.92288    
## nodefactor.irregular_waste.Very High             NA    
## nodefactor.unsafe_water.Low                 < 1e-04 ***
## nodefactor.unsafe_water.Moderate            < 1e-04 ***
## nodefactor.unsafe_water.Very High           0.00581 ** 
## nodefactor.inadequate_sanitation.Low        < 1e-04 ***
## nodefactor.inadequate_sanitation.Moderate   < 1e-04 ***
## nodefactor.inadequate_sanitation.Very High  < 1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   778620  on 30808037  degrees of freedom
##  
## AIC: 778646  BIC: 778844  (Smaller is better. MC Std. Err. = 0)

Model 11 examines the joint effect of multiple service-related factors—regular waste collection, irregular waste collection, unsafe water, and inadequate sanitation—on the formation of commuter ties, focusing on node-level differences in tie propensity.

The results show that regular waste collection and irregular waste collection do not have statistically significant effects once other variables are included. All coefficients for regular waste are insignificant (p > 0.2), suggesting that its effect observed in earlier models is absorbed by other correlated service indicators. For irregular waste, some categories are not estimated (NA), indicating issues such as redundancy or lack of variation, and the remaining category is also not significant (Estimate = 0.03, p = 0.92). This suggests that irregular waste does not independently explain commuter tie formation in the presence of other variables.

In contrast, unsafe water and inadequate sanitation remain strong and significant predictors. Municipalities with low levels of unsafe water are more likely to form ties (Estimate = 0.37, p < 0.001), while those with very high levels are less likely (Estimate = -0.07, p < 0.01). Similarly, municipalities with low inadequate sanitation (i.e., better sanitation) show a strong positive effect (Estimate = 0.51, p < 0.001), and those with very high inadequate sanitation are less likely to form ties (Estimate = -0.18, p < 0.001). Moderate levels of both variables also have positive and significant effects, though smaller in magnitude.

Overall, the findings indicate that when multiple service conditions are considered simultaneously, water and sanitation variables dominate in explaining commuter connectivity, while waste collection variables lose their independent explanatory power. This suggests that broader infrastructure and living condition factors, particularly water and sanitation, play a more central role in structuring commuting networks.

Final Model

m12 <- ergm(net ~ edges + nodefactor("GDP_cat") +  nodefactor("population") + nodefactor("household_income") +
            nodefactor("regular_waste") + 
              nodefactor("unsafe_water") +  nodefactor("inadequate_sanitation"))

## Starting maximum pseudolikelihood estimation (MPLE):

## Obtaining the responsible dyads.

## Evaluating the predictor and response matrix.

## Maximizing the pseudolikelihood.

## Finished MPLE.

## Evaluating log-likelihood at the estimate.

summary(m12)

## Call:
## ergm(formula = net ~ edges + nodefactor("GDP_cat") + nodefactor("population") + 
##     nodefactor("household_income") + nodefactor("regular_waste") + 
##     nodefactor("unsafe_water") + nodefactor("inadequate_sanitation"))
## 
## Maximum Likelihood Results:
## 
##                                             Estimate Std. Error MCMC %  z value
## edges                                      -5.217927   0.041481      0 -125.791
## nodefactor.GDP_cat.Low                     -0.567346   0.020947      0  -27.084
## nodefactor.GDP_cat.Lower-Medium            -0.477023   0.017533      0  -27.207
## nodefactor.GDP_cat.Upper-Medium            -0.397710   0.013726      0  -28.976
## nodefactor.population.Medium               -0.240646   0.010391      0  -23.159
## nodefactor.population.Small                -0.546197   0.013921      0  -39.235
## nodefactor.population.Very Large            0.648615   0.013178      0   49.219
## nodefactor.household_income.Low            -0.231155   0.013240      0  -17.458
## nodefactor.household_income.Lower-Medium   -0.350725   0.010112      0  -34.684
## nodefactor.household_income.Upper-Medium   -0.271974   0.009194      0  -29.582
## nodefactor.regular_waste.Low               -0.165211   0.025960      0   -6.364
## nodefactor.regular_waste.Moderate          -0.099971   0.012964      0   -7.712
## nodefactor.regular_waste.Very High          0.166190   0.010783      0   15.412
## nodefactor.unsafe_water.Low                 0.234061   0.013159      0   17.786
## nodefactor.unsafe_water.Moderate            0.102700   0.012724      0    8.071
## nodefactor.unsafe_water.Very High          -0.161593   0.025950      0   -6.227
## nodefactor.inadequate_sanitation.Low        0.213341   0.010852      0   19.658
## nodefactor.inadequate_sanitation.Moderate   0.096572   0.010625      0    9.089
## nodefactor.inadequate_sanitation.Very High -0.086048   0.010132      0   -8.493
##                                            Pr(>|z|)    
## edges                                        <1e-04 ***
## nodefactor.GDP_cat.Low                       <1e-04 ***
## nodefactor.GDP_cat.Lower-Medium              <1e-04 ***
## nodefactor.GDP_cat.Upper-Medium              <1e-04 ***
## nodefactor.population.Medium                 <1e-04 ***
## nodefactor.population.Small                  <1e-04 ***
## nodefactor.population.Very Large             <1e-04 ***
## nodefactor.household_income.Low              <1e-04 ***
## nodefactor.household_income.Lower-Medium     <1e-04 ***
## nodefactor.household_income.Upper-Medium     <1e-04 ***
## nodefactor.regular_waste.Low                 <1e-04 ***
## nodefactor.regular_waste.Moderate            <1e-04 ***
## nodefactor.regular_waste.Very High           <1e-04 ***
## nodefactor.unsafe_water.Low                  <1e-04 ***
## nodefactor.unsafe_water.Moderate             <1e-04 ***
## nodefactor.unsafe_water.Very High            <1e-04 ***
## nodefactor.inadequate_sanitation.Low         <1e-04 ***
## nodefactor.inadequate_sanitation.Moderate    <1e-04 ***
## nodefactor.inadequate_sanitation.Very High   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 42709026  on 30808050  degrees of freedom
##  Residual Deviance:   742138  on 30808031  degrees of freedom
##  
## AIC: 742176  BIC: 742465  (Smaller is better. MC Std. Err. = 0)

Model 12 examines the joint influence of economic, demographic, and service-related factors on the formation of commuter ties by including GDP, population size, household income, and infrastructure variables in a single specification. The negative and significant edges term (Estimate = -5.22, p < 0.001) indicates that ties remain relatively sparse, although less so than in previous models, reflecting improved explanatory power.

The results show strong and consistent effects of economic and demographic factors. Municipalities with lower GDP levels are significantly less likely to form ties (Low: -0.57; Lower-Medium: -0.48; Upper-Medium: -0.40; all p < 0.001), indicating that higher-GDP municipalities are more connected. Similarly, population size plays a key role: small (-0.55, p < 0.001) and medium (-0.24, p < 0.001) municipalities are less connected, while very large municipalities are substantially more likely to form ties (0.65, p < 0.001), highlighting their role as major hubs. A similar gradient is observed for household income, where lower-income categories are associated with fewer ties (all negative and significant), indicating that wealthier municipalities are more integrated into the commuter network.

Service-related variables remain important but with more moderate effects. Municipalities with very high regular waste collection are more likely to form ties (0.17, p < 0.001), while low (-0.17) and moderate (-0.10) levels reduce connectivity. For unsafe water, municipalities with better conditions (low: 0.23; moderate: 0.10; p < 0.001) are more connected, whereas very high unsafe water reduces tie formation (-0.16, p < 0.001). A similar pattern is observed for inadequate sanitation, where better sanitation (low: 0.21; moderate: 0.10; p < 0.001) increases connectivity, while very high inadequate sanitation decreases it (-0.09, p < 0.001).

Overall, the findings indicate that economic development and population size are the dominant drivers of commuter network structure, with larger, wealthier municipalities acting as central hubs. Infrastructure variables related to water and sanitation also play a significant but secondary role, reinforcing the importance of living conditions in shaping connectivity. The substantially improved model fit (lower AIC and residual deviance) suggests that combining economic, demographic, and service factors provides a more comprehensive explanation of commuting patterns.

df_12 <- tidy(m12)

df_12 <- df_12 %>%
  filter(term != "edges") %>%  # remove edges if you want
  mutate(term = gsub("nodefactor.", "", term))



ggplot(df_12, aes(x = estimate, y = reorder(term, estimate))) +
  geom_point() +
  geom_errorbarh(aes(xmin = estimate - 1.96*std.error,
                     xmax = estimate + 1.96*std.error),
                 height = 0.2) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  labs(
    title = "ERGM Coefficients (Model 12)",
    x = "Estimate (log-odds)",
    y = ""
  ) +
  theme_minimal()

Statistical_Network_Final_Project

2026-04-03