We’re back at it again with our district leader data from the Social Network Analysis and Education companion site! I was interested in seeing the relationship between some factors that we didn’t see in combination during our case study: Reciprocal ties between actors, i.e. confidential and mutual exchanges between individuals, whether or not an individual worked at the district or school level, and efficacy.
Controlling for edges, reciprocity, are leaders who are at the district level or have high efficacy scores more likely to either send or receive a confidential exchange within this social network?
library(tidyverse)
library(readxl)
library(igraph)
library(tidygraph)
library(ggraph)
library(skimr)
library(janitor)
library(statnet)
I began by importing the edge matrix from School Leaders Data Chapter 9_d, then converting it into a matrix object:
leader_matrix <- read_excel("data/School Leaders Data Chapter 9_d.xlsx",
col_names = FALSE) |>
clean_names()
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
leader_matrix <- leader_matrix |>
as.matrix()
No edges are complete without their nodes, and in this case the attributes that we’ll be drawing from for our modeling:
leader_nodes <- read_excel("data/School Leaders Data Chapter 9_e.xlsx",
col_types = c("text",
"numeric",
"numeric",
"numeric",
"numeric")) |>
clean_names()
I then took the values from the matrix and dichotomized them, “flattening” ties into binary based on their strength:
leader_matrix[leader_matrix <= 2] <- 0
leader_matrix[leader_matrix >= 3] <- 1
I then added row and column names to the matrix, drawing from the freshly-imported node attributes data:
rownames(leader_matrix) <- leader_nodes$id
colnames(leader_matrix) <- leader_nodes$id
The matrix needed to then be converted into a network object for use with igraph:
adjacency_matrix <- graph.adjacency(leader_matrix,
diag = FALSE)
I then went on to make my edge list form that network object:
leader_edges <- get.data.frame(adjacency_matrix) |>
mutate(from = as.character(from)) |>
mutate(to = as.character(to))
These two objects came together to make the final graph object ready for analysis:
leader_graph <- tbl_graph(edges = leader_edges,
nodes = leader_nodes,
node_key = "id",
directed = FALSE)
I used mutate to create new columns describing the in, out, and total exchanges between actors in this network:
leader_measures <- leader_graph |>
activate(nodes) |>
mutate(in_degree = centrality_degree(mode = "in")) |>
mutate(out_degree = centrality_degree(mode = "out")) |>
mutate(all_degree = centrality_degree(mode = "all"))
I then converted these measures into a tibble for easier statistical analysis…
node_measures <- leader_measures |>
activate(nodes) |>
as_tibble()
summary(node_measures)
## id efficacy trust district_site
## Length:43 Min. :4.610 Min. :3.630 Min. :0.0000
## Class :character 1st Qu.:5.670 1st Qu.:4.130 1st Qu.:0.0000
## Mode :character Median :6.780 Median :4.780 Median :0.0000
## Mean :6.649 Mean :4.783 Mean :0.4186
## 3rd Qu.:7.470 3rd Qu.:5.440 3rd Qu.:1.0000
## Max. :8.500 Max. :5.880 Max. :1.0000
## male in_degree out_degree all_degree
## Min. :0.0000 Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:0.0000 1st Qu.: 3.000 1st Qu.: 3.000 1st Qu.: 3.000
## Median :0.0000 Median : 4.000 Median : 4.000 Median : 4.000
## Mean :0.4419 Mean : 6.651 Mean : 6.651 Mean : 6.651
## 3rd Qu.:1.0000 3rd Qu.: 9.500 3rd Qu.: 9.500 3rd Qu.: 9.500
## Max. :1.0000 Max. :27.000 Max. :27.000 Max. :27.000
…which led to these descriptive stats on in, out, and total degree, respectively, as well as for the node attribute of efficacy since I planned to use it in my model:
node_measures |>
group_by(district_site) |>
summarise(n = n(),
mean = mean(in_degree),
sd = sd(in_degree))
## # A tibble: 2 × 4
## district_site n mean sd
## <dbl> <int> <dbl> <dbl>
## 1 0 25 4.8 4.34
## 2 1 18 9.22 6.67
node_measures |>
group_by(district_site) |>
summarise(n = n(),
mean = mean(out_degree),
sd = sd(out_degree))
## # A tibble: 2 × 4
## district_site n mean sd
## <dbl> <int> <dbl> <dbl>
## 1 0 25 4.8 4.34
## 2 1 18 9.22 6.67
node_measures |>
group_by(district_site) |>
summarise(n = n(),
mean = mean(all_degree),
sd = sd(all_degree))
## # A tibble: 2 × 4
## district_site n mean sd
## <dbl> <int> <dbl> <dbl>
## 1 0 25 4.8 4.34
## 2 1 18 9.22 6.67
node_measures |>
group_by(district_site) |>
summarise(n = n(),
mean = mean(efficacy),
sd = sd(efficacy))
## # A tibble: 2 × 4
## district_site n mean sd
## <dbl> <int> <dbl> <dbl>
## 1 0 25 7.02 0.953
## 2 1 18 6.13 1.12
leader_graph |>
ggraph(layout = "kk") +
geom_node_point(aes(colour = district_site, size = efficacy)) +
geom_edge_link(colour = "grey", width=0.2, arrow = arrow(type = "closed", length = unit(1, 'mm')))+
theme_void()
This visualization highlighted a few of the properties that I planned on looking at using my ERGM model, including reciprocity (shown by the arrows), district or school-level status, and efficacy.
I first loaded my network data so it could be used by statnet:
leader_network <- as.network(leader_edges,
vertices = leader_nodes)
I looked to see what a basic ERGM model controlling for reciprocal ties would look like (running a summary beforehand to keep with best practices):
summary(leader_network ~ edges + mutual)
## edges mutual
## 143 36
set.seed(589)
ergm_1 <-ergm(leader_network ~ edges + mutual)
## Starting maximum pseudolikelihood estimation (MPLE):
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Starting Monte Carlo maximum likelihood estimation (MCMLE):
## Iteration 1 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.2915.
## Estimating equations are not within tolerance region.
## Iteration 2 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0136.
## Convergence test p-value: 0.1862. Not converged with 99% confidence; increasing sample size.
## Iteration 3 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0049.
## Convergence test p-value: 0.0968. Not converged with 99% confidence; increasing sample size.
## Iteration 4 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0079.
## Convergence test p-value: < 0.0001. Converged with 99% confidence.
## Finished MCMLE.
## Evaluating log-likelihood at the estimate. Fitting the dyad-independent submodel...
## Bridging between the dyad-independent submodel and the full model...
## Setting up bridge sampling...
## Using 16 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .
## Bridging finished.
## This model was fit using MCMC. To examine model diagnostics and check
## for degeneracy, use the mcmc.diagnostics() function.
summary(ergm_1)
## Call:
## ergm(formula = leader_network ~ edges + mutual)
##
## Monte Carlo Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -3.1101 0.1283 0 -24.23 <1e-04 ***
## mutual 3.1246 0.3003 0 10.40 <1e-04 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 2503.6 on 1806 degrees of freedom
## Residual Deviance: 891.8 on 1804 degrees of freedom
##
## AIC: 895.8 BIC: 906.8 (Smaller is better. MC Std. Err. = 0.7447)
It looks like the structural features of this network have significant influence on the formation of the network itself, which makes sense. But we’re here too look at how this might play with district or school-level participation and efficacy!
District site & efficacy
## Call:
## ergm(formula = leader_network ~ edges + mutual + gwesp(0.25,
## fixed = T) + nodefactor("district_site") + nodecov("efficacy"))
##
## Monte Carlo Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -5.10654 0.51634 0 -9.890 < 1e-04 ***
## mutual 2.28904 0.31177 0 7.342 < 1e-04 ***
## gwesp.fixed.0.25 0.99580 0.13837 0 7.197 < 1e-04 ***
## nodefactor.district_site.1 0.26537 0.08488 0 3.126 0.00177 **
## nodecov.efficacy 0.07121 0.03513 0 2.027 0.04268 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 2503.6 on 1806 degrees of freedom
## Residual Deviance: 794.8 on 1801 degrees of freedom
##
## AIC: 804.8 BIC: 832.2 (Smaller is better. MC Std. Err. = 0.5324)
To see how this model fits, I used gof to plot how the simulations fit around the observed data. With some variations, it looks like the model fits pretty well around the sample data. A good sign!
I also wanted to check to see how “balanced” the Monte Carlo Markov Chain diagnostics looked, to make sure that the model wasn’t running away to produce variable or unlikely results. These turned out well also.
## Sample statistics summary:
##
## Iterations = 1697792:6926336
## Thinning interval = 2048
## Number of chains = 1
## Sample size per chain = 2554
##
## 1. Empirical mean and standard deviation for each variable,
## plus standard error of the mean:
##
## Mean SD Naive SE Time-series SE
## edges 4.701 38.77 0.7671 2.893
## mutual 1.457 13.15 0.2601 1.068
## gwesp.fixed.0.25 5.883 49.32 0.9759 3.700
## nodefactor.district_site.1 4.662 47.06 0.9312 3.714
## nodecov.efficacy 69.447 525.90 10.4062 39.272
##
## 2. Quantiles for each variable:
##
## 2.5% 25% 50% 75% 97.5%
## edges -74.00 -22.00 6.000 31.00 80.0
## mutual -25.00 -8.00 1.000 10.00 27.0
## gwesp.fixed.0.25 -90.85 -28.46 5.969 39.78 102.1
## nodefactor.district_site.1 -92.00 -27.00 7.000 37.00 93.0
## nodecov.efficacy -997.65 -297.88 80.605 427.94 1086.3
##
##
## Are sample statistics significantly different from observed?
## edges mutual gwesp.fixed.0.25 nodefactor.district_site.1
## diff. 4.7008614 1.4565388 5.8834055 4.6620987
## test stat. 1.6250332 1.3639116 1.5899004 1.2553877
## P-val. 0.1041555 0.1725954 0.1118573 0.2093381
## nodecov.efficacy Overall (Chi^2)
## diff. 69.44720830 NA
## test stat. 1.76834507 18.286183761
## P-val. 0.07700323 0.002968828
##
## Sample statistics cross-correlations:
## edges mutual gwesp.fixed.0.25
## edges 1.0000000 0.9640384 0.9859629
## mutual 0.9640384 1.0000000 0.9680490
## gwesp.fixed.0.25 0.9859629 0.9680490 1.0000000
## nodefactor.district_site.1 0.9532156 0.9269091 0.9511353
## nodecov.efficacy 0.9979521 0.9626429 0.9849214
## nodefactor.district_site.1 nodecov.efficacy
## edges 0.9532156 0.9979521
## mutual 0.9269091 0.9626429
## gwesp.fixed.0.25 0.9511353 0.9849214
## nodefactor.district_site.1 1.0000000 0.9417990
## nodecov.efficacy 0.9417990 1.0000000
##
## Sample statistics auto-correlation:
## Chain 1
## edges mutual gwesp.fixed.0.25 nodefactor.district_site.1
## Lag 0 1.0000000 1.0000000 1.0000000 1.0000000
## Lag 2048 0.8446890 0.8585567 0.8550006 0.8547406
## Lag 4096 0.7391046 0.7502635 0.7466117 0.7515170
## Lag 6144 0.6517800 0.6623309 0.6561519 0.6672657
## Lag 8192 0.5796237 0.5956158 0.5838357 0.5935643
## Lag 10240 0.5164856 0.5366813 0.5217593 0.5296499
## nodecov.efficacy
## Lag 0 1.0000000
## Lag 2048 0.8460447
## Lag 4096 0.7400769
## Lag 6144 0.6524216
## Lag 8192 0.5804269
## Lag 10240 0.5178714
##
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1
##
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5
##
## edges mutual
## 0.29159 0.11655
## gwesp.fixed.0.25 nodefactor.district_site.1
## 0.18536 -0.06909
## nodecov.efficacy
## 0.30234
##
## Individual P-values (lower = worse):
## edges mutual
## 0.7706030 0.9072177
## gwesp.fixed.0.25 nodefactor.district_site.1
## 0.8529432 0.9449172
## nodecov.efficacy
## 0.7623914
## Joint P-value (lower = worse): 0.3020204 .
##
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
To answer my research question, let’s take a look again at the ERGM model that accounted for district_level and efficacy:
summary(ergm_2)
## Call:
## ergm(formula = leader_network ~ edges + mutual + gwesp(0.25,
## fixed = T) + nodefactor("district_site") + nodecov("efficacy"))
##
## Monte Carlo Maximum Likelihood Results:
##
## Estimate Std. Error MCMC % z value Pr(>|z|)
## edges -5.10654 0.51634 0 -9.890 < 1e-04 ***
## mutual 2.28904 0.31177 0 7.342 < 1e-04 ***
## gwesp.fixed.0.25 0.99580 0.13837 0 7.197 < 1e-04 ***
## nodefactor.district_site.1 0.26537 0.08488 0 3.126 0.00177 **
## nodecov.efficacy 0.07121 0.03513 0 2.027 0.04268 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Null Deviance: 2503.6 on 1806 degrees of freedom
## Residual Deviance: 794.8 on 1801 degrees of freedom
##
## AIC: 804.8 BIC: 832.2 (Smaller is better. MC Std. Err. = 0.5324)
It appears that the p value for being a district-level leader is <.01, and efficiency is <.05. While these demonstrate some statistical significance, the most influential factors for network prediction remain mutual and transitive ties formed between actors.