Intro

We’re back at it again with our district leader data from the Social Network Analysis and Education companion site! I was interested in seeing the relationship between some factors that we didn’t see in combination during our case study: Reciprocal ties between actors, i.e. confidential and mutual exchanges between individuals, whether or not an individual worked at the district or school level, and efficacy.

Research Question

Controlling for edges, reciprocity, are leaders who are at the district level or have high efficacy scores more likely to either send or receive a confidential exchange within this social network?

Prepare

Install Libraries

library(tidyverse)
library(readxl)
library(igraph)
library(tidygraph)
library(ggraph)
library(skimr)
library(janitor)
library(statnet)

Wrangle

I began by importing the edge matrix from School Leaders Data Chapter 9_d, then converting it into a matrix object:

leader_matrix <- read_excel("data/School Leaders Data Chapter 9_d.xlsx", 
                            col_names = FALSE) |>
  clean_names()
## New names:
## * `` -> ...1
## * `` -> ...2
## * `` -> ...3
## * `` -> ...4
## * `` -> ...5
## * ...
leader_matrix <- leader_matrix |>
  as.matrix()

No edges are complete without their nodes, and in this case the attributes that we’ll be drawing from for our modeling:

leader_nodes <- read_excel("data/School Leaders Data Chapter 9_e.xlsx", 
                           col_types = c("text", 
                                         "numeric", 
                                         "numeric", 
                                         "numeric", 
                                         "numeric")) |>
  clean_names()

I then took the values from the matrix and dichotomized them, “flattening” ties into binary based on their strength:

leader_matrix[leader_matrix <= 2] <- 0
leader_matrix[leader_matrix >= 3] <- 1

I then added row and column names to the matrix, drawing from the freshly-imported node attributes data:

rownames(leader_matrix) <- leader_nodes$id
colnames(leader_matrix) <- leader_nodes$id

The matrix needed to then be converted into a network object for use with igraph:

adjacency_matrix <- graph.adjacency(leader_matrix,
                                    diag = FALSE)

I then went on to make my edge list form that network object:

leader_edges <- get.data.frame(adjacency_matrix) |>
  mutate(from = as.character(from)) |>
  mutate(to = as.character(to))

These two objects came together to make the final graph object ready for analysis:

leader_graph <- tbl_graph(edges = leader_edges,
                          nodes = leader_nodes,
                          node_key = "id",
                          directed = FALSE)

Explore

Descriptives

I used mutate to create new columns describing the in, out, and total exchanges between actors in this network:

leader_measures <- leader_graph |>
  activate(nodes) |>
  mutate(in_degree = centrality_degree(mode = "in")) |>
  mutate(out_degree = centrality_degree(mode = "out")) |>
  mutate(all_degree = centrality_degree(mode = "all"))

I then converted these measures into a tibble for easier statistical analysis…

node_measures <- leader_measures |> 
  activate(nodes) |>
  as_tibble()
summary(node_measures)
##       id               efficacy         trust       district_site   
##  Length:43          Min.   :4.610   Min.   :3.630   Min.   :0.0000  
##  Class :character   1st Qu.:5.670   1st Qu.:4.130   1st Qu.:0.0000  
##  Mode  :character   Median :6.780   Median :4.780   Median :0.0000  
##                     Mean   :6.649   Mean   :4.783   Mean   :0.4186  
##                     3rd Qu.:7.470   3rd Qu.:5.440   3rd Qu.:1.0000  
##                     Max.   :8.500   Max.   :5.880   Max.   :1.0000  
##       male          in_degree        out_degree       all_degree    
##  Min.   :0.0000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.: 3.000   1st Qu.: 3.000   1st Qu.: 3.000  
##  Median :0.0000   Median : 4.000   Median : 4.000   Median : 4.000  
##  Mean   :0.4419   Mean   : 6.651   Mean   : 6.651   Mean   : 6.651  
##  3rd Qu.:1.0000   3rd Qu.: 9.500   3rd Qu.: 9.500   3rd Qu.: 9.500  
##  Max.   :1.0000   Max.   :27.000   Max.   :27.000   Max.   :27.000

…which led to these descriptive stats on in, out, and total degree, respectively, as well as for the node attribute of efficacy since I planned to use it in my model:

node_measures |>
  group_by(district_site) |>
  summarise(n = n(),
            mean = mean(in_degree), 
            sd = sd(in_degree))
## # A tibble: 2 × 4
##   district_site     n  mean    sd
##           <dbl> <int> <dbl> <dbl>
## 1             0    25  4.8   4.34
## 2             1    18  9.22  6.67
node_measures |>
  group_by(district_site) |>
  summarise(n = n(),
            mean = mean(out_degree), 
            sd = sd(out_degree))
## # A tibble: 2 × 4
##   district_site     n  mean    sd
##           <dbl> <int> <dbl> <dbl>
## 1             0    25  4.8   4.34
## 2             1    18  9.22  6.67
node_measures |>
  group_by(district_site) |>
  summarise(n = n(),
            mean = mean(all_degree), 
            sd = sd(all_degree))
## # A tibble: 2 × 4
##   district_site     n  mean    sd
##           <dbl> <int> <dbl> <dbl>
## 1             0    25  4.8   4.34
## 2             1    18  9.22  6.67
node_measures |>
  group_by(district_site) |>
  summarise(n = n(),
            mean = mean(efficacy), 
            sd = sd(efficacy))
## # A tibble: 2 × 4
##   district_site     n  mean    sd
##           <dbl> <int> <dbl> <dbl>
## 1             0    25  7.02 0.953
## 2             1    18  6.13 1.12

Sociogram

leader_graph |>
  ggraph(layout = "kk") + 
  geom_node_point(aes(colour = district_site, size = efficacy)) +
  geom_edge_link(colour = "grey", width=0.2, arrow = arrow(type = "closed", length = unit(1, 'mm')))+
  theme_void()

This visualization highlighted a few of the properties that I planned on looking at using my ERGM model, including reciprocity (shown by the arrows), district or school-level status, and efficacy.

Model

I first loaded my network data so it could be used by statnet:

leader_network <- as.network(leader_edges,
                             vertices = leader_nodes)

ERGM

I looked to see what a basic ERGM model controlling for reciprocal ties would look like (running a summary beforehand to keep with best practices):

summary(leader_network ~ edges + mutual)
##  edges mutual 
##    143     36
set.seed(589)
ergm_1 <-ergm(leader_network ~ edges + mutual)
## Starting maximum pseudolikelihood estimation (MPLE):
## Evaluating the predictor and response matrix.
## Maximizing the pseudolikelihood.
## Finished MPLE.
## Starting Monte Carlo maximum likelihood estimation (MCMLE):
## Iteration 1 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.2915.
## Estimating equations are not within tolerance region.
## Iteration 2 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0136.
## Convergence test p-value: 0.1862. Not converged with 99% confidence; increasing sample size.
## Iteration 3 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0049.
## Convergence test p-value: 0.0968. Not converged with 99% confidence; increasing sample size.
## Iteration 4 of at most 60:
## Optimizing with step length 1.0000.
## The log-likelihood improved by 0.0079.
## Convergence test p-value: < 0.0001. Converged with 99% confidence.
## Finished MCMLE.
## Evaluating log-likelihood at the estimate. Fitting the dyad-independent submodel...
## Bridging between the dyad-independent submodel and the full model...
## Setting up bridge sampling...
## Using 16 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 .
## Bridging finished.
## This model was fit using MCMC.  To examine model diagnostics and check
## for degeneracy, use the mcmc.diagnostics() function.
summary(ergm_1)
## Call:
## ergm(formula = leader_network ~ edges + mutual)
## 
## Monte Carlo Maximum Likelihood Results:
## 
##        Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges   -3.1101     0.1283      0  -24.23   <1e-04 ***
## mutual   3.1246     0.3003      0   10.40   <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 2503.6  on 1806  degrees of freedom
##  Residual Deviance:  891.8  on 1804  degrees of freedom
##  
## AIC: 895.8  BIC: 906.8  (Smaller is better. MC Std. Err. = 0.7447)

It looks like the structural features of this network have significant influence on the formation of the network itself, which makes sense. But we’re here too look at how this might play with district or school-level participation and efficacy!

District site & efficacy

## Call:
## ergm(formula = leader_network ~ edges + mutual + gwesp(0.25, 
##     fixed = T) + nodefactor("district_site") + nodecov("efficacy"))
## 
## Monte Carlo Maximum Likelihood Results:
## 
##                            Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges                      -5.10654    0.51634      0  -9.890  < 1e-04 ***
## mutual                      2.28904    0.31177      0   7.342  < 1e-04 ***
## gwesp.fixed.0.25            0.99580    0.13837      0   7.197  < 1e-04 ***
## nodefactor.district_site.1  0.26537    0.08488      0   3.126  0.00177 ** 
## nodecov.efficacy            0.07121    0.03513      0   2.027  0.04268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 2503.6  on 1806  degrees of freedom
##  Residual Deviance:  794.8  on 1801  degrees of freedom
##  
## AIC: 804.8  BIC: 832.2  (Smaller is better. MC Std. Err. = 0.5324)

Model Fit

To see how this model fits, I used gof to plot how the simulations fit around the observed data. With some variations, it looks like the model fits pretty well around the sample data. A good sign!

I also wanted to check to see how “balanced” the Monte Carlo Markov Chain diagnostics looked, to make sure that the model wasn’t running away to produce variable or unlikely results. These turned out well also.

## Sample statistics summary:
## 
## Iterations = 1697792:6926336
## Thinning interval = 2048 
## Number of chains = 1 
## Sample size per chain = 2554 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                              Mean     SD Naive SE Time-series SE
## edges                       4.701  38.77   0.7671          2.893
## mutual                      1.457  13.15   0.2601          1.068
## gwesp.fixed.0.25            5.883  49.32   0.9759          3.700
## nodefactor.district_site.1  4.662  47.06   0.9312          3.714
## nodecov.efficacy           69.447 525.90  10.4062         39.272
## 
## 2. Quantiles for each variable:
## 
##                               2.5%     25%    50%    75%  97.5%
## edges                       -74.00  -22.00  6.000  31.00   80.0
## mutual                      -25.00   -8.00  1.000  10.00   27.0
## gwesp.fixed.0.25            -90.85  -28.46  5.969  39.78  102.1
## nodefactor.district_site.1  -92.00  -27.00  7.000  37.00   93.0
## nodecov.efficacy           -997.65 -297.88 80.605 427.94 1086.3
## 
## 
## Are sample statistics significantly different from observed?
##                edges    mutual gwesp.fixed.0.25 nodefactor.district_site.1
## diff.      4.7008614 1.4565388        5.8834055                  4.6620987
## test stat. 1.6250332 1.3639116        1.5899004                  1.2553877
## P-val.     0.1041555 0.1725954        0.1118573                  0.2093381
##            nodecov.efficacy Overall (Chi^2)
## diff.           69.44720830              NA
## test stat.       1.76834507    18.286183761
## P-val.           0.07700323     0.002968828
## 
## Sample statistics cross-correlations:
##                                edges    mutual gwesp.fixed.0.25
## edges                      1.0000000 0.9640384        0.9859629
## mutual                     0.9640384 1.0000000        0.9680490
## gwesp.fixed.0.25           0.9859629 0.9680490        1.0000000
## nodefactor.district_site.1 0.9532156 0.9269091        0.9511353
## nodecov.efficacy           0.9979521 0.9626429        0.9849214
##                            nodefactor.district_site.1 nodecov.efficacy
## edges                                       0.9532156        0.9979521
## mutual                                      0.9269091        0.9626429
## gwesp.fixed.0.25                            0.9511353        0.9849214
## nodefactor.district_site.1                  1.0000000        0.9417990
## nodecov.efficacy                            0.9417990        1.0000000
## 
## Sample statistics auto-correlation:
## Chain 1 
##               edges    mutual gwesp.fixed.0.25 nodefactor.district_site.1
## Lag 0     1.0000000 1.0000000        1.0000000                  1.0000000
## Lag 2048  0.8446890 0.8585567        0.8550006                  0.8547406
## Lag 4096  0.7391046 0.7502635        0.7466117                  0.7515170
## Lag 6144  0.6517800 0.6623309        0.6561519                  0.6672657
## Lag 8192  0.5796237 0.5956158        0.5838357                  0.5935643
## Lag 10240 0.5164856 0.5366813        0.5217593                  0.5296499
##           nodecov.efficacy
## Lag 0            1.0000000
## Lag 2048         0.8460447
## Lag 4096         0.7400769
## Lag 6144         0.6524216
## Lag 8192         0.5804269
## Lag 10240        0.5178714
## 
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1 
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##                      edges                     mutual 
##                    0.29159                    0.11655 
##           gwesp.fixed.0.25 nodefactor.district_site.1 
##                    0.18536                   -0.06909 
##           nodecov.efficacy 
##                    0.30234 
## 
## Individual P-values (lower = worse):
##                      edges                     mutual 
##                  0.7706030                  0.9072177 
##           gwesp.fixed.0.25 nodefactor.district_site.1 
##                  0.8529432                  0.9449172 
##           nodecov.efficacy 
##                  0.7623914 
## Joint P-value (lower = worse):  0.3020204 .

## 
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

Communicate

To answer my research question, let’s take a look again at the ERGM model that accounted for district_level and efficacy:

summary(ergm_2)
## Call:
## ergm(formula = leader_network ~ edges + mutual + gwesp(0.25, 
##     fixed = T) + nodefactor("district_site") + nodecov("efficacy"))
## 
## Monte Carlo Maximum Likelihood Results:
## 
##                            Estimate Std. Error MCMC % z value Pr(>|z|)    
## edges                      -5.10654    0.51634      0  -9.890  < 1e-04 ***
## mutual                      2.28904    0.31177      0   7.342  < 1e-04 ***
## gwesp.fixed.0.25            0.99580    0.13837      0   7.197  < 1e-04 ***
## nodefactor.district_site.1  0.26537    0.08488      0   3.126  0.00177 ** 
## nodecov.efficacy            0.07121    0.03513      0   2.027  0.04268 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 2503.6  on 1806  degrees of freedom
##  Residual Deviance:  794.8  on 1801  degrees of freedom
##  
## AIC: 804.8  BIC: 832.2  (Smaller is better. MC Std. Err. = 0.5324)

It appears that the p value for being a district-level leader is <.01, and efficiency is <.05. While these demonstrate some statistical significance, the most influential factors for network prediction remain mutual and transitive ties formed between actors.