Lab 3: Exponential Random Graph Modeling

An Introduction in R

Dawei Wang

2017-11-15

Introduction

In this lab, we will be testing a number of hypotheses about a network’s structure using exponential random graph modeling (ERGM) techniques using the statnet package in R.1 Mark S. Handcock, David R. Hunter, Carter T. Butts, Steven M. Goodreau, and Martina Morris (2003). statnet: Software tools for the Statistical Modeling of Network Data. statnetproject.org; see also ?? statnet. For more information about ERGMs, see generally D. Lusher, J. Koskinen, & G. Robins (2012) Exponential Random Graph Models for Social Networks. statnet provides a comprehensive framework for ERGM-based network modeling, including tools for model estimation, model evaluation, model-based network simulation, and network visualization. This functionality is powered by a central Markov chain Monte Carlo Maximum Likelihood Estimation (MCMCMLE) algorithm.2 For a great introduction to MCMC grounded in graph theory, see Jeremy Kun, Markov Chain Monte Carlo Without All the Bullshit.

statnet resources:

Developer website

User guide

Tutorial

Data

We will analyze the communication behaviors within a team of seventeen members who were involved in designing military installations.

Hypotheses

We will test various hypotheses based on the Theory of Transactive Memory.3 See Monge & Contractor (2003) Theories of Communication Networks, 198—203.

Hypothesis 1: Individuals are less likely to retrieve information from those who retrieve information from them.

Hypothesis 2a: Information retrieval tends to be transitive. That is, if individual i retrieves information from individual k, and individual k retrieves information from individual j, then individual i is more likely to retrieve information from individual j.

Hypothesis 2b: Transitivity increases at a sub-linear rate as a function of the number of ties.

Hypothesis 3a: Individuals tend to retrieve information from other members with high expertise.

Hypothesis 3b: Individuals with low expertise tend to retrieve information from many others.

Hypothesis 4: Individuals tend to retrieve information from members to whom they allocate information.

Part I. Building & Visualizing the Networks (30 pts)

The analysis will use three files: the CRIeq.txt as the network file, EXeq_cons.txt as the attribute file, and CAIeq.txt as the co-variate network file. To begin, we must convert the data files into matrices, transform those matrices into networks, and attach the attribute file to our base network.

Understanding the Base Network

Let’s begin by looking at the summary of our base network.

Analysis

In your own words, explain what this network respresents and its relationship to our attribute information and the other network.

The base network is the CRI network, which is who communicates to others to retrieve information from which other member in the network (CAI is who communicates to allocate information). Since this network represents the communication pattern between 17 agents who are involved in designing military installations, the attribute information is related to this network because the information being retrieved is about the environment and the attribute is the level of expertise of the environment.

Without looking at the results, I feel that a person with a high expertise in environment information will be sought out more by others (high indegree centrality). The person with the least expertise would seek out others more (therefore high outdegree centrality). The base network is related to the information-allocation network in the sense that the person with the highest expertise would have high out-degree centrality. In other words, I would expect the opposite direction of the nodes in the allocation network compared to the retrieval network.

Visualization

Before we conduct further analysis, let’s visualize our base network. Similar to our approach in Lab 2, we will begin by establishing set coordinates for our nodes in order to simplify visual comparisons.

Base Graph Structure

Base Graph Structure

Next we can visualize our base network.

Base Network: Retrieval of Environmental Quality

Base Network: Retrieval of Environmental Quality

Next, we will size the nodes by the their expertise value.

Base Network: Retrieval of Environmental Quality, Nodes Sized by Expertise Score

Base Network: Retrieval of Environmental Quality, Nodes Sized by Expertise Score

Let’s compare this visualization to sizing by in-degree centrality.

Base Network: Nodes by Indegree Centrality

Base Network: Nodes by Indegree Centrality

Analysis

Consider hypothesis 3a from Part I. Do these visualizations prove or disprove the hypothesis? In your own words, interpret the graphs and explain how they support or reject the hypothesis.

The hypothesis is supported. This is given by the above two graphs, in particular the size of nodes in the two graphs are similar. This shows that people with high expertise do indeed have high in-degree centrality. For example, nodes 9 and 6 have high in-degree centrality (bigger size nodes) and they also have high expertise.

Understanding the Covariate Network

Let’s explore the summary statistics of our co-variate network.

## Network attributes:
##   vertices = 17
##   directed = TRUE
##   hyper = FALSE
##   loops = FALSE
##   multiple = FALSE
##   bipartite = FALSE
##  total edges = 23 
##    missing edges = 0 
##    non-missing edges = 23 
##  density = 0.08455882 
## 
## Vertex attributes:
##   vertex.names:
##    character valued attribute
##    17 valid vertex names
## 
## No edge attributes
## 
## Network edgelist matrix:
##       [,1] [,2]
##  [1,]    7    2
##  [2,]    6    4
##  [3,]    7    4
##  [4,]    9    4
##  [5,]   14    4
##  [6,]    6    5
##  [7,]   17    5
##  [8,]    2    6
##  [9,]   14    6
## [10,]    7    8
## [11,]    2    9
## [12,]    4    9
## [13,]    5    9
## [14,]    6    9
## [15,]   14    9
## [16,]   17    9
## [17,]    4   10
## [18,]    7   11
## [19,]    4   13
## [20,]    5   16
## [21,]    6   16
## [22,]    4   17
## [23,]    5   17

Analysis

In your own words, explain what this network respresents and its relationship to the other network and the attribute information.

This network represents who allocates information to whom. This network is related to the other network in the sense that the arcs should have the same direction and same patterns as those in the information retrieval network. This is because Hypothesis 4 proposes that people should retrieve information from those whom they allocate information. Thus, the visualization of the base network should be very similar to the visualization of the co-variate network. It is also related to the expertise information in terms of who allocates information to whom. A person with higher expertise should allocate information to those with less expertise.

Visualization

We will repeat the visualization process for our co-variate network. Observe the location and distribution of edges in the following visualization.

Covariate Network: Allocation of Environmental Quality

Covariate Network: Allocation of Environmental Quality

Next, we will size the nodes by their expertise scores.

Base Network: Allocation of Environmental Quality, Nodes Sized by Expertise Score

Base Network: Allocation of Environmental Quality, Nodes Sized by Expertise Score

Allocation

Allocation

Analysis

Consider hypothesis 4 from Part I. Do the visualizations of the retrieval and allocation networks sized by expertise support or disprove the hypothesis based on visual inspection? In your own words, interpret the graphs and explain how they support or reject the hypothesis.

The visualizations do not support the hypothesis. Using the second graph in this section, the distribution of the arcs are very different from that of the base network. For example, agent 9 gets requests for information very often but he or she is not allocated information in the rate that he or she is asked for information retrieval.

Part II: Constructing & Analyzing the ERGM Model (70 pts)

Next, we’re going to construct an Exponential Random Graph Model. Note that model construction is integral to this process; ERGM is not a single method of analysis, but a type of modelling that requires a theoretical grounding specific to the network and the hypotheses posed by the researcher. While the base ERGM simulation is set to take up to twenty iterations of simulations to fit the model by estimating parameters, it will stop at fewer if the generated networks converge on estimates of our coefficient values or paramters. ERGMs are used to predict ties as a function of individual covariates (i.e. attribute data, like “EX” in our example) or network structure.

We will start with a very basic model, looking only at the probability of edge formation, otherwise known as density, to demonstrate how co-efficients can be translated into odds ratios.4 The term for tie density (edges) is often used similarly to an intercept term in a linear regression or other linear model such as R’s glm. Keep in mind that the base network is about information retrieval. Our model will primarily allow us to ask what the probability is that an information retrieval relationship will form between two nodes.

model0 <- ergm(CRIeq ~ edges)
## Evaluating log-likelihood at the estimate.
summary(model0)
## 
## ==========================
## Summary of model fit
## ==========================
## 
## Formula:   CRIeq ~ edges
## 
## Iterations:  5 out of 20 
## 
## Monte Carlo MLE Results:
##       Estimate Std. Error MCMC % p-value    
## edges  -1.7288     0.1695      0  <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 377.1  on 272  degrees of freedom
##  Residual Deviance: 230.6  on 271  degrees of freedom
##  
## AIC: 232.6    BIC: 236.3    (Smaller is better.)

What do coefficients mean? Coefficients are the change in the (log-odds) likelihood of a tie for a unit change in a predictor. In our basic model above, our only predictor is information on the number of edges in the network. We can see that the coefficient estimate is negative, suggesting that a tie is more likely not to form than form (i.e. density is less than .5.) To get a better sense of how less likely it is for a tie to form, we can translate our log-odds into a probability.

To translate our estimated coefficient for the edges parameter into a probability, we can take the inverse-logit. We are thereby finding the probability that a tie will form in the network, looking only at the number of ties in the base network to build our model. We can see below that this probability is equal to our network density.

plogis(coef(model0)[[1]])
## [1] 0.1507353
network.density(CRIeq)
## [1] 0.1507353

Conceptually, this should be fairly easy to follow: if all we know about a network is the number of ties we have and we’re attempting to predict the probability that an edge will exist solely based on that information, then the probability of an edge existing at any point in the network equals the density of the network as a whole.5 As we move on from this toy model, keep in mind that the estimate for our edges parameter will change as we add additional predictors or network statistics as additional terms will partially explain tie formation.

Base Network ERGM

As we build a network, we can evaluate whether individual network statistics or node attributes prove our hypotheses and whether they do so in a way that is significantly different from random chance. Later, we will evaluate whether the model does a good job of explaining our observed network.

model1 <- ergm(CRIeq ~ edges      # Set the base term based on density/edge formation. 
               + mutual           # H1
               + transitive       # H2a: Transitive triads ( type 120D, 030T, 120U, or 300) # What if the tie is part of a transitive triad?
               + nodeicov("EX")   # H3a
               + nodeocov("EX")   # H3b
               + edgecov(CAIeq)   # H4
) 
## Starting maximum likelihood estimation via MCMLE:
## Iteration 1 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.2307.
## Step length converged once. Increasing MCMC sample size.
## Iteration 2 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.00728.
## Step length converged twice. Stopping.
## Evaluating log-likelihood at the estimate. Using 20 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 .
## This model was fit using MCMC.  To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function.
summary(model1)
## 
## ==========================
## Summary of model fit
## ==========================
## 
## Formula:   CRIeq ~ edges + mutual + transitive + nodeicov("EX") + nodeocov("EX") + 
##     edgecov(CAIeq)
## 
## Iterations:  2 out of 20 
## 
## Monte Carlo MLE Results:
##               Estimate Std. Error MCMC % p-value    
## edges          -6.9744     1.1490      0 < 1e-04 ***
## mutual         -1.4515     0.9915      0 0.14436    
## transitive      0.2817     0.1200      0 0.01969 *  
## nodeicov.EX     9.5449     2.0638      0 < 1e-04 ***
## nodeocov.EX     1.2598     1.5984      0 0.43131    
## edgecov.CAIeq   2.0994     0.6816      0 0.00229 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 377.1  on 272  degrees of freedom
##  Residual Deviance: 123.8  on 266  degrees of freedom
##  
## AIC: 135.8    BIC: 157.4    (Smaller is better.)
kable(plogis(coef(model1)))
edges 0.0009346
mutual 0.1897633
transitive 0.5699572
nodeicov.EX 0.9999284
nodeocov.EX 0.7789891
edgecov.CAIeq 0.8908474

Analysis

Take a look at the ERGM equation discussed in the week 5 slides, reproduced below:

\[ \Pr(Y=y)=\exp[\theta'g(y)]/k(\theta) \]

What term in the equation do the ERGM terms or network statistics correspond to?

g(y) is the term that correspond to network statistics.

Explain each ERGM term and its relationship to your hypotheses. Test your hypotheses (think about whether the sign of your coefficient suggests a tie is more or less likely) and report whether your results are significant.6 A parameter is significant if its absolute value is more than twice its Standard Error.

Hypothesis 1: Individuals are less likely to retrieve information from those who retrieve information from them.

This hypothesis is not supported. As can be seen from the table above (Note that I modified the code slightly because the table was not reported in the script), the coefficient for mutual is -1.4289, which means that it is highly unlikely for agents to form ties that is described in this hypothesis. In other words, the logged odds of individuals less likely to retrieve information from those who retrieve from them is very low. This is also shown in the data visualization in the previous section. Indeed, the probability is also very low: 0.193.

Hypothesis 2a: Information retrieval tends to be transitive. That is, if individual i retrieves information from individual k, and individual k retrieves information from individual j, then individual i is more likely to retrieve information from individual j.

This hypothesis is supported. This is because the coefficient is 0.2783 which is twice as much as the standard error. P-value is also significant at 0.05. The probability is 0.569 which is higher than chance.

Hypothesis 3a: Individuals tend to retrieve information from other members with high expertise.

This hypothesis is also supported given that the coefficient is 9.16, which shows that individuals tend to retrieve information from other members with high expertise (this is also supported in our previous visualization). The probability is almost 1.

Hypothesis 3b: Individuals with low expertise tend to retrieve information from many others.

This hypothesis is not supported because the coefficient is 1.2387 which is even less than the standard error, which shows that the probability of a tie forming between a low-expert agent with many others is close to random.

Hypothesis 4: Individuals tend to retrieve information from members to whom they allocate information.

This hypothesis is supported because the coefficient is 2.09 and the standard error is 0.6828 (so the coefficient is twice the standard error) and the p-value is small. Also the probability is high, i.e. 0.89.

ERGM Model 2

Now we will change our model slightly to avoid convergence problems that lead to degeneracy. The terms transitive and dgwesp both rely on triangle formations so including both of them in the model leads to a situation similar to colinearity in a generalized linear model. While they measure slightly different things (take a look at the ergm-terms documentation to understand more about what’s happening under the hood), they’re typically used interchangeably. As a result, we need to use two separate models to test the relevance of these parameters.

model2 <- ergm(CRIeq ~ edges 
               + mutual                           # H1
               + dgwesp(0.5, fixed=T, type="OTP") # H2b: OTP "transitive shared partner" ordered pair (i,j) iff i->k->j.
               + nodeicov("EX")                   # H3a
               + nodeocov("EX")                   # H3b
               + edgecov(CAIeq)                   # H4
) 
## Starting maximum likelihood estimation via MCMLE:
## Iteration 1 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.1652.
## Step length converged once. Increasing MCMC sample size.
## Iteration 2 of at most 20:
## Optimizing with step length 1.
## The log-likelihood improved by 0.002943.
## Step length converged twice. Stopping.
## Evaluating log-likelihood at the estimate. Using 20 bridges: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 .
## This model was fit using MCMC.  To examine model diagnostics and check for degeneracy, use the mcmc.diagnostics() function.
summary(model2)
## 
## ==========================
## Summary of model fit
## ==========================
## 
## Formula:   CRIeq ~ edges + mutual + dgwesp(0.5, fixed = T, type = "OTP") + 
##     nodeicov("EX") + nodeocov("EX") + edgecov(CAIeq)
## 
## Iterations:  2 out of 20 
## 
## Monte Carlo MLE Results:
##                     Estimate Std. Error MCMC %  p-value    
## edges                -6.8740     1.2136      0  < 1e-04 ***
## mutual               -1.5142     1.0369      0 0.145398    
## gwesp.OTP.fixed.0.5   0.6104     0.3613      0 0.092309 .  
## nodeicov.EX           8.9890     2.3300      0 0.000144 ***
## nodeocov.EX           1.3097     1.7263      0 0.448719    
## edgecov.CAIeq         2.1134     0.6710      0 0.001821 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 377.1  on 272  degrees of freedom
##  Residual Deviance: 124.3  on 266  degrees of freedom
##  
## AIC: 136.3    BIC: 157.9    (Smaller is better.)
kable(plogis(coef(model2)))
edges 0.0010333
mutual 0.1803157
gwesp.OTP.fixed.0.5 0.6480383
nodeicov.EX 0.9998752
nodeocov.EX 0.7874616
edgecov.CAIeq 0.8922022

Analysis

Evaluate the remaining hypothesis with your model.

Hypothesis 2b: Transitivity increases at a sub-linear rate as a function of the number of ties.

The hypothesis is marginally supported. This is shown in the coefficient of 0.6104 which is almost twice that of the standard error. The p-value is 0.09 which is also larger than the significant level of 0.05.

Model Diagnostics

Next, judge convergence of the MCMC processes of the first model, using the mcmc.diagnostics() function. The function will plot the change of model statistics during the last iteration of the MCMC estimation procedure.7 Note that although the edge graphs appear to be periodic, the dips between whole numbers are due to the fact that edges are always whole numbers. For each model statistic, the left hand side plot gives the change of the statistic with iterations, and the right hand side plot is a histogram of the statistic values. Both are normalized, so the observed data locate at 0.

mcmc.diagnostics(model1)      # Performs the markov chain monte carlo diagnostics
## Sample statistics summary:
## 
## Iterations = 16384:4209664
## Thinning interval = 1024 
## Number of chains = 1 
## Sample size per chain = 4096 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                  Mean     SD Naive SE Time-series SE
## edges         0.63086  7.375  0.11523        0.17067
## mutual        0.22803  2.473  0.03863        0.05567
## transitive    3.00269 31.115  0.48617        0.71710
## nodeicov.EX   0.32909  3.927  0.06136        0.09246
## nodeocov.EX   0.19260  2.483  0.03880        0.05832
## edgecov.CAIeq 0.04272  1.784  0.02787        0.03580
## 
## 2. Quantiles for each variable:
## 
##                  2.5%     25%        50%    75%  97.5%
## edges         -12.000  -4.000  0.000e+00  5.000 17.000
## mutual         -3.000  -2.000  0.000e+00  2.000  6.000
## transitive    -40.000 -19.000 -4.000e+00 19.000 82.000
## nodeicov.EX    -6.706  -2.412  5.882e-02  2.824  8.919
## nodeocov.EX    -4.294  -1.529 -2.000e-09  1.824  5.471
## edgecov.CAIeq  -4.000  -1.000  0.000e+00  1.000  3.000
## 
## 
## Sample statistics cross-correlations:
##                   edges    mutual transitive nodeicov.EX nodeocov.EX
## edges         1.0000000 0.7440785  0.9183834   0.9909533   0.9222568
## mutual        0.7440785 1.0000000  0.8654203   0.7483071   0.8720016
## transitive    0.9183834 0.8654203  1.0000000   0.9245335   0.9371334
## nodeicov.EX   0.9909533 0.7483071  0.9245335   1.0000000   0.9110763
## nodeocov.EX   0.9222568 0.8720016  0.9371334   0.9110763   1.0000000
## edgecov.CAIeq 0.5273759 0.4952026  0.5245660   0.5121310   0.5631972
##               edgecov.CAIeq
## edges             0.5273759
## mutual            0.4952026
## transitive        0.5245660
## nodeicov.EX       0.5121310
## nodeocov.EX       0.5631972
## edgecov.CAIeq     1.0000000
## 
## Sample statistics auto-correlation:
## Chain 1 
##               edges     mutual transitive nodeicov.EX nodeocov.EX
## Lag 0    1.00000000 1.00000000 1.00000000  1.00000000  1.00000000
## Lag 1024 0.29602714 0.28999608 0.37011059  0.30805949  0.30786156
## Lag 2048 0.13494980 0.11044553 0.15565527  0.14686320  0.13138072
## Lag 3072 0.07822541 0.07382388 0.08220402  0.08408586  0.07803388
## Lag 4096 0.01954890 0.03750356 0.03831785  0.01996449  0.02589681
## Lag 5120 0.03131580 0.02918752 0.03555072  0.03433520  0.03801852
##          edgecov.CAIeq
## Lag 0      1.000000000
## Lag 1024   0.199076878
## Lag 2048   0.086168510
## Lag 3072   0.026845574
## Lag 4096  -0.005624284
## Lag 5120  -0.004873216
## 
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1 
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##         edges        mutual    transitive   nodeicov.EX   nodeocov.EX 
##        0.7790       -0.1910        0.5870        0.6246        0.7619 
## edgecov.CAIeq 
##       -0.3076 
## 
## Individual P-values (lower = worse):
##         edges        mutual    transitive   nodeicov.EX   nodeocov.EX 
##     0.4360008     0.8485307     0.5572247     0.5322518     0.4461489 
## edgecov.CAIeq 
##     0.7584113 
## Joint P-value (lower = worse):  0.04277244 .
## Package latticeExtra is not installed. Falling back on coda's default MCMC diagnostic plots.
MCMC Diagnostics, Model 1.

MCMC Diagnostics, Model 1.

MCMC Diagnostics, Model 1.

MCMC Diagnostics, Model 1.

## 
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

Repeat the process for the second model.

mcmc.diagnostics(model2)      # Performs the markov chain monte carlo diagnostics
## Sample statistics summary:
## 
## Iterations = 16384:4209664
## Thinning interval = 1024 
## Number of chains = 1 
## Sample size per chain = 4096 
## 
## 1. Empirical mean and standard deviation for each variable,
##    plus standard error of the mean:
## 
##                          Mean     SD Naive SE Time-series SE
## edges               -0.310791  5.357  0.08371        0.08845
## mutual              -0.005127  1.711  0.02674        0.02896
## gwesp.OTP.fixed.0.5 -0.478587 10.028  0.15669        0.17049
## nodeicov.EX         -0.164206  2.864  0.04474        0.04765
## nodeocov.EX         -0.078800  1.689  0.02639        0.02858
## edgecov.CAIeq       -0.079346  1.684  0.02631        0.02908
## 
## 2. Quantiles for each variable:
## 
##                        2.5%    25%      50%   75%  97.5%
## edges               -10.000 -4.000  0.00000 3.000 10.000
## mutual               -3.000 -1.000  0.00000 1.000  3.000
## gwesp.OTP.fixed.0.5 -19.321 -7.343 -0.82945 6.107 19.445
## nodeicov.EX          -5.684 -2.118 -0.11765 1.765  5.529
## nodeocov.EX          -3.353 -1.191 -0.05882 1.059  3.176
## edgecov.CAIeq        -4.000 -1.000  0.00000 1.000  3.000
## 
## 
## Sample statistics cross-correlations:
##                         edges    mutual gwesp.OTP.fixed.0.5 nodeicov.EX
## edges               1.0000000 0.5313516           0.8982144   0.9821594
## mutual              0.5313516 1.0000000           0.7194566   0.5381319
## gwesp.OTP.fixed.0.5 0.8982144 0.7194566           1.0000000   0.9165649
## nodeicov.EX         0.9821594 0.5381319           0.9165649   1.0000000
## nodeocov.EX         0.8614649 0.7380090           0.8902859   0.8316911
## edgecov.CAIeq       0.4102323 0.3498729           0.4224946   0.3871053
##                     nodeocov.EX edgecov.CAIeq
## edges                 0.8614649     0.4102323
## mutual                0.7380090     0.3498729
## gwesp.OTP.fixed.0.5   0.8902859     0.4224946
## nodeicov.EX           0.8316911     0.3871053
## nodeocov.EX           1.0000000     0.4554200
## edgecov.CAIeq         0.4554200     1.0000000
## 
## Sample statistics auto-correlation:
## Chain 1 
##                 edges      mutual gwesp.OTP.fixed.0.5  nodeicov.EX
## Lag 0     1.000000000  1.00000000         1.000000000  1.000000000
## Lag 1024  0.054919942  0.07963608         0.084058581  0.062741563
## Lag 2048  0.002266933  0.01325322         0.007484444  0.012055499
## Lag 3072 -0.006670161 -0.01343285        -0.004961040 -0.014162831
## Lag 4096 -0.005378409  0.01433994        -0.009229691 -0.006125102
## Lag 5120 -0.007475345 -0.01042801        -0.016320508 -0.015197072
##           nodeocov.EX edgecov.CAIeq
## Lag 0     1.000000000  1.0000000000
## Lag 1024  0.079578621  0.0995903768
## Lag 2048 -0.013145788  0.0244695122
## Lag 3072 -0.001161323  0.0005197621
## Lag 4096  0.005133018 -0.0272918852
## Lag 5120  0.009118882 -0.0020513647
## 
## Sample statistics burn-in diagnostic (Geweke):
## Chain 1 
## 
## Fraction in 1st window = 0.1
## Fraction in 2nd window = 0.5 
## 
##               edges              mutual gwesp.OTP.fixed.0.5 
##           -0.348152           -0.915231           -0.353861 
##         nodeicov.EX         nodeocov.EX       edgecov.CAIeq 
##           -0.109243           -0.511664           -0.003743 
## 
## Individual P-values (lower = worse):
##               edges              mutual gwesp.OTP.fixed.0.5 
##           0.7277261           0.3600702           0.7234432 
##         nodeicov.EX         nodeocov.EX       edgecov.CAIeq 
##           0.9130100           0.6088865           0.9970136 
## Joint P-value (lower = worse):  0.8045628 .
## Package latticeExtra is not installed. Falling back on coda's default MCMC diagnostic plots.
MCMC Diagnostics, Model 2.

MCMC Diagnostics, Model 2.

MCMC Diagnostics, Model 2.

MCMC Diagnostics, Model 2.

## 
## MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).

Analysis

Has the MCMC process converged to a desired state for each ERGM term? Explain how you interpret the plots.8 D. Lusher et al, ERGM, 12.3 (“[T]he inferential goal is to center the distribution of statistics over those of the observed network, thus fitting a model that we say gives maximal support to the data.”)

Yes, according to the console output, the MCMC process converged to a desired state for each of the ERGM term in the models. The plots basically tell the same story. The plots on the left show the variations around the central tendency and since the variations are centered around the baseline (zero), it shows that the baseline model follows the same pattern as the hypothesized model. In more detail, each line on the left plot represents the parameter of each round of simulation. Zero is the base model as mentioned. The plots on the right are also saying the same story, just visualizing the data differently. The plot on the right are the distribution of the output of the simulations and since all the graphs follows a normal distribution around mean of zero, we can conclude that the MCMC process converged to the desired state.

Model Evaluation

To evaluate the goodness-of-fit for our model, we need to simulate many variations of the model.9 See ?? simulate for more information.

Let’s visually inspect two of our random networks based on our first model.

ggnet2(sim[[1]], mode = baseLayout, label = TRUE, arrow.size = 8, edge.color="blue", arrow.gap = 0.03) + theme(panel.background = element_rect(fill = "#fffff8")) + guides(color = FALSE, size = FALSE)
Random Graph Variant, Example 1

Random Graph Variant, Example 1

ggnet2(sim[[10]], mode = baseLayout, label = TRUE, arrow.size = 8, edge.color="orange", arrow.gap = 0.03) + theme(panel.background = element_rect(fill = "#fffff8")) + guides(color = FALSE, size = FALSE)
Random Graph Variant, Example 1

Random Graph Variant, Example 1

Next we’re going to extract the number of triangles from each of the 100 samples, create a histogram of that model, and place a red arrow at the value of the observed network.

model.tridist <- sapply(1:1000, function(x) summary(sim[[x]] ~triangle)) # Extracts the tiangle data from the simulated networks
hist(model.tridist, xlim=c(0,200), breaks = 50)                         # Plots that triangle distribution as a histogram
CRIeq.tri <- summary(CRIeq ~ triangle)                                  # Saves the CRIeq triangle data from the summary to the CRI.eq variable
arrows(CRIeq.tri,20, CRIeq.tri, 5, col="red", lwd=3)                    # Adds an arrow to the plotted histogram
Triangle Distribution

Triangle Distribution

Analysis

Is the distribution of triangles in your simulation a good match with the distribution of triangles in your observed network? This graph is interpreted similarly to the MCMC diagnostics, above.

So for this question, I ran plots for simulation = 100, 1000 and 10000 (the last one takes too long so the above graph shows a sample of 1000 simulations) and got the same results. I also changed the formatting of the histogram because the previous settings were not good representation and can’t help me to interpret clearly. The conclusion I form from the above histogram is that the distribution of triangles in the simulated networks is not a good match with the observed network. The observed network has approximately 60 triangles but the simulated networks are centered around 50 triangles.

Goodness of Fit

Next, we will calculate the Goodness of Fit for several of the parameters in our model. A p-value closer to one is better; this represents the difference between the observed networks and simulations.

gof
## 
## Goodness-of-fit for in-degree 
## 
##    obs min mean max MC p-value
## 0    9   6 8.98  12       1.00
## 1    3   0 1.96   4       0.62
## 2    0   0 0.66   3       1.00
## 3    0   0 0.43   3       1.00
## 4    0   0 0.45   2       1.00
## 5    2   0 0.68   3       0.28
## 6    0   0 0.77   3       0.96
## 7    0   0 0.79   3       0.84
## 8    1   0 0.69   3       1.00
## 9    1   0 0.67   4       1.00
## 10   0   0 0.49   3       1.00
## 11   1   0 0.25   2       0.44
## 12   0   0 0.15   2       1.00
## 13   0   0 0.02   1       1.00
## 14   0   0 0.01   1       1.00
## 
## Goodness-of-fit for out-degree 
## 
##   obs min mean max MC p-value
## 0   4   0 1.88   5       0.18
## 1   1   0 3.53   8       0.26
## 2   3   0 3.52   8       1.00
## 3   4   0 3.43   9       0.90
## 4   3   0 3.07   9       1.00
## 5   2   0 1.29   6       0.70
## 6   0   0 0.23   2       1.00
## 7   0   0 0.05   1       1.00
## 
## Goodness-of-fit for edgewise shared partner 
## 
##      obs min  mean max MC p-value
## esp0  10   4 10.76  18       0.98
## esp1  13   2 11.93  20       0.86
## esp2  11   0 10.41  24       1.00
## esp3   5   0  6.12  32       1.00
## esp4   2   0  1.79  15       0.80
## esp5   0   0  0.30   4       1.00
## esp6   0   0  0.01   1       1.00
## 
## Goodness-of-fit for minimum geodesic distance 
## 
##     obs min   mean max MC p-value
## 1    41  24  41.32  64       1.00
## 2    19  14  38.20  65       0.06
## 3     4   0  12.67  31       0.32
## 4     0   0   2.48  18       0.78
## 5     0   0   0.33   6       1.00
## 6     0   0   0.06   3       1.00
## 7     0   0   0.01   1       1.00
## Inf 208 128 176.93 225       0.10
# -------------------------------------------------------------------------------------------------
# Test the goodness of fit of the model
# Compiles statistics for these simulations as well as the observed network, and calculates p-values 
# -------------------------------------------------------------------------------------------------

par(mfrow=c(2,2))   # Separate the plot window into a 2 by 2 orientation
plot(gof)           # Plot the goodness of fit
Goodness of Fit

Goodness of Fit

Analysis

Evaluate the plots and summary statitistics of the Goodness of Fit measures for Model 1. Are the four terms evaluated show a good fit between the simulated networks and the observed network?10 In general, for configurations in the model, the fit is considered good if │t│≤ 0.1. For configurations not included in the model, the fit is considered good if 0.1<│t│≤ 1, and not extreme if 1< │t│≤ 2. For your plot, the dark black line represents the data for the observed network. The boxplots represent the distribution of corresponding degrees across the simulated networks, and the soft lines are the 95% confidence intervals.

The interpretation is as follows. First the summary statistics show the possible number of in degree ties on the left column (i.e. 0, 1, 2, …). The second column is the actual observed network statistics (i.e. the frequency of nodes have 0 tie, 1 tie, etc.). The third row is the minimum frequency of that node having that number of indegree ties for the simulated networks. The forth is the simulated network’s average frequency for that number of node. The maximum follows the same logic. The p-value is better if it is higher. So for indegree, the model is not a good fit for predicting 5 and 11 indegree ties. The logic follows for other measures of ties. The model is only good for edge-wise shared partners.

Submitting the Lab (5 pts)

After knitting your file to RPubs, copy the URL and paste it into the comment field of the Lab 2 Assignment on Canvas. Save this .Rmd file and submit it in the file portion of your Canvas assignment. Make sure to review your file and its formatting. Run spell check (built into RStudio) and proofread your answers before submitting. If you can’t publish to RPubs, save your HTML file as a PDF and submit that instead.11 There are many different ways to do this with different browsers. Google it.