This report provides an introductory summary to the formulation and application of exponential random graph models for the network of collaboration between countries. In these networks, nodes are countires and an edge between them is a joint paper. We use income level, cases level and coverage as node attributes.

At fist, lets have a look at the income level distribution among different antigens and differnt year intervals.

Since the node attributes for incidence level and coverage level have missing, we present this report only for income level.

Number of nodes for each income level and each antigen

The figures below displays a summary of number of nodes with specific income level as their node attributes.

We are interested in the effect of income level, cases level and coverage level in tie formation in networks for different antigens in different time intervals. Therefore, we use exponential random graphs, which model networks as a function of network statistics.

ERGM

ERGMs imagine the observed network to be just one instantiation of a set of possible networks with similar features, that is, as the outcome of a stochastic process, which is unknown and must therefore be inferred. The observed network is the network data the we created and we are interested in modeling. The observed network is regarded as one realization from a set of possible networks with similar important characteristics (the same number of node, number of edges, same number of countries with H, UM, LM or L and ….), that is, as the outcome of some (unknown) stochastic process. In other words, the observed network is seen as one particular pattern of ties out of a large set of possible patterns. In general, we do not know what stochastic process generated the observed network, so in simple words, ERGM takes the observed network, and adds and removes edges, then sees how that changes the network. It uses those changes to the network to build an understanding of how all the terms in the model specification interact and affect the overall network. Our goal in formulating a model is to propose a plausible and theoretically principled hypothesis for this process.

Let \(Y\) denote an \(n \times n\) sociomatrix where \(y_{ij} = 1\) if individuals \(i\) and \(j\) have a tie. Let \(X\) denote a matrix of covariates, which includes structural measures of the network as well as nodal and possibly edge-level attributes. A generic ERGM can be written as:

\[ P_{\theta, \tilde{Y}} (Y = y|X) = \frac{exp (\theta ^T g(y,X))}{k(\theta , \tilde{Y})} \]

where \(\theta\) is a vector of coefficients, \(g(y,X)\) is a vector of sufficient statistics and \(\tilde{Y}\) is the space of all possible graphs, and \(k(\theta , \tilde{Y})\) is a normalizing constant. That is, it’s the numerator summed across all possible graphs \(\tilde{Y}\).The ergm equation can be re-written in terms of change statistics. The log-odds of a tie \(y_{ij}\) is:

\[ logit(Y_{i,j} = 1 | y_{i,j}^c) = \theta^T \delta(y_{i,j}) \] We use \(Y\) because we are looking for the random variable \(Y_{i,j}\) rather than the specific realization.

All ERG models and goodness‐of‐fit plots in this article were generated using ergm, a cornerstone of the statnet suite of packages for statistical network analysis (Handcock et al. 2003). All Models assume dyadic independence and thus can be calculated straightforwardly using pseudo‐likelihood estimation.

Model 1: number of Edges

We start by building up from some basic terms first. the first term is the edges term which is a statistic which counts how many edges there are in the network (this not is not informative tho).

Coefficients in ERGMs represent the change in the (log-odds) likelihood of a tie for a unit change in a predictor. In order to be consistent with its standard errors we report the coefficients as log-odds. We show the coefficents value by circels, if its filled, then it means that its significant, if its hollow then it means that its not statistically significant.

Negative coefficients indicate that the formation of edges is less likely than would be expected by chance, while positive coefficients indicate a higher likelihood of edge formation. It is important to note that the edge term in any ERGM is almost always negative. In the simplest terms, this means that ties are not likely formed at random.

However the model with only number of edges is rarely a good model (so far we just understood that none of our networks have been made by chance), because as you add terms to the model, the model will have more explanatory power regarding the formation of ties ( this is also the reason that edge term decreases in the following models).

Model 2: number of Edges and node factor:

Node factor: Income

Now lets include also the information regarding each node, (i.e country). The idea of a nodal attributes is pretty straightforward, These are often what we would call (socio-demographic) attributes (e.g., income, geographic location, vaccination coverage level, …) in more standard regression models. In ERGM, we contribute these additional information in the form of node factor:

node factor is the number of times that nodes with a given attributes appear within the edge and it captures the propensity of nodes with a specific attribute to form ties, but it does not require both nodes in a tie to share the same attribute.

The node factor command is particularly useful since it allows to compare log-odds to a reference point (in our case is High income). This means each coefficient represents the difference in log odds of an edge existing between nodes of the specified income level compared to nodes with high income.

Focusing on before covid pandemic, for Measle, before pandemic, L has the lowest log-odds indicating a lower likelihood of an edge existing in the network for L, compared to LM and UM. Furthermore, the log odds of an edge existing in the network for L, compared to H, is -2.23719 . Post Pnademic,the log-odds of different income levels are closer to eachother, and they have increased in comparison to H but with the same order of likelihood.

While the likelihood of an edge existing in the Network for L is the least for both HPV and Influenza (Pre-Pandemic), the pattern is different for Measles and Polio.

For Polio, the difference in log-odds for L, UM and LM with H is the least (the values are close to 0), this means the similar likelihood of an edge existing in the network for L and H.

The plot below, shows all the coefficient and confidence intervals:

Looking at Polio, the node factor coefficients are not significant, except for the Log-odds to form a tie for L, which is less likeliy than coming from H pre-Covid, all the others are not significant.

Looking at HPV, all coefficients are signifcant and the log-odds of coming from UM in comparison to H is higher than LM and also lower than L which shows the dominance of H income in this network.

Looking at Measles, pre-Covid, the likelihood of tie formation for H, was higher than UM and UM higher than LM and LM higher than L. this pattern changes after Covid and is likelihood of tie formation for H is still higher thn UM (but less higher) and then then the likelihood of UM and LM are similar.

Looking at Influenza, LM and L are not signifcant. The likelihood for H and UM and LM are similar while the log-odds for tie formation in L is 2.5 less than log-odds for H.

Node factor: Geographical location

Now, we are intresetd in consideriny

Model3: NodeMatch(Homophiliy):

Income

Of greater interest from our point of view, the node match statistic counts the number of pairs of nodes of the same income that are members of the same board of ties. So we fit the model with number of edges and node match:

node match is a measure for homophily; the tendency of nodes with similar attributes to form ties with each other. It assesses whether ties are more likely to occur between nodes that share the same attribute vs. not having the same attribute.

Firs of all, all the coefficient are positive, indicating that having the same attribute value for both nodes in a dyad increases the likelihood of a tie for all antigens but the degree of such homophily varies across antigen. The lowest log-odds belong to Polio suggesting for cross-income tie formation (heterophily), or in better words “less” tendency for tie formation for countries with same income level.

Influenza shows a decrease in its homophily log-odds suggesting involving countries with different income level with time.

Apart from Polio, Measles and Influenza show a decrease in their homophily statistic post-pandemic.

Below you can see these measure with the same scale in a single figure for better comparison.

The plot is based on the model in which we include node Match (Homophily) for different antigens in different time intervals.

The Homophliy is always signifcant and Positive apart from the Polio at first time interval. The log-odds for a tie formation for same income country in all time intervals is more than 1.

Looking at Influenza, we see an increasing trend for heterophily with time, meaning opening to cross-income collaboration.

Looking at HPV, the log-odds for homophily stays around the same value (1.4) in all time periods.

Geographical location

The plot below shows the homophily based on Geogrphaic location (aka continent):

Model 4: NodeMix:

Then, we fit the model with number of edges and node mix:

nodemix captures the propensity of nodes with different attribute values to form edges. It evaluates mixing patterns between different attribute levels, similar to what we have seen in Mixing Matrix.

For Polio, Almost all the combinations of ties are not statistically significant, suggesting that income node attribute is not as important as it si for other antigens. For Measles, reflecting what we have already seen in the previous plots(and also the mixing matrix), there is change in pattern for pre-post covid tie formation meaning that pre-covid, for example, the likelihood for L-UM or L-LM is -3 of odds of H-H and H-L or H-LM are -1 of odds of H-H, while the tredns changes post-pandemic.

For Influenza, we see an increase in collaboration between different combinations post-pandemic(reflecting the decreasing trend in homophily meaning welcoming collaboration cross-income-levels)

For HPV, the tie formation between different combination with respect to H-H is always lower.

More Models:

nodematch(“income_group”) + nodematch(“location”)

Now we would like to consider both Geographic location and income attributes as a measure of homophily:

nodematch(“income_group”) + nodefactor(“income_group”)

The plot is based on the model in which we include node factor and node mix for different antigens in different time intervals. The plot below is for networks with self loops:

This plot is for networks without self loops:

nodefactor(“income_group”) + nodematch(“income_group”) + nodefactor(“location”) + nodematch(“location”):

nodemix(“income_group”) + nodefactor(“income_group”)

Warning: Model statistics ‘nodefactor.income_group.Low income’, ‘nodefactor.income_group.Lower middle income’, and ‘nodefactor.income_group.Upper middle income’ are linear combinations of some set of preceding statistics at the current stage of the estimation. This may indicate that the model is nonidentifiable. Evaluating log-likelihood at the estimate.

NodeFactor(Income) + NodeFactor(coverage)

Coverage data for Influenza is not available.

### NodeFactor(Income) + NodeFactor(cases)

edgewise shared partner:GWESP

Two nodes i and j have an edgewise shared partner when they are connected to each other and both i and j are also connected to a third individual k. If i and j were also connected to node l, then i and j would have two edgewise shared partners. In other words, when nodes have edgewise shared partnerships, they form triangles!

Adding one tie has a different effect on the number of edgewise shared partnerships in the network depending on the number of triangles that the tie closes, and the existing number of edgewise shared partnerships that the nodes involved in the triangles already belong to.

if a tie being modelled would not close a triangle, then after adding the tie, the nodes will still have the same number of edgewise shared partners, so the GWESP change statistic is zero.

Goodness-of-Fit

When we use ordinary least-squares regression, for example, we are probably used to calculating residuals, which are the difference between the observed and the predicted values for a specific value of the independent variable. While there is no simple analog to a residual in a linear model, we can ask whether our observed network is consistent with the family of networks implied by our estimated model parameters.

In problems for which maximum likelihood estimation, a troubling empirical fact has emerged: When ERGM parameters are estimated and a large number of networks are simulated from the resulting model, these networks frequently bear little resemblance to the observed network. This seemingly paradoxical fact arises because even though the MLE makes the probability of the observed network as large as possible, this probability still might be extremely small relative to other networks. In such a case, the ERGM does not fit the data well.

The blue points in the plot represent the mean of statistics in the simulated networks. The black line shows the observed statistics in the actual network

First model, with only edges

Second model: edges + nodefactor(“income_group”)

Third model: edges + nodematch(“income_group”)

Fourth model: edges + nodefactor(“income_group”) + nodematch(“income_group”)

Fifth mode: edges + nodemix(“income_group”)

Model Selection:

Based on the BIC of the 4 mentioned mode, the one with node match gained the lowes BIC and therefore its presenting out networks better than others.