Last updated 2021-06-27
This tutorial is a joint product of the Statnet Development Team:
Pavel N. Krivitsky (University of New South Wales)
Martina Morris (University of Washington)
Mark S. Handcock (University of California, Los Angeles)
Carter T. Butts (University of California, Irvine)
David R. Hunter (Penn State University)
Steven M. Goodreau (University of Washington)
Chad Klumb (University of Washington)
Skye Bender de-Moll (Oakland, CA)
The network modeling software demonstrated in this tutorial is authored by Pavel Krivitsky (ergm.ego), with contributions from Michał Bojanowski.
statnet ProjectAll statnet packages are open-source, written for the R computing environment, and published on CRAN. The source repositories are hosted on GitHub. Our website is statnet.org
Need help? For general questions and comments, please email the statnet users group at statnet_help@uw.edu. You’ll need to join the listserv if you’re not already a member. You can do that here: statnet_help listserve.
Found a bug in our software? Please let us know by filing an issue in the appropriate package GitHub repository, with a reproducible example.
Want to request new functionality? We welcome suggestions – you can make a request by filing an issue on the appropriate package GitHub repository. The chances that this functionality will be developed are substantially improved if the requests are accompanied by some proposed code (we are happy to review pull requests).
For all other issues, please email us at contact@statnet.org.
This tutorial provides an introduction to statistical modeling of egocentrically sampled network data with Exponential family Random Graph Models (ERGMs). The primary package we will be demonstrating is ergm.ego, but we will make use of utilities from other statnet packages at various points. As of version 1.0, ergm.ego depends on the egor package for egocentric network data management.
This workshop assumes basic familiarity with R, experience with network concepts, terminology and data, and familiarity with the general framework for statistical modeling and inference.
The workshops are conducted using Rstudio.
Open an R session, and set your working directory to the location where you would like to save this work.
To install the package the ergm.ego,
install.packages('ergm.ego')This will install all of the “dependencies” – the other R packages that ergm.ego needs. Then load the package into R and verify the package version:
library('ergm.ego')Loading required package: ergm
Loading required package: network
'network' 1.17.1 (2021-06-12), part of the Statnet Project
* 'news(package="network")' for changes since last version
* 'citation("network")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4.0.1 (2021-06-20), part of the Statnet Project
* 'news(package="ergm")' for changes since last version
* 'citation("ergm")' for citation information
* 'https://statnet.org' for help, support, and other information
'ergm' 4 is a major update that introduces some backwards-incompatible
changes. Please type 'news(package="ergm")' for a list of major
changes.
Loading required package: egor
Loading required package: dplyr
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
Loading required package: tibble
'ergm.ego' 1.0-654 (2021-06-22), part of the Statnet Project
* 'news(package="ergm.ego")' for changes since last version
* 'citation("ergm.ego")' for citation information
* 'https://statnet.org' for help, support, and other information
Attaching package: 'ergm.ego'
The following objects are masked from 'package:ergm':
COLLAPSE_SMALLEST, snctrl
The following object is masked from 'package:base':
sample
packageVersion('ergm.ego')[1] '1.0.654'
The ergm.ego package is designed to provide principled estimation of and statistical inference for Exponential-family Random Graph Models (“ERGMs”) from egocentrically sampled network data.
This dramatically lowers the bar for empirical research on networks. In many (most?) empirical contexts, it is not possible to collect a network census or even an adaptive (link-traced) sample. Even when one of these may be possible in practice, egocentrically sampled data are typically cheaper and easier to collect.
Long regarded as the poor country cousin in the network data family, egocentric data contain a remarkable amount of information. With the right statistical methods, such data can be used to explore the properties of the complete networks in which they are embedded. The basic idea here will be familiar to anyone who has worked with survey data: you combine what is observed (the data) with assumptions (the model terms and their sampling distributions), to define a class of models (the range of coefficient values).
Once estimated, the fitted model represents a distribution of networks that is centered on the observed properties of the (appropriately scaled) sampled network. The stochastic variation in the network distribution quantifies some of the uncertainty introduced by the assumptions.
The ergm.ego package comprises:
ergm package, but include the specific modifications needed in the egocentric data context.ergm.ego is designed to work with the other statnet packages. So, for example, once you have fit a model, you can use the summary and diagnostic functions from ergm to evaluate the model fit, the ergm simulate function to simulate complete network realizations from the model, the network descriptives from sna to explore the properities of networks simulated from the model, and you can use other R functions and packages as well after converting the network data structure into a data frame.
Putting this all together, you can start with egocentric data, estimate a model, test the coefficients for statistical significance, assess the model goodness of fit, and simulate complete networks of any size from the model. The statistics in your simulated networks will be consistent with the appropriately scaled statistics from your sample for all of the terms that are represented in the model.
The full technical details on ERGM estimation and inference from egocentrically sampled data have been published in Krivitsky and Morris,2017 This section of the tutorial provides a brief introduction to the key concepts.
ERGMs represent a general class of models based in exponential-family theory for specifying the probability distribution for a set of random graphs or networks. Within this framework, one can—among other tasks—obtain maximum-likehood estimates for the parameters of a specified model for a given data set; test individual models for goodness-of-fit, perform various types of model comparison; and simulate additional networks with the underlying probability distribution implied by that model.
The general form for an ERGM can be written as: \[ P(Y=y;\theta,x)=\frac{\exp(\theta^{\top}g(y,x))}{\kappa(\theta,x)}\qquad (1) \] where \(Y\) is the random variable for the state of the network (with realization y), \(g(y,x)\) is a vector of model statistics for network y, \(\theta\) is the vector of coefficients for those statistics, and \(\kappa(\theta)\) represents the quantity in the numerator summed over all possible networks (typically constrained to be all networks with the same node set as \(y\)).
The model terms \(g(y,x)\) are functions of network statistics that we hypothesize may be more or less common than what would be expected in a simple random graph (where all ties have the same probability). When working with egocentrically sampled network data, the statistics one can include in the model are limited by the requirement that they can be observed in the sample data more details in section 4.2
A key distinction in model terms is whether they are dyad independent or dyad dependent. Dyad independent terms (like nodematch for attribute homophily) imply no dependence between dyads—the presence or absence of a tie may depend on nodal attributes, but not on the state of other ties. Dyad dependent terms (like degree for nodal degree, or triad-related terms like gwesp), imply dependence between dyads. The design of an egocentric sample means that most observable statistics are dyad independent, but there are a few, like degree, that are dyad dependent.
Network data are distinguished by having two units of analysis: the actors and the links between the actors. This gives rise to a range of sampling designs that can be classified into two groups: link tracing designs (e.g., snowball and respondent driven sampling) and egocentric designs.
Link-trace designs have traditionally been used to sample hard-to-reach populations. Sampling begins with a set of seed nodes. The seeds are asked to nominate alters, the alters are then recruited into the sample, asked to nominate their alters, and so on. Each new set of alters is called a wave or a generation, and the number of waves of sampling is a study design variable. At each wave, a census or a sample of the alters may be elicited, and/or recruited, this too is a study design variable, and alter recruitment may be investigator-driven, or respondent-driven. This gives rise to a wide range of possible link-trace designs. When the decision to elicit a new wave of alters depends on an attribute of the current node (e.g., is this node an injection drug user) the design is called an adaptive sample.
Egocentric network sampling comprises a range of designs developed specifically for the collection of network data in social science survey research. The design is (ideally) based on a probability sample of respondents (“egos”“) who, via interview, are asked to nominate a list of persons (”alters“) with whom they have a specific type of relationship (”tie“), and then asked to provide information on the characteristics of the alters and/or the ties. The alters are not recruited or directly observed. Depending on the study design, alters may or may not be uniquely identifiable, and respondents may or may not be asked to provide information on one or more ties among alters (the”alter" matrices). Alters could, in theory, also be present in the data as an ego or as an alter of a different ego; the likelihood of this depends on the sampling fraction.
Egocentric designs sample egos using standard sampling methods, and the sampling of links is implemented through the survey instrument. As a result, these methods are easily integrated into population-based surveys, and, as we show below, inherit many of the inferential benefits.
For the moment ergm.ego uses the minimal egocentric network study design, in which alters cannot be uniquely identified and alter matrices are not collected The minimal design is more common, and the data are more widely available, largely because it is less invasive and less time-consuming than designs which include identifiable alter matrices. However, deveopment of estimation where alter–alter matrices are available is being planned.
Handcock and Gile (2010): Likelihood inference for partially observed networks, has egocentric data as a special case.
Kosikinen and Robins (2010): Bayesian inference for partially observed networks, has egocentric data as a special case.
Krivitsky and Morris (2017) Use design-based estimators for sufficient statistics of the ERGM of interest and then transfers their properties to the ERGM estimate.
EgoStats”: currenly does not support alter–alter statistics or directed or bipartite networks.Even if the whole population is egocentrically observed (i.e., \(S=N\), a census), the alters are still not uniquely identifiable. This limits the kinds of network statistics that can be observed, and the ERGM terms that can be fit to such data. We turn to the notion of sufficiency to identify the terms amenable to egocentric inference.
The framework for estimation and inference relies on two basic properties of exponential family models:
For MLEs, the expected value of a sufficient statistic (\(g(y,x)\)) in the model is equal to its observed value.
The MLE is a smooth function of the sufficient statistic, and is defined for “in between” values of the statistics as well (e.g., fractional edges).
MLE’s uniquely maximize the probability of the observed statistics under the model, and any network with the same observed statistics will have the same probability.
Design-based estimation of ERGMs is done in three steps:
Together, these allow us to use any statistics that can be observed in an egocentric sample as a term in an ERG model, to estimate the model from a complete pseudo-network that has the same (or appropriately scaled) sufficient statistics, and the networks simulated from the fitted model will be centered on the (scaled) observed statistics.
In practice, egocentric sample statistics generally need to be adjusted for network size and some types of observable discrepancies. This is one of the key differences between working with sampled and unsampled network data.
The treatment of network size is perhaps the most obvious way that egocentric estimation differs from a standard ERGM estimation on a completely observed network. With a network census, the network size is known; by contrast, with a network sample, we don’t typically know the size of the network from which it is drawn.
If the statistics we observe in the sample scale in a known way with network size, then we can adjust for this in the estimation, and the resulting parameter estimates (with the exception of the edges term) will be “size invariant”.
Here we will follow Krivitsky, Handcock, and Morris (2011), who showed that one can obtain a “per capita” size invariant parameterization for dyad-independent statistics in any network by using an offset, approximately equal to \(-\log(N)\), where \(N\) is the number of nodes in the network. The intuition is that this transforms the density-based parameterization (ties per dyad) that is the natural scale for ERGMs into a mean degree-based parameterization (ties per node): \[ \text{Mean Degree} = \frac{2\times\text{ties}}{\text{nodes}} = \frac{2T}{n} \] \[ \text{Density} = \frac{\text{ties}}{\text{dyads}} = \frac{T}{\frac{N(N-1)}{2}} = \frac{\text{Mean Degree}}{(N-1)} \]
Once the number of edges is adjusted to preserve the mean degree, Krivitsky et al. show that all of the dyad independent terms are properly scaled. For degree-based terms, we would want, by analogy, for per-capita invariance to preserve the degree probability distribution.
Experimental results suggest that the mean-degree preserving offset has this property, but a mathematical proof is elusive.
What we mean by discrepancy is: undirected tie subtotals that are required to balance in theory, but are observed not to balance in the sample. This can happen when ties are broken down by nodal attributes, and the number of ties that group 1 reports to group 2 are not equal to the number that group 2 reports to group 1.
This is another unique feature of egocentrically sampled network data. With a network census, you would have the complete edgelist, with the appropriate nodal attributes for each member of the dyad, so the reports would always balance. For an egocentrically sampled network, and even for an egocentric census, a discrepancy can arise, either from sampling variability, or from measurement error.
In ergm.ego, we assume that any discrepancy is due to sampling variation, and effectively take the average of the discrepant reports to estimate the number of ties for that ego-alter tie configuration. If you know the source of the discrepancy, or want to make a different assumption you may want to address this before fitting the data in ergm.ego.
Once the network size-invariant parameterization and consistency issues are addressed we have a simple way to construct the target statistics needed for ERGM estimation: we instruct ergm.ego to scale the values of the sample statistics to the desired network size.
The way we do this is by specifying an offset term in the model to scale the estimates to the network size we want. The offset used will depend on the context.
For unweighted samples: To obtain population estimates from ergm.ego from an unweighted sample of size \(|S|\) to a population with a known (or specified) size \(N\), fit the model with an offset of \(\log(N/{S})=\log(N) - \log(S)\).
For weighted samples: To obtain population estimates from ergm.ego from a weighted sample to a population with a known (or specified) size \(N\), first choose a network size, \(|N'|\), to be used for estimation (a pseudo-population that will have the correct nodal attribute distribution specified by the weights), and then fit the model with an offset of \(\log(N/{N'})=\log(N) - \log(N')\). The criteria for choosing a good value of \(|N'|\) are discussed in the Example section below.
If the population network size is unknown: This is the most general case. If we do not know \(N\) or wish to specify it we often fit with an offset of \(-\log(S)\) (for the unweighted sample) or \(-\log(N')\) (for the weighted sample). This will return per-capita estimates that can be easily rescaled to any value post-estimation, e.g., for simulation purposes.
The standard errors for coefficients in an ergm.ego fit are designed to represent the uncertainty in our estimate. For ERGMs, this uncertainty can be thought of as coming from three possible sources:
Most treatments of ERGM estimation treat the coefficient \(\theta\) as a parameter of a superpopulation process of which \(y\) is a single realization. The variance of the MLE of \(\theta\) is then conceived as coming from (1) and (3) above.
In contrast, in ergm.ego we treat the network as a fixed, unknown, finite population, so it is not a source of uncertainty. Rather, uncertainty comes from sampling from this network, and from the MCMC algorithm, (2) and (3) above.
This makes ergm.ego inference much more like traditional (frequentist) statistical inference: we imagine repeatedly drawing an egocentric sample, and estimating the ERGM on each replicate. The sampling distribution of the estimate reflects how our estimate will vary from sample to sample.
The ergm.ego package can be used with weighted survey data and complex sampling designs. In that context, the egor package transforms the ego tibble into a srvyr object. The srvyr package can be used for descriptive statistics, and ergm.ego will incorporate the survey design into its estimation and inference.
This topic is beyond the scope of this introductory workshop but the ergm.ego package has an example you can run for more information:
example(sample.egor)ergm.egoSince ergm.ego is essentially a wrapper around ergm, there are relatively few functions in the ergm.ego package itself. The functions that are there deal with the specific requirements associated with data management, estimation and inference for egocentrically sampled data.
To get a list of documented functions, type:
library(help='ergm.ego')The main R objects unique to ergm.ego are:
egor objects for storing the original data (the analog to network objects in ergm; using the package egor),ergm.ego objects, which store the model fit results (the analog to ergm objects in ergm).Once you simulate from the fit, the resulting objects are just network objects.
The functionality can be divided into groups as follows:
Stripped down to the basics, egocentric network data comprise:
The egor object has simple, analogous structure for storing this information: a list object with 3 components
ego - data frame of egos and their attributes
alter - a data frame of alters and their attributes nominated by egos (by default identified by column egoID) or a list of data frames (one for each ego).
aatie - a data frame with edge list of alter-alter ties or a list of data frames (one for each ego).
In addition, one can specify:
ego_design - a list of arguments passed to srvyr::as_survey_design() specifying the sample design for egos. For example: probs for unequal probability independent sample, strata for stratified samples etc.
alter_design - currently a list with one element, max= providing the maximal number of alters ego could nominate for a Fixed Choice Design
The capacity to represent survey design elements makes egor a flexible and powerful foundation for network data analysis.
The simplicity of the data structure makes it easy to constructegor objects from external data read into R, and there are transformation utilities for working with other data formats (like network and igraph objects), which we will demonstrate in the Example section below.
For more information:
?as.egorThe possible terms in an ergm.ego model are inherently limited to those that are egocentrically observable: statistics that can be inferred from an egocentric sample. In general, these will include terms that are functions of nodal attributes and attribute mixing, degree distribution terms, and triadic terms (when the alter–alter ties are observed). The ergm.ego terms have the same names and arguments as their ergm counterparts, there are just far fewer (n=14) of them available in this context.
Dyad independent terms include density and nodal attribute based measures:
edgesnodefactor (for discrete/nominal vars) and nodecov (for continuous)nodematch (for homophily), nodemix (for general mixing patterns) and absdiffDyad dependent terms include degree- and triad-based measures:
degree, degrange, gwdegree, and degreepopularityconcurrent and concurrenttiesesp, gwesp, transitiveties and cyclicalties, but can only if be used if alter–alter ties have been observed.For the full list of ergm.ego terms and their syntax, type:
help('ergm.ego-terms')As in ergm, these terms can be used on the right-hand side of formulas in calls to model and simulation functions
We will work with the faux.mesa.high dataset that is included with the ergm package, using the as.egor function to transform it into an egocentric dataset. In essence, this creates an egocentric census of the network: a census of all nodes in the network, not a sample.
In this egocentric census, every node is the center of their own egonet – we know their alters, and the ties between their alters, but we can not match the alters across the egonets because they are not uniquely identified. We can still compare the fits we get from ergm.ego (from the ego data) and ergm (from the original network) for models with the same terms.
Preliminaries:
library(ergm.ego)Check package versions
sessionInfo()Set seed for simulations – this is not necessary, but it ensures that we all get the same results (if we execute the same commands in the same order).
set.seed(1)We’ll show 2 examples of how to create an egor object here.
network objectRead in the faux.mesa.high data:
data(faux.mesa.high)
mesa <- faux.mesa.highTake a quick look at the complete network
plot(mesa, vertex.col="Grade")
legend('bottomleft',fill=7:12,legend=paste('Grade',7:12),cex=0.75)Now, let’s turn this into an egodata object:
mesa.ego <- as.egor(mesa) Take a look at this object – there are several ways to do this:
names(mesa.ego) # what are the components of this object?[1] "ego" "alter" "aatie"
mesa.ego # prints a few lines for each component# EGO data ([32mactive[39m): 205 x 4
.egoID Grade Race Sex
<int> <dbl> <chr> <chr>
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
# ALTER data: 406 x 5
.altID .egoID Grade Race Sex
<int> <int> <dbl> <chr> <chr>
1 174 1 7 Hisp F
2 161 1 7 Hisp F
3 151 1 7 Hisp F
# AATIE data: 372 x 3
.egoID .srcID .tgtID
<int> <int> <int>
1 1 151 127
2 1 127 52
3 1 127 87
#View(mesa.ego) # opens the component in the Rstudio source window
class(mesa.ego) # what type of "object" is this?[1] "egor" "list"
Each of the components of the egodata object is a simple table, or data.frame.
class(mesa.ego$ego) # and what type of objects are the components?[1] "tbl_df" "tbl" "data.frame"
class(mesa.ego$alter)[1] "tbl_df" "tbl" "data.frame"
class(mesa.ego$aatie)[1] "tbl_df" "tbl" "data.frame"
The ego table contains the ego ID (.egoID), and the nodal attributes Race, Grade and Sex. This is equivalent to a standard person-based survey sample flat file format.
mesa.ego$ego # first few rows of the ego table# A tibble: 205 x 4
.egoID Grade Race Sex
<int> <dbl> <chr> <chr>
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
6 6 10 Hisp F
7 7 8 NatAm M
8 8 11 NatAm M
9 9 9 White M
10 10 9 NatAm F
# ... with 195 more rows
The alter table is a type of edgelist: it lists the edges for each ego. It contains the alter ID (.altID), the corresponding ego ID, and a set of alter nodal attributes. Note that this is a slightly different data structure than a standard network edgelist.
The standard network edgelist contains one unique record for each edge; both ego and alter ID may appear more than once (depending on their degree), but each link is only represented once.
The alter table in this egodata object is a different type of edgelist, as it is “egocentric.”
.altID list is equal to their degree, as is the number of times their ID will appear in the .egoID list.mesa.ego$alter # first few rows of the alter table# A tibble: 406 x 5
.altID .egoID Grade Race Sex
<int> <int> <dbl> <chr> <chr>
1 174 1 7 Hisp F
2 161 1 7 Hisp F
3 151 1 7 Hisp F
4 127 1 7 Hisp F
5 110 1 7 Hisp F
6 100 1 7 Hisp F
7 96 1 7 NatAm F
8 92 1 7 NatAm F
9 87 1 7 White F
10 70 1 7 NatAm F
# ... with 396 more rows
# ties show up twice, but alter info is linked to .altID
mesa.ego$alter %>% filter((.altID==1 & .egoID==25) | (.egoID==1 & .altID==25))# A tibble: 2 x 5
.altID .egoID Grade Race Sex
<int> <int> <dbl> <chr> <chr>
1 25 1 7 White F
2 1 25 7 Hisp F
The aatie table lists the egoID, and the IDs of the two alters that have tie. The alters are distinguished as .srcID and .tgtID to allow for the possibility of directed tie data. In the case of undirected tie data, as we have here, each alter-alter tie will be represented twice, just swapping the target and source IDs.
mesa.ego$aatie # first few rows of the alter table# A tibble: 372 x 3
.egoID .srcID .tgtID
<int> <int> <int>
1 1 151 127
2 1 127 52
3 1 127 87
4 1 127 151
5 1 110 87
6 1 110 92
7 1 110 96
8 1 100 96
9 1 96 87
10 1 96 110
# ... with 362 more rows
Since each of the egor components are simple rectangular matrices, it’s easy to read in external data and use it to construct an egor object. You just need to make sure that the structure of the external file is consistent with the structure of the tables we looked at above.
To demonstrate we will construct an egodata object derived from our mesa.ego data that has the features of an egocentrically sampled data set: the alters are not uniquely identified. And, we will ignore the alter–alter ties.
First, we write out the first two tables in our mesa.ego into external datafiles, deleting the .altID from the alter file.
# egos
write.csv(mesa.ego$ego, file="mesa.ego.table.csv", row.names = F)
# alters
write.csv(mesa.ego$alter[,-1], file="mesa.alter.table.csv", row.names = F)Now read them back in:
mesa.egos <- read.csv("mesa.ego.table.csv")
head(mesa.egos) .egoID Grade Race Sex
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
6 6 10 Hisp F
mesa.alts <- read.csv("mesa.alter.table.csv")
head(mesa.alts) .egoID Grade Race Sex
1 1 7 Hisp F
2 1 7 Hisp F
3 1 7 Hisp F
4 1 7 Hisp F
5 1 7 Hisp F
6 1 7 Hisp F
To create an egodata object from data frames, we use the egor() function:
my.egodata <- egor(egos = mesa.egos,
alters = mesa.alts,
ID.vars = list(ego = ".egoID"))
my.egodata# EGO data ([32mactive[39m): 205 x 4
.egoID Grade Race Sex
<chr> <int> <chr> <chr>
1 1 7 Hisp F
2 2 7 Hisp F
3 3 11 NatAm M
4 4 8 Hisp M
5 5 10 White F
# ALTER data: 406 x 4
.egoID Grade Race Sex
<chr> <int> <chr> <chr>
1 1 7 Hisp F
2 1 7 Hisp F
3 1 7 Hisp F
# AATIE data: 0 x 3
Note how the alter data no longer have a unique alter identifier.
For another example that uses the alter–alter ties, see:
example("egor")We will explore some of the other functions available for manipulating the egor object in a later section.
Prior to model specification, we can explore the data using descriptive statistics observable in the original egocentric sample. In general, the observable statistics are the same as those that ergm.ego can estimate.
We can use standard R commands to view nodal attribute frequencies:
# to reduce typing, we'll pull the ego and alter data frames
egos <- mesa.ego$ego
alters <- mesa.ego$alter
table(egos$Sex, exclude=NULL)
F M
99 106
table(egos$Race, exclude=NULL)
Black Hisp NatAm Other White
6 109 68 4 18
barplot(table(egos$Grade), ylab="frequency")# compare egos and alters...
par(mfrow=c(1,2))
barplot(table(egos$Race)/nrow(egos),
main="Ego Race Distn", ylab="percent",
ylim = c(0,0.5))
barplot(table(alters$Race)/nrow(alters),
main="Alter Race Distn", ylab="percent",
ylim = c(0,0.5))To look at the mixing matrix, we’ll use the mixingmatrix() function on the egodata object, and
we’ll compare the output to what we would get from using this function on the original network object.
Note how the ties on the diagonal are counted twice in the ergm.ego data, compared with the original network data, but the off-diagonal tie counts are the same. Note also, though, that these off-diagonal counts are symmetric, because this is undirected data. So, in both cases, the off-diagonal ties are actually being counted twice (once above and once below the diagonal), but in the original network version, the ties on the diagonal are only counted once.
# to get the crosstabulated counts of ties:
mixingmatrix(mesa.ego,"Grade") 7 8 9 10 11 12
7 150 0 0 1 1 1
8 0 66 2 4 2 1
9 0 2 46 7 6 4
10 1 4 7 18 1 5
11 1 2 6 1 34 5
12 1 1 4 5 5 12
# contrast with the original network crosstab:
mixingmatrix(mesa, "Grade") 7 8 9 10 11 12
7 75 0 0 1 1 1
8 0 33 2 4 2 1
9 0 2 23 7 6 4
10 1 4 7 9 1 5
11 1 2 6 1 17 5
12 1 1 4 5 5 6
You can also use this function to calculate the row probabilities of the mixing matrix:
# to get the row conditional probabilities:
round(mixingmatrix(mesa.ego, "Grade", rowprob=T), 2) 7 8 9 10 11 12
7 0.98 0.00 0.00 0.01 0.01 0.01
8 0.00 0.88 0.03 0.05 0.03 0.01
9 0.00 0.03 0.71 0.11 0.09 0.06
10 0.03 0.11 0.19 0.50 0.03 0.14
11 0.02 0.04 0.12 0.02 0.69 0.10
12 0.04 0.04 0.14 0.18 0.18 0.43
round(mixingmatrix(mesa.ego, "Race", rowprob=T), 2) Black Hisp NatAm Other White
Black 0.00 0.31 0.50 0.00 0.19
Hisp 0.04 0.60 0.23 0.01 0.12
NatAm 0.08 0.26 0.59 0.00 0.06
Other 0.00 1.00 0.00 0.00 0.00
White 0.11 0.49 0.22 0.00 0.18
We can also examine the observed number of ties, mean degree, and degree distributions.
# first, using the original network
network.edgecount(faux.mesa.high)[1] 203
# compare to the egodata
# note that the ties are double counted, so we need to divide by 2.
nrow(mesa.ego$alter)/2[1] 203
# mean degree -- here we want to count each "stub", so we don't divide by 2
nrow(mesa.ego$alter)/nrow(mesa.ego$ego)[1] 1.980488
# overall degree distribution
summary(mesa.ego ~ degree(0:20)) scaled mean SE
degree0 57 6.4306
degree1 51 6.2048
degree2 30 5.0730
degree3 28 4.9289
degree4 18 4.0620
degree5 10 3.0917
degree6 2 1.4107
degree7 4 1.9852
degree8 1 1.0000
degree9 2 1.4107
degree10 1 1.0000
degree11 0 0.0000
degree12 0 0.0000
degree13 1 1.0000
degree14 0 0.0000
degree15 0 0.0000
degree16 0 0.0000
degree17 0 0.0000
degree18 0 0.0000
degree19 0 0.0000
degree20 0 0.0000
# and stratified by sex
summary(mesa.ego ~ degree(0:13, by="Sex")) scaled mean SE
deg0.SexF 23 4.5299
deg1.SexF 23 4.5299
deg2.SexF 10 3.0917
deg3.SexF 17 3.9581
deg4.SexF 12 3.3694
deg5.SexF 7 2.6066
deg6.SexF 1 1.0000
deg7.SexF 3 1.7235
deg8.SexF 1 1.0000
deg9.SexF 0 0.0000
deg10.SexF 1 1.0000
deg11.SexF 0 0.0000
deg12.SexF 0 0.0000
deg13.SexF 1 1.0000
deg0.SexM 34 5.3385
deg1.SexM 28 4.9289
deg2.SexM 20 4.2588
deg3.SexM 11 3.2343
deg4.SexM 6 2.4193
deg5.SexM 3 1.7235
deg6.SexM 1 1.0000
deg7.SexM 1 1.0000
deg8.SexM 0 0.0000
deg9.SexM 2 1.4107
deg10.SexM 0 0.0000
deg11.SexM 0 0.0000
deg12.SexM 0 0.0000
deg13.SexM 0 0.0000
For the degree distribution we used thesummaryfunction in the same way that we would use it in ergm with a network object. But the summary function also has an egor specific argument, scaleto, that allows you to scale the summary statistics to a network of arbitrary size. So, for example, we can obtain the degree distribution scaled to a network of size 100,000, or a network that is 100 times larger than the sample.
summary(mesa.ego ~ degree(0:10), scaleto=100000) scaled mean SE
degree0 27804.88 3136.89
degree1 24878.05 3026.75
degree2 14634.15 2474.63
degree3 13658.54 2404.34
degree4 8780.49 1981.47
degree5 4878.05 1508.16
degree6 975.61 688.17
degree7 1951.22 968.41
degree8 487.80 487.80
degree9 975.61 688.17
degree10 487.80 487.80
summary(mesa.ego ~ degree(0:10), scaleto=nrow(mesa.ego$ego)*100) scaled mean SE
degree0 5700 643.06
degree1 5100 620.48
degree2 3000 507.30
degree3 2800 492.89
degree4 1800 406.20
degree5 1000 309.17
degree6 200 141.07
degree7 400 198.52
degree8 100 100.00
degree9 200 141.07
degree10 100 100.00
Note that the first scaling results in fractional numbers of nodes at each degree, because the proportion at each degree level does not scale to an integer for this population size. Again, this is not a problem for estimation, but one should be careful with descriptive statistics that expect integer values. The second scaling does result in integer counts because it is a multiple of the sample size.
We can plot the degree distribution using another egodata specific function: degreedist. As with the mixingmatrix function, this can return either the counts or the proportions at each degree.
# to get the frequency counts
degreedist(mesa.ego, plot=T)degreedist(mesa.ego, by="Sex", plot=T)
# to get the proportion at each degree level
degreedist(mesa.ego, by="Sex", plot=T, prob=T)The degreedist method for egor objects also has an argument that lets you overplot the expected degree distribution for a Bernoulli random graph with the same expected density. This is the plot equivalent of a CUG test (“conditional uniform graph”).
degreedist(mesa.ego, brg=T)degreedist(mesa.ego, by="Sex", prob=T, brg=T)The brg overplot is based on 50 simulations of a Bernoulli random graph with the same number of nodes and expected density, implemented by using an ergm.ego simulation from an edges only model with \(\theta=\mbox{logit}(\mbox{probability of a tie})\) from the observed data. The overplot shows the mean and 2 standard deviations obtained for each degree value from the 50 simulations. Note that the brg automatically scales to the proportions when prob=T.
What does the plot suggest about the distribution of degree in this network?
From the exploratory work, several characteristics emerged that we might want to capture in a model:
We can use ergm.ego to fit a sequence of nested models to both estimate the parameters associated with these statistics, and test their significance. We can diagnose the both the estimation process (to verify convergence and good mixing in the MCMC sampler) and the fit of the model to the data. In both cases, we will use functionality that will be familiar to ergm users: MCMC diagnostics, and GOF.
One thing that is different from a standard ergm call is that we need to specify the scaling, both for the pseudo-population (\(N'\)) that will be used to set the target statistics during estimation, and for the population (N) size that the final rescaled coefficients will represent. Recall,
ergm.ego, it is controlled by the popsize top-level argument.
Bias – In general, estimation bias is reduced the closer \(N'\) is to \(N\) (usually larger).
Computing time – The larger the pseudo-population, the longer the estimation takes.
Sample weights – In general, it is good practice for the smallest sample weight to produce at least 1 observation in the pseudo-population network, though more is better.
This leads to different guidelines for data with and without weights.
Simulation studies in Krivitsky & Morris (2017) suggest that a good rule of thumb is to have a minimum pseudo-population size of 1,000 for unweighted data. For weighted data the pseudo-populations size should be at least 1 * sampleSize/smallestWeight (or 3 * sampleSize/smallestWeight to be safe), or 1000 (whichever is larger).
In ergm.ego, \(|N'|\) is controlled by a combination of four factors:
popsize (\(|N|\) or 1) (default: 1),control.ergm.ego control parameter ppopsize (default: "auto"),control.ergm.ego control parameter ppopsize.mul (default: 1).If ppopsize is left at its default ("auto"),
popsize is left at 1, \(|S|\times\)ppopsize.mul.popsize is specified, use \(|N|\times\)ppopsize.mul.You can also force one of these two regimes by setting ppopsize to "samp" or "pop", respectively, or set it to a number to force a particular \(|N'|\) ignoring ppopsize.mul.
For more information, see
?control.ergm.egoWe will give an example below.
In both cases, the scaling will only affect the estimate of the edges term, and we will demonstrate this below.
Let’s start with simple edges-only model to see what’s the same and what is different from a call to ergm:
fit.edges <- ergm.ego(mesa.ego ~ edges)
summary(fit.edges)Call:
ergm(formula = ergm.formula, constraints = constraints, offset.coef = ergm.offset.coef,
target.stats = m, eval.loglik = FALSE, control = control$ergm)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
netsize.adj -5.32301 0.00000 0 -Inf <1e-04 ***
edges 0.69930 0.08257 1 8.469 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
netsize.adj
This is a simple model, for a homogenous tie probability – a Bernoulli random graph with the mean degree observed in our sampled data. The only difference in the syntax from standard ergm is the function call to ergm.ego. Let’s look under the hood at the components that are output to the fit.edges object:
names(fit.edges) [1] "coefficients" "sample" "sample.obs" "iterations"
[5] "MCMCtheta" "loglikelihood" "gradient" "hessian"
[9] "covar" "failure" "network" "newnetworks"
[13] "newnetwork" "coef.init" "est.cov" "coef.hist"
[17] "stats.hist" "steplen.hist" "control" "etamap"
[21] "call" "ergm_version" "MPLE_is_MLE" "formula"
[25] "target.stats" "nw.stats" "target.esteq" "constrained"
[29] "constraints" "obs.constraints" "reference" "estimate"
[33] "estimate.desc" "offset" "drop" "estimable"
[37] "v" "m" "ergm.formula" "ergm.offset.coef"
[41] "egor" "ppopsize" "popsize" "netsize.adj"
[45] "ergm.covar" "DtDe"
fit.edges$ppopsize[1] 205
fit.edges$popsize[1] 1
Many of the elements of the object are the same as you would get from an ergm fit, but the last few elements are unique to ergm.ego. Here you can see the ppopsize – the pseudo-population size used to construct the target statistics, and popsize – the final scaled population size after network size adjustment is applied. The values that were used in the fit were the default values, since we did not specify otherwise. So, ppopsize\(=205\) (the sample size, or number of egos), and popsize\(= 1\), so the scaling returns the per capita estimates from the model parameters.
The summary shows the netsize.adj$= -5.32301, which is \(-\log(205)\).
The summary function also reports that:
The following terms are fixed by offset and are not estimated:
netsize.adj
So what would happen if we fit the model instead with target statistics from a pseudo-population of size 1000? To do this, we explicitly change the value of the ppopsize parameter through the control argument:
summary(ergm.ego(mesa.ego ~ edges,
control = control.ergm.ego(ppopsize=1000)))Call:
ergm(formula = ergm.formula, constraints = constraints, offset.coef = ergm.offset.coef,
target.stats = m, eval.loglik = FALSE, control = control$ergm)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
netsize.adj -6.93245 0.00000 0 -Inf <1e-04 ***
edges 0.68124 0.08055 0 8.457 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
netsize.adj
Now the netsize.adj value is \(-6.9077553 = -\log(1000)\).
Note that the value of the estimated edges coefficient is the same in both models, 0.698. This is the behavior we expect – the model is returning the same per capita value in both cases; it is just using a different scaling for the target statistics used in the fit. For this simple model, there may not be much difference in the properties of the estimates for these two different pseudo-population sizes.
We will examine the impact of modifying the popsize parameter in a later section below.
As the output shows, the model fit was fit using MCMC. This, too is different from the edges-only model using ergm. For ergm, models with only dyad-dependent terms are fit using Newton-Raphson algorithms (the same algorithm used for logistic regression), not MCMC. For ergm.ego, estimation is alway based on MCMC, regardless of the terms in the model.
Now let’s see what the MCMC diagnostics for this model look like
mcmc.diagnostics(fit.edges, which ="plots")
MCMC diagnostics shown here are from the last round of simulation, prior to computation of final parameter estimates. Because the final estimates are refinements of those used for this simulation run, these diagnostics may understate model performance. To directly assess the performance of the final model on in-model statistics, please use the GOF command: gof(ergmFitObject, GOF=~model).
Again, this is a simple model, so the diagnostics suggest good mixing, and the distribution of the sample statistic deviations from the targets (on the right panel) suggest that simulations from the model will match the target stats. We can verify that with a call to gof, specifying the “model” for the comparison.
plot(gof(fit.edges, GOF="model"))So, networks simulated from the model appear to be well centered around the values of the observed model terms.
Finally, we should evaluate the model fit. We can also use gof to do this, by comparing observed statistics that are not in the model, like the full degree distribution, with simulations from the fitted model. This is the same procedure that we use for ergm, but now with a more limited set of observed higher-order statistics to use for assessment.
plot(gof(fit.edges, GOF="degree"))Here, finally, we see some bad behavior, but this is expected from such a simple model. The GOF plot shows there are almost twice as many isolates in the observed data than would be predicted from a simple edges-only model.
Of course we knew this from having looked at the degree distribution plots with the Bernoulli random graph overlay.
Ok, so that’s a full cycle of description, estimation, and model assessment.
Let’s try fitting a degree(0) term to see how that changes the degree distribution assessment.
set.seed(1)fit.deg0 <- ergm.ego(mesa.ego ~ edges + degree(0), control=control.ergm.ego(ppopsize=1000))
summary(fit.deg0)Call:
ergm(formula = ergm.formula, constraints = constraints, offset.coef = ergm.offset.coef,
target.stats = m, eval.loglik = FALSE, control = control$ergm)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
netsize.adj -6.9324 0.0000 0 -Inf <1e-04 ***
edges 1.1699 0.1055 0 11.087 <1e-04 ***
degree0 1.4833 0.2711 0 5.472 <1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
netsize.adj
mcmc.diagnostics(fit.deg0, which = "plots")plot(gof(fit.deg0, GOF="model"))plot(gof(fit.deg0, GOF="degree"))So, we’ve now fit the isolates exactly, and the overall fit is better, but the deviations suggest there are more nodes with just one tie than would be expected, given the mean degree, and the number of isolates.
And just to round things off, let’s fit a relatively large model. Here we’ll specify the omitted category for Race as the largest group.
fit.full <- ergm.ego(mesa.ego ~ edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex")
+ nodematch("Race")
+ nodematch("Grade"))
summary(fit.full)Call:
ergm(formula = ergm.formula, constraints = constraints, offset.coef = ergm.offset.coef,
target.stats = m, eval.loglik = FALSE, control = control$ergm)
Monte Carlo Maximum Likelihood Results:
Estimate Std. Error MCMC % z value Pr(>|z|)
netsize.adj -5.32301 0.00000 0 -Inf < 1e-04 ***
edges -1.39738 0.21432 0 -6.520 < 1e-04 ***
degree0 2.10700 0.36592 0 5.758 < 1e-04 ***
degree1 1.00960 0.29695 0 3.400 0.000674 ***
nodefactor.Sex.M -0.17765 0.06240 0 -2.847 0.004411 **
nodefactor.Race.Black 1.21477 0.20316 0 5.979 < 1e-04 ***
nodefactor.Race.NatAm 0.30945 0.06253 0 4.949 < 1e-04 ***
nodefactor.Race.Other -0.91183 0.69912 0 -1.304 0.192149
nodefactor.Race.White 0.57907 0.13165 0 4.399 < 1e-04 ***
nodefactor.Grade.8 0.13722 0.05514 0 2.489 0.012820 *
nodefactor.Grade.9 0.13959 0.05024 0 2.778 0.005467 **
nodefactor.Grade.10 0.31255 0.07529 0 4.151 < 1e-04 ***
nodefactor.Grade.11 0.40604 0.06412 0 6.333 < 1e-04 ***
nodefactor.Grade.12 0.77650 0.07451 0 10.422 < 1e-04 ***
nodematch.Sex 0.64757 0.12803 0 5.058 < 1e-04 ***
nodematch.Race 0.85331 0.14206 0 6.007 < 1e-04 ***
nodematch.Grade 3.06429 0.16358 0 18.733 < 1e-04 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The following terms are fixed by offset and are not estimated:
netsize.adj
mcmc.diagnostics(fit.full, which = "plots")plot(gof(fit.full, GOF="model"))plot(gof(fit.full, GOF="degree"))In general the model diagnostics look good. If this were a genuine sample of 205 students from a larger school, we could infer the following:
there are many more isolates, and more degree 1 nodes than expected by chance;
there are significant differences in mean degree by race, with the largest group (Hispanics, the reference category) nominating fewer friends than most of the other groups;
7th graders nominate fewer friends than all other grades;
there are strong and significant homophily effects, for all three attributes.
It is possible to simulate complete networks from this ergm.ego fit object – just as we would from an ergm fit object:
sim.full <- simulate(fit.full)
summary(mesa.ego ~ edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade")) scaled mean SE
edges 203 15.2022
degree0 57 6.4306
degree1 51 6.2048
nodefactor.Sex.M 171 17.1990
nodefactor.Race.Black 26 6.5507
nodefactor.Race.NatAm 156 19.7787
nodefactor.Race.Other 1 0.7054
nodefactor.Race.White 45 9.1943
nodefactor.Grade.8 75 17.3212
nodefactor.Grade.9 65 11.2475
nodefactor.Grade.10 36 8.0931
nodefactor.Grade.11 49 11.4861
nodefactor.Grade.12 28 7.2756
nodematch.Sex 132 12.1128
nodematch.Race 103 10.0369
nodematch.Grade 163 13.6309
summary(sim.full ~ edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade")) edges degree0 degree1
245 43 46
nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm
173 30 192
nodefactor.Race.Other nodefactor.Race.White nodefactor.Grade.8
1 57 79
nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11
93 34 50
nodefactor.Grade.12 nodematch.Sex nodematch.Race
52 168 121
nodematch.Grade
195
plot(sim.full, vertex.col="Grade")
legend('bottomleft',fill=7:12,legend=paste('Grade',7:12),cex=0.75)(Note that we have implicitly used simulate already – it’s the basis of the GOF results)
We can use network size invariance to simulate networks of a different size, albeit one has to be careful:
sim.full2 <- simulate(fit.full, popsize=network.size(mesa)*2)
summary(mesa~edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade"))*2 edges degree0 degree1
406 114 102
nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm
342 52 312
nodefactor.Race.Other nodefactor.Race.White nodefactor.Grade.8
2 90 150
nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11
130 72 98
nodefactor.Grade.12 nodematch.Sex nodematch.Race
56 264 206
nodematch.Grade
326
summary(sim.full2~edges + degree(0:1)
+ nodefactor("Sex")
+ nodefactor("Race", levels = -LARGEST)
+ nodefactor("Grade")
+ nodematch("Sex") + nodematch("Race") + nodematch("Grade")) edges degree0 degree1
475 98 104
nodefactor.Sex.M nodefactor.Race.Black nodefactor.Race.NatAm
380 68 381
nodefactor.Race.Other nodefactor.Race.White nodefactor.Grade.8
2 91 149
nodefactor.Grade.9 nodefactor.Grade.10 nodefactor.Grade.11
190 64 90
nodefactor.Grade.12 nodematch.Sex nodematch.Race
78 325 254
nodematch.Grade
380
We only demostrate the functionality briefly here, but this kind of simulation is a powerful way to diagnose structural properties of the fitted model, and to identify and remedy systematic lack of fit.
We will leave this model here and go on to explore how the idea of sampling uncertainty is being used to produce the standard errors for our coefficients.
When we estimate parameters based on sampled data, the sampling uncertainty in our estimates comes from the differences in the observations we draw from sample to sample, and the magnitude of uncertainty is a function of our sample size. This is why we typically see something like \(\sqrt{n}\) in the denominator of the standard error of a sample mean or sample proportion. The same principle holds in the context of egocentric network sampling: the standard errors will depend on the number of egos sampled.
This is true despite the fact that we are rescaling first to pseudo-population size, then back down to per capita values. Neither of these influences the estimates of the standard errors – those are influenced only by the size of the egocentric sample.
So let’s use the sample function from ergm.ego to demonstrate this effect. For this section we will use the larger built-in network, faux.magnolia.high.
data(faux.magnolia.high)
faux.magnolia.high -> fmh
N <- network.size(fmh)Let’s start by fitting an ERGM to the complete network, and looking at the coefficients:
fit.ergm <- ergm(fmh ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"))
round(coef(fit.ergm), 3) degree0 degree1 degree2
0.935 0.257 0.029
degree3 nodefactor.Race.Asian nodefactor.Race.Black
-0.245 -2.479 -3.045
nodefactor.Race.Hisp nodefactor.Race.NatAm nodefactor.Race.Other
-2.701 -2.279 -2.623
nodefactor.Race.White nodematch.Race nodefactor.Sex.M
-3.387 1.678 -0.089
nodematch.Sex absdiff.Grade
0.857 -2.112
Now, suppose we only observe an egocentric view of the data – as an egocentric census. With an egocentric census, it’s as though we give a survey to all of the students. Each student nominates her friends, but does not report the name of the friend, she only reports their sex, race and grade. How does the fit from ergm.ego to this egocentric census compare to the complete-network ergm estimates?
fmh.ego <- as.egor(fmh)
head(fmh.ego)# EGO data ([32mactive[39m): 3 x 5
.egoID Grade Race Sex vertex.names
<int> <dbl> <chr> <chr> <chr>
1 1 9 Black F 1
2 2 10 Black M 2
3 3 12 Black F 3
# ALTER data: 6 x 6
.altID .egoID Grade Race Sex vertex.names
<int> <int> <dbl> <chr> <chr> <chr>
1 669 1 9 Black F 669
2 963 2 10 White F 963
3 912 2 10 White M 912
# AATIE data: 0 x 3
egofit <- ergm.ego(fmh.ego ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"), popsize=N,
control=control.ergm.ego(ppopsize=N))
# A convenience function.
model.se <- function(fit) sqrt(diag(vcov(fit)))
# Parameters recovered:
coef.compare <- data.frame(
"NW est" = coef(fit.ergm),
"Ego Cen est" = coef(egofit)[-1],
"diff Z" = (coef(fit.ergm)-coef(egofit)[-1])/model.se(egofit)[-1])
round(coef.compare, 3) NW.est Ego.Cen.est diff.Z
degree0 0.935 0.941 -0.013
degree1 0.257 0.262 -0.013
degree2 0.029 0.033 -0.015
degree3 -0.245 -0.243 -0.011
nodefactor.Race.Asian -2.479 -2.481 0.019
nodefactor.Race.Black -3.045 -3.047 0.022
nodefactor.Race.Hisp -2.701 -2.703 0.016
nodefactor.Race.NatAm -2.279 -2.295 0.114
nodefactor.Race.Other -2.623 -2.670 0.169
nodefactor.Race.White -3.387 -3.384 -0.034
nodematch.Race 1.678 1.677 0.021
nodefactor.Sex.M -0.089 -0.089 0.020
nodematch.Sex 0.857 0.856 0.013
absdiff.Grade -2.112 -2.113 0.013
Again, we can diagnose the fitted egocentric model for proper convergence.
# MCMC diagnostics.
mcmc.diagnostics(egofit, which="plots")And check whether the model converged to the right statistics:
plot(gof(egofit, GOF="model"))Now let’s check whether the fitted model can be used to reconstruct the degree distribution.
plot(gof(egofit, GOF="degree"))What if we only had an equally large sample, instead of an egocentric census? Here, we sample N students with replacement.
set.seed(1)fmh.egosampN <- sample(fmh.ego, N, replace=TRUE)Warning in `[.egor`(x, is, , unit = "ego"): Some ego indices have been selected
multiple times. They will be duplicated, and '.egoID's renumbered to preserve
uniqueness.
egofitN <- ergm.ego(fmh.egosampN ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"),
popsize=N)
# compare the coef
coef.compare <- data.frame(
"NW est" = coef(fit.ergm),
"Ego SampN est" = coef(egofitN)[-1],
"diff Z" = (coef(fit.ergm)-coef(egofitN)[-1])/model.se(egofitN)[-1])
round(coef.compare, 3) NW.est Ego.SampN.est diff.Z
degree0 0.935 1.397 -0.957
degree1 0.257 0.524 -0.693
degree2 0.029 0.368 -1.155
degree3 -0.245 -0.021 -1.076
nodefactor.Race.Asian -2.479 -2.405 -0.476
nodefactor.Race.Black -3.045 -2.911 -1.168
nodefactor.Race.Hisp -2.701 -2.529 -1.349
nodefactor.Race.NatAm -2.279 -2.100 -1.310
nodefactor.Race.Other -2.623 -2.620 -0.009
nodefactor.Race.White -3.387 -3.271 -1.098
nodematch.Race 1.678 1.609 0.919
nodefactor.Sex.M -0.089 -0.139 1.770
nodematch.Sex 0.857 0.880 -0.423
absdiff.Grade -2.112 -2.021 -1.415
# compare the s.e.'s
se.compare <- data.frame(
"NW SE" = model.se(fit.ergm),
"Ego census SE" =model.se(egofit)[-1],
"Ego SampN SE" = model.se(egofitN)[-1])
round(se.compare, 3) NW.SE Ego.census.SE Ego.SampN.SE
degree0 0.458 0.455 0.483
degree1 0.366 0.366 0.385
degree2 0.274 0.273 0.294
degree3 0.202 0.207 0.208
nodefactor.Race.Asian 0.149 0.135 0.155
nodefactor.Race.Black 0.116 0.111 0.115
nodefactor.Race.Hisp 0.146 0.127 0.127
nodefactor.Race.NatAm 0.159 0.142 0.136
nodefactor.Race.Other 0.401 0.283 0.329
nodefactor.Race.White 0.110 0.105 0.106
nodematch.Race 0.102 0.078 0.075
nodefactor.Sex.M 0.033 0.030 0.029
nodematch.Sex 0.070 0.052 0.056
absdiff.Grade 0.071 0.072 0.064
What if we have a smaller sample? If we have a sample of \(N/4=365\) students, how will our standard errors be affected?
set.seed(0) # Some samples have different sets of alter levels from ego levels.
fmh.egosampN4 <- sample(fmh.ego, round(N/4), replace=TRUE)Warning in `[.egor`(x, is, , unit = "ego"): Some ego indices have been selected
multiple times. They will be duplicated, and '.egoID's renumbered to preserve
uniqueness.
egofitN4 <- ergm.ego(fmh.egosampN4 ~ degree(0:3)
+ nodefactor("Race", levels=TRUE) + nodematch("Race")
+ nodefactor("Sex") + nodematch("Sex")
+ absdiff("Grade"),
popsize=N)
# compare the coef
coef.compare <- data.frame(
"NW est" = coef(fit.ergm),
"Ego SampN4 est" = coef(egofitN4)[-1],
"diff Z" = (coef(fit.ergm)-coef(egofitN4)[-1])/model.se(egofitN4)[-1])
round(coef.compare, 3) NW.est Ego.SampN4.est diff.Z
degree0 0.935 0.510 0.443
degree1 0.257 -0.259 0.675
degree2 0.029 -0.053 0.145
degree3 -0.245 -0.373 0.319
nodefactor.Race.Asian -2.479 -2.204 -1.081
nodefactor.Race.Black -3.045 -2.990 -0.230
nodefactor.Race.Hisp -2.701 -2.813 0.453
nodefactor.Race.NatAm -2.279 -2.498 0.847
nodefactor.Race.Other -2.623 -2.432 -0.429
nodefactor.Race.White -3.387 -3.435 0.206
nodematch.Race 1.678 1.581 0.662
nodefactor.Sex.M -0.089 -0.191 1.658
nodematch.Sex 0.857 0.887 -0.296
absdiff.Grade -2.112 -2.079 -0.243
# compare the s.e.'s
se.compare <- data.frame(
"NW SE" = model.se(fit.ergm),
"Ego census SE" =model.se(egofit)[-1],
"Ego SampN SE" = model.se(egofitN)[-1],
"Ego Samp4 SE" = model.se(egofitN4)[-1])
round(se.compare, 3) NW.SE Ego.census.SE Ego.SampN.SE Ego.Samp4.SE
degree0 0.458 0.455 0.483 0.961
degree1 0.366 0.366 0.385 0.766
degree2 0.274 0.273 0.294 0.564
degree3 0.202 0.207 0.208 0.401
nodefactor.Race.Asian 0.149 0.135 0.155 0.254
nodefactor.Race.Black 0.116 0.111 0.115 0.237
nodefactor.Race.Hisp 0.146 0.127 0.127 0.246
nodefactor.Race.NatAm 0.159 0.142 0.136 0.259
nodefactor.Race.Other 0.401 0.283 0.329 0.445
nodefactor.Race.White 0.110 0.105 0.106 0.231
nodematch.Race 0.102 0.078 0.075 0.148
nodefactor.Sex.M 0.033 0.030 0.029 0.062
nodematch.Sex 0.070 0.052 0.056 0.104
absdiff.Grade 0.071 0.072 0.064 0.136
As with ordinary statistics, standard error is inverse-proportional to the square root of the sample size.
The ergm.ego package is under active development on GitHub at statnet/ergm.ego. This repository is the place to go to report bugs or request features (feature requests accompanied by a pull request are especially appreciated). If you are interested in contributing to the development of ergm.ego, please contact us through the GitHub interface.
Additional functionality is planned in the near future:
Support for directed relaions.
Support for automatic fitting of tergms.
Krivitsky, P. N., & Morris, M. (2017). Inference for social network models from egocentrically sampled data, with application to understanding persistent racial disparities in HIV prevalence in the US. Annals of Applied Statistics, 11(1), 427-455.
Morris, M., Kurth, A. E., Hamilton, D. T., Moody, J., & Wakefield, S. (2009). Concurrent partnerships and HIV prevalence disparities by race: linking science and public health practice. American Journal of Public Health, 99(6), 1023-1023.
Motivation: Analyzing racial disparities in HIV in the US
The work on ergm.ego was originally motivated by a specific question in the field of HIV epidemiology—Does network structure help explain the persistent racial disparities in HIV prevalence in the United States?
An African American today is 10 times more likely than a white American to be living with HIV/AIDS. The disparity begins early in life, persists through to old age, and is evident among all risk groups: heterosexuals, men who have sex with men (MSM), and injection drug users. The disproportionate risks faced by heterosexual African-American women are especially steep. In 2010, an African-American woman was over 40 times more likely to be diagnosed with HIV than a heterosexual white man (Figure 1).
Figure 1
Empirical studies repeatedly find that these disparities cannot be explained by individual behavior, or biological differences.
A growing body of work is therefore focused on the role of the underlying transmission network. This network can channel the spread of infection in the same way that a transportation network channels the flow of traffic, with emergent patterns that reflect the connectivity of the system, rather than the behavior of any particular element.
Descriptive analyses and simulation studies have focused attention on two structural features: homophily and concurrency. Homophily is the strong propensity for within-group partner selection. Concurrency is non-monogamy—having partners that overlap in time–which increases network connectivity by allowing for the emergence of stable network connected components larger than dyads (pairs of individuals).
The hypothesis is that these two network properties together can produce the sustained HIV/STI prevalence differentials we observe: differences in concurrency between groups are the mechanism that generates the prevalence disparity, while homophily is the mechanism that sustains it.
We will never observe the complete dynamic sexual network that transmits HIV. But ergm.ego allows us to test the network hypothesis with egocentrically sampled data–and we will demonstrate that here using data collected by the National Health and Social Life Survey from 1994. The analysis comes from a recent paper (Krivitsky and Morris, 2017).
First, ergm.ego allows us to assess whether empirical patterns of homophily and concurrency are in the predicted directions and statistically significant. We do this in the usual way – comparing sequential model fits with terms that represent the hypotheses of interests, and t-tests for their coefficients. We will discuss these terms in more detail in later sections, but here we test the concurrency effects with “monogamy bias” terms.
Table 1
Result: Yes, the homophily and concurrency effects are in the predicted directions and statistically significant.
Next, we can assess the goodness of fit of each model in the way we usually do in ERGMs, by checking whether the models reproduce observed nework properties that are not in the model. We do this here by simulating from each model and comparing the fits to the full observed degree distribution:
Figure 2
Result: Only Model 3 (with both hypothesized nework effects) is able to reproduce the observed degree distribution.
The ability to simulate complete networks from the model, however, allows us to do much more–we can now examine the connectivity in the overall network that each of these models would generate. For example, we can examine the component size distributions under each model:
Figure 3
Result: Model 3, with its “monogamy bias” dramatically reduces the right skew of the component size distribution, and places most people in components of size 2, or 3 if they are in a larger component.
Finally, we can define a measure of “network exposure” that represents the signature feature of a network effect: indirect exposure to HIV via a partner’s behavior, rather than direct exposure via one’s own behavior. One metric for network exposure is the probability of being in a component of size 3 or more. Because this is a node-level metric, we can break it down by race and sex for each of the three models:
Figure 3
Result: Only model 3 produces a pattern of network exposure that is consistent with the observed disparities in HIV incidence.
ergm.ego provides a powerful analytic framework that uses extremely limited network data and testable models to investigate the unobservable patterns of complete network connectivity that are consistent with the sampled data.
The principles of egocentric inference can be extended to temporal ERGMs (TERGMs). While we will not cover that in this workshop, an example can be found in another paper that sought to evaluate the network hypothsis for racial disparities in HIV in the US (Morris et al. 2009). In that paper, egocentric data from the National Longitudinal Survey of Adolescent Health (AddHealth) was analyzed, and an example of the resulting dynamic complete network simulation (on 10,000 nodes) can be found in this “network movie”.
The movie below is another simpler example – an epidemic spreading on a small dynamic contact network that is simulated with a STERGM estimated from egocentrically sampled network data. The movie was produced by the R packagesEpiModel and ndtv, which are based on the statnet tools (and are also available on CRAN).
We’ll need some notation for this (sorry, and a warning that it will get hairier).
| Parameter | Meaning |
|---|---|
| \(N\) | the population being studied: a very large, but finite, set of actors whose relations are of interest |
| \(x _ i\) | attribute (e.g., age, sex, race) vector of actor \(i \in N\) |
| \(x_N\) | (or just \(x\), when there is no ambiguity) the attributes of actors in \(N\) |
| \(\mathbb{Y}(N)\) | the set of dyads (potential ties) in an undirected network of actors in \(N\) |
| \(y\subseteq \mathbb{Y}(N)\) | the population network: a fixed but unknown network (a set of relationships) of relationships of interest. In particular, |
| \(y_{ij}\) | an indicator function of whether a tie between \(i\) and \(j\) is present in \(y\) |
| \(y _ i=\{j\in N: y _ {ij}=1\}\) | the set of \(i\)’s network neighbors. |
| Parameter | Meaning |
|---|---|
| \(e_{N}\) | the egocentric census, the information retained by the minimal egocentric sampling design when all nodes are sampled |
| \(S\subseteq N\) | the set of egos in a sample |
| \(e_{S}\) | the data contained in an egocentric sample |
| \(e_i\) | the “egocentric” view of network \(y\) from the point of view of actor \(i\) (“ego”), with the following parts: |
| \(e^e_i \equiv x_i\) | \(i\)’s own attributes |
| \(e^a_i \equiv (x_{j})_{j\in y_i}\) | an unordered list of attribute vectors of \(i\)’s immediate neighbors (“alters”), but not their identities (indices in \(N\)) |
| \(e^e_{i,k}\equiv x_{i,k}\) | The \(k\)th attribute/covariate observed on ego \(i\) |
| \(e^a_{i,k}\equiv( x_{j,k})_{j\in y_i}\) | and its alters. |
We call a network statistic \(g_{k}(\cdot,\cdot)\) egocentric if it can be expressed as \[ g_{k}(y,x)\equiv \textstyle\sum_{i\in N} h_{k}(e_i) \] for some function \(h_{k}(\cdot)\) of egocentric information associated with a single actor.
The space of egocentric statistics includes dyadic-independent statistics that can be expressed in the general form of \[ g_{k}(y,x)=\sum_{ij\in y} f_k(x_i,x_j) \] for some symmetric function \(f_k(\cdot,\cdot)\) of two actors’ attributes; and some dyadic-dependent statistics that can be expressed as \[ g_{k}(y,x)=\sum_{i\in N} f_k ({x_{i},(x_j)_{j\in y_i}}) \] for some function \(f_k(\cdot,\dotsb)\) of the attributes of an actor and their network neighbors.
The statistics that are identifiable in an egocentric sample depend on the specific egocentric study design.
The table below (from Krivitsky & Morris 2017) shows some examples of egocentric statistics, and gives their representations in terms of of \(h_{k}(\cdot)\).
| Statistic | \(g_{k}( y,x)\) | \(h _ {k}(e_i)\) |
|---|---|---|
| General sum over ties | \(\sum _ {(i,j)\in y} f _ k(x _ i,x _ j)\) | \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} f _ k\big(e^\text{e}_i,e^\text{a}_{i,j'}\big)\) |
| Number of ties in the network | \(\lvert y \rvert\equiv \sum _ {(i,j) \in y} 1\) | \(\frac{1}{2}\lvert e^\text{a}_{i}\rvert\) |
| weighted by actor covariate \(x _ {i,k}\) | \(\sum _ {(i,j) \in y} (x _ {i,k}+x _ {j,k})\) | \(\frac{1}{2} \big(e^\text{e}_{i,k} \lvert e^\text{a}_{i}\rvert + \sum _ {j'\in e^\text{a} _ i} e^\text{a}_{i,j',k} \big)\) |
| weighted by difference in \(x _ {i,k}\) | \(\sum _ {(i,j) \in y} \lvert x _ {i,k}-x _ {j,k}\rvert\) | \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} \lvert e^\text{e}_{i,k}-e^\text{a}_{i,j',k}\rvert\) |
| within groups identified by \(x _ {i,k}\) | \(\sum _ {(i,j) \in y} 1_{x _ {i,k}=x _ {j,k}}\) | \(\frac{1}{2}\sum _ {j'\in e^\text{a} _ i} 1_{ e^\text{e}_{i,k}= e^\text{a}_{i,j',k}}\) |
| General sum over actors | \(\sum _ {i\in N} f _ k\big\{x _ {i},(x _ j) _ {j\in y_{i}}\big\}\) | \(f _ k\big(e^\text{e}_i,e^\text{a}_{i}\big)\) |
| Number of actors with \(d\) neighbors | \(\sum _ {i\in N} 1_{\lvert y_{i}\rvert=d}\) | \(1_{\lvert e^\text{a}_{i}\rvert=d}\) |
| weighted by actor covariate \(x _ {i,k}\) | \(\sum _ {i\in N} x _ {i,k} 1_{\lvert y_{i}\rvert=d}\) | \(e^\text{e}_{i,k}1_{\lvert e^\text{a}_{i}\rvert=d}\) |
This does not mean that the mean degree itself cannot be estimated from egocentric data, only that our inferential results might not apply.↩︎