Network Data Analysis and Visualization

Introduction and Data Preparation

The network data sets selected for this project comprise four simulated social networks, each of which is based on actual data collected for the AddHealth Study, Wave I (Resnick, Bearman, Blum, et al., 1997). The data sets were included in the R ergm package under the names faux.desert.high (referred to as Desert High in this report), faux.dixon.high (Dixon High), faux.magnolia.high (Magnolia High) and faux.mesa.high (Mesa High) (Handcock, Hunter, Butts, Goodreau, & Morris, 2003). The vertices in each network represent students and contain three attributes: grade, sex, and race. The network edges represented friendship nominations between students, in the form of directed ties from the nominating student to the student nominated.

Network analysis began by loading each data set and formatting it as an igraph object. Preliminary inspection showed no missing or irregular data in any of the data sets, so major data cleaning was unnecessary. The network edges in the original data set were directed, but were collapsed to undirected edges to allow processing by community detection algorithms (see below). In addition, attribute names were standardized across all four data sets, including the deletion of the scode attribute, which was irrelevant to the analysis.

Visualization and Network Summary

Assuming a network data set is not too large or complex, visualizations of the network can give an overview of network characteristics such as graph density, size, and structure. Differences in characteristics between networks can be visualized in side-by-side comparison plots as in Figure 1. These plots show each of the “faux” high school networks along with the network density, which measures the proportion of observed connections compared to the maximum possible number of connections and can range from 0 (less dense) to 1 (more dense). Desert High has the highest network density, reflected in the relatively high number of connections between students and the small number of isolated nodes. This contrasts with Magnolia High, the network with the lowest density of the four schools. The plot of the Magnolia High network shows the large number of nodes and the relative sparsity of connections between them which together produce the low value of graph density.

$Figure 1$

Figure 1

The colors of the nodes in each plot of Figure 1 indicate the value of the grade attribute, giving a preliminary sense of how matches and mismatches in grade level effect the probability of a connection between students. For example, all the plots show that students in grades 7, 8, and 9, each tend to have more connections with students in their respective grades and less connections with students in other grades. It also appears that this effect diminishes as students advance in grade level, to the point where connections between students in grades 11 and 12 are highly mixed.

The Desert High plot in Figure 1 is reproduced in the upper-left panel of Figure 2, with node color again coded by grade attribute. The other plots in Figure 2 show the same Desert High network, with node color determined by sex (upper-right panel) and race (lower-left panel). Finally, a combined plot in the lower-right panel shows grade by color, sex by shape, and race by node label. These plots give additional evidence for the formation of preliminary hypotheses that will be testing in the network modeling phase. As already noted, grade level matching and mismatching both appear to influence the probability of forming a connection between students. In contrast, the plot of the sex attribute doesn’t show any strong patterns, indicating a weaker relationship between sex and connection probability. Finally, the main takeaway from the plot of the race attribute is that most of the Desert High nodes are in the “White” category. With so few nodes in other categories, any effect of race attribute on connection probability may not be detectable.

Figure 2

As described by Luke, a set of five measurements can be used to summarize the characteristics of a network, adding numerical data to information derived from observation of the network plots (2015, p. 11-16). Table 1 lists the ‘five-number summary’ values for each of the ‘faux’ high school networks in the following categories:

“Size”: a count of the number of nodes;
“Edge count”: a count of the number of edges, not officially part of Luke’s ‘five-number summary’, but interesting nonetheless;
“Components”: number of subgroups in which all actors are connected, directly or indirectly;
“Diameter”: the longest of the shortest paths across all pairs of nodes for the largest network component;
“Density”: the proportion of observed connections compared to the maximum possible number of connections; and
“Transitivity”: the proportion of closed triads compared to the total number of open and closed triads.

Table 1 - Comparison of Faux High School Network Model Characteristics
Network Name	Size (# of Nodes)	Edge count	Components	Diameter	Density	Transitivity
Desert High	107	348	9	6	0.0613648	0.26553
Dixon High	248	978	10	7	0.0319316	0.18127
Magnolia High	1461	974	661	40	0.0009132	0.27842
Mesa High	205	203	68	16	0.0097083	0.28225

As Table 2 show, the network characteristics of Desert High and Dixon High are similar. In both cases, the number of edges is ~3 times higher than the number of nodes, the number of components is small, and the diameter of the largest component is less than 10 edges. Taken together, these measurements describe a relatively dense, compact network, which is reflected in the higher values of network density for these two schools.

By contrast, the Magnolia High network contains more nodes than edges. The network also has 661 components, indicating a large number of unconnected or weakly-connected nodes. The sparsity of connections leads to a large diameter measurement of the largest component as well as a low overall density: Magnolia High has the lowest density of all four networks. The characteristics of Mesa High are also notable. Like Magnolia High, Mesa High has a high number of components, but the number of edges in Mesa High is higher relative to the number of nodes, producing a higher density than Magnolia High. The diameter of the Mesa High network is also smaller than Magnolia High, which is logical given the lower node count and higher density.

Community Detection

Community detection algorithms are used to identify network subgroups based on the pattern of connections between group members as well as the pattern of connections between groups (Luke, 2015, p. 115). There are multiple community detection algorithms available, each with different methods of determining community membership. Hence, different algorithms applied to the same set of data will produce different results. This is shown in the visualizations included in Figure 3 below.

Each plot displays the results of a community detection algorithm applied to the Desert High network. The node positions in each plot are identical enable comparisons, node labels show the value of the grade attribute for each node, and isolated nodes have been removed to declutter the visualizations. The color of the nodes and their surrounding “bubbles” represent the communities identified by each algorithm. The plots give a broad view of the differences between algorithms: the walktrap and edge-betweeness algorithms, for example, respectively identify 11 and 13 communities, while the label propagation and leading eigenvector algorithms each identify 5.

Figure 3, p 1

Figure 3, p 2

Each plot includes the value of modularity measured on the results of the community detection algorithm. Modularity is defined by Newman (as cited in Luke, 2015, p. 115) as the “extent to which nodes exhibit clustering where there is greater density within the clusters and less density between them.” A modularity value closer to a value of 1 indicates that the algorithm has done a good job identifying subgroup structure (Luke, 2015, p 118). By this measure, the infomap algorithm produced the best result on the Desert High data set, with the Louvain algorithm close behind.

Table 2 summarizes the modularity measurements of each available community detection algorithm as applied to each of the “faux” high school networks. As previously noted, the Desert High and Dixon High networks have similar characteristics, which may explain why the infomap and Louvain algorithms produce the highest modularity values for both data sets. For the Magnolia High network, modularity was highest for the the edge-betweeness, fast-greedy, and Louvain algorithms, while the walktrap, edge-betweeness, and Louvain algorithms produced the best results on the Mesa High data.

Based on the modularity measurements, it appears that the Louvain algorithm may be well-suited to community detection within the type of social network considered here. The plot of the Louvain algorithm results on the Desert High network in Figure 3 algorithm detected the previously identified, relatively homogeneous clusters among students in lower grades (7,8,9) as well as the more mixed connections between students in higher grades. These results provides further evidence for the preliminary hypotheses discussed earlier, to be tested in the network modeling phase.

Table 2 - Modularity of Community Detection Algorithm Results
School Name	Walktrap	Edge-betweeness	InfoMap	Fast-greedy	Label propagation	Leading eigenvector	Louvain
Desert	0.52461	0.50271	0.53885	0.52734	0.52148	0.47830	0.53788
Dixon	0.45758	0.43817	0.48200	0.46677	0.36584	0.42085	0.48109
Magnolia	0.92534	0.95100	0.90300	0.95230	0.90045	0.93307	0.95046
Mesa	0.80194	0.80535	0.79172	0.79819	0.76884	0.76512	0.80355

Bibliography

Handcock, M.S., Hunter, D.R., Butts, C.T., Goodreau, S.M., & Morris, M. (2003). statnet: Software tools for the Statistical Modeling of Network Data. https://statnet.org.

Luke, D.A. (2015). A user’s guide to network analysis in R. Switzerland: Springer International Publishing.

Resnick M.D., Bearman, P.S., Blum R.W. et al. (1997). Protecting adolescents from harm. Findings from the National Longitudinal Study on Adolescent Health, Journal of the American Medical Association, 278: 823-32.

Network Data Analysis and Visualization

Timothy Drexler

October 2019

Introduction and Data Preparation

Visualization and Network Summary

Community Detection

Bibliography