Project Description

In this project, you will explore a network of friends, classmates, and roommates. Your instructor will provide you with a data set (matrix) of various relationship types among individuals: friendship, classmate, roommate. Refer to the Exponential Random Graph Theory (ERGM) discussed in class and read the tutorial on ERGM in R (included). Apply ERGM to the given set to explain the friendship network in relationship to the roommate and classmate matrices. Your instructor will provide additional guidance on expectation from further analysis, interpretation, and visualization of your results.


Topic Explanation

In order to understand what one social network tells us about other, similar networks, inferential statistics is required. The four tests that are most commonly used for statistical significance, chi-square test, z test, t test, and F ratio test, cannot be used on social networks. “Most significance tests presume that the units of analysis are independent of each other, which is exactly contrary to the assumption of interdependence between cases in social network data” (Social Network Analysis: Methods and Examples, 88). Because of this, some novel methods are needed to produce the sampling distribution for social network data and to infer from them the significance level of test statistics.


Data Explanation

The major that I am in at GCU, computer science, has multiple emphases. These emphases are big data, game design, and business. There is one class that every cs major, regardless of their emphasis, has to take. This class is CST-405. I want to answer the following question:

“Do people sit at the same table with others from their same emphasis?”

In order to find this out I’ve created two adjacency matrices that connect everyone in CST-405 by major and by seating arrangement.

This is the adjacency matrix that connects everyone by the table they sit at:

Matrix Representation
class.nodes = read.csv("classNodes.csv")
class.tables = read.csv("classLinkedByTables.csv", row.names = 1)

class.tables.df = as.data.frame(class.tables)
colnames(class.tables.df) = class.nodes$firstname
rownames(class.tables.df) = paste(class.nodes$firstname, class.nodes$lastname)

kable(class.tables.df)
Aaron Alec Andrew Brian Connor Daniel Daniel Erik Garret Justen Kevin Kyle Lamarr Michael Nvart Shawn Timothy Jacob
Aaron Scirocco 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 1
Alec Ferko 0 0 0 0 1 0 0 0 0 0 1 0 1 0 0 0 0 0
Andrew Parasadayan 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0
Brian Kurowski 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
Connor Segneri 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0
Daniel Briscoe 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0
Daniel Stagnaro 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0
Erik Weimer 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 0 1 1
Garret Grundeis 1 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1
Justen Johns 0 0 1 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0
Kevin Hoskins 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0
Kyle Bewley 1 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 1 1
Lamarr Pace 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
Michael Hesseltine 1 0 0 0 0 0 0 1 1 0 0 1 0 0 0 0 1 1
Nvart Kahkedjian 0 0 1 0 0 1 1 0 0 1 0 0 0 0 0 0 0 0
Shawn Kurowski 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Timothy Lowther 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 0 1
Jacob Slaton 1 0 0 0 0 0 0 1 1 0 0 1 0 1 0 0 1 0
Network Representation
class.tables.g = graph_from_adjacency_matrix(as.matrix(class.tables))

par(bg="gray70")
plot(class.tables.g, edge.arrow.size = 0.4, edge.color = "white", vertex.shape = "none",
     vertex.label = class.nodes$firstname, vertex.label.color = "black")

This is the adjacency matrix that connects everyone by their major:

Matrix Representation
class.majors = read.csv("classLinkedByMajor.csv", row.names = 1)

class.majors.df = as.data.frame(class.majors)
colnames(class.majors.df) = class.nodes$firstname
rownames(class.majors.df) = paste(class.nodes$firstname, class.nodes$lastname)

kable(class.majors.df)
Aaron Alec Andrew Brian Connor Daniel Daniel Erik Garret Justen Kevin Kyle Lamarr Michael Nvart Shawn Timothy Jacob
Aaron Scirocco 0 1 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0
Alec Ferko 1 0 1 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0
Andrew Parasadayan 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0
Brian Kurowski 0 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 1
Connor Segneri 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0
Daniel Briscoe 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
Daniel Stagnaro 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0
Erik Weimer 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1
Garret Grundeis 1 1 1 0 1 0 0 0 0 1 1 0 0 0 0 0 0 0
Justen Johns 1 1 1 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0
Kevin Hoskins 1 1 1 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0
Kyle Bewley 0 0 0 1 0 0 0 1 0 0 0 0 1 1 1 1 1 1
Lamarr Pace 0 0 0 1 0 0 0 1 0 0 0 1 0 1 1 1 1 1
Michael Hesseltine 0 0 0 1 0 0 0 1 0 0 0 1 1 0 1 1 1 1
Nvart Kahkedjian 0 0 0 1 0 0 0 1 0 0 0 1 1 1 0 1 1 1
Shawn Kurowski 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 0 1 1
Timothy Lowther 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 1
Jacob Slaton 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 1 0
Network Representation
class.majors.g = graph_from_adjacency_matrix(as.matrix(class.majors))

par(bg="gray70")
plot(class.majors.g, edge.arrow.size = 0.4, edge.color = "white", vertex.shape = "none",
     vertex.label = class.nodes$firstname, vertex.label.color = "black")


Network Analysis

Quadratic Assignment Procedure

Theory

“Quadratic Assignment Procedure (QAP) was developed by Krackhardt (1987), and its logic draws from that of the permutation test” (Social Network Analysis: Methods and Examples). This method is used to determine if a correlation between networks is statistically significant or not.

Below is the methodology of the quadratic assignment procedure:

  1. QAP calculates the Pearson’s correlation coefficient (r) between the two original matrices, treating the computed coefficient as the observed coefficient.
  2. QAP procedure permutes one of the matrices by rearranging both its rows and columns.
  3. The permuted matrix from B is correlated with the original matrix of matrix A, producing a new Pearson’s correlation coefficient (r) between the two matrices.
  4. Step two and step three are repeated at least a thousand times—or more if the researcher wants to be sure of the outcome.
  5. The observed correlation coefficient from the first step is compared with the distribution of the coefficients generated from step 4 to determine the proportion among the coefficients from the permuted matrices that are equivalent or higher than the observed coefficient.

Implementation

netlm(class.majors, class.tables)
## 
## OLS Network Model
## 
## Coefficients:
##             Estimate   Pr(<=b) Pr(>=b) Pr(>=|b|)
## (intercept) 0.35652174 1.000   0.000   0.000    
## x1          0.09084668 0.869   0.131   0.244    
## 
## Residual standard error: 0.4852 on 304 degrees of freedom
## F-statistic: 2.003 on 1 and 304 degrees of freedom, p-value: 0.158 
## Multiple R-squared: 0.006546     Adjusted R-squared: 0.003278

The linear regression function in the sna package can be used to find the correlation between the matrices. This is because it finds the probability value (p-value) automatically.

The multiple R-squared is 0.006546, and the p-value is 0.158. Since the p-value is not below conventional threshold levels (e.g., p < .05, .01, or .001), the null hypothesis that any correlations between the networks occurs purely by chance fails to be rejected. The p-value is low enough, however, to warrant further investigation. If there is a correlation within the two matrices tested, it is very minimal, but a different result could be discovered if a different data set is used with more observations.

Exponential Random Graph Modeling

Theory

“The exponential random graph model (ERGM)/P* attempts to explain how and why social network ties arise. The main goal of ERGMs is to understand a given observed network, that is, the empirical network measured by researchers, and to obtain insight into the underlying process that creates and sustains its ties” (Social Network Analysis: Methods and Examples, 93).

An ERGM assigns probability to graphs according to the following statistics:

\[P_\theta(G)=ce^{\theta_1Z_1(G)+\theta_2Z_2(G)+...+\theta_pZ_p(G)}\]

  • G is the specific network that is being analyzed.
  • The Zs in the expression are network statistics.
  • c is a normalizing constant.

“ERGM includes those inter-dependencies of network ties, represented in various network configurations that explain the overall network structures, and measures their respective importance in the network formation process” (Social Network Analysis: Methods and Examples, 94). This is an advantage of using ERGM, that the overall network structure can be analyzed depending on multiple processes operating simultaneously.

When explaining network ties in a full network, ERGM looks at three aspects:

  1. the network self-organization
    • “Network self-organization refers to the dependency of ties within the network analyzed, and for that reason is also called the endogenous effect or, purely, the structural effect” (Social Network Analysis: Methods and Examples, 95).
  2. the actor’s attributes
    • “Individual attributes operate to affect the tie formation in the overall social network via two mechanisms. First, the human propensity to select others similar to oneself is well documented in social network literature (McPherson, Smith-Lovin, & Cook, 2001). Second, individual attributes matter in the overall activity in tie formation” (Social Network Analysis: Methods and Examples, 95).
  3. the co variate variable(s)
    • “The third aspect of variation that ERGM can account for are co variate matrices that may interfere with the relation between the key independent variables and the dependent social network graph” (Social Network Analysis: Methods and Examples, 95).

Implementation

Below is one ERGM model of the class network where ties are determined by the tables they sit at.

class.tables.net = as.network(x = class.tables,
                              directed = TRUE,
                              loops = FALSE,
                              matrix.type = "adjacency")
set.vertex.attribute(class.tables.net,
                     "Major",
                     as.character(class.nodes$major))

class.tables.01 = ergm(class.tables.net ~ edges)
summary(class.tables.01)
## 
## ==========================
## Summary of model fit
## ==========================
## 
## Formula:   class.tables.net ~ edges
## 
## Iterations:  5 out of 20 
## 
## Monte Carlo MLE Results:
##       Estimate Std. Error MCMC % p-value    
## edges  -1.1073     0.1323      0  <1e-04 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 424.2  on 306  degrees of freedom
##  Residual Deviance: 343.0  on 305  degrees of freedom
##  
## AIC: 345    BIC: 348.8    (Smaller is better.)

This first formula tells us the density of the network. The log-odds of any tie existing is -1.1073. Using this, the density can be found:

\[\frac{e^{-1.1073}}{1+e^{-1.1073}}\]

This comes out to be about 0.248. This means that the overall density of the graph is about 24.8%. This suggests a lower than moderate connectivity for the network. Since the p-value is so low, it means that this density measure is statistically significant, and would not normally occur by chance.

In the next model, homophily will also be evaluated. This is the idea that similar nodes are more likely to be connected than dissimilar nodes. The homophily of the table network is examined based on major.

class.tables.02 = ergm(class.tables.net ~ edges + nodematch("Major", diff = T))
summary(class.tables.02)
## 
## ==========================
## Summary of model fit
## ==========================
## 
## Formula:   class.tables.net ~ edges + nodematch("Major", diff = T)
## 
## Iterations:  13 out of 20 
## 
## Monte Carlo MLE Results:
##                                    Estimate Std. Error MCMC % p-value    
## edges                              -1.25954    0.17483      0  <1e-04 ***
## nodematch.Major.CS in big data      0.09639    0.40226      0   0.811    
## nodematch.Major.CS in business     16.16194  738.50272      0   0.983    
## nodematch.Major.CS in game design   0.43856    0.30987      0   0.158    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      Null Deviance: 424.2  on 306  degrees of freedom
##  Residual Deviance: 335.5  on 302  degrees of freedom
##  
## AIC: 343.5    BIC: 358.4    (Smaller is better.)

Out of the estimated correlations above, it seems that the business emphases tend to sit with each other (since this network again has connections based on seating arrangements) more than other emphases do. The p-values, however for all of these correlations are not below 0.05, so the null hypothesis is rejected, and we cannot confidently say that these homophily results are anything but chance. To state clearly, there is not enough evidence to support a relationship between emphases sitting next to their likeness (those in the same emphasis) concerning the observations that were evaluated. The p-value is low enough, like in the QAP, to warrant further investigation. The outcome is the same as in the QAP, but again the results may be different if more nodes are observed and included in the analysis.


References

(n.d.). Retrieved February 21, 2018, from http://www.mjdenny.com/Preparing_Network_Data_In_R.html

Homophily. (2018, February 20). Retrieved February 21, 2018, from https://en.wikipedia.org/wiki/Homophily

Irvine, C. S. (n.d.). Retrieved February 21, 2018, from https://statnet.org/trac/raw-attachment/wiki/Sunbelt2016/ergm_tutorial.html#terms-provided-with-ergm

Yang, S., Zhang, L., & Keller, F. B. (2017). Social network analysis: methods and examples. Los Angeles: Sage.