Practicum_9_-__Hypothesis_Testing_in_Network

Practicum Setup

This practicum is still in the early stages. So, you may note that it is a little on the terse at the moment. That will be rectified in future iterations. But, for now, we will run through a few options that you are likely to find helpful if you plan to use statistical analysis to confirm any suspicions that you developed during background research, or perhaps during your exploratory work.

This part is important. Please be sure to do this part again

Create a folder labeled “Practicum 9” someplace on your computer, such as your Desktop or wherever you will be able to easily find it again. Then, set your working directory to that folder in RStudio:

Session
- Set Working Directory
- Choose Directory…

Getting Started

For the sake of familiarity and expediency, we’ll stick with Padgett’s Florentine families data. The object of this practicum is to familiarize you with a variety of statistical tools that should be helpful to one or more of your projects. For that reason, it is likely that some of these analyses may seem a little pithy or odd. Please understand that the examples are being provided to demonstrate how you may use R to conduct these analyses. Bear with me, then use these with your own data for something a little more worthy of inference!

Data for this Practicum

Load Padgett data. These have been saved as data that are appropriate for statnet. To access them, go to the Network Data link on the course website and download the padbus.rda and padmar.rda files for business ties, and marriage ties, respectively.

These will load data objects named “Padgett_Business” and “Padgett_Marriage” in R’s memory.

load("padbus.rda") # "Padgett_Business"
load("padmar.rda") # "Padgett_Marriage"

You will also need to convert an attribute that we used in the past to a network. Although this practicum will be conducted almost entirely in statnet, igraph offers the most efficient means of converting a two-mode network to one-mode. So, we’ll use igraph to preprocess our data, and then convert it to a statnet object, using the intergraph package.

To begin, load the party.csv file. We have used this before to study brokerage in Practicum 6, so you can find it in the Practicum 6 folder on the Network Data page of the course website.

# Load igraph
library(igraph)

party <- read.csv(file.choose(), header=FALSE) # Load "party.csv"
g <- graph.data.frame(party, directed=FALSE) # convert to igraph object
plot(g)
V(g)$type <- bipartite_mapping(g)$type
proj <- bipartite.projection(g)
Party_Membership <- proj$proj1

# Convert the igraph object to a statnet object
library(intergraph)
Padgett_Party <- asNetwork(Party_Membership)

# Make sure you got the right network
plot(Padgett_Party, displaylabels=TRUE)

# Finally, clean up your memory by detaching both packages.
detach("package:igraph")
detach("package:intergraph")

Now that your data are ready for use, move on to statnet.

Load statnet

All of the following analyses may be found in statnet. This package was developed on top of Carter Butts’ network and sna packages. Each of those packages are dependencies for statnet, so they load with it. Below, I may refer to various functions as residing in statnet, despite the fact that they are actually part of sna or network. The reference to statnet is therefore intended to be encompassing of all three packages. When in doubt, just use the help function (?) to learn more about any individual command that we cover below.

library(statnet)

Measures of (Continuous) Autocorrelation

Autocorrelation is a measure of how much those near to us influence our outcomes. This is easiest to understand in terms of time. Over time, the value of a particular stock - or any other thing of value - is very likely related to the value it held yesterday. We can observe similar situations with respect to physical proximity. The wealth of particular cities can be related to the wealth of the cities that are adjacent to them. Although it is possible for a wealthy city to be adjacent to much poorer cities, it is also likely that any increases in the wealth of a particular city can cause spillovers into neighboring communities. The spillover may not be always be direct, but there is likely to be a relationship there.

Network autocorrelation works in much the same way. Though, rather than using physical proximity, network autocorrelation uses network proximity to test whether an individuals’ continuous attributes are related to those of their neighbors. In other words, we can ask quesitons about how a particular attribute appears to be related to the ties within that network. For example:

Does wealth attract wealth?
Do people with dissimilar wealth tend to form ties?
Or, does wealth appear to have little to do with one’s immediate neighborhood?

For the example below, we use the “wealth” attribute that is embedded in the Florentine marriage network. We are essentially asking whether Florentine families’ marriages had something to do with wealth. If you consider the network visualization, below, it does look as though the more wealthy families tended to marry less wealthy families, and vice versa. This is an example of negative autocorrelation.

Moran’s I and Geary’s C are measures of continuous autocorrelation that are available in statnet. Although they measure the same thing, they do so in different ways and each has a different scale. To assess the degree of negative autocorrelation, follow the examples below.

Moran’s I

Moran’s I ranges from -1 to 1 and is interpreted like Pearson’s Correlation Coefficient.

Values approaching -1 indicate negative autocorrelation
Values approaching 0 indicate that there is no autocorrelation (independence)
Values approaching 1 indicate positive autocorrelation

# Extract the "wealth" attribute from the Florentine business network.
family.wealth <- get.vertex.attribute(Padgett_Business, "wealth")

nacf(Padgett_Marriage, family.wealth, type="moran", mode="graph")[2]

##          1 
## -0.3107353

Note: Statnet’s nacf function produces a vector gives measures for multiple steps out into the network, up to the theoretical maximum indicated by the order of the network. For our purposes, we consider only the immediate neighborhood around each node, meaning that we are interested in autocorrelation between nodes that are just one step from one another. The vector goes from 0 steps to however many nodes are present in the network. Thus, the [2] in the scripts above and below refers to the second measure in the vector (one step). If you are still a little unclear about this after reading this explanation, just run the script above without the [2].

We can see that the Florentine marriage network is moderately negatively autocorrelated (I=0.31), and now we can use a measure to express that, in case we wish to compare it with other networks.

For another alternative in measuring autocorrelation, see below.

Geary’s C

Geary’s C ranges from 0 to 2.

Values less than 1 indicate positive autocorrelation
Values close to 1 indicate that there is no autocorrelation (independence)
Values greater than 1 indicate negative autocorrelation

# Extract the "wealth" attribute from the Florentine business network.
family.wealth <- get.vertex.attribute(Padgett_Business, "wealth")

nacf(Padgett_Marriage, family.wealth, type="geary", mode="graph")[2]

##        1 
## 1.683607

Just as above, we can see that the Florentine marriage network is negatively autocorrelated.

Feel free to use either of the above measures to express this. You should not need both.

Conditional Uniform Graphs (CUG)

Conditional uniform graphs (CUGs) are simple in their essense. By running the commands below, you are essentially running through a three step process in one brief step. The process for running a CUG is:

Take a global measure on some network
Generate a large number of networks that share some aspect with the original network and run the global measure on each
Compare the measure from the initial network with those of the simulated networks

I am pointing this out because the example below uses only betweenness centralization as the global measure. Centralization involves two items: the network being analyzed; and the name of the centrality measure being applied. Other global measures will, therefore, not require the FUN.arg=list(), argument.

You will also note that I am conditioning the simulated networks on all three of the possible modes: size; number of edges; and the distribution of dyads. This is for demonstration purposes. Under normal circumstances, we would just condition the simulated networks on any one of the three. We are running all three to demonstrate how they differ.

Select the conditioning mode according to what you suspect about your own network of interest. If you are a little rusty on global measures, take a look back at Chapter 4 of Understanding Dark Networks or look back at Practicum 3.

cugBetSize <- cug.test(Padgett_Business,
                       centralization,
                       FUN.arg=list(FUN=betweenness), 
                       mode="graph", 
                       cmode="size")
cugBetEdges <- cug.test(Padgett_Business,
                        centralization,
                        FUN.arg=list(FUN=betweenness), 
                        mode="graph", 
                        cmode="edges")
cugBetDyad <- cug.test(Padgett_Business,
                       centralization,
                       FUN.arg=list(FUN=betweenness), 
                       mode="graph", 
                       cmode="dyad.census")

# Aggregate the findings...if you prefer.
Betweenness <- c(cugBetSize$obs.stat, cugBetEdges$obs.stat, cugBetDyad$obs.stat)
PctGreater <- c(cugBetSize$pgteobs, cugBetEdges$pgteobs, cugBetDyad$pgteobs)
PctLess <- c(cugBetSize$plteobs, cugBetEdges$plteobs, cugBetDyad$plteobs)
Betweenness <- cbind(Betweenness, PctGreater, PctLess)
rownames(Betweenness) <- c("Size", "Edges", "Dyads")
# write.csv(Betweenness, file="CUGoutput.csv")

Betweenness # Take a look

##       Betweenness PctGreater PctLess
## Size    0.2057143      0.001   0.999
## Edges   0.2057143      0.753   0.248
## Dyads   0.2057143      0.740   0.260

Consider the visualization of the network, below. Given the network’s structure, it is at least somewhat dominated by the Barbadori and Medici families (betweenness centralization = 0.21). But is that level of centralization special to the Florentine business network, or is this something that we would normally expect for a network this size? Is it something that we would normally expect for a network with this number of edges? Is it something that we would normally expect for a network with this distribution of dyads?

As we can see in the output above, this level of centralization is very uncommon in a network this size. But it is not at all uncommon in a network with the same number of edges, or the same distribution of dyads.

We can depict the same information graphically by displaying the distribution of betweeness centralization measures for the randomly generated networks, and indicating where the betweenness centralization measure of 0.21 lies in comparison to each distribution.

par(mfrow=c(1,3))
plot(cugBetSize, main="Betweenness \nConditioned on Size" )
plot(cugBetEdges, main="Betweenness \nConditioned on Edges" )
plot(cugBetDyad, main="Betweenness \nConditioned on Dyads" )
par(mfrow=c(1,1))

Quadratic Assignment Procedure (QAP)

Quadratic assignment procedure (QAP) is similar to CUG, in that it simulation in order to generate a distribution of hypothetical networks. But QAP controls for network structure, as compared with CUG, which controls for size, the number of edges, or dyad census. (Note: CUG can be made to condition on other properties. But size, edges, and dyad census are the options available through statnet.)

QAP controls for network structure by repermuting the network over a large number of iterations - usually around 1000. For more details, check out chapter 9 of the book.

QAP is useful for running a variety of statistical functions. Below, are three of the options that you have available in statnet.

QAP Correlation

Did the Florentine families base their business dealings on the marriage ties? (Or maybe their marriages are based on their business ties?)

# Get the Correlation value
gcor(Padgett_Business, Padgett_Marriage)

## [1] 0.3718679

Marriage and business ties are moderately correlated (r=0.37).

# Is it significant?
Medici.Cor <- qaptest(list(Padgett_Business, Padgett_Marriage), gcor, g1=1, g2=2, reps=1000)
Medici.Cor

## 
## QAP Test Results
## 
## Estimated p-values:
##  p(f(perm) >= f(d)): 0.001 
##  p(f(perm) <= f(d)): 1

The correlation is significant at the 0.05 alpha level. We know this because less than 5% the permuted networks - or in this case, almost all of them - exhibited correlation coefficients that were either, greater than, or less than that of the value we calculated for these networks.

For a visual of this comparison, see below.

plot(Medici.Cor, xlim=c(-0.25, 0.4))

Note: I widened the x-axis a little so that the dotted line would show better. That is what the xlim= argument does in the script above.

QAP Linear Regression

We are finally ready to use the “Padgett_Party” network that you constructed from an attribute. If you have not already created this variable, go back to the top and do that now. We’ll wait.

Okay, now that you have the variables ready, you may use it to run a multiple regression.

Strictly speaking, what we are about to do is inappropriate. It is inappropriate because the ties within this network are binary, and multiple linear regression is meant for continuous measures. So, linear regression works well if your dependent network has valued ties (e.g., trade networks, layered networks, etc.)

Because we are using a binary network as our dependent variable, we can interpret the estimates as explaining the probability of a tie in the dependent network, controlling for other variables in the model.

nl<-netlm(Padgett_Business,           # Dependent variable/network
          list(Padgett_Marriage, Padgett_Party), # List the independent variables/networks
          reps=1000)                  # select the number of permutations

#Examine the results
summary(nl)

## 
## OLS Network Model
## 
## Residuals:
##          0%         25%         50%         75%        100% 
## -0.40038536 -0.07067437 -0.07067437 -0.06874759  0.93125241 
## 
## Coefficients:
##             Estimate     Pr(<=b) Pr(>=b) Pr(>=|b|)
## (intercept)  0.070674374 0.827   0.173   0.173    
## x1           0.329710983 1.000   0.000   0.000    
## x2          -0.001926782 0.538   0.462   0.975    
## 
## Residual standard error: 0.3089 on 237 degrees of freedom
## Multiple R-squared: 0.1383   Adjusted R-squared: 0.131 
## F-statistic: 19.02 on 2 and 237 degrees of freedom, p-value: 2.188e-08 
## 
## 
## Test Diagnostics:
## 
##  Null Hypothesis: qap 
##  Replications: 1000 
##  Coefficient Distribution Summary:
## 
##        (intercept)       x1       x2
## Min       -1.88972 -3.03653 -3.42563
## 1stQ       0.99216 -1.24325 -1.04720
## Median     1.73141 -0.31408 -0.17409
## Mean       1.69421 -0.04708 -0.10520
## 3rdQ       2.40030  0.67373  0.73990
## Max        5.00838  5.94179  4.15936

In the output above, x1 represents the first variable in the model (marriage ties), and x2 represents the second variable that we entered (party ties).

The F-statistic tells us that the model is useful (p<0.05). When we consider the individual predictors, it is apparent from the two-tailed p-values (Pr(>=|b|)) that party ties do not explain business ties (p=0.97) and marriage ties do explain business ties (p<0.0001). We can, therefore, interpret the model estimates as telling us that the presence of a marriage tie increases the probability of a business tie by 0.33. Conversely, we interpret party ties as not explaining business ties, since party was not significant.

Again, this example would have been much better if we had valued ties so that we could predict or explain the value of a tie, given the various independent variables. For a much better way to analyze a binary network, see below.

QAP Logistic Regression

If you have a network with binary ties, this is the regression method that you should use to predict ties in the outcome network (dependent variable).

Below, we run the same model. However, this time, we are using the appropriate distribution. The process is the same, but the interpretation is somewhat more involved.

nlog<-netlogit(Padgett_Business, list(Padgett_Marriage, Padgett_Party),reps=1000)

#Examine the results
summary(nlog)

## 
## Network Logit Model
## 
## Coefficients:
##             Estimate    Exp(b)    Pr(<=b) Pr(>=b) Pr(>=|b|)
## (intercept) -2.57893819 0.0758545 0.000   1.000   0.000    
## x1           2.17792200 8.8279428 0.999   0.001   0.001    
## x2          -0.02228475 0.9779617 0.517   0.483   0.972    
## 
## Goodness of Fit Statistics:
## 
## Null deviance: 332.7106 on 240 degrees of freedom
## Residual deviance: 155.2943 on 237 degrees of freedom
## Chi-Squared test of fit improvement:
##   177.4164 on 3 degrees of freedom, p-value 0 
## AIC: 161.2943    BIC: 171.7362 
## Pseudo-R^2 Measures:
##  (Dn-Dr)/(Dn-Dr+dfn): 0.4250345 
##  (Dn-Dr)/Dn: 0.5332452 
## Contingency Table (predicted (rows) x actual (cols)):
## 
##       0     1
## 0   210    30
## 1     0     0
## 
##  Total Fraction Correct: 0.875 
##  Fraction Predicted 1s Correct: NaN 
##  Fraction Predicted 0s Correct: 0.875 
##  False Negative Rate: 1 
##  False Positive Rate: 0 
## 
## Test Diagnostics:
## 
##  Null Hypothesis: qap 
##  Replications: 1000 
##  Distribution Summary:
## 
##        (intercept)        x1        x2
## Min      -7.240807 -2.540211 -2.995598
## 1stQ     -6.537942 -0.820302 -1.010798
## Median   -6.266030 -0.328922 -0.103392
## Mean     -6.230127  0.049667 -0.006711
## 3rdQ     -5.947601  0.813030  0.899497
## Max      -4.705737  5.194956  4.716553

The logistic model looks fairly similar to the linear regression model. The estimates, however, are interpreted very differently.

As above, party ties do not explain business ties (p=0.98) and marriage ties do explain business ties (p<0.0001). In this output, the estimates for this test are given in two columns. The first column, labeled “Estimate”, gives the log odds. You can interpret this as a tendency twords forming ties (for positive values), or a tendency away from forming ties (for negative values). It is, however, much easier to read the second column, labeled “Exp(b)”. The values in the second are the log odds that have been converted to odds ratios by taking their exponent. To interpret the odds ratios, think of them as the liklihood of a tie forming, given the presence of some other factor; as compared with the absence of that factor.

Using the second column of estimates, the output indicates that a business tie is 8.83 times more likely to form in the presence of a marriage tie, than in the absence of a marriage tie. If you are interested in betting, then think of this as the odds of forming a business tie being 8.83:1 in the presence of a marriage tie. Party is not significant (p=0.97), so we interpret party ties as not explaining business ties.