First, we need to load the necessary packages and load the data.
library(qgraph)
library(psych)
library(GPArotation)
library(lavaan)
library(psychTools)
data(bfi)
bfi <- bfi[,1:25]
bfi <- na.omit(bfi)
Next we run a confirmatory factor analysis on the Big Five Index data
using lavaan.
model <- 'N =~ N1 + N2 + N3 + N4 + N5
E =~ E1 + E2 + E3 + E4 + E5
O =~ O1 + O2 + O3 + O4 + O5
A =~ A1 + A2 + A3 + A4 + A5
C =~ C1 + C2 + C3 + C4 + C5'
cfa_fit <- cfa(model, data = bfi)
# Print the results
summary(cfa_fit)
## lavaan 0.6.15 ended normally after 55 iterations
##
## Estimator ML
## Optimization method NLMINB
## Number of model parameters 60
##
## Number of observations 2436
##
## Model Test User Model:
##
## Test statistic 4165.467
## Degrees of freedom 265
## P-value (Chi-square) 0.000
##
## Parameter Estimates:
##
## Standard errors Standard
## Information Expected
## Information saturated (h1) model Structured
##
## Latent Variables:
## Estimate Std.Err z-value P(>|z|)
## N =~
## N1 1.000
## N2 0.947 0.024 39.899 0.000
## N3 0.884 0.025 35.919 0.000
## N4 0.692 0.025 27.753 0.000
## N5 0.628 0.026 24.027 0.000
## E =~
## E1 1.000
## E2 1.226 0.051 23.899 0.000
## E3 -0.921 0.041 -22.431 0.000
## E4 -1.121 0.047 -23.977 0.000
## E5 -0.808 0.039 -20.648 0.000
## O =~
## O1 1.000
## O2 -1.020 0.068 -14.962 0.000
## O3 1.373 0.072 18.942 0.000
## O4 0.437 0.048 9.160 0.000
## O5 -0.960 0.060 -16.056 0.000
## A =~
## A1 1.000
## A2 -1.579 0.108 -14.650 0.000
## A3 -2.030 0.134 -15.093 0.000
## A4 -1.564 0.115 -13.616 0.000
## A5 -1.804 0.121 -14.852 0.000
## C =~
## C1 1.000
## C2 1.148 0.057 20.152 0.000
## C3 1.036 0.054 19.172 0.000
## C4 -1.421 0.065 -21.924 0.000
## C5 -1.489 0.072 -20.694 0.000
##
## Covariances:
## Estimate Std.Err z-value P(>|z|)
## N ~~
## E 0.292 0.032 9.131 0.000
## O -0.093 0.022 -4.138 0.000
## A 0.141 0.018 7.712 0.000
## C -0.250 0.025 -10.117 0.000
## E ~~
## O -0.265 0.021 -12.347 0.000
## A 0.304 0.025 12.293 0.000
## C -0.224 0.020 -11.121 0.000
## O ~~
## A -0.093 0.011 -8.446 0.000
## C 0.130 0.014 9.190 0.000
## A ~~
## C -0.110 0.012 -9.254 0.000
##
## Variances:
## Estimate Std.Err z-value P(>|z|)
## .N1 0.793 0.037 21.575 0.000
## .N2 0.836 0.036 23.458 0.000
## .N3 1.222 0.043 28.271 0.000
## .N4 1.654 0.052 31.977 0.000
## .N5 1.969 0.060 32.889 0.000
## .E1 1.814 0.058 31.047 0.000
## .E2 1.332 0.049 26.928 0.000
## .E3 1.108 0.038 29.522 0.000
## .E4 1.088 0.041 26.732 0.000
## .E5 1.251 0.040 31.258 0.000
## .O1 0.865 0.032 27.216 0.000
## .O2 1.990 0.063 31.618 0.000
## .O3 0.691 0.039 17.717 0.000
## .O4 1.346 0.040 34.036 0.000
## .O5 1.380 0.045 30.662 0.000
## .A1 1.745 0.052 33.725 0.000
## .A2 0.807 0.028 28.396 0.000
## .A3 0.754 0.032 23.339 0.000
## .A4 1.632 0.051 31.796 0.000
## .A5 0.852 0.032 26.800 0.000
## .C1 1.063 0.035 30.073 0.000
## .C2 1.130 0.039 28.890 0.000
## .C3 1.170 0.039 30.194 0.000
## .C4 0.960 0.040 24.016 0.000
## .C5 1.640 0.059 27.907 0.000
## N 1.689 0.073 23.034 0.000
## E 0.846 0.062 13.693 0.000
## O 0.404 0.033 12.156 0.000
## A 0.234 0.030 7.839 0.000
## C 0.463 0.036 12.810 0.000
Now that we have confirmed that a five-factor structure works for this data, we can start using this data to explore network psychometrics! In network psychometrics, it’s usually typical to analyze the individual items that make up a factor rather than the summary variables. But it is still possible to analyze summary variables as components in a network (e.g., depression).
To analyze the data, we will be using the qgraph package
in R. The main function is, surprise, qgraph(), and the
input it takes is an edge-list or a weights matrix. What are those you
ask?
An edge list looks like this:
| From | To | Weight |
|---|---|---|
| Var1 | Var2 | 0.6 |
| Var1 | Var3 | 0.4 |
| Var2 | Var3 | 0.5 |
| Var2 | Var4 | 0.7 |
| Var3 | Var4 | 0.3 |
An edge list calculates the similarity (or dissimilarity) between data points and represents the relationship as pairs of nodes with associated weights. Edge lists are also called similarity or adjacency matrices.
The easy way to do this is by using the igraph
package:
similarity_matrix <- cor(bfi)
graph <- igraph::graph_from_adjacency_matrix(similarity_matrix, mode = "undirected", weighted = TRUE, diag = FALSE)
edge_list <- igraph::get.data.frame(graph)
knitr::kable(head(edge_list), format = "simple", align = 'c', full_width = F)
| from | to | weight |
|---|---|---|
| A1 | A2 | -0.3509055 |
| A1 | A3 | -0.2736361 |
| A1 | A4 | -0.1567536 |
| A1 | A5 | -0.1926976 |
| A1 | C1 | 0.0146977 |
| A1 | C2 | 0.0129184 |
Looking at these results, it should be clear that, given we’re using correlations as our edge weight metric, an edge list is nothing more than a reformatted correlation matrix when creating edges based on correlations. See:
| A1 | A2 | A3 | |
|---|---|---|---|
| A1 | 1.0000000 | -0.3509055 | -0.2736361 |
| A2 | -0.3509055 | 1.0000000 | 0.5030411 |
| A3 | -0.2736361 | 0.5030411 | 1.0000000 |
A correlation table mirrors the general format for a weights table (also called an association table), which is an n x n matrix where the rows are the origin node and the columns are the destination node.
The values in a weights table can be made using any statistic that indicates the strength of the connection between nodes as long as 0 is used to indicate no connection, and the absolute negative values are similar in strength to the positive values.
If the data in the matrix is symmetrical, it will generate a network plot that is undirected. If the data in the matrix is asymmetrical, the network plot will be directed. Here are some plots illustrating the difference between the two:
| 0.0 | 0.2 | -0.3 | 0.1 | -0.4 |
| 0.2 | 0.0 | 0.5 | -0.2 | 0.0 |
| -0.3 | 0.5 | 0.0 | 0.4 | -0.1 |
| 0.1 | -0.2 | 0.4 | 0.0 | 0.2 |
| -0.4 | 0.0 | -0.1 | 0.2 | 0.0 |
This is an undirected graph; think of the edges of the nodes in these graphs as representing “mutual relationships”. If this was a social network, those nodes with a red edge would both agree they didn’t like each other and agree on the strength of that dislike, whereas those with a blue edge both agree they like each other and how much they like each other. The opacity and size of the line indicate the strength of the relationship, because the properties of these edges vary between nodes, we say that this graph is weighted.
Directed graphs, in contrast, show nodes that are considered to be a source and which are to be considered a destination. Hence the “to” and “from” wording in the edge list (even though this wording applies to edge lists used in both directed and undirected graphs). Below is an example asymmetric weight table and its corresponding directed graph:
| 0.0 | 0.5 | 0.4 | 0.0 | -0.2 | -0.1 |
| 0.0 | 0.0 | 0.3 | -0.3 | 0.0 | 0.0 |
| 0.4 | 0.3 | 0.0 | 0.0 | 0.0 | 0.0 |
| 0.0 | -0.3 | 0.0 | 0.0 | 0.4 | 0.3 |
| 0.0 | 0.0 | 0.0 | 0.4 | 0.0 | 0.2 |
| -0.1 | 0.0 | 0.0 | 0.3 | 0.2 | 0.0 |
Notice that in this example most nodes are directed at each other. However, node 1 has more outgoing edges (2,3,5,6) than incoming edges (3,6), and depending on the network we are analyzing, this could indicate that this node is relatively isolated from the rest.
What is important to note that while edge lists and weight matrices
are the inputs into qgraph and are used to create the
network graph, they only represent the network structure at a basic
level and more detailed analyses can still be done.
Back to our data. Let’s graph it as a network:
bfi_cor <- cor(bfi)
network <- qgraph(bfi_cor, layout = "spring", theme = "colorblind")
Already we can see the beginnings of a very interesting network structure. However, because we are using only correlations, we aren’t gaining much more insight than we would have with only factor analysis. Also, remember how I said a main feature and strength of network psychometrics is its ability to estimate model parameters; looking at correlations hardly does this. Let’s instead estimate our network based on our data. To do this, we will estimate the network using a pairwise Markov random field (PMRF) model.
However, before we get into network estimation, we need to understand a few core concepts.
First, remember that a PMRF model “is a class of undirected network models in which variables are represented by nodes connected by edges that indicate the (strength of) conditional association between two variables after controlling for all other variables in the network” (Network Psychometrics in R, p. 162).
Which brings us to a very cool feature of network estimation!
Because network psychometrics focuses so much on accurately representing and estimating the relationships between variables, it’s important to factor in that some variables might impact others, or not at all.
When the strength of a relationship between two variables does not depend on the value of any other variable, the variables are said to be conditionally independent, and the inverse occurs when the variables are conditionally dependent.
In general, most networks are considered to have some amount of dependency, otherwise you wouldn’t have much of a network! It’s also the analysis of how the variables interact which is also one of the main reasons for conducting network analysis. Additionally, this is why PMRF models are so popular, since they work well with data that is assumed to have some level of dependency.
Additionally, PMRF models have model sub-classes that are used for specific types of data, like an Ising model for binary data or a Gaussian for continuous data. But for our purposes, we will stick to a basic partial correlation.
To conduct our analysis, let’s use the bootnet package,
which is really just a wrapper for other estimation functions, to run a
PMRF to estimate the network structure of our measurement theory
assessment items.
library(bootnet)
network <- estimateNetwork(bfi, default = "pcor")
plot(network, layout = "spring", groups = groupings, theme = "colorblind")
Now we can start to notice interesting visual things about our network! According to the CFA, variables should cluster together by factor. In this graph, we can see that this is generally the case with the variables of similar factors being linked together more than with variables of other factors. However, there appears to be an interesting clustering around neuroticism items and the third agreeableness item.
This generated network looks a bit messy because it shows all the edges. In network analytics this is termed a saturated graph.
In the following code we prune the network to only have significant edges at p = .05. Using the pruning method, all non-significant values are set to 0 in the network and thus impacts analysis of the network data.
Thresholding in contrast would simply remove the non-significant values from the plot, but keep them included in the network and they are used in analysis.
network <- estimateNetwork(bfi, default = "pcor", threshold = "sig", alpha = 0.05)
# Further showing that we "pruned" rather than threshold
knitr::kable(head(network$results[1:3,1:3]),align = 'c', full_width = F)
| A1 | A2 | A3 | |
|---|---|---|---|
| A1 | 0.0000000 | -0.2406618 | -0.1324402 |
| A2 | -0.2406618 | 0.0000000 | 0.2674442 |
| A3 | -0.1324402 | 0.2674442 | 0.0000000 |
plot(network, groups = groupings, layout = "spring", theme = "colorblind", minimum = 0.1)
Here I also set the minimum to .1, so only edges with an absolute value greater than .1 are plotted. The minimum value is sometimes a subjective choice, in this case I think it balances interpretability while maintaining enough information.
However, reviewers won’t be happy if you just do a visual examination
of your network. They want to see the numbers! So let’s get some.
Network scientists have developed a few key metrics that we can use here
to examine our network. We can obtain some of these metrics using
qgraph’s centrality plots:
centralityPlot(network, include = "all", scale = "raw0")
## Warning: Removed 1 rows containing missing values (`geom_point()`).
Note: the scale used in the plot is the raw partial correlation values with the scales anchored at 0.
Strength - Is the sum of the absolute edge weights of all the edges connected to the node. In this graph, the partial correlation values are what are being summed. Strength can be considered a proxy measure of influence or importance of a node.
Closeness - Is the inverse of the average distance of a node to all other nodes in the network. The distance of a node is the shortest path that connects them together. In this case, all the nodes are considered far away from each other and none of them is a central node to the network.
Betweeness - Is a way to measure how often a node is between two other nodes. A node with high betweeness is considered a bridge between other parts of the network. In psychometrics, a variable with high betweeness might indicate that the variable plays a role in connecting clusters, or possibly represents a unique aspect of the construct being measured.
Expected Influence - This metric is similar to strength in that it is the sum of the edge weights extending from a node, but does not take the absolute value of these nodes when summing. Note, while negative influences can exist in a network, in the context of expected influence only the positive impact of the node is considered, therefore the lowest possible value for expected influence is 0. Higher influence indicates how much that node can impacts the other nodes in a network. This can be useful for network psychometrics because we can identify the symptoms that are having the greatest impact on the other nodes in the network.
Caveat: Because our data is being treated as
continuous, the qgraph function
clusteringPlots() doesn’t work well as it is designed to
work with categorical and binary data. Our data is being treated as
continuous in this case. Clustering metrics and plots are similar to
those of centrality metrics and plots, but with slightly different
interpretations, unfortunately we don’t have time today to review those
as well.
It is common to use bootstrapping methods when estimating network structures from data. Doing so gives better estimates of the “true” edge weights by considering variability between different samples. The primary way this is accomplished is by identifying the confidence intervals around the edge weights.
We can do this with bootnet:
# Warning! This step is fairly computationally expensive! Be sure to lower the number of cores to that which is appropriate for your device.
b1 <- bootnet(network, nboots = 1000, default = "pcor", ncores = 14,threshold = "sig", alpha = 0.05)
You’ll notice that the bootstrapped network is very similar to the one we estimated previously, however this is not usually the case.
Normally, because parameter estimation (e.g., of edge weights) is sensitive to sampling variation, the estimated parameters come with a level of uncertainty which is larger in smaller samples, resulting in bootstrapped networks sometimes looking very different from their estimated networks.
The important thing to remember is that estimated networks may not represent true network structures, and that networks will vary based on sampling.
But we’re here to look at the CI’s for the bootstrapped edge weights.
plot(b1, "edge", labels = FALSE, order = "sample")
To interpret this plot, we want to look for edges with narrow confidence intervals that do not cross the zero line. These represent stable and significant relationships in the network.
In general, most of the edges are 0 (see the long vertical red line and grey areas that cross over 0). And, in my opinion, the number of edges that are distinctly away from 0 matches the ones that we see in the network graph.
It should be noted that bootstrapping generally shouldn’t be used to estimate other network parameters (e.g., betweenness) because many of the centrality metrics use absolute values in their calculations this means 0 is the lowest bound of an estimated parameter range, and 0’s cause a lot of problems when bootstrapping. And we should expect a lot of 0 edges in our network models, especially if there are distinct and distant clusters of variables.
That’s pretty much a brief introduction to network psychometrics!
By now you hopefully have some ideas on how psychological constructs can be represented, estimated, and analyzed using networks. Network psychometrics is a powerful tool to assess the relationships between variables, especially when working in conjunction with other tools like FA and CFA.
Isvoranu, Adela-Maria, et al., editors. Network Psychometrics with R: A Guide for Behavioral and Social Scientists. New York, NY, 2022.