Before we begin, let’s install/load the packages we are going to use today. Perhaps the one you are going to use for the first time is statnet.
rm(list = ls())
if(!require(statnet)) install.packages("statnet",repos = "http://cran.us.r-project.org")
library("statnet")
if(!require(mice)) install.packages("mice",repos = "http://cran.us.r-project.org")
library("mice")
if(!require(ggplot2)) install.packages("ggplot2",repos = "http://cran.us.r-project.org")
library("ggplot2")
if(!require(GGally)) install.packages("GGally",repos = "http://cran.us.r-project.org")
library("GGally")
if(!require(scales)) install.packages("scales",repos = "http://cran.us.r-project.org")
library("scales")
options(width = 300)
Loading the statnet package also loads other important packages including sna, network, and ergm. Installing and loading statnet automatically installs and loads all the necessary packages for statistical network analysis.
Note: We are not going to use the igraph package. igraph and statnet do not play nicely with each other and there is a chance using them will interfere with an otherwise smooth workflow. In the case you want to use some plotting functions from igraph, make sure to always use igraph:: or network:: or sna:: or ergm:: in front of any command to specify from which package R should run the command.
We are going to look at a network coming from a Swiss high school with about 400 students. The data we are going to use comes from a only one of the classes in this high school. This is what the entire network looks like the figure below.
Friendship network of about 400 Swiss high school students
Today we are going to learn how to:
Today’s activity provides a template for a useful workflow when doing network analysis. In short, the workflow follows three simple steps:
Set your working directory to wherever you have the data:
setwd('')
Read in the data. Remember to use an easy variable name when loading the data.
# Two datasets:
# Students' edgelist
el <- read.table('Edgelist_3Classes_Fall.csv',
sep = ";",
header = TRUE)
# and student's attributes
dt <- read.table('Students_attributes_short.csv',
sep = ";",
header = TRUE)
Let’s look at the edgelist-data first:
head(el)
## studentID friend.ID.code best.friend go.out.friend school.friend free.time.friend
## 1 1894 1095 Yes Yes No Yes
## 2 2200 1109 No No Yes No
## 3 1880 1340 No No Yes Yes
## 4 1854 1340 No No Yes No
## 5 1858 1340 No No Yes No
## 6 1898 1384 No No No Yes
The first column contains the ID of the students who filled out the survey and nominated someone else as a friend. The ID of the nominated friends is stored in the second column. The other columns include information on what kind of friendship does the student have with the nominated friend: best friend; someone you go out and party with; school friend; and/or someone you spend your freetime with).
Before anything else we need to create a vector with all the unique ID-codes of the students.
# This creates a vector with all the unique IDs of the nominating students
# as well as all the nominated students.
students <- unique(c(el$studentID, el$friend.ID.code))
students
## [1] 1894 2200 1880 1854 1858 1898 1842 1683 1950 1681 1886 2206 1836 1410 1476 1844 1063 1448 1872 1109 2018 1384 1497 1902 1679 1468 1884 1653 1916 2202 1404 1340 1418 1675 1677 1671 1878 1651 2210 1095 1649 1701 1900
There are in total 43 students in the network.
To move forward we need to create a network object. There are two ways of moving forward: we can load it directly from the edgelist, or we can create an adjacency matrix. This version of the tutorial is going to create the network object from the edgelist because the homework for today is to load from a matrix. A later version of this document will include both versions.
Although edgelists are a very reliable and efficient way of storing your network data, adjacency matrices are more intuitive (i.e. homework material). Storing a network using an adjacency matrix also helps storing network attributes (by using the rownames of the adjacency matrix and assigning those rownames the corresponding attributes - we’ll see that later).
Following the workflow described above, we are going to create an empty matrix with the unique student IDs as the row and column names so that we can later populate them with the information from the edgelist.
mat <- matrix(0, nrow =length(students) , ncol = length(students))
colnames(mat) <- as.character(students)
rownames(mat) <- as.character(students)
# A glimpse of what this looks like
mat[1:4, 1:4]
## 1894 2200 1880 1854
## 1894 0 0 0 0
## 2200 0 0 0 0
## 1880 0 0 0 0
## 1854 0 0 0 0
We are now going to use a loop to fill the matrix: 1 whenever there is a friendship tie in students, 0 otherwise.
i <- 1
for (i in 1:nrow(el)) {
row.index <- which(students == el[i, 1])
col.index <- which(students == el[i, 2])
mat[row.index, col.index] <- 1
}
mat[1:4, 1:4]
## 1894 2200 1880 1854
## 1894 0 0 0 0
## 2200 0 0 0 0
## 1880 0 0 0 1
## 1854 0 0 1 0
Make sure you verify what you did after every step! For example, the first line from the imported edgelist suggests that the student with ID 1894 nominated 1095 as a friend:
el[1,]
## studentID friend.ID.code best.friend go.out.friend school.friend free.time.friend
## 1 1894 1095 Yes Yes No Yes
This means that the matrix should show a 1 in the row corresponding to 1894 and the column corresponding to 1095.
mat[row.names(mat) == '1894', colnames(mat) == '1095']
## [1] 1
Good! Let’s move on.
The network() function takes edgelists as input to create a network object. To do so, we need a simpler version of the el object. NOtice how we are redefining the names as characters, so that R can read them properly.
el_to_network <- el[, c("studentID", "friend.ID.code")]
el_to_network$studentID <- as.character(el_to_network$studentID)
el_to_network$friend.ID.code <- as.character(el_to_network$friend.ID.code)
We can construct a network object using the edgelist:
nw <- network(el_to_network, directed = TRUE, matrix.type = "edgelist")
Or we can do it by using the adjacency matrix:
nw <- network(mat, directed = TRUE)
We use the ggnet2()-command from the GGally-package. It uses ggplot2 and creates very nice network plots.
Note: This tutorial includes several network plotting instructions. They are here so that you can learn how to plot network objects and get an initial intuition of what you are looking at. However, what we would like you to take away from this particular worksheet is how to create networks from adjacency matrices, add attributes, and set everything up for applying inferential techniques. Visual aids are amazing at helping us understand, but they should only be a complement to rigurous statistical methodology.
ggnet2(nw)
Friendship network
ggnet2(net = nw,
label = TRUE,
label.size = 3,
arrow.size = 3,
arrow.gap = .03)
Friendship network - better
You can also verify what you did using the network plot by adding the vertex (or node) names. Check out this great tutorial on ggnet2 by the author of the function, Francois Briatte.
Once we have the network as a network object, we can add attributes or covariates to the nodes. Important: It is very important to be careful when doing this step: We are going run the analyises using network objects. A mismatch between the original data and that network object will result in erroneous conclusions.
The student names are stored in the students object we created earlier.
# The adjacency matrix
as.character(students)
## [1] "1894" "2200" "1880" "1854" "1858" "1898" "1842" "1683" "1950" "1681" "1886" "2206" "1836" "1410" "1476" "1844" "1063" "1448" "1872" "1109" "2018" "1384" "1497" "1902" "1679" "1468" "1884" "1653" "1916" "2202" "1404" "1340" "1418" "1675" "1677" "1671" "1878" "1651" "2210" "1095" "1649" "1701"
## [43] "1900"
Adding an attribute from the dt-data set, you cannot simply assign the data directly, as the IDs are not stored in the same order:
# The original dataframe names
dt$studentID
## [1] 1063 1095 1109 1340 1404 1410 1418 1448 1468 1476 1651 1653 1675 1677 1679 1681 1683 1836 1844 1854 1858 1872 1878 1880 1884 1886 1894 1898 1900 1902 1916 1950 2018 2200 2202 2206 2210 1842 1497 1384 1649 1671 1701
We can create a new data frame with the row names of the adjacency matrix,
# att for attributes
att <- data.frame('studentID' = as.character(students))
and match the variables you want as attributes to this new data frame:
att$sex <- as.character(dt$sex[match(att$studentID, dt$studentID)])
As before, let’s check that everything is in order:
att$sex[att$studentID == '1109']
## [1] "male"
dt$sex[dt$studentID == '1109']
## [1] male
## Levels: female male
A quick summary of the variable shows the distribution of the attribute:
table(att$sex, useNA="always")
##
## female male <NA>
## 17 22 4
With the auxiliary data frame att we can add the attribute to the network object we defined before nw. Only character or numeric vectors may be added (not factors).
set.vertex.attribute(nw, 'gender', as.character(att$sex))
Alternatively, you can set the attribute using nw %v% 'gender' = as.character(att$sex).
Just running the network name shows its structure. For example, we can see number of vertices and edgesnw has. But most importantly, that there is an attribute called gender!
nw
## Network attributes:
## vertices = 43
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 134
## missing edges= 0
## non-missing edges= 134
##
## Vertex attribute names:
## gender vertex.names
##
## No edge attributes
ggnet2(nw,
label = TRUE,
label.size = 3,
arrow.size = 3,
arrow.gap = .03,
node.color = 'gender',
palette = 'Spectral')
Friendship network of students from 3 classrooms, colored by gender
ggnet2 can use the ColorBrewer predefined palettes for better colour management.
Let’s also add the students’ grade:
table(dt$marks.overall.num, useNA = "always")
##
## 3.5 4 4.5 5 5.5 <NA>
## 1 1 12 22 3 4
att$grade.overall <- dt$marks.overall.num[match(att$studentID, dt$studentID)]
set.vertex.attribute(nw, 'grade', att$grade.overall)
ggnet2(nw,
label = TRUE,
label.size = 3,
arrow.size = 3,
arrow.gap = .03,
node.color = 'grade',
palette = 'Spectral')
We can even plot a graph that has the two attributes represented:
ggnet2(nw,
label = TRUE,
label.size = 3,
node.color = 'grade',
node.shape = 'gender',
palette = 'Spectral')
Friendship network of students from 3 classrooms. Colors by grade (higher = better). Shape by gender
Using colors to describe a continuous scale like a student’s grade is a good idea if we can show a relatable scale in the graph. For this, we have to define our own set of colours. The ColorBrewer2 website generates useful colour schemes which you can copy from the website. In our case, we know that there are 5 possible scores (3.5, 4, 4.5, 5, 5.5), so we need 5 colours:
colours <- c('#f0f9e8','#ccebc5','#a8ddb5','#7bccc4','#43a2ca')
The trick to get user defined colours that are accepted in the palette option of ggnet2 is to match the values we are looking to plot. This can easily be done telling R that the names from the colours vector are the grade values:
names(colours) <- sort(unique(att$grade.overall))
This is a much better looking plot:
ggnet2(nw,
label = TRUE,
label.size = 3,
node.shape = 'gender',
node.color = 'grade',
palette = colours)
Friendship network. (Better) colors by grade (higher = better). Shape by gender
We can use the mice package (Multivariate Imputations by Chained Equations) to impute the grades of the missing students. We are only doing this to make sure that all the students get an assigned grade. Missing data imputation has its benefits and downsides, just as much as excluding observations. You can read some more on how to use the mice package here). You can read more about MICE here.
Note: There is one important thing to be mentioned here. One of the main hypothesis for using MICE is that the variables on the left hand side of the regression equation are supposed to be independent to each other. However, one of the main reasons why we do network analysis is because we know that the observations are interdependent. Which in a way defeats the purpose of using MICE. The question of how to deal with missing observations in networks is still open. A possible solution when dealing with temporal networks (which we are going to look at into later) is to use past observations. Given that a deeper exploration of this problem and potential solutions are beyond the scope of this tutorial, we are just going to use MICE.
It is important to consider the difference between missing data, and nodes that are no longer part of the network. We are going to deal with this second situation later in the lecture.
Up next we are going to do a simple implementation of mice(), just for reference.
imp <- cbind(as.character(att$studentID), att$grade.overall, att$sex)
imp <- mice(imp)
##
## iter imp variable
## 1 1 V2 V3
## 1 2 V2 V3
## 1 3 V2 V3
## 1 4 V2 V3
## 1 5 V2 V3
## 2 1 V2 V3
## 2 2 V2 V3
## 2 3 V2 V3
## 2 4 V2 V3
## 2 5 V2 V3
## 3 1 V2 V3
## 3 2 V2 V3
## 3 3 V2 V3
## 3 4 V2 V3
## 3 5 V2 V3
## 4 1 V2 V3
## 4 2 V2 V3
## 4 3 V2 V3
## 4 4 V2 V3
## 4 5 V2 V3
## 5 1 V2 V3
## 5 2 V2 V3
## 5 3 V2 V3
## 5 4 V2 V3
## 5 5 V2 V3
## Warning: Number of logged events: 60
imp <- complete(imp)
summary(imp$V2)
## 3.5 4 4.5 5 5.5
## 2 1 13 23 4
summary(dt$marks.overall.num)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.500 4.500 5.000 4.821 5.000 5.500 4
summary(imp$V3)
## female male
## 20 23
Let’s replace the old grade variable with this new one we created:
# we use the table function to check everything is going alright:
table(imp$V2, useNA="always")
##
## 3.5 4 4.5 5 5.5 <NA>
## 2 1 13 23 4 0
att$grade.overall <- as.character(imp$V2[match(att$studentID, imp$V1)])
att$sex <- as.character(imp$V3[match(att$studentID, imp$V1)])
set.vertex.attribute(nw, 'grade', att$grade.overall)
set.vertex.attribute(nw, "gender", att$sex)
get.vertex.attribute(nw, attrname = "gender")
## [1] "male" "male" "female" "female" "female" "male" "male" "male" "male" "male" "female" "female" "male" "female" "female" "female" "female" "female" "male" "male" "female" "female" "female" "male" "male" "male" "male" "male" "male" "male" "female" "female"
## [33] "female" "male" "male" "female" "male" "female" "female" "male" "male" "female" "male"
ggnet2(nw,
label = TRUE,
label.size = 3,
node.shape = "gender",
node.color = 'grade',
palette = colours)
Friendship network. (Better) colors by grade imputed (higher = better). Shape by gender
You can also add covariates to edges, instead of to vertices. The example we are using in this tutorial follows the same logic as your homework, so we are not going to go through it in this version of the tutorial.
At this point in time we have a network object with a couple of vertex attributes allowing us to observe the distribution of test scores and gender. But what about the kinds of relationships between the students. We are now going to add an edge covariate according to the friendship intensities in the data we stored in el.
head(el)
## studentID friend.ID.code best.friend go.out.friend school.friend free.time.friend
## 1 1894 1095 Yes Yes No Yes
## 2 2200 1109 No No Yes No
## 3 1880 1340 No No Yes Yes
## 4 1854 1340 No No Yes No
## 5 1858 1340 No No Yes No
## 6 1898 1384 No No No Yes
The vector we are going to create looks like this:
Without more context this is probably the best way to allocate the information provided. Some students gave multiple labels to one friend. We are going to take the highest value. This is, if they labeled their friend as best.friend and as a ‘go.out.friend’, we are going to assume that being ‘best friends’ trumps being ‘going out friends’, so we keep the highest ranked label.
# 4 = best.friend
el$friendship.scale <- ifelse(el$best.friend == 'Yes',
yes = 4,
no = NA)
# 3 = free.time.friend
el$friendship.scale <- ifelse(is.na(el$friendship.scale) &
el$free.time.friend == 'Yes',
yes = 3,
no = el$friendship.scale)
# 2 = school.friend
el$friendship.scale <- ifelse(is.na(el$friendship.scale) &
el$school.friend == 'Yes',
yes = 2,
no = el$friendship.scale)
# 1 = go.out.friend and other
el$friendship.scale <- ifelse(is.na(el$friendship.scale),
yes = 1,
no = el$friendship.scale)
table(el$friendship.scale, useNA="always")
##
## 1 2 3 4 <NA>
## 14 67 29 24 0
We can create a new adjacency matrix that contains these friendship scales instead of a simple 0/1.
# Empty matrix to store the adjacency information
mat.fs <- matrix(0, nrow = length(students), ncol = length(students))
rownames(mat.fs) <- as.character(students)
colnames(mat.fs) <- as.character(students)
for(i in 1:nrow(el)){
row.index <- which(rownames(mat.fs) == el[i, 1])
col.index <- which(colnames(mat.fs) == el[i, 2])
mat.fs[row.index, col.index] <- el$friendship.scale[i]
}
Let’s see what the top-left corner of the matrix looks like:
mat.fs[1:4, 1:4]
## 1894 2200 1880 1854
## 1894 0 0 0 0
## 2200 0 0 0 0
## 1880 0 0 0 3
## 1854 0 0 2 0
Now we can add it to our network object. Presenting the information available nw we can see that now there is one edge attribute called f.scale.
set.edge.value(nw, 'f.scale', mat.fs)
nw
## Network attributes:
## vertices = 43
## directed = TRUE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 134
## missing edges= 0
## non-missing edges= 134
##
## Vertex attribute names:
## gender grade vertex.names
##
## Edge attribute names:
## f.scale
We can make a new plot that includes this information:
ggnet2(nw,
label = TRUE,
label.size = 4,
arrow.size = 3,
arrow.gap = .03,
edge.color = 'f.scale')
Friendship network. Edge colors by friendship intensities
Unfortunately we cannot specify a legend for the edge-colors yet. Plotting stronger relationship as thicker edges might be a better alternative:
ggnet2(nw,
label = TRUE,
label.size = 3,
edge.size = 'f.scale')
Friendship network. Edge width by friendship intensities
Finally, we can put everything we have learned so far together in a single figure:
ggnet2(nw,
label = TRUE,
label.size = 3,
node.shape = 'gender',
node.color = 'grade',
edge.size = 'f.scale' ,
palette = colours)
Friendship network. All info available!