We will use three new packages, the sna package, the network package and the visNetwork package. If you do not already have these package installed, you will first install them using the install.packages() function.
install.packages(c("sna", "network", "visNetwork"))
Next, we load these packages for use in the session.
library(sna)
library(network)
library(visNetwork)
In the lesson that follows, we two CSV files: marvel_nodes1.csv and marvel_edge_data1.csv. The first dataset, marvel_nodes1.csv, contains node information, which are the top 4-5 actors from 5 Marvel movies.
We use the read.csv() function to import the marvel_nodes.csv file into R as a dataframe named marvel_nodes. We set stringsAsFactors = FALSE to keep any character columns as-is and provide column header information using the col.names argument.
marvel_nodes <- read.csv(file = "marvel_nodes1.csv",
stringsAsFactors = FALSE, # import character variables as-is
col.names = c("id", "label"))
The second dataset, marvel_edge_data1.csv, contains information about actors having roles in common movies. The movies included are: The Avengers (2012), Captain America: The Winter Soldier (2014), Avengers: Age of Ultron (2015), Captain America: Civil War (2016), and Thor: Ragnarok (2017).
We use the read.csv() function to import the marvel_edge_data1.csv file into R as a dataframe named marvel_edges. We set stringsAsFactors = FALSE to keep any character columns as-is and provide column header information using the col.names argument.
marvel_edges <- read.csv(file = "marvel_edge_data1.csv",
stringsAsFactors = FALSE, # import character variables as-is
col.names = c("from", "to", "width"))
First, we can view the marvel_nodes dataframe, since it is small.
marvel_nodes
## id label
## 1 1 Anthony Mackie
## 2 2 Chris Evans
## 3 3 Chris Hemsworth
## 4 4 Jeremy Renner
## 5 5 Mark Ruffalo
## 6 6 Robert Downey Jr.
## 7 7 Samuel L. Jackson
## 8 8 Scarlett Johansson
## 9 9 Sebastian Stan
## 10 10 Tom Hiddleston
## 11 11 Cate Blanchett
Since the marvel_edges dataframe is larger, we can view structure information using the str() function.
str(marvel_edges)
## 'data.frame': 24 obs. of 3 variables:
## $ from : int 2 2 2 2 2 2 2 2 3 3 ...
## $ to : int 8 1 3 4 5 6 7 9 5 10 ...
## $ width: int 3 1 1 1 1 3 1 1 2 1 ...
Each observation in the marvel_edges dataset represents a relationship between two actors, or nodes in the network. The actors are represented by their corresponding numbers in the marvel_nodes dataframe. The weight variable represents the number of times in which the actors are in the same movie.
Social network data can either be in the form of an adjacency matrix or separate dataframes of nodes and edges, as we are using in the analysis. Later, we will also view the data in adjacency matrix format.
We use the visNetwork() function from the visNetwork package to visualize the social network. Since the weight column is included in the marvel_edges data, the weight will be represented in the social network plot.
In the plot, nodes represent actors and edges represent actors appearing in the same movie. The plot displays weight information by increasing the width of the connection, or edge, between two nodes. As shown, Chris Evans appears most frequently with Scarlett Johansson and Robert Downey, Jr. in films, which each edge having a weight of 3. Scarlett Johansson appears in 2 films with Robert Downey, Jr. and Mark Ruffalo appears with Chris Hemsworth in 2 films. All other actors appear in either 1 or 0 movies with other actors.
visNetwork(nodes = marvel_nodes,
edges = marvel_edges)
To describe the network and compute degree centrality, we need to convert our data to a network object. Note: when we do this, we lose weight information.
We combine the two dataframes into a network object using the network() function in the network package. First, we create the object using the marvel_edges data. Since the network object does not retain weight information in a meaningful way, we use the first two columns of marvel_edges only. We are creating an undirected network, so we set directed = FALSE. The input data in the form of an edgelist, so we set matrix.type = "edgelist". Other matrix.type options are adjacency and incedence. We set print.adj = TRUE to create an adjacency matrix from our data.
net <- network(x = marvel_edges[,1:2],
directed = FALSE,
matrix.type = "edgelist",
print.adj = TRUE)
Next, we can assign the actor names as labels in our network.
network.vertex.names(net) <- marvel_nodes$label
Next, we can view high-level information about our network by running a code line of the network name.
net
## Network attributes:
## vertices = 11
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 24
## missing edges= 0
## non-missing edges= 24
##
## Vertex attribute names:
## vertex.names
##
## No edge attributes
Next, we can view our network as an adjacency matrix. In adjacency matrix format, if an edge exists between 2 actors, the table will display a 1, otherwise, a 0 will be displayed. Zeroes are displayed along the diagonal of the symmetric matrix, since actors cannot be connected to themselves in this network. As shown, Chris Evans is connected to nearly every actor in the network.
net[,]
## Anthony Mackie Chris Evans Chris Hemsworth Jeremy Renner
## Anthony Mackie 0 1 0 0
## Chris Evans 1 0 1 1
## Chris Hemsworth 0 1 0 0
## Jeremy Renner 0 1 0 0
## Mark Ruffalo 0 1 1 0
## Robert Downey Jr. 0 1 1 1
## Samuel L. Jackson 1 1 0 0
## Scarlett Johansson 1 1 0 1
## Sebastian Stan 0 1 0 0
## Tom Hiddleston 0 0 1 0
## Cate Blanchett 0 0 1 0
## Mark Ruffalo Robert Downey Jr. Samuel L. Jackson
## Anthony Mackie 0 0 1
## Chris Evans 1 1 1
## Chris Hemsworth 1 1 0
## Jeremy Renner 0 1 0
## Mark Ruffalo 0 1 0
## Robert Downey Jr. 1 0 0
## Samuel L. Jackson 0 0 0
## Scarlett Johansson 0 1 1
## Sebastian Stan 0 1 0
## Tom Hiddleston 1 0 0
## Cate Blanchett 1 0 0
## Scarlett Johansson Sebastian Stan Tom Hiddleston
## Anthony Mackie 1 0 0
## Chris Evans 1 1 0
## Chris Hemsworth 0 0 1
## Jeremy Renner 1 0 0
## Mark Ruffalo 0 0 1
## Robert Downey Jr. 1 1 0
## Samuel L. Jackson 1 0 0
## Scarlett Johansson 0 1 0
## Sebastian Stan 1 0 0
## Tom Hiddleston 0 0 0
## Cate Blanchett 0 0 1
## Cate Blanchett
## Anthony Mackie 0
## Chris Evans 0
## Chris Hemsworth 1
## Jeremy Renner 0
## Mark Ruffalo 1
## Robert Downey Jr. 0
## Samuel L. Jackson 0
## Scarlett Johansson 0
## Sebastian Stan 0
## Tom Hiddleston 1
## Cate Blanchett 0
We describe a social network with respect to its network size and density. We can obtain both using the summary() function on our net network object. We set print.adj = FALSE to suppress adjacency matrix output.
summary(object = net, # network object
print.adj = FALSE) # omit adjacency matrix
## Network attributes:
## vertices = 11
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges = 24
## missing edges = 0
## non-missing edges = 24
## density = 0.4363636
##
## Vertex attributes:
## vertex.names:
## character valued attribute
## 11 valid vertex names
##
## No edge attributes
We can also obtain this information directly.
Network size is defined as the number of nodes, or network vertices. We can isolate this information from the net object using
nt_sz <- network.size(x = net)
nt_sz
## [1] 11
The network size is 11, since there are 11 actors.
Network density is defined as the number of connections in the network out of total possible connections.
The number of connections in the network is 24:
network.edgecount(x = net)
## [1] 24
The total possible connections in the network is found using the formula (n * (n-1))/2, where n is the network size.
(nt_sz * (nt_sz - 1))/2
## [1] 55
The total possible connections in the network is 55.
Finally, we can obtain the density by computing it manually (24/55), or by using the network.density() function from the network package.
network.density(x = net)
## [1] 0.4363636
Based on the network density value, we have a moderately dense network.
For each of the actors (nodes) in the networks, we can compute centrality measures.
Degree centrality represents the sum of connections from or to an individual. We use the degree() function from the sna package and set gmode = "graph". We obtain the output as a dataframe, since the measures are for each node.
data.frame(Actor = marvel_nodes$label,
Degree = degree(dat = net,
gmode = "graph"))
## Actor Degree
## 1 Anthony Mackie 3
## 2 Chris Evans 8
## 3 Chris Hemsworth 5
## 4 Jeremy Renner 3
## 5 Mark Ruffalo 5
## 6 Robert Downey Jr. 6
## 7 Samuel L. Jackson 3
## 8 Scarlett Johansson 6
## 9 Sebastian Stan 3
## 10 Tom Hiddleston 3
## 11 Cate Blanchett 3
As shown, our most connected actor is Chris Evans, who is connected to all other actors in the network, with a degree centrality of 8. Scarlett Johansson and Robert Downey, Jr. are the next most connected actors, each with a degree centrality of 6. Mark Ruffalo and Chris Hemsworth each have a degree centrality of 5. All other actors have fewer connections, each having a degree centrality of 3.
Closeness centrality measures the proximity of an individual to all other individuals in the network. We use the closeness() function from the sna package and set cmode = "undirected", since our graph is undirected. We obtain the output as a dataframe, since the measures are for each node. The value is normalized by using n-1 in the numerator, rather than 1.
data.frame(Actor = marvel_nodes$label,
Closeness = closeness(dat = net,
cmode = "undirected"))
## Actor Closeness
## 1 Anthony Mackie 0.5263158
## 2 Chris Evans 0.8333333
## 3 Chris Hemsworth 0.6666667
## 4 Jeremy Renner 0.5263158
## 5 Mark Ruffalo 0.6666667
## 6 Robert Downey Jr. 0.7142857
## 7 Samuel L. Jackson 0.5263158
## 8 Scarlett Johansson 0.6250000
## 9 Sebastian Stan 0.5263158
## 10 Tom Hiddleston 0.4545455
## 11 Cate Blanchett 0.4545455
As shown, Chris Evans is closest, or most central in the network, since he is connected to the majority of other nodes in the network and therefore has a closeness centrality close to 1. Robert Downey, Jr. has the next highest closeness centrality measure, followed by Chris Hemsworth.
Another way of measuring an individuals centrality, based on how frequently an individual is between others paths, or Betweenness Centrality. We use the betweenness() function from the sna package and set gmode = "graph" and cmode = "undirected". We obtain the output as a dataframe, since the measures are for each node.
data.frame(Actor = marvel_nodes$label,
Between = betweenness(dat = net,
gmode = "graph",
cmode = "undirected"))
## Actor Between
## 1 Anthony Mackie 0.000000
## 2 Chris Evans 17.333333
## 3 Chris Hemsworth 7.000000
## 4 Jeremy Renner 0.000000
## 5 Mark Ruffalo 7.000000
## 6 Robert Downey Jr. 6.333333
## 7 Samuel L. Jackson 0.000000
## 8 Scarlett Johansson 3.333333
## 9 Sebastian Stan 0.000000
## 10 Tom Hiddleston 0.000000
## 11 Cate Blanchett 0.000000