1.0 Overview

In this hands-on exercise, you will learn how to visualising network data using R.

By the end of this hands-on exercise, you will be able to:

  • create graph object data frames, manipulate them using appropriate functions of dplyr, lubridate, and tidygraph,
  • build network graph visualisation using appropriate functions of ggraph,
  • compute network geometrics using tidygraph,
  • build advanced graph visualisation by incorporating the network geometrics, and
  • build interactive network visualisation using visNetwork package.

1.1 GAStech Dataset

The datasets used in this hands-on exercise is from an oil exploration and extraction company. There are two data sets, they are:

  • GAStech-email_edges.csv which consists of two weeks of 9063 emails correspondances between 55 employees,

  • GAStech_email_nodes.csv which consists of the names, department and title of the 55 employees.

2.0 Installing and Launching R Packages

Before we get start, it is important to ensure that tidyverse, tidygraph, igraph and ggraph have been installed in R. If anyone or all of them have yet to be installed, you are required to install them.

packages = c('tidygraph', 'ggraph', 'visNetwork', 'lubridate', 'tidyverse')

for(p in packages){library
  if(!require(p, character.only = T)){
    install.packages(p)
  }
  library(p, character.only = T)
}

3.0 Data Wrangling

3.1 Importing network data from files

In this step, you will import GAStech_email_node.csv and GAStech_email_edges.csv into RStudio environment by using read_csv() of readr package.

GAStech_nodes <- read_csv("data/GAStech_email_node.csv")
GAStech_edges <- read_csv("data/GAStech_email_edge-v2.csv")

Next, we will examine the structure of the data frame using glimpse() of dplyr.

glimpse(GAStech_edges)
## Rows: 9,063
## Columns: 8
## $ source      <dbl> 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 26, 26,...
## $ target      <dbl> 41, 40, 51, 52, 53, 45, 44, 46, 48, 49, 47, 54, 27, 28,...
## $ SentDate    <chr> "6/1/2014", "6/1/2014", "6/1/2014", "6/1/2014", "6/1/20...
## $ SentTime    <time> 08:39:00, 08:39:00, 08:58:00, 08:58:00, 08:58:00, 08:5...
## $ Subject     <chr> "GT-SeismicProcessorPro Bug Report", "GT-SeismicProcess...
## $ MainSubject <chr> "Work related", "Work related", "Work related", "Work r...
## $ sourceLabel <chr> "Sven.Flecha", "Sven.Flecha", "Kanon.Herrero", "Kanon.H...
## $ targetLabel <chr> "Isak.Baza", "Lucas.Alcazar", "Felix.Resumir", "Hideki....

Warning: The output report of GAStech_edges above reveals that the SentDate is treated as “Character” data type instead of date data type. This is an error! Before we continue, it is important for us to change the data type of SentDate field back to “Date”" data type.

3.2 Wrangling time

The code chunks below will be used to perform the changes.

GAStech_edges$SentDate  = dmy(GAStech_edges$SentDate)
GAStech_edges$Weekday = wday(GAStech_edges$SentDate, label = TRUE, abbr = FALSE)

Things to learn from the code chunk above:

  • both dmy() and wday() are functions of lubridate package. lubridate is an R package that makes it easier to work with dates and times.
  • dmy() transforms the SentDate to Date data type.
  • wday() returns the day of the week if label is TRUE. The argument abbr is FALSE keep the day spells in full, i.e. Monday. The function will create a new column in the data.frame i.e. Weekday and the output of wday() will save in this newly created field.
  • the values in the Weekday field are in ordinal scale.

Table below shows the data structure of the reformatted GAStech_edges data frame

glimpse(GAStech_edges)
## Rows: 9,063
## Columns: 9
## $ source      <dbl> 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 26, 26,...
## $ target      <dbl> 41, 40, 51, 52, 53, 45, 44, 46, 48, 49, 47, 54, 27, 28,...
## $ SentDate    <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2014-0...
## $ SentTime    <time> 08:39:00, 08:39:00, 08:58:00, 08:58:00, 08:58:00, 08:5...
## $ Subject     <chr> "GT-SeismicProcessorPro Bug Report", "GT-SeismicProcess...
## $ MainSubject <chr> "Work related", "Work related", "Work related", "Work r...
## $ sourceLabel <chr> "Sven.Flecha", "Sven.Flecha", "Kanon.Herrero", "Kanon.H...
## $ targetLabel <chr> "Isak.Baza", "Lucas.Alcazar", "Felix.Resumir", "Hideki....
## $ Weekday     <ord> Monday, Monday, Monday, Monday, Monday, Monday, Monday,...

3.3 Wrangling attributes

A close examination of GAStech_edges data.frame reveals that it consists of individual e-mail flow records. This is not very useful for visualisation. In this section, we will aggregate the individual by date, senders, receivers, main subject and day of the week.

To accomplish the task, the code chunks below will be used.

GAStech_edges_aggregated <- GAStech_edges %>%
  filter(MainSubject == "Work related") %>%
  group_by(source, target, Weekday) %>%
    summarise(Weight = n()) %>%
  filter(source!=target) %>%
  filter(Weight > 1) 
GAStech_edges_aggregated
## # A tibble: 1,456 x 4
## # Groups:   source, target [665]
##    source target Weekday   Weight
##     <dbl>  <dbl> <ord>      <int>
##  1      1      2 Monday         4
##  2      1      2 Tuesday        3
##  3      1      2 Wednesday      5
##  4      1      2 Friday         8
##  5      1      3 Monday         4
##  6      1      3 Tuesday        3
##  7      1      3 Wednesday      5
##  8      1      3 Friday         8
##  9      1      4 Monday         4
## 10      1      4 Tuesday        3
## # ... with 1,446 more rows

Things to learn from the code chunk above:

  • four functions from dplyr package are used. They are: filter(), group(), and summarise().
  • The output data.frame is called GAStech_edges_aggregated.
  • A new field called Weight has been added in GAStech_edges_aggregated.

3.4 Creating network objects using tidygraph

One commonly used function of tidygraph package to create network objects is:

  • tbl_graph() creates a network object from nodes and edges data.

In this section, you will use tbl_graph() of tidygraph package to build a network graph data.frame.

GAStech_graph <- tbl_graph(nodes = GAStech_nodes, edges = GAStech_edges_aggregated, directed = TRUE)
GAStech_graph
## # A tbl_graph: 54 nodes and 1456 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 54 x 4 (active)
##      id label               Department     Title                                
##   <dbl> <chr>               <chr>          <chr>                                
## 1     1 Mat.Bramar          Administration Assistant to CEO                     
## 2     2 Anda.Ribera         Administration Assistant to CFO                     
## 3     3 Rachel.Pantanal     Administration Assistant to CIO                     
## 4     4 Linda.Lagos         Administration Assistant to COO                     
## 5     5 Ruscella.Mies.Haber Administration Assistant to Engineering Group Manag~
## 6     6 Carla.Forluniau     Administration Assistant to IT Group Manager        
## # ... with 48 more rows
## #
## # Edge Data: 1,456 x 4
##    from    to Weekday   Weight
##   <int> <int> <ord>      <int>
## 1     1     2 Monday         4
## 2     1     2 Tuesday        3
## 3     1     2 Wednesday      5
## # ... with 1,453 more rows

Note:

  • The output above reveals that GAStech_graph is a tbl_graph object with 54 nodes and 4541 edges.
  • The command also prints the first six rows of “Node Data” and the first three of “Edge Data”.

4.0 Plotting Network Data with ggraph package

ggraph is an extension of ggplot2, making it easier to carry over basic ggplot skills to the design of network graphs. As in all network graph, there are three main aspects to a ggraph’s network graph, they are: nodes, edges and layouts. For a comprehensive discussion of each of this aspect of graph, please refer to their respective vignettes provided.

4.1 Plotting a basic network graph

In this section, you will build a basic network graph by using ggraph(), geom_edge_link() and geom_node_point() functions.

ggraph(GAStech_graph) +
  geom_edge_link() +
  geom_node_point()

Things to learn from the code chunk above:

  • The basic plotting function is ggraph(), which takes the data to be used for the graph and the type of layout desired.

4.2 Changing the default network graph theme

As shown in previous slide, the whole concept of x and y axes is often redundant in network visualisation are just a distraction. In this section, you will use theme_graph() to remove the x and y axes.

g <- ggraph(GAStech_graph) + 
  geom_edge_link(aes()) +
  geom_node_point(aes())

g + theme_graph()

Things to learn from the code chunk above:

  • ggraph introduces a special ggplot theme that provides better defaults for network graphs than the normal ggplot defaults. theme_graph(), besides removing axes, grids, and border, changes the font to Arial Narrow.

Furthermore, theme_graph() makes it easy to change the coloring of the plot. Please refer to the documentation for more details.

g <- ggraph(GAStech_graph) + 
  geom_edge_link(aes(colour = Weekday)) +
  geom_node_point(aes())

g + theme_graph(background = 'grey90', text_colour = 'blue')

4.3 Working with ggraph’s layouts

ggraph support many layout, such stress, circle, nicely, sphere, randomly, fr. Figure below are the layouts supported by ggraph.

The default layout used is called stress.

g <- ggraph(GAStech_graph) + 
  geom_edge_link(aes()) +
  geom_node_point(aes())

g + theme_graph()

4.3.1 Changing layout

The code chunks below will be used to plot the network graph using different layout.

g <- ggraph(GAStech_graph, layout = "nicely") + 
  geom_edge_link(aes()) +
  geom_node_point(aes())

g + theme_graph()

Thing to learn from the code chunk above:

  • layout argument is used to define the layout to be used.

4.4 Modifying network nodes

In this section, you will colour each node by referring to their respective departments.

g <- ggraph(GAStech_graph, layout = "nicely") + 
  geom_edge_link(aes()) +
  geom_node_point(aes(colour = Department, size = 3))

g + theme_graph()

Things to learn from the code chunks above:

  • geom_node_point is equivalent in functionality to geo_point of ggplot2. It allows for simple plotting of nodes in different shapes, colours and sizes. In the codes chunks above colour and size are used.

4.5 Modifying edges

g <- ggraph(GAStech_graph, layout = "fr") + 
  geom_edge_link(aes(width=Weight), alpha=0.2) +
  scale_edge_width(range = c(0.1, 5)) +
  geom_node_point(aes(colour = Department), size = 3)

g + theme_graph()

Things to learn from the code chunks above:

  • geom_edge_link draws edges in the simplest way - as straight lines between the start and end nodes. But, it can do more than that. In the example above, argument width is used to map the width of the line in proportional to the Weight attribute and argument alpha is used to introduce opacity on the line.

5.0 Creating facet graphs

Another very useful feature of ggraph is facetting. In visualising network data, this technique can be used to reduce edge over-plotting in a very meaningful way by spreading nodes and edges out based on their attributes. In this section, you will learn how to use facetting technique to visualise network data.

There are three functions in ggraph to implement facetting, they are:

  • facet_nodes() whereby edges are only draw in a panel if both terminal nodes are present here,
  • facet_edges() whereby nodes are always drawn in all panels even if the node data contains an attribute named the same as the one used for the edge - facetting, and
  • facet_graph() facetting on two variables simultaneously.

5.1 Working with facet_edges()

In the code chunk below, facet_edges() is used.

set_graph_style()

g <- ggraph(GAStech_graph) + 
  geom_edge_link(aes()) +
  geom_node_point(aes(colour = Department))
  
g + facet_edges(~Weekday)

5.2 Working with facet_nodes()

In the code chunk below, facet_nodes() is used.

set_graph_style()

g <- ggraph(GAStech_graph) + 
  geom_edge_link(aes()) +
  geom_node_point(aes(colour = Department))
  
g + facet_nodes(~Department)+
  th_foreground(foreground = "grey80",  border = TRUE)

#ggsave("test.jpg", g, dpi = 300)

6.0 Network Metrics Analysis

6.1 Computing centrality indices

g <- GAStech_graph %>%
  mutate(betweenness_centrality = centrality_betweenness()) %>%
  mutate(closeness_centrality = centrality_closeness()) %>%
  ggraph(layout = "nicely") + 
  geom_edge_link(aes()) +
  geom_node_point(aes(colour = closeness_centrality, size=betweenness_centrality))

g + theme_graph()

7.0 Building Interactive Network Graph with visNetwork

visNetwork is a R package for network visualization, using vis.js javascript library.

7.1 Building a basic interactive network graph

visNetwork() function uses a nodes list and edges list to create an interactive graph. The nodes list must include an “id” column, and the edge list must have “from” and “to” columns. The function also plots the labels for the nodes, using the names of the actors from the “label” column in the node list. The resulting graph is fun to play around with. You can move the nodes and the graph will use an algorithm to keep the nodes properly spaced. You can also zoom in and out on the plot and move it around to re-center it.

7.1.1 Data preparation

GAStech_edges_aggregated <- GAStech_edges %>%
  left_join(GAStech_nodes, by = c("sourceLabel" = "label")) %>%
  rename(from = id) %>%
  left_join(GAStech_nodes, by = c("targetLabel" = "label")) %>%
  rename(to = id) %>%
  filter(MainSubject == "Work related") %>%
  group_by(from, to) %>%
    summarise(weight = n()) %>%
  filter(from!=to) %>%
  filter(weight > 1)
## `summarise()` regrouping output by 'from' (override with `.groups` argument)

7.1.2 Plotting the first interactive network graph

visNetwork(GAStech_nodes, GAStech_edges_aggregated)

7.2 Working with layout

visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
  visIgraphLayout(layout = "layout_with_fr")

7.3 Working with visual attributes

7.3.1 Nodes

visNetwork() looks for a field called “group” in the nodes object and colour the nodes according to the values of the group field.

The code chunk below rename Department field to group.

GAStech_nodes <- GAStech_nodes %>%
  rename(group = Department)

When we rerun the code chunk below, visNetwork shades the nodes by assigning unique colour to each category in the group field.

visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
  visIgraphLayout(layout = "layout_with_fr") 

7.3.2 Edges

In the code run below visEdges() is used to symbolise the edges. The argument arrows is used to define where to place the arrow. The smooth argument is used to plot the edges using a smooth curve.

visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visEdges(arrows = "to", smooth = list(type = "curvedCW"))

7.4 Interactivity

In the code chunk below, visOptions() is used to incorporate interactivity features in the data visualisation. The argument highlightNearest highlights nearest when clicking a node. The argument nodesIdSelection adds an id node selection creating an HTML select element.

visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
  visIgraphLayout(layout = "layout_with_fr") %>%
  visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)