In this hands-on exercise, you will learn how to visualising network data using R.
By the end of this hands-on exercise, you will be able to:
The datasets used in this hands-on exercise is from an oil exploration and extraction company. There are two data sets, they are:
Before we get start, it is important to ensure that tidyverse, tidygraph, igraph and ggraph have been installed in R. If anyone or all of them have yet to be installed, you are required to install them.
packages = c('tidygraph', 'ggraph', 'visNetwork', 'lubridate', 'tidyverse')
for(p in packages){library
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
In this step, you will import GAStech_email_node.csv and GAStech_email_edges.csv into RStudio environment by using read_csv() of readr package.
GAStech_nodes <- read_csv("data/GAStech_email_node.csv")
GAStech_edges <- read_csv("data/GAStech_email_edge-v2.csv")
Next, we will examine the structure of the data frame using glimpse() of dplyr.
glimpse(GAStech_edges)
## Rows: 9,063
## Columns: 8
## $ source <dbl> 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 26, 26,...
## $ target <dbl> 41, 40, 51, 52, 53, 45, 44, 46, 48, 49, 47, 54, 27, 28,...
## $ SentDate <chr> "6/1/2014", "6/1/2014", "6/1/2014", "6/1/2014", "6/1/20...
## $ SentTime <time> 08:39:00, 08:39:00, 08:58:00, 08:58:00, 08:58:00, 08:5...
## $ Subject <chr> "GT-SeismicProcessorPro Bug Report", "GT-SeismicProcess...
## $ MainSubject <chr> "Work related", "Work related", "Work related", "Work r...
## $ sourceLabel <chr> "Sven.Flecha", "Sven.Flecha", "Kanon.Herrero", "Kanon.H...
## $ targetLabel <chr> "Isak.Baza", "Lucas.Alcazar", "Felix.Resumir", "Hideki....
Warning: The output report of GAStech_edges above reveals that the SentDate is treated as “Character” data type instead of date data type. This is an error! Before we continue, it is important for us to change the data type of SentDate field back to “Date”" data type.
The code chunks below will be used to perform the changes.
GAStech_edges$SentDate = dmy(GAStech_edges$SentDate)
GAStech_edges$Weekday = wday(GAStech_edges$SentDate, label = TRUE, abbr = FALSE)
Things to learn from the code chunk above:
Table below shows the data structure of the reformatted GAStech_edges data frame
glimpse(GAStech_edges)
## Rows: 9,063
## Columns: 9
## $ source <dbl> 43, 43, 44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 26, 26,...
## $ target <dbl> 41, 40, 51, 52, 53, 45, 44, 46, 48, 49, 47, 54, 27, 28,...
## $ SentDate <date> 2014-01-06, 2014-01-06, 2014-01-06, 2014-01-06, 2014-0...
## $ SentTime <time> 08:39:00, 08:39:00, 08:58:00, 08:58:00, 08:58:00, 08:5...
## $ Subject <chr> "GT-SeismicProcessorPro Bug Report", "GT-SeismicProcess...
## $ MainSubject <chr> "Work related", "Work related", "Work related", "Work r...
## $ sourceLabel <chr> "Sven.Flecha", "Sven.Flecha", "Kanon.Herrero", "Kanon.H...
## $ targetLabel <chr> "Isak.Baza", "Lucas.Alcazar", "Felix.Resumir", "Hideki....
## $ Weekday <ord> Monday, Monday, Monday, Monday, Monday, Monday, Monday,...
A close examination of GAStech_edges data.frame reveals that it consists of individual e-mail flow records. This is not very useful for visualisation. In this section, we will aggregate the individual by date, senders, receivers, main subject and day of the week.
To accomplish the task, the code chunks below will be used.
GAStech_edges_aggregated <- GAStech_edges %>%
filter(MainSubject == "Work related") %>%
group_by(source, target, Weekday) %>%
summarise(Weight = n()) %>%
filter(source!=target) %>%
filter(Weight > 1)
GAStech_edges_aggregated
## # A tibble: 1,456 x 4
## # Groups: source, target [665]
## source target Weekday Weight
## <dbl> <dbl> <ord> <int>
## 1 1 2 Monday 4
## 2 1 2 Tuesday 3
## 3 1 2 Wednesday 5
## 4 1 2 Friday 8
## 5 1 3 Monday 4
## 6 1 3 Tuesday 3
## 7 1 3 Wednesday 5
## 8 1 3 Friday 8
## 9 1 4 Monday 4
## 10 1 4 Tuesday 3
## # ... with 1,446 more rows
Things to learn from the code chunk above:
One commonly used function of tidygraph package to create network objects is:
In this section, you will use tbl_graph() of tidygraph package to build a network graph data.frame.
GAStech_graph <- tbl_graph(nodes = GAStech_nodes, edges = GAStech_edges_aggregated, directed = TRUE)
GAStech_graph
## # A tbl_graph: 54 nodes and 1456 edges
## #
## # A directed multigraph with 1 component
## #
## # Node Data: 54 x 4 (active)
## id label Department Title
## <dbl> <chr> <chr> <chr>
## 1 1 Mat.Bramar Administration Assistant to CEO
## 2 2 Anda.Ribera Administration Assistant to CFO
## 3 3 Rachel.Pantanal Administration Assistant to CIO
## 4 4 Linda.Lagos Administration Assistant to COO
## 5 5 Ruscella.Mies.Haber Administration Assistant to Engineering Group Manag~
## 6 6 Carla.Forluniau Administration Assistant to IT Group Manager
## # ... with 48 more rows
## #
## # Edge Data: 1,456 x 4
## from to Weekday Weight
## <int> <int> <ord> <int>
## 1 1 2 Monday 4
## 2 1 2 Tuesday 3
## 3 1 2 Wednesday 5
## # ... with 1,453 more rows
Note:
ggraph is an extension of ggplot2, making it easier to carry over basic ggplot skills to the design of network graphs. As in all network graph, there are three main aspects to a ggraph’s network graph, they are: nodes, edges and layouts. For a comprehensive discussion of each of this aspect of graph, please refer to their respective vignettes provided.
In this section, you will build a basic network graph by using ggraph(), geom_edge_link() and geom_node_point() functions.
ggraph(GAStech_graph) +
geom_edge_link() +
geom_node_point()
Things to learn from the code chunk above:
As shown in previous slide, the whole concept of x and y axes is often redundant in network visualisation are just a distraction. In this section, you will use theme_graph() to remove the x and y axes.
g <- ggraph(GAStech_graph) +
geom_edge_link(aes()) +
geom_node_point(aes())
g + theme_graph()
Things to learn from the code chunk above:
Furthermore, theme_graph() makes it easy to change the coloring of the plot. Please refer to the documentation for more details.
g <- ggraph(GAStech_graph) +
geom_edge_link(aes(colour = Weekday)) +
geom_node_point(aes())
g + theme_graph(background = 'grey90', text_colour = 'blue')
ggraph support many layout, such stress, circle, nicely, sphere, randomly, fr. Figure below are the layouts supported by ggraph.
The default layout used is called stress.
g <- ggraph(GAStech_graph) +
geom_edge_link(aes()) +
geom_node_point(aes())
g + theme_graph()
The code chunks below will be used to plot the network graph using different layout.
g <- ggraph(GAStech_graph, layout = "nicely") +
geom_edge_link(aes()) +
geom_node_point(aes())
g + theme_graph()
Thing to learn from the code chunk above:
In this section, you will colour each node by referring to their respective departments.
g <- ggraph(GAStech_graph, layout = "nicely") +
geom_edge_link(aes()) +
geom_node_point(aes(colour = Department, size = 3))
g + theme_graph()
Things to learn from the code chunks above:
g <- ggraph(GAStech_graph, layout = "fr") +
geom_edge_link(aes(width=Weight), alpha=0.2) +
scale_edge_width(range = c(0.1, 5)) +
geom_node_point(aes(colour = Department), size = 3)
g + theme_graph()
Things to learn from the code chunks above:
Another very useful feature of ggraph is facetting. In visualising network data, this technique can be used to reduce edge over-plotting in a very meaningful way by spreading nodes and edges out based on their attributes. In this section, you will learn how to use facetting technique to visualise network data.
There are three functions in ggraph to implement facetting, they are:
In the code chunk below, facet_edges() is used.
set_graph_style()
g <- ggraph(GAStech_graph) +
geom_edge_link(aes()) +
geom_node_point(aes(colour = Department))
g + facet_edges(~Weekday)
In the code chunk below, facet_nodes() is used.
set_graph_style()
g <- ggraph(GAStech_graph) +
geom_edge_link(aes()) +
geom_node_point(aes(colour = Department))
g + facet_nodes(~Department)+
th_foreground(foreground = "grey80", border = TRUE)
#ggsave("test.jpg", g, dpi = 300)
g <- GAStech_graph %>%
mutate(betweenness_centrality = centrality_betweenness()) %>%
mutate(closeness_centrality = centrality_closeness()) %>%
ggraph(layout = "nicely") +
geom_edge_link(aes()) +
geom_node_point(aes(colour = closeness_centrality, size=betweenness_centrality))
g + theme_graph()
visNetwork is a R package for network visualization, using vis.js javascript library.
visNetwork() function uses a nodes list and edges list to create an interactive graph. The nodes list must include an “id” column, and the edge list must have “from” and “to” columns. The function also plots the labels for the nodes, using the names of the actors from the “label” column in the node list. The resulting graph is fun to play around with. You can move the nodes and the graph will use an algorithm to keep the nodes properly spaced. You can also zoom in and out on the plot and move it around to re-center it.
GAStech_edges_aggregated <- GAStech_edges %>%
left_join(GAStech_nodes, by = c("sourceLabel" = "label")) %>%
rename(from = id) %>%
left_join(GAStech_nodes, by = c("targetLabel" = "label")) %>%
rename(to = id) %>%
filter(MainSubject == "Work related") %>%
group_by(from, to) %>%
summarise(weight = n()) %>%
filter(from!=to) %>%
filter(weight > 1)
## `summarise()` regrouping output by 'from' (override with `.groups` argument)
visNetwork(GAStech_nodes, GAStech_edges_aggregated)
visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
visIgraphLayout(layout = "layout_with_fr")
visNetwork() looks for a field called “group” in the nodes object and colour the nodes according to the values of the group field.
The code chunk below rename Department field to group.
GAStech_nodes <- GAStech_nodes %>%
rename(group = Department)
When we rerun the code chunk below, visNetwork shades the nodes by assigning unique colour to each category in the group field.
visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
visIgraphLayout(layout = "layout_with_fr")
In the code run below visEdges() is used to symbolise the edges. The argument arrows is used to define where to place the arrow. The smooth argument is used to plot the edges using a smooth curve.
visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(arrows = "to", smooth = list(type = "curvedCW"))
In the code chunk below, visOptions() is used to incorporate interactivity features in the data visualisation. The argument highlightNearest highlights nearest when clicking a node. The argument nodesIdSelection adds an id node selection creating an HTML select element.
visNetwork(GAStech_nodes, GAStech_edges_aggregated) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visOptions(highlightNearest = TRUE, nodesIdSelection = TRUE)