Network graphs are a great way to visualise relationships between objects and their importance within the network. This tutorial demonstrates the basics of the package ggraph.
It follows the principles of Leland Wilkinson’s The Grammar of Graphics by extending on well-known data visualisation package ggplot2, which can be challenging to apply to networks. Data structures from igraph, an older network package originally created in C, are also supported.
Load the following packages once installed. We leverage igraph and tidyverse to handle some more complex data pre-processing.
We explore two examples of networks using TV scripts from one of America’s longest running shows The Simpsons. The csv file containing is available at the github repository of sujanjoejacob. To simplify, the analysis focuses on characters appearing in at least 20 episodes.
# Load scripts and remove non-speaking lines
lines <- read_csv("./simpsons_script_lines.csv")
speaking_lines <- lines %>%
filter(!is.na(raw_character_text))
# Limit analysis to re-occuring characters in 20 or more episodes
top_char <- speaking_lines %>%
group_by(raw_character_text) %>%
mutate(appearances=n_distinct(episode_id)) %>%
filter(appearances >= 20) %>%
ungroup()
# Count characters lines per episode
lines_per_ep <- top_char %>%
group_by(raw_character_text, episode_id) %>%
summarise(lines=n()) %>%
ungroup()
A network is comprised of objects referred to as nodes or Vertices and their connections named Edges. This example is only concerned with identifying connections where direction does not matter. Real world examples include social networks or road maps within a city.
The dataset is converted into a weighted N∗N adjacency matrix of character connections. We count characters’ lines per episode, then use cosine similarity to weigh connections based on episodes appearances and amount of dialogue.
# Convert to matrix
char_df <- lines_per_ep %>%
spread(episode_id, lines, fill=0)
char_mat <- as.matrix(select(char_df, -raw_character_text))
rownames(char_mat) <- char_df$raw_character_text
# Calculate cosine distance between characters
cosine_sim <- as.dist(char_mat %*% t(char_mat) / (sqrt(rowSums(char_mat^2) %*% t(rowSums(char_mat^2)))))
# Initial look at the network
autograph(as.matrix(cosine_sim))
The network’s magnitude of nodes and connections create difficulty in extracting valuable information without further customisation. This is where a network package shines. We remove weaker connections from the adjacency matrix and introduce igraph functions for layout and community detection.
Communities are densely connected clusters within the larger network. The Louvain Method used is an algorithm which creates small clusters optimising locally, then iteratively builds a network using smaller clusters as nodes. A random layout is used for now, with other options discussed later. Also of note are attributes of vertices and edges which can be controlled using V()
and E()
.
# Filter weak connections. The amount chosen here is arbitrary. Try different variations.
cs_strong <- cosine_sim
cs_strong[cs_strong < max(cs_strong) * 0.25] <- 0
# Create an igraph object
ig <- as.matrix(cs_strong) %>%
graph_from_adjacency_matrix(mode = "undirected", weighted = TRUE)
# Community detection algoritm
community <- cluster_louvain(ig)
# Attach communities to relevant vertices
V(ig)$color <- community$membership
# Graph layout
layout <- layout.random(ig)
# igraph plot
plot(ig, layout = layout)
Given the level of labels and nodes, extensive customisation is still required for meaningful insight. Using ggraph allows adjustments to be added layer by layer. Following the ggraph()
function to initialise the graph, layers to control network attributes include:
geom_edge_link()
: Straight line edge connections between nodesgeom_node_point()
: Vertex markersgeom_node_text()
: Vertex labelsLastly, layers affecting overall features of the graph such as theme or labels carry over from ggplot2.
# Plot with same aesthetic adjustments as previous
ggraph(ig, layout = "fr") +
geom_edge_link() +
geom_node_point(aes(color = factor(color))) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() +
theme(legend.position = "none")
Further adjustments can be performed on the parameters directly or through the aes()
aesthetic mapping function. We replace geom_edge_link()
with geom_edge_fan()
to avoid overlapping and reduce the boldness of edges. The vertices use aes()
to link size and text with degree centrality. The degree()
function calculates the importance of vertices by their number of connections.
The code layout = "fr"
refers to Fruchterman-Reingold, one of the most widely used algorithms. It is a force-directed method which attempts to avoid edges crossing or differing significantly in length.
# Set size to degree centrality
V(ig)$size = degree(ig)
# Additional customisation for better legibility
ggraph(ig, layout = "fr") +
geom_edge_arc(strength = 0.2, width = 0.5, alpha = 0.15) +
geom_node_point(aes(size = size, color = factor(color))) +
geom_node_text(aes(label = name, size = size), repel = TRUE) +
theme_void() +
theme(legend.position = "none")
There is no best layout as different algorithms have their strengths and purposes. Here are examples of algorithms interpreting the same dataset. Experimentation is needed to determine the right fit.
library(gridExtra)
# Test different layouts
g1 <- ggraph(ig, layout = "mds") +
geom_edge_arc(strength=0.2, width=0.5, alpha=.15) +
geom_node_point(aes(size=size, color=factor(color))) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Multi-Dimensional Scaling")
g2 <- ggraph(ig, layout = "kk") +
geom_edge_arc(strength=0.2, width=0.5, alpha=.15) +
geom_node_point(aes(size=size, color=factor(color))) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Kamada-Kawai")
g3 <- ggraph(ig, layout = "lgl") +
geom_edge_arc(strength=0.2, width=0.5, alpha=.15) +
geom_node_point(aes(size=size, color=factor(color))) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Large Graph Layout")
g4 <- ggraph(ig, layout = "graphopt") +
geom_edge_arc(strength=0.2, width=0.5, alpha=.15) +
geom_node_point(aes(size=size, color=factor(color))) +
theme_void() +
theme(legend.position = "none") +
labs(title = "GraphOPT")
grid.arrange(g1, g2, g3, g4, nrow = 2)
Finally, we introduce filtering within the aesthetic mapping to highlight areas of interest. The first chart filters vertex size to remove color and labels from less important characters. The second chart shows how the colors can be set directly.
# Filter example
cs_weak <- cosine_sim
cs_weak[cs_weak < max(cs_weak) * 0.1] <- 0
ig2 <- graph.adjacency(as.matrix(cs_weak), weighted = TRUE, mode = "undirected")
V(ig2)$size <- degree(ig2)
community2 <- cluster_louvain(ig2)
V(ig2)$color <- community2$membership
g5 <- ggraph(ig2, layout = "graphopt") +
geom_edge_link(alpha = 0.15) +
geom_node_point(aes(filter = size <= 50, size = size, alpha = 0.5)) +
geom_node_point(aes(filter = size > 50, size = size, color = factor(color))) +
geom_node_text(aes(filter = size > 50, label = name, size = size), repel = TRUE) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Degree Centrality")
# Change colors for Simpsons Family
ig_simp <- ig
V(ig_simp)$color <- "grey20"
V(ig_simp)$color[grepl("Simpson", V(ig_simp)$name)] <- "gold"
g6 <- ggraph(ig_simp, layout = "fr") +
geom_edge_link(alpha = 0.15) +
geom_node_point(aes(size = size), color = V(ig_simp)$color) +
geom_node_text(aes(filter = grepl("Simpson", name), size = size, label = name), repel=TRUE) +
theme_void() +
theme(legend.position = "none") +
labs(title = "Simpson Family")
grid.arrange(g5, g6, nrow = 2)
A directed network makes a distinction between the source and target of a connection, e.g. an electric circuit board. For our dataset, we examine how often characters visit prominent locations of the show.
Focus is limited to the locations determined by number of unique characters per episode.
# Count number of unique characters at location per episode
loc_visits <- top_char %>%
group_by(raw_location_text, episode_id) %>%
summarise(count = n_distinct(raw_character_text)) %>%
ungroup()
# Aggregate episode counts per location and extract top ranks
top_loc <- loc_visits %>%
group_by(raw_location_text) %>%
summarise(sum = sum(count)) %>%
top_n(15, sum) %>%
ungroup()
# Limit to major location/character combinations
graph_data <- top_char %>%
filter(raw_location_text %in% top_loc$raw_location_text) %>%
group_by(raw_location_text, raw_character_text) %>%
summarise(ep_count = n_distinct(episode_id)) %>%
ungroup() %>%
top_n(75, ep_count)
The newly formatted data (from, to, weight) uses the function graph_from_data_frame()
instead of graph_from_adjacency_matrix()
in the undirected network. We also specify vertex colors to separate locations and characters directly.
# Create igraph object from data frame
ig_loc <- graph_data %>%
select(raw_character_text, raw_location_text, ep_count) %>%
graph_from_data_frame()
# Use episode appearances for Edge weights
ig_loc <- set_edge_attr(ig_loc, "weight", value = graph_data$ep_count)
# Define colors of locations and characters
V(ig_loc)$color <- "grey20"
V(ig_loc)$color[V(ig_loc)$name %in% top_loc$raw_location_text] <- "red"
# Graph layout
layout <- layout.fruchterman.reingold(ig_loc)
# igraph plot
plot(ig_loc, layout = layout)
Converting to ggraph provides a better implementation of the layout algorithm in avoiding overlap and increasing legibility.
# Plot
ggraph(ig_loc, layout = "fr") +
geom_edge_link() +
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() +
theme(legend.position = "none")
With more control over the edges, here are some adjustment examples:
to
/from
: Direction between nodesstart_cap
/end_cap
: Length of edgesweight
: Provided in the dataframe (number of episodes)scale_edge_width()
: Limit edge width# Plot
ggraph(ig_loc, layout="fr") +
geom_edge_link(aes(color = factor(to), width = log(weight)), alpha = 0.5,
start_cap = circle(2, 'mm'), end_cap = circle(2, 'mm')) +
scale_edge_width(range = c(0.5, 2.5)) +
geom_node_point(color = V(ig_loc)$color, size = 5, alpha = 0.5) +
geom_node_text(aes(label = name), repel = TRUE) +
theme_void() +
theme(legend.position = "none")
The ggraph package contains a huge number of customisations to test out, while layout and community detection algorithms used in this tutorial are deep subjects on their own right.
Networks and Visualisations:
Advanced Topics: