Introduction to Networks with ggraph

Introduction

Network graphs are a great way to visualise relationships between objects and their importance within the network. This tutorial demonstrates the basics of the package ggraph.

It follows the principles of Leland Wilkinson’s The Grammar of Graphics by extending on well-known data visualisation package ggplot2, which can be challenging to apply to networks. Data structures from igraph, an older network package originally created in C, are also supported.

Install and load packages

Load the following packages once installed. We leverage igraph and tidyverse to handle some more complex data pre-processing.

# Load Packages 
library(ggraph)
library(igraph) 
library(tidyverse)

Data preparation

We explore two examples of networks using TV scripts from one of America’s longest running shows The Simpsons. The csv file containing is available at the github repository of sujanjoejacob. To simplify, the analysis focuses on characters appearing in at least 20 episodes.

# Load scripts and remove non-speaking lines
lines <- read_csv("./simpsons_script_lines.csv")

speaking_lines <- lines %>% 
  filter(!is.na(raw_character_text))

# Limit analysis to re-occuring characters in 20 or more episodes
top_char <- speaking_lines %>% 
  group_by(raw_character_text) %>% 
  mutate(appearances=n_distinct(episode_id)) %>% 
  filter(appearances >= 20) %>%
  ungroup()

# Count characters lines per episode 
lines_per_ep <- top_char %>% 
  group_by(raw_character_text, episode_id) %>% 
  summarise(lines=n()) %>% 
  ungroup()

Example 1: Undirected Network

A network is comprised of objects referred to as nodes or Vertices and their connections named Edges. This example is only concerned with identifying connections where direction does not matter. Real world examples include social networks or road maps within a city.

Adjacency matrix

The dataset is converted into a weighted $N * N$ adjacency matrix of character connections. We count characters’ lines per episode, then use cosine similarity to weigh connections based on episodes appearances and amount of dialogue.

# Convert to matrix
char_df <- lines_per_ep %>% 
  spread(episode_id, lines, fill=0)

char_mat <- as.matrix(select(char_df, -raw_character_text))
rownames(char_mat) <- char_df$raw_character_text

# Calculate cosine distance between characters
cosine_sim <- as.dist(char_mat %*% t(char_mat) / (sqrt(rowSums(char_mat^2) %*% t(rowSums(char_mat^2)))))

# Initial look at the network 
autograph(as.matrix(cosine_sim))

igraph network

The network’s magnitude of nodes and connections create difficulty in extracting valuable information without further customisation. This is where a network package shines. We remove weaker connections from the adjacency matrix and introduce igraph functions for layout and community detection.

Communities are densely connected clusters within the larger network. The Louvain Method used is an algorithm which creates small clusters optimising locally, then iteratively builds a network using smaller clusters as nodes. A random layout is used for now, with other options discussed later. Also of note are attributes of vertices and edges which can be controlled using V() and E().

# Filter weak connections. The amount chosen here is arbitrary. Try different variations. 
cs_strong <- cosine_sim
cs_strong[cs_strong < max(cs_strong) * 0.25] <- 0 

# Create an igraph object 
ig <- as.matrix(cs_strong) %>% 
  graph_from_adjacency_matrix(mode = "undirected", weighted = TRUE)

# Community detection algoritm 
community <- cluster_louvain(ig) 

# Attach communities to relevant vertices
V(ig)$color <- community$membership 

# Graph layout
layout <- layout.random(ig) 

# igraph plot 
plot(ig, layout = layout)

Implement ggraph

Given the level of labels and nodes, extensive customisation is still required for meaningful insight. Using ggraph allows adjustments to be added layer by layer. Following the ggraph() function to initialise the graph, layers to control network attributes include:

geom_edge_link(): Straight line edge connections between nodes
geom_node_point(): Vertex markers
geom_node_text(): Vertex labels

Lastly, layers affecting overall features of the graph such as theme or labels carry over from ggplot2.

# Plot with same aesthetic adjustments as previous
ggraph(ig, layout = "fr") +
  geom_edge_link() + 
  geom_node_point(aes(color = factor(color))) + 
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() +
  theme(legend.position = "none")

Adjusting aesthetics and layout

Further adjustments can be performed on the parameters directly or through the aes() aesthetic mapping function. We replace geom_edge_link() with geom_edge_fan() to avoid overlapping and reduce the boldness of edges. The vertices use aes() to link size and text with degree centrality. The degree() function calculates the importance of vertices by their number of connections.

The code layout = "fr" refers to Fruchterman-Reingold, one of the most widely used algorithms. It is a force-directed method which attempts to avoid edges crossing or differing significantly in length.

# Set size to degree centrality 
V(ig)$size = degree(ig)

# Additional customisation for better legibility 
ggraph(ig, layout = "fr") +
  geom_edge_arc(strength = 0.2, width = 0.5, alpha = 0.15) + 
  geom_node_point(aes(size = size, color = factor(color))) + 
  geom_node_text(aes(label = name, size = size), repel = TRUE) +
  theme_void() +
  theme(legend.position = "none")

Layout options

There is no best layout as different algorithms have their strengths and purposes. Here are examples of algorithms interpreting the same dataset. Experimentation is needed to determine the right fit.

library(gridExtra)

# Test different layouts 
g1 <- ggraph(ig, layout = "mds") +
  geom_edge_arc(strength=0.2, width=0.5, alpha=.15) + 
  geom_node_point(aes(size=size, color=factor(color))) + 
  theme_void() +
  theme(legend.position = "none") + 
  labs(title = "Multi-Dimensional Scaling")

g2 <- ggraph(ig, layout = "kk") +
  geom_edge_arc(strength=0.2, width=0.5, alpha=.15) + 
  geom_node_point(aes(size=size, color=factor(color))) + 
  theme_void() +
  theme(legend.position = "none") + 
  labs(title = "Kamada-Kawai")

g3 <- ggraph(ig, layout = "lgl") +
  geom_edge_arc(strength=0.2, width=0.5, alpha=.15) + 
  geom_node_point(aes(size=size, color=factor(color))) + 
  theme_void() +
  theme(legend.position = "none") + 
  labs(title = "Large Graph Layout") 


g4 <- ggraph(ig, layout = "graphopt") +
  geom_edge_arc(strength=0.2, width=0.5, alpha=.15) + 
  geom_node_point(aes(size=size, color=factor(color))) + 
  theme_void() +
  theme(legend.position = "none") + 
  labs(title = "GraphOPT")


grid.arrange(g1, g2, g3, g4, nrow = 2)

Filtering

Finally, we introduce filtering within the aesthetic mapping to highlight areas of interest. The first chart filters vertex size to remove color and labels from less important characters. The second chart shows how the colors can be set directly.

# Filter example
cs_weak <- cosine_sim
cs_weak[cs_weak < max(cs_weak) * 0.1] <- 0 

ig2 <- graph.adjacency(as.matrix(cs_weak), weighted = TRUE, mode = "undirected")
V(ig2)$size <- degree(ig2) 

community2 <- cluster_louvain(ig2)
V(ig2)$color <- community2$membership

g5 <- ggraph(ig2, layout = "graphopt") +
  geom_edge_link(alpha = 0.15) + 
  geom_node_point(aes(filter = size <= 50, size = size, alpha = 0.5)) + 
  geom_node_point(aes(filter = size > 50, size = size, color = factor(color))) + 
  geom_node_text(aes(filter = size > 50, label = name, size = size), repel = TRUE) +
  theme_void() +
  theme(legend.position = "none") +
  labs(title = "Degree Centrality")

# Change colors for Simpsons Family 
ig_simp <- ig 
V(ig_simp)$color <- "grey20"
V(ig_simp)$color[grepl("Simpson", V(ig_simp)$name)] <- "gold"

g6 <- ggraph(ig_simp, layout = "fr") +
  geom_edge_link(alpha = 0.15) + 
  geom_node_point(aes(size = size), color = V(ig_simp)$color) + 
  geom_node_text(aes(filter = grepl("Simpson", name), size = size, label = name), repel=TRUE) +
  theme_void() +
  theme(legend.position = "none") + 
  labs(title = "Simpson Family")

grid.arrange(g5, g6, nrow = 2)

Example 2: Directed Network

A directed network makes a distinction between the source and target of a connection, e.g. an electric circuit board. For our dataset, we examine how often characters visit prominent locations of the show.

Data set up

Focus is limited to the locations determined by number of unique characters per episode.

# Count number of unique characters at location per episode
loc_visits <- top_char %>% 
  group_by(raw_location_text, episode_id) %>% 
  summarise(count = n_distinct(raw_character_text)) %>% 
  ungroup()

# Aggregate episode counts per location and extract top ranks
top_loc <- loc_visits %>% 
  group_by(raw_location_text) %>% 
  summarise(sum = sum(count)) %>% 
  top_n(15, sum) %>% 
  ungroup()

# Limit to major location/character combinations 
graph_data <- top_char %>% 
  filter(raw_location_text %in% top_loc$raw_location_text) %>% 
  group_by(raw_location_text, raw_character_text) %>% 
  summarise(ep_count = n_distinct(episode_id)) %>% 
  ungroup() %>% 
  top_n(75, ep_count)

Initial network

The newly formatted data (from, to, weight) uses the function graph_from_data_frame() instead of graph_from_adjacency_matrix() in the undirected network. We also specify vertex colors to separate locations and characters directly.

# Create igraph object from data frame 
ig_loc <- graph_data %>% 
  select(raw_character_text, raw_location_text, ep_count) %>% 
  graph_from_data_frame()

# Use episode appearances for Edge weights 
ig_loc <- set_edge_attr(ig_loc, "weight", value = graph_data$ep_count)

# Define colors of locations and characters 
V(ig_loc)$color <- "grey20"
V(ig_loc)$color[V(ig_loc)$name %in% top_loc$raw_location_text] <- "red"

# Graph layout
layout <- layout.fruchterman.reingold(ig_loc) 

# igraph plot 
plot(ig_loc, layout = layout)

Implement ggraph

Converting to ggraph provides a better implementation of the layout algorithm in avoiding overlap and increasing legibility.

# Plot 
ggraph(ig_loc, layout = "fr") + 
  geom_edge_link() +
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() + 
  theme(legend.position = "none")

Additional aesthetics

With more control over the edges, here are some adjustment examples:

to/from: Direction between nodes
start_cap/end_cap: Length of edges
weight: Provided in the dataframe (number of episodes)
scale_edge_width(): Limit edge width

# Plot 
ggraph(ig_loc, layout="fr") + 
  geom_edge_link(aes(color = factor(to), width = log(weight)), alpha = 0.5, 
                  start_cap = circle(2, 'mm'), end_cap = circle(2, 'mm')) +
  scale_edge_width(range = c(0.5, 2.5)) + 
  geom_node_point(color = V(ig_loc)$color, size = 5, alpha = 0.5) +
  geom_node_text(aes(label = name), repel = TRUE) +
  theme_void() + 
  theme(legend.position = "none")

Introduction to Networks with ggraph

Martin Neloe

15/08/2020

Introduction

Install and load packages

Data preparation

Example 1: Undirected Network

Adjacency matrix

igraph network

Implement ggraph

Adjusting aesthetics and layout

Layout options

Filtering

Example 2: Directed Network

Data set up

Initial network

Implement ggraph

Additional aesthetics

Further reading