This script recreates a network diagram showing the genealogy of those involved in the development of Exploratory Data Analysis following John W. Tukey, Frederick Mosteller, … and others, together with some of their students. It appears in my article, Remembrances of Things EDA to be published on Nightingale shortly. I’m creating this gist to allow other to work with the data and perhaps produce something better.

The source information used here was the Mathematics Geneaology Project. I selected only a handful of first-level students and some of their descendants. The base dataset I’m using here includes: advisor, student, institution, PhDyear, and MGD_id, the ID number of the student in the geneaology data base.

Creating such a diagram is problematic for a variety of reasons:

Let’s get started. This is published at: https://rpubs.com/friendly/EDA-network

Load packages

library(openxlsx) # Read excel files; this supports using a URL
library(dplyr)    # A Grammar of Data Manipulation
library(ggraph)   # An Implementation of Grammar of Graphics for Graphs and Networks
library(igraph)   # Network Analysis and Visualization
library(ggplot2)  # Create Elegant Data Visualisations Using the Grammar of Graphics
library(grid)     # The Grid Graphics Package
library(rlang)    # Functions for Base Types and Core R and 'Tidyverse' Features
library(glue)     # Interpreted String Literals
library(here)
library(networkD3)

Read the data set

To make this script reproducible, read from a cloud URL. It seeme that only {openxlsx} supports this.

EDA_geneaology <- read.xlsx("https://www.dropbox.com/s/oq3jwvg8bto93ln/EDA-geneaology.xlsx?dl=1")
str(EDA_geneaology)
## 'data.frame':    40 obs. of  5 variables:
##  $ advisor    : chr  "Solomon Lefschetz" "John Tukey" "John Tukey" "John Tukey" ...
##  $ institution: chr  "Princeton" "Princeton" "Princeton" "Princeton" ...
##  $ student    : chr  "John Tukey" "Arthur Dempster" "Leo Goodman" "David Hoaglin" ...
##  $ PhDyear    : num  1939 1956 1950 1971 1946 ...
##  $ MGD_id     : num  15860 15981 35023 35266 35033 ...

Fixup a few things

igraph wants parent and child as the edge list variables. Make “Institution” appear in the legend, not “institution”.

EDA_gen <- EDA_geneaology %>%
  rename(parent = advisor, 
         child = student,
         Institution = institution) 

Clean up some links not to be shown. Add an explicit link from MGD_id to genealogy data base.

EDA_gen <- EDA_gen %>% 
  mutate(main = (child %in% c("John Tukey", "Harold Gulliksen")) ) %>% 
  filter( !(parent %in% c("Solomon Lefschetz", "James Angell")) ) %>% 
  filter( !(child %in% c("Clyde Coombs", "Charles Lewis"))) %>%
  mutate(link = glue::glue("https://www.genealogy.math.ndsu.nodak.edu/id.php?id={MGD_id}"))

Transform to igraph format

igraph wants parent, child in the first two columns. Other variables become attributes of vertices in the graph.

EDA_graph <- igraph::graph_from_data_frame(EDA_gen[,c(1,3,2,4:7)])
print(EDA_graph)
## IGRAPH 6f9cbf9 DN-- 37 35 -- 
## + attr: name (v/c), Institution (e/c), PhDyear (e/n), MGD_id (e/n),
## | main (e/l), link (e/c)
## + edges from 6f9cbf9 (vertex names):
##  [1] John Tukey         ->Arthur Dempster    
##  [2] John Tukey         ->Leo Goodman        
##  [3] John Tukey         ->David Hoaglin      
##  [4] John Tukey         ->Frederick Mosteller
##  [5] Samuel Wilks       ->Frederick Mosteller
##  [6] Frederick Mosteller->Persi Diaconis     
##  [7] Frederick Mosteller->Stephen Fienberg   
## + ... omitted several edges

Examine the igraph object

See: https://kateto.net/wp-content/uploads/2018/06/Polnet%202018%20R%20Network%20Visualization%20Workshop.pdf for working with igraph

Extract edges as a data frame

edges <- igraph::as_data_frame(EDA_graph, what="edges")
glimpse(edges)
## Rows: 35
## Columns: 7
## $ from        <chr> "John Tukey", "John Tukey", "John Tukey", "John Tukey", "S~
## $ to          <chr> "Arthur Dempster", "Leo Goodman", "David Hoaglin", "Freder~
## $ Institution <chr> "Princeton", "Princeton", "Princeton", "Princeton", "Princ~
## $ PhDyear     <dbl> 1956, 1950, 1971, 1946, 1946, 1974, 1968, 1977, 1975, 1953~
## $ MGD_id      <dbl> 15981, 35023, 35266, 35033, 35033, 18747, 58815, 13739, 23~
## $ main        <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
## $ link        <chr> "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=15981~

Get graph vertices: V() returns vertices; E() returns edges.

igraph::V(EDA_graph)
## + 37/37 vertices, named, from 6f9cbf9:
##  [1] John Tukey           Samuel Wilks         Frederick Mosteller 
##  [4] Robert Abelson       Harold Gulliksen     Arthur Dempster     
##  [7] John Hartigan        Louis Leon Thurstone Leo Goodman         
## [10] Ledyard Tucker       Andreas Buja         Dianne Cook         
## [13] Peter Huber          Gordon Foster        Antony Unwin        
## [16] Heike Hofmann        David Hoaglin        Persi Diaconis      
## [19] Stephen Fienberg     Stanley Wasserman    Lee Wilkinson       
## [22] Michael Friendly     Howard Wainer        Paul Velleman       
## [25] Richard Heiberger    Karen Kafadar        Jay Emerson         
## [28] Sanford Weisberg     James Ramsay         William Eddy        
## + ... omitted several vertices

Get node degrees. I use this to make Tukey/Mosteller most prominent, but show all in relation to the number of links.

deg <- igraph::degree(EDA_graph, mode="all")
V(EDA_graph)$degree <- deg

Plotting with ggraph

The ggraph package was proposed as a “Grammar of Graphics for Graphs and Networks”. Yet, I’m finding it somewhat incomplete (had to add $degree to the vertices using igraph::V()) and hard to totally understand.

I’m using filled arrow in the edges, but these did not appear in the legend. From this stackoverflow post I found how to modify GeomEdgePath() to create a custom key function.

draw_key_custom = function(data, params, size) {
  segmentsGrob(0.1, 0.5, 0.9, 0.5,
               gp = gpar(
                 col = alpha(data$edge_colour, data$edge_alpha),
                 fill = alpha(data$edge_colour, data$edge_alpha),  # <- add fill to arrow head!
                 lwd = data$edge_width * .pt,
                 lty = data$edge_linetype, 
                 lineend = 'butt'
               ),
               arrow = params$arrow
  )
}

Choice of layouts

There is a large variety of graph layouts that could be used here. Details and examples at: https://www.data-imaginist.com/2017/ggraph-introduction-layouts/

See: https://i.stack.imgur.com/3QAMW.png for pictorial examples. After some experimentation, I chose fr, the Fruchter-Rheinhold force directed layout.

igraph_layouts <- c('star', 'circle', 'gem', 'dh', 'graphopt', 'grid', 'mds', 
                    'randomly', 'fr', 'kk', 'drl', 'lgl')

Draw the diagram

This uses size = degree to set the vertex size. Edge links are colored by Institution. The legend key_glyph overrides the ggraph default, using my draw_key_custom() function to fill the arrow heads.

ggraph(EDA_graph, layout="fr") + 
  geom_edge_link(aes(color=Institution),
                 arrow = grid::arrow(type = "closed", 
                                     angle=15, 
                                     length = unit(0.15, "inches")),
                 key_glyph = "custom" ) +               # <- I'm new!
  geom_node_point(aes(size=degree), 
                  color = scales::alpha("black", .4),
                  show.legend = FALSE) +
  geom_node_text(aes(label = name), repel = TRUE) +
  ggtitle("Specimen of a Chart of Geneaology of EDA") + 
  theme_graph() +
  theme(legend.position = 'bottom') 

Going further

One nice thing to do would be to make this diagram interactive, with tool tips, so that hovering over a node would show the details of an individual.

Here’s what I tried, making a text variable that could be used as a tool tip.

add tooltip text to the dataset

EDA_gen2 <- EDA_gen %>% 
  mutate(text = glue(
    "name: {child}<br>
    {Institution}<br>
    PhD year: {PhDyear}<br>
    MGD_id: <a href='{link}'>{MGD_id}</a>"
  ))

add the text to the igraph representation

EDA_graph2 <- igraph::graph_from_data_frame(EDA_gen2[,c(1,3,2,4:8)])

Same plot as before

p2 <-
  ggraph(EDA_graph, layout="fr") + 
  geom_edge_link(aes(color=Institution),
                 arrow = grid::arrow(type = "closed", 
                                     angle=15, 
                                     length = unit(0.15, "inches")),
                 key_glyph = "custom" ) +               # <- I'm new!
  geom_node_point() +
  geom_node_text(aes(label = name), repel = TRUE) +
  ggtitle("Specimen of a Chart of Geneaology of EDA") + 
  theme_graph() +
  theme(legend.position = 'bottom') 

Ugh!

I thought I could just use plotly::ggplotly() here. However, most geoms used by ggraph here are not yet implemented in plotly. Many warnings. All I get is the points.

library(plotly)
ggplotly(p2, tooltip="text")

What I’d like

Here is one result I’d like to achieve, shown in a mock-up, but with the tooltip box displaced for readability.

Interactive JS visualization with networkD3 ?

One alternative, using the networkD3 package was suggested by Udi Alter. It solves the problem of resolution in the graph via zoom and pan: drag/click to focus on a node, mouse wheel to zoom in/out.

It is only the beginning: missing Institution as grouping variable to color the nodes, hover options, etc.

simpleNetwork(edges, height="100px", width="100px", zoom = TRUE)
