This script recreates a network diagram showing the genealogy of
those involved in the development of Exploratory
Data Analysis following John W. Tukey,
Frederick Mosteller, … and others, together with some of their students.
It appears in my article, Remembrances
of Things EDA to be published on Nightingale shortly.
I’m creating this gist to allow other to work with the data and perhaps
produce something better.
The source information used here was the Mathematics Geneaology
Project. I selected only a handful of first-level students and some
of their descendants. The base dataset I’m using here includes:
advisor, student, institution,
PhDyear, and MGD_id, the ID number of the
student in the geneaology data base.
Creating such a diagram is problematic for a variety of reasons:
- It follows advisor—student relations only a few
steps removed from the principal participants. Not everyone’s students
are shown.
- The connections among all those involved in this history are only
incompletely conveyed in a network diagram.
- There are cross-connections of influence, not indicated by direct
mentorship.
- The main package,
ggraph, I used for the diagram at the
end gave messy results, and so I had to prune some branches to make it
readable. But more generally, it reflects the limitations of a
static graph of fixed resolution for
an article.
Let’s get started. This is published at: https://rpubs.com/friendly/EDA-network
Load packages
library(openxlsx) # Read excel files; this supports using a URL
library(dplyr) # A Grammar of Data Manipulation
library(ggraph) # An Implementation of Grammar of Graphics for Graphs and Networks
library(igraph) # Network Analysis and Visualization
library(ggplot2) # Create Elegant Data Visualisations Using the Grammar of Graphics
library(grid) # The Grid Graphics Package
library(rlang) # Functions for Base Types and Core R and 'Tidyverse' Features
library(glue) # Interpreted String Literals
library(here)
library(networkD3)
Read the data set
To make this script reproducible, read from a cloud URL. It seeme
that only {openxlsx} supports this.
EDA_geneaology <- read.xlsx("https://www.dropbox.com/s/oq3jwvg8bto93ln/EDA-geneaology.xlsx?dl=1")
str(EDA_geneaology)
## 'data.frame': 40 obs. of 5 variables:
## $ advisor : chr "Solomon Lefschetz" "John Tukey" "John Tukey" "John Tukey" ...
## $ institution: chr "Princeton" "Princeton" "Princeton" "Princeton" ...
## $ student : chr "John Tukey" "Arthur Dempster" "Leo Goodman" "David Hoaglin" ...
## $ PhDyear : num 1939 1956 1950 1971 1946 ...
## $ MGD_id : num 15860 15981 35023 35266 35033 ...
Fixup a few things
igraph wants parent and child
as the edge list variables. Make “Institution” appear in the legend, not
“institution”.
EDA_gen <- EDA_geneaology %>%
rename(parent = advisor,
child = student,
Institution = institution)
Clean up some links not to be shown. Add an explicit
link from MGD_id to genealogy data base.
EDA_gen <- EDA_gen %>%
mutate(main = (child %in% c("John Tukey", "Harold Gulliksen")) ) %>%
filter( !(parent %in% c("Solomon Lefschetz", "James Angell")) ) %>%
filter( !(child %in% c("Clyde Coombs", "Charles Lewis"))) %>%
mutate(link = glue::glue("https://www.genealogy.math.ndsu.nodak.edu/id.php?id={MGD_id}"))
Examine the igraph object
See: https://kateto.net/wp-content/uploads/2018/06/Polnet%202018%20R%20Network%20Visualization%20Workshop.pdf
for working with igraph
Extract edges as a data frame
edges <- igraph::as_data_frame(EDA_graph, what="edges")
glimpse(edges)
## Rows: 35
## Columns: 7
## $ from <chr> "John Tukey", "John Tukey", "John Tukey", "John Tukey", "S~
## $ to <chr> "Arthur Dempster", "Leo Goodman", "David Hoaglin", "Freder~
## $ Institution <chr> "Princeton", "Princeton", "Princeton", "Princeton", "Princ~
## $ PhDyear <dbl> 1956, 1950, 1971, 1946, 1946, 1974, 1968, 1977, 1975, 1953~
## $ MGD_id <dbl> 15981, 35023, 35266, 35033, 35033, 18747, 58815, 13739, 23~
## $ main <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA~
## $ link <chr> "https://www.genealogy.math.ndsu.nodak.edu/id.php?id=15981~
Get graph vertices: V() returns vertices;
E() returns edges.
igraph::V(EDA_graph)
## + 37/37 vertices, named, from 6f9cbf9:
## [1] John Tukey Samuel Wilks Frederick Mosteller
## [4] Robert Abelson Harold Gulliksen Arthur Dempster
## [7] John Hartigan Louis Leon Thurstone Leo Goodman
## [10] Ledyard Tucker Andreas Buja Dianne Cook
## [13] Peter Huber Gordon Foster Antony Unwin
## [16] Heike Hofmann David Hoaglin Persi Diaconis
## [19] Stephen Fienberg Stanley Wasserman Lee Wilkinson
## [22] Michael Friendly Howard Wainer Paul Velleman
## [25] Richard Heiberger Karen Kafadar Jay Emerson
## [28] Sanford Weisberg James Ramsay William Eddy
## + ... omitted several vertices
Get node degrees. I use this to make Tukey/Mosteller most prominent,
but show all in relation to the number of links.
deg <- igraph::degree(EDA_graph, mode="all")
V(EDA_graph)$degree <- deg
Plotting with ggraph
The ggraph
package was proposed as a “Grammar of Graphics for Graphs and Networks”.
Yet, I’m finding it somewhat incomplete (had to add $degree
to the vertices using igraph::V()) and hard to totally
understand.
I’m using filled arrow in the edges, but these did not appear in the
legend. From this
stackoverflow post I found how to modify GeomEdgePath()
to create a custom key function.
draw_key_custom = function(data, params, size) {
segmentsGrob(0.1, 0.5, 0.9, 0.5,
gp = gpar(
col = alpha(data$edge_colour, data$edge_alpha),
fill = alpha(data$edge_colour, data$edge_alpha), # <- add fill to arrow head!
lwd = data$edge_width * .pt,
lty = data$edge_linetype,
lineend = 'butt'
),
arrow = params$arrow
)
}
Choice of layouts
There is a large variety of graph layouts that could be used here.
Details and examples at: https://www.data-imaginist.com/2017/ggraph-introduction-layouts/
See: https://i.stack.imgur.com/3QAMW.png for pictorial
examples. After some experimentation, I chose fr, the
Fruchter-Rheinhold force directed layout.
igraph_layouts <- c('star', 'circle', 'gem', 'dh', 'graphopt', 'grid', 'mds',
'randomly', 'fr', 'kk', 'drl', 'lgl')
Draw the diagram
This uses size = degree to set the vertex size. Edge
links are colored by Institution. The legend
key_glyph overrides the ggraph default, using
my draw_key_custom() function to fill the arrow heads.
ggraph(EDA_graph, layout="fr") +
geom_edge_link(aes(color=Institution),
arrow = grid::arrow(type = "closed",
angle=15,
length = unit(0.15, "inches")),
key_glyph = "custom" ) + # <- I'm new!
geom_node_point(aes(size=degree),
color = scales::alpha("black", .4),
show.legend = FALSE) +
geom_node_text(aes(label = name), repel = TRUE) +
ggtitle("Specimen of a Chart of Geneaology of EDA") +
theme_graph() +
theme(legend.position = 'bottom')

Going further
One nice thing to do would be to make this diagram interactive, with
tool tips, so that hovering over a node would show the details of an
individual.
Here’s what I tried, making a text variable that could
be used as a tool tip.
add tooltip text to the dataset
EDA_gen2 <- EDA_gen %>%
mutate(text = glue(
"name: {child}<br>
{Institution}<br>
PhD year: {PhDyear}<br>
MGD_id: <a href='{link}'>{MGD_id}</a>"
))
add the text to the igraph representation
EDA_graph2 <- igraph::graph_from_data_frame(EDA_gen2[,c(1,3,2,4:8)])
Same plot as before
p2 <-
ggraph(EDA_graph, layout="fr") +
geom_edge_link(aes(color=Institution),
arrow = grid::arrow(type = "closed",
angle=15,
length = unit(0.15, "inches")),
key_glyph = "custom" ) + # <- I'm new!
geom_node_point() +
geom_node_text(aes(label = name), repel = TRUE) +
ggtitle("Specimen of a Chart of Geneaology of EDA") +
theme_graph() +
theme(legend.position = 'bottom')
Ugh!
I thought I could just use plotly::ggplotly() here.
However, most geoms used by ggraph here are not yet
implemented in plotly. Many warnings. All I get is the
points.
library(plotly)
ggplotly(p2, tooltip="text")
What I’d like
Here is one result I’d like to achieve, shown in a mock-up, but with
the tooltip box displaced for readability.

Interactive JS visualization with networkD3 ?
One alternative, using the networkD3 package was
suggested by Udi Alter. It solves the problem of resolution in the graph
via zoom and pan: drag/click to focus on a node, mouse wheel to zoom
in/out.
It is only the beginning: missing Institution as grouping variable to
color the nodes, hover options, etc.
simpleNetwork(edges, height="100px", width="100px", zoom = TRUE)
