Introduction

This code-through explores how to create a Sankey diagram using the networkD3 package in order to support data visualization for educational research.

Specifically, this code-through will cover:

What educational research is and how Sankey diagrams are useful for educational research
How my team’s work in educational research specifically has benefited from the use of Sankey diagrams
Examples of data collected in research exploring student roles during small group activities demonstrated through a sample data frame
How to create a Sankey diagram visualizing changes in student roles throughout the group activities using the sample data

Background Information

Educational research studies the factors that influence educational outcomes and learning processes. Research efforts in education can take the form of qualitative, quantitative, or mixed methods studies, depending on the goal of the study. Currently I work for Washington State University’s Department of Teaching and Learning on a National Science Foundation-funded project exploring deliberative argumentation in a large-lecture undergraduate biology course. One facet of interest for this project is a qualitative case study looking at how student roles change in small groups throughout in-class activities.

Since qualitative studies tend to delve more in-depth and provide relevant context, as opposed to specific statistics regarding a topic, data visualization can be a bit tricky. Sankey diagrams can be useful for both qualitative and quantitative data visualization. These diagrams effectively display weighted flows of data from one node to another, connect by a link (Holtz, n.d.). They can take on a few different data structures to either depict evolution, which displays changes over time through a series of replicated nodes, or source to end flow in which each node is unique and the flow depicts changes from start to end (Holtz, n.d.).

Sankey diagrams are comprised of nodes and links with specified instructions dictating the positions of nodes (Block, 2021). A node is a junction at which point the link changes direction. Links are the lines that connect nodes together and display the changes in flows between one node to another. Link width is also important as it represents the weighted value. Last, but not least, instructions are necessary to indicate the relationship, or links, between individuals nodes as well as each links relative position. Position can be determined automatically by an algorithm or may be specified directly through parameters (Block, 2021).

For my team, a Sankey diagram depicting the evolution of student’s roles in small groups is an extremely effective way to visualize the data we have collected. It allows us to show flow for individual students across time while still signifying which students are in a group together. Additionally, if students are absent for certain activities the Sankey diagram helps visualize any potential role changes for the remaining students. While these Sankey diagrams won’t capture the full picture of all the data and associated analysis (i.e., dialogue, resources referenced, specific student interactions), they will help capture the evolution of group roles in response to the different activities and changes in group membership. Visually, they help represent our answer to our research question(s), such as:

“How do student roles in small groups change over the course of three in-class activities?”

Steps to Create a Sankey Diagram:

Complete the following steps to create a Sankey diagram:

Import/create sample data frame containing student role data.
Create a links data frame by transforming the original data frame used.
Create a nodes data frame by transforming the links data frame we created.
Provide instructions for how links and nodes relate to one another.
Plot the Sankey diagram and specify parameters.

Step 1: Adding Data

Disclaimer: The sample data provided below is inspired by my work with Washington State University’s Department of Teaching and Learning. The names and roles provided are not based on actual student data and were merely created to use as an example to support the exploration of how educational research benefits from data visualization using Sankey diagrams.

In this sample data there are four potential roles, two of which are leadership styles and two of which are styles of non-leading group members:

Tyrant: A domineering leader who assumes power without consulting other group members
Partnership: When two or more group members equally share leadership
Critic: A non-leading member who actively disagrees or questions the leader(s) or other members
Follower: A non-leading member who does not actively disagree or questions the leader(s) or other members

First, let’s look at the data frame for which we will be making a Sankey diagram for. In this data frame, we have data about the roles students assume while completing three consecutive activities in small groups of four. For simplicity, the only data included are the students’ names, groups, and their respective roles for activities 1-3. Notice that while we have a group vector, we will not be including that in the role.df data frame. Instead we will make a smaller data frame containing only the student.names and student.group vectors. This will make more sense when we get to Step 5.

student.names <- c( "Christine", "Mike", "Patrick", "Natalie", 
                    "Renee", "Jacob", "Kathy", "John")
role.1 <- factor(c("Follower", "Tyrant", "Critic", "Follower", 
                   "Follower", "Critic", "Partnership", "Partnership"))
role.2 <- factor(c("Critic", "Tyrant", "Critic", "Follower", 
                   "Follower", "Critic", "Partnership", "Partnership"))
role.3 <- factor(c("Partnership", "Partnership", "Critic", "Follower", 
                   "Follower", "Critic", "Partnership", "Partnership"))

student.group <- c("A", "A", "A", "A", "B", "B", "B", "B")
students.df <- data_frame(student.names, student.group) # use later in Step 5

role.df <- data.frame(student.names, role.1, role.2, role.3)
role.df %>% pander()

student.names	role.1	role.2	role.3
Christine	Follower	Critic	Partnership
Mike	Tyrant	Tyrant	Partnership
Patrick	Critic	Critic	Critic
Natalie	Follower	Follower	Follower
Renee	Follower	Follower	Follower
Jacob	Critic	Critic	Critic
Kathy	Partnership	Partnership	Partnership
John	Partnership	Partnership	Partnership

Now that we have generated a role.df data frame containing all relevant student data, the second stage in creating a Sankey diagram is to build a working framework in which this data can be integrated and transformed into a format that the sankeyNetwork ( ) function of the networkD3 package can use to generate our Sankey diagram. The data must be transformed so that the students’ names and roles become nodes with links connecting them in the proper order of activities. The links data frame is what specifies the correct order by identifying the source nodes, target node, and relative link value at each time period we have data for.

For this rest of this code-through we will follow the process described by CJ Yetman (2018) in this StackOverflow message board.

Step 2: Creating a Data Frame for Links

Let’s start with creating a data frame for the links that will be used in our Sankey diagram. To create a links data frame we will transform role.df in the following ways:

Create a “source” column
Create a “target” column
Specify the order the links should occur in
Differentiate between roles in each activity

Create a Source Column

2.1. Add row numbers for easy reference.

2.2. Gather role.df columns and pivot them into 3 total columns:

“row” with row numbers
“column” with the original roles.df columns names (“student.names”, “role.1”, etc)
“source” with all the role and student names.

links.df <- role.df %>%
  mutate(row = row_number()) %>%                                       # 2.1 
  pivot_longer(cols = -row, names_to = "column", values_to = "source") #2.2

head( links.df, n =4 ) %>% pander()

row	column	source
1	student.names	Christine
1	role.1	Follower
1	role.2	Critic
1	role.3	Partnership

Create a Target Column and Specify Link Order

2.3. Convert “column” variable names (“student.names”, “role.1”, etc) to numbers, match the numbers with the order the columns appeared in roles.df (i.e., “student.names” become 1, “role.1” becomes 2, etc.), and then group by row. This will help define order.

2.4. Create a “target” column where each variable is the target of that row’s source. The target is defined as the source of the next column, which is in numerical order for each student. Thus, a link has been created between each row’s source and target variable and the link order has been defined to flow from column 1 to 2 to 3 for all observations with a specified row value.

2.5 Filter out any targets with NA values, such as those that try to go from “role.3” (column 3) to a non-existent “role.4”.

links.df <- links.df %>%
  mutate(column = match(column, names(role.df))) %>%   # 2.3
  group_by(row) %>% 
  mutate(target = lead(source, order_by = column)) %>% # 2.4
  filter(!is.na(target)) %>%                           # 2.5
  ungroup()                                            # Always ungroup!

head(links.df, n =4) %>% pander()

row	column	source	target
1	1	Christine	Follower
1	2	Follower	Critic
1	3	Critic	Partnership
2	1	Mike	Tyrant

While we have defined the source and target columns, each with one link per row, we still need to define a nodes data frame before being able to produce a Sankey diagram. However, before we can move onto creating a nodes data frame we first need to differentiate between sources and targets with the same value, or same role, by the activity they occur in. This is a necessary step because we want to create an evolutionary Sankey diagram showing how student roles change over different activities– meaning we will have nodes with the same name appear multiple times.

For example, the role of “Follower” shows up at least once in every activity. We need to make sure then that in our Sankey diagram there are three separate nodes each of which is called “Follower”. To differentiate between students who were “Followers” in activity one from those in activities two and three we will need to designate each node in our “source” and “target” columns by the activity in which they occurred.

Differentiate Between Roles in Each Activity

2.6 Rename the “source” and “target” variables to include data about what column they occur in, therefore differentiating variables with the same role from one another.

links.df <-
  links.df %>%
  mutate(source = paste0(source, '_', column)) %>%
  mutate(target = paste0(target, '_', column + 1)) %>% # target is the same as the 
  select(row, column, source, target)                  # source for the next column 

head(links.df, n = 4) %>% pander()

row	column	source	target
1	1	Christine_1	Follower_2
1	2	Follower_2	Critic_3
1	3	Critic_3	Partnership_4
2	1	Mike_1	Tyrant_2

Now we have created a links data frame where each row represents an link and has columns with unique variables for each link’s source and target.

Step 3: Creating a Data Frame for Nodes

The next step is create a nodes data frame using the links.df source and target columns, and changing the label of the nodes to not include the the activity suffix (e.g., the "_1" in Follower_1"). We have specified the order that sources, targets, and links occur in so that the Sankey diagram is ordered correctly, however, the suffix is not needed to appear in the labels on the Sankey diagram.

nodes.df <- data.frame(name = unique(c(links.df$source, links.df$target)))
nodes.df$label <- sub('_[0-9]*$', '', nodes.df$name) # what we use as NodeID

head(nodes.df, n = 4) %>% pander()

name	label
Christine_1	Christine
Follower_2	Follower
Critic_3	Critic
Mike_1	Mike

Great! Now we have created both of the data frames necessary to create a Sankey diagram. All that is left is to provide instructions and parameters for how the sankeyNetwork ( ) function should display the data.

Step 4: Providing Instructions for the Sankey diagram

In order to finally diagram our data, there are a few final things we must do to the data before the sankeyNetwork ( ) function can be used:

Create new 0-based-index ID columns for the “source” and “target” columns from links.df to be used as source and target node IDs.
Assign a value to each link, as this is required by sankeyNetwork ( ) function

links.df$source_id <- match(links.df$source, nodes.df$name) - 1 # Create source_id column
links.df$target_id <- match(links.df$target, nodes.df$name) - 1 # Create target_id column
links.df$value <- 1                                             # Assign link values

head(links.df, n = 4) %>% pander()

row	column	source	target	source_id	target_id	value
1	1	Christine_1	Follower_2	0	1	1
1	2	Follower_2	Critic_3	1	2	1
1	3	Critic_3	Partnership_4	2	16	1
2	1	Mike_1	Tyrant_2	3	4	1

Step 5. Plotting Our Sankey diagram

Now it is time to finally plot our Sankey diagram! Before plotting it is helpful to think about what the most effective way to communicate the data will be. We know we want to show how each individual student’s role changed throughout the three different activities, so maybe we assign a different color to each unique observation (student/role), as is the default for the sankeyNetwork ( ) function.

sankeyNetwork(Links = links.df,     # from Step 2
              Nodes = nodes.df,     # from Step 3
              Source = 'source_id', # from Step 4
              Target = 'target_id', # from Step 4
              Value = 'value',      # from Step 4
              NodeID = 'label',     # from Step 3
              fontSize = 16,        # change font size
              iterations = 0)       # prevent diagram layout changes

This diagram certainly helps depict flow and changes in student’s roles. However, what it doesn’t tell us is how the students relate to one another. We can see that Christine is a follower in the first activity, but we have no idea who she is following. Perhaps there is a better way to display the data.

You may recall that when we were looking at the student data, there was a vector for group. Though we didn’t include it in our role.df data.frame, we can still reference it now to help improve our data visualization. Looking at the students.df data frame we can see that the first four students were in group A and the last four students were in group B.

student.names	student.group
Christine	A
Mike	A
Patrick	A
Natalie	A
Renee	B
Jacob	B
Kathy	B
John	B

When plotting our Sankey diagram, one helpful thing to do is use color to group members of the same groups or roles across activities. While we could have included the vector for groups in our original roles.df, it would have added an extra column to our Sankey diagram and could visually clutter the diagram.

Instead, we can just use the information from the students.df data frame to create a vector specifying which colors to use in our diagram based on group. To do this, we create a new sankey.colors vector that contains an ordinal list of all the student names and possible roles (.domain) and their associated Javascript colors (.range). Each of the students in Group A were assigned the same color (“deepskyblue”), and all of the students in Group B were assigned “mediumpurple”.

sankey.colors <- 'd3.scaleOrdinal() .domain(["Christine", "Mike","Patrick",
"Natalie", "Renee", "Jacob", "Kathy", "John", "Tyrant", "Critic", "Follower",
"Partnership"]) .range(["deepskyblue", "deepskyblue", "deepskyblue", 
"deepskyblue", "mediumpurple", "mediumpurple", "mediumpurple", "mediumpurple",
"red", "hotpink", "orange", "yellowgreen"])'

Let’s see what our Sankey diagram looks like now if we specify our sankey.colors vector for the colourScale parameter within the sankeyNetwork ( ) function.

sankeyNetwork(Links = links.df,            # from Step 2
              Nodes = nodes.df,            # from Step 3
              Source = 'source_id',        # from Step 4
              Target = 'target_id',        # from Step 4
              Value = 'value',             # from Step 4
              NodeID = 'label',            # from Step 3
              colourScale = sankey.colors, # specify our color vector
              fontSize = 16,               # change font size
              iterations = 0)              # prevent diagram layout changes

Much better! Now we can trace the changes in individual student’s roles while seeing how their roles relate to one another. For example, we can see that while Christine started out as a follower in a tyrannical group led by Mike, by activity three, she and Mike were co-leaders. We have now successfully created a Sankey diagram that visualizes our data effectively and can help answer how student roles in small groups change over the course of three in-class activities.

Further Resources

Learn more about the research project I am a part of, the networkD3 package, and about customizing colors in Sankey diagrams by visiting the following:

National Science Foundation Research Grant Award Abstract # 1822490 nsf.gov/awardsearch
Info on NetworkD3 Package CRAN
Customizing Colors in Sankey diagrams R-Graph Gallery

Works Cited

This code through references and cites the following sources:

Block, Tim (2021). Creating Custom Sankey diagrams Using R. DISPLAYR
Holtz, Yan (n.d.). Sankey diagram. from Data to Viz
JavaScripter.net (n.d.) Predefined Color Names - Alphabeticle List. JavaScripter
Yetman, CJ (2018). Creating a Sankey diagram using NetworkD3 package in R. StackOverflow

Introduction to Creating Sankey Diagrams for Educational Research in R

Dana Roach

28 April 2022