Introduction

This code-through explores how to create a Sankey diagram using the networkD3 package in order to support data visualization for educational research.

Specifically, this code-through will cover:

  • What educational research is and how Sankey diagrams are useful for educational research
  • How my team’s work in educational research specifically has benefited from the use of Sankey diagrams
  • Examples of data collected in research exploring student roles during small group activities demonstrated through a sample data frame
  • How to create a Sankey diagram visualizing changes in student roles throughout the group activities using the sample data

Background Information


Educational research studies the factors that influence educational outcomes and learning processes. Research efforts in education can take the form of qualitative, quantitative, or mixed methods studies, depending on the goal of the study. Currently I work for Washington State University’s Department of Teaching and Learning on a National Science Foundation-funded project exploring deliberative argumentation in a large-lecture undergraduate biology course. One facet of interest for this project is a qualitative case study looking at how student roles change in small groups throughout in-class activities.

Since qualitative studies tend to delve more in-depth and provide relevant context, as opposed to specific statistics regarding a topic, data visualization can be a bit tricky. Sankey diagrams can be useful for both qualitative and quantitative data visualization. These diagrams effectively display weighted flows of data from one node to another, connect by a link (Holtz, n.d.). They can take on a few different data structures to either depict evolution, which displays changes over time through a series of replicated nodes, or source to end flow in which each node is unique and the flow depicts changes from start to end (Holtz, n.d.).

Sankey diagrams are comprised of nodes and links with specified instructions dictating the positions of nodes (Block, 2021). A node is a junction at which point the link changes direction. Links are the lines that connect nodes together and display the changes in flows between one node to another. Link width is also important as it represents the weighted value. Last, but not least, instructions are necessary to indicate the relationship, or links, between individuals nodes as well as each links relative position. Position can be determined automatically by an algorithm or may be specified directly through parameters (Block, 2021).

For my team, a Sankey diagram depicting the evolution of student’s roles in small groups is an extremely effective way to visualize the data we have collected. It allows us to show flow for individual students across time while still signifying which students are in a group together. Additionally, if students are absent for certain activities the Sankey diagram helps visualize any potential role changes for the remaining students. While these Sankey diagrams won’t capture the full picture of all the data and associated analysis (i.e., dialogue, resources referenced, specific student interactions), they will help capture the evolution of group roles in response to the different activities and changes in group membership. Visually, they help represent our answer to our research question(s), such as:

“How do student roles in small groups change over the course of three in-class activities?”


Steps to Create a Sankey Diagram:

Complete the following steps to create a Sankey diagram:

  1. Import/create sample data frame containing student role data.
  2. Create a links data frame by transforming the original data frame used.
  3. Create a nodes data frame by transforming the links data frame we created.
  4. Provide instructions for how links and nodes relate to one another.
  5. Plot the Sankey diagram and specify parameters.

Step 1: Adding Data

Disclaimer: The sample data provided below is inspired by my work with Washington State University’s Department of Teaching and Learning. The names and roles provided are not based on actual student data and were merely created to use as an example to support the exploration of how educational research benefits from data visualization using Sankey diagrams.

In this sample data there are four potential roles, two of which are leadership styles and two of which are styles of non-leading group members:

  • Tyrant: A domineering leader who assumes power without consulting other group members
  • Partnership: When two or more group members equally share leadership
  • Critic: A non-leading member who actively disagrees or questions the leader(s) or other members
  • Follower: A non-leading member who does not actively disagree or questions the leader(s) or other members

First, let’s look at the data frame for which we will be making a Sankey diagram for. In this data frame, we have data about the roles students assume while completing three consecutive activities in small groups of four. For simplicity, the only data included are the students’ names, groups, and their respective roles for activities 1-3. Notice that while we have a group vector, we will not be including that in the role.df data frame. Instead we will make a smaller data frame containing only the student.names and student.group vectors. This will make more sense when we get to Step 5.

student.names <- c( "Christine", "Mike", "Patrick", "Natalie", 
                    "Renee", "Jacob", "Kathy", "John")
role.1 <- factor(c("Follower", "Tyrant", "Critic", "Follower", 
                   "Follower", "Critic", "Partnership", "Partnership"))
role.2 <- factor(c("Critic", "Tyrant", "Critic", "Follower", 
                   "Follower", "Critic", "Partnership", "Partnership"))
role.3 <- factor(c("Partnership", "Partnership", "Critic", "Follower", 
                   "Follower", "Critic", "Partnership", "Partnership"))

student.group <- c("A", "A", "A", "A", "B", "B", "B", "B")
students.df <- data_frame(student.names, student.group) # use later in Step 5

role.df <- data.frame(student.names, role.1, role.2, role.3)
role.df %>% pander()
student.names role.1 role.2 role.3
Christine Follower Critic Partnership
Mike Tyrant Tyrant Partnership
Patrick Critic Critic Critic
Natalie Follower Follower Follower
Renee Follower Follower Follower
Jacob Critic Critic Critic
Kathy Partnership Partnership Partnership
John Partnership Partnership Partnership


Now that we have generated a role.df data frame containing all relevant student data, the second stage in creating a Sankey diagram is to build a working framework in which this data can be integrated and transformed into a format that the sankeyNetwork ( ) function of the networkD3 package can use to generate our Sankey diagram. The data must be transformed so that the students’ names and roles become nodes with links connecting them in the proper order of activities. The links data frame is what specifies the correct order by identifying the source nodes, target node, and relative link value at each time period we have data for.

For this rest of this code-through we will follow the process described by CJ Yetman (2018) in this StackOverflow message board.

Step 3: Creating a Data Frame for Nodes

The next step is create a nodes data frame using the links.df source and target columns, and changing the label of the nodes to not include the the activity suffix (e.g., the "_1" in Follower_1"). We have specified the order that sources, targets, and links occur in so that the Sankey diagram is ordered correctly, however, the suffix is not needed to appear in the labels on the Sankey diagram.

nodes.df <- data.frame(name = unique(c(links.df$source, links.df$target)))
nodes.df$label <- sub('_[0-9]*$', '', nodes.df$name) # what we use as NodeID

head(nodes.df, n = 4) %>% pander()
name label
Christine_1 Christine
Follower_2 Follower
Critic_3 Critic
Mike_1 Mike


Great! Now we have created both of the data frames necessary to create a Sankey diagram. All that is left is to provide instructions and parameters for how the sankeyNetwork ( ) function should display the data.

Step 4: Providing Instructions for the Sankey diagram

In order to finally diagram our data, there are a few final things we must do to the data before the sankeyNetwork ( ) function can be used:

  • Create new 0-based-index ID columns for the “source” and “target” columns from links.df to be used as source and target node IDs.
  • Assign a value to each link, as this is required by sankeyNetwork ( ) function
links.df$source_id <- match(links.df$source, nodes.df$name) - 1 # Create source_id column
links.df$target_id <- match(links.df$target, nodes.df$name) - 1 # Create target_id column
links.df$value <- 1                                             # Assign link values

head(links.df, n = 4) %>% pander()
row column source target source_id target_id value
1 1 Christine_1 Follower_2 0 1 1
1 2 Follower_2 Critic_3 1 2 1
1 3 Critic_3 Partnership_4 2 16 1
2 1 Mike_1 Tyrant_2 3 4 1


Step 5. Plotting Our Sankey diagram

Now it is time to finally plot our Sankey diagram! Before plotting it is helpful to think about what the most effective way to communicate the data will be. We know we want to show how each individual student’s role changed throughout the three different activities, so maybe we assign a different color to each unique observation (student/role), as is the default for the sankeyNetwork ( ) function.

sankeyNetwork(Links = links.df,     # from Step 2
              Nodes = nodes.df,     # from Step 3
              Source = 'source_id', # from Step 4
              Target = 'target_id', # from Step 4
              Value = 'value',      # from Step 4
              NodeID = 'label',     # from Step 3
              fontSize = 16,        # change font size
              iterations = 0)       # prevent diagram layout changes


This diagram certainly helps depict flow and changes in student’s roles. However, what it doesn’t tell us is how the students relate to one another. We can see that Christine is a follower in the first activity, but we have no idea who she is following. Perhaps there is a better way to display the data.

You may recall that when we were looking at the student data, there was a vector for group. Though we didn’t include it in our role.df data.frame, we can still reference it now to help improve our data visualization. Looking at the students.df data frame we can see that the first four students were in group A and the last four students were in group B.

student.names student.group
Christine A
Mike A
Patrick A
Natalie A
Renee B
Jacob B
Kathy B
John B


When plotting our Sankey diagram, one helpful thing to do is use color to group members of the same groups or roles across activities. While we could have included the vector for groups in our original roles.df, it would have added an extra column to our Sankey diagram and could visually clutter the diagram.

Instead, we can just use the information from the students.df data frame to create a vector specifying which colors to use in our diagram based on group. To do this, we create a new sankey.colors vector that contains an ordinal list of all the student names and possible roles (.domain) and their associated Javascript colors (.range). Each of the students in Group A were assigned the same color (“deepskyblue”), and all of the students in Group B were assigned “mediumpurple”.

sankey.colors <- 'd3.scaleOrdinal() .domain(["Christine", "Mike","Patrick",
"Natalie", "Renee", "Jacob", "Kathy", "John", "Tyrant", "Critic", "Follower",
"Partnership"]) .range(["deepskyblue", "deepskyblue", "deepskyblue", 
"deepskyblue", "mediumpurple", "mediumpurple", "mediumpurple", "mediumpurple",
"red", "hotpink", "orange", "yellowgreen"])'


Let’s see what our Sankey diagram looks like now if we specify our sankey.colors vector for the colourScale parameter within the sankeyNetwork ( ) function.

sankeyNetwork(Links = links.df,            # from Step 2
              Nodes = nodes.df,            # from Step 3
              Source = 'source_id',        # from Step 4
              Target = 'target_id',        # from Step 4
              Value = 'value',             # from Step 4
              NodeID = 'label',            # from Step 3
              colourScale = sankey.colors, # specify our color vector
              fontSize = 16,               # change font size
              iterations = 0)              # prevent diagram layout changes


Much better! Now we can trace the changes in individual student’s roles while seeing how their roles relate to one another. For example, we can see that while Christine started out as a follower in a tyrannical group led by Mike, by activity three, she and Mike were co-leaders. We have now successfully created a Sankey diagram that visualizes our data effectively and can help answer how student roles in small groups change over the course of three in-class activities.

Further Resources

Learn more about the research project I am a part of, the networkD3 package, and about customizing colors in Sankey diagrams by visiting the following:

Works Cited

This code through references and cites the following sources:

  • Block, Tim (2021). Creating Custom Sankey diagrams Using R. DISPLAYR

  • Holtz, Yan (n.d.). Sankey diagram. from Data to Viz

  • JavaScripter.net (n.d.) Predefined Color Names - Alphabeticle List. JavaScripter

  • Yetman, CJ (2018). Creating a Sankey diagram using NetworkD3 package in R. StackOverflow