This code-through explores how to create a Sankey diagram using the networkD3 package in order to support data visualization for educational research.
Specifically, this code-through will cover:
Educational research studies the factors that influence educational outcomes and learning processes. Research efforts in education can take the form of qualitative, quantitative, or mixed methods studies, depending on the goal of the study. Currently I work for Washington State University’s Department of Teaching and Learning on a National Science Foundation-funded project exploring deliberative argumentation in a large-lecture undergraduate biology course. One facet of interest for this project is a qualitative case study looking at how student roles change in small groups throughout in-class activities.
Since qualitative studies tend to delve more in-depth and provide relevant context, as opposed to specific statistics regarding a topic, data visualization can be a bit tricky. Sankey diagrams can be useful for both qualitative and quantitative data visualization. These diagrams effectively display weighted flows of data from one node to another, connect by a link (Holtz, n.d.). They can take on a few different data structures to either depict evolution, which displays changes over time through a series of replicated nodes, or source to end flow in which each node is unique and the flow depicts changes from start to end (Holtz, n.d.).
Sankey diagrams are comprised of nodes and links with specified instructions dictating the positions of nodes (Block, 2021). A node is a junction at which point the link changes direction. Links are the lines that connect nodes together and display the changes in flows between one node to another. Link width is also important as it represents the weighted value. Last, but not least, instructions are necessary to indicate the relationship, or links, between individuals nodes as well as each links relative position. Position can be determined automatically by an algorithm or may be specified directly through parameters (Block, 2021).
For my team, a Sankey diagram depicting the evolution of student’s roles in small groups is an extremely effective way to visualize the data we have collected. It allows us to show flow for individual students across time while still signifying which students are in a group together. Additionally, if students are absent for certain activities the Sankey diagram helps visualize any potential role changes for the remaining students. While these Sankey diagrams won’t capture the full picture of all the data and associated analysis (i.e., dialogue, resources referenced, specific student interactions), they will help capture the evolution of group roles in response to the different activities and changes in group membership. Visually, they help represent our answer to our research question(s), such as:
“How do student roles in small groups change over the course of three in-class activities?”
Complete the following steps to create a Sankey diagram:
Disclaimer: The sample data provided below is inspired by my work with Washington State University’s Department of Teaching and Learning. The names and roles provided are not based on actual student data and were merely created to use as an example to support the exploration of how educational research benefits from data visualization using Sankey diagrams.
In this sample data there are four potential roles, two of which are leadership styles and two of which are styles of non-leading group members:
First, let’s look at the data frame for which we will be making a Sankey diagram for. In this data frame, we have data about the roles students assume while completing three consecutive activities in small groups of four. For simplicity, the only data included are the students’ names, groups, and their respective roles for activities 1-3. Notice that while we have a group vector, we will not be including that in the role.df data frame. Instead we will make a smaller data frame containing only the student.names and student.group vectors. This will make more sense when we get to Step 5.
student.names <- c( "Christine", "Mike", "Patrick", "Natalie",
"Renee", "Jacob", "Kathy", "John")
role.1 <- factor(c("Follower", "Tyrant", "Critic", "Follower",
"Follower", "Critic", "Partnership", "Partnership"))
role.2 <- factor(c("Critic", "Tyrant", "Critic", "Follower",
"Follower", "Critic", "Partnership", "Partnership"))
role.3 <- factor(c("Partnership", "Partnership", "Critic", "Follower",
"Follower", "Critic", "Partnership", "Partnership"))
student.group <- c("A", "A", "A", "A", "B", "B", "B", "B")
students.df <- data_frame(student.names, student.group) # use later in Step 5
role.df <- data.frame(student.names, role.1, role.2, role.3)
role.df %>% pander()| student.names | role.1 | role.2 | role.3 |
|---|---|---|---|
| Christine | Follower | Critic | Partnership |
| Mike | Tyrant | Tyrant | Partnership |
| Patrick | Critic | Critic | Critic |
| Natalie | Follower | Follower | Follower |
| Renee | Follower | Follower | Follower |
| Jacob | Critic | Critic | Critic |
| Kathy | Partnership | Partnership | Partnership |
| John | Partnership | Partnership | Partnership |
Now that we have generated a role.df data frame containing all relevant student data, the second stage in creating a Sankey diagram is to build a working framework in which this data can be integrated and transformed into a format that the sankeyNetwork ( ) function of the networkD3 package can use to generate our Sankey diagram. The data must be transformed so that the students’ names and roles become nodes with links connecting them in the proper order of activities. The links data frame is what specifies the correct order by identifying the source nodes, target node, and relative link value at each time period we have data for.
For this rest of this code-through we will follow the process described by CJ Yetman (2018) in this StackOverflow message board.
Let’s start with creating a data frame for the links that will be used in our Sankey diagram. To create a links data frame we will transform role.df in the following ways:
2.1. Add row numbers for easy reference.
2.2. Gather role.df columns and pivot them into 3 total columns:
links.df <- role.df %>%
mutate(row = row_number()) %>% # 2.1
pivot_longer(cols = -row, names_to = "column", values_to = "source") #2.2
head( links.df, n =4 ) %>% pander()| row | column | source |
|---|---|---|
| 1 | student.names | Christine |
| 1 | role.1 | Follower |
| 1 | role.2 | Critic |
| 1 | role.3 | Partnership |
2.3. Convert “column” variable names (“student.names”, “role.1”, etc) to numbers, match the numbers with the order the columns appeared in roles.df (i.e., “student.names” become 1, “role.1” becomes 2, etc.), and then group by row. This will help define order.
2.4. Create a “target” column where each variable is the target of that row’s source. The target is defined as the source of the next column, which is in numerical order for each student. Thus, a link has been created between each row’s source and target variable and the link order has been defined to flow from column 1 to 2 to 3 for all observations with a specified row value.
2.5 Filter out any targets with NA values, such as those that try to go from “role.3” (column 3) to a non-existent “role.4”.
links.df <- links.df %>%
mutate(column = match(column, names(role.df))) %>% # 2.3
group_by(row) %>%
mutate(target = lead(source, order_by = column)) %>% # 2.4
filter(!is.na(target)) %>% # 2.5
ungroup() # Always ungroup!
head(links.df, n =4) %>% pander()| row | column | source | target |
|---|---|---|---|
| 1 | 1 | Christine | Follower |
| 1 | 2 | Follower | Critic |
| 1 | 3 | Critic | Partnership |
| 2 | 1 | Mike | Tyrant |
While we have defined the source and target columns, each with one link per row, we still need to define a nodes data frame before being able to produce a Sankey diagram. However, before we can move onto creating a nodes data frame we first need to differentiate between sources and targets with the same value, or same role, by the activity they occur in. This is a necessary step because we want to create an evolutionary Sankey diagram showing how student roles change over different activities– meaning we will have nodes with the same name appear multiple times.
For example, the role of “Follower” shows up at least once in every activity. We need to make sure then that in our Sankey diagram there are three separate nodes each of which is called “Follower”. To differentiate between students who were “Followers” in activity one from those in activities two and three we will need to designate each node in our “source” and “target” columns by the activity in which they occurred.
2.6 Rename the “source” and “target” variables to include data about what column they occur in, therefore differentiating variables with the same role from one another.
links.df <-
links.df %>%
mutate(source = paste0(source, '_', column)) %>%
mutate(target = paste0(target, '_', column + 1)) %>% # target is the same as the
select(row, column, source, target) # source for the next column
head(links.df, n = 4) %>% pander()| row | column | source | target |
|---|---|---|---|
| 1 | 1 | Christine_1 | Follower_2 |
| 1 | 2 | Follower_2 | Critic_3 |
| 1 | 3 | Critic_3 | Partnership_4 |
| 2 | 1 | Mike_1 | Tyrant_2 |
Now we have created a links data frame where each row represents an link and has columns with unique variables for each link’s source and target.
The next step is create a nodes data frame using the links.df source and target columns, and changing the label of the nodes to not include the the activity suffix (e.g., the "_1" in Follower_1"). We have specified the order that sources, targets, and links occur in so that the Sankey diagram is ordered correctly, however, the suffix is not needed to appear in the labels on the Sankey diagram.
nodes.df <- data.frame(name = unique(c(links.df$source, links.df$target)))
nodes.df$label <- sub('_[0-9]*$', '', nodes.df$name) # what we use as NodeID
head(nodes.df, n = 4) %>% pander()| name | label |
|---|---|
| Christine_1 | Christine |
| Follower_2 | Follower |
| Critic_3 | Critic |
| Mike_1 | Mike |
Great! Now we have created both of the data frames necessary to create a Sankey diagram. All that is left is to provide instructions and parameters for how the sankeyNetwork ( ) function should display the data.
In order to finally diagram our data, there are a few final things we must do to the data before the sankeyNetwork ( ) function can be used:
links.df$source_id <- match(links.df$source, nodes.df$name) - 1 # Create source_id column
links.df$target_id <- match(links.df$target, nodes.df$name) - 1 # Create target_id column
links.df$value <- 1 # Assign link values
head(links.df, n = 4) %>% pander()| row | column | source | target | source_id | target_id | value |
|---|---|---|---|---|---|---|
| 1 | 1 | Christine_1 | Follower_2 | 0 | 1 | 1 |
| 1 | 2 | Follower_2 | Critic_3 | 1 | 2 | 1 |
| 1 | 3 | Critic_3 | Partnership_4 | 2 | 16 | 1 |
| 2 | 1 | Mike_1 | Tyrant_2 | 3 | 4 | 1 |
Now it is time to finally plot our Sankey diagram! Before plotting it is helpful to think about what the most effective way to communicate the data will be. We know we want to show how each individual student’s role changed throughout the three different activities, so maybe we assign a different color to each unique observation (student/role), as is the default for the sankeyNetwork ( ) function.
sankeyNetwork(Links = links.df, # from Step 2
Nodes = nodes.df, # from Step 3
Source = 'source_id', # from Step 4
Target = 'target_id', # from Step 4
Value = 'value', # from Step 4
NodeID = 'label', # from Step 3
fontSize = 16, # change font size
iterations = 0) # prevent diagram layout changesThis diagram certainly helps depict flow and changes in student’s roles. However, what it doesn’t tell us is how the students relate to one another. We can see that Christine is a follower in the first activity, but we have no idea who she is following. Perhaps there is a better way to display the data.
You may recall that when we were looking at the student data, there was a vector for group. Though we didn’t include it in our role.df data.frame, we can still reference it now to help improve our data visualization. Looking at the students.df data frame we can see that the first four students were in group A and the last four students were in group B.
| student.names | student.group |
|---|---|
| Christine | A |
| Mike | A |
| Patrick | A |
| Natalie | A |
| Renee | B |
| Jacob | B |
| Kathy | B |
| John | B |
When plotting our Sankey diagram, one helpful thing to do is use color to group members of the same groups or roles across activities. While we could have included the vector for groups in our original roles.df, it would have added an extra column to our Sankey diagram and could visually clutter the diagram.
Instead, we can just use the information from the students.df data frame to create a vector specifying which colors to use in our diagram based on group. To do this, we create a new sankey.colors vector that contains an ordinal list of all the student names and possible roles (.domain) and their associated Javascript colors (.range). Each of the students in Group A were assigned the same color (“deepskyblue”), and all of the students in Group B were assigned “mediumpurple”.
sankey.colors <- 'd3.scaleOrdinal() .domain(["Christine", "Mike","Patrick",
"Natalie", "Renee", "Jacob", "Kathy", "John", "Tyrant", "Critic", "Follower",
"Partnership"]) .range(["deepskyblue", "deepskyblue", "deepskyblue",
"deepskyblue", "mediumpurple", "mediumpurple", "mediumpurple", "mediumpurple",
"red", "hotpink", "orange", "yellowgreen"])'Let’s see what our Sankey diagram looks like now if we specify our sankey.colors vector for the colourScale parameter within the sankeyNetwork ( ) function.
sankeyNetwork(Links = links.df, # from Step 2
Nodes = nodes.df, # from Step 3
Source = 'source_id', # from Step 4
Target = 'target_id', # from Step 4
Value = 'value', # from Step 4
NodeID = 'label', # from Step 3
colourScale = sankey.colors, # specify our color vector
fontSize = 16, # change font size
iterations = 0) # prevent diagram layout changesMuch better! Now we can trace the changes in individual student’s roles while seeing how their roles relate to one another. For example, we can see that while Christine started out as a follower in a tyrannical group led by Mike, by activity three, she and Mike were co-leaders. We have now successfully created a Sankey diagram that visualizes our data effectively and can help answer how student roles in small groups change over the course of three in-class activities.
Learn more about the research project I am a part of, the networkD3 package, and about customizing colors in Sankey diagrams by visiting the following:
National Science Foundation Research Grant Award Abstract # 1822490 nsf.gov/awardsearch
Info on NetworkD3 Package CRAN
Customizing Colors in Sankey diagrams R-Graph Gallery
This code through references and cites the following sources:
Block, Tim (2021). Creating Custom Sankey diagrams Using R. DISPLAYR
Holtz, Yan (n.d.). Sankey diagram. from Data to Viz
JavaScripter.net (n.d.) Predefined Color Names - Alphabeticle List. JavaScripter
Yetman, CJ (2018). Creating a Sankey diagram using NetworkD3 package in R. StackOverflow