https://www.youtube.com/c/TechAnswers88

youtube video link with explanations for these examples https://youtu.be/XRu_Nb8hfIA

Easiest way to create Sankey diagram from your own data in ggplot.

Use the ggsankey package and create your own Data driven Sankey chart. Full customisation is available as the plot is a ggplot object and you can control the look and feel as you want it.

When you have to create a sankey diagram to use in your publications, MS Word document or a PowerPoint document, then this is the most practical and easy approach to use.

Create data labels with numbers and percentages at each node.

Just define the columns which you want to use, customise the colours using the themes and then save it as an image file on your desktop.

Packages used

This example uses the ggsankey package which I think is a great package and does it job effectively and easily.

As this package is not in CRAN that means you would have to install it from the author’s github.

Install the remotes package firstt. Then use the install_github command to install the package.

#install.packages("remotes")
#remotes::install_github("davidsjoberg/ggsankey")

library(ggsankey)

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.1.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.1.3

Create the sample data

Note that this data is not aggregated. Each record is for a patient and there is no need to do any aggregation, counting or summing up..

#'How many pizzas you eat in a month'

d <- data.frame(Question  = c('How many pizzas'
                              ,'How many pizzas'
                              ,'How many pizzas')
              , Answer    = c('1 Pizza','2 Pizzas','3 Pizzas')
              , Responses = c(200,300,400))

d
##          Question   Answer Responses
## 1 How many pizzas  1 Pizza       200
## 2 How many pizzas 2 Pizzas       300
## 3 How many pizzas 3 Pizzas       400

Transform the data to make it ready for the sankey chart creation

All you need to do is the make_long command ( from the ggsankey package). Provide the data columns which you would like to see in your sankey chart

# Step 1
df <- d %>%
  make_long(Question,Answer,Responses)
df
FALSE # A tibble: 9 x 4
FALSE   x         node            next_x    next_node
FALSE   <fct>     <chr>           <fct>     <chr>    
FALSE 1 Question  How many pizzas Answer    1 Pizza  
FALSE 2 Answer    1 Pizza         Responses 200      
FALSE 3 Responses 200             <NA>      <NA>     
FALSE 4 Question  How many pizzas Answer    2 Pizzas 
FALSE 5 Answer    2 Pizzas        Responses 300      
FALSE 6 Responses 300             <NA>      <NA>     
FALSE 7 Question  How many pizzas Answer    3 Pizzas 
FALSE 8 Answer    3 Pizzas        Responses 400      
FALSE 9 Responses 400             <NA>      <NA>

Create the sankey chart

pl <- ggplot(df, aes(x = x
                     , next_x = next_x
                     , node = node
                     , next_node = next_node
                     , fill = factor(node)
                     , label = node)
)
pl <- pl +geom_sankey(flow.alpha = 0.5
                      , node.color = "black"
                      ,show.legend = FALSE)
pl <- pl +geom_sankey_label(size = 3, color = "black", fill= "white", hjust = -0.5)
pl <- pl +  theme_bw()
pl <- pl + theme(legend.position = "none")
pl <- pl +  theme(axis.title = element_blank()
                  , axis.text.y = element_blank()
                  , axis.ticks = element_blank()  
                  , panel.grid = element_blank())
pl <- pl + scale_fill_viridis_d(option = "inferno")
pl <- pl + labs(title = "Sankey diagram using ggplot")
pl <- pl + labs(subtitle = "Showing the responses to a multiple choice question")
pl <- pl + labs(caption = "@techanswers88")
pl <- pl + labs(fill = 'Nodes')
pl