This simple example demonstrates how Sankey Network flow charts can be easily generated in R using the networkD3 package, and can be valuable for illustrating preference flows.

We are interested in seeing how the 12 UK regions’ voting results contributed to the overall referendum result, where the UK voted to leave the European Union by 17,410,742 votes to 16,141,241.

First we load libraries and read the raw data which can be obtained from the Electoral Commission website.

## load libraries

library(dplyr)
library(networkD3)
library(tidyr)

# read in EU referendum results dataset
refresults <- read.csv("EU-referendum-result-data.csv")

Now, if we take a quick look at this data, we discover there is a lot more than we need.

head(refresults)
##    id Region_Code Region Area_Code                 Area Electorate
## 1 108   E12000006   East E06000031         Peterborough     120892
## 2 109   E12000006   East E06000032                Luton     127612
## 3 112   E12000006   East E06000033      Southend-on-Sea     128856
## 4 113   E12000006   East E06000034             Thurrock     109897
## 5 110   E12000006   East E06000055              Bedford     119530
## 6 111   E12000006   East E06000056 Central Bedfordshire     204004
##   ExpectedBallots VerifiedBallotPapers Pct_Turnout Votes_Cast Valid_Votes
## 1           87474                87469       72.35      87469       87392
## 2           84633                84636       66.31      84616       84481
## 3           93948                93939       72.90      93939       93870
## 4           79969                79954       72.75      79950       79916
## 5           86136                86136       72.06      86135       86066
## 6          158904               158896       77.89     158894      158804
##   Remain Leave Rejected_Ballots No_official_mark Voting_for_both_answers
## 1  34176 53216               77                0                      32
## 2  36708 47773              135                0                      85
## 3  39348 54522               69                0                      21
## 4  22151 57765               34                0                       8
## 5  41497 44569               69                0                      26
## 6  69670 89134               90                0                      34
##   Writing_or_mark Unmarked_or_void Pct_Remain Pct_Leave Pct_Rejected
## 1               7               38      39.11     60.89         0.09
## 2               0               50      43.45     56.55         0.16
## 3               0               48      41.92     58.08         0.07
## 4               3               23      27.72     72.28         0.04
## 5               1               42      48.22     51.78         0.08
## 6               1               55      43.87     56.13         0.06

So we now need to group the data by region, remove unnecessary data and format it to allow easy construction of a Sankey network.

# aggregate by region

results <- refresults %>% 
  dplyr::group_by(Region) %>% 
  dplyr::summarise(Remain = sum(Remain), Leave = sum(Leave))

# format in prep for sankey diagram

results <- tidyr::gather(results, result, vote, -Region)

head(results)
## # A tibble: 6 x 3
##   Region           result    vote
##   <fct>            <chr>    <int>
## 1 East             Remain 1448616
## 2 East Midlands    Remain 1033036
## 3 London           Remain 2263519
## 4 North East       Remain  562595
## 5 North West       Remain 1699020
## 6 Northern Ireland Remain  440707

Finally we generate the set of nodes and the set of links for the Sankey network.

# create nodes dataframe

regions <- unique(as.character(results$Region))
nodes <- data.frame(node = c(0:13), 
                     name = c(regions, "Leave", "Remain"))

#create links dataframe

results <- merge(results, nodes, by.x = "Region", by.y = "name")
results <- merge(results, nodes, by.x = "result", by.y = "name")
links <- results[ , c("node.x", "node.y", "vote")]
colnames(links) <- c("source", "target", "value")

We now have the nodes and links in the format we need.

head(nodes)
##   node             name
## 1    0             East
## 2    1    East Midlands
## 3    2           London
## 4    3       North East
## 5    4       North West
## 6    5 Northern Ireland
head(links)
##   source target   value
## 1      0     12 1880367
## 2      1     12 1475479
## 3      2     12 1513232
## 4      3     12  778103
## 5      4     12 1966925
## 6      5     12  349442

So we are ready to draw our Sankey Network. By rolling our mouse over the nodes and links, we can see detailed information on voting numbers.

#draw sankey network

networkD3::sankeyNetwork(Links = links, Nodes = nodes, Source = 'source', 
                         Target = 'target', Value = 'value', NodeID = 'name',
                         units = 'votes')