This simple example demonstrates how Sankey Network flow charts can be easily generated in R using the networkD3 package, and can be valuable for illustrating preference flows.
We are interested in seeing how the 12 UK regions’ voting results contributed to the overall referendum result, where the UK voted to leave the European Union by 17,410,742 votes to 16,141,241.
First we load libraries and read the raw data which can be obtained from the Electoral Commission website.
## load libraries
library(dplyr)
library(networkD3)
library(tidyr)
# read in EU referendum results dataset
refresults <- read.csv("EU-referendum-result-data.csv")
Now, if we take a quick look at this data, we discover there is a lot more than we need.
head(refresults)
## id Region_Code Region Area_Code Area Electorate
## 1 108 E12000006 East E06000031 Peterborough 120892
## 2 109 E12000006 East E06000032 Luton 127612
## 3 112 E12000006 East E06000033 Southend-on-Sea 128856
## 4 113 E12000006 East E06000034 Thurrock 109897
## 5 110 E12000006 East E06000055 Bedford 119530
## 6 111 E12000006 East E06000056 Central Bedfordshire 204004
## ExpectedBallots VerifiedBallotPapers Pct_Turnout Votes_Cast Valid_Votes
## 1 87474 87469 72.35 87469 87392
## 2 84633 84636 66.31 84616 84481
## 3 93948 93939 72.90 93939 93870
## 4 79969 79954 72.75 79950 79916
## 5 86136 86136 72.06 86135 86066
## 6 158904 158896 77.89 158894 158804
## Remain Leave Rejected_Ballots No_official_mark Voting_for_both_answers
## 1 34176 53216 77 0 32
## 2 36708 47773 135 0 85
## 3 39348 54522 69 0 21
## 4 22151 57765 34 0 8
## 5 41497 44569 69 0 26
## 6 69670 89134 90 0 34
## Writing_or_mark Unmarked_or_void Pct_Remain Pct_Leave Pct_Rejected
## 1 7 38 39.11 60.89 0.09
## 2 0 50 43.45 56.55 0.16
## 3 0 48 41.92 58.08 0.07
## 4 3 23 27.72 72.28 0.04
## 5 1 42 48.22 51.78 0.08
## 6 1 55 43.87 56.13 0.06
So we now need to group the data by region, remove unnecessary data and format it to allow easy construction of a Sankey network.
# aggregate by region
results <- refresults %>%
dplyr::group_by(Region) %>%
dplyr::summarise(Remain = sum(Remain), Leave = sum(Leave))
# format in prep for sankey diagram
results <- tidyr::gather(results, result, vote, -Region)
head(results)
## # A tibble: 6 x 3
## Region result vote
## <fct> <chr> <int>
## 1 East Remain 1448616
## 2 East Midlands Remain 1033036
## 3 London Remain 2263519
## 4 North East Remain 562595
## 5 North West Remain 1699020
## 6 Northern Ireland Remain 440707
Finally we generate the set of nodes and the set of links for the Sankey network.
# create nodes dataframe
regions <- unique(as.character(results$Region))
nodes <- data.frame(node = c(0:13),
name = c(regions, "Leave", "Remain"))
#create links dataframe
results <- merge(results, nodes, by.x = "Region", by.y = "name")
results <- merge(results, nodes, by.x = "result", by.y = "name")
links <- results[ , c("node.x", "node.y", "vote")]
colnames(links) <- c("source", "target", "value")
We now have the nodes and links in the format we need.
head(nodes)
## node name
## 1 0 East
## 2 1 East Midlands
## 3 2 London
## 4 3 North East
## 5 4 North West
## 6 5 Northern Ireland
head(links)
## source target value
## 1 0 12 1880367
## 2 1 12 1475479
## 3 2 12 1513232
## 4 3 12 778103
## 5 4 12 1966925
## 6 5 12 349442
So we are ready to draw our Sankey Network. By rolling our mouse over the nodes and links, we can see detailed information on voting numbers.
#draw sankey network
networkD3::sankeyNetwork(Links = links, Nodes = nodes, Source = 'source',
Target = 'target', Value = 'value', NodeID = 'name',
units = 'votes')