Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: PLOS One website, https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0248268 (2021).


Objective

The objective of this mulit-variate visualisation is to show Australian public opinion during the 2019 Federal Election, on the question of Australia reducing greenhouse gas emissions. This opinion is segmented and influenced by which political party people voted for and data was collected via a phone survey of 2033 participants. The visualisation includes only the top four political parties in Australia with the interesting addition of the LNP Party in Queensland as a fifth voting category. The LNP Party in Queensland, is the ‘Liberal and National’ parties combined within Queensland. There is no separate Liberal or National party voting in Queensland!

Another objective of this visualisation is to show that Green voters put more importance on reducing emissions compared with particularly the Queensland LNP who see it as a less important priority. I think this is the main reason that the Queensland LNP have been added as a fifth voting category. I think this visualisation is misleading because I show later on that some of the original data has been incorrectly categorised.

The audience of this visualisation are political leaders but also the campaign and policy makers of these political parties, as suggested in the conclusion section of the published paper.

This visualisation has the following three main issues:

  • Issue #1 - Poor choice of colours. The colour choices of the bars are not optimal and could lead to a lack of clarity for colour-blind people. The colour scheme is also ‘Sequential’ which incorrectly suggests ordering of the political parties.

  • Issue #2 - Non-optimal visualisation type and layout. The layout of the bar chart is not ideal including the orientation of the labels, absence of values within the bars and missing sub-heading.

  • Issue #3 - Incorrect data categorisation. The data has been incorrectly stratified to separate out the Queensland LNP party. When I reviewed the original survey dataset, I believe that the original visualisation created misrepresents the data. The visualisation incorrectly bins Queensland voter opinions into Liberal and Nationals, whereas these are not separate parties in Queensland. They are together known as the Queensland LNP!

Reference

Code

The following code was used to fix the issues identified in the original.

  • Load required R packages
# load packages
library(ggplot2)         # grammar of graphics 
library(tidyverse)       # collection of useful packages
library(readxl)          # allows loading of xlsx files
library(RColorBrewer)    # utilise the colorbrewer colour palette
library(colourpicker)    # package to facilitate colour choice
  • Load in data file
# load in .xlsx data file
df1 <- read_xlsx('/Users/adriangacek/Desktop/data/assignment2/plos_data.xlsx')
glimpse(df1)
## Rows: 2,033
## Columns: 21
## $ ANU_Q1          <dbl> 2, 1, 1, 1, 1, 2, 2, 2, 1, 2, 4, 2, 2, 3, 1, 2, 2, 4, …
## $ ANU_Q2          <dbl> 3, 3, 3, 3, 4, 2, 3, 3, 4, 2, 4, 2, 3, 2, 1, 2, 5, 4, …
## $ ANU_Q3          <dbl> 1, 2, 4, 2, 1, 4, 2, 2, 2, 4, 1, 3, 4, 4, 3, 3, 2, 1, …
## $ ANU_Q4          <dbl> 2, 1, 2, 2, 2, 5, 2, 5, 1, 5, 1, 2, 4, 2, 5, 2, 5, 1, …
## $ CSES_Q15        <dbl> 2, 1, 2, 2, 3, 1, 2, -98, 1, 1, 4, 1, 1, 1, 1, 1, 1, 2…
## $ CSES_Q15_Oth    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ CSES_Q16        <dbl> 2, 1, 2, 2, 1, 1, 2, 2, 1, 1, 4, 8, 1, 6, 1, 1, 1, 2, …
## $ CSES_Q16_Oth    <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ p_state         <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ p_region        <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, …
## $ p_geography     <dbl> 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, …
## $ p_seifa         <dbl> 5, 3, 4, 3, 3, 2, 1, 1, 1, 1, 1, 4, 5, 5, 3, 5, 3, 5, …
## $ p_gender        <dbl> 2, 2, 2, 1, 2, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, 2, 2, 2, …
## $ p_age_group     <dbl> 5, 6, 6, 6, 5, 5, 5, 7, 5, 4, 4, 3, 7, 4, 5, 6, 6, 7, …
## $ p_cob_group     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 1, 1, …
## $ p_citizen       <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ p_lote          <dbl> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 1, 2, 2, …
## $ p_income        <dbl> 6, 5, 10, 10, 2, 10, 7, 10, 10, 10, 9, 3, 9, 3, 10, 8,…
## $ p_atsi          <dbl> 4, 4, 4, 4, 4, 4, 4, 1, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ p_household_str <dbl> 2, 2, 3, 2, 3, 2, 2, 1, 1, 4, 1, 3, 1, 2, 3, 1, 3, 1, …
## $ p_education     <dbl> 1, 3, 4, 4, 4, 7, 5, 9, 7, 5, 6, 1, 6, 5, -97, 7, 9, 1…
  • Transform dataset including the correct data types
# convert the two important variables to factors as variables are actually categorical
df1$ANU_Q1 <- factor(df1$ANU_Q1)
df1$CSES_Q15 <- factor(df1$CSES_Q15)
glimpse(df1$ANU_Q1)
##  Factor w/ 6 levels "-99","-98","1",..: 4 3 3 3 3 4 4 4 3 4 ...
glimpse(df1$CSES_Q15)
##  Factor w/ 37 levels "-99","-98","-97",..: 6 5 6 6 7 5 6 2 5 5 ...
# drop all other variables except for ANU_Q1 and CSES_Q15
df2 <- df1 %>% select(ANU_Q1, CSES_Q15)

# rename the two important variables to be more relevant to the data
colnames(df2) <- c('response' , 'political_party')
glimpse(df2$response)
##  Factor w/ 6 levels "-99","-98","1",..: 4 3 3 3 3 4 4 4 3 4 ...
# REMOVE any irrelevent rows from 'response' variable
df2 <- filter(df2, df2$response != '-98')
df2 <- filter(df2, df2$response != '-99')

# KEEP only relevant rows from 'political_party'
df2 <- df2[df2$political_party %in% c("1", "2", "3", "4", "51"), ]

# drop all factor levels except for the ones that remain and check
df2$response <- factor(df2$response)
df2$political_party <- factor(df2$political_party)
str(df2)
## tibble [1,623 × 2] (S3: tbl_df/tbl/data.frame)
##  $ response       : Factor w/ 4 levels "1","2","3","4": 2 1 1 1 1 2 2 1 2 4 ...
##  $ political_party: Factor w/ 5 levels "1","2","3","4",..: 2 1 2 2 3 1 2 1 1 4 ...
# rename the factor levels for both variables 
levels(df2$response) <- c("Extremely important", "Important", "Unimportant", "Extremely unimportant")
levels(df2$political_party) <- c("Liberal", "Labor", "Nationals", "Greens", "LNP (QLD)")
str(df2)
## tibble [1,623 × 2] (S3: tbl_df/tbl/data.frame)
##  $ response       : Factor w/ 4 levels "Extremely important",..: 2 1 1 1 1 2 2 1 2 4 ...
##  $ political_party: Factor w/ 5 levels "Liberal","Labor",..: 2 1 2 2 3 1 2 1 1 4 ...
  • Continue data transformation including correct labels and percentage calculations
# return counts
df3 <- df2 %>% group_by(response, political_party) %>% summarise(n = n())

# calculate totals of responses by political party and store these as new variables
lib_only <- df3 %>% filter(political_party == 'Liberal')
lib_total <-  sum(lib_only$n)
lab_only <- df3 %>% filter(political_party == 'Labor')
lab_total <-  sum(lab_only$n)
gre_only <- df3 %>% filter(political_party == 'Greens')
gre_total <-  sum(gre_only$n)
nat_only <- df3 %>% filter(political_party == 'Nationals')
nat_total <-  sum(nat_only$n)
lnp_only <- df3 %>% filter(political_party == 'LNP (QLD)')
lnp_total <-  sum(lnp_only$n)

# add another column (numeric type) for percentage of voters for that party 'percent'
df_lib <- df3 %>% filter(political_party == 'Liberal') %>% mutate(new = 100 * n / lib_total)
df_lab <- df3 %>% filter(political_party == 'Labor') %>% mutate(new = 100 * n / lab_total)
df_gre <- df3 %>% filter(political_party == 'Greens') %>% mutate(new = 100 * n / gre_total)
df_nat <- df3 %>% filter(political_party == 'Nationals') %>% mutate(new = 100 * n / nat_total)
df_lnp <- df3 %>% filter(political_party == 'LNP (QLD)') %>% mutate(new = 100 * n / lnp_total)
df4 <- rbind(df_lib, df_lab, df_gre, df_nat, df_lnp)
str(df4)
## grouped_df [20 × 4] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ response       : Factor w/ 4 levels "Extremely important",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ political_party: Factor w/ 5 levels "Liberal","Labor",..: 1 1 1 1 2 2 2 2 4 4 ...
##  $ n              : int [1:20] 174 277 112 63 411 141 13 45 211 24 ...
##  $ new            : num [1:20] 27.8 44.2 17.9 10.1 67.4 ...
##  - attr(*, "groups")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ response: Factor w/ 4 levels "Extremely important",..: 1 2 3 4
##   ..$ .rows   : list<int> [1:4] 
##   .. ..$ : int [1:5] 1 5 9 13 17
##   .. ..$ : int [1:5] 2 6 10 14 18
##   .. ..$ : int [1:5] 3 7 11 15 19
##   .. ..$ : int [1:5] 4 8 12 16 20
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
df4$new <- round(df4$new, 1)
  • Plot a near replica of the original plot to confirm correctness of the dataset!
  • Yes - it looks good and like the original plot!
# reorder the categories within the 'political party' variable
df4$political_party <- factor(df4$political_party, levels = c("Greens", "Labor", "Liberal", "LNP (QLD)", "Nationals"))

# create plot
df4 %>% ggplot(ggplot2::aes(response, new)) + geom_bar(ggplot2::aes(fill = political_party), position = "dodge", stat = "identity") + scale_fill_brewer(palette="YlGnBu") + theme_classic() 

  • To address Issue #1 - Poor choice of colours

  • Choose new colours using the colorbrewer package. Ensure we choose a color-blind safe scheme

  • The four colours that will be used are as below: …‘#a6cee3 light blue’ …‘#1f78b4 dark blue’ …‘#b2df8a light green’ …‘#33a02c green’

# check the colorbrewer pallete for the colour set recommended by colorbrewer
# print out vectors of individual colours
brewer.pal(n = 8, name = "Paired")
## [1] "#A6CEE3" "#1F78B4" "#B2DF8A" "#33A02C" "#FB9A99" "#E31A1C" "#FDBF6F"
## [8] "#FF7F00"

Include image of Colorbrewer webpage with the sequential colour scheme

  • To address Issue #2 - Non-optimal visualisation type and layout

  • Create a first “draft” of what the new reconstructed visualisation will look like

  • NOTE: This is not the final visualisation and this is BEFORE the fix for Issue #3 which is a data fix. Use this plot to compare values/results to my final plot

  • Key data harvested from this plot for later comparison, shows: …Party,Extremely important,Important,Unimportant,Extremely unimportant …Nationals,20,39,31,11 …LNP (Qld),19,40,29,13 …Liberal,28,44,18,10

  • To address Issue #3 - Incorrect data categorisation

  • Create a new dataset the re-categorises voters into the correct categories, namely: …transform Liberal Queensland voters to LNP (QLD) …transform Nationals Queensland voters to LNP (QLD)

# starting with data frame df1, drop all except three variables 
df5 <- df1 %>% select(ANU_Q1, CSES_Q15, p_state)

# convert the three important variables to factors as variables are actually categorical
df5$ANU_Q1 <- factor(df5$ANU_Q1)
df5$CSES_Q15 <- factor(df5$CSES_Q15)
df5$p_state <- factor(df5$p_state)

# rename the three important variables to be more relevant to the data
colnames(df5) <- c('response' , 'political_party', 'state')

# REMOVE any irrelevent rows from 'response' variable
df5 <- filter(df5, df5$response != '-98')
df5 <- filter(df5, df5$response != '-99')

# KEEP only relevant rows from 'political_party'
df5 <- df5[df5$political_party %in% c("1", "2", "3", "4", "51"), ]
#write.csv(df5, 'df5.csv')

##### code is good and data is too up to here!!!!

# collect all LNP voters as it stands now
# note data has some LNP voters in the WRONG STATE outside of QLD!
df_lnp_orig <- filter(df5, df5$political_party == '51')
df_lnp_orig <- filter(df_lnp_orig, df_lnp_orig$state == '3')

# collect Liberal voters from QLD (equals 88 voters) and recategorise as LNP (QLD)
df_lib_qld <- filter(df5, df5$political_party == '1')
df_lib_qld <- filter(df_lib_qld, df_lib_qld$state == '3')
df_lib_qld$political_party <- '51'

# collect National voters from QLD (equals 6 voters) and recategorise as LNP (QLD)
df_nat_qld <- filter(df5, df5$political_party == '3')
df_nat_qld <- filter(df_nat_qld, df_nat_qld$state == '3')
df_nat_qld$political_party <- '51'

# recombine ALL recategorised LNP voters
df_lnp_all <- rbind(df_lib_qld, df_nat_qld, df_lnp_orig)

# strip Liberal (QLD) and National (QLD) from original dataframe and add back as LNP (QLD) voters
df_no_lnp <- filter(df5, df5$political_party != '51')
df_no_lnp_lab <- filter(df5, df5$political_party == '2') 
df_no_lnp_grn <- filter(df5, df5$political_party == '4') 
df_no_lnp_lib <- df_no_lnp %>% filter(df_no_lnp$political_party == '1' & df_no_lnp$state != '3')
df_no_lnp_nat <- df_no_lnp %>% filter(df_no_lnp$political_party == '3' & df_no_lnp$state != '3')

# combine data back together and drop state variable
df6 <- rbind(df_lnp_all, df_no_lnp_lib, df_no_lnp_lab, df_no_lnp_nat, df_no_lnp_grn)
df6 <- select(df6, -state)

# drop all factor levels except for the ones that remain and check
df6$response <- factor(df6$response)
df6$political_party <- factor(df6$political_party)
str(df6)
## tibble [1,621 × 2] (S3: tbl_df/tbl/data.frame)
##  $ response       : Factor w/ 4 levels "1","2","3","4": 2 2 4 4 3 1 1 2 2 1 ...
##  $ political_party: Factor w/ 5 levels "1","2","3","4",..: 5 5 5 5 5 5 5 5 5 5 ...
# rename the factor levels for both variables 
levels(df6$response) <- c("Extremely important", "Important", "Unimportant", "Extremely unimportant")
levels(df6$political_party) <- c("Liberal", "Labor", "Nationals", "Greens", "LNP (QLD)")
str(df6)
## tibble [1,621 × 2] (S3: tbl_df/tbl/data.frame)
##  $ response       : Factor w/ 4 levels "Extremely important",..: 2 2 4 4 3 1 1 2 2 1 ...
##  $ political_party: Factor w/ 5 levels "Liberal","Labor",..: 5 5 5 5 5 5 5 5 5 5 ...
  • Continue data transformation including correct labels and percentage calculations
# return counts
df7 <- df6 %>% group_by(response, political_party) %>% summarise(n = n())

# calculate totals of responses by political party and store these as new variables
lib_only <- df7 %>% filter(political_party == 'Liberal')
lib_total <-  sum(lib_only$n)
lab_only <- df7 %>% filter(political_party == 'Labor')
lab_total <-  sum(lab_only$n)
gre_only <- df7 %>% filter(political_party == 'Greens')
gre_total <-  sum(gre_only$n)
nat_only <- df7 %>% filter(political_party == 'Nationals')
nat_total <-  sum(nat_only$n)
lnp_only <- df7 %>% filter(political_party == 'LNP (QLD)')
lnp_total <-  sum(lnp_only$n)

# add another column (numeric type) for percentage of voters for that party 'percent'
df_lib <- df7 %>% filter(political_party == 'Liberal') %>% mutate(new = 100 * n / lib_total)
df_lab <- df7 %>% filter(political_party == 'Labor') %>% mutate(new = 100 * n / lab_total)
df_gre <- df7 %>% filter(political_party == 'Greens') %>% mutate(new = 100 * n / gre_total)
df_nat <- df7 %>% filter(political_party == 'Nationals') %>% mutate(new = 100 * n / nat_total)
df_lnp <- df7 %>% filter(political_party == 'LNP (QLD)') %>% mutate(new = 100 * n / lnp_total)
df8 <- rbind(df_lib, df_lab, df_gre, df_nat, df_lnp)
str(df8)
## grouped_df [20 × 4] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ response       : Factor w/ 4 levels "Extremely important",..: 1 2 3 4 1 2 3 4 1 2 ...
##  $ political_party: Factor w/ 5 levels "Liberal","Labor",..: 1 1 1 1 2 2 2 2 4 4 ...
##  $ n              : int [1:20] 144 239 99 56 411 141 13 45 211 24 ...
##  $ new            : num [1:20] 26.8 44.4 18.4 10.4 67.4 ...
##  - attr(*, "groups")= tibble [4 × 2] (S3: tbl_df/tbl/data.frame)
##   ..$ response: Factor w/ 4 levels "Extremely important",..: 1 2 3 4
##   ..$ .rows   : list<int> [1:4] 
##   .. ..$ : int [1:5] 1 5 9 13 17
##   .. ..$ : int [1:5] 2 6 10 14 18
##   .. ..$ : int [1:5] 3 7 11 15 19
##   .. ..$ : int [1:5] 4 8 12 16 20
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE
df8$new <- round(df8$new, 1)

Data Reference

Reconstruction

The following reconstructed visualisation fixes the main issues in the original.

Let’s step through these, one by one.

Issue #1 - Poor choice of colours

Fixes:

  • Changed the colour scheme to enhance legibility for colour-blind users

  • Changed the colour scheme away from a ‘Sequential’ scheme to a ‘Quantitative’ scheme. eg. we don’t have ordered categorical data

  • Flipped the Y and X axis variable such that I only have four variables to differentiate with colour rather than five, which does not have a ‘colour-only’ colour-bind safe scheme

Method:

  • I used the Colorbrewer website to find a colour scheme that would be suitable for colour blind audiences. The first thing I noticed whilst looking at colours that represented the original plot, is that the original colour scheme is for ‘Sequential’ data and not designed for ‘Categorical’ data. ‘Sequential’ suggests that there is an ORDER to the political parties which of course is not true. So I found a scheme that suited non-ordered ‘Quantitative’ variables like the data we have. Having five different political parties as categories, also created a dilemma as there was not a safe scheme for five categories only four, so to represent the fifth I would need to combine bar colour with a pattern! But what I ended up doing in the end was to flip the variables so that I only needed four colours to show the four categories I had (eg. I instead show the question responses (4 answers) instead of the political parties (5 parties) using the coloured bars).

Issue #2 - Non-optimal visualisation type and layout

Fixes:

  • Created a flipped, stacked bar chart to better fit the labels and values

  • Removed X and Y axis tickmarks and tickmark labels

  • Removed X and Y axis lines

  • Reduced the width of the legend key width to give some more room to stretch the graph width-wise

  • Adjusted the width of the plot using code chunk settings in R

  • Added percentage values to each segment within each bar and rounded to whole numbers

  • Increased and decreased font sizes as appropriate, including a hierarchy of sizes

  • Added an appropriately positioned heading and sub-title, which conveys more key information such as relating to voters in the 2019 Federal Election

Method:

  • I systematically played with different commands from the ggplot2 cheat sheet, to obtain the above changes. This was a lot of trial and error.

Issue #3 - Incorrect data categorisation

Fixes:

  • Data has been re-categorised to separate out the Queensland LNP (LNP QLD) party. 88 Queensland Liberal voters were re-categorised as Queensland LNP voters and 6 Queensland National voters were re-categorised as Queensland LNP voters.

  • Reordered the plot to now show LNP voters in between the Nationals and Liberal voters in terms of the total ‘revised’ responses once the visualisation was correctly aligned to the original dataset.

Method:

  • I used a very early data frame then kept the ‘state’ information. Then re-categorised as per the above and found that when I graphed the results that the percentages changed significantly for the LNP, Nationals and Liberal numbers!

  • So much change in fact that it now shows that the LNP voter opinions to the four questions asked, sit roughly in-between that of the Nationals (unimportant bias) and the Liberals (important bias). This finding make the visualisation match the original dataset and also makes more sense in the fact that the Queensland LNP is made up of Liberal AND Nationals voters!!!

  • The difference in percentage values, Reconstructed visualisation Vs Original visualisation are:

……LNP (Extremely important) = Now 27% WAS 19%

……LNP (Important) = Now 42% WAS 40%

……LNP (Unimportant) = Now 21% WAS 29%

……LNP (Extremely unimportant) = Now 10% WAS 13%

……And also smaller differences in most response categories for the Nationals and Liberal voters.

The reconstructed visualisation now looks like……