Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original

Source: Visual Capitalist (2022)

Objective

The overall objective of this data visualisation is to raise awareness to the challenges of drinking water access in Sub-Saharan countries. The website to which the visualisation is published indicates the target audience would be well educated and in knowledge-based professions, they are not themselves generally living in the region but by the designer posting this in this forum the audience is aware of the issues and may even be able to contribute addressing the issues(s) in their own professions. Within the broader objective, by using lines to join each countries urban and rural observations, the designer is able to highlight the stark differences in drinking water access between rural and urban populations within the same country.

The visualisation chosen had the following three main issues:

  • Failure to answer a practical question - The subtitle of the plot and adjoining article allude to poverty and its impact on population size and drinking water access. However, poverty is not visualized and therefore the designer is missing a key variable in this complex problem.
  • Perceptual issues, failure to answer a practical question - No explanation or justification for how the countries on the Y axis ordered. If it is by rural water access then the lack of grid lines make this comparison between countries difficult.
  • Perceptual and data issues - Key information missing for the reader. No size scale for the relative population size. The reader has no idea what year or interval the visualisation is measuring of drinking water access.

Reference

Code

The following code was used to fix the issues identified in the original.

library(readr)
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(forcats)
library(knitr)
library(png)
#load data
water <- read_csv("sdg6data_download-1Sep2022-2.csv")
water %>% str()
## spec_tbl_df [276 × 22] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ SDG                    : logi [1:276] NA NA NA NA NA NA ...
##  $ SDG target             : logi [1:276] NA NA NA NA NA NA ...
##  $ SDG indicator          : chr [1:276] "6.1.1" "6.1.1" "6.1.1" "6.1.1" ...
##  $ Indicator Code         : chr [1:276] NA NA NA NA ...
##  $ Indicator name         : chr [1:276] "Drinking water, On premises" "Drinking water, On premises" "Drinking water, On premises" "Drinking water, On premises" ...
##  $ Geographical area code : chr [1:276] NA NA NA NA ...
##  $ Geographical area name : chr [1:276] "Eritrea" "Eritrea" "Equatorial Guinea" "Equatorial Guinea" ...
##  $ Year                   : num [1:276] 2016 2016 2017 2017 2019 ...
##  $ Value                  : num [1:276] 74.34 8.42 22.52 2.02 71.47 ...
##  $ Time detail            : logi [1:276] NA NA NA NA NA NA ...
##  $ Source                 : chr [1:276] "WHO, UNICEF" "WHO, UNICEF" "WHO, UNICEF" "WHO, UNICEF" ...
##  $ Footnote               : logi [1:276] NA NA NA NA NA NA ...
##  $ Type of data           : logi [1:276] NA NA NA NA NA NA ...
##  $ Units                  : logi [1:276] NA NA NA NA NA NA ...
##  $ Age group              : logi [1:276] NA NA NA NA NA NA ...
##  $ Bounds                 : logi [1:276] NA NA NA NA NA NA ...
##  $ Frequency              : logi [1:276] NA NA NA NA NA NA ...
##  $ Level/Status           : logi [1:276] NA NA NA NA NA NA ...
##  $ Location               : chr [1:276] "Urban" "Rural" "Urban" "Rural" ...
##  $ Type of reporting      : logi [1:276] NA NA NA NA NA NA ...
##  $ Sex                    : logi [1:276] NA NA NA NA NA NA ...
##  $ SDG 6 Data portal level: chr [1:276] "6.1.1 Proportion of population using safely managed drinking water services > Safely managed service > On premises > Urban" "6.1.1 Proportion of population using safely managed drinking water services > Safely managed service > On premises > Rural" "6.1.1 Proportion of population using safely managed drinking water services > Safely managed service > On premises > Urban" "6.1.1 Proportion of population using safely managed drinking water services > Safely managed service > On premises > Rural" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   SDG = col_logical(),
##   ..   `SDG target` = col_logical(),
##   ..   `SDG indicator` = col_character(),
##   ..   `Indicator Code` = col_character(),
##   ..   `Indicator name` = col_character(),
##   ..   `Geographical area code` = col_character(),
##   ..   `Geographical area name` = col_character(),
##   ..   Year = col_double(),
##   ..   Value = col_double(),
##   ..   `Time detail` = col_logical(),
##   ..   Source = col_character(),
##   ..   Footnote = col_logical(),
##   ..   `Type of data` = col_logical(),
##   ..   Units = col_logical(),
##   ..   `Age group` = col_logical(),
##   ..   Bounds = col_logical(),
##   ..   Frequency = col_logical(),
##   ..   `Level/Status` = col_logical(),
##   ..   Location = col_character(),
##   ..   `Type of reporting` = col_logical(),
##   ..   Sex = col_logical(),
##   ..   `SDG 6 Data portal level` = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
#create object to use for subsequent filter operation
#subset data with just population observations using select() and filter()
#create new variable 'Location', based off indicator name string contents
#Location will be used in subsequent merge 
#rename 'Value' and remove 'Indicator name' as shown
target2 <- c("Rural population (million people)",
             "Urban population (million people)")
Sub_saharan_pop <- water %>% 
  select(`Indicator name`, `Geographical area name`, Value, 
                 Location) %>% filter(`Indicator name`%in% target2) %>% 
  mutate(Location = case_when(grepl("Rural", `Indicator name`) ~ "Rural",
                              grepl("Urban", `Indicator name`) ~ "Urban")) %>% 
  rename("Population (million people)" = Value) %>%
  select(- `Indicator name`)
                
#subset data for only drinking water observations 
df_water_1 <- water %>%
  select(`Indicator name`, `Geographical area name`, Value,Location) %>% 
  filter(`Indicator name`== "Drinking water, On premises") %>%
  rename("Drinking water, On premises" = Value) %>% select(- `Indicator name`)

#subset data for only poverty observations using object created for filter 
target3 <- ("Proportion of population below international poverty line (%)")
poverty <- water %>%
  select(`Indicator name`, `Geographical area name`, Value,) %>% 
  filter(`Indicator name` %in% target3) %>%
  rename("% of population below int. poverty line" = Value) %>% 
  select(- `Indicator name`)

#object will remove the entire Sub-Saharan region as it is irrelevant for 
#Visualisation
target4 <- ("Sub-Saharan Africa")
#merge the three separate dfs back together - na.omit is used as missing data 
#can be omitted for the following reasons: "Somalia" does not have any recent 
  #poverty data (and I could not find anything similar elsewhere); 
  #"Eritrea","Equatorial Guinea" &"Comoros" are the only countries
  #that do not have water access data for 2020 and given this is the 
  #key variable to be measured they were left out of the visualisation
#countries are arranged ind descending order according to poverty 
#Binary variable is created for subsequent faceting
#relative population % for each country location (rural vs urban) is calculated
#and stored in new variable for plotting 
df_new <- Sub_saharan_pop %>% left_join(df_water_1) %>% 
  left_join(poverty, by= "Geographical area name") %>% na.omit() %>% 
  filter(!`Geographical area name`%in% target4)%>% 
  arrange(`% of population below int. poverty line`,`Geographical area name`)%>% 
  arrange(desc(`% of population below int. poverty line`)) %>% 
  mutate(Binary_poverty = ifelse(`% of population below int. poverty line`< 40,
                                 " Less than 40% below",
                                 " Greater than 40% below")) %>% 
  group_by(`Geographical area name`) %>% 
  mutate(pop_sum = sum(`Population (million people)`)) %>% 
  mutate(relative_pop_perc = `Population (million people)`/pop_sum*100)

#create factor variables
df_new$Location <- df_new$Location %>% as.factor()
df_new$`Geographical area name` <- 
  df_new$`Geographical area name` %>% as_factor()
df_new$Binary_poverty <- df_new$Binary_poverty %>% as_factor() 


#plotting
p2.0 <-  df_new %>% ggplot(aes(x = `Drinking water, On premises`,
                      y = `Geographical area name`))

p2.1 <- p2.0 + 
  geom_point(aes(y=reorder(`Geographical area name`,
                           `% of population below int. poverty line`))) +
  geom_point(aes(size = `relative_pop_perc`,
                 colour = `Location`)) + 
  facet_grid(`Binary_poverty`~., scales = "free",
             space = "free") +
  labs(x = "Proportion of population with drinking water accessible on
       premises (%)",
       y = "Geographical area name",
       title = "2020 Sub-Saharan drinking water access, on-premises",
       caption = ("NOTES:
Geographical areas are ranked in descending order of total proportion of
population below international povery line")) + 
  scale_size(name = "Relative population size (%)") + 
  scale_color_discrete(name = "Location") + geom_line() + 
  theme(plot.caption=element_text(hjust = 0))

Data Reference

Reconstruction

The following plot fixes the main issues in the original.