Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Data is Beautiful subreddit (2019).


Objective

The objective of the original data visualisation was to analyse the percentage of employed software engineers with a Bachelor’s degree or less in the various counties of California as at 2017.

The target audience of the data visualisation generally speaking is people in the software engineering community. This could be students studying software engineering, potential software engineering employees looking to work in the most competitive and exciting counties in California, and also software engineering employers wanting to know where to recruit the most qualified software engineers in California.

The visualisation chosen had the following three main issues:

  • The first and most obvious issue is colour. There are too many saturated and bright contrasting colours in the colour gradient which introduces the ‘tennis match’ effect where the viewer has to continually refer back and forth to the legend and the map. The colours also create issues for red-green colour blindness.
  • The second major problem is deception. To analyse the percentage of employed software engineers with a bachelor’s degree or less does not provide the viewer with any meaningful information as counties with a low percentage of employees with bachelor degrees may have a very high percentage of employees with Masters of Phd qualifications. This could mislead the viewer into thinking a particular county has a very high standard of software engineers when in fact they don’t.
  • Lastly, there are many county borderlines missing on the map and the county names are small and very badly formatted making them difficult to read.

Reference

Code

The following code was used to fix the issues identified in the original.

library(ggplot2)
library(dplyr)
library(readxl)
library(urbnmapr)
library(grid)
library(gridExtra)
setwd("/Users/danielevans/Documents/Uni/Sem 2/Data Viz")

## Load Data
california <- get_urbn_map("counties", sf = TRUE) %>% filter(state_name == "California")
master <- read_excel("assignment 2/masters.xlsx")
master_spatial <- left_join(california, master, by = "county_name")

## Plot Data 
p <- ggplot() + geom_sf(master_spatial,
                   mapping = aes(fill = perc),
                   color = "#ffffff", size = 0.05) + 
  coord_sf(datum = NA) + 
  scale_fill_viridis_c("Percent(%)") +
  theme(legend.position = "left",
        legend.key.size = unit(.4, "cm"))

ordered_master <- master[order(master$perc) , ]
hist <- ordered_master %>% filter(perc > 0)
hist$county_name <- factor(hist$county_name, levels = hist$county_name)

x <- ggplot(hist, aes(x = county_name, y = perc, fill = perc)) + 
  geom_bar(stat = 'identity', width = .4, colour = "#FFFFFF") +
  labs(y = "Percent(%)",
       x = " ") + 
  scale_fill_viridis_c(guide = FALSE) + 
  geom_text(aes(label = perc), nudge_y = 3, size = 2.5) +
  coord_flip()

Data Reference

Reconstruction

The following plot fixes the main issues in the original.