Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original


Source: Mike Bostock at bl.ocks.org (2020).


Objective

The original visualisation showcases the changing population distribution with time for the U.S., between 1850 to 2000, every 10 years; and the difference between genders for each of the age brackets. A new, dynamic, version of the standard population pyramid for the general public.

The visualisation chosen had the following three main issues:

  • The x axis represents time from right (youngest) to left (oldest), against standard convention of time increasing from left to right.

  • The addition of visual effects, not only increases the confusion, when looking at the display moving from right to left with each year, but it also causes inability to compare values between decades and within age groups.

  • Although the selection of Blue and Pink has a direct intuitive correlation to the over represented gender, it is not as clear due to the superposition being purple, with the tops only showing the difference in males or females, an the inability to ascertain the actual number. For the most common forms of colour blindness, the Male-Blue/Female-Pink colour association will not work, as they would not be able to distinguish the association and would have to be explained in the graph.

Reference

Code

The following code was used to fix the issues identified in the original.

POPULATION PYRAMIDS

STACKOVERFLOW MODIFIED TEMPLATE by Andriy Gazin (2016)

library(dplyr)
library(tidyverse)
library(readxl)
library(ggplot2)

#population <- read.csv("population.tsv", sep = "\t")
population <- read_excel("population.xlsx")

population2 <- population %>% mutate(age = factor(age, ordered = TRUE,
                        levels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90), labels = c("0-4", "5-9", "10-14", "15-19", "20-24",
                                   "25-29", "30-34", "35-39", "40-44", 
                                   "45-49", "50-54", "55-59", "60-64",
                                   "65-69", "70-74", "75-79", "80-84",
                                   "85-89", "90")))


population2 <- population2 %>% filter(year %in% c(1850, 1900, 1950,1970, 1990, 2000))

p1 <- ggplot(population2, aes(x = age, color = gender)) + geom_linerange(data = population2[population2$sex==1,], 
                 aes(ymin = -0.1, ymax = -0.3+pop), size = 3.5, alpha = 0.8)+
  geom_linerange(data = population2[population2$sex==2,], 
                 aes(ymin = 0.1, ymax = 0.3+pop), size = 3.5, alpha = 0.8)+
  facet_wrap(~year, ncol = 2)+
  coord_flip()+
labs(title = "U.S. Population Pyramids",
   subtitle = "Snapshot of the evolution 1900-2000",
   caption = "Data: Mike Bostock at bl.ocks.org (2020)") +
  ylab("Population in millions") +
  scale_color_manual(name = "", values = c("M" = "#3E606F", "F" = "#8C3F4D"),
                 labels = c("Male (Left)", "Female (Right)"))+
  scale_y_continuous(breaks = c(c(-12, -11, -10, -9, -8, -7, -6, -5, -4, -3, -2, -1), c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)+0.3),
                 labels = c("12", "11", "10", "9", "8", "7", "6", "5", "4", "3","2", "1", "1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

As I was not reaching any results with the Pyramids and following Kane’s advice, I decided to attempt a group bar chart, though I didn’t reached the desired result, I could not unstack the bars.

GRUP BAR CHART

library(dplyr)
library(ggplot2)
#population <- read.csv("population.tsv", sep = "\t")


population4 <- population %>% mutate(age = factor(age, ordered = TRUE,
                        levels = c(0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70, 75, 80, 85, 90), labels = c("0-4", "5-9", "10-14", "15-19", "20-24",
                                   "25-29", "30-34", "35-39", "40-44", 
                                   "45-49", "50-54", "55-59", "60-64",
                                   "65-69", "70-74", "75-79", "80-84",
                                   "85-89", "90")))

population4 <- population %>% filter(year %in% c(1850, 1900, 1950,1970, 1990, 2000))

p3 <-  ggplot(population4, aes(x = age, y=popm, fill=gender))
p3 + geom_bar(position="dodge", stat="identity") +  
  geom_col(aes(fill=gender, position = "dodge"))+
  facet_wrap(~year, ncol = 2)+
labs(title = "U.S. Population Pyramids",
   subtitle = "Snapshot of the evolution 1850-2000",
   caption = "Data: Mike Bostock at bl.ocks.org (2020)")+
  scale_color_manual(name = "", values = c("M" = "#3E606F", "F" = "#8C3F4D"),
                 labels = c("Male", "Female"))+
  scale_y_continuous(breaks = c( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12),
                 labels = c("1", "2", "3", "4", "5", "6", "7", "8", "9", "10", "11", "12"))

Data Reference

Reconstruction

The following plot fixes the main issues in the original.

Ideally, if my R coding skills were to a different standard, it would be desirable to display two parallel y-axis: -one with a standard population pyramid on the left, with sufficient definition to gain an understanding of the population values (scaled tick marks to best fit), and -on the right, an axis showcasing the percentage difference between genders, colour coded as if it was a stock price along time line chart. With a shared x-axis displaying population in millions and % diff

To solve the ability to compare between decades, a multi faceted approach could be the solution, although the amount of space required should be taken into consideration, there could be 15 visualisations after all, so perhaps a focusing in certain periods of time could help.

To solve the colour blindness issue, and to still utilise the intuitive colour correlation, which the majority of the population performs when they see Pink and Blue, adequate labeling is required in both vertical axis.