Overview

For this project, we chose a visualization on notable Leap Day (February 29) births and deaths, sourced from Wikipedia. The project consists of three parts:

  1. Replication: Reproducing the original graph as closely as possible using ggplot2.
  2. Redesign: Applying principles of data visualization to improve clarity, aesthetics, and accessibility.
  3. Written Summary: Evaluating the original graph and explaining the rationale behind our design choices.

Original Graph

The original graph depicts notable Leap Day births (blue bars above the x-axis) and deaths (red bars below the x-axis) spanning from the 400s CE to the present.

Key Issues Identified:

  • Overcrowded Labels: Recent centuries are cluttered with overlapping labels.
  • Ambiguous Scale: No clear y-axis or gridlines for reference.
  • Distracting Typography: Oversized, bold headings detract from the focus on data.
  • Irrelevant Elements: Side notes unrelated to the plot clutter the visual.
  • Small Key Text: Totals for “births” and “deaths” are difficult to read.

Recreation of the Original Plot

library(tidyverse)
library(here)
library(showtext)
library(ggimage)
library(fontawesome)
library(ggtext)
library(scales)
library(see)
library(patchwork)
library(colorspace)
library(glue)
tuesdata <- tidytuesdayR::tt_load('2024-02-27')

births <- tuesdata$births
deaths <- tuesdata$deaths

rm(tuesdata)
# Theme
theme_set(theme_minimal(base_size = 15))
theme_update(
  plot.title.position = "plot",
  axis.line.x = element_line(linewidth = 0.2, colour = "gray50"),
  axis.line.y = element_line(linewidth = 0.2, colour = "gray50"),
  panel.grid = element_blank()
)


# Data Summary --------------------------------------------------------

births_count <- births %>% 
  count(year_birth) %>% 
  pivot_longer(!n, names_to = "event", values_to = "year") %>% 
  select(year, event, n)

deaths_count <- deaths %>% 
  count(year_death) %>% 
  summarise(year_death, n = n * -1) %>% 
  pivot_longer(!n, names_to = "event", values_to = "year") %>% 
  select(year, event, n)

df_final <- full_join(births_count, deaths_count, by = NULL)


# Data Visualization --------------------------------------------------

# Basic colors 
birth_color <- "blue"
death_color <- "red"
line_color  <- "gray"
line_color_R <- "blue"
bg_plot <- "white"

plot <- df_final %>% 
  ggplot(aes(year, n, fill = event)) +
  geom_col(alpha = 0.9) +   
  # Horizontal axis line
  annotate(geom = "segment",
           x = min(df_final$year) - 100, 
           xend = max(df_final$year) + 100, 
           y = 0, yend = 0,
           color = line_color) +
  annotate(geom = "segment",
           x = 2000, xend = 2000,
           y = Inf, yend = 0,
           color = line_color_R) +
  coord_cartesian(expand = FALSE, clip = "off") +
  scale_x_continuous(breaks = unique(df_final$year)) +
  scale_fill_manual(name = "Event", 
                    values = c(birth_color, death_color),
                    labels = c("Births", "Deaths")) +
  scale_color_manual(values = c(birth_color, death_color)) +
  labs(title = "Leap Day Births and Deaths") +
  theme(axis.text.x = element_blank(),
        axis.line.y = element_blank(),
        axis.text.y = element_blank(),
        axis.title.y = element_blank(),
        axis.line.x = element_blank(),
        axis.title.x = element_blank(),
        plot.title = element_text(hjust = 0.5, face = "bold", size = 16),
        plot.caption = element_text(family = "sans",
                                    size = 6,
                                    face = "bold",
                                    color = "gray50",
                                    margin = margin(1, 0, 0, 0, "cm")),
        plot.margin = margin(0.5, 1, 1, 0.5, "cm"))

print(plot)

Changes Made

In our first attempt at redesigning the original plot, we created a visualization that combined a scatter plot and a line chart. The scatter points were displayed in two colors (red and cyan) to represent different datasets or categories, while a blue line was added to suggest a potential trend or model fit. However, this version faced significant readability challenges.

One of the key issues was the use of time as the x-axis. The timeline spanned a broad range but lacked clear labels, making it difficult to follow the progression of data over time. Additionally, the scatter points overlapped heavily in certain areas, which further obscured individual data values. The varying point sizes added unnecessary complexity without a clear explanation of their purpose.

The inclusion of the blue line also created confusion, as it was neither labeled nor described, leaving its role in the plot ambiguous. Furthermore, the axes lacked proper labels, and the legend provided minimal information, failing to adequately explain the categories or variables represented.

These issues, coupled with the visual clutter caused by overlapping points and the combination of two plot types, made the graph difficult to interpret. As a result, this initial redesign did not effectively communicate the data or its insights.

births_per_year <- table(births$year_birth)
deaths_per_year <- table(deaths$year_death)

# create data frames that have the variables categorized as numbers
births_df <- data.frame(year = as.numeric(names(births_per_year)), births = as.numeric(births_per_year))
deaths_df <- data.frame(year = as.numeric(names(deaths_per_year)), deaths = as.numeric(deaths_per_year))

# combine them, ensuring the years line up and fixing the issue of the number of years being different by including all the possible years
leap_year_counts <- merge(births_df, deaths_df, by = "year", all = TRUE)     

# add a new variable to the datafram: ratio
leap_year_counts$births_deaths_ratio <- leap_year_counts$births / leap_year_counts$deaths
# Create the plot with explicit smoothing formula
my_plot <- ggplot(
  leap_year_counts |> filter(!is.na(births_deaths_ratio))
) +
  # births points
  geom_point(aes(x = year, y = births, color = "births", size = births), 
             alpha = 0.5) +
  # deaths points
  geom_point(aes(x = year, y = deaths, color = "deaths", size = deaths), 
             alpha = 0.5) +
  # remove legend
  theme(legend.position = "none") +
  # add trend line for the births/deaths ratio with explicit formula
  geom_smooth(aes(x = year, y = births_deaths_ratio), 
              formula = y ~ x, method = "loess", se = FALSE) +
  # direct labeling for the ratio at the maximum year where ratio exists
  geom_text(
    data = leap_year_counts |> filter(!is.na(births_deaths_ratio)) |> 
      filter(year == max(year)),
    aes(x = year, y = births_deaths_ratio),
    label = "Births to Deaths Ratio",
    vjust = -2, hjust = -0.05, size = 2.5, color = "blue"
  ) +
  # labels and titles
  labs(
    title = "Leap Day Births and Deaths",
    subtitle = "Births and deaths are plotted by year, with a trend line for their ratio",
    x = "Year",
    y = "Count",
    color = "Event Type",  
    size = "Magnitude"
  ) +
  # avoid plot trimming
  coord_cartesian(clip = "off") + 
  # adjust theme 
  theme(
    plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
    plot.subtitle = element_text(size = 16, hjust = 0.5),
    axis.title.x = element_text(size = 16, face = "bold"),
    axis.title.y = element_text(size = 16, face = "bold"),
    plot.margin = margin(0.1, 0.9, 0.1, 0.1, "in")
  )

print(my_plot)

Final Refined Plot

The final refined plot addresses the shortcomings of previous attempts by thoughtfully applying multiple layers of the grammar of graphics. The data layer is focused on Leap Day births and deaths from the 20th and 21st centuries, with a clear acknowledgment of the bias toward recent data.

The plot’s aesthetics incorporate distinct colors and a clean half-violin-half-dotplot design. This approach effectively displays both the distribution density and individual data points, eliminating visual clutter. Geometrically, births and deaths are separated into mirrored panels, ensuring clarity and facilitating easy comparisons.

Additionally, the minimalist theme, clear annotations, and a linear time scale significantly enhance readability. These design choices make the visualization accessible, visually appealing, and faithful to the underlying data, successfully addressing the issues of previous versions.

df <- bind_rows(
  births |> 
    rename(
      year = year_birth,
      event = person
    ) |> 
    mutate(type = "Births") |> 
    select(year, event, type),
  
  deaths |> 
    rename(
      year = year_death,
      event = person
    ) |> 
    mutate(type = "Deaths") |> 
    select(year, event, type)
) |> 
  mutate(
    type = fct(type, levels = c("Births", "Deaths"))
  )

# create a new dataframe for years after 1900 to use in graph
filtered_df <- df[df$year >= 1900, ]

# make new variables in `filtered_df` for birth and death counts each year
filtered_df <- filtered_df |> 
  group_by(year) |> 
  mutate(births_per_year = sum(type == "Births"),
         deaths_per_year = sum(type == "Deaths"))
# Load fonts
font_add_google("Rampart One",
                family = "title_font"
) # Font for titles
font_add_google("Yanone Kaffeesatz",
                family = "caption_font"
) # Font for the caption
font_add_google("Oregano",
                family = "body_font"
) # Font for plot text
showtext_auto()

# Choosing one Palette for the Visualization
mypal <- c(
  "#A1D6E2", # Light Blue for Births
  "#4B4B4B" # Dark Grey for Deaths
  
)
# Define colors
bg_col <- "white" 
text_col <- mypal[2] 
text_hil <- mypal[1] 

# Define Text Size
ts <- unit(20, units = "cm") # Text Size


# Add text to plot--------------------------------------------------------------
plot_title <- "Leap Day"

subtitle_text <- "Leap Day (February 29) births and deaths from the 20th and 21st centuries, sourced from Wikipedia. The data shows a bias toward more recent events, visualized below using a half-violin, half-dot plot. Each dot represents an individual birth or death."
plot_subtitle <- str_wrap(subtitle_text, width = 90)
# create a new dataframe for years after 1900 to use in graph
filtered_df <- df[df$year >= 1900, ]

# make new variables in `filtered_df` for birth and death counts each year
filtered_df <- filtered_df |> 
  group_by(year) |> 
  mutate(births_per_year = sum(type == "Births"),
         deaths_per_year = sum(type == "Deaths"))

# position for label
birth_labels <- filtered_df |>
  filter(type == "Births") |>
  group_by(year) |>
  summarise(
    label_b = 0.99 - 0.07 * max(births_per_year), 
    .groups = "drop"
  )


death_labels <- filtered_df |>
  filter(type == "Deaths") |>
  group_by(year) |>
  summarise(
    label_d = 1.94 - 0.07 * max(deaths_per_year), 
    .groups = "drop"
  )

filtered_df <- filtered_df |>
  left_join(
    birth_labels |> select(year, label_b),  
    by = "year"
  ) |>
  left_join(
    death_labels |> select(year, label_d), 
    by = "year"
  )
testg <- filtered_df |> 
  ggplot(aes(
    y = year,
    x = type,
    fill = type
  )) +
  geom_point() +
  geom_violindot(
    size_dots = 8,
    trim = FALSE,
    scale = "area",
    show.legend = FALSE,
    binwidth = 0.5,
    width = 1
  ) +
  geom_text(
    data = filtered_df |> filter(type == "Births" & !is.na(label_b)),  
    aes(x = label_b, y = year, label = births_per_year),  
    hjust = 2, size = 30, color = "#A1D6E2"
  ) +
  geom_text(
    data = filtered_df |> filter(type == "Deaths" & !is.na(label_d)),  
    aes(x = label_d, y = year, label = deaths_per_year),
    hjust = 0, size = 30, color =  "#4B4B4B"
  ) +
  scale_y_continuous(
    limits = c(1880, 2020),
    breaks = seq(1900, 2020, 20),
    expand = expansion(c(0, 0.02))
  ) +
  scale_x_discrete(
    expand = expansion(0),
    position = "top"
  ) +
  scale_fill_manual(values = mypal) +
  labs(
    title = plot_title,
    subtitle = plot_subtitle,
    x = NULL, y = NULL
  ) +
  coord_cartesian(clip = "off") +
  theme_minimal() +
  theme(
    plot.margin = margin(2, 2, 2, 2, unit = "cm"),
    legend.position = "none",
        axis.line.x = element_blank(),  
    axis.line.y = element_line(
      linewidth = 1,
      linetype = 2
    ),
    axis.ticks.y = element_blank(),
      axis.text.x = element_text( 
      size = 100,  
      colour = "black",
      face = "bold",  
      hjust = 0.5,
      vjust = 0.7
    ),

    axis.text.y = element_text(
      size = 5 * ts,
      colour = text_col,
      hjust = 0.5, 
      family = "caption_font",
      margin = margin(0, 0.2, 0, 1, unit = "cm")
    ),
    plot.title = element_text(
      size = 24 * ts,
      colour = text_hil,
      margin = margin(2, 0, 1, 0, unit = "cm"),
      family = "title_font",
      hjust = 0.5
    ),
    plot.subtitle = element_text(
      size = 5.7 * ts,
      colour = text_col,
      margin = margin(1, 0, 1, 0, unit = "cm"),
      family = "body_font",
      lineheight = 0.3,
      hjust = 0.5
    ), 
    plot.caption = element_textbox(
      size = 4 * ts,
      colour = text_hil,
      margin = margin(1, 0, 1, 0, unit = "cm"),
      family = "caption_font",
      hjust = 0.5
    ),
    panel.grid = element_blank(),
    plot.title.position = "plot"
  )


ggsave("tidy_leap_day.png", plot = testg, width = 40, height = 50, units = "cm", bg = "white")