Improving a Data Visualization

In 2014, FiveThirtyEight published this graph as part of this article outlining the relationship between Pulitzer prizes and Newspaper circulation:

knitr::include_graphics("~/Data Science: STAT220/Portfolio/proj1-krohaa/pulitzer_graph.jpeg")

When I first came across this graph, I was intrigued, but confused by the discrepancy between the title and the positive correlation indicated by the trend line. I was further confused by the two different time intervals mentioned in the subtitle. Reading the article, I understood that the graph was trying to show that while there is a direct relationship between Pulitzer Prizes and Circulation, that the relationship has not changed in the past decade, indicating that modern readers are not any more inclined by Pulitzer prizes than are past readers. In the rest of this write up, I re-create the original graph and make adjustments to hopefully to make the original conclusions more clear in the data visualization alone.

Recreating the Visualization

Using the following code, I recreated the original graph in R:

install.packages("ggtext")

## Installing package into '/Accounts/krohaa/R/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(patchwork)
library(readr)
library(ggtext)

pulitzer <- read_csv("pulitzer-circulation-data.csv")

## Rows: 50 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): Newspaper, Change in Daily Circulation, 2004-2013
## dbl (3): Pulitzer Prize Winners and Finalists, 1990-2003, Pulitzer Prize Win...
## num (2): Daily Circulation, 2004, Daily Circulation, 2013
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# change percent column to numeric values
pulitzer$`Change in Daily Circulation, 2004-2013` <- as.numeric(gsub("%", "", pulitzer$`Change in Daily Circulation, 2004-2013`)) / 100

# re-creation
options(repr.plot.width= 7, repr.plot.height= 7)

ggplot(data = pulitzer, aes(x = `Pulitzer Prize Winners and Finalists, 1990-2014`, y = `Change in Daily Circulation, 2004-2013`)) +
  geom_point(color = "white", size = 3) +
  geom_point(color = "red", size = 2) +
  geom_vline(xintercept = 0) +
  geom_hline(yintercept = 0) +
  geom_smooth(method = lm, se = FALSE, color = "darkgrey") +
  scale_x_continuous(
    limits = c(0, 125), 
    breaks = seq(0, 125, by = 25)
    ) +
  scale_y_continuous(
    limits = c(-1, 1),
    breaks = seq(-1, 1, by = 0.25),
    labels = scales::percent
    ) +
  theme_minimal() +
  theme(
    plot.background =   element_rect(fill = "gray94", linewidth = 0),
    panel.grid.minor = element_blank(),
    panel.grid.major = element_line(color = "lightgrey")
  ) +
  labs(
    x = "Pulitzer winners and finalists",
    y = "",
    title = "Pulitzer Prizes Don't Lure Readers",
    subtitle = "Change in print and digital circulation (2004-2013) vs. Pulitzer \nwinners and finalists (1990-2014) among the top 50 newspapers"
  ) +
  annotate("text", 
           x = 99, 
           y = .68, 
           label = "The New York Times", 
           size = 4
           ) +
  annotate("text",
           x = 37,
           y = -0.92,
           label = "Rocky Mountain News", 
           size = 4
           ) +
  annotate("text",
           x = 68,
           y = -0.33,
           label = "Los Angeles Times", 
           size = 4
           ) +
  annotate("text",
           x = 115,
           y = -0.4,
           label = "The Washington \nPost", 
           size = 4
           ) +
  annotate("curve",
           x = 18,
           y = -.84,
           xend = 6, 
           yend = -.94,
           linewidth = 1,
           curvature = .4,
           arrow = arrow(length = unit(0.3, "cm"))
           )

## `geom_smooth()` using formula = 'y ~ x'

Improving the Visualization

The following code uses the same data to re-create the plot and emphasize the conclusions written in the original article:

no_nyt <- pulitzer[-3, ]

cor(no_nyt$`Daily Circulation, 2004`, no_nyt$`Pulitzer Prize Winners and Finalists, 1990-2003`)

## [1] 0.3879928

cor(no_nyt$`Daily Circulation, 2013`, no_nyt$`Pulitzer Prize Winners and Finalists, 2004-2014`)

## [1] 0.3179446

ggplot() +
  geom_jitter(data = no_nyt,
              aes(x = `Pulitzer Prize Winners and Finalists, 1990-2003`,
                 y = `Daily Circulation, 2004`),
              width = 1, height = 0,
              size = 2.5, alpha = 0.5, color = "#22A884FF") +
  geom_jitter(data = no_nyt,
             aes(x = `Pulitzer Prize Winners and Finalists, 2004-2014`,
                 y = `Daily Circulation, 2013`),
             width = 1, height = 0,
             size = 2.5, alpha = 0.5, color = "#414487FF") +
  geom_smooth(data = no_nyt,
              aes(x = `Pulitzer Prize Winners and Finalists, 1990-2003`,
                 y = `Daily Circulation, 2004`), color = "#22A884FF",
              method = "lm", se = FALSE) +
  geom_smooth(data = no_nyt,
              aes(x = `Pulitzer Prize Winners and Finalists, 2004-2014`,
                 y = `Daily Circulation, 2013`), color = "#414487FF",
              method = "lm", se = FALSE) +
  geom_richtext(
    aes(x = 50, y = 2000000, label = 
          "<span style='color:black'>Correlation</span><br>
           <span style='color:#22A884FF'>2004 = 0.39</span><br>
           <span style='color:#414487FF'>2014 = 0.32</span>"),
    size = 4, label.size = 0) +
  theme_classic() +
  labs(
    title = "Modern readers are not any more attracted by Pulitzer Prizes than before",
    subtitle = "In fact, among the top 50 newspapers, not only has the relationship between Pulitzer Prizes\nand circulation not changed in the last decade, but circulations are down regardless of\nPulitzer Prizes",
    x = "Number of Pulitzer winners and finalists in the previous 10 years",
    y = "Daily circulation"
  ) +
  annotate("text",
           x = 54,
           y = 1075000,
           label = "2004", 
           size = 4,
           color = "#22A884FF"
           ) +
   annotate("text",
           x = 50,
           y = 910000,
           label = "2014", 
           size = 4,
           color = "#414487FF"
           )

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

My re-creation makes the following changes to improve clarity:

Removal of the New York Times newspaper, which the article mentions is an outlier that makes the relationship between circulation and Pulitzer prizes appear much stronger.
Use of two sets of points and trend lines to convey the relationship between circulation and Pulitzer Prizes at the ends of two different time periods (1990-2004 and 2004-1014). This showcases what is only mentioned in the text- that while the relationship between prizes and circulation exists, it has not changed between 2004 and 2014.
Addition of correlation coefficients for the relationships in each year on the statistics grammar of graphics. This addition allows the viewer to confirm that the relationship between prizes and circulation has not really changed within the last decade.
The y-axis aesthetic mapping is the daily circulation for each of the two years, rather than the change in the circulation between the two years. This allows for a direct comparison of how, on average, circulation has decreased even for papers with though the relationship between circulation and prizes has not changed.
x-axis adjustment to show only prizes that were accumulated in the time period preceding the circulations for each year. This allows for the data to be comparable within the two time frames as prizes are not cumulative between the two periods.
Jittering, resizing, and changing alpha values of points such that no data is obscured, as outlined by Wilke in Fundamentals of Data Visualization.
Change of title and subtitle to more accurately reflect the argument presented in the article text.
Change of theme to remove non-data ink that does not help to convey the argument, such as the chart background, minor grid lines, and labels for particular newspapers.
Change of color to reduce harshness on eyes. I also use colors that are part of the viridis color palette, which we have learned is colorblind friendly and perceptually uniform.

I hope that these changes make the point of the article more clear without the text associated with the original graph.

project1-writeup

Amberly Kroha Published on RPubs here: https://rpubs.com/krohaa/stat220-portfolio1-krohaa-pulitzer

Improving a Data Visualization

Recreating the Visualization

Improving the Visualization