library(ggplot2)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(base)
Gendergap <- read.csv("~/Desktop/RStudio/STAT_220/assignments/proj1-audreymoyer/data/Gendergap.csv")

Original Graph

The original graph is sufficient, but lacks historical context for the data. The use of direct labeling makes the data easier to understand, but I was slightly confused by the countries that Our World in Data chose to include. The original graph also lacked alt text, and I felt like the graph was not sufficiently described in the title or subtitle. Since the data that was provided was the gender pay gap (in percent) for each country and divided into separate csv files for each year, I had to create my own csv file combining all the data that I wanted to include. Because of this, my dataset has different columns for each country, so the separate sets of points and corresponding lines needed to be in their own lines of code. Because there were some missing datapoints for some of the countries, the geom_line() function broke the connecting line, making it so that a the line for France and Italy are nonexistent, and the line for Australia has a break. I couldn’t figure out how to code direct labeling other than by hand, so there are lots of annotation lines. I felt a scatterplot, colored by country, was the correct way to display this data, as we want to see how the gender wage gap has changed over time for different countries. I also appreciated that, instead of a line of best fit (which is what I am used to seeing in scatterplots), they connected the points directly, as this illustrates the changes better.

gendergap.2016 <- slice(Gendergap, 0:47)

ggplot(data = gendergap.2016) +
  geom_point(aes(x = Year, y = `United.States`), color = "#b1360a", size = 0.5) +
  geom_line(aes(x = Year, y = `United.States`),  color = "#b1360a") +
  geom_point(aes(x = Year, y = `UK`), color = "#4c6a9c", size = 0.5) +
  geom_line(aes(x = Year, y = `UK`), color = "#4c6a9c") +
  geom_point(aes(x = Year, y = `Japan`), color = "#2e8466", size = 0.5) +
  geom_line(aes(x = Year, y = `Japan`), color = "#2e8466") +
  geom_point(aes(x = Year, y = `Australia`), color = "#6e3e91", size = 0.5) +
  geom_line(aes(x = Year, y = `Australia`), color = "#6e3e91") +
  geom_point(aes(x = Year, y = `France`), color = "#01295b", size = 0.5) +
  geom_line(aes(x = Year, y = `France`), color = "#01295b") +
  geom_point(aes(x = Year, y = `Italy`), color = "#8d3a42", size = 0.5) +
  geom_path( aes(x = Year, y = `Italy`), na.rm = TRUE, color = "#8d3a42") +
  geom_point(aes(x = Year, y = `Sweden`), color = "#996d3a", size = 0.5) +
  geom_line(aes(x = Year, y = `Sweden`), color = "#996d3a") +
  theme_minimal() +
    coord_cartesian(clip = "off") +
  theme(
    plot.margin = margin(0.1, 0.9, 0.1, 0.1, "in")
    ) +
  theme(panel.grid.major.x = element_blank()) +
  theme(panel.grid.minor.x = element_blank()) +
  theme(panel.grid.minor.y = element_blank()) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Unadjusted gender gap in median earnings, 1970 to 2016", subtitle = "The gender wage gap is unadjusted and is defined as the difference between median earnings of 
men and women relative to median earnings of men. Estimates refer to full-time employees 
and to self-employed workers.", x = "", y = "") +
  annotate("text", x = 2021, y = .19, 
           label = "United States", 
           color = "#b1360a") + 
  annotate("text", x = 2021.5, y = .165, 
           label = "United Kingdom", 
           color = "#4c6a9c") +
  annotate("text", x = 2018.5, y = .25, 
           label = "Japan", 
           color = "#2e8466") +
  annotate("text", x = 2019.5, y = .14, 
           label = "Australia", 
           color = "#6e3e91") +
  annotate("text", x = 2017, y = .105, 
           label = "France", 
           color = "#01295b") +
  annotate("text", x = 2016, y = .06, 
           label = "Italy", 
           color = "#8d3a42") +
  annotate("text", x = 2019, y = .085, 
           label = "Sweden", 
           color = "#996d3a")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 6 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 43 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 34 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 32 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 25 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 25 rows containing missing values or values outside the scale range
## (`geom_line()`).

Improved Graph

For my improved graph, I included data that existed up to 2023, rather than just up 2016 (as in the original). I also chose different countries that I thought might be more interesting to compare. I chose the U.S. and the U.K., South Korea and Japan, Mexico and Colombia, and Sweden and Norway. I chose groups of two countries that I expected to be at least somewhat similar (or maybe interesting to compare) as they are either in similar parts of the world or have similar economies. I changed some of the colors, specifically I removed the color that France was in the original graph and added pink, orange, and grey, as I thought that these would be easier to differentiate in the final graph. To further help differentiate the countries, I changed the point shape for each country. I did feel like the shape options were somewhat lacking, because the filled in shapes are the easiest to differentiate, but I had to use non-filled in shapes as well. I decided to keep the minor and major axis lines in this graph, because they help to see what year and percentage each point is. I added a description to the y-axis because the original graph didn't have one and I was slightly confused as to what the percentage actually was in relation to the gender wage gap. I included alt text because I felt that the graph needed more description. Finally, I added some historical events that may have had an effect on the countries' percentages (ie. legal decisions and laws) because I thought they would be helpful to know. I only included three, because I didn't want to crowd the graph with extra lines and text, and because these three events seemed like the most relevent to the data.

ggplot(data = Gendergap) +
  geom_point(aes(x = Year, y = `United.States`), color = "#b1360a") +
  geom_line(aes(x = Year, y = `United.States`), color = "#b1360a") +
  geom_point(aes(x = Year, y = `UK`), color = "#4c6a9c", shape = "square") +
  geom_line(aes(x = Year, y = `UK`), color = "#4c6a9c") +
  geom_point(aes(x = Year, y = `Japan`), color = "#2e8466", shape = "triangle") +
  geom_line(aes(x = Year, y = `Japan`), color = "#2e8466") +
  geom_point(aes(x = Year, y = `Korea`), color = "#6e3e91", shape = "diamond") +
  geom_line(aes(x = Year, y = `Korea`), color = "#6e3e91") +
  geom_point(aes(x = Year, y = `Mexico`), color = "deeppink3", shape = 6) +
  geom_line(aes(x = Year, y = `Mexico`), color = "deeppink3") +
  geom_point(aes(x = Year, y = `Colombia`), color = "orange", shape = 13) +
  geom_line(aes(x = Year, y = `Colombia`), color = "orange") +
  geom_point(aes(x = Year, y = `Sweden`), color = "#996d3a", shape = 8) +
  geom_line(aes(x = Year, y = `Sweden`), color = "#996d3a") +
  geom_point(aes(x = Year, y = `Norway`), color = "grey27", shape = "plus") +
  geom_line(aes(x = Year, y = `Norway`), color = "grey27") + 
  geom_vline(xintercept = 2009, color = "#b1360a") +
  geom_vline(xintercept = 2010, color = "#4c6a9c") + 
  geom_vline(xintercept = 1985, color = "#2e8566") +
  theme_minimal() +
  annotate("text", x = 2029, y = .17, 
           label = "United States", 
           color = "#b1360a", fontface = c("bold"), family = c("serif")) + 
  annotate("text", x = 2030, y = .13, 
           label = "United Kingdom", 
           color = "#4c6a9c", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2026.5, y = .22, 
           label = "Japan", 
           color = "#2e8466", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2029, y = .29, 
           label = "South Korea", 
           color = "#6e3e91", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2027, y = .15, 
           label = "Mexico", 
           color = "deeppink3", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2028, y = .02, 
           label = "Colombia", 
           color = "orange", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2027.5, y = .075, 
           label = "Sweden", 
           color = "#996d3a", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2027, y = .045, 
           label = "Norway", 
           color = "grey27", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2002, y = .5,
           label = "Lilly Ledbetter Fair 
  Pay Act of 2009, U. S.", size = 3, color = "#b1360a", fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 2018, y = .50,
           label = "Equality Act of 2010, UK", color = "#4c6a9c", size = 3, fontface = c("bold"), family = c("serif")) +
  annotate("text", x = 1976.6, y = .25,
           label = "Equal Employment 
      Opportunity Act, Japan", color = "#2e8466", size = 3, fontface = c("bold"), family = c("serif")) +
  coord_cartesian(clip = "off") +
  theme(
    plot.margin = margin(0.1, 0.9, 0.1, 0.1, "in")
    ) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1)) +
  labs(title = "Unadjusted gender wage gap in median earnings, 1970 to 2023", x = "", y = "Women's pay as a percentage of men's pay", subtitle = "Scatterplot of the difference between median earnings of men and women relative to median 
earnings of men in South Korea, Japan, the U.S, Mexico, the U.K, Sweden, Norway, and Colombia, 
where each country's gap gradually declines.")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 21 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 15 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 38 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 35 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 37 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 25 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 25 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 27 rows containing missing values or values outside the scale range
## (`geom_point()`).

## Warning: Removed 27 rows containing missing values or values outside the scale range
## (`geom_line()`).

project1-writeup

Original Graph

Improved Graph