Assignment 2

Click the Original, Code and Reconstruction tabs to read about the issues and how they were fixed.

Original

Source: Visualizing the massive gender pay gap across U.S. industries (Irena, 2021).

Objective

The objective of the data visualisation is to highlight the discrepancy in salary between men and women, emphasing the trend of female’s lower median salary to male’s median salary. Furthermore, the visual aims to show the amount of discrepancy in salary between men and women across the industries, providing clarity on which industries pay women less than men.

The visualisation was found on the website Howmuch.net, and their mission is to provide more information to the public on subjects regarding to United State’s economic and financial situation (Howmuch.net, 2023). The visual is for those seeking information on the US economy for employment purposes, or potentially individuals requiring information on a new career in a different industry. Therefore, the visual is aimed towards the age group 20-60, with majority of the audience being female due to the information provided by the visualisation.

The visualization chosen had the following three main issues:

Visual Bombardment.
Ignoring Conventions.
Conflicting colour schemes.

The first issue is visual bombardment as it is utilising different techniques such as the use of symbols in addition to text labels, an unconventional plot and a continuous colour scale in combination to a qualitative colour scale, as well as providing two legends that detail how the median salary is measured. With these techniques combined, the visualisation overwhelms the viewer with information, thus requiring the viewer to take longer to understand the graph’s objective.

For the second issue that can be noted is that the graph is ignoring the conventions in data visulisations that make a graph easy to understand. The original data visualisation is a circular bar chart which has no x-axis or y-axis, but uses position to show the value for a qualitative variable, this design choice reduces the visual comparison accuracy. This makes it hard to make comparisons per Gender and Industries, thus, making it difficult to determine the trends within the data and answer questions such as which industry has the largest difference in salary between male and females. This fails the trifecta check-up framework’s question of “What does the chart say?” (Fung, 2014).

Lastly, the original visualisation uses a continuous colour scale to denote the value of the median salary. However, it is a bar chart which uses length to measure the median salary value for men and women per industry. In addition, the colours are used for the qualitative variable Gender to determine which bar are for men or women, while using it as continuous colour scale which can cause confusion. Therefore, it is using a continuous colour scheme for redundant purposes, and confusing the reader on what the colour represents in the visual.

Reference

Irena (2021). Visualizing the massive gender pay gap across U.S. industries. HowMuch. Available at: https://howmuch.net/articles/men-vs-women-comparing-income-by-industry [Accessed 26 Mar. 2023].

HowMuch. (n.d.). Howmuch. [online] Available at: https://howmuch.net/about-us.html.

Fung, K.F. (2014). Junk Charts Trifecta Checkup: The Definitive Guide. Junk Charts: Recycling chart junk as junk art. Available at: https://junkcharts.typepad.com/junk_charts/junk-charts-trifecta-checkup-the-definitive-guide.html [Accessed 27 Mar. 2023].

Code

The following code was used to fix the issues identified in the original.

library(ggplot2)
library(here)
library(dplyr)
library(lubridate)
library(tidyverse)
library(zoo)
library(readxl)
library(stringr)
library(scales)

#######################################################
################### Data Manipulation #################
#######################################################

#Read the csv file
MedianEarnings_df <- read.csv(here('MedianEarningsByIndustryAndSex.csv'))

MedianEarnings_df <- MedianEarnings_df %>%
  filter(!grepl(':', Label..Grouping.))

MedianEarnings_df <- MedianEarnings_df %>%
  filter(
    !grepl(
      'Civilian employed population 16 years and over with earnings',
      Label..Grouping.
    )
  )

MedianEarnings_df <- MedianEarnings_df %>%
  mutate_if(is.character, str_trim)


# Rename multiple columns
MedianEarnings_df  <- MedianEarnings_df  %>%
  rename(
    "Industry" = "Label..Grouping." ,
    "Median_Earnings" = "United.States..Median.earnings..dollars...Estimate",
    "Male" = "United.States..Median.earnings..dollars..for.male..Estimate",
    "Female" = "United.States..Median.earnings..dollars..for.female..Estimate",
    "Womens_earnings_percentage_of_mens_earning" = "United.States..Women.s.earnings.as.a.percentage.of.men.s.earning..Estimate"
  )

# Remove comma characters from value columns
MedianEarnings_df$Median_Earnings <-
  gsub(',', '', MedianEarnings_df$Median_Earnings)
MedianEarnings_df$Male <- gsub(',', '', MedianEarnings_df$Male)
MedianEarnings_df$Female <- gsub(',', '', MedianEarnings_df$Female)
MedianEarnings_df$Womens_earnings_percentage_of_mens_earning <-
  gsub('%',
       '',
       MedianEarnings_df$Womens_earnings_percentage_of_mens_earning)


# Remove 'and' characters from value columns
MedianEarnings_df$Industry <-
  gsub('Administrative and support and waste management services', 'Administrative, support and waste management', MedianEarnings_df$Industry)

#Select the needed columns for bar chart
MedianEarnings_df <- MedianEarnings_df %>%
  select(Industry, Male, Female)

# Convert numeric columns to double
MedianEarnings_df$Male <- as.double(MedianEarnings_df$Male)
MedianEarnings_df$Female <- as.double(MedianEarnings_df$Female)

# Unpivot to long format for the bar chart
MedianEarnings_df <- MedianEarnings_df %>% pivot_longer(cols=c('Male', 'Female'),
                    names_to='Gender',
                    values_to='Median_Earnings')

p <-
  ggplot(data = MedianEarnings_df, aes(
    x = reorder(Industry, Median_Earnings),
    y = Median_Earnings,
    fill = Gender
  )) +
  geom_bar(
    width = 0.9,
    stat = "identity",
    position = position_dodge(width = 0.9),
    color = 'white'
  )   +
  scale_fill_manual(values = c("#ca0020", "#0571b0")) +
  coord_flip() +
  geom_text(
    aes(
      x = Industry,
      y = 0.0,
      label = dollar(Median_Earnings)
    ),
    hjust = -0.2,
    size = 3.1,
    colour = "white",
    fontface = "bold",
    position = position_dodge(.9)
  ) +
  labs(title = "Men vs. Women: Comparing Median Salary by Industry", x =
         "Industry", y = "Median Salary (Dollars)", ) +
  theme(axis.title = element_text(face="bold"))

Data Reference

U.S. Census Bureau (n.d.). INDUSTRY BY SEX AND MEDIAN EARNINGS IN THE PAST 12 MONTHS (IN 2019 INFLATION-ADJUSTED DOLLARS) FOR THE CIVILIAN EMPLOYED POPULATION 16 YEARS AND OVER. [data set] data.census.gov. Available at: https://data.census.gov/table?q=S2413&tid=ACSST1Y2019.S2413&hidePreview=true [Accessed 26 Mar. 2023].

Reconstruction

To amend the issue of the visual bombardment, The new visual uses only one legend that defines which bar is male and female. In addition, the symbols are excluded, and instead the Y-axis labels are defined using the industry variable from the dataset. Furthermore, the circular bar chart was replaced using the standard bar chart which contains a Y-axis and X-axis, also amending the issue of ignoring conventions. Finally, to amend the issue of conflicting colour schemes, two specific colours were chosen to represent male and female, and the continuous colour scheme is removed as the Y-axis is used to scale the values.

The following plot fixes the main issues in the original.

Assignment 2

Deconstruct, Reconstruct Web Report

Cody Lockery s3493469

Original

Code

Reconstruction