Gender Pay Gap in the United States

Introduction

Women make up nearly half of the workforce in the United States. However, they are generally viewed as being paid less well than men for doing the same job. The so-called gender pay gap has been a thorny issue for policy makers and there are ongoing debates about how to close it. The first step to finding solutions is to understand more facts about the gap. The goal of this project is to study the variations of the gender pay gap in the U.S. using historical data. In particular, I plan to examine how the pay gap varies over time and across age groups and industries/occupations.

Data for this project are from the Bureau of Labor Statistics (BLS) and the U.S. Census Bureau. The BLS data include women’s historical earnings and employment status and the Census data provide information on occupations and earnings. I will use exploratory data analysis such as summary statistics and visualizations to identify patterns underlying the data. For example, has the gap narrowed in recent years? Which age groups or industries have the smallest/largest gap? Statistical tests will also be performed to see if those changes/differences are significant.

By documenting the variations in gender pay gap, we can provide policymakers and employers with insights on how to identify and narrow the gap. Our findings can also help female job seekers select an industry/occupation with fairer salary and compensations.

Packages Required

In this project, I use the following R packages (the list will change as the project goes):

# Load R packages
library(tidyverse)  # Transform data to tidy format, manipulate and visualize data
library(readr)      # Read csv files into R
library(DT)         # Display tables on HTML pages
library(knitr)      # Create HTML table
library(psych)      # Produce summary stats in easy to read data.frame

Data Preparation

In this section, data sources and cleaning steps are described.

Data Sources

The datasets for this project are from two sources:

Bureau of Labor Statistics (BLS): The BLS is a governmental statistical agency that collects data in labor economics. For this project, I use two historical datasets, women’s earnings and employment status. Neither dataset contains missing values.
- Earnings from 1979 to 2011 (get data here): This dataset provides the women’s-to-men’s earnings ratio, using median weekly earnings of full-time workers. The original dataset includes three variables: year, age group, and women’s wage as percentage of men’s wage.
- Employment status from 1968 to 2016 (get data here): This dataset provides the percentage of employed people, by full- and part-time status and gender. The original data includes the following variables: year, percentage of employed people working full-time or part-time, percentage of male working full-time or part-time, and percentage of female working full-time or part-time.
Census Bureau (get data here): This dataset provides the information on specific occupations and earnings from 2013-2016, and includes the following variables: year, occupation, major job category, minor job category, the estimated number of total, male and female full-time workers, the estimated median earnings for total, male and female full-time workers, and the female wages as percentage of male wages. In the original census data, the median earnings are estimated based on samples in the Census study. If the sample size is less than 100, the female wages as percentage of male wages are not calculated and coded as missing.

Data Cleaning

Three datasets are imported into R, and then cleaned using the same procedure. First, we examine the structure of the dataset and check if there are any missing values. Second, we check the summary statistics for each variable and look for any outliers.

The first dataset (earnings_female) is clean and complete, and there are no outliers. I changed the variable name “Year” to lower-cases so that it is consistent with the rest of variables and can be used as a key variable for merging. Also changed the variable name “group” to “age_group”, and “percent” to “wage_percent_of_male” to make it more informative.

# Import data and update varible names for the "earning_female" dataset
earnings_female <- read_csv("earnings_female.csv")
names(earnings_female) <- c("year","age_group","wage_percent_of_male")

The second dataset (employed_gender) is also clean and complete, and no outliers are identified. Similarly, I changed the variable names for “total_full_time” and “total_part_time” to “full_time_total” and “part_time_total” respectively, to be consistent with other columns.

# Import data and update varible names for the "employed_gender" dataset
employed_gender <- read_csv("employed_gender.csv")
names(employed_gender) <- c("year", "full_time_total","part_time_total","full_time_female","part_time_female","full_time_male", "part_time_male")

The last dataset (jobs_gender) contains some missing values for occupations with a sample size fewer than 100. As small sample can lead to unreliable estimates of earnings, observations for those occupations are coded as missing (‘NA’). I deleted them from the dataset for the analysis.

# Import the"jobs_gender" dataset and delete missing
jobs_gender <- read_csv("jobs_gender.csv") %>% 
  filter(!is.na(wage_percent_of_male))

Data Preview

Here is a preview of the three clean datasets.

head(earnings_female,100) %>%
  datatable(caption = "Table 1 Women's Earnings, 1979-2011")

head(employed_gender,100) %>%
  datatable(caption = "Table 2 Perentage of Employed People Working Full-time and Part-time, 1968-2016")

head(jobs_gender,100) %>%
  datatable(caption = "Table 3 Full-time Workers' Earnings by Oppucations")

Data Summary

The following table shows the detailed information for variables in three datasets, including variable name, type, description, and summary statistics (for numeric variables only).

# Add variable description
var_des1 <- c("Year", "Age group", "Female wage as percent of male wage")
var_des2 <- c("Year", "Percent of total employed people working full time", "Percent of total employed people working part time", "Percent of female working full time","Percent of female working part time", "Percent of male working full time", "Percent of male working part time")
var_des3 <- c("Year", "Job name", "Major job category", "Minor job cateory", "Estimated total number of full-time workers", "Estimated number of male full-time workers", "Estimated number of female full-time workers", "Percent of females for specific job", "Total estimated median earnings for full-time workers", "Estimated median earnings for male full-time workers", "Estimated median earnings for female full-time workers", "Female wage as percent of male wage")

options(scipen = 999)

# Create summary stats in table format
# Dataset 1: earnings_female
var_sum1 <- earnings_female %>%
    select(year, wage_percent_of_male) %>%
    psych::describe() %>%
    as_tibble(rownames = "rowname")  %>%
    select(variable = rowname, min, median, max, mean, sd)

table_sum1 <- data.frame(dataset = "Earnings_female",
                       variable = names(earnings_female),
                       class = sapply(earnings_female, typeof),
                       description = var_des1,
                       row.names = NULL) %>% 
  left_join(var_sum1, by = "variable") 

# Dataset 2: employed_gender
var_sum2 <- employed_gender %>%
    psych::describe() %>%
    as_tibble(rownames = "rowname")  %>%
    select(variable = rowname, min, median, max, mean, sd)

table_sum2 <- data.frame(dataset = "Employed_gender",
                        variable = names(employed_gender),
                        class = sapply(employed_gender, typeof),
                        description = var_des2,
                        row.names = NULL) %>% 
   left_join(var_sum2, by = "variable") 

# Dataset 3: jobs_gender
var_sum3 <- jobs_gender %>%
    select(-(occupation:minor_category)) %>% 
    psych::describe() %>%
    as_tibble(rownames = "rowname")  %>%
    select(variable = rowname, min, median, max, mean, sd)

table_sum3 <- data.frame(dataset = "Jobs_gender",
                         variable = names(jobs_gender),
                         class = sapply(jobs_gender, typeof),
                         description = var_des3,
                         row.names = NULL) %>% 
   left_join(var_sum3, by = "variable") 

# Merge three tables  
rbind(table_sum1, table_sum2, table_sum3) %>% 
  mutate_if(is.numeric, round, digits = 1) %>% 
  kable()

dataset	variable	class	description	min	median	max	mean	sd
Earnings_female	year	double	Year	1979.0	1995.0	2011.0	1995.0	9.5
Earnings_female	age_group	character	Age group	NA	NA	NA	NA	NA
Earnings_female	wage_percent_of_male	double	Female wage as percent of male wage	56.8	75.5	95.4	76.9	10.4
Employed_gender	year	double	Year	1968.0	1992.0	2016.0	1992.0	14.3
Employed_gender	full_time_total	double	Percent of total employed people working full time	80.3	82.6	86.0	82.6	1.2
Employed_gender	part_time_total	double	Percent of total employed people working part time	14.0	17.4	19.7	17.4	1.2
Employed_gender	full_time_female	double	Percent of female working full time	71.9	73.9	75.4	73.9	1.0
Employed_gender	part_time_female	double	Percent of female working part time	24.6	26.1	28.1	26.1	1.0
Employed_gender	full_time_male	double	Percent of male working full time	86.6	89.5	92.2	89.5	1.4
Employed_gender	part_time_male	double	Percent of male working part time	7.8	10.5	13.4	10.5	1.4
Jobs_gender	year	double	Year	2013.0	2014.5	2016.0	2014.5	1.1
Jobs_gender	occupation	character	Job name	NA	NA	NA	NA	NA
Jobs_gender	major_category	character	Major job category	NA	NA	NA	NA	NA
Jobs_gender	minor_category	character	Minor job cateory	NA	NA	NA	NA	NA
Jobs_gender	total_workers	double	Estimated total number of full-time workers	11383.0	131104.0	3758629.0	309739.0	451172.9
Jobs_gender	workers_male	double	Estimated number of male full-time workers	5360.0	63437.5	2570385.0	170211.2	285446.1
Jobs_gender	workers_female	double	Estimated number of female full-time workers	1333.0	49108.5	2290818.0	139527.8	261826.0
Jobs_gender	percent_female	double	Percent of females for specific job	1.2	46.9	98.0	45.8	24.6
Jobs_gender	total_earnings	double	Total estimated median earnings for full-time workers	17266.0	46459.5	201542.0	50968.2	24567.6
Jobs_gender	total_earnings_male	double	Estimated median earnings for male full-time workers	17302.0	50250.5	231420.0	55456.6	26728.1
Jobs_gender	total_earnings_female	double	Estimated median earnings for female full-time workers	16771.0	41753.0	166388.0	46102.7	21620.1
Jobs_gender	wage_percent_of_male	double	Female wage as percent of male wage	50.9	85.2	117.4	84.0	9.4

Exploratory Data Analysis

In this project, we study the variations in the gender pay gap. I will examine how the gap varies over time and across age groups and occupations. Specifically, the following research questions are investigated.

Has the gender pay gap narrowed in recent years?
Does the gender pay gap vary by age groups?
Does the employment status have an effect on the gender pay gap?
How does the gender pay gap vary by industries/occupations?

In the following section, I will describe the proposed analysis for each question.

1. Has the gender pay gap narrowed in recent years?

Using the first dataset (earnings_female), select the “total, 16 years and older” group, and create a time-series line plot (“year” as x-axis and “wage_percent_of_male” as y-axis) to see if there is any time trend in the women’s-to-men’s earnings ratio. If the gender pay gap is narrowing, the ratio should increase over time. We can also formally estimate and test the significance of the time trend effect.

2. Does the gender pay gap vary by age groups?

Using the first dataset (earning_female), select all the age groups except for the “total, 16 years and older” group. Calculate and compare the descriptive statistics by age group. Use line plots (similarly, “year” as x-axis and “wage_percent_of_male” as y-axis), broken down by age groups, to show the patterns for different groups. We can check which age group has the smallest/largest gap on average. Also, analysis of variance (ANOVA) will be conducted to test if the differences are significant across age groups.

3. Does the employment status have an effect on the gender pay gap?

First, I will use line plot (year as x-axis, percentage of male or female working full time as y-axis) to see if there is any difference in percentage of full-time workers between male and female.

Second, I will continue to study if the employment status has an effect on gender pay gap. The scatter plot of difference in employment status and female wage as percent of male wage will be used to answer this question. For this analysis, the “employed_gender” dataset needs to be merged with the first dataset “earnings_female” by year.

4. How does the gender gap vary by industries/occupations?

I will use the last dataset (jobs_gender) for this analysis. First, since this dataset covers multiple years (2013-2016) but we focus on industries/occupations for this question, I will aggregate the numbers across years before I calculate the summary statistics by job categories. Boxplot is also used to demonstrate the distribution of the pay gap (i.e., female wage as percent of male wage) by major and minor job categories.

Second, I want to see if the pay gap tends to be larger for those men-dominated occupations but smaller for women-dominated occupations? We can use the percent of female workers among all workers as a measure to determine which occupation is dominated by women or men. Summary statistics and boxplots will be used to facilitate the discussions.

I will use ggplots to create visualizations, and hope to learn how to create interactive plots for the final report. For example, for the last research question, I plan to create interactive boxplots by job category, so that readers can select major job category, and a corresponding boxplot by the minor job category within the selected major category will be shown.

Given the limited number of variables in the dataset, I will not use machine learning techniques for this project.