Women make up nearly half of the workforce in the United States. However, they are generally viewed as being paid less well than men for doing the same job. The so-called gender pay gap has been a thorny issue for policy makers and there are ongoing debates about how to close it. The first step to finding solutions is to understand more facts about the gap. The goal of this project is to study the variations of the gender pay gap in the U.S. using historical data. In particular, I plan to examine how the pay gap varies over time and across age groups and industries/occupations.
Data for this project are from the Bureau of Labor Statistics (BLS) and the U.S. Census Bureau. The BLS data include women’s historical earnings and employment status and the Census data provide information on occupations and earnings. I will use exploratory data analysis such as summary statistics and visualizations to identify patterns underlying the data. For example, has the gap narrowed in recent years? Which age groups or industries have the smallest/largest gap? Statistical tests will also be performed to see if those changes/differences are significant.
By documenting the variations in gender pay gap, we can provide policymakers and employers with insights on how to identify and narrow the gap. Our findings can also help female job seekers select an industry/occupation with fairer salary and compensations.
In this project, I use the following R packages (the list will change as the project goes):
# Load R packages
library(tidyverse) # Transform data to tidy format, manipulate and visualize data
library(readr) # Read csv files into R
library(DT) # Display tables on HTML pages
library(knitr) # Create HTML table
library(psych) # Produce summary stats in easy to read data.frame
In this section, data sources and cleaning steps are described.
The datasets for this project are from two sources:
Earnings from 1979 to 2011 (get data here): This dataset provides the women’s-to-men’s earnings ratio, using median weekly earnings of full-time workers. The original dataset includes three variables: year, age group, and women’s wage as percentage of men’s wage.
Employment status from 1968 to 2016 (get data here): This dataset provides the percentage of employed people, by full- and part-time status and gender. The original data includes the following variables: year, percentage of employed people working full-time or part-time, percentage of male working full-time or part-time, and percentage of female working full-time or part-time.
Three datasets are imported into R, and then cleaned using the same procedure. First, we examine the structure of the dataset and check if there are any missing values. Second, we check the summary statistics for each variable and look for any outliers.
The first dataset (earnings_female) is clean and complete, and there are no outliers. I changed the variable name “Year” to lower-cases so that it is consistent with the rest of variables and can be used as a key variable for merging. Also changed the variable name “group” to “age_group”, and “percent” to “wage_percent_of_male” to make it more informative.
# Import data and update varible names for the "earning_female" dataset
earnings_female <- read_csv("earnings_female.csv")
names(earnings_female) <- c("year","age_group","wage_percent_of_male")
The second dataset (employed_gender) is also clean and complete, and no outliers are identified. Similarly, I changed the variable names for “total_full_time” and “total_part_time” to “full_time_total” and “part_time_total” respectively, to be consistent with other columns.
# Import data and update varible names for the "employed_gender" dataset
employed_gender <- read_csv("employed_gender.csv")
names(employed_gender) <- c("year", "full_time_total","part_time_total","full_time_female","part_time_female","full_time_male", "part_time_male")
The last dataset (jobs_gender) contains some missing values for occupations with a sample size fewer than 100. As small sample can lead to unreliable estimates of earnings, observations for those occupations are coded as missing (‘NA’). I deleted them from the dataset for the analysis.
# Import the"jobs_gender" dataset and delete missing
jobs_gender <- read_csv("jobs_gender.csv") %>%
filter(!is.na(wage_percent_of_male))
Here is a preview of the three clean datasets.
head(earnings_female,100) %>%
datatable(caption = "Table 1 Women's Earnings, 1979-2011")
head(employed_gender,100) %>%
datatable(caption = "Table 2 Perentage of Employed People Working Full-time and Part-time, 1968-2016")
head(jobs_gender,100) %>%
datatable(caption = "Table 3 Full-time Workers' Earnings by Oppucations")
The following table shows the detailed information for variables in three datasets, including variable name, type, description, and summary statistics (for numeric variables only).
# Add variable description
var_des1 <- c("Year", "Age group", "Female wage as percent of male wage")
var_des2 <- c("Year", "Percent of total employed people working full time", "Percent of total employed people working part time", "Percent of female working full time","Percent of female working part time", "Percent of male working full time", "Percent of male working part time")
var_des3 <- c("Year", "Job name", "Major job category", "Minor job cateory", "Estimated total number of full-time workers", "Estimated number of male full-time workers", "Estimated number of female full-time workers", "Percent of females for specific job", "Total estimated median earnings for full-time workers", "Estimated median earnings for male full-time workers", "Estimated median earnings for female full-time workers", "Female wage as percent of male wage")
options(scipen = 999)
# Create summary stats in table format
# Dataset 1: earnings_female
var_sum1 <- earnings_female %>%
select(year, wage_percent_of_male) %>%
psych::describe() %>%
as_tibble(rownames = "rowname") %>%
select(variable = rowname, min, median, max, mean, sd)
table_sum1 <- data.frame(dataset = "Earnings_female",
variable = names(earnings_female),
class = sapply(earnings_female, typeof),
description = var_des1,
row.names = NULL) %>%
left_join(var_sum1, by = "variable")
# Dataset 2: employed_gender
var_sum2 <- employed_gender %>%
psych::describe() %>%
as_tibble(rownames = "rowname") %>%
select(variable = rowname, min, median, max, mean, sd)
table_sum2 <- data.frame(dataset = "Employed_gender",
variable = names(employed_gender),
class = sapply(employed_gender, typeof),
description = var_des2,
row.names = NULL) %>%
left_join(var_sum2, by = "variable")
# Dataset 3: jobs_gender
var_sum3 <- jobs_gender %>%
select(-(occupation:minor_category)) %>%
psych::describe() %>%
as_tibble(rownames = "rowname") %>%
select(variable = rowname, min, median, max, mean, sd)
table_sum3 <- data.frame(dataset = "Jobs_gender",
variable = names(jobs_gender),
class = sapply(jobs_gender, typeof),
description = var_des3,
row.names = NULL) %>%
left_join(var_sum3, by = "variable")
# Merge three tables
rbind(table_sum1, table_sum2, table_sum3) %>%
mutate_if(is.numeric, round, digits = 1) %>%
kable()
| dataset | variable | class | description | min | median | max | mean | sd |
|---|---|---|---|---|---|---|---|---|
| Earnings_female | year | double | Year | 1979.0 | 1995.0 | 2011.0 | 1995.0 | 9.5 |
| Earnings_female | age_group | character | Age group | NA | NA | NA | NA | NA |
| Earnings_female | wage_percent_of_male | double | Female wage as percent of male wage | 56.8 | 75.5 | 95.4 | 76.9 | 10.4 |
| Employed_gender | year | double | Year | 1968.0 | 1992.0 | 2016.0 | 1992.0 | 14.3 |
| Employed_gender | full_time_total | double | Percent of total employed people working full time | 80.3 | 82.6 | 86.0 | 82.6 | 1.2 |
| Employed_gender | part_time_total | double | Percent of total employed people working part time | 14.0 | 17.4 | 19.7 | 17.4 | 1.2 |
| Employed_gender | full_time_female | double | Percent of female working full time | 71.9 | 73.9 | 75.4 | 73.9 | 1.0 |
| Employed_gender | part_time_female | double | Percent of female working part time | 24.6 | 26.1 | 28.1 | 26.1 | 1.0 |
| Employed_gender | full_time_male | double | Percent of male working full time | 86.6 | 89.5 | 92.2 | 89.5 | 1.4 |
| Employed_gender | part_time_male | double | Percent of male working part time | 7.8 | 10.5 | 13.4 | 10.5 | 1.4 |
| Jobs_gender | year | double | Year | 2013.0 | 2014.5 | 2016.0 | 2014.5 | 1.1 |
| Jobs_gender | occupation | character | Job name | NA | NA | NA | NA | NA |
| Jobs_gender | major_category | character | Major job category | NA | NA | NA | NA | NA |
| Jobs_gender | minor_category | character | Minor job cateory | NA | NA | NA | NA | NA |
| Jobs_gender | total_workers | double | Estimated total number of full-time workers | 11383.0 | 131104.0 | 3758629.0 | 309739.0 | 451172.9 |
| Jobs_gender | workers_male | double | Estimated number of male full-time workers | 5360.0 | 63437.5 | 2570385.0 | 170211.2 | 285446.1 |
| Jobs_gender | workers_female | double | Estimated number of female full-time workers | 1333.0 | 49108.5 | 2290818.0 | 139527.8 | 261826.0 |
| Jobs_gender | percent_female | double | Percent of females for specific job | 1.2 | 46.9 | 98.0 | 45.8 | 24.6 |
| Jobs_gender | total_earnings | double | Total estimated median earnings for full-time workers | 17266.0 | 46459.5 | 201542.0 | 50968.2 | 24567.6 |
| Jobs_gender | total_earnings_male | double | Estimated median earnings for male full-time workers | 17302.0 | 50250.5 | 231420.0 | 55456.6 | 26728.1 |
| Jobs_gender | total_earnings_female | double | Estimated median earnings for female full-time workers | 16771.0 | 41753.0 | 166388.0 | 46102.7 | 21620.1 |
| Jobs_gender | wage_percent_of_male | double | Female wage as percent of male wage | 50.9 | 85.2 | 117.4 | 84.0 | 9.4 |
In this project, we study the variations in the gender pay gap. I will examine how the gap varies over time and across age groups and occupations. Specifically, the following research questions are investigated.
In the following section, I will describe the proposed analysis for each question.
1. Has the gender pay gap narrowed in recent years?
Using the first dataset (earnings_female), select the “total, 16 years and older” group, and create a time-series line plot (“year” as x-axis and “wage_percent_of_male” as y-axis) to see if there is any time trend in the women’s-to-men’s earnings ratio. If the gender pay gap is narrowing, the ratio should increase over time. We can also formally estimate and test the significance of the time trend effect.
2. Does the gender pay gap vary by age groups?
Using the first dataset (earning_female), select all the age groups except for the “total, 16 years and older” group. Calculate and compare the descriptive statistics by age group. Use line plots (similarly, “year” as x-axis and “wage_percent_of_male” as y-axis), broken down by age groups, to show the patterns for different groups. We can check which age group has the smallest/largest gap on average. Also, analysis of variance (ANOVA) will be conducted to test if the differences are significant across age groups.
3. Does the employment status have an effect on the gender pay gap?
First, I will use line plot (year as x-axis, percentage of male or female working full time as y-axis) to see if there is any difference in percentage of full-time workers between male and female.
Second, I will continue to study if the employment status has an effect on gender pay gap. The scatter plot of difference in employment status and female wage as percent of male wage will be used to answer this question. For this analysis, the “employed_gender” dataset needs to be merged with the first dataset “earnings_female” by year.
4. How does the gender gap vary by industries/occupations?
I will use the last dataset (jobs_gender) for this analysis. First, since this dataset covers multiple years (2013-2016) but we focus on industries/occupations for this question, I will aggregate the numbers across years before I calculate the summary statistics by job categories. Boxplot is also used to demonstrate the distribution of the pay gap (i.e., female wage as percent of male wage) by major and minor job categories.
Second, I want to see if the pay gap tends to be larger for those men-dominated occupations but smaller for women-dominated occupations? We can use the percent of female workers among all workers as a measure to determine which occupation is dominated by women or men. Summary statistics and boxplots will be used to facilitate the discussions.
I will use ggplots to create visualizations, and hope to learn how to create interactive plots for the final report. For example, for the last research question, I plan to create interactive boxplots by job category, so that readers can select major job category, and a corresponding boxplot by the minor job category within the selected major category will be shown.
Given the limited number of variables in the dataset, I will not use machine learning techniques for this project.