Problem Statement
Gender bias in earnings is an ongoing issue. In 2019, the American Association of University Women (AAUW) stated that women who work full-time earn about 80 percent of what their male counterparts make. This report identified the following problems which we attempt to investigate in our study:
Data and Methodology
For this study, we would use the Women in Workforce data which is a historical data about womens’ earnings and employment status, specific occupation and earnings from 2013-2016, compiled from the Bureau of Labor Statistics and the Census Bureau. We intend to analyze the data using the following methodologies: Trend analysis, descriptive analysis, data visualization, and applicable statistical tests.
Proposed Analysis
We would investigate the gender pay gap over the years and the changes in the total female salary as a percentage of male salary by age group. We would conduct an exploratory analysis across the different fields/occupation groups to identify the male-dominated vs female-dominated occupation types. Some statistical tests would be employed to determine the differences in pay gap across fields and by age group. Finally, we hope to identify some factors contributing to the differences in wage gap using regression analysis.
Our Contribution
Our results would provide more insights on the gender pay gap across fields, occupations, and age groups. Our results would show the trend in gender pay gap over the years and some of the factors contributing to gender pay gap. We hope that our study would contribute to the discussion on gender pay gap.
We used the following packages:
library(readxl) #used to import Excel files into R
library(tidyverse) #used for data manipulation
library(dplyr) # used for data manipulation
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(knitr) #used to display an aligned table on the screen
library(kableExtra)#used to build with straightforward formatting options
For this study, we use the Women in Workforce data which is a historical data about womens’ earnings and employment status, specific occupation and earnings from 2013-2016, compiled from the Bureau of Labor Statistics and the Census Bureau. The data was provided in March 2019 as part of the #TidyTuesday project to celebrate the Women’s History month.
The entire data is spread into 3 files: jobs_gender.csv, earnings_female.csv, employed_gender.csv and are described in the next tab.
The three datasets are first imported from csv files into dataframes named jobs_gender, earnings_female,and employed_gender
This dataset contains information on the total number of male and female workers and the total estimated median earnings for all employees at various occupation levels, from 2013-2016. The dataset has 12 variables, with a total of 2,088 data points, across 8 major job categories, 23 minor job categories and 522 occupation types. There are some missing values recorded as “NA” and are taken care of during the data cleaning process.
More information on this dataset can be found here
#Import Dataset 1
jobs_gender <- read.csv("~/BANA/Coursework/BANA 7025 Data wrangling/Mid term assignment/women_in_the_workplace_data/jobs_gender.csv", header = TRUE)
#Check the names of the data columns
colnames(jobs_gender)
## [1] "year" "occupation"
## [3] "major_category" "minor_category"
## [5] "total_workers" "workers_male"
## [7] "workers_female" "percent_female"
## [9] "total_earnings" "total_earnings_male"
## [11] "total_earnings_female" "wage_percent_of_male"
#Check the dimension of the dataset
dim(jobs_gender)
## [1] 2088 12
#Count the number of distinct values/observations
library(dplyr)
jobs_gender %>%
summarise_each(n_distinct, "occupation","minor_category", "major_category")
## occupation minor_category major_category
## 1 522 23 8
#Check the number of missing values per column
colSums(is.na(jobs_gender))
## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 4 65 846
This dataset contains the historic information of female salary as a percent of male salary for various age groups, from year 1979 to 2011. The dataset has 3 variables with 264 observations. This dataset has no missing values.
The dataset can be found here
#Imports Dataset 2
earnings_female <- read.csv("~/BANA/Coursework/BANA 7025 Data wrangling/Mid term assignment/women_in_the_workplace_data/earnings_female.csv", header = TRUE)
#Check the names of the data columns
colnames(earnings_female)
## [1] "Year" "group" "percent"
#Check the dimension of the dataset
dim(earnings_female)
## [1] 264 3
#Count the number of missing values per column
colSums(is.na(earnings_female))
## Year group percent
## 0 0 0
This dataset shows the percentage of part-time and full-time employees for each year at the gender level. The dataset has 7 variables with 49 observations each, from year 1968 to 2016. This dataset has no missing values and can be accessed here
#Import Dataset 3
employed_gender <- read.csv("~/BANA/Coursework/BANA 7025 Data wrangling/Mid term assignment/women_in_the_workplace_data/employed_gender.csv", header = TRUE)
#Check the names of the data columns
colnames(employed_gender)
## [1] "year" "total_full_time" "total_part_time"
## [4] "full_time_female" "part_time_female" "full_time_male"
## [7] "part_time_male"
#Check the dimension of the dataset
dim(employed_gender)
## [1] 49 7
#Count the number of missing values per column
colSums(is.na(employed_gender))
## year total_full_time total_part_time full_time_female
## 0 0 0 0
## part_time_female full_time_male part_time_male
## 0 0 0
We investigate the variable types in our dataset to see if they are accurate or need to be changed.The codes and outputs below indicate that all the variable types in all three datasets are correct.
Dataset 1 - jobs_gender
glimpse(jobs_gender)
## Observations: 2,088
## Variables: 12
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ occupation <fct> "Chief executives", "General and operati...
## $ major_category <fct> "Management, Business, and Financial", "...
## $ minor_category <fct> Management, Management, Management, Mana...
## $ total_workers <int> 1024259, 977284, 14815, 43015, 754514, 4...
## $ workers_male <int> 782400, 681627, 8375, 17775, 440078, 161...
## $ workers_female <int> 241859, 295657, 6440, 25240, 314436, 280...
## $ percent_female <dbl> 23.6, 30.3, 43.5, 58.7, 41.7, 63.5, 33.6...
## $ total_earnings <int> 120254, 73557, 67155, 61371, 78455, 7411...
## $ total_earnings_male <int> 126142, 81041, 71530, 75190, 91998, 9007...
## $ total_earnings_female <int> 95921, 60759, 65325, 55860, 65040, 66052...
## $ wage_percent_of_male <dbl> 76.04208, 74.97316, 91.32532, 74.29179, ...
Dataset 2 - earnings_female
glimpse(earnings_female)
## Observations: 264
## Variables: 3
## $ Year <int> 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, ...
## $ group <fct> "Total, 16 years and older", "Total, 16 years and olde...
## $ percent <dbl> 62.3, 64.2, 64.4, 65.7, 66.5, 67.6, 68.1, 69.5, 69.8, ...
Dataset 3- employed_gender
glimpse(employed_gender)
## Observations: 49
## Variables: 7
## $ year <int> 1968, 1969, 1970, 1971, 1972, 1973, 1974, 197...
## $ total_full_time <dbl> 86.0, 85.5, 84.8, 84.4, 84.3, 84.4, 84.2, 83....
## $ total_part_time <dbl> 14.0, 14.5, 15.2, 15.6, 15.7, 15.6, 15.8, 16....
## $ full_time_female <dbl> 75.1, 74.9, 73.9, 73.2, 73.1, 73.2, 73.2, 72....
## $ part_time_female <dbl> 24.9, 25.1, 26.1, 26.8, 26.9, 26.8, 26.8, 27....
## $ full_time_male <dbl> 92.2, 91.8, 91.5, 91.2, 91.1, 91.4, 91.2, 90....
## $ part_time_male <dbl> 7.8, 8.2, 8.5, 8.8, 8.9, 8.6, 8.8, 9.4, 9.4, ...
Next stage of the data cleaning process is to look at missing values and make decisions on removal or data imputation, where appropriate. As shown in the Data Import and Description tab, the only data set with some missing values is Dataset 1 - jobs_gender. So, we focus on this dataset.
First, we notice that there are 65 cases with missing values for the total earnings for female workers. Of this 65 cases, there are 19 cases where the total number of female workers in the field or occupation types is zero and the total earnings for female is missing “NA”. We impute these missing values as zero because no female workers implies no total female earnings.
#Impute missing values as zero
jobs_gender$total_earnings_female <- ifelse(jobs_gender$workers_female == 0, 0, jobs_gender$total_earnings_female)
For the remaining 46 observations with missing values for the total female earnings,we observe that the number of female workers is reported, indicating that these values are non-zero and missing. Hence, we remove the entire row from the dataset. We believe that this removal would not affect our analysis and goals for this study because, it’s just about 2 percent of the entire dataset.
#Remove rows with missing values for female earnings
jobs_gender_vector <- complete.cases(jobs_gender[,'total_earnings_female'])
jobs_gender <- jobs_gender[jobs_gender_vector,]
Next, we notice that there are 4 observations where the total earnings for male workers is missing. In 3 of these cases, the number of male workers is zero and the total earnings for male is missing “NA”. A closer look at these three observations revealed that in 2013, 2014, and 2016, there were no male workers who worked as “Nurse Midwives”. Therefore, we imputed the missing total male earnings as zero because no male workers implies no male earnings.
#Impute missing values as zero
jobs_gender$total_earnings_male <- ifelse(jobs_gender$workers_male == 0, 0, jobs_gender$total_earnings_male)
In the fourth case which was in 2015, a total of 53 male workers was reported for Nurse midwives. This is very different from what was reported in 2013, 2014, and 2016. This could mean that the occupation type (Nurse Midwives) had male workers in 2015 or an error occurred during the data reporting process. We decide to remove the entire row associated with this observation not because of the anomaly, but because the male total earnings is missing and reported as “NA”.
#Remove row with missing values for male earnings
jobs_gender_m_vector <- complete.cases(jobs_gender[,'total_earnings_male'])
jobs_gender <- jobs_gender[jobs_gender_m_vector,]
Finally, we notice that the column “wage_percent_of_male” showing the percentage of the total female earnings to total male earnings has a total of 846 missing values. Given that this is the percentage of the total female earnings to the total male earnings, we calculate and impute this variable using the total_earnings_male and total_earning_female columns.
#Calculate and impute results for wage_percent_of_male
jobs_gender$wage_percent_of_male<- ifelse(is.na(jobs_gender$wage_percent_of_male),jobs_gender$total_earnings_female/jobs_gender$total_earnings_male*100,jobs_gender$wage_percent_of_male)
We examine the summary of the datasets for unusual values and outliers. In all three datasets, there are no unusual vlaues like negative values. However, we observe some extreme values.These values are further examined in the next tab.
Dataset 1 - jobs_gender
data.frame(unclass(summary(jobs_gender[-c(2,3,4)])), check.names = FALSE, stringsAsFactors = FALSE)
## year total_workers workers_male workers_female
## X Min. :2013 Min. : 658 Min. : 0 Min. : 0
## X.1 1st Qu.:2014 1st Qu.: 20573 1st Qu.: 11396 1st Qu.: 2818
## X.2 Median :2014 Median : 61840 Median : 33470 Median : 16297
## X.3 Mean :2014 Mean : 200491 Mean : 113967 Mean : 86524
## X.4 3rd Qu.:2015 3rd Qu.: 194802 3rd Qu.: 105461 3rd Qu.: 65250
## X.5 Max. :2016 Max. :3758629 Max. :2570385 Max. :2290818
## percent_female total_earnings total_earnings_male
## X Min. : 0.00 Min. : 17266 Min. : 0
## X.1 1st Qu.: 11.47 1st Qu.: 32329 1st Qu.: 35630
## X.2 Median : 33.59 Median : 44371 Median : 46824
## X.3 Mean : 36.73 Mean : 49757 Mean : 53157
## X.4 3rd Qu.: 58.23 3rd Qu.: 61019 3rd Qu.: 65223
## X.5 Max. :100.00 Max. :201542 Max. :231420
## total_earnings_female wage_percent_of_male
## X Min. : 0 Min. : 0.00
## X.1 1st Qu.: 28600 1st Qu.:76.21
## X.2 Median : 40068 Median :84.68
## X.3 Mean : 44264 Mean : Inf
## X.4 3rd Qu.: 54615 3rd Qu.:91.80
## X.5 Max. :166388 Max. : Inf
Dataset 2 - earnings_female
data.frame(unclass(summary(earnings_female)), check.names = FALSE, stringsAsFactors = FALSE)
## Year group percent
## X Min. :1979 16-19 years:33 Min. :56.80
## X.1 1st Qu.:1987 20-24 years:33 1st Qu.:69.40
## X.2 Median :1995 25-34 years:33 Median :75.50
## X.3 Mean :1995 35-44 years:33 Mean :76.88
## X.4 3rd Qu.:2003 45-54 years:33 3rd Qu.:86.90
## X.5 Max. :2011 55-64 years:33 Max. :95.40
## X.6 <NA> (Other) :66 <NA>
Dataset 3- employed_gender
data.frame(unclass(summary(employed_gender)), check.names = FALSE, stringsAsFactors = FALSE)
## year total_full_time total_part_time full_time_female
## X Min. :1968 Min. :80.30 Min. :14.00 Min. :71.90
## X.1 1st Qu.:1980 1st Qu.:81.80 1st Qu.:16.80 1st Qu.:73.20
## X.2 Median :1992 Median :82.60 Median :17.40 Median :73.90
## X.3 Mean :1992 Mean :82.64 Mean :17.36 Mean :73.86
## X.4 3rd Qu.:2004 3rd Qu.:83.20 3rd Qu.:18.20 3rd Qu.:74.70
## X.5 Max. :2016 Max. :86.00 Max. :19.70 Max. :75.40
## part_time_female full_time_male part_time_male
## X Min. :24.60 Min. :86.60 Min. : 7.80
## X.1 1st Qu.:25.30 1st Qu.:89.00 1st Qu.: 9.60
## X.2 Median :26.10 Median :89.50 Median :10.50
## X.3 Mean :26.14 Mean :89.49 Mean :10.51
## X.4 3rd Qu.:26.80 3rd Qu.:90.40 3rd Qu.:11.00
## X.5 Max. :28.10 Max. :92.20 Max. :13.40
Dataset 1 - jobs_gender
Dataset 2 - earnings_female
Dataset 3- employed_gender
Dataset 1 - jobs_gender
| Variable | Description |
|---|---|
| year | Year |
| occupation | Specific job/career |
| major_category | Broad category of occupation |
| minor_category | Fine category of occupation |
| total_workers | Total estimated full-time workers > 16 years old |
| workers_male | Estimated MALE full-time workers > 16 years old |
| workers_female | Estimated FEMALE full-time workers > 16 years old |
| percent_female | The percent of females for specific occupation |
| total_earnings | Total estimated median earnings for full-time workers > 16 years old |
| total_earnings_male | Estimated MALE median earnings for full-time workers > 16 years old |
| total_earnings_female | Estimated FEMALE median earnings for full-time workers > 16 years old |
| wage_percent_of_male | Female wages as percent of male wages - NA for occupations with small sample size |
Dataset 2 - earnings_female
| Variable | Description |
|---|---|
| Year | Year |
| group | Age group |
| percent | Female salary percent of male salary |
Dataset 3 - employed_gender
| Variable | Description |
|---|---|
| year | Year |
| total_full_time | Percent of total employed people usually working full time |
| total_part_time | Percent of total employed people usually working part time |
| full_time_female | Percent of employed women usually working full time |
| part_time_female | Percent of employed women usually working part time |
| full_time_male | Percent of employed men usually working full time |
| part_time_male | Percent of employed men usually working part time |
To investigate and address the problem statement, we would explore the individual datasets and combine them where appropriate. Our proposed approach is as follows:
Descriptive Analysis
We would provide some descriptive statistics and plots such as bar charts, line charts, and ribbon charts showing the representation of male and female workers across the occupation types, minor and major job groups. We would show the major job groups that are male-dominated vs female-dominated; summary statistics of earnings by gender; and the job types with the widest, narrowest, or no gender pay gap.
Trend Analysis with Interactive Plots
We would examine the gender wage gap over time through a trend analysis. This would show the reported and the change in median earnings for male and female workers from 2013 to 2016. Using the combined dataset, we would explore the percentage change in the number of full-time and part-time workers across both genders, from 1968 to 2016. We would employ bar charts, ribbon charts, and interactive plots to show how the trend has evolved over the years.
Analysis of Variance (ANOVA) Test
We would investigate if there’s an increase in pay gap as women get older. To achieve this, we would create a line chart showing female salary as a percentage of male salary across different age groups, from 1979 to 2011. We would also consider an Analysis of Variance Test (ANOVA) to test the hypothesis that the gender pay gap gets wider as women get older. Is there a difference in the mean earnings of women as a percentage of male earnings across different age groups?
Further Data Slicing and Exploration
The data would be sliced to investigate the claims that women make more than men in the five occupations (wholesale and retail buyers except farm products; police and sheriffs patrol officers; bookkeeping, accounting, and auditing clerks; general office clerks; and data entry keepers) and less in pink-collar jobs. The results would be displayed using tables and some summary statistics. We would further explore our dataset to see if some job types changed from been male-dominated to female-dominated and vice versa. If yes, did the earnings change as a result?
Machine Learning Technique: Linear Regression Analysis
Finally, we would employ the linear regression analysis technique to identify some of the factors that contribute to the differences in wage gap.
Need: More Expertise and Information