Women in the Workplace

Introduction

Problem Statement

Gender bias in earnings is an ongoing issue. In 2019, the American Association of University Women (AAUW) stated that women who work full-time earn about 80 percent of what their male counterparts make. This report identified the following problems which we attempt to investigate in our study:

  • The gender wage gap continues to grow over time.
  • Women face wider pay gap as they get older.
  • Women earn less than men in all but five occupations - wholesale and retail buyers except farm products; police and sheriffs patrol officers; bookkeeping, accounting, and auditing clerks; general office clerks; and data entry keepers.
  • Women make less than men in pink-collar jobs.
  • As women enter an occupation previously dominated by men, pay decreases in that occupation.

Data and Methodology

For this study, we would use the Women in Workforce data which is a historical data about womens’ earnings and employment status, specific occupation and earnings from 2013-2016, compiled from the Bureau of Labor Statistics and the Census Bureau. We intend to analyze the data using the following methodologies: Trend analysis, descriptive analysis, data visualization, and applicable statistical tests.

Proposed Analysis

We would investigate the gender pay gap over the years and the changes in the total female salary as a percentage of male salary by age group. We would conduct an exploratory analysis across the different fields/occupation groups to identify the male-dominated vs female-dominated occupation types. Some statistical tests would be employed to determine the differences in pay gap across fields and by age group. Finally, we hope to identify some factors contributing to the differences in wage gap using regression analysis.

Our Contribution

Our results would provide more insights on the gender pay gap across fields, occupations, and age groups. Our results would show the trend in gender pay gap over the years and some of the factors contributing to gender pay gap. We hope that our study would contribute to the discussion on gender pay gap.

Packages Required

We used the following packages:

library(readxl) #used to import Excel files into R
library(tidyverse) #used for data manipulation 
library(dplyr) # used for data manipulation
library(DT) # used for displaying R data objects (matrices or data frames) as tables on HTML pages
library(knitr) #used to display an aligned table on the screen
library(kableExtra)#used to build with straightforward formatting options

Data Preparation

Data Source

For this study, we use the Women in Workforce data which is a historical data about womens’ earnings and employment status, specific occupation and earnings from 2013-2016, compiled from the Bureau of Labor Statistics and the Census Bureau. The data was provided in March 2019 as part of the #TidyTuesday project to celebrate the Women’s History month.

The entire data is spread into 3 files: jobs_gender.csv, earnings_female.csv, employed_gender.csv and are described in the next tab.

Data Import and Description

The three datasets are first imported from csv files into dataframes named jobs_gender, earnings_female,and employed_gender

Dataset 1 - jobs_gender

This dataset contains information on the total number of male and female workers and the total estimated median earnings for all employees at various occupation levels, from 2013-2016. The dataset has 12 variables, with a total of 2,088 data points, across 8 major job categories, 23 minor job categories and 522 occupation types. There are some missing values recorded as “NA” and are taken care of during the data cleaning process.

More information on this dataset can be found here

#Import Dataset 1
jobs_gender <- read.csv("~/BANA/Coursework/BANA 7025 Data wrangling/Mid term assignment/women_in_the_workplace_data/jobs_gender.csv", header = TRUE)
#Check the names of the data columns
colnames(jobs_gender)
##  [1] "year"                  "occupation"           
##  [3] "major_category"        "minor_category"       
##  [5] "total_workers"         "workers_male"         
##  [7] "workers_female"        "percent_female"       
##  [9] "total_earnings"        "total_earnings_male"  
## [11] "total_earnings_female" "wage_percent_of_male"
#Check the dimension of the dataset
dim(jobs_gender)
## [1] 2088   12
#Count the number of distinct values/observations
library(dplyr)
jobs_gender %>%
      summarise_each(n_distinct, "occupation","minor_category", "major_category")
##   occupation minor_category major_category
## 1        522             23              8
#Check the number of missing values per column
colSums(is.na(jobs_gender))
##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     4                    65                   846

Dataset 2 - earnings_female

This dataset contains the historic information of female salary as a percent of male salary for various age groups, from year 1979 to 2011. The dataset has 3 variables with 264 observations. This dataset has no missing values.

The dataset can be found here

#Imports Dataset 2
earnings_female <- read.csv("~/BANA/Coursework/BANA 7025 Data wrangling/Mid term assignment/women_in_the_workplace_data/earnings_female.csv", header = TRUE)
#Check the names of the data columns
colnames(earnings_female)
## [1] "Year"    "group"   "percent"
#Check the dimension of the dataset
dim(earnings_female)
## [1] 264   3
#Count the number of missing values per column
colSums(is.na(earnings_female))
##    Year   group percent 
##       0       0       0

Dataset 3 - employed_gender

This dataset shows the percentage of part-time and full-time employees for each year at the gender level. The dataset has 7 variables with 49 observations each, from year 1968 to 2016. This dataset has no missing values and can be accessed here

#Import Dataset 3
employed_gender <- read.csv("~/BANA/Coursework/BANA 7025 Data wrangling/Mid term assignment/women_in_the_workplace_data/employed_gender.csv", header = TRUE)
#Check the names of the data columns
colnames(employed_gender)
## [1] "year"             "total_full_time"  "total_part_time" 
## [4] "full_time_female" "part_time_female" "full_time_male"  
## [7] "part_time_male"
#Check the dimension of the dataset
dim(employed_gender)
## [1] 49  7
#Count the number of missing values per column
colSums(is.na(employed_gender))
##             year  total_full_time  total_part_time full_time_female 
##                0                0                0                0 
## part_time_female   full_time_male   part_time_male 
##                0                0                0

Data Cleaning

Variable Types

We investigate the variable types in our dataset to see if they are accurate or need to be changed.The codes and outputs below indicate that all the variable types in all three datasets are correct.

Dataset 1 - jobs_gender

glimpse(jobs_gender)
## Observations: 2,088
## Variables: 12
## $ year                  <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013...
## $ occupation            <fct> "Chief executives", "General and operati...
## $ major_category        <fct> "Management, Business, and Financial", "...
## $ minor_category        <fct> Management, Management, Management, Mana...
## $ total_workers         <int> 1024259, 977284, 14815, 43015, 754514, 4...
## $ workers_male          <int> 782400, 681627, 8375, 17775, 440078, 161...
## $ workers_female        <int> 241859, 295657, 6440, 25240, 314436, 280...
## $ percent_female        <dbl> 23.6, 30.3, 43.5, 58.7, 41.7, 63.5, 33.6...
## $ total_earnings        <int> 120254, 73557, 67155, 61371, 78455, 7411...
## $ total_earnings_male   <int> 126142, 81041, 71530, 75190, 91998, 9007...
## $ total_earnings_female <int> 95921, 60759, 65325, 55860, 65040, 66052...
## $ wage_percent_of_male  <dbl> 76.04208, 74.97316, 91.32532, 74.29179, ...

Dataset 2 - earnings_female

glimpse(earnings_female)
## Observations: 264
## Variables: 3
## $ Year    <int> 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987, ...
## $ group   <fct> "Total, 16 years and older", "Total, 16 years and olde...
## $ percent <dbl> 62.3, 64.2, 64.4, 65.7, 66.5, 67.6, 68.1, 69.5, 69.8, ...

Dataset 3- employed_gender

glimpse(employed_gender)
## Observations: 49
## Variables: 7
## $ year             <int> 1968, 1969, 1970, 1971, 1972, 1973, 1974, 197...
## $ total_full_time  <dbl> 86.0, 85.5, 84.8, 84.4, 84.3, 84.4, 84.2, 83....
## $ total_part_time  <dbl> 14.0, 14.5, 15.2, 15.6, 15.7, 15.6, 15.8, 16....
## $ full_time_female <dbl> 75.1, 74.9, 73.9, 73.2, 73.1, 73.2, 73.2, 72....
## $ part_time_female <dbl> 24.9, 25.1, 26.1, 26.8, 26.9, 26.8, 26.8, 27....
## $ full_time_male   <dbl> 92.2, 91.8, 91.5, 91.2, 91.1, 91.4, 91.2, 90....
## $ part_time_male   <dbl> 7.8, 8.2, 8.5, 8.8, 8.9, 8.6, 8.8, 9.4, 9.4, ...

Missing Values & Data Imputation

Next stage of the data cleaning process is to look at missing values and make decisions on removal or data imputation, where appropriate. As shown in the Data Import and Description tab, the only data set with some missing values is Dataset 1 - jobs_gender. So, we focus on this dataset.

First, we notice that there are 65 cases with missing values for the total earnings for female workers. Of this 65 cases, there are 19 cases where the total number of female workers in the field or occupation types is zero and the total earnings for female is missing “NA”. We impute these missing values as zero because no female workers implies no total female earnings.

#Impute missing values as zero
jobs_gender$total_earnings_female <- ifelse(jobs_gender$workers_female == 0, 0, jobs_gender$total_earnings_female)

For the remaining 46 observations with missing values for the total female earnings,we observe that the number of female workers is reported, indicating that these values are non-zero and missing. Hence, we remove the entire row from the dataset. We believe that this removal would not affect our analysis and goals for this study because, it’s just about 2 percent of the entire dataset.

#Remove rows with missing values for female earnings
jobs_gender_vector <- complete.cases(jobs_gender[,'total_earnings_female'])
jobs_gender <- jobs_gender[jobs_gender_vector,]

Next, we notice that there are 4 observations where the total earnings for male workers is missing. In 3 of these cases, the number of male workers is zero and the total earnings for male is missing “NA”. A closer look at these three observations revealed that in 2013, 2014, and 2016, there were no male workers who worked as “Nurse Midwives”. Therefore, we imputed the missing total male earnings as zero because no male workers implies no male earnings.

#Impute missing values as zero
jobs_gender$total_earnings_male <- ifelse(jobs_gender$workers_male == 0, 0, jobs_gender$total_earnings_male)

In the fourth case which was in 2015, a total of 53 male workers was reported for Nurse midwives. This is very different from what was reported in 2013, 2014, and 2016. This could mean that the occupation type (Nurse Midwives) had male workers in 2015 or an error occurred during the data reporting process. We decide to remove the entire row associated with this observation not because of the anomaly, but because the male total earnings is missing and reported as “NA”.

#Remove row with missing values for male earnings
jobs_gender_m_vector <- complete.cases(jobs_gender[,'total_earnings_male'])
jobs_gender <- jobs_gender[jobs_gender_m_vector,]

Finally, we notice that the column “wage_percent_of_male” showing the percentage of the total female earnings to total male earnings has a total of 846 missing values. Given that this is the percentage of the total female earnings to the total male earnings, we calculate and impute this variable using the total_earnings_male and total_earning_female columns.

#Calculate and impute results for wage_percent_of_male
jobs_gender$wage_percent_of_male<- ifelse(is.na(jobs_gender$wage_percent_of_male),jobs_gender$total_earnings_female/jobs_gender$total_earnings_male*100,jobs_gender$wage_percent_of_male)

Data Summary

We examine the summary of the datasets for unusual values and outliers. In all three datasets, there are no unusual vlaues like negative values. However, we observe some extreme values.These values are further examined in the next tab.

Dataset 1 - jobs_gender

data.frame(unclass(summary(jobs_gender[-c(2,3,4)])), check.names = FALSE, stringsAsFactors = FALSE)
##               year     total_workers      workers_male    workers_female
## X   Min.   :2013   Min.   :    658   Min.   :      0   Min.   :      0  
## X.1 1st Qu.:2014   1st Qu.:  20573   1st Qu.:  11396   1st Qu.:   2818  
## X.2 Median :2014   Median :  61840   Median :  33470   Median :  16297  
## X.3 Mean   :2014   Mean   : 200491   Mean   : 113967   Mean   :  86524  
## X.4 3rd Qu.:2015   3rd Qu.: 194802   3rd Qu.: 105461   3rd Qu.:  65250  
## X.5 Max.   :2016   Max.   :3758629   Max.   :2570385   Max.   :2290818  
##       percent_female   total_earnings total_earnings_male
## X   Min.   :  0.00   Min.   : 17266      Min.   :     0  
## X.1 1st Qu.: 11.47   1st Qu.: 32329      1st Qu.: 35630  
## X.2 Median : 33.59   Median : 44371      Median : 46824  
## X.3 Mean   : 36.73   Mean   : 49757      Mean   : 53157  
## X.4 3rd Qu.: 58.23   3rd Qu.: 61019      3rd Qu.: 65223  
## X.5 Max.   :100.00   Max.   :201542      Max.   :231420  
##     total_earnings_female wage_percent_of_male
## X        Min.   :     0        Min.   : 0.00  
## X.1      1st Qu.: 28600        1st Qu.:76.21  
## X.2      Median : 40068        Median :84.68  
## X.3      Mean   : 44264        Mean   :  Inf  
## X.4      3rd Qu.: 54615        3rd Qu.:91.80  
## X.5      Max.   :166388        Max.   :  Inf

Dataset 2 - earnings_female

data.frame(unclass(summary(earnings_female)), check.names = FALSE, stringsAsFactors = FALSE)
##               Year            group         percent
## X   Min.   :1979   16-19 years:33   Min.   :56.80  
## X.1 1st Qu.:1987   20-24 years:33   1st Qu.:69.40  
## X.2 Median :1995   25-34 years:33   Median :75.50  
## X.3 Mean   :1995   35-44 years:33   Mean   :76.88  
## X.4 3rd Qu.:2003   45-54 years:33   3rd Qu.:86.90  
## X.5 Max.   :2011   55-64 years:33   Max.   :95.40  
## X.6           <NA> (Other)    :66              <NA>

Dataset 3- employed_gender

data.frame(unclass(summary(employed_gender)), check.names = FALSE, stringsAsFactors = FALSE)
##               year total_full_time total_part_time full_time_female
## X   Min.   :1968   Min.   :80.30   Min.   :14.00    Min.   :71.90  
## X.1 1st Qu.:1980   1st Qu.:81.80   1st Qu.:16.80    1st Qu.:73.20  
## X.2 Median :1992   Median :82.60   Median :17.40    Median :73.90  
## X.3 Mean   :1992   Mean   :82.64   Mean   :17.36    Mean   :73.86  
## X.4 3rd Qu.:2004   3rd Qu.:83.20   3rd Qu.:18.20    3rd Qu.:74.70  
## X.5 Max.   :2016   Max.   :86.00   Max.   :19.70    Max.   :75.40  
##     part_time_female  full_time_male  part_time_male
## X    Min.   :24.60   Min.   :86.60   Min.   : 7.80  
## X.1  1st Qu.:25.30   1st Qu.:89.00   1st Qu.: 9.60  
## X.2  Median :26.10   Median :89.50   Median :10.50  
## X.3  Mean   :26.14   Mean   :89.49   Mean   :10.51  
## X.4  3rd Qu.:26.80   3rd Qu.:90.40   3rd Qu.:11.00  
## X.5  Max.   :28.10   Max.   :92.20   Max.   :13.40

Outliers

Data preview

Dataset 1 - jobs_gender

Dataset 2 - earnings_female

Dataset 3- employed_gender

Summary of Variables

Dataset 1 - jobs_gender

Variable Description
year Year
occupation Specific job/career
major_category Broad category of occupation
minor_category Fine category of occupation
total_workers Total estimated full-time workers > 16 years old
workers_male Estimated MALE full-time workers > 16 years old
workers_female Estimated FEMALE full-time workers > 16 years old
percent_female The percent of females for specific occupation
total_earnings Total estimated median earnings for full-time workers > 16 years old
total_earnings_male Estimated MALE median earnings for full-time workers > 16 years old
total_earnings_female Estimated FEMALE median earnings for full-time workers > 16 years old
wage_percent_of_male Female wages as percent of male wages - NA for occupations with small sample size

Dataset 2 - earnings_female

Variable Description
Year Year
group Age group
percent Female salary percent of male salary

Dataset 3 - employed_gender

Variable Description
year Year
total_full_time Percent of total employed people usually working full time
total_part_time Percent of total employed people usually working part time
full_time_female Percent of employed women usually working full time
part_time_female Percent of employed women usually working part time
full_time_male Percent of employed men usually working full time
part_time_male Percent of employed men usually working part time

Proposed Exploratory Data Analysis

To investigate and address the problem statement, we would explore the individual datasets and combine them where appropriate. Our proposed approach is as follows:

Descriptive Analysis

We would provide some descriptive statistics and plots such as bar charts, line charts, and ribbon charts showing the representation of male and female workers across the occupation types, minor and major job groups. We would show the major job groups that are male-dominated vs female-dominated; summary statistics of earnings by gender; and the job types with the widest, narrowest, or no gender pay gap.

Trend Analysis with Interactive Plots

We would examine the gender wage gap over time through a trend analysis. This would show the reported and the change in median earnings for male and female workers from 2013 to 2016. Using the combined dataset, we would explore the percentage change in the number of full-time and part-time workers across both genders, from 1968 to 2016. We would employ bar charts, ribbon charts, and interactive plots to show how the trend has evolved over the years.

Analysis of Variance (ANOVA) Test

We would investigate if there’s an increase in pay gap as women get older. To achieve this, we would create a line chart showing female salary as a percentage of male salary across different age groups, from 1979 to 2011. We would also consider an Analysis of Variance Test (ANOVA) to test the hypothesis that the gender pay gap gets wider as women get older. Is there a difference in the mean earnings of women as a percentage of male earnings across different age groups?

Further Data Slicing and Exploration

The data would be sliced to investigate the claims that women make more than men in the five occupations (wholesale and retail buyers except farm products; police and sheriffs patrol officers; bookkeeping, accounting, and auditing clerks; general office clerks; and data entry keepers) and less in pink-collar jobs. The results would be displayed using tables and some summary statistics. We would further explore our dataset to see if some job types changed from been male-dominated to female-dominated and vice versa. If yes, did the earnings change as a result?

Machine Learning Technique: Linear Regression Analysis

Finally, we would employ the linear regression analysis technique to identify some of the factors that contribute to the differences in wage gap.

Need: More Expertise and Information

  • We intend getting some more information on the median earnings across the United States to create some more context and comparison. This would help us in identifying occupation groups that pay below or above the median earnings in the US.
  • In the analysis and visualization perspective, we would like to explore and learn more about the different ways of providing better visuals and graphics to communicate our findings.