Women in the Workforce

Introduction

Starting most dramatically in the 1950’s, American women began to enter the workforce. Over the next half century, they increasingly left their homes to go out and start careers. But since the 1990’s, the United States has seen a decrease in women’s labor participation rate, that has only started to pick up in the last few years. This report focuses mainly on the following concepts:

How have women been doing, comparatively, over the last 20 years?
How are things now, and where are we headed?

Women have been making space for themselves, and headlines, in recent times for requiring that their workplace concerns be heard and addressed. This analysis serves to discover what career prospects for women look like nowadays and how have the components that characterize a woman’s career, her position and income, changed.

To address this, we examined the pay gap trend over time, the differences in part-time and full-time work in relation to gender, as well as how industry impacts a woman’s compensation in recent years. By approaching it from different facets, we will try to tell the story with multiple dimensions and not jump to conclusions about the complex social reasons for why the women’s labor participation rate may have changed.

Because this is still in development, we are unsure if we want to add a predictive element that would help pinpoint what the trends will look like in the future for women. Either way, our analysis is important because of the huge impact that women have on the economic output in the United States. How women are being compensated and making decisions about work will tell employers what to expect in years to come.

The Bureau of Labor Statistics illustrates women’s participation rate in the graph below.

library(png)
library(grid)
img <- readPNG("civilian_women.png")
 grid.raster(img)

disclaimer: It is important to note that the entire labor participation rate has also decreased.

Packages Required

The following packages will be required to reperform the R script included in this report without error:

#Loading the required packages
library(readr)      # Read Data in csv format
library(tidyverse)  # For manipulating data
library(dplyr)      # Manipulating data in R and using pipe operator
library(DT)         # HTML display of data
library(knitr)      # Used to display an aligned table on screen
library(ggplot2)    # Used to plot different graphs
library(qwraps2)    # Used to make tables
library(formattable)# Markdown display of data
library(reshape2)   # Used to melt datasets

Introduction to the Data

About the Data

This data was taken from a Tidy Tuesday challenge. Please read below for a synopsis of the data provided by the Tidy Tuesday GitHub Repository. Information on each of the datasets used can be found in the following tabs.

Women in the Workforce March is Women’s History month, as such we’re exploring data from the Bureau of Labor Statistics and the Census Bureau about women in the workforce. There are historical data about women’s earnings and employment status, as well as detailed information about specific occupation and earnings from 2013-2016.

According to the AAUW - “The gender pay gap is the gap between what men and women are paid. Most commonly, it refers to the median annual pay of all women who work full time and year-round, compared to the pay of a similar cohort of men.”

The specific jobs data came from the Census Bureau and the historical data comes from the Bureau of Labor. The data is provided as is, and you recognize the limitations and issues in defining gender as binary.

Limitations of the Datasets

There are limitations and assumptions in the datasets used, which should be noted at the beginning of this report.

First, the date range of each dataset vary greatly. This limited our view of the trends found in the data accross all time periods.
Second, gender in these datasets is a binary male and female classification.
Lastly, we assume the data obtained and used in our analysis is valid and completes.

Reading the Data

First we read in the semi-cleaned data. The data is held in the Tidy Tuesday GitHub repository. We read the csv file into this report directly from the repository.

jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")

## Parsed with column specification:
## cols(
##   year = col_double(),
##   occupation = col_character(),
##   major_category = col_character(),
##   minor_category = col_character(),
##   total_workers = col_double(),
##   workers_male = col_double(),
##   workers_female = col_double(),
##   percent_female = col_double(),
##   total_earnings = col_double(),
##   total_earnings_male = col_double(),
##   total_earnings_female = col_double(),
##   wage_percent_of_male = col_double()
## )

earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv")

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   group = col_character(),
##   percent = col_double()
## )

employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv")

## Parsed with column specification:
## cols(
##   year = col_double(),
##   total_full_time = col_double(),
##   total_part_time = col_double(),
##   full_time_female = col_double(),
##   part_time_female = col_double(),
##   full_time_male = col_double(),
##   part_time_male = col_double()
## )

Dataset: Earnings_Female

Purpose

The purpose of this dataset is to provide information on women’s salary as a percentage of male’s salary within the same age group. The time series component helps us see the change in percentage over time.

Quick Clean

The first thing we should do is rename the variables to enhance clarity and understanding.

colnames(earnings_female) <- c("Year","Age_Group", "Percent_of_Male_Salary" )

Data Dictionary

This dataset includes three variables: Year, Age Group, and Percent of Male Salary.

datadict <- read_csv("data/earnings_female_datadict.csv")
colnames(datadict) <- c("Variable", "Class", "Description")
kable(datadict, format = "markdown")

Variable	Class	Description
Year	integer	Year
group	character	Age group
percent	double	Female salary percent of male salary

Technical Attributes

The dataset contains 264 observations
Year ranges from 1979 to 2011
Age_Group uses the following age ranges:
- less than 16, 16-19, 20-24, 25-34, 35-44,45-54, 55-64, 55-64, 65+

Missing Values

This dataset contains no missing values.

sum(is.na(earnings_female))

## [1] 0

Data Manipulation

It will be helpful in future visualizations to spread this dataset.

# Spread earnings_female
earnings_female_byage <- spread(earnings_female, key = Age_Group, value = Percent_of_Male_Salary)

Data Preview

datatable(earnings_female_byage, options = list(
  autoWidth = TRUE,
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20))
)

Summary Statistics

See below for a summary of the dataset and general structure:

options(qwraps2_markup = "markdown")
earnings_female <- as.data.frame(earnings_female)
summary_statistics <-
  list(
    "Female Earnings as % of Male" =
      list(
        "min" = ~min(.data$Percent_of_Male_Salary, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$Percent_of_Male_Salary, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$Percent_of_Male_Salary, na_rm = TRUE),
        "max" = ~max(.data$Percent_of_Male_Salary, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$Percent_of_Male_Salary))
      )
  )

print(qwraps2::summary_table(
  dplyr::group_by(earnings_female, Age_Group),
  summary_statistics
),
rtitle = "Summary Statistics Table for the Earnings_Female Data Set")

Summary Statistics Table for the Earnings_Female Data Set	Age_Group: 16-19 years (N = 33)	Age_Group: 20-24 years (N = 33)	Age_Group: 25-34 years (N = 33)	Age_Group: 35-44 years (N = 33)	Age_Group: 45-54 years (N = 33)	Age_Group: 55-64 years (N = 33)	Age_Group: 65 years and older (N = 33)	Age_Group: Total, 16 years and older (N = 33)
Female Earnings as % of Male
min	85.2	76.3	67.5	58.3	56.8	58.9	65.9	62.3
mean (sd)	91.07 ± 2.42	90.00 ± 4.92	81.24 ± 6.42	70.44 ± 6.21	67.40 ± 6.45	66.96 ± 5.37	73.77 ± 3.91	74.18 ± 5.74
median (Q1, Q3)	91.40 (89.10, 92.90)	91.90 (88.00, 93.80)	82.40 (76.70, 86.90)	72.50 (66.10, 75.20)	67.70 (61.70, 73.50)	65.30 (62.20, 72.70)	74.40 (70.90, 76.40)	75.80 (69.80, 79.40)
max	94.6	95.4	92.3	79.9	76.5	75.4	80.9	82.2
Missing	0	0	0	0	0	0	0	0

Dataset: Jobs_Gender

Purpose

The purpose of this dataset is to provide quantitative information on the labor force within major and minor industries (categories). This includes the volume of participants in the workforce by gender, as well as the salary of these participants by gender. The time series component helps us see the change in volume and male/female ratios over time.

Data Dictionary

This dataset includes 12 variables: year, occupation, major_category, minor_category, total_workers, workers_male, workers_female, percent_female, total_earnings, total_earnings_male, total_earnings_female, wage_percent_of_male.

datadict <- read_csv("data/jobs_gender_datadict.csv")
colnames(datadict) <- c("Variable", "Class", "Description")
kable(datadict, format = "markdown")

Variable	Class	Description
year	integer	Year
occupation	character	Specific job/career
major_category	character	Broad category of occupation
minor_category	character	Fine category of occupation
total_workers	double	Total estimated full-time workers > 16 years old
workers_male	double	Estimated MALE full-time workers > 16 years old
workers_female	double	Estimated FEMALE full-time workers > 16 years old
percent_female	double	The percent of females for specific occupation
total_earnings	double	Total estimated median earnings for full-time workers > 16 years old
total_earnings_male	double	Estimated MALE median earnings for full-time workers > 16 years old
total_earnings_female	double	Estimated FEMALE median earnings for full-time workers > 16 years old
wage_percent_of_male	double	Female wages as percent of male wages - NA for occupations with small sample size

Technical Attributes

Observations

The dataset contains 2,088 observations
Year ranges from 2013 to 2016
Occupation includes 522 different job titles
The major category variable has 8 different categories
The minor category variable has 23 unique sub-categories

Missing values

There are 846 records that are missing data in the wage_percent_of_male variable. In the data dictionary, we learned this was the case due to a small sample size. For calculation purposes, we would like to remove as many NA’s as possible without significantly misrepresenting our data. This can be corrected by calculating the percentage using the median value given in total_earnings_female and total_earnings_male in each observation and imputting the missing values with the observation’s calculated percent.

# Imput the percentage using given data
jobs_gender <- 
  jobs_gender %>% 
  mutate(wage_percent_of_male = total_earnings_female/total_earnings_male)

# Check our work
colSums(is.na(jobs_gender))

##                  year            occupation        major_category 
##                     0                     0                     0 
##        minor_category         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female  wage_percent_of_male 
##                     4                    65                    69

The following are logical reasons for missing values and will not be corrected:

In 18 observations, there were no female workers. As a result, total_earnings_female and wage_percent_of_male are NA.
In 3 observations, there were no male workers. As a result, total_earnings_male and wage_percent_of_male are NA.

jobs_gender %>% 
  filter(workers_female == 0 |  workers_male == 0) %>% 
  select(-c("major_category", "minor_category"))

## # A tibble: 21 x 10
##     year occupation total_workers workers_male workers_female
##    <dbl> <chr>              <dbl>        <dbl>          <dbl>
##  1  2013 Nurse mid~          2817            0           2817
##  2  2013 Septic ta~          6191         6191              0
##  3  2013 Roof bolt~          3576         3576              0
##  4  2013 Roustabou~          9766         9766              0
##  5  2013 Mine shut~           847          847              0
##  6  2014 Nurse mid~          4490            0           4490
##  7  2014 Roof bolt~          2812         2812              0
##  8  2014 Helpers--~          4323         4323              0
##  9  2014 Electrica~           972          972              0
## 10  2014 Railroad ~          3808         3808              0
## # ... with 11 more rows, and 5 more variables: percent_female <dbl>,
## #   total_earnings <dbl>, total_earnings_male <dbl>,
## #   total_earnings_female <dbl>, wage_percent_of_male <dbl>

There are still some missing values because we do not have total earnings data for either male or female or both for 48 observations. If these missing values cause issues in our Exploratory Data Analysis, we will remove them. For now, the observations provide enough other useful information to keep in the dataset. Regarding the missing values:

47 are records of ‘total_earnings_female’
1 is a record of ‘total_earnings_male’

Data Preview

See below for a preview of the dataset.

datatable(jobs_gender, options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)
))

Summary Statistics

See below for a summary of the dataset and general structure:

options(qwraps2_markup = "markdown")
jobs_gender <- as.data.frame(jobs_gender)
summary_statistics <-
  list(
    "Total Workers" =
      list(
        "min" = ~min(.data$total_workers, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$total_workers, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_workers, na_rm = TRUE),
        "max" = ~max(.data$total_workers, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$total_workers))
      ),
    "Male Workers" =
      list(
        "min" = ~min(.data$workers_male, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$workers_male, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$workers_male, na_rm = TRUE),
        "max" = ~max(.data$workers_male, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$workers_male))
      ),
    "Female Workers" =
      list(
        "min" = ~min(.data$workers_female, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$workers_female, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$workers_female, na_rm = TRUE),
        "max" = ~max(.data$workers_female, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$workers_female))
      ),
    "Total Earnings" =
      list(
        "min" = ~min(.data$total_earnings, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$total_earnings, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_earnings, na_rm = TRUE),
        "max" = ~max(.data$total_earnings, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$total_earnings))
      ),
    "Male Earnings" =
      list(
        "min" = ~min(.data$total_earnings_male, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$total_earnings_male, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_earnings_male, na_rm = TRUE),
        "max" = ~max(.data$total_earnings_male, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$total_earnings_male))
      ),
    "Female Earnings" =
      list(
        "min" = ~min(.data$total_earnings_female, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$total_earnings_female, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_earnings_female, na_rm = TRUE),
        "max" = ~max(.data$total_earnings_female, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$total_earnings_female))
      )
  )

#table <- summary_table(jobs_gender, summary_statistics)
#print(table, rtitle = "Summary Statistics Table for the Jobs_Gender Data Set")

print(qwraps2::summary_table(
  dplyr::group_by(jobs_gender, major_category),
  summary_statistics
),
rtitle = "Summary Statistics Table for the Jobs_Gender Data Set")

Summary Statistics Table for the Jobs_Gender Data Set	major_category: Computer, Engineering, and Science (N = 236)	major_category: Education, Legal, Community Service, Arts, and Media (N = 168)	major_category: Healthcare Practitioners and Technical (N = 128)	major_category: Management, Business, and Financial (N = 232)	major_category: Natural Resources, Construction, and Maintenance (N = 328)	major_category: Production, Transportation, and Material Moving (N = 444)	major_category: Sales and Office (N = 280)	major_category: Service (N = 272)
Total Workers
min	836	5424	2817	5439	972	658	1413	747
mean (sd)	117,869.59 ± 176,014.30	244,239.31 ± 397,711.18	193,852.91 ± 395,191.26	327,296.72 ± 533,497.95	121,477.00 ± 202,629.98	111,801.25 ± 301,312.90	332,110.75 ± 528,970.43	210,631.98 ± 315,720.78
median (Q1, Q3)	50,892.00 (16,201.00, 136,846.50)	102,709.50 (45,149.25, 330,168.25)	84,206.00 (32,657.75, 157,746.00)	139,287.00 (46,871.00, 440,238.75)	34,171.50 (14,086.25, 119,757.00)	24,677.50 (7,328.00, 73,449.50)	94,830.00 (36,482.25, 386,638.75)	68,956.50 (27,538.50, 233,804.75)
max	1124661	2398445	2317493	3758629	1180144	2695081	2677578	1553626
Missing	0	0	0	0	0	0	0	0
Male Workers
min	219	1239	0	2824	972	296	499	747
mean (sd)	89,785.52 ± 141,174.31	93,467.14 ± 137,818.97	54,026.68 ± 92,402.62	185,876.49 ± 340,101.33	116,804.15 ± 197,077.30	90,149.27 ± 271,783.08	138,123.15 ± 263,065.40	106,253.38 ± 199,365.29
median (Q1, Q3)	37,386.00 (10,951.25, 102,952.25)	36,864.50 (19,942.75, 76,605.75)	25,475.00 (5,832.25, 67,886.50)	92,677.00 (17,769.25, 180,957.25)	33,094.50 (12,412.50, 118,083.25)	19,659.50 (5,990.50, 54,168.75)	34,221.00 (12,493.50, 116,202.75)	34,010.50 (11,032.25, 94,528.00)
max	918865	575443	495061	2472383	1150257	2570385	1537529	1156110
Missing	0	0	0	0	0	0	0	0
Female Workers
min	146	3220	953	1018	0	0	535	0
mean (sd)	28,084.07 ± 40,199.23	150,772.17 ± 293,549.25	139,826.23 ± 340,168.19	141,420.22 ± 218,741.15	4,672.85 ± 9,673.94	21,651.98 ± 48,458.16	193,987.60 ± 345,314.14	104,378.60 ± 188,375.28
median (Q1, Q3)	10,357.50 (3,141.00, 35,682.25)	44,728.50 (21,860.25, 190,418.25)	54,487.00 (10,984.75, 102,121.75)	52,883.00 (23,452.75, 159,442.25)	1,144.00 (438.00, 4,741.75)	2,942.00 (1,168.50, 12,378.00)	57,055.50 (21,173.25, 193,616.25)	28,604.50 (8,963.75, 90,688.25)
max	205796	1867475	2036445	1286246	83108	300040	2290818	1177169
Missing	0	0	0	0	0	0	0	0
Total Earnings
min	40464	21125	31530	36471	20420	20726	20251	17266
mean (sd)	76,561.88 ± 18,417.48	49,832.49 ± 16,935.33	74,912.03 ± 38,164.20	65,564.55 ± 17,041.79	43,898.75 ± 12,522.12	39,516.50 ± 15,143.63	40,359.07 ± 13,396.93	34,626.52 ± 15,006.65
median (Q1, Q3)	76,990.50 (62,014.75, 90,353.25)	47,479.50 (41,926.25, 52,127.25)	62,272.50 (46,511.75, 92,160.50)	62,191.50 (53,092.75, 73,425.00)	41,761.50 (35,451.50, 51,027.50)	35,788.50 (29,410.50, 45,357.75)	37,216.50 (31,841.50, 46,557.50)	30,432.50 (24,747.50, 41,139.25)
max	141359	122073	201542	130293	88901	102155	111522	90571
Missing	0	0	0	0	0	0	0	0
Male Earnings
min	23794	25873	35640	41164	22957	21536	21105	12147
mean (sd)	80,204.30 ± 19,379.94	54,402.95 ± 18,887.57	124; 81,486.81 ± 42,023.71	73,717.34 ± 19,208.39	44,270.04 ± 12,484.74	41,299.60 ± 14,915.35	44,986.89 ± 15,296.61	36,805.50 ± 15,157.04
median (Q1, Q3)	81,433.50 (67,259.75, 91,832.75)	50,893.00 (45,094.00, 57,843.25)	124; 71,213.50 (49,938.50, 101,071.50)	71,393.50 (60,927.75, 82,328.25)	41,984.50 (35,520.00, 51,208.75)	37,683.50 (31,160.00, 46,874.00)	41,365.50 (35,609.00, 52,830.50)	31,805.50 (26,347.00, 41,808.25)
max	150247	136043	231420	141108	88919	102479	115432	90912
Missing	0	0	4	0	0	0	0	0
Female Earnings
min	33376	20748	31126	25310	11080	7447	19688	16771
mean (sd)	235; 69,427.31 ± 17,804.86	46,257.66 ± 14,032.47	68,887.46 ± 31,559.21	59,070.36 ± 15,413.74	282; 38,549.11 ± 15,510.49	429; 32,437.85 ± 13,667.56	37,105.95 ± 10,377.21	269; 31,987.96 ± 13,382.61
median (Q1, Q3)	235; 68,925.00 (56,797.50, 80,889.00)	45,081.00 (38,248.75, 50,808.50)	60,984.00 (45,444.50, 91,083.25)	56,810.00 (49,981.00, 66,007.75)	282; 35,580.00 (29,752.25, 43,862.50)	429; 27,883.00 (24,268.00, 36,241.00)	35,631.00 (30,458.75, 41,380.50)	269; 28,384.00 (22,291.00, 38,088.00)
max	120253	102484	166388	131780	158929	130660	90274	100508
Missing	1	0	0	0	46	15	0	3

Dataset: Employed_Gender

Purpose

This dataset provides another piece of information regarding women in the workforce. Here, we are given data on the number of women and men in the workforce by year. The count of men and women in the workforce is broken out between Full Time work and Part Time work. and whether they

Data Dictionary

This dataset includes 7 variables: year, total_full_time, total_part_time, full_time_female, part_time_female, full_time_male, part_time_male.

datadict <- read_csv("data/employed_gender_datadict.csv")
colnames(datadict) <- c("Variable", "Class", "Description")
kable(datadict, format = "markdown")

Variable	Class	Description
year	double	Year
total_full_time	double	Percent of total employed people usually working full time
total_part_time	double	Percent of total employed people usually working part time
full_time_female	double	Percent of employed women usually working full time
part_time_female	double	Percent of employed women usually working part time
full_time_male	double	Percent of employed men usually working full time
part_time_male	double	Percent of employed men usually working part time

Technical Attributes

Observations

The dataset contains 49 observations
Year ranges from 1968 to 2016

Missing Values

There are no missing values in this dataset.

Data Preview

See below for a preview of the dataset.

datatable(employed_gender, options = list(
  columnDefs = list(list(className = 'dt-center', targets = 5)),
  pageLength = 5,
  lengthMenu = c(5, 10, 15, 20)
))

Summary Statistics

See below for a summary of the dataset and general structure:

options(qwraps2_markup = "markdown")
employed_gender <- as.data.frame(employed_gender)
summary_statistics <-
  list(
    "Total Full Time" =
      list(
        "min" = ~min(.data$total_full_time, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$total_full_time, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_full_time, na_rm = TRUE),
        "max" = ~max(.data$total_full_time, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$total_full_time))
      ),
    "Total Part Time" =
      list(
        "min" = ~min(.data$total_part_time, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$total_part_time, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_part_time, na_rm = TRUE),
        "max" = ~max(.data$total_part_time, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$total_part_time))
      ),
    "Full Time - Female" =
      list(
        "min" = ~min(.data$full_time_female, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$full_time_female, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$full_time_female, na_rm = TRUE),
        "max" = ~max(.data$full_time_female, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$full_time_female))
      ),
    "Part Time - Female" =
      list(
        "min" = ~min(.data$part_time_female, na.rm = TRUE),
        "mean (sd)" = ~qwraps2::mean_sd(.data$part_time_female, na_rm = TRUE),
        "median (Q1, Q3)" = ~qwraps2::median_iqr(.data$part_time_female, na_rm = TRUE),
        "max" = ~max(.data$part_time_female, na.rm = TRUE),
        "Missing" = ~sum(is.na(.data$part_time_female))
      )
  )

table <- summary_table(employed_gender, summary_statistics)
print(table, rtitle = "Summary Statistics Table for the Employed_Gender Data Set")

Summary Statistics Table for the Employed_Gender Data Set	employed_gender (N = 49)
Total Full Time
min	80.3
mean (sd)	82.64 ± 1.24
median (Q1, Q3)	82.60 (81.80, 83.20)
max	86
Missing	0
Total Part Time
min	14
mean (sd)	17.36 ± 1.24
median (Q1, Q3)	17.40 (16.80, 18.20)
max	19.7
Missing	0
Full Time - Female
min	71.9
mean (sd)	73.86 ± 0.97
median (Q1, Q3)	73.90 (73.20, 74.70)
max	75.4
Missing	0
Part Time - Female
min	24.6
mean (sd)	26.14 ± 0.97
median (Q1, Q3)	26.10 (25.30, 26.80)
max	28.1
Missing	0

Exploratory Data Analysis

Our Exploratory Data Analysis will begin with simple visualizations to gain a general undestanding of the trends in our data. Due to the nature of the timelines given in the datasets, our approach to analysing the data is two-fold:

first, we attempt to identify overarching trends in the 3 decades following the initial introduction of women in the workforce;
second, we search for trends in the most recent years of our data to help us understand where women in the workforce are today.

Time Series Analysis

Wage Gap over the Decades

What has the pay gap looked like over these three decades?

From the time series of boxplots below, we see two important trends. Women’s salary as a percentage of male’s salary increases over the years of 1979 to 2011. This indicates that women are earning increasingly comparable wages to their male counterparts. We also see a decrease in the variation of this salary ratio as time progresses.

For understanding: This boxplot shows the spread of women’s salary as a percentage of male’s salary over the years. Each boxplot is created based on the spread of percentages by age group.

boxplot(earnings_female$Percent_of_Male_Salary ~ earnings_female$Year,
        main = "Distribution of Women's Salary as a percent of Males' Salary
        1979 - 2011",
        xlab = "Year",
        ylab = "Percent of Male Salary"
        )

Is age a factor when it comes to the pay gap between men and women?

We can look at the above chart in a different light, by stratifying the data according to the age group of workers. In the plot below, younger generations show a smaller wage gap. We can hypothesize that this is due to the types of jobs workers between the ages of 16 and 24 hold. As they are most likely still in school, these jobs may be temporary or part-time positions that pay a standard rate. As age increases, we see more of a gap between male and female salaries. This is where the most progress is made over the years and brings up the median and average percentage of male salary.

ggplot(data=earnings_female, aes(x = Year, y = Percent_of_Male_Salary, group = Age_Group, colour = Age_Group))+
  geom_line(size = 1)+
  labs(title = 'Female Salary as a Percentage of Male Salary, Grouped By Age')

The trends shown in the above graphs indicate improvement towards gender equality in the workforce.

Full-Time and Part-Time Employment over the Decades

What is the trend of Part-Time and Full-Time workers by both male and female over the years?

We created a graph in an attempt to answer this question. Overall, this chart does not provide much new insight. We can see that the percentage of the labor force that work part-time follow a similar trend, regardless of gender, and the same goes for the percentage of the labor force that work full-time. We would still like to know:

When the percentage for full-time decreases, are more people dropping out of the labor force or simply switching to a part-time position (and vice versa)?
What do these trends look like in the population (as counts and not percentages)?
What types of jobs are women working? Where do the opportunities lie for the increasing FT female workforce?

Because this is only in percentages, we do not get a true understanding of the population total. Because the information given was standardized to just give part-time and full-time percentages, we wanted to scale it based on the sheer quantity of the work-force. There may be a hidden insight in this so we decided to merge data set jobs_gender with data set employed_gender.

For understanding: The Y-axis refers to percentage of the labor force, but respective to gender. For example, % of part-time female + % full-time female adds up to 100%, which constitutes 100% of employed women.

#percentage full time women vs full time men, over time
ggplot(data=employed_gender, aes(x = year, y = full_time_female, colour='full-time female'))+
  geom_line(size = 1.5)+
  geom_line(aes(y = full_time_male, colour='full-time male'), size = 1.5)+
  geom_line(aes(y = part_time_female, colour='part-time female'), size = 1.5)+
  geom_line(aes(y = part_time_male, colour='part-time male'), size = 1.5)+
  labs(title = 'Female & Male Full-Time/Part-Time Percentage of Respective Gender Labor Force',
         x = 'Year',
         y = 'Percentage of Workforce')+
  scale_colour_manual(values = c("deeppink", "blue", "deeppink4", "darkblue"))

By merging the data sets as mentioned above, we are able to create new variables: female_ft, male_ft, female_pt and male_pt, which are the respective population sizes instead of percentages. Below is the code and visualization from this merge.

# From the jobs_gender dataset:
jobs_gender_sum <- jobs_gender %>% 
  group_by(year) %>% 
  summarise(sum(total_workers), sum(workers_male), sum(workers_female))

# Filter employed_gender for 2013 - 2016
emplyed_gender_sub <- employed_gender %>% filter(year %in% c(2013:2016))

# Combine columns to form new dataset
recent_levels <- cbind(jobs_gender_sum, emplyed_gender_sub[,-1])
colnames(recent_levels) <- c("year", "total_workers", "total_male_workers", "total_female_workers", "tot_ft_percent", "tot_pt_percent", "fem_ft_percent", "fem_pt_percent", "mal_ft_percent", "mal_pt_percent")

# Mutate
recent_levels <- recent_levels %>% mutate(
  total_full_time = (tot_ft_percent/100)*total_workers,
  total_part_time = (tot_pt_percent/100)*total_workers,
  female_ft = (fem_ft_percent/100)*total_female_workers,
  female_pt = (fem_pt_percent/100)*total_female_workers,
  male_ft = (mal_ft_percent/100)*total_male_workers,
  male_pt = (mal_pt_percent/100)*total_male_workers,
  )

recent_levels <- select(recent_levels, year, total_workers, total_male_workers, total_female_workers, total_full_time, total_part_time, female_ft, female_pt, male_ft, male_pt, everything())

ggplot(data=recent_levels, aes(x = year, y = female_ft, colour="female_ft"))+
  geom_line(size = 1.5)+
  geom_line(aes(y = male_ft, colour="male_ft" ), size = 1.5)+
  geom_line(aes(y = female_pt, colour="female_pt"), size = 1.5)+
  geom_line(aes(y = male_pt, colour="male_pt"), size = 1.5)+
  labs(title = 'Female & Male Full-Time/ Part-Time Distribution',
         x = 'Year',
         y = 'Volume of Workforce')+
  scale_y_continuous(labels = scales::comma)+
  scale_colour_manual(values = c("deeppink", "deeppink4", "blue", "darkblue"))

The time series of these 4 categories demonstrates that the sheer quantity of full-time workers has increased since 2013, for both men and women, and that the quantity of part-time workers has stayed stagnant since 2013. This is new information that the percentages did not offer. We now know that the full-time population is rising. This is helpful as we begin to examine the availability of jobs for career women, but besides that, this graphic does not offer many new insights.

Categorical Analysis

Now that we have analyzed our data over time, we can similarly look for patterns in our data throughout the many categories and job sectors.

Wages within Major Categories

How often are there observed instances where women earn more than men?

Regardless of the participation rate of men and women in certain sectors, we would like to see the spread of wages within each major category. Below we created two sets of box plots - one for women and one for men. The axes limits are set equal to enhance comparability between the two figures. Additionally, the datapoints are color coordinated to add another layer of information. Red indicates that the wages for a particular job within the category favors men over women regarding compensation. Blue indicates that women are paid more than men for that particular job.

These figures give us valuable insight.

Right away we see that median wages are lower for women in every single area. It’s also interesting to note that the spread of women’s earnings within each category are more concentrated than that of their male counterpart. Without making a gross assumption, we wonder if women could increase their earning potential by negotiating higher wages, as this could be a reason for men’s significant variability in earnings.
The overwhelming amount of red datapoints shows that not only do women earn less on average, but that they are consistently paid less in every role. This is an important distinction, because the average could be skewed by a few outliers where men’s salaries are drastically greater than women’s. On the contrary, we see that the wage gap is widespread among and within the major categories, down to the individual job title.
The area of Natural Resources, Construction, and Maintenance appear to have the most equal ratio of red versus blue datapoints, meaning that this category has the smallest bias in wage discrepancy. We should look into this area as a potential opportunity for women involvement.

#scatterplot of male vs female wages by major category
par(mfrow = c(2,1))
ggplot(jobs_gender, aes(major_category, total_earnings_female)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.5, aes(color = total_earnings_female > total_earnings_male), show.legend = FALSE) +
  scale_y_continuous(limits=c(0,250000), name = NULL, labels = scales::dollar) +
  labs(x = NULL) +
  ggtitle("Spread of Women's Total Earnings by Category") +
  coord_flip()

ggplot(jobs_gender, aes(major_category, total_earnings_male)) +
  geom_boxplot() +
  geom_jitter(width = 0.25, alpha = 0.5)+
  scale_y_continuous(limits=c(0,250000), name = NULL, labels = scales::dollar) +
  labs(x = NULL) +
  ggtitle("Spread of Men's Total Earnings by Category") +
  coord_flip()

Participation within Major Categories

How does the participation compare between men and women among the major categories?

The prior question helped us identify which categories offered comparable wages between men and women. The purpose of this question is to visualize the concentration of men or women within the same categories. We therefore plotted comparison graphs showing the total number of men versus women for each job within the category. Additionally, we added a grey baseline to illustrate equal participation. If the trend of observations are above the line (shown in blue), more men work in the sector than women on average.

From the graphs below, we see that more men than women work in the following areas:

Computer, Engineering, and Science
Natural Resources, Construction, and Maintenance
Production, Transportation, and Material Moving
Management, Business, and Financial

Women gravitate towards the following areas:

Healthcare
Sales and Office
Service

We also see in general that participation levels are greater in the following areas:

Management, Business, and Financial
Sales and Office
Service

The observations from these figures seems rather sterotypical - women focus on services and aid while men focus on mathematics and business. An exploration that is not available to us at this moment (due to limited data) could be how these general tendancies have shifted overtime. Have they become less or more concentrated and defined? At this point we can only see a snapshot in time. From our visualization, we see opportunities to cross more borders and enter fields such as engineering or construction.

ggplot(jobs_gender, aes(workers_female, workers_male)) +
  geom_jitter(width = 0.25, alpha = 0.5) +
  coord_cartesian(xlim=c(0,1500000), ylim=c(0,1500000)) +
  geom_abline(aes(slope=1, intercept=0), color="grey") +
  geom_smooth(method="lm") +
  facet_wrap( ~ major_category, nrow=2) +
  scale_y_continuous(name = "Number of Male Workers", labels = scales::comma) +
  scale_x_continuous(name = "Number of Female Workers", labels = scales::comma) +
  ggtitle("Male and Female Participation Levels")

Further Exploration of Sales and Office Category

The scatterplots above indicate that women are highly involved in Sales and Office, but on average are paid much less than men. Our hypothesis is that women tend to have administrative roles versus men in sales roles, which leads to the pay disparity. We can dive deeper into this category to determine if there is a true pay gap within roles or if the difference is due to different roles.

From the Violin graph, this clearly is not the case. Broken down into administrative support, we still see higher pay for men than women. We also note that women appear to be capped around $52,000 within Administrative support without any outliers.

jobs_gender_filter <- jobs_gender %>% filter(major_category == "Sales and Office")
melted <- melt(jobs_gender_filter[,c(4,10,11)], id.vars= 1)

melted %>% 
  ggplot(aes(x = minor_category, y = value)) + 
    geom_violin(aes(fill = variable)) + 
    scale_y_continuous(name = NULL, labels= scales::dollar) +
    labs(x = NULL, title = "Wages within Sales and Office Category")

Further Exploration of Service Category

Women are also heavily involved in the Service sector. Again using compensation to measure success, how do women compare to men?

The total earnings appear to be more comparable, however men simply earn more accross the board, in all jobs and sectors. Rarely have we found an area or job were women are paid more.

From the prior violin graph and this graph, we expect to see a smooth distribution of salaries within each minor category. Instead we notice an abrupt cutoff of the range in women’s salaries. Comically, it is as if we are quite literally seeing the “glass ceiling.”

jobs_gender_filter <- jobs_gender %>% filter(major_category == "Service")
melted <- melt(jobs_gender_filter[,c(4,10,11)], id.vars= 1)

melted %>% 
  ggplot(aes(x = minor_category, y = value)) + 
    geom_violin(aes(fill = variable)) + 
    scale_y_continuous(name = NULL, labels= scales::dollar) +
    labs(x = NULL, title = "Wages within Sales and Office Category")

Final Conclusions

As stated in this report during our midterm progress check, our goal is to use EDA to identify opportunities for women. From this exploration, we have gathered some important information regarding women in the workforce, in the past and in the present. While we have explored these ideas in depth in the prior sections, we summarize the concepts here:

The Wage Gap between men and women has slowly decreased in size and variability. Additionally, participation has increased up until 2016, a change domineered by men and women entering the workforce as full time employees.
There exists an opportunity for women in the field of Natural Resources, Construction, and Maintenance. While few women have entered this area, wages are comparable. This is shown in the boxplots where a fair percentage of observations indicated women are paid more than men for a particular job. While there are certainly structural and cultural obstacles to overcome for women to enter the field of construction, our analysis shows that those who have overcome those challenges (if applicable) have received fair pay.
Women are highly concentrated in the service and health care sectors. There exists an opportunity to use their numbers to push for equal pay, since it is extremely evident that they are not receiving such. Individually, we can all negotiate higher salaries and wages to decrease the gap.

While over the years, we have seen a decrease in the pay gap, we still see opportunity for improvement.

Should we ever obtain more complete time-series data that is equally detailed by job description and industry, we could uncover the progress women may have made over the years. This exploratory analysis should serve as a starting point to spread awareness of gender inequality in the workforce and start the discussion as to how women can better access industries they are not currently in as well as leverage their numbers in other sectors to improve their pay.

Women in the Workforce

Samantha Riser and Cassidy Peebles

12/1/2019

Introduction

Packages Required

Introduction to the Data

About the Data

Limitations of the Datasets

Reading the Data

Dataset: Earnings_Female

Purpose

Quick Clean

Data Dictionary

Technical Attributes

Missing Values

Data Manipulation

Data Preview

Summary Statistics

Dataset: Jobs_Gender

Purpose

Data Dictionary

Technical Attributes

Missing values

Data Preview

Summary Statistics

Dataset: Employed_Gender

Purpose

Data Dictionary

Technical Attributes

Missing Values

Data Preview

Summary Statistics

Exploratory Data Analysis

Time Series Analysis

Wage Gap over the Decades

Full-Time and Part-Time Employment over the Decades

Categorical Analysis

Wages within Major Categories

Participation within Major Categories

Further Exploration of Sales and Office Category

Further Exploration of Service Category

Final Conclusions