Starting most dramatically in the 1950’s, American women began to enter the workforce. Over the next half century, they increasingly left their homes to go out and start careers. But since the 1990’s, the United States has seen a decrease in women’s labor participation rate, that has only started to pick up in the last few years. This report focuses mainly on the following concepts:
How have women been doing, comparatively, over the last 20 years?
How are things now, and where are we headed?
Women have been making space for themselves, and headlines, in recent times for requiring that their workplace concerns be heard and addressed. This analysis serves to discover what career prospects for women look like nowadays and how have the components that characterize a woman’s career, her position and income, changed.
To address this, we examined the pay gap trend over time, the differences in part-time and full-time work in relation to gender, as well as how industry impacts a woman’s compensation in recent years. By approaching it from different facets, we will try to tell the story with multiple dimensions and not jump to conclusions about the complex social reasons for why the women’s labor participation rate may have changed.
Because this is still in development, we are unsure if we want to add a predictive element that would help pinpoint what the trends will look like in the future for women. Either way, our analysis is important because of the huge impact that women have on the economic output in the United States. How women are being compensated and making decisions about work will tell employers what to expect in years to come.
The Bureau of Labor Statistics illustrates women’s participation rate in the graph below.
library(png)
library(grid)
img <- readPNG("civilian_women.png")
grid.raster(img)
disclaimer: It is important to note that the entire labor participation rate has also decreased.
The following packages will be required to reperform the R script included in this report without error:
#Loading the required packages
library(readr) # Read Data in csv format
library(tidyverse) # For manipulating data
library(dplyr) # Manipulating data in R and using pipe operator
library(DT) # HTML display of data
library(knitr) # Used to display an aligned table on screen
library(ggplot2) # Used to plot different graphs
library(qwraps2) # Used to make tables
library(formattable)# Markdown display of data
library(reshape2) # Used to melt datasets
This data was taken from a Tidy Tuesday challenge. Please read below for a synopsis of the data provided by the Tidy Tuesday GitHub Repository. Information on each of the datasets used can be found in the following tabs.
Women in the Workforce March is Women’s History month, as such we’re exploring data from the Bureau of Labor Statistics and the Census Bureau about women in the workforce. There are historical data about women’s earnings and employment status, as well as detailed information about specific occupation and earnings from 2013-2016.
According to the AAUW - “The gender pay gap is the gap between what men and women are paid. Most commonly, it refers to the median annual pay of all women who work full time and year-round, compared to the pay of a similar cohort of men.”
The specific jobs data came from the Census Bureau and the historical data comes from the Bureau of Labor. The data is provided as is, and you recognize the limitations and issues in defining gender as binary.
There are limitations and assumptions in the datasets used, which should be noted at the beginning of this report.
First, the date range of each dataset vary greatly. This limited our view of the trends found in the data accross all time periods.
Second, gender in these datasets is a binary male and female classification.
Lastly, we assume the data obtained and used in our analysis is valid and completes.
First we read in the semi-cleaned data. The data is held in the Tidy Tuesday GitHub repository. We read the csv file into this report directly from the repository.
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")
## Parsed with column specification:
## cols(
## year = col_double(),
## occupation = col_character(),
## major_category = col_character(),
## minor_category = col_character(),
## total_workers = col_double(),
## workers_male = col_double(),
## workers_female = col_double(),
## percent_female = col_double(),
## total_earnings = col_double(),
## total_earnings_male = col_double(),
## total_earnings_female = col_double(),
## wage_percent_of_male = col_double()
## )
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv")
## Parsed with column specification:
## cols(
## Year = col_double(),
## group = col_character(),
## percent = col_double()
## )
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv")
## Parsed with column specification:
## cols(
## year = col_double(),
## total_full_time = col_double(),
## total_part_time = col_double(),
## full_time_female = col_double(),
## part_time_female = col_double(),
## full_time_male = col_double(),
## part_time_male = col_double()
## )
The purpose of this dataset is to provide information on women’s salary as a percentage of male’s salary within the same age group. The time series component helps us see the change in percentage over time.
The first thing we should do is rename the variables to enhance clarity and understanding.
colnames(earnings_female) <- c("Year","Age_Group", "Percent_of_Male_Salary" )
This dataset includes three variables: Year, Age Group, and Percent of Male Salary.
datadict <- read_csv("data/earnings_female_datadict.csv")
colnames(datadict) <- c("Variable", "Class", "Description")
kable(datadict, format = "markdown")
| Variable | Class | Description |
|---|---|---|
| Year | integer | Year |
| group | character | Age group |
| percent | double | Female salary percent of male salary |
Year ranges from 1979 to 2011Age_Group uses the following age ranges:
This dataset contains no missing values.
sum(is.na(earnings_female))
## [1] 0
It will be helpful in future visualizations to spread this dataset.
# Spread earnings_female
earnings_female_byage <- spread(earnings_female, key = Age_Group, value = Percent_of_Male_Salary)
datatable(earnings_female_byage, options = list(
autoWidth = TRUE,
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20))
)
See below for a summary of the dataset and general structure:
options(qwraps2_markup = "markdown")
earnings_female <- as.data.frame(earnings_female)
summary_statistics <-
list(
"Female Earnings as % of Male" =
list(
"min" = ~min(.data$Percent_of_Male_Salary, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$Percent_of_Male_Salary, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$Percent_of_Male_Salary, na_rm = TRUE),
"max" = ~max(.data$Percent_of_Male_Salary, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$Percent_of_Male_Salary))
)
)
print(qwraps2::summary_table(
dplyr::group_by(earnings_female, Age_Group),
summary_statistics
),
rtitle = "Summary Statistics Table for the Earnings_Female Data Set")
| Summary Statistics Table for the Earnings_Female Data Set | Age_Group: 16-19 years (N = 33) | Age_Group: 20-24 years (N = 33) | Age_Group: 25-34 years (N = 33) | Age_Group: 35-44 years (N = 33) | Age_Group: 45-54 years (N = 33) | Age_Group: 55-64 years (N = 33) | Age_Group: 65 years and older (N = 33) | Age_Group: Total, 16 years and older (N = 33) |
|---|---|---|---|---|---|---|---|---|
| Female Earnings as % of Male | ||||||||
| min | 85.2 | 76.3 | 67.5 | 58.3 | 56.8 | 58.9 | 65.9 | 62.3 |
| mean (sd) | 91.07 ± 2.42 | 90.00 ± 4.92 | 81.24 ± 6.42 | 70.44 ± 6.21 | 67.40 ± 6.45 | 66.96 ± 5.37 | 73.77 ± 3.91 | 74.18 ± 5.74 |
| median (Q1, Q3) | 91.40 (89.10, 92.90) | 91.90 (88.00, 93.80) | 82.40 (76.70, 86.90) | 72.50 (66.10, 75.20) | 67.70 (61.70, 73.50) | 65.30 (62.20, 72.70) | 74.40 (70.90, 76.40) | 75.80 (69.80, 79.40) |
| max | 94.6 | 95.4 | 92.3 | 79.9 | 76.5 | 75.4 | 80.9 | 82.2 |
| Missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
The purpose of this dataset is to provide quantitative information on the labor force within major and minor industries (categories). This includes the volume of participants in the workforce by gender, as well as the salary of these participants by gender. The time series component helps us see the change in volume and male/female ratios over time.
This dataset includes 12 variables: year, occupation, major_category, minor_category, total_workers, workers_male, workers_female, percent_female, total_earnings, total_earnings_male, total_earnings_female, wage_percent_of_male.
datadict <- read_csv("data/jobs_gender_datadict.csv")
colnames(datadict) <- c("Variable", "Class", "Description")
kable(datadict, format = "markdown")
| Variable | Class | Description |
|---|---|---|
| year | integer | Year |
| occupation | character | Specific job/career |
| major_category | character | Broad category of occupation |
| minor_category | character | Fine category of occupation |
| total_workers | double | Total estimated full-time workers > 16 years old |
| workers_male | double | Estimated MALE full-time workers > 16 years old |
| workers_female | double | Estimated FEMALE full-time workers > 16 years old |
| percent_female | double | The percent of females for specific occupation |
| total_earnings | double | Total estimated median earnings for full-time workers > 16 years old |
| total_earnings_male | double | Estimated MALE median earnings for full-time workers > 16 years old |
| total_earnings_female | double | Estimated FEMALE median earnings for full-time workers > 16 years old |
| wage_percent_of_male | double | Female wages as percent of male wages - NA for occupations with small sample size |
Observations
The dataset contains 2,088 observations
Year ranges from 2013 to 2016
Occupation includes 522 different job titles
The major category variable has 8 different categories
The minor category variable has 23 unique sub-categories
There are 846 records that are missing data in the wage_percent_of_male variable. In the data dictionary, we learned this was the case due to a small sample size. For calculation purposes, we would like to remove as many NA’s as possible without significantly misrepresenting our data. This can be corrected by calculating the percentage using the median value given in total_earnings_female and total_earnings_male in each observation and imputting the missing values with the observation’s calculated percent.
# Imput the percentage using given data
jobs_gender <-
jobs_gender %>%
mutate(wage_percent_of_male = total_earnings_female/total_earnings_male)
# Check our work
colSums(is.na(jobs_gender))
## year occupation major_category
## 0 0 0
## minor_category total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female wage_percent_of_male
## 4 65 69
The following are logical reasons for missing values and will not be corrected:
In 18 observations, there were no female workers. As a result, total_earnings_female and wage_percent_of_male are NA.
In 3 observations, there were no male workers. As a result, total_earnings_male and wage_percent_of_male are NA.
jobs_gender %>%
filter(workers_female == 0 | workers_male == 0) %>%
select(-c("major_category", "minor_category"))
## # A tibble: 21 x 10
## year occupation total_workers workers_male workers_female
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 2013 Nurse mid~ 2817 0 2817
## 2 2013 Septic ta~ 6191 6191 0
## 3 2013 Roof bolt~ 3576 3576 0
## 4 2013 Roustabou~ 9766 9766 0
## 5 2013 Mine shut~ 847 847 0
## 6 2014 Nurse mid~ 4490 0 4490
## 7 2014 Roof bolt~ 2812 2812 0
## 8 2014 Helpers--~ 4323 4323 0
## 9 2014 Electrica~ 972 972 0
## 10 2014 Railroad ~ 3808 3808 0
## # ... with 11 more rows, and 5 more variables: percent_female <dbl>,
## # total_earnings <dbl>, total_earnings_male <dbl>,
## # total_earnings_female <dbl>, wage_percent_of_male <dbl>
There are still some missing values because we do not have total earnings data for either male or female or both for 48 observations. If these missing values cause issues in our Exploratory Data Analysis, we will remove them. For now, the observations provide enough other useful information to keep in the dataset. Regarding the missing values:
47 are records of ‘total_earnings_female’
1 is a record of ‘total_earnings_male’
See below for a preview of the dataset.
datatable(jobs_gender, options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)
))
See below for a summary of the dataset and general structure:
options(qwraps2_markup = "markdown")
jobs_gender <- as.data.frame(jobs_gender)
summary_statistics <-
list(
"Total Workers" =
list(
"min" = ~min(.data$total_workers, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$total_workers, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_workers, na_rm = TRUE),
"max" = ~max(.data$total_workers, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$total_workers))
),
"Male Workers" =
list(
"min" = ~min(.data$workers_male, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$workers_male, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$workers_male, na_rm = TRUE),
"max" = ~max(.data$workers_male, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$workers_male))
),
"Female Workers" =
list(
"min" = ~min(.data$workers_female, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$workers_female, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$workers_female, na_rm = TRUE),
"max" = ~max(.data$workers_female, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$workers_female))
),
"Total Earnings" =
list(
"min" = ~min(.data$total_earnings, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$total_earnings, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_earnings, na_rm = TRUE),
"max" = ~max(.data$total_earnings, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$total_earnings))
),
"Male Earnings" =
list(
"min" = ~min(.data$total_earnings_male, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$total_earnings_male, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_earnings_male, na_rm = TRUE),
"max" = ~max(.data$total_earnings_male, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$total_earnings_male))
),
"Female Earnings" =
list(
"min" = ~min(.data$total_earnings_female, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$total_earnings_female, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_earnings_female, na_rm = TRUE),
"max" = ~max(.data$total_earnings_female, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$total_earnings_female))
)
)
#table <- summary_table(jobs_gender, summary_statistics)
#print(table, rtitle = "Summary Statistics Table for the Jobs_Gender Data Set")
print(qwraps2::summary_table(
dplyr::group_by(jobs_gender, major_category),
summary_statistics
),
rtitle = "Summary Statistics Table for the Jobs_Gender Data Set")
| Summary Statistics Table for the Jobs_Gender Data Set | major_category: Computer, Engineering, and Science (N = 236) | major_category: Education, Legal, Community Service, Arts, and Media (N = 168) | major_category: Healthcare Practitioners and Technical (N = 128) | major_category: Management, Business, and Financial (N = 232) | major_category: Natural Resources, Construction, and Maintenance (N = 328) | major_category: Production, Transportation, and Material Moving (N = 444) | major_category: Sales and Office (N = 280) | major_category: Service (N = 272) |
|---|---|---|---|---|---|---|---|---|
| Total Workers | ||||||||
| min | 836 | 5424 | 2817 | 5439 | 972 | 658 | 1413 | 747 |
| mean (sd) | 117,869.59 ± 176,014.30 | 244,239.31 ± 397,711.18 | 193,852.91 ± 395,191.26 | 327,296.72 ± 533,497.95 | 121,477.00 ± 202,629.98 | 111,801.25 ± 301,312.90 | 332,110.75 ± 528,970.43 | 210,631.98 ± 315,720.78 |
| median (Q1, Q3) | 50,892.00 (16,201.00, 136,846.50) | 102,709.50 (45,149.25, 330,168.25) | 84,206.00 (32,657.75, 157,746.00) | 139,287.00 (46,871.00, 440,238.75) | 34,171.50 (14,086.25, 119,757.00) | 24,677.50 (7,328.00, 73,449.50) | 94,830.00 (36,482.25, 386,638.75) | 68,956.50 (27,538.50, 233,804.75) |
| max | 1124661 | 2398445 | 2317493 | 3758629 | 1180144 | 2695081 | 2677578 | 1553626 |
| Missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Male Workers | ||||||||
| min | 219 | 1239 | 0 | 2824 | 972 | 296 | 499 | 747 |
| mean (sd) | 89,785.52 ± 141,174.31 | 93,467.14 ± 137,818.97 | 54,026.68 ± 92,402.62 | 185,876.49 ± 340,101.33 | 116,804.15 ± 197,077.30 | 90,149.27 ± 271,783.08 | 138,123.15 ± 263,065.40 | 106,253.38 ± 199,365.29 |
| median (Q1, Q3) | 37,386.00 (10,951.25, 102,952.25) | 36,864.50 (19,942.75, 76,605.75) | 25,475.00 (5,832.25, 67,886.50) | 92,677.00 (17,769.25, 180,957.25) | 33,094.50 (12,412.50, 118,083.25) | 19,659.50 (5,990.50, 54,168.75) | 34,221.00 (12,493.50, 116,202.75) | 34,010.50 (11,032.25, 94,528.00) |
| max | 918865 | 575443 | 495061 | 2472383 | 1150257 | 2570385 | 1537529 | 1156110 |
| Missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Female Workers | ||||||||
| min | 146 | 3220 | 953 | 1018 | 0 | 0 | 535 | 0 |
| mean (sd) | 28,084.07 ± 40,199.23 | 150,772.17 ± 293,549.25 | 139,826.23 ± 340,168.19 | 141,420.22 ± 218,741.15 | 4,672.85 ± 9,673.94 | 21,651.98 ± 48,458.16 | 193,987.60 ± 345,314.14 | 104,378.60 ± 188,375.28 |
| median (Q1, Q3) | 10,357.50 (3,141.00, 35,682.25) | 44,728.50 (21,860.25, 190,418.25) | 54,487.00 (10,984.75, 102,121.75) | 52,883.00 (23,452.75, 159,442.25) | 1,144.00 (438.00, 4,741.75) | 2,942.00 (1,168.50, 12,378.00) | 57,055.50 (21,173.25, 193,616.25) | 28,604.50 (8,963.75, 90,688.25) |
| max | 205796 | 1867475 | 2036445 | 1286246 | 83108 | 300040 | 2290818 | 1177169 |
| Missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Total Earnings | ||||||||
| min | 40464 | 21125 | 31530 | 36471 | 20420 | 20726 | 20251 | 17266 |
| mean (sd) | 76,561.88 ± 18,417.48 | 49,832.49 ± 16,935.33 | 74,912.03 ± 38,164.20 | 65,564.55 ± 17,041.79 | 43,898.75 ± 12,522.12 | 39,516.50 ± 15,143.63 | 40,359.07 ± 13,396.93 | 34,626.52 ± 15,006.65 |
| median (Q1, Q3) | 76,990.50 (62,014.75, 90,353.25) | 47,479.50 (41,926.25, 52,127.25) | 62,272.50 (46,511.75, 92,160.50) | 62,191.50 (53,092.75, 73,425.00) | 41,761.50 (35,451.50, 51,027.50) | 35,788.50 (29,410.50, 45,357.75) | 37,216.50 (31,841.50, 46,557.50) | 30,432.50 (24,747.50, 41,139.25) |
| max | 141359 | 122073 | 201542 | 130293 | 88901 | 102155 | 111522 | 90571 |
| Missing | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| Male Earnings | ||||||||
| min | 23794 | 25873 | 35640 | 41164 | 22957 | 21536 | 21105 | 12147 |
| mean (sd) | 80,204.30 ± 19,379.94 | 54,402.95 ± 18,887.57 | 124; 81,486.81 ± 42,023.71 | 73,717.34 ± 19,208.39 | 44,270.04 ± 12,484.74 | 41,299.60 ± 14,915.35 | 44,986.89 ± 15,296.61 | 36,805.50 ± 15,157.04 |
| median (Q1, Q3) | 81,433.50 (67,259.75, 91,832.75) | 50,893.00 (45,094.00, 57,843.25) | 124; 71,213.50 (49,938.50, 101,071.50) | 71,393.50 (60,927.75, 82,328.25) | 41,984.50 (35,520.00, 51,208.75) | 37,683.50 (31,160.00, 46,874.00) | 41,365.50 (35,609.00, 52,830.50) | 31,805.50 (26,347.00, 41,808.25) |
| max | 150247 | 136043 | 231420 | 141108 | 88919 | 102479 | 115432 | 90912 |
| Missing | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 |
| Female Earnings | ||||||||
| min | 33376 | 20748 | 31126 | 25310 | 11080 | 7447 | 19688 | 16771 |
| mean (sd) | 235; 69,427.31 ± 17,804.86 | 46,257.66 ± 14,032.47 | 68,887.46 ± 31,559.21 | 59,070.36 ± 15,413.74 | 282; 38,549.11 ± 15,510.49 | 429; 32,437.85 ± 13,667.56 | 37,105.95 ± 10,377.21 | 269; 31,987.96 ± 13,382.61 |
| median (Q1, Q3) | 235; 68,925.00 (56,797.50, 80,889.00) | 45,081.00 (38,248.75, 50,808.50) | 60,984.00 (45,444.50, 91,083.25) | 56,810.00 (49,981.00, 66,007.75) | 282; 35,580.00 (29,752.25, 43,862.50) | 429; 27,883.00 (24,268.00, 36,241.00) | 35,631.00 (30,458.75, 41,380.50) | 269; 28,384.00 (22,291.00, 38,088.00) |
| max | 120253 | 102484 | 166388 | 131780 | 158929 | 130660 | 90274 | 100508 |
| Missing | 1 | 0 | 0 | 0 | 46 | 15 | 0 | 3 |
This dataset provides another piece of information regarding women in the workforce. Here, we are given data on the number of women and men in the workforce by year. The count of men and women in the workforce is broken out between Full Time work and Part Time work. and whether they
This dataset includes 7 variables: year, total_full_time, total_part_time, full_time_female, part_time_female, full_time_male, part_time_male.
datadict <- read_csv("data/employed_gender_datadict.csv")
colnames(datadict) <- c("Variable", "Class", "Description")
kable(datadict, format = "markdown")
| Variable | Class | Description |
|---|---|---|
| year | double | Year |
| total_full_time | double | Percent of total employed people usually working full time |
| total_part_time | double | Percent of total employed people usually working part time |
| full_time_female | double | Percent of employed women usually working full time |
| part_time_female | double | Percent of employed women usually working part time |
| full_time_male | double | Percent of employed men usually working full time |
| part_time_male | double | Percent of employed men usually working part time |
Observations
The dataset contains 49 observations
Year ranges from 1968 to 2016
There are no missing values in this dataset.
See below for a preview of the dataset.
datatable(employed_gender, options = list(
columnDefs = list(list(className = 'dt-center', targets = 5)),
pageLength = 5,
lengthMenu = c(5, 10, 15, 20)
))
See below for a summary of the dataset and general structure:
options(qwraps2_markup = "markdown")
employed_gender <- as.data.frame(employed_gender)
summary_statistics <-
list(
"Total Full Time" =
list(
"min" = ~min(.data$total_full_time, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$total_full_time, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_full_time, na_rm = TRUE),
"max" = ~max(.data$total_full_time, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$total_full_time))
),
"Total Part Time" =
list(
"min" = ~min(.data$total_part_time, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$total_part_time, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$total_part_time, na_rm = TRUE),
"max" = ~max(.data$total_part_time, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$total_part_time))
),
"Full Time - Female" =
list(
"min" = ~min(.data$full_time_female, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$full_time_female, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$full_time_female, na_rm = TRUE),
"max" = ~max(.data$full_time_female, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$full_time_female))
),
"Part Time - Female" =
list(
"min" = ~min(.data$part_time_female, na.rm = TRUE),
"mean (sd)" = ~qwraps2::mean_sd(.data$part_time_female, na_rm = TRUE),
"median (Q1, Q3)" = ~qwraps2::median_iqr(.data$part_time_female, na_rm = TRUE),
"max" = ~max(.data$part_time_female, na.rm = TRUE),
"Missing" = ~sum(is.na(.data$part_time_female))
)
)
table <- summary_table(employed_gender, summary_statistics)
print(table, rtitle = "Summary Statistics Table for the Employed_Gender Data Set")
| Summary Statistics Table for the Employed_Gender Data Set | employed_gender (N = 49) |
|---|---|
| Total Full Time | |
| min | 80.3 |
| mean (sd) | 82.64 ± 1.24 |
| median (Q1, Q3) | 82.60 (81.80, 83.20) |
| max | 86 |
| Missing | 0 |
| Total Part Time | |
| min | 14 |
| mean (sd) | 17.36 ± 1.24 |
| median (Q1, Q3) | 17.40 (16.80, 18.20) |
| max | 19.7 |
| Missing | 0 |
| Full Time - Female | |
| min | 71.9 |
| mean (sd) | 73.86 ± 0.97 |
| median (Q1, Q3) | 73.90 (73.20, 74.70) |
| max | 75.4 |
| Missing | 0 |
| Part Time - Female | |
| min | 24.6 |
| mean (sd) | 26.14 ± 0.97 |
| median (Q1, Q3) | 26.10 (25.30, 26.80) |
| max | 28.1 |
| Missing | 0 |
Our Exploratory Data Analysis will begin with simple visualizations to gain a general undestanding of the trends in our data. Due to the nature of the timelines given in the datasets, our approach to analysing the data is two-fold:
first, we attempt to identify overarching trends in the 3 decades following the initial introduction of women in the workforce;
second, we search for trends in the most recent years of our data to help us understand where women in the workforce are today.
What has the pay gap looked like over these three decades?
From the time series of boxplots below, we see two important trends. Women’s salary as a percentage of male’s salary increases over the years of 1979 to 2011. This indicates that women are earning increasingly comparable wages to their male counterparts. We also see a decrease in the variation of this salary ratio as time progresses.
For understanding: This boxplot shows the spread of women’s salary as a percentage of male’s salary over the years. Each boxplot is created based on the spread of percentages by age group.
boxplot(earnings_female$Percent_of_Male_Salary ~ earnings_female$Year,
main = "Distribution of Women's Salary as a percent of Males' Salary
1979 - 2011",
xlab = "Year",
ylab = "Percent of Male Salary"
)
Is age a factor when it comes to the pay gap between men and women?
We can look at the above chart in a different light, by stratifying the data according to the age group of workers. In the plot below, younger generations show a smaller wage gap. We can hypothesize that this is due to the types of jobs workers between the ages of 16 and 24 hold. As they are most likely still in school, these jobs may be temporary or part-time positions that pay a standard rate. As age increases, we see more of a gap between male and female salaries. This is where the most progress is made over the years and brings up the median and average percentage of male salary.
ggplot(data=earnings_female, aes(x = Year, y = Percent_of_Male_Salary, group = Age_Group, colour = Age_Group))+
geom_line(size = 1)+
labs(title = 'Female Salary as a Percentage of Male Salary, Grouped By Age')
The trends shown in the above graphs indicate improvement towards gender equality in the workforce.
What is the trend of Part-Time and Full-Time workers by both male and female over the years?
We created a graph in an attempt to answer this question. Overall, this chart does not provide much new insight. We can see that the percentage of the labor force that work part-time follow a similar trend, regardless of gender, and the same goes for the percentage of the labor force that work full-time. We would still like to know:
Because this is only in percentages, we do not get a true understanding of the population total. Because the information given was standardized to just give part-time and full-time percentages, we wanted to scale it based on the sheer quantity of the work-force. There may be a hidden insight in this so we decided to merge data set jobs_gender with data set employed_gender.
For understanding: The Y-axis refers to percentage of the labor force, but respective to gender. For example, % of part-time female + % full-time female adds up to 100%, which constitutes 100% of employed women.
#percentage full time women vs full time men, over time
ggplot(data=employed_gender, aes(x = year, y = full_time_female, colour='full-time female'))+
geom_line(size = 1.5)+
geom_line(aes(y = full_time_male, colour='full-time male'), size = 1.5)+
geom_line(aes(y = part_time_female, colour='part-time female'), size = 1.5)+
geom_line(aes(y = part_time_male, colour='part-time male'), size = 1.5)+
labs(title = 'Female & Male Full-Time/Part-Time Percentage of Respective Gender Labor Force',
x = 'Year',
y = 'Percentage of Workforce')+
scale_colour_manual(values = c("deeppink", "blue", "deeppink4", "darkblue"))
By merging the data sets as mentioned above, we are able to create new variables: female_ft, male_ft, female_pt and male_pt, which are the respective population sizes instead of percentages. Below is the code and visualization from this merge.
# From the jobs_gender dataset:
jobs_gender_sum <- jobs_gender %>%
group_by(year) %>%
summarise(sum(total_workers), sum(workers_male), sum(workers_female))
# Filter employed_gender for 2013 - 2016
emplyed_gender_sub <- employed_gender %>% filter(year %in% c(2013:2016))
# Combine columns to form new dataset
recent_levels <- cbind(jobs_gender_sum, emplyed_gender_sub[,-1])
colnames(recent_levels) <- c("year", "total_workers", "total_male_workers", "total_female_workers", "tot_ft_percent", "tot_pt_percent", "fem_ft_percent", "fem_pt_percent", "mal_ft_percent", "mal_pt_percent")
# Mutate
recent_levels <- recent_levels %>% mutate(
total_full_time = (tot_ft_percent/100)*total_workers,
total_part_time = (tot_pt_percent/100)*total_workers,
female_ft = (fem_ft_percent/100)*total_female_workers,
female_pt = (fem_pt_percent/100)*total_female_workers,
male_ft = (mal_ft_percent/100)*total_male_workers,
male_pt = (mal_pt_percent/100)*total_male_workers,
)
recent_levels <- select(recent_levels, year, total_workers, total_male_workers, total_female_workers, total_full_time, total_part_time, female_ft, female_pt, male_ft, male_pt, everything())
ggplot(data=recent_levels, aes(x = year, y = female_ft, colour="female_ft"))+
geom_line(size = 1.5)+
geom_line(aes(y = male_ft, colour="male_ft" ), size = 1.5)+
geom_line(aes(y = female_pt, colour="female_pt"), size = 1.5)+
geom_line(aes(y = male_pt, colour="male_pt"), size = 1.5)+
labs(title = 'Female & Male Full-Time/ Part-Time Distribution',
x = 'Year',
y = 'Volume of Workforce')+
scale_y_continuous(labels = scales::comma)+
scale_colour_manual(values = c("deeppink", "deeppink4", "blue", "darkblue"))
The time series of these 4 categories demonstrates that the sheer quantity of full-time workers has increased since 2013, for both men and women, and that the quantity of part-time workers has stayed stagnant since 2013. This is new information that the percentages did not offer. We now know that the full-time population is rising. This is helpful as we begin to examine the availability of jobs for career women, but besides that, this graphic does not offer many new insights.
Now that we have analyzed our data over time, we can similarly look for patterns in our data throughout the many categories and job sectors.
How often are there observed instances where women earn more than men?
Regardless of the participation rate of men and women in certain sectors, we would like to see the spread of wages within each major category. Below we created two sets of box plots - one for women and one for men. The axes limits are set equal to enhance comparability between the two figures. Additionally, the datapoints are color coordinated to add another layer of information. Red indicates that the wages for a particular job within the category favors men over women regarding compensation. Blue indicates that women are paid more than men for that particular job.
These figures give us valuable insight.
Right away we see that median wages are lower for women in every single area. It’s also interesting to note that the spread of women’s earnings within each category are more concentrated than that of their male counterpart. Without making a gross assumption, we wonder if women could increase their earning potential by negotiating higher wages, as this could be a reason for men’s significant variability in earnings.
The overwhelming amount of red datapoints shows that not only do women earn less on average, but that they are consistently paid less in every role. This is an important distinction, because the average could be skewed by a few outliers where men’s salaries are drastically greater than women’s. On the contrary, we see that the wage gap is widespread among and within the major categories, down to the individual job title.
The area of Natural Resources, Construction, and Maintenance appear to have the most equal ratio of red versus blue datapoints, meaning that this category has the smallest bias in wage discrepancy. We should look into this area as a potential opportunity for women involvement.
#scatterplot of male vs female wages by major category
par(mfrow = c(2,1))
ggplot(jobs_gender, aes(major_category, total_earnings_female)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.5, aes(color = total_earnings_female > total_earnings_male), show.legend = FALSE) +
scale_y_continuous(limits=c(0,250000), name = NULL, labels = scales::dollar) +
labs(x = NULL) +
ggtitle("Spread of Women's Total Earnings by Category") +
coord_flip()
ggplot(jobs_gender, aes(major_category, total_earnings_male)) +
geom_boxplot() +
geom_jitter(width = 0.25, alpha = 0.5)+
scale_y_continuous(limits=c(0,250000), name = NULL, labels = scales::dollar) +
labs(x = NULL) +
ggtitle("Spread of Men's Total Earnings by Category") +
coord_flip()
How does the participation compare between men and women among the major categories?
The prior question helped us identify which categories offered comparable wages between men and women. The purpose of this question is to visualize the concentration of men or women within the same categories. We therefore plotted comparison graphs showing the total number of men versus women for each job within the category. Additionally, we added a grey baseline to illustrate equal participation. If the trend of observations are above the line (shown in blue), more men work in the sector than women on average.
From the graphs below, we see that more men than women work in the following areas:
Women gravitate towards the following areas:
We also see in general that participation levels are greater in the following areas:
The observations from these figures seems rather sterotypical - women focus on services and aid while men focus on mathematics and business. An exploration that is not available to us at this moment (due to limited data) could be how these general tendancies have shifted overtime. Have they become less or more concentrated and defined? At this point we can only see a snapshot in time. From our visualization, we see opportunities to cross more borders and enter fields such as engineering or construction.
ggplot(jobs_gender, aes(workers_female, workers_male)) +
geom_jitter(width = 0.25, alpha = 0.5) +
coord_cartesian(xlim=c(0,1500000), ylim=c(0,1500000)) +
geom_abline(aes(slope=1, intercept=0), color="grey") +
geom_smooth(method="lm") +
facet_wrap( ~ major_category, nrow=2) +
scale_y_continuous(name = "Number of Male Workers", labels = scales::comma) +
scale_x_continuous(name = "Number of Female Workers", labels = scales::comma) +
ggtitle("Male and Female Participation Levels")
The scatterplots above indicate that women are highly involved in Sales and Office, but on average are paid much less than men. Our hypothesis is that women tend to have administrative roles versus men in sales roles, which leads to the pay disparity. We can dive deeper into this category to determine if there is a true pay gap within roles or if the difference is due to different roles.
From the Violin graph, this clearly is not the case. Broken down into administrative support, we still see higher pay for men than women. We also note that women appear to be capped around $52,000 within Administrative support without any outliers.
jobs_gender_filter <- jobs_gender %>% filter(major_category == "Sales and Office")
melted <- melt(jobs_gender_filter[,c(4,10,11)], id.vars= 1)
melted %>%
ggplot(aes(x = minor_category, y = value)) +
geom_violin(aes(fill = variable)) +
scale_y_continuous(name = NULL, labels= scales::dollar) +
labs(x = NULL, title = "Wages within Sales and Office Category")
Women are also heavily involved in the Service sector. Again using compensation to measure success, how do women compare to men?
The total earnings appear to be more comparable, however men simply earn more accross the board, in all jobs and sectors. Rarely have we found an area or job were women are paid more.
From the prior violin graph and this graph, we expect to see a smooth distribution of salaries within each minor category. Instead we notice an abrupt cutoff of the range in women’s salaries. Comically, it is as if we are quite literally seeing the “glass ceiling.”
jobs_gender_filter <- jobs_gender %>% filter(major_category == "Service")
melted <- melt(jobs_gender_filter[,c(4,10,11)], id.vars= 1)
melted %>%
ggplot(aes(x = minor_category, y = value)) +
geom_violin(aes(fill = variable)) +
scale_y_continuous(name = NULL, labels= scales::dollar) +
labs(x = NULL, title = "Wages within Sales and Office Category")
As stated in this report during our midterm progress check, our goal is to use EDA to identify opportunities for women. From this exploration, we have gathered some important information regarding women in the workforce, in the past and in the present. While we have explored these ideas in depth in the prior sections, we summarize the concepts here:
The Wage Gap between men and women has slowly decreased in size and variability. Additionally, participation has increased up until 2016, a change domineered by men and women entering the workforce as full time employees.
There exists an opportunity for women in the field of Natural Resources, Construction, and Maintenance. While few women have entered this area, wages are comparable. This is shown in the boxplots where a fair percentage of observations indicated women are paid more than men for a particular job. While there are certainly structural and cultural obstacles to overcome for women to enter the field of construction, our analysis shows that those who have overcome those challenges (if applicable) have received fair pay.
Women are highly concentrated in the service and health care sectors. There exists an opportunity to use their numbers to push for equal pay, since it is extremely evident that they are not receiving such. Individually, we can all negotiate higher salaries and wages to decrease the gap.
While over the years, we have seen a decrease in the pay gap, we still see opportunity for improvement.
Should we ever obtain more complete time-series data that is equally detailed by job description and industry, we could uncover the progress women may have made over the years. This exploratory analysis should serve as a starting point to spread awareness of gender inequality in the workforce and start the discussion as to how women can better access industries they are not currently in as well as leverage their numbers in other sectors to improve their pay.