Introduction

Is the Wage Gap Real?

The gender wage gap is a problem that has persisted for many years. It would go without saying that in today’s economy our unemployment is extremely low yet politcal tensions are nearing an all time high. Despite all the other social issues happening across our country, the argument of there being a gender wage gap continues to carry on.

As today’s politcal climate is very tense, it is unfair to say that all females are being monetarily discriminated against in the work force. Many companies, large and small, have incorporated the wage gap problem into their social responsibility initiatives. In recent years as more and more companies continue to battle the wage gap, we have seen great strides in reducing it.

Despite many great strides being taken, the problem of the gender wage gap has not been eradicated. As females continue playing a larger and more important role in the workforce, it is extremely pertinent to understand the underlying factors that caused the wage gap so it can be prevented it in the future.

Our Focus

We are going to analyze the historical trends of the wage gap and identify the underlying forces that may have been the root cause. We will look at the trends of employment and salary differnces throughout history across numerous industries and age groups for both males and females.

For our project we will be utilizing three different datasets, two of which are historical datasets originating from the Bureau of Labor and the other stemming from the Census Bureau. Across all three datasets, they are primarily focusing on the earnings ratio that females are making compared to males. The formula for earnings ratio is below: \[ Earnings Ratio = \frac{Female Median Earnings}{Male Median Earnings} \] Additionally, the “employed_gender” dataset is looking at the difference between full-time and part-time employment each year for males, females, and overall from 1968 - 2016.

Key Variables Across All Datasets
  • Year
  • Industry Broad
  • Industry Specific
  • Total Workers
  • Male Workers
  • Female Workers
  • Earnings Ratio
  • Age Group


Analytical Approach

Before identifying the trends of the gender wage gap, we must address the limitations presented by the data. The two major limitations are as follows:

  • Binary Genders: Our data is solely based on a binary gender identification. However, this omits gathering data from individuals who may belong to the LGBTQA community because they may classify their gender as non-binary.
  • Inconsistent Timelines: Each of the three data sets that we are deriving our analyses from have timelines that are not consistent. Because of this, we decided to come up with three different approaches for disecting the gender wage gap problem.

Our three approaches to understanding the gender wage gap are found below:

  1. Wage Gap by Industry: Our dataset “jobs_gender” provides us great detail of different occupations that were held by both males and females from the years 2013 - 2016. The data has been prepared in a fashion where each occupation is listed under a Industry_Broad and Industry_Specific category to assist with the industry analysis. We will filter each of the industries to see which of them have have a majority of employment from males vs. females. From there, we plan to determine if the wage gap in male dominated industries is higher than the wage gap present in female dominated industries. In general, we want to find out if certain industries present larger wage gaps than others, and if so, are there underlying causes.

  2. Wage Gap by Age: Our second approach is to analyze the wage gap by age group. Our dataset “earnings_female” breaks down the Earnings Ratio for seven different age groups from the years 1979 - 2011. We will test to see if certain age groups are more susceptible to the wage gap than others.

  3. Overall Female Employment: In our last analysis, we will pull everything together from the first two independent analyses as well data from our last dataset “employed_gender”. With all of this data, we will observe the overall trend of the wage gap as well as the increasing presence of females in the workforce. The goal of this analysis is to determine if these trends follow along with social movements that have happened over the past 3-5 decades for females.

Our Mission

The mission of our project is to provide the consumer with how the wage gap has changed overtime based on different factors (industry, age group, employment numbers, etc.) and if the wage gap is improving or not. By looking at historical and current events, we want to allow the consumer of the data to determine if they believe that the gender wage gap is actually improving.

In the end, we hope to provide clarity on the industries and age groups that have been historically effected by the gender wage gap the most. Hopefully, this will inspire the consumers of this data to make a positive change for the future of these afflicted groups.

Packages Required

For this project we will use a lot of the standard packages for cleaning and visualizing data. Most of these packages are used with other data manipulation/visualization so not many of the packages will need to be installed strictly for this analysis.

A few of the packages that may need loaded by the user include ggthemes,magick, and plotly. ggthemes is part of the ggplot package but does not come as part of the standard ggplot library. The other two functions are supplementary visualization tools that we will use for our analysis.

## Load Required Packages ##
library(tidyverse) #Use to tidy data
library(readr) #Use to easily import delimited data
library(dplyr) #Use to manipulate data
library(tibble) #Use to manipulate data
library(magrittr) #Use to insert pipe operators
library(DT) #Use to create functional tables in HTML
library(knitr) #Use to create dynamic report generation
library(rmarkdown) #Use to convert R Markdown documents into a variety of formats
library(ggthemes) #Use to implement themes across report
library(ggrepel) #Use to label data
library(ggplot2) #Use to create visualizations
library(plotly) #Use to create dynamic plotting
library(gridExtra) #Use to arrange plots
library(reshape2) #Use to transform data frames

Data Preparation

We have three different sets of data that we are using to analyze the wage gap. However, the timelines of these datasets do not overlap so we will not be able to use the data in aggregrate. Instead, we have elected to treat each of the data sets as their own and clean each of them individually. Our cleaning and preparation process for each can be found below:

earnings_female

The Data

As previously mentioned, this data originated from Bureau of Labor. The original data contains 3 variables with 264 observations and a range of dates from 1979 - 2011 and contains the earnings ratio for seven unique age groups across each year:

  • Total, 16 years and older
  • 16-19 years
  • 20-24 years
  • 25-34 years
  • 35-44 years
  • 45-54 years
  • 55-64 years
  • 65 years and older

To gather more information regarding the dataset, click here

The first thing we will want to do is import the original csv file data using read_csv function:

## Import the Data ##
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv") 

We examined and viewed the original data below:

datatable(head(earnings_female,50))



Cleaning the Data

After importing the data and doing the exploratory analysis we realized that we should change a few aspects to keep the data consistent.

  1. We renamed two of the columns to better represent the values that are being displayed.
  2. We reformatted each column name into snake_case to match the data from the other sets.
  3. We took the data in the age_group category and renamed the value of Total, 16 years and older to Total to make the data easier to understand.
names(earnings_female) <- c("year","age_group","earnings_ratio")
earnings_female$age_group[earnings_female$age_group == "Total, 16 years and older"] <- "Total"



Data Dictionary
Variable.type <- lapply(earnings_female,class)
Variable.desc <- c("Year", "Age group", "Female wages as a percent of male wages, which is the earnings ratio of females")
Variable.name1 <- colnames(earnings_female)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
Variable Name Data Type Variable Description
year numeric Year
age_group character Age group
earnings_ratio numeric Female wages as a percent of male wages, which is the earnings ratio of females
library(DT)
datatable(head(earnings_female,50))



In the clean dataset, the range of years remains the same but the age groups are now:

  • Total
  • 16-19 years
  • 20-24 years
  • 25-34 years
  • 35-44 years
  • 45-54 years
  • 55-64 years
  • 65 years and older

The earnings ratio column ranges from 56.8% to 95.4%, reaffirming that a wage gap does in fact exist. We will analyze these values in our exploratory analysis.Additionally, each of the variables contained in this dataset are primary variables for our analysis, thus we will not remove any of the observations from this table.


jobs_gender

The Data

As previously mentioned, this data originated from the Census Bureau. The original data contains 12 variables and 2088 observations that have dates ranging from 2013 - 2016. The data is centered around employment numbers and earning percentages for male and female. The column names are below:

  • year
  • occupation
  • major_category
  • minor_category
  • total_workers
  • workers_male
  • workers_female
  • percent_female
  • total_earnings
  • total_earnings_male
  • wage_percent_of_male

To view the original data, click here

The first thing we did is import the original csv file data that we were provided using read_csv:

## Import the Data ##
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")

We examined and viewed the original data below:

datatable(head(jobs_gender,50))



Cleaning the Data

After viewing the data, we made a few changes to the data. The changes are found below:

  1. We changed the column names to better represent the values provided. Changes included:
  • Major_Category to Industry_Broad
  • Minor_Category to Industry_Specific
  • wage_percent_of_male to earnings_ratio_female
names(jobs_gender) <- c("year","occupation","industry_broad","industry_specific",
                        "total_workers","workers_male","workers_female","percent_female",
                        "total_earnings","total_earnings_male","total_earnings_female",
                        "earnings_ratio_female")
  1. Then we checked for duplicate and missing values that would affect the data:
colSums(is.na(jobs_gender))
##                  year            occupation        industry_broad 
##                     0                     0                     0 
##     industry_specific         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female earnings_ratio_female 
##                     4                    65                   846
  1. In the last three columns, we have values of “NA” present. The NA values for “earnings_ratio_female” were inputted for rows that have too small of sample sizes. But, the minimum total workers for the rows without “NA” for the “earnings_ratio_female” column is 11,383. The maximum for total workers for the rows with “NA” for the “earnings_ratio_female” column is 441,982. Given that the maximum total workers for the rows containing “NA” is larger that the minimum total workers for the rows not containing “NA”, the argument that the sample size is too small for the rows containing “NA” is invalid. Therefore, we used the earnings ratio formula to fill in the NA values for this variable.

After we filled in the NA values for “total_earnings_female” and “total_earnings_male”, we realized that there were some negative values. There also were values of 0 or very low numbers for the earnings columns even though there were hundreds of workers in for the occupation. Therefore, we remove every NA value for female and male earnings to avoid issues with these values skewing our data further into our analysis.

## Mutate the New Column for Earnings Ratio

jobs_gender <- 
  jobs_gender %>% 
  mutate(Earnings_Ratio = jobs_gender$total_earnings_female / jobs_gender$total_earnings_male)

#Remove Original Column
jobs_gender <- select(jobs_gender,-c(earnings_ratio_female))
#Removing all observations with NA Values
jobs_gender <- na.omit(jobs_gender) 
colSums(is.na(jobs_gender))
##                  year            occupation        industry_broad 
##                     0                     0                     0 
##     industry_specific         total_workers          workers_male 
##                     0                     0                     0 
##        workers_female        percent_female        total_earnings 
##                     0                     0                     0 
##   total_earnings_male total_earnings_female        Earnings_Ratio 
##                     0                     0                     0
  1. Given the numbers used, we felt it would be best presented if all decimals were kept at a maximum of one place.
## Rounding Percentages ##
is.num <- sapply(jobs_gender$percent_female, is.numeric)
jobs_gender$percent_female[is.num] <- lapply(jobs_gender$percent_female[is.num], round, 1)

is.num <- sapply(jobs_gender$earnings_ratio_female, is.numeric)
jobs_gender$earnings_ratio_female[is.num] <- lapply(jobs_gender$earnings_ratio_female[is.num], round, 1)

is.num <- sapply(jobs_gender$percent_male, is.numeric)
jobs_gender$percent_male[is.num] <- lapply(jobs_gender$percent_male[is.num], round, 1)



From this we are able to obtain our clean data set:

Data Dictionary
Variable.type <- lapply(jobs_gender,class)
Variable.desc <- c("Year", "Specific job/career", "Broad industry of occupation", "Specific industry of occupation", "Total estimated full-time workers above 16 years old", "Estimated full-time male workers above 16", "Estimated full-time female workers above 16","The percent of females in a specific occupation","Total estimated median earnings for full-time workers above 16 years old", "Estimated median earnings for males above 16 years old", "Estimated median earnings for females above 16 years old", "Female wages as a percent of male wages, which is the earnings ratio of females")
Variable.name1 <- colnames(jobs_gender)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
Variable Name Data Type Variable Description
year numeric Year
occupation character Specific job/career
industry_broad character Broad industry of occupation
industry_specific character Specific industry of occupation
total_workers numeric Total estimated full-time workers above 16 years old
workers_male numeric Estimated full-time male workers above 16
workers_female numeric Estimated full-time female workers above 16
percent_female list The percent of females in a specific occupation
total_earnings numeric Total estimated median earnings for full-time workers above 16 years old
total_earnings_male numeric Estimated median earnings for males above 16 years old
total_earnings_female numeric Estimated median earnings for females above 16 years old
Earnings_Ratio numeric Female wages as a percent of male wages, which is the earnings ratio of females


datatable(head(jobs_gender,50))

Now with the clean data, we have a uniform naming style and additional columns. All of the data will be used from this dataset, however, we have broken down the variables to be considered a primary or secondary variable. The new column names and variables are listed below:

  • year - secondary
  • occupation - secondary
  • industry_broad - Primary
  • industry_specific - Primary
  • total_workers - Primary
  • workers_male - Primary
  • workers_female - Primary
  • percent_female - secondary
  • total_earnings - secondary
  • total_earnings_male - secondary
  • Earnings_ratio - Primary

After cleaning the data we have replaced the NA values for the total_earnings_male, total_earnings_female, and earnings_ratio_female giving us 2019 complete observations and still 12 variables.



employed_gender

The Data

As previously mentioned, this data originated from the Bureau of Labor. The dataset is showing percentage of employed people working full time from the years 1968 - 2016. The dataset contains 7 variables and 49 observations with the column names shown below:

  • year
  • total_full_time
  • total_part_time
  • full_time_female
  • part_time_female
  • full_time_male
  • part_time_male

The first thing we did is import the original csv file using the read_csv function:

## Import the Data ##
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv")



Our clean data is below:

Data Dictionary
Variable.type <- lapply(employed_gender,class)
Variable.desc <- c("Year", "Percent of total employed people usually working full-time", "Percent of total employed people usually working part time", "Percent of employed females usually working full time", "Percent of employed females usually working part time", "Percent of employed males usually working full time", "Percent of employed men usually working part time")
Variable.name1 <- colnames(employed_gender)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
Variable Name Data Type Variable Description
year numeric Year
total_full_time numeric Percent of total employed people usually working full-time
total_part_time numeric Percent of total employed people usually working part time
full_time_female numeric Percent of employed females usually working full time
part_time_female numeric Percent of employed females usually working part time
full_time_male numeric Percent of employed males usually working full time
part_time_male numeric Percent of employed men usually working part time


datatable(head(employed_gender,50))

Compared to our other datasets, this data is very useful with out a lot of cleaning. We do not have any missing values or duplicate values, and all column names are written with consistent snake_case formatting.

We elected not to make any inital changes to this data for the reason.



Analysis & Visualizations

1. Wage Gap By Industry

For the wage gap by industry analysis, we will study the breakdowns of earnings for each of the eight industries to observe any patterns in the respective wage gaps. First, we are interested in observing the industries that are dominated by females vs. males. Once we determine which industries fall into which category, we will analyze the differences in the average female and male median earnings for both female and male dominated industries. We will take this approach a step further and disect the outliers, averages, minimums, maximum, etc. for each industry and draw conclusions on the underlying factors, if any, for these numbers and patterns. Our hypothesis before beginning this analysis is that even in female dominated industries, the average pay for females will be lower than that of males. Our goal is to identify if this hypothesis is true, if there are any industries that are outliers, and if there are underlying factors that explain the wage gap for each industry and as a whole.

Industry Size Overview

In this first graph we are looking at the 8 broad industries to see which of them have a majority of employment from women based on the avg_females field that we temporarily created. Any industry that has 50% or more women is a female dominated industry. From this graph, you can see that females dominate 3 industries:

  • Healthcare Practicioners and Technical
  • Education, Legal, Community Service, Arts, and Media
  • Sales and Office)

On the other hand, males dominate 5 industries:

  • Service
  • Management, Business, and Financial
  • Computer, Engineering, and Science
  • Production, Transportation, and Material Moving
  • Natural Resources, Constructio, and Maintenance)

The industries that are dominated by females vs. males for the most part aren’t surprising due to the historical stigma surrounding each of the industries. Women have traditionally held roles as medical assistants, nurse, teachers, childcare workers, and administrative assistants while men have traditionally held roles as production workers, mechanics, analysts, engineers, and workers dealing with any type of natural resource. One industry that we were surprised by is the service industry. But, it’s percentage of females is 49%, so it is very close. One reason for a smaller than expected percentage is that firefighters, police officers, and other justice system occupations are listed under service and those careers are heavily dominated by men.

### Female vs. Male Dominated Industries ###

females_vs_males <- jobs_gender %>%
  group_by(industry_broad) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers), 
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(avg_females))
ggplot(data = females_vs_males, 
       aes(x = reorder(industry_broad, +avg_females), 
           y = (avg_females))) + 
  geom_bar(stat = "identity", 
           aes(fill = avg_females >= 0.5)) + 
  scale_fill_discrete(name = "% Of Females", labels = c("< 50%", " >= 50%")) +
  ylab("% Female") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  theme(axis.text.x = element_text(hjust = 0)) +
  ggtitle("Female vs. Male Dominated Industries",
          subtitle = "Broad Industries having more than 50% of Females") +
  coord_flip()

Next, we wanted to look a touch deeper at the specific industries in the same fashion to identify any specific industries where females may dominate but in the broad industry they do not. From the plot below you can see that females dominate the healthcare support and personal care and service categories, which fall under the Service broad industry. But, these roles both are in the medical field, which is an industry that females dominate. Also, females dominate the business and financial operations field, which falls under the Management, Business, and Financial broad industry. After further analysis into the occupations within this category, we discovered that the majority of these roles include marketing analysts, event planners, and human resources workers. These roles are traditionally held by women. There are few financial specialist and accounting roles that are predominantly women that defy this trend and we consider outliers.

females_vs_males_1 <- jobs_gender %>%
  group_by(industry_specific) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers), 
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(avg_females))
ggplot(data = females_vs_males_1, 
       aes(x = reorder(industry_specific, +avg_females), 
           y = (avg_females))) + 
  geom_bar(stat = "identity", aes(fill = avg_females >= 0.5)) + 
  scale_fill_discrete(name = "% Of Females", labels = c("< 50", ">= 50")) +
  scale_y_continuous(name = "% Female", labels = function(y) paste0(y*100,"%")) + 
  xlab("Specific Industry") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Female vs. Male Dominated Industries",
          subtitle = "Specific Industries having more than 50% of Females") +
  coord_flip()

To further confirm female versus male dominated industries, we wanted to look at the top 10 occupations for women across all industries and see if any of the occupations fell outside of their three main industries. Two of the occupations, medical transcriptionists and childcare workers, are in the service industry. But, their fields are closely related to the industries that females dominate, so we do not consider them outliers.

females_vs_males_2 <- jobs_gender %>%
  group_by(occupation) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers),
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  top_n(10, avg_females)
ggplot(data = females_vs_males_2, 
       aes(x = reorder(occupation, +avg_females), 
           y = (avg_females))) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "% Female", labels = function(y) paste0(y*100,"%")) + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top Female Occupations",
          subtitle = "10 Highest Female Dominated Occupations Across Industries") +
  coord_flip()

In addition to seeing the top 10 occupations for females, we also wanted to see the 10 occupations females have the lowest presence. From the plot below you can see that all of the occupations fall within male dominated industries.

females_vs_males_3 <- jobs_gender %>%
  group_by(occupation) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers),
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  top_n(-10, avg_females)
ggplot(data = females_vs_males_3, 
       aes(x = reorder(occupation, -avg_females), 
           y = (avg_females))) + 
  geom_bar(stat = "identity") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  scale_y_continuous(name = "% Female", labels = function(y) paste0(y*1000,"%"), limits = c(0,.1)) +
  ggtitle("Lowest Female Occupations", 
        subtitle = "10 Lowest Occupations Held by Females Across Industries") +
  coord_flip()

Industry Earnings Overview

To begin the analysis of the earnings per industry, we observed the density plots of the total earnings per industry category. We will analyze each of the industry’s summary statistic in our deep-dive industry section below the size and earnings overviews. The main conclusions from this plot are that

  1. Computer, Engineering, and Science has the highest average median earnings and it is a male dominated industry
  2. The distributions for each industry are skewed to the right, meaning that the means are higher than the medians

The skews are more severe for certain industries due to outliers, which we will examine further in our analysis below

### Density Plot of Earnings ###

average_median_earnings <- jobs_gender %>%
  group_by(industry_broad) %>%
  summarise(avg_per_industry = mean(total_earnings)) %>%
  arrange(desc(avg_per_industry))
ggplot(data = jobs_gender,
       aes(x = total_earnings,
           color = industry_broad)) + 
  geom_density(aes(fill = industry_broad), alpha = 0.3) +
  xlab("Total Earnings") +
  ylab("Density") +
  ggtitle("Distribution of Total Earnings Per Industry",
          subtitle = "Density plot Earnings Across Broad Industry")

To provide an overview of the wage gap regardless of industry, we wanted to briefly show the overall trend in earnings between males and females. Females have an average median earnings of $49,640 and males have an average median earnings of $53,218. Therefore, our dataset informs us that on average, men make about $4,000 more than women. There are outliers in each plot, but the outliers in the male earnings are the most significant. Also, there are about 56 million more men than women documented as working fulltime in our dataset. Even though this is a large difference, our sample size for the remaining occupations in our data set is large enough for both men and women.

### Median Earnings Per Gender ###


x <- data.frame(total_median_earnings = jobs_gender$total_earnings, 
                female_median_earnings = jobs_gender$total_earnings_female, 
                male_median_earnings = jobs_gender$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) +
  geom_boxplot() + 
  theme_bw() +
  scale_y_continuous(name = "Total Income", labels = scales::dollar) +
  scale_x_discrete(name = "Class of Earnings") +
  ggtitle("Median Earnings Per Male & Female", 
          subtitle = "Distribution of Earnings Showing Outliers Per Gender") + 
  coord_flip()

summary(jobs_gender$total_earnings)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   17266   32310   44100   49640   60837  201542
summary(jobs_gender$total_earnings_female)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7447   28852   40154   44582   54715  166388
summary(jobs_gender$total_earnings_male)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12147   35609   46825   53218   65144  231420

After displaying the average median salaries of females and males overall, we wanted to show the difference in female dominated industries vs. male dominated industries. We previously defined female and male dominated industries, so now will will aggregate the industries that belong to each category. Then, we will calculate the average median earnings for males and females for both the female and male dominated industries. Our original hypothesis that even in female dominated industries, the average pay for females will be lower than that of males proves to be true from our overview analyses. From the graph, we conclude that females on average make 10% more in female dominated industries. On the other hand, males on average make 10% less in male dominated industries than they do in female industries. Regardless, men on average make more than females in both male and female dominated industries. Also, females make less in male dominated industries than they do in female dominated industries, which we suspected from the start.

### Female & Male Earnings Per Female & Male Dominated Industries ###


female_dominated_industries <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical", "Education, Legal, Community Service, Arts, and Media", "Sales and Office")) %>%
    summarise(Female_Earnings_F = mean(total_earnings_female))
female_dominated_industries_males <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical", "Education, Legal, Community Service, Arts, and Media", "Sales and Office")) %>%
    summarise(Male_Earnings_F = mean(total_earnings_male))
male_dominated_industries <- jobs_gender %>%
  filter(industry_broad == c("Service", "Management, Business, and Financial", "Computer, Engineering, and Science",    
  "Production, Transportation, and Material Moving", "Natural Resources, Construction, and Maintenance")) %>%
  summarise(Female_Earnings_M = mean(total_earnings_female))
male_dominated_industries_males <- jobs_gender %>%
  filter(industry_broad == c("Service", "Management, Business, and Financial", "Computer, Engineering, and Science",    
  "Production, Transportation, and Material Moving", "Natural Resources, Construction, and Maintenance")) %>%
  summarise(Male_Earnings_M = mean(total_earnings_male))
Earnings <- data.frame(female_median_earnings_1 = female_dominated_industries, male_median_earnings_1 = female_dominated_industries_males, female_median_earnings_2 = male_dominated_industries, male_median_earnings_2 = male_dominated_industries_males)
Earnings_differnce <- melt(Earnings)
ggplot(Earnings_differnce, 
       aes(x = variable, 
           y = value,
           fill = variable)) + 
  geom_bar(stat = "identity") + 
  theme_bw() + 
  coord_cartesian(ylim = c(25000, 55000)) +
  xlab("Earnings by Gender") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Gender Earnings in Female vs Male Dominated Industries", 
          subtitle = "Median Earnings of Females and Males in different Industries")

Industry Breakdown

Now, we want to determine if there are any outlier industries. An industry would be an outlier if it strays against the overall pattern and if females make more than males on average in any given industry. We created boxplot for each industry to provide quick visuals of the differences in male and female median earnings on average. Next, we created summary statistic tables to show the exact difference earnings. It is important to note that we did not remove any outliers from this dataset. Each of the outliers is important to providing an overall view of the industry and the discrepancies in median earnings between males and females.

Healthcare Practicioners and Technical

Healthcare Practicioners and Technical Industry

The Healthcare Practicioners and Technical industry agrees with our hypothesis. The average median earnings for females is $38,549 and the average median earnings for males is $43,661, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the healthcare industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for men in this industry.

### Box Plot for Females and Males In Healthcare Industry ###

Healthcare_Industry <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical"))
x <- data.frame(Total_Earnings = Healthcare_Industry$total_earnings, 
                Female_Earnings = Healthcare_Industry$total_earnings_female, 
                Male_Earnings = Healthcare_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value,
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Healthcare Industry") +
  coord_flip()

summary(x)  
##  Total_Earnings   Female_Earnings  Male_Earnings   
##  Min.   : 31530   Min.   : 31126   Min.   : 35640  
##  1st Qu.: 46446   1st Qu.: 44970   1st Qu.: 49939  
##  Median : 62022   Median : 60260   Median : 71214  
##  Mean   : 74269   Mean   : 68051   Mean   : 81487  
##  3rd Qu.: 90725   3rd Qu.: 81794   3rd Qu.:101072  
##  Max.   :201542   Max.   :166388   Max.   :231420

Sales and Office

Sales and Office

The Sales and Office industry agrees with our hypothesis. The average median earnings for females is $37,106 and the average median earnings for males is $44,987, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the sales industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for males in this industry.

### Box Plot for Females and Males In Sales Industry ###


Sales_Industry <- jobs_gender %>%
  filter(industry_broad == c("Sales and Office"))
x <- data.frame(Total_Earnings = Sales_Industry$total_earnings, 
                Female_Earnings = Sales_Industry$total_earnings_female, 
                Male_Earnings = Sales_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Sales Industry") +
  coord_flip()

summary(x)
##  Total_Earnings   Female_Earnings Male_Earnings   
##  Min.   : 20251   Min.   :19688   Min.   : 21105  
##  1st Qu.: 31842   1st Qu.:30459   1st Qu.: 35609  
##  Median : 37217   Median :35631   Median : 41366  
##  Mean   : 40359   Mean   :37106   Mean   : 44987  
##  3rd Qu.: 46558   3rd Qu.:41381   3rd Qu.: 52831  
##  Max.   :111522   Max.   :90274   Max.   :115432

Service

Service Industry

The Sales and Office industry agrees with our hypothesis. The average median earnings for females is $31,988 and the average median earnings for males is $36,644, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the service industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry.

### Box Plot for Females and Males In Service Industry ###

Service_Industry <- jobs_gender %>%
  filter(industry_broad == c("Service"))
x <- data.frame(Total_Earnings = Service_Industry$total_earnings, 
                Female_Earnings = Service_Industry$total_earnings_female, 
                Male_Earnings = Service_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Service Industry") +
  coord_flip()

summary(x)
##  Total_Earnings  Female_Earnings  Male_Earnings  
##  Min.   :17266   Min.   : 16771   Min.   :12147  
##  1st Qu.:24662   1st Qu.: 22291   1st Qu.:26320  
##  Median :30422   Median : 28384   Median :31799  
##  Mean   :34452   Mean   : 31988   Mean   :36644  
##  3rd Qu.:40748   3rd Qu.: 38088   3rd Qu.:41640  
##  Max.   :90571   Max.   :100508   Max.   :90912

Business, Management, and Financial

Business, Management, and Financial Industry

The Business, Management, and Financial industry agrees with our hypothesis. The average median earnings for females is $59,070 and the average median earnings for males is $73,717, which is about a $15,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the business industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry due to there being outliers on both ends of the plot.

### Box Plot for Females and Males In Business Industry ###

Business_Industry <- jobs_gender %>%
  filter(industry_broad == c("Management, Business, and Financial"))
x <- data.frame(Total_Earnings = Business_Industry$total_earnings, 
                Female_Earnings = Business_Industry$total_earnings_female, 
                Male_Earnings = Business_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Business Industry") +
  coord_flip()

summary(x)
##  Total_Earnings   Female_Earnings  Male_Earnings   
##  Min.   : 36471   Min.   : 25310   Min.   : 41164  
##  1st Qu.: 53093   1st Qu.: 49981   1st Qu.: 60928  
##  Median : 62192   Median : 56810   Median : 71394  
##  Mean   : 65565   Mean   : 59070   Mean   : 73717  
##  3rd Qu.: 73425   3rd Qu.: 66008   3rd Qu.: 82328  
##  Max.   :130293   Max.   :131780   Max.   :141108

Computer, Engineering, and Science

Computer, Engineering, and Science Industry

The Computer, Engineering, and Science industry agrees with our hypothesis. The average median earnings for females is $69,427 and the average median earnings for males is $80,191, which is about a $11,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the science industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for males in this industry.

### Box Plot for Females and Males In Engineering Industry ###

Engineering_Industry <- jobs_gender %>%
  filter(industry_broad == c("Computer, Engineering, and Science"))
x <- data.frame(Total_Earnings = Engineering_Industry$total_earnings, 
                Female_Earnings = Engineering_Industry$total_earnings_female, 
                Male_Earnings = Engineering_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Engineering Industry") +
  coord_flip()

summary(x)
##  Total_Earnings   Female_Earnings  Male_Earnings   
##  Min.   : 40464   Min.   : 33376   Min.   : 23794  
##  1st Qu.: 61985   1st Qu.: 56798   1st Qu.: 67164  
##  Median : 76971   Median : 68925   Median : 81388  
##  Mean   : 76536   Mean   : 69427   Mean   : 80191  
##  3rd Qu.: 90354   3rd Qu.: 80889   3rd Qu.: 91855  
##  Max.   :141359   Max.   :120253   Max.   :150247

Production, Transportation, Material Moving

Production, Transportation, Material Moving Industry

The Production, Transportation, Material Moving industry agrees with our hypothesis. The average median earnings for females is $32,438 and the average median earnings for males is $40,769, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the production industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry.

### Box Plot for Females and Males In Production Industry ###

Production_Industry <- jobs_gender %>%
  filter(industry_broad == c("Production, Transportation, and Material Moving"))
x <- data.frame(Total_Earnings = Production_Industry$total_earnings, 
                Female_Earnings = Production_Industry$total_earnings_female, 
                Male_Earnings = Production_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Production Industry") +
  coord_flip()

summary(x)
##  Total_Earnings   Female_Earnings  Male_Earnings   
##  Min.   : 20726   Min.   :  7447   Min.   : 21536  
##  1st Qu.: 29066   1st Qu.: 24268   1st Qu.: 31002  
##  Median : 35408   Median : 27883   Median : 37346  
##  Mean   : 38894   Mean   : 32438   Mean   : 40769  
##  3rd Qu.: 44303   3rd Qu.: 36241   3rd Qu.: 45905  
##  Max.   :102155   Max.   :130660   Max.   :102479

Natural Resources, Construction, and Maitenance

Natural Resources, Construction, and Maintenance Industry

The Natural Resources, Construction, and Maintenance industry agrees with our hypothesis. The average median earnings for females is $38,549 and the average median earnings for males is $43,661, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the production industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry.

### Box Plot for Females and Males In Construction Industry ###

Construction_Industry <- jobs_gender %>%
  filter(industry_broad == c("Natural Resources, Construction, and Maintenance"))
x <- data.frame(Total_Earnings = Construction_Industry$total_earnings, 
                Female_Earnings = Construction_Industry$total_earnings_female,
                Male_Earnings = Construction_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Construction Industry") +
  coord_flip()

summary(x)
##  Total_Earnings  Female_Earnings  Male_Earnings  
##  Min.   :20420   Min.   : 11080   Min.   :22957  
##  1st Qu.:34505   1st Qu.: 29752   1st Qu.:35032  
##  Median :41646   Median : 35580   Median :41945  
##  Mean   :43232   Mean   : 38549   Mean   :43661  
##  3rd Qu.:50537   3rd Qu.: 43863   3rd Qu.:50752  
##  Max.   :85914   Max.   :158929   Max.   :85807

2. Wage Gap by Age

Summary

The second aspect of the wage gap that we wanted to look at was if the wage gap varied across different age groups. With the wage gap being a historical problem we watned to look at how it was trending across all age groups to start. From the plot below you can see that the wage gap is certainly present, however, it is trending in a positive direction.

ER_Overall <- ggplot(data = earnings_female, aes(x = year, y = earnings_ratio)) + 
  geom_point()+
  geom_smooth(se = FALSE) +
  scale_y_continuous(name = "Earnings Ratio") +
  scale_x_continuous(name = "Year") +
  ggtitle("Overall Earnings Ratio",
          subtitle = "Trend of all Age Groups from 1979 to 2011")
ER_Overall

3. Overall Female Employment

Our last analysis was looking at the employment status of males and females throughout history from 1968 to 2016 and to see if the ratio of part-time and full-time workers has changed. From the plot below you can see that the percentage of full-time and part-time females is at the same position in 2016 as it was in 1968 respectively and stayed relatively level during that time period.

The one change that can be seen from the plot is the slight decrease in full-time male workers over the course of this period. We have found two main factors that may have caused this change. First, as more females continue to take a more prominent role in society, some males are now playing the role of the stay at home parent. It is not to say that less males are working overall, but it could lead to more of them assuming part time roles rather than full time ones. The second reason we found from this is that the biggest decrease of full-time male employment came around 2008 and the recession. We have seen through some of the other data that there are a lot more males working than females, so it can be anticipated that the data we are showing will have a greater effect on the males than the females. During this time period, a lot of people, especially men, lost their jobs.

employed_gender %>% 
  ggplot(aes(x = year,)) +
  geom_line(aes(y = full_time_female),color = "red2") +
  geom_line(aes(y = full_time_male), color = "blue") +
  geom_line(aes(y = part_time_female), color = "red2") +
  geom_line(aes(y = part_time_male), color = "blue") +
  scale_y_continuous(name = "Percent") +
  scale_x_continuous(name = "Year") +
  annotate("text", x = 1968, y = 82, label = "Full-time Male = 92.2%",
           color = "blue", hjust = 0, size = 3) +
   annotate("text", x = 1968, y = 68, label = "Full-time Female = 75.1%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 1968, y = 32, label = "Part-time Female = 24.9%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 1968, y = 14, label = "Part-time Male = 7.8%",
           color = "blue", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 82, label = "Full-time Male = 87.6%",
           color = "blue", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 68, label = "Full-time Female = 75.1%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 30, label = "Part-time Female = 24.9%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 17, label = "Part-time Male = 12.4%",
           color = "blue", hjust = 0, size = 3) +
  ggtitle("Male and Female Full-time & Part-time Employment",
          subtitle = "Change from 1968 to 2016")

Conclusion

Conclusion

The main goal of this analysis was to analyze the wage gap in female salaries vs. male salaries. There is no denying that a wage gap has existed in the past and still exists today. There are systemetic, societal reasons for this wage gap, such as the fact that women traditionally have been the caretakers of the house. Women only started entering the workforce in large numbers after the first World 1, and have been steadily increasing ever since. As our culture has become more progressive and many groups of people have fought for their rights, women have entered the workforce and demanded equal pay.

We wanted to analyze the wage gap in three facets, being the wage gap by industry, the wage gap by age, and overall trends in female employment numbers. First, we studied the female vs. male dominated industries. The industries that are female dominated are the roles that women first began assuming when the entered the workforce fifty years ago, like the industries of nurses, secretaries, and teachers. Society only recently began accepting women into other fields like business, math, and science. Diversity is now a cause that most companies champion, so the numbers of women in these fields is rising, especially for younger generations. There were some outliers in the specific industry category. There were a few occupations within the Business, Management, and Financial industry that were majority female and also had a higher average median salary for women than for men. The wage gap in female vs. male dominated industries was higher for male dominated industries. While the men made on average more than the females in both categories, the difference in average wage between males and females in female dominated industries was about $5,000 and the difference in average wage between males and females in male dominated industries was about $9,000.

Next, we analyzed each industry on a deeper level to see if any industries were outliers and the females made more than males and if there were any underlying causes for the wage gap in certain industries. The largest differences in average median salaries for males and females exists in the Business, Management, and Financial industry and the Computer, Engineering, and Science; gaps of $15,000 and $11,000 respectively. These industries are both male dominated industries, so we would expect the gaps to be larger. Also, many of the jobs in these business and engineering fields have been historically held by men, especially the baby boomer generation. We see in our other analyses that younger women have started asserting themselves more into the workforce right out of college. More women are joinging these fields, but the people who would be making the large salaries who have worked in those roles for a while are males.

The next analysis we completed was wage gap by age. The main conclusion from these graphs is that the desparity of pay among females and males is overall getting better, however, some age groups have been effected more than others. The graphs revealed that the older generations are still seeing a large gap than young, college educated females that are entering the work force. It This graph also coincides nicely with the historic events that have happened in parallel for females all the way from getting the right to vote to not being discriminated against because of their sex in the work place. Based on these trends the wage gap for females ages 20-34 years old may be nonexistent in the near future.

Finally, the last analysis we completed was for overall female employement. The main conclusion from this graph is that even though we see the wage gap decreasing between male and females, the ratio of females and males that work full-time and part-time throughout the years only varied slightly. A big reason that these ratios may have stayed the same is because as time has gone on, the need for jobs has only increased. Just because more females are now in the work force does not mean less men are working overall. With technology advancements and other inventions, the need for jobs is higher than ever.