The gender wage gap is a problem that has persisted for many years. It would go without saying that in today’s economy our unemployment is extremely low yet politcal tensions are nearing an all time high. Despite all the other social issues happening across our country, the argument of there being a gender wage gap continues to carry on.
As today’s politcal climate is very tense, it is unfair to say that all females are being monetarily discriminated against in the work force. Many companies, large and small, have incorporated the wage gap problem into their social responsibility initiatives. In recent years as more and more companies continue to battle the wage gap, we have seen great strides in reducing it.
Despite many great strides being taken, the problem of the gender wage gap has not been eradicated. As females continue playing a larger and more important role in the workforce, it is extremely pertinent to understand the underlying factors that caused the wage gap so it can be prevented it in the future.
We are going to analyze the historical trends of the wage gap and identify the underlying forces that may have been the root cause. We will look at the trends of employment and salary differnces throughout history across numerous industries and age groups for both males and females.
For our project we will be utilizing three different datasets, two of which are historical datasets originating from the Bureau of Labor and the other stemming from the Census Bureau. Across all three datasets, they are primarily focusing on the earnings ratio that females are making compared to males. The formula for earnings ratio is below: \[ Earnings Ratio = \frac{Female Median Earnings}{Male Median Earnings} \] Additionally, the “employed_gender” dataset is looking at the difference between full-time and part-time employment each year for males, females, and overall from 1968 - 2016.
Before identifying the trends of the gender wage gap, we must address the limitations presented by the data. The two major limitations are as follows:
Our three approaches to understanding the gender wage gap are found below:
Wage Gap by Industry: Our dataset “jobs_gender” provides us great detail of different occupations that were held by both males and females from the years 2013 - 2016. The data has been prepared in a fashion where each occupation is listed under a Industry_Broad and Industry_Specific category to assist with the industry analysis. We will filter each of the industries to see which of them have have a majority of employment from males vs. females. From there, we plan to determine if the wage gap in male dominated industries is higher than the wage gap present in female dominated industries. In general, we want to find out if certain industries present larger wage gaps than others, and if so, are there underlying causes.
Wage Gap by Age: Our second approach is to analyze the wage gap by age group. Our dataset “earnings_female” breaks down the Earnings Ratio for seven different age groups from the years 1979 - 2011. We will test to see if certain age groups are more susceptible to the wage gap than others.
Overall Female Employment: In our last analysis, we will pull everything together from the first two independent analyses as well data from our last dataset “employed_gender”. With all of this data, we will observe the overall trend of the wage gap as well as the increasing presence of females in the workforce. The goal of this analysis is to determine if these trends follow along with social movements that have happened over the past 3-5 decades for females.
The mission of our project is to provide the consumer with how the wage gap has changed overtime based on different factors (industry, age group, employment numbers, etc.) and if the wage gap is improving or not. By looking at historical and current events, we want to allow the consumer of the data to determine if they believe that the gender wage gap is actually improving.
In the end, we hope to provide clarity on the industries and age groups that have been historically effected by the gender wage gap the most. Hopefully, this will inspire the consumers of this data to make a positive change for the future of these afflicted groups.
For this project we will use a lot of the standard packages for cleaning and visualizing data. Most of these packages are used with other data manipulation/visualization so not many of the packages will need to be installed strictly for this analysis.
A few of the packages that may need loaded by the user include ggthemes,magick, and plotly. ggthemes is part of the ggplot package but does not come as part of the standard ggplot library. The other two functions are supplementary visualization tools that we will use for our analysis.
## Load Required Packages ##
library(tidyverse) #Use to tidy data
library(readr) #Use to easily import delimited data
library(dplyr) #Use to manipulate data
library(tibble) #Use to manipulate data
library(magrittr) #Use to insert pipe operators
library(DT) #Use to create functional tables in HTML
library(knitr) #Use to create dynamic report generation
library(rmarkdown) #Use to convert R Markdown documents into a variety of formats
library(ggthemes) #Use to implement themes across report
library(ggrepel) #Use to label data
library(ggplot2) #Use to create visualizations
library(plotly) #Use to create dynamic plotting
library(gridExtra) #Use to arrange plots
library(reshape2) #Use to transform data frames
We have three different sets of data that we are using to analyze the wage gap. However, the timelines of these datasets do not overlap so we will not be able to use the data in aggregrate. Instead, we have elected to treat each of the data sets as their own and clean each of them individually. Our cleaning and preparation process for each can be found below:
As previously mentioned, this data originated from Bureau of Labor. The original data contains 3 variables with 264 observations and a range of dates from 1979 - 2011 and contains the earnings ratio for seven unique age groups across each year:
To gather more information regarding the dataset, click here
The first thing we will want to do is import the original csv file data using read_csv function:
## Import the Data ##
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv")
We examined and viewed the original data below:
datatable(head(earnings_female,50))
After importing the data and doing the exploratory analysis we realized that we should change a few aspects to keep the data consistent.
age_group category and renamed the value of Total, 16 years and older to Total to make the data easier to understand.names(earnings_female) <- c("year","age_group","earnings_ratio")
earnings_female$age_group[earnings_female$age_group == "Total, 16 years and older"] <- "Total"
Variable.type <- lapply(earnings_female,class)
Variable.desc <- c("Year", "Age group", "Female wages as a percent of male wages, which is the earnings ratio of females")
Variable.name1 <- colnames(earnings_female)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| year | numeric | Year |
| age_group | character | Age group |
| earnings_ratio | numeric | Female wages as a percent of male wages, which is the earnings ratio of females |
library(DT)
datatable(head(earnings_female,50))
In the clean dataset, the range of years remains the same but the age groups are now:
The earnings ratio column ranges from 56.8% to 95.4%, reaffirming that a wage gap does in fact exist. We will analyze these values in our exploratory analysis.Additionally, each of the variables contained in this dataset are primary variables for our analysis, thus we will not remove any of the observations from this table.
As previously mentioned, this data originated from the Census Bureau. The original data contains 12 variables and 2088 observations that have dates ranging from 2013 - 2016. The data is centered around employment numbers and earning percentages for male and female. The column names are below:
To view the original data, click here
The first thing we did is import the original csv file data that we were provided using read_csv:
## Import the Data ##
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")
We examined and viewed the original data below:
datatable(head(jobs_gender,50))
After viewing the data, we made a few changes to the data. The changes are found below:
Industry_BroadIndustry_Specificearnings_ratio_femalenames(jobs_gender) <- c("year","occupation","industry_broad","industry_specific",
"total_workers","workers_male","workers_female","percent_female",
"total_earnings","total_earnings_male","total_earnings_female",
"earnings_ratio_female")
colSums(is.na(jobs_gender))
## year occupation industry_broad
## 0 0 0
## industry_specific total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female earnings_ratio_female
## 4 65 846
After we filled in the NA values for “total_earnings_female” and “total_earnings_male”, we realized that there were some negative values. There also were values of 0 or very low numbers for the earnings columns even though there were hundreds of workers in for the occupation. Therefore, we remove every NA value for female and male earnings to avoid issues with these values skewing our data further into our analysis.
## Mutate the New Column for Earnings Ratio
jobs_gender <-
jobs_gender %>%
mutate(Earnings_Ratio = jobs_gender$total_earnings_female / jobs_gender$total_earnings_male)
#Remove Original Column
jobs_gender <- select(jobs_gender,-c(earnings_ratio_female))
#Removing all observations with NA Values
jobs_gender <- na.omit(jobs_gender)
colSums(is.na(jobs_gender))
## year occupation industry_broad
## 0 0 0
## industry_specific total_workers workers_male
## 0 0 0
## workers_female percent_female total_earnings
## 0 0 0
## total_earnings_male total_earnings_female Earnings_Ratio
## 0 0 0
## Rounding Percentages ##
is.num <- sapply(jobs_gender$percent_female, is.numeric)
jobs_gender$percent_female[is.num] <- lapply(jobs_gender$percent_female[is.num], round, 1)
is.num <- sapply(jobs_gender$earnings_ratio_female, is.numeric)
jobs_gender$earnings_ratio_female[is.num] <- lapply(jobs_gender$earnings_ratio_female[is.num], round, 1)
is.num <- sapply(jobs_gender$percent_male, is.numeric)
jobs_gender$percent_male[is.num] <- lapply(jobs_gender$percent_male[is.num], round, 1)
From this we are able to obtain our clean data set:
Variable.type <- lapply(jobs_gender,class)
Variable.desc <- c("Year", "Specific job/career", "Broad industry of occupation", "Specific industry of occupation", "Total estimated full-time workers above 16 years old", "Estimated full-time male workers above 16", "Estimated full-time female workers above 16","The percent of females in a specific occupation","Total estimated median earnings for full-time workers above 16 years old", "Estimated median earnings for males above 16 years old", "Estimated median earnings for females above 16 years old", "Female wages as a percent of male wages, which is the earnings ratio of females")
Variable.name1 <- colnames(jobs_gender)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| year | numeric | Year |
| occupation | character | Specific job/career |
| industry_broad | character | Broad industry of occupation |
| industry_specific | character | Specific industry of occupation |
| total_workers | numeric | Total estimated full-time workers above 16 years old |
| workers_male | numeric | Estimated full-time male workers above 16 |
| workers_female | numeric | Estimated full-time female workers above 16 |
| percent_female | list | The percent of females in a specific occupation |
| total_earnings | numeric | Total estimated median earnings for full-time workers above 16 years old |
| total_earnings_male | numeric | Estimated median earnings for males above 16 years old |
| total_earnings_female | numeric | Estimated median earnings for females above 16 years old |
| Earnings_Ratio | numeric | Female wages as a percent of male wages, which is the earnings ratio of females |
datatable(head(jobs_gender,50))
Now with the clean data, we have a uniform naming style and additional columns. All of the data will be used from this dataset, however, we have broken down the variables to be considered a primary or secondary variable. The new column names and variables are listed below:
After cleaning the data we have replaced the NA values for the total_earnings_male, total_earnings_female, and earnings_ratio_female giving us 2019 complete observations and still 12 variables.
As previously mentioned, this data originated from the Bureau of Labor. The dataset is showing percentage of employed people working full time from the years 1968 - 2016. The dataset contains 7 variables and 49 observations with the column names shown below:
The first thing we did is import the original csv file using the read_csv function:
## Import the Data ##
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv")
Our clean data is below:
Variable.type <- lapply(employed_gender,class)
Variable.desc <- c("Year", "Percent of total employed people usually working full-time", "Percent of total employed people usually working part time", "Percent of employed females usually working full time", "Percent of employed females usually working part time", "Percent of employed males usually working full time", "Percent of employed men usually working part time")
Variable.name1 <- colnames(employed_gender)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
| Variable Name | Data Type | Variable Description |
|---|---|---|
| year | numeric | Year |
| total_full_time | numeric | Percent of total employed people usually working full-time |
| total_part_time | numeric | Percent of total employed people usually working part time |
| full_time_female | numeric | Percent of employed females usually working full time |
| part_time_female | numeric | Percent of employed females usually working part time |
| full_time_male | numeric | Percent of employed males usually working full time |
| part_time_male | numeric | Percent of employed men usually working part time |
datatable(head(employed_gender,50))
Compared to our other datasets, this data is very useful with out a lot of cleaning. We do not have any missing values or duplicate values, and all column names are written with consistent snake_case formatting.
We elected not to make any inital changes to this data for the reason.
For the wage gap by industry analysis, we will study the breakdowns of earnings for each of the eight industries to observe any patterns in the respective wage gaps. First, we are interested in observing the industries that are dominated by females vs. males. Once we determine which industries fall into which category, we will analyze the differences in the average female and male median earnings for both female and male dominated industries. We will take this approach a step further and disect the outliers, averages, minimums, maximum, etc. for each industry and draw conclusions on the underlying factors, if any, for these numbers and patterns. Our hypothesis before beginning this analysis is that even in female dominated industries, the average pay for females will be lower than that of males. Our goal is to identify if this hypothesis is true, if there are any industries that are outliers, and if there are underlying factors that explain the wage gap for each industry and as a whole.
In this first graph we are looking at the 8 broad industries to see which of them have a majority of employment from women based on the avg_females field that we temporarily created. Any industry that has 50% or more women is a female dominated industry. From this graph, you can see that females dominate 3 industries:
On the other hand, males dominate 5 industries:
The industries that are dominated by females vs. males for the most part aren’t surprising due to the historical stigma surrounding each of the industries. Women have traditionally held roles as medical assistants, nurse, teachers, childcare workers, and administrative assistants while men have traditionally held roles as production workers, mechanics, analysts, engineers, and workers dealing with any type of natural resource. One industry that we were surprised by is the service industry. But, it’s percentage of females is 49%, so it is very close. One reason for a smaller than expected percentage is that firefighters, police officers, and other justice system occupations are listed under service and those careers are heavily dominated by men.
### Female vs. Male Dominated Industries ###
females_vs_males <- jobs_gender %>%
group_by(industry_broad) %>%
summarise(avg_females = sum(workers_female) / sum(total_workers),
avg_males = sum(workers_male) / sum(total_workers)) %>%
arrange(desc(avg_females))
ggplot(data = females_vs_males,
aes(x = reorder(industry_broad, +avg_females),
y = (avg_females))) +
geom_bar(stat = "identity",
aes(fill = avg_females >= 0.5)) +
scale_fill_discrete(name = "% Of Females", labels = c("< 50%", " >= 50%")) +
ylab("% Female") +
scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
theme(axis.text.x = element_text(hjust = 0)) +
ggtitle("Female vs. Male Dominated Industries",
subtitle = "Broad Industries having more than 50% of Females") +
coord_flip()
Next, we wanted to look a touch deeper at the specific industries in the same fashion to identify any specific industries where females may dominate but in the broad industry they do not. From the plot below you can see that females dominate the healthcare support and personal care and service categories, which fall under the Service broad industry. But, these roles both are in the medical field, which is an industry that females dominate. Also, females dominate the business and financial operations field, which falls under the Management, Business, and Financial broad industry. After further analysis into the occupations within this category, we discovered that the majority of these roles include marketing analysts, event planners, and human resources workers. These roles are traditionally held by women. There are few financial specialist and accounting roles that are predominantly women that defy this trend and we consider outliers.
females_vs_males_1 <- jobs_gender %>%
group_by(industry_specific) %>%
summarise(avg_females = sum(workers_female) / sum(total_workers),
avg_males = sum(workers_male) / sum(total_workers)) %>%
arrange(desc(avg_females))
ggplot(data = females_vs_males_1,
aes(x = reorder(industry_specific, +avg_females),
y = (avg_females))) +
geom_bar(stat = "identity", aes(fill = avg_females >= 0.5)) +
scale_fill_discrete(name = "% Of Females", labels = c("< 50", ">= 50")) +
scale_y_continuous(name = "% Female", labels = function(y) paste0(y*100,"%")) +
xlab("Specific Industry") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ggtitle("Female vs. Male Dominated Industries",
subtitle = "Specific Industries having more than 50% of Females") +
coord_flip()
To further confirm female versus male dominated industries, we wanted to look at the top 10 occupations for women across all industries and see if any of the occupations fell outside of their three main industries. Two of the occupations, medical transcriptionists and childcare workers, are in the service industry. But, their fields are closely related to the industries that females dominate, so we do not consider them outliers.
females_vs_males_2 <- jobs_gender %>%
group_by(occupation) %>%
summarise(avg_females = sum(workers_female) / sum(total_workers),
avg_males = sum(workers_male) / sum(total_workers)) %>%
top_n(10, avg_females)
ggplot(data = females_vs_males_2,
aes(x = reorder(occupation, +avg_females),
y = (avg_females))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(name = "% Female", labels = function(y) paste0(y*100,"%")) +
scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
ggtitle("Top Female Occupations",
subtitle = "10 Highest Female Dominated Occupations Across Industries") +
coord_flip()
In addition to seeing the top 10 occupations for females, we also wanted to see the 10 occupations females have the lowest presence. From the plot below you can see that all of the occupations fall within male dominated industries.
females_vs_males_3 <- jobs_gender %>%
group_by(occupation) %>%
summarise(avg_females = sum(workers_female) / sum(total_workers),
avg_males = sum(workers_male) / sum(total_workers)) %>%
top_n(-10, avg_females)
ggplot(data = females_vs_males_3,
aes(x = reorder(occupation, -avg_females),
y = (avg_females))) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
scale_y_continuous(name = "% Female", labels = function(y) paste0(y*1000,"%"), limits = c(0,.1)) +
ggtitle("Lowest Female Occupations",
subtitle = "10 Lowest Occupations Held by Females Across Industries") +
coord_flip()
To begin the analysis of the earnings per industry, we observed the density plots of the total earnings per industry category. We will analyze each of the industry’s summary statistic in our deep-dive industry section below the size and earnings overviews. The main conclusions from this plot are that
The skews are more severe for certain industries due to outliers, which we will examine further in our analysis below
### Density Plot of Earnings ###
average_median_earnings <- jobs_gender %>%
group_by(industry_broad) %>%
summarise(avg_per_industry = mean(total_earnings)) %>%
arrange(desc(avg_per_industry))
ggplot(data = jobs_gender,
aes(x = total_earnings,
color = industry_broad)) +
geom_density(aes(fill = industry_broad), alpha = 0.3) +
xlab("Total Earnings") +
ylab("Density") +
ggtitle("Distribution of Total Earnings Per Industry",
subtitle = "Density plot Earnings Across Broad Industry")
To provide an overview of the wage gap regardless of industry, we wanted to briefly show the overall trend in earnings between males and females. Females have an average median earnings of $49,640 and males have an average median earnings of $53,218. Therefore, our dataset informs us that on average, men make about $4,000 more than women. There are outliers in each plot, but the outliers in the male earnings are the most significant. Also, there are about 56 million more men than women documented as working fulltime in our dataset. Even though this is a large difference, our sample size for the remaining occupations in our data set is large enough for both men and women.
### Median Earnings Per Gender ###
x <- data.frame(total_median_earnings = jobs_gender$total_earnings,
female_median_earnings = jobs_gender$total_earnings_female,
male_median_earnings = jobs_gender$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
scale_y_continuous(name = "Total Income", labels = scales::dollar) +
scale_x_discrete(name = "Class of Earnings") +
ggtitle("Median Earnings Per Male & Female",
subtitle = "Distribution of Earnings Showing Outliers Per Gender") +
coord_flip()
summary(jobs_gender$total_earnings)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 17266 32310 44100 49640 60837 201542
summary(jobs_gender$total_earnings_female)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 7447 28852 40154 44582 54715 166388
summary(jobs_gender$total_earnings_male)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12147 35609 46825 53218 65144 231420
After displaying the average median salaries of females and males overall, we wanted to show the difference in female dominated industries vs. male dominated industries. We previously defined female and male dominated industries, so now will will aggregate the industries that belong to each category. Then, we will calculate the average median earnings for males and females for both the female and male dominated industries. Our original hypothesis that even in female dominated industries, the average pay for females will be lower than that of males proves to be true from our overview analyses. From the graph, we conclude that females on average make 10% more in female dominated industries. On the other hand, males on average make 10% less in male dominated industries than they do in female industries. Regardless, men on average make more than females in both male and female dominated industries. Also, females make less in male dominated industries than they do in female dominated industries, which we suspected from the start.
### Female & Male Earnings Per Female & Male Dominated Industries ###
female_dominated_industries <- jobs_gender %>%
filter(industry_broad == c("Healthcare Practitioners and Technical", "Education, Legal, Community Service, Arts, and Media", "Sales and Office")) %>%
summarise(Female_Earnings_F = mean(total_earnings_female))
female_dominated_industries_males <- jobs_gender %>%
filter(industry_broad == c("Healthcare Practitioners and Technical", "Education, Legal, Community Service, Arts, and Media", "Sales and Office")) %>%
summarise(Male_Earnings_F = mean(total_earnings_male))
male_dominated_industries <- jobs_gender %>%
filter(industry_broad == c("Service", "Management, Business, and Financial", "Computer, Engineering, and Science",
"Production, Transportation, and Material Moving", "Natural Resources, Construction, and Maintenance")) %>%
summarise(Female_Earnings_M = mean(total_earnings_female))
male_dominated_industries_males <- jobs_gender %>%
filter(industry_broad == c("Service", "Management, Business, and Financial", "Computer, Engineering, and Science",
"Production, Transportation, and Material Moving", "Natural Resources, Construction, and Maintenance")) %>%
summarise(Male_Earnings_M = mean(total_earnings_male))
Earnings <- data.frame(female_median_earnings_1 = female_dominated_industries, male_median_earnings_1 = female_dominated_industries_males, female_median_earnings_2 = male_dominated_industries, male_median_earnings_2 = male_dominated_industries_males)
Earnings_differnce <- melt(Earnings)
ggplot(Earnings_differnce,
aes(x = variable,
y = value,
fill = variable)) +
geom_bar(stat = "identity") +
theme_bw() +
coord_cartesian(ylim = c(25000, 55000)) +
xlab("Earnings by Gender") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Gender Earnings in Female vs Male Dominated Industries",
subtitle = "Median Earnings of Females and Males in different Industries")
Now, we want to determine if there are any outlier industries. An industry would be an outlier if it strays against the overall pattern and if females make more than males on average in any given industry. We created boxplot for each industry to provide quick visuals of the differences in male and female median earnings on average. Next, we created summary statistic tables to show the exact difference earnings. It is important to note that we did not remove any outliers from this dataset. Each of the outliers is important to providing an overall view of the industry and the discrepancies in median earnings between males and females.
The Healthcare Practicioners and Technical industry agrees with our hypothesis. The average median earnings for females is $38,549 and the average median earnings for males is $43,661, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the healthcare industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for men in this industry.
### Box Plot for Females and Males In Healthcare Industry ###
Healthcare_Industry <- jobs_gender %>%
filter(industry_broad == c("Healthcare Practitioners and Technical"))
x <- data.frame(Total_Earnings = Healthcare_Industry$total_earnings,
Female_Earnings = Healthcare_Industry$total_earnings_female,
Male_Earnings = Healthcare_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Healthcare Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. : 31530 Min. : 31126 Min. : 35640
## 1st Qu.: 46446 1st Qu.: 44970 1st Qu.: 49939
## Median : 62022 Median : 60260 Median : 71214
## Mean : 74269 Mean : 68051 Mean : 81487
## 3rd Qu.: 90725 3rd Qu.: 81794 3rd Qu.:101072
## Max. :201542 Max. :166388 Max. :231420
The Education, Legal, Community Service, Arts, & Media industry agrees with our hypothesis. The average median earnings for females is $46,258 and the average median earnings for males is $54,403, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the legal industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for males in this industry.
### Box Plot for Females and Males In Education Industry ###
Education_Industry <- jobs_gender %>%
filter(industry_broad == c("Education, Legal, Community Service, Arts, and Media"))
x <- data.frame(Total_Earnings = Education_Industry$total_earnings,
Female_Earnings = Education_Industry$total_earnings_female,
Male_Earnings = Education_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Education Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. : 21125 Min. : 20748 Min. : 25873
## 1st Qu.: 41926 1st Qu.: 38249 1st Qu.: 45094
## Median : 47480 Median : 45081 Median : 50893
## Mean : 49832 Mean : 46258 Mean : 54403
## 3rd Qu.: 52127 3rd Qu.: 50809 3rd Qu.: 57843
## Max. :122073 Max. :102484 Max. :136043
The Sales and Office industry agrees with our hypothesis. The average median earnings for females is $37,106 and the average median earnings for males is $44,987, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the sales industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for males in this industry.
### Box Plot for Females and Males In Sales Industry ###
Sales_Industry <- jobs_gender %>%
filter(industry_broad == c("Sales and Office"))
x <- data.frame(Total_Earnings = Sales_Industry$total_earnings,
Female_Earnings = Sales_Industry$total_earnings_female,
Male_Earnings = Sales_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Sales Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. : 20251 Min. :19688 Min. : 21105
## 1st Qu.: 31842 1st Qu.:30459 1st Qu.: 35609
## Median : 37217 Median :35631 Median : 41366
## Mean : 40359 Mean :37106 Mean : 44987
## 3rd Qu.: 46558 3rd Qu.:41381 3rd Qu.: 52831
## Max. :111522 Max. :90274 Max. :115432
The Sales and Office industry agrees with our hypothesis. The average median earnings for females is $31,988 and the average median earnings for males is $36,644, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the service industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry.
### Box Plot for Females and Males In Service Industry ###
Service_Industry <- jobs_gender %>%
filter(industry_broad == c("Service"))
x <- data.frame(Total_Earnings = Service_Industry$total_earnings,
Female_Earnings = Service_Industry$total_earnings_female,
Male_Earnings = Service_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Service Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. :17266 Min. : 16771 Min. :12147
## 1st Qu.:24662 1st Qu.: 22291 1st Qu.:26320
## Median :30422 Median : 28384 Median :31799
## Mean :34452 Mean : 31988 Mean :36644
## 3rd Qu.:40748 3rd Qu.: 38088 3rd Qu.:41640
## Max. :90571 Max. :100508 Max. :90912
The Business, Management, and Financial industry agrees with our hypothesis. The average median earnings for females is $59,070 and the average median earnings for males is $73,717, which is about a $15,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the business industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry due to there being outliers on both ends of the plot.
### Box Plot for Females and Males In Business Industry ###
Business_Industry <- jobs_gender %>%
filter(industry_broad == c("Management, Business, and Financial"))
x <- data.frame(Total_Earnings = Business_Industry$total_earnings,
Female_Earnings = Business_Industry$total_earnings_female,
Male_Earnings = Business_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Business Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. : 36471 Min. : 25310 Min. : 41164
## 1st Qu.: 53093 1st Qu.: 49981 1st Qu.: 60928
## Median : 62192 Median : 56810 Median : 71394
## Mean : 65565 Mean : 59070 Mean : 73717
## 3rd Qu.: 73425 3rd Qu.: 66008 3rd Qu.: 82328
## Max. :130293 Max. :131780 Max. :141108
The Computer, Engineering, and Science industry agrees with our hypothesis. The average median earnings for females is $69,427 and the average median earnings for males is $80,191, which is about a $11,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the science industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for males in this industry.
### Box Plot for Females and Males In Engineering Industry ###
Engineering_Industry <- jobs_gender %>%
filter(industry_broad == c("Computer, Engineering, and Science"))
x <- data.frame(Total_Earnings = Engineering_Industry$total_earnings,
Female_Earnings = Engineering_Industry$total_earnings_female,
Male_Earnings = Engineering_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Engineering Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. : 40464 Min. : 33376 Min. : 23794
## 1st Qu.: 61985 1st Qu.: 56798 1st Qu.: 67164
## Median : 76971 Median : 68925 Median : 81388
## Mean : 76536 Mean : 69427 Mean : 80191
## 3rd Qu.: 90354 3rd Qu.: 80889 3rd Qu.: 91855
## Max. :141359 Max. :120253 Max. :150247
The Production, Transportation, Material Moving industry agrees with our hypothesis. The average median earnings for females is $32,438 and the average median earnings for males is $40,769, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the production industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry.
### Box Plot for Females and Males In Production Industry ###
Production_Industry <- jobs_gender %>%
filter(industry_broad == c("Production, Transportation, and Material Moving"))
x <- data.frame(Total_Earnings = Production_Industry$total_earnings,
Female_Earnings = Production_Industry$total_earnings_female,
Male_Earnings = Production_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Production Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. : 20726 Min. : 7447 Min. : 21536
## 1st Qu.: 29066 1st Qu.: 24268 1st Qu.: 31002
## Median : 35408 Median : 27883 Median : 37346
## Mean : 38894 Mean : 32438 Mean : 40769
## 3rd Qu.: 44303 3rd Qu.: 36241 3rd Qu.: 45905
## Max. :102155 Max. :130660 Max. :102479
The Natural Resources, Construction, and Maintenance industry agrees with our hypothesis. The average median earnings for females is $38,549 and the average median earnings for males is $43,661, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the production industry making more than others because of years of schooling required, difficulty of job, etc. The outliers are the most significant for females in this industry.
### Box Plot for Females and Males In Construction Industry ###
Construction_Industry <- jobs_gender %>%
filter(industry_broad == c("Natural Resources, Construction, and Maintenance"))
x <- data.frame(Total_Earnings = Construction_Industry$total_earnings,
Female_Earnings = Construction_Industry$total_earnings_female,
Male_Earnings = Construction_Industry$total_earnings_male)
data <- melt(x)
ggplot(data,
aes(x = variable,
y = value,
fill = variable)) +
geom_boxplot() +
theme_bw() +
xlab("Earnings Group") +
scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
ggtitle("Median Earnings for Females and Males in Construction Industry") +
coord_flip()
summary(x)
## Total_Earnings Female_Earnings Male_Earnings
## Min. :20420 Min. : 11080 Min. :22957
## 1st Qu.:34505 1st Qu.: 29752 1st Qu.:35032
## Median :41646 Median : 35580 Median :41945
## Mean :43232 Mean : 38549 Mean :43661
## 3rd Qu.:50537 3rd Qu.: 43863 3rd Qu.:50752
## Max. :85914 Max. :158929 Max. :85807
The second aspect of the wage gap that we wanted to look at was if the wage gap varied across different age groups. With the wage gap being a historical problem we watned to look at how it was trending across all age groups to start. From the plot below you can see that the wage gap is certainly present, however, it is trending in a positive direction.
ER_Overall <- ggplot(data = earnings_female, aes(x = year, y = earnings_ratio)) +
geom_point()+
geom_smooth(se = FALSE) +
scale_y_continuous(name = "Earnings Ratio") +
scale_x_continuous(name = "Year") +
ggtitle("Overall Earnings Ratio",
subtitle = "Trend of all Age Groups from 1979 to 2011")
ER_Overall
The next part was to look and see if a certain age group(s) was being impacted more than others. From the interactive plot below one can see a few things. First, the younger age groups have a higher earnings ratio than the older ones. Second, you can see that the groups 20-24 years and 25-34 years are increasing drastically faster than other age groups. While we can not see it directly from our data, many historical events were taking place just before 1980. In 1963 the Equal Pay Act was signed into law by President John F. Kennedy and in 1964 Lyndon B. Johnson signed the Civil Rights Act into law. With these monumental pieces of legislation enacted, it allowed females to start engaging in occupations that were not possible before. Additionally, it sparked younger females to continue to pursue education and they were quicker to acclimate towards an earnings ratio of 1. As for older females who were in the true midst of gender wage discrimination, these new reforms and much more helped them improve their pay status, just at a much slower rate. This plot does a great job of showing the trends for each age group.
ER_AgeGroup <-
ggplot(data = earnings_female,
aes(x = year,
y = earnings_ratio,
color = age_group)) +
geom_point(size = 1, alpha = .8)+
geom_smooth(size = .8, se = FALSE) +
scale_y_continuous(name = "Earnings Ratio") +
scale_x_continuous(name = "Year") +
ggtitle("Earnings Ratio Per Age Group",
subtitle = "Strength of Upward Trend from 1979 to 2011") +
theme_stata()
ggplotly(ER_AgeGroup)
Our last analysis was looking at the employment status of males and females throughout history from 1968 to 2016 and to see if the ratio of part-time and full-time workers has changed. From the plot below you can see that the percentage of full-time and part-time females is at the same position in 2016 as it was in 1968 respectively and stayed relatively level during that time period.
The one change that can be seen from the plot is the slight decrease in full-time male workers over the course of this period. We have found two main factors that may have caused this change. First, as more females continue to take a more prominent role in society, some males are now playing the role of the stay at home parent. It is not to say that less males are working overall, but it could lead to more of them assuming part time roles rather than full time ones. The second reason we found from this is that the biggest decrease of full-time male employment came around 2008 and the recession. We have seen through some of the other data that there are a lot more males working than females, so it can be anticipated that the data we are showing will have a greater effect on the males than the females. During this time period, a lot of people, especially men, lost their jobs.
employed_gender %>%
ggplot(aes(x = year,)) +
geom_line(aes(y = full_time_female),color = "red2") +
geom_line(aes(y = full_time_male), color = "blue") +
geom_line(aes(y = part_time_female), color = "red2") +
geom_line(aes(y = part_time_male), color = "blue") +
scale_y_continuous(name = "Percent") +
scale_x_continuous(name = "Year") +
annotate("text", x = 1968, y = 82, label = "Full-time Male = 92.2%",
color = "blue", hjust = 0, size = 3) +
annotate("text", x = 1968, y = 68, label = "Full-time Female = 75.1%",
color = "red2", hjust = 0, size = 3) +
annotate("text", x = 1968, y = 32, label = "Part-time Female = 24.9%",
color = "red2", hjust = 0, size = 3) +
annotate("text", x = 1968, y = 14, label = "Part-time Male = 7.8%",
color = "blue", hjust = 0, size = 3) +
annotate("text", x = 2005, y = 82, label = "Full-time Male = 87.6%",
color = "blue", hjust = 0, size = 3) +
annotate("text", x = 2005, y = 68, label = "Full-time Female = 75.1%",
color = "red2", hjust = 0, size = 3) +
annotate("text", x = 2005, y = 30, label = "Part-time Female = 24.9%",
color = "red2", hjust = 0, size = 3) +
annotate("text", x = 2005, y = 17, label = "Part-time Male = 12.4%",
color = "blue", hjust = 0, size = 3) +
ggtitle("Male and Female Full-time & Part-time Employment",
subtitle = "Change from 1968 to 2016")
The main goal of this analysis was to analyze the wage gap in female salaries vs. male salaries. There is no denying that a wage gap has existed in the past and still exists today. There are systemetic, societal reasons for this wage gap, such as the fact that women traditionally have been the caretakers of the house. Women only started entering the workforce in large numbers after the first World 1, and have been steadily increasing ever since. As our culture has become more progressive and many groups of people have fought for their rights, women have entered the workforce and demanded equal pay.
We wanted to analyze the wage gap in three facets, being the wage gap by industry, the wage gap by age, and overall trends in female employment numbers. First, we studied the female vs. male dominated industries. The industries that are female dominated are the roles that women first began assuming when the entered the workforce fifty years ago, like the industries of nurses, secretaries, and teachers. Society only recently began accepting women into other fields like business, math, and science. Diversity is now a cause that most companies champion, so the numbers of women in these fields is rising, especially for younger generations. There were some outliers in the specific industry category. There were a few occupations within the Business, Management, and Financial industry that were majority female and also had a higher average median salary for women than for men. The wage gap in female vs. male dominated industries was higher for male dominated industries. While the men made on average more than the females in both categories, the difference in average wage between males and females in female dominated industries was about $5,000 and the difference in average wage between males and females in male dominated industries was about $9,000.
Next, we analyzed each industry on a deeper level to see if any industries were outliers and the females made more than males and if there were any underlying causes for the wage gap in certain industries. The largest differences in average median salaries for males and females exists in the Business, Management, and Financial industry and the Computer, Engineering, and Science; gaps of $15,000 and $11,000 respectively. These industries are both male dominated industries, so we would expect the gaps to be larger. Also, many of the jobs in these business and engineering fields have been historically held by men, especially the baby boomer generation. We see in our other analyses that younger women have started asserting themselves more into the workforce right out of college. More women are joinging these fields, but the people who would be making the large salaries who have worked in those roles for a while are males.
The next analysis we completed was wage gap by age. The main conclusion from these graphs is that the desparity of pay among females and males is overall getting better, however, some age groups have been effected more than others. The graphs revealed that the older generations are still seeing a large gap than young, college educated females that are entering the work force. It This graph also coincides nicely with the historic events that have happened in parallel for females all the way from getting the right to vote to not being discriminated against because of their sex in the work place. Based on these trends the wage gap for females ages 20-34 years old may be nonexistent in the near future.
Finally, the last analysis we completed was for overall female employement. The main conclusion from this graph is that even though we see the wage gap decreasing between male and females, the ratio of females and males that work full-time and part-time throughout the years only varied slightly. A big reason that these ratios may have stayed the same is because as time has gone on, the need for jobs has only increased. Just because more females are now in the work force does not mean less men are working overall. With technology advancements and other inventions, the need for jobs is higher than ever.