Project Background

Row

What is the Wage Gap? Is it Real?

The gender pay gap is the gap between what men and women are paid. Generally, it refers to the median annual pay of all women who work full-time and year-round versus the median annual pay of all men who work full-time and year-round. But, there are two types of pay gaps. There is the uncontrolled gender pay gap, which is the overall median pay for men and women that are examined separately. Variables of interest, such as education level or years of experince, are not controlled for. On the other hand, the controlled gender pay gap is the amount that a woman earns for every dollar that a comparable man earns. The wage gap in this since takes into account all measured compensable factors. Another form of a gender gap is the opportunity gap. The opportunity gap is the notion that women are less likely to hold higher-level, high-paying jobs compared to men. It also mentions that women advance in their roles at a slower pace than males. In today’s society, the opportunity gap is the main reason for the wage gap.

The wage gap is a problem that has persisted for many years. Despite the vast amont of social issues at the forefront of the conversation, the argument of there being a gender wage gap and how to resolve the problem continues to persist. As many companies have made great strides for equality, it is unfair to say that all females are being monetarily discriminated against in the work force. Numerous companies, large and small, have incorporated the wage gap problem into their social responsibility initiatives. In recent years as more and more companies continue to battle the wage gap, we have seen improvements in reducing it.

Despite many great strides being taken, the problem of the gender wage gap has not been eradicated. As females continue playing a larger and more important role in the workforce, it is extremely pertinent to understand the underlying factors that have caused the wage gap so it can be prevented it in the future.

Row

So Now What?

I analyzed the historical trends of the wage gap and identified the underlying forces that may have been the root cause. I looked at the trends of employment and salary differnces throughout history across numerous industries and age groups for both males and females. I also viewed the wage gap by location throughout the United States. Finally, I analyzed the opportunity gap to highlight the differences in career progression between men and women.

My five approaches to understanding the gender wage gap are found below:

The mission of my project is to inform the reader how the wage gap has changed overtime based on different factors (industry, age group, location, employment numbers, etc.). Also, the goal of my analysis is to enlighten the reader about the opportunity gap and explain its underlying causes. Overall, I want to determine if the wage gap is truly improving or not. In conclusion, I hope to provide clarity on the industries, age groups, and locations that have been effected by the gender wage gap the most. Hopefully, this will inspire the consumers of this data to make positive changes for the future of these afflicted groups.

What About The Data?

For this project I will be utilizing four different datasets, two of which are historical datasets originating from the Bureau of Labor, and the others originating from the Census Bureau. Three of the datasets primarily focus on the earnings ratio, or what females are making compared to males. The formula for earnings ratio is below: \[ Earnings Ratio = \frac{Female Median Earnings}{Male Median Earnings} \] Additionally, the other dataset compares the differences between full-time and part-time employment for males, females, and overall.

Before continuing to the analysis, I must address the limitations presented by the data. The two major limitations are as follows:

Binary Genders: The data is solely based on a binary gender identification. However, this omits gathering data from individuals who may belong to the LGBTQA community because they may classify their gender as non-binary.
Inconsistent Timelines: Each of the three data sets that I am deriving our analyses from have timelines that are not consistent. Because of this, I decided to come up with five different approaches for disecting the gender wage gap.

Importing the Data

Column

Required Packages

Before discussing importing the datasets used in this analysis, I wanted to mention the required R packages. For this analysis, I used a lot of the standard packages for cleaning and visualizing data. Most of these packages are used with other data manipulation/visualization techniques so hopefully not many of the packages need to be installed strictly for this analysis if you are following along with the code.

A few of the packages that may need to be loaded by the user include ggthemes and plotly. The package ggthemes is part of the ggplot package but does not come as part of the standard ggplot library. The package plotly is a supplementary visualization tool that I will use for this analysis.

## Load Required Packages ##
library(tidyverse) #Use to tidy data
library(readr) #Use to easily import delimited data
library(dplyr) #Use to manipulate data
library(tibble) #Use to manipulate data
library(magrittr) #Use to insert pipe operators
library(DT) #Use to create functional tables in HTML
library(knitr) #Use to create dynamic report generation
library(rmarkdown) #Use to convert R Markdown documents into a variety of formats
library(ggthemes) #Use to implement themes across report
library(ggrepel) #Use to label data
library(ggplot2) #Use to create visualizations
library(plotly) #Use to create dynamic plotting
library(gridExtra) #Use to arrange plots
library(reshape2) #Use to transform data frames

Before Importing the Data

There are four different datasets that I am using to analyze the wage gap. However, the timelines of these datasets do not overlap well, so I did not use the data in an aggregate analysis. Instead, I elected to treat each of the datasets on their own and clean each of them individually to develop separate analyses. The data importation process for each dataset can be found on the right.

Column

Earnings_female

Earnings_female Data

As previously mentioned, this data originated from Bureau of Labor. The original data contains 3 variables with 264 observations and a range of dates from 1979 - 2011 and contains the earnings ratio for seven unique age groups across each year:

To gather more information regarding the dataset, click here

The first I did was import the original csv file data using the read_csv function:

## Import the Data ##
earnings_female <- readr::read_csv("earnings_female.csv")

Jobs_gender

Jobs_gender Data

The jobs_gender data originated from the Census Bureau. The original data contains 12 variables and 2088 observations that have dates ranging from 2013 - 2016. The data is centered around employment numbers and earning percentages for males and females. The column names are below:

To view the original data, click here

The first thing I did was import the original csv file using the read_csv function:

## Import the Data ##
jobs_gender <- readr::read_csv("jobs_gender.csv")

Employed_gender

Employed_gender Data

As previously mentioned, this data originated from the Bureau of Labor. The dataset is showing percentage of employed people working full time from the years 1968 - 2016. The dataset contains 7 variables and 49 observations with the column names shown below:

To view the original data, click here

The first thing I did was import the original csv file using the read_csv function:

## Import the Data ##
employed_gender <- readr::read_csv("employed_gender.csv")

Wagegap_map

Wagegap_map Data

The Wagegap_geo data is from the Census Bureau. I web scraped this website using Python in order to produce the dataset I needed for this analysis. The dataset displays each state and its gender pay ratio from 2018, which is the most recent data I am using in this analysis. The dataset contains 4 variables and 50 observations, one for each state, with the column names shown below:

State
Code
Gender Pay Ratio
National Rank
Equal Pay Laws

To view the original data, click here

The first thing I did was import the original csv file using the read_csv function:

## Import the Data ##
gpratio_map <- readr::read_csv("Wagegap_geo.csv")

Cleaning the Data

Column

Process of Cleaning the Data

Earnings_female

I examined and viewed the original data below:

Data Dictionary

Variable Name	Data Type	Variable Description
Year	numeric	Year
group	character	Age group
percent	numeric	Female wages as a percent of male wages, which is the earnings ratio of females

Cleaning the Data

After importing the data and doing the exploratory analysis I realized that I should change a few aspects to keep the data consistent.

I renamed two of the columns to better represent the values that are being displayed.
I reformatted each column name into snake_case to match the data from the other datasets.
I took the data in the age_group category and renamed the value of Total, 16 years and older to Total to make the data easier to understand.

names(earnings_female) <- c("year","age_group","earnings_ratio")
earnings_female$age_group[earnings_female$age_group == "Total, 16 years and older"] <- "Total"

Jobs_gender

I examined and viewed the original data below:

Data Dictionary

Variable Name	Data Type	Variable Description
year	numeric	Year
occupation	character	Specific job/career
major_category	character	Broad industry of occupation
minor_category	character	Specific industry of occupation
total_workers	numeric	Total estimated full-time workers above 16 years old
workers_male	numeric	Estimated full-time male workers above 16
workers_female	numeric	Estimated full-time female workers above 16
percent_female	numeric	The percent of females in a specific occupation
total_earnings	numeric	Total estimated median earnings for full-time workers above 16 years old
total_earnings_male	numeric	Estimated median earnings for males above 16 years old
total_earnings_female	numeric	Estimated median earnings for females above 16 years old
wage_percent_of_male	numeric	Female wages as a percent of male wages, which is the earnings ratio of females

Cleaning the Data

After viewing the data, I made a few changes to the data. The changes are found below:

I changed the column names to better represent the values provided. The changes included:

Major_Category to Industry_Broad
Minor_Category to Industry_Specific
wage_percent_of_male to earnings_ratio_female

Then I checked for duplicate and missing values that would affect the data:

                 year            occupation        industry_broad 
                    0                     0                     0 
    industry_specific         total_workers          workers_male 
                    0                     0                     0 
       workers_female        percent_female        total_earnings 
                    0                     0                     0 
  total_earnings_male total_earnings_female earnings_ratio_female 
                    4                    65                   846

There are “NA” values present in the last three columns. The NA values for “earnings_ratio_female” were inputted for rows that have too small of sample sizes. But, the minimum total workers for the rows without “NA” for the “earnings_ratio_female” column is 11,383. The maximum for total workers for the rows with “NA” for the “earnings_ratio_female” column is 441,982. Given that the maximum total workers for the rows containing “NA” is larger that the minimum total workers for the rows not containing “NA”, the argument that the sample size is too small for the rows containing “NA” is invalid. Therefore, I used the earnings ratio formula to fill in the NA values for this variable.

After I filled in the NA values for “total_earnings_female” and “total_earnings_male”, I realized that there were some negative values. There also were values of 0 or very low numbers for the earnings columns even though there were hundreds of workers in that occupation. Therefore, I ended up removing every NA value for female and male earnings to avoid issues with these values skewing the data further into the analysis.

                 year            occupation        industry_broad 
                    0                     0                     0 
    industry_specific         total_workers          workers_male 
                    0                     0                     0 
       workers_female        percent_female        total_earnings 
                    0                     0                     0 
  total_earnings_male total_earnings_female        Earnings_Ratio 
                    0                     0                     0

Given the numbers used, I felt it would be best presented if all decimals were kept at a maximum of one place.

Employed_gender

I examined and viewed the original data below:

There is no cleaning required for this dataset.

Data Dictionary

Variable Name	Data Type	Variable Description
year	numeric	Year
total_full_time	numeric	Percent of total employed people usually working full-time
total_part_time	numeric	Percent of total employed people usually working part time
full_time_female	numeric	Percent of employed females usually working full time
part_time_female	numeric	Percent of employed females usually working part time
full_time_male	numeric	Percent of employed males usually working full time
part_time_male	numeric	Percent of employed men usually working part time

Wagegap_map

I examined and viewed the original data below:

There is no cleaning required for this dataset.

Data Dictionary

Variable Name	Data Type	Variable Description
state	character	State in the US
code	character	State Code
Gender.Pay.Ratio	numeric	Earnings ratio of females to that of men in decimal form
National.Rank	character	The rank the state is from 1 to 50, number 1 means the state has the highest earnings ratio
Equal.Pay.Laws	character	The strength of the Equal Pay Laws the state has enacted

Column

Cleaned Data

Earnings_female

In the clean dataset, the range of years remains the same but the age groups are now:

The earnings ratio column ranges from 56.8% to 95.4%, reaffirming that a wage gap does in fact exist. I will analyze these values later in the exploratory analysis. Additionally, each of the variables contained in this dataset are important variables for the analysis, thus I did not remove any of the observations from this table. The cleaned data is below:

Jobs_gender

Now with the clean data, there is a uniform naming convention and additional columns. All of the data will be used from this dataset, however, the most important variables to this analysis are listed below:

After cleaning the data, I have replaced the NA values for the total_earnings_male, total_earnings_female, and earnings_ratio_female variables, resulting in 2019 complete observations and 12 variables. The cleaned data is below:

Employed_gender

Compared to the other datasets, this data is very useful without a lot of cleaning. There are no missing values or duplicate values, and all column names are written with consistent snake_case formatting. I elected not to make any inital changes to this data for this reason. The cleaned data is below:

Wagegap_map

The wagegap_map data is very useful as well without a lot of cleaning. The data is so clean because I was able to gather it from web-scrapping, allowing me to choose the specific variables I wanted. There are no missing values and the data is fairly straightforward. I also elected to leave this data as is. The cleaned data is below:

Size Overview

Column

Industry Analysis Overview

For the wage gap by industry analysis, I studied the breakdowns of earnings for each of the eight industries to observe any patterns in the respective wage gaps. First, I was interested in observing the industries that are dominated by females vs. males. Once I determined which industries fall into which category, I analyzed the differences in the average female and male median earnings for both female and male dominated industries. My hypothesis before beginning this analysis is that even in female dominated industries, the average pay for females will be lower than that of males. My goal was to identify if this hypothesis is true, if there are any outlier industries, and if there are underlying factors that explain the wage gap for each industry and as a whole.

Size Analysis

Before beginning this analysis, I wanted to display which occupations fell under which specific industries and which specific industries fell under which broad industries. The visualization is meant to give more clarification to the ensuing size analysis. See the industry breakdown to the right to see this information.

In the first graph, I showed the 8 broad industries to see which of them have a majority of employment from women based on the avg_females field that I temporarily created. Any industry that has 50% or more women is a female dominated industry. From this graph, you can see that females dominate 3 industries:

Healthcare Practicioners and Technical
Education, Legal, Community Service, Arts, and Media
Sales and Office

On the other hand, males dominate 5 industries:

Service
Management, Business, and Financial
Computer, Engineering, and Science
Production, Transportation, and Material Moving
Natural Resources, Construction, and Maintenance

The industries that are dominated by females vs. males for the most part aren’t surprising due to the historical stigma surrounding each of the industries. Women have traditionally held roles as medical assistants, nurse, teachers, childcare workers, and administrative assistants while men have traditionally held roles as production workers, mechanics, analysts, engineers, and workers dealing with any type of natural resource. One industry that I was surprised by is the service industry. But, it’s percentage of females is 49%, so it is very close. One reason for a smaller than expected percentage is that firefighters, police officers, and other justice system occupations are listed under the service industry and those careers are heavily dominated by men.

Next, I looked a touch deeper at the specific industries in the same fashion to identify any specific industries where females may dominate but in the broad industry they do not. From the plot you can see that females dominate the healthcare support and personal care and service categories, which fall under the Service broad industry. But, these roles both are in the medical field, which is an industry that females dominate. Also, females dominate the business and financial operations field, which falls under the Management, Business, and Financial broad industry. After further analysis into the occupations within this category, I discovered that the majority of these roles include marketing analysts, event planners, and human resources workers. These roles are traditionally held by women. There are few financial specialist and accounting roles that are predominantly women that defy this trend and I consider to be outliers.

To further confirm female versus male dominated industries, I observed the top 10 occupations for women across all industries and noted if any of the occupations fell outside of their three dominated industries. Two of the occupations, medical transcriptionists and childcare workers, are in the service industry. But, their fields are closely related to the industries that females dominate, so I do not consider them to be outliers.

In addition to seeing the top 10 occupations for females, I also wanted to see the 10 occupations females have the lowest presence in, or the top 10 occupations for males. In the plot to the right, you can see that all of the occupations fall within male dominated industries.

Column

Industry Breakdown

Broad Industries

Specific Industries

Top 10 Female

Top 10 Male

Earnings Overview

Column

Industry Earnings Overview

To begin the analysis of the earnings per industry, I observed the density plots of the total earnings per industry category to determine the industry with the highest average earnings. The main conclusions from this plot are that:

Computer, Engineering, and Science has the highest average median earnings and it is a male dominated industry
The distributions for each industry are skewed to the right, meaning that the means are higher than the medians. The skews are more severe for certain industries due to outliers, which we I examined further in the next analysis

To provide an overview of the wage gap regardless of industry, I wanted to briefly show the overall trend in earnings between males and females. Females have an average median earnings of $49,640 and males have an average median earnings of $53,218. Therefore, our dataset informs us that on average, men make about $4,000 more than women. The p-value for the two-sample t-test was 0.0257, which is higher than the alpha value of 0.05, so I can reject the null hypothesis and conclude that the difference in the means is significant. There are outliers in each plot, but the outliers in the male earnings are the most significant. Also, there are about 56 million more men than women documented as working full-time in our dataset. Even though this is a large difference, our sample size for the remaining occupations in our data set is large enough for both men and women.

After displaying the average median salaries of females and males overall, I wanted to show the difference in female dominated industries vs. male dominated industries. I previously defined female and male dominated industries, so now I aggregated the industries that belong to each category. Then, I calculated the average median earnings for males and females for both the female and male dominated industries. Female dominated industry earnings for both females and males are the two leftmost bars and the male dominated industry earnings for both females and males are the two rightmost bars. My original hypothesis that even in female dominated industries the average pay for females is lower than that of males was proven to be true. I conducted a two-sample t-test to support my analysis, which resulted in a p-value of <0.0001 for the female dominated industries and 0.0002 for the male dominated industries. Therefore, the difference in the means in both cases are significant. From the graph, it is evident that females on average make 10% more in female dominated industries than they do in male dominated industries, but they are still making less than men. On the other hand, males on average make 10% less in male dominated industries than they do in female industries. Regardless, men on average make more than females in both male and female dominated industries.

Column

Distribution of Earnings

Overall Median Earnings

Female vs. Male Dominated Industry Earnings

Earnings Per Industry

Column

Earnings Per Industry Overview

Next, I wanted to determine if there are any outlier industries. An industry would be an outlier if it strays against the overall pattern and if females make more than males on average in any given industry. I created boxplot for each industry to provide quick visuals of the differences in male and female median earnings on average. It is important to note that I did not remove any of the outliers from this dataset. Each of the outliers is important to providing an overall view of the industry and the discrepancies in median earnings between males and females. The following plots analyze the median earnings for males, females, and total for each broad industry.

Column

Healthcare

Healthcare Practitioners and Technical Industry

The Healthcare Practicioners and Technical industry agrees with my hypothesis. The average median earnings for females is $68,051 and the average median earnings for males is $74,269, which is about a $6,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the healthcare industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for men in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is 0.0049, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Education

Education, Legal, Community Service, Arts, & Media Industry

The Education, Legal, Community Service, Arts, & Media industry agrees with my hypothesis. The average median earnings for females is $46,258 and the average median earnings for males is $54,403, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the legal industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for males in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Sales and Office

Sales and Office Industry

The Sales and Office industry agrees with my hypothesis. The average median earnings for females is $37,106 and the average median earnings for males is $44,987, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the sales industry making more than others because of years of schooling required, difficulty of job, access to promotions etc. The outliers are the most significant for males in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Service

Service Industry

The Sales and Office industry agrees with my hypothesis. The average median earnings for females is $31,988 and the average median earnings for males is $36,644, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the service industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

 Total_Earnings  Female_Earnings  Male_Earnings  
 Min.   :17266   Min.   : 16771   Min.   :12147  
 1st Qu.:24662   1st Qu.: 22291   1st Qu.:26320  
 Median :30422   Median : 28384   Median :31799  
 Mean   :34452   Mean   : 31988   Mean   :36644  
 3rd Qu.:40748   3rd Qu.: 38088   3rd Qu.:41640  
 Max.   :90571   Max.   :100508   Max.   :90912

Business

Business, Management, and Financial Industry

The Business, Management, and Financial industry agrees with my hypothesis. The average median earnings for females is $59,070 and the average median earnings for males is $73,717, which is about a $15,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the business industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry due to there being outliers on both ends of the plot.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Engineering

Computer, Engineering, and Science Industry

The Computer, Engineering, and Science industry agrees with my hypothesis. The average median earnings for females is $69,427 and the average median earnings for males is $80,191, which is about a $11,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the science industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for males in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Production

Production, Transportation, and Material Moving Industry

The Production, Transportation, Material Moving industry agrees with my hypothesis. The average median earnings for females is $32,438 and the average median earnings for males is $40,769, which is about a $8,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the production industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Natural Resources

Natural Resources, Construction, and Maintenance Industry

The Natural Resources, Construction, and Maintenance industry agrees with my hypothesis. The average median earnings for females is $38,549 and the average median earnings for males is $43,661, which is about a $5,000 difference. There are several outliers for each of the categories, which is due to certain occupations in the construction industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Earnings Per Specific Industry

Column

Earnings Per Specific Industry Overview

Now, I dove deeper into the specific industry categories. My hypothesis is still that men will make more on average than females in each of the specific industries. A specific industry would be an outlier if it strays against the overall pattern and if females make more than males on average. I created boxplots for each specific industry to provide quick visuals of the differences in male and female median earnings on average. It is important to note that I did not remove any of the outliers from the dataset. Each of the outliers is important to providing an overall view of the industry and the discrepancies in median earnings between males and females. The following plots analyze the median earnings for males, females, and total for each specific industry.

Column

Healthcare Support

Healthcare Support Specific Industry

The Healthcare Support specific industry agrees with my hypothesis. The average median earnings for females is $31,956 and the average median earnings for males is $36,107, which is about a $4,000 difference. There are several outliers for each of the categories. There are more outliers for females in this industry, but the outlier for males has the highest earnings. The p-value of the two-sample t-test is 0.0096, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Personal Care

Personal Care and Service Specific Industry

The Pearsonal Care and Service specific industry agrees with my hypothesis. The average median earnings for females is $28,080 and the average median earnings for males is $31,952, which is about a $4,000 difference. There are several outliers for each of the categories. There are more outliers for males in this industry and the outlier for females has the highest earnings, which is surprising. The p-value of the two-sample t-test is 0.0092, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Healthcare Practitioners

Healthcare Practitioners and Technical Specific Industry

The Healthcare Practicioners and Technical specific industry agrees with my hypothesis. The average median earnings for females is $68,051 and the average median earnings for males is $81,487, which is about a $13,000 difference. There are several outliers for each of the categories. There are more outliers for females in this industry, but an outlier for males has the highest earnings. The p-value of the two-sample t-test is 0.0049, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Office And Admin

Office and Administrative Support Specific Industry

The Office and Administrative specific industry agrees with my hypothesis. The average median earnings for females is $35,783 and the average median earnings for males is $41,762, which is about a $6,000 difference. There are several outliers for each of the categories. There are more outliers for males in this industry and an outlier for males has the highest earnings. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Education

Education, Training, and Library Specific Industry

The Education, Training, and Library specific industry agrees with my hypothesis. The average median earnings for females is 42,890 and the average median earnings for males is $49,460, which is about a $6,500 difference. There are outliers for only males. An outlier for males has the highest overall earnings. The p-value of the two-sample t-test is 0.0138, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Community Service

Business

Business and Financial Operations Specific Industry

The Business and Financial Operations specific industry agrees with my hypothesis. The average median earnings for females is $54,129 and the average median earnings for males is 68,540, which is about a $14,000 difference. There are not outliers for males or females individually, but there are outliers for the total earnings. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Legal

Legal Specific Industry

The Legal specific industry does not agree fully with my hypothesis. The average median earnings for females is $66,195 and the average median earnings for males is $83,839, which is about a $17,500 difference. There are no outliers in this idustry and the maximum for males has the highest earnings. The p-value of the two-sample t-test is 0.0661, which is larger than the assumed alpha value of 0.05. Therefore, I can’t reject the null hypothesis and the difference in the means is insignificant. Due to the fact that this is the largest difference I have observed, I thought there may have been some violation of assumptions. I transformed the earnings variable and ran the t-test again, which resulted in a p-value of 0.0882. The p-value was still larger than the alpha of 0.05. I think an assumption of the t-test may have been violated, causing the p-value to be insignificant. The distributions are very different for the earnings of males and females, which could have affected the results. So, males make more than females on average in this industry, but the difference in the means doesn’t appear to be significant.

Food Preparation

Arts and Media

Arts, Design, Entertainment, Sports, and Media Specific Industry

The Arts, Design, Entertainment, Sports, and Media specific industry agrees with my hypothesis. The average median earnings for females is $45,286 and the average median earnings for males is $53,642, which is about a $8,000 difference. There are no outliers in this industry and the maximum for males has the highest earnings. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Sales

Management

Management Specific Industry

The Management specific industry agrees with my hypothesis. The average median earnings for females is $63,683 and the average median earnings for males is $78,549, which is about a $15,000 difference. There are outliers for both males and females in this industry. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Maintenance

Building, Grounds Cleaning, and Maintenance Specific Industry

The Building and Grounds Cleaning and Maintenance specific industry agrees with my hypothesis. The average median earnings for females is $26,581 and the average median earnings for males is $32,787, which is about a $6,000 difference. There no outliers for either males or females in this industry. The maximum for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0016, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Production

Production Specific Industry

The Production specific industry agrees with my hypothesis. The average median earnings for females is $30,373 and the average median earnings for males is $38,381, which is about a $8,000 difference. There are outliers for both males and females in this industry. The outlier for males has the highest median earnings overall. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Mathematics

Computer and Matematical Specific Industry

The Computer and Mathematical industry agrees with my hypothesis. The average median earnings for females is $73,384 and the average median earnings for males is $85,772, which is about a $12,000 difference. There is an outlier for only males in this industry. The outlier for males has the highest median earnings overall. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Protective

Protective Service Specific Industry

The Protective Service specific industry agrees with my hypothesis. The average median earnings for females is $46,847 and the average median earnings for males is $53,431, which is about a $6,500 difference. There are no outliers in this industry and the maximum for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0118, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Material Moving

Material Moving Specific Industry

The Material moving specific industry does not agree fully with my hypothesis. The average median earnings for females is $30,022 and the average median earnings for males is $37,364, which is about a $7,000 difference. There are outliers for both males and females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is 0.165, which is larger than the assumed alpha value of 0.05. Therefore, I can’t reject the null hypothesis and the difference in the means is insignificant. To assure that there were no violation of assumptions, I transformed the earnings variable and ran the t-test again, which resulted in a p-value of 0.0115. The p-value is now less than the alpha value, so assumptions may have been violated in this case, causing the difference in the means to be insignificant. So, males make more than females on average for this industry, but the t-test was a bit inconclusive, so I would need to conduct further tests.

Farming

Farming, Fishing, and Forestry Specific Industry

The Farming, Fishing, and Forestry specific industry agrees with my hypothesis. The average median earnings for females is $29,189 and the average median earnings for males is $34,020, which is about a $5,000 difference. There are outliers for only females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is 0.0339, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Engineering

Architecture and Engineering Specific Industry

The Architecture and Engineering specific industry agrees with my hypothesis. The average median earnings for females is $74,873 and the average median earnings for males is $84,004, which is about a $9,000 difference. There are outliers for both males and females in this industry. The outlier for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0022, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Transportation

Transportation Specific Industry

The Transportation specific industry agrees with my hypothesis. The average median earnings for females is $40,589 and the average median earnings for males is $52,899, which is about a $12,000 difference. There are no outliers in this industry. The maximum for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0005, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Installation

Installation, Maintenance, and Repair Specific Industry

The Installation, Maintenance, and Repair specific industry agrees with my hypothesis. The average median earnings for females is $39,102 and the average median earnings for males is $45,959, which is about a $7,000 difference. There are outliers for both males and females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

Construction

Construction and Extraction Specific Industry

The Construction and Extraction specific industry does not agree fully with my hypothesis. The average median earnings for females is $40,424 and the average median earnings for males is $43,779, which is about a $3,000 difference. There are outliers for both males and females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is 0.1141, which is larger than the assumed alpha value of 0.05. Therefore, I can’t reject the null hypothesis and the difference in the means is insignificant. To assure that there were no violation of assumptions, I transformed the earnings variable and ran the t-test again, which resulted in a p-value of 0.0914. The p-value is still larger than the alpha value, so the difference in the means is insignificant. So, males make more than females on average in this industry, but the difference in the means is not significant.

Age Analysis

Column

Overall Trend

Earnings Ratio Trend

The third aspect of the wage gap that I wanted to look at was if the wage gap varied across different age groups. With the wage gap being a historical problem, I watned to view how it trended across all age groups as a whole to start. From the plot below, you can see that the wage gap is certainly present, however, it is trending in a positive direction.

Age Group Trends

Earnings Ratio Per Age Group Trends

My next step was to determine if a certain age group(s) was being impacted more than others. In the interactive plot below, you can see a few things. First, the younger age groups have a higher earnings ratio than the older age groups. Second, you can see that the groups 20-24 years and 25-34 years are increasing drastically faster than other age groups. Also, many historical events were taking place just before 1980. In 1963 the Equal Pay Act was signed into law by President John F. Kennedy and in 1964 Lyndon B. Johnson signed the Civil Rights Act into law. With these monumental pieces of legislation enacted, it allowed females to start engaging in occupations that were not possible before. Additionally, it sparked younger females to continue to pursue education and fight for higher salaries and more promotion opportunities. As for older females who were in the true midst of gender wage discrimination, these new reforms and much more helped them improve their pay status, just at a much slower rate. This plot does a great job of showing the trends for each age group.

Location Analysis

Column

Location Analysis Overview

The next part of my analysis is dealing with the wage gap by location. Take a deeper look into the gender pay ratio for each state by hovering over it in the map to the right!

The map to the right displays the gender pay ratio, national rank, and equal pay laws for every state. The gender pay ratio is in decimal terms and represents the amount that women make compared to a man’s dollar. So, a higher gender pay ratio is better. The national rank displays how each state stacks up against the others in terms of the gender pay ratio. If the state has a high national rank, the gender pay ratio is higher and therefore better for women in those states. Finally, the equal pay law strength notes the laws the state has passed on the wage gap and equal pay. The levels are strong, moderate, and weak. An anomaly is Mississippi because they do not currenlty have any laws regarding equal pay in the workforce. The strength of the laws in each state were based on the Census Bureau rankings, where the data was sourced from. If a state, like California, has an Equal Pay Act and additional legislation supporting it, they were given an equal pay law score of strong. If a state, like Alabama, has an Equal Pay Act or other legislation on the topic but does not strongly enforce it, they were given an equal pay law score of weak. A state may also have an Equal Pay Act that is not very strong in its provisions, which would result in a score of weak.

Column

Gender Pay Ratio Map

Column

Top 10

The top ten states with the best gender pay ratios are below:

Most of these states align with my hypothesis that the lowest wage gaps would exist in states that are more urban and have larger cities. Most of the states are closer to the coasts, are areas with larger cities and more people, and have low rural populations. Larger cities tend to be more progressive and liberal in nature due to their citizens being more politically active and wanting to drive change. Some of the largest cities in the US are found in these states and are Los Angeles (CA), New York City (NY), Baltimore (MD), Las Vegas (NV), Burlington (VT), Little Rock (AR), Jacksonville (FL), Portland (OR), Wilmington (DE), and Phoenix (AZ). Baltimore, Burlington, and Wilmington are not that large of cities in comparison to the others, but they are more politically active cities due to the fact that they are close to the capital, Washington D.C. The Census Bureau indicates that more people tend to work for the government or any kind of political offices in the Northeast due to the proximity to the captial, Washington D.C. Arkansas was really the only state that shocked me. It has a lot of rural areas and few populus cities, so I assumed it would not rank in the top ten for lowest wage gap.

Bottom 10

The bottom ten states with the worst gender pay ratios are below:

Again, most of these states align with my hypothesis that the highest wage gaps would exist in states that are more rural and have smaller cities. Most of these states are in the Midwest or the south, which are areas that tend to have more rural populations and are less progressive as a whole. Louisiana, West Virginia, Alabama, North Dakota, Indiana, Mississippi, and Oklahoma are some of the states with the largest rural population. Also, Wyoming, Utah, and New Hampshire are some of the states with the lowest overall populations. Due to the rural and smaller populations in these states, they contain less progressive and politically active people who would fight for Equal Pay Acts.

Overall, the states with a better gender pay ratio tend to be more progressive and contain a large portion of urban populations, whereas the states with a lower gender pay ratio tend to be more conservative and contain a large portion of rural populations.

Overall Female Employment Analysis

Column

Overall Female Employment

Females in the Workforce Trend

My last analysis studied the employment status of males and females throughout history from 1968 to 2016 and determined if the ratio of part-time and full-time workers has changed. In the plot below, you can see that the percentage of full-time and part-time females is at the same position in 2016 as it was in 1968, respectively, and stayed relatively level during that time period.

The one change that can be seen from the plot is the slight decrease in full-time male workers over the course of this period. I found two main factors that may have caused this change. First, as more females continue to take a more prominent role in society, some males are now playing the role of the stay at home parent. It is not to say that less males are working overall, but it could lead to more of them assuming part time roles rather than full time ones. The second reason is that the biggest decrease of full-time male employment came around 2008 and the recession. I noticed through some of the other data that there were more males working during this time than females, so the trends have a greater effect on the males than the females. During this time period, a lot of people, especially men, lost their jobs. Therefore, their overall employment numbers decreased. Overall, more females are entering the workforce, so men do not have to be the sole breadwinners of the family and can work part-time or be a stay at home dad.

Opportunity Gap Analysis

Column

Opportunity Gap Overview

The final wage gap analysis I conduced is an analysis of the opportunity gap. The opportunity gap is so essential to the wage gap conversation because it captures the aspects of the wage gap that the uncontrolled wage gap cannot. The opportunity gap is the gap in the opportunities that men are offered versus females. Males have more access to higher paying jobs and tend to advance faster in their careers than females.

First, to analytical explain the opportunity gap, I computed the top ten highest paying occupations for each industry. For each industry, I indicated whether these occupations were held by a majority of females versus males. Next, I showed what females make compared to males for each occupation. The data for this analysis is limited because it can’t capture all of the aspects of the opportunity gap. So, I included outside research in order to provide information on other factors that may affect the opportunity gap. In each industry, I researched how many hours a man works on average compared to how many hours a female works. For the top paying occupations in each industry, I researched the level of schooling and experience a man typically has versus a female. The hours worked by males and females in each industry on average and the levels of schooling and experience they received were not significantly different.

Finally, for the last piece of this analysis, I determined what percentage of executives for each industry are males. The percentages are astoundingly high, even though the percentage of females who are executives has been increasing. Each of the percentages is greater than 85%. Therefore, the opportunity gap is a real problem that afflicts the workforce. Males are offered opportunities for advancement and increased salaries more often than females, which contributes greatly to the wage gap.

Column

Healthcare

Healthcare Practitioners and Technical Industry

For this industry, females occupy the majority of a top ten paying occupation 6 out of 10 times. They also have a majority in two of the four highest paying occupations. The healthcare industry is an industry that overall is dominated by females, so I assumed that females would be the majority, but I thought they would be a majority in more than 6 of the occupations.

Education

Education, Legal, Community Service, Arts, and Media Industry

For this industry, females occupy the majority of a top ten paying occupation 5 out of 10 times. They also have a majority in one of the three highest paying occupations. The education industry is an industry that overall is dominated by females, so I assumed that they would have a majority in more than half of the top ten highest paying occupations.

Sales and Office

Sales and Office Industry

For this industry, females occupy the majority of a top ten paying occupation 2 out of 10 times. The sales industry is an industry that overall is dominated by females. But, this industry compared to healthcare and education does not have women as being as dominant of a force. Females don’t hold a majority of the highest paying positions and are only 51% for the occupations that they do have majority in. I assumed that females would have a majority in more than half of the top ten highest paying occupations because it is a female dominated industry.

Service

Service Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The service industry is an industry that overall is dominated by males, so this is not surprising. There are 49% females in this industry, though, and they do not occupy one of the highest paying occupations. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries.

Business

Management, Business, and Financial Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The business industry is an industry that overall is dominated by males, so this is not surprising. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries.

Engineering

Computer, Engineering, and Science Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The engineering industry is an industry that overall is dominated by males, so this is not surprising. The highest percentage of females in one of these occupations is 32%, which is very low. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries.

Production

Production, Transportation, and Material Moving Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The production industry is an industry that overall is dominated by males, so this is not surprising. The highest percentage of females in one of these occupations is 19%, which is extremely low. The percentages of females in each occupation are very low, the lowest at 3%. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries.

Natural Resources

Natural Resources, Construction, and Maintenance Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The natural resources industry is an industry that overall is dominated by males, so this is not surprising. The highest percentage of females in one of these occupations is 7%, which is extremely low. The percentages of females in each occupation are very low across the board, which the lowest being 1%. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries.

Executives Overview

Percentage of Female Executives Per Industry

The largest percentage of female executives is in the healthcare industry at 14%. The largest three percentages correspond to the female dominated industries. The lowest percentage of female executives is in the production industry. Many females in these industries are equally qualified as their male counterparts and have similar backgrounds, but they are not chosen to hold the largest roles at their companies. The number of female executives has definitely risen in the past few decades, but it is nowhere near where it should be. The opportunity gap of men getting the higher paying and more important jobs needs to be reduced as we move further into the 21st century.

---
title: "The Evolution of the Wage Gap"
output:
  flexdashboard::flex_dashboard:
    social: menu
    source_code: embed
    theme: journal
---

```{r setup, include=FALSE, warning = FALSE}
library(flexdashboard)
library(plotly)
library(knitr)
library(DT)
library(outliers) 
library(moments) 
library(stringr) 
library(highcharter) 
library(treemap) 
library(viridisLite) 
library(tidyverse)
```

Project Background {data-navmenu="Background" data-orientation=rows}
===============================================================================

Row {data-height=325}
-------------------------------------------------------------------------------

### What is the Wage Gap? Is it Real? 

The **gender pay gap** is the gap between what men and women are paid. Generally, it refers to the median annual pay of all women who work full-time and year-round versus the median annual pay of all men who work full-time and year-round. But, there are two types of pay gaps. There is the *uncontrolled gender pay gap*, which is the overall median pay for men and women that are examined separately. Variables of interest, such as education level or years of experince, are not controlled for. On the other hand, the *controlled gender pay gap* is the amount that a woman earns for every dollar that a comparable man earns. The wage gap in this since takes into account all measured compensable factors. Another form of a gender gap is the opportunity gap. The *opportunity gap* is the notion that women are less likely to hold higher-level, high-paying jobs compared to men. It also mentions that women advance in their roles at a slower pace than males. In today's society, the opportunity gap is the main reason for the wage gap. 

The wage gap is a problem that has persisted for many years. Despite the vast amont of social issues at the forefront of the conversation, the argument of there being a gender wage gap and how to resolve the problem continues to persist. As many companies have made great strides for equality, it is unfair to say that *all* females are being monetarily discriminated against in the work force. Numerous companies, large and small, have incorporated the wage gap problem into their social responsibility initiatives. In recent years as more and more companies continue to battle the wage gap, we have seen improvements in reducing it.

Despite many great strides being taken, the problem of the gender wage gap has not been eradicated. As females continue playing a larger and more important role in the workforce, it is extremely pertinent to understand the underlying factors that have caused the wage gap so it can be prevented it in the future.

Row
-----------------------------------------------------------

### So Now What?

I analyzed the historical trends of the wage gap and identified the underlying forces that may have been the root cause. I looked at the trends of employment and salary differnces throughout history across numerous industries and age groups for both males and females. I also viewed the wage gap by location throughout the United States. Finally, I analyzed the opportunity gap to highlight the differences in career progression between men and women.

My five approaches to understanding the gender wage gap are found below:

  1. **Wage Gap by Industry**: The dataset *"jobs_gender"* provides great detail of different occupations that were held by both males and females from the years *2013 - 2016*. The data has been prepared in a fashion where each occupation is listed under a *Industry_Broad* and *Industry_Specific* category to assist with the industry analysis. I filtered each of the industries to see which of them have a majority of employment from males vs. females. From there, I determined if the wage gap in male dominated industries is higher than the wage gap present in female dominated industries. I found out if certain industries (both broad and specific) present larger wage gaps than others, and if so, what were the underlying causes.

  2. **Wage Gap by Age**: My second approach was to analyze the wage gap by age group. The dataset *"earnings_female"* breaks down the Earnings Ratio for seven different age groups from the years *1979 - 2011*. I tested to see if certain age groups were more susceptible to the wage gap than others.

  3. **Wage Gap by Location**: The dataset *"wagegap_map"* provides information on each state's earnings ratio and the strength of their equal pay laws. I discovered which states contained the largest wage gaps and uncovered the underlying reasons, if any, for the increased gaps. I also compared the strength of the equal pay laws to the national rank of the state for the gender pay gap.

  4. **Overall Female Employment**: In this analysis, I pulled everything together from the first two independent analyses as well data from the last dataset *"employed_gender"*. With all of this data, I observed the overall trend of the wage gap as well as the increasing presence of females in the workforce. The goal of this analysis was to determine if these trends follow along with social movements that have happened over the past 3-5 decades for females.
  
  5. **Opportunity Gap**: Finally, utilized parts of each dataset to analyze the opportunity gap. The opportunity gap analysis focuses on the percentage of executives across various industries that are men as opposed to women and shows if the top paying occupations for each industry are held by men or women. I also included research from other sources to comment on the access to promotions, higher paying jobs, etc. for men compared to women.
  
The mission of my project is to inform the reader how the wage gap has changed overtime based on different factors (industry, age group, location, employment numbers, etc.). Also, the goal of my analysis is to enlighten the reader about the opportunity gap and explain its underlying causes. Overall, I want to determine if the wage gap is truly *improving or not*. In conclusion, I hope to provide clarity on the industries, age groups, and locations that have been effected by the gender wage gap the most. Hopefully, this will inspire the consumers of this data to make positive changes for the future of these afflicted groups.

### What About The Data?

For this project I will be utilizing four different datasets, two of which are historical datasets originating from the [Bureau of Labor ](https://www.bls.gov/opub/ted/2012/ted_20121123.htm), and the others originating from the [Census Bureau](https://www.census.gov/data/tables/time-series/demo/industry-occupation/median-earnings.html). Three of the datasets primarily focus on the **earnings ratio**, or what females are making compared to males. The formula for earnings ratio is below:
$$ Earnings Ratio = \frac{Female Median Earnings}{Male Median Earnings} $$ 
Additionally, the other dataset compares the differences between full-time and part-time employment for males, females, and overall.

Before continuing to the analysis, I must address the limitations presented by the data. The two major limitations are as follows:

* **Binary Genders**: The data is solely based on a binary gender identification. However, this omits gathering data from individuals who may belong to the LGBTQA community because they may classify their gender as non-binary.
* **Inconsistent Timelines**: Each of the three data sets that I am deriving our analyses from have timelines that are not consistent. Because of this, I decided to come up with five different approaches for disecting the gender wage gap.


Importing the Data {data-navmenu="Background" data-orientation=columns}
===============================================================================

Column {.tabset .tabset data-width=500}
--------------------------------------------------------------------------------

### Required Packages 

Before discussing importing the datasets used in this analysis, I wanted to mention the required R packages. For this analysis, I used a lot of the standard packages for cleaning and visualizing data. Most of these packages are used with other data manipulation/visualization techniques so hopefully not many of the packages need to be installed strictly for this analysis if you are following along with the code. 

A few of the packages that may need to be loaded by the user include **ggthemes** and **plotly**. The package ggthemes is part of the ggplot package but does not come as part of the standard ggplot library. The package plotly is a supplementary visualization tool that I will use for this analysis.

```{r, echo=TRUE, results = "hide", message=FALSE, warning=FALSE}
## Load Required Packages ##
library(tidyverse) #Use to tidy data
library(readr) #Use to easily import delimited data
library(dplyr) #Use to manipulate data
library(tibble) #Use to manipulate data
library(magrittr) #Use to insert pipe operators
library(DT)	#Use to create functional tables in HTML
library(knitr) #Use to create dynamic report generation
library(rmarkdown) #Use to convert R Markdown documents into a variety of formats
library(ggthemes) #Use to implement themes across report
library(ggrepel) #Use to label data
library(ggplot2) #Use to create visualizations
library(plotly)	#Use to create dynamic plotting
library(gridExtra) #Use to arrange plots
library(reshape2) #Use to transform data frames
```


### Before Importing the Data

There are four different datasets that I am using to analyze the wage gap. However, the timelines of these datasets do not overlap well, so I did not use the data in an aggregate analysis. Instead, I elected to treat each of the datasets on their own and clean each of them individually to develop separate analyses. The data importation process for each dataset can be found on the right.


Column {.tabset .tabset data-width=650}
--------------------------------------------------------------------------------

### Earnings_female

#### **Earnings_female Data**

As previously mentioned, this data originated from [Bureau of Labor ](https://www.bls.gov/opub/ted/2012/ted_20121123.htm). The original data contains 3 variables with 264 observations and a range of dates from 1979 - 2011 and contains the earnings ratio for seven unique age groups across each year: 

  * Total, 16 years and older
  * 16-19 years
  * 20-24 years
  * 25-34 years
  * 35-44 years
  * 45-54 years
  * 55-64 years
  * 65 years and older
  
To gather more information regarding the dataset, click [here](https://www.bls.gov/opub/reports/womens-databook/archive/womenlaborforce_2011.pdf)

The first I did was import the original csv file data using the *read_csv* function:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
## Import the Data ##
earnings_female <- readr::read_csv("earnings_female.csv") 
```

### Jobs_gender

#### **Jobs_gender Data**

The jobs_gender data originated from the [Census Bureau](https://www.census.gov/data/tables/time-series/demo/industry-occupation/median-earnings.html). The original data contains 12 variables and 2088 observations that have dates ranging from 2013 - 2016. The data is centered around employment numbers and earning percentages for males and females. The column names are below:

  * year
  * occupation
  * major_category
  * minor_category
  * total_workers
  * workers_male
  * workers_female
  * percent_female
  * total_earnings
  * total_earnings_male
  * total_earnings_female
  * wage_percent_of_male
  
  
To view the original data, click [here](https://www.census.gov/data/tables/time-series/demo/industry-occupation/median-earnings.html)

The first thing I did was import the original csv file using the *read_csv* function:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
## Import the Data ##
jobs_gender <- readr::read_csv("jobs_gender.csv")
```

### Employed_gender

#### **Employed_gender Data**

As previously mentioned, this data originated from the [Bureau of Labor](https://www.bls.gov/opub/ted/2017/percentage-of-employed-women-working-full-time-little-changed-over-past-5-decades.htm). The dataset is showing percentage of employed people working full time from the years 1968 - 2016. The dataset contains 7 variables and 49 observations with the column names shown below:

  * year
  * total_full_time
  * total_part_time
  * full_time_female
  * part_time_female
  * full_time_male
  * part_time_male

To view the original data, click [here](https://www.census.gov/data/tables/time-series/demo/industry-occupation/median-earnings.html)

The first thing I did was import the original csv file using the *read_csv* function:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
## Import the Data ##
employed_gender <- readr::read_csv("employed_gender.csv")
```

### Wagegap_map

#### **Wagegap_map Data**

The Wagegap_geo data is from the [Census Bureau](https://www.census.gov/data/tables/time-series/demo/industry-occupation/median-earnings.html). I web scraped this website using Python in order to produce the dataset I needed for this analysis. The dataset displays each state and its gender pay ratio from 2018, which is the most recent data I am using in this analysis. The dataset contains 4 variables and 50 observations, one for each state, with the column names shown below:

  * State
  * Code
  * Gender Pay Ratio
  * National Rank
  * Equal Pay Laws

To view the original data, click [here](https://www.census.gov/data/tables/time-series/demo/industry-occupation/median-earnings.html)

The first thing I did was import the original csv file using the *read_csv* function:

```{r, echo=TRUE, message=FALSE, warning=FALSE}
## Import the Data ##
gpratio_map <- readr::read_csv("Wagegap_geo.csv")
```


Cleaning the Data {data-navmenu="Background"}
===============================================================================

Column {.tabset .tabset-fade}
--------------------------------------------------------------------------------

Process of Cleaning the Data

### Earnings_female 

I examined and viewed the original data below:

```{r, warning=FALSE, Message=FALSE}
datatable(head(earnings_female,50))
```

**Data Dictionary**

```{r, message=FALSE,warning=FALSE}
Variable.type <- lapply(earnings_female,class)
Variable.desc <- c("Year", "Age group", "Female wages as a percent of male wages, which is the earnings ratio of females")
Variable.name1 <- colnames(earnings_female)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
```

##### **Cleaning the Data**

After importing the data and doing the exploratory analysis I realized that I should change a few aspects to keep the data consistent.

1. I renamed two of the columns to better represent the values that are being displayed.
2. I reformatted each column name into *snake_case* to match the data from the other datasets. 
3. I took the data in the age_group category and renamed the value of *Total, 16 years and older* to **Total** to make the data easier to understand.

```{r, echo=TRUE, message=FALSE, warning=FALSE}
names(earnings_female) <- c("year","age_group","earnings_ratio")
earnings_female$age_group[earnings_female$age_group == "Total, 16 years and older"] <- "Total"
```

### Jobs_gender

I examined and viewed the original data below:

```{r, message= FALSE, warning=FALSE}
datatable(head(jobs_gender,50))
```

**Data Dictionary**

```{r, message=FALSE, warning=FALSE}
Variable.type <- lapply(jobs_gender,class)
Variable.desc <- c("Year", "Specific job/career", "Broad industry of occupation", "Specific industry of occupation", "Total estimated full-time workers above 16 years old", "Estimated full-time male workers above 16", "Estimated full-time female workers above 16","The percent of females in a specific occupation","Total estimated median earnings for full-time workers above 16 years old", "Estimated median earnings for males above 16 years old", "Estimated median earnings for females above 16 years old", "Female wages as a percent of male wages, which is the earnings ratio of females")
Variable.name1 <- colnames(jobs_gender)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
```

##### **Cleaning the Data**

After viewing the data, I made a few changes to the data. The changes are found below: 

1. I changed the column names to better represent the values provided. The changes included:

  * *Major_Category* to **Industry_Broad**
  * *Minor_Category* to **Industry_Specific**
  * *wage_percent_of_male* to **earnings_ratio_female**
  
```{r, warning=FALSE, Message=FALSE}
names(jobs_gender) <- c("year","occupation","industry_broad","industry_specific",
                        "total_workers","workers_male","workers_female","percent_female",
                        "total_earnings","total_earnings_male","total_earnings_female",
                        "earnings_ratio_female")
```

2. Then I checked for duplicate and missing values that would affect the data:

```{r, warning=FALSE, Message=FALSE}
colSums(is.na(jobs_gender))
```

3. There are "NA" values present in the last three columns. The NA values for "earnings_ratio_female" were inputted for rows that *have too small of sample sizes*. But, the minimum total workers for the rows without "NA" for the "earnings_ratio_female" column is **11,383**. The maximum for total workers for the rows with "NA" for the "earnings_ratio_female" column is **441,982**. Given that the maximum total workers for the rows containing "NA" is larger that the minimum total workers for the rows not containing "NA", the argument that the sample size is too small for the rows containing "NA" is invalid. Therefore, I used the earnings ratio formula to fill in the NA values for this variable.

After I filled in the NA values for "total_earnings_female" and "total_earnings_male", I realized that there were some negative values. There also were values of 0 or very low numbers for the earnings columns even though there were hundreds of workers in that occupation. Therefore, I ended up removing every NA value for female and male earnings to avoid issues with these values skewing the data further into the analysis.

```{r, warning=FALSE, Message=FALSE}
## Mutate the New Column for Earnings Ratio
jobs_gender <- 
  jobs_gender %>% 
  mutate(Earnings_Ratio = jobs_gender$total_earnings_female / jobs_gender$total_earnings_male)

## Remove Original Column
jobs_gender <- select(jobs_gender,-c(earnings_ratio_female))
```


```{r, warning=FALSE, Message=FALSE}
## Removing all observations with NA Values
jobs_gender <- na.omit(jobs_gender) 
colSums(is.na(jobs_gender))
```

4. Given the numbers used, I felt it would be best presented if all decimals were kept at a maximum of one place.

```{r, warning=FALSE, Message=FALSE}
## Rounding Percentages
is.num <- sapply(jobs_gender$percent_female, is.numeric)
jobs_gender$percent_female[is.num] <- lapply(jobs_gender$percent_female[is.num], round, 1)

is.num <- sapply(jobs_gender$earnings_ratio_female, is.numeric)
jobs_gender$earnings_ratio_female[is.num] <- lapply(jobs_gender$earnings_ratio_female[is.num], round, 1)

is.num <- sapply(jobs_gender$percent_male, is.numeric)
jobs_gender$percent_male[is.num] <- lapply(jobs_gender$percent_male[is.num], round, 1)
```

### Employed_gender

I examined and viewed the original data below:

```{r, message= FALSE, warning=FALSE}
datatable(head(employed_gender,50))
```

There is no cleaning required for this dataset.

##### **Data Dictionary**

```{r, message=FALSE, warning=FALSE}
Variable.type <- lapply(employed_gender,class)
Variable.desc <- c("Year", "Percent of total employed people usually working full-time", "Percent of total employed people usually working part time", "Percent of employed females usually working full time", "Percent of employed females usually working part time", "Percent of employed males usually working full time", "Percent of employed men usually working part time")
Variable.name1 <- colnames(employed_gender)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
```

### Wagegap_map

I examined and viewed the original data below:

```{r, message= FALSE, warning=FALSE}
datatable(head(gpratio_map,50))
```

There is no cleaning required for this dataset.

##### **Data Dictionary**

```{r, message=FALSE, warning=FALSE}
Variable.type <- lapply(gpratio_map, class)
Variable.desc <- c("State in the US", "State Code", "Earnings ratio of females to that of men in decimal form", "The rank the state is from 1 to 50, number 1 means the state has the highest earnings ratio", "The strength of the Equal Pay Laws the state has enacted")
Variable.name1 <- colnames(gpratio_map)
data.desc <- as_tibble(cbind(Variable.name1,Variable.type,Variable.desc))
colnames(data.desc) <- c("Variable Name","Data Type","Variable Description")
library(knitr)
kable(data.desc)
```

Column {.tabset .tabset-fade}
--------------------------------------------------------------------------------

Cleaned Data

### Earnings_female

In the clean dataset, the range of years remains the same but the age groups are now:

  * Total
  * 16-19 years
  * 20-24 years
  * 25-34 years
  * 35-44 years
  * 45-54 years
  * 55-64 years
  * 65 years and older
  
The earnings ratio column ranges from 56.8% to 95.4%, reaffirming that a wage gap does in fact exist. I will analyze these values later in the exploratory analysis. Additionally, each of the variables contained in this dataset are important variables for the analysis, thus I did not remove any of the observations from this table. The cleaned data is below:

```{r, message=TRUE, warning=FALSE}
library(DT)
datatable(head(earnings_female,50))
```

### Jobs_gender

Now with the clean data, there is a uniform naming convention and additional columns. All of the data will be used from this dataset, however, the most important variables to this analysis are listed below:

  * industry_broad 
  * industry_specific 
  * workers_male 
  * workers_female 
  * total_earnings_male 
  * total_earnings_female 
  * Earnings_ratio 
  
After cleaning the data, I have replaced the NA values for the *total_earnings_male*, *total_earnings_female*, and *earnings_ratio_female* variables, resulting in 2019 complete observations and 12 variables. The cleaned data is below:

```{r, warning=FALSE, message=FALSE}
datatable(head(jobs_gender,50))
```

### Employed_gender

Compared to the other datasets, this data is very useful without a lot of cleaning. There are no missing values or duplicate values, and all column names are written with consistent *snake_case* formatting. I elected not to make any inital changes to this data for this reason. The cleaned data is below:

```{r, warning=FALSE, message=FALSE}
datatable(head(employed_gender,50))
```

### Wagegap_map

The wagegap_map data is very useful as well without a lot of cleaning. The data is so clean because I was able to gather it from web-scrapping, allowing me to choose the specific variables I wanted. There are no missing values and the data is fairly straightforward. I also elected to leave this data as is. The cleaned data is below:

```{r, warning=FALSE, message=FALSE}
datatable(head(gpratio_map,50))
```


Size Overview {data-navmenu="Industry" data-orientation=columns}
==============================================================================

Column {.sidebar}
--------------------------------------------------------------------------------

#### Industry Analysis Overview

For the wage gap by industry analysis, I studied the breakdowns of earnings for each of the eight industries to observe any patterns in the respective wage gaps. First, I was interested in observing the industries that are dominated by females vs. males. Once I determined which industries fall into which category, I analyzed the differences in the average female and male median earnings for both female and male dominated industries. My hypothesis before beginning this analysis is that even in female dominated industries, the average pay for females will be lower than that of males. My goal was to identify if this hypothesis is true, if there are any outlier industries, and if there are underlying factors that explain the wage gap for each industry and as a whole.

#### Size Analysis

Before beginning this analysis, I wanted to display which occupations fell under which specific industries and which specific industries fell under which broad industries. The visualization is meant to give more clarification to the ensuing size analysis. See the industry breakdown to the right to see this information. 

In the first graph, I showed the 8 broad industries to see which of them have a majority of employment from women based on the *avg_females field* that I temporarily created. Any industry that has 50% or more women is a female dominated industry. From this graph, you can see that females dominate 3 industries:

* **Healthcare Practicioners and Technical**
* **Education, Legal, Community Service, Arts, and Media**
* **Sales and Office**

On the other hand, males dominate 5 industries:

* Service
* Management, Business, and Financial
* Computer, Engineering, and Science
* Production, Transportation, and Material Moving
* Natural Resources, Construction, and Maintenance

The industries that are dominated by females vs. males for the most part aren't surprising due to the historical stigma surrounding each of the industries. Women have traditionally held roles as medical assistants, nurse, teachers, childcare workers, and administrative assistants while men have traditionally held roles as production workers, mechanics, analysts, engineers, and workers dealing with any type of natural resource. One industry that I was surprised by is the service industry. But, it's percentage of females is 49%, so it is very close. One reason for a smaller than expected percentage is that firefighters, police officers, and other justice system occupations are listed under the service industry and those careers are heavily dominated by men.

Next, I looked a touch deeper at the specific industries in the same fashion to identify any specific industries where females may dominate but in the broad industry they do not. From the plot you can see that females dominate the healthcare support and personal care and service categories, which fall under the Service broad industry. But, these roles both are in the medical field, which is an industry that females dominate. Also, females dominate the business and financial operations field, which falls under the Management, Business, and Financial broad industry. After further analysis into the occupations within this category, I discovered that the majority of these roles include marketing analysts, event planners, and human resources workers. These roles are traditionally held by women. There are few financial specialist and accounting roles that are predominantly women that defy this trend and I consider to be outliers. 

To further confirm female versus male dominated industries, I observed the top 10 occupations for women across all industries and noted if any of the occupations fell outside of their three dominated industries. Two of the occupations, medical transcriptionists and childcare workers, are in the service industry. But, their fields are closely related to the industries that females dominate, so I do not consider them to be outliers.

In addition to seeing the top 10 occupations for females, I also wanted to see the 10 occupations females have the lowest presence in, or the top 10 occupations for males. In the plot to the right, you can see that all of the occupations fall within male dominated industries.

Column {.tabset .tabset-fade}
-------------------------------------------------------------------------------

### Industry Breakdown

```{r, echo = FALSE, message = FALSE, warning = FALSE}
library(d3Tree)

tm <- treemap(jobs_gender, 
              index = c("industry_broad", "industry_specific"),
              vSize = "total_earnings", 
              vColor = "total_earnings",
              type = "value", 
              title = "Specific Industries in Each Broad Industry",
              fontsize.title = 14, 
              palette = "Purples")

tm2 <- treemap(jobs_gender, 
               index = c("industry_specific", "occupation"),
               vSize = "total_earnings", 
               vColor = "total_earnings",
               type = "index", 
               title = "Occupations in Each Specific Industry",
               fontsize.title = 14, 
               palette = "Purples")

```


### Broad Industries

```{r, echo=FALSE, message=FALSE, warning=FALSE}
## Female vs. Male Dominated Industries

females_vs_males <- jobs_gender %>%
  group_by(industry_broad) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers), 
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(avg_females)

ggplot(data = females_vs_males, 
       aes(x = reorder(industry_broad, +avg_females), 
           y = (avg_females))) + 
  geom_bar(stat = "identity", 
           aes(fill = avg_females >= 0.5)) + 
  scale_fill_discrete(name = "% Of Females", labels = c("< 50%", " >= 50%")) +
  ylab("% Female") + 
  scale_x_discrete(name = "Industry", labels = function(x) str_wrap(x, width = 30)) +
  theme(axis.text.x = element_text(hjust = 0)) +
  ggtitle("Female vs. Male Dominated Industries",
          subtitle = "Broad Industries having more than 50% Females") +
  geom_text(aes(label=(avg_females)), label=round(females_vs_males$avg_females, digits = 2), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) + 
  coord_flip()
```

### Specific Industries

```{r, echo=FALSE, message=FALSE, warning=FALSE}
## Female vs. Male Dominated Specific Industries

females_vs_males_1 <- jobs_gender %>%
  group_by(industry_specific) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers), 
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(avg_females)
ggplot(data = females_vs_males_1, 
       aes(x = reorder(industry_specific, +avg_females), 
           y = (avg_females))) + 
  geom_bar(stat = "identity", aes(fill = avg_females >= 0.5)) + 
  scale_fill_discrete(name = "% Of Females", labels = c("< 50", ">= 50")) +
  scale_y_continuous(name = "% Female", labels = function(y) paste0(y*100,"%")) +   xlab("Specific Industry") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ggtitle("Female vs. Male Dominated Specific Industries",
          subtitle = "Specific Industries having more than 50% Females") +
  geom_text(aes(label=(avg_females)), label=round(females_vs_males_1$avg_females, digits = 2), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=3) +
  coord_flip()
```

### Top 10 Female

```{r,echo = FALSE, message=FALSE, warning=FALSE}
## Top 10 Female Industries 

females_vs_males_2 <- jobs_gender %>%
  group_by(occupation) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers),
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(avg_females) %>%
  top_n(10, avg_females)
ggplot(data = females_vs_males_2, 
       aes(x = reorder(occupation, +avg_females), 
           y = (avg_females))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "% Female", labels = function(y) paste0(y*100,"%")) + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top Female Occupations",
          subtitle = "10 Highest Female Dominated Occupations Across All Industries") +
  geom_text(aes(label=avg_females), label=round(females_vs_males_2$avg_females, digits = 2), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Top 10 Male

```{r, echo = FALSE, warning = FALSE, message = FALSE}
females_vs_males_3 <- jobs_gender %>%
  group_by(occupation) %>%
  summarise(avg_females = sum(workers_female) / sum(total_workers),
            avg_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(avg_males) %>%
  top_n(10, avg_males)
ggplot(data = females_vs_males_3, 
       aes(x = reorder(occupation, +avg_males), 
           y = (avg_males))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "% Male", labels = function(y) paste0(y*100,"%")) + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top Male Occupations",
          subtitle = "10 Highest Male Dominated Occupations Across Industries") +
  geom_text(aes(label=avg_males), label=round(females_vs_males_3$avg_males, digits = 2), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```


Earnings Overview {data-navmenu="Industry" data-orientation=columns}
======================================================================================================

Column {.sidebar}
--------------------------------------------------------------------------------

#### Industry Earnings Overview

To begin the analysis of the earnings per industry, I observed the density plots of the total earnings per industry category to determine the industry with the highest average earnings. The main conclusions from this plot are that:

1. **Computer, Engineering, and Science** has the highest average median earnings and it is a male dominated industry
2. The distributions for each industry are *skewed to the right*, meaning that the means are higher than the medians. The skews are more severe for certain industries due to outliers, which we I examined further in the next analysis

To provide an overview of the wage gap regardless of industry, I wanted to briefly show the overall trend in earnings between males and females. Females have an average median earnings of *$49,640* and males have an average median earnings of *$53,218*. Therefore, our dataset informs us that on average, men make about **$4,000** more than women. The p-value for the two-sample t-test was 0.0257, which is higher than the alpha value of 0.05, so I can reject the null hypothesis and conclude that the difference in the means is significant. There are outliers in each plot, but the outliers in the male earnings are the most significant. Also, there are about 56 million more men than women documented as working full-time in our dataset. Even though this is a large difference, our sample size for the remaining occupations in our data set is large enough for both men and women.

After displaying the average median salaries of females and males overall, I wanted to show the difference in female dominated industries vs. male dominated industries. I previously defined female and male dominated industries, so now I aggregated the industries that belong to each category. Then, I calculated the average median earnings for males and females for both the female and male dominated industries. Female dominated industry earnings for both females and males are the two leftmost bars and the male dominated industry earnings for both females and males are the two rightmost bars. **My original hypothesis that even in female dominated industries the average pay for females is lower than that of males was proven to be true**. I conducted a two-sample t-test to support my analysis, which resulted in a p-value of <0.0001 for the female dominated industries and 0.0002 for the male dominated industries. Therefore, the difference in the means in both cases are significant. From the graph, it is evident that females on average make 10% more in female dominated industries than they do in male dominated industries, but they are still making less than men. On the other hand, males on average make 10% less in male dominated industries than they do in female industries. Regardless, men on average make more than females in both male and female dominated industries.

Column {.tabset .tabset-fade}
-------------------------------------------------------------------------------

### Distribution of Earnings

```{r, echo = FALSE, message=FALSE, warning=FALSE}
## Density Plot of Earnings

plot <- ggplot(jobs_gender,
       aes(total_earnings, fill = industry_broad)) + 
  geom_density(alpha = 0.3) +
  xlab("Total Earnings") +
  ylab("Density") +
  ggtitle("Distribution of Total Earnings Per Industry")

ggplotly(plot)
```

### Overall Median Earnings

```{r, echo = FALSE, message=FALSE, warning=FALSE}
## Median Earnings Per Gender

x <- data.frame(total_median_earnings = jobs_gender$total_earnings, 
                female_median_earnings = jobs_gender$total_earnings_female, 
                male_median_earnings = jobs_gender$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) +
  geom_boxplot() + 
  theme_bw() +
  scale_y_continuous(name = "Total Income", labels = scales::dollar) +
  scale_x_discrete(name = "Earnings Category") +
  ggtitle("Overall Median Earnings", 
          subtitle = "Distribution of Earnings Per Gender") + 
  coord_flip()
```

### Female vs. Male Dominated Industry Earnings

```{r, echo = FALSE, message=FALSE, warning=FALSE}
## Female & Male Earnings Per Female & Male Dominated Industries

female_dominated_industries <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical", "Education, Legal, Community Service, Arts, and Media", "Sales and Office")) %>%
    summarise(F_dom_female = mean(total_earnings_female))
female_dominated_industries_males <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical", "Education, Legal, Community Service, Arts, and Media", "Sales and Office")) %>%
    summarise(F_dom_male = mean(total_earnings_male))
male_dominated_industries <- jobs_gender %>%
  filter(industry_broad == c("Service", "Management, Business, and Financial", "Computer, Engineering, and Science",	
  "Production, Transportation, and Material Moving", "Natural Resources, Construction, and Maintenance")) %>%
  summarise(M_dom_female = mean(total_earnings_female))
male_dominated_industries_males <- jobs_gender %>%
  filter(industry_broad == c("Service", "Management, Business, and Financial", "Computer, Engineering, and Science",	
  "Production, Transportation, and Material Moving", "Natural Resources, Construction, and Maintenance")) %>%
  summarise(M_dom_male = mean(total_earnings_male))
Earnings <- data.frame(female_median_earnings_1 = female_dominated_industries, male_median_earnings_1 = female_dominated_industries_males, female_median_earnings_2 = male_dominated_industries, male_median_earnings_2 = male_dominated_industries_males)
Earnings_differnce <- melt(Earnings)
ggplot(Earnings_differnce, 
       aes(x = variable, 
           y = value,
           fill = variable)) + 
  geom_bar(stat = "identity") + 
  theme_bw() + 
  coord_cartesian(ylim = c(25000, 55000)) +
  xlab("Earnings by Gender") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Female vs Male Dominated Industries", 
          subtitle = "Gender Earnings in Female and Male Dominated Industries")

```

Earnings Per Industry {data-navmenu="Industry" data-orientation=columns}
======================================================================================================

Column {.sidebar}
------------------------------------------------------------

#### Earnings Per Industry Overview

Next, I wanted to determine if there are any outlier industries. An industry would be an outlier if it strays against the overall pattern and if females make more than males on average in any given industry. I created boxplot for each industry to provide quick visuals of the differences in male and female median earnings on average. It is important to note that I did not remove any of the outliers from this dataset. Each of the outliers is important to providing an overall view of the industry and the discrepancies in median earnings between males and females. The following plots analyze the median earnings for males, females, and total for each broad industry.


Column {.tabset .tabset-fade}
-------------------------------------------------------------------------------

### Healthcare

#### Healthcare Practitioners and Technical Industry

The Healthcare Practicioners and Technical industry agrees with my hypothesis. The average median earnings for females is *$68,051* and the average median earnings for males is *$74,269*, which is about a *$6,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the healthcare industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for men in this industry. 

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is 0.0049, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}

sample1 <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample1$industry_broad[sample1$industry_broad == "Healthcare Practitioners and Technical"] <- "HPT_female"

sample1.1 <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample1.1$industry_broad[sample1.1$industry_broad == "Healthcare Practitioners and Technical"] <- "HPT_male"

final1 <- rbind(sample1, sample1.1)

t.test(earnings ~ industry_broad, data = final1)
```


```{r, echo = FALSE, message = FALSE, warning = FALSE}
### Box Plot for Females and Males In Healthcare Industry ###

Healthcare_Industry <- jobs_gender %>%
  filter(industry_broad == c("Healthcare Practitioners and Technical"))
x1 <- data.frame(Total_Earnings = Healthcare_Industry$total_earnings, 
                Female_Earnings = Healthcare_Industry$total_earnings_female, 
                Male_Earnings = Healthcare_Industry$total_earnings_male)
data <- melt(x1)
ggplot(data, 
       aes(x = variable, 
           y = value,
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Healthcare Industry") +
  coord_flip()
```

### Education

#### Education, Legal, Community Service, Arts, & Media Industry

The Education, Legal, Community Service, Arts, & Media industry agrees with my hypothesis. The average median earnings for females is *$46,258* and the average median earnings for males is *$54,403*, which is about a *$8,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the legal industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for males in this industry. 

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}

sample2 <- jobs_gender %>%
  filter(industry_broad == c("Education, Legal, Community Service, Arts, and Media")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample2$industry_broad[sample2$industry_broad == "Education, Legal, Community Service, Arts, and Media"] <- "ELCAM_female"

sample2.2 <- jobs_gender %>%
  filter(industry_broad == c("Education, Legal, Community Service, Arts, and Media")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample2.2$industry_broad[sample2.2$industry_broad == "Education, Legal, Community Service, Arts, and Media"] <- "ELCAM_male"

final2 <- rbind(sample2, sample2.2)

t.test(earnings ~ industry_broad, data = final2)
```


```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Education Industry ###

Education_Industry <- jobs_gender %>%
  filter(industry_broad == c("Education, Legal, Community Service, Arts, and Media"))
x <- data.frame(Total_Earnings = Education_Industry$total_earnings, 
                Female_Earnings = Education_Industry$total_earnings_female, 
                Male_Earnings = Education_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() +
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Education Industry") +
  coord_flip()
```

### Sales and Office

#### Sales and Office Industry

The Sales and Office industry agrees with my hypothesis. The average median earnings for females is *$37,106* and the average median earnings for males is *$44,987*, which is about a *$8,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the sales industry making more than others because of years of schooling required, difficulty of job, access to promotions etc. The outliers are the most significant for males in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}

sample3 <- jobs_gender %>%
  filter(industry_broad == c("Sales and Office")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample3$industry_broad[sample3$industry_broad == "Sales and Office"] <- "SC_female"

sample3.3 <- jobs_gender %>%
  filter(industry_broad == c("Sales and Office")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample3.3$industry_broad[sample3.3$industry_broad == "Sales and Office"] <- "SC_male"

final3 <- rbind(sample3, sample3.3)

t.test(earnings ~ industry_broad, data = final3)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Sales Industry ###

Sales_Industry <- jobs_gender %>%
  filter(industry_broad == c("Sales and Office"))
x <- data.frame(Total_Earnings = Sales_Industry$total_earnings, 
                Female_Earnings = Sales_Industry$total_earnings_female, 
                Male_Earnings = Sales_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Sales Industry") +
  coord_flip()
```

### Service

#### Service Industry

The Sales and Office industry agrees with my hypothesis. The average median earnings for females is *$31,988* and the average median earnings for males is *$36,644*, which is about a *$5,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the service industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample4 <- jobs_gender %>%
  filter(industry_broad == c("Service")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample4$industry_broad[sample4$industry_broad == "Service"] <- "S_female"

sample4.4 <- jobs_gender %>%
  filter(industry_broad == c("Service")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample4.4$industry_broad[sample4.4$industry_broad == "Service"] <- "S_male"

final4 <- rbind(sample4, sample4.4)

t.test(earnings ~ industry_broad, data = final4)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Service Industry ###

Service_Industry <- jobs_gender %>%
  filter(industry_broad == c("Service"))
x <- data.frame(Total_Earnings = Service_Industry$total_earnings, 
                Female_Earnings = Service_Industry$total_earnings_female, 
                Male_Earnings = Service_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Service Industry") +
  coord_flip()

summary(x)
```

### Business

#### Business, Management, and Financial Industry

The Business, Management, and Financial industry agrees with my hypothesis. The average median earnings for females is *$59,070* and the average median earnings for males is *$73,717*, which is about a *$15,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the business industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry due to there being outliers on both ends of the plot.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample5 <- jobs_gender %>%
  filter(industry_broad == c("Management, Business, and Financial")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample5$industry_broad[sample5$industry_broad == "Management, Business, and Financial"] <- "MBF_female"

sample5.5 <- jobs_gender %>%
  filter(industry_broad == c("Management, Business, and Financial")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample5.5$industry_broad[sample5.5$industry_broad == "Management, Business, and Financial"] <- "MBF_male"

final5 <- rbind(sample5, sample5.5)

t.test(earnings ~ industry_broad, data = final5)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Business Industry ###

Business_Industry <- jobs_gender %>%
  filter(industry_broad == c("Management, Business, and Financial"))
x <- data.frame(Total_Earnings = Business_Industry$total_earnings, 
                Female_Earnings = Business_Industry$total_earnings_female, 
                Male_Earnings = Business_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Business Industry") +
  coord_flip()
```

### Engineering

#### Computer, Engineering, and Science Industry

The Computer, Engineering, and Science industry agrees with my hypothesis. The average median earnings for females is *$69,427* and the average median earnings for males is *$80,191*, which is about a *$11,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the science industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for males in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample6 <- jobs_gender %>%
  filter(industry_broad == c("Computer, Engineering, and Science")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample6$industry_broad[sample6$industry_broad == "Computer, Engineering, and Science"] <- "CES_female"

sample6.6 <- jobs_gender %>%
  filter(industry_broad == c("Computer, Engineering, and Science")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample6.6$industry_broad[sample6.6$industry_broad == "Computer, Engineering, and Science"] <- "CES_male"

final6 <- rbind(sample6, sample6.6)

t.test(earnings ~ industry_broad, data = final6)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Engineering Industry ###

Engineering_Industry <- jobs_gender %>%
  filter(industry_broad == c("Computer, Engineering, and Science"))
x <- data.frame(Total_Earnings = Engineering_Industry$total_earnings, 
                Female_Earnings = Engineering_Industry$total_earnings_female, 
                Male_Earnings = Engineering_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Engineering Industry") +
  coord_flip()
```

### Production

#### Production, Transportation, and Material Moving Industry

The Production, Transportation, Material Moving industry agrees with my hypothesis. The average median earnings for females is *$32,438* and the average median earnings for males is *$40,769*, which is about a *$8,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the production industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry.

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample7 <- jobs_gender %>%
  filter(industry_broad == c("Production, Transportation, and Material Moving")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample7$industry_broad[sample7$industry_broad == "Production, Transportation, and Material Moving"] <- "PTMM_female"

sample7.7 <- jobs_gender %>%
  filter(industry_broad == c("Production, Transportation, and Material Moving")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample7.7$industry_broad[sample7.7$industry_broad == "Production, Transportation, and Material Moving"] <- "PTMM_male"

final7 <- rbind(sample7, sample7.7)

t.test(earnings ~ industry_broad, data = final7)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Production Industry ###

Production_Industry <- jobs_gender %>%
  filter(industry_broad == c("Production, Transportation, and Material Moving"))
x <- data.frame(Total_Earnings = Production_Industry$total_earnings, 
                Female_Earnings = Production_Industry$total_earnings_female, 
                Male_Earnings = Production_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Production Industry") +
  coord_flip()
```

### Natural Resources

#### Natural Resources, Construction, and Maintenance Industry

The Natural Resources, Construction, and Maintenance industry agrees with my hypothesis. The average median earnings for females is *$38,549* and the average median earnings for males is *$43,661*, which is about a *$5,000* difference. There are several outliers for each of the categories, which is due to certain occupations in the construction industry making more than others because of years of schooling required, difficulty of job, access to promotions, etc. The outliers are the most significant for females in this industry. 

I conducted a two-sample t-test for the mean of female earnings versus the mean of male earnings. The null hypothesis is that the two means are equal, and the altnerative is that they are not. The assumptions of the test are that both samples are random, independent, normally distributed, and have unknown variances. The p-value of the test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.


```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample8 <- jobs_gender %>%
  filter(industry_broad == c("Natural Resources, Construction, and Maintenance")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_broad, earnings)
sample8$industry_broad[sample8$industry_broad == "Natural Resources, Construction, and Maintenance"] <- "NCM_female"

sample8.8 <- jobs_gender %>%
  filter(industry_broad == c("Natural Resources, Construction, and Maintenance")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_broad, earnings)
sample8.8$industry_broad[sample8.8$industry_broad == "Natural Resources, Construction, and Maintenance"] <- "NCM_male"

final8 <- rbind(sample8, sample8.8)

t.test(earnings ~ industry_broad, data = final8)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Construction Industry ###

Construction_Industry <- jobs_gender %>%
  filter(industry_broad == c("Natural Resources, Construction, and Maintenance"))
x <- data.frame(Total_Earnings = Construction_Industry$total_earnings, 
                Female_Earnings = Construction_Industry$total_earnings_female,
                Male_Earnings = Construction_Industry$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Median Earnings for Females and Males in Construction Industry") +
  coord_flip()
```


Earnings Per Specific Industry {data-navmenu="Industry" data-orientation=columns}
======================================================================================================

Column {.sidebar}
------------------------------------------------------------

#### Earnings Per Specific Industry Overview

Now, I dove deeper into the specific industry categories. My hypothesis is still that men will make more on average than females in each of the specific industries. A specific industry would be an outlier if it strays against the overall pattern and if females make more than males on average. I created boxplots for each specific industry to provide quick visuals of the differences in male and female median earnings on average. It is important to note that I did not remove any of the outliers from the dataset. Each of the outliers is important to providing an overall view of the industry and the discrepancies in median earnings between males and females. The following plots analyze the median earnings for males, females, and total for each specific industry.

Column {.tabset .tabset-fade}
-------------------------------------------------------------------------------

### Healthcare Support 

#### Healthcare Support Specific Industry

The Healthcare Support specific industry agrees with my hypothesis. The average median earnings for females is *$31,956* and the average median earnings for males is *$36,107*, which is about a *$4,000* difference. There are several outliers for each of the categories. There are more outliers for females in this industry, but the outlier for males has the highest earnings. The p-value of the two-sample t-test is 0.0096, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample9 <- jobs_gender %>%
  filter(industry_specific == c("Healthcare Support")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample9$industry_specific[sample9$industry_specific == "Healthcare Support"] <- "HS_female"

sample9.9 <- jobs_gender %>%
  filter(industry_specific == c("Healthcare Support")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample9.9$industry_specific[sample9.9$industry_specific == "Healthcare Support"] <- "HS_male"

final9 <- rbind(sample9, sample9.9)

t.test(earnings ~ industry_specific, data = final9)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Healthcare Support ###

Healthcare_support <- jobs_gender %>%
  filter(industry_specific == c("Healthcare Support"))
x <- data.frame(Total_Earnings = Healthcare_support$total_earnings, 
                Female_Earnings = Healthcare_support$total_earnings_female,
                Male_Earnings = Healthcare_support$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Healthcare Support") +
  coord_flip()
```

### Personal Care

#### Personal Care and Service Specific Industry

The Pearsonal Care and Service specific industry agrees with my hypothesis. The average median earnings for females is *$28,080* and the average median earnings for males is *$31,952*, which is about a *$4,000* difference. There are several outliers for each of the categories. There are more outliers for males in this industry and the outlier for females has the highest earnings, which is surprising. The p-value of the two-sample t-test is 0.0092, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample10 <- jobs_gender %>%
  filter(industry_specific == c("Personal Care and Service")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample10$industry_specific[sample10$industry_specific == "Personal Care and Service"] <- "PS_female"

sample10.10 <- jobs_gender %>%
  filter(industry_specific == c("Personal Care and Service")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample10.10$industry_specific[sample10.10$industry_specific == "Personal Care and Service"] <- "PS_male"

final10 <- rbind(sample10, sample10.10)

t.test(earnings ~ industry_specific, data = final10)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Personal Care ###

Personal_care <- jobs_gender %>%
  filter(industry_specific == c("Personal Care and Service"))
x <- data.frame(Total_Earnings = Personal_care$total_earnings, 
                Female_Earnings = Personal_care$total_earnings_female,
                Male_Earnings = Personal_care$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Personal Care and Service") +
  coord_flip()
```

### Healthcare Practitioners

#### Healthcare Practitioners and Technical Specific Industry

The Healthcare Practicioners and Technical specific industry agrees with my hypothesis. The average median earnings for females is *$68,051* and the average median earnings for males is *$81,487*, which is about a *$13,000* difference. There are several outliers for each of the categories. There are more outliers for females in this industry, but an outlier for males has the highest earnings. The p-value of the two-sample t-test is 0.0049, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample11 <- jobs_gender %>%
  filter(industry_specific == c("Healthcare Practitioners and Technical")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample11$industry_specific[sample11$industry_specific == "Healthcare Practitioners and Technical"] <- "HT_female"

sample11.11 <- jobs_gender %>%
  filter(industry_specific == c("Healthcare Practitioners and Technical")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample11.11$industry_specific[sample11.11$industry_specific == "Healthcare Practitioners and Technical"] <- "HT_male"

final11 <- rbind(sample11, sample11.11)

t.test(earnings ~ industry_specific, data = final11)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Healthcare Practicioners ###
library(ggplot2)
library(dplyr)
Healthcare_pract <- jobs_gender %>%
  filter(industry_specific == c("Healthcare Practitioners and Technical"))
x <- data.frame(Total_Earnings = Healthcare_pract$total_earnings, 
                Female_Earnings = Healthcare_pract$total_earnings_female,
                Male_Earnings = Healthcare_pract$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Healthcare Practitioners and Technical") +
  coord_flip()
```

### Office And Admin

#### Office and Administrative Support Specific Industry

The Office and Administrative specific industry agrees with my hypothesis. The average median earnings for females is *$35,783* and the average median earnings for males is *$41,762*, which is about a *$6,000* difference. There are several outliers for each of the categories. There are more outliers for males in this industry and an outlier for males has the highest earnings. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample12 <- jobs_gender %>%
  filter(industry_specific == c("Office and Administrative Support")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample12$industry_specific[sample12$industry_specific == "Office and Administrative Support"] <- "OA_female"

sample12.12 <- jobs_gender %>%
  filter(industry_specific == c("Office and Administrative Support")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample12.12$industry_specific[sample12.12$industry_specific == "Office and Administrative Support"] <- "OA_male"

final12 <- rbind(sample12, sample12.12)

t.test(earnings ~ industry_specific, data = final12)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Office and Admin ###

Office_Admin <- jobs_gender %>%
  filter(industry_specific == c("Office and Administrative Support"))
x <- data.frame(Total_Earnings = Office_Admin$total_earnings, 
                Female_Earnings = Office_Admin$total_earnings_female,
                Male_Earnings = Office_Admin$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Office and Administrative Support") +
  coord_flip()
```

### Education

#### Education, Training, and Library Specific Industry

The Education, Training, and Library specific industry agrees with my hypothesis. The average median earnings for females is *42,890* and the average median earnings for males is *$49,460*, which is about a *$6,500* difference. There are outliers for only males. An outlier for males has the highest overall earnings. The p-value of the two-sample t-test is 0.0138, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample13 <- jobs_gender %>%
  filter(industry_specific == c("Education, Training, and Library")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample13$industry_specific[sample13$industry_specific == "Education, Training, and Library"] <- "ETL_female"

sample13.13 <- jobs_gender %>%
  filter(industry_specific == c("Education, Training, and Library")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample13.13$industry_specific[sample13.13$industry_specific == "Education, Training, and Library"] <- "ETL_male"

final13 <- rbind(sample13, sample13.13)

t.test(earnings ~ industry_specific, data = final13)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Education ###

Education_Training <- jobs_gender %>%
  filter(industry_specific == c("Education, Training, and Library"))
x <- data.frame(Total_Earnings = Education_Training$total_earnings, 
                Female_Earnings = Education_Training$total_earnings_female,
                Male_Earnings = Education_Training$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Education, Training, and Library") +
  coord_flip()
```

### Community Service

#### Community and Social Service Specific Industry

The Community and Social Service specific industry agrees with my hypothesis. The average median earnings for females is *$40,613* and the average median earnings for males is *$44,515*, which is about a *$4,000* difference. There are no outliers for this specific industry. The maximum for males has the highest median earnings. The p-value of the two-sample t-test is 0.0002, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample14 <- jobs_gender %>%
  filter(industry_specific == c("Community and Social Service")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample14$industry_specific[sample14$industry_specific == "Community and Social Service"] <- "CS_female"

sample14.14 <- jobs_gender %>%
  filter(industry_specific == c("Community and Social Service")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample14.14$industry_specific[sample14.14$industry_specific == "Community and Social Service"] <- "CS_male"

final14 <- rbind(sample14, sample14.14)

t.test(earnings ~ industry_specific, data = final14)
```

```{r, echo = FALSE, message = FALSE, warning=FALSE}
### Box Plot for Females and Males In Community Service ###

Comm_service <- jobs_gender %>%
  filter(industry_specific == c("Community and Social Service"))
x <- data.frame(Total_Earnings = Comm_service$total_earnings, 
                Female_Earnings = Comm_service$total_earnings_female,
                Male_Earnings = Comm_service$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Community and Social Service") +
  coord_flip()
```

### Business

#### Business and Financial Operations Specific Industry

The Business and Financial Operations specific industry agrees with my hypothesis. The average median earnings for females is *$54,129* and the average median earnings for males is *68,540*, which is about a *$14,000* difference. There are not outliers for males or females individually, but there are outliers for the total earnings. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample15 <- jobs_gender %>%
  filter(industry_specific == c("Business and Financial Operations")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample15$industry_specific[sample15$industry_specific == "Business and Financial Operations"] <- "BF_female"

sample15.15 <- jobs_gender %>%
  filter(industry_specific == c("Business and Financial Operations")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample15.15$industry_specific[sample15.15$industry_specific == "Business and Financial Operations"] <- "BF_male"

final15 <- rbind(sample15, sample15.15)

t.test(earnings ~ industry_specific, data = final15)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Business ###

Business <- jobs_gender %>%
  filter(industry_specific == c("Business and Financial Operations"))
x <- data.frame(Total_Earnings = Business$total_earnings, 
                Female_Earnings = Business$total_earnings_female,
                Male_Earnings = Business$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Business and Financial Operations") +
  coord_flip()
```

### Legal

#### Legal Specific Industry

The Legal specific industry does not agree fully with my hypothesis. The average median earnings for females is *$66,195* and the average median earnings for males is *$83,839*, which is about a *$17,500* difference. There are no outliers in this idustry and the maximum for males has the highest earnings. The p-value of the two-sample t-test is 0.0661, which is larger than the assumed alpha value of 0.05. Therefore, I can't reject the null hypothesis and the difference in the means is insignificant. Due to the fact that this is the largest difference I have observed, I thought there may have been some violation of assumptions. I transformed the earnings variable and ran the t-test again, which resulted in a p-value of 0.0882. The p-value was still larger than the alpha of 0.05. I think an assumption of the t-test may have been violated, causing the p-value to be insignificant. The distributions are very different for the earnings of males and females, which could have affected the results. So, males make more than females on average in this industry, but the difference in the means doesn't appear to be significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample16 <- jobs_gender %>%
  filter(industry_specific == c("Legal")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample16$industry_specific[sample16$industry_specific == "Legal"] <- "L_female"

sample16.16 <- jobs_gender %>%
  filter(industry_specific == c("Legal")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample16.16$industry_specific[sample16.16$industry_specific == "Legal"] <- "L_male"

final16 <- rbind(sample16, sample16.16)

t.test(earnings ~ industry_specific, data = final16)

t.test(log(earnings) ~ industry_specific, data = final16)

```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Legal ###

Legal <- jobs_gender %>%
  filter(industry_specific == c("Legal"))
x <- data.frame(Total_Earnings = Legal$total_earnings, 
                Female_Earnings = Legal$total_earnings_female,
                Male_Earnings = Legal$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Legal") +
  coord_flip()
```

### Food Preparation

#### Food Preparation and Serving Related Specific Industry

The Food Preparation and Serving Related specific industry agrees with my hypothesis. The average median earnings for females is *$20,445* and the average median earnings for males is *$23,409*, which is about a *$3,000* difference. There is an outlier for females in this industry. There are no outliers for males, but the maximum for males has the highest earnings. The p-value of the two-sample t-test is 0.0002, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample17 <- jobs_gender %>%
  filter(industry_specific == c("Food Preparation and Serving Related")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample17$industry_specific[sample17$industry_specific == "Food Preparation and Serving Related"] <- "FS_female"

sample17.17 <- jobs_gender %>%
  filter(industry_specific == c("Food Preparation and Serving Related")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample17.17$industry_specific[sample17.17$industry_specific == "Food Preparation and Serving Related"] <- "FS_male"

final17 <- rbind(sample17, sample17.17)

t.test(earnings ~ industry_specific, data = final17)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Food ###

Food <- jobs_gender %>%
  filter(industry_specific == c("Food Preparation and Serving Related"))
x <- data.frame(Total_Earnings = Food$total_earnings, 
                Female_Earnings = Food$total_earnings_female,
                Male_Earnings = Food$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Food Preparation and Serving Related") +
  coord_flip()
```

### Social Science

#### Life, Physical, and Social Science Specific Industry

The Life, Physical, and Social Science specific industry agrees with my hypothesis. The average median earnings for females is *$61,414* and the average median earnings for males is *$72,536*, which is about a *$11,000* difference. There are outliers for only females in this industry. There are no outliers for males and the maximum for females has the highest earnings, which goes against the norm. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample18 <- jobs_gender %>%
  filter(industry_specific == c("Life, Physical, and Social Science")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample18$industry_specific[sample18$industry_specific == "Life, Physical, and Social Science"] <- "LPS_female"

sample18.18 <- jobs_gender %>%
  filter(industry_specific == c("Life, Physical, and Social Science")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample18.18$industry_specific[sample18.18$industry_specific == "Life, Physical, and Social Science"] <- "LPS_male"

final18 <- rbind(sample18, sample18.18)

t.test(earnings ~ industry_specific, data = final18)
```

```{r, echo = FALSE, message = FALSE, warning=FALSE}
### Box Plot for Females and Males In Social Science ###

Social <- jobs_gender %>%
  filter(industry_specific == c("Life, Physical, and Social Science"))
x <- data.frame(Total_Earnings = Social$total_earnings, 
                Female_Earnings = Social$total_earnings_female,
                Male_Earnings = Social$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Life, Physical, and Social Science") +
  coord_flip()
```

### Arts and Media

#### Arts, Design, Entertainment, Sports, and Media Specific Industry

The Arts, Design, Entertainment, Sports, and Media specific industry agrees with my hypothesis. The average median earnings for females is *$45,286* and the average median earnings for males is *$53,642*, which is about a *$8,000* difference. There are no outliers in this industry and the maximum for males has the highest earnings. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample19 <- jobs_gender %>%
  filter(industry_specific == c("Arts, Design, Entertainment, Sports, and Media")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample19$industry_specific[sample19$industry_specific == "Arts, Design, Entertainment, Sports, and Media"] <- "ADESM_female"

sample19.19 <- jobs_gender %>%
  filter(industry_specific == c("Arts, Design, Entertainment, Sports, and Media")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample19.19$industry_specific[sample19.19$industry_specific == "Arts, Design, Entertainment, Sports, and Media"] <- "ADESM_male"

final19 <- rbind(sample19, sample19.19)

t.test(earnings ~ industry_specific, data = final19)
```


```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Arts ###

Arts_media <- jobs_gender %>%
  filter(industry_specific == c("Arts, Design, Entertainment, Sports, and Media"))
x <- data.frame(Total_Earnings = Arts_media$total_earnings, 
                Female_Earnings = Arts_media$total_earnings_female,
                Male_Earnings = Arts_media$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Arts, Design, Entertainment, Sports, and Media") +
  coord_flip()
```

### Sales

#### Sales and Related Specific Industry

The Sales and Related specific industry agrees with my hypothesis. The average median earnings for females is *$40,928* and the average median earnings for males is *$54,302*, which is about a *$13,000* difference. There are outlier for both males and females in this industry. The outlier for males has the highest median earnings overall. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample20 <- jobs_gender %>%
  filter(industry_specific == c("Sales and Related")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample20$industry_specific[sample20$industry_specific == "Sales and Related"] <- "SR_female"

sample20.20 <- jobs_gender %>%
  filter(industry_specific == c("Sales and Related")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample20.20$industry_specific[sample20.20$industry_specific == "Sales and Related"] <- "SR_male"

final20 <- rbind(sample20, sample20.20)

t.test(earnings ~ industry_specific, data = final20)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Sales ###

Sales <- jobs_gender %>%
  filter(industry_specific == c("Sales and Related"))
x <- data.frame(Total_Earnings = Sales$total_earnings, 
                Female_Earnings = Sales$total_earnings_female,
                Male_Earnings = Sales$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Sales and Related") +
  coord_flip()
```

### Management

#### Management Specific Industry

The Management specific industry agrees with my hypothesis. The average median earnings for females is *$63,683* and the average median earnings for males is *$78,549*, which is about a *$15,000* difference. There are outliers for both males and females in this industry. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample21 <- jobs_gender %>%
  filter(industry_specific == c("Management")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample21$industry_specific[sample21$industry_specific == "Management"] <- "M_female"

sample21.21 <- jobs_gender %>%
  filter(industry_specific == c("Management")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample21.21$industry_specific[sample21.21$industry_specific == "Management"] <- "M_male"

final21 <- rbind(sample21, sample21.21)

t.test(earnings ~ industry_specific, data = final21)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Management ###

Management <- jobs_gender %>%
  filter(industry_specific == c("Management"))
x <- data.frame(Total_Earnings = Management$total_earnings, 
                Female_Earnings = Management$total_earnings_female,
                Male_Earnings = Management$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Management") +
  coord_flip()
```

### Maintenance

#### Building, Grounds Cleaning, and Maintenance Specific Industry

The Building and Grounds Cleaning and Maintenance specific industry agrees with my hypothesis. The average median earnings for females is *$26,581* and the average median earnings for males is *$32,787*, which is about a *$6,000* difference. There no outliers for either males or females in this industry. The maximum for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0016, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample22 <- jobs_gender %>%
  filter(industry_specific == c("Building and Grounds Cleaning and Maintenance")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample22$industry_specific[sample22$industry_specific == "Building and Grounds Cleaning and Maintenance"] <- "BG_female"

sample22.22 <- jobs_gender %>%
  filter(industry_specific == c("Building and Grounds Cleaning and Maintenance")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample22.22$industry_specific[sample22.22$industry_specific == "Building and Grounds Cleaning and Maintenance"] <- "BG_male"

final22 <- rbind(sample22, sample22.22)

t.test(earnings ~ industry_specific, data = final22)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Maintenance ###

Maintenance <- jobs_gender %>%
  filter(industry_specific == c("Building and Grounds Cleaning and Maintenance"))
x <- data.frame(Total_Earnings = Maintenance$total_earnings, 
                Female_Earnings = Maintenance$total_earnings_female,
                Male_Earnings = Maintenance$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Building and Grounds Cleaning Maintenance") +
  coord_flip()
```

### Production

#### Production Specific Industry

The Production specific industry agrees with my hypothesis. The average median earnings for females is *$30,373* and the average median earnings for males is *$38,381*, which is about a *$8,000* difference. There are outliers for both males and females in this industry. The outlier for males has the highest median earnings overall. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample23 <- jobs_gender %>%
  filter(industry_specific == c("Production")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample23$industry_specific[sample23$industry_specific == "Production"] <- "P_female"

sample23.23 <- jobs_gender %>%
  filter(industry_specific == c("Production")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample23.23$industry_specific[sample23.23$industry_specific == "Production"] <- "P_male"

final23 <- rbind(sample23, sample23.23)

t.test(earnings ~ industry_specific, data = final23)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Production ###

Production <- jobs_gender %>%
  filter(industry_specific == c("Production"))
x <- data.frame(Total_Earnings = Production$total_earnings, 
                Female_Earnings = Production$total_earnings_female,
                Male_Earnings = Production$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Production") +
  coord_flip()
```

### Mathematics

#### Computer and Matematical Specific Industry

The Computer and Mathematical industry agrees with my hypothesis. The average median earnings for females is *$73,384* and the average median earnings for males is *$85,772*, which is about a *$12,000* difference. There is an outlier for only males in this industry. The outlier for males has the highest median earnings overall. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample24 <- jobs_gender %>%
  filter(industry_specific == c("Computer and mathematical")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample24$industry_specific[sample24$industry_specific == "Computer and mathematical"] <- "CM_female"

sample24.24 <- jobs_gender %>%
  filter(industry_specific == c("Computer and mathematical")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample24.24$industry_specific[sample24.24$industry_specific == "Computer and mathematical"] <- "CM_male"

final24 <- rbind(sample24, sample24.24)

t.test(earnings ~ industry_specific, data = final24)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Math ###

Comp_math <- jobs_gender %>%
  filter(industry_specific == c("Computer and mathematical"))
x <- data.frame(Total_Earnings = Comp_math$total_earnings, 
                Female_Earnings = Comp_math$total_earnings_female,
                Male_Earnings = Comp_math$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Computer and Mathematical") +
  coord_flip()
```

### Protective

#### Protective Service Specific Industry

The Protective Service specific industry agrees with my hypothesis. The average median earnings for females is *$46,847* and the average median earnings for males is *$53,431*, which is about a *$6,500* difference. There are no outliers in this industry and the maximum for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0118, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample25 <- jobs_gender %>%
  filter(industry_specific == c("Protective Service")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample25$industry_specific[sample25$industry_specific == "Protective Service"] <- "P_female"

sample25.25 <- jobs_gender %>%
  filter(industry_specific == c("Protective Service")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample25.25$industry_specific[sample25.25$industry_specific == "Protective Service"] <- "P_male"

final25 <- rbind(sample25, sample25.25)

t.test(earnings ~ industry_specific, data = final25)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Protective ###

Protective <- jobs_gender %>%
  filter(industry_specific == c("Protective Service"))
x <- data.frame(Total_Earnings = Protective$total_earnings, 
                Female_Earnings = Protective$total_earnings_female,
                Male_Earnings = Protective$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Protective Service") +
  coord_flip()
```

### Material Moving

#### Material Moving Specific Industry

The Material moving specific industry does not agree fully with my hypothesis. The average median earnings for females is *$30,022* and the average median earnings for males is *$37,364*, which is about a *$7,000* difference. There are outliers for both males and females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is 0.165, which is larger than the assumed alpha value of 0.05. Therefore, I can't reject the null hypothesis and the difference in the means is insignificant. To assure that there were no violation of assumptions, I transformed the earnings variable and ran the t-test again, which resulted in a p-value of 0.0115. The p-value is now less than the alpha value, so assumptions may have been violated in this case, causing the difference in the means to be insignificant. So, males make more than females on average for this industry, but the t-test was a bit inconclusive, so I would need to conduct further tests.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample26 <- jobs_gender %>%
  filter(industry_specific == c("Material Moving")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample26$industry_specific[sample26$industry_specific == "Material Moving"] <- "MM_female"

sample26.26 <- jobs_gender %>%
  filter(industry_specific == c("Material Moving")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample26.26$industry_specific[sample26.26$industry_specific == "Material Moving"] <- "MM_male"

final26 <- rbind(sample26, sample26.26)

t.test(earnings ~ industry_specific, data = final26)

t.test(log(earnings) ~ industry_specific, data = final26)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Material Moving ###

Material <- jobs_gender %>%
  filter(industry_specific == c("Material Moving"))
x <- data.frame(Total_Earnings = Material$total_earnings, 
                Female_Earnings = Material$total_earnings_female,
                Male_Earnings = Material$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Material Moving") +
  coord_flip()
```

### Farming

#### Farming, Fishing, and Forestry Specific Industry

The Farming, Fishing, and Forestry specific industry agrees with my hypothesis. The average median earnings for females is *$29,189* and the average median earnings for males is *$34,020*, which is about a *$5,000* difference. There are outliers for only females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is 0.0339, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample27 <- jobs_gender %>%
  filter(industry_specific == c("Farming, Fishing, and Forestry")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample27$industry_specific[sample27$industry_specific == "Farming, Fishing, and Forestry"] <- "FFF_female"

sample27.27 <- jobs_gender %>%
  filter(industry_specific == c("Farming, Fishing, and Forestry")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample27.27$industry_specific[sample27.27$industry_specific == "Farming, Fishing, and Forestry"] <- "FFF_male"

final27 <- rbind(sample27, sample27.27)

t.test(earnings ~ industry_specific, data = final27)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Farming ###

Farming <- jobs_gender %>%
  filter(industry_specific == c("Farming, Fishing, and Forestry"))
x <- data.frame(Total_Earnings = Farming$total_earnings, 
                Female_Earnings = Farming$total_earnings_female,
                Male_Earnings = Farming$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Farming, Fishing, and Forestry") +
  coord_flip()
```

### Engineering

#### Architecture and Engineering Specific Industry

The Architecture and Engineering specific industry agrees with my hypothesis. The average median earnings for females is *$74,873* and the average median earnings for males is *$84,004*, which is about a *$9,000* difference. There are outliers for both males and females in this industry. The outlier for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0022, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample28 <- jobs_gender %>%
  filter(industry_specific == c("Architecture and Engineering")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample28$industry_specific[sample28$industry_specific == "Architecture and Engineering"] <- "AE_female"

sample28.28 <- jobs_gender %>%
  filter(industry_specific == c("Architecture and Engineering")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample28.28$industry_specific[sample28.28$industry_specific == "Architecture and Engineering"] <- "AE_male"

final28 <- rbind(sample28, sample28.28)

t.test(earnings ~ industry_specific, data = final28)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Engineering ###

Engineering <- jobs_gender %>%
  filter(industry_specific == c("Architecture and Engineering"))
x <- data.frame(Total_Earnings = Engineering$total_earnings, 
                Female_Earnings = Engineering$total_earnings_female,
                Male_Earnings = Engineering$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Architecture and Engineering") +
  coord_flip()
```

### Transportation

#### Transportation Specific Industry

The Transportation specific industry agrees with my hypothesis. The average median earnings for females is *$40,589* and the average median earnings for males is *$52,899*, which is about a *$12,000* difference. There are no outliers in this industry. The maximum for males has the highest median earnings overall. The p-value of the two-sample t-test is 0.0005, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample29 <- jobs_gender %>%
  filter(industry_specific == c("Transportation")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample29$industry_specific[sample29$industry_specific == "Transportation"] <- "T_female"

sample29.29 <- jobs_gender %>%
  filter(industry_specific == c("Transportation")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample29.29$industry_specific[sample29.29$industry_specific == "Transportation"] <- "T_male"

final29 <- rbind(sample29, sample29.29)

t.test(earnings ~ industry_specific, data = final29)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Transportation ###

Transportation <- jobs_gender %>%
  filter(industry_specific == c("Transportation"))
x <- data.frame(Total_Earnings = Transportation$total_earnings, 
                Female_Earnings = Transportation$total_earnings_female,
                Male_Earnings = Transportation$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Transportation") +
  coord_flip()
```

### Installation

#### Installation, Maintenance, and Repair Specific Industry

The Installation, Maintenance, and Repair specific industry agrees with my hypothesis. The average median earnings for females is *$39,102* and the average median earnings for males is *$45,959*, which is about a *$7,000* difference. There are outliers for both males and females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is <0.0001, which is smaller than the assumed alpha value of 0.05. Therefore, the difference in the means is significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample30 <- jobs_gender %>%
  filter(industry_specific == c("Installation, Maintenance, and Repair")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample30$industry_specific[sample30$industry_specific == "Installation, Maintenance, and Repair"] <- "IMP_female"

sample30.30 <- jobs_gender %>%
  filter(industry_specific == c("Installation, Maintenance, and Repair")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample30.30$industry_specific[sample30.30$industry_specific == "Installation, Maintenance, and Repair"] <- "IMP_male"

final30 <- rbind(sample30, sample30.30)

t.test(earnings ~ industry_specific, data = final30)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Installation ###

Installation <- jobs_gender %>%
  filter(industry_specific == c("Installation, Maintenance, and Repair"))
x <- data.frame(Total_Earnings = Installation$total_earnings, 
                Female_Earnings = Installation$total_earnings_female,
                Male_Earnings = Installation$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Installation, Maintenance, and Repair") +
  coord_flip()
```

### Construction

#### Construction and Extraction Specific Industry

The Construction and Extraction specific industry does not agree fully with my hypothesis. The average median earnings for females is *$40,424* and the average median earnings for males is *$43,779*, which is about a *$3,000* difference. There are outliers for both males and females in this industry. The outlier for females has the highest median earnings overall, which goes against the norm. The p-value of the two-sample t-test is 0.1141, which is larger than the assumed alpha value of 0.05. Therefore, I can't reject the null hypothesis and the difference in the means is insignificant. To assure that there were no violation of assumptions, I transformed the earnings variable and ran the t-test again, which resulted in a p-value of 0.0914. The p-value is still larger than the alpha value, so the difference in the means is insignificant. So, males make more than females on average in this industry, but the difference in the means is not significant.

```{r, results = "hide", echo = FALSE, message = FALSE, warning = FALSE}
sample31 <- jobs_gender %>%
  filter(industry_specific == c("Construction and Extraction")) %>%
  mutate(earnings = total_earnings_female) %>%
  select(industry_specific, earnings)
sample31$industry_specific[sample31$industry_specific == "Construction and Extraction"] <- "CE_female"

sample31.31 <- jobs_gender %>%
  filter(industry_specific == c("Construction and Extraction")) %>%
  mutate(earnings = total_earnings_male) %>%
  select(industry_specific, earnings)
sample31.31$industry_specific[sample31.31$industry_specific == "Construction and Extraction"] <- "CE_male"

final31 <- rbind(sample31, sample31.31)

t.test(earnings ~ industry_specific, data = final31)

t.test(log(earnings) ~ industry_specific, data = final31)
```

```{r, echo = FALSE, message=FALSE, warning=FALSE}
### Box Plot for Females and Males In Construction ###

Construction <- jobs_gender %>%
  filter(industry_specific == c("Construction and Extraction"))
x <- data.frame(Total_Earnings = Construction$total_earnings, 
                Female_Earnings = Construction$total_earnings_female,
                Male_Earnings = Construction$total_earnings_male)
data <- melt(x)
ggplot(data, 
       aes(x = variable, 
           y = value, 
           fill = variable)) + 
  geom_boxplot() + 
  theme_bw() +
  xlab("Earnings Group") +
  scale_y_continuous(name = "Median Earnings", labels = scales::dollar) +
  ggtitle("Construction and Extraction") +
  coord_flip()
```


Age Analysis {data-navmenu="Age" data-orientation=columns}
==============================================================================

Column {.tabset .tabset-fade}
-------------------------------------------------------------------------------

### Overall Trend

#### Earnings Ratio Trend 

The third aspect of the wage gap that I wanted to look at was if the wage gap varied across different age groups. With the wage gap being a historical problem, I watned to view how it trended across all age groups as a whole to start. From the plot below, you can see that the wage gap is certainly present, however, it is trending in a positive direction.

```{r, warning = FALSE, message = FALSE, echo = FALSE}
##Overall

ER_Overall <- ggplot(data = earnings_female, aes(x = year, y = earnings_ratio)) + 
  geom_point()+
  geom_smooth(se = FALSE) +
  scale_y_continuous(name = "Earnings Ratio") +
  scale_x_continuous(name = "Year") +
  ggtitle("Overall Earnings Ratio",
          subtitle = "Trend of all Age Groups from 1979 to 2011")

ER_Overall
```

### Age Group Trends

#### Earnings Ratio Per Age Group Trends

My next step was to determine if a certain age group(s) was being impacted more than others. In the interactive plot below, you can see a few things. First, the younger age groups have a higher earnings ratio than the older age groups. Second, you can see that the groups *20-24 years* and *25-34 years* are increasing drastically faster than other age groups. Also, many historical events were taking place just before 1980. In 1963 the **Equal Pay Act** was signed into law by President John F. Kennedy and in 1964 Lyndon B. Johnson signed the **Civil Rights Act** into law. With these monumental pieces of legislation enacted, it allowed females to start engaging in occupations that were not possible before. Additionally, it sparked younger females to continue to pursue education and fight for higher salaries and more promotion opportunities. As for older females who were in the true midst of gender wage discrimination, these new reforms and much more helped them improve their pay status, just at a much slower rate. This plot does a great job of showing the trends for each age group.

```{r, warning = FALSE, message = FALSE, echo = FALSE}
##Age Groups

ER_AgeGroup <- 
  ggplot(data = earnings_female, 
         aes(x = year, 
             y = earnings_ratio, 
             color = age_group)) + 
  geom_point(size = 1, alpha = .8)+
  geom_smooth(size = .8, se = FALSE) +
  scale_y_continuous(name = "Earnings Ratio") +
  scale_x_continuous(name = "Year") +
  ggtitle("Earnings Ratio Per Age Group",
          subtitle = "Strength of Upward Trend from 1979 to 2011") +
  theme_stata()

ER_AgeGroup
```


Location Analysis {data-navmenu="Location" data-orientation=columns}
=================================================================================

Column {.sidebar}
-----------------------------------------------------------

#### Location Analysis Overview

The next part of my analysis is dealing with the wage gap by location. Take a deeper look into the gender pay ratio for each state by hovering over it in the map to the right!

The map to the right displays the gender pay ratio, national rank, and equal pay laws for every state. The gender pay ratio is in decimal terms and represents the amount that women make compared to a man's dollar. So, a higher gender pay ratio is better. The national rank displays how each state stacks up against the others in terms of the gender pay ratio. If the state has a high national rank, the gender pay ratio is higher and therefore better for women in those states. Finally, the equal pay law strength notes the laws the state has passed on the wage gap and equal pay. The levels are strong, moderate, and weak. An anomaly is Mississippi because they do not currenlty have any laws regarding equal pay in the workforce. The strength of the laws in each state were based on the Census Bureau rankings, where the data was sourced from. If a state, like California, has an Equal Pay Act and additional legislation supporting it, they were given an equal pay law score of strong. If a state, like Alabama, has an Equal Pay Act or other legislation on the topic but does not strongly enforce it, they were given an equal pay law score of weak. A state may also have an Equal Pay Act that is not very strong in its provisions, which would result in a score of weak. 

Column {data-height=600}
-----------------------------------------------------------

### Gender Pay Ratio Map

```{r, message = FALSE, warning = FALSE, echo = FALSE}
##Map

gpratio_map$hover <- with(gpratio_map, paste(state, '
', "Gender Pay Ratio:", Gender.Pay.Ratio, "
", "National Rank:", National.Rank, "
", "Equal Pay Laws:", Equal.Pay.Laws, "
"))

l <- list(color = toRGB("white"), width = 2)
g <- list(scope = 'usa', projection = list(type = 'albers usa'), showlakes = TRUE, lakecolor = toRGB('white'))

fig <- plot_geo(gpratio_map, locationmode = 'USA-states')
fig <- fig %>% add_trace(text = ~hover, locations = ~code, color = ~Gender.Pay.Ratio, colors = 'Purples')

fig <- fig %>% colorbar(title = "Gender Pay Ratio")
fig <- fig %>% layout(
  title = 'Gender Pay Ratio by State
(Hover for Breakdown)', geo = g)

fig
```

Column {.tabset .tabset-fade data-width=400}
-----------------------------------------------------------

### Top 10

The top ten states with the best gender pay ratios are below:

 1. **California**
 2. **New York**
 3. **Maryland**
 4. **Nevada**
 5. **Vermont**
 6. **Arkansas**
 7. **Florida**
 8. **Oregon**
 9. **Delaware**
 10. **Arizona**

Most of these states align with my hypothesis that the lowest wage gaps would exist in states that are more urban and have larger cities. Most of the states are closer to the coasts, are areas with larger cities and more people, and have low rural populations. Larger cities tend to be more progressive and liberal in nature due to their citizens being more politically active and wanting to drive change. Some of the largest cities in the US are found in these states and are Los Angeles (CA), New York City (NY), Baltimore (MD), Las Vegas (NV), Burlington (VT), Little Rock (AR), Jacksonville (FL), Portland (OR), Wilmington (DE), and Phoenix (AZ). Baltimore, Burlington, and Wilmington are not that large of cities in comparison to the others, but they are more politically active cities due to the fact that they are close to the capital, Washington D.C. The Census Bureau indicates that more people tend to work for the government or any kind of political offices in the Northeast due to the proximity to the captial, Washington D.C. Arkansas was really the only state that shocked me. It has a lot of rural areas and few populus cities, so I assumed it would not rank in the top ten for lowest wage gap.

### Bottom 10

The bottom ten states with the worst gender pay ratios are below:

 50. **Louisiana**
 49. **Wyoming**
 48. **West Virginia**
 47. **Alabama**
 46. **North Dakota**
 45. **Utah**
 44. **New Hampshire**
 43. **Indiana**
 42. **Mississippi**
 41. **Oklahoma**
 
Again, most of these states align with my hypothesis that the highest wage gaps would exist in states that are more rural and have smaller cities. Most of these states are in the Midwest or the south, which are areas that tend to have more rural populations and are less progressive as a whole. Louisiana, West Virginia, Alabama, North Dakota, Indiana, Mississippi, and Oklahoma are some of the states with the largest rural population. Also, Wyoming, Utah, and New Hampshire are some of the states with the lowest overall populations. Due to the rural and smaller populations in these states, they contain less progressive and politically active people who would fight for Equal Pay Acts.

Overall, the states with a better gender pay ratio tend to be more progressive and contain a large portion of urban populations, whereas the states with a lower gender pay ratio tend to be more conservative and contain a large portion of rural populations.


Overall Female Employment Analysis {data-navmenu="Employment" data-orientation=columns}
==============================================================================

Column
-------------------------------------------------------------------------------

### Overall Female Employment

#### Females in the Workforce Trend

My last analysis studied the employment status of males and females throughout history from 1968 to 2016 and determined if the ratio of part-time and full-time workers has changed. In the plot below, you can see that the percentage of full-time and part-time females is at the same position in 2016 as it was in 1968, respectively, and stayed relatively level during that time period.

The one change that can be seen from the plot is the slight decrease in full-time male workers over the course of this period. I found two main factors that may have caused this change. First, as more females continue to take a more prominent role in society, some males are now playing the role of the *stay at home parent*. It is not to say that less males are working overall, but it could lead to more of them assuming part time roles rather than full time ones. The second reason is that the biggest decrease of full-time male employment came around 2008 and the recession. I noticed through some of the other data that there were more males working during this time than females, so the trends have a greater effect on the males than the females. During this time period, a lot of people, especially men, lost their jobs. Therefore, their overall employment numbers decreased. Overall, more females are entering the workforce, so men do not have to be the sole breadwinners of the family and can work part-time or be a stay at home dad. 


```{r}
employed_gender %>% 
  ggplot(aes(x = year,)) +
  geom_line(aes(y = full_time_female),color = "red2") +
  geom_line(aes(y = full_time_male), color = "blue") +
  geom_line(aes(y = part_time_female), color = "red2") +
  geom_line(aes(y = part_time_male), color = "blue") +
  scale_y_continuous(name = "Percent") +
  scale_x_continuous(name = "Year") +
  annotate("text", x = 1968, y = 82, label = "Full-time Male = 92.2%",
           color = "blue", hjust = 0, size = 3) +
   annotate("text", x = 1968, y = 68, label = "Full-time Female = 75.1%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 1968, y = 32, label = "Part-time Female = 24.9%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 1968, y = 14, label = "Part-time Male = 7.8%",
           color = "blue", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 82, label = "Full-time Male = 87.6%",
           color = "blue", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 68, label = "Full-time Female = 75.1%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 30, label = "Part-time Female = 24.9%",
           color = "red2", hjust = 0, size = 3) +
   annotate("text", x = 2005, y = 17, label = "Part-time Male = 12.4%",
           color = "blue", hjust = 0, size = 3) +
  ggtitle("Male and Female Full-time & Part-time Employment",
          subtitle = "Change from 1968 to 2016")
```


Opportunity Gap Analysis {data-navmenu="Opportunity Gap" data-orientation=columns}
==============================================================================

Column {.sidebar}
------------------------------------------------------------

#### Opportunity Gap Overview

The final wage gap analysis I conduced is an analysis of the opportunity gap. The opportunity gap is so essential to the wage gap conversation because it captures the aspects of the wage gap that the uncontrolled wage gap cannot. The opportunity gap is the gap in the opportunities that men are offered versus females. Males have more access to higher paying jobs and tend to advance faster in their careers than females.

First, to analytical explain the opportunity gap, I computed the top ten highest paying occupations for each industry. For each industry, I indicated whether these occupations were held by a majority of females versus males. Next, I showed what females make compared to males for each occupation. The data for this analysis is limited because it can’t capture all of the aspects of the opportunity gap. So, I included outside research in order to provide information on other factors that may affect the opportunity gap. In each industry, I researched how many hours a man works on average compared to how many hours a female works. For the top paying occupations in each industry, I researched the level of schooling and experience a man typically has versus a female. The hours worked by males and females in each industry on average and the levels of schooling and experience they received were not significantly different. 

Finally, for the last piece of this analysis, I determined what percentage of executives for each industry are males. The percentages are astoundingly high, even though the percentage of females who are executives has been increasing. Each of the percentages is greater than 85%. Therefore, the opportunity gap is a real problem that afflicts the workforce. Males are offered opportunities for advancement and increased salaries more often than females, which contributes greatly to the wage gap. 


Column {.tabset .tabset-fade}
-------------------------------------------------------------------------------

### Healthcare

#### Healthcare Practitioners and Technical Industry

For this industry, females occupy the majority of a top ten paying occupation 6 out of 10 times. They also have a majority in two of the four highest paying occupations. The healthcare industry is an industry that overall is dominated by females, so I assumed that females would be the majority, but I thought they would be a majority in more than 6 of the occupations.

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay1 <- jobs_gender %>%
  filter(industry_broad == "Healthcare Practitioners and Technical") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay1, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Healthcare") +
  geom_text(aes(label = c(0.33,0.58,0.26,0.53,0.22,0.37,0.89,0.65,0.54,0.67)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```


### Education

#### Education, Legal, Community Service, Arts, and Media Industry

For this industry, females occupy the majority of a top ten paying occupation 5 out of 10 times. They also have a majority in one of the three highest paying occupations. The education industry is an industry that overall is dominated by females, so I assumed that they would have a majority in more than half of the top ten highest paying occupations.

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay2 <- jobs_gender %>%
  filter(industry_broad == "Education, Legal, Community Service, Arts, and Media") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay2, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Education") +
  geom_text(aes(label = c(0.35,0.44,0.57,0.46,0.36,0.63,0.60,0.53,0.50,0.71)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Sales and Office

#### Sales and Office Industry

For this industry, females occupy the majority of a top ten paying occupation 2 out of 10 times. The sales industry is an industry that overall is dominated by females. But, this industry compared to healthcare and education does not have women as being as dominant of a force. Females don't hold a majority of the highest paying positions and are only 51% for the occupations that they do have majority in. I assumed that females would have a majority in more than half of the top ten highest paying occupations because it is a female dominated industry. 

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay3 <- jobs_gender %>%
  filter(industry_broad == "Sales and Office") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay3, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Sales") +
  geom_text(aes(label = c(0.07, 0.29, 0.25, 0.31, 0.29, 0.50, 0.36, 0.51, 0.50, 0.51)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Service

#### Service Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The service industry is an industry that overall is dominated by males, so this is not surprising. There are 49% females in this industry, though, and they do not occupy one of the highest paying occupations. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries. 

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay4 <- jobs_gender %>%
  filter(industry_broad == "Service") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay4, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Service") +
  geom_text(aes(label = c(0.04, 0.15, 0.25, 0.04, 0.12, 0.11, 0.44, 0.29, 0.25, 0.26)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Business

#### Management, Business, and Financial Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The business industry is an industry that overall is dominated by males, so this is not surprising. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries. 

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay5 <- jobs_gender %>%
  filter(industry_broad == "Management, Business, and Financial") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay5, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Business") +
  geom_text(aes(label = c(0.09, 0.24, 0.27, 0.49, 0.30, 0.39, 0.41, 0.43, 0.48, 0.33)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Engineering

#### Computer, Engineering, and Science Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The engineering industry is an industry that overall is dominated by males, so this is not surprising. The highest percentage of females in one of these occupations is 32%, which is very low. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries. 

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay6 <- jobs_gender %>%
  filter(industry_broad == "Computer, Engineering, and Science") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay6, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Engineering") +
  geom_text(aes(label = c(0.11,0.32,0.32,0.11,0.16,0.18,0.26,0.13,0.19,0.08)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Production

#### Production, Transportation, and Material Moving Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The production industry is an industry that overall is dominated by males, so this is not surprising. The highest percentage of females in one of these occupations is 19%, which is extremely low. The percentages of females in each occupation are very low, the lowest at 3%. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries. 

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay7 <- jobs_gender %>%
  filter(industry_broad == "Production, Transportation, and Material Moving") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay7, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Production") +
  geom_text(aes(label = c(0.05, 0.19, 0.06, 0.04, 0.05, 0.06, 0.07, 0.11, 0.05, 0.03)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Natural Resources

#### Natural Resources, Construction, and Maintenance Industry

For this industry, females occupy the majority of a top ten paying occupation 0 out of 10 times. The natural resources industry is an industry that overall is dominated by males, so this is not surprising. The highest percentage of females in one of these occupations is 7%, which is extremely low. The percentages of females in each occupation are very low across the board, which the lowest being 1%. As you can see, males play a more dominant role in the female dominated industries than females play in the male dominated industries. 

```{r, echo = FALSE, warning = FALSE, message = FALSE}

pay8 <- jobs_gender %>%
  filter(industry_broad == "Natural Resources, Construction, and Maintenance") %>%
  group_by(occupation) %>%
  summarise(totalpay = sum(total_earnings), per_females = sum(workers_female) / sum(total_workers), per_males = sum(workers_male) / sum(total_workers)) %>%
  arrange(desc(totalpay)) %>%
  select(occupation, totalpay, per_females, per_males) %>%
  top_n(10, totalpay)

ggplot(data = pay8, 
       aes(x = reorder(occupation, +totalpay), 
           y = (totalpay))) + 
  geom_bar(stat = "identity", fill = "darkturquoise") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "Total Earnings") + 
  scale_x_discrete(name = "Occupation", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("Top 10 Occupations By Earnings",
          subtitle = "10 Highest Paying Occupations for Natural Resources") +
  geom_text(aes(label = c(0.01, 0.02, 0.05, 0.03, 0.02, 0.02, 0.05, 0.07, 0.03, 0.02)), vjust=0.5, hjust=1, color="black", position = position_dodge(1), size=4) +
  coord_flip()
```

### Executives Overview

#### Percentage of Female Executives Per Industry

The largest percentage of female executives is in the healthcare industry at 14%. The largest three percentages correspond to the female dominated industries. The lowest percentage of female executives is in the production industry. Many females in these industries are equally qualified as their male counterparts and have similar backgrounds, but they are not chosen to hold the largest roles at their companies. The number of female executives has definitely risen in the past few decades, but it is nowhere near where it should be. The opportunity gap of men getting the higher paying and more important jobs needs to be reduced as we move further into the 21st century. 

```{r, echo = FALSE, warning = FALSE, message = FALSE}

industries <- c("Healthcare Practitioners and Technical", "Education, Legal, Community Service, Arts, and Media", "Sales and Office", "Managment, Business, and Financial", "Service", "Computer, Engineering, and Science", "Natural Resources, Construction, and Maintenance Industry", "Production, Transportation, and Material Moving")
percentages <- c(0.14, 0.12, 0.10, 0.09, 0.07, 0.04, 0.02, 0.01)
exec <- data.frame(industries, percentages)

ggplot(data = exec, 
       aes(x = reorder(industries, +percentages), 
           y = percentages)) + 
  geom_bar(stat = "identity", fill = "purple") + 
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(name = "% Female Executives") + 
  scale_x_discrete(name = "Industry", labels = function(x) str_wrap(x, width = 30)) +
  ggtitle("% Female Executives Per Industry") +
  coord_flip()
```