According to a recent Glassdoor survey, more than two-thirds (67 percent) of U.S. employees say they would not apply for jobs at employers where they believe a gender pay gap exists.Today, the gender pay gap is more than a social or legal issue. It’s an issue that can affect the ability of employers to attract and retain talent. Gender wage gap is also directly related to women poverty.
The society now aspires to bridge the gender wage gap .It is very essential for us to look at the data to recognize gender wagegsp patterns, best practices contribute for lessening the gender wage gap and practices directly contributing to increase in gender wage gap.
This analysis aims to recognize patterns regarding gender wage gap over different industries and different occupations. It tries to gain insights on the correlation of different factors like industry, occupation, payscale, age group etc. with gender wage gap and to look at correlations and identify the causal relationship. We can also compare the hiring and human resource management structures of industries with low wagegap to identify the best practices which can be applied industries with high wagegap.
To reproduce the results of this project you will need to load the following packages
library(tidyverse) # Collection of R packages designed for data science.
library(knitr) #A general-purpose tool for dynamic report generation in R
library(ggplot2) # Asthetic depiction library
Prior to the analysis , it is essential to properly load, examine and clean the data
The data for this project origniated from the following source.
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv")
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv")
Source data has 3 different files.
This table has information about the salary of females in comparision to males yearwise and age group wise. #### Variable Description
Year - Year
Group - Age group
percent - Female salary percent of male salary
The data doesn’t have any missing values.
This table has Yearwise part time and full time employeement percentage for males as well as females. The variables are .
total_full_time - Percent of total employed people usually working full time
total_part_time - Percent of total employed people usually working part time
full_time_female - Percent of employed women usually working full time
part_time_female - Percent of employed women usually working part time
full_time_male - Percent of employed men usually working full time
part_time_male - Percent of employed men usually working part time
This table has salary comparision amoong males and females in different occupations over a range of major and minor category of industries.
major_category - Broad category of occupation
minor_category - character Fine category of occupation
d.total_workers - Total estimated full-time workers > 16 years old
workers_male - Estimated MALE full-time workers > 16 years old
workers_female - Estimated FEMALE full-time workers > 16 years old
percent_female - The percent of females for specific occupation
total_earnings - Total estimated median earnings for full-time workers > 16 years old
total_earnings_male - Estimated MALE median earnings for full-time workers > 16 years old
total_earnings_female - Estimated FEMALE median earnings for full-time workers > 16 years old
wage_percent_of_male - Female wages as percent of male wages - NA for occupations with small sample size
The variable types in each data file are appropriate.
The missing values in the data is encoded with ‘NA’.There are no missing values in the Earnings_female and employed_gender files. Wage-percent _male column which represent female wage as percent of male wages has 40 % of its entries missing. Since the column is reproducible from the other columns in the data and since more than 40 % of its entries being missing, the column has been deleted.
There are no duplicate entries in any of the 3 data files.
Wage gap is calculated by the following formula
Wage Gap = (Average male Pay - Average Female Pay )/(Average Male Pay)
Sum_Earnings_male = Workers_male * Total_earnings_male - Total earnings by male for the occupation
Sum_earnings_female = Workers_female * Total_earnings_female - Total Earnings female for the occupation
Outlier analysis is not appropriate for this data since definite range of salaries cannot be identified since there is data from variety of categories and industries.
jobs_gender <- read.csv('women_in_the_workplace_data/jobs_gender.csv')
knitr::kable(head(jobs_gender))
jobs_gender <- read.csv('women_in_the_workplace_data/jobs_gender.csv')
# checking variable types
str(jobs_gender) # variable types need not be changed
unique(jobs_gender$occupation) # All are unique
dim(jobs_gender) # dimensions of the dataframe
# Checking missing values
colSums(is.na(jobs_gender))/2088 # remove wage_percent_of_male since there are many missing values
?complete.cases
jobs_gender1 <- jobs_gender[!duplicated(jobs_gender, nmax = 1), ] # Removing duplicate values if there are any
# REmoving redundant columns
jobs_gender2 <- jobs_gender[, 1:11] # remove wage_percent_male column since there are lot of missing values and if necessary can be calculated from other columns
head(jobs_gender2)
#Variable addition
jobs_gender2$wage_gap <- (jobs_gender2$total_earnings_male - jobs_gender2$total_earnings_female)/jobs_gender2$total_earnings_male*100 #adding wage gap column
head(jobs_gender2)
colSums(is.na(jobs_gender2))
jobs_gender3 <- jobs_gender2[complete.cases(jobs_gender2), ] # remove those values since wage gap cant be calculated for those values
colSums(is.na(jobs_gender3))
jobs_gender3$sum_earnings_male <- round(jobs_gender3$workers_male)*jobs_gender3$total_earnings_male #calculated sum of male earnings and sum of female earnings and added the column so that wage gap by major cateogry can be calculated
jobs_gender3$sum_earnings_female <- round(jobs_gender3$workers_female)*jobs_gender3$total_earnings_female
#checking the data after the changes
unique(jobs_gender3$year)
unique(jobs_gender3$minor_category)
unique(jobs_gender3$occupation)
head(jobs_gender3)
jobs_bymajor <- filter(jobs_gender3, major_category == "Management, Business, and Financial")
typeof(jobs_bymajor)
head(jobs_bymajor)
jobs_bymajor <- data.frame(jobs_bymajor)
typeof(jobs_bymajor)
head(jobs_gender3)
str(jobs_gender3)
unique(jobs_gender3$major_category)
| Year | group | percent |
|---|---|---|
| 1979 | Total, 16 years and older | 62.3 |
| 1980 | Total, 16 years and older | 64.2 |
| 1981 | Total, 16 years and older | 64.4 |
| 1982 | Total, 16 years and older | 65.7 |
| 1983 | Total, 16 years and older | 66.5 |
| 1984 | Total, 16 years and older | 67.6 |
Year - Year
Group - Age group
percent - Female salary percent of male salary
The data doesn’t have any missing values.
| year | total_full_time | total_part_time | full_time_female | part_time_female | full_time_male | part_time_male |
|---|---|---|---|---|---|---|
| 1968 | 86.0 | 14.0 | 75.1 | 24.9 | 92.2 | 7.8 |
| 1969 | 85.5 | 14.5 | 74.9 | 25.1 | 91.8 | 8.2 |
| 1970 | 84.8 | 15.2 | 73.9 | 26.1 | 91.5 | 8.5 |
| 1971 | 84.4 | 15.6 | 73.2 | 26.8 | 91.2 | 8.8 |
| 1972 | 84.3 | 15.7 | 73.1 | 26.9 | 91.1 | 8.9 |
| 1973 | 84.4 | 15.6 | 73.2 | 26.8 | 91.4 | 8.6 |
year - Year
total_full_time - Percent of total employed people usually working full time
total_part_time - Percent of total employed people usually working part time
full_time_female - Percent of employed women usually working full time
part_time_female Percent of employed women usually working part time
full_time_male - Percent of employed men usually working full time
part_time_male - Percent of employed men usually working part time
| X | year | occupation | major_category | minor_category | total_workers | workers_male | workers_female | percent_female | total_earnings | total_earnings_male | total_earnings_female | wage_gap | sum_earnings_male | sum_earnings_female |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 2013 | Chief executives | Management, Business, and Financial | Management | 1024259 | 782400 | 241859 | 23.6 | 120254 | 126142 | 95921 | 23.957920 | 98693500800 | 23199357139 |
| 2 | 2013 | General and operations managers | Management, Business, and Financial | Management | 977284 | 681627 | 295657 | 30.3 | 73557 | 81041 | 60759 | 25.026838 | 55239733707 | 17963823663 |
| 3 | 2013 | Legislators | Management, Business, and Financial | Management | 14815 | 8375 | 6440 | 43.5 | 67155 | 71530 | 65325 | 8.674682 | 599063750 | 420693000 |
| 4 | 2013 | Advertising and promotions managers | Management, Business, and Financial | Management | 43015 | 17775 | 25240 | 58.7 | 61371 | 75190 | 55860 | 25.708206 | 1336502250 | 1409906400 |
| 5 | 2013 | Marketing and sales managers | Management, Business, and Financial | Management | 754514 | 440078 | 314436 | 41.7 | 78455 | 91998 | 65040 | 29.302811 | 40486295844 | 20450917440 |
| 6 | 2013 | Public relations and fundraising managers | Management, Business, and Financial | Management | 44198 | 16141 | 28057 | 63.5 | 74114 | 90071 | 66052 | 26.666741 | 1453836011 | 1853220964 |
This table has salary comparision amoong males and females in different occupations over a ranje of major and minor category of industries.
occupation - Specific job/career
major_category - Broad category of occupation
minor_category - character Fine category of occupation
d.total_workers - Total estimated full-time workers > 16 years old
workers_male - Estimated MALE full-time workers > 16 years old
workers_female - Estimated FEMALE full-time workers > 16 years old
percent_female - The percent of females for specific occupation
total_earnings - Total estimated median earnings for full-time workers > 16 years old
total_earnings_male - Estimated MALE median earnings for full-time workers > 16 years old
total_earnings_female - Estimated FEMALE median earnings for full-time workers > 16 years old
wage_percent_of_male - Female wages as percent of male wages - NA for occupations with small sample size
Sum_Earnings_male = Workers_male * Total_earnings_male
Sum_earnings_female = Workers_female * Total_earnings_female
The above variables are added to carry on exploratory data analysis.
The graph indicates that there is a steady increase in the female salary percent of male salary, which means the wagegap has been on a decline.
There is a visible trend of increasing wagegap in some of the age groups from 1990 to 2000.Such a sudden trend should have a specific combination socio political reasons.There is value in investigating this trend
From the data, we observe that part time female work force is higher than part time male work force. one can argue that wagegap is more because of higher part time workforce in female, but the percent of female part time work force has been almost non increasing over the years, even though the wage gap has been on the decline.
## Analysis of Variance Table
##
## Model 1: female_earn5$`Total, 16 years and older` ~ female_earn5$Year
## Model 2: female_earn5$`Total, 16 years and older` ~ female_earn5$Year +
## (female_earn5$part_time_female)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 31 74.887
## 2 30 72.379 1 2.5079 1.0395 0.3161
The Low F statistic confirms that gender wagegap is not due to the higher percentage of part time female work force.
| major_category | avg_earnings_male | avg_earnings_female | total_wage_gap |
|---|---|---|---|
| Computer, Engineering, and Science | 81046.05 | 67619.32 | 16.56680 |
| Education, Legal, Community Service, Arts, and Media | 65979.61 | 47696.79 | 27.70980 |
| Healthcare Practitioners and Technical | 112765.64 | 62566.19 | 44.51662 |
| Management, Business, and Financial | 81058.16 | 59784.28 | 26.24521 |
| Natural Resources, Construction, and Maintenance | 42087.10 | 33805.00 | 19.67848 |
| Production, Transportation, and Material Moving | 40154.60 | 28335.56 | 29.43384 |
| Sales and Office | 48149.99 | 35162.98 | 26.97199 |
| Service | 34961.41 | 24806.43 | 29.04626 |
This analysis clearly shows that the gender wagegap is disproportionately higher in “Healthcare Practitioners and Technical” major category. Since there are no minor categories in the data, further analysis on this major category is not possible with the given data.For example, Factors such as greater part of female workforce working in less skillful jobs compared the male workforce can contribute to the higher gender wagegap.
| earnings_range | avg_earnings_male | avg_earnings_female | total_wage_gap |
|---|---|---|---|
| Greater than 150000 | 210107.86 | 149503.92 | 28.84420 |
| Between 50000 and 150000 | 76777.87 | 60835.89 | 20.76377 |
| Less than 50000 | 36725.40 | 31346.12 | 14.64730 |
From the above table we can infer that gender wagegap is higher for people in the higher income band, but (from the regression analysis between wage_gap and avg income) wagegap doesn’t necessarily depend on the income of the occupation.
| major_category | earnings_range | avg_earnings_male | avg_earnings_female | total_wage_gap |
|---|---|---|---|---|
| Computer, Engineering, and Science | Between 50000 and 150000 | 81721.19 | 68882.02 | 15.71094 |
| Computer, Engineering, and Science | Less than 50000 | 47114.04 | 40979.37 | 13.02088 |
| Education, Legal, Community Service, Arts, and Media | Between 50000 and 150000 | 79274.72 | 57593.45 | 27.34953 |
| Education, Legal, Community Service, Arts, and Media | Less than 50000 | 46891.33 | 40341.33 | 13.96848 |
| Healthcare Practitioners and Technical | Between 50000 and 150000 | 82297.56 | 65917.84 | 19.90304 |
| Healthcare Practitioners and Technical | Greater than 150000 | 210107.86 | 149503.92 | 28.84420 |
| Healthcare Practitioners and Technical | Less than 50000 | 43898.51 | 37966.68 | 13.51260 |
| Management, Business, and Financial | Between 50000 and 150000 | 85155.77 | 61857.30 | 27.35983 |
| Management, Business, and Financial | Less than 50000 | 43263.02 | 35802.90 | 17.24365 |
| Natural Resources, Construction, and Maintenance | Between 50000 and 150000 | 56113.52 | 48607.68 | 13.37616 |
| Natural Resources, Construction, and Maintenance | Less than 50000 | 37460.71 | 28955.85 | 22.70340 |
| Production, Transportation, and Material Moving | Between 50000 and 150000 | 61056.27 | 44019.22 | 27.90385 |
| Production, Transportation, and Material Moving | Less than 50000 | 36726.50 | 26713.49 | 27.26372 |
| Sales and Office | Between 50000 and 150000 | 66313.38 | 50103.75 | 24.44398 |
| Sales and Office | Less than 50000 | 38907.55 | 32775.06 | 15.76169 |
| Service | Between 50000 and 150000 | 66170.20 | 54894.34 | 17.04069 |
| Service | Less than 50000 | 29036.44 | 23930.70 | 17.58391 |
From the above table, it can be observed that even among major categories, the above seen global trend of higher wagegap for a higher income bucket is being observed in all the major categories except for Natural Resources , Construction and Maintenance and Service industry.
# Wage Gap Decreasing over years
female_earn <- read.csv("women_in_the_workplace_data/earnings_female.csv")
female_earn3 <- female_earn %>%
filter(group == c("20-24 years","25-34 years","35-44 years","45-54 years","55-64 years","Total, 16 years and older"))
ggplot(data = female_earn3, aes(x = Year, y = percent, color = group)) +
geom_line()+ #wagegap vs year graph perfect
scale_y_continuous(name = "Female salary percent of male salary")+
ggtitle("Increase of Female salary percent of male salary over the years")
female_earn <- read.csv("women_in_the_workplace_data/earnings_female.csv")
employed_gender <- read.csv('women_in_the_workplace_data/employed_gender.csv')
female_earn2 <- female_earn %>% spread(group,percent)
# Regression Analysis(ANOVA) to show Wagegap is not dependent upon the percent of part time female work force
female_earn5 <- female_earn2 %>% left_join(employed_gender, by = c("Year" = "year"))
lmearn8 = lm( female_earn5$`Total, 16 years and older`~female_earn5$Year + (female_earn5$part_time_female))
lmearn9 = lm( female_earn5$`Total, 16 years and older`~female_earn5$Year)
ano = anova(lmearn9,lmearn8)
ano
# Gender wagegap with income group
jobs_gender3 <- read.csv('jobs_gender3.csv')
jobs_gender4 <- jobs_gender3 %>% mutate(earnings_range = case_when(total_earnings >= 150000 ~ "Greater than 150000",
#total_earnings >= 100000 & total_earnings < 150000 ~ ,
total_earnings >=50000 & total_earnings < 150000 ~ "Between 50000 and 150000",
#total_earnings >=30000 & total_earnings < 50000 ~4,
TRUE ~ "Less than 50000"))
table2 <- jobs_gender4 %>%
group_by(earnings_range) %>%
dplyr::summarise( avg_earnings_male = sum(sum_earnings_male)/sum(workers_male),avg_earnings_female = sum(sum_earnings_female)/sum(workers_female), total_wage_gap =( avg_earnings_male- avg_earnings_female)/avg_earnings_male*100 ) %>%
arrange(desc(total_wage_gap))
dplyr:: kable(table2)
# Wage Gap within major categories vs Income range
table1 <- jobs_gender4 %>%
group_by(major_category , earnings_range) %>%
dplyr::summarise( avg_earnings_male = sum(sum_earnings_male)/sum(workers_male),avg_earnings_female = sum(sum_earnings_female)/sum(workers_female), total_wage_gap =( avg_earnings_male- avg_earnings_female)/avg_earnings_male*100 ) # %>%
# arrange(desc(total_wage_gap))
dplyr:: kable(table1)
#
The analysis goals listed below requires additional data :
Gender wage gap with respect to Education qualification
Gender Wage gap w.r.t Race of the individual
Gender wage gap w.r.t region of the individual
Comparision of Hiring and HR practices Low wagegap and high wage gap industries
The Analysis indicates that there is a steady increase in the female salary percent of male salary, which means the wagegap has been on a decline. This gives us a positive hope the coroporate society is aware of the problem and is self correcting . Now, we have to find ways to fasten the decling trend of gender wage gap.
There is a visible trend of increasing wagegap in some of the age groups from 1990 to 2000.Such a sudden trend should have a specific combination of socio political reasons.There is value in investigating this trend by investigating major government and corporate policy changes which become effective between 1990 and 2000 across the business spectrum
one can argue that wagegap is more because of higher part time workforce in female, but the percent of female part time work force has been almost non increasing over the years, even though the wage gap has been on the decline.The Low F statistic (of analysis of variance between Wagegap and part time female work force) confirms that gender wagegap is not due to the higher percentage of part time female work force.
The anlysis pointed to a definite trend of wagegap being higher for a higher income bucket in almost all major categories, but the regression analysis (between wagegap and income) didn’t yield any significant relation, instead it solely depends on the income bucket that in which the occupation exists. This implies that gender wage gap is higher in occupations with higher income. This trend although non intuitive, is identified through data analysis.