Wage gap at workspace

1. Introduction

According to a recent Glassdoor survey, more than two-thirds (67 percent) of U.S. employees say they would not apply for jobs at employers where they believe a gender pay gap exists.Today, the gender pay gap is more than a social or legal issue. It’s an issue that can affect the ability of employers to attract and retain talent. Gender wage gap is also directly related to women poverty.

The society now aspires to bridge the gender wage gap .It is very essential for us to look at the data to recognize gender wagegsp patterns, best practices contribute for lessening the gender wage gap and practices directly contributing to increase in gender wage gap.

This analysis aims to recognize patterns regarding gender wage gap over different industries and different occupations. It tries to gain insights on the correlation of different factors like industry, occupation, payscale, age group etc. with gender wage gap and to look at correlations and identify the causal relationship.

Future Analysis can compare the hiring and human resource management structures of industries with low wagegap to identify the best practices which can be applied industries with high wagegap.

2. Packages Required

To reproduce the results of this project you will need to load the following packages

library(tidyverse) # Collection of R packages designed for data science.
library(knitr) #A general-purpose tool for dynamic report generation in R
library(ggplot2) # Asthetic depiction library

3. Data Preparation

Prior to the analysis , it is essential to properly load, examine and clean the data

Loading the data

The data for this project origniated from the following source.

  1. Census Bureau
  2. Buraeu of Labour Statistics
jobs_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/jobs_gender.csv")
earnings_female <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/earnings_female.csv") 
employed_gender <- readr::read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2019/2019-03-05/employed_gender.csv") 

Source data has 3 different files.

1. Earnings_Female

This table has information about the salary of females in comparision to males yearwise and age group wise. #### Variable Description

  1. Year - Year

  2. Group - Age group

  3. percent - Female salary percent of male salary

The data doesn’t have any missing values.

2. Employed_gender

This table has Yearwise part time and full time employeement percentage for males as well as females. The variables are .

Variable Description

  1. year - Year
  2. total_full_time - Percent of total employed people usually working full time

  3. total_part_time - Percent of total employed people usually working part time

  4. full_time_female - Percent of employed women usually working full time

  5. part_time_female - Percent of employed women usually working part time

  6. full_time_male - Percent of employed men usually working full time

  7. part_time_male - Percent of employed men usually working part time

3. Jobs_Gender

This table has salary comparision amoong males and females in different occupations over a range of major and minor category of industries.

Variable Description

  1. occupation - Specific job/career
  2. major_category - Broad category of occupation

  3. minor_category - character Fine category of occupation

d.total_workers - Total estimated full-time workers > 16 years old

  1. workers_male - Estimated MALE full-time workers > 16 years old

  2. workers_female - Estimated FEMALE full-time workers > 16 years old

  3. percent_female - The percent of females for specific occupation

  4. total_earnings - Total estimated median earnings for full-time workers > 16 years old

  5. total_earnings_male - Estimated MALE median earnings for full-time workers > 16 years old

  6. total_earnings_female - Estimated FEMALE median earnings for full-time workers > 16 years old

  7. wage_percent_of_male - Female wages as percent of male wages - NA for occupations with small sample size

4. Data Cleaning

Variable Types

The variable types in each data file are appropriate.

Missing Values

The missing values in the data is encoded with ‘NA’.There are no missing values in the Earnings_female and employed_gender files. Wage-percent _male column which represent female wage as percent of male wages has 40 % of its entries missing. Since the column is reproducible from the other columns in the data and since more than 40 % of its entries being missing, the column has been deleted.

Duplicate entries

There are no duplicate entries in any of the 3 data files.

Variable addition

Wage gap is calculated by the following formula

Wage Gap = (Average male Pay - Average Female Pay )/(Average Male Pay)


Sum_Earnings_male = Workers_male * Total_earnings_male - Total earnings by male for the occupation

 
Sum_earnings_female = Workers_female * Total_earnings_female - Total Earnings female for the occupation

Outlier Analysis

Outlier analysis is not appropriate for this data since definite range of salaries cannot be identified since there is data from variety of categories and industries.

Code

jobs_gender <- read.csv('women_in_the_workplace_data/jobs_gender.csv')
knitr::kable(head(jobs_gender))
jobs_gender <- read.csv('women_in_the_workplace_data/jobs_gender.csv')
# checking variable types
str(jobs_gender) # variable types need not be changed

unique(jobs_gender$occupation) # All are unique
dim(jobs_gender) # dimensions of the dataframe
# Checking missing values
colSums(is.na(jobs_gender))/2088 #  remove wage_percent_of_male since there are many missing values
?complete.cases
jobs_gender1 <- jobs_gender[!duplicated(jobs_gender, nmax = 1), ] # Removing duplicate values if there are any
# REmoving redundant columns
jobs_gender2 <- jobs_gender[, 1:11] # remove wage_percent_male column since there are lot of missing values and if necessary can be calculated from other columns
head(jobs_gender2)
#Variable addition

jobs_gender2$wage_gap <- (jobs_gender2$total_earnings_male - jobs_gender2$total_earnings_female)/jobs_gender2$total_earnings_male*100 #adding wage gap column

head(jobs_gender2)
colSums(is.na(jobs_gender2))
jobs_gender3 <- jobs_gender2[complete.cases(jobs_gender2), ] # remove those values since wage gap cant be calculated for those values
colSums(is.na(jobs_gender3)) 

jobs_gender3$sum_earnings_male <-  round(jobs_gender3$workers_male)*jobs_gender3$total_earnings_male #calculated sum of male earnings and sum of female earnings  and added the column so that wage gap by major cateogry can be calculated
jobs_gender3$sum_earnings_female <- round(jobs_gender3$workers_female)*jobs_gender3$total_earnings_female

#checking the data after the changes
unique(jobs_gender3$year)
unique(jobs_gender3$minor_category)
unique(jobs_gender3$occupation)
head(jobs_gender3)
jobs_bymajor <- filter(jobs_gender3, major_category == "Management, Business, and Financial")
typeof(jobs_bymajor)
head(jobs_bymajor)
jobs_bymajor <-  data.frame(jobs_bymajor)
typeof(jobs_bymajor)
head(jobs_gender3)
str(jobs_gender3)
unique(jobs_gender3$major_category)

5. Final data sets

Female_Earnings

Year group percent
1979 Total, 16 years and older 62.3
1980 Total, 16 years and older 64.2
1981 Total, 16 years and older 64.4
1982 Total, 16 years and older 65.7
1983 Total, 16 years and older 66.5
1984 Total, 16 years and older 67.6

Variable Description :

  1. Year - Year

  2. Group - Age group

  3. percent - Female salary percent of male salary

The data doesn’t have any missing values.

Employed_Gender

year total_full_time total_part_time full_time_female part_time_female full_time_male part_time_male
1968 86.0 14.0 75.1 24.9 92.2 7.8
1969 85.5 14.5 74.9 25.1 91.8 8.2
1970 84.8 15.2 73.9 26.1 91.5 8.5
1971 84.4 15.6 73.2 26.8 91.2 8.8
1972 84.3 15.7 73.1 26.9 91.1 8.9
1973 84.4 15.6 73.2 26.8 91.4 8.6

Variable Description

  1. year - Year

  2. total_full_time - Percent of total employed people usually working full time

  3. total_part_time - Percent of total employed people usually working part time

  4. full_time_female - Percent of employed women usually working full time

  5. part_time_female Percent of employed women usually working part time

  6. full_time_male - Percent of employed men usually working full time

  7. part_time_male - Percent of employed men usually working part time

Jobs_Gender

X year occupation major_category minor_category total_workers workers_male workers_female percent_female total_earnings total_earnings_male total_earnings_female wage_gap sum_earnings_male sum_earnings_female
1 2013 Chief executives Management, Business, and Financial Management 1024259 782400 241859 23.6 120254 126142 95921 23.957920 98693500800 23199357139
2 2013 General and operations managers Management, Business, and Financial Management 977284 681627 295657 30.3 73557 81041 60759 25.026838 55239733707 17963823663
3 2013 Legislators Management, Business, and Financial Management 14815 8375 6440 43.5 67155 71530 65325 8.674682 599063750 420693000
4 2013 Advertising and promotions managers Management, Business, and Financial Management 43015 17775 25240 58.7 61371 75190 55860 25.708206 1336502250 1409906400
5 2013 Marketing and sales managers Management, Business, and Financial Management 754514 440078 314436 41.7 78455 91998 65040 29.302811 40486295844 20450917440
6 2013 Public relations and fundraising managers Management, Business, and Financial Management 44198 16141 28057 63.5 74114 90071 66052 26.666741 1453836011 1853220964

Variable Description

This table has salary comparision amoong males and females in different occupations over a ranje of major and minor category of industries.

  1. occupation - Specific job/career

  2. major_category - Broad category of occupation

  3. minor_category - character Fine category of occupation

d.total_workers - Total estimated full-time workers > 16 years old

  1. workers_male - Estimated MALE full-time workers > 16 years old

  2. workers_female - Estimated FEMALE full-time workers > 16 years old

  3. percent_female - The percent of females for specific occupation

  4. total_earnings - Total estimated median earnings for full-time workers > 16 years old

  5. total_earnings_male - Estimated MALE median earnings for full-time workers > 16 years old

  6. total_earnings_female - Estimated FEMALE median earnings for full-time workers > 16 years old

  7. wage_percent_of_male - Female wages as percent of male wages - NA for occupations with small sample size

6. Exploratory Data Analysis

New Variables Created

Total earnings by male for the occupation

                Sum_Earnings_male =  Workers_male * Total_earnings_male
                

Total Earnings by female for the occupation

                Sum_earnings_female  =  Workers_female * Total_earnings_female 

The above variables are added to carry on exploratory data analysis.

Analysis

Wage Gap Decreasing over the years

The graph indicates that there is a steady increase in the female salary percent of male salary, which means the wagegap has been on a decline.

There is a visible trend of increasing wagegap in some of the age groups from 1990 to 2000.Such a sudden trend should have a specific combination socio political reasons.There is value in investigating this trend

Can Existence of wagegap attributed to higher part time female work force?

From the data, we observe that part time female work force is higher than part time male work force. one can argue that wagegap is more because of higher part time workforce in female, but the percent of female part time work force has been almost non increasing over the years, even though the wage gap has been on the decline.

Regression Analysis(ANOVA) to show Wagegap is not dependent upon the percent of part time female work force
## Analysis of Variance Table
## 
## Model 1: female_earn5$`Total, 16 years and older` ~ female_earn5$Year
## Model 2: female_earn5$`Total, 16 years and older` ~ female_earn5$Year + 
##     (female_earn5$part_time_female)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     31 74.887                           
## 2     30 72.379  1    2.5079 1.0395 0.3161

The Low F statistic confirms that gender wagegap is not due to the higher percentage of part time female work force.

Comparision of gender wagegap over major categories of businesses

This analysis clearly shows that the gender wagegap is disproportionately higher in “Healthcare Practitioners and Technical” major category. Since there are no minor categories in the data, further analysis on this major category is not possible with the given data.For example, Factors such as greater part of female workforce working in less skillful jobs compared the male workforce can contribute to the higher gender wagegap.

Gender wagegap with income group

earnings_range avg_earnings_male avg_earnings_female total_wage_gap
Greater than 150000 210107.86 149503.92 28.84420
Between 50000 and 150000 76777.87 60835.89 20.76377
Less than 50000 36725.40 31346.12 14.64730

From the above table we can infer that gender wagegap is higher for people in the higher income band, but (from the regression analysis between wage_gap and avg income) wagegap doesn’t necessarily depend on the income of the occupation.

Wage Gap within major categories vs Income range
major_category earnings_range avg_earnings_male avg_earnings_female total_wage_gap
Computer, Engineering, and Science Between 50000 and 150000 81721.19 68882.02 15.71094
Computer, Engineering, and Science Less than 50000 47114.04 40979.37 13.02088
Education, Legal, Community Service, Arts, and Media Between 50000 and 150000 79274.72 57593.45 27.34953
Education, Legal, Community Service, Arts, and Media Less than 50000 46891.33 40341.33 13.96848
Healthcare Practitioners and Technical Between 50000 and 150000 82297.56 65917.84 19.90304
Healthcare Practitioners and Technical Greater than 150000 210107.86 149503.92 28.84420
Healthcare Practitioners and Technical Less than 50000 43898.51 37966.68 13.51260
Management, Business, and Financial Between 50000 and 150000 85155.77 61857.30 27.35983
Management, Business, and Financial Less than 50000 43263.02 35802.90 17.24365
Natural Resources, Construction, and Maintenance Between 50000 and 150000 56113.52 48607.68 13.37616
Natural Resources, Construction, and Maintenance Less than 50000 37460.71 28955.85 22.70340
Production, Transportation, and Material Moving Between 50000 and 150000 61056.27 44019.22 27.90385
Production, Transportation, and Material Moving Less than 50000 36726.50 26713.49 27.26372
Sales and Office Between 50000 and 150000 66313.38 50103.75 24.44398
Sales and Office Less than 50000 38907.55 32775.06 15.76169
Service Between 50000 and 150000 66170.20 54894.34 17.04069
Service Less than 50000 29036.44 23930.70 17.58391

From the above table, it can be observed that even among major categories, the above seen global trend of higher wagegap for a higher income bucket is being observed in all the major categories except for Natural Resources , Construction and Maintenance and Service industry.

Code for Analysis

# Wage Gap Decreasing over years

female_earn <-  read.csv("women_in_the_workplace_data/earnings_female.csv")
female_earn3 <- female_earn %>% 
  filter(group == c("20-24 years","25-34 years","35-44 years","45-54 years","55-64 years","Total, 16 years and older"))

ggplot(data = female_earn3, aes(x = Year, y = percent, color = group)) +
  geom_line()+  #wagegap vs year graph perfect

  scale_y_continuous(name = "Female salary percent of male salary")+
  ggtitle("Increase of Female salary percent of male salary over the years")

female_earn <-  read.csv("women_in_the_workplace_data/earnings_female.csv")
employed_gender <- read.csv('women_in_the_workplace_data/employed_gender.csv')
female_earn2 <- female_earn %>% spread(group,percent)

 # Regression Analysis(ANOVA) to show Wagegap is not dependent upon the percent of part time female work force
female_earn5 <- female_earn2 %>% left_join(employed_gender, by = c("Year" = "year"))

lmearn8  = lm( female_earn5$`Total, 16 years and older`~female_earn5$Year + (female_earn5$part_time_female))

lmearn9  = lm( female_earn5$`Total, 16 years and older`~female_earn5$Year)

ano = anova(lmearn9,lmearn8)
ano

# Gender wagegap with income group


jobs_gender3 <- read.csv('jobs_gender3.csv')

jobs_gender4 <- jobs_gender3 %>% mutate(earnings_range = case_when(total_earnings >= 150000 ~ "Greater than 150000",
                                                                   #total_earnings >= 100000 & total_earnings < 150000 ~  ,
                                                                   total_earnings >=50000 & total_earnings < 150000 ~ "Between 50000 and 150000",
                                                                   #total_earnings >=30000 & total_earnings < 50000 ~4,
                                                                   TRUE ~ "Less than 50000"))

table2 <- jobs_gender4 %>%
 group_by(earnings_range) %>%
 dplyr::summarise( avg_earnings_male = sum(sum_earnings_male)/sum(workers_male),avg_earnings_female = sum(sum_earnings_female)/sum(workers_female), total_wage_gap =( avg_earnings_male- avg_earnings_female)/avg_earnings_male*100 ) %>%
 arrange(desc(total_wage_gap))

dplyr:: kable(table2)

# Wage Gap within major categories vs Income range

table1 <- jobs_gender4 %>% 
  group_by(major_category , earnings_range) %>% 
  dplyr::summarise( avg_earnings_male = sum(sum_earnings_male)/sum(workers_male),avg_earnings_female = sum(sum_earnings_female)/sum(workers_female), total_wage_gap =( avg_earnings_male- avg_earnings_female)/avg_earnings_male*100 ) # %>% 
 # arrange(desc(total_wage_gap))
dplyr:: kable(table1)
#

Analysis wanting data

The analysis goals listed below requires additional data :

  1. Gender wage gap with respect to Education qualification

  2. Gender Wage gap w.r.t Race of the individual

  3. Gender wage gap w.r.t region of the individual

  4. Comparision of Hiring and HR practices Low wagegap and high wage gap industries

7. Summary and Recommendations

The Analysis indicates that there is a steady increase in the female salary percent of male salary, which means the wagegap has been on a decline. This gives us a positive hope the coroporate society is aware of the problem and is self correcting . Now, we have to find ways to fasten the decling trend of gender wage gap.

There is a visible trend of increasing wagegap in some of the age groups from 1990 to 2000.Such a sudden trend should have a specific combination of socio political reasons.There is value in investigating this trend by investigating major government and corporate policy changes which become effective between 1990 and 2000 across the business spectrum

one can argue that wagegap is more because of higher part time workforce in female, but the percent of female part time work force has been almost non increasing over the years, even though the wage gap has been on the decline.The Low F statistic (of analysis of variance between Wagegap and part time female work force) confirms that gender wagegap is not due to the higher percentage of part time female work force.

The anlysis pointed to a definite trend of wagegap being higher for a higher income bucket in almost all major categories, but the regression analysis (between wagegap and income) didn’t yield any significant relation, instead it solely depends on the income bucket that in which the occupation exists.