Introduction

The data I have used for my R project is a public data set from Kaggle.com called the “IBM HR Analytics Employee Attrition and Performance”. This is a fictional data set created by IBM data scientists. While this data set has a lot of attributes related to attrition of employees, I decided to use this data set to study the gender inequality in the corporate world. We will take a look at the job level, education level, attrition and income differences between men and women employees in this data set.

Dataset

We have the employee related details for 1470 employees that span across 35 different columns. Some of the columns that we have primarily used for our analysis are:

-Age -Attrition -Gender -Monthly Income -Education Level -Job Level -Business Travel

summary(hrdata)
##       Age         Attrition         BusinessTravel       DailyRate     
##  Min.   :18.00   Length:1470        Length:1470        Min.   : 102.0  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 465.0  
##  Median :36.00   Mode  :character   Mode  :character   Median : 802.0  
##  Mean   :36.92                                         Mean   : 802.5  
##  3rd Qu.:43.00                                         3rd Qu.:1157.0  
##  Max.   :60.00                                         Max.   :1499.0  
##   Department        DistanceFromHome   Education     EducationField    
##  Length:1470        Min.   : 1.000   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.: 2.000   1st Qu.:2.000   Class :character  
##  Mode  :character   Median : 7.000   Median :3.000   Mode  :character  
##                     Mean   : 9.193   Mean   :2.913                     
##                     3rd Qu.:14.000   3rd Qu.:4.000                     
##                     Max.   :29.000   Max.   :5.000                     
##  EmployeeCount EmployeeNumber   EnvironmentSatisfaction    Gender         
##  Min.   :1     Min.   :   1.0   Min.   :1.000           Length:1470       
##  1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           Class :character  
##  Median :1     Median :1020.5   Median :3.000           Mode  :character  
##  Mean   :1     Mean   :1024.9   Mean   :2.722                             
##  3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000                             
##  Max.   :1     Max.   :2068.0   Max.   :4.000                             
##    HourlyRate     JobInvolvement    JobLevel       JobRole         
##  Min.   : 30.00   Min.   :1.00   Min.   :1.000   Length:1470       
##  1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Median : 66.00   Median :3.00   Median :2.000   Mode  :character  
##  Mean   : 65.89   Mean   :2.73   Mean   :2.064                     
##  3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000                     
##  Max.   :100.00   Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome    MonthlyRate   
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   : 2094  
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.: 8047  
##  Median :3.000   Mode  :character   Median : 4919   Median :14236  
##  Mean   :2.729                      Mean   : 6503   Mean   :14313  
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:20462  
##  Max.   :4.000                      Max.   :19999   Max.   :26999  
##  NumCompaniesWorked    Over18            OverTime         PercentSalaryHike
##  Min.   :0.000      Length:1470        Length:1470        Min.   :11.00    
##  1st Qu.:1.000      Class :character   Class :character   1st Qu.:12.00    
##  Median :2.000      Mode  :character   Mode  :character   Median :14.00    
##  Mean   :2.693                                            Mean   :15.21    
##  3rd Qu.:4.000                                            3rd Qu.:18.00    
##  Max.   :9.000                                            Max.   :25.00    
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

Findings

The analysis that I am doing on this data set will help us understand or realize if some gender based myths or pre-notions that we have about corporate employment are true or not based on this data set.

Employees by Age range

This visualization helps us get a general idea of the spread of employee population by age. Age of employee data used for analysis ranges from 18 to 60 years.

As you can see from the visualization 73% of the employees are between the age of 26 and 45.

age_bracket1 <- nrow(hrdata[(hrdata$Age >= 18 & hrdata$Age <= 20),])
age_bracket2 <- nrow(hrdata[(hrdata$Age >= 21 & hrdata$Age <= 25),])
age_bracket3 <- nrow(hrdata[(hrdata$Age >= 26 & hrdata$Age <= 30),])
age_bracket4 <- nrow(hrdata[(hrdata$Age >= 31 & hrdata$Age <= 35),])
age_bracket5 <- nrow(hrdata[(hrdata$Age >= 36 & hrdata$Age <= 40),])
age_bracket6 <- nrow(hrdata[(hrdata$Age >= 41 & hrdata$Age <= 45),])
age_bracket7 <- nrow(hrdata[(hrdata$Age >= 46 & hrdata$Age <= 50),])
age_bracket8 <- nrow(hrdata[(hrdata$Age >= 51 & hrdata$Age <= 55),])
age_bracket9 <- nrow(hrdata[(hrdata$Age >= 56 & hrdata$Age <= 60),])

age_brackets <- c(age_bracket1, age_bracket2, age_bracket3, age_bracket4, age_bracket5, age_bracket6,
                  age_bracket7, age_bracket8, age_bracket9)
range <- c("18-20", "21-25", "26-30", "31-35", "36-40", "41-45", "45-50", "51-55", "56-60")

age_count <- data.frame(range, age_brackets)


fig1 <- ggplot(age_count, aes(x= range, y=age_brackets)) +
  geom_bar(colour="black", fill="lightblue", stat="identity") +
  labs(title = "Number of Employees by age range", x= "Age Range (18 to 60)", y = "Employee Count") +
  theme(plot.title = element_text(hjust=0.5)) +
  theme_replace()

fig1

Attrition

This visualization should help us understand who are more likely to leave the organization based on gender.

As you see from the the visualization men are more likely (17%) to leave the organization compared to women employees (14%).

attrition <- hrdata %>% 
  select(Attrition, Gender) %>%
  group_by(Attrition, Gender) %>%
  summarise(n = length(Attrition), .groups = 'keep') %>%
  data.frame()
attrition
##   Attrition Gender   n
## 1        No Female 501
## 2        No   Male 732
## 3       Yes Female  87
## 4       Yes   Male 150
tot_attrition <- hrdata %>% 
  select(Attrition, Gender) %>%
  group_by(Attrition) %>%
  summarise(n = length(Attrition)) %>%
  data.frame()


max_y <- round_any(max(tot_attrition$n),200, ceiling)

fig2 <- ggplot(attrition, aes(x = Attrition, y = n , fill = Gender)) +
  geom_bar(stat= "identity") + 
  coord_flip() +
  labs(title = "Attrition Count by Gender", x= "Atrition", y = "Count of Employees", fill ="Gender") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette="Paired", guide = guide_legend(reverse = TRUE)) +
  geom_text(data=tot_attrition,aes(x=Attrition, y=n, label=scales::comma(n), fill= NULL), hjust = -0.3, size=4) +
  scale_y_continuous(labels = comma,
                     breaks = seq(0, max_y, by = 200),
                     limits=c(0, max_y))  

fig2

Monthly Income

While modern day corporate companies are fighting hard and working to bridge the gender wage gap, I wanted to analyze if a gender wage exists in our data set.

So I tried comparing the mean monthly income between men and women employees across all job levels. This comparison revealed that, in 4 of the 5 job levels in the organization women employees’ average monthly income is less than men employees.

avg_income <- hrdata %>% 
  select(Gender, MonthlyIncome, JobLevel) %>%
  group_by(Gender, JobLevel) %>%
  summarise(a_income = mean(MonthlyIncome), .groups = 'keep') %>%
  data.frame()
avg_income
##    Gender JobLevel  a_income
## 1  Female        1  2780.487
## 2  Female        2  5435.327
## 3  Female        3  9962.702
## 4  Female        4 15431.373
## 5  Female        5 19129.917
## 6    Male        1  2790.634
## 7    Male        2  5549.185
## 8    Male        3  9706.992
## 9    Male        4 15570.927
## 10   Male        5 19224.844
max_y2 <- round_any(max(avg_income$a_income),2500, ceiling)


fig3 <- ggplot(avg_income, aes(x = JobLevel, y = a_income, fill=Gender)) +
  geom_bar(stat= "identity", position= "dodge") +
  coord_flip() +
  labs(title = "Average monthly income across joblevels by Gender",x= "JobLevel" , y = "Avg.Monthly Income in $", fill ="Gender") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette="Set2", guide = guide_legend(reverse = TRUE)) +
  scale_y_continuous(labels = comma,
                     breaks = seq(0, max_y2, by = 2500),
                     limits=c(0, max_y2))  

fig3

Average Age

Now that we know women employees make less compared to their peers in the opposite gender, we now want to understand if they have the same opportunity as men for career development in the organization.

In this visualization we are comparing the average age of men and women employees across each job level. As you can see, in 3 of the 5 job levels it takes longer for women to get to these positions compared to men. It is also to be noted that this gap is the highest at the top most job level.

avg_age <- hrdata %>% 
  select(Gender, Age, JobLevel) %>%
  group_by(Gender, JobLevel) %>%
  summarise(a_age = mean(Age), .groups = 'keep') %>%
  data.frame()
avg_age
##    Gender JobLevel    a_age
## 1  Female        1 32.57286
## 2  Female        2 36.99091
## 3  Female        3 39.84043
## 4  Female        4 47.29412
## 5  Female        5 48.87500
## 6    Male        1 32.49709
## 7    Male        2 36.32484
## 8    Male        3 40.27419
## 9    Male        4 47.76364
## 10   Male        5 47.15556
max_y2 <- round_any(max(avg_age$a_age),5, ceiling)

fig3a <- ggplot(avg_age, aes(x = JobLevel, y = a_age, fill=Gender)) +
  geom_bar(stat= "identity", position="dodge") +
#  coord_flip() +
  labs(title = "Average Age across joblevels by Gender",x= "JobLevel" , y = "Avg. Age in years", fill ="Gender") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5)) +
  scale_fill_brewer(palette="Set1", guide = guide_legend(reverse = FALSE)) +
  scale_y_continuous(breaks = seq(0, max_y2, by = 5),
                     limits=c(0, max_y2))  

fig3a

Educational Level

If women are underpaid and get less opportunities to go up the corporate ladder, is it possible that their educational qualification play a role in to this ? Let’s find out!

This visualization compares and contrasts the educational level of employees based on gender. It is important to note that % of women employees with a College degree or Bachelor degree or Doctorate is higher than their male peers.

In general, % of women employees with a college or higher degree (90%) is more than % of male employees with a college or higher degree(87.5%).

education_df <- hrdata %>%
  select(Education, Gender) %>%
  mutate(edu_description = ifelse(hrdata$Education == "1" , "Below College", 
                                  ifelse(hrdata$Education == "2", "College", 
                                         ifelse(hrdata$Education == "3", "Bachelor",
                                                ifelse(hrdata$Education == "4", "Master", "Doctor"))))) %>%
  group_by(Gender, Education, edu_description) %>%
  summarise(n= length(Education), .groups ='keep') %>%
  group_by(Gender) %>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()
education_df
##    Gender Education edu_description   n percent_of_total
## 1  Female         1   Below College  60             10.2
## 2  Female         2         College 117             19.9
## 3  Female         3        Bachelor 235             40.0
## 4  Female         4          Master 154             26.2
## 5  Female         5          Doctor  22              3.7
## 6    Male         1   Below College 110             12.5
## 7    Male         2         College 165             18.7
## 8    Male         3        Bachelor 337             38.2
## 9    Male         4          Master 244             27.7
## 10   Male         5          Doctor  26              2.9
education_df$edu_description = factor(education_df$edu_description, levels = c("Below College", "College", "Bachelor",
                                                            "Master", "Doctor"))

fig4 <- ggplot(data = education_df, aes(x="", y=n, fill = edu_description)) +
  geom_bar(stat="identity", position="fill") +
  coord_polar(theta="y", start=0) +
  labs(fill = "Education Level", x= NULL, y= NULL, title = "Education level of Employees by gender",
       caption = "Source: IBM HR Analytics Employee Attrition data") +
  theme_light() +
  theme(plot.title = element_text(hjust = 0.5),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        panel.grid =element_blank()) +
  facet_wrap(~Gender, ncol = 2, nrow=1) +
  scale_fill_brewer(palette = "Oranges") +
  geom_text(aes(x=1.7, label= paste0(percent_of_total,"%")),
            size=4,
            position=position_fill(vjust =0.5))


fig4

Travel Pattern

We just learned that women employees have better qualifications compared to their male peers. So that cannot be a reason for being underpaid or having less opportunities to grow in an organization. Could it be their business travel preferences ?

This visualization clearly highlights that higher % of women employees travel frequently compared to men in similar job profiles that need frequent traveling. On the flip side, it also indicates that there is a higher % of male employees who don’t travel at all for business compared to women employees in the same category.

travel_df <- hrdata %>%
  select(BusinessTravel, Gender) %>%
  group_by(Gender, BusinessTravel) %>%
  summarise(n= length(BusinessTravel), .groups ='keep') %>%
  group_by(Gender) %>%
  mutate(percent_of_total = round(100*n/sum(n),1)) %>%
  ungroup() %>%
  data.frame()
travel_df
##   Gender    BusinessTravel   n percent_of_total
## 1 Female        Non-Travel  49              8.3
## 2 Female Travel_Frequently 117             19.9
## 3 Female     Travel_Rarely 422             71.8
## 4   Male        Non-Travel 101             11.5
## 5   Male Travel_Frequently 160             18.1
## 6   Male     Travel_Rarely 621             70.4
fig5 <- plot_ly(hole=0.7) %>%
  layout(title = "Business Travel requirements for Employees by Gender") %>%
  add_trace(data= travel_df[travel_df$Gender ==  "Female",],
            labels = ~BusinessTravel,
            values =~travel_df[travel_df$Gender ==  "Female","n"],
            type ="pie",
            textposition ="inside",
            hovertemplate= "Gender: Female<br>Travel Type:%{label}<br>Percent:%{percent}<br>Number of Employees: %{value}<extra></extra>") %>%
  
  add_trace(data= travel_df[travel_df$Gender ==  "Male",],
            labels = ~BusinessTravel,
            values =~travel_df[travel_df$Gender ==  "Male","n"],
            type ="pie",
            textposition ="inside",
            hovertemplate= "Gender: Male<br>Travel Type:%{label}<br>Percent:%{percent}<br>Number of Employees: %{value}<extra></extra>",
            domain = list(
              x = c(0.16,0.84),
              y = c(0.16,0.84)))


fig5

Income Vs Education Vs Job Level

Why do we have this disparity? Is it possible that something is wrong in the system with the way Job level, Education and monthly income are tied together.

This heatmap visualization makes it clear that monthly income is directly proportional to job levels in the organization and education levels for the most part.

edu_jl <- hrdata %>% 
  select(Education, MonthlyIncome, JobLevel) %>%
  mutate(edu_description = ifelse(hrdata$Education == "1" , "Below College", 
                                  ifelse(hrdata$Education == "2", "College", 
                                         ifelse(hrdata$Education == "3", "Bachelor",
                                                ifelse(hrdata$Education == "4", "Master", "Doctor"))))) %>%
  group_by(edu_description, JobLevel) %>%
  summarise(avgincome =mean(MonthlyIncome), .groups= 'keep') %>%
  data.frame()
edu_jl
##    edu_description JobLevel avgincome
## 1         Bachelor        1  2720.359
## 2         Bachelor        2  5549.456
## 3         Bachelor        3  9612.020
## 4         Bachelor        4 15279.318
## 5         Bachelor        5 19151.679
## 6    Below College        1  2773.202
## 7    Below College        2  5617.000
## 8    Below College        3 10513.200
## 9    Below College        4 15183.250
## 10   Below College        5 19392.167
## 11         College        1  2771.170
## 12         College        2  5301.816
## 13         College        3  9444.121
## 14         College        4 16086.765
## 15         College        5 19043.538
## 16          Doctor        1  3127.375
## 17          Doctor        2  5427.650
## 18          Doctor        3  9506.333
## 19          Doctor        4 15590.222
## 20          Doctor        5 18943.000
## 21          Master        1  2913.785
## 22          Master        2  5578.830
## 23          Master        3 10184.586
## 24          Master        4 15566.357
## 25          Master        5 19309.200
breaks <- c(seq(0, max(edu_jl$avgincome), by=2500))


myorder <- c("Below College","College","Bachelor","Master","Doctor")
edu_jl$edu_description <- factor(edu_jl$edu_description , levels = myorder)

fig6 <- ggplot(edu_jl, aes(x=JobLevel, y=edu_description, fill=avgincome)) +
  geom_tile(color ="black") +
  geom_text(aes(label=comma(avgincome))) +
  coord_equal(ratio=1) +
  labs(title="Average Monthly Income across Job Levels & Education Levels",
       x= "Job Level",
       y="Education Level",
       fill = "Average Monthly Income in $") +
  theme_minimal() +
  theme(plot.title = element_text(hjust=0.5)) +
  scale_fill_continuous(low= "white", high= "orange", breaks= breaks) +
  guides(fill= guide_legend(reverse= TRUE, override.aes=list(colour="black")))

fig6

ggplotly(fig6, tooltip = c("avgincome","JobLevel","edu_description")) %>%
  style(hoverlabel = list(bgcolor = "yellow"))

Conclusion

It is very obvious from all the visualization that gender wage gap does exist in this organization. In spite of women employees in this organization being less attrition prone, holding higher education levels and more willing to travel for business, they are still compensated financially lesser and take longer for climbing the corporate ladder compared to their peers from opposite gender.

This is a systematic issue of gender inequality at workplace and we cannot turn blind eyes to it anymore. The HR policies including compensation policies should be revived with a pledge to fix gender wage gap.