Analysis on Employees Attrition : IBM’s HR Records

[Author: Abdulaziz Alhaqbani]

Introduction

This report provides analysis backed with various visualizations to reveal potential factors that contribute to attrition level in the corporate realm. The dataset used in this report was synthetically created by IBM’s data scientists to simulate real life cases of HR analytics and employees’ behaviors. The dataset was published on Kaggle website Link.

The content of the dataset contains several characteristics of employees and their attitudes towards their work such job involvement, education level, years since last promotion, monthly salary and the distance they commute to their office place from their homes.

## Observations: 1,470
## Variables: 31
## $ Attrition                <fctr> Yes, No, Yes, No, No, No, No, No, No...
## $ BusinessTravel           <fctr> Travel_Rarely, Travel_Frequently, Tr...
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department               <fctr> Sales, Research & Development, Resea...
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education                <ord> College, Below College, College, Mast...
## $ EducationField           <fctr> Life Sciences, Life Sciences, Other,...
## $ EnvironmentSatisfaction  <ord> Medium, High, Very High, Very High, L...
## $ Gender                   <fctr> Female, Male, Male, Female, Male, Ma...
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement           <ord> High, Medium, Medium, High, High, Hig...
## $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole                  <fctr> Sales Executive, Research Scientist,...
## $ JobSatisfaction          <ord> Very High, Medium, High, High, Medium...
## $ MaritalStatus            <fctr> Single, Married, Single, Married, Ma...
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ OverTime                 <fctr> Yes, No, Yes, Yes, No, No, Yes, No, ...
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating        <ord> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <ord> Low, Very High, Medium, High, Very Hi...
## $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance          <ord> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...

Displaying a sample of selected observations 500 to 505:

##     Attrition    BusinessTravel DailyRate             Department
## 500        No     Travel_Rarely      1216                  Sales
## 501        No     Travel_Rarely       646 Research & Development
## 502        No Travel_Frequently       160 Research & Development
## 503        No     Travel_Rarely       238                  Sales
## 504        No     Travel_Rarely      1397 Research & Development
## 505       Yes Travel_Frequently       306                  Sales
##     DistanceFromHome     Education EducationField EnvironmentSatisfaction
## 500                8        Master      Marketing                    High
## 501                9        Master  Life Sciences                     Low
## 502                3      Bachelor        Medical                    High
## 503                1 Below College        Medical               Very High
## 504                1        Doctor  Life Sciences                  Medium
## 505               26        Master  Life Sciences                     Low
##     Gender HourlyRate JobInvolvement JobLevel            JobRole
## 500   Male         39           High        2    Sales Executive
## 501 Female         92           High        2 Research Scientist
## 502 Female         71           High        1 Research Scientist
## 503 Female         34           High        2    Sales Executive
## 504   Male         42           High        1 Research Scientist
## 505 Female        100           High        2    Sales Executive
##     JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 500            High      Divorced          7104       20431
## 501       Very High       Married          6322       18089
## 502            High      Divorced          2083       22653
## 503             Low        Single          8381        7507
## 504       Very High       Married          2691        7660
## 505             Low       Married          4286        5630
##     NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 500                  0       No                12         Excellent
## 501                  1      Yes                12         Excellent
## 502                  1       No                20       Outstanding
## 503                  7       No                20       Outstanding
## 504                  1       No                12         Excellent
## 505                  2       No                14         Excellent
##     RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 500                Very High                0                 6
## 501                Very High                1                 6
## 502                     High                1                 1
## 503                Very High                0                18
## 504                Very High                1                10
## 505                Very High                2                 5
##     TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 500                     3          Better              5
## 501                     2            Good              6
## 502                     2          Better              1
## 503                     2            Best             14
## 504                     4            Good             10
## 505                     4          Better              1
##     YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager Age
## 500                  0                       1                    2  33
## 501                  4                       0                    5  32
## 502                  0                       0                    0  30
## 503                  7                       8                   10  53
## 504                  9                       8                    8  34
## 505                  1                       0                    0  45

With using the library psych, we can get more in-depth statistics such as standard deviation,mad (mean absolute deviation) and skew (measures if the data distribution is symmetrical):

##                vars    n  mean    sd median trimmed   mad min max range
## Age               1 1470 36.92  9.14     36   36.47  8.90  18  60    42
## HourlyRate        2 1470 65.89 20.33     66   66.02 26.69  30 100    70
## YearsAtCompany    3 1470  7.01  6.13      5    5.99  4.45   0  40    40
##                 skew kurtosis   se
## Age             0.41    -0.41 0.24
## HourlyRate     -0.03    -1.20 0.53
## YearsAtCompany  1.76     3.91 0.16

Univariate Plots Section

As explained earlier, the dateset was essentially built to analyze the case of employees attrition at IBM, therefore; the dataset contains a factor variable named Attrition with two levels (Yes, No) that says if the employee with such features attrited or not.

## 
##   No  Yes 
## 1233  237

only 16% of the employees have left the company, that is 237 out of 1470 total employee count. This could raise an instrinc issue of the dataset since it is clearly biased towards employees who opt to stay at the company. Either way, let us build our first histogram to represent the Age variable of all employees.

The distribution of the age variable looks almost a bell-curved (Normal Distribution) where the median/mean are closely equal around the center. Let us do the same histogram but with additional measures of central tendency.

As we can see the mean & median are more than 20 years away from retirement age, the Company seems to be highly depended on segment of employees younger than 40 years old to carry out its operations. But does the company hire both gender almost equally? Let us find out.

Clearly, males are dominating by almost a 1/3.

Since the dataset has more than 30 variables, we will focus on variables which might have great influence for determining job satisfaction and hence lower the unpleasant attrition rate; which is a very problematic case for many companies.

We will start exploring more variables that may highly be involved with determining employees’ attrition.

Most of the employees considered their involvement with their jobs as High which could correlate with their overall job satisfaction, whereas a small group of fewer than 100 employees exhibited the opposite.

The monthly salary variable is highly right-skewed, and that makes sense in the realm of corporations in which the operational level accounts for the biggest segment of workforce and generally receives the least amount of monthly payment. After plotting with a log scale of 10, we notice a jump in the count of employees upon the monthly salary of 2000. This might be caused from the salary difference between interns and full-time employees.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.000   1.000   2.188   3.000  15.000
## 
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15 
## 581 357 159  52  61  45  32  76  18  17   6  24  10  10   9  13

This histogram shows how many years have been passed since the employee’s last promotion. The mean of acquiring a new promotion is two years and a month, with suspicious outliers of 15 years without ever being promoted.

Univariate Analysis

What is the structure of your dataset?

The IBM’s Employees dataset contains 1470 observations and 31 variables (after cleaning) as follows:

##  [1] "Attrition"                "BusinessTravel"          
##  [3] "DailyRate"                "Department"              
##  [5] "DistanceFromHome"         "Education"               
##  [7] "EducationField"           "EnvironmentSatisfaction" 
##  [9] "Gender"                   "HourlyRate"              
## [11] "JobInvolvement"           "JobLevel"                
## [13] "JobRole"                  "JobSatisfaction"         
## [15] "MaritalStatus"            "MonthlyIncome"           
## [17] "MonthlyRate"              "NumCompaniesWorked"      
## [19] "OverTime"                 "PercentSalaryHike"       
## [21] "PerformanceRating"        "RelationshipSatisfaction"
## [23] "StockOptionLevel"         "TotalWorkingYears"       
## [25] "TrainingTimesLastYear"    "WorkLifeBalance"         
## [27] "YearsAtCompany"           "YearsInCurrentRole"      
## [29] "YearsSinceLastPromotion"  "YearsWithCurrManager"    
## [31] "Age"

The dataset includes factor variables like:

Attrition (Yes,No). Gender (Male, Female). Marital Status (Married,Single,Divorced).

Also, there are ordered variables such as:

Job Involvement (Low, Medium, High, Very High). Work Life Balance (Bad, Good, Better, Best). Environment Satisfaction (Low, Medium, High, Very High).

What is/are the main feature(s) of interest in your dataset?

Job Involvement, Work Life Balance, Years Since Last Promotions are the main features to determine if the subject employee is satisfied with his/her job or not, hence the Attrition variable decides the outcomes.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

I believe the dataset has other supporting features such as Environment Satisfaction where the workplace environment plays a critical role into an employee’s overall job satisfaction. Another supporting feature would be Relationship Satisfaction that takes into account how the employees feel towards their managers. It is commonly said that “People Leave Managers, Not Companies” and it would be worthy to investigate that later on.

Did you create any new variables from existing variables in the dataset?

No.

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

The variable Performance Rating exhibits unexpected rating pattern in such way that all employees have been rated as either Outstanding or Excellent, although, as stated in the data description, the variable Performance Rating has 4 ordered levels (Outstanding, Excellent, Good, Low). Could the company has a unique unprecedented rating system different from the corporate norms?

I adjusted factor variables (Education, WorkLifeBalance,JobInvolvement, JobSatisfaction, EnvironmentSatisfaction, RelationshipSatisfaction) to be ordered variables since it is reasonable to arrange such levels due to their natural ordinality.

I removed a variable named Over18 since originally all the listed employees are above 18, as well those duplicated variables that have no variability whatsoever. (EmployeeCount, StandardHours,EmployeeNumber)

Bivariate Plots Section

We will begin the bivariate section by making a correlation table (Pearson Coefficent Correlation) that would help in giving an overview of potenial correlations between the dataset’s variables.

There are logical strong correlations shown in the table such as Age and TotalWorkingYears, However such ordinary relationships would not reveal interesting patterns and valuable insights. Consequently, we will bring our attention to correlations that are worthy to investigate further, for example, the table above shows a strong positive correlation between YearsSinceLastPromostion and YearsAtCompany! Does that mean the more years an employee spends serving the company, the less likely he/she gets promoted? We will discover such pattern later on.

Since our focus on why employees may opt to leave the company which raises the problematic attrition level, it would be interesting to find out the attrition percentage for both gender to have an initial overview.

Approximately 6%, 10% of females and males respectively left the company. Now we shall dive into the main features alongside the supporting ones to uncover the underlying patterns and correlations.

A notable variation is spotted when spreading out the monthly salary for employees who attrited against those who stayed at 5000 and below, and a minor spark at 10000. If we conduct a similar density chart for employees’ age, we can identify a close pattern where younger employees had higher chance of attrition up to around 35. Such behavior could be referred as young employees have agility and more flexibility to land on new jobs as the opposite of older employees who relatively hold their current jobs until retirement.

Employees with Low job involvement tend to be younger than other groups. Could be this related to the usual obstacles when new employees join a company as it takes time to be accustomed within the company’s culture?

Additionally, employees who rated their work life balance as Best had shorted distance to their homes. This could be a factor of why employees like their jobs and less likely to resign.

There seems to be a consistent pattern of a segment of employees reporting to the same manager through their entire company’s career. However, there is a different story after 10 years spent at the company where a notable percentage of employees either left the company or changed their role and hence reporting to a new manager.

Statistical summary for employees’ years with their current manager over their entire company’s tenure:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0625  0.5000  0.6875  0.6803  0.8750  1.0000

We see employees on average spend a bit more than half of their company’s tenure with the same manager, and surprisingly to say that there are actually who are reporting to the same manager through their entire company’s career. Let us calculate the Pearson Correlation Coefficient for the mentioned variables.

## 
##  Pearson's product-moment correlation
## 
## data:  hr$YearsAtCompany and hr$YearsWithCurrManager
## t = 46.123, df = 1468, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.7474818 0.7892985
## sample estimates:
##       cor 
## 0.7692124

The Pearson’s coefficient correlation is roughly 0.8 which represent a positive relationship, however we should keep in mind that the variable YearsAtCompany is a super set of the other one.

Are there specific educations fields in which they are associated with attrited employees?

## # A tibble: 6 x 3
## # Groups:   EducationField [6]
##     EducationField Attrition Total
##             <fctr>    <fctr> <int>
## 1    Life Sciences       Yes    89
## 2          Medical       Yes    63
## 3        Marketing       Yes    35
## 4 Technical Degree       Yes    32
## 5            Other       Yes    11
## 6  Human Resources       Yes     7

Employees majored into the education field Life Sciences had the highest precentage of attrition, followed by Medical and Marketing.

Clearly, this bar graph reflects the pattern in which employees who travelled rarely had the highest attrition level, whereas the employees who were not required to travel had the least chance of attrition.

Seems that the company gives priority to employees’ continuous development by offering most of them two training sessions in a yearly basis. Even the attrited employees had their chances to be trained, that could mean the existing of trainings is not a likely factor for their attrition.

Surprisingly, about 80% of employees who stayed at the company were not required to do overtime, on the other side, the vast majority of attrited employees were actually asked to do overtime. Could stress and job pressures had them to leave the company?

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

Since the dataset has more than 30 variables, many of them do have correlations with each other and might often be unusual For example, business travel variable has showed unexpected correlation with attrition in regards with employees who travel rarely. Moreover, the employees who were doing overtime had higher probability to leave the company than those who were not.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

The variable education field has shown an interesting behavior in an unexpected way towards the attrition level, where Life Sciences accounted for the greatest level of attrited employees’ education field. Likewise, employees apparently have shown interest in having shorter distance to their homes and that reflected as Best on their response of work life balance query.

What was the strongest relationship you found?

The relationship between Overtime and employees’ attrition had surely shown a strong relationship, and that might be a result of not being rewarded (or promoted) for their hard efforts.

Multivariate Plots Section

Now that we explored two vairables at a time, we will explore more variables besides their reponse to the attrition level.

This collection of boxplots shed the lights on how distance from home could positively correlate with higher attrition level. We can see the median of attrited employees had to commute longer than those who stayed on each instance for both gender alongside their marital status.

A quite interesting behavior is reflected from the scatters plot above. Apparently, diligent employees who put more efforts by working overtime seems to take on average more years to get promoted. And such treatment by the company could demotivate the hardworking employees and thus leaving the company.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

A strong correlation was observed when plotting the attrition, distance from home, and marital status where we can see the longer the distance from home, the more likely the chance of an employee’s attrition rises, and the same pattern replicates over the each marital status (Married, Divorced and Single).

Another important discovered relationship is notion of being a hard worker (doing extra overtime) could likely results in an unfair treatment where promotions tend to take more years to be rewarded. In contrast, employees who do not undertake overtime, they are most likely to be promoted quicker.

Were there any interesting or surprising interactions between features?

Overtime’s impact on employees’ attritons.


Final Plots and Summary

Plot One

Description One

Here in these multivariate boxplots, a distinguished correlation appears when the distance to home is longer, the more likely an employee leaves the company and such correlations is strengthened for a single male employee.

Plot Two

Description Two

Here in these multivariate boxplots, a notable correaltion appears when the distance to home is longer, the more likely an employee leaves the company and such correaltions is strengthened for a single male employee.

Plot Three

Description Three

In this last scatter plots, with the use of the best fitting line (trend line), we notice an interesting (and rather unexpected) correlation between being a hard worker and the chances to get slower promotions.


Reflection

We explored a dataset of HR records that is synthetically created by IBM’s data scientists, the dataset contained 1470 observations and 34 variables that are associated with their personal and work-related characteristics. We then analyzed & visualized univariate, bivariate and multivariate sessions of the dataset with using various statistical measurements and visualizations charts.

One struggle that I encountered at the beginning is the sheer number of variables combined with my limited domain knowledge of HR’s analytics field, but through making a couple explorations and visualizations I started to build intuitive sense which helped me diving with confidence into exploring and wrangling complex relationships of variables, and hence extracting valuable insights.

At the end, we have seen strong features that could determine employees’ attrition such as (Over Time, Distance From Home, Years Since Last Promotion) that can be incorporated in a predictive model as a next level of data science’s cycle.