[Author: Abdulaziz Alhaqbani]
Introduction
This report provides analysis backed with various visualizations to reveal potential factors that contribute to attrition level in the corporate realm. The dataset used in this report was synthetically created by IBM’s data scientists to simulate real life cases of HR analytics and employees’ behaviors. The dataset was published on Kaggle website Link.
The content of the dataset contains several characteristics of employees and their attitudes towards their work such job involvement, education level, years since last promotion, monthly salary and the distance they commute to their office place from their homes.
## Observations: 1,470
## Variables: 31
## $ Attrition <fctr> Yes, No, Yes, No, No, No, No, No, No...
## $ BusinessTravel <fctr> Travel_Rarely, Travel_Frequently, Tr...
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department <fctr> Sales, Research & Development, Resea...
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education <ord> College, Below College, College, Mast...
## $ EducationField <fctr> Life Sciences, Life Sciences, Other,...
## $ EnvironmentSatisfaction <ord> Medium, High, Very High, Very High, L...
## $ Gender <fctr> Female, Male, Male, Female, Male, Ma...
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement <ord> High, Medium, Medium, High, High, Hig...
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole <fctr> Sales Executive, Research Scientist,...
## $ JobSatisfaction <ord> Very High, Medium, High, High, Medium...
## $ MaritalStatus <fctr> Single, Married, Single, Married, Ma...
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ OverTime <fctr> Yes, No, Yes, Yes, No, No, Yes, No, ...
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating <ord> Excellent, Outstanding, Excellent, Ex...
## $ RelationshipSatisfaction <ord> Low, Very High, Medium, High, Very Hi...
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance <ord> Bad, Better, Better, Better, Better, ...
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
Displaying a sample of selected observations 500 to 505:
## Attrition BusinessTravel DailyRate Department
## 500 No Travel_Rarely 1216 Sales
## 501 No Travel_Rarely 646 Research & Development
## 502 No Travel_Frequently 160 Research & Development
## 503 No Travel_Rarely 238 Sales
## 504 No Travel_Rarely 1397 Research & Development
## 505 Yes Travel_Frequently 306 Sales
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## 500 8 Master Marketing High
## 501 9 Master Life Sciences Low
## 502 3 Bachelor Medical High
## 503 1 Below College Medical Very High
## 504 1 Doctor Life Sciences Medium
## 505 26 Master Life Sciences Low
## Gender HourlyRate JobInvolvement JobLevel JobRole
## 500 Male 39 High 2 Sales Executive
## 501 Female 92 High 2 Research Scientist
## 502 Female 71 High 1 Research Scientist
## 503 Female 34 High 2 Sales Executive
## 504 Male 42 High 1 Research Scientist
## 505 Female 100 High 2 Sales Executive
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 500 High Divorced 7104 20431
## 501 Very High Married 6322 18089
## 502 High Divorced 2083 22653
## 503 Low Single 8381 7507
## 504 Very High Married 2691 7660
## 505 Low Married 4286 5630
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## 500 0 No 12 Excellent
## 501 1 Yes 12 Excellent
## 502 1 No 20 Outstanding
## 503 7 No 20 Outstanding
## 504 1 No 12 Excellent
## 505 2 No 14 Excellent
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## 500 Very High 0 6
## 501 Very High 1 6
## 502 High 1 1
## 503 Very High 0 18
## 504 Very High 1 10
## 505 Very High 2 5
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 500 3 Better 5
## 501 2 Good 6
## 502 2 Better 1
## 503 2 Best 14
## 504 4 Good 10
## 505 4 Better 1
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager Age
## 500 0 1 2 33
## 501 4 0 5 32
## 502 0 0 0 30
## 503 7 8 10 53
## 504 9 8 8 34
## 505 1 0 0 45
With using the library psych, we can get more in-depth statistics such as standard deviation,mad (mean absolute deviation) and skew (measures if the data distribution is symmetrical):
## vars n mean sd median trimmed mad min max range
## Age 1 1470 36.92 9.14 36 36.47 8.90 18 60 42
## HourlyRate 2 1470 65.89 20.33 66 66.02 26.69 30 100 70
## YearsAtCompany 3 1470 7.01 6.13 5 5.99 4.45 0 40 40
## skew kurtosis se
## Age 0.41 -0.41 0.24
## HourlyRate -0.03 -1.20 0.53
## YearsAtCompany 1.76 3.91 0.16
As explained earlier, the dateset was essentially built to analyze the case of employees attrition at IBM, therefore; the dataset contains a factor variable named Attrition with two levels (Yes, No) that says if the employee with such features attrited or not.
##
## No Yes
## 1233 237
only 16% of the employees have left the company, that is 237 out of 1470 total employee count. This could raise an instrinc issue of the dataset since it is clearly biased towards employees who opt to stay at the company. Either way, let us build our first histogram to represent the Age variable of all employees.
The distribution of the age variable looks almost a bell-curved (Normal Distribution) where the median/mean are closely equal around the center. Let us do the same histogram but with additional measures of central tendency.
As we can see the mean & median are more than 20 years away from retirement age, the Company seems to be highly depended on segment of employees younger than 40 years old to carry out its operations. But does the company hire both gender almost equally? Let us find out.
Clearly, males are dominating by almost a 1/3.
Since the dataset has more than 30 variables, we will focus on variables which might have great influence for determining job satisfaction and hence lower the unpleasant attrition rate; which is a very problematic case for many companies.
We will start exploring more variables that may highly be involved with determining employees’ attrition.
Most of the employees considered their involvement with their jobs as High which could correlate with their overall job satisfaction, whereas a small group of fewer than 100 employees exhibited the opposite.
The monthly salary variable is highly right-skewed, and that makes sense in the realm of corporations in which the operational level accounts for the biggest segment of workforce and generally receives the least amount of monthly payment. After plotting with a log scale of 10, we notice a jump in the count of employees upon the monthly salary of 2000. This might be caused from the salary difference between interns and full-time employees.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 1.000 2.188 3.000 15.000
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 581 357 159 52 61 45 32 76 18 17 6 24 10 10 9 13
This histogram shows how many years have been passed since the employee’s last promotion. The mean of acquiring a new promotion is two years and a month, with suspicious outliers of 15 years without ever being promoted.
The IBM’s Employees dataset contains 1470 observations and 31 variables (after cleaning) as follows:
## [1] "Attrition" "BusinessTravel"
## [3] "DailyRate" "Department"
## [5] "DistanceFromHome" "Education"
## [7] "EducationField" "EnvironmentSatisfaction"
## [9] "Gender" "HourlyRate"
## [11] "JobInvolvement" "JobLevel"
## [13] "JobRole" "JobSatisfaction"
## [15] "MaritalStatus" "MonthlyIncome"
## [17] "MonthlyRate" "NumCompaniesWorked"
## [19] "OverTime" "PercentSalaryHike"
## [21] "PerformanceRating" "RelationshipSatisfaction"
## [23] "StockOptionLevel" "TotalWorkingYears"
## [25] "TrainingTimesLastYear" "WorkLifeBalance"
## [27] "YearsAtCompany" "YearsInCurrentRole"
## [29] "YearsSinceLastPromotion" "YearsWithCurrManager"
## [31] "Age"
The dataset includes factor variables like:
Attrition (Yes,No). Gender (Male, Female). Marital Status (Married,Single,Divorced).
Also, there are ordered variables such as:
Job Involvement (Low, Medium, High, Very High). Work Life Balance (Bad, Good, Better, Best). Environment Satisfaction (Low, Medium, High, Very High).
Job Involvement, Work Life Balance, Years Since Last Promotions are the main features to determine if the subject employee is satisfied with his/her job or not, hence the Attrition variable decides the outcomes.
I believe the dataset has other supporting features such as Environment Satisfaction where the workplace environment plays a critical role into an employee’s overall job satisfaction. Another supporting feature would be Relationship Satisfaction that takes into account how the employees feel towards their managers. It is commonly said that “People Leave Managers, Not Companies” and it would be worthy to investigate that later on.
No.
The variable Performance Rating exhibits unexpected rating pattern in such way that all employees have been rated as either Outstanding or Excellent, although, as stated in the data description, the variable Performance Rating has 4 ordered levels (Outstanding, Excellent, Good, Low). Could the company has a unique unprecedented rating system different from the corporate norms?
I adjusted factor variables (Education, WorkLifeBalance,JobInvolvement, JobSatisfaction, EnvironmentSatisfaction, RelationshipSatisfaction) to be ordered variables since it is reasonable to arrange such levels due to their natural ordinality.
I removed a variable named Over18 since originally all the listed employees are above 18, as well those duplicated variables that have no variability whatsoever. (EmployeeCount, StandardHours,EmployeeNumber)
We will begin the bivariate section by making a correlation table (Pearson Coefficent Correlation) that would help in giving an overview of potenial correlations between the dataset’s variables.
There are logical strong correlations shown in the table such as Age and TotalWorkingYears, However such ordinary relationships would not reveal interesting patterns and valuable insights. Consequently, we will bring our attention to correlations that are worthy to investigate further, for example, the table above shows a strong positive correlation between YearsSinceLastPromostion and YearsAtCompany! Does that mean the more years an employee spends serving the company, the less likely he/she gets promoted? We will discover such pattern later on.
Since our focus on why employees may opt to leave the company which raises the problematic attrition level, it would be interesting to find out the attrition percentage for both gender to have an initial overview.
Approximately 6%, 10% of females and males respectively left the company. Now we shall dive into the main features alongside the supporting ones to uncover the underlying patterns and correlations.
A notable variation is spotted when spreading out the monthly salary for employees who attrited against those who stayed at 5000 and below, and a minor spark at 10000. If we conduct a similar density chart for employees’ age, we can identify a close pattern where younger employees had higher chance of attrition up to around 35. Such behavior could be referred as young employees have agility and more flexibility to land on new jobs as the opposite of older employees who relatively hold their current jobs until retirement.
Employees with Low job involvement tend to be younger than other groups. Could be this related to the usual obstacles when new employees join a company as it takes time to be accustomed within the company’s culture?
Additionally, employees who rated their work life balance as Best had shorted distance to their homes. This could be a factor of why employees like their jobs and less likely to resign.
There seems to be a consistent pattern of a segment of employees reporting to the same manager through their entire company’s career. However, there is a different story after 10 years spent at the company where a notable percentage of employees either left the company or changed their role and hence reporting to a new manager.
Statistical summary for employees’ years with their current manager over their entire company’s tenure:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0625 0.5000 0.6875 0.6803 0.8750 1.0000
We see employees on average spend a bit more than half of their company’s tenure with the same manager, and surprisingly to say that there are actually who are reporting to the same manager through their entire company’s career. Let us calculate the Pearson Correlation Coefficient for the mentioned variables.
##
## Pearson's product-moment correlation
##
## data: hr$YearsAtCompany and hr$YearsWithCurrManager
## t = 46.123, df = 1468, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.7474818 0.7892985
## sample estimates:
## cor
## 0.7692124
The Pearson’s coefficient correlation is roughly 0.8 which represent a positive relationship, however we should keep in mind that the variable YearsAtCompany is a super set of the other one.
Are there specific educations fields in which they are associated with attrited employees?
## # A tibble: 6 x 3
## # Groups: EducationField [6]
## EducationField Attrition Total
## <fctr> <fctr> <int>
## 1 Life Sciences Yes 89
## 2 Medical Yes 63
## 3 Marketing Yes 35
## 4 Technical Degree Yes 32
## 5 Other Yes 11
## 6 Human Resources Yes 7
Employees majored into the education field Life Sciences had the highest precentage of attrition, followed by Medical and Marketing.
Clearly, this bar graph reflects the pattern in which employees who travelled rarely had the highest attrition level, whereas the employees who were not required to travel had the least chance of attrition.
Seems that the company gives priority to employees’ continuous development by offering most of them two training sessions in a yearly basis. Even the attrited employees had their chances to be trained, that could mean the existing of trainings is not a likely factor for their attrition.
Surprisingly, about 80% of employees who stayed at the company were not required to do overtime, on the other side, the vast majority of attrited employees were actually asked to do overtime. Could stress and job pressures had them to leave the company?
Since the dataset has more than 30 variables, many of them do have correlations with each other and might often be unusual For example, business travel variable has showed unexpected correlation with attrition in regards with employees who travel rarely. Moreover, the employees who were doing overtime had higher probability to leave the company than those who were not.
The variable education field has shown an interesting behavior in an unexpected way towards the attrition level, where Life Sciences accounted for the greatest level of attrited employees’ education field. Likewise, employees apparently have shown interest in having shorter distance to their homes and that reflected as Best on their response of work life balance query.
The relationship between Overtime and employees’ attrition had surely shown a strong relationship, and that might be a result of not being rewarded (or promoted) for their hard efforts.
Now that we explored two vairables at a time, we will explore more variables besides their reponse to the attrition level.
This collection of boxplots shed the lights on how distance from home could positively correlate with higher attrition level. We can see the median of attrited employees had to commute longer than those who stayed on each instance for both gender alongside their marital status.
A quite interesting behavior is reflected from the scatters plot above. Apparently, diligent employees who put more efforts by working overtime seems to take on average more years to get promoted. And such treatment by the company could demotivate the hardworking employees and thus leaving the company.
A strong correlation was observed when plotting the attrition, distance from home, and marital status where we can see the longer the distance from home, the more likely the chance of an employee’s attrition rises, and the same pattern replicates over the each marital status (Married, Divorced and Single).
Another important discovered relationship is notion of being a hard worker (doing extra overtime) could likely results in an unfair treatment where promotions tend to take more years to be rewarded. In contrast, employees who do not undertake overtime, they are most likely to be promoted quicker.
Overtime’s impact on employees’ attritons.
Here in these multivariate boxplots, a distinguished correlation appears when the distance to home is longer, the more likely an employee leaves the company and such correlations is strengthened for a single male employee.
Here in these multivariate boxplots, a notable correaltion appears when the distance to home is longer, the more likely an employee leaves the company and such correaltions is strengthened for a single male employee.
In this last scatter plots, with the use of the best fitting line (trend line), we notice an interesting (and rather unexpected) correlation between being a hard worker and the chances to get slower promotions.
We explored a dataset of HR records that is synthetically created by IBM’s data scientists, the dataset contained 1470 observations and 34 variables that are associated with their personal and work-related characteristics. We then analyzed & visualized univariate, bivariate and multivariate sessions of the dataset with using various statistical measurements and visualizations charts.
One struggle that I encountered at the beginning is the sheer number of variables combined with my limited domain knowledge of HR’s analytics field, but through making a couple explorations and visualizations I started to build intuitive sense which helped me diving with confidence into exploring and wrangling complex relationships of variables, and hence extracting valuable insights.
At the end, we have seen strong features that could determine employees’ attrition such as (Over Time, Distance From Home, Years Since Last Promotion) that can be incorporated in a predictive model as a next level of data science’s cycle.