Employee churn is a problem that many firms face. In the US, firms face an average churn of about 10%-15%, and this churn can prove costly. It is estimated that an unwanted departure of an employee can cost the firm anywhere from 30% (of his/her annual salary) for more junior employees to 400% for much more senior employees. The difficulty of finding a replacement and bringing that replacement to the same level of productivity, the lost knowledge and know-how, and the period where resources have to carry the extra load of a missing colleague can be serious problems, especially for firms that face a higher rate of attrition. As such, many firms are trying to solve this issue, and have put in place more measures to boost employee retention such as training programmes, career progression and the promise of better work life balance.
The arrival of data science can help predict the probability of employee churn and help alleviate this problem. In fact, in the last couple of years, IBM has built algorithms that helps predict the likelihood of its employees leaving. The company claims that it has reached 95% level of accuracy, and has clocked retention cost savings to the tune of $300 million. Coupled with other data science tools that help keep employees more committed to their job (eg. by giving more fair compensation, providing internal movement opportunities), IBM is leading the way in employee retention through data science.
Since many of us faced and will face employee churn, our group believes that learning how an employee churn model works, and the business insights and actions we can glean from it would be very useful. Thus we decided to create our own employee churn model, albeit with IBM data, given the availability of this data online. What we hope to achieve is an understanding of the drivers of churn, the consequent actions businesses would take, as well as the know-how of building such a model, which we can replicate in the future for our own companies.
The Process
Part 1: Approach & Analysis
Part 2: Business Insights
Part 3: Strategy & Recommendations
Just previewing the data set, we quickly noticed there was some strange data that mathematically-created (such as daily rate that was not a consistent multiplier of monthly rate of each employee). This created a lot of issues as the data and signaled to us that the data set was flawed
We therefore took a stance to clean a lot of the data, removing employee count, application IT, employee number, over 18, standard hours, as well as mathematically incorrect data such as daily rate/hourly rate/monthly rate having a near-zero correlation.
Some data was missing, so we replaced it by either the median (for numerical values) or “other.”
Per the online documentation for the dataset, the definitions are as follows:
Name | Description |
---|---|
Attrition | Employment status at IBM (Possible Values: Current Employee, Voluntary Resignation, Termination) |
Age | Current age of employee (Possible Values: 18+) |
BusinessTravel | How often does the employee travel for business (Possible Values: Non-Travel, Travel_Rarely, Travel_Frequently) |
DailyRate | How much the employee can earn in a given day (in USD) |
Department | Current department of employee (Possible Values: Human Resources, Research & Development, Sales) |
DistanceFromHome | Miles from employee’s home |
Education | Most recent degree achieved (Possible Values: 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’) |
EducationField | Field of most recent study (Possible Values: Human Resources, Life Sciences, Marketing, Medical, Technical Degree, Other) |
EmployeeNumber | Employee ID number |
EnvironmentSatisfaction | Employee satisfaction with their work environment (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) |
Gender | Gender (Male/Female) |
HourlyRate | Current hourly rate for job in USD |
JobInvolvement | Self-rated assessment describing how involved they must be at their job (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) |
JobLevel | Current job level at the organization (out of 5) (Possible Values: 1 - Intern, 2 - Junior, 3 - Mid-Level, 4 - Senior, 5 - Director) |
JobRole | Employee’s current role at the company (Possible Values: Healthcare Representative, Human Resources, Laboratory Technician, Manager, Manufacturing Director, Research Director, Research Scientist, Sales Executive, Sales Representative) |
JobSatisfaction | Employee rated satisfaction of job on most recent company survey (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) |
MaritalStatus | Marital status (Possible Values: Single, Married, Divorced) |
MonthlyIncome | Most recent earned income in USD |
MonthlyRate | Current monthly rate for job in USD |
NumCompaniesWorked | Number of companies worked at before current company |
OverTime | Whether the employee must work overtime for their job (Possible Values: Yes/No) |
PercentSalaryHike | Percent of salary raised last year |
PerformanceRating | Performance rating by manager last year (Possible Values: 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’) |
RelationshipSatisfaction | Employee satisfaction in current relationship (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’) |
StockOptionLevel | Current stock option level, out of three (Possible Values: 0 - 3) |
TotalWorkingYears | Number of years working overall |
TrainingTimesLastYear | Number of trainings received last year |
WorkLifeBalance | Employee’s rating of their current work-life balance (Possible Values: 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’) |
YearsAtCompany | Number of years at current company |
YearsInCurrentRole | Number of years at current role at current company |
YearsSinceLastPromotion | Number of years since last promoted |
YearsWithCurrManager | Number of years working with current manager |
Just previewing the data set, we quickly noticed there was some strange data that mathematically-created (such as daily rate that was not a consistent multiplier of monthly rate of each employee). This created a lot of issues as the data and signaled to us that the data set was flawed
We therefore took a stance to clean a lot of the data, removing employee count, application IT, employeenumber, over 18, standard hours, as well as mathematically incorrect data such as hourly rate etc.
Some data was missing, so we replaced it by either the median (for numerical values) or “other.”
We checked the correlation and there were a lot of different correlations, so we estimated we could do dimensionality reduction to make it easier to explain.
With the help of PCA function: we used this to determine how many components we could reduce this to reduce the number of factors.
This is where we had to take a strategic hybrid approach: the model told us we could reduce to 5 factors, but we saw that “Years in a job, years in the company, years with current manager” correlations made sense. But others did not make sense - for example Marital Status and Stock options – so we decided to keep those factors separated. That way some factors were condensed into a factor that was meaningful while others were kept independent. That way we used PCA to reduce the data from 33 to 15 factors.
Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
---|---|---|---|---|---|---|---|---|---|---|
Component(Factor) 1 | -0.40 | -0.55 | -0.28 | -0.45 | -0.56 | -0.45 | -0.40 | -0.28 | -0.40 | -1.21 |
Component(Factor) 2 | 0.07 | 0.21 | -0.04 | 0.07 | 0.17 | 0.07 | 0.07 | -0.04 | 0.07 | 2.35 |
Component(Factor) 3 | 1.99 | 0.26 | 1.28 | -0.22 | 0.31 | -0.22 | 1.99 | 1.28 | 1.99 | -0.02 |
Component(Factor) 4 | -1.11 | -1.71 | -1.17 | -0.83 | -1.63 | -0.83 | -1.11 | -1.17 | -1.11 | -0.53 |
Component(Factor) 5 | -0.70 | 1.11 | -0.71 | -0.48 | -0.50 | -0.48 | -0.70 | -0.71 | -0.70 | -0.73 |
Component(Factor) 6 | -1.50 | -0.10 | -1.58 | 0.08 | 0.09 | 0.08 | -1.50 | -1.58 | -1.50 | -1.52 |
We performed CTREE classification on the cleaned data set, and reached an accuracy of 83.9 %
Confusion Matrix and Statistics
Reference
Prediction 0 1 0 1710 105 1 272 266
Accuracy : 0.8398
95% CI : (0.8243, 0.8544)
No Information Rate : 0.8423
P-Value [Acc > NIR] : 0.6455
Kappa : 0.4901
Mcnemar’s Test P-Value : <2e-16
Sensitivity : 0.7170
Specificity : 0.8628
Pos Pred Value : 0.4944
Neg Pred Value : 0.9421
Prevalence : 0.1577
Detection Rate : 0.1130
Detection Prevalence : 0.2286
Balanced Accuracy : 0.7899
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1 0 1209 137 1 773 234
Accuracy : 0.6133
95% CI : (0.5932, 0.633)
No Information Rate : 0.8423
P-Value [Acc > NIR] : 1
Kappa : 0.1419
Mcnemar’s Test P-Value : <2e-16
Sensitivity : 0.63073
Specificity : 0.60999
Pos Pred Value : 0.23237
Neg Pred Value : 0.89822
Prevalence : 0.15767
Detection Rate : 0.09945
Detection Prevalence : 0.42796
Balanced Accuracy : 0.62036
'Positive' Class : 1
For clustering we used the hclust method. We saw that 6 numbers of clusters was the optimal number based on the elbow of the curve of the clusters distances between clusters as a function of clusters.
Based on the dimensionality reduction performed above, we performed clustering analysis on the data We tried different models (hclust and kmeans), and found hclust to be the best
Population | Seg.1 | Seg.2 | Seg.3 | Seg.4 | Seg.5 | Seg.6 | |
---|---|---|---|---|---|---|---|
Is_Resigning | 1.16 | 1.15 | 1.08 | 1.20 | 1.13 | 1.16 | 1.16 |
YearsInCurrentRole | 4.22 | 5.10 | 6.39 | 2.54 | 4.24 | 5.67 | 6.41 |
YearsWithCurrManager | 4.12 | 4.87 | 6.17 | 2.54 | 4.26 | 5.16 | 6.40 |
YearsSinceLastPromotion | 2.18 | 2.12 | 4.53 | 1.24 | 1.82 | 3.65 | 3.98 |
TotalWorkingYears | 11.26 | 11.19 | 25.49 | 6.15 | 9.65 | 16.02 | 22.12 |
JobLevel | 2.06 | 2.33 | 4.48 | 1.06 | 1.82 | 2.90 | 3.61 |
MonthlyIncome | 6502.81 | 7142.18 | 17966.03 | 2580.68 | 4750.85 | 10430.22 | 13484.47 |
Department | 3.22 | 3.33 | 3.13 | 3.10 | 3.27 | 3.32 | 3.29 |
PerformanceRating | 3.16 | 3.15 | 3.15 | 3.16 | 3.15 | 3.15 | 3.17 |
EducationField | 3.24 | 3.20 | 3.09 | 3.23 | 3.30 | 3.28 | 3.44 |
DistanceFromHome | 9.19 | 9.42 | 8.27 | 9.08 | 9.15 | 9.82 | 10.26 |
Education | 2.91 | 2.95 | 2.91 | 2.83 | 2.94 | 2.99 | 3.09 |
PercentSalaryHike | 15.21 | 15.19 | 14.99 | 15.40 | 15.11 | 15.11 | 15.10 |
Age | 36.91 | 37.44 | 41.94 | 34.83 | 36.38 | 38.23 | 41.48 |
JobRole | 5.96 | 5.92 | 5.09 | 6.13 | 6.16 | 5.70 | 6.05 |
NumCompaniesWorked | 2.69 | 2.84 | 2.25 | 2.56 | 2.80 | 2.89 | 2.94 |
Seg.1 | Seg.2 | Seg.3 | Seg.4 | Seg.5 | Seg.6 | |
---|---|---|---|---|---|---|
Is_Resigning | -0.01 | -0.06 | 0.04 | -0.02 | 0.00 | 0.00 |
YearsInCurrentRole | 0.21 | 0.51 | -0.40 | 0.00 | 0.34 | 0.52 |
YearsWithCurrManager | 0.18 | 0.50 | -0.38 | 0.03 | 0.25 | 0.55 |
YearsSinceLastPromotion | -0.03 | 1.08 | -0.43 | -0.16 | 0.68 | 0.83 |
TotalWorkingYears | -0.01 | 1.26 | -0.45 | -0.14 | 0.42 | 0.96 |
JobLevel | 0.13 | 1.17 | -0.49 | -0.12 | 0.41 | 0.75 |
MonthlyIncome | 0.10 | 1.76 | -0.60 | -0.27 | 0.60 | 1.07 |
Department | 0.04 | -0.03 | -0.04 | 0.02 | 0.03 | 0.02 |
PerformanceRating | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 |
EducationField | -0.01 | -0.05 | 0.00 | 0.02 | 0.01 | 0.06 |
DistanceFromHome | 0.02 | -0.10 | -0.01 | 0.00 | 0.07 | 0.12 |
Education | 0.02 | 0.00 | -0.03 | 0.01 | 0.03 | 0.06 |
PercentSalaryHike | 0.00 | -0.01 | 0.01 | -0.01 | -0.01 | -0.01 |
Age | 0.01 | 0.14 | -0.06 | -0.01 | 0.04 | 0.12 |
JobRole | -0.01 | -0.15 | 0.03 | 0.03 | -0.04 | 0.01 |
NumCompaniesWorked | 0.06 | -0.16 | -0.05 | 0.04 | 0.07 | 0.09 |
Seg.1 | Seg.2 | Seg.3 | Seg.4 | Seg.5 | Seg.6 | |
---|---|---|---|---|---|---|
Is_Resigning | ||||||
YearsInCurrentRole | 0.21 | 0.51 | -0.40 | 0.34 | 0.52 | |
YearsWithCurrManager | 0.18 | 0.50 | -0.38 | 0.25 | 0.55 | |
YearsSinceLastPromotion | 1.08 | -0.43 | -0.16 | 0.68 | 0.83 | |
TotalWorkingYears | 1.26 | -0.45 | -0.14 | 0.42 | 0.96 | |
JobLevel | 0.13 | 1.17 | -0.49 | -0.12 | 0.41 | 0.75 |
MonthlyIncome | 1.76 | -0.60 | -0.27 | 0.60 | 1.07 | |
Department | ||||||
PerformanceRating | ||||||
EducationField | ||||||
DistanceFromHome | 0.12 | |||||
Education | ||||||
PercentSalaryHike | ||||||
Age | 0.14 | 0.12 | ||||
JobRole | -0.15 | |||||
NumCompaniesWorked | -0.16 |
Based on our analysis, we grouped each IBM employee into six clusters for segmentation. Each grouping represents a “persona” of people we can propose a solution for. A blend of demographic and professional background information was retrieved from this methodology.
From these groupings, we can conclude that “Rookies” possesses the most dire need for attention with a turnover ratio of 20%, followed by “High-performing Mid-level” at 16%. The table below depicts the average persona of a risky employee as well as variables that define the cluster with the highest liability.
Based on historical data and values, we propose a solution based on not only reactive but also proactive approaches.
As a proactive approach, we acknowledge the variance of each person within each cluster so the hiring process should not immediately rule out candidates that purely fit a certain segment. IBM should incorporate additional questions during the hiring process for candidates that fit into candidates with an average of 6 years of experience (average of 20% turnover rate). This way, IBM can identify risk profiles and start gathering measurable data points at an early stage of each candidate’s lifecycle.
In addition, better understanding should be gained periodically for employees in “danger zone” clusters - rookies and high-performing mid level. Bi-annual assessment of current employee sentiment should be gathered and shared between division managers and HR. Policies around hiring, benefits, retention, and learning & development should be designed based on variables and clusters previously identified and adjustments should be considered on an annual basis. Employees may be leaving IBM for reasons that are “out-of-scope” or too financially burdening for HR to proceed.
Alternatively, if employees are exiting the company for reasons that are within control reactive measures should be taken. Each division manager should bear the responsibility of comprehending the recognition, financial, and progression need desires for each individual. These data points will feed in to update the churn risk ratio. Using these qualitative and quantitative factors, reactive rewards can be implemented. Next, it is also crucial to incorporate external market data based on job function compensation. These measures can help create a dynamic incentive and severance package to increase overall satisfaction and incentives (stock vs. salary) for IBM.
Lastly, during the course of the tenure of each employee organizational behavior should be observed. Culture is typically a significant factor in daily happiness and supports the vision of the company. However, both firm-wide and divisional culture should be considered. For example, correlation between employee interaction, especially with “danger zone” employees is an important factor to consider. If an individual promotes dissatisfaction of the company to others around them, volume of flight risk can be elevated.