Business Problem

Employee churn is a problem that many firms face. In the US, firms face an average churn of about 10%-15%, and this churn can prove costly. It is estimated that an unwanted departure of an employee can cost the firm anywhere from 30% (of his/her annual salary) for more junior employees to 400% for much more senior employees. The difficulty of finding a replacement and bringing that replacement to the same level of productivity, the lost knowledge and know-how, and the period where resources have to carry the extra load of a missing colleague can be serious problems, especially for firms that face a higher rate of attrition. As such, many firms are trying to solve this issue, and have put in place more measures to boost employee retention such as training programmes, career progression and the promise of better work life balance.

The arrival of data science can help predict the probability of employee churn and help alleviate this problem. In fact, in the last couple of years, IBM has built algorithms that helps predict the likelihood of its employees leaving. The company claims that it has reached 95% level of accuracy, and has clocked retention cost savings to the tune of $300 million. Coupled with other data science tools that help keep employees more committed to their job (eg. by giving more fair compensation, providing internal movement opportunities), IBM is leading the way in employee retention through data science.

Since many of us faced and will face employee churn, our group believes that learning how an employee churn model works, and the business insights and actions we can glean from it would be very useful. Thus we decided to create our own employee churn model, albeit with IBM data, given the availability of this data online. What we hope to achieve is an understanding of the drivers of churn, the consequent actions businesses would take, as well as the know-how of building such a model, which we can replicate in the future for our own companies.

The Process

Part 1: Approach & Analysis

Part 2: Business Insights

Part 3: Strategy & Recommendations

Part 1: Approach & Analysis

The Data

Just previewing the data set, we quickly noticed there was some strange data that mathematically-created (such as daily rate that was not a consistent multiplier of monthly rate of each employee). This created a lot of issues as the data and signaled to us that the data set was flawed

We therefore took a stance to clean a lot of the data, removing employee count, application IT, employee number, over 18, standard hours, as well as mathematically incorrect data such as daily rate/hourly rate/monthly rate having a near-zero correlation.

Some data was missing, so we replaced it by either the median (for numerical values) or “other.”

Data Dictionary

Per the online documentation for the dataset, the definitions are as follows:

Name	Description
Attrition	Employment status at IBM (Possible Values: Current Employee, Voluntary Resignation, Termination)
Age	Current age of employee (Possible Values: 18+)
BusinessTravel	How often does the employee travel for business (Possible Values: Non-Travel, Travel_Rarely, Travel_Frequently)
DailyRate	How much the employee can earn in a given day (in USD)
Department	Current department of employee (Possible Values: Human Resources, Research & Development, Sales)
DistanceFromHome	Miles from employee’s home
Education	Most recent degree achieved (Possible Values: 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’)
EducationField	Field of most recent study (Possible Values: Human Resources, Life Sciences, Marketing, Medical, Technical Degree, Other)
EmployeeNumber	Employee ID number
EnvironmentSatisfaction	Employee satisfaction with their work environment (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’)
Gender	Gender (Male/Female)
HourlyRate	Current hourly rate for job in USD
JobInvolvement	Self-rated assessment describing how involved they must be at their job (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’)
JobLevel	Current job level at the organization (out of 5) (Possible Values: 1 - Intern, 2 - Junior, 3 - Mid-Level, 4 - Senior, 5 - Director)
JobRole	Employee’s current role at the company (Possible Values: Healthcare Representative, Human Resources, Laboratory Technician, Manager, Manufacturing Director, Research Director, Research Scientist, Sales Executive, Sales Representative)
JobSatisfaction	Employee rated satisfaction of job on most recent company survey (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’)
MaritalStatus	Marital status (Possible Values: Single, Married, Divorced)
MonthlyIncome	Most recent earned income in USD
MonthlyRate	Current monthly rate for job in USD
NumCompaniesWorked	Number of companies worked at before current company
OverTime	Whether the employee must work overtime for their job (Possible Values: Yes/No)
PercentSalaryHike	Percent of salary raised last year
PerformanceRating	Performance rating by manager last year (Possible Values: 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’)
RelationshipSatisfaction	Employee satisfaction in current relationship (Possible Values: 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’)
StockOptionLevel	Current stock option level, out of three (Possible Values: 0 - 3)
TotalWorkingYears	Number of years working overall
TrainingTimesLastYear	Number of trainings received last year
WorkLifeBalance	Employee’s rating of their current work-life balance (Possible Values: 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’)
YearsAtCompany	Number of years at current company
YearsInCurrentRole	Number of years at current role at current company
YearsSinceLastPromotion	Number of years since last promoted
YearsWithCurrManager	Number of years working with current manager

We therefore took a stance to clean a lot of the data, removing employee count, application IT, employeenumber, over 18, standard hours, as well as mathematically incorrect data such as hourly rate etc.

Some data was missing, so we replaced it by either the median (for numerical values) or “other.”

Dimensionality Reduction

Step 1: Checking the data

Step 2: Check Correlations

We checked the correlation and there were a lot of different correlations, so we estimated we could do dimensionality reduction to make it easier to explain.

Step 3: Choose Number of Factors

With the help of PCA function: we used this to determine how many components we could reduce this to reduce the number of factors.

This is where we had to take a strategic hybrid approach: the model told us we could reduce to 5 factors, but we saw that “Years in a job, years in the company, years with current manager” correlations made sense. But others did not make sense - for example Marital Status and Stock options – so we decided to keep those factors separated. That way some factors were condensed into a factor that was meaningful while others were kept independent. That way we used PCA to reduce the data from 33 to 15 factors.

	Obs.01	Obs.02	Obs.03	Obs.04	Obs.05	Obs.06	Obs.07	Obs.08	Obs.09	Obs.10
Component(Factor) 1	-0.40	-0.55	-0.28	-0.45	-0.56	-0.45	-0.40	-0.28	-0.40	-1.21
Component(Factor) 2	0.07	0.21	-0.04	0.07	0.17	0.07	0.07	-0.04	0.07	2.35
Component(Factor) 3	1.99	0.26	1.28	-0.22	0.31	-0.22	1.99	1.28	1.99	-0.02
Component(Factor) 4	-1.11	-1.71	-1.17	-0.83	-1.63	-0.83	-1.11	-1.17	-1.11	-0.53
Component(Factor) 5	-0.70	1.11	-0.71	-0.48	-0.50	-0.48	-0.70	-0.71	-0.70	-0.73
Component(Factor) 6	-1.50	-0.10	-1.58	0.08	0.09	0.08	-1.50	-1.58	-1.50	-1.52

Classification

We performed CTREE classification on the cleaned data set, and reached an accuracy of 83.9 %

Confusion Matrix and Statistics

      Reference

Prediction 0 1 0 1710 105 1 272 266

           Accuracy : 0.8398          
             95% CI : (0.8243, 0.8544)
No Information Rate : 0.8423          
P-Value [Acc > NIR] : 0.6455          
                                      
              Kappa : 0.4901

Mcnemar’s Test P-Value : <2e-16

        Sensitivity : 0.7170          
        Specificity : 0.8628          
     Pos Pred Value : 0.4944          
     Neg Pred Value : 0.9421          
         Prevalence : 0.1577          
     Detection Rate : 0.1130

Detection Prevalence : 0.2286
Balanced Accuracy : 0.7899

   'Positive' Class : 1

Confusion Matrix and Statistics

      Reference

Prediction 0 1 0 1209 137 1 773 234

           Accuracy : 0.6133         
             95% CI : (0.5932, 0.633)
No Information Rate : 0.8423         
P-Value [Acc > NIR] : 1              
                                     
              Kappa : 0.1419

Mcnemar’s Test P-Value : <2e-16

        Sensitivity : 0.63073        
        Specificity : 0.60999        
     Pos Pred Value : 0.23237        
     Neg Pred Value : 0.89822        
         Prevalence : 0.15767        
     Detection Rate : 0.09945

Detection Prevalence : 0.42796
Balanced Accuracy : 0.62036

   'Positive' Class : 1

Segmentation

For clustering we used the hclust method. We saw that 6 numbers of clusters was the optimal number based on the elbow of the curve of the clusters distances between clusters as a function of clusters.

Based on the dimensionality reduction performed above, we performed clustering analysis on the data We tried different models (hclust and kmeans), and found hclust to be the best

	Population	Seg.1	Seg.2	Seg.3	Seg.4	Seg.5	Seg.6
Is_Resigning	1.16	1.15	1.08	1.20	1.13	1.16	1.16
YearsInCurrentRole	4.22	5.10	6.39	2.54	4.24	5.67	6.41
YearsWithCurrManager	4.12	4.87	6.17	2.54	4.26	5.16	6.40
YearsSinceLastPromotion	2.18	2.12	4.53	1.24	1.82	3.65	3.98
TotalWorkingYears	11.26	11.19	25.49	6.15	9.65	16.02	22.12
JobLevel	2.06	2.33	4.48	1.06	1.82	2.90	3.61
MonthlyIncome	6502.81	7142.18	17966.03	2580.68	4750.85	10430.22	13484.47
Department	3.22	3.33	3.13	3.10	3.27	3.32	3.29
PerformanceRating	3.16	3.15	3.15	3.16	3.15	3.15	3.17
EducationField	3.24	3.20	3.09	3.23	3.30	3.28	3.44
DistanceFromHome	9.19	9.42	8.27	9.08	9.15	9.82	10.26
Education	2.91	2.95	2.91	2.83	2.94	2.99	3.09
PercentSalaryHike	15.21	15.19	14.99	15.40	15.11	15.11	15.10
Age	36.91	37.44	41.94	34.83	36.38	38.23	41.48
JobRole	5.96	5.92	5.09	6.13	6.16	5.70	6.05
NumCompaniesWorked	2.69	2.84	2.25	2.56	2.80	2.89	2.94

	Seg.1	Seg.2	Seg.3	Seg.4	Seg.5	Seg.6
Is_Resigning	-0.01	-0.06	0.04	-0.02	0.00	0.00
YearsInCurrentRole	0.21	0.51	-0.40	0.00	0.34	0.52
YearsWithCurrManager	0.18	0.50	-0.38	0.03	0.25	0.55
YearsSinceLastPromotion	-0.03	1.08	-0.43	-0.16	0.68	0.83
TotalWorkingYears	-0.01	1.26	-0.45	-0.14	0.42	0.96
JobLevel	0.13	1.17	-0.49	-0.12	0.41	0.75
MonthlyIncome	0.10	1.76	-0.60	-0.27	0.60	1.07
Department	0.04	-0.03	-0.04	0.02	0.03	0.02
PerformanceRating	0.00	0.00	0.00	0.00	0.00	0.00
EducationField	-0.01	-0.05	0.00	0.02	0.01	0.06
DistanceFromHome	0.02	-0.10	-0.01	0.00	0.07	0.12
Education	0.02	0.00	-0.03	0.01	0.03	0.06
PercentSalaryHike	0.00	-0.01	0.01	-0.01	-0.01	-0.01
Age	0.01	0.14	-0.06	-0.01	0.04	0.12
JobRole	-0.01	-0.15	0.03	0.03	-0.04	0.01
NumCompaniesWorked	0.06	-0.16	-0.05	0.04	0.07	0.09

	Seg.1	Seg.2	Seg.3	Seg.4	Seg.5	Seg.6
Is_Resigning
YearsInCurrentRole	0.21	0.51	-0.40		0.34	0.52
YearsWithCurrManager	0.18	0.50	-0.38		0.25	0.55
YearsSinceLastPromotion		1.08	-0.43	-0.16	0.68	0.83
TotalWorkingYears		1.26	-0.45	-0.14	0.42	0.96
JobLevel	0.13	1.17	-0.49	-0.12	0.41	0.75
MonthlyIncome		1.76	-0.60	-0.27	0.60	1.07
Department
PerformanceRating
EducationField
DistanceFromHome						0.12
Education
PercentSalaryHike
Age		0.14				0.12
JobRole		-0.15
NumCompaniesWorked		-0.16

Part 2: Business Insights

Based on our analysis, we grouped each IBM employee into six clusters for segmentation. Each grouping represents a “persona” of people we can propose a solution for. A blend of demographic and professional background information was retrieved from this methodology.

“Executives” (Segment 2) = Total working years, Pay, Tenure
“Laterally hired Executives” (Segment 6) = Total working years, Pay, Tenure, Years with IBM
“High-performing Mid-level” (Segment 5) = Total working years, Pay, Department
“Mid-Level” (Segment 1) = Total working years, Pay
“Junior-tenured employees” (Segment 4) = Total working years, Pay
“Rookies” (Segment 3) = Total working years, Pay

From these groupings, we can conclude that “Rookies” possesses the most dire need for attention with a turnover ratio of 20%, followed by “High-performing Mid-level” at 16%. The table below depicts the average persona of a risky employee as well as variables that define the cluster with the highest liability.

Part 3: Strategy & Recommendations

Based on historical data and values, we propose a solution based on not only reactive but also proactive approaches.

As a proactive approach, we acknowledge the variance of each person within each cluster so the hiring process should not immediately rule out candidates that purely fit a certain segment. IBM should incorporate additional questions during the hiring process for candidates that fit into candidates with an average of 6 years of experience (average of 20% turnover rate). This way, IBM can identify risk profiles and start gathering measurable data points at an early stage of each candidate’s lifecycle.

In addition, better understanding should be gained periodically for employees in “danger zone” clusters - rookies and high-performing mid level. Bi-annual assessment of current employee sentiment should be gathered and shared between division managers and HR. Policies around hiring, benefits, retention, and learning & development should be designed based on variables and clusters previously identified and adjustments should be considered on an annual basis. Employees may be leaving IBM for reasons that are “out-of-scope” or too financially burdening for HR to proceed.

Alternatively, if employees are exiting the company for reasons that are within control reactive measures should be taken. Each division manager should bear the responsibility of comprehending the recognition, financial, and progression need desires for each individual. These data points will feed in to update the churn risk ratio. Using these qualitative and quantitative factors, reactive rewards can be implemented. Next, it is also crucial to incorporate external market data based on job function compensation. These measures can help create a dynamic incentive and severance package to increase overall satisfaction and incentives (stock vs. salary) for IBM.

Lastly, during the course of the tenure of each employee organizational behavior should be observed. Culture is typically a significant factor in daily happiness and supports the vision of the company. However, both firm-wide and divisional culture should be considered. For example, correlation between employee interaction, especially with “danger zone” employees is an important factor to consider. If an individual promotes dissatisfaction of the company to others around them, volume of flight risk can be elevated.

Predicting Employee Retention at IBM

20J Group 8: Joe Yoo, Martin Febrian, Jessie Serrino, Aisling Grogan, Antoine Joan, Magno Guidote