What are the characteristics of employees that can help managers predict the possibility of a particular employee leaving the company?
The data is obtained from Kaggle.
Part 1: We cleaned the data to change categorical variables to dummy variables.
Part 2: We did a factor analysis to identify factors that signal the possibility of an employee leaving.
Part 3: We did a cluster analysis to define the profile of employees who might leave the company.
Finally, we will use the results of this analysis to make business decisions to reduce the turnover of employees among those who are at the highest risk of leaving the company.
First we loaded the data to use: data/HR_comma_sep.csv.
Then, we cleaned the data to create dummy variables for the Departments.
The light blue parts of the graph indicate people who stayed and the dark blue parts of the graph indicate people who left. We can note that in all departments, people who were extremely unhappy and had low satisfaction scores chose to leave the company. The number of people leaving declines with increasing satisfaction as can be expected. The department with the least turnover, is the management team, matching intuition that people who are typically recognized for their efforts through promotions are more satisfied. The accounting department closely follows this. The departments with the highest turnovers are arguably HR and support. Sales department has the highest percentage of retention of employees. Regardless of some patterns about employees leaving related to the departments they are in, we can note that satisfaction scores of employees vary in all departments leading to the turnover at certain critical points. Hence, there are other factors such as departments that can characterize employee turnover behavior and the rest of the analysis will help to identify them.
From the data, we determine the key characteristics that identify the employees who left the firm. We are using the eigenvalue method to come up with the criterion to segment employees in multiple buckets.
We check the data that it is all metric data:
After running the eigenvalue method, we identified 16 parameters that define an employee in the data. These are satisfaction level, last evaluation completed, number of project, average monthly hours worked, time spend in the firm, work related accidents, number of promotions in last 5 years, and employee’s department (Accounting, HR, Technical, Support, Management, IT, Product Management, Marketing, and R&D).
Next, we depict the descriptive statistics of the 16 parameters with 0 being the mean.
| min | 25 percent | median | mean | 75 percent | max | std | |
|---|---|---|---|---|---|---|---|
| satisfaction_level | -1.33 | -1.17 | -0.11 | 0 | 1.10 | 1.82 | 1 |
| last_evaluation | -1.36 | -1.00 | 0.36 | 0 | 0.92 | 1.43 | 1 |
| number_project | -1.02 | -1.02 | 0.08 | 0 | 1.18 | 1.73 | 1 |
| average_montly_hours | -1.33 | -1.00 | 0.27 | 0 | 0.89 | 1.68 | 1 |
| time_spend_company | -1.92 | -0.90 | 0.13 | 0 | 1.15 | 2.17 | 1 |
| Work_accident | -0.22 | -0.22 | -0.22 | 0 | -0.22 | 4.49 | 1 |
| promotion_last_5years | -0.07 | -0.07 | -0.07 | 0 | -0.07 | 13.67 | 1 |
| dummy_accounting | -0.25 | -0.25 | -0.25 | 0 | -0.25 | 4.06 | 1 |
| dummy_hr | -0.25 | -0.25 | -0.25 | 0 | -0.25 | 3.95 | 1 |
| dummy_technical | -0.49 | -0.49 | -0.49 | 0 | -0.49 | 2.03 | 1 |
| dummy_support | -0.43 | -0.43 | -0.43 | 0 | -0.43 | 2.33 | 1 |
| dummy_management | -0.16 | -0.16 | -0.16 | 0 | -0.16 | 6.18 | 1 |
| dummy_IT | -0.29 | -0.29 | -0.29 | 0 | -0.29 | 3.48 | 1 |
| dummy_product_mng | -0.24 | -0.24 | -0.24 | 0 | -0.24 | 4.13 | 1 |
| dummy_marketing | -0.25 | -0.25 | -0.25 | 0 | -0.25 | 4.07 | 1 |
| dummy_RandD | -0.19 | -0.19 | -0.19 | 0 | -0.19 | 5.34 | 1 |
We check the correlation between the 16 characteristics (parameters) identified in Step 1 & 2. From the below figure, it is evident that few characteristics are very much correlated to one another. For example, characteristics time spend in the firm, last evaluation completed, number of project and average monthly hours are strongly correlated to one another. This is quite evident as these 4 characteristics when bunched together signify “involvement” of an employee in the firm which could be inferred as time spent in firm or working. Second example could be correlation between satisfaction level and time spent in the firm. A positive correlation implies that a person who had spent enough time in the firm is more satisfied than the employee who is new in the firm. One of the hidden meaning from this correlation is that an employee who has spent less time in the firm has more chances of leaving the firm than an employee who has spent more time than the first employee.
The correlation would be useful in performing the factor analysis on the data to better define an ex-employee.
To further make the process easy to use and understandable, we would reduce the dimensions of the data i.e. reduce the variables (16 characteristics) identified in the previous step by performing the factor analysis. Factor analysis is a method to derive new and fewer components to better segment the data.
The new factors will be the combination of 16 characteristics of the employees and will be 16 factors in total. First factor will explain most of the variance, second next and so on. Each factor will be associated with an eigenvalue. Eigenvalue corresponds to the amount of variance explained by each factor with standardized characteristics and each characteristic having a variance of 1. We will capture as much total variance as possible in as minimum factors as possible.
| Eigenvalue | Pct of explained variance | Cumulative pct of explained variance | |
|---|---|---|---|
| Component 1 | 3.31 | 20.68 | 20.68 |
| Component 2 | 1.34 | 8.36 | 29.04 |
| Component 3 | 1.21 | 7.55 | 36.59 |
| Component 4 | 1.14 | 7.12 | 43.72 |
| Component 5 | 1.11 | 6.92 | 50.64 |
| Component 6 | 1.08 | 6.72 | 57.36 |
| Component 7 | 1.06 | 6.63 | 63.99 |
| Component 8 | 1.06 | 6.61 | 70.61 |
| Component 9 | 1.05 | 6.54 | 77.14 |
| Component 10 | 1.04 | 6.51 | 83.66 |
| Component 11 | 0.99 | 6.18 | 89.84 |
| Component 12 | 0.91 | 5.69 | 95.52 |
| Component 13 | 0.32 | 2.01 | 97.53 |
| Component 14 | 0.17 | 1.06 | 98.59 |
| Component 15 | 0.13 | 0.81 | 99.40 |
| Component 16 | 0.10 | 0.60 | 100.00 |
The blue plot shows the decreasing eigenvalues with maximum 10 factors corresponding to the maximum eigenvalue of 1. This means that 10 factors signify the maximum characteristics of the employees. One of the ways to determine the number of factors to move forward is the “elbow” like shape in the plot. This is where the eigenvalue is one.
In this case we selected the varimax rotation. For our data, the 10 selected factors look as follows after this rotation:
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | Comp.8 | Comp.9 | Comp.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| average_montly_hours | 0.94 | -0.11 | 0.02 | 0.01 | -0.01 | 0.02 | 0.00 | -0.01 | 0.00 | -0.01 |
| last_evaluation | 0.93 | 0.18 | -0.01 | 0.01 | -0.01 | 0.01 | -0.02 | -0.01 | 0.00 | 0.01 |
| number_project | 0.93 | -0.25 | 0.01 | 0.01 | 0.01 | 0.02 | 0.00 | -0.02 | 0.00 | 0.00 |
| time_spend_company | 0.79 | 0.51 | -0.01 | -0.02 | -0.04 | 0.01 | -0.01 | 0.01 | 0.01 | 0.01 |
| dummy_technical | 0.04 | 0.00 | 0.58 | -0.36 | -0.11 | 0.35 | -0.31 | -0.32 | -0.32 | -0.18 |
| dummy_management | 0.04 | -0.06 | 0.04 | -0.09 | 0.74 | -0.12 | 0.04 | 0.07 | 0.08 | -0.07 |
| dummy_IT | 0.02 | -0.02 | 0.08 | 0.95 | 0.00 | 0.07 | -0.06 | -0.06 | -0.06 | -0.07 |
| dummy_RandD | 0.02 | -0.02 | 0.07 | 0.03 | -0.15 | -0.07 | -0.01 | -0.02 | 0.01 | 0.89 |
| dummy_support | 0.01 | 0.01 | -0.91 | -0.15 | -0.06 | 0.14 | -0.13 | -0.13 | -0.13 | -0.08 |
| satisfaction_level | 0.00 | 0.98 | -0.01 | -0.01 | 0.00 | -0.01 | -0.01 | 0.01 | 0.02 | 0.00 |
| dummy_product_mng | 0.00 | 0.03 | 0.06 | -0.06 | -0.04 | 0.08 | -0.07 | -0.07 | 0.96 | -0.04 |
| Work_accident | -0.01 | 0.03 | -0.05 | -0.13 | 0.29 | 0.17 | -0.03 | -0.02 | -0.06 | 0.46 |
| dummy_accounting | -0.01 | -0.01 | 0.07 | -0.06 | -0.04 | 0.07 | 0.97 | -0.06 | -0.06 | -0.04 |
| dummy_marketing | -0.02 | 0.01 | 0.07 | -0.06 | -0.04 | 0.08 | -0.06 | 0.96 | -0.07 | -0.04 |
| dummy_hr | -0.04 | 0.01 | 0.07 | -0.07 | -0.02 | -0.93 | -0.08 | -0.08 | -0.08 | -0.05 |
| promotion_last_5years | -0.06 | 0.05 | 0.00 | 0.11 | 0.64 | 0.11 | -0.07 | -0.10 | -0.10 | 0.06 |
To better visualize and interpret the factors we suppress loadings with small values, e.g. with absolute values smaller than 0.5. In this case our factors look as follows after suppressing the small numbers:
| Comp.1 | Comp.2 | Comp.3 | Comp.4 | Comp.5 | Comp.6 | Comp.7 | Comp.8 | Comp.9 | Comp.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| average_montly_hours | 0.94 | |||||||||
| last_evaluation | 0.93 | |||||||||
| number_project | 0.93 | |||||||||
| time_spend_company | 0.79 | 0.51 | ||||||||
| dummy_technical | 0.58 | |||||||||
| dummy_management | 0.74 | |||||||||
| dummy_IT | 0.95 | |||||||||
| dummy_RandD | 0.89 | |||||||||
| dummy_support | -0.91 | |||||||||
| satisfaction_level | 0.98 | |||||||||
| dummy_product_mng | 0.96 | |||||||||
| Work_accident | ||||||||||
| dummy_accounting | 0.97 | |||||||||
| dummy_marketing | 0.96 | |||||||||
| dummy_hr | -0.93 | |||||||||
| promotion_last_5years | 0.64 |
After shortlisting the 10 factors based on the principal component analysis, we will try to define the factors based on the characteristics that are there in each factor.
Component 1 (factor 1) consists of average monthly hours, last evaluation completed, number of projects completed and time spent in the firm. We will call this factor “Involvement” of an employee in the firm.
Component 2 (factor 2) consists of time spent in the firm and satisfaction level. We will call this factor “firm’s contribution to employee satisfaction”.
Component 3 (factor 3) consists of technical department and a support department (with negative weightage). We will call this factor “Technical department sans support function”.
Component 4 (factor 4) defines employee belonging to “IT department”.
Component 5 (factor 5) consists of Management department in the firm and number of promotions. This will signify “Promotions in Management”.
Component 6 (factor 6) defines employee not belonging to Human Resources and will call “Non-HR”.
Component 7 (factor 7) defines employee belonging to “Accounting”.
Component 8 (factor 8) defines employee belonging to “Marketing”.
Component 9 (factor 9) defines employee belonging to “Product Management”.
Component 10 (factor 10) defines employee belonging to “R&D”.
The segementation variables chosen are based on the highest weights observed in the 10 components from the factor analysis 1) last evaluation 2) average monthly hours 3) departments of accounting, hr, support, management, IT, product management, marketing and R&D
Using the euclidean distance metric as a measure, we used that ot define the differences in the observations of the profile of the employees.
| Obs.01 | Obs.02 | Obs.03 | Obs.04 | Obs.05 | Obs.06 | Obs.07 | Obs.08 | Obs.09 | Obs.10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Obs.01 | 0 | |||||||||
| Obs.02 | 105 | 0 | ||||||||
| Obs.03 | 115 | 10 | 0 | |||||||
| Obs.04 | 66 | 39 | 49 | 0 | ||||||
| Obs.05 | 2 | 103 | 113 | 64 | 0 | |||||
| Obs.06 | 4 | 109 | 119 | 70 | 6 | 0 | ||||
| Obs.07 | 90 | 15 | 25 | 24 | 88 | 94 | 0 | |||
| Obs.08 | 102 | 3 | 13 | 36 | 100 | 106 | 12 | 0 | ||
| Obs.09 | 67 | 38 | 48 | 1 | 65 | 71 | 23 | 35 | 0 | |
| Obs.10 | 15 | 120 | 130 | 81 | 17 | 11 | 105 | 117 | 82 | 0 |
Below shows histogram of all pairwise distances for the euclidean distance:
Below we have the dendogram of our Hierachical Clustering of our data.
Displayed also is a plot of the distances of travelled for the points.
We chose 3 segements to segregate our dataset.
Here is the segment membership of the first 10 respondents if we use hierarchical clustering:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 1 |
| 2 | 2 |
| 3 | 2 |
| 4 | 2 |
| 5 | 1 |
| 6 | 1 |
| 7 | 2 |
| 8 | 2 |
| 9 | 2 |
| 10 | 1 |
while this is the segment membership if we use k-means:
| Observation Number | Cluster_Membership |
|---|---|
| 1 | 2 |
| 2 | 1 |
| 3 | 3 |
| 4 | 1 |
| 5 | 2 |
| 6 | 2 |
| 7 | 1 |
| 8 | 1 |
| 9 | 1 |
| 10 | 2 |
The table below uses the 10 variables to segment employees by showing the average of all the input variables in each segment of employee compared to the ratio of the average of all the employees using the ratio of the two.
| Population | Seg.1 | Seg.2 | Seg.3 | |
|---|---|---|---|---|
| last_evaluation | 0.72 | 0.53 | 0.88 | 0.85 |
| number_project | 3.86 | 2.15 | 5.01 | 6.10 |
| average_montly_hours | 207.42 | 144.05 | 248.83 | 295.24 |
| time_spend_company | 3.88 | 3.06 | 4.68 | 4.11 |
| Work_accident | 0.05 | 0.05 | 0.05 | 0.04 |
| left | 1.00 | 1.00 | 1.00 | 1.00 |
| promotion_last_5years | 0.01 | 0.01 | 0.00 | 0.01 |
| salary.f | 1.41 | 1.41 | 1.42 | 1.41 |
| dummy_sales | 0.28 | 0.30 | 0.28 | 0.23 |
| dummy_accounting | 0.06 | 0.06 | 0.05 | 0.07 |
| dummy_hr | 0.06 | 0.07 | 0.05 | 0.07 |
| dummy_technical | 0.20 | 0.17 | 0.21 | 0.25 |
| dummy_support | 0.16 | 0.16 | 0.16 | 0.14 |
| dummy_management | 0.03 | 0.02 | 0.03 | 0.02 |
| dummy_IT | 0.08 | 0.07 | 0.07 | 0.11 |
| dummy_product_mng | 0.06 | 0.06 | 0.06 | 0.05 |
| dummy_marketing | 0.06 | 0.06 | 0.05 | 0.05 |
| dummy_RandD | 0.03 | 0.03 | 0.04 | 0.02 |
As can be seen from the segmentation, broadly the defining variable for each of the segment is the number of hours spent in the company and the number of projects undertaken, this has led to the formation of 3 distinct segments with distinct identities.
The snake plot for the 3 segments across the 19 variables on the basis of the means of each variable across each independent segment highlights the critical variables where the 3 segments of employees differ in their behavior, Time Spent in Company, Number of Projects, Average Monthly hours and Last evaluation
On comparing the average of each of the profiling variable of each of the segment against the mean of the population for each variable, we get that all 3 segments are inherently distinct, with segment 1 standing for employees who spent the least time in the company, segment 2 stands for employees who undertook the maximum projects as compared to the population mean and segment 3 stands for employees with the least number of promotions as compared to the population
| Seg.1 | Seg.2 | Seg.3 | |
|---|---|---|---|
| last_evaluation | -0.27 | 0.23 | 0.19 |
| number_project | -0.44 | 0.30 | 0.58 |
| average_montly_hours | -0.31 | 0.20 | 0.42 |
| time_spend_company | -0.21 | 0.21 | 0.06 |
| Work_accident | 0.02 | 0.01 | -0.11 |
| left | 0.00 | 0.00 | 0.00 |
| promotion_last_5years | 0.75 | -0.88 | 0.24 |
| salary.f | 0.00 | 0.00 | 0.00 |
| dummy_sales | 0.06 | 0.00 | -0.21 |
| dummy_accounting | 0.01 | -0.08 | 0.24 |
| dummy_hr | 0.16 | -0.23 | 0.17 |
| dummy_technical | -0.13 | 0.06 | 0.27 |
| dummy_support | 0.00 | 0.03 | -0.12 |
| dummy_management | -0.05 | 0.17 | -0.39 |
| dummy_IT | -0.06 | -0.06 | 0.41 |
| dummy_product_mng | 0.00 | 0.02 | -0.04 |
| dummy_marketing | 0.13 | -0.12 | -0.07 |
| dummy_RandD | -0.10 | 0.23 | -0.41 |
Hence, through this robust segmentation exercise, the employees who left the organization can broadly be divided into 3 segments, Segment 1 – Employees who spent the least time in the organization, Segment 2 – Employees who undertook the maximum projects in the year and spent the highest average monthly hours in the company, and Segment 3 – Employees who had the least number of promotions in the group in the last 5 years
Our analysis revealed three key segments in the company. These include Segment 1 – Employees who spent the least time in the organization, Segment 2 – Employees who undertook the maximum projects in the year and spent the highest average monthly hours in the company, and Segment 3 – Employees who had the least number of promotions in the group in the last 5 years. These segments represent the three profiles that describe employees who have left the firm.
Based on our analysis, we cannot conclude with absolute certainty that employees who match one or more of these three profiles will leave the firm. Instead, we would recommend the following actions to gain further insight into attrition:
HR could consider monitoring current employees who match one or more of these segment profiles. This would help managers look for and recognize early signs related to an employee’s desire to leave the firm. Being prepared to persuade the employee to stay could empower managers to retain their staff. Therefore, the results of this study may allow HR to play a more proactive supporting role in retaining employees.
The firm may also want to examine the turnover patterns across departments. The current data analysis does not provide detail needed below the departments, but it is interesting that HR and sales appear to have higher attrition as compared to other departments at the firm. Perhaps staff interviews or gathering more department specific data to analyze would be helpful.
Further study is needed to complement the findings in this study and understand the drivers behind why employees choose to leave the firm. For example, perhaps further analysis complemented with interviews will reveal that instituting new policies around work/life balance would reduce turnover, if burnout is a common cause for an employee to leave the firm.
Should the company want to iterate upon this initial study, increasing the number of factors in subsequent analysis would improve accuracy and the level of detail available to describe patterns related to the firm’s attrition.