To develop a employee churn prediction model, we will perform the following steps:
Read input data from Attrition.csv file
Input Data analysis
## [1] 2940 35
## 'data.frame': 2940 obs. of 35 variables:
## $ EmployeeNumber : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
## EmployeeNumber Attrition Age BusinessTravel
## Min. : 1.0 No :2466 Min. :18.00 Non-Travel : 300
## 1st Qu.: 735.8 Yes: 474 1st Qu.:30.00 Travel_Frequently: 554
## Median :1470.5 Median :36.00 Travel_Rarely :2086
## Mean :1470.5 Mean :36.92
## 3rd Qu.:2205.2 3rd Qu.:43.00
## Max. :2940.0 Max. :60.00
##
## DailyRate Department DistanceFromHome
## Min. : 102.0 Human Resources : 126 Min. : 1.000
## 1st Qu.: 465.0 Research & Development:1922 1st Qu.: 2.000
## Median : 802.0 Sales : 892 Median : 7.000
## Mean : 802.5 Mean : 9.193
## 3rd Qu.:1157.0 3rd Qu.:14.000
## Max. :1499.0 Max. :29.000
##
## Education EducationField EmployeeCount
## Min. :1.000 Human Resources : 54 Min. :1
## 1st Qu.:2.000 Life Sciences :1212 1st Qu.:1
## Median :3.000 Marketing : 318 Median :1
## Mean :2.913 Medical : 928 Mean :1
## 3rd Qu.:4.000 Other : 164 3rd Qu.:1
## Max. :5.000 Technical Degree: 264 Max. :1
##
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement
## Min. :1.000 Female:1176 Min. : 30.00 Min. :1.00
## 1st Qu.:2.000 Male :1764 1st Qu.: 48.00 1st Qu.:2.00
## Median :3.000 Median : 66.00 Median :3.00
## Mean :2.722 Mean : 65.89 Mean :2.73
## 3rd Qu.:4.000 3rd Qu.: 84.00 3rd Qu.:3.00
## Max. :4.000 Max. :100.00 Max. :4.00
##
## JobLevel JobRole JobSatisfaction
## Min. :1.000 Sales Executive :652 Min. :1.000
## 1st Qu.:1.000 Research Scientist :584 1st Qu.:2.000
## Median :2.000 Laboratory Technician :518 Median :3.000
## Mean :2.064 Manufacturing Director :290 Mean :2.729
## 3rd Qu.:3.000 Healthcare Representative:262 3rd Qu.:4.000
## Max. :5.000 Manager :204 Max. :4.000
## (Other) :430
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## Divorced: 654 Min. : 1009 Min. : 2094 Min. :0.000
## Married :1346 1st Qu.: 2911 1st Qu.: 8045 1st Qu.:1.000
## Single : 940 Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8380 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## Over18 OverTime PercentSalaryHike PerformanceRating
## Y:2940 No :2108 Min. :11.00 Min. :3.000
## Yes: 832 1st Qu.:12.00 1st Qu.:3.000
## Median :14.00 Median :3.000
## Mean :15.21 Mean :3.154
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
##
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :80 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :80 Median :1.0000 Median :10.00
## Mean :2.712 Mean :80 Mean :0.7939 Mean :11.28
## 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :80 Max. :3.0000 Max. :40.00
##
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
##
##
## No Yes
## 2466 474
##
## No Yes
## 0.8387755 0.1612245
From the above analysis we see that the employee attrition rate is 16.12%, 474 out of 2940 employees have churned.
Exploratory Data Analysis: We will explore the
## Warning: package 'ggplot2' was built under R version 3.5.1
Let us further explore the other variables and attrition data
Attrition by Department
##
## No Yes
## Human Resources 102 24
## Research & Development 1656 266
## Sales 708 184
Highest attrition is seen in Sales Department, then in Human Resources. Low attrition is seen in Research & Development
Attrition and Job Levels
##
## No Yes
## 1 800 286
## 2 964 104
## 3 372 64
## 4 202 10
## 5 128 10
We see that highest attrition is seen at Job level 1 and then at level 3
Attrition and Job Satisfaction
##
## No Yes
## 1 446 132
## 2 468 92
## 3 738 146
## 4 814 104
Lower the Job Satisfaction index, higher is the attrition (as expected).
Attrition and Overtime
##
## No Yes
## No 1888 220
## Yes 578 254
Employees who are working extra hours are more likely to attrite as compared to employees who are working during normal working hours.
Monthly Income and Attrition
From the above plot we see that monthly income has inverse relationship with Attrition for all departments.
##
## No Yes
## Healthcare Representative 244 18
## Human Resources 80 24
## Laboratory Technician 394 124
## Manager 194 10
## Manufacturing Director 270 20
## Research Director 156 4
## Research Scientist 490 94
## Sales Executive 538 114
## Sales Representative 100 66
Highest Attrition is seen for Sales Representatives and least Attrition is found for Research Director role.
Gender And Attrition
##
## No Yes
## Female 1002 174
## Male 1464 300
Attrition is higher for Male employees as compared with female employees.
Attrition and Education
##
## No Yes
## 1 278 62
## 2 476 88
## 3 946 198
## 4 680 116
## 5 86 10
Highest Attrition is seen for Education levels 1 and 3 and least Attrition is found forEducation level 5.
Age and Attrition From the above boxplot, we see that younger employees are more likely to attrite as compared to employees of higher age.
Hypothesis Tests using Chi-square test of independence
##
## No Yes
## No 1888 220
## Yes 578 254
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tabovertime
## X-squared = 176.61, df = 1, p-value < 2.2e-16
##
## No Yes
## Healthcare Representative 244 18
## Human Resources 80 24
## Laboratory Technician 394 124
## Manager 194 10
## Manufacturing Director 270 20
## Research Director 156 4
## Research Scientist 490 94
## Sales Executive 538 114
## Sales Representative 100 66
##
## Pearson's Chi-squared test
##
## data: tabjobrole
## X-squared = 172.38, df = 8, p-value < 2.2e-16
##
## No Yes
## 1 278 62
## 2 476 88
## 3 946 198
## 4 680 116
## 5 86 10
##
## Pearson's Chi-squared test
##
## data: tabEducation
## X-squared = 6.1479, df = 4, p-value = 0.1884
##
## No Yes
## Divorced 588 66
## Married 1178 168
## Single 700 240
##
## Pearson's Chi-squared test
##
## data: tabMaritalStatus
## X-squared = 92.327, df = 2, p-value < 2.2e-16
##
## No Yes
## 1 446 132
## 2 468 92
## 3 738 146
## 4 814 104
##
## Pearson's Chi-squared test
##
## data: tabJobSatisfaction
## X-squared = 35.01, df = 3, p-value = 1.212e-07
##
## No Yes
## 1 110 50
## 2 572 116
## 3 1532 254
## 4 252 54
##
## Pearson's Chi-squared test
##
## data: tabWorkLifeBalance
## X-squared = 32.65, df = 3, p-value = 3.817e-07
##
## No Yes
## Human Resources 102 24
## Research & Development 1656 266
## Sales 708 184
##
## Pearson's Chi-squared test
##
## data: tabDepartment
## X-squared = 21.592, df = 2, p-value = 2.048e-05
##
## No Yes
## 1 438 114
## 2 516 90
## 3 776 142
## 4 736 128
##
## Pearson's Chi-squared test
##
## data: tabRelSatisfaction
## X-squared = 10.482, df = 3, p-value = 0.01488
##
## No Yes
## Non-Travel 276 24
## Travel_Frequently 416 138
## Travel_Rarely 1774 312
##
## Pearson's Chi-squared test
##
## data: tabBusTravel
## X-squared = 48.365, df = 2, p-value = 3.146e-11
##
## No Yes
## 0 954 308
## 1 1080 112
## 2 292 24
## 3 140 30
##
## Pearson's Chi-squared test
##
## data: tabStockOptionLevel
## X-squared = 121.2, df = 3, p-value < 2.2e-16
Correlation Matrix
## Age DailyRate DistanceFromHome
## Age 1.00000 0.010661 -0.00169
## DailyRate 0.01066 1.000000 -0.00499
## DistanceFromHome -0.00169 -0.004985 1.00000
## EnvironmentSatisfaction 0.01015 0.018355 -0.01608
## HourlyRate 0.02429 0.023381 0.03113
## JobLevel 0.50960 0.002966 0.00530
## MonthlyIncome 0.49785 0.007707 -0.01701
## MonthlyRate 0.02805 -0.032182 0.02747
## NumCompaniesWorked 0.29963 0.038153 -0.02925
## PercentSalaryHike 0.00363 0.022704 0.04024
## PerformanceRating 0.00190 0.000473 0.02711
## TotalWorkingYears 0.68038 0.014515 0.00463
## TrainingTimesLastYear -0.01962 0.002453 -0.03694
## YearsAtCompany 0.31131 -0.034055 0.00951
## YearsInCurrentRole 0.21290 0.009932 0.01884
## YearsSinceLastPromotion 0.21651 -0.033229 0.01003
## YearsWithCurrManager 0.20209 -0.026363 0.01441
## EnvironmentSatisfaction HourlyRate JobLevel
## Age 0.01015 0.02429 0.50960
## DailyRate 0.01835 0.02338 0.00297
## DistanceFromHome -0.01608 0.03113 0.00530
## EnvironmentSatisfaction 1.00000 -0.04986 0.00121
## HourlyRate -0.04986 1.00000 -0.02785
## JobLevel 0.00121 -0.02785 1.00000
## MonthlyIncome -0.00626 -0.01579 0.95030
## MonthlyRate 0.03760 -0.01530 0.03956
## NumCompaniesWorked 0.01259 0.02216 0.14250
## PercentSalaryHike -0.03170 -0.00906 -0.03473
## PerformanceRating -0.02955 -0.00217 -0.02122
## TotalWorkingYears -0.00269 -0.00233 0.78221
## TrainingTimesLastYear -0.01936 -0.00855 -0.01819
## YearsAtCompany 0.00146 -0.01958 0.53474
## YearsInCurrentRole 0.01801 -0.02411 0.38945
## YearsSinceLastPromotion 0.01619 -0.02672 0.35389
## YearsWithCurrManager -0.00500 -0.02012 0.37528
## MonthlyIncome MonthlyRate NumCompaniesWorked
## Age 0.49785 0.02805 0.2996
## DailyRate 0.00771 -0.03218 0.0382
## DistanceFromHome -0.01701 0.02747 -0.0293
## EnvironmentSatisfaction -0.00626 0.03760 0.0126
## HourlyRate -0.01579 -0.01530 0.0222
## JobLevel 0.95030 0.03956 0.1425
## MonthlyIncome 1.00000 0.03481 0.1495
## MonthlyRate 0.03481 1.00000 0.0175
## NumCompaniesWorked 0.14952 0.01752 1.0000
## PercentSalaryHike -0.02727 -0.00643 -0.0102
## PerformanceRating -0.01712 -0.00981 -0.0141
## TotalWorkingYears 0.77289 0.02644 0.2376
## TrainingTimesLastYear -0.02174 0.00147 -0.0661
## YearsAtCompany 0.51428 -0.02366 -0.1184
## YearsInCurrentRole 0.36382 -0.01281 -0.0908
## YearsSinceLastPromotion 0.34498 0.00157 -0.0368
## YearsWithCurrManager 0.34408 -0.03675 -0.1103
## PercentSalaryHike PerformanceRating
## Age 0.00363 0.001904
## DailyRate 0.02270 0.000473
## DistanceFromHome 0.04024 0.027110
## EnvironmentSatisfaction -0.03170 -0.029548
## HourlyRate -0.00906 -0.002172
## JobLevel -0.03473 -0.021222
## MonthlyIncome -0.02727 -0.017120
## MonthlyRate -0.00643 -0.009811
## NumCompaniesWorked -0.01024 -0.014095
## PercentSalaryHike 1.00000 0.773550
## PerformanceRating 0.77355 1.000000
## TotalWorkingYears -0.02061 0.006744
## TrainingTimesLastYear -0.00522 -0.015579
## YearsAtCompany -0.03599 0.003435
## YearsInCurrentRole -0.00152 0.034986
## YearsSinceLastPromotion -0.02215 0.017896
## YearsWithCurrManager -0.01199 0.022827
## TotalWorkingYears TrainingTimesLastYear
## Age 0.68038 -0.01962
## DailyRate 0.01451 0.00245
## DistanceFromHome 0.00463 -0.03694
## EnvironmentSatisfaction -0.00269 -0.01936
## HourlyRate -0.00233 -0.00855
## JobLevel 0.78221 -0.01819
## MonthlyIncome 0.77289 -0.02174
## MonthlyRate 0.02644 0.00147
## NumCompaniesWorked 0.23764 -0.06605
## PercentSalaryHike -0.02061 -0.00522
## PerformanceRating 0.00674 -0.01558
## TotalWorkingYears 1.00000 -0.03566
## TrainingTimesLastYear -0.03566 1.00000
## YearsAtCompany 0.62813 0.00357
## YearsInCurrentRole 0.46036 -0.00574
## YearsSinceLastPromotion 0.40486 -0.00207
## YearsWithCurrManager 0.45919 -0.00410
## YearsAtCompany YearsInCurrentRole
## Age 0.31131 0.21290
## DailyRate -0.03405 0.00993
## DistanceFromHome 0.00951 0.01884
## EnvironmentSatisfaction 0.00146 0.01801
## HourlyRate -0.01958 -0.02411
## JobLevel 0.53474 0.38945
## MonthlyIncome 0.51428 0.36382
## MonthlyRate -0.02366 -0.01281
## NumCompaniesWorked -0.11842 -0.09075
## PercentSalaryHike -0.03599 -0.00152
## PerformanceRating 0.00344 0.03499
## TotalWorkingYears 0.62813 0.46036
## TrainingTimesLastYear 0.00357 -0.00574
## YearsAtCompany 1.00000 0.75875
## YearsInCurrentRole 0.75875 1.00000
## YearsSinceLastPromotion 0.61841 0.54806
## YearsWithCurrManager 0.76921 0.71436
## YearsSinceLastPromotion YearsWithCurrManager
## Age 0.21651 0.2021
## DailyRate -0.03323 -0.0264
## DistanceFromHome 0.01003 0.0144
## EnvironmentSatisfaction 0.01619 -0.0050
## HourlyRate -0.02672 -0.0201
## JobLevel 0.35389 0.3753
## MonthlyIncome 0.34498 0.3441
## MonthlyRate 0.00157 -0.0367
## NumCompaniesWorked -0.03681 -0.1103
## PercentSalaryHike -0.02215 -0.0120
## PerformanceRating 0.01790 0.0228
## TotalWorkingYears 0.40486 0.4592
## TrainingTimesLastYear -0.00207 -0.0041
## YearsAtCompany 0.61841 0.7692
## YearsInCurrentRole 0.54806 0.7144
## YearsSinceLastPromotion 1.00000 0.5102
## YearsWithCurrManager 0.51022 1.0000
## Warning: package 'corrplot' was built under R version 3.5.1
## corrplot 0.84 loaded
Regression Model : Let us predict the number of years at company an employee will serve based on other independent variables in the dataset.
## [1] "EmployeeNumber" "Attrition"
## [3] "Age" "BusinessTravel"
## [5] "DailyRate" "Department"
## [7] "DistanceFromHome" "Education"
## [9] "EducationField" "EnvironmentSatisfaction"
## [11] "Gender" "HourlyRate"
## [13] "JobInvolvement" "JobLevel"
## [15] "JobRole" "JobSatisfaction"
## [17] "MaritalStatus" "MonthlyIncome"
## [19] "MonthlyRate" "NumCompaniesWorked"
## [21] "OverTime" "PercentSalaryHike"
## [23] "PerformanceRating" "RelationshipSatisfaction"
## [25] "StandardHours" "StockOptionLevel"
## [27] "TotalWorkingYears" "TrainingTimesLastYear"
## [29] "WorkLifeBalance" "YearsAtCompany"
## [31] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [33] "YearsWithCurrManager"
##
## Call:
## lm(formula = train$YearsAtCompany ~ train$DistanceFromHome +
## train$MonthlyIncome + train$PercentSalaryHike + train$TotalWorkingYears +
## train$YearsInCurrentRole + train$YearsSinceLastPromotion +
## train$YearsWithCurrManager, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4257 -1.4584 -0.0825 1.0890 20.2286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -9.519e-02 3.001e-01 -0.317 0.7511
## train$DistanceFromHome 2.188e-03 7.736e-03 0.283 0.7773
## train$MonthlyIncome 8.428e-05 2.087e-05 4.038 5.58e-05 ***
## train$PercentSalaryHike -3.659e-02 1.740e-02 -2.102 0.0356 *
## train$TotalWorkingYears 1.619e-01 1.350e-02 11.991 < 2e-16 ***
## train$YearsInCurrentRole 4.980e-01 2.622e-02 18.995 < 2e-16 ***
## train$YearsSinceLastPromotion 3.413e-01 2.460e-02 13.874 < 2e-16 ***
## train$YearsWithCurrManager 5.848e-01 2.615e-02 22.368 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.872 on 2050 degrees of freedom
## Multiple R-squared: 0.772, Adjusted R-squared: 0.7713
## F-statistic: 991.9 on 7 and 2050 DF, p-value: < 2.2e-16
## Warning: package 'car' was built under R version 3.5.1
## Loading required package: carData
## train$DistanceFromHome train$MonthlyIncome
## 1.002486 2.345415
## train$PercentSalaryHike train$TotalWorkingYears
## 1.002499 2.648315
## train$YearsInCurrentRole train$YearsSinceLastPromotion
## 2.207798 1.548921
## train$YearsWithCurrManager
## 2.185022
## [1] 10193.22
## [1] 10243.89
## [1] 8.218202
## [1] 2.866741
##
## Call:
## lm(formula = train$YearsAtCompany ~ train$MonthlyIncome + train$YearsInCurrentRole +
## train$YearsSinceLastPromotion + train$YearsWithCurrManager,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2123 -1.4223 -0.1086 0.8207 23.1071
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3187272 0.1238384 -2.574 0.0101 *
## train$MonthlyIncome 0.0002590 0.0000155 16.710 <2e-16 ***
## train$YearsInCurrentRole 0.5262255 0.0270209 19.475 <2e-16 ***
## train$YearsSinceLastPromotion 0.3643281 0.0253856 14.352 <2e-16 ***
## train$YearsWithCurrManager 0.6355921 0.0266973 23.807 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.973 on 2053 degrees of freedom
## Multiple R-squared: 0.7555, Adjusted R-squared: 0.755
## F-statistic: 1586 on 4 and 2053 DF, p-value: < 2.2e-16
## train$MonthlyIncome train$YearsInCurrentRole
## 1.207962 2.189769
## train$YearsSinceLastPromotion train$YearsWithCurrManager
## 1.539617 2.127238
## [1] 10331.33
## [1] 10365.11
## [1] 8.814305
## [1] 2.96889
## Warning: 'newdata' had 882 rows but variables found have 2058 rows
## Warning in test$YearsAtCompany - pred.testMLE: longer object length is not
## a multiple of shorter object length
## [1] 8.357591
## Warning: 'newdata' had 882 rows but variables found have 2058 rows
## 1 2 3 4 5 6
## 4.412022 8.446663 8.597393 10.946814 2.744178 4.296685
## Warning in test$YearsAtCompany - pred.testMLE: longer object length is not
## a multiple of shorter object length
## [1] 8.311958
## Warning in cbind(test$EmployeeNumber, test$YearsAtCompany, pred.testMLE):
## number of rows of result is not a multiple of vector length (arg 1)
## EmployeeNumber YearsAtCompany YearsAtCompany-Predicted
## 1 2 10 4.412022
## 2 3 0 8.446663
## 3 6 7 8.597393
## 4 13 5 10.946814
## 5 15 4 2.744178
## 6 26 14 4.296685
## 7 28 9 17.174903
## 8 31 1 17.027760
## 9 32 4 3.914374
## 10 34 1 3.156766
Regression model to predict the YearsAtCompany has been built. The model is yielding poor performance on test data. There could be problem of skewness of data and we will not be able to use this model for prediction with good accuracy.
Build CART Model to predict Attrition
## Warning: package 'rpart' was built under R version 3.5.1
## Warning: package 'caret' was built under R version 3.5.1
## Loading required package: lattice
## Warning: package 'rattle' was built under R version 3.5.1
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## Warning: package 'rpart.plot' was built under R version 3.5.1
## 'data.frame': 2058 obs. of 33 variables:
## $ EmployeeNumber : int 905 758 1623 166 1376 1420 2384 1087 1603 500 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 1 2 1 ...
## $ Age : int 48 34 53 50 32 42 45 50 31 33 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 3 2 3 3 2 3 3 ...
## $ DailyRate : int 715 216 1436 1452 238 557 1449 333 542 1216 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 2 3 3 2 2 2 3 2 3 3 ...
## $ DistanceFromHome : int 1 1 6 11 5 18 2 22 20 8 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 3 4 2 3 2 4 3 5 3 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 3 3 2 2 2 3 4 2 3 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 2 2 3 1 4 1 3 2 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 1 2 1 2 ...
## $ HourlyRate : int 76 75 34 53 47 35 94 88 71 39 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 2 4 3 3 4 3 1 1 1 3 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 5 2 2 5 1 2 5 4 2 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 6 8 9 4 7 7 4 6 8 8 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 4 3 2 3 1 2 4 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 1 2 3 3 1 3 3 2 1 ...
## $ MonthlyIncome : int 18265 9725 2306 19926 2432 5410 18824 14411 4559 7104 ...
## $ MonthlyRate : int 8733 12278 16047 17053 15318 11189 2493 24450 24788 20431 ...
## $ NumCompaniesWorked : int 6 0 2 3 3 6 2 1 3 0 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 2 2 1 ...
## $ PercentSalaryHike : int 12 11 20 15 14 17 16 13 11 12 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 1 2 1 1 1 1 1 1 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 3 4 4 2 1 3 1 4 3 4 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 2 2 1 1 2 1 1 2 1 ...
## $ TotalWorkingYears : int 25 16 13 21 8 9 26 32 4 6 ...
## $ TrainingTimesLastYear : int 3 2 3 5 2 3 2 2 2 3 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 4 2 1 3 3 2 3 3 3 3 ...
## $ YearsAtCompany : int 1 15 7 5 4 4 24 32 2 5 ...
## $ YearsInCurrentRole : int 0 1 7 4 1 3 10 6 2 0 ...
## $ YearsSinceLastPromotion : int 0 0 4 4 0 1 1 13 2 1 ...
## $ YearsWithCurrManager : int 0 9 5 4 3 2 11 9 2 2 ...
## [1] "EmployeeNumber" "Attrition"
## [3] "Age" "BusinessTravel"
## [5] "DailyRate" "Department"
## [7] "DistanceFromHome" "Education"
## [9] "EducationField" "EnvironmentSatisfaction"
## [11] "Gender" "HourlyRate"
## [13] "JobInvolvement" "JobLevel"
## [15] "JobRole" "JobSatisfaction"
## [17] "MaritalStatus" "MonthlyIncome"
## [19] "MonthlyRate" "NumCompaniesWorked"
## [21] "OverTime" "PercentSalaryHike"
## [23] "PerformanceRating" "RelationshipSatisfaction"
## [25] "StandardHours" "StockOptionLevel"
## [27] "TotalWorkingYears" "TrainingTimesLastYear"
## [29] "WorkLifeBalance" "YearsAtCompany"
## [31] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [33] "YearsWithCurrManager"
## n= 2058
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2058 330 No (0.839650146 0.160349854)
## 2) OverTime=No 1474 157 No (0.893487110 0.106512890)
## 4) TotalWorkingYears>=2.5 1346 114 No (0.915304606 0.084695394)
## 8) StockOptionLevel=1,2 715 30 No (0.958041958 0.041958042)
## 16) YearsWithCurrManager>=1.5 590 17 No (0.971186441 0.028813559)
## 32) TotalWorkingYears< 33.5 583 15 No (0.974271012 0.025728988)
## 64) PercentSalaryHike>=11.5 503 8 No (0.984095427 0.015904573)
## 128) HourlyRate< 90.5 404 3 No (0.992574257 0.007425743) *
## 129) HourlyRate>=90.5 99 5 No (0.949494949 0.050505051)
## 258) JobRole=Healthcare Representative,Manager,Manufacturing Director,Research Director,Research Scientist,Sales Executive 74 0 No (1.000000000 0.000000000) *
## 259) JobRole=Human Resources,Laboratory Technician,Sales Representative 25 5 No (0.800000000 0.200000000)
## 518) HourlyRate>=93.5 17 0 No (1.000000000 0.000000000) *
## 519) HourlyRate< 93.5 8 3 Yes (0.375000000 0.625000000) *
## 65) PercentSalaryHike< 11.5 80 7 No (0.912500000 0.087500000) *
## 33) TotalWorkingYears>=33.5 7 2 No (0.714285714 0.285714286) *
## 17) YearsWithCurrManager< 1.5 125 13 No (0.896000000 0.104000000)
## 34) MonthlyIncome>=2375.5 113 8 No (0.929203540 0.070796460)
## 68) BusinessTravel=Non-Travel,Travel_Rarely 93 3 No (0.967741935 0.032258065) *
## 69) BusinessTravel=Travel_Frequently 20 5 No (0.750000000 0.250000000)
## 138) Age< 36.5 11 0 No (1.000000000 0.000000000) *
## 139) Age>=36.5 9 4 Yes (0.444444444 0.555555556) *
## 35) MonthlyIncome< 2375.5 12 5 No (0.583333333 0.416666667) *
## 9) StockOptionLevel=0,3 631 84 No (0.866877971 0.133122029)
## 18) Age>=33.5 389 31 No (0.920308483 0.079691517)
## 36) WorkLifeBalance=2,3,4 370 25 No (0.932432432 0.067567568)
## 72) JobRole=Healthcare Representative,Human Resources,Manager,Manufacturing Director 133 1 No (0.992481203 0.007518797) *
## 73) JobRole=Laboratory Technician,Research Director,Research Scientist,Sales Executive,Sales Representative 237 24 No (0.898734177 0.101265823)
## 146) HourlyRate< 81.5 185 12 No (0.935135135 0.064864865) *
## 147) HourlyRate>=81.5 52 12 No (0.769230769 0.230769231)
## 294) JobSatisfaction=2,3,4 40 5 No (0.875000000 0.125000000) *
## 295) JobSatisfaction=1 12 5 Yes (0.416666667 0.583333333) *
## 37) WorkLifeBalance=1 19 6 No (0.684210526 0.315789474) *
## 19) Age< 33.5 242 53 No (0.780991736 0.219008264)
## 38) NumCompaniesWorked< 4.5 188 23 No (0.877659574 0.122340426)
## 76) JobLevel=1,2,4 165 14 No (0.915151515 0.084848485) *
## 77) JobLevel=3 23 9 No (0.608695652 0.391304348)
## 154) DailyRate< 1031.5 15 2 No (0.866666667 0.133333333) *
## 155) DailyRate>=1031.5 8 1 Yes (0.125000000 0.875000000) *
## 39) NumCompaniesWorked>=4.5 54 24 Yes (0.444444444 0.555555556)
## 78) DailyRate>=467 41 17 No (0.585365854 0.414634146)
## 156) HourlyRate>=47.5 31 8 No (0.741935484 0.258064516)
## 312) TotalWorkingYears>=4.5 23 2 No (0.913043478 0.086956522) *
## 313) TotalWorkingYears< 4.5 8 2 Yes (0.250000000 0.750000000) *
## 157) HourlyRate< 47.5 10 1 Yes (0.100000000 0.900000000) *
## 79) DailyRate< 467 13 0 Yes (0.000000000 1.000000000) *
## 5) TotalWorkingYears< 2.5 128 43 No (0.664062500 0.335937500)
## 10) JobInvolvement=2,3,4 118 33 No (0.720338983 0.279661017)
## 20) MonthlyRate< 24646 105 23 No (0.780952381 0.219047619)
## 40) Department=Research & Development,Sales 98 17 No (0.826530612 0.173469388)
## 80) WorkLifeBalance=1,3 68 6 No (0.911764706 0.088235294) *
## 81) WorkLifeBalance=2,4 30 11 No (0.633333333 0.366666667)
## 162) DistanceFromHome< 7.5 20 3 No (0.850000000 0.150000000) *
## 163) DistanceFromHome>=7.5 10 2 Yes (0.200000000 0.800000000) *
## 41) Department=Human Resources 7 1 Yes (0.142857143 0.857142857) *
## 21) MonthlyRate>=24646 13 3 Yes (0.230769231 0.769230769) *
## 11) JobInvolvement=1 10 0 Yes (0.000000000 1.000000000) *
## 3) OverTime=Yes 584 173 No (0.703767123 0.296232877)
## 6) MonthlyIncome>=2475 487 110 No (0.774127310 0.225872690)
## 12) MaritalStatus=Divorced,Married 337 53 No (0.842729970 0.157270030)
## 24) JobRole=Healthcare Representative,Manager,Manufacturing Director,Research Director,Research Scientist,Sales Executive 271 31 No (0.885608856 0.114391144)
## 48) DistanceFromHome< 12.5 189 11 No (0.941798942 0.058201058) *
## 49) DistanceFromHome>=12.5 82 20 No (0.756097561 0.243902439)
## 98) YearsInCurrentRole>=6.5 34 2 No (0.941176471 0.058823529) *
## 99) YearsInCurrentRole< 6.5 48 18 No (0.625000000 0.375000000)
## 198) YearsSinceLastPromotion< 1.5 30 5 No (0.833333333 0.166666667) *
## 199) YearsSinceLastPromotion>=1.5 18 5 Yes (0.277777778 0.722222222) *
## 25) JobRole=Human Resources,Laboratory Technician,Sales Representative 66 22 No (0.666666667 0.333333333)
## 50) JobLevel=2 18 0 No (1.000000000 0.000000000) *
## 51) JobLevel=1,3 48 22 No (0.541666667 0.458333333)
## 102) MonthlyRate< 10019 23 4 No (0.826086957 0.173913043)
## 204) EducationField=Life Sciences,Medical,Other 17 0 No (1.000000000 0.000000000) *
## 205) EducationField=Marketing,Technical Degree 6 2 Yes (0.333333333 0.666666667) *
## 103) MonthlyRate>=10019 25 7 Yes (0.280000000 0.720000000)
## 206) YearsAtCompany< 3.5 13 6 No (0.538461538 0.461538462) *
## 207) YearsAtCompany>=3.5 12 0 Yes (0.000000000 1.000000000) *
## 13) MaritalStatus=Single 150 57 No (0.620000000 0.380000000)
## 26) JobRole=Healthcare Representative,Human Resources,Manager,Manufacturing Director,Research Director,Research Scientist 81 17 No (0.790123457 0.209876543)
## 52) JobLevel=2,3,4 48 2 No (0.958333333 0.041666667) *
## 53) JobLevel=1,5 33 15 No (0.545454545 0.454545455)
## 106) HourlyRate< 86.5 24 7 No (0.708333333 0.291666667)
## 212) YearsSinceLastPromotion< 5 16 0 No (1.000000000 0.000000000) *
## 213) YearsSinceLastPromotion>=5 8 1 Yes (0.125000000 0.875000000) *
## 107) HourlyRate>=86.5 9 1 Yes (0.111111111 0.888888889) *
## 27) JobRole=Laboratory Technician,Sales Executive,Sales Representative 69 29 Yes (0.420289855 0.579710145)
## 54) TrainingTimesLastYear>=2.5 28 10 No (0.642857143 0.357142857)
## 108) EducationField=Life Sciences,Medical 15 1 No (0.933333333 0.066666667) *
## 109) EducationField=Marketing,Other,Technical Degree 13 4 Yes (0.307692308 0.692307692) *
## 55) TrainingTimesLastYear< 2.5 41 11 Yes (0.268292683 0.731707317)
## 110) MonthlyRate< 8860.5 14 6 No (0.571428571 0.428571429) *
## 111) MonthlyRate>=8860.5 27 3 Yes (0.111111111 0.888888889) *
## 7) MonthlyIncome< 2475 97 34 Yes (0.350515464 0.649484536)
## 14) DailyRate>=888 46 19 No (0.586956522 0.413043478)
## 28) YearsInCurrentRole>=2.5 16 1 No (0.937500000 0.062500000) *
## 29) YearsInCurrentRole< 2.5 30 12 Yes (0.400000000 0.600000000)
## 58) WorkLifeBalance=2 6 0 No (1.000000000 0.000000000) *
## 59) WorkLifeBalance=1,3,4 24 6 Yes (0.250000000 0.750000000) *
## 15) DailyRate< 888 51 7 Yes (0.137254902 0.862745098) *
## [1] "rpart"
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
##
## Classification tree:
## rpart(formula = Attrition ~ ., data = train[, -1], method = "class",
## model = TRUE, control = control.parameter)
##
## Variables actually used in tree construction:
## [1] Age BusinessTravel
## [3] DailyRate Department
## [5] DistanceFromHome EducationField
## [7] HourlyRate JobInvolvement
## [9] JobLevel JobRole
## [11] JobSatisfaction MaritalStatus
## [13] MonthlyIncome MonthlyRate
## [15] NumCompaniesWorked OverTime
## [17] PercentSalaryHike StockOptionLevel
## [19] TotalWorkingYears TrainingTimesLastYear
## [21] WorkLifeBalance YearsAtCompany
## [23] YearsInCurrentRole YearsSinceLastPromotion
## [25] YearsWithCurrManager
##
## Root node error: 330/2058 = 0.16035
##
## n= 2058
##
## CP nsplit rel error xerror xstd
## 1 0.0439394 0 1.00000 1.00000 0.050442
## 2 0.0242424 2 0.91212 0.98788 0.050193
## 3 0.0181818 3 0.88788 0.93030 0.048975
## 4 0.0166667 5 0.85152 0.91212 0.048577
## 5 0.0151515 8 0.79394 0.91818 0.048711
## 6 0.0127273 13 0.71212 0.88788 0.048036
## 7 0.0121212 18 0.64848 0.87273 0.047692
## 8 0.0111111 19 0.63636 0.86970 0.047623
## 9 0.0106061 22 0.60303 0.86970 0.047623
## 10 0.0090909 25 0.56364 0.87273 0.047692
## 11 0.0080808 29 0.52727 0.86061 0.047413
## 12 0.0060606 32 0.50303 0.86364 0.047483
## 13 0.0030303 34 0.49091 0.88788 0.048036
## 14 0.0015152 35 0.48788 0.90606 0.048443
## 15 0.0010101 39 0.48182 0.92727 0.048909
## 16 0.0001000 48 0.47273 0.94545 0.049302
## n= 2058
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 2058 330 No (0.83965015 0.16034985)
## 2) OverTime=No 1474 157 No (0.89348711 0.10651289)
## 4) TotalWorkingYears>=2.5 1346 114 No (0.91530461 0.08469539)
## 8) StockOptionLevel=1,2 715 30 No (0.95804196 0.04195804) *
## 9) StockOptionLevel=0,3 631 84 No (0.86687797 0.13312203)
## 18) Age>=33.5 389 31 No (0.92030848 0.07969152) *
## 19) Age< 33.5 242 53 No (0.78099174 0.21900826)
## 38) NumCompaniesWorked< 4.5 188 23 No (0.87765957 0.12234043)
## 76) JobLevel=1,2,4 165 14 No (0.91515152 0.08484848) *
## 77) JobLevel=3 23 9 No (0.60869565 0.39130435)
## 154) DailyRate< 1031.5 15 2 No (0.86666667 0.13333333) *
## 155) DailyRate>=1031.5 8 1 Yes (0.12500000 0.87500000) *
## 39) NumCompaniesWorked>=4.5 54 24 Yes (0.44444444 0.55555556)
## 78) DailyRate>=467 41 17 No (0.58536585 0.41463415)
## 156) HourlyRate>=47.5 31 8 No (0.74193548 0.25806452)
## 312) TotalWorkingYears>=4.5 23 2 No (0.91304348 0.08695652) *
## 313) TotalWorkingYears< 4.5 8 2 Yes (0.25000000 0.75000000) *
## 157) HourlyRate< 47.5 10 1 Yes (0.10000000 0.90000000) *
## 79) DailyRate< 467 13 0 Yes (0.00000000 1.00000000) *
## 5) TotalWorkingYears< 2.5 128 43 No (0.66406250 0.33593750)
## 10) JobInvolvement=2,3,4 118 33 No (0.72033898 0.27966102)
## 20) MonthlyRate< 24646 105 23 No (0.78095238 0.21904762)
## 40) Department=Research & Development,Sales 98 17 No (0.82653061 0.17346939)
## 80) WorkLifeBalance=1,3 68 6 No (0.91176471 0.08823529) *
## 81) WorkLifeBalance=2,4 30 11 No (0.63333333 0.36666667)
## 162) DistanceFromHome< 7.5 20 3 No (0.85000000 0.15000000) *
## 163) DistanceFromHome>=7.5 10 2 Yes (0.20000000 0.80000000) *
## 41) Department=Human Resources 7 1 Yes (0.14285714 0.85714286) *
## 21) MonthlyRate>=24646 13 3 Yes (0.23076923 0.76923077) *
## 11) JobInvolvement=1 10 0 Yes (0.00000000 1.00000000) *
## 3) OverTime=Yes 584 173 No (0.70376712 0.29623288)
## 6) MonthlyIncome>=2475 487 110 No (0.77412731 0.22587269)
## 12) MaritalStatus=Divorced,Married 337 53 No (0.84272997 0.15727003)
## 24) JobRole=Healthcare Representative,Manager,Manufacturing Director,Research Director,Research Scientist,Sales Executive 271 31 No (0.88560886 0.11439114)
## 48) DistanceFromHome< 12.5 189 11 No (0.94179894 0.05820106) *
## 49) DistanceFromHome>=12.5 82 20 No (0.75609756 0.24390244)
## 98) YearsInCurrentRole>=6.5 34 2 No (0.94117647 0.05882353) *
## 99) YearsInCurrentRole< 6.5 48 18 No (0.62500000 0.37500000)
## 198) YearsSinceLastPromotion< 1.5 30 5 No (0.83333333 0.16666667) *
## 199) YearsSinceLastPromotion>=1.5 18 5 Yes (0.27777778 0.72222222) *
## 25) JobRole=Human Resources,Laboratory Technician,Sales Representative 66 22 No (0.66666667 0.33333333)
## 50) JobLevel=2 18 0 No (1.00000000 0.00000000) *
## 51) JobLevel=1,3 48 22 No (0.54166667 0.45833333)
## 102) MonthlyRate< 10019 23 4 No (0.82608696 0.17391304)
## 204) EducationField=Life Sciences,Medical,Other 17 0 No (1.00000000 0.00000000) *
## 205) EducationField=Marketing,Technical Degree 6 2 Yes (0.33333333 0.66666667) *
## 103) MonthlyRate>=10019 25 7 Yes (0.28000000 0.72000000) *
## 13) MaritalStatus=Single 150 57 No (0.62000000 0.38000000)
## 26) JobRole=Healthcare Representative,Human Resources,Manager,Manufacturing Director,Research Director,Research Scientist 81 17 No (0.79012346 0.20987654)
## 52) JobLevel=2,3,4 48 2 No (0.95833333 0.04166667) *
## 53) JobLevel=1,5 33 15 No (0.54545455 0.45454545)
## 106) HourlyRate< 86.5 24 7 No (0.70833333 0.29166667)
## 212) YearsSinceLastPromotion< 5 16 0 No (1.00000000 0.00000000) *
## 213) YearsSinceLastPromotion>=5 8 1 Yes (0.12500000 0.87500000) *
## 107) HourlyRate>=86.5 9 1 Yes (0.11111111 0.88888889) *
## 27) JobRole=Laboratory Technician,Sales Executive,Sales Representative 69 29 Yes (0.42028986 0.57971014)
## 54) TrainingTimesLastYear>=2.5 28 10 No (0.64285714 0.35714286)
## 108) EducationField=Life Sciences,Medical 15 1 No (0.93333333 0.06666667) *
## 109) EducationField=Marketing,Other,Technical Degree 13 4 Yes (0.30769231 0.69230769) *
## 55) TrainingTimesLastYear< 2.5 41 11 Yes (0.26829268 0.73170732)
## 110) MonthlyRate< 8860.5 14 6 No (0.57142857 0.42857143) *
## 111) MonthlyRate>=8860.5 27 3 Yes (0.11111111 0.88888889) *
## 7) MonthlyIncome< 2475 97 34 Yes (0.35051546 0.64948454)
## 14) DailyRate>=888 46 19 No (0.58695652 0.41304348)
## 28) YearsInCurrentRole>=2.5 16 1 No (0.93750000 0.06250000) *
## 29) YearsInCurrentRole< 2.5 30 12 Yes (0.40000000 0.60000000)
## 58) WorkLifeBalance=2 6 0 No (1.00000000 0.00000000) *
## 59) WorkLifeBalance=1,3,4 24 6 Yes (0.25000000 0.75000000) *
## 15) DailyRate< 888 51 7 Yes (0.13725490 0.86274510) *
##
## Classification tree:
## rpart(formula = Attrition ~ ., data = train[, -c(1)], method = "class",
## model = TRUE, control = control.parameter1)
##
## Variables actually used in tree construction:
## [1] Age DailyRate
## [3] Department DistanceFromHome
## [5] EducationField HourlyRate
## [7] JobInvolvement JobLevel
## [9] JobRole MaritalStatus
## [11] MonthlyIncome MonthlyRate
## [13] NumCompaniesWorked OverTime
## [15] StockOptionLevel TotalWorkingYears
## [17] TrainingTimesLastYear WorkLifeBalance
## [19] YearsInCurrentRole YearsSinceLastPromotion
##
## Root node error: 330/2058 = 0.16035
##
## n= 2058
##
## CP nsplit rel error xerror xstd
## 1 0.0439394 0 1.00000 1.00000 0.050442
## 2 0.0242424 2 0.91212 0.96364 0.049688
## 3 0.0181818 3 0.88788 0.96364 0.049688
## 4 0.0166667 5 0.85152 0.96364 0.049688
## 5 0.0151515 8 0.79394 0.93939 0.049172
## 6 0.0127273 13 0.71212 0.92727 0.048909
## 7 0.0121212 18 0.64848 0.85152 0.047202
## 8 0.0111111 19 0.63636 0.86364 0.047483
## 9 0.0106061 22 0.60303 0.86364 0.047483
## 10 0.0090909 25 0.56364 0.84848 0.047131
## 11 0.0080808 29 0.52727 0.84848 0.047131
## 12 0.0060606 32 0.50303 0.84242 0.046989
## 13 0.0060606 34 0.49091 0.84545 0.047060
##
## Classification tree:
## rpart(formula = Attrition ~ ., data = train[, -c(1)], method = "class",
## model = TRUE, control = control.parameter1)
##
## Variables actually used in tree construction:
## [1] Age DailyRate
## [3] Department DistanceFromHome
## [5] EducationField HourlyRate
## [7] JobInvolvement JobLevel
## [9] JobRole MaritalStatus
## [11] MonthlyIncome MonthlyRate
## [13] NumCompaniesWorked OverTime
## [15] StockOptionLevel TotalWorkingYears
## [17] TrainingTimesLastYear WorkLifeBalance
## [19] YearsInCurrentRole YearsSinceLastPromotion
##
## Root node error: 330/2058 = 0.16035
##
## n= 2058
##
## CP nsplit rel error xerror xstd
## 1 0.0439394 0 1.00000 1.00000 0.050442
## 2 0.0242424 2 0.91212 0.96364 0.049688
## 3 0.0181818 3 0.88788 0.96364 0.049688
## 4 0.0166667 5 0.85152 0.96364 0.049688
## 5 0.0151515 8 0.79394 0.93939 0.049172
## 6 0.0127273 13 0.71212 0.92727 0.048909
## 7 0.0121212 18 0.64848 0.85152 0.047202
## 8 0.0111111 19 0.63636 0.86364 0.047483
## 9 0.0106061 22 0.60303 0.86364 0.047483
## 10 0.0090909 25 0.56364 0.84848 0.047131
## 11 0.0080808 29 0.52727 0.84848 0.047131
## 12 0.0060606 32 0.50303 0.84242 0.046989
## 905 758 1623 166 1376 1420
## No No No No Yes No
## Levels: No Yes
## predictTrain
## No Yes
## No 1676 52
## Yes 114 216
## Warning: package 'ROCR' was built under R version 3.5.1
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.5.1
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## predict.class
## Attrition No Yes
## No 1676 52
## Yes 114 216
## [1] "True Positive Rate/Precision: 0.806"
## [1] "Sensistivity/Recall Rate: 0.806"
## [1] "False Positive Rate: 0.0637"
## [1] "Accuracy: 0.9193"
## [1] "Specificity/TNR: 0.9363"
## [1] "Area under ROC Curve: 0.856222818462402"
## [1] "KS: 0.626262626262626"
## [1] "Gini Index 0.59820508289896"
CART Model performance on Test Data
## predictTest
## No Yes
## No 710 28
## Yes 68 76
## predict.class
## Attrition No Yes
## No 710 28
## Yes 68 76
## [1] "True Positive Rate/Precision: 0.5278"
## [1] "Sensistivity/Recall Rate: 0.5278"
## [1] "False Positive Rate: 0.0379"
## [1] "Accuracy: 0.8912"
## [1] "Specificity/TNR: 0.9621"
## [1] "Area under ROC Curve: 0.7742"
## [1] "KS: 0.6263"
## [1] "Gini Index 0.5862"
CART Model has 89.12% accuracy, Sensitivity of 52.78%, KS value 0.62 and Gini index of 0.586 on test samples. Let us now evaluate the performance of the Random Forest Model
Employee Attrition prediction using Random Forest
## Warning: package 'randomForest' was built under R version 3.5.1
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
## [1] "No of observations in training dataset: 2058"
## [1] "No of observations in training dataset: 882"
## 'data.frame': 2058 obs. of 33 variables:
## $ EmployeeNumber : int 905 758 1623 166 1376 1420 2384 1087 1603 500 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 2 1 2 1 ...
## $ Age : int 48 34 53 50 32 42 45 50 31 33 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 3 3 2 3 3 2 3 3 ...
## $ DailyRate : int 715 216 1436 1452 238 557 1449 333 542 1216 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 2 3 3 2 2 2 3 2 3 3 ...
## $ DistanceFromHome : int 1 1 6 11 5 18 2 22 20 8 ...
## $ Education : Factor w/ 5 levels "1","2","3","4",..: 3 4 2 3 2 4 3 5 3 4 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 3 3 2 2 2 3 4 2 3 ...
## $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 2 2 3 1 4 1 3 2 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 2 1 1 2 1 2 1 2 ...
## $ HourlyRate : int 76 75 34 53 47 35 94 88 71 39 ...
## $ JobInvolvement : Factor w/ 4 levels "1","2","3","4": 2 4 3 3 4 3 1 1 1 3 ...
## $ JobLevel : Factor w/ 5 levels "1","2","3","4",..: 5 2 2 5 1 2 5 4 2 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 6 8 9 4 7 7 4 6 8 8 ...
## $ JobSatisfaction : Factor w/ 4 levels "1","2","3","4": 4 4 3 2 3 1 2 4 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 1 2 3 3 1 3 3 2 1 ...
## $ MonthlyIncome : int 18265 9725 2306 19926 2432 5410 18824 14411 4559 7104 ...
## $ MonthlyRate : int 8733 12278 16047 17053 15318 11189 2493 24450 24788 20431 ...
## $ NumCompaniesWorked : int 6 0 2 3 3 6 2 1 3 0 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 2 1 2 2 2 2 2 1 ...
## $ PercentSalaryHike : int 12 11 20 15 14 17 16 13 11 12 ...
## $ PerformanceRating : Factor w/ 2 levels "3","4": 1 1 2 1 1 1 1 1 1 1 ...
## $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 3 4 4 2 1 3 1 4 3 4 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : Factor w/ 4 levels "0","1","2","3": 1 2 2 1 1 2 1 1 2 1 ...
## $ TotalWorkingYears : int 25 16 13 21 8 9 26 32 4 6 ...
## $ TrainingTimesLastYear : int 3 2 3 5 2 3 2 2 2 3 ...
## $ WorkLifeBalance : Factor w/ 4 levels "1","2","3","4": 4 2 1 3 3 2 3 3 3 3 ...
## $ YearsAtCompany : int 1 15 7 5 4 4 24 32 2 5 ...
## $ YearsInCurrentRole : int 0 1 7 4 1 3 10 6 2 0 ...
## $ YearsSinceLastPromotion : int 0 0 4 4 0 1 1 13 2 1 ...
## $ YearsWithCurrManager : int 0 9 5 4 3 2 11 9 2 2 ...
## starting httpd help server ...
## done
##
## Call:
## randomForest(formula = Attrition ~ ., data = trainRF[, -1], ntree = 301, mtry = 3, nodesize = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 301
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 4.91%
## Confusion matrix:
## No Yes class.error
## No 1726 2 0.001157407
## Yes 99 231 0.300000000
##
## Call:
## randomForest(formula = Attrition ~ ., data = trainRF[, -1], ntree = 50, mtry = 3, nodesize = 2, importance = TRUE)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 3
##
## OOB estimate of error rate: 5.49%
## Confusion matrix:
## No Yes class.error
## No 1724 4 0.002314815
## Yes 109 221 0.330303030
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## EducationField 8.74 11.58 11.61 18.83
## DailyRate 7.55 9.47 11.38 26.24
## Age 9.55 9.05 11.36 33.75
## TotalWorkingYears 9.26 8.14 11.33 28.17
## OverTime 9.32 9.86 11.04 21.55
## JobInvolvement 7.52 6.91 10.09 14.31
## HourlyRate 7.01 9.15 9.56 26.28
## JobRole 7.85 8.68 9.51 24.19
## JobSatisfaction 7.00 9.05 9.16 17.33
## WorkLifeBalance 6.32 7.56 8.93 15.52
## MonthlyRate 7.15 6.73 8.87 26.82
## StockOptionLevel 7.24 8.69 8.70 15.55
## RelationshipSatisfaction 5.75 9.13 8.28 15.67
## EnvironmentSatisfaction 6.64 8.53 8.24 19.13
## BusinessTravel 6.79 7.56 8.06 12.60
## MonthlyIncome 5.81 9.40 8.05 32.37
## YearsAtCompany 6.33 7.86 7.96 20.90
## PercentSalaryHike 6.27 7.52 7.78 18.22
## YearsWithCurrManager 5.47 7.06 7.41 18.16
## JobLevel 6.32 6.61 7.34 12.00
## NumCompaniesWorked 6.65 6.16 7.06 17.51
## TrainingTimesLastYear 5.51 6.08 6.81 12.52
## Education 5.62 5.49 6.61 14.78
## DistanceFromHome 4.68 7.63 6.54 21.20
## MaritalStatus 5.93 6.03 6.36 11.69
## YearsInCurrentRole 5.21 4.77 6.26 13.13
## Gender 3.84 4.67 5.70 4.86
## Department 4.40 5.49 5.51 6.87
## YearsSinceLastPromotion 3.52 5.97 4.75 11.95
## PerformanceRating 3.74 2.43 4.06 2.17
## StandardHours 0.00 0.00 0.00 0.00
## [1] "EmployeeNumber" "Attrition"
## [3] "Age" "BusinessTravel"
## [5] "DailyRate" "Department"
## [7] "DistanceFromHome" "Education"
## [9] "EducationField" "EnvironmentSatisfaction"
## [11] "Gender" "HourlyRate"
## [13] "JobInvolvement" "JobLevel"
## [15] "JobRole" "JobSatisfaction"
## [17] "MaritalStatus" "MonthlyIncome"
## [19] "MonthlyRate" "NumCompaniesWorked"
## [21] "OverTime" "PercentSalaryHike"
## [23] "PerformanceRating" "RelationshipSatisfaction"
## [25] "StandardHours" "StockOptionLevel"
## [27] "TotalWorkingYears" "TrainingTimesLastYear"
## [29] "WorkLifeBalance" "YearsAtCompany"
## [31] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [33] "YearsWithCurrManager"
## mtry = 5 OOB error = 6.41%
## Searching left ...
## mtry = 3 OOB error = 6.85%
## -0.06818182 0.005
## Searching right ...
## mtry = 10 OOB error = 6.22%
## 0.03030303 0.005
## mtry = 20 OOB error = 6.12%
## 0.015625 0.005
## mtry = 29 OOB error = 5.83%
## 0.04761905 0.005
## No Yes MeanDecreaseAccuracy
## OverTime 3.421312e-02 0.1497481075 0.0525983030
## MonthlyIncome 2.305700e-02 0.0868095144 0.0332689226
## StockOptionLevel 2.053927e-02 0.0883051888 0.0313137840
## JobRole 2.199264e-02 0.0674163280 0.0292430100
## Age 1.776303e-02 0.0701639178 0.0260889968
## YearsAtCompany 1.307768e-02 0.0372733649 0.0169107197
## YearsWithCurrManager 1.063692e-02 0.0354600267 0.0145660908
## JobLevel 8.565560e-03 0.0451302376 0.0143403514
## JobSatisfaction 9.510402e-03 0.0328168496 0.0132370804
## DistanceFromHome 7.871728e-03 0.0394132275 0.0129167246
## DailyRate 9.211615e-03 0.0296247003 0.0124675638
## EnvironmentSatisfaction 7.873530e-03 0.0345239847 0.0121113658
## NumCompaniesWorked 7.351471e-03 0.0347614642 0.0117274889
## EducationField 7.763148e-03 0.0299764154 0.0113065589
## HourlyRate 6.465411e-03 0.0251595161 0.0094442778
## RelationshipSatisfaction 5.686965e-03 0.0222657718 0.0083454513
## MonthlyRate 5.311772e-03 0.0216995825 0.0079193504
## WorkLifeBalance 5.389711e-03 0.0205250929 0.0078118417
## PercentSalaryHike 5.049804e-03 0.0153337177 0.0066869957
## YearsSinceLastPromotion 4.507368e-03 0.0178477923 0.0066511095
## MaritalStatus 4.370844e-03 0.0181945840 0.0065769187
## BusinessTravel 4.349745e-03 0.0160812637 0.0062225412
## JobInvolvement 4.087245e-03 0.0174524284 0.0062126648
## YearsInCurrentRole 3.815498e-03 0.0110609897 0.0049648003
## TrainingTimesLastYear 3.199792e-03 0.0126043172 0.0046966303
## Education 2.778793e-03 0.0074466212 0.0035200736
## Department 7.324137e-04 0.0026808524 0.0010468276
## Gender 2.589403e-04 0.0011203404 0.0003930719
## PerformanceRating 8.542784e-05 0.0002876599 0.0001171267
## MeanDecreaseGini
## OverTime 32.1663063
## MonthlyIncome 48.9927319
## StockOptionLevel 20.2533097
## JobRole 30.2818387
## Age 36.0998391
## YearsAtCompany 20.4645003
## YearsWithCurrManager 15.4587802
## JobLevel 11.7340973
## JobSatisfaction 16.5851081
## DistanceFromHome 25.7156570
## DailyRate 31.7446777
## EnvironmentSatisfaction 19.6163238
## NumCompaniesWorked 19.2439180
## EducationField 18.9344575
## HourlyRate 23.2559919
## RelationshipSatisfaction 14.1076408
## MonthlyRate 23.4294266
## WorkLifeBalance 14.8967726
## PercentSalaryHike 13.8480141
## YearsSinceLastPromotion 12.2889447
## MaritalStatus 8.3304856
## BusinessTravel 10.6497942
## JobInvolvement 14.2958563
## YearsInCurrentRole 8.6512194
## TrainingTimesLastYear 12.7096981
## Education 9.2211762
## Department 1.7209162
## Gender 1.2697352
## PerformanceRating 0.2669939
## predictionRf
## No Yes
## No 1728 0
## Yes 6 324
## [1] "True Positive Rate/Precision: 0.9758"
## [1] "Sensistivity/Recall Rate: 0.9758"
## [1] "False Positive Rate: 0"
## [1] "Accuracy: 0.9961"
## [1] "Specificity/TNR: 1"
## [1] "KSRF: 1"
## [1] "Gini Index 0.736"
## [1] "KSRF - Test : 0.8855"
## [1] "Gini Index 0.6309"
## predictionRf.test
## No Yes
## No 734 4
## Yes 30 114
## [1] "True Positive Rate/Precision: 0.7778"
## [1] "Sensistivity/Recall Rate: 0.7778"
## [1] "False Positive Rate: 0.0054"
## [1] "Accuracy: 0.9592"
## [1] "Specificity/TNR: 0.9946"