Import the data file
emp_raw <- read.csv("HR_Employee_Attrition_Data.csv")
emp <- emp_raw
Drop the the columns with no variability.
Also, drop Employee Number as it is just an identifier.
library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
nearZeroVar(emp)
## [1] 9 22 27
emp$Over18 <- NULL
emp$EmployeeCount <- NULL
emp$StandardHours <- NULL
emp$EmployeeNumber <- NULL
nrow(emp)
## [1] 2940
table(emp$Attrition)
##
## No Yes
## 2466 474
prop.table(table(emp$Attrition))
##
## No Yes
## 0.8387755 0.1612245
ggplot(emp, aes(Attrition, fill="salmon")) + geom_bar()
Total attrition rate is approximately 16%. 474 out of 2940 have attrited.
As per the CART model, Overtime, MonthlyIncome, TotalWorkingYears, HourlyRate, JobRole and Age are the most important factors influencing the attrition rates. Let’s explore these variables.
prop.table(table(emp$OverTime))
##
## No Yes
## 0.7170068 0.2829932
table(emp$OverTime, emp$Attrition)
##
## No Yes
## No 1888 220
## Yes 578 254
ggplot(emp, aes(OverTime, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")
Overall 28% of the employees are putting overtime. The percentage of attrition amongst those putting in overtime is close to 44% (254/578) vs 11% (220/1888) for those not putting in overtime. Thus Overtime is contributing towards attrition.
summary(emp$MonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1009 2911 4919 6503 8380 20000
MnthlyIncome <- cut(emp$MonthlyIncome, 10, include.lowest = TRUE, labels=c(1,2,3,4,5,6,7,8,9,10))
ggplot(emp, aes(MnthlyIncome, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")
The attrition in absolute terms decreases as the salary increases, thus lower salary is contributing towards attrition.
summary(emp$TotalWorkingYears)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 6.00 10.00 11.28 15.00 40.00
TtlWkgYrs <- cut(emp$TotalWorkingYears, 10, include.lowest = TRUE)
ggplot(emp, aes(TtlWkgYrs, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")
The attrition in absolute terms decreases as the total number of working years increase. After an employee has spent 8-12 years in the company, his chances of attrition decrease.
summary(emp$HourlyRate)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 30.00 48.00 66.00 65.89 84.00 100.00
HrlyRate<- cut(emp$HourlyRate, 7, include.lowest = TRUE)
ggplot(emp, aes(HrlyRate, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")
There is no pattern in attrition that can be seen in the context of hourly rates.
table(emp$JobRole)
##
## Healthcare Representative Human Resources
## 262 104
## Laboratory Technician Manager
## 518 204
## Manufacturing Director Research Director
## 290 160
## Research Scientist Sales Executive
## 584 652
## Sales Representative
## 166
table(emp$JobRole, emp$Attrition)
##
## No Yes
## Healthcare Representative 244 18
## Human Resources 80 24
## Laboratory Technician 394 124
## Manager 194 10
## Manufacturing Director 270 20
## Research Director 156 4
## Research Scientist 490 94
## Sales Executive 538 114
## Sales Representative 100 66
ggplot(emp, aes(JobRole, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")
In absolute terms, Laboratory Technicians followed by the Sales Executives are contributing the maximum towards attrition. In percentage terms, Sales Representative are far ahead at 66%.
summary(emp$Age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 30.00 36.00 36.92 43.00 60.00
A_g_e <- cut(emp$Age, 8, include.lowest = TRUE)
ggplot(emp, aes(A_g_e, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")
The age group of 18-23 contributes the maximum to attrition in percentage terms. Post 34, attrition shows a downward trend.
Split the data - create Training and Testing Sets.
set.seed(777)
library(caTools)
## Warning: package 'caTools' was built under R version 3.3.3
split = sample.split(emp$Attrition, SplitRatio = 0.70)
# Create training and testing sets
train = subset(emp, split == TRUE)
test = subset(emp, split == FALSE)
Baseline Accuracy - If we just predict attrition as “No” for every observation, we will get an accuracy of 83.90%.
# Baseline Accuracy
table(test$Attrition)
##
## No Yes
## 740 142
740/nrow(test)
## [1] 0.8390023
# Load CART packages
library(rpart)
## Warning: package 'rpart' was built under R version 3.3.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
modelCart = rpart(Attrition ~ ., data=train, method="class")
#Plot the model
prp(modelCart)
#Predict the test data
predictionCart <- predict(modelCart, newdata=test, type="class")
#CART Accuracy
#Confusion matrix
t1 <- table(test$Attrition, predictionCart)
#CART model accuracy
(t1[1]+t1[4])/(nrow(test))
## [1] 0.8526077
The CART model has improved the accuracy but not by much. We will let go of pruning as it did not improve the accuracy at all in the last assignment.
# Load Random Forest package
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
modelRf = randomForest(Attrition ~ ., data=train, ntree = 100, mtry = 5, importance = TRUE, method="class")
#Plot the model
print(modelRf)
##
## Call:
## randomForest(formula = Attrition ~ ., data = train, ntree = 100, mtry = 5, importance = TRUE, method = "class")
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 5
##
## OOB estimate of error rate: 4.57%
## Confusion matrix:
## No Yes class.error
## No 1720 6 0.003476246
## Yes 88 244 0.265060241
#OOB vs No. Of Trees
plot(modelRf, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
title(main="Error Rates Random Forest")
## List the importance of the variables.
impVar <- round(randomForest::importance(modelRf), 2)
impVar[order(impVar[,3], decreasing=TRUE),]
## No Yes MeanDecreaseAccuracy MeanDecreaseGini
## OverTime 17.32 19.21 20.00 31.57
## DailyRate 12.52 17.14 17.82 30.00
## HourlyRate 11.89 16.76 17.37 27.15
## Age 14.16 14.20 17.15 34.25
## JobRole 13.95 14.35 16.75 26.13
## DistanceFromHome 11.82 14.14 16.65 27.02
## MonthlyIncome 12.37 15.61 16.29 44.21
## EnvironmentSatisfaction 10.62 13.24 14.10 17.91
## MonthlyRate 9.06 13.27 13.12 27.99
## TotalWorkingYears 11.01 10.00 12.97 24.97
## PercentSalaryHike 9.80 11.37 12.94 19.12
## EducationField 10.72 10.60 12.62 17.78
## YearsWithCurrManager 10.74 10.44 12.45 16.49
## NumCompaniesWorked 9.50 11.95 12.25 16.97
## YearsAtCompany 11.39 9.66 12.17 22.39
## YearsSinceLastPromotion 8.99 10.37 11.90 14.24
## Education 9.91 7.31 11.79 11.00
## StockOptionLevel 10.38 10.19 11.64 15.83
## WorkLifeBalance 8.56 11.32 11.58 12.73
## JobSatisfaction 9.88 10.28 11.43 14.62
## YearsInCurrentRole 7.51 9.55 11.12 14.89
## TrainingTimesLastYear 8.45 9.68 10.81 14.38
## RelationshipSatisfaction 8.05 9.56 10.60 11.64
## MaritalStatus 9.18 9.38 10.07 11.61
## BusinessTravel 8.52 8.38 9.38 11.53
## JobInvolvement 7.30 9.81 9.14 15.94
## JobLevel 7.79 7.83 8.77 13.67
## Department 6.55 7.41 8.22 5.49
## PerformanceRating 3.22 4.75 5.52 2.80
## Gender 5.38 3.21 5.36 3.13
## Tuning Random Forest
tunedRf <- tuneRF(x = train[,-2],
y=as.factor(train$Attrition),
mtryStart = 5,
ntreeTry=60,
stepFactor = 2,
improve = 0.001,
trace=TRUE,
plot = TRUE,
doBest = TRUE,
nodesize = 5,
importance=TRUE
)
## mtry = 5 OOB error = 6.66%
## Searching left ...
## mtry = 3 OOB error = 7%
## -0.05109489 0.001
## Searching right ...
## mtry = 10 OOB error = 5.83%
## 0.1240876 0.001
## mtry = 20 OOB error = 5.64%
## 0.03333333 0.001
## mtry = 30 OOB error = 5.25%
## 0.06896552 0.001
impvarTunedRf <- tunedRf$importance
impvarTunedRf[order(impvarTunedRf[,3], decreasing=TRUE),]
## No Yes MeanDecreaseAccuracy
## OverTime 3.674092e-02 0.1667691646 0.0576395975
## MonthlyIncome 3.268947e-02 0.1254217065 0.0475849764
## JobRole 2.941582e-02 0.0942117181 0.0397810489
## Age 2.141759e-02 0.0601320879 0.0276401058
## TotalWorkingYears 1.455379e-02 0.0347263189 0.0177597347
## DailyRate 1.113130e-02 0.0453898972 0.0166284494
## StockOptionLevel 9.046704e-03 0.0376012830 0.0135968107
## EnvironmentSatisfaction 6.130424e-03 0.0389680116 0.0114104307
## YearsAtCompany 8.189525e-03 0.0241061639 0.0107252280
## DistanceFromHome 7.160049e-03 0.0293487146 0.0107039021
## HourlyRate 6.379418e-03 0.0243691692 0.0092769217
## YearsWithCurrManager 7.336233e-03 0.0192609916 0.0092546631
## JobInvolvement 5.314447e-03 0.0280428066 0.0089592523
## NumCompaniesWorked 6.483768e-03 0.0205790769 0.0087458960
## MonthlyRate 5.206747e-03 0.0248349592 0.0083532979
## MaritalStatus 5.235030e-03 0.0221778228 0.0079683717
## EducationField 5.463110e-03 0.0197898666 0.0077799366
## JobSatisfaction 5.062239e-03 0.0194320409 0.0073663880
## BusinessTravel 5.305516e-03 0.0179475686 0.0073431044
## WorkLifeBalance 4.798595e-03 0.0200955019 0.0072560857
## JobLevel 3.970271e-03 0.0227245670 0.0069841502
## YearsSinceLastPromotion 5.018073e-03 0.0160183508 0.0067753955
## PercentSalaryHike 4.621691e-03 0.0144902501 0.0062057635
## TrainingTimesLastYear 3.894748e-03 0.0164957317 0.0059222692
## YearsInCurrentRole 3.317704e-03 0.0115006854 0.0046358212
## Education 2.673795e-03 0.0080349286 0.0035239971
## RelationshipSatisfaction 1.800002e-03 0.0092262596 0.0029895233
## Department 5.475392e-04 0.0014427027 0.0006921772
## Gender 1.834251e-04 0.0007469336 0.0002706506
## PerformanceRating 6.947282e-05 0.0003719227 0.0001184777
## MeanDecreaseGini
## OverTime 39.9988731
## MonthlyIncome 64.5393310
## JobRole 31.4530547
## Age 30.4219463
## TotalWorkingYears 25.3786984
## DailyRate 37.0597263
## StockOptionLevel 17.9044688
## EnvironmentSatisfaction 15.0264290
## YearsAtCompany 14.2737445
## DistanceFromHome 23.8306455
## HourlyRate 24.8692498
## YearsWithCurrManager 10.7489891
## JobInvolvement 15.8718079
## NumCompaniesWorked 14.2662131
## MonthlyRate 23.8923478
## MaritalStatus 10.3925575
## EducationField 15.2117212
## JobSatisfaction 10.5671425
## BusinessTravel 9.6970989
## WorkLifeBalance 13.5919405
## JobLevel 3.5939619
## YearsSinceLastPromotion 12.6421162
## PercentSalaryHike 13.5701889
## TrainingTimesLastYear 14.7839038
## YearsInCurrentRole 7.6675645
## Education 7.3502480
## RelationshipSatisfaction 6.4095451
## Department 0.9804010
## Gender 1.0998102
## PerformanceRating 0.6185574
predictionRf <- predict(tunedRf, test, type="class")
#RandomForest Accuracy
#Confusion matrix
t2 <- table(test$Attrition, predictionRf)
#RandomForest model accuracy
(t2[1]+t2[4])/(nrow(test))
## [1] 0.9421769
Random Forest has improved the accuracy to 94% proving that bagging is better than a single tree model (CART).
We will use the nnet package instead of neuralnetwork R package. nnet does not need you to convert factors/charcter variables into numericals.
##Using the NNet package(2 lines of code)
#No need to convert factors to numericals as in neuralnetwork package
library(nnet)
set.seed(777)
modelNN<-nnet(Attrition~.,train,size=21,rang=0.07,Hess=FALSE,decay=15e-4,maxit=2000)
## # weights: 967
## initial value 1259.026759
## iter 10 value 899.779328
## iter 20 value 883.125558
## iter 30 value 882.856833
## iter 40 value 882.479622
## iter 50 value 881.443760
## iter 60 value 878.441653
## iter 70 value 875.782006
## iter 80 value 874.882651
## iter 90 value 873.773543
## iter 100 value 873.303544
## iter 110 value 872.247616
## iter 120 value 871.615744
## iter 130 value 871.492874
## iter 140 value 870.422323
## iter 150 value 870.204272
## iter 160 value 869.957355
## iter 170 value 868.772616
## iter 180 value 865.514489
## iter 190 value 863.857474
## iter 200 value 863.579794
## iter 210 value 863.194340
## iter 220 value 862.363122
## iter 230 value 856.188856
## iter 240 value 850.607038
## iter 250 value 846.707643
## iter 260 value 840.650373
## iter 270 value 839.614410
## iter 280 value 837.229109
## iter 290 value 831.481824
## iter 300 value 817.219486
## iter 310 value 774.382784
## iter 320 value 701.762947
## iter 330 value 643.033297
## iter 340 value 613.604670
## iter 350 value 589.140287
## iter 360 value 578.151334
## iter 370 value 565.190917
## iter 380 value 556.574622
## iter 390 value 551.459427
## iter 400 value 550.158238
## iter 410 value 548.621001
## iter 420 value 548.350850
## iter 430 value 547.926148
## iter 440 value 547.068021
## iter 450 value 546.388319
## iter 460 value 543.958892
## iter 470 value 540.745619
## iter 480 value 539.235683
## iter 490 value 538.602445
## iter 500 value 537.755900
## iter 510 value 536.677109
## iter 520 value 536.203518
## iter 530 value 534.583342
## iter 540 value 534.387111
## iter 550 value 534.173219
## iter 560 value 534.063107
## iter 570 value 533.664626
## iter 580 value 533.511674
## iter 590 value 533.321811
## iter 600 value 532.192178
## iter 610 value 530.734431
## iter 620 value 529.372234
## iter 630 value 528.770980
## iter 640 value 528.019922
## iter 650 value 527.430963
## iter 660 value 526.298655
## iter 670 value 525.692799
## iter 680 value 525.023007
## iter 690 value 524.508087
## iter 700 value 524.324316
## iter 710 value 523.331760
## iter 720 value 521.152905
## iter 730 value 516.415379
## iter 740 value 514.111402
## iter 750 value 511.835972
## iter 760 value 507.977901
## iter 770 value 503.245825
## iter 780 value 498.288133
## iter 790 value 494.972996
## iter 800 value 493.968316
## iter 810 value 492.625595
## iter 820 value 491.619314
## iter 830 value 489.937374
## iter 840 value 486.727061
## iter 850 value 485.936355
## iter 860 value 485.716476
## iter 870 value 485.643754
## iter 880 value 485.356797
## iter 890 value 484.680175
## iter 900 value 483.170401
## iter 910 value 482.904045
## iter 920 value 482.660587
## iter 930 value 482.562270
## iter 940 value 482.514243
## iter 950 value 482.135244
## iter 960 value 481.674688
## iter 970 value 481.481894
## iter 980 value 480.943347
## iter 990 value 480.665476
## iter1000 value 479.778580
## iter1010 value 479.406735
## iter1020 value 478.878078
## iter1030 value 478.860950
## iter1040 value 478.844519
## iter1050 value 478.250587
## iter1060 value 478.144281
## iter1070 value 478.003743
## iter1080 value 477.677976
## iter1090 value 476.933821
## iter1100 value 474.855823
## iter1110 value 474.177353
## iter1120 value 473.761446
## iter1130 value 473.031643
## iter1140 value 471.277878
## iter1150 value 469.738301
## iter1160 value 469.312775
## iter1170 value 468.737208
## iter1180 value 467.935215
## iter1190 value 467.812012
## iter1200 value 467.015831
## iter1210 value 466.013573
## iter1220 value 463.998010
## iter1230 value 461.555291
## iter1240 value 461.293002
## iter1250 value 461.244277
## iter1260 value 460.541329
## iter1270 value 459.902595
## iter1280 value 458.858513
## iter1290 value 457.930196
## iter1300 value 456.892861
## iter1310 value 455.989798
## iter1320 value 453.811554
## iter1330 value 452.677427
## iter1340 value 450.661449
## iter1350 value 449.767054
## iter1360 value 449.653095
## iter1370 value 449.272171
## iter1380 value 448.504676
## iter1390 value 447.882317
## iter1400 value 447.646634
## iter1410 value 447.142310
## iter1420 value 447.113570
## iter1430 value 447.054808
## iter1440 value 446.833888
## iter1450 value 446.619184
## iter1460 value 446.392426
## iter1470 value 446.208198
## iter1480 value 446.036966
## iter1490 value 445.274706
## iter1500 value 445.163208
## iter1510 value 444.522304
## iter1520 value 444.097994
## iter1530 value 443.574850
## iter1540 value 443.044266
## iter1550 value 442.041497
## iter1560 value 439.525132
## iter1570 value 437.590517
## iter1580 value 436.392604
## iter1590 value 436.202004
## iter1600 value 436.157515
## iter1610 value 435.917056
## iter1620 value 435.759583
## iter1630 value 434.353994
## iter1640 value 434.277379
## iter1650 value 434.213307
## iter1660 value 433.883208
## iter1670 value 433.455099
## iter1680 value 432.551327
## iter1690 value 431.908449
## iter1700 value 431.876608
## iter1710 value 431.776466
## iter1720 value 431.558745
## iter1730 value 431.283670
## iter1740 value 430.745133
## iter1750 value 430.066524
## iter1760 value 429.563158
## iter1770 value 429.545176
## iter1780 value 429.144357
## iter1790 value 428.633950
## iter1800 value 428.462026
## iter1810 value 428.386778
## iter1820 value 427.939770
## iter1830 value 425.073400
## iter1840 value 423.735041
## iter1850 value 422.316940
## iter1860 value 422.007211
## iter1870 value 419.987934
## iter1880 value 413.765897
## iter1890 value 407.844048
## iter1900 value 404.502477
## iter1910 value 403.498504
## iter1920 value 402.516921
## iter1930 value 399.932929
## iter1940 value 392.910875
## iter1950 value 385.177661
## iter1960 value 365.684448
## iter1970 value 353.180803
## iter1980 value 340.379691
## iter1990 value 332.964232
## iter2000 value 324.678115
## final value 324.678115
## stopped after 2000 iterations
predictionNN<-predict(modelNN,test,type=("class"))
table(predictionNN)
## predictionNN
## No Yes
## 772 110
library(devtools)
## Warning: package 'devtools' was built under R version 3.3.3
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef
plot.nnet(modelNN)
## Loading required package: scales
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 3.3.3
#Counfusion Matrix
t3 <- table(test$Attrition, predictionNN)
#NeuralNetwork model accuracy
(t3[1]+t3[4])/(nrow(test))
## [1] 0.893424
The neural networks give an accuracy of 90% approximately.
We will create an ensemble of the three models - CART, RandomForest and NeuralNetworks to see if it improves accuracy.
predictions <- data.frame(predictionCart= predictionCart, predictionRf = predictionRf,
predictionNN = predictionNN)
predictions$predictionEnsemble <- as.factor(ifelse(predictions$predictionCart=='Yes' &
predictions$predictionRf=='Yes','Yes',ifelse(predictions$predictionCart=='Yes' & predictions$predictionNN=='Yes','Yes',ifelse(predictions$predictionRf=='Yes' & predictions$predictionNN=='Yes','Yes','No'))))
#Confusion Matrix
t4 <- table(test$Attrition, predictions$predictionEnsemble)
#Ensembeling Accuracy
(t4[1]+t4[4])/(nrow(test))
## [1] 0.9092971
The ensembling has been unable to beat the accuracy of 94% achieved by random forest. Thus, ensembling may not be useful in all circumstances.
Overtime remains the singlemost important factor contributing towards attrition both for the CART and RandomForest models.
Age and TotalWorkingYears are amongst the top 7 (3rd and 7th respectively) factors influencing attrition in the random forest model.
Ensembling models has not really helped improve the accuracy.A single random forest model has done better. We should check the correlation between the predictions from the 3 models, usually uncorelated predictions improve accuracy. Stacking can be other option to be tried in case of larger dataset.