Problem Statement

  1. Data Import (Target variable is “Attrition” column)
  2. Split the data in Dev & Hold Out sample (70:30)
  3. Perform Exploratory Data Analysis
  4. Identify columns which are of no use. drop those columns
  5. Write Hypothesis and validate the Hypothesis
  6. Build Neural Network Model (Development sample)
  7. Validate NN model on Hold Out. If need be improvise
  8. Build Random Forest Model
  9. Validate RF Model
  10. Compare NN, RF and CART model
  11. Combine NN, RF and CART into Ensemble Model
  12. Check whether Ensemble Model Performance outperforms the individual models

Data Preperation

Import the data file

emp_raw <- read.csv("HR_Employee_Attrition_Data.csv")
emp <- emp_raw

Drop the the columns with no variability.

Also, drop Employee Number as it is just an identifier.

library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
nearZeroVar(emp) 
## [1]  9 22 27
emp$Over18 <- NULL
emp$EmployeeCount <- NULL
emp$StandardHours <- NULL
emp$EmployeeNumber <- NULL

Exploratory Data Analysis

Attrition

nrow(emp)
## [1] 2940
table(emp$Attrition)
## 
##   No  Yes 
## 2466  474
prop.table(table(emp$Attrition))
## 
##        No       Yes 
## 0.8387755 0.1612245
ggplot(emp, aes(Attrition, fill="salmon")) + geom_bar()

Total attrition rate is approximately 16%. 474 out of 2940 have attrited.

As per the CART model, Overtime, MonthlyIncome, TotalWorkingYears, HourlyRate, JobRole and Age are the most important factors influencing the attrition rates. Let’s explore these variables.

Overtime

prop.table(table(emp$OverTime))
## 
##        No       Yes 
## 0.7170068 0.2829932
table(emp$OverTime, emp$Attrition)
##      
##         No  Yes
##   No  1888  220
##   Yes  578  254
ggplot(emp, aes(OverTime, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")

Overall 28% of the employees are putting overtime. The percentage of attrition amongst those putting in overtime is close to 44% (254/578) vs 11% (220/1888) for those not putting in overtime. Thus Overtime is contributing towards attrition.

MonthlyIncome

summary(emp$MonthlyIncome)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1009    2911    4919    6503    8380   20000
MnthlyIncome <- cut(emp$MonthlyIncome, 10, include.lowest = TRUE, labels=c(1,2,3,4,5,6,7,8,9,10))
ggplot(emp, aes(MnthlyIncome, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")

The attrition in absolute terms decreases as the salary increases, thus lower salary is contributing towards attrition.

TotalWorkingYears

summary(emp$TotalWorkingYears)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    6.00   10.00   11.28   15.00   40.00
TtlWkgYrs <- cut(emp$TotalWorkingYears, 10, include.lowest = TRUE)
ggplot(emp, aes(TtlWkgYrs, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")

The attrition in absolute terms decreases as the total number of working years increase. After an employee has spent 8-12 years in the company, his chances of attrition decrease.

HourlyRate

summary(emp$HourlyRate)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   30.00   48.00   66.00   65.89   84.00  100.00
HrlyRate<- cut(emp$HourlyRate, 7, include.lowest = TRUE)
ggplot(emp, aes(HrlyRate, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")

There is no pattern in attrition that can be seen in the context of hourly rates.

JobRole

table(emp$JobRole)
## 
## Healthcare Representative           Human Resources 
##                       262                       104 
##     Laboratory Technician                   Manager 
##                       518                       204 
##    Manufacturing Director         Research Director 
##                       290                       160 
##        Research Scientist           Sales Executive 
##                       584                       652 
##      Sales Representative 
##                       166
table(emp$JobRole, emp$Attrition)
##                            
##                              No Yes
##   Healthcare Representative 244  18
##   Human Resources            80  24
##   Laboratory Technician     394 124
##   Manager                   194  10
##   Manufacturing Director    270  20
##   Research Director         156   4
##   Research Scientist        490  94
##   Sales Executive           538 114
##   Sales Representative      100  66
ggplot(emp, aes(JobRole, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")

In absolute terms, Laboratory Technicians followed by the Sales Executives are contributing the maximum towards attrition. In percentage terms, Sales Representative are far ahead at 66%.

Age

summary(emp$Age)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   18.00   30.00   36.00   36.92   43.00   60.00
A_g_e <- cut(emp$Age, 8, include.lowest = TRUE)
ggplot(emp, aes(A_g_e, ..count.., fill = factor(Attrition))) + geom_bar(position="dodge")

The age group of 18-23 contributes the maximum to attrition in percentage terms. Post 34, attrition shows a downward trend.

Hypothesis

  1. Overtime is the singlemost important factor contributing towards attrition.
  2. Young people with lesser years at the company contribute to the attrition.
  3. Ensembeling different models should improve accuracy.

Data Split

Split the data - create Training and Testing Sets.

set.seed(777)
library(caTools)
## Warning: package 'caTools' was built under R version 3.3.3
split = sample.split(emp$Attrition, SplitRatio = 0.70)

# Create training and testing sets
train = subset(emp, split == TRUE)
test = subset(emp, split == FALSE)

Modeling

1. Baseline Model

Baseline Accuracy - If we just predict attrition as “No” for every observation, we will get an accuracy of 83.90%.

# Baseline Accuracy
table(test$Attrition)
## 
##  No Yes 
## 740 142
740/nrow(test)
## [1] 0.8390023

2. CART

# Load CART packages
library(rpart)
## Warning: package 'rpart' was built under R version 3.3.3
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.3.3
modelCart = rpart(Attrition ~ ., data=train, method="class")

#Plot the model
prp(modelCart)

#Predict the test data
predictionCart <- predict(modelCart, newdata=test, type="class")

#CART Accuracy
#Confusion matrix 
t1 <- table(test$Attrition, predictionCart)


#CART model accuracy
(t1[1]+t1[4])/(nrow(test))
## [1] 0.8526077

The CART model has improved the accuracy but not by much. We will let go of pruning as it did not improve the accuracy at all in the last assignment.

3. RandomForest

# Load Random Forest package
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.3.3
modelRf = randomForest(Attrition ~ ., data=train, ntree = 100, mtry = 5, importance = TRUE, method="class")

#Plot the model
print(modelRf)
## 
## Call:
##  randomForest(formula = Attrition ~ ., data = train, ntree = 100,      mtry = 5, importance = TRUE, method = "class") 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 5
## 
##         OOB estimate of  error rate: 4.57%
## Confusion matrix:
##       No Yes class.error
## No  1720   6 0.003476246
## Yes   88 244 0.265060241
#OOB vs No. Of Trees
plot(modelRf, main="")
legend("topright", c("OOB", "0", "1"), text.col=1:6, lty=1:3, col=1:3)
title(main="Error Rates Random Forest")

## List the importance of the variables.
impVar <- round(randomForest::importance(modelRf), 2)
impVar[order(impVar[,3], decreasing=TRUE),]
##                             No   Yes MeanDecreaseAccuracy MeanDecreaseGini
## OverTime                 17.32 19.21                20.00            31.57
## DailyRate                12.52 17.14                17.82            30.00
## HourlyRate               11.89 16.76                17.37            27.15
## Age                      14.16 14.20                17.15            34.25
## JobRole                  13.95 14.35                16.75            26.13
## DistanceFromHome         11.82 14.14                16.65            27.02
## MonthlyIncome            12.37 15.61                16.29            44.21
## EnvironmentSatisfaction  10.62 13.24                14.10            17.91
## MonthlyRate               9.06 13.27                13.12            27.99
## TotalWorkingYears        11.01 10.00                12.97            24.97
## PercentSalaryHike         9.80 11.37                12.94            19.12
## EducationField           10.72 10.60                12.62            17.78
## YearsWithCurrManager     10.74 10.44                12.45            16.49
## NumCompaniesWorked        9.50 11.95                12.25            16.97
## YearsAtCompany           11.39  9.66                12.17            22.39
## YearsSinceLastPromotion   8.99 10.37                11.90            14.24
## Education                 9.91  7.31                11.79            11.00
## StockOptionLevel         10.38 10.19                11.64            15.83
## WorkLifeBalance           8.56 11.32                11.58            12.73
## JobSatisfaction           9.88 10.28                11.43            14.62
## YearsInCurrentRole        7.51  9.55                11.12            14.89
## TrainingTimesLastYear     8.45  9.68                10.81            14.38
## RelationshipSatisfaction  8.05  9.56                10.60            11.64
## MaritalStatus             9.18  9.38                10.07            11.61
## BusinessTravel            8.52  8.38                 9.38            11.53
## JobInvolvement            7.30  9.81                 9.14            15.94
## JobLevel                  7.79  7.83                 8.77            13.67
## Department                6.55  7.41                 8.22             5.49
## PerformanceRating         3.22  4.75                 5.52             2.80
## Gender                    5.38  3.21                 5.36             3.13
## Tuning Random Forest
tunedRf <- tuneRF(x = train[,-2], 
              y=as.factor(train$Attrition),
              mtryStart = 5, 
              ntreeTry=60, 
              stepFactor = 2, 
              improve = 0.001, 
              trace=TRUE, 
              plot = TRUE,
              doBest = TRUE,
              nodesize = 5, 
              importance=TRUE
)
## mtry = 5  OOB error = 6.66% 
## Searching left ...
## mtry = 3     OOB error = 7% 
## -0.05109489 0.001 
## Searching right ...
## mtry = 10    OOB error = 5.83% 
## 0.1240876 0.001 
## mtry = 20    OOB error = 5.64% 
## 0.03333333 0.001 
## mtry = 30    OOB error = 5.25% 
## 0.06896552 0.001

impvarTunedRf <- tunedRf$importance
impvarTunedRf[order(impvarTunedRf[,3], decreasing=TRUE),]
##                                    No          Yes MeanDecreaseAccuracy
## OverTime                 3.674092e-02 0.1667691646         0.0576395975
## MonthlyIncome            3.268947e-02 0.1254217065         0.0475849764
## JobRole                  2.941582e-02 0.0942117181         0.0397810489
## Age                      2.141759e-02 0.0601320879         0.0276401058
## TotalWorkingYears        1.455379e-02 0.0347263189         0.0177597347
## DailyRate                1.113130e-02 0.0453898972         0.0166284494
## StockOptionLevel         9.046704e-03 0.0376012830         0.0135968107
## EnvironmentSatisfaction  6.130424e-03 0.0389680116         0.0114104307
## YearsAtCompany           8.189525e-03 0.0241061639         0.0107252280
## DistanceFromHome         7.160049e-03 0.0293487146         0.0107039021
## HourlyRate               6.379418e-03 0.0243691692         0.0092769217
## YearsWithCurrManager     7.336233e-03 0.0192609916         0.0092546631
## JobInvolvement           5.314447e-03 0.0280428066         0.0089592523
## NumCompaniesWorked       6.483768e-03 0.0205790769         0.0087458960
## MonthlyRate              5.206747e-03 0.0248349592         0.0083532979
## MaritalStatus            5.235030e-03 0.0221778228         0.0079683717
## EducationField           5.463110e-03 0.0197898666         0.0077799366
## JobSatisfaction          5.062239e-03 0.0194320409         0.0073663880
## BusinessTravel           5.305516e-03 0.0179475686         0.0073431044
## WorkLifeBalance          4.798595e-03 0.0200955019         0.0072560857
## JobLevel                 3.970271e-03 0.0227245670         0.0069841502
## YearsSinceLastPromotion  5.018073e-03 0.0160183508         0.0067753955
## PercentSalaryHike        4.621691e-03 0.0144902501         0.0062057635
## TrainingTimesLastYear    3.894748e-03 0.0164957317         0.0059222692
## YearsInCurrentRole       3.317704e-03 0.0115006854         0.0046358212
## Education                2.673795e-03 0.0080349286         0.0035239971
## RelationshipSatisfaction 1.800002e-03 0.0092262596         0.0029895233
## Department               5.475392e-04 0.0014427027         0.0006921772
## Gender                   1.834251e-04 0.0007469336         0.0002706506
## PerformanceRating        6.947282e-05 0.0003719227         0.0001184777
##                          MeanDecreaseGini
## OverTime                       39.9988731
## MonthlyIncome                  64.5393310
## JobRole                        31.4530547
## Age                            30.4219463
## TotalWorkingYears              25.3786984
## DailyRate                      37.0597263
## StockOptionLevel               17.9044688
## EnvironmentSatisfaction        15.0264290
## YearsAtCompany                 14.2737445
## DistanceFromHome               23.8306455
## HourlyRate                     24.8692498
## YearsWithCurrManager           10.7489891
## JobInvolvement                 15.8718079
## NumCompaniesWorked             14.2662131
## MonthlyRate                    23.8923478
## MaritalStatus                  10.3925575
## EducationField                 15.2117212
## JobSatisfaction                10.5671425
## BusinessTravel                  9.6970989
## WorkLifeBalance                13.5919405
## JobLevel                        3.5939619
## YearsSinceLastPromotion        12.6421162
## PercentSalaryHike              13.5701889
## TrainingTimesLastYear          14.7839038
## YearsInCurrentRole              7.6675645
## Education                       7.3502480
## RelationshipSatisfaction        6.4095451
## Department                      0.9804010
## Gender                          1.0998102
## PerformanceRating               0.6185574
predictionRf <- predict(tunedRf, test, type="class")

#RandomForest Accuracy
#Confusion matrix 
t2 <- table(test$Attrition, predictionRf)

#RandomForest model accuracy
(t2[1]+t2[4])/(nrow(test))
## [1] 0.9421769

Random Forest has improved the accuracy to 94% proving that bagging is better than a single tree model (CART).

4. Neural Networks

We will use the nnet package instead of neuralnetwork R package. nnet does not need you to convert factors/charcter variables into numericals.

##Using the NNet package(2 lines of code) 
#No need to convert factors to numericals as in neuralnetwork package
library(nnet)
set.seed(777)
modelNN<-nnet(Attrition~.,train,size=21,rang=0.07,Hess=FALSE,decay=15e-4,maxit=2000)
## # weights:  967
## initial  value 1259.026759 
## iter  10 value 899.779328
## iter  20 value 883.125558
## iter  30 value 882.856833
## iter  40 value 882.479622
## iter  50 value 881.443760
## iter  60 value 878.441653
## iter  70 value 875.782006
## iter  80 value 874.882651
## iter  90 value 873.773543
## iter 100 value 873.303544
## iter 110 value 872.247616
## iter 120 value 871.615744
## iter 130 value 871.492874
## iter 140 value 870.422323
## iter 150 value 870.204272
## iter 160 value 869.957355
## iter 170 value 868.772616
## iter 180 value 865.514489
## iter 190 value 863.857474
## iter 200 value 863.579794
## iter 210 value 863.194340
## iter 220 value 862.363122
## iter 230 value 856.188856
## iter 240 value 850.607038
## iter 250 value 846.707643
## iter 260 value 840.650373
## iter 270 value 839.614410
## iter 280 value 837.229109
## iter 290 value 831.481824
## iter 300 value 817.219486
## iter 310 value 774.382784
## iter 320 value 701.762947
## iter 330 value 643.033297
## iter 340 value 613.604670
## iter 350 value 589.140287
## iter 360 value 578.151334
## iter 370 value 565.190917
## iter 380 value 556.574622
## iter 390 value 551.459427
## iter 400 value 550.158238
## iter 410 value 548.621001
## iter 420 value 548.350850
## iter 430 value 547.926148
## iter 440 value 547.068021
## iter 450 value 546.388319
## iter 460 value 543.958892
## iter 470 value 540.745619
## iter 480 value 539.235683
## iter 490 value 538.602445
## iter 500 value 537.755900
## iter 510 value 536.677109
## iter 520 value 536.203518
## iter 530 value 534.583342
## iter 540 value 534.387111
## iter 550 value 534.173219
## iter 560 value 534.063107
## iter 570 value 533.664626
## iter 580 value 533.511674
## iter 590 value 533.321811
## iter 600 value 532.192178
## iter 610 value 530.734431
## iter 620 value 529.372234
## iter 630 value 528.770980
## iter 640 value 528.019922
## iter 650 value 527.430963
## iter 660 value 526.298655
## iter 670 value 525.692799
## iter 680 value 525.023007
## iter 690 value 524.508087
## iter 700 value 524.324316
## iter 710 value 523.331760
## iter 720 value 521.152905
## iter 730 value 516.415379
## iter 740 value 514.111402
## iter 750 value 511.835972
## iter 760 value 507.977901
## iter 770 value 503.245825
## iter 780 value 498.288133
## iter 790 value 494.972996
## iter 800 value 493.968316
## iter 810 value 492.625595
## iter 820 value 491.619314
## iter 830 value 489.937374
## iter 840 value 486.727061
## iter 850 value 485.936355
## iter 860 value 485.716476
## iter 870 value 485.643754
## iter 880 value 485.356797
## iter 890 value 484.680175
## iter 900 value 483.170401
## iter 910 value 482.904045
## iter 920 value 482.660587
## iter 930 value 482.562270
## iter 940 value 482.514243
## iter 950 value 482.135244
## iter 960 value 481.674688
## iter 970 value 481.481894
## iter 980 value 480.943347
## iter 990 value 480.665476
## iter1000 value 479.778580
## iter1010 value 479.406735
## iter1020 value 478.878078
## iter1030 value 478.860950
## iter1040 value 478.844519
## iter1050 value 478.250587
## iter1060 value 478.144281
## iter1070 value 478.003743
## iter1080 value 477.677976
## iter1090 value 476.933821
## iter1100 value 474.855823
## iter1110 value 474.177353
## iter1120 value 473.761446
## iter1130 value 473.031643
## iter1140 value 471.277878
## iter1150 value 469.738301
## iter1160 value 469.312775
## iter1170 value 468.737208
## iter1180 value 467.935215
## iter1190 value 467.812012
## iter1200 value 467.015831
## iter1210 value 466.013573
## iter1220 value 463.998010
## iter1230 value 461.555291
## iter1240 value 461.293002
## iter1250 value 461.244277
## iter1260 value 460.541329
## iter1270 value 459.902595
## iter1280 value 458.858513
## iter1290 value 457.930196
## iter1300 value 456.892861
## iter1310 value 455.989798
## iter1320 value 453.811554
## iter1330 value 452.677427
## iter1340 value 450.661449
## iter1350 value 449.767054
## iter1360 value 449.653095
## iter1370 value 449.272171
## iter1380 value 448.504676
## iter1390 value 447.882317
## iter1400 value 447.646634
## iter1410 value 447.142310
## iter1420 value 447.113570
## iter1430 value 447.054808
## iter1440 value 446.833888
## iter1450 value 446.619184
## iter1460 value 446.392426
## iter1470 value 446.208198
## iter1480 value 446.036966
## iter1490 value 445.274706
## iter1500 value 445.163208
## iter1510 value 444.522304
## iter1520 value 444.097994
## iter1530 value 443.574850
## iter1540 value 443.044266
## iter1550 value 442.041497
## iter1560 value 439.525132
## iter1570 value 437.590517
## iter1580 value 436.392604
## iter1590 value 436.202004
## iter1600 value 436.157515
## iter1610 value 435.917056
## iter1620 value 435.759583
## iter1630 value 434.353994
## iter1640 value 434.277379
## iter1650 value 434.213307
## iter1660 value 433.883208
## iter1670 value 433.455099
## iter1680 value 432.551327
## iter1690 value 431.908449
## iter1700 value 431.876608
## iter1710 value 431.776466
## iter1720 value 431.558745
## iter1730 value 431.283670
## iter1740 value 430.745133
## iter1750 value 430.066524
## iter1760 value 429.563158
## iter1770 value 429.545176
## iter1780 value 429.144357
## iter1790 value 428.633950
## iter1800 value 428.462026
## iter1810 value 428.386778
## iter1820 value 427.939770
## iter1830 value 425.073400
## iter1840 value 423.735041
## iter1850 value 422.316940
## iter1860 value 422.007211
## iter1870 value 419.987934
## iter1880 value 413.765897
## iter1890 value 407.844048
## iter1900 value 404.502477
## iter1910 value 403.498504
## iter1920 value 402.516921
## iter1930 value 399.932929
## iter1940 value 392.910875
## iter1950 value 385.177661
## iter1960 value 365.684448
## iter1970 value 353.180803
## iter1980 value 340.379691
## iter1990 value 332.964232
## iter2000 value 324.678115
## final  value 324.678115 
## stopped after 2000 iterations
predictionNN<-predict(modelNN,test,type=("class"))
table(predictionNN)
## predictionNN
##  No Yes 
## 772 110
library(devtools)
## Warning: package 'devtools' was built under R version 3.3.3
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')
## SHA-1 hash of file is 74c80bd5ddbc17ab3ae5ece9c0ed9beb612e87ef
plot.nnet(modelNN)
## Loading required package: scales
## Loading required package: reshape
## Warning: package 'reshape' was built under R version 3.3.3

#Counfusion Matrix
t3 <- table(test$Attrition, predictionNN)

#NeuralNetwork model accuracy
(t3[1]+t3[4])/(nrow(test))
## [1] 0.893424

The neural networks give an accuracy of 90% approximately.

5. Ensemblimg

We will create an ensemble of the three models - CART, RandomForest and NeuralNetworks to see if it improves accuracy.

predictions <- data.frame(predictionCart= predictionCart, predictionRf = predictionRf,
                          predictionNN = predictionNN)

predictions$predictionEnsemble <- as.factor(ifelse(predictions$predictionCart=='Yes' &
                                   predictions$predictionRf=='Yes','Yes',ifelse(predictions$predictionCart=='Yes' & predictions$predictionNN=='Yes','Yes',ifelse(predictions$predictionRf=='Yes' &                             predictions$predictionNN=='Yes','Yes','No'))))

#Confusion Matrix
t4 <- table(test$Attrition, predictions$predictionEnsemble)

#Ensembeling Accuracy
(t4[1]+t4[4])/(nrow(test))
## [1] 0.9092971

The ensembling has been unable to beat the accuracy of 94% achieved by random forest. Thus, ensembling may not be useful in all circumstances.

Hypothesis Validation

  1. Overtime is the singlemost important factor contributing towards attrition.

Overtime remains the singlemost important factor contributing towards attrition both for the CART and RandomForest models.

  1. Young people with lesser years at the company contribute to the attrition.

Age and TotalWorkingYears are amongst the top 7 (3rd and 7th respectively) factors influencing attrition in the random forest model.

  1. Ensembling models should improve accuracy.

Ensembling models has not really helped improve the accuracy.A single random forest model has done better. We should check the correlation between the predictions from the 3 models, usually uncorelated predictions improve accuracy. Stacking can be other option to be tried in case of larger dataset.