In this exploratory analysis, I examine a predictive model for investigating which employees might leave, versus which might stay. The data set, which you can get HERE, has a binary outcome variable, called Termd_1_or_0. This variable indicates whether the employee has terminated (1) or not (0). Since we have a binary outcome variable, this analysis lends itself to a binary classification problem. I recently shared portions of this analysis during a presentation at the Annual SHRM ’19 Conference in Las Vegas.
The dataset has been cleansed and should not contain any NaNs or NULLs, etc.
Let’s take a look at the logistic regression model. Note that we cannot account for every possible factor that contributes to an employee terminating, but this model is just a sample of the types of predictive analyses we can do.
m <- lm(Termd_1_or_0 ~ Performance.Score + MaritalStatusID + GenderID + Department + Employee.Source,
data = hr)Let’s take a look at a summary, first, of a basic logistic regresson model. In this model, we predict whether the employee has terminated, based on the following predictors: Performance Score, Marital Status, Gender, Department, and Employee (Recruitment) Source.
##
## Call:
## lm(formula = Termd_1_or_0 ~ Performance.Score + MaritalStatusID +
## GenderID + Department + Employee.Source, data = hr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75112 -0.28901 -0.06531 0.35474 0.95820
##
## Coefficients:
## Estimate Std. Error
## (Intercept) -0.025559 0.219499
## Performance.ScoreExceeds -0.142233 0.138586
## Performance.ScoreExceptional -0.277410 0.179203
## Performance.ScoreFully Meets -0.053027 0.102871
## Performance.ScoreN/A- too early to review -0.004307 0.121506
## Performance.ScoreNeeds Improvement -0.061749 0.177421
## Performance.ScorePIP 0.116637 0.196880
## MaritalStatusID 0.048982 0.031548
## GenderID 0.067753 0.060841
## DepartmentExecutive Office 0.060163 0.473568
## DepartmentIT/IS 0.155325 0.184940
## DepartmentProduction 0.257198 0.162713
## DepartmentSales 0.059009 0.176736
## DepartmentSoftware Engineering 0.094886 0.222765
## Employee.SourceCareerbuilder -0.295346 0.452952
## Employee.SourceCompany Intranet - Partner 0.806527 0.463643
## Employee.SourceDiversity Job Fair 0.425565 0.149829
## Employee.SourceEmployee Referral -0.057899 0.153594
## Employee.SourceGlassdoor 0.220965 0.186932
## Employee.SourceIndeed -0.129449 0.229623
## Employee.SourceInformation Session 0.125654 0.288383
## Employee.SourceInternet Search 0.141294 0.254513
## Employee.SourceMBTA ads 0.105537 0.167859
## Employee.SourceMonster.com 0.365652 0.154389
## Employee.SourceNewspager/Magazine -0.006377 0.168073
## Employee.SourceOn-campus Recruiting -0.205204 0.179722
## Employee.SourceOther 0.270820 0.206408
## Employee.SourcePay Per Click 0.508987 0.469398
## Employee.SourcePay Per Click - Google -0.030558 0.163483
## Employee.SourceProfessional Society -0.047544 0.165613
## Employee.SourceSearch Engine - Google Bing Yahoo 0.375468 0.154391
## Employee.SourceSocial Networks - Facebook Twitter etc 0.380696 0.196935
## Employee.SourceVendor Referral 0.163684 0.184542
## Employee.SourceWebsite Banner Ads -0.142353 0.189663
## Employee.SourceWord of Mouth 0.342035 0.195642
## t value Pr(>|t|)
## (Intercept) -0.116 0.90741
## Performance.ScoreExceeds -1.026 0.30590
## Performance.ScoreExceptional -1.548 0.12310
## Performance.ScoreFully Meets -0.515 0.60675
## Performance.ScoreN/A- too early to review -0.035 0.97176
## Performance.ScoreNeeds Improvement -0.348 0.72816
## Performance.ScorePIP 0.592 0.55419
## MaritalStatusID 1.553 0.12200
## GenderID 1.114 0.26671
## DepartmentExecutive Office 0.127 0.89903
## DepartmentIT/IS 0.840 0.40192
## DepartmentProduction 1.581 0.11543
## DepartmentSales 0.334 0.73880
## DepartmentSoftware Engineering 0.426 0.67058
## Employee.SourceCareerbuilder -0.652 0.51507
## Employee.SourceCompany Intranet - Partner 1.740 0.08338 .
## Employee.SourceDiversity Job Fair 2.840 0.00494 **
## Employee.SourceEmployee Referral -0.377 0.70658
## Employee.SourceGlassdoor 1.182 0.23850
## Employee.SourceIndeed -0.564 0.57352
## Employee.SourceInformation Session 0.436 0.66348
## Employee.SourceInternet Search 0.555 0.57937
## Employee.SourceMBTA ads 0.629 0.53020
## Employee.SourceMonster.com 2.368 0.01876 *
## Employee.SourceNewspager/Magazine -0.038 0.96977
## Employee.SourceOn-campus Recruiting -1.142 0.25482
## Employee.SourceOther 1.312 0.19091
## Employee.SourcePay Per Click 1.084 0.27944
## Employee.SourcePay Per Click - Google -0.187 0.85190
## Employee.SourceProfessional Society -0.287 0.77433
## Employee.SourceSearch Engine - Google Bing Yahoo 2.432 0.01585 *
## Employee.SourceSocial Networks - Facebook Twitter etc 1.933 0.05455 .
## Employee.SourceVendor Referral 0.887 0.37609
## Employee.SourceWebsite Banner Ads -0.751 0.45375
## Employee.SourceWord of Mouth 1.748 0.08186 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4342 on 213 degrees of freedom
## Multiple R-squared: 0.2684, Adjusted R-squared: 0.1517
## F-statistic: 2.299 on 34 and 213 DF, p-value: 0.0001833
| Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
|---|---|---|---|---|---|
| Performance.Score | 6 | 1.4965795 | 0.2494299 | 1.3231241 | 0.2480014 |
| MaritalStatusID | 1 | 0.5696195 | 0.5696195 | 3.0215994 | 0.0836075 |
| GenderID | 1 | 0.0029732 | 0.0029732 | 0.0157718 | 0.9001782 |
| Department | 5 | 2.5574275 | 0.5114855 | 2.7132222 | 0.0212008 |
| Employee.Source | 21 | 10.1066124 | 0.4812673 | 2.5529267 | 0.0003775 |
| Residuals | 213 | 40.1538846 | 0.1885159 | NA | NA |
According to this output, we can see that the employees who have been recruited from the Diversity Job Fair have a statistically significant contribution to whether an employee terminated, at the p < .01 level. You can see this by comparing the significance codes. For example ** indicates that the category has a p-value < .01.
# Next, let's try a decision tree to see what happens. Classification tree, using Termd1_or_0
# First, subset the data
sub1 <- select(df, Termd_1_or_0, MarriedID, Age, GenderID, Department,
Performance.Score, Days.Employed)
set.seed(1972)
trainIndex <- createDataPartition(sub1$Termd_1_or_0, p = 0.75, list = FALSE, times = 1)
train <- sub1[ trainIndex, ]
test <- sub1[ -trainIndex, ]
# Set hyperparameters for tuning.
rpctrl <- rpart.control(minsplit = 5, minbucket = round(5 / 3), maxdepth = 3, cp = 0, xval = 5)
accuracy_tune <- function(fit) {
predict_unseen <- predict(fit, test, type = 'class')
table_mat <- table(test$Termd_1_or_0, predict_unseen)
accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
print(paste('Accuracy for test ', format(accuracy_Test, digits = 3)))
}
fit <- rpart(Termd_1_or_0 ~ ., data = train, method = 'class', control = rpctrl)
rpart.plot(fit, extra = 105)## [1] "Accuracy for test 0.753"
##
## Rule number: 13 [Termd_1_or_0=1 cover=3 (1%) prob=1.00]
## Days.Employed< 926
## Department=Admin Offices,IT/IS,Sales
## Days.Employed>=710.5
##
## Rule number: 5 [Termd_1_or_0=1 cover=2 (1%) prob=1.00]
## Days.Employed>=926
## Age>=64
##
## Rule number: 7 [Termd_1_or_0=1 cover=48 (21%) prob=0.90]
## Days.Employed< 926
## Department=Production ,Software Engineering
##
## Rule number: 12 [Termd_1_or_0=0 cover=17 (7%) prob=0.41]
## Days.Employed< 926
## Department=Admin Offices,IT/IS,Sales
## Days.Employed< 710.5
##
## Rule number: 4 [Termd_1_or_0=0 cover=163 (70%) prob=0.15]
## Days.Employed>=926
## Age< 64
OK, so we know that the decision tree accuracy is around 75%. Let’s try some other algorithms.
hr2$Termd_1_or_0 <- as.factor(hr2$Termd_1_or_0)
ctrl <- trainControl(method="repeatedcv", number = 10, repeats = 3)
metr = "Accuracy"
# fit.lda <- train( Termd_1_or_0 ~ ., data = train, method = "lda", metric = "Accuracy", trControl = ctrl)
fit.lda <- train( Termd_1_or_0 ~ ., data = hr2, method = "lda", metric = metr, preProc=c("center", "scale"), trControl = ctrl )
fit.glm <- train( Termd_1_or_0 ~ ., data = hr2, method = "glm", metric = metr, trControl = ctrl)
fit.glmnet <- train( Termd_1_or_0 ~ ., data = hr2, method = "glmnet", metric = metr, preProc=c("center", "scale"), trControl = ctrl)
fit.svmRadial <- train(Termd_1_or_0 ~ ., data = hr2, method = "svmRadial", metric = metr, preProc=c("center","scale"), trControl = ctrl)
fit.knn <- train( Termd_1_or_0 ~ ., data = hr2, method = "knn", metric = metr, preProc=c("center", "scale"), trControl = ctrl )
fit.nb <- train( Termd_1_or_0 ~ ., data = hr2, method = "nb", metric = metr, preProc=c("center", "scale"), trControl = ctrl)
fit.cart <- train( Termd_1_or_0 ~ ., data = hr2, method = "rpart", metric = metr, preProc=c("center", "scale"), trControl = ctrl )
fit.c50 <- train( Termd_1_or_0 ~ ., data = hr2, method = "C5.0", metric=metr,preProc=c("center", "scale"), trControl = ctrl )
fit.rf <- train( Termd_1_or_0 ~ ., data = hr2, method = "rf", metric=metr, trControl = ctrl)
fit.ada <- train( Termd_1_or_0 ~ ., data = hr2, method = "ada", metric=metr, trControl = ctrl)The results…
results <- resamples(list(lda=fit.lda, logistic=fit.glm, glmnet=fit.glmnet,
svm=fit.svmRadial, knn=fit.knn, nb=fit.nb, cart=fit.cart, c50=fit.c50, rf=fit.rf, ada=fit.ada))
#results <- resamples(list(lda=fit.lda, logistic=fit.glm, glmnet=fit.glmnet,
# svm=fit.svmRadial, knn=fit.knn, nb=fit.nb, cart=fit.cart, c50=fit.c50,
# bagging=fit.treebag, rf=fit.rf, gbm=fit.gbm))
results##
## Call:
## resamples.default(x = list(lda = fit.lda, logistic = fit.glm, glmnet
## = fit.glmnet, svm = fit.svmRadial, knn = fit.knn, nb = fit.nb, cart
## = fit.cart, c50 = fit.c50, rf = fit.rf, ada = fit.ada))
##
## Models: lda, logistic, glmnet, svm, knn, nb, cart, c50, rf, ada
## Number of resamples: 30
## Performance metrics: Accuracy, Kappa
## Time estimates for: everything, final model fit
##
## Call:
## summary.resamples(object = results)
##
## Models: lda, logistic, glmnet, svm, knn, nb, cart, c50, rf, ada
## Number of resamples: 30
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## lda 0.6153846 0.7679167 0.8000000 0.7949103 0.8333333 0.8846154 0
## logistic 0.5600000 0.7600000 0.8166667 0.8118590 0.8446154 1.0000000 0
## glmnet 0.6250000 0.7300000 0.8200000 0.8057906 0.8750000 0.9200000 0
## svm 0.7083333 0.7623077 0.8000000 0.7998718 0.8383333 0.9583333 0
## knn 0.6000000 0.7200000 0.7600000 0.7715641 0.8000000 0.9200000 0
## nb 0.4166667 0.7112500 0.7600000 0.7553547 0.8333333 0.9200000 0
## cart 0.7083333 0.7916667 0.8000000 0.8156154 0.8400000 0.9200000 0
## c50 0.6800000 0.8000000 0.8366667 0.8238632 0.8400000 0.9583333 0
## rf 0.6923077 0.7692308 0.8166667 0.8205812 0.8662500 0.9600000 0
## ada 0.6800000 0.8000000 0.8333333 0.8306923 0.8787500 0.9600000 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## lda 0.05109489 0.4601057 0.5247148 0.5120852 0.6203248 0.7234043
## logistic 0.02135231 0.4561887 0.5807297 0.5609937 0.6650327 1.0000000
## glmnet 0.06896552 0.3282510 0.5690672 0.5194197 0.7096774 0.8031496
## svm 0.16000000 0.3988738 0.4897959 0.4813131 0.5750605 0.9032258
## knn 0.08088235 0.3238463 0.3886121 0.4417260 0.5099409 0.8275862
## nb -0.23529412 0.3396031 0.4344558 0.4327721 0.6187500 0.8031496
## cart 0.22907489 0.4456681 0.5584416 0.5462311 0.6434295 0.8175182
## c50 0.27007299 0.4897959 0.5881356 0.5805562 0.6343656 0.9032258
## rf 0.19379845 0.4341038 0.5657157 0.5538006 0.6753296 0.9049430
## ada 0.30769231 0.4898960 0.5714286 0.5864009 0.6928220 0.9049430
## NA's
## lda 0
## logistic 0
## glmnet 0
## svm 0
## knn 0
## nb 0
## cart 0
## c50 0
## rf 0
## ada 0