Exploratory and Predictive Analysis with HR Data

Predicting which employees might leave…

In this exploratory analysis, I examine a predictive model for investigating which employees might leave, versus which might stay. The data set, which you can get HERE, has a binary outcome variable, called Termd_1_or_0. This variable indicates whether the employee has terminated (1) or not (0). Since we have a binary outcome variable, this analysis lends itself to a binary classification problem. I recently shared portions of this analysis during a presentation at the Annual SHRM ’19 Conference in Las Vegas.

The dataset has been cleansed and should not contain any NaNs or NULLs, etc.

Let’s take a look at the logistic regression model. Note that we cannot account for every possible factor that contributes to an employee terminating, but this model is just a sample of the types of predictive analyses we can do.

m <- lm(Termd_1_or_0 ~ Performance.Score + MaritalStatusID + GenderID + Department + Employee.Source,
               data = hr)

Let’s take a look at a summary, first, of a basic logistic regresson model. In this model, we predict whether the employee has terminated, based on the following predictors: Performance Score, Marital Status, Gender, Department, and Employee (Recruitment) Source.

summary(m)

## 
## Call:
## lm(formula = Termd_1_or_0 ~ Performance.Score + MaritalStatusID + 
##     GenderID + Department + Employee.Source, data = hr)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75112 -0.28901 -0.06531  0.35474  0.95820 
## 
## Coefficients:
##                                                        Estimate Std. Error
## (Intercept)                                           -0.025559   0.219499
## Performance.ScoreExceeds                              -0.142233   0.138586
## Performance.ScoreExceptional                          -0.277410   0.179203
## Performance.ScoreFully Meets                          -0.053027   0.102871
## Performance.ScoreN/A- too early to review             -0.004307   0.121506
## Performance.ScoreNeeds Improvement                    -0.061749   0.177421
## Performance.ScorePIP                                   0.116637   0.196880
## MaritalStatusID                                        0.048982   0.031548
## GenderID                                               0.067753   0.060841
## DepartmentExecutive Office                             0.060163   0.473568
## DepartmentIT/IS                                        0.155325   0.184940
## DepartmentProduction                                   0.257198   0.162713
## DepartmentSales                                        0.059009   0.176736
## DepartmentSoftware Engineering                         0.094886   0.222765
## Employee.SourceCareerbuilder                          -0.295346   0.452952
## Employee.SourceCompany Intranet - Partner              0.806527   0.463643
## Employee.SourceDiversity Job Fair                      0.425565   0.149829
## Employee.SourceEmployee Referral                      -0.057899   0.153594
## Employee.SourceGlassdoor                               0.220965   0.186932
## Employee.SourceIndeed                                 -0.129449   0.229623
## Employee.SourceInformation Session                     0.125654   0.288383
## Employee.SourceInternet Search                         0.141294   0.254513
## Employee.SourceMBTA ads                                0.105537   0.167859
## Employee.SourceMonster.com                             0.365652   0.154389
## Employee.SourceNewspager/Magazine                     -0.006377   0.168073
## Employee.SourceOn-campus Recruiting                   -0.205204   0.179722
## Employee.SourceOther                                   0.270820   0.206408
## Employee.SourcePay Per Click                           0.508987   0.469398
## Employee.SourcePay Per Click - Google                 -0.030558   0.163483
## Employee.SourceProfessional Society                   -0.047544   0.165613
## Employee.SourceSearch Engine - Google Bing Yahoo       0.375468   0.154391
## Employee.SourceSocial Networks - Facebook Twitter etc  0.380696   0.196935
## Employee.SourceVendor Referral                         0.163684   0.184542
## Employee.SourceWebsite Banner Ads                     -0.142353   0.189663
## Employee.SourceWord of Mouth                           0.342035   0.195642
##                                                       t value Pr(>|t|)   
## (Intercept)                                            -0.116  0.90741   
## Performance.ScoreExceeds                               -1.026  0.30590   
## Performance.ScoreExceptional                           -1.548  0.12310   
## Performance.ScoreFully Meets                           -0.515  0.60675   
## Performance.ScoreN/A- too early to review              -0.035  0.97176   
## Performance.ScoreNeeds Improvement                     -0.348  0.72816   
## Performance.ScorePIP                                    0.592  0.55419   
## MaritalStatusID                                         1.553  0.12200   
## GenderID                                                1.114  0.26671   
## DepartmentExecutive Office                              0.127  0.89903   
## DepartmentIT/IS                                         0.840  0.40192   
## DepartmentProduction                                    1.581  0.11543   
## DepartmentSales                                         0.334  0.73880   
## DepartmentSoftware Engineering                          0.426  0.67058   
## Employee.SourceCareerbuilder                           -0.652  0.51507   
## Employee.SourceCompany Intranet - Partner               1.740  0.08338 . 
## Employee.SourceDiversity Job Fair                       2.840  0.00494 **
## Employee.SourceEmployee Referral                       -0.377  0.70658   
## Employee.SourceGlassdoor                                1.182  0.23850   
## Employee.SourceIndeed                                  -0.564  0.57352   
## Employee.SourceInformation Session                      0.436  0.66348   
## Employee.SourceInternet Search                          0.555  0.57937   
## Employee.SourceMBTA ads                                 0.629  0.53020   
## Employee.SourceMonster.com                              2.368  0.01876 * 
## Employee.SourceNewspager/Magazine                      -0.038  0.96977   
## Employee.SourceOn-campus Recruiting                    -1.142  0.25482   
## Employee.SourceOther                                    1.312  0.19091   
## Employee.SourcePay Per Click                            1.084  0.27944   
## Employee.SourcePay Per Click - Google                  -0.187  0.85190   
## Employee.SourceProfessional Society                    -0.287  0.77433   
## Employee.SourceSearch Engine - Google Bing Yahoo        2.432  0.01585 * 
## Employee.SourceSocial Networks - Facebook Twitter etc   1.933  0.05455 . 
## Employee.SourceVendor Referral                          0.887  0.37609   
## Employee.SourceWebsite Banner Ads                      -0.751  0.45375   
## Employee.SourceWord of Mouth                            1.748  0.08186 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4342 on 213 degrees of freedom
## Multiple R-squared:  0.2684, Adjusted R-squared:  0.1517 
## F-statistic: 2.299 on 34 and 213 DF,  p-value: 0.0001833

anova(m)

	Df	Sum Sq	Mean Sq	F value	Pr(>F)
Performance.Score	6	1.4965795	0.2494299	1.3231241	0.2480014
MaritalStatusID	1	0.5696195	0.5696195	3.0215994	0.0836075
GenderID	1	0.0029732	0.0029732	0.0157718	0.9001782
Department	5	2.5574275	0.5114855	2.7132222	0.0212008
Employee.Source	21	10.1066124	0.4812673	2.5529267	0.0003775
Residuals	213	40.1538846	0.1885159	NA	NA

According to this output, we can see that the employees who have been recruited from the Diversity Job Fair have a statistically significant contribution to whether an employee terminated, at the p < .01 level. You can see this by comparing the significance codes. For example ** indicates that the category has a p-value < .01.

# Next, let's try a decision tree to see what happens.  Classification tree, using Termd1_or_0

# First, subset the data
sub1 <- select(df, Termd_1_or_0, MarriedID, Age, GenderID, Department, 
               Performance.Score, Days.Employed)

set.seed(1972)
trainIndex <- createDataPartition(sub1$Termd_1_or_0, p = 0.75, list = FALSE, times = 1)
train <- sub1[ trainIndex, ]
test <- sub1[ -trainIndex, ]

# Set hyperparameters for tuning.
rpctrl <- rpart.control(minsplit = 5, minbucket = round(5 / 3), maxdepth = 3, cp = 0, xval = 5)

accuracy_tune <- function(fit) {
  predict_unseen <- predict(fit, test, type = 'class')
  table_mat <- table(test$Termd_1_or_0, predict_unseen)
  accuracy_Test <- sum(diag(table_mat)) / sum(table_mat)
  print(paste('Accuracy for test ', format(accuracy_Test, digits = 3)))
}

fit <- rpart(Termd_1_or_0 ~ ., data = train, method = 'class', control = rpctrl)
rpart.plot(fit, extra = 105)

accuracy_tune(fit)

## [1] "Accuracy for test  0.753"

rattle::asRules(fit)

## 
##  Rule number: 13 [Termd_1_or_0=1 cover=3 (1%) prob=1.00]
##    Days.Employed< 926
##    Department=Admin Offices,IT/IS,Sales
##    Days.Employed>=710.5
## 
##  Rule number: 5 [Termd_1_or_0=1 cover=2 (1%) prob=1.00]
##    Days.Employed>=926
##    Age>=64
## 
##  Rule number: 7 [Termd_1_or_0=1 cover=48 (21%) prob=0.90]
##    Days.Employed< 926
##    Department=Production       ,Software Engineering
## 
##  Rule number: 12 [Termd_1_or_0=0 cover=17 (7%) prob=0.41]
##    Days.Employed< 926
##    Department=Admin Offices,IT/IS,Sales
##    Days.Employed< 710.5
## 
##  Rule number: 4 [Termd_1_or_0=0 cover=163 (70%) prob=0.15]
##    Days.Employed>=926
##    Age< 64

OK, so we know that the decision tree accuracy is around 75%. Let’s try some other algorithms.

hr2$Termd_1_or_0 <- as.factor(hr2$Termd_1_or_0)

ctrl <- trainControl(method="repeatedcv", number = 10, repeats = 3)
metr = "Accuracy"

# fit.lda <- train( Termd_1_or_0 ~ ., data = train, method = "lda", metric = "Accuracy", trControl = ctrl)

fit.lda <- train( Termd_1_or_0 ~ ., data = hr2, method = "lda", metric = metr, preProc=c("center", "scale"), trControl = ctrl )
fit.glm <- train( Termd_1_or_0 ~ ., data = hr2, method = "glm", metric = metr, trControl = ctrl)
fit.glmnet <- train( Termd_1_or_0 ~ ., data = hr2, method = "glmnet", metric = metr, preProc=c("center", "scale"), trControl = ctrl)
fit.svmRadial <- train(Termd_1_or_0 ~ ., data = hr2, method = "svmRadial", metric = metr, preProc=c("center","scale"), trControl = ctrl)
fit.knn <- train( Termd_1_or_0 ~ ., data = hr2, method = "knn", metric = metr, preProc=c("center", "scale"), trControl = ctrl )
fit.nb <- train( Termd_1_or_0 ~ ., data = hr2, method = "nb", metric = metr, preProc=c("center", "scale"), trControl = ctrl)
fit.cart <- train( Termd_1_or_0 ~ ., data = hr2, method = "rpart", metric = metr, preProc=c("center", "scale"), trControl = ctrl )
fit.c50 <- train( Termd_1_or_0 ~ ., data = hr2, method = "C5.0", metric=metr,preProc=c("center", "scale"), trControl = ctrl )
fit.rf <- train( Termd_1_or_0 ~ ., data = hr2, method = "rf", metric=metr, trControl = ctrl)
fit.ada <- train( Termd_1_or_0 ~ ., data = hr2, method = "ada", metric=metr, trControl = ctrl)

The results…

results <- resamples(list(lda=fit.lda, logistic=fit.glm, glmnet=fit.glmnet,
    svm=fit.svmRadial, knn=fit.knn, nb=fit.nb, cart=fit.cart, c50=fit.c50, rf=fit.rf, ada=fit.ada))

#results <- resamples(list(lda=fit.lda, logistic=fit.glm, glmnet=fit.glmnet,
#   svm=fit.svmRadial, knn=fit.knn, nb=fit.nb, cart=fit.cart, c50=fit.c50,
#   bagging=fit.treebag, rf=fit.rf, gbm=fit.gbm))

results

## 
## Call:
## resamples.default(x = list(lda = fit.lda, logistic = fit.glm, glmnet
##  = fit.glmnet, svm = fit.svmRadial, knn = fit.knn, nb = fit.nb, cart
##  = fit.cart, c50 = fit.c50, rf = fit.rf, ada = fit.ada))
## 
## Models: lda, logistic, glmnet, svm, knn, nb, cart, c50, rf, ada 
## Number of resamples: 30 
## Performance metrics: Accuracy, Kappa 
## Time estimates for: everything, final model fit

summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: lda, logistic, glmnet, svm, knn, nb, cart, c50, rf, ada 
## Number of resamples: 30 
## 
## Accuracy 
##               Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## lda      0.6153846 0.7679167 0.8000000 0.7949103 0.8333333 0.8846154    0
## logistic 0.5600000 0.7600000 0.8166667 0.8118590 0.8446154 1.0000000    0
## glmnet   0.6250000 0.7300000 0.8200000 0.8057906 0.8750000 0.9200000    0
## svm      0.7083333 0.7623077 0.8000000 0.7998718 0.8383333 0.9583333    0
## knn      0.6000000 0.7200000 0.7600000 0.7715641 0.8000000 0.9200000    0
## nb       0.4166667 0.7112500 0.7600000 0.7553547 0.8333333 0.9200000    0
## cart     0.7083333 0.7916667 0.8000000 0.8156154 0.8400000 0.9200000    0
## c50      0.6800000 0.8000000 0.8366667 0.8238632 0.8400000 0.9583333    0
## rf       0.6923077 0.7692308 0.8166667 0.8205812 0.8662500 0.9600000    0
## ada      0.6800000 0.8000000 0.8333333 0.8306923 0.8787500 0.9600000    0
## 
## Kappa 
##                 Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
## lda       0.05109489 0.4601057 0.5247148 0.5120852 0.6203248 0.7234043
## logistic  0.02135231 0.4561887 0.5807297 0.5609937 0.6650327 1.0000000
## glmnet    0.06896552 0.3282510 0.5690672 0.5194197 0.7096774 0.8031496
## svm       0.16000000 0.3988738 0.4897959 0.4813131 0.5750605 0.9032258
## knn       0.08088235 0.3238463 0.3886121 0.4417260 0.5099409 0.8275862
## nb       -0.23529412 0.3396031 0.4344558 0.4327721 0.6187500 0.8031496
## cart      0.22907489 0.4456681 0.5584416 0.5462311 0.6434295 0.8175182
## c50       0.27007299 0.4897959 0.5881356 0.5805562 0.6343656 0.9032258
## rf        0.19379845 0.4341038 0.5657157 0.5538006 0.6753296 0.9049430
## ada       0.30769231 0.4898960 0.5714286 0.5864009 0.6928220 0.9049430
##          NA's
## lda         0
## logistic    0
## glmnet      0
## svm         0
## knn         0
## nb          0
## cart        0
## c50         0
## rf          0
## ada         0

bwplot(results)

dotplot(results)

Exploratory and Predictive Analysis with HR Data

Dr. Rich Huebner

6/30/2019

Predicting which employees might leave…