# Data Wrangling
library(dplyr)
library(tidyverse)

# Plotting
library(ggplot2)

# Machine Learning
## KNN
library(class)
## Naive Bayes
library(e1071)
## Random Forest
library(partykit)

# Machine Learning Evaluation
library(caret)
library(car)

About the data

Employees are the backbone of any organization. The success of an organization is heavily dependent on committed employees who are motivated and productive to carry out a company’s goals and objectives. Thus, employee attrition, the rate at which employees leave their job, hurts the growth and performance of an organization in terms of financial profit, productivity, and time. However, attrition is an inevitable part of any business. There will come a time when an employees wants to leave a company, which can happen for many reasons. For instance, employees might be seeking for better opportunities, mentally illed, working excessive hours, working in a toxic atmosphere or under bad management, etc. But when attrition reaches a particular threshold, it becomes a concern to be addressed as it may be reflecting a deep problem with the company culture. Attrition is particularly concerning when the attrition rate is high, indicating that employees are turning over pretty quickly, or when early attrition rate is high, indicating new joiners leave the company within the first few years of employment. Therefore, it is important to ask questions, find out why people leaving, and understand what is influencing attrition rate within an organization.

The most important question to ask is what factors are contributing to more employee attrition? Thus, by understanding these reasons for employee attrition, a company can take the correct measures to retain their employees. To explore these correlations, we used an IBM Employee Attrition dataset to explore and visualize interesting and significant factors that lead to higher employee attrition.

emp_att <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv")
  • Age: Employee age
  • Attrition: if the employee leaves the job (1 = Yes, 0 = No)
  • BusinessTravel: The frequency of job travels
  • DailyRate: Billing cost for employee’s services for a single day
  • Department: Employee work department
  • DistanceFromHome: Distance traveled to work from home
  • Education: Employee education level (1 = Below College, 2 = College, 3 = Bachelor, 4 = Master, 5 = Doctor)
  • EducationField: Employee education field
  • EmployeeCount: Employee Count (Constant)
  • EmployeeNumber: Employee ID
  • EnvironmentSatisfaction: Num value for environment satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • Gender: Employee gender
  • HourlyRate: The amount of money that employee earns for every hour worked
  • JobInvolvement: Numerical value for job involvement (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • JobLevel: Numerical value for job level
  • JobRole: Employee job position
  • JobSatisfaction: Numerical value for job satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • MaritalStatus: Employee marital status
  • MonthlyIncome: The amount of money that employee earns in one month, before taxes or deductions
  • MonthlyRate: Billing cost for employee’s services for a month
  • NumCompaniesWorked: Number of companies worked at
  • Over18: if employee is over 18 years old
  • OverTime: if employee works overtime
  • PercentSalaryHike: Percent increase in salary
  • PerformanceRating: Numerical value for performance rating (1 = Low, 2 = Good, 3 = Excellent, 4 = Outstanding)
  • RelationshipSatisfaction: Numerical value for relationship satisfaction (1 = Low, 2 = Medium, 3 = High, 4 = Very High)
  • StandardHours: Hours employee spent working (Constant)
  • StockOptionsLevel: Numerical value for stock options
  • TotalWorkingYears: Total number of years employee worked
  • TrainingTimesLastYear: Hours employee spent on training last year
  • WorkLifeBalance: Numerical value for work life balance (1 = Bad, 2 = Good, 3 = Better, 4 = Best)
  • YearsAtCompany: Number of years employee worked at company
  • YearsInCurrentRole: Number of years employee worked as their current job role
  • YearsSinceLastPromotion: Number of years since last promotion
  • YearsWithCurrentManager: Number of years employee worked with current manager

Data Wrangling

We will try glimpse() on our data to inspect it

glimpse(emp_att)
Rows: 1,470
Columns: 35
$ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, …
$ Attrition                <chr> "Yes", "No", "Yes", "No", "No", "No", "N…
$ BusinessTravel           <chr> "Travel_Rarely", "Travel_Frequently", "T…
$ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, …
$ Department               <chr> "Sales", "Research & Development", "Rese…
$ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15,…
$ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2…
$ EducationField           <chr> "Life Sciences", "Life Sciences", "Other…
$ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15…
$ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2…
$ Gender                   <chr> "Female", "Male", "Male", "Female", "Mal…
$ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, …
$ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3…
$ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1…
$ JobRole                  <chr> "Sales Executive", "Research Scientist",…
$ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4…
$ MaritalStatus            <chr> "Single", "Married", "Single", "Married"…
$ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670…
$ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864,…
$ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0…
$ Over18                   <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
$ OverTime                 <chr> "Yes", "No", "Yes", "Yes", "No", "No", "…
$ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, …
$ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3…
$ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3…
$ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
$ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1…
$ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10,…
$ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2…
$ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3…
$ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, …
$ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2…
$ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1…
$ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2…

From the glimpse above we can see that some variables does not have that much variance and those variables usually is a bad predictor for our target variable, in which this case we want to predict how likely an employee to resign.

To be sure we can use the command nearZeroVar() and remove all variables that has near zero variables.

zero_var <- nearZeroVar(emp_att)

ea_cln <- emp_att %>% 
              select(-zero_var)

For predicting purposes we will encode some variables such as Attrition and OverTime values with their respective binary values or from categorical to numerical values

convert_to_numeric <- function(x) {
  as.numeric(factor(x, levels = unique(x)))
}

ea_cln <- ea_cln %>% 
            mutate(Attrition = ifelse(Attrition == 'Yes', 1, 0),
                   OverTime = ifelse(OverTime == 'Yes', 1, 0),
                   Gender = ifelse(Gender=='Female', 1, 0)) %>% 
            mutate(Attrition=as.factor(Attrition)) %>% 
            mutate_at(vars(BusinessTravel, MaritalStatus, Department, EducationField, JobRole), convert_to_numeric)

head(ea_cln)

Exploratory Data Analysis

How much of the employee actually resign?

emp_att %>% 
  select(Attrition) %>% 
  group_by(Attrition) %>% 
  summarise(total_count=n()) %>%
  mutate(percentage = (total_count / sum(total_count)) * 100)

Employee Attrition by Department

emp_att %>% 
  select(c(Attrition, Department)) %>% 
  group_by(Attrition, Department) %>% 
  summarise(total_count=n())

Employee Attrition by Gender

emp_att %>% 
  select(c(Attrition, Gender)) %>% 
  group_by(Attrition, Gender) %>% 
  summarise(total_count=n())

If we check it one by one (from gender or department above), we can’t really see the difference. Rather then doing it one by one we will try to make a logistic regression model to check which predictors can predict attrition more

model_test <- glm(Attrition~., ea_cln, family = "binomial")

summary(model_test)

Call:
glm(formula = Attrition ~ ., family = "binomial", data = ea_cln)

Coefficients:
                             Estimate   Std. Error z value
(Intercept)               6.094865193  1.303035466   4.677
Age                      -0.029414017  0.013042566  -2.255
BusinessTravel           -0.004028734  0.131556461  -0.031
DailyRate                -0.000315710  0.000209533  -1.507
Department               -0.607195370  0.174471567  -3.480
DistanceFromHome          0.038606599  0.010274366   3.758
Education                 0.030978598  0.083821179   0.370
EducationField            0.159943981  0.059315864   2.696
EmployeeNumber           -0.000119581  0.000142957  -0.836
EnvironmentSatisfaction  -0.399924340  0.078883976  -5.070
Gender                   -0.406539962  0.177598337  -2.289
HourlyRate               -0.000476374  0.004187871  -0.114
JobInvolvement           -0.507323811  0.117682680  -4.311
JobLevel                 -0.232365127  0.277802474  -0.836
JobRole                   0.102008817  0.040841734   2.498
JobSatisfaction          -0.372911141  0.078061710  -4.777
MaritalStatus            -0.556564876  0.165511511  -3.363
MonthlyIncome            -0.000079180  0.000066524  -1.190
MonthlyRate               0.000005541  0.000011981   0.462
NumCompaniesWorked        0.182091866  0.036352423   5.009
OverTime                  1.844527957  0.182442703  10.110
PercentSalaryHike        -0.050153340  0.037482185  -1.338
PerformanceRating         0.327361507  0.380456753   0.860
RelationshipSatisfaction -0.244772535  0.079541606  -3.077
StockOptionLevel         -0.241558733  0.145973448  -1.655
TotalWorkingYears        -0.048884647  0.027307027  -1.790
TrainingTimesLastYear    -0.154009409  0.069703456  -2.209
WorkLifeBalance          -0.322623449  0.116454784  -2.770
YearsAtCompany            0.091360764  0.036473855   2.505
YearsInCurrentRole       -0.141023532  0.043341719  -3.254
YearsSinceLastPromotion   0.159495346  0.040286712   3.959
YearsWithCurrManager     -0.127626504  0.044161463  -2.890
                                     Pr(>|z|)    
(Intercept)                       0.000002905 ***
Age                                  0.024119 *  
BusinessTravel                       0.975570    
DailyRate                            0.131880    
Department                           0.000501 ***
DistanceFromHome                     0.000172 ***
Education                            0.711696    
EducationField                       0.007008 ** 
EmployeeNumber                       0.402883    
EnvironmentSatisfaction           0.000000398 ***
Gender                               0.022074 *  
HourlyRate                           0.909435    
JobInvolvement                    0.000016256 ***
JobLevel                             0.402907    
JobRole                              0.012502 *  
JobSatisfaction                   0.000001778 ***
MaritalStatus                        0.000772 ***
MonthlyIncome                        0.233952    
MonthlyRate                          0.643741    
NumCompaniesWorked                0.000000547 ***
OverTime                 < 0.0000000000000002 ***
PercentSalaryHike                    0.180878    
PerformanceRating                    0.389545    
RelationshipSatisfaction             0.002089 ** 
StockOptionLevel                     0.097962 .  
TotalWorkingYears                    0.073424 .  
TrainingTimesLastYear                0.027140 *  
WorkLifeBalance                      0.005599 ** 
YearsAtCompany                       0.012251 *  
YearsInCurrentRole                   0.001139 ** 
YearsSinceLastPromotion           0.000075262 ***
YearsWithCurrManager                 0.003852 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1298.58  on 1469  degrees of freedom
Residual deviance:  923.64  on 1438  degrees of freedom
AIC: 987.64

Number of Fisher Scoring iterations: 6

From the model above we will take note of predictors with significance level <0.001 as factors that contributes to an employee’s attrition (whether they will resign or not) the strongest:

  • Department
  • DistanceFromHome
  • EnvironmentSatisfaction
  • JobInvolvement
  • JobSatisfaction
  • MaritalStatus
  • NumCompaniesWorked
  • OverTime
  • YearsSinceLastPromotion

Noting that whether an employee have worked overtime is the most significant factor in predicting how likely someone will resign from their position.

Building Model for Prediction

Cross Validation

Let’s split the data into data train and data test to build a model for prediction.

# Locking Random
RNGkind(sample.kind = "Rounding") 
set.seed(417)

# Index
index <- sample(nrow(ea_cln), nrow(ea_cln)*0.8)

# Splitting
ea_train <- ea_cln[index,]
ea_test <- ea_cln[-index,]

Logistic Regression

Making The Model

Base Logistic Regression Model

model_base <- glm(Attrition~., ea_train, family = "binomial")

summary(model_base)

Call:
glm(formula = Attrition ~ ., family = "binomial", data = ea_train)

Coefficients:
                             Estimate   Std. Error z value
(Intercept)               5.518642241  1.485422434   3.715
Age                      -0.029857725  0.014397873  -2.074
BusinessTravel           -0.000468029  0.148311647  -0.003
DailyRate                -0.000175185  0.000230698  -0.759
Department               -0.617262871  0.194879977  -3.167
DistanceFromHome          0.039406526  0.011410185   3.454
Education                 0.088593530  0.093816433   0.944
EducationField            0.152509610  0.066342213   2.299
EmployeeNumber           -0.000064731  0.000158766  -0.408
EnvironmentSatisfaction  -0.381377091  0.087576288  -4.355
Gender                   -0.555817870  0.197337074  -2.817
HourlyRate                0.001715846  0.004593870   0.374
JobInvolvement           -0.412927652  0.130820889  -3.156
JobLevel                 -0.285196224  0.306935318  -0.929
JobRole                   0.119394226  0.045677550   2.614
JobSatisfaction          -0.377832196  0.086195285  -4.383
MaritalStatus            -0.542522375  0.181740036  -2.985
MonthlyIncome            -0.000096878  0.000075269  -1.287
MonthlyRate               0.000006974  0.000013234   0.527
NumCompaniesWorked        0.209468447  0.039666928   5.281
OverTime                  1.914446933  0.201964101   9.479
PercentSalaryHike        -0.053309773  0.041785612  -1.276
PerformanceRating         0.391069884  0.427983831   0.914
RelationshipSatisfaction -0.305112567  0.088344798  -3.454
StockOptionLevel         -0.264462295  0.161063619  -1.642
TotalWorkingYears        -0.046135677  0.030218546  -1.527
TrainingTimesLastYear    -0.133005181  0.076868537  -1.730
WorkLifeBalance          -0.410563478  0.129354941  -3.174
YearsAtCompany            0.126452475  0.040800222   3.099
YearsInCurrentRole       -0.155515675  0.047825813  -3.252
YearsSinceLastPromotion   0.161650603  0.043821052   3.689
YearsWithCurrManager     -0.151006422  0.051060614  -2.957
                                     Pr(>|z|)    
(Intercept)                          0.000203 ***
Age                                  0.038102 *  
BusinessTravel                       0.997482    
DailyRate                            0.447632    
Department                           0.001538 ** 
DistanceFromHome                     0.000553 ***
Education                            0.345002    
EducationField                       0.021514 *  
EmployeeNumber                       0.683484    
EnvironmentSatisfaction           0.000013319 ***
Gender                               0.004854 ** 
HourlyRate                           0.708771    
JobInvolvement                       0.001597 ** 
JobLevel                             0.352799    
JobRole                              0.008953 ** 
JobSatisfaction                   0.000011682 ***
MaritalStatus                        0.002834 ** 
MonthlyIncome                        0.198063    
MonthlyRate                          0.598212    
NumCompaniesWorked                0.000000129 ***
OverTime                 < 0.0000000000000002 ***
PercentSalaryHike                    0.202029    
PerformanceRating                    0.360849    
RelationshipSatisfaction             0.000553 ***
StockOptionLevel                     0.100595    
TotalWorkingYears                    0.126827    
TrainingTimesLastYear                0.083578 .  
WorkLifeBalance                      0.001504 ** 
YearsAtCompany                       0.001940 ** 
YearsInCurrentRole                   0.001147 ** 
YearsSinceLastPromotion              0.000225 ***
YearsWithCurrManager                 0.003103 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1069.32  on 1175  degrees of freedom
Residual deviance:  752.96  on 1144  degrees of freedom
AIC: 816.96

Number of Fisher Scoring iterations: 6

Logistic Regression using Predictors with significance level <0.001

model_sig <- glm(Attrition~Department+DistanceFromHome+EnvironmentSatisfaction+JobInvolvement+JobSatisfaction+MaritalStatus+NumCompaniesWorked+OverTime+YearsSinceLastPromotion, ea_train, family = "binomial")

summary(model_sig)

Call:
glm(formula = Attrition ~ Department + DistanceFromHome + EnvironmentSatisfaction + 
    JobInvolvement + JobSatisfaction + MaritalStatus + NumCompaniesWorked + 
    OverTime + YearsSinceLastPromotion, family = "binomial", 
    data = ea_train)

Coefficients:
                        Estimate Std. Error z value             Pr(>|z|)
(Intercept)              1.93095    0.57193   3.376             0.000735
Department              -0.28331    0.16465  -1.721             0.085307
DistanceFromHome         0.02780    0.01002   2.775             0.005521
EnvironmentSatisfaction -0.30731    0.07819  -3.930        0.00008486634
JobInvolvement          -0.46996    0.11790  -3.986        0.00006713646
JobSatisfaction         -0.29574    0.07593  -3.895        0.00009820334
MaritalStatus           -0.72346    0.12494  -5.790        0.00000000703
NumCompaniesWorked       0.08340    0.03266   2.554             0.010663
OverTime                 1.59343    0.17480   9.116 < 0.0000000000000002
YearsSinceLastPromotion -0.01701    0.02787  -0.610             0.541682
                           
(Intercept)             ***
Department              .  
DistanceFromHome        ** 
EnvironmentSatisfaction ***
JobInvolvement          ***
JobSatisfaction         ***
MaritalStatus           ***
NumCompaniesWorked      *  
OverTime                ***
YearsSinceLastPromotion    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1069.32  on 1175  degrees of freedom
Residual deviance:  894.54  on 1166  degrees of freedom
AIC: 914.54

Number of Fisher Scoring iterations: 5

Stepwise Logistic Regression to check for Models with The Smallest AIC

model_step <- step(model_base, direction = "backward", trace = FALSE)

summary(model_step)

Call:
glm(formula = Attrition ~ Age + Department + DistanceFromHome + 
    EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + 
    JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
    NumCompaniesWorked + OverTime + RelationshipSatisfaction + 
    StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
    WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
    YearsWithCurrManager, family = "binomial", data = ea_train)

Coefficients:
                            Estimate  Std. Error z value
(Intercept)               5.93459384  0.93389577   6.355
Age                      -0.02873555  0.01403087  -2.048
Department               -0.56862062  0.18995364  -2.993
DistanceFromHome          0.03838888  0.01128931   3.400
EducationField            0.15488422  0.06609420   2.343
EnvironmentSatisfaction  -0.38050523  0.08697840  -4.375
Gender                   -0.54727602  0.19632787  -2.788
JobInvolvement           -0.40548884  0.13039656  -3.110
JobRole                   0.11567999  0.04508837   2.566
JobSatisfaction          -0.38375993  0.08528478  -4.500
MaritalStatus            -0.55196664  0.18099838  -3.050
MonthlyIncome            -0.00015464  0.00004196  -3.686
NumCompaniesWorked        0.20994659  0.03932347   5.339
OverTime                  1.91284609  0.20082985   9.525
RelationshipSatisfaction -0.30416398  0.08712048  -3.491
StockOptionLevel         -0.26098842  0.15975689  -1.634
TotalWorkingYears        -0.05141708  0.02931597  -1.754
TrainingTimesLastYear    -0.13124170  0.07669190  -1.711
WorkLifeBalance          -0.40648707  0.12851073  -3.163
YearsAtCompany            0.12692345  0.04110856   3.088
YearsInCurrentRole       -0.15219084  0.04743761  -3.208
YearsSinceLastPromotion   0.16930421  0.04333034   3.907
YearsWithCurrManager     -0.15667826  0.05071895  -3.089
                                     Pr(>|z|)    
(Intercept)                    0.000000000209 ***
Age                                  0.040558 *  
Department                           0.002758 ** 
DistanceFromHome                     0.000673 ***
EducationField                       0.019110 *  
EnvironmentSatisfaction        0.000012159441 ***
Gender                               0.005311 ** 
JobInvolvement                       0.001873 ** 
JobRole                              0.010299 *  
JobSatisfaction                0.000006803439 ***
MaritalStatus                        0.002292 ** 
MonthlyIncome                        0.000228 ***
NumCompaniesWorked             0.000000093479 ***
OverTime                 < 0.0000000000000002 ***
RelationshipSatisfaction             0.000481 ***
StockOptionLevel                     0.102330    
TotalWorkingYears                    0.079449 .  
TrainingTimesLastYear                0.087028 .  
WorkLifeBalance                      0.001561 ** 
YearsAtCompany                       0.002018 ** 
YearsInCurrentRole                   0.001336 ** 
YearsSinceLastPromotion        0.000093337114 ***
YearsWithCurrManager                 0.002007 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1069.32  on 1175  degrees of freedom
Residual deviance:  757.52  on 1153  degrees of freedom
AIC: 803.52

Number of Fisher Scoring iterations: 6

Predicting and Evaluating

Predicting the model

# Predict model base
ea_test$pred_att_bs <- predict(model_base, newdata = ea_test, type = "response")

# Predict significant only model
ea_test$pred_att_sig <- predict(model_sig, newdata = ea_test, type = "response")

# Predict step wise model
ea_test$pred_att_stp <- predict(model_step, newdata = ea_test, type = "response")

Changing odds to labels

# Label model base
ea_test$label_bs <- ifelse(ea_test$pred_att_bs>0.5, 1, 0)

# Label model sig
ea_test$label_sig <- ifelse(ea_test$pred_att_sig>0.5, 1, 0)

# Label model step
ea_test$label_stp <- ifelse(ea_test$pred_att_stp>0.5, 1, 0)
ea_test %>% 
  select(c(Attrition, label_bs, label_sig, label_stp))

Evaluating The Model

# Eval model base
confusionMatrix(as.factor(ea_test$label_bs), as.factor(ea_test$Attrition), "1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 246  24
         1  10  14
                                          
               Accuracy : 0.8844          
                 95% CI : (0.8422, 0.9186)
    No Information Rate : 0.8707          
    P-Value [Acc > NIR] : 0.27602         
                                          
                  Kappa : 0.3906          
                                          
 Mcnemar's Test P-Value : 0.02578         
                                          
            Sensitivity : 0.36842         
            Specificity : 0.96094         
         Pos Pred Value : 0.58333         
         Neg Pred Value : 0.91111         
             Prevalence : 0.12925         
         Detection Rate : 0.04762         
   Detection Prevalence : 0.08163         
      Balanced Accuracy : 0.66468         
                                          
       'Positive' Class : 1               
                                          
# Eval model sig
confusionMatrix(as.factor(ea_test$label_sig), as.factor(ea_test$Attrition), "1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 250  33
         1   6   5
                                          
               Accuracy : 0.8673          
                 95% CI : (0.8231, 0.9039)
    No Information Rate : 0.8707          
    P-Value [Acc > NIR] : 0.6105          
                                          
                  Kappa : 0.155           
                                          
 Mcnemar's Test P-Value : 0.00003136      
                                          
            Sensitivity : 0.13158         
            Specificity : 0.97656         
         Pos Pred Value : 0.45455         
         Neg Pred Value : 0.88339         
             Prevalence : 0.12925         
         Detection Rate : 0.01701         
   Detection Prevalence : 0.03741         
      Balanced Accuracy : 0.55407         
                                          
       'Positive' Class : 1               
                                          
# Eval model step
confusionMatrix(as.factor(ea_test$label_stp), as.factor(ea_test$Attrition), "1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 245  25
         1  11  13
                                          
               Accuracy : 0.8776          
                 95% CI : (0.8345, 0.9127)
    No Information Rate : 0.8707          
    P-Value [Acc > NIR] : 0.40497         
                                          
                  Kappa : 0.3548          
                                          
 Mcnemar's Test P-Value : 0.03026         
                                          
            Sensitivity : 0.34211         
            Specificity : 0.95703         
         Pos Pred Value : 0.54167         
         Neg Pred Value : 0.90741         
             Prevalence : 0.12925         
         Detection Rate : 0.04422         
   Detection Prevalence : 0.08163         
      Balanced Accuracy : 0.64957         
                                          
       'Positive' Class : 1               
                                          

Result:

  • All of the logistic regression model’s accuracy is about 85%, it might look good but actually it failed to predict employees who actually resigned (the prediction was that they did not resign instead)
  • We can see the very low sensitivity metric (10% for the model_sig and 30% for model_base and model_step)
  • This might be caused by many thing such as imbalanced positive and negative class or the machine learning model does not fit the data

K Nearest Neighbor (KNN)

Data Preprocessing

Separating predictor and target variable.

# Predictor data train
ea_trn_x <- ea_train %>% select(-Attrition)

# Target data train
ea_trn_y <- ea_train[,"Attrition"]

# Predictor data test
ea_tst_x <- ea_test %>% select(
                          -c(Attrition, pred_att_bs, pred_att_sig, pred_att_stp, label_bs, label_sig, label_stp))

# Target data test
ea_tst_y <- ea_test[,"Attrition"]

Scaling data.

# Train
ea_trn_xs <- scale(ea_trn_x)

# Test
ea_tst_xs <- scale(ea_tst_x,
                      center= attr(ea_trn_xs, "scaled:center"),
                      scale = attr(ea_trn_xs, "scaled:scale"))

Predicting

Finding optimum N (usually using square root)

sqrt(nrow(ea_train))
[1] 34.29286

Because the target class is even (yes/no), then we will take N=33 or 35 (odd number). This is to prevent even vote on majority voting on KNN.

Predicting the data

# Predict using N = 33
pr_knn1 <- knn(train = ea_trn_xs, # pred data train
               test = ea_tst_xs, # pred data test
               cl = ea_trn_y, # label data train
               k = 33)

# Predict using N = 35
pr_knn2 <- knn(train = ea_trn_xs, # pred data train
               test = ea_tst_xs, # pred data test
               cl = ea_trn_y, # label data train
               k = 35)

Evaluating The Model

# Evaluating KNN N = 33
confusionMatrix(as.factor(pr_knn1), as.factor(ea_tst_y), positive = "1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 256  38
         1   0   0
                                          
               Accuracy : 0.8707          
                 95% CI : (0.8269, 0.9069)
    No Information Rate : 0.8707          
    P-Value [Acc > NIR] : 0.5431          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : 0.000000001947  
                                          
            Sensitivity : 0.0000          
            Specificity : 1.0000          
         Pos Pred Value :    NaN          
         Neg Pred Value : 0.8707          
             Prevalence : 0.1293          
         Detection Rate : 0.0000          
   Detection Prevalence : 0.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : 1               
                                          
# Evaluating KNN N = 35
confusionMatrix(as.factor(pr_knn2), as.factor(ea_tst_y), positive = "1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 256  38
         1   0   0
                                          
               Accuracy : 0.8707          
                 95% CI : (0.8269, 0.9069)
    No Information Rate : 0.8707          
    P-Value [Acc > NIR] : 0.5431          
                                          
                  Kappa : 0               
                                          
 Mcnemar's Test P-Value : 0.000000001947  
                                          
            Sensitivity : 0.0000          
            Specificity : 1.0000          
         Pos Pred Value :    NaN          
         Neg Pred Value : 0.8707          
             Prevalence : 0.1293          
         Detection Rate : 0.0000          
   Detection Prevalence : 0.0000          
      Balanced Accuracy : 0.5000          
                                          
       'Positive' Class : 1               
                                          

Result

  • The KNN Model failed more than the logistic regression model to distinguish the employees that will resign (it instead classifies them as retained).
  • It performed way worse than the logistic regression model.

Naive Bayes

Making The Model

naive_ea <- naiveBayes(Attrition~., ea_train, laplace = 1)

Model Fitting

ea_test$pred_nv <- predict(object = naive_ea,
                           newdata = ea_test %>% select(-c(pred_att_bs, pred_att_sig, pred_att_stp, label_bs, 
                                                          label_sig, label_stp)),
                           type="class") 

Model Evaluation

confusionMatrix(data = as.factor(ea_test$pred_nv), reference = as.factor(ea_test$Attrition), positive = "1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 218  12
         1  38  26
                                         
               Accuracy : 0.8299         
                 95% CI : (0.782, 0.8711)
    No Information Rate : 0.8707         
    P-Value [Acc > NIR] : 0.982321       
                                         
                  Kappa : 0.4149         
                                         
 Mcnemar's Test P-Value : 0.000407       
                                         
            Sensitivity : 0.68421        
            Specificity : 0.85156        
         Pos Pred Value : 0.40625        
         Neg Pred Value : 0.94783        
             Prevalence : 0.12925        
         Detection Rate : 0.08844        
   Detection Prevalence : 0.21769        
      Balanced Accuracy : 0.76789        
                                         
       'Positive' Class : 1              
                                         

Result

  • Naive Bayes Model seems like the better model to classify whether someone will resign or not based on the high accuracy (82%) but also high sensitivity metrics (68%) because we want our model to better predict the positive class (will resign).
  • This is a very good model since it takes little time to compute and less computing power.

Random Forest

Making the model

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k-fold
                     repeats = 3) # repetition
 
ea_forest <- train(Attrition ~ .,
                   data = ea_train,
                   method = "rf", # random forest
                   trControl = ctrl)

Predicting

# Predict using the random forest model with type = "raw"
ea_test$pred_rf_raw <- predict(object = ea_forest,
                               newdata = ea_test %>% select(-c(pred_att_bs, pred_att_sig, pred_att_stp, label_bs, 
                                                              label_sig, label_stp, pred_nv)),
                               type = "raw") %>% as.character() %>% as.numeric()

# Convert raw predictions to probabilities
ea_test$pred_rf_prob <- exp(ea_test$pred_rf_raw) / (1 + exp(ea_test$pred_rf_raw))

# Label model base
ea_test$label_rf <- ifelse(ea_test$pred_rf_prob>0.55, 1, 0)

Model Evaluation

confusionMatrix(data = as.factor(ea_test$label_rf), reference = as.factor(ea_test$Attrition), positive = "1")
Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0 251  25
         1   5  13
                                          
               Accuracy : 0.898           
                 95% CI : (0.8575, 0.9301)
    No Information Rate : 0.8707          
    P-Value [Acc > NIR] : 0.0932007       
                                          
                  Kappa : 0.4157          
                                          
 Mcnemar's Test P-Value : 0.0005226       
                                          
            Sensitivity : 0.34211         
            Specificity : 0.98047         
         Pos Pred Value : 0.72222         
         Neg Pred Value : 0.90942         
             Prevalence : 0.12925         
         Detection Rate : 0.04422         
   Detection Prevalence : 0.06122         
      Balanced Accuracy : 0.66129         
                                          
       'Positive' Class : 1               
                                          

Result

  • Out of the logistic regression model and naive bayes model, this random forest model seems to be the best overall and able to predict the targeted class way more accurately.
  • The accuracy might only be 75% but if we take into account the sensitivity metrics (how good our model in predicting positive class) on 73%, we might want it more to avoid false negatives.
  • This is because as a company, we might need to anticipate employees that is more likely resign so identifying which employee who resigns correctly is much more important.