Introduction

This analysis will try to classify the Attrition of the given employee. The dataset is downloaded from Kaggle: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

This analysis is part of Algoritma LBB Project in C1 class and C2 class.

For C2 class LBB, please scroll to Decision Tree and Random Forest part.

In this project, I will create a Logistic Regression model and a KNN model with different parameters and tuning, and compare them to find which predict better for this dataset.

If you have any suggestion or questions, connect with me in LinkedIn

library(tidyverse)
library(ggthemr)
library(ggpubr)
library(scales)
library(skimr)
library(GGally)
library(corrr)
library(corrplot)
library(brglm2)
library(ROSE)
library(ROCR)
library(caret)
library(plotROC)
library(plotly)
library(yardstick)
library(partykit)

dataHR <- read.csv(file = 'WA_Fn-UseC_-HR-Employee-Attrition.csv')

Now we will get a glimpse over the data. Personally, I like the skim() function from skimr package over glimpse() or summary() because it give a more detailed information about the data, including the missing rate, complete rate, number of unique, column type and for the numeric type. skim() also return the five-number summary with a cute little histogram.

skim(dataHR)

Data summary
Name	dataHR
Number of rows	1470
Number of columns	35
_______________________
Column type frequency:
factor	9
numeric	26
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
Attrition	1	FALSE	2	No: 1233, Yes: 237
BusinessTravel	1	FALSE	3	Tra: 1043, Tra: 277, Non: 150
Department	1	FALSE	3	Res: 961, Sal: 446, Hum: 63
EducationField	1	FALSE	6	Lif: 606, Med: 464, Mar: 159, Tec: 132
Gender	1	FALSE	2	Mal: 882, Fem: 588
JobRole	1	FALSE	9	Sal: 326, Res: 292, Lab: 259, Man: 145
MaritalStatus	1	FALSE	3	Mar: 673, Sin: 470, Div: 327
Over18	1	FALSE	1	Y: 1470
OverTime	1	FALSE	2	No: 1054, Yes: 416

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
ï..Age	1	36.92	9.14	18	30.00	36.0	43.00	60	▂▇▇▃▂
DailyRate	1	802.49	403.51	102	465.00	802.0	1157.00	1499	▇▇▇▇▇
DistanceFromHome	1	9.19	8.11	1	2.00	7.0	14.00	29	▇▅▂▂▂
Education	1	2.91	1.02	1	2.00	3.0	4.00	5	▂▃▇▆▁
EmployeeCount	1	1.00	0.00	1	1.00	1.0	1.00	1	▁▁▇▁▁
EmployeeNumber	1	1024.87	602.02	1	491.25	1020.5	1555.75	2068	▇▇▇▇▇
EnvironmentSatisfaction	1	2.72	1.09	1	2.00	3.0	4.00	4	▅▅▁▇▇
HourlyRate	1	65.89	20.33	30	48.00	66.0	83.75	100	▇▇▇▇▇
JobInvolvement	1	2.73	0.71	1	2.00	3.0	3.00	4	▁▃▁▇▁
JobLevel	1	2.06	1.11	1	1.00	2.0	3.00	5	▇▇▃▂▁
JobSatisfaction	1	2.73	1.10	1	2.00	3.0	4.00	4	▅▅▁▇▇
MonthlyIncome	1	6502.93	4707.96	1009	2911.00	4919.0	8379.00	19999	▇▅▂▁▂
MonthlyRate	1	14313.10	7117.79	2094	8047.00	14235.5	20461.50	26999	▇▇▇▇▇
NumCompaniesWorked	1	2.69	2.50	0	1.00	2.0	4.00	9	▇▃▂▂▁
PercentSalaryHike	1	15.21	3.66	11	12.00	14.0	18.00	25	▇▅▃▂▁
PerformanceRating	1	3.15	0.36	3	3.00	3.0	3.00	4	▇▁▁▁▂
RelationshipSatisfaction	1	2.71	1.08	1	2.00	3.0	4.00	4	▅▅▁▇▇
StandardHours	1	80.00	0.00	80	80.00	80.0	80.00	80	▁▁▇▁▁
StockOptionLevel	1	0.79	0.85	0	0.00	1.0	1.00	3	▇▇▁▂▁
TotalWorkingYears	1	11.28	7.78	0	6.00	10.0	15.00	40	▇▇▂▁▁
TrainingTimesLastYear	1	2.80	1.29	0	2.00	3.0	3.00	6	▂▇▇▂▃
WorkLifeBalance	1	2.76	0.71	1	2.00	3.0	3.00	4	▁▃▁▇▂
YearsAtCompany	1	7.01	6.13	0	3.00	5.0	9.00	40	▇▂▁▁▁
YearsInCurrentRole	1	4.23	3.62	0	2.00	3.0	7.00	18	▇▃▂▁▁
YearsSinceLastPromotion	1	2.19	3.22	0	0.00	1.0	3.00	15	▇▁▁▁▁
YearsWithCurrManager	1	4.12	3.57	0	2.00	3.0	7.00	17	▇▂▅▁▁

Now we will do some data transformation, like renaming the typo in column name and converting some column type into the appropriate format. In this case, a column like Education is a factor, not an integer or numeric.

names(dataHR)[names(dataHR) == "ï..Age"] <- "Age"
dataHR$Education <- as.factor(dataHR$Education)
dataHR$EnvironmentSatisfaction <- as.factor(dataHR$EnvironmentSatisfaction)
dataHR$JobInvolvement <- as.factor(dataHR$JobInvolvement)
dataHR$JobLevel <- as.factor(dataHR$JobLevel)
dataHR$JobSatisfaction <- as.factor(dataHR$JobSatisfaction)
dataHR$StockOptionLevel <- as.factor(dataHR$StockOptionLevel)
dataHR$PerformanceRating <- as.factor(dataHR$PerformanceRating)
dataHR$RelationshipSatisfaction <- as.factor(dataHR$RelationshipSatisfaction)
dataHR$WorkLifeBalance <- as.factor(dataHR$WorkLifeBalance)

Some of the columns contain only a single value for the entire column. This kind of column is not informative for our modeling. So we will remove the columns.

dataHR <- dataHR %>% select(-EmployeeCount, -StandardHours, -Over18)

We will make sure once again that we don’t have NA values in our dataset.

sum(is.na(dataHR))

## [1] 0

Exploratory Data Analysis

Our target variable is in the Attrition column. Before doing any analysis or modeling, we want to know the distribution of our Attrition variable. If it is imbalanced, further handling will be needed.

dist_attr <- dataHR %>%
                group_by(Attrition) %>%
                summarise(Total = n()) %>%
                print()

## # A tibble: 2 x 2
##   Attrition Total
##   <fct>     <int>
## 1 No         1233
## 2 Yes         237

dist_attr %>% 
  ggplot(aes(x=Attrition, y=Total)) +
  geom_col() +
  ggtitle("Total Numbers of Attrition")

We also want to see the distribution of our employee’s age divided by their gender.

mean_age <- dataHR %>%
    group_by(Gender) %>%
    summarise(mean = mean(Age),
              median = median(Age)) %>%
    print()

## # A tibble: 2 x 3
##   Gender  mean median
##   <fct>  <dbl>  <dbl>
## 1 Female  37.3     36
## 2 Male    36.7     35

plot1 <- dataHR %>% 
    ggplot(aes(x=Age)) + 
    geom_density(fill = "green", alpha = 0.5) +
    geom_vline(aes(xintercept = mean(Age))) +
    labs(title = "Age Distribution")

plot2 <- dataHR %>%
    filter(Gender == "Male") %>%
    ggplot(aes(x=Age)) + 
    geom_density(fill = "blue", alpha = 0.5) +
    geom_vline(aes(xintercept = mean(Age))) +
    labs(title = "Male Age Distribution")

plot3 <- dataHR %>%
    filter(Gender == "Female") %>%
    ggplot(aes(x=Age)) + 
    geom_density(fill = "red", alpha = 0.5) +
    geom_vline(aes(xintercept = mean(Age))) +
    labs(title = "Female Age Distribution")

ggarrange(plot1,
          ggarrange(plot2, plot3),
          nrow = 2)

dist_attr_gender <- dataHR %>%
                group_by(Attrition, Gender) %>%
                summarise(Total = n())
print(dist_attr_gender)

## # A tibble: 4 x 3
## # Groups:   Attrition [2]
##   Attrition Gender Total
##   <fct>     <fct>  <int>
## 1 No        Female   501
## 2 No        Male     732
## 3 Yes       Female    87
## 4 Yes       Male     150

dist_attr_gender %>%
  ggplot(aes(x=Attrition, y=Total, fill=Gender)) +
  geom_col(position="dodge")

pie_attr_male <- dist_attr_gender %>%
                    filter(Gender == "Male") %>%
                    ggplot(aes(x="", y=Total, fill=Attrition)) +
                    geom_bar(width=1, stat="identity") + 
                    coord_polar("y", start=0) +
                    ggtitle("Pie Chart \nAttrition Male") +
                    geom_text(aes(y = Total/2 + c(5, 10), 
                              label = percent(Total/sum(Total))), size=5)

pie_attr_female <- dist_attr_gender %>%
                    filter(Gender == "Female") %>%
                    ggplot(aes(x="", y=Total, fill=Attrition)) +
                    geom_bar(width=1, stat="identity") + 
                    coord_polar("y", start=0) +
                    ggtitle("Pie Chart \nAttrition Female") +
                    geom_text(aes(y = Total/2 + c(5, 10), 
                              label = percent(Total/sum(Total))), size=5)

ggarrange(pie_attr_male, pie_attr_female)

ggcorr(dataHR)

Modelling

Logistic Regression

Our variable of interest is the Attrition variable that indicates whether the employee considered in the Attrition group (Attrition = Yes) or the opposite (Attrition = No).

Cambridge Dictionary defined Attrition as a reduction in the number of people who work for an organization that is achieved by not replacing those people who leave.

head(dataHR$Attrition)

## [1] Yes No  Yes No  No  No 
## Levels: No Yes

We will divide our dataset by 80% into train data and 20% into test data. We will using set.seed() function to make sure that R produce the same random number to make this report is reproducible.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(18)
index <- sample(nrow(dataHR), nrow(dataHR)*0.8)
data_train <- dataHR[index, ]
data_test <- dataHR[-index,]

Before doing any analysis, we must make sure that we have a balanced training data. An imbalanced training data most likely to perform more inferior than the balanced data. So, we will try to make it balanced by using an upsampling technique with ovun.sample() function from ROSE package.

We will see our train data distribution. Now we see quite a large skew in our target variable.

table(data_train$Attrition)

## 
##  No Yes 
## 996 180

Before doing modeling, we need to make our data balanced. The imbalanced data could perform poorly because the algorithm cannot learn good enough from the minority class. We will do an upsampling technique to treat the imbalance.

train_balanced <- ovun.sample(Attrition ~ ., data = data_train, method = "over",
                              N = 996*2, seed = 1)$data
table(train_balanced$Attrition)

## 
##  No Yes 
## 996 996

After the data is balanced, we will make out first Logistic Regression model by using all the predictors into the formula.

model_log_full <- glm(Attrition ~ ., family = "binomial", data = train_balanced)
summary(model_log_full)

## 
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train_balanced)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8672  -0.5173   0.0246   0.5847   3.4560  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -9.801e+00  3.256e+02  -0.030 0.975983    
## Age                              -3.858e-02  1.031e-02  -3.741 0.000183 ***
## BusinessTravelTravel_Frequently   2.260e+00  3.203e-01   7.054 1.74e-12 ***
## BusinessTravelTravel_Rarely       1.224e+00  2.896e-01   4.226 2.38e-05 ***
## DailyRate                        -4.255e-04  1.751e-04  -2.429 0.015120 *  
## DepartmentResearch & Development  1.521e+01  3.256e+02   0.047 0.962735    
## DepartmentSales                   1.505e+01  3.256e+02   0.046 0.963137    
## DistanceFromHome                  4.566e-02  8.898e-03   5.132 2.87e-07 ***
## Education2                       -1.167e-02  2.495e-01  -0.047 0.962693    
## Education3                       -9.487e-02  2.207e-01  -0.430 0.667304    
## Education4                        2.541e-01  2.422e-01   1.049 0.294043    
## Education5                        2.890e-01  4.320e-01   0.669 0.503559    
## EducationFieldLife Sciences      -1.471e+00  6.599e-01  -2.229 0.025842 *  
## EducationFieldMarketing          -6.816e-01  6.877e-01  -0.991 0.321555    
## EducationFieldMedical            -1.346e+00  6.509e-01  -2.068 0.038676 *  
## EducationFieldOther              -8.496e-01  6.986e-01  -1.216 0.223902    
## EducationFieldTechnical Degree    1.691e-01  6.741e-01   0.251 0.801992    
## EmployeeNumber                   -2.245e-04  1.224e-04  -1.834 0.066652 .  
## EnvironmentSatisfaction2         -1.244e+00  2.258e-01  -5.509 3.60e-08 ***
## EnvironmentSatisfaction3         -1.090e+00  2.059e-01  -5.296 1.18e-07 ***
## EnvironmentSatisfaction4         -1.111e+00  2.029e-01  -5.473 4.41e-08 ***
## GenderMale                        4.716e-01  1.448e-01   3.258 0.001124 ** 
## HourlyRate                       -1.655e-03  3.419e-03  -0.484 0.628366    
## JobInvolvement2                  -1.240e+00  3.224e-01  -3.846 0.000120 ***
## JobInvolvement3                  -1.580e+00  3.078e-01  -5.133 2.85e-07 ***
## JobInvolvement4                  -1.678e+00  3.808e-01  -4.408 1.05e-05 ***
## JobLevel2                        -1.025e+00  2.986e-01  -3.433 0.000596 ***
## JobLevel3                         1.023e+00  5.256e-01   1.946 0.051666 .  
## JobLevel4                        -2.091e+00  9.785e-01  -2.137 0.032572 *  
## JobLevel5                         3.386e+00  1.189e+00   2.848 0.004400 ** 
## JobRoleHuman Resources            1.642e+01  3.256e+02   0.050 0.959764    
## JobRoleLaboratory Technician      2.041e+00  4.764e-01   4.284 1.83e-05 ***
## JobRoleManager                   -3.228e-01  8.494e-01  -0.380 0.703927    
## JobRoleManufacturing Director     1.068e+00  4.715e-01   2.264 0.023555 *  
## JobRoleResearch Director         -2.529e+00  9.117e-01  -2.774 0.005531 ** 
## JobRoleResearch Scientist         1.107e+00  4.789e-01   2.312 0.020776 *  
## JobRoleSales Executive            2.459e+00  1.007e+00   2.442 0.014620 *  
## JobRoleSales Representative       3.035e+00  1.061e+00   2.862 0.004213 ** 
## JobSatisfaction2                 -1.168e+00  2.168e-01  -5.387 7.17e-08 ***
## JobSatisfaction3                 -8.921e-01  1.891e-01  -4.718 2.38e-06 ***
## JobSatisfaction4                 -1.750e+00  2.043e-01  -8.564  < 2e-16 ***
## MaritalStatusMarried              5.933e-01  2.022e-01   2.935 0.003338 ** 
## MaritalStatusSingle               6.732e-01  2.930e-01   2.297 0.021591 *  
## MonthlyIncome                    -1.466e-04  7.005e-05  -2.093 0.036321 *  
## MonthlyRate                       9.625e-06  9.522e-06   1.011 0.312118    
## NumCompaniesWorked                2.380e-01  3.096e-02   7.687 1.50e-14 ***
## OverTimeYes                       2.020e+00  1.534e-01  13.166  < 2e-16 ***
## PercentSalaryHike                -6.368e-02  3.011e-02  -2.115 0.034450 *  
## PerformanceRating4                2.976e-01  3.191e-01   0.933 0.351078    
## RelationshipSatisfaction2        -1.149e+00  2.183e-01  -5.264 1.41e-07 ***
## RelationshipSatisfaction3        -1.045e+00  1.955e-01  -5.347 8.93e-08 ***
## RelationshipSatisfaction4        -1.094e+00  1.970e-01  -5.552 2.82e-08 ***
## StockOptionLevel1                -1.274e+00  2.405e-01  -5.299 1.16e-07 ***
## StockOptionLevel2                -1.131e+00  3.153e-01  -3.586 0.000336 ***
## StockOptionLevel3                -3.910e-01  3.662e-01  -1.068 0.285618    
## TotalWorkingYears                -3.083e-02  2.078e-02  -1.483 0.138032    
## TrainingTimesLastYear            -1.209e-01  5.217e-02  -2.318 0.020468 *  
## WorkLifeBalance2                 -9.685e-01  2.946e-01  -3.288 0.001008 ** 
## WorkLifeBalance3                 -1.549e+00  2.815e-01  -5.503 3.73e-08 ***
## WorkLifeBalance4                 -6.003e-01  3.325e-01  -1.805 0.071037 .  
## YearsAtCompany                    2.011e-01  3.016e-02   6.670 2.56e-11 ***
## YearsInCurrentRole               -2.519e-01  4.021e-02  -6.266 3.70e-10 ***
## YearsSinceLastPromotion           1.024e-01  3.173e-02   3.225 0.001258 ** 
## YearsWithCurrManager             -1.281e-01  3.721e-02  -3.444 0.000573 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2761.5  on 1991  degrees of freedom
## Residual deviance: 1511.3  on 1928  degrees of freedom
## AIC: 1639.3
## 
## Number of Fisher Scoring iterations: 14

We will detect if perfect separation exist in our data. https://cran.r-project.org/web/packages/brglm2/vignettes/separation.html

glm(formula = Attrition ~ ., family = "binomial", data = data_train, method = "detect_separation", linear_program = "dual")

## Separation: FALSE 
## Existence of maximum likelihood estimates
##                      (Intercept)                              Age 
##                             -Inf                                0 
##  BusinessTravelTravel_Frequently      BusinessTravelTravel_Rarely 
##                                0                                0 
##                        DailyRate DepartmentResearch & Development 
##                                0                              Inf 
##                  DepartmentSales                 DistanceFromHome 
##                              Inf                                0 
##                       Education2                       Education3 
##                                0                                0 
##                       Education4                       Education5 
##                                0                                0 
##      EducationFieldLife Sciences          EducationFieldMarketing 
##                                0                                0 
##            EducationFieldMedical              EducationFieldOther 
##                                0                                0 
##   EducationFieldTechnical Degree                   EmployeeNumber 
##                                0                                0 
##         EnvironmentSatisfaction2         EnvironmentSatisfaction3 
##                                0                                0 
##         EnvironmentSatisfaction4                       GenderMale 
##                                0                                0 
##                       HourlyRate                  JobInvolvement2 
##                                0                                0 
##                  JobInvolvement3                  JobInvolvement4 
##                                0                                0 
##                        JobLevel2                        JobLevel3 
##                                0                                0 
##                        JobLevel4                        JobLevel5 
##                                0                                0 
##           JobRoleHuman Resources     JobRoleLaboratory Technician 
##                              Inf                                0 
##                   JobRoleManager    JobRoleManufacturing Director 
##                                0                                0 
##         JobRoleResearch Director        JobRoleResearch Scientist 
##                                0                                0 
##           JobRoleSales Executive      JobRoleSales Representative 
##                                0                                0 
##                 JobSatisfaction2                 JobSatisfaction3 
##                                0                                0 
##                 JobSatisfaction4             MaritalStatusMarried 
##                                0                                0 
##              MaritalStatusSingle                    MonthlyIncome 
##                                0                                0 
##                      MonthlyRate               NumCompaniesWorked 
##                                0                                0 
##                      OverTimeYes                PercentSalaryHike 
##                                0                                0 
##               PerformanceRating4        RelationshipSatisfaction2 
##                                0                                0 
##        RelationshipSatisfaction3        RelationshipSatisfaction4 
##                                0                                0 
##                StockOptionLevel1                StockOptionLevel2 
##                                0                                0 
##                StockOptionLevel3                TotalWorkingYears 
##                                0                                0 
##            TrainingTimesLastYear                 WorkLifeBalance2 
##                                0                                0 
##                 WorkLifeBalance3                 WorkLifeBalance4 
##                                0                                0 
##                   YearsAtCompany               YearsInCurrentRole 
##                                0                                0 
##          YearsSinceLastPromotion             YearsWithCurrManager 
##                                0                                0 
## 0: finite value, Inf: infinity, -Inf: -infinity

FALSE means there is no perfect separation that exists in our model.

Next, we will be doing a feature engineering by applying the step() function into our full model. We will be using the backward option, which the function will iterate over the model, starts with all predictor (full model as we input it) and remove the least contributive predictors.

model_log_bw <- step(model_log_full, direction = "backward", trace = FALSE)
summary(model_log_bw)

## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + DailyRate + 
##     Department + DistanceFromHome + EducationField + EmployeeNumber + 
##     EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + 
##     JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
##     NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, family = "binomial", data = train_balanced)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.8611  -0.5238   0.0286   0.5907   3.4512  
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -1.020e+01  3.216e+02  -0.032 0.974706    
## Age                              -3.617e-02  1.003e-02  -3.607 0.000310 ***
## BusinessTravelTravel_Frequently   2.247e+00  3.185e-01   7.054 1.73e-12 ***
## BusinessTravelTravel_Rarely       1.235e+00  2.873e-01   4.298 1.72e-05 ***
## DailyRate                        -4.102e-04  1.721e-04  -2.384 0.017135 *  
## DepartmentResearch & Development  1.535e+01  3.216e+02   0.048 0.961932    
## DepartmentSales                   1.518e+01  3.216e+02   0.047 0.962360    
## DistanceFromHome                  4.663e-02  8.788e-03   5.307 1.12e-07 ***
## EducationFieldLife Sciences      -1.530e+00  6.345e-01  -2.411 0.015903 *  
## EducationFieldMarketing          -7.559e-01  6.685e-01  -1.131 0.258195    
## EducationFieldMedical            -1.436e+00  6.248e-01  -2.299 0.021518 *  
## EducationFieldOther              -8.717e-01  6.693e-01  -1.302 0.192746    
## EducationFieldTechnical Degree    4.438e-02  6.511e-01   0.068 0.945657    
## EmployeeNumber                   -2.060e-04  1.209e-04  -1.703 0.088483 .  
## EnvironmentSatisfaction2         -1.228e+00  2.241e-01  -5.481 4.22e-08 ***
## EnvironmentSatisfaction3         -1.073e+00  2.041e-01  -5.257 1.47e-07 ***
## EnvironmentSatisfaction4         -1.081e+00  2.009e-01  -5.381 7.42e-08 ***
## GenderMale                        4.882e-01  1.436e-01   3.400 0.000673 ***
## JobInvolvement2                  -1.221e+00  3.185e-01  -3.832 0.000127 ***
## JobInvolvement3                  -1.556e+00  3.013e-01  -5.163 2.44e-07 ***
## JobInvolvement4                  -1.678e+00  3.747e-01  -4.478 7.54e-06 ***
## JobLevel2                        -1.005e+00  2.963e-01  -3.390 0.000698 ***
## JobLevel3                         1.020e+00  5.201e-01   1.961 0.049846 *  
## JobLevel4                        -2.127e+00  9.762e-01  -2.179 0.029357 *  
## JobLevel5                         3.288e+00  1.176e+00   2.796 0.005179 ** 
## JobRoleHuman Resources            1.645e+01  3.216e+02   0.051 0.959207    
## JobRoleLaboratory Technician      1.983e+00  4.728e-01   4.195 2.73e-05 ***
## JobRoleManager                   -1.912e-01  8.395e-01  -0.228 0.819847    
## JobRoleManufacturing Director     1.051e+00  4.682e-01   2.245 0.024773 *  
## JobRoleResearch Director         -2.265e+00  8.983e-01  -2.522 0.011679 *  
## JobRoleResearch Scientist         1.058e+00  4.758e-01   2.224 0.026168 *  
## JobRoleSales Executive            2.461e+00  1.011e+00   2.434 0.014931 *  
## JobRoleSales Representative       2.974e+00  1.065e+00   2.794 0.005207 ** 
## JobSatisfaction2                 -1.168e+00  2.138e-01  -5.463 4.68e-08 ***
## JobSatisfaction3                 -8.703e-01  1.861e-01  -4.677 2.92e-06 ***
## JobSatisfaction4                 -1.731e+00  2.005e-01  -8.635  < 2e-16 ***
## MaritalStatusMarried              5.959e-01  1.996e-01   2.986 0.002826 ** 
## MaritalStatusSingle               6.371e-01  2.893e-01   2.203 0.027626 *  
## MonthlyIncome                    -1.574e-04  6.925e-05  -2.272 0.023069 *  
## NumCompaniesWorked                2.340e-01  3.066e-02   7.632 2.31e-14 ***
## OverTimeYes                       2.012e+00  1.516e-01  13.270  < 2e-16 ***
## PercentSalaryHike                -4.092e-02  1.929e-02  -2.121 0.033902 *  
## RelationshipSatisfaction2        -1.112e+00  2.130e-01  -5.222 1.77e-07 ***
## RelationshipSatisfaction3        -1.010e+00  1.918e-01  -5.267 1.39e-07 ***
## RelationshipSatisfaction4        -1.046e+00  1.948e-01  -5.371 7.83e-08 ***
## StockOptionLevel1                -1.356e+00  2.350e-01  -5.771 7.87e-09 ***
## StockOptionLevel2                -1.197e+00  3.086e-01  -3.879 0.000105 ***
## StockOptionLevel3                -4.014e-01  3.607e-01  -1.113 0.265790    
## TotalWorkingYears                -2.977e-02  2.054e-02  -1.449 0.147289    
## TrainingTimesLastYear            -1.282e-01  5.158e-02  -2.486 0.012915 *  
## WorkLifeBalance2                 -9.798e-01  2.906e-01  -3.372 0.000745 ***
## WorkLifeBalance3                 -1.531e+00  2.757e-01  -5.553 2.81e-08 ***
## WorkLifeBalance4                 -6.113e-01  3.278e-01  -1.865 0.062216 .  
## YearsAtCompany                    2.056e-01  3.004e-02   6.845 7.65e-12 ***
## YearsInCurrentRole               -2.527e-01  4.009e-02  -6.303 2.92e-10 ***
## YearsSinceLastPromotion           1.046e-01  3.127e-02   3.345 0.000824 ***
## YearsWithCurrManager             -1.277e-01  3.683e-02  -3.466 0.000528 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2761.5  on 1991  degrees of freedom
## Residual deviance: 1517.1  on 1935  degrees of freedom
## AIC: 1631.1
## 
## Number of Fisher Scoring iterations: 14

When modeling a model, there is a parsimonious model, which is a model that has a desirable and useful level of explanation with as few predictors as possible. That’s why when we do a backward step into our full model, we will obtain the least used predictors in the second model with the comparable good result.

Now, we will check if multicollinearity exists in our model using vif() function from car package. According to Zuur et al. (2010), a high value of VIF indicating multicollinearity exists and suggest to drop the highest value of VIF.

car::vif(model_log_bw)

##                                  GVIF Df GVIF^(1/(2*Df))
## Age                      1.987430e+00  1        1.409763
## BusinessTravel           1.338507e+00  2        1.075611
## DailyRate                1.175317e+00  1        1.084120
## Department               5.743169e+07  2       87.053832
## DistanceFromHome         1.228154e+00  1        1.108221
## EducationField           4.666025e+00  5        1.166527
## EmployeeNumber           1.185489e+00  1        1.088801
## EnvironmentSatisfaction  1.636246e+00  3        1.085529
## Gender                   1.147067e+00  1        1.071012
## JobInvolvement           1.598506e+00  3        1.081315
## JobLevel                 7.672646e+01  4        1.720355
## JobRole                  1.243779e+09  8        3.701873
## JobSatisfaction          1.595519e+00  3        1.080978
## MaritalStatus            3.569071e+00  2        1.374481
## MonthlyIncome            1.533085e+01  1        3.915463
## NumCompaniesWorked       1.523547e+00  1        1.234321
## OverTime                 1.308205e+00  1        1.143768
## PercentSalaryHike        1.128670e+00  1        1.062389
## RelationshipSatisfaction 1.508066e+00  3        1.070870
## StockOptionLevel         4.275539e+00  3        1.273987
## TotalWorkingYears        5.195686e+00  1        2.279405
## TrainingTimesLastYear    1.135342e+00  1        1.065524
## WorkLifeBalance          1.602410e+00  3        1.081755
## YearsAtCompany           7.559767e+00  1        2.749503
## YearsInCurrentRole       4.249090e+00  1        2.061332
## YearsSinceLastPromotion  2.307515e+00  1        1.519051
## YearsWithCurrManager     3.736256e+00  1        1.932940

Recreating a formula with the highest VIF is dropped.

model_log_bw <- glm(formula = Attrition ~ Age + BusinessTravel + DailyRate + 
    DistanceFromHome + EducationField + EmployeeNumber + 
    EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + 
    JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
    NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction + 
    StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
    WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
    YearsWithCurrManager, family = "binomial", data = train_balanced)

Recalculate the VIF value.

car::vif(model_log_bw)

##                               GVIF Df GVIF^(1/(2*Df))
## Age                       1.983295  1        1.408295
## BusinessTravel            1.329186  2        1.073733
## DailyRate                 1.175157  1        1.084047
## DistanceFromHome          1.216231  1        1.102829
## EducationField            4.315324  5        1.157448
## EmployeeNumber            1.182660  1        1.087502
## EnvironmentSatisfaction   1.614912  3        1.083157
## Gender                    1.138396  1        1.066956
## JobInvolvement            1.549342  3        1.075700
## JobLevel                 73.886948  4        1.712265
## JobRole                  77.216664  8        1.312145
## JobSatisfaction           1.574728  3        1.078618
## MaritalStatus             3.560918  2        1.373696
## MonthlyIncome            15.480245  1        3.934494
## NumCompaniesWorked        1.516519  1        1.231470
## OverTime                  1.298094  1        1.139339
## PercentSalaryHike         1.124487  1        1.060418
## RelationshipSatisfaction  1.497423  3        1.069607
## StockOptionLevel          4.255169  3        1.272974
## TotalWorkingYears         5.173397  1        2.274510
## TrainingTimesLastYear     1.134146  1        1.064963
## WorkLifeBalance           1.598007  3        1.081259
## YearsAtCompany            7.271288  1        2.696533
## YearsInCurrentRole        4.092361  1        2.022959
## YearsSinceLastPromotion   2.249532  1        1.499844
## YearsWithCurrManager      3.700395  1        1.923641

After the highest value of VIF dropped, our model is ready to be used for predicting the test dataset.

pred <- predict(model_log_bw, data_test, "response")

Let say the company objective is want to prevent the Attrition, so we ** don’t want to predict the Attrition = No when it is actually Yes.**

If the real value of Attrition = Yes while the predicted value is No, it will make us failed to take a preventive action that needed to take to prevent the Attrition happens.

If the real value of Attrition = No while the predicted value is Yes, we may take an unnecessary step to prevent something that won’t happen. But in my opinion, this action is not fatal and maybe it better, because with this ‘mistaken’ step is taken, we may build a stronger relationship and stronger understanding with our employees. Not a bad choice, actually.

FP: a test result that incorrectly indicates that a particular condition or attribute is present.
FN: a test result that incorrectly indicates that a particular condition or attribute is absent.

So, in our case, we prefer to focus on lowering the False Negatives rate. Why? Because if so, we will miss our valuable employee.

In the False Positive case, when the actual is good, we may take the concrete step, like questioning our employee. But in the end, we will figure out that our employee has no problem at all.

Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
Accuracy = (TN + TP)/(TN+TP+FN+FP)

So, we want to decrease the FN. According to the formula, if we decrease the FN, we will increase the Sensitivity and Accuracy score.

Now we will compare our prediction with the real value in the test dataset. The threshold is set into 0.5.

pred_round <- as.factor(ifelse(pred >= 0.5, "Yes", "No"))
confusionMatrix(pred_round, data_test$Attrition, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  186  13
##        Yes  51  44
##                                           
##                Accuracy : 0.7823          
##                  95% CI : (0.7307, 0.8281)
##     No Information Rate : 0.8061          
##     P-Value [Acc > NIR] : 0.8651          
##                                           
##                   Kappa : 0.4443          
##                                           
##  Mcnemar's Test P-Value : 3.746e-06       
##                                           
##             Sensitivity : 0.7719          
##             Specificity : 0.7848          
##          Pos Pred Value : 0.4632          
##          Neg Pred Value : 0.9347          
##              Prevalence : 0.1939          
##          Detection Rate : 0.1497          
##    Detection Prevalence : 0.3231          
##       Balanced Accuracy : 0.7784          
##                                           
##        'Positive' Class : Yes             
##

We will try to increase the accuracy and decrease the FN. In the Logistic Regression model, it can happen by adjusting the threshold of the prediction. But the question, what is the optimal number of the threshold? We will create a graph that will help us visualize the distribution of the prediction.

data_test$pred <- pred

ggplot(data_test, aes(pred, color = as.factor(Attrition) ) ) + 
geom_density( size = 1 ) +
ggtitle("Testing Set's Predicted Score")

By seeing the above graph, we can tune our model by adjusting the threshold:

The threshold value of <0.5 will results in more Attrition == Yes
The threshold value of >0.5 will results in less Attrition == Yes

Besides the graph, we can mathematically compute the most optimal cutoff values.

ROCRpred = prediction(pred, data_test$Attrition)

# Performance function
plot1 = performance(ROCRpred, "acc", "fnr")
plot2 = performance(ROCRpred, "prec", "rec")

par(mfrow = c(1, 2))
plot(plot1, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
plot(plot2, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

We want a threshold number which has the highest accuracy and the lowest false-negative rate. By looking at the graph, the threshold should be around 0.3 or 0.4. Now we will try both values to see the results.

pred_round <- as.factor(ifelse(pred >= 0.3, "Yes", "No"))
confusionMatrix(pred_round, data_test$Attrition, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  159   7
##        Yes  78  50
##                                          
##                Accuracy : 0.7109         
##                  95% CI : (0.6554, 0.762)
##     No Information Rate : 0.8061         
##     P-Value [Acc > NIR] : 1              
##                                          
##                   Kappa : 0.3721         
##                                          
##  Mcnemar's Test P-Value : 3.136e-14      
##                                          
##             Sensitivity : 0.8772         
##             Specificity : 0.6709         
##          Pos Pred Value : 0.3906         
##          Neg Pred Value : 0.9578         
##              Prevalence : 0.1939         
##          Detection Rate : 0.1701         
##    Detection Prevalence : 0.4354         
##       Balanced Accuracy : 0.7740         
##                                          
##        'Positive' Class : Yes            
##

pred_round <- as.factor(ifelse(pred >= 0.4, "Yes", "No"))
confusionMatrix(pred_round, data_test$Attrition, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  173  10
##        Yes  64  47
##                                           
##                Accuracy : 0.7483          
##                  95% CI : (0.6946, 0.7969)
##     No Information Rate : 0.8061          
##     P-Value [Acc > NIR] : 0.9939          
##                                           
##                   Kappa : 0.4078          
##                                           
##  Mcnemar's Test P-Value : 7.223e-10       
##                                           
##             Sensitivity : 0.8246          
##             Specificity : 0.7300          
##          Pos Pred Value : 0.4234          
##          Neg Pred Value : 0.9454          
##              Prevalence : 0.1939          
##          Detection Rate : 0.1599          
##    Detection Prevalence : 0.3776          
##       Balanced Accuracy : 0.7773          
##                                           
##        'Positive' Class : Yes             
##

The threshold value of 0.3 and 0.4, successfully decreasing the False Negative count of the prediction, thus increasing the Sensitivity of the data. It is also increasing the number of True Positive count of the prediction. Good sign!

The drawback of adjusting the threshold in our prediction is our model failed to predict the True Negative count. But if we think the False Positive is not a big deal, then it should not be a problem.

KNN with caret()

KNN or k-nearest neighbors is a non-parametric method that can be used for classification. We will try to predict using this method and compare the performance with the Logistic Regression.

KNN is only worked for predictor with numerical values so that we will select the numerical columns only.

First, we will divide our dataset into 80% train data and 20% test data. We will using set.seed() function to make sure that R produce the same random number to make this report is reproducible.

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(18)
ind <- dataHR_num$Attrition %>% 
  createDataPartition(p = 0.8, list = FALSE)
train_caret  <- dataHR_num[ind, ]
test_caret <- dataHR_num[-ind, ]

I will be using caret package to predict with KNN. After data split into a train set and a test set, I am using k-Fold Cross-Validation parameter in the train() function. The data pre-processing of Centering and Scaling will be done in the trainControl() function.

The k-Fold Cross-Validation parameter will be set all to 10. Data imbalanced will be handled using sampling method in trainControl() function, I will try all the options available to compare how the different sampling method will perform.

The k values of KNN method will be set with the tuneGrid in train() function. The function will take the best k value automatically to be used in the test set, which has the best Sensitivity, from metric options.

sqrt(nrow(dataHR))

## [1] 38.34058

The rule of thumb of k value in KNN is the square root of the number of rows in the dataset.

I will iterate from 1 to 50 for the k values of KNN in the tuneGrid. The function will take the best k value that produces the highest value of Sensitivity. The Sensitivity parameter is chosen according to our company objective.

The imbalanced data will be handled using several sampling methods: Upsampling, Downsampling, SMOTE, ROSE.

model_imbalanced <- train(
  Attrition ~., data = train_caret, method = "knn",
  trControl = trainControl(method = "cv",
                           number = 10,
                           classProbs = TRUE,
                           summaryFunction=twoClassSummary),
  preProcess = c("center","scale"),
  metric = "Sens",
  tuneGrid = expand.grid(k = 1:50)
  )

model_upsampling <- train(
  Attrition ~., data = train_caret, method = "knn",
  trControl = trainControl(method = "cv",
                           number = 10,
                           sampling = "up",
                           classProbs = TRUE,
                           summaryFunction=twoClassSummary),
  preProcess = c("center","scale"),
  metric = "Sens",
  tuneGrid = expand.grid(k = 1:50)
  )

model_downsampling <- train(
  Attrition ~., data = train_caret, method = "knn",
  trControl = trainControl(method = "cv",
                           number = 10,
                           sampling = "down",
                           classProbs = TRUE,
                           summaryFunction=twoClassSummary),
  preProcess = c("center","scale"),
  metric = "Sens",
  tuneGrid = expand.grid(k = 1:50)
  )

model_smote<- train(
  Attrition ~., data = train_caret, method = "knn",
  trControl = trainControl(method = "cv",
                           number = 10,
                           sampling = "smote",
                           classProbs = TRUE,
                           summaryFunction=twoClassSummary),
  preProcess = c("center","scale"),
  metric = "Sens",
  tuneGrid = expand.grid(k = 1:50)
  )

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

model_rose<- train(
  Attrition ~., data = train_caret, method = "knn",
  trControl = trainControl(method = "cv",
                           number = 10,
                           sampling = "rose",
                           classProbs = TRUE,
                           summaryFunction=twoClassSummary),
  preProcess = c("center","scale"),
  metric = "Sens",
  tuneGrid = expand.grid(k = 1:50)
  )

Now, we will make a prediction with our created models into the test dataset, and we will save the results into appropriate objects.

pred_imbalanced <- model_imbalanced %>% 
  predict(test_caret)
pred_upsampling <- model_upsampling %>% 
  predict(test_caret)
pred_downsampling <- model_downsampling %>% 
  predict(test_caret)
pred_smote <- model_smote %>% 
  predict(test_caret)
pred_rose <- model_rose %>% 
  predict(test_caret)

confusionMatrix(pred_imbalanced, test_caret$Attrition)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Yes  No
##        Yes  15  38
##        No   32 208
##                                           
##                Accuracy : 0.7611          
##                  95% CI : (0.7081, 0.8088)
##     No Information Rate : 0.8396          
##     P-Value [Acc > NIR] : 0.9998          
##                                           
##                   Kappa : 0.1566          
##                                           
##  Mcnemar's Test P-Value : 0.5501          
##                                           
##             Sensitivity : 0.31915         
##             Specificity : 0.84553         
##          Pos Pred Value : 0.28302         
##          Neg Pred Value : 0.86667         
##              Prevalence : 0.16041         
##          Detection Rate : 0.05119         
##    Detection Prevalence : 0.18089         
##       Balanced Accuracy : 0.58234         
##                                           
##        'Positive' Class : Yes             
##

confusionMatrix(pred_upsampling, test_caret$Attrition, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Yes  No
##        Yes  29 102
##        No   18 144
##                                           
##                Accuracy : 0.5904          
##                  95% CI : (0.5318, 0.6473)
##     No Information Rate : 0.8396          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1175          
##                                           
##  Mcnemar's Test P-Value : 3.541e-14       
##                                           
##             Sensitivity : 0.61702         
##             Specificity : 0.58537         
##          Pos Pred Value : 0.22137         
##          Neg Pred Value : 0.88889         
##              Prevalence : 0.16041         
##          Detection Rate : 0.09898         
##    Detection Prevalence : 0.44710         
##       Balanced Accuracy : 0.60119         
##                                           
##        'Positive' Class : Yes             
##

confusionMatrix(pred_downsampling, test_caret$Attrition, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Yes  No
##        Yes  26 101
##        No   21 145
##                                           
##                Accuracy : 0.5836          
##                  95% CI : (0.5249, 0.6407)
##     No Information Rate : 0.8396          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.0845          
##                                           
##  Mcnemar's Test P-Value : 8.532e-13       
##                                           
##             Sensitivity : 0.55319         
##             Specificity : 0.58943         
##          Pos Pred Value : 0.20472         
##          Neg Pred Value : 0.87349         
##              Prevalence : 0.16041         
##          Detection Rate : 0.08874         
##    Detection Prevalence : 0.43345         
##       Balanced Accuracy : 0.57131         
##                                           
##        'Positive' Class : Yes             
##

confusionMatrix(pred_smote, test_caret$Attrition, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Yes  No
##        Yes  23  82
##        No   24 164
##                                           
##                Accuracy : 0.6382          
##                  95% CI : (0.5803, 0.6933)
##     No Information Rate : 0.8396          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1041          
##                                           
##  Mcnemar's Test P-Value : 3.089e-08       
##                                           
##             Sensitivity : 0.4894          
##             Specificity : 0.6667          
##          Pos Pred Value : 0.2190          
##          Neg Pred Value : 0.8723          
##              Prevalence : 0.1604          
##          Detection Rate : 0.0785          
##    Detection Prevalence : 0.3584          
##       Balanced Accuracy : 0.5780          
##                                           
##        'Positive' Class : Yes             
##

confusionMatrix(pred_rose, test_caret$Attrition, positive = "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Yes  No
##        Yes  27  81
##        No   20 165
##                                           
##                Accuracy : 0.6553          
##                  95% CI : (0.5978, 0.7096)
##     No Information Rate : 0.8396          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.1608          
##                                           
##  Mcnemar's Test P-Value : 2.369e-09       
##                                           
##             Sensitivity : 0.57447         
##             Specificity : 0.67073         
##          Pos Pred Value : 0.25000         
##          Neg Pred Value : 0.89189         
##              Prevalence : 0.16041         
##          Detection Rate : 0.09215         
##    Detection Prevalence : 0.36860         
##       Balanced Accuracy : 0.62260         
##                                           
##        'Positive' Class : Yes             
##

We will calculate the performance differences statistically.

resamps <- resamples(list(Imbalanced = model_imbalanced,
                          Upsampling = model_upsampling,
                          Downsampling = model_downsampling,
                          SMOTE = model_smote,
                          ROSE = model_rose))

resamps$values %>%
  select(Resample, 
         Imbalanced = `Imbalanced~Sens`, 
         Upsampling = `Upsampling~Sens`, 
         Downsampling = `Downsampling~Sens`, 
         SMOTE = `SMOTE~Sens`, 
         ROSE = `ROSE~Sens`)

bwplot(resamps, layout = c(3, 1))

From the table and the chart above, we can conclude that, in general, Upsampling produced the highest Sensitivity value of our model. It doesn’t mean that Upsampling is always the best for all models and all situations. But in our case, because we focus on the Sensitivity value, the Upsampling method gives us the best result.

Conclusion

Our model summary:

Logistic Regression with default threshold (> 0.5)

Accuracy: 0.7721
Sensitivity : 0.7368

Logistic Regression with threshold > 0.3

Accuracy: 0.7075
Sensitivity: 0.8772

Logistic Regression with threshold > 0.4

Accuracy: 0.7381
Sensitivity: 0.8070

KNN with Imbalanced Data

Accuracy: 0.7611
Sensitivity: 0.31915

KNN with Unsampling

Accuracy: 0.5904
Sensitivity: 0.61702

KNN with Downsampling

Accuracy: 0.5836
Sensitivity: 0.55319

KNN with SMOTE

Accuracy: 0.6382
Sensitivity: 0.4894

KNN with ROSE

Accuracy: 0.6553
Sensitivity: 0.5745

Our prediction with the KNN method, all produces both Accuracy and Sensitivity score lower than the Logistic Regression Model. Maybe it is because we only use the numerical value type of predictors in the KNN. We may lose some valuable information that may contain in the categorical predictors.

With both numerical and categorical types of columns included in the Logistic Regression model, it produces a better value than the KNN. If our objective is decreasing the False Negative as defined by the business objective, we will choose Logistic Regression with threshold > 0.3 for our model.

Below is Second Section for This Analysis

This analysis is part of Algoritma LBB Project in C2 class.

In this project, I will create a Decision Tree and a Random Forest model with different parameters and tuning, and compare them to find which predict better for this dataset.

The EDA part is still the same, so I will focus directly on the analysis here.

Decision Tree

I will be using caret package to predict with Decision Tree algorithm by specifiying method = "ctree as an argument.

The k-Fold Cross-Validation parameter will be set to 10 and repeats to 3 with repeatedcv method. Data imbalance will be handled using sampling method in trainControl() function, I will try to use upsampling method to handle the imbalanced data.

tree_imbalanced <- train(
  Attrition ~., data = data_train, method = "ctree",
  trControl = trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 3)
  )

tree_upsample <- train(
  Attrition ~., data = data_train, method = "ctree",
  trControl = trainControl(method = "repeatedcv",
                           number = 10,
                           repeats = 3,
                           sampling = "up")
  )

Now we will predict using the model above to out test data, we will predict both raw and prob.

pred_tree_imbalanced <- predict(tree_imbalanced, data_test)
pred_tree_upsample <- predict(tree_upsample, data_test)

pred_tree_imbalanced_prob <- predict(tree_imbalanced, data_test, type = "prob")
pred_tree_upsample_prob <- predict(tree_upsample, data_test, type = "prob")

Here is the confusion matrix with the imbalanced dataset.

confusionMatrix(pred_tree_imbalanced,  data_test$Attrition, positive = "Yes")[[2]]

##           Reference
## Prediction  No Yes
##        No  223  43
##        Yes  14  14

Now we will calculate the metrics for our prediction with the imbalanced dataset.

table <- select(data_test, Attrition) %>%
  bind_cols(pred_tree_imbalanced = pred_tree_imbalanced)
imbalanced_tree_metrics <- table %>% summarise(
  accuracy = accuracy_vec(data_test$Attrition, pred_tree_imbalanced),
  sensitivity = spec_vec(data_test$Attrition, pred_tree_imbalanced),
  specificity = sens_vec(data_test$Attrition, pred_tree_imbalanced),
  precision = precision_vec(data_test$Attrition, pred_tree_imbalanced))
imbalanced_tree_metrics

Here is the confusion matrix with the balanced data using upsampling method.

confusionMatrix(pred_tree_upsample,  data_test$Attrition, positive = "Yes")[[2]]

##           Reference
## Prediction  No Yes
##        No  185  39
##        Yes  52  18

Now we will calculate the metrics for our prediction with the balanced dataset using upsampling method.

table <- select(data_test, Attrition) %>%
  bind_cols(pred_tree_upsample = pred_tree_upsample)
upsample_tree_metrics <- table %>% summarise(
  accuracy = accuracy_vec(data_test$Attrition, pred_tree_upsample),
  sensitivity = spec_vec(data_test$Attrition, pred_tree_upsample),
  specificity = sens_vec(data_test$Attrition, pred_tree_upsample),
  precision = precision_vec(data_test$Attrition, pred_tree_upsample))
upsample_tree_metrics

We can plot the importance of the variables.

varImp(tree_upsample) %>% ggplot

ct <- ctree(Attrition ~., data = data_train)

plotfun <- function(i) c(
  as.character(i$prediction),
  paste("n =", i$n),
  format(round(i$distribution/i$n, digits = 3), nsmall = 3)
)

plot(as.simpleparty(ct), tp_args = list(FUN = plotfun), ep_args = list(justmin = 20))

Wow, the balanced data with upsampling method look so messy. The upsampling one has a lower accuracy, but a higher sensitivity. Because we want to focus on sensitivity, we will choose the upsampling one. .

Now I will see the ROC curve of my final model: balanced dataset using upsampling method.

table <- select(data_test, Attrition) %>%
  bind_cols(attrition_pred = pred_tree_upsample_prob) %>%
  bind_cols(attrition_eprob = round(pred_tree_upsample_prob[,2],4)) %>%
  bind_cols(attrition_pprob = round(pred_tree_upsample_prob[,1],4))

roc <- data.frame(prediction = round(pred_tree_upsample_prob[,2],4),
                  trueclass = as.numeric(table$Attrition))

roc <- prediction(roc$prediction, roc$trueclass)

# ROC Curve
plot(performance(roc, "tpr", "fpr"), main = "ROC")
abline(a = 0, b = 1)

By looking visually the graphs above, we can see our model is perform poorly as they really close with the threshold line.

# AUC
auc_ROCR_tree <- performance(roc, measure = "auc")
auc_ROCR_tree <- auc_ROCR_tree@y.values[[1]]
auc_ROCR_tree

## [1] 0.5787253

The AUC, area under curve is the metric to measure how good our model is. The higher the AUC, the better the model is. Value of 0.5 denotes a bad classifier. As our model is very close to 0.5, we could say that our model doesn’t perform well.

We can improve our model by adjusting the threshold of the classification probabolity. Now I will try to tuning the model by looking at the tradeoff graph.

th_cut <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)

for(i in 1:100){
  result[i,] = th_cut(cutoff = co[i], 
                     prob = table$attrition_eprob, 
                     ref = as.factor(ifelse(table$Attrition == "Yes", 1, 0)), 
                     postarget = "1", 
                     negtarget = "0")
}

# visualize
ggplotly(tibble("Sensitivity" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "Metrics", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(-0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(-0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()))

From above graph, there are no real clear-cuts of intersection for the optimum threshold. We can increase our threshold to 0.7 to increase the accuracy, precision and specificity but decreas the sensitivity value which actualy we want it as high as possible. So I will skip the tuning process as it won’t improve the model performance.

This is our metrics of our final model for the Decision Tree algorithm.

final_tree <- table %>% summarise(
  accuracy = accuracy_vec(Attrition, pred_tree_upsample),
  sensitivity = spec_vec(Attrition, pred_tree_upsample),
  specificity = sens_vec(Attrition, pred_tree_upsample),
  precision = precision_vec(Attrition, pred_tree_upsample)) %>%
  cbind(AUC = auc_ROCR_tree)

final_tree

Random Forest

Random Forest consists of a large number of individual decision trees that operates as an ensemble.

We will train our data with the k-Fold Cross-Validation parameter set to 5 and repeats to 3 with repeatedcv method. Data imbalance will be handled using sampling method in trainControl() function, I will try to use several method to handle the imbalanced data.

rf_imbalanced <- train(
  Attrition ~., data = data_train, method = "rf",
  trControl = trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 3,
                           summaryFunction=twoClassSummary,
                           classProbs=T)
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

rf_upsample <- train(
  Attrition ~., data = data_train, method = "rf",
  trControl = trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 3,
                           sampling = "up",
                           summaryFunction=twoClassSummary,
                           classProbs=T)
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

rf_smote <- train(
  Attrition ~., data = data_train, method = "rf",
  trControl = trainControl(method = "repeatedcv",
                           number = 5,
                           repeats = 3,
                           sampling = "smote",
                           summaryFunction=twoClassSummary,
                           classProbs=T)
)

## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.

Here is the graph visualizing the mtry picked for the model.

ggarrange(rf_imbalanced %>% ggplot(),
          rf_upsample %>% ggplot(),
          rf_smote %>% ggplot())

Three graphs above are visualizing the best mtry with certain accuracy in the y-axis. The algorithm will automatically choose the best mtry that produce the highest accuracy.

Now we will predict using the model above to out test data, we will predict both raw and prob.

pred_rf_imbalanced_prob <- predict(rf_imbalanced, data_test, type = "prob")
pred_rf_upsample_prob <- predict(rf_upsample, data_test, type = "prob")
pred_rf_smote_prob <- predict(rf_smote, data_test, type = "prob")

pred_rf_imbalanced_raw <- predict(rf_imbalanced, data_test, type = "raw")
pred_rf_upsample_raw <- predict(rf_upsample, data_test, type = "raw")
pred_rf_smote_raw <- predict(rf_smote, data_test, type = "raw")

Here is the confusion matrix with the imbalanced dataset.

confusionMatrix(pred_rf_imbalanced_raw,  data_test$Attrition, positive = "Yes")[[2]]

##           Reference
## Prediction  No Yes
##        No  237  56
##        Yes   0   1

Now we will calculate the metrics for our prediction with the imbalanced dataset.

table <- select(data_test, Attrition) %>%
  bind_cols(pred_rf_imbalanced_raw = pred_rf_imbalanced_raw)
imbalanced_metrics <- table %>% summarise(
  accuracy = accuracy_vec(data_test$Attrition, pred_rf_imbalanced_raw),
  sensitivity = spec_vec(data_test$Attrition, pred_rf_imbalanced_raw),
  specificity = sens_vec(data_test$Attrition, pred_rf_imbalanced_raw),
  precision = precision_vec(data_test$Attrition, pred_rf_imbalanced_raw))
imbalanced_metrics

Here is the confusion matrix with the upsampling method.

confusionMatrix(pred_rf_upsample_raw,  data_test$Attrition, positive = "Yes")[[2]]

##           Reference
## Prediction  No Yes
##        No  230  41
##        Yes   7  16

Now we will calculate the metrics for our prediction with the upsampling method.

table <- select(data_test, Attrition) %>%
  bind_cols(pred_rf_upsample_raw = pred_rf_upsample_raw)
upsample_metrics <- table %>% summarise(
  accuracy = accuracy_vec(data_test$Attrition, pred_rf_upsample_raw),
  sensitivity = spec_vec(data_test$Attrition, pred_rf_upsample_raw),
  specificity = sens_vec(data_test$Attrition, pred_rf_upsample_raw),
  precision = precision_vec(data_test$Attrition, pred_rf_upsample_raw))
upsample_metrics

Here is the confusion matrix with the smote method.

confusionMatrix(pred_rf_smote_raw,  data_test$Attrition, positive = "Yes")[[2]]

##           Reference
## Prediction  No Yes
##        No  234  43
##        Yes   3  14

Now we will calculate the metrics for our prediction with the smote method.

table <- select(data_test, Attrition) %>%
  bind_cols(pred_rf_smote_raw = pred_rf_smote_raw)
smote_metrics <- table %>% summarise(
  accuracy = accuracy_vec(data_test$Attrition, pred_rf_smote_raw),
  sensitivity = spec_vec(data_test$Attrition, pred_rf_smote_raw),
  specificity = sens_vec(data_test$Attrition, pred_rf_smote_raw),
  precision = precision_vec(data_test$Attrition, pred_rf_smote_raw))
smote_metrics

Because the upsampling technique results the best in both accuracy and sensitivity, we will proceed to the next analysis.

Now we will see the detail and plot of the chosen model.

rf_upsample

## Random Forest 
## 
## 1176 samples
##   31 predictor
##    2 classes: 'No', 'Yes' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times) 
## Summary of sample sizes: 941, 941, 941, 940, 941, 940, ... 
## Addtional sampling using up-sampling
## 
## Resampling results across tuning parameters:
## 
##   mtry  ROC        Sens       Spec     
##    2    0.7707324  0.9708878  0.2462963
##   32    0.7586876  0.9598526  0.2296296
##   63    0.7349803  0.9531508  0.2611111
## 
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Here is importance of the variables for the chosen model.

varImp(rf_upsample)

## rf variable importance
## 
##   only 20 most important variables shown (out of 63)
## 
##                         Overall
## MonthlyIncome            100.00
## Age                       80.44
## TotalWorkingYears         75.80
## OverTimeYes               66.15
## EmployeeNumber            63.83
## YearsAtCompany            63.76
## DailyRate                 61.57
## MonthlyRate               61.22
## HourlyRate                60.08
## YearsWithCurrManager      58.07
## YearsInCurrentRole        56.36
## DistanceFromHome          51.92
## PercentSalaryHike         47.31
## NumCompaniesWorked        44.50
## YearsSinceLastPromotion   38.29
## TrainingTimesLastYear     34.20
## StockOptionLevel1         31.54
## MaritalStatusSingle       27.13
## JobLevel2                 23.30
## JobSatisfaction4          22.34

Here we can see our error is decresing as the number of trees is growing. This is the main idea of Random Forest, where the crowd will produce the better result.

plot(rf_upsample$finalModel)
legend("topright", colnames(rf_upsample$finalModel$err.rate), col=1:6, cex=0.8, fill = 1:6)

Now I will see the ROC curve of my final model.

table <- select(data_test, Attrition) %>%
  bind_cols(attrition_pred = pred_rf_upsample_prob) %>%
  bind_cols(attrition_eprob = round(pred_rf_upsample_prob[,2],4)) %>%
  bind_cols(attrition_pprob = round(pred_rf_upsample_prob[,1],4))

roc <- data.frame(prediction = round(pred_rf_upsample_prob[,2],4),
                  trueclass = as.numeric(table$Attrition))

roc <- prediction(roc$prediction, roc$trueclass)

# ROC Curve
plot(performance(roc, "tpr", "fpr"), main = "ROC")
abline(a = 0, b = 1)

We can see the our ROC is visually better than the Decision Tree one. Now let’s calculate it statistically.

# AUC
auc_ROCR_f <- performance(roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f

## [1] 0.7794803

Our AUC of Random Forest algorithm produce a far better result than the Decision Tree one. So, it is very obvious that we will choose Random Forest as our model.

Based on the above results, a random forest model gave us a reasonable results, in term of accuracy and specificity. Furthermore, the AUC of the model is also reasonable, which is around 78.2%.

In the nest step, we will try to tune the thresholds to improve the metrics.

th_cut <- function(cutoff, prob, ref, postarget, negtarget) 
{
  predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
  conf <- caret::confusionMatrix(predict , ref, positive = postarget)
  acc <- conf$overall[1]
  rec <- conf$byClass[1]
  prec <- conf$byClass[3]
  spec <- conf$byClass[2]
  mat <- t(as.matrix(c(rec , acc , prec, spec))) 
  colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
  return(mat)
}

co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)

for(i in 1:100){
  result[i,] = th_cut(cutoff = co[i], 
                     prob = table$attrition_eprob, 
                     ref = as.factor(ifelse(table$Attrition == "Yes", 1, 0)), 
                     postarget = "1", 
                     negtarget = "0")
}

# visualize
ggplotly(tibble("Sensitivity" = result[,1],
           "Accuracy" = result[,2],
           "Precision" = result[,3],
           "Specificity" = result[,4],
                   "Cutoff" = co) %>% 
  gather(key = "Metrics", value = "value", 1:4) %>% 
  ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
  geom_line(lwd = 1.5) +
  scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
  scale_y_continuous(breaks = seq(-0,1,0.1), limits = c(0,1)) +
  scale_x_continuous(breaks = seq(-0,1,0.1)) +
  labs(title = "Tradeoff Model Perfomance") +
  theme_minimal() +
  theme(legend.position = "top",
        panel.grid.minor.y = element_blank(),
        panel.grid.minor.x = element_blank()))

From above graph, we can see the cutoff value where the metrics is intersected. In above graph, we can see clearly that by reducing the Cutoff will increase the Sensitivity value. By looking at the graph, if our focus is this analysis is both Sensitivity and Accuracy, I will choose the cutoff value = 0.3.

table <- table %>%
  mutate(tuning_pred = as.factor(ifelse(attrition_eprob >= 0.3, "Yes", "No")))

final_rf_threshold <- table %>% summarise(
  accuracy = accuracy_vec(Attrition, tuning_pred),
  sensitivity = spec_vec(Attrition, tuning_pred),
  specificity = sens_vec(Attrition, tuning_pred),
  precision = precision_vec(Attrition, tuning_pred)) %>%
  cbind(AUC = auc_ROCR_f)

final_rf_threshold

We can see by decreasing the threshold will dramatically increase the sensitivity but sacrifice the accuracy a bit.

(rf_upsample_before_tuning <- upsample_metrics %>% cbind(AUC = auc_ROCR_f))

Conclusion

Here I will conclude from both Decision Tree and Random Forest prediction.

rbind("Decision Tree" = final_tree, 
      "Random Forest, No Tuning" = rf_upsample_before_tuning,
      "Random Forest, After Tuning" = final_rf_threshold)

Based on the conclusion above, we can conclude that Random Forest, After Tuning give the best result, the sensitivity and the precision beat all other algorithms. But, the tuning sacrifice certain metrics like accuracy and specificity. Considering our priority is that our model give the highest sensitivity, as possible, we will choose Random Forest, After Tuning as our model.

IBM HR Analytics ML Classification

Yevonnael Andrew

2/5/2020

Introduction

Exploratory Data Analysis

Modelling

Logistic Regression

KNN with caret()

Conclusion

Below is Second Section for This Analysis

Decision Tree

Random Forest

Conclusion