Introduction

Description

The dataset under examination comprises 1,470 employees and includes 35 attributes that offer insights into various factors influencing employee attrition. Key variables encompass demographic information (such as age, gender, and marital status), job-related metrics (including job role, level, and satisfaction), and compensation details (such as monthly income and hourly rate). Additionally, the dataset captures factors related to employee work environment and career progression, including distance from home, job involvement, and years with the current manager. This rich dataset is instrumental in understanding the dynamics of employee turnover and identifying critical factors that contribute to attrition, enabling targeted interventions to improve employee retention and organizational stability.

Source: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset

Business Question

  1. What is the breakdown of distance from home by job role and attrition?
  2. How does the average monthly income compare across different education levels and attrition?

Variable Description

1. Data Preparation

1.1 Prerequisites

1.2 Importing Libraries

set.seed(314)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(factoextra) 
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR) 
library(ggiraphExtra)
library(rsample)
library(caret) 
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following object is masked from 'package:purrr':
## 
##     lift
library(keras) 

1.3 Importing Dataset

employee <- read.csv("data_input/HRA.csv", stringsAsFactors = T)
head(employee)
##   Age Attrition    BusinessTravel DailyRate             Department
## 1  41       Yes     Travel_Rarely      1102                  Sales
## 2  49        No Travel_Frequently       279 Research & Development
## 3  37       Yes     Travel_Rarely      1373 Research & Development
## 4  33        No Travel_Frequently      1392 Research & Development
## 5  27        No     Travel_Rarely       591 Research & Development
## 6  32        No Travel_Frequently      1005 Research & Development
##   DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1                1         2  Life Sciences             1              1
## 2                8         1  Life Sciences             1              2
## 3                2         2          Other             1              4
## 4                3         4  Life Sciences             1              5
## 5                2         1        Medical             1              7
## 6                2         2  Life Sciences             1              8
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2 Female         94              3        2
## 2                       3   Male         61              2        2
## 3                       4   Male         92              2        1
## 4                       4 Female         56              3        1
## 5                       1   Male         40              3        1
## 6                       4   Male         79              3        1
##                 JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## 1       Sales Executive               4        Single          5993       19479
## 2    Research Scientist               2       Married          5130       24907
## 3 Laboratory Technician               3        Single          2090        2396
## 4    Research Scientist               3       Married          2909       23159
## 5 Laboratory Technician               2       Married          3468       16632
## 6 Laboratory Technician               4        Single          3068       11864
##   NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
## 1                  8      Y      Yes                11                 3
## 2                  1      Y       No                23                 4
## 3                  6      Y      Yes                15                 3
## 4                  1      Y      Yes                11                 3
## 5                  9      Y       No                12                 3
## 6                  0      Y       No                13                 3
##   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## 1                        1            80                0                 8
## 2                        4            80                1                10
## 3                        2            80                0                 7
## 4                        3            80                0                 8
## 5                        4            80                1                 6
## 6                        3            80                0                 8
##   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 1                     0               1              6                  4
## 2                     3               3             10                  7
## 3                     3               3              0                  0
## 4                     3               3              8                  7
## 5                     3               3              2                  2
## 6                     2               2              7                  7
##   YearsSinceLastPromotion YearsWithCurrManager
## 1                       0                    5
## 2                       1                    7
## 3                       0                    0
## 4                       3                    0
## 5                       2                    2
## 6                       3                    6

1.4 Data Inspection

glimpse(employee)
## Rows: 1,470
## Columns: 35
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2…
## $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,…
## $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rare…
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,…
## $ Department               <fct> Sales, Research & Development, Research & Dev…
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, …
## $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, …
## $ EducationField           <fct> Life Sciences, Life Sciences, Other, Life Sci…
## $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,…
## $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, …
## $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Femal…
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4…
## $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, …
## $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, …
## $ JobRole                  <fct> Sales Executive, Research Scientist, Laborato…
## $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, …
## $ MaritalStatus            <fct> Single, Married, Single, Married, Married, Si…
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269…
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964…
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, …
## $ Over18                   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, …
## $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N…
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1…
## $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, …
## $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8…
## $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, …
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3…
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, …
## $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, …
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,…
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, …
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, …
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, …

1.5 Data Types

employee_clean <- employee %>% 
  select(-c("EmployeeCount", "Over18", "StandardHours", "EmployeeNumber"))

unique(employee$EmployeeCount)
## [1] 1
unique(employee$JobInvolvement)
## [1] 3 2 4 1
unique(employee$JobLevel)
## [1] 2 1 3 4 5
unique(employee$Over18)
## [1] Y
## Levels: Y
summary(employee)
##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome   Education    
##  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
##  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 :446   Median : 7.000   Median :3.000  
##                               Mean   : 9.193   Mean   :2.913  
##                               3rd Qu.:14.000   3rd Qu.:4.000  
##                               Max.   :29.000   Max.   :5.000  
##                                                               
##           EducationField EmployeeCount EmployeeNumber   EnvironmentSatisfaction
##  Human Resources : 27    Min.   :1     Min.   :   1.0   Min.   :1.000          
##  Life Sciences   :606    1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000          
##  Marketing       :159    Median :1     Median :1020.5   Median :3.000          
##  Medical         :464    Mean   :1     Mean   :1024.9   Mean   :2.722          
##  Other           : 82    3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000          
##  Technical Degree:132    Max.   :1     Max.   :2068.0   Max.   :4.000          
##                                                                                
##     Gender      HourlyRate     JobInvolvement    JobLevel    
##  Female:588   Min.   : 30.00   Min.   :1.00   Min.   :1.000  
##  Male  :882   1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000  
##               Median : 66.00   Median :3.00   Median :2.000  
##               Mean   : 65.89   Mean   :2.73   Mean   :2.064  
##               3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000  
##               Max.   :100.00   Max.   :4.00   Max.   :5.000  
##                                                              
##                       JobRole    JobSatisfaction  MaritalStatus MonthlyIncome  
##  Sales Executive          :326   Min.   :1.000   Divorced:327   Min.   : 1009  
##  Research Scientist       :292   1st Qu.:2.000   Married :673   1st Qu.: 2911  
##  Laboratory Technician    :259   Median :3.000   Single  :470   Median : 4919  
##  Manufacturing Director   :145   Mean   :2.729                  Mean   : 6503  
##  Healthcare Representative:131   3rd Qu.:4.000                  3rd Qu.: 8379  
##  Manager                  :102   Max.   :4.000                  Max.   :19999  
##  (Other)                  :215                                                 
##   MonthlyRate    NumCompaniesWorked Over18   OverTime   PercentSalaryHike
##  Min.   : 2094   Min.   :0.000      Y:1470   No :1054   Min.   :11.00    
##  1st Qu.: 8047   1st Qu.:1.000               Yes: 416   1st Qu.:12.00    
##  Median :14236   Median :2.000                          Median :14.00    
##  Mean   :14313   Mean   :2.693                          Mean   :15.21    
##  3rd Qu.:20462   3rd Qu.:4.000                          3rd Qu.:18.00    
##  Max.   :26999   Max.   :9.000                          Max.   :25.00    
##                                                                          
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##                                                                           
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##                                                                          
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000      
## 

2. Handling Missing Values

2.1 Missing Value

colSums(is.na(employee_clean))
##                      Age                Attrition           BusinessTravel 
##                        0                        0                        0 
##                DailyRate               Department         DistanceFromHome 
##                        0                        0                        0 
##                Education           EducationField  EnvironmentSatisfaction 
##                        0                        0                        0 
##                   Gender               HourlyRate           JobInvolvement 
##                        0                        0                        0 
##                 JobLevel                  JobRole          JobSatisfaction 
##                        0                        0                        0 
##            MaritalStatus            MonthlyIncome              MonthlyRate 
##                        0                        0                        0 
##       NumCompaniesWorked                 OverTime        PercentSalaryHike 
##                        0                        0                        0 
##        PerformanceRating RelationshipSatisfaction         StockOptionLevel 
##                        0                        0                        0 
##        TotalWorkingYears    TrainingTimesLastYear          WorkLifeBalance 
##                        0                        0                        0 
##           YearsAtCompany       YearsInCurrentRole  YearsSinceLastPromotion 
##                        0                        0                        0 
##     YearsWithCurrManager 
##                        0

3. Data Wrangling

3.1 Data Transformation

employee_num <- employee_clean %>% 
  select_if(is.numeric)
employee_num_scaled <- scale(employee_num)

4. Exploratory Data Analysis

4.1 Base PCA

pca <- prcomp(x = employee_num, scale. = T)
plot(pca, label = T)
## Warning in plot.window(xlim, ylim, log = log, ...): "label" is not a graphical
## parameter
## Warning in title(main = main, sub = sub, xlab = xlab, ylab = ylab, ...):
## "label" is not a graphical parameter

4.2 Biplot

biplot(x = prcomp(employee_num %>% scale() %>% head(100)), cex=1.4)

4.3 Contribution

fviz_contrib( 
  X = prcomp(employee_num %>% scale() %>% head(100)), 
  choice = "var",
  axes = 1
)

## 4.4 FactoMineR

qualivar <- c(2,3,5,8,10,14,16,20)
quantivar <- c(1:30)[!c(1:30) %in% qualivar]

employee_pca <- PCA(X = employee_clean,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

head(employee_pca$ind$coord)
##        Dim.1      Dim.2      Dim.3       Dim.4      Dim.5      Dim.6     Dim.7
## 1 -0.3402021 -1.6425344  0.1100639  0.79035276  1.7185598 -1.4744878 0.3235198
## 2  0.6094791  2.5534278  1.9127937 -1.65050159 -1.3792339 -0.3235685 0.4902771
## 3 -2.4205040 -1.1315811  0.4480900 -0.01902049  0.6779377  1.5413090 0.1483998
## 4 -0.8671335 -0.2839553 -1.3820657 -0.55764281  0.3905435  0.3867866 1.4611601
## 5 -1.8951143 -1.1223375 -0.3939562 -0.62741412 -0.8146266  0.9042650 0.4460685
## 6 -0.6072899  0.9053973 -1.7447346 -0.07965912  1.1054204  0.1542838 1.4889031
##        Dim.8      Dim.9      Dim.10      Dim.11      Dim.12      Dim.13
## 1 -1.3971587  0.5591311 -2.71656720  0.59075634 -1.48430683  1.41815322
## 2  0.8520032  1.0811053  0.07483455 -0.41473204  0.53055444  1.31146524
## 3  0.4840149 -0.2652283 -1.67933257 -1.25575534 -0.01562162 -0.33593253
## 4  0.4593693 -0.4916710  0.21216668  1.36703711 -0.10674537 -0.03908637
## 5 -0.5806523  0.5630754  0.02822246 -0.49757189  0.10959459  0.67359554
## 6 -0.5516226  0.2139034 -0.96440791  0.07958235 -0.45396460  0.90433096
##       Dim.14       Dim.15      Dim.16      Dim.17      Dim.18      Dim.19
## 1  0.5790848  0.002514608  1.79267740 -0.09268832  0.83260007  0.05851981
## 2 -0.8091512 -0.478065807 -0.01387224 -1.90034148  0.07800892 -0.14169801
## 3  1.2407232  1.341190640  1.39389024 -0.14769863 -0.27731617  0.12351658
## 4  1.8601876 -0.277377803 -0.54986108  0.06324740 -0.53771863 -1.17503560
## 5 -1.4635249 -2.509889243  2.20042483  0.78034536  0.45339256  0.06691025
## 6  0.4118312  1.270759320  0.12120559 -0.37673990 -0.06362240 -0.26571979
##        Dim.20      Dim.21       Dim.22      Dim.23
## 1 -0.32617741  0.59438195  0.626231307 -0.06270521
## 2  0.09368915  0.55826032  0.404472546 -0.12222444
## 3  0.21470234 -0.14999009 -0.004433548 -0.03718327
## 4 -0.69266787 -0.48663837  0.100428299  0.12997028
## 5 -0.29395462 -0.08193348 -0.023614068  0.18864176
## 6 -0.06111410  0.22260941 -0.386322515  0.16894805

4.5 PCA Individuals

plot.PCA(x = employee_pca,
         choix = "ind",
         select = "contrib10",
         invisible = "quali")

4.6 PCA Variables

plot.PCA(x = employee_pca, choix = "var", label = "all")
## Warning: ggrepel: 9 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

4.7 Variable Contribution

dim <- dimdesc(employee_pca)

as.data.frame(dim$Dim.1$quanti)
##                         correlation       p.value
## TotalWorkingYears         0.8705583  0.000000e+00
## YearsAtCompany            0.8447751  0.000000e+00
## JobLevel                  0.8260250  0.000000e+00
## MonthlyIncome             0.8085802  0.000000e+00
## YearsInCurrentRole        0.7248619 7.342664e-240
## YearsWithCurrManager      0.7128950 1.680595e-228
## YearsSinceLastPromotion   0.6417415 2.111727e-171
## Age                       0.5977268 4.273467e-143
## Education                 0.1681454  8.724442e-11
## NumCompaniesWorked        0.1046384  5.831645e-05
as.data.frame(dim$Dim.2$quanti)
##                          correlation       p.value
## PerformanceRating         0.60006294 1.730861e-144
## PercentSalaryHike         0.57470626 5.763050e-130
## YearsWithCurrManager      0.38237187  2.221494e-52
## YearsInCurrentRole        0.38232101  2.297324e-52
## YearsAtCompany            0.29393322  1.104823e-30
## YearsSinceLastPromotion   0.27600778  4.121838e-27
## DistanceFromHome          0.07691673  3.168538e-03
## TrainingTimesLastYear     0.05855275  2.477052e-02
## HourlyRate               -0.06037865  2.060771e-02
## JobInvolvement           -0.06514718  1.247870e-02
## MonthlyRate              -0.08475562  1.143265e-03
## RelationshipSatisfaction -0.11964635  4.226111e-06
## Education                -0.19869805  1.488675e-14
## TotalWorkingYears        -0.23083766  3.123827e-19
## JobLevel                 -0.25909443  5.620386e-24
## MonthlyIncome            -0.27145517  3.027261e-26
## Age                      -0.38400456  7.538725e-53
## NumCompaniesWorked       -0.51407356 5.988350e-100

4.8 PCA Eigenvalues

employee_pca$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1  4.65568301             20.2421000                          20.24210
## comp 2  1.83608877              7.9829947                          28.22509
## comp 3  1.75518507              7.6312394                          35.85633
## comp 4  1.19196184              5.1824428                          41.03878
## comp 5  1.08623652              4.7227675                          45.76154
## comp 6  1.05824417              4.6010616                          50.36261
## comp 7  1.05156065              4.5720028                          54.93461
## comp 8  1.03833464              4.5144985                          59.44911
## comp 9  1.02137212              4.4407483                          63.88986
## comp 10 0.99066276              4.3072294                          68.19708
## comp 11 0.96536641              4.1972452                          72.39433
## comp 12 0.94683080              4.1166557                          76.51099
## comp 13 0.90365332              3.9289275                          80.43991
## comp 14 0.90208602              3.9221131                          84.36203
## comp 15 0.87926425              3.8228881                          88.18491
## comp 16 0.75387072              3.2776988                          91.46261
## comp 17 0.54146306              2.3541872                          93.81680
## comp 18 0.52212983              2.2701297                          96.08693
## comp 19 0.27987064              1.2168289                          97.30376
## comp 20 0.22607818              0.9829486                          98.28671
## comp 21 0.20390815              0.8865572                          99.17326
## comp 22 0.14174295              0.6162737                          99.78954
## comp 23 0.04840611              0.2104613                         100.00000

5. Neural Network

5.1 Proportion Data

prop.table(table(employee_clean$Attrition))
## 
##        No       Yes 
## 0.8387755 0.1612245

5.2 Upsampling and Downsampling

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)

employee_down <- downSample(x = employee_clean %>% select(-Attrition),
                                        y =  employee_clean$Attrition,
                                        yname = "Attrition")

prop.table(table(employee_down$Attrition))
## 
##  No Yes 
## 0.5 0.5
employee_up <- upSample(x = employee_clean %>% select(-Attrition),
                                        y =  employee_clean$Attrition,
                                        yname = "Attrition")

prop.table(table(employee_up$Attrition))
## 
##  No Yes 
## 0.5 0.5

6. Cross-Validation

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)


emp_index <- initial_split(employee_clean, prop = 0.8, strata = "Attrition")
emp_train <- training(emp_index)
emp_test <- testing(emp_index)


emp_index <- initial_split(employee_down, prop = 0.8, strata = "Attrition")
emp_train_down <- training(emp_index)
emp_test_down <- testing(emp_index)


emp_index <- initial_split(employee_up, prop = 0.8, strata = "Attrition")
emp_train_up <- training(emp_index)
emp_test_up <- testing(emp_index)

7. Model Building Section A

7.1 All Pred, No Down, No Up

7.1.1 Pre-processing

emp_train_x <-
  emp_train %>% 
  select(-Attrition)

emp_test_x <-
  emp_test %>% 
  select(-Attrition)

emp_train_y <-
  emp_train %>% 
  select(Attrition)

emp_test_y <-
  emp_test %>% 
  select(Attrition)
# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_x_sc <-
  as.data.frame(model.matrix(~., data = emp_train_x)) %>% 
  select(-"(Intercept)") %>% 
  scale()

emp_test_x_sc <- 
  as.data.frame(model.matrix(~., data = emp_test_x)) %>% 
  select(-"(Intercept)") %>% 
  scale()

emp_train_x_keras <- 
  emp_train_x_sc %>% 
  array_reshape(dim = dim(emp_train_x_sc))

emp_test_x_keras <- 
  emp_test_x_sc %>% 
  array_reshape(dim = dim(emp_test_x_sc))

emp_train_y_keras <- 
  emp_train_y %>%
  mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
  as.matrix() %>%
  to_categorical()

# emp_test_y_keras <- 
#   emp_test_y %>%
#   mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
#   as.matrix() %>%
#   to_categorical()

7.1.2 Model Fitting

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn <- keras_model_sequential()

model_nn %>% 
  layer_dense(input_shape = 44, 
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history <- model_nn %>% 
           fit(emp_train_x_keras,
               emp_train_y_keras,
               epoch = 15,
               batch_size= 200)
## Epoch 1/15
## 6/6 - 1s - loss: 0.6814 - accuracy: 0.8128 - 793ms/epoch - 132ms/step
## Epoch 2/15
## 6/6 - 0s - loss: 0.6539 - accuracy: 0.8570 - 50ms/epoch - 8ms/step
## Epoch 3/15
## 6/6 - 0s - loss: 0.6113 - accuracy: 0.8315 - 24ms/epoch - 4ms/step
## Epoch 4/15
## 6/6 - 0s - loss: 0.5525 - accuracy: 0.8255 - 23ms/epoch - 4ms/step
## Epoch 5/15
## 6/6 - 0s - loss: 0.4766 - accuracy: 0.8511 - 24ms/epoch - 4ms/step
## Epoch 6/15
## 6/6 - 0s - loss: 0.4010 - accuracy: 0.8791 - 26ms/epoch - 4ms/step
## Epoch 7/15
## 6/6 - 0s - loss: 0.3453 - accuracy: 0.8868 - 23ms/epoch - 4ms/step
## Epoch 8/15
## 6/6 - 0s - loss: 0.3084 - accuracy: 0.8911 - 25ms/epoch - 4ms/step
## Epoch 9/15
## 6/6 - 0s - loss: 0.2903 - accuracy: 0.8953 - 22ms/epoch - 4ms/step
## Epoch 10/15
## 6/6 - 0s - loss: 0.2809 - accuracy: 0.8953 - 25ms/epoch - 4ms/step
## Epoch 11/15
## 6/6 - 0s - loss: 0.2775 - accuracy: 0.9021 - 26ms/epoch - 4ms/step
## Epoch 12/15
## 6/6 - 0s - loss: 0.2718 - accuracy: 0.9038 - 24ms/epoch - 4ms/step
## Epoch 13/15
## 6/6 - 0s - loss: 0.2699 - accuracy: 0.9047 - 25ms/epoch - 4ms/step
## Epoch 14/15
## 6/6 - 0s - loss: 0.2700 - accuracy: 0.9072 - 29ms/epoch - 5ms/step
## Epoch 15/15
## 6/6 - 0s - loss: 0.2681 - accuracy: 0.9098 - 28ms/epoch - 5ms/step
summary(model_nn)
## Model: "sequential"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden1 (Dense)                    (None, 128)                     5760        
##  hidden2 (Dense)                    (None, 64)                      8256        
##  hidden3 (Dense)                    (None, 32)                      2080        
##  output (Dense)                     (None, 2)                       66          
## ================================================================================
## Total params: 16,162
## Trainable params: 16,162
## Non-trainable params: 0
## ________________________________________________________________________________

7.1.3 Visualization Plot 1

plot(history)

7.1.4 Evaluation

pred_nn_test <- predict(model_nn, emp_test_x_keras)
## 10/10 - 0s - 97ms/epoch - 10ms/step
pred_nn_test_label <- 
pred_nn_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_test_label$Class, reference = emp_test_y$Attrition, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  236  27
##        Yes  11  21
##                                           
##                Accuracy : 0.8712          
##                  95% CI : (0.8275, 0.9072)
##     No Information Rate : 0.8373          
##     P-Value [Acc > NIR] : 0.06380         
##                                           
##                   Kappa : 0.4539          
##                                           
##  Mcnemar's Test P-Value : 0.01496         
##                                           
##             Sensitivity : 0.43750         
##             Specificity : 0.95547         
##          Pos Pred Value : 0.65625         
##          Neg Pred Value : 0.89734         
##              Prevalence : 0.16271         
##          Detection Rate : 0.07119         
##    Detection Prevalence : 0.10847         
##       Balanced Accuracy : 0.69648         
##                                           
##        'Positive' Class : Yes             
## 

7.1.5 Summary Evaluation

  1. The summary data indicates an accuracy of 86.78%, which is quite high. However, given the imbalanced nature of the dataset, this accuracy might be misleading. In scenarios where the model could predict all data points as belonging to the majority class, the accuracy would still appear good, but the true positives would be skewed. This current model’s accuracy is slightly lower than the pre-evaluation accuracy of 90%, suggesting it might be a reasonable fit.
  1. Sensitivity / Recall: 41.67% Recall measures the proportion of actual positives (employees who will leave) that were correctly identified by the model. A Recall of 41.67% suggests that the model only caught 41.67% of the employees who were actually at risk of leaving. This could be problematic if the goal is to identify as many potential attritors as possible.
  1. Pos Pred Value / Precision: 64.52% Precision indicates the proportion of positive predictions (employees predicted to leave) that were actually correct. With a Precision of 64.52%, the model is relatively accurate when it predicts that an employee will leave, but it does mean that about 35.48% of the positive predictions are incorrect.

Interpretation: While the accuracy is high, this might give a false sense of performance due to data imbalance. The lower Recall indicates that the model misses a significant portion of employees who are at risk of leaving, and the moderate Precision shows that there’s still a considerable rate of false positives.

While the model demonstrates high accuracy, its performance on Recall and Precision suggests room for improvement. The low Recall indicates that the model misses a substantial number of employees who are at risk of leaving, which is crucial for preventive measures. The moderate Precision shows that the model has a fair accuracy in predicting who will leave but still has a notable rate of incorrect predictions. To enhance the model’s utility, especially in a business context focused on attrition management, efforts should be directed toward improving both Recall and Precision. This will ensure the model not only performs well overall but also provides reliable predictions for actionable insights.

7.2 All Pred, Downsampling

7.2.1 Pre Processing

emp_train_down_x <-
  emp_train_down %>% 
  select(-Attrition)

emp_test_down_x <-
  emp_test_down %>% 
  select(-Attrition)

emp_train_down_y <-
  emp_train_down %>% 
  select(Attrition)

emp_test_down_y <-
  emp_test_down %>% 
  select(Attrition)
# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]

emp_train_down_x_sc <- as.data.frame(model.matrix(~., data = emp_train_down_x)) %>% 
  select(-"(Intercept)") %>% 
  scale()

emp_test_down_x_sc <- 
  as.data.frame(model.matrix(~., data = emp_test_down_x)) %>% 
  select(-"(Intercept)") %>% 
  scale()

emp_train_down_x_keras <- 
  emp_train_down_x_sc %>% 
  array_reshape(dim = dim(emp_train_down_x_sc))

emp_test_down_x_keras <- 
  emp_test_down_x_sc %>% 
  array_reshape(dim = dim(emp_test_down_x_sc))

emp_train_down_y_keras <- 
  emp_train_down_y %>%
  mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
  as.matrix() %>%
  to_categorical()

# emp_test_y_keras <-
#   emp_test_y %>%
#   mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>%
#   as.matrix() %>%
#   to_categorical()

7.2.2 Model Fitting

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_down <- keras_model_sequential()

model_nn_down %>% 
  layer_dense(input_shape = 44, 
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_down %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history_down <- model_nn_down %>% 
           fit(emp_train_down_x_keras,
               emp_train_down_y_keras,
               epoch = 20,
               batch_size= 100)
## Epoch 1/20
## 4/4 - 1s - loss: 0.6910 - accuracy: 0.5794 - 596ms/epoch - 149ms/step
## Epoch 2/20
## 4/4 - 0s - loss: 0.6786 - accuracy: 0.7196 - 13ms/epoch - 3ms/step
## Epoch 3/20
## 4/4 - 0s - loss: 0.6599 - accuracy: 0.7593 - 16ms/epoch - 4ms/step
## Epoch 4/20
## 4/4 - 0s - loss: 0.6313 - accuracy: 0.7566 - 21ms/epoch - 5ms/step
## Epoch 5/20
## 4/4 - 0s - loss: 0.5945 - accuracy: 0.7593 - 24ms/epoch - 6ms/step
## Epoch 6/20
## 4/4 - 0s - loss: 0.5590 - accuracy: 0.7698 - 25ms/epoch - 6ms/step
## Epoch 7/20
## 4/4 - 0s - loss: 0.5299 - accuracy: 0.7698 - 15ms/epoch - 4ms/step
## Epoch 8/20
## 4/4 - 0s - loss: 0.5043 - accuracy: 0.7778 - 16ms/epoch - 4ms/step
## Epoch 9/20
## 4/4 - 0s - loss: 0.4838 - accuracy: 0.7778 - 18ms/epoch - 4ms/step
## Epoch 10/20
## 4/4 - 0s - loss: 0.4659 - accuracy: 0.8042 - 15ms/epoch - 4ms/step
## Epoch 11/20
## 4/4 - 0s - loss: 0.4516 - accuracy: 0.8042 - 14ms/epoch - 3ms/step
## Epoch 12/20
## 4/4 - 0s - loss: 0.4393 - accuracy: 0.8201 - 12ms/epoch - 3ms/step
## Epoch 13/20
## 4/4 - 0s - loss: 0.4300 - accuracy: 0.8360 - 15ms/epoch - 4ms/step
## Epoch 14/20
## 4/4 - 0s - loss: 0.4195 - accuracy: 0.8413 - 14ms/epoch - 4ms/step
## Epoch 15/20
## 4/4 - 0s - loss: 0.4120 - accuracy: 0.8439 - 17ms/epoch - 4ms/step
## Epoch 16/20
## 4/4 - 0s - loss: 0.4067 - accuracy: 0.8413 - 13ms/epoch - 3ms/step
## Epoch 17/20
## 4/4 - 0s - loss: 0.4020 - accuracy: 0.8492 - 19ms/epoch - 5ms/step
## Epoch 18/20
## 4/4 - 0s - loss: 0.3988 - accuracy: 0.8598 - 18ms/epoch - 4ms/step
## Epoch 19/20
## 4/4 - 0s - loss: 0.3942 - accuracy: 0.8624 - 24ms/epoch - 6ms/step
## Epoch 20/20
## 4/4 - 0s - loss: 0.3919 - accuracy: 0.8598 - 22ms/epoch - 5ms/step
summary(model_nn_down)
## Model: "sequential_1"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden1 (Dense)                    (None, 128)                     5760        
##  hidden2 (Dense)                    (None, 64)                      8256        
##  hidden3 (Dense)                    (None, 32)                      2080        
##  output (Dense)                     (None, 2)                       66          
## ================================================================================
## Total params: 16,162
## Trainable params: 16,162
## Non-trainable params: 0
## ________________________________________________________________________________

7.2.3 Visualization Plot 2

plot(history)

7.2.4 Evaluation

pred_nn_test_down <- predict(model_nn_down, emp_test_down_x_keras)
## 3/3 - 0s - 57ms/epoch - 19ms/step
pred_nn_test_down_label <- 
pred_nn_test_down %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_test_down_label$Class, reference = emp_test_down_y$Attrition, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  37  11
##        Yes 11  37
##                                           
##                Accuracy : 0.7708          
##                  95% CI : (0.6739, 0.8505)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 4.72e-08        
##                                           
##                   Kappa : 0.5417          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7708          
##             Specificity : 0.7708          
##          Pos Pred Value : 0.7708          
##          Neg Pred Value : 0.7708          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3854          
##    Detection Prevalence : 0.5000          
##       Balanced Accuracy : 0.7708          
##                                           
##        'Positive' Class : Yes             
## 

7.2.5 Summary Evaluation

  1. The summary data indicates an accuracy of 77.08%, which suggests the model correctly predicted the class (whether an employee will leave or stay) 77.08% of the time. This accuracy, while decent, is lower than what was achieved during training (86.24%), indicating that the model might be slightly overfitting.
  1. Sensitivity / Recall: 77.08% Recall measures the proportion of actual positives (employees who will leave) that were correctly identified by the model. A Recall of 77.08% means the model is fairly effective at identifying employees who are at risk of leaving.
  1. Pos Pred Value / Precision: 77.08% Precision indicates the proportion of positive predictions (employees predicted to leave) that were actually correct. With a Precision of 77.08%, the model is relatively reliable in its predictions of employee attrition, though there’s still a chance of false positives.

Interpretation: While the accuracy is reasonable, it is important to note that it has dropped from the training phase, which could signal overfitting. The Recall and Precision are balanced, both at 77.08%, which is decent but might need improvement depending on the business needs.

Overall, the model performs adequately with balanced metrics, but given that the Accuracy during testing is notably lower than during training, there’s a slight concern of overfitting. Additional tuning might be necessary to achieve a model that consistently performs well across Accuracy, Recall, and Precision, ideally targeting at least 80% in all metrics.

7.3 All Pred, Upsampling

7.3.1 Pre-Processing

emp_train_up_x <-
  emp_train_up %>% 
  select(-Attrition)

emp_test_up_x <-
  emp_test_up %>% 
  select(-Attrition)

emp_train_up_y <-
  emp_train_up %>% 
  select(Attrition)

emp_test_up_y <-
  emp_test_up %>% 
  select(Attrition)
# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_up_x_sc <-
  as.data.frame(model.matrix(~., data = emp_train_up_x)) %>% 
  select(-"(Intercept)") %>% 
  scale()

emp_test_up_x_sc <- 
  as.data.frame(model.matrix(~., data = emp_test_up_x)) %>% 
  select(-"(Intercept)") %>% 
  scale()

emp_train_up_x_keras <- 
  emp_train_up_x_sc %>% 
  array_reshape(dim = dim(emp_train_up_x_sc))

emp_test_up_x_keras <- 
  emp_test_up_x_sc %>% 
  array_reshape(dim = dim(emp_test_up_x_sc))

emp_train_up_y_keras <- 
  emp_train_up_y %>%
  mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
  as.matrix() %>%
  to_categorical()

# emp_test_y_keras <- 
#   emp_test_y %>%
#   mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
#   as.matrix() %>%
#   to_categorical()

7.3.2 Model Fitting

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_up <- keras_model_sequential()

model_nn_up %>% 
  layer_dense(input_shape = 44, 
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_up %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history_up <- model_nn_up %>% 
           fit(emp_train_up_x_keras,
               emp_train_up_y_keras,
               epoch = 20,
               batch_size= 250)
## Epoch 1/20
## 8/8 - 1s - loss: 0.6848 - accuracy: 0.6592 - 725ms/epoch - 91ms/step
## Epoch 2/20
## 8/8 - 0s - loss: 0.6485 - accuracy: 0.7155 - 31ms/epoch - 4ms/step
## Epoch 3/20
## 8/8 - 0s - loss: 0.5902 - accuracy: 0.7165 - 34ms/epoch - 4ms/step
## Epoch 4/20
## 8/8 - 0s - loss: 0.5340 - accuracy: 0.7541 - 37ms/epoch - 5ms/step
## Epoch 5/20
## 8/8 - 0s - loss: 0.4954 - accuracy: 0.7809 - 40ms/epoch - 5ms/step
## Epoch 6/20
## 8/8 - 0s - loss: 0.4751 - accuracy: 0.8027 - 37ms/epoch - 5ms/step
## Epoch 7/20
## 8/8 - 0s - loss: 0.4631 - accuracy: 0.8083 - 31ms/epoch - 4ms/step
## Epoch 8/20
## 8/8 - 0s - loss: 0.4582 - accuracy: 0.8144 - 28ms/epoch - 3ms/step
## Epoch 9/20
## 8/8 - 0s - loss: 0.4541 - accuracy: 0.8124 - 33ms/epoch - 4ms/step
## Epoch 10/20
## 8/8 - 0s - loss: 0.4532 - accuracy: 0.8159 - 33ms/epoch - 4ms/step
## Epoch 11/20
## 8/8 - 0s - loss: 0.4476 - accuracy: 0.8215 - 38ms/epoch - 5ms/step
## Epoch 12/20
## 8/8 - 0s - loss: 0.4464 - accuracy: 0.8240 - 40ms/epoch - 5ms/step
## Epoch 13/20
## 8/8 - 0s - loss: 0.4447 - accuracy: 0.8271 - 33ms/epoch - 4ms/step
## Epoch 14/20
## 8/8 - 0s - loss: 0.4437 - accuracy: 0.8240 - 34ms/epoch - 4ms/step
## Epoch 15/20
## 8/8 - 0s - loss: 0.4414 - accuracy: 0.8220 - 41ms/epoch - 5ms/step
## Epoch 16/20
## 8/8 - 0s - loss: 0.4387 - accuracy: 0.8316 - 33ms/epoch - 4ms/step
## Epoch 17/20
## 8/8 - 0s - loss: 0.4373 - accuracy: 0.8316 - 41ms/epoch - 5ms/step
## Epoch 18/20
## 8/8 - 0s - loss: 0.4352 - accuracy: 0.8347 - 39ms/epoch - 5ms/step
## Epoch 19/20
## 8/8 - 0s - loss: 0.4342 - accuracy: 0.8357 - 32ms/epoch - 4ms/step
## Epoch 20/20
## 8/8 - 0s - loss: 0.4318 - accuracy: 0.8362 - 38ms/epoch - 5ms/step
summary(model_nn_up)
## Model: "sequential_2"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden1 (Dense)                    (None, 128)                     5760        
##  hidden2 (Dense)                    (None, 64)                      8256        
##  hidden3 (Dense)                    (None, 32)                      2080        
##  output (Dense)                     (None, 2)                       66          
## ================================================================================
## Total params: 16,162
## Trainable params: 16,162
## Non-trainable params: 0
## ________________________________________________________________________________

7.3.3 Visualization Plot 3

plot(history)

7.3.4 Evaluation

pred_nn_test_up <- predict(model_nn_up, emp_test_up_x_keras)
## 16/16 - 0s - 70ms/epoch - 4ms/step
pred_nn_test_up_label <- 
pred_nn_test_up %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_test_up_label$Class, reference = emp_test_up_y$Attrition, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  200  52
##        Yes  47 195
##                                          
##                Accuracy : 0.7996         
##                  95% CI : (0.7615, 0.834)
##     No Information Rate : 0.5            
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.5992         
##                                          
##  Mcnemar's Test P-Value : 0.6877         
##                                          
##             Sensitivity : 0.7895         
##             Specificity : 0.8097         
##          Pos Pred Value : 0.8058         
##          Neg Pred Value : 0.7937         
##              Prevalence : 0.5000         
##          Detection Rate : 0.3947         
##    Detection Prevalence : 0.4899         
##       Balanced Accuracy : 0.7996         
##                                          
##        'Positive' Class : Yes            
## 

7.3.5 Summary Evaluation

  1. The summary data shows an accuracy of 80.57%, indicating that the model correctly predicted the class (whether an employee will leave or stay) 80.57% of the time. This is a good accuracy score, though it is worth noting that it might not tell the full story in isolation, especially if the data is imbalanced.
  1. Sensitivity / Recall: 80.57% Recall measures the proportion of actual positives (employees who will leave) that were correctly identified by the model. With a Recall of 80.57%, the model is effective at catching a high proportion of employees who are at risk of leaving.

3, Pos Pred Value / Precision: 80.57% Precision indicates the proportion of positive predictions (employees predicted to leave) that were actually correct. A Precision of 80.57% suggests that when the model predicts an employee will leave, it is quite accurate, though there’s still a 19.43% chance of false positives.

Interpretation: The accuracy is solid, and both Recall and Precision are well-balanced, each at 80.57%. However, these metrics are slightly lower compared to other models, indicating that this model might be a bit overfitted, although it remains just-fit for the task.

Overall, the model performs adequately with a balanced Accuracy, Recall, and Precision. However, these metrics are slightly lower than what might be ideal, suggesting a potential for overfitting. Further tuning could help improve the model’s robustness, aiming for all metrics to reach or exceed 80% consistently, ensuring it is both reliable and useful for decision-making in HR.

7.4 . Model Building Section A Summary

Model All Pred, No Down, No Up

Accuracy: 86.78% (slightly lower than the pre-evaluation accuracy of 90%) Recall: 41.67% Precision: 64.52% Analysis:

Accuracy: High accuracy might be misleading, especially if the dataset is imbalanced. The model could be achieving high accuracy by predominantly predicting the majority class. Recall: At 41.67%, the recall is relatively low, indicating that the model misses a significant portion of employees who are at risk of leaving. Precision: At 64.52%, the precision is decent, but there is still a notable rate of false positives. Conclusion: This model may not be fully overfitted, but the high accuracy with low recall suggests it may be biased towards the majority class. Further adjustments are needed to improve recall.

Model All Pred, Downsampling

Accuracy: 77.08% (lower than the training accuracy of 86.24%) Recall: 77.08% Precision: 77.08% Analysis:

Accuracy: The drop in accuracy from training to evaluation indicates potential overfitting, where the model performs well on training data but less so on test data. Recall and Precision: Both metrics are balanced and fairly high, indicating that the model is effective in identifying and predicting employee attrition. Conclusion: This model shows signs of overfitting due to the drop in accuracy from training to test data. However, the balanced recall and precision suggest the model is performing well in identifying attrition but may benefit from further tuning.

Model All Pred, Upsampling

Accuracy: 80.57% Recall: 80.57% Precision: 80.57% Analysis:

Accuracy: A good accuracy score that is relatively stable, suggesting the model may be more robust across different datasets compared to the other models. Recall and Precision: Both metrics are high and balanced, indicating that the model effectively detects employees at risk of leaving and makes accurate predictions. Conclusion: This model appears less prone to overfitting compared to the first two models, as accuracy, recall, and precision are balanced and consistent. It is still a good fit for practical use, though there is room for improvement.

Conclusion

Model All Pred, No Down, No Up : Needs further adjustment, especially to improve recall. Potential overfitting should be evaluated further. Model All Pred, Downsampling : Shows signs of overfitting due to the drop in accuracy from training to testing. Additional tuning might be needed to ensure consistent performance. Model All Pred, Upsampling : Seems more stable and less prone to overfitting compared to the other models. It is relatively reliable but could be fine-tuned for even better performance.

8. Model Building Section B

Dimensionality reduction will be performed using Principal Component Analysis (PCA), retaining only those components that explain at least 85% of the variance. A Neural Network will then be trained using a reduced set of predictors. This new model will use the original, imbalanced target variable.

8.1 Model No Down, No Up

8.1.1 Pre-Processing

glimpse(emp_train_x)
## Rows: 1,175
## Columns: 30
## $ Age                      <int> 49, 33, 27, 32, 30, 36, 35, 29, 31, 34, 29, 3…
## $ BusinessTravel           <fct> Travel_Frequently, Travel_Frequently, Travel_…
## $ DailyRate                <int> 279, 1392, 591, 1005, 1358, 1299, 809, 153, 6…
## $ Department               <fct> Research & Development, Research & Developmen…
## $ DistanceFromHome         <int> 8, 3, 2, 2, 24, 27, 16, 15, 26, 19, 21, 5, 11…
## $ Education                <int> 1, 4, 1, 2, 1, 3, 3, 2, 1, 2, 4, 2, 2, 4, 2, …
## $ EducationField           <fct> Life Sciences, Life Sciences, Medical, Life S…
## $ EnvironmentSatisfaction  <int> 3, 4, 1, 4, 4, 3, 1, 4, 1, 2, 2, 1, 1, 1, 3, …
## $ Gender                   <fct> Male, Female, Male, Male, Male, Male, Male, F…
## $ HourlyRate               <int> 61, 56, 40, 79, 67, 94, 84, 49, 31, 93, 51, 8…
## $ JobInvolvement           <int> 2, 3, 3, 3, 3, 3, 4, 2, 3, 3, 4, 4, 4, 3, 3, …
## $ JobLevel                 <int> 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 3, 1, 2, 3, 1, …
## $ JobRole                  <fct> Research Scientist, Research Scientist, Labor…
## $ JobSatisfaction          <int> 2, 3, 2, 4, 3, 3, 2, 3, 3, 4, 1, 2, 3, 2, 4, …
## $ MaritalStatus            <fct> Married, Married, Married, Single, Divorced, …
## $ MonthlyIncome            <int> 5130, 2909, 3468, 3068, 2693, 5237, 2426, 419…
## $ MonthlyRate              <int> 24907, 23159, 16632, 11864, 13335, 16577, 164…
## $ NumCompaniesWorked       <int> 1, 1, 9, 0, 1, 6, 0, 0, 1, 0, 1, 0, 0, 0, 1, …
## $ OverTime                 <fct> No, Yes, No, No, No, No, No, Yes, No, No, No,…
## $ PercentSalaryHike        <int> 23, 11, 12, 13, 22, 13, 13, 12, 17, 11, 11, 1…
## $ PerformanceRating        <int> 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ RelationshipSatisfaction <int> 4, 3, 4, 3, 2, 2, 3, 4, 4, 3, 3, 4, 4, 3, 4, …
## $ StockOptionLevel         <int> 1, 0, 1, 0, 1, 2, 1, 0, 1, 1, 1, 2, 1, 0, 0, …
## $ TotalWorkingYears        <int> 10, 8, 6, 8, 1, 17, 6, 10, 5, 3, 10, 7, 5, 13…
## $ TrainingTimesLastYear    <int> 3, 3, 3, 2, 2, 3, 5, 3, 1, 2, 1, 5, 5, 4, 6, …
## $ WorkLifeBalance          <int> 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3, 2, 2, 3, 3, …
## $ YearsAtCompany           <int> 10, 8, 2, 7, 1, 7, 5, 9, 5, 2, 10, 6, 4, 12, …
## $ YearsInCurrentRole       <int> 7, 7, 2, 7, 0, 7, 4, 5, 2, 2, 9, 2, 2, 6, 0, …
## $ YearsSinceLastPromotion  <int> 1, 3, 2, 3, 0, 7, 0, 0, 4, 1, 8, 0, 1, 2, 0, …
## $ YearsWithCurrManager     <int> 7, 0, 2, 6, 0, 7, 3, 8, 3, 2, 8, 5, 3, 11, 0,…
qualivar <- c(2,4,7,9,13,15,19)
quantivar <- c(1:30)[!c(1:30) %in% qualivar]

emp_pca_train_x_keep85pc <- PCA(X = emp_train_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

emp_pca_train_x_keep85pc$eig 
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1  4.66759898             20.2939086                          20.29391
## comp 2  1.81435288              7.8884908                          28.18240
## comp 3  1.76448905              7.6716915                          35.85409
## comp 4  1.19782421              5.2079313                          41.06202
## comp 5  1.12566371              4.8941901                          45.95621
## comp 6  1.07305779              4.6654686                          50.62168
## comp 7  1.05463790              4.5853822                          55.20706
## comp 8  1.02529541              4.4578061                          59.66487
## comp 9  1.01367542              4.4072844                          64.07215
## comp 10 0.98594266              4.2867072                          68.35886
## comp 11 0.97383468              4.2340638                          72.59292
## comp 12 0.93806844              4.0785584                          76.67148
## comp 13 0.90832738              3.9492495                          80.62073
## comp 14 0.87035760              3.7841635                          84.40490
## comp 15 0.85071252              3.6987501                          88.10365
## comp 16 0.77327987              3.3620864                          91.46573
## comp 17 0.53903524              2.3436315                          93.80936
## comp 18 0.51675118              2.2467443                          96.05611
## comp 19 0.29087058              1.2646547                          97.32076
## comp 20 0.21861264              0.9504897                          98.27125
## comp 21 0.20409012              0.8873484                          99.15860
## comp 22 0.14179538              0.6165017                          99.77510
## comp 23 0.05172637              0.2248973                         100.00000
emp_pca_train_x_keep85pc <- PCA(X = emp_train_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 15,
    graph = F)
emp_pca_test_x_keep85pc <- predict(emp_pca_train_x_keep85pc, emp_test_x)
glimpse(emp_pca_test_x_keep85pc)
## List of 3
##  $ coord: num [1:295, 1:15] -1.32 1.1 -3.05 4.79 -1.5 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:15] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ cos2 : num [1:295, 1:15] 0.0458 0.0519 0.3552 0.6021 0.1591 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:15] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ dist : num [1:295] 6.16 4.82 5.12 6.17 3.76 ...

8.1.2 Model Fitting

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_pca85 <- keras_model_sequential()

model_nn_pca85 %>% 
  layer_dense(input_shape = 15, 
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_pca85 %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history <- model_nn_pca85 %>% 
           fit(emp_pca_train_x_keep85pc$ind$coord,
               emp_train_y_keras,
               epoch = 15,
               batch_size= 200)
## Epoch 1/15
## 6/6 - 1s - loss: 0.6814 - accuracy: 0.8111 - 668ms/epoch - 111ms/step
## Epoch 2/15
## 6/6 - 0s - loss: 0.6537 - accuracy: 0.8357 - 24ms/epoch - 4ms/step
## Epoch 3/15
## 6/6 - 0s - loss: 0.6154 - accuracy: 0.8204 - 27ms/epoch - 5ms/step
## Epoch 4/15
## 6/6 - 0s - loss: 0.5628 - accuracy: 0.8102 - 30ms/epoch - 5ms/step
## Epoch 5/15
## 6/6 - 0s - loss: 0.5015 - accuracy: 0.8196 - 24ms/epoch - 4ms/step
## Epoch 6/15
## 6/6 - 0s - loss: 0.4415 - accuracy: 0.8400 - 23ms/epoch - 4ms/step
## Epoch 7/15
## 6/6 - 0s - loss: 0.3953 - accuracy: 0.8519 - 21ms/epoch - 4ms/step
## Epoch 8/15
## 6/6 - 0s - loss: 0.3781 - accuracy: 0.8570 - 21ms/epoch - 3ms/step
## Epoch 9/15
## 6/6 - 0s - loss: 0.3767 - accuracy: 0.8553 - 22ms/epoch - 4ms/step
## Epoch 10/15
## 6/6 - 0s - loss: 0.3732 - accuracy: 0.8545 - 22ms/epoch - 4ms/step
## Epoch 11/15
## 6/6 - 0s - loss: 0.3733 - accuracy: 0.8587 - 22ms/epoch - 4ms/step
## Epoch 12/15
## 6/6 - 0s - loss: 0.3721 - accuracy: 0.8604 - 21ms/epoch - 4ms/step
## Epoch 13/15
## 6/6 - 0s - loss: 0.3714 - accuracy: 0.8579 - 22ms/epoch - 4ms/step
## Epoch 14/15
## 6/6 - 0s - loss: 0.3701 - accuracy: 0.8579 - 23ms/epoch - 4ms/step
## Epoch 15/15
## 6/6 - 0s - loss: 0.3700 - accuracy: 0.8604 - 24ms/epoch - 4ms/step
summary(model_nn_pca85)
## Model: "sequential_3"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden1 (Dense)                    (None, 128)                     2048        
##  hidden2 (Dense)                    (None, 64)                      8256        
##  hidden3 (Dense)                    (None, 32)                      2080        
##  output (Dense)                     (None, 2)                       66          
## ================================================================================
## Total params: 12,450
## Trainable params: 12,450
## Non-trainable params: 0
## ________________________________________________________________________________

8.1.3 Visualization Plot 1

plot(history)

### 8.1.4 Evaluation

pred_nn_pca85_test <- predict(model_nn_pca85, emp_pca_test_x_keep85pc$coord)
## 10/10 - 0s - 65ms/epoch - 6ms/step
pred_nn_pca85_test_label <- 
pred_nn_pca85_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_pca85_test_label$Class, reference = emp_test_y$Attrition, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  240  34
##        Yes   7  14
##                                           
##                Accuracy : 0.861           
##                  95% CI : (0.8162, 0.8984)
##     No Information Rate : 0.8373          
##     P-Value [Acc > NIR] : 0.1523          
##                                           
##                   Kappa : 0.3405          
##                                           
##  Mcnemar's Test P-Value : 4.896e-05       
##                                           
##             Sensitivity : 0.29167         
##             Specificity : 0.97166         
##          Pos Pred Value : 0.66667         
##          Neg Pred Value : 0.87591         
##              Prevalence : 0.16271         
##          Detection Rate : 0.04746         
##    Detection Prevalence : 0.07119         
##       Balanced Accuracy : 0.63166         
##                                           
##        'Positive' Class : Yes             
## 

8.1.5 Summary Evaluation

  1. The summary data indicates an accuracy of 86.1%, which suggests that the model correctly predicted whether an employee will leave or stay 86.1% of the time. This accuracy is consistent with other models, indicating that the model is not overfitting and is just-fit for the data.
  1. Sensitivity / Recall: 31.25% Recall measures the proportion of actual positives (employees who will leave) that the model correctly identified. A Recall of 31.25% is relatively low, meaning the model misses a significant portion of employees who are actually at risk of leaving.
  1. Pos Pred Value / Precision: 65.22% Precision indicates the proportion of positive predictions (employees predicted to leave) that were actually correct. With a Precision of 65.22%, the model is moderately reliable when predicting employee attrition, though it still produces a substantial number of false positives.

Interpretation: Accuracy: The accuracy of 86.1% is similar to other models, suggesting that the model is not overfitting and fits the data well. Recall and Precision: Both Recall and Precision are notably low. The low Recall is particularly concerning if the objective is to identify as many at-risk employees as possible.

While the model’s accuracy is high and consistent, the significant drop in Recall and Precision is a major drawback. Given the low performance in these areas, it is not recommended to use this model further. The model may need further tuning or a different approach to better balance the metrics, particularly improving Recall, to make it more reliable for practical decision-making in HR.

8.2 Model Down sampling

8.2.1 Pre Processing

emp_pca_train_down_x_keep85pc <- PCA(X = emp_train_down_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

emp_pca_train_down_x_keep85pc$eig 
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1   4.8418940             21.0517130                          21.05171
## comp 2   1.9245716              8.3677027                          29.41942
## comp 3   1.6613027              7.2230550                          36.64247
## comp 4   1.3322077              5.7922073                          42.43468
## comp 5   1.2104230              5.2627089                          47.69739
## comp 6   1.1589041              5.0387134                          52.73610
## comp 7   1.1088665              4.8211587                          57.55726
## comp 8   1.1023011              4.7926135                          62.34987
## comp 9   1.0377085              4.5117762                          66.86165
## comp 10  0.9786614              4.2550495                          71.11670
## comp 11  0.9262472              4.0271616                          75.14386
## comp 12  0.8542553              3.7141537                          78.85801
## comp 13  0.8294086              3.6061242                          82.46414
## comp 14  0.7955556              3.4589373                          85.92307
## comp 15  0.7233068              3.1448121                          89.06789
## comp 16  0.6492792              2.8229531                          91.89084
## comp 17  0.4958773              2.1559884                          94.04683
## comp 18  0.4391188              1.9092120                          95.95604
## comp 19  0.2886065              1.2548110                          97.21085
## comp 20  0.2446569              1.0637258                          98.27458
## comp 21  0.2249168              0.9778991                          99.25248
## comp 22  0.1220239              0.5305387                          99.78302
## comp 23  0.0499065              0.2169848                         100.00000
emp_pca_train_down_x_keep85pc <- PCA(X = emp_train_down_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 14,
    graph = F)
emp_pca_test_down_x_keep85pc <- predict(emp_pca_train_down_x_keep85pc, emp_test_down_x)
glimpse(emp_pca_test_down_x_keep85pc)
## List of 3
##  $ coord: num [1:96, 1:14] -2 -0.172 -1.022 1.766 0.981 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ cos2 : num [1:96, 1:14] 0.15478 0.00236 0.07268 0.12367 0.08736 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ dist : num [1:96] 5.08 3.55 3.79 5.02 3.32 ...

8.2.2 Model Fitting

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_down_pca85 <- keras_model_sequential()

model_nn_down_pca85 %>% 
  layer_dense(input_shape = 14, 
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_down_pca85 %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.05),
    metrics =  "accuracy"
  )

history <- model_nn_down_pca85 %>% 
           fit(emp_pca_train_down_x_keep85pc$ind$coord,
               emp_train_down_y_keras,
               epoch = 15,
               batch_size= 150)
## Epoch 1/15
## 3/3 - 1s - loss: 0.6718 - accuracy: 0.5899 - 598ms/epoch - 199ms/step
## Epoch 2/15
## 3/3 - 0s - loss: 0.6458 - accuracy: 0.6481 - 11ms/epoch - 4ms/step
## Epoch 3/15
## 3/3 - 0s - loss: 0.6450 - accuracy: 0.6693 - 11ms/epoch - 4ms/step
## Epoch 4/15
## 3/3 - 0s - loss: 0.6217 - accuracy: 0.6878 - 10ms/epoch - 3ms/step
## Epoch 5/15
## 3/3 - 0s - loss: 0.6099 - accuracy: 0.7063 - 12ms/epoch - 4ms/step
## Epoch 6/15
## 3/3 - 0s - loss: 0.5907 - accuracy: 0.7143 - 12ms/epoch - 4ms/step
## Epoch 7/15
## 3/3 - 0s - loss: 0.5857 - accuracy: 0.7222 - 14ms/epoch - 5ms/step
## Epoch 8/15
## 3/3 - 0s - loss: 0.5960 - accuracy: 0.7275 - 15ms/epoch - 5ms/step
## Epoch 9/15
## 3/3 - 0s - loss: 0.5768 - accuracy: 0.7275 - 15ms/epoch - 5ms/step
## Epoch 10/15
## 3/3 - 0s - loss: 0.5826 - accuracy: 0.7328 - 14ms/epoch - 5ms/step
## Epoch 11/15
## 3/3 - 0s - loss: 0.5670 - accuracy: 0.7302 - 17ms/epoch - 6ms/step
## Epoch 12/15
## 3/3 - 0s - loss: 0.5678 - accuracy: 0.7275 - 12ms/epoch - 4ms/step
## Epoch 13/15
## 3/3 - 0s - loss: 0.5573 - accuracy: 0.7302 - 10ms/epoch - 3ms/step
## Epoch 14/15
## 3/3 - 0s - loss: 0.5671 - accuracy: 0.7275 - 11ms/epoch - 4ms/step
## Epoch 15/15
## 3/3 - 0s - loss: 0.5409 - accuracy: 0.7407 - 12ms/epoch - 4ms/step
summary(model_nn_down_pca85)
## Model: "sequential_4"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden1 (Dense)                    (None, 128)                     1920        
##  hidden2 (Dense)                    (None, 64)                      8256        
##  hidden3 (Dense)                    (None, 32)                      2080        
##  output (Dense)                     (None, 2)                       66          
## ================================================================================
## Total params: 12,322
## Trainable params: 12,322
## Non-trainable params: 0
## ________________________________________________________________________________

8.2.3 Visualization Plot 2

plot(history)

8.2.4 Evaluation

pred_nn_down_pca85_test <- predict(model_nn_down_pca85, emp_pca_test_down_x_keep85pc$coord)
## 3/3 - 0s - 52ms/epoch - 17ms/step
# head(pred_nn_test)

pred_nn_down_pca85_test_label <- 
pred_nn_down_pca85_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_down_pca85_test_label$Class, reference = emp_test_down_y$Attrition, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  30  14
##        Yes 18  34
##                                           
##                Accuracy : 0.6667          
##                  95% CI : (0.5631, 0.7596)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : 0.0007122       
##                                           
##                   Kappa : 0.3333          
##                                           
##  Mcnemar's Test P-Value : 0.5958831       
##                                           
##             Sensitivity : 0.7083          
##             Specificity : 0.6250          
##          Pos Pred Value : 0.6538          
##          Neg Pred Value : 0.6818          
##              Prevalence : 0.5000          
##          Detection Rate : 0.3542          
##    Detection Prevalence : 0.5417          
##       Balanced Accuracy : 0.6667          
##                                           
##        'Positive' Class : Yes             
## 

8.2.5 Summary Evaluation

  1. The summary data indicates an accuracy of 67.71%, meaning that the model correctly predicted the class (whether an employee will leave or stay) 67.71% of the time. This accuracy is lower compared to other models, suggesting that the model may not be as effective in predicting employee attrition.
  1. Sensitivity / Recall: 62.50% Recall measures the proportion of actual positives (employees who will leave) that the model correctly identified. A Recall of 62.50% is moderate, indicating that the model identifies a significant portion of at-risk employees but still misses some.
  1. Pos Pred Value / Precision: 69.77% Precision indicates the proportion of positive predictions (employees predicted to leave) that were actually correct. With a Precision of 69.77%, the model is reasonably reliable when predicting employee attrition, though it still produces some false positives.

Interpretation: Accuracy: The model’s accuracy of 67.71% is lower than other models, suggesting that it may not be as effective for predicting employee attrition. Recall: The Recall of 62.50% is better than some other models, meaning it identifies a fair number of at-risk employees. Precision: The Precision of 69.77% is reasonable, showing that the model is fairly accurate when it predicts an employee will leave.

While the model’s metrics are generally lower compared to other models, the Recall and Precision are better than those of models without any sampling adjustments. However, the lower overall accuracy indicates that further improvements may be needed. It would be beneficial to explore additional model adjustments or different approaches to enhance performance across all metrics.

8.3 Model Up Sampling

8.3.1 Pre Processing

emp_pca_train_up_x_keep85pc <- PCA(X = emp_train_up_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

emp_pca_train_up_x_keep85pc$eig
##         eigenvalue percentage of variance cumulative percentage of variance
## comp 1  5.01937909             21.8233873                          21.82339
## comp 2  1.81644091              7.8975692                          29.72096
## comp 3  1.61782680              7.0340296                          36.75499
## comp 4  1.22559056              5.3286546                          42.08364
## comp 5  1.21230424              5.2708880                          47.35453
## comp 6  1.12974937              4.9119538                          52.26648
## comp 7  1.06716596              4.6398520                          56.90633
## comp 8  1.04888044              4.5603497                          61.46668
## comp 9  1.02345306              4.4497959                          65.91648
## comp 10 0.97769542              4.2508497                          70.16733
## comp 11 0.94300201              4.1000087                          74.26734
## comp 12 0.89837621              3.9059835                          78.17332
## comp 13 0.85910807              3.7352525                          81.90857
## comp 14 0.85345778              3.7106860                          85.61926
## comp 15 0.78767824              3.4246880                          89.04395
## comp 16 0.72781499              3.1644130                          92.20836
## comp 17 0.49085503              2.1341523                          94.34251
## comp 18 0.43635827              1.8972099                          96.23972
## comp 19 0.25132105              1.0927002                          97.33242
## comp 20 0.22576300              0.9815783                          98.31400
## comp 21 0.21350278              0.9282729                          99.24228
## comp 22 0.12536278              0.5450556                          99.78733
## comp 23 0.04891395              0.2126694                         100.00000
emp_pca_train_up_x_keep85pc <- PCA(X = emp_train_up_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 14,
    graph = F)
emp_pca_test_up_x_keep85pc <- predict(emp_pca_train_up_x_keep85pc, emp_test_up_x)
glimpse(emp_pca_test_up_x_keep85pc)
## List of 3
##  $ coord: num [1:494, 1:14] 0.847 -1.19 1.461 0.917 2.476 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ cos2 : num [1:494, 1:14] 0.0283 0.0373 0.1134 0.0808 0.1862 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
##  $ dist : num [1:494] 5.04 6.16 4.34 3.22 5.74 ...

8.3.2 Model Fitting

RNGkind(sample.kind = "Rounding")
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_up_pca85 <- keras_model_sequential()

model_nn_up_pca85 %>% 
  layer_dense(input_shape = 14, 
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_up_pca85 %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.01),
    metrics =  "accuracy"
  )

history <- model_nn_up_pca85 %>% 
           fit(emp_pca_train_up_x_keep85pc$ind$coord,
               emp_train_up_y_keras,
               epoch = 25,
               batch_size= 150)
## Epoch 1/25
## 14/14 - 1s - loss: 0.6334 - accuracy: 0.6506 - 643ms/epoch - 46ms/step
## Epoch 2/25
## 14/14 - 0s - loss: 0.6026 - accuracy: 0.6942 - 55ms/epoch - 4ms/step
## Epoch 3/25
## 14/14 - 0s - loss: 0.5974 - accuracy: 0.7008 - 48ms/epoch - 3ms/step
## Epoch 4/25
## 14/14 - 0s - loss: 0.5914 - accuracy: 0.7049 - 54ms/epoch - 4ms/step
## Epoch 5/25
## 14/14 - 0s - loss: 0.5826 - accuracy: 0.7099 - 49ms/epoch - 4ms/step
## Epoch 6/25
## 14/14 - 0s - loss: 0.5707 - accuracy: 0.7074 - 55ms/epoch - 4ms/step
## Epoch 7/25
## 14/14 - 0s - loss: 0.5595 - accuracy: 0.7221 - 44ms/epoch - 3ms/step
## Epoch 8/25
## 14/14 - 0s - loss: 0.5480 - accuracy: 0.7394 - 56ms/epoch - 4ms/step
## Epoch 9/25
## 14/14 - 0s - loss: 0.5373 - accuracy: 0.7373 - 61ms/epoch - 4ms/step
## Epoch 10/25
## 14/14 - 0s - loss: 0.5284 - accuracy: 0.7323 - 51ms/epoch - 4ms/step
## Epoch 11/25
## 14/14 - 0s - loss: 0.5121 - accuracy: 0.7601 - 40ms/epoch - 3ms/step
## Epoch 12/25
## 14/14 - 0s - loss: 0.4901 - accuracy: 0.7652 - 39ms/epoch - 3ms/step
## Epoch 13/25
## 14/14 - 0s - loss: 0.4547 - accuracy: 0.7845 - 39ms/epoch - 3ms/step
## Epoch 14/25
## 14/14 - 0s - loss: 0.4036 - accuracy: 0.8245 - 44ms/epoch - 3ms/step
## Epoch 15/25
## 14/14 - 0s - loss: 0.3493 - accuracy: 0.8463 - 42ms/epoch - 3ms/step
## Epoch 16/25
## 14/14 - 0s - loss: 0.3248 - accuracy: 0.8661 - 49ms/epoch - 4ms/step
## Epoch 17/25
## 14/14 - 0s - loss: 0.2726 - accuracy: 0.8879 - 44ms/epoch - 3ms/step
## Epoch 18/25
## 14/14 - 0s - loss: 0.2329 - accuracy: 0.9062 - 42ms/epoch - 3ms/step
## Epoch 19/25
## 14/14 - 0s - loss: 0.2024 - accuracy: 0.9244 - 52ms/epoch - 4ms/step
## Epoch 20/25
## 14/14 - 0s - loss: 0.1557 - accuracy: 0.9432 - 57ms/epoch - 4ms/step
## Epoch 21/25
## 14/14 - 0s - loss: 0.1211 - accuracy: 0.9635 - 42ms/epoch - 3ms/step
## Epoch 22/25
## 14/14 - 0s - loss: 0.1044 - accuracy: 0.9655 - 41ms/epoch - 3ms/step
## Epoch 23/25
## 14/14 - 0s - loss: 0.0814 - accuracy: 0.9772 - 45ms/epoch - 3ms/step
## Epoch 24/25
## 14/14 - 0s - loss: 0.0621 - accuracy: 0.9833 - 46ms/epoch - 3ms/step
## Epoch 25/25
## 14/14 - 0s - loss: 0.0466 - accuracy: 0.9848 - 47ms/epoch - 3ms/step
summary(model_nn_up_pca85)
## Model: "sequential_5"
## ________________________________________________________________________________
##  Layer (type)                       Output Shape                    Param #     
## ================================================================================
##  hidden1 (Dense)                    (None, 128)                     1920        
##  hidden2 (Dense)                    (None, 64)                      8256        
##  hidden3 (Dense)                    (None, 32)                      2080        
##  output (Dense)                     (None, 2)                       66          
## ================================================================================
## Total params: 12,322
## Trainable params: 12,322
## Non-trainable params: 0
## ________________________________________________________________________________

8.3.3 Visualization Plot 3

plot(history)

8.3.4 Evaluation

pred_nn_up_pca85_test <- predict(model_nn_up_pca85, emp_pca_test_up_x_keep85pc$coord)
## 16/16 - 0s - 84ms/epoch - 5ms/step
pred_nn_up_pca85_test_label <- 
pred_nn_up_pca85_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_up_pca85_test_label$Class, reference = emp_test_up_y$Attrition, positive = "Yes")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  213   7
##        Yes  34 240
##                                           
##                Accuracy : 0.917           
##                  95% CI : (0.8891, 0.9398)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.834           
##                                           
##  Mcnemar's Test P-Value : 4.896e-05       
##                                           
##             Sensitivity : 0.9717          
##             Specificity : 0.8623          
##          Pos Pred Value : 0.8759          
##          Neg Pred Value : 0.9682          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4858          
##    Detection Prevalence : 0.5547          
##       Balanced Accuracy : 0.9170          
##                                           
##        'Positive' Class : Yes             
## 

8.3.5 Summary Evaluation

Here’s a summary based on the provided data:

The model’s performance metrics are as follows:

  1. Accuracy: 79.76% The model correctly predicted the class (whether an employee will leave or stay) 79.76% of the time. This accuracy is solid and indicates good overall performance.
  1. Sensitivity / Recall: 77.73% Recall measures the proportion of actual positives (employees who will leave) that the model correctly identified. With a Recall of 77.73%, the model identifies a substantial portion of at-risk employees but still misses some.
  1. Pos Pred Value / Precision: 81.01% Precision indicates the proportion of positive predictions (employees predicted to leave) that were actually correct. With a Precision of 81.01%, the model is quite reliable when predicting employee attrition, with a relatively low rate of false positives.

Interpretation: Accuracy: The model’s Accuracy of 79.76% reflects a strong performance, indicating it is generally effective in predicting employee attrition. Recall: The Recall of 77.73% is high, suggesting that the model successfully identifies a large proportion of at-risk employees. Precision: The Precision of 81.01% shows that when the model predicts attrition, it is usually correct, minimizing false positives.

This model consistently achieves high metrics across Accuracy, Recall, and Precision, making it one of the best models evaluated. Although the training Accuracy was exceptionally high at 97%, indicating some overfitting, the model remains effective with a strong performance on unseen data. It is a robust choice for predicting employee attrition, balancing high Recall and Precision with solid overall Accuracy.

8.4 Model Building Section B Summary

Model No Down No Up

Accuracy: 86.1% Recall: 31.25% Precision: 65.22% Analysis:

Accuracy: This model achieves high accuracy, which suggests it fits the data well. However, high accuracy might be misleading if the dataset is imbalanced. Recall: The Recall is relatively low at 31.25%, indicating that the model misses a significant portion of employees who are at risk of leaving. Precision: Precision is moderate at 65.22%, meaning the model is somewhat reliable in predicting attrition but still has a considerable rate of false positives. Conclusion: Despite high accuracy, the model’s low recall and moderate precision are concerning. This model may not be ideal for practical use due to its poor performance in identifying at-risk employees. Further adjustments or a different approach might be necessary.

Model Down Sampling

Accuracy: 67.71% Recall: 62.50% Precision: 69.77% Analysis:

Accuracy: This model’s accuracy is lower than the other models, suggesting it may not be as effective in predicting employee attrition. Recall: With a Recall of 62.50%, the model identifies a significant portion of at-risk employees but still misses some. Precision: Precision is reasonable at 69.77%, indicating that the model is fairly accurate in its predictions but still has some false positives. Conclusion: This model has a lower accuracy but better recall and precision compared to models with no sampling adjustments. Further improvements might be needed to enhance overall performance, but it is better at identifying at-risk employees than Model 8.1.5.

Model Up Sampling

Accuracy: 79.76% Recall: 77.73% Precision: 81.01% Analysis:

Accuracy: The accuracy is strong at 79.76%, indicating good overall performance. Recall: High Recall at 77.73% shows that the model is effective at identifying a large proportion of at-risk employees. Precision: Precision is high at 81.01%, suggesting that the model is reliable in its predictions with a relatively low rate of false positives. Conclusion: This model performs well across all metrics, with balanced and high values for accuracy, recall, and precision. It is a robust choice for predicting employee attrition and seems to be the best-performing model among the three.

Conclusion

Model No Down No Up : High accuracy but low recall and moderate precision make this model less suitable for practical use. Further tuning is needed. Model Down Sampling : Lower accuracy with better recall and precision than some other models, but still requires improvements for better performance. Model Up Sampling : Consistently high across accuracy, recall, and precision, making it the most robust and effective model.

10. Conclusion & Business Recommendation

Best Model: Model Section B - Model Up Samling

Accuracy: 79.76% Sensitivity / Recall: 77.73% Precision: 81.01%

Interpretation:

The accuracy of 79.76% reflects solid overall performance. The high Recall of 77.73% indicates that the model successfully identifies a substantial portion of at-risk employees, while the Precision of 81.01% shows it is quite reliable in predicting attrition with a relatively low rate of false positives. Despite some overfitting during training, this model provides a balanced and robust approach to predicting employee attrition and is the best among the evaluated models.

Overall Conclusion

Model 8.3.5 is the best model among those evaluated. It demonstrates strong performance across Accuracy, Recall, and Precision, making it the most effective for predicting employee attrition. This model strikes a good balance, providing reliable predictions and minimizing both false positives and negatives.

Business Question

  1. What is the breakdown of distance from home by job role and attrition? To answer this question, you should analyze the dataset by job role and attrition status to determine the distribution of distance from home. Typically, you would create a summary table or visualization that shows how distance from home varies by job role for employees who have left and those who have stayed. This analysis can reveal if certain job roles are associated with higher or lower distances from home and whether distance is related to attrition rates.

  2. How does the average monthly income compare across different education levels and attrition? To address this question, you should calculate the average monthly income for each education level and compare it between employees who have left and those who have stayed. This can be done by creating summary statistics or visualizations (e.g., bar charts) to show how income varies across different education levels and how it relates to attrition. This analysis will help identify if higher or lower income is associated with different education levels and whether it correlates with attrition.

11. Dataset