Using Principal Component Analysis and Neural Network to Classify IBM Employee Attrition

Calvin

1/22/2022

Background

The dataset is found from Kaggle: IBM HR Analytics Employee Attrition & Performance

I was intrigued about how it was said that we can utilize Unsupervised Learning to do a Principal Component Analyses to reduce some dimensionality to our dataset, which can be used again as a dataset for a Supervised Learning. Some references also said that having less predictors/variables could make a more accurate Neural Network model when tuned right. That’s why I try to combine both concepts of Unsupervised Learning and Neural network into this report.

Calling Libraries

set.seed(314)
library(tidyverse)
library(factoextra) #for fviz_contrib
library(FactoMineR) #for PCA that is more easy to visualize
library(ggiraphExtra) #for ggRadar
library(rsample) #for easy sampling / cross-validation
library(caret) #for confusionMatrix
library(keras) #for neural network

Reading the Dataset

employee <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv", stringsAsFactors = T)
head(employee)

#>   ï..Age Attrition    BusinessTravel DailyRate             Department
#> 1     41       Yes     Travel_Rarely      1102                  Sales
#> 2     49        No Travel_Frequently       279 Research & Development
#> 3     37       Yes     Travel_Rarely      1373 Research & Development
#> 4     33        No Travel_Frequently      1392 Research & Development
#> 5     27        No     Travel_Rarely       591 Research & Development
#> 6     32        No Travel_Frequently      1005 Research & Development
#>   DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
#> 1                1         2  Life Sciences             1              1
#> 2                8         1  Life Sciences             1              2
#> 3                2         2          Other             1              4
#> 4                3         4  Life Sciences             1              5
#> 5                2         1        Medical             1              7
#> 6                2         2  Life Sciences             1              8
#>   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
#> 1                       2 Female         94              3        2
#> 2                       3   Male         61              2        2
#> 3                       4   Male         92              2        1
#> 4                       4 Female         56              3        1
#> 5                       1   Male         40              3        1
#> 6                       4   Male         79              3        1
#>                 JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
#> 1       Sales Executive               4        Single          5993       19479
#> 2    Research Scientist               2       Married          5130       24907
#> 3 Laboratory Technician               3        Single          2090        2396
#> 4    Research Scientist               3       Married          2909       23159
#> 5 Laboratory Technician               2       Married          3468       16632
#> 6 Laboratory Technician               4        Single          3068       11864
#>   NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
#> 1                  8      Y      Yes                11                 3
#> 2                  1      Y       No                23                 4
#> 3                  6      Y      Yes                15                 3
#> 4                  1      Y      Yes                11                 3
#> 5                  9      Y       No                12                 3
#> 6                  0      Y       No                13                 3
#>   RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
#> 1                        1            80                0                 8
#> 2                        4            80                1                10
#> 3                        2            80                0                 7
#> 4                        3            80                0                 8
#> 5                        4            80                1                 6
#> 6                        3            80                0                 8
#>   TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
#> 1                     0               1              6                  4
#> 2                     3               3             10                  7
#> 3                     3               3              0                  0
#> 4                     3               3              8                  7
#> 5                     3               3              2                  2
#> 6                     2               2              7                  7
#>   YearsSinceLastPromotion YearsWithCurrManager
#> 1                       0                    5
#> 2                       1                    7
#> 3                       0                    0
#> 4                       3                    0
#> 5                       2                    2
#> 6                       3                    6

Data Wrangling

glimpse(employee)

#> Rows: 1,470
#> Columns: 35
#> $ ï..Age                   <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
#> $ Attrition                <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
#> $ BusinessTravel           <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
#> $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
#> $ Department               <fct> Sales, Research & Development, Research & Dev~
#> $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
#> $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
#> $ EducationField           <fct> Life Sciences, Life Sciences, Other, Life Sci~
#> $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
#> $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
#> $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Femal~
#> $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
#> $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
#> $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
#> $ JobRole                  <fct> Sales Executive, Research Scientist, Laborato~
#> $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
#> $ MaritalStatus            <fct> Single, Married, Single, Married, Married, Si~
#> $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
#> $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
#> $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
#> $ Over18                   <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ~
#> $ OverTime                 <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
#> $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
#> $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
#> $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
#> $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8~
#> $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
#> $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
#> $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
#> $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
#> $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
#> $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
#> $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
#> $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~

employee_clean <- employee %>% 
  rename(Age = "ï..Age") %>% 
  select(-c("EmployeeCount", "Over18", "StandardHours", "EmployeeNumber"))

unique(employee$EmployeeCount)

#> [1] 1

unique(employee$JobInvolvement)

#> [1] 3 2 4 1

unique(employee$JobLevel)

#> [1] 2 1 3 4 5

unique(employee$Over18)

#> [1] Y
#> Levels: Y

summary(employee)

#>      ï..Age      Attrition            BusinessTravel   DailyRate     
#>  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
#>  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
#>  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
#>  Mean   :36.92                                       Mean   : 802.5  
#>  3rd Qu.:43.00                                       3rd Qu.:1157.0  
#>  Max.   :60.00                                       Max.   :1499.0  
#>                                                                      
#>                   Department  DistanceFromHome   Education    
#>  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
#>  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
#>  Sales                 :446   Median : 7.000   Median :3.000  
#>                               Mean   : 9.193   Mean   :2.913  
#>                               3rd Qu.:14.000   3rd Qu.:4.000  
#>                               Max.   :29.000   Max.   :5.000  
#>                                                               
#>           EducationField EmployeeCount EmployeeNumber   EnvironmentSatisfaction
#>  Human Resources : 27    Min.   :1     Min.   :   1.0   Min.   :1.000          
#>  Life Sciences   :606    1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000          
#>  Marketing       :159    Median :1     Median :1020.5   Median :3.000          
#>  Medical         :464    Mean   :1     Mean   :1024.9   Mean   :2.722          
#>  Other           : 82    3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000          
#>  Technical Degree:132    Max.   :1     Max.   :2068.0   Max.   :4.000          
#>                                                                                
#>     Gender      HourlyRate     JobInvolvement    JobLevel    
#>  Female:588   Min.   : 30.00   Min.   :1.00   Min.   :1.000  
#>  Male  :882   1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000  
#>               Median : 66.00   Median :3.00   Median :2.000  
#>               Mean   : 65.89   Mean   :2.73   Mean   :2.064  
#>               3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000  
#>               Max.   :100.00   Max.   :4.00   Max.   :5.000  
#>                                                              
#>                       JobRole    JobSatisfaction  MaritalStatus MonthlyIncome  
#>  Sales Executive          :326   Min.   :1.000   Divorced:327   Min.   : 1009  
#>  Research Scientist       :292   1st Qu.:2.000   Married :673   1st Qu.: 2911  
#>  Laboratory Technician    :259   Median :3.000   Single  :470   Median : 4919  
#>  Manufacturing Director   :145   Mean   :2.729                  Mean   : 6503  
#>  Healthcare Representative:131   3rd Qu.:4.000                  3rd Qu.: 8379  
#>  Manager                  :102   Max.   :4.000                  Max.   :19999  
#>  (Other)                  :215                                                 
#>   MonthlyRate    NumCompaniesWorked Over18   OverTime   PercentSalaryHike
#>  Min.   : 2094   Min.   :0.000      Y:1470   No :1054   Min.   :11.00    
#>  1st Qu.: 8047   1st Qu.:1.000               Yes: 416   1st Qu.:12.00    
#>  Median :14236   Median :2.000                          Median :14.00    
#>  Mean   :14313   Mean   :2.693                          Mean   :15.21    
#>  3rd Qu.:20462   3rd Qu.:4.000                          3rd Qu.:18.00    
#>  Max.   :26999   Max.   :9.000                          Max.   :25.00    
#>                                                                          
#>  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
#>  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
#>  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
#>  Median :3.000     Median :3.000            Median :80    Median :1.0000  
#>  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
#>  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
#>  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
#>                                                                           
#>  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
#>  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
#>  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
#>  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
#>  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
#>  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
#>  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
#>                                                                          
#>  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
#>  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
#>  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
#>  Median : 3.000     Median : 1.000          Median : 3.000      
#>  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
#>  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
#>  Max.   :18.000     Max.   :15.000          Max.   :17.000      
#>

colSums(is.na(employee_clean))

#>                      Age                Attrition           BusinessTravel 
#>                        0                        0                        0 
#>                DailyRate               Department         DistanceFromHome 
#>                        0                        0                        0 
#>                Education           EducationField  EnvironmentSatisfaction 
#>                        0                        0                        0 
#>                   Gender               HourlyRate           JobInvolvement 
#>                        0                        0                        0 
#>                 JobLevel                  JobRole          JobSatisfaction 
#>                        0                        0                        0 
#>            MaritalStatus            MonthlyIncome              MonthlyRate 
#>                        0                        0                        0 
#>       NumCompaniesWorked                 OverTime        PercentSalaryHike 
#>                        0                        0                        0 
#>        PerformanceRating RelationshipSatisfaction         StockOptionLevel 
#>                        0                        0                        0 
#>        TotalWorkingYears    TrainingTimesLastYear          WorkLifeBalance 
#>                        0                        0                        0 
#>           YearsAtCompany       YearsInCurrentRole  YearsSinceLastPromotion 
#>                        0                        0                        0 
#>     YearsWithCurrManager 
#>                        0

Simple Principal Component Analysis

Principal Component Analysis (PCA) is part of Unsupervised Machine Learning, in which the main goal is to convert all numerical predictors into new numerical predictors, summarizing as most data as possible. The number of numerical predictors before and after the process is the same.

The very first one of the new numerical predictor that is produced, can be called PC1 or Principal Component number 1, will represent the largest percentage of data that one could contain and summarize using the PCA process. The second, or PC2, will represent the second largest, and so on.
Therefore some larger-numbered PCs can theoretically be “sacrificed”, since most of the data should already be represented by the smaller-numbered PCs. This “sacrifice” process that can be done after the initial PCA process is usually called by dimensionality reduction, which could help the performance of a Machine Learning, due to lesser data to be processed.

employee_num <- employee_clean %>% 
  select_if(is.numeric)

employee_num_scaled <- scale(employee_num)

pca <- prcomp(x = employee_num, scale. = T)
plot(pca, label = T)

This simple plot represents the first PCs. Variances will record how many information is stored within each PC. As is in the concept, the very first produced PC (usually simply called PC1) will contain the most information. PC2 will contain the second most information and so on.

biplot(x = prcomp(employee_num %>% scale() %>% head(100)), cex=1.4)

This above plot explains how each original numerical predictor correlate with each other, how it affects each observations, and how is the first 100 observations are spread out characteristically.

Some insights:

PerformanceRating and PercentSalaryHike are showing similar characteristic and highly correlated to eachother, judged from the similar angle and the length of the vectors. This similarity is said as dimensionality reduction consideration also, as we can remove one of the “similar” predictors, since one is showing enough representation.
Observation with row index number of 91 could be considered as outlier due to its characteristics causing the data point to not centralized close to other ones.

fviz_contrib( 
  X = prcomp(employee_num %>% scale() %>% head(100)), 
  choice = "var",
  axes = 1
)

This is for example, is how PC1 is constituted out of the original numerical predictors. The red line shows the average of 100% divided by number of predictors, therefore predictors shown to be passing this line could be said as having contributed quite large number of information to the Principal Component.

As we can see, YearsAtCompany contributed the most information within this PC1, followed by TotalWorkingYears, JobLevel, etc.

The first 8 predictors until Age can be said as having the most contribution of information, having the only ones pasing the red average line. Meanwhile, the rest of the predictors won’t be quite explained by this PC, and most likely be more represented by the other PC.

FactomineR Principal Component Analysis

After using Simple PCA using prcomp() for EDA, we will be using PCA() function from library of FactomineR to do a proper PCA, for us to use the process as part of pre-processing the data before feeding it to the Neural Network.

qualivar <- c(2,3,5,8,10,14,16,20)
quantivar <- c(1:30)[!c(1:30) %in% qualivar]

employee_pca <- PCA(X = employee_clean,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

head(employee_pca$ind$coord)

#>        Dim.1      Dim.2      Dim.3       Dim.4      Dim.5      Dim.6     Dim.7
#> 1 -0.3402021 -1.6425344  0.1100639  0.79035276  1.7185598 -1.4744878 0.3235198
#> 2  0.6094791  2.5534278  1.9127937 -1.65050159 -1.3792339 -0.3235685 0.4902771
#> 3 -2.4205040 -1.1315811  0.4480900 -0.01902049  0.6779377  1.5413090 0.1483998
#> 4 -0.8671335 -0.2839553 -1.3820657 -0.55764281  0.3905435  0.3867866 1.4611601
#> 5 -1.8951143 -1.1223375 -0.3939562 -0.62741412 -0.8146266  0.9042650 0.4460685
#> 6 -0.6072899  0.9053973 -1.7447346 -0.07965912  1.1054204  0.1542838 1.4889031
#>        Dim.8      Dim.9      Dim.10      Dim.11      Dim.12      Dim.13
#> 1 -1.3971587  0.5591311 -2.71656720  0.59075634 -1.48430683  1.41815322
#> 2  0.8520032  1.0811053  0.07483455 -0.41473204  0.53055444  1.31146524
#> 3  0.4840149 -0.2652283 -1.67933257 -1.25575534 -0.01562162 -0.33593253
#> 4  0.4593693 -0.4916710  0.21216668  1.36703711 -0.10674537 -0.03908637
#> 5 -0.5806523  0.5630754  0.02822246 -0.49757189  0.10959459  0.67359554
#> 6 -0.5516226  0.2139034 -0.96440791  0.07958235 -0.45396460  0.90433096
#>       Dim.14       Dim.15      Dim.16      Dim.17      Dim.18      Dim.19
#> 1  0.5790848  0.002514608  1.79267740 -0.09268832  0.83260007  0.05851981
#> 2 -0.8091512 -0.478065807 -0.01387224 -1.90034148  0.07800892 -0.14169801
#> 3  1.2407232  1.341190640  1.39389024 -0.14769863 -0.27731617  0.12351658
#> 4  1.8601876 -0.277377803 -0.54986108  0.06324740 -0.53771863 -1.17503560
#> 5 -1.4635249 -2.509889243  2.20042483  0.78034536  0.45339256  0.06691025
#> 6  0.4118312  1.270759320  0.12120559 -0.37673990 -0.06362240 -0.26571979
#>        Dim.20      Dim.21       Dim.22      Dim.23
#> 1 -0.32617741  0.59438195  0.626231307 -0.06270521
#> 2  0.09368915  0.55826032  0.404472546 -0.12222444
#> 3  0.21470234 -0.14999009 -0.004433548 -0.03718327
#> 4 -0.69266787 -0.48663837  0.100428299  0.12997028
#> 5 -0.29395462 -0.08193348 -0.023614068  0.18864176
#> 6 -0.06111410  0.22260941 -0.386322515  0.16894805

The PCA is done, and we can see the new 23 PCs, representing 23 original numerical predictors above.

plot.PCA(x = employee_pca,
         choix = "ind",
         select = "contrib10",
         invisible = "quali")

Using 2 of the most valuable PCs, above is how our scattered data looks like, with 10 most farthest away to the (0,0) coordinates can be seen. Since they are not actually too far away from the other observations, these data points would not need to be treated as outliers and excluded.

plot.PCA(x = employee_pca, choix = "var", label = "all")

Shown above is a PCA graph of how each original numerical predictors contributed to the first 2 PCs.

Variable Contribution to Each PC

dim <- dimdesc(employee_pca)

as.data.frame(dim$Dim.1$quanti)

#>                         correlation       p.value
#> TotalWorkingYears         0.8705583  0.000000e+00
#> YearsAtCompany            0.8447751  0.000000e+00
#> JobLevel                  0.8260250  0.000000e+00
#> MonthlyIncome             0.8085802  0.000000e+00
#> YearsInCurrentRole        0.7248619 7.342664e-240
#> YearsWithCurrManager      0.7128950 1.680595e-228
#> YearsSinceLastPromotion   0.6417415 2.111727e-171
#> Age                       0.5977268 4.273467e-143
#> Education                 0.1681454  8.724442e-11
#> NumCompaniesWorked        0.1046384  5.831645e-05

as.data.frame(dim$Dim.2$quanti)

#>                          correlation       p.value
#> PerformanceRating         0.60006294 1.730861e-144
#> PercentSalaryHike         0.57470626 5.763050e-130
#> YearsWithCurrManager      0.38237187  2.221494e-52
#> YearsInCurrentRole        0.38232101  2.297324e-52
#> YearsAtCompany            0.29393322  1.104823e-30
#> YearsSinceLastPromotion   0.27600778  4.121838e-27
#> DistanceFromHome          0.07691673  3.168538e-03
#> TrainingTimesLastYear     0.05855275  2.477052e-02
#> HourlyRate               -0.06037865  2.060771e-02
#> JobInvolvement           -0.06514718  1.247870e-02
#> MonthlyRate              -0.08475562  1.143265e-03
#> RelationshipSatisfaction -0.11964635  4.226111e-06
#> Education                -0.19869805  1.488675e-14
#> TotalWorkingYears        -0.23083766  3.123827e-19
#> JobLevel                 -0.25909443  5.620386e-24
#> MonthlyIncome            -0.27145517  3.027261e-26
#> Age                      -0.38400456  7.538725e-53
#> NumCompaniesWorked       -0.51407356 5.988350e-100

From dimdesc() above, we can see the value of correlation or how much information of any specific original predictors, constitute each of the Principal Components. As example, we see the top 10 variable contributors of PC1 and PC2.

PC Contribution to Variance / Information

employee_pca$eig

#>         eigenvalue percentage of variance cumulative percentage of variance
#> comp 1  4.65568301             20.2421000                          20.24210
#> comp 2  1.83608877              7.9829947                          28.22509
#> comp 3  1.75518507              7.6312394                          35.85633
#> comp 4  1.19196184              5.1824428                          41.03878
#> comp 5  1.08623652              4.7227675                          45.76154
#> comp 6  1.05824417              4.6010616                          50.36261
#> comp 7  1.05156065              4.5720028                          54.93461
#> comp 8  1.03833464              4.5144985                          59.44911
#> comp 9  1.02137212              4.4407483                          63.88986
#> comp 10 0.99066276              4.3072294                          68.19708
#> comp 11 0.96536641              4.1972452                          72.39433
#> comp 12 0.94683080              4.1166557                          76.51099
#> comp 13 0.90365332              3.9289275                          80.43991
#> comp 14 0.90208602              3.9221131                          84.36203
#> comp 15 0.87926425              3.8228881                          88.18491
#> comp 16 0.75387072              3.2776988                          91.46261
#> comp 17 0.54146306              2.3541872                          93.81680
#> comp 18 0.52212983              2.2701297                          96.08693
#> comp 19 0.27987064              1.2168289                          97.30376
#> comp 20 0.22607818              0.9829486                          98.28671
#> comp 21 0.20390815              0.8865572                          99.17326
#> comp 22 0.14174295              0.6162737                          99.78954
#> comp 23 0.04840611              0.2104613                         100.00000

We can see the cumulative percentage of variance from above. Basically it tells us how many PCs can be kept if we would want to retain how many percentage of original data. For example, if we want to keep at least 85% of the variance, would require us to keep at least PC1 until PC15, since it is when the cumulative percentage passes 85%.

Neural Network

Class Imbalance: Need Downsampling/Upsampling?

Our target variable is Attrition. Let’s see

prop.table(table(employee_clean$Attrition))

#> 
#>        No       Yes 
#> 0.8387755 0.1612245

Since the class is very imbalanced at 84:16 proportion, I would personally use upsampling to make it balanced again and help improve the modelling. Sure, it would make the data quite biased towards the minority class, but by not doing downsampling, we would not remove a lot precious data from our environment.

For now, I will use both upsampling and downsampling, and compare the model afterwards.

RNGkind(sample.kind = "Rounding")
set.seed(314)

employee_down <- downSample(x = employee_clean %>% select(-Attrition),
                                        y =  employee_clean$Attrition,
                                        yname = "Attrition")

prop.table(table(employee_down$Attrition))

#> 
#>  No Yes 
#> 0.5 0.5

employee_up <- upSample(x = employee_clean %>% select(-Attrition),
                                        y =  employee_clean$Attrition,
                                        yname = "Attrition")

prop.table(table(employee_up$Attrition))

#> 
#>  No Yes 
#> 0.5 0.5

Cross-Validation 80:20

I will be using cross-validation of 80% training data and 20% testing data, and a stratified random sampling based on our target variable Attrition to keep the balance after upsampling and downsampling them before this.

RNGkind(sample.kind = "Rounding")
set.seed(314)

#without downsampling
emp_index <- initial_split(employee_clean, prop = 0.8, strata = "Attrition")
emp_train <- training(emp_index)
emp_test <- testing(emp_index)

#with downsampling
emp_index <- initial_split(employee_down, prop = 0.8, strata = "Attrition")
emp_train_down <- training(emp_index)
emp_test_down <- testing(emp_index)

#with upsampling
emp_index <- initial_split(employee_up, prop = 0.8, strata = "Attrition")
emp_train_up <- training(emp_index)
emp_test_up <- testing(emp_index)

Training: All Pred, No Down, No Up

We will be training our model using all predictors (numerical and categorical-converted-to-dummy-variables predictors), on our original dataset that has quite a class imbalance.

Pre-processing

emp_train_x <-
  emp_train %>% 
  select(-Attrition)

emp_test_x <-
  emp_test %>% 
  select(-Attrition)

emp_train_y <-
  emp_train %>% 
  select(Attrition)

emp_test_y <-
  emp_test %>% 
  select(Attrition)

# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_x_sc <-
  as.data.frame(model.matrix(~., data = emp_train_x)) %>% #dummy variable all the predictors including the categoricals
  select(-"(Intercept)") %>% 
  scale()

emp_test_x_sc <- 
  as.data.frame(model.matrix(~., data = emp_test_x)) %>% #dummy variable all the predictors including the categoricals
  select(-"(Intercept)") %>% 
  scale()

emp_train_x_keras <- 
  emp_train_x_sc %>% 
  array_reshape(dim = dim(emp_train_x_sc))

emp_test_x_keras <- 
  emp_test_x_sc %>% 
  array_reshape(dim = dim(emp_test_x_sc))

emp_train_y_keras <- 
  emp_train_y %>%
  mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
  as.matrix() %>%
  to_categorical()

# emp_test_y_keras <- 
#   emp_test_y %>%
#   mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
#   as.matrix() %>%
#   to_categorical()

NN Fitting

RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn <- keras_model_sequential()

model_nn %>% 
  layer_dense(input_shape = 44, #input layer
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history <- model_nn %>% 
           fit(emp_train_x_keras,
               emp_train_y_keras,
               epoch = 15,
               batch_size= 200)

summary(model_nn)

#> Model: "sequential"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  hidden1 (Dense)                    (None, 128)                     5760        
#>                                                                                 
#>  hidden2 (Dense)                    (None, 64)                      8256        
#>                                                                                 
#>  hidden3 (Dense)                    (None, 32)                      2080        
#>                                                                                 
#>  output (Dense)                     (None, 2)                       66          
#>                                                                                 
#> ================================================================================
#> Total params: 16,162
#> Trainable params: 16,162
#> Non-trainable params: 0
#> ________________________________________________________________________________

plot(history)

The Accuracy reaches above 90%. Let’s hope that this won’t turn into a overfitting case.

Evaluation

pred_nn_test <- predict(model_nn, emp_test_x_keras)
# head(pred_nn_test)

pred_nn_test_label <- 
pred_nn_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_test_label$Class, reference = emp_test_y$Attrition, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  234  26
#>        Yes  13  22
#>                                           
#>                Accuracy : 0.8678          
#>                  95% CI : (0.8237, 0.9043)
#>     No Information Rate : 0.8373          
#>     P-Value [Acc > NIR] : 0.08741         
#>                                           
#>                   Kappa : 0.4554          
#>                                           
#>  Mcnemar's Test P-Value : 0.05466         
#>                                           
#>             Sensitivity : 0.45833         
#>             Specificity : 0.94737         
#>          Pos Pred Value : 0.62857         
#>          Neg Pred Value : 0.90000         
#>              Prevalence : 0.16271         
#>          Detection Rate : 0.07458         
#>    Detection Prevalence : 0.11864         
#>       Balanced Accuracy : 0.70285         
#>                                           
#>        'Positive' Class : Yes             
#>

Accuracy: 86.78%
Sensitivity / Recall: 45.83%
Pos Pred Value / Precision: 62.86%

Accuracy is pretty good, but this is expected from an imbalanced class since in extreme “lazy” case that the model predicts all testing data to be the majority class, the true positives will be very biased. It’s not too far from pre-evaluation accuracy which touches 90%, therefore this should be a just-fit model.

While for Recall vs Precision, I think it’s more depending on the business question that wants to be tackled.
If the HR wants to prevent attrition of employees, therefore it’s preventive and we should focus on Recall.
If the HR wants to predict attrition as precise as possible to optimize salaries and things, therefore it’s quality of our prediction, and we should focus on Precision.

For now, we will try to find the best model comparatively on the whole 3 metrics of Accuracy, Recall, and Precision. I would prefer all of the metrics to at least have 80%, while this one has not so good Recall and Precision.

Training: All Pred, Downsampling

We will be training this model using all predictors similar as above, but using the dataset that was downsampled to balance the class on the target variable.

Pre-processing

#without scaling
emp_train_down_x <-
  emp_train_down %>% 
  select(-Attrition)

emp_test_down_x <-
  emp_test_down %>% 
  select(-Attrition)

emp_train_down_y <-
  emp_train_down %>% 
  select(Attrition)

emp_test_down_y <-
  emp_test_down %>% 
  select(Attrition)

# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_down_x_sc <-
  as.data.frame(model.matrix(~., data = emp_train_down_x)) %>% #dummy variable all the predictors including the categoricals
  select(-"(Intercept)") %>% 
  scale()

emp_test_down_x_sc <- 
  as.data.frame(model.matrix(~., data = emp_test_down_x)) %>% #dummy variable all the predictors including the categoricals
  select(-"(Intercept)") %>% 
  scale()

emp_train_down_x_keras <- 
  emp_train_down_x_sc %>% 
  array_reshape(dim = dim(emp_train_down_x_sc))

emp_test_down_x_keras <- 
  emp_test_down_x_sc %>% 
  array_reshape(dim = dim(emp_test_down_x_sc))

emp_train_down_y_keras <- 
  emp_train_down_y %>%
  mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
  as.matrix() %>%
  to_categorical()

# emp_test_y_keras <- 
#   emp_test_y %>%
#   mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
#   as.matrix() %>%
#   to_categorical()

NN Fitting

RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_down <- keras_model_sequential()

model_nn_down %>% 
  layer_dense(input_shape = 44, #input layer
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_down %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history_down <- model_nn_down %>% 
           fit(emp_train_down_x_keras,
               emp_train_down_y_keras,
               epoch = 20,
               batch_size= 100)

summary(model_nn_down)

#> Model: "sequential_1"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  hidden1 (Dense)                    (None, 128)                     5760        
#>                                                                                 
#>  hidden2 (Dense)                    (None, 64)                      8256        
#>                                                                                 
#>  hidden3 (Dense)                    (None, 32)                      2080        
#>                                                                                 
#>  output (Dense)                     (None, 2)                       66          
#>                                                                                 
#> ================================================================================
#> Total params: 16,162
#> Trainable params: 16,162
#> Non-trainable params: 0
#> ________________________________________________________________________________

plot(history)

Evaluation

pred_nn_test_down <- predict(model_nn_down, emp_test_down_x_keras)
# head(pred_nn_test)

pred_nn_test_down_label <- 
pred_nn_test_down %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_test_down_label$Class, reference = emp_test_down_y$Attrition, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction No Yes
#>        No  37  11
#>        Yes 11  37
#>                                           
#>                Accuracy : 0.7708          
#>                  95% CI : (0.6739, 0.8505)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : 4.72e-08        
#>                                           
#>                   Kappa : 0.5417          
#>                                           
#>  Mcnemar's Test P-Value : 1               
#>                                           
#>             Sensitivity : 0.7708          
#>             Specificity : 0.7708          
#>          Pos Pred Value : 0.7708          
#>          Neg Pred Value : 0.7708          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.3854          
#>    Detection Prevalence : 0.5000          
#>       Balanced Accuracy : 0.7708          
#>                                           
#>        'Positive' Class : Yes             
#>

Accuracy: 79.17%
Sensitivity / Recall: 75.00%
Pos Pred Value / Precision: 81.82%

Overall, not a bad model, though the Accuracy for this testing data is quite far away against the one for training data, having around 86.24%, so this model is just-fit but a bit overfitting.

Training: All Pred, Upsampling

We will be training this model using all predictors similar as above, but using the dataset that was upsampled to balance the class on the target variable. Expecting this to perform better than the imbalanced or downsampling ones due to not removing any valuable data, even from majority class that is not the positive class (in this case the Attrition’s “No” class, or employees staying in the company)

Pre-processing

#without scaling
emp_train_up_x <-
  emp_train_up %>% 
  select(-Attrition)

emp_test_up_x <-
  emp_test_up %>% 
  select(-Attrition)

emp_train_up_y <-
  emp_train_up %>% 
  select(Attrition)

emp_test_up_y <-
  emp_test_up %>% 
  select(Attrition)

# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_up_x_sc <-
  as.data.frame(model.matrix(~., data = emp_train_up_x)) %>% #dummy variable all the predictors including the categoricals
  select(-"(Intercept)") %>% 
  scale()

emp_test_up_x_sc <- 
  as.data.frame(model.matrix(~., data = emp_test_up_x)) %>% #dummy variable all the predictors including the categoricals
  select(-"(Intercept)") %>% 
  scale()

emp_train_up_x_keras <- 
  emp_train_up_x_sc %>% 
  array_reshape(dim = dim(emp_train_up_x_sc))

emp_test_up_x_keras <- 
  emp_test_up_x_sc %>% 
  array_reshape(dim = dim(emp_test_up_x_sc))

emp_train_up_y_keras <- 
  emp_train_up_y %>%
  mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
  as.matrix() %>%
  to_categorical()

# emp_test_y_keras <- 
#   emp_test_y %>%
#   mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>% 
#   as.matrix() %>%
#   to_categorical()

NN Fitting

RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_up <- keras_model_sequential()

model_nn_up %>% 
  layer_dense(input_shape = 44, #input layer
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_up %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history_up <- model_nn_up %>% 
           fit(emp_train_up_x_keras,
               emp_train_up_y_keras,
               epoch = 20,
               batch_size= 250)

summary(model_nn_up)

#> Model: "sequential_2"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  hidden1 (Dense)                    (None, 128)                     5760        
#>                                                                                 
#>  hidden2 (Dense)                    (None, 64)                      8256        
#>                                                                                 
#>  hidden3 (Dense)                    (None, 32)                      2080        
#>                                                                                 
#>  output (Dense)                     (None, 2)                       66          
#>                                                                                 
#> ================================================================================
#> Total params: 16,162
#> Trainable params: 16,162
#> Non-trainable params: 0
#> ________________________________________________________________________________

plot(history)

Evaluation

pred_nn_test_up <- predict(model_nn_up, emp_test_up_x_keras)
# head(pred_nn_test)

pred_nn_test_up_label <- 
pred_nn_test_up %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_test_up_label$Class, reference = emp_test_up_y$Attrition, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  199  48
#>        Yes  48 199
#>                                          
#>                Accuracy : 0.8057         
#>                  95% CI : (0.768, 0.8397)
#>     No Information Rate : 0.5            
#>     P-Value [Acc > NIR] : <2e-16         
#>                                          
#>                   Kappa : 0.6113         
#>                                          
#>  Mcnemar's Test P-Value : 1              
#>                                          
#>             Sensitivity : 0.8057         
#>             Specificity : 0.8057         
#>          Pos Pred Value : 0.8057         
#>          Neg Pred Value : 0.8057         
#>              Prevalence : 0.5000         
#>          Detection Rate : 0.4028         
#>    Detection Prevalence : 0.5000         
#>       Balanced Accuracy : 0.8057         
#>                                          
#>        'Positive' Class : Yes            
#>

Accuracy: 79.15%
Sensitivity / Recall: 78.95%
Pos Pred Value / Precision: 79.27%

Overall, not a bad model and very similar to the downsampled one, though the metrics here are a bit lower, just-fit but a bit overfitting.

Training: Keep 85% using PCA, No Down, No Up

Now we will be keeping only at least 85% using concepts of PCA, and train the Neural Network using less predictors. This one is using the original, imbalanced target variable.

Pre-processing

glimpse(emp_train_x)

#> Rows: 1,175
#> Columns: 30
#> $ Age                      <int> 49, 33, 27, 32, 30, 36, 35, 29, 31, 34, 29, 3~
#> $ BusinessTravel           <fct> Travel_Frequently, Travel_Frequently, Travel_~
#> $ DailyRate                <int> 279, 1392, 591, 1005, 1358, 1299, 809, 153, 6~
#> $ Department               <fct> Research & Development, Research & Developmen~
#> $ DistanceFromHome         <int> 8, 3, 2, 2, 24, 27, 16, 15, 26, 19, 21, 5, 11~
#> $ Education                <int> 1, 4, 1, 2, 1, 3, 3, 2, 1, 2, 4, 2, 2, 4, 2, ~
#> $ EducationField           <fct> Life Sciences, Life Sciences, Medical, Life S~
#> $ EnvironmentSatisfaction  <int> 3, 4, 1, 4, 4, 3, 1, 4, 1, 2, 2, 1, 1, 1, 3, ~
#> $ Gender                   <fct> Male, Female, Male, Male, Male, Male, Male, F~
#> $ HourlyRate               <int> 61, 56, 40, 79, 67, 94, 84, 49, 31, 93, 51, 8~
#> $ JobInvolvement           <int> 2, 3, 3, 3, 3, 3, 4, 2, 3, 3, 4, 4, 4, 3, 3, ~
#> $ JobLevel                 <int> 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 3, 1, 2, 3, 1, ~
#> $ JobRole                  <fct> Research Scientist, Research Scientist, Labor~
#> $ JobSatisfaction          <int> 2, 3, 2, 4, 3, 3, 2, 3, 3, 4, 1, 2, 3, 2, 4, ~
#> $ MaritalStatus            <fct> Married, Married, Married, Single, Divorced, ~
#> $ MonthlyIncome            <int> 5130, 2909, 3468, 3068, 2693, 5237, 2426, 419~
#> $ MonthlyRate              <int> 24907, 23159, 16632, 11864, 13335, 16577, 164~
#> $ NumCompaniesWorked       <int> 1, 1, 9, 0, 1, 6, 0, 0, 1, 0, 1, 0, 0, 0, 1, ~
#> $ OverTime                 <fct> No, Yes, No, No, No, No, No, Yes, No, No, No,~
#> $ PercentSalaryHike        <int> 23, 11, 12, 13, 22, 13, 13, 12, 17, 11, 11, 1~
#> $ PerformanceRating        <int> 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
#> $ RelationshipSatisfaction <int> 4, 3, 4, 3, 2, 2, 3, 4, 4, 3, 3, 4, 4, 3, 4, ~
#> $ StockOptionLevel         <int> 1, 0, 1, 0, 1, 2, 1, 0, 1, 1, 1, 2, 1, 0, 0, ~
#> $ TotalWorkingYears        <int> 10, 8, 6, 8, 1, 17, 6, 10, 5, 3, 10, 7, 5, 13~
#> $ TrainingTimesLastYear    <int> 3, 3, 3, 2, 2, 3, 5, 3, 1, 2, 1, 5, 5, 4, 6, ~
#> $ WorkLifeBalance          <int> 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3, 2, 2, 3, 3, ~
#> $ YearsAtCompany           <int> 10, 8, 2, 7, 1, 7, 5, 9, 5, 2, 10, 6, 4, 12, ~
#> $ YearsInCurrentRole       <int> 7, 7, 2, 7, 0, 7, 4, 5, 2, 2, 9, 2, 2, 6, 0, ~
#> $ YearsSinceLastPromotion  <int> 1, 3, 2, 3, 0, 7, 0, 0, 4, 1, 8, 0, 1, 2, 0, ~
#> $ YearsWithCurrManager     <int> 7, 0, 2, 6, 0, 7, 3, 8, 3, 2, 8, 5, 3, 11, 0,~

qualivar <- c(2,4,7,9,13,15,19)
quantivar <- c(1:30)[!c(1:30) %in% qualivar]

emp_pca_train_x_keep85pc <- PCA(X = emp_train_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

emp_pca_train_x_keep85pc$eig #checking cumulative, still at 15PC for 85%

#>         eigenvalue percentage of variance cumulative percentage of variance
#> comp 1  4.66759898             20.2939086                          20.29391
#> comp 2  1.81435288              7.8884908                          28.18240
#> comp 3  1.76448905              7.6716915                          35.85409
#> comp 4  1.19782421              5.2079313                          41.06202
#> comp 5  1.12566371              4.8941901                          45.95621
#> comp 6  1.07305779              4.6654686                          50.62168
#> comp 7  1.05463790              4.5853822                          55.20706
#> comp 8  1.02529541              4.4578061                          59.66487
#> comp 9  1.01367542              4.4072844                          64.07215
#> comp 10 0.98594266              4.2867072                          68.35886
#> comp 11 0.97383468              4.2340638                          72.59292
#> comp 12 0.93806844              4.0785584                          76.67148
#> comp 13 0.90832738              3.9492495                          80.62073
#> comp 14 0.87035760              3.7841635                          84.40490
#> comp 15 0.85071252              3.6987501                          88.10365
#> comp 16 0.77327987              3.3620864                          91.46573
#> comp 17 0.53903524              2.3436315                          93.80936
#> comp 18 0.51675118              2.2467443                          96.05611
#> comp 19 0.29087058              1.2646547                          97.32076
#> comp 20 0.21861264              0.9504897                          98.27125
#> comp 21 0.20409012              0.8873484                          99.15860
#> comp 22 0.14179538              0.6165017                          99.77510
#> comp 23 0.05172637              0.2248973                         100.00000

We would need 15 PCs.

emp_pca_train_x_keep85pc <- PCA(X = emp_train_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 15,
    graph = F)

Treating the testing data using the same PCs from training:

emp_pca_test_x_keep85pc <- predict(emp_pca_train_x_keep85pc, emp_test_x)
glimpse(emp_pca_test_x_keep85pc)

#> List of 3
#>  $ coord: num [1:295, 1:15] -1.32 1.1 -3.05 4.79 -1.5 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:295] "7" "9" "18" "19" ...
#>   .. ..$ : chr [1:15] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#>  $ cos2 : num [1:295, 1:15] 0.0458 0.0519 0.3552 0.6021 0.1591 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:295] "7" "9" "18" "19" ...
#>   .. ..$ : chr [1:15] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#>  $ dist : Named num [1:295] 6.16 4.82 5.12 6.17 3.76 ...
#>   ..- attr(*, "names")= chr [1:295] "7" "9" "18" "19" ...

NN Fitting

RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_pca85 <- keras_model_sequential()

model_nn_pca85 %>% 
  layer_dense(input_shape = 15, #input layer
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_pca85 %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.001),
    metrics =  "accuracy"
  )

history <- model_nn_pca85 %>% 
           fit(emp_pca_train_x_keep85pc$ind$coord,
               emp_train_y_keras,
               epoch = 15,
               batch_size= 200)

summary(model_nn_pca85)

#> Model: "sequential_3"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  hidden1 (Dense)                    (None, 128)                     2048        
#>                                                                                 
#>  hidden2 (Dense)                    (None, 64)                      8256        
#>                                                                                 
#>  hidden3 (Dense)                    (None, 32)                      2080        
#>                                                                                 
#>  output (Dense)                     (None, 2)                       66          
#>                                                                                 
#> ================================================================================
#> Total params: 12,450
#> Trainable params: 12,450
#> Non-trainable params: 0
#> ________________________________________________________________________________

plot(history)

Evaluation

pred_nn_pca85_test <- predict(model_nn_pca85, emp_pca_test_x_keep85pc$coord)
# head(pred_nn_test)

pred_nn_pca85_test_label <- 
pred_nn_pca85_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_pca85_test_label$Class, reference = emp_test_y$Attrition, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  242  32
#>        Yes   5  16
#>                                           
#>                Accuracy : 0.8746          
#>                  95% CI : (0.8313, 0.9101)
#>     No Information Rate : 0.8373          
#>     P-Value [Acc > NIR] : 0.04536         
#>                                           
#>                   Kappa : 0.4048          
#>                                           
#>  Mcnemar's Test P-Value : 1.917e-05       
#>                                           
#>             Sensitivity : 0.33333         
#>             Specificity : 0.97976         
#>          Pos Pred Value : 0.76190         
#>          Neg Pred Value : 0.88321         
#>              Prevalence : 0.16271         
#>          Detection Rate : 0.05424         
#>    Detection Prevalence : 0.07119         
#>       Balanced Accuracy : 0.65655         
#>                                           
#>        'Positive' Class : Yes             
#>

Accuracy: 86.1%
Sensitivity / Recall: 27.1%
Pos Pred Value / Precision: 68.42%

Accuracy-wise, it’s very similar compared to the “All Pred, No Down, No Up”, around 86%. Looking at fitted Accuracy of around 85%, I would say that this model is not overfitting at all, and just-fit

Meanwhile, the Recall and Precision took a big hit, and goes very low. Therefore, I would not recommend using this model further.

Training: Keep 85% using PCA, Downsampling

We are still keeping only at least 85% using concepts of PCA, and train the Neural Network using less predictors, but this one will be using the downsampled data so that the class of the target variable will be balanced on the number of minority class, unfortunately remove a lot of data from the majority class.

Pre-processing

emp_pca_train_down_x_keep85pc <- PCA(X = emp_train_down_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

emp_pca_train_down_x_keep85pc$eig #checking cumulative, 14PC is enough for 85%

#>         eigenvalue percentage of variance cumulative percentage of variance
#> comp 1   4.8418940             21.0517130                          21.05171
#> comp 2   1.9245716              8.3677027                          29.41942
#> comp 3   1.6613027              7.2230550                          36.64247
#> comp 4   1.3322077              5.7922073                          42.43468
#> comp 5   1.2104230              5.2627089                          47.69739
#> comp 6   1.1589041              5.0387134                          52.73610
#> comp 7   1.1088665              4.8211587                          57.55726
#> comp 8   1.1023011              4.7926135                          62.34987
#> comp 9   1.0377085              4.5117762                          66.86165
#> comp 10  0.9786614              4.2550495                          71.11670
#> comp 11  0.9262472              4.0271616                          75.14386
#> comp 12  0.8542553              3.7141537                          78.85801
#> comp 13  0.8294086              3.6061242                          82.46414
#> comp 14  0.7955556              3.4589373                          85.92307
#> comp 15  0.7233068              3.1448121                          89.06789
#> comp 16  0.6492792              2.8229531                          91.89084
#> comp 17  0.4958773              2.1559884                          94.04683
#> comp 18  0.4391188              1.9092120                          95.95604
#> comp 19  0.2886065              1.2548110                          97.21085
#> comp 20  0.2446569              1.0637258                          98.27458
#> comp 21  0.2249168              0.9778991                          99.25248
#> comp 22  0.1220239              0.5305387                          99.78302
#> comp 23  0.0499065              0.2169848                         100.00000

In this case, instead of using 15 PCA, we can use 1 less.

emp_pca_train_down_x_keep85pc <- PCA(X = emp_train_down_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 14,
    graph = F)

Treating the testing data using the same PCs from training:

emp_pca_test_down_x_keep85pc <- predict(emp_pca_train_down_x_keep85pc, emp_test_down_x)
glimpse(emp_pca_test_down_x_keep85pc)

#> List of 3
#>  $ coord: num [1:96, 1:14] -2 -0.172 -1.022 1.766 0.981 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:96] "15" "20" "22" "29" ...
#>   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#>  $ cos2 : num [1:96, 1:14] 0.15478 0.00236 0.07268 0.12367 0.08736 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:96] "15" "20" "22" "29" ...
#>   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#>  $ dist : Named num [1:96] 5.08 3.55 3.79 5.02 3.32 ...
#>   ..- attr(*, "names")= chr [1:96] "15" "20" "22" "29" ...

NN Fitting

RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_down_pca85 <- keras_model_sequential()

model_nn_down_pca85 %>% 
  layer_dense(input_shape = 14, #input layer
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_down_pca85 %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.05),
    metrics =  "accuracy"
  )

history <- model_nn_down_pca85 %>% 
           fit(emp_pca_train_down_x_keep85pc$ind$coord,
               emp_train_down_y_keras,
               epoch = 15,
               batch_size= 150)

summary(model_nn_down_pca85)

#> Model: "sequential_4"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  hidden1 (Dense)                    (None, 128)                     1920        
#>                                                                                 
#>  hidden2 (Dense)                    (None, 64)                      8256        
#>                                                                                 
#>  hidden3 (Dense)                    (None, 32)                      2080        
#>                                                                                 
#>  output (Dense)                     (None, 2)                       66          
#>                                                                                 
#> ================================================================================
#> Total params: 12,322
#> Trainable params: 12,322
#> Non-trainable params: 0
#> ________________________________________________________________________________

plot(history)

Evaluation

pred_nn_down_pca85_test <- predict(model_nn_down_pca85, emp_pca_test_down_x_keep85pc$coord)
# head(pred_nn_test)

pred_nn_down_pca85_test_label <- 
pred_nn_down_pca85_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_down_pca85_test_label$Class, reference = emp_test_down_y$Attrition, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction No Yes
#>        No  33  15
#>        Yes 15  33
#>                                           
#>                Accuracy : 0.6875          
#>                  95% CI : (0.5848, 0.7782)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : 0.0001528       
#>                                           
#>                   Kappa : 0.375           
#>                                           
#>  Mcnemar's Test P-Value : 1.0000000       
#>                                           
#>             Sensitivity : 0.6875          
#>             Specificity : 0.6875          
#>          Pos Pred Value : 0.6875          
#>          Neg Pred Value : 0.6875          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.3438          
#>    Detection Prevalence : 0.5000          
#>       Balanced Accuracy : 0.6875          
#>                                           
#>        'Positive' Class : Yes             
#>

Accuracy: 67.71%
Sensitivity / Recall: 62.50%
Pos Pred Value / Precision: 69.77%

Overall, all the metrics took a hit and become lower than the similar ones using all predictor. Although, the Recall and Precision are better compared to the one using original data without down or upsampling.

Training: Keep 85% using PCA, Upsampling

We are still keeping only at least 85% using concepts of PCA, and train the Neural Network using less predictors, but this one will be using the upsampled data so that the class of the target variable will be balanced on the number of majority class, possibly duplicating some random observations from the minority class.

Pre-processing

emp_pca_train_up_x_keep85pc <- PCA(X = emp_train_up_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 23,
    graph = F)

emp_pca_train_up_x_keep85pc$eig

#>         eigenvalue percentage of variance cumulative percentage of variance
#> comp 1  5.01937909             21.8233873                          21.82339
#> comp 2  1.81644091              7.8975692                          29.72096
#> comp 3  1.61782680              7.0340296                          36.75499
#> comp 4  1.22559056              5.3286546                          42.08364
#> comp 5  1.21230424              5.2708880                          47.35453
#> comp 6  1.12974937              4.9119538                          52.26648
#> comp 7  1.06716596              4.6398520                          56.90633
#> comp 8  1.04888044              4.5603497                          61.46668
#> comp 9  1.02345306              4.4497959                          65.91648
#> comp 10 0.97769542              4.2508497                          70.16733
#> comp 11 0.94300201              4.1000087                          74.26734
#> comp 12 0.89837621              3.9059835                          78.17332
#> comp 13 0.85910807              3.7352525                          81.90857
#> comp 14 0.85345778              3.7106860                          85.61926
#> comp 15 0.78767824              3.4246880                          89.04395
#> comp 16 0.72781499              3.1644130                          92.20836
#> comp 17 0.49085503              2.1341523                          94.34251
#> comp 18 0.43635827              1.8972099                          96.23972
#> comp 19 0.25132105              1.0927002                          97.33242
#> comp 20 0.22576300              0.9815783                          98.31400
#> comp 21 0.21350278              0.9282729                          99.24228
#> comp 22 0.12536278              0.5450556                          99.78733
#> comp 23 0.04891395              0.2126694                         100.00000

Here, we only need to keep the first 14 PCs.

emp_pca_train_up_x_keep85pc <- PCA(X = emp_train_up_x,
    scale.unit = T,
    quali.sup = qualivar,
    ncp = 14,
    graph = F)

Treating the testing data using the same PCs from training:

emp_pca_test_up_x_keep85pc <- predict(emp_pca_train_up_x_keep85pc, emp_test_up_x)
glimpse(emp_pca_test_up_x_keep85pc)

#> List of 3
#>  $ coord: num [1:494, 1:14] 0.847 -1.19 1.461 0.917 2.476 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:494] "1" "5" "8" "22" ...
#>   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#>  $ cos2 : num [1:494, 1:14] 0.0283 0.0373 0.1134 0.0808 0.1862 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : chr [1:494] "1" "5" "8" "22" ...
#>   .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#>  $ dist : Named num [1:494] 5.04 6.16 4.34 3.22 5.74 ...
#>   ..- attr(*, "names")= chr [1:494] "1" "5" "8" "22" ...

NN Fitting

RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)

model_nn_up_pca85 <- keras_model_sequential()

model_nn_up_pca85 %>% 
  layer_dense(input_shape = 14, #input layer
              units =  128,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden1") %>% 
  layer_dense(units = 64,
              activation =  "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name =  "hidden2") %>% 
  layer_dense(units = 32, 
              activation = "tanh",
              bias_initializer = initializer,
              kernel_initializer = initializer,
            name = "hidden3") %>% 
  layer_dense(units =  2,
              activation = "sigmoid",
              bias_initializer = initializer,
              kernel_initializer = initializer,
              name = "output") 

model_nn_up_pca85 %>%
  compile(
    loss =  "binary_crossentropy",
    optimizer = optimizer_adam(learning_rate = 0.01),
    metrics =  "accuracy"
  )

history <- model_nn_up_pca85 %>% 
           fit(emp_pca_train_up_x_keep85pc$ind$coord,
               emp_train_up_y_keras,
               epoch = 25,
               batch_size= 150)

summary(model_nn_up_pca85)

#> Model: "sequential_5"
#> ________________________________________________________________________________
#>  Layer (type)                       Output Shape                    Param #     
#> ================================================================================
#>  hidden1 (Dense)                    (None, 128)                     1920        
#>                                                                                 
#>  hidden2 (Dense)                    (None, 64)                      8256        
#>                                                                                 
#>  hidden3 (Dense)                    (None, 32)                      2080        
#>                                                                                 
#>  output (Dense)                     (None, 2)                       66          
#>                                                                                 
#> ================================================================================
#> Total params: 12,322
#> Trainable params: 12,322
#> Non-trainable params: 0
#> ________________________________________________________________________________

plot(history)

Evaluation

pred_nn_up_pca85_test <- predict(model_nn_up_pca85, emp_pca_test_up_x_keep85pc$coord)
# head(pred_nn_test)

pred_nn_up_pca85_test_label <- 
pred_nn_up_pca85_test %>% 
  as.data.frame() %>% 
  `colnames<-`(0:1) %>% 
  transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))

confusionMatrix(data = pred_nn_up_pca85_test_label$Class, reference = emp_test_up_y$Attrition, positive = "Yes")

#> Confusion Matrix and Statistics
#> 
#>           Reference
#> Prediction  No Yes
#>        No  209  23
#>        Yes  38 224
#>                                           
#>                Accuracy : 0.8765          
#>                  95% CI : (0.8442, 0.9042)
#>     No Information Rate : 0.5             
#>     P-Value [Acc > NIR] : < 2e-16         
#>                                           
#>                   Kappa : 0.753           
#>                                           
#>  Mcnemar's Test P-Value : 0.07305         
#>                                           
#>             Sensitivity : 0.9069          
#>             Specificity : 0.8462          
#>          Pos Pred Value : 0.8550          
#>          Neg Pred Value : 0.9009          
#>              Prevalence : 0.5000          
#>          Detection Rate : 0.4534          
#>    Detection Prevalence : 0.5304          
#>       Balanced Accuracy : 0.8765          
#>                                           
#>        'Positive' Class : Yes             
#>

Accuracy: 91.09%
Sensitivity / Recall: 97.17%
Pos Pred Value / Precision: 86.64%

Our best model, consistently reaching 86-90+% on all metrics! Seeing the training Accuracy that reached 97%, this model is just-fit although a bit overfitting.

Conclusions

The best Neural Network model that we found is the one using upsampling to balance the class on the target variable, and using less predictors / variables by the help of Principal Component Analysis and Dimensionality Reduction process. The model named model_nn_up_pca85 has Accuracy of 91.09%, Recall of 97.17%, and Precision of 86.54%.

This proves my original business question, that we can actually use PCA concept of Unsupervised Machine Learning to reduce dimensions of our data, therefore potentially improving our Supervised Machine Learning model’s performance, which in this case is Neural Network.

If the class imbalance is very heavily weighted on the majority class, upsampling could be more preferable rather than downsampling since bias on the minority class due to upsampling should be less impactful than loss of information when downsampling a majority class.

About Me

Hi! My name is Calvin, I am from Jakarta, Indonesia. I am looking forward to be a full-time data analyst and/or data scientist. I have a background in Mathematics and Computer Science from my Bachelor’s Degrees, and I love playing with numbers and data. I am doing this to enhance my Data Science portfolio (constructive criticism is very much welcomed!), also as part of Learn-By-Building assignment at Algoritma Data Science School.

You can reach me at my LinkedIn for more discussion. Thank you!