Background
The dataset is found from Kaggle: IBM HR Analytics Employee Attrition & Performance
I was intrigued about how it was said that we can utilize Unsupervised Learning to do a Principal Component Analyses to reduce some dimensionality to our dataset, which can be used again as a dataset for a Supervised Learning. Some references also said that having less predictors/variables could make a more accurate Neural Network model when tuned right. That’s why I try to combine both concepts of Unsupervised Learning and Neural network into this report.
Calling Libraries
set.seed(314)
library(tidyverse)
library(factoextra) #for fviz_contrib
library(FactoMineR) #for PCA that is more easy to visualize
library(ggiraphExtra) #for ggRadar
library(rsample) #for easy sampling / cross-validation
library(caret) #for confusionMatrix
library(keras) #for neural network
Reading the Dataset
employee <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv", stringsAsFactors = T)
head(employee)
#> ï..Age Attrition BusinessTravel DailyRate Department
#> 1 41 Yes Travel_Rarely 1102 Sales
#> 2 49 No Travel_Frequently 279 Research & Development
#> 3 37 Yes Travel_Rarely 1373 Research & Development
#> 4 33 No Travel_Frequently 1392 Research & Development
#> 5 27 No Travel_Rarely 591 Research & Development
#> 6 32 No Travel_Frequently 1005 Research & Development
#> DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
#> 1 1 2 Life Sciences 1 1
#> 2 8 1 Life Sciences 1 2
#> 3 2 2 Other 1 4
#> 4 3 4 Life Sciences 1 5
#> 5 2 1 Medical 1 7
#> 6 2 2 Life Sciences 1 8
#> EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
#> 1 2 Female 94 3 2
#> 2 3 Male 61 2 2
#> 3 4 Male 92 2 1
#> 4 4 Female 56 3 1
#> 5 1 Male 40 3 1
#> 6 4 Male 79 3 1
#> JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
#> 1 Sales Executive 4 Single 5993 19479
#> 2 Research Scientist 2 Married 5130 24907
#> 3 Laboratory Technician 3 Single 2090 2396
#> 4 Research Scientist 3 Married 2909 23159
#> 5 Laboratory Technician 2 Married 3468 16632
#> 6 Laboratory Technician 4 Single 3068 11864
#> NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating
#> 1 8 Y Yes 11 3
#> 2 1 Y No 23 4
#> 3 6 Y Yes 15 3
#> 4 1 Y Yes 11 3
#> 5 9 Y No 12 3
#> 6 0 Y No 13 3
#> RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
#> 1 1 80 0 8
#> 2 4 80 1 10
#> 3 2 80 0 7
#> 4 3 80 0 8
#> 5 4 80 1 6
#> 6 3 80 0 8
#> TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
#> 1 0 1 6 4
#> 2 3 3 10 7
#> 3 3 3 0 0
#> 4 3 3 8 7
#> 5 3 3 2 2
#> 6 2 2 7 7
#> YearsSinceLastPromotion YearsWithCurrManager
#> 1 0 5
#> 2 1 7
#> 3 0 0
#> 4 3 0
#> 5 2 2
#> 6 3 6
Data Wrangling
glimpse(employee)
#> Rows: 1,470
#> Columns: 35
#> $ ï..Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 2~
#> $ Attrition <fct> Yes, No, Yes, No, No, No, No, No, No, No, No,~
#> $ BusinessTravel <fct> Travel_Rarely, Travel_Frequently, Travel_Rare~
#> $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358,~
#> $ Department <fct> Sales, Research & Development, Research & Dev~
#> $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, ~
#> $ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, ~
#> $ EducationField <fct> Life Sciences, Life Sciences, Other, Life Sci~
#> $ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
#> $ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16,~
#> $ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, ~
#> $ Gender <fct> Female, Male, Male, Female, Male, Male, Femal~
#> $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 4~
#> $ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, ~
#> $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, ~
#> $ JobRole <fct> Sales Executive, Research Scientist, Laborato~
#> $ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, ~
#> $ MaritalStatus <fct> Single, Married, Single, Married, Married, Si~
#> $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 269~
#> $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864, 9964~
#> $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, ~
#> $ Over18 <fct> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ~
#> $ OverTime <fct> Yes, No, Yes, Yes, No, No, Yes, No, No, No, N~
#> $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 1~
#> $ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, ~
#> $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, ~
#> $ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8~
#> $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, ~
#> $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3~
#> $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, ~
#> $ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, ~
#> $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4,~
#> $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, ~
#> $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, ~
#> $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, ~
employee_clean <- employee %>%
rename(Age = "ï..Age") %>%
select(-c("EmployeeCount", "Over18", "StandardHours", "EmployeeNumber"))
unique(employee$EmployeeCount)
#> [1] 1
unique(employee$JobInvolvement)
#> [1] 3 2 4 1
unique(employee$JobLevel)
#> [1] 2 1 3 4 5
unique(employee$Over18)
#> [1] Y
#> Levels: Y
summary(employee)
#> ï..Age Attrition BusinessTravel DailyRate
#> Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
#> 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
#> Median :36.00 Travel_Rarely :1043 Median : 802.0
#> Mean :36.92 Mean : 802.5
#> 3rd Qu.:43.00 3rd Qu.:1157.0
#> Max. :60.00 Max. :1499.0
#>
#> Department DistanceFromHome Education
#> Human Resources : 63 Min. : 1.000 Min. :1.000
#> Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000
#> Sales :446 Median : 7.000 Median :3.000
#> Mean : 9.193 Mean :2.913
#> 3rd Qu.:14.000 3rd Qu.:4.000
#> Max. :29.000 Max. :5.000
#>
#> EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction
#> Human Resources : 27 Min. :1 Min. : 1.0 Min. :1.000
#> Life Sciences :606 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000
#> Marketing :159 Median :1 Median :1020.5 Median :3.000
#> Medical :464 Mean :1 Mean :1024.9 Mean :2.722
#> Other : 82 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000
#> Technical Degree:132 Max. :1 Max. :2068.0 Max. :4.000
#>
#> Gender HourlyRate JobInvolvement JobLevel
#> Female:588 Min. : 30.00 Min. :1.00 Min. :1.000
#> Male :882 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000
#> Median : 66.00 Median :3.00 Median :2.000
#> Mean : 65.89 Mean :2.73 Mean :2.064
#> 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000
#> Max. :100.00 Max. :4.00 Max. :5.000
#>
#> JobRole JobSatisfaction MaritalStatus MonthlyIncome
#> Sales Executive :326 Min. :1.000 Divorced:327 Min. : 1009
#> Research Scientist :292 1st Qu.:2.000 Married :673 1st Qu.: 2911
#> Laboratory Technician :259 Median :3.000 Single :470 Median : 4919
#> Manufacturing Director :145 Mean :2.729 Mean : 6503
#> Healthcare Representative:131 3rd Qu.:4.000 3rd Qu.: 8379
#> Manager :102 Max. :4.000 Max. :19999
#> (Other) :215
#> MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
#> Min. : 2094 Min. :0.000 Y:1470 No :1054 Min. :11.00
#> 1st Qu.: 8047 1st Qu.:1.000 Yes: 416 1st Qu.:12.00
#> Median :14236 Median :2.000 Median :14.00
#> Mean :14313 Mean :2.693 Mean :15.21
#> 3rd Qu.:20462 3rd Qu.:4.000 3rd Qu.:18.00
#> Max. :26999 Max. :9.000 Max. :25.00
#>
#> PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
#> Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
#> 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
#> Median :3.000 Median :3.000 Median :80 Median :1.0000
#> Mean :3.154 Mean :2.712 Mean :80 Mean :0.7939
#> 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
#> Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
#>
#> TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
#> Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
#> 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
#> Median :10.00 Median :3.000 Median :3.000 Median : 5.000
#> Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008
#> 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
#> Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
#>
#> YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
#> Min. : 0.000 Min. : 0.000 Min. : 0.000
#> 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
#> Median : 3.000 Median : 1.000 Median : 3.000
#> Mean : 4.229 Mean : 2.188 Mean : 4.123
#> 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
#> Max. :18.000 Max. :15.000 Max. :17.000
#>
colSums(is.na(employee_clean))
#> Age Attrition BusinessTravel
#> 0 0 0
#> DailyRate Department DistanceFromHome
#> 0 0 0
#> Education EducationField EnvironmentSatisfaction
#> 0 0 0
#> Gender HourlyRate JobInvolvement
#> 0 0 0
#> JobLevel JobRole JobSatisfaction
#> 0 0 0
#> MaritalStatus MonthlyIncome MonthlyRate
#> 0 0 0
#> NumCompaniesWorked OverTime PercentSalaryHike
#> 0 0 0
#> PerformanceRating RelationshipSatisfaction StockOptionLevel
#> 0 0 0
#> TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
#> 0 0 0
#> YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
#> 0 0 0
#> YearsWithCurrManager
#> 0
Simple Principal Component Analysis
Principal Component Analysis (PCA) is part of Unsupervised Machine Learning, in which the main goal is to convert all numerical predictors into new numerical predictors, summarizing as most data as possible. The number of numerical predictors before and after the process is the same.
The very first one of the new numerical predictor that is produced, can be called PC1 or Principal Component number 1, will represent the largest percentage of data that one could contain and summarize using the PCA process. The second, or PC2, will represent the second largest, and so on.
Therefore some larger-numbered PCs can theoretically be “sacrificed”, since most of the data should already be represented by the smaller-numbered PCs. This “sacrifice” process that can be done after the initial PCA process is usually called by dimensionality reduction, which could help the performance of a Machine Learning, due to lesser data to be processed.
employee_num <- employee_clean %>%
select_if(is.numeric)
employee_num_scaled <- scale(employee_num)
pca <- prcomp(x = employee_num, scale. = T)
plot(pca, label = T)
This simple plot represents the first PCs. Variances will record how many information is stored within each PC. As is in the concept, the very first produced PC (usually simply called PC1) will contain the most information. PC2 will contain the second most information and so on.
biplot(x = prcomp(employee_num %>% scale() %>% head(100)), cex=1.4)
This above plot explains how each original numerical predictor correlate with each other, how it affects each observations, and how is the first 100 observations are spread out characteristically.
Some insights:- PerformanceRating and PercentSalaryHike are showing similar characteristic and highly correlated to eachother, judged from the similar angle and the length of the vectors. This similarity is said as dimensionality reduction consideration also, as we can remove one of the “similar” predictors, since one is showing enough representation.
- Observation with row index number of 91 could be considered as outlier due to its characteristics causing the data point to not centralized close to other ones.
fviz_contrib(
X = prcomp(employee_num %>% scale() %>% head(100)),
choice = "var",
axes = 1
)
This is for example, is how PC1 is constituted out of the original numerical predictors. The red line shows the average of 100% divided by number of predictors, therefore predictors shown to be passing this line could be said as having contributed quite large number of information to the Principal Component.
As we can see, YearsAtCompany contributed the most information within this PC1, followed by TotalWorkingYears, JobLevel, etc.
The first 8 predictors until Age can be said as having the most contribution of information, having the only ones pasing the red average line. Meanwhile, the rest of the predictors won’t be quite explained by this PC, and most likely be more represented by the other PC.
FactomineR Principal Component Analysis
After using Simple PCA using prcomp() for EDA, we will be using PCA() function from library of FactomineR to do a proper PCA, for us to use the process as part of pre-processing the data before feeding it to the Neural Network.
qualivar <- c(2,3,5,8,10,14,16,20)
quantivar <- c(1:30)[!c(1:30) %in% qualivar]
employee_pca <- PCA(X = employee_clean,
scale.unit = T,
quali.sup = qualivar,
ncp = 23,
graph = F)
head(employee_pca$ind$coord)
#> Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
#> 1 -0.3402021 -1.6425344 0.1100639 0.79035276 1.7185598 -1.4744878 0.3235198
#> 2 0.6094791 2.5534278 1.9127937 -1.65050159 -1.3792339 -0.3235685 0.4902771
#> 3 -2.4205040 -1.1315811 0.4480900 -0.01902049 0.6779377 1.5413090 0.1483998
#> 4 -0.8671335 -0.2839553 -1.3820657 -0.55764281 0.3905435 0.3867866 1.4611601
#> 5 -1.8951143 -1.1223375 -0.3939562 -0.62741412 -0.8146266 0.9042650 0.4460685
#> 6 -0.6072899 0.9053973 -1.7447346 -0.07965912 1.1054204 0.1542838 1.4889031
#> Dim.8 Dim.9 Dim.10 Dim.11 Dim.12 Dim.13
#> 1 -1.3971587 0.5591311 -2.71656720 0.59075634 -1.48430683 1.41815322
#> 2 0.8520032 1.0811053 0.07483455 -0.41473204 0.53055444 1.31146524
#> 3 0.4840149 -0.2652283 -1.67933257 -1.25575534 -0.01562162 -0.33593253
#> 4 0.4593693 -0.4916710 0.21216668 1.36703711 -0.10674537 -0.03908637
#> 5 -0.5806523 0.5630754 0.02822246 -0.49757189 0.10959459 0.67359554
#> 6 -0.5516226 0.2139034 -0.96440791 0.07958235 -0.45396460 0.90433096
#> Dim.14 Dim.15 Dim.16 Dim.17 Dim.18 Dim.19
#> 1 0.5790848 0.002514608 1.79267740 -0.09268832 0.83260007 0.05851981
#> 2 -0.8091512 -0.478065807 -0.01387224 -1.90034148 0.07800892 -0.14169801
#> 3 1.2407232 1.341190640 1.39389024 -0.14769863 -0.27731617 0.12351658
#> 4 1.8601876 -0.277377803 -0.54986108 0.06324740 -0.53771863 -1.17503560
#> 5 -1.4635249 -2.509889243 2.20042483 0.78034536 0.45339256 0.06691025
#> 6 0.4118312 1.270759320 0.12120559 -0.37673990 -0.06362240 -0.26571979
#> Dim.20 Dim.21 Dim.22 Dim.23
#> 1 -0.32617741 0.59438195 0.626231307 -0.06270521
#> 2 0.09368915 0.55826032 0.404472546 -0.12222444
#> 3 0.21470234 -0.14999009 -0.004433548 -0.03718327
#> 4 -0.69266787 -0.48663837 0.100428299 0.12997028
#> 5 -0.29395462 -0.08193348 -0.023614068 0.18864176
#> 6 -0.06111410 0.22260941 -0.386322515 0.16894805
The PCA is done, and we can see the new 23 PCs, representing 23 original numerical predictors above.
plot.PCA(x = employee_pca,
choix = "ind",
select = "contrib10",
invisible = "quali")
Using 2 of the most valuable PCs, above is how our scattered data looks like, with 10 most farthest away to the (0,0) coordinates can be seen. Since they are not actually too far away from the other observations, these data points would not need to be treated as outliers and excluded.
plot.PCA(x = employee_pca, choix = "var", label = "all")
Shown above is a PCA graph of how each original numerical predictors contributed to the first 2 PCs.
Variable Contribution to Each PC
dim <- dimdesc(employee_pca)
as.data.frame(dim$Dim.1$quanti)
#> correlation p.value
#> TotalWorkingYears 0.8705583 0.000000e+00
#> YearsAtCompany 0.8447751 0.000000e+00
#> JobLevel 0.8260250 0.000000e+00
#> MonthlyIncome 0.8085802 0.000000e+00
#> YearsInCurrentRole 0.7248619 7.342664e-240
#> YearsWithCurrManager 0.7128950 1.680595e-228
#> YearsSinceLastPromotion 0.6417415 2.111727e-171
#> Age 0.5977268 4.273467e-143
#> Education 0.1681454 8.724442e-11
#> NumCompaniesWorked 0.1046384 5.831645e-05
as.data.frame(dim$Dim.2$quanti)
#> correlation p.value
#> PerformanceRating 0.60006294 1.730861e-144
#> PercentSalaryHike 0.57470626 5.763050e-130
#> YearsWithCurrManager 0.38237187 2.221494e-52
#> YearsInCurrentRole 0.38232101 2.297324e-52
#> YearsAtCompany 0.29393322 1.104823e-30
#> YearsSinceLastPromotion 0.27600778 4.121838e-27
#> DistanceFromHome 0.07691673 3.168538e-03
#> TrainingTimesLastYear 0.05855275 2.477052e-02
#> HourlyRate -0.06037865 2.060771e-02
#> JobInvolvement -0.06514718 1.247870e-02
#> MonthlyRate -0.08475562 1.143265e-03
#> RelationshipSatisfaction -0.11964635 4.226111e-06
#> Education -0.19869805 1.488675e-14
#> TotalWorkingYears -0.23083766 3.123827e-19
#> JobLevel -0.25909443 5.620386e-24
#> MonthlyIncome -0.27145517 3.027261e-26
#> Age -0.38400456 7.538725e-53
#> NumCompaniesWorked -0.51407356 5.988350e-100
From dimdesc() above, we can see the value of correlation or how much information of any specific original predictors, constitute each of the Principal Components. As example, we see the top 10 variable contributors of PC1 and PC2.
PC Contribution to Variance / Information
employee_pca$eig
#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 4.65568301 20.2421000 20.24210
#> comp 2 1.83608877 7.9829947 28.22509
#> comp 3 1.75518507 7.6312394 35.85633
#> comp 4 1.19196184 5.1824428 41.03878
#> comp 5 1.08623652 4.7227675 45.76154
#> comp 6 1.05824417 4.6010616 50.36261
#> comp 7 1.05156065 4.5720028 54.93461
#> comp 8 1.03833464 4.5144985 59.44911
#> comp 9 1.02137212 4.4407483 63.88986
#> comp 10 0.99066276 4.3072294 68.19708
#> comp 11 0.96536641 4.1972452 72.39433
#> comp 12 0.94683080 4.1166557 76.51099
#> comp 13 0.90365332 3.9289275 80.43991
#> comp 14 0.90208602 3.9221131 84.36203
#> comp 15 0.87926425 3.8228881 88.18491
#> comp 16 0.75387072 3.2776988 91.46261
#> comp 17 0.54146306 2.3541872 93.81680
#> comp 18 0.52212983 2.2701297 96.08693
#> comp 19 0.27987064 1.2168289 97.30376
#> comp 20 0.22607818 0.9829486 98.28671
#> comp 21 0.20390815 0.8865572 99.17326
#> comp 22 0.14174295 0.6162737 99.78954
#> comp 23 0.04840611 0.2104613 100.00000
We can see the cumulative percentage of variance from above. Basically it tells us how many PCs can be kept if we would want to retain how many percentage of original data. For example, if we want to keep at least 85% of the variance, would require us to keep at least PC1 until PC15, since it is when the cumulative percentage passes 85%.
Neural Network
Class Imbalance: Need Downsampling/Upsampling?
Our target variable is Attrition. Let’s see
prop.table(table(employee_clean$Attrition))
#>
#> No Yes
#> 0.8387755 0.1612245
Since the class is very imbalanced at 84:16 proportion, I would personally use upsampling to make it balanced again and help improve the modelling. Sure, it would make the data quite biased towards the minority class, but by not doing downsampling, we would not remove a lot precious data from our environment.
For now, I will use both upsampling and downsampling, and compare the model afterwards.
RNGkind(sample.kind = "Rounding")
set.seed(314)
employee_down <- downSample(x = employee_clean %>% select(-Attrition),
y = employee_clean$Attrition,
yname = "Attrition")
prop.table(table(employee_down$Attrition))
#>
#> No Yes
#> 0.5 0.5
employee_up <- upSample(x = employee_clean %>% select(-Attrition),
y = employee_clean$Attrition,
yname = "Attrition")
prop.table(table(employee_up$Attrition))
#>
#> No Yes
#> 0.5 0.5
Cross-Validation 80:20
I will be using cross-validation of 80% training data and 20% testing data, and a stratified random sampling based on our target variable Attrition to keep the balance after upsampling and downsampling them before this.
RNGkind(sample.kind = "Rounding")
set.seed(314)
#without downsampling
emp_index <- initial_split(employee_clean, prop = 0.8, strata = "Attrition")
emp_train <- training(emp_index)
emp_test <- testing(emp_index)
#with downsampling
emp_index <- initial_split(employee_down, prop = 0.8, strata = "Attrition")
emp_train_down <- training(emp_index)
emp_test_down <- testing(emp_index)
#with upsampling
emp_index <- initial_split(employee_up, prop = 0.8, strata = "Attrition")
emp_train_up <- training(emp_index)
emp_test_up <- testing(emp_index)
Training: All Pred, No Down, No Up
We will be training our model using all predictors (numerical and categorical-converted-to-dummy-variables predictors), on our original dataset that has quite a class imbalance.
Pre-processing
emp_train_x <-
emp_train %>%
select(-Attrition)
emp_test_x <-
emp_test %>%
select(-Attrition)
emp_train_y <-
emp_train %>%
select(Attrition)
emp_test_y <-
emp_test %>%
select(Attrition)
# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_x_sc <-
as.data.frame(model.matrix(~., data = emp_train_x)) %>% #dummy variable all the predictors including the categoricals
select(-"(Intercept)") %>%
scale()
emp_test_x_sc <-
as.data.frame(model.matrix(~., data = emp_test_x)) %>% #dummy variable all the predictors including the categoricals
select(-"(Intercept)") %>%
scale()
emp_train_x_keras <-
emp_train_x_sc %>%
array_reshape(dim = dim(emp_train_x_sc))
emp_test_x_keras <-
emp_test_x_sc %>%
array_reshape(dim = dim(emp_test_x_sc))
emp_train_y_keras <-
emp_train_y %>%
mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>%
as.matrix() %>%
to_categorical()
# emp_test_y_keras <-
# emp_test_y %>%
# mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>%
# as.matrix() %>%
# to_categorical()
NN Fitting
RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)
model_nn <- keras_model_sequential()
model_nn %>%
layer_dense(input_shape = 44, #input layer
units = 128,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden1") %>%
layer_dense(units = 64,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden2") %>%
layer_dense(units = 32,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "output")
model_nn %>%
compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "accuracy"
)
history <- model_nn %>%
fit(emp_train_x_keras,
emp_train_y_keras,
epoch = 15,
batch_size= 200)
summary(model_nn)
#> Model: "sequential"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> hidden1 (Dense) (None, 128) 5760
#>
#> hidden2 (Dense) (None, 64) 8256
#>
#> hidden3 (Dense) (None, 32) 2080
#>
#> output (Dense) (None, 2) 66
#>
#> ================================================================================
#> Total params: 16,162
#> Trainable params: 16,162
#> Non-trainable params: 0
#> ________________________________________________________________________________
plot(history)
The Accuracy reaches above 90%. Let’s hope that this won’t turn into a overfitting case.
Evaluation
pred_nn_test <- predict(model_nn, emp_test_x_keras)
# head(pred_nn_test)
pred_nn_test_label <-
pred_nn_test %>%
as.data.frame() %>%
`colnames<-`(0:1) %>%
transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))
confusionMatrix(data = pred_nn_test_label$Class, reference = emp_test_y$Attrition, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 234 26
#> Yes 13 22
#>
#> Accuracy : 0.8678
#> 95% CI : (0.8237, 0.9043)
#> No Information Rate : 0.8373
#> P-Value [Acc > NIR] : 0.08741
#>
#> Kappa : 0.4554
#>
#> Mcnemar's Test P-Value : 0.05466
#>
#> Sensitivity : 0.45833
#> Specificity : 0.94737
#> Pos Pred Value : 0.62857
#> Neg Pred Value : 0.90000
#> Prevalence : 0.16271
#> Detection Rate : 0.07458
#> Detection Prevalence : 0.11864
#> Balanced Accuracy : 0.70285
#>
#> 'Positive' Class : Yes
#>
Accuracy: 86.78%
Sensitivity / Recall: 45.83%
Pos Pred Value / Precision: 62.86%
Accuracy is pretty good, but this is expected from an imbalanced class since in extreme “lazy” case that the model predicts all testing data to be the majority class, the true positives will be very biased. It’s not too far from pre-evaluation accuracy which touches 90%, therefore this should be a just-fit model.
While for Recall vs Precision, I think it’s more depending on the business question that wants to be tackled.
If the HR wants to prevent attrition of employees, therefore it’s preventive and we should focus on Recall.
If the HR wants to predict attrition as precise as possible to optimize salaries and things, therefore it’s quality of our prediction, and we should focus on Precision.
For now, we will try to find the best model comparatively on the whole 3 metrics of Accuracy, Recall, and Precision. I would prefer all of the metrics to at least have 80%, while this one has not so good Recall and Precision.
Training: All Pred, Downsampling
We will be training this model using all predictors similar as above, but using the dataset that was downsampled to balance the class on the target variable.
Pre-processing
#without scaling
emp_train_down_x <-
emp_train_down %>%
select(-Attrition)
emp_test_down_x <-
emp_test_down %>%
select(-Attrition)
emp_train_down_y <-
emp_train_down %>%
select(Attrition)
emp_test_down_y <-
emp_test_down %>%
select(Attrition)
# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_down_x_sc <-
as.data.frame(model.matrix(~., data = emp_train_down_x)) %>% #dummy variable all the predictors including the categoricals
select(-"(Intercept)") %>%
scale()
emp_test_down_x_sc <-
as.data.frame(model.matrix(~., data = emp_test_down_x)) %>% #dummy variable all the predictors including the categoricals
select(-"(Intercept)") %>%
scale()
emp_train_down_x_keras <-
emp_train_down_x_sc %>%
array_reshape(dim = dim(emp_train_down_x_sc))
emp_test_down_x_keras <-
emp_test_down_x_sc %>%
array_reshape(dim = dim(emp_test_down_x_sc))
emp_train_down_y_keras <-
emp_train_down_y %>%
mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>%
as.matrix() %>%
to_categorical()
# emp_test_y_keras <-
# emp_test_y %>%
# mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>%
# as.matrix() %>%
# to_categorical()
NN Fitting
RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)
model_nn_down <- keras_model_sequential()
model_nn_down %>%
layer_dense(input_shape = 44, #input layer
units = 128,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden1") %>%
layer_dense(units = 64,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden2") %>%
layer_dense(units = 32,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "output")
model_nn_down %>%
compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "accuracy"
)
history_down <- model_nn_down %>%
fit(emp_train_down_x_keras,
emp_train_down_y_keras,
epoch = 20,
batch_size= 100)
summary(model_nn_down)
#> Model: "sequential_1"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> hidden1 (Dense) (None, 128) 5760
#>
#> hidden2 (Dense) (None, 64) 8256
#>
#> hidden3 (Dense) (None, 32) 2080
#>
#> output (Dense) (None, 2) 66
#>
#> ================================================================================
#> Total params: 16,162
#> Trainable params: 16,162
#> Non-trainable params: 0
#> ________________________________________________________________________________
plot(history)
Evaluation
pred_nn_test_down <- predict(model_nn_down, emp_test_down_x_keras)
# head(pred_nn_test)
pred_nn_test_down_label <-
pred_nn_test_down %>%
as.data.frame() %>%
`colnames<-`(0:1) %>%
transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))
confusionMatrix(data = pred_nn_test_down_label$Class, reference = emp_test_down_y$Attrition, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 37 11
#> Yes 11 37
#>
#> Accuracy : 0.7708
#> 95% CI : (0.6739, 0.8505)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 4.72e-08
#>
#> Kappa : 0.5417
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.7708
#> Specificity : 0.7708
#> Pos Pred Value : 0.7708
#> Neg Pred Value : 0.7708
#> Prevalence : 0.5000
#> Detection Rate : 0.3854
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.7708
#>
#> 'Positive' Class : Yes
#>
Accuracy: 79.17%
Sensitivity / Recall: 75.00%
Pos Pred Value / Precision: 81.82%
Overall, not a bad model, though the Accuracy for this testing data is quite far away against the one for training data, having around 86.24%, so this model is just-fit but a bit overfitting.
Training: All Pred, Upsampling
We will be training this model using all predictors similar as above, but using the dataset that was upsampled to balance the class on the target variable. Expecting this to perform better than the imbalanced or downsampling ones due to not removing any valuable data, even from majority class that is not the positive class (in this case the Attrition’s “No” class, or employees staying in the company)
Pre-processing
#without scaling
emp_train_up_x <-
emp_train_up %>%
select(-Attrition)
emp_test_up_x <-
emp_test_up %>%
select(-Attrition)
emp_train_up_y <-
emp_train_up %>%
select(Attrition)
emp_test_up_y <-
emp_test_up %>%
select(Attrition)
# glimpse(emp_train_x)
# emp_train_x[,c(2,4,7,9,13,15,19)]
emp_train_up_x_sc <-
as.data.frame(model.matrix(~., data = emp_train_up_x)) %>% #dummy variable all the predictors including the categoricals
select(-"(Intercept)") %>%
scale()
emp_test_up_x_sc <-
as.data.frame(model.matrix(~., data = emp_test_up_x)) %>% #dummy variable all the predictors including the categoricals
select(-"(Intercept)") %>%
scale()
emp_train_up_x_keras <-
emp_train_up_x_sc %>%
array_reshape(dim = dim(emp_train_up_x_sc))
emp_test_up_x_keras <-
emp_test_up_x_sc %>%
array_reshape(dim = dim(emp_test_up_x_sc))
emp_train_up_y_keras <-
emp_train_up_y %>%
mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>%
as.matrix() %>%
to_categorical()
# emp_test_y_keras <-
# emp_test_y %>%
# mutate(Attrition = ifelse(Attrition == "Yes",1,0)) %>%
# as.matrix() %>%
# to_categorical()
NN Fitting
RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)
model_nn_up <- keras_model_sequential()
model_nn_up %>%
layer_dense(input_shape = 44, #input layer
units = 128,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden1") %>%
layer_dense(units = 64,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden2") %>%
layer_dense(units = 32,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "output")
model_nn_up %>%
compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "accuracy"
)
history_up <- model_nn_up %>%
fit(emp_train_up_x_keras,
emp_train_up_y_keras,
epoch = 20,
batch_size= 250)
summary(model_nn_up)
#> Model: "sequential_2"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> hidden1 (Dense) (None, 128) 5760
#>
#> hidden2 (Dense) (None, 64) 8256
#>
#> hidden3 (Dense) (None, 32) 2080
#>
#> output (Dense) (None, 2) 66
#>
#> ================================================================================
#> Total params: 16,162
#> Trainable params: 16,162
#> Non-trainable params: 0
#> ________________________________________________________________________________
plot(history)
Evaluation
pred_nn_test_up <- predict(model_nn_up, emp_test_up_x_keras)
# head(pred_nn_test)
pred_nn_test_up_label <-
pred_nn_test_up %>%
as.data.frame() %>%
`colnames<-`(0:1) %>%
transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))
confusionMatrix(data = pred_nn_test_up_label$Class, reference = emp_test_up_y$Attrition, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 199 48
#> Yes 48 199
#>
#> Accuracy : 0.8057
#> 95% CI : (0.768, 0.8397)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : <2e-16
#>
#> Kappa : 0.6113
#>
#> Mcnemar's Test P-Value : 1
#>
#> Sensitivity : 0.8057
#> Specificity : 0.8057
#> Pos Pred Value : 0.8057
#> Neg Pred Value : 0.8057
#> Prevalence : 0.5000
#> Detection Rate : 0.4028
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.8057
#>
#> 'Positive' Class : Yes
#>
Accuracy: 79.15%
Sensitivity / Recall: 78.95%
Pos Pred Value / Precision: 79.27%
Overall, not a bad model and very similar to the downsampled one, though the metrics here are a bit lower, just-fit but a bit overfitting.
Training: Keep 85% using PCA, No Down, No Up
Now we will be keeping only at least 85% using concepts of PCA, and train the Neural Network using less predictors. This one is using the original, imbalanced target variable.
Pre-processing
glimpse(emp_train_x)
#> Rows: 1,175
#> Columns: 30
#> $ Age <int> 49, 33, 27, 32, 30, 36, 35, 29, 31, 34, 29, 3~
#> $ BusinessTravel <fct> Travel_Frequently, Travel_Frequently, Travel_~
#> $ DailyRate <int> 279, 1392, 591, 1005, 1358, 1299, 809, 153, 6~
#> $ Department <fct> Research & Development, Research & Developmen~
#> $ DistanceFromHome <int> 8, 3, 2, 2, 24, 27, 16, 15, 26, 19, 21, 5, 11~
#> $ Education <int> 1, 4, 1, 2, 1, 3, 3, 2, 1, 2, 4, 2, 2, 4, 2, ~
#> $ EducationField <fct> Life Sciences, Life Sciences, Medical, Life S~
#> $ EnvironmentSatisfaction <int> 3, 4, 1, 4, 4, 3, 1, 4, 1, 2, 2, 1, 1, 1, 3, ~
#> $ Gender <fct> Male, Female, Male, Male, Male, Male, Male, F~
#> $ HourlyRate <int> 61, 56, 40, 79, 67, 94, 84, 49, 31, 93, 51, 8~
#> $ JobInvolvement <int> 2, 3, 3, 3, 3, 3, 4, 2, 3, 3, 4, 4, 4, 3, 3, ~
#> $ JobLevel <int> 2, 1, 1, 1, 1, 2, 1, 2, 1, 1, 3, 1, 2, 3, 1, ~
#> $ JobRole <fct> Research Scientist, Research Scientist, Labor~
#> $ JobSatisfaction <int> 2, 3, 2, 4, 3, 3, 2, 3, 3, 4, 1, 2, 3, 2, 4, ~
#> $ MaritalStatus <fct> Married, Married, Married, Single, Divorced, ~
#> $ MonthlyIncome <int> 5130, 2909, 3468, 3068, 2693, 5237, 2426, 419~
#> $ MonthlyRate <int> 24907, 23159, 16632, 11864, 13335, 16577, 164~
#> $ NumCompaniesWorked <int> 1, 1, 9, 0, 1, 6, 0, 0, 1, 0, 1, 0, 0, 0, 1, ~
#> $ OverTime <fct> No, Yes, No, No, No, No, No, Yes, No, No, No,~
#> $ PercentSalaryHike <int> 23, 11, 12, 13, 22, 13, 13, 12, 17, 11, 11, 1~
#> $ PerformanceRating <int> 4, 3, 3, 3, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ~
#> $ RelationshipSatisfaction <int> 4, 3, 4, 3, 2, 2, 3, 4, 4, 3, 3, 4, 4, 3, 4, ~
#> $ StockOptionLevel <int> 1, 0, 1, 0, 1, 2, 1, 0, 1, 1, 1, 2, 1, 0, 0, ~
#> $ TotalWorkingYears <int> 10, 8, 6, 8, 1, 17, 6, 10, 5, 3, 10, 7, 5, 13~
#> $ TrainingTimesLastYear <int> 3, 3, 3, 2, 2, 3, 5, 3, 1, 2, 1, 5, 5, 4, 6, ~
#> $ WorkLifeBalance <int> 3, 3, 3, 2, 3, 2, 3, 3, 2, 3, 3, 2, 2, 3, 3, ~
#> $ YearsAtCompany <int> 10, 8, 2, 7, 1, 7, 5, 9, 5, 2, 10, 6, 4, 12, ~
#> $ YearsInCurrentRole <int> 7, 7, 2, 7, 0, 7, 4, 5, 2, 2, 9, 2, 2, 6, 0, ~
#> $ YearsSinceLastPromotion <int> 1, 3, 2, 3, 0, 7, 0, 0, 4, 1, 8, 0, 1, 2, 0, ~
#> $ YearsWithCurrManager <int> 7, 0, 2, 6, 0, 7, 3, 8, 3, 2, 8, 5, 3, 11, 0,~
qualivar <- c(2,4,7,9,13,15,19)
quantivar <- c(1:30)[!c(1:30) %in% qualivar]
emp_pca_train_x_keep85pc <- PCA(X = emp_train_x,
scale.unit = T,
quali.sup = qualivar,
ncp = 23,
graph = F)
emp_pca_train_x_keep85pc$eig #checking cumulative, still at 15PC for 85%
#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 4.66759898 20.2939086 20.29391
#> comp 2 1.81435288 7.8884908 28.18240
#> comp 3 1.76448905 7.6716915 35.85409
#> comp 4 1.19782421 5.2079313 41.06202
#> comp 5 1.12566371 4.8941901 45.95621
#> comp 6 1.07305779 4.6654686 50.62168
#> comp 7 1.05463790 4.5853822 55.20706
#> comp 8 1.02529541 4.4578061 59.66487
#> comp 9 1.01367542 4.4072844 64.07215
#> comp 10 0.98594266 4.2867072 68.35886
#> comp 11 0.97383468 4.2340638 72.59292
#> comp 12 0.93806844 4.0785584 76.67148
#> comp 13 0.90832738 3.9492495 80.62073
#> comp 14 0.87035760 3.7841635 84.40490
#> comp 15 0.85071252 3.6987501 88.10365
#> comp 16 0.77327987 3.3620864 91.46573
#> comp 17 0.53903524 2.3436315 93.80936
#> comp 18 0.51675118 2.2467443 96.05611
#> comp 19 0.29087058 1.2646547 97.32076
#> comp 20 0.21861264 0.9504897 98.27125
#> comp 21 0.20409012 0.8873484 99.15860
#> comp 22 0.14179538 0.6165017 99.77510
#> comp 23 0.05172637 0.2248973 100.00000
We would need 15 PCs.
emp_pca_train_x_keep85pc <- PCA(X = emp_train_x,
scale.unit = T,
quali.sup = qualivar,
ncp = 15,
graph = F)
Treating the testing data using the same PCs from training:
emp_pca_test_x_keep85pc <- predict(emp_pca_train_x_keep85pc, emp_test_x)
glimpse(emp_pca_test_x_keep85pc)
#> List of 3
#> $ coord: num [1:295, 1:15] -1.32 1.1 -3.05 4.79 -1.5 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:295] "7" "9" "18" "19" ...
#> .. ..$ : chr [1:15] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#> $ cos2 : num [1:295, 1:15] 0.0458 0.0519 0.3552 0.6021 0.1591 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:295] "7" "9" "18" "19" ...
#> .. ..$ : chr [1:15] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#> $ dist : Named num [1:295] 6.16 4.82 5.12 6.17 3.76 ...
#> ..- attr(*, "names")= chr [1:295] "7" "9" "18" "19" ...
NN Fitting
RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)
model_nn_pca85 <- keras_model_sequential()
model_nn_pca85 %>%
layer_dense(input_shape = 15, #input layer
units = 128,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden1") %>%
layer_dense(units = 64,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden2") %>%
layer_dense(units = 32,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "output")
model_nn_pca85 %>%
compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(learning_rate = 0.001),
metrics = "accuracy"
)
history <- model_nn_pca85 %>%
fit(emp_pca_train_x_keep85pc$ind$coord,
emp_train_y_keras,
epoch = 15,
batch_size= 200)
summary(model_nn_pca85)
#> Model: "sequential_3"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> hidden1 (Dense) (None, 128) 2048
#>
#> hidden2 (Dense) (None, 64) 8256
#>
#> hidden3 (Dense) (None, 32) 2080
#>
#> output (Dense) (None, 2) 66
#>
#> ================================================================================
#> Total params: 12,450
#> Trainable params: 12,450
#> Non-trainable params: 0
#> ________________________________________________________________________________
plot(history)
Evaluation
pred_nn_pca85_test <- predict(model_nn_pca85, emp_pca_test_x_keep85pc$coord)
# head(pred_nn_test)
pred_nn_pca85_test_label <-
pred_nn_pca85_test %>%
as.data.frame() %>%
`colnames<-`(0:1) %>%
transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))
confusionMatrix(data = pred_nn_pca85_test_label$Class, reference = emp_test_y$Attrition, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 242 32
#> Yes 5 16
#>
#> Accuracy : 0.8746
#> 95% CI : (0.8313, 0.9101)
#> No Information Rate : 0.8373
#> P-Value [Acc > NIR] : 0.04536
#>
#> Kappa : 0.4048
#>
#> Mcnemar's Test P-Value : 1.917e-05
#>
#> Sensitivity : 0.33333
#> Specificity : 0.97976
#> Pos Pred Value : 0.76190
#> Neg Pred Value : 0.88321
#> Prevalence : 0.16271
#> Detection Rate : 0.05424
#> Detection Prevalence : 0.07119
#> Balanced Accuracy : 0.65655
#>
#> 'Positive' Class : Yes
#>
Accuracy: 86.1%
Sensitivity / Recall: 27.1%
Pos Pred Value / Precision: 68.42%
Accuracy-wise, it’s very similar compared to the “All Pred, No Down, No Up”, around 86%. Looking at fitted Accuracy of around 85%, I would say that this model is not overfitting at all, and just-fit
Meanwhile, the Recall and Precision took a big hit, and goes very low. Therefore, I would not recommend using this model further.
Training: Keep 85% using PCA, Downsampling
We are still keeping only at least 85% using concepts of PCA, and train the Neural Network using less predictors, but this one will be using the downsampled data so that the class of the target variable will be balanced on the number of minority class, unfortunately remove a lot of data from the majority class.
Pre-processing
emp_pca_train_down_x_keep85pc <- PCA(X = emp_train_down_x,
scale.unit = T,
quali.sup = qualivar,
ncp = 23,
graph = F)
emp_pca_train_down_x_keep85pc$eig #checking cumulative, 14PC is enough for 85%
#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 4.8418940 21.0517130 21.05171
#> comp 2 1.9245716 8.3677027 29.41942
#> comp 3 1.6613027 7.2230550 36.64247
#> comp 4 1.3322077 5.7922073 42.43468
#> comp 5 1.2104230 5.2627089 47.69739
#> comp 6 1.1589041 5.0387134 52.73610
#> comp 7 1.1088665 4.8211587 57.55726
#> comp 8 1.1023011 4.7926135 62.34987
#> comp 9 1.0377085 4.5117762 66.86165
#> comp 10 0.9786614 4.2550495 71.11670
#> comp 11 0.9262472 4.0271616 75.14386
#> comp 12 0.8542553 3.7141537 78.85801
#> comp 13 0.8294086 3.6061242 82.46414
#> comp 14 0.7955556 3.4589373 85.92307
#> comp 15 0.7233068 3.1448121 89.06789
#> comp 16 0.6492792 2.8229531 91.89084
#> comp 17 0.4958773 2.1559884 94.04683
#> comp 18 0.4391188 1.9092120 95.95604
#> comp 19 0.2886065 1.2548110 97.21085
#> comp 20 0.2446569 1.0637258 98.27458
#> comp 21 0.2249168 0.9778991 99.25248
#> comp 22 0.1220239 0.5305387 99.78302
#> comp 23 0.0499065 0.2169848 100.00000
In this case, instead of using 15 PCA, we can use 1 less.
emp_pca_train_down_x_keep85pc <- PCA(X = emp_train_down_x,
scale.unit = T,
quali.sup = qualivar,
ncp = 14,
graph = F)
Treating the testing data using the same PCs from training:
emp_pca_test_down_x_keep85pc <- predict(emp_pca_train_down_x_keep85pc, emp_test_down_x)
glimpse(emp_pca_test_down_x_keep85pc)
#> List of 3
#> $ coord: num [1:96, 1:14] -2 -0.172 -1.022 1.766 0.981 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:96] "15" "20" "22" "29" ...
#> .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#> $ cos2 : num [1:96, 1:14] 0.15478 0.00236 0.07268 0.12367 0.08736 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:96] "15" "20" "22" "29" ...
#> .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#> $ dist : Named num [1:96] 5.08 3.55 3.79 5.02 3.32 ...
#> ..- attr(*, "names")= chr [1:96] "15" "20" "22" "29" ...
NN Fitting
RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)
model_nn_down_pca85 <- keras_model_sequential()
model_nn_down_pca85 %>%
layer_dense(input_shape = 14, #input layer
units = 128,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden1") %>%
layer_dense(units = 64,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden2") %>%
layer_dense(units = 32,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "output")
model_nn_down_pca85 %>%
compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(learning_rate = 0.05),
metrics = "accuracy"
)
history <- model_nn_down_pca85 %>%
fit(emp_pca_train_down_x_keep85pc$ind$coord,
emp_train_down_y_keras,
epoch = 15,
batch_size= 150)
summary(model_nn_down_pca85)
#> Model: "sequential_4"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> hidden1 (Dense) (None, 128) 1920
#>
#> hidden2 (Dense) (None, 64) 8256
#>
#> hidden3 (Dense) (None, 32) 2080
#>
#> output (Dense) (None, 2) 66
#>
#> ================================================================================
#> Total params: 12,322
#> Trainable params: 12,322
#> Non-trainable params: 0
#> ________________________________________________________________________________
plot(history)
Evaluation
pred_nn_down_pca85_test <- predict(model_nn_down_pca85, emp_pca_test_down_x_keep85pc$coord)
# head(pred_nn_test)
pred_nn_down_pca85_test_label <-
pred_nn_down_pca85_test %>%
as.data.frame() %>%
`colnames<-`(0:1) %>%
transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))
confusionMatrix(data = pred_nn_down_pca85_test_label$Class, reference = emp_test_down_y$Attrition, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 33 15
#> Yes 15 33
#>
#> Accuracy : 0.6875
#> 95% CI : (0.5848, 0.7782)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : 0.0001528
#>
#> Kappa : 0.375
#>
#> Mcnemar's Test P-Value : 1.0000000
#>
#> Sensitivity : 0.6875
#> Specificity : 0.6875
#> Pos Pred Value : 0.6875
#> Neg Pred Value : 0.6875
#> Prevalence : 0.5000
#> Detection Rate : 0.3438
#> Detection Prevalence : 0.5000
#> Balanced Accuracy : 0.6875
#>
#> 'Positive' Class : Yes
#>
Accuracy: 67.71%
Sensitivity / Recall: 62.50%
Pos Pred Value / Precision: 69.77%
Overall, all the metrics took a hit and become lower than the similar ones using all predictor. Although, the Recall and Precision are better compared to the one using original data without down or upsampling.
Training: Keep 85% using PCA, Upsampling
We are still keeping only at least 85% using concepts of PCA, and train the Neural Network using less predictors, but this one will be using the upsampled data so that the class of the target variable will be balanced on the number of majority class, possibly duplicating some random observations from the minority class.
Pre-processing
emp_pca_train_up_x_keep85pc <- PCA(X = emp_train_up_x,
scale.unit = T,
quali.sup = qualivar,
ncp = 23,
graph = F)
emp_pca_train_up_x_keep85pc$eig
#> eigenvalue percentage of variance cumulative percentage of variance
#> comp 1 5.01937909 21.8233873 21.82339
#> comp 2 1.81644091 7.8975692 29.72096
#> comp 3 1.61782680 7.0340296 36.75499
#> comp 4 1.22559056 5.3286546 42.08364
#> comp 5 1.21230424 5.2708880 47.35453
#> comp 6 1.12974937 4.9119538 52.26648
#> comp 7 1.06716596 4.6398520 56.90633
#> comp 8 1.04888044 4.5603497 61.46668
#> comp 9 1.02345306 4.4497959 65.91648
#> comp 10 0.97769542 4.2508497 70.16733
#> comp 11 0.94300201 4.1000087 74.26734
#> comp 12 0.89837621 3.9059835 78.17332
#> comp 13 0.85910807 3.7352525 81.90857
#> comp 14 0.85345778 3.7106860 85.61926
#> comp 15 0.78767824 3.4246880 89.04395
#> comp 16 0.72781499 3.1644130 92.20836
#> comp 17 0.49085503 2.1341523 94.34251
#> comp 18 0.43635827 1.8972099 96.23972
#> comp 19 0.25132105 1.0927002 97.33242
#> comp 20 0.22576300 0.9815783 98.31400
#> comp 21 0.21350278 0.9282729 99.24228
#> comp 22 0.12536278 0.5450556 99.78733
#> comp 23 0.04891395 0.2126694 100.00000
Here, we only need to keep the first 14 PCs.
emp_pca_train_up_x_keep85pc <- PCA(X = emp_train_up_x,
scale.unit = T,
quali.sup = qualivar,
ncp = 14,
graph = F)
Treating the testing data using the same PCs from training:
emp_pca_test_up_x_keep85pc <- predict(emp_pca_train_up_x_keep85pc, emp_test_up_x)
glimpse(emp_pca_test_up_x_keep85pc)
#> List of 3
#> $ coord: num [1:494, 1:14] 0.847 -1.19 1.461 0.917 2.476 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:494] "1" "5" "8" "22" ...
#> .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#> $ cos2 : num [1:494, 1:14] 0.0283 0.0373 0.1134 0.0808 0.1862 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : chr [1:494] "1" "5" "8" "22" ...
#> .. ..$ : chr [1:14] "Dim.1" "Dim.2" "Dim.3" "Dim.4" ...
#> $ dist : Named num [1:494] 5.04 6.16 4.34 3.22 5.74 ...
#> ..- attr(*, "names")= chr [1:494] "1" "5" "8" "22" ...
NN Fitting
RNGkind(sample.kind = "Rounding")
set.seed(314)
initializer <- initializer_random_normal(seed = 314)
model_nn_up_pca85 <- keras_model_sequential()
model_nn_up_pca85 %>%
layer_dense(input_shape = 14, #input layer
units = 128,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden1") %>%
layer_dense(units = 64,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden2") %>%
layer_dense(units = 32,
activation = "tanh",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "hidden3") %>%
layer_dense(units = 2,
activation = "sigmoid",
bias_initializer = initializer,
kernel_initializer = initializer,
name = "output")
model_nn_up_pca85 %>%
compile(
loss = "binary_crossentropy",
optimizer = optimizer_adam(learning_rate = 0.01),
metrics = "accuracy"
)
history <- model_nn_up_pca85 %>%
fit(emp_pca_train_up_x_keep85pc$ind$coord,
emp_train_up_y_keras,
epoch = 25,
batch_size= 150)
summary(model_nn_up_pca85)
#> Model: "sequential_5"
#> ________________________________________________________________________________
#> Layer (type) Output Shape Param #
#> ================================================================================
#> hidden1 (Dense) (None, 128) 1920
#>
#> hidden2 (Dense) (None, 64) 8256
#>
#> hidden3 (Dense) (None, 32) 2080
#>
#> output (Dense) (None, 2) 66
#>
#> ================================================================================
#> Total params: 12,322
#> Trainable params: 12,322
#> Non-trainable params: 0
#> ________________________________________________________________________________
plot(history)
Evaluation
pred_nn_up_pca85_test <- predict(model_nn_up_pca85, emp_pca_test_up_x_keep85pc$coord)
# head(pred_nn_test)
pred_nn_up_pca85_test_label <-
pred_nn_up_pca85_test %>%
as.data.frame() %>%
`colnames<-`(0:1) %>%
transmute(Class = as.factor(ifelse(names(.)[max.col(.)] == "1","Yes","No")))
confusionMatrix(data = pred_nn_up_pca85_test_label$Class, reference = emp_test_up_y$Attrition, positive = "Yes")
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction No Yes
#> No 209 23
#> Yes 38 224
#>
#> Accuracy : 0.8765
#> 95% CI : (0.8442, 0.9042)
#> No Information Rate : 0.5
#> P-Value [Acc > NIR] : < 2e-16
#>
#> Kappa : 0.753
#>
#> Mcnemar's Test P-Value : 0.07305
#>
#> Sensitivity : 0.9069
#> Specificity : 0.8462
#> Pos Pred Value : 0.8550
#> Neg Pred Value : 0.9009
#> Prevalence : 0.5000
#> Detection Rate : 0.4534
#> Detection Prevalence : 0.5304
#> Balanced Accuracy : 0.8765
#>
#> 'Positive' Class : Yes
#>
Accuracy: 91.09%
Sensitivity / Recall: 97.17%
Pos Pred Value / Precision: 86.64%
Our best model, consistently reaching 86-90+% on all metrics! Seeing the training Accuracy that reached 97%, this model is just-fit although a bit overfitting.
Conclusions
The best Neural Network model that we found is the one using upsampling to balance the class on the target variable, and using less predictors / variables by the help of Principal Component Analysis and Dimensionality Reduction process. The model named model_nn_up_pca85 has Accuracy of 91.09%, Recall of 97.17%, and Precision of 86.54%.
This proves my original business question, that we can actually use PCA concept of Unsupervised Machine Learning to reduce dimensions of our data, therefore potentially improving our Supervised Machine Learning model’s performance, which in this case is Neural Network.
If the class imbalance is very heavily weighted on the majority class, upsampling could be more preferable rather than downsampling since bias on the minority class due to upsampling should be less impactful than loss of information when downsampling a majority class.
About Me
Hi! My name is Calvin, I am from Jakarta, Indonesia. I am looking forward to be a full-time data analyst and/or data scientist. I have a background in Mathematics and Computer Science from my Bachelor’s Degrees, and I love playing with numbers and data. I am doing this to enhance my Data Science portfolio (constructive criticism is very much welcomed!), also as part of Learn-By-Building assignment at Algoritma Data Science School.
You can reach me at my LinkedIn for more discussion. Thank you!