# Data Wrangling
library(dplyr)
library(tidyverse)
# Plotting
library(ggplot2)
# Machine Learning
## KNN
library(class)
## Naive Bayes
library(e1071)
## Random Forest
library(partykit)
# Machine Learning Evaluation
library(caret)
library(car)Employees are the backbone of any organization. The success of an organization is heavily dependent on committed employees who are motivated and productive to carry out a company’s goals and objectives. Thus, employee attrition, the rate at which employees leave their job, hurts the growth and performance of an organization in terms of financial profit, productivity, and time. However, attrition is an inevitable part of any business. There will come a time when an employees wants to leave a company, which can happen for many reasons. For instance, employees might be seeking for better opportunities, mentally illed, working excessive hours, working in a toxic atmosphere or under bad management, etc. But when attrition reaches a particular threshold, it becomes a concern to be addressed as it may be reflecting a deep problem with the company culture. Attrition is particularly concerning when the attrition rate is high, indicating that employees are turning over pretty quickly, or when early attrition rate is high, indicating new joiners leave the company within the first few years of employment. Therefore, it is important to ask questions, find out why people leaving, and understand what is influencing attrition rate within an organization.
The most important question to ask is what factors are contributing to more employee attrition? Thus, by understanding these reasons for employee attrition, a company can take the correct measures to retain their employees. To explore these correlations, we used an IBM Employee Attrition dataset to explore and visualize interesting and significant factors that lead to higher employee attrition.
We will try glimpse() on our data to inspect it
Rows: 1,470
Columns: 35
$ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, …
$ Attrition <chr> "Yes", "No", "Yes", "No", "No", "No", "N…
$ BusinessTravel <chr> "Travel_Rarely", "Travel_Frequently", "T…
$ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 1324, …
$ Department <chr> "Sales", "Research & Development", "Rese…
$ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15,…
$ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2…
$ EducationField <chr> "Life Sciences", "Life Sciences", "Other…
$ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
$ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15…
$ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2…
$ Gender <chr> "Female", "Male", "Male", "Female", "Mal…
$ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, …
$ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3…
$ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1…
$ JobRole <chr> "Sales Executive", "Research Scientist",…
$ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4…
$ MaritalStatus <chr> "Single", "Married", "Single", "Married"…
$ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2670…
$ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 11864,…
$ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0…
$ Over18 <chr> "Y", "Y", "Y", "Y", "Y", "Y", "Y", "Y", …
$ OverTime <chr> "Yes", "No", "Yes", "Yes", "No", "No", "…
$ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, …
$ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3…
$ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3…
$ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, …
$ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1…
$ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10,…
$ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2…
$ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3…
$ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, …
$ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2…
$ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1…
$ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2…
From the glimpse above we can see that some variables does not have that much variance and those variables usually is a bad predictor for our target variable, in which this case we want to predict how likely an employee to resign.
To be sure we can use the command nearZeroVar() and
remove all variables that has near zero variables.
For predicting purposes we will encode some variables such as Attrition and OverTime values with their respective binary values or from categorical to numerical values
convert_to_numeric <- function(x) {
as.numeric(factor(x, levels = unique(x)))
}
ea_cln <- ea_cln %>%
mutate(Attrition = ifelse(Attrition == 'Yes', 1, 0),
OverTime = ifelse(OverTime == 'Yes', 1, 0),
Gender = ifelse(Gender=='Female', 1, 0)) %>%
mutate(Attrition=as.factor(Attrition)) %>%
mutate_at(vars(BusinessTravel, MaritalStatus, Department, EducationField, JobRole), convert_to_numeric)
head(ea_cln)How much of the employee actually resign?
emp_att %>%
select(Attrition) %>%
group_by(Attrition) %>%
summarise(total_count=n()) %>%
mutate(percentage = (total_count / sum(total_count)) * 100)Employee Attrition by Department
emp_att %>%
select(c(Attrition, Department)) %>%
group_by(Attrition, Department) %>%
summarise(total_count=n())Employee Attrition by Gender
emp_att %>%
select(c(Attrition, Gender)) %>%
group_by(Attrition, Gender) %>%
summarise(total_count=n())If we check it one by one (from gender or department above), we can’t really see the difference. Rather then doing it one by one we will try to make a logistic regression model to check which predictors can predict attrition more
Call:
glm(formula = Attrition ~ ., family = "binomial", data = ea_cln)
Coefficients:
Estimate Std. Error z value
(Intercept) 6.094865193 1.303035466 4.677
Age -0.029414017 0.013042566 -2.255
BusinessTravel -0.004028734 0.131556461 -0.031
DailyRate -0.000315710 0.000209533 -1.507
Department -0.607195370 0.174471567 -3.480
DistanceFromHome 0.038606599 0.010274366 3.758
Education 0.030978598 0.083821179 0.370
EducationField 0.159943981 0.059315864 2.696
EmployeeNumber -0.000119581 0.000142957 -0.836
EnvironmentSatisfaction -0.399924340 0.078883976 -5.070
Gender -0.406539962 0.177598337 -2.289
HourlyRate -0.000476374 0.004187871 -0.114
JobInvolvement -0.507323811 0.117682680 -4.311
JobLevel -0.232365127 0.277802474 -0.836
JobRole 0.102008817 0.040841734 2.498
JobSatisfaction -0.372911141 0.078061710 -4.777
MaritalStatus -0.556564876 0.165511511 -3.363
MonthlyIncome -0.000079180 0.000066524 -1.190
MonthlyRate 0.000005541 0.000011981 0.462
NumCompaniesWorked 0.182091866 0.036352423 5.009
OverTime 1.844527957 0.182442703 10.110
PercentSalaryHike -0.050153340 0.037482185 -1.338
PerformanceRating 0.327361507 0.380456753 0.860
RelationshipSatisfaction -0.244772535 0.079541606 -3.077
StockOptionLevel -0.241558733 0.145973448 -1.655
TotalWorkingYears -0.048884647 0.027307027 -1.790
TrainingTimesLastYear -0.154009409 0.069703456 -2.209
WorkLifeBalance -0.322623449 0.116454784 -2.770
YearsAtCompany 0.091360764 0.036473855 2.505
YearsInCurrentRole -0.141023532 0.043341719 -3.254
YearsSinceLastPromotion 0.159495346 0.040286712 3.959
YearsWithCurrManager -0.127626504 0.044161463 -2.890
Pr(>|z|)
(Intercept) 0.000002905 ***
Age 0.024119 *
BusinessTravel 0.975570
DailyRate 0.131880
Department 0.000501 ***
DistanceFromHome 0.000172 ***
Education 0.711696
EducationField 0.007008 **
EmployeeNumber 0.402883
EnvironmentSatisfaction 0.000000398 ***
Gender 0.022074 *
HourlyRate 0.909435
JobInvolvement 0.000016256 ***
JobLevel 0.402907
JobRole 0.012502 *
JobSatisfaction 0.000001778 ***
MaritalStatus 0.000772 ***
MonthlyIncome 0.233952
MonthlyRate 0.643741
NumCompaniesWorked 0.000000547 ***
OverTime < 0.0000000000000002 ***
PercentSalaryHike 0.180878
PerformanceRating 0.389545
RelationshipSatisfaction 0.002089 **
StockOptionLevel 0.097962 .
TotalWorkingYears 0.073424 .
TrainingTimesLastYear 0.027140 *
WorkLifeBalance 0.005599 **
YearsAtCompany 0.012251 *
YearsInCurrentRole 0.001139 **
YearsSinceLastPromotion 0.000075262 ***
YearsWithCurrManager 0.003852 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1298.58 on 1469 degrees of freedom
Residual deviance: 923.64 on 1438 degrees of freedom
AIC: 987.64
Number of Fisher Scoring iterations: 6
From the model above we will take note of predictors with significance level <0.001 as factors that contributes to an employee’s attrition (whether they will resign or not) the strongest:
Noting that whether an employee have worked overtime is the most significant factor in predicting how likely someone will resign from their position.
Let’s split the data into data train and data test to build a model for prediction.
Base Logistic Regression Model
Call:
glm(formula = Attrition ~ ., family = "binomial", data = ea_train)
Coefficients:
Estimate Std. Error z value
(Intercept) 5.518642241 1.485422434 3.715
Age -0.029857725 0.014397873 -2.074
BusinessTravel -0.000468029 0.148311647 -0.003
DailyRate -0.000175185 0.000230698 -0.759
Department -0.617262871 0.194879977 -3.167
DistanceFromHome 0.039406526 0.011410185 3.454
Education 0.088593530 0.093816433 0.944
EducationField 0.152509610 0.066342213 2.299
EmployeeNumber -0.000064731 0.000158766 -0.408
EnvironmentSatisfaction -0.381377091 0.087576288 -4.355
Gender -0.555817870 0.197337074 -2.817
HourlyRate 0.001715846 0.004593870 0.374
JobInvolvement -0.412927652 0.130820889 -3.156
JobLevel -0.285196224 0.306935318 -0.929
JobRole 0.119394226 0.045677550 2.614
JobSatisfaction -0.377832196 0.086195285 -4.383
MaritalStatus -0.542522375 0.181740036 -2.985
MonthlyIncome -0.000096878 0.000075269 -1.287
MonthlyRate 0.000006974 0.000013234 0.527
NumCompaniesWorked 0.209468447 0.039666928 5.281
OverTime 1.914446933 0.201964101 9.479
PercentSalaryHike -0.053309773 0.041785612 -1.276
PerformanceRating 0.391069884 0.427983831 0.914
RelationshipSatisfaction -0.305112567 0.088344798 -3.454
StockOptionLevel -0.264462295 0.161063619 -1.642
TotalWorkingYears -0.046135677 0.030218546 -1.527
TrainingTimesLastYear -0.133005181 0.076868537 -1.730
WorkLifeBalance -0.410563478 0.129354941 -3.174
YearsAtCompany 0.126452475 0.040800222 3.099
YearsInCurrentRole -0.155515675 0.047825813 -3.252
YearsSinceLastPromotion 0.161650603 0.043821052 3.689
YearsWithCurrManager -0.151006422 0.051060614 -2.957
Pr(>|z|)
(Intercept) 0.000203 ***
Age 0.038102 *
BusinessTravel 0.997482
DailyRate 0.447632
Department 0.001538 **
DistanceFromHome 0.000553 ***
Education 0.345002
EducationField 0.021514 *
EmployeeNumber 0.683484
EnvironmentSatisfaction 0.000013319 ***
Gender 0.004854 **
HourlyRate 0.708771
JobInvolvement 0.001597 **
JobLevel 0.352799
JobRole 0.008953 **
JobSatisfaction 0.000011682 ***
MaritalStatus 0.002834 **
MonthlyIncome 0.198063
MonthlyRate 0.598212
NumCompaniesWorked 0.000000129 ***
OverTime < 0.0000000000000002 ***
PercentSalaryHike 0.202029
PerformanceRating 0.360849
RelationshipSatisfaction 0.000553 ***
StockOptionLevel 0.100595
TotalWorkingYears 0.126827
TrainingTimesLastYear 0.083578 .
WorkLifeBalance 0.001504 **
YearsAtCompany 0.001940 **
YearsInCurrentRole 0.001147 **
YearsSinceLastPromotion 0.000225 ***
YearsWithCurrManager 0.003103 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1069.32 on 1175 degrees of freedom
Residual deviance: 752.96 on 1144 degrees of freedom
AIC: 816.96
Number of Fisher Scoring iterations: 6
Logistic Regression using Predictors with significance level <0.001
model_sig <- glm(Attrition~Department+DistanceFromHome+EnvironmentSatisfaction+JobInvolvement+JobSatisfaction+MaritalStatus+NumCompaniesWorked+OverTime+YearsSinceLastPromotion, ea_train, family = "binomial")
summary(model_sig)
Call:
glm(formula = Attrition ~ Department + DistanceFromHome + EnvironmentSatisfaction +
JobInvolvement + JobSatisfaction + MaritalStatus + NumCompaniesWorked +
OverTime + YearsSinceLastPromotion, family = "binomial",
data = ea_train)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.93095 0.57193 3.376 0.000735
Department -0.28331 0.16465 -1.721 0.085307
DistanceFromHome 0.02780 0.01002 2.775 0.005521
EnvironmentSatisfaction -0.30731 0.07819 -3.930 0.00008486634
JobInvolvement -0.46996 0.11790 -3.986 0.00006713646
JobSatisfaction -0.29574 0.07593 -3.895 0.00009820334
MaritalStatus -0.72346 0.12494 -5.790 0.00000000703
NumCompaniesWorked 0.08340 0.03266 2.554 0.010663
OverTime 1.59343 0.17480 9.116 < 0.0000000000000002
YearsSinceLastPromotion -0.01701 0.02787 -0.610 0.541682
(Intercept) ***
Department .
DistanceFromHome **
EnvironmentSatisfaction ***
JobInvolvement ***
JobSatisfaction ***
MaritalStatus ***
NumCompaniesWorked *
OverTime ***
YearsSinceLastPromotion
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1069.32 on 1175 degrees of freedom
Residual deviance: 894.54 on 1166 degrees of freedom
AIC: 914.54
Number of Fisher Scoring iterations: 5
Stepwise Logistic Regression to check for Models with The Smallest AIC
Call:
glm(formula = Attrition ~ Age + Department + DistanceFromHome +
EducationField + EnvironmentSatisfaction + Gender + JobInvolvement +
JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome +
NumCompaniesWorked + OverTime + RelationshipSatisfaction +
StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +
YearsWithCurrManager, family = "binomial", data = ea_train)
Coefficients:
Estimate Std. Error z value
(Intercept) 5.93459384 0.93389577 6.355
Age -0.02873555 0.01403087 -2.048
Department -0.56862062 0.18995364 -2.993
DistanceFromHome 0.03838888 0.01128931 3.400
EducationField 0.15488422 0.06609420 2.343
EnvironmentSatisfaction -0.38050523 0.08697840 -4.375
Gender -0.54727602 0.19632787 -2.788
JobInvolvement -0.40548884 0.13039656 -3.110
JobRole 0.11567999 0.04508837 2.566
JobSatisfaction -0.38375993 0.08528478 -4.500
MaritalStatus -0.55196664 0.18099838 -3.050
MonthlyIncome -0.00015464 0.00004196 -3.686
NumCompaniesWorked 0.20994659 0.03932347 5.339
OverTime 1.91284609 0.20082985 9.525
RelationshipSatisfaction -0.30416398 0.08712048 -3.491
StockOptionLevel -0.26098842 0.15975689 -1.634
TotalWorkingYears -0.05141708 0.02931597 -1.754
TrainingTimesLastYear -0.13124170 0.07669190 -1.711
WorkLifeBalance -0.40648707 0.12851073 -3.163
YearsAtCompany 0.12692345 0.04110856 3.088
YearsInCurrentRole -0.15219084 0.04743761 -3.208
YearsSinceLastPromotion 0.16930421 0.04333034 3.907
YearsWithCurrManager -0.15667826 0.05071895 -3.089
Pr(>|z|)
(Intercept) 0.000000000209 ***
Age 0.040558 *
Department 0.002758 **
DistanceFromHome 0.000673 ***
EducationField 0.019110 *
EnvironmentSatisfaction 0.000012159441 ***
Gender 0.005311 **
JobInvolvement 0.001873 **
JobRole 0.010299 *
JobSatisfaction 0.000006803439 ***
MaritalStatus 0.002292 **
MonthlyIncome 0.000228 ***
NumCompaniesWorked 0.000000093479 ***
OverTime < 0.0000000000000002 ***
RelationshipSatisfaction 0.000481 ***
StockOptionLevel 0.102330
TotalWorkingYears 0.079449 .
TrainingTimesLastYear 0.087028 .
WorkLifeBalance 0.001561 **
YearsAtCompany 0.002018 **
YearsInCurrentRole 0.001336 **
YearsSinceLastPromotion 0.000093337114 ***
YearsWithCurrManager 0.002007 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1069.32 on 1175 degrees of freedom
Residual deviance: 757.52 on 1153 degrees of freedom
AIC: 803.52
Number of Fisher Scoring iterations: 6
Predicting the model
# Predict model base
ea_test$pred_att_bs <- predict(model_base, newdata = ea_test, type = "response")
# Predict significant only model
ea_test$pred_att_sig <- predict(model_sig, newdata = ea_test, type = "response")
# Predict step wise model
ea_test$pred_att_stp <- predict(model_step, newdata = ea_test, type = "response")Changing odds to labels
# Label model base
ea_test$label_bs <- ifelse(ea_test$pred_att_bs>0.5, 1, 0)
# Label model sig
ea_test$label_sig <- ifelse(ea_test$pred_att_sig>0.5, 1, 0)
# Label model step
ea_test$label_stp <- ifelse(ea_test$pred_att_stp>0.5, 1, 0)Evaluating The Model
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 246 24
1 10 14
Accuracy : 0.8844
95% CI : (0.8422, 0.9186)
No Information Rate : 0.8707
P-Value [Acc > NIR] : 0.27602
Kappa : 0.3906
Mcnemar's Test P-Value : 0.02578
Sensitivity : 0.36842
Specificity : 0.96094
Pos Pred Value : 0.58333
Neg Pred Value : 0.91111
Prevalence : 0.12925
Detection Rate : 0.04762
Detection Prevalence : 0.08163
Balanced Accuracy : 0.66468
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 250 33
1 6 5
Accuracy : 0.8673
95% CI : (0.8231, 0.9039)
No Information Rate : 0.8707
P-Value [Acc > NIR] : 0.6105
Kappa : 0.155
Mcnemar's Test P-Value : 0.00003136
Sensitivity : 0.13158
Specificity : 0.97656
Pos Pred Value : 0.45455
Neg Pred Value : 0.88339
Prevalence : 0.12925
Detection Rate : 0.01701
Detection Prevalence : 0.03741
Balanced Accuracy : 0.55407
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 245 25
1 11 13
Accuracy : 0.8776
95% CI : (0.8345, 0.9127)
No Information Rate : 0.8707
P-Value [Acc > NIR] : 0.40497
Kappa : 0.3548
Mcnemar's Test P-Value : 0.03026
Sensitivity : 0.34211
Specificity : 0.95703
Pos Pred Value : 0.54167
Neg Pred Value : 0.90741
Prevalence : 0.12925
Detection Rate : 0.04422
Detection Prevalence : 0.08163
Balanced Accuracy : 0.64957
'Positive' Class : 1
Result:
Separating predictor and target variable.
# Predictor data train
ea_trn_x <- ea_train %>% select(-Attrition)
# Target data train
ea_trn_y <- ea_train[,"Attrition"]
# Predictor data test
ea_tst_x <- ea_test %>% select(
-c(Attrition, pred_att_bs, pred_att_sig, pred_att_stp, label_bs, label_sig, label_stp))
# Target data test
ea_tst_y <- ea_test[,"Attrition"]Scaling data.
Finding optimum N (usually using square root)
[1] 34.29286
Because the target class is even (yes/no), then we will take N=33 or 35 (odd number). This is to prevent even vote on majority voting on KNN.
Predicting the data
# Predict using N = 33
pr_knn1 <- knn(train = ea_trn_xs, # pred data train
test = ea_tst_xs, # pred data test
cl = ea_trn_y, # label data train
k = 33)
# Predict using N = 35
pr_knn2 <- knn(train = ea_trn_xs, # pred data train
test = ea_tst_xs, # pred data test
cl = ea_trn_y, # label data train
k = 35)Confusion Matrix and Statistics
Reference
Prediction 0 1
0 256 38
1 0 0
Accuracy : 0.8707
95% CI : (0.8269, 0.9069)
No Information Rate : 0.8707
P-Value [Acc > NIR] : 0.5431
Kappa : 0
Mcnemar's Test P-Value : 0.000000001947
Sensitivity : 0.0000
Specificity : 1.0000
Pos Pred Value : NaN
Neg Pred Value : 0.8707
Prevalence : 0.1293
Detection Rate : 0.0000
Detection Prevalence : 0.0000
Balanced Accuracy : 0.5000
'Positive' Class : 1
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 256 38
1 0 0
Accuracy : 0.8707
95% CI : (0.8269, 0.9069)
No Information Rate : 0.8707
P-Value [Acc > NIR] : 0.5431
Kappa : 0
Mcnemar's Test P-Value : 0.000000001947
Sensitivity : 0.0000
Specificity : 1.0000
Pos Pred Value : NaN
Neg Pred Value : 0.8707
Prevalence : 0.1293
Detection Rate : 0.0000
Detection Prevalence : 0.0000
Balanced Accuracy : 0.5000
'Positive' Class : 1
Result
confusionMatrix(data = as.factor(ea_test$pred_nv), reference = as.factor(ea_test$Attrition), positive = "1")Confusion Matrix and Statistics
Reference
Prediction 0 1
0 218 12
1 38 26
Accuracy : 0.8299
95% CI : (0.782, 0.8711)
No Information Rate : 0.8707
P-Value [Acc > NIR] : 0.982321
Kappa : 0.4149
Mcnemar's Test P-Value : 0.000407
Sensitivity : 0.68421
Specificity : 0.85156
Pos Pred Value : 0.40625
Neg Pred Value : 0.94783
Prevalence : 0.12925
Detection Rate : 0.08844
Detection Prevalence : 0.21769
Balanced Accuracy : 0.76789
'Positive' Class : 1
Result
# Predict using the random forest model with type = "raw"
ea_test$pred_rf_raw <- predict(object = ea_forest,
newdata = ea_test %>% select(-c(pred_att_bs, pred_att_sig, pred_att_stp, label_bs,
label_sig, label_stp, pred_nv)),
type = "raw") %>% as.character() %>% as.numeric()
# Convert raw predictions to probabilities
ea_test$pred_rf_prob <- exp(ea_test$pred_rf_raw) / (1 + exp(ea_test$pred_rf_raw))
# Label model base
ea_test$label_rf <- ifelse(ea_test$pred_rf_prob>0.55, 1, 0)confusionMatrix(data = as.factor(ea_test$label_rf), reference = as.factor(ea_test$Attrition), positive = "1")Confusion Matrix and Statistics
Reference
Prediction 0 1
0 251 25
1 5 13
Accuracy : 0.898
95% CI : (0.8575, 0.9301)
No Information Rate : 0.8707
P-Value [Acc > NIR] : 0.0932007
Kappa : 0.4157
Mcnemar's Test P-Value : 0.0005226
Sensitivity : 0.34211
Specificity : 0.98047
Pos Pred Value : 0.72222
Neg Pred Value : 0.90942
Prevalence : 0.12925
Detection Rate : 0.04422
Detection Prevalence : 0.06122
Balanced Accuracy : 0.66129
'Positive' Class : 1
Result