This analysis will try to classify the Attrition of the given employee. The dataset is downloaded from Kaggle: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset
This analysis is part of Algoritma LBB Project in C1 class and C2 class.
For C2 class LBB, please scroll to Decision Tree and Random Forest part.
In this project, I will create a Logistic Regression model and a KNN model with different parameters and tuning, and compare them to find which predict better for this dataset.
If you have any suggestion or questions, connect with me in LinkedIn
library(tidyverse)
library(ggthemr)
library(ggpubr)
library(scales)
library(skimr)
library(GGally)
library(corrr)
library(corrplot)
library(brglm2)
library(ROSE)
library(ROCR)
library(caret)
library(plotROC)
library(plotly)
library(yardstick)
library(partykit)Now we will get a glimpse over the data. Personally, I like the skim() function from skimr package over glimpse() or summary() because it give a more detailed information about the data, including the missing rate, complete rate, number of unique, column type and for the numeric type. skim() also return the five-number summary with a cute little histogram.
| Name | dataHR |
| Number of rows | 1470 |
| Number of columns | 35 |
| _______________________ | |
| Column type frequency: | |
| factor | 9 |
| numeric | 26 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| Attrition | 0 | 1 | FALSE | 2 | No: 1233, Yes: 237 |
| BusinessTravel | 0 | 1 | FALSE | 3 | Tra: 1043, Tra: 277, Non: 150 |
| Department | 0 | 1 | FALSE | 3 | Res: 961, Sal: 446, Hum: 63 |
| EducationField | 0 | 1 | FALSE | 6 | Lif: 606, Med: 464, Mar: 159, Tec: 132 |
| Gender | 0 | 1 | FALSE | 2 | Mal: 882, Fem: 588 |
| JobRole | 0 | 1 | FALSE | 9 | Sal: 326, Res: 292, Lab: 259, Man: 145 |
| MaritalStatus | 0 | 1 | FALSE | 3 | Mar: 673, Sin: 470, Div: 327 |
| Over18 | 0 | 1 | FALSE | 1 | Y: 1470 |
| OverTime | 0 | 1 | FALSE | 2 | No: 1054, Yes: 416 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| ï..Age | 0 | 1 | 36.92 | 9.14 | 18 | 30.00 | 36.0 | 43.00 | 60 | ▂▇▇▃▂ |
| DailyRate | 0 | 1 | 802.49 | 403.51 | 102 | 465.00 | 802.0 | 1157.00 | 1499 | ▇▇▇▇▇ |
| DistanceFromHome | 0 | 1 | 9.19 | 8.11 | 1 | 2.00 | 7.0 | 14.00 | 29 | ▇▅▂▂▂ |
| Education | 0 | 1 | 2.91 | 1.02 | 1 | 2.00 | 3.0 | 4.00 | 5 | ▂▃▇▆▁ |
| EmployeeCount | 0 | 1 | 1.00 | 0.00 | 1 | 1.00 | 1.0 | 1.00 | 1 | ▁▁▇▁▁ |
| EmployeeNumber | 0 | 1 | 1024.87 | 602.02 | 1 | 491.25 | 1020.5 | 1555.75 | 2068 | ▇▇▇▇▇ |
| EnvironmentSatisfaction | 0 | 1 | 2.72 | 1.09 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| HourlyRate | 0 | 1 | 65.89 | 20.33 | 30 | 48.00 | 66.0 | 83.75 | 100 | ▇▇▇▇▇ |
| JobInvolvement | 0 | 1 | 2.73 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▁ |
| JobLevel | 0 | 1 | 2.06 | 1.11 | 1 | 1.00 | 2.0 | 3.00 | 5 | ▇▇▃▂▁ |
| JobSatisfaction | 0 | 1 | 2.73 | 1.10 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| MonthlyIncome | 0 | 1 | 6502.93 | 4707.96 | 1009 | 2911.00 | 4919.0 | 8379.00 | 19999 | ▇▅▂▁▂ |
| MonthlyRate | 0 | 1 | 14313.10 | 7117.79 | 2094 | 8047.00 | 14235.5 | 20461.50 | 26999 | ▇▇▇▇▇ |
| NumCompaniesWorked | 0 | 1 | 2.69 | 2.50 | 0 | 1.00 | 2.0 | 4.00 | 9 | ▇▃▂▂▁ |
| PercentSalaryHike | 0 | 1 | 15.21 | 3.66 | 11 | 12.00 | 14.0 | 18.00 | 25 | ▇▅▃▂▁ |
| PerformanceRating | 0 | 1 | 3.15 | 0.36 | 3 | 3.00 | 3.0 | 3.00 | 4 | ▇▁▁▁▂ |
| RelationshipSatisfaction | 0 | 1 | 2.71 | 1.08 | 1 | 2.00 | 3.0 | 4.00 | 4 | ▅▅▁▇▇ |
| StandardHours | 0 | 1 | 80.00 | 0.00 | 80 | 80.00 | 80.0 | 80.00 | 80 | ▁▁▇▁▁ |
| StockOptionLevel | 0 | 1 | 0.79 | 0.85 | 0 | 0.00 | 1.0 | 1.00 | 3 | ▇▇▁▂▁ |
| TotalWorkingYears | 0 | 1 | 11.28 | 7.78 | 0 | 6.00 | 10.0 | 15.00 | 40 | ▇▇▂▁▁ |
| TrainingTimesLastYear | 0 | 1 | 2.80 | 1.29 | 0 | 2.00 | 3.0 | 3.00 | 6 | ▂▇▇▂▃ |
| WorkLifeBalance | 0 | 1 | 2.76 | 0.71 | 1 | 2.00 | 3.0 | 3.00 | 4 | ▁▃▁▇▂ |
| YearsAtCompany | 0 | 1 | 7.01 | 6.13 | 0 | 3.00 | 5.0 | 9.00 | 40 | ▇▂▁▁▁ |
| YearsInCurrentRole | 0 | 1 | 4.23 | 3.62 | 0 | 2.00 | 3.0 | 7.00 | 18 | ▇▃▂▁▁ |
| YearsSinceLastPromotion | 0 | 1 | 2.19 | 3.22 | 0 | 0.00 | 1.0 | 3.00 | 15 | ▇▁▁▁▁ |
| YearsWithCurrManager | 0 | 1 | 4.12 | 3.57 | 0 | 2.00 | 3.0 | 7.00 | 17 | ▇▂▅▁▁ |
Now we will do some data transformation, like renaming the typo in column name and converting some column type into the appropriate format. In this case, a column like Education is a factor, not an integer or numeric.
names(dataHR)[names(dataHR) == "ï..Age"] <- "Age"
dataHR$Education <- as.factor(dataHR$Education)
dataHR$EnvironmentSatisfaction <- as.factor(dataHR$EnvironmentSatisfaction)
dataHR$JobInvolvement <- as.factor(dataHR$JobInvolvement)
dataHR$JobLevel <- as.factor(dataHR$JobLevel)
dataHR$JobSatisfaction <- as.factor(dataHR$JobSatisfaction)
dataHR$StockOptionLevel <- as.factor(dataHR$StockOptionLevel)
dataHR$PerformanceRating <- as.factor(dataHR$PerformanceRating)
dataHR$RelationshipSatisfaction <- as.factor(dataHR$RelationshipSatisfaction)
dataHR$WorkLifeBalance <- as.factor(dataHR$WorkLifeBalance)Some of the columns contain only a single value for the entire column. This kind of column is not informative for our modeling. So we will remove the columns.
We will make sure once again that we don’t have NA values in our dataset.
## [1] 0
Our target variable is in the Attrition column. Before doing any analysis or modeling, we want to know the distribution of our Attrition variable. If it is imbalanced, further handling will be needed.
## # A tibble: 2 x 2
## Attrition Total
## <fct> <int>
## 1 No 1233
## 2 Yes 237
dist_attr %>%
ggplot(aes(x=Attrition, y=Total)) +
geom_col() +
ggtitle("Total Numbers of Attrition")We also want to see the distribution of our employee’s age divided by their gender.
mean_age <- dataHR %>%
group_by(Gender) %>%
summarise(mean = mean(Age),
median = median(Age)) %>%
print()## # A tibble: 2 x 3
## Gender mean median
## <fct> <dbl> <dbl>
## 1 Female 37.3 36
## 2 Male 36.7 35
plot1 <- dataHR %>%
ggplot(aes(x=Age)) +
geom_density(fill = "green", alpha = 0.5) +
geom_vline(aes(xintercept = mean(Age))) +
labs(title = "Age Distribution")
plot2 <- dataHR %>%
filter(Gender == "Male") %>%
ggplot(aes(x=Age)) +
geom_density(fill = "blue", alpha = 0.5) +
geom_vline(aes(xintercept = mean(Age))) +
labs(title = "Male Age Distribution")
plot3 <- dataHR %>%
filter(Gender == "Female") %>%
ggplot(aes(x=Age)) +
geom_density(fill = "red", alpha = 0.5) +
geom_vline(aes(xintercept = mean(Age))) +
labs(title = "Female Age Distribution")
ggarrange(plot1,
ggarrange(plot2, plot3),
nrow = 2)dist_attr_gender <- dataHR %>%
group_by(Attrition, Gender) %>%
summarise(Total = n())
print(dist_attr_gender)## # A tibble: 4 x 3
## # Groups: Attrition [2]
## Attrition Gender Total
## <fct> <fct> <int>
## 1 No Female 501
## 2 No Male 732
## 3 Yes Female 87
## 4 Yes Male 150
pie_attr_male <- dist_attr_gender %>%
filter(Gender == "Male") %>%
ggplot(aes(x="", y=Total, fill=Attrition)) +
geom_bar(width=1, stat="identity") +
coord_polar("y", start=0) +
ggtitle("Pie Chart \nAttrition Male") +
geom_text(aes(y = Total/2 + c(5, 10),
label = percent(Total/sum(Total))), size=5)
pie_attr_female <- dist_attr_gender %>%
filter(Gender == "Female") %>%
ggplot(aes(x="", y=Total, fill=Attrition)) +
geom_bar(width=1, stat="identity") +
coord_polar("y", start=0) +
ggtitle("Pie Chart \nAttrition Female") +
geom_text(aes(y = Total/2 + c(5, 10),
label = percent(Total/sum(Total))), size=5)
ggarrange(pie_attr_male, pie_attr_female)Our variable of interest is the Attrition variable that indicates whether the employee considered in the Attrition group (Attrition = Yes) or the opposite (Attrition = No).
Cambridge Dictionary defined Attrition as a reduction in the number of people who work for an organization that is achieved by not replacing those people who leave.
## [1] Yes No Yes No No No
## Levels: No Yes
We will divide our dataset by 80% into train data and 20% into test data. We will using set.seed() function to make sure that R produce the same random number to make this report is reproducible.
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(18)
index <- sample(nrow(dataHR), nrow(dataHR)*0.8)
data_train <- dataHR[index, ]
data_test <- dataHR[-index,]Before doing any analysis, we must make sure that we have a balanced training data. An imbalanced training data most likely to perform more inferior than the balanced data. So, we will try to make it balanced by using an upsampling technique with ovun.sample() function from ROSE package.
We will see our train data distribution. Now we see quite a large skew in our target variable.
##
## No Yes
## 996 180
Before doing modeling, we need to make our data balanced. The imbalanced data could perform poorly because the algorithm cannot learn good enough from the minority class. We will do an upsampling technique to treat the imbalance.
train_balanced <- ovun.sample(Attrition ~ ., data = data_train, method = "over",
N = 996*2, seed = 1)$data
table(train_balanced$Attrition)##
## No Yes
## 996 996
After the data is balanced, we will make out first Logistic Regression model by using all the predictors into the formula.
model_log_full <- glm(Attrition ~ ., family = "binomial", data = train_balanced)
summary(model_log_full)##
## Call:
## glm(formula = Attrition ~ ., family = "binomial", data = train_balanced)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8672 -0.5173 0.0246 0.5847 3.4560
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.801e+00 3.256e+02 -0.030 0.975983
## Age -3.858e-02 1.031e-02 -3.741 0.000183 ***
## BusinessTravelTravel_Frequently 2.260e+00 3.203e-01 7.054 1.74e-12 ***
## BusinessTravelTravel_Rarely 1.224e+00 2.896e-01 4.226 2.38e-05 ***
## DailyRate -4.255e-04 1.751e-04 -2.429 0.015120 *
## DepartmentResearch & Development 1.521e+01 3.256e+02 0.047 0.962735
## DepartmentSales 1.505e+01 3.256e+02 0.046 0.963137
## DistanceFromHome 4.566e-02 8.898e-03 5.132 2.87e-07 ***
## Education2 -1.167e-02 2.495e-01 -0.047 0.962693
## Education3 -9.487e-02 2.207e-01 -0.430 0.667304
## Education4 2.541e-01 2.422e-01 1.049 0.294043
## Education5 2.890e-01 4.320e-01 0.669 0.503559
## EducationFieldLife Sciences -1.471e+00 6.599e-01 -2.229 0.025842 *
## EducationFieldMarketing -6.816e-01 6.877e-01 -0.991 0.321555
## EducationFieldMedical -1.346e+00 6.509e-01 -2.068 0.038676 *
## EducationFieldOther -8.496e-01 6.986e-01 -1.216 0.223902
## EducationFieldTechnical Degree 1.691e-01 6.741e-01 0.251 0.801992
## EmployeeNumber -2.245e-04 1.224e-04 -1.834 0.066652 .
## EnvironmentSatisfaction2 -1.244e+00 2.258e-01 -5.509 3.60e-08 ***
## EnvironmentSatisfaction3 -1.090e+00 2.059e-01 -5.296 1.18e-07 ***
## EnvironmentSatisfaction4 -1.111e+00 2.029e-01 -5.473 4.41e-08 ***
## GenderMale 4.716e-01 1.448e-01 3.258 0.001124 **
## HourlyRate -1.655e-03 3.419e-03 -0.484 0.628366
## JobInvolvement2 -1.240e+00 3.224e-01 -3.846 0.000120 ***
## JobInvolvement3 -1.580e+00 3.078e-01 -5.133 2.85e-07 ***
## JobInvolvement4 -1.678e+00 3.808e-01 -4.408 1.05e-05 ***
## JobLevel2 -1.025e+00 2.986e-01 -3.433 0.000596 ***
## JobLevel3 1.023e+00 5.256e-01 1.946 0.051666 .
## JobLevel4 -2.091e+00 9.785e-01 -2.137 0.032572 *
## JobLevel5 3.386e+00 1.189e+00 2.848 0.004400 **
## JobRoleHuman Resources 1.642e+01 3.256e+02 0.050 0.959764
## JobRoleLaboratory Technician 2.041e+00 4.764e-01 4.284 1.83e-05 ***
## JobRoleManager -3.228e-01 8.494e-01 -0.380 0.703927
## JobRoleManufacturing Director 1.068e+00 4.715e-01 2.264 0.023555 *
## JobRoleResearch Director -2.529e+00 9.117e-01 -2.774 0.005531 **
## JobRoleResearch Scientist 1.107e+00 4.789e-01 2.312 0.020776 *
## JobRoleSales Executive 2.459e+00 1.007e+00 2.442 0.014620 *
## JobRoleSales Representative 3.035e+00 1.061e+00 2.862 0.004213 **
## JobSatisfaction2 -1.168e+00 2.168e-01 -5.387 7.17e-08 ***
## JobSatisfaction3 -8.921e-01 1.891e-01 -4.718 2.38e-06 ***
## JobSatisfaction4 -1.750e+00 2.043e-01 -8.564 < 2e-16 ***
## MaritalStatusMarried 5.933e-01 2.022e-01 2.935 0.003338 **
## MaritalStatusSingle 6.732e-01 2.930e-01 2.297 0.021591 *
## MonthlyIncome -1.466e-04 7.005e-05 -2.093 0.036321 *
## MonthlyRate 9.625e-06 9.522e-06 1.011 0.312118
## NumCompaniesWorked 2.380e-01 3.096e-02 7.687 1.50e-14 ***
## OverTimeYes 2.020e+00 1.534e-01 13.166 < 2e-16 ***
## PercentSalaryHike -6.368e-02 3.011e-02 -2.115 0.034450 *
## PerformanceRating4 2.976e-01 3.191e-01 0.933 0.351078
## RelationshipSatisfaction2 -1.149e+00 2.183e-01 -5.264 1.41e-07 ***
## RelationshipSatisfaction3 -1.045e+00 1.955e-01 -5.347 8.93e-08 ***
## RelationshipSatisfaction4 -1.094e+00 1.970e-01 -5.552 2.82e-08 ***
## StockOptionLevel1 -1.274e+00 2.405e-01 -5.299 1.16e-07 ***
## StockOptionLevel2 -1.131e+00 3.153e-01 -3.586 0.000336 ***
## StockOptionLevel3 -3.910e-01 3.662e-01 -1.068 0.285618
## TotalWorkingYears -3.083e-02 2.078e-02 -1.483 0.138032
## TrainingTimesLastYear -1.209e-01 5.217e-02 -2.318 0.020468 *
## WorkLifeBalance2 -9.685e-01 2.946e-01 -3.288 0.001008 **
## WorkLifeBalance3 -1.549e+00 2.815e-01 -5.503 3.73e-08 ***
## WorkLifeBalance4 -6.003e-01 3.325e-01 -1.805 0.071037 .
## YearsAtCompany 2.011e-01 3.016e-02 6.670 2.56e-11 ***
## YearsInCurrentRole -2.519e-01 4.021e-02 -6.266 3.70e-10 ***
## YearsSinceLastPromotion 1.024e-01 3.173e-02 3.225 0.001258 **
## YearsWithCurrManager -1.281e-01 3.721e-02 -3.444 0.000573 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2761.5 on 1991 degrees of freedom
## Residual deviance: 1511.3 on 1928 degrees of freedom
## AIC: 1639.3
##
## Number of Fisher Scoring iterations: 14
We will detect if perfect separation exist in our data. https://cran.r-project.org/web/packages/brglm2/vignettes/separation.html
## Separation: FALSE
## Existence of maximum likelihood estimates
## (Intercept) Age
## -Inf 0
## BusinessTravelTravel_Frequently BusinessTravelTravel_Rarely
## 0 0
## DailyRate DepartmentResearch & Development
## 0 Inf
## DepartmentSales DistanceFromHome
## Inf 0
## Education2 Education3
## 0 0
## Education4 Education5
## 0 0
## EducationFieldLife Sciences EducationFieldMarketing
## 0 0
## EducationFieldMedical EducationFieldOther
## 0 0
## EducationFieldTechnical Degree EmployeeNumber
## 0 0
## EnvironmentSatisfaction2 EnvironmentSatisfaction3
## 0 0
## EnvironmentSatisfaction4 GenderMale
## 0 0
## HourlyRate JobInvolvement2
## 0 0
## JobInvolvement3 JobInvolvement4
## 0 0
## JobLevel2 JobLevel3
## 0 0
## JobLevel4 JobLevel5
## 0 0
## JobRoleHuman Resources JobRoleLaboratory Technician
## Inf 0
## JobRoleManager JobRoleManufacturing Director
## 0 0
## JobRoleResearch Director JobRoleResearch Scientist
## 0 0
## JobRoleSales Executive JobRoleSales Representative
## 0 0
## JobSatisfaction2 JobSatisfaction3
## 0 0
## JobSatisfaction4 MaritalStatusMarried
## 0 0
## MaritalStatusSingle MonthlyIncome
## 0 0
## MonthlyRate NumCompaniesWorked
## 0 0
## OverTimeYes PercentSalaryHike
## 0 0
## PerformanceRating4 RelationshipSatisfaction2
## 0 0
## RelationshipSatisfaction3 RelationshipSatisfaction4
## 0 0
## StockOptionLevel1 StockOptionLevel2
## 0 0
## StockOptionLevel3 TotalWorkingYears
## 0 0
## TrainingTimesLastYear WorkLifeBalance2
## 0 0
## WorkLifeBalance3 WorkLifeBalance4
## 0 0
## YearsAtCompany YearsInCurrentRole
## 0 0
## YearsSinceLastPromotion YearsWithCurrManager
## 0 0
## 0: finite value, Inf: infinity, -Inf: -infinity
FALSE means there is no perfect separation that exists in our model.
Next, we will be doing a feature engineering by applying the step() function into our full model. We will be using the backward option, which the function will iterate over the model, starts with all predictor (full model as we input it) and remove the least contributive predictors.
##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + DailyRate +
## Department + DistanceFromHome + EducationField + EmployeeNumber +
## EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel +
## JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome +
## NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction +
## StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
## WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +
## YearsWithCurrManager, family = "binomial", data = train_balanced)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.8611 -0.5238 0.0286 0.5907 3.4512
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.020e+01 3.216e+02 -0.032 0.974706
## Age -3.617e-02 1.003e-02 -3.607 0.000310 ***
## BusinessTravelTravel_Frequently 2.247e+00 3.185e-01 7.054 1.73e-12 ***
## BusinessTravelTravel_Rarely 1.235e+00 2.873e-01 4.298 1.72e-05 ***
## DailyRate -4.102e-04 1.721e-04 -2.384 0.017135 *
## DepartmentResearch & Development 1.535e+01 3.216e+02 0.048 0.961932
## DepartmentSales 1.518e+01 3.216e+02 0.047 0.962360
## DistanceFromHome 4.663e-02 8.788e-03 5.307 1.12e-07 ***
## EducationFieldLife Sciences -1.530e+00 6.345e-01 -2.411 0.015903 *
## EducationFieldMarketing -7.559e-01 6.685e-01 -1.131 0.258195
## EducationFieldMedical -1.436e+00 6.248e-01 -2.299 0.021518 *
## EducationFieldOther -8.717e-01 6.693e-01 -1.302 0.192746
## EducationFieldTechnical Degree 4.438e-02 6.511e-01 0.068 0.945657
## EmployeeNumber -2.060e-04 1.209e-04 -1.703 0.088483 .
## EnvironmentSatisfaction2 -1.228e+00 2.241e-01 -5.481 4.22e-08 ***
## EnvironmentSatisfaction3 -1.073e+00 2.041e-01 -5.257 1.47e-07 ***
## EnvironmentSatisfaction4 -1.081e+00 2.009e-01 -5.381 7.42e-08 ***
## GenderMale 4.882e-01 1.436e-01 3.400 0.000673 ***
## JobInvolvement2 -1.221e+00 3.185e-01 -3.832 0.000127 ***
## JobInvolvement3 -1.556e+00 3.013e-01 -5.163 2.44e-07 ***
## JobInvolvement4 -1.678e+00 3.747e-01 -4.478 7.54e-06 ***
## JobLevel2 -1.005e+00 2.963e-01 -3.390 0.000698 ***
## JobLevel3 1.020e+00 5.201e-01 1.961 0.049846 *
## JobLevel4 -2.127e+00 9.762e-01 -2.179 0.029357 *
## JobLevel5 3.288e+00 1.176e+00 2.796 0.005179 **
## JobRoleHuman Resources 1.645e+01 3.216e+02 0.051 0.959207
## JobRoleLaboratory Technician 1.983e+00 4.728e-01 4.195 2.73e-05 ***
## JobRoleManager -1.912e-01 8.395e-01 -0.228 0.819847
## JobRoleManufacturing Director 1.051e+00 4.682e-01 2.245 0.024773 *
## JobRoleResearch Director -2.265e+00 8.983e-01 -2.522 0.011679 *
## JobRoleResearch Scientist 1.058e+00 4.758e-01 2.224 0.026168 *
## JobRoleSales Executive 2.461e+00 1.011e+00 2.434 0.014931 *
## JobRoleSales Representative 2.974e+00 1.065e+00 2.794 0.005207 **
## JobSatisfaction2 -1.168e+00 2.138e-01 -5.463 4.68e-08 ***
## JobSatisfaction3 -8.703e-01 1.861e-01 -4.677 2.92e-06 ***
## JobSatisfaction4 -1.731e+00 2.005e-01 -8.635 < 2e-16 ***
## MaritalStatusMarried 5.959e-01 1.996e-01 2.986 0.002826 **
## MaritalStatusSingle 6.371e-01 2.893e-01 2.203 0.027626 *
## MonthlyIncome -1.574e-04 6.925e-05 -2.272 0.023069 *
## NumCompaniesWorked 2.340e-01 3.066e-02 7.632 2.31e-14 ***
## OverTimeYes 2.012e+00 1.516e-01 13.270 < 2e-16 ***
## PercentSalaryHike -4.092e-02 1.929e-02 -2.121 0.033902 *
## RelationshipSatisfaction2 -1.112e+00 2.130e-01 -5.222 1.77e-07 ***
## RelationshipSatisfaction3 -1.010e+00 1.918e-01 -5.267 1.39e-07 ***
## RelationshipSatisfaction4 -1.046e+00 1.948e-01 -5.371 7.83e-08 ***
## StockOptionLevel1 -1.356e+00 2.350e-01 -5.771 7.87e-09 ***
## StockOptionLevel2 -1.197e+00 3.086e-01 -3.879 0.000105 ***
## StockOptionLevel3 -4.014e-01 3.607e-01 -1.113 0.265790
## TotalWorkingYears -2.977e-02 2.054e-02 -1.449 0.147289
## TrainingTimesLastYear -1.282e-01 5.158e-02 -2.486 0.012915 *
## WorkLifeBalance2 -9.798e-01 2.906e-01 -3.372 0.000745 ***
## WorkLifeBalance3 -1.531e+00 2.757e-01 -5.553 2.81e-08 ***
## WorkLifeBalance4 -6.113e-01 3.278e-01 -1.865 0.062216 .
## YearsAtCompany 2.056e-01 3.004e-02 6.845 7.65e-12 ***
## YearsInCurrentRole -2.527e-01 4.009e-02 -6.303 2.92e-10 ***
## YearsSinceLastPromotion 1.046e-01 3.127e-02 3.345 0.000824 ***
## YearsWithCurrManager -1.277e-01 3.683e-02 -3.466 0.000528 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 2761.5 on 1991 degrees of freedom
## Residual deviance: 1517.1 on 1935 degrees of freedom
## AIC: 1631.1
##
## Number of Fisher Scoring iterations: 14
When modeling a model, there is a parsimonious model, which is a model that has a desirable and useful level of explanation with as few predictors as possible. That’s why when we do a backward step into our full model, we will obtain the least used predictors in the second model with the comparable good result.
Now, we will check if multicollinearity exists in our model using vif() function from car package. According to Zuur et al. (2010), a high value of VIF indicating multicollinearity exists and suggest to drop the highest value of VIF.
## GVIF Df GVIF^(1/(2*Df))
## Age 1.987430e+00 1 1.409763
## BusinessTravel 1.338507e+00 2 1.075611
## DailyRate 1.175317e+00 1 1.084120
## Department 5.743169e+07 2 87.053832
## DistanceFromHome 1.228154e+00 1 1.108221
## EducationField 4.666025e+00 5 1.166527
## EmployeeNumber 1.185489e+00 1 1.088801
## EnvironmentSatisfaction 1.636246e+00 3 1.085529
## Gender 1.147067e+00 1 1.071012
## JobInvolvement 1.598506e+00 3 1.081315
## JobLevel 7.672646e+01 4 1.720355
## JobRole 1.243779e+09 8 3.701873
## JobSatisfaction 1.595519e+00 3 1.080978
## MaritalStatus 3.569071e+00 2 1.374481
## MonthlyIncome 1.533085e+01 1 3.915463
## NumCompaniesWorked 1.523547e+00 1 1.234321
## OverTime 1.308205e+00 1 1.143768
## PercentSalaryHike 1.128670e+00 1 1.062389
## RelationshipSatisfaction 1.508066e+00 3 1.070870
## StockOptionLevel 4.275539e+00 3 1.273987
## TotalWorkingYears 5.195686e+00 1 2.279405
## TrainingTimesLastYear 1.135342e+00 1 1.065524
## WorkLifeBalance 1.602410e+00 3 1.081755
## YearsAtCompany 7.559767e+00 1 2.749503
## YearsInCurrentRole 4.249090e+00 1 2.061332
## YearsSinceLastPromotion 2.307515e+00 1 1.519051
## YearsWithCurrManager 3.736256e+00 1 1.932940
Recreating a formula with the highest VIF is dropped.
model_log_bw <- glm(formula = Attrition ~ Age + BusinessTravel + DailyRate +
DistanceFromHome + EducationField + EmployeeNumber +
EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel +
JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome +
NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction +
StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear +
WorkLifeBalance + YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion +
YearsWithCurrManager, family = "binomial", data = train_balanced)Recalculate the VIF value.
## GVIF Df GVIF^(1/(2*Df))
## Age 1.983295 1 1.408295
## BusinessTravel 1.329186 2 1.073733
## DailyRate 1.175157 1 1.084047
## DistanceFromHome 1.216231 1 1.102829
## EducationField 4.315324 5 1.157448
## EmployeeNumber 1.182660 1 1.087502
## EnvironmentSatisfaction 1.614912 3 1.083157
## Gender 1.138396 1 1.066956
## JobInvolvement 1.549342 3 1.075700
## JobLevel 73.886948 4 1.712265
## JobRole 77.216664 8 1.312145
## JobSatisfaction 1.574728 3 1.078618
## MaritalStatus 3.560918 2 1.373696
## MonthlyIncome 15.480245 1 3.934494
## NumCompaniesWorked 1.516519 1 1.231470
## OverTime 1.298094 1 1.139339
## PercentSalaryHike 1.124487 1 1.060418
## RelationshipSatisfaction 1.497423 3 1.069607
## StockOptionLevel 4.255169 3 1.272974
## TotalWorkingYears 5.173397 1 2.274510
## TrainingTimesLastYear 1.134146 1 1.064963
## WorkLifeBalance 1.598007 3 1.081259
## YearsAtCompany 7.271288 1 2.696533
## YearsInCurrentRole 4.092361 1 2.022959
## YearsSinceLastPromotion 2.249532 1 1.499844
## YearsWithCurrManager 3.700395 1 1.923641
After the highest value of VIF dropped, our model is ready to be used for predicting the test dataset.
Let say the company objective is want to prevent the Attrition, so we ** don’t want to predict the Attrition = No when it is actually Yes.**
If the real value of Attrition = Yes while the predicted value is No, it will make us failed to take a preventive action that needed to take to prevent the Attrition happens.
If the real value of Attrition = No while the predicted value is Yes, we may take an unnecessary step to prevent something that won’t happen. But in my opinion, this action is not fatal and maybe it better, because with this ‘mistaken’ step is taken, we may build a stronger relationship and stronger understanding with our employees. Not a bad choice, actually.
FP: a test result that incorrectly indicates that a particular condition or attribute is present.
FN: a test result that incorrectly indicates that a particular condition or attribute is absent.
So, in our case, we prefer to focus on lowering the False Negatives rate. Why? Because if so, we will miss our valuable employee.
In the False Positive case, when the actual is good, we may take the concrete step, like questioning our employee. But in the end, we will figure out that our employee has no problem at all.
Sensitivity = TP/(TP + FN)
Specificity = TN/(TN + FP)
Accuracy = (TN + TP)/(TN+TP+FN+FP)
So, we want to decrease the FN. According to the formula, if we decrease the FN, we will increase the Sensitivity and Accuracy score.
Now we will compare our prediction with the real value in the test dataset. The threshold is set into 0.5.
pred_round <- as.factor(ifelse(pred >= 0.5, "Yes", "No"))
confusionMatrix(pred_round, data_test$Attrition, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 186 13
## Yes 51 44
##
## Accuracy : 0.7823
## 95% CI : (0.7307, 0.8281)
## No Information Rate : 0.8061
## P-Value [Acc > NIR] : 0.8651
##
## Kappa : 0.4443
##
## Mcnemar's Test P-Value : 3.746e-06
##
## Sensitivity : 0.7719
## Specificity : 0.7848
## Pos Pred Value : 0.4632
## Neg Pred Value : 0.9347
## Prevalence : 0.1939
## Detection Rate : 0.1497
## Detection Prevalence : 0.3231
## Balanced Accuracy : 0.7784
##
## 'Positive' Class : Yes
##
We will try to increase the accuracy and decrease the FN. In the Logistic Regression model, it can happen by adjusting the threshold of the prediction. But the question, what is the optimal number of the threshold? We will create a graph that will help us visualize the distribution of the prediction.
data_test$pred <- pred
ggplot(data_test, aes(pred, color = as.factor(Attrition) ) ) +
geom_density( size = 1 ) +
ggtitle("Testing Set's Predicted Score") By seeing the above graph, we can tune our model by adjusting the threshold:
The threshold value of <0.5 will results in more Attrition == Yes
The threshold value of >0.5 will results in less Attrition == Yes
Besides the graph, we can mathematically compute the most optimal cutoff values.
ROCRpred = prediction(pred, data_test$Attrition)
# Performance function
plot1 = performance(ROCRpred, "acc", "fnr")
plot2 = performance(ROCRpred, "prec", "rec")
par(mfrow = c(1, 2))
plot(plot1, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
plot(plot2, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))We want a threshold number which has the highest accuracy and the lowest false-negative rate. By looking at the graph, the threshold should be around 0.3 or 0.4. Now we will try both values to see the results.
pred_round <- as.factor(ifelse(pred >= 0.3, "Yes", "No"))
confusionMatrix(pred_round, data_test$Attrition, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 159 7
## Yes 78 50
##
## Accuracy : 0.7109
## 95% CI : (0.6554, 0.762)
## No Information Rate : 0.8061
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.3721
##
## Mcnemar's Test P-Value : 3.136e-14
##
## Sensitivity : 0.8772
## Specificity : 0.6709
## Pos Pred Value : 0.3906
## Neg Pred Value : 0.9578
## Prevalence : 0.1939
## Detection Rate : 0.1701
## Detection Prevalence : 0.4354
## Balanced Accuracy : 0.7740
##
## 'Positive' Class : Yes
##
pred_round <- as.factor(ifelse(pred >= 0.4, "Yes", "No"))
confusionMatrix(pred_round, data_test$Attrition, positive = "Yes")## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 173 10
## Yes 64 47
##
## Accuracy : 0.7483
## 95% CI : (0.6946, 0.7969)
## No Information Rate : 0.8061
## P-Value [Acc > NIR] : 0.9939
##
## Kappa : 0.4078
##
## Mcnemar's Test P-Value : 7.223e-10
##
## Sensitivity : 0.8246
## Specificity : 0.7300
## Pos Pred Value : 0.4234
## Neg Pred Value : 0.9454
## Prevalence : 0.1939
## Detection Rate : 0.1599
## Detection Prevalence : 0.3776
## Balanced Accuracy : 0.7773
##
## 'Positive' Class : Yes
##
The threshold value of 0.3 and 0.4, successfully decreasing the False Negative count of the prediction, thus increasing the Sensitivity of the data. It is also increasing the number of True Positive count of the prediction. Good sign!
The drawback of adjusting the threshold in our prediction is our model failed to predict the True Negative count. But if we think the False Positive is not a big deal, then it should not be a problem.
KNN or k-nearest neighbors is a non-parametric method that can be used for classification. We will try to predict using this method and compare the performance with the Logistic Regression.
KNN is only worked for predictor with numerical values so that we will select the numerical columns only.
First, we will divide our dataset into 80% train data and 20% test data. We will using set.seed() function to make sure that R produce the same random number to make this report is reproducible.
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
set.seed(18)
ind <- dataHR_num$Attrition %>%
createDataPartition(p = 0.8, list = FALSE)
train_caret <- dataHR_num[ind, ]
test_caret <- dataHR_num[-ind, ]I will be using caret package to predict with KNN. After data split into a train set and a test set, I am using k-Fold Cross-Validation parameter in the train() function. The data pre-processing of Centering and Scaling will be done in the trainControl() function.
The k-Fold Cross-Validation parameter will be set all to 10. Data imbalanced will be handled using sampling method in trainControl() function, I will try all the options available to compare how the different sampling method will perform.
The k values of KNN method will be set with the tuneGrid in train() function. The function will take the best k value automatically to be used in the test set, which has the best Sensitivity, from metric options.
## [1] 38.34058
The rule of thumb of k value in KNN is the square root of the number of rows in the dataset.
I will iterate from 1 to 50 for the k values of KNN in the tuneGrid. The function will take the best k value that produces the highest value of Sensitivity. The Sensitivity parameter is chosen according to our company objective.
The imbalanced data will be handled using several sampling methods: Upsampling, Downsampling, SMOTE, ROSE.
model_imbalanced <- train(
Attrition ~., data = train_caret, method = "knn",
trControl = trainControl(method = "cv",
number = 10,
classProbs = TRUE,
summaryFunction=twoClassSummary),
preProcess = c("center","scale"),
metric = "Sens",
tuneGrid = expand.grid(k = 1:50)
)
model_upsampling <- train(
Attrition ~., data = train_caret, method = "knn",
trControl = trainControl(method = "cv",
number = 10,
sampling = "up",
classProbs = TRUE,
summaryFunction=twoClassSummary),
preProcess = c("center","scale"),
metric = "Sens",
tuneGrid = expand.grid(k = 1:50)
)
model_downsampling <- train(
Attrition ~., data = train_caret, method = "knn",
trControl = trainControl(method = "cv",
number = 10,
sampling = "down",
classProbs = TRUE,
summaryFunction=twoClassSummary),
preProcess = c("center","scale"),
metric = "Sens",
tuneGrid = expand.grid(k = 1:50)
)
model_smote<- train(
Attrition ~., data = train_caret, method = "knn",
trControl = trainControl(method = "cv",
number = 10,
sampling = "smote",
classProbs = TRUE,
summaryFunction=twoClassSummary),
preProcess = c("center","scale"),
metric = "Sens",
tuneGrid = expand.grid(k = 1:50)
)## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
model_rose<- train(
Attrition ~., data = train_caret, method = "knn",
trControl = trainControl(method = "cv",
number = 10,
sampling = "rose",
classProbs = TRUE,
summaryFunction=twoClassSummary),
preProcess = c("center","scale"),
metric = "Sens",
tuneGrid = expand.grid(k = 1:50)
)Now, we will make a prediction with our created models into the test dataset, and we will save the results into appropriate objects.
pred_imbalanced <- model_imbalanced %>%
predict(test_caret)
pred_upsampling <- model_upsampling %>%
predict(test_caret)
pred_downsampling <- model_downsampling %>%
predict(test_caret)
pred_smote <- model_smote %>%
predict(test_caret)
pred_rose <- model_rose %>%
predict(test_caret)## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 15 38
## No 32 208
##
## Accuracy : 0.7611
## 95% CI : (0.7081, 0.8088)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 0.9998
##
## Kappa : 0.1566
##
## Mcnemar's Test P-Value : 0.5501
##
## Sensitivity : 0.31915
## Specificity : 0.84553
## Pos Pred Value : 0.28302
## Neg Pred Value : 0.86667
## Prevalence : 0.16041
## Detection Rate : 0.05119
## Detection Prevalence : 0.18089
## Balanced Accuracy : 0.58234
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 29 102
## No 18 144
##
## Accuracy : 0.5904
## 95% CI : (0.5318, 0.6473)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1175
##
## Mcnemar's Test P-Value : 3.541e-14
##
## Sensitivity : 0.61702
## Specificity : 0.58537
## Pos Pred Value : 0.22137
## Neg Pred Value : 0.88889
## Prevalence : 0.16041
## Detection Rate : 0.09898
## Detection Prevalence : 0.44710
## Balanced Accuracy : 0.60119
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 26 101
## No 21 145
##
## Accuracy : 0.5836
## 95% CI : (0.5249, 0.6407)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.0845
##
## Mcnemar's Test P-Value : 8.532e-13
##
## Sensitivity : 0.55319
## Specificity : 0.58943
## Pos Pred Value : 0.20472
## Neg Pred Value : 0.87349
## Prevalence : 0.16041
## Detection Rate : 0.08874
## Detection Prevalence : 0.43345
## Balanced Accuracy : 0.57131
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 23 82
## No 24 164
##
## Accuracy : 0.6382
## 95% CI : (0.5803, 0.6933)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1041
##
## Mcnemar's Test P-Value : 3.089e-08
##
## Sensitivity : 0.4894
## Specificity : 0.6667
## Pos Pred Value : 0.2190
## Neg Pred Value : 0.8723
## Prevalence : 0.1604
## Detection Rate : 0.0785
## Detection Prevalence : 0.3584
## Balanced Accuracy : 0.5780
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction Yes No
## Yes 27 81
## No 20 165
##
## Accuracy : 0.6553
## 95% CI : (0.5978, 0.7096)
## No Information Rate : 0.8396
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.1608
##
## Mcnemar's Test P-Value : 2.369e-09
##
## Sensitivity : 0.57447
## Specificity : 0.67073
## Pos Pred Value : 0.25000
## Neg Pred Value : 0.89189
## Prevalence : 0.16041
## Detection Rate : 0.09215
## Detection Prevalence : 0.36860
## Balanced Accuracy : 0.62260
##
## 'Positive' Class : Yes
##
We will calculate the performance differences statistically.
resamps <- resamples(list(Imbalanced = model_imbalanced,
Upsampling = model_upsampling,
Downsampling = model_downsampling,
SMOTE = model_smote,
ROSE = model_rose))resamps$values %>%
select(Resample,
Imbalanced = `Imbalanced~Sens`,
Upsampling = `Upsampling~Sens`,
Downsampling = `Downsampling~Sens`,
SMOTE = `SMOTE~Sens`,
ROSE = `ROSE~Sens`)From the table and the chart above, we can conclude that, in general, Upsampling produced the highest Sensitivity value of our model. It doesn’t mean that Upsampling is always the best for all models and all situations. But in our case, because we focus on the Sensitivity value, the Upsampling method gives us the best result.
Our model summary:
Accuracy: 0.7721
Sensitivity : 0.7368
Accuracy: 0.7075
Sensitivity: 0.8772
Accuracy: 0.7381
Sensitivity: 0.8070
Accuracy: 0.7611
Sensitivity: 0.31915
Accuracy: 0.5904
Sensitivity: 0.61702
Accuracy: 0.5836
Sensitivity: 0.55319
Accuracy: 0.6382
Sensitivity: 0.4894
Accuracy: 0.6553
Sensitivity: 0.5745
Our prediction with the KNN method, all produces both Accuracy and Sensitivity score lower than the Logistic Regression Model. Maybe it is because we only use the numerical value type of predictors in the KNN. We may lose some valuable information that may contain in the categorical predictors.
With both numerical and categorical types of columns included in the Logistic Regression model, it produces a better value than the KNN. If our objective is decreasing the False Negative as defined by the business objective, we will choose Logistic Regression with threshold > 0.3 for our model.
This analysis is part of Algoritma LBB Project in C2 class.
In this project, I will create a Decision Tree and a Random Forest model with different parameters and tuning, and compare them to find which predict better for this dataset.
The EDA part is still the same, so I will focus directly on the analysis here.
I will be using caret package to predict with Decision Tree algorithm by specifiying method = "ctree as an argument.
The k-Fold Cross-Validation parameter will be set to 10 and repeats to 3 with repeatedcv method. Data imbalance will be handled using sampling method in trainControl() function, I will try to use upsampling method to handle the imbalanced data.
tree_imbalanced <- train(
Attrition ~., data = data_train, method = "ctree",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
)
tree_upsample <- train(
Attrition ~., data = data_train, method = "ctree",
trControl = trainControl(method = "repeatedcv",
number = 10,
repeats = 3,
sampling = "up")
)Now we will predict using the model above to out test data, we will predict both raw and prob.
pred_tree_imbalanced <- predict(tree_imbalanced, data_test)
pred_tree_upsample <- predict(tree_upsample, data_test)
pred_tree_imbalanced_prob <- predict(tree_imbalanced, data_test, type = "prob")
pred_tree_upsample_prob <- predict(tree_upsample, data_test, type = "prob")Here is the confusion matrix with the imbalanced dataset.
## Reference
## Prediction No Yes
## No 223 43
## Yes 14 14
Now we will calculate the metrics for our prediction with the imbalanced dataset.
table <- select(data_test, Attrition) %>%
bind_cols(pred_tree_imbalanced = pred_tree_imbalanced)
imbalanced_tree_metrics <- table %>% summarise(
accuracy = accuracy_vec(data_test$Attrition, pred_tree_imbalanced),
sensitivity = spec_vec(data_test$Attrition, pred_tree_imbalanced),
specificity = sens_vec(data_test$Attrition, pred_tree_imbalanced),
precision = precision_vec(data_test$Attrition, pred_tree_imbalanced))
imbalanced_tree_metricsHere is the confusion matrix with the balanced data using upsampling method.
## Reference
## Prediction No Yes
## No 185 39
## Yes 52 18
Now we will calculate the metrics for our prediction with the balanced dataset using upsampling method.
table <- select(data_test, Attrition) %>%
bind_cols(pred_tree_upsample = pred_tree_upsample)
upsample_tree_metrics <- table %>% summarise(
accuracy = accuracy_vec(data_test$Attrition, pred_tree_upsample),
sensitivity = spec_vec(data_test$Attrition, pred_tree_upsample),
specificity = sens_vec(data_test$Attrition, pred_tree_upsample),
precision = precision_vec(data_test$Attrition, pred_tree_upsample))
upsample_tree_metricsWe can plot the importance of the variables.
ct <- ctree(Attrition ~., data = data_train)
plotfun <- function(i) c(
as.character(i$prediction),
paste("n =", i$n),
format(round(i$distribution/i$n, digits = 3), nsmall = 3)
)
plot(as.simpleparty(ct), tp_args = list(FUN = plotfun), ep_args = list(justmin = 20))Wow, the balanced data with upsampling method look so messy. The upsampling one has a lower accuracy, but a higher sensitivity. Because we want to focus on sensitivity, we will choose the upsampling one. .
Now I will see the ROC curve of my final model: balanced dataset using upsampling method.
table <- select(data_test, Attrition) %>%
bind_cols(attrition_pred = pred_tree_upsample_prob) %>%
bind_cols(attrition_eprob = round(pred_tree_upsample_prob[,2],4)) %>%
bind_cols(attrition_pprob = round(pred_tree_upsample_prob[,1],4))
roc <- data.frame(prediction = round(pred_tree_upsample_prob[,2],4),
trueclass = as.numeric(table$Attrition))
roc <- prediction(roc$prediction, roc$trueclass)
# ROC Curve
plot(performance(roc, "tpr", "fpr"), main = "ROC")
abline(a = 0, b = 1)By looking visually the graphs above, we can see our model is perform poorly as they really close with the threshold line.
# AUC
auc_ROCR_tree <- performance(roc, measure = "auc")
auc_ROCR_tree <- auc_ROCR_tree@y.values[[1]]
auc_ROCR_tree## [1] 0.5787253
The AUC, area under curve is the metric to measure how good our model is. The higher the AUC, the better the model is. Value of 0.5 denotes a bad classifier. As our model is very close to 0.5, we could say that our model doesn’t perform well.
We can improve our model by adjusting the threshold of the classification probabolity. Now I will try to tuning the model by looking at the tradeoff graph.
th_cut <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = th_cut(cutoff = co[i],
prob = table$attrition_eprob,
ref = as.factor(ifelse(table$Attrition == "Yes", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Sensitivity" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(-0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(-0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))From above graph, there are no real clear-cuts of intersection for the optimum threshold. We can increase our threshold to 0.7 to increase the accuracy, precision and specificity but decreas the sensitivity value which actualy we want it as high as possible. So I will skip the tuning process as it won’t improve the model performance.
This is our metrics of our final model for the Decision Tree algorithm.
final_tree <- table %>% summarise(
accuracy = accuracy_vec(Attrition, pred_tree_upsample),
sensitivity = spec_vec(Attrition, pred_tree_upsample),
specificity = sens_vec(Attrition, pred_tree_upsample),
precision = precision_vec(Attrition, pred_tree_upsample)) %>%
cbind(AUC = auc_ROCR_tree)
final_treeRandom Forest consists of a large number of individual decision trees that operates as an ensemble.
We will train our data with the k-Fold Cross-Validation parameter set to 5 and repeats to 3 with repeatedcv method. Data imbalance will be handled using sampling method in trainControl() function, I will try to use several method to handle the imbalanced data.
rf_imbalanced <- train(
Attrition ~., data = data_train, method = "rf",
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
summaryFunction=twoClassSummary,
classProbs=T)
)## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
rf_upsample <- train(
Attrition ~., data = data_train, method = "rf",
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
sampling = "up",
summaryFunction=twoClassSummary,
classProbs=T)
)## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
rf_smote <- train(
Attrition ~., data = data_train, method = "rf",
trControl = trainControl(method = "repeatedcv",
number = 5,
repeats = 3,
sampling = "smote",
summaryFunction=twoClassSummary,
classProbs=T)
)## Warning in train.default(x, y, weights = w, ...): The metric "Accuracy" was not
## in the result set. ROC will be used instead.
Here is the graph visualizing the mtry picked for the model.
Three graphs above are visualizing the best mtry with certain accuracy in the y-axis. The algorithm will automatically choose the best mtry that produce the highest accuracy.
Now we will predict using the model above to out test data, we will predict both raw and prob.
pred_rf_imbalanced_prob <- predict(rf_imbalanced, data_test, type = "prob")
pred_rf_upsample_prob <- predict(rf_upsample, data_test, type = "prob")
pred_rf_smote_prob <- predict(rf_smote, data_test, type = "prob")
pred_rf_imbalanced_raw <- predict(rf_imbalanced, data_test, type = "raw")
pred_rf_upsample_raw <- predict(rf_upsample, data_test, type = "raw")
pred_rf_smote_raw <- predict(rf_smote, data_test, type = "raw")Here is the confusion matrix with the imbalanced dataset.
## Reference
## Prediction No Yes
## No 237 56
## Yes 0 1
Now we will calculate the metrics for our prediction with the imbalanced dataset.
table <- select(data_test, Attrition) %>%
bind_cols(pred_rf_imbalanced_raw = pred_rf_imbalanced_raw)
imbalanced_metrics <- table %>% summarise(
accuracy = accuracy_vec(data_test$Attrition, pred_rf_imbalanced_raw),
sensitivity = spec_vec(data_test$Attrition, pred_rf_imbalanced_raw),
specificity = sens_vec(data_test$Attrition, pred_rf_imbalanced_raw),
precision = precision_vec(data_test$Attrition, pred_rf_imbalanced_raw))
imbalanced_metricsHere is the confusion matrix with the upsampling method.
## Reference
## Prediction No Yes
## No 230 41
## Yes 7 16
Now we will calculate the metrics for our prediction with the upsampling method.
table <- select(data_test, Attrition) %>%
bind_cols(pred_rf_upsample_raw = pred_rf_upsample_raw)
upsample_metrics <- table %>% summarise(
accuracy = accuracy_vec(data_test$Attrition, pred_rf_upsample_raw),
sensitivity = spec_vec(data_test$Attrition, pred_rf_upsample_raw),
specificity = sens_vec(data_test$Attrition, pred_rf_upsample_raw),
precision = precision_vec(data_test$Attrition, pred_rf_upsample_raw))
upsample_metricsHere is the confusion matrix with the smote method.
## Reference
## Prediction No Yes
## No 234 43
## Yes 3 14
Now we will calculate the metrics for our prediction with the smote method.
table <- select(data_test, Attrition) %>%
bind_cols(pred_rf_smote_raw = pred_rf_smote_raw)
smote_metrics <- table %>% summarise(
accuracy = accuracy_vec(data_test$Attrition, pred_rf_smote_raw),
sensitivity = spec_vec(data_test$Attrition, pred_rf_smote_raw),
specificity = sens_vec(data_test$Attrition, pred_rf_smote_raw),
precision = precision_vec(data_test$Attrition, pred_rf_smote_raw))
smote_metricsBecause the upsampling technique results the best in both accuracy and sensitivity, we will proceed to the next analysis.
Now we will see the detail and plot of the chosen model.
## Random Forest
##
## 1176 samples
## 31 predictor
## 2 classes: 'No', 'Yes'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 3 times)
## Summary of sample sizes: 941, 941, 941, 940, 941, 940, ...
## Addtional sampling using up-sampling
##
## Resampling results across tuning parameters:
##
## mtry ROC Sens Spec
## 2 0.7707324 0.9708878 0.2462963
## 32 0.7586876 0.9598526 0.2296296
## 63 0.7349803 0.9531508 0.2611111
##
## ROC was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
Here is importance of the variables for the chosen model.
## rf variable importance
##
## only 20 most important variables shown (out of 63)
##
## Overall
## MonthlyIncome 100.00
## Age 80.44
## TotalWorkingYears 75.80
## OverTimeYes 66.15
## EmployeeNumber 63.83
## YearsAtCompany 63.76
## DailyRate 61.57
## MonthlyRate 61.22
## HourlyRate 60.08
## YearsWithCurrManager 58.07
## YearsInCurrentRole 56.36
## DistanceFromHome 51.92
## PercentSalaryHike 47.31
## NumCompaniesWorked 44.50
## YearsSinceLastPromotion 38.29
## TrainingTimesLastYear 34.20
## StockOptionLevel1 31.54
## MaritalStatusSingle 27.13
## JobLevel2 23.30
## JobSatisfaction4 22.34
Here we can see our error is decresing as the number of trees is growing. This is the main idea of Random Forest, where the crowd will produce the better result.
plot(rf_upsample$finalModel)
legend("topright", colnames(rf_upsample$finalModel$err.rate), col=1:6, cex=0.8, fill = 1:6)Now I will see the ROC curve of my final model.
table <- select(data_test, Attrition) %>%
bind_cols(attrition_pred = pred_rf_upsample_prob) %>%
bind_cols(attrition_eprob = round(pred_rf_upsample_prob[,2],4)) %>%
bind_cols(attrition_pprob = round(pred_rf_upsample_prob[,1],4))
roc <- data.frame(prediction = round(pred_rf_upsample_prob[,2],4),
trueclass = as.numeric(table$Attrition))
roc <- prediction(roc$prediction, roc$trueclass)
# ROC Curve
plot(performance(roc, "tpr", "fpr"), main = "ROC")
abline(a = 0, b = 1)We can see the our ROC is visually better than the Decision Tree one. Now let’s calculate it statistically.
# AUC
auc_ROCR_f <- performance(roc, measure = "auc")
auc_ROCR_f <- auc_ROCR_f@y.values[[1]]
auc_ROCR_f## [1] 0.7794803
Our AUC of Random Forest algorithm produce a far better result than the Decision Tree one. So, it is very obvious that we will choose Random Forest as our model.
Based on the above results, a random forest model gave us a reasonable results, in term of accuracy and specificity. Furthermore, the AUC of the model is also reasonable, which is around 78.2%.
In the nest step, we will try to tune the thresholds to improve the metrics.
th_cut <- function(cutoff, prob, ref, postarget, negtarget)
{
predict <- factor(ifelse(prob >= cutoff, postarget, negtarget))
conf <- caret::confusionMatrix(predict , ref, positive = postarget)
acc <- conf$overall[1]
rec <- conf$byClass[1]
prec <- conf$byClass[3]
spec <- conf$byClass[2]
mat <- t(as.matrix(c(rec , acc , prec, spec)))
colnames(mat) <- c("recall", "accuracy", "precicion", "specificity")
return(mat)
}
co <- seq(0.01,0.99,length=100)
result <- matrix(0,100,4)
for(i in 1:100){
result[i,] = th_cut(cutoff = co[i],
prob = table$attrition_eprob,
ref = as.factor(ifelse(table$Attrition == "Yes", 1, 0)),
postarget = "1",
negtarget = "0")
}
# visualize
ggplotly(tibble("Sensitivity" = result[,1],
"Accuracy" = result[,2],
"Precision" = result[,3],
"Specificity" = result[,4],
"Cutoff" = co) %>%
gather(key = "Metrics", value = "value", 1:4) %>%
ggplot(aes(x = Cutoff, y = value, col = Metrics)) +
geom_line(lwd = 1.5) +
scale_color_manual(values = c("darkred","darkgreen","orange", "blue")) +
scale_y_continuous(breaks = seq(-0,1,0.1), limits = c(0,1)) +
scale_x_continuous(breaks = seq(-0,1,0.1)) +
labs(title = "Tradeoff Model Perfomance") +
theme_minimal() +
theme(legend.position = "top",
panel.grid.minor.y = element_blank(),
panel.grid.minor.x = element_blank()))From above graph, we can see the cutoff value where the metrics is intersected. In above graph, we can see clearly that by reducing the Cutoff will increase the Sensitivity value. By looking at the graph, if our focus is this analysis is both Sensitivity and Accuracy, I will choose the cutoff value = 0.3.
table <- table %>%
mutate(tuning_pred = as.factor(ifelse(attrition_eprob >= 0.3, "Yes", "No")))
final_rf_threshold <- table %>% summarise(
accuracy = accuracy_vec(Attrition, tuning_pred),
sensitivity = spec_vec(Attrition, tuning_pred),
specificity = sens_vec(Attrition, tuning_pred),
precision = precision_vec(Attrition, tuning_pred)) %>%
cbind(AUC = auc_ROCR_f)
final_rf_thresholdWe can see by decreasing the threshold will dramatically increase the sensitivity but sacrifice the accuracy a bit.
Here I will conclude from both Decision Tree and Random Forest prediction.
rbind("Decision Tree" = final_tree,
"Random Forest, No Tuning" = rf_upsample_before_tuning,
"Random Forest, After Tuning" = final_rf_threshold)Based on the conclusion above, we can conclude that Random Forest, After Tuning give the best result, the sensitivity and the precision beat all other algorithms. But, the tuning sacrifice certain metrics like accuracy and specificity. Considering our priority is that our model give the highest sensitivity, as possible, we will choose Random Forest, After Tuning as our model.