High turnover rates can negatively affect a company and its employees in many ways. With the constant need to hire and train new employees, it is easy to veer from true mission and vision of the organization.
By retaining employees, companies can provide a higher caliber workforce that positively affects the bottom line. Businesses can lower turnover rates by providing adequate training, rewarding employees for a job well done and creating a company culture of trust.
Therefore, this study is to apply machine learning models to understand what kind of employee are likely to turnover and is there a specific job role that is likely to have turnover.
The dataset on employee turnover is taken from Kaggle which was provided by IBM Analytics with a total of 1470 observations and 31 variables.
Below are the questions proposed for this study;
Category : Classification
Who among the employee will leave the company?
Which job role that an employee will likely leave the company?
Category : Regression
To start, a few packages and/or library would need to be installed/called : Caret, Boruta, EnvStats, tidyverse, heatmaply, caTools, rpart.plor, corrplot and kernlab. After installing, proceed to load data onto R.
Reading CSV
Dataset consists of 1470 observations with 35 variables.
Majority of the variable are categorical values with a few are continuous.
DF <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv", fileEncoding="UTF-8-BOM")
str(DF)
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : chr "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr "Female" "Male" "Male" "Female" ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr "Y" "Y" "Y" "Y" ...
## $ OverTime : chr "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
summary(DF)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 Length:1470 Length:1470 Min. : 102.0
## 1st Qu.:30.00 Class :character Class :character 1st Qu.: 465.0
## Median :36.00 Mode :character Mode :character Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
## Department DistanceFromHome Education EducationField
## Length:1470 Min. : 1.000 Min. :1.000 Length:1470
## Class :character 1st Qu.: 2.000 1st Qu.:2.000 Class :character
## Mode :character Median : 7.000 Median :3.000 Mode :character
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
## EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender
## Min. :1 Min. : 1.0 Min. :1.000 Length:1470
## 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 Class :character
## Median :1 Median :1020.5 Median :3.000 Mode :character
## Mean :1 Mean :1024.9 Mean :2.722
## 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000
## Max. :1 Max. :2068.0 Max. :4.000
## HourlyRate JobInvolvement JobLevel JobRole
## Min. : 30.00 Min. :1.00 Min. :1.000 Length:1470
## 1st Qu.: 48.00 1st Qu.:2.00 1st Qu.:1.000 Class :character
## Median : 66.00 Median :3.00 Median :2.000 Mode :character
## Mean : 65.89 Mean :2.73 Mean :2.064
## 3rd Qu.: 83.75 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :100.00 Max. :4.00 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## Min. :1.000 Length:1470 Min. : 1009 Min. : 2094
## 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.: 8047
## Median :3.000 Mode :character Median : 4919 Median :14236
## Mean :2.729 Mean : 6503 Mean :14313
## 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:20462
## Max. :4.000 Max. :19999 Max. :26999
## NumCompaniesWorked Over18 OverTime PercentSalaryHike
## Min. :0.000 Length:1470 Length:1470 Min. :11.00
## 1st Qu.:1.000 Class :character Class :character 1st Qu.:12.00
## Median :2.000 Mode :character Mode :character Median :14.00
## Mean :2.693 Mean :15.21
## 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :9.000 Max. :25.00
## PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
## Min. :3.000 Min. :1.000 Min. :80 Min. :0.0000
## 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000
## Median :3.000 Median :3.000 Median :80 Median :1.0000
## Mean :3.154 Mean :2.712 Mean :80 Mean :0.7939
## 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000
## Max. :4.000 Max. :4.000 Max. :80 Max. :3.0000
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000
## 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000
## Median :10.00 Median :3.000 Median :3.000 Median : 5.000
## Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008
## 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000
## Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 3.000 Median : 1.000 Median : 3.000
## Mean : 4.229 Mean : 2.188 Mean : 4.123
## 3rd Qu.: 7.000 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :18.000 Max. :15.000 Max. :17.000
Checking for unique values
Three columns has only one unique value
One column only has unique values (autonumber)
# checking unique value in every column
DF %>%
lapply(unique) %>%
lengths()
## Age Attrition BusinessTravel
## 43 2 3
## DailyRate Department DistanceFromHome
## 886 3 29
## Education EducationField EmployeeCount
## 5 6 1
## EmployeeNumber EnvironmentSatisfaction Gender
## 1470 4 2
## HourlyRate JobInvolvement JobLevel
## 71 4 5
## JobRole JobSatisfaction MaritalStatus
## 9 4 3
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 1349 1427 10
## Over18 OverTime PercentSalaryHike
## 1 2 15
## PerformanceRating RelationshipSatisfaction StandardHours
## 2 4 1
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 4 40 7
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 4 37 19
## YearsSinceLastPromotion YearsWithCurrManager
## 16 18
# checking for inconsistency
DF %>%
subset(select=c("BusinessTravel","Department","EducationField","JobRole","MaritalStatus")) %>%
lapply(unique)
## $BusinessTravel
## [1] "Travel_Rarely" "Travel_Frequently" "Non-Travel"
##
## $Department
## [1] "Sales" "Research & Development" "Human Resources"
##
## $EducationField
## [1] "Life Sciences" "Other" "Medical" "Marketing"
## [5] "Technical Degree" "Human Resources"
##
## $JobRole
## [1] "Sales Executive" "Research Scientist"
## [3] "Laboratory Technician" "Manufacturing Director"
## [5] "Healthcare Representative" "Manager"
## [7] "Sales Representative" "Research Director"
## [9] "Human Resources"
##
## $MaritalStatus
## [1] "Single" "Married" "Divorced"
# checking for missing values
sapply(DF, function(x) sum(is.na(x)))
## Age Attrition BusinessTravel
## 0 0 0
## DailyRate Department DistanceFromHome
## 0 0 0
## Education EducationField EmployeeCount
## 0 0 0
## EmployeeNumber EnvironmentSatisfaction Gender
## 0 0 0
## HourlyRate JobInvolvement JobLevel
## 0 0 0
## JobRole JobSatisfaction MaritalStatus
## 0 0 0
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 0 0 0
## Over18 OverTime PercentSalaryHike
## 0 0 0
## PerformanceRating RelationshipSatisfaction StandardHours
## 0 0 0
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 0 0 0
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 0 0 0
## YearsSinceLastPromotion YearsWithCurrManager
## 0 0
Data Cleaning
##########
dataCleaning <- function(df) {
df %>%
# dropping columns with single value
Filter(function(x) min(x)!=max(x),.) %>%
# remove autonumber column
.[, names(.) != "EmployeeNumber"] %>%
mutate(
# changing YesNo data to logical data
OverTime = ifelse(OverTime == "No",FALSE,TRUE),
Attrition = ifelse(Attrition == "No",FALSE,TRUE),
# changing char data to ordered factor (ordinal)
BusinessTravel = factor(BusinessTravel,
levels = c("Non-Travel","Travel_Rarely","Travel_Frequently"),
ordered = TRUE),
Education = factor(Education, ordered = TRUE),
EnvironmentSatisfaction = factor(EnvironmentSatisfaction, ordered = TRUE),
JobInvolvement = factor(JobInvolvement, ordered = TRUE),
JobLevel = factor(JobLevel, ordered = TRUE),
JobSatisfaction = factor(JobSatisfaction, ordered = TRUE),
PerformanceRating = factor(PerformanceRating, ordered = TRUE),
RelationshipSatisfaction = factor(RelationshipSatisfaction, ordered = TRUE),
StockOptionLevel = factor(StockOptionLevel, ordered = TRUE),
WorkLifeBalance = factor(WorkLifeBalance, ordered = TRUE)
) %>%
# changing other char columns to unordered factor (nominal)
mutate_if(is.character,as.factor)
}
DF <- dataCleaning(DF)
# recheck data is correctly cleaned
str(DF)
## 'data.frame': 1470 obs. of 31 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ BusinessTravel : Ord.factor w/ 3 levels "Non-Travel"<"Travel_Rarely"<..: 2 3 2 3 2 3 2 2 3 2 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ OverTime : logi TRUE FALSE TRUE TRUE FALSE FALSE ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : Ord.factor w/ 2 levels "3"<"4": 1 2 1 1 1 1 2 2 2 1 ...
## $ RelationshipSatisfaction: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 4 2 3 4 3 1 2 2 2 ...
## $ StockOptionLevel : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 2 1 1 2 1 4 2 1 3 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
Based on the result above, there is no inconsistency in categorical data and there is no missing value in any column
Classification Modelling
Before starting on the modelling few things needed to be done before applying the model.
Step 1, Checking for outlier
outlierTest <- function(df){
# outlier from boxplot
DFo <- list()
for (i in names(select_if(df, is.numeric))) {
cat(sprintf('%-24s : %d\n', i, length(boxplot.stats(df[,i])$out)))
if (length(boxplot.stats(df[,i])$out)>0){
DFo <- c(DFo,i)
}
}
# testing outliers
for (i in DFo) {
cat(sprintf('%s\n',i))
print(rosnerTest(DF[,i])$all.stats)
cat(sprintf('\n'))
}
# retesting specific column for outlier
print(rosnerTest(df$YearsAtCompany, k=20)$all.stats)
}
outlierTest(DF)
## Age : 0
## DailyRate : 0
## DistanceFromHome : 0
## HourlyRate : 0
## MonthlyIncome : 114
## MonthlyRate : 0
## NumCompaniesWorked : 52
## PercentSalaryHike : 0
## TotalWorkingYears : 63
## TrainingTimesLastYear : 238
## YearsAtCompany : 104
## YearsInCurrentRole : 21
## YearsSinceLastPromotion : 107
## YearsWithCurrManager : 14
## MonthlyIncome
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 6502.931 4707.957 19999 191 2.866651 4.133386 FALSE
## 2 1 6493.744 4696.360 19973 747 2.870150 4.133224 FALSE
## 3 2 6484.562 4684.751 19943 852 2.872818 4.133061 FALSE
##
## NumCompaniesWorked
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 2.693197 2.498009 9 5 2.524732 4.133386 FALSE
## 2 1 2.688904 2.493429 9 39 2.531092 4.133224 FALSE
## 3 2 2.684605 2.488826 9 51 2.537500 4.133061 FALSE
##
## TotalWorkingYears
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 11.27959 7.780782 40 127 3.691198 4.133386 FALSE
## 2 1 11.26004 7.747227 40 596 3.709709 4.133224 FALSE
## 3 2 11.24046 7.713431 38 99 3.469213 4.133061 FALSE
##
## TrainingTimesLastYear
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 2.799320 1.289271 6 24 2.482551 4.133386 FALSE
## 2 1 2.797141 1.287000 6 34 2.488625 4.133224 FALSE
## 3 2 2.794959 1.284718 6 42 2.494743 4.133061 FALSE
##
## YearsAtCompany
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 7.008163 6.126525 40 127 5.385081 4.133386 TRUE
## 2 1 6.985705 6.067777 37 99 4.946506 4.133224 TRUE
## 3 2 6.965259 6.019013 36 271 4.823838 4.133061 TRUE
##
## YearsInCurrentRole
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 4.229252 3.623137 18 124 3.800780 4.133386 FALSE
## 2 1 4.219877 3.606494 18 191 3.820919 4.133224 FALSE
## 3 2 4.210490 3.589726 17 232 3.562809 4.133061 FALSE
##
## YearsSinceLastPromotion
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 2.187755 3.222430 15 46 3.975957 4.133386 FALSE
## 2 1 2.179033 3.206124 15 124 3.998899 4.133224 FALSE
## 3 2 2.170300 3.189689 15 127 4.022242 4.133061 FALSE
##
## YearsWithCurrManager
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 4.123129 3.568136 17 29 3.608851 4.133386 FALSE
## 2 1 4.114364 3.553483 17 387 3.626199 4.133224 FALSE
## 3 2 4.105586 3.538727 17 617 3.643801 4.133061 FALSE
## Warning in rosnerTest(df$YearsAtCompany, k = 20): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).
## i Mean.i SD.i Value Obs.Num R.i+1 lambda.i+1 Outlier
## 1 0 7.008163 6.126525 40 127 5.385081 4.133386 TRUE
## 2 1 6.985705 6.067777 37 99 4.946506 4.133224 TRUE
## 3 2 6.965259 6.019013 36 271 4.823838 4.133061 TRUE
## 4 3 6.945467 5.973089 36 1117 4.864239 4.132899 TRUE
## 5 4 6.925648 5.926679 34 562 4.568216 4.132736 TRUE
## 6 5 6.907167 5.886296 33 191 4.432810 4.132573 TRUE
## 7 6 6.889344 5.848630 33 238 4.464405 4.132410 TRUE
## 8 7 6.871497 5.810614 33 915 4.496686 4.132247 TRUE
## 9 8 6.853625 5.772239 33 963 4.529677 4.132084 TRUE
## 10 9 6.835729 5.733498 33 1112 4.563405 4.131921 TRUE
## 11 10 6.817808 5.694383 32 478 4.422286 4.131757 TRUE
## 12 11 6.800548 5.658004 32 750 4.453771 4.131594 TRUE
## 13 12 6.783265 5.621285 32 1087 4.485938 4.131430 TRUE
## 14 13 6.765957 5.584220 31 474 4.339737 4.131266 TRUE
## 15 14 6.749313 5.549868 31 596 4.369597 4.131102 TRUE
## 16 15 6.732646 5.515203 31 654 4.400083 4.130938 TRUE
## 17 16 6.715956 5.480221 30 1139 4.248742 4.130774 TRUE
## 18 17 6.699931 5.447923 29 412 4.093315 4.130610 FALSE
## 19 18 6.684573 5.418244 29 919 4.118572 4.130445 FALSE
## 20 19 6.669194 5.388315 27 63 3.773129 4.130281 FALSE
# removing outlier
DFnoOut <- DF %>%
mutate() %>%
filter(YearsAtCompany < 30)
17 outliers detected and they are removed
Step 2, Data Scaling
########## Data Scaling
##########
# heatmap before normalization showing range difference in columns
heatmap(sapply(DFnoOut,as.numeric), main = "Before Normalization", Rowv = NA, Colv = NA,labRow = NA)
# normalizing data
DFScaled <- DFnoOut %>%
mutate() %>%
mutate_if(is.factor, as.numeric) %>%
mutate_if(is.numeric, normalize)
# heatmap after normalization showing better range distribution
heatmap(sapply(DFScaled,as.numeric), main = "After Normalization", Rowv = NA, Colv = NA,labRow = NA)
# return to default view
str(DFScaled)
## 'data.frame': 1453 obs. of 31 variables:
## $ Age : num 0.548 0.738 0.452 0.357 0.214 ...
## $ Attrition : logi TRUE FALSE TRUE FALSE FALSE FALSE ...
## $ BusinessTravel : num 0.5 1 0.5 1 0.5 1 0.5 0.5 1 0.5 ...
## $ DailyRate : num 0.716 0.127 0.91 0.923 0.35 ...
## $ Department : num 1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
## $ DistanceFromHome : num 0 0.25 0.0357 0.0714 0.0357 ...
## $ Education : num 0.25 0 0.25 0.75 0 0.25 0.5 0 0.5 0.5 ...
## $ EducationField : num 0.2 0.2 0.8 0.2 0.6 0.2 0.6 0.2 0.2 0.6 ...
## $ EnvironmentSatisfaction : num 0.333 0.667 1 1 0 ...
## $ Gender : num 0 1 1 0 1 1 0 1 1 1 ...
## $ HourlyRate : num 0.914 0.443 0.886 0.371 0.143 ...
## $ JobInvolvement : num 0.667 0.333 0.333 0.667 0.667 ...
## $ JobLevel : num 0.25 0.25 0 0 0 0 0 0 0.5 0.25 ...
## $ JobRole : num 0.875 0.75 0.25 0.75 0.25 0.25 0.25 0.25 0.5 0 ...
## $ JobSatisfaction : num 1 0.333 0.667 0.667 0.333 ...
## $ MaritalStatus : num 1 0.5 1 0.5 0.5 1 0.5 0 1 0.5 ...
## $ MonthlyIncome : num 0.263 0.217 0.057 0.1 0.13 ...
## $ MonthlyRate : num 0.6981 0.916 0.0121 0.8458 0.5837 ...
## $ NumCompaniesWorked : num 0.889 0.111 0.667 0.111 1 ...
## $ OverTime : logi TRUE FALSE TRUE TRUE FALSE FALSE ...
## $ PercentSalaryHike : num 0 0.8571 0.2857 0 0.0714 ...
## $ PerformanceRating : num 0 1 0 0 0 0 1 1 1 0 ...
## $ RelationshipSatisfaction: num 0 1 0.333 0.667 1 ...
## $ StockOptionLevel : num 0 0.333 0 0 0.333 ...
## $ TotalWorkingYears : num 0.216 0.27 0.189 0.216 0.162 ...
## $ TrainingTimesLastYear : num 0 0.5 0.5 0.5 0.5 ...
## $ WorkLifeBalance : num 0 0.667 0.667 0.667 0.667 ...
## $ YearsAtCompany : num 0.207 0.345 0 0.276 0.069 ...
## $ YearsInCurrentRole : num 0.222 0.389 0 0.389 0.111 ...
## $ YearsSinceLastPromotion : num 0 0.0667 0 0.2 0.1333 ...
## $ YearsWithCurrManager : num 0.294 0.412 0 0 0.118 ...
Heatmap is used to visualize range difference
Data is scaled with normalization
Step 3, Using Boruta to measure feature importance for Attrition
########## Feature analysis
##########
featureSelection <- function(df,dv){
# using Boruta to analyze feature importance for one selected column
# the data has no na, na.omit() is not necessary
# doTrace can be changed to 0 for no console output, 1 for less and 2 for more
boruta_output <- Boruta(as.formula(paste(dv, "~", ".")), data=na.omit(df), doTrace=0)
names(boruta_output)
# Get significant features with tentative
roughFixMod <- TentativeRoughFix(boruta_output)
imps <- attStats(roughFixMod)
imps2 <- imps[imps$decision != 'Rejected', c('meanImp', 'decision')]
cat(sprintf('Feature Importance Score\n'))
print(imps2[order(-imps2$meanImp), ]) # descending sort
plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Feature Importance")
return(imps2[order(-imps2$meanImp), ])
}
imps2 <- featureSelection(DFScaled, "Attrition")
## Feature Importance Score
## meanImp decision
## OverTime 21.664273 Confirmed
## MonthlyIncome 12.778403 Confirmed
## TotalWorkingYears 11.747154 Confirmed
## Age 11.415415 Confirmed
## JobLevel 9.775490 Confirmed
## StockOptionLevel 9.698172 Confirmed
## YearsAtCompany 8.556034 Confirmed
## MaritalStatus 8.011467 Confirmed
## YearsWithCurrManager 7.737806 Confirmed
## YearsInCurrentRole 6.359683 Confirmed
## JobRole 6.214727 Confirmed
## BusinessTravel 5.014753 Confirmed
## EnvironmentSatisfaction 4.410866 Confirmed
## NumCompaniesWorked 4.349563 Confirmed
## JobSatisfaction 3.996361 Confirmed
## JobInvolvement 3.858005 Confirmed
## WorkLifeBalance 3.222004 Confirmed
## Department 2.872305 Confirmed
## YearsSinceLastPromotion 2.518286 Confirmed
Step 4, Data Splitting
########## Feature selection and splitting data
##########
dataSplit <- function(df,i,y){
if (i == 0){
cleanDF <- mutate(df)
}
else {
# selecting features with highest importance
cleanDF <- df[ , names(df) %in% row.names(imps2[1:i,])]
cleanDF[[y]] <- df[[y]]
}
#set seed for repeatable outcome
set.seed(7004)
# splitting dataframe into test and train data
sample <- sample.split(cleanDF[[y]], SplitRatio = .75)
out <- list()
out$train <- subset(cleanDF, sample == TRUE)
out$test <- subset(cleanDF, sample == FALSE)
list2env(out, .GlobalEnv)
print(names(train))
}
# two parameters : dataframe to split, number of features to choose. 0 to choose all
dataSplit(DFScaled,6,"Attrition")
## [1] "Age" "JobLevel" "MonthlyIncome"
## [4] "OverTime" "StockOptionLevel" "TotalWorkingYears"
## [7] "Attrition"
Splitting data into train set and test set
Function sample.split keeps the true/false ratio the same in both subsets
Step 5, performing modelling onto the data
Question 1 : Who will leave the company? (Predicting Attrition)
Decision Tree and Logistic Regression is used for this question.
decisionTree <- function(y,x){
# changing parameters for model
control <- rpart.control(minsplit = 20, maxdepth = 4, cp = 0)
# building decision tree model
dcmodel <- rpart(as.formula(paste(y, "~", x)), data = train,
method = 'class', control = control)
rpart.plot(dcmodel, extra= 106)
# prediction
dcpredict <-predict(dcmodel, test, type = 'class')
con_mat <- table(test[,y], dcpredict)
print(con_mat)
accuracy_Test <- sum(diag(con_mat)) / sum(con_mat)
print(paste('Accuracy for test', accuracy_Test))
}
decisionTree("Attrition",".")
## dcpredict
## FALSE TRUE
## FALSE 298 7
## TRUE 45 13
## [1] "Accuracy for test 0.856749311294766"
Based on the above result, the decision tree shows a high accuracy of 0.85 on prediction. The visualisation shows that an employee with no overtime, monthly income and age is a significant variable in determining who is likely to turnover.
##For this question the splitting is done differently from Decision Tree due to previous ones only choose 6 features from the feature importance.
train2 <- DF[1:1017,]
test2<-DF[1018:1453,]
model <- glm(Attrition ~ Age+BusinessTravel+StockOptionLevel+
EnvironmentSatisfaction+Gender+JobInvolvement+
JobLevel+JobRole+JobSatisfaction+MaritalStatus+
MonthlyIncome+NumCompaniesWorked+
OverTime+TotalWorkingYears+WorkLifeBalance+
YearsAtCompany+YearsInCurrentRole+
YearsWithCurrManager,family=binomial(link='logit'),data=train2)
summary(model)
##
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + StockOptionLevel +
## EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel +
## JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome +
## NumCompaniesWorked + OverTime + TotalWorkingYears + WorkLifeBalance +
## YearsAtCompany + YearsInCurrentRole + YearsWithCurrManager,
## family = binomial(link = "logit"), data = train2)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.11448 -0.47190 -0.23673 -0.09205 3.06996
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.3119708 1.2835092 0.243 0.807958
## Age -0.0428556 0.0161990 -2.646 0.008155 **
## BusinessTravel.L 1.1757784 0.3324404 3.537 0.000405 ***
## BusinessTravel.Q -0.0311223 0.2192468 -0.142 0.887119
## StockOptionLevel.L -0.3844071 0.4110641 -0.935 0.349710
## StockOptionLevel.Q 0.6445212 0.3725329 1.730 0.083611 .
## StockOptionLevel.C -0.3120749 0.3402087 -0.917 0.358983
## EnvironmentSatisfaction.L -0.9569920 0.2269740 -4.216 2.48e-05 ***
## EnvironmentSatisfaction.Q 0.3001998 0.2208489 1.359 0.174052
## EnvironmentSatisfaction.C -0.1295658 0.2199537 -0.589 0.555821
## GenderMale 0.2620825 0.2233497 1.173 0.240628
## JobInvolvement.L -1.6692244 0.3980256 -4.194 2.74e-05 ***
## JobInvolvement.Q 0.4176760 0.3223002 1.296 0.195002
## JobInvolvement.C -0.3493188 0.2101773 -1.662 0.096509 .
## JobLevel.L 2.8490529 1.3966489 2.040 0.041358 *
## JobLevel.Q 1.0552027 0.5825452 1.811 0.070084 .
## JobLevel.C -0.2905530 0.5244857 -0.554 0.579595
## JobLevel^4 1.1224262 0.4103807 2.735 0.006236 **
## JobRoleHuman Resources 0.2052411 0.8084664 0.254 0.799600
## JobRoleLaboratory Technician 0.7450391 0.6844041 1.089 0.276332
## JobRoleManager -0.8927124 1.0650114 -0.838 0.401908
## JobRoleManufacturing Director -0.3463023 0.6456596 -0.536 0.591714
## JobRoleResearch Director -2.1148384 1.1671887 -1.812 0.070000 .
## JobRoleResearch Scientist -0.2749128 0.6976372 -0.394 0.693535
## JobRoleSales Executive 0.9289974 0.5006575 1.856 0.063517 .
## JobRoleSales Representative 1.1220319 0.7586067 1.479 0.139122
## JobSatisfaction.L -0.8246162 0.2217753 -3.718 0.000201 ***
## JobSatisfaction.Q -0.0497849 0.2243623 -0.222 0.824396
## JobSatisfaction.C -0.5396369 0.2250814 -2.398 0.016507 *
## MaritalStatusMarried 0.4486917 0.3365660 1.333 0.182484
## MaritalStatusSingle 0.7169908 0.4664782 1.537 0.124286
## MonthlyIncome -0.0001333 0.0001134 -1.175 0.239798
## NumCompaniesWorked 0.1830446 0.0482171 3.796 0.000147 ***
## OverTimeTRUE 2.1159567 0.2381157 8.886 < 2e-16 ***
## TotalWorkingYears -0.0566521 0.0364002 -1.556 0.119620
## WorkLifeBalance.L -0.7933512 0.3650188 -2.173 0.029746 *
## WorkLifeBalance.Q 0.6940302 0.3018206 2.299 0.021478 *
## WorkLifeBalance.C 0.2152122 0.2087123 1.031 0.302474
## YearsAtCompany 0.1080266 0.0436496 2.475 0.013329 *
## YearsInCurrentRole -0.0946155 0.0590094 -1.603 0.108847
## YearsWithCurrManager -0.1337650 0.0570066 -2.346 0.018952 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 921.25 on 1016 degrees of freedom
## Residual deviance: 585.52 on 976 degrees of freedom
## AIC: 667.52
##
## Number of Fisher Scoring iterations: 6
Based on Logistic Regression result above, BusinessTravel,EnvironmentSatisfaction,JobInvolvement,JobSatisfaction,NumofCompaniesWorked and OverTime(TRUE) are highly significant whereby it shows who is likely to turnover.
Question 2 : Which job roles are more likely to leave?
Decision Tree is used for this question.
*For this question the splitting is done differently from Q2 due to previous ones only choose 6 features from the feature importance.
train1 <- DF[1:1017,]
test1<-DF[1018:1453,]
decisionTree <- function(){
# changing parameters for model
control <- rpart.control(minsplit = 20, maxdepth = 4, cp = 0)
# building decision tree model
dcmodel <- rpart(Attrition ~ JobRole + MaritalStatus, data = train1,
method = 'class', control = control)
rpart.plot(dcmodel, extra= 106)
# prediction
dcpredict <-predict(dcmodel, test1, type = 'class')
con_mat <- table(test1$Attrition, dcpredict)
print(con_mat)
accuracy_Test <- sum(diag(con_mat)) / sum(con_mat)
print(paste('Accuracy for test', accuracy_Test))
}
decisionTree()
## dcpredict
## FALSE TRUE
## FALSE 365 6
## TRUE 60 5
## [1] "Accuracy for test 0.848623853211009"
The result above shows that Laboratory Technician are likely to have turnover compared to other job role and the result also shows an accuracy of 0.84
Regression Modelling
Before applying the model onto the dataset, following steps are what needed to be done before the modelling.
Step 1, correlation analysis
correlationFunction <- function(df){
# creating a temporary numeric dataframe
temp <- df %>%
mutate() %>%
mutate_if(is.factor,as.numeric) %>%
mutate_if(is.logical,as.numeric)
# making a correlation matrix and visualizing
tempcor <- cor(temp[, sapply(temp, is.numeric)])
corrplot(tempcor, tl.col = "black",type = "lower",
method = "color", tl.cex=0.7, tl.srt = 45)
}
correlationFunction(DF)
Step 2, Outlier detection
########## Detecting outlier using scatter plot
# interactive scatterplot to identify outlier visually
ggplotly(ggplot(DF, aes(x=TotalWorkingYears, y=MonthlyIncome, color=JobLevel)) +
geom_point(size=1.5) +
scale_color_manual(values=c("darkslategrey","chartreuse3","darkorange1","deepskyblue3","maroon"))+
theme(panel.background = element_rect(fill = 'white')))
DFnoOutI <- DF %>%
mutate() %>%
filter(!(TotalWorkingYears > 35 & JobLevel %in% "3")) %>%
filter(!(TotalWorkingYears < 20 & JobLevel %in% "4"))
ggplot(DFnoOutI, aes(x=TotalWorkingYears, y=MonthlyIncome, color=JobLevel)) +
geom_point(size=2) +
scale_color_manual(values=c("darkslategrey","chartreuse3","darkorange1","deepskyblue3","maroon"))+
theme(panel.background = element_rect(fill = 'white'))
Step 3, Feature Importance
Data is scaled using normalization
Feature selection function is used to confirm feature significance
Data split into train and test set
DFScaledI <- DFnoOutI %>%
mutate() %>%
mutate_if(is.factor, as.numeric) %>%
mutate_if(is.numeric, normalize)
########## confirming significance of jobrole and totalworkingyears to monthly income
##########
imps2 <- featureSelection(DFScaledI, "MonthlyIncome")
## Feature Importance Score
## meanImp decision
## JobLevel 41.633264 Confirmed
## TotalWorkingYears 26.076690 Confirmed
## JobRole 17.190628 Confirmed
## Age 14.453971 Confirmed
## YearsAtCompany 13.770025 Confirmed
## YearsInCurrentRole 10.299587 Confirmed
## NumCompaniesWorked 9.426253 Confirmed
## Department 8.585054 Confirmed
## YearsWithCurrManager 8.147893 Confirmed
## YearsSinceLastPromotion 6.596010 Confirmed
## Attrition 6.121120 Confirmed
########## splitting data to train set and test set
##########
dataSplit(DFScaledI,0,"MonthlyIncome")
## [1] "Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EnvironmentSatisfaction" "Gender"
## [11] "HourlyRate" "JobInvolvement"
## [13] "JobLevel" "JobRole"
## [15] "JobSatisfaction" "MaritalStatus"
## [17] "MonthlyIncome" "MonthlyRate"
## [19] "NumCompaniesWorked" "OverTime"
## [21] "PercentSalaryHike" "PerformanceRating"
## [23] "RelationshipSatisfaction" "StockOptionLevel"
## [25] "TotalWorkingYears" "TrainingTimesLastYear"
## [27] "WorkLifeBalance" "YearsAtCompany"
## [29] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [31] "YearsWithCurrManager"
Step 4, applying the model
Q3 : What is the suitable monthly income for an employee? (Predicting Monthly Income)
Ridge Regression model is used for this question
########## regression model using ridge regression
##########
ridgeRegression <- function(){
trainY<-train$MonthlyIncome
trainX<-data.matrix(subset(train, select=c(JobLevel, TotalWorkingYears)))
lambda_seq <- 10^seq(2, -2, by = -.1)
# Using glmnet function - setting alpha to 0 to build the ridge regression
cv_fit <- cv.glmnet(trainX, trainY, alpha = 0, lambda = lambda_seq)
plot(cv_fit)
opt_lambda <- cv_fit$lambda.min
cat(sprintf('Lambda : %f\n',opt_lambda))
# prediction
testY <- test$MonthlyIncome
testX <- data.matrix(subset(test, select=c(JobLevel, TotalWorkingYears)))
y_predicted <- predict(cv_fit, s = opt_lambda, newx = testX)
# Sum of Squares Total and Error
sst <- sum((testY - mean(testY))^2)
sse <- sum((y_predicted - testY)^2)
# R squared
rsq <- 1 - sse / sst
cat(sprintf('R Squared : %f\n',rsq))
plot(y_predicted,test$MonthlyIncome, pch = 16, col = c('red','black')[as.factor(!test$Attrition)], xlab="predicted",ylab="actual")
abline(0,1)
}
ridgeRegression()
## Lambda : 0.010000
## R Squared : 0.901915
Based on the above result, ….