Introduction

High turnover rates can negatively affect a company and its employees in many ways. With the constant need to hire and train new employees, it is easy to veer from true mission and vision of the organization.

By retaining employees, companies can provide a higher caliber workforce that positively affects the bottom line. Businesses can lower turnover rates by providing adequate training, rewarding employees for a job well done and creating a company culture of trust.

Therefore, this study is to apply machine learning models to understand what kind of employee are likely to turnover and is there a specific job role that is likely to have turnover.

The dataset on employee turnover is taken from Kaggle which was provided by IBM Analytics with a total of 1470 observations and 31 variables.

Below are the questions proposed for this study;

Category : Classification

  1. Who among the employee will leave the company?

  2. Which job role that an employee will likely leave the company?

Category : Regression

  1. What is the suitable monthly income for an employee?

Data Exploratory and Analysis

To start, a few packages and/or library would need to be installed/called : Caret, Boruta, EnvStats, tidyverse, heatmaply, caTools, rpart.plor, corrplot and kernlab. After installing, proceed to load data onto R.

Reading CSV

DF <- read.csv("WA_Fn-UseC_-HR-Employee-Attrition.csv", fileEncoding="UTF-8-BOM")

str(DF)
## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr  "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : chr  "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr  "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr  "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr  "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr  "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
##  $ OverTime                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...
summary(DF)
##       Age         Attrition         BusinessTravel       DailyRate     
##  Min.   :18.00   Length:1470        Length:1470        Min.   : 102.0  
##  1st Qu.:30.00   Class :character   Class :character   1st Qu.: 465.0  
##  Median :36.00   Mode  :character   Mode  :character   Median : 802.0  
##  Mean   :36.92                                         Mean   : 802.5  
##  3rd Qu.:43.00                                         3rd Qu.:1157.0  
##  Max.   :60.00                                         Max.   :1499.0  
##   Department        DistanceFromHome   Education     EducationField    
##  Length:1470        Min.   : 1.000   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.: 2.000   1st Qu.:2.000   Class :character  
##  Mode  :character   Median : 7.000   Median :3.000   Mode  :character  
##                     Mean   : 9.193   Mean   :2.913                     
##                     3rd Qu.:14.000   3rd Qu.:4.000                     
##                     Max.   :29.000   Max.   :5.000                     
##  EmployeeCount EmployeeNumber   EnvironmentSatisfaction    Gender         
##  Min.   :1     Min.   :   1.0   Min.   :1.000           Length:1470       
##  1st Qu.:1     1st Qu.: 491.2   1st Qu.:2.000           Class :character  
##  Median :1     Median :1020.5   Median :3.000           Mode  :character  
##  Mean   :1     Mean   :1024.9   Mean   :2.722                             
##  3rd Qu.:1     3rd Qu.:1555.8   3rd Qu.:4.000                             
##  Max.   :1     Max.   :2068.0   Max.   :4.000                             
##    HourlyRate     JobInvolvement    JobLevel       JobRole         
##  Min.   : 30.00   Min.   :1.00   Min.   :1.000   Length:1470       
##  1st Qu.: 48.00   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Median : 66.00   Median :3.00   Median :2.000   Mode  :character  
##  Mean   : 65.89   Mean   :2.73   Mean   :2.064                     
##  3rd Qu.: 83.75   3rd Qu.:3.00   3rd Qu.:3.000                     
##  Max.   :100.00   Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome    MonthlyRate   
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   : 2094  
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.: 8047  
##  Median :3.000   Mode  :character   Median : 4919   Median :14236  
##  Mean   :2.729                      Mean   : 6503   Mean   :14313  
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:20462  
##  Max.   :4.000                      Max.   :19999   Max.   :26999  
##  NumCompaniesWorked    Over18            OverTime         PercentSalaryHike
##  Min.   :0.000      Length:1470        Length:1470        Min.   :11.00    
##  1st Qu.:1.000      Class :character   Class :character   1st Qu.:12.00    
##  Median :2.000      Mode  :character   Mode  :character   Median :14.00    
##  Mean   :2.693                                            Mean   :15.21    
##  3rd Qu.:4.000                                            3rd Qu.:18.00    
##  Max.   :9.000                                            Max.   :25.00    
##  PerformanceRating RelationshipSatisfaction StandardHours StockOptionLevel
##  Min.   :3.000     Min.   :1.000            Min.   :80    Min.   :0.0000  
##  1st Qu.:3.000     1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000  
##  Median :3.000     Median :3.000            Median :80    Median :1.0000  
##  Mean   :3.154     Mean   :2.712            Mean   :80    Mean   :0.7939  
##  3rd Qu.:3.000     3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000  
##  Max.   :4.000     Max.   :4.000            Max.   :80    Max.   :3.0000  
##  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany  
##  Min.   : 0.00     Min.   :0.000         Min.   :1.000   Min.   : 0.000  
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000  
##  Median :10.00     Median :3.000         Median :3.000   Median : 5.000  
##  Mean   :11.28     Mean   :2.799         Mean   :2.761   Mean   : 7.008  
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000  
##  Max.   :40.00     Max.   :6.000         Max.   :4.000   Max.   :40.000  
##  YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000     Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 2.000     1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 3.000     Median : 1.000          Median : 3.000      
##  Mean   : 4.229     Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 7.000     3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :18.000     Max.   :15.000          Max.   :17.000

Checking for unique values

# checking unique value in every column
DF %>%
  lapply(unique) %>%
  lengths()
##                      Age                Attrition           BusinessTravel 
##                       43                        2                        3 
##                DailyRate               Department         DistanceFromHome 
##                      886                        3                       29 
##                Education           EducationField            EmployeeCount 
##                        5                        6                        1 
##           EmployeeNumber  EnvironmentSatisfaction                   Gender 
##                     1470                        4                        2 
##               HourlyRate           JobInvolvement                 JobLevel 
##                       71                        4                        5 
##                  JobRole          JobSatisfaction            MaritalStatus 
##                        9                        4                        3 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##                     1349                     1427                       10 
##                   Over18                 OverTime        PercentSalaryHike 
##                        1                        2                       15 
##        PerformanceRating RelationshipSatisfaction            StandardHours 
##                        2                        4                        1 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                        4                       40                        7 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                        4                       37                       19 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                       16                       18
# checking for inconsistency
DF %>%
  subset(select=c("BusinessTravel","Department","EducationField","JobRole","MaritalStatus")) %>%
  lapply(unique)
## $BusinessTravel
## [1] "Travel_Rarely"     "Travel_Frequently" "Non-Travel"       
## 
## $Department
## [1] "Sales"                  "Research & Development" "Human Resources"       
## 
## $EducationField
## [1] "Life Sciences"    "Other"            "Medical"          "Marketing"       
## [5] "Technical Degree" "Human Resources" 
## 
## $JobRole
## [1] "Sales Executive"           "Research Scientist"       
## [3] "Laboratory Technician"     "Manufacturing Director"   
## [5] "Healthcare Representative" "Manager"                  
## [7] "Sales Representative"      "Research Director"        
## [9] "Human Resources"          
## 
## $MaritalStatus
## [1] "Single"   "Married"  "Divorced"
# checking for missing values
sapply(DF, function(x) sum(is.na(x)))
##                      Age                Attrition           BusinessTravel 
##                        0                        0                        0 
##                DailyRate               Department         DistanceFromHome 
##                        0                        0                        0 
##                Education           EducationField            EmployeeCount 
##                        0                        0                        0 
##           EmployeeNumber  EnvironmentSatisfaction                   Gender 
##                        0                        0                        0 
##               HourlyRate           JobInvolvement                 JobLevel 
##                        0                        0                        0 
##                  JobRole          JobSatisfaction            MaritalStatus 
##                        0                        0                        0 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##                        0                        0                        0 
##                   Over18                 OverTime        PercentSalaryHike 
##                        0                        0                        0 
##        PerformanceRating RelationshipSatisfaction            StandardHours 
##                        0                        0                        0 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                        0                        0                        0 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                        0                        0                        0 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                        0                        0

Data Cleaning

##########
dataCleaning <- function(df) {
  df %>%
    # dropping columns with single value
    Filter(function(x) min(x)!=max(x),.) %>%
    # remove autonumber column
    .[, names(.) != "EmployeeNumber"] %>%
    
    mutate(
      # changing YesNo data to logical data
      OverTime = ifelse(OverTime == "No",FALSE,TRUE),
      Attrition = ifelse(Attrition == "No",FALSE,TRUE),
      
      # changing char data to ordered factor (ordinal)
      BusinessTravel = factor(BusinessTravel,
                              levels = c("Non-Travel","Travel_Rarely","Travel_Frequently"),
                              ordered = TRUE),
      Education = factor(Education, ordered = TRUE),
      EnvironmentSatisfaction = factor(EnvironmentSatisfaction, ordered = TRUE),
      JobInvolvement = factor(JobInvolvement, ordered = TRUE),
      JobLevel = factor(JobLevel, ordered = TRUE),
      JobSatisfaction = factor(JobSatisfaction, ordered = TRUE),
      PerformanceRating = factor(PerformanceRating, ordered = TRUE),
      RelationshipSatisfaction = factor(RelationshipSatisfaction, ordered = TRUE),
      StockOptionLevel = factor(StockOptionLevel, ordered = TRUE),
      WorkLifeBalance = factor(WorkLifeBalance, ordered = TRUE)
    ) %>%
    
    # changing other char columns to unordered factor (nominal)
    mutate_if(is.character,as.factor)
}

DF <- dataCleaning(DF)

# recheck data is correctly cleaned
str(DF)
## 'data.frame':    1470 obs. of  31 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ BusinessTravel          : Ord.factor w/ 3 levels "Non-Travel"<"Travel_Rarely"<..: 2 3 2 3 2 3 2 2 3 2 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : logi  TRUE FALSE TRUE TRUE FALSE FALSE ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Ord.factor w/ 2 levels "3"<"4": 1 2 1 1 1 1 2 2 2 1 ...
##  $ RelationshipSatisfaction: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 2 1 1 2 1 4 2 1 3 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

Based on the result above, there is no inconsistency in categorical data and there is no missing value in any column

Classification Modelling

Before starting on the modelling few things needed to be done before applying the model.

Step 1, Checking for outlier

outlierTest <- function(df){
  # outlier from boxplot
  DFo <- list()
  for (i in names(select_if(df, is.numeric))) {
    cat(sprintf('%-24s : %d\n', i, length(boxplot.stats(df[,i])$out)))
    if (length(boxplot.stats(df[,i])$out)>0){
      DFo <- c(DFo,i)
    }
  }
  
  # testing outliers
  for (i in DFo) {
    cat(sprintf('%s\n',i))
    print(rosnerTest(DF[,i])$all.stats)
    cat(sprintf('\n'))
  }

  # retesting specific column for outlier
  print(rosnerTest(df$YearsAtCompany, k=20)$all.stats)
}

outlierTest(DF)
## Age                      : 0
## DailyRate                : 0
## DistanceFromHome         : 0
## HourlyRate               : 0
## MonthlyIncome            : 114
## MonthlyRate              : 0
## NumCompaniesWorked       : 52
## PercentSalaryHike        : 0
## TotalWorkingYears        : 63
## TrainingTimesLastYear    : 238
## YearsAtCompany           : 104
## YearsInCurrentRole       : 21
## YearsSinceLastPromotion  : 107
## YearsWithCurrManager     : 14
## MonthlyIncome
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 6502.931 4707.957 19999     191 2.866651   4.133386   FALSE
## 2 1 6493.744 4696.360 19973     747 2.870150   4.133224   FALSE
## 3 2 6484.562 4684.751 19943     852 2.872818   4.133061   FALSE
## 
## NumCompaniesWorked
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 2.693197 2.498009     9       5 2.524732   4.133386   FALSE
## 2 1 2.688904 2.493429     9      39 2.531092   4.133224   FALSE
## 3 2 2.684605 2.488826     9      51 2.537500   4.133061   FALSE
## 
## TotalWorkingYears
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 11.27959 7.780782    40     127 3.691198   4.133386   FALSE
## 2 1 11.26004 7.747227    40     596 3.709709   4.133224   FALSE
## 3 2 11.24046 7.713431    38      99 3.469213   4.133061   FALSE
## 
## TrainingTimesLastYear
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 2.799320 1.289271     6      24 2.482551   4.133386   FALSE
## 2 1 2.797141 1.287000     6      34 2.488625   4.133224   FALSE
## 3 2 2.794959 1.284718     6      42 2.494743   4.133061   FALSE
## 
## YearsAtCompany
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 7.008163 6.126525    40     127 5.385081   4.133386    TRUE
## 2 1 6.985705 6.067777    37      99 4.946506   4.133224    TRUE
## 3 2 6.965259 6.019013    36     271 4.823838   4.133061    TRUE
## 
## YearsInCurrentRole
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 4.229252 3.623137    18     124 3.800780   4.133386   FALSE
## 2 1 4.219877 3.606494    18     191 3.820919   4.133224   FALSE
## 3 2 4.210490 3.589726    17     232 3.562809   4.133061   FALSE
## 
## YearsSinceLastPromotion
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 2.187755 3.222430    15      46 3.975957   4.133386   FALSE
## 2 1 2.179033 3.206124    15     124 3.998899   4.133224   FALSE
## 3 2 2.170300 3.189689    15     127 4.022242   4.133061   FALSE
## 
## YearsWithCurrManager
##   i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1 0 4.123129 3.568136    17      29 3.608851   4.133386   FALSE
## 2 1 4.114364 3.553483    17     387 3.626199   4.133224   FALSE
## 3 2 4.105586 3.538727    17     617 3.643801   4.133061   FALSE
## Warning in rosnerTest(df$YearsAtCompany, k = 20): The true Type I error may be larger than assumed.
## Although the help file for 'rosnerTest' has a table with information
## on the estimated Type I error level,
## simulations were not run for k > 10 or k > floor(n/2).
##     i   Mean.i     SD.i Value Obs.Num    R.i+1 lambda.i+1 Outlier
## 1   0 7.008163 6.126525    40     127 5.385081   4.133386    TRUE
## 2   1 6.985705 6.067777    37      99 4.946506   4.133224    TRUE
## 3   2 6.965259 6.019013    36     271 4.823838   4.133061    TRUE
## 4   3 6.945467 5.973089    36    1117 4.864239   4.132899    TRUE
## 5   4 6.925648 5.926679    34     562 4.568216   4.132736    TRUE
## 6   5 6.907167 5.886296    33     191 4.432810   4.132573    TRUE
## 7   6 6.889344 5.848630    33     238 4.464405   4.132410    TRUE
## 8   7 6.871497 5.810614    33     915 4.496686   4.132247    TRUE
## 9   8 6.853625 5.772239    33     963 4.529677   4.132084    TRUE
## 10  9 6.835729 5.733498    33    1112 4.563405   4.131921    TRUE
## 11 10 6.817808 5.694383    32     478 4.422286   4.131757    TRUE
## 12 11 6.800548 5.658004    32     750 4.453771   4.131594    TRUE
## 13 12 6.783265 5.621285    32    1087 4.485938   4.131430    TRUE
## 14 13 6.765957 5.584220    31     474 4.339737   4.131266    TRUE
## 15 14 6.749313 5.549868    31     596 4.369597   4.131102    TRUE
## 16 15 6.732646 5.515203    31     654 4.400083   4.130938    TRUE
## 17 16 6.715956 5.480221    30    1139 4.248742   4.130774    TRUE
## 18 17 6.699931 5.447923    29     412 4.093315   4.130610   FALSE
## 19 18 6.684573 5.418244    29     919 4.118572   4.130445   FALSE
## 20 19 6.669194 5.388315    27      63 3.773129   4.130281   FALSE
# removing outlier
DFnoOut <-  DF %>%
  mutate() %>%
  filter(YearsAtCompany < 30)

17 outliers detected and they are removed

Step 2, Data Scaling

##########  Data Scaling
##########

# heatmap before normalization showing range difference in columns
heatmap(sapply(DFnoOut,as.numeric), main = "Before Normalization", Rowv = NA, Colv = NA,labRow = NA)

# normalizing data
DFScaled <- DFnoOut %>%
  mutate() %>%
  mutate_if(is.factor, as.numeric) %>%
  mutate_if(is.numeric, normalize) 

# heatmap after normalization showing better range distribution
heatmap(sapply(DFScaled,as.numeric), main = "After Normalization", Rowv = NA, Colv = NA,labRow = NA)

# return to default view

str(DFScaled)
## 'data.frame':    1453 obs. of  31 variables:
##  $ Age                     : num  0.548 0.738 0.452 0.357 0.214 ...
##  $ Attrition               : logi  TRUE FALSE TRUE FALSE FALSE FALSE ...
##  $ BusinessTravel          : num  0.5 1 0.5 1 0.5 1 0.5 0.5 1 0.5 ...
##  $ DailyRate               : num  0.716 0.127 0.91 0.923 0.35 ...
##  $ Department              : num  1 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 ...
##  $ DistanceFromHome        : num  0 0.25 0.0357 0.0714 0.0357 ...
##  $ Education               : num  0.25 0 0.25 0.75 0 0.25 0.5 0 0.5 0.5 ...
##  $ EducationField          : num  0.2 0.2 0.8 0.2 0.6 0.2 0.6 0.2 0.2 0.6 ...
##  $ EnvironmentSatisfaction : num  0.333 0.667 1 1 0 ...
##  $ Gender                  : num  0 1 1 0 1 1 0 1 1 1 ...
##  $ HourlyRate              : num  0.914 0.443 0.886 0.371 0.143 ...
##  $ JobInvolvement          : num  0.667 0.333 0.333 0.667 0.667 ...
##  $ JobLevel                : num  0.25 0.25 0 0 0 0 0 0 0.5 0.25 ...
##  $ JobRole                 : num  0.875 0.75 0.25 0.75 0.25 0.25 0.25 0.25 0.5 0 ...
##  $ JobSatisfaction         : num  1 0.333 0.667 0.667 0.333 ...
##  $ MaritalStatus           : num  1 0.5 1 0.5 0.5 1 0.5 0 1 0.5 ...
##  $ MonthlyIncome           : num  0.263 0.217 0.057 0.1 0.13 ...
##  $ MonthlyRate             : num  0.6981 0.916 0.0121 0.8458 0.5837 ...
##  $ NumCompaniesWorked      : num  0.889 0.111 0.667 0.111 1 ...
##  $ OverTime                : logi  TRUE FALSE TRUE TRUE FALSE FALSE ...
##  $ PercentSalaryHike       : num  0 0.8571 0.2857 0 0.0714 ...
##  $ PerformanceRating       : num  0 1 0 0 0 0 1 1 1 0 ...
##  $ RelationshipSatisfaction: num  0 1 0.333 0.667 1 ...
##  $ StockOptionLevel        : num  0 0.333 0 0 0.333 ...
##  $ TotalWorkingYears       : num  0.216 0.27 0.189 0.216 0.162 ...
##  $ TrainingTimesLastYear   : num  0 0.5 0.5 0.5 0.5 ...
##  $ WorkLifeBalance         : num  0 0.667 0.667 0.667 0.667 ...
##  $ YearsAtCompany          : num  0.207 0.345 0 0.276 0.069 ...
##  $ YearsInCurrentRole      : num  0.222 0.389 0 0.389 0.111 ...
##  $ YearsSinceLastPromotion : num  0 0.0667 0 0.2 0.1333 ...
##  $ YearsWithCurrManager    : num  0.294 0.412 0 0 0.118 ...

Step 3, Using Boruta to measure feature importance for Attrition

##########  Feature analysis
##########
featureSelection <- function(df,dv){
  # using Boruta to analyze feature importance for one selected column
  # the data has no na, na.omit() is not necessary
  # doTrace can be changed to 0 for no console output, 1 for less and 2 for more
  boruta_output <- Boruta(as.formula(paste(dv, "~", ".")), data=na.omit(df), doTrace=0)  
  names(boruta_output)
  
  # Get significant features with tentative
  roughFixMod <- TentativeRoughFix(boruta_output)
  imps <- attStats(roughFixMod)
  imps2 <- imps[imps$decision != 'Rejected', c('meanImp', 'decision')]
  cat(sprintf('Feature Importance Score\n'))
  print(imps2[order(-imps2$meanImp), ])  # descending sort
  plot(boruta_output, cex.axis=.7, las=2, xlab="", main="Feature Importance")
  return(imps2[order(-imps2$meanImp), ])
}

imps2 <- featureSelection(DFScaled, "Attrition")
## Feature Importance Score
##                           meanImp  decision
## OverTime                21.664273 Confirmed
## MonthlyIncome           12.778403 Confirmed
## TotalWorkingYears       11.747154 Confirmed
## Age                     11.415415 Confirmed
## JobLevel                 9.775490 Confirmed
## StockOptionLevel         9.698172 Confirmed
## YearsAtCompany           8.556034 Confirmed
## MaritalStatus            8.011467 Confirmed
## YearsWithCurrManager     7.737806 Confirmed
## YearsInCurrentRole       6.359683 Confirmed
## JobRole                  6.214727 Confirmed
## BusinessTravel           5.014753 Confirmed
## EnvironmentSatisfaction  4.410866 Confirmed
## NumCompaniesWorked       4.349563 Confirmed
## JobSatisfaction          3.996361 Confirmed
## JobInvolvement           3.858005 Confirmed
## WorkLifeBalance          3.222004 Confirmed
## Department               2.872305 Confirmed
## YearsSinceLastPromotion  2.518286 Confirmed

Step 4, Data Splitting

########## Feature selection and splitting data
##########
dataSplit <- function(df,i,y){
  
  if (i == 0){
    cleanDF <- mutate(df)
  }
  else {
    # selecting features with highest importance
    cleanDF <- df[ , names(df) %in% row.names(imps2[1:i,])]
    cleanDF[[y]] <- df[[y]]
  }

  #set seed for repeatable outcome
  set.seed(7004)
  # splitting dataframe into test and train data
  sample <- sample.split(cleanDF[[y]], SplitRatio = .75)
  out <- list()
  out$train <- subset(cleanDF, sample == TRUE)
  out$test <- subset(cleanDF, sample == FALSE)
  list2env(out, .GlobalEnv)
  print(names(train))
}

# two parameters : dataframe to split, number of features to choose. 0 to choose all
dataSplit(DFScaled,6,"Attrition")
## [1] "Age"               "JobLevel"          "MonthlyIncome"    
## [4] "OverTime"          "StockOptionLevel"  "TotalWorkingYears"
## [7] "Attrition"

Step 5, performing modelling onto the data

Question 1 : Who will leave the company? (Predicting Attrition)

Decision Tree and Logistic Regression is used for this question.

decisionTree <- function(y,x){
  # changing parameters for model
  control <- rpart.control(minsplit = 20, maxdepth = 4, cp = 0)
  # building decision tree model
  dcmodel <- rpart(as.formula(paste(y, "~", x)), data = train,
                   method = 'class', control = control)
  rpart.plot(dcmodel, extra= 106)
  
  # prediction
  dcpredict <-predict(dcmodel, test, type = 'class')
  con_mat <- table(test[,y], dcpredict)
  print(con_mat)
  accuracy_Test <- sum(diag(con_mat)) / sum(con_mat)
  print(paste('Accuracy for test', accuracy_Test))
}
decisionTree("Attrition",".")

##        dcpredict
##         FALSE TRUE
##   FALSE   298    7
##   TRUE     45   13
## [1] "Accuracy for test 0.856749311294766"

Based on the above result, the decision tree shows a high accuracy of 0.85 on prediction. The visualisation shows that an employee with no overtime, monthly income and age is a significant variable in determining who is likely to turnover.

##For this question the splitting is done differently from Decision Tree due to previous ones only choose 6 features from the feature importance.

train2 <- DF[1:1017,]
test2<-DF[1018:1453,]

model <- glm(Attrition ~ Age+BusinessTravel+StockOptionLevel+
               EnvironmentSatisfaction+Gender+JobInvolvement+
               JobLevel+JobRole+JobSatisfaction+MaritalStatus+
               MonthlyIncome+NumCompaniesWorked+
               OverTime+TotalWorkingYears+WorkLifeBalance+ 
               YearsAtCompany+YearsInCurrentRole+
               YearsWithCurrManager,family=binomial(link='logit'),data=train2)
summary(model)
## 
## Call:
## glm(formula = Attrition ~ Age + BusinessTravel + StockOptionLevel + 
##     EnvironmentSatisfaction + Gender + JobInvolvement + JobLevel + 
##     JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
##     NumCompaniesWorked + OverTime + TotalWorkingYears + WorkLifeBalance + 
##     YearsAtCompany + YearsInCurrentRole + YearsWithCurrManager, 
##     family = binomial(link = "logit"), data = train2)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.11448  -0.47190  -0.23673  -0.09205   3.06996  
## 
## Coefficients:
##                                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                    0.3119708  1.2835092   0.243 0.807958    
## Age                           -0.0428556  0.0161990  -2.646 0.008155 ** 
## BusinessTravel.L               1.1757784  0.3324404   3.537 0.000405 ***
## BusinessTravel.Q              -0.0311223  0.2192468  -0.142 0.887119    
## StockOptionLevel.L            -0.3844071  0.4110641  -0.935 0.349710    
## StockOptionLevel.Q             0.6445212  0.3725329   1.730 0.083611 .  
## StockOptionLevel.C            -0.3120749  0.3402087  -0.917 0.358983    
## EnvironmentSatisfaction.L     -0.9569920  0.2269740  -4.216 2.48e-05 ***
## EnvironmentSatisfaction.Q      0.3001998  0.2208489   1.359 0.174052    
## EnvironmentSatisfaction.C     -0.1295658  0.2199537  -0.589 0.555821    
## GenderMale                     0.2620825  0.2233497   1.173 0.240628    
## JobInvolvement.L              -1.6692244  0.3980256  -4.194 2.74e-05 ***
## JobInvolvement.Q               0.4176760  0.3223002   1.296 0.195002    
## JobInvolvement.C              -0.3493188  0.2101773  -1.662 0.096509 .  
## JobLevel.L                     2.8490529  1.3966489   2.040 0.041358 *  
## JobLevel.Q                     1.0552027  0.5825452   1.811 0.070084 .  
## JobLevel.C                    -0.2905530  0.5244857  -0.554 0.579595    
## JobLevel^4                     1.1224262  0.4103807   2.735 0.006236 ** 
## JobRoleHuman Resources         0.2052411  0.8084664   0.254 0.799600    
## JobRoleLaboratory Technician   0.7450391  0.6844041   1.089 0.276332    
## JobRoleManager                -0.8927124  1.0650114  -0.838 0.401908    
## JobRoleManufacturing Director -0.3463023  0.6456596  -0.536 0.591714    
## JobRoleResearch Director      -2.1148384  1.1671887  -1.812 0.070000 .  
## JobRoleResearch Scientist     -0.2749128  0.6976372  -0.394 0.693535    
## JobRoleSales Executive         0.9289974  0.5006575   1.856 0.063517 .  
## JobRoleSales Representative    1.1220319  0.7586067   1.479 0.139122    
## JobSatisfaction.L             -0.8246162  0.2217753  -3.718 0.000201 ***
## JobSatisfaction.Q             -0.0497849  0.2243623  -0.222 0.824396    
## JobSatisfaction.C             -0.5396369  0.2250814  -2.398 0.016507 *  
## MaritalStatusMarried           0.4486917  0.3365660   1.333 0.182484    
## MaritalStatusSingle            0.7169908  0.4664782   1.537 0.124286    
## MonthlyIncome                 -0.0001333  0.0001134  -1.175 0.239798    
## NumCompaniesWorked             0.1830446  0.0482171   3.796 0.000147 ***
## OverTimeTRUE                   2.1159567  0.2381157   8.886  < 2e-16 ***
## TotalWorkingYears             -0.0566521  0.0364002  -1.556 0.119620    
## WorkLifeBalance.L             -0.7933512  0.3650188  -2.173 0.029746 *  
## WorkLifeBalance.Q              0.6940302  0.3018206   2.299 0.021478 *  
## WorkLifeBalance.C              0.2152122  0.2087123   1.031 0.302474    
## YearsAtCompany                 0.1080266  0.0436496   2.475 0.013329 *  
## YearsInCurrentRole            -0.0946155  0.0590094  -1.603 0.108847    
## YearsWithCurrManager          -0.1337650  0.0570066  -2.346 0.018952 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 921.25  on 1016  degrees of freedom
## Residual deviance: 585.52  on  976  degrees of freedom
## AIC: 667.52
## 
## Number of Fisher Scoring iterations: 6

Based on Logistic Regression result above, BusinessTravel,EnvironmentSatisfaction,JobInvolvement,JobSatisfaction,NumofCompaniesWorked and OverTime(TRUE) are highly significant whereby it shows who is likely to turnover.

Question 2 : Which job roles are more likely to leave?

Decision Tree is used for this question.

*For this question the splitting is done differently from Q2 due to previous ones only choose 6 features from the feature importance.

train1 <- DF[1:1017,]
test1<-DF[1018:1453,]

decisionTree <- function(){
  # changing parameters for model
  control <- rpart.control(minsplit = 20, maxdepth = 4, cp = 0)
  # building decision tree model
  dcmodel <- rpart(Attrition ~ JobRole + MaritalStatus, data = train1,
                   method = 'class', control = control)
  rpart.plot(dcmodel, extra= 106)
  
  # prediction
  dcpredict <-predict(dcmodel, test1, type = 'class')
  con_mat <- table(test1$Attrition, dcpredict)
  print(con_mat)
  accuracy_Test <- sum(diag(con_mat)) / sum(con_mat)
  print(paste('Accuracy for test', accuracy_Test))
}

decisionTree()

##        dcpredict
##         FALSE TRUE
##   FALSE   365    6
##   TRUE     60    5
## [1] "Accuracy for test 0.848623853211009"

The result above shows that Laboratory Technician are likely to have turnover compared to other job role and the result also shows an accuracy of 0.84

Regression Modelling

Before applying the model onto the dataset, following steps are what needed to be done before the modelling.

Step 1, correlation analysis

correlationFunction <- function(df){
  
  # creating a temporary numeric dataframe
  temp <- df %>%
    mutate() %>%
    mutate_if(is.factor,as.numeric) %>%
    mutate_if(is.logical,as.numeric)
  
  # making a correlation matrix and visualizing
  tempcor <- cor(temp[, sapply(temp, is.numeric)])
  corrplot(tempcor, tl.col = "black",type = "lower",
           method = "color", tl.cex=0.7, tl.srt = 45)
}

correlationFunction(DF)

Step 2, Outlier detection

########## Detecting outlier using scatter plot
# interactive scatterplot to identify outlier visually
ggplotly(ggplot(DF, aes(x=TotalWorkingYears, y=MonthlyIncome, color=JobLevel)) +
           geom_point(size=1.5) + 
           scale_color_manual(values=c("darkslategrey","chartreuse3","darkorange1","deepskyblue3","maroon"))+
           theme(panel.background = element_rect(fill = 'white')))
DFnoOutI <- DF %>%
  mutate() %>%
  filter(!(TotalWorkingYears > 35 & JobLevel %in% "3")) %>%
  filter(!(TotalWorkingYears < 20 & JobLevel %in% "4"))


ggplot(DFnoOutI, aes(x=TotalWorkingYears, y=MonthlyIncome, color=JobLevel)) +
  geom_point(size=2) + 
  scale_color_manual(values=c("darkslategrey","chartreuse3","darkorange1","deepskyblue3","maroon"))+
  theme(panel.background = element_rect(fill = 'white'))

Step 3, Feature Importance

DFScaledI <- DFnoOutI %>%
  mutate() %>%
  mutate_if(is.factor, as.numeric) %>%
  mutate_if(is.numeric, normalize) 


########## confirming significance of jobrole and totalworkingyears to monthly income
##########
imps2 <- featureSelection(DFScaledI, "MonthlyIncome")
## Feature Importance Score
##                           meanImp  decision
## JobLevel                41.633264 Confirmed
## TotalWorkingYears       26.076690 Confirmed
## JobRole                 17.190628 Confirmed
## Age                     14.453971 Confirmed
## YearsAtCompany          13.770025 Confirmed
## YearsInCurrentRole      10.299587 Confirmed
## NumCompaniesWorked       9.426253 Confirmed
## Department               8.585054 Confirmed
## YearsWithCurrManager     8.147893 Confirmed
## YearsSinceLastPromotion  6.596010 Confirmed
## Attrition                6.121120 Confirmed

########## splitting data to train set and test set
##########
dataSplit(DFScaledI,0,"MonthlyIncome")
##  [1] "Age"                      "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EnvironmentSatisfaction"  "Gender"                  
## [11] "HourlyRate"               "JobInvolvement"          
## [13] "JobLevel"                 "JobRole"                 
## [15] "JobSatisfaction"          "MaritalStatus"           
## [17] "MonthlyIncome"            "MonthlyRate"             
## [19] "NumCompaniesWorked"       "OverTime"                
## [21] "PercentSalaryHike"        "PerformanceRating"       
## [23] "RelationshipSatisfaction" "StockOptionLevel"        
## [25] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [27] "WorkLifeBalance"          "YearsAtCompany"          
## [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [31] "YearsWithCurrManager"

Step 4, applying the model

Q3 : What is the suitable monthly income for an employee? (Predicting Monthly Income)

Ridge Regression model is used for this question

########## regression model using ridge regression
##########
ridgeRegression <- function(){
  trainY<-train$MonthlyIncome
  trainX<-data.matrix(subset(train, select=c(JobLevel, TotalWorkingYears)))
  
  lambda_seq <- 10^seq(2, -2, by = -.1)
  
  # Using glmnet function - setting alpha to 0 to build the ridge regression
  cv_fit <- cv.glmnet(trainX, trainY, alpha = 0, lambda = lambda_seq)
  plot(cv_fit)
  opt_lambda <- cv_fit$lambda.min
  cat(sprintf('Lambda : %f\n',opt_lambda))
  
  # prediction
  testY <- test$MonthlyIncome
  testX <- data.matrix(subset(test, select=c(JobLevel, TotalWorkingYears)))
  y_predicted <- predict(cv_fit, s = opt_lambda, newx = testX)
  
  # Sum of Squares Total and Error
  sst <- sum((testY - mean(testY))^2)
  sse <- sum((y_predicted - testY)^2)
  # R squared
  rsq <- 1 - sse / sst
  cat(sprintf('R Squared : %f\n',rsq))
  
  plot(y_predicted,test$MonthlyIncome, pch = 16, col = c('red','black')[as.factor(!test$Attrition)], xlab="predicted",ylab="actual")
  abline(0,1)
}

ridgeRegression()

## Lambda : 0.010000
## R Squared : 0.901915

Based on the above result, ….