Introduction

In a healthy economy, a certain amount of voluntary employee turnover is normal. People switch jobs for many considerations: family, convenience, compensation, growth opportunities, and more. However, it is always good to understand the reason behind the attrition so that proper measures can be taken to prevent talent drain.

Objective

The objective is to understand what factors contributed most to employee attrition and to create a model that can predict if a certain employee will leave the company or not. The goal also includes helping in formulating different retention strategies on targeted employees. Overall, the implementation of this model will allow management to create better decision-making actions.

Dataset

We will be using the dataset from IBM Analytics website. The dataset can be found here https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/.

We will start by importing the dataset and understanding the data at a high level.

Importing the necessary libraries

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0     v purrr   0.3.0
## v tibble  2.0.1     v dplyr   0.7.8
## v tidyr   0.8.2     v stringr 1.3.1
## v readr   1.3.1     v forcats 0.3.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(caret)
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
library(SuperLearner)
## Loading required package: nnls
## Super Learner
## Version: 2.0-24
## Package created on 2018-08-10
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
library(ROCR)
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
library(ranger)

Importing the dataset and checking the dimensions

dataset <- read.csv('Dataset.csv')
dim(dataset)
## [1] 1470   35

The dataset has 1470 observations with 35 variables.

Dataset Features

names(dataset)
##  [1] "Age"                      "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EmployeeCount"            "EmployeeNumber"          
## [11] "EnvironmentSatisfaction"  "Gender"                  
## [13] "HourlyRate"               "JobInvolvement"          
## [15] "JobLevel"                 "JobRole"                 
## [17] "JobSatisfaction"          "MaritalStatus"           
## [19] "MonthlyIncome"            "MonthlyRate"             
## [21] "NumCompaniesWorked"       "Over18"                  
## [23] "OverTime"                 "PercentSalaryHike"       
## [25] "PerformanceRating"        "RelationshipSatisfaction"
## [27] "StandardHours"            "StockOptionLevel"        
## [29] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [31] "WorkLifeBalance"          "YearsAtCompany"          
## [33] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [35] "YearsWithCurrManager"

Out of the 35 variables we have 34 independent variables and one dependent/target variable which is Attrition.

Dataset Structure

str(dataset)
## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

As we can see from the structure we have lookup values for a few columns like Education, Environment Satisfaction, Job Involvement etc. So first we will replace these values with the actual description provided in the dataset .We will also be removing a few columns from our dataset which we wont be using in analysis further.

dataset <- dataset %>%
  mutate(Education = as.factor(if_else(Education == 1,"Below College", if_else(Education == 2, "College", if_else(Education == 3, "Bachelor", if_else(Education == 4, "Master","Doctor")))))
         ,EnvironmentSatisfaction = as.factor(if_else(EnvironmentSatisfaction == 1,"Low",if_else(EnvironmentSatisfaction == 2, "Medium", if_else(EnvironmentSatisfaction == 3, "High", "Very High"))))
         ,JobInvolvement = as.factor(if_else(JobInvolvement == 1,"Low",if_else(JobInvolvement == 2, "Medium",if_else(JobInvolvement == 3, "High", "Very High"))))
         ,JobSatisfaction = as.factor(if_else(JobSatisfaction == 1, "Low",if_else(JobSatisfaction == 2, "Medium",if_else(JobSatisfaction == 3, "High","Very High"))))
         ,PerformanceRating = as.factor(if_else(PerformanceRating == 1, "Low",if_else(PerformanceRating == 2, "Good", if_else(PerformanceRating == 3, "Excellent", "Outstanding"))))
         ,RelationshipSatisfaction = as.factor(if_else(RelationshipSatisfaction == 1, "Low",if_else(RelationshipSatisfaction == 2, "Medium", if_else(RelationshipSatisfaction == 3, "High", "Very High"))))
         ,WorkLifeBalance = as.factor(if_else(WorkLifeBalance == 1, "Bad",if_else(WorkLifeBalance == 2, "Good", if_else(WorkLifeBalance == 3, "Better", "Best"))))
         ,JobLevel = as.factor(JobLevel)
         ) %>%
  select(-EmployeeCount, -EmployeeNumber, -Over18, -StandardHours, -StockOptionLevel, -JobLevel)

Exploratory Data Analysis

After understanding the dimensions of the dataset we will try to analyse the variables visually. This will help us in understanding the nature of data in terms of distribution of the individual variables/features and relationship with other variables.

We will first carry out univariate data analysis and then move forward to bivariate analysis.

Univariate EDA

A. Dataset Summary

summary(dataset)
##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome         Education  
##  Human Resources       : 63   Min.   : 1.000   Bachelor     :572  
##  Research & Development:961   1st Qu.: 2.000   Below College:170  
##  Sales                 :446   Median : 7.000   College      :282  
##                               Mean   : 9.193   Doctor       : 48  
##                               3rd Qu.:14.000   Master       :398  
##                               Max.   :29.000                      
##                                                                   
##           EducationField EnvironmentSatisfaction    Gender   
##  Human Resources : 27    High     :453           Female:588  
##  Life Sciences   :606    Low      :284           Male  :882  
##  Marketing       :159    Medium   :287                       
##  Medical         :464    Very High:446                       
##  Other           : 82                                        
##  Technical Degree:132                                        
##                                                              
##    HourlyRate       JobInvolvement                      JobRole   
##  Min.   : 30.00   High     :868    Sales Executive          :326  
##  1st Qu.: 48.00   Low      : 83    Research Scientist       :292  
##  Median : 66.00   Medium   :375    Laboratory Technician    :259  
##  Mean   : 65.89   Very High:144    Manufacturing Director   :145  
##  3rd Qu.: 83.75                    Healthcare Representative:131  
##  Max.   :100.00                    Manager                  :102  
##                                    (Other)                  :215  
##   JobSatisfaction  MaritalStatus MonthlyIncome    MonthlyRate   
##  High     :442    Divorced:327   Min.   : 1009   Min.   : 2094  
##  Low      :289    Married :673   1st Qu.: 2911   1st Qu.: 8047  
##  Medium   :280    Single  :470   Median : 4919   Median :14236  
##  Very High:459                   Mean   : 6503   Mean   :14313  
##                                  3rd Qu.: 8379   3rd Qu.:20462  
##                                  Max.   :19999   Max.   :26999  
##                                                                 
##  NumCompaniesWorked OverTime   PercentSalaryHike   PerformanceRating
##  Min.   :0.000      No :1054   Min.   :11.00     Excellent  :1244   
##  1st Qu.:1.000      Yes: 416   1st Qu.:12.00     Outstanding: 226   
##  Median :2.000                 Median :14.00                        
##  Mean   :2.693                 Mean   :15.21                        
##  3rd Qu.:4.000                 3rd Qu.:18.00                        
##  Max.   :9.000                 Max.   :25.00                        
##                                                                     
##  RelationshipSatisfaction TotalWorkingYears TrainingTimesLastYear
##  High     :459            Min.   : 0.00     Min.   :0.000        
##  Low      :276            1st Qu.: 6.00     1st Qu.:2.000        
##  Medium   :303            Median :10.00     Median :3.000        
##  Very High:432            Mean   :11.28     Mean   :2.799        
##                           3rd Qu.:15.00     3rd Qu.:3.000        
##                           Max.   :40.00     Max.   :6.000        
##                                                                  
##  WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Bad   : 80      Min.   : 0.000   Min.   : 0.000    
##  Best  :153      1st Qu.: 3.000   1st Qu.: 2.000    
##  Better:893      Median : 5.000   Median : 3.000    
##  Good  :344      Mean   : 7.008   Mean   : 4.229    
##                  3rd Qu.: 9.000   3rd Qu.: 7.000    
##                  Max.   :40.000   Max.   :18.000    
##                                                     
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000      
## 

This gives us a high level view of the dataset. We can see that we dont have any NULL values in our dataset. We can also see that we just have 237 attrition records out of 1470 which is approx. 16%. We need to keep this ratio maintained once we split the data in test and training for our model evaluation to avoid any biases.

B. Employee Personal Demographics - Numerical Variables

p1 <- ggplot(dataset) + geom_histogram(aes(Age), binwidth = 5, fill = "blue",col = "black")
p2 <- ggplot(dataset) + geom_histogram(aes(DistanceFromHome), binwidth = 5, fill = "blue",col = "black")
p3 <- ggplot(dataset) + geom_histogram(aes(NumCompaniesWorked), binwidth = 2, fill = "blue",col = "black")
p4 <- ggplot(dataset) + geom_histogram(aes(TotalWorkingYears), binwidth = 4, fill = "blue",col = "black")

grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)

Observations :

  1. We see that the age is more or less normally distributed.
  2. Distance from Home, Num of Companies worked and Total working Years is right skewed and should be transformed to curb the skewness.

C. Employee Billing Rate Demographics - Numerical Variables

p1 <- ggplot(dataset) + geom_histogram(aes(HourlyRate), binwidth = 5, fill = "blue",col = "black")
p2 <- ggplot(dataset) + geom_histogram(aes(DailyRate), binwidth = 100, fill = "blue",col = "black")
p3 <- ggplot(dataset) + geom_histogram(aes(MonthlyRate), binwidth = 1000, fill = "blue",col = "black")

grid.arrange(p1, p2, p3, nrow = 3)

Observations :

  1. There seems to be no clear cut pattern observed in these rates.

D. Employee Work Demographics - Numerical Variables

p1 <- ggplot(dataset) + geom_histogram(aes(MonthlyIncome), binwidth = 1000, fill = "blue",col = "black")
p2 <- ggplot(dataset) + geom_histogram(aes(PercentSalaryHike), binwidth = 1, fill = "blue",col = "black")
p3 <- ggplot(dataset) + geom_histogram(aes(YearsAtCompany), binwidth = 2, fill = "blue",col = "black")
p4 <- ggplot(dataset) + geom_histogram(aes(YearsInCurrentRole), binwidth = 2, fill = "blue",col = "black")
p5 <- ggplot(dataset) + geom_histogram(aes(YearsSinceLastPromotion), binwidth = 2, fill = "blue",col = "black")
p6 <- ggplot(dataset) + geom_histogram(aes(YearsWithCurrManager), binwidth = 2, fill = "blue",col = "black")

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Observations :

  1. All 6 variables are right skewed and should be transformed to curb the skewness.

E. Employee Personal Demographics - Categorical Variables

p1<- dataset %>%
  group_by(Gender) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(Gender), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Gender") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))

p2<- dataset %>%
  group_by(Education) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(Education), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Education") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 650))

p3 <- dataset %>%
  group_by(EducationField) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(EducationField), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Education Field") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 650))

p4 <- dataset %>%
  group_by(MaritalStatus) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(MaritalStatus), y = counts)) + geom_bar(stat = 'identity', fill = "coral1")+ ggtitle("Marital Status") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 750))

p5 <- dataset %>%
  group_by(RelationshipSatisfaction) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(RelationshipSatisfaction), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Relationship Satisfaction") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())+ scale_y_continuous(limits = c(0, 500))

p6 <- dataset %>%
  group_by(WorkLifeBalance) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(WorkLifeBalance), y = counts)) + geom_bar(stat = 'identity', fill = "coral1")+ ggtitle("Work Life Balance") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 950))

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Observations :

  1. More than 50% of Male population.
  2. More than 37% is having a bachelors degree and 27% is having a masters degree.
  3. Almost 75% of the employees come from Life Sciences and Medical background.
  4. 45% of the employees are marrried whereas 22% of them are divorced.
  5. More than 30% of the employees have a High or Very High Relationship Satisfaction.
  6. More than 60% of the employees feel they have a better work life balance.

F. Employee Work Demographics - Categorical Variables

p1 <- dataset %>%
  group_by(BusinessTravel) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(BusinessTravel), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Business Travel") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =10,angle = 45, hjust = 1),axis.title.x=element_blank())+ scale_y_continuous(limits = c(0, 1100))



p2 <- dataset %>%
  group_by(EnvironmentSatisfaction) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(EnvironmentSatisfaction), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Environment Satisfaction") + geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =10,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 500))

p3 <- dataset %>%
  group_by(JobInvolvement) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(JobInvolvement), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Job Involvement") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =10,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))


p4 <- dataset %>%
  group_by(JobSatisfaction) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(JobSatisfaction), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Job Satisfaction") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 500))

p5 <- dataset %>%
  group_by(OverTime) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(OverTime), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Over Time") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 1100))


p6 <- dataset %>%
  group_by(PerformanceRating) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(PerformanceRating), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Performance Rating") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 1300))

grid.arrange(p1,p2,p3,p4,p5,p6,nrow = 2)

p1 <- dataset %>%
  group_by(Department) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(Department), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Department") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size = 7, angle = 45, hjust = 1),axis.title.x=element_blank())

p2 <- dataset %>%
  group_by(JobRole) %>%
  summarise(counts = n()) %>%
  ggplot(aes(x = as.factor(JobRole), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Job Role") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())

grid.arrange(p1,p2 ,ncol = 2)

Observations :

  1. More than 70% of the employees Travel rarely for work.
  2. 30% of the employees have a high and very high environment satisfaction each.
  3. Almost 59% of the employees think they have a High job involvement at work.
  4. Again in Job Satisifaction we see that 30% employees have a high and a very high job satisfaction each.
  5. More than 70% of the people seem to be working over time.
  6. 85% of the employees have an excellent performance rating.
  7. More than half of the employees work for the R&D department.
  8. Majority of the employees work as Sales Executives, Research Scientists and Laboratory Technicians.

Bivariate EDA

After looking at every feature individually, let???s now do some bivariate/multivariate analysis. Here we???ll explore the independent variables with respect to the target variable. The objective is to discover hidden relationships between the independent variables and the target variable.

A. Employee Personal Demographics - Numerical Variables

p1 <- dataset %>%
  ggplot(aes(x = Age, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Age") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())

p2 <- dataset %>%
  ggplot(aes(x = DistanceFromHome, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Distance From Home")  + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())

p3 <- dataset %>%
  ggplot(aes(x = NumCompaniesWorked, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Number of Companies")  + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())

p4 <- dataset %>%
  ggplot(aes(x = TotalWorkingYears, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Total Working Years")  + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())

grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)

Observations :

  1. Younger employees within 25-35 years have a higher attrition rate.
  2. We see a lower attrition rate when the Distance from home is within 10 kms. The attrition rate increase post 10kms.
  3. The attrition rate tends to be higher with employees who have worked with 5 to 7 companies.
  4. Attrition rate seems to be extremely high amongst employees who have a total working experience between 0 to 7 years approximately.

B. Employee Billing Rate Demographics - Numerical Variables

p1 <- dataset %>%
  ggplot(aes(x = HourlyRate, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Hourly Rate") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


p2 <- dataset %>%
  ggplot(aes(x = DailyRate, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Daily Rate") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


p3 <- dataset %>%
  ggplot(aes(x = MonthlyRate, fill = Attrition)) + geom_density(alpha = 0.5)+ ggtitle("Monthly Rate") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


grid.arrange(p1, p2, p3)

Observations :

  1. No specific trend observed except for Daily rate where we see two modes. The attrition is highest near 400 whereas the attrition is lowest near 1200.

C. Employee Work Demographics - Numerical Variables

p1 <- dataset %>%
  ggplot(aes(x = MonthlyIncome, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Monthly Income") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


p2 <- dataset %>%
  ggplot(aes(x = PercentSalaryHike, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Percentage Salary Hike") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


p3 <- dataset %>%
  ggplot(aes(x = YearsAtCompany, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years At Company") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


p4 <- dataset %>%
  ggplot(aes(x = YearsInCurrentRole, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years in Current Role") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


p5 <- dataset %>%
  ggplot(aes(x = YearsSinceLastPromotion, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years Since Last Promotion") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


p6 <- dataset %>%
  ggplot(aes(x = YearsWithCurrManager, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years With Current Manager") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())


grid.arrange(p1, p2, p3, p4, p5, p6 , nrow = 3, ncol = 2)

Observations :

  1. We observe a peak in attrition rate at a monthly income approx. 2500.
  2. We also see a peak in attrition rate when the employee is with the company for 0-2 years approx.
  3. From the two modes observed in ???Years in Current Role??? plot we can say that the attrition rate is higher when the employee is in the same role for 0-2 years or 6 years approx.
  4. Also, From the two modes observed in ???Years with current manager??? plot we can say that employee also tends to leave if he is with the same manager for less than 1.5 years or 6 years approximately.

D. Employee Personal Demographics - Categorical Variables

p1 <- dataset %>%
  group_by(Gender) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = Gender, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Gender") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 20))


p2 <- dataset %>%
  group_by(Education) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = Education, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Education") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 20))

p3 <- dataset %>%
  group_by(EducationField) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = EducationField, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Education Field") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))

p4 <- dataset %>%
  group_by(MaritalStatus) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = MaritalStatus, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Marital Status") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))

p5 <- dataset %>%
  group_by(RelationshipSatisfaction) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = as.factor(RelationshipSatisfaction), y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Relationship Satisfaction") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))

p6 <- dataset %>%
  group_by(WorkLifeBalance) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = as.factor(WorkLifeBalance), y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Work Life Balance") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 35))

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

Observations :

  1. Attrition Rate is slightly more in Males as compared to Females.
  2. 18% attrition rate is observed amongst employees have below college education.
  3. Attrition rate is very high amongst employees from HR, Marketing and Technical backgrounds.
  4. As expected, the attrition rate is very high amongst employees who have a bad work life balance.

E. Employee Work Demographics - Categorical Variables

p1 <- dataset %>%
  group_by(BusinessTravel) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = BusinessTravel, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Business Travel") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))

p2 <- dataset %>%
  group_by(EnvironmentSatisfaction) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = EnvironmentSatisfaction, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Environment Satisfaction") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))

p3 <- dataset %>%
  group_by(JobInvolvement) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = JobInvolvement, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Job Involvement") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 40))

p4 <- dataset %>%
  group_by(JobSatisfaction) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = JobSatisfaction, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Job Satisfaction") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))

p5 <- dataset %>%
  group_by(OverTime) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = OverTime, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Over Time") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 35))

p6 <- dataset %>%
  group_by(PerformanceRating) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = as.factor(PerformanceRating), y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Performance Rating") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 20))

grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)

p1 <- dataset %>%
  group_by(Department) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = Department, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Department") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)

p2 <- dataset %>%
  group_by(JobRole) %>%
  summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
  ggplot(aes(x = JobRole, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Job Role") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)



grid.arrange(p1, p2, ncol = 2)

Observations :

  1. Attrition rate is higher amongst people who travel frequently.
  2. Its also higher amongst employees who have a low environment satisfaction, low job involvement and low job satisfaction.
  3. The attrition rate is almost 30% amongst employees who work over time.
  4. Sales department have the highest attrition at 20% whereas Sales Representatives have the highest attrition at 40%.

Data Preprocessing

Missing Value Imputations

It is the rare database that contains no missing values. In the world of big data, the problem of missing data is widespread and so it is important to have an understanding of methods for handling missing data that will not bias the results.

The common techniques to handle missing data are :

  1. Deletion of rows where values are missing. The downside of this method is the loss of information and drop in prediction power of model.

  2. In case of continuous variable, missing values can be replaced with mean or median of all known values of that variable. For categorical variables, we can use mode of the given values to replace the missing values.

  3. Building Prediction Model: We can even make a predictive model to impute missing data in a variable. Here we will treat the variable having missing data as the target variable and the other variables(except actual target variable) as predictors.

Checking for missing data in our dataset.

sum(is.na(dataset))
## [1] 0

Luckily, we have no missing values in our dataset and hence we will move ahead.

Encoding categorical variables

Most of the machine learning algorithms produce better result with numerical variables only. So, it is essential to treat the categorical variables present in the data. We will use One hot encoding for encoding categorical variables wherein, each category of a categorical variable is converted into a new binary column (1/0).

dmy <- dummyVars(~., data = dataset[-2])
trsf <- data.frame(predict(dmy, newdata = dataset[-2]))

Removing Skewness

Skewness in variables is undesirable for predictive modeling. Some machine learning methods assume normally distributed data and a skewed variable can be transformed by taking its log, square root, or cube root so as to make its distribution as close to normal distribution as possible. We will treat their skewness with the help of log transformation. We will also be centering and scaling these values.

trsf <- trsf %>%
  mutate(Age = log(Age + 1)
         ,DailyRate = log(DailyRate + 1)
         ,DistanceFromHome = log(DistanceFromHome + 1)
         ,HourlyRate = log(HourlyRate + 1)
         ,MonthlyIncome = log(MonthlyIncome + 1)
         ,MonthlyRate = log(MonthlyRate + 1)
         ,NumCompaniesWorked = log(NumCompaniesWorked + 1)
         ,PercentSalaryHike = log(PercentSalaryHike + 1)
         ,TotalWorkingYears = log(TotalWorkingYears + 1)
         ,TrainingTimesLastYear = log(TrainingTimesLastYear + 1)
         ,YearsAtCompany = log(YearsAtCompany +1)
         ,YearsInCurrentRole = log(YearsInCurrentRole + 1)
         ,YearsSinceLastPromotion = log(YearsSinceLastPromotion + 1)
         ,YearsWithCurrManager = log(YearsWithCurrManager + 1))
prep_num = preProcess(trsf, method=c("center", "scale"))
final_dataset = predict(prep_num, trsf)

Predictive Modeling & Model Implementation

As we are predicting the attrition as output which can either have a yes or a no output, this analysis falls under the classification category.

Accuracy measurement

We will be using two methods to measure the accuracy.

1. Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or classifier) on a set of test data for which the true values are known. Based on the information in confusion matrix, it is possible to derive metrics to measure the fitness of a model to a particular context.

In columns we have actual positives and Negatives whereas in rows we have predicted positives and negatives.

True Positives and True Negatives indicate the cases which have been correctly predicted by the model. False positives and false negatives are considered as wrong predictions by the model.

Various measures that can be derived from the confusion matrix : 1. Accuracy : (TP + TN) / (TP + FN + FP + TN) 2. Sensitivity - (TP) / (TP + FN) 3. Specificity - (TN)/(FP + TN)

2. Reciever Operating Characteristics (ROC) Curve and Area Under the Curve (AUC) :

ROC is a graphical representation to show the connection/trade-off between sensitivity and specificity. AUC is the Area Under the ROC Curve. Better the AUC value better is model performance. AUC value of 0.5 which is similar to a random prediction i.e. predicting everything to be a 0 or a 1. If the AUC value of our model is less than 0.5 our model is performing worse that random predictions and should never be productionized.

It can also be possible that the confusion matrix gives a high accuracy above 0.8 but the AUC value is near .5. So it is necessary that we consider both these values and then decide the performance of our model.

Dataset Split in Training and Test Sets

Lets start with splitting the dataset into training and test. We will use the training dataset, to train our model and test its accuracy on the test dataset.

We need to ensure that the churn data in both training and test set is of the same proporation as we have it in our main dataset to avoid any biases in prediction. For this we will use createDataPartition function.Our training dataset with have 70% of the rows whereas the test dataset will have the remaining 30%.

Train <- createDataPartition(final_dataset$Attrition, p=0.7, list=FALSE)
training <- final_dataset[ Train, ]
testing <- final_dataset[ -Train, ]
testing$Attrition <- as.factor(testing$Attrition)

Let’s check if the proporation of churn data is maintained in the sets we created.

prop.table(table(final_dataset$Attrition))
## 
##         0         1 
## 0.8387755 0.1612245
prop.table(table(training$Attrition))
## 
##         0         1 
## 0.8367347 0.1632653
prop.table(table(testing$Attrition))
## 
##         0         1 
## 0.8435374 0.1564626

Yes so the proporation seems to be are maintained.

We will be building three models to predict the attrition of an employee. These models are :

  1. Logistic Regression
  2. Random Forest
  3. Gradient Boosting

Let’s start implementing the models. The method we will be following for all models is :

  1. Train the model on the training dataset.
  2. Predicting the outpur for the testing dataset.
  3. Use the confusion matrix to check the accuracy.
  4. Calculate the AUC(Area Under Curve) of the ROC plot. (Higher the AUC better the performance)

Logistic Regression

Model Training

control <- trainControl(method="repeatedcv", number=5)
set.seed(123)
model_lr <- train(as.factor(Attrition)~., data=training, method="glm", trControl=control)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

Output Prediction

pred_lr <- predict(model_lr, newdata=testing)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

Confusion Matrix

confusionMatrix(pred_lr, testing$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 356  37
##          1  16  32
##                                           
##                Accuracy : 0.8798          
##                  95% CI : (0.8458, 0.9087)
##     No Information Rate : 0.8435          
##     P-Value [Acc > NIR] : 0.01854         
##                                           
##                   Kappa : 0.4803          
##  Mcnemar's Test P-Value : 0.00601         
##                                           
##             Sensitivity : 0.9570          
##             Specificity : 0.4638          
##          Pos Pred Value : 0.9059          
##          Neg Pred Value : 0.6667          
##              Prevalence : 0.8435          
##          Detection Rate : 0.8073          
##    Detection Prevalence : 0.8912          
##       Balanced Accuracy : 0.7104          
##                                           
##        'Positive' Class : 0               
## 
ROCRpred <- prediction(as.numeric(pred_lr), as.numeric(testing$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_lr <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_lr, digits=5, scientific=FALSE)))

Random Forest

Model Training

control <- trainControl(method="repeatedcv", number=5)
set.seed(123)
model_rf <- train(as.factor(Attrition)~., data=training, method="rf",  trControl=control)

Output Prediction

pred_rf <- predict(model_rf, newdata=testing)

Confusion Matrix

confusionMatrix(pred_rf, testing$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 363  52
##          1   9  17
##                                           
##                Accuracy : 0.8617          
##                  95% CI : (0.8259, 0.8925)
##     No Information Rate : 0.8435          
##     P-Value [Acc > NIR] : 0.1628          
##                                           
##                   Kappa : 0.2978          
##  Mcnemar's Test P-Value : 7.551e-08       
##                                           
##             Sensitivity : 0.9758          
##             Specificity : 0.2464          
##          Pos Pred Value : 0.8747          
##          Neg Pred Value : 0.6538          
##              Prevalence : 0.8435          
##          Detection Rate : 0.8231          
##    Detection Prevalence : 0.9410          
##       Balanced Accuracy : 0.6111          
##                                           
##        'Positive' Class : 0               
## 
ROCRpred <- prediction(as.numeric(pred_rf), as.numeric(testing$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_rf <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_rf, digits=5, scientific=FALSE)))

Xtreme Gradient Boosting

Model Training

control <- trainControl(method="repeatedcv", number=5)
set.seed(123)
model_xgb <- train(as.factor(Attrition)~., data=training, method="xgbTree",  trControl=control)

Output Prediction

pred_xgb <- predict(model_xgb, newdata=testing)

Confusion Matrix

confusionMatrix(pred_xgb, testing$Attrition)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 362  43
##          1  10  26
##                                           
##                Accuracy : 0.8798          
##                  95% CI : (0.8458, 0.9087)
##     No Information Rate : 0.8435          
##     P-Value [Acc > NIR] : 0.01854         
##                                           
##                   Kappa : 0.4346          
##  Mcnemar's Test P-Value : 1.105e-05       
##                                           
##             Sensitivity : 0.9731          
##             Specificity : 0.3768          
##          Pos Pred Value : 0.8938          
##          Neg Pred Value : 0.7222          
##              Prevalence : 0.8435          
##          Detection Rate : 0.8209          
##    Detection Prevalence : 0.9184          
##       Balanced Accuracy : 0.6750          
##                                           
##        'Positive' Class : 0               
## 
ROCRpred <- prediction(as.numeric(pred_xgb), as.numeric(testing$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_xgb <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_xgb, digits=5, scientific=FALSE)))

Ensemble Modeling

Ensemble modeling is the process of running two or more related but different analytical models and then synthesizing the results into a single score or spread in order to improve the accuracy of predictive analytics and data mining applications.

We will be using SuperLearner to ensemble the above three models into one and see if it collectively out performs Logistic Regression model.

SuperLearner requires that the predictors (X) and responses (Y) to be in their own data structures. So we split them out for both training and test sets

ytrain <- as.numeric(training[,65])
ytest <- as.numeric(testing[,65])
xtrain <- data.frame(training[,1:64])
xtest <- data.frame(testing[,1:64])

Now we try to train the model using all three models together.

set.seed(150)
single.model <- SuperLearner(ytrain,
                             xtrain,
                             family=binomial(),
                             SL.library=list("SL.ranger",
                                             "SL.glm",
                                             "SL.xgboost"))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Loading required package: xgboost
## 
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
## 
##     slice
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

Simply printing the model provides the coefficient, which is the weight of the algorithm in the model and the risk factor which is the error the algorithm produces.

single.model
## 
## Call:  
## SuperLearner(Y = ytrain, X = xtrain, family = binomial(), SL.library = list("SL.ranger",  
##     "SL.glm", "SL.xgboost")) 
## 
## 
##                      Risk       Coef
## SL.ranger_All  0.11131486 0.25225946
## SL.glm_All     0.09863218 0.68782039
## SL.xgboost_All 0.11921474 0.05992015

Predicting the output for the test set.

predictions <- predict.SuperLearner(single.model, newdata=xtest)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
head(predictions$library.predict)
##      SL.ranger_All SL.glm_All SL.xgboost_All
## [1,]     0.2375000 0.46317283     0.65912575
## [2,]     0.1520000 0.02129748     0.03247750
## [3,]     0.1320000 0.02270261     0.03948908
## [4,]     0.2613333 0.07986202     0.16892673
## [5,]     0.1400000 0.17017250     0.03366067
## [6,]     0.2170000 0.24031384     0.03786308

Here its observed that the prediction given out by SuperLearner is always a probability. So by chosing the optimum threshold value we can get the increase the accuracy/sensitivity or specificity based on the the requirements.

Here the threshold value is set to 0.45 to get the best result.

conv.preds <- ifelse(predictions$pred>=0.35,1,0)

Creating a confusion matrix to check the accuracy

cm <- confusionMatrix(as.factor(conv.preds), as.factor(testing[,65]))
cm
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 347  29
##          1  25  40
##                                           
##                Accuracy : 0.8776          
##                  95% CI : (0.8433, 0.9067)
##     No Information Rate : 0.8435          
##     P-Value [Acc > NIR] : 0.0259          
##                                           
##                   Kappa : 0.5249          
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.9328          
##             Specificity : 0.5797          
##          Pos Pred Value : 0.9229          
##          Neg Pred Value : 0.6154          
##              Prevalence : 0.8435          
##          Detection Rate : 0.7868          
##    Detection Prevalence : 0.8526          
##       Balanced Accuracy : 0.7563          
##                                           
##        'Positive' Class : 0               
## 
ROCRpred <- prediction(as.numeric(conv.preds), as.numeric(ytest))
ROCRpref <- performance(ROCRpred,"auc")
auc_ens <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_ens, digits=5, scientific=FALSE)))

Conclusion

To conclude, we have seen the entire process where we started with importing the dataset, getting to know the dataset at a high level, carrying out EDA (univariate & multivariate) and then moving on to data pre processing and then finally building models to predict the classification.

Every model comes with parameters which can be used to tune the models to obtain higher accuracy and specific results. e.g. in some health cases where in we want to predict if a particular person is having cancer, we need to have a model which overall may have a less accuracy but it should have has very less false negatives i.e. a person may actually have cancer but our model predicts that he does not have it. In such cases it becomes hyper parameter tuning comes into picture and we can tweak the results using it. As, we had no such specific requirements and achieved the desired levels of accuracy and AUC values we have not used hyper parameter tuning here.