In a healthy economy, a certain amount of voluntary employee turnover is normal. People switch jobs for many considerations: family, convenience, compensation, growth opportunities, and more. However, it is always good to understand the reason behind the attrition so that proper measures can be taken to prevent talent drain.
The objective is to understand what factors contributed most to employee attrition and to create a model that can predict if a certain employee will leave the company or not. The goal also includes helping in formulating different retention strategies on targeted employees. Overall, the implementation of this model will allow management to create better decision-making actions.
We will be using the dataset from IBM Analytics website. The dataset can be found here https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/.
We will start by importing the dataset and understanding the data at a high level.
Importing the necessary libraries
library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.0 v purrr 0.3.0
## v tibble 2.0.1 v dplyr 0.7.8
## v tidyr 0.8.2 v stringr 1.3.1
## v readr 1.3.1 v forcats 0.3.0
## -- Conflicts -------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(SuperLearner)
## Loading required package: nnls
## Super Learner
## Version: 2.0-24
## Package created on 2018-08-10
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(ROCR)
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
library(ranger)
Importing the dataset and checking the dimensions
dataset <- read.csv('Dataset.csv')
dim(dataset)
## [1] 1470 35
The dataset has 1470 observations with 35 variables.
Dataset Features
names(dataset)
## [1] "Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EmployeeCount" "EmployeeNumber"
## [11] "EnvironmentSatisfaction" "Gender"
## [13] "HourlyRate" "JobInvolvement"
## [15] "JobLevel" "JobRole"
## [17] "JobSatisfaction" "MaritalStatus"
## [19] "MonthlyIncome" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager"
Out of the 35 variables we have 34 independent variables and one dependent/target variable which is Attrition.
Dataset Structure
str(dataset)
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
As we can see from the structure we have lookup values for a few columns like Education, Environment Satisfaction, Job Involvement etc. So first we will replace these values with the actual description provided in the dataset .We will also be removing a few columns from our dataset which we wont be using in analysis further.
dataset <- dataset %>%
mutate(Education = as.factor(if_else(Education == 1,"Below College", if_else(Education == 2, "College", if_else(Education == 3, "Bachelor", if_else(Education == 4, "Master","Doctor")))))
,EnvironmentSatisfaction = as.factor(if_else(EnvironmentSatisfaction == 1,"Low",if_else(EnvironmentSatisfaction == 2, "Medium", if_else(EnvironmentSatisfaction == 3, "High", "Very High"))))
,JobInvolvement = as.factor(if_else(JobInvolvement == 1,"Low",if_else(JobInvolvement == 2, "Medium",if_else(JobInvolvement == 3, "High", "Very High"))))
,JobSatisfaction = as.factor(if_else(JobSatisfaction == 1, "Low",if_else(JobSatisfaction == 2, "Medium",if_else(JobSatisfaction == 3, "High","Very High"))))
,PerformanceRating = as.factor(if_else(PerformanceRating == 1, "Low",if_else(PerformanceRating == 2, "Good", if_else(PerformanceRating == 3, "Excellent", "Outstanding"))))
,RelationshipSatisfaction = as.factor(if_else(RelationshipSatisfaction == 1, "Low",if_else(RelationshipSatisfaction == 2, "Medium", if_else(RelationshipSatisfaction == 3, "High", "Very High"))))
,WorkLifeBalance = as.factor(if_else(WorkLifeBalance == 1, "Bad",if_else(WorkLifeBalance == 2, "Good", if_else(WorkLifeBalance == 3, "Better", "Best"))))
,JobLevel = as.factor(JobLevel)
) %>%
select(-EmployeeCount, -EmployeeNumber, -Over18, -StandardHours, -StockOptionLevel, -JobLevel)
After understanding the dimensions of the dataset we will try to analyse the variables visually. This will help us in understanding the nature of data in terms of distribution of the individual variables/features and relationship with other variables.
We will first carry out univariate data analysis and then move forward to bivariate analysis.
A. Dataset Summary
summary(dataset)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Bachelor :572
## Research & Development:961 1st Qu.: 2.000 Below College:170
## Sales :446 Median : 7.000 College :282
## Mean : 9.193 Doctor : 48
## 3rd Qu.:14.000 Master :398
## Max. :29.000
##
## EducationField EnvironmentSatisfaction Gender
## Human Resources : 27 High :453 Female:588
## Life Sciences :606 Low :284 Male :882
## Marketing :159 Medium :287
## Medical :464 Very High:446
## Other : 82
## Technical Degree:132
##
## HourlyRate JobInvolvement JobRole
## Min. : 30.00 High :868 Sales Executive :326
## 1st Qu.: 48.00 Low : 83 Research Scientist :292
## Median : 66.00 Medium :375 Laboratory Technician :259
## Mean : 65.89 Very High:144 Manufacturing Director :145
## 3rd Qu.: 83.75 Healthcare Representative:131
## Max. :100.00 Manager :102
## (Other) :215
## JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
## High :442 Divorced:327 Min. : 1009 Min. : 2094
## Low :289 Married :673 1st Qu.: 2911 1st Qu.: 8047
## Medium :280 Single :470 Median : 4919 Median :14236
## Very High:459 Mean : 6503 Mean :14313
## 3rd Qu.: 8379 3rd Qu.:20462
## Max. :19999 Max. :26999
##
## NumCompaniesWorked OverTime PercentSalaryHike PerformanceRating
## Min. :0.000 No :1054 Min. :11.00 Excellent :1244
## 1st Qu.:1.000 Yes: 416 1st Qu.:12.00 Outstanding: 226
## Median :2.000 Median :14.00
## Mean :2.693 Mean :15.21
## 3rd Qu.:4.000 3rd Qu.:18.00
## Max. :9.000 Max. :25.00
##
## RelationshipSatisfaction TotalWorkingYears TrainingTimesLastYear
## High :459 Min. : 0.00 Min. :0.000
## Low :276 1st Qu.: 6.00 1st Qu.:2.000
## Medium :303 Median :10.00 Median :3.000
## Very High:432 Mean :11.28 Mean :2.799
## 3rd Qu.:15.00 3rd Qu.:3.000
## Max. :40.00 Max. :6.000
##
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Bad : 80 Min. : 0.000 Min. : 0.000
## Best :153 1st Qu.: 3.000 1st Qu.: 2.000
## Better:893 Median : 5.000 Median : 3.000
## Good :344 Mean : 7.008 Mean : 4.229
## 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
##
This gives us a high level view of the dataset. We can see that we dont have any NULL values in our dataset. We can also see that we just have 237 attrition records out of 1470 which is approx. 16%. We need to keep this ratio maintained once we split the data in test and training for our model evaluation to avoid any biases.
B. Employee Personal Demographics - Numerical Variables
p1 <- ggplot(dataset) + geom_histogram(aes(Age), binwidth = 5, fill = "blue",col = "black")
p2 <- ggplot(dataset) + geom_histogram(aes(DistanceFromHome), binwidth = 5, fill = "blue",col = "black")
p3 <- ggplot(dataset) + geom_histogram(aes(NumCompaniesWorked), binwidth = 2, fill = "blue",col = "black")
p4 <- ggplot(dataset) + geom_histogram(aes(TotalWorkingYears), binwidth = 4, fill = "blue",col = "black")
grid.arrange(p1, p2, p3, p4, ncol = 2, nrow = 2)
Observations :
C. Employee Billing Rate Demographics - Numerical Variables
p1 <- ggplot(dataset) + geom_histogram(aes(HourlyRate), binwidth = 5, fill = "blue",col = "black")
p2 <- ggplot(dataset) + geom_histogram(aes(DailyRate), binwidth = 100, fill = "blue",col = "black")
p3 <- ggplot(dataset) + geom_histogram(aes(MonthlyRate), binwidth = 1000, fill = "blue",col = "black")
grid.arrange(p1, p2, p3, nrow = 3)
Observations :
D. Employee Work Demographics - Numerical Variables
p1 <- ggplot(dataset) + geom_histogram(aes(MonthlyIncome), binwidth = 1000, fill = "blue",col = "black")
p2 <- ggplot(dataset) + geom_histogram(aes(PercentSalaryHike), binwidth = 1, fill = "blue",col = "black")
p3 <- ggplot(dataset) + geom_histogram(aes(YearsAtCompany), binwidth = 2, fill = "blue",col = "black")
p4 <- ggplot(dataset) + geom_histogram(aes(YearsInCurrentRole), binwidth = 2, fill = "blue",col = "black")
p5 <- ggplot(dataset) + geom_histogram(aes(YearsSinceLastPromotion), binwidth = 2, fill = "blue",col = "black")
p6 <- ggplot(dataset) + geom_histogram(aes(YearsWithCurrManager), binwidth = 2, fill = "blue",col = "black")
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
Observations :
E. Employee Personal Demographics - Categorical Variables
p1<- dataset %>%
group_by(Gender) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(Gender), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Gender") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))
p2<- dataset %>%
group_by(Education) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(Education), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Education") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 650))
p3 <- dataset %>%
group_by(EducationField) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(EducationField), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Education Field") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 650))
p4 <- dataset %>%
group_by(MaritalStatus) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(MaritalStatus), y = counts)) + geom_bar(stat = 'identity', fill = "coral1")+ ggtitle("Marital Status") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 750))
p5 <- dataset %>%
group_by(RelationshipSatisfaction) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(RelationshipSatisfaction), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Relationship Satisfaction") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())+ scale_y_continuous(limits = c(0, 500))
p6 <- dataset %>%
group_by(WorkLifeBalance) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(WorkLifeBalance), y = counts)) + geom_bar(stat = 'identity', fill = "coral1")+ ggtitle("Work Life Balance") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 950))
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
Observations :
F. Employee Work Demographics - Categorical Variables
p1 <- dataset %>%
group_by(BusinessTravel) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(BusinessTravel), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Business Travel") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =10,angle = 45, hjust = 1),axis.title.x=element_blank())+ scale_y_continuous(limits = c(0, 1100))
p2 <- dataset %>%
group_by(EnvironmentSatisfaction) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(EnvironmentSatisfaction), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Environment Satisfaction") + geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =10,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 500))
p3 <- dataset %>%
group_by(JobInvolvement) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(JobInvolvement), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Job Involvement") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =10,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 900))
p4 <- dataset %>%
group_by(JobSatisfaction) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(JobSatisfaction), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Job Satisfaction") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 500))
p5 <- dataset %>%
group_by(OverTime) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(OverTime), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Over Time") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 1100))
p6 <- dataset %>%
group_by(PerformanceRating) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(PerformanceRating), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Performance Rating") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) + scale_y_continuous(limits = c(0, 1300))
grid.arrange(p1,p2,p3,p4,p5,p6,nrow = 2)
p1 <- dataset %>%
group_by(Department) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(Department), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Department") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ theme(plot.title = element_text(size =10),axis.text.x = element_text(size = 7, angle = 45, hjust = 1),axis.title.x=element_blank())
p2 <- dataset %>%
group_by(JobRole) %>%
summarise(counts = n()) %>%
ggplot(aes(x = as.factor(JobRole), y = counts)) + geom_bar(stat = 'identity', fill = "coral1") + ggtitle("Job Role") +geom_text(aes(label=counts), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25) + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
grid.arrange(p1,p2 ,ncol = 2)
Observations :
After looking at every feature individually, let???s now do some bivariate/multivariate analysis. Here we???ll explore the independent variables with respect to the target variable. The objective is to discover hidden relationships between the independent variables and the target variable.
A. Employee Personal Demographics - Numerical Variables
p1 <- dataset %>%
ggplot(aes(x = Age, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Age") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p2 <- dataset %>%
ggplot(aes(x = DistanceFromHome, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Distance From Home") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p3 <- dataset %>%
ggplot(aes(x = NumCompaniesWorked, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Number of Companies") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p4 <- dataset %>%
ggplot(aes(x = TotalWorkingYears, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Total Working Years") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
grid.arrange(p1, p2, p3, p4, nrow = 2, ncol = 2)
Observations :
B. Employee Billing Rate Demographics - Numerical Variables
p1 <- dataset %>%
ggplot(aes(x = HourlyRate, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Hourly Rate") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p2 <- dataset %>%
ggplot(aes(x = DailyRate, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Daily Rate") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p3 <- dataset %>%
ggplot(aes(x = MonthlyRate, fill = Attrition)) + geom_density(alpha = 0.5)+ ggtitle("Monthly Rate") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
grid.arrange(p1, p2, p3)
Observations :
C. Employee Work Demographics - Numerical Variables
p1 <- dataset %>%
ggplot(aes(x = MonthlyIncome, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Monthly Income") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p2 <- dataset %>%
ggplot(aes(x = PercentSalaryHike, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Percentage Salary Hike") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p3 <- dataset %>%
ggplot(aes(x = YearsAtCompany, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years At Company") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p4 <- dataset %>%
ggplot(aes(x = YearsInCurrentRole, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years in Current Role") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p5 <- dataset %>%
ggplot(aes(x = YearsSinceLastPromotion, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years Since Last Promotion") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
p6 <- dataset %>%
ggplot(aes(x = YearsWithCurrManager, fill = Attrition)) + geom_density(alpha = 0.5) + ggtitle("Years With Current Manager") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank())
grid.arrange(p1, p2, p3, p4, p5, p6 , nrow = 3, ncol = 2)
Observations :
D. Employee Personal Demographics - Categorical Variables
p1 <- dataset %>%
group_by(Gender) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = Gender, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Gender") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 20))
p2 <- dataset %>%
group_by(Education) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = Education, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Education") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 20))
p3 <- dataset %>%
group_by(EducationField) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = EducationField, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Education Field") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))
p4 <- dataset %>%
group_by(MaritalStatus) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = MaritalStatus, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Marital Status") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))
p5 <- dataset %>%
group_by(RelationshipSatisfaction) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = as.factor(RelationshipSatisfaction), y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Relationship Satisfaction") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))
p6 <- dataset %>%
group_by(WorkLifeBalance) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = as.factor(WorkLifeBalance), y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Work Life Balance") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 35))
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
Observations :
E. Employee Work Demographics - Categorical Variables
p1 <- dataset %>%
group_by(BusinessTravel) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = BusinessTravel, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Business Travel") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))
p2 <- dataset %>%
group_by(EnvironmentSatisfaction) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = EnvironmentSatisfaction, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Environment Satisfaction") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))
p3 <- dataset %>%
group_by(JobInvolvement) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = JobInvolvement, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Job Involvement") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 40))
p4 <- dataset %>%
group_by(JobSatisfaction) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = JobSatisfaction, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Job Satisfaction") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 30))
p5 <- dataset %>%
group_by(OverTime) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = OverTime, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Over Time") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 35))
p6 <- dataset %>%
group_by(PerformanceRating) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = as.factor(PerformanceRating), y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Performance Rating") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)+ scale_y_continuous(limits = c(0, 20))
grid.arrange(p1, p2, p3, p4, p5, p6, nrow = 2, ncol = 3)
p1 <- dataset %>%
group_by(Department) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = Department, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Department") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)
p2 <- dataset %>%
group_by(JobRole) %>%
summarise(attrition_rate = round((sum(if_else(Attrition == "Yes",1,0))/n()*100),2)) %>%
ggplot(aes(x = JobRole, y = attrition_rate))+ geom_bar(stat = 'identity',fill = "coral3") + ggtitle("Attrition Rate - Job Role") + theme(plot.title = element_text(size =10),axis.text.x = element_text(size =7,angle = 45, hjust = 1),axis.title.x=element_blank()) +geom_text(aes(label=attrition_rate), size = 2.5, position=position_dodge(width=0.2), vjust=-0.25)
grid.arrange(p1, p2, ncol = 2)
Observations :
It is the rare database that contains no missing values. In the world of big data, the problem of missing data is widespread and so it is important to have an understanding of methods for handling missing data that will not bias the results.
The common techniques to handle missing data are :
Deletion of rows where values are missing. The downside of this method is the loss of information and drop in prediction power of model.
In case of continuous variable, missing values can be replaced with mean or median of all known values of that variable. For categorical variables, we can use mode of the given values to replace the missing values.
Building Prediction Model: We can even make a predictive model to impute missing data in a variable. Here we will treat the variable having missing data as the target variable and the other variables(except actual target variable) as predictors.
Checking for missing data in our dataset.
sum(is.na(dataset))
## [1] 0
Luckily, we have no missing values in our dataset and hence we will move ahead.
Most of the machine learning algorithms produce better result with numerical variables only. So, it is essential to treat the categorical variables present in the data. We will use One hot encoding for encoding categorical variables wherein, each category of a categorical variable is converted into a new binary column (1/0).
dmy <- dummyVars(~., data = dataset[-2])
trsf <- data.frame(predict(dmy, newdata = dataset[-2]))
Skewness in variables is undesirable for predictive modeling. Some machine learning methods assume normally distributed data and a skewed variable can be transformed by taking its log, square root, or cube root so as to make its distribution as close to normal distribution as possible. We will treat their skewness with the help of log transformation. We will also be centering and scaling these values.
trsf <- trsf %>%
mutate(Age = log(Age + 1)
,DailyRate = log(DailyRate + 1)
,DistanceFromHome = log(DistanceFromHome + 1)
,HourlyRate = log(HourlyRate + 1)
,MonthlyIncome = log(MonthlyIncome + 1)
,MonthlyRate = log(MonthlyRate + 1)
,NumCompaniesWorked = log(NumCompaniesWorked + 1)
,PercentSalaryHike = log(PercentSalaryHike + 1)
,TotalWorkingYears = log(TotalWorkingYears + 1)
,TrainingTimesLastYear = log(TrainingTimesLastYear + 1)
,YearsAtCompany = log(YearsAtCompany +1)
,YearsInCurrentRole = log(YearsInCurrentRole + 1)
,YearsSinceLastPromotion = log(YearsSinceLastPromotion + 1)
,YearsWithCurrManager = log(YearsWithCurrManager + 1))
prep_num = preProcess(trsf, method=c("center", "scale"))
final_dataset = predict(prep_num, trsf)
As we are predicting the attrition as output which can either have a yes or a no output, this analysis falls under the classification category.
We will be using two methods to measure the accuracy.
A confusion matrix is a table that is often used to describe the performance of a classification model (or classifier) on a set of test data for which the true values are known. Based on the information in confusion matrix, it is possible to derive metrics to measure the fitness of a model to a particular context.
In columns we have actual positives and Negatives whereas in rows we have predicted positives and negatives.
True Positives and True Negatives indicate the cases which have been correctly predicted by the model. False positives and false negatives are considered as wrong predictions by the model.
Various measures that can be derived from the confusion matrix : 1. Accuracy : (TP + TN) / (TP + FN + FP + TN) 2. Sensitivity - (TP) / (TP + FN) 3. Specificity - (TN)/(FP + TN)
ROC is a graphical representation to show the connection/trade-off between sensitivity and specificity. AUC is the Area Under the ROC Curve. Better the AUC value better is model performance. AUC value of 0.5 which is similar to a random prediction i.e. predicting everything to be a 0 or a 1. If the AUC value of our model is less than 0.5 our model is performing worse that random predictions and should never be productionized.
It can also be possible that the confusion matrix gives a high accuracy above 0.8 but the AUC value is near .5. So it is necessary that we consider both these values and then decide the performance of our model.
Lets start with splitting the dataset into training and test. We will use the training dataset, to train our model and test its accuracy on the test dataset.
We need to ensure that the churn data in both training and test set is of the same proporation as we have it in our main dataset to avoid any biases in prediction. For this we will use createDataPartition function.Our training dataset with have 70% of the rows whereas the test dataset will have the remaining 30%.
Train <- createDataPartition(final_dataset$Attrition, p=0.7, list=FALSE)
training <- final_dataset[ Train, ]
testing <- final_dataset[ -Train, ]
testing$Attrition <- as.factor(testing$Attrition)
Let’s check if the proporation of churn data is maintained in the sets we created.
prop.table(table(final_dataset$Attrition))
##
## 0 1
## 0.8387755 0.1612245
prop.table(table(training$Attrition))
##
## 0 1
## 0.8367347 0.1632653
prop.table(table(testing$Attrition))
##
## 0 1
## 0.8435374 0.1564626
Yes so the proporation seems to be are maintained.
We will be building three models to predict the attrition of an employee. These models are :
Let’s start implementing the models. The method we will be following for all models is :
control <- trainControl(method="repeatedcv", number=5)
set.seed(123)
model_lr <- train(as.factor(Attrition)~., data=training, method="glm", trControl=control)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
pred_lr <- predict(model_lr, newdata=testing)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
confusionMatrix(pred_lr, testing$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 356 37
## 1 16 32
##
## Accuracy : 0.8798
## 95% CI : (0.8458, 0.9087)
## No Information Rate : 0.8435
## P-Value [Acc > NIR] : 0.01854
##
## Kappa : 0.4803
## Mcnemar's Test P-Value : 0.00601
##
## Sensitivity : 0.9570
## Specificity : 0.4638
## Pos Pred Value : 0.9059
## Neg Pred Value : 0.6667
## Prevalence : 0.8435
## Detection Rate : 0.8073
## Detection Prevalence : 0.8912
## Balanced Accuracy : 0.7104
##
## 'Positive' Class : 0
##
ROCRpred <- prediction(as.numeric(pred_lr), as.numeric(testing$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_lr <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_lr, digits=5, scientific=FALSE)))
control <- trainControl(method="repeatedcv", number=5)
set.seed(123)
model_rf <- train(as.factor(Attrition)~., data=training, method="rf", trControl=control)
pred_rf <- predict(model_rf, newdata=testing)
confusionMatrix(pred_rf, testing$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 363 52
## 1 9 17
##
## Accuracy : 0.8617
## 95% CI : (0.8259, 0.8925)
## No Information Rate : 0.8435
## P-Value [Acc > NIR] : 0.1628
##
## Kappa : 0.2978
## Mcnemar's Test P-Value : 7.551e-08
##
## Sensitivity : 0.9758
## Specificity : 0.2464
## Pos Pred Value : 0.8747
## Neg Pred Value : 0.6538
## Prevalence : 0.8435
## Detection Rate : 0.8231
## Detection Prevalence : 0.9410
## Balanced Accuracy : 0.6111
##
## 'Positive' Class : 0
##
ROCRpred <- prediction(as.numeric(pred_rf), as.numeric(testing$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_rf <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_rf, digits=5, scientific=FALSE)))
control <- trainControl(method="repeatedcv", number=5)
set.seed(123)
model_xgb <- train(as.factor(Attrition)~., data=training, method="xgbTree", trControl=control)
pred_xgb <- predict(model_xgb, newdata=testing)
confusionMatrix(pred_xgb, testing$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 362 43
## 1 10 26
##
## Accuracy : 0.8798
## 95% CI : (0.8458, 0.9087)
## No Information Rate : 0.8435
## P-Value [Acc > NIR] : 0.01854
##
## Kappa : 0.4346
## Mcnemar's Test P-Value : 1.105e-05
##
## Sensitivity : 0.9731
## Specificity : 0.3768
## Pos Pred Value : 0.8938
## Neg Pred Value : 0.7222
## Prevalence : 0.8435
## Detection Rate : 0.8209
## Detection Prevalence : 0.9184
## Balanced Accuracy : 0.6750
##
## 'Positive' Class : 0
##
ROCRpred <- prediction(as.numeric(pred_xgb), as.numeric(testing$Attrition))
ROCRpref <- performance(ROCRpred,"auc")
auc_xgb <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_xgb, digits=5, scientific=FALSE)))
Ensemble modeling is the process of running two or more related but different analytical models and then synthesizing the results into a single score or spread in order to improve the accuracy of predictive analytics and data mining applications.
We will be using SuperLearner to ensemble the above three models into one and see if it collectively out performs Logistic Regression model.
SuperLearner requires that the predictors (X) and responses (Y) to be in their own data structures. So we split them out for both training and test sets
ytrain <- as.numeric(training[,65])
ytest <- as.numeric(testing[,65])
xtrain <- data.frame(training[,1:64])
xtest <- data.frame(testing[,1:64])
Now we try to train the model using all three models together.
set.seed(150)
single.model <- SuperLearner(ytrain,
xtrain,
family=binomial(),
SL.library=list("SL.ranger",
"SL.glm",
"SL.xgboost"))
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Loading required package: xgboost
##
## Attaching package: 'xgboost'
## The following object is masked from 'package:dplyr':
##
## slice
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
Simply printing the model provides the coefficient, which is the weight of the algorithm in the model and the risk factor which is the error the algorithm produces.
single.model
##
## Call:
## SuperLearner(Y = ytrain, X = xtrain, family = binomial(), SL.library = list("SL.ranger",
## "SL.glm", "SL.xgboost"))
##
##
## Risk Coef
## SL.ranger_All 0.11131486 0.25225946
## SL.glm_All 0.09863218 0.68782039
## SL.xgboost_All 0.11921474 0.05992015
Predicting the output for the test set.
predictions <- predict.SuperLearner(single.model, newdata=xtest)
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
head(predictions$library.predict)
## SL.ranger_All SL.glm_All SL.xgboost_All
## [1,] 0.2375000 0.46317283 0.65912575
## [2,] 0.1520000 0.02129748 0.03247750
## [3,] 0.1320000 0.02270261 0.03948908
## [4,] 0.2613333 0.07986202 0.16892673
## [5,] 0.1400000 0.17017250 0.03366067
## [6,] 0.2170000 0.24031384 0.03786308
Here its observed that the prediction given out by SuperLearner is always a probability. So by chosing the optimum threshold value we can get the increase the accuracy/sensitivity or specificity based on the the requirements.
Here the threshold value is set to 0.45 to get the best result.
conv.preds <- ifelse(predictions$pred>=0.35,1,0)
Creating a confusion matrix to check the accuracy
cm <- confusionMatrix(as.factor(conv.preds), as.factor(testing[,65]))
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 347 29
## 1 25 40
##
## Accuracy : 0.8776
## 95% CI : (0.8433, 0.9067)
## No Information Rate : 0.8435
## P-Value [Acc > NIR] : 0.0259
##
## Kappa : 0.5249
## Mcnemar's Test P-Value : 0.6831
##
## Sensitivity : 0.9328
## Specificity : 0.5797
## Pos Pred Value : 0.9229
## Neg Pred Value : 0.6154
## Prevalence : 0.8435
## Detection Rate : 0.7868
## Detection Prevalence : 0.8526
## Balanced Accuracy : 0.7563
##
## 'Positive' Class : 0
##
ROCRpred <- prediction(as.numeric(conv.preds), as.numeric(ytest))
ROCRpref <- performance(ROCRpred,"auc")
auc_ens <- as.numeric(ROCRpref@y.values)
perf_ROC <- performance(ROCRpred,"tpr","fpr") #plot the actual ROC curve
plot(perf_ROC, main="ROC plot")
text(0.5,0.5,paste("AUC = ",format(auc_ens, digits=5, scientific=FALSE)))
To conclude, we have seen the entire process where we started with importing the dataset, getting to know the dataset at a high level, carrying out EDA (univariate & multivariate) and then moving on to data pre processing and then finally building models to predict the classification.
Every model comes with parameters which can be used to tune the models to obtain higher accuracy and specific results. e.g. in some health cases where in we want to predict if a particular person is having cancer, we need to have a model which overall may have a less accuracy but it should have has very less false negatives i.e. a person may actually have cancer but our model predicts that he does not have it. In such cases it becomes hyper parameter tuning comes into picture and we can tweak the results using it. As, we had no such specific requirements and achieved the desired levels of accuracy and AUC values we have not used hyper parameter tuning here.