Objective: To predict if an employee is going to resign or not
Dataset : We are using a dataset put up by IBM for our analysis. The dataset contain 35 variables along with Attrition variable. It can be downloaded from the following link
Link- https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/
Methodology: 1. Through our analysis we intend to build a model which can predict if an employee is about to quit. 2. We shall be looking at all variables through some plots and infer about it in our exploratory analysis. 3. After our exploration we shall build some features based on the Variables at hand and take a call on inclusion/exclusion of few variables.
We load the data and look at the different Variables in the dataset
setwd("C:/Users/user/Desktop/Bhanu/Data Science/Self Projects/IBM HR Data Analysis")
HRdata <- read.csv("IBMhrData.csv")
names(HRdata)
## [1] "ï..Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EmployeeCount" "EmployeeNumber"
## [11] "EnvironmentSatisfaction" "Gender"
## [13] "HourlyRate" "JobInvolvement"
## [15] "JobLevel" "JobRole"
## [17] "JobSatisfaction" "MaritalStatus"
## [19] "MonthlyIncome" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager"
colnames(HRdata)[1] <- "Age" # Renaming the column
In ordet to build our model we have to train on a set of observations, which we call the Training Set and then validate on the remaining set of observations, which we call our Testing Set.
So, we will perform all our analysis on the training set and validate it on our testing set. We shall divide our dataset into training(75%) and testing(25%)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
set.seed(12345)
inTrain <- createDataPartition(HRdata$Attrition,p=0.75,list = FALSE)
Training <- HRdata[inTrain,]
Testing <- HRdata[-inTrain,]
Lets have a look at the format of the fields which we have to play with in the Training Set
str(Training)
## 'data.frame': 1103 obs. of 35 variables:
## $ Age : int 41 37 33 27 32 59 30 38 36 35 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 2 1 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 3 2 3 2 3 3 2 3 3 ...
## $ DailyRate : int 1102 1373 1392 591 1005 1324 1358 216 1299 809 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 2 3 2 2 3 24 23 27 16 ...
## $ Education : int 2 2 4 1 2 3 1 3 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 5 2 4 2 4 2 2 4 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 4 5 7 8 10 11 12 13 14 ...
## $ EnvironmentSatisfaction : int 2 4 4 1 4 3 4 4 3 1 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 1 2 2 1 2 2 2 2 ...
## $ HourlyRate : int 94 92 56 40 79 81 67 44 94 84 ...
## $ JobInvolvement : int 3 2 3 3 3 4 3 2 3 4 ...
## $ JobLevel : int 2 1 1 1 1 1 1 3 2 1 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 3 7 3 3 3 3 5 1 3 ...
## $ JobSatisfaction : int 4 3 3 2 4 1 3 3 3 2 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 3 2 2 3 2 1 3 2 2 ...
## $ MonthlyIncome : int 5993 2090 2909 3468 3068 2670 2693 9526 5237 2426 ...
## $ MonthlyRate : int 19479 2396 23159 16632 11864 9964 13335 8787 16577 16479 ...
## $ NumCompaniesWorked : int 8 6 1 9 0 4 1 0 6 0 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 2 2 1 1 2 1 1 1 1 ...
## $ PercentSalaryHike : int 11 15 11 12 13 20 22 21 13 13 ...
## $ PerformanceRating : int 3 3 3 3 3 4 4 4 3 3 ...
## $ RelationshipSatisfaction: int 1 2 3 4 3 1 2 2 2 3 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 0 0 1 0 3 1 0 2 1 ...
## $ TotalWorkingYears : int 8 7 8 6 8 12 1 10 17 6 ...
## $ TrainingTimesLastYear : int 0 3 3 3 2 3 2 2 3 5 ...
## $ WorkLifeBalance : int 1 3 3 3 2 2 3 3 2 3 ...
## $ YearsAtCompany : int 6 0 8 2 7 1 1 9 7 5 ...
## $ YearsInCurrentRole : int 4 0 7 2 7 0 0 7 7 4 ...
## $ YearsSinceLastPromotion : int 0 0 3 2 3 0 0 1 7 0 ...
## $ YearsWithCurrManager : int 5 0 0 2 6 0 0 8 7 3 ...
Let us start by taking a look at the attrition percentage of IBM. We see that we have approximately 16% leaving the organization.
library(ggplot2)
ggplot(Training,aes(Attrition,fill=Attrition))+geom_bar()
prop.table(table(Training$Attrition)) #Percentage of Attrition
##
## No Yes
## 0.8386219 0.1613781
Let us look at each variable and see its influence on the Attrition of the organization
library(ggplot2)
library(grid)
library(gridExtra)
agePlot <- ggplot(Training,aes(Age,fill=Attrition))+geom_density()+facet_grid(~Attrition)
travelPlot <- ggplot(Training,aes(BusinessTravel,fill=Attrition))+geom_bar()
ratePlot <- ggplot(Training,aes(DailyRate,Attrition))+geom_point(size=4,alpha = 0.05)
depPlot <- ggplot(Training,aes(Department,fill = Attrition))+geom_bar()
grid.arrange(agePlot,travelPlot,ratePlot,depPlot,ncol=2,top = "Fig 1")
distPlot <- ggplot(Training,aes(DistanceFromHome,fill=Attrition))+geom_bar()
eduPlot <- ggplot(Training,aes(Education,fill=Attrition))+geom_bar()
edufieldPlot <- ggplot(Training,aes(EducationField,fill=Attrition))+geom_bar()
envPlot <- ggplot(Training,aes(EnvironmentSatisfaction,fill=Attrition))+geom_bar()
genPlot <- ggplot(Training,aes(Gender,fill=Attrition))+geom_bar()
grid.arrange(distPlot,eduPlot,edufieldPlot,envPlot,genPlot,ncol=2,top = "Fig 2")
hourlyPlot <- ggplot(Training,aes(HourlyRate,fill=Attrition))+geom_bar()
jobInvPlot <- ggplot(Training,aes(JobInvolvement,fill=Attrition))+geom_bar()
jobLevelPlot <- ggplot(Training,aes(JobLevel,fill=Attrition))+geom_bar()
jobSatPlot <- ggplot(Training,aes(JobSatisfaction,fill=Attrition))+geom_bar()
grid.arrange(hourlyPlot,jobInvPlot,jobLevelPlot,jobSatPlot,ncol=2,top = "Fig 3")
marPlot <- ggplot(Training,aes(MaritalStatus,fill=Attrition))+geom_bar()
monthlyIncPlot <- ggplot(Training,aes(MonthlyIncome,fill=Attrition))+geom_density()
monthlyRatePlot <- ggplot(Training,aes(MonthlyRate,fill=Attrition))+geom_density()
numCompPlot <- ggplot(Training,aes(NumCompaniesWorked,fill=Attrition))+geom_bar()
grid.arrange(marPlot,monthlyIncPlot,monthlyRatePlot,numCompPlot,ncol=2,top = "Fig 4")
overTimePlot <- ggplot(Training,aes(OverTime,fill=Attrition))+geom_bar()
hikePlot <- ggplot(Training,aes(PercentSalaryHike,Attrition))+geom_point(size=4,alpha = 0.01)
perfPlot <- ggplot(Training,aes(PerformanceRating,fill = Attrition))+geom_bar()
RelSatPlot <- ggplot(Training,aes(RelationshipSatisfaction,fill = Attrition))+geom_bar()
grid.arrange(overTimePlot,hikePlot,perfPlot,RelSatPlot,ncol=2,top = "Fig 5")
StockPlot <- ggplot(Training,aes(StockOptionLevel,fill = Attrition))+geom_bar()
workingYearsPlot <- ggplot(Training,aes(TotalWorkingYears,fill = Attrition))+geom_bar()
TrainTimesPlot <- ggplot(Training,aes(TrainingTimesLastYear,fill = Attrition))+geom_bar()
WLBPlot <- ggplot(Training,aes(WorkLifeBalance,fill = Attrition))+geom_bar()
grid.arrange(StockPlot,workingYearsPlot,TrainTimesPlot,WLBPlot,ncol=2,top = "Fig 6")
YearAtComPlot <- ggplot(Training,aes(YearsAtCompany,fill = Attrition))+geom_bar()
YearInCurrPlot <- ggplot(Training,aes(YearsInCurrentRole,fill = Attrition))+geom_bar()
YearsSinceProm <- ggplot(Training,aes(YearsSinceLastPromotion,fill = Attrition))+geom_bar()
YearsCurrManPlot <- ggplot(Training,aes(YearsWithCurrManager,fill = Attrition))+geom_bar()
grid.arrange(YearAtComPlot,YearInCurrPlot,YearsSinceProm,YearsCurrManPlot,ncol=2,top = "Fig 7")
If we look at the dataset the minority is Attrition - Yes cases ( 16%).
Usually if we the proportion of Minority class is too low, we can try to balance the sample using Oversampling or Undersampling techniques. They come with their advantages and disadvantages though, which should be kept in mind while doing so.
In this analysis,we don’t get any better results using the above mentioned techniques. Also the proportion and absolute number is not too less. Hence, we stick with the sample we have for model building.
Training1_os <- Training
Feature engineering is one aspect which provided a huge impact on the outcome rather than the model. Here, we try at creating new features with the existing variables we have based on my assumptions.
Tenure per job: Usually, people who have worked with many companies but for small periods at every organization tend to leave early as they always need a change of Organization to keep them going.
Years without Change: For any person, a change either in role or job level or responsibility is needed to keep the work exciting to continue. We create a variable to see how many years it has been for an employee without any sort of change using Promotion, Role and Job Change as a metric to cover different variants of change.
If we look at the plots in Fig 8, we see that there is an influence of these new features on the Attrition.
Here, we use the Comapny pay information to calculate our Compa Ratio at Department Level & Organiation Level.
People, with Compa Ratio less than 1, usually feel underpaid and show more tendency to leave the Organization in search of a better pay.
If we look at the figure (Fig 9), we can notice the effect of lower Compa Ratios.
Training1_os$TenurePerJob <- ifelse(Training1_os$NumCompaniesWorked!=0, Training1_os$TotalWorkingYears/Training1_os$NumCompaniesWorked,0)
Training1_os$YearWithoutChange <- Training1_os$YearsInCurrentRole - Training1_os$YearsSinceLastPromotion
Training1_os$YearsWithoutChange2 <- Training1_os$TotalWorkingYears - Training1_os$YearsSinceLastPromotion
tenurePlot <- ggplot(Training1_os,aes(TenurePerJob))+geom_density()+facet_grid(~Attrition)
changePlot <- ggplot(Training1_os,aes(YearWithoutChange))+geom_density()+facet_grid(~Attrition)
change2Plot <- ggplot(Training1_os,aes(YearsWithoutChange2))+geom_density()+facet_grid(~Attrition)
grid.arrange(tenurePlot,changePlot,change2Plot,ncol=2,top = "Fig 8")
Med_HR <- median(Training1_os[Training1_os$Department == 'Human Resources',]$MonthlyIncome)
Med_RnD <- median(Training1_os[Training1_os$Department == 'Research & Development',]$MonthlyIncome)
Med_Sales <- median(Training1_os[Training1_os$Department == 'Sales',]$MonthlyIncome)
Med_LabTech <- median(Training1_os[Training1_os$JobRole == 'Laboratory Technician',]$MonthlyIncome)
TrainLabTech <- Training1_os[Training1_os$JobRole == 'Laboratory Technician',]
TrainLabTech$comparole <- TrainLabTech$MonthlyIncome/Med_LabTech
Med_overall <- median(Training1_os$MonthlyIncome)
Training1_os$CompaRatioDep <- ifelse(Training1_os$Department == 'Human Resources',Training1_os$MonthlyIncome/Med_HR,ifelse(Training1_os$Department=='Research & Development',Training1_os$MonthlyIncome/Med_RnD,Training1_os$MonthlyIncome/Med_Sales))
Training1_os$CompaRatioOverall <- Training1_os$MonthlyIncome/Med_overall
Training1_os$CompaOverallGroup <- ifelse(Training1_os$CompaRatioOverall>4,4,ifelse(Training1_os$CompaRatioOverall>3,3,ifelse(Training1_os$CompaRatioOverall>2,2,ifelse(Training1_os$CompaRatioOverall>1,1,ifelse(Training1_os$CompaRatioOverall>0.5,0.5,0)))))
Training1_os$CompaDepGroup <- ifelse(Training1_os$CompaRatioDep>4,4,ifelse(Training1_os$CompaRatioDep>3,3,ifelse(Training1_os$CompaRatioDep>2,2,ifelse(Training1_os$CompaRatioDep>1,1,ifelse(Training1_os$CompaRatioDep>0.5,0.5,0)))))
CompaOverallPlot <- ggplot(Training1_os,aes(CompaRatioOverall))+geom_density()+facet_grid(~Attrition)
CompaDepPlot <- ggplot(Training1_os,aes(CompaRatioDep))+geom_density()+facet_grid(~Attrition)
grid.arrange(CompaOverallPlot,CompaDepPlot,ncol=2,top = "Fig 9")
# Adding the variables for Testing Set
Testing$TenurePerJob <- ifelse(Testing$NumCompaniesWorked!=0, Testing$TotalWorkingYears/Testing$NumCompaniesWorked,0)
Testing$YearWithoutChange <- Testing$YearsInCurrentRole - Testing$YearsSinceLastPromotion
Testing$YearsWithoutChange2 <- Testing$TotalWorkingYears - Testing$YearsSinceLastPromotion
Testing$CompaRatioDep <- ifelse(Testing$Department == 'Human Resources',Testing$MonthlyIncome/Med_HR,ifelse(Testing$Department=='Research & Development',Testing$MonthlyIncome/Med_RnD,Testing$MonthlyIncome/Med_Sales))
Testing$CompaRatioOverall <- Testing$MonthlyIncome/Med_overall
Testing$CompaOverallGroup <- ifelse(Testing$CompaRatioOverall>4,4,ifelse(Testing$CompaRatioOverall>3,3,ifelse(Testing$CompaRatioOverall>2,2,ifelse(Testing$CompaRatioOverall>1,1,ifelse(Testing$CompaRatioOverall>0.5,0.5,0)))))
Testing$CompaDepGroup <- ifelse(Testing$CompaRatioDep>4,4,ifelse(Testing$CompaRatioDep>3,3,ifelse(Testing$CompaRatioDep>2,2,ifelse(Testing$CompaRatioDep>1,1,ifelse(Testing$CompaRatioDep>0.5,0.5,0)))))
#Testing$AvgSatis <- with(Testing,(EnvironmentSatisfaction+JobInvolvement+JobSatisfaction+RelationshipSatisfaction+WorkLifeBalance)/4)
Many of our variables are either continuos in nature or have lot of discrete values which peak at certain points.
To make sense of it we need to Categorise our variables which can collect the impact making groups and make more sense. Hence, we categorize many of the features.
While doing these transformations, we ensure that we do the same to our Testing set also.
Training1_os$AgeGroup <- with(Training1_os,ifelse(Age>55,8,ifelse(Age>50,7,ifelse(Age>45,6,ifelse(Age>40,5,ifelse(Age>35,4,ifelse(Age>30,3,ifelse(Age>25,2,1)))))))) #Creating Age Groups
Training1_os$DistanceGroup <- with(Training1_os,ifelse(DistanceFromHome>25,6,ifelse(DistanceFromHome>20,5,ifelse(DistanceFromHome>15,4,ifelse(DistanceFromHome>10,3,ifelse(DistanceFromHome>5,2,1)))))) #Creating Distance Groups
Training1_os$YearsWithManagerGroup <- with(Training1_os,ifelse(YearsWithCurrManager>15,5,ifelse(YearsWithCurrManager>10,4,ifelse(YearsWithCurrManager>5,3,ifelse(YearsWithCurrManager>2,2,1))))) #Creating YearsWithManager Groups
Training1_os$TenureGroup <- with(Training1_os,ifelse(TenurePerJob>35,9,ifelse(TenurePerJob>30,8,ifelse(TenurePerJob>25,7,ifelse(TenurePerJob>20,6,ifelse(TenurePerJob>15,5,ifelse(TenurePerJob>10,4,ifelse(TenurePerJob>5,3,ifelse(TenurePerJob>2,2,1))))))))) #Creating Tenure Per Job groups
Training1_os$Change2Group <- with(Training1_os,ifelse(YearsWithoutChange2>10,3,ifelse(YearsWithoutChange2>5,2,1))) #Creating Years Without Change2
Training1_os$Change1Group <- with(Training1_os,ifelse(YearWithoutChange>2.5,3,ifelse(YearWithoutChange>-2.5,2,1))) #Creating Years Without Change 1
#Training1_os$AvgSatisGroup <- with(Training1_os,ifelse(AvgSatis<2.5,1,2)) # Create Average Satisfaction Groups
Training1_os$WorkYearGroup <- with(Training1_os,ifelse(TotalWorkingYears>35,9,ifelse(TotalWorkingYears>30,8,ifelse(TotalWorkingYears>25,7,ifelse(TotalWorkingYears>20,6,ifelse(TotalWorkingYears>15,5,ifelse(TotalWorkingYears>10,4,ifelse(TotalWorkingYears>5,3,ifelse(TotalWorkingYears>2,2,1)))))))))
Training1_os$NumCompGroup <- with(Training1_os,ifelse(NumCompaniesWorked>4,3,ifelse(NumCompaniesWorked>2,2,1))) #Creating Number of Companies Worked
# For Testing Set
Testing$AgeGroup <- with(Testing,ifelse(Age>55,8,ifelse(Age>50,7,ifelse(Age>45,6,ifelse(Age>40,5,ifelse(Age>35,4,ifelse(Age>30,3,ifelse(Age>25,2,1)))))))) #Creating Age Groups
Testing$DistanceGroup <- with(Testing,ifelse(DistanceFromHome>25,6,ifelse(DistanceFromHome>20,5,ifelse(DistanceFromHome>15,4,ifelse(DistanceFromHome>10,3,ifelse(DistanceFromHome>5,2,1)))))) #Creating Distance Groups
Testing$YearsWithManagerGroup <- with(Testing,ifelse(YearsWithCurrManager>15,5,ifelse(YearsWithCurrManager>10,4,ifelse(YearsWithCurrManager>5,3,ifelse(YearsWithCurrManager>2,2,1))))) #Creating YearsWithManager Groups
Testing$TenureGroup <- with(Testing,ifelse(TenurePerJob>35,9,ifelse(TenurePerJob>30,8,ifelse(TenurePerJob>25,7,ifelse(TenurePerJob>20,6,ifelse(TenurePerJob>15,5,ifelse(TenurePerJob>10,4,ifelse(TenurePerJob>5,3,ifelse(TenurePerJob>2,2,1))))))))) #Creating Tenure Per Job groups
Testing$Change2Group <- with(Testing,ifelse(YearsWithoutChange2>10,3,ifelse(YearsWithoutChange2>5,2,1))) #Creating Years Without Change2
Testing$Change1Group <- with(Testing,ifelse(YearWithoutChange>2.5,3,ifelse(YearWithoutChange>-2.5,2,1))) #Creating Years Without Change 1
#Testing$AvgSatisGroup <- with(Testing,ifelse(AvgSatis<2.5,1,2)) # Creating avg satisfaction group
Testing$WorkYearGroup <- with(Testing,ifelse(TotalWorkingYears>35,9,ifelse(TotalWorkingYears>30,8,ifelse(TotalWorkingYears>25,7,ifelse(TotalWorkingYears>20,6,ifelse(TotalWorkingYears>15,5,ifelse(TotalWorkingYears>10,4,ifelse(TotalWorkingYears>5,3,ifelse(TotalWorkingYears>2,2,1)))))))))
Testing$NumCompGroup <- with(Testing,ifelse(NumCompaniesWorked>4,3,ifelse(NumCompaniesWorked>2,2,1))) #Creating Number of Companies Worked
We see lot of correlation among the following variables
library(corrplot)
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
Training_cor <- Training
for(i in 1:ncol(Training_cor)){
Training_cor[,i]<- as.integer(Training_cor[,i])
}
corrplot(cor(Training_cor))
## Warning in cor(Training_cor): the standard deviation is zero
#plot(cor.ci(Training_cor))
# Removing higly correlated Variables, Variables for which binning has been done and near Zero Variance variables
Train <- Training1_os[,c(2,3,5,7,8,11,12,14,15,16,17,18,21,23,24,26,28,29,30,31,41:48)]
Train <- Training1_os[,c(2,3,5,7,8,11,12,14,15,16,17,18,21,23,24,26,29,30,31,41:48)]
Test <- Testing[,-2]
# Coding the categorical Variables
Train$BusinessTravel <- as.integer(Train$BusinessTravel)
Train$Department <- as.integer(Train$Department)
Train$Gender <- as.integer(Train$Gender)
Train$MaritalStatus <- as.integer(Train$MaritalStatus)
Train$OverTime <- as.integer(Train$OverTime)
Train$JobRole <- as.integer(Train$JobRole)
Train$EducationField <- as.integer(Train$EducationField)
Test$BusinessTravel <- as.integer(Test$BusinessTravel)
Test$Department <- as.integer(Test$Department)
Test$Gender <- as.integer(Test$Gender)
Test$MaritalStatus <- as.integer(Test$MaritalStatus)
Test$OverTime <- as.integer(Test$OverTime)
Test$JobRole <- as.integer(Test$JobRole)
Test$EducationField <- as.integer(Test$EducationField)
Train1 <- Train
for(i in 1:ncol(Train1)){
Train1[,i] <- as.factor(Train1[,i])
}
As we are facing a classification problem, we try to use tree based methods and GLM. We aslo apply Clustering Based Algorithms like SVM and KNN.
fit_rpart <- train(Attrition ~.,Train,method = 'rpart', trControl = trainControl(method = 'cv',number = 3)) # A simple Decision Tree
## Loading required package: rpart
set.seed(123)
fit_rf <- train(Attrition ~.,Train,method = 'rf', trControl = trainControl(method = 'repeatedcv',number = 3)) # Random Forest
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:psych':
##
## outlier
## The following object is masked from 'package:gridExtra':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
xgbGrid <- expand.grid(nrounds = 300,
max_depth = 1,
eta = 0.3,
gamma = 0.01,
colsample_bytree = .7,
min_child_weight = 1,
subsample = 0.9)
set.seed(12)
fit_xgb <- train(Attrition ~.,Train,method = 'xgbTree',tuneGrid = xgbGrid,trControl = trainControl(method = 'repeatedcv',number = 3,classProbs = TRUE))
## Loading required package: xgboost
## Loading required package: plyr
fit_nn <- train(Attrition ~.,Train,method = 'pcaNNet',trControl = trainControl(method = 'repeatedcv',number = 3),tuneGrid = expand.grid(size = 25,decay = 0.01))
## Loading required package: nnet
## # weights: 652
## initial value 476.499390
## iter 10 value 238.347077
## iter 20 value 131.887112
## iter 30 value 70.531971
## iter 40 value 60.031849
## iter 50 value 52.142232
## iter 60 value 38.924575
## iter 70 value 34.596370
## iter 80 value 30.566229
## iter 90 value 25.840818
## iter 100 value 23.645236
## final value 23.645236
## stopped after 100 iterations
## # weights: 652
## initial value 705.336314
## iter 10 value 206.527939
## iter 20 value 128.781476
## iter 30 value 75.223689
## iter 40 value 52.001293
## iter 50 value 38.993577
## iter 60 value 32.894662
## iter 70 value 28.075916
## iter 80 value 23.744128
## iter 90 value 21.689197
## iter 100 value 18.680350
## final value 18.680350
## stopped after 100 iterations
## # weights: 652
## initial value 439.001946
## iter 10 value 242.604899
## iter 20 value 192.185686
## iter 30 value 161.091719
## iter 40 value 153.560900
## iter 50 value 121.710218
## iter 60 value 67.303700
## iter 70 value 53.142201
## iter 80 value 44.394973
## iter 90 value 35.426695
## iter 100 value 24.649864
## final value 24.649864
## stopped after 100 iterations
## # weights: 652
## initial value 462.836369
## iter 10 value 268.607395
## iter 20 value 207.513776
## iter 30 value 76.547816
## iter 40 value 57.854159
## iter 50 value 46.161156
## iter 60 value 43.460912
## iter 70 value 41.166526
## iter 80 value 40.233490
## iter 90 value 39.422458
## iter 100 value 37.632924
## final value 37.632924
## stopped after 100 iterations
fit_glm <- train(Attrition~.,Train,method = 'glm',trControl = trainControl(method = 'repeatedcv',number = 3))
fit_svm <- train(Attrition~.,Train,method = 'svmRadial',trControl = trainControl(method = 'repeatedcv',number = 3))
## Loading required package: kernlab
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:psych':
##
## alpha
## The following object is masked from 'package:ggplot2':
##
## alpha
fit_knn <- train(Attrition~.,Train,method = 'knn',trControl = trainControl(method = 'repeatedcv',number = 3))
fit_glmBoost <- train(Attrition~.,Train,method = 'glmboost',trControl = trainControl(method = 'repeatedcv',number = 3))
## Loading required package: mboost
## Loading required package: parallel
## Loading required package: stabs
## This is mboost 2.8-0. See 'package?mboost' and 'news(package = "mboost")'
## for a complete list of changes.
##
## Attaching package: 'mboost'
## The following object is masked from 'package:psych':
##
## %+%
## The following object is masked from 'package:ggplot2':
##
## %+%
Predictions_rpart <- predict(fit_rpart,Test)
Predictions_rf <- predict(fit_rf, Test)
Predictions_xgb <- predict(fit_xgb, Test)
Predictions_nn <- predict(fit_nn, Test)
Predictions_glm <- predict(fit_glm, Test)
Predictions_svm <- predict(fit_svm,Test)
Predictions_knn <- predict(fit_knn,Test)
Predictions_glmboost <- predict(fit_glmBoost,Test)
In order to find the best model we look at the confusion matrix comparing with the Actual Values.
Once we identify the best model, we try to tune it further to get the best results.
While looking at the confusion Matrix, if we consider only the Accuracy Values then our model is doomed to fail in Practicality.
For example, we get an accuracy of 90%,which looks like very good number, but we are able to correclty predict only 30% of the Minority Class ( which is more important) and out of our Predictions only 30 % are correct then it is a challenge.
Out of overall 1600, if we have 300 people who are quitting and our model gives accuracy of 90 percent it is not enough to say that our model is good as the majority class is itself near 90% of the toal observations. Our aim is to predict the Minority. So, if our model is able to identify only 100 people ( low specificity ) and to identify that 100 people we are predicting in all 300 people ( High Rate )- which means 200 are a wrong prediction.
In Practical Sense, if HR wants to talk to all the identified people they have to address atleast 300 employees to actually address the concerns of only 100 people which is also just 30% of overall Attrition.
After checking all Models, we find that XGBTree ( Boosted Decision Tree) works the best for us with a decent specificity rate ( > 50%) and a very low error rate ( < 30 %).
The overall accuracy is 89% which is also very good.
confusionMatrix(Predictions_xgb,Testing$Attrition)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 297 29
## Yes 11 30
##
## Accuracy : 0.891
## 95% CI : (0.8546, 0.921)
## No Information Rate : 0.8392
## P-Value [Acc > NIR] : 0.003055
##
## Kappa : 0.5393
## Mcnemar's Test P-Value : 0.007190
##
## Sensitivity : 0.9643
## Specificity : 0.5085
## Pos Pred Value : 0.9110
## Neg Pred Value : 0.7317
## Prevalence : 0.8392
## Detection Rate : 0.8093
## Detection Prevalence : 0.8883
## Balanced Accuracy : 0.7364
##
## 'Positive' Class : No
##
So, our best single model is XGBTree and further work can be done by looking at ensemble and stacking of the models which can help improve our metrics.