Among all the multiple aspects of Human Resource functions, Attrition is painful& dreadful thing which an organization has to face inevitably. Along with employee, valuable knowledge built over the period also walks out of the door. Also Whenever does an employee decide to put down his/her resignation business takes a hit either in terms of productivity of that employee till he finally walks out of the system or Morale of fellow colleagues i.e. constant murmuring of what is happening with the company because of which employees are leaving ? Am I in the right job or role ? My fellow colleagues who once vouched for this company are no longer working here.
As a HR Business partner how can i work on keeping the attrition under the limit ? Is it possible that i can predict that which employee is more likely to resign ?
Objective & Goal of this study is to predict whether an employee is going to stay or leave. We will calculate the probability of an employee leaving/resigning.
Here we will use three different methods to determine the probability of exit-> we will determine the best possible model basis ROC(you will learn that later)
I am using publicly available data shared by IBM-Watson.
link.
Below listed packages will be used in our analysis
library(ggplot2) #For visualization of data
library(plotly) # for visualization of data
library(RColorBrewer) #FOr visualization of data
library(e1071)# For SVM(Support vector Modeling)
library(MASS)
library(ISLR)
library(corrgram) # For plotting Correaltion charts
library(corrplot) # Fpr plotting Correlation charts
library(car) #
library(plyr)
library(dplyr)
library(caTools) # for sample splitting i.e train and test data
library(ROCR) # for the ROC/ AUC measure
library(pROC) # for the ROC/ AUC measure
library(rattle) # For Visualization of Decision Trees
library(rpart.plot)
library(psych) # For point biserial correlation
library(RColorBrewer)
library(mlr)
library(moments)# To calculate Skewness and Kurtosis
#Reading HR Employee Attrition Data
HRData <- read.csv("C:/Users/guptaanu/Desktop/Analytics/IBM Watson/WA_Fn-UseC_-HR-Employee-Attrition.csv")
#To view only fist 6 rows of HRData
head(HRData)
## Age Attrition BusinessTravel DailyRate Department
## 1 41 Yes Travel_Rarely 1102 Sales
## 2 49 No Travel_Frequently 279 Research & Development
## 3 37 Yes Travel_Rarely 1373 Research & Development
## 4 33 No Travel_Frequently 1392 Research & Development
## 5 27 No Travel_Rarely 591 Research & Development
## 6 32 No Travel_Frequently 1005 Research & Development
## DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1 1 2 Life Sciences 1 1
## 2 8 1 Life Sciences 1 2
## 3 2 2 Other 1 4
## 4 3 4 Life Sciences 1 5
## 5 2 1 Medical 1 7
## 6 2 2 Life Sciences 1 8
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1 2 Female 94 3 2
## 2 3 Male 61 2 2
## 3 4 Male 92 2 1
## 4 4 Female 56 3 1
## 5 1 Male 40 3 1
## 6 4 Male 79 3 1
## JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1 Sales Executive 4 Single 5993
## 2 Research Scientist 2 Married 5130
## 3 Laboratory Technician 3 Single 2090
## 4 Research Scientist 3 Married 2909
## 5 Laboratory Technician 2 Married 3468
## 6 Laboratory Technician 4 Single 3068
## MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1 19479 8 Y Yes 11
## 2 24907 1 Y No 23
## 3 2396 6 Y Yes 15
## 4 23159 1 Y Yes 11
## 5 16632 9 Y No 12
## 6 11864 0 Y No 13
## PerformanceRating RelationshipSatisfaction StandardHours
## 1 3 1 80
## 2 4 4 80
## 3 3 2 80
## 4 3 3 80
## 5 3 4 80
## 6 3 3 80
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1 0 8 0 1
## 2 1 10 3 3
## 3 0 7 3 3
## 4 0 8 3 3
## 5 1 6 3 3
## 6 0 8 2 2
## YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1 6 4 0
## 2 10 7 1
## 3 0 0 0
## 4 8 7 3
## 5 2 2 2
## 6 7 7 3
## YearsWithCurrManager
## 1 5
## 2 7
## 3 0
## 4 0
## 5 2
## 6 6
# Storing HRData in memory of R; Easier to call all the variable of data by column names
attach(HRData)
# For understanding the overall structure of HR Data(no. and type of variables)
str(HRData)
## 'data.frame': 1470 obs. of 35 variables:
## $ Age : int 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
## $ DailyRate : int 1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : int 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
## $ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : int 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : int 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
## $ HourlyRate : int 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : int 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : int 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
## $ JobSatisfaction : int 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
## $ MonthlyIncome : int 5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
## $ MonthlyRate : int 19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
## $ NumCompaniesWorked : int 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
## $ PercentSalaryHike : int 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: int 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : int 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : int 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : int 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : int 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : int 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : int 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : int 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : int 5 7 0 0 2 6 0 0 8 7 ...
#Changing the class of some variables
unique(WorkLifeBalance)
## [1] 1 3 2 4
unique(Education)
## [1] 2 1 4 3 5
unique(EducationField)
## [1] Life Sciences Other Medical Marketing
## [5] Technical Degree Human Resources
## 6 Levels: Human Resources Life Sciences Marketing Medical ... Technical Degree
Education<-as.factor(Education)
WorkLifeBalance <-as.factor(HRData$WorkLifeBalance)
PerformanceRating <-as.factor(HRData$PerformanceRating)
RelationshipSatisfaction <-as.factor(HRData$RelationshipSatisfaction)
JobSatisfaction <-as.factor(HRData$JobSatisfaction)
class(PerformanceRating)
## [1] "factor"
class(Education)
## [1] "factor"
class(RelationshipSatisfaction)
## [1] "factor"
class(JobSatisfaction)
## [1] "factor"
class(WorkLifeBalance)
## [1] "factor"
class(EducationField)
## [1] "factor"
# To identify if their are any missing values across any variables we can use below command ; however we don't require this command as data doesn't have any missing values
summarizeColumns(HRData)
## name type na mean disp median
## 1 Age integer 0 3.692381e+01 9.1353735 36.0
## 2 Attrition factor 0 NA 0.1612245 NA
## 3 BusinessTravel factor 0 NA 0.2904762 NA
## 4 DailyRate integer 0 8.024857e+02 403.5090999 802.0
## 5 Department factor 0 NA 0.3462585 NA
## 6 DistanceFromHome integer 0 9.192517e+00 8.1068644 7.0
## 7 Education integer 0 2.912925e+00 1.0241649 3.0
## 8 EducationField factor 0 NA 0.5877551 NA
## 9 EmployeeCount integer 0 1.000000e+00 0.0000000 1.0
## 10 EmployeeNumber integer 0 1.024865e+03 602.0243348 1020.5
## 11 EnvironmentSatisfaction integer 0 2.721769e+00 1.0930822 3.0
## 12 Gender factor 0 NA 0.4000000 NA
## 13 HourlyRate integer 0 6.589116e+01 20.3294276 66.0
## 14 JobInvolvement integer 0 2.729932e+00 0.7115611 3.0
## 15 JobLevel integer 0 2.063946e+00 1.1069399 2.0
## 16 JobRole factor 0 NA 0.7782313 NA
## 17 JobSatisfaction integer 0 2.728571e+00 1.1028461 3.0
## 18 MaritalStatus factor 0 NA 0.5421769 NA
## 19 MonthlyIncome integer 0 6.502931e+03 4707.9567831 4919.0
## 20 MonthlyRate integer 0 1.431310e+04 7117.7860441 14235.5
## 21 NumCompaniesWorked integer 0 2.693197e+00 2.4980090 2.0
## 22 Over18 factor 0 NA 0.0000000 NA
## 23 OverTime factor 0 NA 0.2829932 NA
## 24 PercentSalaryHike integer 0 1.520952e+01 3.6599377 14.0
## 25 PerformanceRating integer 0 3.153741e+00 0.3608235 3.0
## 26 RelationshipSatisfaction integer 0 2.712245e+00 1.0812089 3.0
## 27 StandardHours integer 0 8.000000e+01 0.0000000 80.0
## 28 StockOptionLevel integer 0 7.938776e-01 0.8520767 1.0
## 29 TotalWorkingYears integer 0 1.127959e+01 7.7807817 10.0
## 30 TrainingTimesLastYear integer 0 2.799320e+00 1.2892706 3.0
## 31 WorkLifeBalance integer 0 2.761224e+00 0.7064758 3.0
## 32 YearsAtCompany integer 0 7.008163e+00 6.1265252 5.0
## 33 YearsInCurrentRole integer 0 4.229252e+00 3.6231370 3.0
## 34 YearsSinceLastPromotion integer 0 2.187755e+00 3.2224303 1.0
## 35 YearsWithCurrManager integer 0 4.123129e+00 3.5681361 3.0
## mad min max nlevs
## 1 8.8956 18 60 0
## 2 NA 237 1233 2
## 3 NA 150 1043 3
## 4 510.0144 102 1499 0
## 5 NA 63 961 3
## 6 7.4130 1 29 0
## 7 1.4826 1 5 0
## 8 NA 27 606 6
## 9 0.0000 1 1 0
## 10 790.9671 1 2068 0
## 11 1.4826 1 4 0
## 12 NA 588 882 2
## 13 26.6868 30 100 0
## 14 0.0000 1 4 0
## 15 1.4826 1 5 0
## 16 NA 52 326 9
## 17 1.4826 1 4 0
## 18 NA 327 673 3
## 19 3260.2374 1009 19999 0
## 20 9201.7569 2094 26999 0
## 21 1.4826 0 9 0
## 22 NA 1470 1470 1
## 23 NA 416 1054 2
## 24 2.9652 11 25 0
## 25 0.0000 3 4 0
## 26 1.4826 1 4 0
## 27 0.0000 80 80 0
## 28 1.4826 0 3 0
## 29 5.9304 0 40 0
## 30 1.4826 0 6 0
## 31 0.0000 1 4 0
## 32 4.4478 0 40 0
## 33 4.4478 0 18 0
## 34 1.4826 0 15 0
## 35 4.4478 0 17 0
dim(HRData)
## [1] 1470 35
HRData has 1470 rows and 35 columns
As our task is to determine employee churn prediction thus Attrition is a dependent variable(DV) and rest 34 variables will be labeled as independent variable(IDV) going forward for our analysis
Our assumption as of now is that all 34 variables are contributing or influencing the attrition/retention of any employees
Before we delve into the analysis and understanding of variables, it is recommended to identify and remove those variables which do not add any value to the model
thus if we evaluate the data more closely, we will observe that column over 18 has only one type of data i.e. ‘Factor w/ 1 level’. Since everyone is above the age of 18 this column will not bring any additional information to the model so we can drop this variable from overall analysis, similarly we have other variables such as employee count & employee number and standard hours
# Identifying the number of factor/level in Over18, employee count & employee number variable has
unique(Over18)
## [1] Y
## Levels: Y
unique(EmployeeCount)
## [1] 1
grep("Over18", colnames(HRData)) # grep function is used to identify the column no of the variable
## [1] 22
grep("EmployeeCount", colnames(HRData))# doesn't add any value to the analysis
## [1] 9
grep("EmployeeNumber", colnames(HRData)) # doesn't add any value to the analysis
## [1] 10
grep("StandardHours", colnames(HRData))# doesn't add any value to the analysis
## [1] 27
# Removing column Over18,employee count & employee number from the dataset as it doesn't carry any value to the model
HRData = HRData[,c(-22,-9,-10,-27)]
dim(HRData)
## [1] 1470 31
We understand that many factor influence the attrition, here we will try to look at the few variable individually and identify any specific observation of these variables on attrition.
#Plotting a histogram with Age of Employees and storing it in hist
hist<-plot_ly(x=HRData$Age,type='histogram')
#Defining labels and title using layout()
layout(hist,title = "HR Data - Employee Age",xaxis = list(title = "Age"),yaxis = list(title = "Count"))
Here we observe that age is normal distributed so we don’t need to do any data modulation/transformation (logistic model works better for normaly distributed data)
#Plotting a histogram with Monthly Income and storing it in hist
hist<-plot_ly(x=HRData$MonthlyIncome,type='histogram')
#Defining labels and title using layout()
layout(hist,title = "HR Data - Monthly Income",xaxis = list(title = " Monthly Income "),yaxis = list(title = "Count"))
Monthly Income is highly skewed towards right and this will affect the model outcome; so we will transform monthly income variable
We can also check the skewness and normality of the data as detailed below
skewness(HRData$MonthlyIncome)
## [1] 1.368419
# Performing Shapiro-Wilk normality test
shapiro.test(HRData$MonthlyIncome)
##
## Shapiro-Wilk normality test
##
## data: HRData$MonthlyIncome
## W = 0.82791, p-value < 2.2e-16
Null Hypothesis : Samples came from a Normal distribution or Data is normaly distributed
Alternate Hypothesis : Samples doesn’t come from Normal distribution or Data is not normaly distributed
If P Value is LOW –> Null Hypothesis is Rejected
Trick to remember : If P LOW –> Null Go
So here we can confidently say that data is not normaly distributed. We may have to transform the variable
box<- plot_ly(x = ~HRData$JobLevel, y = ~HRData$MonthlyIncome, type = "box")
layout(box,title = "HR Data - Monthly Income",xaxis = list(title = " Monthly Income "),yaxis = list(title = "Count"),boxmode = "group")
Compensation Analysis:
Now since salary is determined by the job level of the person we have to find a way to use monthly data in more useful way i.e. salary has to be observed relative to level in which that employee is
Compensation expert will suggest that we should use Compa Ratio to find the relative indicator of salary at each level
Compa Ratio(As sourced from Google):- A Compa-Ratio of 1.00 or 100% means that the employee is paid exactly what the industry average pays and is at the midpoint for the salary range, A ratio of 0.75 means that the employee is paid 25% below the industry average and is at the risk of seeking employment with competitors at a higher pay that is perceived .
An employee doing administrative/back office job would be happy to get INR 30,000 take home considering the median for that group while a manager or business unit head, on the other hand, might not be given the median manager salary at that level of roughly INR 1.5 Lakh.So employee satisfaction also depends on the salary of that employee at respective level
This suggests that looking at salary relative to the median of an employee’s role/level (using compa ratio) makes more sense and will add value to the overall model
To do this, we’ll first calculate the median salary for each group and then create a new salary measure that measures a salary relative to that employee’s role median.
In this case, we’ll simply get the difference between the employee’s salary and that of the median. By this new metric, salaries lower than the median of the relevant role will be negative, those higher than the median will be positive.
This will put the salaries on a relative footing.
Observe that in data science parlance, this step of creating a new predictor variable from the data available is part of feature engineering. This is a huge topic by itself but the lesson for you is this: You DO NOT need to just stick with the original raw variables; those are just the starting point.
Calculate the median salary at each job level
Here i will be using aggregate function. How to use aggregate function is explained in the below mentioned link
link.
# Method 1
Median_Salary_Level <- aggregate(HRData$MonthlyIncome,list(JobLevel=HRData$JobLevel),median)
names(Median_Salary_Level)[2]<-("Median Salary Level") # renaiming the column header
print(Median_Salary_Level)
## JobLevel Median Salary Level
## 1 1 2670
## 2 2 5340
## 3 3 9980
## 4 4 16154
## 5 5 19232
# Method 2( Only for reference; not using in model; have been commented out from execution)
#Median_Salary_Level <- aggregate(HRData$MonthlyIncome~HRData$JobLevel,data = HRData,median)
#names(Median_Salary_Level)[2]<-("Median Salary Level")# renaiming the column header
#print(Median_Salary_Level)
Adding Median Salary at each level for every employee in the HR data frame
HRData <- merge(HRData,Median_Salary_Level,by ="JobLevel")
Calculating Compa Ratio and difference of monthly salary and median salary at each level for every employee
# Salary Difference
HRData$Salary_Diff <- HRData$MonthlyIncome- HRData$`Median Salary Level`
# Compa Ratio
HRData$CompaRatio <- HRData$MonthlyIncome/HRData$`Median Salary Level`
Exploration of Salary Difference
#Plotting a histogram with Median Salary Level and storing it in hist
hist<-plot_ly(x=HRData$Salary_Diff,type='histogram')
#Defining labels and title using layout()
layout(hist,title = " Median Salary Difference",xaxis = list(title = " Monthly Income - Median Salary specific to that level "),yaxis = list(title = "Count"))
Looking at the histogram we can observe that it is now following normal distribution.
We can again check the normality condition by using Shapiro-Wilk normality test
Performing Shapiro-Wilk normality test
shapiro.test(HRData$Salary_Diff)
##
## Shapiro-Wilk normality test
##
## data: HRData$Salary_Diff
## W = 0.96357, p-value < 2.2e-16
If you observe that p value < .05 so our null hypothesis is still rejected i.e. even after completing the data modulation!!
The histogram generated for Salary Difference looks perfectly normal however the test doesn’t say so.
Their are certain limitation of Shapiro-will normality test. Please follow the below link to read more about shapiro-will test link
As of now i am considering that the Median Salary Difference follows normal distribution
Hypothesis Testing
Null Hypothesis : Employees with higher median salary difference are likely to be active
aggregate(HRData$Salary_Diff~HRData$Attrition, data = HRData, mean)
## HRData$Attrition HRData$Salary_Diff
## 1 No 52.07299
## 2 Yes -89.61603
Average Median Salary difference of employees who have attrited/exit is negative while for employees who are active Average Median Salary difference is positive
Employees who are at the lower range of salary band at each level are more likely to leave
Correlation b/w median salary difference and Attrition
Here we have to use Point-biserial correlation coefficient because one of the variable is categorical. Please follow below link to read more about biserial correlation
biserial(HRData$Salary_Diff, HRData$Attrition)
## [,1]
## [1,] -0.05998931
The point biserial suggests a negative relationship, indicating that as salary increases relative to the level median, the likelihood of leaving decreases. This makes sense since those earning more relative to their true peers in the same role are less likely to be looking elseswhere for a better deal.
We will use HRData$Salary_Diff in our model.(New Variable is justified and adds value to the over all model) ##### Analyzing Monthly Rate of employees
#Plotting a histogram with Monthly Income and storing it in hist
hist<-plot_ly(x=HRData$MonthlyRate,type='histogram')
#Defining labels and title using layout()
layout(hist,title = "HR Data - Monthly Rate",xaxis = list(title = " Monthly Rate "),yaxis = list(title = "Count"))
# Analyzing Monthly Rate w.r.t level
box<- plot_ly(x = ~HRData$JobLevel, y = ~HRData$MonthlyRate, type = "box")
layout(box,title = "HR Data - Monthly Rate",xaxis = list(title = " Monthly Rate "),yaxis = list(title = "Count"),boxmode = "group")
Monthly rate variable is not giving any specific insights, I am assuming this variable will not make much of a difference.
p <- plot_ly(data = HRData, x = ~HRData$HourlyRate, y = ~HRData$DailyRate)
layout(p,title = "HR Data - Daily rate vs Hourly Rate ",xaxis = list(title = " Hourly Rate "),yaxis = list(title = "Daily Rate"))
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plot.ly/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode
Again their is no specific insights coming from these two variables so i am assuming these are just redundant variables.
table(Attrition) # count of employees who have attrited
## Attrition
## No Yes
## 1233 237
prop.table(table(Attrition)) # Proportion of employees who are active/exit
## Attrition
## No Yes
## 0.8387755 0.1612245
Employee who travel from long distance are genrally observed to be more dissatisfied and more likely to leave the organization as they want to minimize the commute and search for organization which are near to their home
ggplot(data = HRData) + geom_bar(mapping = aes( x= DistanceFromHome, fill= Attrition),position = "fill")
ggplot(data = HRData)+geom_point(mapping = aes(x=YearsAtCompany,y=YearsSinceLastPromotion))
cor(YearsAtCompany,YearsSinceLastPromotion)
## [1] 0.6184089
ggplot(data = HRData)+geom_point(mapping = aes(x=YearsAtCompany,y=YearsWithCurrManager))
cor(YearsAtCompany,YearsWithCurrManager)
## [1] 0.7692124
If you observe the above plot –> all of the variable are highly correlated
We have to work on factor analysis /Principle component analysis to identif
Observing the interaction of performance rating with Attrition & etc.
unique(HRData$PerformanceRating)
## [1] 3 4
mytable <-table(HRData$Attrition,HRData$PerformanceRating)
mytable
##
## 3 4
## No 1044 189
## Yes 200 37
prop.table(mytable) # cell percentages
##
## 3 4
## No 0.71020408 0.12857143
## Yes 0.13605442 0.02517007
prop.table(mytable, 1) # row percentages
##
## 3 4
## No 0.8467153 0.1532847
## Yes 0.8438819 0.1561181
prop.table(mytable, 2) # column percentage
##
## 3 4
## No 0.8392283 0.8362832
## Yes 0.1607717 0.1637168
# Using GGplot
PerformanceRating <-as.factor(PerformanceRating)
class(PerformanceRating)
## [1] "factor"
ggplot(data = HRData) + geom_bar(mapping = aes(x = PerformanceRating,fill = Attrition),position = "fill")
ggplot(data = HRData) + geom_bar(mapping = aes(x = PerformanceRating,fill = Attrition),position = "dodge")
Attrition is factor variable with two level (Yes/No). Here i am going to create a new variable Attrition_M where M stand for modified.
Employee Exit –> 1 Employee Active –>0
HRData$Attrition_M<- HRData$Attrition
HRData$Attrition_M <-as.character(HRData$Attrition_M)
HRData$Attrition_M[HRData$Attrition_M=="Yes"] <-"1"
HRData$Attrition_M[HRData$Attrition_M=="No"] <-"0"
HRData$Attrition_M <-as.numeric(HRData$Attrition_M)
mean(HRData$Attrition_M)
## [1] 0.1612245
unique(HRData$Attrition_M)
## [1] 0 1
str(HRData)
## 'data.frame': 1470 obs. of 35 variables:
## $ JobLevel : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 32 37 38 27 28 31 41 27 45 29 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 1 1 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 2 3 3 3 3 3 3 3 3 3 ...
## $ DailyRate : int 1005 408 362 1420 1434 1082 465 591 1457 1328 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 2 2 2 3 2 2 2 2 2 2 ...
## $ DistanceFromHome : int 2 19 1 2 5 1 14 2 7 2 ...
## $ Education : int 2 2 1 1 4 4 3 1 3 3 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 2 3 6 4 2 4 4 2 ...
## $ EnvironmentSatisfaction : int 4 2 3 3 3 3 1 1 1 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 2 2 2 1 2 ...
## $ HourlyRate : int 79 73 43 85 50 87 56 40 83 76 ...
## $ JobInvolvement : int 3 3 3 3 3 3 3 3 3 3 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 3 7 7 9 3 7 7 3 7 7 ...
## $ JobSatisfaction : int 4 2 1 1 3 2 3 2 3 2 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 1 3 3 1 2 2 2 ...
## $ MonthlyIncome : int 3068 3022 2619 3041 3441 2501 2451 3468 4477 2703 ...
## $ MonthlyRate : int 11864 10227 14561 16346 11179 18775 4609 16632 20100 4956 ...
## $ NumCompaniesWorked : int 0 4 3 0 1 1 4 9 4 0 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 2 1 ...
## $ PercentSalaryHike : int 13 21 17 11 13 17 12 12 19 23 ...
## $ PerformanceRating : int 3 4 3 3 3 3 3 3 3 4 ...
## $ RelationshipSatisfaction: int 3 1 4 2 3 2 1 4 3 4 ...
## $ StockOptionLevel : int 0 0 0 1 0 0 1 1 1 1 ...
## $ TotalWorkingYears : int 8 8 8 5 2 1 13 6 7 6 ...
## $ TrainingTimesLastYear : int 2 1 3 3 3 4 2 3 2 3 ...
## $ WorkLifeBalance : int 2 3 2 3 2 3 3 3 2 3 ...
## $ YearsAtCompany : int 7 1 0 4 2 1 9 2 3 5 ...
## $ YearsInCurrentRole : int 7 0 0 3 2 1 8 2 2 4 ...
## $ YearsSinceLastPromotion : int 3 0 0 0 2 1 1 2 0 0 ...
## $ YearsWithCurrManager : int 6 0 0 2 2 0 8 2 2 4 ...
## $ Median Salary Level : num 2670 2670 2670 2670 2670 2670 2670 2670 2670 2670 ...
## $ Salary_Diff : num 398 352 -51 371 771 ...
## $ CompaRatio : num 1.149 1.132 0.981 1.139 1.289 ...
## $ Attrition_M : num 0 0 0 1 1 0 0 0 0 0 ...
dim(HRData)
## [1] 1470 35
Till now what you may have observed is that we have done the data exploration i.e.multivariate analysis of few specific variables.
Here we have a data set of dimension 1470 (n) × 30 (p). n represents the number of observations and p represents number of predictors. Since we have a large p = 30, there can be p(p-1)/2 scatter plots i.e more than 400 plots possible to analyze the variable relationship. Wouldn’t is be a tedious job to perform exploratory analysis on this data ?
Don’t worry !! We have a way out from here. Principal Component Analysis will come for our rescur. Please read on to know more about PCA
Principal component analysis is a method of extracting important variables (in form of components) from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible.
Please follow below link to read more about Principal Component Analysis
# Creating a data set of numeric variables as PCA is applicable on Numeric data
nums <- sapply(HRData, is.numeric)
HRData_numeric <-HRData[ , nums]
head(HRData_numeric)
## JobLevel Age DailyRate DistanceFromHome Education
## 1 1 32 1005 2 2
## 2 1 37 408 19 2
## 3 1 38 362 1 1
## 4 1 27 1420 2 1
## 5 1 28 1434 5 4
## 6 1 31 1082 1 4
## EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction
## 1 4 79 3 4
## 2 2 73 3 2
## 3 3 43 3 1
## 4 3 85 3 1
## 5 3 50 3 3
## 6 3 87 3 2
## MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike
## 1 3068 11864 0 13
## 2 3022 10227 4 21
## 3 2619 14561 3 17
## 4 3041 16346 0 11
## 5 3441 11179 1 13
## 6 2501 18775 1 17
## PerformanceRating RelationshipSatisfaction StockOptionLevel
## 1 3 3 0
## 2 4 1 0
## 3 3 4 0
## 4 3 2 1
## 5 3 3 0
## 6 3 2 0
## TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1 8 2 2 7
## 2 8 1 3 1
## 3 8 3 2 0
## 4 5 3 3 4
## 5 2 3 2 2
## 6 1 4 3 1
## YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1 7 3 6
## 2 0 0 0
## 3 0 0 0
## 4 3 0 2
## 5 2 2 2
## 6 1 1 0
## Median Salary Level Salary_Diff CompaRatio Attrition_M
## 1 2670 398 1.1490637 0
## 2 2670 352 1.1318352 0
## 3 2670 -51 0.9808989 0
## 4 2670 371 1.1389513 1
## 5 2670 771 1.2887640 1
## 6 2670 -169 0.9367041 0
View(HRData_numeric)
if (!require('FactoMineR')) install.packages('FactoMineR')
## Loading required package: FactoMineR
pca=PCA(HRData_numeric)
pca$eig
## eigenvalue percentage of variance
## comp 1 5.429495e+00 2.010924e+01
## comp 2 1.976901e+00 7.321855e+00
## comp 3 1.893096e+00 7.011465e+00
## comp 4 1.771847e+00 6.562396e+00
## comp 5 1.287847e+00 4.769803e+00
## comp 6 1.211759e+00 4.487996e+00
## comp 7 1.081930e+00 4.007149e+00
## comp 8 1.063586e+00 3.939207e+00
## comp 9 1.045010e+00 3.870407e+00
## comp 10 1.036292e+00 3.838120e+00
## comp 11 1.002848e+00 3.714252e+00
## comp 12 9.711744e-01 3.596942e+00
## comp 13 9.641448e-01 3.570907e+00
## comp 14 9.202070e-01 3.408174e+00
## comp 15 9.115981e-01 3.376289e+00
## comp 16 8.856361e-01 3.280134e+00
## comp 17 7.907383e-01 2.928660e+00
## comp 18 7.083341e-01 2.623460e+00
## comp 19 5.437008e-01 2.013707e+00
## comp 20 5.097173e-01 1.887842e+00
## comp 21 2.795359e-01 1.035318e+00
## comp 22 2.255444e-01 8.353498e-01
## comp 23 2.060553e-01 7.631678e-01
## comp 24 1.423288e-01 5.271439e-01
## comp 25 1.237772e-01 4.584339e-01
## comp 26 1.689718e-02 6.258216e-02
## comp 27 6.217231e-29 2.302678e-28
## cumulative percentage of variance
## comp 1 20.10924
## comp 2 27.43109
## comp 3 34.44256
## comp 4 41.00496
## comp 5 45.77476
## comp 6 50.26275
## comp 7 54.26990
## comp 8 58.20911
## comp 9 62.07952
## comp 10 65.91764
## comp 11 69.63189
## comp 12 73.22883
## comp 13 76.79974
## comp 14 80.20791
## comp 15 83.58420
## comp 16 86.86434
## comp 17 89.79300
## comp 18 92.41646
## comp 19 94.43016
## comp 20 96.31800
## comp 21 97.35332
## comp 22 98.18867
## comp 23 98.95184
## comp 24 99.47898
## comp 25 99.93742
## comp 26 100.00000
## comp 27 100.00000
#Exploring the data through correlation plots
Correlation_Matrix=as.data.frame(round(cor(HRData_numeric,pca$ind$coord)^2*100,0))
Correlation_Matrix[with(Correlation_Matrix, order(-Correlation_Matrix[,1])),]
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## JobLevel 78 1 8 1 0
## TotalWorkingYears 78 1 3 1 0
## Median Salary Level 78 1 9 1 0
## MonthlyIncome 75 10 2 1 1
## YearsAtCompany 64 7 12 2 0
## YearsInCurrentRole 45 7 21 2 0
## YearsWithCurrManager 43 8 21 3 0
## Age 37 6 9 3 1
## YearsSinceLastPromotion 35 6 10 2 1
## Attrition_M 5 1 1 0 50
## Education 3 2 1 0 1
## NumCompaniesWorked 1 17 14 3 0
## DailyRate 0 1 0 0 12
## DistanceFromHome 0 1 0 1 2
## EnvironmentSatisfaction 0 0 0 0 5
## HourlyRate 0 1 0 0 1
## JobInvolvement 0 0 0 0 20
## JobSatisfaction 0 0 0 0 6
## MonthlyRate 0 0 1 0 3
## PercentSalaryHike 0 1 8 79 0
## PerformanceRating 0 2 9 77 0
## RelationshipSatisfaction 0 0 1 0 1
## StockOptionLevel 0 0 0 0 21
## TrainingTimesLastYear 0 0 0 0 3
## WorkLifeBalance 0 0 0 0 1
## Salary_Diff 0 62 30 0 0
## CompaRatio 0 62 30 1 0
#principal component analysis
View(HRData_numeric)
prin_comp <- prcomp(HRData_numeric, scale. = T)
names(prin_comp)
## [1] "sdev" "rotation" "center" "scale" "x"
#The prcomp() function results in 5 useful measures:
##1. center and scale refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA
#outputs the mean of variables
prin_comp$center
## JobLevel Age DailyRate
## 2.063946e+00 3.692381e+01 8.024857e+02
## DistanceFromHome Education EnvironmentSatisfaction
## 9.192517e+00 2.912925e+00 2.721769e+00
## HourlyRate JobInvolvement JobSatisfaction
## 6.589116e+01 2.729932e+00 2.728571e+00
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 6.502931e+03 1.431310e+04 2.693197e+00
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 1.520952e+01 3.153741e+00 2.712245e+00
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 7.938776e-01 1.127959e+01 2.799320e+00
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 2.761224e+00 7.008163e+00 4.229252e+00
## YearsSinceLastPromotion YearsWithCurrManager Median Salary Level
## 2.187755e+00 4.123129e+00 6.473702e+03
## Salary_Diff CompaRatio Attrition_M
## 2.922925e+01 1.021795e+00 1.612245e-01
#outputs the standard deviation of variables
prin_comp$scale
## JobLevel Age DailyRate
## 1.1069399 9.1353735 403.5090999
## DistanceFromHome Education EnvironmentSatisfaction
## 8.1068644 1.0241649 1.0930822
## HourlyRate JobInvolvement JobSatisfaction
## 20.3294276 0.7115611 1.1028461
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 4707.9567831 7117.7860441 2.4980090
## PercentSalaryHike PerformanceRating RelationshipSatisfaction
## 3.6599377 0.3608235 1.0812089
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 0.8520767 7.7807817 1.2892706
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 0.7064758 6.1265252 3.6231370
## YearsSinceLastPromotion YearsWithCurrManager Median Salary Level
## 3.2224303 3.5681361 4695.4308733
## Salary_Diff CompaRatio Attrition_M
## 1306.2112901 0.2465838 0.3678630
##2. The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vector. This is the most important measure we should be interested in.
prin_comp$rotation
## PC1 PC2 PC3
## JobLevel 0.379903632 -0.068444254 0.206924614
## Age 0.261113430 -0.168958959 0.214185692
## DailyRate -0.001490123 -0.059056172 0.010763893
## DistanceFromHome 0.001263467 0.079315139 0.008111305
## Education 0.068333788 -0.100250642 0.071504839
## EnvironmentSatisfaction 0.004658870 0.030545177 0.035006623
## HourlyRate -0.009766244 -0.064764784 -0.000549535
## JobInvolvement 0.001442457 -0.003342808 0.030662131
## JobSatisfaction -0.004785191 0.021978374 -0.012252134
## MonthlyIncome 0.370432137 -0.223352114 0.106624273
## MonthlyRate 0.007275577 -0.044932068 0.054217785
## NumCompaniesWorked 0.051046148 -0.294084939 0.275183214
## PercentSalaryHike -0.015296412 0.072833869 -0.208914719
## PerformanceRating -0.002498925 0.100985341 -0.218463765
## RelationshipSatisfaction 0.015601004 -0.025916610 0.063677290
## StockOptionLevel 0.015419551 -0.002419895 -0.023365341
## TotalWorkingYears 0.379786039 -0.086856420 0.118727911
## TrainingTimesLastYear -0.009710365 0.043700164 -0.028941726
## WorkLifeBalance 0.013632426 0.006847215 -0.014528387
## YearsAtCompany 0.343470202 0.188593576 -0.250741372
## YearsInCurrentRole 0.286701217 0.190921614 -0.333858649
## YearsSinceLastPromotion 0.254902573 0.177004045 -0.227480396
## YearsWithCurrManager 0.281042887 0.204978380 -0.329208653
## Median Salary Level 0.378444677 -0.068340945 0.217503208
## Salary_Diff -0.025250377 -0.559359672 -0.397553453
## CompaRatio -0.007249761 -0.558082060 -0.398569418
## Attrition_M -0.097569223 0.053585240 0.062224931
## PC4 PC5 PC6
## JobLevel -0.071425501 0.0374986298 -0.14128675
## Age -0.119443508 -0.0774624166 0.14038332
## DailyRate -0.035575273 -0.3012049252 0.05914881
## DistanceFromHome -0.056584621 0.1175123123 0.28265660
## Education -0.023752472 -0.0641512369 0.38969967
## EnvironmentSatisfaction 0.050535150 -0.2062798113 -0.23622768
## HourlyRate -0.004365295 -0.0839424235 0.40363230
## JobInvolvement 0.030855996 -0.3935499912 0.26160606
## JobSatisfaction -0.016687141 -0.2204889547 -0.30976723
## MonthlyIncome -0.063169067 0.0627519101 -0.15807262
## MonthlyRate -0.012339693 0.1466224594 -0.20408286
## NumCompaniesWorked -0.132516817 0.0240279385 0.24363805
## PercentSalaryHike -0.666374486 -0.0275530652 -0.02304386
## PerformanceRating -0.658374476 0.0004671452 -0.01666273
## RelationshipSatisfaction 0.042281790 -0.0701198396 0.01932180
## StockOptionLevel -0.012798069 -0.3991586579 0.19401303
## TotalWorkingYears -0.067665248 0.0170043293 0.03301825
## TrainingTimesLastYear 0.040184109 -0.1407404759 -0.21975844
## WorkLifeBalance 0.015209322 -0.0946489880 -0.20617070
## YearsAtCompany 0.114603537 0.0405574624 0.04765693
## YearsInCurrentRole 0.113317862 -0.0323892214 0.08417727
## YearsSinceLastPromotion 0.098402503 0.1068695222 0.08867197
## YearsWithCurrManager 0.126687464 -0.0134142210 0.11578133
## Median Salary Level -0.075955061 0.0490070213 -0.14407950
## Salary_Diff 0.045355985 0.0500104391 -0.05181683
## CompaRatio 0.067164029 0.0435484282 -0.01410153
## Attrition_M -0.012656810 0.6242459940 0.17600558
## PC7 PC8 PC9
## JobLevel 0.078841478 0.109381742 0.021376980
## Age -0.059196611 -0.086124081 -0.026754756
## DailyRate 0.293481278 -0.100580084 -0.157345785
## DistanceFromHome 0.243261635 -0.086817248 0.437385247
## Education -0.149294526 -0.160883101 -0.075054815
## EnvironmentSatisfaction -0.256618782 -0.443261590 0.369268542
## HourlyRate -0.029272923 0.408019944 0.176391951
## JobInvolvement -0.139756539 -0.076396968 -0.120798742
## JobSatisfaction 0.440147731 -0.248095846 -0.151867067
## MonthlyIncome 0.083724831 0.113017284 0.014176856
## MonthlyRate -0.086991044 -0.161203032 0.509229671
## NumCompaniesWorked -0.159460685 -0.187852269 -0.026008197
## PercentSalaryHike -0.049041425 -0.009532876 -0.016690687
## PerformanceRating -0.074977890 -0.006077328 -0.018452980
## RelationshipSatisfaction -0.412009013 -0.180775405 -0.290277876
## StockOptionLevel 0.236075317 0.181792716 0.392772917
## TotalWorkingYears 0.008738231 -0.014129060 -0.013245680
## TrainingTimesLastYear -0.106730537 0.516566539 -0.118334388
## WorkLifeBalance -0.487359814 0.261074843 0.230332756
## YearsAtCompany -0.009800126 -0.026203521 -0.026144572
## YearsInCurrentRole -0.035938442 -0.071242600 0.024442964
## YearsSinceLastPromotion -0.068904802 -0.093289656 -0.012147953
## YearsWithCurrManager -0.041630099 -0.061843490 -0.029101978
## Median Salary Level 0.082189840 0.112742566 0.008494278
## Salary_Diff 0.006320702 0.002071309 0.020563081
## CompaRatio 0.023707291 -0.024843651 0.019941934
## Attrition_M 0.070594885 -0.002329182 -0.047519707
## PC10 PC11 PC12
## JobLevel 0.067682032 -0.008541918 -0.054360301
## Age -0.103456679 -0.031003329 0.100583075
## DailyRate 0.341067649 0.379397086 -0.024859682
## DistanceFromHome 0.209512902 -0.396570729 -0.245565713
## Education -0.524340621 -0.233167650 0.156515968
## EnvironmentSatisfaction 0.018934122 0.401362915 0.037365351
## HourlyRate 0.275633729 0.062956335 -0.031366217
## JobInvolvement 0.268736513 -0.254628220 0.266235600
## JobSatisfaction -0.191597504 -0.417937364 -0.153935052
## MonthlyIncome 0.087500279 -0.009040978 -0.061739021
## MonthlyRate 0.219677411 -0.278346829 0.495464689
## NumCompaniesWorked -0.135485037 0.162801614 0.024274295
## PercentSalaryHike 0.016401329 -0.006036777 0.021626793
## PerformanceRating 0.020116800 0.015840488 0.010731712
## RelationshipSatisfaction 0.441223129 -0.285630511 -0.324683796
## StockOptionLevel -0.202307260 0.107328743 -0.159552476
## TotalWorkingYears -0.008719107 0.020581669 0.027383133
## TrainingTimesLastYear -0.039938249 -0.110032090 0.436974882
## WorkLifeBalance -0.207858737 -0.099787430 -0.472614700
## YearsAtCompany 0.001476838 0.010632110 0.012392629
## YearsInCurrentRole -0.027400923 0.035184241 0.015370440
## YearsSinceLastPromotion -0.008426576 0.045145414 0.021707424
## YearsWithCurrManager -0.017527466 0.013583770 0.047678709
## Median Salary Level 0.080628610 -0.001994417 -0.059250190
## Salary_Diff 0.025540637 -0.025416932 -0.009538634
## CompaRatio 0.019278000 -0.030482156 -0.021706703
## Attrition_M 0.014686316 0.117173771 -0.030636323
## PC13 PC14 PC15
## JobLevel -0.082546691 -0.015627444 -0.07960094
## Age 0.115051367 0.064647463 0.06055567
## DailyRate 0.218617346 -0.622843453 0.23619905
## DistanceFromHome 0.244670379 -0.241011829 -0.37934094
## Education 0.085008028 -0.238681902 0.07451553
## EnvironmentSatisfaction 0.076240897 0.090501935 -0.18154719
## HourlyRate -0.137815230 0.229759973 0.49088163
## JobInvolvement -0.474812321 -0.147562928 -0.29726427
## JobSatisfaction 0.021985559 0.062819931 0.36081857
## MonthlyIncome -0.086163764 -0.008626024 -0.10422406
## MonthlyRate 0.029061541 -0.080056778 0.40420806
## NumCompaniesWorked 0.199939153 0.016259095 0.10969406
## PercentSalaryHike -0.001770767 0.005188512 -0.01170265
## PerformanceRating -0.007342699 0.032319898 0.00279999
## RelationshipSatisfaction 0.412989845 0.249839414 0.06797384
## StockOptionLevel 0.236714527 0.283849192 -0.06883193
## TotalWorkingYears 0.020490186 0.010902452 0.01601755
## TrainingTimesLastYear 0.538960294 -0.116644637 -0.18395192
## WorkLifeBalance -0.134890341 -0.462753012 0.17398483
## YearsAtCompany 0.024983474 0.011635910 0.04533176
## YearsInCurrentRole 0.030610356 -0.056224049 0.07715220
## YearsSinceLastPromotion 0.116202172 0.014244203 0.08778571
## YearsWithCurrManager 0.004338385 -0.002638002 0.03091558
## Median Salary Level -0.091428409 -0.007743205 -0.08468835
## Salary_Diff 0.018098527 -0.003256183 -0.07122437
## CompaRatio -0.003891091 -0.005255410 -0.05511932
## Attrition_M 0.080967949 -0.153642949 0.01316782
## PC16 PC17 PC18
## JobLevel -0.022313981 0.097423827 0.114783137
## Age -0.034801520 -0.122916668 -0.231818727
## DailyRate 0.053156014 0.142959707 0.015685455
## DistanceFromHome -0.231653914 -0.108929900 -0.211538029
## Education -0.238934827 0.466051187 0.225661071
## EnvironmentSatisfaction -0.451863036 -0.081801910 0.266193266
## HourlyRate -0.466025640 -0.080001618 0.084509335
## JobInvolvement 0.150549190 -0.277472920 0.296985110
## JobSatisfaction -0.188113545 -0.342592566 0.222631743
## MonthlyIncome -0.032494396 0.095440020 0.156492428
## MonthlyRate 0.277934590 0.131662985 0.008548726
## NumCompaniesWorked 0.198180083 -0.525105988 -0.175765160
## PercentSalaryHike -0.013733162 0.017749196 0.034412653
## PerformanceRating -0.001106856 0.004011410 0.043368106
## RelationshipSatisfaction 0.101550863 0.200004069 0.153789063
## StockOptionLevel 0.471468358 0.085725394 0.318098734
## TotalWorkingYears -0.009491363 -0.064974085 -0.099162271
## TrainingTimesLastYear -0.176620276 -0.217648284 0.101010424
## WorkLifeBalance 0.094752839 -0.164016516 0.025026910
## YearsAtCompany -0.003547795 -0.025133010 -0.008945467
## YearsInCurrentRole 0.051367158 -0.084338609 -0.067717982
## YearsSinceLastPromotion 0.066913562 -0.157550874 0.175935603
## YearsWithCurrManager 0.029621928 -0.041987725 -0.111175592
## Median Salary Level -0.027190546 0.096920168 0.138947653
## Salary_Diff -0.019377326 -0.004405459 0.064568794
## CompaRatio -0.016864785 -0.014547985 0.015863238
## Attrition_M 0.072582728 -0.214779738 0.587940224
## PC19 PC20 PC21
## JobLevel -0.1649846217 0.015059547 1.987337e-02
## Age 0.7388296245 -0.031849034 4.760898e-02
## DailyRate 0.0358938488 -0.021241687 -3.866189e-02
## DistanceFromHome -0.0491937315 -0.045465668 -7.687354e-03
## Education -0.1423550477 -0.003840078 6.267605e-03
## EnvironmentSatisfaction 0.0348139213 0.069895657 -1.814423e-02
## HourlyRate -0.0510785241 0.003985597 2.829376e-03
## JobInvolvement 0.0464435248 0.005990439 6.733255e-03
## JobSatisfaction -0.0081084849 0.055216774 -2.402262e-02
## MonthlyIncome -0.1343734458 -0.013825961 1.133065e-02
## MonthlyRate -0.0230311696 0.032053459 -2.169203e-02
## NumCompaniesWorked -0.4843465574 0.056866467 -1.960949e-02
## PercentSalaryHike -0.0079070903 0.004516371 -1.246120e-02
## PerformanceRating -0.0146802007 0.001545554 7.755357e-05
## RelationshipSatisfaction -0.0001816872 0.082601160 1.916085e-02
## StockOptionLevel 0.0682543128 0.055229705 -2.639510e-02
## TotalWorkingYears 0.1861212005 0.075484372 -7.240561e-02
## TrainingTimesLastYear -0.0303146185 0.039847586 1.990692e-03
## WorkLifeBalance 0.1033805810 0.005564292 -5.112251e-02
## YearsAtCompany 0.0312561706 0.135669539 -7.648120e-02
## YearsInCurrentRole -0.0877109391 0.230243524 7.685922e-01
## YearsSinceLastPromotion -0.0246438463 -0.838413926 -8.092032e-02
## YearsWithCurrManager -0.0824171026 0.366268000 -6.174012e-01
## Median Salary Level -0.1397007695 -0.011129578 1.370880e-02
## Salary_Diff 0.0178615290 -0.009825259 -8.440076e-03
## CompaRatio 0.0591896367 -0.001477403 -9.113936e-03
## Attrition_M 0.2304828560 0.240506831 8.069989e-03
## PC22 PC23 PC24
## JobLevel 0.0499453318 -0.1693392783 -0.073015731
## Age 0.0767528151 -0.2920167943 -0.228839032
## DailyRate -0.0159752140 -0.0034679573 -0.021707760
## DistanceFromHome -0.0160228099 -0.0007846826 -0.011865390
## Education -0.0188521245 0.0178795652 0.003326954
## EnvironmentSatisfaction 0.0012006978 0.0061111836 0.007395061
## HourlyRate 0.0112857543 -0.0087668862 0.004759065
## JobInvolvement -0.0149416666 0.0287852510 -0.005774135
## JobSatisfaction -0.0112456541 -0.0097525200 0.014339574
## MonthlyIncome 0.0206823654 -0.1074270862 -0.070636721
## MonthlyRate -0.0032350094 0.0062726537 -0.011213104
## NumCompaniesWorked 0.0103044531 0.0146988460 -0.117853364
## PercentSalaryHike 0.6839496083 0.1725600539 0.034011126
## PerformanceRating -0.6839742563 -0.1565538148 -0.050335348
## RelationshipSatisfaction 0.0077969408 0.0010258760 0.025318394
## StockOptionLevel -0.0056362748 0.0201869398 0.010221359
## TotalWorkingYears -0.1744729150 0.5333146781 0.667172191
## TrainingTimesLastYear -0.0089731922 0.0024912062 0.014156121
## WorkLifeBalance 0.0009695364 0.0116966021 -0.001065797
## YearsAtCompany -0.0891948687 0.5629334791 -0.638400353
## YearsInCurrentRole 0.0535687670 -0.1890869692 0.156110400
## YearsSinceLastPromotion 0.0318000797 -0.0782417443 0.087449559
## YearsWithCurrManager 0.0907491329 -0.3951591079 0.165917108
## Median Salary Level 0.0317333606 -0.0909478913 -0.068184674
## Salary_Diff -0.0395266206 -0.0602678472 -0.009491732
## CompaRatio 0.0194536995 0.0776161597 0.008088583
## Attrition_M 0.0219484236 -0.0278122093 0.026364919
## PC25 PC26 PC27
## JobLevel 0.0470991161 -0.8076721831 -7.079678e-16
## Age -0.0071914044 0.0112666005 8.372591e-18
## DailyRate -0.0036996340 0.0002206090 -3.882268e-17
## DistanceFromHome -0.0109313511 0.0076715012 1.429304e-16
## Education 0.0005763606 0.0070069915 -1.002268e-16
## EnvironmentSatisfaction 0.0121247909 -0.0032576188 6.881040e-17
## HourlyRate -0.0039657476 -0.0028056375 -6.785957e-18
## JobInvolvement -0.0060892708 -0.0085831486 1.005618e-16
## JobSatisfaction -0.0033960181 0.0006845210 -7.239160e-17
## MonthlyIncome -0.0838645203 0.3987353254 -6.947690e-01
## MonthlyRate 0.0052436459 0.0086118904 5.309418e-17
## NumCompaniesWorked -0.0012600009 0.0155427148 4.931790e-17
## PercentSalaryHike -0.0332989320 -0.0006100768 -1.554176e-17
## PerformanceRating 0.0588225748 -0.0043141785 1.198809e-16
## RelationshipSatisfaction 0.0096537170 -0.0050389886 2.773348e-17
## StockOptionLevel 0.0032560362 0.0015864883 2.100705e-16
## TotalWorkingYears -0.0674172203 -0.0375832961 -6.895901e-17
## TrainingTimesLastYear 0.0326056211 0.0022958207 3.025400e-17
## WorkLifeBalance 0.0008095454 0.0039883977 1.090267e-16
## YearsAtCompany -0.0536327603 -0.0274173535 1.457420e-16
## YearsInCurrentRole 0.0026368213 0.0299388436 -1.968351e-16
## YearsSinceLastPromotion 0.0094004316 -0.0131897301 1.433796e-17
## YearsWithCurrManager 0.0181084516 0.0382395364 -3.347949e-17
## Median Salary Level 0.1048237519 0.4200057460 6.929205e-01
## Salary_Diff -0.6790809609 -0.0726369979 1.927620e-01
## CompaRatio 0.7102475865 -0.0366333340 -1.313865e-16
## Attrition_M 0.0122826643 -0.0156420523 7.571008e-17
#This returns 27 principal components loadings. Is that correct ? Absolutely. In a data set, the maximum number of principal component loadings is a minimum of (n-1, p). Let's look at first 4 principal components and first 5 rows.
prin_comp$rotation[1:5,1:4]
## PC1 PC2 PC3 PC4
## JobLevel 0.379903632 -0.06844425 0.206924614 -0.07142550
## Age 0.261113430 -0.16895896 0.214185692 -0.11944351
## DailyRate -0.001490123 -0.05905617 0.010763893 -0.03557527
## DistanceFromHome 0.001263467 0.07931514 0.008111305 -0.05658462
## Education 0.068333788 -0.10025064 0.071504839 -0.02375247
##3. In order to compute the principal component score vector, we don't need to multiply the loading with data. Rather, the matrix x has the principal component score vectors in a 1470 × 27 dimension.
dim(prin_comp$x)
## [1] 1470 27
#Let's plot the resultant principal components.
biplot(prin_comp, scale = 0)
##4. The prcomp() function also provides the facility to compute standard deviation of each principal component. sdev refers to the standard deviation of principal components.
#compute standard deviation of each principal component
std_dev <- prin_comp$sdev
#compute variance
pr_var <- std_dev^2
#check variance of first 10 components
pr_var[1:10]
## [1] 5.429495 1.976901 1.893096 1.771847 1.287847 1.211759 1.081930
## [8] 1.063586 1.045010 1.036292
#We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. So, higher is the explained variance, higher will be the information contained in those components. To compute the proportion of variance explained by each component, we simply divide the variance by sum of total variance. This results in:
#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]
## [1] 0.20109240 0.07321855 0.07011465 0.06562396 0.04769803 0.04487996
## [7] 0.04007149 0.03939207 0.03870407 0.03838120 0.03714252 0.03596942
## [13] 0.03570907 0.03408174 0.03376289 0.03280134 0.02928660 0.02623460
## [19] 0.02013707 0.01887842
#This shows that first principal component explains 20.1% variance. Second component explains 7.3% variance. Third component explains 6.5% variance and so on. So, how do we decide how many components should we select for modeling stage ?
#The answer to this question is provided by a scree plot. A scree plot is used to access components or factors which explains the most of variability in the data. It represents values in descending order.
#scree plot
plot(prop_varex, xlab = "Principal Component",ylab = "Proportion of Variance Explained",type = "b")
#The plot above shows that ~ 20 components explains around 95.4% variance in the data set. In order words, using PCA we have reduced 27 predictors to 20 without compromising on explained variance. This is the power of PCA> Let's do a confirmation check, by plotting a cumulative variance plot. This will give us a clear picture of number of components.
#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",ylab = "Cumulative Proportion of Variance Explained",type = "b")
#This plot shows that 20 components results in variance close to ~ 95%. Therefore, in this case, we'll select number of components as 20 [PC1 to PC20] and proceed to the modeling stage. This completes the steps to implement PCA on train data. For modeling, we'll use these 20 components as predictor variables and follow the normal procedures.
I will be compleing this portion after some time, lets run our first basic regression model
dim(HRData)
## [1] 1470 35
Data <- HRData[,c(-36)] # Removing last column
View(Data)
# setting the random seed for replication
set.seed(123)
Data$spl =sample.split(Data,SplitRatio = .7)
# By using the sample.split() you are actually creating a vector with two values TRUE and FALSE. By setting the SplitRatio to 0.7, you are splitting the original HRData of 1450 rows to 70% training and 30% testing data.
Data$spl will create a new column in the Data dataset. Now, if you view the Data dataset again you would notice a new column added at the end, you can use View() or head() for this
View(Data)
dim(Data)
## [1] 1470 36
Now, you can split the dataset to training and testing as given
train= subset(Data,Data$spl==TRUE) # where spl== TRUE means to add only those rows that have value true for spl in the training dataframe
View(train) # you will see that this dataframe has all values where Data$spl==TRUE Similarly, to create the testing dataset,
test=subset(Data,Data$spl==FALSE) # where spl== FALSE means to add only those rows that have value true for spl in the training dataframe
View(test) # you will see that this dataframe has all values where Data$spl==FALSE
str(train)
## 'data.frame': 1008 obs. of 36 variables:
## $ JobLevel : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Age : int 32 37 38 31 41 45 29 32 31 19 ...
## $ Attrition : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
## $ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 2 3 3 3 3 3 3 3 2 2 ...
## $ DailyRate : int 1005 408 362 1082 465 1457 1328 634 444 602 ...
## $ Department : Factor w/ 3 levels "Human Resources",..: 2 2 2 2 2 2 2 2 3 3 ...
## $ DistanceFromHome : int 2 19 1 1 14 7 2 5 5 1 ...
## $ Education : int 2 2 1 4 3 3 3 4 3 1 ...
## $ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 2 4 2 4 2 5 3 6 ...
## $ EnvironmentSatisfaction : int 4 2 3 3 1 1 3 2 4 3 ...
## $ Gender : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 1 1 ...
## $ HourlyRate : int 79 73 43 87 56 83 76 35 84 100 ...
## $ JobInvolvement : int 3 3 3 3 3 3 3 4 3 1 ...
## $ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 3 7 7 7 7 7 7 7 9 9 ...
## $ JobSatisfaction : int 4 2 1 2 3 3 2 4 2 1 ...
## $ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 3 1 2 2 2 1 3 ...
## $ MonthlyIncome : int 3068 3022 2619 2501 2451 4477 2703 3312 2789 2325 ...
## $ MonthlyRate : int 11864 10227 14561 18775 4609 20100 4956 18783 3909 20989 ...
## $ NumCompaniesWorked : int 0 4 3 1 4 4 0 3 1 0 ...
## $ OverTime : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
## $ PercentSalaryHike : int 13 21 17 17 12 19 23 17 11 21 ...
## $ PerformanceRating : int 3 4 3 3 3 3 4 3 3 4 ...
## $ RelationshipSatisfaction: int 3 1 4 2 1 3 4 4 3 1 ...
## $ StockOptionLevel : int 0 0 0 0 1 1 1 2 1 0 ...
## $ TotalWorkingYears : int 8 8 8 1 13 7 6 6 2 1 ...
## $ TrainingTimesLastYear : int 2 1 3 4 2 2 3 3 5 5 ...
## $ WorkLifeBalance : int 2 3 2 3 3 2 3 3 2 4 ...
## $ YearsAtCompany : int 7 1 0 1 9 3 5 3 2 0 ...
## $ YearsInCurrentRole : int 7 0 0 1 8 2 4 2 2 0 ...
## $ YearsSinceLastPromotion : int 3 0 0 1 1 0 0 0 2 0 ...
## $ YearsWithCurrManager : int 6 0 0 0 8 2 4 2 2 0 ...
## $ Median Salary Level : num 2670 2670 2670 2670 2670 2670 2670 2670 2670 2670 ...
## $ Salary_Diff : num 398 352 -51 -169 -219 ...
## $ CompaRatio : num 1.149 1.132 0.981 0.937 0.918 ...
## $ Attrition_M : num 0 0 0 0 0 0 0 0 0 1 ...
## $ spl : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
Checking proportion of data set is same for both sets
mean(train$MonthlyIncome)
## [1] 6471.623
mean(test$MonthlyIncome)
## [1] 6571.24
glm.fit <- glm(Attrition ~., data = train, family = binomial())
## Warning: glm.fit: algorithm did not converge
summary(glm.fit)
##
## Call:
## glm(formula = Attrition ~ ., family = binomial(), data = train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.409e-06 -2.409e-06 -2.409e-06 -2.409e-06 2.409e-06
##
## Coefficients: (2 not defined because of singularities)
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.657e+01 2.715e+05 0.000 1.000
## JobLevel -1.071e-10 9.044e+04 0.000 1.000
## Age -1.658e-12 1.729e+03 0.000 1.000
## BusinessTravelTravel_Frequently 2.879e-11 4.403e+04 0.000 1.000
## BusinessTravelTravel_Rarely 3.057e-11 3.650e+04 0.000 1.000
## DailyRate -1.769e-14 2.891e+01 0.000 1.000
## DepartmentResearch & Development 3.794e-10 1.578e+05 0.000 1.000
## DepartmentSales 2.353e-10 1.618e+05 0.000 1.000
## DistanceFromHome 3.975e-12 1.399e+03 0.000 1.000
## Education 9.045e-13 1.144e+04 0.000 1.000
## EducationFieldLife Sciences -5.298e-10 1.052e+05 0.000 1.000
## EducationFieldMarketing -1.813e-10 1.127e+05 0.000 1.000
## EducationFieldMedical -4.326e-10 1.054e+05 0.000 1.000
## EducationFieldOther -4.331e-10 1.151e+05 0.000 1.000
## EducationFieldTechnical Degree -5.040e-10 1.102e+05 0.000 1.000
## EnvironmentSatisfaction 1.309e-11 1.052e+04 0.000 1.000
## GenderMale -2.425e-11 2.351e+04 0.000 1.000
## HourlyRate -2.527e-12 5.673e+02 0.000 1.000
## JobInvolvement 3.000e-11 1.625e+04 0.000 1.000
## JobRoleHuman Resources 4.028e-10 1.631e+05 0.000 1.000
## JobRoleLaboratory Technician -2.574e-10 5.741e+04 0.000 1.000
## JobRoleManager 2.543e-10 1.032e+05 0.000 1.000
## JobRoleManufacturing Director 3.273e-11 5.225e+04 0.000 1.000
## JobRoleResearch Director 1.204e-10 9.280e+04 0.000 1.000
## JobRoleResearch Scientist 2.849e-11 5.690e+04 0.000 1.000
## JobRoleSales Executive -2.388e-11 1.046e+05 0.000 1.000
## JobRoleSales Representative 8.084e-10 1.163e+05 0.000 1.000
## JobSatisfaction -7.131e-12 1.038e+04 0.000 1.000
## MaritalStatusMarried 1.535e-11 3.055e+04 0.000 1.000
## MaritalStatusSingle 3.110e-12 4.222e+04 0.000 1.000
## MonthlyIncome 4.428e-14 2.513e+01 0.000 1.000
## MonthlyRate -1.685e-15 1.615e+00 0.000 1.000
## NumCompaniesWorked 4.476e-12 5.132e+03 0.000 1.000
## OverTimeYes 1.379e-10 2.610e+04 0.000 1.000
## PercentSalaryHike -6.239e-13 4.939e+03 0.000 1.000
## PerformanceRating 2.694e-11 5.050e+04 0.000 1.000
## RelationshipSatisfaction 2.860e-11 1.059e+04 0.000 1.000
## StockOptionLevel 5.057e-11 1.834e+04 0.000 1.000
## TotalWorkingYears 3.366e-13 3.229e+03 0.000 1.000
## TrainingTimesLastYear 1.535e-11 8.967e+03 0.000 1.000
## WorkLifeBalance 1.994e-10 1.623e+04 0.000 1.000
## YearsAtCompany 1.230e-12 4.093e+03 0.000 1.000
## YearsInCurrentRole -1.454e-12 5.216e+03 0.000 1.000
## YearsSinceLastPromotion -4.554e-12 4.737e+03 0.000 1.000
## YearsWithCurrManager 4.868e-12 5.652e+03 0.000 1.000
## `Median Salary Level` -3.914e-14 2.513e+01 0.000 1.000
## Salary_Diff NA NA NA NA
## CompaRatio -2.187e-10 1.159e+05 0.000 1.000
## Attrition_M 5.313e+01 3.633e+04 0.001 0.999
## splTRUE NA NA NA NA
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 8.7877e+02 on 1007 degrees of freedom
## Residual deviance: 5.8480e-09 on 960 degrees of freedom
## AIC: 96
##
## Number of Fisher Scoring iterations: 25
Interpreting the results of our logistic regression model
In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.
Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable
Now we can run the anova() function on the model to analyze the table of deviance
anova(glm.fit, test="Chisq")
## Warning: glm.fit: algorithm did not converge
## Analysis of Deviance Table
##
## Model: binomial, link: logit
##
## Response: Attrition
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 1007 878.77
## JobLevel 1 35.93 1006 842.85 2.049e-09 ***
## Age 1 7.39 1005 835.46 0.0065574 **
## BusinessTravel 2 27.76 1003 807.70 9.374e-07 ***
## DailyRate 1 0.78 1002 806.92 0.3774528
## Department 2 13.25 1000 793.67 0.0013253 **
## DistanceFromHome 1 7.84 999 785.83 0.0051238 **
## Education 1 0.21 998 785.62 0.6467971
## EducationField 5 5.40 993 780.22 0.3685629
## EnvironmentSatisfaction 1 20.14 992 760.08 7.214e-06 ***
## Gender 1 2.25 991 757.83 0.1335512
## HourlyRate 1 0.53 990 757.30 0.4682762
## JobInvolvement 1 16.90 989 740.40 3.936e-05 ***
## JobRole 8 9.99 981 730.41 0.2658905
## JobSatisfaction 1 8.00 980 722.42 0.0046829 **
## MaritalStatus 2 26.72 978 695.70 1.577e-06 ***
## MonthlyIncome 1 0.22 977 695.47 0.6372306
## MonthlyRate 1 0.44 976 695.04 0.5082545
## NumCompaniesWorked 1 21.13 975 673.91 4.301e-06 ***
## OverTime 1 64.11 974 609.81 1.180e-15 ***
## PercentSalaryHike 1 1.41 973 608.40 0.2352663
## PerformanceRating 1 0.01 972 608.38 0.9082739
## RelationshipSatisfaction 1 2.60 971 605.78 0.1067035
## StockOptionLevel 1 3.87 970 601.91 0.0491960 *
## TotalWorkingYears 1 9.16 969 592.76 0.0024785 **
## TrainingTimesLastYear 1 2.07 968 590.68 0.1498382
## WorkLifeBalance 1 11.08 967 579.61 0.0008747 ***
## YearsAtCompany 1 0.73 966 578.88 0.3940867
## YearsInCurrentRole 1 4.31 965 574.57 0.0379302 *
## YearsSinceLastPromotion 1 8.65 964 565.92 0.0032653 **
## YearsWithCurrManager 1 6.07 963 559.85 0.0137515 *
## `Median Salary Level` 1 11.31 962 548.54 0.0007704 ***
## Salary_Diff 0 0.00 962 548.54
## CompaRatio 1 3.01 961 545.53 0.0826204 .
## Attrition_M 1 545.53 960 0.00 < 2.2e-16 ***
## spl 0 0.00 960 0.00
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding XXX, YYY and ZZZ significantly reduces the residual deviance. The other variables seem to improve the model less even though XXXX has a low p-value. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.
While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.
##library(pscl)
##pR2(model)
With our final model specified, it’s time to actually see the predicted output from the fitted values in the model output. The fitted values provide the predicted probability of leaving for each individual.
This individual-level data is really useful in the real world because we can then calculate probabilities of leaving for any set of arbitrarily defined groups after creating our model. For example, we can group our prediction to predict “Females in Sales”, “Managers in Accounting”, or even “low performing males under 30 in Sales”.
Individual-level predictors allow one to drill down and slice the data anyway you wish while still leveraging all of the data available across the workforce.
Let’s see the distribution.
hist(glm.fit$fitted.values, main = "Distribution of Predicted Probabilities", xlab = "Probability of Leaving", col = "Blue", border = F, breaks = 50)
abline(v = .5, col = "red", lwd = 3)
prop.table(table(glm.fit$fitted.values >= .5))
##
## FALSE TRUE
## 0.8422619 0.1577381
prop.table(table(glm.fit$fitted.values >= .3))
##
## FALSE TRUE
## 0.8422619 0.1577381
Our histogram (and calculations) for the training data show us that we have roughly 25% of our employees with a probability of leaving at 50% or higher and 60% of our employees with a probability of leaving greater than 30%.
Now what? These data give us a feel for the output on the training data (and yes, you should ALWAYS plot them) but of course the real test is seeing how the model did with the novel test data.
Measuring Accuracy To figure out how acccurate our model is at predicting turnover, we need to first take our model and use it to predict the outcomes for our test dataset. We’ll use the “predict” function and feed it our final glm.fit model that we built above. We will also set the newdate to our “test” dataset from our split, and set type = “response” to get the predicted probabilities for each person.
glm.fit_test<- predict(glm.fit, newdata = test, type = "response")
## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading
hist(glm.fit_test, main = "Distribution of Test Set \nPredicted Probabilities", xlab = "Probability of Leaving", col = "Blue", border = F, breaks = 50)
abline(v = .5, col = "red", lwd = 3)
prop.table(table(glm.fit_test >= .5))
##
## FALSE TRUE
## 0.8311688 0.1688312
prop.table(table(glm.fit_test>= .3))
##
## FALSE TRUE
## 0.8311688 0.1688312
Similar to our training set predictions, we see that 8.xx% have a predicted probability greater than .5 and 19.xx% with a predicted probability of greater than .3. Note again though that the model has provided the individual probabilities, not a 0/1 response
To get the 0/1 value we need to measure model accuracy, we’ll need to first select some cutoff value for predicting whether someone will leave or not. A good starting point is .5. After all, those with a probability value greater than .5 are more likely to leave than those with less than a .5 predicted probability.
Our reference point is the base rate which for the current data is 83.8% staying and 16.12% leaving. Thus, if we had no information available, we could simply predict that everyone will stay and be correct 83.8% of the time. You could think about this as a “majority class model” in which our best guess is picking the largest class. Let’s hope our model does better.
To check accuracy, we’ll build a “confusion matrix”. This allows is to compare the observed results from our test set with our predicted results. We will set the cutoff as .5 so that anyone with a value over .5 will categorized as leaving.
prop.table(table(test$Attrition))
##
## No Yes
## 0.8311688 0.1688312
accuracy <- table(glm.fit_test > .5, test$Attrition) # confusion matrix
addmargins(table(glm.fit_test> .5, test$Attrition))
##
## No Yes Sum
## FALSE 384 0 384
## TRUE 0 78 78
## Sum 384 78 462
The values on the right represent the predicted outcome (where False = 0 and True = 1). The values for the columns represent the real, observed outcomes.
To get the model accuracy, we will sum the correctly classified observations (values on the diagonal) and divide by the total number of observations (1472). This will give us the proportion correct.
sum(diag(accuracy))/sum(accuracy)
## [1] 1
The result is 87.66% correct v. the “no model” case of 83.11%. It’s not spectacular mind you, but definitely better than nothing.
Remember too that we are dealing with far more variables than you would normally have in your HR data. , the model is actually doing quite well.
Let’s take a minute to think a bit more about that .5 cutoff.
addmargins(accuracy)
##
## No Yes Sum
## FALSE 384 0 384
## TRUE 0 78 78
## Sum 384 78 462
When we choose this cutoff, our model catches 31 of the 78 turnover events. This gives us True Positive Rate of 39% . That is, we are correctly labeling 31 of the 78 “leave” cases. Correspondingly, that same .5 cutoff catches 374 of the 384 stay events, yielding a True Negative Rate of 97.39% percent.
What happens if we lower our threshold to .3 instead?
prop.table(table(test$Attrition))
##
## No Yes
## 0.8311688 0.1688312
accuracy2 <- table(glm.fit_test > .3, test$Attrition) # confusion matrix
addmargins(accuracy2)
##
## No Yes Sum
## FALSE 384 0 384
## TRUE 0 78 78
## Sum 384 78 462
sum(diag(accuracy2))/sum(accuracy)
## [1] 1
Here, we end up catching many more of the leave events (49), giving us a much higher True Positive Rate of 62% (49/78). That’s a lot better, but loosening the cutoff criteria also dramtically lowers the True Negative Rate to 89.84% (345/384).
By lowering the criterion, we ended up labeling too many of the stayers as leavers. We also end up reducing our overall accuracy to 85.28% (345+49)/(39+29). The world is full of tradeoffs.
To make this more concrete, think instead of a test for a disease. In that case, we might be willing to lower the cutoff to make sure we are catching those with the disease, even at the expense of more False Positive results.
We will deal with this more in the ROC section below. The take home message for now is that changing that cutoff can have a big impact on your interpretation of the model; the actions you wish to take given your results should guide your decision.
We’ve already discussed the impact of changing cutoffs on accuracy, but let’s dig into an alternative way of assessing our logistic regression and decision tree models: the ROC curve.
ROC stands for “Receiver Operating Characteristic” and was first used in WWII to aid radar detection processes. It is still used today in human cognition research, radiology, epidemiology, and predictive models of all sorts.
ROC curves help us evaluate binary classifiers by showing us how different cutoff (threshold) values impact the quality of our model. Said differently, it helps us figure out the best tradeoff between True Positive and False Positive rates.
To develop our intuitions, let’s go back to our logistic model and select five possible cutoffs: 0, .3, .5, . 7, and 1. These cutoffs are just illustrative.
Let’s plot our False Positive and True Positive values at each of these sample cutoffs.
# collector for the data that we will plot
rates <- data.frame(fp = numeric(3), tp = numeric(3), cutoff = as.numeric(3))
for (i in 1:3){ # cycling through the middle cutoff values to test
temp_cutoffs <- c(.3, .5, .7)
# accuracy <- table(m2_test > i, test$vol_leave) # confusion matrix
temp_res <- addmargins(table(glm.fit_test > temp_cutoffs[i], test$Attrition))
rates$fp[i] <- temp_res[2,1]/temp_res[3,1]
rates$tp[i] <- temp_res[2,2]/temp_res[3,2]
rates$cutoff[i] <- temp_cutoffs[i]
}
# plotting the calculated rates
plot(rates$fp[1:3], rates$tp[1:3], pch = 19, col = "red", cex = 1.5,
xlim = c(0,1), ylim = c(0,1), xlab = "False Positive Rate",
ylab = "True Positive Rate",
main = "False Positive and True Positive Rates \nFor Different Cutoffs")
# Lines connecting the dots
lines(c(0, rates$fp[3:1], 1), c(0,rates$tp[3:1],1),
xlim = c(0,1), ylim = c(0,1))
# Additional notes for the figure
abline(coef = c(0,1), col = "black", lwd = 2)
points(x = c(0,1), y = c(0,1),pch = 19, col = "red", cex = 1.5)
points(x = 0, y = 1, pch = 19, col = "dark green", cex = 1.5)
text(x = rates$fp, y = (rates$tp + .05), labels = rates$cutoff)
text(x = .07, y = .97, labels = "Ideal Model")
text(x = 0, y = .05, labels = 1)
text(x = 1, y = .95, labels = 0)
abline(v = 0, h = 1, lty = 2, lwd = 2)
text(.5, .46, srt = 30, labels = "No Predictive Power")
At one extreme, a cutoff value of 1 (bottom left corner) means we treat everyone as a stayer. This would give us 0 False Positives (because no one is predicted to leave) but also 0 True Positives.
At the other extreme, a cutoff of 0 (upper right corner) means we call everyone a leaver. By categorizing everyone as a leaver, we would have a 100% True Positive rate but also a 100% False Positive rate (because every stayer is predicted to leave regardless of the model output).
The the middle values will give us a feel for the tradeoffs with changing cutoffs.
As we progress through the different cutoff values, we see that we move along a curve. Note that at each point of the curve we have a different False Positive and True Positive rate.
For example, with a cutoff of .5, we would have a False Positive Rate of roughly 18% (x axis) and a True Positive Rate of 42% (y axis). By contrast, a cutoff of .3 gives us a roughly 50% False Positive rate and an 80% True Positive rate.
Ideally, we would like to see our curve shoot up sharply with small changes in the cutoff value. That is, we would like to see a huge increase in catching the True Positives while limiting the number False Positives.
The absolute ideal (if unacheivable) reference point is the green dot which represents perfect prediction: all True Positives and no False Positives.
We want to get our ROC curves close to that dotted line shape when evaluating and developing our models. We also want our chosen cutoff value along that ROC curve to be as close to that green dot as possible.
The diagonal line stretching across the figure represents a model with no predictive power.
The better your model, the closer it gets to that ideal step-like dotted line. The worse your model, the closer it gets to the diagonal.
Now that you understand the principles of thresholds and their impact on False Positive and True Positive rate, let’s build complete ROC curves using our models.
The only difference between these complete curves and the simplified example above is that we will test every possible threshold we can using the probability values produced by our model. Those probabilities produced by our models represent the possible cutoff values.
Fortunately, some smart people came up with some functions to make the calculations and plots easy.
I have created the function that uses the basic ROCR set up but added a few other details for this primer.
We’ll start with our glm.fit model from the logistic regression along with our test dataset and the specific outcomes.