1 Table of Content

2 Introduction

Among all the multiple aspects of Human Resource functions, Attrition is painful& dreadful thing which an organization has to face inevitably. Along with employee, valuable knowledge built over the period also walks out of the door. Also Whenever does an employee decide to put down his/her resignation business takes a hit either in terms of productivity of that employee till he finally walks out of the system or Morale of fellow colleagues i.e. constant murmuring of what is happening with the company because of which employees are leaving ? Am I in the right job or role ? My fellow colleagues who once vouched for this company are no longer working here.

As a HR Business partner how can i work on keeping the attrition under the limit ? Is it possible that i can predict that which employee is more likely to resign ?

3 Objective and the Goal of the employee Churn analysis

Objective & Goal of this study is to predict whether an employee is going to stay or leave. We will calculate the probability of an employee leaving/resigning.

4 Model Development

Here we will use three different methods to determine the probability of exit-> we will determine the best possible model basis ROC(you will learn that later)

Logistic Regression
Decision Tree
Random forest

4.1 Importing Data and R Libraries

I am using publicly available data shared by IBM-Watson.

link.

Below listed packages will be used in our analysis

library(ggplot2) #For visualization of data 
library(plotly) # for visualization of data
library(RColorBrewer) #FOr visualization of data
library(e1071)# For SVM(Support vector Modeling)
library(MASS)
library(ISLR)
library(corrgram) # For plotting Correaltion charts 
library(corrplot) # Fpr plotting Correlation charts
library(car) #
library(plyr)
library(dplyr)
library(caTools) # for sample splitting i.e train and test data 
library(ROCR) # for the ROC/ AUC measure
library(pROC) # for the ROC/ AUC measure
library(rattle) # For Visualization of Decision Trees
library(rpart.plot)
library(psych) # For point biserial correlation
library(RColorBrewer)
library(mlr)
library(moments)# To calculate Skewness and Kurtosis

4.2 Data Exploration and Visualization

#Reading HR Employee Attrition Data
HRData <- read.csv("C:/Users/guptaanu/Desktop/Analytics/IBM Watson/WA_Fn-UseC_-HR-Employee-Attrition.csv")  
 
#To view only fist 6 rows of HRData
head(HRData)

##   Age Attrition    BusinessTravel DailyRate             Department
## 1  41       Yes     Travel_Rarely      1102                  Sales
## 2  49        No Travel_Frequently       279 Research & Development
## 3  37       Yes     Travel_Rarely      1373 Research & Development
## 4  33        No Travel_Frequently      1392 Research & Development
## 5  27        No     Travel_Rarely       591 Research & Development
## 6  32        No Travel_Frequently      1005 Research & Development
##   DistanceFromHome Education EducationField EmployeeCount EmployeeNumber
## 1                1         2  Life Sciences             1              1
## 2                8         1  Life Sciences             1              2
## 3                2         2          Other             1              4
## 4                3         4  Life Sciences             1              5
## 5                2         1        Medical             1              7
## 6                2         2  Life Sciences             1              8
##   EnvironmentSatisfaction Gender HourlyRate JobInvolvement JobLevel
## 1                       2 Female         94              3        2
## 2                       3   Male         61              2        2
## 3                       4   Male         92              2        1
## 4                       4 Female         56              3        1
## 5                       1   Male         40              3        1
## 6                       4   Male         79              3        1
##                 JobRole JobSatisfaction MaritalStatus MonthlyIncome
## 1       Sales Executive               4        Single          5993
## 2    Research Scientist               2       Married          5130
## 3 Laboratory Technician               3        Single          2090
## 4    Research Scientist               3       Married          2909
## 5 Laboratory Technician               2       Married          3468
## 6 Laboratory Technician               4        Single          3068
##   MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike
## 1       19479                  8      Y      Yes                11
## 2       24907                  1      Y       No                23
## 3        2396                  6      Y      Yes                15
## 4       23159                  1      Y      Yes                11
## 5       16632                  9      Y       No                12
## 6       11864                  0      Y       No                13
##   PerformanceRating RelationshipSatisfaction StandardHours
## 1                 3                        1            80
## 2                 4                        4            80
## 3                 3                        2            80
## 4                 3                        3            80
## 5                 3                        4            80
## 6                 3                        3            80
##   StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
## 1                0                 8                     0               1
## 2                1                10                     3               3
## 3                0                 7                     3               3
## 4                0                 8                     3               3
## 5                1                 6                     3               3
## 6                0                 8                     2               2
##   YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
## 1              6                  4                       0
## 2             10                  7                       1
## 3              0                  0                       0
## 4              8                  7                       3
## 5              2                  2                       2
## 6              7                  7                       3
##   YearsWithCurrManager
## 1                    5
## 2                    7
## 3                    0
## 4                    0
## 5                    2
## 6                    6

# Storing HRData in memory of R; Easier to call all the variable of data by column names
attach(HRData)

# For understanding the overall structure of HR Data(no. and type of variables)
str(HRData)

## 'data.frame':    1470 obs. of  35 variables:
##  $ Age                     : int  41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : int  1102 279 1373 1392 591 1005 1324 1358 216 1299 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : int  2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ EmployeeNumber          : int  1 2 4 5 7 8 10 11 12 13 ...
##  $ EnvironmentSatisfaction : int  2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : int  94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : int  3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : int  2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : int  4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : int  5993 5130 2090 2909 3468 3068 2670 2693 9526 5237 ...
##  $ MonthlyRate             : int  19479 24907 2396 23159 16632 11864 9964 13335 8787 16577 ...
##  $ NumCompaniesWorked      : int  8 1 6 1 9 0 4 1 0 6 ...
##  $ Over18                  : Factor w/ 1 level "Y": 1 1 1 1 1 1 1 1 1 1 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : int  11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: int  1 4 2 3 4 3 1 2 2 2 ...
##  $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
##  $ StockOptionLevel        : int  0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : int  8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : int  0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : int  1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : int  6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : int  4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : int  0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : int  5 7 0 0 2 6 0 0 8 7 ...

#Changing the class of some variables 
unique(WorkLifeBalance)

## [1] 1 3 2 4

unique(Education)

## [1] 2 1 4 3 5

unique(EducationField)

## [1] Life Sciences    Other            Medical          Marketing       
## [5] Technical Degree Human Resources 
## 6 Levels: Human Resources Life Sciences Marketing Medical ... Technical Degree

Education<-as.factor(Education)
WorkLifeBalance <-as.factor(HRData$WorkLifeBalance)
PerformanceRating <-as.factor(HRData$PerformanceRating)
RelationshipSatisfaction <-as.factor(HRData$RelationshipSatisfaction)
JobSatisfaction <-as.factor(HRData$JobSatisfaction)

class(PerformanceRating)

## [1] "factor"

class(Education)

## [1] "factor"

class(RelationshipSatisfaction)

## [1] "factor"

class(JobSatisfaction)

## [1] "factor"

class(WorkLifeBalance)

## [1] "factor"

class(EducationField)

## [1] "factor"

# To identify if their are any missing values across any variables we can use below command ; however we don't require this command as data doesn't have any missing values
summarizeColumns(HRData)

##                        name    type na         mean         disp  median
## 1                       Age integer  0 3.692381e+01    9.1353735    36.0
## 2                 Attrition  factor  0           NA    0.1612245      NA
## 3            BusinessTravel  factor  0           NA    0.2904762      NA
## 4                 DailyRate integer  0 8.024857e+02  403.5090999   802.0
## 5                Department  factor  0           NA    0.3462585      NA
## 6          DistanceFromHome integer  0 9.192517e+00    8.1068644     7.0
## 7                 Education integer  0 2.912925e+00    1.0241649     3.0
## 8            EducationField  factor  0           NA    0.5877551      NA
## 9             EmployeeCount integer  0 1.000000e+00    0.0000000     1.0
## 10           EmployeeNumber integer  0 1.024865e+03  602.0243348  1020.5
## 11  EnvironmentSatisfaction integer  0 2.721769e+00    1.0930822     3.0
## 12                   Gender  factor  0           NA    0.4000000      NA
## 13               HourlyRate integer  0 6.589116e+01   20.3294276    66.0
## 14           JobInvolvement integer  0 2.729932e+00    0.7115611     3.0
## 15                 JobLevel integer  0 2.063946e+00    1.1069399     2.0
## 16                  JobRole  factor  0           NA    0.7782313      NA
## 17          JobSatisfaction integer  0 2.728571e+00    1.1028461     3.0
## 18            MaritalStatus  factor  0           NA    0.5421769      NA
## 19            MonthlyIncome integer  0 6.502931e+03 4707.9567831  4919.0
## 20              MonthlyRate integer  0 1.431310e+04 7117.7860441 14235.5
## 21       NumCompaniesWorked integer  0 2.693197e+00    2.4980090     2.0
## 22                   Over18  factor  0           NA    0.0000000      NA
## 23                 OverTime  factor  0           NA    0.2829932      NA
## 24        PercentSalaryHike integer  0 1.520952e+01    3.6599377    14.0
## 25        PerformanceRating integer  0 3.153741e+00    0.3608235     3.0
## 26 RelationshipSatisfaction integer  0 2.712245e+00    1.0812089     3.0
## 27            StandardHours integer  0 8.000000e+01    0.0000000    80.0
## 28         StockOptionLevel integer  0 7.938776e-01    0.8520767     1.0
## 29        TotalWorkingYears integer  0 1.127959e+01    7.7807817    10.0
## 30    TrainingTimesLastYear integer  0 2.799320e+00    1.2892706     3.0
## 31          WorkLifeBalance integer  0 2.761224e+00    0.7064758     3.0
## 32           YearsAtCompany integer  0 7.008163e+00    6.1265252     5.0
## 33       YearsInCurrentRole integer  0 4.229252e+00    3.6231370     3.0
## 34  YearsSinceLastPromotion integer  0 2.187755e+00    3.2224303     1.0
## 35     YearsWithCurrManager integer  0 4.123129e+00    3.5681361     3.0
##          mad  min   max nlevs
## 1     8.8956   18    60     0
## 2         NA  237  1233     2
## 3         NA  150  1043     3
## 4   510.0144  102  1499     0
## 5         NA   63   961     3
## 6     7.4130    1    29     0
## 7     1.4826    1     5     0
## 8         NA   27   606     6
## 9     0.0000    1     1     0
## 10  790.9671    1  2068     0
## 11    1.4826    1     4     0
## 12        NA  588   882     2
## 13   26.6868   30   100     0
## 14    0.0000    1     4     0
## 15    1.4826    1     5     0
## 16        NA   52   326     9
## 17    1.4826    1     4     0
## 18        NA  327   673     3
## 19 3260.2374 1009 19999     0
## 20 9201.7569 2094 26999     0
## 21    1.4826    0     9     0
## 22        NA 1470  1470     1
## 23        NA  416  1054     2
## 24    2.9652   11    25     0
## 25    0.0000    3     4     0
## 26    1.4826    1     4     0
## 27    0.0000   80    80     0
## 28    1.4826    0     3     0
## 29    5.9304    0    40     0
## 30    1.4826    0     6     0
## 31    0.0000    1     4     0
## 32    4.4478    0    40     0
## 33    4.4478    0    18     0
## 34    1.4826    0    15     0
## 35    4.4478    0    17     0

dim(HRData)

## [1] 1470   35

HRData has 1470 rows and 35 columns

As our task is to determine employee churn prediction thus Attrition is a dependent variable(DV) and rest 34 variables will be labeled as independent variable(IDV) going forward for our analysis

Our assumption as of now is that all 34 variables are contributing or influencing the attrition/retention of any employees

4.2.1 Univariate Analysis

Before we delve into the analysis and understanding of variables, it is recommended to identify and remove those variables which do not add any value to the model

thus if we evaluate the data more closely, we will observe that column over 18 has only one type of data i.e. ‘Factor w/ 1 level’. Since everyone is above the age of 18 this column will not bring any additional information to the model so we can drop this variable from overall analysis, similarly we have other variables such as employee count & employee number and standard hours

# Identifying the number of factor/level in Over18, employee count & employee number variable has 
unique(Over18)

## [1] Y
## Levels: Y

unique(EmployeeCount)

## [1] 1

grep("Over18", colnames(HRData)) # grep function is used to identify the column no of the variable

## [1] 22

grep("EmployeeCount", colnames(HRData))# doesn't add any value to the analysis

## [1] 9

grep("EmployeeNumber", colnames(HRData)) # doesn't add any value to the analysis

## [1] 10

grep("StandardHours", colnames(HRData))# doesn't add any value to the analysis

## [1] 27

# Removing column Over18,employee count & employee number from the dataset as it doesn't carry any value to the model 
HRData = HRData[,c(-22,-9,-10,-27)]
dim(HRData)

## [1] 1470   31

We understand that many factor influence the attrition, here we will try to look at the few variable individually and identify any specific observation of these variables on attrition.

4.2.1.0.1 Analyzing Age of the employees

#Plotting a histogram with Age of Employees and storing it in hist
hist<-plot_ly(x=HRData$Age,type='histogram')

#Defining labels and title using layout()
layout(hist,title = "HR Data - Employee Age",xaxis = list(title = "Age"),yaxis = list(title = "Count"))

Here we observe that age is normal distributed so we don’t need to do any data modulation/transformation (logistic model works better for normaly distributed data)

4.2.1.0.2 Analyzing Monthly Income of employees

#Plotting a histogram with Monthly Income and storing it in hist
hist<-plot_ly(x=HRData$MonthlyIncome,type='histogram')

#Defining labels and title using layout()
layout(hist,title = "HR Data - Monthly Income",xaxis = list(title = " Monthly Income "),yaxis = list(title = "Count"))

Monthly Income is highly skewed towards right and this will affect the model outcome; so we will transform monthly income variable

We can also check the skewness and normality of the data as detailed below

skewness(HRData$MonthlyIncome)

## [1] 1.368419

# Performing Shapiro-Wilk normality test
shapiro.test(HRData$MonthlyIncome)

## 
##  Shapiro-Wilk normality test
## 
## data:  HRData$MonthlyIncome
## W = 0.82791, p-value < 2.2e-16

Null Hypothesis : Samples came from a Normal distribution or Data is normaly distributed

Alternate Hypothesis : Samples doesn’t come from Normal distribution or Data is not normaly distributed

If P Value is LOW –> Null Hypothesis is Rejected

Trick to remember : If P LOW –> Null Go

So here we can confidently say that data is not normaly distributed. We may have to transform the variable

4.2.1.0.3 Analyzing MonthlyIncome of employees w.r.t Job Level

box<- plot_ly(x = ~HRData$JobLevel, y = ~HRData$MonthlyIncome, type = "box")

layout(box,title = "HR Data - Monthly Income",xaxis = list(title = " Monthly Income "),yaxis = list(title = "Count"),boxmode = "group")

Compensation Analysis:

Salary of an employee is normaly proportional to the job level, Experience, Education etc.
At every level their is a salary band i.e. the range of salary offered specific to that level
In General Salary band at each level shouldn’t overlap much with the nextlevel salary band

Now since salary is determined by the job level of the person we have to find a way to use monthly data in more useful way i.e. salary has to be observed relative to level in which that employee is

Compensation expert will suggest that we should use Compa Ratio to find the relative indicator of salary at each level

Compa Ratio(As sourced from Google):- A Compa-Ratio of 1.00 or 100% means that the employee is paid exactly what the industry average pays and is at the midpoint for the salary range, A ratio of 0.75 means that the employee is paid 25% below the industry average and is at the risk of seeking employment with competitors at a higher pay that is perceived .

An employee doing administrative/back office job would be happy to get INR 30,000 take home considering the median for that group while a manager or business unit head, on the other hand, might not be given the median manager salary at that level of roughly INR 1.5 Lakh.So employee satisfaction also depends on the salary of that employee at respective level

This suggests that looking at salary relative to the median of an employee’s role/level (using compa ratio) makes more sense and will add value to the overall model

To do this, we’ll first calculate the median salary for each group and then create a new salary measure that measures a salary relative to that employee’s role median.

In this case, we’ll simply get the difference between the employee’s salary and that of the median. By this new metric, salaries lower than the median of the relevant role will be negative, those higher than the median will be positive.

This will put the salaries on a relative footing.

Observe that in data science parlance, this step of creating a new predictor variable from the data available is part of feature engineering. This is a huge topic by itself but the lesson for you is this: You DO NOT need to just stick with the original raw variables; those are just the starting point.

Calculate the median salary at each job level

Here i will be using aggregate function. How to use aggregate function is explained in the below mentioned link

link.

# Method 1
Median_Salary_Level <- aggregate(HRData$MonthlyIncome,list(JobLevel=HRData$JobLevel),median)
names(Median_Salary_Level)[2]<-("Median Salary Level") # renaiming the column header

print(Median_Salary_Level)

##   JobLevel Median Salary Level
## 1        1                2670
## 2        2                5340
## 3        3                9980
## 4        4               16154
## 5        5               19232

# Method 2( Only for reference; not using in model; have been commented out from execution)
#Median_Salary_Level <- aggregate(HRData$MonthlyIncome~HRData$JobLevel,data = HRData,median)

#names(Median_Salary_Level)[2]<-("Median Salary Level")# renaiming the column header
#print(Median_Salary_Level)

Adding Median Salary at each level for every employee in the HR data frame

HRData <- merge(HRData,Median_Salary_Level,by ="JobLevel")

Calculating Compa Ratio and difference of monthly salary and median salary at each level for every employee

# Salary Difference
HRData$Salary_Diff <- HRData$MonthlyIncome- HRData$`Median Salary Level`

# Compa Ratio 
HRData$CompaRatio <- HRData$MonthlyIncome/HRData$`Median Salary Level`

Exploration of Salary Difference

#Plotting a histogram with Median Salary Level and storing it in hist
hist<-plot_ly(x=HRData$Salary_Diff,type='histogram')

#Defining labels and title using layout()
layout(hist,title = " Median Salary Difference",xaxis = list(title = " Monthly Income - Median Salary specific to that level "),yaxis = list(title = "Count"))

Looking at the histogram we can observe that it is now following normal distribution.

We can again check the normality condition by using Shapiro-Wilk normality test

Performing Shapiro-Wilk normality test

shapiro.test(HRData$Salary_Diff)

## 
##  Shapiro-Wilk normality test
## 
## data:  HRData$Salary_Diff
## W = 0.96357, p-value < 2.2e-16

If you observe that p value < .05 so our null hypothesis is still rejected i.e. even after completing the data modulation!!

The histogram generated for Salary Difference looks perfectly normal however the test doesn’t say so.

Their are certain limitation of Shapiro-will normality test. Please follow the below link to read more about shapiro-will test link

As of now i am considering that the Median Salary Difference follows normal distribution

Hypothesis Testing

Null Hypothesis : Employees with higher median salary difference are likely to be active

aggregate(HRData$Salary_Diff~HRData$Attrition, data = HRData, mean)

##   HRData$Attrition HRData$Salary_Diff
## 1               No           52.07299
## 2              Yes          -89.61603

Average Median Salary difference of employees who have attrited/exit is negative while for employees who are active Average Median Salary difference is positive

Employees who are at the lower range of salary band at each level are more likely to leave

Correlation b/w median salary difference and Attrition

X : Median Salary Difference (HRData$Salary_Diff)–> continuous variable
Y : Attrition (HRData$Attrition)–> categorical / dichotomus variable

Here we have to use Point-biserial correlation coefficient because one of the variable is categorical. Please follow below link to read more about biserial correlation

link

biserial(HRData$Salary_Diff, HRData$Attrition)

##             [,1]
## [1,] -0.05998931

The point biserial suggests a negative relationship, indicating that as salary increases relative to the level median, the likelihood of leaving decreases. This makes sense since those earning more relative to their true peers in the same role are less likely to be looking elseswhere for a better deal.

We will use HRData$Salary_Diff in our model.(New Variable is justified and adds value to the over all model) ##### Analyzing Monthly Rate of employees

#Plotting a histogram with Monthly Income and storing it in hist
hist<-plot_ly(x=HRData$MonthlyRate,type='histogram')

#Defining labels and title using layout()
layout(hist,title = "HR Data - Monthly Rate",xaxis = list(title = " Monthly Rate "),yaxis = list(title = "Count"))

# Analyzing Monthly Rate w.r.t level 

box<- plot_ly(x = ~HRData$JobLevel, y = ~HRData$MonthlyRate, type = "box")

layout(box,title = "HR Data - Monthly Rate",xaxis = list(title = " Monthly Rate "),yaxis = list(title = "Count"),boxmode = "group")

Monthly rate variable is not giving any specific insights, I am assuming this variable will not make much of a difference.

4.2.1.0.4 Analyzing Hourly Rate and Daily Rate

p <- plot_ly(data = HRData, x = ~HRData$HourlyRate, y = ~HRData$DailyRate)

layout(p,title = "HR Data - Daily rate vs Hourly Rate ",xaxis = list(title = " Hourly Rate "),yaxis = list(title = "Daily Rate"))

## No trace type specified:
##   Based on info supplied, a 'scatter' trace seems appropriate.
##   Read more about this trace type -> https://plot.ly/r/reference/#scatter

## No scatter mode specifed:
##   Setting the mode to markers
##   Read more about this attribute -> https://plot.ly/r/reference/#scatter-mode

Again their is no specific insights coming from these two variables so i am assuming these are just redundant variables.

4.2.1.0.5 Analyzing Attrition Data

table(Attrition) # count of employees who have attrited

## Attrition
##   No  Yes 
## 1233  237

prop.table(table(Attrition)) # Proportion of employees who are active/exit

## Attrition
##        No       Yes 
## 0.8387755 0.1612245

4.2.2 Bi-variate and Multivariate analysis

4.2.2.1 Distance from home vs Attrition

Employee who travel from long distance are genrally observed to be more dissatisfied and more likely to leave the organization as they want to minimize the commute and search for organization which are near to their home

ggplot(data = HRData) + geom_bar(mapping = aes( x= DistanceFromHome, fill= Attrition),position = "fill")

4.2.2.2 Years in Current Role /Company/Time since last promotion/ With current Manager

ggplot(data = HRData)+geom_point(mapping = aes(x=YearsAtCompany,y=YearsSinceLastPromotion))

cor(YearsAtCompany,YearsSinceLastPromotion)

## [1] 0.6184089

ggplot(data = HRData)+geom_point(mapping = aes(x=YearsAtCompany,y=YearsWithCurrManager))

cor(YearsAtCompany,YearsWithCurrManager)

## [1] 0.7692124

If you observe the above plot –> all of the variable are highly correlated

We have to work on factor analysis /Principle component analysis to identif

4.2.2.2.1 Analyzing Performance Rating

Observing the interaction of performance rating with Attrition & etc.

unique(HRData$PerformanceRating)

## [1] 3 4

mytable <-table(HRData$Attrition,HRData$PerformanceRating)
mytable

##      
##          3    4
##   No  1044  189
##   Yes  200   37

prop.table(mytable) # cell percentages

##      
##                3          4
##   No  0.71020408 0.12857143
##   Yes 0.13605442 0.02517007

prop.table(mytable, 1) # row percentages

##      
##               3         4
##   No  0.8467153 0.1532847
##   Yes 0.8438819 0.1561181

prop.table(mytable, 2) # column percentage

##      
##               3         4
##   No  0.8392283 0.8362832
##   Yes 0.1607717 0.1637168

# Using GGplot 
PerformanceRating <-as.factor(PerformanceRating)
class(PerformanceRating)

## [1] "factor"

ggplot(data = HRData) + geom_bar(mapping = aes(x = PerformanceRating,fill = Attrition),position = "fill")

ggplot(data = HRData) + geom_bar(mapping = aes(x = PerformanceRating,fill = Attrition),position = "dodge")

Attrition is factor variable with two level (Yes/No). Here i am going to create a new variable Attrition_M where M stand for modified.

Employee Exit –> 1 Employee Active –>0

HRData$Attrition_M<- HRData$Attrition

HRData$Attrition_M <-as.character(HRData$Attrition_M)

HRData$Attrition_M[HRData$Attrition_M=="Yes"] <-"1"
HRData$Attrition_M[HRData$Attrition_M=="No"] <-"0"
HRData$Attrition_M <-as.numeric(HRData$Attrition_M)
mean(HRData$Attrition_M)

## [1] 0.1612245

unique(HRData$Attrition_M)

## [1] 0 1

str(HRData)

## 'data.frame':    1470 obs. of  35 variables:
##  $ JobLevel                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age                     : int  32 37 38 27 28 31 41 27 45 29 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 2 2 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 2 3 3 3 3 3 3 3 3 3 ...
##  $ DailyRate               : int  1005 408 362 1420 1434 1082 465 591 1457 1328 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 2 2 2 3 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : int  2 19 1 2 5 1 14 2 7 2 ...
##  $ Education               : int  2 2 1 1 4 4 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 2 3 6 4 2 4 4 2 ...
##  $ EnvironmentSatisfaction : int  4 2 3 3 3 3 1 1 1 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 2 2 2 1 2 ...
##  $ HourlyRate              : int  79 73 43 85 50 87 56 40 83 76 ...
##  $ JobInvolvement          : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 3 7 7 9 3 7 7 3 7 7 ...
##  $ JobSatisfaction         : int  4 2 1 1 3 2 3 2 3 2 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 1 3 3 1 2 2 2 ...
##  $ MonthlyIncome           : int  3068 3022 2619 3041 3441 2501 2451 3468 4477 2703 ...
##  $ MonthlyRate             : int  11864 10227 14561 16346 11179 18775 4609 16632 20100 4956 ...
##  $ NumCompaniesWorked      : int  0 4 3 0 1 1 4 9 4 0 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 2 1 1 1 2 1 ...
##  $ PercentSalaryHike       : int  13 21 17 11 13 17 12 12 19 23 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 3 3 3 4 ...
##  $ RelationshipSatisfaction: int  3 1 4 2 3 2 1 4 3 4 ...
##  $ StockOptionLevel        : int  0 0 0 1 0 0 1 1 1 1 ...
##  $ TotalWorkingYears       : int  8 8 8 5 2 1 13 6 7 6 ...
##  $ TrainingTimesLastYear   : int  2 1 3 3 3 4 2 3 2 3 ...
##  $ WorkLifeBalance         : int  2 3 2 3 2 3 3 3 2 3 ...
##  $ YearsAtCompany          : int  7 1 0 4 2 1 9 2 3 5 ...
##  $ YearsInCurrentRole      : int  7 0 0 3 2 1 8 2 2 4 ...
##  $ YearsSinceLastPromotion : int  3 0 0 0 2 1 1 2 0 0 ...
##  $ YearsWithCurrManager    : int  6 0 0 2 2 0 8 2 2 4 ...
##  $ Median Salary Level     : num  2670 2670 2670 2670 2670 2670 2670 2670 2670 2670 ...
##  $ Salary_Diff             : num  398 352 -51 371 771 ...
##  $ CompaRatio              : num  1.149 1.132 0.981 1.139 1.289 ...
##  $ Attrition_M             : num  0 0 0 1 1 0 0 0 0 0 ...

dim(HRData)

## [1] 1470   35

Till now what you may have observed is that we have done the data exploration i.e.multivariate analysis of few specific variables.

Here we have a data set of dimension 1470 (n) × 30 (p). n represents the number of observations and p represents number of predictors. Since we have a large p = 30, there can be p(p-1)/2 scatter plots i.e more than 400 plots possible to analyze the variable relationship. Wouldn’t is be a tedious job to perform exploratory analysis on this data ?

Don’t worry !! We have a way out from here. Principal Component Analysis will come for our rescur. Please read on to know more about PCA

5 Principal Component Analysis

Principal component analysis is a method of extracting important variables (in form of components) from a large set of variables available in a data set. It extracts low dimensional set of features from a high dimensional data set with a motive to capture as much information as possible.

Please follow below link to read more about Principal Component Analysis

link

5.1 Running PCA as instructed in the blog of Ucananalytics

# Creating a data set of numeric variables as PCA is applicable on Numeric data 
nums <- sapply(HRData, is.numeric)
HRData_numeric <-HRData[ , nums]
head(HRData_numeric)

##   JobLevel Age DailyRate DistanceFromHome Education
## 1        1  32      1005                2         2
## 2        1  37       408               19         2
## 3        1  38       362                1         1
## 4        1  27      1420                2         1
## 5        1  28      1434                5         4
## 6        1  31      1082                1         4
##   EnvironmentSatisfaction HourlyRate JobInvolvement JobSatisfaction
## 1                       4         79              3               4
## 2                       2         73              3               2
## 3                       3         43              3               1
## 4                       3         85              3               1
## 5                       3         50              3               3
## 6                       3         87              3               2
##   MonthlyIncome MonthlyRate NumCompaniesWorked PercentSalaryHike
## 1          3068       11864                  0                13
## 2          3022       10227                  4                21
## 3          2619       14561                  3                17
## 4          3041       16346                  0                11
## 5          3441       11179                  1                13
## 6          2501       18775                  1                17
##   PerformanceRating RelationshipSatisfaction StockOptionLevel
## 1                 3                        3                0
## 2                 4                        1                0
## 3                 3                        4                0
## 4                 3                        2                1
## 5                 3                        3                0
## 6                 3                        2                0
##   TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany
## 1                 8                     2               2              7
## 2                 8                     1               3              1
## 3                 8                     3               2              0
## 4                 5                     3               3              4
## 5                 2                     3               2              2
## 6                 1                     4               3              1
##   YearsInCurrentRole YearsSinceLastPromotion YearsWithCurrManager
## 1                  7                       3                    6
## 2                  0                       0                    0
## 3                  0                       0                    0
## 4                  3                       0                    2
## 5                  2                       2                    2
## 6                  1                       1                    0
##   Median Salary Level Salary_Diff CompaRatio Attrition_M
## 1                2670         398  1.1490637           0
## 2                2670         352  1.1318352           0
## 3                2670         -51  0.9808989           0
## 4                2670         371  1.1389513           1
## 5                2670         771  1.2887640           1
## 6                2670        -169  0.9367041           0

View(HRData_numeric)

if (!require('FactoMineR')) install.packages('FactoMineR')

## Loading required package: FactoMineR

pca=PCA(HRData_numeric)

pca$eig

##           eigenvalue percentage of variance
## comp 1  5.429495e+00           2.010924e+01
## comp 2  1.976901e+00           7.321855e+00
## comp 3  1.893096e+00           7.011465e+00
## comp 4  1.771847e+00           6.562396e+00
## comp 5  1.287847e+00           4.769803e+00
## comp 6  1.211759e+00           4.487996e+00
## comp 7  1.081930e+00           4.007149e+00
## comp 8  1.063586e+00           3.939207e+00
## comp 9  1.045010e+00           3.870407e+00
## comp 10 1.036292e+00           3.838120e+00
## comp 11 1.002848e+00           3.714252e+00
## comp 12 9.711744e-01           3.596942e+00
## comp 13 9.641448e-01           3.570907e+00
## comp 14 9.202070e-01           3.408174e+00
## comp 15 9.115981e-01           3.376289e+00
## comp 16 8.856361e-01           3.280134e+00
## comp 17 7.907383e-01           2.928660e+00
## comp 18 7.083341e-01           2.623460e+00
## comp 19 5.437008e-01           2.013707e+00
## comp 20 5.097173e-01           1.887842e+00
## comp 21 2.795359e-01           1.035318e+00
## comp 22 2.255444e-01           8.353498e-01
## comp 23 2.060553e-01           7.631678e-01
## comp 24 1.423288e-01           5.271439e-01
## comp 25 1.237772e-01           4.584339e-01
## comp 26 1.689718e-02           6.258216e-02
## comp 27 6.217231e-29           2.302678e-28
##         cumulative percentage of variance
## comp 1                           20.10924
## comp 2                           27.43109
## comp 3                           34.44256
## comp 4                           41.00496
## comp 5                           45.77476
## comp 6                           50.26275
## comp 7                           54.26990
## comp 8                           58.20911
## comp 9                           62.07952
## comp 10                          65.91764
## comp 11                          69.63189
## comp 12                          73.22883
## comp 13                          76.79974
## comp 14                          80.20791
## comp 15                          83.58420
## comp 16                          86.86434
## comp 17                          89.79300
## comp 18                          92.41646
## comp 19                          94.43016
## comp 20                          96.31800
## comp 21                          97.35332
## comp 22                          98.18867
## comp 23                          98.95184
## comp 24                          99.47898
## comp 25                          99.93742
## comp 26                         100.00000
## comp 27                         100.00000

#Exploring the data through correlation plots 

Correlation_Matrix=as.data.frame(round(cor(HRData_numeric,pca$ind$coord)^2*100,0))
Correlation_Matrix[with(Correlation_Matrix, order(-Correlation_Matrix[,1])),]

##                          Dim.1 Dim.2 Dim.3 Dim.4 Dim.5
## JobLevel                    78     1     8     1     0
## TotalWorkingYears           78     1     3     1     0
## Median Salary Level         78     1     9     1     0
## MonthlyIncome               75    10     2     1     1
## YearsAtCompany              64     7    12     2     0
## YearsInCurrentRole          45     7    21     2     0
## YearsWithCurrManager        43     8    21     3     0
## Age                         37     6     9     3     1
## YearsSinceLastPromotion     35     6    10     2     1
## Attrition_M                  5     1     1     0    50
## Education                    3     2     1     0     1
## NumCompaniesWorked           1    17    14     3     0
## DailyRate                    0     1     0     0    12
## DistanceFromHome             0     1     0     1     2
## EnvironmentSatisfaction      0     0     0     0     5
## HourlyRate                   0     1     0     0     1
## JobInvolvement               0     0     0     0    20
## JobSatisfaction              0     0     0     0     6
## MonthlyRate                  0     0     1     0     3
## PercentSalaryHike            0     1     8    79     0
## PerformanceRating            0     2     9    77     0
## RelationshipSatisfaction     0     0     1     0     1
## StockOptionLevel             0     0     0     0    21
## TrainingTimesLastYear        0     0     0     0     3
## WorkLifeBalance              0     0     0     0     1
## Salary_Diff                  0    62    30     0     0
## CompaRatio                   0    62    30     1     0

5.2 Running PCA as instructed by Analytics Vidhya

#principal component analysis
View(HRData_numeric)
prin_comp <- prcomp(HRData_numeric, scale. = T)
names(prin_comp)

## [1] "sdev"     "rotation" "center"   "scale"    "x"

#The prcomp() function results in 5 useful measures: 
##1. center and scale refers to respective mean and standard deviation of the variables that are used for normalization prior to implementing PCA

#outputs the mean of variables
prin_comp$center

##                 JobLevel                      Age                DailyRate 
##             2.063946e+00             3.692381e+01             8.024857e+02 
##         DistanceFromHome                Education  EnvironmentSatisfaction 
##             9.192517e+00             2.912925e+00             2.721769e+00 
##               HourlyRate           JobInvolvement          JobSatisfaction 
##             6.589116e+01             2.729932e+00             2.728571e+00 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##             6.502931e+03             1.431310e+04             2.693197e+00 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##             1.520952e+01             3.153741e+00             2.712245e+00 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##             7.938776e-01             1.127959e+01             2.799320e+00 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##             2.761224e+00             7.008163e+00             4.229252e+00 
##  YearsSinceLastPromotion     YearsWithCurrManager      Median Salary Level 
##             2.187755e+00             4.123129e+00             6.473702e+03 
##              Salary_Diff               CompaRatio              Attrition_M 
##             2.922925e+01             1.021795e+00             1.612245e-01

#outputs the standard deviation of variables
prin_comp$scale

##                 JobLevel                      Age                DailyRate 
##                1.1069399                9.1353735              403.5090999 
##         DistanceFromHome                Education  EnvironmentSatisfaction 
##                8.1068644                1.0241649                1.0930822 
##               HourlyRate           JobInvolvement          JobSatisfaction 
##               20.3294276                0.7115611                1.1028461 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##             4707.9567831             7117.7860441                2.4980090 
##        PercentSalaryHike        PerformanceRating RelationshipSatisfaction 
##                3.6599377                0.3608235                1.0812089 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                0.8520767                7.7807817                1.2892706 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                0.7064758                6.1265252                3.6231370 
##  YearsSinceLastPromotion     YearsWithCurrManager      Median Salary Level 
##                3.2224303                3.5681361             4695.4308733 
##              Salary_Diff               CompaRatio              Attrition_M 
##             1306.2112901                0.2465838                0.3678630

##2. The rotation measure provides the principal component loading. Each column of rotation matrix contains the principal component loading vector. This is the most important measure we should be interested in.

prin_comp$rotation

##                                   PC1          PC2          PC3
## JobLevel                  0.379903632 -0.068444254  0.206924614
## Age                       0.261113430 -0.168958959  0.214185692
## DailyRate                -0.001490123 -0.059056172  0.010763893
## DistanceFromHome          0.001263467  0.079315139  0.008111305
## Education                 0.068333788 -0.100250642  0.071504839
## EnvironmentSatisfaction   0.004658870  0.030545177  0.035006623
## HourlyRate               -0.009766244 -0.064764784 -0.000549535
## JobInvolvement            0.001442457 -0.003342808  0.030662131
## JobSatisfaction          -0.004785191  0.021978374 -0.012252134
## MonthlyIncome             0.370432137 -0.223352114  0.106624273
## MonthlyRate               0.007275577 -0.044932068  0.054217785
## NumCompaniesWorked        0.051046148 -0.294084939  0.275183214
## PercentSalaryHike        -0.015296412  0.072833869 -0.208914719
## PerformanceRating        -0.002498925  0.100985341 -0.218463765
## RelationshipSatisfaction  0.015601004 -0.025916610  0.063677290
## StockOptionLevel          0.015419551 -0.002419895 -0.023365341
## TotalWorkingYears         0.379786039 -0.086856420  0.118727911
## TrainingTimesLastYear    -0.009710365  0.043700164 -0.028941726
## WorkLifeBalance           0.013632426  0.006847215 -0.014528387
## YearsAtCompany            0.343470202  0.188593576 -0.250741372
## YearsInCurrentRole        0.286701217  0.190921614 -0.333858649
## YearsSinceLastPromotion   0.254902573  0.177004045 -0.227480396
## YearsWithCurrManager      0.281042887  0.204978380 -0.329208653
## Median Salary Level       0.378444677 -0.068340945  0.217503208
## Salary_Diff              -0.025250377 -0.559359672 -0.397553453
## CompaRatio               -0.007249761 -0.558082060 -0.398569418
## Attrition_M              -0.097569223  0.053585240  0.062224931
##                                   PC4           PC5         PC6
## JobLevel                 -0.071425501  0.0374986298 -0.14128675
## Age                      -0.119443508 -0.0774624166  0.14038332
## DailyRate                -0.035575273 -0.3012049252  0.05914881
## DistanceFromHome         -0.056584621  0.1175123123  0.28265660
## Education                -0.023752472 -0.0641512369  0.38969967
## EnvironmentSatisfaction   0.050535150 -0.2062798113 -0.23622768
## HourlyRate               -0.004365295 -0.0839424235  0.40363230
## JobInvolvement            0.030855996 -0.3935499912  0.26160606
## JobSatisfaction          -0.016687141 -0.2204889547 -0.30976723
## MonthlyIncome            -0.063169067  0.0627519101 -0.15807262
## MonthlyRate              -0.012339693  0.1466224594 -0.20408286
## NumCompaniesWorked       -0.132516817  0.0240279385  0.24363805
## PercentSalaryHike        -0.666374486 -0.0275530652 -0.02304386
## PerformanceRating        -0.658374476  0.0004671452 -0.01666273
## RelationshipSatisfaction  0.042281790 -0.0701198396  0.01932180
## StockOptionLevel         -0.012798069 -0.3991586579  0.19401303
## TotalWorkingYears        -0.067665248  0.0170043293  0.03301825
## TrainingTimesLastYear     0.040184109 -0.1407404759 -0.21975844
## WorkLifeBalance           0.015209322 -0.0946489880 -0.20617070
## YearsAtCompany            0.114603537  0.0405574624  0.04765693
## YearsInCurrentRole        0.113317862 -0.0323892214  0.08417727
## YearsSinceLastPromotion   0.098402503  0.1068695222  0.08867197
## YearsWithCurrManager      0.126687464 -0.0134142210  0.11578133
## Median Salary Level      -0.075955061  0.0490070213 -0.14407950
## Salary_Diff               0.045355985  0.0500104391 -0.05181683
## CompaRatio                0.067164029  0.0435484282 -0.01410153
## Attrition_M              -0.012656810  0.6242459940  0.17600558
##                                   PC7          PC8          PC9
## JobLevel                  0.078841478  0.109381742  0.021376980
## Age                      -0.059196611 -0.086124081 -0.026754756
## DailyRate                 0.293481278 -0.100580084 -0.157345785
## DistanceFromHome          0.243261635 -0.086817248  0.437385247
## Education                -0.149294526 -0.160883101 -0.075054815
## EnvironmentSatisfaction  -0.256618782 -0.443261590  0.369268542
## HourlyRate               -0.029272923  0.408019944  0.176391951
## JobInvolvement           -0.139756539 -0.076396968 -0.120798742
## JobSatisfaction           0.440147731 -0.248095846 -0.151867067
## MonthlyIncome             0.083724831  0.113017284  0.014176856
## MonthlyRate              -0.086991044 -0.161203032  0.509229671
## NumCompaniesWorked       -0.159460685 -0.187852269 -0.026008197
## PercentSalaryHike        -0.049041425 -0.009532876 -0.016690687
## PerformanceRating        -0.074977890 -0.006077328 -0.018452980
## RelationshipSatisfaction -0.412009013 -0.180775405 -0.290277876
## StockOptionLevel          0.236075317  0.181792716  0.392772917
## TotalWorkingYears         0.008738231 -0.014129060 -0.013245680
## TrainingTimesLastYear    -0.106730537  0.516566539 -0.118334388
## WorkLifeBalance          -0.487359814  0.261074843  0.230332756
## YearsAtCompany           -0.009800126 -0.026203521 -0.026144572
## YearsInCurrentRole       -0.035938442 -0.071242600  0.024442964
## YearsSinceLastPromotion  -0.068904802 -0.093289656 -0.012147953
## YearsWithCurrManager     -0.041630099 -0.061843490 -0.029101978
## Median Salary Level       0.082189840  0.112742566  0.008494278
## Salary_Diff               0.006320702  0.002071309  0.020563081
## CompaRatio                0.023707291 -0.024843651  0.019941934
## Attrition_M               0.070594885 -0.002329182 -0.047519707
##                                  PC10         PC11         PC12
## JobLevel                  0.067682032 -0.008541918 -0.054360301
## Age                      -0.103456679 -0.031003329  0.100583075
## DailyRate                 0.341067649  0.379397086 -0.024859682
## DistanceFromHome          0.209512902 -0.396570729 -0.245565713
## Education                -0.524340621 -0.233167650  0.156515968
## EnvironmentSatisfaction   0.018934122  0.401362915  0.037365351
## HourlyRate                0.275633729  0.062956335 -0.031366217
## JobInvolvement            0.268736513 -0.254628220  0.266235600
## JobSatisfaction          -0.191597504 -0.417937364 -0.153935052
## MonthlyIncome             0.087500279 -0.009040978 -0.061739021
## MonthlyRate               0.219677411 -0.278346829  0.495464689
## NumCompaniesWorked       -0.135485037  0.162801614  0.024274295
## PercentSalaryHike         0.016401329 -0.006036777  0.021626793
## PerformanceRating         0.020116800  0.015840488  0.010731712
## RelationshipSatisfaction  0.441223129 -0.285630511 -0.324683796
## StockOptionLevel         -0.202307260  0.107328743 -0.159552476
## TotalWorkingYears        -0.008719107  0.020581669  0.027383133
## TrainingTimesLastYear    -0.039938249 -0.110032090  0.436974882
## WorkLifeBalance          -0.207858737 -0.099787430 -0.472614700
## YearsAtCompany            0.001476838  0.010632110  0.012392629
## YearsInCurrentRole       -0.027400923  0.035184241  0.015370440
## YearsSinceLastPromotion  -0.008426576  0.045145414  0.021707424
## YearsWithCurrManager     -0.017527466  0.013583770  0.047678709
## Median Salary Level       0.080628610 -0.001994417 -0.059250190
## Salary_Diff               0.025540637 -0.025416932 -0.009538634
## CompaRatio                0.019278000 -0.030482156 -0.021706703
## Attrition_M               0.014686316  0.117173771 -0.030636323
##                                  PC13         PC14        PC15
## JobLevel                 -0.082546691 -0.015627444 -0.07960094
## Age                       0.115051367  0.064647463  0.06055567
## DailyRate                 0.218617346 -0.622843453  0.23619905
## DistanceFromHome          0.244670379 -0.241011829 -0.37934094
## Education                 0.085008028 -0.238681902  0.07451553
## EnvironmentSatisfaction   0.076240897  0.090501935 -0.18154719
## HourlyRate               -0.137815230  0.229759973  0.49088163
## JobInvolvement           -0.474812321 -0.147562928 -0.29726427
## JobSatisfaction           0.021985559  0.062819931  0.36081857
## MonthlyIncome            -0.086163764 -0.008626024 -0.10422406
## MonthlyRate               0.029061541 -0.080056778  0.40420806
## NumCompaniesWorked        0.199939153  0.016259095  0.10969406
## PercentSalaryHike        -0.001770767  0.005188512 -0.01170265
## PerformanceRating        -0.007342699  0.032319898  0.00279999
## RelationshipSatisfaction  0.412989845  0.249839414  0.06797384
## StockOptionLevel          0.236714527  0.283849192 -0.06883193
## TotalWorkingYears         0.020490186  0.010902452  0.01601755
## TrainingTimesLastYear     0.538960294 -0.116644637 -0.18395192
## WorkLifeBalance          -0.134890341 -0.462753012  0.17398483
## YearsAtCompany            0.024983474  0.011635910  0.04533176
## YearsInCurrentRole        0.030610356 -0.056224049  0.07715220
## YearsSinceLastPromotion   0.116202172  0.014244203  0.08778571
## YearsWithCurrManager      0.004338385 -0.002638002  0.03091558
## Median Salary Level      -0.091428409 -0.007743205 -0.08468835
## Salary_Diff               0.018098527 -0.003256183 -0.07122437
## CompaRatio               -0.003891091 -0.005255410 -0.05511932
## Attrition_M               0.080967949 -0.153642949  0.01316782
##                                  PC16         PC17         PC18
## JobLevel                 -0.022313981  0.097423827  0.114783137
## Age                      -0.034801520 -0.122916668 -0.231818727
## DailyRate                 0.053156014  0.142959707  0.015685455
## DistanceFromHome         -0.231653914 -0.108929900 -0.211538029
## Education                -0.238934827  0.466051187  0.225661071
## EnvironmentSatisfaction  -0.451863036 -0.081801910  0.266193266
## HourlyRate               -0.466025640 -0.080001618  0.084509335
## JobInvolvement            0.150549190 -0.277472920  0.296985110
## JobSatisfaction          -0.188113545 -0.342592566  0.222631743
## MonthlyIncome            -0.032494396  0.095440020  0.156492428
## MonthlyRate               0.277934590  0.131662985  0.008548726
## NumCompaniesWorked        0.198180083 -0.525105988 -0.175765160
## PercentSalaryHike        -0.013733162  0.017749196  0.034412653
## PerformanceRating        -0.001106856  0.004011410  0.043368106
## RelationshipSatisfaction  0.101550863  0.200004069  0.153789063
## StockOptionLevel          0.471468358  0.085725394  0.318098734
## TotalWorkingYears        -0.009491363 -0.064974085 -0.099162271
## TrainingTimesLastYear    -0.176620276 -0.217648284  0.101010424
## WorkLifeBalance           0.094752839 -0.164016516  0.025026910
## YearsAtCompany           -0.003547795 -0.025133010 -0.008945467
## YearsInCurrentRole        0.051367158 -0.084338609 -0.067717982
## YearsSinceLastPromotion   0.066913562 -0.157550874  0.175935603
## YearsWithCurrManager      0.029621928 -0.041987725 -0.111175592
## Median Salary Level      -0.027190546  0.096920168  0.138947653
## Salary_Diff              -0.019377326 -0.004405459  0.064568794
## CompaRatio               -0.016864785 -0.014547985  0.015863238
## Attrition_M               0.072582728 -0.214779738  0.587940224
##                                   PC19         PC20          PC21
## JobLevel                 -0.1649846217  0.015059547  1.987337e-02
## Age                       0.7388296245 -0.031849034  4.760898e-02
## DailyRate                 0.0358938488 -0.021241687 -3.866189e-02
## DistanceFromHome         -0.0491937315 -0.045465668 -7.687354e-03
## Education                -0.1423550477 -0.003840078  6.267605e-03
## EnvironmentSatisfaction   0.0348139213  0.069895657 -1.814423e-02
## HourlyRate               -0.0510785241  0.003985597  2.829376e-03
## JobInvolvement            0.0464435248  0.005990439  6.733255e-03
## JobSatisfaction          -0.0081084849  0.055216774 -2.402262e-02
## MonthlyIncome            -0.1343734458 -0.013825961  1.133065e-02
## MonthlyRate              -0.0230311696  0.032053459 -2.169203e-02
## NumCompaniesWorked       -0.4843465574  0.056866467 -1.960949e-02
## PercentSalaryHike        -0.0079070903  0.004516371 -1.246120e-02
## PerformanceRating        -0.0146802007  0.001545554  7.755357e-05
## RelationshipSatisfaction -0.0001816872  0.082601160  1.916085e-02
## StockOptionLevel          0.0682543128  0.055229705 -2.639510e-02
## TotalWorkingYears         0.1861212005  0.075484372 -7.240561e-02
## TrainingTimesLastYear    -0.0303146185  0.039847586  1.990692e-03
## WorkLifeBalance           0.1033805810  0.005564292 -5.112251e-02
## YearsAtCompany            0.0312561706  0.135669539 -7.648120e-02
## YearsInCurrentRole       -0.0877109391  0.230243524  7.685922e-01
## YearsSinceLastPromotion  -0.0246438463 -0.838413926 -8.092032e-02
## YearsWithCurrManager     -0.0824171026  0.366268000 -6.174012e-01
## Median Salary Level      -0.1397007695 -0.011129578  1.370880e-02
## Salary_Diff               0.0178615290 -0.009825259 -8.440076e-03
## CompaRatio                0.0591896367 -0.001477403 -9.113936e-03
## Attrition_M               0.2304828560  0.240506831  8.069989e-03
##                                   PC22          PC23         PC24
## JobLevel                  0.0499453318 -0.1693392783 -0.073015731
## Age                       0.0767528151 -0.2920167943 -0.228839032
## DailyRate                -0.0159752140 -0.0034679573 -0.021707760
## DistanceFromHome         -0.0160228099 -0.0007846826 -0.011865390
## Education                -0.0188521245  0.0178795652  0.003326954
## EnvironmentSatisfaction   0.0012006978  0.0061111836  0.007395061
## HourlyRate                0.0112857543 -0.0087668862  0.004759065
## JobInvolvement           -0.0149416666  0.0287852510 -0.005774135
## JobSatisfaction          -0.0112456541 -0.0097525200  0.014339574
## MonthlyIncome             0.0206823654 -0.1074270862 -0.070636721
## MonthlyRate              -0.0032350094  0.0062726537 -0.011213104
## NumCompaniesWorked        0.0103044531  0.0146988460 -0.117853364
## PercentSalaryHike         0.6839496083  0.1725600539  0.034011126
## PerformanceRating        -0.6839742563 -0.1565538148 -0.050335348
## RelationshipSatisfaction  0.0077969408  0.0010258760  0.025318394
## StockOptionLevel         -0.0056362748  0.0201869398  0.010221359
## TotalWorkingYears        -0.1744729150  0.5333146781  0.667172191
## TrainingTimesLastYear    -0.0089731922  0.0024912062  0.014156121
## WorkLifeBalance           0.0009695364  0.0116966021 -0.001065797
## YearsAtCompany           -0.0891948687  0.5629334791 -0.638400353
## YearsInCurrentRole        0.0535687670 -0.1890869692  0.156110400
## YearsSinceLastPromotion   0.0318000797 -0.0782417443  0.087449559
## YearsWithCurrManager      0.0907491329 -0.3951591079  0.165917108
## Median Salary Level       0.0317333606 -0.0909478913 -0.068184674
## Salary_Diff              -0.0395266206 -0.0602678472 -0.009491732
## CompaRatio                0.0194536995  0.0776161597  0.008088583
## Attrition_M               0.0219484236 -0.0278122093  0.026364919
##                                   PC25          PC26          PC27
## JobLevel                  0.0470991161 -0.8076721831 -7.079678e-16
## Age                      -0.0071914044  0.0112666005  8.372591e-18
## DailyRate                -0.0036996340  0.0002206090 -3.882268e-17
## DistanceFromHome         -0.0109313511  0.0076715012  1.429304e-16
## Education                 0.0005763606  0.0070069915 -1.002268e-16
## EnvironmentSatisfaction   0.0121247909 -0.0032576188  6.881040e-17
## HourlyRate               -0.0039657476 -0.0028056375 -6.785957e-18
## JobInvolvement           -0.0060892708 -0.0085831486  1.005618e-16
## JobSatisfaction          -0.0033960181  0.0006845210 -7.239160e-17
## MonthlyIncome            -0.0838645203  0.3987353254 -6.947690e-01
## MonthlyRate               0.0052436459  0.0086118904  5.309418e-17
## NumCompaniesWorked       -0.0012600009  0.0155427148  4.931790e-17
## PercentSalaryHike        -0.0332989320 -0.0006100768 -1.554176e-17
## PerformanceRating         0.0588225748 -0.0043141785  1.198809e-16
## RelationshipSatisfaction  0.0096537170 -0.0050389886  2.773348e-17
## StockOptionLevel          0.0032560362  0.0015864883  2.100705e-16
## TotalWorkingYears        -0.0674172203 -0.0375832961 -6.895901e-17
## TrainingTimesLastYear     0.0326056211  0.0022958207  3.025400e-17
## WorkLifeBalance           0.0008095454  0.0039883977  1.090267e-16
## YearsAtCompany           -0.0536327603 -0.0274173535  1.457420e-16
## YearsInCurrentRole        0.0026368213  0.0299388436 -1.968351e-16
## YearsSinceLastPromotion   0.0094004316 -0.0131897301  1.433796e-17
## YearsWithCurrManager      0.0181084516  0.0382395364 -3.347949e-17
## Median Salary Level       0.1048237519  0.4200057460  6.929205e-01
## Salary_Diff              -0.6790809609 -0.0726369979  1.927620e-01
## CompaRatio                0.7102475865 -0.0366333340 -1.313865e-16
## Attrition_M               0.0122826643 -0.0156420523  7.571008e-17

#This returns 27 principal components loadings. Is that correct ? Absolutely. In a data set, the maximum number of principal component loadings is a minimum of (n-1, p). Let's look at first 4 principal components and first 5 rows.

prin_comp$rotation[1:5,1:4]

##                           PC1         PC2         PC3         PC4
## JobLevel          0.379903632 -0.06844425 0.206924614 -0.07142550
## Age               0.261113430 -0.16895896 0.214185692 -0.11944351
## DailyRate        -0.001490123 -0.05905617 0.010763893 -0.03557527
## DistanceFromHome  0.001263467  0.07931514 0.008111305 -0.05658462
## Education         0.068333788 -0.10025064 0.071504839 -0.02375247

##3. In order to compute the principal component score vector, we don't need to multiply the loading with data. Rather, the matrix x has the principal component score vectors in a 1470 × 27 dimension.

dim(prin_comp$x)

## [1] 1470   27

#Let's plot the resultant principal components.

biplot(prin_comp, scale = 0)

##4. The prcomp() function also provides the facility to compute standard deviation of each principal component. sdev refers to the standard deviation of principal components.

#compute standard deviation of each principal component
std_dev <- prin_comp$sdev

#compute variance
pr_var <- std_dev^2

#check variance of first 10 components
pr_var[1:10]

##  [1] 5.429495 1.976901 1.893096 1.771847 1.287847 1.211759 1.081930
##  [8] 1.063586 1.045010 1.036292

#We aim to find the components which explain the maximum variance. This is because, we want to retain as much information as possible using these components. So, higher is the explained variance, higher will be the information contained in those components. To compute the proportion of variance explained by each component, we simply divide the variance by sum of total variance. This results in:


#proportion of variance explained
prop_varex <- pr_var/sum(pr_var)
prop_varex[1:20]

##  [1] 0.20109240 0.07321855 0.07011465 0.06562396 0.04769803 0.04487996
##  [7] 0.04007149 0.03939207 0.03870407 0.03838120 0.03714252 0.03596942
## [13] 0.03570907 0.03408174 0.03376289 0.03280134 0.02928660 0.02623460
## [19] 0.02013707 0.01887842

#This shows that first principal component explains 20.1% variance. Second component explains 7.3% variance. Third component explains 6.5% variance and so on. So, how do we decide how many components should we select for modeling stage ?
#The answer to this question is provided by a scree plot. A scree plot is used to access components or factors which explains the most of variability in the data. It represents values in descending order.

#scree plot
plot(prop_varex, xlab = "Principal Component",ylab = "Proportion of Variance Explained",type = "b")

#The plot above shows that ~ 20 components explains around 95.4% variance in the data set. In order words, using PCA we have reduced 27 predictors to 20 without compromising on explained variance. This is the power of PCA> Let's do a confirmation check, by plotting a cumulative variance plot. This will give us a clear picture of number of components.

#cumulative scree plot
plot(cumsum(prop_varex), xlab = "Principal Component",ylab = "Cumulative Proportion of Variance Explained",type = "b")

#This plot shows that 20 components results in variance close to ~ 95%. Therefore, in this case, we'll select number of components as 20 [PC1 to PC20] and proceed to the modeling stage. This completes the steps to implement PCA on train data. For modeling, we'll use these 20 components as predictor variables and follow the normal procedures.

I will be compleing this portion after some time, lets run our first basic regression model

6 Basic Logistic Regression Model

dim(HRData)

## [1] 1470   35

Data <- HRData[,c(-36)] # Removing last column 
View(Data)

7 Data Splitting into train data and sample data

# setting the random seed for replication
set.seed(123)
Data$spl =sample.split(Data,SplitRatio = .7)
# By using the sample.split() you are actually creating a vector with two values TRUE and FALSE. By setting the SplitRatio to 0.7, you are splitting the original HRData  of 1450 rows to 70% training and 30% testing data.

Data$spl will create a new column in the Data dataset. Now, if you view the Data dataset again you would notice a new column added at the end, you can use View() or head() for this

View(Data)
dim(Data)

## [1] 1470   36

Now, you can split the dataset to training and testing as given

train= subset(Data,Data$spl==TRUE) # where spl== TRUE means to add only those rows that have value true for spl in the training dataframe

View(train) # you will see that this dataframe has all values where Data$spl==TRUE Similarly, to create the testing dataset,

test=subset(Data,Data$spl==FALSE) # where spl== FALSE means to add only those rows that have value true for spl in the training dataframe
View(test) # you will see that this dataframe has all values where Data$spl==FALSE
str(train)

## 'data.frame':    1008 obs. of  36 variables:
##  $ JobLevel                : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Age                     : int  32 37 38 31 41 45 29 32 31 19 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 2 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 2 3 3 3 3 3 3 3 2 2 ...
##  $ DailyRate               : int  1005 408 362 1082 465 1457 1328 634 444 602 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 2 2 2 2 2 2 2 2 3 3 ...
##  $ DistanceFromHome        : int  2 19 1 1 14 7 2 5 5 1 ...
##  $ Education               : int  2 2 1 4 3 3 3 4 3 1 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 2 4 2 4 2 5 3 6 ...
##  $ EnvironmentSatisfaction : int  4 2 3 3 1 1 3 2 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 2 2 1 2 2 1 2 1 1 1 ...
##  $ HourlyRate              : int  79 73 43 87 56 83 76 35 84 100 ...
##  $ JobInvolvement          : int  3 3 3 3 3 3 3 4 3 1 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 3 7 7 7 7 7 7 7 9 9 ...
##  $ JobSatisfaction         : int  4 2 1 2 3 3 2 4 2 1 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 3 1 2 2 2 1 3 ...
##  $ MonthlyIncome           : int  3068 3022 2619 2501 2451 4477 2703 3312 2789 2325 ...
##  $ MonthlyRate             : int  11864 10227 14561 18775 4609 20100 4956 18783 3909 20989 ...
##  $ NumCompaniesWorked      : int  0 4 3 1 4 4 0 3 1 0 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 2 1 1 1 1 ...
##  $ PercentSalaryHike       : int  13 21 17 17 12 19 23 17 11 21 ...
##  $ PerformanceRating       : int  3 4 3 3 3 3 4 3 3 4 ...
##  $ RelationshipSatisfaction: int  3 1 4 2 1 3 4 4 3 1 ...
##  $ StockOptionLevel        : int  0 0 0 0 1 1 1 2 1 0 ...
##  $ TotalWorkingYears       : int  8 8 8 1 13 7 6 6 2 1 ...
##  $ TrainingTimesLastYear   : int  2 1 3 4 2 2 3 3 5 5 ...
##  $ WorkLifeBalance         : int  2 3 2 3 3 2 3 3 2 4 ...
##  $ YearsAtCompany          : int  7 1 0 1 9 3 5 3 2 0 ...
##  $ YearsInCurrentRole      : int  7 0 0 1 8 2 4 2 2 0 ...
##  $ YearsSinceLastPromotion : int  3 0 0 1 1 0 0 0 2 0 ...
##  $ YearsWithCurrManager    : int  6 0 0 0 8 2 4 2 2 0 ...
##  $ Median Salary Level     : num  2670 2670 2670 2670 2670 2670 2670 2670 2670 2670 ...
##  $ Salary_Diff             : num  398 352 -51 -169 -219 ...
##  $ CompaRatio              : num  1.149 1.132 0.981 0.937 0.918 ...
##  $ Attrition_M             : num  0 0 0 0 0 0 0 0 0 1 ...
##  $ spl                     : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

Checking proportion of data set is same for both sets

mean(train$MonthlyIncome)

## [1] 6471.623

mean(test$MonthlyIncome)

## [1] 6571.24

glm.fit <- glm(Attrition ~., data = train, family = binomial())

## Warning: glm.fit: algorithm did not converge

summary(glm.fit)

## 
## Call:
## glm(formula = Attrition ~ ., family = binomial(), data = train)
## 
## Deviance Residuals: 
##        Min          1Q      Median          3Q         Max  
## -2.409e-06  -2.409e-06  -2.409e-06  -2.409e-06   2.409e-06  
## 
## Coefficients: (2 not defined because of singularities)
##                                    Estimate Std. Error z value Pr(>|z|)
## (Intercept)                      -2.657e+01  2.715e+05   0.000    1.000
## JobLevel                         -1.071e-10  9.044e+04   0.000    1.000
## Age                              -1.658e-12  1.729e+03   0.000    1.000
## BusinessTravelTravel_Frequently   2.879e-11  4.403e+04   0.000    1.000
## BusinessTravelTravel_Rarely       3.057e-11  3.650e+04   0.000    1.000
## DailyRate                        -1.769e-14  2.891e+01   0.000    1.000
## DepartmentResearch & Development  3.794e-10  1.578e+05   0.000    1.000
## DepartmentSales                   2.353e-10  1.618e+05   0.000    1.000
## DistanceFromHome                  3.975e-12  1.399e+03   0.000    1.000
## Education                         9.045e-13  1.144e+04   0.000    1.000
## EducationFieldLife Sciences      -5.298e-10  1.052e+05   0.000    1.000
## EducationFieldMarketing          -1.813e-10  1.127e+05   0.000    1.000
## EducationFieldMedical            -4.326e-10  1.054e+05   0.000    1.000
## EducationFieldOther              -4.331e-10  1.151e+05   0.000    1.000
## EducationFieldTechnical Degree   -5.040e-10  1.102e+05   0.000    1.000
## EnvironmentSatisfaction           1.309e-11  1.052e+04   0.000    1.000
## GenderMale                       -2.425e-11  2.351e+04   0.000    1.000
## HourlyRate                       -2.527e-12  5.673e+02   0.000    1.000
## JobInvolvement                    3.000e-11  1.625e+04   0.000    1.000
## JobRoleHuman Resources            4.028e-10  1.631e+05   0.000    1.000
## JobRoleLaboratory Technician     -2.574e-10  5.741e+04   0.000    1.000
## JobRoleManager                    2.543e-10  1.032e+05   0.000    1.000
## JobRoleManufacturing Director     3.273e-11  5.225e+04   0.000    1.000
## JobRoleResearch Director          1.204e-10  9.280e+04   0.000    1.000
## JobRoleResearch Scientist         2.849e-11  5.690e+04   0.000    1.000
## JobRoleSales Executive           -2.388e-11  1.046e+05   0.000    1.000
## JobRoleSales Representative       8.084e-10  1.163e+05   0.000    1.000
## JobSatisfaction                  -7.131e-12  1.038e+04   0.000    1.000
## MaritalStatusMarried              1.535e-11  3.055e+04   0.000    1.000
## MaritalStatusSingle               3.110e-12  4.222e+04   0.000    1.000
## MonthlyIncome                     4.428e-14  2.513e+01   0.000    1.000
## MonthlyRate                      -1.685e-15  1.615e+00   0.000    1.000
## NumCompaniesWorked                4.476e-12  5.132e+03   0.000    1.000
## OverTimeYes                       1.379e-10  2.610e+04   0.000    1.000
## PercentSalaryHike                -6.239e-13  4.939e+03   0.000    1.000
## PerformanceRating                 2.694e-11  5.050e+04   0.000    1.000
## RelationshipSatisfaction          2.860e-11  1.059e+04   0.000    1.000
## StockOptionLevel                  5.057e-11  1.834e+04   0.000    1.000
## TotalWorkingYears                 3.366e-13  3.229e+03   0.000    1.000
## TrainingTimesLastYear             1.535e-11  8.967e+03   0.000    1.000
## WorkLifeBalance                   1.994e-10  1.623e+04   0.000    1.000
## YearsAtCompany                    1.230e-12  4.093e+03   0.000    1.000
## YearsInCurrentRole               -1.454e-12  5.216e+03   0.000    1.000
## YearsSinceLastPromotion          -4.554e-12  4.737e+03   0.000    1.000
## YearsWithCurrManager              4.868e-12  5.652e+03   0.000    1.000
## `Median Salary Level`            -3.914e-14  2.513e+01   0.000    1.000
## Salary_Diff                              NA         NA      NA       NA
## CompaRatio                       -2.187e-10  1.159e+05   0.000    1.000
## Attrition_M                       5.313e+01  3.633e+04   0.001    0.999
## splTRUE                                  NA         NA      NA       NA
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 8.7877e+02  on 1007  degrees of freedom
## Residual deviance: 5.8480e-09  on  960  degrees of freedom
## AIC: 96
## 
## Number of Fisher Scoring iterations: 25

Interpreting the results of our logistic regression model

In the output above, the first thing we see is the call, this is R reminding us what the model we ran was, what options we specified, etc.
Next we see the deviance residuals, which are a measure of model fit. This part of output shows the distribution of the deviance residuals for individual cases used in the model. Below we discuss how to use summaries of the deviance statistic to assess model fit.
The next part of the output shows the coefficients, their standard errors, the z-statistic (sometimes called a Wald z-statistic), and the associated p-values. Both gre and gpa are statistically significant, as are the three terms for rank. The logistic regression coefficients give the change in the log odds of the outcome for a one unit increase in the predictor variable

Now we can run the anova() function on the model to analyze the table of deviance

anova(glm.fit, test="Chisq")

## Warning: glm.fit: algorithm did not converge

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: Attrition
## 
## Terms added sequentially (first to last)
## 
## 
##                          Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                                      1007     878.77              
## JobLevel                  1    35.93      1006     842.85 2.049e-09 ***
## Age                       1     7.39      1005     835.46 0.0065574 ** 
## BusinessTravel            2    27.76      1003     807.70 9.374e-07 ***
## DailyRate                 1     0.78      1002     806.92 0.3774528    
## Department                2    13.25      1000     793.67 0.0013253 ** 
## DistanceFromHome          1     7.84       999     785.83 0.0051238 ** 
## Education                 1     0.21       998     785.62 0.6467971    
## EducationField            5     5.40       993     780.22 0.3685629    
## EnvironmentSatisfaction   1    20.14       992     760.08 7.214e-06 ***
## Gender                    1     2.25       991     757.83 0.1335512    
## HourlyRate                1     0.53       990     757.30 0.4682762    
## JobInvolvement            1    16.90       989     740.40 3.936e-05 ***
## JobRole                   8     9.99       981     730.41 0.2658905    
## JobSatisfaction           1     8.00       980     722.42 0.0046829 ** 
## MaritalStatus             2    26.72       978     695.70 1.577e-06 ***
## MonthlyIncome             1     0.22       977     695.47 0.6372306    
## MonthlyRate               1     0.44       976     695.04 0.5082545    
## NumCompaniesWorked        1    21.13       975     673.91 4.301e-06 ***
## OverTime                  1    64.11       974     609.81 1.180e-15 ***
## PercentSalaryHike         1     1.41       973     608.40 0.2352663    
## PerformanceRating         1     0.01       972     608.38 0.9082739    
## RelationshipSatisfaction  1     2.60       971     605.78 0.1067035    
## StockOptionLevel          1     3.87       970     601.91 0.0491960 *  
## TotalWorkingYears         1     9.16       969     592.76 0.0024785 ** 
## TrainingTimesLastYear     1     2.07       968     590.68 0.1498382    
## WorkLifeBalance           1    11.08       967     579.61 0.0008747 ***
## YearsAtCompany            1     0.73       966     578.88 0.3940867    
## YearsInCurrentRole        1     4.31       965     574.57 0.0379302 *  
## YearsSinceLastPromotion   1     8.65       964     565.92 0.0032653 ** 
## YearsWithCurrManager      1     6.07       963     559.85 0.0137515 *  
## `Median Salary Level`     1    11.31       962     548.54 0.0007704 ***
## Salary_Diff               0     0.00       962     548.54              
## CompaRatio                1     3.01       961     545.53 0.0826204 .  
## Attrition_M               1   545.53       960       0.00 < 2.2e-16 ***
## spl                       0     0.00       960       0.00              
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The difference between the null deviance and the residual deviance shows how our model is doing against the null model (a model with only the intercept). The wider this gap, the better. Analyzing the table we can see the drop in deviance when adding each variable one at a time. Again, adding XXX, YYY and ZZZ significantly reduces the residual deviance. The other variables seem to improve the model less even though XXXX has a low p-value. A large p-value here indicates that the model without the variable explains more or less the same amount of variation. Ultimately what you would like to see is a significant drop in deviance and the AIC.

While no exact equivalent to the R2 of linear regression exists, the McFadden R2 index can be used to assess the model fit.

##library(pscl)
##pR2(model)

8 Interpreting the Model Predictions

With our final model specified, it’s time to actually see the predicted output from the fitted values in the model output. The fitted values provide the predicted probability of leaving for each individual.

This individual-level data is really useful in the real world because we can then calculate probabilities of leaving for any set of arbitrarily defined groups after creating our model. For example, we can group our prediction to predict “Females in Sales”, “Managers in Accounting”, or even “low performing males under 30 in Sales”.

Individual-level predictors allow one to drill down and slice the data anyway you wish while still leveraging all of the data available across the workforce.

Let’s see the distribution.

hist(glm.fit$fitted.values, main = "Distribution of Predicted Probabilities", xlab = "Probability of Leaving", col = "Blue", border = F, breaks = 50)
abline(v = .5, col = "red", lwd = 3)

prop.table(table(glm.fit$fitted.values >= .5))

## 
##     FALSE      TRUE 
## 0.8422619 0.1577381

prop.table(table(glm.fit$fitted.values >= .3))

## 
##     FALSE      TRUE 
## 0.8422619 0.1577381

Our histogram (and calculations) for the training data show us that we have roughly 25% of our employees with a probability of leaving at 50% or higher and 60% of our employees with a probability of leaving greater than 30%.

Now what? These data give us a feel for the output on the training data (and yes, you should ALWAYS plot them) but of course the real test is seeing how the model did with the novel test data.

9 Predicting with New Data

Measuring Accuracy To figure out how acccurate our model is at predicting turnover, we need to first take our model and use it to predict the outcomes for our test dataset. We’ll use the “predict” function and feed it our final glm.fit model that we built above. We will also set the newdate to our “test” dataset from our split, and set type = “response” to get the predicted probabilities for each person.

glm.fit_test<- predict(glm.fit, newdata = test, type = "response")

## Warning in predict.lm(object, newdata, se.fit, scale = 1, type =
## ifelse(type == : prediction from a rank-deficient fit may be misleading

hist(glm.fit_test, main = "Distribution of Test Set \nPredicted Probabilities", xlab = "Probability of Leaving", col = "Blue", border = F, breaks = 50)

abline(v = .5, col = "red", lwd = 3)

prop.table(table(glm.fit_test >= .5))

## 
##     FALSE      TRUE 
## 0.8311688 0.1688312

prop.table(table(glm.fit_test>= .3))

## 
##     FALSE      TRUE 
## 0.8311688 0.1688312

Similar to our training set predictions, we see that 8.xx% have a predicted probability greater than .5 and 19.xx% with a predicted probability of greater than .3. Note again though that the model has provided the individual probabilities, not a 0/1 response

10 The Cutoff

To get the 0/1 value we need to measure model accuracy, we’ll need to first select some cutoff value for predicting whether someone will leave or not. A good starting point is .5. After all, those with a probability value greater than .5 are more likely to leave than those with less than a .5 predicted probability.

Our reference point is the base rate which for the current data is 83.8% staying and 16.12% leaving. Thus, if we had no information available, we could simply predict that everyone will stay and be correct 83.8% of the time. You could think about this as a “majority class model” in which our best guess is picking the largest class. Let’s hope our model does better.

11 The Confusion Matrix

To check accuracy, we’ll build a “confusion matrix”. This allows is to compare the observed results from our test set with our predicted results. We will set the cutoff as .5 so that anyone with a value over .5 will categorized as leaving.

prop.table(table(test$Attrition))

## 
##        No       Yes 
## 0.8311688 0.1688312

accuracy <- table(glm.fit_test > .5, test$Attrition) # confusion matrix

addmargins(table(glm.fit_test> .5, test$Attrition))

##        
##          No Yes Sum
##   FALSE 384   0 384
##   TRUE    0  78  78
##   Sum   384  78 462

The values on the right represent the predicted outcome (where False = 0 and True = 1). The values for the columns represent the real, observed outcomes.

To get the model accuracy, we will sum the correctly classified observations (values on the diagonal) and divide by the total number of observations (1472). This will give us the proportion correct.

sum(diag(accuracy))/sum(accuracy)

## [1] 1

The result is 87.66% correct v. the “no model” case of 83.11%. It’s not spectacular mind you, but definitely better than nothing.

Remember too that we are dealing with far more variables than you would normally have in your HR data. , the model is actually doing quite well.

12 Choosing a Cutoff

Let’s take a minute to think a bit more about that .5 cutoff.

addmargins(accuracy)

##        
##          No Yes Sum
##   FALSE 384   0 384
##   TRUE    0  78  78
##   Sum   384  78 462

When we choose this cutoff, our model catches 31 of the 78 turnover events. This gives us True Positive Rate of 39% . That is, we are correctly labeling 31 of the 78 “leave” cases. Correspondingly, that same .5 cutoff catches 374 of the 384 stay events, yielding a True Negative Rate of 97.39% percent.

What happens if we lower our threshold to .3 instead?

prop.table(table(test$Attrition))

## 
##        No       Yes 
## 0.8311688 0.1688312

accuracy2 <- table(glm.fit_test > .3, test$Attrition) # confusion matrix

addmargins(accuracy2)

##        
##          No Yes Sum
##   FALSE 384   0 384
##   TRUE    0  78  78
##   Sum   384  78 462

sum(diag(accuracy2))/sum(accuracy)

## [1] 1

Here, we end up catching many more of the leave events (49), giving us a much higher True Positive Rate of 62% (49/78). That’s a lot better, but loosening the cutoff criteria also dramtically lowers the True Negative Rate to 89.84% (345/384).

By lowering the criterion, we ended up labeling too many of the stayers as leavers. We also end up reducing our overall accuracy to 85.28% (345+49)/(39+29). The world is full of tradeoffs.

To make this more concrete, think instead of a test for a disease. In that case, we might be willing to lower the cutoff to make sure we are catching those with the disease, even at the expense of more False Positive results.

We will deal with this more in the ROC section below. The take home message for now is that changing that cutoff can have a big impact on your interpretation of the model; the actions you wish to take given your results should guide your decision.

13 ROC Curves: What are They and Who Cares?

We’ve already discussed the impact of changing cutoffs on accuracy, but let’s dig into an alternative way of assessing our logistic regression and decision tree models: the ROC curve.

ROC stands for “Receiver Operating Characteristic” and was first used in WWII to aid radar detection processes. It is still used today in human cognition research, radiology, epidemiology, and predictive models of all sorts.

ROC curves help us evaluate binary classifiers by showing us how different cutoff (threshold) values impact the quality of our model. Said differently, it helps us figure out the best tradeoff between True Positive and False Positive rates.

14 Understanding ROC Curves: Developing Intuitions

To develop our intuitions, let’s go back to our logistic model and select five possible cutoffs: 0, .3, .5, . 7, and 1. These cutoffs are just illustrative.

Let’s plot our False Positive and True Positive values at each of these sample cutoffs.

# collector for the data that we will plot
rates <- data.frame(fp = numeric(3), tp = numeric(3), cutoff = as.numeric(3)) 
 
for (i in 1:3){ # cycling through the middle cutoff values to test

temp_cutoffs <- c(.3, .5, .7)    

# accuracy <- table(m2_test > i, test$vol_leave) # confusion matrix

temp_res <- addmargins(table(glm.fit_test > temp_cutoffs[i], test$Attrition))

rates$fp[i] <- temp_res[2,1]/temp_res[3,1]
rates$tp[i] <- temp_res[2,2]/temp_res[3,2]
rates$cutoff[i] <- temp_cutoffs[i]

}

# plotting the calculated rates

plot(rates$fp[1:3], rates$tp[1:3], pch = 19, col = "red", cex = 1.5, 
     xlim = c(0,1), ylim = c(0,1), xlab = "False Positive Rate", 
     ylab = "True Positive Rate",
     main = "False Positive and True Positive Rates \nFor Different Cutoffs")

# Lines connecting the dots
lines(c(0, rates$fp[3:1], 1), c(0,rates$tp[3:1],1), 
     xlim = c(0,1), ylim = c(0,1))

# Additional notes for the figure
abline(coef = c(0,1), col = "black", lwd = 2)

points(x = c(0,1), y = c(0,1),pch = 19, col = "red", cex = 1.5) 
points(x = 0, y = 1, pch = 19, col = "dark green", cex = 1.5)
text(x = rates$fp, y = (rates$tp + .05), labels = rates$cutoff)
text(x = .07, y = .97, labels = "Ideal Model")
text(x = 0, y = .05, labels = 1)
text(x = 1, y = .95, labels = 0)
abline(v = 0, h = 1, lty = 2, lwd = 2)
text(.5, .46, srt = 30, labels = "No Predictive Power")

At one extreme, a cutoff value of 1 (bottom left corner) means we treat everyone as a stayer. This would give us 0 False Positives (because no one is predicted to leave) but also 0 True Positives.

At the other extreme, a cutoff of 0 (upper right corner) means we call everyone a leaver. By categorizing everyone as a leaver, we would have a 100% True Positive rate but also a 100% False Positive rate (because every stayer is predicted to leave regardless of the model output).

The the middle values will give us a feel for the tradeoffs with changing cutoffs.

As we progress through the different cutoff values, we see that we move along a curve. Note that at each point of the curve we have a different False Positive and True Positive rate.

For example, with a cutoff of .5, we would have a False Positive Rate of roughly 18% (x axis) and a True Positive Rate of 42% (y axis). By contrast, a cutoff of .3 gives us a roughly 50% False Positive rate and an 80% True Positive rate.

Ideally, we would like to see our curve shoot up sharply with small changes in the cutoff value. That is, we would like to see a huge increase in catching the True Positives while limiting the number False Positives.

The absolute ideal (if unacheivable) reference point is the green dot which represents perfect prediction: all True Positives and no False Positives.

We want to get our ROC curves close to that dotted line shape when evaluating and developing our models. We also want our chosen cutoff value along that ROC curve to be as close to that green dot as possible.

The diagonal line stretching across the figure represents a model with no predictive power.

The better your model, the closer it gets to that ideal step-like dotted line. The worse your model, the closer it gets to the diagonal.

15 Logistic Regression ROC Curve

Now that you understand the principles of thresholds and their impact on False Positive and True Positive rate, let’s build complete ROC curves using our models.

The only difference between these complete curves and the simplified example above is that we will test every possible threshold we can using the probability values produced by our model. Those probabilities produced by our models represent the possible cutoff values.

Fortunately, some smart people came up with some functions to make the calculations and plots easy.

I have created the function that uses the basic ROCR set up but added a few other details for this primer.

We’ll start with our glm.fit model from the logistic regression along with our test dataset and the specific outcomes.

Employee Churn Prediction Modeling

Anurag Gupta

2017-05-28