This is the capstone project submitted during the Data Analytics Internship under the guidance of Prof. Sameer Mathur.

Introduction

Objective : To investigate how the company objective factors influence in attrition of employees, and what kind of working enviroment is most likely to cause attrition.

Dataset : I am using a dataset put up by IBM for my analysis. The dataset contain 35 variables along with Attrition variable. It can be downloaded from the following link

Link- https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/

Methodology : I shall be looking at all variables through some plots and infer about it in my exploratory analysis. And through my exploration I shall try to identify the Variables that tend to have an impact in the attrition of the most experienced and talented employees and try to fit a linear regression model and use it to test hypotheses and draw inferences.

Loading the data

Loading the data to look at different Variables in the dataset

setwd("C:/Users/Rengan/Desktop/Coursera/Files")
HR.data <- read.csv(paste("HR-Em.csv"))
names(HR.data)
##  [1] "ï..Age"                   "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EmployeeCount"            "EmployeeNumber"          
## [11] "EnvironmentSatisfaction"  "Gender"                  
## [13] "HourlyRate"               "JobInvolvement"          
## [15] "JobLevel"                 "JobRole"                 
## [17] "JobSatisfaction"          "MaritalStatus"           
## [19] "MonthlyIncome"            "MonthlyRate"             
## [21] "NumCompaniesWorked"       "Over18"                  
## [23] "OverTime"                 "PercentSalaryHike"       
## [25] "PerformanceRating"        "RelationshipSatisfaction"
## [27] "StandardHours"            "StockOptionLevel"        
## [29] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [31] "WorkLifeBalance"          "YearsAtCompany"          
## [33] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [35] "YearsWithCurrManager"
colnames(HR.data)[1] <- "Age"
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
glimpse(HR.data)
## Observations: 1,470
## Variables: 35
## $ Age                      <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition                <fctr> Yes, No, Yes, No, No, No, No, No, No...
## $ BusinessTravel           <fctr> Travel_Rarely, Travel_Frequently, Tr...
## $ DailyRate                <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department               <fctr> Sales, Research & Development, Resea...
## $ DistanceFromHome         <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education                <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1...
## $ EducationField           <fctr> Life Sciences, Life Sciences, Other,...
## $ EmployeeCount            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber           <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction  <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1...
## $ Gender                   <fctr> Female, Male, Male, Female, Male, Ma...
## $ HourlyRate               <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement           <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3...
## $ JobLevel                 <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole                  <fctr> Sales Executive, Research Scientist,...
## $ JobSatisfaction          <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3...
## $ MaritalStatus            <fctr> Single, Married, Single, Married, Ma...
## $ MonthlyIncome            <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate              <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked       <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18                   <fctr> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ OverTime                 <fctr> Yes, No, Yes, Yes, No, No, Yes, No, ...
## $ PercentSalaryHike        <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating        <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4...
## $ StandardHours            <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel         <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears        <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear    <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance          <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2...
## $ YearsAtCompany           <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole       <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion  <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager     <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
summary(HR.data)
##       Age        Attrition            BusinessTravel   DailyRate     
##  Min.   :18.00   No :1233   Non-Travel       : 150   Min.   : 102.0  
##  1st Qu.:30.00   Yes: 237   Travel_Frequently: 277   1st Qu.: 465.0  
##  Median :36.00              Travel_Rarely    :1043   Median : 802.0  
##  Mean   :36.92                                       Mean   : 802.5  
##  3rd Qu.:43.00                                       3rd Qu.:1157.0  
##  Max.   :60.00                                       Max.   :1499.0  
##                                                                      
##                   Department  DistanceFromHome   Education    
##  Human Resources       : 63   Min.   : 1.000   Min.   :1.000  
##  Research & Development:961   1st Qu.: 2.000   1st Qu.:2.000  
##  Sales                 :446   Median : 7.000   Median :3.000  
##                               Mean   : 9.193   Mean   :2.913  
##                               3rd Qu.:14.000   3rd Qu.:4.000  
##                               Max.   :29.000   Max.   :5.000  
##                                                               
##           EducationField EmployeeCount EmployeeNumber  
##  Human Resources : 27    Min.   :1     Min.   :   1.0  
##  Life Sciences   :606    1st Qu.:1     1st Qu.: 491.2  
##  Marketing       :159    Median :1     Median :1020.5  
##  Medical         :464    Mean   :1     Mean   :1024.9  
##  Other           : 82    3rd Qu.:1     3rd Qu.:1555.8  
##  Technical Degree:132    Max.   :1     Max.   :2068.0  
##                                                        
##  EnvironmentSatisfaction    Gender      HourlyRate     JobInvolvement
##  Min.   :1.000           Female:588   Min.   : 30.00   Min.   :1.00  
##  1st Qu.:2.000           Male  :882   1st Qu.: 48.00   1st Qu.:2.00  
##  Median :3.000                        Median : 66.00   Median :3.00  
##  Mean   :2.722                        Mean   : 65.89   Mean   :2.73  
##  3rd Qu.:4.000                        3rd Qu.: 83.75   3rd Qu.:3.00  
##  Max.   :4.000                        Max.   :100.00   Max.   :4.00  
##                                                                      
##     JobLevel                          JobRole    JobSatisfaction
##  Min.   :1.000   Sales Executive          :326   Min.   :1.000  
##  1st Qu.:1.000   Research Scientist       :292   1st Qu.:2.000  
##  Median :2.000   Laboratory Technician    :259   Median :3.000  
##  Mean   :2.064   Manufacturing Director   :145   Mean   :2.729  
##  3rd Qu.:3.000   Healthcare Representative:131   3rd Qu.:4.000  
##  Max.   :5.000   Manager                  :102   Max.   :4.000  
##                  (Other)                  :215                  
##   MaritalStatus MonthlyIncome    MonthlyRate    NumCompaniesWorked
##  Divorced:327   Min.   : 1009   Min.   : 2094   Min.   :0.000     
##  Married :673   1st Qu.: 2911   1st Qu.: 8047   1st Qu.:1.000     
##  Single  :470   Median : 4919   Median :14236   Median :2.000     
##                 Mean   : 6503   Mean   :14313   Mean   :2.693     
##                 3rd Qu.: 8379   3rd Qu.:20462   3rd Qu.:4.000     
##                 Max.   :19999   Max.   :26999   Max.   :9.000     
##                                                                   
##  Over18   OverTime   PercentSalaryHike PerformanceRating
##  Y:1470   No :1054   Min.   :11.00     Min.   :3.000    
##           Yes: 416   1st Qu.:12.00     1st Qu.:3.000    
##                      Median :14.00     Median :3.000    
##                      Mean   :15.21     Mean   :3.154    
##                      3rd Qu.:18.00     3rd Qu.:3.000    
##                      Max.   :25.00     Max.   :4.000    
##                                                         
##  RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :80    Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:80    1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :80    Median :1.0000   Median :10.00    
##  Mean   :2.712            Mean   :80    Mean   :0.7939   Mean   :11.28    
##  3rd Qu.:4.000            3rd Qu.:80    3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :80    Max.   :3.0000   Max.   :40.00    
##                                                                           
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##                                                                           
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000      
## 

As we see in the Data:

Observations: 1,470 with Variables: 35

Employee Count is equal 1 for all observation which can not generate useful value for this sample data. Over 18 is equal to ‘Y’, which means employee is not less than 18 years old. Here, according to the data set, we will remove it.

Moreover, Standard Hours is equal 80 for all observation. the decision for this attribute is same to Over18 and Employee Count. BusinessTravel, Department, EducationField, Gender, jobRole, MaritalStatus and OverTime are categorical data and other variabels are continues.

Some of variables are related to the years of working which can be a good candidate for feature generation. Some of variable are related to personal issues like WorkLifeBalance, RelationshipSatisfaction, JobSatisfaction,EnvironmentSatisfaction etc.

There are some variables that are related to the income like MonthlyIncome, PercentSalaryHike, etc.

EmployeeNumber is a variable for identifying the specific employee.If we have more information about employee and the structure of the employee number, then we can extract some new features. But now it is not possible and we have to remove it from our data set.

cat("Thus Data Set has ",dim(HR.data)[1], " Rows and ", dim(HR.data)[2], " Columns" )
## Thus Data Set has  1470  Rows and  35  Columns

Checking for missing values and removing non value attributes

apply(is.na(HR.data), 2, sum)
##                      Age                Attrition           BusinessTravel 
##                        0                        0                        0 
##                DailyRate               Department         DistanceFromHome 
##                        0                        0                        0 
##                Education           EducationField            EmployeeCount 
##                        0                        0                        0 
##           EmployeeNumber  EnvironmentSatisfaction                   Gender 
##                        0                        0                        0 
##               HourlyRate           JobInvolvement                 JobLevel 
##                        0                        0                        0 
##                  JobRole          JobSatisfaction            MaritalStatus 
##                        0                        0                        0 
##            MonthlyIncome              MonthlyRate       NumCompaniesWorked 
##                        0                        0                        0 
##                   Over18                 OverTime        PercentSalaryHike 
##                        0                        0                        0 
##        PerformanceRating RelationshipSatisfaction            StandardHours 
##                        0                        0                        0 
##         StockOptionLevel        TotalWorkingYears    TrainingTimesLastYear 
##                        0                        0                        0 
##          WorkLifeBalance           YearsAtCompany       YearsInCurrentRole 
##                        0                        0                        0 
##  YearsSinceLastPromotion     YearsWithCurrManager 
##                        0                        0
HR.data$EmployeeNumber<- NULL
HR.data$StandardHours <- NULL
HR.data$Over18 <- NULL
HR.data$EmployeeCount <- NULL
cat("Data Set has ",dim(HR.data)[1], " Rows and ", dim(HR.data)[2], " Columns" )
## Data Set has  1470  Rows and  31  Columns
sum(is.na(duplicated(HR.data)))
## [1] 0

No Missing Value and no duplication, I am lucky

There are some attributes that are categorical, but in the data set are integer. We have to change them to categorical.

HR.data$Education <- factor(HR.data$Education)
HR.data$EnvironmentSatisfaction <- factor(HR.data$EnvironmentSatisfaction)
HR.data$JobInvolvement <- factor(HR.data$JobInvolvement)
HR.data$JobLevel <- factor(HR.data$JobLevel)
HR.data$JobSatisfaction <- factor(HR.data$JobSatisfaction)
HR.data$PerformanceRating <- factor(HR.data$PerformanceRating)
HR.data$RelationshipSatisfaction <- factor(HR.data$RelationshipSatisfaction)
HR.data$StockOptionLevel <- factor(HR.data$StockOptionLevel)
HR.data$WorkLifeBalance <- factor(HR.data$WorkLifeBalance)

Visualization of Attrition

library(ggplot2)
HR.data %>%
        group_by(Attrition) %>%
        tally() %>%
        ggplot(aes(x = Attrition, y = n,fill=Attrition)) +
        geom_bar(stat = "identity") +
        theme_minimal()+
        labs(x="Attrition", y="Count of Attriation")+
        ggtitle("Attrition")+
        geom_text(aes(label = n), vjust = -0.5, position = position_dodge(0.9))

prop.table(table(HR.data$Attrition))
## 
##        No       Yes 
## 0.8387755 0.1612245

Influence of each variable on the Attrition of the organization

library(ggplot2)
ggplot(data=HR.data, aes(HR.data$Age)) + 
        geom_histogram(breaks=seq(20, 50, by=2), 
                       col="red", 
                       aes(fill=..count..))+
        labs(x="Age", y="Count")+
        scale_fill_gradient("Count", low="blue", high="dark green")

library(grid)
library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
travelPlot <- ggplot(HR.data,aes(BusinessTravel,fill=Attrition))+geom_bar()
depPlot <- ggplot(HR.data,aes(Department,fill = Attrition))+geom_bar()
distPlot <- ggplot(HR.data,aes(DistanceFromHome,fill=Attrition))+geom_bar()
grid.arrange(travelPlot,depPlot,distPlot, nrow=3)

Age: We see that the majority of employees are between 28-36 years.(34-36 esp.) Business Travel: Among people who leave, most travel. Department: Among people attrited employees from HR dept are less because of low proportion of HR in the organization. Distance From Home: Contrary to normal assumptions, a mojority of employees who have left the organization are near to the Office

eduPlot <- ggplot(HR.data,aes(Education,fill=Attrition))+geom_bar()
edufieldPlot <- ggplot(HR.data,aes(EducationField,fill=Attrition))+geom_bar()
envPlot <- ggplot(HR.data,aes(EnvironmentSatisfaction,fill=Attrition))+geom_bar()
genPlot <- ggplot(HR.data,aes(Gender,fill=Attrition))+geom_bar()
grid.arrange(distPlot,eduPlot,edufieldPlot,envPlot,genPlot,ncol=2)

Education: Very few Doctors attrite. May be because of less number. Education Field: Minority of HR educated employees leave and it is majorly because of low number of people. Environment Satisfaction:No distinguishable feature. Gender: Majority of separated employees are Male and the reason might be because around 61% of employees in the dataset are Male.

hourlyPlot <- ggplot(HR.data,aes(HourlyRate,fill=Attrition))+geom_bar()
jobInvPlot <- ggplot(HR.data,aes(JobInvolvement,fill=Attrition))+geom_bar()
jobLevelPlot <- ggplot(HR.data,aes(JobLevel,fill=Attrition))+geom_bar()
jobSatPlot <- ggplot(HR.data,aes(JobSatisfaction,fill=Attrition))+geom_bar()
grid.arrange(hourlyPlot,jobInvPlot,jobLevelPlot,jobSatPlot,ncol=2)

HourlyRate : There seems to be no straightforward relation with the Daily Rate of the employees. Job Involvement: Majority of employees who leave are either Very Highly involved or Low Involved in their Jobs. JobLevel: Job Level increases the number of people quitting decreases. Job Satisfaction: Higher attrition levels in among lower Job Satisfaction levels.

marPlot <- ggplot(HR.data,aes(MaritalStatus,fill=Attrition))+geom_bar()
monthlyIncPlot <- ggplot(HR.data,aes(MonthlyIncome,fill=Attrition))+geom_density()
monthlyRatePlot <- ggplot(HR.data,aes(MonthlyRate,fill=Attrition))+geom_density()
numCompPlot <- ggplot(HR.data,aes(NumCompaniesWorked,fill=Attrition))+geom_bar()
grid.arrange(marPlot,monthlyIncPlot,monthlyRatePlot,numCompPlot,ncol=2)

Marital Status:Attrition is on higher side for Single and lowest for Divorced employees. Monthly Income: We see higher levels of attrition among the lower segment of monthly income. If looked at in isolation, might be due to dissatisfaction of income for the effort out. Monthly Rate: We don’t see any inferable trend from this. Also no straightforwad relation with Monthly Income. Number of Companies Worked: We see a clear indication that many people who have worked only in One company before quit a lot.

overTimePlot <- ggplot(HR.data,aes(OverTime,fill=Attrition))+geom_bar()
hikePlot <- ggplot(HR.data,aes(PercentSalaryHike,Attrition))+geom_point(size=4,alpha = 0.01)
perfPlot <- ggplot(HR.data,aes(PerformanceRating,fill = Attrition))+geom_bar()
RelSatPlot <- ggplot(HR.data,aes(RelationshipSatisfaction,fill = Attrition))+geom_bar()
grid.arrange(overTimePlot,hikePlot,perfPlot,RelSatPlot,ncol=2)

Over18: Seems like an insignificant variable as all are above 18 Years. Over Time: Larger Proportion of Overtime Employees are quitting. Percent Salary Hike: We see that people with less than 15% hike have more chances to leave. Performance Rating: We see that we have employees of only 3 and 4 ratings. Lesser proportion of 4 raters quit.

Standard Hours: Same for all and hence not a significant variable for us. Stock Option Level: Larger proportions of levels 1 & 2 quit. Total Working Years: We see larger proportions of people with 1 year of experiences quitting the organization also in bracket of 1-10 Years. Traning Times Last Year: This indicates the no of HR.data interventions the employee has attended. People who have been trained 2-4 times is an area of concern.

StockPlot <- ggplot(HR.data,aes(StockOptionLevel,fill = Attrition))+geom_bar()
workingYearsPlot <- ggplot(HR.data,aes(TotalWorkingYears,fill = Attrition))+geom_bar()
TrainTimesPlot <- ggplot(HR.data,aes(TrainingTimesLastYear,fill = Attrition))+geom_bar()
WLBPlot <- ggplot(HR.data,aes(WorkLifeBalance,fill = Attrition))+geom_bar()
grid.arrange(StockPlot,workingYearsPlot,TrainTimesPlot,WLBPlot,ncol=2)

Work Life Balance: As expected larger proportion of 1 rating quit, but absolute number wise 2 & 3 are on higher side. Years at Company: Larger proportion of new comers are quitting the organization. Which sidelines the recruitment efforts of the organization.

YearAtComPlot <- ggplot(HR.data,aes(YearsAtCompany,fill = Attrition))+geom_bar()
YearInCurrPlot <- ggplot(HR.data,aes(YearsInCurrentRole,fill = Attrition))+geom_bar()
YearsSinceProm <- ggplot(HR.data,aes(YearsSinceLastPromotion,fill = Attrition))+geom_bar()
YearsCurrManPlot <- ggplot(HR.data,aes(YearsWithCurrManager,fill = Attrition))+geom_bar()
grid.arrange(YearAtComPlot,YearInCurrPlot,YearsSinceProm,YearsCurrManPlot,ncol=2,top = "Fig 7")

Years In Current Role: Plot shows a larger proportion with just 0 years quitting. May be a role change is a trigger for Quitting. Years Since Last Promotion: Larger proportion of people who have been promoted recently have quit the organization. Years With Current Manager: As expected a new Manager is a big cause for quitting.

Correlation of Variables

We see lot of correlation among the following variables

Years at Company, Years in Curr Role, Years with Curr Manager & Years Since Last Promotion - I will consider ‘Years with Curr Manager’ for the model Job Level & Monthly Income - I will consider ‘Job Level’ Percent Salary Hike & Performance Ratiing - I will consider ‘Percent Salary Hike’

library(corrplot)
## corrplot 0.84 loaded
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha
HR.data_cor <- HR.data

for(i in 1:ncol(HR.data_cor)){
        
        HR.data_cor[,i]<- as.integer(HR.data_cor[,i])
}

corrplot(cor(HR.data_cor))

$Removing higly correlated Variables and near Zero Variance variables

# 
HR <- HR.data[,c(2,3,5,7,8,11,12,14,15,16,17,18,21,23,24,26,28,29,30,31)]
library(corrgram)
corrgram(HR, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main = "Corrgram of all  variables")

And thus I have analysed and broke down the variables and their effects on Attrition of the organisation and as the future aspect of this, Machine Learning algorithms like random process and XGBoost can be deployed to build a model that can predict in fact which employees are most likely to leave in the future. I am stopping my analysis here as those things are a bit far fetched and beyond the scope of this 2 weeks mini Capstone Project.