This is the capstone project submitted during the Data Analytics Internship under the guidance of Prof. Sameer Mathur.
Objective : To investigate how the company objective factors influence in attrition of employees, and what kind of working enviroment is most likely to cause attrition.
Dataset : I am using a dataset put up by IBM for my analysis. The dataset contain 35 variables along with Attrition variable. It can be downloaded from the following link
Link- https://www.ibm.com/communities/analytics/watson-analytics-blog/hr-employee-attrition/
Methodology : I shall be looking at all variables through some plots and infer about it in my exploratory analysis. And through my exploration I shall try to identify the Variables that tend to have an impact in the attrition of the most experienced and talented employees and try to fit a linear regression model and use it to test hypotheses and draw inferences.
Loading the data to look at different Variables in the dataset
setwd("C:/Users/Rengan/Desktop/Coursera/Files")
HR.data <- read.csv(paste("HR-Em.csv"))
names(HR.data)
## [1] "ï..Age" "Attrition"
## [3] "BusinessTravel" "DailyRate"
## [5] "Department" "DistanceFromHome"
## [7] "Education" "EducationField"
## [9] "EmployeeCount" "EmployeeNumber"
## [11] "EnvironmentSatisfaction" "Gender"
## [13] "HourlyRate" "JobInvolvement"
## [15] "JobLevel" "JobRole"
## [17] "JobSatisfaction" "MaritalStatus"
## [19] "MonthlyIncome" "MonthlyRate"
## [21] "NumCompaniesWorked" "Over18"
## [23] "OverTime" "PercentSalaryHike"
## [25] "PerformanceRating" "RelationshipSatisfaction"
## [27] "StandardHours" "StockOptionLevel"
## [29] "TotalWorkingYears" "TrainingTimesLastYear"
## [31] "WorkLifeBalance" "YearsAtCompany"
## [33] "YearsInCurrentRole" "YearsSinceLastPromotion"
## [35] "YearsWithCurrManager"
colnames(HR.data)[1] <- "Age"
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
glimpse(HR.data)
## Observations: 1,470
## Variables: 35
## $ Age <int> 41, 49, 37, 33, 27, 32, 59, 30, 38, 3...
## $ Attrition <fctr> Yes, No, Yes, No, No, No, No, No, No...
## $ BusinessTravel <fctr> Travel_Rarely, Travel_Frequently, Tr...
## $ DailyRate <int> 1102, 279, 1373, 1392, 591, 1005, 132...
## $ Department <fctr> Sales, Research & Development, Resea...
## $ DistanceFromHome <int> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, ...
## $ Education <int> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1...
## $ EducationField <fctr> Life Sciences, Life Sciences, Other,...
## $ EmployeeCount <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ EmployeeNumber <int> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14,...
## $ EnvironmentSatisfaction <int> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1...
## $ Gender <fctr> Female, Male, Male, Female, Male, Ma...
## $ HourlyRate <int> 94, 61, 92, 56, 40, 79, 81, 67, 44, 9...
## $ JobInvolvement <int> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3...
## $ JobLevel <int> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1...
## $ JobRole <fctr> Sales Executive, Research Scientist,...
## $ JobSatisfaction <int> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3...
## $ MaritalStatus <fctr> Single, Married, Single, Married, Ma...
## $ MonthlyIncome <int> 5993, 5130, 2090, 2909, 3468, 3068, 2...
## $ MonthlyRate <int> 19479, 24907, 2396, 23159, 16632, 118...
## $ NumCompaniesWorked <int> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1...
## $ Over18 <fctr> Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y, ...
## $ OverTime <fctr> Yes, No, Yes, Yes, No, No, Yes, No, ...
## $ PercentSalaryHike <int> 11, 23, 15, 11, 12, 13, 20, 22, 21, 1...
## $ PerformanceRating <int> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3...
## $ RelationshipSatisfaction <int> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4...
## $ StandardHours <int> 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
## $ StockOptionLevel <int> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1...
## $ TotalWorkingYears <int> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, ...
## $ TrainingTimesLastYear <int> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1...
## $ WorkLifeBalance <int> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2...
## $ YearsAtCompany <int> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, ...
## $ YearsInCurrentRole <int> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2...
## $ YearsSinceLastPromotion <int> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4...
## $ YearsWithCurrManager <int> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3...
summary(HR.data)
## Age Attrition BusinessTravel DailyRate
## Min. :18.00 No :1233 Non-Travel : 150 Min. : 102.0
## 1st Qu.:30.00 Yes: 237 Travel_Frequently: 277 1st Qu.: 465.0
## Median :36.00 Travel_Rarely :1043 Median : 802.0
## Mean :36.92 Mean : 802.5
## 3rd Qu.:43.00 3rd Qu.:1157.0
## Max. :60.00 Max. :1499.0
##
## Department DistanceFromHome Education
## Human Resources : 63 Min. : 1.000 Min. :1.000
## Research & Development:961 1st Qu.: 2.000 1st Qu.:2.000
## Sales :446 Median : 7.000 Median :3.000
## Mean : 9.193 Mean :2.913
## 3rd Qu.:14.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000
##
## EducationField EmployeeCount EmployeeNumber
## Human Resources : 27 Min. :1 Min. : 1.0
## Life Sciences :606 1st Qu.:1 1st Qu.: 491.2
## Marketing :159 Median :1 Median :1020.5
## Medical :464 Mean :1 Mean :1024.9
## Other : 82 3rd Qu.:1 3rd Qu.:1555.8
## Technical Degree:132 Max. :1 Max. :2068.0
##
## EnvironmentSatisfaction Gender HourlyRate JobInvolvement
## Min. :1.000 Female:588 Min. : 30.00 Min. :1.00
## 1st Qu.:2.000 Male :882 1st Qu.: 48.00 1st Qu.:2.00
## Median :3.000 Median : 66.00 Median :3.00
## Mean :2.722 Mean : 65.89 Mean :2.73
## 3rd Qu.:4.000 3rd Qu.: 83.75 3rd Qu.:3.00
## Max. :4.000 Max. :100.00 Max. :4.00
##
## JobLevel JobRole JobSatisfaction
## Min. :1.000 Sales Executive :326 Min. :1.000
## 1st Qu.:1.000 Research Scientist :292 1st Qu.:2.000
## Median :2.000 Laboratory Technician :259 Median :3.000
## Mean :2.064 Manufacturing Director :145 Mean :2.729
## 3rd Qu.:3.000 Healthcare Representative:131 3rd Qu.:4.000
## Max. :5.000 Manager :102 Max. :4.000
## (Other) :215
## MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked
## Divorced:327 Min. : 1009 Min. : 2094 Min. :0.000
## Married :673 1st Qu.: 2911 1st Qu.: 8047 1st Qu.:1.000
## Single :470 Median : 4919 Median :14236 Median :2.000
## Mean : 6503 Mean :14313 Mean :2.693
## 3rd Qu.: 8379 3rd Qu.:20462 3rd Qu.:4.000
## Max. :19999 Max. :26999 Max. :9.000
##
## Over18 OverTime PercentSalaryHike PerformanceRating
## Y:1470 No :1054 Min. :11.00 Min. :3.000
## Yes: 416 1st Qu.:12.00 1st Qu.:3.000
## Median :14.00 Median :3.000
## Mean :15.21 Mean :3.154
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
##
## RelationshipSatisfaction StandardHours StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :80 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :80 Median :1.0000 Median :10.00
## Mean :2.712 Mean :80 Mean :0.7939 Mean :11.28
## 3rd Qu.:4.000 3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :80 Max. :3.0000 Max. :40.00
##
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
##
## YearsSinceLastPromotion YearsWithCurrManager
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 2.000
## Median : 1.000 Median : 3.000
## Mean : 2.188 Mean : 4.123
## 3rd Qu.: 3.000 3rd Qu.: 7.000
## Max. :15.000 Max. :17.000
##
As we see in the Data:
Observations: 1,470 with Variables: 35
Employee Count is equal 1 for all observation which can not generate useful value for this sample data. Over 18 is equal to ‘Y’, which means employee is not less than 18 years old. Here, according to the data set, we will remove it.
Moreover, Standard Hours is equal 80 for all observation. the decision for this attribute is same to Over18 and Employee Count. BusinessTravel, Department, EducationField, Gender, jobRole, MaritalStatus and OverTime are categorical data and other variabels are continues.
Some of variables are related to the years of working which can be a good candidate for feature generation. Some of variable are related to personal issues like WorkLifeBalance, RelationshipSatisfaction, JobSatisfaction,EnvironmentSatisfaction etc.
There are some variables that are related to the income like MonthlyIncome, PercentSalaryHike, etc.
EmployeeNumber is a variable for identifying the specific employee.If we have more information about employee and the structure of the employee number, then we can extract some new features. But now it is not possible and we have to remove it from our data set.
cat("Thus Data Set has ",dim(HR.data)[1], " Rows and ", dim(HR.data)[2], " Columns" )
## Thus Data Set has 1470 Rows and 35 Columns
apply(is.na(HR.data), 2, sum)
## Age Attrition BusinessTravel
## 0 0 0
## DailyRate Department DistanceFromHome
## 0 0 0
## Education EducationField EmployeeCount
## 0 0 0
## EmployeeNumber EnvironmentSatisfaction Gender
## 0 0 0
## HourlyRate JobInvolvement JobLevel
## 0 0 0
## JobRole JobSatisfaction MaritalStatus
## 0 0 0
## MonthlyIncome MonthlyRate NumCompaniesWorked
## 0 0 0
## Over18 OverTime PercentSalaryHike
## 0 0 0
## PerformanceRating RelationshipSatisfaction StandardHours
## 0 0 0
## StockOptionLevel TotalWorkingYears TrainingTimesLastYear
## 0 0 0
## WorkLifeBalance YearsAtCompany YearsInCurrentRole
## 0 0 0
## YearsSinceLastPromotion YearsWithCurrManager
## 0 0
HR.data$EmployeeNumber<- NULL
HR.data$StandardHours <- NULL
HR.data$Over18 <- NULL
HR.data$EmployeeCount <- NULL
cat("Data Set has ",dim(HR.data)[1], " Rows and ", dim(HR.data)[2], " Columns" )
## Data Set has 1470 Rows and 31 Columns
sum(is.na(duplicated(HR.data)))
## [1] 0
There are some attributes that are categorical, but in the data set are integer. We have to change them to categorical.
HR.data$Education <- factor(HR.data$Education)
HR.data$EnvironmentSatisfaction <- factor(HR.data$EnvironmentSatisfaction)
HR.data$JobInvolvement <- factor(HR.data$JobInvolvement)
HR.data$JobLevel <- factor(HR.data$JobLevel)
HR.data$JobSatisfaction <- factor(HR.data$JobSatisfaction)
HR.data$PerformanceRating <- factor(HR.data$PerformanceRating)
HR.data$RelationshipSatisfaction <- factor(HR.data$RelationshipSatisfaction)
HR.data$StockOptionLevel <- factor(HR.data$StockOptionLevel)
HR.data$WorkLifeBalance <- factor(HR.data$WorkLifeBalance)
library(ggplot2)
HR.data %>%
group_by(Attrition) %>%
tally() %>%
ggplot(aes(x = Attrition, y = n,fill=Attrition)) +
geom_bar(stat = "identity") +
theme_minimal()+
labs(x="Attrition", y="Count of Attriation")+
ggtitle("Attrition")+
geom_text(aes(label = n), vjust = -0.5, position = position_dodge(0.9))
prop.table(table(HR.data$Attrition))
##
## No Yes
## 0.8387755 0.1612245
library(ggplot2)
ggplot(data=HR.data, aes(HR.data$Age)) +
geom_histogram(breaks=seq(20, 50, by=2),
col="red",
aes(fill=..count..))+
labs(x="Age", y="Count")+
scale_fill_gradient("Count", low="blue", high="dark green")
library(grid)
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
travelPlot <- ggplot(HR.data,aes(BusinessTravel,fill=Attrition))+geom_bar()
depPlot <- ggplot(HR.data,aes(Department,fill = Attrition))+geom_bar()
distPlot <- ggplot(HR.data,aes(DistanceFromHome,fill=Attrition))+geom_bar()
grid.arrange(travelPlot,depPlot,distPlot, nrow=3)
Age: We see that the majority of employees are between 28-36 years.(34-36 esp.) Business Travel: Among people who leave, most travel. Department: Among people attrited employees from HR dept are less because of low proportion of HR in the organization. Distance From Home: Contrary to normal assumptions, a mojority of employees who have left the organization are near to the Office
eduPlot <- ggplot(HR.data,aes(Education,fill=Attrition))+geom_bar()
edufieldPlot <- ggplot(HR.data,aes(EducationField,fill=Attrition))+geom_bar()
envPlot <- ggplot(HR.data,aes(EnvironmentSatisfaction,fill=Attrition))+geom_bar()
genPlot <- ggplot(HR.data,aes(Gender,fill=Attrition))+geom_bar()
grid.arrange(distPlot,eduPlot,edufieldPlot,envPlot,genPlot,ncol=2)
Education: Very few Doctors attrite. May be because of less number. Education Field: Minority of HR educated employees leave and it is majorly because of low number of people. Environment Satisfaction:No distinguishable feature. Gender: Majority of separated employees are Male and the reason might be because around 61% of employees in the dataset are Male.
hourlyPlot <- ggplot(HR.data,aes(HourlyRate,fill=Attrition))+geom_bar()
jobInvPlot <- ggplot(HR.data,aes(JobInvolvement,fill=Attrition))+geom_bar()
jobLevelPlot <- ggplot(HR.data,aes(JobLevel,fill=Attrition))+geom_bar()
jobSatPlot <- ggplot(HR.data,aes(JobSatisfaction,fill=Attrition))+geom_bar()
grid.arrange(hourlyPlot,jobInvPlot,jobLevelPlot,jobSatPlot,ncol=2)
HourlyRate : There seems to be no straightforward relation with the Daily Rate of the employees. Job Involvement: Majority of employees who leave are either Very Highly involved or Low Involved in their Jobs. JobLevel: Job Level increases the number of people quitting decreases. Job Satisfaction: Higher attrition levels in among lower Job Satisfaction levels.
marPlot <- ggplot(HR.data,aes(MaritalStatus,fill=Attrition))+geom_bar()
monthlyIncPlot <- ggplot(HR.data,aes(MonthlyIncome,fill=Attrition))+geom_density()
monthlyRatePlot <- ggplot(HR.data,aes(MonthlyRate,fill=Attrition))+geom_density()
numCompPlot <- ggplot(HR.data,aes(NumCompaniesWorked,fill=Attrition))+geom_bar()
grid.arrange(marPlot,monthlyIncPlot,monthlyRatePlot,numCompPlot,ncol=2)
Marital Status:Attrition is on higher side for Single and lowest for Divorced employees. Monthly Income: We see higher levels of attrition among the lower segment of monthly income. If looked at in isolation, might be due to dissatisfaction of income for the effort out. Monthly Rate: We don’t see any inferable trend from this. Also no straightforwad relation with Monthly Income. Number of Companies Worked: We see a clear indication that many people who have worked only in One company before quit a lot.
overTimePlot <- ggplot(HR.data,aes(OverTime,fill=Attrition))+geom_bar()
hikePlot <- ggplot(HR.data,aes(PercentSalaryHike,Attrition))+geom_point(size=4,alpha = 0.01)
perfPlot <- ggplot(HR.data,aes(PerformanceRating,fill = Attrition))+geom_bar()
RelSatPlot <- ggplot(HR.data,aes(RelationshipSatisfaction,fill = Attrition))+geom_bar()
grid.arrange(overTimePlot,hikePlot,perfPlot,RelSatPlot,ncol=2)
Over18: Seems like an insignificant variable as all are above 18 Years. Over Time: Larger Proportion of Overtime Employees are quitting. Percent Salary Hike: We see that people with less than 15% hike have more chances to leave. Performance Rating: We see that we have employees of only 3 and 4 ratings. Lesser proportion of 4 raters quit.
Standard Hours: Same for all and hence not a significant variable for us. Stock Option Level: Larger proportions of levels 1 & 2 quit. Total Working Years: We see larger proportions of people with 1 year of experiences quitting the organization also in bracket of 1-10 Years. Traning Times Last Year: This indicates the no of HR.data interventions the employee has attended. People who have been trained 2-4 times is an area of concern.
StockPlot <- ggplot(HR.data,aes(StockOptionLevel,fill = Attrition))+geom_bar()
workingYearsPlot <- ggplot(HR.data,aes(TotalWorkingYears,fill = Attrition))+geom_bar()
TrainTimesPlot <- ggplot(HR.data,aes(TrainingTimesLastYear,fill = Attrition))+geom_bar()
WLBPlot <- ggplot(HR.data,aes(WorkLifeBalance,fill = Attrition))+geom_bar()
grid.arrange(StockPlot,workingYearsPlot,TrainTimesPlot,WLBPlot,ncol=2)
Work Life Balance: As expected larger proportion of 1 rating quit, but absolute number wise 2 & 3 are on higher side. Years at Company: Larger proportion of new comers are quitting the organization. Which sidelines the recruitment efforts of the organization.
YearAtComPlot <- ggplot(HR.data,aes(YearsAtCompany,fill = Attrition))+geom_bar()
YearInCurrPlot <- ggplot(HR.data,aes(YearsInCurrentRole,fill = Attrition))+geom_bar()
YearsSinceProm <- ggplot(HR.data,aes(YearsSinceLastPromotion,fill = Attrition))+geom_bar()
YearsCurrManPlot <- ggplot(HR.data,aes(YearsWithCurrManager,fill = Attrition))+geom_bar()
grid.arrange(YearAtComPlot,YearInCurrPlot,YearsSinceProm,YearsCurrManPlot,ncol=2,top = "Fig 7")
Years In Current Role: Plot shows a larger proportion with just 0 years quitting. May be a role change is a trigger for Quitting. Years Since Last Promotion: Larger proportion of people who have been promoted recently have quit the organization. Years With Current Manager: As expected a new Manager is a big cause for quitting.
We see lot of correlation among the following variables
Years at Company, Years in Curr Role, Years with Curr Manager & Years Since Last Promotion - I will consider ‘Years with Curr Manager’ for the model Job Level & Monthly Income - I will consider ‘Job Level’ Percent Salary Hike & Performance Ratiing - I will consider ‘Percent Salary Hike’
library(corrplot)
## corrplot 0.84 loaded
library(psych)
##
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
HR.data_cor <- HR.data
for(i in 1:ncol(HR.data_cor)){
HR.data_cor[,i]<- as.integer(HR.data_cor[,i])
}
corrplot(cor(HR.data_cor))
$Removing higly correlated Variables and near Zero Variance variables
#
HR <- HR.data[,c(2,3,5,7,8,11,12,14,15,16,17,18,21,23,24,26,28,29,30,31)]
library(corrgram)
corrgram(HR, lower.panel = panel.shade, upper.panel = panel.pie, text.panel = panel.txt, main = "Corrgram of all variables")
And thus I have analysed and broke down the variables and their effects on Attrition of the organisation and as the future aspect of this, Machine Learning algorithms like random process and XGBoost can be deployed to build a model that can predict in fact which employees are most likely to leave in the future. I am stopping my analysis here as those things are a bit far fetched and beyond the scope of this 2 weeks mini Capstone Project.