About this Project
This case study is an extra exercise to showcase the skills I have acquired during and after the Google Data Analytics Certificate course currently offered on Coursera. The purpose is to utilize the tools the tools of choice to ultimately provide recommendations by following the six steps of the data analysis process: Ask, Prepare, Process, Analyze, Share and Act.
Scenario You are an entry level data analytics hire and your employer Big Tuna has tasked you with finding a possible solution to the high attrition in the firm. Turnover tends to be expensive and time consuming for the company and they want to understand the root cause.
They have provided the Employee Attrition and Factors dataset for you to analyze and figure out any relationships between the variables and attrition rate. Additionally, they expect recommendations at the end of your analysis.
Step 2: Prepare
The second step is to put together the tools needed to copmplete the analysis. My tool of choice in this case is R Studio for both data wrangling and visual analysis.
Step 3: Process
Copy HR_Analytics.csv to a dataframe
The third step for is to prepare the environment for our analysis by loading the libraries we will need:
After that, we can import the dataset. We use the “read_csv()” function to do that since or data has been downloaded in the “csv” format. We can also subsequently display the characteristics of our dataset using the “str()” function to see if any of the variables are out of place/are in the incorrect data type.
HR_Analytics <- read_csv(here("HR_Analytics.csv"))
str(HR_Analytics)
## spc_tbl_ [1,470 × 35] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Age : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
## $ Attrition : chr [1:1470] "Yes" "No" "Yes" "No" ...
## $ BusinessTravel : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
## $ DailyRate : num [1:1470] 1102 279 1373 1392 591 ...
## $ Department : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
## $ DistanceFromHome : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
## $ Education : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
## $ EducationField : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
## $ EmployeeCount : num [1:1470] 1 1 1 1 1 1 1 1 1 1 ...
## $ EmployeeNumber : num [1:1470] 1 2 4 5 7 8 10 11 12 13 ...
## $ EnvironmentSatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
## $ Gender : chr [1:1470] "Female" "Male" "Male" "Female" ...
## $ HourlyRate : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
## $ JobInvolvement : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
## $ JobLevel : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
## $ JobRole : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
## $ JobSatisfaction : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
## $ MaritalStatus : chr [1:1470] "Single" "Married" "Single" "Married" ...
## $ MonthlyIncome : num [1:1470] 5993 5130 2090 2909 3468 ...
## $ MonthlyRate : num [1:1470] 19479 24907 2396 23159 16632 ...
## $ NumCompaniesWorked : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
## $ Over18 : chr [1:1470] "Y" "Y" "Y" "Y" ...
## $ OverTime : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
## $ PercentSalaryHike : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
## $ PerformanceRating : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
## $ RelationshipSatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
## $ StandardHours : num [1:1470] 80 80 80 80 80 80 80 80 80 80 ...
## $ StockOptionLevel : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
## $ TotalWorkingYears : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
## $ TrainingTimesLastYear : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
## $ WorkLifeBalance : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
## $ YearsAtCompany : num [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
## $ YearsInCurrentRole : num [1:1470] 4 7 0 7 2 7 0 0 7 7 ...
## $ YearsSinceLastPromotion : num [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
## $ YearsWithCurrManager : num [1:1470] 5 7 0 0 2 6 0 0 8 7 ...
## - attr(*, "spec")=
## .. cols(
## .. Age = col_double(),
## .. Attrition = col_character(),
## .. BusinessTravel = col_character(),
## .. DailyRate = col_double(),
## .. Department = col_character(),
## .. DistanceFromHome = col_double(),
## .. Education = col_double(),
## .. EducationField = col_character(),
## .. EmployeeCount = col_double(),
## .. EmployeeNumber = col_double(),
## .. EnvironmentSatisfaction = col_double(),
## .. Gender = col_character(),
## .. HourlyRate = col_double(),
## .. JobInvolvement = col_double(),
## .. JobLevel = col_double(),
## .. JobRole = col_character(),
## .. JobSatisfaction = col_double(),
## .. MaritalStatus = col_character(),
## .. MonthlyIncome = col_double(),
## .. MonthlyRate = col_double(),
## .. NumCompaniesWorked = col_double(),
## .. Over18 = col_character(),
## .. OverTime = col_character(),
## .. PercentSalaryHike = col_double(),
## .. PerformanceRating = col_double(),
## .. RelationshipSatisfaction = col_double(),
## .. StandardHours = col_double(),
## .. StockOptionLevel = col_double(),
## .. TotalWorkingYears = col_double(),
## .. TrainingTimesLastYear = col_double(),
## .. WorkLifeBalance = col_double(),
## .. YearsAtCompany = col_double(),
## .. YearsInCurrentRole = col_double(),
## .. YearsSinceLastPromotion = col_double(),
## .. YearsWithCurrManager = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
Step 3-1: Check HR_Analytics dataset for consistency and general cleaning
So far, the data appears fine. There are a couple of variables stored as the “character” type, but these will be utilized as the “logical” type. This is because even though they are Boolean in nature, they also have a descriptive element to them. the “Gender” variable can be kept the same. However, the “Attrition” and “Overtime” variables should have the same character length for the sake of consistency. Let’s have them in the simpler “Y/N” format instead. The “Yes” and “No” strings can be conditionally replaced by using of the square brackets to manipulate the object elements:
HR_Analytics$OverTime[HR_Analytics$OverTime=="Yes"] <-"Y"
HR_Analytics$OverTime[HR_Analytics$OverTime=="No"] <-"N"
HR_Analytics$Attrition[HR_Analytics$Attrition=="Yes"] <-"Y"
HR_Analytics$Attrition[HR_Analytics$Attrition=="No"] <-"N"
Step 4-1: Visual Analysis - Attrition vs. Department
Now that our dataset is consistent and clean, we can do a number of visual analysis to compare our variables.
Our first analysis will be on department-specific attrition. We want to know where the most attrition happens based on the departments in the dataset. The best visualization medium for proportions of a whole is the pie chart. But first, we need to construct new vectors based on the conditions we are trying to meet. To get the number of people that left based on their department, we need to filter our count based on the dataset “HR_Analytics” for a specific “Department” (Sales, Human Resources or Research & Development) where “Attrition” value equals “Y”.
The department-specific total divided by total attrition multiplied by 100 gives the percentage of workers that left Big Tuna per department. It would also give us a hint on where to focus most of our attention as far as attrition goes. The formulas required to make those calculations are as follows:
#Find out how many people left the organization per department as a percentage
# a) Create a "label" vector. We calculate the attrition percentage per department and concatenate the results
# with descriptive text elements using the "paste()" command
#--------------------------------------------------------------------------------------------------------------
labels = c(paste(round((nrow(filter(HR_Analytics, Department=="Sales",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Sales"), paste(round((nrow(filter(HR_Analytics, Department=="Human Resources",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Human Resources"), paste(round((nrow(filter(HR_Analytics, Department=="Research & Development",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Research & Development"))
# b) Create a "Pie_Bin" vector. We can use the "nrow()" and "filter()" commands to count the total occurrences
# per department.
#--------------------------------------------------------------------------------------------------------------
Pie_Bin<-c(nrow(filter(HR_Analytics, Department=="Sales")), nrow(filter(HR_Analytics, Department=="Human Resources")), nrow(filter(HR_Analytics, Department=="Research & Development")))
# c) Combine the previous two results using the "pie" command.
#--------------------------------------------------------------------------------------------------------------
pie(Pie_Bin,labels,col = topo.colors(3), radius = 1)
title(main = "Attrition Percentage vs. Department", xlab = "", ylab = "",
cex.main = 1, font.main = 3, col.main = "darkgreen",
cex.sub = 2, font.sub = 3, col.sub = "darkgreen",
col.lab ="black"
)
Based on the visual above, we can see the department with the highest attrition rate is “Research & Development”. The “Sales” depart is not too far off it seems. “Human Resources” has the lowest rate, so this may not even be necessary to consider as a candidate for trying to solve the attrition issue.
Step 4-2: Visual Analysis - Attrition vs. Business Travel
Though some may enjoy it, travelling for business can be a problem for some. This may be even more true if the employees have families of their own. With this in mind, we can create another pie chart below to see how each of those variables look like when visually represented.
#Find out how many people left the organization based on how much they travel
# a) Create a "label" vector. We calculate the attrition percentage tied to travel frequency and concatenate
# the results with descriptive text elements using the "paste()" command
#--------------------------------------------------------------------------------------------------------------
labels = c(paste(round((nrow(filter(HR_Analytics, BusinessTravel=="Non-Travel",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Non-Travel"), paste(round((nrow(filter(HR_Analytics, BusinessTravel=="Travel_Rarely",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Travel_Rarely"), paste(round((nrow(filter(HR_Analytics, BusinessTravel=="Travel_Frequently",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Travel_Frequently"))
# b) Create a "Pie_Bin" vector. We can use the "nrow()" and "filter()" commands to count the total occurrences
# per travel frequency category.
#--------------------------------------------------------------------------------------------------------------
Pie_Bin<-c(nrow(filter(HR_Analytics, BusinessTravel=="Non-Travel")), nrow(filter(HR_Analytics, BusinessTravel=="Travel_Rarely")), nrow(filter(HR_Analytics, BusinessTravel=="Travel_Frequently")))
# c) Combine the previous two results using the "pie" command.
#--------------------------------------------------------------------------------------------------------------
pie(Pie_Bin,labels,col = topo.colors(3), radius = 1)
title(main = "Attrition Percentage vs. Business Travel", xlab = "", ylab = "",
cex.main = 1, font.main = 3, col.main = "darkgreen",
cex.sub = 2, font.sub = 3, col.sub = "darkgreen",
col.lab ="black"
)
The visual analysis above somewhat aligns with our expectations. The least amount of “Non-Travel” attrition is the lowest. My expectations are subverted a little though because “Travel_Rarely” employee attrition rate is over double the “Travel_Frequently” rate. Correlation is not necessarily causation however, so we can dig into the data a little more to gain more insights.
Step 4-3: Visual Analysis - Attrition vs. Job Satisfaction
Job Satisfaction is a simple enough metric to understand. While various factors determine Job Satisfaction, we are more interested in how the numbers look against attrition levels. We can use the “unique()” to identify all the levels of satisfaction. It appears they range from 1-4.
#Find out how many people left the organization based on Job Satisfaction level
# a) Create a "label" vector. We calculate the attrition percentage tied to Job Satisfaction and concatenate
# the results with descriptive text elements using the "paste()" command
#--------------------------------------------------------------------------------------------------------------
labels = c(paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="1",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Level 1"), paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="2",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Level 2"), paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="3",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Level 3"),
paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="4",Attrition=="Y"))/nrow(filter(HR_Analytics, Attrition=="Y")))*100,0),"%","","- Level 4"))
# b) Create a "Pie_Bin" vector. We can use the "nrow()" and "filter()" commands to count the total occurrences
# per Job Satisfaction category.
#--------------------------------------------------------------------------------------------------------------
Pie_Bin<-c(nrow(filter(HR_Analytics, JobSatisfaction=="1")), nrow(filter(HR_Analytics, JobSatisfaction=="2")), nrow(filter(HR_Analytics, JobSatisfaction=="3")),nrow(filter(HR_Analytics, JobSatisfaction=="4")))
# c) Combine the previous two results using the "pie" command.
#--------------------------------------------------------------------------------------------------------------
pie(Pie_Bin,labels,col = topo.colors(4), radius = 1)
title(main = "Attrition Percentage vs. Job Satisfaction", xlab = "", ylab = "",
cex.main = 1, font.main = 3, col.main = "darkgreen",
cex.sub = 2, font.sub = 3, col.sub = "darkgreen",
col.lab ="black"
)
The visual Analysis shows that there seems to be very little effect of “Job Satisfaction” on attrition. The not one satisfaction level heavily outweighs the others. We can also compare the significance of “Job Satisfaction” to employees that stayed with the company. We just have to change the “Y” condition to “N”.
#Find out how many people remained with the organization based on Job Satisfaction level
# a) Create a "label" vector. We calculate the attrition percentage tied to Job Satisfaction and concatenate
# the results with descriptive text elements using the "paste()" command
#--------------------------------------------------------------------------------------------------------------
labels = c(paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="1",Attrition=="N"))/nrow(filter(HR_Analytics, Attrition=="N")))*100,0),"%","","- Level 1"), paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="2",Attrition=="N"))/nrow(filter(HR_Analytics, Attrition=="N")))*100,0),"%","","- Level 2"), paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="3",Attrition=="N"))/nrow(filter(HR_Analytics, Attrition=="N")))*100,0),"%","","- Level 3"),
paste(round((nrow(filter(HR_Analytics, JobSatisfaction=="4",Attrition=="N"))/nrow(filter(HR_Analytics, Attrition=="N")))*100,0),"%","","- Level 4"))
# b) Create a "Pie_Bin" vector. We can use the "nrow()" and "filter()" commands to count the total occurrences
# per Job Satisfaction category.
#--------------------------------------------------------------------------------------------------------------
Pie_Bin<-c(nrow(filter(HR_Analytics, JobSatisfaction=="1")), nrow(filter(HR_Analytics, JobSatisfaction=="2")), nrow(filter(HR_Analytics, JobSatisfaction=="3")),nrow(filter(HR_Analytics, JobSatisfaction=="4")))
# c) Combine the previous two results using the "pie" command.
#--------------------------------------------------------------------------------------------------------------
pie(Pie_Bin,labels,col = topo.colors(4), radius = 1)
title(main = "Attrition Percentage vs. Job Satisfaction", xlab = "", ylab = "",
cex.main = 1, font.main = 3, col.main = "darkgreen",
cex.sub = 2, font.sub = 3, col.sub = "darkgreen",
col.lab ="black"
)
As suspected, the weight of “Job Satisfaction” still seems to bear little significance on attrition.
Step 4-4: Visual Analysis - Attrition vs. Distance From Home
Another important comparison is distance from home to the workplace. With remote options available, paired with the inconvenience of traveling to an from work, the expectation is higher attrition as the distance goes up. We can also use the “unique()” function here to see how many distances there are. We can use a bar type plot to see how Attrition compares to distance from work. We can also go a step further to compare the interaction between the two against all “Y”s and “N”s:
ggplot(data = HR_Analytics, aes(x = DistanceFromHome))+
facet_wrap(vars(HR_Analytics$Attrition))+
ylab("Attrition")+
geom_bar()
The findings subvert expectations again when comparing the relative
volume of the two plots. It looks like distance from work does not play
a significant role in people leaving the organization.
Step 5: Share
Key Findings
We can now briefly go over the observations for the visual output that appears to carry significance in affecting attrition:
Recommendations
References