Clean Data - Employee Attrition
I looked at Kaggle.com for datasets related to Human Resources.
One of the big questions in all HR departments is how to explain and predict the employee turnover. I am an HR professional, so I am interested on these type of “problems”.
I got the dataset here: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset
According to the description in the site, this is a fictional data set created by IBM data scientists.
And here the descriptions of the categorical variables in the dataset:
Education:
1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’
EnvironmentSatisfaction:
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
JobInvolvement:
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
JobSatisfaction:
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
PerformanceRating:
1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’
RelationshipSatisfaction:
1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’
WorkLifeBalance:
1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’
First let’s set the working directory and load the data.
dim.data.frame(ibm)
[1] 1470 35
# A tibble: 6 × 35
Age Attrition BusinessTravel DailyRate Department DistanceFromHome
<int> <chr> <chr> <int> <chr> <int>
1 41 Yes Travel_Rarely 1102 Sales 1
2 49 No Travel_Freque… 279 Research … 8
3 37 Yes Travel_Rarely 1373 Research … 2
4 33 No Travel_Freque… 1392 Research … 3
5 27 No Travel_Rarely 591 Research … 2
6 32 No Travel_Freque… 1005 Research … 2
# … with 29 more variables: Education <int>, EducationField <chr>,
# EmployeeCount <int>, EmployeeNumber <int>,
# EnvironmentSatisfaction <int>, Gender <chr>, HourlyRate <int>,
# JobInvolvement <int>, JobLevel <int>, JobRole <chr>,
# JobSatisfaction <int>, MaritalStatus <chr>, MonthlyIncome <int>,
# MonthlyRate <int>, NumCompaniesWorked <int>, Over18 <chr>,
# OverTime <chr>, PercentSalaryHike <int>, …
# Want to see all the variables
head(ibm)
Age Attrition BusinessTravel DailyRate Department
1 41 Yes Travel_Rarely 1102 Sales
2 49 No Travel_Frequently 279 Research & Development
3 37 Yes Travel_Rarely 1373 Research & Development
4 33 No Travel_Frequently 1392 Research & Development
5 27 No Travel_Rarely 591 Research & Development
6 32 No Travel_Frequently 1005 Research & Development
DistanceFromHome Education EducationField EmployeeCount
1 1 2 Life Sciences 1
2 8 1 Life Sciences 1
3 2 2 Other 1
4 3 4 Life Sciences 1
5 2 1 Medical 1
6 2 2 Life Sciences 1
EmployeeNumber EnvironmentSatisfaction Gender HourlyRate
1 1 2 Female 94
2 2 3 Male 61
3 4 4 Male 92
4 5 4 Female 56
5 7 1 Male 40
6 8 4 Male 79
JobInvolvement JobLevel JobRole JobSatisfaction
1 3 2 Sales Executive 4
2 2 2 Research Scientist 2
3 2 1 Laboratory Technician 3
4 3 1 Research Scientist 3
5 3 1 Laboratory Technician 2
6 3 1 Laboratory Technician 4
MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18
1 Single 5993 19479 8 Y
2 Married 5130 24907 1 Y
3 Single 2090 2396 6 Y
4 Married 2909 23159 1 Y
5 Married 3468 16632 9 Y
6 Single 3068 11864 0 Y
OverTime PercentSalaryHike PerformanceRating
1 Yes 11 3
2 No 23 4
3 Yes 15 3
4 Yes 11 3
5 No 12 3
6 No 13 3
RelationshipSatisfaction StandardHours StockOptionLevel
1 1 80 0
2 4 80 1
3 2 80 0
4 3 80 0
5 4 80 1
6 3 80 0
TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
1 8 0 1
2 10 3 3
3 7 3 3
4 8 3 3
5 6 3 3
6 8 2 2
YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
1 6 4 0
2 10 7 1
3 0 0 0
4 8 7 3
5 2 2 2
6 7 7 3
YearsWithCurrManager
1 5
2 7
3 0
4 0
5 2
6 6
colnames(ibm)
[1] "Age" "Attrition"
[3] "BusinessTravel" "DailyRate"
[5] "Department" "DistanceFromHome"
[7] "Education" "EducationField"
[9] "EmployeeCount" "EmployeeNumber"
[11] "EnvironmentSatisfaction" "Gender"
[13] "HourlyRate" "JobInvolvement"
[15] "JobLevel" "JobRole"
[17] "JobSatisfaction" "MaritalStatus"
[19] "MonthlyIncome" "MonthlyRate"
[21] "NumCompaniesWorked" "Over18"
[23] "OverTime" "PercentSalaryHike"
[25] "PerformanceRating" "RelationshipSatisfaction"
[27] "StandardHours" "StockOptionLevel"
[29] "TotalWorkingYears" "TrainingTimesLastYear"
[31] "WorkLifeBalance" "YearsAtCompany"
[33] "YearsInCurrentRole" "YearsSinceLastPromotion"
[35] "YearsWithCurrManager"
The dataset is a large data frame (1470 obs), I will reduce it and create a new dataset by selecting a random sample of 200 observations/rows.
# Random Sample
ibm_s <- sample_n(ibm, 200)
dim.data.frame(ibm_s)
[1] 200 35
I learned to use head() to see the structure of the dataset but I found str() more easy for me to read the dataset
str(ibm_s)
'data.frame': 200 obs. of 35 variables:
$ Age : int 54 24 27 52 33 31 28 28 56 26 ...
$ Attrition : chr "No" "No" "No" "No" ...
$ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
$ DailyRate : int 1217 581 199 994 1075 1003 821 1404 1369 775 ...
$ Department : chr "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
$ DistanceFromHome : int 2 9 6 7 3 5 5 17 23 29 ...
$ Education : int 4 3 3 4 2 3 4 3 3 2 ...
$ EducationField : chr "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : int 126 1707 1162 1118 910 1749 916 1960 1373 618 ...
$ EnvironmentSatisfaction : int 1 3 4 2 4 1 1 3 4 1 ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ HourlyRate : int 60 62 55 87 57 51 98 32 68 45 ...
$ JobInvolvement : int 3 4 2 3 3 3 3 2 3 3 ...
$ JobLevel : int 3 1 1 3 1 2 2 1 4 2 ...
$ JobRole : chr "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
$ JobSatisfaction : int 3 3 3 2 2 3 4 4 2 3 ...
$ MaritalStatus : chr "Married" "Married" "Married" "Single" ...
$ MonthlyIncome : int 13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
$ MonthlyRate : int 24001 17616 7950 15322 22650 20943 24252 18779 18235 4267 ...
$ NumCompaniesWorked : int 9 1 1 7 3 1 1 5 4 5 ...
$ Over18 : chr "Y" "Y" "Y" "Y" ...
$ OverTime : chr "No" "No" "No" "No" ...
$ PercentSalaryHike : int 12 16 13 19 11 19 14 12 12 12 ...
$ PerformanceRating : int 3 3 3 3 3 3 3 3 3 3 ...
$ RelationshipSatisfaction: int 1 4 3 4 3 3 2 1 1 1 ...
$ StandardHours : int 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : int 1 1 1 0 1 1 0 1 1 2 ...
$ TotalWorkingYears : int 16 5 4 18 7 6 4 6 33 8 ...
$ TrainingTimesLastYear : int 5 1 0 4 4 3 3 2 0 5 ...
$ WorkLifeBalance : int 1 3 3 3 4 3 3 2 3 3 ...
$ YearsAtCompany : int 4 5 4 8 4 5 4 4 19 0 ...
$ YearsInCurrentRole : int 3 3 2 6 3 2 2 1 16 0 ...
$ YearsSinceLastPromotion : int 0 0 2 4 0 0 0 0 15 0 ...
$ YearsWithCurrManager : int 3 4 2 0 3 2 2 3 9 0 ...
Now let’s see the unique values in categorical/nominal variables, and others like “StandardHours” that maybe does not vary too much in the dataset
unique(ibm_s$Attrition)
[1] "No" "Yes"
unique(ibm_s$BusinessTrave)
[1] "Travel_Rarely" "Travel_Frequently" "Non-Travel"
unique(ibm_s$Department)
[1] "Research & Development" "Human Resources"
[3] "Sales"
unique(ibm_s$EducationField)
[1] "Technical Degree" "Medical" "Life Sciences"
[4] "Human Resources" "Marketing" "Other"
unique(ibm_s$Gender)
[1] "Female" "Male"
unique(ibm_s$JobRole)
[1] "Research Director" "Research Scientist"
[3] "Healthcare Representative" "Human Resources"
[5] "Sales Executive" "Laboratory Technician"
[7] "Manufacturing Director" "Manager"
[9] "Sales Representative"
unique(ibm_s$MaritalStatus)
[1] "Married" "Single" "Divorced"
unique(ibm_s$Over18)
[1] "Y"
unique(ibm_s$OverTime)
[1] "No" "Yes"
unique(ibm_s$StandardHours)
[1] 80
I found this dataset very comprehensive. There are 35 very interesting variables to explore potential explanations to the attrition.
In terms of the data type, We have:
Nine (9) Categorical variables:
- Attrition
- BusinessTravel
- Department
- EducationField
- Gender
- JobRole
- MaritalStatus
- Over18
- OverTimeSeven (7) Ordinal:
- Education
- EnvironmentSatisfaction:
- JobInvolvement:
- JobSatisfaction:
- PerformanceRating:
- RelationshipSatisfaction:
- WorkLifeBalance: And the rest of the variables are (19) are numerical discrete.
The dataset seems very clean in general, but I will adjust a couple of things:
'data.frame': 200 obs. of 33 variables:
$ Age : int 54 24 27 52 33 31 28 28 56 26 ...
$ Attrition : chr "No" "No" "No" "No" ...
$ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
$ DailyRate : int 1217 581 199 994 1075 1003 821 1404 1369 775 ...
$ Department : chr "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
$ DistanceFromHome : int 2 9 6 7 3 5 5 17 23 29 ...
$ Education : int 4 3 3 4 2 3 4 3 3 2 ...
$ EducationField : chr "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : int 126 1707 1162 1118 910 1749 916 1960 1373 618 ...
$ EnvironmentSatisfaction : int 1 3 4 2 4 1 1 3 4 1 ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ HourlyRate : int 60 62 55 87 57 51 98 32 68 45 ...
$ JobInvolvement : int 3 4 2 3 3 3 3 2 3 3 ...
$ JobLevel : int 3 1 1 3 1 2 2 1 4 2 ...
$ JobRole : chr "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
$ JobSatisfaction : int 3 3 3 2 2 3 4 4 2 3 ...
$ MaritalStatus : chr "Married" "Married" "Married" "Single" ...
$ MonthlyIncome : int 13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
$ MonthlyRate : int 24001 17616 7950 15322 22650 20943 24252 18779 18235 4267 ...
$ NumCompaniesWorked : int 9 1 1 7 3 1 1 5 4 5 ...
$ OverTime : chr "No" "No" "No" "No" ...
$ PercentSalaryHike : int 12 16 13 19 11 19 14 12 12 12 ...
$ PerformanceRating : int 3 3 3 3 3 3 3 3 3 3 ...
$ RelationshipSatisfaction: int 1 4 3 4 3 3 2 1 1 1 ...
$ StockOptionLevel : int 1 1 1 0 1 1 0 1 1 2 ...
$ TotalWorkingYears : int 16 5 4 18 7 6 4 6 33 8 ...
$ TrainingTimesLastYear : int 5 1 0 4 4 3 3 2 0 5 ...
$ WorkLifeBalance : int 1 3 3 3 4 3 3 2 3 3 ...
$ YearsAtCompany : int 4 5 4 8 4 5 4 4 19 0 ...
$ YearsInCurrentRole : int 3 3 2 6 3 2 2 1 16 0 ...
$ YearsSinceLastPromotion : int 0 0 2 4 0 0 0 0 15 0 ...
$ YearsWithCurrManager : int 3 4 2 0 3 2 2 3 9 0 ...
'data.frame': 200 obs. of 31 variables:
$ Age : int 54 24 27 52 33 31 28 28 56 26 ...
$ Attrition : chr "No" "No" "No" "No" ...
$ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
$ Department : chr "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
$ DistanceFromHome : int 2 9 6 7 3 5 5 17 23 29 ...
$ Education : int 4 3 3 4 2 3 4 3 3 2 ...
$ EducationField : chr "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : int 126 1707 1162 1118 910 1749 916 1960 1373 618 ...
$ EnvironmentSatisfaction : int 1 3 4 2 4 1 1 3 4 1 ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ HourlyRate : int 60 62 55 87 57 51 98 32 68 45 ...
$ JobInvolvement : int 3 4 2 3 3 3 3 2 3 3 ...
$ JobLevel : int 3 1 1 3 1 2 2 1 4 2 ...
$ JobRole : chr "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
$ JobSatisfaction : int 3 3 3 2 2 3 4 4 2 3 ...
$ MaritalStatus : chr "Married" "Married" "Married" "Single" ...
$ MonthlyIncome : int 13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
$ NumCompaniesWorked : int 9 1 1 7 3 1 1 5 4 5 ...
$ OverTime : chr "No" "No" "No" "No" ...
$ PercentSalaryHike : int 12 16 13 19 11 19 14 12 12 12 ...
$ PerformanceRating : int 3 3 3 3 3 3 3 3 3 3 ...
$ RelationshipSatisfaction: int 1 4 3 4 3 3 2 1 1 1 ...
$ StockOptionLevel : int 1 1 1 0 1 1 0 1 1 2 ...
$ TotalWorkingYears : int 16 5 4 18 7 6 4 6 33 8 ...
$ TrainingTimesLastYear : int 5 1 0 4 4 3 3 2 0 5 ...
$ WorkLifeBalance : int 1 3 3 3 4 3 3 2 3 3 ...
$ YearsAtCompany : int 4 5 4 8 4 5 4 4 19 0 ...
$ YearsInCurrentRole : int 3 3 2 6 3 2 2 1 16 0 ...
$ YearsSinceLastPromotion : int 0 0 2 4 0 0 0 0 15 0 ...
$ YearsWithCurrManager : int 3 4 2 0 3 2 2 3 9 0 ...
#recode Education as education_ch and WorkLifeBalance as WorkLifeBalance_ch
ibm_s2<-ibm_s2%>%
mutate(education_ch = case_when(
Education == 1 ~ "Below College",
Education == 2 ~ "College",
Education == 3 ~ "Bachelor",
Education == 4 ~ "Master",
Education == 5 ~ "Doctor",
)) %>%
mutate(WorkLifeBalance_ch = case_when(
WorkLifeBalance == 1 ~ "Bad",
WorkLifeBalance == 2 ~ "Good",
WorkLifeBalance == 3 ~ "Better",
WorkLifeBalance == 4 ~ "Best"
))
str(ibm_s2)
'data.frame': 200 obs. of 33 variables:
$ Age : int 54 24 27 52 33 31 28 28 56 26 ...
$ Attrition : chr "No" "No" "No" "No" ...
$ BusinessTravel : chr "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
$ Department : chr "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
$ DistanceFromHome : int 2 9 6 7 3 5 5 17 23 29 ...
$ Education : int 4 3 3 4 2 3 4 3 3 2 ...
$ EducationField : chr "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
$ EmployeeCount : int 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : int 126 1707 1162 1118 910 1749 916 1960 1373 618 ...
$ EnvironmentSatisfaction : int 1 3 4 2 4 1 1 3 4 1 ...
$ Gender : chr "Female" "Male" "Male" "Male" ...
$ HourlyRate : int 60 62 55 87 57 51 98 32 68 45 ...
$ JobInvolvement : int 3 4 2 3 3 3 3 2 3 3 ...
$ JobLevel : int 3 1 1 3 1 2 2 1 4 2 ...
$ JobRole : chr "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
$ JobSatisfaction : int 3 3 3 2 2 3 4 4 2 3 ...
$ MaritalStatus : chr "Married" "Married" "Married" "Single" ...
$ MonthlyIncome : int 13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
$ NumCompaniesWorked : int 9 1 1 7 3 1 1 5 4 5 ...
$ OverTime : chr "No" "No" "No" "No" ...
$ PercentSalaryHike : int 12 16 13 19 11 19 14 12 12 12 ...
$ PerformanceRating : int 3 3 3 3 3 3 3 3 3 3 ...
$ RelationshipSatisfaction: int 1 4 3 4 3 3 2 1 1 1 ...
$ StockOptionLevel : int 1 1 1 0 1 1 0 1 1 2 ...
$ TotalWorkingYears : int 16 5 4 18 7 6 4 6 33 8 ...
$ TrainingTimesLastYear : int 5 1 0 4 4 3 3 2 0 5 ...
$ WorkLifeBalance : int 1 3 3 3 4 3 3 2 3 3 ...
$ YearsAtCompany : int 4 5 4 8 4 5 4 4 19 0 ...
$ YearsInCurrentRole : int 3 3 2 6 3 2 2 1 16 0 ...
$ YearsSinceLastPromotion : int 0 0 2 4 0 0 0 0 15 0 ...
$ YearsWithCurrManager : int 3 4 2 0 3 2 2 3 9 0 ...
$ education_ch : chr "Master" "Bachelor" "Bachelor" "Master" ...
$ WorkLifeBalance_ch : chr "Bad" "Better" "Better" "Better" ...
Now we have a dataset of 200 obs. and 31 variables (+ 2 variables duplicated due to recoding).
In the HR “world” is common to think that employee attrition/turnover is very related to the immediate leadership, so some of the initial questions could be:
But also there are other questions rekated to the demographics:
Then, more recently is also common to find in exit interviews that the Quality of Life in general is a more and more relevant topic for the labor force:
Lastly, it is also important to check the impact of the compensation into the turnover rate:
Attrition
No Yes
82 18
Department
Human Resources Research & Development Sales
6 65 29
#Make crosstabs for Department and Attrition
xtabs(~ Department + Attrition, ibm_s2)
Attrition
Department No Yes
Human Resources 9 3
Research & Development 108 22
Sales 47 11
#Make proportional crosstabs for Department and Attrition
prop.table(xtabs(~ Department + Attrition, ibm_s2))*100
Attrition
Department No Yes
Human Resources 4.5 1.5
Research & Development 54.0 11.0
Sales 23.5 5.5
#Make proportional crosstabs for Job Satisfaction and Attrition
prop.table(xtabs(~ JobSatisfaction + Attrition, ibm_s2))*100
Attrition
JobSatisfaction No Yes
1 12.5 6.0
2 19.5 3.0
3 26.5 4.0
4 23.5 5.0
#Make proportional crosstabs for Gender and Attrition
prop.table(xtabs(~ Gender + Attrition, ibm_s2))*100
Attrition
Gender No Yes
Female 31.5 5.5
Male 50.5 12.5