HW-3

Clean Data - Employee Attrition

Felix Betancourt
2022-06-27

About Dataset:

I looked at Kaggle.com for datasets related to Human Resources.

One of the big questions in all HR departments is how to explain and predict the employee turnover. I am an HR professional, so I am interested on these type of “problems”.

I got the dataset here: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset

According to the description in the site, this is a fictional data set created by IBM data scientists.

And here the descriptions of the categorical variables in the dataset:

Education:

1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction:

1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement:

1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction:

1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating:

1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction:

1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance:

1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

Now let’s start working with the data set:

Setting up and Reading the data

First let’s set the working directory and load the data.

#First of all let's load the Dplyr and Tidyverse package

library(dplyr)

library(tidyverse)

# Now let's set the working directory

setwd("/Users/fbeta/OneDrive/1-UMASS-DACSS/601/Week-6/HW-3")

ibm <- read.csv("IBM_data.csv")

Exploring the data

[1] 1470   35
ibm_t <- as_tibble(ibm)

head(ibm_t)
# A tibble: 6 × 35
    Age Attrition BusinessTravel DailyRate Department DistanceFromHome
  <int> <chr>     <chr>              <int> <chr>                 <int>
1    41 Yes       Travel_Rarely       1102 Sales                     1
2    49 No        Travel_Freque…       279 Research …                8
3    37 Yes       Travel_Rarely       1373 Research …                2
4    33 No        Travel_Freque…      1392 Research …                3
5    27 No        Travel_Rarely        591 Research …                2
6    32 No        Travel_Freque…      1005 Research …                2
# … with 29 more variables: Education <int>, EducationField <chr>,
#   EmployeeCount <int>, EmployeeNumber <int>,
#   EnvironmentSatisfaction <int>, Gender <chr>, HourlyRate <int>,
#   JobInvolvement <int>, JobLevel <int>, JobRole <chr>,
#   JobSatisfaction <int>, MaritalStatus <chr>, MonthlyIncome <int>,
#   MonthlyRate <int>, NumCompaniesWorked <int>, Over18 <chr>,
#   OverTime <chr>, PercentSalaryHike <int>, …
# Want to see all the variables
head(ibm)
  Age Attrition    BusinessTravel DailyRate             Department
1  41       Yes     Travel_Rarely      1102                  Sales
2  49        No Travel_Frequently       279 Research & Development
3  37       Yes     Travel_Rarely      1373 Research & Development
4  33        No Travel_Frequently      1392 Research & Development
5  27        No     Travel_Rarely       591 Research & Development
6  32        No Travel_Frequently      1005 Research & Development
  DistanceFromHome Education EducationField EmployeeCount
1                1         2  Life Sciences             1
2                8         1  Life Sciences             1
3                2         2          Other             1
4                3         4  Life Sciences             1
5                2         1        Medical             1
6                2         2  Life Sciences             1
  EmployeeNumber EnvironmentSatisfaction Gender HourlyRate
1              1                       2 Female         94
2              2                       3   Male         61
3              4                       4   Male         92
4              5                       4 Female         56
5              7                       1   Male         40
6              8                       4   Male         79
  JobInvolvement JobLevel               JobRole JobSatisfaction
1              3        2       Sales Executive               4
2              2        2    Research Scientist               2
3              2        1 Laboratory Technician               3
4              3        1    Research Scientist               3
5              3        1 Laboratory Technician               2
6              3        1 Laboratory Technician               4
  MaritalStatus MonthlyIncome MonthlyRate NumCompaniesWorked Over18
1        Single          5993       19479                  8      Y
2       Married          5130       24907                  1      Y
3        Single          2090        2396                  6      Y
4       Married          2909       23159                  1      Y
5       Married          3468       16632                  9      Y
6        Single          3068       11864                  0      Y
  OverTime PercentSalaryHike PerformanceRating
1      Yes                11                 3
2       No                23                 4
3      Yes                15                 3
4      Yes                11                 3
5       No                12                 3
6       No                13                 3
  RelationshipSatisfaction StandardHours StockOptionLevel
1                        1            80                0
2                        4            80                1
3                        2            80                0
4                        3            80                0
5                        4            80                1
6                        3            80                0
  TotalWorkingYears TrainingTimesLastYear WorkLifeBalance
1                 8                     0               1
2                10                     3               3
3                 7                     3               3
4                 8                     3               3
5                 6                     3               3
6                 8                     2               2
  YearsAtCompany YearsInCurrentRole YearsSinceLastPromotion
1              6                  4                       0
2             10                  7                       1
3              0                  0                       0
4              8                  7                       3
5              2                  2                       2
6              7                  7                       3
  YearsWithCurrManager
1                    5
2                    7
3                    0
4                    0
5                    2
6                    6
colnames(ibm)
 [1] "Age"                      "Attrition"               
 [3] "BusinessTravel"           "DailyRate"               
 [5] "Department"               "DistanceFromHome"        
 [7] "Education"                "EducationField"          
 [9] "EmployeeCount"            "EmployeeNumber"          
[11] "EnvironmentSatisfaction"  "Gender"                  
[13] "HourlyRate"               "JobInvolvement"          
[15] "JobLevel"                 "JobRole"                 
[17] "JobSatisfaction"          "MaritalStatus"           
[19] "MonthlyIncome"            "MonthlyRate"             
[21] "NumCompaniesWorked"       "Over18"                  
[23] "OverTime"                 "PercentSalaryHike"       
[25] "PerformanceRating"        "RelationshipSatisfaction"
[27] "StandardHours"            "StockOptionLevel"        
[29] "TotalWorkingYears"        "TrainingTimesLastYear"   
[31] "WorkLifeBalance"          "YearsAtCompany"          
[33] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
[35] "YearsWithCurrManager"    

The dataset is a large data frame (1470 obs), I will reduce it and create a new dataset by selecting a random sample of 200 observations/rows.

# Random Sample

ibm_s <- sample_n(ibm, 200)

dim.data.frame(ibm_s)
[1] 200  35

I learned to use head() to see the structure of the dataset but I found str() more easy for me to read the dataset

str(ibm_s)
'data.frame':   200 obs. of  35 variables:
 $ Age                     : int  54 24 27 52 33 31 28 28 56 26 ...
 $ Attrition               : chr  "No" "No" "No" "No" ...
 $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
 $ DailyRate               : int  1217 581 199 994 1075 1003 821 1404 1369 775 ...
 $ Department              : chr  "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
 $ DistanceFromHome        : int  2 9 6 7 3 5 5 17 23 29 ...
 $ Education               : int  4 3 3 4 2 3 4 3 3 2 ...
 $ EducationField          : chr  "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
 $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ EmployeeNumber          : int  126 1707 1162 1118 910 1749 916 1960 1373 618 ...
 $ EnvironmentSatisfaction : int  1 3 4 2 4 1 1 3 4 1 ...
 $ Gender                  : chr  "Female" "Male" "Male" "Male" ...
 $ HourlyRate              : int  60 62 55 87 57 51 98 32 68 45 ...
 $ JobInvolvement          : int  3 4 2 3 3 3 3 2 3 3 ...
 $ JobLevel                : int  3 1 1 3 1 2 2 1 4 2 ...
 $ JobRole                 : chr  "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
 $ JobSatisfaction         : int  3 3 3 2 2 3 4 4 2 3 ...
 $ MaritalStatus           : chr  "Married" "Married" "Married" "Single" ...
 $ MonthlyIncome           : int  13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
 $ MonthlyRate             : int  24001 17616 7950 15322 22650 20943 24252 18779 18235 4267 ...
 $ NumCompaniesWorked      : int  9 1 1 7 3 1 1 5 4 5 ...
 $ Over18                  : chr  "Y" "Y" "Y" "Y" ...
 $ OverTime                : chr  "No" "No" "No" "No" ...
 $ PercentSalaryHike       : int  12 16 13 19 11 19 14 12 12 12 ...
 $ PerformanceRating       : int  3 3 3 3 3 3 3 3 3 3 ...
 $ RelationshipSatisfaction: int  1 4 3 4 3 3 2 1 1 1 ...
 $ StandardHours           : int  80 80 80 80 80 80 80 80 80 80 ...
 $ StockOptionLevel        : int  1 1 1 0 1 1 0 1 1 2 ...
 $ TotalWorkingYears       : int  16 5 4 18 7 6 4 6 33 8 ...
 $ TrainingTimesLastYear   : int  5 1 0 4 4 3 3 2 0 5 ...
 $ WorkLifeBalance         : int  1 3 3 3 4 3 3 2 3 3 ...
 $ YearsAtCompany          : int  4 5 4 8 4 5 4 4 19 0 ...
 $ YearsInCurrentRole      : int  3 3 2 6 3 2 2 1 16 0 ...
 $ YearsSinceLastPromotion : int  0 0 2 4 0 0 0 0 15 0 ...
 $ YearsWithCurrManager    : int  3 4 2 0 3 2 2 3 9 0 ...

Now let’s see the unique values in categorical/nominal variables, and others like “StandardHours” that maybe does not vary too much in the dataset

unique(ibm_s$Attrition)
[1] "No"  "Yes"
unique(ibm_s$BusinessTrave)
[1] "Travel_Rarely"     "Travel_Frequently" "Non-Travel"       
unique(ibm_s$Department)
[1] "Research & Development" "Human Resources"       
[3] "Sales"                 
unique(ibm_s$EducationField)
[1] "Technical Degree" "Medical"          "Life Sciences"   
[4] "Human Resources"  "Marketing"        "Other"           
unique(ibm_s$Gender)
[1] "Female" "Male"  
unique(ibm_s$JobRole)
[1] "Research Director"         "Research Scientist"       
[3] "Healthcare Representative" "Human Resources"          
[5] "Sales Executive"           "Laboratory Technician"    
[7] "Manufacturing Director"    "Manager"                  
[9] "Sales Representative"     
unique(ibm_s$MaritalStatus)
[1] "Married"  "Single"   "Divorced"
unique(ibm_s$Over18)
[1] "Y"
unique(ibm_s$OverTime)
[1] "No"  "Yes"
unique(ibm_s$StandardHours)
[1] 80

I found this dataset very comprehensive. There are 35 very interesting variables to explore potential explanations to the attrition.

In terms of the data type, We have:

The dataset seems very clean in general, but I will adjust a couple of things:

  1. Variable “Over18” and “Standardhours” are not relevant in this dataset as they doesn’t “vary”. All employees are over 18 years old and work 80 hours per pay period. Therefore I can delete these columns from the dataset.
ibm_s2 <- select(ibm_s, - c("Over18", "StandardHours"))
str(ibm_s2)
'data.frame':   200 obs. of  33 variables:
 $ Age                     : int  54 24 27 52 33 31 28 28 56 26 ...
 $ Attrition               : chr  "No" "No" "No" "No" ...
 $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
 $ DailyRate               : int  1217 581 199 994 1075 1003 821 1404 1369 775 ...
 $ Department              : chr  "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
 $ DistanceFromHome        : int  2 9 6 7 3 5 5 17 23 29 ...
 $ Education               : int  4 3 3 4 2 3 4 3 3 2 ...
 $ EducationField          : chr  "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
 $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ EmployeeNumber          : int  126 1707 1162 1118 910 1749 916 1960 1373 618 ...
 $ EnvironmentSatisfaction : int  1 3 4 2 4 1 1 3 4 1 ...
 $ Gender                  : chr  "Female" "Male" "Male" "Male" ...
 $ HourlyRate              : int  60 62 55 87 57 51 98 32 68 45 ...
 $ JobInvolvement          : int  3 4 2 3 3 3 3 2 3 3 ...
 $ JobLevel                : int  3 1 1 3 1 2 2 1 4 2 ...
 $ JobRole                 : chr  "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
 $ JobSatisfaction         : int  3 3 3 2 2 3 4 4 2 3 ...
 $ MaritalStatus           : chr  "Married" "Married" "Married" "Single" ...
 $ MonthlyIncome           : int  13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
 $ MonthlyRate             : int  24001 17616 7950 15322 22650 20943 24252 18779 18235 4267 ...
 $ NumCompaniesWorked      : int  9 1 1 7 3 1 1 5 4 5 ...
 $ OverTime                : chr  "No" "No" "No" "No" ...
 $ PercentSalaryHike       : int  12 16 13 19 11 19 14 12 12 12 ...
 $ PerformanceRating       : int  3 3 3 3 3 3 3 3 3 3 ...
 $ RelationshipSatisfaction: int  1 4 3 4 3 3 2 1 1 1 ...
 $ StockOptionLevel        : int  1 1 1 0 1 1 0 1 1 2 ...
 $ TotalWorkingYears       : int  16 5 4 18 7 6 4 6 33 8 ...
 $ TrainingTimesLastYear   : int  5 1 0 4 4 3 3 2 0 5 ...
 $ WorkLifeBalance         : int  1 3 3 3 4 3 3 2 3 3 ...
 $ YearsAtCompany          : int  4 5 4 8 4 5 4 4 19 0 ...
 $ YearsInCurrentRole      : int  3 3 2 6 3 2 2 1 16 0 ...
 $ YearsSinceLastPromotion : int  0 0 2 4 0 0 0 0 15 0 ...
 $ YearsWithCurrManager    : int  3 4 2 0 3 2 2 3 9 0 ...
  1. And, there are 2 variables that could be redundant. Since we have hourly rate, I prefer to delete montlhy and daily rate. I’ll be able to calculate those from Hourly rate (if need to).
ibm_s2 <- select(ibm_s2, - c("DailyRate", "MonthlyRate"))
str(ibm_s2)
'data.frame':   200 obs. of  31 variables:
 $ Age                     : int  54 24 27 52 33 31 28 28 56 26 ...
 $ Attrition               : chr  "No" "No" "No" "No" ...
 $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
 $ Department              : chr  "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
 $ DistanceFromHome        : int  2 9 6 7 3 5 5 17 23 29 ...
 $ Education               : int  4 3 3 4 2 3 4 3 3 2 ...
 $ EducationField          : chr  "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
 $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ EmployeeNumber          : int  126 1707 1162 1118 910 1749 916 1960 1373 618 ...
 $ EnvironmentSatisfaction : int  1 3 4 2 4 1 1 3 4 1 ...
 $ Gender                  : chr  "Female" "Male" "Male" "Male" ...
 $ HourlyRate              : int  60 62 55 87 57 51 98 32 68 45 ...
 $ JobInvolvement          : int  3 4 2 3 3 3 3 2 3 3 ...
 $ JobLevel                : int  3 1 1 3 1 2 2 1 4 2 ...
 $ JobRole                 : chr  "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
 $ JobSatisfaction         : int  3 3 3 2 2 3 4 4 2 3 ...
 $ MaritalStatus           : chr  "Married" "Married" "Married" "Single" ...
 $ MonthlyIncome           : int  13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
 $ NumCompaniesWorked      : int  9 1 1 7 3 1 1 5 4 5 ...
 $ OverTime                : chr  "No" "No" "No" "No" ...
 $ PercentSalaryHike       : int  12 16 13 19 11 19 14 12 12 12 ...
 $ PerformanceRating       : int  3 3 3 3 3 3 3 3 3 3 ...
 $ RelationshipSatisfaction: int  1 4 3 4 3 3 2 1 1 1 ...
 $ StockOptionLevel        : int  1 1 1 0 1 1 0 1 1 2 ...
 $ TotalWorkingYears       : int  16 5 4 18 7 6 4 6 33 8 ...
 $ TrainingTimesLastYear   : int  5 1 0 4 4 3 3 2 0 5 ...
 $ WorkLifeBalance         : int  1 3 3 3 4 3 3 2 3 3 ...
 $ YearsAtCompany          : int  4 5 4 8 4 5 4 4 19 0 ...
 $ YearsInCurrentRole      : int  3 3 2 6 3 2 2 1 16 0 ...
 $ YearsSinceLastPromotion : int  0 0 2 4 0 0 0 0 15 0 ...
 $ YearsWithCurrManager    : int  3 4 2 0 3 2 2 3 9 0 ...
  1. I think worth recoding back to the categories name for Education and Work Life Balance (ordinal variables) but saving it in new variables in case I need to use it as character in tables.
#recode Education as education_ch and WorkLifeBalance as WorkLifeBalance_ch
ibm_s2<-ibm_s2%>%
  mutate(education_ch = case_when(
         Education == 1 ~ "Below College",
         Education == 2 ~ "College",
         Education == 3 ~ "Bachelor",
         Education == 4 ~ "Master",
         Education == 5 ~ "Doctor",
         )) %>%
  mutate(WorkLifeBalance_ch = case_when(
         WorkLifeBalance == 1 ~ "Bad",
         WorkLifeBalance == 2 ~ "Good",
         WorkLifeBalance == 3 ~ "Better",
         WorkLifeBalance == 4 ~ "Best"
         ))

str(ibm_s2)
'data.frame':   200 obs. of  33 variables:
 $ Age                     : int  54 24 27 52 33 31 28 28 56 26 ...
 $ Attrition               : chr  "No" "No" "No" "No" ...
 $ BusinessTravel          : chr  "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" "Travel_Rarely" ...
 $ Department              : chr  "Research & Development" "Research & Development" "Research & Development" "Research & Development" ...
 $ DistanceFromHome        : int  2 9 6 7 3 5 5 17 23 29 ...
 $ Education               : int  4 3 3 4 2 3 4 3 3 2 ...
 $ EducationField          : chr  "Technical Degree" "Medical" "Life Sciences" "Life Sciences" ...
 $ EmployeeCount           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ EmployeeNumber          : int  126 1707 1162 1118 910 1749 916 1960 1373 618 ...
 $ EnvironmentSatisfaction : int  1 3 4 2 4 1 1 3 4 1 ...
 $ Gender                  : chr  "Female" "Male" "Male" "Male" ...
 $ HourlyRate              : int  60 62 55 87 57 51 98 32 68 45 ...
 $ JobInvolvement          : int  3 4 2 3 3 3 3 2 3 3 ...
 $ JobLevel                : int  3 1 1 3 1 2 2 1 4 2 ...
 $ JobRole                 : chr  "Research Director" "Research Scientist" "Research Scientist" "Healthcare Representative" ...
 $ JobSatisfaction         : int  3 3 3 2 2 3 4 4 2 3 ...
 $ MaritalStatus           : chr  "Married" "Married" "Married" "Single" ...
 $ MonthlyIncome           : int  13549 4401 2539 10445 2277 8346 4908 2367 13402 4306 ...
 $ NumCompaniesWorked      : int  9 1 1 7 3 1 1 5 4 5 ...
 $ OverTime                : chr  "No" "No" "No" "No" ...
 $ PercentSalaryHike       : int  12 16 13 19 11 19 14 12 12 12 ...
 $ PerformanceRating       : int  3 3 3 3 3 3 3 3 3 3 ...
 $ RelationshipSatisfaction: int  1 4 3 4 3 3 2 1 1 1 ...
 $ StockOptionLevel        : int  1 1 1 0 1 1 0 1 1 2 ...
 $ TotalWorkingYears       : int  16 5 4 18 7 6 4 6 33 8 ...
 $ TrainingTimesLastYear   : int  5 1 0 4 4 3 3 2 0 5 ...
 $ WorkLifeBalance         : int  1 3 3 3 4 3 3 2 3 3 ...
 $ YearsAtCompany          : int  4 5 4 8 4 5 4 4 19 0 ...
 $ YearsInCurrentRole      : int  3 3 2 6 3 2 2 1 16 0 ...
 $ YearsSinceLastPromotion : int  0 0 2 4 0 0 0 0 15 0 ...
 $ YearsWithCurrManager    : int  3 4 2 0 3 2 2 3 9 0 ...
 $ education_ch            : chr  "Master" "Bachelor" "Bachelor" "Master" ...
 $ WorkLifeBalance_ch      : chr  "Bad" "Better" "Better" "Better" ...

Now we have a dataset of 200 obs. and 31 variables (+ 2 variables duplicated due to recoding).

In the HR “world” is common to think that employee attrition/turnover is very related to the immediate leadership, so some of the initial questions could be:

  1. Is there a pattern of attrition by Department?, is there significantly more tenure in certain department than others?
  2. Are there significant Job Satisfaction levels differences by Department? or based on Years with current manager?
  3. Also how is the job satisfaction and attrition relationship?

But also there are other questions rekated to the demographics:

  1. Is there a pattern in terms of demographics of the employees who left the company? (marital status, age, gender, education field)
  2. What about tenure (years at company) or years with current manager

Then, more recently is also common to find in exit interviews that the Quality of Life in general is a more and more relevant topic for the labor force:

  1. How is the attrition among employees with Over time vs none over time? or based on the work life Balance rating?

Lastly, it is also important to check the impact of the compensation into the turnover rate:

  1. Is there a relationship between salary and job satisfaction and attrition?
  2. Is there a trend to leave the organization depending on the salary increase?

Let’s create tables to start exploring relationships among variables

#Make tables for Attrition

ibm_s2 %>%
  select(Attrition) %>%
  table() %>%
  prop.table()*100
Attrition
 No Yes 
 82  18 
#Make tables for department

ibm_s2 %>%
  select(Department) %>%
  table() %>%
  prop.table()*100
Department
       Human Resources Research & Development                  Sales 
                     6                     65                     29 
#Make crosstabs for Department and Attrition

xtabs(~ Department + Attrition, ibm_s2)
                        Attrition
Department                No Yes
  Human Resources          9   3
  Research & Development 108  22
  Sales                   47  11
#Make proportional crosstabs for Department and Attrition

prop.table(xtabs(~ Department + Attrition, ibm_s2))*100
                        Attrition
Department                 No  Yes
  Human Resources         4.5  1.5
  Research & Development 54.0 11.0
  Sales                  23.5  5.5
#Make proportional crosstabs for Job Satisfaction and Attrition

prop.table(xtabs(~ JobSatisfaction + Attrition, ibm_s2))*100
               Attrition
JobSatisfaction   No  Yes
              1 12.5  6.0
              2 19.5  3.0
              3 26.5  4.0
              4 23.5  5.0
#Make proportional crosstabs for Gender and Attrition

prop.table(xtabs(~ Gender + Attrition, ibm_s2))*100
        Attrition
Gender     No  Yes
  Female 31.5  5.5
  Male   50.5 12.5

To be continued…