1 Introduction

In the realm of People Analytics, the significance of a data set created by IBM for attrition modeling cannot be overstated. This data set presents a valuable resource for addressing critical questions related to employee turnover and engagement. With 1470 rows and 35 columns, it offers a wealth of information that encompasses various aspects of employees’ professional and personal lives. The data set combines the typical Human Resources Information System (HRIS) data with a comprehensive engagement survey, providing a holistic view of the employees’ experiences and sentiments.

The main objective of this analysis is to understand and predict employee turnover within an organization. By examining the factors that contribute to attrition, we aim to uncover valuable insights that can guide HR strategies and improve employee retention rates. Moreover, this data set offers a unique opportunity to identify differences between the group of employees who chose to stay and those who decided to leave the organization.

2 Description of Data

The data set was created and made available by IBM, a renowned leader in the tech industry. As a reputable source, the data is expected to be reliable and well-structured, enabling meaningful analysis and inference. While the specific details of data collection methods are not provided, the comprehensive nature of the data set suggests that it was meticulously curated to capture a wide range of attributes relevant to employee turnover and engagement.

Sample Size and Feature Variables: The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.

The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.

In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.

A detailed description of the variables is given below:

Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4 ‘Master’ 5 ‘Doctor’

EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4 ‘Outstanding’

RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very High’

WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’

A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv.

EmployeeAttrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv")

Applying binary coding.

# Convert our response variable from 'Yes' and 'No' to 1 and 0
EmployeeAttrition$Attrition_num <- ifelse(EmployeeAttrition$Attrition == "Yes", 1, 0)

Column Employee Number will be removed to protect the identity of employees and their sensitive information.

Additional columns have been identified for removal:

DailyRate, HourlyRate, and MonthlyRate are inexplicable. EmployeeCount, Over18, and StandardHours are uniform for all employees.

# Using subset() function to drop specific columns
EmployeeAttrition_drop <- subset(EmployeeAttrition, select = -c(EmployeeNumber, EmployeeCount, Over18, StandardHours, DailyRate, HourlyRate, MonthlyRate))


3 EDA for Feature Engineering

We first scan the entire data set and determine the EDA tools to use for feature engineering.

summary(EmployeeAttrition_drop)
##       Age         Attrition         BusinessTravel      Department       
##  Min.   :18.00   Length:1470        Length:1470        Length:1470       
##  1st Qu.:30.00   Class :character   Class :character   Class :character  
##  Median :36.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :36.92                                                           
##  3rd Qu.:43.00                                                           
##  Max.   :60.00                                                           
##  DistanceFromHome   Education     EducationField     EnvironmentSatisfaction
##  Min.   : 1.000   Min.   :1.000   Length:1470        Min.   :1.000          
##  1st Qu.: 2.000   1st Qu.:2.000   Class :character   1st Qu.:2.000          
##  Median : 7.000   Median :3.000   Mode  :character   Median :3.000          
##  Mean   : 9.193   Mean   :2.913                      Mean   :2.722          
##  3rd Qu.:14.000   3rd Qu.:4.000                      3rd Qu.:4.000          
##  Max.   :29.000   Max.   :5.000                      Max.   :4.000          
##     Gender          JobInvolvement    JobLevel       JobRole         
##  Length:1470        Min.   :1.00   Min.   :1.000   Length:1470       
##  Class :character   1st Qu.:2.00   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :3.00   Median :2.000   Mode  :character  
##                     Mean   :2.73   Mean   :2.064                     
##                     3rd Qu.:3.00   3rd Qu.:3.000                     
##                     Max.   :4.00   Max.   :5.000                     
##  JobSatisfaction MaritalStatus      MonthlyIncome   NumCompaniesWorked
##  Min.   :1.000   Length:1470        Min.   : 1009   Min.   :0.000     
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.:1.000     
##  Median :3.000   Mode  :character   Median : 4919   Median :2.000     
##  Mean   :2.729                      Mean   : 6503   Mean   :2.693     
##  3rd Qu.:4.000                      3rd Qu.: 8379   3rd Qu.:4.000     
##  Max.   :4.000                      Max.   :19999   Max.   :9.000     
##    OverTime         PercentSalaryHike PerformanceRating
##  Length:1470        Min.   :11.00     Min.   :3.000    
##  Class :character   1st Qu.:12.00     1st Qu.:3.000    
##  Mode  :character   Median :14.00     Median :3.000    
##                     Mean   :15.21     Mean   :3.154    
##                     3rd Qu.:18.00     3rd Qu.:3.000    
##                     Max.   :25.00     Max.   :4.000    
##  RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :1.0000   Median :10.00    
##  Mean   :2.712            Mean   :0.7939   Mean   :11.28    
##  3rd Qu.:4.000            3rd Qu.:1.0000   3rd Qu.:15.00    
##  Max.   :4.000            Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.799         Mean   :2.761   Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager Attrition_num   
##  Min.   : 0.000          Min.   : 0.000       Min.   :0.0000  
##  1st Qu.: 0.000          1st Qu.: 2.000       1st Qu.:0.0000  
##  Median : 1.000          Median : 3.000       Median :0.0000  
##  Mean   : 2.188          Mean   : 4.123       Mean   :0.1612  
##  3rd Qu.: 3.000          3rd Qu.: 7.000       3rd Qu.:0.0000  
##  Max.   :15.000          Max.   :17.000       Max.   :1.0000

There seems to be no apparent outliers.All the numerical variables seem to be in a reasonable range.

The average age seems to be around 37 years, the highest age seems to be 60 and the lowest age seems to be 18. This dynamic age group could range from new interns all the way to senior managers. There are people who live very close to work and some live far away from the work place Salaries tend to fluctuate a lot as well with a average monthly income of USD 6,500. The salaries can range anywhere from lowest USD 1,000 to as high as USD 20,000. This explains as new interns get paid less whereas the senior managers make a lot. The average number of years worked as an employee is 7 years, where there are employees who have worked for 40 years and as well as some who just started working.

# checking the unique character variables.
unique(EmployeeAttrition_drop$Gender)
## [1] "Female" "Male"
unique(EmployeeAttrition_drop$BusinessTravel)
## [1] "Travel_Rarely"     "Travel_Frequently" "Non-Travel"
unique(EmployeeAttrition_drop$Department)
## [1] "Sales"                  "Research & Development" "Human Resources"
unique(EmployeeAttrition_drop$JobRole)
## [1] "Sales Executive"           "Research Scientist"       
## [3] "Laboratory Technician"     "Manufacturing Director"   
## [5] "Healthcare Representative" "Manager"                  
## [7] "Sales Representative"      "Research Director"        
## [9] "Human Resources"
unique(EmployeeAttrition_drop$MaritalStatus)
## [1] "Single"   "Married"  "Divorced"
unique(EmployeeAttrition_drop$OverTime)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$Attrition)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$EducationField)
## [1] "Life Sciences"    "Other"            "Medical"          "Marketing"       
## [5] "Technical Degree" "Human Resources"

All of the categorical characters are consistent.

3.1 Missing values

Since the data set appears to have no missing values, it simplifies the Exploratory Data Analysis (EDA) process. With a complete data set, we can focus on exploring relationships, identifying patterns, and gaining insights more effectively.

The data cleaning process is complete.

numeric_vars <- sapply(EmployeeAttrition_drop, is.numeric)
numeric_data <- EmployeeAttrition_drop[, numeric_vars]

3.2 Assess Distributions

This subsection focuses on the potential discretization of continuous variables and grouping sparse categories of category variables based on their distribution.

3.2.1 Discretizing Continuous Variable

We will group age into three sub groups ranging from 18 to 60 and also group the distance from home into three subgroups. We will replace Age, DistanceFromHome, and Attrition feature variables and replace them with the modified grouped variable grp.age and grp.dist also binary variable outcome Attrition_num for easy graphical approach.

EmployeeAttrition_drop$grp.age <- ifelse(EmployeeAttrition_drop$Age <= 30, '(18, 30)',
               ifelse(EmployeeAttrition_drop$Age >= 50, '(50, 60)', '[30,50]'))
EmployeeAttrition_drop$grp.dist <- ifelse(EmployeeAttrition_drop$DistanceFromHome <= 10, '(1, 10)',
               ifelse(EmployeeAttrition_drop$DistanceFromHome >= 20, '(20, 30)', '[10,20]'))

3.3 Pairwise associations

Pairwise association refers to the examination of relationships between pairs of variables in a data set. It involves analyzing how the values of two variables co-occur or change together. There are three different types of pairwise associations.

3.3.1 Two numeric variables

The best visual tool for assessing pairwise linear association between two numeric variables is a pair-wise scatter plot.

# Selecting specific columns
selected_columns <- c(1, 5, 15, 16, 18, 22)

# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
        columns = selected_columns,
        aes(color = Attrition, alpha = 0.5))

# Selecting specific columns
selected_columnss <- c(23, 25, 26,27,28)

# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
        columns = selected_columnss,
        aes(color = Attrition, alpha = 0.5))

The off-diagonal plots and numbers indicate the correlation between the pair-wise numeric variables. As expected, YearsWithCurrManager and YearsAtCompany are significantly correlated, YearsWithCurrManager and YearsInCurrentRole are significantly correlated, TotalWorkingYears and YearsAtCompany are significantly correlated, YearsInCurrentRole and YearsAtCompany are significantly correlated, and YearsSinceLastPromotion and YearsAtCompany are significantly correlated. Other paired variables have weak correlations.

The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Attrition and no attrition groups. This means that the stacked density curves show the relation between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable.

3.3.2 Two Categorial variables

Mosaic plots are convenient to show whether two categorical variables are dependent. In EDA, we are primarily interested in whether the response (binary in this case) is independent of categorical variables. Those categorical variables that are independent of the response variable should be excluded in any of the subsequent models and algorithms.

par(mfrow = c(2,2))
mosaicplot(BusinessTravel ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="BusinessTravel vs Attrition")
mosaicplot(Department ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Department vs Attrition")
mosaicplot(EducationField ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="EducationField vs Attrition")
mosaicplot(JobRole ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="JobRole vs Attrition")

The top two mosaic plots demonstrate show that Attrition is not independent of Business Travel and Department because the proportion of Attrition cases in individual categories is not identical. In which employees traveling frequently have the highest attrition rate whereas non-travel employees has the least attrition rate. The bottom two mosaic plots also show that Attrition is not independent of Education field and Job Role because the proportion of Attrition cases in individual categories is not identical.

par(mfrow = c(2,3))
mosaicplot(OverTime ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="OverTime vs Attrition")
mosaicplot(MaritalStatus ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="MaritalStatus vs Attrition")
mosaicplot(Gender ~ Attrition, data=EmployeeAttrition,col=c("Blue","Red"), main="Gender vs Attrition")
mosaicplot(grp.age ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Agegroup vs Attrition")
mosaicplot(grp.dist ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="DistanceFromHome vs Attrition")

The top left two mosaic plots demonstrate the positive association between overtime and marital status readings. In which employees being single and working a lot of overtime shows increasing rate of employee attrition. The bottom two mosaic plots also show that Attrition is not independent of age group and distance from home because the proportion of Attrition cases in individual categories is not identical. Here, the younger age group have a higher attrition rate than older age group. Also further away from the work place shows increase in the rate of attrition. Lastly, as for the Gender, there seems to be not much of an influence.

# Calculate the correlation matrix
cor_matrix <- cor(numeric_data)

# You can also plot the correlation matrix for better visualization
library(corrplot)
corrplot(cor_matrix, method = "color", tl.col = "black")

4 Concluding Remarks

We will be removing grp.age and grp.dist also binary variable outcome Attrition_num. We will aslo remove YearsAtCompany, and TotalWorkingYears, as they highly correlated with correlation coefficient > .75. For our final cleaned data set, most feature variables have very high or very low correlation but not high or low enough to be removed. Therefore, we will keep them for further analysis.