In the realm of People Analytics, the significance of a data set created by IBM for attrition modeling cannot be overstated. This data set presents a valuable resource for addressing critical questions related to employee turnover and engagement. With 1470 rows and 35 columns, it offers a wealth of information that encompasses various aspects of employees’ professional and personal lives. The data set combines the typical Human Resources Information System (HRIS) data with a comprehensive engagement survey, providing a holistic view of the employees’ experiences and sentiments.
The main objective of this analysis is to understand and predict employee turnover within an organization. By examining the factors that contribute to attrition, we aim to uncover valuable insights that can guide HR strategies and improve employee retention rates. Moreover, this data set offers a unique opportunity to identify differences between the group of employees who chose to stay and those who decided to leave the organization.
The data set was created and made available by IBM, a renowned leader in the tech industry. As a reputable source, the data is expected to be reliable and well-structured, enabling meaningful analysis and inference. While the specific details of data collection methods are not provided, the comprehensive nature of the data set suggests that it was meticulously curated to capture a wide range of attributes relevant to employee turnover and engagement.
Sample Size and Feature Variables: The data set consists of 1470 rows, representing individual employee records. Each row contains 35 columns, making these columns the feature variables for analysis. The feature variables include demographic information such as age and gender, factors related to job satisfaction and environment satisfaction, education field, job role, income, overtime, percentage salary hike, tenure, training time, years in the current role, relationship status, and several other parameters that may impact attrition and engagement.
The feature variables encompass various data types, including numeric, categorical, and ordinal data, which allows for a diverse set of analyses and modeling approaches.
In conclusion, this data set offers a comprehensive collection of features that are crucial for understanding and predicting employee attrition within an organization. By delving into the relationships between different variables, we can gain valuable insights that have practical implications for HR policies and practices.
A detailed description of the variables is given below:
Education 1 ‘Below College’ 2 ‘College’ 3 ‘Bachelor’ 4
‘Master’ 5 ‘Doctor’
EnvironmentSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4
‘Very High’
JobInvolvement 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very
High’
JobSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4 ‘Very
High’
PerformanceRating 1 ‘Low’ 2 ‘Good’ 3 ‘Excellent’ 4
‘Outstanding’
RelationshipSatisfaction 1 ‘Low’ 2 ‘Medium’ 3 ‘High’ 4
‘Very High’
WorkLifeBalance 1 ‘Bad’ 2 ‘Good’ 3 ‘Better’ 4 ‘Best’
A copy of this publicly available data is stored at https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv.
EmployeeAttrition = read.csv("https://raw.githubusercontent.com/Tenam01/DATASETS/main/EmployeeAttritionData.csv")
Applying binary coding.
# Convert our response variable from 'Yes' and 'No' to 1 and 0
EmployeeAttrition$Attrition_num <- ifelse(EmployeeAttrition$Attrition == "Yes", 1, 0)
Column Employee Number will be removed to protect the identity of employees and their sensitive information.
Additional columns have been identified for removal:
DailyRate, HourlyRate, and MonthlyRate are inexplicable. EmployeeCount, Over18, and StandardHours are uniform for all employees.
# Using subset() function to drop specific columns
EmployeeAttrition_drop <- subset(EmployeeAttrition, select = -c(EmployeeNumber, EmployeeCount, Over18, StandardHours, DailyRate, HourlyRate, MonthlyRate))
We first scan the entire data set and determine the EDA tools to use for feature engineering.
summary(EmployeeAttrition_drop)
## Age Attrition BusinessTravel Department
## Min. :18.00 Length:1470 Length:1470 Length:1470
## 1st Qu.:30.00 Class :character Class :character Class :character
## Median :36.00 Mode :character Mode :character Mode :character
## Mean :36.92
## 3rd Qu.:43.00
## Max. :60.00
## DistanceFromHome Education EducationField EnvironmentSatisfaction
## Min. : 1.000 Min. :1.000 Length:1470 Min. :1.000
## 1st Qu.: 2.000 1st Qu.:2.000 Class :character 1st Qu.:2.000
## Median : 7.000 Median :3.000 Mode :character Median :3.000
## Mean : 9.193 Mean :2.913 Mean :2.722
## 3rd Qu.:14.000 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :29.000 Max. :5.000 Max. :4.000
## Gender JobInvolvement JobLevel JobRole
## Length:1470 Min. :1.00 Min. :1.000 Length:1470
## Class :character 1st Qu.:2.00 1st Qu.:1.000 Class :character
## Mode :character Median :3.00 Median :2.000 Mode :character
## Mean :2.73 Mean :2.064
## 3rd Qu.:3.00 3rd Qu.:3.000
## Max. :4.00 Max. :5.000
## JobSatisfaction MaritalStatus MonthlyIncome NumCompaniesWorked
## Min. :1.000 Length:1470 Min. : 1009 Min. :0.000
## 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.:1.000
## Median :3.000 Mode :character Median : 4919 Median :2.000
## Mean :2.729 Mean : 6503 Mean :2.693
## 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:4.000
## Max. :4.000 Max. :19999 Max. :9.000
## OverTime PercentSalaryHike PerformanceRating
## Length:1470 Min. :11.00 Min. :3.000
## Class :character 1st Qu.:12.00 1st Qu.:3.000
## Mode :character Median :14.00 Median :3.000
## Mean :15.21 Mean :3.154
## 3rd Qu.:18.00 3rd Qu.:3.000
## Max. :25.00 Max. :4.000
## RelationshipSatisfaction StockOptionLevel TotalWorkingYears
## Min. :1.000 Min. :0.0000 Min. : 0.00
## 1st Qu.:2.000 1st Qu.:0.0000 1st Qu.: 6.00
## Median :3.000 Median :1.0000 Median :10.00
## Mean :2.712 Mean :0.7939 Mean :11.28
## 3rd Qu.:4.000 3rd Qu.:1.0000 3rd Qu.:15.00
## Max. :4.000 Max. :3.0000 Max. :40.00
## TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
## Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
## 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
## Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
## Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
## 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
## Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
## YearsSinceLastPromotion YearsWithCurrManager Attrition_num
## Min. : 0.000 Min. : 0.000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.: 2.000 1st Qu.:0.0000
## Median : 1.000 Median : 3.000 Median :0.0000
## Mean : 2.188 Mean : 4.123 Mean :0.1612
## 3rd Qu.: 3.000 3rd Qu.: 7.000 3rd Qu.:0.0000
## Max. :15.000 Max. :17.000 Max. :1.0000
There seems to be no apparent outliers.All the numerical variables seem to be in a reasonable range.
The average age seems to be around 37 years, the highest age seems to be 60 and the lowest age seems to be 18. This dynamic age group could range from new interns all the way to senior managers. There are people who live very close to work and some live far away from the work place Salaries tend to fluctuate a lot as well with a average monthly income of USD 6,500. The salaries can range anywhere from lowest USD 1,000 to as high as USD 20,000. This explains as new interns get paid less whereas the senior managers make a lot. The average number of years worked as an employee is 7 years, where there are employees who have worked for 40 years and as well as some who just started working.
# checking the unique character variables.
unique(EmployeeAttrition_drop$Gender)
## [1] "Female" "Male"
unique(EmployeeAttrition_drop$BusinessTravel)
## [1] "Travel_Rarely" "Travel_Frequently" "Non-Travel"
unique(EmployeeAttrition_drop$Department)
## [1] "Sales" "Research & Development" "Human Resources"
unique(EmployeeAttrition_drop$JobRole)
## [1] "Sales Executive" "Research Scientist"
## [3] "Laboratory Technician" "Manufacturing Director"
## [5] "Healthcare Representative" "Manager"
## [7] "Sales Representative" "Research Director"
## [9] "Human Resources"
unique(EmployeeAttrition_drop$MaritalStatus)
## [1] "Single" "Married" "Divorced"
unique(EmployeeAttrition_drop$OverTime)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$Attrition)
## [1] "Yes" "No"
unique(EmployeeAttrition_drop$EducationField)
## [1] "Life Sciences" "Other" "Medical" "Marketing"
## [5] "Technical Degree" "Human Resources"
All of the categorical characters are consistent.
Since the data set appears to have no missing values, it simplifies the Exploratory Data Analysis (EDA) process. With a complete data set, we can focus on exploring relationships, identifying patterns, and gaining insights more effectively.
The data cleaning process is complete.
numeric_vars <- sapply(EmployeeAttrition_drop, is.numeric)
numeric_data <- EmployeeAttrition_drop[, numeric_vars]
This subsection focuses on the potential discretization of continuous variables and grouping sparse categories of category variables based on their distribution.
We will group age into three sub groups ranging from 18 to 60 and also group the distance from home into three subgroups. We will replace Age, DistanceFromHome, and Attrition feature variables and replace them with the modified grouped variable grp.age and grp.dist also binary variable outcome Attrition_num for easy graphical approach.
EmployeeAttrition_drop$grp.age <- ifelse(EmployeeAttrition_drop$Age <= 30, '(18, 30)',
ifelse(EmployeeAttrition_drop$Age >= 50, '(50, 60)', '[30,50]'))
EmployeeAttrition_drop$grp.dist <- ifelse(EmployeeAttrition_drop$DistanceFromHome <= 10, '(1, 10)',
ifelse(EmployeeAttrition_drop$DistanceFromHome >= 20, '(20, 30)', '[10,20]'))
Pairwise association refers to the examination of relationships between pairs of variables in a data set. It involves analyzing how the values of two variables co-occur or change together. There are three different types of pairwise associations.
The best visual tool for assessing pairwise linear association between two numeric variables is a pair-wise scatter plot.
# Selecting specific columns
selected_columns <- c(1, 5, 15, 16, 18, 22)
# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
columns = selected_columns,
aes(color = Attrition, alpha = 0.5))
# Selecting specific columns
selected_columnss <- c(23, 25, 26,27,28)
# Creating ggpairs plot with selected columns
ggpairs(EmployeeAttrition_drop,
columns = selected_columnss,
aes(color = Attrition, alpha = 0.5))
The off-diagonal plots and numbers indicate the correlation between
the pair-wise numeric variables. As expected,
YearsWithCurrManager and YearsAtCompany are
significantly correlated, YearsWithCurrManager and
YearsInCurrentRole are significantly correlated,
TotalWorkingYears and YearsAtCompany are
significantly correlated, YearsInCurrentRole and
YearsAtCompany are significantly correlated, and
YearsSinceLastPromotion and YearsAtCompany are
significantly correlated. Other paired variables have weak
correlations.
The main diagonal stacked density curves show the potential difference in the distribution of the underlying numeric variable in Attrition and no attrition groups. This means that the stacked density curves show the relation between numeric and categorical variables. These stacked density curves are not completely overlapped indicating somewhat correlation between each of these numeric variables and the binary response variable.
Mosaic plots are convenient to show whether two categorical variables are dependent. In EDA, we are primarily interested in whether the response (binary in this case) is independent of categorical variables. Those categorical variables that are independent of the response variable should be excluded in any of the subsequent models and algorithms.
par(mfrow = c(2,2))
mosaicplot(BusinessTravel ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="BusinessTravel vs Attrition")
mosaicplot(Department ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Department vs Attrition")
mosaicplot(EducationField ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="EducationField vs Attrition")
mosaicplot(JobRole ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="JobRole vs Attrition")
The top two mosaic plots demonstrate show that Attrition is not independent of Business Travel and Department because the proportion of Attrition cases in individual categories is not identical. In which employees traveling frequently have the highest attrition rate whereas non-travel employees has the least attrition rate. The bottom two mosaic plots also show that Attrition is not independent of Education field and Job Role because the proportion of Attrition cases in individual categories is not identical.
par(mfrow = c(2,3))
mosaicplot(OverTime ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="OverTime vs Attrition")
mosaicplot(MaritalStatus ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="MaritalStatus vs Attrition")
mosaicplot(Gender ~ Attrition, data=EmployeeAttrition,col=c("Blue","Red"), main="Gender vs Attrition")
mosaicplot(grp.age ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="Agegroup vs Attrition")
mosaicplot(grp.dist ~ Attrition, data=EmployeeAttrition_drop,col=c("Blue","Red"), main="DistanceFromHome vs Attrition")
The top left two mosaic plots demonstrate the positive association between overtime and marital status readings. In which employees being single and working a lot of overtime shows increasing rate of employee attrition. The bottom two mosaic plots also show that Attrition is not independent of age group and distance from home because the proportion of Attrition cases in individual categories is not identical. Here, the younger age group have a higher attrition rate than older age group. Also further away from the work place shows increase in the rate of attrition. Lastly, as for the Gender, there seems to be not much of an influence.
# Calculate the correlation matrix
cor_matrix <- cor(numeric_data)
# You can also plot the correlation matrix for better visualization
library(corrplot)
corrplot(cor_matrix, method = "color", tl.col = "black")
We will be removing grp.age and grp.dist also binary variable outcome Attrition_num. We will aslo remove YearsAtCompany, and TotalWorkingYears, as they highly correlated with correlation coefficient > .75. For our final cleaned data set, most feature variables have very high or very low correlation but not high or low enough to be removed. Therefore, we will keep them for further analysis.