The project goal is to develop predictive models using Machine Learning that leverage employee demographic information, job related insights, performance metrics and other key factors to enable Human Resource Department (HR) to make well-informed, data decisions.
The two main objectives are:
Year: Uploaded Aug 2024 (Data from 2012
-2022).
Purpose of dataset: To analyze employee attrition based
on key factors including performance trends which influence
retention.
Data Set Content and Dimensions:
Total of 5 datasets:
During the data cleaning process, the following tasks are addressed:
Firstly, all the required packages were installed.
1. tidyr: To organise data for analysis and visualization
2. caret: For data preprocessing, model training, hyperparameter tuning,
and model evaluation
3. dplyr: For data manipulation and transformation
4. ggplot2: For data visualization
5. TTR: For technical analysis of financial data
6. randomForest: for Random Forest machine learning tasks
7. glmnet: For fitting generalized linear models via regularization
techniques
8. rpart: For recursive partitioning and regression trees
9. rpart.plot: For visualization for decision trees
10. lubridate: For data manipulation of dates and times
11. caTools: For data partitioning, model evaluation and performance
metrics
12. DMwR2: For data mining and machine learning
13. smotefamily: For data resampling technique on class imbalanced
data
14 XGBoost: For XGBoost Regression Model
15. grid: For organizing complex visualization
16. corrplot: To visualize correlation matrices in a clear and concise
manner
17. rpart: For building decision trees using recursive partitioning for
both classification and regression tasks
18. glmnet: For fitting generalized linear models (GLMs) with
regularization techniques
19. Metrics: For evaluating machine learning model performance using
various metrics such as RMSE, MAE, and R², providing insights into model
accuracy and error
# install.packages("smotefamily")
# install.packages("igraph")
library(tidyr)
library(caret)
library(dplyr)
library(ggplot2)
library(TTR)
library(randomForest)
library(glmnet)
library(rpart)
library(rpart.plot)
library(lubridate)
library(caTools)
library(DMwR2)
library(smotefamily)
library(xgboost)
library(grid)
library(corrplot)
library(rpart)
library(rpart.plot)
library(glmnet)
library(Metrics)
Firstly, the CSV file datasets were read into R dataframes.
employee <- read.csv('dataset/Employee.csv', header = TRUE)
performance <- read.csv('dataset/PerformanceRating.csv', header = TRUE)
edu_lvl <- read.csv('dataset/EducationLevel.csv', header = TRUE)
rating_lvl <- read.csv('dataset/RatingLevel.csv', header = TRUE)
satisified_lvl <- read.csv('dataset/SatisfiedLevel.csv', header = TRUE)
# View the first few rows
str(employee)
## 'data.frame': 1470 obs. of 23 variables:
## $ EmployeeID : chr "3012-1A41" "CBCB-9C9D" "95D7-1CE9" "47A0-559B" ...
## $ FirstName : chr "Leonelle" "Leonerd" "Ahmed" "Ermentrude" ...
## $ LastName : chr "Simco" "Aland" "Sykes" "Berrie" ...
## $ Gender : chr "Female" "Male" "Male" "Non-Binary" ...
## $ Age : int 30 38 43 39 29 34 42 40 38 31 ...
## $ BusinessTravel : chr "Some Travel" "Some Travel" "Some Travel" "Some Travel" ...
## $ Department : chr "Sales" "Sales" "Human Resources" "Technology" ...
## $ DistanceFromHome..KM. : int 27 23 29 12 29 30 45 3 20 4 ...
## $ State : chr "IL" "CA" "CA" "IL" ...
## $ Ethnicity : chr "White" "White" "Asian or Asian American" "White" ...
## $ Education : int 5 4 4 3 2 2 3 2 4 2 ...
## $ EducationField : chr "Marketing" "Marketing" "Marketing " "Computer Science" ...
## $ JobRole : chr "Sales Executive" "Sales Executive" "HR Business Partner" "Engineering Manager" ...
## $ MaritalStatus : chr "Divorced" "Single" "Married" "Married" ...
## $ Salary : int 102059 157718 309964 293132 49606 133468 259284 104426 147098 69747 ...
## $ StockOptionLevel : int 1 0 1 0 0 1 1 1 1 0 ...
## $ OverTime : chr "No" "Yes" "No" "No" ...
## $ HireDate : chr "2012-01-03" "2012-01-04" "2012-01-04" "2012-01-05" ...
## $ Attrition : chr "No" "No" "No" "No" ...
## $ YearsAtCompany : int 10 10 10 10 6 10 10 10 10 6 ...
## $ YearsInMostRecentRole : int 4 6 6 10 1 3 2 3 5 5 ...
## $ YearsSinceLastPromotion: int 9 10 10 10 1 7 6 4 8 5 ...
## $ YearsWithCurrManager : int 7 0 8 0 6 9 6 6 2 1 ...
str(performance)
## 'data.frame': 6709 obs. of 11 variables:
## $ PerformanceID : chr "PR01" "PR02" "PR03" "PR04" ...
## $ EmployeeID : chr "79F7-78EC" "B61E-0F26" "F5E3-48BB" "0678-748A" ...
## $ ReviewDate : chr "1/2/2013" "1/3/2013" "1/3/2013" "1/4/2013" ...
## $ EnvironmentSatisfaction : int 5 5 3 5 5 3 3 4 4 5 ...
## $ JobSatisfaction : int 4 4 4 3 2 3 4 5 5 4 ...
## $ RelationshipSatisfaction : int 5 4 5 2 3 2 5 4 2 3 ...
## $ TrainingOpportunitiesWithinYear: int 1 1 3 2 1 2 2 1 1 2 ...
## $ TrainingOpportunitiesTaken : int 0 3 2 0 0 0 1 1 1 3 ...
## $ WorkLifeBalance : int 4 4 3 2 4 4 5 3 4 4 ...
## $ SelfRating : int 4 4 5 3 4 4 4 3 5 5 ...
## $ ManagerRating : int 4 3 4 2 3 4 3 2 4 4 ...
str(edu_lvl)
## 'data.frame': 5 obs. of 2 variables:
## $ EducationLevelID: int 1 2 3 4 5
## $ EducationLevel : chr "No Formal Qualifications" "High School " "Bachelors " "Masters " ...
str(rating_lvl)
## 'data.frame': 5 obs. of 2 variables:
## $ RatingID : int 1 2 3 4 5
## $ RatingLevel: chr "Unacceptable" "Needs Improvement" "Meets Expectation" "Exceeds Expectation " ...
str(satisified_lvl)
## 'data.frame': 5 obs. of 2 variables:
## $ SatisfactionID : int 1 2 3 4 5
## $ SatisfactionLevel: chr "Very Dissatisfied" "Dissatisfied" "Neutral" "Satisfied " ...
Then, check for missing values and duplicates in the employee and performance dataframes:
# Count total NA values in the dataframe
total_na_count1 <- sum(is.na(employee))
total_na_count2 <- sum(is.na(performance))
print(paste(total_na_count1, "missing value in employee df"))
## [1] "0 missing value in employee df"
print(paste(total_na_count2, "missing value in performance df"))
## [1] "0 missing value in performance df"
# Check for duplicates
duplicate1 <- sum(duplicated(employee))
duplicate2 <- sum(duplicated(performance))
print(paste(duplicate1, "duplicate value in employee df"))
## [1] "0 duplicate value in employee df"
print(paste(duplicate2, "duplicate value in performance df"))
## [1] "0 duplicate value in performance df"
Check for unique values in columns.
# Print all the unique values
uniquegender <- unique(employee$Gender)
print(uniquegender)
## [1] "Female" "Male" "Non-Binary"
## [4] "Prefer Not To Say"
uniquetravel <- unique(employee$BusinessTravel)
print(uniquetravel)
## [1] "Some Travel" "No Travel " "Frequent Traveller"
To ease data preprocessing of date columns, date columns’ datatypes were converted from character to date.
# Date transformation
# Convert date from chr to date
performance$ReviewDate <- as.Date(performance$ReviewDate, format = "%m/%d/%Y")
employee$HireDate <- as.Date(employee$HireDate, format = "%Y-%m-%d")
# print(class(performance$ReviewDate))
# print(class(employee$HireDate))
# Arrange based on employee ID and review date
performance_sorted <- performance %>% arrange(EmployeeID,ReviewDate)
Then, to aggregate the satisfaction ratings of multiple aspects, OverallSatisfaction column was calculated and added into performance_overall.
# Calculate overall job satisfaction of employee towards workplace
performance_overall <- performance_sorted %>%
mutate(OverallSatisfaction = (EnvironmentSatisfaction + JobSatisfaction + RelationshipSatisfaction + WorkLifeBalance)/4) %>%
select(EmployeeID,ReviewDate,EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction, WorkLifeBalance, TrainingOpportunitiesWithinYear,TrainingOpportunitiesTaken,SelfRating,ManagerRating,OverallSatisfaction) %>%
arrange(EmployeeID,ReviewDate)
One Hot Encoding was done to convert categorical data into numerical data to be used in machine learning algorithms.
# One Hot Encoding
dummy <- dummyVars("~ Gender + BusinessTravel + Department + State + Ethnicity + EducationField + JobRole + MaritalStatus", data = employee)
employeenew <- predict(dummy, newdata = employee)
employeenew <- as.data.frame(employeenew)
employeenew <- cbind(employee, employeenew)
Binary Encoding was performed on categorical data with “Yes” and “No” values.
# Transforming "yes" "no" to binary (1 = yes, 0 = no)
employeenew <- employeenew %>%
mutate(Attrition = ifelse(Attrition == "Yes", 1, 0)) %>%
mutate(OverTime = ifelse(OverTime == "Yes", 1, 0))
Min-Max Normalization was done to rescale the data with differing scales into similar scales.
# Normalization
subset <- employeenew[, c("Age", "DistanceFromHome..KM.", "Salary")]
preprocess <- preProcess(subset, method = "range")
normalizeddata <- predict(preprocess, newdata = subset)
colnames(normalizeddata) <- c("AgeNormalized", "DistanceFromHomeNormalized", "SalaryNormalized")
employeenew <- cbind(employeenew, normalizeddata)
Then, to match employee performance with employee data, the data was merged using left join to include all the rows in performance dataset.
# Perform a left join
employee_perf <- merge(employeenew, performance_overall, by = "EmployeeID", all.x = TRUE)
# View(employee_perf)
glimpse(employee_perf)
## Rows: 6,899
## Columns: 81
## $ EmployeeID <chr> "001A-8F88", "005C-E0FB", …
## $ FirstName <chr> "Christy", "Fin", "Fin", "…
## $ LastName <chr> "Jumel", "O'Halleghane", "…
## $ Gender <chr> "Male", "Non-Binary", "Non…
## $ Age <int> 22, 24, 24, 24, 30, 30, 30…
## $ BusinessTravel <chr> "Some Travel", "Frequent T…
## $ Department <chr> "Technology", "Sales", "Sa…
## $ DistanceFromHome..KM. <int> 40, 17, 17, 17, 6, 6, 6, 6…
## $ State <chr> "CA", "CA", "CA", "CA", "C…
## $ Ethnicity <chr> "White", "White", "White",…
## $ Education <int> 4, 4, 4, 4, 2, 2, 2, 2, 2,…
## $ EducationField <chr> "Information Systems", "Ma…
## $ JobRole <chr> "Software Engineer", "Sale…
## $ MaritalStatus <chr> "Married", "Married", "Mar…
## $ Salary <int> 27763, 56155, 56155, 56155…
## $ StockOptionLevel <int> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ OverTime <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ HireDate <date> 2021-09-05, 2017-08-26, 2…
## $ Attrition <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ YearsAtCompany <int> 1, 5, 5, 5, 10, 10, 10, 10…
## $ YearsInMostRecentRole <int> 0, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ YearsSinceLastPromotion <int> 1, 2, 2, 2, 6, 6, 6, 6, 6,…
## $ YearsWithCurrManager <int> 0, 0, 0, 0, 6, 6, 6, 6, 6,…
## $ GenderFemale <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GenderMale <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `GenderNon-Binary` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelSome Travel` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `DepartmentHuman Resources` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ DepartmentSales <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ DepartmentTechnology <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ StateCA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ StateIL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ StateNY <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American` <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `EthnicityMixed or multiple ethnic groups` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityNative Hawaiian ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityOther ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science` <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ EducationFieldEconomics <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EducationFieldMarketing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldMarketing ` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldOther <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldTechnical Degree` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Business Partner` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer` <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ JobRoleManager <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Executive` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Representative` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusMarried <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MaritalStatusSingle <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AgeNormalized <dbl> 0.1212121, 0.1818182, 0.18…
## $ DistanceFromHomeNormalized <dbl> 0.8863636, 0.3636364, 0.36…
## $ SalaryNormalized <dbl> 0.01400107, 0.06789454, 0.…
## $ ReviewDate <date> NA, 2020-06-17, 2022-06-1…
## $ EnvironmentSatisfaction <int> NA, 3, 3, 4, 5, 3, 5, 5, 3…
## $ JobSatisfaction <int> NA, 3, 4, 4, 4, 3, 2, 2, 2…
## $ RelationshipSatisfaction <int> NA, 2, 5, 5, 2, 4, 4, 5, 2…
## $ WorkLifeBalance <int> NA, 2, 4, 5, 2, 2, 5, 5, 4…
## $ TrainingOpportunitiesWithinYear <int> NA, 1, 3, 1, 3, 3, 1, 3, 3…
## $ TrainingOpportunitiesTaken <int> NA, 2, 0, 1, 0, 1, 0, 0, 0…
## $ SelfRating <int> NA, 4, 4, 3, 3, 3, 3, 5, 3…
## $ ManagerRating <int> NA, 3, 4, 3, 2, 3, 3, 4, 3…
## $ OverallSatisfaction <dbl> NA, 2.50, 4.00, 4.50, 3.25…
For employees with Attrition == “Yes”, ExitDate was calculated using sum of HireDate with YearsAtCompany.
# Create a column of exit date if employee has left the company.
# Ensure ExitDate is properly handled as a Date
employee_perf_clean <- employee_perf %>%
mutate(
ExitDate = if_else(
Attrition == 1,
as.Date(HireDate) + years(YearsAtCompany),
as.Date(NA) # Maintain NA as a Date
)
)
# str(employee_perf_clean)
There are 190 records of employee without a performance review entry, this indicates the employee may be a new joiner as they have yet had their first review.
# Find rows where review date is NA
missing_review_date <- employee_perf_clean %>%
filter(is.na(ReviewDate))
# View the result
sum(is.na(employee_perf_clean$ReviewDate))
## [1] 190
There are dirty data whereby a performance review rating is available before an employee joined the company and after an employee left the company. A total of 1818 rows are cleaned up by removing these ratings.
# Retaining only those with valid performance rating
checking <- employee_perf_clean %>%
filter(
is.na(ReviewDate) |
(ReviewDate >= HireDate &
(is.na(ExitDate) | ReviewDate <= ExitDate)) # ReviewDate >= HireDate and ReviewDate <= ExitDate (if ExitDate is not NA)
)
# str(checking)
# colSums(is.na(checking))
Identifying the earliest and latest hire dates helps pinpoint employees who haven’t yet undergone their initial performance review. These individuals might be recent hires or may not have completed the required tenure, such as one year, necessary for a performance evaluation.
# Filter rows where ReviewDate is NA and select HireDate
check1 <- checking %>% filter(is.na(ReviewDate)) %>% select(HireDate, YearsAtCompany)
# Calculate min and max HireDate from the 'check1' dataset
min_hire_date <- min(check1$HireDate, na.rm = TRUE)
max_hire_date <- max(check1$HireDate, na.rm = TRUE)
max_years <- max(check1$YearsAtCompany, na.rm = TRUE)
# Calculate max ReviewDate for the entire dataset
max_review_date <- max(employee_perf_clean$ReviewDate, na.rm = TRUE)
# Print results
print(paste("Minimum hire date with an NA:", min_hire_date))
## [1] "Minimum hire date with an NA: 2021-06-30"
print(paste("Maximum hire date with an NA:", max_hire_date))
## [1] "Maximum hire date with an NA: 2022-12-31"
print(paste("Check max years at company with an NA:" ,max_years))
## [1] "Check max years at company with an NA: 1"
print(paste("Check last review date:",max_review_date))
## [1] "Check last review date: 2022-12-31"
To calculate employee lifecycle for employees without ReviewDate (ReviewDate == NA). For employees with Lifecycle == 0, it is replaced with 1.
employee_perf_check <- checking %>%
mutate(
ReviewDate = ifelse(is.na(ReviewDate), max_review_date, ReviewDate),
ReviewDate = as.Date(ReviewDate),
Lifecycle = as.numeric(difftime(ReviewDate, HireDate, units = "days")) / 365,
Lifecycle = ifelse(ceiling(Lifecycle) == 0, 1, ceiling(Lifecycle))
)
Since employees with missing rating are mainly on Lifecycle == 1, mean performance ratings for this group are calculated.
# Calculate mean for imputation where Lifecycle = 1
means1 <- employee_perf_check %>%
filter(Lifecycle == 1) %>%
summarise(
EnvironmentSatisfaction_mean = mean(EnvironmentSatisfaction, na.rm = TRUE),
JobSatisfaction_mean = mean(JobSatisfaction, na.rm = TRUE),
RelationshipSatisfaction_mean = mean(RelationshipSatisfaction, na.rm = TRUE),
WorkLifeBalance_mean = mean(WorkLifeBalance, na.rm = TRUE),
TrainingOpportunitiesWithinYear_mean = mean(TrainingOpportunitiesWithinYear, na.rm = TRUE),
TrainingOpportunitiesTaken_mean = mean(TrainingOpportunitiesTaken, na.rm = TRUE),
SelfRating_mean = mean(SelfRating, na.rm = TRUE),
ManagerRating_mean = mean(ManagerRating, na.rm = TRUE),
OverallSatisfaction_mean = mean(OverallSatisfaction, na.rm = TRUE)
)
For employees with missing values for performance ratings, mean imputations using Lifecycle == 1 were performed as well.
# Perform mean imputation for missing values (NA) using Lifecycle = 1 means
employee_perf_check <- employee_perf_check %>%
mutate(
EnvironmentSatisfaction = ifelse(is.na(EnvironmentSatisfaction), means1$EnvironmentSatisfaction_mean, EnvironmentSatisfaction),
JobSatisfaction = ifelse(is.na(JobSatisfaction), means1$JobSatisfaction_mean, JobSatisfaction),
RelationshipSatisfaction = ifelse(is.na(RelationshipSatisfaction), means1$RelationshipSatisfaction_mean, RelationshipSatisfaction),
WorkLifeBalance = ifelse(is.na(WorkLifeBalance), means1$WorkLifeBalance_mean, WorkLifeBalance),
TrainingOpportunitiesWithinYear = ifelse(is.na(TrainingOpportunitiesWithinYear), means1$TrainingOpportunitiesWithinYear_mean, TrainingOpportunitiesWithinYear),
TrainingOpportunitiesTaken = ifelse(is.na(TrainingOpportunitiesTaken), means1$TrainingOpportunitiesTaken_mean, TrainingOpportunitiesTaken),
SelfRating = ifelse(is.na(SelfRating), means1$SelfRating_mean, SelfRating),
ManagerRating = ifelse(is.na(ManagerRating), means1$ManagerRating_mean, ManagerRating),
OverallSatisfaction = ifelse(is.na(OverallSatisfaction), means1$OverallSatisfaction_mean, OverallSatisfaction)
)
# View the resulting dataset
glimpse(employee_perf_check)
## Rows: 5,081
## Columns: 83
## $ EmployeeID <chr> "001A-8F88", "005C-E0FB", …
## $ FirstName <chr> "Christy", "Fin", "Fin", "…
## $ LastName <chr> "Jumel", "O'Halleghane", "…
## $ Gender <chr> "Male", "Non-Binary", "Non…
## $ Age <int> 22, 24, 24, 24, 30, 30, 30…
## $ BusinessTravel <chr> "Some Travel", "Frequent T…
## $ Department <chr> "Technology", "Sales", "Sa…
## $ DistanceFromHome..KM. <int> 40, 17, 17, 17, 6, 6, 6, 6…
## $ State <chr> "CA", "CA", "CA", "CA", "C…
## $ Ethnicity <chr> "White", "White", "White",…
## $ Education <int> 4, 4, 4, 4, 2, 2, 2, 2, 2,…
## $ EducationField <chr> "Information Systems", "Ma…
## $ JobRole <chr> "Software Engineer", "Sale…
## $ MaritalStatus <chr> "Married", "Married", "Mar…
## $ Salary <int> 27763, 56155, 56155, 56155…
## $ StockOptionLevel <int> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ OverTime <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ HireDate <date> 2021-09-05, 2017-08-26, 2…
## $ Attrition <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ YearsAtCompany <int> 1, 5, 5, 5, 10, 10, 10, 10…
## $ YearsInMostRecentRole <int> 0, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ YearsSinceLastPromotion <int> 1, 2, 2, 2, 6, 6, 6, 6, 6,…
## $ YearsWithCurrManager <int> 0, 0, 0, 0, 6, 6, 6, 6, 6,…
## $ GenderFemale <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GenderMale <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `GenderNon-Binary` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelSome Travel` <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `DepartmentHuman Resources` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ DepartmentSales <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ DepartmentTechnology <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ StateCA <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ StateIL <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ StateNY <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American` <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `EthnicityMixed or multiple ethnic groups` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityNative Hawaiian ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityOther ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science` <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ EducationFieldEconomics <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EducationFieldMarketing <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldMarketing ` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldOther <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldTechnical Degree` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Business Partner` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer` <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ JobRoleManager <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Executive` <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Representative` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer` <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusMarried <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MaritalStatusSingle <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AgeNormalized <dbl> 0.12121212, 0.18181818, 0.…
## $ DistanceFromHomeNormalized <dbl> 0.8863636, 0.3636364, 0.36…
## $ SalaryNormalized <dbl> 0.014001067, 0.067894544, …
## $ ReviewDate <date> 2022-12-31, 2020-06-17, 2…
## $ EnvironmentSatisfaction <dbl> 3.833333, 3.000000, 3.0000…
## $ JobSatisfaction <dbl> 3.535088, 3.000000, 4.0000…
## $ RelationshipSatisfaction <dbl> 3.526316, 2.000000, 5.0000…
## $ WorkLifeBalance <dbl> 3.495614, 2.000000, 4.0000…
## $ TrainingOpportunitiesWithinYear <dbl> 2.074561, 1.000000, 3.0000…
## $ TrainingOpportunitiesTaken <dbl> 1.02193, 2.00000, 0.00000,…
## $ SelfRating <dbl> 4, 4, 4, 3, 3, 3, 3, 5, 3,…
## $ ManagerRating <dbl> 3.54386, 3.00000, 4.00000,…
## $ OverallSatisfaction <dbl> 3.597588, 2.500000, 4.0000…
## $ ExitDate <date> NA, NA, NA, NA, NA, NA, N…
## $ Lifecycle <dbl> 2, 3, 5, 4, 9, 4, 7, 6, 10…
Finally, the unique employee dataset and the merged employee performance rating dataset are ready to proceed to Exploratory Data Analysis (EDA) and Feature Engineering.
Insights:
Calculate total number of employees each year based on hire date, cumulative count
## # A tibble: 11 × 3
## Year EmployeesHired CumulativeEmployees
## <dbl> <int> <int>
## 1 2012 151 151
## 2 2013 136 287
## 3 2014 136 423
## 4 2015 127 550
## 5 2016 114 664
## 6 2017 106 770
## 7 2018 136 906
## 8 2019 145 1051
## 9 2020 127 1178
## 10 2021 137 1315
## 11 2022 155 1470
Calculate total attrition employees each year, cumulative count
Calculate cumulative employees remaining and rate of imbalance
## # A tibble: 11 × 8
## Year EmployeesHired CumulativeEmployees AttritionCount CumulativeAttrition
## <dbl> <int> <int> <dbl> <dbl>
## 1 2012 151 151 0 0
## 2 2013 136 287 2 2
## 3 2014 136 423 10 12
## 4 2015 127 550 11 23
## 5 2016 114 664 12 35
## 6 2017 106 770 14 49
## 7 2018 136 906 21 70
## 8 2019 145 1051 21 91
## 9 2020 127 1178 32 123
## 10 2021 137 1315 60 183
## 11 2022 155 1470 54 237
## # ℹ 3 more variables: CumulativeEmployeesRemain <dbl>,
## # AttritionRatePerYear <dbl>, AttritiontoEmployeesRemainRatio <dbl>
Create a combined plot for both hire and attrition trends
## List of 1
## $ legend.position: chr "bottom"
## - attr(*, "class")= chr [1:2] "theme" "gg"
## - attr(*, "complete")= logi FALSE
## - attr(*, "validate")= logi TRUE
Insights:
Insights:
Insights:
Insights:
Insights:
The annual salary range with the most employee count is between 50k to 100k, followed up by 25k to 50k,more than 150k, 100k to 150k and less than 25k.
Annual salary range with the most employee count which are 50k to 100k and 25k to 50k has employee count over 400 while annual salary range with the least employee count which are less than 25k and 100k to 150k has employee count less than 200.
This highlights that most of the employees earn between 25k to 100k with moderate amount of employees earning over 150k.
The department with the most employee count is the technology department followed up sales and human resources.
This shows that the technology department plays an important role in the company’s operations with other departments like sales and human resources playing more of a supporting role within the organisation.
Insights:
Insights:
Insights:
Insights:
Insights:
Insights:
Insights:
Key features were engineered to enhance the dataset for modeling purposes. The feature engineering steps included:
A higher stagnation rate indicates a greater risk of stagnation,
which suggests fewer promotions.
Role stability reflects how long an employee has remained in the same
role.
By combining these two metrics, we can categorize employees into several
groups.
EMA is applied to all employees’ performance ratings to generate a
final, smoothed rating for each employee, which is then used for
modeling.
This helps smooth out fluctuations in performance ratings over time,
providing a more consistent and reliable assessment of employee
performance while placing more weight on recent ratings, allowing it to
better capture trends and changes in performance.
# Feature engineering on performance df using EMA and work satisfaction
columns <- c(
"EnvironmentSatisfaction", "JobSatisfaction", "RelationshipSatisfaction",
"TrainingOpportunitiesWithinYear", "TrainingOpportunitiesTaken",
"WorkLifeBalance", "SelfRating", "ManagerRating", "OverallSatisfaction"
)
# Apply Dynamic EMA
EMA_perf <- employee_perf_check %>%
arrange(EmployeeID, desc(ReviewDate)) %>%
group_by(EmployeeID) %>%
mutate(across(
all_of(columns),
~ if (n() >= 10) rev(EMA(rev(.x), n = 10)) else rev(EMA(rev(.x), n = n())),
.names = "{.col}_EMA")
) %>%
ungroup() %>%
filter(if_all(ends_with("_EMA"), ~ !is.na(.))) %>%
select(EmployeeID, ReviewDate, ends_with("_EMA"))
# Check final unique numbers of employees
n_distinct(EMA_perf$EmployeeID)
## [1] 1359
# Work Satisfaction Creation
EMA_final <- EMA_perf %>% mutate(WorkSatisfaction = WorkLifeBalance_EMA * JobSatisfaction_EMA)
str(EMA_final)
## tibble [1,359 × 12] (S3: tbl_df/tbl/data.frame)
## $ EmployeeID : chr [1:1359] "001A-8F88" "005C-E0FB" "00A3-2445" "00B0-F199" ...
## $ ReviewDate : Date[1:1359], format: "2022-12-31" "2022-06-17" ...
## $ EnvironmentSatisfaction_EMA : num [1:1359] 3.83 3.33 4.12 3 3.75 ...
## $ JobSatisfaction_EMA : num [1:1359] 3.54 3.67 3.5 4 4.25 ...
## $ RelationshipSatisfaction_EMA : num [1:1359] 3.53 4 3.62 3 3.5 ...
## $ TrainingOpportunitiesWithinYear_EMA: num [1:1359] 2.07 1.67 2.12 1 2 ...
## $ TrainingOpportunitiesTaken_EMA : num [1:1359] 1.02 1 0.75 1 1.25 ...
## $ WorkLifeBalance_EMA : num [1:1359] 3.5 3.67 3.75 4 4.25 ...
## $ SelfRating_EMA : num [1:1359] 4 3.67 3.75 4 3.75 ...
## $ ManagerRating_EMA : num [1:1359] 3.54 3.33 3.38 3 3 ...
## $ OverallSatisfaction_EMA : num [1:1359] 3.6 3.67 3.75 3.5 3.94 ...
## $ WorkSatisfaction : num [1:1359] 12.4 13.4 13.1 16 18.1 ...
The EMA rating is then left joined to the cleaned unique employee dataset.
# here to join with employee_fe later on at the very end
employee_combined <- employee_fe %>%
left_join(EMA_final, by = "EmployeeID")
Mean imputation is done on employee with missing ratings.
## Rows: 1,470
## Columns: 89
## $ EmployeeID <chr> "3012-1A41", "CBCB-9C9D", …
## $ FirstName <chr> "Leonelle", "Leonerd", "Ah…
## $ LastName <chr> "Simco", "Aland", "Sykes",…
## $ Gender <chr> "Female", "Male", "Male", …
## $ Age <int> 30, 38, 43, 39, 29, 34, 42…
## $ BusinessTravel <chr> "Some Travel", "Some Trave…
## $ Department <chr> "Sales", "Sales", "Human R…
## $ DistanceFromHome..KM. <int> 27, 23, 29, 12, 29, 30, 45…
## $ State <chr> "IL", "CA", "CA", "IL", "C…
## $ Ethnicity <chr> "White", "White", "Asian o…
## $ Education <int> 5, 4, 4, 3, 2, 2, 3, 2, 4,…
## $ EducationField <chr> "Marketing", "Marketing", …
## $ JobRole <chr> "Sales Executive", "Sales …
## $ MaritalStatus <chr> "Divorced", "Single", "Mar…
## $ Salary <int> 102059, 157718, 309964, 29…
## $ StockOptionLevel <int> 1, 0, 1, 0, 0, 1, 1, 1, 1,…
## $ OverTime <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ HireDate <date> 2012-01-03, 2012-01-04, 2…
## $ Attrition <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ YearsAtCompany <int> 10, 10, 10, 10, 6, 10, 10,…
## $ YearsInMostRecentRole <int> 4, 6, 6, 10, 1, 3, 2, 3, 5…
## $ YearsSinceLastPromotion <int> 9, 10, 10, 10, 1, 7, 6, 4,…
## $ YearsWithCurrManager <int> 7, 0, 8, 0, 6, 9, 6, 6, 2,…
## $ GenderFemale <dbl> 1, 0, 0, 0, 1, 0, 1, 1, 0,…
## $ GenderMale <dbl> 0, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ `GenderNon-Binary` <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel ` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ `BusinessTravelSome Travel` <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ `DepartmentHuman Resources` <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ DepartmentSales <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ DepartmentTechnology <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ StateCA <dbl> 0, 1, 1, 0, 1, 0, 0, 1, 0,…
## $ StateIL <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ StateNY <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American` <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ `EthnicityMixed or multiple ethnic groups` <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ `EthnicityNative Hawaiian ` <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EthnicityOther ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science` <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldEconomics <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ EducationFieldMarketing <dbl> 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ `EducationFieldMarketing ` <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ EducationFieldOther <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EducationFieldTechnical Degree` <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager` <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ `JobRoleHR Business Partner` <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleManager <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleSales Executive` <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ `JobRoleSales Representative` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0,…
## $ MaritalStatusMarried <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 1,…
## $ MaritalStatusSingle <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ AgeNormalized <dbl> 0.3636364, 0.6060606, 0.75…
## $ DistanceFromHomeNormalized <dbl> 0.59090909, 0.50000000, 0.…
## $ SalaryNormalized <dbl> 0.15502917, 0.26068065, 0.…
## $ SalaryRange <fct> 100k-150k, >150k, >150k, >…
## $ EducationLevel <fct> Doctorate, Masters, Master…
## $ JobRoleCat <chr> "Executive", "Executive", …
## $ StagnationRate <dbl> 0.8181818, 0.9090909, 0.90…
## $ RoleStability <dbl> 0.4000000, 0.6000000, 0.60…
## $ StagCat <chr> "Stagnation Risk", "Stagna…
## $ RSCat <chr> "Moderate Role Stability",…
## $ GrowthCat <chr> "Needs Review", "Needs Rev…
## $ EnvironmentSatisfaction_EMA <dbl> 3.555556, 3.888889, 3.8888…
## $ JobSatisfaction_EMA <dbl> 3.666667, 3.333333, 3.5555…
## $ RelationshipSatisfaction_EMA <dbl> 3.333333, 3.777778, 2.8888…
## $ TrainingOpportunitiesWithinYear_EMA <dbl> 2.000000, 2.444444, 2.1111…
## $ TrainingOpportunitiesTaken_EMA <dbl> 0.3333333, 0.7777778, 0.66…
## $ WorkLifeBalance_EMA <dbl> 3.333333, 3.000000, 3.7777…
## $ SelfRating_EMA <dbl> 4.111111, 4.222222, 3.5555…
## $ ManagerRating_EMA <dbl> 3.555556, 3.888889, 3.0000…
## $ OverallSatisfaction_EMA <dbl> 3.472222, 3.500000, 3.5277…
## $ WorkSatisfaction <dbl> 12.222222, 10.000000, 13.4…
Assess class imbalance and ensure that it is maintained when
splitting the data into training and testing sets.
If necessary, apply data resampling to address imbalance, and re-train
the model using the resampled training data.
Check for sample imbalance on Attrition:
##
## 0 1
## 1233 237
##
## 0 1
## 83.87755 16.12245
## [1] "Class Imbalance Ratio: 5.2"
Due to the high class imbalance ratio, SMOTE (Synthetic Minority Over-sampling Technique) resampling may be necessary.
## Rows: 1,470
## Columns: 71
## $ Age <int> 30, 38, 43, 39, 29, 34, 42…
## $ DistanceFromHome..KM. <int> 27, 23, 29, 12, 29, 30, 45…
## $ Education <int> 5, 4, 4, 3, 2, 2, 3, 2, 4,…
## $ Salary <int> 102059, 157718, 309964, 29…
## $ StockOptionLevel <int> 1, 0, 1, 0, 0, 1, 1, 1, 1,…
## $ OverTime <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ Attrition <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ YearsAtCompany <int> 10, 10, 10, 10, 6, 10, 10,…
## $ YearsInMostRecentRole <int> 4, 6, 6, 10, 1, 3, 2, 3, 5…
## $ YearsSinceLastPromotion <int> 9, 10, 10, 10, 1, 7, 6, 4,…
## $ YearsWithCurrManager <int> 7, 0, 8, 0, 6, 9, 6, 6, 2,…
## $ GenderFemale <dbl> 1, 0, 0, 0, 1, 0, 1, 1, 0,…
## $ GenderMale <dbl> 0, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ `GenderNon-Binary` <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel ` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ `BusinessTravelSome Travel` <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ `DepartmentHuman Resources` <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ DepartmentSales <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ DepartmentTechnology <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ StateCA <dbl> 0, 1, 1, 0, 1, 0, 0, 1, 0,…
## $ StateIL <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ StateNY <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American` <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ `EthnicityMixed or multiple ethnic groups` <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ `EthnicityNative Hawaiian ` <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EthnicityOther ` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science` <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldEconomics <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems` <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ EducationFieldMarketing <dbl> 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ `EducationFieldMarketing ` <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ EducationFieldOther <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EducationFieldTechnical Degree` <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager` <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ `JobRoleHR Business Partner` <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleManager <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleSales Executive` <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ `JobRoleSales Representative` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0,…
## $ MaritalStatusMarried <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 1,…
## $ MaritalStatusSingle <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ AgeNormalized <dbl> 0.3636364, 0.6060606, 0.75…
## $ DistanceFromHomeNormalized <dbl> 0.59090909, 0.50000000, 0.…
## $ SalaryNormalized <dbl> 0.15502917, 0.26068065, 0.…
## $ StagnationRate <dbl> 0.8181818, 0.9090909, 0.90…
## $ RoleStability <dbl> 0.4000000, 0.6000000, 0.60…
## $ EnvironmentSatisfaction_EMA <dbl> 3.555556, 3.888889, 3.8888…
## $ JobSatisfaction_EMA <dbl> 3.666667, 3.333333, 3.5555…
## $ RelationshipSatisfaction_EMA <dbl> 3.333333, 3.777778, 2.8888…
## $ TrainingOpportunitiesWithinYear_EMA <dbl> 2.000000, 2.444444, 2.1111…
## $ TrainingOpportunitiesTaken_EMA <dbl> 0.3333333, 0.7777778, 0.66…
## $ WorkLifeBalance_EMA <dbl> 3.333333, 3.000000, 3.7777…
## $ SelfRating_EMA <dbl> 4.111111, 4.222222, 3.5555…
## $ ManagerRating_EMA <dbl> 3.555556, 3.888889, 3.0000…
## $ OverallSatisfaction_EMA <dbl> 3.472222, 3.500000, 3.5277…
## $ WorkSatisfaction <dbl> 12.222222, 10.000000, 13.4…
ii.Split the data into a training set and a test set, using a 70:30
ratio.
The class imbalance on Attrition remains in both training and testing
data set.
The baseline accuracy is 83% indicating the accuracy achieved by always
predicting the majority class (Attrition = No) in the dataset.
## Class Distribution in Training Set:
##
## 0 1
## 83.86783 16.13217
##
## Class Distribution in Testing Set:
##
## 0 1
## 83.90023 16.09977
## [1] "Baseline Accuracy: 83.88 %"
Apply SMOTE to oversample the attrition class, addressing the class imbalance by generating synthetic data points for the minority class.
##
## Class Distribution in Balanced Training Set:
##
## 0 1
## 863 498
The testing data is resampled with a smaller ratio of the imbalanced
class and is now ready for model evaluation.
Initial model training should be performed on the original training set
to assess whether the model trained with the balanced data performs
better.
Check for class imbalance on YearsAtCompany.
# Visualizing the target variable for regression (Dataset is evenly distributed, hence no data imbalance)
hist(employee_fe_numeric$YearsAtCompany, breaks = 30, main = "Distribution of Target Variable", xlab = "Target", col = "lightblue")
Since the data is evenly distributed across the continous variable in YearsAtCompany, hence there is no need for computing data imbalance for regression model.
## Features that have strong correlations to Attrition:
## [1] "OverTime" "YearsAtCompany"
## [3] "YearsInMostRecentRole" "YearsSinceLastPromotion"
## [5] "YearsWithCurrManager" "StagnationRate"
## [7] "RoleStability"
##
## Redundant features dropped:
## [1] "YearsSinceLastPromotion"
##
## Final Features:
## [1] "OverTime" "YearsAtCompany" "YearsInMostRecentRole"
## [4] "YearsWithCurrManager" "StagnationRate" "RoleStability"
# subset of data using only the final feature and target variable
ori_classification_data <- train[, final_features_classi]
# Add class(Attrition) as label to the classification_data
ori_classification_data$label <- as.factor(train$Attrition)
ori_X_train_classi <- as.matrix(ori_classification_data[, -which(colnames(ori_classification_data) == "label")])
ori_Y_train_classi <- ori_classification_data$label
ori_X_test_classi <- as.matrix(test[, -which(colnames(test) == "Attrition")])
ori_Y_test_classi <- as.factor(test$Attrition)
## Features that have strong correlations to Attrition:
## [1] "OverTime" "YearsAtCompany"
## [3] "YearsInMostRecentRole" "YearsSinceLastPromotion"
## [5] "YearsWithCurrManager" "StagnationRate"
## [7] "RoleStability"
# Calculate correlation matrix for selected features
cor_matrix_classi <- cor(train_bal[, selected_features_classi])
# Visualize the correlation matrix
library(grid)
library(corrplot)
par(mfrow = c(1,1), mar = c(8,5,3,2))
corrplot(cor_matrix_classi, method = "color", type = "lower", order = "hclust", tl.cex = 0.8)
title(main = "Correlation Matrix of selected features", col = "black")
##
## Redundant features dropped:
## [1] "YearsSinceLastPromotion"
##
## Final Features:
## [1] "OverTime" "YearsAtCompany" "YearsInMostRecentRole"
## [4] "YearsWithCurrManager" "StagnationRate" "RoleStability"
# Split data for SMOTE Balanced Data
classification_data <- train_bal[, final_features_classi]
classification_data$label <- as.factor(train_bal$class)
X_train_classi <- as.matrix(classification_data[, -which(colnames(classification_data) == "label")])
Y_train_classi <- classification_data$label
X_test_classi <- as.matrix(test[, -which(colnames(test) == "Attrition")])
Y_test_classi <- as.factor(test$Attrition)
Examining through the correlation relationship between features and
the target variable (Attrition),
5 features will be used as the features to be feed to the model for
training and testing.
Selection Criteria :
## Features that have strong correlations to YearsAtCompany:
## [1] "Age" "YearsInMostRecentRole"
## [3] "YearsSinceLastPromotion" "YearsWithCurrManager"
## [5] "EthnicityWhite" "AgeNormalized"
## [7] "StagnationRate"
##
## Redundant features dropped:
## [1] "YearsSinceLastPromotion" "Age"
##
## Final Features:
## [1] "YearsInMostRecentRole" "YearsWithCurrManager" "EthnicityWhite"
## [4] "AgeNormalized" "StagnationRate"
Feature Selection
Examining through the correlation relationship between features and the
target variable (YearsAtCompany),
5 features will be used as the features to be feed to the model for training and testing.
Selection Criteria :
# Split the data into training and testing sets
# Train : 70%
# Test : 30%
# subset of data using only the final feature and target variable
regression_data <- employee_fe_numeric[, final_features]
# Add YearsAtCompany as label to the regression_data
regression_data$label <- employee_fe_numeric$YearsAtCompany
set.seed(123)
trainIndex <- createDataPartition(regression_data$label, p = .7,
list = FALSE,
times = 1)
trainData <- regression_data[trainIndex,]
testData <- regression_data[-trainIndex,]
X_train <- as.matrix(trainData[, -which(colnames(trainData) == "label")])
Y_train <- trainData$label
X_test <- as.matrix(testData[, -which(colnames(trainData) == "label")])
Y_test <- testData$label
Objective : Predict the Attrition of Employee
Attrition will be used as the target variable.
Models:
# Fit the model
ori_knn_model <- train(label ~., data = ori_classification_data, method = "knn")
# Summarize the model
summary(ori_knn_model)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 6 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
# Plot the model
plot(ori_knn_model)
# Predict on test data
ori_knn_predictions <- predict(ori_knn_model, ori_X_test_classi)
# Fit the model
knn_model <- train(label ~., data = classification_data, method = "knn")
# Summarize the model
summary(knn_model)
## Length Class Mode
## learn 2 -none- list
## k 1 -none- numeric
## theDots 0 -none- list
## xNames 6 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
# Plot the model
plot(knn_model)
# Predict on test data
knn_predictions <- predict(knn_model, X_test_classi)
# Fit the model
ori_rf_model <- train(label ~ ., data = ori_classification_data, method = "rf")
# Summarize the model
summary(ori_rf_model)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 1029 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 2058 matrix numeric
## oob.times 1029 -none- numeric
## classes 2 -none- character
## importance 6 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 1029 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 6 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
# Predict on test data
ori_rf_predictions <- predict(ori_rf_model, ori_X_test_classi)
# Fit the model
rf_model <- train(label ~ ., data = classification_data, method = "rf")
# Summarize the model
summary(rf_model)
## Length Class Mode
## call 4 -none- call
## type 1 -none- character
## predicted 1361 factor numeric
## err.rate 1500 -none- numeric
## confusion 6 -none- numeric
## votes 2722 matrix numeric
## oob.times 1361 -none- numeric
## classes 2 -none- character
## importance 6 -none- numeric
## importanceSD 0 -none- NULL
## localImportance 0 -none- NULL
## proximity 0 -none- NULL
## ntree 1 -none- numeric
## mtry 1 -none- numeric
## forest 14 -none- list
## y 1361 factor numeric
## test 0 -none- NULL
## inbag 0 -none- NULL
## xNames 6 -none- character
## problemType 1 -none- character
## tuneValue 1 data.frame list
## obsLevels 2 -none- character
## param 0 -none- list
# Predict on test data
rf_predictions <- predict(rf_model, X_test_classi)
# Fit the model
ori_svm_model <- train(label ~., data = ori_classification_data, method = "svmRadial")
# Summarize the model
summary(ori_svm_model)
## Length Class Mode
## 1 ksvm S4
# Predict on test data
ori_svm_predictions <- predict(ori_svm_model, ori_X_test_classi)
# Fit the model
svm_model <- train(label ~., data = classification_data, method = "svmRadial")
# Summarize the model
summary(svm_model)
## Length Class Mode
## 1 ksvm S4
# Predict on test data
svm_predictions <- predict(svm_model, X_test_classi)
Objective : Predict how long (year) would an
employee stays in the company
YearsAtCompany will be used as the target variable
# Fit the model
linear_regression_model <- lm(label ~ ., data = trainData)
# Summarize the model
summary(linear_regression_model)
##
## Call:
## lm(formula = label ~ ., data = trainData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2703 -1.1555 -0.3113 0.9979 7.1938
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.13876 0.14787 7.701 3.18e-14 ***
## YearsInMostRecentRole 0.34996 0.03105 11.272 < 2e-16 ***
## YearsWithCurrManager 0.47595 0.02665 17.859 < 2e-16 ***
## EthnicityWhite -0.90458 0.11617 -7.787 1.68e-14 ***
## AgeNormalized 3.30280 0.25644 12.880 < 2e-16 ***
## StagnationRate 2.01496 0.24513 8.220 6.11e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.781 on 1024 degrees of freedom
## Multiple R-squared: 0.7125, Adjusted R-squared: 0.7111
## F-statistic: 507.5 on 5 and 1024 DF, p-value: < 2.2e-16
# Predict on test data
linear_regression_prediction <- predict(linear_regression_model, testData)
#import rpart library
library(rpart)
# Train the decision tree regression model
decision_tree_model <- rpart(label ~ ., data = trainData, method = "anova")
# Summarize the model
summary(decision_tree_model)
## Call:
## rpart(formula = label ~ ., data = trainData, method = "anova")
## n= 1030
##
## CP nsplit rel error xerror xstd
## 1 0.37943504 0 1.0000000 1.0026023 0.02625433
## 2 0.13737903 1 0.6205650 0.6238024 0.02528469
## 3 0.11154286 2 0.4831859 0.5107801 0.02224078
## 4 0.04771225 3 0.3716431 0.3987939 0.02330235
## 5 0.02736749 4 0.3239308 0.3361217 0.02004071
## 6 0.02008641 5 0.2965633 0.3322190 0.02033671
## 7 0.01801957 7 0.2563905 0.3167477 0.01985053
## 8 0.01454200 8 0.2383710 0.2840977 0.01882688
## 9 0.01103456 9 0.2238290 0.2607595 0.01684891
## 10 0.01000000 10 0.2127944 0.2548949 0.01655874
##
## Variable importance
## YearsWithCurrManager YearsInMostRecentRole StagnationRate
## 30 25 20
## AgeNormalized EthnicityWhite
## 19 6
##
## Node number 1: 1030 observations, complexity param=0.379435
## mean=4.572816, MSE=10.96703
## left son=2 (540 obs) right son=3 (490 obs)
## Primary splits:
## YearsWithCurrManager < 1.5 to the left, improve=0.37943500, (0 missing)
## YearsInMostRecentRole < 2.5 to the left, improve=0.36188020, (0 missing)
## StagnationRate < 0.04545455 to the left, improve=0.35262460, (0 missing)
## AgeNormalized < 0.1969697 to the left, improve=0.31202770, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.09852108, (0 missing)
## Surrogate splits:
## YearsInMostRecentRole < 1.5 to the left, agree=0.729, adj=0.431, (0 split)
## StagnationRate < 0.5227273 to the left, agree=0.718, adj=0.408, (0 split)
## AgeNormalized < 0.1969697 to the left, agree=0.657, adj=0.280, (0 split)
## EthnicityWhite < 0.5 to the right, agree=0.618, adj=0.198, (0 split)
##
## Node number 2: 540 observations, complexity param=0.137379
## mean=2.62963, MSE=7.470233
## left son=4 (395 obs) right son=5 (145 obs)
## Primary splits:
## YearsInMostRecentRole < 1.5 to the left, improve=0.38469690, (0 missing)
## StagnationRate < 0.05 to the left, improve=0.37149720, (0 missing)
## AgeNormalized < 0.1363636 to the left, improve=0.19967400, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.13088220, (0 missing)
## YearsWithCurrManager < 0.5 to the left, improve=0.06312805, (0 missing)
## Surrogate splits:
## StagnationRate < 0.6306818 to the left, agree=0.863, adj=0.49, (0 split)
##
## Node number 3: 490 observations, complexity param=0.1115429
## mean=6.714286, MSE=6.073469
## left son=6 (223 obs) right son=7 (267 obs)
## Primary splits:
## AgeNormalized < 0.2878788 to the left, improve=0.423384600, (0 missing)
## YearsWithCurrManager < 4.5 to the left, improve=0.230631500, (0 missing)
## YearsInMostRecentRole < 4.5 to the left, improve=0.202520900, (0 missing)
## StagnationRate < 0.8819444 to the left, improve=0.140402200, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.004421336, (0 missing)
## Surrogate splits:
## YearsWithCurrManager < 4.5 to the left, agree=0.645, adj=0.220, (0 split)
## YearsInMostRecentRole < 4.5 to the left, agree=0.612, adj=0.148, (0 split)
## StagnationRate < 0.8819444 to the left, agree=0.594, adj=0.108, (0 split)
##
## Node number 4: 395 observations, complexity param=0.04771225
## mean=1.602532, MSE=4.102778
## left son=8 (197 obs) right son=9 (198 obs)
## Primary splits:
## StagnationRate < 0.05 to the left, improve=0.33256840, (0 missing)
## YearsWithCurrManager < 0.5 to the left, improve=0.13852670, (0 missing)
## AgeNormalized < 0.1060606 to the left, improve=0.12773670, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.09200418, (0 missing)
## YearsInMostRecentRole < 0.5 to the left, improve=0.07483577, (0 missing)
## Surrogate splits:
## YearsInMostRecentRole < 0.5 to the left, agree=0.785, adj=0.569, (0 split)
## YearsWithCurrManager < 0.5 to the left, agree=0.742, adj=0.482, (0 split)
## AgeNormalized < 0.1060606 to the left, agree=0.630, adj=0.259, (0 split)
## EthnicityWhite < 0.5 to the right, agree=0.542, adj=0.081, (0 split)
##
## Node number 5: 145 observations, complexity param=0.02736749
## mean=5.427586, MSE=5.941308
## left son=10 (47 obs) right son=11 (98 obs)
## Primary splits:
## AgeNormalized < 0.1969697 to the left, improve=0.358848500, (0 missing)
## YearsInMostRecentRole < 5.5 to the left, improve=0.326086700, (0 missing)
## StagnationRate < 0.8660714 to the left, improve=0.220893100, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.024278930, (0 missing)
## YearsWithCurrManager < 0.5 to the right, improve=0.008307433, (0 missing)
##
## Node number 6: 223 observations, complexity param=0.01801957
## mean=4.959641, MSE=2.415412
## left son=12 (94 obs) right son=13 (129 obs)
## Primary splits:
## AgeNormalized < 0.1969697 to the left, improve=0.37789800, (0 missing)
## YearsInMostRecentRole < 3.5 to the left, improve=0.17782810, (0 missing)
## YearsWithCurrManager < 3.5 to the left, improve=0.16222870, (0 missing)
## StagnationRate < 0.8452381 to the left, improve=0.13976110, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.02247327, (0 missing)
## Surrogate splits:
## YearsWithCurrManager < 2.5 to the left, agree=0.632, adj=0.128, (0 split)
## EthnicityWhite < 0.5 to the right, agree=0.587, adj=0.021, (0 split)
##
## Node number 7: 267 observations, complexity param=0.02008641
## mean=8.179775, MSE=4.409628
## left son=14 (99 obs) right son=15 (168 obs)
## Primary splits:
## YearsWithCurrManager < 3.5 to the left, improve=0.155517000, (0 missing)
## YearsInMostRecentRole < 3.5 to the left, improve=0.142482400, (0 missing)
## StagnationRate < 0.8090909 to the left, improve=0.094389490, (0 missing)
## AgeNormalized < 0.4090909 to the left, improve=0.077354000, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.002488476, (0 missing)
## Surrogate splits:
## StagnationRate < 0.04545455 to the left, agree=0.64, adj=0.03, (0 split)
##
## Node number 8: 197 observations
## mean=0.4314721, MSE=0.7529181
##
## Node number 9: 198 observations, complexity param=0.014542
## mean=2.767677, MSE=4.713703
## left son=18 (143 obs) right son=19 (55 obs)
## Primary splits:
## EthnicityWhite < 0.5 to the right, improve=0.1760041000, (0 missing)
## StagnationRate < 0.5277778 to the left, improve=0.1695346000, (0 missing)
## AgeNormalized < 0.1363636 to the left, improve=0.1233405000, (0 missing)
## YearsInMostRecentRole < 0.5 to the right, improve=0.0236862400, (0 missing)
## YearsWithCurrManager < 0.5 to the left, improve=0.0002690429, (0 missing)
## Surrogate splits:
## StagnationRate < 0.2111111 to the right, agree=0.747, adj=0.091, (0 split)
##
## Node number 10: 47 observations
## mean=3.319149, MSE=0.9406971
##
## Node number 11: 98 observations, complexity param=0.01103456
## mean=6.438776, MSE=5.185027
## left son=22 (67 obs) right son=23 (31 obs)
## Primary splits:
## YearsInMostRecentRole < 5.5 to the left, improve=2.453038e-01, (0 missing)
## StagnationRate < 0.8819444 to the left, improve=1.882079e-01, (0 missing)
## AgeNormalized < 0.6969697 to the left, improve=1.461969e-01, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=5.600144e-03, (0 missing)
## YearsWithCurrManager < 0.5 to the left, improve=1.435002e-06, (0 missing)
## Surrogate splits:
## StagnationRate < 0.8452381 to the left, agree=0.806, adj=0.387, (0 split)
## AgeNormalized < 0.6969697 to the left, agree=0.724, adj=0.129, (0 split)
##
## Node number 12: 94 observations
## mean=3.840426, MSE=1.304323
##
## Node number 13: 129 observations
## mean=5.775194, MSE=1.647137
##
## Node number 14: 99 observations, complexity param=0.02008641
## mean=7.10101, MSE=7.605959
## left son=28 (46 obs) right son=29 (53 obs)
## Primary splits:
## YearsInMostRecentRole < 3.5 to the left, improve=0.35949020, (0 missing)
## StagnationRate < 0.8090909 to the left, improve=0.19071570, (0 missing)
## AgeNormalized < 0.4090909 to the left, improve=0.06846250, (0 missing)
## EthnicityWhite < 0.5 to the right, improve=0.03530751, (0 missing)
## YearsWithCurrManager < 2.5 to the left, improve=0.02353120, (0 missing)
## Surrogate splits:
## StagnationRate < 0.6833333 to the left, agree=0.727, adj=0.413, (0 split)
## AgeNormalized < 0.4393939 to the left, agree=0.646, adj=0.239, (0 split)
## YearsWithCurrManager < 2.5 to the left, agree=0.556, adj=0.043, (0 split)
## EthnicityWhite < 0.5 to the right, agree=0.545, adj=0.022, (0 split)
##
## Node number 15: 168 observations
## mean=8.815476, MSE=1.436189
##
## Node number 18: 143 observations
## mean=2.202797, MSE=3.196636
##
## Node number 19: 55 observations
## mean=4.236364, MSE=5.671405
##
## Node number 22: 67 observations
## mean=5.671642, MSE=4.847405
##
## Node number 23: 31 observations
## mean=8.096774, MSE=1.893861
##
## Node number 28: 46 observations
## mean=5.326087, MSE=8.089319
##
## Node number 29: 53 observations
## mean=8.641509, MSE=2.079032
library(rpart.plot)
# Visualize the decision tree
rpart.plot(decision_tree_model, type = 3, digits = 2, fallen.leaves = TRUE)
#Shallower Trees are less likely to overfit
# First Splits are the most important predictors
# Predict on test data
decision_tree_prediction <- predict(decision_tree_model, testData)
YearsWithCurrManager is the most important predictor as the tree
starts with it.
YearsWithCurrManager < 2 -> shorter tenure on average
YearsWithCurrManager >= 2 -> longer tenure on average
YearsInMostRecentRole < 2 -> shorter tenure on average
YearsInMostRecentRole >= 2 -> longer tenure on average
This suggests that as employees spend more time on their most recent
role, it is likely to prolong the tenure.
AgeNormalized < 0.29 -> shorter tenure on average
AgeNormalized > 0.29 -> longer tenure on average
Younger employees tend to have a shorter tenure predictions.
StagnationRate < 0.05 -> shorter tenure on average
StagnationRate >= 0.05 -> longer tenure on average
When the employee is stagnant in their current state of work, they will
tend to stay for a longer duration in the company.
EthnicityWhite = 1 -> shorter tenure on average EthnicityWhite = 0 -> longer tenure on average Employee with white ethnicity tends to leave the company earlier.
Modification of Linear Regression
Lasso (L1) & Ridge (L2) Regularization
library(glmnet)
elastic_net_model <- cv.glmnet(
x = as.matrix(X_train),
y = Y_train,
alpha = 0.5,
family = "gaussian"
)
# Cross-validation results (Lambda selection)
plot(elastic_net_model)
# determine the best lambda score
# choose a small lambda for minimal regularization
best_lambda <- elastic_net_model$lambda.min
# Predict on test data
elastic_net_prediction <- predict(elastic_net_model, s = best_lambda, as.matrix(X_test))
library(xgboost)
set.seed(123)
# convert data to XGBoost DMatrix
dtrain <- xgb.DMatrix(data = X_train, label = Y_train )
dtest <- xgb.DMatrix(data = X_test, label = Y_test )
# Define XGBoost Parameters (Default)
params <- list(
objective = "reg:squarederror", # Regression for predicting numerical target
booster = "gbtree", # Use tree-based boosting
eta = 0.1, # Learning rate
max_depth = 6, # Maximum depth of each tree
subsample = 0.8, # Subsample ratio of the training data
colsample_bytree = 0.8 # Subsample ratio of columns
)
#train model
xgboost_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 1000,
early_stopping_rounds = 10, # Stop if the test RMSE doesn’t improve for 10 rounds
watchlist = list(train = dtrain, test = dtest),
verbose = 0
)
# Predict on test data
xgboost_prediction <- predict(xgboost_model, dtest)
## KNN Accuracy Precision Recall F1
## 1 Original 0.8208617 0.8900804 0.8972973 0.8936743
## 2 SMOTE Balanced 0.8480726 0.8740741 0.9567568 0.9135484
## RF Accuracy Precision Recall F1
## 1 Original 0.8548753 0.8903061 0.9432432 0.9160105
## 2 SMOTE Balanced 0.8458050 0.8665049 0.9648649 0.9130435
## SVM Accuracy Precision Recall F1
## 1 Original 0.8367347 0.8962766 0.9108108 0.9034853
## 2 SMOTE Balanced 0.8458050 0.8629808 0.9702703 0.9134860
In general, SMOTE balanced dataset will perform slightly better than the original unbalanced data. Therefore, it is important to perform data resmapling using SMOTE to increase the accuracy of the models.
## Models Accuracy Precision Recall F1
## 1 KNN 0.8208617 0.8900804 0.8972973 0.8936743
## 2 Random Forest 0.8548753 0.8903061 0.9432432 0.9160105
## 3 SVM 0.8367347 0.8962766 0.9108108 0.9034853
Random Forest model outperforms the other 2 models with the highest accuracy, high recall value, an high F1 score. This indicates that this model is predicting accurately, and high F1 score means the model has a good balance between precision and recall. KNN has the lowest accuracy but still maintains a good balance between precision and recall, reflected in its F1 score. SVM has the highest precision but slightly trailing Random Forest in recall and F1 score.
Feature Importance in Random Forest
Mean Decrease Gini is a measure of variable importance based on the Gini impurity index used for the calculating the splits in trees. Higher values indicate features that contribute more to reducing the impurity (Gini index) in the model. From the graph, StagnationRate, OverTime, and YearsAtCompany are the top 3 factors in Random Forest model.
Comparison of F1 score of the classification models between train data and test data.
## Train F1 Score (KNN): 0.8774265
## Test F1 Score (KNN): 0.8933873
Good Fit. Due to the balanced dataset, Test F1 score is higher than Train F1 score, which means KNN model is performing well without significant overfitting.
## Train F1 Score (Random Forest): 0.9375352
## Test F1 Score (Random Forest): 0.9160105
Slightly Overfit.
The gap indicates the model performs slightly better on the training set
than on unseen data. However, the difference is quite small and
acceptable.
## Train F1 Score (SVM): 0.8687534
## Test F1 Score (SVM): 0.9034853
Good Fit. The SVM model is performing well with high F1 score than train set. This indicates that the SVM model is accurate and well-generalized.
## Model MAE MSE RMSE R2 Adj_R2
## 1 Linear Regression 1.3968233 3.044407 1.744823 0.7104385 0.7071025
## 2 Decision Tree 1.1746989 2.490537 1.578144 0.7632154 0.7604875
## 3 Elastic Net 1.3982779 3.038865 1.743234 0.7105643 0.7072298
## 4 XGBoost 0.6944892 1.175592 1.084247 0.8872953 0.7072298
XGBoost clearly outperforms the other 3 models in terms of scoring in
the evaluation metrics. - Lowest Error Metrics (MAE, MSE, RMSE) -
Highest R2 and Adjusted R2
Achieving a high score of R2, XGBoost model is able to predict
YearsAtCompany at 89.14% of the variance from the selected features
(YearsWithCurrManager, YearsInMonthRecentRole, StagnantRate,
AgeNormalized, EthnicityWhite).
Decision Tree is the simpler alternative for predicting the employee tenure, obtaining the R2 score of 76.3%, where it uses much lesser computational resources as compared to XGBoost. Besides that, Decision Tree performs better than Linear Regression and Elastic Net in all aspects.
Both Linear Regression and Elastic Net have similar score in the metrics, performed poorly in the test data. These models fail to capture the non-linear nature which caused both to underperform in tenure prediction.
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.1387572 0.14787257 7.700936 3.175446e-14
## YearsInMostRecentRole 0.3499602 0.03104623 11.272230 7.289326e-28
## YearsWithCurrManager 0.4759478 0.02664998 17.859221 2.595076e-62
## EthnicityWhite -0.9045818 0.11616599 -7.786977 1.675088e-14
## AgeNormalized 3.3028031 0.25643764 12.879557 2.737538e-35
## StagnationRate 2.0149616 0.24512647 8.220090 6.106997e-16
Linear Regression
All features have very small P-value < 0.05, which suggests that
all predictors are statically significant in the modelling.
AgeNormalized is the most impactful predictor with the larget
coefficients of 3.3, which means 1 year increase in age would expect 3.3
years increase in the duration stay at the company.
Followed by Stagnation Rate which increases the tenure by 2 years when
the rate increases by 1 unit.
EthnicityWhite shows a negative relationship with the YearsAtCompany,
White Ethnicity are likely to stay shorter duration in the company. Less
0.9 year as compared to others.
YearsInMostRecentRole and YearsWithCurrManager have positive impact on
the tenure in the company as the tendency increases, however the
importane of these features are less significant as compared to the
others.
## YearsWithCurrManager YearsInMostRecentRole StagnationRate
## 5043.732 4285.701 3364.404
## AgeNormalized EthnicityWhite
## 3191.390 1066.731
Decision Tree
Importance Rank:
1. YearsWithCurrManager
2. YearsInMonthRecentRole
3. StagnantRate
4. AgeNormalized
5. EthnicityWhite
The effect of manager relationships (YearsWithCurrManager) and career
progression (YearsInMonthRecentRole) has most impact towards the
employee tenure.
Stagnation and Age are contributing factors with moderate importance
towards the duration of staying in the company.
Ethnicity is the least important feature in decision tree as it
contributed less to the decision tree predictions.
Elastic Net Regression
Feature Importance has similar pattern and values with linear
regression.
Importance Rank:
1. YearsInMonthRecentRole
2. YearsWithCurrManager
3. EthnicityWhite
4. AgeNormalized
5. StagnantRate
Possible Reason of similarity:
- Low Regularization strength of 0.02, behaves very similar to standard
Linear Regression
- No Collinearity (As most of the highly correlated predictors were
removed during correlation analysis)
XGBoost
Importance Rank:
1. YearsWithCurrManager
2. YearsInMonthRecentRole
3. StagnantRate
4. AgeNormalized
5. EthnicityWhite
Similar feature importance rank as the Decision Tree Regression
model.
With YearsWithCurrManager, YearsInMostRecentRole and StagnationRate as
the top three drivers in predicting how long the employee staying in the
company, three of these features have covered up to nearly 80% of
importance. AgeNormalized has moderate influence on the tenure.
EthnicityWhite contributed minimal values towards the employee tenure
with only 1.9% of importance score.
Comparison of Mean Absolute Error score of the regression models between train data and test data.
## Train MAE (Linear Regression): 1.390022
## Test MAE (Linear Regression): 1.396823
Good Fit.
MAE generated from linear regression model on unseen data is slightly
higher than the train dataset. However, it is still very close which
indicates that it is not overfitting or underfitting.
## Train MAE (Decision Tree): 1.128646
## Test MAE (Decision Tree): 1.174699
Overfit.
Test MAE is higher than Train MAE which suggests that the decision tree
model is facing the overfitting issue, where it performs well on train
data but not on test data.
## Train MAE (Elastic Net): 1.457014
## Test MAE (Elastic Net): 1.451609
Good Fit.
Both Train and Test MAE is similar which indicates that the Elastic Net
model is accurate and well-generalized
## Train MAE (XGBoost): 0.3872376
## Test MAE (XGBoost): 0.6944892
Overfit.
For XGBoost the Mean Absolute Error score in test set is higher than the
train set. There is a large gap between train and test error.
tunegrid_knn <- data.frame(k = seq(11, 85, by = 2))
trainControl_knn <- trainControl(method = "cv", number = 10)
knnModel_tune <- train(
label ~ .,
data = classification_data,
method = "knn",
tuneGrid = tunegrid_knn,
trControl = trainControl_knn,
tuneLength = 10
)
# Plot performance
plot(knnModel_tune)
# Show best parameter
cat("Best Tune Hyperparameter (k): ", knnModel_tune$bestTune$k)
## Best Tune Hyperparameter (k): 21
## Model Accuracy Precision Recall F1
## 1 KNN 0.8208617 0.8900804 0.8972973 0.8936743
## 2 Tuned KNN 0.8072562 0.8925620 0.8756757 0.8840382
For KNN, the tuning might cause the overfitting issue to the train data, which results in a lower F1 score but higher precision.
tunegrid_rf <- expand.grid(mtry = base::c(1:5)) # 'mtry' value
trainControl_rf <- trainControl(method = "cv", number = 10, search = "random") # 10 fold cross validation, randomized search
rfModel_tune <- train(
label ~ .,
data = classification_data,
method = "rf",
tuneGrid = tunegrid_rf,
trControl = trainControl_rf,
tuneLength = 10
)
# Plot performance
plot(rfModel_tune)
# Show best parameter
cat("Best Tune Hyperparameter (mtry): ", rfModel_tune$bestTune$mtry)
## Best Tune Hyperparameter (mtry): 1
## Model Accuracy Precision Recall F1
## 1 Random Forest 0.8548753 0.8903061 0.9432432 0.9160105
## 2 Tuned Random Forest 0.8548753 0.8844221 0.9513514 0.9166667
For Random Forest, the tuned model performs slightly better than the original model, resulting in higher values across all the evaluation metrics. The tuning is not necessary due to the slight improvement on the model.
tunegrid_svm <- expand.grid(C = c(0.25, 0.5, .75, 1), sigma=c(.5, .75, .8, .9, 1))
trainControl_svm <- trainControl(method = "cv",
number = 10,
search = "random")
svmModel_tune <- train(
label ~ .,
data = classification_data,
method = "svmRadial",
tuneGrid = tunegrid_svm,
preProcess = c("center","scale"),
trControl = trainControl(method = "cv", number = 10)
)
# Plot performance
plot(svmModel_tune)
# Show best parameter
cat("Best Tune Hyperparameter (sigma): ", svmModel_tune$bestTune$sigma)
## Best Tune Hyperparameter (sigma): 1
cat("Best Tune Hyperparameter (C): ", svmModel_tune$bestTune$C)
## Best Tune Hyperparameter (C): 1
## Model Accuracy Precision Recall F1
## 1 SVM 0.8367347 0.8962766 0.9108108 0.9034853
## 2 Tuned SVM 0.8321995 0.8978495 0.9027027 0.9002695
Although the tuned SVM shows slightly better precision, its overall performance is worsen, especially F1 score and accuracy. This suggests that the default SVM parameters are already well-suited for the dataset, and tuning is not necessary.
Random Forest stands out as the top performing model among the 3 models as it has the highest score for all the evaluation metrics. In predicting the atrrition of the employees, random forest shows that StagnationRate, OverTime, and YearsAtCompany could be the main factors contributing to the attrition. For KNN and SVM models, both are having high scores in predicting attrition but slightly lower than the Random Forest model. Therefore, Random Forest is the most suitable model in this prediction in terms of its strong performance in F1 score, recall, and accuracy.
## Best Tune Hyperparameter (cp): 0.001
## Model MAE MSE RMSE R2 Adj_R2
## 1 Decision Tree 1.174699 2.490537 1.578144 0.7632154 0.7604875
## 2 Tuned Decision Tree 1.008521 2.140439 1.463024 0.7968465 0.7945060
## Train MAE (Tuned Decision Tree): 0.8315298
## Test MAE (Tuned Decision Tree): 1.008521
Although the performance metrics of the tuned tree have improved in general, however in train-test performance metrics, the gap has grow larger, where this indicates the model has become more overfitting. Hence the tuned tree will struggle to generalize to unseen data. The tuned tree might not be able to perform as well as the default decision tree when it is tested against unseen data.
# Define the grid of hyperparameters
grid <- expand.grid(
alpha = seq(0, 1, by = 0.1),
lambda = 10^seq(-4, 0, length=100)
)
# Define training control for 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)
set.seed(123)
# Train Elastic Net with caret
elastic_net_tune <- train(
label ~ ., data = trainData,
method = "glmnet",
trControl = train_control,
tuneGrid = grid
)
plot(elastic_net_tune)
# achieved lowest error and balances model parsimony with accuracy
## Best Tune Hyperparameter (alpha): 0.1
## Best Tune Hyperparameter (lambda): 0.04229243
## Model MAE MSE RMSE R2 Adj_R2
## 1 Elastic Net 1.398278 3.038865 1.743234 0.7105643 0.7072298
## 2 Tuned Elastic Net 1.398166 3.039481 1.743411 0.7104614 0.7071257
Tuned Elastic Net Model improves slightly on MAE however the difference is negligible. While the tuned model performs slightly worse on MSE, RMSE and R2. This suggests that the default Elastic Net hyperparameters worked well on the dataset. Given the minimal difference in performance metrics, there might not be a need for further tuning.
# To supress message
# Redirect messages and warnings to nowhere
sink("/dev/null")
suppressWarnings({
suppressMessages({
# Define parameter grid
grid <- expand.grid(
nrounds = c(50, 100, 150), # Number of boosting rounds
max_depth = c(3, 4, 5), # Tree depth
eta = c(0.5, 0.1), # Learning rate
gamma = c(0), # Minimum loss reduction
colsample_bytree = c(0.5, 0.7), # Column sampling ratio
min_child_weight = c(1, 3), # Minimum child weight
subsample = c(0.7, 0.8, 1.0) # Subsample ratio
)
# Define training control for 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10) # 10-fold cross-validation
# Train the model
set.seed(123)
xgb_caret_model <- train(
x = X_train,
y = Y_train,
method = "xgbTree",
tuneGrid = grid,
trControl = train_control
)
})
})
sink() # Restore normal output
## Best Tune Hyperparameter (eta): 0.5
## Best Tune Hyperparameter (max_depth): 4
## Best Tune Hyperparameter (subsample): 1
## Best Tune Hyperparameter (colsample_bytree): 0.7
## Best Tune Hyperparameter (min_child_weight): 3
## Best Tune Hyperparameter (gamma): 0
## Model MAE MSE RMSE R2 Adj_R2
## 1 XGBoost 0.6944892 1.175592 1.084247 0.8872953 0.7072298
## 2 Tuned XGBoost 0.7952340 1.313201 1.145950 0.8743488 0.7071257
## Train MAE (Tuned XGBoost): 0.6133807
## Test MAE (Tuned XGBoost): 0.795234
Although the performance metrics of Tuned XGBoost drops a little as
compared to before tuning. However it is still generating a good results
overall with R2 value R2 value above 85%. With the aim of hyperparameter
tuning to make the model better in generalization, the result of the
tuned model has actually improved in overfitting issue. Hyperparameter
tuning was mainly focusing on adjustment on max_depth, eta,
min_Child_Weight, subsample and nrounds to avoid overfitting of
model.
As the original model overfits more as compared to the tuned version,
the tuned XGBoost model is better in generalization as it will likely
perform better on unseen data.
XGBoost and Tuned XGBoost come out on top in predicting the tenure of
employee staying in a company. However it may not performs as well when
it comes to unseen data as it is overfitting towards the train data.
Tuned XGBoost would be the ideal model as it achieved similar
performance as original version but with less overfitting issue as
compared.
Decision Tree comes in at second where it has a good R2 values of 76%
with slight overfitting issue and it requires much lesser computation
resource as compared to XGBoost model.
While looking at Linear Regression and Elastic Net Regression models,
with R2 scores of 71% which shows that they are a fairly good fit in
predicting the tenure. Error tendency might be worse than decision tree,
but they generalize well to the dataset. Similar results between Linear
Regression and Elastic Net have indicated that the data has minimal
multicollinearity as it has been removed during feature selection stage.
Hence the regularization does not provide much advantage in dataset with
no large collinearity. Lastly, non-linear regression models work better
in predicting the employee than linear models provided with features
like YearsWithCurrManager, YearsInMonthRecentRole, StagnantRate,
AgeNormalized, EthnicityWhite.