Group Members:

  1. Nicholas Ooi Jiawei 23110707
  2. Low E-Jie 23092124
  3. Ho Wei Wen 23097016
  4. Cheong Meng Ben 24051211
  5. Toh Chu Xian 23105612

Objective:

The project goal is to develop predictive models using Machine Learning that leverage employee demographic information, job related insights, performance metrics and other key factors to enable Human Resource Department (HR) to make well-informed, data decisions.

The two main objectives are:

  1. Employee Attrition Prediction:
  1. Employee Tenure Prediction:

Dataset Overview:

Dataset Link

Year: Uploaded Aug 2024 (Data from 2012 -2022).
Purpose of dataset: To analyze employee attrition based on key factors including performance trends which influence retention.
Data Set Content and Dimensions:
Total of 5 datasets:

Data Cleaning & Preprocessing

During the data cleaning process, the following tasks are addressed:

  1. Managing missing values, handling outliers, and ensuring proper date formatting.
  2. Transforming data through techniques such as one-hot encoding, labeling, and normalization.
  3. Merging the employee and performance rating CSV files to analyze trends throughout the employee lifecycle.
  4. Performing feature engineering.

Firstly, all the required packages were installed.
1. tidyr: To organise data for analysis and visualization
2. caret: For data preprocessing, model training, hyperparameter tuning, and model evaluation
3. dplyr: For data manipulation and transformation
4. ggplot2: For data visualization
5. TTR: For technical analysis of financial data
6. randomForest: for Random Forest machine learning tasks
7. glmnet: For fitting generalized linear models via regularization techniques
8. rpart: For recursive partitioning and regression trees
9. rpart.plot: For visualization for decision trees
10. lubridate: For data manipulation of dates and times
11. caTools: For data partitioning, model evaluation and performance metrics
12. DMwR2: For data mining and machine learning
13. smotefamily: For data resampling technique on class imbalanced data
14 XGBoost: For XGBoost Regression Model
15. grid: For organizing complex visualization
16. corrplot: To visualize correlation matrices in a clear and concise manner
17. rpart: For building decision trees using recursive partitioning for both classification and regression tasks
18. glmnet: For fitting generalized linear models (GLMs) with regularization techniques
19. Metrics: For evaluating machine learning model performance using various metrics such as RMSE, MAE, and R², providing insights into model accuracy and error

# install.packages("smotefamily")
# install.packages("igraph")
library(tidyr)
library(caret)
library(dplyr)
library(ggplot2)
library(TTR)
library(randomForest)
library(glmnet)
library(rpart)
library(rpart.plot)
library(lubridate)
library(caTools)
library(DMwR2)
library(smotefamily)
library(xgboost)
library(grid)
library(corrplot)
library(rpart)
library(rpart.plot)
library(glmnet)
library(Metrics)

Firstly, the CSV file datasets were read into R dataframes.

employee <- read.csv('dataset/Employee.csv', header = TRUE)
performance <- read.csv('dataset/PerformanceRating.csv', header = TRUE)
edu_lvl <- read.csv('dataset/EducationLevel.csv', header = TRUE)
rating_lvl <- read.csv('dataset/RatingLevel.csv', header = TRUE)
satisified_lvl <- read.csv('dataset/SatisfiedLevel.csv', header = TRUE)

# View the first few rows
str(employee)
## 'data.frame':    1470 obs. of  23 variables:
##  $ EmployeeID             : chr  "3012-1A41" "CBCB-9C9D" "95D7-1CE9" "47A0-559B" ...
##  $ FirstName              : chr  "Leonelle" "Leonerd" "Ahmed" "Ermentrude" ...
##  $ LastName               : chr  "Simco" "Aland" "Sykes" "Berrie" ...
##  $ Gender                 : chr  "Female" "Male" "Male" "Non-Binary" ...
##  $ Age                    : int  30 38 43 39 29 34 42 40 38 31 ...
##  $ BusinessTravel         : chr  "Some Travel" "Some Travel" "Some Travel" "Some Travel" ...
##  $ Department             : chr  "Sales" "Sales" "Human Resources" "Technology" ...
##  $ DistanceFromHome..KM.  : int  27 23 29 12 29 30 45 3 20 4 ...
##  $ State                  : chr  "IL" "CA" "CA" "IL" ...
##  $ Ethnicity              : chr  "White" "White" "Asian or Asian American" "White" ...
##  $ Education              : int  5 4 4 3 2 2 3 2 4 2 ...
##  $ EducationField         : chr  "Marketing" "Marketing" "Marketing " "Computer Science" ...
##  $ JobRole                : chr  "Sales Executive" "Sales Executive" "HR Business Partner" "Engineering Manager" ...
##  $ MaritalStatus          : chr  "Divorced" "Single" "Married" "Married" ...
##  $ Salary                 : int  102059 157718 309964 293132 49606 133468 259284 104426 147098 69747 ...
##  $ StockOptionLevel       : int  1 0 1 0 0 1 1 1 1 0 ...
##  $ OverTime               : chr  "No" "Yes" "No" "No" ...
##  $ HireDate               : chr  "2012-01-03" "2012-01-04" "2012-01-04" "2012-01-05" ...
##  $ Attrition              : chr  "No" "No" "No" "No" ...
##  $ YearsAtCompany         : int  10 10 10 10 6 10 10 10 10 6 ...
##  $ YearsInMostRecentRole  : int  4 6 6 10 1 3 2 3 5 5 ...
##  $ YearsSinceLastPromotion: int  9 10 10 10 1 7 6 4 8 5 ...
##  $ YearsWithCurrManager   : int  7 0 8 0 6 9 6 6 2 1 ...
str(performance)
## 'data.frame':    6709 obs. of  11 variables:
##  $ PerformanceID                  : chr  "PR01" "PR02" "PR03" "PR04" ...
##  $ EmployeeID                     : chr  "79F7-78EC" "B61E-0F26" "F5E3-48BB" "0678-748A" ...
##  $ ReviewDate                     : chr  "1/2/2013" "1/3/2013" "1/3/2013" "1/4/2013" ...
##  $ EnvironmentSatisfaction        : int  5 5 3 5 5 3 3 4 4 5 ...
##  $ JobSatisfaction                : int  4 4 4 3 2 3 4 5 5 4 ...
##  $ RelationshipSatisfaction       : int  5 4 5 2 3 2 5 4 2 3 ...
##  $ TrainingOpportunitiesWithinYear: int  1 1 3 2 1 2 2 1 1 2 ...
##  $ TrainingOpportunitiesTaken     : int  0 3 2 0 0 0 1 1 1 3 ...
##  $ WorkLifeBalance                : int  4 4 3 2 4 4 5 3 4 4 ...
##  $ SelfRating                     : int  4 4 5 3 4 4 4 3 5 5 ...
##  $ ManagerRating                  : int  4 3 4 2 3 4 3 2 4 4 ...
str(edu_lvl)
## 'data.frame':    5 obs. of  2 variables:
##  $ EducationLevelID: int  1 2 3 4 5
##  $ EducationLevel  : chr  "No Formal Qualifications" "High School " "Bachelors " "Masters " ...
str(rating_lvl)
## 'data.frame':    5 obs. of  2 variables:
##  $ RatingID   : int  1 2 3 4 5
##  $ RatingLevel: chr  "Unacceptable" "Needs Improvement" "Meets Expectation" "Exceeds Expectation " ...
str(satisified_lvl)
## 'data.frame':    5 obs. of  2 variables:
##  $ SatisfactionID   : int  1 2 3 4 5
##  $ SatisfactionLevel: chr  "Very Dissatisfied" "Dissatisfied" "Neutral" "Satisfied " ...

Then, check for missing values and duplicates in the employee and performance dataframes:

# Count total NA values in the dataframe
total_na_count1 <- sum(is.na(employee))
total_na_count2 <- sum(is.na(performance))
print(paste(total_na_count1, "missing value in employee df"))
## [1] "0 missing value in employee df"
print(paste(total_na_count2, "missing value in performance df"))
## [1] "0 missing value in performance df"
# Check for duplicates
duplicate1 <- sum(duplicated(employee))
duplicate2 <- sum(duplicated(performance))
print(paste(duplicate1, "duplicate value in employee df"))
## [1] "0 duplicate value in employee df"
print(paste(duplicate2, "duplicate value in performance df"))
## [1] "0 duplicate value in performance df"

Check for unique values in columns.

# Print all the unique values
uniquegender <- unique(employee$Gender)
print(uniquegender)
## [1] "Female"            "Male"              "Non-Binary"       
## [4] "Prefer Not To Say"
uniquetravel <- unique(employee$BusinessTravel)
print(uniquetravel)
## [1] "Some Travel"        "No Travel "         "Frequent Traveller"

To ease data preprocessing of date columns, date columns’ datatypes were converted from character to date.

# Date transformation
# Convert date from chr to date
performance$ReviewDate <- as.Date(performance$ReviewDate, format = "%m/%d/%Y")
employee$HireDate <- as.Date(employee$HireDate, format = "%Y-%m-%d")
# print(class(performance$ReviewDate))
# print(class(employee$HireDate))

# Arrange based on employee ID and review date
performance_sorted <- performance %>% arrange(EmployeeID,ReviewDate)

Then, to aggregate the satisfaction ratings of multiple aspects, OverallSatisfaction column was calculated and added into performance_overall.

# Calculate overall job satisfaction of employee towards workplace
performance_overall <- performance_sorted %>%
  mutate(OverallSatisfaction = (EnvironmentSatisfaction + JobSatisfaction + RelationshipSatisfaction + WorkLifeBalance)/4) %>%
  select(EmployeeID,ReviewDate,EnvironmentSatisfaction, JobSatisfaction, RelationshipSatisfaction, WorkLifeBalance, TrainingOpportunitiesWithinYear,TrainingOpportunitiesTaken,SelfRating,ManagerRating,OverallSatisfaction) %>%
  arrange(EmployeeID,ReviewDate)

One Hot Encoding was done to convert categorical data into numerical data to be used in machine learning algorithms.

# One Hot Encoding

dummy <- dummyVars("~ Gender + BusinessTravel + Department + State + Ethnicity + EducationField + JobRole + MaritalStatus", data = employee)
employeenew <- predict(dummy, newdata = employee)
employeenew <- as.data.frame(employeenew)
employeenew <- cbind(employee, employeenew)

Binary Encoding was performed on categorical data with “Yes” and “No” values.

# Transforming "yes" "no" to binary (1 = yes, 0 = no)
employeenew <- employeenew %>%
  mutate(Attrition = ifelse(Attrition == "Yes", 1, 0)) %>%
  mutate(OverTime = ifelse(OverTime == "Yes", 1, 0))

Min-Max Normalization was done to rescale the data with differing scales into similar scales.

# Normalization
subset <- employeenew[, c("Age", "DistanceFromHome..KM.", "Salary")]
preprocess <- preProcess(subset, method = "range")
normalizeddata <- predict(preprocess, newdata = subset)
colnames(normalizeddata) <- c("AgeNormalized", "DistanceFromHomeNormalized", "SalaryNormalized")
employeenew <- cbind(employeenew, normalizeddata)

Then, to match employee performance with employee data, the data was merged using left join to include all the rows in performance dataset.

# Perform a left join
employee_perf <- merge(employeenew, performance_overall, by = "EmployeeID", all.x = TRUE)

# View(employee_perf)
glimpse(employee_perf)
## Rows: 6,899
## Columns: 81
## $ EmployeeID                                  <chr> "001A-8F88", "005C-E0FB", …
## $ FirstName                                   <chr> "Christy", "Fin", "Fin", "…
## $ LastName                                    <chr> "Jumel", "O'Halleghane", "…
## $ Gender                                      <chr> "Male", "Non-Binary", "Non…
## $ Age                                         <int> 22, 24, 24, 24, 30, 30, 30…
## $ BusinessTravel                              <chr> "Some Travel", "Frequent T…
## $ Department                                  <chr> "Technology", "Sales", "Sa…
## $ DistanceFromHome..KM.                       <int> 40, 17, 17, 17, 6, 6, 6, 6…
## $ State                                       <chr> "CA", "CA", "CA", "CA", "C…
## $ Ethnicity                                   <chr> "White", "White", "White",…
## $ Education                                   <int> 4, 4, 4, 4, 2, 2, 2, 2, 2,…
## $ EducationField                              <chr> "Information Systems", "Ma…
## $ JobRole                                     <chr> "Software Engineer", "Sale…
## $ MaritalStatus                               <chr> "Married", "Married", "Mar…
## $ Salary                                      <int> 27763, 56155, 56155, 56155…
## $ StockOptionLevel                            <int> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ OverTime                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ HireDate                                    <date> 2021-09-05, 2017-08-26, 2…
## $ Attrition                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ YearsAtCompany                              <int> 1, 5, 5, 5, 10, 10, 10, 10…
## $ YearsInMostRecentRole                       <int> 0, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ YearsSinceLastPromotion                     <int> 1, 2, 2, 2, 6, 6, 6, 6, 6,…
## $ YearsWithCurrManager                        <int> 0, 0, 0, 0, 6, 6, 6, 6, 6,…
## $ GenderFemale                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GenderMale                                  <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `GenderNon-Binary`                          <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller`          <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel `                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelSome Travel`                 <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `DepartmentHuman Resources`                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ DepartmentSales                             <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ DepartmentTechnology                        <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ StateCA                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ StateIL                                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ StateNY                                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American`        <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `EthnicityMixed or multiple ethnic groups`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityNative Hawaiian `                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityOther `                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite                              <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies`            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science`            <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ EducationFieldEconomics                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources`             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems`         <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EducationFieldMarketing                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldMarketing `                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldOther                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldTechnical Degree`            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist`                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager`                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Business Partner`                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer`          <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ JobRoleManager                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Executive`                    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Representative`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer`                  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusMarried                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MaritalStatusSingle                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AgeNormalized                               <dbl> 0.1212121, 0.1818182, 0.18…
## $ DistanceFromHomeNormalized                  <dbl> 0.8863636, 0.3636364, 0.36…
## $ SalaryNormalized                            <dbl> 0.01400107, 0.06789454, 0.…
## $ ReviewDate                                  <date> NA, 2020-06-17, 2022-06-1…
## $ EnvironmentSatisfaction                     <int> NA, 3, 3, 4, 5, 3, 5, 5, 3…
## $ JobSatisfaction                             <int> NA, 3, 4, 4, 4, 3, 2, 2, 2…
## $ RelationshipSatisfaction                    <int> NA, 2, 5, 5, 2, 4, 4, 5, 2…
## $ WorkLifeBalance                             <int> NA, 2, 4, 5, 2, 2, 5, 5, 4…
## $ TrainingOpportunitiesWithinYear             <int> NA, 1, 3, 1, 3, 3, 1, 3, 3…
## $ TrainingOpportunitiesTaken                  <int> NA, 2, 0, 1, 0, 1, 0, 0, 0…
## $ SelfRating                                  <int> NA, 4, 4, 3, 3, 3, 3, 5, 3…
## $ ManagerRating                               <int> NA, 3, 4, 3, 2, 3, 3, 4, 3…
## $ OverallSatisfaction                         <dbl> NA, 2.50, 4.00, 4.50, 3.25…

For employees with Attrition == “Yes”, ExitDate was calculated using sum of HireDate with YearsAtCompany.

# Create a column of exit date if employee has left the company.
# Ensure ExitDate is properly handled as a Date
employee_perf_clean <- employee_perf %>%
  mutate(
    ExitDate = if_else(
      Attrition == 1,
      as.Date(HireDate) + years(YearsAtCompany),
      as.Date(NA)  # Maintain NA as a Date
    )
  )

# str(employee_perf_clean)

There are 190 records of employee without a performance review entry, this indicates the employee may be a new joiner as they have yet had their first review.

# Find rows where review date is NA
missing_review_date <- employee_perf_clean %>%
  filter(is.na(ReviewDate))

# View the result
sum(is.na(employee_perf_clean$ReviewDate))
## [1] 190

There are dirty data whereby a performance review rating is available before an employee joined the company and after an employee left the company. A total of 1818 rows are cleaned up by removing these ratings.

# Retaining only those with valid performance rating
checking <- employee_perf_clean %>%
  filter(
    is.na(ReviewDate) |               
    (ReviewDate >= HireDate &
    (is.na(ExitDate) | ReviewDate <= ExitDate))      # ReviewDate >= HireDate and ReviewDate <= ExitDate (if ExitDate is not NA)
  )

# str(checking)
# colSums(is.na(checking))

Identifying the earliest and latest hire dates helps pinpoint employees who haven’t yet undergone their initial performance review. These individuals might be recent hires or may not have completed the required tenure, such as one year, necessary for a performance evaluation.

# Filter rows where ReviewDate is NA and select HireDate
check1 <- checking %>% filter(is.na(ReviewDate)) %>% select(HireDate, YearsAtCompany)

# Calculate min and max HireDate from the 'check1' dataset
min_hire_date <- min(check1$HireDate, na.rm = TRUE)
max_hire_date <- max(check1$HireDate, na.rm = TRUE)
max_years <- max(check1$YearsAtCompany, na.rm = TRUE)

# Calculate max ReviewDate for the entire dataset
max_review_date <- max(employee_perf_clean$ReviewDate, na.rm = TRUE)

# Print results
print(paste("Minimum hire date with an NA:", min_hire_date))
## [1] "Minimum hire date with an NA: 2021-06-30"
print(paste("Maximum hire date with an NA:", max_hire_date))
## [1] "Maximum hire date with an NA: 2022-12-31"
print(paste("Check max years at company with an NA:" ,max_years))
## [1] "Check max years at company with an NA: 1"
print(paste("Check last review date:",max_review_date))
## [1] "Check last review date: 2022-12-31"

To calculate employee lifecycle for employees without ReviewDate (ReviewDate == NA). For employees with Lifecycle == 0, it is replaced with 1.

employee_perf_check <- checking %>%
  mutate(
    ReviewDate = ifelse(is.na(ReviewDate), max_review_date, ReviewDate),
    ReviewDate = as.Date(ReviewDate),
    Lifecycle = as.numeric(difftime(ReviewDate, HireDate, units = "days")) / 365,
    Lifecycle = ifelse(ceiling(Lifecycle) == 0, 1, ceiling(Lifecycle))
  )

Since employees with missing rating are mainly on Lifecycle == 1, mean performance ratings for this group are calculated.

# Calculate mean for imputation where Lifecycle = 1
means1 <- employee_perf_check %>%
  filter(Lifecycle == 1) %>%
  summarise(
    EnvironmentSatisfaction_mean = mean(EnvironmentSatisfaction, na.rm = TRUE),
    JobSatisfaction_mean = mean(JobSatisfaction, na.rm = TRUE),
    RelationshipSatisfaction_mean = mean(RelationshipSatisfaction, na.rm = TRUE),
    WorkLifeBalance_mean = mean(WorkLifeBalance, na.rm = TRUE),
    TrainingOpportunitiesWithinYear_mean = mean(TrainingOpportunitiesWithinYear, na.rm = TRUE),
    TrainingOpportunitiesTaken_mean = mean(TrainingOpportunitiesTaken, na.rm = TRUE),
    SelfRating_mean = mean(SelfRating, na.rm = TRUE),
    ManagerRating_mean = mean(ManagerRating, na.rm = TRUE),
    OverallSatisfaction_mean = mean(OverallSatisfaction, na.rm = TRUE)
  )

For employees with missing values for performance ratings, mean imputations using Lifecycle == 1 were performed as well.

# Perform mean imputation for missing values (NA) using Lifecycle = 1 means
employee_perf_check <- employee_perf_check %>%
  mutate(
    EnvironmentSatisfaction = ifelse(is.na(EnvironmentSatisfaction), means1$EnvironmentSatisfaction_mean, EnvironmentSatisfaction),
    JobSatisfaction = ifelse(is.na(JobSatisfaction), means1$JobSatisfaction_mean, JobSatisfaction),
    RelationshipSatisfaction = ifelse(is.na(RelationshipSatisfaction), means1$RelationshipSatisfaction_mean, RelationshipSatisfaction),
    WorkLifeBalance = ifelse(is.na(WorkLifeBalance), means1$WorkLifeBalance_mean, WorkLifeBalance),
    TrainingOpportunitiesWithinYear = ifelse(is.na(TrainingOpportunitiesWithinYear), means1$TrainingOpportunitiesWithinYear_mean, TrainingOpportunitiesWithinYear),
    TrainingOpportunitiesTaken = ifelse(is.na(TrainingOpportunitiesTaken), means1$TrainingOpportunitiesTaken_mean, TrainingOpportunitiesTaken),
    SelfRating = ifelse(is.na(SelfRating), means1$SelfRating_mean, SelfRating),
    ManagerRating = ifelse(is.na(ManagerRating), means1$ManagerRating_mean, ManagerRating),
    OverallSatisfaction = ifelse(is.na(OverallSatisfaction), means1$OverallSatisfaction_mean, OverallSatisfaction)
  )

# View the resulting dataset
glimpse(employee_perf_check)
## Rows: 5,081
## Columns: 83
## $ EmployeeID                                  <chr> "001A-8F88", "005C-E0FB", …
## $ FirstName                                   <chr> "Christy", "Fin", "Fin", "…
## $ LastName                                    <chr> "Jumel", "O'Halleghane", "…
## $ Gender                                      <chr> "Male", "Non-Binary", "Non…
## $ Age                                         <int> 22, 24, 24, 24, 30, 30, 30…
## $ BusinessTravel                              <chr> "Some Travel", "Frequent T…
## $ Department                                  <chr> "Technology", "Sales", "Sa…
## $ DistanceFromHome..KM.                       <int> 40, 17, 17, 17, 6, 6, 6, 6…
## $ State                                       <chr> "CA", "CA", "CA", "CA", "C…
## $ Ethnicity                                   <chr> "White", "White", "White",…
## $ Education                                   <int> 4, 4, 4, 4, 2, 2, 2, 2, 2,…
## $ EducationField                              <chr> "Information Systems", "Ma…
## $ JobRole                                     <chr> "Software Engineer", "Sale…
## $ MaritalStatus                               <chr> "Married", "Married", "Mar…
## $ Salary                                      <int> 27763, 56155, 56155, 56155…
## $ StockOptionLevel                            <int> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ OverTime                                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ HireDate                                    <date> 2021-09-05, 2017-08-26, 2…
## $ Attrition                                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ YearsAtCompany                              <int> 1, 5, 5, 5, 10, 10, 10, 10…
## $ YearsInMostRecentRole                       <int> 0, 2, 2, 2, 3, 3, 3, 3, 3,…
## $ YearsSinceLastPromotion                     <int> 1, 2, 2, 2, 6, 6, 6, 6, 6,…
## $ YearsWithCurrManager                        <int> 0, 0, 0, 0, 6, 6, 6, 6, 6,…
## $ GenderFemale                                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ GenderMale                                  <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `GenderNon-Binary`                          <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller`          <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel `                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelSome Travel`                 <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `DepartmentHuman Resources`                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ DepartmentSales                             <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ DepartmentTechnology                        <dbl> 1, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ StateCA                                     <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ StateIL                                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ StateNY                                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American`        <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ `EthnicityMixed or multiple ethnic groups`  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityNative Hawaiian `                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityOther `                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite                              <dbl> 1, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies`            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science`            <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ EducationFieldEconomics                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources`             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems`         <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EducationFieldMarketing                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldMarketing `                  <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldOther                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldTechnical Degree`            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist`                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager`                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Business Partner`                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer`          <dbl> 0, 0, 0, 0, 1, 1, 1, 1, 1,…
## $ JobRoleManager                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter                            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Executive`                    <dbl> 0, 1, 1, 1, 0, 0, 0, 0, 0,…
## $ `JobRoleSales Representative`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer`                  <dbl> 1, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusMarried                        <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ MaritalStatusSingle                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ AgeNormalized                               <dbl> 0.12121212, 0.18181818, 0.…
## $ DistanceFromHomeNormalized                  <dbl> 0.8863636, 0.3636364, 0.36…
## $ SalaryNormalized                            <dbl> 0.014001067, 0.067894544, …
## $ ReviewDate                                  <date> 2022-12-31, 2020-06-17, 2…
## $ EnvironmentSatisfaction                     <dbl> 3.833333, 3.000000, 3.0000…
## $ JobSatisfaction                             <dbl> 3.535088, 3.000000, 4.0000…
## $ RelationshipSatisfaction                    <dbl> 3.526316, 2.000000, 5.0000…
## $ WorkLifeBalance                             <dbl> 3.495614, 2.000000, 4.0000…
## $ TrainingOpportunitiesWithinYear             <dbl> 2.074561, 1.000000, 3.0000…
## $ TrainingOpportunitiesTaken                  <dbl> 1.02193, 2.00000, 0.00000,…
## $ SelfRating                                  <dbl> 4, 4, 4, 3, 3, 3, 3, 5, 3,…
## $ ManagerRating                               <dbl> 3.54386, 3.00000, 4.00000,…
## $ OverallSatisfaction                         <dbl> 3.597588, 2.500000, 4.0000…
## $ ExitDate                                    <date> NA, NA, NA, NA, NA, NA, N…
## $ Lifecycle                                   <dbl> 2, 3, 5, 4, 9, 4, 7, 6, 10…

Finally, the unique employee dataset and the merged employee performance rating dataset are ready to proceed to Exploratory Data Analysis (EDA) and Feature Engineering.

Exploratory Data Analysis (EDA)

Distribution of Company’s Hiring vs Attrition

Insights:

  • The total count of employees hired and attrition from 2012 to 2022 are 1470 and 237 respectively.
  • The year with the most hire is 2022 while the year with the most attrition is 2021.
  • The distribution of attritions varies widely, with attrition count significantly increases past 2020 whereas distribution of hires does not fluctuate as much, keeping around 100 to 150.
  • This suggests that attrition count are increasing in recent years and further analysis could be done to determine the cause of drastic increase in attrition.

Calculate total number of employees each year based on hire date, cumulative count

## # A tibble: 11 × 3
##     Year EmployeesHired CumulativeEmployees
##    <dbl>          <int>               <int>
##  1  2012            151                 151
##  2  2013            136                 287
##  3  2014            136                 423
##  4  2015            127                 550
##  5  2016            114                 664
##  6  2017            106                 770
##  7  2018            136                 906
##  8  2019            145                1051
##  9  2020            127                1178
## 10  2021            137                1315
## 11  2022            155                1470

Calculate total attrition employees each year, cumulative count

Calculate cumulative employees remaining and rate of imbalance

## # A tibble: 11 × 8
##     Year EmployeesHired CumulativeEmployees AttritionCount CumulativeAttrition
##    <dbl>          <int>               <int>          <dbl>               <dbl>
##  1  2012            151                 151              0                   0
##  2  2013            136                 287              2                   2
##  3  2014            136                 423             10                  12
##  4  2015            127                 550             11                  23
##  5  2016            114                 664             12                  35
##  6  2017            106                 770             14                  49
##  7  2018            136                 906             21                  70
##  8  2019            145                1051             21                  91
##  9  2020            127                1178             32                 123
## 10  2021            137                1315             60                 183
## 11  2022            155                1470             54                 237
## # ℹ 3 more variables: CumulativeEmployeesRemain <dbl>,
## #   AttritionRatePerYear <dbl>, AttritiontoEmployeesRemainRatio <dbl>

Create a combined plot for both hire and attrition trends

## List of 1
##  $ legend.position: chr "bottom"
##  - attr(*, "class")= chr [1:2] "theme" "gg"
##  - attr(*, "complete")= logi FALSE
##  - attr(*, "validate")= logi TRUE

Distribution of Employee’s Demographic

  1. Distribution of Employees by Gender

Insights:

  • There are more female than male employees in the company.
  • Other than female and male, there are employees who identify themselves as non-binary and there are those who prefer not to say about their genders.
  1. Distribution of Employees by Age

Insights:

  • The age range with the most employee count is between 20 to 30 while age range of 30 to 50 has employee count of less than 50 with the numbers dropping drastically when nearing 50.
  • This signifies that most of the employees for the company consist of young adults in the age of 20 to 30.
  1. Distribution of Employees by DistanceFromHome

Insights:

  • There is a varying amount of distances from home among all employees with fewer employee counts on the lower end and higher end of the spectrum.
  1. Distribution of Employees by Ethnicity

Insights:

  • White remains the most dominant race by count among all employees within the company, separated by a large margin followed up by black/african american and mix or multiple ethnic group.
  • “Other” ethnicity has the lowest employee count followed up by native hawaiian and american indian/alaska native.
  1. Distribution of Employees by Salary Range

Insights:

  • The annual salary range with the most employee count is between 50k to 100k, followed up by 25k to 50k,more than 150k, 100k to 150k and less than 25k.

  • Annual salary range with the most employee count which are 50k to 100k and 25k to 50k has employee count over 400 while annual salary range with the least employee count which are less than 25k and 100k to 150k has employee count less than 200.

  • This highlights that most of the employees earn between 25k to 100k with moderate amount of employees earning over 150k.

  • The department with the most employee count is the technology department followed up sales and human resources.

  • This shows that the technology department plays an important role in the company’s operations with other departments like sales and human resources playing more of a supporting role within the organisation.

  1. Distribution of Employee by Department and Years at Company

Insights:

  • Employees who joined the company for less than a year have the highest count among all departments.
  • Employees who joined the company for 6 years have the least count for Sales and Technology Department while employees who joined for 8 years have the least count for Human Resources department.
  • Sales and technology department have the most employees who joined for 0 to 1 year while employees who either joined for 0 or 10 years are the most within Human Resources department.
  1. Distribution of Years at Company by Department

Insights:

  • The median of YearsAtCompany for employees at every department is around 4 years.
  • Middle 50% of employees joined the company for 1 to 7.5 years in the Human Resources Department while middle 50% of employees joined for 1 to 8 years in the Sales Department. Middle 50% of employees joined the company for 2 to 7 years for Technology.

  1. Distribution of Employees by Education and Years at Company

Insights:

  • Employees who joined the company for 0 to 1 year have the highest count among most departments, except for High School where the employees joined for 9 years and for Doctorate where the employees joined for 7 years are the most count.
  • The least count of employees for each education level differs: 6 years for No Formal Qualifications, 8 years for High School and Masters, 2 years for Bachelors and 2,3,6,9 years equally for Doctorate.
  • Employees who hold Bachelors and Masters certification are the majority count of the employees within the company.
  1. Heatmap of Employees by Education and Years at Company (Individual Years)

Insights:

  • For Human Resources, bachelors employees who joined either 0 or 10 years have the most count.
  • For Sales, bachelors employees who joined either 1 or 8 years have the most count while masters employees who joined 0 years have the most count. Only very few count of doctorate employees are in the Sales department.
  • For Technology, it has the most amount of employees holding bachelors with vary amount of years at company, followed up by masters and high school certifications.

Salary Distribution

  1. Salary Distribution by Department and Job Role.

Insights:

  • HR Manager, Analytics Manager, HR Business Partner, Manager and Engineering Manager have the highest median salary when compared to the rest of the job roles among all departments.
  • The lowest median salary is Recruiter, followed up by Sales Representative, Data Scientist, Software Engineer and the rest of the job roles.
  • Job roles like HR Business Partner, Manager, Analytics Manager and Engineering Manager have wider distribution of salary when compared to other job roles.
  • The middle 50% of HR managers have salary of more than 35k while the middle 50% of Recruiter and Sales Representative have salary of less than 5k.

  1. Salary Distribution by Education Level

Insights:

  • The median monthly salary for employees who have no formal qualifications are the lowest while the median monthly salary for employees who have doctorate is the highest.
  • The middle 50% of employees who have no formal qualifications earn monthly salary of less than 10k while the middle 50% of employees who have doctorate earn monthly salary varying between more than 5k and less than 17.5k.
  • The median monthly salary for employees who have high school, bachelors and masters certifications are similar at around 6k.

  1. Monthly Income Distribution by Attrition Status

Insights:

  • The median monthly salary for attrition employees are lower when compared to median monthly salary for employees who stayed within the company.
  • The employees who stayed have salary ranging between 5k to over 10k while the attrition employees have salary between nearly 2.5k to nearly 7.5k.

Feature Engineering

Key features were engineered to enhance the dataset for modeling purposes. The feature engineering steps included:

  1. Introduction of Hierarchical Features: Differentiating roles into managerial, senior, and executive categories.
  2. Incorporation of Career Progression Metrics:
  1. Exponential Moving Average (EMA) of Performance Ratings: Applied to employee performance ratings to give more weight to recent evaluations, emphasizing their current performance trends.

Hierarchical Features

  1. Job Role and Years at company:
  • Majority employees fall under the Executive roles where there is significantly higher new hires (0 years at company) with a steady decrease in count as the years of service increase. This indicates there may be higher recruitment rate or higher turnover rate within the company.
  • Overall number of Managers and Seniors are significant lower than Executive role, with distribution across years at company distributed across tenure lengths.

  1. Monthly Salary Distribution by Department and Job Hierachical:
  • Managers in the Human Resources department generally earn higher salaries compared to managers in other departments, with a median salary of $40k, whereas the median for managerial roles in the Sales and Technology departments is around $25k.
  • Executives across all departments have a median salary of approximately $5k, with executives in the Technology department displaying more outliers at the higher end of the salary range.
  1. Attrition across Job Hierachical:
  • Attrition rate among executives is significantly higher, reaching approximately 30%.
  • Managerial and Senior attrition rate remains relatively low, indicating greater stability in these positions.

  1. Job Role and Department Across Education Field:
  • Technology roles mainly employ individuals with Computer Science and Information Systems education background.
  • Employee from Sales primarily comes from Marketing and Economics background.
  • Computer Science and Information Systems dominates in higher-ranking roles, especially in technical domain suggesting that senior roles are primarily concentrated within Technology.

Career Progression Features

A higher stagnation rate indicates a greater risk of stagnation, which suggests fewer promotions.
Role stability reflects how long an employee has remained in the same role.
By combining these two metrics, we can categorize employees into several groups.

  • The distribution of employees across all categories remains relatively consistent across departments.
  • This categorization helps us identify and recommend actions for different employees.
  • The “Needs Review” category stands out with the highest number of employees, highlighting a potential area for improvement in career growth opportunities for a large portion of the workforce.

EMA on Performance Rating Features:

  1. Employee Rating Lifecycle: Insights on How Employees with Different Tenure Years Perceive Across the Year
  • Training opportunities are offered approximately twice a year to every employee, while the average number of training opportunities taken is around once.
  • Self-rating remains consistently stable at a score of 4, which is the highest among all the other metrics.
  • Job Satisfaction and Environment Satisfaction remain relatively stable throughout the year but dip after year 9, suggesting a lack of improvement over time and possibly indicating less challenging work for employees with longer tenures.
  • Work-Life Balance ratings stay stable across the employee lifecycle but show an increase toward the end of year 10, indicating an improvement in employee work-life balance.
  • Relationship Satisfaction and Manager Rating remain steady with slight fluctuations.

  1. Exponential Moving Average (EMA) Performance Rating Feature:

EMA is applied to all employees’ performance ratings to generate a final, smoothed rating for each employee, which is then used for modeling.
This helps smooth out fluctuations in performance ratings over time, providing a more consistent and reliable assessment of employee performance while placing more weight on recent ratings, allowing it to better capture trends and changes in performance.

# Feature engineering on performance df using EMA and work satisfaction

columns <- c(
  "EnvironmentSatisfaction", "JobSatisfaction", "RelationshipSatisfaction",
  "TrainingOpportunitiesWithinYear", "TrainingOpportunitiesTaken",
  "WorkLifeBalance", "SelfRating", "ManagerRating", "OverallSatisfaction"
)

# Apply Dynamic EMA
EMA_perf <- employee_perf_check %>%
  arrange(EmployeeID, desc(ReviewDate)) %>%
  group_by(EmployeeID) %>%
  mutate(across(
    all_of(columns),
    ~ if (n() >= 10) rev(EMA(rev(.x), n = 10)) else rev(EMA(rev(.x), n = n())),
    .names = "{.col}_EMA")
    ) %>%
  ungroup() %>%
  filter(if_all(ends_with("_EMA"), ~ !is.na(.))) %>%
  select(EmployeeID, ReviewDate, ends_with("_EMA"))

# Check final unique numbers of employees
n_distinct(EMA_perf$EmployeeID)
## [1] 1359
# Work Satisfaction Creation
EMA_final <- EMA_perf %>% mutate(WorkSatisfaction = WorkLifeBalance_EMA * JobSatisfaction_EMA)

str(EMA_final)
## tibble [1,359 × 12] (S3: tbl_df/tbl/data.frame)
##  $ EmployeeID                         : chr [1:1359] "001A-8F88" "005C-E0FB" "00A3-2445" "00B0-F199" ...
##  $ ReviewDate                         : Date[1:1359], format: "2022-12-31" "2022-06-17" ...
##  $ EnvironmentSatisfaction_EMA        : num [1:1359] 3.83 3.33 4.12 3 3.75 ...
##  $ JobSatisfaction_EMA                : num [1:1359] 3.54 3.67 3.5 4 4.25 ...
##  $ RelationshipSatisfaction_EMA       : num [1:1359] 3.53 4 3.62 3 3.5 ...
##  $ TrainingOpportunitiesWithinYear_EMA: num [1:1359] 2.07 1.67 2.12 1 2 ...
##  $ TrainingOpportunitiesTaken_EMA     : num [1:1359] 1.02 1 0.75 1 1.25 ...
##  $ WorkLifeBalance_EMA                : num [1:1359] 3.5 3.67 3.75 4 4.25 ...
##  $ SelfRating_EMA                     : num [1:1359] 4 3.67 3.75 4 3.75 ...
##  $ ManagerRating_EMA                  : num [1:1359] 3.54 3.33 3.38 3 3 ...
##  $ OverallSatisfaction_EMA            : num [1:1359] 3.6 3.67 3.75 3.5 3.94 ...
##  $ WorkSatisfaction                   : num [1:1359] 12.4 13.4 13.1 16 18.1 ...

The EMA rating is then left joined to the cleaned unique employee dataset.

# here to join with employee_fe later on at the very end
employee_combined <- employee_fe %>%
  left_join(EMA_final, by = "EmployeeID")

Mean imputation is done on employee with missing ratings.

## Rows: 1,470
## Columns: 89
## $ EmployeeID                                  <chr> "3012-1A41", "CBCB-9C9D", …
## $ FirstName                                   <chr> "Leonelle", "Leonerd", "Ah…
## $ LastName                                    <chr> "Simco", "Aland", "Sykes",…
## $ Gender                                      <chr> "Female", "Male", "Male", …
## $ Age                                         <int> 30, 38, 43, 39, 29, 34, 42…
## $ BusinessTravel                              <chr> "Some Travel", "Some Trave…
## $ Department                                  <chr> "Sales", "Sales", "Human R…
## $ DistanceFromHome..KM.                       <int> 27, 23, 29, 12, 29, 30, 45…
## $ State                                       <chr> "IL", "CA", "CA", "IL", "C…
## $ Ethnicity                                   <chr> "White", "White", "Asian o…
## $ Education                                   <int> 5, 4, 4, 3, 2, 2, 3, 2, 4,…
## $ EducationField                              <chr> "Marketing", "Marketing", …
## $ JobRole                                     <chr> "Sales Executive", "Sales …
## $ MaritalStatus                               <chr> "Divorced", "Single", "Mar…
## $ Salary                                      <int> 102059, 157718, 309964, 29…
## $ StockOptionLevel                            <int> 1, 0, 1, 0, 0, 1, 1, 1, 1,…
## $ OverTime                                    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ HireDate                                    <date> 2012-01-03, 2012-01-04, 2…
## $ Attrition                                   <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ YearsAtCompany                              <int> 10, 10, 10, 10, 6, 10, 10,…
## $ YearsInMostRecentRole                       <int> 4, 6, 6, 10, 1, 3, 2, 3, 5…
## $ YearsSinceLastPromotion                     <int> 9, 10, 10, 10, 1, 7, 6, 4,…
## $ YearsWithCurrManager                        <int> 7, 0, 8, 0, 6, 9, 6, 6, 2,…
## $ GenderFemale                                <dbl> 1, 0, 0, 0, 1, 0, 1, 1, 0,…
## $ GenderMale                                  <dbl> 0, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ `GenderNon-Binary`                          <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel `                  <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ `BusinessTravelSome Travel`                 <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ `DepartmentHuman Resources`                 <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ DepartmentSales                             <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ DepartmentTechnology                        <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ StateCA                                     <dbl> 0, 1, 1, 0, 1, 0, 0, 1, 0,…
## $ StateIL                                     <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ StateNY                                     <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American`          <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American`        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ `EthnicityMixed or multiple ethnic groups`  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ `EthnicityNative Hawaiian `                 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EthnicityOther `                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite                              <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies`            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science`            <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldEconomics                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources`             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems`         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ EducationFieldMarketing                     <dbl> 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ `EducationFieldMarketing `                  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ EducationFieldOther                         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EducationFieldTechnical Degree`            <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist`                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager`                <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ `JobRoleHR Business Partner`                <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleManager                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter                            <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleSales Executive`                    <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ `JobRoleSales Representative`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced                       <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0,…
## $ MaritalStatusMarried                        <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 1,…
## $ MaritalStatusSingle                         <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ AgeNormalized                               <dbl> 0.3636364, 0.6060606, 0.75…
## $ DistanceFromHomeNormalized                  <dbl> 0.59090909, 0.50000000, 0.…
## $ SalaryNormalized                            <dbl> 0.15502917, 0.26068065, 0.…
## $ SalaryRange                                 <fct> 100k-150k, >150k, >150k, >…
## $ EducationLevel                              <fct> Doctorate, Masters, Master…
## $ JobRoleCat                                  <chr> "Executive", "Executive", …
## $ StagnationRate                              <dbl> 0.8181818, 0.9090909, 0.90…
## $ RoleStability                               <dbl> 0.4000000, 0.6000000, 0.60…
## $ StagCat                                     <chr> "Stagnation Risk", "Stagna…
## $ RSCat                                       <chr> "Moderate Role Stability",…
## $ GrowthCat                                   <chr> "Needs Review", "Needs Rev…
## $ EnvironmentSatisfaction_EMA                 <dbl> 3.555556, 3.888889, 3.8888…
## $ JobSatisfaction_EMA                         <dbl> 3.666667, 3.333333, 3.5555…
## $ RelationshipSatisfaction_EMA                <dbl> 3.333333, 3.777778, 2.8888…
## $ TrainingOpportunitiesWithinYear_EMA         <dbl> 2.000000, 2.444444, 2.1111…
## $ TrainingOpportunitiesTaken_EMA              <dbl> 0.3333333, 0.7777778, 0.66…
## $ WorkLifeBalance_EMA                         <dbl> 3.333333, 3.000000, 3.7777…
## $ SelfRating_EMA                              <dbl> 4.111111, 4.222222, 3.5555…
## $ ManagerRating_EMA                           <dbl> 3.555556, 3.888889, 3.0000…
## $ OverallSatisfaction_EMA                     <dbl> 3.472222, 3.500000, 3.5277…
## $ WorkSatisfaction                            <dbl> 12.222222, 10.000000, 13.4…

Sample Imbalance and Split between Training and Test Set

Assess class imbalance and ensure that it is maintained when splitting the data into training and testing sets.
If necessary, apply data resampling to address imbalance, and re-train the model using the resampled training data.

Sample Imbalance

Check for sample imbalance on Attrition:

## 
##    0    1 
## 1233  237
## 
##        0        1 
## 83.87755 16.12245
## [1] "Class Imbalance Ratio: 5.2"

Due to the high class imbalance ratio, SMOTE (Synthetic Minority Over-sampling Technique) resampling may be necessary.

Spitting of Training vs Test Set

  1. Perform final data cleanup by retaining only the numerical feature columns.
## Rows: 1,470
## Columns: 71
## $ Age                                         <int> 30, 38, 43, 39, 29, 34, 42…
## $ DistanceFromHome..KM.                       <int> 27, 23, 29, 12, 29, 30, 45…
## $ Education                                   <int> 5, 4, 4, 3, 2, 2, 3, 2, 4,…
## $ Salary                                      <int> 102059, 157718, 309964, 29…
## $ StockOptionLevel                            <int> 1, 0, 1, 0, 0, 1, 1, 1, 1,…
## $ OverTime                                    <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0,…
## $ Attrition                                   <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ YearsAtCompany                              <int> 10, 10, 10, 10, 6, 10, 10,…
## $ YearsInMostRecentRole                       <int> 4, 6, 6, 10, 1, 3, 2, 3, 5…
## $ YearsSinceLastPromotion                     <int> 9, 10, 10, 10, 1, 7, 6, 4,…
## $ YearsWithCurrManager                        <int> 7, 0, 8, 0, 6, 9, 6, 6, 2,…
## $ GenderFemale                                <dbl> 1, 0, 0, 0, 1, 0, 1, 1, 0,…
## $ GenderMale                                  <dbl> 0, 1, 1, 0, 0, 1, 0, 0, 1,…
## $ `GenderNon-Binary`                          <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ `GenderPrefer Not To Say`                   <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelFrequent Traveller`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `BusinessTravelNo Travel `                  <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ `BusinessTravelSome Travel`                 <dbl> 1, 1, 1, 1, 1, 1, 0, 1, 1,…
## $ `DepartmentHuman Resources`                 <dbl> 0, 0, 1, 0, 1, 0, 0, 0, 0,…
## $ DepartmentSales                             <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ DepartmentTechnology                        <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ StateCA                                     <dbl> 0, 1, 1, 0, 1, 0, 0, 1, 0,…
## $ StateIL                                     <dbl> 1, 0, 0, 1, 0, 0, 0, 0, 1,…
## $ StateNY                                     <dbl> 0, 0, 0, 0, 0, 1, 1, 0, 0,…
## $ `EthnicityAmerican Indian or Alaska Native` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityAsian or Asian American`          <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `EthnicityBlack or African American`        <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 1,…
## $ `EthnicityMixed or multiple ethnic groups`  <dbl> 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ `EthnicityNative Hawaiian `                 <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EthnicityOther `                           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ EthnicityWhite                              <dbl> 1, 1, 0, 1, 1, 0, 0, 0, 0,…
## $ `EducationFieldBusiness Studies`            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldComputer Science`            <dbl> 0, 0, 0, 1, 0, 0, 0, 0, 0,…
## $ EducationFieldEconomics                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldHuman Resources`             <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `EducationFieldInformation Systems`         <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0,…
## $ EducationFieldMarketing                     <dbl> 1, 1, 0, 0, 0, 1, 0, 0, 0,…
## $ `EducationFieldMarketing `                  <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 1,…
## $ EducationFieldOther                         <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0,…
## $ `EducationFieldTechnical Degree`            <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleAnalytics Manager`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleData Scientist`                     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleEngineering Manager`                <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0,…
## $ `JobRoleHR Business Partner`                <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Executive`                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleHR Manager`                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleMachine Learning Engineer`          <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleManager                              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ JobRoleRecruiter                            <dbl> 0, 0, 0, 0, 1, 0, 0, 0, 0,…
## $ `JobRoleSales Executive`                    <dbl> 1, 1, 0, 0, 0, 1, 0, 1, 1,…
## $ `JobRoleSales Representative`               <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSenior Software Engineer`           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ `JobRoleSoftware Engineer`                  <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ MaritalStatusDivorced                       <dbl> 1, 0, 0, 0, 0, 1, 0, 1, 0,…
## $ MaritalStatusMarried                        <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 1,…
## $ MaritalStatusSingle                         <dbl> 0, 1, 0, 0, 1, 0, 0, 0, 0,…
## $ AgeNormalized                               <dbl> 0.3636364, 0.6060606, 0.75…
## $ DistanceFromHomeNormalized                  <dbl> 0.59090909, 0.50000000, 0.…
## $ SalaryNormalized                            <dbl> 0.15502917, 0.26068065, 0.…
## $ StagnationRate                              <dbl> 0.8181818, 0.9090909, 0.90…
## $ RoleStability                               <dbl> 0.4000000, 0.6000000, 0.60…
## $ EnvironmentSatisfaction_EMA                 <dbl> 3.555556, 3.888889, 3.8888…
## $ JobSatisfaction_EMA                         <dbl> 3.666667, 3.333333, 3.5555…
## $ RelationshipSatisfaction_EMA                <dbl> 3.333333, 3.777778, 2.8888…
## $ TrainingOpportunitiesWithinYear_EMA         <dbl> 2.000000, 2.444444, 2.1111…
## $ TrainingOpportunitiesTaken_EMA              <dbl> 0.3333333, 0.7777778, 0.66…
## $ WorkLifeBalance_EMA                         <dbl> 3.333333, 3.000000, 3.7777…
## $ SelfRating_EMA                              <dbl> 4.111111, 4.222222, 3.5555…
## $ ManagerRating_EMA                           <dbl> 3.555556, 3.888889, 3.0000…
## $ OverallSatisfaction_EMA                     <dbl> 3.472222, 3.500000, 3.5277…
## $ WorkSatisfaction                            <dbl> 12.222222, 10.000000, 13.4…

ii.Split the data into a training set and a test set, using a 70:30 ratio.
The class imbalance on Attrition remains in both training and testing data set.
The baseline accuracy is 83% indicating the accuracy achieved by always predicting the majority class (Attrition = No) in the dataset.

## Class Distribution in Training Set:
## 
##        0        1 
## 83.86783 16.13217
## 
## Class Distribution in Testing Set:
## 
##        0        1 
## 83.90023 16.09977
## [1] "Baseline Accuracy: 83.88 %"

Data Resampling with SMOTE:

Apply SMOTE to oversample the attrition class, addressing the class imbalance by generating synthetic data points for the minority class.

## 
## Class Distribution in Balanced Training Set:
## 
##   0   1 
## 863 498

The testing data is resampled with a smaller ratio of the imbalanced class and is now ready for model evaluation.
Initial model training should be performed on the original training set to assess whether the model trained with the balanced data performs better.

Check for class imbalance on YearsAtCompany.

# Visualizing the target variable for regression (Dataset is evenly distributed, hence no data imbalance)
hist(employee_fe_numeric$YearsAtCompany, breaks = 30, main = "Distribution of Target Variable", xlab = "Target", col = "lightblue")

Since the data is evenly distributed across the continous variable in YearsAtCompany, hence there is no need for computing data imbalance for regression model.

Correlation Analysis for feature selection (Classification)

  • Measures the degree to which two variables move in relation to each other
  • Pearson correlation coefficient (r)
  • Determines how strongly each feature relates to the target variable

Original Sample

## Features that have strong correlations to Attrition:
## [1] "OverTime"                "YearsAtCompany"         
## [3] "YearsInMostRecentRole"   "YearsSinceLastPromotion"
## [5] "YearsWithCurrManager"    "StagnationRate"         
## [7] "RoleStability"
## 
## Redundant features dropped:
## [1] "YearsSinceLastPromotion"
## 
## Final Features:
## [1] "OverTime"              "YearsAtCompany"        "YearsInMostRecentRole"
## [4] "YearsWithCurrManager"  "StagnationRate"        "RoleStability"
# subset of data using only the final feature and target variable
ori_classification_data <- train[, final_features_classi]

# Add class(Attrition) as label to the classification_data
ori_classification_data$label <- as.factor(train$Attrition)

ori_X_train_classi <- as.matrix(ori_classification_data[, -which(colnames(ori_classification_data) == "label")])
ori_Y_train_classi <- ori_classification_data$label
ori_X_test_classi  <- as.matrix(test[, -which(colnames(test) == "Attrition")])
ori_Y_test_classi <- as.factor(test$Attrition)

Data Resampling with SMOTE

## Features that have strong correlations to Attrition:
## [1] "OverTime"                "YearsAtCompany"         
## [3] "YearsInMostRecentRole"   "YearsSinceLastPromotion"
## [5] "YearsWithCurrManager"    "StagnationRate"         
## [7] "RoleStability"
# Calculate correlation matrix for selected features
cor_matrix_classi <- cor(train_bal[, selected_features_classi])

# Visualize the correlation matrix
library(grid)
library(corrplot)
par(mfrow = c(1,1), mar = c(8,5,3,2))
corrplot(cor_matrix_classi, method = "color", type = "lower", order = "hclust", tl.cex = 0.8)

title(main = "Correlation Matrix of selected features", col = "black")

## 
## Redundant features dropped:
## [1] "YearsSinceLastPromotion"
## 
## Final Features:
## [1] "OverTime"              "YearsAtCompany"        "YearsInMostRecentRole"
## [4] "YearsWithCurrManager"  "StagnationRate"        "RoleStability"
# Split data for SMOTE Balanced Data
classification_data <- train_bal[, final_features_classi]
classification_data$label <- as.factor(train_bal$class)

X_train_classi <- as.matrix(classification_data[, -which(colnames(classification_data) == "label")])
Y_train_classi <- classification_data$label
X_test_classi  <- as.matrix(test[, -which(colnames(test) == "Attrition")])
Y_test_classi <- as.factor(test$Attrition)

Feature Selection

Examining through the correlation relationship between features and the target variable (Attrition),
5 features will be used as the features to be feed to the model for training and testing.

  • OverTime
  • YearsAtCompany
  • YearsInMostRecentRole
  • YearsWithCurrManager
  • StagnationRate
  • RoleStability

Selection Criteria :

  • Include features with strong correlation to the target variable
  • Remove High multicollinearity features

Correlation Analysis for feature selection (Regression)

  • Measures the degree to which two variables move in relation to each other
  • Pearson correlation coefficient (r)
  • Determines how strongly each feature relates to the target variable
## Features that have strong correlations to YearsAtCompany:
## [1] "Age"                     "YearsInMostRecentRole"  
## [3] "YearsSinceLastPromotion" "YearsWithCurrManager"   
## [5] "EthnicityWhite"          "AgeNormalized"          
## [7] "StagnationRate"

## 
## Redundant features dropped:
## [1] "YearsSinceLastPromotion" "Age"
## 
## Final Features:
## [1] "YearsInMostRecentRole" "YearsWithCurrManager"  "EthnicityWhite"       
## [4] "AgeNormalized"         "StagnationRate"

Feature Selection
Examining through the correlation relationship between features and the target variable (YearsAtCompany),

5 features will be used as the features to be feed to the model for training and testing.

  • YearsInMostRecentRole
  • YearsWithCurrManger
  • EthnicityWhite
  • AgeNormalized
  • StagnationRate

Selection Criteria :

  • Include features with strong correlation to the target variable
  • Remove High multicollinearity features
# Split the data into training and testing sets
# Train : 70%
# Test : 30%

# subset of data using only the final feature and target variable
regression_data <- employee_fe_numeric[, final_features]

# Add YearsAtCompany as label to the regression_data
regression_data$label <- employee_fe_numeric$YearsAtCompany

set.seed(123)
trainIndex <- createDataPartition(regression_data$label, p = .7,
                                  list = FALSE,
                                  times = 1)

trainData <- regression_data[trainIndex,]
testData  <- regression_data[-trainIndex,]


X_train <- as.matrix(trainData[, -which(colnames(trainData) == "label")])
Y_train <- trainData$label
X_test  <- as.matrix(testData[, -which(colnames(trainData) == "label")])
Y_test <- testData$label

Model Selection

Classification Models

Objective : Predict the Attrition of Employee
Attrition will be used as the target variable.

Models:

  • K-Nearest Neighbors (KNN)
  • Random Forest
  • Support Vector Machine (SVM)

Classification Model 1: K-Nearest Neighbors (KNN)

Original Sample
# Fit the model
ori_knn_model <- train(label ~., data = ori_classification_data, method = "knn")

# Summarize the model
summary(ori_knn_model)
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      6      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list
# Plot the model
plot(ori_knn_model)

# Predict on test data
ori_knn_predictions <- predict(ori_knn_model, ori_X_test_classi)
Data Resampling with SMOTE
# Fit the model
knn_model <- train(label ~., data = classification_data, method = "knn")

# Summarize the model
summary(knn_model)
##             Length Class      Mode     
## learn       2      -none-     list     
## k           1      -none-     numeric  
## theDots     0      -none-     list     
## xNames      6      -none-     character
## problemType 1      -none-     character
## tuneValue   1      data.frame list     
## obsLevels   2      -none-     character
## param       0      -none-     list
# Plot the model
plot(knn_model)

# Predict on test data
knn_predictions <- predict(knn_model, X_test_classi)

Classification Model 2: Random Forest (RF)

Original Sample
# Fit the model
ori_rf_model <- train(label ~ ., data = ori_classification_data, method = "rf")

# Summarize the model
summary(ori_rf_model)
##                 Length Class      Mode     
## call               4   -none-     call     
## type               1   -none-     character
## predicted       1029   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           2058   matrix     numeric  
## oob.times       1029   -none-     numeric  
## classes            2   -none-     character
## importance         6   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               1029   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames             6   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              0   -none-     list
# Predict on test data
ori_rf_predictions <- predict(ori_rf_model, ori_X_test_classi)
Data Resampling with SMOTE
# Fit the model
rf_model <- train(label ~ ., data = classification_data, method = "rf")

# Summarize the model
summary(rf_model)
##                 Length Class      Mode     
## call               4   -none-     call     
## type               1   -none-     character
## predicted       1361   factor     numeric  
## err.rate        1500   -none-     numeric  
## confusion          6   -none-     numeric  
## votes           2722   matrix     numeric  
## oob.times       1361   -none-     numeric  
## classes            2   -none-     character
## importance         6   -none-     numeric  
## importanceSD       0   -none-     NULL     
## localImportance    0   -none-     NULL     
## proximity          0   -none-     NULL     
## ntree              1   -none-     numeric  
## mtry               1   -none-     numeric  
## forest            14   -none-     list     
## y               1361   factor     numeric  
## test               0   -none-     NULL     
## inbag              0   -none-     NULL     
## xNames             6   -none-     character
## problemType        1   -none-     character
## tuneValue          1   data.frame list     
## obsLevels          2   -none-     character
## param              0   -none-     list
# Predict on test data
rf_predictions <- predict(rf_model, X_test_classi)

Classification Model 3: Support Vector Machine (SVM)

Original Sample
# Fit the model
ori_svm_model <- train(label ~., data = ori_classification_data, method = "svmRadial")

# Summarize the model
summary(ori_svm_model)
## Length  Class   Mode 
##      1   ksvm     S4
# Predict on test data
ori_svm_predictions <- predict(ori_svm_model, ori_X_test_classi)
Data Resampling with SMOTE
# Fit the model
svm_model <- train(label ~., data = classification_data, method = "svmRadial")

# Summarize the model
summary(svm_model)
## Length  Class   Mode 
##      1   ksvm     S4
# Predict on test data
svm_predictions <- predict(svm_model, X_test_classi)

Regression Models

Objective : Predict how long (year) would an employee stays in the company
YearsAtCompany will be used as the target variable

Regression Model 1: Linear Regression

# Fit the model
linear_regression_model <- lm(label ~ ., data = trainData)

# Summarize the model
summary(linear_regression_model)
## 
## Call:
## lm(formula = label ~ ., data = trainData)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2703 -1.1555 -0.3113  0.9979  7.1938 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            1.13876    0.14787   7.701 3.18e-14 ***
## YearsInMostRecentRole  0.34996    0.03105  11.272  < 2e-16 ***
## YearsWithCurrManager   0.47595    0.02665  17.859  < 2e-16 ***
## EthnicityWhite        -0.90458    0.11617  -7.787 1.68e-14 ***
## AgeNormalized          3.30280    0.25644  12.880  < 2e-16 ***
## StagnationRate         2.01496    0.24513   8.220 6.11e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.781 on 1024 degrees of freedom
## Multiple R-squared:  0.7125, Adjusted R-squared:  0.7111 
## F-statistic: 507.5 on 5 and 1024 DF,  p-value: < 2.2e-16
# Predict on test data
linear_regression_prediction <- predict(linear_regression_model, testData)

Regression Model 2 : Decision Tree Regression

#import rpart library
library(rpart)

# Train the decision tree regression model
decision_tree_model <- rpart(label ~ .,  data = trainData, method = "anova")

# Summarize the model
summary(decision_tree_model)
## Call:
## rpart(formula = label ~ ., data = trainData, method = "anova")
##   n= 1030 
## 
##            CP nsplit rel error    xerror       xstd
## 1  0.37943504      0 1.0000000 1.0026023 0.02625433
## 2  0.13737903      1 0.6205650 0.6238024 0.02528469
## 3  0.11154286      2 0.4831859 0.5107801 0.02224078
## 4  0.04771225      3 0.3716431 0.3987939 0.02330235
## 5  0.02736749      4 0.3239308 0.3361217 0.02004071
## 6  0.02008641      5 0.2965633 0.3322190 0.02033671
## 7  0.01801957      7 0.2563905 0.3167477 0.01985053
## 8  0.01454200      8 0.2383710 0.2840977 0.01882688
## 9  0.01103456      9 0.2238290 0.2607595 0.01684891
## 10 0.01000000     10 0.2127944 0.2548949 0.01655874
## 
## Variable importance
##  YearsWithCurrManager YearsInMostRecentRole        StagnationRate 
##                    30                    25                    20 
##         AgeNormalized        EthnicityWhite 
##                    19                     6 
## 
## Node number 1: 1030 observations,    complexity param=0.379435
##   mean=4.572816, MSE=10.96703 
##   left son=2 (540 obs) right son=3 (490 obs)
##   Primary splits:
##       YearsWithCurrManager  < 1.5        to the left,  improve=0.37943500, (0 missing)
##       YearsInMostRecentRole < 2.5        to the left,  improve=0.36188020, (0 missing)
##       StagnationRate        < 0.04545455 to the left,  improve=0.35262460, (0 missing)
##       AgeNormalized         < 0.1969697  to the left,  improve=0.31202770, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.09852108, (0 missing)
##   Surrogate splits:
##       YearsInMostRecentRole < 1.5        to the left,  agree=0.729, adj=0.431, (0 split)
##       StagnationRate        < 0.5227273  to the left,  agree=0.718, adj=0.408, (0 split)
##       AgeNormalized         < 0.1969697  to the left,  agree=0.657, adj=0.280, (0 split)
##       EthnicityWhite        < 0.5        to the right, agree=0.618, adj=0.198, (0 split)
## 
## Node number 2: 540 observations,    complexity param=0.137379
##   mean=2.62963, MSE=7.470233 
##   left son=4 (395 obs) right son=5 (145 obs)
##   Primary splits:
##       YearsInMostRecentRole < 1.5        to the left,  improve=0.38469690, (0 missing)
##       StagnationRate        < 0.05       to the left,  improve=0.37149720, (0 missing)
##       AgeNormalized         < 0.1363636  to the left,  improve=0.19967400, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.13088220, (0 missing)
##       YearsWithCurrManager  < 0.5        to the left,  improve=0.06312805, (0 missing)
##   Surrogate splits:
##       StagnationRate < 0.6306818  to the left,  agree=0.863, adj=0.49, (0 split)
## 
## Node number 3: 490 observations,    complexity param=0.1115429
##   mean=6.714286, MSE=6.073469 
##   left son=6 (223 obs) right son=7 (267 obs)
##   Primary splits:
##       AgeNormalized         < 0.2878788  to the left,  improve=0.423384600, (0 missing)
##       YearsWithCurrManager  < 4.5        to the left,  improve=0.230631500, (0 missing)
##       YearsInMostRecentRole < 4.5        to the left,  improve=0.202520900, (0 missing)
##       StagnationRate        < 0.8819444  to the left,  improve=0.140402200, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.004421336, (0 missing)
##   Surrogate splits:
##       YearsWithCurrManager  < 4.5        to the left,  agree=0.645, adj=0.220, (0 split)
##       YearsInMostRecentRole < 4.5        to the left,  agree=0.612, adj=0.148, (0 split)
##       StagnationRate        < 0.8819444  to the left,  agree=0.594, adj=0.108, (0 split)
## 
## Node number 4: 395 observations,    complexity param=0.04771225
##   mean=1.602532, MSE=4.102778 
##   left son=8 (197 obs) right son=9 (198 obs)
##   Primary splits:
##       StagnationRate        < 0.05       to the left,  improve=0.33256840, (0 missing)
##       YearsWithCurrManager  < 0.5        to the left,  improve=0.13852670, (0 missing)
##       AgeNormalized         < 0.1060606  to the left,  improve=0.12773670, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.09200418, (0 missing)
##       YearsInMostRecentRole < 0.5        to the left,  improve=0.07483577, (0 missing)
##   Surrogate splits:
##       YearsInMostRecentRole < 0.5        to the left,  agree=0.785, adj=0.569, (0 split)
##       YearsWithCurrManager  < 0.5        to the left,  agree=0.742, adj=0.482, (0 split)
##       AgeNormalized         < 0.1060606  to the left,  agree=0.630, adj=0.259, (0 split)
##       EthnicityWhite        < 0.5        to the right, agree=0.542, adj=0.081, (0 split)
## 
## Node number 5: 145 observations,    complexity param=0.02736749
##   mean=5.427586, MSE=5.941308 
##   left son=10 (47 obs) right son=11 (98 obs)
##   Primary splits:
##       AgeNormalized         < 0.1969697  to the left,  improve=0.358848500, (0 missing)
##       YearsInMostRecentRole < 5.5        to the left,  improve=0.326086700, (0 missing)
##       StagnationRate        < 0.8660714  to the left,  improve=0.220893100, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.024278930, (0 missing)
##       YearsWithCurrManager  < 0.5        to the right, improve=0.008307433, (0 missing)
## 
## Node number 6: 223 observations,    complexity param=0.01801957
##   mean=4.959641, MSE=2.415412 
##   left son=12 (94 obs) right son=13 (129 obs)
##   Primary splits:
##       AgeNormalized         < 0.1969697  to the left,  improve=0.37789800, (0 missing)
##       YearsInMostRecentRole < 3.5        to the left,  improve=0.17782810, (0 missing)
##       YearsWithCurrManager  < 3.5        to the left,  improve=0.16222870, (0 missing)
##       StagnationRate        < 0.8452381  to the left,  improve=0.13976110, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.02247327, (0 missing)
##   Surrogate splits:
##       YearsWithCurrManager < 2.5        to the left,  agree=0.632, adj=0.128, (0 split)
##       EthnicityWhite       < 0.5        to the right, agree=0.587, adj=0.021, (0 split)
## 
## Node number 7: 267 observations,    complexity param=0.02008641
##   mean=8.179775, MSE=4.409628 
##   left son=14 (99 obs) right son=15 (168 obs)
##   Primary splits:
##       YearsWithCurrManager  < 3.5        to the left,  improve=0.155517000, (0 missing)
##       YearsInMostRecentRole < 3.5        to the left,  improve=0.142482400, (0 missing)
##       StagnationRate        < 0.8090909  to the left,  improve=0.094389490, (0 missing)
##       AgeNormalized         < 0.4090909  to the left,  improve=0.077354000, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.002488476, (0 missing)
##   Surrogate splits:
##       StagnationRate < 0.04545455 to the left,  agree=0.64, adj=0.03, (0 split)
## 
## Node number 8: 197 observations
##   mean=0.4314721, MSE=0.7529181 
## 
## Node number 9: 198 observations,    complexity param=0.014542
##   mean=2.767677, MSE=4.713703 
##   left son=18 (143 obs) right son=19 (55 obs)
##   Primary splits:
##       EthnicityWhite        < 0.5        to the right, improve=0.1760041000, (0 missing)
##       StagnationRate        < 0.5277778  to the left,  improve=0.1695346000, (0 missing)
##       AgeNormalized         < 0.1363636  to the left,  improve=0.1233405000, (0 missing)
##       YearsInMostRecentRole < 0.5        to the right, improve=0.0236862400, (0 missing)
##       YearsWithCurrManager  < 0.5        to the left,  improve=0.0002690429, (0 missing)
##   Surrogate splits:
##       StagnationRate < 0.2111111  to the right, agree=0.747, adj=0.091, (0 split)
## 
## Node number 10: 47 observations
##   mean=3.319149, MSE=0.9406971 
## 
## Node number 11: 98 observations,    complexity param=0.01103456
##   mean=6.438776, MSE=5.185027 
##   left son=22 (67 obs) right son=23 (31 obs)
##   Primary splits:
##       YearsInMostRecentRole < 5.5        to the left,  improve=2.453038e-01, (0 missing)
##       StagnationRate        < 0.8819444  to the left,  improve=1.882079e-01, (0 missing)
##       AgeNormalized         < 0.6969697  to the left,  improve=1.461969e-01, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=5.600144e-03, (0 missing)
##       YearsWithCurrManager  < 0.5        to the left,  improve=1.435002e-06, (0 missing)
##   Surrogate splits:
##       StagnationRate < 0.8452381  to the left,  agree=0.806, adj=0.387, (0 split)
##       AgeNormalized  < 0.6969697  to the left,  agree=0.724, adj=0.129, (0 split)
## 
## Node number 12: 94 observations
##   mean=3.840426, MSE=1.304323 
## 
## Node number 13: 129 observations
##   mean=5.775194, MSE=1.647137 
## 
## Node number 14: 99 observations,    complexity param=0.02008641
##   mean=7.10101, MSE=7.605959 
##   left son=28 (46 obs) right son=29 (53 obs)
##   Primary splits:
##       YearsInMostRecentRole < 3.5        to the left,  improve=0.35949020, (0 missing)
##       StagnationRate        < 0.8090909  to the left,  improve=0.19071570, (0 missing)
##       AgeNormalized         < 0.4090909  to the left,  improve=0.06846250, (0 missing)
##       EthnicityWhite        < 0.5        to the right, improve=0.03530751, (0 missing)
##       YearsWithCurrManager  < 2.5        to the left,  improve=0.02353120, (0 missing)
##   Surrogate splits:
##       StagnationRate       < 0.6833333  to the left,  agree=0.727, adj=0.413, (0 split)
##       AgeNormalized        < 0.4393939  to the left,  agree=0.646, adj=0.239, (0 split)
##       YearsWithCurrManager < 2.5        to the left,  agree=0.556, adj=0.043, (0 split)
##       EthnicityWhite       < 0.5        to the right, agree=0.545, adj=0.022, (0 split)
## 
## Node number 15: 168 observations
##   mean=8.815476, MSE=1.436189 
## 
## Node number 18: 143 observations
##   mean=2.202797, MSE=3.196636 
## 
## Node number 19: 55 observations
##   mean=4.236364, MSE=5.671405 
## 
## Node number 22: 67 observations
##   mean=5.671642, MSE=4.847405 
## 
## Node number 23: 31 observations
##   mean=8.096774, MSE=1.893861 
## 
## Node number 28: 46 observations
##   mean=5.326087, MSE=8.089319 
## 
## Node number 29: 53 observations
##   mean=8.641509, MSE=2.079032
library(rpart.plot)

# Visualize the decision tree
rpart.plot(decision_tree_model, type = 3, digits = 2, fallen.leaves = TRUE)

#Shallower Trees are less likely to overfit
# First Splits are the most important predictors

# Predict on test data
decision_tree_prediction <- predict(decision_tree_model, testData)

YearsWithCurrManager is the most important predictor as the tree starts with it.
YearsWithCurrManager < 2 -> shorter tenure on average
YearsWithCurrManager >= 2 -> longer tenure on average

YearsInMostRecentRole < 2 -> shorter tenure on average
YearsInMostRecentRole >= 2 -> longer tenure on average
This suggests that as employees spend more time on their most recent role, it is likely to prolong the tenure.

AgeNormalized < 0.29 -> shorter tenure on average
AgeNormalized > 0.29 -> longer tenure on average
Younger employees tend to have a shorter tenure predictions.

StagnationRate < 0.05 -> shorter tenure on average
StagnationRate >= 0.05 -> longer tenure on average
When the employee is stagnant in their current state of work, they will tend to stay for a longer duration in the company.

EthnicityWhite = 1 -> shorter tenure on average EthnicityWhite = 0 -> longer tenure on average Employee with white ethnicity tends to leave the company earlier.

Regression Model 3 : Elastic Net Regression

Modification of Linear Regression
Lasso (L1) & Ridge (L2) Regularization

library(glmnet)

elastic_net_model <- cv.glmnet(
    x = as.matrix(X_train),
    y = Y_train,
    alpha = 0.5,
    family = "gaussian"
)

# Cross-validation results (Lambda selection)
plot(elastic_net_model)

# determine the best lambda score
# choose a small lambda for minimal regularization
best_lambda <- elastic_net_model$lambda.min

# Predict on test data
elastic_net_prediction <- predict(elastic_net_model, s = best_lambda, as.matrix(X_test))

Regression Model 4: XGBoost

library(xgboost)

set.seed(123)
# convert data to XGBoost DMatrix
dtrain <- xgb.DMatrix(data = X_train, label = Y_train )
dtest <- xgb.DMatrix(data = X_test, label = Y_test )

# Define XGBoost Parameters (Default)
params <- list(
  objective = "reg:squarederror",  # Regression for predicting numerical target
  booster = "gbtree",             # Use tree-based boosting
  eta = 0.1,                      # Learning rate
  max_depth = 6,                  # Maximum depth of each tree
  subsample = 0.8,                # Subsample ratio of the training data
  colsample_bytree = 0.8          # Subsample ratio of columns
)

#train model
xgboost_model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 1000,
  early_stopping_rounds = 10,  # Stop if the test RMSE doesn’t improve for 10 rounds
  watchlist = list(train = dtrain, test = dtest),
  verbose = 0
)

# Predict on test data
xgboost_prediction <- predict(xgboost_model, dtest)

Model Evaluation

Classification Models Evaluation

Comparison of Original Split data and SMOTE Balanced Data

KNN

##              KNN  Accuracy Precision    Recall        F1
## 1       Original 0.8208617 0.8900804 0.8972973 0.8936743
## 2 SMOTE Balanced 0.8480726 0.8740741 0.9567568 0.9135484
Random Forest

##               RF  Accuracy Precision    Recall        F1
## 1       Original 0.8548753 0.8903061 0.9432432 0.9160105
## 2 SMOTE Balanced 0.8458050 0.8665049 0.9648649 0.9130435
SVM

##              SVM  Accuracy Precision    Recall        F1
## 1       Original 0.8367347 0.8962766 0.9108108 0.9034853
## 2 SMOTE Balanced 0.8458050 0.8629808 0.9702703 0.9134860

Overview

In general, SMOTE balanced dataset will perform slightly better than the original unbalanced data. Therefore, it is important to perform data resmapling using SMOTE to increase the accuracy of the models.

Classification Evaluation Metrics :

  • Accuracy - proportion of all classifications that were correct, whether positive or negative.
  • Precision - proportion of all the model’s positive classifications that are actually positive.
  • Recall - true positive rate (TPR)
  • F1 Score - harmonic mean (a kind of average) of precision and recall.
##          Models  Accuracy Precision    Recall        F1
## 1           KNN 0.8208617 0.8900804 0.8972973 0.8936743
## 2 Random Forest 0.8548753 0.8903061 0.9432432 0.9160105
## 3           SVM 0.8367347 0.8962766 0.9108108 0.9034853

Random Forest model outperforms the other 2 models with the highest accuracy, high recall value, an high F1 score. This indicates that this model is predicting accurately, and high F1 score means the model has a good balance between precision and recall. KNN has the lowest accuracy but still maintains a good balance between precision and recall, reflected in its F1 score. SVM has the highest precision but slightly trailing Random Forest in recall and F1 score.

Feature Importance for Random Forest Models

Feature Importance in Random Forest

  • StagnationRate
  • OverTime
  • YearsAtCompany
  • YearsWithCurrManager
  • YearsInMostRecentRole

Mean Decrease Gini is a measure of variable importance based on the Gini impurity index used for the calculating the splits in trees. Higher values indicate features that contribute more to reducing the impurity (Gini index) in the model. From the graph, StagnationRate, OverTime, and YearsAtCompany are the top 3 factors in Random Forest model.

Classification Overfitting / Underfitting Evaluation

Comparison of F1 score of the classification models between train data and test data.

KNN
## Train F1 Score (KNN): 0.8774265
## Test F1 Score (KNN): 0.8933873

Good Fit. Due to the balanced dataset, Test F1 score is higher than Train F1 score, which means KNN model is performing well without significant overfitting.

Random Forest
## Train F1 Score (Random Forest): 0.9375352
## Test F1 Score (Random Forest): 0.9160105

Slightly Overfit.
The gap indicates the model performs slightly better on the training set than on unseen data. However, the difference is quite small and acceptable.

Support Vector Machine (SVM)
## Train F1 Score (SVM): 0.8687534
## Test F1 Score (SVM): 0.9034853

Good Fit. The SVM model is performing well with high F1 score than train set. This indicates that the SVM model is accurate and well-generalized.

Regression Models Evaluation

Regression Evaluation Metrics :

  • Mean Absolute Error (MSE) - minimizing average absolute errors
  • Mean Squared Error(MSE) - penalizes large errors
  • Root Mean Squared Error (RMSE) - interpretable in the same units as the target variable
  • R2 (Coefficient of Determination) - Proportion of variance explained for dependent variable
  • Adjusted R2 - Modified version of R2 with accounting on number of predictors
##               Model       MAE      MSE     RMSE        R2    Adj_R2
## 1 Linear Regression 1.3968233 3.044407 1.744823 0.7104385 0.7071025
## 2     Decision Tree 1.1746989 2.490537 1.578144 0.7632154 0.7604875
## 3       Elastic Net 1.3982779 3.038865 1.743234 0.7105643 0.7072298
## 4           XGBoost 0.6944892 1.175592 1.084247 0.8872953 0.7072298

XGBoost clearly outperforms the other 3 models in terms of scoring in the evaluation metrics. - Lowest Error Metrics (MAE, MSE, RMSE) - Highest R2 and Adjusted R2
Achieving a high score of R2, XGBoost model is able to predict YearsAtCompany at 89.14% of the variance from the selected features (YearsWithCurrManager, YearsInMonthRecentRole, StagnantRate, AgeNormalized, EthnicityWhite).

Decision Tree is the simpler alternative for predicting the employee tenure, obtaining the R2 score of 76.3%, where it uses much lesser computational resources as compared to XGBoost. Besides that, Decision Tree performs better than Linear Regression and Elastic Net in all aspects.

Both Linear Regression and Elastic Net have similar score in the metrics, performed poorly in the test data. These models fail to capture the non-linear nature which caused both to underperform in tenure prediction.

Feature Importance across Regression Models

##                         Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)            1.1387572 0.14787257  7.700936 3.175446e-14
## YearsInMostRecentRole  0.3499602 0.03104623 11.272230 7.289326e-28
## YearsWithCurrManager   0.4759478 0.02664998 17.859221 2.595076e-62
## EthnicityWhite        -0.9045818 0.11616599 -7.786977 1.675088e-14
## AgeNormalized          3.3028031 0.25643764 12.879557 2.737538e-35
## StagnationRate         2.0149616 0.24512647  8.220090 6.106997e-16

Linear Regression

All features have very small P-value < 0.05, which suggests that all predictors are statically significant in the modelling.
AgeNormalized is the most impactful predictor with the larget coefficients of 3.3, which means 1 year increase in age would expect 3.3 years increase in the duration stay at the company.
Followed by Stagnation Rate which increases the tenure by 2 years when the rate increases by 1 unit.
EthnicityWhite shows a negative relationship with the YearsAtCompany, White Ethnicity are likely to stay shorter duration in the company. Less 0.9 year as compared to others.
YearsInMostRecentRole and YearsWithCurrManager have positive impact on the tenure in the company as the tendency increases, however the importane of these features are less significant as compared to the others.

##  YearsWithCurrManager YearsInMostRecentRole        StagnationRate 
##              5043.732              4285.701              3364.404 
##         AgeNormalized        EthnicityWhite 
##              3191.390              1066.731

Decision Tree
Importance Rank:
1. YearsWithCurrManager
2. YearsInMonthRecentRole
3. StagnantRate
4. AgeNormalized
5. EthnicityWhite

The effect of manager relationships (YearsWithCurrManager) and career progression (YearsInMonthRecentRole) has most impact towards the employee tenure.
Stagnation and Age are contributing factors with moderate importance towards the duration of staying in the company.
Ethnicity is the least important feature in decision tree as it contributed less to the decision tree predictions.

Elastic Net Regression
Feature Importance has similar pattern and values with linear regression.
Importance Rank:
1. YearsInMonthRecentRole
2. YearsWithCurrManager
3. EthnicityWhite
4. AgeNormalized
5. StagnantRate

Possible Reason of similarity:
- Low Regularization strength of 0.02, behaves very similar to standard Linear Regression
- No Collinearity (As most of the highly correlated predictors were removed during correlation analysis)

XGBoost
Importance Rank:
1. YearsWithCurrManager
2. YearsInMonthRecentRole
3. StagnantRate
4. AgeNormalized
5. EthnicityWhite

Similar feature importance rank as the Decision Tree Regression model.
With YearsWithCurrManager, YearsInMostRecentRole and StagnationRate as the top three drivers in predicting how long the employee staying in the company, three of these features have covered up to nearly 80% of importance. AgeNormalized has moderate influence on the tenure. EthnicityWhite contributed minimal values towards the employee tenure with only 1.9% of importance score.

Overfitting / Underfitting Evaluation

Comparison of Mean Absolute Error score of the regression models between train data and test data.

Linear Regression

## Train MAE (Linear Regression): 1.390022
## Test MAE (Linear Regression): 1.396823

Good Fit.
MAE generated from linear regression model on unseen data is slightly higher than the train dataset. However, it is still very close which indicates that it is not overfitting or underfitting.

Decision Tree

## Train MAE (Decision Tree): 1.128646
## Test MAE (Decision Tree): 1.174699

Overfit.
Test MAE is higher than Train MAE which suggests that the decision tree model is facing the overfitting issue, where it performs well on train data but not on test data.

Elastic Net

## Train MAE (Elastic Net): 1.457014
## Test MAE (Elastic Net): 1.451609

Good Fit.
Both Train and Test MAE is similar which indicates that the Elastic Net model is accurate and well-generalized

XGBoost

## Train MAE (XGBoost): 0.3872376
## Test MAE (XGBoost): 0.6944892

Overfit.
For XGBoost the Mean Absolute Error score in test set is higher than the train set. There is a large gap between train and test error.

Hyperparameter Tuning

Classification Models

Classification Model 1: K-Nearest Neighbours

  • value of K (k)
    • how many neighbors will be checked
  • Cross Validation
    • 10 fold cross validation
tunegrid_knn <- data.frame(k = seq(11, 85, by = 2))
trainControl_knn <- trainControl(method = "cv", number = 10)
knnModel_tune <- train(
  label ~ .,
  data = classification_data,
  method = "knn",
  tuneGrid = tunegrid_knn,
  trControl = trainControl_knn,
  tuneLength = 10
)

# Plot performance
plot(knnModel_tune)

# Show best parameter
cat("Best Tune Hyperparameter (k): ", knnModel_tune$bestTune$k)
## Best Tune Hyperparameter (k):  21
##       Model  Accuracy Precision    Recall        F1
## 1       KNN 0.8208617 0.8900804 0.8972973 0.8936743
## 2 Tuned KNN 0.8072562 0.8925620 0.8756757 0.8840382

For KNN, the tuning might cause the overfitting issue to the train data, which results in a lower F1 score but higher precision.

Classification Model 2: Random Forest

  • value of mtry (mtry)
    • Number of randomly sampled predictors
  • Cross Validation
    • 10 fold cross validation
tunegrid_rf <- expand.grid(mtry = base::c(1:5)) # 'mtry' value
trainControl_rf <- trainControl(method = "cv", number = 10, search = "random") # 10 fold cross validation, randomized search

rfModel_tune <- train(
  label ~ .,
  data = classification_data,
  method = "rf",
  tuneGrid = tunegrid_rf,
  trControl = trainControl_rf,
  tuneLength = 10
)

# Plot performance
plot(rfModel_tune)

# Show best parameter
cat("Best Tune Hyperparameter (mtry): ", rfModel_tune$bestTune$mtry)
## Best Tune Hyperparameter (mtry):  1
##                 Model  Accuracy Precision    Recall        F1
## 1       Random Forest 0.8548753 0.8903061 0.9432432 0.9160105
## 2 Tuned Random Forest 0.8548753 0.8844221 0.9513514 0.9166667

For Random Forest, the tuned model performs slightly better than the original model, resulting in higher values across all the evaluation metrics. The tuning is not necessary due to the slight improvement on the model.

Classification Model 3: Support Vector Machine (SVM)

  • value of c (c) and sigma(sigma)
    • c: how many data samples are allowed to be placed in different classes
    • sigma: the standard deviation of the Gaussian Zkernel
  • Cross Validation
    • 10 fold cross validation
tunegrid_svm <- expand.grid(C = c(0.25, 0.5, .75, 1), sigma=c(.5, .75, .8, .9, 1))
trainControl_svm <- trainControl(method = "cv",
                                 number = 10,
                                 search = "random")
svmModel_tune <- train(
  label ~ .,
  data = classification_data,
  method = "svmRadial",
  tuneGrid = tunegrid_svm,
  preProcess = c("center","scale"),
  trControl = trainControl(method = "cv", number = 10)
)

# Plot performance
plot(svmModel_tune)

# Show best parameter
cat("Best Tune Hyperparameter (sigma): ", svmModel_tune$bestTune$sigma)
## Best Tune Hyperparameter (sigma):  1
cat("Best Tune Hyperparameter (C): ", svmModel_tune$bestTune$C)
## Best Tune Hyperparameter (C):  1
##       Model  Accuracy Precision    Recall        F1
## 1       SVM 0.8367347 0.8962766 0.9108108 0.9034853
## 2 Tuned SVM 0.8321995 0.8978495 0.9027027 0.9002695

Although the tuned SVM shows slightly better precision, its overall performance is worsen, especially F1 score and accuracy. This suggests that the default SVM parameters are already well-suited for the dataset, and tuning is not necessary.

Final Verdict on Classification Models

Random Forest stands out as the top performing model among the 3 models as it has the highest score for all the evaluation metrics. In predicting the atrrition of the employees, random forest shows that StagnationRate, OverTime, and YearsAtCompany could be the main factors contributing to the attrition. For KNN and SVM models, both are having high scores in predicting attrition but slightly lower than the Random Forest model. Therefore, Random Forest is the most suitable model in this prediction in terms of its strong performance in F1 score, recall, and accuracy.

Regression Models

Regression Model 1: Linear Regression

  • No tuning as there is no hyperparameters in the standard linear regression model.

Regression Model 2: Decision Tree Regression

  • Grid Search
    • complexity parameter (cp)
  • Cross Validation
    • 10 fold cross validation
## Best Tune Hyperparameter (cp):  0.001
##                 Model      MAE      MSE     RMSE        R2    Adj_R2
## 1       Decision Tree 1.174699 2.490537 1.578144 0.7632154 0.7604875
## 2 Tuned Decision Tree 1.008521 2.140439 1.463024 0.7968465 0.7945060
## Train MAE (Tuned Decision Tree): 0.8315298
## Test MAE (Tuned Decision Tree): 1.008521

Although the performance metrics of the tuned tree have improved in general, however in train-test performance metrics, the gap has grow larger, where this indicates the model has become more overfitting. Hence the tuned tree will struggle to generalize to unseen data. The tuned tree might not be able to perform as well as the default decision tree when it is tested against unseen data.

Regression Model 3: Elastic Net Regression

  • Grid Search
    • Alpha (Balances Ridge & Lasso properties)
    • Lambda (Strength of regularization)
  • Cross Validation
    • 10 fold cross validation
# Define the grid of hyperparameters
grid <- expand.grid(
alpha = seq(0, 1, by = 0.1),
lambda = 10^seq(-4, 0, length=100)
)

# Define training control for 10-fold cross-validation
train_control <- trainControl(method = "cv", number = 10)

set.seed(123)
# Train Elastic Net with caret
elastic_net_tune <- train(
  label ~ ., data = trainData,
  method = "glmnet",
  trControl = train_control,
  tuneGrid = grid
)

plot(elastic_net_tune)

# achieved lowest error and balances model parsimony with accuracy
## Best Tune Hyperparameter (alpha):  0.1
## Best Tune Hyperparameter (lambda):  0.04229243
##               Model      MAE      MSE     RMSE        R2    Adj_R2
## 1       Elastic Net 1.398278 3.038865 1.743234 0.7105643 0.7072298
## 2 Tuned Elastic Net 1.398166 3.039481 1.743411 0.7104614 0.7071257

Tuned Elastic Net Model improves slightly on MAE however the difference is negligible. While the tuned model performs slightly worse on MSE, RMSE and R2. This suggests that the default Elastic Net hyperparameters worked well on the dataset. Given the minimal difference in performance metrics, there might not be a need for further tuning.

Regression Model 4: XGBoost

  • Grid Search (Reduce Model Complexity)
    • Max_Depth (Shallow - Generalize better)
    • eta (Smaller - Reduce Step Size)
    • min_Child_Weight (Higher - discourage splitting on small data)
    • subsample (Lower - Add randomness)
    • nrounds (Early stopping rounds)
    • gamma (Greater - minimum loss reduction required to make a further split)
  • Cross Validation
    • 10 fold cross validation
# To supress message
# Redirect messages and warnings to nowhere
sink("/dev/null")  
suppressWarnings({
  suppressMessages({
    # Define parameter grid
    grid <- expand.grid(
      nrounds = c(50, 100, 150),            # Number of boosting rounds
      max_depth = c(3, 4, 5),               # Tree depth
      eta = c(0.5, 0.1),                    # Learning rate
      gamma = c(0),                         # Minimum loss reduction
      colsample_bytree = c(0.5, 0.7),       # Column sampling ratio
      min_child_weight = c(1, 3),           # Minimum child weight
      subsample = c(0.7, 0.8, 1.0)          # Subsample ratio
    )
    
    # Define training control for 10-fold cross-validation
    train_control <- trainControl(method = "cv", number = 10)  # 10-fold cross-validation

    # Train the model
    set.seed(123)
    xgb_caret_model <- train(
      x = X_train,
      y = Y_train,
      method = "xgbTree",
      tuneGrid = grid,
      trControl = train_control
    )
  })
})
sink()  # Restore normal output
## Best Tune Hyperparameter (eta): 0.5
## Best Tune Hyperparameter (max_depth): 4
## Best Tune Hyperparameter (subsample): 1
## Best Tune Hyperparameter (colsample_bytree): 0.7
## Best Tune Hyperparameter (min_child_weight): 3
## Best Tune Hyperparameter (gamma): 0
##           Model       MAE      MSE     RMSE        R2    Adj_R2
## 1       XGBoost 0.6944892 1.175592 1.084247 0.8872953 0.7072298
## 2 Tuned XGBoost 0.7952340 1.313201 1.145950 0.8743488 0.7071257
## Train MAE (Tuned XGBoost): 0.6133807
## Test MAE (Tuned XGBoost): 0.795234

Although the performance metrics of Tuned XGBoost drops a little as compared to before tuning. However it is still generating a good results overall with R2 value R2 value above 85%. With the aim of hyperparameter tuning to make the model better in generalization, the result of the tuned model has actually improved in overfitting issue. Hyperparameter tuning was mainly focusing on adjustment on max_depth, eta, min_Child_Weight, subsample and nrounds to avoid overfitting of model.
As the original model overfits more as compared to the tuned version, the tuned XGBoost model is better in generalization as it will likely perform better on unseen data.

Final Verdict on Regression Models

XGBoost and Tuned XGBoost come out on top in predicting the tenure of employee staying in a company. However it may not performs as well when it comes to unseen data as it is overfitting towards the train data. Tuned XGBoost would be the ideal model as it achieved similar performance as original version but with less overfitting issue as compared.
Decision Tree comes in at second where it has a good R2 values of 76% with slight overfitting issue and it requires much lesser computation resource as compared to XGBoost model.
While looking at Linear Regression and Elastic Net Regression models, with R2 scores of 71% which shows that they are a fairly good fit in predicting the tenure. Error tendency might be worse than decision tree, but they generalize well to the dataset. Similar results between Linear Regression and Elastic Net have indicated that the data has minimal multicollinearity as it has been removed during feature selection stage. Hence the regularization does not provide much advantage in dataset with no large collinearity. Lastly, non-linear regression models work better in predicting the employee than linear models provided with features like YearsWithCurrManager, YearsInMonthRecentRole, StagnantRate, AgeNormalized, EthnicityWhite.