1 Objectives

The main objectives are to predict monthly salary and employee attrition:

To identify the factors associated by looking into various aspects of employee features.
To develop machine learning models for prediction.
To evaluate the effectiveness of the machine learning models for prediction.

2 Introduction

Employee attrition refers to the natural process by which employees leave an organization, often through resignation or other factors. It is a critical challenge faced by organizations, as it directly impacts workforce stability, financial costs, and overall productivity. According to the 2023 Salary Increase and Turnover Study, Malaysia’s attrition rates increased to 16.2% in 2023 from 14.9% in 2022. Both voluntary and involuntary attrition can hinder organizational growth and competitiveness. Addressing the root causes of attrition is crucial to enhancing employee retention, which is vital for sustaining organizational stability, productivity, and competitive advantage. In addition to predicting attrition, analyzing and predicting monthly salaries plays an equally important role. It allows organizations identify patterns in compensation, ensuring fairness and attractiveness in pay structures, which can mitigate potential dissatisfaction and reduce attrition. Leveraging machine learning, a powerful tool in data science, enables organizations to analyze historical data, predict attrition trends, and implement proactive strategies to mitigate its adverse effects, such as skill shortages and diminished performance.

3 Dataset Description

The dataset is obtained from Kaggle titled ‘Fictional dataset on HR Employee attrition and performance’ at https://www.kaggle.com/datasets/patelprashant/employee-attrition. It is structured as a CSV file, containing both categorical and numerical variables. The dataset contains information about employees, focusing on various factors that may contribute to attrition such as demographics, job related attributes, performance metrics and attrition status.

Education: ‘1-Below College’ ‘2-College’ ‘3-Bachelor’ ‘4-Master’ ‘5-Doctor’

Environment Satisfaction: ‘1-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’

Job Involvement: ‘1-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’

Job Satisfaction: ‘2-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’

Performance Rating: ‘1-Low’ ‘2-Good’ ‘3-Excellent’ ‘4-Outstanding’

Relationship Satisfaction: ‘1-Low’ ‘2-Medium’ ‘3-High’ ‘4-Very High’

Work Life Balance: ‘1-Bad’ ‘2-Good’ ‘3-Better’ ‘4-Best’

# Import necessary libraries
library(caret)

## Loading required package: ggplot2

## Loading required package: lattice

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(MLmetrics)

## 
## Attaching package: 'MLmetrics'

## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE

## The following object is masked from 'package:base':
## 
##     Recall

library(readr)
library(repr)
library(reshape2)
library(ROSE)

## Loaded ROSE 0.0-4

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

library(stats)
library(tidyr)

## 
## Attaching package: 'tidyr'

## The following object is masked from 'package:reshape2':
## 
##     smiths

csv_url <- "https://drive.google.com/uc?id=1TLe_eD3c-KURISY6KU-ltJNxukC8X2a_&export=download"
df <- read_csv(csv_url)

## Rows: 1470 Columns: 35
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (9): Attrition, BusinessTravel, Department, EducationField, Gender, Job...
## dbl (26): Age, DailyRate, DistanceFromHome, Education, EmployeeCount, Employ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

head(df)

## # A tibble: 6 × 35
##     Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education
##   <dbl> <chr>     <chr>              <dbl> <chr>                 <dbl>     <dbl>
## 1    41 Yes       Travel_Rarely       1102 Sales                     1         2
## 2    49 No        Travel_Freque…       279 Research …                8         1
## 3    37 Yes       Travel_Rarely       1373 Research …                2         2
## 4    33 No        Travel_Freque…      1392 Research …                3         4
## 5    27 No        Travel_Rarely        591 Research …                2         1
## 6    32 No        Travel_Freque…      1005 Research …                2         2
## # ℹ 28 more variables: EducationField <chr>, EmployeeCount <dbl>,
## #   EmployeeNumber <dbl>, EnvironmentSatisfaction <dbl>, Gender <chr>,
## #   HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>, JobRole <chr>,
## #   JobSatisfaction <dbl>, MaritalStatus <chr>, MonthlyIncome <dbl>,
## #   MonthlyRate <dbl>, NumCompaniesWorked <dbl>, Over18 <chr>, OverTime <chr>,
## #   PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## #   RelationshipSatisfaction <dbl>, StandardHours <dbl>, …

4 Preprocessing

# Check if there are any missing values
anyNA(df)

## [1] TRUE

# Perform imputation with mean

# Find columns with missing values
missing_cols <- colnames(df)[colSums(is.na(df)) > 0]
missing_cols

## [1] "MonthlyRate"

# Impute missing values with the mean for each column
for (col in missing_cols) {
  if (is.numeric(df[[col]])) {
    df[[col]][is.na(df[[col]])] <- mean(df[[col]], na.rm = TRUE)
  }
}

# Remove columns with only one unique values (which are irrelevant) and Employee number

univalue_cols <- names(df)[sapply(df, function(x) length(unique(x)) == 1)]
cols_to_remove <- c("EmployeeNumber", univalue_cols)

# Print out columns to be removed
print(cols_to_remove)

## [1] "EmployeeNumber" "EmployeeCount"  "Over18"         "StandardHours"

df <- df[, !(names(df) %in% cols_to_remove)]
head(df)

## # A tibble: 6 × 31
##     Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education
##   <dbl> <chr>     <chr>              <dbl> <chr>                 <dbl>     <dbl>
## 1    41 Yes       Travel_Rarely       1102 Sales                     1         2
## 2    49 No        Travel_Freque…       279 Research …                8         1
## 3    37 Yes       Travel_Rarely       1373 Research …                2         2
## 4    33 No        Travel_Freque…      1392 Research …                3         4
## 5    27 No        Travel_Rarely        591 Research …                2         1
## 6    32 No        Travel_Freque…      1005 Research …                2         2
## # ℹ 24 more variables: EducationField <chr>, EnvironmentSatisfaction <dbl>,
## #   Gender <chr>, HourlyRate <dbl>, JobInvolvement <dbl>, JobLevel <dbl>,
## #   JobRole <chr>, JobSatisfaction <dbl>, MaritalStatus <chr>,
## #   MonthlyIncome <dbl>, MonthlyRate <dbl>, NumCompaniesWorked <dbl>,
## #   OverTime <chr>, PercentSalaryHike <dbl>, PerformanceRating <dbl>,
## #   RelationshipSatisfaction <dbl>, StockOptionLevel <dbl>,
## #   TotalWorkingYears <dbl>, TrainingTimesLastYear <dbl>, …

str(df)

## tibble [1,470 × 31] (S3: tbl_df/tbl/data.frame)
##  $ Age                     : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : chr [1:1470] "Yes" "No" "Yes" "No" ...
##  $ BusinessTravel          : chr [1:1470] "Travel_Rarely" "Travel_Frequently" "Travel_Rarely" "Travel_Frequently" ...
##  $ DailyRate               : num [1:1470] 1102 279 1373 1392 591 ...
##  $ Department              : chr [1:1470] "Sales" "Research & Development" "Research & Development" "Research & Development" ...
##  $ DistanceFromHome        : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : num [1:1470] 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : chr [1:1470] "Life Sciences" "Life Sciences" "Other" "Life Sciences" ...
##  $ EnvironmentSatisfaction : num [1:1470] 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : chr [1:1470] "Female" "Male" "Male" "Female" ...
##  $ HourlyRate              : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : num [1:1470] 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : num [1:1470] 2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : chr [1:1470] "Sales Executive" "Research Scientist" "Laboratory Technician" "Research Scientist" ...
##  $ JobSatisfaction         : num [1:1470] 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : chr [1:1470] "Single" "Married" "Single" "Married" ...
##  $ MonthlyIncome           : num [1:1470] 5993 5130 2090 2909 3468 ...
##  $ MonthlyRate             : num [1:1470] 19479 24907 2396 23159 14281 ...
##  $ NumCompaniesWorked      : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : chr [1:1470] "Yes" "No" "Yes" "Yes" ...
##  $ PercentSalaryHike       : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : num [1:1470] 3 4 3 3 3 3 4 4 4 3 ...
##  $ RelationshipSatisfaction: num [1:1470] 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : num [1:1470] 0 1 0 0 1 0 3 1 0 2 ...
##  $ TotalWorkingYears       : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : num [1:1470] 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : num [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : num [1:1470] 4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : num [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : num [1:1470] 5 7 0 0 2 6 0 0 8 7 ...

colnames(df)

##  [1] "Age"                      "Attrition"               
##  [3] "BusinessTravel"           "DailyRate"               
##  [5] "Department"               "DistanceFromHome"        
##  [7] "Education"                "EducationField"          
##  [9] "EnvironmentSatisfaction"  "Gender"                  
## [11] "HourlyRate"               "JobInvolvement"          
## [13] "JobLevel"                 "JobRole"                 
## [15] "JobSatisfaction"          "MaritalStatus"           
## [17] "MonthlyIncome"            "MonthlyRate"             
## [19] "NumCompaniesWorked"       "OverTime"                
## [21] "PercentSalaryHike"        "PerformanceRating"       
## [23] "RelationshipSatisfaction" "StockOptionLevel"        
## [25] "TotalWorkingYears"        "TrainingTimesLastYear"   
## [27] "WorkLifeBalance"          "YearsAtCompany"          
## [29] "YearsInCurrentRole"       "YearsSinceLastPromotion" 
## [31] "YearsWithCurrManager"

# List out the numerical columns and categorical columns for preprocessing later
numerical_cols = c("Age", "DailyRate", "DistanceFromHome", "HourlyRate", "MonthlyIncome", "MonthlyRate", "NumCompaniesWorked", "PercentSalaryHike", "TotalWorkingYears", "TrainingTimesLastYear", "YearsAtCompany",
"YearsInCurrentRole", "YearsSinceLastPromotion", "YearsWithCurrManager")
categorical_cols = c("Attrition", "BusinessTravel", "Department", "Education", "EducationField", "EnvironmentSatisfaction", "Gender", "JobInvolvement", "JobLevel", "JobRole", "JobSatisfaction", "MaritalStatus",
"OverTime", "PerformanceRating", "RelationshipSatisfaction", "StockOptionLevel", "WorkLifeBalance")

# Convert categorical columns to factors
for (col in categorical_cols) {
  df[[col]] <- factor(df[[col]])
}

# Verify structure
str(df)

## tibble [1,470 × 31] (S3: tbl_df/tbl/data.frame)
##  $ Age                     : num [1:1470] 41 49 37 33 27 32 59 30 38 36 ...
##  $ Attrition               : Factor w/ 2 levels "No","Yes": 2 1 2 1 1 1 1 1 1 1 ...
##  $ BusinessTravel          : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
##  $ DailyRate               : num [1:1470] 1102 279 1373 1392 591 ...
##  $ Department              : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
##  $ DistanceFromHome        : num [1:1470] 1 8 2 3 2 2 3 24 23 27 ...
##  $ Education               : Factor w/ 5 levels "1","2","3","4",..: 2 1 2 4 1 2 3 1 3 3 ...
##  $ EducationField          : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
##  $ EnvironmentSatisfaction : Factor w/ 4 levels "1","2","3","4": 2 3 4 4 1 4 3 4 4 3 ...
##  $ Gender                  : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
##  $ HourlyRate              : num [1:1470] 94 61 92 56 40 79 81 67 44 94 ...
##  $ JobInvolvement          : Factor w/ 4 levels "1","2","3","4": 3 2 2 3 3 3 4 3 2 3 ...
##  $ JobLevel                : Factor w/ 5 levels "1","2","3","4",..: 2 2 1 1 1 1 1 1 3 2 ...
##  $ JobRole                 : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
##  $ JobSatisfaction         : Factor w/ 4 levels "1","2","3","4": 4 2 3 3 2 4 1 3 3 3 ...
##  $ MaritalStatus           : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
##  $ MonthlyIncome           : num [1:1470] 5993 5130 2090 2909 3468 ...
##  $ MonthlyRate             : num [1:1470] 19479 24907 2396 23159 14281 ...
##  $ NumCompaniesWorked      : num [1:1470] 8 1 6 1 9 0 4 1 0 6 ...
##  $ OverTime                : Factor w/ 2 levels "No","Yes": 2 1 2 2 1 1 2 1 1 1 ...
##  $ PercentSalaryHike       : num [1:1470] 11 23 15 11 12 13 20 22 21 13 ...
##  $ PerformanceRating       : Factor w/ 2 levels "3","4": 1 2 1 1 1 1 2 2 2 1 ...
##  $ RelationshipSatisfaction: Factor w/ 4 levels "1","2","3","4": 1 4 2 3 4 3 1 2 2 2 ...
##  $ StockOptionLevel        : Factor w/ 4 levels "0","1","2","3": 1 2 1 1 2 1 4 2 1 3 ...
##  $ TotalWorkingYears       : num [1:1470] 8 10 7 8 6 8 12 1 10 17 ...
##  $ TrainingTimesLastYear   : num [1:1470] 0 3 3 3 3 2 3 2 2 3 ...
##  $ WorkLifeBalance         : Factor w/ 4 levels "1","2","3","4": 1 3 3 3 3 2 2 3 3 2 ...
##  $ YearsAtCompany          : num [1:1470] 6 10 0 8 2 7 1 1 9 7 ...
##  $ YearsInCurrentRole      : num [1:1470] 4 7 0 7 2 7 0 0 7 7 ...
##  $ YearsSinceLastPromotion : num [1:1470] 0 1 0 3 2 3 0 0 1 7 ...
##  $ YearsWithCurrManager    : num [1:1470] 5 7 0 0 2 6 0 0 8 7 ...

5 Numerical Data

We have generated a summary on the numerical columns to have an understanding on the data. After that, we also utilised the ggplot2 library to visualise the distribution of the numerical values. After visualisation, we are able to identify if the data follows normal distribution or it is skewed to left or right. For example, variables like “Monthly Income” or “Years at Company” may show skewness, indicating a small group with significantly higher values than the rest.

summary(df[numerical_cols])

##       Age          DailyRate      DistanceFromHome   HourlyRate    
##  Min.   :18.00   Min.   : 102.0   Min.   : 1.000   Min.   : 30.00  
##  1st Qu.:30.00   1st Qu.: 465.0   1st Qu.: 2.000   1st Qu.: 48.00  
##  Median :36.00   Median : 802.0   Median : 7.000   Median : 66.00  
##  Mean   :36.92   Mean   : 802.5   Mean   : 9.193   Mean   : 65.89  
##  3rd Qu.:43.00   3rd Qu.:1157.0   3rd Qu.:14.000   3rd Qu.: 83.75  
##  Max.   :60.00   Max.   :1499.0   Max.   :29.000   Max.   :100.00  
##  MonthlyIncome    MonthlyRate    NumCompaniesWorked PercentSalaryHike
##  Min.   : 1009   Min.   : 2094   Min.   :0.000      Min.   :11.00    
##  1st Qu.: 2911   1st Qu.: 8807   1st Qu.:1.000      1st Qu.:12.00    
##  Median : 4919   Median :14281   Median :2.000      Median :14.00    
##  Mean   : 6503   Mean   :14281   Mean   :2.693      Mean   :15.21    
##  3rd Qu.: 8379   3rd Qu.:19801   3rd Qu.:4.000      3rd Qu.:18.00    
##  Max.   :19999   Max.   :26999   Max.   :9.000      Max.   :25.00    
##  TotalWorkingYears TrainingTimesLastYear YearsAtCompany   YearsInCurrentRole
##  Min.   : 0.00     Min.   :0.000         Min.   : 0.000   Min.   : 0.000    
##  1st Qu.: 6.00     1st Qu.:2.000         1st Qu.: 3.000   1st Qu.: 2.000    
##  Median :10.00     Median :3.000         Median : 5.000   Median : 3.000    
##  Mean   :11.28     Mean   :2.799         Mean   : 7.008   Mean   : 4.229    
##  3rd Qu.:15.00     3rd Qu.:3.000         3rd Qu.: 9.000   3rd Qu.: 7.000    
##  Max.   :40.00     Max.   :6.000         Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.188          Mean   : 4.123      
##  3rd Qu.: 3.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000

# Numeric data
df_numeric <- df[sapply(df, is.numeric)]

# Plot histograms for numeric columns
numeric_long <- pivot_longer(as.data.frame(df_numeric), cols = everything(), names_to = "variable", values_to = "value")
ggplot(numeric_long, aes(x = value, fill = variable)) +
  geom_histogram(bins = 20, color = "white", show.legend = FALSE) +
  facet_wrap(~variable, scales = "free") +
  labs(title = "Histograms for Numeric Features", x = "Value", y = "Frequency") +
  theme_minimal()

6 Categorical Data

Bar plots provide a clear view of the frequency or proportion of categories within categorical variables, helping identify patterns and imbalances.

summary(df[categorical_cols])

##  Attrition            BusinessTravel                  Department  Education
##  No :1233   Non-Travel       : 150   Human Resources       : 63   1:170    
##  Yes: 237   Travel_Frequently: 277   Research & Development:961   2:282    
##             Travel_Rarely    :1043   Sales                 :446   3:572    
##                                                                   4:398    
##                                                                   5: 48    
##                                                                            
##                                                                            
##           EducationField EnvironmentSatisfaction    Gender    JobInvolvement
##  Human Resources : 27    1:284                   Female:588   1: 83         
##  Life Sciences   :606    2:287                   Male  :882   2:375         
##  Marketing       :159    3:453                                3:868         
##  Medical         :464    4:446                                4:144         
##  Other           : 82                                                       
##  Technical Degree:132                                                       
##                                                                             
##  JobLevel                      JobRole    JobSatisfaction  MaritalStatus
##  1:543    Sales Executive          :326   1:289           Divorced:327  
##  2:534    Research Scientist       :292   2:280           Married :673  
##  3:218    Laboratory Technician    :259   3:442           Single  :470  
##  4:106    Manufacturing Director   :145   4:459                         
##  5: 69    Healthcare Representative:131                                 
##           Manager                  :102                                 
##           (Other)                  :215                                 
##  OverTime   PerformanceRating RelationshipSatisfaction StockOptionLevel
##  No :1054   3:1244            1:276                    0:631           
##  Yes: 416   4: 226            2:303                    1:596           
##                               3:459                    2:158           
##                               4:432                    3: 85           
##                                                                        
##                                                                        
##                                                                        
##  WorkLifeBalance
##  1: 80          
##  2:344          
##  3:893          
##  4:153          
##                 
##                 
##

# Factor (categorical) data
df_factor <- df[sapply(df, is.factor)]
# Plot bar plots for categorical columns
factor_long <- pivot_longer(as.data.frame(df_factor), cols = everything(), names_to = "variable", values_to = "value")
ggplot(factor_long, aes(x = value, fill = variable)) +
geom_bar(color = "white", show.legend = FALSE) +
facet_wrap(~variable, scales = "free") +
labs(title = "Bar Plots for Categorical Features", x = "Categories", y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

We also generated correlation map to show relationship between variables. As our target variable for regression problem is monthly income and classification problem is attrition, hence we have focused on these two variables.

After analysing the correlation between different features with monthly income, it was noted that job level, total working hours and years at company show a relatively high correlation with monthly income with correlation of more than +0.5. Hence, these variables can be focused as they have a strong relationship with the target variable.

Meanwhile, the correlation graph between different featrues with attrition shows that overtime column has a postive correlation of more than 0.2 with attrition, showing overtime might be an important factor for employee to leave.

# Convert factor columns to numeric
df_numeric <- df
df_numeric[sapply(df, is.factor)] <- lapply(df[sapply(df, is.factor)], as.numeric)

# Compute correlation matrix
cor_matrix <- cor(df_numeric)

# Reorder rows and columns (e.g., "Attrition" and "A" as last two rows/columns)
order <- c(setdiff(rownames(cor_matrix), c("MonthlyIncome", "Attrition")), "MonthlyIncome", "Attrition")
cor_matrix <- cor_matrix[order, order]

# Melt the correlation matrix for ggplot
melted_cor <- melt(cor_matrix, varnames = c("Row", "Column"))

# Reverse the order of the rows for the y-axis
melted_cor$Row <- factor(melted_cor$Row, levels = rev(unique(melted_cor$Row)))

# Plot the heatmap with coefficients
ggplot(data = melted_cor, aes(x = Column, y = Row, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "blue", high = "red", mid = "white", midpoint = 0) +
geom_text(aes(label = round(value, 2)), color = "black", size = 3) +  # Add coefficients
labs(title = "Correlation Heatmap", x = "", y = "") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) # to make x axis values visible

# Plot correlation matrix of different features with monthly income

cor_monthly_income <- cor_matrix["MonthlyIncome", ]
cor_df <- data.frame(Feature = names(cor_monthly_income), Correlation = cor_monthly_income)
cor_df <- cor_df[cor_df$Feature != "MonthlyIncome", ]

ggplot(cor_df, aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_col() +
coord_flip() +
labs(title = "Correlation of Features with Monthly Income",
      x = "Feature",
      y = "Correlation")

# Plot correlation matrix of different features with attrition

cor_attrition <- cor_matrix["Attrition", ]
cor_df <- data.frame(Feature = names(cor_attrition), Correlation = cor_attrition)
cor_df <- cor_df[cor_df$Feature != "Attrition", ]

ggplot(cor_df, aes(x = reorder(Feature, Correlation), y = Correlation)) +
geom_col() +
coord_flip() +
labs(title = "Correlation of Features with Attrition",
      x = "Feature",
      y = "Correlation")

More than 50% of employees earn between 2000 and 6000 in monthly income. This indicates that the company’s workforce is skewed towards lower to mid-level income groups.

# Histogram for numeric variables
ggplot(df, aes(x = MonthlyIncome)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
scale_x_continuous(n.breaks = 10) +
labs(title = "Distribution of Monthly Income", x = "Monthly Income", y = "Frequency")

This graph shows that lower-income employees seem to leave.

# Boxplot for numeric variables
ggplot(df, aes(y = MonthlyIncome, x = Attrition)) +
geom_boxplot(fill = "lightgreen") +
labs(title = "Monthly Income by Attrition", x = "Attrition", y = "Monthly Income")

This graph shows that younger employees tend to leave, possibly seeking for other better opportunities.

# Density plot for numeric variables
ggplot(df, aes(x = Age, fill = Attrition)) +
geom_density(alpha = 0.5) +
labs(title = "Distribution of Age by Attrition", x = "Age", y = "Density")

Sales Department may require attention regarding employee retention strategies as this department having a higher proportion of attrition relative to the total department size. R&D shows attrition but not at alarming rates compared to the total workforce.

# Bar plot for categorical variables
ggplot(df, aes(x = Department, fill = Attrition)) +
geom_bar(position = "dodge")+
labs(title = "Employee Attrition by Department", x = "Department", y = "Count")

Employees with lower salaries and shorter tenures are more likely to leave. This suggests the need for early career engagement strategies.

# Scatter plot for two numeric variables
ggplot(df, aes(x = MonthlyIncome, y = YearsAtCompany, color = Attrition)) +
geom_point(alpha = 0.7) +
labs(title = "Monthly Income vs Years at Company", x = "Monthly Income", y = "Years at Company")

Majority of employees that travel rarely do not leave as travelling infrequetly might positively influence their job satisfaction and work-life balance.

# Bar plot to visualize Attrition by Business Travel
ggplot(df, aes(x = BusinessTravel, fill = Attrition)) +
geom_bar(position = "dodge") +
labs(title = "Attrition by Business Travel",
      x = "Business Travel",
      y = "Count")

The attrition columns show it is imbalanced as the count of ‘no’ is larger than ‘yes’.

# Plot barchart for count of attrition

ggplot(df, aes(x = factor(Attrition), fill = factor(Attrition))) +
geom_bar() +
labs(title = "Count of Attrition", x = "Attrition", y = "Count")

table(df$Attrition)

## 
##   No  Yes 
## 1233  237

7 Chi-Square Test of Independence

This table presents the results of a chi-square test to evaluate the association between various categorical variables and a dependent variable. Features like ‘BusinessTravel’, ‘Education’ and ‘JobSatisfaction’ are marked as Dependent, indicating significant associations based on their low p-values (less than 0.05) and chi-square statistics exceeding the critical value. Meanwhile, features like ‘Gender’ and ‘PerformanceRating’ are marked as Independent, suggesting no significant relationship with the dependent variable. These results highlight the most influential factors that could be important for further analysis or modeling.

cols_to_drop <- c(numerical_cols, "Attrition")
a <- df %>% select(-all_of(cols_to_drop))  # Factor columns
b <- df$Attrition  # Target column

# Perform Chi-Square test for each factor column
results <- lapply(a, function(column) {
  contingency_table <- table(column, b)  # Create a contingency table
  chi_result <- chisq.test(contingency_table)  # Perform Chi-Square test

  # Extract results
  list(
    statistic = chi_result$statistic,
    p_value = chi_result$p.value,
    df = chi_result$parameter,
    critical_value = qchisq(0.95, df = chi_result$parameter),
    hypothesis = ifelse(chi_result$p.value < 0.05, "Dependent", "Independent")
  )
})

## Warning in chisq.test(contingency_table): Chi-squared approximation may be
## incorrect

# Convert results to a DataFrame
results_df <- data.frame(
  Feature = names(a),
  Chi_Square_Statistic = sapply(results, `[[`, "statistic"),
  P_Value = sapply(results, `[[`, "p_value"),
  Degrees_of_Freedom = sapply(results, `[[`, "df"),
  Critical_Value_5_Percent = sapply(results, `[[`, "critical_value"),
  Hypothesis = sapply(results, `[[`, "hypothesis")
)

# Print the results
results_df

##                                                     Feature
## BusinessTravel.X-squared                     BusinessTravel
## Department.X-squared                             Department
## Education.X-squared                               Education
## EducationField.X-squared                     EducationField
## EnvironmentSatisfaction.X-squared   EnvironmentSatisfaction
## Gender.X-squared                                     Gender
## JobInvolvement.X-squared                     JobInvolvement
## JobLevel.X-squared                                 JobLevel
## JobRole.X-squared                                   JobRole
## JobSatisfaction.X-squared                   JobSatisfaction
## MaritalStatus.X-squared                       MaritalStatus
## OverTime.X-squared                                 OverTime
## PerformanceRating.X-squared               PerformanceRating
## RelationshipSatisfaction.X-squared RelationshipSatisfaction
## StockOptionLevel.X-squared                 StockOptionLevel
## WorkLifeBalance.X-squared                   WorkLifeBalance
##                                    Chi_Square_Statistic      P_Value
## BusinessTravel.X-squared                   2.418241e+01 5.608614e-06
## Department.X-squared                       1.079601e+01 4.525607e-03
## Education.X-squared                        3.073961e+00 5.455253e-01
## EducationField.X-squared                   1.602467e+01 6.773980e-03
## EnvironmentSatisfaction.X-squared          2.250388e+01 5.123469e-05
## Gender.X-squared                           1.116967e+00 2.905724e-01
## JobInvolvement.X-squared                   2.849202e+01 2.863181e-06
## JobLevel.X-squared                         7.252901e+01 6.634685e-15
## JobRole.X-squared                          8.619025e+01 2.752482e-15
## JobSatisfaction.X-squared                  1.750508e+01 5.563005e-04
## MaritalStatus.X-squared                    4.616368e+01 9.455511e-11
## OverTime.X-squared                         8.756429e+01 8.158424e-21
## PerformanceRating.X-squared                1.547544e-04 9.900745e-01
## RelationshipSatisfaction.X-squared         5.241068e+00 1.549724e-01
## StockOptionLevel.X-squared                 6.059830e+01 4.379390e-13
## WorkLifeBalance.X-squared                  1.632510e+01 9.725699e-04
##                                    Degrees_of_Freedom Critical_Value_5_Percent
## BusinessTravel.X-squared                            2                 5.991465
## Department.X-squared                                2                 5.991465
## Education.X-squared                                 4                 9.487729
## EducationField.X-squared                            5                11.070498
## EnvironmentSatisfaction.X-squared                   3                 7.814728
## Gender.X-squared                                    1                 3.841459
## JobInvolvement.X-squared                            3                 7.814728
## JobLevel.X-squared                                  4                 9.487729
## JobRole.X-squared                                   8                15.507313
## JobSatisfaction.X-squared                           3                 7.814728
## MaritalStatus.X-squared                             2                 5.991465
## OverTime.X-squared                                  1                 3.841459
## PerformanceRating.X-squared                         1                 3.841459
## RelationshipSatisfaction.X-squared                  3                 7.814728
## StockOptionLevel.X-squared                          3                 7.814728
## WorkLifeBalance.X-squared                           3                 7.814728
##                                     Hypothesis
## BusinessTravel.X-squared             Dependent
## Department.X-squared                 Dependent
## Education.X-squared                Independent
## EducationField.X-squared             Dependent
## EnvironmentSatisfaction.X-squared    Dependent
## Gender.X-squared                   Independent
## JobInvolvement.X-squared             Dependent
## JobLevel.X-squared                   Dependent
## JobRole.X-squared                    Dependent
## JobSatisfaction.X-squared            Dependent
## MaritalStatus.X-squared              Dependent
## OverTime.X-squared                   Dependent
## PerformanceRating.X-squared        Independent
## RelationshipSatisfaction.X-squared Independent
## StockOptionLevel.X-squared           Dependent
## WorkLifeBalance.X-squared            Dependent

8 Modeling

First we train a regression model with k-fold cross-validation before its performance is evaluated and the best model is chosen using R-squared metric. Then, we train a classification model with cross-validation and the option to handle class imbalance using oversampling technique. After the best model is chosen, a classification report is generated to show the evaluation of the models’ performance.

options(repr.plot.width = 6, repr.plot.height = 6)
model_training_regression <- function(X, y, model, k = 5) {
  # Combine predictors (X) and target (y) into a single data frame
  data <- cbind(X, target = y)

  # K-Fold Cross Validation
  set.seed(42)
  folds <- createFolds(y, k = k, list = TRUE, returnTrain = TRUE)

  # Initialize variables to store results
  fold_metrics <- data.frame(Fold = integer(), MSE = numeric(), RMSE = numeric(), R_Squared = numeric())
  models <- list()

  # Loop through folds
  for (fold in seq_along(folds)) {
    train_idx <- folds[[fold]]
    test_idx <- setdiff(seq_len(nrow(data)), train_idx)

    X_train <- X[train_idx, ]
    X_test <- X[test_idx, ]
    y_train <- y[train_idx]
    y_test <- y[test_idx]

    # Train the model
    model_fit <- train(X_train, y_train, method = model)
    models[[fold]] <- model_fit

    # Predict and evaluate
    y_pred <- predict(model_fit, X_test)

    # Compute Metrics
    mae <- mean(abs(y_test - y_pred))
    rmse <- sqrt(mean((y_test - y_pred)^2))
    ss_total <- sum((y_test - mean(y_test))^2)
    ss_residual <- sum((y_test - y_pred)^2)
    r_squared <- 1 - (ss_residual / ss_total)

    # Save metrics
    fold_metrics <- rbind(fold_metrics, data.frame(Fold = fold, MAE = mae, RMSE = rmse, R_Squared = r_squared))

    cat(sprintf("Fold %d: MSE = %.2f, RMSE = %.2f, R-Squared = %.2f\n", fold, mae, rmse, r_squared))
  }
  # Calculate and print average metrics
  avg_metrics <- colMeans(fold_metrics[, -1])  # Exclude 'Fold' column
  cat(sprintf("\nAverage Metrics Across Folds: MAE = %.2f, RMSE = %.2f, R-Squared = %.2f\n",
              avg_metrics["MAE"], avg_metrics["RMSE"], avg_metrics["R_Squared"]))

  # Select the best model based on R-Squared
  best_fold <- which.max(fold_metrics$R_Squared)
  best_model <- models[[best_fold]]

  cat(sprintf("\nBest Model is from Fold %d with R-Squared = %.2f\n", best_fold, fold_metrics$R_Squared[best_fold]))

  # Plot Predictions vs Actual for the Best Model
  best_test_idx <- setdiff(seq_len(nrow(data)), folds[[best_fold]])
  X_best_test <- X[best_test_idx, ]
  y_best_test <- y[best_test_idx]
  y_best_pred <- predict(best_model, X_best_test)

  print(
    ggplot(data = data.frame(Actual = y_best_test, Predicted = y_best_pred), aes(x = Actual, y = Predicted)) +
      geom_point(color = "blue") +
      geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
      labs(title = "Predicted vs Actual (Best Model)", x = "Actual", y = "Predicted") +
      theme_minimal()
  )

  # Return best model and fold metrics
  list(BestModel = best_model, FoldMetrics = fold_metrics)
}

model_training <- function(X, y, model, sampler = NULL) {
  # Combine predictors (X) and target (y) into a single data frame
  data <- cbind(X, target = y)

  # Stratified K-Fold Cross Validation
  set.seed(42)
  folds <- createFolds(y, k = 5, list = TRUE, returnTrain = TRUE)

  # Initialize variables to store results
  fold_accuracies <- numeric()
  models <- list()
  confusion_matrices <- list()
  predictions <- list()
  true_labels <- list()

  # Loop through folds
  for (fold in seq_along(folds)) {
    train_idx <- folds[[fold]]
    test_idx <- setdiff(seq_len(nrow(data)), train_idx)

    X_train <- X[train_idx, ]
    X_test <- X[test_idx, ]
    y_train <- y[train_idx]
    y_test <- y[test_idx]

    # Apply resampling if specified
    if (!is.null(sampler)) {
      if (sampler == "SMOTE") {
        cat("Applying SMOTE...\n")
        smote_data <- ROSE(target ~ ., data = data.frame(X_train, target = y_train),
                           seed = 42)$data
        X_train <- smote_data[, -ncol(smote_data)]  # Drop target column
        y_train <- smote_data$target
      } else if (sampler == "RandomOversampling") {
        cat("Applying Random Oversampling...\n")
        rose_data <- ROSE(target ~ ., data = data.frame(X_train, target = y_train), seed = 42)$data
        X_train <- rose_data[, -ncol(rose_data)]  # Drop target column
        y_train <- rose_data$target
      }
    }

    # Train the model
    model_fit <- train(X_train, y_train, method = model)
    models[[fold]] <- model_fit

    # Predict and evaluate
    y_pred <- predict(model_fit, X_test)
    accuracy <- Accuracy(y_pred, y_test)
    fold_accuracies <- c(fold_accuracies, accuracy)

      # Save predictions and true labels
    predictions[[fold]] <- y_pred
    true_labels[[fold]] <- y_test

    # Save confusion matrix
    confusion_matrices[[fold]] <- confusionMatrix(y_test, y_pred)
    cat(sprintf("Fold %d: Accuracy = %.2f\n", fold, accuracy))
  }

  # Average accuracy across folds
  avg_accuracy <- mean(fold_accuracies)
  cat(sprintf("Average Accuracy: %.2f\n", avg_accuracy))

  # Identify the best model
  best_fold <- which.max(fold_accuracies)
  best_model <- models[[best_fold]]
  cat(sprintf("\nBest Model is from Fold %d with Accuracy = %.2f\n", best_fold, fold_accuracies[best_fold]))

  # Display the confusion matrix and classification report for the best model
  print(confusion_matrices[[best_fold]])
  list(
    BestModel = best_model,
    BestPredictions = predictions[[best_fold]],
    BestTrueLabels = true_labels[[best_fold]]
  )
}


generate_classification_report <- function(y_true, y_pred) {
  # Ensure inputs are factors
  y_true <- as.factor(y_true)
  y_pred <- as.factor(y_pred)

  # Get the levels of the true labels
  classes <- levels(y_true)

  # Initialize the report data frame
  report <- data.frame(
    Class = character(),
    Precision = numeric(),
    Recall = numeric(),
    F1_Score = numeric(),
    Support = integer()
  )

  # Calculate metrics for each class
  for (class in classes) {
    precision <- Precision(y_pred, y_true, positive = class)
    recall <- Recall(y_pred, y_true, positive = class)
    f1_score <- F1_Score(y_pred, y_true, positive = class)
    support <- sum(y_true == class)

    report <- rbind(report, data.frame(
      Class = class,
      Precision = precision,
      Recall = recall,
      F1_Score = f1_score,
      Support = support
    ))
  }

  # Add averages (macro-average)
  macro_precision <- mean(report$Precision, na.rm = TRUE)
  macro_recall <- mean(report$Recall, na.rm = TRUE)
  macro_f1 <- mean(report$F1_Score, na.rm = TRUE)
  report <- rbind(report, data.frame(
    Class = "Macro Average",
    Precision = macro_precision,
    Recall = macro_recall,
    F1_Score = macro_f1,
    Support = NA
  ))

  # Add weighted averages (weighted by support)
  total_support <- sum(report$Support, na.rm = TRUE)
  weighted_precision <- sum(report$Precision * report$Support, na.rm = TRUE) / total_support
  weighted_recall <- sum(report$Recall * report$Support, na.rm = TRUE) / total_support
  weighted_f1 <- sum(report$F1_Score * report$Support, na.rm = TRUE) / total_support
  report <- rbind(report, data.frame(
    Class = "Weighted Average",
    Precision = weighted_precision,
    Recall = weighted_recall,
    F1_Score = weighted_f1,
    Support = total_support
  ))

  return(report)
}

9 Monthly Income Prediction (Regression)

In this part, we used 4 different types of models to predict monthly income based on the other variables. These models are linear regression, logistic regression, Support Vector Machine (SVM) and Random Forest Classifier. The models are evaluated based on the R-squared metric to select the best one. Linear regression and logistic regression provide identical performances with a R-squared value of 0.95 followed by Random Forest with 0.94 and SVM with 0.93. The linear regression model shows that higher job levels are directly related to higher monthly income.

df_final <- as.data.frame(df)
X <- df_final %>% select(-MonthlyIncome)
num_cols <- sapply(X, is.numeric)
# Min-Max scale numerical columns
X[, num_cols] <- lapply(X[, num_cols], rescale, to = c(0, 1))
y <- df_final$MonthlyIncome

Linear Regression

# Train the model
results <- model_training_regression(X, y, model = "lm")

## Fold 1: MSE = 828.91, RMSE = 1082.25, R-Squared = 0.95
## Fold 2: MSE = 870.33, RMSE = 1116.75, R-Squared = 0.95
## Fold 3: MSE = 845.95, RMSE = 1106.10, R-Squared = 0.94
## Fold 4: MSE = 768.92, RMSE = 1045.57, R-Squared = 0.95
## Fold 5: MSE = 800.64, RMSE = 1056.02, R-Squared = 0.95
## 
## Average Metrics Across Folds: MAE = 822.95, RMSE = 1081.34, R-Squared = 0.95
## 
## Best Model is from Fold 4 with R-Squared = 0.95

# View the fold metrics
print(results$FoldMetrics)

##   Fold      MAE     RMSE R_Squared
## 1    1 828.9101 1082.247 0.9485850
## 2    2 870.3345 1116.750 0.9491387
## 3    3 845.9539 1106.101 0.9386794
## 4    4 768.9196 1045.574 0.9500628
## 5    5 800.6386 1056.016 0.9482275

# Access the best model
best_model <- results$BestModel

model_summary <- summary(best_model)
# Extract coefficients table and convert it to a data frame
coefficients_df <- as.data.frame(model_summary$coefficients)

# Add column names for clarity
colnames(coefficients_df) <- c("Estimate", "Std.Error", "t.value", "P.Value")

# Sort by p-value
sorted_coefficients <- coefficients_df[order(coefficients_df$P.Value), ]
sorted_coefficients

##                                      Estimate Std.Error     t.value
## JobLevel5                        11128.553100 305.96735 36.37170178
## JobLevel4                         8449.651934 262.28936 32.21500040
## JobLevel3                         4877.776167 173.62223 28.09419127
## JobRoleResearch Director          3396.313228 202.71578 16.75406457
## JobRoleManager                    3405.927580 247.97128 13.73516961
## JobLevel2                         1599.175704 126.17838 12.67392829
## JobRoleLaboratory Technician     -1266.673428 160.98849 -7.86809950
## JobRoleResearch Scientist        -1216.878099 161.38288 -7.54031729
## (Intercept)                       3692.168968 585.47547  6.30627440
## JobRoleSales Representative      -1221.448955 333.73051 -3.65998594
## TotalWorkingYears                 1300.374016 367.11759  3.54211853
## JobInvolvement3                   -363.622583 138.73212 -2.62104099
## JobInvolvement4                   -364.940933 168.28534 -2.16858424
## NumCompaniesWorked                 267.058359 129.66398  2.05961869
## JobInvolvement2                   -295.845662 145.88960 -2.02787360
## Education5                        -331.801800 211.86178 -1.56612390
## StockOptionLevel1                  169.394274 111.44450  1.51998778
## Education3                        -151.800846 107.93985 -1.40634663
## JobSatisfaction2                  -141.038264 100.68643 -1.40076739
## Education2                        -150.754922 120.54784 -1.25058169
## DepartmentResearch & Development   551.353643 453.50926  1.21574948
## AttritionYes                      -120.712163 101.21726 -1.19260455
## YearsInCurrentRole                 310.864297 264.72191  1.17430512
## JobSatisfaction3                  -100.382876  91.38259 -1.09849015
## GenderMale                          63.139388  65.55567  0.96314149
## OverTimeYes                         68.411639  72.44940  0.94426791
## PerformanceRating4                -128.060348 135.99746 -0.94163780
## JobRoleSales Executive             274.291916 293.62630  0.93415310
## Age                               -189.363032 207.14438 -0.91415965
## YearsSinceLastPromotion            172.839726 192.77058  0.89660842
## PercentSalaryHike                  168.532776 191.21627  0.88137255
## BusinessTravelTravel_Rarely         91.409246 105.30608  0.86803387
## EducationFieldMedical             -262.353335 308.23376 -0.85115055
## RelationshipSatisfaction2           85.137088 100.54232  0.84677866
## WorkLifeBalance4                   139.557308 167.60848  0.83263871
## YearsWithCurrManager              -211.210099 266.40319 -0.79282121
## BusinessTravelTravel_Frequently     92.981542 123.17768  0.75485709
## EducationFieldLife Sciences       -223.608795 308.54154 -0.72472833
## EducationFieldMarketing           -232.839766 327.56029 -0.71083026
## TrainingTimesLastYear             -101.781937 148.29514 -0.68634708
## EducationFieldOther               -224.944912 330.68697 -0.68023519
## EducationFieldTechnical Degree    -202.511764 319.61746 -0.63360671
## JobRoleHuman Resources            -278.771392 474.99356 -0.58689510
## WorkLifeBalance3                    83.592331 143.80397  0.58129362
## JobSatisfaction4                   -52.738468  91.50354 -0.57635444
## MaritalStatusSingle                 76.193547 137.79456  0.55295033
## EnvironmentSatisfaction2           -52.280042 101.98499 -0.51262488
## DepartmentSales                    232.652083 461.82531  0.50376642
## WorkLifeBalance2                    74.649952 152.98642  0.48795150
## JobRoleManufacturing Director      -70.873720 149.73308 -0.47333376
## MaritalStatusMarried                37.796013  84.42156  0.44770568
## DistanceFromHome                   -46.711766 108.73972 -0.42957411
## EnvironmentSatisfaction4           -38.608441  92.80066 -0.41603625
## DailyRate                           45.235325 110.07200  0.41096122
## HourlyRate                          44.388598 109.39371  0.40576921
## RelationshipSatisfaction3           15.371131  92.67310  0.16586400
## MonthlyRate                        -16.768151 116.25081 -0.14424115
## EnvironmentSatisfaction3            -9.737652  92.23749 -0.10557152
## RelationshipSatisfaction4            7.705286  94.47260  0.08156107
## YearsAtCompany                      35.760154 475.69156  0.07517509
## StockOptionLevel3                  -11.093335 168.41468 -0.06586917
## StockOptionLevel2                    8.134095 141.69522  0.05740557
## Education4                          -5.505794 115.27068 -0.04776405
##                                        P.Value
## JobLevel5                        1.619338e-191
## JobLevel4                        2.081430e-161
## JobLevel3                        1.064009e-131
## JobRoleResearch Director          2.342074e-56
## JobRoleManager                    8.981680e-40
## JobLevel2                         1.753125e-34
## JobRoleLaboratory Technician      8.484063e-15
## JobRoleResearch Scientist         9.699518e-14
## (Intercept)                       4.114887e-10
## JobRoleSales Representative       2.640573e-04
## TotalWorkingYears                 4.133597e-04
## JobInvolvement3                   8.885654e-03
## JobInvolvement4                   3.032553e-02
## NumCompaniesWorked                3.966716e-02
## JobInvolvement2                   4.281099e-02
## Education5                        1.176039e-01
## StockOptionLevel1                 1.287981e-01
## Education3                        1.599003e-01
## JobSatisfaction2                  1.615624e-01
## Education2                        2.113500e-01
## DepartmentResearch & Development  2.243382e-01
## AttritionYes                      2.332785e-01
## YearsInCurrentRole                2.405240e-01
## JobSatisfaction3                  2.722281e-01
## GenderMale                        3.356857e-01
## OverTimeYes                       3.452377e-01
## PerformanceRating4                3.465824e-01
## JobRoleSales Executive            3.504275e-01
## Age                               3.608310e-01
## YearsSinceLastPromotion           3.701218e-01
## PercentSalaryHike                 3.783065e-01
## BusinessTravelTravel_Rarely       3.855629e-01
## EducationFieldMedical             3.948688e-01
## RelationshipSatisfaction2         3.973005e-01
## WorkLifeBalance4                  4.052271e-01
## YearsWithCurrManager              4.280510e-01
## BusinessTravelTravel_Frequently   4.504944e-01
## EducationFieldLife Sciences       4.687710e-01
## EducationFieldMarketing           4.773384e-01
## TrainingTimesLastYear             4.926372e-01
## EducationFieldOther               4.964970e-01
## EducationFieldTechnical Degree    5.264677e-01
## JobRoleHuman Resources            5.573932e-01
## WorkLifeBalance3                  5.611603e-01
## JobSatisfaction4                  5.644922e-01
## MaritalStatusSingle               5.804085e-01
## EnvironmentSatisfaction2          6.083155e-01
## DepartmentSales                   6.145252e-01
## WorkLifeBalance2                  6.256803e-01
## JobRoleManufacturing Director     6.360679e-01
## MaritalStatusMarried              6.544527e-01
## DistanceFromHome                  6.675886e-01
## EnvironmentSatisfaction4          6.774637e-01
## DailyRate                         6.811801e-01
## HourlyRate                        6.849902e-01
## RelationshipSatisfaction3         8.682941e-01
## MonthlyRate                       8.853362e-01
## EnvironmentSatisfaction3          9.159413e-01
## RelationshipSatisfaction4         9.350104e-01
## YearsAtCompany                    9.400889e-01
## StockOptionLevel3                 9.474938e-01
## StockOptionLevel2                 9.542324e-01
## Education4                        9.619129e-01

# Perform linear regression modelling
model <- lm(y ~ ., data = X)

# Summarize the model
summary(model)

## 
## Call:
## lm(formula = y ~ ., data = X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3222.6  -660.7   -54.9   620.1  4483.7 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                       3585.679    493.757   7.262 6.29e-13 ***
## Age                               -128.846    184.795  -0.697  0.48577    
## AttritionYes                      -113.667     89.442  -1.271  0.20400    
## BusinessTravelTravel_Frequently    132.955    109.808   1.211  0.22618    
## BusinessTravelTravel_Rarely        126.625     94.394   1.341  0.17999    
## DailyRate                           95.528     97.747   0.977  0.32859    
## DepartmentResearch & Development   331.253    387.016   0.856  0.39219    
## DepartmentSales                    101.324    401.361   0.252  0.80073    
## DistanceFromHome                   -63.680     97.677  -0.652  0.51454    
## Education2                        -191.251    105.982  -1.805  0.07136 .  
## Education3                        -134.821     95.213  -1.416  0.15700    
## Education4                         -31.364    101.967  -0.308  0.75844    
## Education5                        -262.173    177.985  -1.473  0.14097    
## EducationFieldLife Sciences       -147.111    277.291  -0.531  0.59583    
## EducationFieldMarketing            -79.744    295.410  -0.270  0.78724    
## EducationFieldMedical             -151.789    278.342  -0.545  0.58561    
## EducationFieldOther               -166.127    298.232  -0.557  0.57759    
## EducationFieldTechnical Degree     -86.024    289.431  -0.297  0.76635    
## EnvironmentSatisfaction2           -35.576     90.513  -0.393  0.69435    
## EnvironmentSatisfaction3           -30.856     82.169  -0.376  0.70733    
## EnvironmentSatisfaction4           -45.904     82.610  -0.556  0.57852    
## GenderMale                          82.946     57.425   1.444  0.14884    
## HourlyRate                          68.097     96.644   0.705  0.48117    
## JobInvolvement2                   -227.718    131.024  -1.738  0.08243 .  
## JobInvolvement3                   -348.415    124.408  -2.801  0.00517 ** 
## JobInvolvement4                   -316.115    149.621  -2.113  0.03480 *  
## JobLevel2                         1628.154    112.570  14.464  < 2e-16 ***
## JobLevel3                         4876.694    153.935  31.680  < 2e-16 ***
## JobLevel4                         8462.717    232.879  36.340  < 2e-16 ***
## JobLevel5                        11031.862    272.062  40.549  < 2e-16 ***
## JobRoleHuman Resources            -520.019    405.941  -1.281  0.20040    
## JobRoleLaboratory Technician     -1236.021    143.365  -8.622  < 2e-16 ***
## JobRoleManager                    3474.382    212.172  16.375  < 2e-16 ***
## JobRoleManufacturing Director      -68.293    129.234  -0.528  0.59728    
## JobRoleResearch Director          3406.925    182.622  18.656  < 2e-16 ***
## JobRoleResearch Scientist        -1168.127    143.144  -8.161 7.33e-16 ***
## JobRoleSales Executive             224.696    255.954   0.878  0.38016    
## JobRoleSales Representative      -1247.632    292.098  -4.271 2.07e-05 ***
## JobSatisfaction2                   -78.912     90.060  -0.876  0.38106    
## JobSatisfaction3                   -94.659     81.527  -1.161  0.24581    
## JobSatisfaction4                   -49.215     81.189  -0.606  0.54450    
## MaritalStatusMarried                64.713     76.061   0.851  0.39502    
## MaritalStatusSingle                121.552    123.560   0.984  0.32541    
## MonthlyRate                        -28.088    103.415  -0.272  0.78596    
## NumCompaniesWorked                 263.853    114.216   2.310  0.02103 *  
## OverTimeYes                         88.453     65.297   1.355  0.17575    
## PercentSalaryHike                  204.099    170.029   1.200  0.23020    
## PerformanceRating4                -162.435    122.408  -1.327  0.18473    
## RelationshipSatisfaction2           68.620     90.085   0.762  0.44635    
## RelationshipSatisfaction3           52.051     82.464   0.631  0.52801    
## RelationshipSatisfaction4           20.505     83.162   0.247  0.80528    
## StockOptionLevel1                  163.770     99.127   1.652  0.09873 .  
## StockOptionLevel2                    1.533    126.409   0.012  0.99033    
## StockOptionLevel3                  -69.960    149.662  -0.467  0.64025    
## TotalWorkingYears                 1422.871    322.413   4.413 1.10e-05 ***
## TrainingTimesLastYear              -43.775    131.373  -0.333  0.73903    
## WorkLifeBalance2                    79.179    134.219   0.590  0.55534    
## WorkLifeBalance3                    54.587    126.428   0.432  0.66598    
## WorkLifeBalance4                    46.607    148.693   0.313  0.75399    
## YearsAtCompany                    -115.719    397.733  -0.291  0.77114    
## YearsInCurrentRole                 308.803    232.981   1.325  0.18524    
## YearsSinceLastPromotion            143.594    170.292   0.843  0.39925    
## YearsWithCurrManager              -150.110    224.556  -0.668  0.50394    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1056 on 1407 degrees of freedom
## Multiple R-squared:  0.9518, Adjusted R-squared:  0.9497 
## F-statistic: 448.1 on 62 and 1407 DF,  p-value: < 2.2e-16

# Test the accuracy of the model

# Predict monthly income using the model
predictions <- predict(model, newdata = df)

# Calculate the Root Mean Squared Error (RMSE)
rmse <- sqrt(mean((df$MonthlyIncome - predictions)^2))

# Print the RMSE
cat("RMSE:", rmse, "\n")

## RMSE: 362994.7

# Calculate the R-squared value
r_squared <- summary(model)$r.squared

# Print the R-squared value
cat("R-squared:", r_squared, "\n")

## R-squared: 0.9518001

Logistic Regression Model

# Train the model
results <- model_training_regression(X, y, model = "glm")

## Fold 1: MSE = 828.91, RMSE = 1082.25, R-Squared = 0.95
## Fold 2: MSE = 870.33, RMSE = 1116.75, R-Squared = 0.95
## Fold 3: MSE = 845.95, RMSE = 1106.10, R-Squared = 0.94
## Fold 4: MSE = 768.92, RMSE = 1045.57, R-Squared = 0.95
## Fold 5: MSE = 800.64, RMSE = 1056.02, R-Squared = 0.95
## 
## Average Metrics Across Folds: MAE = 822.95, RMSE = 1081.34, R-Squared = 0.95
## 
## Best Model is from Fold 4 with R-Squared = 0.95

# View the fold metrics
print(results$FoldMetrics)

##   Fold      MAE     RMSE R_Squared
## 1    1 828.9101 1082.247 0.9485850
## 2    2 870.3345 1116.750 0.9491387
## 3    3 845.9539 1106.101 0.9386794
## 4    4 768.9196 1045.574 0.9500628
## 5    5 800.6386 1056.016 0.9482275

# Access the best model
best_model <- results$BestModel

model_summary <- summary(best_model)
# Extract coefficients table and convert it to a data frame
coefficients_df <- as.data.frame(model_summary$coefficients)

# Add column names for clarity
colnames(coefficients_df) <- c("Estimate", "Std.Error", "t.value", "P.Value")

# Sort by p-value
sorted_coefficients <- coefficients_df[order(coefficients_df$P.Value), ]
sorted_coefficients

##                                      Estimate Std.Error     t.value
## JobLevel5                        11128.553100 305.96735 36.37170178
## JobLevel4                         8449.651934 262.28936 32.21500040
## JobLevel3                         4877.776167 173.62223 28.09419127
## JobRoleResearch Director          3396.313228 202.71578 16.75406457
## JobRoleManager                    3405.927580 247.97128 13.73516961
## JobLevel2                         1599.175704 126.17838 12.67392829
## JobRoleLaboratory Technician     -1266.673428 160.98849 -7.86809950
## JobRoleResearch Scientist        -1216.878099 161.38288 -7.54031729
## (Intercept)                       3692.168968 585.47547  6.30627440
## JobRoleSales Representative      -1221.448955 333.73051 -3.65998594
## TotalWorkingYears                 1300.374016 367.11759  3.54211853
## JobInvolvement3                   -363.622583 138.73212 -2.62104099
## JobInvolvement4                   -364.940933 168.28534 -2.16858424
## NumCompaniesWorked                 267.058359 129.66398  2.05961869
## JobInvolvement2                   -295.845662 145.88960 -2.02787360
## Education5                        -331.801800 211.86178 -1.56612390
## StockOptionLevel1                  169.394274 111.44450  1.51998778
## Education3                        -151.800846 107.93985 -1.40634663
## JobSatisfaction2                  -141.038264 100.68643 -1.40076739
## Education2                        -150.754922 120.54784 -1.25058169
## DepartmentResearch & Development   551.353643 453.50926  1.21574948
## AttritionYes                      -120.712163 101.21726 -1.19260455
## YearsInCurrentRole                 310.864297 264.72191  1.17430512
## JobSatisfaction3                  -100.382876  91.38259 -1.09849015
## GenderMale                          63.139388  65.55567  0.96314149
## OverTimeYes                         68.411639  72.44940  0.94426791
## PerformanceRating4                -128.060348 135.99746 -0.94163780
## JobRoleSales Executive             274.291916 293.62630  0.93415310
## Age                               -189.363032 207.14438 -0.91415965
## YearsSinceLastPromotion            172.839726 192.77058  0.89660842
## PercentSalaryHike                  168.532776 191.21627  0.88137255
## BusinessTravelTravel_Rarely         91.409246 105.30608  0.86803387
## EducationFieldMedical             -262.353335 308.23376 -0.85115055
## RelationshipSatisfaction2           85.137088 100.54232  0.84677866
## WorkLifeBalance4                   139.557308 167.60848  0.83263871
## YearsWithCurrManager              -211.210099 266.40319 -0.79282121
## BusinessTravelTravel_Frequently     92.981542 123.17768  0.75485709
## EducationFieldLife Sciences       -223.608795 308.54154 -0.72472833
## EducationFieldMarketing           -232.839766 327.56029 -0.71083026
## TrainingTimesLastYear             -101.781937 148.29514 -0.68634708
## EducationFieldOther               -224.944912 330.68697 -0.68023519
## EducationFieldTechnical Degree    -202.511764 319.61746 -0.63360671
## JobRoleHuman Resources            -278.771392 474.99356 -0.58689510
## WorkLifeBalance3                    83.592331 143.80397  0.58129362
## JobSatisfaction4                   -52.738468  91.50354 -0.57635444
## MaritalStatusSingle                 76.193547 137.79456  0.55295033
## EnvironmentSatisfaction2           -52.280042 101.98499 -0.51262488
## DepartmentSales                    232.652083 461.82531  0.50376642
## WorkLifeBalance2                    74.649952 152.98642  0.48795150
## JobRoleManufacturing Director      -70.873720 149.73308 -0.47333376
## MaritalStatusMarried                37.796013  84.42156  0.44770568
## DistanceFromHome                   -46.711766 108.73972 -0.42957411
## EnvironmentSatisfaction4           -38.608441  92.80066 -0.41603625
## DailyRate                           45.235325 110.07200  0.41096122
## HourlyRate                          44.388598 109.39371  0.40576921
## RelationshipSatisfaction3           15.371131  92.67310  0.16586400
## MonthlyRate                        -16.768151 116.25081 -0.14424115
## EnvironmentSatisfaction3            -9.737652  92.23749 -0.10557152
## RelationshipSatisfaction4            7.705286  94.47260  0.08156107
## YearsAtCompany                      35.760154 475.69156  0.07517509
## StockOptionLevel3                  -11.093335 168.41468 -0.06586917
## StockOptionLevel2                    8.134095 141.69522  0.05740557
## Education4                          -5.505794 115.27068 -0.04776405
##                                        P.Value
## JobLevel5                        1.619338e-191
## JobLevel4                        2.081430e-161
## JobLevel3                        1.064009e-131
## JobRoleResearch Director          2.342074e-56
## JobRoleManager                    8.981680e-40
## JobLevel2                         1.753125e-34
## JobRoleLaboratory Technician      8.484063e-15
## JobRoleResearch Scientist         9.699518e-14
## (Intercept)                       4.114887e-10
## JobRoleSales Representative       2.640573e-04
## TotalWorkingYears                 4.133597e-04
## JobInvolvement3                   8.885654e-03
## JobInvolvement4                   3.032553e-02
## NumCompaniesWorked                3.966716e-02
## JobInvolvement2                   4.281099e-02
## Education5                        1.176039e-01
## StockOptionLevel1                 1.287981e-01
## Education3                        1.599003e-01
## JobSatisfaction2                  1.615624e-01
## Education2                        2.113500e-01
## DepartmentResearch & Development  2.243382e-01
## AttritionYes                      2.332785e-01
## YearsInCurrentRole                2.405240e-01
## JobSatisfaction3                  2.722281e-01
## GenderMale                        3.356857e-01
## OverTimeYes                       3.452377e-01
## PerformanceRating4                3.465824e-01
## JobRoleSales Executive            3.504275e-01
## Age                               3.608310e-01
## YearsSinceLastPromotion           3.701218e-01
## PercentSalaryHike                 3.783065e-01
## BusinessTravelTravel_Rarely       3.855629e-01
## EducationFieldMedical             3.948688e-01
## RelationshipSatisfaction2         3.973005e-01
## WorkLifeBalance4                  4.052271e-01
## YearsWithCurrManager              4.280510e-01
## BusinessTravelTravel_Frequently   4.504944e-01
## EducationFieldLife Sciences       4.687710e-01
## EducationFieldMarketing           4.773384e-01
## TrainingTimesLastYear             4.926372e-01
## EducationFieldOther               4.964970e-01
## EducationFieldTechnical Degree    5.264677e-01
## JobRoleHuman Resources            5.573932e-01
## WorkLifeBalance3                  5.611603e-01
## JobSatisfaction4                  5.644922e-01
## MaritalStatusSingle               5.804085e-01
## EnvironmentSatisfaction2          6.083155e-01
## DepartmentSales                   6.145252e-01
## WorkLifeBalance2                  6.256803e-01
## JobRoleManufacturing Director     6.360679e-01
## MaritalStatusMarried              6.544527e-01
## DistanceFromHome                  6.675886e-01
## EnvironmentSatisfaction4          6.774637e-01
## DailyRate                         6.811801e-01
## HourlyRate                        6.849902e-01
## RelationshipSatisfaction3         8.682941e-01
## MonthlyRate                       8.853362e-01
## EnvironmentSatisfaction3          9.159413e-01
## RelationshipSatisfaction4         9.350104e-01
## YearsAtCompany                    9.400889e-01
## StockOptionLevel3                 9.474938e-01
## StockOptionLevel2                 9.542324e-01
## Education4                        9.619129e-01

Support Vector Machine

dummy_model <- dummyVars("~ .", data = X)
X_numeric <- predict(dummy_model, newdata = X)
X_numeric <- as.data.frame(X_numeric)

# Train the model
results <- model_training_regression(X_numeric, y, model = "svmRadial")

## Fold 1: MSE = 959.78, RMSE = 1245.38, R-Squared = 0.93
## Fold 2: MSE = 958.74, RMSE = 1256.37, R-Squared = 0.94
## Fold 3: MSE = 932.48, RMSE = 1204.21, R-Squared = 0.93
## Fold 4: MSE = 903.54, RMSE = 1188.87, R-Squared = 0.94
## Fold 5: MSE = 940.82, RMSE = 1201.95, R-Squared = 0.93
## 
## Average Metrics Across Folds: MAE = 939.07, RMSE = 1219.36, R-Squared = 0.93
## 
## Best Model is from Fold 2 with R-Squared = 0.94

# View the fold metrics
print(results$FoldMetrics)

##   Fold      MAE     RMSE R_Squared
## 1    1 959.7766 1245.375 0.9319172
## 2    2 958.7353 1256.369 0.9356262
## 3    3 932.4791 1204.212 0.9273188
## 4    4 903.5439 1188.875 0.9354365
## 5    5 940.8169 1201.947 0.9329300

# Access the best model
best_model <- results$BestModel

# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(best_model, scale = FALSE)
print(importance)

## loess r-squared variable importance
## 
##   only 20 most important variables shown (out of 79)
## 
##                               Overall
## TotalWorkingYears             0.58817
## JobRole.Manager               0.39688
## JobLevel.1                    0.36809
## JobLevel.5                    0.34497
## JobLevel.4                    0.29949
## YearsAtCompany                0.27172
## Age                           0.23015
## JobRole.Research Director     0.21773
## YearsInCurrentRole            0.14339
## YearsSinceLastPromotion       0.13441
## JobRole.Research Scientist    0.12625
## YearsWithCurrManager          0.11669
## JobRole.Laboratory Technician 0.09895
## JobLevel.3                    0.08358
## JobRole.Sales Representative  0.03905
## Attrition.No                  0.02541
## Attrition.Yes                 0.02541
## JobLevel.2                    0.02248
## NumCompaniesWorked            0.01971
## StockOptionLevel.1            0.01360

# Plot variable importance
plot(importance, top = 20, cex = 0.8)

Random Forest Classifier

# Step 1: Create an 80:20 train-test split
set.seed(42)
train_idx <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X_numeric[train_idx, ]
X_test <- X_numeric[-train_idx, ]
y_train <- y[train_idx]
y_test <- y[-train_idx]

# Step 2: Train the model
model <- train(
  X_train, y_train,
  method = "rf",
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "none")  # Single train-test split
)

# Step 3: Predict on the test set
y_pred <- predict(model, X_test)

# Step 4: Compute metrics
mae <- mean(abs(y_test - y_pred))  # Mean Absolute Error
rmse <- sqrt(mean((y_test - y_pred)^2))  # Root Mean Squared Error
ss_total <- sum((y_test - mean(y_test))^2)
ss_residual <- sum((y_test - y_pred)^2)
r_squared <- 1 - (ss_residual / ss_total)  # R-Squared

# Print metrics
cat(sprintf("Metrics on Test Set:\nMAE = %.2f\nRMSE = %.2f\nR-Squared = %.2f\n", mae, rmse, r_squared))

## Metrics on Test Set:
## MAE = 856.51
## RMSE = 1151.24
## R-Squared = 0.94

# Step 5: Plot Predictions vs Actual
ggplot(data = data.frame(Actual = y_test, Predicted = y_pred), aes(x = Actual, y = Predicted)) +
  geom_point(color = "blue") +
  geom_abline(slope = 1, intercept = 0, color = "red", linetype = "dashed") +
  labs(title = "Predicted vs Actual", x = "Actual", y = "Predicted") +
  theme_minimal()

# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(model, scale = FALSE)
print(importance)

## rf variable importance
## 
##   only 20 most important variables shown (out of 79)
## 
##                                 Overall
## TotalWorkingYears             8.076e+09
## JobLevel.1                    3.037e+09
## JobLevel.5                    2.873e+09
## JobLevel.3                    2.402e+09
## JobRole.Manager               2.181e+09
## JobLevel.4                    1.559e+09
## JobLevel.2                    1.437e+09
## JobRole.Research Director     1.422e+09
## YearsAtCompany                6.369e+08
## Age                           3.372e+08
## JobRole.Sales Executive       1.520e+08
## JobRole.Laboratory Technician 1.273e+08
## YearsInCurrentRole            1.260e+08
## JobRole.Research Scientist    1.132e+08
## DailyRate                     1.009e+08
## HourlyRate                    8.956e+07
## MonthlyRate                   8.865e+07
## YearsWithCurrManager          8.197e+07
## YearsSinceLastPromotion       8.138e+07
## DistanceFromHome              7.257e+07

# Plot variable importance
plot(importance)

10 Attrition Prediction

Logistic regression, SVM and Random Forest models are used to predict attrition using other variables. After comparison, logistic regression is selected as the best model because it has the highest accuracy of 0.905 and the least number of false predictions shown in the confusion matrix. Based on the model summary, working overtime is the primary factor contributing to attrition followed by monthly income, total working years, number of companies worked and distance from home.

Logistic Regression

X <- df_final %>% select(-Attrition)
num_cols <- sapply(X, is.numeric)
# Min-Max scale numerical columns
X[, num_cols] <- lapply(X[, num_cols], rescale, to = c(0, 1))
y <- df_final$Attrition

suppressWarnings({
  results <- model_training(X, y, model = "glm", sampler = NULL)
})

## Fold 1: Accuracy = 0.86
## Fold 2: Accuracy = 0.89
## Fold 3: Accuracy = 0.86
## Fold 4: Accuracy = 0.87
## Fold 5: Accuracy = 0.91
## Average Accuracy: 0.88
## 
## Best Model is from Fold 5 with Accuracy = 0.91
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  236  11
##        Yes  17  31
##                                          
##                Accuracy : 0.9051         
##                  95% CI : (0.8657, 0.936)
##     No Information Rate : 0.8576         
##     P-Value [Acc > NIR] : 0.009493       
##                                          
##                   Kappa : 0.6332         
##                                          
##  Mcnemar's Test P-Value : 0.344704       
##                                          
##             Sensitivity : 0.9328         
##             Specificity : 0.7381         
##          Pos Pred Value : 0.9555         
##          Neg Pred Value : 0.6458         
##              Prevalence : 0.8576         
##          Detection Rate : 0.8000         
##    Detection Prevalence : 0.8373         
##       Balanced Accuracy : 0.8355         
##                                          
##        'Positive' Class : No             
##

# Access the best model
best_model <- results$BestModel

# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels

# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)

# Print the classification report
classification_report

##              Class Precision    Recall  F1_Score Support
## 1               No 0.9554656 0.9328063 0.9440000     247
## 2              Yes 0.6458333 0.7380952 0.6888889      48
## 3    Macro Average 0.8006495 0.8354508 0.8164444      NA
## 4 Weighted Average 0.9050847 0.9011245 0.9024904     295

model_summary <- summary(best_model)
# Extract coefficients table and convert it to a data frame
coefficients_df <- as.data.frame(model_summary$coefficients)

# Add column names for clarity
colnames(coefficients_df) <- c("Estimate", "Std.Error", "t.value", "P.Value")

# Sort by p-value
sorted_coefficients <- coefficients_df[order(coefficients_df$P.Value), ]
sorted_coefficients

##                                      Estimate   Std.Error     t.value
## OverTimeYes                        2.14728976   0.2351219  9.13266710
## NumCompaniesWorked                 1.96199273   0.4044277  4.85128130
## EnvironmentSatisfaction4          -1.39308297   0.3059738 -4.55294788
## DistanceFromHome                   1.46393935   0.3529273  4.14799136
## JobInvolvement3                   -1.69100215   0.4177843 -4.04754882
## JobInvolvement4                   -2.22500703   0.5500317 -4.04523443
## BusinessTravelTravel_Frequently    1.92052570   0.4752718  4.04089992
## JobSatisfaction4                  -1.13127134   0.3003542 -3.76645752
## EnvironmentSatisfaction3          -1.11117173   0.2959835 -3.75416828
## EnvironmentSatisfaction2          -1.27776080   0.3426227 -3.72935195
## YearsSinceLastPromotion            2.67637895   0.7667300  3.49064073
## JobInvolvement2                   -1.48543339   0.4441896 -3.34414232
## JobLevel2                         -1.97161336   0.5944914 -3.31647090
## WorkLifeBalance3                  -1.29638574   0.4270061 -3.03598846
## RelationshipSatisfaction2         -1.03790311   0.3514475 -2.95322373
## RelationshipSatisfaction3         -0.86466470   0.3046718 -2.83802015
## YearsWithCurrManager              -2.54204443   0.9315524 -2.72882614
## StockOptionLevel1                 -0.97576638   0.3639313 -2.68118286
## RelationshipSatisfaction4         -0.79169582   0.2996262 -2.64227841
## JobSatisfaction3                  -0.68182842   0.2856354 -2.38705843
## YearsInCurrentRole                -2.40180035   1.0313226 -2.32885451
## BusinessTravelTravel_Rarely        0.97619068   0.4337658  2.25050178
## YearsAtCompany                     4.14588740   1.8429603  2.24958042
## TotalWorkingYears                 -3.07892059   1.4055955 -2.19047415
## WorkLifeBalance2                  -0.95968886   0.4579243 -2.09573691
## DailyRate                         -0.73708157   0.3677695 -2.00419420
## TrainingTimesLastYear             -0.98934134   0.5042333 -1.96207053
## GenderMale                         0.40880073   0.2178506  1.87651862
## MonthlyIncome                     -3.53008494   2.0091424 -1.75701080
## JobSatisfaction2                  -0.55183424   0.3165151 -1.74346896
## WorkLifeBalance4                  -0.87450136   0.5026432 -1.73980527
## StockOptionLevel2                 -0.84103357   0.5005979 -1.68005822
## JobRoleResearch Director          -2.40776989   1.4648478 -1.64369969
## MaritalStatusSingle                0.63403839   0.4618445  1.37283960
## EducationFieldOther               -1.46544712   1.0760930 -1.36182198
## JobLevel5                          2.29487657   1.8123382  1.26625182
## MonthlyRate                        0.48599936   0.3872225  1.25509072
## JobRoleResearch Scientist         -0.95575395   0.7871207 -1.21424058
## JobRoleManufacturing Director      0.74116024   0.6576467  1.12698848
## EducationFieldLife Sciences       -1.00750838   0.9490896 -1.06155246
## Age                               -0.66242002   0.6624095 -1.00001593
## EducationFieldMedical             -0.88367045   0.9526814 -0.92756136
## MaritalStatusMarried               0.27796145   0.3259368  0.85280778
## JobRoleSales Executive             1.04583115   1.3245031  0.78960267
## HourlyRate                         0.26369307   0.3659306  0.72060943
## PercentSalaryHike                 -0.39141001   0.6512822 -0.60098375
## EducationFieldMarketing           -0.59331395   1.0018866 -0.59219672
## JobLevel4                         -0.73042113   1.4061718 -0.51943948
## JobRoleLaboratory Technician       0.36195536   0.7542048  0.47991653
## StockOptionLevel3                 -0.24117872   0.5444473 -0.44297901
## Education4                         0.15545223   0.3715673  0.41836900
## Education2                        -0.14023446   0.3936925 -0.35620305
## JobRoleManager                     0.35478524   1.2063074  0.29410849
## EducationFieldTechnical Degree    -0.26616344   0.9580493 -0.27781811
## JobRoleSales Representative        0.37377966   1.4531736  0.25721611
## JobLevel3                         -0.20214672   0.8481719 -0.23833225
## Education3                         0.08019492   0.3409004  0.23524444
## PerformanceRating4                 0.08096047   0.4811355  0.16826960
## Education5                        -0.04431876   0.7003849 -0.06327772
## DepartmentSales                   14.86901021 598.7455466  0.02483360
## JobRoleHuman Resources            14.43411793 598.7453717  0.02410727
## DepartmentResearch & Development  14.33604722 598.7450422  0.02394349
## (Intercept)                      -10.89246893 598.7464656 -0.01819212
##                                       P.Value
## OverTimeYes                      6.683212e-20
## NumCompaniesWorked               1.226664e-06
## EnvironmentSatisfaction4         5.289938e-06
## DistanceFromHome                 3.354050e-05
## JobInvolvement3                  5.175679e-05
## JobInvolvement4                  5.227079e-05
## BusinessTravelTravel_Frequently  5.324648e-05
## JobSatisfaction4                 1.655802e-04
## EnvironmentSatisfaction3         1.739180e-04
## EnvironmentSatisfaction2         1.919729e-04
## YearsSinceLastPromotion          4.818638e-04
## JobInvolvement2                  8.253739e-04
## JobLevel2                        9.116208e-04
## WorkLifeBalance3                 2.397485e-03
## RelationshipSatisfaction2        3.144740e-03
## RelationshipSatisfaction3        4.539432e-03
## YearsWithCurrManager             6.356020e-03
## StockOptionLevel1                7.336241e-03
## RelationshipSatisfaction4        8.235033e-03
## JobSatisfaction3                 1.698379e-02
## YearsInCurrentRole               1.986677e-02
## BusinessTravelTravel_Rarely      2.441711e-02
## YearsAtCompany                   2.447559e-02
## TotalWorkingYears                2.848987e-02
## WorkLifeBalance2                 3.610554e-02
## DailyRate                        4.504926e-02
## TrainingTimesLastYear            4.975427e-02
## GenderMale                       6.058410e-02
## MonthlyIncome                    7.891597e-02
## JobSatisfaction2                 8.125173e-02
## WorkLifeBalance4                 8.189322e-02
## StockOptionLevel2                9.294599e-02
## JobRoleResearch Director         1.002382e-01
## MaritalStatusSingle              1.698022e-01
## EducationFieldOther              1.732541e-01
## JobLevel5                        2.054229e-01
## MonthlyRate                      2.094458e-01
## JobRoleResearch Scientist        2.246559e-01
## JobRoleManufacturing Director    2.597474e-01
## EducationFieldLife Sciences      2.884389e-01
## Age                              3.173028e-01
## EducationFieldMedical            3.536351e-01
## MaritalStatusMarried             3.937659e-01
## JobRoleSales Executive           4.297598e-01
## HourlyRate                       4.711499e-01
## PercentSalaryHike                5.478508e-01
## EducationFieldMarketing          5.537189e-01
## JobLevel4                        6.034543e-01
## JobRoleLaboratory Technician     6.312867e-01
## StockOptionLevel3                6.577809e-01
## Education4                       6.756773e-01
## Education2                       7.216885e-01
## JobRoleManager                   7.686750e-01
## EducationFieldTechnical Degree   7.811520e-01
## JobRoleSales Representative      7.970119e-01
## JobLevel3                        8.116234e-01
## Education3                       8.140190e-01
## PerformanceRating4               8.663712e-01
## Education5                       9.495454e-01
## DepartmentSales                  9.801877e-01
## JobRoleHuman Resources           9.807670e-01
## DepartmentResearch & Development 9.808977e-01
## (Intercept)                      9.854856e-01

Oversampling is worsening the model performance on minority instead of improving, possibly because the difference is too big.

suppressWarnings({
  results <- model_training(X, y, model = "glm", sampler = "RandomOversampling")
})

## Applying Random Oversampling...
## Fold 1: Accuracy = 0.77
## Applying Random Oversampling...
## Fold 2: Accuracy = 0.78
## Applying Random Oversampling...
## Fold 3: Accuracy = 0.79
## Applying Random Oversampling...
## Fold 4: Accuracy = 0.79
## Applying Random Oversampling...
## Fold 5: Accuracy = 0.79
## Average Accuracy: 0.78
## 
## Best Model is from Fold 5 with Accuracy = 0.79
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  194  53
##        Yes  10  38
##                                           
##                Accuracy : 0.7864          
##                  95% CI : (0.7352, 0.8318)
##     No Information Rate : 0.6915          
##     P-Value [Acc > NIR] : 0.0001746       
##                                           
##                   Kappa : 0.4241          
##                                           
##  Mcnemar's Test P-Value : 1.213e-07       
##                                           
##             Sensitivity : 0.9510          
##             Specificity : 0.4176          
##          Pos Pred Value : 0.7854          
##          Neg Pred Value : 0.7917          
##              Prevalence : 0.6915          
##          Detection Rate : 0.6576          
##    Detection Prevalence : 0.8373          
##       Balanced Accuracy : 0.6843          
##                                           
##        'Positive' Class : No              
##

# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels

# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)

# Print the classification report
classification_report

##              Class Precision    Recall  F1_Score Support
## 1               No 0.7854251 0.9509804 0.8603104     247
## 2              Yes 0.7916667 0.4175824 0.5467626      48
## 3    Macro Average 0.7885459 0.6842814 0.7035365      NA
## 4 Weighted Average 0.7864407 0.8641902 0.8092925     295

train_index <- sample(1:nrow(df), 0.8 * nrow(df))
train_data <- df[train_index, ]
test_data <- df[-train_index, ]

# Perform logistic regression
model <- glm(Attrition ~ ., data = train_data, family = binomial)

# Summarize the model
summary(model)

## 
## Call:
## glm(formula = Attrition ~ ., family = binomial, data = train_data)
## 
## Coefficients:
##                                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      -9.131e+00  6.522e+02  -0.014 0.988829    
## Age                              -3.549e-02  1.681e-02  -2.112 0.034687 *  
## BusinessTravelTravel_Frequently   2.343e+00  4.972e-01   4.712 2.45e-06 ***
## BusinessTravelTravel_Rarely       1.204e+00  4.490e-01   2.682 0.007323 ** 
## DailyRate                        -4.124e-04  2.627e-04  -1.570 0.116473    
## DepartmentResearch & Development  1.426e+01  6.522e+02   0.022 0.982550    
## DepartmentSales                   1.343e+01  6.522e+02   0.021 0.983576    
## DistanceFromHome                  5.388e-02  1.336e-02   4.031 5.54e-05 ***
## Education2                        4.232e-01  3.935e-01   1.076 0.282099    
## Education3                        3.967e-01  3.517e-01   1.128 0.259301    
## Education4                        3.749e-01  3.871e-01   0.969 0.332721    
## Education5                        5.782e-01  6.814e-01   0.848 0.396172    
## EducationFieldLife Sciences      -1.195e+00  9.703e-01  -1.231 0.218283    
## EducationFieldMarketing          -5.345e-01  1.029e+00  -0.520 0.603348    
## EducationFieldMedical            -9.341e-01  9.683e-01  -0.965 0.334740    
## EducationFieldOther              -1.274e+00  1.069e+00  -1.192 0.233338    
## EducationFieldTechnical Degree   -3.483e-02  9.954e-01  -0.035 0.972088    
## EnvironmentSatisfaction2         -1.206e+00  3.453e-01  -3.491 0.000481 ***
## EnvironmentSatisfaction3         -1.042e+00  3.014e-01  -3.458 0.000545 ***
## EnvironmentSatisfaction4         -1.438e+00  3.096e-01  -4.646 3.39e-06 ***
## GenderMale                        4.334e-01  2.210e-01   1.961 0.049836 *  
## HourlyRate                        4.423e-03  5.448e-03   0.812 0.416928    
## JobInvolvement2                  -9.876e-01  4.228e-01  -2.336 0.019494 *  
## JobInvolvement3                  -1.631e+00  4.011e-01  -4.065 4.80e-05 ***
## JobInvolvement4                  -2.039e+00  5.475e-01  -3.724 0.000196 ***
## JobLevel2                        -2.308e+00  6.204e-01  -3.720 0.000200 ***
## JobLevel3                        -5.086e-01  8.808e-01  -0.577 0.563672    
## JobLevel4                        -1.177e+00  1.500e+00  -0.785 0.432559    
## JobLevel5                         2.443e+00  2.117e+00   1.154 0.248449    
## JobRoleHuman Resources            1.396e+01  6.522e+02   0.021 0.982920    
## JobRoleLaboratory Technician     -1.881e-01  7.328e-01  -0.257 0.797443    
## JobRoleManager                   -8.795e-01  1.551e+00  -0.567 0.570657    
## JobRoleManufacturing Director     6.053e-01  6.058e-01   0.999 0.317675    
## JobRoleResearch Director         -3.041e+00  1.610e+00  -1.889 0.058912 .  
## JobRoleResearch Scientist        -1.315e+00  7.584e-01  -1.734 0.082932 .  
## JobRoleSales Executive            2.218e+00  1.687e+00   1.315 0.188549    
## JobRoleSales Representative       9.756e-01  1.782e+00   0.547 0.584100    
## JobSatisfaction2                 -7.221e-01  3.250e-01  -2.222 0.026280 *  
## JobSatisfaction3                 -5.940e-01  2.881e-01  -2.062 0.039230 *  
## JobSatisfaction4                 -1.286e+00  3.142e-01  -4.093 4.25e-05 ***
## MaritalStatusMarried              4.019e-01  3.290e-01   1.221 0.221907    
## MaritalStatusSingle               6.364e-01  4.727e-01   1.346 0.178276    
## MonthlyIncome                    -1.745e-04  1.097e-04  -1.591 0.111685    
## MonthlyRate                       6.023e-06  1.587e-05   0.380 0.704260    
## NumCompaniesWorked                2.001e-01  4.671e-02   4.283 1.84e-05 ***
## OverTimeYes                       2.268e+00  2.409e-01   9.412  < 2e-16 ***
## PercentSalaryHike                -5.212e-02  4.766e-02  -1.094 0.274171    
## PerformanceRating4                2.803e-01  4.925e-01   0.569 0.569282    
## RelationshipSatisfaction2        -1.130e+00  3.533e-01  -3.200 0.001376 ** 
## RelationshipSatisfaction3        -1.113e+00  3.047e-01  -3.654 0.000258 ***
## RelationshipSatisfaction4        -1.003e+00  2.991e-01  -3.354 0.000797 ***
## StockOptionLevel1                -1.263e+00  3.770e-01  -3.350 0.000807 ***
## StockOptionLevel2                -1.169e+00  5.221e-01  -2.238 0.025224 *  
## StockOptionLevel3                -3.105e-01  5.417e-01  -0.573 0.566539    
## TotalWorkingYears                -5.116e-02  3.560e-02  -1.437 0.150760    
## TrainingTimesLastYear            -2.043e-01  8.521e-02  -2.398 0.016489 *  
## WorkLifeBalance2                 -1.019e+00  4.446e-01  -2.292 0.021907 *  
## WorkLifeBalance3                 -1.376e+00  4.176e-01  -3.294 0.000986 ***
## WorkLifeBalance4                 -9.547e-01  5.101e-01  -1.872 0.061252 .  
## YearsAtCompany                    1.065e-01  4.939e-02   2.157 0.031016 *  
## YearsInCurrentRole               -1.704e-01  5.875e-02  -2.901 0.003720 ** 
## YearsSinceLastPromotion           1.594e-01  5.195e-02   3.067 0.002159 ** 
## YearsWithCurrManager             -1.179e-01  5.574e-02  -2.115 0.034407 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1046.75  on 1175  degrees of freedom
## Residual deviance:  609.94  on 1113  degrees of freedom
## AIC: 735.94
## 
## Number of Fisher Scoring iterations: 15

# Predict on the test data
predictions <- predict(model, newdata = test_data, type = "response")

# Convert probabilities to binary predictions (0 or 1) using a threshold of 0.5
predicted_attrition <- ifelse(predictions > 0.5, 1, 0)

# Evaluate the model
# Create a confusion matrix
conf_matrix <- table(predicted_attrition, test_data$Attrition)
print(conf_matrix)

##                    
## predicted_attrition  No Yes
##                   0 238  22
##                   1  11  23

# Calculate accuracy
accuracy <- sum(diag(conf_matrix)) / sum(conf_matrix)
cat("Accuracy:", accuracy, "\n")

## Accuracy: 0.8877551

SVM

dummy_model <- dummyVars("~ .", data = X)
X_numeric <- predict(dummy_model, newdata = X)
X_numeric <- as.data.frame(X_numeric)

# Train the model
results <- model_training(X_numeric, y, model = "svmLinear")

## Fold 1: Accuracy = 0.87
## Fold 2: Accuracy = 0.89
## Fold 3: Accuracy = 0.88
## Fold 4: Accuracy = 0.87
## Fold 5: Accuracy = 0.89
## Average Accuracy: 0.88
## 
## Best Model is from Fold 5 with Accuracy = 0.89
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  236  11
##        Yes  21  27
##                                           
##                Accuracy : 0.8915          
##                  95% CI : (0.8503, 0.9246)
##     No Information Rate : 0.8712          
##     P-Value [Acc > NIR] : 0.1699          
##                                           
##                   Kappa : 0.5654          
##                                           
##  Mcnemar's Test P-Value : 0.1116          
##                                           
##             Sensitivity : 0.9183          
##             Specificity : 0.7105          
##          Pos Pred Value : 0.9555          
##          Neg Pred Value : 0.5625          
##              Prevalence : 0.8712          
##          Detection Rate : 0.8000          
##    Detection Prevalence : 0.8373          
##       Balanced Accuracy : 0.8144          
##                                           
##        'Positive' Class : No              
##

# View the fold metrics
print(results$FoldMetrics)

## NULL

# Access the best model
best_model <- results$BestModel

# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels

# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)

# Print the classification report
classification_report

##              Class Precision    Recall  F1_Score Support
## 1               No 0.9554656 0.9182879 0.9365079     247
## 2              Yes 0.5625000 0.7105263 0.6279070      48
## 3    Macro Average 0.7589828 0.8144071 0.7822075      NA
## 4 Weighted Average 0.8915254 0.8844827 0.8862949     295

# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(best_model, scale = FALSE)
print(importance)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 78)
## 
##                                   Importance
## MonthlyIncome                         0.6695
## TotalWorkingYears                     0.6656
## JobLevel.1                            0.6512
## YearsAtCompany                        0.6492
## OverTime.Yes                          0.6376
## OverTime.No                           0.6376
## YearsWithCurrManager                  0.6339
## Age                                   0.6327
## YearsInCurrentRole                    0.6313
## StockOptionLevel.0                    0.6244
## MaritalStatus.Single                  0.6098
## StockOptionLevel.1                    0.6010
## JobLevel.2                            0.5910
## EnvironmentSatisfaction.1             0.5704
## DistanceFromHome                      0.5675
## MaritalStatus.Married                 0.5653
## BusinessTravel.Travel_Frequently      0.5574
## JobRole.Laboratory Technician         0.5572
## Department.Research & Development     0.5539
## DailyRate                             0.5505

# Plot variable importance
plot(importance)

# Train the model
results <- model_training(X_numeric, y, model = "svmRadial")

## Fold 1: Accuracy = 0.86
## Fold 2: Accuracy = 0.88
## Fold 3: Accuracy = 0.86
## Fold 4: Accuracy = 0.88
## Fold 5: Accuracy = 0.89
## Average Accuracy: 0.87
## 
## Best Model is from Fold 5 with Accuracy = 0.89
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  246   1
##        Yes  32  16
##                                           
##                Accuracy : 0.8881          
##                  95% CI : (0.8465, 0.9217)
##     No Information Rate : 0.9424          
##     P-Value [Acc > NIR] : 0.9999          
##                                           
##                   Kappa : 0.4451          
##                                           
##  Mcnemar's Test P-Value : 1.767e-07       
##                                           
##             Sensitivity : 0.8849          
##             Specificity : 0.9412          
##          Pos Pred Value : 0.9960          
##          Neg Pred Value : 0.3333          
##              Prevalence : 0.9424          
##          Detection Rate : 0.8339          
##    Detection Prevalence : 0.8373          
##       Balanced Accuracy : 0.9130          
##                                           
##        'Positive' Class : No              
##

# View the fold metrics
print(results$FoldMetrics)

## NULL

# Access the best model
best_model <- results$BestModel

# Extract predictions and true labels for the best fold
y_pred <- results$BestPredictions
y_test <- results$BestTrueLabels

# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)

# Print the classification report
classification_report

##              Class Precision    Recall  F1_Score Support
## 1               No 0.9959514 0.8848921 0.9371429     247
## 2              Yes 0.3333333 0.9411765 0.4923077      48
## 3    Macro Average 0.6646424 0.9130343 0.7147253      NA
## 4 Weighted Average 0.8881356 0.8940502 0.8647629     295

# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(best_model, scale = FALSE)
print(importance)

## ROC curve variable importance
## 
##   only 20 most important variables shown (out of 78)
## 
##                                   Importance
## MonthlyIncome                         0.6695
## TotalWorkingYears                     0.6656
## JobLevel.1                            0.6512
## YearsAtCompany                        0.6492
## OverTime.Yes                          0.6376
## OverTime.No                           0.6376
## YearsWithCurrManager                  0.6339
## Age                                   0.6327
## YearsInCurrentRole                    0.6313
## StockOptionLevel.0                    0.6244
## MaritalStatus.Single                  0.6098
## StockOptionLevel.1                    0.6010
## JobLevel.2                            0.5910
## EnvironmentSatisfaction.1             0.5704
## DistanceFromHome                      0.5675
## MaritalStatus.Married                 0.5653
## BusinessTravel.Travel_Frequently      0.5574
## JobRole.Laboratory Technician         0.5572
## Department.Research & Development     0.5539
## DailyRate                             0.5505

# Plot variable importance
plot(importance)

Random Forest (RF)

#Too slow, just run one fold to check

# Step 1: Create an 80:20 train-test split
set.seed(42)
train_idx <- createDataPartition(y, p = 0.8, list = FALSE)
X_train <- X_numeric[train_idx, ]
X_test <- X_numeric[-train_idx, ]
y_train <- y[train_idx]
y_test <- y[-train_idx]

# Step 2: Train the model
model <- train(
  X_train, y_train,
  method = "rf",
  preProcess = c("center", "scale"),
  trControl = trainControl(method = "none")  # Single train-test split
)

# Step 3: Predict on the test set
y_pred <- predict(model, X_test)

confusionMatrix(y_test, y_pred)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  No Yes
##        No  245   1
##        Yes  39   8
##                                           
##                Accuracy : 0.8635          
##                  95% CI : (0.8188, 0.9006)
##     No Information Rate : 0.9693          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.2469          
##                                           
##  Mcnemar's Test P-Value : 4.909e-09       
##                                           
##             Sensitivity : 0.8627          
##             Specificity : 0.8889          
##          Pos Pred Value : 0.9959          
##          Neg Pred Value : 0.1702          
##              Prevalence : 0.9693          
##          Detection Rate : 0.8362          
##    Detection Prevalence : 0.8396          
##       Balanced Accuracy : 0.8758          
##                                           
##        'Positive' Class : No              
##

# Generate the classification report
classification_report <- generate_classification_report(y_test, y_pred)

# Print the classification report
classification_report

##              Class Precision    Recall  F1_Score Support
## 1               No 0.9959350 0.8626761 0.9245283     246
## 2              Yes 0.1702128 0.8888889 0.2857143      47
## 3    Macro Average 0.5830739 0.8757825 0.6051213      NA
## 4 Weighted Average 0.8634812 0.8668808 0.8220564     293

# Get variable importance
options(repr.plot.width = 6, repr.plot.height = 15)
importance <- varImp(model, scale = FALSE)
print(importance)

## rf variable importance
## 
##   only 20 most important variables shown (out of 78)
## 
##                                  Overall
## MonthlyIncome                     19.573
## Age                               17.533
## DailyRate                         14.628
## TotalWorkingYears                 13.921
## MonthlyRate                       13.499
## HourlyRate                        13.111
## DistanceFromHome                  12.189
## YearsAtCompany                    12.082
## PercentSalaryHike                  9.730
## NumCompaniesWorked                 9.315
## OverTime.Yes                       9.197
## YearsWithCurrManager               8.994
## OverTime.No                        8.752
## TrainingTimesLastYear              7.447
## YearsInCurrentRole                 7.322
## YearsSinceLastPromotion            7.089
## EnvironmentSatisfaction.1          5.240
## StockOptionLevel.0                 5.150
## BusinessTravel.Travel_Frequently   4.608
## JobLevel.1                         4.600

# Plot variable importance
plot(importance)

11 Data Interpretation/Model Evaluation

Important features used by model to predict monthly income: 1. Job Role 2. Job Level 3. Total Working Years

Interpretation of Job Role’s Impact on Monthly Income:

Job role is a significant factor in determining monthly income, as evidenced by the clear distinctions between roles in the boxplot below. The boxplot showing the distribution of monthly income across various job roles. Based on the observation, a few key insights are derived from the boxplot below: 1. Manager and Research Director have the highest median income while Sales Representative has the lowest median income. 2. Outliers are visible for multiple job roles like Laboratory Technicians, Research Scientist, Sales Executive, Sales Representatives, indicating unique cases of exceptionally high or low salaries. 3. By comparing different roles, Manufacturing Director and Research Director have a broader income range, suggesting diverse pay structures based on seniority or performance. While Sales Representatives consistently earn the least, with their incomes clustering at the lower end.

options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = JobRole, y = MonthlyIncome, fill = JobRole)) +
  geom_boxplot() +
  labs(
    title = "Monthly Income by Job Role",
    x = "Job Role",
    y = "Monthly Income"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation of Job Level’s Impact on Monthly Income:

After job role, job level is the second most significant factor in determining monthly income. The boxplot below shows a clear positive relationship between job level and monthly income. As the job level increases (from 1 to 5), the median monthly income rises significantly, indicating that higher job levels are associated with higher earnings. This also representing: 1. Promotions and career progression result in higher pay. 2. The wider spread at higher job levels (especially Job Level 4) could indicate variations in responsibilities, individual negotiation, or incentives for senior positions. 3. Job Level 2 has a noticeable number of outliers earning higher than the upper quartile. This may indicate exceptional cases, such as high-performing individuals or unique circumstances at this level.

options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = JobLevel, y = MonthlyIncome, fill = JobLevel)) +
  geom_boxplot() +
  labs(
    title = "Monthly Income by Job Level",
    x = "Job Level",
    y = "Monthly Income"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation of Total Working Years’ Impact on Monthly Income:

After job level, total working years is the third most significant factor influencing monthly income. The smoothing line shows a positive correlation between working years and income, indicating that more experience generally leads to higher earnings. Key insights include: 1. Experience and skills over time lead to higher-paying roles and promotions, with more experienced individuals often commanding higher salaries. 2. The chart differentiates data points by job level, highlighting that higher job levels tend to earn more, even with the same number of working years. 3. Variability in income at each experience level suggests that other factors, like industry and location, also influence earnings.

ggplot(df, aes(x = TotalWorkingYears, y = MonthlyIncome, color = JobLevel)) +
  geom_point(size = 3, alpha = 0.8) +  # Adjust size and transparency
  geom_smooth(method = "lm", se = FALSE, linetype = "dashed", color = "black") +  # Trend line
  labs(
    title = "Monthly Income by Total Working Years",
    x = "Total Working Years",
    y = "Monthly Income",
    color = "Job Level"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",               # Move legend to the top
    axis.text = element_text(size = 12),   # Adjust axis text size
    plot.title = element_text(hjust = 0.5) # Center the title
  )

## `geom_smooth()` using formula = 'y ~ x'

Important features used by models to predict attrition: 1. Over Time 2. Monthly Income 3. Total Working Years 4. Number of Companies Worked 5. Distance From Home

Interpretation of Over Time’s Impact on Employee Attrition:

Overtime is the primary factor influencing employee attrition. The chart indicates a positive association between overtime work and employee attrition, highlighting the need for further analysis and potential interventions to address this issue. Key insights include: 1. Employees who do not work overtime have a significantly lower attrition rate compared to those who do work overtime. There is a notable association between working overtime and the likelihood of leaving the company. 2. This suggests that working overtime may be a contributing factor to higher employee turnover. 3. The potential underlying reasons could include work-life imbalance, job stress, or other factors related to overtime work.

Recommendations: 1. Company should investigate the reasons behind the higher attrition rate among employees who work overtime. 2. Consider strategies to reduce overtime or support employees who work extra hours to improve retention. 3. Strategies such as: (i). Offering fair pay or additional benefits for overtime work shows that the company values employees’ time and effort. (ii). Providing flexible working hours or the option to take time off in lieu of overtime hours worked helps employees manage their work-life balance. (iii). Offering access to health and wellness programs or mental health support can help employees manage the stress and fatigue that may come from working extra hours. (iv). Acknowledging the hard work and commitment of employees who put in extra hours can foster a sense of appreciation and loyalty, motivating them to stay with the organization.

# Summarize data: Count occurrences of each combination
df_summary <- as.data.frame(table(df$OverTime, df$Attrition))
colnames(df_summary) <- c("OverTime", "Attrition", "Count")
# Create the grouped bar chart
ggplot(df_summary, aes(x = OverTime, y = Count, fill = Attrition)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(
    title = "Attrition Grouped by OverTime",
    x = "OverTime",
    y = "Count",
    fill = "Attrition"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 0, hjust = 0.5),
    plot.title = element_text(hjust = 0.5)
  )

Interpretation of Monthly Income’s Impact on Employee Attrition:

Monthly income is the second most significant factor influencing employee attrition. The boxplot below show the income distributions between staying and leaving employees, suggesting a potential relationship between income and attrition. The boxplot show that employees who leave tend to have lower incomes, this could suggest that monthly income is a key factor driving attrition. The key insights include: 1. Lower incomes among those who leave could suggest dissatisfaction related to compensation, prompting employees to seek better opportunities elsewhere. 2. Higher incomes among those who leave might indicate other factors, such as lack of job satisfaction, insufficient career development, or poor work-life balance. It could also suggest that despite earning higher salaries, these employees may feel their potential is not being fully utilized or recognized, prompting them to move on.

Recommendations: 1. Regularly review and adjust compensation to remain competitive within the industry, as salary is a key retention factor. 2. Offer valuable benefits like health insurance, retirement plans, and performance incentives to show care for employees’ long-term well-being.

options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = Attrition, y = MonthlyIncome, fill = Attrition)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Monthly Income Grouped by Attrition",
    x = "Attrition",
    y = "Monthly Income"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Interpretation of the Impact of Number of Companies Worked on Employee Attrition:

The boxplot below shows that employees who left the company tend to have more variability in the number of companies they’ve worked for. However, the median number of companies worked for by employees who left is 1, suggesting they may be early in their careers and have worked at only one previous employer. The key insights include: 1. This may reflect a pattern of lower organizational loyalty or dissatisfaction with career progression. 2. Employees in their first job may leave after gaining initial experience, seeking a better fit once they understand their skills, interests, and the industry.

Recommendations: 1. Provide employees with clear, structured career progression plans. Make sure they understand how they can advance within the company and the steps required to reach higher positions. 2. Offer continuous learning opportunities, including skill-building programs, leadership training, and mentorship. This can help employees feel valued and give them the tools to grow within the organization. 3. Implement regular check-ins and performance reviews to provide feedback, set goals, and recognize achievements. This helps employees feel their progress is acknowledged and motivates them to stay. 4. Ensure that employees’ roles are aligned with their skills, interests, and long-term career goals. Providing meaningful, challenging work can increase job satisfaction and reduce the urge to leave for new opportunities.

options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = Attrition, y = NumCompaniesWorked, fill = Attrition)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Number of Companies Worked Grouped by Attrition",
    x = "Attrition",
    y = "Number of Companies Worked"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Those who left their median is 1, possibly first job ppl leaving

Interpretation of the Impact of Distance From Home on Employee Attrition:

The boxplot shows indicate a link between longer commutes and higher attrition rates. By comparing the median distances for both groups, employees who left the company tend to live farther from work. A higher median would suggest that greater commuting distances might correlate with attrition. Key insight include: 1. Long distances can increase employee dissatisfaction due to the time and cost involved in commuting, leading to a higher likelihood of leaving the company.

Recommendations: 1. Implement remote or hybrid work models to reduce commuting and improve work-life balance. 2. Offer transportation subsidies or reimbursements to ease commuting costs and stress. 3. Consider satellite offices or coworking spaces closer to employees’ homes to cut commuting time. 4. Allow flexible work hours to help employees avoid peak commute times. 5. Provide relocation assistance for employees willing to move closer to the office.

options(repr.plot.width = 12, repr.plot.height = 12)
ggplot(df, aes(x = Attrition, y = DistanceFromHome, fill = Attrition)) +
  geom_boxplot() +
  labs(
    title = "Boxplot of Distance from Home Grouped by Attrition",
    x = "Attrition",
    y = "Distance from Home"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

12 Conclusion

In this project, we successfully developed machine learning models to predict employee attrition and monthly income with promising performance. Through our analysis, we identified key factors influencing attrition, specifically OverTime, Monthly Income, Total Working Years, Number of Companies Worked, and Distance From Home. These insights provide valuable direction for addressing employee retention by highlighting the most impactful areas for intervention.

However, this project has several limitations. The dataset used was relatively small, and limited computing resources restricted the complexity of models and analysis we could perform. For future work, expanding the dataset, integrating additional features, and retraining the models on more diverse data could improve performance and test the models’ generalizability across different industries.

In conclusion, adopting a holistic approach to enhancing the employee experience—focused on improving work-life balance, supporting career growth, and promoting overall well-being—can help foster a more engaged and loyal workforce. This will ultimately lead to higher retention, increased employee motivation, and better alignment with the company’s long-term goals, driving sustained success.

Workforce Attrition Prediction: The Role of Benefits and Work Conditions

Group 5 (OCC 2)
Tan Rik Ee (17206574)
Lim Wei En (23086179)
Chow Koo Li (23104111)
Lim Sze Gee (23102772)
Law Yu Xuan (23106248)

11 January 2025

1 Objectives

2 Introduction

3 Dataset Description

4 Preprocessing

5 Numerical Data

6 Categorical Data

7 Chi-Square Test of Independence

8 Modeling

9 Monthly Income Prediction (Regression)

10 Attrition Prediction

11 Data Interpretation/Model Evaluation

12 Conclusion

Workforce Attrition Prediction: The Role of Benefits and Work Conditions

Group 5 (OCC 2) Tan Rik Ee (17206574) Lim Wei En (23086179) Chow Koo Li (23104111) Lim Sze Gee (23102772) Law Yu Xuan (23106248)

11 January 2025

1 Objectives

2 Introduction

3 Dataset Description

4 Preprocessing

5 Numerical Data

6 Categorical Data

7 Chi-Square Test of Independence

8 Modeling

9 Monthly Income Prediction (Regression)

10 Attrition Prediction

11 Data Interpretation/Model Evaluation

12 Conclusion

Group 5 (OCC 2)
Tan Rik Ee (17206574)
Lim Wei En (23086179)
Chow Koo Li (23104111)
Lim Sze Gee (23102772)
Law Yu Xuan (23106248)