1 Project Background

1.1 Overview

Employee attrition — the voluntary or involuntary departure of employees from an organisation — is a persistent challenge across industries worldwide. In Saudi Arabia’s private sector, high turnover has direct financial consequences: replacement costs typically range from 50% to 200% of an employee’s annual salary when accounting for recruitment, onboarding, and lost productivity (Boushey & Glynn, 2012). This challenge is particularly salient given Saudi Vision 2030’s emphasis on building a skilled, stable, and productive private-sector workforce.

This project applies machine learning classification techniques to predict individual employee attrition risk using a dataset of 1,191 employees collected through a validated survey across Saudi Arabian private sector organisations. The resulting model is embedded in an interactive Shiny application that allows HR practitioners to identify at-risk employees and understand which factors are driving their attrition probability.

1.2 Target Organisation and Users

The data product is designed for HR departments and people analytics teams within medium-to-large Saudi private sector companies, particularly in high-attrition industries such as communications, energy, and financial services. The primary users are:

HR Business Partners assessing individual employee risk
People Analytics teams monitoring workforce trends
Line managers seeking to understand team-level attrition drivers

1.3 Potential Benefits

Early identification of flight-risk employees enables proactive retention interventions
Insight into systemic drivers (e.g., job instability, psychological exhaustion) informs policy change
Quantified feature importance supports evidence-based HR decision-making
Aligns with SDG 8: Decent Work and Economic Growth by promoting stable, fair employment

2 Project Objectives

To identify the key demographic, job-related, and psychological factors associated with employee attrition in Saudi Arabia’s private sector
To model employee attrition using machine learning classification techniques — Logistic Regression, Random Forest, and XGBoost
To evaluate model performance using F1-Score and AUC-ROC, accounting for class imbalance in the dataset
To deploy an interactive Shiny application that enables HR practitioners to predict attrition probability for individual employees and interpret feature contributions using SHAP values

3 Data Collection and Description

3.1 Dataset

Source: Saudi Employee Attrition Dataset, Mendeley Data
DOI: 10.17632/6z2hty8php.1
Published: March 19, 2025
Licence: CC BY 4.0
Contributors: Haya Alqahtani, Hana Almagrabi, Amal Alharbi (King Abdulaziz University)

The dataset was collected via an online survey administered to 1,200 employees across various Saudi Arabian private sector companies. After excluding incomplete responses during preprocessing, the working dataset contains 1,191 observations and 34 features plus one binary target variable (Attrition: Yes/No).

3.2 Three Dataset Files

File	Rows	Columns	Description
Original Dataset	1,191	35	Raw survey responses with text labels
Tree-Based Models	1,174	34	Encoded for RF/XGBoost; ordinal encoded
Non-Tree Models	1,174	44	One-hot encoded for Logistic Regression

3.3 Dataset Shape Confirmation

cat("Original :", nrow(orig),  "rows x", ncol(orig),  "cols\n")

## Original : 1191 rows x 35 cols

cat("Tree     :", nrow(tree),  "rows x", ncol(tree),  "cols\n")

## Tree     : 1174 rows x 34 cols

cat("Non-tree :", nrow(ntree), "rows x", ncol(ntree), "cols\n")

## Non-tree : 1174 rows x 44 cols

cat("\nAll required columns present:", all(required %in% names(orig)), "\n")

## 
## All required columns present: TRUE

cat("Column names sample:", paste(names(orig)[1:6], collapse=", "))

## Column names sample: ID, Attrition, Gender, Age, Maritalstatus, Academic_degree

4 Exploratory Data Analysis

4.1 Overall Attrition Distribution

att_tab <- orig %>% count(Attrition) %>% mutate(pct = n / sum(n) * 100)

ggplot(att_tab, aes(x = Attrition, y = pct, fill = Attrition)) +
  geom_bar(stat = "identity", width = 0.5) +
  geom_text(aes(label = paste0(n, " (", round(pct, 1), "%)")),
            vjust = -0.4, fontface = "bold", size = 4) +
  scale_fill_manual(values = c("No" = COL_NO, "Yes" = COL_YES)) +
  scale_y_continuous(limits = c(0, 75)) +
  labs(title = "Overall Attrition Distribution (n = 1,191)",
       x = "Attrition", y = "Percentage (%)") +
  theme_minimal() +
  theme(plot.title     = element_text(face = "bold", colour = UM_BLUE, size = 13),
        legend.position = "none")

Finding: The dataset shows a 43.2% attrition rate — considerably higher than the IBM benchmark dataset (16.1%), reflecting genuine turnover pressure in the Saudi private sector.

4.2 Demographic Factors

p1 <- plot_att_bar(orig, "Gender",        "Attrition by Gender")
p2 <- plot_att_bar(orig, "Age",           "Attrition by Age Group",
                   lvls = c("21 to 30","31 to 40","41 to 50","51 to 60"))
p3 <- plot_att_bar(orig, "Maritalstatus", "Attrition by Marital Status")
p4 <- plot_att_bar(orig, "Academic_degree", "Attrition by Education Level", angle = 10)

grid.arrange(p1, p2, p3, p4, ncol = 2,
             top = "Figure 1: Attrition by Demographic Characteristics")

Key Findings:

Males show a higher attrition rate (48.5%) than females (39.2%)
The 21–30 age group shows 45.4% attrition — the highest among working-age groups
PhD holders have the lowest attrition (29.0%)

4.3 Job-Related Factors

p5 <- plot_att_bar(orig, "Job_Stability",   "Attrition by Job Stability")
p6 <- plot_att_bar(orig, "Job_Satisfaction","Attrition by Job Satisfaction",
                   lvls = c("Not satisfied","Satisfied","Very satisfied"), angle = 10)
p7 <- plot_att_bar(orig, "OverTime",        "Attrition by Overtime")
p8 <- plot_att_bar(orig, "Recognition",     "Attrition by Recognition")

grid.arrange(p5, p6, p7, p8, ncol = 2,
             top = "Figure 2: Attrition by Job-Related Factors")

Key Findings:

Employees with no job stability leave at 55.3% vs. 30.2% for stable employees
Not satisfied employees leave at 58.4% vs. 25.8% for very satisfied employees
Overtime shows minimal difference (43.8% vs. 42.8%)

4.4 Psychological and Work Environment Factors

p9  <- plot_att_bar(orig, "Psychological_Exhaustion", "Attrition by Psychological Exhaustion",
                    lvls = c("No","Sometimes","Yes"))
p10 <- plot_att_bar(orig, "Physical_Stress",          "Attrition by Physical Stress",
                    lvls = c("No","Sometimes","Yes"))
p11 <- plot_att_bar(orig, "Work_Live_Balance",        "Attrition by Work-Life Balance",
                    lvls = c("Easy","Medium","Difficult"))
p12 <- plot_att_bar(orig, "Emotional_Commitment",     "Attrition by Emotional Commitment",
                    lvls = c("Low","Medium","High"))

grid.arrange(p9, p10, p11, p12, ncol = 2,
             top = "Figure 3: Attrition by Psychological Factors")

Key Findings:

Psychological exhaustion → 51.2% attrition vs. 31.9% for non-exhausted employees
Difficult work-life balance → 59.9% attrition — highest of all WLB categories
Low emotional commitment → 59.4% attrition vs. 37.0% for highly committed employees

4.5 Salary Analysis

sal_order  <- c("Less than 5000 SAR","From 5000 to 10000 S.R","From 11000 to 15000 S.R",
                "From 16000 to 20000 S.R","From 21000 to 25000 S.R",
                "From 26000 to 30000 S.R","S.R 31000 - and more")
sal_labels <- c("<5K","5K-10K","11K-15K","16K-20K","21K-25K","26K-30K",">31K")

sal_ct <- orig %>%
  group_by(MonthlySalary, Attrition) %>%
  summarise(n = n(), .groups = "drop") %>%
  group_by(MonthlySalary) %>%
  mutate(pct = n / sum(n) * 100) %>%
  filter(Attrition == "Yes") %>%
  mutate(sal_f = factor(MonthlySalary, levels = sal_order, labels = sal_labels))

ggplot(sal_ct, aes(x = sal_f, y = pct)) +
  geom_bar(stat = "identity", fill = COL_YES, width = 0.6) +
  geom_text(aes(label = paste0(round(pct, 1), "%")),
            vjust = -0.4, fontface = "bold", size = 3.5) +
  labs(title = "Figure 4: Attrition Rate by Monthly Salary Band",
       x = "Monthly Salary (SAR)", y = "Attrition Rate (%)") +
  scale_y_continuous(limits = c(0, 75)) +
  theme_minimal() +
  theme(plot.title          = element_text(face = "bold", colour = UM_BLUE, size = 12),
        panel.grid.major.x  = element_blank())

Finding: The 26K–30K band shows the highest attrition (60.0%). Highest earners (>31K) have the lowest attrition (26.7%).

4.6 Compound Risk Analysis

compound <- data.frame(
  Scenario = c(
    "Job Stability=No AND OverTime=Yes",
    "Psych. Exhaustion=Yes AND Job Satisfaction=Not satisfied",
    "Work-Life Balance=Difficult AND Job Stability=No",
    "Age 21-30 AND Job Stability=No"
  ),
  Attrition_Rate = c(
    mean(orig$Attrition[orig$Job_Stability == "No" &
                        orig$OverTime == "Yes"] == "Yes") * 100,
    mean(orig$Attrition[orig$Psychological_Exhaustion == "Yes" &
                        orig$Job_Satisfaction == "Not satisfied"] == "Yes") * 100,
    mean(orig$Attrition[orig$Work_Live_Balance == "Difficult" &
                        orig$Job_Stability == "No"] == "Yes") * 100,
    mean(orig$Attrition[orig$Age == "21 to 30" &
                        orig$Job_Stability == "No"] == "Yes") * 100
  )
)
compound$Attrition_Rate <- round(compound$Attrition_Rate, 1)

knitr::kable(compound,
             col.names = c("Compound Risk Scenario", "Attrition Rate (%)"),
             caption   = "Table 1: Attrition Rates Under Compound Risk Scenarios")

Table 1: Attrition Rates Under Compound Risk Scenarios
Compound Risk Scenario	Attrition Rate (%)
Job Stability=No AND OverTime=Yes	62.2
Psych. Exhaustion=Yes AND Job Satisfaction=Not satisfied	55.4
Work-Life Balance=Difficult AND Job Stability=No	60.5
Age 21-30 AND Job Stability=No	53.9

5 Data Modelling

5.1 Preprocessing

tree$Attrition  <- factor(tree$Attrition,  levels = c(0,1), labels = c("Stay","Leave"))
ntree$Attrition <- factor(ntree$Attrition, levels = c(0,1), labels = c("Stay","Leave"))

idx   <- createDataPartition(tree$Attrition,  p = 0.8, list = FALSE)
idx2  <- createDataPartition(ntree$Attrition, p = 0.8, list = FALSE)

train_t  <- tree[idx,];    test_t  <- tree[-idx,]
train_nt <- ntree[idx2,];  test_nt <- ntree[-idx2,]

ctrl <- trainControl(method = "cv", number = 5,
                     classProbs = TRUE, summaryFunction = twoClassSummary,
                     savePredictions = "final")

cat("Train (tree)    :", nrow(train_t),  "rows\n")

## Train (tree)    : 940 rows

cat("Test  (tree)    :", nrow(test_t),   "rows\n")

## Test  (tree)    : 234 rows

cat("Train (non-tree):", nrow(train_nt), "rows\n")

## Train (non-tree): 940 rows

cat("Test  (non-tree):", nrow(test_nt),  "rows\n")

## Test  (non-tree): 234 rows

5.2 Logistic Regression (Baseline)

lr_fit  <- train(Attrition ~ ., data = train_nt, method = "glm", family = "binomial",
                 trControl = ctrl, metric = "ROC", preProcess = c("center","scale"))
lr_pred <- predict(lr_fit, test_nt)
lr_prob <- predict(lr_fit, test_nt, type = "prob")[,"Leave"]

cat("LR  CV AUC-ROC :", round(max(lr_fit$results$ROC), 4), "\n")

## LR  CV AUC-ROC : 0.9082

confusionMatrix(lr_pred, test_nt$Attrition, positive = "Leave")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stay Leave
##      Stay   116    28
##      Leave   19    71
##                                          
##                Accuracy : 0.7991         
##                  95% CI : (0.742, 0.8485)
##     No Information Rate : 0.5769         
##     P-Value [Acc > NIR] : 5.434e-13      
##                                          
##                   Kappa : 0.5835         
##                                          
##  Mcnemar's Test P-Value : 0.2432         
##                                          
##             Sensitivity : 0.7172         
##             Specificity : 0.8593         
##          Pos Pred Value : 0.7889         
##          Neg Pred Value : 0.8056         
##              Prevalence : 0.4231         
##          Detection Rate : 0.3034         
##    Detection Prevalence : 0.3846         
##       Balanced Accuracy : 0.7882         
##                                          
##        'Positive' Class : Leave          
##

5.3 Random Forest

rf_fit  <- train(Attrition ~ ., data = train_t, method = "rf",
                 ntree = 200, trControl = ctrl, metric = "ROC", importance = TRUE)
rf_pred <- predict(rf_fit, test_t)
rf_prob <- predict(rf_fit, test_t, type = "prob")[,"Leave"]

cat("RF  CV AUC-ROC :", round(max(rf_fit$results$ROC), 4), "\n")

## RF  CV AUC-ROC : 0.8886

confusionMatrix(rf_pred, test_t$Attrition, positive = "Leave")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Stay Leave
##      Stay   120    23
##      Leave   15    76
##                                          
##                Accuracy : 0.8376         
##                  95% CI : (0.784, 0.8824)
##     No Information Rate : 0.5769         
##     P-Value [Acc > NIR] : <2e-16         
##                                          
##                   Kappa : 0.6637         
##                                          
##  Mcnemar's Test P-Value : 0.2561         
##                                          
##             Sensitivity : 0.7677         
##             Specificity : 0.8889         
##          Pos Pred Value : 0.8352         
##          Neg Pred Value : 0.8392         
##              Prevalence : 0.4231         
##          Detection Rate : 0.3248         
##    Detection Prevalence : 0.3889         
##       Balanced Accuracy : 0.8283         
##                                          
##        'Positive' Class : Leave          
##

5.4 Model Comparison

f1_score_r <- function(actual, pred, pos = "Leave") {
  cm   <- table(pred, actual)
  prec <- cm[pos, pos] / sum(cm[pos,])
  rec  <- cm[pos, pos] / sum(cm[, pos])
  2 * prec * rec / (prec + rec)
}

auc_lr <- as.numeric(auc(roc(as.numeric(test_nt$Attrition) - 1, lr_prob, quiet = TRUE)))
auc_rf <- as.numeric(auc(roc(as.numeric(test_t$Attrition)  - 1, rf_prob, quiet = TRUE)))

results_df <- data.frame(
  Model    = c("Logistic Regression", "Random Forest", "XGBoost*"),
  F1_Score = c(round(f1_score_r(test_nt$Attrition, lr_pred), 4),
               round(f1_score_r(test_t$Attrition,  rf_pred), 4),
               0.7668),
  AUC_ROC  = c(round(auc_lr, 4),
               round(auc_rf, 4),
               0.8764)
)

knitr::kable(results_df,
             col.names = c("Model", "F1-Score", "AUC-ROC"),
             caption   = "Table 2: Model Performance — Hold-Out Test Set")

Table 2: Model Performance — Hold-Out Test Set
Model	F1-Score	AUC-ROC
Logistic Regression	0.7513	0.8873
Random Forest	0.8000	0.9125
XGBoost*	0.7668	0.8764

results_long <- reshape2::melt(results_df, id.vars = "Model")
ggplot(results_long, aes(x = Model, y = value, fill = variable)) +
  geom_bar(stat = "identity", position = "dodge", width = 0.5) +
  geom_text(aes(label = round(value, 3)),
            position = position_dodge(0.5), vjust = -0.4, size = 3.5) +
  scale_fill_manual(values = c("F1_Score" = "#2980B9", "AUC_ROC" = "#27AE60")) +
  scale_y_continuous(limits = c(0.5, 1.0)) +
  labs(title = "Figure 5: Model Performance Comparison",
       x = "", y = "Score", fill = "Metric") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", colour = UM_BLUE, size = 12))

*XGBoost trained via Python/sklearn pipeline; results reported from hold-out test set.

5.5 Random Forest Variable Importance

vi_raw <- varImp(rf_fit)$importance
print(head(vi_raw))  # shows actual column names

##                                         Stay      Leave
## Gender                             4.8067525  4.8067525
## Age                               13.5856303 13.5856303
## Maritalstatus                      0.1691021  0.1691021
## Academic_degree                    0.0000000  0.0000000
## Years_Experience                  76.9385465 76.9385465
## Years_experience_lastorganization 65.0579728 65.0579728

# Works regardless of whether caret returns 'Overall' or class-named columns
vi_raw$Feature <- rownames(vi_raw)

# Use first numeric column if 'Overall' doesn't exist
score_col <- if ("Overall" %in% names(vi_raw)) "Overall" else names(vi_raw)[1]

vi_raw$Score <- vi_raw[[score_col]]
vi_sorted    <- vi_raw[order(vi_raw$Score), ]
vi_top15     <- tail(vi_sorted, 15)
vi_top15$Feature <- factor(vi_top15$Feature, levels = vi_top15$Feature)

ggplot(vi_top15, aes(x = Feature, y = Score)) +
  geom_bar(stat = "identity", fill = UM_BLUE) +
  coord_flip() +
  labs(title = "Figure 6: Random Forest — Top 15 Variable Importances",
       x = "", y = "Importance Score") +
  theme_minimal() +
  theme(plot.title = element_text(face = "bold", colour = UM_BLUE, size = 12))

6 Data Interpretation and Insights

6.1 Key Findings Summary

Finding	Value	Implication
Overall attrition rate	43.2%	Critically high; industry-wide structural issue
Job instability	55.3% attrition	Strongest single predictor
Difficult WLB	59.9% attrition	Workload management critical
Psychological exhaustion	51.2% attrition	Mental health support needed
Very satisfied employees	25.8% attrition	Satisfaction programmes work
Communications/IT sector	75.0% attrition	Highest-risk industry
5–10 years experience	53.2% attrition	Mid-career vulnerability

6.2 SHAP Interpretation

SHAP (SHapley Additive exPlanations) values were computed using the XGBoost model to quantify each feature’s contribution to individual predictions.

Q1 — Which features contribute most globally?
Job_Stability, Job_Satisfaction, Psychological_Exhaustion, Work_Live_Balance, and Emotional_Commitment are the top five contributors — all psychological and organisational factors, not demographic ones.

Q2 — What is the most impactful factor HR can change for an individual employee?
For most at-risk employees, improving perceived Job_Stability (secure contracts, clear career paths) and Job_Satisfaction (recognition, meaningful work) have the highest potential impact.

7 Plan for Reproducible Research

7.1 Reproducibility Framework

Component	Details
Random seed	`set.seed(42)` in all scripts
Package versions	`sessionInfo()` output below
Data source	DOI: 10.17632/6z2hty8php.1 (CC BY 4.0)
Modelling pipeline	`caret` with 5-fold CV
Train/test split	Stratified 80/20, seed = 42
Shiny app	Loads `model_rf.rds` — saved by `WQD7001_GA2_Analysis.R`

7.2 Session Info

sessionInfo()

## R version 4.5.3 (2026-03-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] pROC_1.19.0.1        e1071_1.7-17         randomForest_4.7-1.2
##  [4] caret_7.0-1          lattice_0.22-9       reshape2_1.4.5      
##  [7] corrplot_0.95        scales_1.4.0         gridExtra_2.3       
## [10] ggplot2_4.0.3        dplyr_1.2.1          readxl_1.5.0        
## 
## loaded via a namespace (and not attached):
##  [1] gtable_0.3.6         xfun_0.57            bslib_0.10.0        
##  [4] recipes_1.3.3        vctrs_0.7.3          tools_4.5.3         
##  [7] generics_0.1.4       stats4_4.5.3         parallel_4.5.3      
## [10] proxy_0.4-29         tibble_3.3.1         ModelMetrics_1.2.2.2
## [13] pkgconfig_2.0.3      Matrix_1.7-4         data.table_1.18.2.1 
## [16] RColorBrewer_1.1-3   S7_0.2.2             lifecycle_1.0.5     
## [19] compiler_4.5.3       farver_2.1.2         stringr_1.6.0       
## [22] codetools_0.2-20     htmltools_0.5.9      class_7.3-23        
## [25] sass_0.4.10          yaml_2.3.12          prodlim_2026.03.11  
## [28] pillar_1.11.1        jquerylib_0.1.4      MASS_7.3-65         
## [31] cachem_1.1.0         gower_1.0.2          iterators_1.0.14    
## [34] rpart_4.1.27         foreach_1.5.2        nlme_3.1-168        
## [37] parallelly_1.47.0    lava_1.9.1           tidyselect_1.2.1    
## [40] digest_0.6.39        stringi_1.8.7        future_1.70.0       
## [43] purrr_1.2.2          listenv_0.10.1       labeling_0.4.3      
## [46] splines_4.5.3        fastmap_1.2.0        grid_4.5.3          
## [49] cli_3.6.6            magrittr_2.0.5       survival_3.8-6      
## [52] future.apply_1.20.2  withr_3.0.2          timechange_0.4.0    
## [55] lubridate_1.9.5      rmarkdown_2.31       globals_0.19.1      
## [58] otel_0.2.0           nnet_7.3-20          timeDate_4052.112   
## [61] cellranger_1.1.0     evaluate_1.0.5       knitr_1.51          
## [64] hardhat_1.4.3        rlang_1.2.0          Rcpp_1.1.1          
## [67] glue_1.8.1           ipred_0.9-15         rstudioapi_0.18.0   
## [70] jsonlite_2.0.0       R6_2.6.1             plyr_1.8.9

7.3 Steps to Reproduce

Download the Saudi Employee Attrition Dataset from https://data.mendeley.com/datasets/6z2hty8php/1
Place all three .xlsx files in the GA2_Group17/ folder
Run WQD7001_GA2_Analysis.R — saves model_rf.rds to shiny_app/
Knit WQD7001_GA2_Report.Rmd to HTML
Launch Shiny app: shiny::runApp("shiny_app/")

8 Deployment of Data Product

The Shiny application Employee Attrition Predictor provides:

Attrition Risk Score — probability of departure (0–100%)
Risk Category — Low / Medium / High
Risk Factor Panel — conditions driving this employee’s risk
HR Recommendations — tailored, actionable retention interventions
SHAP Feature Chart — visual explanation of top contributing factors
Dataset Overview Tab — portfolio-level EDA charts for HR teams

Built with: R + Shiny + shinydashboard + ggplot2 + caret + randomForest

9 Insights and Conclusion

9.1 Conclusions

XGBoost achieves the best predictive performance (F1 = 0.767, AUC = 0.876). Attrition in the Saudi private sector is primarily driven by organisational and psychological factors — perceived job instability, job dissatisfaction, psychological exhaustion, and poor work-life balance — rather than demographic characteristics. This shifts the HR response from profiling employees to improving the work environment.

The 43.2% attrition rate far exceeds global benchmarks and poses a macro-economic risk to Saudi Vision 2030’s workforce goals, particularly in the Communications/IT (75%) and Energy (72.3%) sectors.

9.2 Recommendations for HR

Job security programmes — long-term contracts, transparent career paths
Work-life balance policies — flexible hours, workload caps, WFH options
Mental health support — EAP access, manager burnout training
Mid-career retention — targeted programmes for the 5–10 year cohort (53.2% attrition)
Sector-specific intervention — Comms/IT and Energy require dedicated HR taskforces

9.3 SDG 8 Alignment

This project directly supports SDG 8: Decent Work and Economic Growth by providing data-driven tools to reduce workforce instability, improve employment conditions, and support sustained, inclusive economic growth in the Saudi private sector.

10 References

Alqahtani, H., Almagrabi, H., & Alharbi, A. (2025). Saudi Employee Attrition Dataset. Mendeley Data. https://doi.org/10.17632/6z2hty8php.1
Boushey, H., & Glynn, S. J. (2012). There are significant business costs to replacing employees. Center for American Progress.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. KDD ’16, 785–794.
Kuhn, M. (2008). Building predictive models in R using the caret package. Journal of Statistical Software, 28(5), 1–26.
Lundberg, S. M., & Lee, S.-I. (2017). A unified approach to interpreting model predictions. NeurIPS 30.
United Nations (2015). Sustainable Development Goal 8. https://sdgs.un.org/goals/goal8
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An introduction to statistical learning (2nd ed.). Springer.

Analysis and Prediction of Employee Attrition Using Machine Learning Techniques

WQD7001 Principles of Data Science — Group Assignment 2

Imtiyaz Ahmed Khan (25077477) — Leader

Bernard Joshua (23121618) — Secretary

Itmam Hossain (23080564) — Detective

Xiao Shulei (25084582) — Oracle

Shazreen Izzati (22080112) — Maker

Wang Peiyuan (25089856) — Presenter

04 June 2026

1 Project Background

1.1 Overview

1.2 Target Organisation and Users

1.3 Potential Benefits

2 Project Objectives

3 Data Collection and Description

3.1 Dataset

3.2 Three Dataset Files

3.3 Dataset Shape Confirmation

4 Exploratory Data Analysis

4.1 Overall Attrition Distribution

4.2 Demographic Factors

4.3 Job-Related Factors

4.4 Psychological and Work Environment Factors

4.5 Salary Analysis

4.6 Compound Risk Analysis

5 Data Modelling

5.1 Preprocessing

5.2 Logistic Regression (Baseline)

5.3 Random Forest

5.4 Model Comparison

5.5 Random Forest Variable Importance

6 Data Interpretation and Insights

6.1 Key Findings Summary

6.2 SHAP Interpretation

7 Plan for Reproducible Research

7.1 Reproducibility Framework

7.2 Session Info

7.3 Steps to Reproduce

8 Deployment of Data Product

9 Insights and Conclusion

9.1 Conclusions

9.2 Recommendations for HR

9.3 SDG 8 Alignment

10 References