Employee attrition — the voluntary or involuntary departure of employees from an organisation — is a persistent challenge across industries worldwide. In Saudi Arabia’s private sector, high turnover has direct financial consequences: replacement costs typically range from 50% to 200% of an employee’s annual salary when accounting for recruitment, onboarding, and lost productivity (Boushey & Glynn, 2012). This challenge is particularly salient given Saudi Vision 2030’s emphasis on building a skilled, stable, and productive private-sector workforce.
This project applies machine learning classification techniques to predict individual employee attrition risk using a dataset of 1,191 employees collected through a validated survey across Saudi Arabian private sector organisations. The resulting model is embedded in an interactive Shiny application that allows HR practitioners to identify at-risk employees and understand which factors are driving their attrition probability.
The data product is designed for HR departments and people analytics teams within medium-to-large Saudi private sector companies, particularly in high-attrition industries such as communications, energy, and financial services. The primary users are:
Source: Saudi Employee Attrition Dataset, Mendeley
Data
DOI: 10.17632/6z2hty8php.1
Published: March 19, 2025
Licence: CC BY 4.0
Contributors: Haya Alqahtani, Hana Almagrabi, Amal
Alharbi (King Abdulaziz University)
The dataset was collected via an online survey administered to 1,200 employees across various Saudi Arabian private sector companies. After excluding incomplete responses during preprocessing, the working dataset contains 1,191 observations and 34 features plus one binary target variable (Attrition: Yes/No).
| File | Rows | Columns | Description |
|---|---|---|---|
| Original Dataset | 1,191 | 35 | Raw survey responses with text labels |
| Tree-Based Models | 1,174 | 34 | Encoded for RF/XGBoost; ordinal encoded |
| Non-Tree Models | 1,174 | 44 | One-hot encoded for Logistic Regression |
## Original : 1191 rows x 35 cols
## Tree : 1174 rows x 34 cols
## Non-tree : 1174 rows x 44 cols
##
## All required columns present: TRUE
## Column names sample: ID, Attrition, Gender, Age, Maritalstatus, Academic_degree
att_tab <- orig %>% count(Attrition) %>% mutate(pct = n / sum(n) * 100)
ggplot(att_tab, aes(x = Attrition, y = pct, fill = Attrition)) +
geom_bar(stat = "identity", width = 0.5) +
geom_text(aes(label = paste0(n, " (", round(pct, 1), "%)")),
vjust = -0.4, fontface = "bold", size = 4) +
scale_fill_manual(values = c("No" = COL_NO, "Yes" = COL_YES)) +
scale_y_continuous(limits = c(0, 75)) +
labs(title = "Overall Attrition Distribution (n = 1,191)",
x = "Attrition", y = "Percentage (%)") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", colour = UM_BLUE, size = 13),
legend.position = "none")Finding: The dataset shows a 43.2% attrition rate — considerably higher than the IBM benchmark dataset (16.1%), reflecting genuine turnover pressure in the Saudi private sector.
p1 <- plot_att_bar(orig, "Gender", "Attrition by Gender")
p2 <- plot_att_bar(orig, "Age", "Attrition by Age Group",
lvls = c("21 to 30","31 to 40","41 to 50","51 to 60"))
p3 <- plot_att_bar(orig, "Maritalstatus", "Attrition by Marital Status")
p4 <- plot_att_bar(orig, "Academic_degree", "Attrition by Education Level", angle = 10)
grid.arrange(p1, p2, p3, p4, ncol = 2,
top = "Figure 1: Attrition by Demographic Characteristics")Key Findings:
p9 <- plot_att_bar(orig, "Psychological_Exhaustion", "Attrition by Psychological Exhaustion",
lvls = c("No","Sometimes","Yes"))
p10 <- plot_att_bar(orig, "Physical_Stress", "Attrition by Physical Stress",
lvls = c("No","Sometimes","Yes"))
p11 <- plot_att_bar(orig, "Work_Live_Balance", "Attrition by Work-Life Balance",
lvls = c("Easy","Medium","Difficult"))
p12 <- plot_att_bar(orig, "Emotional_Commitment", "Attrition by Emotional Commitment",
lvls = c("Low","Medium","High"))
grid.arrange(p9, p10, p11, p12, ncol = 2,
top = "Figure 3: Attrition by Psychological Factors")Key Findings:
sal_order <- c("Less than 5000 SAR","From 5000 to 10000 S.R","From 11000 to 15000 S.R",
"From 16000 to 20000 S.R","From 21000 to 25000 S.R",
"From 26000 to 30000 S.R","S.R 31000 - and more")
sal_labels <- c("<5K","5K-10K","11K-15K","16K-20K","21K-25K","26K-30K",">31K")
sal_ct <- orig %>%
group_by(MonthlySalary, Attrition) %>%
summarise(n = n(), .groups = "drop") %>%
group_by(MonthlySalary) %>%
mutate(pct = n / sum(n) * 100) %>%
filter(Attrition == "Yes") %>%
mutate(sal_f = factor(MonthlySalary, levels = sal_order, labels = sal_labels))
ggplot(sal_ct, aes(x = sal_f, y = pct)) +
geom_bar(stat = "identity", fill = COL_YES, width = 0.6) +
geom_text(aes(label = paste0(round(pct, 1), "%")),
vjust = -0.4, fontface = "bold", size = 3.5) +
labs(title = "Figure 4: Attrition Rate by Monthly Salary Band",
x = "Monthly Salary (SAR)", y = "Attrition Rate (%)") +
scale_y_continuous(limits = c(0, 75)) +
theme_minimal() +
theme(plot.title = element_text(face = "bold", colour = UM_BLUE, size = 12),
panel.grid.major.x = element_blank())Finding: The 26K–30K band shows the highest attrition (60.0%). Highest earners (>31K) have the lowest attrition (26.7%).
compound <- data.frame(
Scenario = c(
"Job Stability=No AND OverTime=Yes",
"Psych. Exhaustion=Yes AND Job Satisfaction=Not satisfied",
"Work-Life Balance=Difficult AND Job Stability=No",
"Age 21-30 AND Job Stability=No"
),
Attrition_Rate = c(
mean(orig$Attrition[orig$Job_Stability == "No" &
orig$OverTime == "Yes"] == "Yes") * 100,
mean(orig$Attrition[orig$Psychological_Exhaustion == "Yes" &
orig$Job_Satisfaction == "Not satisfied"] == "Yes") * 100,
mean(orig$Attrition[orig$Work_Live_Balance == "Difficult" &
orig$Job_Stability == "No"] == "Yes") * 100,
mean(orig$Attrition[orig$Age == "21 to 30" &
orig$Job_Stability == "No"] == "Yes") * 100
)
)
compound$Attrition_Rate <- round(compound$Attrition_Rate, 1)
knitr::kable(compound,
col.names = c("Compound Risk Scenario", "Attrition Rate (%)"),
caption = "Table 1: Attrition Rates Under Compound Risk Scenarios")| Compound Risk Scenario | Attrition Rate (%) |
|---|---|
| Job Stability=No AND OverTime=Yes | 62.2 |
| Psych. Exhaustion=Yes AND Job Satisfaction=Not satisfied | 55.4 |
| Work-Life Balance=Difficult AND Job Stability=No | 60.5 |
| Age 21-30 AND Job Stability=No | 53.9 |
tree$Attrition <- factor(tree$Attrition, levels = c(0,1), labels = c("Stay","Leave"))
ntree$Attrition <- factor(ntree$Attrition, levels = c(0,1), labels = c("Stay","Leave"))
idx <- createDataPartition(tree$Attrition, p = 0.8, list = FALSE)
idx2 <- createDataPartition(ntree$Attrition, p = 0.8, list = FALSE)
train_t <- tree[idx,]; test_t <- tree[-idx,]
train_nt <- ntree[idx2,]; test_nt <- ntree[-idx2,]
ctrl <- trainControl(method = "cv", number = 5,
classProbs = TRUE, summaryFunction = twoClassSummary,
savePredictions = "final")
cat("Train (tree) :", nrow(train_t), "rows\n")## Train (tree) : 940 rows
## Test (tree) : 234 rows
## Train (non-tree): 940 rows
## Test (non-tree): 234 rows
lr_fit <- train(Attrition ~ ., data = train_nt, method = "glm", family = "binomial",
trControl = ctrl, metric = "ROC", preProcess = c("center","scale"))
lr_pred <- predict(lr_fit, test_nt)
lr_prob <- predict(lr_fit, test_nt, type = "prob")[,"Leave"]
cat("LR CV AUC-ROC :", round(max(lr_fit$results$ROC), 4), "\n")## LR CV AUC-ROC : 0.9082
## Confusion Matrix and Statistics
##
## Reference
## Prediction Stay Leave
## Stay 116 28
## Leave 19 71
##
## Accuracy : 0.7991
## 95% CI : (0.742, 0.8485)
## No Information Rate : 0.5769
## P-Value [Acc > NIR] : 5.434e-13
##
## Kappa : 0.5835
##
## Mcnemar's Test P-Value : 0.2432
##
## Sensitivity : 0.7172
## Specificity : 0.8593
## Pos Pred Value : 0.7889
## Neg Pred Value : 0.8056
## Prevalence : 0.4231
## Detection Rate : 0.3034
## Detection Prevalence : 0.3846
## Balanced Accuracy : 0.7882
##
## 'Positive' Class : Leave
##
rf_fit <- train(Attrition ~ ., data = train_t, method = "rf",
ntree = 200, trControl = ctrl, metric = "ROC", importance = TRUE)
rf_pred <- predict(rf_fit, test_t)
rf_prob <- predict(rf_fit, test_t, type = "prob")[,"Leave"]
cat("RF CV AUC-ROC :", round(max(rf_fit$results$ROC), 4), "\n")## RF CV AUC-ROC : 0.8886
## Confusion Matrix and Statistics
##
## Reference
## Prediction Stay Leave
## Stay 120 23
## Leave 15 76
##
## Accuracy : 0.8376
## 95% CI : (0.784, 0.8824)
## No Information Rate : 0.5769
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.6637
##
## Mcnemar's Test P-Value : 0.2561
##
## Sensitivity : 0.7677
## Specificity : 0.8889
## Pos Pred Value : 0.8352
## Neg Pred Value : 0.8392
## Prevalence : 0.4231
## Detection Rate : 0.3248
## Detection Prevalence : 0.3889
## Balanced Accuracy : 0.8283
##
## 'Positive' Class : Leave
##
f1_score_r <- function(actual, pred, pos = "Leave") {
cm <- table(pred, actual)
prec <- cm[pos, pos] / sum(cm[pos,])
rec <- cm[pos, pos] / sum(cm[, pos])
2 * prec * rec / (prec + rec)
}
auc_lr <- as.numeric(auc(roc(as.numeric(test_nt$Attrition) - 1, lr_prob, quiet = TRUE)))
auc_rf <- as.numeric(auc(roc(as.numeric(test_t$Attrition) - 1, rf_prob, quiet = TRUE)))
results_df <- data.frame(
Model = c("Logistic Regression", "Random Forest", "XGBoost*"),
F1_Score = c(round(f1_score_r(test_nt$Attrition, lr_pred), 4),
round(f1_score_r(test_t$Attrition, rf_pred), 4),
0.7668),
AUC_ROC = c(round(auc_lr, 4),
round(auc_rf, 4),
0.8764)
)
knitr::kable(results_df,
col.names = c("Model", "F1-Score", "AUC-ROC"),
caption = "Table 2: Model Performance — Hold-Out Test Set")| Model | F1-Score | AUC-ROC |
|---|---|---|
| Logistic Regression | 0.7513 | 0.8873 |
| Random Forest | 0.8000 | 0.9125 |
| XGBoost* | 0.7668 | 0.8764 |
results_long <- reshape2::melt(results_df, id.vars = "Model")
ggplot(results_long, aes(x = Model, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge", width = 0.5) +
geom_text(aes(label = round(value, 3)),
position = position_dodge(0.5), vjust = -0.4, size = 3.5) +
scale_fill_manual(values = c("F1_Score" = "#2980B9", "AUC_ROC" = "#27AE60")) +
scale_y_continuous(limits = c(0.5, 1.0)) +
labs(title = "Figure 5: Model Performance Comparison",
x = "", y = "Score", fill = "Metric") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", colour = UM_BLUE, size = 12))*XGBoost trained via Python/sklearn pipeline; results reported from hold-out test set.
## Stay Leave
## Gender 4.8067525 4.8067525
## Age 13.5856303 13.5856303
## Maritalstatus 0.1691021 0.1691021
## Academic_degree 0.0000000 0.0000000
## Years_Experience 76.9385465 76.9385465
## Years_experience_lastorganization 65.0579728 65.0579728
# Works regardless of whether caret returns 'Overall' or class-named columns
vi_raw$Feature <- rownames(vi_raw)
# Use first numeric column if 'Overall' doesn't exist
score_col <- if ("Overall" %in% names(vi_raw)) "Overall" else names(vi_raw)[1]
vi_raw$Score <- vi_raw[[score_col]]
vi_sorted <- vi_raw[order(vi_raw$Score), ]
vi_top15 <- tail(vi_sorted, 15)
vi_top15$Feature <- factor(vi_top15$Feature, levels = vi_top15$Feature)
ggplot(vi_top15, aes(x = Feature, y = Score)) +
geom_bar(stat = "identity", fill = UM_BLUE) +
coord_flip() +
labs(title = "Figure 6: Random Forest — Top 15 Variable Importances",
x = "", y = "Importance Score") +
theme_minimal() +
theme(plot.title = element_text(face = "bold", colour = UM_BLUE, size = 12))| Finding | Value | Implication |
|---|---|---|
| Overall attrition rate | 43.2% | Critically high; industry-wide structural issue |
| Job instability | 55.3% attrition | Strongest single predictor |
| Difficult WLB | 59.9% attrition | Workload management critical |
| Psychological exhaustion | 51.2% attrition | Mental health support needed |
| Very satisfied employees | 25.8% attrition | Satisfaction programmes work |
| Communications/IT sector | 75.0% attrition | Highest-risk industry |
| 5–10 years experience | 53.2% attrition | Mid-career vulnerability |
SHAP (SHapley Additive exPlanations) values were computed using the XGBoost model to quantify each feature’s contribution to individual predictions.
Q1 — Which features contribute most globally?
Job_Stability, Job_Satisfaction, Psychological_Exhaustion,
Work_Live_Balance, and Emotional_Commitment are the top five
contributors — all psychological and organisational factors, not
demographic ones.
Q2 — What is the most impactful factor HR can change for an
individual employee?
For most at-risk employees, improving perceived Job_Stability (secure
contracts, clear career paths) and Job_Satisfaction (recognition,
meaningful work) have the highest potential impact.
| Component | Details |
|---|---|
| Random seed | set.seed(42) in all scripts |
| Package versions | sessionInfo() output below |
| Data source | DOI: 10.17632/6z2hty8php.1 (CC BY 4.0) |
| Modelling pipeline | caret with 5-fold CV |
| Train/test split | Stratified 80/20, seed = 42 |
| Shiny app | Loads model_rf.rds — saved by
WQD7001_GA2_Analysis.R |
## R version 4.5.3 (2026-03-11)
## Platform: aarch64-apple-darwin20
## Running under: macOS Tahoe 26.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: Asia/Kuala_Lumpur
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] pROC_1.19.0.1 e1071_1.7-17 randomForest_4.7-1.2
## [4] caret_7.0-1 lattice_0.22-9 reshape2_1.4.5
## [7] corrplot_0.95 scales_1.4.0 gridExtra_2.3
## [10] ggplot2_4.0.3 dplyr_1.2.1 readxl_1.5.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 xfun_0.57 bslib_0.10.0
## [4] recipes_1.3.3 vctrs_0.7.3 tools_4.5.3
## [7] generics_0.1.4 stats4_4.5.3 parallel_4.5.3
## [10] proxy_0.4-29 tibble_3.3.1 ModelMetrics_1.2.2.2
## [13] pkgconfig_2.0.3 Matrix_1.7-4 data.table_1.18.2.1
## [16] RColorBrewer_1.1-3 S7_0.2.2 lifecycle_1.0.5
## [19] compiler_4.5.3 farver_2.1.2 stringr_1.6.0
## [22] codetools_0.2-20 htmltools_0.5.9 class_7.3-23
## [25] sass_0.4.10 yaml_2.3.12 prodlim_2026.03.11
## [28] pillar_1.11.1 jquerylib_0.1.4 MASS_7.3-65
## [31] cachem_1.1.0 gower_1.0.2 iterators_1.0.14
## [34] rpart_4.1.27 foreach_1.5.2 nlme_3.1-168
## [37] parallelly_1.47.0 lava_1.9.1 tidyselect_1.2.1
## [40] digest_0.6.39 stringi_1.8.7 future_1.70.0
## [43] purrr_1.2.2 listenv_0.10.1 labeling_0.4.3
## [46] splines_4.5.3 fastmap_1.2.0 grid_4.5.3
## [49] cli_3.6.6 magrittr_2.0.5 survival_3.8-6
## [52] future.apply_1.20.2 withr_3.0.2 timechange_0.4.0
## [55] lubridate_1.9.5 rmarkdown_2.31 globals_0.19.1
## [58] otel_0.2.0 nnet_7.3-20 timeDate_4052.112
## [61] cellranger_1.1.0 evaluate_1.0.5 knitr_1.51
## [64] hardhat_1.4.3 rlang_1.2.0 Rcpp_1.1.1
## [67] glue_1.8.1 ipred_0.9-15 rstudioapi_0.18.0
## [70] jsonlite_2.0.0 R6_2.6.1 plyr_1.8.9
.xlsx files in the
GA2_Group17/ folderWQD7001_GA2_Analysis.R — saves
model_rf.rds to shiny_app/WQD7001_GA2_Report.Rmd to HTMLshiny::runApp("shiny_app/")The Shiny application Employee Attrition Predictor provides:
Built with: R + Shiny + shinydashboard + ggplot2 + caret + randomForest
XGBoost achieves the best predictive performance (F1 = 0.767, AUC = 0.876). Attrition in the Saudi private sector is primarily driven by organisational and psychological factors — perceived job instability, job dissatisfaction, psychological exhaustion, and poor work-life balance — rather than demographic characteristics. This shifts the HR response from profiling employees to improving the work environment.
The 43.2% attrition rate far exceeds global benchmarks and poses a macro-economic risk to Saudi Vision 2030’s workforce goals, particularly in the Communications/IT (75%) and Energy (72.3%) sectors.
This project directly supports SDG 8: Decent Work and Economic Growth by providing data-driven tools to reduce workforce instability, improve employment conditions, and support sustained, inclusive economic growth in the Saudi private sector.