Data607Final

Author

XiaoFei Mei

Published

May 10, 2026

1. Introduction

In this final project, my goal is to use random forest and gradient boosted machines(GBM) to predict the daily productivity of factory work teams from historical operational records. The dataset was covering January–March 2015 publised in Kaggle website. Our target variable is “actual_productivity”.

From RF and GBM, we train and compare two supervised regression modele. Both models are tuned via 5-fold cross-validation and evaluated on a set 20% test set using RMSE, MAE, and R².

At the end, we will determin which model performs better and based on our finding to provide recommendation to factory manager on how to improve productivity effectively.

2. Data Loading & Inspection

Code
df_raw <- read.csv(PATH_RAW, stringsAsFactors = FALSE) #the csv file downloded from Kaggle website is saved in the data folder of this project.

kable(head(df_raw, 6)) #test connection
date quarter department day team targeted_productivity smv wip over_time incentive idle_time idle_men no_of_style_change no_of_workers actual_productivity
1/1/2015 Quarter1 sweing Thursday 8 0.80 26.16 1108 7080 98 0 0 0 59.0 0.9407254
1/1/2015 Quarter1 finishing Thursday 1 0.75 3.94 NA 960 0 0 0 0 8.0 0.8865000
1/1/2015 Quarter1 sweing Thursday 11 0.80 11.41 968 3660 50 0 0 0 30.5 0.8005705
1/1/2015 Quarter1 sweing Thursday 12 0.80 11.41 968 3660 50 0 0 0 30.5 0.8005705
1/1/2015 Quarter1 sweing Thursday 6 0.80 25.90 1170 1920 50 0 0 0 56.0 0.8003819
1/1/2015 Quarter1 sweing Thursday 7 0.80 25.90 984 6720 38 0 0 0 56.0 0.8001250

3. Data Cleaning & Feature Engineering

To prepare the data for modelling:

  • Standardise column names — lowercase, underscores instead of spaces
  • Drop date — quarter and day already capture all temporal structure
  • Trim whitespace in categorical columns and convert to factors
  • Engineer three new features to improve model signal
Code
df <- df_raw |>
  rename_with(~ tolower(trimws(gsub("\\s+", "_", .)))) |>
  select(-any_of("date")) |>
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) |>
  filter(!is.na(actual_productivity)) |>
  mutate(
    department           = as.factor(trimws(department)),
    day                  = as.factor(trimws(day)),
    quarter              = as.factor(trimws(quarter)),
    overtime_per_worker  = ifelse(no_of_workers > 0,
                                  over_time / no_of_workers, 0),
    incentive_per_worker = ifelse(no_of_workers > 0,
                                  incentive  / no_of_workers, 0),
    productivity_gap     = actual_productivity - targeted_productivity
  )

Engineered features and their rationale:

Feature:

  • overtime_per_worker : Normalises overtime by team size — removes scale bias between large and small teams
  • incentive_per_worker : Same normalisation for incentive — isolates per-person motivation signal
  • productivity_gap : Actual minus targeted — used in EDA only, dropped before modelling to prevent target leakage

4. Exploratory Data Analysis

4.1 Target Variable Distribution

Code
ggplot(df, aes(actual_productivity)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "#4C72B0", colour = "white", alpha = .85) +
  geom_density(colour = "#C44E52", linewidth = 1.1) +
  labs(title = "Distribution of Actual Productivity",
       x = "Actual Productivity", y = "Density") +
  theme_minimal(base_size = 13)

From the chart above, looks like the target has a skewed distribution concentrated between 0.70 and 0.85. The sharp spike near 0.80 reflects most frequent occurance. The long left tail represents rare but significant underperformance events that will be hardest to predict.

Code
summary(df$actual_productivity) |> 
  t() |> 
  kable(caption = "productivity distribution records", digits = 4)
productivity distribution records
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.2337 0.6503 0.7733 0.7351 0.8503 1.1204

4.2 Actual vs Targeted Productivity

Code
ggplot(df, aes(targeted_productivity, actual_productivity,
               colour = department)) +
  geom_point(alpha = .5, size = 1.8) +
  geom_abline(linetype = "dashed", colour = "grey30", linewidth = .8) +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Actual vs Targeted Productivity by Department",
       subtitle = "Points above the dashed line exceeded their target",
       x = "Targeted Productivity", y = "Actual Productivity",
       colour = "Department") +
  theme_minimal(base_size = 13)

Points above the dashed line exceeded targets; those below fell short.

4.3 Productivity by Department & Day

Code
p_dept <- ggplot(df, aes(department, actual_productivity, fill = department)) +
  geom_boxplot(alpha = .8) +
  scale_fill_viridis_d(option = "C") +
  labs(title = "By Department", x = NULL, y = "Actual Productivity") +
  theme_minimal() +
  theme(legend.position = "none")

day_lvl <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
df$day  <- factor(df$day, levels = intersect(day_lvl, levels(df$day)))

p_day <- ggplot(df, aes(day, actual_productivity, fill = day)) +
  geom_violin(alpha = .7, trim = FALSE) +
  geom_boxplot(width = .1, fill = "white", outlier.size = .8) +
  scale_fill_viridis_d(option = "E") +
  labs(title = "By Day of Week", x = NULL, y = NULL) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1))

p_dept + p_day +
  plot_annotation(
    title = "Productivity Distribution by Department and Day",
    theme = theme(plot.title = element_text(size = 14, face = "bold"))
  )

Finishing dedpartment shows slightly higher and more consistent productivity given narrower IQR.

4.4 Key Feature Relationships

Code
p5 <- ggplot(df, aes(incentive_per_worker, actual_productivity,
                      colour = department)) +
  geom_point(alpha = .4, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Incentive per Worker vs Productivity",
       x = "Incentive per Worker (BDT)", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

p6 <- ggplot(df, aes(smv, actual_productivity)) +
  geom_point(aes(colour = department), alpha = .4, size = 1.5) +
  geom_smooth(method = "loess", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "SMV vs Actual Productivity",
       x = "Standard Minute Value", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

p7 <- ggplot(df, aes(no_of_workers, actual_productivity)) +
  geom_point(aes(colour = department), alpha = .4, size = 1.5) +
  geom_smooth(method = "loess", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Team Size vs Productivity",
       x = "Number of Workers", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

p8 <- ggplot(df, aes(overtime_per_worker, actual_productivity,
                      colour = department)) +
  geom_point(alpha = .4, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Overtime per Worker vs Productivity",
       x = "Overtime per Worker (mins)", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

(p5 + p6) / (p7 + p8) +
  plot_annotation(
    title = "Key Feature Relationships with Productivity",
    theme = theme(plot.title = element_text(size = 14, face = "bold"))
  )

From above charts key observations:

  • Incentive has a clear positive association — higher bonuses correlate with better output
  • SMV shows diminishing returns — very high workload per piece hurts productivity
  • Team size plateaus — very large teams (>60) gain no further productivity advantage
  • Overtime has a complex pattern — moderate overtime may reflect dedication; extreme overtime suggests fatigue

4.5 Correlation Matrix

Code
num_cols <- df |> select(where(is.numeric)) |> names()
cor_mat  <- cor(df[, num_cols], use = "pairwise.complete.obs")

corrplot(cor_mat,
         method      = "color",
         type        = "upper",
         tl.cex      = .72,
         tl.col      = "black",
         col         = colorRampPalette(c("#C44E52", "white", "#4C72B0"))(200),
         addCoef.col = "black",
         number.cex  = .52,
         title       = "Numeric Feature Correlation Matrix",
         mar         = c(0, 0, 2, 0))


5. Modelling Pipeline

Feature Matrix

Categorical variables (department, day, quarter) are one-hot encoded via model.matrix(). The productivity_gap column is explicitly dropped — it is derived from the target and would cause data leakage if included.

Code
target <- df$actual_productivity

feat_mat <- df |>
  select(-actual_productivity, -productivity_gap) |>
  mutate(across(where(is.factor), as.character)) |>
  (\(x) model.matrix(~ . - 1, data = x))() |>
  as.data.frame()

cat("Feature matrix:", nrow(feat_mat), "rows x", ncol(feat_mat), "columns\n")
Feature matrix: 1197 rows x 23 columns

Train / Test Split

Code
idx     <- createDataPartition(target, p = .80, list = FALSE)
X_train <- feat_mat[ idx, ]; y_train <- target[ idx]
X_test  <- feat_mat[-idx, ]; y_test  <- target[-idx]

cat(sprintf("Train: %d rows  |  Test: %d rows\n",
            nrow(X_train), nrow(X_test)))
Train: 959 rows  |  Test: 238 rows
Code
ctrl <- trainControl(method          = "cv",
                     number          = 5,
                     verboseIter     = FALSE,
                     savePredictions = "final")

An 80/20 stratified split ensures both sets have similar productivity distributions. 5-fold cross-validation on the training set selects the best hyperparameters.


6. Model 1 — Random Forest

Code
rf_model <- train(
  x          = X_train,
  y          = y_train,
  method     = "rf",
  trControl  = ctrl,
  tuneGrid   = expand.grid(mtry = c(3, 5, 7, 10)),
  ntree      = 300,
  importance = TRUE
)

cat("Best mtry:", rf_model$bestTune$mtry, "\n")
Best mtry: 7 
Code
kable(rf_model$results, digits = 5,
      caption = "Random Forest — CV results across mtry values")
Random Forest — CV results across mtry values
mtry RMSE Rsquared MAE RMSESD RsquaredSD MAESD
3 0.12849 0.48414 0.08521 0.00924 0.06349 0.00394
5 0.12717 0.48554 0.08173 0.01072 0.07345 0.00440
7 0.12699 0.48532 0.08040 0.01100 0.07349 0.00447
10 0.12804 0.47705 0.08010 0.01175 0.07830 0.00502
Code
ggplot(rf_model) +
  labs(title = "Random Forest — 5-Fold CV RMSE by mtry",
       x     = "mtry (features considered at each split)",
       y     = "Cross-Validation RMSE") +
  theme_minimal(base_size = 13)

The U-shaped curve maybe caused by too few features → high bias (underfitting); too many → correlated trees (overfitting).

Code
rf_pred_train <- predict(rf_model, X_train)
rf_pred_test  <- predict(rf_model, X_test)

7. Model 2 — Gradient Boosted Machines (GBM)

Key hyperparameters tuned(n.trees): 100, 200, 300. Number of sequential boosting iterations (interaction.depth): 2, 3, 5. Maximum tree depth (shrinkage): 0.05, 0.10. and lastly learning rate — smaller = more conservative.

Code
gbm_model <- train(
  x         = X_train,
  y         = y_train,
  method    = "gbm",
  trControl = ctrl,
  tuneGrid  = expand.grid(
    n.trees           = c(100, 200, 300),
    interaction.depth = c(2, 3, 5),
    shrinkage         = c(0.05, 0.1),
    n.minobsinnode    = 10
  ),
  verbose = FALSE
)

cat("Best GBM parameters:\n")
Best GBM parameters:
Code
kable(gbm_model$bestTune, caption = "GBM — Best hyperparameters from CV")
GBM — Best hyperparameters from CV
n.trees interaction.depth shrinkage n.minobsinnode
8 200 5 0.05 10
Code
gbm_pred_train <- predict(gbm_model, X_train)
gbm_pred_test  <- predict(gbm_model, X_test)

8. Model Evaluation

Evaluation Metrics

Three complementary metrics are used to compare the models:

Metric Formula Interpretation
RMSE √mean((y − ŷ)²) Penalises large errors heavily; same units as target. Lower is better.
MAE mean(|y − ŷ|) Average absolute error; more robust to outliers. Lower is better.
1 − SS_res/SS_tot Proportion of variance explained; 1 = perfect fit. Higher is better.
Code
make_metrics <- function(label, y_tr, pr_tr, y_te, pr_te) {
  tibble(
    Model = label,
    Set   = c("Train", "Test"),
    RMSE  = c(rmse(y_tr, pr_tr), rmse(y_te, pr_te)),
    MAE   = c(mae(y_tr,  pr_tr), mae(y_te,  pr_te)),
    R2    = c(cor(y_tr,  pr_tr)^2, cor(y_te, pr_te)^2)
  )
}

all_metrics <- bind_rows(
  make_metrics("Random Forest",
               y_train, rf_pred_train, y_test, rf_pred_test),
  make_metrics("GBM",
               y_train, gbm_pred_train, y_test, gbm_pred_test)
)
test_metrics <- filter(all_metrics, Set == "Test")

kable(all_metrics, digits = 5,
      caption = "Train and Test Metrics for Both Models",
      col.names = c("Model", "Set", "RMSE", "MAE", "R²"))
Train and Test Metrics for Both Models
Model Set RMSE MAE
Random Forest Train 0.06789 0.04141 0.87783
Random Forest Test 0.10268 0.06551 0.61602
GBM Train 0.10720 0.07092 0.63749
GBM Test 0.10834 0.07388 0.57261

The gap between train and test metrics indicates moderate overfitting — typical for ensemble tree methods on small datasets. Random Forest shows slightly better than GBM here.

Actual vs Predicted

Code
pred_df <- tibble(
  Actual         = y_test,
  `Random Forest` = rf_pred_test,
  GBM             = gbm_pred_test
) |> pivot_longer(-Actual, names_to = "Model", values_to = "Predicted")

ggplot(pred_df, aes(Actual, Predicted, colour = Model)) +
  geom_point(alpha = .45, size = 1.8) +
  geom_abline(linetype = "dashed", colour = "grey30") +
  facet_wrap(~ Model) +
  scale_colour_manual(
    values = c("Random Forest" = "#4C72B0", "GBM" = "#DD8452")) +
  labs(title    = "Actual vs Predicted Productivity — Test Set",
       subtitle = "Points on the dashed line indicate perfect prediction",
       x = "Actual", y = "Predicted") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Both models track well through the 0.5–0.9 range where most observations shows.

Metrics Comparison

Code
test_metrics |>
  pivot_longer(c(RMSE, MAE, R2), names_to = "Metric", values_to = "Value") |>
  ggplot(aes(Model, Value, fill = Model)) +
  geom_col(width = .55, alpha = .85) +
  geom_text(aes(label = round(Value, 4)), vjust = -.4, size = 4) +
  facet_wrap(~ Metric, scales = "free_y") +
  scale_fill_manual(
    values = c("Random Forest" = "#4C72B0", "GBM" = "#DD8452")) +
  labs(title = "Evaluation Metrics — Test Set Comparison",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Cross-Validation RMSE

Code
cv_rf <- rf_model$results |>
  filter(mtry == rf_model$bestTune$mtry) |>
  mutate(Model = "Random Forest") |>
  select(Model, RMSE, RMSESD)

cv_gbm <- gbm_model$results |>
  semi_join(gbm_model$bestTune,
            by = c("n.trees", "interaction.depth",
                   "shrinkage", "n.minobsinnode")) |>
  mutate(Model = "GBM") |>
  select(Model, RMSE, RMSESD)

bind_rows(cv_rf, cv_gbm) |>
  ggplot(aes(Model, RMSE, fill = Model)) +
  geom_col(width = .45, alpha = .85) +
  geom_errorbar(aes(ymin = RMSE - RMSESD, ymax = RMSE + RMSESD),
                width = .15, linewidth = 1) +
  geom_text(aes(label = round(RMSE, 5)), vjust = -2.5, size = 4.2) +
  scale_fill_manual(
    values = c("Random Forest" = "#4C72B0", "GBM" = "#DD8452")) +
  labs(title    = "5-Fold Cross-Validation RMSE (Best Hyperparameters)",
       subtitle = "Error bars = ± 1 SD across folds",
       x = NULL, y = "CV RMSE") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Smaller error bars indicate more stable generalisation across the 5 folds. The CV RMSE closely matches the held-out test RMSE — confirming no overfitting from the hyperparameter search.

Code
imp_fn <- function(m, label, clr, top = 12) {
  varImp(m)$importance |>
    rownames_to_column("Feature") |>
    arrange(desc(Overall)) |>
    head(top) |>
    ggplot(aes(reorder(Feature, Overall), Overall)) +
    geom_col(fill = clr, alpha = .85) +
    coord_flip() +
    labs(title = label, x = NULL, y = "Importance") +
    theme_minimal(base_size = 10)
}

imp_fn(rf_model,  "Random Forest", "#4C72B0") +
imp_fn(gbm_model, "GBM",           "#DD8452") +
  plot_annotation(
    title = "Feature Importance — Top 12 Predictors",
    theme = theme(plot.title = element_text(size = 14, face = "bold"))
  )


Conclusion

Code
kable(test_metrics, digits = 5,
      caption = "Final Test-Set Performance Summary",
      col.names = c("Model", "Set", "RMSE", "MAE", "R²"))
Final Test-Set Performance Summary
Model Set RMSE MAE
Random Forest Test 0.10268 0.06551 0.61602
GBM Test 0.10834 0.07388 0.57261

Both models explain approximately 57–62% of the variance in worker productivity which is a somewhat good result.Beased on the findings, Some factors have affect actual productivity. Knowing those finding can provide management a clear actionalbe solution to boost productivity such as smv, no_of_workers, incentive_per_worker, etc.