Data607Final

Author

XiaoFei Mei

Published

May 10, 2026

1. Introduction

In this final project, my goal is to use random forest and gradient boosted machines(GBM) to predict the daily productivity of factory work teams from historical operational records. The dataset was covering January–March 2015 publised in Kaggle website. Our target variable is “actual_productivity”.

From RF and GBM, we train and compare two supervised regression modele. Both models are tuned via 5-fold cross-validation and evaluated on a set 20% test set using RMSE, MAE, and R².

At the end, we will determin which model performs better and based on our finding to provide recommendation to factory manager on how to improve productivity effectively.

2. Data Loading & Inspection

Code

df_raw <- read.csv(PATH_RAW, stringsAsFactors = FALSE) #the csv file downloded from Kaggle website is saved in the data folder of this project.

kable(head(df_raw, 6)) #test connection

date	quarter	department	day	team	targeted_productivity	smv	wip	over_time	incentive	no_of_workers	actual_productivity
1/1/2015	Quarter1	sweing	Thursday	8	0.80	26.16	1108	7080	98	59.0	0.9407254
1/1/2015	Quarter1	finishing	Thursday	1	0.75	3.94	NA	960	0	8.0	0.8865000
1/1/2015	Quarter1	sweing	Thursday	11	0.80	11.41	968	3660	50	30.5	0.8005705
1/1/2015	Quarter1	sweing	Thursday	12	0.80	11.41	968	3660	50	30.5	0.8005705
1/1/2015	Quarter1	sweing	Thursday	6	0.80	25.90	1170	1920	50	56.0	0.8003819
1/1/2015	Quarter1	sweing	Thursday	7	0.80	25.90	984	6720	38	56.0	0.8001250

3. Data Cleaning & Feature Engineering

To prepare the data for modelling:

Standardise column names — lowercase, underscores instead of spaces
Drop date — quarter and day already capture all temporal structure
Trim whitespace in categorical columns and convert to factors
Engineer three new features to improve model signal

Code

df <- df_raw |>
  rename_with(~ tolower(trimws(gsub("\\s+", "_", .)))) |>
  select(-any_of("date")) |>
  mutate(across(where(is.numeric),
                ~ ifelse(is.na(.), median(., na.rm = TRUE), .))) |>
  filter(!is.na(actual_productivity)) |>
  mutate(
    department           = as.factor(trimws(department)),
    day                  = as.factor(trimws(day)),
    quarter              = as.factor(trimws(quarter)),
    overtime_per_worker  = ifelse(no_of_workers > 0,
                                  over_time / no_of_workers, 0),
    incentive_per_worker = ifelse(no_of_workers > 0,
                                  incentive  / no_of_workers, 0),
    productivity_gap     = actual_productivity - targeted_productivity
  )

Engineered features and their rationale:

Feature:

overtime_per_worker : Normalises overtime by team size — removes scale bias between large and small teams
incentive_per_worker : Same normalisation for incentive — isolates per-person motivation signal
productivity_gap : Actual minus targeted — used in EDA only, dropped before modelling to prevent target leakage

4. Exploratory Data Analysis

4.1 Target Variable Distribution

Code

ggplot(df, aes(actual_productivity)) +
  geom_histogram(aes(y = after_stat(density)), bins = 40,
                 fill = "#4C72B0", colour = "white", alpha = .85) +
  geom_density(colour = "#C44E52", linewidth = 1.1) +
  labs(title = "Distribution of Actual Productivity",
       x = "Actual Productivity", y = "Density") +
  theme_minimal(base_size = 13)

From the chart above, looks like the target has a skewed distribution concentrated between 0.70 and 0.85. The sharp spike near 0.80 reflects most frequent occurance. The long left tail represents rare but significant underperformance events that will be hardest to predict.

Code

summary(df$actual_productivity) |> 
  t() |> 
  kable(caption = "productivity distribution records", digits = 4)

productivity distribution records
Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
0.2337	0.6503	0.7733	0.7351	0.8503	1.1204

4.2 Actual vs Targeted Productivity

Code

ggplot(df, aes(targeted_productivity, actual_productivity,
               colour = department)) +
  geom_point(alpha = .5, size = 1.8) +
  geom_abline(linetype = "dashed", colour = "grey30", linewidth = .8) +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Actual vs Targeted Productivity by Department",
       subtitle = "Points above the dashed line exceeded their target",
       x = "Targeted Productivity", y = "Actual Productivity",
       colour = "Department") +
  theme_minimal(base_size = 13)

Points above the dashed line exceeded targets; those below fell short.

4.3 Productivity by Department & Day

Code

p_dept <- ggplot(df, aes(department, actual_productivity, fill = department)) +
  geom_boxplot(alpha = .8) +
  scale_fill_viridis_d(option = "C") +
  labs(title = "By Department", x = NULL, y = "Actual Productivity") +
  theme_minimal() +
  theme(legend.position = "none")

day_lvl <- c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")
df$day  <- factor(df$day, levels = intersect(day_lvl, levels(df$day)))

p_day <- ggplot(df, aes(day, actual_productivity, fill = day)) +
  geom_violin(alpha = .7, trim = FALSE) +
  geom_boxplot(width = .1, fill = "white", outlier.size = .8) +
  scale_fill_viridis_d(option = "E") +
  labs(title = "By Day of Week", x = NULL, y = NULL) +
  theme_minimal() +
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1))

p_dept + p_day +
  plot_annotation(
    title = "Productivity Distribution by Department and Day",
    theme = theme(plot.title = element_text(size = 14, face = "bold"))
  )

Finishing dedpartment shows slightly higher and more consistent productivity given narrower IQR.

4.4 Key Feature Relationships

Code

p5 <- ggplot(df, aes(incentive_per_worker, actual_productivity,
                      colour = department)) +
  geom_point(alpha = .4, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Incentive per Worker vs Productivity",
       x = "Incentive per Worker (BDT)", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

p6 <- ggplot(df, aes(smv, actual_productivity)) +
  geom_point(aes(colour = department), alpha = .4, size = 1.5) +
  geom_smooth(method = "loess", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "SMV vs Actual Productivity",
       x = "Standard Minute Value", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

p7 <- ggplot(df, aes(no_of_workers, actual_productivity)) +
  geom_point(aes(colour = department), alpha = .4, size = 1.5) +
  geom_smooth(method = "loess", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Team Size vs Productivity",
       x = "Number of Workers", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

p8 <- ggplot(df, aes(overtime_per_worker, actual_productivity,
                      colour = department)) +
  geom_point(alpha = .4, size = 1.5) +
  geom_smooth(method = "lm", se = TRUE, colour = "grey20") +
  scale_colour_viridis_d(option = "D") +
  labs(title = "Overtime per Worker vs Productivity",
       x = "Overtime per Worker (mins)", y = "Actual Productivity",
       colour = "Dept") +
  theme_minimal(base_size = 12)

(p5 + p6) / (p7 + p8) +
  plot_annotation(
    title = "Key Feature Relationships with Productivity",
    theme = theme(plot.title = element_text(size = 14, face = "bold"))
  )

From above charts key observations:

Incentive has a clear positive association — higher bonuses correlate with better output
SMV shows diminishing returns — very high workload per piece hurts productivity
Team size plateaus — very large teams (>60) gain no further productivity advantage
Overtime has a complex pattern — moderate overtime may reflect dedication; extreme overtime suggests fatigue

4.5 Correlation Matrix

Code

num_cols <- df |> select(where(is.numeric)) |> names()
cor_mat  <- cor(df[, num_cols], use = "pairwise.complete.obs")

corrplot(cor_mat,
         method      = "color",
         type        = "upper",
         tl.cex      = .72,
         tl.col      = "black",
         col         = colorRampPalette(c("#C44E52", "white", "#4C72B0"))(200),
         addCoef.col = "black",
         number.cex  = .52,
         title       = "Numeric Feature Correlation Matrix",
         mar         = c(0, 0, 2, 0))

5. Modelling Pipeline

Feature Matrix

Categorical variables (department, day, quarter) are one-hot encoded via model.matrix(). The productivity_gap column is explicitly dropped — it is derived from the target and would cause data leakage if included.

Code

target <- df$actual_productivity

feat_mat <- df |>
  select(-actual_productivity, -productivity_gap) |>
  mutate(across(where(is.factor), as.character)) |>
  (\(x) model.matrix(~ . - 1, data = x))() |>
  as.data.frame()

cat("Feature matrix:", nrow(feat_mat), "rows x", ncol(feat_mat), "columns\n")

Feature matrix: 1197 rows x 23 columns

Train / Test Split

Code

idx     <- createDataPartition(target, p = .80, list = FALSE)
X_train <- feat_mat[ idx, ]; y_train <- target[ idx]
X_test  <- feat_mat[-idx, ]; y_test  <- target[-idx]

cat(sprintf("Train: %d rows  |  Test: %d rows\n",
            nrow(X_train), nrow(X_test)))

Train: 959 rows  |  Test: 238 rows

Code

ctrl <- trainControl(method          = "cv",
                     number          = 5,
                     verboseIter     = FALSE,
                     savePredictions = "final")

An 80/20 stratified split ensures both sets have similar productivity distributions. 5-fold cross-validation on the training set selects the best hyperparameters.

6. Model 1 — Random Forest

Code

rf_model <- train(
  x          = X_train,
  y          = y_train,
  method     = "rf",
  trControl  = ctrl,
  tuneGrid   = expand.grid(mtry = c(3, 5, 7, 10)),
  ntree      = 300,
  importance = TRUE
)

cat("Best mtry:", rf_model$bestTune$mtry, "\n")

Best mtry: 7

Code

kable(rf_model$results, digits = 5,
      caption = "Random Forest — CV results across mtry values")

Random Forest — CV results across mtry values
mtry	RMSE	Rsquared	MAE	RMSESD	RsquaredSD	MAESD
3	0.12849	0.48414	0.08521	0.00924	0.06349	0.00394
5	0.12717	0.48554	0.08173	0.01072	0.07345	0.00440
7	0.12699	0.48532	0.08040	0.01100	0.07349	0.00447
10	0.12804	0.47705	0.08010	0.01175	0.07830	0.00502

Code

ggplot(rf_model) +
  labs(title = "Random Forest — 5-Fold CV RMSE by mtry",
       x     = "mtry (features considered at each split)",
       y     = "Cross-Validation RMSE") +
  theme_minimal(base_size = 13)

The U-shaped curve maybe caused by too few features → high bias (underfitting); too many → correlated trees (overfitting).

Code

rf_pred_train <- predict(rf_model, X_train)
rf_pred_test  <- predict(rf_model, X_test)

7. Model 2 — Gradient Boosted Machines (GBM)

Key hyperparameters tuned(n.trees): 100, 200, 300. Number of sequential boosting iterations (interaction.depth): 2, 3, 5. Maximum tree depth (shrinkage): 0.05, 0.10. and lastly learning rate — smaller = more conservative.

Code

gbm_model <- train(
  x         = X_train,
  y         = y_train,
  method    = "gbm",
  trControl = ctrl,
  tuneGrid  = expand.grid(
    n.trees           = c(100, 200, 300),
    interaction.depth = c(2, 3, 5),
    shrinkage         = c(0.05, 0.1),
    n.minobsinnode    = 10
  ),
  verbose = FALSE
)

cat("Best GBM parameters:\n")

Best GBM parameters:

Code

kable(gbm_model$bestTune, caption = "GBM — Best hyperparameters from CV")

GBM — Best hyperparameters from CV
	n.trees	interaction.depth	shrinkage	n.minobsinnode
8	200	5	0.05	10

Code

gbm_pred_train <- predict(gbm_model, X_train)
gbm_pred_test  <- predict(gbm_model, X_test)

8. Model Evaluation

Evaluation Metrics

Three complementary metrics are used to compare the models:

Metric	Formula	Interpretation
RMSE	√mean((y − ŷ)²)	Penalises large errors heavily; same units as target. Lower is better.
MAE	mean(\|y − ŷ\|)	Average absolute error; more robust to outliers. Lower is better.
R²	1 − SS_res/SS_tot	Proportion of variance explained; 1 = perfect fit. Higher is better.

Code

make_metrics <- function(label, y_tr, pr_tr, y_te, pr_te) {
  tibble(
    Model = label,
    Set   = c("Train", "Test"),
    RMSE  = c(rmse(y_tr, pr_tr), rmse(y_te, pr_te)),
    MAE   = c(mae(y_tr,  pr_tr), mae(y_te,  pr_te)),
    R2    = c(cor(y_tr,  pr_tr)^2, cor(y_te, pr_te)^2)
  )
}

all_metrics <- bind_rows(
  make_metrics("Random Forest",
               y_train, rf_pred_train, y_test, rf_pred_test),
  make_metrics("GBM",
               y_train, gbm_pred_train, y_test, gbm_pred_test)
)
test_metrics <- filter(all_metrics, Set == "Test")

kable(all_metrics, digits = 5,
      caption = "Train and Test Metrics for Both Models",
      col.names = c("Model", "Set", "RMSE", "MAE", "R²"))

Train and Test Metrics for Both Models
Model	Set	RMSE	MAE	R²
Random Forest	Train	0.06789	0.04141	0.87783
Random Forest	Test	0.10268	0.06551	0.61602
GBM	Train	0.10720	0.07092	0.63749
GBM	Test	0.10834	0.07388	0.57261

The gap between train and test metrics indicates moderate overfitting — typical for ensemble tree methods on small datasets. Random Forest shows slightly better than GBM here.

Actual vs Predicted

Code

pred_df <- tibble(
  Actual         = y_test,
  `Random Forest` = rf_pred_test,
  GBM             = gbm_pred_test
) |> pivot_longer(-Actual, names_to = "Model", values_to = "Predicted")

ggplot(pred_df, aes(Actual, Predicted, colour = Model)) +
  geom_point(alpha = .45, size = 1.8) +
  geom_abline(linetype = "dashed", colour = "grey30") +
  facet_wrap(~ Model) +
  scale_colour_manual(
    values = c("Random Forest" = "#4C72B0", "GBM" = "#DD8452")) +
  labs(title    = "Actual vs Predicted Productivity — Test Set",
       subtitle = "Points on the dashed line indicate perfect prediction",
       x = "Actual", y = "Predicted") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Both models track well through the 0.5–0.9 range where most observations shows.

Metrics Comparison

Code

test_metrics |>
  pivot_longer(c(RMSE, MAE, R2), names_to = "Metric", values_to = "Value") |>
  ggplot(aes(Model, Value, fill = Model)) +
  geom_col(width = .55, alpha = .85) +
  geom_text(aes(label = round(Value, 4)), vjust = -.4, size = 4) +
  facet_wrap(~ Metric, scales = "free_y") +
  scale_fill_manual(
    values = c("Random Forest" = "#4C72B0", "GBM" = "#DD8452")) +
  labs(title = "Evaluation Metrics — Test Set Comparison",
       x = NULL, y = NULL) +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Cross-Validation RMSE

Code

cv_rf <- rf_model$results |>
  filter(mtry == rf_model$bestTune$mtry) |>
  mutate(Model = "Random Forest") |>
  select(Model, RMSE, RMSESD)

cv_gbm <- gbm_model$results |>
  semi_join(gbm_model$bestTune,
            by = c("n.trees", "interaction.depth",
                   "shrinkage", "n.minobsinnode")) |>
  mutate(Model = "GBM") |>
  select(Model, RMSE, RMSESD)

bind_rows(cv_rf, cv_gbm) |>
  ggplot(aes(Model, RMSE, fill = Model)) +
  geom_col(width = .45, alpha = .85) +
  geom_errorbar(aes(ymin = RMSE - RMSESD, ymax = RMSE + RMSESD),
                width = .15, linewidth = 1) +
  geom_text(aes(label = round(RMSE, 5)), vjust = -2.5, size = 4.2) +
  scale_fill_manual(
    values = c("Random Forest" = "#4C72B0", "GBM" = "#DD8452")) +
  labs(title    = "5-Fold Cross-Validation RMSE (Best Hyperparameters)",
       subtitle = "Error bars = ± 1 SD across folds",
       x = NULL, y = "CV RMSE") +
  theme_minimal(base_size = 13) +
  theme(legend.position = "none")

Smaller error bars indicate more stable generalisation across the 5 folds. The CV RMSE closely matches the held-out test RMSE — confirming no overfitting from the hyperparameter search.

Code

imp_fn <- function(m, label, clr, top = 12) {
  varImp(m)$importance |>
    rownames_to_column("Feature") |>
    arrange(desc(Overall)) |>
    head(top) |>
    ggplot(aes(reorder(Feature, Overall), Overall)) +
    geom_col(fill = clr, alpha = .85) +
    coord_flip() +
    labs(title = label, x = NULL, y = "Importance") +
    theme_minimal(base_size = 10)
}

imp_fn(rf_model,  "Random Forest", "#4C72B0") +
imp_fn(gbm_model, "GBM",           "#DD8452") +
  plot_annotation(
    title = "Feature Importance — Top 12 Predictors",
    theme = theme(plot.title = element_text(size = 14, face = "bold"))
  )

Conclusion

Code

kable(test_metrics, digits = 5,
      caption = "Final Test-Set Performance Summary",
      col.names = c("Model", "Set", "RMSE", "MAE", "R²"))

Final Test-Set Performance Summary
Model	Set	RMSE	MAE	R²
Random Forest	Test	0.10268	0.06551	0.61602
GBM	Test	0.10834	0.07388	0.57261

Both models explain approximately 57–62% of the variance in worker productivity which is a somewhat good result.Beased on the findings, Some factors have affect actual productivity. Knowing those finding can provide management a clear actionalbe solution to boost productivity such as smv, no_of_workers, incentive_per_worker, etc.