In this final project, my goal is to use random forest and gradient boosted machines(GBM) to predict the daily productivity of factory work teams from historical operational records. The dataset was covering January–March 2015 publised in Kaggle website. Our target variable is “actual_productivity”.
From RF and GBM, we train and compare two supervised regression modele. Both models are tuned via 5-fold cross-validation and evaluated on a set 20% test set using RMSE, MAE, and R².
At the end, we will determin which model performs better and based on our finding to provide recommendation to factory manager on how to improve productivity effectively.
2. Data Loading & Inspection
Code
df_raw <-read.csv(PATH_RAW, stringsAsFactors =FALSE) #the csv file downloded from Kaggle website is saved in the data folder of this project.kable(head(df_raw, 6)) #test connection
date
quarter
department
day
team
targeted_productivity
smv
wip
over_time
incentive
idle_time
idle_men
no_of_style_change
no_of_workers
actual_productivity
1/1/2015
Quarter1
sweing
Thursday
8
0.80
26.16
1108
7080
98
0
0
0
59.0
0.9407254
1/1/2015
Quarter1
finishing
Thursday
1
0.75
3.94
NA
960
0
0
0
0
8.0
0.8865000
1/1/2015
Quarter1
sweing
Thursday
11
0.80
11.41
968
3660
50
0
0
0
30.5
0.8005705
1/1/2015
Quarter1
sweing
Thursday
12
0.80
11.41
968
3660
50
0
0
0
30.5
0.8005705
1/1/2015
Quarter1
sweing
Thursday
6
0.80
25.90
1170
1920
50
0
0
0
56.0
0.8003819
1/1/2015
Quarter1
sweing
Thursday
7
0.80
25.90
984
6720
38
0
0
0
56.0
0.8001250
3. Data Cleaning & Feature Engineering
To prepare the data for modelling:
Standardise column names — lowercase, underscores instead of spaces
Drop date — quarter and day already capture all temporal structure
Trim whitespace in categorical columns and convert to factors
Engineer three new features to improve model signal
overtime_per_worker : Normalises overtime by team size — removes scale bias between large and small teams
incentive_per_worker : Same normalisation for incentive — isolates per-person motivation signal
productivity_gap : Actual minus targeted — used in EDA only, dropped before modelling to prevent target leakage
4. Exploratory Data Analysis
4.1 Target Variable Distribution
Code
ggplot(df, aes(actual_productivity)) +geom_histogram(aes(y =after_stat(density)), bins =40,fill ="#4C72B0", colour ="white", alpha = .85) +geom_density(colour ="#C44E52", linewidth =1.1) +labs(title ="Distribution of Actual Productivity",x ="Actual Productivity", y ="Density") +theme_minimal(base_size =13)
From the chart above, looks like the target has a skewed distribution concentrated between 0.70 and 0.85. The sharp spike near 0.80 reflects most frequent occurance. The long left tail represents rare but significant underperformance events that will be hardest to predict.
Code
summary(df$actual_productivity) |>t() |>kable(caption ="productivity distribution records", digits =4)
productivity distribution records
Min.
1st Qu.
Median
Mean
3rd Qu.
Max.
0.2337
0.6503
0.7733
0.7351
0.8503
1.1204
4.2 Actual vs Targeted Productivity
Code
ggplot(df, aes(targeted_productivity, actual_productivity,colour = department)) +geom_point(alpha = .5, size =1.8) +geom_abline(linetype ="dashed", colour ="grey30", linewidth = .8) +scale_colour_viridis_d(option ="D") +labs(title ="Actual vs Targeted Productivity by Department",subtitle ="Points above the dashed line exceeded their target",x ="Targeted Productivity", y ="Actual Productivity",colour ="Department") +theme_minimal(base_size =13)
Points above the dashed line exceeded targets; those below fell short.
4.3 Productivity by Department & Day
Code
p_dept <-ggplot(df, aes(department, actual_productivity, fill = department)) +geom_boxplot(alpha = .8) +scale_fill_viridis_d(option ="C") +labs(title ="By Department", x =NULL, y ="Actual Productivity") +theme_minimal() +theme(legend.position ="none")day_lvl <-c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday")df$day <-factor(df$day, levels =intersect(day_lvl, levels(df$day)))p_day <-ggplot(df, aes(day, actual_productivity, fill = day)) +geom_violin(alpha = .7, trim =FALSE) +geom_boxplot(width = .1, fill ="white", outlier.size = .8) +scale_fill_viridis_d(option ="E") +labs(title ="By Day of Week", x =NULL, y =NULL) +theme_minimal() +theme(legend.position ="none",axis.text.x =element_text(angle =30, hjust =1))p_dept + p_day +plot_annotation(title ="Productivity Distribution by Department and Day",theme =theme(plot.title =element_text(size =14, face ="bold")) )
Finishing dedpartment shows slightly higher and more consistent productivity given narrower IQR.
4.4 Key Feature Relationships
Code
p5 <-ggplot(df, aes(incentive_per_worker, actual_productivity,colour = department)) +geom_point(alpha = .4, size =1.5) +geom_smooth(method ="lm", se =TRUE, colour ="grey20") +scale_colour_viridis_d(option ="D") +labs(title ="Incentive per Worker vs Productivity",x ="Incentive per Worker (BDT)", y ="Actual Productivity",colour ="Dept") +theme_minimal(base_size =12)p6 <-ggplot(df, aes(smv, actual_productivity)) +geom_point(aes(colour = department), alpha = .4, size =1.5) +geom_smooth(method ="loess", se =TRUE, colour ="grey20") +scale_colour_viridis_d(option ="D") +labs(title ="SMV vs Actual Productivity",x ="Standard Minute Value", y ="Actual Productivity",colour ="Dept") +theme_minimal(base_size =12)p7 <-ggplot(df, aes(no_of_workers, actual_productivity)) +geom_point(aes(colour = department), alpha = .4, size =1.5) +geom_smooth(method ="loess", se =TRUE, colour ="grey20") +scale_colour_viridis_d(option ="D") +labs(title ="Team Size vs Productivity",x ="Number of Workers", y ="Actual Productivity",colour ="Dept") +theme_minimal(base_size =12)p8 <-ggplot(df, aes(overtime_per_worker, actual_productivity,colour = department)) +geom_point(alpha = .4, size =1.5) +geom_smooth(method ="lm", se =TRUE, colour ="grey20") +scale_colour_viridis_d(option ="D") +labs(title ="Overtime per Worker vs Productivity",x ="Overtime per Worker (mins)", y ="Actual Productivity",colour ="Dept") +theme_minimal(base_size =12)(p5 + p6) / (p7 + p8) +plot_annotation(title ="Key Feature Relationships with Productivity",theme =theme(plot.title =element_text(size =14, face ="bold")) )
From above charts key observations:
Incentive has a clear positive association — higher bonuses correlate with better output
SMV shows diminishing returns — very high workload per piece hurts productivity
Team size plateaus — very large teams (>60) gain no further productivity advantage
Overtime has a complex pattern — moderate overtime may reflect dedication; extreme overtime suggests fatigue
Categorical variables (department, day, quarter) are one-hot encoded via model.matrix(). The productivity_gap column is explicitly dropped — it is derived from the target and would cause data leakage if included.
An 80/20 stratified split ensures both sets have similar productivity distributions. 5-fold cross-validation on the training set selects the best hyperparameters.
ggplot(rf_model) +labs(title ="Random Forest — 5-Fold CV RMSE by mtry",x ="mtry (features considered at each split)",y ="Cross-Validation RMSE") +theme_minimal(base_size =13)
The U-shaped curve maybe caused by too few features → high bias (underfitting); too many → correlated trees (overfitting).
Key hyperparameters tuned(n.trees): 100, 200, 300. Number of sequential boosting iterations (interaction.depth): 2, 3, 5. Maximum tree depth (shrinkage): 0.05, 0.10. and lastly learning rate — smaller = more conservative.
The gap between train and test metrics indicates moderate overfitting — typical for ensemble tree methods on small datasets. Random Forest shows slightly better than GBM here.
Actual vs Predicted
Code
pred_df <-tibble(Actual = y_test,`Random Forest`= rf_pred_test,GBM = gbm_pred_test) |>pivot_longer(-Actual, names_to ="Model", values_to ="Predicted")ggplot(pred_df, aes(Actual, Predicted, colour = Model)) +geom_point(alpha = .45, size =1.8) +geom_abline(linetype ="dashed", colour ="grey30") +facet_wrap(~ Model) +scale_colour_manual(values =c("Random Forest"="#4C72B0", "GBM"="#DD8452")) +labs(title ="Actual vs Predicted Productivity — Test Set",subtitle ="Points on the dashed line indicate perfect prediction",x ="Actual", y ="Predicted") +theme_minimal(base_size =13) +theme(legend.position ="none")
Both models track well through the 0.5–0.9 range where most observations shows.
Smaller error bars indicate more stable generalisation across the 5 folds. The CV RMSE closely matches the held-out test RMSE — confirming no overfitting from the hyperparameter search.
Code
imp_fn <-function(m, label, clr, top =12) {varImp(m)$importance |>rownames_to_column("Feature") |>arrange(desc(Overall)) |>head(top) |>ggplot(aes(reorder(Feature, Overall), Overall)) +geom_col(fill = clr, alpha = .85) +coord_flip() +labs(title = label, x =NULL, y ="Importance") +theme_minimal(base_size =10)}imp_fn(rf_model, "Random Forest", "#4C72B0") +imp_fn(gbm_model, "GBM", "#DD8452") +plot_annotation(title ="Feature Importance — Top 12 Predictors",theme =theme(plot.title =element_text(size =14, face ="bold")) )
Both models explain approximately 57–62% of the variance in worker productivity which is a somewhat good result.Beased on the findings, Some factors have affect actual productivity. Knowing those finding can provide management a clear actionalbe solution to boost productivity such as smv, no_of_workers, incentive_per_worker, etc.