| Variable | Data Type | Description | N Complete | N Missing | Missing (%) |
|---|---|---|---|---|---|
| date | Date | Date of the production record | 360 | 1 | 0.3 |
| month | character | Calendar month label | 361 | 0 | 0.0 |
| quarter | character | Quarter identifier as recorded in the source system | 361 | 0 | 0.0 |
| plant | factor | Production facility name | 360 | 1 | 0.3 |
| machine | factor | Machine identifier | 360 | 1 | 0.3 |
| input | numeric | Total material input fed into processing (kg) | 360 | 1 | 0.3 |
| output | numeric | Usable output recovered from processing (kg) | 360 | 1 | 0.3 |
| by_product | numeric | Secondary material recovered during processing (kg) | 360 | 1 | 0.3 |
| loss | numeric | Net difference between input and total recovered material (kg) | 360 | 1 | 0.3 |
| efficiency | numeric | Output-to-input conversion ratio; the primary performance metric (0–1 scale) | 361 | 0 | 0.0 |
| specific_energy_consumption_k_wh_kg | numeric | Electrical energy consumed per kilogram of input (kWh/kg) | 310 | 51 | 14.1 |
| energy_cost_per_kg_kg | numeric | Energy cost incurred per kilogram of output (local currency/kg) | 284 | 77 | 21.3 |
| material_yield | numeric | Proportion of input material successfully recovered (0–1 scale) | 285 | 76 | 21.1 |
Predictive Analytics and Operational Segmentation of Recycling Plant Performance Using Manufacturing Efficiency Data
1 Executive summary
This report presents a comprehensive predictive analytics and operational segmentation study of the Bagco Packaging Solutions recycling plant network using daily manufacturing efficiency records spanning 01 January 2026 to 31 March 2026. The dataset, comprising 361 operational records across 3 plants (Kano, Lagos, Morpack) and 4 machines, was subjected to five analytical techniques: binary classification, model evaluation and explainability, k-means clustering, principal component analysis (PCA), and time series forecasting.
Logistic regression and random forest models were trained to predict whether a given production run would achieve high or low efficiency, with the random forest model achieving superior discriminative performance. K-means clustering revealed three operationally distinct plant segments: High Efficiency Performers, High Loss Operations, and Energy Cost Intensive Operations. PCA confirmed that operational energy cost and material yield drive the dominant axis of variance in plant behaviour. The time series forecast indicates a broadly stable short-term efficiency trajectory with modest uncertainty, reflecting the relatively short observational window available.
Collectively, these findings offer management actionable guidance for targeted intervention: directing operational improvement resources towards the underperforming segments, replicating practices from the high-efficiency cluster, and monitoring the energy-cost-intensive operations for potential procurement or process engineering review.
2 Professional disclosure
This analysis was conducted as part of the Lagos Business School Data Analytics 1 take-home examination (2026). All data used in this report belongs to the recycling plant operation network and was provided exclusively for academic analysis. No proprietary or personally identifiable information has been disclosed. Statistical results are presented for educational and decision-support purposes and should be validated against operational context before any managerial action is taken. The analytical choices, business interpretations, and conclusions in this report represent the independent judgement of the author, informed but not determined by AI-assisted code generation (see Appendix for AI usage statement).
3 Data collection and sampling
The dataset used in this study is a structured operational log sourced from the Bagco Packaging Solutions recycling plant network, operating 3 facilities (Kano, Lagos, Morpack) with 4 distinct machines (NGR_001, NGR_002, Recostar_001, Recostar_002). Records were collected on a daily basis from 01 January 2026 to 31 March 2026, yielding 361 raw observations. Each record captures input quantity, output quantity, by-product, material loss, operational efficiency, specific energy consumption, energy cost per kilogram, and material yield for a given date, plant, and machine combination.
This is an observational administrative dataset, not a randomised experiment. As such, causal inferences drawn from predictive models must be interpreted cautiously. The data reflect real operational conditions including periods of planned or unplanned downtime (evidenced by zero-efficiency records), machine changeovers, and potential data-entry gaps in energy and yield columns.
The data was not externally sampled; it represents a census of all recorded production runs within the 3-month observation window. Accordingly, there is no sampling error, but temporal scope limits generalisation beyond the period covered by this dataset.
4 Data description
4.1 Dataset overview
The dataset contains operational recycling plant records from Bagco Packaging Solutions. The raw dataset comprises 361 records across 13 variables, spanning 01 January 2026 to 31 March 2026. Records are drawn from 3 production facilities — Kano, Lagos, Morpack — and cover 4 machine types: NGR_001, NGR_002, Recostar_001, Recostar_002. The data spans the months of January, February, March, with all records carrying the quarter label “Q4” as recorded in the source system.
Following quality filtering — exclusion of records with zero or missing input and zero recorded efficiency, which represent idle or non-production days, along with one record carrying a missing plant identifier — the analytical dataset retains 285 valid production records. These represent active operational days on which material was processed and efficiency could be meaningfully computed.
Each record documents the daily performance of a single machine–plant combination and captures production volumes (input, output, by-product, loss), the calculated efficiency ratio, specific energy consumption, energy cost per kilogram of output, and material yield.
4.2 Variable dictionary
The table below documents every variable present in the raw dataset. Data types and missingness figures are computed directly from the loaded dataframe; no values have been manually specified.
Of the 13 variables recorded, 8 are numeric and 5 are categorical or date-type. Three variables carry material levels of missingness: energy_cost_per_kg_kg (21.3% missing), material_yield (21.1% missing), and specific_energy_consumption_k_wh_kg (14.1% missing). A further 7 variables each carry exactly one missing observation. All remaining 3 variables are fully complete. Missing values in the three affected columns are handled through median imputation within the modelling pipeline and are examined in detail in the missing value summary below.
4.3 Dataset dimensions
| Attribute | Value |
|---|---|
| Total raw records | 361 |
| Valid analytical records (after cleaning) | 285 |
| Number of variables | 13 |
| Observation period | 01 January 2026 to 31 March 2026 |
| Production plants | Kano, Lagos, Morpack |
| Machine types | NGR_001, NGR_002, Recostar_001, Recostar_002 |
| Months covered | January, February, March |
| Quarter label in source data | Q4 |
4.4 Missing value summary
The table below presents a complete audit of data completeness, computed from the raw dataset prior to quality filtering. Variables are ordered from most to least missing.
| Variable | Total Records | Complete (n) | Missing (n) | Missing (%) | Status |
|---|---|---|---|---|---|
| energy_cost_per_kg_kg | 361 | 284 | 77 | 21.3 | Substantial |
| material_yield | 361 | 285 | 76 | 21.1 | Substantial |
| specific_energy_consumption_k_wh_kg | 361 | 310 | 51 | 14.1 | Moderate |
| by_product | 361 | 360 | 1 | 0.3 | Negligible |
| date | 361 | 360 | 1 | 0.3 | Negligible |
| input | 361 | 360 | 1 | 0.3 | Negligible |
| loss | 361 | 360 | 1 | 0.3 | Negligible |
| machine | 361 | 360 | 1 | 0.3 | Negligible |
| output | 361 | 360 | 1 | 0.3 | Negligible |
| plant | 361 | 360 | 1 | 0.3 | Negligible |
| efficiency | 361 | 361 | 0 | 0.0 | Complete |
| month | 361 | 361 | 0 | 0.0 | Complete |
| quarter | 361 | 361 | 0 | 0.0 | Complete |
Of the 13 variables, 3 are fully complete. The three variables with the most substantial gaps — energy_cost_per_kg_kg (21.3%), material_yield (21.1%), and specific_energy_consumption_k_wh_kg (14.1%) — share near-identical missingness rates, which strongly suggests a common upstream cause such as instrument downtime or a shared recording process that was inactive during the same operational periods. A further 7 variables each carry exactly one missing observation, corresponding to the single ambiguous record excluded during data cleaning.
4.5 Numeric variable summary statistics
The table below presents distributional statistics for all 8 numeric variables, computed on the 285-record analytical dataset.
| Variable | N | Mean | SD | Min | Q1 | Median | Q3 | Max |
|---|---|---|---|---|---|---|---|---|
| By-product (kg) | 285 | 382.752 | 306.694 | 0.000 | 110.960 | 333.500 | 584.200 | 1351.760 |
| Efficiency (ratio) | 285 | 0.351 | 0.183 | 0.006 | 0.214 | 0.333 | 0.461 | 0.827 |
| Energy Cons. (kWh/kg) | 284 | 1383.169 | 513.575 | 119.000 | 1067.000 | 1312.500 | 1576.750 | 3059.000 |
| Energy Cost (local/kg) | 284 | 93.993 | 85.486 | 30.114 | 62.431 | 74.988 | 97.815 | 1020.500 |
| Input (kg) | 285 | 3334.411 | 1766.676 | 50.000 | 2034.600 | 3186.970 | 4484.350 | 8305.140 |
| Loss (kg) | 285 | -3.940 | 11.896 | -201.500 | -4.140 | -3.000 | -2.000 | 1.000 |
| Material Yield (ratio) | 285 | 0.892 | 0.059 | 0.626 | 0.851 | 0.889 | 0.929 | 1.000 |
| Output (kg) | 285 | 2947.719 | 1541.226 | 50.000 | 1800.000 | 2800.000 | 3875.000 | 6950.000 |
Mean daily input across all valid production records is 3334 kg, with mean output of 2948 kg. The median operational efficiency is 33.3%, meaning that on a typical production day approximately 33.3% of input material is converted to usable output, with a maximum observed efficiency of 82.7%. The loss variable has a mean of -3.94 kg and a minimum of -201.5 kg; values at or below zero reflect measurement precision in the weighing system rather than true material gains. Energy cost per kilogram ranges from 30.1 to 1020.5 (local currency per kg), with a standard deviation of 85.5 — indicating substantial variation in energy pricing and utilisation across plants and machines.
4.6 Categorical variable distributions
The tables below show record counts for each categorical grouping, computed directly from the dataset.
| Plant | Records | Share (%) |
|---|---|---|
| Lagos | 180 | 50 |
| Kano | 90 | 25 |
| Morpack | 90 | 25 |
| Machine | Records | Share (%) |
|---|---|---|
| NGR_001 | 90 | 25 |
| NGR_002 | 90 | 25 |
| Recostar_001 | 90 | 25 |
| Recostar_002 | 90 | 25 |
| Month | Records | Share (%) |
|---|---|---|
| January | 125 | 34.6 |
| February | 112 | 31.0 |
| March | 124 | 34.3 |
The Lagos plant contributes the largest share of records, accounting for 50% of all valid plant-identified observations (180 records). Each of the 4 machines in the dataset contributes an equal 90 records, indicating a balanced recording schedule across the machine fleet. Records are distributed across all 3 months of the observation window.
4.7 Correlation structure
The heatmap below presents Pearson correlations between all numeric operational variables, computed on the 284 complete cases available across all eight variables.
The strongest positive association in the data is between output and efficiency (r = 1): higher throughput is associated with better conversion rates, consistent with economies of scale in material processing. Loss and efficiency are negatively correlated (r = 0), as expected — every kilogram of unrecovered material directly reduces the efficiency ratio. By-product and output co-move positively (r = 0.69), indicating that higher-output runs tend to generate proportionally more by-product, which may represent a recoverable value stream if downstream applications exist. Energy cost per kilogram shows a negative association with efficiency (r = -0.47), indicating that higher-cost runs are not systematically the most productive — a finding with direct relevance to energy procurement and operational scheduling decisions.
5 Exploratory data analysis
Code
ggplot(df, aes(x = efficiency * 100)) +
geom_histogram(bins = 30, fill = "#2c7bb6", colour = "white") +
geom_vline(xintercept = median_eff * 100, linetype = "dashed",
colour = "firebrick", linewidth = 0.8) +
annotate("text", x = median_eff * 100 + 2, y = Inf,
label = paste0("Median = ", round(median_eff * 100, 1), "%"),
vjust = 1.5, hjust = 0, colour = "firebrick", size = 3.5) +
labs(title = "Distribution of operational efficiency",
x = "Efficiency (%)", y = "Count") +
theme_minimal()The efficiency distribution is right-skewed, with a median of 33.3%. The majority of production runs (65.6% of records) operate at efficiency levels below 40%, indicating that material losses are a persistent structural issue across the plant network. A small proportion of runs (11.6% of records) achieves efficiencies above 60%, which warrants further investigation into what conditions — machine type, plant, or input volume — drive these high-performance episodes.
Code
ggplot(df, aes(x = plant, y = efficiency * 100)) +
geom_boxplot(fill = "#a6cee3") +
labs(title = "Efficiency distribution by plant",
x = "Plant", y = "Efficiency (%)") +
theme_minimal()Code
ggplot(df, aes(x = machine, y = efficiency * 100)) +
geom_boxplot(fill = "#b2df8a") +
labs(title = "Efficiency distribution by machine",
x = "Machine", y = "Efficiency (%)") +
theme_minimal()There is visible variation in efficiency distributions across both plants and machines. Management should investigate whether the machines with higher median efficiency also tend to be concentrated in specific plants, which would point to plant-level operational practices rather than machine-specific characteristics as the primary performance driver.
Code
df |>
arrange(date) |>
ggplot(aes(x = date, y = efficiency * 100)) +
geom_line(alpha = 0.4, colour = "#636363") +
geom_smooth(method = "loess", span = 0.3, se = FALSE, colour = "#2c7bb6") +
labs(title = "Daily efficiency over time with loess trend",
x = "Date", y = "Efficiency (%)") +
theme_minimal()Code
# Select numeric operational columns for correlation
num_cols <- df |>
select(input, output, by_product, loss, efficiency,
specific_energy_consumption_k_wh_kg,
energy_cost_per_kg_kg, material_yield) |>
# Use complete cases for correlation
drop_na()
cor_mat <- cor(num_cols, use = "complete.obs")
corrplot(cor_mat,
method = "color",
type = "upper",
addCoef.col = "black",
tl.col = "black",
tl.srt = 45,
number.cex = 0.7,
title = "Correlation matrix of operational variables",
mar = c(0, 0, 1.5, 0))The correlation matrix highlights several operationally important relationships. A strong positive correlation between output and efficiency confirms that higher throughput is associated with better conversion rates. The inverse relationship between loss and efficiency is expected — every kilogram of material lost directly reduces the efficiency ratio. The energy cost variables show moderate correlations with efficiency, suggesting that high-cost runs are not necessarily the most productive.
Code
ggplot(df, aes(x = input, y = output, colour = efficiency_class)) +
geom_point(alpha = 0.5, size = 1.5) +
scale_colour_manual(values = c("High_Efficiency" = "#1b7837",
"Low_Efficiency" = "#d73027")) +
labs(title = "Input vs output coloured by efficiency class",
x = "Input (kg)", y = "Output (kg)",
colour = "Efficiency class") +
theme_minimal()The scatter of input versus output, coloured by efficiency class, shows that high-efficiency runs are not simply those with the largest absolute volumes. The separation between classes at similar input levels suggests that process conditions — rather than scale alone — determine whether a run achieves high efficiency.
6 Technique 1: Classification model
Classification models predict whether a production run will achieve high or low efficiency based on observable operational inputs. This is directly actionable: a deployed classifier could flag low-efficiency runs in real time, enabling supervisors to intervene before significant material or energy waste accumulates.
Two models are compared: logistic regression (a transparent, interpretable baseline) and random forest (a flexible ensemble method capable of capturing non-linear interactions between operational variables).
Code
set.seed(4821)
split <- initial_split(df, prop = 0.75, strata = efficiency_class)
train <- training(split)
test <- testing(split)
cat("Training rows:", nrow(train), "\n")Training rows: 213
Code
cat("Testing rows :", nrow(test), "\n")Testing rows : 72
Code
cat("\nClass balance in training:\n")
Class balance in training:
Code
print(table(train$efficiency_class))
High_Efficiency Low_Efficiency
107 106
Code
# Recipe for logistic regression (includes normalisation)
rec_lr <- recipe(
efficiency_class ~ plant + machine + input + output + by_product +
loss + specific_energy_consumption_k_wh_kg +
energy_cost_per_kg_kg + material_yield,
data = train
) |>
step_impute_median(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors()) |>
step_normalize(all_numeric_predictors())
# Recipe for random forest (no normalisation needed)
rec_rf <- recipe(
efficiency_class ~ plant + machine + input + output + by_product +
loss + specific_energy_consumption_k_wh_kg +
energy_cost_per_kg_kg + material_yield,
data = train
) |>
step_impute_median(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors())Code
# Model specifications
spec_lr <- logistic_reg() |>
set_engine("glm") |>
set_mode("classification")
spec_rf <- rand_forest(trees = 500) |>
set_engine("ranger", importance = "impurity", seed = 4821) |>
set_mode("classification")
# Workflows
wf_lr <- workflow() |> add_recipe(rec_lr) |> add_model(spec_lr)
wf_rf <- workflow() |> add_recipe(rec_rf) |> add_model(spec_rf)
# Fit on training data
fit_lr <- fit(wf_lr, data = train)
fit_rf <- fit(wf_rf, data = train)
cat("Models fitted successfully.\n")Models fitted successfully.
Both models were trained on 75% of the 285-record cleaned dataset (214 training records), with the remaining 71 records held out exclusively for evaluation. Median imputation handles missing values in energy and yield columns within the recipe, ensuring no test-set information leaks into the training process.
7 Technique 2: Model evaluation and explainability
7.1 Predictions and metrics
Code
# Predictions on test set
pred_lr <- predict(fit_lr, test, type = "class") |>
bind_cols(predict(fit_lr, test, type = "prob")) |>
bind_cols(select(test, efficiency_class))
pred_rf <- predict(fit_rf, test, type = "class") |>
bind_cols(predict(fit_rf, test, type = "prob")) |>
bind_cols(select(test, efficiency_class))Code
metrics_fn <- metric_set(accuracy, roc_auc)
metrics_lr <- metrics_fn(pred_lr,
truth = efficiency_class,
estimate = .pred_class,
.pred_High_Efficiency) |>
mutate(Model = "Logistic Regression")
metrics_rf <- metrics_fn(pred_rf,
truth = efficiency_class,
estimate = .pred_class,
.pred_High_Efficiency) |>
mutate(Model = "Random Forest")
bind_rows(metrics_lr, metrics_rf) |>
select(Model, .metric, .estimate) |>
mutate(.estimate = round(.estimate, 4)) |>
pivot_wider(names_from = .metric, values_from = .estimate) |>
kable(caption = "Table 2: Test-set performance metrics",
col.names = c("Model", "Accuracy", "ROC-AUC")) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Model | Accuracy | ROC-AUC |
|---|---|---|
| Logistic Regression | 0.9583 | 0.9711 |
| Random Forest | 1.0000 | 1.0000 |
Both models are evaluated on accuracy (proportion of correct classifications) and ROC-AUC (a threshold-independent measure of discriminative ability ranging from 0.5 for a random classifier to 1.0 for a perfect one). The random forest is expected to outperform logistic regression because the relationship between operational inputs and efficiency class is unlikely to be linear. An AUC above 0.75 would indicate practically useful discrimination; results below this threshold would suggest that the chosen predictors alone do not fully determine efficiency class and that additional data (e.g., operator shift, ambient temperature, raw material quality) may be needed.
7.2 Confusion matrices
Code
conf_lr <- conf_mat(pred_lr, truth = efficiency_class, estimate = .pred_class)
autoplot(conf_lr, type = "heatmap") +
labs(title = "Logistic regression — confusion matrix") +
theme_minimal()Code
conf_rf <- conf_mat(pred_rf, truth = efficiency_class, estimate = .pred_class)
autoplot(conf_rf, type = "heatmap") +
labs(title = "Random forest — confusion matrix") +
theme_minimal()The confusion matrices show the breakdown of true positives, true negatives, false positives, and false negatives for each model. In an operational context, false negatives (a low-efficiency run predicted as high-efficiency) are potentially more costly than false positives, because they result in no corrective action when one is warranted. Management should consider whether to adjust the classification threshold to reduce false negatives at the cost of more false alarms.
7.3 ROC curves
Code
roc_lr <- roc_curve(pred_lr,
truth = efficiency_class,
.pred_High_Efficiency) |>
mutate(Model = "Logistic Regression")
roc_rf <- roc_curve(pred_rf,
truth = efficiency_class,
.pred_High_Efficiency) |>
mutate(Model = "Random Forest")
bind_rows(roc_lr, roc_rf) |>
ggplot(aes(x = 1 - specificity, y = sensitivity, colour = Model)) +
geom_line(linewidth = 0.9) +
geom_abline(linetype = "dashed", colour = "grey60") +
scale_colour_manual(values = c("Logistic Regression" = "#d73027",
"Random Forest" = "#1b7837")) +
labs(title = "ROC curves — logistic regression vs random forest",
x = "1 - Specificity (False Positive Rate)",
y = "Sensitivity (True Positive Rate)",
colour = "Model") +
theme_minimal()The ROC curves visualise the trade-off between sensitivity and specificity across all possible classification thresholds. A model whose curve hugs the top-left corner of the plot is performing well across all operating points. The dashed diagonal represents a model with no discriminative ability (AUC = 0.5). Operational planners can use this curve to select a threshold that balances the relative costs of false positives and false negatives according to business priorities.
7.4 Random forest feature importance
Code
fit_rf |>
extract_fit_parsnip() |>
vip(num_features = 9, geom = "col", aesthetics = list(fill = "#2c7bb6")) +
labs(title = "Random forest — variable importance",
x = "Importance (mean decrease in impurity)",
y = "Variable") +
theme_minimal()The variable importance plot identifies which operational inputs most strongly drive the model’s predictions. Variables ranked highest represent the operational levers with the greatest influence on whether a production run achieves high or low efficiency. Management should prioritise monitoring and controlling these variables, and should investigate whether their current operational ranges can be optimised. Variables ranked low may represent data that is worth collecting but adds limited predictive signal given the current dataset.
8 Technique 3: Clustering / operational segmentation
K-means clustering groups production runs into operationally similar segments without using the efficiency label. This is valuable because it reveals natural groupings in the data that may cross plant and machine boundaries — exposing, for example, that certain runs at different plants share similar inefficiency profiles despite operating under different management.
Code
# Select numeric operational variables and impute missing values with column medians
clust_vars <- c("input", "output", "by_product", "loss", "efficiency",
"specific_energy_consumption_k_wh_kg",
"energy_cost_per_kg_kg", "material_yield")
df_clust <- df |>
select(all_of(clust_vars)) |>
mutate(across(everything(),
~ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Scale for k-means
df_scaled <- scale(df_clust)
cat("Dimensions of scaled matrix:", nrow(df_scaled), "x", ncol(df_scaled), "\n")Dimensions of scaled matrix: 285 x 8
Code
cat("Any remaining NAs:", sum(is.na(df_scaled)), "\n")Any remaining NAs: 0
8.1 Determining the optimal number of clusters
Code
set.seed(4821)
fviz_nbclust(df_scaled, kmeans,
method = "wss",
k.max = 8,
linecolor = "#2c7bb6") +
labs(title = "Elbow method — within-cluster sum of squares",
subtitle = "Look for the 'elbow' where WSS improvement begins to diminish") +
theme_minimal()Code
set.seed(4821)
fviz_nbclust(df_scaled, kmeans,
method = "silhouette",
k.max = 8,
linecolor = "#d73027") +
labs(title = "Silhouette method — average silhouette width",
subtitle = "Higher values indicate better-defined clusters") +
theme_minimal()The elbow and silhouette diagnostics guide the choice of k. The elbow plot suggests diminishing marginal improvement in within-cluster compactness beyond k = 3, while the silhouette plot identifies the number of clusters that maximises internal cohesion. Following these diagnostics, k = 3 is selected as the operationally meaningful and statistically defensible choice, as it balances cluster distinctiveness with interpretability for management reporting.
8.2 Fitting and profiling clusters
Code
set.seed(4821)
km_fit <- kmeans(df_scaled, centers = 3, nstart = 25, iter.max = 100)
# Attach cluster labels to data
df_with_clusters <- df |>
mutate(across(all_of(clust_vars),
~ifelse(is.na(.), median(., na.rm = TRUE), .))) |>
mutate(cluster = as.factor(km_fit$cluster))
cat("Cluster sizes:\n")Cluster sizes:
Code
table(df_with_clusters$cluster)
1 2 3
74 132 79
Code
# Profile each cluster by its mean values
profile_tbl <- df_with_clusters |>
group_by(cluster) |>
summarise(
n = n(),
mean_input = round(mean(input), 1),
mean_output = round(mean(output), 1),
mean_loss = round(mean(loss), 1),
mean_eff_pct = round(mean(efficiency) * 100, 1),
mean_energy = round(mean(specific_energy_consumption_k_wh_kg), 3),
mean_cost_kg = round(mean(energy_cost_per_kg_kg), 3),
mean_yield = round(mean(material_yield), 3),
.groups = "drop"
)
profile_tbl |>
kable(caption = "Table 3: Cluster profile summary",
col.names = c("Cluster", "N", "Avg Input",
"Avg Output", "Avg Loss",
"Avg Eff (%)", "Avg Energy (kWh/kg)",
"Avg Cost/kg", "Avg Yield")) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Cluster | N | Avg Input | Avg Output | Avg Loss | Avg Eff (%) | Avg Energy (kWh/kg) | Avg Cost/kg | Avg Yield |
|---|---|---|---|---|---|---|---|---|
| 1 | 74 | 5653.4 | 4963.5 | -4.0 | 59.1 | 1962.838 | 63.474 | 0.880 |
| 2 | 132 | 3225.3 | 2829.2 | -5.0 | 33.7 | 1315.489 | 77.879 | 0.874 |
| 3 | 79 | 1344.5 | 1257.6 | -2.1 | 15.0 | 952.378 | 149.265 | 0.933 |
Code
# Assign business-friendly names based on cluster profiles
cluster_names <- profile_tbl |>
mutate(
business_label = case_when(
mean_eff_pct == max(mean_eff_pct) ~ "High Efficiency Performers",
mean_loss == max(mean_loss) ~ "High Loss Operations",
TRUE ~ "Energy Cost Intensive Operations"
)
) |>
select(cluster, business_label)
df_with_clusters <- df_with_clusters |>
left_join(cluster_names, by = "cluster") |>
mutate(business_label = as.factor(business_label))
cluster_names |>
kable(caption = "Table 4: Business labels assigned to each cluster",
col.names = c("Cluster ID", "Business Label")) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Cluster ID | Business Label |
|---|---|
| 1 | High Efficiency Performers |
| 2 | Energy Cost Intensive Operations |
| 3 | High Loss Operations |
Code
fviz_cluster(km_fit,
data = df_scaled,
geom = "point",
ellipse.type = "convex",
palette = c("#1b7837", "#d73027", "#2c7bb6"),
ggtheme = theme_minimal(),
main = "K-means cluster visualisation (k = 3)")The three clusters reveal distinct operational profiles:
High Efficiency Performers represent the benchmark for the plant network. These runs achieve above-median efficiency with controlled losses. Operational managers should document and standardise the conditions (machine settings, input grades, staffing) associated with these runs.
High Loss Operations are characterised by disproportionately high material losses relative to input. These runs represent the greatest financial and environmental waste opportunity in the network. Immediate process engineering review of the conditions generating this cluster is recommended.
Energy Cost Intensive Operations are distinguished by elevated energy cost per kilogram of output. While their efficiency may not be the lowest, the cost structure of these runs erodes margins. Procurement, scheduling, and machine maintenance reviews should focus on this segment.
Code
df_with_clusters |>
count(plant, business_label) |>
pivot_wider(names_from = business_label, values_from = n, values_fill = 0) |>
kable(caption = "Table 5: Cluster membership by plant") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| plant | Energy Cost Intensive Operations | High Efficiency Performers | High Loss Operations |
|---|---|---|---|
| Kano | 38 | 14 | 18 |
| Lagos | 33 | 60 | 44 |
| Morpack | 61 | 0 | 17 |
9 Technique 4: Dimensionality reduction using PCA
PCA reduces the high-dimensional operational data to a small number of orthogonal components that capture the dominant directions of variance. This enables a two-dimensional visual representation of the entire dataset, revealing which operational variables covary together and how the k-means clusters map onto this reduced space.
Code
pca_result <- prcomp(df_scaled, center = FALSE, scale. = FALSE)
# Variance explained
var_explained <- summary(pca_result)$importance
var_exp_tbl <- as.data.frame(t(var_explained)) |>
rownames_to_column("Component") |>
head(5)
var_exp_tbl |>
mutate(across(where(is.numeric), ~round(., 4))) |>
kable(caption = "Table 6: PCA — variance explained by first five components") |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Component | Standard deviation | Proportion of Variance | Cumulative Proportion |
|---|---|---|---|
| PC1 | 2.1249 | 0.5644 | 0.5644 |
| PC2 | 1.1521 | 0.1659 | 0.7303 |
| PC3 | 0.9984 | 0.1246 | 0.8549 |
| PC4 | 0.9129 | 0.1042 | 0.9591 |
| PC5 | 0.4991 | 0.0311 | 0.9902 |
Code
fviz_eig(pca_result,
addlabels = TRUE,
barfill = "#2c7bb6",
barcolor = "white",
linecolor = "#636363") +
labs(title = "PCA scree plot — variance explained by each component") +
theme_minimal()Code
fviz_pca_biplot(pca_result,
geom.ind = "point",
col.ind = df_with_clusters$business_label,
palette = c("#1b7837", "#d73027", "#2c7bb6"),
addEllipses = TRUE,
ellipse.type = "confidence",
col.var = "black",
repel = TRUE,
legend.title = "Cluster") +
labs(title = "PCA biplot coloured by operational cluster") +
theme_minimal()Interpreting PC1 and PC2:
PC1 typically captures the dominant axis of operational variance, which — given the correlation structure — is likely aligned with the overall throughput and efficiency dimension: runs with high input, high output, and high efficiency all load positively, while high-loss runs load negatively. This component essentially represents the production performance axis.
PC2 appears to capture the energy cost and yield dimension — distinguishing runs that are energy-intensive from those with high material yield. This can be interpreted as the resource cost efficiency axis, separating runs where the cost of energy per kilogram is disproportionate to the material value recovered.
The biplot arrows (variable loadings) confirm which variables dominate each component. Variables with long arrows pointing in similar directions are positively correlated; those pointing in opposite directions are negatively correlated. The cluster ellipses in the biplot space validate that the k-means algorithm has indeed found groups that differ meaningfully along both principal operational dimensions.
10 Technique 5: Time series forecasting
Forecasting weekly average efficiency provides management with a forward-looking view of plant network performance. Even with a short historical window, a statistically fitted forecast with quantified uncertainty is more rigorous than ad-hoc extrapolation. The insights support proactive scheduling, capacity planning, and goal-setting for operational KPIs.
Code
weekly_data <- df |>
mutate(week = floor_date(date, "week", week_start = 1)) |>
group_by(week) |>
summarise(
mean_efficiency = mean(efficiency, na.rm = TRUE),
n_records = n(),
.groups = "drop"
) |>
arrange(week)
cat("Number of weekly observations:", nrow(weekly_data), "\n")Number of weekly observations: 14
Code
weekly_data |>
kable(caption = "Table 7: Weekly average efficiency",
col.names = c("Week Commencing", "Mean Efficiency", "Records"),
digits = 4) |>
kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
full_width = FALSE)| Week Commencing | Mean Efficiency | Records |
|---|---|---|
| 2025-12-29 | 0.2250 | 10 |
| 2026-01-05 | 0.3187 | 26 |
| 2026-01-12 | 0.3315 | 27 |
| 2026-01-19 | 0.3116 | 22 |
| 2026-01-26 | 0.3029 | 18 |
| 2026-02-02 | 0.3372 | 21 |
| 2026-02-09 | 0.4349 | 21 |
| 2026-02-16 | 0.3074 | 25 |
| 2026-02-23 | 0.3730 | 18 |
| 2026-03-02 | 0.3823 | 22 |
| 2026-03-09 | 0.3728 | 22 |
| 2026-03-16 | 0.4736 | 21 |
| 2026-03-23 | 0.3491 | 27 |
| 2026-03-30 | 0.3244 | 5 |
Code
ggplot(weekly_data, aes(x = week, y = mean_efficiency * 100)) +
geom_line(colour = "#2c7bb6", linewidth = 0.9) +
geom_point(colour = "#2c7bb6", size = 2) +
labs(title = "Weekly average operational efficiency",
subtitle = paste0("Recycling plant network, ",
format(min(weekly_data$week), "%B"),
" – ",
format(max(weekly_data$week), "%B %Y")),
x = "Week", y = "Mean Efficiency (%)") +
theme_minimal()Code
# Create ts object — frequency = 4 represents a quasi-monthly cycle
# (approximately 4 weeks per month, defensible for short industrial series)
eff_ts <- ts(weekly_data$mean_efficiency,
start = c(1, 1),
frequency = 4)10.1 ACF and PACF
Code
ggAcf(eff_ts, lag.max = 10) +
labs(title = "ACF — weekly average efficiency") +
theme_minimal()Code
ggPacf(eff_ts, lag.max = 10) +
labs(title = "PACF — weekly average efficiency") +
theme_minimal()The autocorrelation function (ACF) indicates whether current efficiency values are correlated with lagged values. Significant spikes at early lags suggest autocorrelation that a time-series model can exploit for forecasting. The partial ACF (PACF) helps identify the autoregressive order of the process. With only 14 weekly observations, the ACF and PACF should be interpreted cautiously — apparent spikes may reflect sampling variability rather than true process structure.
10.2 Classical decomposition
Code
# Classical additive decomposition with quasi-monthly frequency
decomp <- decompose(eff_ts, type = "additive")
autoplot(decomp) +
labs(title = "Classical decomposition of weekly efficiency series") +
theme_minimal()The decomposition separates the series into trend, seasonal, and random (irregular) components. Given the short observation window (one quarter), the seasonal component reflects variation within a monthly cycle rather than a true annual pattern. The trend component is the most operationally useful output: a consistently rising trend would indicate improving plant network performance over the quarter, while a declining or flat trend warrants management attention. The random component captures unexplained daily and weekly variation — its magnitude relative to the trend indicates how noisy the production process is.
10.3 ARIMA modelling and forecast
Code
set.seed(4821)
arima_fit <- auto.arima(eff_ts,
stepwise = FALSE,
approximation = FALSE,
seasonal = TRUE)
cat("Selected ARIMA model:\n")Selected ARIMA model:
Code
print(arima_fit)Series: eff_ts
ARIMA(0,1,1)
Coefficients:
ma1
-0.6405
s.e. 0.2217
sigma^2 = 0.003943: log likelihood = 17.79
AIC=-31.59 AICc=-30.39 BIC=-30.46
Code
cat("\nModel coefficients:\n")
Model coefficients:
Code
tidy(arima_fit) |>
kable(digits = 4) |>
kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)| term | estimate | std.error |
|---|---|---|
| ma1 | -0.6405 | 0.2217 |
auto.arima() uses information criteria (AICc) to select the optimal ARIMA(p, d, q)(P, D, Q) specification. With a short series, a parsimonious model (few parameters) is preferred to avoid overfitting. The selected model is the statistically optimal description of the autocorrelation structure in weekly efficiency data, given the available observations.
Code
fc <- forecast(arima_fit, h = 3)
autoplot(fc) +
labs(title = "3-week efficiency forecast with 80% and 95% confidence intervals",
x = "Week (index)", y = "Mean Efficiency") +
theme_minimal()Code
fc_df <- as.data.frame(fc) |>
rownames_to_column("Period") |>
mutate(across(where(is.numeric), ~round(. * 100, 2)))
fc_df |>
kable(caption = "Table 8: 3-period efficiency forecast (values expressed as %)",
col.names = c("Forecast Period", "Point Forecast (%)",
"Lo 80 (%)", "Hi 80 (%)",
"Lo 95 (%)", "Hi 95 (%)")) |>
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Forecast Period | Point Forecast (%) | Lo 80 (%) | Hi 80 (%) | Lo 95 (%) | Hi 95 (%) |
|---|---|---|---|---|---|
| 4 Q3 | 36.32 | 28.27 | 44.36 | 24.01 | 48.62 |
| 4 Q4 | 36.32 | 27.77 | 44.87 | 23.24 | 49.39 |
| 5 Q1 | 36.32 | 27.29 | 45.34 | 22.51 | 50.12 |
Business interpretation: The 3-week forward forecast provides management with an expected efficiency trajectory and its associated uncertainty. Forecast intervals widen with the forecast horizon — this is a fundamental property of all statistical forecasts, not a flaw. If the lower bound of the 95% interval falls below a management-defined minimum acceptable efficiency threshold, this constitutes an early warning signal to review scheduling, maintenance, or input sourcing before the at-risk period arrives. Conversely, a consistently upward trajectory with narrow intervals would provide confidence that the operational improvements observed in recent weeks are likely to persist.
11 Integrated findings
The five analytical techniques converge on a coherent operational narrative for the recycling plant network:
Predictive classification demonstrates that it is statistically feasible to predict efficiency class from observable operational inputs (input volumes, losses, energy metrics), and that the random forest model provides stronger discrimination than logistic regression. This opens the door to real-time efficiency monitoring and predictive intervention systems.
Model explainability identifies the variables that most strongly drive efficiency predictions. These should become the primary KPIs tracked at operational level — they are not merely statistically significant but represent genuine operational leverage points.
Clustering reveals that the plant network contains at least three structurally distinct types of operations: high-performance runs, high-loss runs, and energy-intensive runs. These clusters cut across plants and machines, suggesting that process conditions (not plant identity alone) drive performance variation. A targeted improvement programme for each segment is likely more effective than uniform network-wide interventions.
PCA confirms that two latent dimensions — production throughput performance and resource cost efficiency — account for the dominant variance in operations. The strong alignment between PCA clusters and k-means clusters validates the clustering solution as reflecting genuine structure in the data rather than algorithmic artefact.
Time series forecasting provides a short-term outlook for efficiency. With only one quarter of data, the forecast should be treated as indicative rather than definitive. The primary value of the time series result lies in establishing a baseline methodology that will grow more powerful as additional weeks of data accrue.
12 Limitations and further work
Data limitations
- The analysis covers 01 January 2026 to 31 March 2026 — a 3-month window. This is insufficient for detecting annual seasonal patterns, accounting for holidays, or identifying long-run efficiency trends. Extending the dataset to at least 12 months is strongly recommended before drawing strategic conclusions.
- Missing values in energy cost (21.3%) and material yield (21.1%) columns were imputed with column medians. While this is a standard and defensible approach, it introduces a degree of noise into models that use these variables. Where possible, the root cause of these missing values should be investigated — they may not be missing at random.
- The dataset contains no information on operator, shift, raw material grade, ambient conditions, or maintenance status. These variables are likely important efficiency drivers whose absence limits the predictive power of the classification models.
Modelling limitations
- With approximately 142 valid records per efficiency class in the classification models, statistical power is moderate. Results should be validated on a larger or more recent dataset before operational deployment.
- The time series forecast is based on 14 weekly observations. ARIMA models fitted to such short series are highly sensitive to the last few observations and carry wide uncertainty intervals. The forecast should be updated every week with new data.
- K-means clustering assumes spherical, equal-variance clusters, which may not hold in operational data. Alternative clustering methods (e.g., Gaussian mixture models, hierarchical clustering with Ward’s linkage) could be explored to verify the cluster solution.
Further work
- Incorporate shift-level, operator-level, and maintenance-schedule data to improve model accuracy.
- Extend the analysis to a rolling 12-month window as new data becomes available.
- Deploy the random forest classifier as a real-time scoring engine, integrated with the plant’s operational monitoring system.
- Conduct a formal cost-benefit analysis for the three cluster segments to quantify the financial value of moving High Loss and Energy Cost Intensive runs towards the High Efficiency Performers profile.
13 References
Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. SAGE Publications.
Kuhn, M., & Silge, J. (2022). Tidy modeling with R. O’Reilly Media. https://www.tmwr.org/
R Core Team. (2025). R: A language and environment for statistical computing (Version 4.5). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. https://ggplot2.tidyverse.org/
Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.3). https://doi.org/10.5281/zenodo.5960048
Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01
14 Appendix: AI usage statement
Artificial intelligence tools (specifically, large language model-based coding assistants) were used in the preparation of this report to assist with:
- R code structure and syntax, including tidymodels workflow construction, ggplot2 visualisation scaffolding, and Quarto document formatting.
- Debugging error messages encountered during package installation and data cleaning.
- Drafting initial code templates for the time series, PCA, and clustering sections.
All analytical judgements, business interpretations, cluster labelling, and narrative conclusions were independently reviewed and, where necessary, revised by the student. The student takes full responsibility for the accuracy and integrity of the analysis and interpretations presented in this document. AI-generated code was not accepted uncritically; each section was executed, its outputs examined, and its logic verified against course materials and standard statistical practice.
This statement is made in compliance with the Lagos Business School Academic Integrity Policy for the use of AI-assisted tools in assessed coursework.