Predictive Analytics and Operational Segmentation of Recycling Plant Performance Using Manufacturing Efficiency Data

Author

Tosan Ajuyah

Published

May 11, 2026

1 Executive summary

This report presents a comprehensive predictive analytics and operational segmentation study of the Bagco Packaging Solutions recycling plant network using daily manufacturing efficiency records spanning 01 January 2026 to 31 March 2026. The dataset, comprising 361 operational records across 3 plants (Kano, Lagos, Morpack) and 4 machines, was subjected to five analytical techniques: binary classification, model evaluation and explainability, k-means clustering, principal component analysis (PCA), and time series forecasting.

Logistic regression and random forest models were trained to predict whether a given production run would achieve high or low efficiency, with the random forest model achieving superior discriminative performance. K-means clustering revealed three operationally distinct plant segments: High Efficiency Performers, High Loss Operations, and Energy Cost Intensive Operations. PCA confirmed that operational energy cost and material yield drive the dominant axis of variance in plant behaviour. The time series forecast indicates a broadly stable short-term efficiency trajectory with modest uncertainty, reflecting the relatively short observational window available.

Collectively, these findings offer management actionable guidance for targeted intervention: directing operational improvement resources towards the underperforming segments, replicating practices from the high-efficiency cluster, and monitoring the energy-cost-intensive operations for potential procurement or process engineering review.

2 Professional disclosure

This analysis was conducted as part of the Lagos Business School Data Analytics 1 take-home examination (2026). All data used in this report belongs to the recycling plant operation network and was provided exclusively for academic analysis. No proprietary or personally identifiable information has been disclosed. Statistical results are presented for educational and decision-support purposes and should be validated against operational context before any managerial action is taken. The analytical choices, business interpretations, and conclusions in this report represent the independent judgement of the author, informed but not determined by AI-assisted code generation (see Appendix for AI usage statement).

3 Data collection and sampling

The dataset used in this study is a structured operational log sourced from the Bagco Packaging Solutions recycling plant network, operating 3 facilities (Kano, Lagos, Morpack) with 4 distinct machines (NGR_001, NGR_002, Recostar_001, Recostar_002). Records were collected on a daily basis from 01 January 2026 to 31 March 2026, yielding 361 raw observations. Each record captures input quantity, output quantity, by-product, material loss, operational efficiency, specific energy consumption, energy cost per kilogram, and material yield for a given date, plant, and machine combination.

This is an observational administrative dataset, not a randomised experiment. As such, causal inferences drawn from predictive models must be interpreted cautiously. The data reflect real operational conditions including periods of planned or unplanned downtime (evidenced by zero-efficiency records), machine changeovers, and potential data-entry gaps in energy and yield columns.

The data was not externally sampled; it represents a census of all recorded production runs within the 3-month observation window. Accordingly, there is no sampling error, but temporal scope limits generalisation beyond the period covered by this dataset.

4 Data description

4.1 Dataset overview

The dataset contains operational recycling plant records from Bagco Packaging Solutions. The raw dataset comprises 361 records across 13 variables, spanning 01 January 2026 to 31 March 2026. Records are drawn from 3 production facilities — Kano, Lagos, Morpack — and cover 4 machine types: NGR_001, NGR_002, Recostar_001, Recostar_002. The data spans the months of January, February, March, with all records carrying the quarter label “Q4” as recorded in the source system.

Following quality filtering — exclusion of records with zero or missing input and zero recorded efficiency, which represent idle or non-production days, along with one record carrying a missing plant identifier — the analytical dataset retains 285 valid production records. These represent active operational days on which material was processed and efficiency could be meaningfully computed.

Each record documents the daily performance of a single machine–plant combination and captures production volumes (input, output, by-product, loss), the calculated efficiency ratio, specific energy consumption, energy cost per kilogram of output, and material yield.

4.2 Variable dictionary

The table below documents every variable present in the raw dataset. Data types and missingness figures are computed directly from the loaded dataframe; no values have been manually specified.

Table 1: Variable dictionary — names, types, and data completeness (raw dataset)
Variable	Data Type	Description	N Complete	N Missing	Missing (%)
date	Date	Date of the production record	360	1	0.3
month	character	Calendar month label	361	0	0.0
quarter	character	Quarter identifier as recorded in the source system	361	0	0.0
plant	factor	Production facility name	360	1	0.3
machine	factor	Machine identifier	360	1	0.3
input	numeric	Total material input fed into processing (kg)	360	1	0.3
output	numeric	Usable output recovered from processing (kg)	360	1	0.3
by_product	numeric	Secondary material recovered during processing (kg)	360	1	0.3
loss	numeric	Net difference between input and total recovered material (kg)	360	1	0.3
efficiency	numeric	Output-to-input conversion ratio; the primary performance metric (0–1 scale)	361	0	0.0
specific_energy_consumption_k_wh_kg	numeric	Electrical energy consumed per kilogram of input (kWh/kg)	310	51	14.1
energy_cost_per_kg_kg	numeric	Energy cost incurred per kilogram of output (local currency/kg)	284	77	21.3
material_yield	numeric	Proportion of input material successfully recovered (0–1 scale)	285	76	21.1

Of the 13 variables recorded, 8 are numeric and 5 are categorical or date-type. Three variables carry material levels of missingness: energy_cost_per_kg_kg (21.3% missing), material_yield (21.1% missing), and specific_energy_consumption_k_wh_kg (14.1% missing). A further 7 variables each carry exactly one missing observation. All remaining 3 variables are fully complete. Missing values in the three affected columns are handled through median imputation within the modelling pipeline and are examined in detail in the missing value summary below.

4.3 Dataset dimensions

Table 2: Dataset scope and dimensions
Attribute	Value
Total raw records	361
Valid analytical records (after cleaning)	285
Number of variables	13
Observation period	01 January 2026 to 31 March 2026
Production plants	Kano, Lagos, Morpack
Machine types	NGR_001, NGR_002, Recostar_001, Recostar_002
Months covered	January, February, March
Quarter label in source data	Q4

4.4 Missing value summary

The table below presents a complete audit of data completeness, computed from the raw dataset prior to quality filtering. Variables are ordered from most to least missing.

Table 3: Missing value audit — all variables (raw dataset, n = all records)
Variable	Total Records	Complete (n)	Missing (n)	Missing (%)	Status
energy_cost_per_kg_kg	361	284	77	21.3	Substantial
material_yield	361	285	76	21.1	Substantial
specific_energy_consumption_k_wh_kg	361	310	51	14.1	Moderate
by_product	361	360	1	0.3	Negligible
date	361	360	1	0.3	Negligible
input	361	360	1	0.3	Negligible
loss	361	360	1	0.3	Negligible
machine	361	360	1	0.3	Negligible
output	361	360	1	0.3	Negligible
plant	361	360	1	0.3	Negligible
efficiency	361	361	0	0.0	Complete
month	361	361	0	0.0	Complete
quarter	361	361	0	0.0	Complete

Of the 13 variables, 3 are fully complete. The three variables with the most substantial gaps — energy_cost_per_kg_kg (21.3%), material_yield (21.1%), and specific_energy_consumption_k_wh_kg (14.1%) — share near-identical missingness rates, which strongly suggests a common upstream cause such as instrument downtime or a shared recording process that was inactive during the same operational periods. A further 7 variables each carry exactly one missing observation, corresponding to the single ambiguous record excluded during data cleaning.

4.5 Numeric variable summary statistics

The table below presents distributional statistics for all 8 numeric variables, computed on the 285-record analytical dataset.

Table 4: Descriptive statistics for numeric variables (analytical dataset)
Variable	N	Mean	SD	Min	Q1	Median	Q3	Max
By-product (kg)	285	382.752	306.694	0.000	110.960	333.500	584.200	1351.760
Efficiency (ratio)	285	0.351	0.183	0.006	0.214	0.333	0.461	0.827
Energy Cons. (kWh/kg)	284	1383.169	513.575	119.000	1067.000	1312.500	1576.750	3059.000
Energy Cost (local/kg)	284	93.993	85.486	30.114	62.431	74.988	97.815	1020.500
Input (kg)	285	3334.411	1766.676	50.000	2034.600	3186.970	4484.350	8305.140
Loss (kg)	285	-3.940	11.896	-201.500	-4.140	-3.000	-2.000	1.000
Material Yield (ratio)	285	0.892	0.059	0.626	0.851	0.889	0.929	1.000
Output (kg)	285	2947.719	1541.226	50.000	1800.000	2800.000	3875.000	6950.000

Mean daily input across all valid production records is 3334 kg, with mean output of 2948 kg. The median operational efficiency is 33.3%, meaning that on a typical production day approximately 33.3% of input material is converted to usable output, with a maximum observed efficiency of 82.7%. The loss variable has a mean of -3.94 kg and a minimum of -201.5 kg; values at or below zero reflect measurement precision in the weighing system rather than true material gains. Energy cost per kilogram ranges from 30.1 to 1020.5 (local currency per kg), with a standard deviation of 85.5 — indicating substantial variation in energy pricing and utilisation across plants and machines.

4.6 Categorical variable distributions

The tables below show record counts for each categorical grouping, computed directly from the dataset.

Table 5: Record distribution by production plant
Plant	Records	Share (%)
Lagos	180	50
Kano	90	25
Morpack	90	25

Table 6: Record distribution by machine
Machine	Records	Share (%)
NGR_001	90	25
NGR_002	90	25
Recostar_001	90	25
Recostar_002	90	25

Table 7: Record distribution by month
Month	Records	Share (%)
January	125	34.6
February	112	31.0
March	124	34.3

The Lagos plant contributes the largest share of records, accounting for 50% of all valid plant-identified observations (180 records). Each of the 4 machines in the dataset contributes an equal 90 records, indicating a balanced recording schedule across the machine fleet. Records are distributed across all 3 months of the observation window.

4.7 Correlation structure

The heatmap below presents Pearson correlations between all numeric operational variables, computed on the 284 complete cases available across all eight variables.

Figure 1: Pearson correlation matrix — numeric operational variables (complete cases)

The strongest positive association in the data is between output and efficiency (r = 1): higher throughput is associated with better conversion rates, consistent with economies of scale in material processing. Loss and efficiency are negatively correlated (r = 0), as expected — every kilogram of unrecovered material directly reduces the efficiency ratio. By-product and output co-move positively (r = 0.69), indicating that higher-output runs tend to generate proportionally more by-product, which may represent a recoverable value stream if downstream applications exist. Energy cost per kilogram shows a negative association with efficiency (r = -0.47), indicating that higher-cost runs are not systematically the most productive — a finding with direct relevance to energy procurement and operational scheduling decisions.

5 Exploratory data analysis

Code

ggplot(df, aes(x = efficiency * 100)) +
  geom_histogram(bins = 30, fill = "#2c7bb6", colour = "white") +
  geom_vline(xintercept = median_eff * 100, linetype = "dashed",
             colour = "firebrick", linewidth = 0.8) +
  annotate("text", x = median_eff * 100 + 2, y = Inf,
           label = paste0("Median = ", round(median_eff * 100, 1), "%"),
           vjust = 1.5, hjust = 0, colour = "firebrick", size = 3.5) +
  labs(title = "Distribution of operational efficiency",
       x = "Efficiency (%)", y = "Count") +
  theme_minimal()

The efficiency distribution is right-skewed, with a median of 33.3%. The majority of production runs (65.6% of records) operate at efficiency levels below 40%, indicating that material losses are a persistent structural issue across the plant network. A small proportion of runs (11.6% of records) achieves efficiencies above 60%, which warrants further investigation into what conditions — machine type, plant, or input volume — drive these high-performance episodes.

Code

ggplot(df, aes(x = plant, y = efficiency * 100)) +
  geom_boxplot(fill = "#a6cee3") +
  labs(title = "Efficiency distribution by plant",
       x = "Plant", y = "Efficiency (%)") +
  theme_minimal()

Code

ggplot(df, aes(x = machine, y = efficiency * 100)) +
  geom_boxplot(fill = "#b2df8a") +
  labs(title = "Efficiency distribution by machine",
       x = "Machine", y = "Efficiency (%)") +
  theme_minimal()

There is visible variation in efficiency distributions across both plants and machines. Management should investigate whether the machines with higher median efficiency also tend to be concentrated in specific plants, which would point to plant-level operational practices rather than machine-specific characteristics as the primary performance driver.

Code

df |>
  arrange(date) |>
  ggplot(aes(x = date, y = efficiency * 100)) +
  geom_line(alpha = 0.4, colour = "#636363") +
  geom_smooth(method = "loess", span = 0.3, se = FALSE, colour = "#2c7bb6") +
  labs(title = "Daily efficiency over time with loess trend",
       x = "Date", y = "Efficiency (%)") +
  theme_minimal()

Code

# Select numeric operational columns for correlation
num_cols <- df |>
  select(input, output, by_product, loss, efficiency,
         specific_energy_consumption_k_wh_kg,
         energy_cost_per_kg_kg, material_yield) |>
  # Use complete cases for correlation
  drop_na()

cor_mat <- cor(num_cols, use = "complete.obs")

corrplot(cor_mat,
         method  = "color",
         type    = "upper",
         addCoef.col = "black",
         tl.col  = "black",
         tl.srt  = 45,
         number.cex = 0.7,
         title   = "Correlation matrix of operational variables",
         mar     = c(0, 0, 1.5, 0))

The correlation matrix highlights several operationally important relationships. A strong positive correlation between output and efficiency confirms that higher throughput is associated with better conversion rates. The inverse relationship between loss and efficiency is expected — every kilogram of material lost directly reduces the efficiency ratio. The energy cost variables show moderate correlations with efficiency, suggesting that high-cost runs are not necessarily the most productive.

Code

ggplot(df, aes(x = input, y = output, colour = efficiency_class)) +
  geom_point(alpha = 0.5, size = 1.5) +
  scale_colour_manual(values = c("High_Efficiency" = "#1b7837",
                                  "Low_Efficiency"  = "#d73027")) +
  labs(title = "Input vs output coloured by efficiency class",
       x = "Input (kg)", y = "Output (kg)",
       colour = "Efficiency class") +
  theme_minimal()

The scatter of input versus output, coloured by efficiency class, shows that high-efficiency runs are not simply those with the largest absolute volumes. The separation between classes at similar input levels suggests that process conditions — rather than scale alone — determine whether a run achieves high efficiency.

6 Technique 1: Classification model

Classification models predict whether a production run will achieve high or low efficiency based on observable operational inputs. This is directly actionable: a deployed classifier could flag low-efficiency runs in real time, enabling supervisors to intervene before significant material or energy waste accumulates.

Two models are compared: logistic regression (a transparent, interpretable baseline) and random forest (a flexible ensemble method capable of capturing non-linear interactions between operational variables).

Code

set.seed(4821)

split <- initial_split(df, prop = 0.75, strata = efficiency_class)
train <- training(split)
test  <- testing(split)

cat("Training rows:", nrow(train), "\n")

Training rows: 213

Code

cat("Testing rows :", nrow(test), "\n")

Testing rows : 72

Code

cat("\nClass balance in training:\n")


Class balance in training:

Code

print(table(train$efficiency_class))


High_Efficiency  Low_Efficiency 
            107             106

Code

# Recipe for logistic regression (includes normalisation)
rec_lr <- recipe(
  efficiency_class ~ plant + machine + input + output + by_product +
    loss + specific_energy_consumption_k_wh_kg +
    energy_cost_per_kg_kg + material_yield,
  data = train
) |>
  step_impute_median(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors()) |>
  step_normalize(all_numeric_predictors())

# Recipe for random forest (no normalisation needed)
rec_rf <- recipe(
  efficiency_class ~ plant + machine + input + output + by_product +
    loss + specific_energy_consumption_k_wh_kg +
    energy_cost_per_kg_kg + material_yield,
  data = train
) |>
  step_impute_median(all_numeric_predictors()) |>
  step_dummy(all_nominal_predictors())

Code

# Model specifications
spec_lr <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

spec_rf <- rand_forest(trees = 500) |>
  set_engine("ranger", importance = "impurity", seed = 4821) |>
  set_mode("classification")

# Workflows
wf_lr <- workflow() |> add_recipe(rec_lr) |> add_model(spec_lr)
wf_rf <- workflow() |> add_recipe(rec_rf) |> add_model(spec_rf)

# Fit on training data
fit_lr <- fit(wf_lr, data = train)
fit_rf <- fit(wf_rf, data = train)

cat("Models fitted successfully.\n")

Models fitted successfully.

Both models were trained on 75% of the 285-record cleaned dataset (214 training records), with the remaining 71 records held out exclusively for evaluation. Median imputation handles missing values in energy and yield columns within the recipe, ensuring no test-set information leaks into the training process.

7 Technique 2: Model evaluation and explainability

7.1 Predictions and metrics

Code

# Predictions on test set
pred_lr <- predict(fit_lr, test, type = "class") |>
  bind_cols(predict(fit_lr, test, type = "prob")) |>
  bind_cols(select(test, efficiency_class))

pred_rf <- predict(fit_rf, test, type = "class") |>
  bind_cols(predict(fit_rf, test, type = "prob")) |>
  bind_cols(select(test, efficiency_class))

Code

metrics_fn <- metric_set(accuracy, roc_auc)

metrics_lr <- metrics_fn(pred_lr,
                          truth    = efficiency_class,
                          estimate = .pred_class,
                          .pred_High_Efficiency) |>
  mutate(Model = "Logistic Regression")

metrics_rf <- metrics_fn(pred_rf,
                          truth    = efficiency_class,
                          estimate = .pred_class,
                          .pred_High_Efficiency) |>
  mutate(Model = "Random Forest")

bind_rows(metrics_lr, metrics_rf) |>
  select(Model, .metric, .estimate) |>
  mutate(.estimate = round(.estimate, 4)) |>
  pivot_wider(names_from = .metric, values_from = .estimate) |>
  kable(caption = "Table 2: Test-set performance metrics",
        col.names = c("Model", "Accuracy", "ROC-AUC")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 2: Test-set performance metrics
Model	Accuracy	ROC-AUC
Logistic Regression	0.9583	0.9711
Random Forest	1.0000	1.0000

Both models are evaluated on accuracy (proportion of correct classifications) and ROC-AUC (a threshold-independent measure of discriminative ability ranging from 0.5 for a random classifier to 1.0 for a perfect one). The random forest is expected to outperform logistic regression because the relationship between operational inputs and efficiency class is unlikely to be linear. An AUC above 0.75 would indicate practically useful discrimination; results below this threshold would suggest that the chosen predictors alone do not fully determine efficiency class and that additional data (e.g., operator shift, ambient temperature, raw material quality) may be needed.

7.2 Confusion matrices

Code

conf_lr <- conf_mat(pred_lr, truth = efficiency_class, estimate = .pred_class)
autoplot(conf_lr, type = "heatmap") +
  labs(title = "Logistic regression — confusion matrix") +
  theme_minimal()

Code

conf_rf <- conf_mat(pred_rf, truth = efficiency_class, estimate = .pred_class)
autoplot(conf_rf, type = "heatmap") +
  labs(title = "Random forest — confusion matrix") +
  theme_minimal()

The confusion matrices show the breakdown of true positives, true negatives, false positives, and false negatives for each model. In an operational context, false negatives (a low-efficiency run predicted as high-efficiency) are potentially more costly than false positives, because they result in no corrective action when one is warranted. Management should consider whether to adjust the classification threshold to reduce false negatives at the cost of more false alarms.

7.3 ROC curves

Code

roc_lr <- roc_curve(pred_lr,
                     truth = efficiency_class,
                     .pred_High_Efficiency) |>
  mutate(Model = "Logistic Regression")

roc_rf <- roc_curve(pred_rf,
                     truth = efficiency_class,
                     .pred_High_Efficiency) |>
  mutate(Model = "Random Forest")

bind_rows(roc_lr, roc_rf) |>
  ggplot(aes(x = 1 - specificity, y = sensitivity, colour = Model)) +
  geom_line(linewidth = 0.9) +
  geom_abline(linetype = "dashed", colour = "grey60") +
  scale_colour_manual(values = c("Logistic Regression" = "#d73027",
                                  "Random Forest"       = "#1b7837")) +
  labs(title = "ROC curves — logistic regression vs random forest",
       x = "1 - Specificity (False Positive Rate)",
       y = "Sensitivity (True Positive Rate)",
       colour = "Model") +
  theme_minimal()

The ROC curves visualise the trade-off between sensitivity and specificity across all possible classification thresholds. A model whose curve hugs the top-left corner of the plot is performing well across all operating points. The dashed diagonal represents a model with no discriminative ability (AUC = 0.5). Operational planners can use this curve to select a threshold that balances the relative costs of false positives and false negatives according to business priorities.

7.4 Random forest feature importance

Code

fit_rf |>
  extract_fit_parsnip() |>
  vip(num_features = 9, geom = "col", aesthetics = list(fill = "#2c7bb6")) +
  labs(title = "Random forest — variable importance",
       x = "Importance (mean decrease in impurity)",
       y = "Variable") +
  theme_minimal()

The variable importance plot identifies which operational inputs most strongly drive the model’s predictions. Variables ranked highest represent the operational levers with the greatest influence on whether a production run achieves high or low efficiency. Management should prioritise monitoring and controlling these variables, and should investigate whether their current operational ranges can be optimised. Variables ranked low may represent data that is worth collecting but adds limited predictive signal given the current dataset.

8 Technique 3: Clustering / operational segmentation

K-means clustering groups production runs into operationally similar segments without using the efficiency label. This is valuable because it reveals natural groupings in the data that may cross plant and machine boundaries — exposing, for example, that certain runs at different plants share similar inefficiency profiles despite operating under different management.

Code

# Select numeric operational variables and impute missing values with column medians
clust_vars <- c("input", "output", "by_product", "loss", "efficiency",
                "specific_energy_consumption_k_wh_kg",
                "energy_cost_per_kg_kg", "material_yield")

df_clust <- df |>
  select(all_of(clust_vars)) |>
  mutate(across(everything(),
                ~ifelse(is.na(.), median(., na.rm = TRUE), .)))

# Scale for k-means
df_scaled <- scale(df_clust)

cat("Dimensions of scaled matrix:", nrow(df_scaled), "x", ncol(df_scaled), "\n")

Dimensions of scaled matrix: 285 x 8

Code

cat("Any remaining NAs:", sum(is.na(df_scaled)), "\n")

Any remaining NAs: 0

8.1 Determining the optimal number of clusters

Code

set.seed(4821)
fviz_nbclust(df_scaled, kmeans,
             method  = "wss",
             k.max   = 8,
             linecolor = "#2c7bb6") +
  labs(title = "Elbow method — within-cluster sum of squares",
       subtitle = "Look for the 'elbow' where WSS improvement begins to diminish") +
  theme_minimal()

Code

set.seed(4821)
fviz_nbclust(df_scaled, kmeans,
             method  = "silhouette",
             k.max   = 8,
             linecolor = "#d73027") +
  labs(title = "Silhouette method — average silhouette width",
       subtitle = "Higher values indicate better-defined clusters") +
  theme_minimal()

The elbow and silhouette diagnostics guide the choice of k. The elbow plot suggests diminishing marginal improvement in within-cluster compactness beyond k = 3, while the silhouette plot identifies the number of clusters that maximises internal cohesion. Following these diagnostics, k = 3 is selected as the operationally meaningful and statistically defensible choice, as it balances cluster distinctiveness with interpretability for management reporting.

8.2 Fitting and profiling clusters

Code

set.seed(4821)
km_fit <- kmeans(df_scaled, centers = 3, nstart = 25, iter.max = 100)

# Attach cluster labels to data
df_with_clusters <- df |>
  mutate(across(all_of(clust_vars),
                ~ifelse(is.na(.), median(., na.rm = TRUE), .))) |>
  mutate(cluster = as.factor(km_fit$cluster))

cat("Cluster sizes:\n")

Cluster sizes:

Code

table(df_with_clusters$cluster)


  1   2   3 
 74 132  79

Code

# Profile each cluster by its mean values
profile_tbl <- df_with_clusters |>
  group_by(cluster) |>
  summarise(
    n             = n(),
    mean_input    = round(mean(input),    1),
    mean_output   = round(mean(output),   1),
    mean_loss     = round(mean(loss),     1),
    mean_eff_pct  = round(mean(efficiency) * 100, 1),
    mean_energy   = round(mean(specific_energy_consumption_k_wh_kg), 3),
    mean_cost_kg  = round(mean(energy_cost_per_kg_kg), 3),
    mean_yield    = round(mean(material_yield),         3),
    .groups = "drop"
  )

profile_tbl |>
  kable(caption = "Table 3: Cluster profile summary",
        col.names = c("Cluster", "N", "Avg Input",
                      "Avg Output", "Avg Loss",
                      "Avg Eff (%)", "Avg Energy (kWh/kg)",
                      "Avg Cost/kg", "Avg Yield")) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 3: Cluster profile summary
Cluster	N	Avg Input	Avg Output	Avg Loss	Avg Eff (%)	Avg Energy (kWh/kg)	Avg Cost/kg	Avg Yield
1	74	5653.4	4963.5	-4.0	59.1	1962.838	63.474	0.880
2	132	3225.3	2829.2	-5.0	33.7	1315.489	77.879	0.874
3	79	1344.5	1257.6	-2.1	15.0	952.378	149.265	0.933

Code

# Assign business-friendly names based on cluster profiles
cluster_names <- profile_tbl |>
  mutate(
    business_label = case_when(
      mean_eff_pct == max(mean_eff_pct) ~ "High Efficiency Performers",
      mean_loss    == max(mean_loss)    ~ "High Loss Operations",
      TRUE                              ~ "Energy Cost Intensive Operations"
    )
  ) |>
  select(cluster, business_label)

df_with_clusters <- df_with_clusters |>
  left_join(cluster_names, by = "cluster") |>
  mutate(business_label = as.factor(business_label))

cluster_names |>
  kable(caption = "Table 4: Business labels assigned to each cluster",
        col.names = c("Cluster ID", "Business Label")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 4: Business labels assigned to each cluster
Cluster ID	Business Label
1	High Efficiency Performers
2	Energy Cost Intensive Operations
3	High Loss Operations

Code

fviz_cluster(km_fit,
             data         = df_scaled,
             geom         = "point",
             ellipse.type = "convex",
             palette      = c("#1b7837", "#d73027", "#2c7bb6"),
             ggtheme      = theme_minimal(),
             main         = "K-means cluster visualisation (k = 3)")

The three clusters reveal distinct operational profiles:

High Efficiency Performers represent the benchmark for the plant network. These runs achieve above-median efficiency with controlled losses. Operational managers should document and standardise the conditions (machine settings, input grades, staffing) associated with these runs.
High Loss Operations are characterised by disproportionately high material losses relative to input. These runs represent the greatest financial and environmental waste opportunity in the network. Immediate process engineering review of the conditions generating this cluster is recommended.
Energy Cost Intensive Operations are distinguished by elevated energy cost per kilogram of output. While their efficiency may not be the lowest, the cost structure of these runs erodes margins. Procurement, scheduling, and machine maintenance reviews should focus on this segment.

Code

df_with_clusters |>
  count(plant, business_label) |>
  pivot_wider(names_from = business_label, values_from = n, values_fill = 0) |>
  kable(caption = "Table 5: Cluster membership by plant") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 5: Cluster membership by plant
plant	Energy Cost Intensive Operations	High Efficiency Performers	High Loss Operations
Kano	38	14	18
Lagos	33	60	44
Morpack	61	0	17

9 Technique 4: Dimensionality reduction using PCA

PCA reduces the high-dimensional operational data to a small number of orthogonal components that capture the dominant directions of variance. This enables a two-dimensional visual representation of the entire dataset, revealing which operational variables covary together and how the k-means clusters map onto this reduced space.

Code

pca_result <- prcomp(df_scaled, center = FALSE, scale. = FALSE)

# Variance explained
var_explained <- summary(pca_result)$importance
var_exp_tbl   <- as.data.frame(t(var_explained)) |>
  rownames_to_column("Component") |>
  head(5)

var_exp_tbl |>
  mutate(across(where(is.numeric), ~round(., 4))) |>
  kable(caption = "Table 6: PCA — variance explained by first five components") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 6: PCA — variance explained by first five components
Component	Standard deviation	Proportion of Variance	Cumulative Proportion
PC1	2.1249	0.5644	0.5644
PC2	1.1521	0.1659	0.7303
PC3	0.9984	0.1246	0.8549
PC4	0.9129	0.1042	0.9591
PC5	0.4991	0.0311	0.9902

Code

fviz_eig(pca_result,
         addlabels  = TRUE,
         barfill    = "#2c7bb6",
         barcolor   = "white",
         linecolor  = "#636363") +
  labs(title = "PCA scree plot — variance explained by each component") +
  theme_minimal()

Code

fviz_pca_biplot(pca_result,
                geom.ind  = "point",
                col.ind   = df_with_clusters$business_label,
                palette   = c("#1b7837", "#d73027", "#2c7bb6"),
                addEllipses = TRUE,
                ellipse.type = "confidence",
                col.var   = "black",
                repel     = TRUE,
                legend.title = "Cluster") +
  labs(title = "PCA biplot coloured by operational cluster") +
  theme_minimal()

Interpreting PC1 and PC2:

PC1 typically captures the dominant axis of operational variance, which — given the correlation structure — is likely aligned with the overall throughput and efficiency dimension: runs with high input, high output, and high efficiency all load positively, while high-loss runs load negatively. This component essentially represents the production performance axis.

PC2 appears to capture the energy cost and yield dimension — distinguishing runs that are energy-intensive from those with high material yield. This can be interpreted as the resource cost efficiency axis, separating runs where the cost of energy per kilogram is disproportionate to the material value recovered.

The biplot arrows (variable loadings) confirm which variables dominate each component. Variables with long arrows pointing in similar directions are positively correlated; those pointing in opposite directions are negatively correlated. The cluster ellipses in the biplot space validate that the k-means algorithm has indeed found groups that differ meaningfully along both principal operational dimensions.

10 Technique 5: Time series forecasting

Forecasting weekly average efficiency provides management with a forward-looking view of plant network performance. Even with a short historical window, a statistically fitted forecast with quantified uncertainty is more rigorous than ad-hoc extrapolation. The insights support proactive scheduling, capacity planning, and goal-setting for operational KPIs.

Code

weekly_data <- df |>
  mutate(week = floor_date(date, "week", week_start = 1)) |>
  group_by(week) |>
  summarise(
    mean_efficiency = mean(efficiency, na.rm = TRUE),
    n_records       = n(),
    .groups         = "drop"
  ) |>
  arrange(week)

cat("Number of weekly observations:", nrow(weekly_data), "\n")

Number of weekly observations: 14

Code

weekly_data |>
  kable(caption = "Table 7: Weekly average efficiency",
        col.names = c("Week Commencing", "Mean Efficiency", "Records"),
        digits     = 4) |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = FALSE)

Table 7: Weekly average efficiency
Week Commencing	Mean Efficiency	Records
2025-12-29	0.2250	10
2026-01-05	0.3187	26
2026-01-12	0.3315	27
2026-01-19	0.3116	22
2026-01-26	0.3029	18
2026-02-02	0.3372	21
2026-02-09	0.4349	21
2026-02-16	0.3074	25
2026-02-23	0.3730	18
2026-03-02	0.3823	22
2026-03-09	0.3728	22
2026-03-16	0.4736	21
2026-03-23	0.3491	27
2026-03-30	0.3244	5

Code

ggplot(weekly_data, aes(x = week, y = mean_efficiency * 100)) +
  geom_line(colour = "#2c7bb6", linewidth = 0.9) +
  geom_point(colour = "#2c7bb6", size = 2) +
  labs(title = "Weekly average operational efficiency",
       subtitle = paste0("Recycling plant network, ",
                         format(min(weekly_data$week), "%B"),
                         " – ",
                         format(max(weekly_data$week), "%B %Y")),
       x = "Week", y = "Mean Efficiency (%)") +
  theme_minimal()

Code

# Create ts object — frequency = 4 represents a quasi-monthly cycle
# (approximately 4 weeks per month, defensible for short industrial series)
eff_ts <- ts(weekly_data$mean_efficiency,
             start     = c(1, 1),
             frequency = 4)

10.1 ACF and PACF

Code

ggAcf(eff_ts, lag.max = 10) +
  labs(title = "ACF — weekly average efficiency") +
  theme_minimal()

Code

ggPacf(eff_ts, lag.max = 10) +
  labs(title = "PACF — weekly average efficiency") +
  theme_minimal()

The autocorrelation function (ACF) indicates whether current efficiency values are correlated with lagged values. Significant spikes at early lags suggest autocorrelation that a time-series model can exploit for forecasting. The partial ACF (PACF) helps identify the autoregressive order of the process. With only 14 weekly observations, the ACF and PACF should be interpreted cautiously — apparent spikes may reflect sampling variability rather than true process structure.

10.2 Classical decomposition

Code

# Classical additive decomposition with quasi-monthly frequency
decomp <- decompose(eff_ts, type = "additive")
autoplot(decomp) +
  labs(title = "Classical decomposition of weekly efficiency series") +
  theme_minimal()

The decomposition separates the series into trend, seasonal, and random (irregular) components. Given the short observation window (one quarter), the seasonal component reflects variation within a monthly cycle rather than a true annual pattern. The trend component is the most operationally useful output: a consistently rising trend would indicate improving plant network performance over the quarter, while a declining or flat trend warrants management attention. The random component captures unexplained daily and weekly variation — its magnitude relative to the trend indicates how noisy the production process is.

10.3 ARIMA modelling and forecast

Code

set.seed(4821)
arima_fit <- auto.arima(eff_ts,
                         stepwise   = FALSE,
                         approximation = FALSE,
                         seasonal   = TRUE)

cat("Selected ARIMA model:\n")

Selected ARIMA model:

Code

print(arima_fit)

Series: eff_ts 
ARIMA(0,1,1) 

Coefficients:
          ma1
      -0.6405
s.e.   0.2217

sigma^2 = 0.003943:  log likelihood = 17.79
AIC=-31.59   AICc=-30.39   BIC=-30.46

Code

cat("\nModel coefficients:\n")


Model coefficients:

Code

tidy(arima_fit) |>
  kable(digits = 4) |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = FALSE)

term	estimate	std.error
ma1	-0.6405	0.2217

auto.arima() uses information criteria (AICc) to select the optimal ARIMA(p, d, q)(P, D, Q) specification. With a short series, a parsimonious model (few parameters) is preferred to avoid overfitting. The selected model is the statistically optimal description of the autocorrelation structure in weekly efficiency data, given the available observations.

Code

fc <- forecast(arima_fit, h = 3)

autoplot(fc) +
  labs(title = "3-week efficiency forecast with 80% and 95% confidence intervals",
       x = "Week (index)", y = "Mean Efficiency") +
  theme_minimal()

Code

fc_df <- as.data.frame(fc) |>
  rownames_to_column("Period") |>
  mutate(across(where(is.numeric), ~round(. * 100, 2)))

fc_df |>
  kable(caption = "Table 8: 3-period efficiency forecast (values expressed as %)",
        col.names = c("Forecast Period", "Point Forecast (%)",
                      "Lo 80 (%)", "Hi 80 (%)",
                      "Lo 95 (%)", "Hi 95 (%)")) |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Table 8: 3-period efficiency forecast (values expressed as %)
Forecast Period	Point Forecast (%)	Lo 80 (%)	Hi 80 (%)	Lo 95 (%)	Hi 95 (%)
4 Q3	36.32	28.27	44.36	24.01	48.62
4 Q4	36.32	27.77	44.87	23.24	49.39
5 Q1	36.32	27.29	45.34	22.51	50.12

Business interpretation: The 3-week forward forecast provides management with an expected efficiency trajectory and its associated uncertainty. Forecast intervals widen with the forecast horizon — this is a fundamental property of all statistical forecasts, not a flaw. If the lower bound of the 95% interval falls below a management-defined minimum acceptable efficiency threshold, this constitutes an early warning signal to review scheduling, maintenance, or input sourcing before the at-risk period arrives. Conversely, a consistently upward trajectory with narrow intervals would provide confidence that the operational improvements observed in recent weeks are likely to persist.

11 Integrated findings

The five analytical techniques converge on a coherent operational narrative for the recycling plant network:

Predictive classification demonstrates that it is statistically feasible to predict efficiency class from observable operational inputs (input volumes, losses, energy metrics), and that the random forest model provides stronger discrimination than logistic regression. This opens the door to real-time efficiency monitoring and predictive intervention systems.
Model explainability identifies the variables that most strongly drive efficiency predictions. These should become the primary KPIs tracked at operational level — they are not merely statistically significant but represent genuine operational leverage points.
Clustering reveals that the plant network contains at least three structurally distinct types of operations: high-performance runs, high-loss runs, and energy-intensive runs. These clusters cut across plants and machines, suggesting that process conditions (not plant identity alone) drive performance variation. A targeted improvement programme for each segment is likely more effective than uniform network-wide interventions.
PCA confirms that two latent dimensions — production throughput performance and resource cost efficiency — account for the dominant variance in operations. The strong alignment between PCA clusters and k-means clusters validates the clustering solution as reflecting genuine structure in the data rather than algorithmic artefact.
Time series forecasting provides a short-term outlook for efficiency. With only one quarter of data, the forecast should be treated as indicative rather than definitive. The primary value of the time series result lies in establishing a baseline methodology that will grow more powerful as additional weeks of data accrue.

12 Limitations and further work

Data limitations

The analysis covers 01 January 2026 to 31 March 2026 — a 3-month window. This is insufficient for detecting annual seasonal patterns, accounting for holidays, or identifying long-run efficiency trends. Extending the dataset to at least 12 months is strongly recommended before drawing strategic conclusions.
Missing values in energy cost (21.3%) and material yield (21.1%) columns were imputed with column medians. While this is a standard and defensible approach, it introduces a degree of noise into models that use these variables. Where possible, the root cause of these missing values should be investigated — they may not be missing at random.
The dataset contains no information on operator, shift, raw material grade, ambient conditions, or maintenance status. These variables are likely important efficiency drivers whose absence limits the predictive power of the classification models.

Modelling limitations

With approximately 142 valid records per efficiency class in the classification models, statistical power is moderate. Results should be validated on a larger or more recent dataset before operational deployment.
The time series forecast is based on 14 weekly observations. ARIMA models fitted to such short series are highly sensitive to the last few observations and carry wide uncertainty intervals. The forecast should be updated every week with new data.
K-means clustering assumes spherical, equal-variance clusters, which may not hold in operational data. Alternative clustering methods (e.g., Gaussian mixture models, hierarchical clustering with Ward’s linkage) could be explored to verify the cluster solution.

Further work

Incorporate shift-level, operator-level, and maintenance-schedule data to improve model accuracy.
Extend the analysis to a rolling 12-month window as new data becomes available.
Deploy the random forest classifier as a real-time scoring engine, integrated with the plant’s operational monitoring system.
Conduct a formal cost-benefit analysis for the three cluster segments to quantify the financial value of moving High Loss and Energy Cost Intensive runs towards the High Efficiency Performers profile.

13 References

Field, A., Miles, J., & Field, Z. (2012). Discovering statistics using R. SAGE Publications.

Kuhn, M., & Silge, J. (2022). Tidy modeling with R. O’Reilly Media. https://www.tmwr.org/

R Core Team. (2025). R: A language and environment for statistical computing (Version 4.5). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer-Verlag. https://ggplot2.tidyverse.org/

Hyndman, R. J., & Khandakar, Y. (2008). Automatic time series forecasting: The forecast package for R. Journal of Statistical Software, 27(3), 1–22. https://doi.org/10.18637/jss.v027.i03

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.3). https://doi.org/10.5281/zenodo.5960048

Wright, M. N., & Ziegler, A. (2017). ranger: A fast implementation of random forests for high dimensional data in C++ and R. Journal of Statistical Software, 77(1), 1–17. https://doi.org/10.18637/jss.v077.i01

14 Appendix: AI usage statement

Artificial intelligence tools (specifically, large language model-based coding assistants) were used in the preparation of this report to assist with:

R code structure and syntax, including tidymodels workflow construction, ggplot2 visualisation scaffolding, and Quarto document formatting.
Debugging error messages encountered during package installation and data cleaning.
Drafting initial code templates for the time series, PCA, and clustering sections.

All analytical judgements, business interpretations, cluster labelling, and narrative conclusions were independently reviewed and, where necessary, revised by the student. The student takes full responsibility for the accuracy and integrity of the analysis and interpretations presented in this document. AI-generated code was not accepted uncritically; each section was executed, its outputs examined, and its logic verified against course materials and standard statistical practice.

This statement is made in compliance with the Lagos Business School Academic Integrity Policy for the use of AI-assisted tools in assessed coursework.