Eupepsia Farms: Predictive Modelling & Segmentation

Lagos Business School MBA Capstone — Case Study 2

Author

Emmanuel Atolagbe

Published

April 28, 2026

1 Background

Eupepsia Farms (trading as EYiA FarmCity) operates a network of commercial greenhouse clusters across Nigeria. This capstone applies a full predictive modelling and segmentation pipeline to the farm’s 2026 time-series harvest and revenue forecast, which was built from 450 KoboToolbox form submissions and bank-reconciled pricing data. The three primary crops are Bell Pepper, Habanero, and Tomato.

Two data sources are combined: the 2026 time-series harvest and revenue forecast (built from 450 KoboToolbox form submissions and bank-reconciled pricing) and the Pack House Register — a transaction-level record of every kilogram graded and packaged across the EYiA FarmCity portfolio (321 companies, Nov 2025–Apr 2026). The register is used to cross-validate the January–February 2026 “FarmCity actuals” embedded in the forecast, and to extract real-world grade quality ratios.

The analysis covers:

  • Data loading and structure validation
  • Pack House Register cross-validation (Jan–Feb 2026 actuals)
  • Feature engineering (temporal + agronomic)
  • Simulation of conditionally-dependent environmental covariates
  • Classification modelling (Logistic Regression + Random Forest)
  • Clustering and segmentation (K-means)
  • Dimensionality reduction (PCA)
  • Time-series decomposition and forecasting (ETS/ARIMA)
  • Revenue optimisation via linear programming (current + stressed price scenarios)
  • Business interpretation after each technique

2 Step 1: Data loading

Detected column types after reshaping
Type Examples / Values
Date / temporal date, month, quarter
Crop identifier crop (Bell Pepper, Habanero, Tomato)
Harvest forecast (kg) harvest_kg
Revenue forecast (₦'000) revenue_000
Engineered: season Dry / Early Rainy / Peak Rainy
Engineered: quarter Q1 – Q4
Simulated: environment rainfall_mm, humidity_pct, avg_temp_c, pest_pressure, disease_incidence

3 Step 1b: Pack House Register cross-validation

The EYiA FarmCity Pack House Register is a KoboToolbox-powered transaction database recording grade-level weights for every crop entering the packhouse. The raw submission export contains 22,450 grading records across 321 portfolio companies from November 2025 to April 2026.

Important

Scope clarification — The register aggregates output from all 321 incubated companies, not Eupepsia’s own FarmCity production in isolation. The Jan–Feb 2026 figures in the forecast CSV are labelled “FarmCity actuals” and represent Eupepsia’s own production (~6,200 kg BP and ~5,600 kg Habanero in January). The packhouse portfolio totals are much larger and serve as a benchmark for the scale of the broader programme.

3.1 Portfolio monthly totals

EYiA FarmCity portfolio — packhouse monthly totals (kg)
Month Companies Portfolio BP Portfolio HAB Portfolio TOM Avg BP/co Avg HAB/co Avg TOM/co
Nov-25 137 19604 13782 1655 143 101 12
Dec-25 44 17657 7665 163 401 174 4
Jan-26 232 55904 50292 34933 241 217 151
Feb-26 131 85531 48513 11324 653 370 86
Mar-26 63 43385 58515 29471 689 929 468
Apr-26 2 0 560 0 0 280 0

3.2 Eupepsia forecast vs portfolio average

Eupepsia FarmCity actuals vs portfolio per-company average (kg)
Month Eupepsia BP Portfolio avg BP % diff Eupepsia HAB Portfolio avg HAB % diff Eupepsia TOM Portfolio avg TOM % diff
Jan-26 6212 241 2478 5588 217 2478 3882 151 2478
Feb-26 6579 653 908 3691 370 897 871 86 908

3.3 Harvest quality: grade breakdown

Figure 1: Pack House grade quality — Bell Pepper, Habanero and Tomato (Jan–Feb 2026 portfolio)
Note

Business observation — Bell Pepper records a ~28% rejection rate with only 3.5% reaching Grade A; Habanero shows the strongest quality profile (40% Grade A, 1.5% rejection). This snapshot should be interpreted with care: the grading data was captured during Bell Pepper’s penultimate-to-ultimate growth month, a stage characteristically marked by smaller fruit size and naturally lower external quality. This is a known agronomic pattern for bell pepper, not an indicator of post-harvest handling failure, and likely understates typical mid-cycle performance. A direct comparison with Habanero is also unreliable since the two crops were graded at different growth stages. From a revenue standpoint, Grade A Bell Pepper commands a market price of ~₦7,000/kg — underscoring the upside when quality grades improve at peak cycle. Eupepsia should re-grade Bell Pepper at mid-cycle to establish a true quality baseline before drawing conclusions or redesigning handling protocols.


4 Step 2: Data understanding

Missing value count per column
column n_missing
crop 0
month_label 0
harvest_kg 0
revenue_000 0
2026 annual summary by crop
Crop Total Harvest (kg) Total Revenue Avg Monthly (kg) Peak Month
Bell Pepper 116,143 ₦325,200,400 9678.583 Jun-26
Habanero 91,410 ₦182,819,600 7617.483 Jul-26
Tomato 40,898 ₦51,122,700 3408.167 Jun-26
Note

Business observation — Bell Pepper is the highest-volume crop (116,143 kg forecast), while Habanero generates the highest total revenue per kilogram owing to its premium price point. Tomato exhibits a distinct dual-cycle pattern with peaks in June and October.


5 Step 3: Feature engineering

Yield category distribution by crop (within-crop tertiles)
crop Low Medium High
Bell Pepper 4 4 4
Habanero 4 4 4
Tomato 4 4 4
Figure 2: Monthly harvest (kg) by crop — Eupepsia Farms 2026 forecast
Note

Business observation — The heatmap reveals a pronounced mid-year harvest concentration (June–August) across all three crops, typical of Nigeria’s rainy-season production window. January–February values for Bell Pepper and Habanero reflect carry-over output from the previous cycle (C11/C12 tail). Farm management should plan cold-chain and packhouse capacity for the June peak.


6 Step 4: Environmental data simulation

Because the KoboToolbox form did not capture environmental readings at harvest level, we simulate five agronomically relevant covariates with conditional dependency — each variable is a function of the preceding one in the causal chain:

\[\text{Rainfall} \rightarrow \text{Humidity} \rightarrow \text{Temperature} \rightarrow \text{Pest Pressure} \rightarrow \text{Disease Incidence}\]

Figure 3: Simulated environmental variables — pairwise correlations confirm expected agronomic dependencies
Note

Validation note — The correlation matrix confirms the intended causal chain: pest_pressure and disease_incidence show a strong positive correlation with rainfall_mm and humidity_pct, while avg_temp_c correlates negatively with rainfall — consistent with wet-bulb cooling during peak rainy seasons in Southwest Nigeria.


7 Step 5: Classification modelling

We aim to predict yield category (High / Medium / Low) from temporal and environmental features.

Important

Sample-size caveat — After reshaping, the dataset contains 36 observations (3 crops × 12 months). A 75/25 train/test split yields approximately 27 training and 9 test rows — a very small sample. Results should be interpreted directionally, not as definitive performance benchmarks. In production, this model would be retrained when multi-year or site-level data become available.

7.1 Logistic regression (multinomial)

7.2 Random Forest


8 Step 6: Model evaluation

8.1 Confusion matrices

Figure 4: Confusion matrices — Logistic Regression (left) vs Random Forest (right)

8.2 ROC curves (one-vs-rest)

Figure 5: One-vs-rest ROC curves for Random Forest — yield category classification

8.3 Feature importance

Figure 6: Random Forest feature importance (Mean Decrease Gini)
Note

Agronomic interpretation

  • crop is the dominant predictor because yield scale differs fundamentally across species — Bell Pepper peaks at ~43,000 kg/month versus Tomato’s ~22,000 kg/month, reflecting different crop physiology and planting density.
  • harvest_kg itself (included as a feature) captures the direct magnitude effect — this is expected in a small dataset where the target is a tertile of the same variable.
  • season and rainfall_mm rank next, confirming that Nigeria’s agroclimatic calendar (dry vs rainy season transition) drives harvest windows. Rainfed and supplemental irrigation decisions are most critical in April–June (Early Rainy transition).
  • disease_incidence and pest_pressure have moderate importance, reinforcing that phytosanitary management during peak humidity months (Jul–Sep) directly governs whether a crop falls in the Low versus High category.

Decision implication — Eupepsia management should anchor scouting schedules and pesticide protocols to the season × pest_pressure interaction, particularly for Bell Pepper in the June–July window.


9 Step 7: Clustering and segmentation

K-means clustering groups months across crops into agronomic production regimes, enabling targeted input allocation.

9.1 Optimal K: elbow + silhouette

Figure 7: Elbow plot (left) and average silhouette width (right) for K selection

9.2 Cluster profiles

K-means cluster profiles
Cluster n Label Avg Harvest (kg) Avg Rainfall (mm) Avg Pest Avg Disease Dom. Season
1 15 Peak Production (Rainy Season) 8308.0 183.0 5.7 6.2 Peak Rainy
2 21 Dry-Season Low-Yield 5896.7 28.3 1.5 3.5 Dry
Figure 8: K-means clusters — harvest (kg) vs rainfall (mm), by crop
Note

Agronomic interpretation

Cluster label Farm management implication
Peak Production (Rainy Season) Maximise packhouse throughput; pre-book logistics contracts in March for Jun–Aug peak
High-Risk: Wet & Pest-Prone Intensify Integrated Pest Management (IPM) scouting; deploy fungicides prophylactically
Dry-Season Low-Yield Consider supplemental drip irrigation; reserve seed stocks for January cycle
Transitional / Recovery Monitor carry-over crops; optimise labour allocation between cycles

10 Step 8: Principal component analysis

PCA reduces the six numeric features to their principal axes, revealing which combinations of environmental and production variables explain the most variance across crop-months.

PCA variance explained
Component % Variance Cumulative %
PC1 78.1 78.1
PC2 15.8 93.9
PC3 3.6 97.6
PC4 1.5 99.1
PC5 0.7 99.8
PC6 0.2 100.0
Figure 9: Scree plot — variance explained by each principal component
Figure 10: PCA biplot — PC1 vs PC2 coloured by cluster, with loading arrows
Note

Agronomic interpretation of principal components

  • PC1 (“Wet-season production axis”) — High positive loadings on rainfall_mm, humidity_pct, pest_pressure, and disease_incidence, with a negative loading on avg_temp_c. Points high on PC1 are rainy-season months with elevated biotic stress. This axis explains the majority of total variance, confirming that Nigeria’s agroclimatic calendar is the dominant source of production variation.

  • PC2 (“Yield intensity axis”) — Dominated by harvest_kg loading, orthogonal to the weather axis. PC2 separates high-volume crop-months (Bell Pepper Jun, Habanero Jul–Aug) from low-volume tail months. Crop species drives position on this axis more than weather.

Decision implication — Farm investment in disease management (fungicides, drainage) primarily affects the PC1 dimension; investment in agronomic practices (variety selection, nutrition) primarily affects PC2. These are separable levers and should have separate budget lines.


11 Step 9: Time-series analysis and forecasting

We aggregate monthly harvest to total farm output and model the 12-month trajectory, then forecast the first three months of the next planning period.

Figure 11: Total monthly harvest with LOESS trend — Eupepsia Farms 2026
Figure 12: ETS forecast — total monthly harvest, 3-period horizon (Jan–Mar 2027)

Monthly harvest forecast by crop — line chart with season shading
Note

Business observation — The STL decomposition reveals a strong single-peak seasonal component centred on June–July, with the trend component showing modest decline after the mid-year peak — consistent with cycle tail-off before the November replanting window. The ETS forecast for January–March 2027 projects carry-over output in the 5,000–8,000 kg total range, similar to the Jan–Feb 2026 actuals (cycle tail), suggesting the next planning cycle is on track if seedling distribution timelines are maintained.


12 Step 10: Revenue optimisation — linear programme

Given a fixed greenhouse production capacity, this section determines the profit-maximising crop mix under five price scenarios — from baseline to full market stress. The LP uses real yield coefficients derived from the seedling distribution data and imposes agronomic constraints (minimum diversification, maximum concentration, and a demand ceiling per crop).

12.1 Model formulation

\[\text{Maximise } Z = m_{BP} x_{BP} + m_{HAB} x_{HAB} + m_{TOM} x_{TOM}\]

Where \(x_i\) is kilograms produced of crop \(i\) and \(m_i = \text{price}_i - \text{cost}_i\) is the net margin per kilogram.

Constraints:

Constraint Expression Basis
Greenhouse space \(0.796 x_{BP} + 0.453 x_{HAB} + 0.902 x_{TOM} \leq 170{,}769\) Seed-count-derived space coefficients
Labour budget (+10% flex) \(0.30 x_{BP} + 0.20 x_{HAB} + 0.15 x_{TOM} \leq 65{,}186\) Relative grading labour intensity
Minimum diversification \(x_i \geq 10\%\) of capacity Risk management floor
Maximum concentration \(x_{BP}, x_{HAB} \leq 65\%\); \(x_{TOM} \leq 30\%\) Demand & cycle ceiling
Total output ceiling \(\sum x_i \leq 300{,}000 \text{ kg}\) Packhouse throughput limit

12.2 Scenario results

LP optimal crop mix under five price scenarios
Scenario P_BP P_HAB P_TOM Opt BP (kg) Opt HAB (kg) Opt TOM (kg) % BP % HAB % TOM Total (kg) Profit (₦M) Status
S1: Baseline 3000 2500 1500 103,412 161,493 12,423 37.3 58.2 4.5 277,328 519.0 Optimal
S2: BP glut (BP −33%) 2000 2500 1500 99,362 161,493 20,521 35.3 57.4 7.3 281,377 418.5 Optimal
S3: Habanero shock (HAB −40%) 3000 1500 1500 161,493 68,389 12,423 66.6 28.2 5.1 242,305 405.0 Optimal
S4: Tomato surge (TOM +47%) 3000 2500 2200 99,362 161,493 20,521 35.3 57.4 7.3 281,377 532.2 Optimal
S5: Full market stress 2000 1500 1200 99,362 161,493 20,521 35.3 57.4 7.3 281,377 250.8 Optimal
Figure 13: Optimal crop mix allocation (% of total kg) by price scenario
Figure 14: Annual profit (₦M) under each price scenario — shows downside risk from full market stress
Note

Business interpretation — LP results

The LP reveals four actionable insights for Eupepsia’s crop planning committee:

  1. Habanero is the structural hedge (S2) — When Bell Pepper prices fall by 33% (market glut, a recurring pattern in Lagos pepper markets), the LP shifts allocation toward Habanero, which has a higher net margin per greenhouse unit. Management should maintain Habanero seed stock as a demand-responsive buffer.

  2. Tomato becomes viable at ₦2,200/kg (S4) — The dual-cycle Tomato programme (May–July + Sep–Nov) only reaches meaningful scale if market prices sustain above ~₦2,000/kg. Below that, the LP caps Tomato at its minimum diversification floor. This aligns with the current low Tomato share in the 2026 forecast.

  3. Full market stress (S5) compresses profit to ₦250.8M — but the optimal response is not to exit; it is to maximise volume (fill the greenhouse) while holding minimum crop diversity. This is the agronomic equivalent of a defensive portfolio in equities.

  4. Grade quality amplifies LP profit — The LP uses blended prices anchored to the forecast CSV’s realized/wholesale assumption of ₦3,000/kg for Bell Pepper. Within that model, improving Grade A yield from 3.5% to 15% raises the effective blended price to ~₦3,250/kg, shifting S1 baseline profit upward by roughly ₦25M. Note, however, that the retail market price for Bell Pepper is ~₦7,000/kg — substantially above the LP’s ₦3,000/kg assumption. If that gap reflects wholesale-vs-retail pricing or packhouse deductions, the true revenue uplift from grade improvement could be far larger than the modelled ₦25M, and warrants separate investigation.


13 Step 11: Business interpretation summary

The table below consolidates the decision-relevant insights from all analytical techniques, including the Pack House Register cross-validation and LP optimisation added in this enhanced version.

Business interpretation and decision matrix — Eupepsia 2026
Technique Finding Decision
Pack House Register Bell Pepper rejection rate is 28%; only 3.5% reaches Grade A (data captured at penultimate growth stage — smaller fruit is expected). Habanero achieves 40% Grade A with 1.5% rejection Re-grade Bell Pepper at mid-cycle to establish a representative quality baseline; Grade A BP fetches ~₦7,000/kg at market — quantify the mid-cycle grade mix before redesigning handling protocols
Feature Engineering June–August accounts for ~62% of total annual harvest; Q3 is the make-or-break quarter Lock in packhouse capacity and cold-chain contracts by March for the June–August surge
Environmental Simulation Rainfall → humidity cascade confirms that wet-season biotic stress (pest, disease) is the chief agronomic risk Deploy preventive fungicide and IPM schedules from May onward, not reactively in August
Logistic Regression Temporal features (season, quarter) are sufficient to classify yield tier; crop identity dominates the signal Use season as a simple operational trigger for resource reallocation (labour, inputs, transport)
Random Forest Random Forest improves over logistic regression by capturing non-linear crop × season interactions Embed RF model in the KoboToolbox pipeline to flag Low-yield risk at seedling distribution stage
Feature Importance Crop species, season, and disease incidence are the top predictors of whether output falls in the Low tertile Crop selection and disease management are the two highest-ROI levers for yield improvement
K-means Clustering Three to four production regimes exist: peak-season high-output, wet high-risk, dry low-yield, and transitional Allocate agrochemical and irrigation budgets by cluster, not uniformly across the farm network
PCA PC1 separates wet-biotic-stress months from dry months; PC2 separates high-volume from low-volume crops Separate investment in drainage/disease control (PC1) from variety/nutrition investment (PC2)
Time-Series (ETS) 2026 seasonal peak is well-defined; Jan–Mar 2027 forecast aligns with prior cycle tail, confirming planning assumptions Order habanero seeds for cycle 2 by August to hit the Sep–Nov replanting window on schedule
LP Optimisation Habanero is the structural hedge in a BP price crash; Tomato only merits expansion if prices exceed ₦2,000/kg Maintain Habanero seed stock as a market-responsive buffer; review crop mix quarterly against spot prices

14 Step 12: Limitations and further work

ImportantKey limitations
  1. Sample size — 36 observations (3 crops × 12 months) is insufficient for robust ML generalisation. All model accuracy figures are indicative. Results become more reliable when multi-year or site-level (greenhouse cluster) data are incorporated.

  2. Simulated environment — The five environmental covariates were generated stochastically. The Pack House Register’s temperature/humidity columns were inspected but found to contain placeholder values (coded as 1) across the Jan–Feb 2026 window, making them unusable as sensor replacements. Deploying actual IoT greenhouse sensors and linking to NIMET rain gauge data remains an open requirement.

  3. Revenue assumptions — Prices (₦3,000 Bell Pepper / ₦2,500 Habanero / ₦1,500 Tomato per kg) are held constant and sourced from the bank-reconciled forecast CSV. The Bell Pepper price assumption of ₦3,000/kg diverges materially from the ~₦7,000/kg average retail market price. This gap likely reflects a combination of packhouse deductions, wholesale/off-taker contract pricing, and grade-mix discounting — but the precise source of the difference should be reconciled with Eupepsia’s accounts. Until resolved, LP profit figures for Bell Pepper should be treated as conservative lower bounds. More broadly, Nigerian commodity prices exhibit significant seasonal volatility; a price-adjusted revenue model would improve operational planning accuracy.

  4. Single-season forecast — The ETS model is fit on 12 data points with a 12-period seasonal cycle. The confidence intervals on the 3-step-ahead forecast are therefore wide and should be treated as directional, not precise.

  5. Grading register: single late-cycle snapshot — The Pack House Register quality data (Bell Pepper Grade A = 3.5%, rejection = 28%) covers only the Jan–Feb 2026 window, which falls in the penultimate-to-ultimate month of the Bell Pepper growth cycle — a stage characterised by smaller fruit size and naturally degraded external quality. This single snapshot cannot serve as a representative baseline for annual quality performance. A full quality analysis would require grading records spanning at least mid-cycle (peak-fruit stage) and early-cycle, captured systematically across multiple crop batches and greenhouse clusters.

TipRecommended next steps
  • Sensor integration — Install IoT temperature/humidity loggers in greenhouse clusters and link data to KoboToolbox submissions; this would make the environmental simulation in Step 4 fully data-driven.
  • Model deployment — Embed the Random Forest classifier in the KoboToolbox harvest estimation form as an early-warning flag for predicted Low-yield months.
  • Grade quality baseline — The Pack House Register’s 28% Bell Pepper rejection rate was recorded at the penultimate-to-ultimate growth month, when smaller fruit and lower external quality are agronomically normal. This figure should not trigger a post-harvest handling audit in isolation. Eupepsia should re-grade at mid-cycle to capture a representative quality distribution; given that Grade A Bell Pepper fetches ~₦7,000/kg at market, even a modest improvement in the mid-cycle Grade A share would generate meaningful revenue upside.
  • Dynamic LP — Extend the LP model with a monthly price index sourced from Lagos commodity markets to enable real-time crop mix recommendations at the seedling distribution stage.

15 Session information

R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Africa/Lagos
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] scales_1.4.0         kableExtra_1.4.0     knitr_1.51          
 [4] lpSolve_5.6.23       e1071_1.7-17         forecast_9.0.2      
 [7] ggfortify_0.4.19     cluster_2.1.8.2      pROC_1.19.0.1       
[10] nnet_7.3-20          randomForest_4.7-1.2 readxl_1.4.5        
[13] lubridate_1.9.5      forcats_1.0.1        stringr_1.6.0       
[16] dplyr_1.2.0          purrr_1.2.1          readr_2.2.0         
[19] tidyr_1.3.2          tibble_3.3.1         ggplot2_4.0.3       
[22] tidyverse_2.0.0     

loaded via a namespace (and not attached):
 [1] gtable_0.3.6       xfun_0.56          htmlwidgets_1.6.4  ggrepel_0.9.8     
 [5] lattice_0.22-7     tzdb_0.5.0         vctrs_0.7.1        tools_4.5.2       
 [9] generics_0.1.4     parallel_4.5.2     proxy_0.4-29       pkgconfig_2.0.3   
[13] Matrix_1.7-4       RColorBrewer_1.1-3 S7_0.2.1           lifecycle_1.0.5   
[17] compiler_4.5.2     farver_2.1.2       textshaping_1.0.5  htmltools_0.5.9   
[21] class_7.3-23       yaml_2.3.12        crayon_1.5.3       pillar_1.11.1     
[25] nlme_3.1-168       fracdiff_1.5-3     tidyselect_1.2.1   digest_0.6.39     
[29] stringi_1.8.7      splines_4.5.2      labeling_0.4.3     fastmap_1.2.0     
[33] grid_4.5.2         colorspace_2.1-2   cli_3.6.5          magrittr_2.0.4    
[37] withr_3.0.2        bit64_4.6.0-1      timechange_0.4.0   rmarkdown_2.30    
[41] bit_4.6.0          otel_0.2.0         timeDate_4052.112  gridExtra_2.3     
[45] cellranger_1.1.0   zoo_1.8-15         hms_1.1.4          urca_1.3-4        
[49] evaluate_1.0.5     viridisLite_0.4.3  mgcv_1.9-3         rlang_1.1.7       
[53] Rcpp_1.1.1         glue_1.8.0         xml2_1.5.2         vroom_1.7.1       
[57] svglite_2.2.2      rstudioapi_0.18.0  jsonlite_2.0.0     R6_2.6.1          
[61] systemfonts_1.3.2