| Type | Examples / Values |
|---|---|
| Date / temporal | date, month, quarter |
| Crop identifier | crop (Bell Pepper, Habanero, Tomato) |
| Harvest forecast (kg) | harvest_kg |
| Revenue forecast (₦'000) | revenue_000 |
| Engineered: season | Dry / Early Rainy / Peak Rainy |
| Engineered: quarter | Q1 – Q4 |
| Simulated: environment | rainfall_mm, humidity_pct, avg_temp_c, pest_pressure, disease_incidence |
Eupepsia Farms: Predictive Modelling & Segmentation
Lagos Business School MBA Capstone — Case Study 2
1 Background
Eupepsia Farms (trading as EYiA FarmCity) operates a network of commercial greenhouse clusters across Nigeria. This capstone applies a full predictive modelling and segmentation pipeline to the farm’s 2026 time-series harvest and revenue forecast, which was built from 450 KoboToolbox form submissions and bank-reconciled pricing data. The three primary crops are Bell Pepper, Habanero, and Tomato.
Two data sources are combined: the 2026 time-series harvest and revenue forecast (built from 450 KoboToolbox form submissions and bank-reconciled pricing) and the Pack House Register — a transaction-level record of every kilogram graded and packaged across the EYiA FarmCity portfolio (321 companies, Nov 2025–Apr 2026). The register is used to cross-validate the January–February 2026 “FarmCity actuals” embedded in the forecast, and to extract real-world grade quality ratios.
The analysis covers:
- Data loading and structure validation
- Pack House Register cross-validation (Jan–Feb 2026 actuals)
- Feature engineering (temporal + agronomic)
- Simulation of conditionally-dependent environmental covariates
- Classification modelling (Logistic Regression + Random Forest)
- Clustering and segmentation (K-means)
- Dimensionality reduction (PCA)
- Time-series decomposition and forecasting (ETS/ARIMA)
- Revenue optimisation via linear programming (current + stressed price scenarios)
- Business interpretation after each technique
2 Step 1: Data loading
3 Step 1b: Pack House Register cross-validation
The EYiA FarmCity Pack House Register is a KoboToolbox-powered transaction database recording grade-level weights for every crop entering the packhouse. The raw submission export contains 22,450 grading records across 321 portfolio companies from November 2025 to April 2026.
Scope clarification — The register aggregates output from all 321 incubated companies, not Eupepsia’s own FarmCity production in isolation. The Jan–Feb 2026 figures in the forecast CSV are labelled “FarmCity actuals” and represent Eupepsia’s own production (~6,200 kg BP and ~5,600 kg Habanero in January). The packhouse portfolio totals are much larger and serve as a benchmark for the scale of the broader programme.
3.1 Portfolio monthly totals
| Month | Companies | Portfolio BP | Portfolio HAB | Portfolio TOM | Avg BP/co | Avg HAB/co | Avg TOM/co |
|---|---|---|---|---|---|---|---|
| Nov-25 | 137 | 19604 | 13782 | 1655 | 143 | 101 | 12 |
| Dec-25 | 44 | 17657 | 7665 | 163 | 401 | 174 | 4 |
| Jan-26 | 232 | 55904 | 50292 | 34933 | 241 | 217 | 151 |
| Feb-26 | 131 | 85531 | 48513 | 11324 | 653 | 370 | 86 |
| Mar-26 | 63 | 43385 | 58515 | 29471 | 689 | 929 | 468 |
| Apr-26 | 2 | 0 | 560 | 0 | 0 | 280 | 0 |
3.2 Eupepsia forecast vs portfolio average
| Month | Eupepsia BP | Portfolio avg BP | % diff | Eupepsia HAB | Portfolio avg HAB | % diff | Eupepsia TOM | Portfolio avg TOM | % diff |
|---|---|---|---|---|---|---|---|---|---|
| Jan-26 | 6212 | 241 | 2478 | 5588 | 217 | 2478 | 3882 | 151 | 2478 |
| Feb-26 | 6579 | 653 | 908 | 3691 | 370 | 897 | 871 | 86 | 908 |
3.3 Harvest quality: grade breakdown
Business observation — Bell Pepper records a ~28% rejection rate with only 3.5% reaching Grade A; Habanero shows the strongest quality profile (40% Grade A, 1.5% rejection). This snapshot should be interpreted with care: the grading data was captured during Bell Pepper’s penultimate-to-ultimate growth month, a stage characteristically marked by smaller fruit size and naturally lower external quality. This is a known agronomic pattern for bell pepper, not an indicator of post-harvest handling failure, and likely understates typical mid-cycle performance. A direct comparison with Habanero is also unreliable since the two crops were graded at different growth stages. From a revenue standpoint, Grade A Bell Pepper commands a market price of ~₦7,000/kg — underscoring the upside when quality grades improve at peak cycle. Eupepsia should re-grade Bell Pepper at mid-cycle to establish a true quality baseline before drawing conclusions or redesigning handling protocols.
4 Step 2: Data understanding
| column | n_missing |
|---|---|
| crop | 0 |
| month_label | 0 |
| harvest_kg | 0 |
| revenue_000 | 0 |
| Crop | Total Harvest (kg) | Total Revenue | Avg Monthly (kg) | Peak Month |
|---|---|---|---|---|
| Bell Pepper | 116,143 | ₦325,200,400 | 9678.583 | Jun-26 |
| Habanero | 91,410 | ₦182,819,600 | 7617.483 | Jul-26 |
| Tomato | 40,898 | ₦51,122,700 | 3408.167 | Jun-26 |
Business observation — Bell Pepper is the highest-volume crop (116,143 kg forecast), while Habanero generates the highest total revenue per kilogram owing to its premium price point. Tomato exhibits a distinct dual-cycle pattern with peaks in June and October.
5 Step 3: Feature engineering
| crop | Low | Medium | High |
|---|---|---|---|
| Bell Pepper | 4 | 4 | 4 |
| Habanero | 4 | 4 | 4 |
| Tomato | 4 | 4 | 4 |
Business observation — The heatmap reveals a pronounced mid-year harvest concentration (June–August) across all three crops, typical of Nigeria’s rainy-season production window. January–February values for Bell Pepper and Habanero reflect carry-over output from the previous cycle (C11/C12 tail). Farm management should plan cold-chain and packhouse capacity for the June peak.
6 Step 4: Environmental data simulation
Because the KoboToolbox form did not capture environmental readings at harvest level, we simulate five agronomically relevant covariates with conditional dependency — each variable is a function of the preceding one in the causal chain:
\[\text{Rainfall} \rightarrow \text{Humidity} \rightarrow \text{Temperature} \rightarrow \text{Pest Pressure} \rightarrow \text{Disease Incidence}\]
Validation note — The correlation matrix confirms the intended causal chain: pest_pressure and disease_incidence show a strong positive correlation with rainfall_mm and humidity_pct, while avg_temp_c correlates negatively with rainfall — consistent with wet-bulb cooling during peak rainy seasons in Southwest Nigeria.
7 Step 5: Classification modelling
We aim to predict yield category (High / Medium / Low) from temporal and environmental features.
Sample-size caveat — After reshaping, the dataset contains 36 observations (3 crops × 12 months). A 75/25 train/test split yields approximately 27 training and 9 test rows — a very small sample. Results should be interpreted directionally, not as definitive performance benchmarks. In production, this model would be retrained when multi-year or site-level data become available.
7.1 Logistic regression (multinomial)
7.2 Random Forest
8 Step 6: Model evaluation
8.1 Confusion matrices
8.2 ROC curves (one-vs-rest)
8.3 Feature importance
Agronomic interpretation
cropis the dominant predictor because yield scale differs fundamentally across species — Bell Pepper peaks at ~43,000 kg/month versus Tomato’s ~22,000 kg/month, reflecting different crop physiology and planting density.harvest_kgitself (included as a feature) captures the direct magnitude effect — this is expected in a small dataset where the target is a tertile of the same variable.seasonandrainfall_mmrank next, confirming that Nigeria’s agroclimatic calendar (dry vs rainy season transition) drives harvest windows. Rainfed and supplemental irrigation decisions are most critical in April–June (Early Rainy transition).disease_incidenceandpest_pressurehave moderate importance, reinforcing that phytosanitary management during peak humidity months (Jul–Sep) directly governs whether a crop falls in the Low versus High category.
Decision implication — Eupepsia management should anchor scouting schedules and pesticide protocols to the season × pest_pressure interaction, particularly for Bell Pepper in the June–July window.
9 Step 7: Clustering and segmentation
K-means clustering groups months across crops into agronomic production regimes, enabling targeted input allocation.
9.1 Optimal K: elbow + silhouette
9.2 Cluster profiles
| Cluster | n | Label | Avg Harvest (kg) | Avg Rainfall (mm) | Avg Pest | Avg Disease | Dom. Season |
|---|---|---|---|---|---|---|---|
| 1 | 15 | Peak Production (Rainy Season) | 8308.0 | 183.0 | 5.7 | 6.2 | Peak Rainy |
| 2 | 21 | Dry-Season Low-Yield | 5896.7 | 28.3 | 1.5 | 3.5 | Dry |
Agronomic interpretation
| Cluster label | Farm management implication |
|---|---|
| Peak Production (Rainy Season) | Maximise packhouse throughput; pre-book logistics contracts in March for Jun–Aug peak |
| High-Risk: Wet & Pest-Prone | Intensify Integrated Pest Management (IPM) scouting; deploy fungicides prophylactically |
| Dry-Season Low-Yield | Consider supplemental drip irrigation; reserve seed stocks for January cycle |
| Transitional / Recovery | Monitor carry-over crops; optimise labour allocation between cycles |
10 Step 8: Principal component analysis
PCA reduces the six numeric features to their principal axes, revealing which combinations of environmental and production variables explain the most variance across crop-months.
| Component | % Variance | Cumulative % |
|---|---|---|
| PC1 | 78.1 | 78.1 |
| PC2 | 15.8 | 93.9 |
| PC3 | 3.6 | 97.6 |
| PC4 | 1.5 | 99.1 |
| PC5 | 0.7 | 99.8 |
| PC6 | 0.2 | 100.0 |
Agronomic interpretation of principal components
PC1 (“Wet-season production axis”) — High positive loadings on
rainfall_mm,humidity_pct,pest_pressure, anddisease_incidence, with a negative loading onavg_temp_c. Points high on PC1 are rainy-season months with elevated biotic stress. This axis explains the majority of total variance, confirming that Nigeria’s agroclimatic calendar is the dominant source of production variation.PC2 (“Yield intensity axis”) — Dominated by
harvest_kgloading, orthogonal to the weather axis. PC2 separates high-volume crop-months (Bell Pepper Jun, Habanero Jul–Aug) from low-volume tail months. Crop species drives position on this axis more than weather.
Decision implication — Farm investment in disease management (fungicides, drainage) primarily affects the PC1 dimension; investment in agronomic practices (variety selection, nutrition) primarily affects PC2. These are separable levers and should have separate budget lines.
11 Step 9: Time-series analysis and forecasting
We aggregate monthly harvest to total farm output and model the 12-month trajectory, then forecast the first three months of the next planning period.
Business observation — The STL decomposition reveals a strong single-peak seasonal component centred on June–July, with the trend component showing modest decline after the mid-year peak — consistent with cycle tail-off before the November replanting window. The ETS forecast for January–March 2027 projects carry-over output in the 5,000–8,000 kg total range, similar to the Jan–Feb 2026 actuals (cycle tail), suggesting the next planning cycle is on track if seedling distribution timelines are maintained.
12 Step 10: Revenue optimisation — linear programme
Given a fixed greenhouse production capacity, this section determines the profit-maximising crop mix under five price scenarios — from baseline to full market stress. The LP uses real yield coefficients derived from the seedling distribution data and imposes agronomic constraints (minimum diversification, maximum concentration, and a demand ceiling per crop).
12.1 Model formulation
\[\text{Maximise } Z = m_{BP} x_{BP} + m_{HAB} x_{HAB} + m_{TOM} x_{TOM}\]
Where \(x_i\) is kilograms produced of crop \(i\) and \(m_i = \text{price}_i - \text{cost}_i\) is the net margin per kilogram.
Constraints:
| Constraint | Expression | Basis |
|---|---|---|
| Greenhouse space | \(0.796 x_{BP} + 0.453 x_{HAB} + 0.902 x_{TOM} \leq 170{,}769\) | Seed-count-derived space coefficients |
| Labour budget (+10% flex) | \(0.30 x_{BP} + 0.20 x_{HAB} + 0.15 x_{TOM} \leq 65{,}186\) | Relative grading labour intensity |
| Minimum diversification | \(x_i \geq 10\%\) of capacity | Risk management floor |
| Maximum concentration | \(x_{BP}, x_{HAB} \leq 65\%\); \(x_{TOM} \leq 30\%\) | Demand & cycle ceiling |
| Total output ceiling | \(\sum x_i \leq 300{,}000 \text{ kg}\) | Packhouse throughput limit |
12.2 Scenario results
| Scenario | P_BP | P_HAB | P_TOM | Opt BP (kg) | Opt HAB (kg) | Opt TOM (kg) | % BP | % HAB | % TOM | Total (kg) | Profit (₦M) | Status |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| S1: Baseline | 3000 | 2500 | 1500 | 103,412 | 161,493 | 12,423 | 37.3 | 58.2 | 4.5 | 277,328 | 519.0 | Optimal |
| S2: BP glut (BP −33%) | 2000 | 2500 | 1500 | 99,362 | 161,493 | 20,521 | 35.3 | 57.4 | 7.3 | 281,377 | 418.5 | Optimal |
| S3: Habanero shock (HAB −40%) | 3000 | 1500 | 1500 | 161,493 | 68,389 | 12,423 | 66.6 | 28.2 | 5.1 | 242,305 | 405.0 | Optimal |
| S4: Tomato surge (TOM +47%) | 3000 | 2500 | 2200 | 99,362 | 161,493 | 20,521 | 35.3 | 57.4 | 7.3 | 281,377 | 532.2 | Optimal |
| S5: Full market stress | 2000 | 1500 | 1200 | 99,362 | 161,493 | 20,521 | 35.3 | 57.4 | 7.3 | 281,377 | 250.8 | Optimal |
Business interpretation — LP results
The LP reveals four actionable insights for Eupepsia’s crop planning committee:
Habanero is the structural hedge (S2) — When Bell Pepper prices fall by 33% (market glut, a recurring pattern in Lagos pepper markets), the LP shifts allocation toward Habanero, which has a higher net margin per greenhouse unit. Management should maintain Habanero seed stock as a demand-responsive buffer.
Tomato becomes viable at ₦2,200/kg (S4) — The dual-cycle Tomato programme (May–July + Sep–Nov) only reaches meaningful scale if market prices sustain above ~₦2,000/kg. Below that, the LP caps Tomato at its minimum diversification floor. This aligns with the current low Tomato share in the 2026 forecast.
Full market stress (S5) compresses profit to ₦250.8M — but the optimal response is not to exit; it is to maximise volume (fill the greenhouse) while holding minimum crop diversity. This is the agronomic equivalent of a defensive portfolio in equities.
Grade quality amplifies LP profit — The LP uses blended prices anchored to the forecast CSV’s realized/wholesale assumption of ₦3,000/kg for Bell Pepper. Within that model, improving Grade A yield from 3.5% to 15% raises the effective blended price to ~₦3,250/kg, shifting S1 baseline profit upward by roughly ₦25M. Note, however, that the retail market price for Bell Pepper is ~₦7,000/kg — substantially above the LP’s ₦3,000/kg assumption. If that gap reflects wholesale-vs-retail pricing or packhouse deductions, the true revenue uplift from grade improvement could be far larger than the modelled ₦25M, and warrants separate investigation.
13 Step 11: Business interpretation summary
The table below consolidates the decision-relevant insights from all analytical techniques, including the Pack House Register cross-validation and LP optimisation added in this enhanced version.
| Technique | Finding | Decision |
|---|---|---|
| Pack House Register | Bell Pepper rejection rate is 28%; only 3.5% reaches Grade A (data captured at penultimate growth stage — smaller fruit is expected). Habanero achieves 40% Grade A with 1.5% rejection | Re-grade Bell Pepper at mid-cycle to establish a representative quality baseline; Grade A BP fetches ~₦7,000/kg at market — quantify the mid-cycle grade mix before redesigning handling protocols |
| Feature Engineering | June–August accounts for ~62% of total annual harvest; Q3 is the make-or-break quarter | Lock in packhouse capacity and cold-chain contracts by March for the June–August surge |
| Environmental Simulation | Rainfall → humidity cascade confirms that wet-season biotic stress (pest, disease) is the chief agronomic risk | Deploy preventive fungicide and IPM schedules from May onward, not reactively in August |
| Logistic Regression | Temporal features (season, quarter) are sufficient to classify yield tier; crop identity dominates the signal | Use season as a simple operational trigger for resource reallocation (labour, inputs, transport) |
| Random Forest | Random Forest improves over logistic regression by capturing non-linear crop × season interactions | Embed RF model in the KoboToolbox pipeline to flag Low-yield risk at seedling distribution stage |
| Feature Importance | Crop species, season, and disease incidence are the top predictors of whether output falls in the Low tertile | Crop selection and disease management are the two highest-ROI levers for yield improvement |
| K-means Clustering | Three to four production regimes exist: peak-season high-output, wet high-risk, dry low-yield, and transitional | Allocate agrochemical and irrigation budgets by cluster, not uniformly across the farm network |
| PCA | PC1 separates wet-biotic-stress months from dry months; PC2 separates high-volume from low-volume crops | Separate investment in drainage/disease control (PC1) from variety/nutrition investment (PC2) |
| Time-Series (ETS) | 2026 seasonal peak is well-defined; Jan–Mar 2027 forecast aligns with prior cycle tail, confirming planning assumptions | Order habanero seeds for cycle 2 by August to hit the Sep–Nov replanting window on schedule |
| LP Optimisation | Habanero is the structural hedge in a BP price crash; Tomato only merits expansion if prices exceed ₦2,000/kg | Maintain Habanero seed stock as a market-responsive buffer; review crop mix quarterly against spot prices |
14 Step 12: Limitations and further work
Sample size — 36 observations (3 crops × 12 months) is insufficient for robust ML generalisation. All model accuracy figures are indicative. Results become more reliable when multi-year or site-level (greenhouse cluster) data are incorporated.
Simulated environment — The five environmental covariates were generated stochastically. The Pack House Register’s temperature/humidity columns were inspected but found to contain placeholder values (coded as
1) across the Jan–Feb 2026 window, making them unusable as sensor replacements. Deploying actual IoT greenhouse sensors and linking to NIMET rain gauge data remains an open requirement.Revenue assumptions — Prices (₦3,000 Bell Pepper / ₦2,500 Habanero / ₦1,500 Tomato per kg) are held constant and sourced from the bank-reconciled forecast CSV. The Bell Pepper price assumption of ₦3,000/kg diverges materially from the ~₦7,000/kg average retail market price. This gap likely reflects a combination of packhouse deductions, wholesale/off-taker contract pricing, and grade-mix discounting — but the precise source of the difference should be reconciled with Eupepsia’s accounts. Until resolved, LP profit figures for Bell Pepper should be treated as conservative lower bounds. More broadly, Nigerian commodity prices exhibit significant seasonal volatility; a price-adjusted revenue model would improve operational planning accuracy.
Single-season forecast — The ETS model is fit on 12 data points with a 12-period seasonal cycle. The confidence intervals on the 3-step-ahead forecast are therefore wide and should be treated as directional, not precise.
Grading register: single late-cycle snapshot — The Pack House Register quality data (Bell Pepper Grade A = 3.5%, rejection = 28%) covers only the Jan–Feb 2026 window, which falls in the penultimate-to-ultimate month of the Bell Pepper growth cycle — a stage characterised by smaller fruit size and naturally degraded external quality. This single snapshot cannot serve as a representative baseline for annual quality performance. A full quality analysis would require grading records spanning at least mid-cycle (peak-fruit stage) and early-cycle, captured systematically across multiple crop batches and greenhouse clusters.
- Sensor integration — Install IoT temperature/humidity loggers in greenhouse clusters and link data to KoboToolbox submissions; this would make the environmental simulation in Step 4 fully data-driven.
- Model deployment — Embed the Random Forest classifier in the KoboToolbox harvest estimation form as an early-warning flag for predicted Low-yield months.
- Grade quality baseline — The Pack House Register’s 28% Bell Pepper rejection rate was recorded at the penultimate-to-ultimate growth month, when smaller fruit and lower external quality are agronomically normal. This figure should not trigger a post-harvest handling audit in isolation. Eupepsia should re-grade at mid-cycle to capture a representative quality distribution; given that Grade A Bell Pepper fetches ~₦7,000/kg at market, even a modest improvement in the mid-cycle Grade A share would generate meaningful revenue upside.
- Dynamic LP — Extend the LP model with a monthly price index sourced from Lagos commodity markets to enable real-time crop mix recommendations at the seedling distribution stage.
15 Session information
R version 4.5.2 (2025-10-31)
Platform: aarch64-apple-darwin20
Running under: macOS Sequoia 15.5
Matrix products: default
BLAS: /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.1
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: Africa/Lagos
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] scales_1.4.0 kableExtra_1.4.0 knitr_1.51
[4] lpSolve_5.6.23 e1071_1.7-17 forecast_9.0.2
[7] ggfortify_0.4.19 cluster_2.1.8.2 pROC_1.19.0.1
[10] nnet_7.3-20 randomForest_4.7-1.2 readxl_1.4.5
[13] lubridate_1.9.5 forcats_1.0.1 stringr_1.6.0
[16] dplyr_1.2.0 purrr_1.2.1 readr_2.2.0
[19] tidyr_1.3.2 tibble_3.3.1 ggplot2_4.0.3
[22] tidyverse_2.0.0
loaded via a namespace (and not attached):
[1] gtable_0.3.6 xfun_0.56 htmlwidgets_1.6.4 ggrepel_0.9.8
[5] lattice_0.22-7 tzdb_0.5.0 vctrs_0.7.1 tools_4.5.2
[9] generics_0.1.4 parallel_4.5.2 proxy_0.4-29 pkgconfig_2.0.3
[13] Matrix_1.7-4 RColorBrewer_1.1-3 S7_0.2.1 lifecycle_1.0.5
[17] compiler_4.5.2 farver_2.1.2 textshaping_1.0.5 htmltools_0.5.9
[21] class_7.3-23 yaml_2.3.12 crayon_1.5.3 pillar_1.11.1
[25] nlme_3.1-168 fracdiff_1.5-3 tidyselect_1.2.1 digest_0.6.39
[29] stringi_1.8.7 splines_4.5.2 labeling_0.4.3 fastmap_1.2.0
[33] grid_4.5.2 colorspace_2.1-2 cli_3.6.5 magrittr_2.0.4
[37] withr_3.0.2 bit64_4.6.0-1 timechange_0.4.0 rmarkdown_2.30
[41] bit_4.6.0 otel_0.2.0 timeDate_4052.112 gridExtra_2.3
[45] cellranger_1.1.0 zoo_1.8-15 hms_1.1.4 urca_1.3-4
[49] evaluate_1.0.5 viridisLite_0.4.3 mgcv_1.9-3 rlang_1.1.7
[53] Rcpp_1.1.1 glue_1.8.0 xml2_1.5.2 vroom_1.7.1
[57] svglite_2.2.2 rstudioapi_0.18.0 jsonlite_2.0.0 R6_2.6.1
[61] systemfonts_1.3.2