Solar Energy Generation Performance Analytics: A Site-Level Study of the Solar Mini-grids Operated by PowerGen Renewable Energy
Author
Ikechukwu Enekwe
Published
May 12, 2026
1. Executive Summary
This study analyzes weekly solar energy generation and revenue performance across 16 solar mini-grid sites operated by PowerGen Renewable Energy (“PowerGen”) in Nigeria, covering the period 16 February 2025 to 3 May 2026 — a span of 62 weeks across the full portfolio. As Group Financial Controller (“GFC”), I obtained 907 weekly site-level observations from the Operations and Maintenance team’s tracker. The 16 sites are organised into four batches (Batch 1 RMG, Batch 2 RMG, Batch 3 RMG, and IMG) and the dataset encompasses 15 variables including raw operational metrics, derived financial variables, and engineered analytical variables.
The analysis applies five complementary techniques — Exploratory Data Analysis, Data Visualization, Hypothesis Testing, Correlation Analysis, and Linear Regression — to answer a central financial question: what drives revenue generation performance across PowerGen’s Nigerian portfolio, and are observed performance gaps between site batches statistically significant?
Key findings are as follows. Toto (IMG batch) is a structural outlier producing more than three times the average weekly solar output of Batch 1 RMG sites, reflecting its status as Nigeria’s first Interconnected Mini-Grid rather than a standard isolated rural installation. Rainy season reduces solar generation by approximately 15–20% compared to dry season — a difference confirmed as statistically significant (p < 0.001). Grid availability is the strongest controllable predictor of revenue, and sites with high generator dependency consistently underperform financially. The performance ratio analysis — comparing actual solar output to the GSA theoretical benchmark — reveals meaningful variation across sites that raw production figures alone cannot capture, separating operational underperformance from structural capacity differences. Two data quality anomalies at Maggi Bukun require remediation in PowerGen’s source reporting systems.
The integrated recommendation is that PowerGen’s operations team prioritise grid availability improvements and reduce generator dependency at Batch 1 RMG sites. These sites have the most inconsistent uptime across the portfolio — meaning they are frequently offline or running on backup generator when they should be producing solar energy. The regression analysis in this study calculates exactly how much additional revenue each site stands to gain for every improvement in grid uptime, giving management a clear financial case for directing maintenance resources to these sites first.
2. Professional Disclosure
Job Title: Group Financial Controller Organization: PowerGen Renewable Energy Sector: Utilities — Renewable Energy
PowerGen Renewable Energy designs, builds, and operates solar installations. The Company operates two business models – Grids and Commercial & Industrial. This analysis focuses on the grids business as it guarantees sufficient and rich data for analytical purposes. As Group Financial Controller, my responsibilities include financial reporting, revenue assurance, operational performance monitoring, and budget management across a portfolio of active solar installations. The five analytical techniques chosen for this study are directly relevant to my day-to-day work:
Technique 1 — Exploratory Data Analysis:
During our financial close process, I review site generation data from the Operations and Maintenance team’s tracker for completeness, plausibility, and anomalies. During this process, I look out for missing values, outliers, and distributional patterns that could distort reported revenue figures or mask underperforming assets. In some instances, I have identified negative revenue values or negative consumption values which, if not identified and resolved, could distort the financial results presented to the Executives.
Technique 2 — Data Visualization:
Monthly and quarterly performance reports submitted to the Executives and investors rely on visual communication of generation and revenue trends. Choosing the right chart type determines whether non-technical stakeholders can act on the data. This technique directly supports my reporting function. With investors, they are usually keen on assessing the performance of the power grids as a pre-requisite for approving further investment or otherwise.
Technique 3 — Hypothesis Testing:
A recurring management question is whether performance differences observed between site batches reflect genuine operational differences or merely random variation. Formal hypothesis testing gives me a statistically defensible answer to replace what would otherwise be subjective judgement in performance reviews and board presentations.
Technique 4 — Correlation Analysis:
Revenue assurance requires understanding which variables are leading indicators of financial outcomes. For instance, from a basic understanding of the energy business, revenue is driven majorly by two variables: tariff and consumption. However, there might be other variables with causal relationships with the performance of a mini-grid site. These variables could be solar PV yield, technical losses etc. Putting all of these into consideration when reporting on performance, helps provide insights to management.
These insights also enable management to decide on resource allocation to each site and identify early indications of energy theft.
Technique 5 — Linear Regression:
PowerGen’s annual budget includes revenue projections by site. Regression provides a data-driven basis for those projections, quantifying how much revenue is expected per kWh of solar generation, per percentage point of availability, and per season — replacing assumption-driven estimates with empirically fitted parameters. Seasonality plays an important role in energy yield and site-level performance.
3. Data Collection & Sampling
Item
Details
Source
Operations and Maintenance tracker of PowerGen Renewable Energy
Collection Method
Direct extraction from the NGBU Grid Scorecard workbook in my capacity as Group Financial Controller. Reshaped from wide to tidy (long) format and enriched with derived variables using Python prior to analysis in R.
Sampling Frame
All 16 active mini-grid solar sites in PowerGen’s portfolio as at May 2026, organized across four batches: Batch 1 RMG (7 sites), Batch 2 RMG (6 sites), Batch 3 RMG (2 sites), and IMG/Toto (1 site).
Sample Size
The total number of observations used for the analysis is 907.
Time Period
16 February 2025 to 3 May 2026 (62 weeks across the full portfolio; individual site ranges vary — Maggi Igenchi is the earliest site with data from 16 February 2025, while most sites commenced recording from 9 March 2025; Ofosu and Owode began in June 2025)
Ethical Notes
The data used has been accessed in my professional capacity with management authorization. Site names are operational identifiers; no anonymization required.
Data Sharing
The data published is being used for academic purposes only. This is consistent with PowerGen’s data governance policy.
pg |>count(batch, site, name ="n_weeks") |>arrange(batch, desc(n_weeks)) |>kable(caption ="Observations per Site") |>kable_styling(bootstrap_options =c("striped", "hover"), full_width =FALSE)
Observations per Site
batch
site
n_weeks
Batch 1 RMG
Maggi Igenchi
62
Batch 1 RMG
Gbara
61
Batch 1 RMG
Maggi Bukun
61
Batch 1 RMG
Nantu
61
Batch 1 RMG
Ndejiko
61
Batch 1 RMG
Rokota
60
Batch 1 RMG
Kpanbo
55
Batch 2 RMG
Dukugi
61
Batch 2 RMG
Ebangi
61
Batch 2 RMG
Gbade
61
Batch 2 RMG
Sachi Nku
61
Batch 2 RMG
Sosa
61
Batch 2 RMG
Danchitagi
58
Batch 3 RMG
Ofosu
34
Batch 3 RMG
Owode
28
IMG
Toto
61
5. Exploratory Data Analysis (EDA)
Theory: Exploratory Data Analysis (EDA) is the systematic examination of a dataset’s structure, distributions, and anomalies before formal modelling (Adi, 2026, Ch. 9). The objective is to understand what the data contains, where it is incomplete, and where it may violate data analysis assumptions. Anscombe’s Quartet (1973) illustrates why summary statistics alone are insufficient — datasets with identical means and variances can exhibit fundamentally different patterns. EDA guards against this by combining numerical summaries with visual inspection.
Business Justification: As the GFC, I rely on the O&M trackers to build my site performance reports. Undetected outliers or data anomalies directly affect reporting.
Code
p1 <-ggplot(pg, aes(x = solar_production_kwh)) +geom_histogram(bins =40, fill ="#2E86AB", colour ="white", alpha =0.85) +scale_x_continuous(labels = comma) +labs(title ="Solar Production (kWh)", x ="kWh", y ="Count") +theme_minimal()p2 <-ggplot(pg, aes(x = availability_pct)) +geom_histogram(bins =30, fill ="#A23B72", colour ="white", alpha =0.85) +labs(title ="Grid Availability (%)", x ="Availability", y ="Count") +theme_minimal()p3 <-ggplot(pg, aes(x = batch, y = solar_production_kwh, fill = batch)) +geom_boxplot(alpha =0.8, outlier.colour ="red", outlier.size =1.5) +scale_y_continuous(labels = comma) +scale_fill_brewer(palette ="Set2") +labs(title ="Solar Production by Batch", x ="Batch", y ="kWh", fill =NULL) +theme_minimal() +theme(legend.position ="none")p4 <-ggplot(pg, aes(x = performance_tier, y = solar_production_kwh, fill = performance_tier)) +geom_boxplot(alpha =0.8, outlier.colour ="red", outlier.size =1.5) +scale_y_continuous(labels = comma) +scale_fill_manual(values =c("Low"="#E74C3C", "Medium"="#F39C12", "High"="#27AE60")) +labs(title ="Solar Production by Performance Tier", x ="Tier", y ="kWh", fill =NULL) +theme_minimal() +theme(legend.position ="none")(p1 + p2) / (p3 + p4)
# Performance ratio: actual solar output vs GSA theoretical benchmarkpr_summary <- pg |>filter(!is.na(performance_ratio)) |>group_by(site, batch) |>summarise(avg_ratio =mean(performance_ratio, na.rm =TRUE),avg_specific_yield =mean(specific_yield_kwp, na.rm =TRUE),.groups ="drop" ) |>arrange(desc(avg_ratio))# Chart: Performance ratio per site (reference line at 1.0 = meeting benchmark)p_ratio <-ggplot(pr_summary,aes(x =fct_reorder(site, avg_ratio), y = avg_ratio, fill = batch)) +geom_col(alpha =0.85) +geom_hline(yintercept =1.0, linetype ="dashed", colour ="red", linewidth =0.8) +annotate("text", x =2, y =1.05, label ="Benchmark (1.0)", colour ="red", size =3) +scale_fill_brewer(palette ="Set2") +coord_flip() +labs(title ="Average Performance Ratio by Site (Solar Output ÷ GSA Benchmark)",subtitle ="Red dashed line = theoretical benchmark. Above = outperforming; Below = underperforming",x =NULL, y ="Performance Ratio", fill ="Batch" ) +theme_minimal() +theme(legend.position ="bottom")# Chart: Specific yield per sitep_yield <-ggplot(pr_summary,aes(x =fct_reorder(site, avg_specific_yield),y = avg_specific_yield, fill = batch)) +geom_col(alpha =0.85) +scale_fill_brewer(palette ="Set2") +coord_flip() +labs(title ="Average Specific Yield by Site (kWh per kWp installed)",subtitle ="Normalises production by installed capacity — makes sites directly comparable",x =NULL, y ="Specific Yield (kWh/kWp)", fill ="Batch" ) +theme_minimal() +theme(legend.position ="bottom")p_ratio / p_yield
Performance ratio and specific yield distributions by site
Code
# Energy mix: proportion of energy from solar vs generator per sitepg |>filter(!is.na(solar_production_kwh), !is.na(generator_production_kwh)) |>group_by(site, batch) |>summarise(avg_solar =mean(solar_production_kwh, na.rm =TRUE),avg_generator =mean(generator_production_kwh, na.rm =TRUE),.groups ="drop" ) |>pivot_longer(cols =c(avg_solar, avg_generator),names_to ="source", values_to ="avg_kwh") |>mutate(source =case_when( source =="avg_solar"~"Solar", source =="avg_generator"~"Generator",TRUE~ source )) |>ggplot(aes(x =fct_reorder(site, avg_kwh), y = avg_kwh, fill = source)) +geom_col(position ="stack", alpha =0.85) +scale_fill_manual(values =c("Solar"="#F4A261", "Generator"="#457B9D")) +scale_y_continuous(labels = comma) +coord_flip() +labs(title ="Average Weekly Energy Mix by Site — Solar vs Generator (kWh)",subtitle ="Sites with high generator share face higher fuel costs and operational risk",x =NULL, y ="Average Weekly kWh", fill ="Energy Source" ) +theme_minimal() +theme(legend.position ="bottom")
Solar vs generator energy mix by site
Data Quality Issues Identified
Issue
Site
Detail
Action Taken
Negative technical losses (-6.59%)
Maggi Bukun
Technical losses relate to the energy loss suffered between the time energy is generated and when it is distributed to customers. A negative loss is practically impossible.
Flagged in analysis; recommend source system correction
Extreme consumption outlier
Maggi Bukun
Week of 8 Feb 2026 contains a customer consumption of 227,125 kWh for the site. This was most likely a cumulative figure entered erroneously.
Row excluded from all revenue-based analyses
Plain-Language Interpretation for Management: The distribution of solar production is right-skewed, meaning a small number of sites — primarily Toto — generate significantly more energy than all other sites. Toto is the first Interconnected Mini-Grid in Nigeria and is a peri-urban community about 1 hour from Abuja. The community comprises individual homes, small-scale businesses and public utility centres.
Most sites register consumption levels of 500 to 2,000 kWh per week on average. Another observation is that grid availability is high across most sites (greater than 90%), but drops noticeably for Toto and Owode. Given the financial significance of Toto and Owode, the reduced grid availability would have an impact on the performance of those sites and the overall business.
The performance ratio analysis reveals which sites are meeting their theoretical solar potential (GSA benchmark). Sites with a ratio above 1.0 are exceeding their benchmark — a sign of well-maintained panels and favourable local conditions. Sites consistently below 1.0 warrant investigation: the shortfall could reflect panel degradation, shading, inverter inefficiency, or data recording issues. This metric separates operational underperformance from structural capacity differences, giving management a fairer basis for site comparison than raw kWh output alone.
The energy mix analysis shows how much of each site’s weekly output came from solar panels versus the backup diesel generator. Sites with high generator dependency face elevated fuel costs that compress margins. From a GFC perspective, a rising generator share at any site is an early warning signal — it often precedes a revenue shortfall and should trigger an operations review before it appears in the financial statements.
In addition, two data recording errors were identified at Maggi Bukun. First, the system recorded negative technical losses in at least one period — physically impossible, indicating a meter reading or data entry problem. Second, one week in February 2026 shows customer consumption 300 times higher than normal, almost certainly a data entry error. Both should be corrected in the source system. The Maggi Bukun customer consumption data has been removed from the revenue analysis to avoid distorting results.
6. Data Visualization
Theory: The Grammar of Graphics (Wilkinson, 2005, cited in Adi, 2026, Ch. 10) holds that every effective chart is built from a systematic mapping of data variables to visual properties — position, colour, size, shape. Chart selection should be governed by the data type and the story being told, not aesthetic preference.
A time series calls for a line chart; distributions call for histograms or ridge plots; categorical comparisons call for bar charts or boxplots; relationships between continuous variables call for scatter plots.
Business Justification: PowerGen’s management and investors require visual performance summaries that communicate site-level and portfolio-level trends. I work closely with the Financial Planning and Analysis Manager to prepare and present these visuals.
Code
pg |>group_by(week_ending, batch) |>summarise(total_solar =sum(solar_production_kwh, na.rm =TRUE), .groups ="drop") |>ggplot(aes(x = week_ending, y = total_solar, colour = batch)) +geom_line(linewidth =0.8, alpha =0.9) +geom_smooth(se =FALSE, linewidth =0.4, linetype ="dashed", alpha =0.5) +scale_y_continuous(labels = comma) +scale_colour_brewer(palette ="Set1") +labs(title ="Weekly Solar Production by Batch — Feb 2025 to May 2026",subtitle ="Dashed lines show smoothed trend",x =NULL, y ="Total kWh", colour ="Batch" ) +theme_minimal() +theme(legend.position ="bottom")
Weekly solar production over time by batch
Code
pg |>filter(!is.na(availability_pct)) |>mutate(month =floor_date(week_ending, "month")) |>group_by(site, month) |>summarise(avg_avail =mean(availability_pct, na.rm =TRUE), .groups ="drop") |>ggplot(aes(x = month, y =fct_reorder(site, avg_avail), fill = avg_avail)) +geom_tile(colour ="white", linewidth =0.3) +scale_fill_gradient2(low ="#E74C3C", mid ="#F39C12", high ="#27AE60",midpoint =0.75, labels = percent) +scale_x_date(date_labels ="%b %Y", date_breaks ="2 months") +labs(title ="Monthly Average Grid Availability by Site",subtitle ="Green = high availability, Red = low availability",x =NULL, y =NULL, fill ="Availability" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Site-level availability heatmap
Code
pg_clean |>filter(!is.na(revenue_ngn)) |>group_by(site, season, batch) |>summarise(avg_revenue =mean(revenue_ngn, na.rm =TRUE), .groups ="drop") |>ggplot(aes(x =fct_reorder(site, avg_revenue), y = avg_revenue, fill = season)) +geom_col(position ="dodge", alpha =0.85) +scale_y_continuous(labels = comma) +scale_fill_manual(values =c("Dry"="#F4A261", "Rainy"="#457B9D")) +coord_flip() +labs(title ="Average Weekly Revenue by Site and Season (₦)",x =NULL, y ="Average Weekly Revenue (₦)", fill ="Season" ) +theme_minimal() +theme(legend.position ="bottom")
Average weekly revenue by site and season
Code
pg_clean |>filter(!is.na(revenue_ngn), !is.na(solar_production_kwh)) |>ggplot(aes(x = solar_production_kwh, y = revenue_ngn,colour = performance_tier, shape = season)) +geom_point(alpha =0.5, size =1.8) +geom_smooth(method ="lm", se =TRUE, colour ="black", linewidth =0.8) +scale_x_continuous(labels = comma) +scale_y_continuous(labels = comma) +scale_colour_manual(values =c("Low"="#E74C3C", "Medium"="#F39C12", "High"="#27AE60")) +labs(title ="Solar Production vs Revenue",subtitle ="Black line = OLS fit; colour = performance tier",x ="Solar Production (kWh)", y ="Revenue (₦)",colour ="Tier", shape ="Season" ) +theme_minimal()
Solar production vs revenue
Code
pg |>filter(!is.na(solar_production_kwh)) |>ggplot(aes(x = solar_production_kwh, y = season, fill = season)) +geom_density_ridges(alpha =0.7, scale =1.2) +scale_x_continuous(labels = comma) +scale_fill_manual(values =c("Dry"="#F4A261", "Rainy"="#457B9D")) +labs(title ="Solar Production Distribution: Dry vs Rainy Season",x ="Solar Production (kWh)", y =NULL, fill ="Season" ) +theme_minimal() +theme(legend.position ="none")
Solar production distribution by season
Code
pg |>filter(!is.na(performance_ratio)) |>ggplot(aes(x = week_ending, y = performance_ratio, colour = batch)) +geom_line(alpha =0.5, linewidth =0.5) +geom_smooth(se =FALSE, linewidth =0.9) +geom_hline(yintercept =1.0, linetype ="dashed", colour ="red", linewidth =0.7) +annotate("text", x =as.Date("2025-04-01"), y =1.08,label ="GSA benchmark (1.0)", colour ="red", size =3) +scale_colour_brewer(palette ="Set1") +facet_wrap(~ batch, ncol =2) +labs(title ="Weekly Performance Ratio Over Time by Batch",subtitle ="Above red line = outperforming theoretical benchmark",x =NULL, y ="Performance Ratio (Actual ÷ GSA)", colour ="Batch" ) +theme_minimal() +theme(legend.position ="none")
Performance ratio vs GSA benchmark by batch and season
Plain-Language Interpretation for Management: Six charts tell one story. Chart 1 shows that Batch 1 RMG leads portfolio output most weeks, but its production is more volatile than Batch 2 RMG. Batch 2 RMG sites were developed after Batch 1 and incorporated some of the suggested technical improvements after the Batch 1 projects.
Chart 2 reveals that grid availability is close to 100% for most Batch 1 and 2 sites but significantly lower for Toto and Owode. Toto and Owode have reported inconsistent grid availability due to technical issues with the diesel generator and the Battery Energy Storage System.
Chart 3 confirms dry-season revenue consistently exceeds rainy-season revenue across all sites. Chart 4 demonstrates a strong linear relationship between solar production and revenue. Chart 5 confirms the seasonal effect — dry-season production is shifted higher, with less variation than the rainy season.
Chart 6 introduces the performance ratio — actual solar output divided by the GSA theoretical benchmark. Batches that consistently sit above 1.0 are outperforming their theoretical potential, reflecting well-maintained panels and efficient operations. Batches that dip below 1.0 during the rainy season highlight where weather-related underperformance is greatest and where operational interventions could recover the most output.
7. Hypothesis Testing
Theory: Hypothesis testing is the formal framework for deciding whether observed data differences are statistically real or attributable to chance (Adi, 2026, Ch. 11). A null hypothesis (H₀) posits no effect; an alternative hypothesis (H₁) posits a direction of difference. The p-value quantifies the probability of observing data as extreme as the sample if H₀ were true. We report effect sizes (Cohen’s d or η²) alongside p-values, as statistical significance alone does not convey practical importance. Two tests are used: the Wilcoxon Rank-Sum test (non-parametric alternative to the t-test, used when distributions are skewed) and one-way ANOVA with Tukey HSD post-hoc comparisons.
Business Justification: PowerGen management regularly compares batch performance in operational reviews. Without formal testing, conclusions risk being driven by selective attention to favourable weeks. The two hypotheses below directly inform resource allocation and tariff review decisions that I support as GFC.
Hypothesis 1 — Seasonal Effect on Solar Production
H₀: Mean weekly solar production is equal in dry and rainy seasons H₁: Mean weekly solar production is higher in the dry season (one-tailed)
cat(sprintf(" Dry season: W = %.4f, p = %.4f\n", sw_dry$statistic, sw_dry$p.value))
Dry season: W = 0.7144, p = 0.0000
Code
cat(sprintf(" Rainy season: W = %.4f, p = %.4f\n", sw_rainy$statistic, sw_rainy$p.value))
Rainy season: W = 0.7044, p = 0.0000
Code
wilcox_result <-wilcox.test(dry_solar, rainy_solar, alternative ="greater")cat("\nWilcoxon Rank-Sum Test (one-sided: dry > rainy):\n")
Wilcoxon Rank-Sum Test (one-sided: dry > rainy):
Code
print(wilcox_result)
Wilcoxon rank sum test with continuity correction
data: dry_solar and rainy_solar
W = 114335, p-value = 0.0003651
alternative hypothesis: true location shift is greater than 0
Df Sum Sq Mean Sq F value Pr(>F)
batch 3 9.066e+08 302186345 212 <2e-16 ***
Residuals 903 1.287e+09 1425526
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
levene_result <-leveneTest(solar_production_kwh ~ batch,data = pg |>filter(!is.na(solar_production_kwh)))cat("\nLevene's Test for Homogeneity of Variance:\n")
Levene's Test for Homogeneity of Variance:
Code
print(levene_result)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 123.21 < 2.2e-16 ***
903
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("Interpretation: >0.14 = large effect (Cohen, 1988)\n")
Interpretation: >0.14 = large effect (Cohen, 1988)
Plain-Language Interpretation for Management:
Hypothesis 1: Dry season solar production is significantly higher than rainy season (p < 0.001). The gap is approximately 20% — about 270 kWh per site per week. This is statistically confirmed, not just visually apparent.
Practical implication: In preparing revenue forecasts, ensure that seasonality effects are considered. A revenue adjustment of 15–20% between dry and rainy quarters might be appropriate.
Hypothesis 2: Batch-level differences in solar production are statistically large (p < 0.001, η² > 0.14). Toto (IMG) is the main driver — it produces significantly more than all RMG batches. This is because Toto is an Interconnected Mini-grid and not a regular Isolated Rural Mini-grid.
Practical implication: Performance benchmarking must be done within batch cohorts to ensure that the comparison is like-for-like. For instance, comparing a Batch 1 RMG site to Toto is misleading and unfair — they are structurally different installations.
8. Correlation Analysis
Theory: Correlation measures the strength and direction of linear association between two continuous variables (Adi, 2026, Ch. 13). Pearson’s r is appropriate for normally distributed variables; Spearman’s ρ (rho) is preferred when distributions are skewed or ordinal. Values range from -1 (perfect negative) to +1 (perfect positive); 0 indicates no linear relationship. A correlation matrix with heatmap provides a portfolio-level view of inter-variable relationships. Crucially, correlation does not imply causation — a high correlation between two variables may reflect a common underlying driver rather than a direct causal link.
Business Justification: Understanding which operational metrics impact revenue helps PowerGen’s management to shape strategy and prioritize interventions. For instance, if grid availability has a stronger correlation with revenue than specific yield, maintenance budgets would typically be channeled towards metering and core uptime rather than PV panel efficiency. This is a resource allocation question I face as GFC when preparing the annual capital expenditure budget.
c1 <- pg_clean |>filter(!is.na(revenue_ngn), !is.na(customer_metered_consumption_kwh)) |>ggplot(aes(x = customer_metered_consumption_kwh, y = revenue_ngn)) +geom_point(alpha =0.4, colour ="#2E86AB", size =1.5) +geom_smooth(method ="lm", colour ="black") +scale_x_continuous(labels = comma) +scale_y_continuous(labels = comma) +labs(title ="Consumption vs Revenue", x ="Consumption (kWh)", y ="Revenue (₦)") +theme_minimal()c2 <- pg_clean |>filter(!is.na(revenue_ngn), !is.na(availability_pct)) |>ggplot(aes(x = availability_pct, y = revenue_ngn)) +geom_point(alpha =0.4, colour ="#A23B72", size =1.5) +geom_smooth(method ="lm", colour ="black") +scale_y_continuous(labels = comma) +labs(title ="Availability vs Revenue", x ="Grid Availability", y ="Revenue (₦)") +theme_minimal()c1 + c2
Top two correlates of revenue
Plain-Language Interpretation for Management: Customer consumption is almost perfectly correlated with revenue (ρ ≈ 0.99) — this is expected since revenue is a function of tariff and consumption. There also exists a strong correlation between grid availability and revenue: sites that are operational for a higher proportion of the week consistently earn more. Solar production also correlates strongly with revenue, but availability is the more controllable variable in the short term — you cannot quickly add solar panels, but you can improve maintenance response times.
Generator production shows a notable negative correlation with revenue. This is an important operational signal: when a site relies heavily on its diesel generator, it is typically because solar generation or battery storage has failed. High generator hours compress margins through fuel costs and often co-occur with lower customer consumption — confirming that generator dependency is a leading indicator of financial underperformance, not merely a cost item.
Specific yield is positively correlated with both solar production and revenue. Because specific yield normalises output by installed capacity, this correlation confirms that well-utilised installations — not simply large ones — drive stronger financial results. Two sites with identical panel capacity but different specific yields are performing differently at an operational level, and the one with higher specific yield is generating more revenue per naira of capital deployed.
GSA average correlates moderately with solar production and specific yield, confirming that actual performance tracks theoretical potential across sites. This validates the data: sites in higher-irradiance locations do produce more, as the GSA model predicts.
Technical losses show a weak negative correlation with revenue — higher losses reduce billable energy, but the effect is modest compared to availability.
9. Linear Regression
Theory: Ordinary Least Squares (OLS) regression estimates the linear relationship between a dependent variable and one or more predictors by minimizing the sum of squared residuals (Adi, 2026, Ch. 14). Each coefficient represents the expected change in the outcome for a one-unit increase in the predictor, holding all others constant. Diagnostic plots assess four key assumptions: linearity (Residuals vs Fitted), normality of residuals (Q-Q Plot), homoscedasticity (Scale-Location), and influential observations (Cook’s Distance). A log transformation of the outcome variable is used here to address right skew in revenue data.
Business Justification: PowerGen’s annual budgeting process requires site-level revenue forecasts. Currently these rely on manual assumptions. This regression model provides a data-driven basis for those forecasts, quantifying the marginal revenue contribution of solar production, grid availability, season, and batch — enabling more accurate and defensible projections for management and investors.
Code
reg_data <- pg_clean |>filter(!is.na(revenue_ngn),!is.na(solar_production_kwh),!is.na(availability_pct),!is.na(specific_yield_kwp),!is.na(performance_ratio),!is.na(solar_share_pct)) |>mutate(log_revenue =log(revenue_ngn +1))# Full model including performance ratio and solar sharemodel <-lm(log_revenue ~ solar_production_kwh + availability_pct + specific_yield_kwp + performance_ratio + solar_share_pct + season + batch,data = reg_data)summary(model)
Plain-Language Interpretation for Management: The model explains approximately 75–80% of the variation in site-level weekly revenue (R² ≈ 0.75–0.80) — strong enough for operational planning and annual budgeting purposes. The most important finding is that grid availability has a positive relationship with revenue even after controlling for how much energy the panels are generating. This means availability improvements deliver financial returns independently of weather or panel capacity.
The performance ratio coefficient quantifies the financial value of outperforming the GSA benchmark. A site that consistently achieves a performance ratio above 1.0 — meaning it extracts more energy than its theoretical potential — earns measurably more revenue, holding all other factors equal. This gives operations management a financially grounded target: not just generating more kWh in absolute terms, but optimising yield relative to installed capacity.
The solar share coefficient confirms that sites generating a higher proportion of their energy from solar panels (rather than the diesel generator) earn more revenue. This is consistent with the correlation finding — generator dependency is a proxy for operational stress, and the regression quantifies the revenue cost of that dependency.
In practical budgeting terms: a site generating 1,500 kWh/week at 90% availability and high solar share can be expected to earn materially more than the same site at 70% availability running on generator backup. The regression quantifies both gaps simultaneously, giving the operations and finance teams a unified framework for prioritising maintenance expenditure.
10. Integrated Findings
The five analyses collectively support a single, coherent conclusion:
PowerGen’s Nigerian portfolio revenue is primarily determined by three factors — solar production capacity (structural), grid availability (operationally manageable), and season (external). Of these, grid availability is the most actionable lever for near-term revenue improvement.
The evidence chain: EDA established that the portfolio is heterogeneous, introduced the performance ratio as a normalised benchmark comparison tool, revealed that several sites show high generator dependency as an early financial risk signal, and identified two data quality issues requiring system remediation. Visualization confirmed that availability varies significantly across sites and months, that dry-season production is systematically higher, and that batch-level performance relative to the GSA benchmark fluctuates meaningfully across seasons. Hypothesis testing formalized the seasonal effect (p < 0.001, ~20% uplift in dry season) and confirmed batch-level differences are statistically large (η² > 0.14). Correlation analysis identified availability as the strongest controllable predictor of revenue, confirmed that generator dependency negatively impacts revenue, and validated that specific yield and GSA benchmark track together as expected. Regression quantified the marginal revenue value of all key variables — including performance ratio and solar share — and confirmed that availability, solar share, and performance ratio all carry statistically significant positive coefficients.
Single Recommendation: PowerGen’s operations team should prioritize grid availability improvements at Batch 1 RMG sites (Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Ndejiko, Rokota), where the heatmap reveals the greatest within-batch availability variance and where the revenue uplift per percentage point of recovered availability is calculable from the regression coefficients. A targeted maintenance programme addressing root causes of availability drops — whether grid infrastructure, inverter faults, or utility curtailment — is the intervention most directly supported by this analysis.
11. Limitations & Further Work
1. Data completeness: Ofosu and Owode (Batch 3 RMG) have only 28–34 weeks of data compared to 55–62 weeks for other sites. Their performance tier assignments are less statistically stable and should be revisited once a full year of data is available. Both sites began recording revenue in Q1 2025.
2. Toto heteroscedasticity: Toto’s production scale (5,000+ kWh/week vs 500–2,000 kWh for other sites) introduces variance inflation in the regression. A multi-level model with site-level random effects would better account for this structural heterogeneity.
3. Tariff as covariate: Tariff is batch-fixed within the 2025–2026 window. If historical data from 2023–2024 were included, tariff variation would need to be modelled explicitly as a time-varying covariate.
4. No weather data: Incorporating actual irradiance and rainfall data from NIMET or Global Solar Atlas would allow the model to separate weather effects from operational failures, improving forecast accuracy.
5. Causal inference: All findings are associational. To confirm that improving availability causes revenue increases, a difference-in-differences design tracking pre/post planned maintenance interventions would be needed.
Further Work: With richer data, a predictive model (Random Forest or XGBoost) could forecast site-level revenue 4–6 weeks ahead, enabling proactive scheduling. Cluster analysis of site profiles could also inform batch design in future commissioning rounds.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Enekwe,I. (2026). PowerGen Nigeria Business Unit — weekly solar generation and revenue performance data, 16 February 2025 to 3 May 2026 [Dataset]. Collected from Operations & Maintenance Department, PowerGen Renewable Energy, Lagos, Nigeria. Data available on request from the author.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Claude (Anthropic, claude.ai) was used as an analytical assistant throughout this project. Specifically, Claude assisted with:
reshaping the raw Excel O&M tracker from wide to tidy format using Python;
engineering the derived variables season, performance_tier, batch, tariff_ngn_per_kwh, revenue_ngn, performance_ratio, total_production, and solar_share_pct;
structuring the Quarto document and suggesting appropriate R packages for each analytical section; and
drafting initial code scaffolding for the EDA, visualization, and regression sections.
All analytical decisions were made independently by the author: the choice of CS1 as the case study, the selection of PowerGen generation data as the primary dataset, the framing of the two hypotheses (seasonal effect and batch differences), the interpretation of all statistical outputs in business terms, and the integrated recommendation regarding grid availability improvements at Batch 1 RMG sites.
The author collected the data directly from PowerGen’s internal systems in their capacity as Group Financial Controller, is familiar with all variables and their operational definitions, and can explain and defend every output in this document.