Solar Energy Generation Performance Analytics: A Site-Level Study of the Solar Mini-grids Operated by PowerGen Renewable Energy
Author
IKECHUKWU ENEKWE
Published
May 7, 2026
1. Executive Summary
This study analyzes weekly solar energy generation and revenue performance across 16 solar grid sites operated by PowerGen Renewable Energy (“PowerGen”) in Nigeria, covering the period February 2025 to May 2026. As Group Financial Controller (“GFC”), I obtained the weekly site-level observations from the Operations and Maintenance team’s tracker. The 16 solar grid sites have been categorized into four batches for ease of data analysis. The total site-level observations contain 9 variables – including financial and non-financial metrics.
The analysis applies five complementary techniques — Exploratory Data Analysis, Data Visualization, Hypothesis Testing, Correlation Analysis, and Linear Regression — to answer a central financial question: what drives revenue generation performance across PowerGen’s Nigerian portfolio, and are observed performance gaps between site batches statistically significant?
Key findings reveal that Toto (IMG batch) is a structural outlier with average weekly solar production more than three times higher than Batch 1 RMG sites; that rainy season reduces solar generation by approximately 15–20% compared to dry season; that grid availability is the strongest operational predictor of revenue; and that two data quality anomalies require remediation in PowerGen’s reporting systems.
The integrated recommendation is that management prioritise grid availability improvements at Batch 1 RMG sites, where availability variance is highest, as this single intervention offers the largest predicted revenue uplift per naira of operational expenditure.
2. Professional Disclosure
Job Title: Group Financial Controller Organization: PowerGen Renewable Energy Sector: Utilities — Renewable Energy
PowerGen Renewable Energy designs, builds, and operates solar installations. The Company operates two business models – Grids and Commercial & Industrial. This analysis focuses on the grids business as it guarantees sufficient and rich data for analytical purposes. As Group Financial Controller, my responsibilities include financial reporting, revenue assurance, operational performance monitoring, and budget management across a portfolio of active solar installations. The five analytical techniques chosen for this study are directly relevant to my day-to-day work:
Technique 1 — Exploratory Data Analysis:
During our financial close process, I review site generation data from the Operations and Maintenance team’s tracker for completeness, plausibility, and anomalies. During this process, I look out for missing values, outliers, and distributional patterns that could distort reported revenue figures or mask underperforming assets. In some instances, I have identified negative revenue values or negative consumption values which, if not identified and resolved, could distort the financial results presented to the Executives.
Technique 2 — Data Visualization:
Monthly and quarterly performance reports submitted to the Executives and investors rely on visual communication of generation and revenue trends. Choosing the right chart type determines whether non-technical stakeholders can act on the data. This technique directly supports my reporting function. With investors, they are usually keen on assessing the performance of the power grids as a pre-requisite for approving further investment or otherwise.
Technique 3 — Hypothesis Testing:
A recurring management question is whether performance differences observed between site batches reflect genuine operational differences or merely random variation. Formal hypothesis testing gives me a statistically defensible answer to replace what would otherwise be subjective judgement in performance reviews and board presentations.
Technique 4 — Correlation Analysis:
Revenue assurance requires understanding which variables are leading indicators of financial outcomes. For instance, from a basic understanding of the energy business, revenue is driven majorly by two variables: tariff and consumption. However, there might be other variables with causal relationships with the performance of a mini-grid site. These variables could be solar PV yield, technical losses etc. Putting all of these into consideration when reporting on performance, helps provide insights to management.
These insights also enable management to decide on resource allocation to each site and identify early indications of energy theft.
Technique 5 — Linear Regression:
PowerGen’s annual budget includes revenue projections by site. Regression provides a data-driven basis for those projections, quantifying how much revenue is expected per kWh of solar generation, per percentage point of availability, and per season — replacing assumption-driven estimates with empirically fitted parameters. Seasonality plays an important role in energy yield and site-level performance.
3. Data Collection & Sampling
Item
Details
Source
Operations and Maintenance (O&M) tracker of PowerGen Renewable Energy
Collection Method
Direct extraction from the O&M tracker in my capacity as Group Financial Controller. Reshaped from wide to tidy (long) format and enriched with derived variables using Python prior to analysis in R.
Sampling Frame
All 16 active mini-grid solar sites in PowerGen’s portfolio as at May 2026, organized across four batches: Batch 1 RMG (7 sites), Batch 2 RMG (6 sites), Batch 3 RMG (2 sites), and IMG/Toto (1 site).
Sample Size
The total number of observations used for the analysis is 907.
Time Period
9 March 2025 to 1 May 2026 (approximately 51 weeks)
Ethical Notes
The data used has been accessed in my professional capacity with management authorization. Site names are operational identifiers; no anonymization required.
Data Sharing
The data published is being used for academic purposes only. This is consistent with PowerGen’s data governance policy.
pg |>count(batch, site, name ="n_weeks") |>arrange(batch, desc(n_weeks)) |>kable(caption ="Observations per Site") |>kable_styling(bootstrap_options =c("striped", "hover"), full_width =FALSE)
Observations per Site
batch
site
n_weeks
Batch 1 RMG
Maggi Igenchi
62
Batch 1 RMG
Gbara
61
Batch 1 RMG
Maggi Bukun
61
Batch 1 RMG
Nantu
61
Batch 1 RMG
Ndejiko
61
Batch 1 RMG
Rokota
60
Batch 1 RMG
Kpanbo
55
Batch 2 RMG
Dukugi
61
Batch 2 RMG
Ebangi
61
Batch 2 RMG
Gbade
61
Batch 2 RMG
Sachi Nku
61
Batch 2 RMG
Sosa
61
Batch 2 RMG
Danchitagi
58
Batch 3 RMG
Ofosu
34
Batch 3 RMG
Owode
28
IMG
Toto
61
5. Exploratory Data Analysis (EDA)
Theory: Exploratory Data Analysis (EDA) is the systematic examination of a dataset’s structure, distributions, and anomalies before formal modelling (Adi, 2026, Ch. 4). The objective is to understand what the data contains, where it is incomplete, and where it may violate data analysis assumptions. Anscombe’s Quartet (1973) illustrates why summary statistics alone are insufficient — datasets with identical means and variances can exhibit fundamentally different patterns. EDA guards against this by combining numerical summaries with visual inspection.
Business Justification: As the GFC, I rely on the O&M trackers to build my site performance reports. Undetected outliers or data anomalies directly affect reporting.
Code
p1 <-ggplot(pg, aes(x = solar_production_kwh)) +geom_histogram(bins =40, fill ="#2E86AB", colour ="white", alpha =0.85) +scale_x_continuous(labels = comma) +labs(title ="Solar Production (kWh)", x ="kWh", y ="Count") +theme_minimal()p2 <-ggplot(pg, aes(x = availability_pct)) +geom_histogram(bins =30, fill ="#A23B72", colour ="white", alpha =0.85) +labs(title ="Grid Availability (%)", x ="Availability", y ="Count") +theme_minimal()p3 <-ggplot(pg, aes(x = batch, y = solar_production_kwh, fill = batch)) +geom_boxplot(alpha =0.8, outlier.colour ="red", outlier.size =1.5) +scale_y_continuous(labels = comma) +scale_fill_brewer(palette ="Set2") +labs(title ="Solar Production by Batch", x ="Batch", y ="kWh", fill =NULL) +theme_minimal() +theme(legend.position ="none")p4 <-ggplot(pg, aes(x = performance_tier, y = solar_production_kwh, fill = performance_tier)) +geom_boxplot(alpha =0.8, outlier.colour ="red", outlier.size =1.5) +scale_y_continuous(labels = comma) +scale_fill_manual(values =c("Low"="#E74C3C", "Medium"="#F39C12", "High"="#27AE60")) +labs(title ="Solar Production by Performance Tier", x ="Tier", y ="kWh", fill =NULL) +theme_minimal() +theme(legend.position ="none")(p1 + p2) / (p3 + p4)
Technical losses relate to the energy loss suffered between the time energy is generated and when it is distributed to customers. A negative loss is practically impossible.
Flagged in analysis; recommend source system correction
Extreme consumption outlier
Maggi Bukun
Week of 8 Feb 2026 contains a customer consumption of 227,125 kWh for the site. This was most likely a cumulative figure entered erroneously.
Row excluded from all revenue-based analyses
Plain-Language Interpretation for Management: The distribution of solar production is right-skewed, meaning a small number of sites — primarily Toto — generate significantly more energy than all other sites. Toto is the first Interconnected Mini-Grid in Nigeria and is a peri-urban community about 1 hour from Abuja. The community comprises individual homes, small-scale businesses and public utility centres.
Most sites register consumption levels of 500 to 2,000 kWh per week on average. Another observation is that grid availability is high across most sites (greater than 90%), but drops noticeably for Toto and Owode. Given the financial significance of Toto and Owode, the reduced grid availability would have an impact on the performance of those sites and the overall business.
In addition, two data recording errors were identified at Maggi Bukun. First, the system recorded negative technical losses in at least one period — physically impossible, indicating a meter reading or data entry problem. Second, one week in February 2026 shows customer consumption 300 times higher than normal, almost certainly a data entry error. Both should be corrected in the source system. The Maggi Bukun customer consumption data has been removed from the revenue analysis to avoid distorting results.
6. Data Visualization
Theory: The Grammar of Graphics (Wilkinson, 2005, cited in Adi, 2026, Ch. 5) holds that every effective chart is built from a systematic mapping of data variables to visual properties — position, colour, size, shape. Chart selection should be governed by the data type and the story being told, not aesthetic preference.
A time series calls for a line chart; distributions call for histograms or ridge plots; categorical comparisons call for bar charts or boxplots; relationships between continuous variables call for scatter plots.
Business Justification: PowerGen’s management and investors require visual performance summaries that communicate site-level and portfolio-level trends. I work closely with the Financial Planning and Analysis Manager to prepare and present these visuals.
Code
pg |>group_by(week_ending, batch) |>summarise(total_solar =sum(solar_production_kwh, na.rm =TRUE), .groups ="drop") |>ggplot(aes(x = week_ending, y = total_solar, colour = batch)) +geom_line(linewidth =0.8, alpha =0.9) +geom_smooth(se =FALSE, linewidth =0.4, linetype ="dashed", alpha =0.5) +scale_y_continuous(labels = comma) +scale_colour_brewer(palette ="Set1") +labs(title ="Weekly Solar Production by Batch — Feb 2025 to May 2026",subtitle ="Dashed lines show smoothed trend",x =NULL, y ="Total kWh", colour ="Batch" ) +theme_minimal() +theme(legend.position ="bottom")
Weekly solar production over time by batch
Code
pg |>filter(!is.na(availability_pct)) |>mutate(month =floor_date(week_ending, "month")) |>group_by(site, month) |>summarise(avg_avail =mean(availability_pct, na.rm =TRUE), .groups ="drop") |>ggplot(aes(x = month, y =fct_reorder(site, avg_avail), fill = avg_avail)) +geom_tile(colour ="white", linewidth =0.3) +scale_fill_gradient2(low ="#E74C3C", mid ="#F39C12", high ="#27AE60",midpoint =0.75, labels = percent) +scale_x_date(date_labels ="%b %Y", date_breaks ="2 months") +labs(title ="Monthly Average Grid Availability by Site",subtitle ="Green = high availability, Red = low availability",x =NULL, y =NULL, fill ="Availability" ) +theme_minimal() +theme(axis.text.x =element_text(angle =45, hjust =1))
Site-level availability heatmap
Code
pg_clean |>filter(!is.na(revenue_ngn)) |>group_by(site, season, batch) |>summarise(avg_revenue =mean(revenue_ngn, na.rm =TRUE), .groups ="drop") |>ggplot(aes(x =fct_reorder(site, avg_revenue), y = avg_revenue, fill = season)) +geom_col(position ="dodge", alpha =0.85) +scale_y_continuous(labels = comma) +scale_fill_manual(values =c("Dry"="#F4A261", "Rainy"="#457B9D")) +coord_flip() +labs(title ="Average Weekly Revenue by Site and Season (₦)",x =NULL, y ="Average Weekly Revenue (₦)", fill ="Season" ) +theme_minimal() +theme(legend.position ="bottom")
Average weekly revenue by site and season
Code
pg_clean |>filter(!is.na(revenue_ngn), !is.na(solar_production_kwh)) |>ggplot(aes(x = solar_production_kwh, y = revenue_ngn,colour = performance_tier, shape = season)) +geom_point(alpha =0.5, size =1.8) +geom_smooth(method ="lm", se =TRUE, colour ="black", linewidth =0.8) +scale_x_continuous(labels = comma) +scale_y_continuous(labels = comma) +scale_colour_manual(values =c("Low"="#E74C3C", "Medium"="#F39C12", "High"="#27AE60")) +labs(title ="Solar Production vs Revenue",subtitle ="Black line = OLS fit; colour = performance tier",x ="Solar Production (kWh)", y ="Revenue (₦)",colour ="Tier", shape ="Season" ) +theme_minimal()
Solar production vs revenue
Code
pg |>filter(!is.na(solar_production_kwh)) |>ggplot(aes(x = solar_production_kwh, y = season, fill = season)) +geom_density_ridges(alpha =0.7, scale =1.2) +scale_x_continuous(labels = comma) +scale_fill_manual(values =c("Dry"="#F4A261", "Rainy"="#457B9D")) +labs(title ="Solar Production Distribution: Dry vs Rainy Season",x ="Solar Production (kWh)", y =NULL, fill ="Season" ) +theme_minimal() +theme(legend.position ="none")
Solar production distribution by season
Plain-Language Interpretation for Management: Five charts tell one story. Chart 1 shows that Batch 1 RMG leads portfolio output most weeks, but its production is more volatile than Batch 2 RMG. Batch 2 RMG sites were developed after Batch 1 and incorporated some of the suggested technical improvements after the Batch 1 projects.
Chart 2 reveals that grid availability is close to 100% for most Batch 1 and 2 sites but significantly lower for Toto and Owode. Toto and Owode have reported inconsistent grid availability due to technical issues with the diesel generator and the Battery Energy Storage System.
Chart 3 confirms dry-season revenue consistently exceeds rainy-season revenue across all sites. Chart 4 demonstrates a strong linear relationship between solar production and revenue. Chart 5 confirms the seasonal effect — dry-season production is shifted higher, with less variation than the rainy season.
7. Hypothesis Testing
Theory: Hypothesis testing is the formal framework for deciding whether observed data differences are statistically real or attributable to chance (Adi, 2026, Ch. 6). A null hypothesis (H₀) posits no effect; an alternative hypothesis (H₁) posits a direction of difference. The p-value quantifies the probability of observing data as extreme as the sample if H₀ were true. We report effect sizes (Cohen’s d or η²) alongside p-values, as statistical significance alone does not convey practical importance. Two tests are used: the Wilcoxon Rank-Sum test (non-parametric alternative to the t-test, used when distributions are skewed) and one-way ANOVA with Tukey HSD post-hoc comparisons.
Business Justification: PowerGen management regularly compares batch performance in operational reviews. Without formal testing, conclusions risk being driven by selective attention to favourable weeks. The two hypotheses below directly inform resource allocation and tariff review decisions that I support as GFC.
Hypothesis 1 — Seasonal Effect on Solar Production
H₀: Mean weekly solar production is equal in dry and rainy seasons H₁: Mean weekly solar production is higher in the dry season (one-tailed)
cat(sprintf(" Dry season: W = %.4f, p = %.4f\n", sw_dry$statistic, sw_dry$p.value))
Dry season: W = 0.7144, p = 0.0000
Code
cat(sprintf(" Rainy season: W = %.4f, p = %.4f\n", sw_rainy$statistic, sw_rainy$p.value))
Rainy season: W = 0.6963, p = 0.0000
Code
wilcox_result <-wilcox.test(dry_solar, rainy_solar, alternative ="greater")cat("\nWilcoxon Rank-Sum Test (one-sided: dry > rainy):\n")
Wilcoxon Rank-Sum Test (one-sided: dry > rainy):
Code
print(wilcox_result)
Wilcoxon rank sum test with continuity correction
data: dry_solar and rainy_solar
W = 114335, p-value = 0.0003651
alternative hypothesis: true location shift is greater than 0
Df Sum Sq Mean Sq F value Pr(>F)
batch 3 9.066e+08 302186345 212 <2e-16 ***
Residuals 903 1.287e+09 1425526
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
levene_result <-leveneTest(solar_production_kwh ~ batch,data = pg |>filter(!is.na(solar_production_kwh)))cat("\nLevene's Test for Homogeneity of Variance:\n")
Levene's Test for Homogeneity of Variance:
Code
print(levene_result)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 123.21 < 2.2e-16 ***
903
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cat("Interpretation: >0.14 = large effect (Cohen, 1988)\n")
Interpretation: >0.14 = large effect (Cohen, 1988)
Plain-Language Interpretation for Management:
Hypothesis 1: Dry season solar production is significantly higher than rainy season (p < 0.001). The gap is approximately 20% — about 270 kWh per site per week. This is statistically confirmed, not just visually apparent.
Practical implication: In preparing revenue forecasts, ensure that seasonality effects are considered. A revenue adjustment of 15–20% between dry and rainy quarters might be appropriate.
Hypothesis 2: Batch-level differences in solar production are statistically large (p < 0.001, η² > 0.14). Toto (IMG) is the main driver — it produces significantly more than all RMG batches. This is because Toto is an Interconnected Mini-grid and not a regular Isolated Rural Mini-grid.
Practical implication: Performance benchmarking must be done within batch cohorts to ensure that the comparison is like-for-like. For instance, comparing a Batch 1 RMG site to Toto is misleading and unfair — they are structurally different installations.
8. Correlation Analysis
Theory: Correlation measures the strength and direction of linear association between two continuous variables (Adi, 2026, Ch. 8). Pearson’s r is appropriate for normally distributed variables; Spearman’s ρ (rho) is preferred when distributions are skewed or ordinal. Values range from -1 (perfect negative) to +1 (perfect positive); 0 indicates no linear relationship. A correlation matrix with heatmap provides a portfolio-level view of inter-variable relationships. Crucially, correlation does not imply causation — a high correlation between two variables may reflect a common underlying driver rather than a direct causal link.
Business Justification: Understanding which operational metrics impact revenue helps PowerGen’s management to shape strategy and prioritize interventions. For instance, if grid availability has a stronger correlation with revenue than specific yield, maintenance budgets would typically be channeled towards metering and core uptime rather than PV panel efficiency. This is a resource allocation question I face as GFC when preparing the annual capital expenditure budget.
c1 <- pg_clean |>filter(!is.na(revenue_ngn), !is.na(customer_metered_consumption_kwh)) |>ggplot(aes(x = customer_metered_consumption_kwh, y = revenue_ngn)) +geom_point(alpha =0.4, colour ="#2E86AB", size =1.5) +geom_smooth(method ="lm", colour ="black") +scale_x_continuous(labels = comma) +scale_y_continuous(labels = comma) +labs(title ="Consumption vs Revenue", x ="Consumption (kWh)", y ="Revenue (₦)") +theme_minimal()c2 <- pg_clean |>filter(!is.na(revenue_ngn), !is.na(availability_pct)) |>ggplot(aes(x = availability_pct, y = revenue_ngn)) +geom_point(alpha =0.4, colour ="#A23B72", size =1.5) +geom_smooth(method ="lm", colour ="black") +scale_y_continuous(labels = comma) +labs(title ="Availability vs Revenue", x ="Grid Availability", y ="Revenue (₦)") +theme_minimal()c1 + c2
Top two correlates of revenue
Plain-Language Interpretation for Management: Customer consumption is almost perfectly correlated with revenue (ρ ≈ 0.99) — this is expected since revenue is a function of tariff and consumption. There also exists a strong correlation between grid availability and revenue: sites that are operational for a higher proportion of the week consistently earn more. Solar production also correlates strongly with revenue, but availability is the more controllable variable in the short term — you cannot quickly add solar panels, but you can improve maintenance response times.
Technical losses show a weak negative correlation with revenue — higher losses reduce billable energy, but the effect is modest compared to availability. The correlation between specific yield and the GSA benchmark confirms that actual site performance tracks theoretical solar potential, which validates the integrity of the generation data.
9. Linear Regression
Theory: Ordinary Least Squares (OLS) regression estimates the linear relationship between a dependent variable and one or more predictors by minimizing the sum of squared residuals (Adi, 2026, Ch. 9). Each coefficient represents the expected change in the outcome for a one-unit increase in the predictor, holding all others constant. Diagnostic plots assess four key assumptions: linearity (Residuals vs Fitted), normality of residuals (Q-Q Plot), homoscedasticity (Scale-Location), and influential observations (Cook’s Distance). A log transformation of the outcome variable is used here to address right skew in revenue data.
Business Justification: PowerGen’s annual budgeting process requires site-level revenue forecasts. Currently these rely on manual assumptions. This regression model provides a data-driven basis for those forecasts, quantifying the marginal revenue contribution of solar production, grid availability, season, and batch — enabling more accurate and defensible projections for management and investors.
Plain-Language Interpretation for Management: The model explains approximately 75–80% of the variation in site-level weekly revenue (R² ≈ 0.75–0.80) — strong enough for operational planning and annual budgeting purposes. The most important finding is that grid availability has a positive relationship with revenue even after controlling for how much energy the panels are generating. This means availability improvements deliver financial returns independently of weather or panel capacity.
In practical budgeting terms: a site generating 1,500 kWh/week at 90% availability can be expected to earn materially more than the same site at 70% availability, even in the same weather week. The regression quantifies that gap, giving operations management a financial case for prioritizing uptime maintenance over other expenditure categories.
10. Integrated Findings
The five analyses collectively support a single, coherent conclusion:
PowerGen’s Nigerian portfolio revenue is primarily determined by three factors — solar production capacity (structural), grid availability (operationally manageable), and season (external). Of these, grid availability is the most actionable lever for near-term revenue improvement.
The evidence chain: EDA established that the portfolio is heterogeneous and identified two data quality issues requiring system remediation. Visualization confirmed that availability varies significantly across sites and months, and that dry-season production is systematically higher. Hypothesis testing formalized the seasonal effect (p < 0.001, ~20% uplift in dry season) and confirmed batch-level differences are statistically large (η² > 0.14). Correlation analysis identified availability as the strongest controllable predictor of revenue. Regression quantified the marginal revenue value of each variable and confirmed the availability coefficient is positive, significant, and economically meaningful.
Single Recommendation: PowerGen’s operations team should prioritize grid availability improvements at Batch 1 RMG sites (Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Ndejiko, Rokota), where the heatmap reveals the greatest within-batch availability variance and where the revenue uplift per percentage point of recovered availability is calculable from the regression coefficients. A targeted maintenance programme addressing root causes of availability drops — whether grid infrastructure, inverter faults, or utility curtailment — is the intervention most directly supported by this analysis.
11. Limitations & Further Work
1. Data completeness: Ofosu and Owode (Batch 3 RMG) have only 28–34 weeks of data compared to 55–62 weeks for other sites. Their performance tier assignments are less statistically stable and should be revisited once a full year of data is available. Both sites began recording revenue in Q1 2025.
2. Toto heteroscedasticity: Toto’s production scale (5,000+ kWh/week vs 500–2,000 kWh for other sites) introduces variance inflation in the regression. A multi-level model with site-level random effects would better account for this structural heterogeneity.
3. Tariff as covariate: Tariff is batch-fixed within the 2025–2026 window. If historical data from 2023–2024 were included, tariff variation would need to be modelled explicitly as a time-varying covariate.
4. No weather data: Incorporating actual irradiance and rainfall data from NIMET or Global Solar Atlas would allow the model to separate weather effects from operational failures, improving forecast accuracy.
5. Causal inference: All findings are associational. To confirm that improving availability causes revenue increases, a difference-in-differences design tracking pre/post planned maintenance interventions would be needed.
Further Work: With richer data, a predictive model (Random Forest or XGBoost) could forecast site-level revenue 4–6 weeks ahead, enabling proactive scheduling. Cluster analysis of site profiles could also inform batch design in future commissioning rounds.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Enekwe, I. (2026). PowerGen NGBU weekly O&M tracker — site-level data extract, January 2025 to May 2026 [Dataset]. Finance Department, PowerGen Renewable Energy, Lagos, Nigeria. Data available on request from the author.
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
Claude (Anthropic, claude.ai) was used as an analytical assistant throughout this project. Specifically, Claude assisted with:
reshaping the raw Excel O&M tracker from wide to tidy format using Python;
engineering the derived variables season, performance_tier, batch, tariff_ngn_per_kwh, and revenue_ngn;
structuring the Quarto document and suggesting appropriate R packages for each analytical section; and
drafting initial code scaffolding for the EDA, visualization, and regression sections.
All analytical decisions were made independently by the author: the choice of CS1 as the case study, the selection of PowerGen generation data as the primary dataset, the framing of the two hypotheses (seasonal effect and batch differences), the interpretation of all statistical outputs in business terms, and the integrated recommendation regarding grid availability improvements at Batch 1 RMG sites.
The author collected the data directly from PowerGen’s internal systems in their capacity as Group Financial Controller, is familiar with all variables and their operational definitions, and can explain and defend every output in this document.