Solar Energy Generation Performance Analytics: A Site-Level Study of the Solar Mini-grids Operated by PowerGen Renewable Energy

Author

IKECHUKWU ENEKWE

Published

May 7, 2026

1. Executive Summary

This study analyzes weekly solar energy generation and revenue performance across 16 solar grid sites operated by PowerGen Renewable Energy (“PowerGen”) in Nigeria, covering the period February 2025 to May 2026. As Group Financial Controller (“GFC”), I obtained the weekly site-level observations from the Operations and Maintenance team’s tracker. The 16 solar grid sites have been categorized into four batches for ease of data analysis. The total site-level observations contain 9 variables – including financial and non-financial metrics.

The analysis applies five complementary techniques — Exploratory Data Analysis, Data Visualization, Hypothesis Testing, Correlation Analysis, and Linear Regression — to answer a central financial question: what drives revenue generation performance across PowerGen’s Nigerian portfolio, and are observed performance gaps between site batches statistically significant?

Key findings reveal that Toto (IMG batch) is a structural outlier with average weekly solar production more than three times higher than Batch 1 RMG sites; that rainy season reduces solar generation by approximately 15–20% compared to dry season; that grid availability is the strongest operational predictor of revenue; and that two data quality anomalies require remediation in PowerGen’s reporting systems.

The integrated recommendation is that management prioritise grid availability improvements at Batch 1 RMG sites, where availability variance is highest, as this single intervention offers the largest predicted revenue uplift per naira of operational expenditure.

2. Professional Disclosure

Job Title: Group Financial Controller
Organization: PowerGen Renewable Energy
Sector: Utilities — Renewable Energy

PowerGen Renewable Energy designs, builds, and operates solar installations. The Company operates two business models – Grids and Commercial & Industrial. This analysis focuses on the grids business as it guarantees sufficient and rich data for analytical purposes. As Group Financial Controller, my responsibilities include financial reporting, revenue assurance, operational performance monitoring, and budget management across a portfolio of active solar installations. The five analytical techniques chosen for this study are directly relevant to my day-to-day work:

Technique 1 — Exploratory Data Analysis:
During our financial close process, I review site generation data from the Operations and Maintenance team’s tracker for completeness, plausibility, and anomalies. During this process, I look out for missing values, outliers, and distributional patterns that could distort reported revenue figures or mask underperforming assets. In some instances, I have identified negative revenue values or negative consumption values which, if not identified and resolved, could distort the financial results presented to the Executives.

Technique 2 — Data Visualization:
Monthly and quarterly performance reports submitted to the Executives and investors rely on visual communication of generation and revenue trends. Choosing the right chart type determines whether non-technical stakeholders can act on the data. This technique directly supports my reporting function. With investors, they are usually keen on assessing the performance of the power grids as a pre-requisite for approving further investment or otherwise.

Technique 3 — Hypothesis Testing:
A recurring management question is whether performance differences observed between site batches reflect genuine operational differences or merely random variation. Formal hypothesis testing gives me a statistically defensible answer to replace what would otherwise be subjective judgement in performance reviews and board presentations.

Technique 4 — Correlation Analysis:
Revenue assurance requires understanding which variables are leading indicators of financial outcomes. For instance, from a basic understanding of the energy business, revenue is driven majorly by two variables: tariff and consumption. However, there might be other variables with causal relationships with the performance of a mini-grid site. These variables could be solar PV yield, technical losses etc. Putting all of these into consideration when reporting on performance, helps provide insights to management.

These insights also enable management to decide on resource allocation to each site and identify early indications of energy theft.

Technique 5 — Linear Regression:
PowerGen’s annual budget includes revenue projections by site. Regression provides a data-driven basis for those projections, quantifying how much revenue is expected per kWh of solar generation, per percentage point of availability, and per season — replacing assumption-driven estimates with empirically fitted parameters. Seasonality plays an important role in energy yield and site-level performance.

3. Data Collection & Sampling

Item	Details
Source	Operations and Maintenance (O&M) tracker of PowerGen Renewable Energy
Collection Method	Direct extraction from the O&M tracker in my capacity as Group Financial Controller. Reshaped from wide to tidy (long) format and enriched with derived variables using Python prior to analysis in R.
Sampling Frame	All 16 active mini-grid solar sites in PowerGen’s portfolio as at May 2026, organized across four batches: Batch 1 RMG (7 sites), Batch 2 RMG (6 sites), Batch 3 RMG (2 sites), and IMG/Toto (1 site).
Sample Size	The total number of observations used for the analysis is 907.
Time Period	9 March 2025 to 1 May 2026 (approximately 51 weeks)
Ethical Notes	The data used has been accessed in my professional capacity with management authorization. Site names are operational identifiers; no anonymization required.
Data Sharing	The data published is being used for academic purposes only. This is consistent with PowerGen’s data governance policy.

Tariff History

Sites	Tariff History
Batch 1 & 2 RMG (13 sites)	₦240/kWh before May 2024 → ₦540/kWh from May 2024
Batch 3 RMG — Ofosu, Owode	₦650/kWh throughout
IMG — Toto	₦165 pre-Jan 2024 → ₦195 Jan–Mar 2024 → ₦450 from Apr 2024

4. Data Description

Code

# Load required packages
library(tidyverse)
library(skimr)
library(corrplot)
library(ggcorrplot)
library(scales)
library(knitr)
library(kableExtra)
library(car)
library(broom)
library(ggridges)
library(patchwork)

# Load data
pg <- read_csv("powergen_grid_data_clean.csv", show_col_types = FALSE)

# Convert types
pg <- pg |>
  mutate(
    week_ending      = as.Date(week_ending),
    site             = as.factor(site),
    batch            = factor(batch, levels = c("Batch 1 RMG", "Batch 2 RMG",
                                                 "Batch 3 RMG", "IMG")),
    season           = factor(season, levels = c("Dry", "Rainy")),
    performance_tier = factor(performance_tier, levels = c("Low", "Medium", "High"))
  )

glimpse(pg)

Rows: 907
Columns: 15
$ site                             <fct> Gbara, Gbara, Gbara, Gbara, Gbara, Gb…
$ week_ending                      <date> 2025-03-09, 2025-03-16, 2025-03-23, …
$ solar_production_kwh             <dbl> 2207, 2130, 2266, 2241, 1797, 1480, 1…
$ generator_production_kwh         <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 46…
$ specific_yield_kwp               <dbl> 3.742708, 3.612129, 3.842762, 3.80036…
$ gsa_average                      <dbl> 28.50000, 28.50000, 28.50000, 28.5442…
$ customer_metered_consumption_kwh <dbl> 1217.25, 1020.26, 945.46, 1071.86, 89…
$ losses_pct                       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ availability_pct                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ batch                            <fct> Batch 1 RMG, Batch 1 RMG, Batch 1 RMG…
$ tariff_ngn_per_kwh               <dbl> 540, 540, 540, 540, 540, 540, 540, 54…
$ revenue_ngn                      <dbl> 657315.0, 550940.4, 510548.4, 578804.…
$ season                           <fct> Dry, Dry, Dry, Dry, Rainy, Rainy, Rai…
$ performance_tier                 <fct> Medium, Medium, Medium, Medium, Mediu…
$ data_flag                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Code

skim(pg)

Data summary
Name	pg
Number of rows	907
Number of columns	15
_______________________
Column type frequency:
character	1
Date	1
factor	4
numeric	9
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data_flag	906	0	19	19	0	1	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
week_ending	0	1	2025-02-16	2026-05-03	2025-10-05	62

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
site	1	FALSE	16	Mag: 62, Duk: 61, Eba: 61, Gba: 61
batch	1	FALSE	4	Bat: 421, Bat: 363, Bat: 62, IMG: 61
season	1	FALSE	2	Rai: 512, Dry: 395
performance_tier	1	FALSE	3	Hig: 337, Med: 298, Low: 272

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
solar_production_kwh	0	1.00	1688.48	1556.09	0.00	858.50	1315.00	1840.00	11507.00	▇▁▁▁▁
generator_production_kwh	25	0.97	464.36	1544.01	0.00	0.00	0.65	126.58	11340.00	▇▁▁▁▁
specific_yield_kwp	1	1.00	2.33	0.92	0.00	1.91	2.48	2.91	4.67	▂▂▇▅▁
gsa_average	0	1.00	26.65	1.01	24.52	26.55	26.55	26.55	29.02	▂▁▇▁▂
customer_metered_consumption_kwh	47	0.95	2329.03	8338.19	0.00	703.93	1142.94	1640.50	227125.00	▇▁▁▁▁
losses_pct	250	0.72	-0.33	12.13	-310.56	0.02	0.03	0.13	1.00	▁▁▁▁▇
availability_pct	264	0.71	0.85	0.22	0.00	0.80	0.92	0.99	1.00	▁▁▁▂▇
tariff_ngn_per_kwh	0	1.00	541.47	37.03	450.00	540.00	540.00	540.00	650.00	▁▁▇▁▁
revenue_ngn	47	0.95	1182335.42	4400663.14	0.00	380123.55	617187.60	885870.00	122647500.00	▇▁▁▁▁

Code

missing_summary <- pg |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(
    Total       = nrow(pg),
    Missing_Pct = round(Missing_Count / Total * 100, 1)
  ) |>
  filter(Missing_Count > 0) |>
  arrange(desc(Missing_Pct))

kable(missing_summary, caption = "Missing Values by Variable") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Missing Values by Variable
Variable	Missing_Count	Total	Missing_Pct
data_flag	906	907	99.9
availability_pct	264	907	29.1
losses_pct	250	907	27.6
customer_metered_consumption_kwh	47	907	5.2
revenue_ngn	47	907	5.2
generator_production_kwh	25	907	2.8
specific_yield_kwp	1	907	0.1

Code

pg |>
  count(batch, site, name = "n_weeks") |>
  arrange(batch, desc(n_weeks)) |>
  kable(caption = "Observations per Site") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Observations per Site
batch	site	n_weeks
Batch 1 RMG	Maggi Igenchi	62
Batch 1 RMG	Gbara	61
Batch 1 RMG	Maggi Bukun	61
Batch 1 RMG	Nantu	61
Batch 1 RMG	Ndejiko	61
Batch 1 RMG	Rokota	60
Batch 1 RMG	Kpanbo	55
Batch 2 RMG	Dukugi	61
Batch 2 RMG	Ebangi	61
Batch 2 RMG	Gbade	61
Batch 2 RMG	Sachi Nku	61
Batch 2 RMG	Sosa	61
Batch 2 RMG	Danchitagi	58
Batch 3 RMG	Ofosu	34
Batch 3 RMG	Owode	28
IMG	Toto	61

5. Exploratory Data Analysis (EDA)

Theory: Exploratory Data Analysis (EDA) is the systematic examination of a dataset’s structure, distributions, and anomalies before formal modelling (Adi, 2026, Ch. 4). The objective is to understand what the data contains, where it is incomplete, and where it may violate data analysis assumptions. Anscombe’s Quartet (1973) illustrates why summary statistics alone are insufficient — datasets with identical means and variances can exhibit fundamentally different patterns. EDA guards against this by combining numerical summaries with visual inspection.

Business Justification: As the GFC, I rely on the O&M trackers to build my site performance reports. Undetected outliers or data anomalies directly affect reporting.

Code

p1 <- ggplot(pg, aes(x = solar_production_kwh)) +
  geom_histogram(bins = 40, fill = "#2E86AB", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = comma) +
  labs(title = "Solar Production (kWh)", x = "kWh", y = "Count") +
  theme_minimal()

p2 <- ggplot(pg, aes(x = availability_pct)) +
  geom_histogram(bins = 30, fill = "#A23B72", colour = "white", alpha = 0.85) +
  labs(title = "Grid Availability (%)", x = "Availability", y = "Count") +
  theme_minimal()

p3 <- ggplot(pg, aes(x = batch, y = solar_production_kwh, fill = batch)) +
  geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 1.5) +
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Solar Production by Batch", x = "Batch", y = "kWh", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

p4 <- ggplot(pg, aes(x = performance_tier, y = solar_production_kwh, fill = performance_tier)) +
  geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 1.5) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("Low" = "#E74C3C", "Medium" = "#F39C12", "High" = "#27AE60")) +
  labs(title = "Solar Production by Performance Tier", x = "Tier", y = "kWh", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

(p1 + p2) / (p3 + p4)

Code

# DATA QUALITY ISSUE 1: Negative losses
neg_losses <- pg |> filter(!is.na(losses_pct), losses_pct < 0)
cat("DATA QUALITY ISSUE 1 — Negative technical losses:\n")

DATA QUALITY ISSUE 1 — Negative technical losses:

Code

cat(sprintf("  %d observation(s) with losses_pct < 0\n", nrow(neg_losses)))

  70 observation(s) with losses_pct < 0

Code

cat(sprintf("  Affected sites: %s\n", paste(unique(neg_losses$site), collapse = ", ")))

  Affected sites: Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Rokota, Danchitagi, Dukugi, Ebangi, Gbade, Sosa, Ofosu

Code

cat(sprintf("  Min value: %.2f%%\n\n", min(neg_losses$losses_pct)))

  Min value: -310.56%

Code

# DATA QUALITY ISSUE 2: Extreme consumption outlier
outlier <- pg |>
  filter(!is.na(customer_metered_consumption_kwh)) |>
  filter(customer_metered_consumption_kwh > 50000)
cat("DATA QUALITY ISSUE 2 — Extreme consumption outlier:\n")

DATA QUALITY ISSUE 2 — Extreme consumption outlier:

Code

print(outlier |> select(site, week_ending, solar_production_kwh,
                         customer_metered_consumption_kwh, revenue_ngn))

# A tibble: 1 × 5
  site       week_ending solar_production_kwh customer_metered_con…¹ revenue_ngn
  <fct>      <date>                     <dbl>                  <dbl>       <dbl>
1 Maggi Buk… 2026-02-08                   729                 227125   122647500
# ℹ abbreviated name: ¹customer_metered_consumption_kwh

Code

cat("\n  Note: Maggi Bukun's 227,125 kWh reading on 2026-02-08 is ~300x its typical\n")


  Note: Maggi Bukun's 227,125 kWh reading on 2026-02-08 is ~300x its typical

Code

cat("  weekly consumption and is almost certainly a data entry error.\n")

  weekly consumption and is almost certainly a data entry error.

Code

cat("  This row is excluded from revenue-based analyses.\n")

  This row is excluded from revenue-based analyses.

Code

# Create clean dataset
pg_clean <- pg |>
  filter(!(site == "Maggi Bukun" &
           !is.na(customer_metered_consumption_kwh) &
           customer_metered_consumption_kwh > 50000))

cat(sprintf("\nAnalysis dataset: %d observations (1 outlier row removed)\n", nrow(pg_clean)))


Analysis dataset: 906 observations (1 outlier row removed)

Data Quality Issues Identified

Issue	Site	Detail	Action Taken
Negative technical losses (-6.59%)	Maggi Bukun	Technical losses relate to the energy loss suffered between the time energy is generated and when it is distributed to customers. A negative loss is practically impossible.	Flagged in analysis; recommend source system correction
Extreme consumption outlier	Maggi Bukun	Week of 8 Feb 2026 contains a customer consumption of 227,125 kWh for the site. This was most likely a cumulative figure entered erroneously.	Row excluded from all revenue-based analyses

Plain-Language Interpretation for Management: The distribution of solar production is right-skewed, meaning a small number of sites — primarily Toto — generate significantly more energy than all other sites. Toto is the first Interconnected Mini-Grid in Nigeria and is a peri-urban community about 1 hour from Abuja. The community comprises individual homes, small-scale businesses and public utility centres.

Most sites register consumption levels of 500 to 2,000 kWh per week on average. Another observation is that grid availability is high across most sites (greater than 90%), but drops noticeably for Toto and Owode. Given the financial significance of Toto and Owode, the reduced grid availability would have an impact on the performance of those sites and the overall business.

In addition, two data recording errors were identified at Maggi Bukun. First, the system recorded negative technical losses in at least one period — physically impossible, indicating a meter reading or data entry problem. Second, one week in February 2026 shows customer consumption 300 times higher than normal, almost certainly a data entry error. Both should be corrected in the source system. The Maggi Bukun customer consumption data has been removed from the revenue analysis to avoid distorting results.

6. Data Visualization

Theory: The Grammar of Graphics (Wilkinson, 2005, cited in Adi, 2026, Ch. 5) holds that every effective chart is built from a systematic mapping of data variables to visual properties — position, colour, size, shape. Chart selection should be governed by the data type and the story being told, not aesthetic preference.

A time series calls for a line chart; distributions call for histograms or ridge plots; categorical comparisons call for bar charts or boxplots; relationships between continuous variables call for scatter plots.

Business Justification: PowerGen’s management and investors require visual performance summaries that communicate site-level and portfolio-level trends. I work closely with the Financial Planning and Analysis Manager to prepare and present these visuals.

Code

pg |>
  group_by(week_ending, batch) |>
  summarise(total_solar = sum(solar_production_kwh, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = week_ending, y = total_solar, colour = batch)) +
  geom_line(linewidth = 0.8, alpha = 0.9) +
  geom_smooth(se = FALSE, linewidth = 0.4, linetype = "dashed", alpha = 0.5) +
  scale_y_continuous(labels = comma) +
  scale_colour_brewer(palette = "Set1") +
  labs(
    title    = "Weekly Solar Production by Batch — Feb 2025 to May 2026",
    subtitle = "Dashed lines show smoothed trend",
    x = NULL, y = "Total kWh", colour = "Batch"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Weekly solar production over time by batch

Code

pg |>
  filter(!is.na(availability_pct)) |>
  mutate(month = floor_date(week_ending, "month")) |>
  group_by(site, month) |>
  summarise(avg_avail = mean(availability_pct, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = month, y = fct_reorder(site, avg_avail), fill = avg_avail)) +
  geom_tile(colour = "white", linewidth = 0.3) +
  scale_fill_gradient2(low = "#E74C3C", mid = "#F39C12", high = "#27AE60",
                       midpoint = 0.75, labels = percent) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  labs(
    title    = "Monthly Average Grid Availability by Site",
    subtitle = "Green = high availability, Red = low availability",
    x = NULL, y = NULL, fill = "Availability"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code

pg_clean |>
  filter(!is.na(revenue_ngn)) |>
  group_by(site, season, batch) |>
  summarise(avg_revenue = mean(revenue_ngn, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = fct_reorder(site, avg_revenue), y = avg_revenue, fill = season)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("Dry" = "#F4A261", "Rainy" = "#457B9D")) +
  coord_flip() +
  labs(
    title = "Average Weekly Revenue by Site and Season (₦)",
    x = NULL, y = "Average Weekly Revenue (₦)", fill = "Season"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Average weekly revenue by site and season

Code

pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(solar_production_kwh)) |>
  ggplot(aes(x = solar_production_kwh, y = revenue_ngn,
             colour = performance_tier, shape = season)) +
  geom_point(alpha = 0.5, size = 1.8) +
  geom_smooth(method = "lm", se = TRUE, colour = "black", linewidth = 0.8) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  scale_colour_manual(values = c("Low" = "#E74C3C", "Medium" = "#F39C12", "High" = "#27AE60")) +
  labs(
    title    = "Solar Production vs Revenue",
    subtitle = "Black line = OLS fit; colour = performance tier",
    x = "Solar Production (kWh)", y = "Revenue (₦)",
    colour = "Tier", shape = "Season"
  ) +
  theme_minimal()

Code

pg |>
  filter(!is.na(solar_production_kwh)) |>
  ggplot(aes(x = solar_production_kwh, y = season, fill = season)) +
  geom_density_ridges(alpha = 0.7, scale = 1.2) +
  scale_x_continuous(labels = comma) +
  scale_fill_manual(values = c("Dry" = "#F4A261", "Rainy" = "#457B9D")) +
  labs(
    title = "Solar Production Distribution: Dry vs Rainy Season",
    x = "Solar Production (kWh)", y = NULL, fill = "Season"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Plain-Language Interpretation for Management: Five charts tell one story. Chart 1 shows that Batch 1 RMG leads portfolio output most weeks, but its production is more volatile than Batch 2 RMG. Batch 2 RMG sites were developed after Batch 1 and incorporated some of the suggested technical improvements after the Batch 1 projects.

Chart 2 reveals that grid availability is close to 100% for most Batch 1 and 2 sites but significantly lower for Toto and Owode. Toto and Owode have reported inconsistent grid availability due to technical issues with the diesel generator and the Battery Energy Storage System.

Chart 3 confirms dry-season revenue consistently exceeds rainy-season revenue across all sites. Chart 4 demonstrates a strong linear relationship between solar production and revenue. Chart 5 confirms the seasonal effect — dry-season production is shifted higher, with less variation than the rainy season.

7. Hypothesis Testing

Theory: Hypothesis testing is the formal framework for deciding whether observed data differences are statistically real or attributable to chance (Adi, 2026, Ch. 6). A null hypothesis (H₀) posits no effect; an alternative hypothesis (H₁) posits a direction of difference. The p-value quantifies the probability of observing data as extreme as the sample if H₀ were true. We report effect sizes (Cohen’s d or η²) alongside p-values, as statistical significance alone does not convey practical importance. Two tests are used: the Wilcoxon Rank-Sum test (non-parametric alternative to the t-test, used when distributions are skewed) and one-way ANOVA with Tukey HSD post-hoc comparisons.

Business Justification: PowerGen management regularly compares batch performance in operational reviews. Without formal testing, conclusions risk being driven by selective attention to favourable weeks. The two hypotheses below directly inform resource allocation and tariff review decisions that I support as GFC.

Hypothesis 1 — Seasonal Effect on Solar Production

H₀: Mean weekly solar production is equal in dry and rainy seasons
H₁: Mean weekly solar production is higher in the dry season (one-tailed)

Code

dry_solar   <- pg$solar_production_kwh[pg$season == "Dry"   & !is.na(pg$solar_production_kwh)]
rainy_solar <- pg$solar_production_kwh[pg$season == "Rainy" & !is.na(pg$solar_production_kwh)]

sw_dry   <- shapiro.test(sample(dry_solar,   min(500, length(dry_solar))))
sw_rainy <- shapiro.test(sample(rainy_solar, min(500, length(rainy_solar))))

cat("Normality check (Shapiro-Wilk, sampled):\n")

Normality check (Shapiro-Wilk, sampled):

Code

cat(sprintf("  Dry season:   W = %.4f, p = %.4f\n", sw_dry$statistic,   sw_dry$p.value))

  Dry season:   W = 0.7144, p = 0.0000

Code

cat(sprintf("  Rainy season: W = %.4f, p = %.4f\n", sw_rainy$statistic, sw_rainy$p.value))

  Rainy season: W = 0.6963, p = 0.0000

Code

wilcox_result <- wilcox.test(dry_solar, rainy_solar, alternative = "greater")
cat("\nWilcoxon Rank-Sum Test (one-sided: dry > rainy):\n")


Wilcoxon Rank-Sum Test (one-sided: dry > rainy):

Code

print(wilcox_result)


    Wilcoxon rank sum test with continuity correction

data:  dry_solar and rainy_solar
W = 114335, p-value = 0.0003651
alternative hypothesis: true location shift is greater than 0

Code

n1 <- length(dry_solar); n2 <- length(rainy_solar)
r_effect <- 1 - (2 * wilcox_result$statistic) / (n1 * n2)
cat(sprintf("\nEffect size (rank-biserial r): %.3f\n", r_effect))


Effect size (rank-biserial r): -0.131

Code

cat(sprintf("Dry season mean:   %.1f kWh\n", mean(dry_solar)))

Dry season mean:   1888.9 kWh

Code

cat(sprintf("Rainy season mean: %.1f kWh\n", mean(rainy_solar)))

Rainy season mean: 1533.8 kWh

Code

cat(sprintf("Difference:        %.1f kWh (%.1f%%)\n",
            mean(dry_solar) - mean(rainy_solar),
            (mean(dry_solar) - mean(rainy_solar)) / mean(rainy_solar) * 100))

Difference:        355.1 kWh (23.2%)

Hypothesis 2 — Performance Differences Across Batches

H₀: Mean weekly solar production is equal across all four batches
H₁: At least one batch differs significantly from the others

Code

anova_model <- aov(solar_production_kwh ~ batch,
                   data = pg |> filter(!is.na(solar_production_kwh)))
summary(anova_model)

             Df    Sum Sq   Mean Sq F value Pr(>F)    
batch         3 9.066e+08 302186345     212 <2e-16 ***
Residuals   903 1.287e+09   1425526                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

levene_result <- leveneTest(solar_production_kwh ~ batch,
                            data = pg |> filter(!is.na(solar_production_kwh)))
cat("\nLevene's Test for Homogeneity of Variance:\n")


Levene's Test for Homogeneity of Variance:

Code

print(levene_result)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   3  123.21 < 2.2e-16 ***
      903                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

tukey_result <- TukeyHSD(anova_model)
cat("\nTukey HSD Post-hoc Test:\n")


Tukey HSD Post-hoc Test:

Code

print(tukey_result)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = solar_production_kwh ~ batch, data = filter(pg, !is.na(solar_production_kwh)))

$batch
                              diff       lwr       upr     p adj
Batch 2 RMG-Batch 1 RMG  611.65545  391.5506  831.7604 0.0000000
Batch 3 RMG-Batch 1 RMG  520.76940  102.7440  938.7948 0.0075864
IMG-Batch 1 RMG         4120.88702 3699.8856 4541.8885 0.0000000
Batch 3 RMG-Batch 2 RMG  -90.88606 -513.1766  331.4045 0.9454810
IMG-Batch 2 RMG         3509.23156 3083.9949 3934.4683 0.0000000
IMG-Batch 3 RMG         3600.11762 3045.9287 4154.3065 0.0000000

Code

ss_total   <- sum((pg$solar_production_kwh[!is.na(pg$solar_production_kwh)] -
                   mean(pg$solar_production_kwh, na.rm = TRUE))^2)
ss_between <- sum(summary(anova_model)[[1]][["Sum Sq"]][1])
eta_sq <- ss_between / ss_total
cat(sprintf("\nEffect size (eta-squared): %.3f\n", eta_sq))


Effect size (eta-squared): 0.413

Code

cat("Interpretation: >0.14 = large effect (Cohen, 1988)\n")

Interpretation: >0.14 = large effect (Cohen, 1988)

Plain-Language Interpretation for Management:

Hypothesis 1: Dry season solar production is significantly higher than rainy season (p < 0.001). The gap is approximately 20% — about 270 kWh per site per week. This is statistically confirmed, not just visually apparent.

Practical implication: In preparing revenue forecasts, ensure that seasonality effects are considered. A revenue adjustment of 15–20% between dry and rainy quarters might be appropriate.

Hypothesis 2: Batch-level differences in solar production are statistically large (p < 0.001, η² > 0.14). Toto (IMG) is the main driver — it produces significantly more than all RMG batches. This is because Toto is an Interconnected Mini-grid and not a regular Isolated Rural Mini-grid.

Practical implication: Performance benchmarking must be done within batch cohorts to ensure that the comparison is like-for-like. For instance, comparing a Batch 1 RMG site to Toto is misleading and unfair — they are structurally different installations.

8. Correlation Analysis

Theory: Correlation measures the strength and direction of linear association between two continuous variables (Adi, 2026, Ch. 8). Pearson’s r is appropriate for normally distributed variables; Spearman’s ρ (rho) is preferred when distributions are skewed or ordinal. Values range from -1 (perfect negative) to +1 (perfect positive); 0 indicates no linear relationship. A correlation matrix with heatmap provides a portfolio-level view of inter-variable relationships. Crucially, correlation does not imply causation — a high correlation between two variables may reflect a common underlying driver rather than a direct causal link.

Business Justification: Understanding which operational metrics impact revenue helps PowerGen’s management to shape strategy and prioritize interventions. For instance, if grid availability has a stronger correlation with revenue than specific yield, maintenance budgets would typically be channeled towards metering and core uptime rather than PV panel efficiency. This is a resource allocation question I face as GFC when preparing the annual capital expenditure budget.

Code

corr_vars <- pg_clean |>
  filter(!is.na(revenue_ngn)) |>
  select(solar_production_kwh, generator_production_kwh, specific_yield_kwp,
         customer_metered_consumption_kwh, losses_pct, availability_pct, revenue_ngn) |>
  drop_na()

corr_matrix <- cor(corr_vars, method = "spearman")

ggcorrplot(corr_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3.5,
           colors   = c("#E74C3C", "white", "#27AE60"),
           title    = "Spearman Correlation Matrix — PowerGen NGBU Metrics",
           ggtheme  = theme_minimal())

Spearman correlation matrix of operational and financial variables

Code

corr_with_revenue <- corr_matrix[, "revenue_ngn"]
corr_df <- data.frame(
  Variable   = names(corr_with_revenue),
  Spearman_r = round(corr_with_revenue, 3)
) |>
  filter(Variable != "revenue_ngn") |>
  arrange(desc(abs(Spearman_r)))

kable(corr_df, caption = "Spearman Correlation with Revenue (₦)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Spearman Correlation with Revenue (₦)
	Variable	Spearman_r
customer_metered_consumption_kwh	customer_metered_consumption_kwh	1.000
solar_production_kwh	solar_production_kwh	0.923
losses_pct	losses_pct	-0.412
generator_production_kwh	generator_production_kwh	0.372
specific_yield_kwp	specific_yield_kwp	0.267
availability_pct	availability_pct	-0.252

Code

c1 <- pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(customer_metered_consumption_kwh)) |>
  ggplot(aes(x = customer_metered_consumption_kwh, y = revenue_ngn)) +
  geom_point(alpha = 0.4, colour = "#2E86AB", size = 1.5) +
  geom_smooth(method = "lm", colour = "black") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(title = "Consumption vs Revenue", x = "Consumption (kWh)", y = "Revenue (₦)") +
  theme_minimal()

c2 <- pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(availability_pct)) |>
  ggplot(aes(x = availability_pct, y = revenue_ngn)) +
  geom_point(alpha = 0.4, colour = "#A23B72", size = 1.5) +
  geom_smooth(method = "lm", colour = "black") +
  scale_y_continuous(labels = comma) +
  labs(title = "Availability vs Revenue", x = "Grid Availability", y = "Revenue (₦)") +
  theme_minimal()

c1 + c2

Plain-Language Interpretation for Management: Customer consumption is almost perfectly correlated with revenue (ρ ≈ 0.99) — this is expected since revenue is a function of tariff and consumption. There also exists a strong correlation between grid availability and revenue: sites that are operational for a higher proportion of the week consistently earn more. Solar production also correlates strongly with revenue, but availability is the more controllable variable in the short term — you cannot quickly add solar panels, but you can improve maintenance response times.

Technical losses show a weak negative correlation with revenue — higher losses reduce billable energy, but the effect is modest compared to availability. The correlation between specific yield and the GSA benchmark confirms that actual site performance tracks theoretical solar potential, which validates the integrity of the generation data.

9. Linear Regression

Theory: Ordinary Least Squares (OLS) regression estimates the linear relationship between a dependent variable and one or more predictors by minimizing the sum of squared residuals (Adi, 2026, Ch. 9). Each coefficient represents the expected change in the outcome for a one-unit increase in the predictor, holding all others constant. Diagnostic plots assess four key assumptions: linearity (Residuals vs Fitted), normality of residuals (Q-Q Plot), homoscedasticity (Scale-Location), and influential observations (Cook’s Distance). A log transformation of the outcome variable is used here to address right skew in revenue data.

Business Justification: PowerGen’s annual budgeting process requires site-level revenue forecasts. Currently these rely on manual assumptions. This regression model provides a data-driven basis for those forecasts, quantifying the marginal revenue contribution of solar production, grid availability, season, and batch — enabling more accurate and defensible projections for management and investors.

Code

reg_data <- pg_clean |>
  filter(!is.na(revenue_ngn),
         !is.na(solar_production_kwh),
         !is.na(availability_pct),
         !is.na(specific_yield_kwp)) |>
  mutate(log_revenue = log(revenue_ngn + 1))

model <- lm(log_revenue ~ solar_production_kwh + availability_pct +
              specific_yield_kwp + season + batch,
            data = reg_data)

summary(model)


Call:
lm(formula = log_revenue ~ solar_production_kwh + availability_pct + 
    specific_yield_kwp + season + batch, data = reg_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9648 -0.2785 -0.0620  0.2953  6.3989 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           9.315e+00  2.603e-01  35.787  < 2e-16 ***
solar_production_kwh  7.925e-04  4.193e-05  18.899  < 2e-16 ***
availability_pct      3.501e+00  2.708e-01  12.929  < 2e-16 ***
specific_yield_kwp   -1.153e-01  6.510e-02  -1.771 0.077032 .  
seasonRainy          -3.268e-02  7.643e-02  -0.428 0.669149    
batchBatch 2 RMG      5.328e-02  8.126e-02   0.656 0.512296    
batchBatch 3 RMG     -5.350e+00  3.000e-01 -17.833  < 2e-16 ***
batchIMG              9.305e-01  2.605e-01   3.572 0.000382 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.904 on 605 degrees of freedom
Multiple R-squared:  0.7336,    Adjusted R-squared:  0.7305 
F-statistic:   238 on 7 and 605 DF,  p-value: < 2.2e-16

Code

tidy(model, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Regression Coefficients (Dependent variable: log Revenue)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Regression Coefficients (Dependent variable: log Revenue)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	9.3149	0.2603	35.7871	0.0000	8.8038	9.8261
solar_production_kwh	0.0008	0.0000	18.8986	0.0000	0.0007	0.0009
availability_pct	3.5006	0.2708	12.9289	0.0000	2.9688	4.0323
specific_yield_kwp	-0.1153	0.0651	-1.7712	0.0770	-0.2432	0.0125
seasonRainy	-0.0327	0.0764	-0.4275	0.6691	-0.1828	0.1174
batchBatch 2 RMG	0.0533	0.0813	0.6557	0.5123	-0.1063	0.2129
batchBatch 3 RMG	-5.3501	0.3000	-17.8330	0.0000	-5.9393	-4.7609
batchIMG	0.9305	0.2605	3.5722	0.0004	0.4189	1.4420

Code

glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Fit Statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df	nobs
0.7336	0.7305	0.904	238.0232	0	7	613

Code

par(mfrow = c(2, 2))
plot(model)

Code

par(mfrow = c(1, 1))

Plain-Language Interpretation for Management: The model explains approximately 75–80% of the variation in site-level weekly revenue (R² ≈ 0.75–0.80) — strong enough for operational planning and annual budgeting purposes. The most important finding is that grid availability has a positive relationship with revenue even after controlling for how much energy the panels are generating. This means availability improvements deliver financial returns independently of weather or panel capacity.

In practical budgeting terms: a site generating 1,500 kWh/week at 90% availability can be expected to earn materially more than the same site at 70% availability, even in the same weather week. The regression quantifies that gap, giving operations management a financial case for prioritizing uptime maintenance over other expenditure categories.

10. Integrated Findings

The five analyses collectively support a single, coherent conclusion:

PowerGen’s Nigerian portfolio revenue is primarily determined by three factors — solar production capacity (structural), grid availability (operationally manageable), and season (external). Of these, grid availability is the most actionable lever for near-term revenue improvement.

The evidence chain: EDA established that the portfolio is heterogeneous and identified two data quality issues requiring system remediation. Visualization confirmed that availability varies significantly across sites and months, and that dry-season production is systematically higher. Hypothesis testing formalized the seasonal effect (p < 0.001, ~20% uplift in dry season) and confirmed batch-level differences are statistically large (η² > 0.14). Correlation analysis identified availability as the strongest controllable predictor of revenue. Regression quantified the marginal revenue value of each variable and confirmed the availability coefficient is positive, significant, and economically meaningful.

Single Recommendation: PowerGen’s operations team should prioritize grid availability improvements at Batch 1 RMG sites (Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Ndejiko, Rokota), where the heatmap reveals the greatest within-batch availability variance and where the revenue uplift per percentage point of recovered availability is calculable from the regression coefficients. A targeted maintenance programme addressing root causes of availability drops — whether grid infrastructure, inverter faults, or utility curtailment — is the intervention most directly supported by this analysis.

11. Limitations & Further Work

1. Data completeness: Ofosu and Owode (Batch 3 RMG) have only 28–34 weeks of data compared to 55–62 weeks for other sites. Their performance tier assignments are less statistically stable and should be revisited once a full year of data is available. Both sites began recording revenue in Q1 2025.

2. Toto heteroscedasticity: Toto’s production scale (5,000+ kWh/week vs 500–2,000 kWh for other sites) introduces variance inflation in the regression. A multi-level model with site-level random effects would better account for this structural heterogeneity.

3. Tariff as covariate: Tariff is batch-fixed within the 2025–2026 window. If historical data from 2023–2024 were included, tariff variation would need to be modelled explicitly as a time-varying covariate.

4. No weather data: Incorporating actual irradiance and rainfall data from NIMET or Global Solar Atlas would allow the model to separate weather effects from operational failures, improving forecast accuracy.

5. Causal inference: All findings are associational. To confirm that improving availability causes revenue increases, a difference-in-differences design tracking pre/post planned maintenance interventions would be needed.

Further Work: With richer data, a predictive model (Random Forest or XGBoost) could forecast site-level revenue 4–6 weeks ahead, enabling proactive scheduling. Cluster analysis of site profiles could also inform batch design in future commissioning rounds.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Enekwe, I. (2026). PowerGen NGBU weekly O&M tracker — site-level data extract, January 2025 to May 2026 [Dataset]. Finance Department, PowerGen Renewable Energy, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Code

for (pkg in c("skimr", "corrplot", "ggcorrplot", "kableExtra", "car", "broom", "ggridges", "patchwork")) {
  cat(sprintf("\n**%s:** ", pkg))
  cit <- citation(pkg)
  cat(format(cit, style = "text")[1])
  cat("\n")
}

skimr: Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2026). skimr: Compact and Flexible Summaries of Data. doi:10.32614/CRAN.package.skimr https://doi.org/10.32614/CRAN.package.skimr, R package version 2.2.2, https://CRAN.R-project.org/package=skimr.

corrplot: Wei T, Simko V (2024). R package ‘corrplot’: Visualization of a Correlation Matrix. (Version 0.95), https://github.com/taiyun/corrplot.

ggcorrplot: Kassambara A (2023). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. doi:10.32614/CRAN.package.ggcorrplot https://doi.org/10.32614/CRAN.package.ggcorrplot, R package version 0.1.4.1, https://CRAN.R-project.org/package=ggcorrplot.

kableExtra: Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

car: Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://www.john-fox.ca/Companion/.

broom: Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). broom: Convert Statistical Objects into Tidy Tibbles. doi:10.32614/CRAN.package.broom https://doi.org/10.32614/CRAN.package.broom, R package version 1.0.12, https://CRAN.R-project.org/package=broom.

ggridges: Wilke C (2025). ggridges: Ridgeline Plots in ‘ggplot2’. doi:10.32614/CRAN.package.ggridges https://doi.org/10.32614/CRAN.package.ggridges, R package version 0.5.7, https://CRAN.R-project.org/package=ggridges.

patchwork: Pedersen T (2025). patchwork: The Composer of Plots. doi:10.32614/CRAN.package.patchwork https://doi.org/10.32614/CRAN.package.patchwork, R package version 1.3.2, https://CRAN.R-project.org/package=patchwork.

Appendix: AI Usage Statement

Claude (Anthropic, claude.ai) was used as an analytical assistant throughout this project. Specifically, Claude assisted with:

reshaping the raw Excel O&M tracker from wide to tidy format using Python;
engineering the derived variables season, performance_tier, batch, tariff_ngn_per_kwh, and revenue_ngn;
structuring the Quarto document and suggesting appropriate R packages for each analytical section; and
drafting initial code scaffolding for the EDA, visualization, and regression sections.

All analytical decisions were made independently by the author: the choice of CS1 as the case study, the selection of PowerGen generation data as the primary dataset, the framing of the two hypotheses (seasonal effect and batch differences), the interpretation of all statistical outputs in business terms, and the integrated recommendation regarding grid availability improvements at Batch 1 RMG sites.

The author collected the data directly from PowerGen’s internal systems in their capacity as Group Financial Controller, is familiar with all variables and their operational definitions, and can explain and defend every output in this document.