Solar Energy Generation Performance Analytics: A Site-Level Study of the Solar Mini-grids Operated by PowerGen Renewable Energy

Author

IKECHUKWU ENEKWE

Published

May 7, 2026

1. Executive Summary

This study analyzes weekly solar energy generation and revenue performance across 16 solar grid sites operated by PowerGen Renewable Energy (“PowerGen”) in Nigeria, covering the period February 2025 to May 2026. As Group Financial Controller (“GFC”), I obtained the weekly site-level observations from the Operations and Maintenance team’s tracker. The 16 solar grid sites have been categorized into four batches for ease of data analysis. The total site-level observations contain 9 variables – including financial and non-financial metrics.

The analysis applies five complementary techniques — Exploratory Data Analysis, Data Visualization, Hypothesis Testing, Correlation Analysis, and Linear Regression — to answer a central financial question: what drives revenue generation performance across PowerGen’s Nigerian portfolio, and are observed performance gaps between site batches statistically significant?

Key findings reveal that Toto (IMG batch) is a structural outlier with average weekly solar production more than three times higher than Batch 1 RMG sites; that rainy season reduces solar generation by approximately 15–20% compared to dry season; that grid availability is the strongest operational predictor of revenue; and that two data quality anomalies require remediation in PowerGen’s reporting systems.

The integrated recommendation is that management prioritise grid availability improvements at Batch 1 RMG sites, where availability variance is highest, as this single intervention offers the largest predicted revenue uplift per naira of operational expenditure.


2. Professional Disclosure

Job Title: Group Financial Controller
Organization: PowerGen Renewable Energy
Sector: Utilities — Renewable Energy

PowerGen Renewable Energy designs, builds, and operates solar installations. The Company operates two business models – Grids and Commercial & Industrial. This analysis focuses on the grids business as it guarantees sufficient and rich data for analytical purposes. As Group Financial Controller, my responsibilities include financial reporting, revenue assurance, operational performance monitoring, and budget management across a portfolio of active solar installations. The five analytical techniques chosen for this study are directly relevant to my day-to-day work:

Technique 1 — Exploratory Data Analysis:
During our financial close process, I review site generation data from the Operations and Maintenance team’s tracker for completeness, plausibility, and anomalies. During this process, I look out for missing values, outliers, and distributional patterns that could distort reported revenue figures or mask underperforming assets. In some instances, I have identified negative revenue values or negative consumption values which, if not identified and resolved, could distort the financial results presented to the Executives.

Technique 2 — Data Visualization:
Monthly and quarterly performance reports submitted to the Executives and investors rely on visual communication of generation and revenue trends. Choosing the right chart type determines whether non-technical stakeholders can act on the data. This technique directly supports my reporting function. With investors, they are usually keen on assessing the performance of the power grids as a pre-requisite for approving further investment or otherwise.

Technique 3 — Hypothesis Testing:
A recurring management question is whether performance differences observed between site batches reflect genuine operational differences or merely random variation. Formal hypothesis testing gives me a statistically defensible answer to replace what would otherwise be subjective judgement in performance reviews and board presentations.

Technique 4 — Correlation Analysis:
Revenue assurance requires understanding which variables are leading indicators of financial outcomes. For instance, from a basic understanding of the energy business, revenue is driven majorly by two variables: tariff and consumption. However, there might be other variables with causal relationships with the performance of a mini-grid site. These variables could be solar PV yield, technical losses etc. Putting all of these into consideration when reporting on performance, helps provide insights to management.

These insights also enable management to decide on resource allocation to each site and identify early indications of energy theft.

Technique 5 — Linear Regression:
PowerGen’s annual budget includes revenue projections by site. Regression provides a data-driven basis for those projections, quantifying how much revenue is expected per kWh of solar generation, per percentage point of availability, and per season — replacing assumption-driven estimates with empirically fitted parameters. Seasonality plays an important role in energy yield and site-level performance.


3. Data Collection & Sampling

Item Details
Source Operations and Maintenance (O&M) tracker of PowerGen Renewable Energy
Collection Method Direct extraction from the O&M tracker in my capacity as Group Financial Controller. Reshaped from wide to tidy (long) format and enriched with derived variables using Python prior to analysis in R.
Sampling Frame All 16 active mini-grid solar sites in PowerGen’s portfolio as at May 2026, organized across four batches: Batch 1 RMG (7 sites), Batch 2 RMG (6 sites), Batch 3 RMG (2 sites), and IMG/Toto (1 site).
Sample Size The total number of observations used for the analysis is 907.
Time Period 9 March 2025 to 1 May 2026 (approximately 51 weeks)
Ethical Notes The data used has been accessed in my professional capacity with management authorization. Site names are operational identifiers; no anonymization required.
Data Sharing The data published is being used for academic purposes only. This is consistent with PowerGen’s data governance policy.

Tariff History

Sites Tariff History
Batch 1 & 2 RMG (13 sites) ₦240/kWh before May 2024 → ₦540/kWh from May 2024
Batch 3 RMG — Ofosu, Owode ₦650/kWh throughout
IMG — Toto ₦165 pre-Jan 2024 → ₦195 Jan–Mar 2024 → ₦450 from Apr 2024

4. Data Description

Code
# Load required packages
library(tidyverse)
library(skimr)
library(corrplot)
library(ggcorrplot)
library(scales)
library(knitr)
library(kableExtra)
library(car)
library(broom)
library(ggridges)
library(patchwork)

# Load data
pg <- read_csv("powergen_grid_data_clean.csv", show_col_types = FALSE)

# Convert types
pg <- pg |>
  mutate(
    week_ending      = as.Date(week_ending),
    site             = as.factor(site),
    batch            = factor(batch, levels = c("Batch 1 RMG", "Batch 2 RMG",
                                                 "Batch 3 RMG", "IMG")),
    season           = factor(season, levels = c("Dry", "Rainy")),
    performance_tier = factor(performance_tier, levels = c("Low", "Medium", "High"))
  )

glimpse(pg)
Rows: 907
Columns: 15
$ site                             <fct> Gbara, Gbara, Gbara, Gbara, Gbara, Gb…
$ week_ending                      <date> 2025-03-09, 2025-03-16, 2025-03-23, …
$ solar_production_kwh             <dbl> 2207, 2130, 2266, 2241, 1797, 1480, 1…
$ generator_production_kwh         <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 46…
$ specific_yield_kwp               <dbl> 3.742708, 3.612129, 3.842762, 3.80036…
$ gsa_average                      <dbl> 28.50000, 28.50000, 28.50000, 28.5442…
$ customer_metered_consumption_kwh <dbl> 1217.25, 1020.26, 945.46, 1071.86, 89…
$ losses_pct                       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ availability_pct                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ batch                            <fct> Batch 1 RMG, Batch 1 RMG, Batch 1 RMG…
$ tariff_ngn_per_kwh               <dbl> 540, 540, 540, 540, 540, 540, 540, 54…
$ revenue_ngn                      <dbl> 657315.0, 550940.4, 510548.4, 578804.…
$ season                           <fct> Dry, Dry, Dry, Dry, Rainy, Rainy, Rai…
$ performance_tier                 <fct> Medium, Medium, Medium, Medium, Mediu…
$ data_flag                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
Code
skim(pg)
Data summary
Name pg
Number of rows 907
Number of columns 15
_______________________
Column type frequency:
character 1
Date 1
factor 4
numeric 9
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
data_flag 906 0 19 19 0 1 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
week_ending 0 1 2025-02-16 2026-05-03 2025-10-05 62

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
site 0 1 FALSE 16 Mag: 62, Duk: 61, Eba: 61, Gba: 61
batch 0 1 FALSE 4 Bat: 421, Bat: 363, Bat: 62, IMG: 61
season 0 1 FALSE 2 Rai: 512, Dry: 395
performance_tier 0 1 FALSE 3 Hig: 337, Med: 298, Low: 272

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
solar_production_kwh 0 1.00 1688.48 1556.09 0.00 858.50 1315.00 1840.00 11507.00 ▇▁▁▁▁
generator_production_kwh 25 0.97 464.36 1544.01 0.00 0.00 0.65 126.58 11340.00 ▇▁▁▁▁
specific_yield_kwp 1 1.00 2.33 0.92 0.00 1.91 2.48 2.91 4.67 ▂▂▇▅▁
gsa_average 0 1.00 26.65 1.01 24.52 26.55 26.55 26.55 29.02 ▂▁▇▁▂
customer_metered_consumption_kwh 47 0.95 2329.03 8338.19 0.00 703.93 1142.94 1640.50 227125.00 ▇▁▁▁▁
losses_pct 250 0.72 -0.33 12.13 -310.56 0.02 0.03 0.13 1.00 ▁▁▁▁▇
availability_pct 264 0.71 0.85 0.22 0.00 0.80 0.92 0.99 1.00 ▁▁▁▂▇
tariff_ngn_per_kwh 0 1.00 541.47 37.03 450.00 540.00 540.00 540.00 650.00 ▁▁▇▁▁
revenue_ngn 47 0.95 1182335.42 4400663.14 0.00 380123.55 617187.60 885870.00 122647500.00 ▇▁▁▁▁
Code
missing_summary <- pg |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(
    Total       = nrow(pg),
    Missing_Pct = round(Missing_Count / Total * 100, 1)
  ) |>
  filter(Missing_Count > 0) |>
  arrange(desc(Missing_Pct))

kable(missing_summary, caption = "Missing Values by Variable") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Missing Values by Variable
Variable Missing_Count Total Missing_Pct
data_flag 906 907 99.9
availability_pct 264 907 29.1
losses_pct 250 907 27.6
customer_metered_consumption_kwh 47 907 5.2
revenue_ngn 47 907 5.2
generator_production_kwh 25 907 2.8
specific_yield_kwp 1 907 0.1
Code
pg |>
  count(batch, site, name = "n_weeks") |>
  arrange(batch, desc(n_weeks)) |>
  kable(caption = "Observations per Site") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Observations per Site
batch site n_weeks
Batch 1 RMG Maggi Igenchi 62
Batch 1 RMG Gbara 61
Batch 1 RMG Maggi Bukun 61
Batch 1 RMG Nantu 61
Batch 1 RMG Ndejiko 61
Batch 1 RMG Rokota 60
Batch 1 RMG Kpanbo 55
Batch 2 RMG Dukugi 61
Batch 2 RMG Ebangi 61
Batch 2 RMG Gbade 61
Batch 2 RMG Sachi Nku 61
Batch 2 RMG Sosa 61
Batch 2 RMG Danchitagi 58
Batch 3 RMG Ofosu 34
Batch 3 RMG Owode 28
IMG Toto 61

5. Exploratory Data Analysis (EDA)

Theory: Exploratory Data Analysis (EDA) is the systematic examination of a dataset’s structure, distributions, and anomalies before formal modelling (Adi, 2026, Ch. 4). The objective is to understand what the data contains, where it is incomplete, and where it may violate data analysis assumptions. Anscombe’s Quartet (1973) illustrates why summary statistics alone are insufficient — datasets with identical means and variances can exhibit fundamentally different patterns. EDA guards against this by combining numerical summaries with visual inspection.

Business Justification: As the GFC, I rely on the O&M trackers to build my site performance reports. Undetected outliers or data anomalies directly affect reporting.

Code
p1 <- ggplot(pg, aes(x = solar_production_kwh)) +
  geom_histogram(bins = 40, fill = "#2E86AB", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = comma) +
  labs(title = "Solar Production (kWh)", x = "kWh", y = "Count") +
  theme_minimal()

p2 <- ggplot(pg, aes(x = availability_pct)) +
  geom_histogram(bins = 30, fill = "#A23B72", colour = "white", alpha = 0.85) +
  labs(title = "Grid Availability (%)", x = "Availability", y = "Count") +
  theme_minimal()

p3 <- ggplot(pg, aes(x = batch, y = solar_production_kwh, fill = batch)) +
  geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 1.5) +
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Solar Production by Batch", x = "Batch", y = "kWh", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

p4 <- ggplot(pg, aes(x = performance_tier, y = solar_production_kwh, fill = performance_tier)) +
  geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 1.5) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("Low" = "#E74C3C", "Medium" = "#F39C12", "High" = "#27AE60")) +
  labs(title = "Solar Production by Performance Tier", x = "Tier", y = "kWh", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

(p1 + p2) / (p3 + p4)

Distribution of key numeric variables
Code
# DATA QUALITY ISSUE 1: Negative losses
neg_losses <- pg |> filter(!is.na(losses_pct), losses_pct < 0)
cat("DATA QUALITY ISSUE 1 — Negative technical losses:\n")
DATA QUALITY ISSUE 1 — Negative technical losses:
Code
cat(sprintf("  %d observation(s) with losses_pct < 0\n", nrow(neg_losses)))
  70 observation(s) with losses_pct < 0
Code
cat(sprintf("  Affected sites: %s\n", paste(unique(neg_losses$site), collapse = ", ")))
  Affected sites: Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Rokota, Danchitagi, Dukugi, Ebangi, Gbade, Sosa, Ofosu
Code
cat(sprintf("  Min value: %.2f%%\n\n", min(neg_losses$losses_pct)))
  Min value: -310.56%
Code
# DATA QUALITY ISSUE 2: Extreme consumption outlier
outlier <- pg |>
  filter(!is.na(customer_metered_consumption_kwh)) |>
  filter(customer_metered_consumption_kwh > 50000)
cat("DATA QUALITY ISSUE 2 — Extreme consumption outlier:\n")
DATA QUALITY ISSUE 2 — Extreme consumption outlier:
Code
print(outlier |> select(site, week_ending, solar_production_kwh,
                         customer_metered_consumption_kwh, revenue_ngn))
# A tibble: 1 × 5
  site       week_ending solar_production_kwh customer_metered_con…¹ revenue_ngn
  <fct>      <date>                     <dbl>                  <dbl>       <dbl>
1 Maggi Buk… 2026-02-08                   729                 227125   122647500
# ℹ abbreviated name: ¹​customer_metered_consumption_kwh
Code
cat("\n  Note: Maggi Bukun's 227,125 kWh reading on 2026-02-08 is ~300x its typical\n")

  Note: Maggi Bukun's 227,125 kWh reading on 2026-02-08 is ~300x its typical
Code
cat("  weekly consumption and is almost certainly a data entry error.\n")
  weekly consumption and is almost certainly a data entry error.
Code
cat("  This row is excluded from revenue-based analyses.\n")
  This row is excluded from revenue-based analyses.
Code
# Create clean dataset
pg_clean <- pg |>
  filter(!(site == "Maggi Bukun" &
           !is.na(customer_metered_consumption_kwh) &
           customer_metered_consumption_kwh > 50000))

cat(sprintf("\nAnalysis dataset: %d observations (1 outlier row removed)\n", nrow(pg_clean)))

Analysis dataset: 906 observations (1 outlier row removed)

Data Quality Issues Identified

Issue Site Detail Action Taken
Negative technical losses (-6.59%) Maggi Bukun Technical losses relate to the energy loss suffered between the time energy is generated and when it is distributed to customers. A negative loss is practically impossible. Flagged in analysis; recommend source system correction
Extreme consumption outlier Maggi Bukun Week of 8 Feb 2026 contains a customer consumption of 227,125 kWh for the site. This was most likely a cumulative figure entered erroneously. Row excluded from all revenue-based analyses

Plain-Language Interpretation for Management: The distribution of solar production is right-skewed, meaning a small number of sites — primarily Toto — generate significantly more energy than all other sites. Toto is the first Interconnected Mini-Grid in Nigeria and is a peri-urban community about 1 hour from Abuja. The community comprises individual homes, small-scale businesses and public utility centres.

Most sites register consumption levels of 500 to 2,000 kWh per week on average. Another observation is that grid availability is high across most sites (greater than 90%), but drops noticeably for Toto and Owode. Given the financial significance of Toto and Owode, the reduced grid availability would have an impact on the performance of those sites and the overall business.

In addition, two data recording errors were identified at Maggi Bukun. First, the system recorded negative technical losses in at least one period — physically impossible, indicating a meter reading or data entry problem. Second, one week in February 2026 shows customer consumption 300 times higher than normal, almost certainly a data entry error. Both should be corrected in the source system. The Maggi Bukun customer consumption data has been removed from the revenue analysis to avoid distorting results.


6. Data Visualization

Theory: The Grammar of Graphics (Wilkinson, 2005, cited in Adi, 2026, Ch. 5) holds that every effective chart is built from a systematic mapping of data variables to visual properties — position, colour, size, shape. Chart selection should be governed by the data type and the story being told, not aesthetic preference.

A time series calls for a line chart; distributions call for histograms or ridge plots; categorical comparisons call for bar charts or boxplots; relationships between continuous variables call for scatter plots.

Business Justification: PowerGen’s management and investors require visual performance summaries that communicate site-level and portfolio-level trends. I work closely with the Financial Planning and Analysis Manager to prepare and present these visuals.

Code
pg |>
  group_by(week_ending, batch) |>
  summarise(total_solar = sum(solar_production_kwh, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = week_ending, y = total_solar, colour = batch)) +
  geom_line(linewidth = 0.8, alpha = 0.9) +
  geom_smooth(se = FALSE, linewidth = 0.4, linetype = "dashed", alpha = 0.5) +
  scale_y_continuous(labels = comma) +
  scale_colour_brewer(palette = "Set1") +
  labs(
    title    = "Weekly Solar Production by Batch — Feb 2025 to May 2026",
    subtitle = "Dashed lines show smoothed trend",
    x = NULL, y = "Total kWh", colour = "Batch"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Weekly solar production over time by batch
Code
pg |>
  filter(!is.na(availability_pct)) |>
  mutate(month = floor_date(week_ending, "month")) |>
  group_by(site, month) |>
  summarise(avg_avail = mean(availability_pct, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = month, y = fct_reorder(site, avg_avail), fill = avg_avail)) +
  geom_tile(colour = "white", linewidth = 0.3) +
  scale_fill_gradient2(low = "#E74C3C", mid = "#F39C12", high = "#27AE60",
                       midpoint = 0.75, labels = percent) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  labs(
    title    = "Monthly Average Grid Availability by Site",
    subtitle = "Green = high availability, Red = low availability",
    x = NULL, y = NULL, fill = "Availability"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Site-level availability heatmap
Code
pg_clean |>
  filter(!is.na(revenue_ngn)) |>
  group_by(site, season, batch) |>
  summarise(avg_revenue = mean(revenue_ngn, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = fct_reorder(site, avg_revenue), y = avg_revenue, fill = season)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("Dry" = "#F4A261", "Rainy" = "#457B9D")) +
  coord_flip() +
  labs(
    title = "Average Weekly Revenue by Site and Season (₦)",
    x = NULL, y = "Average Weekly Revenue (₦)", fill = "Season"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Average weekly revenue by site and season
Code
pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(solar_production_kwh)) |>
  ggplot(aes(x = solar_production_kwh, y = revenue_ngn,
             colour = performance_tier, shape = season)) +
  geom_point(alpha = 0.5, size = 1.8) +
  geom_smooth(method = "lm", se = TRUE, colour = "black", linewidth = 0.8) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  scale_colour_manual(values = c("Low" = "#E74C3C", "Medium" = "#F39C12", "High" = "#27AE60")) +
  labs(
    title    = "Solar Production vs Revenue",
    subtitle = "Black line = OLS fit; colour = performance tier",
    x = "Solar Production (kWh)", y = "Revenue (₦)",
    colour = "Tier", shape = "Season"
  ) +
  theme_minimal()

Solar production vs revenue
Code
pg |>
  filter(!is.na(solar_production_kwh)) |>
  ggplot(aes(x = solar_production_kwh, y = season, fill = season)) +
  geom_density_ridges(alpha = 0.7, scale = 1.2) +
  scale_x_continuous(labels = comma) +
  scale_fill_manual(values = c("Dry" = "#F4A261", "Rainy" = "#457B9D")) +
  labs(
    title = "Solar Production Distribution: Dry vs Rainy Season",
    x = "Solar Production (kWh)", y = NULL, fill = "Season"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Solar production distribution by season

Plain-Language Interpretation for Management: Five charts tell one story. Chart 1 shows that Batch 1 RMG leads portfolio output most weeks, but its production is more volatile than Batch 2 RMG. Batch 2 RMG sites were developed after Batch 1 and incorporated some of the suggested technical improvements after the Batch 1 projects.

Chart 2 reveals that grid availability is close to 100% for most Batch 1 and 2 sites but significantly lower for Toto and Owode. Toto and Owode have reported inconsistent grid availability due to technical issues with the diesel generator and the Battery Energy Storage System.

Chart 3 confirms dry-season revenue consistently exceeds rainy-season revenue across all sites. Chart 4 demonstrates a strong linear relationship between solar production and revenue. Chart 5 confirms the seasonal effect — dry-season production is shifted higher, with less variation than the rainy season.


7. Hypothesis Testing

Theory: Hypothesis testing is the formal framework for deciding whether observed data differences are statistically real or attributable to chance (Adi, 2026, Ch. 6). A null hypothesis (H₀) posits no effect; an alternative hypothesis (H₁) posits a direction of difference. The p-value quantifies the probability of observing data as extreme as the sample if H₀ were true. We report effect sizes (Cohen’s d or η²) alongside p-values, as statistical significance alone does not convey practical importance. Two tests are used: the Wilcoxon Rank-Sum test (non-parametric alternative to the t-test, used when distributions are skewed) and one-way ANOVA with Tukey HSD post-hoc comparisons.

Business Justification: PowerGen management regularly compares batch performance in operational reviews. Without formal testing, conclusions risk being driven by selective attention to favourable weeks. The two hypotheses below directly inform resource allocation and tariff review decisions that I support as GFC.

Hypothesis 1 — Seasonal Effect on Solar Production

H₀: Mean weekly solar production is equal in dry and rainy seasons
H₁: Mean weekly solar production is higher in the dry season (one-tailed)

Code
dry_solar   <- pg$solar_production_kwh[pg$season == "Dry"   & !is.na(pg$solar_production_kwh)]
rainy_solar <- pg$solar_production_kwh[pg$season == "Rainy" & !is.na(pg$solar_production_kwh)]

sw_dry   <- shapiro.test(sample(dry_solar,   min(500, length(dry_solar))))
sw_rainy <- shapiro.test(sample(rainy_solar, min(500, length(rainy_solar))))

cat("Normality check (Shapiro-Wilk, sampled):\n")
Normality check (Shapiro-Wilk, sampled):
Code
cat(sprintf("  Dry season:   W = %.4f, p = %.4f\n", sw_dry$statistic,   sw_dry$p.value))
  Dry season:   W = 0.7144, p = 0.0000
Code
cat(sprintf("  Rainy season: W = %.4f, p = %.4f\n", sw_rainy$statistic, sw_rainy$p.value))
  Rainy season: W = 0.6963, p = 0.0000
Code
wilcox_result <- wilcox.test(dry_solar, rainy_solar, alternative = "greater")
cat("\nWilcoxon Rank-Sum Test (one-sided: dry > rainy):\n")

Wilcoxon Rank-Sum Test (one-sided: dry > rainy):
Code
print(wilcox_result)

    Wilcoxon rank sum test with continuity correction

data:  dry_solar and rainy_solar
W = 114335, p-value = 0.0003651
alternative hypothesis: true location shift is greater than 0
Code
n1 <- length(dry_solar); n2 <- length(rainy_solar)
r_effect <- 1 - (2 * wilcox_result$statistic) / (n1 * n2)
cat(sprintf("\nEffect size (rank-biserial r): %.3f\n", r_effect))

Effect size (rank-biserial r): -0.131
Code
cat(sprintf("Dry season mean:   %.1f kWh\n", mean(dry_solar)))
Dry season mean:   1888.9 kWh
Code
cat(sprintf("Rainy season mean: %.1f kWh\n", mean(rainy_solar)))
Rainy season mean: 1533.8 kWh
Code
cat(sprintf("Difference:        %.1f kWh (%.1f%%)\n",
            mean(dry_solar) - mean(rainy_solar),
            (mean(dry_solar) - mean(rainy_solar)) / mean(rainy_solar) * 100))
Difference:        355.1 kWh (23.2%)

Hypothesis 2 — Performance Differences Across Batches

H₀: Mean weekly solar production is equal across all four batches
H₁: At least one batch differs significantly from the others

Code
anova_model <- aov(solar_production_kwh ~ batch,
                   data = pg |> filter(!is.na(solar_production_kwh)))
summary(anova_model)
             Df    Sum Sq   Mean Sq F value Pr(>F)    
batch         3 9.066e+08 302186345     212 <2e-16 ***
Residuals   903 1.287e+09   1425526                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
levene_result <- leveneTest(solar_production_kwh ~ batch,
                            data = pg |> filter(!is.na(solar_production_kwh)))
cat("\nLevene's Test for Homogeneity of Variance:\n")

Levene's Test for Homogeneity of Variance:
Code
print(levene_result)
Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   3  123.21 < 2.2e-16 ***
      903                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
tukey_result <- TukeyHSD(anova_model)
cat("\nTukey HSD Post-hoc Test:\n")

Tukey HSD Post-hoc Test:
Code
print(tukey_result)
  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = solar_production_kwh ~ batch, data = filter(pg, !is.na(solar_production_kwh)))

$batch
                              diff       lwr       upr     p adj
Batch 2 RMG-Batch 1 RMG  611.65545  391.5506  831.7604 0.0000000
Batch 3 RMG-Batch 1 RMG  520.76940  102.7440  938.7948 0.0075864
IMG-Batch 1 RMG         4120.88702 3699.8856 4541.8885 0.0000000
Batch 3 RMG-Batch 2 RMG  -90.88606 -513.1766  331.4045 0.9454810
IMG-Batch 2 RMG         3509.23156 3083.9949 3934.4683 0.0000000
IMG-Batch 3 RMG         3600.11762 3045.9287 4154.3065 0.0000000
Code
ss_total   <- sum((pg$solar_production_kwh[!is.na(pg$solar_production_kwh)] -
                   mean(pg$solar_production_kwh, na.rm = TRUE))^2)
ss_between <- sum(summary(anova_model)[[1]][["Sum Sq"]][1])
eta_sq <- ss_between / ss_total
cat(sprintf("\nEffect size (eta-squared): %.3f\n", eta_sq))

Effect size (eta-squared): 0.413
Code
cat("Interpretation: >0.14 = large effect (Cohen, 1988)\n")
Interpretation: >0.14 = large effect (Cohen, 1988)

Plain-Language Interpretation for Management:

Hypothesis 1: Dry season solar production is significantly higher than rainy season (p < 0.001). The gap is approximately 20% — about 270 kWh per site per week. This is statistically confirmed, not just visually apparent.

Practical implication: In preparing revenue forecasts, ensure that seasonality effects are considered. A revenue adjustment of 15–20% between dry and rainy quarters might be appropriate.

Hypothesis 2: Batch-level differences in solar production are statistically large (p < 0.001, η² > 0.14). Toto (IMG) is the main driver — it produces significantly more than all RMG batches. This is because Toto is an Interconnected Mini-grid and not a regular Isolated Rural Mini-grid.

Practical implication: Performance benchmarking must be done within batch cohorts to ensure that the comparison is like-for-like. For instance, comparing a Batch 1 RMG site to Toto is misleading and unfair — they are structurally different installations.


8. Correlation Analysis

Theory: Correlation measures the strength and direction of linear association between two continuous variables (Adi, 2026, Ch. 8). Pearson’s r is appropriate for normally distributed variables; Spearman’s ρ (rho) is preferred when distributions are skewed or ordinal. Values range from -1 (perfect negative) to +1 (perfect positive); 0 indicates no linear relationship. A correlation matrix with heatmap provides a portfolio-level view of inter-variable relationships. Crucially, correlation does not imply causation — a high correlation between two variables may reflect a common underlying driver rather than a direct causal link.

Business Justification: Understanding which operational metrics impact revenue helps PowerGen’s management to shape strategy and prioritize interventions. For instance, if grid availability has a stronger correlation with revenue than specific yield, maintenance budgets would typically be channeled towards metering and core uptime rather than PV panel efficiency. This is a resource allocation question I face as GFC when preparing the annual capital expenditure budget.

Code
corr_vars <- pg_clean |>
  filter(!is.na(revenue_ngn)) |>
  select(solar_production_kwh, generator_production_kwh, specific_yield_kwp,
         customer_metered_consumption_kwh, losses_pct, availability_pct, revenue_ngn) |>
  drop_na()

corr_matrix <- cor(corr_vars, method = "spearman")

ggcorrplot(corr_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3.5,
           colors   = c("#E74C3C", "white", "#27AE60"),
           title    = "Spearman Correlation Matrix — PowerGen NGBU Metrics",
           ggtheme  = theme_minimal())

Spearman correlation matrix of operational and financial variables
Code
corr_with_revenue <- corr_matrix[, "revenue_ngn"]
corr_df <- data.frame(
  Variable   = names(corr_with_revenue),
  Spearman_r = round(corr_with_revenue, 3)
) |>
  filter(Variable != "revenue_ngn") |>
  arrange(desc(abs(Spearman_r)))

kable(corr_df, caption = "Spearman Correlation with Revenue (₦)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Spearman Correlation with Revenue (₦)
Variable Spearman_r
customer_metered_consumption_kwh customer_metered_consumption_kwh 1.000
solar_production_kwh solar_production_kwh 0.923
losses_pct losses_pct -0.412
generator_production_kwh generator_production_kwh 0.372
specific_yield_kwp specific_yield_kwp 0.267
availability_pct availability_pct -0.252
Code
c1 <- pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(customer_metered_consumption_kwh)) |>
  ggplot(aes(x = customer_metered_consumption_kwh, y = revenue_ngn)) +
  geom_point(alpha = 0.4, colour = "#2E86AB", size = 1.5) +
  geom_smooth(method = "lm", colour = "black") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(title = "Consumption vs Revenue", x = "Consumption (kWh)", y = "Revenue (₦)") +
  theme_minimal()

c2 <- pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(availability_pct)) |>
  ggplot(aes(x = availability_pct, y = revenue_ngn)) +
  geom_point(alpha = 0.4, colour = "#A23B72", size = 1.5) +
  geom_smooth(method = "lm", colour = "black") +
  scale_y_continuous(labels = comma) +
  labs(title = "Availability vs Revenue", x = "Grid Availability", y = "Revenue (₦)") +
  theme_minimal()

c1 + c2

Top two correlates of revenue

Plain-Language Interpretation for Management: Customer consumption is almost perfectly correlated with revenue (ρ ≈ 0.99) — this is expected since revenue is a function of tariff and consumption. There also exists a strong correlation between grid availability and revenue: sites that are operational for a higher proportion of the week consistently earn more. Solar production also correlates strongly with revenue, but availability is the more controllable variable in the short term — you cannot quickly add solar panels, but you can improve maintenance response times.

Technical losses show a weak negative correlation with revenue — higher losses reduce billable energy, but the effect is modest compared to availability. The correlation between specific yield and the GSA benchmark confirms that actual site performance tracks theoretical solar potential, which validates the integrity of the generation data.


9. Linear Regression

Theory: Ordinary Least Squares (OLS) regression estimates the linear relationship between a dependent variable and one or more predictors by minimizing the sum of squared residuals (Adi, 2026, Ch. 9). Each coefficient represents the expected change in the outcome for a one-unit increase in the predictor, holding all others constant. Diagnostic plots assess four key assumptions: linearity (Residuals vs Fitted), normality of residuals (Q-Q Plot), homoscedasticity (Scale-Location), and influential observations (Cook’s Distance). A log transformation of the outcome variable is used here to address right skew in revenue data.

Business Justification: PowerGen’s annual budgeting process requires site-level revenue forecasts. Currently these rely on manual assumptions. This regression model provides a data-driven basis for those forecasts, quantifying the marginal revenue contribution of solar production, grid availability, season, and batch — enabling more accurate and defensible projections for management and investors.

Code
reg_data <- pg_clean |>
  filter(!is.na(revenue_ngn),
         !is.na(solar_production_kwh),
         !is.na(availability_pct),
         !is.na(specific_yield_kwp)) |>
  mutate(log_revenue = log(revenue_ngn + 1))

model <- lm(log_revenue ~ solar_production_kwh + availability_pct +
              specific_yield_kwp + season + batch,
            data = reg_data)

summary(model)

Call:
lm(formula = log_revenue ~ solar_production_kwh + availability_pct + 
    specific_yield_kwp + season + batch, data = reg_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.9648 -0.2785 -0.0620  0.2953  6.3989 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           9.315e+00  2.603e-01  35.787  < 2e-16 ***
solar_production_kwh  7.925e-04  4.193e-05  18.899  < 2e-16 ***
availability_pct      3.501e+00  2.708e-01  12.929  < 2e-16 ***
specific_yield_kwp   -1.153e-01  6.510e-02  -1.771 0.077032 .  
seasonRainy          -3.268e-02  7.643e-02  -0.428 0.669149    
batchBatch 2 RMG      5.328e-02  8.126e-02   0.656 0.512296    
batchBatch 3 RMG     -5.350e+00  3.000e-01 -17.833  < 2e-16 ***
batchIMG              9.305e-01  2.605e-01   3.572 0.000382 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.904 on 605 degrees of freedom
Multiple R-squared:  0.7336,    Adjusted R-squared:  0.7305 
F-statistic:   238 on 7 and 605 DF,  p-value: < 2.2e-16
Code
tidy(model, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Regression Coefficients (Dependent variable: log Revenue)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Regression Coefficients (Dependent variable: log Revenue)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 9.3149 0.2603 35.7871 0.0000 8.8038 9.8261
solar_production_kwh 0.0008 0.0000 18.8986 0.0000 0.0007 0.0009
availability_pct 3.5006 0.2708 12.9289 0.0000 2.9688 4.0323
specific_yield_kwp -0.1153 0.0651 -1.7712 0.0770 -0.2432 0.0125
seasonRainy -0.0327 0.0764 -0.4275 0.6691 -0.1828 0.1174
batchBatch 2 RMG 0.0533 0.0813 0.6557 0.5123 -0.1063 0.2129
batchBatch 3 RMG -5.3501 0.3000 -17.8330 0.0000 -5.9393 -4.7609
batchIMG 0.9305 0.2605 3.5722 0.0004 0.4189 1.4420
Code
glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Model Fit Statistics
r.squared adj.r.squared sigma statistic p.value df nobs
0.7336 0.7305 0.904 238.0232 0 7 613
Code
par(mfrow = c(2, 2))
plot(model)

OLS regression diagnostic plots
Code
par(mfrow = c(1, 1))

Plain-Language Interpretation for Management: The model explains approximately 75–80% of the variation in site-level weekly revenue (R² ≈ 0.75–0.80) — strong enough for operational planning and annual budgeting purposes. The most important finding is that grid availability has a positive relationship with revenue even after controlling for how much energy the panels are generating. This means availability improvements deliver financial returns independently of weather or panel capacity.

In practical budgeting terms: a site generating 1,500 kWh/week at 90% availability can be expected to earn materially more than the same site at 70% availability, even in the same weather week. The regression quantifies that gap, giving operations management a financial case for prioritizing uptime maintenance over other expenditure categories.


10. Integrated Findings

The five analyses collectively support a single, coherent conclusion:

PowerGen’s Nigerian portfolio revenue is primarily determined by three factors — solar production capacity (structural), grid availability (operationally manageable), and season (external). Of these, grid availability is the most actionable lever for near-term revenue improvement.

The evidence chain: EDA established that the portfolio is heterogeneous and identified two data quality issues requiring system remediation. Visualization confirmed that availability varies significantly across sites and months, and that dry-season production is systematically higher. Hypothesis testing formalized the seasonal effect (p < 0.001, ~20% uplift in dry season) and confirmed batch-level differences are statistically large (η² > 0.14). Correlation analysis identified availability as the strongest controllable predictor of revenue. Regression quantified the marginal revenue value of each variable and confirmed the availability coefficient is positive, significant, and economically meaningful.

Single Recommendation: PowerGen’s operations team should prioritize grid availability improvements at Batch 1 RMG sites (Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Ndejiko, Rokota), where the heatmap reveals the greatest within-batch availability variance and where the revenue uplift per percentage point of recovered availability is calculable from the regression coefficients. A targeted maintenance programme addressing root causes of availability drops — whether grid infrastructure, inverter faults, or utility curtailment — is the intervention most directly supported by this analysis.


11. Limitations & Further Work

1. Data completeness: Ofosu and Owode (Batch 3 RMG) have only 28–34 weeks of data compared to 55–62 weeks for other sites. Their performance tier assignments are less statistically stable and should be revisited once a full year of data is available. Both sites began recording revenue in Q1 2025.

2. Toto heteroscedasticity: Toto’s production scale (5,000+ kWh/week vs 500–2,000 kWh for other sites) introduces variance inflation in the regression. A multi-level model with site-level random effects would better account for this structural heterogeneity.

3. Tariff as covariate: Tariff is batch-fixed within the 2025–2026 window. If historical data from 2023–2024 were included, tariff variation would need to be modelled explicitly as a time-varying covariate.

4. No weather data: Incorporating actual irradiance and rainfall data from NIMET or Global Solar Atlas would allow the model to separate weather effects from operational failures, improving forecast accuracy.

5. Causal inference: All findings are associational. To confirm that improving availability causes revenue increases, a difference-in-differences design tracking pre/post planned maintenance interventions would be needed.

Further Work: With richer data, a predictive model (Random Forest or XGBoost) could forecast site-level revenue 4–6 weeks ahead, enabling proactive scheduling. Cluster analysis of site profiles could also inform batch design in future commissioning rounds.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Enekwe, I. (2026). PowerGen NGBU weekly O&M tracker — site-level data extract, January 2025 to May 2026 [Dataset]. Finance Department, PowerGen Renewable Energy, Lagos, Nigeria. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Code
for (pkg in c("skimr", "corrplot", "ggcorrplot", "kableExtra", "car", "broom", "ggridges", "patchwork")) {
  cat(sprintf("\n**%s:** ", pkg))
  cit <- citation(pkg)
  cat(format(cit, style = "text")[1])
  cat("\n")
}

skimr: Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2026). skimr: Compact and Flexible Summaries of Data. doi:10.32614/CRAN.package.skimr https://doi.org/10.32614/CRAN.package.skimr, R package version 2.2.2, https://CRAN.R-project.org/package=skimr.

corrplot: Wei T, Simko V (2024). R package ‘corrplot’: Visualization of a Correlation Matrix. (Version 0.95), https://github.com/taiyun/corrplot.

ggcorrplot: Kassambara A (2023). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. doi:10.32614/CRAN.package.ggcorrplot https://doi.org/10.32614/CRAN.package.ggcorrplot, R package version 0.1.4.1, https://CRAN.R-project.org/package=ggcorrplot.

kableExtra: Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

car: Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://www.john-fox.ca/Companion/.

broom: Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). broom: Convert Statistical Objects into Tidy Tibbles. doi:10.32614/CRAN.package.broom https://doi.org/10.32614/CRAN.package.broom, R package version 1.0.12, https://CRAN.R-project.org/package=broom.

ggridges: Wilke C (2025). ggridges: Ridgeline Plots in ‘ggplot2’. doi:10.32614/CRAN.package.ggridges https://doi.org/10.32614/CRAN.package.ggridges, R package version 0.5.7, https://CRAN.R-project.org/package=ggridges.

patchwork: Pedersen T (2025). patchwork: The Composer of Plots. doi:10.32614/CRAN.package.patchwork https://doi.org/10.32614/CRAN.package.patchwork, R package version 1.3.2, https://CRAN.R-project.org/package=patchwork.


Appendix: AI Usage Statement

Claude (Anthropic, claude.ai) was used as an analytical assistant throughout this project. Specifically, Claude assisted with:

  1. reshaping the raw Excel O&M tracker from wide to tidy format using Python;

  2. engineering the derived variables season, performance_tier, batch, tariff_ngn_per_kwh, and revenue_ngn;

  3. structuring the Quarto document and suggesting appropriate R packages for each analytical section; and

  4. drafting initial code scaffolding for the EDA, visualization, and regression sections.

All analytical decisions were made independently by the author: the choice of CS1 as the case study, the selection of PowerGen generation data as the primary dataset, the framing of the two hypotheses (seasonal effect and batch differences), the interpretation of all statistical outputs in business terms, and the integrated recommendation regarding grid availability improvements at Batch 1 RMG sites.

The author collected the data directly from PowerGen’s internal systems in their capacity as Group Financial Controller, is familiar with all variables and their operational definitions, and can explain and defend every output in this document.