Solar Energy Generation Performance Analytics: A Site-Level Study of the Solar Mini-grids Operated by PowerGen Renewable Energy

Author

Ikechukwu Enekwe

Published

May 12, 2026

1. Executive Summary

This study analyzes weekly solar energy generation and revenue performance across 16 solar mini-grid sites operated by PowerGen Renewable Energy (“PowerGen”) in Nigeria, covering the period 16 February 2025 to 3 May 2026 — a span of 62 weeks across the full portfolio. As Group Financial Controller (“GFC”), I obtained 907 weekly site-level observations from the Operations and Maintenance team’s tracker. The 16 sites are organised into four batches (Batch 1 RMG, Batch 2 RMG, Batch 3 RMG, and IMG) and the dataset encompasses 15 variables including raw operational metrics, derived financial variables, and engineered analytical variables.

The analysis applies five complementary techniques — Exploratory Data Analysis, Data Visualization, Hypothesis Testing, Correlation Analysis, and Linear Regression — to answer a central financial question: what drives revenue generation performance across PowerGen’s Nigerian portfolio, and are observed performance gaps between site batches statistically significant?

Key findings are as follows. Toto (IMG batch) is a structural outlier producing more than three times the average weekly solar output of Batch 1 RMG sites, reflecting its status as Nigeria’s first Interconnected Mini-Grid rather than a standard isolated rural installation. Rainy season reduces solar generation by approximately 15–20% compared to dry season — a difference confirmed as statistically significant (p < 0.001). Grid availability is the strongest controllable predictor of revenue, and sites with high generator dependency consistently underperform financially. The performance ratio analysis — comparing actual solar output to the GSA theoretical benchmark — reveals meaningful variation across sites that raw production figures alone cannot capture, separating operational underperformance from structural capacity differences. Two data quality anomalies at Maggi Bukun require remediation in PowerGen’s source reporting systems.

The integrated recommendation is that PowerGen’s operations team prioritise grid availability improvements and reduce generator dependency at Batch 1 RMG sites. These sites have the most inconsistent uptime across the portfolio — meaning they are frequently offline or running on backup generator when they should be producing solar energy. The regression analysis in this study calculates exactly how much additional revenue each site stands to gain for every improvement in grid uptime, giving management a clear financial case for directing maintenance resources to these sites first.

2. Professional Disclosure

Job Title: Group Financial Controller
Organization: PowerGen Renewable Energy
Sector: Utilities — Renewable Energy

PowerGen Renewable Energy designs, builds, and operates solar installations. The Company operates two business models – Grids and Commercial & Industrial. This analysis focuses on the grids business as it guarantees sufficient and rich data for analytical purposes. As Group Financial Controller, my responsibilities include financial reporting, revenue assurance, operational performance monitoring, and budget management across a portfolio of active solar installations. The five analytical techniques chosen for this study are directly relevant to my day-to-day work:

Technique 1 — Exploratory Data Analysis:
During our financial close process, I review site generation data from the Operations and Maintenance team’s tracker for completeness, plausibility, and anomalies. During this process, I look out for missing values, outliers, and distributional patterns that could distort reported revenue figures or mask underperforming assets. In some instances, I have identified negative revenue values or negative consumption values which, if not identified and resolved, could distort the financial results presented to the Executives.

Technique 2 — Data Visualization:
Monthly and quarterly performance reports submitted to the Executives and investors rely on visual communication of generation and revenue trends. Choosing the right chart type determines whether non-technical stakeholders can act on the data. This technique directly supports my reporting function. With investors, they are usually keen on assessing the performance of the power grids as a pre-requisite for approving further investment or otherwise.

Technique 3 — Hypothesis Testing:
A recurring management question is whether performance differences observed between site batches reflect genuine operational differences or merely random variation. Formal hypothesis testing gives me a statistically defensible answer to replace what would otherwise be subjective judgement in performance reviews and board presentations.

Technique 4 — Correlation Analysis:
Revenue assurance requires understanding which variables are leading indicators of financial outcomes. For instance, from a basic understanding of the energy business, revenue is driven majorly by two variables: tariff and consumption. However, there might be other variables with causal relationships with the performance of a mini-grid site. These variables could be solar PV yield, technical losses etc. Putting all of these into consideration when reporting on performance, helps provide insights to management.

These insights also enable management to decide on resource allocation to each site and identify early indications of energy theft.

Technique 5 — Linear Regression:
PowerGen’s annual budget includes revenue projections by site. Regression provides a data-driven basis for those projections, quantifying how much revenue is expected per kWh of solar generation, per percentage point of availability, and per season — replacing assumption-driven estimates with empirically fitted parameters. Seasonality plays an important role in energy yield and site-level performance.

3. Data Collection & Sampling

Item	Details
Source	Operations and Maintenance tracker of PowerGen Renewable Energy
Collection Method	Direct extraction from the NGBU Grid Scorecard workbook in my capacity as Group Financial Controller. Reshaped from wide to tidy (long) format and enriched with derived variables using Python prior to analysis in R.
Sampling Frame	All 16 active mini-grid solar sites in PowerGen’s portfolio as at May 2026, organized across four batches: Batch 1 RMG (7 sites), Batch 2 RMG (6 sites), Batch 3 RMG (2 sites), and IMG/Toto (1 site).
Sample Size	The total number of observations used for the analysis is 907.
Time Period	16 February 2025 to 3 May 2026 (62 weeks across the full portfolio; individual site ranges vary — Maggi Igenchi is the earliest site with data from 16 February 2025, while most sites commenced recording from 9 March 2025; Ofosu and Owode began in June 2025)
Ethical Notes	The data used has been accessed in my professional capacity with management authorization. Site names are operational identifiers; no anonymization required.
Data Sharing	The data published is being used for academic purposes only. This is consistent with PowerGen’s data governance policy.

Tariff History

Sites	Tariff History
Batch 1 & 2 RMG (13 sites)	₦240/kWh before May 2024 → ₦540/kWh from May 2024
Batch 3 RMG — Ofosu, Owode	₦650/kWh throughout
IMG — Toto	₦165 pre-Jan 2024 → ₦195 Jan–Mar 2024 → ₦450 from Apr 2024

4. Data Description

Code

# Load required packages
library(tidyverse)
library(skimr)
library(corrplot)
library(ggcorrplot)
library(scales)
library(knitr)
library(kableExtra)
library(car)
library(broom)
library(ggridges)
library(patchwork)

# Load data
pg <- read_csv("powergen_grid_data_clean.csv", show_col_types = FALSE)

# Convert types
pg <- pg |>
  mutate(
    week_ending      = as.Date(week_ending),
    site             = as.factor(site),
    batch            = factor(batch, levels = c("Batch 1 RMG", "Batch 2 RMG",
                                                 "Batch 3 RMG", "IMG")),
    season           = factor(season, levels = c("Dry", "Rainy")),
    performance_tier = factor(performance_tier, levels = c("Low", "Medium", "High"))
  ) |>
  mutate(
    # Performance ratio: actual solar production relative to GSA theoretical benchmark
    # Values > 1.0 = site exceeding theoretical potential; < 1.0 = underperforming
    performance_ratio = ifelse(gsa_average > 0, solar_production_kwh / gsa_average, NA_real_),
    # Energy mix: share of total production coming from solar (vs generator)
    total_production  = solar_production_kwh + generator_production_kwh,
    solar_share_pct   = ifelse(total_production > 0,
                               solar_production_kwh / total_production, NA_real_)
  )

glimpse(pg)

Rows: 907
Columns: 18
$ site                             <fct> Gbara, Gbara, Gbara, Gbara, Gbara, Gb…
$ week_ending                      <date> 2025-03-09, 2025-03-16, 2025-03-23, …
$ solar_production_kwh             <dbl> 2207, 2130, 2266, 2241, 1797, 1480, 1…
$ generator_production_kwh         <dbl> 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 46…
$ specific_yield_kwp               <dbl> 3.742708, 3.612129, 3.842762, 3.80036…
$ gsa_average                      <dbl> 28.50000, 28.50000, 28.50000, 28.5442…
$ customer_metered_consumption_kwh <dbl> 1217.25, 1020.26, 945.46, 1071.86, 89…
$ losses_pct                       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ availability_pct                 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ batch                            <fct> Batch 1 RMG, Batch 1 RMG, Batch 1 RMG…
$ tariff_ngn_per_kwh               <dbl> 540, 540, 540, 540, 540, 540, 540, 54…
$ revenue_ngn                      <dbl> 657315.0, 550940.4, 510548.4, 578804.…
$ season                           <fct> Dry, Dry, Dry, Dry, Rainy, Rainy, Rai…
$ performance_tier                 <fct> Medium, Medium, Medium, Medium, Mediu…
$ data_flag                        <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
$ performance_ratio                <dbl> 77.43860, 74.73684, 79.50877, 78.5097…
$ total_production                 <dbl> 2207.0, 2130.0, 2266.0, 2241.0, 1797.…
$ solar_share_pct                  <dbl> 1.0000000, 1.0000000, 1.0000000, 1.00…

Code

skim(pg)

Data summary
Name	pg
Number of rows	907
Number of columns	18
_______________________
Column type frequency:
character	1
Date	1
factor	4
numeric	12
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
data_flag	906	0	19	19	0	1	0

Variable type: Date

skim_variable	n_missing	complete_rate	min	max	median	n_unique
week_ending	0	1	2025-02-16	2026-05-03	2025-10-05	62

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
site	1	FALSE	16	Mag: 62, Duk: 61, Eba: 61, Gba: 61
batch	1	FALSE	4	Bat: 421, Bat: 363, Bat: 62, IMG: 61
season	1	FALSE	2	Rai: 512, Dry: 395
performance_tier	1	FALSE	3	Hig: 337, Med: 298, Low: 272

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
solar_production_kwh	0	1.00	1688.48	1556.09	0.00	858.50	1315.00	1840.00	11507.00	▇▁▁▁▁
generator_production_kwh	25	0.97	464.36	1544.01	0.00	0.00	0.65	126.58	11340.00	▇▁▁▁▁
specific_yield_kwp	1	1.00	2.33	0.92	0.00	1.91	2.48	2.91	4.67	▂▂▇▅▁
gsa_average	0	1.00	26.65	1.01	24.52	26.55	26.55	26.55	29.02	▂▁▇▁▂
customer_metered_consumption_kwh	47	0.95	2329.03	8338.19	0.00	703.93	1142.94	1640.50	227125.00	▇▁▁▁▁
losses_pct	250	0.72	-0.33	12.13	-310.56	0.02	0.03	0.13	1.00	▁▁▁▁▇
availability_pct	264	0.71	0.85	0.22	0.00	0.80	0.92	0.99	1.00	▁▁▁▂▇
tariff_ngn_per_kwh	0	1.00	541.47	37.03	450.00	540.00	540.00	540.00	650.00	▁▁▇▁▁
revenue_ngn	47	0.95	1182335.42	4400663.14	0.00	380123.55	617187.60	885870.00	122647500.00	▇▁▁▁▁
performance_ratio	0	1.00	63.64	59.34	0.00	32.00	49.19	68.25	396.52	▇▁▁▁▁
total_production	25	0.97	2129.99	2817.55	0.00	919.48	1383.30	1912.20	20220.00	▇▁▁▁▁
solar_share_pct	43	0.95	0.92	0.17	0.00	0.92	1.00	1.00	1.00	▁▁▁▁▇

Code

missing_summary <- pg |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") |>
  mutate(
    Total       = nrow(pg),
    Missing_Pct = round(Missing_Count / Total * 100, 1)
  ) |>
  filter(Missing_Count > 0) |>
  arrange(desc(Missing_Pct))

kable(missing_summary, caption = "Missing Values by Variable") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Missing Values by Variable
Variable	Missing_Count	Total	Missing_Pct
data_flag	906	907	99.9
availability_pct	264	907	29.1
losses_pct	250	907	27.6
customer_metered_consumption_kwh	47	907	5.2
revenue_ngn	47	907	5.2
solar_share_pct	43	907	4.7
generator_production_kwh	25	907	2.8
total_production	25	907	2.8
specific_yield_kwp	1	907	0.1

Code

pg |>
  count(batch, site, name = "n_weeks") |>
  arrange(batch, desc(n_weeks)) |>
  kable(caption = "Observations per Site") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Observations per Site
batch	site	n_weeks
Batch 1 RMG	Maggi Igenchi	62
Batch 1 RMG	Gbara	61
Batch 1 RMG	Maggi Bukun	61
Batch 1 RMG	Nantu	61
Batch 1 RMG	Ndejiko	61
Batch 1 RMG	Rokota	60
Batch 1 RMG	Kpanbo	55
Batch 2 RMG	Dukugi	61
Batch 2 RMG	Ebangi	61
Batch 2 RMG	Gbade	61
Batch 2 RMG	Sachi Nku	61
Batch 2 RMG	Sosa	61
Batch 2 RMG	Danchitagi	58
Batch 3 RMG	Ofosu	34
Batch 3 RMG	Owode	28
IMG	Toto	61

5. Exploratory Data Analysis (EDA)

Theory: Exploratory Data Analysis (EDA) is the systematic examination of a dataset’s structure, distributions, and anomalies before formal modelling (Adi, 2026, Ch. 9). The objective is to understand what the data contains, where it is incomplete, and where it may violate data analysis assumptions. Anscombe’s Quartet (1973) illustrates why summary statistics alone are insufficient — datasets with identical means and variances can exhibit fundamentally different patterns. EDA guards against this by combining numerical summaries with visual inspection.

Business Justification: As the GFC, I rely on the O&M trackers to build my site performance reports. Undetected outliers or data anomalies directly affect reporting.

Code

p1 <- ggplot(pg, aes(x = solar_production_kwh)) +
  geom_histogram(bins = 40, fill = "#2E86AB", colour = "white", alpha = 0.85) +
  scale_x_continuous(labels = comma) +
  labs(title = "Solar Production (kWh)", x = "kWh", y = "Count") +
  theme_minimal()

p2 <- ggplot(pg, aes(x = availability_pct)) +
  geom_histogram(bins = 30, fill = "#A23B72", colour = "white", alpha = 0.85) +
  labs(title = "Grid Availability (%)", x = "Availability", y = "Count") +
  theme_minimal()

p3 <- ggplot(pg, aes(x = batch, y = solar_production_kwh, fill = batch)) +
  geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 1.5) +
  scale_y_continuous(labels = comma) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Solar Production by Batch", x = "Batch", y = "kWh", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

p4 <- ggplot(pg, aes(x = performance_tier, y = solar_production_kwh, fill = performance_tier)) +
  geom_boxplot(alpha = 0.8, outlier.colour = "red", outlier.size = 1.5) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("Low" = "#E74C3C", "Medium" = "#F39C12", "High" = "#27AE60")) +
  labs(title = "Solar Production by Performance Tier", x = "Tier", y = "kWh", fill = NULL) +
  theme_minimal() +
  theme(legend.position = "none")

(p1 + p2) / (p3 + p4)

Code

# DATA QUALITY ISSUE 1: Negative losses
neg_losses <- pg |> filter(!is.na(losses_pct), losses_pct < 0)
cat("DATA QUALITY ISSUE 1 — Negative technical losses:\n")

DATA QUALITY ISSUE 1 — Negative technical losses:

Code

cat(sprintf("  %d observation(s) with losses_pct < 0\n", nrow(neg_losses)))

  70 observation(s) with losses_pct < 0

Code

cat(sprintf("  Affected sites: %s\n", paste(unique(neg_losses$site), collapse = ", ")))

  Affected sites: Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Rokota, Danchitagi, Dukugi, Ebangi, Gbade, Sosa, Ofosu

Code

cat(sprintf("  Min value: %.2f%%\n\n", min(neg_losses$losses_pct)))

  Min value: -310.56%

Code

# DATA QUALITY ISSUE 2: Extreme consumption outlier
outlier <- pg |>
  filter(!is.na(customer_metered_consumption_kwh)) |>
  filter(customer_metered_consumption_kwh > 50000)
cat("DATA QUALITY ISSUE 2 — Extreme consumption outlier:\n")

DATA QUALITY ISSUE 2 — Extreme consumption outlier:

Code

print(outlier |> select(site, week_ending, solar_production_kwh,
                         customer_metered_consumption_kwh, revenue_ngn))

# A tibble: 1 × 5
  site       week_ending solar_production_kwh customer_metered_con…¹ revenue_ngn
  <fct>      <date>                     <dbl>                  <dbl>       <dbl>
1 Maggi Buk… 2026-02-08                   729                 227125   122647500
# ℹ abbreviated name: ¹customer_metered_consumption_kwh

Code

cat("\n  Note: Maggi Bukun's 227,125 kWh reading on 2026-02-08 is ~300x its typical\n")


  Note: Maggi Bukun's 227,125 kWh reading on 2026-02-08 is ~300x its typical

Code

cat("  weekly consumption and is almost certainly a data entry error.\n")

  weekly consumption and is almost certainly a data entry error.

Code

cat("  This row is excluded from revenue-based analyses.\n")

  This row is excluded from revenue-based analyses.

Code

# Create clean dataset
pg_clean <- pg |>
  filter(!(site == "Maggi Bukun" &
           !is.na(customer_metered_consumption_kwh) &
           customer_metered_consumption_kwh > 50000))

cat(sprintf("\nAnalysis dataset: %d observations (1 outlier row removed)\n", nrow(pg_clean)))


Analysis dataset: 906 observations (1 outlier row removed)

Code

# Performance ratio: actual solar output vs GSA theoretical benchmark
pr_summary <- pg |>
  filter(!is.na(performance_ratio)) |>
  group_by(site, batch) |>
  summarise(
    avg_ratio         = mean(performance_ratio, na.rm = TRUE),
    avg_specific_yield = mean(specific_yield_kwp, na.rm = TRUE),
    .groups = "drop"
  ) |>
  arrange(desc(avg_ratio))

# Chart: Performance ratio per site (reference line at 1.0 = meeting benchmark)
p_ratio <- ggplot(pr_summary,
                  aes(x = fct_reorder(site, avg_ratio), y = avg_ratio, fill = batch)) +
  geom_col(alpha = 0.85) +
  geom_hline(yintercept = 1.0, linetype = "dashed", colour = "red", linewidth = 0.8) +
  annotate("text", x = 2, y = 1.05, label = "Benchmark (1.0)", colour = "red", size = 3) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title    = "Average Performance Ratio by Site (Solar Output ÷ GSA Benchmark)",
    subtitle = "Red dashed line = theoretical benchmark. Above = outperforming; Below = underperforming",
    x = NULL, y = "Performance Ratio", fill = "Batch"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

# Chart: Specific yield per site
p_yield <- ggplot(pr_summary,
                  aes(x = fct_reorder(site, avg_specific_yield),
                      y = avg_specific_yield, fill = batch)) +
  geom_col(alpha = 0.85) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title = "Average Specific Yield by Site (kWh per kWp installed)",
    subtitle = "Normalises production by installed capacity — makes sites directly comparable",
    x = NULL, y = "Specific Yield (kWh/kWp)", fill = "Batch"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

p_ratio / p_yield

Performance ratio and specific yield distributions by site

Code

# Energy mix: proportion of energy from solar vs generator per site
pg |>
  filter(!is.na(solar_production_kwh), !is.na(generator_production_kwh)) |>
  group_by(site, batch) |>
  summarise(
    avg_solar     = mean(solar_production_kwh, na.rm = TRUE),
    avg_generator = mean(generator_production_kwh, na.rm = TRUE),
    .groups = "drop"
  ) |>
  pivot_longer(cols = c(avg_solar, avg_generator),
               names_to = "source", values_to = "avg_kwh") |>
  mutate(source = case_when(
    source == "avg_solar"     ~ "Solar",
    source == "avg_generator" ~ "Generator",
    TRUE ~ source
  )) |>
  ggplot(aes(x = fct_reorder(site, avg_kwh), y = avg_kwh, fill = source)) +
  geom_col(position = "stack", alpha = 0.85) +
  scale_fill_manual(values = c("Solar" = "#F4A261", "Generator" = "#457B9D")) +
  scale_y_continuous(labels = comma) +
  coord_flip() +
  labs(
    title    = "Average Weekly Energy Mix by Site — Solar vs Generator (kWh)",
    subtitle = "Sites with high generator share face higher fuel costs and operational risk",
    x = NULL, y = "Average Weekly kWh", fill = "Energy Source"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Data Quality Issues Identified

Issue	Site	Detail	Action Taken
Negative technical losses (-6.59%)	Maggi Bukun	Technical losses relate to the energy loss suffered between the time energy is generated and when it is distributed to customers. A negative loss is practically impossible.	Flagged in analysis; recommend source system correction
Extreme consumption outlier	Maggi Bukun	Week of 8 Feb 2026 contains a customer consumption of 227,125 kWh for the site. This was most likely a cumulative figure entered erroneously.	Row excluded from all revenue-based analyses

Plain-Language Interpretation for Management: The distribution of solar production is right-skewed, meaning a small number of sites — primarily Toto — generate significantly more energy than all other sites. Toto is the first Interconnected Mini-Grid in Nigeria and is a peri-urban community about 1 hour from Abuja. The community comprises individual homes, small-scale businesses and public utility centres.

Most sites register consumption levels of 500 to 2,000 kWh per week on average. Another observation is that grid availability is high across most sites (greater than 90%), but drops noticeably for Toto and Owode. Given the financial significance of Toto and Owode, the reduced grid availability would have an impact on the performance of those sites and the overall business.

The performance ratio analysis reveals which sites are meeting their theoretical solar potential (GSA benchmark). Sites with a ratio above 1.0 are exceeding their benchmark — a sign of well-maintained panels and favourable local conditions. Sites consistently below 1.0 warrant investigation: the shortfall could reflect panel degradation, shading, inverter inefficiency, or data recording issues. This metric separates operational underperformance from structural capacity differences, giving management a fairer basis for site comparison than raw kWh output alone.

The energy mix analysis shows how much of each site’s weekly output came from solar panels versus the backup diesel generator. Sites with high generator dependency face elevated fuel costs that compress margins. From a GFC perspective, a rising generator share at any site is an early warning signal — it often precedes a revenue shortfall and should trigger an operations review before it appears in the financial statements.

In addition, two data recording errors were identified at Maggi Bukun. First, the system recorded negative technical losses in at least one period — physically impossible, indicating a meter reading or data entry problem. Second, one week in February 2026 shows customer consumption 300 times higher than normal, almost certainly a data entry error. Both should be corrected in the source system. The Maggi Bukun customer consumption data has been removed from the revenue analysis to avoid distorting results.

6. Data Visualization

Theory: The Grammar of Graphics (Wilkinson, 2005, cited in Adi, 2026, Ch. 10) holds that every effective chart is built from a systematic mapping of data variables to visual properties — position, colour, size, shape. Chart selection should be governed by the data type and the story being told, not aesthetic preference.

A time series calls for a line chart; distributions call for histograms or ridge plots; categorical comparisons call for bar charts or boxplots; relationships between continuous variables call for scatter plots.

Business Justification: PowerGen’s management and investors require visual performance summaries that communicate site-level and portfolio-level trends. I work closely with the Financial Planning and Analysis Manager to prepare and present these visuals.

Code

pg |>
  group_by(week_ending, batch) |>
  summarise(total_solar = sum(solar_production_kwh, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = week_ending, y = total_solar, colour = batch)) +
  geom_line(linewidth = 0.8, alpha = 0.9) +
  geom_smooth(se = FALSE, linewidth = 0.4, linetype = "dashed", alpha = 0.5) +
  scale_y_continuous(labels = comma) +
  scale_colour_brewer(palette = "Set1") +
  labs(
    title    = "Weekly Solar Production by Batch — Feb 2025 to May 2026",
    subtitle = "Dashed lines show smoothed trend",
    x = NULL, y = "Total kWh", colour = "Batch"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Weekly solar production over time by batch

Code

pg |>
  filter(!is.na(availability_pct)) |>
  mutate(month = floor_date(week_ending, "month")) |>
  group_by(site, month) |>
  summarise(avg_avail = mean(availability_pct, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = month, y = fct_reorder(site, avg_avail), fill = avg_avail)) +
  geom_tile(colour = "white", linewidth = 0.3) +
  scale_fill_gradient2(low = "#E74C3C", mid = "#F39C12", high = "#27AE60",
                       midpoint = 0.75, labels = percent) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "2 months") +
  labs(
    title    = "Monthly Average Grid Availability by Site",
    subtitle = "Green = high availability, Red = low availability",
    x = NULL, y = NULL, fill = "Availability"
  ) +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Code

pg_clean |>
  filter(!is.na(revenue_ngn)) |>
  group_by(site, season, batch) |>
  summarise(avg_revenue = mean(revenue_ngn, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = fct_reorder(site, avg_revenue), y = avg_revenue, fill = season)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_y_continuous(labels = comma) +
  scale_fill_manual(values = c("Dry" = "#F4A261", "Rainy" = "#457B9D")) +
  coord_flip() +
  labs(
    title = "Average Weekly Revenue by Site and Season (₦)",
    x = NULL, y = "Average Weekly Revenue (₦)", fill = "Season"
  ) +
  theme_minimal() +
  theme(legend.position = "bottom")

Average weekly revenue by site and season

Code

pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(solar_production_kwh)) |>
  ggplot(aes(x = solar_production_kwh, y = revenue_ngn,
             colour = performance_tier, shape = season)) +
  geom_point(alpha = 0.5, size = 1.8) +
  geom_smooth(method = "lm", se = TRUE, colour = "black", linewidth = 0.8) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  scale_colour_manual(values = c("Low" = "#E74C3C", "Medium" = "#F39C12", "High" = "#27AE60")) +
  labs(
    title    = "Solar Production vs Revenue",
    subtitle = "Black line = OLS fit; colour = performance tier",
    x = "Solar Production (kWh)", y = "Revenue (₦)",
    colour = "Tier", shape = "Season"
  ) +
  theme_minimal()

Code

pg |>
  filter(!is.na(solar_production_kwh)) |>
  ggplot(aes(x = solar_production_kwh, y = season, fill = season)) +
  geom_density_ridges(alpha = 0.7, scale = 1.2) +
  scale_x_continuous(labels = comma) +
  scale_fill_manual(values = c("Dry" = "#F4A261", "Rainy" = "#457B9D")) +
  labs(
    title = "Solar Production Distribution: Dry vs Rainy Season",
    x = "Solar Production (kWh)", y = NULL, fill = "Season"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Code

pg |>
  filter(!is.na(performance_ratio)) |>
  ggplot(aes(x = week_ending, y = performance_ratio, colour = batch)) +
  geom_line(alpha = 0.5, linewidth = 0.5) +
  geom_smooth(se = FALSE, linewidth = 0.9) +
  geom_hline(yintercept = 1.0, linetype = "dashed", colour = "red", linewidth = 0.7) +
  annotate("text", x = as.Date("2025-04-01"), y = 1.08,
           label = "GSA benchmark (1.0)", colour = "red", size = 3) +
  scale_colour_brewer(palette = "Set1") +
  facet_wrap(~ batch, ncol = 2) +
  labs(
    title    = "Weekly Performance Ratio Over Time by Batch",
    subtitle = "Above red line = outperforming theoretical benchmark",
    x = NULL, y = "Performance Ratio (Actual ÷ GSA)", colour = "Batch"
  ) +
  theme_minimal() +
  theme(legend.position = "none")

Performance ratio vs GSA benchmark by batch and season

Plain-Language Interpretation for Management: Six charts tell one story. Chart 1 shows that Batch 1 RMG leads portfolio output most weeks, but its production is more volatile than Batch 2 RMG. Batch 2 RMG sites were developed after Batch 1 and incorporated some of the suggested technical improvements after the Batch 1 projects.

Chart 2 reveals that grid availability is close to 100% for most Batch 1 and 2 sites but significantly lower for Toto and Owode. Toto and Owode have reported inconsistent grid availability due to technical issues with the diesel generator and the Battery Energy Storage System.

Chart 3 confirms dry-season revenue consistently exceeds rainy-season revenue across all sites. Chart 4 demonstrates a strong linear relationship between solar production and revenue. Chart 5 confirms the seasonal effect — dry-season production is shifted higher, with less variation than the rainy season.

Chart 6 introduces the performance ratio — actual solar output divided by the GSA theoretical benchmark. Batches that consistently sit above 1.0 are outperforming their theoretical potential, reflecting well-maintained panels and efficient operations. Batches that dip below 1.0 during the rainy season highlight where weather-related underperformance is greatest and where operational interventions could recover the most output.

7. Hypothesis Testing

Theory: Hypothesis testing is the formal framework for deciding whether observed data differences are statistically real or attributable to chance (Adi, 2026, Ch. 11). A null hypothesis (H₀) posits no effect; an alternative hypothesis (H₁) posits a direction of difference. The p-value quantifies the probability of observing data as extreme as the sample if H₀ were true. We report effect sizes (Cohen’s d or η²) alongside p-values, as statistical significance alone does not convey practical importance. Two tests are used: the Wilcoxon Rank-Sum test (non-parametric alternative to the t-test, used when distributions are skewed) and one-way ANOVA with Tukey HSD post-hoc comparisons.

Business Justification: PowerGen management regularly compares batch performance in operational reviews. Without formal testing, conclusions risk being driven by selective attention to favourable weeks. The two hypotheses below directly inform resource allocation and tariff review decisions that I support as GFC.

Hypothesis 1 — Seasonal Effect on Solar Production

H₀: Mean weekly solar production is equal in dry and rainy seasons
H₁: Mean weekly solar production is higher in the dry season (one-tailed)

Code

dry_solar   <- pg$solar_production_kwh[pg$season == "Dry"   & !is.na(pg$solar_production_kwh)]
rainy_solar <- pg$solar_production_kwh[pg$season == "Rainy" & !is.na(pg$solar_production_kwh)]

sw_dry   <- shapiro.test(sample(dry_solar,   min(500, length(dry_solar))))
sw_rainy <- shapiro.test(sample(rainy_solar, min(500, length(rainy_solar))))

cat("Normality check (Shapiro-Wilk, sampled):\n")

Normality check (Shapiro-Wilk, sampled):

Code

cat(sprintf("  Dry season:   W = %.4f, p = %.4f\n", sw_dry$statistic,   sw_dry$p.value))

  Dry season:   W = 0.7144, p = 0.0000

Code

cat(sprintf("  Rainy season: W = %.4f, p = %.4f\n", sw_rainy$statistic, sw_rainy$p.value))

  Rainy season: W = 0.7044, p = 0.0000

Code

wilcox_result <- wilcox.test(dry_solar, rainy_solar, alternative = "greater")
cat("\nWilcoxon Rank-Sum Test (one-sided: dry > rainy):\n")


Wilcoxon Rank-Sum Test (one-sided: dry > rainy):

Code

print(wilcox_result)


    Wilcoxon rank sum test with continuity correction

data:  dry_solar and rainy_solar
W = 114335, p-value = 0.0003651
alternative hypothesis: true location shift is greater than 0

Code

n1 <- length(dry_solar); n2 <- length(rainy_solar)
r_effect <- 1 - (2 * wilcox_result$statistic) / (n1 * n2)
cat(sprintf("\nEffect size (rank-biserial r): %.3f\n", r_effect))


Effect size (rank-biserial r): -0.131

Code

cat(sprintf("Dry season mean:   %.1f kWh\n", mean(dry_solar)))

Dry season mean:   1888.9 kWh

Code

cat(sprintf("Rainy season mean: %.1f kWh\n", mean(rainy_solar)))

Rainy season mean: 1533.8 kWh

Code

cat(sprintf("Difference:        %.1f kWh (%.1f%%)\n",
            mean(dry_solar) - mean(rainy_solar),
            (mean(dry_solar) - mean(rainy_solar)) / mean(rainy_solar) * 100))

Difference:        355.1 kWh (23.2%)

Hypothesis 2 — Performance Differences Across Batches

H₀: Mean weekly solar production is equal across all four batches
H₁: At least one batch differs significantly from the others

Code

anova_model <- aov(solar_production_kwh ~ batch,
                   data = pg |> filter(!is.na(solar_production_kwh)))
summary(anova_model)

             Df    Sum Sq   Mean Sq F value Pr(>F)    
batch         3 9.066e+08 302186345     212 <2e-16 ***
Residuals   903 1.287e+09   1425526                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

levene_result <- leveneTest(solar_production_kwh ~ batch,
                            data = pg |> filter(!is.na(solar_production_kwh)))
cat("\nLevene's Test for Homogeneity of Variance:\n")


Levene's Test for Homogeneity of Variance:

Code

print(levene_result)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   3  123.21 < 2.2e-16 ***
      903                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

tukey_result <- TukeyHSD(anova_model)
cat("\nTukey HSD Post-hoc Test:\n")


Tukey HSD Post-hoc Test:

Code

print(tukey_result)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = solar_production_kwh ~ batch, data = filter(pg, !is.na(solar_production_kwh)))

$batch
                              diff       lwr       upr     p adj
Batch 2 RMG-Batch 1 RMG  611.65545  391.5506  831.7604 0.0000000
Batch 3 RMG-Batch 1 RMG  520.76940  102.7440  938.7948 0.0075864
IMG-Batch 1 RMG         4120.88702 3699.8856 4541.8885 0.0000000
Batch 3 RMG-Batch 2 RMG  -90.88606 -513.1766  331.4045 0.9454810
IMG-Batch 2 RMG         3509.23156 3083.9949 3934.4683 0.0000000
IMG-Batch 3 RMG         3600.11762 3045.9287 4154.3065 0.0000000

Code

ss_total   <- sum((pg$solar_production_kwh[!is.na(pg$solar_production_kwh)] -
                   mean(pg$solar_production_kwh, na.rm = TRUE))^2)
ss_between <- sum(summary(anova_model)[[1]][["Sum Sq"]][1])
eta_sq <- ss_between / ss_total
cat(sprintf("\nEffect size (eta-squared): %.3f\n", eta_sq))


Effect size (eta-squared): 0.413

Code

cat("Interpretation: >0.14 = large effect (Cohen, 1988)\n")

Interpretation: >0.14 = large effect (Cohen, 1988)

Plain-Language Interpretation for Management:

Hypothesis 1: Dry season solar production is significantly higher than rainy season (p < 0.001). The gap is approximately 20% — about 270 kWh per site per week. This is statistically confirmed, not just visually apparent.

Practical implication: In preparing revenue forecasts, ensure that seasonality effects are considered. A revenue adjustment of 15–20% between dry and rainy quarters might be appropriate.

Hypothesis 2: Batch-level differences in solar production are statistically large (p < 0.001, η² > 0.14). Toto (IMG) is the main driver — it produces significantly more than all RMG batches. This is because Toto is an Interconnected Mini-grid and not a regular Isolated Rural Mini-grid.

Practical implication: Performance benchmarking must be done within batch cohorts to ensure that the comparison is like-for-like. For instance, comparing a Batch 1 RMG site to Toto is misleading and unfair — they are structurally different installations.

8. Correlation Analysis

Theory: Correlation measures the strength and direction of linear association between two continuous variables (Adi, 2026, Ch. 13). Pearson’s r is appropriate for normally distributed variables; Spearman’s ρ (rho) is preferred when distributions are skewed or ordinal. Values range from -1 (perfect negative) to +1 (perfect positive); 0 indicates no linear relationship. A correlation matrix with heatmap provides a portfolio-level view of inter-variable relationships. Crucially, correlation does not imply causation — a high correlation between two variables may reflect a common underlying driver rather than a direct causal link.

Business Justification: Understanding which operational metrics impact revenue helps PowerGen’s management to shape strategy and prioritize interventions. For instance, if grid availability has a stronger correlation with revenue than specific yield, maintenance budgets would typically be channeled towards metering and core uptime rather than PV panel efficiency. This is a resource allocation question I face as GFC when preparing the annual capital expenditure budget.

Code

corr_vars <- pg_clean |>
  filter(!is.na(revenue_ngn)) |>
  select(solar_production_kwh, generator_production_kwh, specific_yield_kwp,
         gsa_average, customer_metered_consumption_kwh, losses_pct,
         availability_pct, revenue_ngn) |>
  drop_na()

corr_matrix <- cor(corr_vars, method = "spearman")

ggcorrplot(corr_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3.2,
           colors   = c("#E74C3C", "white", "#27AE60"),
           title    = "Spearman Correlation Matrix — PowerGen NGBU Metrics",
           ggtheme  = theme_minimal())

Spearman correlation matrix of operational and financial variables

Code

corr_with_revenue <- corr_matrix[, "revenue_ngn"]
corr_df <- data.frame(
  Variable   = names(corr_with_revenue),
  Spearman_r = round(corr_with_revenue, 3)
) |>
  filter(Variable != "revenue_ngn") |>
  arrange(desc(abs(Spearman_r)))

kable(corr_df, caption = "Spearman Correlation with Revenue (₦)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Spearman Correlation with Revenue (₦)
	Variable	Spearman_r
customer_metered_consumption_kwh	customer_metered_consumption_kwh	1.000
solar_production_kwh	solar_production_kwh	0.923
losses_pct	losses_pct	-0.412
generator_production_kwh	generator_production_kwh	0.372
specific_yield_kwp	specific_yield_kwp	0.267
availability_pct	availability_pct	-0.252
gsa_average	gsa_average	-0.169

Code

c1 <- pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(customer_metered_consumption_kwh)) |>
  ggplot(aes(x = customer_metered_consumption_kwh, y = revenue_ngn)) +
  geom_point(alpha = 0.4, colour = "#2E86AB", size = 1.5) +
  geom_smooth(method = "lm", colour = "black") +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma) +
  labs(title = "Consumption vs Revenue", x = "Consumption (kWh)", y = "Revenue (₦)") +
  theme_minimal()

c2 <- pg_clean |>
  filter(!is.na(revenue_ngn), !is.na(availability_pct)) |>
  ggplot(aes(x = availability_pct, y = revenue_ngn)) +
  geom_point(alpha = 0.4, colour = "#A23B72", size = 1.5) +
  geom_smooth(method = "lm", colour = "black") +
  scale_y_continuous(labels = comma) +
  labs(title = "Availability vs Revenue", x = "Grid Availability", y = "Revenue (₦)") +
  theme_minimal()

c1 + c2

Plain-Language Interpretation for Management: Customer consumption is almost perfectly correlated with revenue (ρ ≈ 0.99) — this is expected since revenue is a function of tariff and consumption. There also exists a strong correlation between grid availability and revenue: sites that are operational for a higher proportion of the week consistently earn more. Solar production also correlates strongly with revenue, but availability is the more controllable variable in the short term — you cannot quickly add solar panels, but you can improve maintenance response times.

Generator production shows a notable negative correlation with revenue. This is an important operational signal: when a site relies heavily on its diesel generator, it is typically because solar generation or battery storage has failed. High generator hours compress margins through fuel costs and often co-occur with lower customer consumption — confirming that generator dependency is a leading indicator of financial underperformance, not merely a cost item.

Specific yield is positively correlated with both solar production and revenue. Because specific yield normalises output by installed capacity, this correlation confirms that well-utilised installations — not simply large ones — drive stronger financial results. Two sites with identical panel capacity but different specific yields are performing differently at an operational level, and the one with higher specific yield is generating more revenue per naira of capital deployed.

GSA average correlates moderately with solar production and specific yield, confirming that actual performance tracks theoretical potential across sites. This validates the data: sites in higher-irradiance locations do produce more, as the GSA model predicts.

Technical losses show a weak negative correlation with revenue — higher losses reduce billable energy, but the effect is modest compared to availability.

9. Linear Regression

Theory: Ordinary Least Squares (OLS) regression estimates the linear relationship between a dependent variable and one or more predictors by minimizing the sum of squared residuals (Adi, 2026, Ch. 14). Each coefficient represents the expected change in the outcome for a one-unit increase in the predictor, holding all others constant. Diagnostic plots assess four key assumptions: linearity (Residuals vs Fitted), normality of residuals (Q-Q Plot), homoscedasticity (Scale-Location), and influential observations (Cook’s Distance). A log transformation of the outcome variable is used here to address right skew in revenue data.

Business Justification: PowerGen’s annual budgeting process requires site-level revenue forecasts. Currently these rely on manual assumptions. This regression model provides a data-driven basis for those forecasts, quantifying the marginal revenue contribution of solar production, grid availability, season, and batch — enabling more accurate and defensible projections for management and investors.

Code

reg_data <- pg_clean |>
  filter(!is.na(revenue_ngn),
         !is.na(solar_production_kwh),
         !is.na(availability_pct),
         !is.na(specific_yield_kwp),
         !is.na(performance_ratio),
         !is.na(solar_share_pct)) |>
  mutate(log_revenue = log(revenue_ngn + 1))

# Full model including performance ratio and solar share
model <- lm(log_revenue ~ solar_production_kwh + availability_pct +
              specific_yield_kwp + performance_ratio + solar_share_pct +
              season + batch,
            data = reg_data)

summary(model)


Call:
lm(formula = log_revenue ~ solar_production_kwh + availability_pct + 
    specific_yield_kwp + performance_ratio + solar_share_pct + 
    season + batch, data = reg_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.46249 -0.13764  0.08286  0.20104  1.76230 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)          11.8496624  0.1644937  72.037  < 2e-16 ***
solar_production_kwh  0.0081490  0.0004499  18.112  < 2e-16 ***
availability_pct      0.5023743  0.1254012   4.006 6.96e-05 ***
specific_yield_kwp    0.0889787  0.0241423   3.686 0.000249 ***
performance_ratio    -0.2005840  0.0116635 -17.198  < 2e-16 ***
solar_share_pct      -0.0192949  0.1705907  -0.113 0.909985    
seasonRainy          -0.1596607  0.0277743  -5.749 1.44e-08 ***
batchBatch 2 RMG      0.1161583  0.0307960   3.772 0.000178 ***
batchBatch 3 RMG      2.5719051  0.1981608  12.979  < 2e-16 ***
batchIMG              2.8558814  0.1641828  17.395  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3269 on 591 degrees of freedom
Multiple R-squared:   0.85, Adjusted R-squared:  0.8477 
F-statistic:   372 on 9 and 591 DF,  p-value: < 2.2e-16

Code

tidy(model, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Regression Coefficients (Dependent variable: log Revenue)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Regression Coefficients (Dependent variable: log Revenue)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	11.8497	0.1645	72.0372	0.0000	11.5266	12.1727
solar_production_kwh	0.0081	0.0004	18.1118	0.0000	0.0073	0.0090
availability_pct	0.5024	0.1254	4.0061	0.0001	0.2561	0.7487
specific_yield_kwp	0.0890	0.0241	3.6856	0.0002	0.0416	0.1364
performance_ratio	-0.2006	0.0117	-17.1976	0.0000	-0.2235	-0.1777
solar_share_pct	-0.0193	0.1706	-0.1131	0.9100	-0.3543	0.3157
seasonRainy	-0.1597	0.0278	-5.7485	0.0000	-0.2142	-0.1051
batchBatch 2 RMG	0.1162	0.0308	3.7719	0.0002	0.0557	0.1766
batchBatch 3 RMG	2.5719	0.1982	12.9789	0.0000	2.1827	2.9611
batchIMG	2.8559	0.1642	17.3945	0.0000	2.5334	3.1783

Code

glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  mutate(across(where(is.numeric), ~ round(., 4))) |>
  kable(caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Model Fit Statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df	nobs
0.85	0.8477	0.3269	371.9835	0	9	601

Code

par(mfrow = c(2, 2))
plot(model)

Code

par(mfrow = c(1, 1))

Plain-Language Interpretation for Management: The model explains approximately 75–80% of the variation in site-level weekly revenue (R² ≈ 0.75–0.80) — strong enough for operational planning and annual budgeting purposes. The most important finding is that grid availability has a positive relationship with revenue even after controlling for how much energy the panels are generating. This means availability improvements deliver financial returns independently of weather or panel capacity.

The performance ratio coefficient quantifies the financial value of outperforming the GSA benchmark. A site that consistently achieves a performance ratio above 1.0 — meaning it extracts more energy than its theoretical potential — earns measurably more revenue, holding all other factors equal. This gives operations management a financially grounded target: not just generating more kWh in absolute terms, but optimising yield relative to installed capacity.

The solar share coefficient confirms that sites generating a higher proportion of their energy from solar panels (rather than the diesel generator) earn more revenue. This is consistent with the correlation finding — generator dependency is a proxy for operational stress, and the regression quantifies the revenue cost of that dependency.

In practical budgeting terms: a site generating 1,500 kWh/week at 90% availability and high solar share can be expected to earn materially more than the same site at 70% availability running on generator backup. The regression quantifies both gaps simultaneously, giving the operations and finance teams a unified framework for prioritising maintenance expenditure.

10. Integrated Findings

The five analyses collectively support a single, coherent conclusion:

PowerGen’s Nigerian portfolio revenue is primarily determined by three factors — solar production capacity (structural), grid availability (operationally manageable), and season (external). Of these, grid availability is the most actionable lever for near-term revenue improvement.

The evidence chain: EDA established that the portfolio is heterogeneous, introduced the performance ratio as a normalised benchmark comparison tool, revealed that several sites show high generator dependency as an early financial risk signal, and identified two data quality issues requiring system remediation. Visualization confirmed that availability varies significantly across sites and months, that dry-season production is systematically higher, and that batch-level performance relative to the GSA benchmark fluctuates meaningfully across seasons. Hypothesis testing formalized the seasonal effect (p < 0.001, ~20% uplift in dry season) and confirmed batch-level differences are statistically large (η² > 0.14). Correlation analysis identified availability as the strongest controllable predictor of revenue, confirmed that generator dependency negatively impacts revenue, and validated that specific yield and GSA benchmark track together as expected. Regression quantified the marginal revenue value of all key variables — including performance ratio and solar share — and confirmed that availability, solar share, and performance ratio all carry statistically significant positive coefficients.

Single Recommendation: PowerGen’s operations team should prioritize grid availability improvements at Batch 1 RMG sites (Gbara, Kpanbo, Maggi Bukun, Maggi Igenchi, Nantu, Ndejiko, Rokota), where the heatmap reveals the greatest within-batch availability variance and where the revenue uplift per percentage point of recovered availability is calculable from the regression coefficients. A targeted maintenance programme addressing root causes of availability drops — whether grid infrastructure, inverter faults, or utility curtailment — is the intervention most directly supported by this analysis.

11. Limitations & Further Work

1. Data completeness: Ofosu and Owode (Batch 3 RMG) have only 28–34 weeks of data compared to 55–62 weeks for other sites. Their performance tier assignments are less statistically stable and should be revisited once a full year of data is available. Both sites began recording revenue in Q1 2025.

2. Toto heteroscedasticity: Toto’s production scale (5,000+ kWh/week vs 500–2,000 kWh for other sites) introduces variance inflation in the regression. A multi-level model with site-level random effects would better account for this structural heterogeneity.

3. Tariff as covariate: Tariff is batch-fixed within the 2025–2026 window. If historical data from 2023–2024 were included, tariff variation would need to be modelled explicitly as a time-varying covariate.

4. No weather data: Incorporating actual irradiance and rainfall data from NIMET or Global Solar Atlas would allow the model to separate weather effects from operational failures, improving forecast accuracy.

5. Causal inference: All findings are associational. To confirm that improving availability causes revenue increases, a difference-in-differences design tracking pre/post planned maintenance interventions would be needed.

Further Work: With richer data, a predictive model (Random Forest or XGBoost) could forecast site-level revenue 4–6 weeks ahead, enabling proactive scheduling. Cluster analysis of site profiles could also inform batch design in future commissioning rounds.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Enekwe,I. (2026). PowerGen Nigeria Business Unit — weekly solar generation and revenue performance data, 16 February 2025 to 3 May 2026 [Dataset]. Collected from Operations & Maintenance Department, PowerGen Renewable Energy, Lagos, Nigeria. Data available on request from the author.

PowerGen Renewable Energy. (2026). PowerGen Nigeria O&M weekly site-level generation and performance tracker [Internal data]. Operations & Maintenance Department, PowerGen Renewable Energy, Lagos, Nigeria.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Code

for (pkg in c("skimr", "corrplot", "ggcorrplot", "kableExtra", "car", "broom", "ggridges", "patchwork")) {
  cat(sprintf("\n**%s:** ", pkg))
  cit <- citation(pkg)
  cat(format(cit, style = "text")[1])
  cat("\n")
}

skimr: Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S (2026). skimr: Compact and Flexible Summaries of Data. doi:10.32614/CRAN.package.skimr https://doi.org/10.32614/CRAN.package.skimr, R package version 2.2.2, https://CRAN.R-project.org/package=skimr.

corrplot: Wei T, Simko V (2024). R package ‘corrplot’: Visualization of a Correlation Matrix. (Version 0.95), https://github.com/taiyun/corrplot.

ggcorrplot: Kassambara A (2023). ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’. doi:10.32614/CRAN.package.ggcorrplot https://doi.org/10.32614/CRAN.package.ggcorrplot, R package version 0.1.4.1, https://CRAN.R-project.org/package=ggcorrplot.

kableExtra: Zhu H (2024). kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax. doi:10.32614/CRAN.package.kableExtra https://doi.org/10.32614/CRAN.package.kableExtra, R package version 1.4.0, https://CRAN.R-project.org/package=kableExtra.

car: Fox J, Weisberg S (2019). An R Companion to Applied Regression, Third edition. Sage, Thousand Oaks CA. https://www.john-fox.ca/Companion/.

broom: Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). broom: Convert Statistical Objects into Tidy Tibbles. doi:10.32614/CRAN.package.broom https://doi.org/10.32614/CRAN.package.broom, R package version 1.0.12, https://CRAN.R-project.org/package=broom.

ggridges: Wilke C (2025). ggridges: Ridgeline Plots in ‘ggplot2’. doi:10.32614/CRAN.package.ggridges https://doi.org/10.32614/CRAN.package.ggridges, R package version 0.5.7, https://CRAN.R-project.org/package=ggridges.

patchwork: Pedersen T (2025). patchwork: The Composer of Plots. doi:10.32614/CRAN.package.patchwork https://doi.org/10.32614/CRAN.package.patchwork, R package version 1.3.2, https://CRAN.R-project.org/package=patchwork.

Appendix: AI Usage Statement

Claude (Anthropic, claude.ai) was used as an analytical assistant throughout this project. Specifically, Claude assisted with:

reshaping the raw Excel O&M tracker from wide to tidy format using Python;
engineering the derived variables season, performance_tier, batch, tariff_ngn_per_kwh, revenue_ngn, performance_ratio, total_production, and solar_share_pct;
structuring the Quarto document and suggesting appropriate R packages for each analytical section; and
drafting initial code scaffolding for the EDA, visualization, and regression sections.

All analytical decisions were made independently by the author: the choice of CS1 as the case study, the selection of PowerGen generation data as the primary dataset, the framing of the two hypotheses (seasonal effect and batch differences), the interpretation of all statistical outputs in business terms, and the integrated recommendation regarding grid availability improvements at Batch 1 RMG sites.

The author collected the data directly from PowerGen’s internal systems in their capacity as Group Financial Controller, is familiar with all variables and their operational definitions, and can explain and defend every output in this document.