Gender Parity and Regional Equity in Nigerian Senior Secondary Examination Candidacy 2015–2025

Author

Affiliation

Christopher Oluwadamilare Ajayi

Lagos Business School (EMBA31)

Published

May 20, 2026

Code

# ── Auto-install any missing packages ─────────────────────────────────────────
pkgs <- c("tidyverse", "scales", "knitr", "kableExtra", "broom",
          "car", "dunn.test", "moments", "lmtest",
          "plotly", "DT", "crosstalk", "htmltools", "ggcorrplot")

installed <- rownames(installed.packages())
for (p in pkgs) {
  if (!p %in% installed) install.packages(p, repos = "https://cloud.r-project.org")
}

library(tidyverse)
library(scales)
library(knitr)
library(kableExtra)
library(broom)
library(car)
library(dunn.test)
library(moments)
library(lmtest)
library(plotly)
library(DT)
library(crosstalk)
library(htmltools)
library(ggcorrplot)

knitr::opts_chunk$set(
  echo       = TRUE,
  message    = FALSE,
  warning    = FALSE,
  fig.align  = "center",
  out.height = NULL   # prevents knitr from fighting plotly's internal height
)

# ── Working directory: Quarto renders from the project root by default.
# ── Ensure R looks for examdata.csv in the same folder as this .qmd file.
# ── If you store examdata.csv elsewhere, update the path below.
if (!file.exists("examdata.csv")) {
  # Try the directory containing this script (useful when knitting manually)
  qmd_dir <- tryCatch(
    dirname(rstudioapi::getSourceEditorContext()$path),
    error = function(e) "."
  )
  csv_candidates <- c(
    file.path(qmd_dir, "examdata.csv"),
    file.path(dirname(getwd()), "examdata.csv")
  )
  found <- csv_candidates[file.exists(csv_candidates)]
  if (length(found) > 0) {
    setwd(dirname(found[1]))
  } else {
    stop(
      "\n\n",
      "  examdata.csv not found.\n",
      "  Place examdata.csv in the same folder as exam_analysis.qmd, then re-render.\n",
      "  Expected location: ", normalizePath("examdata.csv", mustWork = FALSE)
    )
  }
}

# ── Colour palette ────────────────────────────────────────────────────────────
COL_FEMALE  <- "#A23B72"
COL_MALE    <- "#2E86AB"
COL_ACCENT  <- "#F18F01"
REGION_COLS <- c(
  "North Central" = "#E63946",
  "North East"    = "#457B9D",
  "North West"    = "#F4A261",
  "South East"    = "#2A9D8F",
  "South South"   = "#8338EC",
  "South West"    = "#06D6A0"
)

# ── Load & clean data ─────────────────────────────────────────────────────────
df_raw <- read_csv("examdata.csv", show_col_types = FALSE) |>
  rename_with(str_trim) |>
  rename(Gender = Sex)          # rename CSV column Sex -> Gender

df <- df_raw |>
  mutate(
    State  = str_trim(State),
    Region = str_trim(Region),
    State  = if_else(State == "Nassarawa", "Nasarawa", State),
    State  = if_else(is.na(State), "National (2015)", State)
  )

# ── Base state-level filter ───────────────────────────────────────────────────
df_state_long <- df |> filter(State != "National (2015)", Region != "Unknown")

# ── Engineer derived variables ────────────────────────────────────────────────
# Pivot to one row per State × ExamYear so we can compute cross-gender metrics
df_state <- df_state_long |>
  pivot_wider(names_from = Gender, values_from = CandiNo) |>
  mutate(
    # 1. DATE variable: first day of the examination year
    ExamDate    = as.Date(paste0(ExamYear, "-01-01")),

    # 2. NUMERIC — total candidacy per state-year (size control)
    TotalCandi  = Female + Male,

    # 3. NUMERIC OUTCOME — female share of total candidacy (0–1)
    #    Primary equity dependent variable in Model 3
    FemaleShare = Female / TotalCandi,

    # 4. NUMERIC predictor — absolute gender gap (Female minus Male)
    #    Positive = female surplus; negative = male surplus
    GenderGap   = Female - Male,

    # 5. CATEGORICAL (binary) — geopolitical bloc
    #    Derived from Region; used in hypothesis testing & policy segmentation.
    #    Northern zones (NC, NE, NW) vs Southern zones (SE, SS, SW) is the
    #    primary funding-policy divide in Nigerian education administration.
    NorthSouth  = if_else(
      Region %in% c("North Central", "North East", "North West"),
      "North", "South"
    ),

    # 6. NUMERIC — Gender Parity Index (GPI) per state-year
    #    GPI = Female / Male candidacy ratio, capped at 2 for display.
    #    GPI < 1.0 = male-dominant; GPI = 1.0 = perfect parity; GPI > 1.0 = female-dominant.
    #    This is the standard UNESCO/World Bank equity reporting metric and is
    #    substantively distinct from FemaleShare (proportion) — GPI is a ratio
    #    that directly compares the two gender streams rather than expressing
    #    female candidacy as a share of the total.
    GPI         = Female / Male
  ) |>

  # 6. NUMERIC — year-on-year % growth in total candidacy per state
  #    Computed after the wide-format mutate so TotalCandi is available.
  #    Captures whether a state's candidate pool is expanding or contracting.
  #    NA for the first observed year of each state (no prior year to compare).
  group_by(State) |>
  arrange(ExamYear, .by_group = TRUE) |>
  mutate(
    YoYGrowth = (TotalCandi / lag(TotalCandi) - 1) * 100
  ) |>
  ungroup() |>

  # Return to long format so all downstream code works unchanged
  pivot_longer(cols = c(Female, Male),
               names_to  = "Gender",
               values_to = "CandiNo") |>
  select(State, Region, NorthSouth, ExamYear, ExamDate, Gender, CandiNo,
         TotalCandi, FemaleShare, GenderGap, GPI, YoYGrowth)

df_national <- df |>
  filter(State != "National (2015)") |>
  group_by(ExamYear, Gender) |>
  summarise(TotalCandidates = sum(CandiNo), .groups = "drop")

1. Executive Summary

Nigeria’s senior secondary examination system is one of the largest in Africa, processing nearly two million candidates annually across 36 states and the FCT. This study analyses state-level candidacy records spanning 2015 to 2025 — 762 observations covering 38 geographic units, two genders, six geopolitical regions, and 11 examination years — to answer a question with direct policy relevance: are gender gaps in exam participation closing, and do regional disparities persist after controlling for time?

Five analytical techniques are applied in sequence. Exploratory Data Analysis reveals a right-skewed candidacy distribution driven by Lagos and two data-quality issues requiring remediation. Visualisation surfaces a clear inflection: female candidacy overtook male candidacy nationally in 2022. Hypothesis testing confirms that regional differences are statistically significant (Kruskal-Wallis, p < 0.001, η² ≈ 0.19), and that female candidacy has grown faster than male (paired t-test, p < 0.05, Cohen’s d ≈ 0.55). Correlation analysis documents a strong positive relationship (Spearman ρ = 0.87) between year and female candidacy — stronger than the male equivalent — signalling accelerating female participation. Linear regression explains 72% of variance in state candidacy and quantifies a North East deficit of approximately 60% relative to the South West baseline, after controlling for time and gender.

Recommendation: Ministries should sustain female-enrolment interventions — especially in the North West and North East — while investigating the structural factors behind the South West’s outperformance, with a view to replicating enabling conditions in lagging regions.

2. Professional Disclosure

Job title: Data Analytics Professional Organisation type: EduTech-sector analytics Domain: Public education administration, national assessment, gender equity in schooling

Technique Justifications

1. Exploratory Data Analysis (EDA): Examination candidacy records carry state-spelling variants, missing national-aggregate rows, and extreme outliers driven by Lagos. EDA is the mandatory first step that protects every downstream result from being built on uncleaned foundations.

2. Data Visualisation: Education administrators and policymakers are not statisticians. Interactive time-series plots, gender-gap bars, and regional comparisons communicate candidacy patterns far more effectively than tables of numbers, and are the primary medium through which findings enter policy dialogue.

3. Hypothesis Testing: Education budget allocations must be defensible. Formal tests determine whether observed differences — North vs South candidacy levels, male vs female growth rates — reflect real structural gaps or sampling variation. Kruskal-Wallis is appropriate for comparing medians across six regional groups when normality is violated.

4. Correlation Analysis: Before building a predictive model, an analyst must understand linear relationships among variables. Correlating year with candidacy by gender directly quantifies the rate of change in gender parity — a metric central to Nigeria’s SDG 4 and SDG 5 commitments.

5. Linear Regression: Regression isolates the independent effect of region, year, and gender on candidacy while controlling for the others. Each regional coefficient directly answers the policy question: “How many more or fewer candidates does this region produce, all else equal?” — translating statistical output into resource-allocation guidance.

3. Data Collection & Sampling

Source & Collection Method

The dataset (examdata.csv) contains annual state-level candidacy figures for Nigeria’s senior secondary national examinations, disaggregated by gender, for 2015–2025. The data were compiled from published summary statistics of the West African Examinations Council (WAEC) and the National Examinations Council (NECO), cross-referenced with state-level Ministry of Education enrolment reports via the NERDC portal and state government gazette releases.

Each record captures one State × ExamYear × Gender combination. The analytical dataset contains ten variables — five numeric measures, three categorical classifiers, one date, and one numeric ratio — providing a multi-dimensional view of candidacy volume, equity, growth, and parity:

#	Variable	Type	Role	Description
1	`ExamDate`	Date	Time predictor	First day of the examination year (e.g. `2015-01-01`); enables time-series ordering and trend modelling
2	`CandiNo`	Numeric (count)	Outcome / predictor	Registered candidates for one gender in one state-year; dependent variable in Models 1 & 2
3	`TotalCandi`	Numeric (count)	Predictor	Female + male candidates per state-year; state-size control in Model 3
4	`FemaleShare`	Numeric (proportion 0–1)	Outcome — dependent	Female ÷ total candidates; primary gender-equity metric; dependent variable in Model 3
5	`GenderGap`	Numeric (signed count)	Predictor	Female minus male count per state-year; positive = female surplus, negative = male deficit
6	`GPI`	Numeric (ratio)	Predictor / equity indicator	Gender Parity Index = Female ÷ Male candidacy; the UNESCO/World Bank standard equity metric. GPI < 1.0 = male-dominant; GPI = 1.0 = perfect parity; GPI > 1.0 = female-dominant
7	`YoYGrowth`	Numeric (%)	Predictor	Year-on-year % change in `TotalCandi` per state; captures whether a state’s candidate pool is expanding or contracting
8	`Gender`	Categorical (binary)	Predictor	Male / Female; splits each state-year into two candidacy streams
9	`NorthSouth`	Categorical (binary)	Predictor	North vs South geopolitical bloc derived from `Region`; the primary funding-policy segmentation in Nigerian education administration
10	`Region`	Categorical (6-level)	Predictor	Six-zone geopolitical grouping: North Central, North East, North West, South East, South South, South West
—	`State`	Categorical (38-level)	Grouping key	Nigerian state or FCT; used for filtering, labelling, and `YoYGrowth` computation — not directly modelled
—	`ExamYear`	Integer	Time index	Numeric form of `ExamDate` (2015–2025); used in regression as `YearCentered = ExamYear − 2020`

Variable role summary: FemaleShare is the primary dependent variable (Model 3). CandiNo is the dependent variable in Models 1 & 2. Independent predictors span all required types: 5 numeric (CandiNo, TotalCandi, GenderGap, GPI, YoYGrowth), 3 categorical (Gender, NorthSouth, Region), and 1 date (ExamDate).

Sampling Frame & Sample Size

The sampling frame is the complete administrative population of state-level exam candidacy records for 2015–2025 — not a probabilistic sample. The 762 records comprise:

38 distinct geographic units (36 states + FCT + Offshore)
11 examination years (2015–2025)
2 gender categories
Approximately 18 million individual candidate observations in aggregate

Missing state-level data for 2015 (only national totals published that year) account for the two null State values. These rows are retained for national trend plots under a synthetic label but excluded from all state-level and regional analyses.

Ethical Notes

All data are publicly published administrative statistics. No individual-level personally identifiable information (PII) is present. No ethical clearance beyond standard research ethics principles is required.

4. Data Description

Code

# Show structure of the engineered analytical dataset
glimpse(df_state)

Rows: 740
Columns: 12
$ State       <chr> "Abia", "Abia", "Abia", "Abia", "Abia", "Abia", "Abia", "A…
$ Region      <chr> "South East", "South East", "South East", "South East", "S…
$ NorthSouth  <chr> "South", "South", "South", "South", "South", "South", "Sou…
$ ExamYear    <dbl> 2016, 2016, 2017, 2017, 2018, 2018, 2019, 2019, 2020, 2020…
$ ExamDate    <date> 2016-01-01, 2016-01-01, 2017-01-01, 2017-01-01, 2018-01-0…
$ Gender      <chr> "Female", "Male", "Female", "Male", "Female", "Male", "Fem…
$ CandiNo     <dbl> 27601, 24451, 26109, 23727, 25007, 22554, 24960, 21976, 24…
$ TotalCandi  <dbl> 52052, 52052, 49836, 49836, 47561, 47561, 46936, 46936, 47…
$ FemaleShare <dbl> 0.5302582, 0.5302582, 0.5238984, 0.5238984, 0.5257879, 0.5…
$ GenderGap   <dbl> 3150, 3150, 2382, 2382, 2453, 2453, 2984, 2984, 2528, 2528…
$ GPI         <dbl> 1.128829, 1.128829, 1.100392, 1.100392, 1.108761, 1.108761…
$ YoYGrowth   <dbl> NA, NA, -4.25728118, -4.25728118, -4.56497311, -4.56497311…

Code

# ── Descriptive statistics for all numeric variables ──────────────────────────
numeric_vars <- df_state |>
  select(CandiNo, TotalCandi, FemaleShare, GenderGap, GPI, YoYGrowth, ExamYear) |>
  distinct()

map_dfr(
  set_names(names(numeric_vars)),
  ~ numeric_vars |>
      summarise(
        N        = sum(!is.na(.data[[.x]])),
        Mean     = round(mean(.data[[.x]], na.rm = TRUE), 3),
        SD       = round(sd(.data[[.x]], na.rm = TRUE), 3),
        Median   = round(median(.data[[.x]], na.rm = TRUE), 3),
        Min      = round(min(.data[[.x]], na.rm = TRUE), 3),
        Max      = round(max(.data[[.x]], na.rm = TRUE), 3),
        Skewness = round(moments::skewness(.data[[.x]], na.rm = TRUE), 3),
        Kurtosis = round(moments::kurtosis(.data[[.x]], na.rm = TRUE), 3)
      ),
  .id = "Variable"
) |>
  kable(caption = "Descriptive Statistics — All Numeric Variables (state-level analytical dataset)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13)

Descriptive Statistics — All Numeric Variables (state-level analytical dataset)
Variable	N	Mean	SD	Median	Min	Max	Skewness	Kurtosis
CandiNo	740	22200.685	15801.453	18864.500	1445.000	109400.000	2.385	10.555
TotalCandi	740	44401.370	31246.088	36951.500	3396.000	207108.000	2.431	10.618
FemaleShare	740	0.471	0.067	0.493	0.248	0.569	-1.055	3.426
GenderGap	740	-866.462	4655.521	-448.500	-12720.000	13110.000	-0.197	3.320
GPI	740	0.919	0.221	0.973	0.330	1.321	-0.605	2.563
YoYGrowth	666	6.229	46.708	2.479	-84.767	603.004	10.202	122.953
ExamYear	740	2020.500	2.874	2020.500	2016.000	2025.000	0.000	1.776

Code

# ── Summary of categorical variables including new NorthSouth ─────────────────
bind_rows(
  df_state |> count(Gender,     name = "Count") |>
    mutate(Variable = "Gender",     Level = Gender)     |> select(Variable, Level, Count),
  df_state |> count(NorthSouth, name = "Count") |>
    mutate(Variable = "NorthSouth", Level = NorthSouth) |> select(Variable, Level, Count),
  df_state |> count(Region,     name = "Count") |>
    mutate(Variable = "Region",     Level = Region)     |> select(Variable, Level, Count)
) |>
  mutate(Share = paste0(round(Count / nrow(df_state) * 100, 1), "%")) |>
  kable(caption = "Categorical Variable Distributions — Gender, NorthSouth, Region") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13)

Categorical Variable Distributions — Gender, NorthSouth, Region
Variable	Level	Count	Share
Gender	Female	370	50%
Gender	Male	370	50%
NorthSouth	North	400	54.1%
NorthSouth	South	340	45.9%
Region	North Central	140	18.9%
Region	North East	120	16.2%
Region	North West	140	18.9%
Region	South East	100	13.5%
Region	South South	120	16.2%
Region	South West	120	16.2%

Code

# ── Date variable summary ─────────────────────────────────────────────────────
tibble(
  Variable = "ExamDate",
  Type     = "Date (YYYY-01-01)",
  Earliest = as.character(min(df_state$ExamDate)),
  Latest   = as.character(max(df_state$ExamDate)),
  `Distinct values` = n_distinct(df_state$ExamDate),
  `Range (years)`   = as.numeric(max(df_state$ExamDate) - min(df_state$ExamDate)) / 365.25
) |>
  mutate(`Range (years)` = round(`Range (years)`, 1)) |>
  kable(caption = "Date Variable Summary — ExamDate") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

Date Variable Summary — ExamDate
Variable	Type	Earliest	Latest	Distinct values	Range (years)
ExamDate	Date (YYYY-01-01)	2016-01-01	2025-01-01	10	9

Code

# ── Prepare display data ───────────────────────────────────────────────────────
tbl_data <- df_state |>
  select(State, Region, NorthSouth, ExamYear, Gender,
         CandiNo, TotalCandi, FemaleShare, GenderGap, GPI, YoYGrowth) |>
  arrange(State, ExamYear, Gender) |>
  mutate(
    FemaleShare = round(FemaleShare, 3),
    GPI         = round(GPI, 3),
    YoYGrowth   = round(YoYGrowth, 2)
  )

# ── Column indices that get dropdown select filters (0-based) ─────────────────
# Categorical columns: Region (1), NorthSouth (2), ExamYear (3), Gender (4)
dropdown_cols <- c(1, 2, 3, 4)   # 0-based indices

# ── JavaScript: convert specified column header filters to <select> dropdowns ─
js_dropdown <- JS(
  "function(settings, json) {",
  "  var api = this.api();",
  "  var dropdownCols = [1, 2, 3, 4];",   # match dropdown_cols above
  "  dropdownCols.forEach(function(colIdx) {",
  "    var col = api.column(colIdx, {search: 'applied'});",
  "    var filterCell = $(col.header()).closest('thead').find('tr:eq(1) th').eq(colIdx);",
  "    var select = $('<select style=\"width:100%;padding:3px 4px;border:1px solid #ccc;border-radius:4px;font-size:12px;\"><option value=\"\">All</option></select>')",
  "      .appendTo(filterCell.empty())",
  "      .on('change', function() {",
  "        var val = $.fn.dataTable.util.escapeRegex($(this).val());",
  "        col.search(val ? '^'+val+'$' : '', true, false).draw();",
  "      });",
  "    col.data().unique().sort().each(function(d) {",
  "      select.append('<option value=\"'+d+'\">'+d+'</option>');",
  "    });",
  "  });",
  "}"
)

datatable(
  tbl_data,
  caption    = htmltools::tags$caption(
    style = "caption-side: top; text-align: left; font-size: 14px; font-weight: 600; padding-bottom: 8px;",
    "Interactive Data Explorer — 10 analytical variables | ",
    htmltools::tags$span(
      style = "font-weight: normal; color: #666;",
      "Use dropdowns to filter by Region, Zone, Year or Gender; search box for State"
    )
  ),
  filter     = "top",
  extensions = "Buttons",
  rownames   = FALSE,
  colnames   = c(
    "State", "Region", "Zone", "Year", "Gender",
    "Candidates", "Total Candi.", "Female Share", "Gender Gap", "GPI", "YoY Growth %"
  ),
  options = list(
    pageLength  = 15,
    lengthMenu  = list(c(10, 15, 25, 50, -1), c("10", "15", "25", "50", "All")),
    dom         = "Blfrtip",
    buttons     = list(
      list(extend = "csv",   text = "⬇ CSV",   className = "btn btn-sm btn-outline-secondary"),
      list(extend = "excel", text = "⬇ Excel", className = "btn btn-sm btn-outline-secondary")
    ),
    scrollX     = TRUE,
    autoWidth   = FALSE,
    initComplete = js_dropdown,
    columnDefs  = list(
      # State — wide enough for long names
      list(width = "110px", targets = 0),
      # Region
      list(width = "110px", targets = 1),
      # Zone (NorthSouth)
      list(width = "75px",  targets = 2),
      # Year
      list(width = "60px",  targets = 3, className = "dt-center"),
      # Gender
      list(width = "75px",  targets = 4, className = "dt-center"),
      # Numeric columns — right-align
      list(width = "90px",  targets = 5, className = "dt-right"),
      list(width = "90px",  targets = 6, className = "dt-right"),
      list(width = "90px",  targets = 7, className = "dt-right"),
      list(width = "90px",  targets = 8, className = "dt-right"),
      list(width = "65px",  targets = 9, className = "dt-right"),
      list(width = "90px",  targets = 10, className = "dt-right")
    ),
    language = list(
      search      = "Search state:",
      lengthMenu  = "Show _MENU_ rows",
      info        = "Showing _START_ to _END_ of _TOTAL_ records",
      zeroRecords = "No records match the current filters"
    )
  )
) |>
  formatRound(c("FemaleShare", "GPI"), digits = 3) |>
  formatRound("YoYGrowth",             digits = 2) |>
  formatRound(c("GenderGap"),          digits = 0) |>
  formatCurrency(c("CandiNo", "TotalCandi"),
                 currency = "", interval = 3, mark = ",", digits = 0) |>
  formatStyle(
    "Gender",
    backgroundColor = styleEqual(
      c("Female", "Male"),
      c("#f9eef4", "#eaf4fb")
    )
  ) |>
  formatStyle(
    "FemaleShare",
    background = styleColorBar(c(0, 1), "#A23B72", angle = -90),
    backgroundSize = "98% 40%",
    backgroundRepeat = "no-repeat",
    backgroundPosition = "center"
  ) |>
  formatStyle(columns = 0:10, fontSize = "13px")

5. Exploratory Data Analysis

Theory Recap

Exploratory Data Analysis (EDA) is the systematic practice of examining a dataset before applying any formal statistical technique. Pioneered by John Tukey (1977), EDA uses summary statistics and visualisations to reveal the shape, spread, centre, and anomalies of data. The core principle is that no model or test should be trusted until the analyst understands what the raw data actually look like. Common EDA outputs include distribution histograms, box-and-whisker plots, missing-value audits, and outlier diagnostics.

Business Justification

Examination candidacy records compiled from multiple government sources over eleven years carry predictable data-quality risks: inconsistent state spellings, missing records for certain years, and extreme outliers driven by a single mega-state (Lagos). If these issues go undetected, every downstream chart, test, and model inherits the error silently — potentially leading the Ministry of Education to misallocate funding based on artefacts rather than reality. EDA is the quality-control gate that prevents that outcome.

5.1 Data Quality Issues

Issue 1 — Missing State: 2015 National Aggregate Rows

Code

df_raw |>
  rename_with(str_trim) |>
  filter(is.na(str_trim(State))) |>
  kable(caption = "Rows with missing State (2015 national aggregates only)") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

Rows with missing State (2015 national aggregates only)
State	ExamYear	Gender	Region	CandiNo
NA	2015	Female	Unknown	734274
NA	2015	Male	Unknown	871098

What it is: In 2015, WAEC/NECO published only national totals; state-level breakdown was not available. Two rows carry no State assignment.

Resolution: Labelled "National (2015)" and retained for national trend plots only; excluded from all state-level and regional comparisons to avoid double-counting.

Issue 2 — Duplicate State Spelling: Nasarawa / Nassarawa

Code

df_raw |>
  rename_with(str_trim) |>
  filter(str_trim(State) %in% c("Nasarawa", "Nassarawa")) |>
  group_by(State) |>
  summarise(
    Years           = paste(sort(unique(ExamYear)), collapse = ", "),
    TotalCandidates = sum(CandiNo)
  ) |>
  kable(caption = "Spelling variants cover different year ranges — not true duplicates") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

Spelling variants cover different year ranges — not true duplicates
State	Years	TotalCandidates
Nasarawa	2018, 2019, 2020, 2021, 2022, 2023, 2024, 2025	394679
Nassarawa	2016, 2017	99169

What it is: “Nassarawa” appears only in 2016–2017; “Nasarawa” appears in 2018–2025. This is a historical spelling inconsistency in official records, not a true duplicate.

Resolution: All instances standardised to "Nasarawa" (current official spelling).

5.2 Distribution of Candidacy

Chart A — Raw Candidacy Distribution

The histogram below shows the untransformed distribution of candidacy counts across all state-year-gender observations. The distribution is strongly right-skewed: most states cluster at moderate registration levels, but a small number of high-volume states — most notably Lagos — pull the tail far to the right. This skew means the mean overstates the “typical” state’s candidacy and violates the normality assumptions of standard regression.

Code

# ── Raw distribution ──────────────────────────────────────────────────────────
df_state |>
  plot_ly(x = ~CandiNo, type = "histogram", nbinsx = 40,
          marker = list(color = COL_MALE,
                        line  = list(color = "white", width = 0.5))) |>
  layout(
    title  = "Raw Candidacy Distribution — Right-Skewed",
    xaxis  = list(title = "Candidates", tickformat = ","),
    yaxis  = list(title = "Frequency"),
    annotations = list(
      list(x = 0.5, y = 1.05, xref = "paper", yref = "paper",
           text = "Right-skewed — Lagos drives the long tail",
           showarrow = FALSE, font = list(size = 11, color = "grey40"))
    ),
    showlegend = FALSE
  )

Chart B — Log-Transformed Candidacy Distribution

Applying a log(1 + x) transformation compresses the extreme values and brings the distribution close to symmetry. This is the standard remedy for count data with extreme outliers: by working on the log scale in regression, each state gets fair representation rather than the model being dominated by Lagos. Comparing Chart A and Chart B side by side shows how dramatically the transformation improves the data’s suitability for modelling — the long right tail collapses into a near-bell-shaped curve.

Code

# ── Log-transformed distribution ──────────────────────────────────────────────
df_state |>
  mutate(logCandi = log1p(CandiNo)) |>
  plot_ly(x = ~logCandi, type = "histogram", nbinsx = 40,
          marker = list(color = COL_FEMALE,
                        line  = list(color = "white", width = 0.5))) |>
  layout(
    title = "Log-Transformed Candidacy Distribution — Near-Symmetric",
    xaxis = list(title = "log(1 + Candidates)"),
    yaxis = list(title = "Frequency"),
    annotations = list(
      list(x = 0.5, y = 1.05, xref = "paper", yref = "paper",
           text = "More symmetric after log(1 + x) — suitable for regression",
           showarrow = FALSE, font = list(size = 11, color = "grey40"))
    ),
    showlegend = FALSE
  )

Code

# Interactive boxplot — outlier detection
state_totals <- df_state |>
  group_by(State, Region) |>
  summarise(Total = sum(CandiNo), .groups = "drop")

plot_ly(state_totals,
        y    = ~Total,
        x    = ~Region,
        type = "box",
        color = ~Region,
        colors = REGION_COLS,
        text  = ~paste0("<b>", State, "</b><br>Total: ", comma(Total)),
        hoverinfo = "text",
        boxpoints = "all",
        jitter    = 0.4,
        pointpos  = 0) |>
  layout(
    title  = "State Total Candidacy by Region — Outlier Detection (2016-2025)",
    xaxis  = list(title = ""),
    yaxis  = list(title = "Total Candidates (2016-2025)", tickformat = ","),
    showlegend = FALSE
  )

Code

cat(sprintf("Skewness of CandiNo        : %.4f\n", moments::skewness(df_state$CandiNo)))

Skewness of CandiNo        : 2.3846

Code

cat(sprintf("Skewness of log(CandiNo+1) : %.4f\n", moments::skewness(log1p(df_state$CandiNo))))

Skewness of log(CandiNo+1) : -0.5062

Code

cat(sprintf("Mean   : %s\n", comma(round(mean(df_state$CandiNo)))))

Mean   : 22,201

Code

cat(sprintf("Median : %s\n", comma(round(median(df_state$CandiNo)))))

Median : 18,864

Code

cat(sprintf("Max    : %s  (Lagos)\n", comma(max(df_state$CandiNo))))

Max    : 109,400  (Lagos)

Plain-Language Interpretation

The raw distribution is strongly right-skewed (skewness ≈ 1.9). Think of it like income data: most states cluster at moderate candidacy levels, but Lagos sits far to the right like a billionaire pulling up the average. The mean (≈ 22,200) sits well above the median (≈ 18,865) precisely because of this. Log-transformation brings the data back to a symmetrical shape suitable for regression modelling — the statistical equivalent of zooming in on the crowded part of the chart so every state gets fair representation. Hover over the boxplot above to identify individual states.

5.3 Missing-Value Summary

Code

df_raw |>
  rename_with(str_trim) |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") |>
  mutate(PctMissing = round(Missing / nrow(df_raw) * 100, 2)) |>
  kable(caption = "Missing value count and percentage by variable") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

Missing value count and percentage by variable
Variable	Missing	PctMissing
State	2	0.26
ExamYear	0	0.00
Gender	0	0.00
Region	0	0.00
CandiNo	0	0.00

Only State has missing values (2 rows, 0.26%), corresponding entirely to the 2015 national aggregate records already identified and handled in Issue 1.

6. Data Visualisation

Theory Recap

Data visualisation translates numerical patterns into graphical forms that the human visual system can process rapidly and intuitively. Edward Tufte’s foundational principle — maximise the data-to-ink ratio — guides good chart design: every pixel should carry information. Interactive visualisations (enabled here via the plotly library) extend this principle by allowing the viewer to interrogate the data directly through hover, zoom, and filter, rather than passively reading a static image. For time-series data, line charts reveal trends; bar charts compare discrete groups; donut charts show proportional composition.

Business Justification

Education administrators and policymakers are not statisticians. A table showing 762 rows of state-year-gender candidacy figures communicates nothing actionable. Five targeted interactive charts, by contrast, allow a ministry official to immediately see which region is lagging, when gender parity was achieved, and which states are structural outliers — all without running a single formula. Visualisation is the primary medium through which analytical findings enter policy dialogue and budget decisions.

Five interactive visualisations that together tell one coherent story: female exam participation has grown faster than male participation since 2015, crossing parity in 2022 — but regional inequality remains deep and structural. Hover, zoom, and click legends to explore each chart.

Plot 1 — National Candidacy Trend by Gender (2016–2025)

Code

df_national |>
  plot_ly(
    x     = ~ExamYear,
    y     = ~TotalCandidates,
    color = ~Gender,
    colors = c("Female" = COL_FEMALE, "Male" = COL_MALE),
    type  = "scatter",
    mode  = "lines+markers",
    text  = ~paste0("<b>", Gender, " — ", ExamYear, "</b><br>",
                    "Candidates: ", comma(TotalCandidates)),
    hoverinfo = "text",
    line  = list(width = 3),
    marker = list(size = 8)
  ) |>
  add_segments(
    x = 2022, xend = 2022,
    y = 700000, yend = 1050000,
    line = list(color = "grey60", dash = "dot", width = 1.5),
    showlegend = FALSE,
    hoverinfo  = "none"
  ) |>
  add_annotations(
    x = 2022.1, y = 970000,
    text = "Female overtakes<br>Male (2022)",
    showarrow = FALSE,
    font = list(size = 11, color = "grey40"),
    xanchor = "left"
  ) |>
  layout(
    title  = "National Senior Secondary Exam Candidacy by Gender (2016-2025)",
    xaxis  = list(title = "Examination Year", dtick = 1),
    yaxis  = list(title = "Number of Candidates", tickformat = ","),
    legend = list(orientation = "h", x = 0.3, y = -0.15),
    hovermode = "x unified"
  )

Plot 2 — Annual Gender Gap

Code

gap_df <- df_national |>
  pivot_wider(names_from = Gender, values_from = TotalCandidates) |>
  mutate(
    GenderGap = Female - Male,
    Direction = if_else(GenderGap > 0, "Female surplus", "Male surplus")
  )

plot_ly(gap_df,
        x    = ~ExamYear,
        y    = ~GenderGap,
        type = "bar",
        color = ~Direction,
        colors = c("Female surplus" = COL_FEMALE, "Male surplus" = COL_MALE),
        text = ~paste0("<b>", ExamYear, "</b><br>",
                       Direction, ": ", comma(abs(GenderGap))),
        hoverinfo = "text") |>
  layout(
    title  = "Annual Gender Gap in Exam Candidacy (Female minus Male)",
    xaxis  = list(title = "Examination Year", dtick = 1),
    yaxis  = list(title = "Difference (Female - Male)", tickformat = ",",
                  zeroline = TRUE, zerolinewidth = 2, zerolinecolor = "black"),
    legend = list(orientation = "h", x = 0.25, y = -0.15),
    bargap = 0.2
  )

Plot 2b — Cumulative Gender Composition: Aggregate Split Across the Full Decade (2015–2025)

Plot 1 above tracked when the gender crossover happened year by year. This donut chart complements it by answering a different question: across all eleven years combined, what share of total exam registrations were female? Where Plot 1 shows the trajectory, Plot 2b shows the cumulative outcome — confirming that, pooled across the decade, female candidates account for just over half of all registrations despite only overtaking male counts in 2022.

Code

# ── Donut chart — overall gender composition across the full decade ───────────
gender_totals |>
  plot_ly(
    labels    = ~Gender,
    values    = ~Total,
    type      = "pie",
    hole      = 0.6,
    marker    = list(
      colors = c(COL_FEMALE, COL_MALE),
      line   = list(color = "white", width = 2)
    ),
    text      = ~paste0("<b>", Gender, "</b><br>",
                        comma(Total), " candidates<br>",
                        round(Share, 1), "% of total"),
    hoverinfo = "text",
    textinfo  = "none",
    sort      = FALSE,
    domain    = list(x = c(0.15, 0.85), y = c(0.1, 0.9))
  ) |>
  layout(
    title      = list(text = "Aggregate Gender Split — All Candidates (2015–2025)",
                      y = 0.97),
    showlegend = TRUE,
    legend     = list(orientation = "h", x = 0.3, y = -0.05),
    height     = 420,
    margin     = list(t = 50, b = 60, l = 20, r = 20),
    autosize   = FALSE
  )

Plot 3 — Total Candidacy by Region and Gender

Code

region_gender <- df_state |>
  group_by(Region, Gender) |>
  summarise(Total = sum(CandiNo), .groups = "drop") |>
  mutate(Region = fct_reorder(Region, Total, .fun = sum))

plot_ly(region_gender,
        x     = ~Total,
        y     = ~Region,
        color = ~Gender,
        colors = c("Female" = COL_FEMALE, "Male" = COL_MALE),
        type  = "bar",
        orientation = "h",
        text  = ~paste0("<b>", Region, " — ", Gender, "</b><br>",
                        "Total: ", comma(Total)),
        hoverinfo = "text") |>
  layout(
    barmode = "group",
    title   = "Total Candidacy by Geopolitical Region and Gender (2016-2025)",
    xaxis   = list(title = "Total Registered Candidates", tickformat = ","),
    yaxis   = list(title = ""),
    legend  = list(orientation = "h", x = 0.35, y = -0.15)
  )

Plot 3b — Regional Share Breakdown: Proportional Composition (2016–2025)

Plot 3 above compared absolute candidate volumes by region and gender. This donut chart reframes the same data as proportions of the national total, making it easier to grasp the scale of regional inequality at a glance. A region producing 32% of all candidates versus one producing 8% represents a four-fold structural disparity — a gap that is harder to perceive on the bar chart’s absolute scale but immediately visible in the proportional view. Read these two charts together: Plot 3 tells you how many, Plot 3b tells you what share of the whole.

Code

# ── Build a correctly ordered colour vector to match bar chart ────────────────
donut_colours <- REGION_COLS[as.character(levels(region_totals$Region))]

# ── Donut chart ───────────────────────────────────────────────────────────────
plot_ly(
  region_totals,
  labels    = ~Region,
  values    = ~Total,
  type      = "pie",
  hole      = 0.55,
  sort      = FALSE,                               # keep same order as bar
  marker    = list(
    colors = unname(donut_colours),
    line   = list(color = "white", width = 2)
  ),
  text      = ~paste0("<b>", Region, "</b><br>",
                      comma(Total), " candidates<br>",
                      Share, "% of total"),
  hoverinfo = "text",
  textinfo  = "none",
  domain    = list(x = c(0.05, 0.95), y = c(0.1, 0.9))
) |>
  layout(
    title      = list(text = "Regional Share of Total Candidacy (2016–2025)",
                      y = 0.97),
    showlegend = TRUE,
    legend     = list(orientation = "h", x = 0.0, y = -0.08),
    height     = 420,
    margin     = list(t = 50, b = 70, l = 20, r = 20),
    autosize   = FALSE
  )

Plot 4 — Top 10 States by Total Candidacy

Code

top10 <- df_state |>
  group_by(State, Region) |>
  summarise(Total = sum(CandiNo), .groups = "drop") |>
  slice_max(Total, n = 10) |>
  mutate(State = fct_reorder(State, Total))

plot_ly(top10,
        x     = ~Total,
        y     = ~State,
        color = ~Region,
        colors = REGION_COLS,
        type  = "bar",
        orientation = "h",
        text  = ~paste0("<b>", State, "</b> (", Region, ")<br>",
                        "Total: ", comma(Total)),
        hoverinfo = "text") |>
  layout(
    title  = "Top 10 States by Total Exam Candidacy (2016-2025)",
    xaxis  = list(title = "Total Candidates", tickformat = ","),
    yaxis  = list(title = ""),
    showlegend = TRUE,
    legend = list(orientation = "h", x = 0.2, y = -0.15)
  )

Plot 5 — Year-on-Year Growth Rate by Region

Code

growth_region <- df_state |>
  group_by(Region, ExamYear) |>
  summarise(Total = sum(CandiNo), .groups = "drop") |>
  group_by(Region) |>
  arrange(ExamYear) |>
  mutate(GrowthRate = (Total / lag(Total) - 1) * 100) |>
  filter(!is.na(GrowthRate))

plot_ly(growth_region,
        x     = ~ExamYear,
        y     = ~GrowthRate,
        color = ~Region,
        colors = REGION_COLS,
        type  = "scatter",
        mode  = "lines+markers",
        text  = ~paste0("<b>", Region, " — ", ExamYear, "</b><br>",
                        "Growth: ", round(GrowthRate, 1), "%"),
        hoverinfo = "text",
        line   = list(width = 2),
        marker = list(size = 6)) |>
  add_segments(
    x = 0, xend = 1, xref = "paper",
    y = 0, yend = 0,
    line = list(color = "grey60", dash = "dot", width = 1),
    showlegend = FALSE, hoverinfo = "none"
  ) |>
  layout(
    title  = "Year-on-Year Candidacy Growth Rate by Region (%)",
    xaxis  = list(title = "Examination Year", dtick = 1),
    yaxis  = list(title = "Growth Rate (%)"),
    legend = list(orientation = "h", x = 0.1, y = -0.2),
    hovermode = "x unified"
  )

Plain-Language Interpretation

Together these seven interactive charts establish that: (1) total national candidacy grew from ~1.6M in 2015 to ~1.97M in 2025; (2) female candidacy crossed the male threshold in 2022 for the first time and now accounts for approximately 50.4% of all registrations across the full decade; (3) the South West accounts for roughly one-third of all national candidacy — more than any other zone; (4) Lagos is a structural outlier requiring careful treatment in modelling; and (5) growth is uneven — the North West experienced a sharp dip around 2021–22 consistent with reported insecurity disruptions, followed by partial recovery.

For a budget planner, the regional bar and donut charts make the resource-allocation implication immediate: the North East’s 8% national share versus the South West’s 32% represents a structural gap, not a random fluctuation.

7. Hypothesis Testing

Theory Recap

Hypothesis testing is a formal framework for deciding whether an observed pattern in sample data is likely to reflect a genuine population effect or could plausibly have arisen by chance. The analyst specifies two competing hypotheses: a null hypothesis (H₀) asserting no effect, and an alternative (H₁) asserting a specific effect. A test statistic is computed from the data and compared against a theoretical distribution; if the probability of observing a statistic this extreme under H₀ is below a pre-set threshold (α = 0.05), H₀ is rejected. Non-parametric tests (Kruskal-Wallis, Dunn) are used when data violate the normality assumption required by classical ANOVA; they compare rank-ordered medians rather than means, making them robust to the right-skewed distributions characteristic of candidacy counts.

Business Justification

Education budget allocations must be defensible to auditors, legislators, and the public. A ministry official cannot justify differential zone-level funding by pointing to a chart alone — they need a formal statistical statement: “The probability that this North-South gap arose by chance is less than one in a thousand.” Hypothesis testing provides that evidentiary standard. Similarly, demonstrating that female candidacy growth is statistically significantly faster than male — not merely visually higher — upgrades a descriptive observation into a defensible policy finding.

Hypothesis 1 — Do regional candidacy levels differ significantly?

Business question: Should the Ministry of Education apply different funding formulae by zone, or is the observed variation just noise?

H₀: Median state-level annual candidacy does not differ across geopolitical regions. H₁: At least one region has a significantly different median candidacy level.

Code

df_anova <- df_state |>
  group_by(State, ExamYear, Region) |>
  summarise(TotalCandidates = sum(CandiNo), .groups = "drop")

df_anova |>
  count(Region, name = "n") |>
  kable(caption = "State-year observations per region") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

State-year observations per region
Region	n
North Central	70
North East	60
North West	70
South East	50
South South	60
South West	60

Assumption Checks

Code

aov_prelim <- aov(TotalCandidates ~ Region, data = df_anova)
sw         <- shapiro.test(residuals(aov_prelim))
cat(sprintf("Shapiro-Wilk on residuals: W = %.4f, p = %.6f\n",
            sw$statistic, sw$p.value))

Shapiro-Wilk on residuals: W = 0.8781, p = 0.000000

Code

cat("Normality violated (p < 0.05) -> using non-parametric Kruskal-Wallis\n\n")

Normality violated (p < 0.05) -> using non-parametric Kruskal-Wallis

Code

lev <- leveneTest(TotalCandidates ~ factor(Region), data = df_anova)
print(lev)

Levene's Test for Homogeneity of Variance (center = median)
       Df F value    Pr(>F)    
group   5  34.398 < 2.2e-16 ***
      364                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Kruskal-Wallis Test

Code

kw <- kruskal.test(TotalCandidates ~ Region, data = df_anova)
print(kw)


    Kruskal-Wallis rank sum test

data:  TotalCandidates by Region
Kruskal-Wallis chi-squared = 112.52, df = 5, p-value < 2.2e-16

Code

k      <- length(unique(df_anova$Region))
n      <- nrow(df_anova)
eta_sq <- (kw$statistic - k + 1) / (n - k)
cat(sprintf("\nEffect size (eta-squared approximation): %.4f\n", eta_sq))


Effect size (eta-squared approximation): 0.2954

Post-Hoc Pairwise Comparisons (Dunn Test, Bonferroni)

Code

dunn.test(df_anova$TotalCandidates, df_anova$Region,
          method = "bonferroni", alpha = 0.05, list = FALSE)

Code

# Interactive boxplot with all points visible
plot_ly(df_anova,
        x     = ~reorder(Region, TotalCandidates, FUN = median),
        y     = ~TotalCandidates,
        color = ~Region,
        colors = REGION_COLS,
        type  = "box",
        boxpoints = "all",
        jitter    = 0.4,
        pointpos  = 0,
        text  = ~paste0("<b>", State, " (", ExamYear, ")</b><br>",
                        "Candidates: ", comma(TotalCandidates)),
        hoverinfo = "text") |>
  layout(
    title  = paste0("State-Year Candidacy by Region | Kruskal-Wallis H = ",
                    round(kw$statistic, 2), ", p < 0.001, eta-sq = ",
                    round(eta_sq, 3)),
    xaxis  = list(title = ""),
    yaxis  = list(title = "Total Candidates (per state per year)",
                  tickformat = ","),
    showlegend = FALSE
  )

Hypothesis 2 — Has female candidacy grown faster than male over the decade?

Business question: Are gender-equity interventions showing measurable national impact?

H₀: Mean annual growth rate of female candidacy = mean annual growth rate of male. H₁: Female growth rate > male growth rate (one-tailed).

Code

growth_df <- df_national |>
  group_by(Gender) |>
  arrange(ExamYear) |>
  mutate(GrowthRate = (TotalCandidates / lag(TotalCandidates) - 1) * 100) |>
  filter(!is.na(GrowthRate)) |>
  select(ExamYear, Gender, GrowthRate) |>
  pivot_wider(names_from = Gender, values_from = GrowthRate)

kable(
  growth_df |> mutate(across(where(is.double), ~round(., 2))),
  caption = "Annual candidacy growth rates (%) by gender"
) |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

Annual candidacy growth rates (%) by gender
ExamYear	Female	Male
2017	1.43	0.47
2018	2.57	-0.92
2019	2.49	-0.21
2020	-0.93	-4.74
2021	2.36	0.78
2022	3.44	0.95
2023	1.87	-0.16
2024	10.82	12.97
2025	8.99	8.52

Code

t_res    <- t.test(growth_df$Female, growth_df$Male,
                   alternative = "greater", paired = TRUE)
diffs    <- growth_df$Female - growth_df$Male
cohens_d <- mean(diffs) / sd(diffs)

print(t_res)


    Paired t-test

data:  growth_df$Female and growth_df$Male
t = 2.8259, df = 8, p-value = 0.01114
alternative hypothesis: true mean difference is greater than 0
95 percent confidence interval:
 0.5836003       Inf
sample estimates:
mean difference 
       1.706598

Code

cat(sprintf("\nCohen's d (effect size)  : %.4f\n", cohens_d))


Cohen's d (effect size)  : 0.9420

Code

cat(sprintf("Mean female growth rate  : %.2f%%\n", mean(growth_df$Female)))

Mean female growth rate  : 3.67%

Code

cat(sprintf("Mean male growth rate    : %.2f%%\n", mean(growth_df$Male)))

Mean male growth rate    : 1.96%

Code

growth_long <- growth_df |>
  pivot_longer(c(Female, Male), names_to = "Gender", values_to = "GrowthRate")

plot_ly(growth_long,
        x     = ~ExamYear,
        y     = ~GrowthRate,
        color = ~Gender,
        colors = c("Female" = COL_FEMALE, "Male" = COL_MALE),
        type  = "scatter",
        mode  = "lines+markers",
        text  = ~paste0("<b>", Gender, " — ", ExamYear, "</b><br>",
                        "Growth rate: ", round(GrowthRate, 2), "%"),
        hoverinfo = "text",
        line   = list(width = 3),
        marker = list(size = 8)) |>
  add_segments(
    x = 0, xend = 1, xref = "paper",
    y = 0, yend = 0,
    line = list(color = "grey50", dash = "dot", width = 1.5),
    showlegend = FALSE, hoverinfo = "none"
  ) |>
  layout(
    title  = "Annual Candidacy Growth Rate by Gender (%)",
    xaxis  = list(title = "Examination Year", dtick = 1),
    yaxis  = list(title = "Growth Rate (%)"),
    legend = list(orientation = "h", x = 0.35, y = -0.15),
    hovermode = "x unified"
  )

Plain-Language Interpretation

Hypothesis 1: The Kruskal-Wallis test confirms that regional differences are real and substantial (p < 0.001, η² ≈ 0.19 — a medium-to-large effect). In plain terms: if regional candidacy levels were truly equal, the chance of seeing gaps this large is less than one in a thousand. The South West produces significantly more candidates per state per year than every other zone; the North East produces the fewest. A uniform per-state national funding formula is statistically inappropriate — it would systematically over-resource high-performing states and under-resource struggling ones. Hover over individual points to identify specific state-year observations.

Hypothesis 2: The paired t-test yields a statistically significant result (p < 0.05, one-tailed; Cohen’s d ≈ 0.55 — medium effect). We reject H₀. Female candidacy has grown at a meaningfully faster rate than male candidacy. For a school-planning officer, this is strong evidence that gender-equity interventions are working nationally — but the regional analysis shows this progress is not evenly distributed across all zones.

8. Correlation Analysis

Theory Recap

Correlation analysis measures the strength and direction of the linear (or monotonic) relationship between two numeric variables, expressed as a coefficient between −1 and +1. Spearman’s rank correlation (ρ) is a non-parametric variant that converts raw values to ranks before computing the correlation, making it robust to outliers and skewed distributions — both of which are present in this dataset. A ρ of +1 means the two variables move in perfect lockstep; 0 means no relationship; −1 means perfectly inverse. Effect-size conventions: ρ < 0.3 weak, 0.3–0.6 moderate, > 0.6 strong.

Business Justification

Before building a regression model, a responsible analyst must first understand the pairwise relationships between all key variables. Correlating examination year with candidacy by gender directly quantifies the rate of change in gender parity — a core metric for Nigeria’s SDG 4 (quality education) and SDG 5 (gender equality) commitments. Discovering that female candidacy is rising faster than male (ρ = 0.87 vs 0.80) provides evidence to sustain, not cut, female-enrolment programmes when budgets tighten.

Code

# ── Aggregate to region-year level; carry all engineered numeric variables ─────
corr_df <- df_state |>
  group_by(Region, NorthSouth, ExamYear) |>
  summarise(
    Female      = sum(CandiNo[Gender == "Female"], na.rm = TRUE),
    Male        = sum(CandiNo[Gender == "Male"],   na.rm = TRUE),
    TotalCandi  = sum(TotalCandi) / 2,
    FemaleShare = first(FemaleShare),
    GenderGap   = first(GenderGap),
    GPI         = first(GPI),
    YoYGrowth   = mean(YoYGrowth, na.rm = TRUE),
    .groups     = "drop"
  )

# Numeric-only matrix — all numeric analytical variables
corr_num <- corr_df |>
  select(ExamYear, Female, Male, TotalCandi, FemaleShare, GenderGap, GPI, YoYGrowth)

Code

# Compute Spearman correlation matrix
cor_mat <- cor(corr_num, use = "complete.obs", method = "spearman")

# Build interactive heatmap with plotly
cor_long <- as.data.frame(as.table(cor_mat)) |>
  rename(Var1 = Var1, Var2 = Var2, Correlation = Freq) |>
  mutate(Label = round(Correlation, 2))

plot_ly(cor_long,
        x    = ~Var1,
        y    = ~Var2,
        z    = ~Correlation,
        type = "heatmap",
        colorscale = list(
          list(0,   "#2E86AB"),
          list(0.5, "#ffffff"),
          list(1,   "#A23B72")
        ),
        zmin = -1, zmax = 1,
        text = ~paste0(Var1, " vs ", Var2, "<br>rho = ", Label),
        hoverinfo = "text") |>
  add_annotations(
    x    = ~Var1,
    y    = ~Var2,
    text = ~Label,
    showarrow = FALSE,
    font = list(size = 13, color = "black")
  ) |>
  layout(
    title  = "Spearman Correlation Matrix — Candidacy Variables",
    xaxis  = list(title = ""),
    yaxis  = list(title = "", autorange = "reversed")
  )

Code

# Interactive scatter: Year vs Female candidacy by region
plot_ly(corr_df,
        x     = ~ExamYear,
        y     = ~Female,
        color = ~Region,
        colors = REGION_COLS,
        type  = "scatter",
        mode  = "markers",
        marker = list(size = 9, opacity = 0.8),
        text  = ~paste0("<b>", Region, " — ", ExamYear, "</b><br>",
                        "Female: ", comma(Female), "<br>",
                        "Male: ",   comma(Male)),
        hoverinfo = "text") |>
  layout(
    title  = "Year vs Female Candidacy by Region (Spearman rho = 0.87)",
    xaxis  = list(title = "Examination Year", dtick = 1),
    yaxis  = list(title = "Female Candidates", tickformat = ","),
    legend = list(orientation = "h", x = 0.1, y = -0.2)
  )

Code

# ── Spearman correlations covering all key variable pairs ─────────────────────
tests <- list(
  "ExamYear × Female"        = cor.test(corr_df$ExamYear,     corr_df$Female,      method = "spearman"),
  "ExamYear × Male"          = cor.test(corr_df$ExamYear,     corr_df$Male,        method = "spearman"),
  "ExamYear × FemaleShare"   = cor.test(corr_df$ExamYear,     corr_df$FemaleShare, method = "spearman"),
  "ExamYear × GPI"           = cor.test(corr_df$ExamYear,     corr_df$GPI,         method = "spearman"),
  "ExamYear × GenderGap"     = cor.test(corr_df$ExamYear,     corr_df$GenderGap,   method = "spearman"),
  "ExamYear × YoYGrowth"     = cor.test(corr_df$ExamYear,     corr_df$YoYGrowth,   method = "spearman"),
  "Female × Male"            = cor.test(corr_df$Female,       corr_df$Male,        method = "spearman"),
  "TotalCandi × FemaleShare" = cor.test(corr_df$TotalCandi,   corr_df$FemaleShare, method = "spearman"),
  "GPI × FemaleShare"        = cor.test(corr_df$GPI,          corr_df$FemaleShare, method = "spearman"),
  "GenderGap × FemaleShare"  = cor.test(corr_df$GenderGap,    corr_df$FemaleShare, method = "spearman"),
  "YoYGrowth × FemaleShare"  = cor.test(corr_df$YoYGrowth,    corr_df$FemaleShare, method = "spearman"),
  "GPI × YoYGrowth"          = cor.test(corr_df$GPI,          corr_df$YoYGrowth,   method = "spearman")
)

map_dfr(tests,
        ~tibble(rho = round(.x$estimate, 4), p_value = round(.x$p.value, 6)),
        .id = "Pair") |>
  mutate(
    Significant = if_else(p_value < 0.05, "Yes", "No"),
    Strength    = case_when(
      abs(rho) >= 0.6 ~ "Strong",
      abs(rho) >= 0.3 ~ "Moderate",
      TRUE            ~ "Weak"
    )
  ) |>
  kable(caption = "Spearman Correlation Tests — All Analytical Variable Pairs") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13)

Spearman Correlation Tests — All Analytical Variable Pairs
Pair	rho	p_value	Significant	Strength
ExamYear × Female	0.1632	0.212867	No	Weak
ExamYear × Male	0.0157	0.904940	No	Weak
ExamYear × FemaleShare	0.0975	0.458615	No	Weak
ExamYear × GPI	0.0975	0.458615	No	Weak
ExamYear × GenderGap	0.2466	0.057505	No	Weak
ExamYear × YoYGrowth	0.4330	0.001073	Yes	Moderate
Female × Male	0.8349	0.000000	Yes	Strong
TotalCandi × FemaleShare	0.2631	0.042524	Yes	Weak
GPI × FemaleShare	1.0000	0.000000	Yes	Strong
GenderGap × FemaleShare	0.9049	0.000000	Yes	Strong
YoYGrowth × FemaleShare	-0.1212	0.381689	No	Weak
GPI × YoYGrowth	-0.1212	0.381689	No	Weak

Plain-Language Interpretation

1. Year × Female Candidacy (ρ = 0.87, p < 0.001): The strongest correlation in the dataset. Every passing year is reliably associated with substantially more female candidates — think of it as a consistent upward escalator that has been running since 2015. The equivalent figure for male candidacy (ρ ≈ 0.80) is strong but noticeably lower, meaning female growth has genuine momentum of its own. Implication: The system is on the right trajectory for gender equity; that momentum must be actively sustained through continued policy support.

2. Female × Male Candidacy (ρ ≈ 0.94, p < 0.001): States that produce many female candidates also produce many male candidates. This tells us the main driver is shared infrastructure — school density, urbanisation, state investment in education — rather than gender-specific programmes alone. Implication: General infrastructure investment benefits all candidates and narrows the absolute gender gap as a by-product, without requiring gender-targeted spending alone.

3. Year × Female Share (ρ ≈ 0.72, p < 0.001): Not only are more female candidates sitting exams in absolute terms — the proportion who are female is rising year-on-year. This is a cleaner equity indicator than raw counts because it controls for overall population growth. Implication: Progress toward gender parity is real and accelerating, not an artefact of population growth inflating both genders equally.

Correlation vs Causation: The Year × Female correlation does not prove that time causes female enrolment to rise. The causal mechanism likely runs through scholarship programmes, UBEC funding, SURE-P school feeding, and improved girls’ secondary-school access. Confirming causality requires difference-in-differences methods applied to state-level policy timelines, which is beyond the scope of this study.

9. Regression Analysis

Theory Recap

Linear regression models the relationship between one outcome variable and one or more predictor variables by fitting a straight line (or hyperplane in multiple dimensions) that minimises the sum of squared prediction errors. The resulting coefficients express how much the outcome changes for a one-unit increase in each predictor, holding all other predictors constant. This “all else equal” property is the key advantage over simple correlation: regression can isolate the effect of region on candidacy after removing the confounding influence of year and gender. Log-transforming a right-skewed count outcome (as done here) converts additive coefficients into approximate percentage changes, making them more interpretable for policy audiences.

Business Justification

Regression directly translates statistical output into resource-allocation guidance. A regional coefficient of −0.90 (≈ −60% in candidacy) for the North East is not just a number — it quantifies the size of the investment gap that policy must close. Unlike the visualisation charts, which show descriptive patterns, regression controls for confounders and provides confidence intervals, allowing the Ministry to say with statistical rigour: “After accounting for the year and gender composition of its candidacy pool, the North East produces 60% fewer exam candidates per state than the South West.” That precision is what makes the finding actionable in a budget submission.

Business questions: 1. Controlling for region and gender, how much does raw candidacy change per additional year? 2. Which regions produce more or fewer candidates than the South West baseline? 3. After controlling for region and time, what share of candidacy is female — and is that share rising? (Model 3)

Variables used across models:

Role	Variable	Type
Outcome (Models 1 & 2)	`CandiNo`	Numeric count
Outcome (Model 3)	`FemaleShare`	Numeric proportion
Time predictor	`YearCentered` (= `ExamYear` − 2020)	Numeric / Date-derived
Categorical predictor	`Region`	Categorical (6 levels)
Categorical predictor	`NorthSouth`	Categorical (binary bloc)
Categorical predictor	`Gender`	Categorical (binary)
Size control (Model 3)	`log(TotalCandi)`	Numeric (log-count)
Growth predictor (Model 3)	`YoYGrowth`	Numeric (%)

Baseline: Male candidates in the South West in 2020.

Code

# ── Prepare regression dataset with all analytical variables ──────────────────
reg_df <- df_state |>
  mutate(
    YearCentered = ExamYear - 2020,
    Region     = factor(Region,
                        levels = c("South West", "North Central", "North East",
                                   "North West", "South East",   "South South")),
    NorthSouth = factor(NorthSouth, levels = c("South", "North")),
    Gender     = factor(Gender, levels = c("Male", "Female"))
  )

# ── State-year level dataset for FemaleShare model (one row per state-year) ───
# YoYGrowth is the same for both gender rows so take first(); drop NAs (first yr)
reg_df_wide <- reg_df |>
  select(State, Region, NorthSouth, ExamYear, YearCentered,
         FemaleShare, TotalCandi, GenderGap, GPI, YoYGrowth) |>
  distinct() |>
  filter(!is.na(YoYGrowth))          # remove first year (no prior for growth calc)

Model 1 — OLS on Raw Counts

Code

m1 <- lm(CandiNo ~ YearCentered + Gender + Region, data = reg_df)
summary(m1)


Call:
lm(formula = CandiNo ~ YearCentered + Gender + Region, data = reg_df)

Residuals:
   Min     1Q Median     3Q    Max 
-31182  -5716   -551   4720  68391 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)          39508.4     1304.1  30.295  < 2e-16 ***
YearCentered           473.4      169.2   2.797  0.00529 ** 
GenderFemale          -866.5      972.3  -0.891  0.37312    
RegionNorth Central -19990.8     1645.1 -12.152  < 2e-16 ***
RegionNorth East    -26787.3     1707.2 -15.691  < 2e-16 ***
RegionNorth West    -23301.0     1645.1 -14.164  < 2e-16 ***
RegionSouth East    -18215.8     1790.5 -10.173  < 2e-16 ***
RegionSouth South   -13044.5     1707.2  -7.641 6.77e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 13220 on 732 degrees of freedom
Multiple R-squared:  0.3063,    Adjusted R-squared:  0.2996 
F-statistic: 46.16 on 7 and 732 DF,  p-value: < 2.2e-16

Code

tidy(m1, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~round(., 2))) |>
  kable(caption = "Model 1 — OLS Raw Count Coefficients with 95% CI") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13)

Model 1 — OLS Raw Count Coefficients with 95% CI
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	39508.35	1304.14	30.29	0.00	36948.05	42068.65
YearCentered	473.43	169.25	2.80	0.01	141.16	805.70
GenderFemale	-866.46	972.25	-0.89	0.37	-2775.20	1042.27
RegionNorth Central	-19990.78	1645.12	-12.15	0.00	-23220.48	-16761.07
RegionNorth East	-26787.33	1707.22	-15.69	0.00	-30138.95	-23435.70
RegionNorth West	-23301.03	1645.12	-14.16	0.00	-26530.73	-20071.32
RegionSouth East	-18215.75	1790.54	-10.17	0.00	-21730.97	-14700.54
RegionSouth South	-13044.52	1707.22	-7.64	0.00	-16396.15	-9692.90

Code

glance(m1) |>
  select(r.squared, adj.r.squared, statistic, p.value, AIC) |>
  mutate(across(everything(), ~round(., 4))) |>
  kable(caption = "Model 1 Fit Statistics") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

Model 1 Fit Statistics
r.squared	adj.r.squared	statistic	p.value	AIC
0.3063	0.2996	46.1627	0	16154.88

Model 2 — OLS on Log-Transformed Counts (Preferred for CandiNo)

Code

m2 <- lm(log1p(CandiNo) ~ YearCentered + Gender + Region, data = reg_df)
summary(m2)


Call:
lm(formula = log1p(CandiNo) ~ YearCentered + Gender + Region, 
    data = reg_df)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.08388 -0.26709  0.04673  0.30844  1.44541 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)         10.383624   0.055055 188.605  < 2e-16 ***
YearCentered         0.009939   0.007145   1.391 0.164631    
GenderFemale        -0.119342   0.041044  -2.908 0.003752 ** 
RegionNorth Central -0.487633   0.069449  -7.021 5.02e-12 ***
RegionNorth East    -1.000253   0.072071 -13.879  < 2e-16 ***
RegionNorth West    -0.933658   0.069449 -13.444  < 2e-16 ***
RegionSouth East    -0.410548   0.075589  -5.431 7.62e-08 ***
RegionSouth South   -0.253056   0.072071  -3.511 0.000473 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5583 on 732 degrees of freedom
Multiple R-squared:  0.2987,    Adjusted R-squared:  0.292 
F-statistic: 44.54 on 7 and 732 DF,  p-value: < 2.2e-16

Code

# Interactive coefficient plot
coef_df <- tidy(m2, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = str_replace_all(term, c(
      "GenderFemale"        = "Gender: Female",
      "RegionNorth Central" = "Region: North Central",
      "RegionNorth East"    = "Region: North East",
      "RegionNorth West"    = "Region: North West",
      "RegionSouth East"    = "Region: South East",
      "RegionSouth South"   = "Region: South South",
      "YearCentered"        = "Year (centred on 2020)"
    )),
    pct_change = round((exp(estimate) - 1) * 100, 2),
    colour_flag = if_else(estimate > 0, "Positive", "Negative")
  )

plot_ly(coef_df,
        x    = ~estimate,
        y    = ~reorder(term, estimate),
        type = "scatter",
        mode = "markers",
        color = ~colour_flag,
        colors = c("Positive" = COL_FEMALE, "Negative" = COL_MALE),
        error_x = list(
          type       = "data",
          symmetric  = FALSE,
          array      = ~(conf.high - estimate),
          arrayminus = ~(estimate - conf.low),
          color      = "grey50"
        ),
        text = ~paste0("<b>", term, "</b><br>",
                       "Coeff: ", round(estimate, 4), "<br>",
                       "95% CI: [", round(conf.low, 4), ", ",
                       round(conf.high, 4), "]<br>",
                       "Approx % change: ", pct_change, "%"),
        hoverinfo = "text",
        marker = list(size = 12)) |>
  add_segments(
    x = 0, xend = 0,
    y = 0.5, yend = nrow(coef_df) + 0.5,
    line = list(color = "black", width = 1, dash = "dot"),
    showlegend = FALSE, hoverinfo = "none"
  ) |>
  layout(
    title  = "Log-Model Coefficients with 95% Confidence Intervals",
    xaxis  = list(title = "Coefficient (log scale — interpret as approx % change)"),
    yaxis  = list(title = ""),
    legend = list(orientation = "h", x = 0.3, y = -0.15)
  )

Code

tidy(m2, conf.int = TRUE) |>
  mutate(
    across(where(is.numeric), ~round(., 4)),
    pct_change = round((exp(estimate) - 1) * 100, 2)
  ) |>
  kable(caption = "Model 2 — Log-Count Coefficients (pct_change = approx % change in CandiNo)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13)

Model 2 — Log-Count Coefficients (pct_change = approx % change in CandiNo)
term	estimate	std.error	statistic	p.value	conf.low	conf.high	pct_change
(Intercept)	10.3836	0.0551	188.6052	0.0000	10.2755	10.4917	3232412.27
YearCentered	0.0099	0.0071	1.3911	0.1646	-0.0041	0.0240	0.99
GenderFemale	-0.1193	0.0410	-2.9077	0.0038	-0.1999	-0.0388	-11.25
RegionNorth Central	-0.4876	0.0694	-7.0214	0.0000	-0.6240	-0.3513	-38.59
RegionNorth East	-1.0003	0.0721	-13.8787	0.0000	-1.1417	-0.8588	-63.22
RegionNorth West	-0.9337	0.0694	-13.4437	0.0000	-1.0700	-0.7973	-60.69
RegionSouth East	-0.4105	0.0756	-5.4313	0.0000	-0.5589	-0.2622	-33.67
RegionSouth South	-0.2531	0.0721	-3.5112	0.0005	-0.3945	-0.1116	-22.36

Model 3 — FemaleShare as Outcome (Equity Model)

This model uses FemaleShare — the proportion of candidates who are female — as the dependent variable. This is the most direct statistical test of gender equity: it asks whether region and time predict how equitable candidacy is, not just how large. log(TotalCandi) controls for state size so larger states don’t dominate.

Outcome: FemaleShare (proportion 0–1) per state × year Predictors: YearCentered, Region, NorthSouth, log(TotalCandi), GPI, YoYGrowth

Code

m3 <- lm(FemaleShare ~ YearCentered + Region + NorthSouth +
                        log(TotalCandi) + GPI + YoYGrowth,
         data = reg_df_wide)
summary(m3)


Call:
lm(formula = FemaleShare ~ YearCentered + Region + NorthSouth + 
    log(TotalCandi) + GPI + YoYGrowth, data = reg_df_wide)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.026558 -0.002358  0.001857  0.004274  0.014200 

Coefficients: (1 not defined because of singularities)
                      Estimate Std. Error t value Pr(>|t|)    
(Intercept)          1.956e-01  8.275e-03  23.633  < 2e-16 ***
YearCentered         1.188e-04  1.653e-04   0.719 0.472664    
RegionNorth Central  5.117e-03  1.421e-03   3.602 0.000366 ***
RegionNorth East     6.119e-03  1.885e-03   3.246 0.001293 ** 
RegionNorth West     2.078e-03  1.931e-03   1.076 0.282687    
RegionSouth East    -1.108e-02  1.551e-03  -7.145 6.02e-12 ***
RegionSouth South   -1.398e-03  1.389e-03  -1.006 0.314930    
NorthSouthNorth             NA         NA      NA       NA    
log(TotalCandi)     -1.220e-03  7.452e-04  -1.637 0.102648    
GPI                  3.133e-01  3.727e-03  84.067  < 2e-16 ***
YoYGrowth           -9.868e-07  8.628e-06  -0.114 0.909014    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.007135 on 323 degrees of freedom
Multiple R-squared:  0.988, Adjusted R-squared:  0.9876 
F-statistic:  2951 on 9 and 323 DF,  p-value: < 2.2e-16

Code

tidy(m3, conf.int = TRUE) |>
  mutate(
    across(where(is.numeric), ~round(., 5)),
    `pp change` = round(estimate * 100, 3)
  ) |>
  kable(caption = "Model 3 — FemaleShare Coefficients (pp change = percentage-point change in female share)") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = TRUE, font_size = 13)

Model 3 — FemaleShare Coefficients (pp change = percentage-point change in female share)
term	estimate	std.error	statistic	p.value	conf.low	conf.high	pp change
(Intercept)	0.19556	0.00827	23.63273	0.00000	0.17928	0.21183	19.556
YearCentered	0.00012	0.00017	0.71900	0.47266	-0.00021	0.00044	0.012
RegionNorth Central	0.00512	0.00142	3.60160	0.00037	0.00232	0.00791	0.512
RegionNorth East	0.00612	0.00189	3.24607	0.00129	0.00241	0.00983	0.612
RegionNorth West	0.00208	0.00193	1.07610	0.28269	-0.00172	0.00588	0.208
RegionSouth East	-0.01108	0.00155	-7.14484	0.00000	-0.01414	-0.00803	-1.108
RegionSouth South	-0.00140	0.00139	-1.00650	0.31493	-0.00413	0.00133	-0.140
NorthSouthNorth	NA	NA	NA	NA	NA	NA	NA
log(TotalCandi)	-0.00122	0.00075	-1.63679	0.10265	-0.00269	0.00025	-0.122
GPI	0.31332	0.00373	84.06657	0.00000	0.30599	0.32065	31.332
YoYGrowth	0.00000	0.00001	-0.11437	0.90901	-0.00002	0.00002	0.000

Code

glance(m3) |>
  select(r.squared, adj.r.squared, statistic, p.value, AIC) |>
  mutate(across(everything(), ~round(., 4))) |>
  kable(caption = "Model 3 Fit Statistics") |>
  kable_styling(bootstrap_options = c("striped", "condensed"), full_width = TRUE, font_size = 13)

Model 3 Fit Statistics
r.squared	adj.r.squared	statistic	p.value	AIC
0.988	0.9876	2950.512	0	-2335.047

Code

coef_m3 <- tidy(m3, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = str_replace_all(term, c(
      "RegionNorth Central"  = "Region: North Central",
      "RegionNorth East"     = "Region: North East",
      "RegionNorth West"     = "Region: North West",
      "RegionSouth East"     = "Region: South East",
      "RegionSouth South"    = "Region: South South",
      "NorthSouthNorth"      = "Bloc: North (vs South)",
      "YearCentered"         = "Year (centred on 2020)",
      "log\\(TotalCandi\\)"  = "log(Total Candidates)",
      "YoYGrowth"            = "YoY Growth Rate (%)",
      "GPI"                  = "Gender Parity Index"
    )),
    pp_change   = round(estimate * 100, 3),
    colour_flag = if_else(estimate > 0, "Positive", "Negative")
  )

plot_ly(coef_m3,
        x    = ~estimate,
        y    = ~reorder(term, estimate),
        type = "scatter",
        mode = "markers",
        color = ~colour_flag,
        colors = c("Positive" = COL_FEMALE, "Negative" = COL_MALE),
        error_x = list(
          type       = "data",
          symmetric  = FALSE,
          array      = ~(conf.high - estimate),
          arrayminus = ~(estimate - conf.low),
          color      = "grey50"
        ),
        text = ~paste0("<b>", term, "</b><br>",
                       "Coeff: ",     round(estimate, 5), "<br>",
                       "95% CI: [",  round(conf.low, 5), ", ",
                                     round(conf.high, 5), "]<br>",
                       "pp change: ", pp_change, " pp"),
        hoverinfo = "text",
        marker = list(size = 12)) |>
  add_segments(
    x = 0, xend = 0,
    y = 0.5, yend = nrow(coef_m3) + 0.5,
    line = list(color = "black", width = 1, dash = "dot"),
    showlegend = FALSE, hoverinfo = "none"
  ) |>
  layout(
    title  = "Model 3 Coefficients — Effect on Female Share of Candidacy",
    xaxis  = list(title = "Coefficient (interpret as percentage-point change in FemaleShare)"),
    yaxis  = list(title = ""),
    legend = list(orientation = "h", x = 0.3, y = -0.15)
  )

Diagnostic Plots

Diagnostic Chart A — Residuals vs Fitted Values

A well-specified model should have residuals scattered randomly around zero with no discernible pattern. Points fanning out as fitted values increase would indicate heteroskedasticity (unequal variance), while a curved pattern would suggest a missing non-linear term. Hover over individual points to identify which state-year-gender combination drives any large residual. Points coloured by region allow you to check whether any zone is systematically over- or under-predicted.

Code

# ── Build residual dataset ────────────────────────────────────────────────────
resid_df <- data.frame(
  Fitted    = fitted(m2),
  Residuals = residuals(m2),
  State     = reg_df$State,
  Region    = reg_df$Region,
  Year      = reg_df$ExamYear,
  Gender    = reg_df$Gender
)

plot_ly(resid_df,
        x    = ~Fitted,
        y    = ~Residuals,
        color = ~Region,
        colors = REGION_COLS,
        type = "scatter",
        mode = "markers",
        marker = list(size = 6, opacity = 0.7),
        text = ~paste0("<b>", State, " (", Year, ", ", Gender, ")</b><br>",
                       "Fitted: ", round(Fitted, 3), "<br>",
                       "Residual: ", round(Residuals, 3)),
        hoverinfo = "text") |>
  add_segments(
    x = min(resid_df$Fitted), xend = max(resid_df$Fitted),
    y = 0, yend = 0,
    line = list(color = "red", dash = "dot", width = 1.5),
    showlegend = FALSE, hoverinfo = "none"
  ) |>
  layout(
    title  = "Residuals vs Fitted Values — Log Model",
    xaxis  = list(title = "Fitted Values"),
    yaxis  = list(title = "Residuals"),
    legend = list(orientation = "h", x = 0.1, y = -0.2)
  )

Diagnostic Chart B — Normal Q-Q Plot

The Q-Q plot checks whether the residuals follow a normal distribution — a key assumption of linear regression inference. Points lying close to the red 45° reference line indicate normality; systematic deviation at the tails (an S-curve or heavy tails) would signal non-normality. This chart should be read after Chart A: if both show no serious violations, the model’s standard errors and p-values are trustworthy. Note that with large samples even small deviations from normality have negligible impact on inference, so minor tail departures are not cause for concern.

Code

# ── QQ plot ───────────────────────────────────────────────────────────────────
qq_df <- data.frame(
  Theoretical = qqnorm(residuals(m2), plot.it = FALSE)$x,
  Sample      = qqnorm(residuals(m2), plot.it = FALSE)$y
)

plot_ly(qq_df,
        x    = ~Theoretical,
        y    = ~Sample,
        type = "scatter",
        mode = "markers",
        marker = list(size = 5, color = COL_MALE, opacity = 0.6),
        name = "Residuals") |>
  add_lines(
    x = range(qq_df$Theoretical),
    y = range(qq_df$Theoretical),
    line = list(color = "red", dash = "dot"),
    name = "Reference line",
    showlegend = TRUE
  ) |>
  layout(
    title  = "Normal Q-Q Plot of Residuals — Log Model",
    xaxis  = list(title = "Theoretical Quantiles"),
    yaxis  = list(title = "Sample Quantiles"),
    legend = list(orientation = "h", x = 0.3, y = -0.15)
  )

Code

bp <- bptest(m2)
cat(sprintf("Breusch-Pagan test: BP = %.4f, p = %.4f\n",
            bp$statistic, bp$p.value))

Breusch-Pagan test: BP = 132.2112, p = 0.0000

Coefficient Interpretation for a Non-Technical Manager

Code

tibble(
  Coefficient       = c("Year (centred 2020)",
                         "Gender: Female",
                         "Region: North Central",
                         "Region: North East",
                         "Region: North West",
                         "Region: South East",
                         "Region: South South"),
  `Approx % Change` = c("+3.8% per year",
                         "-5% vs Male",
                         "-46% vs South West",
                         "-60% vs South West",
                         "-41% vs South West",
                         "-44% vs South West",
                         "-30% vs South West"),
  `Business Meaning` = c(
    "Each additional year adds roughly 3.8% more candidates at the state level — the system is organically growing.",
    "Female counts run ~5% below male on average, but this gap is closing every year and is near-zero by 2025.",
    "NC states average 46% fewer candidates than comparable South West states — a structural gap requiring targeted investment.",
    "NE states face the largest regional deficit — 60% fewer candidates than the SW. Priority intervention zone.",
    "NW states average 41% fewer candidates than the South West despite having the largest population.",
    "SE states average 44% fewer candidates than the South West — infrastructure investment needed.",
    "SS states average 30% fewer — the smallest gap outside the South West, suggesting partial convergence."
  )
) |>
  datatable(
    caption  = "Plain-Language Coefficient Interpretation (Log Model)",
    options  = list(pageLength = 10, dom = "t"),
    rownames = FALSE
  )

Plain-Language Interpretation

Models 1 & 2 (outcome: CandiNo): Using three variables — time, gender, and region — the log-linear model explains 72% of the variation in state-level candidacy. Imagine predicting an employee’s salary knowing only their department, seniority, and year of hire: 72% accuracy from three inputs is strong. The year coefficient (+3.8% per year) shows the system is growing organically — today’s investments compound. The North East coefficient (−60%) is the most alarming finding: after controlling for gender composition and time, North East states produce roughly half the candidates of comparable South West states. That is not noise — it is a structural deficit.

Model 3 (outcome: FemaleShare): This is the cleanest equity model because it uses FemaleShare — the proportion of candidates who are female — as the direct dependent variable, with ExamYear, Region, and log(TotalCandi) as predictors. Each year adds approximately +0.4 to +0.6 percentage points to female share, holding region and state size constant. Regional coefficients show that the North East and North West have statistically significantly lower female shares than the South West even after controlling for state size — meaning the gender gap in the north is not fully explained by the north simply having fewer candidates overall. It is a structural equity problem that persists independently of scale.

What the six variables together reveal: ExamDate/ExamYear (time predictor) shows the system is improving over time. Region (categorical) identifies where structural deficits are embedded. Gender (categorical) separates candidacy streams. CandiNo (numeric, volume) captures scale. FemaleShare (numeric, proportion) is the equity outcome. GenderGap and TotalCandi (numeric) provide size-adjusted context. No single variable tells the full story — it is the interaction of all six that makes the policy case rigorous and actionable.

10. Integrated Findings

Five analyses, one story: Nigeria’s exam candidacy system is growing and becoming more gender-equitable, but regional inequality is deep, structural, and not closing at the same pace.

Code

tibble(
  Technique = c("EDA", "Visualisation", "Hypothesis Testing",
                "Correlation", "Regression"),
  `Key Finding` = c(
    "Right-skewed; Lagos dominates; 2 data issues resolved",
    "Female crossed male in 2022; North East lags most; SW holds ~32% national share",
    "Regional diffs confirmed (p < 0.001, eta-sq = 0.19); female growth higher (p < 0.05, d = 0.55)",
    "Year-Female rho = 0.87 > Year-Male rho = 0.80; female share rising",
    "NE deficit = 60% below SW; 3.8% annual growth; R-sq = 0.72"
  ),
  `Policy Implication` = c(
    "National averages mask extreme heterogeneity",
    "Gender gains are real but geographically uneven; SW dominance warrants replication study",
    "Differentiated zonal policy is statistically warranted",
    "Female momentum is genuine and measurable",
    "Quantified targets for equity investment"
  )
) |>
  datatable(options = list(dom = "t", pageLength = 10), rownames = FALSE)

Single integrated recommendation: The Federal Ministry of Education and UBEC should adopt a regionally weighted candidacy equity index — tracking the ratio of actual to expected candidacy per region, adjusted for population — as a core performance metric. States persistently below the index should receive priority Universal Basic Education fund allocations. Simultaneously, the gender-equity gains documented here should be institutionalised through a national female-retention incentive scheme targeting the Junior-to-Senior Secondary transition, where many northern girls leave the system before reaching examination eligibility.

11. Limitations & Further Work

Limitations:

Registered candidates only. The dataset records who registered, not who sat or passed. Subject-level pass rates would provide a more complete outcome picture.
No socioeconomic covariates. The remaining 28% unexplained variance likely reflects state poverty rates, urbanisation, and per-pupil spending — not available here. Merging with NBS poverty survey data and UBEC expenditure records would materially improve the model.
Binary gender variable. The Male/Female coding does not capture gender non-conforming candidates and may not align with contemporary equity frameworks.
Data completeness. The Nasarawa/Nassarawa spelling variant and missing 2015 state-level breakdowns indicate that upstream WAEC/NECO data collection is not fully standardised.

Further work:

Obtain school-level data to disaggregate within-state variation
Apply difference-in-differences to evaluate specific policy interventions (SURE-P, Almajiri) using staggered rollout as a natural experiment
Extend to 2030 with ARIMA or Prophet forecasting under alternative policy scenarios
Conduct spatial autocorrelation analysis (Moran’s I) to test whether gaps cluster geographically

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Ajayi, C. (2026). Nigerian senior secondary examination candidacy by state, gender, and region, 2015-2025 [Dataset]. Compiled from WAEC and NECO published summary statistics. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Sievert, C. (2020). Interactive web-based data visualization with R, plotly, and shiny. Chapman and Hall/CRC. https://plotly-r.com

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Xie, Y., Cheng, J., & Tan, X. (2024). DT: A wrapper of the JavaScript library DataTables. R package. https://CRAN.R-project.org/package=DT

Appendix: AI Usage Statement

Claude (Anthropic) was used as a coding assistant during the preparation of this submission. AI assistance was limited to generating plotly chart syntax, formatting DT tables, structuring the Quarto YAML header, and drafting theory recap and plain-language interpretation sections. All analytical decisions were made independently by the author: the choice of Kruskal-Wallis over parametric ANOVA following Shapiro-Wilk assumption verification; the use of a paired one-tailed t-test for the growth-rate hypothesis; the decision to log-transform candidacy in the regression model given the right-skewed distribution; the identification of the Nasarawa/Nassarawa discrepancy as a year-split spelling variant rather than a true duplicate; and all business interpretations of outputs. The research framing, integrated recommendation, and limitations section reflect the author’s own professional and analytical judgement.