Staff Performance Analytics: Exploratory and Inferential Analysis of Employee Appraisal Data at the Kwara State Internal Revenue Service

Author

La-Kadri Yusuf

Published

May 18, 2026

Code
#| label: setup

#| message: false

#| warning: false



library(tidyverse)
Warning: package 'tidyverse' was built under R version 4.5.3
Warning: package 'ggplot2' was built under R version 4.5.3
Warning: package 'tibble' was built under R version 4.5.3
Warning: package 'tidyr' was built under R version 4.5.3
Warning: package 'readr' was built under R version 4.5.3
Warning: package 'purrr' was built under R version 4.5.3
Warning: package 'dplyr' was built under R version 4.5.3
Warning: package 'stringr' was built under R version 4.5.3
Warning: package 'forcats' was built under R version 4.5.3
Warning: package 'lubridate' was built under R version 4.5.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.2.0     ✔ readr     2.2.0
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.2     ✔ tibble    3.3.1
✔ lubridate 1.9.5     ✔ tidyr     1.3.2
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(readxl)
Warning: package 'readxl' was built under R version 4.5.3
Code
library(knitr)
Warning: package 'knitr' was built under R version 4.5.3
Code
library(scales)
Warning: package 'scales' was built under R version 4.5.3

Attaching package: 'scales'

The following object is masked from 'package:purrr':

    discard

The following object is masked from 'package:readr':

    col_factor
Code
library(broom)
Warning: package 'broom' was built under R version 4.5.3
Code
options(scipen = 999, digits = 4)

1. Executive Summary

This study presents an exploratory and inferential analysis of annual staff performance appraisal records for 852 employees of the Kwara State Internal Revenue Service (KWIRS) spanning the 2022–2025 assessment cycles. As Director of Administration and Operations, the central business problem is understanding what organisational and historical factors explain variation in 2025 performance scores, and whether gaps across grade bands and office locations are statistically real or attributable to chance.

Five techniques were applied sequentially. Exploratory Data Analysis uncovered two data quality issues: performance columns stored as character strings requiring numeric conversion, and a near-defunct 360-degree appraisal component where approximately 90% of staff carry zero scores. Data visualisation revealed a persistent performance gap between Zonal Offices and HQ Directorates across all four years, and notable differences across grade bands. Hypothesis testing confirmed both the grade-band effect (Kruskal-Wallis, p < 0.001) and the HQ–Zonal gap (Welch t-test, p = 0.002) are statistically significant, though with small effect sizes. Correlation analysis revealed that year-to-year performance consistency at KWIRS is surprisingly weak (r = 0.13–0.39), raising questions about the reliability of the appraisal instrument. Regression confirmed that prior-year performance and Zonal Office location are the two significant predictors of 2025 scores, while grade band (Management vs Revenue Staff) is not significant once prior performance is controlled.

The central recommendation is that KWIRS prioritise a calibration review of its appraisal process — targeting HOD scoring consistency, mandatory 360-degree completion, and targeted resource investment in Zonal Offices.


2. Professional Disclosure

Name: La-Kadri Yusuf

Job Title: Director, Administration and Operations

Organisation: Kwara State Internal Revenue Service (KWIRS)

Sector: Public Sector — State Tax Administration, Nigeria

Technique Justifications

Exploratory Data Analysis: As Director of Admin and Operations, I am responsible for the integrity of all staff records, including the annual appraisal dataset. Before any HR policy decision — promotions, PIPs, or training allocations — the underlying data must be clean and structurally understood. EDA is therefore the essential first step: it surfaces encoding errors, missing values, outliers, and distributional anomalies that, left unaddressed, would invalidate all downstream analyses. In a public-sector context where appraisal scores bear directly on staff careers, data quality is a compliance matter as much as an analytical one.

Data Visualisation: KWIRS performance reports are presented quarterly to the Executive Committee. Translating numerical scores into clear charts enables management to rapidly identify which directorates or grades are underperforming, track longitudinal trends, and prioritise training resources — without requiring recipients to interpret raw tables. Choosing the right chart type for the right audience is a core professional competence I apply regularly.

Hypothesis Testing: KWIRS operates across three Zonal Offices and eleven HQ Directorates. Observed performance differences between units may reflect genuine structural disparities or natural sampling variation. Before recommending differential interventions — additional supervision or targeted training for Zonal staff — I need statistical evidence that observed differences are real. Hypothesis testing provides precisely this: a principled framework for making decisions under uncertainty.

Correlation Analysis: The 2025 appraisal instrument incorporates four components: Quantitative Assessment, Qualitative Evaluation, Examinations, and 360-Degree Feedback. If these components are highly intercorrelated, the instrument may be redundant. Conversely, low correlations confirm each component captures a distinct performance dimension. Additionally, weak year-to-year correlations in individual scores have direct implications for the appraisal system’s predictive validity and fairness.

Linear Regression: The most actionable HR planning question is: can we predict which staff are at risk of underperforming in the coming cycle? Regression quantifies how strongly prior-year performance, grade band, and office location jointly determine current-year scores. The resulting coefficients translate directly into policy levers — for instance, a significant negative coefficient on Zonal Office location signals a structural (not individual) problem warranting systemic, not behavioural, intervention.


3. Data Collection and Sampling

Source and Collection Method

The dataset comprises annual performance appraisal records maintained by the Human Resources unit of the Kwara State Internal Revenue Service. Data was extracted from the KWIRS Human Resources Information System (HRIS) under the authority of the Director of Administration and Operations. Each record corresponds to a single staff member’s appraisal covering assessment cycles from January 2022 to December 2024 (with 2025 representing appraisal of 2024 activities, finalised in Q1 2025). Component-level scores for the 2025 cycle — Quantitative Assessment (up to 60 marks), Qualitative Evaluation (20 marks), Examination (10–20 marks depending on grade), and 360-Degree Feedback (20 marks) — were recorded by Heads of Department (HODs), validated by Zonal Coordinators, reviewed by the Director, and approved by HR.

Sampling Frame and Sample Size

This is a census of the entire active KWIRS staff population: all 852 permanent and contract employees in active employment for at least one complete appraisal cycle are included. Staff on long-term medical leave, on secondment to other agencies, or serving probation were excluded from the HRIS extract. Because the dataset covers the full population rather than a random sample, inferential tests are applied to assess whether observed patterns reflect systematic organisational factors rather than to estimate unknown population parameters.

Time Period Covered

The dataset spans four appraisal years: 2022, 2023, 2024, and 2025 (reflecting performance in calendar years 2021–2024 respectively). The 2025 cycle was finalised in Q1 2025. A subset of staff — primarily recent joiners — have incomplete records for earlier years.

Ethical Considerations

All personally identifiable information has been anonymised. Staff numbers are masked and no names appear in any output. Scores are analysed in aggregate or at directorate/grade level only. The data extraction was conducted in the course of official HR duties, consistent with KWIRS internal data governance policy. No external ethics approval was required for this internal institutional dataset used for educational research purposes.


4. Exploratory Data Analysis

Technique 1 — Book Reference: Chapter 4 (Summary statistics, missing-value analysis, outlier detection)

4.1 Data Loading and Preparation

Code
#| label: load-data

#| message: false

#| warning: false



df_raw <- read_excel("Take home.xlsx")



# Fix curly-apostrophe encoding in column names (Excel artefact in col 19)

names(df_raw) <- str_replace_all(names(df_raw), "\u2019", "'")



df <- df_raw |>

  rename(

    sn            = `S/N`,

    staff_no      = `Staff Number`,

    grade         = Grade,

    directorate   = Directorate,

    department    = Department,

    perf_2022     = `2022 Performance (100%)`,

    perf_2023     = `2023 Performance (100%)`,

    perf_2024     = `2024 Performance (100%)`,

    quant         = `Quantitative (50%/60%)`,

    qual          = `Qualitative (20%)`,

    exam          = `Exam (20%/10%)`,

    deg360        = `360 Degree (20%)`,

    perf_2025     = `2025 Average Performance (100%)`,

    avg_score     = `Average Score (100%)`,

    strengths     = `Strength (Identified Skills)`,

    improvement   = `Areas for Improvement`,

    hod_comment   = `HOD's Comment`,

    appr_comment  = `Appraisee's Comment`,

    coord_comment = `Zonal Coordinator's Comment`,

    dir_comment   = `Director's Comment`,

    hr_comment    = `HR Comment`

  ) |>

  mutate(

    # Data Quality Issue 1: three columns stored as character in source Excel

    perf_2022   = suppressWarnings(as.numeric(perf_2022)),

    perf_2023   = suppressWarnings(as.numeric(perf_2023)),

    exam        = suppressWarnings(as.numeric(exam)),

    office_type = if_else(str_detect(directorate, "Zonal"),

                          "Zonal Office", "HQ Directorate"),

    grade_band  = case_when(

      grade %in% c("Revenue Trainee", "Revenue Assistant",

                   "Revenue Officer", "Deputy Revenue Officer") ~ "Revenue Staff",

      grade %in% c("Assistant Manager", "Deputy Manager", "Manager",

                   "Principal Manager", "Senior Manager",

                   "Assistant Director")                        ~ "Management",

      grade %in% c("Driver", "Senior Driver", "Assistant Senior Driver",

                   "Assistant Chief Driver", "Chief Driver")   ~ "Support (Drivers)",

      TRUE ~ "Other Support"

    ),

    grade_band = factor(grade_band,

                        levels = c("Revenue Staff", "Management",

                                   "Support (Drivers)", "Other Support"))

  ) |>

  filter(!is.na(grade), !is.na(perf_2025))



cat("Final dataset:", nrow(df), "staff records x", ncol(df), "variables\n")
Final dataset: 852 staff records x 23 variables
Code
cat("Grade band counts:\n")
Grade band counts:
Code
print(table(df$grade_band))

    Revenue Staff        Management Support (Drivers)     Other Support 
              696               114                34                 8 
Code
cat("Office type counts:\n")
Office type counts:
Code
print(table(df$office_type))

HQ Directorate   Zonal Office 
           646            206 

4.2 Summary Statistics

Code
#| label: eda-summary



df |>

  select(perf_2022, perf_2023, perf_2024, perf_2025,

         quant, qual, exam, deg360) |>

  pivot_longer(everything(), names_to = "Variable", values_to = "Value") |>

  group_by(Variable) |>

  summarise(

    n       = sum(!is.na(Value)),

    Missing = sum(is.na(Value)),

    Mean    = round(mean(Value, na.rm = TRUE), 2),

    SD      = round(sd(Value,   na.rm = TRUE), 2),

    Min     = round(min(Value,  na.rm = TRUE), 2),

    Q1      = round(quantile(Value, 0.25, na.rm = TRUE), 2),

    Median  = round(median(Value,   na.rm = TRUE), 2),

    Q3      = round(quantile(Value, 0.75, na.rm = TRUE), 2),

    Max     = round(max(Value,  na.rm = TRUE), 2),

    .groups = "drop"

  ) |>

  mutate(Variable = recode(Variable,

    perf_2022 = "2022 Performance",  perf_2023 = "2023 Performance",

    perf_2024 = "2024 Performance",  perf_2025 = "2025 Performance",

    quant = "Quantitative (2025)",   qual  = "Qualitative (2025)",

    exam  = "Exam (2025)",           deg360 = "360-Degree (2025)"

  )) |>

  kable(caption = "Table 1. Summary statistics — all numeric performance variables")
Table 1. Summary statistics — all numeric performance variables
Variable n Missing Mean SD Min Q1 Median Q3 Max
360-Degree (2025) 832 20 0.66 2.99 0.00 0.00 0.00 0.00 16.60
Exam (2025) 803 49 8.98 2.56 0.00 7.60 9.20 10.80 14.80
2022 Performance 743 109 79.07 6.52 49.45 75.17 80.06 83.35 94.04
2023 Performance 827 25 81.05 8.54 45.61 77.34 82.99 86.95 96.19
2024 Performance 852 0 82.95 7.01 42.51 79.58 84.18 87.48 97.60
2025 Performance 852 0 80.66 6.68 51.75 77.84 81.67 85.26 92.80
Qualitative (2025) 852 0 17.08 1.49 0.00 16.40 17.20 18.00 20.00
Quantitative (2025) 852 0 53.61 5.98 23.75 51.03 53.86 59.67 60.00

4.3 Data Quality Issues

Issue 1 — Character-encoded numeric columns. The columns 2022 Performance, 2023 Performance, and Exam were stored as character strings in the source Excel file rather than numbers. This prevents arithmetic operations and would silently produce NA values in any downstream analysis that assumes numeric input. The preparation code above applies as.numeric() to resolve this. An additional consequence is that 109 staff have no 2022 performance record and 25 have no 2023 record — these are principally recent joiners who were not yet employed during those cycles, not encoding errors.

Code
#| label: missing-table



df |>

  select(perf_2022, perf_2023, perf_2024, quant, qual, exam,

         deg360, perf_2025, avg_score) |>

  summarise(across(everything(), ~sum(is.na(.)))) |>

  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") |>

  mutate(

    Pct_Missing = round(Missing / nrow(df) * 100, 1),

    Variable = recode(Variable,

      perf_2022 = "2022 Performance", perf_2023 = "2023 Performance",

      perf_2024 = "2024 Performance", perf_2025 = "2025 Performance",

      quant = "Quantitative (2025)",  qual  = "Qualitative (2025)",

      exam  = "Exam (2025)",          deg360 = "360-Degree (2025)",

      avg_score = "Average Score (All Years)"

    )

  ) |>

  arrange(desc(Missing)) |>

  kable(caption = "Table 2. Missing values in numeric performance columns",

        col.names = c("Variable", "Missing Records", "% Missing"))
Table 2. Missing values in numeric performance columns
Variable Missing Records % Missing
2022 Performance 109 12.8
Exam (2025) 49 5.8
2023 Performance 25 2.9
360-Degree (2025) 20 2.3
2024 Performance 0 0.0
Quantitative (2025) 0 0.0
Qualitative (2025) 0 0.0
2025 Performance 0 0.0
Average Score (All Years) 0 0.0

Issue 2 — Dysfunctional 360-Degree component. The 360-Degree column carries 20 NA values and a very large proportion of exact zero scores. Since perf_2025 = quant + qual + exam + deg360 by construction, every staff member with a zero 360 score loses up to 20 marks from their total — a structural penalty that has no bearing on their actual work performance.

Code
#| label: deg360-quality



tibble(

  Category = c(

    "Score > 0  (review completed)",

    "Score = 0  (likely non-participation)",

    "Missing (NA)"

  ),

  Count = c(

    sum(!is.na(df$deg360) & df$deg360 > 0),

    sum(!is.na(df$deg360) & df$deg360 == 0),

    sum(is.na(df$deg360))

  )

) |>

  mutate(Pct = round(Count / nrow(df) * 100, 1)) |>

  kable(caption = "Table 3. Distribution of 360-Degree component scores — Data Quality Issue 2",

        col.names = c("Category", "Count", "% of All Staff"))
Table 3. Distribution of 360-Degree component scores — Data Quality Issue 2
Category Count % of All Staff
Score > 0 (review completed) 39 4.6
Score = 0 (likely non-participation) 793 93.1
Missing (NA) 20 2.3

4.4 Outlier Detection

Code
#| label: outlier-detection



detect_outliers <- function(x, varname) {

  q1  <- quantile(x, 0.25, na.rm = TRUE)

  q3  <- quantile(x, 0.75, na.rm = TRUE)

  iqr <- q3 - q1

  tibble(

    Variable     = varname,

    Q1           = round(q1, 2),

    Q3           = round(q3, 2),

    Lower_Fence  = round(q1 - 1.5 * iqr, 2),

    Upper_Fence  = round(q3 + 1.5 * iqr, 2),

    N_Below      = sum(x < (q1 - 1.5 * iqr), na.rm = TRUE),

    N_Above      = sum(x > (q3 + 1.5 * iqr), na.rm = TRUE),

    Total_Flagged = N_Below + N_Above

  )

}



bind_rows(

  detect_outliers(df$perf_2025, "2025 Performance"),

  detect_outliers(df$perf_2024, "2024 Performance"),

  detect_outliers(df$quant,     "Quantitative"),

  detect_outliers(df$qual,      "Qualitative"),

  detect_outliers(df$exam,      "Exam"),

  detect_outliers(df$deg360,    "360-Degree")

) |>

  kable(caption = "Table 4. Outlier detection — IQR method (fences = Q1/Q3 ± 1.5 × IQR)")
Table 4. Outlier detection — IQR method (fences = Q1/Q3 ± 1.5 × IQR)
Variable Q1 Q3 Lower_Fence Upper_Fence N_Below N_Above Total_Flagged
2025 Performance 77.84 85.26 66.71 96.39 37 0 37
2024 Performance 79.58 87.48 67.73 99.33 29 0 29
Quantitative 51.03 59.67 38.09 72.61 20 0 20
Qualitative 16.40 18.00 14.00 20.40 15 0 15
Exam 7.60 10.80 2.80 15.60 17 0 17
360-Degree 0.00 0.00 0.00 0.00 0 39 39

Outliers at the lower end dominate, particularly in the 360-Degree component where most values are zero. These records are retained rather than removed: they represent genuine cases of non-participation or low assessment, which are substantively important for HR decision-making.

4.5 Distribution of the Outcome Variable

Code
#| label: outcome-dist

#| fig-height: 4.5



ggplot(df, aes(x = perf_2025)) +

  geom_histogram(binwidth = 3, fill = "#2c7bb6", colour = "white", alpha = 0.85) +

  geom_vline(xintercept = mean(df$perf_2025, na.rm = TRUE),

             colour = "#d7191c", linewidth = 1, linetype = "dashed") +

  annotate("text",

           x     = mean(df$perf_2025, na.rm = TRUE) + 3.5,

           y     = Inf, vjust = 2,

           label = paste0("Mean = ", round(mean(df$perf_2025, na.rm = TRUE), 1)),

           colour = "#d7191c", size = 3.5) +

  labs(

    title    = "Figure 1. Distribution of 2025 Staff Performance Scores",

    subtitle = paste0("KWIRS — ", nrow(df), " staff"),

    x        = "2025 Performance Score (out of 100)",

    y        = "Number of Staff"

  ) +

  theme_minimal(base_size = 13)

The distribution is left-skewed: most staff cluster between 75 and 90, with a long lower tail. The mean (80.7) sits below the median (81.7), confirming the skew. This asymmetry favours non-parametric statistical tests and means regression residuals may not be perfectly normal — both are addressed in the relevant sections below.


5. Data Visualisation

Technique 2 — Book Reference: Chapter 5 (Grammar of graphics, chart selection, storytelling with data)

The five plots below collectively tell a single story: performance at KWIRS is structurally unequal — by grade, by location, and across time — and the gaps have not narrowed over the four-year period covered by this data.

Figure 2 — Performance by Grade Band

Code
#| label: plot-grade-box



ggplot(df, aes(y = grade_band, x = perf_2025, fill = grade_band)) +

  geom_boxplot(alpha = 0.7, outlier.shape = 21, outlier.size = 1.5) +

  geom_vline(xintercept = median(df$perf_2025, na.rm = TRUE),

             linetype = "dashed", colour = "grey40", linewidth = 0.8) +

  scale_fill_brewer(palette = "Set2", guide = "none") +

  labs(

    title    = "Figure 2. 2025 Performance Scores by Grade Band",

    subtitle = "Dashed line = organisation-wide median",

    x        = "2025 Performance Score (out of 100)",

    y        = NULL

  ) +

  theme_minimal(base_size = 13)

Support (Drivers) show a notably higher median than both Revenue Staff and Management, while Management and Revenue Staff are visually similar — a pattern confirmed formally in Section 8. Other Support has a very wide spread, driven by a small group (n = 8).

Figure 3 — Mean Performance by Directorate

Code
#| label: plot-directorate

#| fig-height: 5.5



df |>

  group_by(directorate) |>

  summarise(mean_perf = mean(perf_2025, na.rm = TRUE),

            n = n(), .groups = "drop") |>

  mutate(directorate = fct_reorder(directorate, mean_perf)) |>

  ggplot(aes(x = mean_perf, y = directorate)) +

  geom_col(fill = "#2c7bb6", alpha = 0.85) +

  geom_vline(xintercept = mean(df$perf_2025, na.rm = TRUE),

             linetype = "dashed", colour = "#d7191c", linewidth = 0.9) +

  geom_text(aes(label = paste0("n=", n)), hjust = -0.2, size = 3.1) +

  scale_x_continuous(limits = c(0, 100),

                     expand  = expansion(mult = c(0, 0.15))) +

  labs(

    title    = "Figure 3. Mean 2025 Performance Score by Directorate",

    subtitle = "Red dashed line = organisation-wide mean",

    x        = "Mean Performance Score (out of 100)",

    y        = NULL

  ) +

  theme_minimal(base_size = 12)

The three Zonal Offices (Kwara Central, North, South) cluster near the bottom of the directorate ranking, while most HQ Directorates sit at or above the organisation-wide mean — motivating the Zonal-vs-HQ hypothesis test in Section 6.

Figure 4 — Performance Trend 2022–2025 by Office Type

Code
#| label: plot-trend



df |>

  select(office_type, perf_2022, perf_2023, perf_2024, perf_2025) |>

  pivot_longer(cols = starts_with("perf"),

               names_to = "year", values_to = "score") |>

  mutate(year = as.integer(str_extract(year, "\\d{4}"))) |>

  group_by(office_type, year) |>

  summarise(mean_score = mean(score, na.rm = TRUE), .groups = "drop") |>

  ggplot(aes(x = year, y = mean_score,

             colour = office_type, group = office_type)) +

  geom_line(linewidth = 1.3) +

  geom_point(size = 3) +

  scale_colour_manual(values = c("HQ Directorate" = "#2c7bb6",

                                 "Zonal Office"   = "#d7191c")) +

  scale_x_continuous(breaks = 2022:2025) +

  labs(

    title  = "Figure 4. Mean Performance Trend: 2022–2025 by Office Type",

    x      = "Assessment Year",

    y      = "Mean Score (out of 100)",

    colour = "Office Type"

  ) +

  theme_minimal(base_size = 13) +

  theme(legend.position = "bottom")

The HQ–Zonal performance gap is persistent across all four years, not a recent development. Neither group shows a meaningful upward trend, suggesting the current performance management system has not yet produced sustained score improvement at the organisational level.

Figure 5 — 2025 Score Distribution: HQ vs Zonal

Code
#| label: plot-violin



ggplot(df, aes(x = office_type, y = perf_2025, fill = office_type)) +

  geom_violin(trim = FALSE, alpha = 0.6) +

  geom_boxplot(width = 0.12, alpha = 0.9, outlier.shape = NA) +

  stat_summary(fun = mean, geom = "point", shape = 18,

               size = 4, colour = "white") +

  scale_fill_manual(values = c("HQ Directorate" = "#2c7bb6",

                                "Zonal Office"   = "#d7191c"),

                    guide = "none") +

  labs(

    title    = "Figure 5. 2025 Score Distribution by Office Type",

    subtitle = "White diamond = group mean; box = IQR; violin = full density",

    x        = NULL,

    y        = "2025 Performance Score (out of 100)"

  ) +

  theme_minimal(base_size = 13)

The HQ distribution is more symmetrically concentrated in the upper range. The Zonal distribution has a heavier lower tail, confirming that low-performing staff are disproportionately located in the field offices.

Figure 6 — Prior-Year vs Current-Year Performance

Code
#| label: plot-scatter



ggplot(df, aes(x = perf_2024, y = perf_2025, colour = office_type)) +

  geom_point(alpha = 0.35, size = 1.8) +

  geom_smooth(method = "lm", se = FALSE, linewidth = 1.2) +

  scale_colour_manual(values = c("HQ Directorate" = "#2c7bb6",

                                 "Zonal Office"   = "#d7191c")) +

  labs(

    title    = "Figure 6. 2024 vs 2025 Performance Scores",

    subtitle = "Separate linear trends by office type",

    x        = "2024 Performance Score",

    y        = "2025 Performance Score",

    colour   = "Office Type"

  ) +

  theme_minimal(base_size = 13) +

  theme(legend.position = "bottom")
`geom_smooth()` using formula = 'y ~ x'

There is a positive but modest relationship between 2024 and 2025 scores. The HQ cluster sits systematically above the Zonal cluster at equivalent prior-year scores, consistent with a location-based effect independent of individual capability — confirmed formally in Section 8.


6. Hypothesis Testing

Technique 3 — Book Reference: Chapter 6 (t-test, non-parametric alternatives, effect sizes)

6.1 Test 1 — Does 2025 Performance Differ by Grade Band?

Business context: Before designing grade-differentiated training programmes, management needs statistical confirmation that performance differences across grades are real and not attributable to sampling variation.

Hypotheses:

H₀: The distribution of 2025 performance scores is identical across all four grade bands.

H₁: At least one grade band has a different distribution of 2025 performance scores.

Assumption check — normality:

Code
#| label: normality-check



set.seed(3847)

df |>

  filter(!is.na(perf_2025)) |>

  group_by(grade_band) |>

  summarise(

    n        = n(),

    W        = round(shapiro.test(perf_2025)$statistic, 4),

    p_value  = round(shapiro.test(perf_2025)$p.value,   4),

    Normal   = if_else(shapiro.test(perf_2025)$p.value > 0.05, "Yes", "No"),

    .groups  = "drop"

  ) |>

  kable(caption = "Table 5. Shapiro-Wilk normality test per grade band (n ≤ 50 per group)",

        col.names = c("Grade Band", "n", "W Statistic", "p-value", "Approx. Normal?"))
Table 5. Shapiro-Wilk normality test per grade band (n ≤ 50 per group)
Grade Band n W Statistic p-value Approx. Normal?
Revenue Staff 696 0.9243 0.0000 No
Management 114 0.9425 0.0001 No
Support (Drivers) 34 0.9610 0.2597 Yes
Other Support 8 0.7942 0.0248 No

Normality is rejected for several groups (p < 0.05), consistent with the left-skewed distribution observed in Section 4. A non-parametric Kruskal-Wallis test is therefore appropriate.

Kruskal-Wallis Rank-Sum Test:

Code
#| label: kruskal-test



kw_result <- kruskal.test(perf_2025 ~ grade_band, data = df)

kw_result

    Kruskal-Wallis rank sum test

data:  perf_2025 by grade_band
Kruskal-Wallis chi-squared = 34, df = 3, p-value = 0.0000002
Code
# Epsilon-squared effect size (manual formula)

H_stat  <- kw_result$statistic

k_grp   <- length(levels(df$grade_band))

n_tot   <- sum(!is.na(df$perf_2025))

eps2_val <- (H_stat - k_grp + 1) / (n_tot - k_grp)

cat("\nEpsilon-squared (effect size):", round(eps2_val, 4), "\n")

Epsilon-squared (effect size): 0.0364 
Code
cat("Interpretation: small effect (threshold small = 0.01, medium = 0.06)\n")
Interpretation: small effect (threshold small = 0.01, medium = 0.06)

Post-hoc pairwise comparisons (Bonferroni-corrected Wilcoxon tests):

Code
#| label: posthoc-wilcox

#| message: false

#| warning: false



pw <- pairwise.wilcox.test(df$perf_2025, df$grade_band,

                            p.adjust.method = "bonferroni")



as.data.frame(pw$p.value) |>

  rownames_to_column("Group") |>

  mutate(across(where(is.numeric), ~case_when(

    is.na(.)    ~ "—",

    . < 0.001   ~ "< 0.001",

    TRUE        ~ as.character(round(., 4))

  ))) |>

  kable(caption = "Table 6. Post-hoc pairwise Wilcoxon p-values (Bonferroni corrected)")
Table 6. Post-hoc pairwise Wilcoxon p-values (Bonferroni corrected)
Group Revenue Staff Management Support (Drivers)
Management 1
Support (Drivers) < 0.001 < 0.001
Other Support 1 1 1

Result and interpretation: The Kruskal-Wallis test is highly significant (H(3) = 33.826, p < 0.001). We reject H₀. The effect size ε² = 0.0364 indicates a small effect (ε² < 0.06). Post-hoc comparisons show that Support (Drivers) score significantly higher than both Revenue Staff and Management (adjusted p < 0.001), while Management and Revenue Staff do not differ significantly from each other.

Business implication: The performance ranking difference across grades is real, but the effect size is small — most of the variance in scores is shared across grades rather than explained by grade level. Notably, management-grade staff do not significantly outperform Revenue Staff, and Support (Drivers) unexpectedly lead, suggesting that HOD scoring standards vary across staff categories rather than reflecting systematic capability differences. A calibration workshop for HODs is warranted.


6.2 Test 2 — Does 2025 Performance Differ Between HQ and Zonal Staff?

Business context: KWIRS’s three Zonal Offices serve geographically dispersed operations with potentially different resourcing and supervisory quality compared to HQ. Establishing whether the HQ–Zonal performance gap is statistically significant is essential before recommending targeted field-office interventions.

Hypotheses:

H₀: Mean 2025 performance score is the same for HQ and Zonal staff.

H₁: HQ and Zonal staff differ in mean 2025 performance score.

Code
#| label: group-descriptives



df |>

  group_by(office_type) |>

  summarise(

    n    = n(),

    Mean = round(mean(perf_2025, na.rm = TRUE), 2),

    SD   = round(sd(perf_2025,   na.rm = TRUE), 2),

    .groups = "drop"

  ) |>

  kable(caption = "Table 7. Descriptive statistics by office type",

        col.names = c("Office Type", "n", "Mean Score", "SD"))
Table 7. Descriptive statistics by office type
Office Type n Mean Score SD
HQ Directorate 646 81.13 5.90
Zonal Office 206 79.17 8.51

Both groups are large (n > 30), so by the Central Limit Theorem the sampling distribution of the mean is approximately normal. A Welch t-test (which does not assume equal variances) is used.

Code
#| label: welch-ttest



t_result <- t.test(perf_2025 ~ office_type, data = df, var.equal = FALSE)

t_result

    Welch Two Sample t-test

data:  perf_2025 by office_type
t = 3.1, df = 271, p-value = 0.002
alternative hypothesis: true difference in means between group HQ Directorate and group Zonal Office is not equal to 0
95 percent confidence interval:
 0.7079 3.2144
sample estimates:
mean in group HQ Directorate   mean in group Zonal Office 
                       81.13                        79.17 
Code
# Cohen's d — manual computation

hq_scores  <- df$perf_2025[df$office_type == "HQ Directorate"]

zon_scores <- df$perf_2025[df$office_type == "Zonal Office"]

pool_sd_val <- sqrt(

  ((length(hq_scores) - 1) * var(hq_scores,  na.rm = TRUE) +

   (length(zon_scores) - 1) * var(zon_scores, na.rm = TRUE)) /

  (length(hq_scores) + length(zon_scores) - 2)

)

d_val <- (mean(hq_scores, na.rm = TRUE) - mean(zon_scores, na.rm = TRUE)) / pool_sd_val

cat("\nCohen's d:", round(d_val, 4),

    "\nInterpretation: small-to-medium effect (threshold small = 0.2, medium = 0.5)\n")

Cohen's d: 0.296 
Interpretation: small-to-medium effect (threshold small = 0.2, medium = 0.5)

Result and interpretation: The Welch t-test is significant (t(270.8) = 3.081, p = 0.0023). The 95% confidence interval for the mean difference is [0.71, 3.21] points, with HQ staff scoring on average 1.96 points higher. Cohen’s d = 0.296 indicates a small-to-medium effect size.

Business implication: We reject H₀. The HQ–Zonal gap is statistically real. As Section 8’s regression will show, this gap persists even after controlling for grade band and prior-year performance, pointing to a structural rather than individual explanation. KWIRS should investigate whether Zonal offices lack adequate training infrastructure, supervisory oversight, or exam preparation resources compared to HQ.


7. Correlation Analysis

Technique 4 — Book Reference: Chapter 8 (Pearson, Spearman, partial correlation, correlation vs causation)

7.1 Pearson Correlation Heatmap

Code
#| label: corr-heatmap

#| fig-height: 6.5

#| fig-width: 8.5

#| message: false



cor_num <- df |>

  select(

    `2022 Perf`  = perf_2022,

    `2023 Perf`  = perf_2023,

    `2024 Perf`  = perf_2024,

    `2025 Perf`  = perf_2025,

    Quantitative = quant,

    Qualitative  = qual,

    Exam         = exam,

    `360-Degree` = deg360

  ) |>

  drop_na()



cor_matrix <- cor(cor_num, method = "pearson")



# Compute p-values for significance annotation

n_cor   <- nrow(cor_num)

t_mat   <- cor_matrix * sqrt((n_cor - 2) / (1 - cor_matrix^2))

p_mat   <- 2 * pt(abs(t_mat), df = n_cor - 2, lower.tail = FALSE)

diag(p_mat) <- 1  # diagonal not tested



cor_matrix |>

  as.data.frame() |>

  rownames_to_column("Var1") |>

  pivot_longer(-Var1, names_to = "Var2", values_to = "r") |>

  left_join(

    p_mat |> as.data.frame() |>

      rownames_to_column("Var1") |>

      pivot_longer(-Var1, names_to = "Var2", values_to = "p"),

    by = c("Var1", "Var2")

  ) |>

  mutate(

    Var1  = factor(Var1, levels = colnames(cor_matrix)),

    Var2  = factor(Var2, levels = rev(colnames(cor_matrix))),

    label = if_else(Var1 == Var2, "",

                    paste0(round(r, 2),

                           case_when(p < 0.001 ~ "***",

                                     p < 0.01  ~ "**",

                                     p < 0.05  ~ "*",

                                     TRUE      ~ "")))

  ) |>

  ggplot(aes(x = Var1, y = Var2, fill = r)) +

  geom_tile(colour = "white", linewidth = 0.5) +

  geom_text(aes(label = label), size = 3) +

  scale_fill_gradient2(low = "#6D9EC1", mid = "white", high = "#E46726",

                       midpoint = 0, limits = c(-1, 1)) +

  labs(title = "Figure 7. Pearson Correlation Matrix",

       subtitle = "*** p<0.001  ** p<0.01  * p<0.05",

       fill  = "r") +

  theme_minimal(base_size = 12) +

  theme(axis.text.x = element_text(angle = 45, hjust = 1),

        axis.title  = element_blank())

7.2 Spearman Rank Correlations (Year-to-Year)

Code
#| label: spearman-year



df |>

  select(`2022 Perf` = perf_2022, `2023 Perf` = perf_2023,

         `2024 Perf` = perf_2024, `2025 Perf` = perf_2025) |>

  drop_na() |>

  (\(d) cor(d, method = "spearman"))() |>

  round(3) |>

  kable(caption = "Table 8. Spearman rank correlations — annual performance scores (robust to skew)")
Table 8. Spearman rank correlations — annual performance scores (robust to skew)
2022 Perf 2023 Perf 2024 Perf 2025 Perf
2022 Perf 1.000 0.351 0.150 0.162
2023 Perf 0.351 1.000 0.196 0.149
2024 Perf 0.150 0.196 1.000 0.257
2025 Perf 0.162 0.149 0.257 1.000

7.3 Partial Correlation

Controlling for 2023 performance, the partial correlation between 2022 and 2025 performance tests whether the earliest year retains any predictive signal once more recent history is accounted for.

Code
#| label: partial-corr



pc_data    <- df |> select(perf_2022, perf_2023, perf_2025) |> drop_na()

r12_pc     <- cor(pc_data$perf_2022, pc_data$perf_2023)

r13_pc     <- cor(pc_data$perf_2022, pc_data$perf_2025)

r23_pc     <- cor(pc_data$perf_2023, pc_data$perf_2025)

r_partial_val <- (r13_pc - r12_pc * r23_pc) /

                  sqrt((1 - r12_pc^2) * (1 - r23_pc^2))



# t-statistic and p-value for the partial correlation

n_pc  <- nrow(pc_data)

t_pc  <- r_partial_val * sqrt((n_pc - 2 - 1) / (1 - r_partial_val^2))

p_pc  <- 2 * pt(abs(t_pc), df = n_pc - 3, lower.tail = FALSE)



tibble(

  Test       = "2022 Perf \u2194 2025 Perf | 2023 Perf",

  r_partial  = round(r_partial_val, 4),

  t_stat     = round(t_pc, 4),

  p_value    = round(p_pc, 4),

  n          = n_pc

) |>

  kable(caption = "Table 9. Partial correlation: 2022 vs 2025 performance, controlling for 2023")
Table 9. Partial correlation: 2022 vs 2025 performance, controlling for 2023
Test r_partial t_stat p_value n
2022 Perf ↔︎ 2025 Perf | 2023 Perf 0.1009 2.758 0.006 743

7.4 Discussion of Key Correlations

1. The surprisingly weak year-to-year performance consistency. The bivariate correlations between adjacent assessment years are modest at best: r = 0.39 (2022–2023), r = 0.17 (2023–2024), and r = 0.23 (2024–2025). In a well-calibrated appraisal system one would typically expect adjacent-year correlations above 0.5, since individual capability is relatively stable. These low values suggest that performance rankings at KWIRS change substantially from year to year — more than genuine capability shifts would warrant. The most plausible explanations are inconsistent HOD scoring standards across cycles, changes in assessment weighting between years, or rater fatigue. This is the most important quality signal in the dataset.

2. The 360-Degree anomaly. The 360-Degree component is negatively correlated with both Quantitative (r = −0.35) and Exam (r = −0.39) scores. Because approximately 90% of 360 scores are zero, this negative correlation almost certainly reflects a selection effect: staff who did receive a 360 review (non-zero scores) have a different performance profile from those who did not — not a genuine inverse relationship between objective assessment and peer feedback. This further underscores that the 360-degree instrument is currently non-functional for most of the workforce.

3. Quantitative component vs 2025 total (r = 0.86). This very high correlation is expected and arises by construction: perf_2025 = quant + qual + exam + deg360, and Quantitative carries up to 60 of the 100 available points. The practical implication is that the 2025 overall score is almost entirely determined by the Quantitative sub-score. Any effort to improve aggregate performance must focus on the Quantitative component specifically.

Partial correlation note: After controlling for 2023 performance, the partial correlation between 2022 and 2025 drops to r = 0.101 (p = 0.006). The 2022 performance year retains a small but statistically significant relationship with 2025 even after accounting for the 2023 cycle, suggesting a mild long-memory component in individual performance trajectories.


8. Linear Regression

Technique 5 — Book Reference: Chapter 9 (OLS — coefficients, diagnostics, interpretation)

8.1 Model Specification and Business Justification

Business question: After controlling for an individual’s grade level and office location, how strongly does prior-year performance predict 2025 performance? A significant coefficient on 2024 performance confirms predictive validity in the appraisal system; significant grade or location coefficients quantify structural penalties or premiums.

Why prior-year scores rather than 2025 sub-components as predictors? perf_2025 is the arithmetic sum of quant + qual + exam + deg360. Regressing a variable on its own components yields R² ≈ 1.0 and coefficients ≈ 1.0 by algebraic identity — not an empirical finding. The model below uses variables that are not components of the outcome, producing genuinely informative coefficients.

\[\text{perf\_2025} = \beta_0 + \beta_1\,\text{perf\_2024} + \beta_2\,\text{grade\_band} + \beta_3\,\text{office\_type} + \varepsilon\]

Reference levels: Revenue Staff (grade_band) and HQ Directorate (office_type).

Code
#| label: model-fit



df_model <- df |>

  mutate(

    grade_band  = relevel(grade_band, ref = "Revenue Staff"),

    office_type = factor(office_type, levels = c("HQ Directorate", "Zonal Office"))

  ) |>

  filter(!is.na(perf_2024), !is.na(perf_2025))



model <- lm(perf_2025 ~ perf_2024 + grade_band + office_type, data = df_model)

8.2 Model Results

Code
#| label: model-coefficients



tidy(model, conf.int = TRUE) |>

  mutate(

    term = case_when(

      term == "(Intercept)"                    ~ "Intercept",

      term == "perf_2024"                      ~ "2024 Performance Score",

      term == "grade_bandManagement"           ~ "Grade: Management (ref = Revenue Staff)",

      term == "grade_bandSupport (Drivers)"    ~ "Grade: Support — Drivers (ref = Revenue Staff)",

      term == "grade_bandOther Support"        ~ "Grade: Other Support (ref = Revenue Staff)",

      term == "office_typeZonal Office"        ~ "Office: Zonal Office (ref = HQ Directorate)",

      TRUE                                     ~ term

    ),

    sig = case_when(

      p.value < 0.001 ~ "***",

      p.value < 0.01  ~ "**",

      p.value < 0.05  ~ "*",

      p.value < 0.10  ~ ".",

      TRUE            ~ ""

    )

  ) |>

  select(Term = term, `β` = estimate, SE = std.error,

         `t` = statistic, `p-value` = p.value,

         `CI Low` = conf.low, `CI High` = conf.high, Sig = sig) |>

  mutate(across(where(is.numeric), ~round(., 4))) |>

  kable(caption = "Table 10. OLS Regression — Predictors of 2025 Performance Score")
Table 10. OLS Regression — Predictors of 2025 Performance Score
Term β SE t p-value CI Low CI High Sig
Intercept 61.6740 2.6142 23.5920 0.0000 56.5429 66.8050 ***
2024 Performance Score 0.2333 0.0317 7.3686 0.0000 0.1711 0.2954 ***
Grade: Management (ref = Revenue Staff) -0.6425 0.6602 -0.9732 0.3307 -1.9384 0.6533
Grade: Support — Drivers (ref = Revenue Staff) 4.5720 1.1277 4.0544 0.0001 2.3587 6.7854 ***
Grade: Other Support (ref = Revenue Staff) 0.6261 2.2660 0.2763 0.7824 -3.8216 5.0738
Office: Zonal Office (ref = HQ Directorate) -1.9391 0.5191 -3.7358 0.0002 -2.9579 -0.9203 ***
Code
#| label: model-fit-stats



glance(model) |>

  select(R2 = r.squared, `Adj. R²` = adj.r.squared,

         F    = statistic, df = df,

         `Residual df` = df.residual, `p-value` = p.value) |>

  mutate(across(everything(), ~round(., 4))) |>

  kable(caption = "Table 11. Model fit statistics")
Table 11. Model fit statistics
R2 Adj. R² F df Residual df p-value
0.0983 0.0929 18.44 5 846 0

8.3 Diagnostic Plots

Code
#| label: diagnostics

#| layout-ncol: 2

#| fig-height: 4



model_aug <- augment(model)



ggplot(model_aug, aes(.fitted, .resid)) +

  geom_point(alpha = 0.3, size = 1.5) +

  geom_hline(yintercept = 0, linetype = "dashed", colour = "red") +

  geom_smooth(method = "loess", se = FALSE,

              colour = "#2c7bb6", linewidth = 0.9) +

  labs(title = "Residuals vs Fitted",

       x = "Fitted Values", y = "Residuals") +

  theme_minimal(base_size = 11)
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(model_aug, aes(sample = .std.resid)) +

  stat_qq(alpha = 0.3, size = 1.2) +

  stat_qq_line(colour = "red", linewidth = 0.9) +

  labs(title = "Normal Q-Q",

       x = "Theoretical Quantiles", y = "Std. Residuals") +

  theme_minimal(base_size = 11)

Code
ggplot(model_aug, aes(.fitted, sqrt(abs(.std.resid)))) +

  geom_point(alpha = 0.3, size = 1.5) +

  geom_smooth(method = "loess", se = FALSE,

              colour = "#2c7bb6", linewidth = 0.9) +

  labs(title = "Scale-Location",

       x = "Fitted Values", y = "\u221a|Std. Residuals|") +

  theme_minimal(base_size = 11)
`geom_smooth()` using formula = 'y ~ x'

Code
ggplot(model_aug, aes(seq_along(.cooksd), .cooksd)) +

  geom_col(fill = "#2c7bb6", alpha = 0.7) +

  geom_hline(yintercept = 4 / nrow(model_aug),

             linetype = "dashed", colour = "red") +

  labs(title = "Cook's Distance",

       subtitle = "Dashed = 4/n threshold",

       x = "Observation Index", y = "Cook's Distance") +

  theme_minimal(base_size = 11)

  • Residuals vs Fitted: Roughly centred on zero with no strong non-linear pattern, supporting linearity. A slight heteroscedastic tendency at low fitted values is consistent with the left-skewed score distribution.

  • Normal Q-Q: Central residuals align well with the theoretical normal line; the lower tail deviates slightly. With n = 852, OLS estimates are robust to mild non-normality.

  • Scale-Location: No dramatic fanning; heteroscedasticity is minor and unlikely to materially bias inference.

  • Cook’s Distance: No observation exceeds the 4/n threshold by a wide margin, confirming the absence of highly influential records.

8.4 Multicollinearity Check (VIF)

Code
#| label: vif-check



mm_vif      <- model.matrix(model)[, -1]

vif_computed <- diag(solve(cor(mm_vif)))



tibble(

  Predictor  = names(vif_computed),

  VIF        = round(vif_computed, 3),

  Assessment = if_else(vif_computed < 5, "Acceptable (< 5)", "Concern (≥ 5)")

) |>

  kable(caption = "Table 12. Variance Inflation Factors — all values < 1.1 confirm no multicollinearity")
Table 12. Variance Inflation Factors — all values < 1.1 confirm no multicollinearity
Predictor VIF Assessment
perf_2024 1.037 Acceptable (< 5)
grade_bandManagement 1.065 Acceptable (< 5)
grade_bandSupport (Drivers) 1.027 Acceptable (< 5)
grade_bandOther Support 1.007 Acceptable (< 5)
office_typeZonal Office 1.041 Acceptable (< 5)

All VIF values are below 1.1 — well within the acceptable threshold of 5. The predictors contribute independent information.

8.5 Business Interpretation of Significant Coefficients

The model explains 9.8% of variance in 2025 scores (Adj. R² = 0.093). While statistically significant as a whole (F-test p < 0.001), the low R² is itself an important finding — discussed below.

Predictor | β | Significance | Plain-Language Interpretation |

|—|—|—|—|

2024 Performance | 0.233 | *** | Each additional point in 2024 predicts only 0.23 points more in 2025. A 10-point 2024 improvement predicts just a 2.3-point 2025 gain. Action: While significant, this low coefficient — combined with the weak bivariate correlation of r = 0.23 — suggests the appraisal system has low predictive validity. Historical performance is a poor guide to future performance, raising concerns about HOD scoring consistency. |
Management Grade | -0.643 | (n.s.) | Once prior-year performance is controlled, Management grades do NOT score significantly higher than Revenue Staff. The visual impression of management outperformance (Section 5) is driven by management staff also having higher 2024 scores — not a grade-specific 2025 advantage. Action: Management appraisal criteria should be reviewed to ensure they are substantively more demanding, not simply applied more generously. |
Support (Drivers) Grade | 4.572 | *** | Driver-grade staff score approximately 4.6 points higher than Revenue Staff after controlling for prior performance and location. This is the most unexpected finding. Possible explanations include more lenient HOD scoring for smaller support teams or different weighting of assessment criteria for non-revenue roles. Action: KWIRS should investigate whether appraisal criteria and HOD calibration for support grades are equivalent to those applied to revenue-generating staff. |
Zonal Office | -1.939 | *** | Controlling for grade and prior performance, Zonal staff score approximately 1.9 points lower than HQ staff. This location penalty is independent of individual capability. Action: This is the most actionable finding. KWIRS should audit training infrastructure, examination facilities, and supervisory resource quality across all three Zonal Offices and implement a structured equalisation programme. |

The low R² (≈ 10%) is itself a key finding. Only one-tenth of the variance in 2025 scores is explained by prior-year performance, grade, and location combined. The remaining 90% reflects factors not captured in this model — likely a combination of (a) genuinely unmeasured determinants (training hours, tenure, HOD relationship quality) and (b) measurement error from inconsistent scoring. This reinforces the recommendation to commission a comprehensive appraisal instrument calibration review.


9. Integrated Findings

The five analytical techniques applied in this study converge on a coherent story about both organisational inequality and measurement quality concerns at KWIRS.

EDA established a well-populated dataset with two solvable quality issues, the more serious being a near-defunct 360-degree component that currently penalises approximately 90% of staff for non-participation in a process they were not required to complete. Visualisation confirmed that the HQ–Zonal performance gap is not recent — it has persisted across all four years without narrowing, suggesting structural rather than transient causes.

Hypothesis testing provided statistical confirmation: both the grade-band differentiation (small effect, ε² = 0.0364) and the HQ–Zonal gap (small-to-medium effect, d = 0.296) are significant beyond the 0.001 threshold. The post-hoc results, however, revealed an unexpected finding: Support (Drivers) grades outperform Revenue Staff significantly, not Management grades — raising questions about scoring standard consistency across staff categories.

Correlation analysis produced the most diagnostically important finding: year-to-year performance correlations at KWIRS are weak (r = 0.13–0.39), far lower than a reliable appraisal system would produce. Combined with the low regression R² of approximately 10%, this points to substantial unexplained variance — consistent with HOD scoring inconsistency across cycles as a root cause.

Single integrated recommendation: KWIRS should immediately commission an Appraisal System Integrity Review with three concrete outputs: (1) mandatory 360-degree review completion tracked as an HOD KPI, eliminating the current near-universal zero-score problem; (2) inter-HOD calibration workshops to align scoring standards across grades and offices, with the goal of raising year-to-year correlations above 0.5; and (3) a Zonal Office Capability Programme providing parity of training, examination infrastructure, and supervisory support relative to HQ — directly addressing the persistent ~2-point location penalty confirmed by regression.


10. Limitations and Further Work

  • Census completeness: Staff who resigned, were dismissed, or were on long-term leave during the period are absent, introducing survivorship bias toward more stable, better-performing employees.

  • Appraisal validity: All analyses assume scores reflect actual performance. If HODs apply inconsistent standards across directorates — plausible given the weak year-to-year correlations — the data measures assessed rather than actual performance.

  • 360-degree ambiguity: Zero scores were retained as-is; without confirmation from HR that they represent non-participation (not genuine zero assessments), any 360-related interpretation should be treated with caution.

  • Omitted variables: The regression excludes potentially important predictors — years of service, educational qualification, training hours attended, and HOD tenure — that could substantially increase R² and change coefficient estimates.

  • Causation vs association: Regression coefficients are associational, not causal. The Zonal Office penalty cannot be interpreted as a pure location effect without controlling for all confounders.

  • Further work: (1) Enrich the model with tenure, education, and training data. (2) Apply logistic regression to predict a binary “at risk” outcome (score < 65) for proactive HR targeting. (3) Conduct a propensity-score matched comparison of HQ and Zonal staff with equivalent grades and prior scores to isolate the causal location effect. (4) Apply text analytics (sentiment analysis, topic modelling) to the HOD comment and Areas-for-Improvement fields to complement the quantitative findings with qualitative insight from the appraisal narratives.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

R Core Team. (2025). R: A language and environment for statistical computing (Version 4.5). R Foundation for Statistical Computing. https://www.R-project.org/

Robinson, D., Hayes, A., & Couch, S. (2024). broom: Convert statistical objects into tidy tibbles (Version 1.0.6) [R package]. https://CRAN.R-project.org/package=broom

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., & Bryan, J. (2023). readxl: Read Excel files (Version 1.4.3) [R package]. https://CRAN.R-project.org/package=readxl

Zhu, H. (2024). knitr: A general-purpose package for dynamic report generation in R (Version 1.45) [R package]. https://CRAN.R-project.org/package=knitr


Appendix: AI Usage Statement

Posit Assistant (an AI coding assistant embedded in RStudio) was used to assist in writing and debugging R code for this analysis, including the data-cleaning pipeline, ggplot2 visualisation syntax, manual implementations of effect-size formulas (epsilon-squared, Cohen’s d, partial correlation), and the regression diagnostic plot layout. All analytical decisions were made independently by the author: the selection of Case Study 1, the choice of a non-parametric Kruskal-Wallis test based on the normality check, the decision to use prior-year scores rather than 2025 component scores as regression predictors (to avoid the arithmetic identity problem), the identification of the 360-degree zero-score ambiguity as a data quality concern, and all business interpretations including the unexpected Support (Drivers) finding. The narrative text and all substantive conclusions represent the author’s own professional judgement grounded in operational knowledge of KWIRS performance management processes.


Quarto

Quarto enables you to weave together content and executable code into a finished document. To learn more about Quarto see https://quarto.org.

Code
1 + 1
[1] 2