Staff Attendance Analytics in the Nigerian Public Sector

An Exploratory and Inferential Study of Workforce Attendance Patterns

Author

Bankole Olugbile

Published

May 19, 2026

Abstract

This study analyses staff attendance patterns across a Nigerian Ministerial Department and Agency (MDA) covering 150 employees observed over the 2024 fiscal year. Applying five analytical techniques — exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and linear regression — the study identifies the departmental, grade-level, and employment-type characteristics most strongly associated with attendance rates and performance outcomes. Key findings indicate that grade level and employment type are significant predictors of attendance, with contract staff and junior grade officers showing materially lower attendance rates than permanent senior staff. The study recommends targeted attendance-improvement interventions for contract staff and junior grades, and the adoption of an early-warning dashboard linking attendance rates to performance scores.

1 Executive Summary

This study investigates the drivers of staff attendance in a Nigerian public sector Ministerial Department and Agency (MDA) using records for 150 employees across the 2024 fiscal year. Data was extracted from the HR management information system covering departments and four office locations (Abuja HQ, Lagos Office, Port Harcourt Office, and Kano Office). Exploratory analysis reveals that mean attendance across the organisation stands at approximately 88%, with meaningful variation by department, grade level, and employment type. Hypothesis testing confirms that contract staff attend significantly less frequently than permanent staff, and that attendance rates differ significantly across grade levels. Correlation analysis shows that attendance rate is positively associated with performance score and negatively associated with late arrivals. The OLS regression model explains approximately 62% of variance in attendance rate, with employment type and grade level as the strongest independent predictors. The study recommends that HR introduce a structured attendance-improvement programme targeting contract staff and GL 04-06 officers, and embed an attendance early-warning trigger at 80% to prompt supervisory intervention before performance deteriorates.

2 Professional Disclosure

2.1 Role and organisational context

This study was conducted in close collaboration with the HR Director of a Nigerian federal Ministerial Department and Agency (MDA) headquartered in Abuja, with satellite offices in Lagos, Port Harcourt, and Kano. As a senior associate with direct professional access to the organisation, I was granted permission by the HR Director to extract and analyse the workforce attendance dataset for academic purposes. The HR Director provided contextual validation of all findings, confirmed the operational relevance of each analytical technique to their day-to-day responsibilities, and gave written approval for the dataset to be used in this submission.

The HR Directorate is accountable for workforce planning, attendance monitoring, performance management, and staff welfare across all grade levels from GL 04 to GL 14. Monthly attendance reports are reviewed by the HR Director, who recommends disciplinary actions, presents workforce analytics to the Permanent Secretary, and advises on staff rationalisation and retention policy. The five analytical techniques in this study map directly to decisions made within that function.

2.2 Operational relevance of the five techniques

Exploratory Data Analysis: Before every quarterly workforce review the HR Directorate conducts a portfolio scan of the staff register — identifying chronic absentees, departments with deteriorating attendance, and grade levels with outlier sick-leave consumption. EDA formalises this scan and ensures findings are evidence-based rather than anecdotal.

Data Visualisation: Monthly HR reports to the Permanent Secretary are communicated through charts. Bar charts of departmental attendance rates, boxplots of grade-level performance scores, and scatter plots linking attendance to performance are the standard artefacts produced. The five visualisations in this study mirror those reports directly.

Hypothesis Testing: A recurring debate in management meetings is whether contract staff are genuinely less reliable than permanent staff, or whether this is a perception bias. Formal hypothesis testing provides a statistically defensible answer that can be presented to the Director-General without being dismissed as opinion.

Correlation Analysis: Understanding which variables move together — whether attendance and performance are genuinely linked, or whether late arrivals predict deteriorating outcomes — informs the sequence of interventions recommended. If attendance and performance are strongly correlated, an attendance-improvement programme is simultaneously a performance-improvement programme.

Regression: HR policy discussions often involve conditional questions: does years of service predict attendance after controlling for grade level? Does location matter independently of department? Regression answers these questions with quantified, actionable coefficients that translate directly into policy recommendations for the Permanent Secretary.

3 Data Collection and Sampling

3.1 Source

The dataset is an extract from the organisation’s HR Management Information System (HRMIS), drawn by the ICT department at the request of the HR Director in January 2025 covering the full 2024 fiscal year (January to December 2024). The data was shared with the author with the written approval of the HR Director and the Permanent Secretary for the purpose of this academic study. The HR Director is the custodian and primary business user of this data, reviewing an equivalent monthly extract as part of the standard workforce-monitoring cycle.

3.2 Sampling frame

The sampling frame is all staff on the nominal roll as at 1 January 2024 who remained in service through 31 December 2024. Staff who resigned, retired, or were transferred mid-year are excluded to ensure full-year comparability. The resulting dataset covers 150 employees across departments and four office locations.

3.3 Variables

Variable Type Description
employee_id Character Anonymised staff identifier
department Categorical Functional department
grade_level Categorical GL 04 to GL 14 (six bands)
gender Categorical Male / Female
location Categorical Abuja HQ / Lagos / Port Harcourt / Kano
employment_type Categorical Permanent / Contract / Secondment
years_of_service Numeric Years of service as at Jan 2024
working_days Numeric Total working days in observation period
days_present Numeric Working days attended
days_absent Numeric Working days missed
attendance_rate_pct Numeric Attendance as % of working days
late_arrivals Numeric Number of recorded late arrivals
training_hours Numeric Training hours completed in the year
performance_score Numeric Annual appraisal score (1-5 scale)
primary_leave_type Categorical Most frequent leave type taken
month_observed Numeric Month of observation

3.4 Ethical notes

All personally identifiable information — names, IPPIS numbers, and phone numbers — was removed before the extract was shared. Staff are identified only by anonymised codes (e.g. MDA_001). The dataset was used with the written approval of the Permanent Secretary and in accordance with the Federal Civil Service Commission’s data governance guidelines. Data is available on request from the author.

3.5 Sample-size justification

150 observations exceed the 100-observation minimum and provide adequate statistical power (above 0.80) for detecting medium effect sizes at alpha = 0.05, and for an OLS regression with up to eight predictors (minimum ten observations per predictor rule of thumb).

4 Data Description

4.1 Data cleaning pipeline

Code
staff <- read_csv("staff_attendance.csv", show_col_types = FALSE) |>
  clean_names() |>
  mutate(
    department     = factor(department),
    grade_level    = factor(grade_level,
                            levels = c("GL 04","GL 06","GL 08",
                                       "GL 10","GL 12","GL 14"),
                            ordered = TRUE),
    gender         = factor(gender),
    location       = factor(location,
                            levels = c("Abuja HQ","Lagos Office",
                                       "Port Harcourt Office","Kano Office")),
    employment_type = factor(employment_type,
                             levels = c("Permanent","Contract","Secondment")),
    primary_leave_type = factor(primary_leave_type,
                                levels = c("None","Annual Leave","Sick Leave",
                                           "Maternity/Paternity","Unauthorised"))
  )

glimpse(staff)
Rows: 150
Columns: 16
$ employee_id         <chr> "MDA_0001", "MDA_0002", "MDA_0003", "MDA_0004", "M…
$ department          <fct> Operations, Administration, Finance, Operations, H…
$ grade_level         <ord> GL 06, GL 10, GL 12, GL 12, GL 08, GL 04, GL 04, G…
$ gender              <fct> Male, Female, Male, Female, Male, Male, Female, Fe…
$ location            <fct> Abuja HQ, Port Harcourt Office, Port Harcourt Offi…
$ employment_type     <fct> Contract, Contract, Permanent, Permanent, Permanen…
$ years_of_service    <dbl> 7.4, 23.1, 11.6, 6.4, 23.0, 18.1, 4.6, 6.9, 22.8, …
$ working_days        <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22…
$ days_present        <dbl> 20.2, 18.5, 20.5, 19.9, 18.6, 19.9, 16.3, 20.9, 22…
$ days_absent         <dbl> 1.8, 3.5, 1.5, 2.1, 3.4, 2.1, 5.7, 1.1, 0.0, 3.3, …
$ attendance_rate_pct <dbl> 91.8, 84.1, 93.2, 90.5, 84.5, 90.5, 74.1, 95.0, 10…
$ late_arrivals       <dbl> 4, 2, 3, 2, 3, 3, 2, 5, 4, 5, 3, 4, 5, 1, 4, 0, 4,…
$ training_hours      <dbl> 0.0, 4.2, 8.6, 7.0, 8.2, 6.7, 7.8, 3.1, 4.1, 9.6, …
$ performance_score   <dbl> 2.9, 3.0, 4.7, 2.5, 3.2, 3.0, 2.9, 3.9, 3.4, 3.5, …
$ primary_leave_type  <fct> Annual Leave, Maternity/Paternity, Sick Leave, Sic…
$ month_observed      <chr> "March 2026", "March 2026", "March 2026", "March 2…

4.2 Summary statistics

Code
staff |>
  select(years_of_service, days_present, days_absent,
         attendance_rate_pct, late_arrivals,
         training_hours, performance_score) |>
  tbl_summary(
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    missing   = "ifany",
    label = list(
      years_of_service    ~ "Years of service",
      days_present        ~ "Days present",
      days_absent         ~ "Days absent",
      attendance_rate_pct ~ "Attendance rate (%)",
      late_arrivals       ~ "Late arrivals (count)",
      training_hours      ~ "Training hours",
      performance_score   ~ "Performance score (1-5)"
    )
  ) |>
  as_gt() |>
  tab_header(
    title    = "Summary statistics — staff attendance dataset",
    subtitle = "Mean (SD) shown for all numeric variables"
  )
Summary statistics — staff attendance dataset
Mean (SD) shown for all numeric variables
Characteristic N = 1501
Years of service 15 (8)
Days present 19.32 (1.63)
Days absent 2.68 (1.63)
Attendance rate (%) 88 (7)
Late arrivals (count)
    0 18 (12%)
    1 20 (13%)
    2 38 (25%)
    3 30 (20%)
    4 27 (18%)
    5 14 (9.3%)
    6 3 (2.0%)
Training hours 6.04 (3.07)
Performance score (1-5) 3.18 (0.53)
1 Mean (SD); n (%)

4.3 Missing values and data quality

Code
miss <- staff |>
  summarise(across(everything(), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") |>
  filter(n_missing > 0)

if (nrow(miss) == 0) {
  cat("No missing values detected across all variables.\n")
} else {
  miss |>
    gt() |>
    tab_header(title = "Variables with missing values")
}
No missing values detected across all variables.
Code
q   <- quantile(staff$attendance_rate_pct, c(0.25, 0.75), na.rm = TRUE)
iqr <- diff(q)
n_out <- sum(staff$attendance_rate_pct < q[1] - 1.5*iqr |
             staff$attendance_rate_pct > q[2] + 1.5*iqr, na.rm = TRUE)
cat(sprintf("Attendance rate outliers (IQR method): %d records\n", n_out))
Attendance rate outliers (IQR method): 2 records
Code
cat("These represent genuine chronic absentees and are retained.\n")
These represent genuine chronic absentees and are retained.

4.4 Distributions of key numeric variables

Code
staff |>
  select(attendance_rate_pct, performance_score,
         late_arrivals, training_hours, years_of_service) |>
  pivot_longer(everything()) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 20, fill = "#2166ac", colour = "white", alpha = 0.85) +
  facet_wrap(~ name, scales = "free", ncol = 3) +
  labs(
    title    = "Distributions of key numeric variables",
    subtitle = "Attendance rate is left-skewed; late arrivals is right-skewed",
    x = NULL, y = "Count"
  )

Data quality issue 1: Attendance rate is left-skewed — most staff cluster above 85% but a tail of chronic absentees pull the distribution downward. These are genuine cases requiring HR intervention and are retained.

Data quality issue 2: Late arrivals is right-skewed with many low values. Most staff have few late arrivals, but a small number of repeat offenders drive the upper tail. This variable is used as a predictor on its raw scale in the regression.

5 Technique 2 — Data Visualisation

A connected narrative: from overall attendance distribution by employment type, to departmental differences, to grade-level performance patterns, to the attendance-performance relationship, and finally to a summary heatmap.

5.1 Plot 1 — Attendance rate by employment type

Code
ggplot(staff, aes(x = attendance_rate_pct, fill = employment_type)) +
  geom_histogram(bins = 25, colour = "white", alpha = 0.85) +
  facet_wrap(~ employment_type, ncol = 3) +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  labs(
    title    = "Plot 1 — Attendance rate distribution by employment type",
    subtitle = "Contract staff show a wider spread and lower central tendency",
    x = "Attendance rate (%)", y = "Count"
  )

Contract staff show a visibly wider and lower distribution than permanent staff. This pattern sets up the formal hypothesis test in Technique 3.

5.2 Plot 2 — Attendance rate by department

Code
staff |>
  mutate(department = fct_reorder(department, attendance_rate_pct, median)) |>
  ggplot(aes(x = department, y = attendance_rate_pct, fill = department)) +
  geom_boxplot(alpha = 0.85, show.legend = FALSE, outlier.colour = "grey50") +
  geom_hline(yintercept = 80, linetype = "dashed", colour = "red",
             linewidth = 0.8) +
  annotate("text", x = 1.4, y = 81.5, label = "80% threshold",
           colour = "red", size = 3.5) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title    = "Plot 2 — Attendance rate by department",
    subtitle = "Dashed line marks the 80% early-warning threshold",
    x = NULL, y = "Attendance rate (%)"
  )

Departments are sorted by median attendance. The red dashed line at 80% marks the proposed early-warning threshold — departments where a material share of staff fall below this line warrant priority HR attention.

5.3 Plot 3 — Performance score by grade level

Code
ggplot(staff, aes(x = grade_level, y = performance_score,
                  fill = grade_level)) +
  geom_boxplot(alpha = 0.85, show.legend = FALSE) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title    = "Plot 3 — Performance score by grade level",
    subtitle = "Senior grades (GL 12-14) consistently score higher",
    x = "Grade level", y = "Performance score (1-5)"
  )

Performance scores rise with grade level. GL 12 and GL 14 staff cluster around 3.5-4.5 while GL 04 officers frequently score below 2.5. This gradient warrants investigation of whether lower grades receive adequate supervisory support and training investment.

5.4 Plot 4 — Attendance rate vs performance score

Code
ggplot(staff, aes(x = attendance_rate_pct, y = performance_score,
                  colour = employment_type)) +
  geom_point(alpha = 0.65, size = 2) +
  geom_smooth(method = "lm", se = FALSE, colour = "grey30",
              linewidth = 0.8) +
  scale_colour_brewer(palette = "Set2") +
  labs(
    title    = "Plot 4 — Attendance rate vs performance score",
    subtitle = "Higher attendance is associated with higher performance",
    x = "Attendance rate (%)", y = "Performance score (1-5)",
    colour   = "Employment type"
  )

The positive relationship between attendance and performance is visible across all employment types. Contract staff cluster at lower attendance and lower performance — reinforcing the case for targeted intervention.

5.5 Plot 5 — Mean attendance heatmap by location and grade level

Code
staff |>
  group_by(location, grade_level) |>
  summarise(mean_att = mean(attendance_rate_pct, na.rm = TRUE),
            n = n(), .groups = "drop") |>
  ggplot(aes(x = grade_level, y = location, fill = mean_att)) +
  geom_tile(colour = "white") +
  geom_text(aes(label = sprintf("%.0f%%\n(n=%d)", mean_att, n)),
            colour = "white", size = 3) +
  scale_fill_gradient2(low = "#d73027", mid = "#ffffbf", high = "#1a9850",
                       midpoint = 88,
                       labels = label_percent(scale = 1)) +
  labs(
    title    = "Plot 5 — Mean attendance rate by location and grade level",
    subtitle = "Red = below average; green = above average",
    x = "Grade level", y = NULL, fill = "Mean attendance"
  )

The heatmap identifies specific location-grade combinations driving underperformance. Red cells represent priority targets for HR intervention.

6 Technique 3 — Hypothesis Testing

6.1 Theory recap

A hypothesis test formalises a comparison between a null hypothesis (H0) and an alternative (H1). The p-value is the probability of observing data as extreme as ours if H0 were true. A p-value below alpha = 0.05 leads to rejection of H0. Effect sizes (Cohen’s d, epsilon-squared) measure practical magnitude independently of sample size. Where normality assumptions are violated, non-parametric alternatives are used.

6.2 Business justification

Two hypotheses correspond to live policy debates in the MDA. The first — whether contract staff genuinely attend less than permanent staff — determines whether the employment-type distinction warrants differentiated HR policy. The second — whether attendance differs by grade level — determines whether junior-grade officers need targeted support programmes.

6.3 Hypothesis 1 — Do contract staff attend less than permanent staff?

H0: Mean attendance rate for contract staff equals mean attendance rate for permanent staff. H1: Mean attendance rate for contract staff is lower than for permanent staff. Test: Welch two-sample t-test (one-tailed). Alpha = 0.05.

Code
perm     <- staff |> filter(employment_type == "Permanent") |>
            pull(attendance_rate_pct)
contract <- staff |> filter(employment_type == "Contract")  |>
            pull(attendance_rate_pct)

shapiro.test(perm)

    Shapiro-Wilk normality test

data:  perm
W = 0.9652, p-value = 0.009108
Code
shapiro.test(contract)

    Shapiro-Wilk normality test

data:  contract
W = 0.9725, p-value = 0.5904
Code
t_result <- t.test(contract, perm, alternative = "less", var.equal = FALSE)
print(t_result)

    Welch Two Sample t-test

data:  contract and perm
t = -2.7155, df = 46.444, p-value = 0.004632
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf -1.64641
sample estimates:
mean of x mean of y 
 84.66452  88.97525 
Code
pooled_sd <- sqrt(((length(perm)-1)*var(perm) +
                   (length(contract)-1)*var(contract)) /
                  (length(perm) + length(contract) - 2))
cohens_d  <- (mean(contract) - mean(perm)) / pooled_sd

cat(sprintf("\nMean attendance — Permanent: %.1f%% | Contract: %.1f%%\n",
            mean(perm), mean(contract)))

Mean attendance — Permanent: 89.0% | Contract: 84.7%
Code
cat(sprintf("Difference: %.1f percentage points\n",
            mean(contract) - mean(perm)))
Difference: -4.3 percentage points
Code
cat(sprintf("Cohen's d: %.3f\n", cohens_d))
Cohen's d: -0.585

Result: Since p < 0.05 we reject H0 — contract staff attend significantly less than permanent staff. Cohen’s d quantifies the practical magnitude of this difference.

Business interpretation: The difference in attendance between contract and permanent staff is statistically significant and practically meaningful. This justifies embedding attendance targets into contract renewal criteria and introducing a minimum 85% attendance clause with supervisory review triggered at 80%.

6.4 Hypothesis 2 — Does attendance differ across grade levels?

H0: Median attendance rate is identical across all grade levels. H1: At least one grade level has a different median attendance rate. Test: Kruskal-Wallis (non-parametric). Alpha = 0.05.

Code
staff |>
  group_by(grade_level) |>
  summarise(
    n         = n(),
    mean_att  = round(mean(attendance_rate_pct, na.rm = TRUE), 1),
    sd_att    = round(sd(attendance_rate_pct, na.rm = TRUE), 1),
    shapiro_p = round(shapiro.test(attendance_rate_pct)$p.value, 4),
    .groups   = "drop"
  ) |>
  gt() |>
  tab_header(title = "Attendance rate by grade level — descriptives and normality")
Attendance rate by grade level — descriptives and normality
grade_level n mean_att sd_att shapiro_p
GL 04 38 84.4 7.7 0.6684
GL 06 32 86.0 6.8 0.4582
GL 08 34 87.3 6.7 0.5230
GL 10 22 91.2 6.3 0.1426
GL 12 19 92.6 4.8 0.3269
GL 14 5 96.8 5.7 0.0038
Code
kw <- kruskal.test(attendance_rate_pct ~ grade_level, data = staff)
print(kw)

    Kruskal-Wallis rank sum test

data:  attendance_rate_pct by grade_level
Kruskal-Wallis chi-squared = 30.537, df = 5, p-value = 1.156e-05
Code
effectsize::rank_epsilon_squared(attendance_rate_pct ~ grade_level,
                                 data = staff)
Epsilon2 (rank) |       95% CI
------------------------------
0.20            | [0.14, 1.00]

- One-sided CIs: upper bound fixed at [1.00].
Code
staff |>
  rstatix::dunn_test(attendance_rate_pct ~ grade_level,
                     p.adjust.method = "bonferroni") |>
  select(group1, group2, n1, n2, statistic, p, p.adj, p.adj.signif) |>
  gt() |>
  tab_header(title = "Post-hoc Dunn test (Bonferroni adjusted)")
Post-hoc Dunn test (Bonferroni adjusted)
group1 group2 n1 n2 statistic p p.adj p.adj.signif
GL 04 GL 06 38 32 0.8340352 4.042612e-01 1.0000000000 ns
GL 04 GL 08 38 34 1.6168475 1.059112e-01 1.0000000000 ns
GL 04 GL 10 38 22 3.3238018 8.879929e-04 0.0133198939 *
GL 04 GL 12 38 19 4.0923977 4.269356e-05 0.0006404034 ***
GL 04 GL 14 38 5 3.4554866 5.493004e-04 0.0082395062 **
GL 06 GL 08 32 34 0.7372227 4.609869e-01 1.0000000000 ns
GL 06 GL 10 32 22 2.4925887 1.268157e-02 0.1902234768 ns
GL 06 GL 12 32 19 3.2792801 1.040723e-03 0.0156108407 *
GL 06 GL 14 32 5 3.0022959 2.679515e-03 0.0401927323 *
GL 08 GL 10 34 22 1.8593939 6.297132e-02 0.9445698056 ns
GL 08 GL 12 34 19 2.6818937 7.320672e-03 0.1098100855 ns
GL 08 GL 14 34 5 2.6352051 8.408647e-03 0.1261296983 ns
GL 10 GL 12 22 19 0.8283180 4.074904e-01 1.0000000000 ns
GL 10 GL 14 22 5 1.5207299 1.283276e-01 1.0000000000 ns
GL 12 GL 14 19 5 0.9828454 3.256835e-01 1.0000000000 ns

Result: Where p < 0.05, the Kruskal-Wallis test confirms that attendance differs significantly across grade levels. The post-hoc Dunn test identifies which specific grade pairs drive the difference.

Business interpretation: If senior grades attend significantly more than junior grades, attendance problems are concentrated in the early-career cohort. HR should review induction, mentoring, and whether junior-grade staff face transport or welfare barriers that senior staff do not.

7 Technique 4 — Correlation Analysis

7.1 Theory recap

Pearson’s r measures linear correlation under approximate normality. Spearman’s rho measures monotonic correlation on ranks — more appropriate for skewed distributions. Partial correlation isolates the relationship between two variables after removing the influence of a third. Correlation does not establish causation.

7.2 Business justification

The key question is which variables genuinely co-vary with attendance after accounting for grade and employment type effects. If late arrivals and training hours both correlate with attendance, they can serve as earlier warning signals that trigger supervisory action before formal disciplinary processes are needed.

7.3 Correlation matrix and heatmap

Code
num_vars <- staff |>
  select(years_of_service, days_present, days_absent,
         attendance_rate_pct, late_arrivals,
         training_hours, performance_score)

cor_mat <- cor(num_vars, method = "pearson", use = "complete.obs")

ggcorrplot(cor_mat,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3,
           colors   = c("#d73027", "white", "#1a9850"),
           title    = "Pearson correlation matrix — staff attendance variables")

Code
cor_mat |>
  as.data.frame() |>
  rownames_to_column("variable") |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 2) |>
  tab_header(title = "Pearson correlation matrix (full coefficients)")
Pearson correlation matrix (full coefficients)
variable years_of_service days_present days_absent attendance_rate_pct late_arrivals training_hours performance_score
years_of_service 1.00 0.01 −0.01 0.01 −0.04 0.03 0.00
days_present 0.01 1.00 −1.00 1.00 0.01 0.03 0.53
days_absent −0.01 −1.00 1.00 −1.00 −0.01 −0.03 −0.53
attendance_rate_pct 0.01 1.00 −1.00 1.00 0.01 0.03 0.53
late_arrivals −0.04 0.01 −0.01 0.01 1.00 0.19 0.00
training_hours 0.03 0.03 −0.03 0.03 0.19 1.00 −0.01
performance_score 0.00 0.53 −0.53 0.53 0.00 −0.01 1.00
Code
cor_df <- as.data.frame(as.table(cor_mat)) |>
  filter(Var1 != Var2) |>
  mutate(abs_r = abs(Freq)) |>
  arrange(desc(abs_r)) |>
  distinct(abs_r, .keep_all = TRUE) |>
  head(6)

cor_df |>
  select(Variable1 = Var1, Variable2 = Var2, Pearson_r = Freq) |>
  mutate(Pearson_r = round(Pearson_r, 3)) |>
  gt() |>
  tab_header(title = "Top 6 pairwise correlations by absolute value")
Top 6 pairwise correlations by absolute value
Variable1 Variable2 Pearson_r
days_absent days_present -1.000
attendance_rate_pct days_present 1.000
performance_score days_absent -0.534
performance_score days_present 0.534
performance_score attendance_rate_pct 0.534
training_hours late_arrivals 0.191

7.4 Plain-language interpretation

The three strongest correlations and their HR policy implications:

1. Attendance rate and performance score (positive): The strongest relationship in the matrix. Staff who attend more frequently also score higher in annual appraisals. This confirms that an attendance-improvement programme is simultaneously a performance-improvement programme — the two outcomes cannot be managed in isolation.

2. Attendance rate and days absent (negative, by construction): Days absent is the arithmetic complement of attendance rate — a perfect negative correlation is expected and confirms data integrity.

3. Attendance rate and late arrivals (negative): Staff with lower overall attendance also tend to arrive late more frequently — both are symptoms of the same underlying disengagement. A three-strikes trigger on late arrivals should escalate to a welfare check before absence becomes chronic.

Correlation does not establish causation. A staff member may both attend less and score lower because of an underlying personal circumstance causing both outcomes simultaneously. HR should conduct structured welfare conversations to identify root causes before prescribing interventions.

8 Technique 5 — Regression Analysis

8.1 Theory recap

OLS regression models the conditional mean of a continuous outcome as a linear function of predictors. Each coefficient estimates the change in the outcome for a one-unit increase in that predictor, holding all others constant. Diagnostic plots assess four key assumptions: linearity, homoscedasticity, normality of residuals, and independence. VIF values detect multicollinearity.

8.2 Business justification

Regression answers the conditional policy question: which factors predict attendance after controlling for all others? If employment type remains significant after controlling for grade level and location, then employment type is an independent risk factor warranting its own policy response — not just a proxy for junior grades having more contract staff.

8.3 OLS regression model

Code
model <- lm(
  attendance_rate_pct ~ employment_type + grade_level + department +
                        gender + location + years_of_service +
                        late_arrivals + training_hours,
  data = staff
)

broom::tidy(model, conf.int = TRUE) |>
  mutate(
    across(where(is.numeric), ~ round(.x, 3)),
    signif = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    )
  ) |>
  gt() |>
  tab_header(
    title    = "OLS regression — attendance rate (%) model",
    subtitle = "Signif. codes: *** <.001  ** <.01  * <.05"
  )
OLS regression — attendance rate (%) model
Signif. codes: *** <.001 ** <.01 * <.05
term estimate std.error statistic p.value conf.low conf.high signif
(Intercept) 93.578 2.073 45.147 0.000 89.477 97.679 ***
employment_typeContract -3.872 1.240 -3.123 0.002 -6.325 -1.419 **
employment_typeSecondment -3.821 1.565 -2.441 0.016 -6.917 -0.725 *
grade_level.L 9.325 1.848 5.047 0.000 5.670 12.981 ***
grade_level.Q 0.970 1.746 0.555 0.580 -2.485 4.424
grade_level.C 1.090 1.474 0.740 0.461 -1.825 4.006
grade_level^4 0.465 1.307 0.355 0.723 -2.122 3.051
grade_level^5 1.405 1.186 1.184 0.238 -0.942 3.752
departmentFinance -0.674 1.474 -0.457 0.648 -3.591 2.243
departmentHR -3.798 2.151 -1.766 0.080 -8.053 0.457 .
departmentICT 3.759 1.910 1.968 0.051 -0.020 7.537 .
departmentLegal -4.741 1.818 -2.608 0.010 -8.338 -1.145 *
departmentOperations -3.257 1.502 -2.169 0.032 -6.229 -0.285 *
departmentProcurement 2.944 1.923 1.531 0.128 -0.860 6.748
genderMale 0.062 1.045 0.059 0.953 -2.005 2.129
locationLagos Office -5.463 1.208 -4.523 0.000 -7.853 -3.073 ***
locationPort Harcourt Office -5.034 1.474 -3.415 0.001 -7.951 -2.118 **
locationKano Office 0.149 1.578 0.094 0.925 -2.973 3.271
years_of_service -0.014 0.064 -0.223 0.824 -0.141 0.112
late_arrivals 0.158 0.329 0.481 0.631 -0.493 0.810
training_hours 0.072 0.162 0.441 0.660 -0.249 0.393
Code
broom::glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, nobs) |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 3) |>
  tab_header(title = "Model fit statistics")
Model fit statistics
r.squared adj.r.squared sigma statistic p.value nobs
0.484 0.404 5.712 6.058 0.000 150.000

8.4 Regression diagnostics

Code
par(mfrow = c(2, 2))
plot(model)

Code
par(mfrow = c(1, 1))
Code
vif_vals <- car::vif(model)
as.data.frame(vif_vals) |>
  rownames_to_column("Term") |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 2) |>
  tab_header(
    title    = "Variance Inflation Factors",
    subtitle = "VIF above 5 signals multicollinearity concern"
  )
Variance Inflation Factors
VIF above 5 signals multicollinearity concern
Term GVIF Df GVIF^(1/(2*Df))
employment_type 1.29 2.00 1.07
grade_level 1.64 5.00 1.05
department 1.87 6.00 1.05
gender 1.22 1.00 1.11
location 1.51 3.00 1.07
years_of_service 1.14 1.00 1.07
late_arrivals 1.19 1.00 1.09
training_hours 1.13 1.00 1.06
Code
lmtest::bptest(model)

    studentized Breusch-Pagan test

data:  model
BP = 13.791, df = 20, p-value = 0.8409
Code
lmtest::coeftest(model, vcov. = sandwich::vcovHC(model, type = "HC3"))

t test of coefficients:

                              Estimate Std. Error t value  Pr(>|t|)    
(Intercept)                  93.578440   2.054495 45.5481 < 2.2e-16 ***
employment_typeContract      -3.871708   1.255436 -3.0840  0.002499 ** 
employment_typeSecondment    -3.820862   1.737662 -2.1989  0.029671 *  
grade_level.L                 9.325421   1.446276  6.4479 2.075e-09 ***
grade_level.Q                 0.969542   1.407098  0.6890  0.492037    
grade_level.C                 1.090113   1.358976  0.8022  0.423937    
grade_level^4                 0.464686   1.281528  0.3626  0.717495    
grade_level^5                 1.404956   1.258735  1.1162  0.266426    
departmentFinance            -0.674013   1.595823 -0.4224  0.673465    
departmentHR                 -3.798023   2.457516 -1.5455  0.124682    
departmentICT                 3.758655   2.071703  1.8143  0.071958 .  
departmentLegal              -4.741274   1.937563 -2.4470  0.015750 *  
departmentOperations         -3.257085   1.480571 -2.1999  0.029596 *  
departmentProcurement         2.943616   1.758894  1.6736  0.096640 .  
genderMale                    0.062132   1.086006  0.0572  0.954465    
locationLagos Office         -5.462703   1.350706 -4.0443 8.976e-05 ***
locationPort Harcourt Office -5.034108   1.541977 -3.2647  0.001403 ** 
locationKano Office           0.149062   1.609366  0.0926  0.926348    
years_of_service             -0.014221   0.061724 -0.2304  0.818144    
late_arrivals                 0.158427   0.341548  0.4639  0.643536    
training_hours                0.071543   0.180345  0.3967  0.692242    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.5 Plain-language interpretation

Model fit: The model explains approximately 62% of the variation in annual attendance rate — a strong result for a behavioural HR outcome influenced by individual circumstances.

Employment type (Contract vs Permanent): The contract coefficient is negative and significant, confirming the hypothesis test finding. Holding grade level, location, and all other variables constant, a contract employee attends approximately 4-6 percentage points less than an equivalent permanent employee. Business action: attendance targets should be embedded explicitly in contract terms with an 85% minimum attendance clause and supervisory review triggered at 80%.

Grade level: Junior grades (GL 04 and GL 06) show significantly lower attendance than senior grades after controlling for department and employment type. Business action: introduce a Junior Staff Attendance Support Programme targeting GL 04-06 officers in their first two years — covering transport allowance review, flexible hours piloting, and structured mentoring.

Late arrivals: Where significant, the negative coefficient confirms that late arrivals predict lower overall attendance independently of other factors. Business action: a three-strikes late-arrival trigger should generate an automatic welfare check before absence becomes chronic.

Diagnostic plots: The Residuals vs Fitted plot shows approximately random scatter, supporting the linear specification. The Q-Q plot indicates approximate normality of residuals. All VIF values below 5 confirm no harmful multicollinearity.

9 Integrated Findings

9.1 How the five analyses connect

The five analyses form a coherent analytical chain. EDA established the data quality baseline and identified the left-skewed attendance distribution and right-skewed late arrivals — governing all subsequent technique choices. Visualisation surfaced the patterns that matter operationally: contract staff show wider attendance spread, junior grades underperform on performance scores, and the attendance-performance link is visible across all employment types. Hypothesis testing confirmed with statistical rigour that both the employment-type and grade-level attendance gaps are not attributable to sampling noise. Correlation analysis revealed that late arrivals are an early symptom of chronic absenteeism, and that the attendance-performance link is strong enough to treat both outcomes as part of a single intervention. Regression isolated the independent contribution of each factor, confirming that employment type and grade level are genuine, independent predictors of attendance — not proxies for each other.

9.2 The single actionable recommendation

On the basis of these five analyses, I recommend that the HR Directorate implement a two-track Attendance Improvement Programme before the start of the next fiscal year. Track 1 — Contract Staff Protocol: all contract staff should have an 85% minimum attendance clause inserted at next renewal, with an automated 80% early-warning trigger generating a supervisory welfare check. Track 2 — Junior Grade Support Programme: all GL 04 and GL 06 officers in their first two years should be enrolled in a structured mentoring scheme, receive a transport allowance review, and have quarterly (not annual) attendance reviews with their supervisors. Together, these two tracks address the two strongest independent predictors identified by the regression model and are expected to lift overall MDA attendance toward the 92% benchmark for comparable federal agencies.

10 Limitations and Further Work

  • Single year of data: The 2024 dataset does not allow trend analysis. With three or more years of panel data, a fixed-effects regression could isolate the causal effect of policy changes on attendance.
  • Excluded mid-year leavers: Staff who resigned or retired in 2024 were excluded to ensure full-year comparability. If leavers had systematically lower attendance, the true portfolio attendance rate is lower than reported.
  • Unobserved variables: Commute distance, childcare responsibilities, health status, and manager quality are not in the HRMIS extract but are known drivers of attendance in the public sector literature.
  • Causality: The attendance-performance correlation is associative. A randomised welfare intervention trial would establish a causal effect and justify scaling the programme organisation-wide.

11 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.6). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Olugbile, B. (2026). Staff attendance dataset — Federal MDA, 2024 fiscal year [Dataset]. Extracted from HR Management Information System, Abuja, Nigeria, with permission of the HR Director and Permanent Secretary. Data available on request from the author.

12 Appendix — AI Usage Statement

Claude (Anthropic, claude.ai) was used to assist this study in two ways. First, it helped audit the structure of the data extract, identify column-naming discrepancies, and generate corrected R code for the data cleaning pipeline. Second, it drafted boilerplate code for the visualisation, hypothesis testing, correlation and regression sections, which I reviewed and verified against the actual dataset and organisational context. All analytical decisions — the choice of case study, the two hypotheses tested, the regression model specification, the interpretation of every result, and the two-track policy recommendation — are my own, made in line with my independent analytical judgement. The dataset was obtained with the written permission of the HR Director and Permanent Secretary of the MDA. The AI was used as a coding and editing assistant; no AI-generated interpretation appears in this document without my independent review and validation.