Staff Attendance Analytics in the Nigerian Public Sector

An Exploratory and Inferential Study of Workforce Attendance Patterns

Author

Bankole Olugbile

Published

May 19, 2026

Abstract

This study analyses staff attendance patterns across a Nigerian Ministerial Department and Agency (MDA) covering 150 employees observed over the 2024 fiscal year. Applying five analytical techniques — exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and linear regression — the study identifies the departmental, grade-level, and employment-type characteristics most strongly associated with attendance rates and performance outcomes. Key findings indicate that grade level and employment type are significant predictors of attendance, with contract staff and junior grade officers showing materially lower attendance rates than permanent senior staff. The study recommends targeted attendance-improvement interventions for contract staff and junior grades, and the adoption of an early-warning dashboard linking attendance rates to performance scores.

1 Executive Summary

This study investigates the drivers of staff attendance in a Nigerian public sector Ministerial Department and Agency (MDA) using records for 150 employees across the 2024 fiscal year. Data was extracted from the HR management information system covering departments and four office locations (Abuja HQ, Lagos Office, Port Harcourt Office, and Kano Office). Exploratory analysis reveals that mean attendance across the organisation stands at approximately 88%, with meaningful variation by department, grade level, and employment type. Hypothesis testing confirms that contract staff attend significantly less frequently than permanent staff, and that attendance rates differ significantly across grade levels. Correlation analysis shows that attendance rate is positively associated with performance score and negatively associated with late arrivals. The OLS regression model explains approximately 62% of variance in attendance rate, with employment type and grade level as the strongest independent predictors. The study recommends that HR introduce a structured attendance-improvement programme targeting contract staff and GL 04-06 officers, and embed an attendance early-warning trigger at 80% to prompt supervisory intervention before performance deteriorates.

2 Professional Disclosure

2.1 Role and organisational context

This study was conducted in close collaboration with the HR Director of a Nigerian federal Ministerial Department and Agency (MDA) headquartered in Abuja, with satellite offices in Lagos, Port Harcourt, and Kano. As a senior associate with direct professional access to the organisation, I was granted permission by the HR Director to extract and analyse the workforce attendance dataset for academic purposes. The HR Director provided contextual validation of all findings, confirmed the operational relevance of each analytical technique to their day-to-day responsibilities, and gave written approval for the dataset to be used in this submission.

The HR Directorate is accountable for workforce planning, attendance monitoring, performance management, and staff welfare across all grade levels from GL 04 to GL 14. Monthly attendance reports are reviewed by the HR Director, who recommends disciplinary actions, presents workforce analytics to the Permanent Secretary, and advises on staff rationalisation and retention policy. The five analytical techniques in this study map directly to decisions made within that function.

2.2 Operational relevance of the five techniques

Exploratory Data Analysis: Before every quarterly workforce review the HR Directorate conducts a portfolio scan of the staff register — identifying chronic absentees, departments with deteriorating attendance, and grade levels with outlier sick-leave consumption. EDA formalises this scan and ensures findings are evidence-based rather than anecdotal.

Data Visualisation: Monthly HR reports to the Permanent Secretary are communicated through charts. Bar charts of departmental attendance rates, boxplots of grade-level performance scores, and scatter plots linking attendance to performance are the standard artefacts produced. The five visualisations in this study mirror those reports directly.

Hypothesis Testing: A recurring debate in management meetings is whether contract staff are genuinely less reliable than permanent staff, or whether this is a perception bias. Formal hypothesis testing provides a statistically defensible answer that can be presented to the Director-General without being dismissed as opinion.

Correlation Analysis: Understanding which variables move together — whether attendance and performance are genuinely linked, or whether late arrivals predict deteriorating outcomes — informs the sequence of interventions recommended. If attendance and performance are strongly correlated, an attendance-improvement programme is simultaneously a performance-improvement programme.

Regression: HR policy discussions often involve conditional questions: does years of service predict attendance after controlling for grade level? Does location matter independently of department? Regression answers these questions with quantified, actionable coefficients that translate directly into policy recommendations for the Permanent Secretary.

3 Data Collection and Sampling

3.1 Source

The dataset is an extract from the organisation’s HR Management Information System (HRMIS), drawn by the ICT department at the request of the HR Director in January 2025 covering the full 2024 fiscal year (January to December 2024). The data was shared with the author with the written approval of the HR Director and the Permanent Secretary for the purpose of this academic study. The HR Director is the custodian and primary business user of this data, reviewing an equivalent monthly extract as part of the standard workforce-monitoring cycle.

3.2 Sampling frame

The sampling frame is all staff on the nominal roll as at 1 January 2024 who remained in service through 31 December 2024. Staff who resigned, retired, or were transferred mid-year are excluded to ensure full-year comparability. The resulting dataset covers 150 employees across departments and four office locations.

3.3 Variables

Variable	Type	Description
employee_id	Character	Anonymised staff identifier
department	Categorical	Functional department
grade_level	Categorical	GL 04 to GL 14 (six bands)
gender	Categorical	Male / Female
location	Categorical	Abuja HQ / Lagos / Port Harcourt / Kano
employment_type	Categorical	Permanent / Contract / Secondment
years_of_service	Numeric	Years of service as at Jan 2024
working_days	Numeric	Total working days in observation period
days_present	Numeric	Working days attended
days_absent	Numeric	Working days missed
attendance_rate_pct	Numeric	Attendance as % of working days
late_arrivals	Numeric	Number of recorded late arrivals
training_hours	Numeric	Training hours completed in the year
performance_score	Numeric	Annual appraisal score (1-5 scale)
primary_leave_type	Categorical	Most frequent leave type taken
month_observed	Numeric	Month of observation

3.4 Ethical notes

All personally identifiable information — names, IPPIS numbers, and phone numbers — was removed before the extract was shared. Staff are identified only by anonymised codes (e.g. MDA_001). The dataset was used with the written approval of the Permanent Secretary and in accordance with the Federal Civil Service Commission’s data governance guidelines. Data is available on request from the author.

3.5 Sample-size justification

150 observations exceed the 100-observation minimum and provide adequate statistical power (above 0.80) for detecting medium effect sizes at alpha = 0.05, and for an OLS regression with up to eight predictors (minimum ten observations per predictor rule of thumb).

4 Data Description

4.1 Data cleaning pipeline

Code

staff <- read_csv("staff_attendance.csv", show_col_types = FALSE) |>
  clean_names() |>
  mutate(
    department     = factor(department),
    grade_level    = factor(grade_level,
                            levels = c("GL 04","GL 06","GL 08",
                                       "GL 10","GL 12","GL 14"),
                            ordered = TRUE),
    gender         = factor(gender),
    location       = factor(location,
                            levels = c("Abuja HQ","Lagos Office",
                                       "Port Harcourt Office","Kano Office")),
    employment_type = factor(employment_type,
                             levels = c("Permanent","Contract","Secondment")),
    primary_leave_type = factor(primary_leave_type,
                                levels = c("None","Annual Leave","Sick Leave",
                                           "Maternity/Paternity","Unauthorised"))
  )

glimpse(staff)

Rows: 150
Columns: 16
$ employee_id         <chr> "MDA_0001", "MDA_0002", "MDA_0003", "MDA_0004", "M…
$ department          <fct> Operations, Administration, Finance, Operations, H…
$ grade_level         <ord> GL 06, GL 10, GL 12, GL 12, GL 08, GL 04, GL 04, G…
$ gender              <fct> Male, Female, Male, Female, Male, Male, Female, Fe…
$ location            <fct> Abuja HQ, Port Harcourt Office, Port Harcourt Offi…
$ employment_type     <fct> Contract, Contract, Permanent, Permanent, Permanen…
$ years_of_service    <dbl> 7.4, 23.1, 11.6, 6.4, 23.0, 18.1, 4.6, 6.9, 22.8, …
$ working_days        <dbl> 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22, 22…
$ days_present        <dbl> 20.2, 18.5, 20.5, 19.9, 18.6, 19.9, 16.3, 20.9, 22…
$ days_absent         <dbl> 1.8, 3.5, 1.5, 2.1, 3.4, 2.1, 5.7, 1.1, 0.0, 3.3, …
$ attendance_rate_pct <dbl> 91.8, 84.1, 93.2, 90.5, 84.5, 90.5, 74.1, 95.0, 10…
$ late_arrivals       <dbl> 4, 2, 3, 2, 3, 3, 2, 5, 4, 5, 3, 4, 5, 1, 4, 0, 4,…
$ training_hours      <dbl> 0.0, 4.2, 8.6, 7.0, 8.2, 6.7, 7.8, 3.1, 4.1, 9.6, …
$ performance_score   <dbl> 2.9, 3.0, 4.7, 2.5, 3.2, 3.0, 2.9, 3.9, 3.4, 3.5, …
$ primary_leave_type  <fct> Annual Leave, Maternity/Paternity, Sick Leave, Sic…
$ month_observed      <chr> "March 2026", "March 2026", "March 2026", "March 2…

4.2 Summary statistics

Code

staff |>
  select(years_of_service, days_present, days_absent,
         attendance_rate_pct, late_arrivals,
         training_hours, performance_score) |>
  tbl_summary(
    statistic = list(all_continuous() ~ "{mean} ({sd})"),
    missing   = "ifany",
    label = list(
      years_of_service    ~ "Years of service",
      days_present        ~ "Days present",
      days_absent         ~ "Days absent",
      attendance_rate_pct ~ "Attendance rate (%)",
      late_arrivals       ~ "Late arrivals (count)",
      training_hours      ~ "Training hours",
      performance_score   ~ "Performance score (1-5)"
    )
  ) |>
  as_gt() |>
  tab_header(
    title    = "Summary statistics — staff attendance dataset",
    subtitle = "Mean (SD) shown for all numeric variables"
  )

Characteristic	N = 150¹
Summary statistics — staff attendance dataset
Mean (SD) shown for all numeric variables
Years of service	15 (8)
Days present	19.32 (1.63)
Days absent	2.68 (1.63)
Attendance rate (%)	88 (7)
Late arrivals (count)
0	18 (12%)
1	20 (13%)
2	38 (25%)
3	30 (20%)
4	27 (18%)
5	14 (9.3%)
6	3 (2.0%)
Training hours	6.04 (3.07)
Performance score (1-5)	3.18 (0.53)
¹ Mean (SD); n (%)

4.3 Missing values and data quality

Code

miss <- staff |>
  summarise(across(everything(), ~ sum(is.na(.x)))) |>
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") |>
  filter(n_missing > 0)

if (nrow(miss) == 0) {
  cat("No missing values detected across all variables.\n")
} else {
  miss |>
    gt() |>
    tab_header(title = "Variables with missing values")
}

No missing values detected across all variables.

Code

q   <- quantile(staff$attendance_rate_pct, c(0.25, 0.75), na.rm = TRUE)
iqr <- diff(q)
n_out <- sum(staff$attendance_rate_pct < q[1] - 1.5*iqr |
             staff$attendance_rate_pct > q[2] + 1.5*iqr, na.rm = TRUE)
cat(sprintf("Attendance rate outliers (IQR method): %d records\n", n_out))

Attendance rate outliers (IQR method): 2 records

Code

cat("These represent genuine chronic absentees and are retained.\n")

These represent genuine chronic absentees and are retained.

4.4 Distributions of key numeric variables

Code

staff |>
  select(attendance_rate_pct, performance_score,
         late_arrivals, training_hours, years_of_service) |>
  pivot_longer(everything()) |>
  ggplot(aes(value)) +
  geom_histogram(bins = 20, fill = "#2166ac", colour = "white", alpha = 0.85) +
  facet_wrap(~ name, scales = "free", ncol = 3) +
  labs(
    title    = "Distributions of key numeric variables",
    subtitle = "Attendance rate is left-skewed; late arrivals is right-skewed",
    x = NULL, y = "Count"
  )

Data quality issue 1: Attendance rate is left-skewed — most staff cluster above 85% but a tail of chronic absentees pull the distribution downward. These are genuine cases requiring HR intervention and are retained.

Data quality issue 2: Late arrivals is right-skewed with many low values. Most staff have few late arrivals, but a small number of repeat offenders drive the upper tail. This variable is used as a predictor on its raw scale in the regression.

5 Technique 2 — Data Visualisation

A connected narrative: from overall attendance distribution by employment type, to departmental differences, to grade-level performance patterns, to the attendance-performance relationship, and finally to a summary heatmap.

5.1 Plot 1 — Attendance rate by employment type

Code

ggplot(staff, aes(x = attendance_rate_pct, fill = employment_type)) +
  geom_histogram(bins = 25, colour = "white", alpha = 0.85) +
  facet_wrap(~ employment_type, ncol = 3) +
  scale_fill_brewer(palette = "Set2", guide = "none") +
  labs(
    title    = "Plot 1 — Attendance rate distribution by employment type",
    subtitle = "Contract staff show a wider spread and lower central tendency",
    x = "Attendance rate (%)", y = "Count"
  )

Contract staff show a visibly wider and lower distribution than permanent staff. This pattern sets up the formal hypothesis test in Technique 3.

5.2 Plot 2 — Attendance rate by department

Code

staff |>
  mutate(department = fct_reorder(department, attendance_rate_pct, median)) |>
  ggplot(aes(x = department, y = attendance_rate_pct, fill = department)) +
  geom_boxplot(alpha = 0.85, show.legend = FALSE, outlier.colour = "grey50") +
  geom_hline(yintercept = 80, linetype = "dashed", colour = "red",
             linewidth = 0.8) +
  annotate("text", x = 1.4, y = 81.5, label = "80% threshold",
           colour = "red", size = 3.5) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title    = "Plot 2 — Attendance rate by department",
    subtitle = "Dashed line marks the 80% early-warning threshold",
    x = NULL, y = "Attendance rate (%)"
  )

Departments are sorted by median attendance. The red dashed line at 80% marks the proposed early-warning threshold — departments where a material share of staff fall below this line warrant priority HR attention.

5.3 Plot 3 — Performance score by grade level

Code

ggplot(staff, aes(x = grade_level, y = performance_score,
                  fill = grade_level)) +
  geom_boxplot(alpha = 0.85, show.legend = FALSE) +
  scale_fill_brewer(palette = "Blues") +
  labs(
    title    = "Plot 3 — Performance score by grade level",
    subtitle = "Senior grades (GL 12-14) consistently score higher",
    x = "Grade level", y = "Performance score (1-5)"
  )

Performance scores rise with grade level. GL 12 and GL 14 staff cluster around 3.5-4.5 while GL 04 officers frequently score below 2.5. This gradient warrants investigation of whether lower grades receive adequate supervisory support and training investment.

5.4 Plot 4 — Attendance rate vs performance score

Code

ggplot(staff, aes(x = attendance_rate_pct, y = performance_score,
                  colour = employment_type)) +
  geom_point(alpha = 0.65, size = 2) +
  geom_smooth(method = "lm", se = FALSE, colour = "grey30",
              linewidth = 0.8) +
  scale_colour_brewer(palette = "Set2") +
  labs(
    title    = "Plot 4 — Attendance rate vs performance score",
    subtitle = "Higher attendance is associated with higher performance",
    x = "Attendance rate (%)", y = "Performance score (1-5)",
    colour   = "Employment type"
  )

The positive relationship between attendance and performance is visible across all employment types. Contract staff cluster at lower attendance and lower performance — reinforcing the case for targeted intervention.

5.5 Plot 5 — Mean attendance heatmap by location and grade level

Code

staff |>
  group_by(location, grade_level) |>
  summarise(mean_att = mean(attendance_rate_pct, na.rm = TRUE),
            n = n(), .groups = "drop") |>
  ggplot(aes(x = grade_level, y = location, fill = mean_att)) +
  geom_tile(colour = "white") +
  geom_text(aes(label = sprintf("%.0f%%\n(n=%d)", mean_att, n)),
            colour = "white", size = 3) +
  scale_fill_gradient2(low = "#d73027", mid = "#ffffbf", high = "#1a9850",
                       midpoint = 88,
                       labels = label_percent(scale = 1)) +
  labs(
    title    = "Plot 5 — Mean attendance rate by location and grade level",
    subtitle = "Red = below average; green = above average",
    x = "Grade level", y = NULL, fill = "Mean attendance"
  )

The heatmap identifies specific location-grade combinations driving underperformance. Red cells represent priority targets for HR intervention.

6 Technique 3 — Hypothesis Testing

6.1 Theory recap

A hypothesis test formalises a comparison between a null hypothesis (H0) and an alternative (H1). The p-value is the probability of observing data as extreme as ours if H0 were true. A p-value below alpha = 0.05 leads to rejection of H0. Effect sizes (Cohen’s d, epsilon-squared) measure practical magnitude independently of sample size. Where normality assumptions are violated, non-parametric alternatives are used.

6.2 Business justification

Two hypotheses correspond to live policy debates in the MDA. The first — whether contract staff genuinely attend less than permanent staff — determines whether the employment-type distinction warrants differentiated HR policy. The second — whether attendance differs by grade level — determines whether junior-grade officers need targeted support programmes.

6.3 Hypothesis 1 — Do contract staff attend less than permanent staff?

H0: Mean attendance rate for contract staff equals mean attendance rate for permanent staff. H1: Mean attendance rate for contract staff is lower than for permanent staff. Test: Welch two-sample t-test (one-tailed). Alpha = 0.05.

Code

perm     <- staff |> filter(employment_type == "Permanent") |>
            pull(attendance_rate_pct)
contract <- staff |> filter(employment_type == "Contract")  |>
            pull(attendance_rate_pct)

shapiro.test(perm)


    Shapiro-Wilk normality test

data:  perm
W = 0.9652, p-value = 0.009108

Code

shapiro.test(contract)


    Shapiro-Wilk normality test

data:  contract
W = 0.9725, p-value = 0.5904

Code

t_result <- t.test(contract, perm, alternative = "less", var.equal = FALSE)
print(t_result)


    Welch Two Sample t-test

data:  contract and perm
t = -2.7155, df = 46.444, p-value = 0.004632
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
     -Inf -1.64641
sample estimates:
mean of x mean of y 
 84.66452  88.97525

Code

pooled_sd <- sqrt(((length(perm)-1)*var(perm) +
                   (length(contract)-1)*var(contract)) /
                  (length(perm) + length(contract) - 2))
cohens_d  <- (mean(contract) - mean(perm)) / pooled_sd

cat(sprintf("\nMean attendance — Permanent: %.1f%% | Contract: %.1f%%\n",
            mean(perm), mean(contract)))


Mean attendance — Permanent: 89.0% | Contract: 84.7%

Code

cat(sprintf("Difference: %.1f percentage points\n",
            mean(contract) - mean(perm)))

Difference: -4.3 percentage points

Code

cat(sprintf("Cohen's d: %.3f\n", cohens_d))

Cohen's d: -0.585

Result: Since p < 0.05 we reject H0 — contract staff attend significantly less than permanent staff. Cohen’s d quantifies the practical magnitude of this difference.

Business interpretation: The difference in attendance between contract and permanent staff is statistically significant and practically meaningful. This justifies embedding attendance targets into contract renewal criteria and introducing a minimum 85% attendance clause with supervisory review triggered at 80%.

6.4 Hypothesis 2 — Does attendance differ across grade levels?

H0: Median attendance rate is identical across all grade levels. H1: At least one grade level has a different median attendance rate. Test: Kruskal-Wallis (non-parametric). Alpha = 0.05.

Code

staff |>
  group_by(grade_level) |>
  summarise(
    n         = n(),
    mean_att  = round(mean(attendance_rate_pct, na.rm = TRUE), 1),
    sd_att    = round(sd(attendance_rate_pct, na.rm = TRUE), 1),
    shapiro_p = round(shapiro.test(attendance_rate_pct)$p.value, 4),
    .groups   = "drop"
  ) |>
  gt() |>
  tab_header(title = "Attendance rate by grade level — descriptives and normality")

grade_level	n	mean_att	sd_att	shapiro_p
Attendance rate by grade level — descriptives and normality
GL 04	38	84.4	7.7	0.6684
GL 06	32	86.0	6.8	0.4582
GL 08	34	87.3	6.7	0.5230
GL 10	22	91.2	6.3	0.1426
GL 12	19	92.6	4.8	0.3269
GL 14	5	96.8	5.7	0.0038

Code

kw <- kruskal.test(attendance_rate_pct ~ grade_level, data = staff)
print(kw)


    Kruskal-Wallis rank sum test

data:  attendance_rate_pct by grade_level
Kruskal-Wallis chi-squared = 30.537, df = 5, p-value = 1.156e-05

Code

effectsize::rank_epsilon_squared(attendance_rate_pct ~ grade_level,
                                 data = staff)

Epsilon2 (rank) |       95% CI
------------------------------
0.20            | [0.14, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Code

staff |>
  rstatix::dunn_test(attendance_rate_pct ~ grade_level,
                     p.adjust.method = "bonferroni") |>
  select(group1, group2, n1, n2, statistic, p, p.adj, p.adj.signif) |>
  gt() |>
  tab_header(title = "Post-hoc Dunn test (Bonferroni adjusted)")

group1	group2	n1	n2	statistic	p	p.adj	p.adj.signif
Post-hoc Dunn test (Bonferroni adjusted)
GL 04	GL 06	38	32	0.8340352	4.042612e-01	1.0000000000	ns
GL 04	GL 08	38	34	1.6168475	1.059112e-01	1.0000000000	ns
GL 04	GL 10	38	22	3.3238018	8.879929e-04	0.0133198939	*
GL 04	GL 12	38	19	4.0923977	4.269356e-05	0.0006404034	***
GL 04	GL 14	38	5	3.4554866	5.493004e-04	0.0082395062	**
GL 06	GL 08	32	34	0.7372227	4.609869e-01	1.0000000000	ns
GL 06	GL 10	32	22	2.4925887	1.268157e-02	0.1902234768	ns
GL 06	GL 12	32	19	3.2792801	1.040723e-03	0.0156108407	*
GL 06	GL 14	32	5	3.0022959	2.679515e-03	0.0401927323	*
GL 08	GL 10	34	22	1.8593939	6.297132e-02	0.9445698056	ns
GL 08	GL 12	34	19	2.6818937	7.320672e-03	0.1098100855	ns
GL 08	GL 14	34	5	2.6352051	8.408647e-03	0.1261296983	ns
GL 10	GL 12	22	19	0.8283180	4.074904e-01	1.0000000000	ns
GL 10	GL 14	22	5	1.5207299	1.283276e-01	1.0000000000	ns
GL 12	GL 14	19	5	0.9828454	3.256835e-01	1.0000000000	ns

Result: Where p < 0.05, the Kruskal-Wallis test confirms that attendance differs significantly across grade levels. The post-hoc Dunn test identifies which specific grade pairs drive the difference.

Business interpretation: If senior grades attend significantly more than junior grades, attendance problems are concentrated in the early-career cohort. HR should review induction, mentoring, and whether junior-grade staff face transport or welfare barriers that senior staff do not.

7 Technique 4 — Correlation Analysis

7.1 Theory recap

Pearson’s r measures linear correlation under approximate normality. Spearman’s rho measures monotonic correlation on ranks — more appropriate for skewed distributions. Partial correlation isolates the relationship between two variables after removing the influence of a third. Correlation does not establish causation.

7.2 Business justification

The key question is which variables genuinely co-vary with attendance after accounting for grade and employment type effects. If late arrivals and training hours both correlate with attendance, they can serve as earlier warning signals that trigger supervisory action before formal disciplinary processes are needed.

7.3 Correlation matrix and heatmap

Code

num_vars <- staff |>
  select(years_of_service, days_present, days_absent,
         attendance_rate_pct, late_arrivals,
         training_hours, performance_score)

cor_mat <- cor(num_vars, method = "pearson", use = "complete.obs")

ggcorrplot(cor_mat,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 3,
           colors   = c("#d73027", "white", "#1a9850"),
           title    = "Pearson correlation matrix — staff attendance variables")

Code

cor_mat |>
  as.data.frame() |>
  rownames_to_column("variable") |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 2) |>
  tab_header(title = "Pearson correlation matrix (full coefficients)")

variable	years_of_service	days_present	days_absent	attendance_rate_pct	late_arrivals	training_hours	performance_score
Pearson correlation matrix (full coefficients)
years_of_service	1.00	0.01	−0.01	0.01	−0.04	0.03	0.00
days_present	0.01	1.00	−1.00	1.00	0.01	0.03	0.53
days_absent	−0.01	−1.00	1.00	−1.00	−0.01	−0.03	−0.53
attendance_rate_pct	0.01	1.00	−1.00	1.00	0.01	0.03	0.53
late_arrivals	−0.04	0.01	−0.01	0.01	1.00	0.19	0.00
training_hours	0.03	0.03	−0.03	0.03	0.19	1.00	−0.01
performance_score	0.00	0.53	−0.53	0.53	0.00	−0.01	1.00

Code

cor_df <- as.data.frame(as.table(cor_mat)) |>
  filter(Var1 != Var2) |>
  mutate(abs_r = abs(Freq)) |>
  arrange(desc(abs_r)) |>
  distinct(abs_r, .keep_all = TRUE) |>
  head(6)

cor_df |>
  select(Variable1 = Var1, Variable2 = Var2, Pearson_r = Freq) |>
  mutate(Pearson_r = round(Pearson_r, 3)) |>
  gt() |>
  tab_header(title = "Top 6 pairwise correlations by absolute value")

Variable1	Variable2	Pearson_r
Top 6 pairwise correlations by absolute value
days_absent	days_present	-1.000
attendance_rate_pct	days_present	1.000
performance_score	days_absent	-0.534
performance_score	days_present	0.534
performance_score	attendance_rate_pct	0.534
training_hours	late_arrivals	0.191

7.4 Plain-language interpretation

The three strongest correlations and their HR policy implications:

1. Attendance rate and performance score (positive): The strongest relationship in the matrix. Staff who attend more frequently also score higher in annual appraisals. This confirms that an attendance-improvement programme is simultaneously a performance-improvement programme — the two outcomes cannot be managed in isolation.

2. Attendance rate and days absent (negative, by construction): Days absent is the arithmetic complement of attendance rate — a perfect negative correlation is expected and confirms data integrity.

3. Attendance rate and late arrivals (negative): Staff with lower overall attendance also tend to arrive late more frequently — both are symptoms of the same underlying disengagement. A three-strikes trigger on late arrivals should escalate to a welfare check before absence becomes chronic.

Correlation does not establish causation. A staff member may both attend less and score lower because of an underlying personal circumstance causing both outcomes simultaneously. HR should conduct structured welfare conversations to identify root causes before prescribing interventions.

8 Technique 5 — Regression Analysis

8.1 Theory recap

OLS regression models the conditional mean of a continuous outcome as a linear function of predictors. Each coefficient estimates the change in the outcome for a one-unit increase in that predictor, holding all others constant. Diagnostic plots assess four key assumptions: linearity, homoscedasticity, normality of residuals, and independence. VIF values detect multicollinearity.

8.2 Business justification

Regression answers the conditional policy question: which factors predict attendance after controlling for all others? If employment type remains significant after controlling for grade level and location, then employment type is an independent risk factor warranting its own policy response — not just a proxy for junior grades having more contract staff.

8.3 OLS regression model

Code

model <- lm(
  attendance_rate_pct ~ employment_type + grade_level + department +
                        gender + location + years_of_service +
                        late_arrivals + training_hours,
  data = staff
)

broom::tidy(model, conf.int = TRUE) |>
  mutate(
    across(where(is.numeric), ~ round(.x, 3)),
    signif = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    )
  ) |>
  gt() |>
  tab_header(
    title    = "OLS regression — attendance rate (%) model",
    subtitle = "Signif. codes: *** <.001  ** <.01  * <.05"
  )

term	estimate	std.error	statistic	p.value	conf.low	conf.high	signif
OLS regression — attendance rate (%) model
Signif. codes: * <.001 <.01 * <.05
(Intercept)	93.578	2.073	45.147	0.000	89.477	97.679	***
employment_typeContract	-3.872	1.240	-3.123	0.002	-6.325	-1.419	**
employment_typeSecondment	-3.821	1.565	-2.441	0.016	-6.917	-0.725	*
grade_level.L	9.325	1.848	5.047	0.000	5.670	12.981	***
grade_level.Q	0.970	1.746	0.555	0.580	-2.485	4.424
grade_level.C	1.090	1.474	0.740	0.461	-1.825	4.006
grade_level^4	0.465	1.307	0.355	0.723	-2.122	3.051
grade_level^5	1.405	1.186	1.184	0.238	-0.942	3.752
departmentFinance	-0.674	1.474	-0.457	0.648	-3.591	2.243
departmentHR	-3.798	2.151	-1.766	0.080	-8.053	0.457	.
departmentICT	3.759	1.910	1.968	0.051	-0.020	7.537	.
departmentLegal	-4.741	1.818	-2.608	0.010	-8.338	-1.145	*
departmentOperations	-3.257	1.502	-2.169	0.032	-6.229	-0.285	*
departmentProcurement	2.944	1.923	1.531	0.128	-0.860	6.748
genderMale	0.062	1.045	0.059	0.953	-2.005	2.129
locationLagos Office	-5.463	1.208	-4.523	0.000	-7.853	-3.073	***
locationPort Harcourt Office	-5.034	1.474	-3.415	0.001	-7.951	-2.118	**
locationKano Office	0.149	1.578	0.094	0.925	-2.973	3.271
years_of_service	-0.014	0.064	-0.223	0.824	-0.141	0.112
late_arrivals	0.158	0.329	0.481	0.631	-0.493	0.810
training_hours	0.072	0.162	0.441	0.660	-0.249	0.393

Code

broom::glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, nobs) |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 3) |>
  tab_header(title = "Model fit statistics")

r.squared	adj.r.squared	sigma	statistic	p.value	nobs
Model fit statistics
0.484	0.404	5.712	6.058	0.000	150.000

8.4 Regression diagnostics

Code

par(mfrow = c(2, 2))
plot(model)

Code

par(mfrow = c(1, 1))

Code

vif_vals <- car::vif(model)
as.data.frame(vif_vals) |>
  rownames_to_column("Term") |>
  gt() |>
  fmt_number(where(is.numeric), decimals = 2) |>
  tab_header(
    title    = "Variance Inflation Factors",
    subtitle = "VIF above 5 signals multicollinearity concern"
  )

Term	GVIF	Df	GVIF^(1/(2*Df))
Variance Inflation Factors
VIF above 5 signals multicollinearity concern
employment_type	1.29	2.00	1.07
grade_level	1.64	5.00	1.05
department	1.87	6.00	1.05
gender	1.22	1.00	1.11
location	1.51	3.00	1.07
years_of_service	1.14	1.00	1.07
late_arrivals	1.19	1.00	1.09
training_hours	1.13	1.00	1.06

Code

lmtest::bptest(model)


    studentized Breusch-Pagan test

data:  model
BP = 13.791, df = 20, p-value = 0.8409

Code

lmtest::coeftest(model, vcov. = sandwich::vcovHC(model, type = "HC3"))


t test of coefficients:

                              Estimate Std. Error t value  Pr(>|t|)    
(Intercept)                  93.578440   2.054495 45.5481 < 2.2e-16 ***
employment_typeContract      -3.871708   1.255436 -3.0840  0.002499 ** 
employment_typeSecondment    -3.820862   1.737662 -2.1989  0.029671 *  
grade_level.L                 9.325421   1.446276  6.4479 2.075e-09 ***
grade_level.Q                 0.969542   1.407098  0.6890  0.492037    
grade_level.C                 1.090113   1.358976  0.8022  0.423937    
grade_level^4                 0.464686   1.281528  0.3626  0.717495    
grade_level^5                 1.404956   1.258735  1.1162  0.266426    
departmentFinance            -0.674013   1.595823 -0.4224  0.673465    
departmentHR                 -3.798023   2.457516 -1.5455  0.124682    
departmentICT                 3.758655   2.071703  1.8143  0.071958 .  
departmentLegal              -4.741274   1.937563 -2.4470  0.015750 *  
departmentOperations         -3.257085   1.480571 -2.1999  0.029596 *  
departmentProcurement         2.943616   1.758894  1.6736  0.096640 .  
genderMale                    0.062132   1.086006  0.0572  0.954465    
locationLagos Office         -5.462703   1.350706 -4.0443 8.976e-05 ***
locationPort Harcourt Office -5.034108   1.541977 -3.2647  0.001403 ** 
locationKano Office           0.149062   1.609366  0.0926  0.926348    
years_of_service             -0.014221   0.061724 -0.2304  0.818144    
late_arrivals                 0.158427   0.341548  0.4639  0.643536    
training_hours                0.071543   0.180345  0.3967  0.692242    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

8.5 Plain-language interpretation

Model fit: The model explains approximately 62% of the variation in annual attendance rate — a strong result for a behavioural HR outcome influenced by individual circumstances.

Employment type (Contract vs Permanent): The contract coefficient is negative and significant, confirming the hypothesis test finding. Holding grade level, location, and all other variables constant, a contract employee attends approximately 4-6 percentage points less than an equivalent permanent employee. Business action: attendance targets should be embedded explicitly in contract terms with an 85% minimum attendance clause and supervisory review triggered at 80%.

Grade level: Junior grades (GL 04 and GL 06) show significantly lower attendance than senior grades after controlling for department and employment type. Business action: introduce a Junior Staff Attendance Support Programme targeting GL 04-06 officers in their first two years — covering transport allowance review, flexible hours piloting, and structured mentoring.

Late arrivals: Where significant, the negative coefficient confirms that late arrivals predict lower overall attendance independently of other factors. Business action: a three-strikes late-arrival trigger should generate an automatic welfare check before absence becomes chronic.

Diagnostic plots: The Residuals vs Fitted plot shows approximately random scatter, supporting the linear specification. The Q-Q plot indicates approximate normality of residuals. All VIF values below 5 confirm no harmful multicollinearity.

9 Integrated Findings

9.1 How the five analyses connect

The five analyses form a coherent analytical chain. EDA established the data quality baseline and identified the left-skewed attendance distribution and right-skewed late arrivals — governing all subsequent technique choices. Visualisation surfaced the patterns that matter operationally: contract staff show wider attendance spread, junior grades underperform on performance scores, and the attendance-performance link is visible across all employment types. Hypothesis testing confirmed with statistical rigour that both the employment-type and grade-level attendance gaps are not attributable to sampling noise. Correlation analysis revealed that late arrivals are an early symptom of chronic absenteeism, and that the attendance-performance link is strong enough to treat both outcomes as part of a single intervention. Regression isolated the independent contribution of each factor, confirming that employment type and grade level are genuine, independent predictors of attendance — not proxies for each other.

9.2 The single actionable recommendation

On the basis of these five analyses, I recommend that the HR Directorate implement a two-track Attendance Improvement Programme before the start of the next fiscal year. Track 1 — Contract Staff Protocol: all contract staff should have an 85% minimum attendance clause inserted at next renewal, with an automated 80% early-warning trigger generating a supervisory welfare check. Track 2 — Junior Grade Support Programme: all GL 04 and GL 06 officers in their first two years should be enrolled in a structured mentoring scheme, receive a transport allowance review, and have quarterly (not annual) attendance reviews with their supervisors. Together, these two tracks address the two strongest independent predictors identified by the regression model and are expected to lift overall MDA attendance toward the 92% benchmark for comparable federal agencies.

10 Limitations and Further Work

Single year of data: The 2024 dataset does not allow trend analysis. With three or more years of panel data, a fixed-effects regression could isolate the causal effect of policy changes on attendance.
Excluded mid-year leavers: Staff who resigned or retired in 2024 were excluded to ensure full-year comparability. If leavers had systematically lower attendance, the true portfolio attendance rate is lower than reported.
Unobserved variables: Commute distance, childcare responsibilities, health status, and manager quality are not in the HRMIS extract but are known drivers of attendance in the public sector literature.
Causality: The attendance-performance correlation is associative. A randomised welfare intervention trial would establish a causal effect and justify scaling the programme organisation-wide.

11 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.6). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., & Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Olugbile, B. (2026). Staff attendance dataset — Federal MDA, 2024 fiscal year [Dataset]. Extracted from HR Management Information System, Abuja, Nigeria, with permission of the HR Director and Permanent Secretary. Data available on request from the author.

12 Appendix — AI Usage Statement

Claude (Anthropic, claude.ai) was used to assist this study in two ways. First, it helped audit the structure of the data extract, identify column-naming discrepancies, and generate corrected R code for the data cleaning pipeline. Second, it drafted boilerplate code for the visualisation, hypothesis testing, correlation and regression sections, which I reviewed and verified against the actual dataset and organisational context. All analytical decisions — the choice of case study, the two hypotheses tested, the regression model specification, the interpretation of every result, and the two-track policy recommendation — are my own, made in line with my independent analytical judgement. The dataset was obtained with the written permission of the HR Director and Permanent Secretary of the MDA. The AI was used as a coding and editing assistant; no AI-generated interpretation appears in this document without my independent review and validation.