---
title: "Drivers of Employee Performance at P&S Holdings"
subtitle: "An HR Analytics Case Study — LBS EMBA 31, Data Analytics 1"
author: "[Tolulope Olateju]"
date: today
format:
html:
theme: flatly
toc: true
toc-depth: 3
toc-location: left
code-fold: true
code-tools: true
self-contained: true
fig-width: 9
fig-height: 6
number-sections: true
smooth-scroll: true
execute:
warning: false
message: false
echo: true
---
# Executive Summary {.unnumbered}
This study addresses a question I have wanted to measure for some time as Head of People Operations & Administration at P&S Holdings, a Lagos-based fintech group: **does training spend measurably improve employee performance, and what other levers explain the variation we see in appraisal scores?**
The dataset is a complete census of all 97 permanent staff across 11 functional departments plus a 4-person Management group, joined to 2025 training records, biometric attendance data, and the most recent six-month appraisal cycle. Five analytical techniques were applied — exploratory data analysis, visualisation, hypothesis testing, correlation, and linear regression.
**Three findings.** First, **each training a staff member attended is associated with a 5.12-point lift in appraisal score** (p = 0.044), controlling for tenure, attendance, level, gender, and department — the L&D budget is defensible on evidence terms. Second, **the appraisal system shows no gender bias** (χ² p = 0.75, Cramér's V = 0.00). Third, **Terminal Support scores 15.5 points below comparable Software Development staff** (p = 0.010), identifying a specific unit requiring management attention.
**Recommendation:** Maintain or grow the 2026 L&D budget, and prioritise the 47 currently-untrained staff (48.5% of the workforce) as the highest-leverage HR intervention available.
# Professional Disclosure
I am the **Head of People Operations & Administration** for P&S Holdings, the parent group of two sister fintech companies operating in Lagos, Nigeria. One subsidiary holds a Mobile Money Operator (MMO) licence from the Central Bank of Nigeria, while the other holds a Payment Terminal Service Provider (PTSP) licence. Together, the group employs 97 permanent staff across 11 functional departments — software development, business, product success, audit and compliance, settlement and reconciliation, IT and infrastructure, terminal support, information security, operations, people operations, and finance — plus a board-level Management group of four executives who oversee all departments. For analytical purposes the Management group is treated as a twelfth category in the dataset.
I am responsible for the full employee lifecycle at P&S Holdings: recruitment, onboarding, performance management, learning and development, attendance, leave, exits, and the cost envelope around all of these. In addition, I am responsible for the day-to-day administration of the office. People Operations and Administration is the function I lead and from which this analysis originates. The framework for this study follows Adi (2026), who positions exploratory and inferential analytics as the foundation that must precede predictive modelling — a sequencing directly applicable to the maturity of analytics at P&S Holdings, where descriptive rigour is the current binding constraint rather than algorithmic sophistication.
The specific business problem driving this study has been one I have been curious to measure for some time: **does the training spend P&S Holdings made in 2025 measurably improve employee performance, and which other levers — attendance, tenure, level, department — explain the variation we see in appraisal scores?** Until now, training impact has been argued from feedback forms and anecdote. I want a defensible numeric answer that compares training intensity with performance outcomes ahead of the next appraisal cycle in July 2026. The outcome will inform a briefing I plan to share with my CEO and the leadership team to support a more evidence-based 2026 L&D investment case. The five techniques in this case study map directly to that question:
**Exploratory Data Analysis** is how I will demonstrate that the HRIS data underlying any recommendation is sound. Before presenting any conclusion to the CEO, I need to show that I understand its distributions, missing values, and quirks — for example, that the December 2025 attendance dip reflects the firm's Christmas shutdown rather than a workforce problem. EDA is the equivalent of reconciling a trial balance before declaring profit.
**Data Visualisation** is the medium the CEO consumes. Tables of regression coefficients will not survive a board memo; the same finding rendered as a clear plot will. I use visualisation in every management report I write, and the skill I most need to sharpen is choosing the chart type that makes the single most important pattern obvious without further explanation.
**Hypothesis Testing** is how I move beyond impression. Senior managers regularly assert that *"the X department is under-performing"* or *"we lose more women than men"*. My job is to confirm or refute these claims with a defensible statistical test rather than an eye-balled average. An ANOVA on appraisal scores by department, and a chi-squared test on rating bands by gender, are the two specific tests this study runs.
**Correlation Analysis** lets me quantify the strength of association between the HR levers I control — training intensity, attendance, tenure — and the outcome the business cares about, namely performance ratings. Correlation does not prove causation, but a correlation matrix tells me which levers are worth investigating further and which I can deprioritise.
**Linear Regression** is the technique that answers the central question. By regressing appraisal score on training intensity while controlling for tenure, attendance, level, and department, I can isolate the marginal contribution of training — the number I need to defend the L&D budget. If the coefficient is significant and positive, the budget stays or grows. If it is not, I have a difficult but evidence-based conversation to have. I will not claim that any single regression coefficient establishes causation — only association under control variables. The decision to extend or curtail training spend will still require management judgement; my role is to ensure that judgement is informed by the strongest evidence the data can support.
# Data Collection & Sampling
**Source.** All data in this study were extracted from the internal HRIS of P&S Holdings, the parent group for which I serve as Head of People Operations & Administration. The export covers all 97 permanent staff active on the payroll of the two operating subsidiaries (MMO and PTSP) as at 17 May 2026, plus the event logs (new hires, promotions, role changes, exits, leave, attendance, training, and performance appraisals) for the period 1 January 2025 to 17 May 2026.
**Collection method.** Staff list, performance, leave, hire, exit, and promotion records were exported directly from the HRIS in Excel format by myself as Head of People Operations. Attendance data was pulled from the biometric clock-in system that was deployed across P&S Holdings in November 2025 — this is why attendance data is only available from November 2025 onwards rather than for the full 17-month event window. Training records were extracted from the 2025 L&D budget tracker maintained by People Operations.
**Sampling frame and sample size.** This is a census, not a sample — all 97 active staff are included, eliminating sampling error as a source of bias. The dataset is therefore fully representative of the workforce at P&S Holdings as at the snapshot date. Departments range in size from 3 staff (Finance) to 30 staff (Software Development); this imbalance is preserved in the analysis and addressed by reporting effect sizes alongside p-values rather than relying on statistical significance alone.
**Time period covered.** The personnel snapshot is current as at 17 May 2026. Performance appraisals are the most recent six-month cycle (October 2025 – March 2026). Attendance covers seven complete months (November 2025 – May 2026). Training events cover the full 2025 calendar year. The misalignment between these windows is acknowledged as a limitation in Section 11.
**Ethical clearance.** This analysis is conducted with the verbal awareness of the CEO of P&S Holdings; no objection has been raised regarding its use as an academic submission. The firm name **"P&S Holdings"** is a pseudonym used in this published document to protect the commercial identity of the two operating entities, in line with the assessment brief's instruction that students "anonymise if needed". All staff IDs were pseudonymised by the HRIS prior to my extract (format `PPL/#####/##` or `SWP/#####/##`); the dataset contains no names, salaries, contact details, or bank information. The dataset is not redistributable outside this academic submission.
**Data quality work undertaken.** The raw HRIS export contained 16 documented data-quality issues which were systematically resolved before analysis. Department names had 14 spelling and spacing variants collapsed into 12 canonical labels; performance ratings were filtered to the 5 valid bands after removing six descriptive legend rows that had polluted the column; one leave end-date typo (`25/3/206`) was reconstructed as `2026-03-25`; one duplicate attendance record was resolved by averaging the two entries; and all text fields were stripped of internal and trailing whitespace. The full cleaning log is included as Appendix B.
# Setup and Data Loading
```{r setup}
library(tidyverse)
library(readxl)
library(janitor)
library(skimr)
library(patchwork)
library(scales)
library(car)
library(rstatix)
library(effectsize)
library(broom)
library(corrplot)
library(knitr)
library(kableExtra)
# Global plot theme
theme_set(theme_minimal(base_size = 12))
# Load the cleaned master analytical table
hr <- read_excel("Employee_Data_CLEAN.xlsx",
sheet = "00_Master_Analytical") |>
mutate(
Level_Ord = factor(Level,
levels = c("Trainee","Junior Staff","Associate","Coordinator",
"Specialist","Principal","Manager","Consultant","Partner"),
ordered = TRUE),
Performance_Rating = factor(Performance_Rating,
levels = c("Unsatisfactory","Below Expectations","Meets Expectations",
"Exceeds Expectations","Outstanding"),
ordered = TRUE)
)
```
The dataset contains **`r nrow(hr)` staff** across **`r n_distinct(hr$Department)` departments**, with **`r sum(!is.na(hr$Appraisal_Score))` having a recent performance appraisal** and **`r sum(!is.na(hr$Avg_Attendance))` having attendance data** from the biometric system. The remaining analyses build on this dataset.
# Data Description
This study draws on a master analytical table of 97 staff (one row per current employee) joined to performance, attendance, leave, training, and promotion records. Before any inferential analysis, this section establishes what the data looks like, where the missing values are, and which variables carry usable signal for the regression that follows.
## Variable inventory
```{r var-inventory}
hr |>
summarise(
Numeric = sum(sapply(hr, is.numeric)),
Character = sum(sapply(hr, is.character)),
Factor = sum(sapply(hr, is.factor)),
Date = sum(sapply(hr, lubridate::is.Date) | sapply(hr, lubridate::is.POSIXct)),
Total_Vars = ncol(hr),
Rows = nrow(hr)
) |>
kable(caption = "Dataset shape and column types")
```
The dataset combines 8 categorical predictors (gender, marital status, department, level, employment status, performance rating, tenure bucket, and high-performer flag), 6 numeric predictors and outcomes (tenure years, average attendance, total leave days, leave events, trainings in department, total training cost, appraisal score, and level rank), and 2 date variables (resumption date and confirmation date). This satisfies the Case Study 1 minimum of 6 variables including at least 3 numeric, 2 categorical, and 1 date.
## Summary statistics for numeric variables
```{r numeric-summary}
hr |>
select(Tenure_Years, Avg_Attendance, Total_Leave_Days, Leave_Events,
Trainings_Attended, Appraisal_Score) |>
skim() |>
yank("numeric") |>
select(skim_variable, n_missing, mean, sd, p0, p25, p50, p75, p100) |>
kable(digits = 1,
col.names = c("Variable", "Missing", "Mean", "SD",
"Min", "Q1", "Median", "Q3", "Max"),
caption = "Distribution of numeric variables in the analytical dataset")
```
## Categorical frequency tables
```{r cat-tables}
hr |>
count(Department, sort = TRUE) |>
mutate(Pct = round(100 * n / sum(n), 1)) |>
kable(col.names = c("Department", "Headcount", "% of staff"),
caption = "Workforce by department (n = 97)")
hr |>
count(Level, sort = TRUE) |>
mutate(Pct = round(100 * n / sum(n), 1)) |>
kable(col.names = c("Level", "Headcount", "% of staff"),
caption = "Workforce by level (n = 97)")
hr |>
count(Gender, Marital_Status) |>
pivot_wider(names_from = Marital_Status, values_from = n, values_fill = 0) |>
kable(caption = "Cross-tabulation of gender and marital status")
```
## Missing-value scan
```{r missing-scan}
hr |>
summarise(across(everything(), ~ sum(is.na(.)))) |>
pivot_longer(everything(), names_to = "Variable", values_to = "n_missing") |>
filter(n_missing > 0) |>
mutate(Pct_missing = round(100 * n_missing / nrow(hr), 1)) |>
arrange(desc(n_missing)) |>
kable(col.names = c("Variable", "Missing rows", "Missing %"),
caption = "Variables with missing values")
```
## Outlier check on the outcome variable
```{r outlier-check}
out_thresh <- quantile(hr$Appraisal_Score, c(0.25, 0.75), na.rm = TRUE) +
c(-1.5, 1.5) * IQR(hr$Appraisal_Score, na.rm = TRUE)
cat("Appraisal_Score IQR fences (1.5 × IQR rule):\n",
"Lower fence:", round(out_thresh[1], 1), "\n",
"Upper fence:", round(out_thresh[2], 1), "\n",
"Observed range:", min(hr$Appraisal_Score, na.rm = TRUE), "to",
max(hr$Appraisal_Score, na.rm = TRUE), "\n",
"Outliers flagged:", sum(hr$Appraisal_Score < out_thresh[1] |
hr$Appraisal_Score > out_thresh[2], na.rm = TRUE))
```
## Interpretation
**Workforce structure.** The 97-staff census is dominated by **Software Development (30 staff, 31%)** and **Business (19 staff, 20%)** — together more than half the workforce. The smallest functional unit is Finance with 3 staff, which is a sample-size warning for any department-level test: a single Finance employee's score will move the department mean by ~33%. Five departments (Information Security, Management, Operations, People Operations, Audit & Compliance) each have 4–6 staff, placing them in the same fragile sample-size category. The level distribution is healthy and bottom-heavy as expected in a young fintech — 26 Associates, 20 Coordinators, 18 Principals, 17 Specialists — with only 9 staff at Manager-or-above.
**Outcome variable.** Appraisal scores range from 50 to 90 with a median of 70 and a standard deviation of ~11 points. This is a tight, well-behaved distribution. The IQR rule flags zero outliers, which means we can proceed with parametric tests (ANOVA, Pearson correlation, OLS regression) without worrying about extreme values distorting the conclusions. The distribution skews very slightly right because the maximum is 90 rather than the theoretical 100 — no employee in this cycle was rated above 90, which is itself a finding worth noting to the CEO: P&S Holdings has plenty of *Meets Expectations* performers and a credible pipeline of *Exceeds*, but no *Outstanding* outliers driving the mean upward.
**Data quality issues identified and handled.** Two material issues remained after the initial HRIS cleaning, both visible in the missing-value scan above:
1. **Attendance data is missing for 13 staff** (~13% of the workforce). These are mostly senior management and board-level staff who, by company policy, do not clock in on the biometric system. The pattern is therefore **missing not at random (MNAR)** — it is systematic, not accidental. For any analysis involving attendance I retain these staff in the dataset but allow R's default behaviour to exclude them from attendance-specific calculations, and I report the effective sample size in each test. I do not impute the values, because imputing a clock-in score for a Partner who never clocks in would invent data.
2. **Two staff lack a performance appraisal score.** These are likely staff whose appraisals were pending at extract date or who joined after the appraisal window closed. They are retained for descriptive purposes but excluded from the regression in Section 9, where Appraisal_Score is the dependent variable.
The full cleaning log appears in Appendix B. None of the remaining patterns required transformation (no log transforms, no winsorisation), and no rows were dropped during EDA.
# Data Visualisation
Following Adi (2026, Ch. 5), this section uses five plots designed to be read as a single narrative. The story they collectively tell: **appraisal scores at P&S Holdings are well-behaved on average, but the variation around that average is driven more by what people do day-to-day (attendance, training intensity) than by who they are (department, gender, marital status).**
## Plot 1 — Distribution of the outcome variable
```{r viz-1}
p1 <- hr |>
filter(!is.na(Appraisal_Score)) |>
ggplot(aes(Appraisal_Score)) +
geom_histogram(bins = 10, fill = "#1F4E79", colour = "white") +
geom_vline(aes(xintercept = median(Appraisal_Score, na.rm = TRUE)),
linetype = "dashed", colour = "#C00000", linewidth = 0.8) +
annotate("text", x = 71, y = 25, hjust = 0,
label = "Median = 70", colour = "#C00000", size = 3.5) +
labs(title = "1. Most staff cluster between 60 and 80",
subtitle = "Distribution of 6-month appraisal scores (n = 95)",
x = "Appraisal score (out of 100)", y = "Number of staff") +
theme(plot.title = element_text(face = "bold"))
p1
```
## Plot 2 — Performance by department
```{r viz-2, fig.width=10, fig.height=6}
p2 <- hr |>
filter(!is.na(Appraisal_Score)) |>
mutate(Department = fct_reorder(Department, Appraisal_Score, .fun = median)) |>
ggplot(aes(Department, Appraisal_Score, fill = Department)) +
geom_boxplot(show.legend = FALSE, alpha = 0.85) +
geom_jitter(width = 0.15, alpha = 0.4, size = 1) +
coord_flip() +
labs(title = "2. Median performance varies by department",
subtitle = "Box = IQR, line = median, dots = individual staff",
x = NULL, y = "Appraisal score") +
theme(plot.title = element_text(face = "bold"))
p2
```
## Plot 3 — Tenure and performance
```{r viz-3}
p3 <- hr |>
filter(!is.na(Appraisal_Score), !is.na(Tenure_Years)) |>
ggplot(aes(Tenure_Years, Appraisal_Score)) +
geom_point(alpha = 0.6, size = 2.5, colour = "#1F4E79") +
geom_smooth(method = "lm", se = TRUE, colour = "#C00000", fill = "#F2D7D5") +
labs(title = "3. Tenure has a weak positive association with performance",
subtitle = "Each dot is one staff member; red line is the OLS best fit",
x = "Tenure (years)", y = "Appraisal score") +
theme(plot.title = element_text(face = "bold"))
p3
```
## Plot 4 — Attendance and performance
```{r viz-4}
p4 <- hr |>
filter(!is.na(Appraisal_Score), !is.na(Avg_Attendance)) |>
ggplot(aes(Avg_Attendance, Appraisal_Score, colour = Gender)) +
geom_point(alpha = 0.75, size = 2.5) +
geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
scale_colour_manual(values = c("Female" = "#C00000", "Male" = "#1F4E79")) +
labs(title = "4. Attendance has no clear linear relationship with performance",
subtitle = "Coloured by gender — the slope is flat for both",
x = "Average attendance (%)", y = "Appraisal score",
colour = NULL) +
theme(plot.title = element_text(face = "bold"),
legend.position = "bottom")
p4
```
## Plot 5 — Rating mix across levels
```{r viz-5, fig.width=10, fig.height=5}
p5 <- hr |>
filter(!is.na(Performance_Rating)) |>
mutate(Level_Ord = fct_drop(Level_Ord)) |>
ggplot(aes(Level_Ord, fill = Performance_Rating)) +
geom_bar(position = "fill", colour = "white", linewidth = 0.3) +
scale_y_continuous(labels = percent_format()) +
scale_fill_manual(values = c(
"Meets Expectations" = "#A6BDDB",
"Exceeds Expectations" = "#3690C0",
"Outstanding" = "#034E7B"
)) +
labs(title = "5. Rating mix shifts upward with level",
subtitle = "Share of each performance band by job level",
x = "Level", y = "Share of staff",
fill = "Rating") +
theme(plot.title = element_text(face = "bold"),
axis.text.x = element_text(angle = 25, hjust = 1),
legend.position = "bottom")
p5
```
## Storyline — what the five plots say together
```{r viz-combined, fig.width=10, fig.height=22}
# Stacked layout: each plot gets its own row, no title collision
p1 / p2 / p3 / p4 / p5 +
plot_annotation(
title = "Performance at P&S Holdings: a five-plot narrative",
subtitle = "Distribution → department variation → tenure → attendance → level progression",
theme = theme(plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11))
) +
plot_layout(heights = c(1, 1.4, 1, 1, 1))
```
## Interpretation
**Plot 1** establishes the shape of the outcome. The distribution is roughly bimodal with one cluster around 60 and a second around 80–85, suggesting two distinct performance populations within the workforce. The median sits at 70 and the range is 50–90, with no extreme outliers and no ceiling effects (nobody scored 100). For the CEO this is reassuring: the appraisal system has discriminating power. The bimodality is itself worth noting — it suggests the workforce may be splitting into a "Meets" cluster and an "Exceeds" cluster, which the regression in Section 9 will help explain.
**Plot 2** shows that median performance does vary across departments. Software Development sits at the top of the median ranking; Terminal Support and Management sit at the bottom. The interquartile ranges are wide for the larger departments and visibly narrow for the smaller ones (Finance, Operations, Information Security) — a function of small sample sizes rather than truly tight performance. Whether the visible gap between top and bottom departments is **statistically significant** is the question Section 7 (ANOVA) answers formally. The viva-defensible caveat is that interpreting any department's mean on the basis of 3–5 observations should be done with caution.
**Plot 3** shows the tenure-performance relationship. The slope is essentially flat — and if anything slightly negative once the confidence band is considered. Long-tenured staff at P&S Holdings are not, on average, higher performers than recent joiners. This is a meaningful finding: it implies that staying longer in the firm is not, by itself, a performance lever. The HR implication is that retention policy should be selective rather than blanket — keeping our best performers, not simply keeping everyone.
**Plot 4** is the most surprising finding of the five and the one I will lead with when briefing the CEO. The slope of appraisal on attendance is **flat for both men and women** — being present at the office does not predict being a high performer at P&S Holdings. This is counter to the HR intuition that attendance is a productivity proxy. The implication is that simple presence is not enough; what people *do* during their hours matters more than whether they show up. This finding will be quantified formally in the correlation section, where we expect a small, statistically insignificant coefficient.
**Plot 5** completes the story by showing that the performance rating mix does shift upward with level — but not monotonically. Trainees, Junior Staff, Consultants, and Partners are entirely *Meets Expectations*; Associates through Managers show a healthy share of *Exceeds Expectations* and a small *Outstanding* tail. The pattern is consistent with a selection effect: high performers at the working levels (Associate through Manager) get promoted into the *Exceeds* band, while the entry levels and the most senior levels are rated more uniformly. The CEO should not over-interpret the "100% Meets Expectations" at Partner level — this is a sample of 2 board members whose appraisal cycle may operate differently from the general workforce.
**The single most important takeaway from these five plots:** the three predictors HR usually treats as obvious performance levers — **department membership, tenure, and attendance** — are all weaker than expected. None of them shows a strong, clean visual relationship with appraisal score. This is itself the headline message for the CEO: performance variation at P&S Holdings is not well explained by who people are or where they sit on the org chart. The regression in Section 9 will test whether **training intensity** — the lever the L&D budget actually pays for — has a meaningful effect once we control for these weaker factors.
# Hypothesis Testing
Following Adi (2026, Ch. 6), this section formalises two claims that come up regularly in management conversations at P&S Holdings. Visualisation hinted at the patterns; formal testing tells us whether the patterns are real signal or sample-size noise. For each test I state the null and alternative hypotheses, check assumptions, run the test, and report both the p-value and an effect size — because in a 97-person dataset, statistical significance alone is not enough.
## Hypothesis 1 — Does mean appraisal score differ across departments?
**H₀:** μ_dept1 = μ_dept2 = … = μ_deptK (mean appraisal score is the same in every department)
**H₁:** at least one department mean differs from the others
This is a one-way ANOVA. Because Plot 2 showed visible differences between the top and bottom departments, formal testing is needed to confirm whether the gap survives the small-sample noise visible in the wider boxes.
### Filter to departments with adequate sample size
```{r anova-prep}
dept_n <- hr |>
filter(!is.na(Appraisal_Score)) |>
count(Department) |>
filter(n >= 3) |>
pull(Department)
hr_an <- hr |>
filter(Department %in% dept_n, !is.na(Appraisal_Score))
cat("Departments retained:", length(dept_n),
"| Staff retained:", nrow(hr_an),
"| Min group size:", min(table(hr_an$Department)),
"| Max group size:", max(table(hr_an$Department)), "\n")
```
### Assumption check 1 — Equal variances across groups (Levene's test)
```{r anova-levene}
levene_result <- car::leveneTest(Appraisal_Score ~ as.factor(Department), data = hr_an)
print(levene_result)
```
### Run the ANOVA
```{r anova-fit}
aov_fit <- aov(Appraisal_Score ~ as.factor(Department), data = hr_an)
anova_summary <- summary(aov_fit)
print(anova_summary)
# Effect size — eta squared
eta_sq <- effectsize::eta_squared(aov_fit)
print(eta_sq)
```
### Post-hoc — which department pairs differ?
```{r anova-tukey}
tukey_result <- TukeyHSD(aov_fit)
tukey_df <- as.data.frame(tukey_result$`as.factor(Department)`) |>
tibble::rownames_to_column("Comparison") |>
rename(Diff = diff, Lower = lwr, Upper = upr, `p_adj` = `p adj`) |>
filter(p_adj < 0.05) |>
arrange(p_adj)
if (nrow(tukey_df) > 0) {
tukey_df |>
kable(digits = 3,
caption = "Statistically significant pairwise department differences (Tukey HSD, p < 0.05)")
} else {
cat("No pairwise department differences reached statistical significance at p < 0.05.\n")
}
```
### Interpretation — Hypothesis 1
The ANOVA produces **F(11, 83) = 1.79, p = 0.069**. By the strict 5% significance threshold we **fail to reject H₀** — we cannot say with formal confidence that department means differ. But this is a marginal result, sitting just above the conventional cutoff and well below the 10% threshold. The honest reading is that **the data hints at a real department effect without quite confirming it at the statistical bar most readers apply**.
The **eta-squared (η²) is 0.19**, which on Cohen's (1988) conventions is a **large effect**. In business language: department membership accounts for roughly **19% of the variance** in appraisal scores at P&S Holdings. That is not a trivial amount. The reason the large effect size and the borderline p-value coexist is sample size: with only 3–6 staff in several departments (Finance, IT & Infrastructure, Information Security, Operations, People Operations, Audit & Compliance, Management), our statistical power to detect a difference of this magnitude is genuinely low. We are seeing a real signal that the test cannot yet formally certify.
For the briefing to the CEO, the practical translation is: *"Department membership appears to explain roughly 19% of performance variance — a meaningful share — but the small size of several departments means we cannot yet declare the difference statistically significant at the conventional 5% threshold. Another appraisal cycle of data would likely confirm it."* The headline action is therefore not to assume department doesn't matter, but to **investigate the pattern more cautiously** rather than dismiss it.
The Tukey post-hoc above identifies which specific department pairs differ. Where the table is empty or short, this reflects the same low-power story: pairwise comparisons require even more data than the omnibus test, so most pairs will not reach significance even when the underlying differences are real. The defensible move is to combine the η² finding ("department matters") with the visualisation in Plot 2 ("Software Development and Settlement & Reconciliation cluster high; Terminal Support and Management cluster low") and treat that as a hypothesis to test more rigorously in the next cycle.
## Hypothesis 2 — Is performance rating independent of gender?
**H₀:** the distribution of performance ratings is the same for women and men (rating and gender are statistically independent)
**H₁:** the distribution differs by gender
This is a chi-squared test of independence on the cross-tabulation of `Performance_Rating` × `Gender`. The test directly answers a question that gets raised in management meetings: *"are we rating women and men differently?"*
### The cross-tabulation
```{r chi-cross}
gender_rating_tab <- table(hr$Gender, droplevels(hr$Performance_Rating))
gender_rating_tab |>
kable(caption = "Cross-tabulation: gender × performance rating")
```
### Assumption check — are expected counts ≥ 5 in most cells?
```{r chi-expected}
chi_result <- chisq.test(gender_rating_tab)
cat("Expected counts under H₀ (independence):\n")
print(round(chi_result$expected, 1))
cells_below_5 <- sum(chi_result$expected < 5)
total_cells <- length(chi_result$expected)
cat("\nCells with expected count < 5:", cells_below_5,
"out of", total_cells,
"(", round(100 * cells_below_5 / total_cells), "% )\n")
```
### Run the test
```{r chi-test}
print(chi_result)
# Effect size — Cramér's V
cramers_v <- effectsize::cramers_v(gender_rating_tab)
print(cramers_v)
```
### Interpretation — Hypothesis 2
The chi-squared test produces **χ² = 0.57, df = 2, p = 0.7507**. We **fail to reject H₀** by a very wide margin. The probability of seeing the observed gender × rating cross-tabulation by chance alone, if the appraisal system were truly gender-neutral, is over 75%. There is **no statistical evidence whatsoever** that women and men are rated differently at P&S Holdings.
The **Cramér's V is 0.00** (95% CI: [0.00, 1.00]), which is the strongest possible effect-size confirmation: gender and performance rating are **statistically independent** in this dataset. A Cramér's V of zero combined with a p-value of 0.75 is a textbook null result, and a defensible one to take to the CEO.
This is the answer the CEO will want to hear, and it survives scrutiny. The appraisal system at P&S Holdings is, on the single statistical dimension this test examines, **gender-neutral**. As an additional and separate finding worth surfacing: **no employee in this appraisal cycle was rated *Below Expectations* or *Unsatisfactory*** — the entire workforce sits at *Meets Expectations* or above. This is itself an organisational signal: either the workforce is uniformly competent (a healthy state), or the appraisal system is reluctant to record poor performance (a calibration concern).
**Assumption caveat.** The chi-squared test is unreliable when more than 20% of expected cell counts fall below 5. With only 3 rating categories actually observed in our data (Meets, Exceeds, Outstanding) and 2 gender categories, this assumption is comfortably satisfied for the *Meets* and *Exceeds* cells but is borderline for the *Outstanding* cell. As a robustness check, the Fisher's exact test below — which makes no minimum-count assumption — is run on the same table. The conclusion does not change.
```{r fisher-fallback, eval = TRUE}
# Fisher's exact test as a robustness check
fisher_result <- fisher.test(gender_rating_tab, simulate.p.value = TRUE, B = 10000)
cat("Fisher's exact test (simulated p-value):\n")
print(fisher_result)
```
### Combined interpretation — what hypothesis testing tells the CEO
Together, the two tests answer one strategic question each:
1. **Does department matter for performance?** The ANOVA, combined with its effect size, tells us *how much* — not just whether — department drives the appraisal score we observe. This calibrates how much of the L&D budget should be allocated by department versus by individual.
2. **Is the appraisal system gender-fair?** The chi-squared test (with Fisher's exact as a robustness check) gives a defensible audit answer to a question that boards and regulators increasingly ask. A non-significant result here is not a stamp of perfect fairness, but it is the strongest evidence the appraisal data alone can give.
These two tests do not, on their own, identify the *drivers* of performance. They tell us where the variance lives. The regression in Section 9 isolates which specific levers move the score, controlling for these structural factors.
# Correlation Analysis
Following Adi (2026, Ch. 8), this section quantifies the strength of association between the numeric HR levers I control as Head of People Operations — training intensity, tenure, attendance, leave taken — and the outcome the business cares about, namely the appraisal score. Visualisation hinted at the patterns; correlation analysis puts a defensible number on each, and prepares the ground for the regression in Section 9.
I report **Pearson** correlations (which test linear relationships) alongside **Spearman** correlations (which test monotonic relationships and are robust to non-linearity and outliers). For most analyses I trust Pearson when the underlying variables are roughly normally distributed and Spearman when they are skewed or contain influential points. Reporting both lets the reader see whether the conclusion is sensitive to that choice.
## Variable selection
```{r corr-prep}
corr_vars <- hr |>
select(Tenure_Years, Avg_Attendance, Total_Leave_Days, Leave_Events,
Trainings_Attended, Appraisal_Score) |>
drop_na()
cat("Variables in correlation matrix:", ncol(corr_vars), "\n")
cat("Complete cases used:", nrow(corr_vars), "of", nrow(hr), "staff\n")
```
The matrix uses only the staff who have complete data on every numeric variable shown — primarily this means dropping the 13 senior staff who are not on the biometric attendance system. The reduced sample preserves all variables but limits inference to those staff who clock in.
## Pearson correlation matrix
```{r corr-pearson, fig.width=8, fig.height=7}
cor_p <- cor(corr_vars, method = "pearson")
corrplot::corrplot(
cor_p,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
tl.col = "black",
tl.srt = 30,
tl.cex = 0.85,
number.cex = 0.85,
col = colorRampPalette(c("#C00000","white","#1F4E79"))(200),
title = "Pearson correlations among HR variables",
mar = c(0, 0, 1.5, 0)
)
```
## Spearman correlation matrix (robustness check)
```{r corr-spearman, fig.width=8, fig.height=7}
cor_s <- cor(corr_vars, method = "spearman")
corrplot::corrplot(
cor_s,
method = "color",
type = "upper",
order = "hclust",
addCoef.col = "black",
tl.col = "black",
tl.srt = 30,
tl.cex = 0.85,
number.cex = 0.85,
col = colorRampPalette(c("#C00000","white","#1F4E79"))(200),
title = "Spearman correlations (rank-based, robust)",
mar = c(0, 0, 1.5, 0)
)
```
## Significance tests on the three correlations the CEO will ask about
```{r corr-tests}
test_attend <- cor.test(corr_vars$Avg_Attendance, corr_vars$Appraisal_Score, method = "pearson")
test_tenure <- cor.test(corr_vars$Tenure_Years, corr_vars$Appraisal_Score, method = "pearson")
test_train <- cor.test(corr_vars$Trainings_Attended, corr_vars$Appraisal_Score, method = "pearson")
tibble(
Pair = c("Attendance ↔ Appraisal",
"Tenure ↔ Appraisal",
"Department Training Spend ↔ Appraisal"),
Pearson_r = round(c(test_attend$estimate, test_tenure$estimate, test_train$estimate), 3),
CI_lower = round(c(test_attend$conf.int[1], test_tenure$conf.int[1], test_train$conf.int[1]), 3),
CI_upper = round(c(test_attend$conf.int[2], test_tenure$conf.int[2], test_train$conf.int[2]), 3),
p_value = round(c(test_attend$p.value, test_tenure$p.value, test_train$p.value), 4)
) |>
kable(caption = "The three correlations most relevant to the L&D investment question")
```
## Interpretation
**What the matrices say at a glance.** The heatmap is read by colour intensity: deep blue means a strong positive correlation, deep red means a strong negative one, and white means no relationship. Across both Pearson and Spearman, the cells involving `Appraisal_Score` (the outcome) are predominantly pale — the numeric HR levers we measure do not show strong linear relationships with appraisal scores. This is consistent with the visual story from Section 5.
**Attendance vs Appraisal Score.** The correlation here is small and close to zero. This confirms quantitatively what Plot 4 showed visually: simply being present at the office does not predict being a high performer at P&S Holdings. Whether the relationship reaches statistical significance is shown in the test table — but the magnitude is what matters for the CEO conversation, and the magnitude is too small to act on. For the L&D investment case, attendance is **not the lever to invest behind**.
**Tenure vs Appraisal Score.** Also weak, and the Pearson coefficient is slightly negative. Long-tenured staff are not, on average, the highest performers — and may even be marginally lower. Combined with the result from the ANOVA, this strengthens the argument that retention policy at P&S Holdings should be **selective rather than blanket**: keep our best performers, not simply keep everyone for longer.
**Trainings Attended vs Appraisal Score.** This is the variable that directly addresses the CEO's question. `Trainings_Attended` is the **individual-level** count of trainings each staff member attended in 2025, derived by parsing the Staff_ID lists from the Training sheet. The variable ranges from 0 (47 staff received no recorded training) to 2 (the most-trained staff). Two findings deserve immediate attention before any statistical test: **47 of 97 staff (48.5%) received no recorded training in 2025**, and **the raw mean appraisal score for trained staff (72.56) is 3.69 points higher than for untrained staff (68.87)** — a difference the regression in Section 9 will test for statistical significance after controlling for tenure, attendance, level, and department. The three regulator-sponsored trainings (AML/CFT, CBN cybersecurity) entered at zero cost, representing a quiet subsidy to the L&D budget.
**Inter-predictor correlations to watch.** The Pearson matrix also shows the relationships *between* predictors — particularly between `Tenure_Years` and `Avg_Attendance`, and between `Leave_Events` and `Total_Leave_Days`. Strong correlations between predictors create **multicollinearity** in regression, which inflates standard errors and makes individual coefficients hard to interpret. I will check for multicollinearity formally in Section 9 using variance inflation factors (VIFs); any predictor with VIF > 5 is a candidate for removal.
**Pearson vs Spearman — which to trust?** Where Pearson and Spearman agree (which is the case for most cells in our matrices), the conclusion is robust to the distributional assumptions and outliers. Where they differ, Spearman is the more conservative reading because it does not assume linearity. For our key relationship — training spend versus appraisal — the two methods reach the same qualitative conclusion, which strengthens the finding.
**The headline correlation finding for the CEO.** Of all the numeric HR levers measurable in the HRIS, **none shows a strong linear association with appraisal score**. The L&D budget cannot be defended on the basis of any single bivariate correlation alone. What the regression in Section 9 will test is whether training spend has a meaningful effect *once we control* for tenure, attendance, level, and department — which is the only fair test of the L&D investment case.
# Linear Regression
Following Adi (2026, Ch. 9), this section answers the central question of the study: **once we control for tenure, attendance, level, and department, does training spend have a meaningful effect on appraisal score?** Correlation in Section 8 showed weak bivariate relationships across all numeric predictors. Regression goes further by estimating the *marginal* contribution of each predictor while holding the others constant — the only fair test of the L&D investment case.
## Model specification
The dependent variable is **Appraisal_Score** (continuous, 50–90). The predictors include both numeric (tenure, attendance, leave, training spend) and categorical (gender, level, department) variables. Level is included as an ordered factor; department as an unordered factor (with the largest department, Software Development, as the reference category).
```{r reg-prep}
hr_reg <- hr |>
filter(!is.na(Appraisal_Score)) |>
select(Appraisal_Score, Tenure_Years, Avg_Attendance,
Total_Leave_Days, Trainings_Attended,
Gender, Level_Ord, Department) |>
mutate(Department = relevel(factor(Department), ref = "SOFTWARE DEVELOPMENT")) |>
drop_na()
cat("Regression sample size:", nrow(hr_reg), "staff\n")
cat("Predictors:", ncol(hr_reg) - 1, "(including categorical levels of factors)\n")
```
## Fit the model
```{r reg-fit}
model <- lm(Appraisal_Score ~ Tenure_Years + Avg_Attendance +
Total_Leave_Days + Trainings_Attended +
Gender + Level_Ord + Department,
data = hr_reg)
summary(model)
```
## Coefficient table — what each predictor contributes
```{r reg-coef-table}
broom::tidy(model, conf.int = TRUE) |>
mutate(across(where(is.numeric), ~ round(., 3))) |>
kable(col.names = c("Predictor", "Estimate", "Std. Error", "t-stat",
"p-value", "CI Lower", "CI Upper"),
caption = "Regression coefficients with 95% confidence intervals")
```
## Diagnostic plots — do the OLS assumptions hold?
The four standard diagnostic plots check (1) linearity (Residuals vs Fitted), (2) normality of residuals (Q-Q plot), (3) homoscedasticity (Scale-Location), and (4) influential outliers (Residuals vs Leverage).
```{r reg-diagnostics, fig.width=10, fig.height=8}
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
```
## Multicollinearity check — variance inflation factors
```{r reg-vif}
vif_result <- car::vif(model)
vif_result
```
A VIF above 5 indicates problematic multicollinearity (some authors use 10 as the threshold). Predictors with high VIFs have their coefficients destabilised because they are partly redundant with other predictors.
## Model performance summary
```{r reg-performance}
glance_result <- broom::glance(model) |>
select(r.squared, adj.r.squared, sigma, statistic, p.value, df, df.residual, nobs)
glance_result |>
mutate(across(where(is.numeric), ~ round(., 3))) |>
kable(col.names = c("R²", "Adj R²", "Residual SE", "F-stat", "F p-value",
"df", "df residual", "n"),
caption = "Overall model fit statistics")
```
## Interpretation
**Model fit.** The adjusted R² tells us what fraction of the variation in appraisal scores the model explains. Read it from the model performance table above. If adj R² is around 0.20–0.30, the model captures a meaningful but modest portion of performance variation — which is realistic for HR data, where individual factors (motivation, manager fit, life events) drive much of the variance and are not in the dataset. If adj R² is much lower than that, the predictors we have access to are insufficient to explain performance, and we should be cautious about drawing strong conclusions from any single coefficient.
**The L&D coefficient — direct answer to the CEO's question.** The coefficient on `Trainings_Attended` is **+5.12 (95% CI: 0.14 to 10.11, p = 0.044)**. Each additional training a staff member attended in 2025 is associated with a **5-point increase in appraisal score**, holding tenure, attendance, level, and department membership constant. This is statistically significant at the conventional 5% threshold and the effect size is **business-meaningful** on a scale that runs from 50 to 90: it is roughly half the standard deviation of the appraisal score itself. The lower confidence-interval bound is 0.14, which means even in the conservative reading the effect is positive. **The L&D budget is defensible.** The honest caveat to attach to this number, which I will state plainly in the CEO briefing, is that the model overall is not statistically significant (F = 1.23, p = 0.262) and the adjusted R² is only 0.052 — meaning the predictors collectively explain only about 5% of appraisal variance once we adjust for the number of variables. The training coefficient is robust *within* this modest model, but performance is largely driven by individual factors (motivation, manager fit, role match) that the dataset does not capture. The recommendation to the CEO is therefore to **maintain or grow** the L&D budget — but to recognise that training is one lever among many, not the dominant one.
**Tenure.** The `Tenure_Years` coefficient is **−0.50 (p = 0.437)** — small, slightly negative, and not statistically significant. Controlling for everything else, an extra year of tenure is associated with a half-point *decrease* in appraisal score, but with so much uncertainty around that estimate that the practical reading is "tenure does nothing." This confirms what Plot 3 and the correlation matrix already suggested: long-tenured staff are not, on average, the higher performers. Retention policy at P&S Holdings should therefore prioritise *whom* we retain rather than *how long* people stay.
**Attendance.** The `Avg_Attendance` coefficient is **+0.083 (p = 0.360)** — effectively zero. After controlling for everything else, being present at the office is **not predictive of being a high performer** at P&S Holdings. The honest CEO message is that the biometric attendance system, while operationally useful for payroll and absenteeism management, is not a leading indicator of performance and should not be used to evaluate staff.
**Level.** The linear component of Level (`Level_Ord.L = +13.24, p = 0.091`) suggests that as staff move up the level hierarchy, appraisal scores rise — by roughly 13 points across the full Trainee-to-Partner range — but the result is **marginally significant**, sitting at the 10% threshold. More interestingly, the **cubic component** is significant (`Level_Ord.C = +20.52, p = 0.017`), indicating that the level-performance relationship is **not linear**: it bends. Combined with the visual story from Plot 5, this captures the selection effect where mid-level staff (Associates, Coordinators, Specialists) show the widest performance variation and the highest-rated tail, while both very junior and very senior staff cluster more uniformly at *Meets Expectations*. Level matters for performance, but in a non-linear way that simple "more senior = higher rated" intuitions miss.
**Department.** Each department's coefficient compares its mean appraisal score to the reference department (Software Development) after controlling for all other factors. **One department reaches statistical significance: Terminal Support, with an estimate of −15.51 (p = 0.010)** — staff in Terminal Support score on average 15.5 points lower than otherwise comparable Software Development staff. Information Security is marginally negative (−12.26, p = 0.053), and several other departments show large negative point estimates that fall short of significance because of small sample sizes (Finance, People Operations, Audit & Compliance). The Terminal Support gap is the single most actionable department-level finding for the CEO: it identifies a specific unit where appraisal scores cannot be explained by tenure, attendance, level, training, or gender — meaning something structural about the role, team, or management is driving the gap and deserves investigation.
**Diagnostic verdict.** The four diagnostic plots above are visual checks on the OLS assumptions. Residuals vs Fitted should show no clear pattern (linearity satisfied); the Q-Q plot should follow the diagonal (residuals roughly normal); Scale-Location should be flat (homoscedasticity); Residuals vs Leverage should show no points outside the dashed Cook's distance lines (no overly influential observations). Where any assumption is visibly violated, I would normally consider robust regression or a transformation of the outcome variable; in the present model the violations (if any) are documented in the limitations section.
**Multicollinearity.** The largest GVIFs in the model are for `Level_Ord` (12.19, 6 df) and `Department` (10.99, 9 df). These are above the conventional 5 threshold and signal that the ordered-factor level variable and the unordered department variable share substantial variance — staff in certain departments tend to cluster at certain levels (e.g. Trainees concentrate in Operations and People Operations; Principals concentrate in Information Security and Software Development). However, GVIF is interpreted on the scaled metric `GVIF^(1/(2·df))`, which gives **1.23 for Level and 1.14 for Department** — both comfortably below the equivalent of VIF = 5 in the unscaled metric. Individual numeric predictors (`Tenure_Years`, `Avg_Attendance`, `Total_Leave_Days`, `Trainings_Attended`) all show GVIFs below 3, well within safe limits. The takeaway: multicollinearity does not threaten the reliability of the `Trainings_Attended` coefficient — that finding is robust.
**The single most important regression finding for the CEO.** **Each training a staff member attended in 2025 was associated with a 5-point improvement in appraisal score (95% CI: 0.14 to 10.11, p = 0.044), controlling for tenure, attendance, level, gender, and department.** This is the first quantified, individually-defensible answer to the question I came into this study wanting to measure. On a 50–90 scale, a 5-point lift is approximately half a standard deviation — meaningful, not trivial. The L&D budget is therefore defensible on evidence terms. The two important caveats: (1) the overall model is not statistically significant (F p = 0.262, adjusted R² = 0.05), meaning performance is mostly driven by individual factors not in the HRIS; and (2) we cannot rule out reverse causation — high performers may be more likely to be nominated for training, rather than training causing performance. The next analytical step in 2026 would be a within-staff comparison: appraisal score *before* and *after* training, holding the individual constant. For now, the evidence supports continuing — and ideally extending — the L&D programme, with priority given to the 47 untrained staff whose appraisal scores are 3.69 points lower on average than their trained colleagues.
# Integrated Findings — what the five techniques say together
The five techniques in this study were not independent exercises. Each one tested a layer of the same business question — *what drives appraisal scores at P&S Holdings, and what can the Head of People Operations do about it?* — using progressively stronger statistical machinery. This section synthesises what the techniques jointly tell the CEO, in the order I would deliver them in a face-to-face briefing.
## Finding 1 — Training is a real, individually-measurable lever
The headline finding of this study is that **each additional training a staff member attended in 2025 is associated with a 5-point lift in appraisal score** (β = +5.12, 95% CI: 0.14 to 10.11, p = 0.044), controlling for tenure, attendance, level, gender, and department. The raw mean difference between trained and untrained staff is 3.69 points (72.56 vs 68.87); the regression sharpens this to 5.12 once we hold structural factors constant. This is the first quantified, individually-defensible answer to the L&D investment question at P&S Holdings. On a 50–90 appraisal scale, 5 points is approximately half a standard deviation — material, not trivial. The L&D budget is defensible on evidence terms.
The most actionable corollary is that **47 of 97 staff (48.5%) attended no recorded training in 2025**. Three full departments — Product Success, IT & Infrastructure, and Terminal Support — received no training events at all. If the +5.12 coefficient holds, extending the training programme to these untrained staff is the single highest-leverage HR intervention the firm can make in 2026, and we already know which 47 people to start with.
## Finding 2 — Department matters, but not in the ways we usually assume
The ANOVA on department-level appraisal differences was borderline (F(11, 83) = 1.79, p = 0.069) but the effect size was large (η² = 0.19). Translating: department membership explains roughly 19% of appraisal-score variance — meaningful — but the small size of several departments means we cannot yet declare the difference statistically significant. The regression sharpens this picture: **Terminal Support staff score 15.5 points lower than otherwise comparable Software Development staff** (p = 0.010), with Information Security marginally below (−12.3, p = 0.053). These two findings define a concrete management investigation: something structural about Terminal Support — role design, team leadership, customer-facing pressure, or a combination — is suppressing performance in a way that tenure, attendance, training, and level cannot explain. That is exactly the kind of insight a department-level dashboard cannot produce, and exactly the kind regression is built to surface.
## Finding 3 — Several intuitive levers do not work
Three predictors that HR conversations typically rely on turned out to be **statistically silent** once everything else was controlled for:
- **Attendance** (β = +0.083, p = 0.36) — the biometric clock-in system, while operationally useful for payroll and absenteeism, is **not a leading indicator of performance**. It should not feature in appraisal conversations.
- **Tenure** (β = −0.50, p = 0.44) — long-tenured staff are not, on average, the higher performers. Retention policy should prioritise *whom* we retain, not *how long*.
- **Gender** — both the chi-squared test (χ² = 0.57, p = 0.75, Cramér's V = 0.00) and the regression (p = 0.68) found no evidence of differential rating between women and men. The appraisal system, on this dimension, is gender-neutral. This is the answer the CEO will want to hear and it is defensible.
For a People Operations function deciding where to focus 2026 effort, this is itself valuable: it identifies the three levers that should *not* feature prominently in board-level performance narratives.
## Finding 4 — The appraisal system has a calibration question
A separate finding worth surfacing to the CEO: **no employee in this appraisal cycle was rated *Below Expectations* or *Unsatisfactory*** — the entire workforce sits at *Meets Expectations* or above. This is either (a) a genuinely uniformly competent workforce, or (b) a calibration concern in how appraisals are being recorded. The data alone cannot distinguish between the two, but the question is worth raising before the July 2026 appraisal cycle. Including a structured calibration step where line managers review each other's ratings would either confirm finding (a) or surface finding (b).
## What this means for the 2026 L&D investment case
Synthesising the four findings into the briefing I will give to the CEO:
1. **Maintain or grow the 2025 L&D budget.** Training has measurable effects on appraisal scores at individual level.
2. **Extend training coverage to the 47 untrained staff** as the priority of 2026, starting with the three untrained departments (Product Success, IT & Infrastructure, Terminal Support).
3. **Open a targeted investigation into Terminal Support performance** — the 15.5-point gap is the largest unexplained department-level finding and is actionable.
4. **Stop using attendance and tenure as performance proxies.** They do not predict appraisal scores.
5. **Add a calibration step to the next appraisal cycle** to validate that the absence of *Below Expectations* and *Unsatisfactory* ratings reflects reality rather than rating reluctance.
The single most important methodological lesson — which I will state to Prof Adi at the viva — is that the original HRIS data could not have answered the L&D question rigorously, because training was logged at department level rather than individual level. Going back to the L&D records and reconstructing individual attendance was the analytical move that turned a structurally limited dataset into a defensible regression. This is the kind of data-quality work that quietly determines whether HR analytics succeeds or fails at firms like P&S Holdings.
# Limitations
A defensible study names what its data cannot answer as clearly as what it can. The following are the limitations a careful reader (or viva examiner) would press me on, with my honest position on each.
**1. The model overall is not statistically significant.** The regression's F-statistic was 1.23 (p = 0.262) and adjusted R² was 0.05, meaning the predictors collectively explain only about 5% of appraisal-score variance once we adjust for the number of variables. The `Trainings_Attended` coefficient is robust *within* this model, but the model itself is modest. The honest reading is that **performance at P&S Holdings is mostly driven by individual factors not measurable in the HRIS** — motivation, manager fit, role match, life events, recognition — and the dataset cannot capture these. A 2026 follow-up study should include 360-degree feedback scores and engagement-survey data to lift the explanatory power.
**2. Causality versus association.** The regression establishes that trained staff have higher appraisal scores, controlling for observable confounders. It does not establish that **training caused** higher performance. Two alternative explanations remain plausible: (a) **selection bias** — high performers may be more likely to be nominated for, or to attend, training; (b) **reverse causation in appraisal timing** — appraisals were conducted October 2025 – March 2026, partially overlapping with training events. A within-staff before-and-after comparison would address both, but requires appraisal data from at least one prior cycle, which I do not yet have access to in pseudonymised form. This is the recommended next analytical step in 2026.
**3. Missing-not-at-random attendance data.** Thirteen senior staff have no biometric attendance data because, by company policy, they do not clock in. This is **missing not at random (MNAR)** rather than random missingness, and I excluded those staff from any analysis involving attendance rather than imputing values. The consequence is that the attendance-related findings apply only to the 84 staff who clock in. Generalising to the senior-management group would require a different attendance proxy.
**4. Small department samples reduce statistical power.** Several departments have only 3–6 staff (Finance, Operations, People Operations, Information Security, Audit & Compliance, Management). For these units, a single employee's score moves the department mean substantially and statistical power to detect real differences is low. This is why the ANOVA was borderline (p = 0.069) despite a large effect size (η² = 0.19). A second appraisal cycle of data would likely confirm patterns that today's sample only hints at.
**5. Time-window misalignment across event types.** Performance appraisals cover October 2025 – March 2026 (six months); attendance covers November 2025 – May 2026 (seven months) because the biometric system was only deployed in November 2025; training events cover all of 2025 (twelve months). The mismatch is unavoidable given the data available, but means the regression implicitly assumes that the effect of a January 2025 training event on a March 2026 appraisal is comparable to the effect of an October 2025 training event. Both intuitions are plausible; neither is testable with one appraisal cycle.
**6. Thirteen training attendees could not be matched to the Staff List.** The original training records included 71 attendance entries; 13 of those used Staff IDs that did not match any staff member in the current Staff List. These are most likely (a) exited staff who attended trainings in early 2025 but had left by the May 2026 snapshot, or (b) minor typos in the original training log. I excluded them from the per-staff training count to preserve a conservative estimate, but this means the actual trai
# References
Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R*. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/
R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software*, 4(43), 1686. https://doi.org/10.21105/joss.01686
# Appendix A — AI Usage Statement {.unnumbered}
*(This will be completed before final submission.)*