Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as smPayroll Data Analysis — Federal University Gusau, December 2024
Federal University Gusau’s payroll represents one of its largest recurring expenditures. This report analyses the December 2024 payroll dataset — comprising staff compensation, statutory deductions, and loan obligations — using five quantitative techniques: descriptive statistics, correlation analysis, segmentation analysis, hypothesis testing, and multiple linear regression.
Key findings reveal that net pay distributions are highly skewed by voluntary loan deductions rather than by departmental affiliation. A strong negative correlation (r ≈ −0.85) exists between loan repayments and net take-home pay, suggesting that a significant proportion of staff face substantial debt burdens. Contrary to expectation, there is no statistically significant difference in net pay between the Sociology and Chemistry departments (p = 0.817), confirming that rank and seniority — not academic discipline — drive compensation. The regression model explains approximately 71% of net pay variance, with Basic Salary as the dominant driver and loans exerting an outsized negative impact.
The primary recommendation is that the university introduce a financial wellness programme and enforce a maximum debt-to-income threshold of 33.3%, protecting employee welfare while reducing payroll administration risk.
Job Title: Chartered Accountant
Organisation Type/Sector: Higher Education Institution (Public Sector) — Federal University Gusau, Nigeria
Descriptive Statistics: As a chartered accountant, descriptive statistics form the bedrock of any financial review. Summarising payroll distributions — including means, medians, and standard deviations — allows me to identify anomalies, benchmark against prior periods, and flag figures that fall outside acceptable ranges during an internal audit or management account preparation.
Correlation Analysis: Understanding the relationships between financial variables is central to risk-based auditing. Identifying that loan deductions are strongly negatively correlated with net pay, for instance, helps quantify the organisation’s exposure to employee liquidity risk and informs decisions around debt policy — a key consideration when advising management on staff welfare compliance.
Segmentation Analysis (Grouped Visualisation): In financial reporting, disaggregating payroll by rank and department enables management to conduct cost-centre analysis and budget variance reviews. This technique directly supports the preparation of departmental budget reports and assists in identifying whether compensation costs are concentrated in specific units.
Hypothesis Testing: Statistical inference supports evidence-based decision-making in financial governance. Testing whether two departments differ significantly in their payroll impact enables finance managers to apply equitable budget allocation without relying on assumption. This is directly applicable to the University’s budget committee reports and financial planning cycles.
Multiple Linear Regression: Regression modelling enables predictive financial analysis. By identifying the drivers of net pay, the finance function can model the future payroll liability under different staffing or policy scenarios — for example, forecasting the payroll impact of a promotion round or a change in loan ceiling policy. This supports treasury management and cash flow forecasting.
The dataset used in this analysis is the December 2024 Payroll Register of Federal University Gusau (FUGUS), Nigeria. The data was obtained directly through my professional role in the University’s financial reporting function, where I have authorised access to payroll records as part of the internal audit and management accounts preparation process.
The payroll register is compiled monthly by the Bursary Department using the University’s human resources and payroll management system. Staff records — including rank, department, basic salary, gross pay, deductions, and net pay — are generated automatically from the HR system and validated by the payroll unit before disbursement. The dataset was extracted in Microsoft Excel format (.xlsx) from the financial reporting archive.
The sampling frame consists of all staff on the University’s payroll as at December 2024, including academic, non-academic, and support staff. No probabilistic sampling was applied; this analysis uses the full population of payroll records for that period, making it a census rather than a sample. This approach is appropriate given that the complete dataset was accessible and the objective is a full audit-grade review rather than inference to a wider population.
The dataset contains records for all staff on the December 2024 payroll register. The exact count is determined during data loading (see Section 4); initial inspection suggests several hundred records spanning multiple departments and ranks.
The data covers a single payroll cycle: December 2024. Longitudinal comparisons across months are outside the scope of this analysis but are noted as an area for further work (see Section 11).
The data was accessed in my capacity as a financial reporting professional at Federal University Gusau and is used strictly for academic purposes under the requirements of the LBS Data Analysis examination. All records are handled in accordance with Nigerian data protection obligations. No personally identifiable information (such as staff names or bank account details) is included in this analysis; all outputs are reported at the aggregate or categorical level (rank, department). The analysis does not expose individual staff compensation to external parties.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm# Load the payroll register
df = pd.read_excel(r"C:\Users\Lucky.ai\OneDrive\Documents\FINAL PAYROLL DECEMBER 2024.xlsx")
# --- DATA CLEANING ---
# Fill missing pension PINs with a placeholder to maintain data integrity
df['PENSION PIN'] = df['PENSION PIN'].fillna('NOT REGISTERED')
# Coerce financial columns to numeric, replacing unparseable entries with 0
numeric_cols = ['BASIC', 'TOTAL GROSS', 'LOANS', 'PAYE', 'TOTAL DEDUCTION', 'NET PAY']
for col in numeric_cols:
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nColumn names and data types:\n{df.dtypes}")Dataset shape: 1223 rows × 26 columns
Column names and data types:
STAFF NO str
IPPIS NO str
NAME str
RANK str
DEPARTMENT str
GRADE/STEP str
BANK str
ACCOUNT NO int64
PENSION ADMINISTRATOR str
PENSION PIN str
BASIC int64
HARZARD ALLOWANCE int64
SHIFT int64
TEACHING ALLOWANCE float64
CALLDUTY int64
SPECIAL DUTY float64
OTHER ALLOWANCES int64
TOTAL GROSS int64
LOANS float64
NHF int64
PENSION int64
UNION NAME str
UNION DUES int64
PAYE int64
TOTAL DEDUCTION int64
NET PAY int64
dtype: object
# Summary statistics for all financial variables
print("=== Summary Statistics ===")
df[numeric_cols].describe().round(2)=== Summary Statistics ===
| BASIC | TOTAL GROSS | LOANS | PAYE | TOTAL DEDUCTION | NET PAY | |
|---|---|---|---|---|---|---|
| count | 1223.00 | 1223.00 | 1223.00 | 1223.00 | 1223.00 | 1223.00 |
| mean | 222693.83 | 258178.79 | 27568.47 | 19037.33 | 72566.86 | 185611.85 |
| std | 119779.63 | 171783.28 | 30203.26 | 16637.76 | 46971.77 | 148461.85 |
| min | 54174.00 | 54174.00 | 0.00 | 0.00 | 0.00 | 25165.00 |
| 25% | 146027.50 | 167676.00 | 5000.00 | 8298.00 | 40137.00 | 103515.00 |
| 50% | 218817.00 | 234898.00 | 16458.33 | 16735.00 | 60763.00 | 165686.00 |
| 75% | 252018.00 | 294231.00 | 40747.67 | 21500.00 | 95598.50 | 203722.50 |
| max | 1289126.00 | 1905632.00 | 165775.00 | 185434.00 | 350568.00 | 1718631.00 |
# Missing value audit
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
print("=== Missing Values ===")
print(missing_report[missing_report['Missing Count'] > 0])
unregistered_pension = (df['PENSION PIN'] == 'NOT REGISTERED').sum()
print(f"\nStaff without registered Pension PIN: {unregistered_pension}")=== Missing Values ===
Empty DataFrame
Columns: [Missing Count, Missing %]
Index: []
Staff without registered Pension PIN: 120
| Variable | Type | Description |
|---|---|---|
RANK |
Categorical | Job rank/grade of the staff member |
DEPARTMENT |
Categorical | Department to which the staff member is assigned |
PENSION PIN |
Categorical | Pension Fund Administrator registration number |
BASIC |
Numeric (NGN) | Basic salary before allowances |
TOTAL GROSS |
Numeric (NGN) | Total gross pay including all allowances |
LOANS |
Numeric (NGN) | Monthly loan deduction from gross pay |
PAYE |
Numeric (NGN) | Pay As You Earn income tax deduction |
TOTAL DEDUCTION |
Numeric (NGN) | Sum of all deductions |
NET PAY |
Numeric (NGN) | Take-home pay after all deductions |
fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()
for i, col in enumerate(numeric_cols):
axes[i].hist(df[col], bins=30, color='steelblue', edgecolor='white')
axes[i].set_title(f'Distribution of {col}')
axes[i].set_xlabel('NGN')
axes[i].set_ylabel('Frequency')
plt.tight_layout()
plt.show()
plt.close()The distributions show that NET PAY, TOTAL GROSS, and BASIC are right-skewed, consistent with payroll data where a minority of senior staff earn significantly above the median. LOANS is notably bimodal — a large number of staff have zero loans, while those who do borrow carry substantial obligations.
Descriptive statistics summarise the central tendency (mean, median), dispersion (standard deviation, range, interquartile range), and shape (skewness, kurtosis) of a dataset. They provide the essential foundation for any quantitative analysis by establishing what is “typical” and what is unusual.
In a payroll audit context, descriptive statistics serve as the first line of anomaly detection. Unusually high means relative to medians indicate right-skew driven by high earners, while high standard deviations flag internal equity concerns. For a finance officer at FUGUS, these metrics are directly relevant to monthly payroll certification and management reporting.
desc = df[numeric_cols].describe().T
desc['skewness'] = df[numeric_cols].skew().round(3)
desc['kurtosis'] = df[numeric_cols].kurt().round(3)
desc.round(2)| count | mean | std | min | 25% | 50% | 75% | max | skewness | kurtosis | |
|---|---|---|---|---|---|---|---|---|---|---|
| BASIC | 1223.0 | 222693.83 | 119779.63 | 54174.0 | 146027.5 | 218817.00 | 252018.00 | 1289126.0 | 2.01 | 8.37 |
| TOTAL GROSS | 1223.0 | 258178.79 | 171783.28 | 54174.0 | 167676.0 | 234898.00 | 294231.00 | 1905632.0 | 3.44 | 20.17 |
| LOANS | 1223.0 | 27568.47 | 30203.26 | 0.0 | 5000.0 | 16458.33 | 40747.67 | 165775.0 | 1.60 | 2.72 |
| PAYE | 1223.0 | 19037.33 | 16637.76 | 0.0 | 8298.0 | 16735.00 | 21500.00 | 185434.0 | 3.08 | 16.18 |
| TOTAL DEDUCTION | 1223.0 | 72566.86 | 46971.77 | 0.0 | 40137.0 | 60763.00 | 95598.50 | 350568.0 | 1.53 | 3.37 |
| NET PAY | 1223.0 | 185611.85 | 148461.85 | 25165.0 | 103515.0 | 165686.00 | 203722.50 | 1718631.0 | 4.04 | 24.74 |
plt.figure(figsize=(9, 5))
plt.hist(df['NET PAY'], bins=30, color='steelblue', edgecolor='white')
plt.axvline(df['NET PAY'].mean(), color='red', linestyle='--', label=f"Mean: ₦{df['NET PAY'].mean():,.0f}")
plt.axvline(df['NET PAY'].median(), color='orange', linestyle='--', label=f"Median: ₦{df['NET PAY'].median():,.0f}")
plt.title('Distribution of Net Pay (December 2024)')
plt.xlabel('Net Pay (NGN)')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
plt.close()The net pay distribution is right-skewed, with the mean exceeding the median. This indicates that a relatively small number of senior or high-ranking staff pull the average upward. For a non-technical manager: most staff earn below the average net pay figure reported in summary tables — the “typical” take-home is better represented by the median. The standard deviation in net pay also indicates wide dispersion, reflecting the diversity of ranks and loan burdens across the workforce.
Pearson’s correlation coefficient (r) measures the linear association between two continuous variables, ranging from −1 (perfect negative relationship) to +1 (perfect positive relationship). A correlation heatmap visualises all pairwise correlations simultaneously.
For the University’s finance team, understanding which variables move together is essential for financial modelling and risk identification. A strong negative correlation between loans and net pay, for example, has direct implications for staff welfare policy and payroll risk assessment.
corr_matrix = df[numeric_cols].corr()
plt.figure(figsize=(9, 7))
sns.heatmap(
corr_matrix,
annot=True,
cmap='coolwarm',
fmt=".2f",
linewidths=0.5,
vmin=-1, vmax=1
)
plt.title('Correlation Heatmap of Financial Variables')
plt.tight_layout()
plt.show()
plt.close()The heatmap reveals several important relationships. BASIC and TOTAL GROSS are very highly correlated, which is expected as Basic Salary is the primary component of Gross Pay. LOANS shows a strong negative correlation with NET PAY, confirming that loan repayments are the single largest driver of reduced take-home pay. PAYE is positively correlated with NET PAY — not because taxes increase pay, but because both are proxies for seniority: higher earners pay more tax and, on net, still take home more. For a manager: loan repayments are depressing take-home pay more than taxes are.
Segmentation analysis groups observations by a categorical variable and compares the distribution of a numeric outcome across groups. Box plots are particularly effective for this, displaying the median, interquartile range, and outliers for each group simultaneously.
Understanding pay variation by rank and department helps the finance function identify cost concentration, assess whether the salary structure is equitable, and prioritise where budget interventions will have the greatest effect. This analysis directly supports departmental budget reporting and workforce cost modelling.
top_ranks = (
df.groupby('RANK')['NET PAY']
.mean()
.sort_values(ascending=False)
.head(10)
.index
)
subset_ranks = df[df['RANK'].isin(top_ranks)]
plt.figure(figsize=(10, 6))
sns.boxplot(data=subset_ranks, x='NET PAY', y='RANK', palette='viridis', order=top_ranks)
plt.title('Net Pay Distribution — Top 10 Ranks')
plt.xlabel('Net Pay (NGN)')
plt.ylabel('Rank')
plt.tight_layout()
plt.show()
plt.close()top_departments = (
df.groupby('DEPARTMENT')['NET PAY']
.mean()
.sort_values(ascending=False)
.head(10)
.index
)
subset_departments = df[df['DEPARTMENT'].isin(top_departments)]
plt.figure(figsize=(10, 6))
sns.boxplot(data=subset_departments, x='NET PAY', y='DEPARTMENT', palette='muted', order=top_departments)
plt.title('Net Pay Distribution — Top 10 Departments')
plt.xlabel('Net Pay (NGN)')
plt.ylabel('Department')
plt.tight_layout()
plt.show()
plt.close()The rank-based boxplots confirm that seniority is the primary driver of pay variation. Higher ranks display wider interquartile ranges, reflecting greater variance in loan obligations among senior staff. The departmental boxplots, in contrast, show considerable overlap between units, suggesting that the University’s standardised salary scales (CONUASS/CONTISS) effectively equalise pay across departments. For a non-technical manager: what you earn depends far more on your grade level than on which department you work in.
An independent-samples t-test compares the means of a continuous variable across two distinct groups to determine whether any observed difference is statistically significant or likely due to chance. The null hypothesis (H₀) posits no difference in group means; the alternative hypothesis (H₁) posits a real difference. A p-value below 0.05 is conventionally taken as evidence to reject H₀.
Assumptions: The test assumes approximate normality within each group (or large enough samples for the Central Limit Theorem to apply), independence of observations, and that the two groups are mutually exclusive.
Budget committees routinely ask whether two cost centres have genuinely different expenditure profiles, or whether apparent differences are statistical noise. A formal hypothesis test provides the rigorous answer, replacing subjective judgement with evidence. Here, we test whether the Sociology and Chemistry departments — two substantively different academic units — carry meaningfully different payroll costs.
dept_soc = df[df['DEPARTMENT'] == 'SOCIOLOGY DEPARTMENT']['NET PAY'].dropna()
dept_chem = df[df['DEPARTMENT'] == 'CHEMISTRY DEPARTMENT']['NET PAY'].dropna()
print(f"Sociology — n={len(dept_soc)}, mean=₦{dept_soc.mean():,.2f}, SD=₦{dept_soc.std():,.2f}")
print(f"Chemistry — n={len(dept_chem)}, mean=₦{dept_chem.mean():,.2f}, SD=₦{dept_chem.std():,.2f}")
t_stat, p_val = stats.ttest_ind(dept_soc, dept_chem, equal_var=False) # Welch's t-test (safer default)
print(f"\nWelch's t-statistic : {t_stat:.4f}")
print(f"P-value : {p_val:.4f}")
print(f"\nDecision: {'Reject H₀ — significant difference' if p_val < 0.05 else 'Fail to reject H₀ — no significant difference'} at α = 0.05")Sociology — n=15, mean=₦255,444.47, SD=₦213,641.64
Chemistry — n=25, mean=₦240,863.80, SD=₦177,674.77
Welch's t-statistic : 0.2222
P-value : 0.8259
Decision: Fail to reject H₀ — no significant difference at α = 0.05
With a p-value well above 0.05, we fail to reject the null hypothesis. There is no statistically significant difference in net pay between the Sociology and Chemistry departments. For a non-technical manager: if you replaced every Chemistry staff member with a Sociology staff member of the same rank and loan profile, the payroll bill would not change meaningfully. This finding validates the University’s unified salary structure and confirms that budget forecasts can be applied uniformly across departments without needing department-specific pay assumptions.
Ordinary Least Squares (OLS) regression models a continuous outcome variable as a linear function of one or more predictors. The model estimates coefficients (β) representing the expected change in the outcome for a one-unit increase in each predictor, holding all others constant. Model fit is assessed using R² (proportion of variance explained) and p-values for individual coefficients.
For financial forecasting, knowing not just that variables are related but by how much is critical. The regression model allows the finance team to project future net pay obligations under different policy scenarios — for example, if loan ceilings change or a promotion round increases the number of staff on higher basic salaries. This supports treasury planning and cash flow management.
X = df[['BASIC', 'LOANS', 'PAYE']].copy()
y = df['NET PAY'].copy()
# Add constant (intercept) term
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary()) OLS Regression Results
==============================================================================
Dep. Variable: NET PAY R-squared: 0.709
Model: OLS Adj. R-squared: 0.709
Method: Least Squares F-statistic: 992.1
Date: Tue, 19 May 2026 Prob (F-statistic): 0.00
Time: 13:55:24 Log-Likelihood: -15543.
No. Observations: 1223 AIC: 3.109e+04
Df Residuals: 1219 BIC: 3.111e+04
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.781e+04 6567.322 4.234 0.000 1.49e+04 4.07e+04
BASIC 0.4967 0.061 8.192 0.000 0.378 0.616
LOANS -1.1833 0.079 -15.057 0.000 -1.337 -1.029
PAYE 4.1928 0.432 9.710 0.000 3.346 5.040
==============================================================================
Omnibus: 1378.969 Durbin-Watson: 1.593
Prob(Omnibus): 0.000 Jarque-Bera (JB): 155877.439
Skew: 5.465 Prob(JB): 0.00
Kurtosis: 57.217 Cond. No. 7.32e+05
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
y_pred = model.fittedvalues
plt.figure(figsize=(7, 5))
plt.scatter(y_pred, y, alpha=0.4, color='steelblue', edgecolor='none')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=1.5, label='Perfect fit')
plt.xlabel('Predicted Net Pay (NGN)')
plt.ylabel('Actual Net Pay (NGN)')
plt.title('Actual vs Predicted Net Pay — OLS Regression')
plt.legend()
plt.tight_layout()
plt.show()
plt.close()The model explains approximately 71% of the variance in Net Pay (R² ≈ 0.71), indicating a reasonably strong fit. Key findings from the coefficients:
For a non-technical manager: the two biggest levers on take-home pay are the basic salary grade and loan obligations. Reducing loan burdens would have a direct, measurable positive impact on employee net pay.
Taken together, the five analyses converge on a coherent narrative about FUGUS’s December 2024 payroll:
Rank drives compensation; departments do not. Descriptive statistics and segmentation analysis both confirm that pay variance is primarily explained by grade level. The hypothesis test formally confirms that departmental affiliation (at least between Sociology and Chemistry) adds no statistically independent payroll cost. Budget planning should therefore be rank-based, not department-based.
Loan deductions are the primary financial wellness risk. The correlation analysis and regression model both identify loans as the strongest depressor of net take-home pay. The regression coefficient of −1.18 suggests that the total cost of borrowing (principal plus charges) is eroding disposable income substantially.
The payroll structure is principled but carries latent risk. The strong R² of 71% in the regression model confirms that the payroll is operating largely as designed — basic salary and statutory deductions explain the bulk of variation. However, the 29% unexplained variance and the wide loan-driven dispersion in net pay represent areas of financial and compliance risk.
Single Collective Recommendation: The University should introduce a Payroll Financial Wellness Programme that: (a) caps loan deductions at a maximum of 33.3% of gross pay in line with labour best practice; (b) audits and regularises the pension registration status of staff currently without a Pension PIN; and (c) uses the regression model to simulate the net pay impact of future policy changes before implementation. This would improve staff welfare, reduce regulatory exposure, and strengthen the University’s financial governance framework.
Agbo, E. (2026). Payroll data analysis: Federal University Gusau December 2024 payroll register [Dataset]. Internal financial reporting archive, Bursary Department, Federal University Gusau.
Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed.). SAGE Publications.
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55
McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 51–56. https://doi.org/10.25080/Majora-92bf1922-00a
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.
Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. Proceedings of the 9th Python in Science Conference, 57–61. https://doi.org/10.25080/Majora-92bf1922-00b
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2
Waskom, M. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021
Posit Assistant (an AI coding assistant) was used during the preparation of this report to assist with the structural organisation of the Quarto document and to refine the formatting of code cells and section headings. All analytical decisions — including the choice of techniques, the selection of variables for the regression model, the formulation of hypotheses, and the interpretation of outputs — were made independently based on my professional judgement as a practicing chartered accountant and my understanding of the business context at Federal University Gusau. The AI tool did not generate the dataset, determine the analytical strategy, or produce the substantive interpretations contained in Sections 5 through 10. I reviewed all AI-assisted content and take full responsibility for the accuracy and integrity of this report.