Case Study 1: Exploratory & Inferential Analytics

Payroll Data Analysis — Federal University Gusau, December 2024

Author

Emmanuel Agbo

Published

May 1, 2026

1. Executive Summary

Federal University Gusau’s payroll represents one of its largest recurring expenditures. This report analyses the December 2024 payroll dataset — comprising staff compensation, statutory deductions, and loan obligations — using five quantitative techniques: descriptive statistics, correlation analysis, segmentation analysis, hypothesis testing, and multiple linear regression.

Key findings reveal that net pay distributions are highly skewed by voluntary loan deductions rather than by departmental affiliation. A strong negative correlation (r ≈ −0.85) exists between loan repayments and net take-home pay, suggesting that a significant proportion of staff face substantial debt burdens. Contrary to expectation, there is no statistically significant difference in net pay between the Sociology and Chemistry departments (p = 0.817), confirming that rank and seniority — not academic discipline — drive compensation. The regression model explains approximately 71% of net pay variance, with Basic Salary as the dominant driver and loans exerting an outsized negative impact.

The primary recommendation is that the university introduce a financial wellness programme and enforce a maximum debt-to-income threshold of 33.3%, protecting employee welfare while reducing payroll administration risk.

2. Professional Disclosure

Job Title: Chartered Accountant
Organisation Type/Sector: Higher Education Institution (Public Sector) — Federal University Gusau, Nigeria

Relevance of Each Technique to Professional Practice

Descriptive Statistics: As a chartered accountant, descriptive statistics form the bedrock of any financial review. Summarising payroll distributions — including means, medians, and standard deviations — allows me to identify anomalies, benchmark against prior periods, and flag figures that fall outside acceptable ranges during an internal audit or management account preparation.

Correlation Analysis: Understanding the relationships between financial variables is central to risk-based auditing. Identifying that loan deductions are strongly negatively correlated with net pay, for instance, helps quantify the organisation’s exposure to employee liquidity risk and informs decisions around debt policy — a key consideration when advising management on staff welfare compliance.

Segmentation Analysis (Grouped Visualisation): In financial reporting, disaggregating payroll by rank and department enables management to conduct cost-centre analysis and budget variance reviews. This technique directly supports the preparation of departmental budget reports and assists in identifying whether compensation costs are concentrated in specific units.

Hypothesis Testing: Statistical inference supports evidence-based decision-making in financial governance. Testing whether two departments differ significantly in their payroll impact enables finance managers to apply equitable budget allocation without relying on assumption. This is directly applicable to the University’s budget committee reports and financial planning cycles.

Multiple Linear Regression: Regression modelling enables predictive financial analysis. By identifying the drivers of net pay, the finance function can model the future payroll liability under different staffing or policy scenarios — for example, forecasting the payroll impact of a promotion round or a change in loan ceiling policy. This supports treasury management and cash flow forecasting.

3. Data Collection & Sampling

Source

The dataset used in this analysis is the December 2024 Payroll Register of Federal University Gusau (FUGUS), Nigeria. The data was obtained directly through my professional role in the University’s financial reporting function, where I have authorised access to payroll records as part of the internal audit and management accounts preparation process.

Collection Method

The payroll register is compiled monthly by the Bursary Department using the University’s human resources and payroll management system. Staff records — including rank, department, basic salary, gross pay, deductions, and net pay — are generated automatically from the HR system and validated by the payroll unit before disbursement. The dataset was extracted in Microsoft Excel format (.xlsx) from the financial reporting archive.

Sampling Frame

The sampling frame consists of all staff on the University’s payroll as at December 2024, including academic, non-academic, and support staff. No probabilistic sampling was applied; this analysis uses the full population of payroll records for that period, making it a census rather than a sample. This approach is appropriate given that the complete dataset was accessible and the objective is a full audit-grade review rather than inference to a wider population.

Sample Size

The dataset contains records for all staff on the December 2024 payroll register. The exact count is determined during data loading (see Section 4); initial inspection suggests several hundred records spanning multiple departments and ranks.

Time Period Covered

The data covers a single payroll cycle: December 2024. Longitudinal comparisons across months are outside the scope of this analysis but are noted as an area for further work (see Section 11).

4. Data Description

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import statsmodels.api as sm

Code

# Load the payroll register
df = pd.read_excel(r"C:\Users\Lucky.ai\OneDrive\Documents\FINAL PAYROLL DECEMBER 2024.xlsx")

# --- DATA CLEANING ---
# Fill missing pension PINs with a placeholder to maintain data integrity
df['PENSION PIN'] = df['PENSION PIN'].fillna('NOT REGISTERED')

# Coerce financial columns to numeric, replacing unparseable entries with 0
numeric_cols = ['BASIC', 'TOTAL GROSS', 'LOANS', 'PAYE', 'TOTAL DEDUCTION', 'NET PAY']
for col in numeric_cols:
    df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)

print(f"Dataset shape: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nColumn names and data types:\n{df.dtypes}")

Dataset shape: 1223 rows × 26 columns

Column names and data types:
STAFF NO                     str
IPPIS NO                     str
NAME                         str
RANK                         str
DEPARTMENT                   str
GRADE/STEP                   str
BANK                         str
ACCOUNT NO                 int64
PENSION ADMINISTRATOR        str
PENSION PIN                  str
BASIC                      int64
HARZARD ALLOWANCE          int64
SHIFT                      int64
TEACHING ALLOWANCE       float64
CALLDUTY                   int64
SPECIAL DUTY             float64
OTHER ALLOWANCES           int64
TOTAL GROSS                int64
LOANS                    float64
NHF                        int64
PENSION                    int64
UNION NAME                   str
UNION DUES                 int64
PAYE                       int64
TOTAL DEDUCTION            int64
NET PAY                    int64
dtype: object

Code

# Summary statistics for all financial variables
print("=== Summary Statistics ===")
df[numeric_cols].describe().round(2)

=== Summary Statistics ===

	BASIC	TOTAL GROSS	LOANS	PAYE	TOTAL DEDUCTION	NET PAY
count	1223.00	1223.00	1223.00	1223.00	1223.00	1223.00
mean	222693.83	258178.79	27568.47	19037.33	72566.86	185611.85
std	119779.63	171783.28	30203.26	16637.76	46971.77	148461.85
min	54174.00	54174.00	0.00	0.00	0.00	25165.00
25%	146027.50	167676.00	5000.00	8298.00	40137.00	103515.00
50%	218817.00	234898.00	16458.33	16735.00	60763.00	165686.00
75%	252018.00	294231.00	40747.67	21500.00	95598.50	203722.50
max	1289126.00	1905632.00	165775.00	185434.00	350568.00	1718631.00

Code

# Missing value audit
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_report = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
print("=== Missing Values ===")
print(missing_report[missing_report['Missing Count'] > 0])

unregistered_pension = (df['PENSION PIN'] == 'NOT REGISTERED').sum()
print(f"\nStaff without registered Pension PIN: {unregistered_pension}")

=== Missing Values ===
Empty DataFrame
Columns: [Missing Count, Missing %]
Index: []

Staff without registered Pension PIN: 120

Variable Dictionary

Variable	Type	Description
`RANK`	Categorical	Job rank/grade of the staff member
`DEPARTMENT`	Categorical	Department to which the staff member is assigned
`PENSION PIN`	Categorical	Pension Fund Administrator registration number
`BASIC`	Numeric (NGN)	Basic salary before allowances
`TOTAL GROSS`	Numeric (NGN)	Total gross pay including all allowances
`LOANS`	Numeric (NGN)	Monthly loan deduction from gross pay
`PAYE`	Numeric (NGN)	Pay As You Earn income tax deduction
`TOTAL DEDUCTION`	Numeric (NGN)	Sum of all deductions
`NET PAY`	Numeric (NGN)	Take-home pay after all deductions

Distributions

Code

fig, axes = plt.subplots(2, 3, figsize=(14, 8))
axes = axes.flatten()

for i, col in enumerate(numeric_cols):
    axes[i].hist(df[col], bins=30, color='steelblue', edgecolor='white')
    axes[i].set_title(f'Distribution of {col}')
    axes[i].set_xlabel('NGN')
    axes[i].set_ylabel('Frequency')

plt.tight_layout()
plt.show()
plt.close()

The distributions show that NET PAY, TOTAL GROSS, and BASIC are right-skewed, consistent with payroll data where a minority of senior staff earn significantly above the median. LOANS is notably bimodal — a large number of staff have zero loans, while those who do borrow carry substantial obligations.

5. Analysis 1 — Descriptive Statistics

5.1 Theory

Descriptive statistics summarise the central tendency (mean, median), dispersion (standard deviation, range, interquartile range), and shape (skewness, kurtosis) of a dataset. They provide the essential foundation for any quantitative analysis by establishing what is “typical” and what is unusual.

5.2 Business Justification

In a payroll audit context, descriptive statistics serve as the first line of anomaly detection. Unusually high means relative to medians indicate right-skew driven by high earners, while high standard deviations flag internal equity concerns. For a finance officer at FUGUS, these metrics are directly relevant to monthly payroll certification and management reporting.

5.3 Code & Output

Code

desc = df[numeric_cols].describe().T
desc['skewness'] = df[numeric_cols].skew().round(3)
desc['kurtosis'] = df[numeric_cols].kurt().round(3)
desc.round(2)

	count	mean	std	min	25%	50%	75%	max	skewness	kurtosis
BASIC	1223.0	222693.83	119779.63	54174.0	146027.5	218817.00	252018.00	1289126.0	2.01	8.37
TOTAL GROSS	1223.0	258178.79	171783.28	54174.0	167676.0	234898.00	294231.00	1905632.0	3.44	20.17
LOANS	1223.0	27568.47	30203.26	0.0	5000.0	16458.33	40747.67	165775.0	1.60	2.72
PAYE	1223.0	19037.33	16637.76	0.0	8298.0	16735.00	21500.00	185434.0	3.08	16.18
TOTAL DEDUCTION	1223.0	72566.86	46971.77	0.0	40137.0	60763.00	95598.50	350568.0	1.53	3.37
NET PAY	1223.0	185611.85	148461.85	25165.0	103515.0	165686.00	203722.50	1718631.0	4.04	24.74

Code

plt.figure(figsize=(9, 5))
plt.hist(df['NET PAY'], bins=30, color='steelblue', edgecolor='white')
plt.axvline(df['NET PAY'].mean(), color='red', linestyle='--', label=f"Mean: ₦{df['NET PAY'].mean():,.0f}")
plt.axvline(df['NET PAY'].median(), color='orange', linestyle='--', label=f"Median: ₦{df['NET PAY'].median():,.0f}")
plt.title('Distribution of Net Pay (December 2024)')
plt.xlabel('Net Pay (NGN)')
plt.ylabel('Frequency')
plt.legend()
plt.tight_layout()
plt.show()
plt.close()

5.4 Interpretation

The net pay distribution is right-skewed, with the mean exceeding the median. This indicates that a relatively small number of senior or high-ranking staff pull the average upward. For a non-technical manager: most staff earn below the average net pay figure reported in summary tables — the “typical” take-home is better represented by the median. The standard deviation in net pay also indicates wide dispersion, reflecting the diversity of ranks and loan burdens across the workforce.

6. Analysis 2 — Correlation Analysis

6.1 Theory

Pearson’s correlation coefficient (r) measures the linear association between two continuous variables, ranging from −1 (perfect negative relationship) to +1 (perfect positive relationship). A correlation heatmap visualises all pairwise correlations simultaneously.

6.2 Business Justification

For the University’s finance team, understanding which variables move together is essential for financial modelling and risk identification. A strong negative correlation between loans and net pay, for example, has direct implications for staff welfare policy and payroll risk assessment.

6.3 Code & Output

Code

corr_matrix = df[numeric_cols].corr()

plt.figure(figsize=(9, 7))
sns.heatmap(
    corr_matrix,
    annot=True,
    cmap='coolwarm',
    fmt=".2f",
    linewidths=0.5,
    vmin=-1, vmax=1
)
plt.title('Correlation Heatmap of Financial Variables')
plt.tight_layout()
plt.show()
plt.close()

Correlation heatmap of financial payroll variables

6.4 Interpretation

The heatmap reveals several important relationships. BASIC and TOTAL GROSS are very highly correlated, which is expected as Basic Salary is the primary component of Gross Pay. LOANS shows a strong negative correlation with NET PAY, confirming that loan repayments are the single largest driver of reduced take-home pay. PAYE is positively correlated with NET PAY — not because taxes increase pay, but because both are proxies for seniority: higher earners pay more tax and, on net, still take home more. For a manager: loan repayments are depressing take-home pay more than taxes are.

7. Analysis 3 — Segmentation Analysis (Grouped Visualisation)

7.1 Theory

Segmentation analysis groups observations by a categorical variable and compares the distribution of a numeric outcome across groups. Box plots are particularly effective for this, displaying the median, interquartile range, and outliers for each group simultaneously.

7.2 Business Justification

Understanding pay variation by rank and department helps the finance function identify cost concentration, assess whether the salary structure is equitable, and prioritise where budget interventions will have the greatest effect. This analysis directly supports departmental budget reporting and workforce cost modelling.

7.3 Code & Output

Code

top_ranks = (
    df.groupby('RANK')['NET PAY']
    .mean()
    .sort_values(ascending=False)
    .head(10)
    .index
)
subset_ranks = df[df['RANK'].isin(top_ranks)]

plt.figure(figsize=(10, 6))
sns.boxplot(data=subset_ranks, x='NET PAY', y='RANK', palette='viridis', order=top_ranks)
plt.title('Net Pay Distribution — Top 10 Ranks')
plt.xlabel('Net Pay (NGN)')
plt.ylabel('Rank')
plt.tight_layout()
plt.show()
plt.close()

Net Pay distribution by top 10 staff ranks

Code

top_departments = (
    df.groupby('DEPARTMENT')['NET PAY']
    .mean()
    .sort_values(ascending=False)
    .head(10)
    .index
)
subset_departments = df[df['DEPARTMENT'].isin(top_departments)]

plt.figure(figsize=(10, 6))
sns.boxplot(data=subset_departments, x='NET PAY', y='DEPARTMENT', palette='muted', order=top_departments)
plt.title('Net Pay Distribution — Top 10 Departments')
plt.xlabel('Net Pay (NGN)')
plt.ylabel('Department')
plt.tight_layout()
plt.show()
plt.close()

Net Pay distribution by top 10 departments

7.4 Interpretation

The rank-based boxplots confirm that seniority is the primary driver of pay variation. Higher ranks display wider interquartile ranges, reflecting greater variance in loan obligations among senior staff. The departmental boxplots, in contrast, show considerable overlap between units, suggesting that the University’s standardised salary scales (CONUASS/CONTISS) effectively equalise pay across departments. For a non-technical manager: what you earn depends far more on your grade level than on which department you work in.

8. Analysis 4 — Hypothesis Testing (Independent Samples t-Test)

8.1 Theory

An independent-samples t-test compares the means of a continuous variable across two distinct groups to determine whether any observed difference is statistically significant or likely due to chance. The null hypothesis (H₀) posits no difference in group means; the alternative hypothesis (H₁) posits a real difference. A p-value below 0.05 is conventionally taken as evidence to reject H₀.

Assumptions: The test assumes approximate normality within each group (or large enough samples for the Central Limit Theorem to apply), independence of observations, and that the two groups are mutually exclusive.

8.2 Business Justification

Budget committees routinely ask whether two cost centres have genuinely different expenditure profiles, or whether apparent differences are statistical noise. A formal hypothesis test provides the rigorous answer, replacing subjective judgement with evidence. Here, we test whether the Sociology and Chemistry departments — two substantively different academic units — carry meaningfully different payroll costs.

8.3 Code & Output

Code

dept_soc = df[df['DEPARTMENT'] == 'SOCIOLOGY DEPARTMENT']['NET PAY'].dropna()
dept_chem = df[df['DEPARTMENT'] == 'CHEMISTRY DEPARTMENT']['NET PAY'].dropna()

print(f"Sociology  — n={len(dept_soc)}, mean=₦{dept_soc.mean():,.2f}, SD=₦{dept_soc.std():,.2f}")
print(f"Chemistry  — n={len(dept_chem)}, mean=₦{dept_chem.mean():,.2f}, SD=₦{dept_chem.std():,.2f}")

t_stat, p_val = stats.ttest_ind(dept_soc, dept_chem, equal_var=False)  # Welch's t-test (safer default)

print(f"\nWelch's t-statistic : {t_stat:.4f}")
print(f"P-value             : {p_val:.4f}")
print(f"\nDecision: {'Reject H₀ — significant difference' if p_val < 0.05 else 'Fail to reject H₀ — no significant difference'} at α = 0.05")

Sociology  — n=15, mean=₦255,444.47, SD=₦213,641.64
Chemistry  — n=25, mean=₦240,863.80, SD=₦177,674.77

Welch's t-statistic : 0.2222
P-value             : 0.8259

Decision: Fail to reject H₀ — no significant difference at α = 0.05

8.4 Interpretation

With a p-value well above 0.05, we fail to reject the null hypothesis. There is no statistically significant difference in net pay between the Sociology and Chemistry departments. For a non-technical manager: if you replaced every Chemistry staff member with a Sociology staff member of the same rank and loan profile, the payroll bill would not change meaningfully. This finding validates the University’s unified salary structure and confirms that budget forecasts can be applied uniformly across departments without needing department-specific pay assumptions.

9. Analysis 5 — Multiple Linear Regression

9.1 Theory

Ordinary Least Squares (OLS) regression models a continuous outcome variable as a linear function of one or more predictors. The model estimates coefficients (β) representing the expected change in the outcome for a one-unit increase in each predictor, holding all others constant. Model fit is assessed using R² (proportion of variance explained) and p-values for individual coefficients.

9.2 Business Justification

For financial forecasting, knowing not just that variables are related but by how much is critical. The regression model allows the finance team to project future net pay obligations under different policy scenarios — for example, if loan ceilings change or a promotion round increases the number of staff on higher basic salaries. This supports treasury planning and cash flow management.

9.3 Code & Output

Code

X = df[['BASIC', 'LOANS', 'PAYE']].copy()
y = df['NET PAY'].copy()

# Add constant (intercept) term
X = sm.add_constant(X)

model = sm.OLS(y, X).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                NET PAY   R-squared:                       0.709
Model:                            OLS   Adj. R-squared:                  0.709
Method:                 Least Squares   F-statistic:                     992.1
Date:                Tue, 19 May 2026   Prob (F-statistic):               0.00
Time:                        13:55:24   Log-Likelihood:                -15543.
No. Observations:                1223   AIC:                         3.109e+04
Df Residuals:                    1219   BIC:                         3.111e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const       2.781e+04   6567.322      4.234      0.000    1.49e+04    4.07e+04
BASIC          0.4967      0.061      8.192      0.000       0.378       0.616
LOANS         -1.1833      0.079    -15.057      0.000      -1.337      -1.029
PAYE           4.1928      0.432      9.710      0.000       3.346       5.040
==============================================================================
Omnibus:                     1378.969   Durbin-Watson:                   1.593
Prob(Omnibus):                  0.000   Jarque-Bera (JB):           155877.439
Skew:                           5.465   Prob(JB):                         0.00
Kurtosis:                      57.217   Cond. No.                     7.32e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.32e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Code

y_pred = model.fittedvalues

plt.figure(figsize=(7, 5))
plt.scatter(y_pred, y, alpha=0.4, color='steelblue', edgecolor='none')
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', linewidth=1.5, label='Perfect fit')
plt.xlabel('Predicted Net Pay (NGN)')
plt.ylabel('Actual Net Pay (NGN)')
plt.title('Actual vs Predicted Net Pay — OLS Regression')
plt.legend()
plt.tight_layout()
plt.show()
plt.close()

9.4 Interpretation

The model explains approximately 71% of the variance in Net Pay (R² ≈ 0.71), indicating a reasonably strong fit. Key findings from the coefficients:

Basic Salary is the dominant positive predictor: for each additional Naira in Basic, Net Pay increases by a nearly proportional amount, reflecting its structural primacy in the pay framework.
Loans carry a coefficient of approximately −1.18, meaning that for every ₦1 of loan repayment, net take-home falls by ₦1.18. The excess above ₦1 likely captures associated insurance premiums or administrative charges embedded in the loan agreement.
PAYE shows a positive coefficient — this is a proxy variable effect. Staff with higher PAYE are higher earners and, despite larger tax bills, still take home more money overall.

For a non-technical manager: the two biggest levers on take-home pay are the basic salary grade and loan obligations. Reducing loan burdens would have a direct, measurable positive impact on employee net pay.

10. Integrated Findings

Taken together, the five analyses converge on a coherent narrative about FUGUS’s December 2024 payroll:

Rank drives compensation; departments do not. Descriptive statistics and segmentation analysis both confirm that pay variance is primarily explained by grade level. The hypothesis test formally confirms that departmental affiliation (at least between Sociology and Chemistry) adds no statistically independent payroll cost. Budget planning should therefore be rank-based, not department-based.
Loan deductions are the primary financial wellness risk. The correlation analysis and regression model both identify loans as the strongest depressor of net take-home pay. The regression coefficient of −1.18 suggests that the total cost of borrowing (principal plus charges) is eroding disposable income substantially.
The payroll structure is principled but carries latent risk. The strong R² of 71% in the regression model confirms that the payroll is operating largely as designed — basic salary and statutory deductions explain the bulk of variation. However, the 29% unexplained variance and the wide loan-driven dispersion in net pay represent areas of financial and compliance risk.

Single Collective Recommendation: The University should introduce a Payroll Financial Wellness Programme that: (a) caps loan deductions at a maximum of 33.3% of gross pay in line with labour best practice; (b) audits and regularises the pension registration status of staff currently without a Pension PIN; and (c) uses the regression model to simulate the net pay impact of future policy changes before implementation. This would improve staff welfare, reduce regulatory exposure, and strengthen the University’s financial governance framework.

11. Limitations & Further Work

Limitations

Single time period: The analysis covers only December 2024. A single month’s payroll may include seasonal anomalies (e.g., end-of-year bonuses, December arrears). Cross-month comparisons would provide more robust conclusions.
No causal inference: Although regression identifies predictors, the observational nature of the data means causality cannot be established. A staff member’s loan amount and basic salary are jointly determined by seniority, not independently manipulable variables.
Missing pension data: The 120+ staff without registered Pension PINs represent a gap in the dataset that may introduce bias if their compensation profiles differ systematically from registered staff.
Limited predictor set: The regression model uses only three predictors. Inclusion of variables such as years of service, leave entitlements, and allowance categories could substantially improve explanatory power.
Normality assumptions: The t-test and OLS regression assume approximately normal residuals. Given the right-skewed distributions observed in Section 4, this assumption may be imperfectly met, particularly for smaller departmental subgroups.

Further Work

Longitudinal payroll analysis: Extending the dataset to cover 12 months would enable time-series modelling of payroll trends, seasonal patterns, and the impact of promotions or policy changes.
Classification modelling: A logistic regression or decision tree could predict which staff are at highest risk of falling below a minimum take-home threshold, enabling proactive HR intervention.
Pension compliance deep-dive: Cross-referencing payroll records with PENCOM (National Pension Commission) data could produce a comprehensive compliance audit.
Expanded regression model: With access to HR records (years of service, qualification level, contract type), a richer OLS or mixed-effects model could be specified to better account for the 29% unexplained variance.

References

Agbo, E. (2026). Payroll data analysis: Federal University Gusau December 2024 payroll register [Dataset]. Internal financial reporting archive, Bursary Department, Federal University Gusau.

Field, A. (2018). Discovering statistics using IBM SPSS statistics (5th ed.). SAGE Publications.

Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference, 51–56. https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

Seabold, S., & Perktold, J. (2010). Statsmodels: Econometric and statistical modeling with Python. Proceedings of the 9th Python in Science Conference, 57–61. https://doi.org/10.25080/Majora-92bf1922-00b

Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … SciPy 1.0 Contributors. (2020). SciPy 1.0: Fundamental algorithms for scientific computing in Python. Nature Methods, 17, 261–272. https://doi.org/10.1038/s41592-019-0686-2

Waskom, M. (2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Appendix: AI Usage Statement

Posit Assistant (an AI coding assistant) was used during the preparation of this report to assist with the structural organisation of the Quarto document and to refine the formatting of code cells and section headings. All analytical decisions — including the choice of techniques, the selection of variables for the regression model, the formulation of hypotheses, and the interpretation of outputs — were made independently based on my professional judgement as a practicing chartered accountant and my understanding of the business context at Federal University Gusau. The AI tool did not generate the dataset, determine the analytical strategy, or produce the substantive interpretations contained in Sections 5 through 10. I reviewed all AI-assisted content and take full responsibility for the accuracy and integrity of this report.

1. Executive Summary

2. Professional Disclosure

Relevance of Each Technique to Professional Practice

3. Data Collection & Sampling

Source

Collection Method

Sampling Frame

Sample Size

Time Period Covered

Ethical Notes & Consent Statement

4. Data Description

Variable Dictionary

Distributions

5. Analysis 1 — Descriptive Statistics

5.1 Theory

5.2 Business Justification

5.3 Code & Output

5.4 Interpretation

6. Analysis 2 — Correlation Analysis

6.1 Theory

6.2 Business Justification

6.3 Code & Output

6.4 Interpretation

7. Analysis 3 — Segmentation Analysis (Grouped Visualisation)

7.1 Theory

7.2 Business Justification

7.3 Code & Output

7.4 Interpretation

8. Analysis 4 — Hypothesis Testing (Independent Samples t-Test)

8.1 Theory

8.2 Business Justification

8.3 Code & Output

8.4 Interpretation

9. Analysis 5 — Multiple Linear Regression

9.1 Theory

9.2 Business Justification

9.3 Code & Output

9.4 Interpretation

10. Integrated Findings

11. Limitations & Further Work

Limitations

Further Work

References

Appendix: AI Usage Statement