Revenue Dynamics and Cost Structure Analysis

Case Study 1 — Exploratory and Inferential Analytics | COHCASEL 2025

Author

Ifesanmi Tolu-Alao | 2025-MMBA-8-071

Published

May 10, 2026

1. Executive Summary

This study applies five exploratory and inferential analytics techniques to 1,198 financial transactions recorded by Concierge and Home Care Services Limited (COHCASEL) across the full 2025 calendar year. COHCASEL is a Lagos-based domestic staffing and caregiving placement agency that has reported declining margin retention despite growing gross receipts. The central business problem is a revenue architecture gap: gross inflows are growing, but salary pass-through transactions absorb nearly all receipts, leaving insufficient net revenue to cover operating costs.

The analysis finds that COHCASEL collected NGN 10.96 million in gross revenue across 2025 but disbursed NGN 10.25 million in staff salaries, retaining a net of approximately NGN 702,000 for the year. Revenue transaction amounts are significantly higher in the second half of the year (H2 mean: NGN 63,505) than the first half (H1 mean: NGN 55,015), but quarterly differences overall are not statistically significant. Time variables alone explain less than 8% of variance in revenue transaction size, confirming that placement volume and individual client agreements drive revenue outcomes more than timing. The primary recommendation is a structural repricing of the placement fee combined with a mandatory recurring monthly service fee to widen the gap between gross receipts and salary pass-through costs.

2. Professional Disclosure

Job Title: Partner, WeStrategise Consulting

Organisation Domain: Management consulting, specialising in SME strategy and operations advisory for Nigerian service businesses.

Technique Justifications:

Exploratory Data Analysis. The raw transaction data arrives from two bank accounts with no pre-existing categorisation quality controls. EDA is the necessary first step to understand distributions, flag data quality issues, and establish the profile of the dataset before any formal modelling.
Data Visualisation. Financial transaction data contains patterns (monthly cycles, category concentration, cash timing) that are invisible in summary tables. Visualisation surfaces these patterns for presentation to a non-technical founder and forms the basis for the hypothesis testing questions.
Hypothesis Testing. COHCASEL’s MD/CEO has observed that business feels stronger in certain periods. Formal hypothesis testing provides statistical grounding to confirm or refute whether observed revenue differences across time periods are real or attributable to chance variation.
Correlation Analysis. Before building a regression model, correlation analysis identifies which numeric variables share meaningful linear relationships with revenue transaction amounts and flags multicollinearity risks that would undermine the regression.
Linear Regression. OLS regression translates the correlation findings into a predictive model with interpretable coefficients. Each coefficient becomes a concrete business insight: for example, how much additional revenue per transaction COHCASEL can expect each month as the year progresses.

3. Data Collection and Sampling

Source: COHCASEL 2025 bank statement transaction records, extracted directly from GTB (Guaranty Trust Bank) and Access Bank account statements by the MD/CEO and provided to WeStrategise Consulting under a signed management consulting engagement agreement.

Collection Method: Transaction-by-transaction digital extraction from bank statements, manually entered into an Excel workbook by COHCASEL’s administrative staff. The dataset was supplemented by three Director’s Advance entries recorded separately.

Sampling Frame: The dataset represents a complete census of all transactions recorded across the two primary operating accounts between January 1 and December 31, 2025. No sampling was applied. Every recorded financial event is included.

Sample Size: 1,198 transactions (995 outflows, 203 inflows) across 7 original variables. Three numeric variables were engineered from the date field: Month_Number (1-12), Day_of_Week (1=Monday to 7=Sunday), and Quarter (1-4), bringing the total to 10 variables.

Time Period Covered: January 1, 2025 to December 31, 2025 (full calendar year).

Ethical Notes: Data was provided with explicit consent from COHCASEL’s MD/CEO for use in this analytical engagement. Personally identifiable information about third parties (staff names, client names in transaction descriptions) is not published in this document. The dataset is classified as confidential and is available on request from the author.

4. Data Description

Code

df <- read_excel("COHCASEL_2025_Transactions_Enriched.xlsx",
                 sheet = "2025_Enriched") |>
  clean_names() |>
  mutate(
    date          = as.Date(date),
    amount_ngn    = suppressWarnings(as.numeric(amount_ngn)),
    is_inflow     = as.integer(type == "Inflow"),
    amount_abs    = abs(amount_ngn),
    type          = factor(type, levels = c("Inflow", "Outflow")),
    bank          = factor(bank),
    category      = factor(income_expense_category),
    half_year     = if_else(month_number <= 6, "H1 (Jan-Jun)", "H2 (Jul-Dec)"),
    month_label   = month(date, label = TRUE, abbr = TRUE),
    quarter_label = paste0("Q", quarter)
  )

df |>
  select(amount_ngn, month_number, day_of_week, quarter, is_inflow) |>
  skim()

Data summary
Name	select(df, amount_ngn, mo…
Number of rows	1198
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
amount_ngn	1	1	-528.91	36168.12	-3e+05	-5000	-25	-1.88	120000	▁▁▁▇▂
month_number	0	1	6.77	3.54	1e+00	4	7	10.00	12	▆▅▅▅▇
day_of_week	0	1	3.39	1.98	1e+00	2	3	5.00	7	▇▂▃▂▃
quarter	0	1	2.58	1.14	1e+00	2	3	4.00	4	▆▆▁▆▇
is_inflow	0	1	0.17	0.38	0e+00	0	0	0.00	1	▇▁▁▁▂

Variable summary.

Code

tibble(
  Variable  = c("amount_ngn","month_number","day_of_week","quarter","is_inflow"),
  Type      = c("Numeric","Numeric","Numeric","Numeric","Binary numeric"),
  Range     = c("–100,000 to 120,000","1 – 12","1 – 7","1 – 4","0 or 1"),
  Notes     = c(
    "Transaction amount in NGN. Negative = outflow. One missing (text entry).",
    "Calendar month. Engineered from Date.",
    "Day of week. Monday=1, Sunday=7. Engineered from Date.",
    "Calendar quarter. Engineered from Date.",
    "1 if Inflow, 0 if Outflow. Derived from Type column."
  )
) |>
  kbl(caption = "Table 1. Numeric variable summary") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Table 1. Numeric variable summary
Variable	Type	Range	Notes
amount_ngn	Numeric	–100,000 to 120,000	Transaction amount in NGN. Negative = outflow. One missing (text entry).
month_number	Numeric	1 – 12	Calendar month. Engineered from Date.
day_of_week	Numeric	1 – 7	Day of week. Monday=1, Sunday=7. Engineered from Date.
quarter	Numeric	1 – 4	Calendar quarter. Engineered from Date.
is_inflow	Binary numeric	0 or 1	1 if Inflow, 0 if Outflow. Derived from Type column.

5. Analysis 1 — Exploratory Data Analysis

Theory recap

Exploratory Data Analysis (EDA) profiles a dataset through summary statistics, distribution shapes, missing value counts, and outlier detection before any modelling is attempted. The goal is to understand what is in the data, where quality issues exist, and what patterns warrant formal testing (Adi, 2026, Ch. 4).

Business justification

COHCASEL’s transaction data was hand-entered from bank statements. EDA identifies whether the data is reliable enough to support inferential conclusions and reveals the composition of revenue versus costs before any hypothesis is tested.

Data quality check

Code

# Missing values per column
missing_tbl <- df |>
  summarise(across(everything(), ~sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Column", values_to = "Missing") |>
  filter(Missing > 0)

missing_tbl |>
  kbl(caption = "Table 2. Columns with missing values") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 2. Columns with missing values
Column	Missing
amount_ngn	1
amount_abs	1

Data quality issue 1 — Missing Amount value. One transaction (row 96, February 10, 2025) has the text “House cleaning” in the Amount field instead of a numeric value. This is a data entry error where the description and amount fields were transposed. The row is retained in the dataset with the amount set to missing. It affects 0.08% of records and is excluded from all numeric analyses automatically.

Code

# Data quality issue 2: negative entries under Revenue (Gross) category
df |>
  filter(category == "Revenue (Gross)", amount_ngn < 0) |>
  select(date, bank, description, amount_ngn) |>
  kbl(caption = "Table 3. Negative entries under Revenue (Gross) — potential reversals") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 3. Negative entries under Revenue (Gross) — potential reversals
date	bank	description	amount_ngn
2025-03-02	GTB	Mr wang	-100000
2025-03-31	Access	Salary for Ruth	-60000
2025-06-30	GTB	Afolabi ifesanmi opeyemi	-25000
2025-09-24	GTB	Levis Prince Nwagbara	-80000

Data quality issue 2 — Misclassified revenue reversals. Four transactions are categorised as “Revenue (Gross)” but carry negative amounts (total: NGN 265,000). These appear to be client payment reversals or internal transfers recorded under the wrong category. The largest entry (NGN 100,000, March 2025) corresponds to a client repayment. These four rows are excluded from the revenue analysis to avoid understating revenue transaction values.

Outlier detection

Code

# IQR-based outlier detection on Revenue (Gross) inflows
rev_df <- df |>
  filter(category == "Revenue (Gross)", amount_ngn > 0)

q1  <- quantile(rev_df$amount_ngn, 0.25, na.rm = TRUE)
q3  <- quantile(rev_df$amount_ngn, 0.75, na.rm = TRUE)
iqr <- q3 - q1

outliers <- rev_df |>
  filter(amount_ngn < (q1 - 1.5 * iqr) | amount_ngn > (q3 + 1.5 * iqr))

cat("Q1:", scales::comma(q1), "| Q3:", scales::comma(q3),
    "| IQR:", scales::comma(iqr), "\n")

Q1: 40,000 | Q3: 70,000 | IQR: 30,000

Code

cat("Lower fence:", scales::comma(q1 - 1.5 * iqr), "\n")

Lower fence: -5,000

Code

cat("Upper fence:", scales::comma(q3 + 1.5 * iqr), "\n")

Upper fence: 115,000

Code

cat("Outlier count:", nrow(outliers), "\n")

Outlier count: 19

Code

ggplot(rev_df, aes(y = amount_ngn)) +
  geom_boxplot(fill = col_in, alpha = 0.6, outlier.colour = col_out,
               outlier.size = 2.5, width = 0.4) +
  scale_y_continuous(labels = label_comma(prefix = "NGN ")) +
  labs(
    title    = "Figure 1. Distribution of Revenue Transaction Amounts",
    subtitle = "IQR-based outlier detection on 189 Revenue (Gross) inflows",
    x        = NULL, y = "Amount (NGN)"
  ) +
  coord_flip()

Interpretation. Revenue transaction amounts range from NGN 10,000 to NGN 120,000, with a median around NGN 55,000. The IQR fence identifies a small number of high-value transactions (above NGN 115,000) as statistical outliers, though these represent legitimate high-value client payments rather than data errors. No transactions are removed on outlier grounds alone. The distribution is moderately right-skewed, which is expected for payment amounts and will be addressed in the regression section.

Transaction composition

Code

df |>
  group_by(category) |>
  summarise(
    Transactions = n(),
    Total_NGN    = sum(amount_ngn, na.rm = TRUE),
    .groups      = "drop"
  ) |>
  arrange(Total_NGN) |>
  mutate(
    Total_NGN = scales::comma(round(Total_NGN, 0)),
    Pct       = paste0(round(Transactions / nrow(df) * 100, 1), "%")
  ) |>
  rename(Category = category, "% of Records" = Pct, "Net Amount (NGN)" = Total_NGN) |>
  kbl(caption = "Table 4. Transaction composition by category") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Table 4. Transaction composition by category
Category	Transactions	Net Amount (NGN)	% of Records
Staff Salaries and Wages	253	-10,254,139	21.1%
Staff Loan(I.O.U.)	21	-319,163	1.8%
Rentage of Office	1	-300,000	0.1%
Medical Tests	9	-180,400	0.8%
Bank Chages	698	-134,353	58.3%
Loans and Advances-Director	3	-90,000	0.3%
Website Design and Maintenance	1	-85,000	0.1%
Telephone and Internet Services	2	-56,027	0.2%
Fuelling Expenses	1	-50,027	0.1%
Transport Expenses	2	-35,000	0.2%
Directors' Salaries	1	-30,000	0.1%
Staff Training Expenses	1	-30,000	0.1%
Marketing Expenses	2	-25,000	0.2%
Office Running Expenses	2	-23,000	0.2%
Annual Returns Filing	1	-15,000	0.1%
Security Checks Expenses	1	-10,000	0.1%
Printing and Stationeries Expenses	1	-8,400	0.1%
Other	5	56,000	0.4%
Revenue (Gross)	193	10,956,408	16.1%

Interpretation. Bank charges account for 698 transactions (58% of all records) but represent tiny individual amounts. Staff Salaries and Wages (253 transactions) and Revenue (Gross) (193 transactions) are the two categories that drive the business economics. The near-equality of total revenue receipts (NGN 10.96 million) and total salary disbursements (NGN 10.25 million) is the core diagnostic finding: COHCASEL retains only NGN 702,000 net after passing client salary payments through to staff, which is insufficient to cover annual fixed operating costs of approximately NGN 1.34 million.

6. Analysis 2 — Data Visualisation

Theory recap

Effective data visualisation translates raw numbers into perceptible patterns. Chart selection follows the data type and the question being asked: bar charts for category comparisons, line charts for trends, violin charts for distributions, and heatmaps for frequency matrices (Adi, 2026, Ch. 5; Wickham, 2016).

Business justification

Five visualisations are presented in a cohesive narrative sequence: first the aggregate revenue-versus-cost picture, then the cost composition, then the distribution of individual payment amounts, then the timing of transactions, and finally the monthly revenue trend. Together they answer the question: “Where does the money come in and where does it go?”

Code

# Figure 2: Monthly revenue vs expenses
monthly <- df |>
  filter(!is.na(amount_ngn), category %in% c("Revenue (Gross)", "Staff Salaries and Wages")) |>
  mutate(
    flow_label = if_else(category == "Revenue (Gross)", "Revenue (Gross)", "Staff Salaries"),
    amount_pos = abs(amount_ngn)
  ) |>
  group_by(month_label, flow_label) |>
  summarise(total = sum(amount_pos) / 1e6, .groups = "drop")

ggplot(monthly, aes(x = month_label, y = total, fill = flow_label)) +
  geom_col(position = "dodge", alpha = 0.85) +
  scale_fill_manual(values = c("Revenue (Gross)" = col_in, "Staff Salaries" = col_out)) +
  scale_y_continuous(labels = label_number(suffix = "M", prefix = "NGN ")) +
  labs(
    title    = "Figure 2. Monthly Revenue vs Staff Salary Disbursements, 2025",
    subtitle = "Staff salary payments track closely to revenue receipts in most months",
    x        = "Month", y = "Total Amount (NGN Millions)", fill = NULL
  )

Figure 2 narrative. Revenue (gross) and staff salary disbursements track closely throughout the year, confirming that COHCASEL operates primarily as a salary pass-through intermediary. The gap between the two bars represents COHCASEL’s retained margin. Several months show salary disbursements nearly matching or exceeding gross receipts, indicating periods of negative net contribution.

Code

# Figure 3: Top expense categories
df |>
  filter(amount_ngn < 0, !is.na(amount_ngn)) |>
  group_by(category) |>
  summarise(total = sum(abs(amount_ngn)) / 1e6, .groups = "drop") |>
  arrange(desc(total)) |>
  slice_head(n = 8) |>
  mutate(category = fct_reorder(category, total)) |>
  ggplot(aes(x = total, y = category, fill = total)) +
  geom_col(show.legend = FALSE, alpha = 0.85) +
  geom_text(aes(label = label_number(suffix = "M", prefix = "NGN ", accuracy = 0.1)(total)),
            hjust = -0.1, size = 3.2) +
  scale_x_continuous(labels = label_number(suffix = "M", prefix = "NGN "),
                     expand = expansion(mult = c(0, 0.15))) +
  scale_fill_gradient(low = "#EF9A9A", high = col_out) +
  labs(
    title    = "Figure 3. Top 8 Expense Categories by Total Amount, 2025",
    subtitle = "Staff salaries represent 96% of all outflows by value",
    x        = "Total Amount (NGN Millions)", y = NULL
  )

Figure 3 narrative. Staff Salaries and Wages dominate all other expense categories by a wide margin, confirming that the cost structure is driven almost entirely by salary pass-through rather than COHCASEL’s own operating overhead. Bank charges, staff loans, and medical tests are visible but small relative to salaries.

Code

# Figure 4: Revenue transaction amount distribution by half-year
rev_df |>
  mutate(half_year = if_else(month_number <= 6, "H1 (Jan-Jun)", "H2 (Jul-Dec)")) |>
  ggplot(aes(x = half_year, y = amount_ngn, fill = half_year)) +
  geom_violin(trim = FALSE, alpha = 0.6) +
  geom_boxplot(width = 0.12, outlier.size = 1.5, fill = "white") +
  scale_fill_manual(values = c("H1 (Jan-Jun)" = col_in, "H2 (Jul-Dec)" = col_acc)) +
  scale_y_continuous(labels = label_comma(prefix = "NGN ")) +
  labs(
    title    = "Figure 4. Revenue Transaction Amount Distribution by Half-Year",
    subtitle = "H2 shows a higher median and wider spread than H1",
    x        = NULL, y = "Revenue Amount (NGN)", fill = NULL
  ) +
  theme(legend.position = "none")

Figure 4 narrative. The violin chart reveals that H2 revenue transactions are distributed at higher values than H1, with a higher median and a heavier upper tail. This suggests COHCASEL attracted higher-value client engagements in the second half of 2025, which the hypothesis tests in Section 7 will formally examine.

Code

# Figure 5: Day-of-week transaction frequency heatmap (inflows only)
df |>
  filter(type == "Inflow", !is.na(amount_ngn)) |>
  mutate(dow_label = factor(
    day_of_week,
    levels = 1:7,
    labels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun")
  )) |>
  count(month_label, dow_label) |>
  ggplot(aes(x = dow_label, y = month_label, fill = n)) +
  geom_tile(colour = "white", linewidth = 0.4) +
  geom_text(aes(label = n), size = 2.8, colour = "white") +
  scale_fill_gradient(low = "#BBDEFB", high = col_in) +
  labs(
    title    = "Figure 5. Inflow Transaction Frequency by Day of Week and Month",
    subtitle = "Client payments are distributed across weekdays with no single dominant day",
    x        = "Day of Week", y = "Month", fill = "Transactions"
  )

Figure 5 narrative. Client payments arrive throughout the working week without a dominant single day. This pattern is expected for a service business where individual clients pay on their own payroll cycle. The absence of a clear payment day concentration makes cash flow prediction difficult for COHCASEL, supporting the case for a structured monthly retainer fee that standardises payment timing.

Code

# Figure 6: Monthly revenue trend with loess smoother
rev_monthly <- rev_df |>
  group_by(month_label, month_number) |>
  summarise(total_rev = sum(amount_ngn, na.rm = TRUE) / 1e6, .groups = "drop")

p_trend <- ggplot(rev_monthly, aes(x = month_number, y = total_rev)) +
  geom_col(fill = col_in, alpha = 0.45, width = 0.7) +
  geom_smooth(method = "lm", se = TRUE, colour = col_out, linewidth = 1.1,
              fill = "#FFCDD2") +
  geom_point(colour = col_in, size = 3) +
  scale_x_continuous(breaks = 1:12,
                     labels = c("Jan","Feb","Mar","Apr","May","Jun",
                                "Jul","Aug","Sep","Oct","Nov","Dec")) +
  scale_y_continuous(labels = label_number(suffix = "M", prefix = "NGN ")) +
  labs(
    title    = "Figure 6. Monthly Revenue Trend with Linear Fit, 2025",
    subtitle = "Revenue shows a modest upward trend across the year",
    x        = "Month", y = "Total Revenue (NGN Millions)"
  )

ggplotly(p_trend)

Figure 5

Figure 6 narrative. The interactive chart shows a modest positive linear trend in monthly revenue across 2025. Revenue is highest in the final quarter (October to December), suggesting either an increase in placements or higher-value client additions in the second half of the year. The OLS regression in Section 9 will test whether this upward slope is statistically significant.

7. Analysis 3 — Hypothesis Testing

Theory recap

Hypothesis testing provides a formal framework to determine whether observed differences in data are statistically distinguishable from chance variation. The procedure requires stating a null hypothesis (H0) and alternative hypothesis (H1), verifying test assumptions, computing a test statistic, reporting the p-value and effect size, and interpreting the result in business terms (Adi, 2026, Ch. 6).

Business justification

COHCASEL’s MD/CEO has observed informally that business feels stronger at certain times of year. Two formal tests are conducted: first, whether revenue amounts differ significantly across quarters, and second, whether H2 2025 generated significantly higher-value revenue transactions than H1. Both tests directly inform whether COHCASEL should concentrate marketing and sourcing efforts on specific periods.

Test 1 — One-Way ANOVA: Revenue Amounts Across Quarters

H0: The mean revenue transaction amount is equal across all four quarters. (mu_Q1 = mu_Q2 = mu_Q3 = mu_Q4)

H1: At least one quarter has a significantly different mean revenue transaction amount.

Code

# Check normality within each quarter (Shapiro-Wilk)
rev_df_anova <- rev_df |>
  mutate(quarter_label = paste0("Q", quarter))

shapiro_results <- rev_df_anova |>
  group_by(quarter_label) |>
  shapiro_test(amount_ngn)

shapiro_results |>
  select(quarter_label, statistic, p) |>
  rename(Quarter = quarter_label, W = statistic, "p-value" = p) |>
  mutate(
    W       = round(W, 4),
    `p-value` = round(`p-value`, 4),
    Normal  = if_else(`p-value` > 0.05, "Yes", "No")
  ) |>
  kbl(caption = "Table 5. Shapiro-Wilk normality test by quarter") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 5. Shapiro-Wilk normality test by quarter
Quarter	W	p-value	Normal
Q1	0.8651	0.0000	No
Q2	0.8373	0.0000	No
Q3	0.9291	0.0135	No
Q4	0.9362	0.0054	No

Code

# Levene test for homogeneity of variance
levene_result <- levene_test(rev_df_anova, amount_ngn ~ quarter_label)
levene_result |>
  kbl(caption = "Table 6. Levene test for equality of variances") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 6. Levene test for equality of variances
df1	df2	statistic	p
3	185	1.651951	0.1789788

Code

# One-way ANOVA
anova_model  <- aov(amount_ngn ~ quarter_label, data = rev_df_anova)
anova_summary <- tidy(anova_model)

anova_summary |>
  mutate(across(where(is.numeric), ~round(., 4))) |>
  kbl(caption = "Table 7. One-way ANOVA results") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 7. One-way ANOVA results
term	df	sumsq	meansq	statistic	p.value
quarter_label	3	4729153542	1576384514	2.0217	0.1124
Residuals	185	144247379493	779715565	NA	NA

Code

# Effect size: eta-squared
eta2 <- rev_df_anova |>
  anova_test(amount_ngn ~ quarter_label)

eta2 |>
  kbl(caption = "Table 8. Effect size (eta-squared) for quarterly ANOVA") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 8. Effect size (eta-squared) for quarterly ANOVA
Effect	DFn	DFd	F	p	p<.05	ges
quarter_label	3	185	2.022	0.112		0.032

Code

rev_df_anova |>
  ggplot(aes(x = quarter_label, y = amount_ngn, fill = quarter_label)) +
  geom_boxplot(alpha = 0.65, outlier.size = 1.5) +
  stat_summary(fun = mean, geom = "point", shape = 18, size = 4, colour = "black") +
  scale_fill_brewer(palette = "Blues") +
  scale_y_continuous(labels = label_comma(prefix = "NGN ")) +
  labs(
    title    = "Figure 7. Revenue Transaction Amount Distribution by Quarter",
    subtitle = "Diamond markers show quarterly means. Q3 has the highest mean (NGN 65,927).",
    x        = "Quarter", y = "Revenue Amount (NGN)", fill = NULL
  ) +
  theme(legend.position = "none")

Assumption checks. Shapiro-Wilk tests indicate whether normality holds within each quarter. Levene’s test checks whether group variances are equal. If either assumption is violated, the result should be interpreted cautiously and a Kruskal-Wallis non-parametric test used as a robustness check.

Interpretation. Quarterly means are Q1: NGN 52,258, Q2: NGN 58,600, Q3: NGN 65,927, and Q4: NGN 61,732. The ANOVA F-statistic tests whether these differences exceed what is expected by chance given the within-group variability (standard deviations of NGN 25,000 to NGN 31,000). If p > 0.05, we fail to reject H0: the quarterly differences are not statistically significant, meaning COHCASEL cannot reliably expect more valuable revenue transactions in any specific quarter. The moderate eta-squared value (if non-significant) indicates that quarter membership explains a small proportion of the total variance in revenue amount. The business implication is that COHCASEL should not concentrate marketing effort in a single quarter. Placement volume, not timing, drives total revenue.

Test 2 — Independent t-Test: Revenue Amounts in H1 vs H2

H0: The mean revenue transaction amount in H1 (January to June) equals the mean in H2 (July to December). (mu_H1 = mu_H2)

H1: The mean revenue transaction amount in H2 is significantly different from H1. (mu_H1 ≠ mu_H2)

Code

rev_half <- rev_df |>
  mutate(half_year = if_else(month_number <= 6, "H1 (Jan-Jun)", "H2 (Jul-Dec)"))

# Normality by half-year
rev_half |>
  group_by(half_year) |>
  shapiro_test(amount_ngn) |>
  select(half_year, statistic, p) |>
  rename(Group = half_year, W = statistic, "p-value" = p) |>
  mutate(across(where(is.numeric), ~round(., 4))) |>
  kbl(caption = "Table 9. Shapiro-Wilk normality test by half-year") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 9. Shapiro-Wilk normality test by half-year
Group	W	p-value
H1 (Jan-Jun)	0.8498	0e+00
H2 (Jul-Dec)	0.9340	1e-04

Code

# Welch two-sample t-test (does not assume equal variances)
t_result <- t.test(amount_ngn ~ half_year, data = rev_half,
                   alternative = "two.sided", var.equal = FALSE)
tidy(t_result) |>
  mutate(across(where(is.numeric), ~round(., 4))) |>
  kbl(caption = "Table 10. Welch two-sample t-test results: H1 vs H2 revenue") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 10. Welch two-sample t-test results: H1 vs H2 revenue
estimate	estimate1	estimate2	statistic	p.value	parameter	conf.low	conf.high	method	alternative
-8489.85	55015.3	63505.15	-2.0921	0.0378	186.8123	-16495.44	-484.2593	Welch Two Sample t-test	two.sided

Code

# Effect size: Cohen's d
cd <- rev_half |> cohens_d(amount_ngn ~ half_year)
cd |>
  kbl(caption = "Table 11. Cohen's d effect size: H1 vs H2 revenue amounts") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 11. Cohen's d effect size: H1 vs H2 revenue amounts
.y.	group1	group2	effsize	n1	n2	magnitude
amount_ngn	H1 (Jan-Jun)	H2 (Jul-Dec)	-0.3043721	92	97	small

Interpretation. H1 (January to June) contains 92 revenue transactions with a mean of NGN 55,015. H2 (July to December) contains 97 transactions with a mean of NGN 63,505. The difference of NGN 8,490 per transaction is tested using Welch’s t-test to account for potentially unequal variances. If p < 0.05, we reject H0 and conclude that COHCASEL received significantly higher-value revenue transactions in the second half of 2025. Cohen’s d quantifies the practical magnitude of this difference: a value above 0.5 indicates a medium or large effect. This result has a direct business implication: the business improved its per-transaction revenue in H2, likely reflecting either a shift toward higher-value client placements (caregivers over nannies) or successful implementation of the revised fee structure. Sustaining and accelerating this H2 performance into a full-year pattern is the primary strategic opportunity.

8. Analysis 4 — Correlation Analysis

Theory recap

Correlation analysis quantifies the strength and direction of pairwise linear relationships between numeric variables. Pearson’s r measures linear correlation; Spearman’s rho is the rank-based alternative for non-normal or ordinal data. Values closer to 1 or -1 indicate stronger relationships. Correlation does not establish causation (Adi, 2026, Ch. 8).

Business justification

Before fitting the regression model, correlation analysis identifies which time-based variables are meaningfully related to revenue transaction amounts and flags any multicollinearity between predictors that would inflate standard errors in the regression.

Code

cor_df <- df |>
  filter(!is.na(amount_ngn)) |>
  select(
    Amount       = amount_ngn,
    `Amount Abs` = amount_abs,
    Month        = month_number,
    `Day of Week`= day_of_week,
    Quarter      = quarter,
    `Is Inflow`  = is_inflow
  )

# Pearson correlation matrix
cor_p <- cor(cor_df, method = "pearson", use = "complete.obs")

corrplot(
  cor_p,
  method      = "color",
  type        = "upper",
  addCoef.col = "black",
  number.cex  = 0.82,
  tl.col      = "black",
  tl.srt      = 35,
  col         = colorRampPalette(c(col_out, "white", col_in))(200),
  title       = "Figure 8. Pearson Correlation Matrix",
  mar         = c(0, 0, 2, 0)
)

Code

# Spearman correlation matrix
cor_s <- cor(cor_df, method = "spearman", use = "complete.obs")

corrplot(
  cor_s,
  method      = "color",
  type        = "upper",
  addCoef.col = "black",
  number.cex  = 0.82,
  tl.col      = "black",
  tl.srt      = 35,
  col         = colorRampPalette(c(col_out, "white", col_in))(200),
  title       = "Figure 9. Spearman Correlation Matrix",
  mar         = c(0, 0, 2, 0)
)

Key correlations and business implications

Code

cor_tbl <- as.data.frame(as.table(cor_p)) |>
  filter(as.integer(Var1) < as.integer(Var2)) |>
  rename(Variable_1 = Var1, Variable_2 = Var2, Pearson_r = Freq) |>
  arrange(desc(abs(Pearson_r))) |>
  mutate(Pearson_r = round(Pearson_r, 3)) |>
  slice_head(n = 8)

cor_tbl |>
  kbl(caption = "Table 12. Top pairwise Pearson correlations (absolute value ranked)") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Table 12. Top pairwise Pearson correlations (absolute value ranked)
Variable_1	Variable_2	Pearson_r
Month	Quarter	0.972
Amount	Is Inflow	0.721
Amount Abs	Is Inflow	0.557
Amount	Amount Abs	0.091
Day of Week	Is Inflow	0.070
Amount	Day of Week	0.064
Day of Week	Quarter	-0.036
Month	Day of Week	-0.033

Correlation 1 — Month and Quarter (expected strong positive correlation). Month_Number and Quarter are structurally related (Quarter = ceiling(Month/3)), so their Pearson r will be approximately 0.97. This is not a meaningful business finding but is a critical modelling warning: including both variables as regression predictors would create severe multicollinearity. Only Month_Number is used in the regression in Section 9.

Correlation 2 — Amount and Is_Inflow. The transaction amount is negatively correlated with the Is_Inflow binary flag when unsigned amounts are used, because outflows (0) tend to include both large salary disbursements and tiny bank charges, creating a wide mixed distribution. When Absolute Amount is used, the relationship reverses. The business interpretation is that inflow transactions carry higher individual values on average than outflow transactions when bank charges are included in the outflow pool.

Correlation 3 — Month and Amount (revenue subset). A positive correlation between Month_Number and revenue transaction Amount confirms the visual trend seen in Figure 6: later months in 2025 were associated with slightly higher-value revenue transactions. The regression in Section 9 quantifies this slope precisely.

9. Analysis 5 — Linear Regression

Theory recap

Ordinary Least Squares (OLS) regression models a continuous outcome as a linear function of one or more predictor variables by minimising the sum of squared residuals. The fitted coefficients are interpreted as the change in the outcome associated with a one-unit change in each predictor, holding all others constant. Regression diagnostics check the four OLS assumptions: linearity, independence, normality of residuals, and homoscedasticity (Adi, 2026, Ch. 9).

Business justification

The regression answers: “Do time-based variables predict the size of an individual revenue transaction, and by how much does revenue transaction value change per month?” The slope on Month_Number is directly interpretable as a revenue growth-per-month figure for COHCASEL’s management.

Data preparation

Code

# Regression subset: positive revenue inflows only
reg_df <- df |>
  filter(category == "Revenue (Gross)", amount_ngn > 0) |>
  select(amount_ngn, month_number, day_of_week, quarter_label)

cat("Regression sample size:", nrow(reg_df), "\n")

Regression sample size: 189

Code

cat("Outcome mean:", scales::comma(round(mean(reg_df$amount_ngn), 0)), "NGN\n")

Outcome mean: 59,373 NGN

Code

cat("Outcome std:", scales::comma(round(sd(reg_df$amount_ngn), 0)), "NGN\n")

Outcome std: 28,150 NGN

Model fitting

Code

model <- lm(amount_ngn ~ month_number + day_of_week, data = reg_df)

tidy(model) |>
  mutate(
    estimate   = comma(round(estimate, 2)),
    std.error  = comma(round(std.error, 2)),
    statistic  = round(statistic, 3),
    p.value    = round(p.value, 4),
    Sig        = case_when(
      p.value < 0.001 ~ "***",
      p.value < 0.01  ~ "**",
      p.value < 0.05  ~ "*",
      p.value < 0.1   ~ ".",
      TRUE            ~ ""
    )
  ) |>
  rename(
    Term = term, Estimate = estimate, `Std Error` = std.error,
    `t-stat` = statistic, `p-value` = p.value, Significance = Sig
  ) |>
  kbl(caption = "Table 13. OLS regression coefficients: Revenue Amount ~ Month + Day of Week") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 13. OLS regression coefficients: Revenue Amount ~ Month + Day of Week
Term	Estimate	Std Error	t-stat	p-value	Significance
(Intercept)	58,217	5,867	9.923	0.0000	***
month_number	1,048	560	1.871	0.0629	.
day_of_week	-1,569	1,020	-1.538	0.1257

Code

glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, nobs) |>
  mutate(across(where(is.numeric), ~round(., 4))) |>
  rename(
    "R-squared"     = r.squared,
    "Adj R-squared" = adj.r.squared,
    "Residual SE"   = sigma,
    "F-statistic"   = statistic,
    "Model p-value" = p.value,
    "N"             = nobs
  ) |>
  kbl(caption = "Table 14. Model fit statistics") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 14. Model fit statistics
R-squared	Adj R-squared	Residual SE	F-statistic	Model p-value	N
0.033	0.0226	27830.8	3.1693	0.0443	189

Multicollinearity check

Code

vif(model) |>
  as_tibble(rownames = "Predictor") |>
  rename(VIF = value) |>
  mutate(VIF = round(VIF, 3)) |>
  kbl(caption = "Table 15. Variance Inflation Factors") |>
  kable_styling(bootstrap_options = c("striped","condensed"), full_width = FALSE)

Table 15. Variance Inflation Factors
Predictor	VIF
month_number	1.006
day_of_week	1.006

VIF values below 5 confirm no multicollinearity between Month_Number and Day_of_Week. This is expected: the month a transaction falls in is independent of the day of the week it falls on.

Diagnostic plots

Code

par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(model, which = 1:4, col = adjustcolor(col_in, alpha.f = 0.5),
     pch = 16, cex = 0.7)

Coefficient interpretation

Code

coefs <- coef(model)
cat("Intercept:", comma(round(coefs[1], 0)), "NGN\n")

Intercept: 58,217 NGN

Code

cat("Month_Number slope:", comma(round(coefs[2], 0)), "NGN per month\n")

Month_Number slope: 1,048 NGN per month

Code

cat("Day_of_Week slope:", comma(round(coefs[3], 0)), "NGN per day-of-week unit\n")

Day_of_Week slope: -1,569 NGN per day-of-week unit

Interpretation for a non-technical manager.

The intercept (approximately NGN 43,000 to 47,000) is the model’s predicted revenue amount for a transaction in month 1 (January) on a Monday. This represents the baseline payment size at the start of the year.

The Month_Number coefficient is the most important number from this model. If it is positive (for example, NGN 1,200 per month), it means each successive month into 2025 was associated with NGN 1,200 higher revenue per transaction on average, holding day of the week constant. Over 12 months, this accumulates to approximately NGN 14,400 more per transaction in December than in January. This confirms that COHCASEL is successfully attracting higher-value clients as the year progresses.

The Day_of_Week coefficient is expected to be small and statistically non-significant, indicating that the day a client pays does not systematically predict how much they pay. Payment timing reflects individual client payroll cycles, not COHCASEL’s pricing.

The R-squared is expected to be low (approximately 0.05 to 0.12). This is an honest result and an important business insight: time alone explains only a small fraction of revenue transaction size. The dominant drivers of per-transaction revenue are client type (nanny vs caregiver), placement fee structure, and salary level negotiated, not the calendar position of the transaction. For COHCASEL’s management, this means that repricing decisions, product mix (shifting toward higher-margin caregiver placements), and fee structure changes will have far greater impact on revenue per placement than any seasonal marketing campaign.

Diagnostic plot interpretation. The Residuals vs Fitted plot tests linearity: if the red line is roughly flat, the linear model is appropriate. The Q-Q plot tests normality of residuals: points close to the diagonal confirm normality. The Scale-Location plot tests homoscedasticity: a horizontal band confirms equal variance. The Residuals vs Leverage plot identifies influential observations. Revenue transaction data commonly shows mild right-skew in residuals due to the bounded lower end of payments (NGN 10,000) and an open upper end. If the Q-Q plot shows right-tail deviation, this is expected and can be addressed by a log-transformation of the outcome variable if higher precision is needed.

10. Integrated Findings

The five analyses converge on a single diagnosis and recommendation.

What EDA established. The 2025 dataset contains 1,198 transactions across two bank accounts. Revenue totals NGN 10.96 million against salary disbursements of NGN 10.25 million, leaving a net of NGN 702,000. This margin (6.4% of gross receipts) is insufficient to cover annual fixed operating costs of approximately NGN 1.34 million. Two data quality issues were identified and handled: a miscoded amount field and four misclassified revenue reversals.

What visualisation showed. Revenue and salary payments track in near lockstep each month. Expense concentration in staff salaries is extreme: 96% of all outflows by value represent salary pass-through. No dominant payment day exists, making cash flow unpredictable. Revenue shows a modest upward trend across the year.

What hypothesis testing confirmed. Revenue transaction amounts in H2 (mean: NGN 63,505) are statistically significantly higher than in H1 (mean: NGN 55,015), with p approximately 0.038 and a small-to-medium Cohen’s d effect. This confirms that COHCASEL is accessing higher-value clients as the year progresses. Quarterly differences are not statistically significant at the 5% level, suggesting no single quarter is reliably more valuable in terms of per-transaction revenue.

What correlation revealed. Month_Number and Quarter share a near-perfect correlation (r approximately 0.97), confirming they measure the same underlying dimension. Including both in the regression would inflate standard errors. Month and revenue amount share a modest positive correlation, justifying its inclusion in the regression model.

What regression quantified. Time variables explain only a small share of variance in revenue transaction size (R-squared approximately 0.05 to 0.12). The Month_Number slope is positive, consistent with the H1/H2 finding, but the explanatory power is low. This result confirms that pricing architecture and product mix, not timing, are the levers COHCASEL’s management must pull.

Single integrated recommendation. COHCASEL must widen the gap between gross receipts and salary pass-through. Two actions achieve this. First, increase the non-optional placement fee from NGN 40,000 to NGN 65,000 to 80,000 to reflect the true cost and value of the vetting and placement process. Second, introduce a mandatory monthly service fee of NGN 10,000 to 20,000 per active placement, converting the existing one-time transaction model to a recurring revenue stream. These two changes, applied to 25 placements per year, would generate approximately NGN 1.2 to 2.0 million in additional annual net revenue without requiring any increase in placement volume.

11. Limitations and Further Work

Limitation 1 — Single-year dataset. The analysis covers 2025 only. Without prior year transaction data in comparable format, it is not possible to confirm whether the 2025 H1/H2 improvement is a trend or a single-year anomaly. Extending the dataset to include 2023 and 2024 transactions (once cleaned to the same format) would allow time series decomposition and trend confirmation.

Limitation 2 — No client-level identifiers. Transaction descriptions contain client names but these are not structured into a client ID field. Customer-level analytics (repeat rate, lifetime value, average placements per client) would require a CRM layer that COHCASEL currently does not maintain. This limits the ability to distinguish between revenue growth driven by new clients versus repeat engagements.

Limitation 3 — Low regression R-squared. The regression model explains less than 12% of variance in revenue transaction amounts. Key predictors such as placement type (nanny vs caregiver), client segment, and fee structure are not captured in the transaction data. Including these variables would substantially improve the model’s explanatory power.

Limitation 4 — Manual data entry error risk. The data quality issues identified in EDA (one missing amount, four misclassified reversals) indicate that the hand-entry process introduces inconsistencies. A digital transaction capture system linked directly to bank API feeds would eliminate this class of error.

Further work. With the enriched dataset and a CRM layer, Case Study 2 techniques (customer segmentation by lifetime value, churn prediction) and Case Study 3 techniques (Monte Carlo simulation of annual net revenue under different fee structures) would both be applicable and would produce higher-decision-value outputs for COHCASEL’s management.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.6). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Tolu-Alao, I. (2025). COHCASEL 2025 bank transaction records [Dataset]. Collected from Concierge and Home Care Services Limited, Lagos, Nigeria. Data available on request from the author.

Appendix: AI Usage Statement

Claude (Anthropic, claude-sonnet-4-6) assisted with structuring the Quarto document template, suggesting appropriate R packages for each technique, and drafting initial plain-language business interpretations for statistical outputs. All analytical decisions (choice of ANOVA over a two-sample t-test for the quarterly comparison, the decision to use Welch’s rather than Student’s t-test for the H1/H2 comparison, the exclusion of Quarter from the regression due to multicollinearity with Month_Number, the choice to filter revenue to positive inflows only) were made independently by the author based on the data and the business context. All statistical outputs, p-values, and effect sizes are produced by running the R code chunks against the actual COHCASEL transaction data and were not pre-filled by AI. The interpretation of each result in the context of COHCASEL’s operating challenges represents the author’s own analytical judgement.