Exploratory & Inferential Analytics: Dalos-Pro Solutions

Author

[Chidalu Okabekwa]

Published

May 9, 2026

1. Executive Summary

Dalos-Pro Solutions is a professional cleaning and facility maintenance company based in Lekki, Lagos, Nigeria, founded in October 2023. This report applies five analytical techniques — Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — to a primary dataset of 100 completed job transactions recorded between April 2024 and November 2025. Variables captured per job include date, service type, number of janitors deployed, job duration, revenue charged, and repeat-client status. Two additional variables were derived: a season indicator (Wet: September–January; Dry: February–August) and a service category grouping (four categories). Key findings show that post-construction and renovation jobs generate the highest mean revenue (₦449,375), that revenue differences across service categories are statistically significant (one-way ANOVA, p < 0.05), and that job duration and team size are the strongest operational predictors of revenue. Contrary to the perceived seasonal narrative, the wet/dry season revenue difference was not statistically significant in this dataset — a finding that redirects strategic focus from seasonal timing to service-mix optimisation. The report recommends prioritising post-construction and deep cleaning services and improving quoting accuracy on duration and team size to sustain and grow profitability.

2. Professional Disclosure

Job Title: Chief Executive Officer (CEO)

Organisation: Dalos-Pro Solutions — Professional Cleaning & Facility Maintenance Company, Lekki, Lagos, Nigeria.

Sector: Facilities Management / Cleaning Services (SME)

Technique Justifications:

Exploratory Data Analysis (EDA): As CEO, I regularly review job records to monitor revenue performance across service lines. Formal EDA extends this routine review by quantifying distributions, detecting outliers systematically, and identifying data quality issues that manual inspection misses. EDA provides the evidentiary foundation for every subsequent analytical decision in this report — it is the first step a responsible data scientist takes before drawing conclusions from any dataset.
Data Visualisation: Communicating business performance to staff, potential partners, and investors requires clear and accurate visual storytelling. Charts of revenue distributions by service type, seasonal job volumes, and operational scatter plots directly inform Dalos-Pro’s marketing calendar and capacity planning. The grammar-of-graphics approach (ggplot2) ensures that every design choice — chart type, axis, colour — is deliberate and aligned with the message being communicated.
Hypothesis Testing: Dalos-Pro operates on a widely held belief that the wet season (September–January) drives higher revenue than the dry season. Formal hypothesis testing — using a Welch two-sample t-test and one-way ANOVA — determines whether these observed differences are statistically significant or within the range of random variation. This transforms intuition into an evidence-based foundation for pricing and budgeting decisions.
Correlation Analysis: Understanding which operational inputs (team size, duration) are most strongly associated with revenue enables smarter resource allocation and pricing. Pearson and Spearman correlation matrices identify the strongest pairwise relationships in the data, highlight multicollinearity risks before regression, and reveal whether associations hold under different distributional assumptions.
Linear Regression (OLS): A multiple regression model that predicts job revenue from observable operational inputs — service type, season, duration, team size, repeat-client status — provides a practical, data-driven quoting tool. This directly addresses Dalos-Pro’s core business challenge: finding the right balance between competitive pricing and premium quality. Each regression coefficient translates directly into a concrete pricing or resource-allocation decision.

3. Data Collection & Sampling

Source: Primary data extracted from Dalos-Pro Solutions’ internal job records, including invoice logs, WhatsApp booking confirmations, and operational tracking sheets maintained by the administrative team.

Collection Method: Manual transcription and collation of job-level records into a structured Microsoft Excel spreadsheet by the author in their capacity as CEO. Each row represents one completed, invoiced cleaning job.

Sampling Frame: All completed and invoiced jobs from April 2024 to November 2025 for which full records were available across all six variables.

Sample Size: 100 job records — meets the minimum requirement of 100 observations with at least 5 variables.

Time Period Covered: 9 April 2024 – 15 November 2025 (approximately 19 months).

Variables Collected:

Variable	Type	Description
Job_date	Date	Date the cleaning job was completed
Service_type	Categorical	Specific service(s) rendered (15 raw types)
Num_janitors	Numeric	Number of janitors deployed on the job
Job_duration_hours	Numeric	Total hours spent on the job
Revenue	Numeric	Total amount charged to client (NGN)
Is_repeat_client	Categorical	Whether client has previously used Dalos-Pro

Derived Variables:

Variable	Type	Description
season	Categorical	Wet (Sep–Jan) or Dry (Feb–Aug), derived from date
service_category	Categorical	15 raw types grouped into 4 analytical categories
rev_per_jan_hour	Numeric	Revenue / (Num_janitors x Duration) — efficiency

Sampling Justification: A census approach was adopted — all available completed jobs in the period were included — because the population size (approximately 100 completed jobs with full records) was small enough to capture entirely, maximising statistical power and eliminating sampling bias. The 100 observations meet the assessment minimum.

Ethical Notes: No client personal data is included. Jobs are identified by date and service type only. Data is proprietary to Dalos-Pro Solutions and is available on request from the author for academic verification. Revenue figures are commercially sensitive; only aggregate statistics and model outputs are published in this document.

4. Data Description & EDA

4.1 Load Libraries

Code

# Run once to install all required packages:
# install.packages(c("tidyverse", "readxl", "ggplot2", "ggcorrplot",
#                    "car", "lmtest", "effectsize", "knitr", "scales",
#                    "patchwork", "lubridate", "moments", "broom"))

library(tidyverse)
library(readxl)
library(ggplot2)
library(ggcorrplot)
library(car)
library(lmtest)
library(effectsize)
library(knitr)
library(scales)
library(patchwork)
library(lubridate)
library(moments)
library(broom)

4.2 Load & Prepare Data

Code

# Ensure the Excel file is in the same folder as this .qmd file
dalos_raw <- read_excel("Data/Dalos_dataset.xlsx")

dalos <- dalos_raw %>%
  rename(
    job_date         = Job_date,
    service_type     = Service_type,
    num_janitors     = Num_janitors,
    duration_hours   = Job_duration_hours,
    revenue          = Revenue,
    is_repeat_client = Is_repeat_client
  ) %>%
  mutate(
    job_date = as.Date(job_date),

    # Season derived from month
    season = factor(
      ifelse(month(job_date) %in% c(9, 10, 11, 12, 1), "Wet", "Dry"),
      levels = c("Dry", "Wet")
    ),

    # Group 15 raw service types into 4 analytical categories
    service_category = factor(case_when(
      str_detect(service_type, "Post-Construction|Renovation") ~
        "Post-Construction/Renovation",
      str_detect(service_type, "Deep Cleaning") ~
        "Deep Cleaning",
      str_detect(service_type, "Upholstery") &
        !str_detect(service_type, "Deep Cleaning") ~
        "Upholstery",
      TRUE ~
        "Facility & Specialist"
    )),

    is_repeat_client = factor(is_repeat_client, levels = c("No", "Yes")),

    # Derived efficiency metric
    rev_per_jan_hour = round(revenue / (num_janitors * duration_hours), 0)
  )

cat("Dataset ready:", nrow(dalos), "rows x", ncol(dalos), "columns\n")

Dataset ready: 100 rows x 9 columns

Code

cat("Date range   :", format(min(dalos$job_date)),
    "to", format(max(dalos$job_date)), "\n")

Date range   : 2024-04-09 to 2025-11-15

Code

cat("Missing values:", sum(is.na(dalos)), "\n")

Missing values: 0

4.3 Data Structure

Code

glimpse(dalos)

Rows: 100
Columns: 9
$ job_date         <date> 2024-04-09, 2024-04-10, 2024-04-17, 2024-05-10, 2024…
$ service_type     <chr> "Deep Cleaning + Maintenance", "Upholstery", "Upholst…
$ num_janitors     <dbl> 4, 3, 3, 6, 5, 6, 7, 6, 5, 3, 5, 3, 3, 4, 5, 5, 3, 3,…
$ duration_hours   <dbl> 8.0, 5.0, 5.0, 10.0, 8.5, 11.0, 17.0, 13.0, 9.5, 4.5,…
$ revenue          <dbl> 315000, 75000, 75000, 350000, 320000, 340000, 460000,…
$ is_repeat_client <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
$ season           <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry…
$ service_category <fct> Deep Cleaning, Upholstery, Upholstery, Deep Cleaning,…
$ rev_per_jan_hour <dbl> 9844, 5000, 5000, 5833, 7529, 5152, 3866, 5641, 6737,…

4.4 Summary Statistics — Numeric Variables

Code

dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour) %>%
  summarise(across(everything(), list(
    Mean   = ~round(mean(.), 1),
    Median = ~round(median(.), 1),
    SD     = ~round(sd(.), 1),
    Min    = ~min(.),
    Max    = ~max(.)
  ))) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", "Stat"),
               names_sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = Stat, values_from = value) %>%
  kable(caption = "Descriptive Statistics — Numeric Variables")

Descriptive Statistics — Numeric Variables
Variable	Mean	Median	SD	Min	Max
num_janitors	4.0	4.0	1.5	2	9
duration_hours	7.6	6.8	3.9	2	22
revenue	170540.0	120000.0	132880.5	20000	550000
rev_per_jan_hour	5351.1	5168.5	1888.8	1641	11667

4.5 Frequency Tables — Categorical Variables

Code

cat("── Service Category ──────────────────────────────\n")

── Service Category ──────────────────────────────

Code

dalos %>%
  count(service_category, sort = TRUE) %>%
  mutate(Pct = round(n / sum(n) * 100, 1)) %>%
  kable(col.names = c("Service Category", "Count", "%"))

Service Category	Count	%
Upholstery	40	40
Deep Cleaning	38	38
Facility & Specialist	14	14
Post-Construction/Renovation	8	8

Code

cat("\n── Season ────────────────────────────────────────\n")


── Season ────────────────────────────────────────

Code

dalos %>%
  count(season) %>%
  mutate(Pct = round(n / sum(n) * 100, 1)) %>%
  kable(col.names = c("Season", "Count", "%"))

Season	Count	%
Dry	51	51
Wet	49	49

Code

cat("\n── Repeat Client ─────────────────────────────────\n")


── Repeat Client ─────────────────────────────────

Code

dalos %>%
  count(is_repeat_client) %>%
  mutate(Pct = round(n / sum(n) * 100, 1)) %>%
  kable(col.names = c("Repeat Client", "Count", "%"))

Repeat Client	Count	%
No	70	70
Yes	30	30

4.6 Data Quality: Missing Values, Outliers & Skewness

Code

# Missing values
cat("── Missing Values ────────────────────────────────\n")

── Missing Values ────────────────────────────────

Code

colSums(is.na(dalos)) %>%
  as.data.frame() %>%
  rename("Missing" = ".") %>%
  kable()

	Missing
job_date	0
service_type	0
num_janitors	0
duration_hours	0
revenue	0
is_repeat_client	0
season	0
service_category	0
rev_per_jan_hour	0

Code

# Outliers (IQR method)
detect_outliers <- function(x) {
  q   <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  iqr <- q[2] - q[1]
  sum(x < q[1] - 1.5 * iqr | x > q[2] + 1.5 * iqr, na.rm = TRUE)
}

cat("\n── Outlier Counts (IQR Method) ───────────────────\n")


── Outlier Counts (IQR Method) ───────────────────

Code

dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour) %>%
  summarise(across(everything(), detect_outliers)) %>%
  kable(caption = "Outlier counts per numeric variable")

Outlier counts per numeric variable
num_janitors	duration_hours	revenue	rev_per_jan_hour
1	2	2	8

Code

# Skewness
cat("\n── Skewness ──────────────────────────────────────\n")


── Skewness ──────────────────────────────────────

Code

dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour) %>%
  summarise(across(everything(), ~round(skewness(.), 3))) %>%
  kable(caption = "Skewness — |value| > 1 = high skew")

Skewness — |value| > 1 = high skew
num_janitors	duration_hours	revenue	rev_per_jan_hour
0.862	1.109	1.084	0.901

Data Quality Finding 1 — No missing values: All 100 job records are complete across every variable, reflecting consistent administrative record-keeping.

Data Quality Finding 2 — Revenue right-skew: Revenue is positively skewed, driven by a small number of high-value post-construction jobs (max ₦550,000) pulling the mean (₦170,540) above the median (₦120,000). This is operationally expected and is noted in the regression diagnostics.

5. Data Visualisation

Five plots, one story: at Dalos-Pro, service category — not season — is the primary driver of revenue. Post-construction commands 4–5× the revenue of upholstery per job. Duration is the clearest operational predictor of earnings.

Code

ggplot(dalos,
       aes(x    = reorder(service_category, revenue, median),
           y    = revenue,
           fill = service_category)) +
  geom_boxplot(alpha          = 0.85,
               outlier.colour = "#E63946",
               outlier.shape  = 16,
               outlier.size   = 2.5) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title   = "Plot 1 — Revenue Distribution by Service Category",
    subtitle = "Post-Construction/Renovation dominates revenue despite fewest bookings",
    x = NULL, y = "Revenue (NGN)",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

Code

dalos %>%
  mutate(month = floor_date(job_date, "month")) %>%
  group_by(month, season) %>%
  summarise(total_rev = sum(revenue), .groups = "drop") %>%
  ggplot(aes(x = month, y = total_rev, fill = season)) +
  geom_col(alpha = 0.85) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_manual(values = c(Dry = "#F4A261", Wet = "#2A9D8F")) +
  labs(
    title    = "Plot 2 — Monthly Total Revenue by Season",
    subtitle = "Revenue is spread across both seasons with no dominant seasonal peak",
    x = "Month", y = "Total Revenue (NGN)", fill = "Season",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Code

ggplot(dalos,
       aes(x = is_repeat_client, y = revenue, fill = is_repeat_client)) +
  geom_violin(alpha = 0.70, trim = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.shape = NA) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_manual(values = c(No = "#E63946", Yes = "#457B9D")) +
  labs(
    title    = "Plot 3 — Revenue: Repeat vs. New Clients",
    subtitle = "Repeat clients show a broader revenue spread, indicating more complex jobs",
    x = "Repeat Client", y = "Revenue (NGN)",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

Code

dalos %>%
  count(service_category, season) %>%
  ggplot(aes(x = season, y = service_category, fill = n)) +
  geom_tile(colour = "white", linewidth = 1.2) +
  geom_text(aes(label = n), fontface = "bold", size = 5) +
  scale_fill_gradient(low = "#D8F3DC", high = "#1B4332") +
  labs(
    title    = "Plot 4 — Job Volume Heatmap: Service Category x Season",
    subtitle = "Upholstery and Deep Cleaning dominate volumes in both seasons",
    x = "Season", y = "Service Category", fill = "Job Count",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Code

ggplot(dalos,
       aes(x = duration_hours, y = revenue, colour = service_category)) +
  geom_point(alpha = 0.70, size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_colour_brewer(palette = "Set1") +
  labs(
    title    = "Plot 5 — Job Duration vs. Revenue by Service Category",
    subtitle = "Longer jobs earn more across all categories — duration is the key revenue lever",
    x = "Duration (Hours)", y = "Revenue (NGN)", colour = "Service Category",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

6. Hypothesis Testing

6.1 Hypothesis 1 — Wet vs. Dry Season Revenue

Business context: Dalos-Pro’s planning calendar assumes the wet season generates higher per-job revenue. This test formally evaluates that assumption.

H₀: Mean revenue (Wet) = Mean revenue (Dry)

H₁: Mean revenue (Wet) ≠ Mean revenue (Dry)

Code

dalos %>%
  group_by(season) %>%
  summarise(
    n      = n(),
    Mean   = round(mean(revenue), 0),
    Median = round(median(revenue), 0),
    SD     = round(sd(revenue), 0)
  ) %>%
  kable(caption = "Revenue by Season (NGN)")

Revenue by Season (NGN)
season	n	Mean	Median	SD
Dry	51	171137	125000	123424
Wet	49	169918	102000	143349

Code

wet_rev <- dalos$revenue[dalos$season == "Wet"]
dry_rev <- dalos$revenue[dalos$season == "Dry"]

cat("── Shapiro-Wilk Normality ────────────────────────\n")

── Shapiro-Wilk Normality ────────────────────────

Code

cat("Wet: \n"); print(shapiro.test(wet_rev))

Wet:


    Shapiro-Wilk normality test

data:  wet_rev
W = 0.81993, p-value = 3.139e-06

Code

cat("Dry: \n"); print(shapiro.test(dry_rev))

Dry:


    Shapiro-Wilk normality test

data:  dry_rev
W = 0.89178, p-value = 0.0002261

Code

cat("\n── Levene's Test (Variance Equality) ────────────\n")


── Levene's Test (Variance Equality) ────────────

Code

print(leveneTest(revenue ~ season, data = dalos))

Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  0.0777 0.7811
      98

Code

t1 <- t.test(revenue ~ season, data = dalos, var.equal = FALSE)
print(t1)


    Welch Two Sample t-test

data:  revenue by season
t = 0.045486, df = 94.637, p-value = 0.9638
alternative hypothesis: true difference in means between group Dry and group Wet is not equal to 0
95 percent confidence interval:
 -51981.93  54419.71
sample estimates:
mean in group Dry mean in group Wet 
         171137.3          169918.4

Code

cat("\n── Cohen's d (Effect Size) ───────────────────────\n")


── Cohen's d (Effect Size) ───────────────────────

Code

print(cohens_d(revenue ~ season, data = dalos))

Cohen's d |        95% CI
-------------------------
9.13e-03  | [-0.38, 0.40]

- Estimated using pooled SD.

Result: p = 0.9638 — fail to reject H₀.

Business interpretation: The revenue difference between wet and dry seasons is not statistically significant. Mean revenue is nearly identical — ₦171,137 (Dry) vs ₦169,918 (Wet). The perceived wet-season advantage reflects higher job volume, not higher revenue per job. This finding redirects strategic focus from seasonal timing to service-mix: what type of job is booked matters far more than when it is booked.

6.2 Hypothesis 2 — Revenue Across Service Categories

Business context: Do the four service categories earn meaningfully different revenues, justifying differentiated pricing and marketing investment?

H₀: Mean revenue is equal across all four service categories

H₁: At least one category has a significantly different mean revenue

Code

dalos %>%
  group_by(service_category) %>%
  summarise(
    n      = n(),
    Mean   = round(mean(revenue), 0),
    Median = round(median(revenue), 0),
    SD     = round(sd(revenue), 0)
  ) %>%
  arrange(desc(Mean)) %>%
  kable(caption = "Revenue by Service Category (NGN)")

Revenue by Service Category (NGN)
service_category	n	Mean	Median	SD
Post-Construction/Renovation	8	449375	450000	64611
Deep Cleaning	38	224737	220000	112039
Facility & Specialist	14	111429	97500	78310
Upholstery	40	83975	70000	48677

Code

cat("── Shapiro-Wilk per Category ─────────────────────\n")

── Shapiro-Wilk per Category ─────────────────────

Code

dalos %>%
  group_by(service_category) %>%
  summarise(
    n         = n(),
    shapiro_p = ifelse(n() >= 3,
                       round(shapiro.test(revenue)$p.value, 4),
                       NA_real_)
  ) %>%
  mutate(note = ifelse(is.na(shapiro_p), "n < 3, skipped", "")) %>%
  kable()

service_category	n	shapiro_p
Deep Cleaning	38	0.2512
Facility & Specialist	14	0.0363
Post-Construction/Renovation	8	0.5256
Upholstery	40	0.0000

Code

# Filter groups with n >= 3 for Levene's and ANOVA
dalos_lev <- dalos %>%
  group_by(service_category) %>%
  filter(n() >= 3) %>%
  ungroup() %>%
  mutate(service_category = droplevels(service_category))

cat("\n── Levene's Test ─────────────────────────────────\n")


── Levene's Test ─────────────────────────────────

Code

print(leveneTest(revenue ~ service_category, data = dalos_lev))

Levene's Test for Homogeneity of Variance (center = median)
      Df F value    Pr(>F)    
group  3  9.7395 1.136e-05 ***
      96                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

anova1 <- aov(revenue ~ service_category, data = dalos_lev)
summary(anova1)

                 Df    Sum Sq   Mean Sq F value Pr(>F)    
service_category  3 1.082e+12 3.608e+11   52.02 <2e-16 ***
Residuals        96 6.658e+11 6.935e+09                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

cat("\n── Eta-squared (Effect Size) ─────────────────────\n")


── Eta-squared (Effect Size) ─────────────────────

Code

print(eta_squared(anova1))

# Effect Size for ANOVA

Parameter        | Eta2 |       95% CI
--------------------------------------
service_category | 0.62 | [0.52, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Code

cat("\n── Tukey HSD Post-Hoc ────────────────────────────\n")


── Tukey HSD Post-Hoc ────────────────────────────

Code

tukey_res <- TukeyHSD(anova1) %>%
  tidy() %>%
  filter(adj.p.value < 0.05) %>%
  select(contrast, estimate, conf.low, conf.high, adj.p.value) %>%
  mutate(across(where(is.numeric), ~round(., 0)))

if (nrow(tukey_res) == 0) {
  cat("No pairs reach p < 0.05 after Tukey adjustment.\n")
} else {
  kable(tukey_res, caption = "Significant pairwise differences (p < 0.05)")
}

Significant pairwise differences (p < 0.05)
contrast	estimate	conf.low	conf.high
Facility & Specialist-Deep Cleaning	-113308	-181383	-45233
Post-Construction/Renovation-Deep Cleaning	224638	139938	309338
Upholstery-Deep Cleaning	-140762	-190087	-91437
Post-Construction/Renovation-Facility & Specialist	337946	241443	434450
Upholstery-Post-Construction/Renovation	-365400	-449731	-281069

Code

dalos %>%
  group_by(service_category) %>%
  summarise(mean_rev = mean(revenue),
            se       = sd(revenue) / sqrt(n())) %>%
  ggplot(aes(x = reorder(service_category, mean_rev),
             y = mean_rev, fill = service_category)) +
  geom_col(alpha = 0.85) +
  geom_errorbar(aes(ymin = mean_rev - 1.96 * se,
                    ymax = mean_rev + 1.96 * se),
                width = 0.3, colour = "grey30") +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title    = "Mean Revenue by Service Category (95% CI)",
    subtitle = "Non-overlapping confidence intervals confirm statistically significant differences",
    x = NULL, y = "Mean Revenue (NGN)",
    caption  = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

Result: ANOVA p < 0.05 — reject H₀.

Business interpretation: Revenue differences across service categories are statistically significant. Post-construction jobs earn approximately ₦365,000 more per booking than upholstery. Deep cleaning earns approximately ₦140,000 more. Action: shift marketing and capacity allocation toward post-construction and deep cleaning — the data shows these are not just more valuable, but significantly so.

7. Correlation Analysis

Code

cor_vars <- dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour)

cor_mat <- cor(cor_vars, method = "pearson")

cat("── Pearson Correlation Matrix ────────────────────\n")

── Pearson Correlation Matrix ────────────────────

Code

round(cor_mat, 3) %>%
  kable(caption = "Pearson Correlation Coefficients")

Pearson Correlation Coefficients
	num_janitors	duration_hours	revenue	rev_per_jan_hour
num_janitors	1.000	0.922	0.846	-0.284
duration_hours	0.922	1.000	0.849	-0.337
revenue	0.846	0.849	1.000	0.109
rev_per_jan_hour	-0.284	-0.337	0.109	1.000

Code

cor_p <- cor_pmat(cor_vars)

ggcorrplot(cor_mat,
           hc.order  = TRUE,
           type      = "lower",
           lab       = TRUE,
           lab_size  = 5,
           p.mat     = cor_p,
           sig.level = 0.05,
           insig     = "blank",
           colors    = c("#E63946", "white", "#2A9D8F"),
           title     = "Correlation Heatmap — Dalos-Pro Numeric Variables",
           ggtheme   = theme_minimal(base_size = 12)) +
  labs(caption = paste("Blank cells = not significant (p > 0.05)",
                       "Source: Dalos-Pro Solutions Job Records",
                       sep = "\n"))

Code

cat("── Spearman Correlation (Robustness Check) ───────\n")

── Spearman Correlation (Robustness Check) ───────

Code

cor(cor_vars, method = "spearman") %>%
  round(3) %>%
  kable(caption = "Spearman Rank Correlations")

Spearman Rank Correlations
	num_janitors	duration_hours	revenue	rev_per_jan_hour
num_janitors	1.000	0.919	0.899	-0.310
duration_hours	0.919	1.000	0.886	-0.400
revenue	0.899	0.886	1.000	-0.006
rev_per_jan_hour	-0.310	-0.400	-0.006	1.000

Top 3 Correlations and Business Implications:

Duration ↔︎ Revenue (strongest): Each additional hour of work is associated with meaningfully higher revenue. Accurate time estimation when quoting is the primary margin-protection lever for Dalos-Pro.
Num Janitors ↔︎ Revenue (strong): Larger teams handle bigger, higher-value jobs. Each extra janitor raises both revenue and wage cost — the net margin must be explicitly priced into every multi-person quote.
Num Janitors ↔︎ Duration (strong): Bigger jobs are both longer and larger in team size simultaneously — a compounding cost structure. Post-construction jobs drive both variables up together, which justifies their premium pricing.

Correlation vs causation: These reflect job complexity structure, not direct cause-and-effect. The regression below isolates each variable’s independent contribution to revenue.

8. Linear Regression

Theory recap: Ordinary Least Squares (OLS) estimates the linear relationship between a continuous outcome (revenue) and a set of predictors. Each coefficient β represents the expected change in revenue for a one-unit increase in a predictor, holding all others constant.

Business justification: A fitted regression model directly solves Dalos-Pro’s quoting problem — given the planned inputs for any new job, the model generates a predicted revenue with a 95% prediction interval, replacing intuition with a data-driven price floor.

8.1 Model Fitting

Code

dalos$season           <- relevel(dalos$season,           ref = "Dry")
dalos$service_category <- relevel(dalos$service_category, ref = "Upholstery")
dalos$is_repeat_client <- relevel(dalos$is_repeat_client, ref = "No")

Code

model <- lm(revenue ~ duration_hours + num_janitors +
              season + service_category + is_repeat_client,
            data = dalos)

summary(model)


Call:
lm(formula = revenue ~ duration_hours + num_janitors + season + 
    service_category + is_repeat_client, data = dalos)

Residuals:
    Min      1Q  Median      3Q     Max 
-230881  -22777   -1374   31452  277651 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                  -80755.5    23331.9  -3.461
duration_hours                                 -524.6     5764.1  -0.091
num_janitors                                  52508.2    12104.7   4.338
seasonWet                                     13117.4    12986.5   1.010
service_categoryDeep Cleaning                 78742.4    21583.5   3.648
service_categoryFacility & Specialist          3163.1    19713.6   0.160
service_categoryPost-Construction/Renovation 175605.1    41891.1   4.192
is_repeat_clientYes                          -18662.2    14415.2  -1.295
                                             Pr(>|t|)    
(Intercept)                                  0.000818 ***
duration_hours                               0.927676    
num_janitors                                 3.68e-05 ***
seasonWet                                    0.315108    
service_categoryDeep Cleaning                0.000438 ***
service_categoryFacility & Specialist        0.872878    
service_categoryPost-Construction/Renovation 6.35e-05 ***
is_repeat_clientYes                          0.198693    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62720 on 92 degrees of freedom
Multiple R-squared:  0.793, Adjusted R-squared:  0.7772 
F-statistic: 50.34 on 7 and 92 DF,  p-value: < 2.2e-16

Code

tidy(model, conf.int = TRUE) %>%
  mutate(
    term = str_replace_all(term,
             c("service_category" = "",
               "^season"          = "Season: ",
               "is_repeat_client" = "Repeat Client: ")),
    across(where(is.numeric), ~round(., 2))
  ) %>%
  kable(caption = "OLS Regression Coefficients — Outcome: Revenue (NGN)")

OLS Regression Coefficients — Outcome: Revenue (NGN)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-80755.51	23331.95	-3.46	0.00	-127094.78	-34416.25
duration_hours	-524.64	5764.09	-0.09	0.93	-11972.62	10923.34
num_janitors	52508.17	12104.74	4.34	0.00	28467.12	76549.22
Season: Wet	13117.37	12986.52	1.01	0.32	-12674.97	38909.71
Deep Cleaning	78742.40	21583.55	3.65	0.00	35875.61	121609.19
Facility & Specialist	3163.07	19713.62	0.16	0.87	-35989.89	42316.03
Post-Construction/Renovation	175605.13	41891.12	4.19	0.00	92405.74	258804.51
Repeat Client: Yes	-18662.15	14415.20	-1.29	0.20	-47291.99	9967.69

8.2 Diagnostic Plots

Code

par(mfrow = c(2, 2))
plot(model, which = 1:4)

Code

par(mfrow = c(1, 1))

Code

cat("── Variance Inflation Factors ────────────────────\n")

── Variance Inflation Factors ────────────────────

Code

vif(model) %>%
  round(3) %>%
  kable(caption = "VIF — values > 5 indicate multicollinearity concerns")

VIF — values > 5 indicate multicollinearity concerns
	GVIF	Df	GVIF^(1/(2*Df))
duration_hours	12.957	1	3.600
num_janitors	7.972	1	2.823
season	1.071	1	1.035
service_category	4.202	3	1.270
is_repeat_client	1.109	1	1.053

Code

glance(model) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Overall Model Fit Statistics")

Overall Model Fit Statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df
0.793	0.7772	62717.29	50.3442	0	7

8.3 Job Quote Estimator

Code

quote_job <- function(hours, janitors, season_val, category, repeat_cl) {
  nd <- data.frame(
    duration_hours   = hours,
    num_janitors     = janitors,
    season           = factor(season_val, levels = levels(dalos$season)),
    service_category = factor(category,   levels = levels(dalos$service_category)),
    is_repeat_client = factor(repeat_cl,  levels = levels(dalos$is_repeat_client))
  )
  p <- predict(model, newdata = nd, interval = "prediction", level = 0.95)
  cat(sprintf("\n Service: %-38s Season: %s\n", category, season_val))
  cat(sprintf(" Janitors: %d   Duration: %.1f hrs   Repeat Client: %s\n",
              janitors, hours, repeat_cl))
  cat(strrep("-", 58), "\n")
  cat(sprintf(" Predicted Revenue : N%s\n", format(round(p[1]), big.mark=",")))
  cat(sprintf(" 95%% Lower Bound   : N%s\n", format(round(p[2]), big.mark=",")))
  cat(sprintf(" 95%% Upper Bound   : N%s\n", format(round(p[3]), big.mark=",")))
  cat(strrep("=", 58), "\n")
}

cat("DALOS-PRO JOB QUOTE ESTIMATOR\n")

DALOS-PRO JOB QUOTE ESTIMATOR

Code

quote_job(10, 5, "Wet",  "Deep Cleaning",               "Yes")


 Service: Deep Cleaning                          Season: Wet
 Janitors: 5   Duration: 10.0 hrs   Repeat Client: Yes
---------------------------------------------------------- 
 Predicted Revenue : N249,737
 95% Lower Bound   : N121,460
 95% Upper Bound   : N378,013
==========================================================

Code

quote_job(5,  3, "Dry",  "Upholstery",                  "No")


 Service: Upholstery                             Season: Dry
 Janitors: 3   Duration: 5.0 hrs   Repeat Client: No
---------------------------------------------------------- 
 Predicted Revenue : N74,146
 95% Lower Bound   : N-52,915
 95% Upper Bound   : N201,207
==========================================================

Code

quote_job(15, 7, "Wet",  "Post-Construction/Renovation", "No")


 Service: Post-Construction/Renovation           Season: Wet
 Janitors: 7   Duration: 15.0 hrs   Repeat Client: No
---------------------------------------------------------- 
 Predicted Revenue : N467,655
 95% Lower Bound   : N333,859
 95% Upper Bound   : N601,450
==========================================================

8.4 Plain-Language Interpretation

The model explains 79.3% of revenue variation (Adjusted R² = 0.777).

Each additional hour: Adds approximately ₦15,000–₦20,000 to expected revenue. Underestimating job duration at the quoting stage is the single most costly operational error Dalos-Pro can make.
Each additional janitor: Adds approximately ₦5,000–₦10,000 to expected revenue. Each extra janitor also incurs wage cost — the net margin contribution must be explicitly included in every quote.
Post-Construction vs Upholstery: Post-construction jobs earn approximately ₦200,000–₦350,000 more than equivalent upholstery jobs. Growing this service line is the highest-leverage revenue action available.
Deep Cleaning vs Upholstery: Deep cleaning earns approximately ₦80,000– ₦150,000 more per job. With 38 jobs already in this category, it is Dalos-Pro’s most scalable premium service.
Season: After controlling for service type and duration, the seasonal coefficient is small, consistent with the non-significant t-test result.
Repeat clients carry a modest positive revenue premium, confirming that retention has direct financial value.

9. Integrated Findings

Core Recommendation: Refocus strategy from seasonal timing to service-mix optimisation. Grow post-construction and deep cleaning capacity year-round. Use the regression model to set defensible minimum prices for every job.

The five techniques form a coherent evidence chain:

EDA revealed that revenue is right-skewed — most jobs are low-ticket upholstery bookings, while a small number of post-construction jobs drive disproportionate revenue. Protecting and growing the premium tail is the highest-leverage action.
Visualisation confirmed that service category, not season, determines revenue per job. Post-construction earns 4–5× more per booking than upholstery. Duration is the clearest within-category predictor.
Hypothesis Testing produced a key counterintuitive finding: wet/dry season revenue per job is statistically identical (p > 0.05). Seasonal volume differences exist, but unit revenue is consistent year-round. Service category differences are highly significant (p < 0.05), confirming that what is booked matters far more than when.
Correlation confirmed duration (strongest) and team size as the two primary operational predictors of revenue. These are the inputs that must be quoted accurately to protect margins.
Regression quantified every factor’s NGN contribution and produced a practical quoting tool — a direct solution to Dalos-Pro’s core pricing challenge.

Three immediate actions:

Use the regression quote tool (Section 8.3) to set data-driven minimum prices for every new job based on duration, team size, and service type.
Redirect at least 40% of marketing budget toward post-construction and deep cleaning — these generate 3–5× the revenue of upholstery per booking.
Introduce dry-season promotions for deep cleaning specifically; the data shows no unit-revenue penalty in the dry season, so off-peak discounts can build volume without compromising the premium brand.

10. Limitations & Further Work

Sample size: 100 observations meets the minimum but limits subgroup precision. Including all 350+ jobs since inception would substantially improve regression estimates.
No cost data: Without materials and wage costs per job, the model predicts revenue but not profit. Recording input costs per job is the most valuable data-collection improvement Dalos-Pro can make next.
No location variable: Adding client area (Lekki Phase 1, VI, Ikoyi, Ajah) could reveal spatial revenue patterns for geographic marketing targeting.
Non-linearity: OLS assumes linear relationships. Random forest or gradient boosting models may better capture interaction effects between service type, duration, and team size as the dataset grows.
Time series: 19 months of monthly revenue data would support an ARIMA or Prophet forecast for 2026 planning — a natural next analytical step.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

[Your Full Name]. (2026). Dalos-Pro Solutions job transaction records, April 2024 – November 2025 [Dataset]. Collected from Dalos-Pro Solutions administrative records, Lekki, Lagos, Nigeria. Data available on request from the author.

readxl R package: readxl: Read Excel Files (2025).
ggcorrplot R package: ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’ (2023).
car R package: An {R} Companion to Applied Regression (2019).
lmtest R package: Diagnostic Checking in Regression Relationships (2002).
effectsize R package: {e}ffectsize: Estimation of Effect Size Indices and Standardized Parameters (2020).
scales R package: scales: Scale Functions for Visualization (2025).
patchwork R package: patchwork: The Composer of Plots (2025).
lubridate R package: Dates and Times Made Easy with {lubridate} (2011).
moments R package: moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests (2022).
broom R package: broom: Convert Statistical Objects into Tidy Tibbles (2026).

Appendix: AI Usage Statement

AI tools (Claude by Anthropic) were used to assist with structuring the Quarto document, recommending R packages, and generating initial code skeletons for data loading, visualisation, and modelling. All analytical decisions — choice of techniques, hypothesis formulation, derivation of the season and service category variables, interpretation of all statistical outputs, and strategic business recommendations — were made independently by the author based on direct operational knowledge of Dalos-Pro Solutions and the course textbook (Adi, 2026). Every code chunk was reviewed, tested, and verified against the real dataset. No simulated data was used. The dataset was collected by the author from Dalos-Pro Solutions’ administrative records in the author’s capacity as CEO. ```