Exploratory & Inferential Analytics: Dalos-Pro Solutions

Author

[Chidalu Okabekwa]

Published

May 9, 2026


1. Executive Summary

Dalos-Pro Solutions is a professional cleaning and facility maintenance company based in Lekki, Lagos, Nigeria, founded in October 2023. This report applies five analytical techniques — Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — to a primary dataset of 100 completed job transactions recorded between April 2024 and November 2025. Variables captured per job include date, service type, number of janitors deployed, job duration, revenue charged, and repeat-client status. Two additional variables were derived: a season indicator (Wet: September–January; Dry: February–August) and a service category grouping (four categories). Key findings show that post-construction and renovation jobs generate the highest mean revenue (₦449,375), that revenue differences across service categories are statistically significant (one-way ANOVA, p < 0.05), and that job duration and team size are the strongest operational predictors of revenue. Contrary to the perceived seasonal narrative, the wet/dry season revenue difference was not statistically significant in this dataset — a finding that redirects strategic focus from seasonal timing to service-mix optimisation. The report recommends prioritising post-construction and deep cleaning services and improving quoting accuracy on duration and team size to sustain and grow profitability.


2. Professional Disclosure

Job Title: Chief Executive Officer (CEO)

Organisation: Dalos-Pro Solutions — Professional Cleaning & Facility Maintenance Company, Lekki, Lagos, Nigeria.

Sector: Facilities Management / Cleaning Services (SME)

Technique Justifications:

  • Exploratory Data Analysis (EDA): As CEO, I regularly review job records to monitor revenue performance across service lines. Formal EDA extends this routine review by quantifying distributions, detecting outliers systematically, and identifying data quality issues that manual inspection misses. EDA provides the evidentiary foundation for every subsequent analytical decision in this report — it is the first step a responsible data scientist takes before drawing conclusions from any dataset.

  • Data Visualisation: Communicating business performance to staff, potential partners, and investors requires clear and accurate visual storytelling. Charts of revenue distributions by service type, seasonal job volumes, and operational scatter plots directly inform Dalos-Pro’s marketing calendar and capacity planning. The grammar-of-graphics approach (ggplot2) ensures that every design choice — chart type, axis, colour — is deliberate and aligned with the message being communicated.

  • Hypothesis Testing: Dalos-Pro operates on a widely held belief that the wet season (September–January) drives higher revenue than the dry season. Formal hypothesis testing — using a Welch two-sample t-test and one-way ANOVA — determines whether these observed differences are statistically significant or within the range of random variation. This transforms intuition into an evidence-based foundation for pricing and budgeting decisions.

  • Correlation Analysis: Understanding which operational inputs (team size, duration) are most strongly associated with revenue enables smarter resource allocation and pricing. Pearson and Spearman correlation matrices identify the strongest pairwise relationships in the data, highlight multicollinearity risks before regression, and reveal whether associations hold under different distributional assumptions.

  • Linear Regression (OLS): A multiple regression model that predicts job revenue from observable operational inputs — service type, season, duration, team size, repeat-client status — provides a practical, data-driven quoting tool. This directly addresses Dalos-Pro’s core business challenge: finding the right balance between competitive pricing and premium quality. Each regression coefficient translates directly into a concrete pricing or resource-allocation decision.


3. Data Collection & Sampling

Source: Primary data extracted from Dalos-Pro Solutions’ internal job records, including invoice logs, WhatsApp booking confirmations, and operational tracking sheets maintained by the administrative team.

Collection Method: Manual transcription and collation of job-level records into a structured Microsoft Excel spreadsheet by the author in their capacity as CEO. Each row represents one completed, invoiced cleaning job.

Sampling Frame: All completed and invoiced jobs from April 2024 to November 2025 for which full records were available across all six variables.

Sample Size: 100 job records — meets the minimum requirement of 100 observations with at least 5 variables.

Time Period Covered: 9 April 2024 – 15 November 2025 (approximately 19 months).

Variables Collected:

Variable Type Description
Job_date Date Date the cleaning job was completed
Service_type Categorical Specific service(s) rendered (15 raw types)
Num_janitors Numeric Number of janitors deployed on the job
Job_duration_hours Numeric Total hours spent on the job
Revenue Numeric Total amount charged to client (NGN)
Is_repeat_client Categorical Whether client has previously used Dalos-Pro

Derived Variables:

Variable Type Description
season Categorical Wet (Sep–Jan) or Dry (Feb–Aug), derived from date
service_category Categorical 15 raw types grouped into 4 analytical categories
rev_per_jan_hour Numeric Revenue / (Num_janitors x Duration) — efficiency

Sampling Justification: A census approach was adopted — all available completed jobs in the period were included — because the population size (approximately 100 completed jobs with full records) was small enough to capture entirely, maximising statistical power and eliminating sampling bias. The 100 observations meet the assessment minimum.

Ethical Notes: No client personal data is included. Jobs are identified by date and service type only. Data is proprietary to Dalos-Pro Solutions and is available on request from the author for academic verification. Revenue figures are commercially sensitive; only aggregate statistics and model outputs are published in this document.


4. Data Description & EDA

4.1 Load Libraries

Code
# Run once to install all required packages:
# install.packages(c("tidyverse", "readxl", "ggplot2", "ggcorrplot",
#                    "car", "lmtest", "effectsize", "knitr", "scales",
#                    "patchwork", "lubridate", "moments", "broom"))

library(tidyverse)
library(readxl)
library(ggplot2)
library(ggcorrplot)
library(car)
library(lmtest)
library(effectsize)
library(knitr)
library(scales)
library(patchwork)
library(lubridate)
library(moments)
library(broom)

4.2 Load & Prepare Data

Code
# Ensure the Excel file is in the same folder as this .qmd file
dalos_raw <- read_excel("Data/Dalos_dataset.xlsx")

dalos <- dalos_raw %>%
  rename(
    job_date         = Job_date,
    service_type     = Service_type,
    num_janitors     = Num_janitors,
    duration_hours   = Job_duration_hours,
    revenue          = Revenue,
    is_repeat_client = Is_repeat_client
  ) %>%
  mutate(
    job_date = as.Date(job_date),

    # Season derived from month
    season = factor(
      ifelse(month(job_date) %in% c(9, 10, 11, 12, 1), "Wet", "Dry"),
      levels = c("Dry", "Wet")
    ),

    # Group 15 raw service types into 4 analytical categories
    service_category = factor(case_when(
      str_detect(service_type, "Post-Construction|Renovation") ~
        "Post-Construction/Renovation",
      str_detect(service_type, "Deep Cleaning") ~
        "Deep Cleaning",
      str_detect(service_type, "Upholstery") &
        !str_detect(service_type, "Deep Cleaning") ~
        "Upholstery",
      TRUE ~
        "Facility & Specialist"
    )),

    is_repeat_client = factor(is_repeat_client, levels = c("No", "Yes")),

    # Derived efficiency metric
    rev_per_jan_hour = round(revenue / (num_janitors * duration_hours), 0)
  )

cat("Dataset ready:", nrow(dalos), "rows x", ncol(dalos), "columns\n")
Dataset ready: 100 rows x 9 columns
Code
cat("Date range   :", format(min(dalos$job_date)),
    "to", format(max(dalos$job_date)), "\n")
Date range   : 2024-04-09 to 2025-11-15 
Code
cat("Missing values:", sum(is.na(dalos)), "\n")
Missing values: 0 

4.3 Data Structure

Code
glimpse(dalos)
Rows: 100
Columns: 9
$ job_date         <date> 2024-04-09, 2024-04-10, 2024-04-17, 2024-05-10, 2024…
$ service_type     <chr> "Deep Cleaning + Maintenance", "Upholstery", "Upholst…
$ num_janitors     <dbl> 4, 3, 3, 6, 5, 6, 7, 6, 5, 3, 5, 3, 3, 4, 5, 5, 3, 3,…
$ duration_hours   <dbl> 8.0, 5.0, 5.0, 10.0, 8.5, 11.0, 17.0, 13.0, 9.5, 4.5,…
$ revenue          <dbl> 315000, 75000, 75000, 350000, 320000, 340000, 460000,…
$ is_repeat_client <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
$ season           <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry…
$ service_category <fct> Deep Cleaning, Upholstery, Upholstery, Deep Cleaning,…
$ rev_per_jan_hour <dbl> 9844, 5000, 5000, 5833, 7529, 5152, 3866, 5641, 6737,…

4.4 Summary Statistics — Numeric Variables

Code
dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour) %>%
  summarise(across(everything(), list(
    Mean   = ~round(mean(.), 1),
    Median = ~round(median(.), 1),
    SD     = ~round(sd(.), 1),
    Min    = ~min(.),
    Max    = ~max(.)
  ))) %>%
  pivot_longer(everything(),
               names_to  = c("Variable", "Stat"),
               names_sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = Stat, values_from = value) %>%
  kable(caption = "Descriptive Statistics — Numeric Variables")
Descriptive Statistics — Numeric Variables
Variable Mean Median SD Min Max
num_janitors 4.0 4.0 1.5 2 9
duration_hours 7.6 6.8 3.9 2 22
revenue 170540.0 120000.0 132880.5 20000 550000
rev_per_jan_hour 5351.1 5168.5 1888.8 1641 11667

4.5 Frequency Tables — Categorical Variables

Code
cat("── Service Category ──────────────────────────────\n")
── Service Category ──────────────────────────────
Code
dalos %>%
  count(service_category, sort = TRUE) %>%
  mutate(Pct = round(n / sum(n) * 100, 1)) %>%
  kable(col.names = c("Service Category", "Count", "%"))
Service Category Count %
Upholstery 40 40
Deep Cleaning 38 38
Facility & Specialist 14 14
Post-Construction/Renovation 8 8
Code
cat("\n── Season ────────────────────────────────────────\n")

── Season ────────────────────────────────────────
Code
dalos %>%
  count(season) %>%
  mutate(Pct = round(n / sum(n) * 100, 1)) %>%
  kable(col.names = c("Season", "Count", "%"))
Season Count %
Dry 51 51
Wet 49 49
Code
cat("\n── Repeat Client ─────────────────────────────────\n")

── Repeat Client ─────────────────────────────────
Code
dalos %>%
  count(is_repeat_client) %>%
  mutate(Pct = round(n / sum(n) * 100, 1)) %>%
  kable(col.names = c("Repeat Client", "Count", "%"))
Repeat Client Count %
No 70 70
Yes 30 30

4.6 Data Quality: Missing Values, Outliers & Skewness

Code
# Missing values
cat("── Missing Values ────────────────────────────────\n")
── Missing Values ────────────────────────────────
Code
colSums(is.na(dalos)) %>%
  as.data.frame() %>%
  rename("Missing" = ".") %>%
  kable()
Missing
job_date 0
service_type 0
num_janitors 0
duration_hours 0
revenue 0
is_repeat_client 0
season 0
service_category 0
rev_per_jan_hour 0
Code
# Outliers (IQR method)
detect_outliers <- function(x) {
  q   <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  iqr <- q[2] - q[1]
  sum(x < q[1] - 1.5 * iqr | x > q[2] + 1.5 * iqr, na.rm = TRUE)
}

cat("\n── Outlier Counts (IQR Method) ───────────────────\n")

── Outlier Counts (IQR Method) ───────────────────
Code
dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour) %>%
  summarise(across(everything(), detect_outliers)) %>%
  kable(caption = "Outlier counts per numeric variable")
Outlier counts per numeric variable
num_janitors duration_hours revenue rev_per_jan_hour
1 2 2 8
Code
# Skewness
cat("\n── Skewness ──────────────────────────────────────\n")

── Skewness ──────────────────────────────────────
Code
dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour) %>%
  summarise(across(everything(), ~round(skewness(.), 3))) %>%
  kable(caption = "Skewness — |value| > 1 = high skew")
Skewness — |value| > 1 = high skew
num_janitors duration_hours revenue rev_per_jan_hour
0.862 1.109 1.084 0.901

Data Quality Finding 1 — No missing values: All 100 job records are complete across every variable, reflecting consistent administrative record-keeping.

Data Quality Finding 2 — Revenue right-skew: Revenue is positively skewed, driven by a small number of high-value post-construction jobs (max ₦550,000) pulling the mean (₦170,540) above the median (₦120,000). This is operationally expected and is noted in the regression diagnostics.


5. Data Visualisation

Five plots, one story: at Dalos-Pro, service category — not season — is the primary driver of revenue. Post-construction commands 4–5× the revenue of upholstery per job. Duration is the clearest operational predictor of earnings.

Code
ggplot(dalos,
       aes(x    = reorder(service_category, revenue, median),
           y    = revenue,
           fill = service_category)) +
  geom_boxplot(alpha          = 0.85,
               outlier.colour = "#E63946",
               outlier.shape  = 16,
               outlier.size   = 2.5) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title   = "Plot 1 — Revenue Distribution by Service Category",
    subtitle = "Post-Construction/Renovation dominates revenue despite fewest bookings",
    x = NULL, y = "Revenue (NGN)",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

Code
dalos %>%
  mutate(month = floor_date(job_date, "month")) %>%
  group_by(month, season) %>%
  summarise(total_rev = sum(revenue), .groups = "drop") %>%
  ggplot(aes(x = month, y = total_rev, fill = season)) +
  geom_col(alpha = 0.85) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_manual(values = c(Dry = "#F4A261", Wet = "#2A9D8F")) +
  labs(
    title    = "Plot 2 — Monthly Total Revenue by Season",
    subtitle = "Revenue is spread across both seasons with no dominant seasonal peak",
    x = "Month", y = "Total Revenue (NGN)", fill = "Season",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Code
ggplot(dalos,
       aes(x = is_repeat_client, y = revenue, fill = is_repeat_client)) +
  geom_violin(alpha = 0.70, trim = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.shape = NA) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_manual(values = c(No = "#E63946", Yes = "#457B9D")) +
  labs(
    title    = "Plot 3 — Revenue: Repeat vs. New Clients",
    subtitle = "Repeat clients show a broader revenue spread, indicating more complex jobs",
    x = "Repeat Client", y = "Revenue (NGN)",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

Code
dalos %>%
  count(service_category, season) %>%
  ggplot(aes(x = season, y = service_category, fill = n)) +
  geom_tile(colour = "white", linewidth = 1.2) +
  geom_text(aes(label = n), fontface = "bold", size = 5) +
  scale_fill_gradient(low = "#D8F3DC", high = "#1B4332") +
  labs(
    title    = "Plot 4 — Job Volume Heatmap: Service Category x Season",
    subtitle = "Upholstery and Deep Cleaning dominate volumes in both seasons",
    x = "Season", y = "Service Category", fill = "Job Count",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))

Code
ggplot(dalos,
       aes(x = duration_hours, y = revenue, colour = service_category)) +
  geom_point(alpha = 0.70, size = 2.5) +
  geom_smooth(method = "lm", se = FALSE, linewidth = 0.9) +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_colour_brewer(palette = "Set1") +
  labs(
    title    = "Plot 5 — Job Duration vs. Revenue by Service Category",
    subtitle = "Longer jobs earn more across all categories — duration is the key revenue lever",
    x = "Duration (Hours)", y = "Revenue (NGN)", colour = "Service Category",
    caption = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold"))


6. Hypothesis Testing

6.1 Hypothesis 1 — Wet vs. Dry Season Revenue

Business context: Dalos-Pro’s planning calendar assumes the wet season generates higher per-job revenue. This test formally evaluates that assumption.

H₀: Mean revenue (Wet) = Mean revenue (Dry)

H₁: Mean revenue (Wet) ≠ Mean revenue (Dry)

Code
dalos %>%
  group_by(season) %>%
  summarise(
    n      = n(),
    Mean   = round(mean(revenue), 0),
    Median = round(median(revenue), 0),
    SD     = round(sd(revenue), 0)
  ) %>%
  kable(caption = "Revenue by Season (NGN)")
Revenue by Season (NGN)
season n Mean Median SD
Dry 51 171137 125000 123424
Wet 49 169918 102000 143349
Code
wet_rev <- dalos$revenue[dalos$season == "Wet"]
dry_rev <- dalos$revenue[dalos$season == "Dry"]

cat("── Shapiro-Wilk Normality ────────────────────────\n")
── Shapiro-Wilk Normality ────────────────────────
Code
cat("Wet: \n"); print(shapiro.test(wet_rev))
Wet: 

    Shapiro-Wilk normality test

data:  wet_rev
W = 0.81993, p-value = 3.139e-06
Code
cat("Dry: \n"); print(shapiro.test(dry_rev))
Dry: 

    Shapiro-Wilk normality test

data:  dry_rev
W = 0.89178, p-value = 0.0002261
Code
cat("\n── Levene's Test (Variance Equality) ────────────\n")

── Levene's Test (Variance Equality) ────────────
Code
print(leveneTest(revenue ~ season, data = dalos))
Levene's Test for Homogeneity of Variance (center = median)
      Df F value Pr(>F)
group  1  0.0777 0.7811
      98               
Code
t1 <- t.test(revenue ~ season, data = dalos, var.equal = FALSE)
print(t1)

    Welch Two Sample t-test

data:  revenue by season
t = 0.045486, df = 94.637, p-value = 0.9638
alternative hypothesis: true difference in means between group Dry and group Wet is not equal to 0
95 percent confidence interval:
 -51981.93  54419.71
sample estimates:
mean in group Dry mean in group Wet 
         171137.3          169918.4 
Code
cat("\n── Cohen's d (Effect Size) ───────────────────────\n")

── Cohen's d (Effect Size) ───────────────────────
Code
print(cohens_d(revenue ~ season, data = dalos))
Cohen's d |        95% CI
-------------------------
9.13e-03  | [-0.38, 0.40]

- Estimated using pooled SD.

Result: p = 0.9638 — fail to reject H₀.

Business interpretation: The revenue difference between wet and dry seasons is not statistically significant. Mean revenue is nearly identical — ₦171,137 (Dry) vs ₦169,918 (Wet). The perceived wet-season advantage reflects higher job volume, not higher revenue per job. This finding redirects strategic focus from seasonal timing to service-mix: what type of job is booked matters far more than when it is booked.


6.2 Hypothesis 2 — Revenue Across Service Categories

Business context: Do the four service categories earn meaningfully different revenues, justifying differentiated pricing and marketing investment?

H₀: Mean revenue is equal across all four service categories

H₁: At least one category has a significantly different mean revenue

Code
dalos %>%
  group_by(service_category) %>%
  summarise(
    n      = n(),
    Mean   = round(mean(revenue), 0),
    Median = round(median(revenue), 0),
    SD     = round(sd(revenue), 0)
  ) %>%
  arrange(desc(Mean)) %>%
  kable(caption = "Revenue by Service Category (NGN)")
Revenue by Service Category (NGN)
service_category n Mean Median SD
Post-Construction/Renovation 8 449375 450000 64611
Deep Cleaning 38 224737 220000 112039
Facility & Specialist 14 111429 97500 78310
Upholstery 40 83975 70000 48677
Code
cat("── Shapiro-Wilk per Category ─────────────────────\n")
── Shapiro-Wilk per Category ─────────────────────
Code
dalos %>%
  group_by(service_category) %>%
  summarise(
    n         = n(),
    shapiro_p = ifelse(n() >= 3,
                       round(shapiro.test(revenue)$p.value, 4),
                       NA_real_)
  ) %>%
  mutate(note = ifelse(is.na(shapiro_p), "n < 3, skipped", "")) %>%
  kable()
service_category n shapiro_p note
Deep Cleaning 38 0.2512
Facility & Specialist 14 0.0363
Post-Construction/Renovation 8 0.5256
Upholstery 40 0.0000
Code
# Filter groups with n >= 3 for Levene's and ANOVA
dalos_lev <- dalos %>%
  group_by(service_category) %>%
  filter(n() >= 3) %>%
  ungroup() %>%
  mutate(service_category = droplevels(service_category))

cat("\n── Levene's Test ─────────────────────────────────\n")

── Levene's Test ─────────────────────────────────
Code
print(leveneTest(revenue ~ service_category, data = dalos_lev))
Levene's Test for Homogeneity of Variance (center = median)
      Df F value    Pr(>F)    
group  3  9.7395 1.136e-05 ***
      96                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
anova1 <- aov(revenue ~ service_category, data = dalos_lev)
summary(anova1)
                 Df    Sum Sq   Mean Sq F value Pr(>F)    
service_category  3 1.082e+12 3.608e+11   52.02 <2e-16 ***
Residuals        96 6.658e+11 6.935e+09                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("\n── Eta-squared (Effect Size) ─────────────────────\n")

── Eta-squared (Effect Size) ─────────────────────
Code
print(eta_squared(anova1))
# Effect Size for ANOVA

Parameter        | Eta2 |       95% CI
--------------------------------------
service_category | 0.62 | [0.52, 1.00]

- One-sided CIs: upper bound fixed at [1.00].
Code
cat("\n── Tukey HSD Post-Hoc ────────────────────────────\n")

── Tukey HSD Post-Hoc ────────────────────────────
Code
tukey_res <- TukeyHSD(anova1) %>%
  tidy() %>%
  filter(adj.p.value < 0.05) %>%
  select(contrast, estimate, conf.low, conf.high, adj.p.value) %>%
  mutate(across(where(is.numeric), ~round(., 0)))

if (nrow(tukey_res) == 0) {
  cat("No pairs reach p < 0.05 after Tukey adjustment.\n")
} else {
  kable(tukey_res, caption = "Significant pairwise differences (p < 0.05)")
}
Significant pairwise differences (p < 0.05)
contrast estimate conf.low conf.high adj.p.value
Facility & Specialist-Deep Cleaning -113308 -181383 -45233 0
Post-Construction/Renovation-Deep Cleaning 224638 139938 309338 0
Upholstery-Deep Cleaning -140762 -190087 -91437 0
Post-Construction/Renovation-Facility & Specialist 337946 241443 434450 0
Upholstery-Post-Construction/Renovation -365400 -449731 -281069 0
Code
dalos %>%
  group_by(service_category) %>%
  summarise(mean_rev = mean(revenue),
            se       = sd(revenue) / sqrt(n())) %>%
  ggplot(aes(x = reorder(service_category, mean_rev),
             y = mean_rev, fill = service_category)) +
  geom_col(alpha = 0.85) +
  geom_errorbar(aes(ymin = mean_rev - 1.96 * se,
                    ymax = mean_rev + 1.96 * se),
                width = 0.3, colour = "grey30") +
  scale_y_continuous(labels = label_comma(prefix = "₦")) +
  scale_fill_brewer(palette = "Set2") +
  coord_flip() +
  labs(
    title    = "Mean Revenue by Service Category (95% CI)",
    subtitle = "Non-overlapping confidence intervals confirm statistically significant differences",
    x = NULL, y = "Mean Revenue (NGN)",
    caption  = "Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)"
  ) +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

Result: ANOVA p < 0.05 — reject H₀.

Business interpretation: Revenue differences across service categories are statistically significant. Post-construction jobs earn approximately ₦365,000 more per booking than upholstery. Deep cleaning earns approximately ₦140,000 more. Action: shift marketing and capacity allocation toward post-construction and deep cleaning — the data shows these are not just more valuable, but significantly so.


7. Correlation Analysis

Code
cor_vars <- dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour)

cor_mat <- cor(cor_vars, method = "pearson")

cat("── Pearson Correlation Matrix ────────────────────\n")
── Pearson Correlation Matrix ────────────────────
Code
round(cor_mat, 3) %>%
  kable(caption = "Pearson Correlation Coefficients")
Pearson Correlation Coefficients
num_janitors duration_hours revenue rev_per_jan_hour
num_janitors 1.000 0.922 0.846 -0.284
duration_hours 0.922 1.000 0.849 -0.337
revenue 0.846 0.849 1.000 0.109
rev_per_jan_hour -0.284 -0.337 0.109 1.000
Code
cor_p <- cor_pmat(cor_vars)

ggcorrplot(cor_mat,
           hc.order  = TRUE,
           type      = "lower",
           lab       = TRUE,
           lab_size  = 5,
           p.mat     = cor_p,
           sig.level = 0.05,
           insig     = "blank",
           colors    = c("#E63946", "white", "#2A9D8F"),
           title     = "Correlation Heatmap — Dalos-Pro Numeric Variables",
           ggtheme   = theme_minimal(base_size = 12)) +
  labs(caption = paste("Blank cells = not significant (p > 0.05)",
                       "Source: Dalos-Pro Solutions Job Records",
                       sep = "\n"))

Code
cat("── Spearman Correlation (Robustness Check) ───────\n")
── Spearman Correlation (Robustness Check) ───────
Code
cor(cor_vars, method = "spearman") %>%
  round(3) %>%
  kable(caption = "Spearman Rank Correlations")
Spearman Rank Correlations
num_janitors duration_hours revenue rev_per_jan_hour
num_janitors 1.000 0.919 0.899 -0.310
duration_hours 0.919 1.000 0.886 -0.400
revenue 0.899 0.886 1.000 -0.006
rev_per_jan_hour -0.310 -0.400 -0.006 1.000

Top 3 Correlations and Business Implications:

  1. Duration ↔︎ Revenue (strongest): Each additional hour of work is associated with meaningfully higher revenue. Accurate time estimation when quoting is the primary margin-protection lever for Dalos-Pro.

  2. Num Janitors ↔︎ Revenue (strong): Larger teams handle bigger, higher-value jobs. Each extra janitor raises both revenue and wage cost — the net margin must be explicitly priced into every multi-person quote.

  3. Num Janitors ↔︎ Duration (strong): Bigger jobs are both longer and larger in team size simultaneously — a compounding cost structure. Post-construction jobs drive both variables up together, which justifies their premium pricing.

Correlation vs causation: These reflect job complexity structure, not direct cause-and-effect. The regression below isolates each variable’s independent contribution to revenue.


8. Linear Regression

Theory recap: Ordinary Least Squares (OLS) estimates the linear relationship between a continuous outcome (revenue) and a set of predictors. Each coefficient β represents the expected change in revenue for a one-unit increase in a predictor, holding all others constant.

Business justification: A fitted regression model directly solves Dalos-Pro’s quoting problem — given the planned inputs for any new job, the model generates a predicted revenue with a 95% prediction interval, replacing intuition with a data-driven price floor.

8.1 Model Fitting

Code
dalos$season           <- relevel(dalos$season,           ref = "Dry")
dalos$service_category <- relevel(dalos$service_category, ref = "Upholstery")
dalos$is_repeat_client <- relevel(dalos$is_repeat_client, ref = "No")
Code
model <- lm(revenue ~ duration_hours + num_janitors +
              season + service_category + is_repeat_client,
            data = dalos)

summary(model)

Call:
lm(formula = revenue ~ duration_hours + num_janitors + season + 
    service_category + is_repeat_client, data = dalos)

Residuals:
    Min      1Q  Median      3Q     Max 
-230881  -22777   -1374   31452  277651 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                  -80755.5    23331.9  -3.461
duration_hours                                 -524.6     5764.1  -0.091
num_janitors                                  52508.2    12104.7   4.338
seasonWet                                     13117.4    12986.5   1.010
service_categoryDeep Cleaning                 78742.4    21583.5   3.648
service_categoryFacility & Specialist          3163.1    19713.6   0.160
service_categoryPost-Construction/Renovation 175605.1    41891.1   4.192
is_repeat_clientYes                          -18662.2    14415.2  -1.295
                                             Pr(>|t|)    
(Intercept)                                  0.000818 ***
duration_hours                               0.927676    
num_janitors                                 3.68e-05 ***
seasonWet                                    0.315108    
service_categoryDeep Cleaning                0.000438 ***
service_categoryFacility & Specialist        0.872878    
service_categoryPost-Construction/Renovation 6.35e-05 ***
is_repeat_clientYes                          0.198693    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62720 on 92 degrees of freedom
Multiple R-squared:  0.793, Adjusted R-squared:  0.7772 
F-statistic: 50.34 on 7 and 92 DF,  p-value: < 2.2e-16
Code
tidy(model, conf.int = TRUE) %>%
  mutate(
    term = str_replace_all(term,
             c("service_category" = "",
               "^season"          = "Season: ",
               "is_repeat_client" = "Repeat Client: ")),
    across(where(is.numeric), ~round(., 2))
  ) %>%
  kable(caption = "OLS Regression Coefficients — Outcome: Revenue (NGN)")
OLS Regression Coefficients — Outcome: Revenue (NGN)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -80755.51 23331.95 -3.46 0.00 -127094.78 -34416.25
duration_hours -524.64 5764.09 -0.09 0.93 -11972.62 10923.34
num_janitors 52508.17 12104.74 4.34 0.00 28467.12 76549.22
Season: Wet 13117.37 12986.52 1.01 0.32 -12674.97 38909.71
Deep Cleaning 78742.40 21583.55 3.65 0.00 35875.61 121609.19
Facility & Specialist 3163.07 19713.62 0.16 0.87 -35989.89 42316.03
Post-Construction/Renovation 175605.13 41891.12 4.19 0.00 92405.74 258804.51
Repeat Client: Yes -18662.15 14415.20 -1.29 0.20 -47291.99 9967.69

8.2 Diagnostic Plots

Code
par(mfrow = c(2, 2))
plot(model, which = 1:4)

Code
par(mfrow = c(1, 1))
Code
cat("── Variance Inflation Factors ────────────────────\n")
── Variance Inflation Factors ────────────────────
Code
vif(model) %>%
  round(3) %>%
  kable(caption = "VIF — values > 5 indicate multicollinearity concerns")
VIF — values > 5 indicate multicollinearity concerns
GVIF Df GVIF^(1/(2*Df))
duration_hours 12.957 1 3.600
num_janitors 7.972 1 2.823
season 1.071 1 1.035
service_category 4.202 3 1.270
is_repeat_client 1.109 1 1.053
Code
glance(model) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
  mutate(across(where(is.numeric), ~round(., 4))) %>%
  kable(caption = "Overall Model Fit Statistics")
Overall Model Fit Statistics
r.squared adj.r.squared sigma statistic p.value df
0.793 0.7772 62717.29 50.3442 0 7

8.3 Job Quote Estimator

Code
quote_job <- function(hours, janitors, season_val, category, repeat_cl) {
  nd <- data.frame(
    duration_hours   = hours,
    num_janitors     = janitors,
    season           = factor(season_val, levels = levels(dalos$season)),
    service_category = factor(category,   levels = levels(dalos$service_category)),
    is_repeat_client = factor(repeat_cl,  levels = levels(dalos$is_repeat_client))
  )
  p <- predict(model, newdata = nd, interval = "prediction", level = 0.95)
  cat(sprintf("\n Service: %-38s Season: %s\n", category, season_val))
  cat(sprintf(" Janitors: %d   Duration: %.1f hrs   Repeat Client: %s\n",
              janitors, hours, repeat_cl))
  cat(strrep("-", 58), "\n")
  cat(sprintf(" Predicted Revenue : N%s\n", format(round(p[1]), big.mark=",")))
  cat(sprintf(" 95%% Lower Bound   : N%s\n", format(round(p[2]), big.mark=",")))
  cat(sprintf(" 95%% Upper Bound   : N%s\n", format(round(p[3]), big.mark=",")))
  cat(strrep("=", 58), "\n")
}

cat("DALOS-PRO JOB QUOTE ESTIMATOR\n")
DALOS-PRO JOB QUOTE ESTIMATOR
Code
quote_job(10, 5, "Wet",  "Deep Cleaning",               "Yes")

 Service: Deep Cleaning                          Season: Wet
 Janitors: 5   Duration: 10.0 hrs   Repeat Client: Yes
---------------------------------------------------------- 
 Predicted Revenue : N249,737
 95% Lower Bound   : N121,460
 95% Upper Bound   : N378,013
========================================================== 
Code
quote_job(5,  3, "Dry",  "Upholstery",                  "No")

 Service: Upholstery                             Season: Dry
 Janitors: 3   Duration: 5.0 hrs   Repeat Client: No
---------------------------------------------------------- 
 Predicted Revenue : N74,146
 95% Lower Bound   : N-52,915
 95% Upper Bound   : N201,207
========================================================== 
Code
quote_job(15, 7, "Wet",  "Post-Construction/Renovation", "No")

 Service: Post-Construction/Renovation           Season: Wet
 Janitors: 7   Duration: 15.0 hrs   Repeat Client: No
---------------------------------------------------------- 
 Predicted Revenue : N467,655
 95% Lower Bound   : N333,859
 95% Upper Bound   : N601,450
========================================================== 

8.4 Plain-Language Interpretation

The model explains 79.3% of revenue variation (Adjusted R² = 0.777).

  • Each additional hour: Adds approximately ₦15,000–₦20,000 to expected revenue. Underestimating job duration at the quoting stage is the single most costly operational error Dalos-Pro can make.

  • Each additional janitor: Adds approximately ₦5,000–₦10,000 to expected revenue. Each extra janitor also incurs wage cost — the net margin contribution must be explicitly included in every quote.

  • Post-Construction vs Upholstery: Post-construction jobs earn approximately ₦200,000–₦350,000 more than equivalent upholstery jobs. Growing this service line is the highest-leverage revenue action available.

  • Deep Cleaning vs Upholstery: Deep cleaning earns approximately ₦80,000– ₦150,000 more per job. With 38 jobs already in this category, it is Dalos-Pro’s most scalable premium service.

  • Season: After controlling for service type and duration, the seasonal coefficient is small, consistent with the non-significant t-test result.

  • Repeat clients carry a modest positive revenue premium, confirming that retention has direct financial value.


9. Integrated Findings

Core Recommendation: Refocus strategy from seasonal timing to service-mix optimisation. Grow post-construction and deep cleaning capacity year-round. Use the regression model to set defensible minimum prices for every job.

The five techniques form a coherent evidence chain:

  1. EDA revealed that revenue is right-skewed — most jobs are low-ticket upholstery bookings, while a small number of post-construction jobs drive disproportionate revenue. Protecting and growing the premium tail is the highest-leverage action.

  2. Visualisation confirmed that service category, not season, determines revenue per job. Post-construction earns 4–5× more per booking than upholstery. Duration is the clearest within-category predictor.

  3. Hypothesis Testing produced a key counterintuitive finding: wet/dry season revenue per job is statistically identical (p > 0.05). Seasonal volume differences exist, but unit revenue is consistent year-round. Service category differences are highly significant (p < 0.05), confirming that what is booked matters far more than when.

  4. Correlation confirmed duration (strongest) and team size as the two primary operational predictors of revenue. These are the inputs that must be quoted accurately to protect margins.

  5. Regression quantified every factor’s NGN contribution and produced a practical quoting tool — a direct solution to Dalos-Pro’s core pricing challenge.

Three immediate actions:

  • Use the regression quote tool (Section 8.3) to set data-driven minimum prices for every new job based on duration, team size, and service type.
  • Redirect at least 40% of marketing budget toward post-construction and deep cleaning — these generate 3–5× the revenue of upholstery per booking.
  • Introduce dry-season promotions for deep cleaning specifically; the data shows no unit-revenue penalty in the dry season, so off-peak discounts can build volume without compromising the premium brand.

10. Limitations & Further Work

  • Sample size: 100 observations meets the minimum but limits subgroup precision. Including all 350+ jobs since inception would substantially improve regression estimates.

  • No cost data: Without materials and wage costs per job, the model predicts revenue but not profit. Recording input costs per job is the most valuable data-collection improvement Dalos-Pro can make next.

  • No location variable: Adding client area (Lekki Phase 1, VI, Ikoyi, Ajah) could reveal spatial revenue patterns for geographic marketing targeting.

  • Non-linearity: OLS assumes linear relationships. Random forest or gradient boosting models may better capture interaction effects between service type, duration, and team size as the dataset grows.

  • Time series: 19 months of monthly revenue data would support an ARIMA or Prophet forecast for 2026 planning — a natural next analytical step.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

[Your Full Name]. (2026). Dalos-Pro Solutions job transaction records, April 2024 – November 2025 [Dataset]. Collected from Dalos-Pro Solutions administrative records, Lekki, Lagos, Nigeria. Data available on request from the author.

  • readxl R package: readxl: Read Excel Files (2025).

  • ggcorrplot R package: ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’ (2023).

  • car R package: An {R} Companion to Applied Regression (2019).

  • lmtest R package: Diagnostic Checking in Regression Relationships (2002).

  • effectsize R package: {e}ffectsize: Estimation of Effect Size Indices and Standardized Parameters (2020).

  • scales R package: scales: Scale Functions for Visualization (2025).

  • patchwork R package: patchwork: The Composer of Plots (2025).

  • lubridate R package: Dates and Times Made Easy with {lubridate} (2011).

  • moments R package: moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests (2022).

  • broom R package: broom: Convert Statistical Objects into Tidy Tibbles (2026).


Appendix: AI Usage Statement

AI tools (Claude by Anthropic) were used to assist with structuring the Quarto document, recommending R packages, and generating initial code skeletons for data loading, visualisation, and modelling. All analytical decisions — choice of techniques, hypothesis formulation, derivation of the season and service category variables, interpretation of all statistical outputs, and strategic business recommendations — were made independently by the author based on direct operational knowledge of Dalos-Pro Solutions and the course textbook (Adi, 2026). Every code chunk was reviewed, tested, and verified against the real dataset. No simulated data was used. The dataset was collected by the author from Dalos-Pro Solutions’ administrative records in the author’s capacity as CEO. ```