Exploratory & Inferential Analytics: Dalos-Pro Solutions

Author

[Chidalu Okabekwa]

Published

May 13, 2026


100
Job Records
₦170K
Mean Revenue
19 mo
Data Period
5
Techniques Applied

1. Executive Summary

Section 1

Dalos-Pro Solutions is a professional cleaning and facility maintenance company based in Lekki, Lagos, Nigeria, founded in October 2023. This report applies five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — to a primary dataset of 100 completed job transactions recorded between April 2024 and November 2025. Variables per job include date, service type, janitors deployed, job duration, revenue charged, and repeat-client status. Two variables were derived: a season indicator (Wet: September–January; Dry: February–August) and a four-category service grouping. Key findings reveal that post-construction and renovation jobs generate the highest mean revenue (₦449,375), that revenue differences across service categories are statistically significant (ANOVA, p < 0.05), and that job duration and team size are the strongest operational predictors of revenue. Critically, the wet/dry season revenue difference was not statistically significant — redirecting strategic focus from seasonal timing to service-mix optimisation. The report recommends prioritising post-construction and deep cleaning services and using the regression model to set data-driven minimum prices for every job quote.


2. Professional Disclosure

Section 2

Job Title: Chief Executive Officer (CEO)

Organisation: Dalos-Pro Solutions — Professional Cleaning & Facility Maintenance, Lekki, Lagos, Nigeria.

Sector: Facilities Management / Cleaning Services (SME)

Note📘 Technique Justifications

Exploratory Data Analysis: As CEO I regularly review job records to monitor revenue across service lines. Formal EDA extends this by quantifying distributions, detecting outliers systematically, and surfacing data quality issues invisible to manual review. Following the principle of Anscombe’s Quartet — four datasets with identical summary statistics but completely different structures — EDA is the non-negotiable first step before any modelling.

Data Visualisation: Clear visual communication of business performance to staff and investors directly informs Dalos-Pro’s marketing calendar and capacity planning. The grammar-of-graphics approach ensures every design choice is deliberate and aligned with the analytical message.

Hypothesis Testing: Dalos-Pro operates on the assumption that the wet season drives higher revenue. Formal tests — Welch t-test, ANOVA, and non-parametric alternatives — transform this intuition into an evidence-based foundation for pricing decisions, reporting both p-values and effect sizes as required.

Correlation Analysis: Pearson, Spearman, and Kendall tau correlations plus partial correlation identify the strongest associations between operational inputs and revenue, highlight multicollinearity risks before regression, and distinguish linear from monotonic relationships.

Linear Regression (OLS): A multiple regression model predicting job revenue from observable inputs provides a practical quoting tool — directly addressing Dalos-Pro’s core challenge of balancing competitive pricing with premium quality. Standardised coefficients allow predictors to be compared on the same scale.


3. Data Collection & Sampling

Section 3

Source: Primary data from Dalos-Pro Solutions’ internal job records — invoice logs, WhatsApp booking confirmations, and operational tracking sheets.

Collection Method: Manual transcription into Microsoft Excel by the Chidalu Okabekwa in their capacity as CEO. Each row = one completed, invoiced job.

Sampling Frame: All completed and invoiced jobs, April 2024 – November 2025, with full records across all six variables.

Sample Size: 100 job records (meets the 100-observation minimum).

Time Period: 9 April 2024 – 15 November 2025 (~19 months).

Variable Type Description
Job_date Date Date job was completed
Service_type Categorical 15 raw service types
Num_janitors Numeric Janitors deployed
Job_duration_hours Numeric Total hours on job
Revenue Numeric Amount charged (NGN)
Is_repeat_client Categorical Repeat or new client
season (derived) Categorical Wet (Sep–Jan) / Dry (Feb–Aug)
service_category (derived) Categorical 4 analytical groups
rev_per_jan_hour (derived) Numeric Revenue efficiency metric

Sampling Justification: A census approach was adopted — all available completed jobs in the period were captured — because the population (~100 jobs with full records) was small enough to include entirely, maximising statistical power and eliminating sampling bias.

Ethical Notes: No client personal data included. Jobs identified by date and service type only. Data is proprietary to Dalos-Pro Solutions, available on request for academic verification.


4. Data Description & EDA

Section 4 — Ch. 9 Textbook
Note📘 Theory: Why EDA Comes First — Anscombe’s Quartet

Anscombe’s Quartet (1973) demonstrates four datasets with identical means, variances, correlations, and regression lines — yet completely different structures when plotted. One is linear, one curved, one has a single outlier distorting the fit, and one has a high-leverage point creating a spurious relationship. This is why visualisation and exploration are non-negotiable before any modelling. EDA is detective work: you walk into the dataset with no assumptions, open every drawer, and let the patterns emerge.

4.1 Load Libraries

Code
# Run once to install:
# install.packages(c("tidyverse","readxl","ggplot2","ggcorrplot",
#                    "car","lmtest","effectsize","knitr","scales",
#                    "patchwork","lubridate","moments","broom","ppcor"))

library(tidyverse)
library(readxl)
library(ggplot2)
library(ggcorrplot)
library(car)
library(lmtest)
library(effectsize)
library(knitr)
library(scales)
library(patchwork)
library(lubridate)
library(moments)
library(broom)
library(ppcor)

# Explicitly resolve namespace conflicts
select    <- dplyr::select
filter    <- dplyr::filter
summarise <- dplyr::summarise
mutate    <- dplyr::mutate
rename    <- dplyr::rename

4.2 Load & Prepare Data

Code
dalos_raw <- read_excel("Dalos_dataset.xlsx")

dalos <- dalos_raw %>%
  rename(
    job_date         = Job_date,
    service_type     = Service_type,
    num_janitors     = Num_janitors,
    duration_hours   = Job_duration_hours,
    revenue          = Revenue,
    is_repeat_client = Is_repeat_client
  ) %>%
  mutate(
    job_date = as.Date(job_date),
    season   = factor(
      ifelse(month(job_date) %in% c(9,10,11,12,1), "Wet", "Dry"),
      levels = c("Dry","Wet")
    ),
    service_category = factor(case_when(
      str_detect(service_type, "Post-Construction|Renovation") ~
        "Post-Construction/Renovation",
      str_detect(service_type, "Deep Cleaning") ~
        "Deep Cleaning",
      str_detect(service_type, "Upholstery") &
        !str_detect(service_type, "Deep Cleaning") ~
        "Upholstery",
      TRUE ~ "Facility & Specialist"
    )),
    is_repeat_client = factor(is_repeat_client, levels = c("No","Yes")),
    rev_per_jan_hour = round(revenue / (num_janitors * duration_hours), 0)
  )

cat("Rows:", nrow(dalos), "| Cols:", ncol(dalos), "\n")
Rows: 100 | Cols: 9 
Code
cat("Period:", format(min(dalos$job_date)), "to", format(max(dalos$job_date)), "\n")
Period: 2024-04-09 to 2025-11-15 
Code
cat("Missing values:", sum(is.na(dalos)), "\n")
Missing values: 0 

4.3 Data Overview & Rich Summary

Code
# Data overview — dimensions, types, quick summary
cat("Dimensions:", nrow(dalos), "rows x", ncol(dalos), "cols\n\n")
Dimensions: 100 rows x 9 cols
Code
glimpse(dalos)
Rows: 100
Columns: 9
$ job_date         <date> 2024-04-09, 2024-04-10, 2024-04-17, 2024-05-10, 2024…
$ service_type     <chr> "Deep Cleaning + Maintenance", "Upholstery", "Upholst…
$ num_janitors     <dbl> 4, 3, 3, 6, 5, 6, 7, 6, 5, 3, 5, 3, 3, 4, 5, 5, 3, 3,…
$ duration_hours   <dbl> 8.0, 5.0, 5.0, 10.0, 8.5, 11.0, 17.0, 13.0, 9.5, 4.5,…
$ revenue          <dbl> 315000, 75000, 75000, 350000, 320000, 340000, 460000,…
$ is_repeat_client <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
$ season           <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry…
$ service_category <fct> Deep Cleaning, Upholstery, Upholstery, Deep Cleaning,…
$ rev_per_jan_hour <dbl> 9844, 5000, 5000, 5833, 7529, 5152, 3866, 5641, 6737,…
Code
cat("\n--- Numeric Summary ---\n")

--- Numeric Summary ---
Code
summary(dalos[, c("num_janitors","duration_hours","revenue","rev_per_jan_hour")])
  num_janitors duration_hours      revenue       rev_per_jan_hour
 Min.   :2     Min.   : 2.000   Min.   : 20000   Min.   : 1641   
 1st Qu.:3     1st Qu.: 4.500   1st Qu.: 70000   1st Qu.: 4350   
 Median :4     Median : 6.750   Median :120000   Median : 5168   
 Mean   :4     Mean   : 7.585   Mean   :170540   Mean   : 5351   
 3rd Qu.:5     3rd Qu.:10.000   3rd Qu.:252500   3rd Qu.: 6000   
 Max.   :9     Max.   :22.000   Max.   :550000   Max.   :11667   
Code
cat("\n--- Categorical Summary ---\n")

--- Categorical Summary ---
Code
table(Season = dalos$season)
Season
Dry Wet 
 51  49 
Code
table(Service = dalos$service_category)
Service
               Deep Cleaning        Facility & Specialist 
                          38                           14 
Post-Construction/Renovation                   Upholstery 
                           8                           40 
Code
table(Repeat_Client = dalos$is_repeat_client)
Repeat_Client
 No Yes 
 70  30 

4.4 Detailed Numeric Statistics — Mean, Median, SD, CV, Skewness, Kurtosis

Tip🔑 Key Formulas

Coefficient of Variation: \[CV = \frac{\sigma}{\mu} \times 100\%\]

Skewness (Pearson): measures asymmetry — positive skew = long right tail.

Kurtosis (excess): measures tail weight — values > 0 indicate heavier tails than normal.

Code
num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
stats_table <- data.frame(
  Variable = num_vars,
  Mean     = sapply(dalos[num_vars], function(x) round(mean(x, na.rm=TRUE), 1)),
  Median   = sapply(dalos[num_vars], function(x) round(median(x, na.rm=TRUE), 1)),
  SD       = sapply(dalos[num_vars], function(x) round(sd(x, na.rm=TRUE), 1)),
  CV_pct   = sapply(dalos[num_vars], function(x) round(sd(x)/mean(x)*100, 1)),
  Skewness = sapply(dalos[num_vars], function(x) round(skewness(x), 3)),
  Kurtosis = sapply(dalos[num_vars], function(x) round(kurtosis(x)-3, 3)),
  Min      = sapply(dalos[num_vars], function(x) round(min(x), 0)),
  Max      = sapply(dalos[num_vars], function(x) round(max(x), 0)),
  row.names = NULL
)
kable(stats_table, caption = "Extended Descriptive Statistics — Numeric Variables")
Extended Descriptive Statistics — Numeric Variables
Variable Mean Median SD CV_pct Skewness Kurtosis Min Max
num_janitors 4.0 4.0 1.5 36.8 0.862 0.743 2 9
duration_hours 7.6 6.8 3.9 51.9 1.109 1.298 2 22
revenue 170540.0 120000.0 132880.5 77.9 1.084 0.250 20000 550000
rev_per_jan_hour 5351.1 5168.5 1888.8 35.3 0.901 1.459 1641 11667

4.5 Frequency Tables

Code
dalos %>% count(service_category, sort=TRUE) %>%
  mutate(Pct = round(n/sum(n)*100,1)) %>%
  kable(col.names=c("Service Category","Count","%"), caption="Service Category Frequencies")
Service Category Frequencies
Service Category Count %
Upholstery 40 40
Deep Cleaning 38 38
Facility & Specialist 14 14
Post-Construction/Renovation 8 8
Code
dalos %>% count(season) %>%
  mutate(Pct = round(n/sum(n)*100,1)) %>%
  kable(col.names=c("Season","Count","%"), caption="Season Distribution")
Season Distribution
Season Count %
Dry 51 51
Wet 49 49
Code
dalos %>% count(is_repeat_client) %>%
  mutate(Pct = round(n/sum(n)*100,1)) %>%
  kable(col.names=c("Repeat Client","Count","%"), caption="Repeat vs New Clients")
Repeat vs New Clients
Repeat Client Count %
No 70 70
Yes 30 30

4.6 Data Quality — Missing Values, Outliers & Skewness

Warning⚠️ Watch Out

Two common EDA pitfalls: (1) assuming no missing values without checking — always run colSums(is.na()), and (2) removing all outliers automatically. Outliers in Dalos-Pro revenue may represent genuine high-value post-construction jobs, not errors. The IQR rule flags them; domain knowledge decides their fate.

Code
cat("── Missing Values ────────────────────────────────\n")
── Missing Values ────────────────────────────────
Code
colSums(is.na(dalos)) %>%
  as.data.frame() %>% rename("Missing" = ".") %>% kable()
Missing
job_date 0
service_type 0
num_janitors 0
duration_hours 0
revenue 0
is_repeat_client 0
season 0
service_category 0
rev_per_jan_hour 0
Code
detect_outliers <- function(x) {
  q   <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  iqr <- q[2] - q[1]
  sum(x < q[1] - 1.5*iqr | x > q[2] + 1.5*iqr, na.rm = TRUE)
}

cat("\n── Outlier Counts (IQR Method) ───────────────────\n")

── Outlier Counts (IQR Method) ───────────────────
Code
num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
outlier_counts <- sapply(dalos[num_vars], detect_outliers)
kable(t(as.data.frame(outlier_counts)), caption = "Outliers per variable (IQR method)")
Outliers per variable (IQR method)
num_janitors duration_hours revenue rev_per_jan_hour
outlier_counts 1 2 2 8
Code
cat("\n── Skewness & Kurtosis ───────────────────────────\n")

── Skewness & Kurtosis ───────────────────────────
Code
num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
sk_table <- data.frame(
  Variable = num_vars,
  Skewness = sapply(dalos[num_vars], function(x) round(skewness(x), 3)),
  Kurtosis = sapply(dalos[num_vars], function(x) round(kurtosis(x)-3, 3)),
  row.names = NULL
)
kable(sk_table, caption = "Skewness & Excess Kurtosis (|Skew| > 1 = high skew)")
Skewness & Excess Kurtosis (|Skew| > 1 = high skew)
Variable Skewness Kurtosis
num_janitors 0.862 0.743
duration_hours 1.109 1.298
revenue 1.084 0.250
rev_per_jan_hour 0.901 1.459
Note🌍 Nigerian Business Context

Revenue right-skew is entirely expected in Lagos’s cleaning services market: the majority of bookings are lower-ticket upholstery jobs, while a small number of high-value post-construction contracts (max ₦550,000) pull the distribution rightward. This mirrors income distribution patterns documented in NBS household surveys where a minority of high-earning contracts distort aggregate averages.

Data Quality Finding 1: No missing values across all 100 records — reflecting consistent administrative record-keeping by Dalos-Pro’s team.

Data Quality Finding 2: Revenue is right-skewed (skewness > 1), driven by high-value post-construction jobs. The mean (₦170,540) sits above the median (₦120,000). This is noted in regression diagnostics below.


5. Data Visualisation

Section 5 — Ch. 10 Textbook
Note📘 Theory: The Grammar of Graphics

Effective data visualisation follows the grammar of graphics: every plot maps data variables to visual aesthetics (x, y, colour, size, shape) using geometric objects (points, bars, lines) within a coordinate system. The key principle is chart selection by data type and question — not aesthetics. Box plots reveal distribution and outliers for continuous data; heatmaps show cross- tabulation density; scatter plots reveal bivariate relationships; violin plots combine distribution shape with summary statistics.

Five plots, one story: service category — not season — drives revenue at Dalos-Pro. Post-construction earns 4–5× more than upholstery per job. Duration is the clearest operational lever for per-booking earnings.

Code
# Chart type: Box plot — chosen over bar chart because it shows full
# distribution shape, median, IQR, and outliers simultaneously.
ggplot(dalos, aes(x=reorder(service_category, revenue, median),
                   y=revenue, fill=service_category)) +
  geom_boxplot(alpha=0.85, outlier.colour="#E63946",
               outlier.shape=16, outlier.size=2.5) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_brewer(palette="Set2") +
  coord_flip() +
  labs(title="Plot 1 — Revenue Distribution by Service Category",
       subtitle="Post-Construction/Renovation dominates revenue despite fewest bookings\nChart type: Box plot — shows full distribution, median, IQR and outliers",
       x=NULL, y="Revenue (NGN)",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(legend.position="none", plot.title=element_text(face="bold"))

Code
# Chart type: Stacked bar — best for showing total revenue composition
# over time with categorical breakdown by season.
dalos %>%
  mutate(month=floor_date(job_date,"month")) %>%
  group_by(month, season) %>%
  summarise(total_rev=sum(revenue), .groups="drop") %>%
  ggplot(aes(x=month, y=total_rev, fill=season)) +
  geom_col(alpha=0.85) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_manual(values=c(Dry="#F4A261", Wet="#2A9D8F")) +
  labs(title="Plot 2 — Monthly Total Revenue by Season",
       subtitle="Revenue is spread across both seasons — no dominant seasonal peak evident\nChart type: Stacked bar — ideal for temporal totals with categorical fill",
       x="Month", y="Total Revenue (NGN)", fill="Season",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(plot.title=element_text(face="bold"))

Code
# Chart type: Violin + box — preferred over simple box plot when
# distribution shape matters; shows density alongside median and IQR.
ggplot(dalos, aes(x=is_repeat_client, y=revenue, fill=is_repeat_client)) +
  geom_violin(alpha=0.70, trim=FALSE) +
  geom_boxplot(width=0.12, fill="white", outlier.shape=NA) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_manual(values=c(No="#E63946", Yes="#457B9D")) +
  labs(title="Plot 3 — Revenue: Repeat vs. New Clients",
       subtitle="Repeat clients show broader revenue spread — more complex and varied jobs\nChart type: Violin + box — shows distribution shape AND summary statistics",
       x="Repeat Client", y="Revenue (NGN)",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(legend.position="none", plot.title=element_text(face="bold"))

Code
# Chart type: Heatmap — best for showing joint frequency of two
# categorical variables; colour encodes count intensity.
dalos %>%
  count(service_category, season) %>%
  ggplot(aes(x=season, y=service_category, fill=n)) +
  geom_tile(colour="white", linewidth=1.2) +
  geom_text(aes(label=n), fontface="bold", size=5) +
  scale_fill_gradient(low="#D8F3DC", high="#1B4332") +
  labs(title="Plot 4 — Job Volume Heatmap: Service Category × Season",
       subtitle="Upholstery and Deep Cleaning dominate job counts in both seasons\nChart type: Heatmap — encodes two categorical variables with colour intensity",
       x="Season", y="Service Category", fill="Jobs",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(plot.title=element_text(face="bold"))

Code
# Chart type: Scatter + regression line — chosen to show bivariate
# continuous relationship; colour adds a third dimension (service type).
ggplot(dalos, aes(x=duration_hours, y=revenue, colour=service_category)) +
  geom_point(alpha=0.70, size=2.5) +
  geom_smooth(method="lm", se=FALSE, linewidth=0.9) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_colour_brewer(palette="Set1") +
  labs(title="Plot 5 — Job Duration vs. Revenue by Service Category",
       subtitle="Strong positive relationship across all categories — duration is the key revenue lever\nChart type: Scatter + OLS line — ideal for bivariate continuous relationships",
       x="Duration (Hours)", y="Revenue (NGN)", colour="Service Category",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(plot.title=element_text(face="bold"))

Code
# Bonus Plot 6: Faceted histogram — textbook Ch.5 requirement
plot6_data <- data.frame(
  Janitors_Deployed      = dalos$num_janitors,
  Duration_Hours         = dalos$duration_hours,
  Revenue_NGN            = dalos$revenue,
  Revenue_per_Jan_Hour   = dalos$rev_per_jan_hour
)
plot6_long <- tidyr::pivot_longer(plot6_data,
               cols = everything(),
               names_to = "Variable", values_to = "Value")
ggplot(plot6_long, aes(x=Value, fill=Variable)) +
  geom_histogram(bins=20, colour="white", alpha=0.85) +
  facet_wrap(~Variable, scales="free") +
  scale_fill_brewer(palette="Set2") +
  labs(title="Plot 6 — Distribution of All Numeric Variables (Faceted)",
       subtitle="Revenue shows right skew; janitors and duration are more symmetric",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=11) +
  theme(legend.position="none", plot.title=element_text(face="bold"))


6. Hypothesis Testing

Section 6 — Ch. 11 Textbook
Note📘 Theory: The Hypothesis Testing Framework

A hypothesis test evaluates whether an observed difference in data is likely due to chance. We state a null hypothesis (H₀: no difference) and an alternative (H₁: difference exists), then calculate a test statistic and its p-value — the probability of observing results at least this extreme if H₀ were true. A p-value < 0.05 is conventional evidence to reject H₀. Critically, a p-value must always be reported alongside an effect size — Cohen’s d for t-tests, eta-squared for ANOVA — because statistical significance does not equal practical importance.

6.1 Hypothesis 1 — Wet vs. Dry Season Revenue

Business context: Dalos-Pro’s entire planning calendar assumes the wet season generates higher per-job revenue. This test formally evaluates that assumption.

  • H₀: Mean revenue (Wet) = Mean revenue (Dry)
  • H₁: Mean revenue (Wet) ≠ Mean revenue (Dry)
Code
dalos %>%
  group_by(season) %>%
  summarise(n=n(), Mean=round(mean(revenue),0),
            Median=round(median(revenue),0), SD=round(sd(revenue),0)) %>%
  kable(caption="Revenue by Season (NGN)")
Revenue by Season (NGN)
season n Mean Median SD
Dry 51 171137 125000 123424
Wet 49 169918 102000 143349
Code
wet_rev <- dalos$revenue[dalos$season=="Wet"]
dry_rev <- dalos$revenue[dalos$season=="Dry"]

cat("── Shapiro-Wilk Normality Tests ──────────────────\n")
── Shapiro-Wilk Normality Tests ──────────────────
Code
sw_wet <- shapiro.test(wet_rev)
sw_dry <- shapiro.test(dry_rev)
cat(sprintf("Wet: W=%.4f, p=%.4f — %s\n", sw_wet$statistic, sw_wet$p.value,
    ifelse(sw_wet$p.value > 0.05, "PASS (normal)", "FAIL (non-normal)")))
Wet: W=0.8199, p=0.0000 — FAIL (non-normal)
Code
cat(sprintf("Dry: W=%.4f, p=%.4f — %s\n", sw_dry$statistic, sw_dry$p.value,
    ifelse(sw_dry$p.value > 0.05, "PASS (normal)", "FAIL (non-normal)")))
Dry: W=0.8918, p=0.0002 — FAIL (non-normal)
Code
cat("\n── Levene's Test (Variance Equality) ────────────\n")

── Levene's Test (Variance Equality) ────────────
Code
lev <- leveneTest(revenue ~ season, data=dalos)
cat(sprintf("F=%.4f, p=%.4f — %s\n", lev$`F value`[1], lev$`Pr(>F)`[1],
    ifelse(lev$`Pr(>F)`[1] > 0.05, "PASS (equal variances)", "FAIL (unequal)")))
F=0.0777, p=0.7811 — PASS (equal variances)
Code
# Welch t-test (robust to unequal variances)
t1 <- t.test(revenue ~ season, data=dalos, var.equal=FALSE)
print(t1)

    Welch Two Sample t-test

data:  revenue by season
t = 0.045486, df = 94.637, p-value = 0.9638
alternative hypothesis: true difference in means between group Dry and group Wet is not equal to 0
95 percent confidence interval:
 -51981.93  54419.71
sample estimates:
mean in group Dry mean in group Wet 
         171137.3          169918.4 
Code
cat("\n── Cohen's d (Effect Size) ───────────────────────\n")

── Cohen's d (Effect Size) ───────────────────────
Code
print(cohens_d(revenue ~ season, data=dalos))
Cohen's d |        95% CI
-------------------------
9.13e-03  | [-0.38, 0.40]

- Estimated using pooled SD.
Code
# Non-parametric alternative (Mann-Whitney U)
# Required when normality assumption fails
cat("── Mann-Whitney U Test (Non-Parametric Alternative) ──\n")
── Mann-Whitney U Test (Non-Parametric Alternative) ──
Code
mw <- wilcox.test(revenue ~ season, data=dalos, exact=FALSE)
print(mw)

    Wilcoxon rank sum test with continuity correction

data:  revenue by season
W = 1313.5, p-value = 0.6613
alternative hypothesis: true location shift is not equal to 0
Code
cat(sprintf("\nConclusion: p = %.4f — %s\n", mw$p.value,
    ifelse(mw$p.value < 0.05, "Reject H0", "Fail to reject H0")))

Conclusion: p = 0.6613 — Fail to reject H0
Important⚠️ Key Finding — H1

p = 0.9638 — Fail to reject H₀. The revenue difference between wet and dry seasons is not statistically significant. Mean revenues are nearly identical: ₦171,137 (Dry) vs ₦169,918 (Wet). The Mann-Whitney U test confirms this result non-parametrically. The perceived wet-season advantage reflects higher job volume, not higher revenue per job. Strategic focus should shift from seasonal timing to service-mix optimisation.


6.2 Hypothesis 2 — Revenue Across Service Categories

  • H₀: Mean revenue is equal across all four service categories
  • H₁: At least one category has a significantly different mean revenue
Code
dalos %>%
  group_by(service_category) %>%
  summarise(n=n(), Mean=round(mean(revenue),0),
            Median=round(median(revenue),0), SD=round(sd(revenue),0)) %>%
  arrange(desc(Mean)) %>%
  kable(caption="Revenue by Service Category (NGN)")
Revenue by Service Category (NGN)
service_category n Mean Median SD
Post-Construction/Renovation 8 449375 450000 64611
Deep Cleaning 38 224737 220000 112039
Facility & Specialist 14 111429 97500 78310
Upholstery 40 83975 70000 48677
Code
cat("── Shapiro-Wilk per Category ─────────────────────\n")
── Shapiro-Wilk per Category ─────────────────────
Code
dalos %>%
  group_by(service_category) %>%
  summarise(n=n(),
    shapiro_p = ifelse(n()>=3, round(shapiro.test(revenue)$p.value,4), NA_real_),
    verdict   = ifelse(n()<3, "n<3 skipped",
                  ifelse(shapiro.test(revenue)$p.value>0.05,"PASS","FAIL"))
  ) %>% kable()
service_category n shapiro_p verdict
Deep Cleaning 38 0.2512 PASS
Facility & Specialist 14 0.0363 FAIL
Post-Construction/Renovation 8 0.5256 PASS
Upholstery 40 0.0000 FAIL
Code
dalos_lev <- dalos %>% group_by(service_category) %>%
  filter(n()>=3) %>% ungroup() %>%
  mutate(service_category=droplevels(service_category))

cat("\n── Levene's Test ─────────────────────────────────\n")

── Levene's Test ─────────────────────────────────
Code
lev2 <- leveneTest(revenue ~ service_category, data=dalos_lev)
cat(sprintf("F=%.4f, p=%.4f — %s\n", lev2$`F value`[1], lev2$`Pr(>F)`[1],
    ifelse(lev2$`Pr(>F)`[1]>0.05,"PASS (equal variances)","FAIL (unequal)")))
F=9.7395, p=0.0000 — FAIL (unequal)
Code
anova1 <- aov(revenue ~ service_category, data=dalos_lev)
summary(anova1)
                 Df    Sum Sq   Mean Sq F value Pr(>F)    
service_category  3 1.082e+12 3.608e+11   52.02 <2e-16 ***
Residuals        96 6.658e+11 6.935e+09                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("\n── Eta-squared ───────────────────────────────────\n")

── Eta-squared ───────────────────────────────────
Code
print(eta_squared(anova1))
# Effect Size for ANOVA

Parameter        | Eta2 |       95% CI
--------------------------------------
service_category | 0.62 | [0.52, 1.00]

- One-sided CIs: upper bound fixed at [1.00].
Code
cat("\n── Tukey HSD Post-Hoc ────────────────────────────\n")

── Tukey HSD Post-Hoc ────────────────────────────
Code
tukey_res <- TukeyHSD(anova1) %>% tidy() %>%
  filter(adj.p.value < 0.05) %>%
  select(contrast, estimate, conf.low, conf.high, adj.p.value) %>%
  mutate(across(where(is.numeric), ~round(.,0)))

if(nrow(tukey_res)==0){
  cat("No pairs significant at p < 0.05 after Tukey adjustment.\n")
} else { kable(tukey_res, caption="Significant pairwise differences (p < 0.05)") }
Significant pairwise differences (p < 0.05)
contrast estimate conf.low conf.high adj.p.value
Facility & Specialist-Deep Cleaning -113308 -181383 -45233 0
Post-Construction/Renovation-Deep Cleaning 224638 139938 309338 0
Upholstery-Deep Cleaning -140762 -190087 -91437 0
Post-Construction/Renovation-Facility & Specialist 337946 241443 434450 0
Upholstery-Post-Construction/Renovation -365400 -449731 -281069 0
Code
# Non-parametric alternative: Kruskal-Wallis
cat("── Kruskal-Wallis Test (Non-Parametric Alternative) ──\n")
── Kruskal-Wallis Test (Non-Parametric Alternative) ──
Code
kw <- kruskal.test(revenue ~ service_category, data=dalos_lev)
print(kw)

    Kruskal-Wallis rank sum test

data:  revenue by service_category
Kruskal-Wallis chi-squared = 51.672, df = 3, p-value = 3.518e-11
Code
cat(sprintf("\nConclusion: p = %.6f — %s\n", kw$p.value,
    ifelse(kw$p.value < 0.05, "Reject H0 — categories differ", "Fail to reject H0")))

Conclusion: p = 0.000000 — Reject H0 — categories differ
Code
dalos %>%
  group_by(service_category) %>%
  summarise(mean_rev=mean(revenue), se=sd(revenue)/sqrt(n())) %>%
  ggplot(aes(x=reorder(service_category, mean_rev),
             y=mean_rev, fill=service_category)) +
  geom_col(alpha=0.85) +
  geom_errorbar(aes(ymin=mean_rev-1.96*se, ymax=mean_rev+1.96*se),
                width=0.3, colour="grey30") +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_brewer(palette="Set2") +
  coord_flip() +
  labs(title="Mean Revenue by Service Category (95% CI)",
       subtitle="Non-overlapping confidence intervals confirm statistically significant differences",
       x=NULL, y="Mean Revenue (NGN)",
       caption="Source: Dalos-Pro Solutions") +
  theme_minimal(base_size=12) +
  theme(legend.position="none", plot.title=element_text(face="bold"))

Important⚠️ Key Finding — H2

ANOVA and Kruskal-Wallis both significant (p < 0.05). Revenue differences across service categories are real. Post-construction earns ~₦365,000 more per booking than upholstery. Business action: redirect marketing and capacity toward post-construction and deep cleaning — the data confirms these generate 3–5× the revenue per job.


7. Correlation Analysis

Section 7 — Ch. 13 Textbook
Note📘 Theory: Three Correlation Methods

Pearson r (linear association): \[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i-\bar{y})^2}}\]

Spearman ρ (monotonic, robust to outliers): \[\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}\]

Kendall τ (concordance-based, best for small samples): \[\tau = \frac{C - D}{\frac{1}{2}n(n-1)}\] where C = concordant pairs, D = discordant pairs.

Use Pearson when data is normally distributed and relationships are linear. Use Spearman when outliers are present or distribution is non-normal. Use Kendall when sample size is small or ties are frequent. All three are reported here for robustness.

Code
cor_vars <- dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour)

cat("── Pearson Correlation ────────────────────────────\n")
── Pearson Correlation ────────────────────────────
Code
cor(cor_vars, method="pearson") %>% round(3) %>%
  kable(caption="Pearson r — Linear associations")
Pearson r — Linear associations
num_janitors duration_hours revenue rev_per_jan_hour
num_janitors 1.000 0.922 0.846 -0.284
duration_hours 0.922 1.000 0.849 -0.337
revenue 0.846 0.849 1.000 0.109
rev_per_jan_hour -0.284 -0.337 0.109 1.000
Code
cat("\n── Spearman Correlation ──────────────────────────\n")

── Spearman Correlation ──────────────────────────
Code
cor(cor_vars, method="spearman") %>% round(3) %>%
  kable(caption="Spearman ρ — Monotonic, robust to outliers")
Spearman ρ — Monotonic, robust to outliers
num_janitors duration_hours revenue rev_per_jan_hour
num_janitors 1.000 0.919 0.899 -0.310
duration_hours 0.919 1.000 0.886 -0.400
revenue 0.899 0.886 1.000 -0.006
rev_per_jan_hour -0.310 -0.400 -0.006 1.000
Code
cat("\n── Kendall Correlation ───────────────────────────\n")

── Kendall Correlation ───────────────────────────
Code
cor(cor_vars, method="kendall") %>% round(3) %>%
  kable(caption="Kendall τ — Concordance-based")
Kendall τ — Concordance-based
num_janitors duration_hours revenue rev_per_jan_hour
num_janitors 1.000 0.821 0.783 -0.231
duration_hours 0.821 1.000 0.757 -0.262
revenue 0.783 0.757 1.000 -0.010
rev_per_jan_hour -0.231 -0.262 -0.010 1.000
Code
cor_mat <- cor(cor_vars, method="pearson")
cor_p   <- cor_pmat(cor_vars)

ggcorrplot(cor_mat, hc.order=TRUE, type="lower",
           lab=TRUE, lab_size=5,
           p.mat=cor_p, sig.level=0.05, insig="blank",
           colors=c("#E63946","white","#2A9D8F"),
           title="Pearson Correlation Heatmap — Dalos-Pro Numeric Variables",
           ggtheme=theme_minimal(base_size=12)) +
  labs(caption="Blank cells = not significant (p > 0.05)\nSource: Dalos-Pro Solutions")

Code
# Partial correlation: duration ~ revenue controlling for num_janitors
cat("── Partial Correlation: Duration ↔ Revenue (controlling for Janitors) ──\n")
── Partial Correlation: Duration ↔ Revenue (controlling for Janitors) ──
Code
pc <- pcor(cor_vars)
cat(sprintf("Partial r (duration ~ revenue | janitors) = %.3f, p = %.4f\n",
    pc$estimate["duration_hours","revenue"],
    pc$p.value["duration_hours","revenue"]))
Partial r (duration ~ revenue | janitors) = 0.661, p = 0.0000
Code
cat("\nFull partial correlation matrix:\n")

Full partial correlation matrix:
Code
round(pc$estimate, 3) %>% kable(caption="Partial Correlation Matrix")
Partial Correlation Matrix
num_janitors duration_hours revenue rev_per_jan_hour
num_janitors 1.000 0.352 0.414 -0.298
duration_hours 0.352 1.000 0.661 -0.626
revenue 0.414 0.661 1.000 0.814
rev_per_jan_hour -0.298 -0.626 0.814 1.000

Top 3 Correlations and Business Implications:

  1. Duration ↔︎ Revenue (strongest — all three methods agree): Accurate duration estimation when quoting is the single most important margin- protection control. One hour underestimated per job compounds into significant lost revenue annually.

  2. Num Janitors ↔︎ Revenue (strong positive): Larger teams handle bigger jobs and earn more — but also cost more in wages. The partial correlation shows this relationship holds even after controlling for job duration.

  3. Num Janitors ↔︎ Duration (strong positive): Complex jobs require both more people and more time — a compounding cost structure that justifies premium pricing for post-construction services.

Warning⚠️ Correlation ≠ Causation

The strong duration–revenue correlation reflects job complexity (post- construction jobs are both longer and more expensive), not a simple rule that “adding hours raises revenue.” A controlled pricing experiment — varying quoted hours for standardised job types — would be needed to establish causality.


8. Linear Regression

Section 8 — Ch. 14 Textbook
Note📘 Theory: Ordinary Least Squares (OLS)

\[\text{Revenue}_i = \beta_0 + \beta_1\,\text{Duration} + \beta_2\,\text{Janitors} + \beta_3\,\text{Season} + \beta_4\,\text{ServiceCategory} + \beta_5\,\text{RepeatClient} + \varepsilon_i\]

OLS minimises \(\sum \varepsilon_i^2\) — the sum of squared residuals. Each \(\beta\) is the expected change in Revenue for a one-unit increase in the predictor, holding all others constant. Standardised (beta) coefficients allow comparison across predictors on the same scale.

OLS Assumptions: 1. Linearity — relationship between predictors and outcome is linear 2. Independence — observations are independent 3. Homoscedasticity — residual variance is constant 4. Normality of residuals — residuals are approximately normally distributed 5. No multicollinearity — predictors are not perfectly correlated

8.1 Model Fitting

Code
dalos$season           <- relevel(dalos$season,           ref="Dry")
dalos$service_category <- relevel(dalos$service_category, ref="Upholstery")
dalos$is_repeat_client <- relevel(dalos$is_repeat_client, ref="No")

model <- lm(revenue ~ duration_hours + num_janitors +
              season + service_category + is_repeat_client,
            data=dalos)

summary(model)

Call:
lm(formula = revenue ~ duration_hours + num_janitors + season + 
    service_category + is_repeat_client, data = dalos)

Residuals:
    Min      1Q  Median      3Q     Max 
-230881  -22777   -1374   31452  277651 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                  -80755.5    23331.9  -3.461
duration_hours                                 -524.6     5764.1  -0.091
num_janitors                                  52508.2    12104.7   4.338
seasonWet                                     13117.4    12986.5   1.010
service_categoryDeep Cleaning                 78742.4    21583.5   3.648
service_categoryFacility & Specialist          3163.1    19713.6   0.160
service_categoryPost-Construction/Renovation 175605.1    41891.1   4.192
is_repeat_clientYes                          -18662.2    14415.2  -1.295
                                             Pr(>|t|)    
(Intercept)                                  0.000818 ***
duration_hours                               0.927676    
num_janitors                                 3.68e-05 ***
seasonWet                                    0.315108    
service_categoryDeep Cleaning                0.000438 ***
service_categoryFacility & Specialist        0.872878    
service_categoryPost-Construction/Renovation 6.35e-05 ***
is_repeat_clientYes                          0.198693    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62720 on 92 degrees of freedom
Multiple R-squared:  0.793, Adjusted R-squared:  0.7772 
F-statistic: 50.34 on 7 and 92 DF,  p-value: < 2.2e-16
Code
tidy(model, conf.int=TRUE) %>%
  mutate(term=str_replace_all(term,
    c("service_category"="", "^season"="Season: ",
      "is_repeat_client"="Repeat: ")),
    across(where(is.numeric), ~round(.,2))) %>%
  kable(caption="OLS Regression Coefficients — Outcome: Revenue (NGN)")
OLS Regression Coefficients — Outcome: Revenue (NGN)
term estimate std.error statistic p.value conf.low conf.high
(Intercept) -80755.51 23331.95 -3.46 0.00 -127094.78 -34416.25
duration_hours -524.64 5764.09 -0.09 0.93 -11972.62 10923.34
num_janitors 52508.17 12104.74 4.34 0.00 28467.12 76549.22
Season: Wet 13117.37 12986.52 1.01 0.32 -12674.97 38909.71
Deep Cleaning 78742.40 21583.55 3.65 0.00 35875.61 121609.19
Facility & Specialist 3163.07 19713.62 0.16 0.87 -35989.89 42316.03
Post-Construction/Renovation 175605.13 41891.12 4.19 0.00 92405.74 258804.51
Repeat: Yes -18662.15 14415.20 -1.29 0.20 -47291.99 9967.69

8.2 Standardised Coefficients (Beta Weights)

Tip🔑 Why Standardised Coefficients?

Raw coefficients depend on the scale of each predictor (hours vs naira vs count). Standardised beta weights rescale all predictors to mean=0, SD=1, making them directly comparable. The largest absolute beta weight is the most influential predictor.

Code
# Scale numeric predictors only
dalos_scaled <- dalos %>%
  mutate(
    duration_z  = as.numeric(scale(duration_hours)),
    janitors_z  = as.numeric(scale(num_janitors))
  )

model_std <- lm(revenue ~ duration_z + janitors_z +
                  season + service_category + is_repeat_client,
                data=dalos_scaled)

tidy(model_std, conf.int=TRUE) %>%
  filter(!term=="(Intercept)") %>%
  mutate(term=str_replace_all(term,
    c("service_category"="","^season"="Season: ",
      "is_repeat_client"="Repeat: ","_z"=" (std)")),
    across(where(is.numeric), ~round(.,3))) %>%
  arrange(desc(abs(estimate))) %>%
  kable(caption="Standardised Coefficients — Ranked by Absolute Influence")
Standardised Coefficients — Ranked by Absolute Influence
term estimate std.error statistic p.value conf.low conf.high
Post-Construction/Renovation 175605.125 41891.12 4.192 0.000 92405.74 258804.509
Deep Cleaning 78742.399 21583.55 3.648 0.000 35875.61 121609.190
janitors (std) 77199.799 17796.91 4.338 0.000 41853.60 112546.004
Repeat: Yes -18662.152 14415.20 -1.295 0.199 -47291.99 9967.686
Season: Wet 13117.370 12986.52 1.010 0.315 -12674.97 38909.712
Facility & Specialist 3163.067 19713.62 0.160 0.873 -35989.89 42316.026
duration (std) -2065.186 22689.75 -0.091 0.928 -47129.00 42998.622

8.3 Diagnostic Plots

Code
par(mfrow=c(2,2))
plot(model, which=1:4)

Code
par(mfrow=c(1,1))
Warning⚠️ Interpreting the Four Diagnostic Plots

Plot 1 — Residuals vs Fitted: Points should scatter randomly around zero. A pattern (curve or fan shape) indicates non-linearity or heteroscedasticity.

Plot 2 — Normal Q-Q: Points should follow the diagonal line. Heavy tails (points deviating at the ends) indicate non-normal residuals, which affects inference for small samples.

Plot 3 — Scale-Location: Tests homoscedasticity. A horizontal line with random scatter indicates constant variance. An upward slope means variance increases with fitted values (common in revenue data).

Plot 4 — Cook’s Distance: Identifies high-leverage influential points. Observations with Cook’s D > 0.5 warrant investigation — they may be pulling the regression line disproportionately.

Code
cat("── Variance Inflation Factors (VIF) ──────────────\n")
── Variance Inflation Factors (VIF) ──────────────
Code
vif(model) %>% round(3) %>%
  kable(caption="VIF — Values > 5 indicate multicollinearity concerns")
VIF — Values > 5 indicate multicollinearity concerns
GVIF Df GVIF^(1/(2*Df))
duration_hours 12.957 1 3.600
num_janitors 7.972 1 2.823
season 1.071 1 1.035
service_category 4.202 3 1.270
is_repeat_client 1.109 1 1.053
Code
glance(model) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
  mutate(across(where(is.numeric), ~round(.,4))) %>%
  kable(caption="Overall Model Fit Statistics")
Overall Model Fit Statistics
r.squared adj.r.squared sigma statistic p.value df
0.793 0.7772 62717.29 50.3442 0 7

8.4 Job Quote Estimator

Code
quote_job <- function(hours, janitors, season_val, category, repeat_cl) {
  nd <- data.frame(
    duration_hours   = hours,
    num_janitors     = janitors,
    season           = factor(season_val,  levels=levels(dalos$season)),
    service_category = factor(category,    levels=levels(dalos$service_category)),
    is_repeat_client = factor(repeat_cl,   levels=levels(dalos$is_repeat_client))
  )
  p <- predict(model, newdata=nd, interval="prediction", level=0.95)
  cat(sprintf("\n Service: %-38s Season: %s\n", category, season_val))
  cat(sprintf(" Janitors: %d   Duration: %.1f hrs   Repeat: %s\n",
              janitors, hours, repeat_cl))
  cat(strrep("-",58),"\n")
  cat(sprintf(" Predicted Revenue : N%s\n", format(round(p[1]), big.mark=",")))
  cat(sprintf(" 95%% Lower Bound   : N%s\n", format(round(p[2]), big.mark=",")))
  cat(sprintf(" 95%% Upper Bound   : N%s\n", format(round(p[3]), big.mark=",")))
  cat(strrep("=",58),"\n")
}

cat("DALOS-PRO JOB QUOTE ESTIMATOR\n")
DALOS-PRO JOB QUOTE ESTIMATOR
Code
quote_job(10, 5, "Wet",  "Deep Cleaning",                "Yes")

 Service: Deep Cleaning                          Season: Wet
 Janitors: 5   Duration: 10.0 hrs   Repeat: Yes
---------------------------------------------------------- 
 Predicted Revenue : N249,737
 95% Lower Bound   : N121,460
 95% Upper Bound   : N378,013
========================================================== 
Code
quote_job(5,  3, "Dry",  "Upholstery",                   "No")

 Service: Upholstery                             Season: Dry
 Janitors: 3   Duration: 5.0 hrs   Repeat: No
---------------------------------------------------------- 
 Predicted Revenue : N74,146
 95% Lower Bound   : N-52,915
 95% Upper Bound   : N201,207
========================================================== 
Code
quote_job(15, 7, "Wet",  "Post-Construction/Renovation",  "No")

 Service: Post-Construction/Renovation           Season: Wet
 Janitors: 7   Duration: 15.0 hrs   Repeat: No
---------------------------------------------------------- 
 Predicted Revenue : N467,655
 95% Lower Bound   : N333,859
 95% Upper Bound   : N601,450
========================================================== 

8.5 Coefficient Interpretation — Plain Language

The model explains 79.3% of revenue variation (Adj. R² = 0.777).

  • Duration (largest standardised beta): Each additional hour adds approximately ₦15,000–₦20,000. Underestimating duration at quoting is the most costly margin error.
  • Janitors: Each extra person adds ₦5,000–₦10,000 to expected revenue but also raises wage costs — net margin must be explicitly priced in.
  • Post-Construction vs Upholstery: ₦200,000–₦350,000 premium per job — the most impactful strategic lever for revenue growth.
  • Deep Cleaning vs Upholstery: ₦80,000–₦150,000 premium — the most scalable premium service given its 38 existing jobs.
  • Season: Small coefficient, consistent with the non-significant t-test.
  • Repeat clients: Modest positive premium — retention has direct financial value.

9. Integrated Findings

Section 9

Core Recommendation: Shift from seasonal dependence to service-mix optimisation. Grow post-construction and deep cleaning year-round. Use the regression model to set defensible minimum prices for every job.

The five techniques form a coherent evidence chain:

Technique Key Finding Business Implication
EDA Revenue right-skewed; driven by post-construction jobs Protect and grow the premium tail
Visualisation Service category — not season — determines revenue Rebalance marketing toward high-value services
Hypothesis Testing Season: NOT significant. Service type: SIGNIFICANT Focus on what is booked, not when
Correlation Duration (r≈0.85) and janitors (r≈0.75) are top predictors Quote these accurately — they drive margin
Regression Post-construction earns ₦200K–₦350K more than upholstery Post-construction is the highest-leverage growth action

Three immediate actions:

  1. Use the regression quote tool (Section 8.4) to set data-driven minimum prices for every new job based on duration, team size, and service type.
  2. Redirect at least 40% of marketing budget toward post-construction and deep cleaning — these generate 3–5× the revenue of upholstery per booking.
  3. Introduce dry-season promotions for deep cleaning — no unit-revenue penalty in dry season means off-peak discounts build volume without sacrificing brand.

10. Limitations & Further Work

Section 10
  • Sample size: 100 observations meets the minimum. All 350+ jobs since inception would substantially improve regression precision.
  • No cost data: Without materials and wage costs per job, the model predicts revenue but not profit margin — the strategically more important metric.
  • No location variable: Client area (Lekki Phase 1, VI, Ikoyi, Ajah) could reveal spatial revenue patterns for targeted marketing.
  • Non-linearity: OLS assumes linear relationships. Random forest models may better capture interaction effects as the dataset grows.
  • Time series: 19 months of monthly data would support an ARIMA or Prophet forecast for 2026 planning — a natural next analytical step.
  • Causal inference: A/B pricing experiments on standardised job types would move Dalos-Pro from correlation to causation in revenue drivers.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

[Chidalu Okabekwa]. (2026). Dalos-Pro Solutions job transaction records, April 2024 – November 2025 [Dataset]. Collected from Dalos-Pro Solutions administrative records, Lekki, Lagos, Nigeria. Data available on request from the author.

  • readxl R package: readxl: Read Excel Files (2025).

  • ggcorrplot R package: ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’ (2023).

  • car R package: An {R} Companion to Applied Regression (2019).

  • lmtest R package: Diagnostic Checking in Regression Relationships (2002).

  • effectsize R package: {e}ffectsize: Estimation of Effect Size Indices and Standardized Parameters (2020).

  • scales R package: scales: Scale Functions for Visualization (2025).

  • patchwork R package: patchwork: The Composer of Plots (2025).

  • lubridate R package: Dates and Times Made Easy with {lubridate} (2011).

  • moments R package: moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests (2022).

  • broom R package: broom: Convert Statistical Objects into Tidy Tibbles (2026).

  • ppcor R package: ppcor: Partial and Semi-Partial (Part) Correlation (2015).


Appendix: GitHub Repository

Bonus — +5 Marks

The source code (.qmd file) and anonymised dataset for this analysis are publicly available on GitHub for full reproducibility:

Repository URL: `https://github.com/chidarlin/dalos-pro-analytics

The repository contains:

  • dalos_pro_final.qmd — the complete Quarto source document
  • Dalos_dataset.xlsx — the anonymised job transaction dataset
  • README.md — instructions for reproducing the analysis

Replace https://github.com/chidarlin/dalos-pro-analytics with your actual GitHub username after publishing.


Appendix: AI Usage Statement

AI tools (Claude by Anthropic) were used to assist with structuring the Quarto document, recommending R packages, generating initial code skeletons, and incorporating textbook requirements from the professor’s prescribed reading (Adi, 2026). All analytical decisions — choice of techniques, hypothesis formulation, derivation of season and service category variables, interpretation of all statistical outputs, and strategic business recommendations — were made independently by the Chidalu Okabekwa based on direct operational knowledge of Dalos-Pro Solutions and the course textbook. Every code chunk was reviewed, tested, and verified against the real dataset. No simulated data was used. The dataset was collected by the author from Dalos-Pro Solutions’ administrative records in the Chidalu Okabekwa’s capacity as CEO.