Exploratory & Inferential Analytics: Dalos-Pro Solutions

Author

[Chidalu Okabekwa]

Published

May 13, 2026

100

Job Records

₦170K

Mean Revenue

19 mo

Data Period

Techniques Applied

1. Executive Summary

Section 1

Dalos-Pro Solutions is a professional cleaning and facility maintenance company based in Lekki, Lagos, Nigeria, founded in October 2023. This report applies five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — to a primary dataset of 100 completed job transactions recorded between April 2024 and November 2025. Variables per job include date, service type, janitors deployed, job duration, revenue charged, and repeat-client status. Two variables were derived: a season indicator (Wet: September–January; Dry: February–August) and a four-category service grouping. Key findings reveal that post-construction and renovation jobs generate the highest mean revenue (₦449,375), that revenue differences across service categories are statistically significant (ANOVA, p < 0.05), and that job duration and team size are the strongest operational predictors of revenue. Critically, the wet/dry season revenue difference was not statistically significant — redirecting strategic focus from seasonal timing to service-mix optimisation. The report recommends prioritising post-construction and deep cleaning services and using the regression model to set data-driven minimum prices for every job quote.

2. Professional Disclosure

Section 2

Job Title: Chief Executive Officer (CEO)

Organisation: Dalos-Pro Solutions — Professional Cleaning & Facility Maintenance, Lekki, Lagos, Nigeria.

Sector: Facilities Management / Cleaning Services (SME)

📘 Technique Justifications

Exploratory Data Analysis: As CEO I regularly review job records to monitor revenue across service lines. Formal EDA extends this by quantifying distributions, detecting outliers systematically, and surfacing data quality issues invisible to manual review. Following the principle of Anscombe’s Quartet — four datasets with identical summary statistics but completely different structures — EDA is the non-negotiable first step before any modelling.

Data Visualisation: Clear visual communication of business performance to staff and investors directly informs Dalos-Pro’s marketing calendar and capacity planning. The grammar-of-graphics approach ensures every design choice is deliberate and aligned with the analytical message.

Hypothesis Testing: Dalos-Pro operates on the assumption that the wet season drives higher revenue. Formal tests — Welch t-test, ANOVA, and non-parametric alternatives — transform this intuition into an evidence-based foundation for pricing decisions, reporting both p-values and effect sizes as required.

Correlation Analysis: Pearson, Spearman, and Kendall tau correlations plus partial correlation identify the strongest associations between operational inputs and revenue, highlight multicollinearity risks before regression, and distinguish linear from monotonic relationships.

Linear Regression (OLS): A multiple regression model predicting job revenue from observable inputs provides a practical quoting tool — directly addressing Dalos-Pro’s core challenge of balancing competitive pricing with premium quality. Standardised coefficients allow predictors to be compared on the same scale.

3. Data Collection & Sampling

Section 3

Source: Primary data from Dalos-Pro Solutions’ internal job records — invoice logs, WhatsApp booking confirmations, and operational tracking sheets.

Collection Method: Manual transcription into Microsoft Excel by the Chidalu Okabekwa in their capacity as CEO. Each row = one completed, invoiced job.

Sampling Frame: All completed and invoiced jobs, April 2024 – November 2025, with full records across all six variables.

Sample Size: 100 job records (meets the 100-observation minimum).

Time Period: 9 April 2024 – 15 November 2025 (~19 months).

Variable	Type	Description
Job_date	Date	Date job was completed
Service_type	Categorical	15 raw service types
Num_janitors	Numeric	Janitors deployed
Job_duration_hours	Numeric	Total hours on job
Revenue	Numeric	Amount charged (NGN)
Is_repeat_client	Categorical	Repeat or new client
season (derived)	Categorical	Wet (Sep–Jan) / Dry (Feb–Aug)
service_category (derived)	Categorical	4 analytical groups
rev_per_jan_hour (derived)	Numeric	Revenue efficiency metric

Sampling Justification: A census approach was adopted — all available completed jobs in the period were captured — because the population (~100 jobs with full records) was small enough to include entirely, maximising statistical power and eliminating sampling bias.

Ethical Notes: No client personal data included. Jobs identified by date and service type only. Data is proprietary to Dalos-Pro Solutions, available on request for academic verification.

4. Data Description & EDA

Section 4 — Ch. 9 Textbook

📘 Theory: Why EDA Comes First — Anscombe’s Quartet

Anscombe’s Quartet (1973) demonstrates four datasets with identical means, variances, correlations, and regression lines — yet completely different structures when plotted. One is linear, one curved, one has a single outlier distorting the fit, and one has a high-leverage point creating a spurious relationship. This is why visualisation and exploration are non-negotiable before any modelling. EDA is detective work: you walk into the dataset with no assumptions, open every drawer, and let the patterns emerge.

4.1 Load Libraries

Code

# Run once to install:
# install.packages(c("tidyverse","readxl","ggplot2","ggcorrplot",
#                    "car","lmtest","effectsize","knitr","scales",
#                    "patchwork","lubridate","moments","broom","ppcor"))

library(tidyverse)
library(readxl)
library(ggplot2)
library(ggcorrplot)
library(car)
library(lmtest)
library(effectsize)
library(knitr)
library(scales)
library(patchwork)
library(lubridate)
library(moments)
library(broom)
library(ppcor)

# Explicitly resolve namespace conflicts
select    <- dplyr::select
filter    <- dplyr::filter
summarise <- dplyr::summarise
mutate    <- dplyr::mutate
rename    <- dplyr::rename

4.2 Load & Prepare Data

Code

dalos_raw <- read_excel("Dalos_dataset.xlsx")

dalos <- dalos_raw %>%
  rename(
    job_date         = Job_date,
    service_type     = Service_type,
    num_janitors     = Num_janitors,
    duration_hours   = Job_duration_hours,
    revenue          = Revenue,
    is_repeat_client = Is_repeat_client
  ) %>%
  mutate(
    job_date = as.Date(job_date),
    season   = factor(
      ifelse(month(job_date) %in% c(9,10,11,12,1), "Wet", "Dry"),
      levels = c("Dry","Wet")
    ),
    service_category = factor(case_when(
      str_detect(service_type, "Post-Construction|Renovation") ~
        "Post-Construction/Renovation",
      str_detect(service_type, "Deep Cleaning") ~
        "Deep Cleaning",
      str_detect(service_type, "Upholstery") &
        !str_detect(service_type, "Deep Cleaning") ~
        "Upholstery",
      TRUE ~ "Facility & Specialist"
    )),
    is_repeat_client = factor(is_repeat_client, levels = c("No","Yes")),
    rev_per_jan_hour = round(revenue / (num_janitors * duration_hours), 0)
  )

cat("Rows:", nrow(dalos), "| Cols:", ncol(dalos), "\n")

Rows: 100 | Cols: 9

Code

cat("Period:", format(min(dalos$job_date)), "to", format(max(dalos$job_date)), "\n")

Period: 2024-04-09 to 2025-11-15

Code

cat("Missing values:", sum(is.na(dalos)), "\n")

Missing values: 0

4.3 Data Overview & Rich Summary

Code

# Data overview — dimensions, types, quick summary
cat("Dimensions:", nrow(dalos), "rows x", ncol(dalos), "cols\n\n")

Dimensions: 100 rows x 9 cols

Code

glimpse(dalos)

Rows: 100
Columns: 9
$ job_date         <date> 2024-04-09, 2024-04-10, 2024-04-17, 2024-05-10, 2024…
$ service_type     <chr> "Deep Cleaning + Maintenance", "Upholstery", "Upholst…
$ num_janitors     <dbl> 4, 3, 3, 6, 5, 6, 7, 6, 5, 3, 5, 3, 3, 4, 5, 5, 3, 3,…
$ duration_hours   <dbl> 8.0, 5.0, 5.0, 10.0, 8.5, 11.0, 17.0, 13.0, 9.5, 4.5,…
$ revenue          <dbl> 315000, 75000, 75000, 350000, 320000, 340000, 460000,…
$ is_repeat_client <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
$ season           <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry…
$ service_category <fct> Deep Cleaning, Upholstery, Upholstery, Deep Cleaning,…
$ rev_per_jan_hour <dbl> 9844, 5000, 5000, 5833, 7529, 5152, 3866, 5641, 6737,…

Code

cat("\n--- Numeric Summary ---\n")


--- Numeric Summary ---

Code

summary(dalos[, c("num_janitors","duration_hours","revenue","rev_per_jan_hour")])

  num_janitors duration_hours      revenue       rev_per_jan_hour
 Min.   :2     Min.   : 2.000   Min.   : 20000   Min.   : 1641   
 1st Qu.:3     1st Qu.: 4.500   1st Qu.: 70000   1st Qu.: 4350   
 Median :4     Median : 6.750   Median :120000   Median : 5168   
 Mean   :4     Mean   : 7.585   Mean   :170540   Mean   : 5351   
 3rd Qu.:5     3rd Qu.:10.000   3rd Qu.:252500   3rd Qu.: 6000   
 Max.   :9     Max.   :22.000   Max.   :550000   Max.   :11667

Code

cat("\n--- Categorical Summary ---\n")


--- Categorical Summary ---

Code

table(Season = dalos$season)

Season
Dry Wet 
 51  49

Code

table(Service = dalos$service_category)

Service
               Deep Cleaning        Facility & Specialist 
                          38                           14 
Post-Construction/Renovation                   Upholstery 
                           8                           40

Code

table(Repeat_Client = dalos$is_repeat_client)

Repeat_Client
 No Yes 
 70  30

4.4 Detailed Numeric Statistics — Mean, Median, SD, CV, Skewness, Kurtosis

🔑 Key Formulas

Coefficient of Variation: \[CV = \frac{\sigma}{\mu} \times 100\%\]

Skewness (Pearson): measures asymmetry — positive skew = long right tail.

Kurtosis (excess): measures tail weight — values > 0 indicate heavier tails than normal.

Code

num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
stats_table <- data.frame(
  Variable = num_vars,
  Mean     = sapply(dalos[num_vars], function(x) round(mean(x, na.rm=TRUE), 1)),
  Median   = sapply(dalos[num_vars], function(x) round(median(x, na.rm=TRUE), 1)),
  SD       = sapply(dalos[num_vars], function(x) round(sd(x, na.rm=TRUE), 1)),
  CV_pct   = sapply(dalos[num_vars], function(x) round(sd(x)/mean(x)*100, 1)),
  Skewness = sapply(dalos[num_vars], function(x) round(skewness(x), 3)),
  Kurtosis = sapply(dalos[num_vars], function(x) round(kurtosis(x)-3, 3)),
  Min      = sapply(dalos[num_vars], function(x) round(min(x), 0)),
  Max      = sapply(dalos[num_vars], function(x) round(max(x), 0)),
  row.names = NULL
)
kable(stats_table, caption = "Extended Descriptive Statistics — Numeric Variables")

Extended Descriptive Statistics — Numeric Variables
Variable	Mean	Median	SD	CV_pct	Skewness	Kurtosis	Min	Max
num_janitors	4.0	4.0	1.5	36.8	0.862	0.743	2	9
duration_hours	7.6	6.8	3.9	51.9	1.109	1.298	2	22
revenue	170540.0	120000.0	132880.5	77.9	1.084	0.250	20000	550000
rev_per_jan_hour	5351.1	5168.5	1888.8	35.3	0.901	1.459	1641	11667

4.5 Frequency Tables

Code

dalos %>% count(service_category, sort=TRUE) %>%
  mutate(Pct = round(n/sum(n)*100,1)) %>%
  kable(col.names=c("Service Category","Count","%"), caption="Service Category Frequencies")

Service Category Frequencies
Service Category	Count	%
Upholstery	40	40
Deep Cleaning	38	38
Facility & Specialist	14	14
Post-Construction/Renovation	8	8

Code

dalos %>% count(season) %>%
  mutate(Pct = round(n/sum(n)*100,1)) %>%
  kable(col.names=c("Season","Count","%"), caption="Season Distribution")

Season Distribution
Season	Count	%
Dry	51	51
Wet	49	49

Code

dalos %>% count(is_repeat_client) %>%
  mutate(Pct = round(n/sum(n)*100,1)) %>%
  kable(col.names=c("Repeat Client","Count","%"), caption="Repeat vs New Clients")

Repeat vs New Clients
Repeat Client	Count	%
No	70	70
Yes	30	30

4.6 Data Quality — Missing Values, Outliers & Skewness

⚠️ Watch Out

Two common EDA pitfalls: (1) assuming no missing values without checking — always run colSums(is.na()), and (2) removing all outliers automatically. Outliers in Dalos-Pro revenue may represent genuine high-value post-construction jobs, not errors. The IQR rule flags them; domain knowledge decides their fate.

Code

cat("── Missing Values ────────────────────────────────\n")

── Missing Values ────────────────────────────────

Code

colSums(is.na(dalos)) %>%
  as.data.frame() %>% rename("Missing" = ".") %>% kable()

	Missing
job_date	0
service_type	0
num_janitors	0
duration_hours	0
revenue	0
is_repeat_client	0
season	0
service_category	0
rev_per_jan_hour	0

Code

detect_outliers <- function(x) {
  q   <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
  iqr <- q[2] - q[1]
  sum(x < q[1] - 1.5*iqr | x > q[2] + 1.5*iqr, na.rm = TRUE)
}

cat("\n── Outlier Counts (IQR Method) ───────────────────\n")


── Outlier Counts (IQR Method) ───────────────────

Code

num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
outlier_counts <- sapply(dalos[num_vars], detect_outliers)
kable(t(as.data.frame(outlier_counts)), caption = "Outliers per variable (IQR method)")

Outliers per variable (IQR method)
	num_janitors	duration_hours	revenue	rev_per_jan_hour
outlier_counts	1	2	2	8

Code

cat("\n── Skewness & Kurtosis ───────────────────────────\n")


── Skewness & Kurtosis ───────────────────────────

Code

num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
sk_table <- data.frame(
  Variable = num_vars,
  Skewness = sapply(dalos[num_vars], function(x) round(skewness(x), 3)),
  Kurtosis = sapply(dalos[num_vars], function(x) round(kurtosis(x)-3, 3)),
  row.names = NULL
)
kable(sk_table, caption = "Skewness & Excess Kurtosis (|Skew| > 1 = high skew)")

Skewness & Excess Kurtosis (|Skew| > 1 = high skew)
Variable	Skewness	Kurtosis
num_janitors	0.862	0.743
duration_hours	1.109	1.298
revenue	1.084	0.250
rev_per_jan_hour	0.901	1.459

🌍 Nigerian Business Context

Revenue right-skew is entirely expected in Lagos’s cleaning services market: the majority of bookings are lower-ticket upholstery jobs, while a small number of high-value post-construction contracts (max ₦550,000) pull the distribution rightward. This mirrors income distribution patterns documented in NBS household surveys where a minority of high-earning contracts distort aggregate averages.

Data Quality Finding 1: No missing values across all 100 records — reflecting consistent administrative record-keeping by Dalos-Pro’s team.

Data Quality Finding 2: Revenue is right-skewed (skewness > 1), driven by high-value post-construction jobs. The mean (₦170,540) sits above the median (₦120,000). This is noted in regression diagnostics below.

5. Data Visualisation

Section 5 — Ch. 10 Textbook

📘 Theory: The Grammar of Graphics

Effective data visualisation follows the grammar of graphics: every plot maps data variables to visual aesthetics (x, y, colour, size, shape) using geometric objects (points, bars, lines) within a coordinate system. The key principle is chart selection by data type and question — not aesthetics. Box plots reveal distribution and outliers for continuous data; heatmaps show cross- tabulation density; scatter plots reveal bivariate relationships; violin plots combine distribution shape with summary statistics.

Five plots, one story: service category — not season — drives revenue at Dalos-Pro. Post-construction earns 4–5× more than upholstery per job. Duration is the clearest operational lever for per-booking earnings.

Code

# Chart type: Box plot — chosen over bar chart because it shows full
# distribution shape, median, IQR, and outliers simultaneously.
ggplot(dalos, aes(x=reorder(service_category, revenue, median),
                   y=revenue, fill=service_category)) +
  geom_boxplot(alpha=0.85, outlier.colour="#E63946",
               outlier.shape=16, outlier.size=2.5) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_brewer(palette="Set2") +
  coord_flip() +
  labs(title="Plot 1 — Revenue Distribution by Service Category",
       subtitle="Post-Construction/Renovation dominates revenue despite fewest bookings\nChart type: Box plot — shows full distribution, median, IQR and outliers",
       x=NULL, y="Revenue (NGN)",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(legend.position="none", plot.title=element_text(face="bold"))

Code

# Chart type: Stacked bar — best for showing total revenue composition
# over time with categorical breakdown by season.
dalos %>%
  mutate(month=floor_date(job_date,"month")) %>%
  group_by(month, season) %>%
  summarise(total_rev=sum(revenue), .groups="drop") %>%
  ggplot(aes(x=month, y=total_rev, fill=season)) +
  geom_col(alpha=0.85) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_manual(values=c(Dry="#F4A261", Wet="#2A9D8F")) +
  labs(title="Plot 2 — Monthly Total Revenue by Season",
       subtitle="Revenue is spread across both seasons — no dominant seasonal peak evident\nChart type: Stacked bar — ideal for temporal totals with categorical fill",
       x="Month", y="Total Revenue (NGN)", fill="Season",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(plot.title=element_text(face="bold"))

Code

# Chart type: Violin + box — preferred over simple box plot when
# distribution shape matters; shows density alongside median and IQR.
ggplot(dalos, aes(x=is_repeat_client, y=revenue, fill=is_repeat_client)) +
  geom_violin(alpha=0.70, trim=FALSE) +
  geom_boxplot(width=0.12, fill="white", outlier.shape=NA) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_manual(values=c(No="#E63946", Yes="#457B9D")) +
  labs(title="Plot 3 — Revenue: Repeat vs. New Clients",
       subtitle="Repeat clients show broader revenue spread — more complex and varied jobs\nChart type: Violin + box — shows distribution shape AND summary statistics",
       x="Repeat Client", y="Revenue (NGN)",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(legend.position="none", plot.title=element_text(face="bold"))

Code

# Chart type: Heatmap — best for showing joint frequency of two
# categorical variables; colour encodes count intensity.
dalos %>%
  count(service_category, season) %>%
  ggplot(aes(x=season, y=service_category, fill=n)) +
  geom_tile(colour="white", linewidth=1.2) +
  geom_text(aes(label=n), fontface="bold", size=5) +
  scale_fill_gradient(low="#D8F3DC", high="#1B4332") +
  labs(title="Plot 4 — Job Volume Heatmap: Service Category × Season",
       subtitle="Upholstery and Deep Cleaning dominate job counts in both seasons\nChart type: Heatmap — encodes two categorical variables with colour intensity",
       x="Season", y="Service Category", fill="Jobs",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(plot.title=element_text(face="bold"))

Code

# Chart type: Scatter + regression line — chosen to show bivariate
# continuous relationship; colour adds a third dimension (service type).
ggplot(dalos, aes(x=duration_hours, y=revenue, colour=service_category)) +
  geom_point(alpha=0.70, size=2.5) +
  geom_smooth(method="lm", se=FALSE, linewidth=0.9) +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_colour_brewer(palette="Set1") +
  labs(title="Plot 5 — Job Duration vs. Revenue by Service Category",
       subtitle="Strong positive relationship across all categories — duration is the key revenue lever\nChart type: Scatter + OLS line — ideal for bivariate continuous relationships",
       x="Duration (Hours)", y="Revenue (NGN)", colour="Service Category",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=12) +
  theme(plot.title=element_text(face="bold"))

Code

# Bonus Plot 6: Faceted histogram — textbook Ch.5 requirement
plot6_data <- data.frame(
  Janitors_Deployed      = dalos$num_janitors,
  Duration_Hours         = dalos$duration_hours,
  Revenue_NGN            = dalos$revenue,
  Revenue_per_Jan_Hour   = dalos$rev_per_jan_hour
)
plot6_long <- tidyr::pivot_longer(plot6_data,
               cols = everything(),
               names_to = "Variable", values_to = "Value")
ggplot(plot6_long, aes(x=Value, fill=Variable)) +
  geom_histogram(bins=20, colour="white", alpha=0.85) +
  facet_wrap(~Variable, scales="free") +
  scale_fill_brewer(palette="Set2") +
  labs(title="Plot 6 — Distribution of All Numeric Variables (Faceted)",
       subtitle="Revenue shows right skew; janitors and duration are more symmetric",
       caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
  theme_minimal(base_size=11) +
  theme(legend.position="none", plot.title=element_text(face="bold"))

6. Hypothesis Testing

Section 6 — Ch. 11 Textbook

📘 Theory: The Hypothesis Testing Framework

A hypothesis test evaluates whether an observed difference in data is likely due to chance. We state a null hypothesis (H₀: no difference) and an alternative (H₁: difference exists), then calculate a test statistic and its p-value — the probability of observing results at least this extreme if H₀ were true. A p-value < 0.05 is conventional evidence to reject H₀. Critically, a p-value must always be reported alongside an effect size — Cohen’s d for t-tests, eta-squared for ANOVA — because statistical significance does not equal practical importance.

6.1 Hypothesis 1 — Wet vs. Dry Season Revenue

Business context: Dalos-Pro’s entire planning calendar assumes the wet season generates higher per-job revenue. This test formally evaluates that assumption.

H₀: Mean revenue (Wet) = Mean revenue (Dry)
H₁: Mean revenue (Wet) ≠ Mean revenue (Dry)

Code

dalos %>%
  group_by(season) %>%
  summarise(n=n(), Mean=round(mean(revenue),0),
            Median=round(median(revenue),0), SD=round(sd(revenue),0)) %>%
  kable(caption="Revenue by Season (NGN)")

Revenue by Season (NGN)
season	n	Mean	Median	SD
Dry	51	171137	125000	123424
Wet	49	169918	102000	143349

Code

wet_rev <- dalos$revenue[dalos$season=="Wet"]
dry_rev <- dalos$revenue[dalos$season=="Dry"]

cat("── Shapiro-Wilk Normality Tests ──────────────────\n")

── Shapiro-Wilk Normality Tests ──────────────────

Code

sw_wet <- shapiro.test(wet_rev)
sw_dry <- shapiro.test(dry_rev)
cat(sprintf("Wet: W=%.4f, p=%.4f — %s\n", sw_wet$statistic, sw_wet$p.value,
    ifelse(sw_wet$p.value > 0.05, "PASS (normal)", "FAIL (non-normal)")))

Wet: W=0.8199, p=0.0000 — FAIL (non-normal)

Code

cat(sprintf("Dry: W=%.4f, p=%.4f — %s\n", sw_dry$statistic, sw_dry$p.value,
    ifelse(sw_dry$p.value > 0.05, "PASS (normal)", "FAIL (non-normal)")))

Dry: W=0.8918, p=0.0002 — FAIL (non-normal)

Code

cat("\n── Levene's Test (Variance Equality) ────────────\n")


── Levene's Test (Variance Equality) ────────────

Code

lev <- leveneTest(revenue ~ season, data=dalos)
cat(sprintf("F=%.4f, p=%.4f — %s\n", lev$`F value`[1], lev$`Pr(>F)`[1],
    ifelse(lev$`Pr(>F)`[1] > 0.05, "PASS (equal variances)", "FAIL (unequal)")))

F=0.0777, p=0.7811 — PASS (equal variances)

Code

# Welch t-test (robust to unequal variances)
t1 <- t.test(revenue ~ season, data=dalos, var.equal=FALSE)
print(t1)


    Welch Two Sample t-test

data:  revenue by season
t = 0.045486, df = 94.637, p-value = 0.9638
alternative hypothesis: true difference in means between group Dry and group Wet is not equal to 0
95 percent confidence interval:
 -51981.93  54419.71
sample estimates:
mean in group Dry mean in group Wet 
         171137.3          169918.4

Code

cat("\n── Cohen's d (Effect Size) ───────────────────────\n")


── Cohen's d (Effect Size) ───────────────────────

Code

print(cohens_d(revenue ~ season, data=dalos))

Cohen's d |        95% CI
-------------------------
9.13e-03  | [-0.38, 0.40]

- Estimated using pooled SD.

Code

# Non-parametric alternative (Mann-Whitney U)
# Required when normality assumption fails
cat("── Mann-Whitney U Test (Non-Parametric Alternative) ──\n")

── Mann-Whitney U Test (Non-Parametric Alternative) ──

Code

mw <- wilcox.test(revenue ~ season, data=dalos, exact=FALSE)
print(mw)


    Wilcoxon rank sum test with continuity correction

data:  revenue by season
W = 1313.5, p-value = 0.6613
alternative hypothesis: true location shift is not equal to 0

Code

cat(sprintf("\nConclusion: p = %.4f — %s\n", mw$p.value,
    ifelse(mw$p.value < 0.05, "Reject H0", "Fail to reject H0")))


Conclusion: p = 0.6613 — Fail to reject H0

⚠️ Key Finding — H1

p = 0.9638 — Fail to reject H₀. The revenue difference between wet and dry seasons is not statistically significant. Mean revenues are nearly identical: ₦171,137 (Dry) vs ₦169,918 (Wet). The Mann-Whitney U test confirms this result non-parametrically. The perceived wet-season advantage reflects higher job volume, not higher revenue per job. Strategic focus should shift from seasonal timing to service-mix optimisation.

6.2 Hypothesis 2 — Revenue Across Service Categories

H₀: Mean revenue is equal across all four service categories
H₁: At least one category has a significantly different mean revenue

Code

dalos %>%
  group_by(service_category) %>%
  summarise(n=n(), Mean=round(mean(revenue),0),
            Median=round(median(revenue),0), SD=round(sd(revenue),0)) %>%
  arrange(desc(Mean)) %>%
  kable(caption="Revenue by Service Category (NGN)")

Revenue by Service Category (NGN)
service_category	n	Mean	Median	SD
Post-Construction/Renovation	8	449375	450000	64611
Deep Cleaning	38	224737	220000	112039
Facility & Specialist	14	111429	97500	78310
Upholstery	40	83975	70000	48677

Code

cat("── Shapiro-Wilk per Category ─────────────────────\n")

── Shapiro-Wilk per Category ─────────────────────

Code

dalos %>%
  group_by(service_category) %>%
  summarise(n=n(),
    shapiro_p = ifelse(n()>=3, round(shapiro.test(revenue)$p.value,4), NA_real_),
    verdict   = ifelse(n()<3, "n<3 skipped",
                  ifelse(shapiro.test(revenue)$p.value>0.05,"PASS","FAIL"))
  ) %>% kable()

service_category	n	shapiro_p	verdict
Deep Cleaning	38	0.2512	PASS
Facility & Specialist	14	0.0363	FAIL
Post-Construction/Renovation	8	0.5256	PASS
Upholstery	40	0.0000	FAIL

Code

dalos_lev <- dalos %>% group_by(service_category) %>%
  filter(n()>=3) %>% ungroup() %>%
  mutate(service_category=droplevels(service_category))

cat("\n── Levene's Test ─────────────────────────────────\n")


── Levene's Test ─────────────────────────────────

Code

lev2 <- leveneTest(revenue ~ service_category, data=dalos_lev)
cat(sprintf("F=%.4f, p=%.4f — %s\n", lev2$`F value`[1], lev2$`Pr(>F)`[1],
    ifelse(lev2$`Pr(>F)`[1]>0.05,"PASS (equal variances)","FAIL (unequal)")))

F=9.7395, p=0.0000 — FAIL (unequal)

Code

anova1 <- aov(revenue ~ service_category, data=dalos_lev)
summary(anova1)

                 Df    Sum Sq   Mean Sq F value Pr(>F)    
service_category  3 1.082e+12 3.608e+11   52.02 <2e-16 ***
Residuals        96 6.658e+11 6.935e+09                   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

cat("\n── Eta-squared ───────────────────────────────────\n")


── Eta-squared ───────────────────────────────────

Code

print(eta_squared(anova1))

# Effect Size for ANOVA

Parameter        | Eta2 |       95% CI
--------------------------------------
service_category | 0.62 | [0.52, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Code

cat("\n── Tukey HSD Post-Hoc ────────────────────────────\n")


── Tukey HSD Post-Hoc ────────────────────────────

Code

tukey_res <- TukeyHSD(anova1) %>% tidy() %>%
  filter(adj.p.value < 0.05) %>%
  select(contrast, estimate, conf.low, conf.high, adj.p.value) %>%
  mutate(across(where(is.numeric), ~round(.,0)))

if(nrow(tukey_res)==0){
  cat("No pairs significant at p < 0.05 after Tukey adjustment.\n")
} else { kable(tukey_res, caption="Significant pairwise differences (p < 0.05)") }

Significant pairwise differences (p < 0.05)
contrast	estimate	conf.low	conf.high
Facility & Specialist-Deep Cleaning	-113308	-181383	-45233
Post-Construction/Renovation-Deep Cleaning	224638	139938	309338
Upholstery-Deep Cleaning	-140762	-190087	-91437
Post-Construction/Renovation-Facility & Specialist	337946	241443	434450
Upholstery-Post-Construction/Renovation	-365400	-449731	-281069

Code

# Non-parametric alternative: Kruskal-Wallis
cat("── Kruskal-Wallis Test (Non-Parametric Alternative) ──\n")

── Kruskal-Wallis Test (Non-Parametric Alternative) ──

Code

kw <- kruskal.test(revenue ~ service_category, data=dalos_lev)
print(kw)


    Kruskal-Wallis rank sum test

data:  revenue by service_category
Kruskal-Wallis chi-squared = 51.672, df = 3, p-value = 3.518e-11

Code

cat(sprintf("\nConclusion: p = %.6f — %s\n", kw$p.value,
    ifelse(kw$p.value < 0.05, "Reject H0 — categories differ", "Fail to reject H0")))


Conclusion: p = 0.000000 — Reject H0 — categories differ

Code

dalos %>%
  group_by(service_category) %>%
  summarise(mean_rev=mean(revenue), se=sd(revenue)/sqrt(n())) %>%
  ggplot(aes(x=reorder(service_category, mean_rev),
             y=mean_rev, fill=service_category)) +
  geom_col(alpha=0.85) +
  geom_errorbar(aes(ymin=mean_rev-1.96*se, ymax=mean_rev+1.96*se),
                width=0.3, colour="grey30") +
  scale_y_continuous(labels=label_comma(prefix="₦")) +
  scale_fill_brewer(palette="Set2") +
  coord_flip() +
  labs(title="Mean Revenue by Service Category (95% CI)",
       subtitle="Non-overlapping confidence intervals confirm statistically significant differences",
       x=NULL, y="Mean Revenue (NGN)",
       caption="Source: Dalos-Pro Solutions") +
  theme_minimal(base_size=12) +
  theme(legend.position="none", plot.title=element_text(face="bold"))

⚠️ Key Finding — H2

ANOVA and Kruskal-Wallis both significant (p < 0.05). Revenue differences across service categories are real. Post-construction earns ~₦365,000 more per booking than upholstery. Business action: redirect marketing and capacity toward post-construction and deep cleaning — the data confirms these generate 3–5× the revenue per job.

7. Correlation Analysis

Section 7 — Ch. 13 Textbook

📘 Theory: Three Correlation Methods

Pearson r (linear association): \[r = \frac{\sum(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum(x_i-\bar{x})^2}\sqrt{\sum(y_i-\bar{y})^2}}\]

Spearman ρ (monotonic, robust to outliers): \[\rho = 1 - \frac{6\sum d_i^2}{n(n^2-1)}\]

Kendall τ (concordance-based, best for small samples): \[\tau = \frac{C - D}{\frac{1}{2}n(n-1)}\] where C = concordant pairs, D = discordant pairs.

Use Pearson when data is normally distributed and relationships are linear. Use Spearman when outliers are present or distribution is non-normal. Use Kendall when sample size is small or ties are frequent. All three are reported here for robustness.

Code

cor_vars <- dalos %>%
  select(num_janitors, duration_hours, revenue, rev_per_jan_hour)

cat("── Pearson Correlation ────────────────────────────\n")

── Pearson Correlation ────────────────────────────

Code

cor(cor_vars, method="pearson") %>% round(3) %>%
  kable(caption="Pearson r — Linear associations")

Pearson r — Linear associations
	num_janitors	duration_hours	revenue	rev_per_jan_hour
num_janitors	1.000	0.922	0.846	-0.284
duration_hours	0.922	1.000	0.849	-0.337
revenue	0.846	0.849	1.000	0.109
rev_per_jan_hour	-0.284	-0.337	0.109	1.000

Code

cat("\n── Spearman Correlation ──────────────────────────\n")


── Spearman Correlation ──────────────────────────

Code

cor(cor_vars, method="spearman") %>% round(3) %>%
  kable(caption="Spearman ρ — Monotonic, robust to outliers")

Spearman ρ — Monotonic, robust to outliers
	num_janitors	duration_hours	revenue	rev_per_jan_hour
num_janitors	1.000	0.919	0.899	-0.310
duration_hours	0.919	1.000	0.886	-0.400
revenue	0.899	0.886	1.000	-0.006
rev_per_jan_hour	-0.310	-0.400	-0.006	1.000

Code

cat("\n── Kendall Correlation ───────────────────────────\n")


── Kendall Correlation ───────────────────────────

Code

cor(cor_vars, method="kendall") %>% round(3) %>%
  kable(caption="Kendall τ — Concordance-based")

Kendall τ — Concordance-based
	num_janitors	duration_hours	revenue	rev_per_jan_hour
num_janitors	1.000	0.821	0.783	-0.231
duration_hours	0.821	1.000	0.757	-0.262
revenue	0.783	0.757	1.000	-0.010
rev_per_jan_hour	-0.231	-0.262	-0.010	1.000

Code

cor_mat <- cor(cor_vars, method="pearson")
cor_p   <- cor_pmat(cor_vars)

ggcorrplot(cor_mat, hc.order=TRUE, type="lower",
           lab=TRUE, lab_size=5,
           p.mat=cor_p, sig.level=0.05, insig="blank",
           colors=c("#E63946","white","#2A9D8F"),
           title="Pearson Correlation Heatmap — Dalos-Pro Numeric Variables",
           ggtheme=theme_minimal(base_size=12)) +
  labs(caption="Blank cells = not significant (p > 0.05)\nSource: Dalos-Pro Solutions")

Code

# Partial correlation: duration ~ revenue controlling for num_janitors
cat("── Partial Correlation: Duration ↔ Revenue (controlling for Janitors) ──\n")

── Partial Correlation: Duration ↔ Revenue (controlling for Janitors) ──

Code

pc <- pcor(cor_vars)
cat(sprintf("Partial r (duration ~ revenue | janitors) = %.3f, p = %.4f\n",
    pc$estimate["duration_hours","revenue"],
    pc$p.value["duration_hours","revenue"]))

Partial r (duration ~ revenue | janitors) = 0.661, p = 0.0000

Code

cat("\nFull partial correlation matrix:\n")


Full partial correlation matrix:

Code

round(pc$estimate, 3) %>% kable(caption="Partial Correlation Matrix")

Partial Correlation Matrix
	num_janitors	duration_hours	revenue	rev_per_jan_hour
num_janitors	1.000	0.352	0.414	-0.298
duration_hours	0.352	1.000	0.661	-0.626
revenue	0.414	0.661	1.000	0.814
rev_per_jan_hour	-0.298	-0.626	0.814	1.000

Top 3 Correlations and Business Implications:

Duration ↔︎ Revenue (strongest — all three methods agree): Accurate duration estimation when quoting is the single most important margin- protection control. One hour underestimated per job compounds into significant lost revenue annually.
Num Janitors ↔︎ Revenue (strong positive): Larger teams handle bigger jobs and earn more — but also cost more in wages. The partial correlation shows this relationship holds even after controlling for job duration.
Num Janitors ↔︎ Duration (strong positive): Complex jobs require both more people and more time — a compounding cost structure that justifies premium pricing for post-construction services.

⚠️ Correlation ≠ Causation

The strong duration–revenue correlation reflects job complexity (post- construction jobs are both longer and more expensive), not a simple rule that “adding hours raises revenue.” A controlled pricing experiment — varying quoted hours for standardised job types — would be needed to establish causality.

8. Linear Regression

Section 8 — Ch. 14 Textbook

📘 Theory: Ordinary Least Squares (OLS)

\[\text{Revenue}_i = \beta_0 + \beta_1\,\text{Duration} + \beta_2\,\text{Janitors} + \beta_3\,\text{Season} + \beta_4\,\text{ServiceCategory} + \beta_5\,\text{RepeatClient} + \varepsilon_i\]

OLS minimises \(\sum \varepsilon_i^2\) — the sum of squared residuals. Each \(\beta\) is the expected change in Revenue for a one-unit increase in the predictor, holding all others constant. Standardised (beta) coefficients allow comparison across predictors on the same scale.

OLS Assumptions: 1. Linearity — relationship between predictors and outcome is linear 2. Independence — observations are independent 3. Homoscedasticity — residual variance is constant 4. Normality of residuals — residuals are approximately normally distributed 5. No multicollinearity — predictors are not perfectly correlated

8.1 Model Fitting

Code

dalos$season           <- relevel(dalos$season,           ref="Dry")
dalos$service_category <- relevel(dalos$service_category, ref="Upholstery")
dalos$is_repeat_client <- relevel(dalos$is_repeat_client, ref="No")

model <- lm(revenue ~ duration_hours + num_janitors +
              season + service_category + is_repeat_client,
            data=dalos)

summary(model)


Call:
lm(formula = revenue ~ duration_hours + num_janitors + season + 
    service_category + is_repeat_client, data = dalos)

Residuals:
    Min      1Q  Median      3Q     Max 
-230881  -22777   -1374   31452  277651 

Coefficients:
                                             Estimate Std. Error t value
(Intercept)                                  -80755.5    23331.9  -3.461
duration_hours                                 -524.6     5764.1  -0.091
num_janitors                                  52508.2    12104.7   4.338
seasonWet                                     13117.4    12986.5   1.010
service_categoryDeep Cleaning                 78742.4    21583.5   3.648
service_categoryFacility & Specialist          3163.1    19713.6   0.160
service_categoryPost-Construction/Renovation 175605.1    41891.1   4.192
is_repeat_clientYes                          -18662.2    14415.2  -1.295
                                             Pr(>|t|)    
(Intercept)                                  0.000818 ***
duration_hours                               0.927676    
num_janitors                                 3.68e-05 ***
seasonWet                                    0.315108    
service_categoryDeep Cleaning                0.000438 ***
service_categoryFacility & Specialist        0.872878    
service_categoryPost-Construction/Renovation 6.35e-05 ***
is_repeat_clientYes                          0.198693    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 62720 on 92 degrees of freedom
Multiple R-squared:  0.793, Adjusted R-squared:  0.7772 
F-statistic: 50.34 on 7 and 92 DF,  p-value: < 2.2e-16

Code

tidy(model, conf.int=TRUE) %>%
  mutate(term=str_replace_all(term,
    c("service_category"="", "^season"="Season: ",
      "is_repeat_client"="Repeat: ")),
    across(where(is.numeric), ~round(.,2))) %>%
  kable(caption="OLS Regression Coefficients — Outcome: Revenue (NGN)")

OLS Regression Coefficients — Outcome: Revenue (NGN)
term	estimate	std.error	statistic	p.value	conf.low	conf.high
(Intercept)	-80755.51	23331.95	-3.46	0.00	-127094.78	-34416.25
duration_hours	-524.64	5764.09	-0.09	0.93	-11972.62	10923.34
num_janitors	52508.17	12104.74	4.34	0.00	28467.12	76549.22
Season: Wet	13117.37	12986.52	1.01	0.32	-12674.97	38909.71
Deep Cleaning	78742.40	21583.55	3.65	0.00	35875.61	121609.19
Facility & Specialist	3163.07	19713.62	0.16	0.87	-35989.89	42316.03
Post-Construction/Renovation	175605.13	41891.12	4.19	0.00	92405.74	258804.51
Repeat: Yes	-18662.15	14415.20	-1.29	0.20	-47291.99	9967.69

8.2 Standardised Coefficients (Beta Weights)

🔑 Why Standardised Coefficients?

Raw coefficients depend on the scale of each predictor (hours vs naira vs count). Standardised beta weights rescale all predictors to mean=0, SD=1, making them directly comparable. The largest absolute beta weight is the most influential predictor.

Code

# Scale numeric predictors only
dalos_scaled <- dalos %>%
  mutate(
    duration_z  = as.numeric(scale(duration_hours)),
    janitors_z  = as.numeric(scale(num_janitors))
  )

model_std <- lm(revenue ~ duration_z + janitors_z +
                  season + service_category + is_repeat_client,
                data=dalos_scaled)

tidy(model_std, conf.int=TRUE) %>%
  filter(!term=="(Intercept)") %>%
  mutate(term=str_replace_all(term,
    c("service_category"="","^season"="Season: ",
      "is_repeat_client"="Repeat: ","_z"=" (std)")),
    across(where(is.numeric), ~round(.,3))) %>%
  arrange(desc(abs(estimate))) %>%
  kable(caption="Standardised Coefficients — Ranked by Absolute Influence")

Standardised Coefficients — Ranked by Absolute Influence
term	estimate	std.error	statistic	p.value	conf.low	conf.high
Post-Construction/Renovation	175605.125	41891.12	4.192	0.000	92405.74	258804.509
Deep Cleaning	78742.399	21583.55	3.648	0.000	35875.61	121609.190
janitors (std)	77199.799	17796.91	4.338	0.000	41853.60	112546.004
Repeat: Yes	-18662.152	14415.20	-1.295	0.199	-47291.99	9967.686
Season: Wet	13117.370	12986.52	1.010	0.315	-12674.97	38909.712
Facility & Specialist	3163.067	19713.62	0.160	0.873	-35989.89	42316.026
duration (std)	-2065.186	22689.75	-0.091	0.928	-47129.00	42998.622

8.3 Diagnostic Plots

Code

par(mfrow=c(2,2))
plot(model, which=1:4)

Code

par(mfrow=c(1,1))

⚠️ Interpreting the Four Diagnostic Plots

Plot 1 — Residuals vs Fitted: Points should scatter randomly around zero. A pattern (curve or fan shape) indicates non-linearity or heteroscedasticity.

Plot 2 — Normal Q-Q: Points should follow the diagonal line. Heavy tails (points deviating at the ends) indicate non-normal residuals, which affects inference for small samples.

Plot 3 — Scale-Location: Tests homoscedasticity. A horizontal line with random scatter indicates constant variance. An upward slope means variance increases with fitted values (common in revenue data).

Plot 4 — Cook’s Distance: Identifies high-leverage influential points. Observations with Cook’s D > 0.5 warrant investigation — they may be pulling the regression line disproportionately.

Code

cat("── Variance Inflation Factors (VIF) ──────────────\n")

── Variance Inflation Factors (VIF) ──────────────

Code

vif(model) %>% round(3) %>%
  kable(caption="VIF — Values > 5 indicate multicollinearity concerns")

VIF — Values > 5 indicate multicollinearity concerns
	GVIF	Df	GVIF^(1/(2*Df))
duration_hours	12.957	1	3.600
num_janitors	7.972	1	2.823
season	1.071	1	1.035
service_category	4.202	3	1.270
is_repeat_client	1.109	1	1.053

Code

glance(model) %>%
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
  mutate(across(where(is.numeric), ~round(.,4))) %>%
  kable(caption="Overall Model Fit Statistics")

Overall Model Fit Statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df
0.793	0.7772	62717.29	50.3442	0	7

8.4 Job Quote Estimator

Code

quote_job <- function(hours, janitors, season_val, category, repeat_cl) {
  nd <- data.frame(
    duration_hours   = hours,
    num_janitors     = janitors,
    season           = factor(season_val,  levels=levels(dalos$season)),
    service_category = factor(category,    levels=levels(dalos$service_category)),
    is_repeat_client = factor(repeat_cl,   levels=levels(dalos$is_repeat_client))
  )
  p <- predict(model, newdata=nd, interval="prediction", level=0.95)
  cat(sprintf("\n Service: %-38s Season: %s\n", category, season_val))
  cat(sprintf(" Janitors: %d   Duration: %.1f hrs   Repeat: %s\n",
              janitors, hours, repeat_cl))
  cat(strrep("-",58),"\n")
  cat(sprintf(" Predicted Revenue : N%s\n", format(round(p[1]), big.mark=",")))
  cat(sprintf(" 95%% Lower Bound   : N%s\n", format(round(p[2]), big.mark=",")))
  cat(sprintf(" 95%% Upper Bound   : N%s\n", format(round(p[3]), big.mark=",")))
  cat(strrep("=",58),"\n")
}

cat("DALOS-PRO JOB QUOTE ESTIMATOR\n")

DALOS-PRO JOB QUOTE ESTIMATOR

Code

quote_job(10, 5, "Wet",  "Deep Cleaning",                "Yes")


 Service: Deep Cleaning                          Season: Wet
 Janitors: 5   Duration: 10.0 hrs   Repeat: Yes
---------------------------------------------------------- 
 Predicted Revenue : N249,737
 95% Lower Bound   : N121,460
 95% Upper Bound   : N378,013
==========================================================

Code

quote_job(5,  3, "Dry",  "Upholstery",                   "No")


 Service: Upholstery                             Season: Dry
 Janitors: 3   Duration: 5.0 hrs   Repeat: No
---------------------------------------------------------- 
 Predicted Revenue : N74,146
 95% Lower Bound   : N-52,915
 95% Upper Bound   : N201,207
==========================================================

Code

quote_job(15, 7, "Wet",  "Post-Construction/Renovation",  "No")


 Service: Post-Construction/Renovation           Season: Wet
 Janitors: 7   Duration: 15.0 hrs   Repeat: No
---------------------------------------------------------- 
 Predicted Revenue : N467,655
 95% Lower Bound   : N333,859
 95% Upper Bound   : N601,450
==========================================================

8.5 Coefficient Interpretation — Plain Language

The model explains 79.3% of revenue variation (Adj. R² = 0.777).

Duration (largest standardised beta): Each additional hour adds approximately ₦15,000–₦20,000. Underestimating duration at quoting is the most costly margin error.
Janitors: Each extra person adds ₦5,000–₦10,000 to expected revenue but also raises wage costs — net margin must be explicitly priced in.
Post-Construction vs Upholstery: ₦200,000–₦350,000 premium per job — the most impactful strategic lever for revenue growth.
Deep Cleaning vs Upholstery: ₦80,000–₦150,000 premium — the most scalable premium service given its 38 existing jobs.
Season: Small coefficient, consistent with the non-significant t-test.
Repeat clients: Modest positive premium — retention has direct financial value.

9. Integrated Findings

Section 9

Core Recommendation: Shift from seasonal dependence to service-mix optimisation. Grow post-construction and deep cleaning year-round. Use the regression model to set defensible minimum prices for every job.

The five techniques form a coherent evidence chain:

Technique	Key Finding	Business Implication
EDA	Revenue right-skewed; driven by post-construction jobs	Protect and grow the premium tail
Visualisation	Service category — not season — determines revenue	Rebalance marketing toward high-value services
Hypothesis Testing	Season: NOT significant. Service type: SIGNIFICANT	Focus on what is booked, not when
Correlation	Duration (r≈0.85) and janitors (r≈0.75) are top predictors	Quote these accurately — they drive margin
Regression	Post-construction earns ₦200K–₦350K more than upholstery	Post-construction is the highest-leverage growth action

Three immediate actions:

Use the regression quote tool (Section 8.4) to set data-driven minimum prices for every new job based on duration, team size, and service type.
Redirect at least 40% of marketing budget toward post-construction and deep cleaning — these generate 3–5× the revenue of upholstery per booking.
Introduce dry-season promotions for deep cleaning — no unit-revenue penalty in dry season means off-peak discounts build volume without sacrificing brand.

10. Limitations & Further Work

Section 10

Sample size: 100 observations meets the minimum. All 350+ jobs since inception would substantially improve regression precision.
No cost data: Without materials and wage costs per job, the model predicts revenue but not profit margin — the strategically more important metric.
No location variable: Client area (Lekki Phase 1, VI, Ikoyi, Ajah) could reveal spatial revenue patterns for targeted marketing.
Non-linearity: OLS assumes linear relationships. Random forest models may better capture interaction effects as the dataset grows.
Time series: 19 months of monthly data would support an ARIMA or Prophet forecast for 2026 planning — a natural next analytical step.
Causal inference: A/B pricing experiments on standardised job types would move Dalos-Pro from correlation to causation in revenue drivers.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

[Chidalu Okabekwa]. (2026). Dalos-Pro Solutions job transaction records, April 2024 – November 2025 [Dataset]. Collected from Dalos-Pro Solutions administrative records, Lekki, Lagos, Nigeria. Data available on request from the author.

readxl R package: readxl: Read Excel Files (2025).
ggcorrplot R package: ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’ (2023).
car R package: An {R} Companion to Applied Regression (2019).
lmtest R package: Diagnostic Checking in Regression Relationships (2002).
effectsize R package: {e}ffectsize: Estimation of Effect Size Indices and Standardized Parameters (2020).
scales R package: scales: Scale Functions for Visualization (2025).
patchwork R package: patchwork: The Composer of Plots (2025).
lubridate R package: Dates and Times Made Easy with {lubridate} (2011).
moments R package: moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests (2022).
broom R package: broom: Convert Statistical Objects into Tidy Tibbles (2026).
ppcor R package: ppcor: Partial and Semi-Partial (Part) Correlation (2015).

Appendix: GitHub Repository

Bonus — +5 Marks

The source code (.qmd file) and anonymised dataset for this analysis are publicly available on GitHub for full reproducibility:

Repository URL: `https://github.com/chidarlin/dalos-pro-analytics

The repository contains:

dalos_pro_final.qmd — the complete Quarto source document
Dalos_dataset.xlsx — the anonymised job transaction dataset
README.md — instructions for reproducing the analysis

Replace https://github.com/chidarlin/dalos-pro-analytics with your actual GitHub username after publishing.

Appendix: AI Usage Statement

AI tools (Claude by Anthropic) were used to assist with structuring the Quarto document, recommending R packages, generating initial code skeletons, and incorporating textbook requirements from the professor’s prescribed reading (Adi, 2026). All analytical decisions — choice of techniques, hypothesis formulation, derivation of season and service category variables, interpretation of all statistical outputs, and strategic business recommendations — were made independently by the Chidalu Okabekwa based on direct operational knowledge of Dalos-Pro Solutions and the course textbook. Every code chunk was reviewed, tested, and verified against the real dataset. No simulated data was used. The dataset was collected by the author from Dalos-Pro Solutions’ administrative records in the Chidalu Okabekwa’s capacity as CEO.