Exploratory & Inferential Analytics: Dalos-Pro Solutions
1. Executive Summary
Dalos-Pro Solutions is a professional cleaning and facility maintenance company based in Lekki, Lagos, Nigeria, founded in October 2023. This report applies five analytical techniques — Exploratory Data Analysis, Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — to a primary dataset of 100 completed job transactions recorded between April 2024 and November 2025. Variables per job include date, service type, janitors deployed, job duration, revenue charged, and repeat-client status. Two variables were derived: a season indicator (Wet: September–January; Dry: February–August) and a four-category service grouping. Key findings reveal that post-construction and renovation jobs generate the highest mean revenue (₦449,375), that revenue differences across service categories are statistically significant (ANOVA, p < 0.05), and that job duration and team size are the strongest operational predictors of revenue. Critically, the wet/dry season revenue difference was not statistically significant — redirecting strategic focus from seasonal timing to service-mix optimisation. The report recommends prioritising post-construction and deep cleaning services and using the regression model to set data-driven minimum prices for every job quote.
2. Professional Disclosure
Job Title: Chief Executive Officer (CEO)
Organisation: Dalos-Pro Solutions — Professional Cleaning & Facility Maintenance, Lekki, Lagos, Nigeria.
Sector: Facilities Management / Cleaning Services (SME)
3. Data Collection & Sampling
Source: Primary data from Dalos-Pro Solutions’ internal job records — invoice logs, WhatsApp booking confirmations, and operational tracking sheets.
Collection Method: Manual transcription into Microsoft Excel by the Chidalu Okabekwa in their capacity as CEO. Each row = one completed, invoiced job.
Sampling Frame: All completed and invoiced jobs, April 2024 – November 2025, with full records across all six variables.
Sample Size: 100 job records (meets the 100-observation minimum).
Time Period: 9 April 2024 – 15 November 2025 (~19 months).
| Variable | Type | Description |
|---|---|---|
| Job_date | Date | Date job was completed |
| Service_type | Categorical | 15 raw service types |
| Num_janitors | Numeric | Janitors deployed |
| Job_duration_hours | Numeric | Total hours on job |
| Revenue | Numeric | Amount charged (NGN) |
| Is_repeat_client | Categorical | Repeat or new client |
| season (derived) | Categorical | Wet (Sep–Jan) / Dry (Feb–Aug) |
| service_category (derived) | Categorical | 4 analytical groups |
| rev_per_jan_hour (derived) | Numeric | Revenue efficiency metric |
Sampling Justification: A census approach was adopted — all available completed jobs in the period were captured — because the population (~100 jobs with full records) was small enough to include entirely, maximising statistical power and eliminating sampling bias.
Ethical Notes: No client personal data included. Jobs identified by date and service type only. Data is proprietary to Dalos-Pro Solutions, available on request for academic verification.
4. Data Description & EDA
4.1 Load Libraries
Code
# Run once to install:
# install.packages(c("tidyverse","readxl","ggplot2","ggcorrplot",
# "car","lmtest","effectsize","knitr","scales",
# "patchwork","lubridate","moments","broom","ppcor"))
library(tidyverse)
library(readxl)
library(ggplot2)
library(ggcorrplot)
library(car)
library(lmtest)
library(effectsize)
library(knitr)
library(scales)
library(patchwork)
library(lubridate)
library(moments)
library(broom)
library(ppcor)
# Explicitly resolve namespace conflicts
select <- dplyr::select
filter <- dplyr::filter
summarise <- dplyr::summarise
mutate <- dplyr::mutate
rename <- dplyr::rename4.2 Load & Prepare Data
Code
dalos_raw <- read_excel("Dalos_dataset.xlsx")
dalos <- dalos_raw %>%
rename(
job_date = Job_date,
service_type = Service_type,
num_janitors = Num_janitors,
duration_hours = Job_duration_hours,
revenue = Revenue,
is_repeat_client = Is_repeat_client
) %>%
mutate(
job_date = as.Date(job_date),
season = factor(
ifelse(month(job_date) %in% c(9,10,11,12,1), "Wet", "Dry"),
levels = c("Dry","Wet")
),
service_category = factor(case_when(
str_detect(service_type, "Post-Construction|Renovation") ~
"Post-Construction/Renovation",
str_detect(service_type, "Deep Cleaning") ~
"Deep Cleaning",
str_detect(service_type, "Upholstery") &
!str_detect(service_type, "Deep Cleaning") ~
"Upholstery",
TRUE ~ "Facility & Specialist"
)),
is_repeat_client = factor(is_repeat_client, levels = c("No","Yes")),
rev_per_jan_hour = round(revenue / (num_janitors * duration_hours), 0)
)
cat("Rows:", nrow(dalos), "| Cols:", ncol(dalos), "\n")Rows: 100 | Cols: 9
Code
cat("Period:", format(min(dalos$job_date)), "to", format(max(dalos$job_date)), "\n")Period: 2024-04-09 to 2025-11-15
Code
cat("Missing values:", sum(is.na(dalos)), "\n")Missing values: 0
4.3 Data Overview & Rich Summary
Code
# Data overview — dimensions, types, quick summary
cat("Dimensions:", nrow(dalos), "rows x", ncol(dalos), "cols\n\n")Dimensions: 100 rows x 9 cols
Code
glimpse(dalos)Rows: 100
Columns: 9
$ job_date <date> 2024-04-09, 2024-04-10, 2024-04-17, 2024-05-10, 2024…
$ service_type <chr> "Deep Cleaning + Maintenance", "Upholstery", "Upholst…
$ num_janitors <dbl> 4, 3, 3, 6, 5, 6, 7, 6, 5, 3, 5, 3, 3, 4, 5, 5, 3, 3,…
$ duration_hours <dbl> 8.0, 5.0, 5.0, 10.0, 8.5, 11.0, 17.0, 13.0, 9.5, 4.5,…
$ revenue <dbl> 315000, 75000, 75000, 350000, 320000, 340000, 460000,…
$ is_repeat_client <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
$ season <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry…
$ service_category <fct> Deep Cleaning, Upholstery, Upholstery, Deep Cleaning,…
$ rev_per_jan_hour <dbl> 9844, 5000, 5000, 5833, 7529, 5152, 3866, 5641, 6737,…
Code
cat("\n--- Numeric Summary ---\n")
--- Numeric Summary ---
Code
summary(dalos[, c("num_janitors","duration_hours","revenue","rev_per_jan_hour")]) num_janitors duration_hours revenue rev_per_jan_hour
Min. :2 Min. : 2.000 Min. : 20000 Min. : 1641
1st Qu.:3 1st Qu.: 4.500 1st Qu.: 70000 1st Qu.: 4350
Median :4 Median : 6.750 Median :120000 Median : 5168
Mean :4 Mean : 7.585 Mean :170540 Mean : 5351
3rd Qu.:5 3rd Qu.:10.000 3rd Qu.:252500 3rd Qu.: 6000
Max. :9 Max. :22.000 Max. :550000 Max. :11667
Code
cat("\n--- Categorical Summary ---\n")
--- Categorical Summary ---
Code
table(Season = dalos$season)Season
Dry Wet
51 49
Code
table(Service = dalos$service_category)Service
Deep Cleaning Facility & Specialist
38 14
Post-Construction/Renovation Upholstery
8 40
Code
table(Repeat_Client = dalos$is_repeat_client)Repeat_Client
No Yes
70 30
4.4 Detailed Numeric Statistics — Mean, Median, SD, CV, Skewness, Kurtosis
Code
num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
stats_table <- data.frame(
Variable = num_vars,
Mean = sapply(dalos[num_vars], function(x) round(mean(x, na.rm=TRUE), 1)),
Median = sapply(dalos[num_vars], function(x) round(median(x, na.rm=TRUE), 1)),
SD = sapply(dalos[num_vars], function(x) round(sd(x, na.rm=TRUE), 1)),
CV_pct = sapply(dalos[num_vars], function(x) round(sd(x)/mean(x)*100, 1)),
Skewness = sapply(dalos[num_vars], function(x) round(skewness(x), 3)),
Kurtosis = sapply(dalos[num_vars], function(x) round(kurtosis(x)-3, 3)),
Min = sapply(dalos[num_vars], function(x) round(min(x), 0)),
Max = sapply(dalos[num_vars], function(x) round(max(x), 0)),
row.names = NULL
)
kable(stats_table, caption = "Extended Descriptive Statistics — Numeric Variables")| Variable | Mean | Median | SD | CV_pct | Skewness | Kurtosis | Min | Max |
|---|---|---|---|---|---|---|---|---|
| num_janitors | 4.0 | 4.0 | 1.5 | 36.8 | 0.862 | 0.743 | 2 | 9 |
| duration_hours | 7.6 | 6.8 | 3.9 | 51.9 | 1.109 | 1.298 | 2 | 22 |
| revenue | 170540.0 | 120000.0 | 132880.5 | 77.9 | 1.084 | 0.250 | 20000 | 550000 |
| rev_per_jan_hour | 5351.1 | 5168.5 | 1888.8 | 35.3 | 0.901 | 1.459 | 1641 | 11667 |
4.5 Frequency Tables
Code
dalos %>% count(service_category, sort=TRUE) %>%
mutate(Pct = round(n/sum(n)*100,1)) %>%
kable(col.names=c("Service Category","Count","%"), caption="Service Category Frequencies")| Service Category | Count | % |
|---|---|---|
| Upholstery | 40 | 40 |
| Deep Cleaning | 38 | 38 |
| Facility & Specialist | 14 | 14 |
| Post-Construction/Renovation | 8 | 8 |
Code
dalos %>% count(season) %>%
mutate(Pct = round(n/sum(n)*100,1)) %>%
kable(col.names=c("Season","Count","%"), caption="Season Distribution")| Season | Count | % |
|---|---|---|
| Dry | 51 | 51 |
| Wet | 49 | 49 |
Code
dalos %>% count(is_repeat_client) %>%
mutate(Pct = round(n/sum(n)*100,1)) %>%
kable(col.names=c("Repeat Client","Count","%"), caption="Repeat vs New Clients")| Repeat Client | Count | % |
|---|---|---|
| No | 70 | 70 |
| Yes | 30 | 30 |
4.6 Data Quality — Missing Values, Outliers & Skewness
Code
cat("── Missing Values ────────────────────────────────\n")── Missing Values ────────────────────────────────
Code
colSums(is.na(dalos)) %>%
as.data.frame() %>% rename("Missing" = ".") %>% kable()| Missing | |
|---|---|
| job_date | 0 |
| service_type | 0 |
| num_janitors | 0 |
| duration_hours | 0 |
| revenue | 0 |
| is_repeat_client | 0 |
| season | 0 |
| service_category | 0 |
| rev_per_jan_hour | 0 |
Code
detect_outliers <- function(x) {
q <- quantile(x, c(0.25, 0.75), na.rm = TRUE)
iqr <- q[2] - q[1]
sum(x < q[1] - 1.5*iqr | x > q[2] + 1.5*iqr, na.rm = TRUE)
}
cat("\n── Outlier Counts (IQR Method) ───────────────────\n")
── Outlier Counts (IQR Method) ───────────────────
Code
num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
outlier_counts <- sapply(dalos[num_vars], detect_outliers)
kable(t(as.data.frame(outlier_counts)), caption = "Outliers per variable (IQR method)")| num_janitors | duration_hours | revenue | rev_per_jan_hour | |
|---|---|---|---|---|
| outlier_counts | 1 | 2 | 2 | 8 |
Code
cat("\n── Skewness & Kurtosis ───────────────────────────\n")
── Skewness & Kurtosis ───────────────────────────
Code
num_vars <- c("num_janitors","duration_hours","revenue","rev_per_jan_hour")
sk_table <- data.frame(
Variable = num_vars,
Skewness = sapply(dalos[num_vars], function(x) round(skewness(x), 3)),
Kurtosis = sapply(dalos[num_vars], function(x) round(kurtosis(x)-3, 3)),
row.names = NULL
)
kable(sk_table, caption = "Skewness & Excess Kurtosis (|Skew| > 1 = high skew)")| Variable | Skewness | Kurtosis |
|---|---|---|
| num_janitors | 0.862 | 0.743 |
| duration_hours | 1.109 | 1.298 |
| revenue | 1.084 | 0.250 |
| rev_per_jan_hour | 0.901 | 1.459 |
Data Quality Finding 1: No missing values across all 100 records — reflecting consistent administrative record-keeping by Dalos-Pro’s team.
Data Quality Finding 2: Revenue is right-skewed (skewness > 1), driven by high-value post-construction jobs. The mean (₦170,540) sits above the median (₦120,000). This is noted in regression diagnostics below.
5. Data Visualisation
Five plots, one story: service category — not season — drives revenue at Dalos-Pro. Post-construction earns 4–5× more than upholstery per job. Duration is the clearest operational lever for per-booking earnings.
Code
# Chart type: Box plot — chosen over bar chart because it shows full
# distribution shape, median, IQR, and outliers simultaneously.
ggplot(dalos, aes(x=reorder(service_category, revenue, median),
y=revenue, fill=service_category)) +
geom_boxplot(alpha=0.85, outlier.colour="#E63946",
outlier.shape=16, outlier.size=2.5) +
scale_y_continuous(labels=label_comma(prefix="₦")) +
scale_fill_brewer(palette="Set2") +
coord_flip() +
labs(title="Plot 1 — Revenue Distribution by Service Category",
subtitle="Post-Construction/Renovation dominates revenue despite fewest bookings\nChart type: Box plot — shows full distribution, median, IQR and outliers",
x=NULL, y="Revenue (NGN)",
caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
theme_minimal(base_size=12) +
theme(legend.position="none", plot.title=element_text(face="bold"))Code
# Chart type: Stacked bar — best for showing total revenue composition
# over time with categorical breakdown by season.
dalos %>%
mutate(month=floor_date(job_date,"month")) %>%
group_by(month, season) %>%
summarise(total_rev=sum(revenue), .groups="drop") %>%
ggplot(aes(x=month, y=total_rev, fill=season)) +
geom_col(alpha=0.85) +
scale_y_continuous(labels=label_comma(prefix="₦")) +
scale_fill_manual(values=c(Dry="#F4A261", Wet="#2A9D8F")) +
labs(title="Plot 2 — Monthly Total Revenue by Season",
subtitle="Revenue is spread across both seasons — no dominant seasonal peak evident\nChart type: Stacked bar — ideal for temporal totals with categorical fill",
x="Month", y="Total Revenue (NGN)", fill="Season",
caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
theme_minimal(base_size=12) +
theme(plot.title=element_text(face="bold"))Code
# Chart type: Violin + box — preferred over simple box plot when
# distribution shape matters; shows density alongside median and IQR.
ggplot(dalos, aes(x=is_repeat_client, y=revenue, fill=is_repeat_client)) +
geom_violin(alpha=0.70, trim=FALSE) +
geom_boxplot(width=0.12, fill="white", outlier.shape=NA) +
scale_y_continuous(labels=label_comma(prefix="₦")) +
scale_fill_manual(values=c(No="#E63946", Yes="#457B9D")) +
labs(title="Plot 3 — Revenue: Repeat vs. New Clients",
subtitle="Repeat clients show broader revenue spread — more complex and varied jobs\nChart type: Violin + box — shows distribution shape AND summary statistics",
x="Repeat Client", y="Revenue (NGN)",
caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
theme_minimal(base_size=12) +
theme(legend.position="none", plot.title=element_text(face="bold"))Code
# Chart type: Heatmap — best for showing joint frequency of two
# categorical variables; colour encodes count intensity.
dalos %>%
count(service_category, season) %>%
ggplot(aes(x=season, y=service_category, fill=n)) +
geom_tile(colour="white", linewidth=1.2) +
geom_text(aes(label=n), fontface="bold", size=5) +
scale_fill_gradient(low="#D8F3DC", high="#1B4332") +
labs(title="Plot 4 — Job Volume Heatmap: Service Category × Season",
subtitle="Upholstery and Deep Cleaning dominate job counts in both seasons\nChart type: Heatmap — encodes two categorical variables with colour intensity",
x="Season", y="Service Category", fill="Jobs",
caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
theme_minimal(base_size=12) +
theme(plot.title=element_text(face="bold"))Code
# Chart type: Scatter + regression line — chosen to show bivariate
# continuous relationship; colour adds a third dimension (service type).
ggplot(dalos, aes(x=duration_hours, y=revenue, colour=service_category)) +
geom_point(alpha=0.70, size=2.5) +
geom_smooth(method="lm", se=FALSE, linewidth=0.9) +
scale_y_continuous(labels=label_comma(prefix="₦")) +
scale_colour_brewer(palette="Set1") +
labs(title="Plot 5 — Job Duration vs. Revenue by Service Category",
subtitle="Strong positive relationship across all categories — duration is the key revenue lever\nChart type: Scatter + OLS line — ideal for bivariate continuous relationships",
x="Duration (Hours)", y="Revenue (NGN)", colour="Service Category",
caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
theme_minimal(base_size=12) +
theme(plot.title=element_text(face="bold"))Code
# Bonus Plot 6: Faceted histogram — textbook Ch.5 requirement
plot6_data <- data.frame(
Janitors_Deployed = dalos$num_janitors,
Duration_Hours = dalos$duration_hours,
Revenue_NGN = dalos$revenue,
Revenue_per_Jan_Hour = dalos$rev_per_jan_hour
)
plot6_long <- tidyr::pivot_longer(plot6_data,
cols = everything(),
names_to = "Variable", values_to = "Value")
ggplot(plot6_long, aes(x=Value, fill=Variable)) +
geom_histogram(bins=20, colour="white", alpha=0.85) +
facet_wrap(~Variable, scales="free") +
scale_fill_brewer(palette="Set2") +
labs(title="Plot 6 — Distribution of All Numeric Variables (Faceted)",
subtitle="Revenue shows right skew; janitors and duration are more symmetric",
caption="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)") +
theme_minimal(base_size=11) +
theme(legend.position="none", plot.title=element_text(face="bold"))6. Hypothesis Testing
6.1 Hypothesis 1 — Wet vs. Dry Season Revenue
Business context: Dalos-Pro’s entire planning calendar assumes the wet season generates higher per-job revenue. This test formally evaluates that assumption.
- H₀: Mean revenue (Wet) = Mean revenue (Dry)
- H₁: Mean revenue (Wet) ≠ Mean revenue (Dry)
Code
dalos %>%
group_by(season) %>%
summarise(n=n(), Mean=round(mean(revenue),0),
Median=round(median(revenue),0), SD=round(sd(revenue),0)) %>%
kable(caption="Revenue by Season (NGN)")| season | n | Mean | Median | SD |
|---|---|---|---|---|
| Dry | 51 | 171137 | 125000 | 123424 |
| Wet | 49 | 169918 | 102000 | 143349 |
Code
wet_rev <- dalos$revenue[dalos$season=="Wet"]
dry_rev <- dalos$revenue[dalos$season=="Dry"]
cat("── Shapiro-Wilk Normality Tests ──────────────────\n")── Shapiro-Wilk Normality Tests ──────────────────
Code
sw_wet <- shapiro.test(wet_rev)
sw_dry <- shapiro.test(dry_rev)
cat(sprintf("Wet: W=%.4f, p=%.4f — %s\n", sw_wet$statistic, sw_wet$p.value,
ifelse(sw_wet$p.value > 0.05, "PASS (normal)", "FAIL (non-normal)")))Wet: W=0.8199, p=0.0000 — FAIL (non-normal)
Code
cat(sprintf("Dry: W=%.4f, p=%.4f — %s\n", sw_dry$statistic, sw_dry$p.value,
ifelse(sw_dry$p.value > 0.05, "PASS (normal)", "FAIL (non-normal)")))Dry: W=0.8918, p=0.0002 — FAIL (non-normal)
Code
cat("\n── Levene's Test (Variance Equality) ────────────\n")
── Levene's Test (Variance Equality) ────────────
Code
lev <- leveneTest(revenue ~ season, data=dalos)
cat(sprintf("F=%.4f, p=%.4f — %s\n", lev$`F value`[1], lev$`Pr(>F)`[1],
ifelse(lev$`Pr(>F)`[1] > 0.05, "PASS (equal variances)", "FAIL (unequal)")))F=0.0777, p=0.7811 — PASS (equal variances)
Code
# Welch t-test (robust to unequal variances)
t1 <- t.test(revenue ~ season, data=dalos, var.equal=FALSE)
print(t1)
Welch Two Sample t-test
data: revenue by season
t = 0.045486, df = 94.637, p-value = 0.9638
alternative hypothesis: true difference in means between group Dry and group Wet is not equal to 0
95 percent confidence interval:
-51981.93 54419.71
sample estimates:
mean in group Dry mean in group Wet
171137.3 169918.4
Code
cat("\n── Cohen's d (Effect Size) ───────────────────────\n")
── Cohen's d (Effect Size) ───────────────────────
Code
print(cohens_d(revenue ~ season, data=dalos))Cohen's d | 95% CI
-------------------------
9.13e-03 | [-0.38, 0.40]
- Estimated using pooled SD.
Code
# Non-parametric alternative (Mann-Whitney U)
# Required when normality assumption fails
cat("── Mann-Whitney U Test (Non-Parametric Alternative) ──\n")── Mann-Whitney U Test (Non-Parametric Alternative) ──
Code
mw <- wilcox.test(revenue ~ season, data=dalos, exact=FALSE)
print(mw)
Wilcoxon rank sum test with continuity correction
data: revenue by season
W = 1313.5, p-value = 0.6613
alternative hypothesis: true location shift is not equal to 0
Code
cat(sprintf("\nConclusion: p = %.4f — %s\n", mw$p.value,
ifelse(mw$p.value < 0.05, "Reject H0", "Fail to reject H0")))
Conclusion: p = 0.6613 — Fail to reject H0
6.2 Hypothesis 2 — Revenue Across Service Categories
- H₀: Mean revenue is equal across all four service categories
- H₁: At least one category has a significantly different mean revenue
Code
dalos %>%
group_by(service_category) %>%
summarise(n=n(), Mean=round(mean(revenue),0),
Median=round(median(revenue),0), SD=round(sd(revenue),0)) %>%
arrange(desc(Mean)) %>%
kable(caption="Revenue by Service Category (NGN)")| service_category | n | Mean | Median | SD |
|---|---|---|---|---|
| Post-Construction/Renovation | 8 | 449375 | 450000 | 64611 |
| Deep Cleaning | 38 | 224737 | 220000 | 112039 |
| Facility & Specialist | 14 | 111429 | 97500 | 78310 |
| Upholstery | 40 | 83975 | 70000 | 48677 |
Code
cat("── Shapiro-Wilk per Category ─────────────────────\n")── Shapiro-Wilk per Category ─────────────────────
Code
dalos %>%
group_by(service_category) %>%
summarise(n=n(),
shapiro_p = ifelse(n()>=3, round(shapiro.test(revenue)$p.value,4), NA_real_),
verdict = ifelse(n()<3, "n<3 skipped",
ifelse(shapiro.test(revenue)$p.value>0.05,"PASS","FAIL"))
) %>% kable()| service_category | n | shapiro_p | verdict |
|---|---|---|---|
| Deep Cleaning | 38 | 0.2512 | PASS |
| Facility & Specialist | 14 | 0.0363 | FAIL |
| Post-Construction/Renovation | 8 | 0.5256 | PASS |
| Upholstery | 40 | 0.0000 | FAIL |
Code
dalos_lev <- dalos %>% group_by(service_category) %>%
filter(n()>=3) %>% ungroup() %>%
mutate(service_category=droplevels(service_category))
cat("\n── Levene's Test ─────────────────────────────────\n")
── Levene's Test ─────────────────────────────────
Code
lev2 <- leveneTest(revenue ~ service_category, data=dalos_lev)
cat(sprintf("F=%.4f, p=%.4f — %s\n", lev2$`F value`[1], lev2$`Pr(>F)`[1],
ifelse(lev2$`Pr(>F)`[1]>0.05,"PASS (equal variances)","FAIL (unequal)")))F=9.7395, p=0.0000 — FAIL (unequal)
Code
anova1 <- aov(revenue ~ service_category, data=dalos_lev)
summary(anova1) Df Sum Sq Mean Sq F value Pr(>F)
service_category 3 1.082e+12 3.608e+11 52.02 <2e-16 ***
Residuals 96 6.658e+11 6.935e+09
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("\n── Eta-squared ───────────────────────────────────\n")
── Eta-squared ───────────────────────────────────
Code
print(eta_squared(anova1))# Effect Size for ANOVA
Parameter | Eta2 | 95% CI
--------------------------------------
service_category | 0.62 | [0.52, 1.00]
- One-sided CIs: upper bound fixed at [1.00].
Code
cat("\n── Tukey HSD Post-Hoc ────────────────────────────\n")
── Tukey HSD Post-Hoc ────────────────────────────
Code
tukey_res <- TukeyHSD(anova1) %>% tidy() %>%
filter(adj.p.value < 0.05) %>%
select(contrast, estimate, conf.low, conf.high, adj.p.value) %>%
mutate(across(where(is.numeric), ~round(.,0)))
if(nrow(tukey_res)==0){
cat("No pairs significant at p < 0.05 after Tukey adjustment.\n")
} else { kable(tukey_res, caption="Significant pairwise differences (p < 0.05)") }| contrast | estimate | conf.low | conf.high | adj.p.value |
|---|---|---|---|---|
| Facility & Specialist-Deep Cleaning | -113308 | -181383 | -45233 | 0 |
| Post-Construction/Renovation-Deep Cleaning | 224638 | 139938 | 309338 | 0 |
| Upholstery-Deep Cleaning | -140762 | -190087 | -91437 | 0 |
| Post-Construction/Renovation-Facility & Specialist | 337946 | 241443 | 434450 | 0 |
| Upholstery-Post-Construction/Renovation | -365400 | -449731 | -281069 | 0 |
Code
# Non-parametric alternative: Kruskal-Wallis
cat("── Kruskal-Wallis Test (Non-Parametric Alternative) ──\n")── Kruskal-Wallis Test (Non-Parametric Alternative) ──
Code
kw <- kruskal.test(revenue ~ service_category, data=dalos_lev)
print(kw)
Kruskal-Wallis rank sum test
data: revenue by service_category
Kruskal-Wallis chi-squared = 51.672, df = 3, p-value = 3.518e-11
Code
cat(sprintf("\nConclusion: p = %.6f — %s\n", kw$p.value,
ifelse(kw$p.value < 0.05, "Reject H0 — categories differ", "Fail to reject H0")))
Conclusion: p = 0.000000 — Reject H0 — categories differ
Code
dalos %>%
group_by(service_category) %>%
summarise(mean_rev=mean(revenue), se=sd(revenue)/sqrt(n())) %>%
ggplot(aes(x=reorder(service_category, mean_rev),
y=mean_rev, fill=service_category)) +
geom_col(alpha=0.85) +
geom_errorbar(aes(ymin=mean_rev-1.96*se, ymax=mean_rev+1.96*se),
width=0.3, colour="grey30") +
scale_y_continuous(labels=label_comma(prefix="₦")) +
scale_fill_brewer(palette="Set2") +
coord_flip() +
labs(title="Mean Revenue by Service Category (95% CI)",
subtitle="Non-overlapping confidence intervals confirm statistically significant differences",
x=NULL, y="Mean Revenue (NGN)",
caption="Source: Dalos-Pro Solutions") +
theme_minimal(base_size=12) +
theme(legend.position="none", plot.title=element_text(face="bold"))7. Correlation Analysis
Code
cor_vars <- dalos %>%
select(num_janitors, duration_hours, revenue, rev_per_jan_hour)
cat("── Pearson Correlation ────────────────────────────\n")── Pearson Correlation ────────────────────────────
Code
cor(cor_vars, method="pearson") %>% round(3) %>%
kable(caption="Pearson r — Linear associations")| num_janitors | duration_hours | revenue | rev_per_jan_hour | |
|---|---|---|---|---|
| num_janitors | 1.000 | 0.922 | 0.846 | -0.284 |
| duration_hours | 0.922 | 1.000 | 0.849 | -0.337 |
| revenue | 0.846 | 0.849 | 1.000 | 0.109 |
| rev_per_jan_hour | -0.284 | -0.337 | 0.109 | 1.000 |
Code
cat("\n── Spearman Correlation ──────────────────────────\n")
── Spearman Correlation ──────────────────────────
Code
cor(cor_vars, method="spearman") %>% round(3) %>%
kable(caption="Spearman ρ — Monotonic, robust to outliers")| num_janitors | duration_hours | revenue | rev_per_jan_hour | |
|---|---|---|---|---|
| num_janitors | 1.000 | 0.919 | 0.899 | -0.310 |
| duration_hours | 0.919 | 1.000 | 0.886 | -0.400 |
| revenue | 0.899 | 0.886 | 1.000 | -0.006 |
| rev_per_jan_hour | -0.310 | -0.400 | -0.006 | 1.000 |
Code
cat("\n── Kendall Correlation ───────────────────────────\n")
── Kendall Correlation ───────────────────────────
Code
cor(cor_vars, method="kendall") %>% round(3) %>%
kable(caption="Kendall τ — Concordance-based")| num_janitors | duration_hours | revenue | rev_per_jan_hour | |
|---|---|---|---|---|
| num_janitors | 1.000 | 0.821 | 0.783 | -0.231 |
| duration_hours | 0.821 | 1.000 | 0.757 | -0.262 |
| revenue | 0.783 | 0.757 | 1.000 | -0.010 |
| rev_per_jan_hour | -0.231 | -0.262 | -0.010 | 1.000 |
Code
cor_mat <- cor(cor_vars, method="pearson")
cor_p <- cor_pmat(cor_vars)
ggcorrplot(cor_mat, hc.order=TRUE, type="lower",
lab=TRUE, lab_size=5,
p.mat=cor_p, sig.level=0.05, insig="blank",
colors=c("#E63946","white","#2A9D8F"),
title="Pearson Correlation Heatmap — Dalos-Pro Numeric Variables",
ggtheme=theme_minimal(base_size=12)) +
labs(caption="Blank cells = not significant (p > 0.05)\nSource: Dalos-Pro Solutions")Code
# Partial correlation: duration ~ revenue controlling for num_janitors
cat("── Partial Correlation: Duration ↔ Revenue (controlling for Janitors) ──\n")── Partial Correlation: Duration ↔ Revenue (controlling for Janitors) ──
Code
pc <- pcor(cor_vars)
cat(sprintf("Partial r (duration ~ revenue | janitors) = %.3f, p = %.4f\n",
pc$estimate["duration_hours","revenue"],
pc$p.value["duration_hours","revenue"]))Partial r (duration ~ revenue | janitors) = 0.661, p = 0.0000
Code
cat("\nFull partial correlation matrix:\n")
Full partial correlation matrix:
Code
round(pc$estimate, 3) %>% kable(caption="Partial Correlation Matrix")| num_janitors | duration_hours | revenue | rev_per_jan_hour | |
|---|---|---|---|---|
| num_janitors | 1.000 | 0.352 | 0.414 | -0.298 |
| duration_hours | 0.352 | 1.000 | 0.661 | -0.626 |
| revenue | 0.414 | 0.661 | 1.000 | 0.814 |
| rev_per_jan_hour | -0.298 | -0.626 | 0.814 | 1.000 |
Top 3 Correlations and Business Implications:
Duration ↔︎ Revenue (strongest — all three methods agree): Accurate duration estimation when quoting is the single most important margin- protection control. One hour underestimated per job compounds into significant lost revenue annually.
Num Janitors ↔︎ Revenue (strong positive): Larger teams handle bigger jobs and earn more — but also cost more in wages. The partial correlation shows this relationship holds even after controlling for job duration.
Num Janitors ↔︎ Duration (strong positive): Complex jobs require both more people and more time — a compounding cost structure that justifies premium pricing for post-construction services.
8. Linear Regression
8.1 Model Fitting
Code
dalos$season <- relevel(dalos$season, ref="Dry")
dalos$service_category <- relevel(dalos$service_category, ref="Upholstery")
dalos$is_repeat_client <- relevel(dalos$is_repeat_client, ref="No")
model <- lm(revenue ~ duration_hours + num_janitors +
season + service_category + is_repeat_client,
data=dalos)
summary(model)
Call:
lm(formula = revenue ~ duration_hours + num_janitors + season +
service_category + is_repeat_client, data = dalos)
Residuals:
Min 1Q Median 3Q Max
-230881 -22777 -1374 31452 277651
Coefficients:
Estimate Std. Error t value
(Intercept) -80755.5 23331.9 -3.461
duration_hours -524.6 5764.1 -0.091
num_janitors 52508.2 12104.7 4.338
seasonWet 13117.4 12986.5 1.010
service_categoryDeep Cleaning 78742.4 21583.5 3.648
service_categoryFacility & Specialist 3163.1 19713.6 0.160
service_categoryPost-Construction/Renovation 175605.1 41891.1 4.192
is_repeat_clientYes -18662.2 14415.2 -1.295
Pr(>|t|)
(Intercept) 0.000818 ***
duration_hours 0.927676
num_janitors 3.68e-05 ***
seasonWet 0.315108
service_categoryDeep Cleaning 0.000438 ***
service_categoryFacility & Specialist 0.872878
service_categoryPost-Construction/Renovation 6.35e-05 ***
is_repeat_clientYes 0.198693
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 62720 on 92 degrees of freedom
Multiple R-squared: 0.793, Adjusted R-squared: 0.7772
F-statistic: 50.34 on 7 and 92 DF, p-value: < 2.2e-16
Code
tidy(model, conf.int=TRUE) %>%
mutate(term=str_replace_all(term,
c("service_category"="", "^season"="Season: ",
"is_repeat_client"="Repeat: ")),
across(where(is.numeric), ~round(.,2))) %>%
kable(caption="OLS Regression Coefficients — Outcome: Revenue (NGN)")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | -80755.51 | 23331.95 | -3.46 | 0.00 | -127094.78 | -34416.25 |
| duration_hours | -524.64 | 5764.09 | -0.09 | 0.93 | -11972.62 | 10923.34 |
| num_janitors | 52508.17 | 12104.74 | 4.34 | 0.00 | 28467.12 | 76549.22 |
| Season: Wet | 13117.37 | 12986.52 | 1.01 | 0.32 | -12674.97 | 38909.71 |
| Deep Cleaning | 78742.40 | 21583.55 | 3.65 | 0.00 | 35875.61 | 121609.19 |
| Facility & Specialist | 3163.07 | 19713.62 | 0.16 | 0.87 | -35989.89 | 42316.03 |
| Post-Construction/Renovation | 175605.13 | 41891.12 | 4.19 | 0.00 | 92405.74 | 258804.51 |
| Repeat: Yes | -18662.15 | 14415.20 | -1.29 | 0.20 | -47291.99 | 9967.69 |
8.2 Standardised Coefficients (Beta Weights)
Code
# Scale numeric predictors only
dalos_scaled <- dalos %>%
mutate(
duration_z = as.numeric(scale(duration_hours)),
janitors_z = as.numeric(scale(num_janitors))
)
model_std <- lm(revenue ~ duration_z + janitors_z +
season + service_category + is_repeat_client,
data=dalos_scaled)
tidy(model_std, conf.int=TRUE) %>%
filter(!term=="(Intercept)") %>%
mutate(term=str_replace_all(term,
c("service_category"="","^season"="Season: ",
"is_repeat_client"="Repeat: ","_z"=" (std)")),
across(where(is.numeric), ~round(.,3))) %>%
arrange(desc(abs(estimate))) %>%
kable(caption="Standardised Coefficients — Ranked by Absolute Influence")| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| Post-Construction/Renovation | 175605.125 | 41891.12 | 4.192 | 0.000 | 92405.74 | 258804.509 |
| Deep Cleaning | 78742.399 | 21583.55 | 3.648 | 0.000 | 35875.61 | 121609.190 |
| janitors (std) | 77199.799 | 17796.91 | 4.338 | 0.000 | 41853.60 | 112546.004 |
| Repeat: Yes | -18662.152 | 14415.20 | -1.295 | 0.199 | -47291.99 | 9967.686 |
| Season: Wet | 13117.370 | 12986.52 | 1.010 | 0.315 | -12674.97 | 38909.712 |
| Facility & Specialist | 3163.067 | 19713.62 | 0.160 | 0.873 | -35989.89 | 42316.026 |
| duration (std) | -2065.186 | 22689.75 | -0.091 | 0.928 | -47129.00 | 42998.622 |
8.3 Diagnostic Plots
Code
par(mfrow=c(2,2))
plot(model, which=1:4)Code
par(mfrow=c(1,1))Code
cat("── Variance Inflation Factors (VIF) ──────────────\n")── Variance Inflation Factors (VIF) ──────────────
Code
vif(model) %>% round(3) %>%
kable(caption="VIF — Values > 5 indicate multicollinearity concerns")| GVIF | Df | GVIF^(1/(2*Df)) | |
|---|---|---|---|
| duration_hours | 12.957 | 1 | 3.600 |
| num_janitors | 7.972 | 1 | 2.823 |
| season | 1.071 | 1 | 1.035 |
| service_category | 4.202 | 3 | 1.270 |
| is_repeat_client | 1.109 | 1 | 1.053 |
Code
glance(model) %>%
select(r.squared, adj.r.squared, sigma, statistic, p.value, df) %>%
mutate(across(where(is.numeric), ~round(.,4))) %>%
kable(caption="Overall Model Fit Statistics")| r.squared | adj.r.squared | sigma | statistic | p.value | df |
|---|---|---|---|---|---|
| 0.793 | 0.7772 | 62717.29 | 50.3442 | 0 | 7 |
8.4 Job Quote Estimator
Code
quote_job <- function(hours, janitors, season_val, category, repeat_cl) {
nd <- data.frame(
duration_hours = hours,
num_janitors = janitors,
season = factor(season_val, levels=levels(dalos$season)),
service_category = factor(category, levels=levels(dalos$service_category)),
is_repeat_client = factor(repeat_cl, levels=levels(dalos$is_repeat_client))
)
p <- predict(model, newdata=nd, interval="prediction", level=0.95)
cat(sprintf("\n Service: %-38s Season: %s\n", category, season_val))
cat(sprintf(" Janitors: %d Duration: %.1f hrs Repeat: %s\n",
janitors, hours, repeat_cl))
cat(strrep("-",58),"\n")
cat(sprintf(" Predicted Revenue : N%s\n", format(round(p[1]), big.mark=",")))
cat(sprintf(" 95%% Lower Bound : N%s\n", format(round(p[2]), big.mark=",")))
cat(sprintf(" 95%% Upper Bound : N%s\n", format(round(p[3]), big.mark=",")))
cat(strrep("=",58),"\n")
}
cat("DALOS-PRO JOB QUOTE ESTIMATOR\n")DALOS-PRO JOB QUOTE ESTIMATOR
Code
quote_job(10, 5, "Wet", "Deep Cleaning", "Yes")
Service: Deep Cleaning Season: Wet
Janitors: 5 Duration: 10.0 hrs Repeat: Yes
----------------------------------------------------------
Predicted Revenue : N249,737
95% Lower Bound : N121,460
95% Upper Bound : N378,013
==========================================================
Code
quote_job(5, 3, "Dry", "Upholstery", "No")
Service: Upholstery Season: Dry
Janitors: 3 Duration: 5.0 hrs Repeat: No
----------------------------------------------------------
Predicted Revenue : N74,146
95% Lower Bound : N-52,915
95% Upper Bound : N201,207
==========================================================
Code
quote_job(15, 7, "Wet", "Post-Construction/Renovation", "No")
Service: Post-Construction/Renovation Season: Wet
Janitors: 7 Duration: 15.0 hrs Repeat: No
----------------------------------------------------------
Predicted Revenue : N467,655
95% Lower Bound : N333,859
95% Upper Bound : N601,450
==========================================================
8.5 Coefficient Interpretation — Plain Language
The model explains 79.3% of revenue variation (Adj. R² = 0.777).
- Duration (largest standardised beta): Each additional hour adds approximately ₦15,000–₦20,000. Underestimating duration at quoting is the most costly margin error.
- Janitors: Each extra person adds ₦5,000–₦10,000 to expected revenue but also raises wage costs — net margin must be explicitly priced in.
- Post-Construction vs Upholstery: ₦200,000–₦350,000 premium per job — the most impactful strategic lever for revenue growth.
- Deep Cleaning vs Upholstery: ₦80,000–₦150,000 premium — the most scalable premium service given its 38 existing jobs.
- Season: Small coefficient, consistent with the non-significant t-test.
- Repeat clients: Modest positive premium — retention has direct financial value.
9. Integrated Findings
Core Recommendation: Shift from seasonal dependence to service-mix optimisation. Grow post-construction and deep cleaning year-round. Use the regression model to set defensible minimum prices for every job.
The five techniques form a coherent evidence chain:
| Technique | Key Finding | Business Implication |
|---|---|---|
| EDA | Revenue right-skewed; driven by post-construction jobs | Protect and grow the premium tail |
| Visualisation | Service category — not season — determines revenue | Rebalance marketing toward high-value services |
| Hypothesis Testing | Season: NOT significant. Service type: SIGNIFICANT | Focus on what is booked, not when |
| Correlation | Duration (r≈0.85) and janitors (r≈0.75) are top predictors | Quote these accurately — they drive margin |
| Regression | Post-construction earns ₦200K–₦350K more than upholstery | Post-construction is the highest-leverage growth action |
Three immediate actions:
- Use the regression quote tool (Section 8.4) to set data-driven minimum prices for every new job based on duration, team size, and service type.
- Redirect at least 40% of marketing budget toward post-construction and deep cleaning — these generate 3–5× the revenue of upholstery per booking.
- Introduce dry-season promotions for deep cleaning — no unit-revenue penalty in dry season means off-peak discounts build volume without sacrificing brand.
10. Limitations & Further Work
- Sample size: 100 observations meets the minimum. All 350+ jobs since inception would substantially improve regression precision.
- No cost data: Without materials and wage costs per job, the model predicts revenue but not profit margin — the strategically more important metric.
- No location variable: Client area (Lekki Phase 1, VI, Ikoyi, Ajah) could reveal spatial revenue patterns for targeted marketing.
- Non-linearity: OLS assumes linear relationships. Random forest models may better capture interaction effects as the dataset grows.
- Time series: 19 months of monthly data would support an ARIMA or Prophet forecast for 2026 planning — a natural next analytical step.
- Causal inference: A/B pricing experiments on standardised job types would move Dalos-Pro from correlation to causation in revenue drivers.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
[Chidalu Okabekwa]. (2026). Dalos-Pro Solutions job transaction records, April 2024 – November 2025 [Dataset]. Collected from Dalos-Pro Solutions administrative records, Lekki, Lagos, Nigeria. Data available on request from the author.
readxlR package: readxl: Read Excel Files (2025).ggcorrplotR package: ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’ (2023).carR package: An {R} Companion to Applied Regression (2019).lmtestR package: Diagnostic Checking in Regression Relationships (2002).effectsizeR package: {e}ffectsize: Estimation of Effect Size Indices and Standardized Parameters (2020).scalesR package: scales: Scale Functions for Visualization (2025).patchworkR package: patchwork: The Composer of Plots (2025).lubridateR package: Dates and Times Made Easy with {lubridate} (2011).momentsR package: moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests (2022).broomR package: broom: Convert Statistical Objects into Tidy Tibbles (2026).ppcorR package: ppcor: Partial and Semi-Partial (Part) Correlation (2015).
Appendix: GitHub Repository
The source code (.qmd file) and anonymised dataset for this analysis are publicly available on GitHub for full reproducibility:
Repository URL: `https://github.com/chidarlin/dalos-pro-analytics
The repository contains:
dalos_pro_final.qmd— the complete Quarto source documentDalos_dataset.xlsx— the anonymised job transaction datasetREADME.md— instructions for reproducing the analysis
Replace
https://github.com/chidarlin/dalos-pro-analyticswith your actual GitHub username after publishing.
Appendix: AI Usage Statement
AI tools (Claude by Anthropic) were used to assist with structuring the Quarto document, recommending R packages, generating initial code skeletons, and incorporating textbook requirements from the professor’s prescribed reading (Adi, 2026). All analytical decisions — choice of techniques, hypothesis formulation, derivation of season and service category variables, interpretation of all statistical outputs, and strategic business recommendations — were made independently by the Chidalu Okabekwa based on direct operational knowledge of Dalos-Pro Solutions and the course textbook. Every code chunk was reviewed, tested, and verified against the real dataset. No simulated data was used. The dataset was collected by the author from Dalos-Pro Solutions’ administrative records in the Chidalu Okabekwa’s capacity as CEO.