Dalos-Pro Solutions is a professional cleaning and facility maintenance company based in Lekki, Lagos, Nigeria, founded in October 2023. This report applies five analytical techniques — Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression — to a primary dataset of 100 completed job transactions recorded between April 2024 and November 2025. Variables captured per job include date, service type, number of janitors deployed, job duration, revenue charged, and repeat-client status. Two additional variables were derived: a season indicator (Wet: September–January; Dry: February–August) and a service category grouping (four categories). Key findings show that post-construction and renovation jobs generate the highest mean revenue (₦449,375), that revenue differences across service categories are statistically significant (one-way ANOVA, p < 0.05), and that job duration and team size are the strongest operational predictors of revenue. Contrary to the perceived seasonal narrative, the wet/dry season revenue difference was not statistically significant in this dataset — a finding that redirects strategic focus from seasonal timing to service-mix optimisation. The report recommends prioritising post-construction and deep cleaning services and improving quoting accuracy on duration and team size to sustain and grow profitability.
2. Professional Disclosure
Job Title: Chief Executive Officer (CEO)
Organisation: Dalos-Pro Solutions — Professional Cleaning & Facility Maintenance Company, Lekki, Lagos, Nigeria.
Exploratory Data Analysis (EDA): As CEO, I regularly review job records to monitor revenue performance across service lines. Formal EDA extends this routine review by quantifying distributions, detecting outliers systematically, and identifying data quality issues that manual inspection misses. EDA provides the evidentiary foundation for every subsequent analytical decision in this report — it is the first step a responsible data scientist takes before drawing conclusions from any dataset.
Data Visualisation: Communicating business performance to staff, potential partners, and investors requires clear and accurate visual storytelling. Charts of revenue distributions by service type, seasonal job volumes, and operational scatter plots directly inform Dalos-Pro’s marketing calendar and capacity planning. The grammar-of-graphics approach (ggplot2) ensures that every design choice — chart type, axis, colour — is deliberate and aligned with the message being communicated.
Hypothesis Testing: Dalos-Pro operates on a widely held belief that the wet season (September–January) drives higher revenue than the dry season. Formal hypothesis testing — using a Welch two-sample t-test and one-way ANOVA — determines whether these observed differences are statistically significant or within the range of random variation. This transforms intuition into an evidence-based foundation for pricing and budgeting decisions.
Correlation Analysis: Understanding which operational inputs (team size, duration) are most strongly associated with revenue enables smarter resource allocation and pricing. Pearson and Spearman correlation matrices identify the strongest pairwise relationships in the data, highlight multicollinearity risks before regression, and reveal whether associations hold under different distributional assumptions.
Linear Regression (OLS): A multiple regression model that predicts job revenue from observable operational inputs — service type, season, duration, team size, repeat-client status — provides a practical, data-driven quoting tool. This directly addresses Dalos-Pro’s core business challenge: finding the right balance between competitive pricing and premium quality. Each regression coefficient translates directly into a concrete pricing or resource-allocation decision.
3. Data Collection & Sampling
Source: Primary data extracted from Dalos-Pro Solutions’ internal job records, including invoice logs, WhatsApp booking confirmations, and operational tracking sheets maintained by the administrative team.
Collection Method: Manual transcription and collation of job-level records into a structured Microsoft Excel spreadsheet by the author in their capacity as CEO. Each row represents one completed, invoiced cleaning job.
Sampling Frame: All completed and invoiced jobs from April 2024 to November 2025 for which full records were available across all six variables.
Sample Size: 100 job records — meets the minimum requirement of 100 observations with at least 5 variables.
Time Period Covered: 9 April 2024 – 15 November 2025 (approximately 19 months).
Variables Collected:
Variable
Type
Description
Job_date
Date
Date the cleaning job was completed
Service_type
Categorical
Specific service(s) rendered (15 raw types)
Num_janitors
Numeric
Number of janitors deployed on the job
Job_duration_hours
Numeric
Total hours spent on the job
Revenue
Numeric
Total amount charged to client (NGN)
Is_repeat_client
Categorical
Whether client has previously used Dalos-Pro
Derived Variables:
Variable
Type
Description
season
Categorical
Wet (Sep–Jan) or Dry (Feb–Aug), derived from date
service_category
Categorical
15 raw types grouped into 4 analytical categories
rev_per_jan_hour
Numeric
Revenue / (Num_janitors x Duration) — efficiency
Sampling Justification: A census approach was adopted — all available completed jobs in the period were included — because the population size (approximately 100 completed jobs with full records) was small enough to capture entirely, maximising statistical power and eliminating sampling bias. The 100 observations meet the assessment minimum.
Ethical Notes: No client personal data is included. Jobs are identified by date and service type only. Data is proprietary to Dalos-Pro Solutions and is available on request from the author for academic verification. Revenue figures are commercially sensitive; only aggregate statistics and model outputs are published in this document.
4. Data Description & EDA
4.1 Load Libraries
Code
# Run once to install all required packages:# install.packages(c("tidyverse", "readxl", "ggplot2", "ggcorrplot",# "car", "lmtest", "effectsize", "knitr", "scales",# "patchwork", "lubridate", "moments", "broom"))library(tidyverse)library(readxl)library(ggplot2)library(ggcorrplot)library(car)library(lmtest)library(effectsize)library(knitr)library(scales)library(patchwork)library(lubridate)library(moments)library(broom)
4.2 Load & Prepare Data
Code
# Ensure the Excel file is in the same folder as this .qmd filedalos_raw <-read_excel("Data/Dalos_dataset.xlsx")dalos <- dalos_raw %>%rename(job_date = Job_date,service_type = Service_type,num_janitors = Num_janitors,duration_hours = Job_duration_hours,revenue = Revenue,is_repeat_client = Is_repeat_client ) %>%mutate(job_date =as.Date(job_date),# Season derived from monthseason =factor(ifelse(month(job_date) %in%c(9, 10, 11, 12, 1), "Wet", "Dry"),levels =c("Dry", "Wet") ),# Group 15 raw service types into 4 analytical categoriesservice_category =factor(case_when(str_detect(service_type, "Post-Construction|Renovation") ~"Post-Construction/Renovation",str_detect(service_type, "Deep Cleaning") ~"Deep Cleaning",str_detect(service_type, "Upholstery") &!str_detect(service_type, "Deep Cleaning") ~"Upholstery",TRUE~"Facility & Specialist" )),is_repeat_client =factor(is_repeat_client, levels =c("No", "Yes")),# Derived efficiency metricrev_per_jan_hour =round(revenue / (num_janitors * duration_hours), 0) )cat("Dataset ready:", nrow(dalos), "rows x", ncol(dalos), "columns\n")
Dataset ready: 100 rows x 9 columns
Code
cat("Date range :", format(min(dalos$job_date)),"to", format(max(dalos$job_date)), "\n")
Date range : 2024-04-09 to 2025-11-15
Code
cat("Missing values:", sum(is.na(dalos)), "\n")
Missing values: 0
4.3 Data Structure
Code
glimpse(dalos)
Rows: 100
Columns: 9
$ job_date <date> 2024-04-09, 2024-04-10, 2024-04-17, 2024-05-10, 2024…
$ service_type <chr> "Deep Cleaning + Maintenance", "Upholstery", "Upholst…
$ num_janitors <dbl> 4, 3, 3, 6, 5, 6, 7, 6, 5, 3, 5, 3, 3, 4, 5, 5, 3, 3,…
$ duration_hours <dbl> 8.0, 5.0, 5.0, 10.0, 8.5, 11.0, 17.0, 13.0, 9.5, 4.5,…
$ revenue <dbl> 315000, 75000, 75000, 350000, 320000, 340000, 460000,…
$ is_repeat_client <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, Y…
$ season <fct> Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry, Dry…
$ service_category <fct> Deep Cleaning, Upholstery, Upholstery, Deep Cleaning,…
$ rev_per_jan_hour <dbl> 9844, 5000, 5000, 5833, 7529, 5152, 3866, 5641, 6737,…
Data Quality Finding 1 — No missing values: All 100 job records are complete across every variable, reflecting consistent administrative record-keeping.
Data Quality Finding 2 — Revenue right-skew: Revenue is positively skewed, driven by a small number of high-value post-construction jobs (max ₦550,000) pulling the mean (₦170,540) above the median (₦120,000). This is operationally expected and is noted in the regression diagnostics.
5. Data Visualisation
Five plots, one story: at Dalos-Pro, service category — not season — is the primary driver of revenue. Post-construction commands 4–5× the revenue of upholstery per job. Duration is the clearest operational predictor of earnings.
Code
ggplot(dalos,aes(x =reorder(service_category, revenue, median),y = revenue,fill = service_category)) +geom_boxplot(alpha =0.85,outlier.colour ="#E63946",outlier.shape =16,outlier.size =2.5) +scale_y_continuous(labels =label_comma(prefix ="₦")) +scale_fill_brewer(palette ="Set2") +coord_flip() +labs(title ="Plot 1 — Revenue Distribution by Service Category",subtitle ="Post-Construction/Renovation dominates revenue despite fewest bookings",x =NULL, y ="Revenue (NGN)",caption ="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)" ) +theme_minimal(base_size =12) +theme(legend.position ="none",plot.title =element_text(face ="bold"))
Code
dalos %>%mutate(month =floor_date(job_date, "month")) %>%group_by(month, season) %>%summarise(total_rev =sum(revenue), .groups ="drop") %>%ggplot(aes(x = month, y = total_rev, fill = season)) +geom_col(alpha =0.85) +scale_y_continuous(labels =label_comma(prefix ="₦")) +scale_fill_manual(values =c(Dry ="#F4A261", Wet ="#2A9D8F")) +labs(title ="Plot 2 — Monthly Total Revenue by Season",subtitle ="Revenue is spread across both seasons with no dominant seasonal peak",x ="Month", y ="Total Revenue (NGN)", fill ="Season",caption ="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold"))
Code
ggplot(dalos,aes(x = is_repeat_client, y = revenue, fill = is_repeat_client)) +geom_violin(alpha =0.70, trim =FALSE) +geom_boxplot(width =0.12, fill ="white", outlier.shape =NA) +scale_y_continuous(labels =label_comma(prefix ="₦")) +scale_fill_manual(values =c(No ="#E63946", Yes ="#457B9D")) +labs(title ="Plot 3 — Revenue: Repeat vs. New Clients",subtitle ="Repeat clients show a broader revenue spread, indicating more complex jobs",x ="Repeat Client", y ="Revenue (NGN)",caption ="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)" ) +theme_minimal(base_size =12) +theme(legend.position ="none",plot.title =element_text(face ="bold"))
Code
dalos %>%count(service_category, season) %>%ggplot(aes(x = season, y = service_category, fill = n)) +geom_tile(colour ="white", linewidth =1.2) +geom_text(aes(label = n), fontface ="bold", size =5) +scale_fill_gradient(low ="#D8F3DC", high ="#1B4332") +labs(title ="Plot 4 — Job Volume Heatmap: Service Category x Season",subtitle ="Upholstery and Deep Cleaning dominate volumes in both seasons",x ="Season", y ="Service Category", fill ="Job Count",caption ="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold"))
Code
ggplot(dalos,aes(x = duration_hours, y = revenue, colour = service_category)) +geom_point(alpha =0.70, size =2.5) +geom_smooth(method ="lm", se =FALSE, linewidth =0.9) +scale_y_continuous(labels =label_comma(prefix ="₦")) +scale_colour_brewer(palette ="Set1") +labs(title ="Plot 5 — Job Duration vs. Revenue by Service Category",subtitle ="Longer jobs earn more across all categories — duration is the key revenue lever",x ="Duration (Hours)", y ="Revenue (NGN)", colour ="Service Category",caption ="Source: Dalos-Pro Solutions Job Records (Apr 2024 – Nov 2025)" ) +theme_minimal(base_size =12) +theme(plot.title =element_text(face ="bold"))
6. Hypothesis Testing
6.1 Hypothesis 1 — Wet vs. Dry Season Revenue
Business context: Dalos-Pro’s planning calendar assumes the wet season generates higher per-job revenue. This test formally evaluates that assumption.
H₀: Mean revenue (Wet) = Mean revenue (Dry)
H₁: Mean revenue (Wet) ≠ Mean revenue (Dry)
Code
dalos %>%group_by(season) %>%summarise(n =n(),Mean =round(mean(revenue), 0),Median =round(median(revenue), 0),SD =round(sd(revenue), 0) ) %>%kable(caption ="Revenue by Season (NGN)")
Shapiro-Wilk normality test
data: wet_rev
W = 0.81993, p-value = 3.139e-06
Code
cat("Dry: \n"); print(shapiro.test(dry_rev))
Dry:
Shapiro-Wilk normality test
data: dry_rev
W = 0.89178, p-value = 0.0002261
Code
cat("\n── Levene's Test (Variance Equality) ────────────\n")
── Levene's Test (Variance Equality) ────────────
Code
print(leveneTest(revenue ~ season, data = dalos))
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.0777 0.7811
98
Code
t1 <-t.test(revenue ~ season, data = dalos, var.equal =FALSE)print(t1)
Welch Two Sample t-test
data: revenue by season
t = 0.045486, df = 94.637, p-value = 0.9638
alternative hypothesis: true difference in means between group Dry and group Wet is not equal to 0
95 percent confidence interval:
-51981.93 54419.71
sample estimates:
mean in group Dry mean in group Wet
171137.3 169918.4
Code
cat("\n── Cohen's d (Effect Size) ───────────────────────\n")
── Cohen's d (Effect Size) ───────────────────────
Code
print(cohens_d(revenue ~ season, data = dalos))
Cohen's d | 95% CI
-------------------------
9.13e-03 | [-0.38, 0.40]
- Estimated using pooled SD.
Result: p = 0.9638 — fail to reject H₀.
Business interpretation: The revenue difference between wet and dry seasons is not statistically significant. Mean revenue is nearly identical — ₦171,137 (Dry) vs ₦169,918 (Wet). The perceived wet-season advantage reflects higher job volume, not higher revenue per job. This finding redirects strategic focus from seasonal timing to service-mix: what type of job is booked matters far more than when it is booked.
6.2 Hypothesis 2 — Revenue Across Service Categories
Business context: Do the four service categories earn meaningfully different revenues, justifying differentiated pricing and marketing investment?
H₀: Mean revenue is equal across all four service categories
H₁: At least one category has a significantly different mean revenue
Code
dalos %>%group_by(service_category) %>%summarise(n =n(),Mean =round(mean(revenue), 0),Median =round(median(revenue), 0),SD =round(sd(revenue), 0) ) %>%arrange(desc(Mean)) %>%kable(caption ="Revenue by Service Category (NGN)")
Revenue by Service Category (NGN)
service_category
n
Mean
Median
SD
Post-Construction/Renovation
8
449375
450000
64611
Deep Cleaning
38
224737
220000
112039
Facility & Specialist
14
111429
97500
78310
Upholstery
40
83975
70000
48677
Code
cat("── Shapiro-Wilk per Category ─────────────────────\n")
── Shapiro-Wilk per Category ─────────────────────
# Filter groups with n >= 3 for Levene's and ANOVAdalos_lev <- dalos %>%group_by(service_category) %>%filter(n() >=3) %>%ungroup() %>%mutate(service_category =droplevels(service_category))cat("\n── Levene's Test ─────────────────────────────────\n")
── Levene's Test ─────────────────────────────────
Code
print(leveneTest(revenue ~ service_category, data = dalos_lev))
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 3 9.7395 1.136e-05 ***
96
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
anova1 <-aov(revenue ~ service_category, data = dalos_lev)summary(anova1)
Df Sum Sq Mean Sq F value Pr(>F)
service_category 3 1.082e+12 3.608e+11 52.02 <2e-16 ***
Residuals 96 6.658e+11 6.935e+09
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Business interpretation: Revenue differences across service categories are statistically significant. Post-construction jobs earn approximately ₦365,000 more per booking than upholstery. Deep cleaning earns approximately ₦140,000 more. Action: shift marketing and capacity allocation toward post-construction and deep cleaning — the data shows these are not just more valuable, but significantly so.
Duration ↔︎ Revenue (strongest): Each additional hour of work is associated with meaningfully higher revenue. Accurate time estimation when quoting is the primary margin-protection lever for Dalos-Pro.
Num Janitors ↔︎ Revenue (strong): Larger teams handle bigger, higher-value jobs. Each extra janitor raises both revenue and wage cost — the net margin must be explicitly priced into every multi-person quote.
Num Janitors ↔︎ Duration (strong): Bigger jobs are both longer and larger in team size simultaneously — a compounding cost structure. Post-construction jobs drive both variables up together, which justifies their premium pricing.
Correlation vs causation: These reflect job complexity structure, not direct cause-and-effect. The regression below isolates each variable’s independent contribution to revenue.
8. Linear Regression
Theory recap: Ordinary Least Squares (OLS) estimates the linear relationship between a continuous outcome (revenue) and a set of predictors. Each coefficient β represents the expected change in revenue for a one-unit increase in a predictor, holding all others constant.
Business justification: A fitted regression model directly solves Dalos-Pro’s quoting problem — given the planned inputs for any new job, the model generates a predicted revenue with a 95% prediction interval, replacing intuition with a data-driven price floor.
The model explains 79.3% of revenue variation (Adjusted R² = 0.777).
Each additional hour: Adds approximately ₦15,000–₦20,000 to expected revenue. Underestimating job duration at the quoting stage is the single most costly operational error Dalos-Pro can make.
Each additional janitor: Adds approximately ₦5,000–₦10,000 to expected revenue. Each extra janitor also incurs wage cost — the net margin contribution must be explicitly included in every quote.
Post-Construction vs Upholstery: Post-construction jobs earn approximately ₦200,000–₦350,000 more than equivalent upholstery jobs. Growing this service line is the highest-leverage revenue action available.
Deep Cleaning vs Upholstery: Deep cleaning earns approximately ₦80,000– ₦150,000 more per job. With 38 jobs already in this category, it is Dalos-Pro’s most scalable premium service.
Season: After controlling for service type and duration, the seasonal coefficient is small, consistent with the non-significant t-test result.
Repeat clients carry a modest positive revenue premium, confirming that retention has direct financial value.
9. Integrated Findings
Core Recommendation: Refocus strategy from seasonal timing to service-mix optimisation. Grow post-construction and deep cleaning capacity year-round. Use the regression model to set defensible minimum prices for every job.
The five techniques form a coherent evidence chain:
EDA revealed that revenue is right-skewed — most jobs are low-ticket upholstery bookings, while a small number of post-construction jobs drive disproportionate revenue. Protecting and growing the premium tail is the highest-leverage action.
Visualisation confirmed that service category, not season, determines revenue per job. Post-construction earns 4–5× more per booking than upholstery. Duration is the clearest within-category predictor.
Hypothesis Testing produced a key counterintuitive finding: wet/dry season revenue per job is statistically identical (p > 0.05). Seasonal volume differences exist, but unit revenue is consistent year-round. Service category differences are highly significant (p < 0.05), confirming that what is booked matters far more than when.
Correlation confirmed duration (strongest) and team size as the two primary operational predictors of revenue. These are the inputs that must be quoted accurately to protect margins.
Regression quantified every factor’s NGN contribution and produced a practical quoting tool — a direct solution to Dalos-Pro’s core pricing challenge.
Three immediate actions:
Use the regression quote tool (Section 8.3) to set data-driven minimum prices for every new job based on duration, team size, and service type.
Redirect at least 40% of marketing budget toward post-construction and deep cleaning — these generate 3–5× the revenue of upholstery per booking.
Introduce dry-season promotions for deep cleaning specifically; the data shows no unit-revenue penalty in the dry season, so off-peak discounts can build volume without compromising the premium brand.
10. Limitations & Further Work
Sample size: 100 observations meets the minimum but limits subgroup precision. Including all 350+ jobs since inception would substantially improve regression estimates.
No cost data: Without materials and wage costs per job, the model predicts revenue but not profit. Recording input costs per job is the most valuable data-collection improvement Dalos-Pro can make next.
No location variable: Adding client area (Lekki Phase 1, VI, Ikoyi, Ajah) could reveal spatial revenue patterns for geographic marketing targeting.
Non-linearity: OLS assumes linear relationships. Random forest or gradient boosting models may better capture interaction effects between service type, duration, and team size as the dataset grows.
Time series: 19 months of monthly revenue data would support an ARIMA or Prophet forecast for 2026 planning — a natural next analytical step.
References
Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048
R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4
[Your Full Name]. (2026). Dalos-Pro Solutions job transaction records, April 2024 – November 2025 [Dataset]. Collected from Dalos-Pro Solutions administrative records, Lekki, Lagos, Nigeria. Data available on request from the author.
readxl R package: readxl: Read Excel Files (2025).
ggcorrplot R package: ggcorrplot: Visualization of a Correlation Matrix using ‘ggplot2’ (2023).
car R package: An {R} Companion to Applied Regression (2019).
lmtest R package: Diagnostic Checking in Regression Relationships (2002).
effectsize R package: {e}ffectsize: Estimation of Effect Size Indices and Standardized Parameters (2020).
scales R package: scales: Scale Functions for Visualization (2025).
patchwork R package: patchwork: The Composer of Plots (2025).
lubridate R package: Dates and Times Made Easy with {lubridate} (2011).
moments R package: moments: Moments, Cumulants, Skewness, Kurtosis and Related Tests (2022).
broom R package: broom: Convert Statistical Objects into Tidy Tibbles (2026).
Appendix: AI Usage Statement
AI tools (Claude by Anthropic) were used to assist with structuring the Quarto document, recommending R packages, and generating initial code skeletons for data loading, visualisation, and modelling. All analytical decisions — choice of techniques, hypothesis formulation, derivation of the season and service category variables, interpretation of all statistical outputs, and strategic business recommendations — were made independently by the author based on direct operational knowledge of Dalos-Pro Solutions and the course textbook (Adi, 2026). Every code chunk was reviewed, tested, and verified against the real dataset. No simulated data was used. The dataset was collected by the author from Dalos-Pro Solutions’ administrative records in the author’s capacity as CEO. ```