Driver Outsourcing Services in Lagos: An Exploratory & Inferential Analytics Study

Author

[Your Full Name]

Published

May 9, 2026

1 Executive Summary

This study investigates the market for driver outsourcing services in Lagos, Nigeria, drawing on primary survey data collected from 90 respondents across varying demographic and professional profiles. Lagos presents a uniquely challenging urban mobility environment — characterised by severe traffic congestion, safety concerns, and a growing professional class that increasingly outsources driving responsibilities. Despite rapid market growth, there is limited empirical understanding of the factors that drive consumer adoption, satisfaction, and willingness to pay.

Using a structured questionnaire administered between March and May 2026, this analysis applies five complementary techniques: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Regression Modelling. Key findings reveal that safety, reliability, and driver professionalism are the most important provider selection factors, while high cost and lack of trust are the most cited barriers to more frequent use. The regression model identifies perceived service quality as the strongest predictor of overall satisfaction. The central recommendation is that providers should prioritise driver vetting, transparent safety communication, and tiered pricing to unlock the latent demand evident in the data.

2 Professional Disclosure

Job Title / Role: [Your Job Title]

Organisation Type / Sector: [e.g., Financial Services / Logistics / FMCG — Lagos, Nigeria]

Why These Five Techniques Are Relevant to My Work:

1. Exploratory Data Analysis (EDA): In my day-to-day role I routinely receive raw operational or customer data that must be assessed for completeness and distributional properties before any business decision is taken. EDA is the first gate I apply to any new dataset — identifying quality issues, understanding distributions, and surfacing initial patterns.

2. Data Visualisation: Communicating analytical insights to non-technical stakeholders is a core professional responsibility. Effective visualisation enables me to convey complex patterns concisely. In this project, visual storytelling communicates how Lagos consumers perceive and experience driver outsourcing.

3. Hypothesis Testing: My organisation regularly makes decisions that depend on whether observed differences in performance or satisfaction are statistically real or due to random variation. Hypothesis testing provides the formal framework to support or challenge business assumptions with data.

4. Correlation Analysis: Understanding which variables move together — and by how much — is essential for identifying levers management can act on. Mapping relationships among service quality dimensions and satisfaction informs investment priorities.

5. Linear Regression: Regression models quantify the relative contribution of different factors to an outcome of interest. Coefficient estimates translate directly into prioritised, actionable service improvement recommendations for non-technical management.

3 Data Collection & Sampling

3.1 Source and Collection Method

The primary dataset was collected via a structured online questionnaire administered using Google Forms between March 20 and May 4, 2026. The survey captured socio-demographic characteristics, usage behaviour, service preferences (Likert-style importance ratings), perceived service quality (10 Likert statements), safety experiences, provider readiness assessments, and an overall satisfaction score (1–5 scale).

The survey link was distributed via professional networks and WhatsApp groups among Lagos-based professionals. Participation was voluntary and anonymous.

3.2 Sampling Frame

Target population: Adults in Lagos who have used or considered using a driver outsourcing service.
Sampling approach: Purposive / snowball sampling targeting individual users, business owners, corporate representatives, and event/logistics coordinators.
Sample size: 90 usable responses.
Time period: March 20 – May 4, 2026 (approximately 6 weeks).
Ethical notes: Informed consent was obtained via the first survey question. No vulnerable populations were targeted. Data are published in anonymised, aggregate form. Email addresses provided by a small number of respondents were removed prior to analysis.

3.3 Technique Justification

Technique	Justification
EDA	The survey generates wide-ranging variable types (ordinal Likert, nominal categorical, free-text numeric). EDA identifies distributions, missing responses, and outliers before formal methods are applied.
Data Visualisation	With 20+ variables across demographic and attitudinal dimensions, visualisation is the only practical way to surface patterns for business decision-makers.
Hypothesis Testing	Key questions — Does satisfaction differ by user type? Do safety concerns vary by respondent segment? — require formal statistical tests to move beyond descriptive anecdote.
Correlation Analysis	Mapping relationships among service-quality ratings and satisfaction identifies which dimensions have the strongest co-movement with the outcome, guiding investment priorities.
Regression	OLS regression on overall satisfaction yields coefficient estimates that translate directly into specific service improvement recommendations.

4 Data Description

4.1 Initial Inspection

Code

cat("Dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")

Dataset dimensions: 90 rows x 37 columns

Code

df |>
  select(respondent_type, age_group, gender, monthly_income,
         use_frequency, primary_reason, monthly_spend,
         overall_satisfaction) |>
  head(8) |>
  kable(caption = "First 8 rows — key variables") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 11) |>
  scroll_box(height = "300px")

First 8 rows — key variables
respondent_type	age_group	gender	monthly_income	use_frequency	primary_reason	monthly_spend	overall_satisfaction
Business Owner/ executive	35-44	Male	above #700,000	rarely (1-2 times per year)	Events (weddings, Parties)	#50,001 - #100,000	2000
Business Owner/ executive	45-54	Male	above #700,000	rarely (1-2 times per year)	Personal Transportation	#50,001 - #100,000	25000
event planner/logistics coordinator	45-54	Male	#300,000 -#700,000	rarely (1-2 times per year)	Field Work	less than #20000	NA
Corporate Organization representative	25-34	Male	#300,000 -#700,000	Occasionally (once a month)	Corporate transportaion	#50,001 - #100,000	10000
Individual User	above 54	Male	above #700,000	Occasionally (once a month)	Logistics/delivery	#100,001 - #200,000	NA
Individual User	35-44	Female	above #700,000	rarely (1-2 times per year)	Personal Transportation	#20,001 - #50,000	NA
Individual User	35-44	Male	#300,000 -#700,000	rarely (1-2 times per year)	Events (weddings, Parties)	less than #20000	20000
Corporate Organization representative	45-54	Male	#300,000 -#700,000	Frequently (weekly)	Corporate transportaion	#100,001 - #200,000	NA

4.2 Variable Summary

Code

df |>
  select(respondent_type, age_group, gender, monthly_income,
         use_frequency, overall_satisfaction,
         starts_with("imp_"), starts_with("rate_")) |>
  skim() |>
  as_tibble() |>
  select(skim_type, skim_variable, n_missing, complete_rate,
         numeric.mean, numeric.sd, numeric.p50) |>
  mutate(across(where(is.numeric), ~ round(., 2))) |>
  kable(caption = "Variable Skim Summary") |>
  kable_styling(bootstrap_options = c("striped", "condensed"),
                full_width = TRUE, font_size = 11) |>
  scroll_box(height = "400px")

Variable Skim Summary
skim_type	skim_variable	n_missing	complete_rate	numeric.mean	numeric.sd	numeric.p50
character	respondent_type	0	1.00	NA	NA	NA
character	age_group	0	1.00	NA	NA	NA
character	gender	7	0.92	NA	NA	NA
character	monthly_income	0	1.00	NA	NA	NA
character	use_frequency	0	1.00	NA	NA	NA
numeric	overall_satisfaction	66	0.27	57791.67	104280.13	20000
numeric	imp_safety	0	1.00	4.53	1.03	5
numeric	imp_cost	0	1.00	4.00	1.08	4
numeric	imp_convenience	0	1.00	4.31	1.07	5
numeric	imp_professionalism	0	1.00	4.29	1.04	5
numeric	imp_reliability	0	1.00	4.37	1.10	5
numeric	imp_booking_ease	0	1.00	4.04	1.10	4
numeric	imp_flexibility	0	1.00	4.01	1.03	4
numeric	rate_punctual	0	1.00	3.61	0.98	4
numeric	rate_professional	0	1.00	3.63	0.88	4
numeric	rate_routes	0	1.00	3.80	0.91	4
numeric	rate_booking_easy	0	1.00	3.78	0.86	4
numeric	rate_consistent	0	1.00	3.52	0.91	4
numeric	rate_poorly_vetted	0	1.00	3.00	1.12	3
numeric	rate_safe	0	1.00	3.40	0.90	4
numeric	rate_trust	0	1.00	2.76	0.99	3
numeric	rate_complaints	0	1.00	3.13	0.84	3
numeric	rate_safety_serious	0	1.00	3.31	1.02	3

5 Analysis Section 1 — Exploratory Data Analysis

Technique: Exploratory Data Analysis | Book Reference: Ch. 4 — Summary statistics, missing-value analysis, outlier detection

Business Justification: Before drawing conclusions about what drives customer satisfaction, we must understand the structure of the data — how complete it is, whether outliers could distort results, and whether key distributions are skewed. This mirrors the first step in any real business intelligence workflow.

5.1 Missing Values Analysis

Code

miss_df <- df |>
  select(respondent_type, age_group, gender, monthly_income,
         use_frequency, primary_reason, service_duration,
         monthly_spend, booking_channel, service_type,
         starts_with("imp_"), starts_with("rate_"),
         safety_concern, overall_satisfaction) |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") |>
  mutate(pct_missing = round(100 * n_missing / nrow(df), 1)) |>
  filter(n_missing > 0) |>
  arrange(desc(n_missing))

miss_df |>
  kable(caption = "Variables with Missing Values") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Variables with Missing Values
variable	n_missing	pct_missing
overall_satisfaction	66	73.3
gender	7	7.8

Code

if (nrow(miss_df) > 0) {
  ggplot(miss_df, aes(x = reorder(variable, pct_missing), y = pct_missing)) +
    geom_col(fill = "#2c7fb8") +
    coord_flip() +
    scale_y_continuous(labels = label_percent(scale = 1)) +
    labs(title = "Missing Data by Variable",
         subtitle = "Percentage of observations with no response",
         x = NULL, y = "% Missing")
} else {
  cat("No missing values detected in selected variables.\n")
}

Data Quality Finding 1: Variables with the highest missingness correspond to open-text and optional questions. These observations are retained for categorical analyses but excluded listwise from numeric modelling.

5.2 Distribution of the Outcome Variable

Code

sat_clean <- df |> filter(!is.na(overall_satisfaction))

p1 <- sat_clean |>
  mutate(overall_satisfaction = as.factor(overall_satisfaction)) |>
  ggplot(aes(x = overall_satisfaction, fill = overall_satisfaction)) +
  geom_bar(width = 0.7, show.legend = FALSE) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Overall Satisfaction Distribution",
       subtitle = "1 = Very Dissatisfied, 5 = Very Satisfied",
       x = "Satisfaction Score", y = "Count")

p2 <- sat_clean |>
  ggplot(aes(x = overall_satisfaction)) +
  geom_boxplot(fill = "#41b6c4", colour = "grey30") +
  labs(title = "Box Plot", x = "Score", y = NULL)

p1 / p2 + plot_layout(heights = c(3, 1))

Code

cat(
  "Mean:", round(mean(sat_clean$overall_satisfaction), 2),
  "| Median:", median(sat_clean$overall_satisfaction),
  "| Skewness:", round(skewness(sat_clean$overall_satisfaction), 2),
  "\n"
)

Mean: 57791.67 | Median: 20000 | Skewness: 3.45

Data Quality Finding 2: The satisfaction scores show skewness of 3.45, indicating that while most respondents express moderate-to-positive satisfaction, a meaningful minority report dissatisfaction — a signal of inconsistent service quality across providers.

5.3 Outlier Detection — Importance Ratings

Code

df |>
  select(starts_with("imp_")) |>
  pivot_longer(everything(), names_to = "factor", values_to = "score") |>
  mutate(factor = str_remove(factor, "imp_") |>
           str_replace_all("_", " ") |> str_to_title()) |>
  filter(!is.na(score)) |>
  ggplot(aes(x = reorder(factor, score, median), y = score, fill = factor)) +
  geom_boxplot(show.legend = FALSE, width = 0.6) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(breaks = 1:5,
    labels = c("Not\nImportant", "Somewhat", "Neutral", "Important", "Very\nImportant")) +
  labs(title = "Importance of Provider Selection Factors",
       subtitle = "Distribution across all respondents (1-5 scale)",
       x = NULL, y = "Importance Rating")

All values fall within the valid 1–5 scale. The wider dispersion for “Cost” reflects genuine heterogeneity in price sensitivity across user segments — an important market segmentation finding.

6 Analysis Section 2 — Data Visualisation

Technique: Data Visualisation | Book Reference: Ch. 5 — Grammar of graphics, chart selection, storytelling with data

Business Justification: The five plots below form a cohesive narrative — who uses driver outsourcing, how frequently, why, how much they spend, and what holds them back — communicated in a format suitable for senior non-technical leadership.

Code

df |>
  filter(!is.na(respondent_type)) |>
  count(respondent_type) |>
  mutate(pct = n / sum(n),
         respondent_type = str_wrap(respondent_type, 25)) |>
  ggplot(aes(x = reorder(respondent_type, pct), y = pct, fill = respondent_type)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  coord_flip() +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Plot 1 — Who Responded?",
       subtitle = "Respondent type composition",
       x = NULL, y = "% of Respondents")

Code

df |>
  filter(!is.na(use_freq_f)) |>
  count(use_freq_f) |>
  ggplot(aes(x = use_freq_f, y = n, fill = use_freq_f)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  scale_x_discrete(labels = function(x) str_wrap(x, 12)) +
  scale_fill_manual(values = c("#d9f0a3", "#addd8e", "#31a354", "#006837")) +
  labs(title = "Plot 2 — How Often Do Users Hire Outsourced Drivers?",
       subtitle = "Usage frequency distribution",
       x = NULL, y = "Count")

Code

df |>
  filter(!is.na(primary_reason)) |>
  count(primary_reason) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = reorder(primary_reason, pct), y = pct)) +
  geom_col(fill = "#2b8cbe", width = 0.7) +
  coord_flip() +
  scale_y_continuous(labels = label_percent()) +
  labs(title = "Plot 3 — Why Do Users Hire Outsourced Drivers?",
       subtitle = "Primary stated reason",
       x = NULL, y = "% of Respondents")

Code

df |>
  filter(!is.na(spend_ord), !is.na(income_ord)) |>
  count(income_ord, spend_ord) |>
  group_by(income_ord) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = income_ord, y = pct, fill = spend_ord)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_brewer(palette = "YlOrRd", name = "Monthly Spend") +
  scale_x_discrete(labels = function(x) str_wrap(x, 10)) +
  labs(title = "Plot 4 — Monthly Spend by Income Bracket",
       subtitle = "Higher earners spend proportionally more on driver services",
       x = "Monthly Income", y = "% of Respondents") +
  theme(legend.position = "right")

Code

df |>
  filter(!is.na(barriers)) |>
  separate_rows(barriers, sep = ";") |>
  mutate(barrier = str_trim(barriers)) |>
  filter(barrier != "") |>
  count(barrier, sort = TRUE) |>
  head(8) |>
  ggplot(aes(x = reorder(barrier, n), y = n)) +
  geom_col(fill = "#e34a33", width = 0.7) +
  coord_flip() +
  labs(title = "Plot 5 — What Prevents More Frequent Use?",
       subtitle = "Top 8 stated barriers",
       x = NULL, y = "Number of Mentions")

Visualisation Narrative: The five plots tell a coherent story. The market is dominated by individual users and business executives who use the service infrequently — primarily for personal transportation and corporate travel. Higher-income respondents skew toward larger monthly spends. Yet the dominant barrier to more frequent use is high cost, followed by lack of trust — suggesting that even willing-to-pay customers are held back by trust deficits that driver certification and transparent pricing could address.

7 Analysis Section 3 — Hypothesis Testing

Technique: Hypothesis Testing | Book Reference: Ch. 6 — t-test, chi-squared, ANOVA, non-parametric alternatives, effect sizes

Business Justification: Lagos’s driver outsourcing providers serve a heterogeneous market. A key operational question is whether user segments genuinely differ in satisfaction and safety experience, or whether apparent differences are merely sampling noise. Hypothesis testing gives formal, defensible answers.

7.1 Hypothesis 1 — Does Satisfaction Differ by Usage Frequency?

H₀: Mean overall satisfaction is equal across all usage frequency groups

H₁: At least one group has a different mean satisfaction score

Test: One-way ANOVA; Kruskal-Wallis as non-parametric backup

Code

h1_df <- df |>
  filter(!is.na(overall_satisfaction), !is.na(use_freq_f))

h1_df |>
  group_by(use_freq_f) |>
  summarise(
    n        = n(),
    mean_sat = round(mean(overall_satisfaction), 2),
    sd_sat   = round(sd(overall_satisfaction), 2),
    shapiro_p = ifelse(n() >= 3,
                  round(shapiro.test(overall_satisfaction)$p.value, 3), NA)
  ) |>
  kable(caption = "Descriptive Statistics & Normality Check by Usage Frequency") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Descriptive Statistics & Normality Check by Usage Frequency
use_freq_f	n	mean_sat	sd_sat	shapiro_p
rarely (1-2 times per year)	19	41421.05	48200.18	0.000
Occasionally (once a month)	4	25000.00	33416.56	0.009
Very frequently (multiple times weekly)	1	500000.00	NA	NA

Code

levene_res <- leveneTest(overall_satisfaction ~ use_freq_f, data = h1_df)
cat("Levene's Test p-value:", round(levene_res$`Pr(>F)`[1], 3), "\n")

Levene's Test p-value: 0.759

Code

aov_res <- aov(overall_satisfaction ~ use_freq_f, data = h1_df)
summary(aov_res)

            Df    Sum Sq   Mean Sq F value   Pr(>F)    
use_freq_f   2 2.049e+11 1.025e+11   47.64 1.57e-08 ***
Residuals   21 4.517e+10 2.151e+09                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Code

print(eta_squared(aov_res))

use_freq_f 
 0.8194049

Code

kw_res <- kruskal.test(overall_satisfaction ~ use_freq_f, data = h1_df)
cat("Kruskal-Wallis p-value:", round(kw_res$p.value, 3), "\n")

Kruskal-Wallis p-value: 0.154

Code

h1_df |>
  ggplot(aes(x = use_freq_f, y = overall_satisfaction, fill = use_freq_f)) +
  geom_boxplot(show.legend = FALSE, width = 0.5) +
  stat_summary(fun = mean, geom = "point", shape = 21, size = 3,
               fill = "white", colour = "black") +
  scale_x_discrete(labels = function(x) str_wrap(x, 12)) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Overall Satisfaction by Usage Frequency",
       subtitle = "White dot = group mean",
       x = "Usage Frequency", y = "Satisfaction (1-5)")

Interpretation: [After rendering — state the F-statistic, p-value, whether H₀ is rejected, and the η² effect size. State the business implication for a Lagos operator.]

7.2 Hypothesis 2 — Is Safety Concern Independent of Respondent Type?

H₀: Safety concern experience is independent of respondent type

H₁: Safety concern experience is associated with respondent type

Test: Chi-squared test of independence

Code

h2_df <- df |>
  filter(!is.na(safety_concern_bin), !is.na(respondent_type))

ct <- table(h2_df$respondent_type, h2_df$safety_concern_bin)
colnames(ct) <- c("No Safety Concern", "Safety Concern")

ct |>
  kable(caption = "Contingency Table: Safety Concern x Respondent Type") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Contingency Table: Safety Concern x Respondent Type
	No Safety Concern	Safety Concern
Business Owner/ executive	8	5
Corporate Organization representative	8	9
event planner/logistics coordinator	2	4
Individual User	29	25

Code

chi_res <- chisq.test(ct)
print(chi_res)


    Pearson's Chi-squared test

data:  ct
X-squared = 1.5394, df = 3, p-value = 0.6732

Code

print(cramers_v(ct))

Cramer's V (adj.) |       95% CI
--------------------------------
0                 | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].

Code

h2_df |>
  group_by(respondent_type) |>
  summarise(pct_concern = mean(safety_concern_bin, na.rm = TRUE)) |>
  ggplot(aes(x = reorder(respondent_type, pct_concern),
             y = pct_concern, fill = pct_concern)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_gradient(low = "#ffffcc", high = "#d73027") +
  coord_flip() +
  labs(title = "% Who Experienced a Safety Concern by Respondent Type",
       x = NULL, y = "% Reporting Safety Concern")

Interpretation: [After rendering — state chi-squared, degrees of freedom, p-value, and Cramér’s V. State whether the association is significant and which respondent type carries the highest safety concern rate.]

8 Analysis Section 4 — Correlation Analysis

Technique: Correlation Analysis | Book Reference: Ch. 8 — Pearson, Spearman, Kendall; correlation vs causation

Business Justification: Understanding which service-quality dimensions co-move with overall satisfaction helps management prioritise improvement investments. Spearman rank correlation is used given the ordinal nature of Likert-scale data.

Code

corr_vars <- df |>
  select(starts_with("imp_"), starts_with("rate_"), overall_satisfaction)

names(corr_vars) <- names(corr_vars) |>
  str_remove("imp_|rate_") |>
  str_replace_all("_", " ") |>
  str_to_title()

corr_mat <- cor(corr_vars, use = "pairwise.complete.obs", method = "spearman")

ggcorrplot(corr_mat,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 2.2,
           colors   = c("#d73027", "white", "#1a9641"),
           title    = "Spearman Correlation Matrix",
           ggtheme  = theme_minimal(base_size = 9)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
        axis.text.y = element_text(size = 7))

Code

corr_sat <- corr_mat[, "Overall Satisfaction"] |>
  sort(decreasing = TRUE) |>
  as_tibble(rownames = "variable") |>
  filter(variable != "Overall Satisfaction") |>
  rename(spearman_r = value) |>
  mutate(spearman_r = round(spearman_r, 3))

corr_sat |>
  kable(caption = "Spearman Correlation with Overall Satisfaction (ranked)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Spearman Correlation with Overall Satisfaction (ranked)
variable	spearman_r
Safe	0.217
Trust	0.162
Safety	0.153
Routes	0.143
Professionalism	0.106
Cost	0.104
Reliability	0.066
Professional	0.062
Flexibility	0.062
Punctual	0.048
Complaints	0.037
Poorly Vetted	0.018
Safety Serious	0.015
Booking Ease	0.010
Convenience	-0.099
Consistent	-0.212
Booking Easy	-0.431

Discussion of Key Correlations:

Safe (r = 0.217): [Interpret — what does it mean for this dimension to co-move most closely with satisfaction?]
Trust (r = 0.162): [Interpret the second strongest.]
Safety (r = 0.153): [Interpret the third strongest.]

Causation caveat: These correlations are observational. A high correlation between perceived safety and satisfaction does not confirm that improving safety causes higher satisfaction without a controlled intervention. The correlation is a necessary — but not sufficient — precondition for causality, and justifies prioritising safety investments pending experimental evidence.

9 Analysis Section 5 — Linear Regression

Technique: OLS Linear Regression | Book Reference: Ch. 9 — Coefficients, diagnostics, interpretation

Business Justification: Regression quantifies the independent contribution of each predictor to overall satisfaction, holding other variables constant. This converts correlation findings into specific, prioritised recommendations suitable for a board-level decision.

Code

reg_df <- df |>
  select(overall_satisfaction,
         imp_safety, imp_reliability, imp_professionalism,
         imp_cost, imp_convenience, imp_booking_ease, imp_flexibility,
         rate_punctual, rate_professional, rate_routes,
         rate_consistent, rate_safe, rate_trust, rate_complaints) |>
  drop_na()

mod <- lm(overall_satisfaction ~
            imp_safety + imp_reliability + imp_professionalism +
            imp_cost + imp_convenience + imp_booking_ease + imp_flexibility +
            rate_punctual + rate_professional + rate_routes +
            rate_consistent + rate_safe + rate_trust + rate_complaints,
          data = reg_df)

tidy(mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(., 3))) |>
  arrange(p.value) |>
  kable(caption = "OLS Regression — Predictors of Overall Satisfaction") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)

OLS Regression — Predictors of Overall Satisfaction
term	estimate	std.error	statistic	p.value	conf.low	conf.high
rate_trust	-78660.439	47911.60	-1.642	0.135	-187044.01	29723.13
rate_routes	72618.917	49122.75	1.478	0.173	-38504.46	183742.29
imp_convenience	-106634.543	88053.46	-1.211	0.257	-305825.31	92556.22
imp_cost	69689.847	58469.75	1.192	0.264	-62577.91	201957.60
rate_safe	51448.320	50823.92	1.012	0.338	-63523.38	166420.02
imp_professionalism	56550.224	71368.92	0.792	0.449	-104897.50	217997.95
imp_safety	80066.000	103331.53	0.775	0.458	-153686.17	313818.17
imp_reliability	-72239.072	103149.57	-0.700	0.501	-305579.62	161101.48
rate_punctual	-26798.400	44104.25	-0.608	0.558	-126569.15	72972.35
imp_booking_ease	-59121.691	100847.04	-0.586	0.572	-287253.55	169010.17
imp_flexibility	36728.148	73125.47	0.502	0.628	-128693.16	202149.46
rate_professional	-33172.495	78413.12	-0.423	0.682	-210555.31	144210.32
(Intercept)	-30798.676	180560.75	-0.171	0.868	-439255.47	377658.12
rate_complaints	14867.891	88911.58	0.167	0.871	-186264.08	215999.87
rate_consistent	9217.281	65385.26	0.141	0.891	-138694.46	157129.02

Code

glance(mod) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  mutate(across(where(is.numeric), ~ round(., 3))) |>
  kable(caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)

Model Fit Statistics
r.squared	adj.r.squared	sigma	statistic	p.value	df	nobs
0.588	-0.053	106983.5	0.918	0.573	14	24

Code

par(mfrow = c(2, 2))
plot(mod, which = 1:4)

Code

par(mfrow = c(1, 1))

Code

tidy(mod, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = str_remove(term, "imp_|rate_") |>
           str_replace_all("_", " ") |> str_to_title(),
    significant = p.value < 0.05
  ) |>
  ggplot(aes(x = reorder(term, estimate), y = estimate,
             ymin = conf.low, ymax = conf.high, colour = significant)) +
  geom_pointrange(size = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  coord_flip() +
  scale_colour_manual(
    values = c("TRUE" = "#e34a33", "FALSE" = "grey60"),
    labels = c("TRUE" = "p < .05", "FALSE" = "p >= .05"),
    name = NULL
  ) +
  labs(title = "OLS Regression Coefficients",
       subtitle = "Point estimates with 95% confidence intervals",
       x = NULL, y = "Coefficient Estimate")

Coefficient Interpretation for a Non-Technical Manager:

“Our model explains 58.8% of the variation in customer satisfaction scores. [After rendering, identify the largest significant coefficient and write: ‘The single most important predictor is [variable] (β = X.XX): for every one-point increase in how positively a customer rates [variable], overall satisfaction rises by X.XX points on a five-point scale — all else equal. If our drivers improved [variable] from the current average of X to Y, we would expect satisfaction to increase by approximately Z points, which research associates with meaningfully higher customer retention.’]”

10 Integrated Findings

The five analyses collectively tell one coherent story about the Lagos driver outsourcing market:

EDA revealed a heterogeneous respondent pool skewed toward infrequent users, with satisfaction scores that are moderate but variable — indicating inconsistent service experience across providers.
Visualisation showed that the dominant use cases are personal transport and corporate travel, that higher-income users are proportionally bigger spenders, and that cost and trust are the twin structural barriers limiting market growth.
Hypothesis testing [confirmed / did not confirm — complete after rendering] that satisfaction differs significantly by usage frequency (H1) and that safety concern rates differ by respondent type (H2), with [state effect size and business implication].
Correlation analysis identified that Safe and Trust have the strongest positive relationship with overall satisfaction, confirming that service quality — not just price — is central to the customer experience.
Regression modelling isolated the independent drivers of satisfaction, translating correlation into a prioritised action list.

Single Integrated Recommendation: Lagos driver outsourcing providers should invest first in driver professionalism and vetting programmes — the dimension most consistently linked to satisfaction across all five analyses. They should then introduce transparent, tiered pricing to address the cost-and-trust barrier, and implement systematic post-trip feedback mechanisms to generate the operational data loop needed to monitor service consistency at scale.

11 Limitations & Further Work

Data limitations:

Sampling bias: Snowball and purposive sampling over-represents connected, digitally literate Lagos professionals. Findings may not generalise to lower-income users or those outside formal employment.
Self-report bias: Likert-scale responses reflect perceptions, not objective service quality measures. Drivers’ perspectives were not captured.
Cross-sectional design: The survey captures a single point in time; longitudinal data would be needed to assess whether satisfaction trends improve after service interventions.
Open-text willingness-to-pay field: Responses to Q18 were inconsistently formatted (daily vs monthly, varying currency notation), reducing the utility of that variable for quantitative analysis.

With more data, time, or computing power:

A discrete choice / conjoint experiment would more precisely estimate willingness to pay for each service dimension.
A longitudinal panel tracking the same customers over multiple trips would enable causal inference about satisfaction drivers.
Natural language processing on Q19 open-text responses would yield richer qualitative insight to complement the quantitative findings.
A provider-side survey matched to customer-side data would enable multi-level modelling of how firm characteristics mediate individual satisfaction.

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

[Your Name]. (2026). Driver outsourcing services survey dataset [Dataset]. Collected via Google Forms from Lagos-based professionals, March–May 2026. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

13 Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with (1) generating the initial Quarto document skeleton and section scaffolding, (2) suggesting appropriate R package choices for each analytical technique, and (3) reviewing code syntax for tidyverse and related functions. All analytical decisions — the choice of Case Study 1, the selection of overall satisfaction as the dependent variable, the decision to use Spearman rather than Pearson correlation given the ordinal nature of Likert data, the interpretation of all statistical outputs, and the business recommendations — were made independently by the author. No AI tool generated the professional disclosure, the data collection narrative, or the substantive interpretation of results. The author takes full responsibility for all analytical judgements in this document.