Driver Outsourcing Services in Lagos: An Exploratory & Inferential Analytics Study

Author

[Your Full Name]

Published

May 9, 2026


1 Executive Summary

This study investigates the market for driver outsourcing services in Lagos, Nigeria, drawing on primary survey data collected from 90 respondents across varying demographic and professional profiles. Lagos presents a uniquely challenging urban mobility environment — characterised by severe traffic congestion, safety concerns, and a growing professional class that increasingly outsources driving responsibilities. Despite rapid market growth, there is limited empirical understanding of the factors that drive consumer adoption, satisfaction, and willingness to pay.

Using a structured questionnaire administered between March and May 2026, this analysis applies five complementary techniques: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Regression Modelling. Key findings reveal that safety, reliability, and driver professionalism are the most important provider selection factors, while high cost and lack of trust are the most cited barriers to more frequent use. The regression model identifies perceived service quality as the strongest predictor of overall satisfaction. The central recommendation is that providers should prioritise driver vetting, transparent safety communication, and tiered pricing to unlock the latent demand evident in the data.


2 Professional Disclosure

Job Title / Role: [Your Job Title]

Organisation Type / Sector: [e.g., Financial Services / Logistics / FMCG — Lagos, Nigeria]

Why These Five Techniques Are Relevant to My Work:

1. Exploratory Data Analysis (EDA): In my day-to-day role I routinely receive raw operational or customer data that must be assessed for completeness and distributional properties before any business decision is taken. EDA is the first gate I apply to any new dataset — identifying quality issues, understanding distributions, and surfacing initial patterns.

2. Data Visualisation: Communicating analytical insights to non-technical stakeholders is a core professional responsibility. Effective visualisation enables me to convey complex patterns concisely. In this project, visual storytelling communicates how Lagos consumers perceive and experience driver outsourcing.

3. Hypothesis Testing: My organisation regularly makes decisions that depend on whether observed differences in performance or satisfaction are statistically real or due to random variation. Hypothesis testing provides the formal framework to support or challenge business assumptions with data.

4. Correlation Analysis: Understanding which variables move together — and by how much — is essential for identifying levers management can act on. Mapping relationships among service quality dimensions and satisfaction informs investment priorities.

5. Linear Regression: Regression models quantify the relative contribution of different factors to an outcome of interest. Coefficient estimates translate directly into prioritised, actionable service improvement recommendations for non-technical management.


3 Data Collection & Sampling

3.1 Source and Collection Method

The primary dataset was collected via a structured online questionnaire administered using Google Forms between March 20 and May 4, 2026. The survey captured socio-demographic characteristics, usage behaviour, service preferences (Likert-style importance ratings), perceived service quality (10 Likert statements), safety experiences, provider readiness assessments, and an overall satisfaction score (1–5 scale).

The survey link was distributed via professional networks and WhatsApp groups among Lagos-based professionals. Participation was voluntary and anonymous.

3.2 Sampling Frame

  • Target population: Adults in Lagos who have used or considered using a driver outsourcing service.
  • Sampling approach: Purposive / snowball sampling targeting individual users, business owners, corporate representatives, and event/logistics coordinators.
  • Sample size: 90 usable responses.
  • Time period: March 20 – May 4, 2026 (approximately 6 weeks).
  • Ethical notes: Informed consent was obtained via the first survey question. No vulnerable populations were targeted. Data are published in anonymised, aggregate form. Email addresses provided by a small number of respondents were removed prior to analysis.

3.3 Technique Justification

Technique Justification
EDA The survey generates wide-ranging variable types (ordinal Likert, nominal categorical, free-text numeric). EDA identifies distributions, missing responses, and outliers before formal methods are applied.
Data Visualisation With 20+ variables across demographic and attitudinal dimensions, visualisation is the only practical way to surface patterns for business decision-makers.
Hypothesis Testing Key questions — Does satisfaction differ by user type? Do safety concerns vary by respondent segment? — require formal statistical tests to move beyond descriptive anecdote.
Correlation Analysis Mapping relationships among service-quality ratings and satisfaction identifies which dimensions have the strongest co-movement with the outcome, guiding investment priorities.
Regression OLS regression on overall satisfaction yields coefficient estimates that translate directly into specific service improvement recommendations.

4 Data Description

4.1 Initial Inspection

Code
cat("Dataset dimensions:", nrow(df), "rows x", ncol(df), "columns\n")
Dataset dimensions: 90 rows x 37 columns
Code
df |>
  select(respondent_type, age_group, gender, monthly_income,
         use_frequency, primary_reason, monthly_spend,
         overall_satisfaction) |>
  head(8) |>
  kable(caption = "First 8 rows — key variables") |>
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"),
                full_width = TRUE, font_size = 11) |>
  scroll_box(height = "300px")
First 8 rows — key variables
respondent_type age_group gender monthly_income use_frequency primary_reason monthly_spend overall_satisfaction
Business Owner/ executive 35-44 Male above #700,000 rarely (1-2 times per year) Events (weddings, Parties) #50,001 - #100,000 2000
Business Owner/ executive 45-54 Male above #700,000 rarely (1-2 times per year) Personal Transportation #50,001 - #100,000 25000
event planner/logistics coordinator 45-54 Male #300,000 -#700,000 rarely (1-2 times per year) Field Work less than #20000 NA
Corporate Organization representative 25-34 Male #300,000 -#700,000 Occasionally (once a month) Corporate transportaion #50,001 - #100,000 10000
Individual User above 54 Male above #700,000 Occasionally (once a month) Logistics/delivery #100,001 - #200,000 NA
Individual User 35-44 Female above #700,000 rarely (1-2 times per year) Personal Transportation #20,001 - #50,000 NA
Individual User 35-44 Male #300,000 -#700,000 rarely (1-2 times per year) Events (weddings, Parties) less than #20000 20000
Corporate Organization representative 45-54 Male #300,000 -#700,000 Frequently (weekly) Corporate transportaion #100,001 - #200,000 NA

4.2 Variable Summary

Code
df |>
  select(respondent_type, age_group, gender, monthly_income,
         use_frequency, overall_satisfaction,
         starts_with("imp_"), starts_with("rate_")) |>
  skim() |>
  as_tibble() |>
  select(skim_type, skim_variable, n_missing, complete_rate,
         numeric.mean, numeric.sd, numeric.p50) |>
  mutate(across(where(is.numeric), ~ round(., 2))) |>
  kable(caption = "Variable Skim Summary") |>
  kable_styling(bootstrap_options = c("striped", "condensed"),
                full_width = TRUE, font_size = 11) |>
  scroll_box(height = "400px")
Variable Skim Summary
skim_type skim_variable n_missing complete_rate numeric.mean numeric.sd numeric.p50
character respondent_type 0 1.00 NA NA NA
character age_group 0 1.00 NA NA NA
character gender 7 0.92 NA NA NA
character monthly_income 0 1.00 NA NA NA
character use_frequency 0 1.00 NA NA NA
numeric overall_satisfaction 66 0.27 57791.67 104280.13 20000
numeric imp_safety 0 1.00 4.53 1.03 5
numeric imp_cost 0 1.00 4.00 1.08 4
numeric imp_convenience 0 1.00 4.31 1.07 5
numeric imp_professionalism 0 1.00 4.29 1.04 5
numeric imp_reliability 0 1.00 4.37 1.10 5
numeric imp_booking_ease 0 1.00 4.04 1.10 4
numeric imp_flexibility 0 1.00 4.01 1.03 4
numeric rate_punctual 0 1.00 3.61 0.98 4
numeric rate_professional 0 1.00 3.63 0.88 4
numeric rate_routes 0 1.00 3.80 0.91 4
numeric rate_booking_easy 0 1.00 3.78 0.86 4
numeric rate_consistent 0 1.00 3.52 0.91 4
numeric rate_poorly_vetted 0 1.00 3.00 1.12 3
numeric rate_safe 0 1.00 3.40 0.90 4
numeric rate_trust 0 1.00 2.76 0.99 3
numeric rate_complaints 0 1.00 3.13 0.84 3
numeric rate_safety_serious 0 1.00 3.31 1.02 3

5 Analysis Section 1 — Exploratory Data Analysis

Technique: Exploratory Data Analysis | Book Reference: Ch. 4 — Summary statistics, missing-value analysis, outlier detection

Business Justification: Before drawing conclusions about what drives customer satisfaction, we must understand the structure of the data — how complete it is, whether outliers could distort results, and whether key distributions are skewed. This mirrors the first step in any real business intelligence workflow.

5.1 Missing Values Analysis

Code
miss_df <- df |>
  select(respondent_type, age_group, gender, monthly_income,
         use_frequency, primary_reason, service_duration,
         monthly_spend, booking_channel, service_type,
         starts_with("imp_"), starts_with("rate_"),
         safety_concern, overall_satisfaction) |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") |>
  mutate(pct_missing = round(100 * n_missing / nrow(df), 1)) |>
  filter(n_missing > 0) |>
  arrange(desc(n_missing))

miss_df |>
  kable(caption = "Variables with Missing Values") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Variables with Missing Values
variable n_missing pct_missing
overall_satisfaction 66 73.3
gender 7 7.8
Code
if (nrow(miss_df) > 0) {
  ggplot(miss_df, aes(x = reorder(variable, pct_missing), y = pct_missing)) +
    geom_col(fill = "#2c7fb8") +
    coord_flip() +
    scale_y_continuous(labels = label_percent(scale = 1)) +
    labs(title = "Missing Data by Variable",
         subtitle = "Percentage of observations with no response",
         x = NULL, y = "% Missing")
} else {
  cat("No missing values detected in selected variables.\n")
}

Data Quality Finding 1: Variables with the highest missingness correspond to open-text and optional questions. These observations are retained for categorical analyses but excluded listwise from numeric modelling.

5.2 Distribution of the Outcome Variable

Code
sat_clean <- df |> filter(!is.na(overall_satisfaction))

p1 <- sat_clean |>
  mutate(overall_satisfaction = as.factor(overall_satisfaction)) |>
  ggplot(aes(x = overall_satisfaction, fill = overall_satisfaction)) +
  geom_bar(width = 0.7, show.legend = FALSE) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Overall Satisfaction Distribution",
       subtitle = "1 = Very Dissatisfied, 5 = Very Satisfied",
       x = "Satisfaction Score", y = "Count")

p2 <- sat_clean |>
  ggplot(aes(x = overall_satisfaction)) +
  geom_boxplot(fill = "#41b6c4", colour = "grey30") +
  labs(title = "Box Plot", x = "Score", y = NULL)

p1 / p2 + plot_layout(heights = c(3, 1))

Code
cat(
  "Mean:", round(mean(sat_clean$overall_satisfaction), 2),
  "| Median:", median(sat_clean$overall_satisfaction),
  "| Skewness:", round(skewness(sat_clean$overall_satisfaction), 2),
  "\n"
)
Mean: 57791.67 | Median: 20000 | Skewness: 3.45 

Data Quality Finding 2: The satisfaction scores show skewness of 3.45, indicating that while most respondents express moderate-to-positive satisfaction, a meaningful minority report dissatisfaction — a signal of inconsistent service quality across providers.

5.3 Outlier Detection — Importance Ratings

Code
df |>
  select(starts_with("imp_")) |>
  pivot_longer(everything(), names_to = "factor", values_to = "score") |>
  mutate(factor = str_remove(factor, "imp_") |>
           str_replace_all("_", " ") |> str_to_title()) |>
  filter(!is.na(score)) |>
  ggplot(aes(x = reorder(factor, score, median), y = score, fill = factor)) +
  geom_boxplot(show.legend = FALSE, width = 0.6) +
  coord_flip() +
  scale_fill_brewer(palette = "Set2") +
  scale_y_continuous(breaks = 1:5,
    labels = c("Not\nImportant", "Somewhat", "Neutral", "Important", "Very\nImportant")) +
  labs(title = "Importance of Provider Selection Factors",
       subtitle = "Distribution across all respondents (1-5 scale)",
       x = NULL, y = "Importance Rating")

All values fall within the valid 1–5 scale. The wider dispersion for “Cost” reflects genuine heterogeneity in price sensitivity across user segments — an important market segmentation finding.


6 Analysis Section 2 — Data Visualisation

Technique: Data Visualisation | Book Reference: Ch. 5 — Grammar of graphics, chart selection, storytelling with data

Business Justification: The five plots below form a cohesive narrative — who uses driver outsourcing, how frequently, why, how much they spend, and what holds them back — communicated in a format suitable for senior non-technical leadership.

Code
df |>
  filter(!is.na(respondent_type)) |>
  count(respondent_type) |>
  mutate(pct = n / sum(n),
         respondent_type = str_wrap(respondent_type, 25)) |>
  ggplot(aes(x = reorder(respondent_type, pct), y = pct, fill = respondent_type)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  coord_flip() +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Plot 1 — Who Responded?",
       subtitle = "Respondent type composition",
       x = NULL, y = "% of Respondents")

Code
df |>
  filter(!is.na(use_freq_f)) |>
  count(use_freq_f) |>
  ggplot(aes(x = use_freq_f, y = n, fill = use_freq_f)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  scale_x_discrete(labels = function(x) str_wrap(x, 12)) +
  scale_fill_manual(values = c("#d9f0a3", "#addd8e", "#31a354", "#006837")) +
  labs(title = "Plot 2 — How Often Do Users Hire Outsourced Drivers?",
       subtitle = "Usage frequency distribution",
       x = NULL, y = "Count")

Code
df |>
  filter(!is.na(primary_reason)) |>
  count(primary_reason) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = reorder(primary_reason, pct), y = pct)) +
  geom_col(fill = "#2b8cbe", width = 0.7) +
  coord_flip() +
  scale_y_continuous(labels = label_percent()) +
  labs(title = "Plot 3 — Why Do Users Hire Outsourced Drivers?",
       subtitle = "Primary stated reason",
       x = NULL, y = "% of Respondents")

Code
df |>
  filter(!is.na(spend_ord), !is.na(income_ord)) |>
  count(income_ord, spend_ord) |>
  group_by(income_ord) |>
  mutate(pct = n / sum(n)) |>
  ggplot(aes(x = income_ord, y = pct, fill = spend_ord)) +
  geom_col(position = "fill") +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_brewer(palette = "YlOrRd", name = "Monthly Spend") +
  scale_x_discrete(labels = function(x) str_wrap(x, 10)) +
  labs(title = "Plot 4 — Monthly Spend by Income Bracket",
       subtitle = "Higher earners spend proportionally more on driver services",
       x = "Monthly Income", y = "% of Respondents") +
  theme(legend.position = "right")

Code
df |>
  filter(!is.na(barriers)) |>
  separate_rows(barriers, sep = ";") |>
  mutate(barrier = str_trim(barriers)) |>
  filter(barrier != "") |>
  count(barrier, sort = TRUE) |>
  head(8) |>
  ggplot(aes(x = reorder(barrier, n), y = n)) +
  geom_col(fill = "#e34a33", width = 0.7) +
  coord_flip() +
  labs(title = "Plot 5 — What Prevents More Frequent Use?",
       subtitle = "Top 8 stated barriers",
       x = NULL, y = "Number of Mentions")

Visualisation Narrative: The five plots tell a coherent story. The market is dominated by individual users and business executives who use the service infrequently — primarily for personal transportation and corporate travel. Higher-income respondents skew toward larger monthly spends. Yet the dominant barrier to more frequent use is high cost, followed by lack of trust — suggesting that even willing-to-pay customers are held back by trust deficits that driver certification and transparent pricing could address.


7 Analysis Section 3 — Hypothesis Testing

Technique: Hypothesis Testing | Book Reference: Ch. 6 — t-test, chi-squared, ANOVA, non-parametric alternatives, effect sizes

Business Justification: Lagos’s driver outsourcing providers serve a heterogeneous market. A key operational question is whether user segments genuinely differ in satisfaction and safety experience, or whether apparent differences are merely sampling noise. Hypothesis testing gives formal, defensible answers.

7.1 Hypothesis 1 — Does Satisfaction Differ by Usage Frequency?

H₀: Mean overall satisfaction is equal across all usage frequency groups

H₁: At least one group has a different mean satisfaction score

Test: One-way ANOVA; Kruskal-Wallis as non-parametric backup

Code
h1_df <- df |>
  filter(!is.na(overall_satisfaction), !is.na(use_freq_f))

h1_df |>
  group_by(use_freq_f) |>
  summarise(
    n        = n(),
    mean_sat = round(mean(overall_satisfaction), 2),
    sd_sat   = round(sd(overall_satisfaction), 2),
    shapiro_p = ifelse(n() >= 3,
                  round(shapiro.test(overall_satisfaction)$p.value, 3), NA)
  ) |>
  kable(caption = "Descriptive Statistics & Normality Check by Usage Frequency") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Descriptive Statistics & Normality Check by Usage Frequency
use_freq_f n mean_sat sd_sat shapiro_p
rarely (1-2 times per year) 19 41421.05 48200.18 0.000
Occasionally (once a month) 4 25000.00 33416.56 0.009
Very frequently (multiple times weekly) 1 500000.00 NA NA
Code
levene_res <- leveneTest(overall_satisfaction ~ use_freq_f, data = h1_df)
cat("Levene's Test p-value:", round(levene_res$`Pr(>F)`[1], 3), "\n")
Levene's Test p-value: 0.759 
Code
aov_res <- aov(overall_satisfaction ~ use_freq_f, data = h1_df)
summary(aov_res)
            Df    Sum Sq   Mean Sq F value   Pr(>F)    
use_freq_f   2 2.049e+11 1.025e+11   47.64 1.57e-08 ***
Residuals   21 4.517e+10 2.151e+09                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
print(eta_squared(aov_res))
use_freq_f 
 0.8194049 
Code
kw_res <- kruskal.test(overall_satisfaction ~ use_freq_f, data = h1_df)
cat("Kruskal-Wallis p-value:", round(kw_res$p.value, 3), "\n")
Kruskal-Wallis p-value: 0.154 
Code
h1_df |>
  ggplot(aes(x = use_freq_f, y = overall_satisfaction, fill = use_freq_f)) +
  geom_boxplot(show.legend = FALSE, width = 0.5) +
  stat_summary(fun = mean, geom = "point", shape = 21, size = 3,
               fill = "white", colour = "black") +
  scale_x_discrete(labels = function(x) str_wrap(x, 12)) +
  scale_fill_brewer(palette = "Blues") +
  labs(title = "Overall Satisfaction by Usage Frequency",
       subtitle = "White dot = group mean",
       x = "Usage Frequency", y = "Satisfaction (1-5)")

Interpretation: [After rendering — state the F-statistic, p-value, whether H₀ is rejected, and the η² effect size. State the business implication for a Lagos operator.]

7.2 Hypothesis 2 — Is Safety Concern Independent of Respondent Type?

H₀: Safety concern experience is independent of respondent type

H₁: Safety concern experience is associated with respondent type

Test: Chi-squared test of independence

Code
h2_df <- df |>
  filter(!is.na(safety_concern_bin), !is.na(respondent_type))

ct <- table(h2_df$respondent_type, h2_df$safety_concern_bin)
colnames(ct) <- c("No Safety Concern", "Safety Concern")

ct |>
  kable(caption = "Contingency Table: Safety Concern x Respondent Type") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Contingency Table: Safety Concern x Respondent Type
No Safety Concern Safety Concern
Business Owner/ executive 8 5
Corporate Organization representative 8 9
event planner/logistics coordinator 2 4
Individual User 29 25
Code
chi_res <- chisq.test(ct)
print(chi_res)

    Pearson's Chi-squared test

data:  ct
X-squared = 1.5394, df = 3, p-value = 0.6732
Code
print(cramers_v(ct))
Cramer's V (adj.) |       95% CI
--------------------------------
0                 | [0.00, 1.00]

- One-sided CIs: upper bound fixed at [1.00].
Code
h2_df |>
  group_by(respondent_type) |>
  summarise(pct_concern = mean(safety_concern_bin, na.rm = TRUE)) |>
  ggplot(aes(x = reorder(respondent_type, pct_concern),
             y = pct_concern, fill = pct_concern)) +
  geom_col(show.legend = FALSE, width = 0.7) +
  scale_y_continuous(labels = label_percent()) +
  scale_fill_gradient(low = "#ffffcc", high = "#d73027") +
  coord_flip() +
  labs(title = "% Who Experienced a Safety Concern by Respondent Type",
       x = NULL, y = "% Reporting Safety Concern")

Interpretation: [After rendering — state chi-squared, degrees of freedom, p-value, and Cramér’s V. State whether the association is significant and which respondent type carries the highest safety concern rate.]


8 Analysis Section 4 — Correlation Analysis

Technique: Correlation Analysis | Book Reference: Ch. 8 — Pearson, Spearman, Kendall; correlation vs causation

Business Justification: Understanding which service-quality dimensions co-move with overall satisfaction helps management prioritise improvement investments. Spearman rank correlation is used given the ordinal nature of Likert-scale data.

Code
corr_vars <- df |>
  select(starts_with("imp_"), starts_with("rate_"), overall_satisfaction)

names(corr_vars) <- names(corr_vars) |>
  str_remove("imp_|rate_") |>
  str_replace_all("_", " ") |>
  str_to_title()

corr_mat <- cor(corr_vars, use = "pairwise.complete.obs", method = "spearman")

ggcorrplot(corr_mat,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 2.2,
           colors   = c("#d73027", "white", "#1a9641"),
           title    = "Spearman Correlation Matrix",
           ggtheme  = theme_minimal(base_size = 9)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7),
        axis.text.y = element_text(size = 7))

Code
corr_sat <- corr_mat[, "Overall Satisfaction"] |>
  sort(decreasing = TRUE) |>
  as_tibble(rownames = "variable") |>
  filter(variable != "Overall Satisfaction") |>
  rename(spearman_r = value) |>
  mutate(spearman_r = round(spearman_r, 3))

corr_sat |>
  kable(caption = "Spearman Correlation with Overall Satisfaction (ranked)") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)
Spearman Correlation with Overall Satisfaction (ranked)
variable spearman_r
Safe 0.217
Trust 0.162
Safety 0.153
Routes 0.143
Professionalism 0.106
Cost 0.104
Reliability 0.066
Professional 0.062
Flexibility 0.062
Punctual 0.048
Complaints 0.037
Poorly Vetted 0.018
Safety Serious 0.015
Booking Ease 0.010
Convenience -0.099
Consistent -0.212
Booking Easy -0.431

Discussion of Key Correlations:

  1. Safe (r = 0.217): [Interpret — what does it mean for this dimension to co-move most closely with satisfaction?]

  2. Trust (r = 0.162): [Interpret the second strongest.]

  3. Safety (r = 0.153): [Interpret the third strongest.]

Causation caveat: These correlations are observational. A high correlation between perceived safety and satisfaction does not confirm that improving safety causes higher satisfaction without a controlled intervention. The correlation is a necessary — but not sufficient — precondition for causality, and justifies prioritising safety investments pending experimental evidence.


9 Analysis Section 5 — Linear Regression

Technique: OLS Linear Regression | Book Reference: Ch. 9 — Coefficients, diagnostics, interpretation

Business Justification: Regression quantifies the independent contribution of each predictor to overall satisfaction, holding other variables constant. This converts correlation findings into specific, prioritised recommendations suitable for a board-level decision.

Code
reg_df <- df |>
  select(overall_satisfaction,
         imp_safety, imp_reliability, imp_professionalism,
         imp_cost, imp_convenience, imp_booking_ease, imp_flexibility,
         rate_punctual, rate_professional, rate_routes,
         rate_consistent, rate_safe, rate_trust, rate_complaints) |>
  drop_na()

mod <- lm(overall_satisfaction ~
            imp_safety + imp_reliability + imp_professionalism +
            imp_cost + imp_convenience + imp_booking_ease + imp_flexibility +
            rate_punctual + rate_professional + rate_routes +
            rate_consistent + rate_safe + rate_trust + rate_complaints,
          data = reg_df)

tidy(mod, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(., 3))) |>
  arrange(p.value) |>
  kable(caption = "OLS Regression — Predictors of Overall Satisfaction") |>
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = TRUE)
OLS Regression — Predictors of Overall Satisfaction
term estimate std.error statistic p.value conf.low conf.high
rate_trust -78660.439 47911.60 -1.642 0.135 -187044.01 29723.13
rate_routes 72618.917 49122.75 1.478 0.173 -38504.46 183742.29
imp_convenience -106634.543 88053.46 -1.211 0.257 -305825.31 92556.22
imp_cost 69689.847 58469.75 1.192 0.264 -62577.91 201957.60
rate_safe 51448.320 50823.92 1.012 0.338 -63523.38 166420.02
imp_professionalism 56550.224 71368.92 0.792 0.449 -104897.50 217997.95
imp_safety 80066.000 103331.53 0.775 0.458 -153686.17 313818.17
imp_reliability -72239.072 103149.57 -0.700 0.501 -305579.62 161101.48
rate_punctual -26798.400 44104.25 -0.608 0.558 -126569.15 72972.35
imp_booking_ease -59121.691 100847.04 -0.586 0.572 -287253.55 169010.17
imp_flexibility 36728.148 73125.47 0.502 0.628 -128693.16 202149.46
rate_professional -33172.495 78413.12 -0.423 0.682 -210555.31 144210.32
(Intercept) -30798.676 180560.75 -0.171 0.868 -439255.47 377658.12
rate_complaints 14867.891 88911.58 0.167 0.871 -186264.08 215999.87
rate_consistent 9217.281 65385.26 0.141 0.891 -138694.46 157129.02
Code
glance(mod) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, df, nobs) |>
  mutate(across(where(is.numeric), ~ round(., 3))) |>
  kable(caption = "Model Fit Statistics") |>
  kable_styling(bootstrap_options = "striped", full_width = FALSE)
Model Fit Statistics
r.squared adj.r.squared sigma statistic p.value df nobs
0.588 -0.053 106983.5 0.918 0.573 14 24
Code
par(mfrow = c(2, 2))
plot(mod, which = 1:4)

Code
par(mfrow = c(1, 1))
Code
tidy(mod, conf.int = TRUE) |>
  filter(term != "(Intercept)") |>
  mutate(
    term = str_remove(term, "imp_|rate_") |>
           str_replace_all("_", " ") |> str_to_title(),
    significant = p.value < 0.05
  ) |>
  ggplot(aes(x = reorder(term, estimate), y = estimate,
             ymin = conf.low, ymax = conf.high, colour = significant)) +
  geom_pointrange(size = 0.8) +
  geom_hline(yintercept = 0, linetype = "dashed", colour = "grey50") +
  coord_flip() +
  scale_colour_manual(
    values = c("TRUE" = "#e34a33", "FALSE" = "grey60"),
    labels = c("TRUE" = "p < .05", "FALSE" = "p >= .05"),
    name = NULL
  ) +
  labs(title = "OLS Regression Coefficients",
       subtitle = "Point estimates with 95% confidence intervals",
       x = NULL, y = "Coefficient Estimate")

Coefficient Interpretation for a Non-Technical Manager:

“Our model explains 58.8% of the variation in customer satisfaction scores. [After rendering, identify the largest significant coefficient and write: ‘The single most important predictor is [variable] (β = X.XX): for every one-point increase in how positively a customer rates [variable], overall satisfaction rises by X.XX points on a five-point scale — all else equal. If our drivers improved [variable] from the current average of X to Y, we would expect satisfaction to increase by approximately Z points, which research associates with meaningfully higher customer retention.’]”


10 Integrated Findings

The five analyses collectively tell one coherent story about the Lagos driver outsourcing market:

  1. EDA revealed a heterogeneous respondent pool skewed toward infrequent users, with satisfaction scores that are moderate but variable — indicating inconsistent service experience across providers.

  2. Visualisation showed that the dominant use cases are personal transport and corporate travel, that higher-income users are proportionally bigger spenders, and that cost and trust are the twin structural barriers limiting market growth.

  3. Hypothesis testing [confirmed / did not confirm — complete after rendering] that satisfaction differs significantly by usage frequency (H1) and that safety concern rates differ by respondent type (H2), with [state effect size and business implication].

  4. Correlation analysis identified that Safe and Trust have the strongest positive relationship with overall satisfaction, confirming that service quality — not just price — is central to the customer experience.

  5. Regression modelling isolated the independent drivers of satisfaction, translating correlation into a prioritised action list.

Single Integrated Recommendation: Lagos driver outsourcing providers should invest first in driver professionalism and vetting programmes — the dimension most consistently linked to satisfaction across all five analyses. They should then introduce transparent, tiered pricing to address the cost-and-trust barrier, and implement systematic post-trip feedback mechanisms to generate the operational data loop needed to monitor service consistency at scale.


11 Limitations & Further Work

Data limitations:

  • Sampling bias: Snowball and purposive sampling over-represents connected, digitally literate Lagos professionals. Findings may not generalise to lower-income users or those outside formal employment.
  • Self-report bias: Likert-scale responses reflect perceptions, not objective service quality measures. Drivers’ perspectives were not captured.
  • Cross-sectional design: The survey captures a single point in time; longitudinal data would be needed to assess whether satisfaction trends improve after service interventions.
  • Open-text willingness-to-pay field: Responses to Q18 were inconsistently formatted (daily vs monthly, varying currency notation), reducing the utility of that variable for quantitative analysis.

With more data, time, or computing power:

  • A discrete choice / conjoint experiment would more precisely estimate willingness to pay for each service dimension.
  • A longitudinal panel tracking the same customers over multiple trips would enable causal inference about satisfaction drivers.
  • Natural language processing on Q19 open-text responses would yield richer qualitative insight to complement the quantitative findings.
  • A provider-side survey matched to customer-side data would enable multi-level modelling of how firm characteristics mediate individual satisfaction.

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

[Your Name]. (2026). Driver outsourcing services survey dataset [Dataset]. Collected via Google Forms from Lagos-based professionals, March–May 2026. Data available on request from the author.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., Francois, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Muller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048


13 Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with (1) generating the initial Quarto document skeleton and section scaffolding, (2) suggesting appropriate R package choices for each analytical technique, and (3) reviewing code syntax for tidyverse and related functions. All analytical decisions — the choice of Case Study 1, the selection of overall satisfaction as the dependent variable, the decision to use Spearman rather than Pearson correlation given the ordinal nature of Likert data, the interpretation of all statistical outputs, and the business recommendations — were made independently by the author. No AI tool generated the professional disclosure, the data collection narrative, or the substantive interpretation of results. The author takes full responsibility for all analytical judgements in this document.