Nigerian Consumer International Shopping: Exploratory & Inferential Analytics

Author

Idayat Oshodi

Published

May 20, 2026

Executive Summary

This study analyses primary survey data collected from 132 Nigerian consumers to answer the question: What income, spending, and behavioural characteristics predict how much a consumer allocates to international shopping — and which sector × category × season combinations represent the highest-value, lowest-friction segments for a personal shopper to target?

The data were collected via a structured Google Form survey administered to Nigerian professionals in May 2026. Respondents span 16 employment sectors, six income brackets, and report purchasing internationally between 0 and more than 10 times over the previous 24 months. Five inferential techniques were applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression.

Key findings indicate that monthly income is the single strongest predictor of international shopping budget share. Consumers earning ₦1 million or more monthly allocate significantly higher proportions of income to international purchases. Oil & Gas and Banking & Finance professionals operating in the Black Friday and Christmas seasons emerge as the highest-value, lowest-friction segments. Payment friction (card declines and FX losses) remains the most prevalent barrier, disproportionately affecting lower-income brackets. The regression model explains approximately 45% of variance in budget allocation, with income bracket and purchase frequency as the dominant drivers.


1 Professional Disclosure

Role and context: I am a Business Analyst and a Personal International Shopper operating in the Financial sector with regular exposure to cross-border commerce, supply chain management, and consumer decision-making. International purchasing of goods — particularly electronics, fashion, and specialised equipment — is a recurring operational and personal activity that requires navigating payment infrastructure constraints, logistics providers, and personal shoppers. This survey was designed and administered to peers within a professional network to understand demand patterns that inform a potential personal-shopping service.

Technique justifications:

  • EDA is the foundation of any reliable analysis: with survey data, cleaning decisions (how to handle non-standard responses, ordinal midpoints) materially affect every downstream result. EDA makes those decisions transparent and auditable.
  • Data Visualisation converts multi-dimensional survey patterns into actionable client profiles. A personal shopper benefits more from a visual targeting map than from a regression table.
  • Hypothesis Testing provides statistical rigour to observed differences: it distinguishes genuine income-driven spend patterns from sampling noise, which matters when deciding which client segments to pursue.
  • Correlation Analysis identifies which predictors are independently informative versus redundant — critical for avoiding double-counting in a regression model built on ordinal data.
  • Linear Regression provides a single quantitative model that translates observable client characteristics (income, frequency, sector) into a predicted budget-allocation score — the foundation of a client-scoring tool for a personal shopper.

2 Data Collection & Sampling

Collection method: A structured survey was designed and distributed via Google Forms in May 2026 to Nigerian consumers within a professional network. The link was shared through WhatsApp groups, email chains, and LinkedIn direct messages targeting working professionals aged 18–55.

Sampling frame: Convenience sample of Nigerian professionals with prior exposure to international online shopping. The population of interest is economically active Nigerians who have attempted or completed international purchases via platforms such as Amazon, ASOS, Shein, or through personal shoppers.

Sample size and period: 132 completed responses were collected between 06 May 2026 and 15 May 2026. The assignment guidelines require a minimum of 100 observations; the achieved sample of 132 exceeds this threshold, providing a credible analytical base. All 132 responses are genuine primary data and are used in their entirety; no rows were discarded except the 6 with uninterpretable outcome entries in the outcome variable.

Variables collected: 25 variables covering demographics (age, gender, city), professional context (sector, employment status, income bracket), shopping behaviour (frequency, budget share, spend per order, categories, platforms, seasons, method), payment behaviour, friction losses, satisfaction, and open-text responses.

Ethical notes: No personally identifiable information (names, phone numbers, employee IDs) was collected. Participation was voluntary; the survey introduction stated that responses would be used for an academic analysis. Respondents who preferred not to disclose income were given a “Prefer not to say” option. No organisational confidential data was used.

Data-sharing restrictions: The dataset will be made available on request from the author; no organisational restrictions apply given the consumer-level nature of the data.


3 Data Description

3.1 Respondent Profile — Pie Charts

The five interactive pie charts below answer who the respondents are. age_group is ordinal (youngest-to-oldest order); the remaining four are nominal (no inherent ranking). Hover over any slice to see the exact label, count, and percentage. Click a slice to isolate it; click the ⛶ Expand button (top-right of each chart) to explode it to full-screen, and press Esc or click the overlay to close.

Show code
# ── Colour palette ────────────────────────────────────────────────────────────
pie_cols <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0",
              "#00B0F0","#FFC000","#FF7C80","#43682B","#833C00",
              "#A9D18E","#9DC3E6","#FFD966","#F4B183")

# ── Helper: frequency table with top-n + "Other" rollup ─────────────────────
make_pie_df <- function(x, top_n = NULL) {
  x      <- as.character(x)
  x      <- x[!is.na(x) & nchar(trimws(x)) > 0]
  tbl    <- sort(table(x), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (!is.null(top_n) && length(counts) > top_n) {
    others <- sum(counts[(top_n + 1):length(counts)])
    labels <- c(labels[seq_len(top_n)], "Other")
    counts <- c(counts[seq_len(top_n)], others)
  }
  df <- data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
  df$pct <- round(df$n / sum(df$n) * 100, 1)
  df
}

# ── Build data for each pie ───────────────────────────────────────────────────
d_age    <- make_pie_df(survey_clean$age_group)
d_gender <- make_pie_df(survey_clean$gender)
d_emp    <- make_pie_df(survey_clean$emp_status)
d_city   <- make_pie_df(survey_clean$city,         top_n = 7)
d_sector <- make_pie_df(survey_clean$sector_clean, top_n = 8)

pie_data   <- list(d_age, d_gender, d_emp, d_city, d_sector)
pie_titles <- c("Age Group", "Gender", "Employment Status",
                "City  (Top 7 + Other)", "Employment Sector  (Top 8 + Other)")

# ── Domain grid: 3 charts on top row, 2 centred on bottom row ────────────────
dom_x <- list(c(0.01,0.31), c(0.35,0.65), c(0.69,0.99),
              c(0.13,0.47), c(0.53,0.87))
dom_y <- list(c(0.54,1.00), c(0.54,1.00), c(0.54,1.00),
              c(0.00,0.46), c(0.00,0.46))
ann_x <- sapply(dom_x, mean)
ann_y <- c(1.01, 1.01, 1.01, 0.47, 0.47)

# ── Build multi-trace plotly figure ──────────────────────────────────────────
fig_pie <- plot_ly(height = 720)

for (i in seq_along(pie_data)) {
  df   <- pie_data[[i]]
  cols <- as.list(pie_cols[seq_len(nrow(df))])
  fig_pie <- fig_pie |>
    add_pie(
      data          = df,
      labels        = ~label,
      values        = ~n,
      name          = pie_titles[i],
      textposition  = "inside",
      textinfo      = "percent",
      hovertemplate = paste0(
        "<b>%{label}</b><br>Count: %{value}<br>Share: %{percent}<extra></extra>"
      ),
      domain  = list(x = dom_x[[i]], y = dom_y[[i]]),
      marker  = list(colors = cols,
                     line   = list(color = "white", width = 2)),
      showlegend = FALSE
    )
}

# ── Subtitle annotations (chart titles above each pie) ───────────────────────
ann_list <- lapply(seq_along(pie_titles), function(i) {
  list(x = ann_x[i], y = ann_y[i],
       text      = paste0("<b>", pie_titles[i], "</b>"),
       xref      = "paper", yref = "paper",
       showarrow = FALSE, xanchor = "center", yanchor = "bottom",
       font      = list(size = 11, color = "#1F3864"))
})

fig_pie |>
  plotly::layout(
    title = list(
      text = paste0(
        "<b>Who Are the Respondents?</b>",
        "<span style='font-size:12px;color:#777'>",
        "   n = ", nrow(survey_clean),
        " respondents  |  Hover for counts  |  Click slices to filter</span>"
      ),
      font = list(size = 15, color = "#1F3864"), x = 0.5
    ),
    annotations   = ann_list,
    paper_bgcolor = "white",
    plot_bgcolor  = "white",
    margin        = list(t = 80, b = 10, l = 10, r = 10)
  )

3.2 Variable Summary Table

The table below consolidates all key descriptive statistics into a single readable summary. Each row is one numeric variable derived from the survey; columns show sample size, completeness, central tendency, spread, range, and quartiles.

Show code
# ── Human-readable variable labels ───────────────────────────────────────────
var_meta <- data.frame(
  col   = c("income_num","pct_income","spend_usd",
             "freq_num","satisfaction","total_loss"),
  Label = c("Monthly Income (₦ '000)",
             "Budget Share — Intl Shopping (%)",
             "Typical Order Spend (USD)",
             "Purchases in Last 24 Months (n)",
             "Overall Satisfaction (1–5)",
             "Total Estimated USD Losses"),
  Source = c("Income bracket midpoint",
              "Budget-share band midpoint",
              "Spend-band midpoint",
              "Frequency-band midpoint",
              "Direct survey rating",
              "Sum of 4 cleaned loss columns"),
  stringsAsFactors = FALSE
)

# ── Build summary row for each variable ──────────────────────────────────────
sum_rows <- lapply(var_meta$col, function(nm) {
  v   <- survey_clean[[nm]]
  v_c <- v[!is.na(v)]
  n_valid   <- length(v_c)
  n_missing <- sum(is.na(v))
  pct_complete <- round(n_valid / nrow(survey_clean) * 100, 1)
  data.frame(
    col          = nm,
    N_Valid      = n_valid,
    N_Missing    = n_missing,
    Pct_Complete = pct_complete,
    Mean         = round(mean(v_c), 2),
    Median       = round(median(v_c), 2),
    SD           = round(sd(v_c), 2),
    Min          = round(min(v_c), 2),
    Q1           = round(quantile(v_c, 0.25), 2),
    Q3           = round(quantile(v_c, 0.75), 2),
    Max          = round(max(v_c), 2),
    stringsAsFactors = FALSE
  )
})
sum_tbl <- do.call(rbind, sum_rows)

# ── Merge labels ──────────────────────────────────────────────────────────────
sum_tbl <- merge(var_meta, sum_tbl, by = "col")
sum_tbl <- sum_tbl[match(var_meta$col, sum_tbl$col), ]   # preserve order

# ── Render ───────────────────────────────────────────────────────────────────
kable(
  sum_tbl[, c("Label","Source","N_Valid","N_Missing","Pct_Complete",
               "Mean","Median","SD","Min","Q1","Q3","Max")],
  col.names = c("Variable","Derived From",
                "Valid n","Missing n","Complete %",
                "Mean","Median","Std Dev",
                "Min (p0)","Q1 (p25)","Q3 (p75)","Max (p100)"),
  caption   = paste0("Table 1: Descriptive statistics for all key numeric variables (n = ", nrow(survey_clean), " respondents)"),
  align     = c("l","l","c","c","c","r","r","r","r","r","r","r")
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) |>
  row_spec(0,
           bold       = TRUE,
           color      = "white",
           background = "#1F3864") |>
  row_spec(1, background = "#EEF2F7") |>   # income_num
  row_spec(2, background = "#FEF2F2") |>   # pct_income  (outcome — highlight)
  row_spec(3, background = "#EEF2F7") |>
  row_spec(4, background = "#FFFFFF") |>
  row_spec(5, background = "#EEF2F7") |>
  row_spec(6, background = "#FFFFFF") |>
  column_spec(1, bold = TRUE, width = "18em") |>
  column_spec(2, italic = TRUE, color = "#555555", width = "16em") |>
  column_spec(4,
    color = ifelse(sum_tbl$N_Missing > 0, "#C00000", "#375623"),
    bold  = ifelse(sum_tbl$N_Missing > 0, TRUE,      FALSE)) |>
  footnote(
    general = paste0(
      "Row highlighted in red = outcome variable (pct_income). ",
      "Red Missing n = variable has missing values. ",
      "All numeric values derived from ordinal survey bands using band midpoints ",
      "(see Section 3 — Data Collection for encoding decisions)."
    ),
    general_title = "Notes: "
  )
Table 1: Descriptive statistics for all key numeric variables
Variable Derived From Valid n Missing n Complete % Mean Median Std Dev Min (p0) Q1 (p25) Q3 (p75) Max (p100)
2 Monthly Income (₦ '000) Income bracket midpoint 119 13 90.2 1340.76 750.0 1178.61 75 400.0 1750.0 3500
3 Budget Share — Intl Shopping (%) Budget-share band midpoint 126 6 95.5 7.02 2.5 7.35 0 2.5 7.5 35
5 Typical Order Spend (USD) Spend-band midpoint 132 0 100.0 138.75 75.0 177.66 30 30.0 150.0 700
1 Purchases in Last 24 Months (n) Frequency-band midpoint 132 0 100.0 3.94 1.5 3.89 0 1.5 4.0 12
4 Overall Satisfaction (1–5) Direct survey rating 132 0 100.0 3.41 3.0 0.94 1 3.0 4.0 5
6 Total Estimated USD Losses Sum of 4 cleaned loss columns 132 0 100.0 6281.49 0.0 41868.61 0 0.0 82.5 401900
Notes:
Row highlighted in red = outcome variable (pct_income). Red Missing n = variable has missing values. All numeric values derived from ordinal survey bands using band midpoints (see Section 3 — Data Collection for encoding decisions).

The outcome variable pct_income (highlighted in red) is missing for 6 respondents who gave non-interpretable free-text responses (“0.9”, “Depend need(s)”, “As required”). These rows are excluded from the regression and correlation analyses but retained in all descriptive outputs.


3.3 Distribution of Key Variables

The panel below shows the full distribution of each variable — bars show how many respondents fall into each value range, overlaid with a smooth density curve. The dashed red line marks the median for each variable, giving an instant read on the typical respondent.

Show code
# ── Data in long format for faceting ─────────────────────────────────────────
hist_vars <- survey_clean[, c("income_num","pct_income","spend_usd",
                               "freq_num","satisfaction","total_loss")]

# Build long data frame manually (no pivot_longer to avoid select conflict)
hist_long <- do.call(rbind, lapply(names(hist_vars), function(nm) {
  v <- hist_vars[[nm]]
  data.frame(variable = nm, value = v[!is.na(v)], stringsAsFactors = FALSE)
}))

# Readable facet labels
facet_labels <- c(
  income_num   = "Monthly Income (₦ '000)",
  pct_income   = "Budget Share — Intl Shopping (% of income)",
  spend_usd    = "Typical Order Spend (USD)",
  freq_num     = "Purchases in Last 24 Months (n)", 
  satisfaction = "Overall Satisfaction \n(1 = very dissatisfied, 5 = very satisfied)",
  total_loss   = "Total Estimated USD Losses"
)
hist_long$facet <- factor(facet_labels[hist_long$variable],
                           levels = facet_labels)

# Median lines per facet
med_lines <- do.call(rbind, lapply(names(hist_vars), function(nm) {
  v <- hist_vars[[nm]]
  data.frame(variable = nm,
             facet    = facet_labels[nm],
             med      = median(v, na.rm = TRUE),
             stringsAsFactors = FALSE)
}))
med_lines$facet <- factor(med_lines$facet, levels = facet_labels)

# Per-variable fill colours (one per panel)
fill_cols <- setNames(
  c("#1F3864","#C00000","#ED7D31","#70AD47","#7030A0","#00B0F0"),
  facet_labels
)

# ── Plot ──────────────────────────────────────────────────────────────────────
p_hist <- ggplot(hist_long, aes(x = value, fill = facet)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins      = 12,
                 colour    = "white",
                 linewidth = 0.4,
                 alpha     = 0.80) +
  geom_density(colour    = "grey20",
               linewidth = 0.8,
               adjust    = 1.2) +
  geom_vline(data      = med_lines,
             aes(xintercept = med,
                 text = paste0("Median: ", round(med, 1))),
             colour    = "#C00000",
             linetype  = "dashed",
             linewidth = 0.9) +
  scale_fill_manual(values = fill_cols, guide = "none") +
  facet_wrap(~ facet,
             scales   = "free",
             ncol     = 2,
             labeller = label_value) +
  labs(
    title    = "Distribution of Key Survey Variables",
    subtitle = paste0(
      "Bars = histogram density  |  Curve = smoothed density  |  ",
      "Dashed red line = median  |  ★ Outcome in red"
    ),
    x = "Value",
    y = "Density"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", colour = "#1F3864",
                                    size = 15, margin = margin(b = 4)),
    plot.subtitle    = element_text(colour = "#555555", size = 10,
                                    margin = margin(b = 10)),
    strip.text       = element_text(face = "bold", colour = "white",
                                    size = 9),
    strip.background = element_rect(fill = "#1F3864", colour = NA),
    panel.grid.minor = element_blank(),
    panel.border     = element_rect(colour = "#DDDDDD", fill = NA,
                                    linewidth = 0.5),
    axis.title       = element_text(colour = "#333333", size = 10),
    plot.margin      = margin(12, 12, 12, 12)
  )

# ── Wrap as interactive plotly ────────────────────────────────────────────────
ggplotly(p_hist, height = 900, tooltip = c("x", "y", "text")) |>
  plotly::layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 11)),
    margin     = list(t = 80, b = 40, l = 50, r = 30)
  )

Figure 1: Distribution of all six key numeric variables (hover for values · drag to zoom · double-click to reset)


3.4 Categorical & Behavioural Variable Inventory

The table below documents all eleven categorical variables charted in this section — five in the pie charts above and six in the bar charts below. The Data Type column classifies each variable by measurement level; Response Type shows whether a respondent could select one or multiple answers. Understanding this matters for interpreting bar-chart counts: multi-select bars can sum to more than the total number of respondents.

Show code
var_desc <- data.frame(
  R_Variable = c("age_group","gender","city","sector_clean","emp_status",
                 "categories","platforms","shopping_months",
                 "shopping_method","payment_method","improvement"),
  Survey_Question = c(
    "What is your age group?",
    "What is your gender?",
    "What city do you currently live in?",
    "What is your employment sector?",
    "What best describes your employment status?",
    "Which categories do you shop for internationally? (select all that apply)",
    "Which international platforms do you shop from? (select all that apply)",
    "Which months do you shop most internationally? (select all that apply)",
    "How do you usually handle your international shopping?",
    "How do you typically pay your personal shopper or for international orders?",
    "What single improvement would make you shop internationally more often?"
  ),
  Data_Type  = c("Ordinal","Nominal","Nominal","Nominal","Nominal",
                 "Nominal","Nominal","Nominal","Nominal","Nominal","Nominal"),
  Response   = c("Single-select","Single-select","Single-select",
                 "Single-select","Single-select",
                 "Multi-select","Multi-select","Multi-select",
                 "Single-select","Single-select","Single-select"),
  Chart      = c("Pie","Pie","Pie","Pie","Pie",
                 "Bar","Bar","Bar","Bar","Bar","Bar"),
  stringsAsFactors = FALSE
)

kable(
  var_desc,
  col.names = c("R Variable Name","Survey Question",
                "Data Type","Response Type","Chart Used"),
  caption   = "Table 2: Categorical and behavioural variables described in this section",
  align     = c("l","l","c","c","c")
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |>
  column_spec(1, bold = TRUE, monospace = TRUE, color = "#1F3864") |>
  column_spec(3, italic = TRUE, color = "#555555") |>
  row_spec(c(6, 7, 8), background = "#FFF3CD") |>
  footnote(
    general = paste0(
      "Rows highlighted in amber = multi-select variables: one respondent may contribute ",
      "multiple values, so bar-chart counts can exceed n = ", nrow(survey_clean), ". ",
      "sector_clean is a cleaned version of the raw sector column (Banking variants merged)."
    ),
    general_title = "Notes: "
  )
Table 2: Categorical variables visualised in this section
R Variable Name Survey Question Data Type Response Type Chart Used
age_group What is your age group? Ordinal Single-select Pie
gender What is your gender? Nominal Single-select Pie
city What city do you currently live in? Nominal Single-select Pie
sector_clean What is your employment sector? Nominal Single-select Pie
emp_status What best describes your employment status? Nominal Single-select Pie
categories Which categories do you shop for internationally? (select all that apply) Nominal Multi-select Bar
platforms Which international platforms do you shop from? (select all that apply) Nominal Multi-select Bar
shopping_months Which months do you shop most internationally? (select all that apply) Nominal Multi-select Bar
shopping_method How do you usually handle your international shopping? Nominal Single-select Bar
payment_method How do you typically pay your personal shopper or for international orders? Nominal Single-select Bar
improvement What single improvement would make you shop internationally more often? Nominal Single-select Bar
Notes:
Rows highlighted in amber = multi-select variables: one respondent may contribute multiple values, so bar-chart counts can exceed n = 132. sector_clean is a cleaned version of the raw sector column (Banking variants merged).

3.5 Shopping Behaviour — Bar Charts

The six interactive charts below answer how respondents shop internationally. All six variables are nominal (no natural ranking). Hover over any bar for the exact count; drag to zoom in on any region; double-click to reset the view.

Multi-select note: Categories, Platforms, and Months are “select all that apply” questions — one respondent may be counted in multiple bars, so totals across bars exceed 132.

Show code
bar_pal <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0","#00B0F0")

# ── Helper: expand multi-select and count ─────────────────────────────────────
split_count <- function(x, top_n = 12) {
  items <- unlist(strsplit(as.character(x[!is.na(x) & x != ""]), ",\\s*"))
  items <- trimws(items[nchar(trimws(items)) > 0])
  tbl    <- sort(table(items), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (length(counts) > top_n) { labels <- labels[seq_len(top_n)]; counts <- counts[seq_len(top_n)] }
  data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
}

# ── Helper: single-select count with "Other" rollup ───────────────────────────
single_count <- function(x, top_n = 12) {
  x      <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
  tbl    <- sort(table(x), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (length(counts) > top_n) {
    others <- sum(counts[(top_n + 1):length(counts)])
    labels <- c(labels[seq_len(top_n)], "Other")
    counts <- c(counts[seq_len(top_n)], others)
  }
  data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
}

# ── Helper: build one native plotly horizontal bar chart ─────────────────────
make_plotly_bar <- function(df, title, fill_col, x_note = "Count") {
  df$lw <- stringr::str_wrap(df$label, width = 30)
  # order ascending so largest bar is at the top of the y-axis
  df     <- df[order(df$n, decreasing = FALSE), ]
  df$lw  <- factor(df$lw, levels = unique(df$lw))
  plot_ly(
    df,
    x           = ~n,
    y           = ~lw,
    type        = "bar",
    orientation = "h",
    marker      = list(color     = fill_col,
                       opacity   = 0.87,
                       line      = list(color = "white", width = 0.5)),
    text        = ~paste0("<b>", label, "</b><br>Count: ", n),
    hoverinfo   = "text",
    showlegend  = FALSE
  ) |>
    plotly::layout(
      title  = list(text  = paste0("<b>", title, "</b>"),
                    font  = list(size = 12, color = "#1F3864"),
                    x = 0.02, xanchor = "left"),
      xaxis  = list(title    = x_note,
                    gridcolor = "#DDDDDD",
                    zeroline  = FALSE,
                    tickfont  = list(size = 10)),
      yaxis  = list(title     = "",
                    tickfont  = list(size = 9),
                    automargin = TRUE),
      height        = 400,
      paper_bgcolor = "white",
      plot_bgcolor  = "#FAFAFA",
      margin        = list(l = 10, r = 30, t = 55, b = 30)
    )
}

# ── Build the six bar charts ──────────────────────────────────────────────────
b_cat  <- make_plotly_bar(
  split_count(survey_clean$categories, top_n = 10),
  "1. Categories Shopped Internationally  (Top 10 — multi-select)",
  bar_pal[1], x_note = "Respondents mentioning"
)
b_plat <- make_plotly_bar(
  split_count(survey_clean$platforms, top_n = 10),
  "2. Platforms Used  (Top 10 — multi-select)",
  bar_pal[2], x_note = "Respondents mentioning"
)
b_mon  <- make_plotly_bar(
  split_count(survey_clean$shopping_months),
  "3. Months of Peak International Shopping  (multi-select)",
  bar_pal[3], x_note = "Respondents mentioning"
)
b_meth <- make_plotly_bar(
  single_count(survey_clean$method_short),
  "4. How Shopping is Handled  (single-select)",
  bar_pal[4]
)
b_pay  <- make_plotly_bar(
  single_count(survey_clean$payment_method, top_n = 8),
  "5. Payment Method  (Top 8 — single-select)",
  bar_pal[5]
)
b_imp  <- make_plotly_bar(
  single_count(survey_clean$improvement, top_n = 10),
  "6. What Would Drive More Shopping  (Top 10 — single-select)",
  bar_pal[6]
)

# ── Render all six charts stacked (full-width, individually interactive) ──────
htmltools::tagList(b_cat, b_plat, b_mon, b_meth, b_pay, b_imp)

4 Technique 1 — Exploratory Data Analysis

Note📚 Theory Recap — Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an analytic philosophy introduced by John Tukey (1977) that advocates examining data through graphical and numerical summaries before imposing statistical models. The guiding principle is to let the data “speak” — uncovering distributions, outliers, and structural patterns without prior hypotheses — so that subsequent modelling decisions rest on observed reality rather than untested assumptions.

Key concepts:

Concept What it measures Why it matters
Central tendency (mean, median) Where data cluster Median is robust to skew; mean is sensitive to outliers
Dispersion (SD, IQR, range) How spread out values are Large SD relative to mean signals high heterogeneity
Shape (skewness, kurtosis) Asymmetry and tail weight Right-skewed income data may need transformation before parametric tests
Missing-value audit Completeness of each variable Informs whether to impute, exclude, or flag missing cases

Anscombe’s Quartet (Anscombe, 1973) provides the canonical demonstration of why visual inspection is indispensable: four datasets with identical means, variances, and correlations exhibit completely different structures when plotted, proving that numerical summaries alone can be deeply misleading.

Technique justification: EDA is the mandatory first step before any inferential work. Chapter 9 of Adi (2026) emphasises that summary statistics and distributions must be understood before model parameters can be meaningfully interpreted. With a survey dataset containing ordinal income brackets, multi-select categories, and free-text loss estimates, a thorough EDA prevents misspecified models and unreliable inference.

4.1 Summary Statistics

Show code
num_vars_list <- list(
  income_num  = survey_clean$income_num,
  pct_income  = survey_clean$pct_income,
  spend_usd   = survey_clean$spend_usd,
  freq_num    = survey_clean$freq_num,
  satisfaction= survey_clean$satisfaction,
  total_loss  = survey_clean$total_loss
)
skew_pearson <- function(v) {
  v <- v[!is.na(v)]
  if(length(v)<3) return(NA_real_)
  round((mean(v) - median(v)) / sd(v), 2)
}
num_summary2 <- do.call(rbind, lapply(names(num_vars_list), function(nm) {
  v <- num_vars_list[[nm]]
  data.frame(Variable=nm, n=sum(!is.na(v)),
             Mean=round(mean(v,na.rm=TRUE),2),
             Median=round(median(v,na.rm=TRUE),2),
             SD=round(sd(v,na.rm=TRUE),2),
             Min=round(min(v,na.rm=TRUE),2),
             Max=round(max(v,na.rm=TRUE),2),
             Skew=skew_pearson(v),
             stringsAsFactors=FALSE)
}))

kable(num_summary2,
      caption = "Table 2: Descriptive statistics for key numeric variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)
Descriptive statistics for key numeric variables
Variable n Mean Median SD Min Max Skew
income_num 119 1340.76 750.0 1178.61 75 3500 0.50
pct_income 126 7.02 2.5 7.35 0 35 0.61
spend_usd 132 138.75 75.0 177.66 30 700 0.36
freq_num 132 3.94 1.5 3.89 0 12 0.63
satisfaction 132 3.41 3.0 0.94 1 5 0.43
total_loss 132 6281.49 0.0 41868.61 0 401900 0.15

4.2 Summary Statistics for Categorical Variables

For categorical variables, descriptive analysis centres on frequency counts and percentage distributions rather than means and standard deviations. The appropriate measure of central tendency is the mode — the most frequently occurring category. The table below provides an at-a-glance overview of all eleven key categorical variables; detailed frequency breakdowns follow.

Show code
# ── Helper: compute mode statistics for one categorical column ────────────────
cat_stats <- function(nm, label, df) {
  x   <- df[[nm]]
  x_c <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
  tbl <- sort(table(x_c), decreasing = TRUE)
  mode_val <- if (length(tbl) > 0) names(tbl)[1] else "—"
  mode_n   <- if (length(tbl) > 0) as.integer(tbl[1]) else 0L
  mode_pct <- if (mode_n > 0) round(mode_n / nrow(df) * 100, 1) else 0
  data.frame(
    Label    = label,
    N_Valid  = length(x_c),
    N_Miss   = sum(is.na(x)),
    N_Cat    = length(unique(x_c)),
    Mode     = mode_val,
    Mode_n   = mode_n,
    Mode_pct = paste0(mode_pct, "%"),
    stringsAsFactors = FALSE
  )
}

cat_meta <- data.frame(
  col   = c("age_group","gender","city","sector_clean","emp_status",
            "education","income_bracket","purchases_24m",
            "method_short","payment_method","payment_abandoned"),
  Label = c("Age Group","Gender","City of Residence",
            "Employment Sector","Employment Status","Education Level",
            "Income Bracket","International Purchases (last 24 mo.)",
            "Shopping Method","Payment Method","Ever Abandoned a Purchase?"),
  stringsAsFactors = FALSE
)

cat_summary <- do.call(rbind, lapply(
  seq_len(nrow(cat_meta)),
  function(i) cat_stats(cat_meta$col[i], cat_meta$Label[i], survey_clean)
))

kable(
  cat_summary,
  col.names = c("Variable","Valid n","Missing","# Categories",
                "Mode (most common response)","Mode count","Mode %"),
  caption   = paste0("Table 3: Categorical variable overview — mode and completeness",
                     " (n = ", nrow(survey_clean), " respondents)"),
  align     = c("l","c","c","c","l","c","c")
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width = TRUE, font_size = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |>
  column_spec(1, bold = TRUE, color = "#1F3864") |>
  column_spec(5, italic = TRUE, color = "#444444") |>
  footnote(
    general       = paste0(
      "Mode = most frequently occurring response. ",
      "Missing = blank or NA entries excluded from frequency calculations. ",
      "sector_clean merges 'Banking' and 'Banking and Financial Services'."
    ),
    general_title = "Notes: "
  )
Table 3: Categorical variable overview — mode, completeness, and category count
Variable Valid n Missing # Categories Mode (most common response) Mode count Mode %
Age Group 132 0 5 35 - 44 61 46.2%
Gender 132 0 3 Male 66 50%
City of Residence 132 0 15 Lagos 98 74.2%
Employment Sector 132 0 16 Banking & Finance 40 30.3%
Employment Status 132 0 4 Employed full-time 95 72%
Education Level 132 0 4 HND / Bachelor's degree 60 45.5%
Income Bracket 132 0 7 ₦1,000,000 – ₦2,499,999 29 22%
International Purchases (last 24 mo.) 132 0 5 1 - 2 43 32.6%
Shopping Method 132 0 5 Direct (Own Card) 64 48.5%
Payment Method 132 0 8 Bank transfer in Naira 67 50.8%
Ever Abandoned a Purchase? 132 0 4 Yes, more than once 70 53%
Notes:
Mode = most frequently occurring response. Missing = blank or NA entries excluded from frequency calculations. sector_clean merges 'Banking' and 'Banking and Financial Services'.

4.2.1 Demographic & Employment Profile — Detailed Frequencies

The table below gives the full frequency distribution for the five demographic and employment variables, grouped by variable. Percentages are calculated against the total sample of 132 respondents.

Show code
# ── Helper: frequency data frame for one variable ────────────────────────────
make_freq_df <- function(x, n_total, top_n = NULL, other_label = "All others") {
  x_c    <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
  tbl    <- sort(table(x_c), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (!is.null(top_n) && length(counts) > top_n) {
    others <- sum(counts[(top_n + 1):length(counts)])
    labels <- c(labels[seq_len(top_n)], other_label)
    counts <- c(counts[seq_len(top_n)], others)
  }
  data.frame(
    Category = labels,
    Count    = counts,
    Pct      = paste0(round(counts / n_total * 100, 1), "%"),
    stringsAsFactors = FALSE
  )
}

N <- nrow(survey_clean)

demo_vars <- list(
  "Age Group"          = survey_clean$age_group,
  "Gender"             = survey_clean$gender,
  "Employment Status"  = survey_clean$emp_status,
  "Education Level"    = survey_clean$education,
  "Employment Sector"  = survey_clean$sector_clean
)

# ── Build combined long data frame ────────────────────────────────────────────
demo_rows  <- data.frame(Category = character(), Count = integer(),
                         Pct = character(), stringsAsFactors = FALSE)
grp_label  <- character()
grp_start  <- integer()
grp_end    <- integer()
cursor     <- 1L

for (nm in names(demo_vars)) {
  df      <- make_freq_df(demo_vars[[nm]], N)
  grp_label <- c(grp_label, nm)
  grp_start <- c(grp_start, cursor)
  grp_end   <- c(grp_end,   cursor + nrow(df) - 1L)
  demo_rows <- rbind(demo_rows, df)
  cursor    <- cursor + nrow(df)
}

# ── Render with pack_rows grouping ────────────────────────────────────────────
k_demo <- kable(
  demo_rows,
  col.names = c("Category", "Count", "% of sample"),
  align     = c("l", "c", "c"),
  caption   = "Table 4: Demographic and employment frequency distributions"
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width = TRUE, font_size = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864")

for (i in seq_along(grp_label)) {
  k_demo <- k_demo |>
    pack_rows(grp_label[i], grp_start[i], grp_end[i],
              bold = TRUE, color = "white", background = "#2C5F9E",
              label_row_css = "border-top: 2px solid #1F3864;")
}
k_demo
Table 4: Frequency distributions — demographic and employment profile
Category Count % of sample
Age Group
35 - 44 61 46.2%
25 - 34 44 33.3%
45 - 54 21 15.9%
18 - 24 3 2.3%
55 and above 3 2.3%
Gender
Male 66 50%
Female 64 48.5%
Prefer not to say 2 1.5%
Employment Status
Employed full-time 95 72%
Self-employed / Business owner 31 23.5%
Employed part-time 3 2.3%
Unemployed / Between jobs 3 2.3%
Education Level
HND / Bachelor's degree 60 45.5%
Postgraduate (Masters, MBA, PhD) 57 43.2%
Professional certification (ICAN, CFA, ACCA, etc.) 13 9.8%
Secondary school / O-levels 2 1.5%
Employment Sector
Banking & Finance 40 30.3%
Entrepreneur / Business Owner 20 15.2%
Technology / Telecoms 18 13.6%
Oil and Gas 12 9.1%
Education 11 8.3%
Legal / Consulting 6 4.5%
Retail / Trading 6 4.5%
Civil Service / Government 5 3.8%
Healthcare 5 3.8%
Logistics 2 1.5%
Media / Creative Industries 2 1.5%
Aviation 1 0.8%
Food / Manufacturing 1 0.8%
Mining 1 0.8%
Other / Retired 1 0.8%
Third Sector / NGO 1 0.8%

4.2.2 Geographic & Shopping Behaviour Profile — Detailed Frequencies

Show code
behav_vars <- list(
  "City of Residence (Top 8)"              = list(x = survey_clean$city,
                                                   top_n = 8,
                                                   other = "All other cities"),
  "Income Bracket"                          = list(x = survey_clean$income_bracket,
                                                   top_n = NULL, other = NULL),
  "Intl. Purchases — Last 24 Months"        = list(x = survey_clean$purchases_24m,
                                                   top_n = NULL, other = NULL),
  "Shopping Method"                         = list(x = survey_clean$method_short,
                                                   top_n = NULL, other = NULL),
  "Payment Method"                          = list(x = survey_clean$payment_method,
                                                   top_n = NULL, other = NULL),
  "Ever Abandoned a Purchase?"              = list(x = survey_clean$payment_abandoned,
                                                   top_n = NULL, other = NULL)
)

behav_rows  <- data.frame(Category = character(), Count = integer(),
                           Pct = character(), stringsAsFactors = FALSE)
grp_label2  <- character()
grp_start2  <- integer()
grp_end2    <- integer()
cursor2     <- 1L

for (nm in names(behav_vars)) {
  v    <- behav_vars[[nm]]
  df   <- make_freq_df(v$x, N, top_n = v$top_n,
                       other_label = if (!is.null(v$other)) v$other else "All others")
  grp_label2  <- c(grp_label2, nm)
  grp_start2  <- c(grp_start2, cursor2)
  grp_end2    <- c(grp_end2,   cursor2 + nrow(df) - 1L)
  behav_rows  <- rbind(behav_rows, df)
  cursor2     <- cursor2 + nrow(df)
}

k_behav <- kable(
  behav_rows,
  col.names = c("Category", "Count", "% of sample"),
  align     = c("l", "c", "c"),
  caption   = "Table 5: Geographic and shopping behaviour frequency distributions"
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width = TRUE, font_size = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864")

for (i in seq_along(grp_label2)) {
  k_behav <- k_behav |>
    pack_rows(grp_label2[i], grp_start2[i], grp_end2[i],
              bold = TRUE, color = "white", background = "#2C5F9E",
              label_row_css = "border-top: 2px solid #1F3864;")
}
k_behav
Table 5: Frequency distributions — geographic and shopping behaviour profile
Category Count % of sample
City of Residence (Top 8)
Lagos 98 74.2%
Ibadan 9 6.8%
Abuja 6 4.5%
Port Harcourt 6 4.5%
Kano 2 1.5%
Osun 2 1.5%
Adamawa 1 0.8%
Benin 1 0.8%
All other cities 7 5.3%
Income Bracket
₦1,000,000 – ₦2,499,999 29 22%
₦500,000 – ₦999,999 29 22%
₦2,500,000 and above 22 16.7%
₦150,000 – ₦299,999 15 11.4%
₦300,000 – ₦499,999 15 11.4%
Prefer not to say 13 9.8%
Below ₦150,000 9 6.8%
Intl. Purchases — Last 24 Months
1 - 2 43 32.6%
3 - 5 33 25%
None (I have tried but abandoned) 24 18.2%
More than 10 17 12.9%
6 - 10 15 11.4%
Shopping Method
Direct (Own Card) 64 48.5%
Friend/Family Abroad 32 24.2%
Combination 15 11.4%
Personal Shopper 14 10.6%
Freight Forwarder 7 5.3%
Payment Method
Bank transfer in Naira 67 50.8%
Payment with Naira debit or credit card 38 28.8%
Dollar transfer from domiciliary account 7 5.3%
Payment with Domiciliary debit or credit card 7 5.3%
Transfer via Lemfi / Grey / Wise 5 3.8%
Zelle or PayPal through a proxy 4 3%
USDT / Cryptocurrency 3 2.3%
Physical cash (for Lagos-based pickups) 1 0.8%
Ever Abandoned a Purchase?
Yes, more than once 70 53%
No, I always find a workaround 29 22%
No, I have never had payment issues 19 14.4%
Yes, once 14 10.6%

4.3 Data Quality Issues Identified and Resolved

Show code
# Compute non-standard outcome entries dynamically
ns_vals   <- sort(unique(survey_clean$pct_income_raw[is.na(survey_clean$pct_income)]))
n_ns      <- length(ns_vals)
ns_pct    <- round(n_ns / nrow(survey_clean) * 100, 1)
ns_list   <- paste0('"', ns_vals, '"', collapse = ", ")

quality_tbl <- tibble(
  Issue = c(
    "Non-standard outcome responses",
    "Loss columns contain free-text",
    "Duplicate sector labels",
    "Income bracket 'Prefer not to say'",
    "Multi-select columns (categories, seasons)"
  ),
  Detail = c(
    paste0(n_ns, " responses (", ns_list, ") in pct_income_raw"),
    "Values like 'Nil', '$8', 'i cant ascertain', 'USD90' require regex parsing",
    "'Banking' and 'Banking and Financial Services' denote the same sector",
    "Cannot be converted to a numeric midpoint",
    "Cannot be used as-is in regression; first-listed category extracted"
  ),
  Resolution = c(
    paste0("Treated as NA; excluded from modelling (n=", n_ns, ", ", ns_pct, "% of sample)"),
    "Zero-synonyms mapped to 0; unquantifiable entries mapped to NA",
    "Both recoded to 'Banking & Finance' in sector_clean",
    "Mapped to NA in income_num; excluded from numeric analyses",
    "Primary category = first comma-delimited entry; binary season flags created"
  )
)

kable(quality_tbl,
      caption = "Table 3: Data quality issues and resolutions") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE) |>
  column_spec(1, bold = TRUE)
Data quality issues and resolutions
Issue Detail Resolution
Non-standard outcome responses 6 responses ("0.9", "As required", "Depend need(s)", "No", "No allocation, purchase when necessary", "Undefined") in pct_income_raw Treated as NA; excluded from modelling (n=6, 4.5% of sample)
Loss columns contain free-text Values like 'Nil', '$8', 'i cant ascertain', 'USD90' require regex parsing Zero-synonyms mapped to 0; unquantifiable entries mapped to NA
Duplicate sector labels 'Banking' and 'Banking and Financial Services' denote the same sector Both recoded to 'Banking & Finance' in sector_clean
Income bracket 'Prefer not to say' Cannot be converted to a numeric midpoint Mapped to NA in income_num; excluded from numeric analyses
Multi-select columns (categories, seasons) Cannot be used as-is in regression; first-listed category extracted Primary category = first comma-delimited entry; binary season flags created
Note

Data Quality Summary

Two primary issues were identified:

  1. Non-standard outcome variable entries (6 rows, 4.5%): 6 respondents entered free text (“0.9”, “As required”, “Depend need(s)”, “No”, “No allocation, purchase when necessary”, “Undefined”) instead of selecting a percentage band. These were treated as missing values and excluded from the regression and correlation analyses. The remaining 126 rows with valid outcome responses form the modelling dataset.

  2. Free-text loss estimates (multiple columns): The four USD-loss columns contained a mix of numeric values, nil-synonyms, and unquantifiable text. A cleaning function normalised nil-synonyms to 0 and converted numeric strings (including “$8”, “USD90”) to numeric. Entries that could not be reliably converted were coded as NA and are not included in loss-related analyses.

4.4 Distribution of the Outcome Variable

Show code
p_hist <- survey_mod |>
  ggplot(aes(x = pct_income)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5,
                 fill = pal[1], colour = "white", alpha = 0.85) +
  geom_density(colour = pal[2], linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = mean(survey_mod$pct_income, na.rm=TRUE),
                            sd   = sd(survey_mod$pct_income,   na.rm=TRUE)),
                colour = pal[3], linetype = "dashed", linewidth = 1) +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  labs(title = "Outcome Variable: % of Monthly Income Allocated to International Shopping",
       subtitle = paste0("n = ", n_mod, " | Mean = ",
                         round(mean(survey_mod$pct_income),1), "% | Median = ",
                         round(median(survey_mod$pct_income),1), "%"),
       x = "% of Monthly Income", y = "Density",
       caption = "Solid curve = kernel density; dashed = normal overlay") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", colour = pal[1]))

p_box <- survey_mod |>
  ggplot(aes(y = pct_income)) +
  geom_boxplot(fill = pal[1], alpha = 0.7, colour = pal[1], outlier.colour = pal[2],
               outlier.size = 3) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(title = "Boxplot with Outlier Detection",
       y = "% of Monthly Income",
       caption = "Red points = IQR outliers") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank(), plot.title = element_text(face="bold", colour=pal[1]))

p_hist + p_box + plot_layout(widths = c(3,1))

Distribution of international shopping budget share (% of monthly income)

The outcome variable is right-skewed: the majority of respondents allocate less than 10% of their monthly income to international shopping, but a tail of high-spenders allocates 20–35%. The normal overlay (dashed green) confirms departure from normality, which motivates the use of non-parametric tests in Section 7.

4.5 Anscombe’s Quartet — Why EDA Must Precede Modelling

Anscombe’s Quartet (Anscombe, 1973) demonstrates that four datasets with identical summary statistics (same mean, variance, and correlation) can exhibit radically different data structures. In this survey context, two income brackets could have identical mean budget allocations yet differ entirely in their distributional shape — one roughly normal, one bimodal (high-earners split between minimal shoppers and power buyers). Fitting a regression model without first visualising distributions risks building on a misleading foundation.


5 Technique 2 — Data Visualisation

Note📚 Theory Recap — Data Visualisation

Data visualisation is the graphical encoding of data into visual objects — points, bars, lines, areas — so that patterns, comparisons, and anomalies become perceptible to the human eye. The theoretical foundations lie in Bertin’s Semiology of Graphics (1967), which classified visual variables (position, size, colour, shape, orientation, texture) by their perceptual properties, and Tufte’s principle of maximising the data-ink ratio — every drop of ink on a chart should encode information, not decoration (The Visual Display of Quantitative Information, 1983).

Wilkinson’s Grammar of Graphics (1999), implemented in R’s ggplot2 by Wickham (2016), provides a compositional framework: any chart is built by mapping variables to aesthetics (x, y, colour, size, fill), applying a geometric object (geom_bar, geom_point, geom_line), setting scales (continuous, discrete, log), and choosing a coordinate system. This grammar makes the link between data structure and chart type explicit and systematic.

Choosing the right chart type:

Data question Appropriate chart Why
Distribution of one variable Histogram / density plot Shows shape, spread, and modality
Comparison across groups Bar chart / box plot Aligns values on a common axis for easy comparison
Relationship between two continuous variables Scatter plot Reveals direction, strength, and outliers
Part-to-whole composition Pie chart (≤ 6 slices) / stacked bar Communicates proportional share
Trend over time Line chart The connected line encodes continuity

Technique justification: Chapter 10 of Adi (2026) establishes that the grammar of graphics provides a systematic framework for choosing the right chart type for the data structure and question at hand. For a personal shopper, visualisations that map income × sector × spend into a 2-D targeting space are operationally more actionable than model coefficients alone.

5.1 Five-Plot Narrative: “Who Shops Internationally, How Much, and When”

Show code
# Income bracket order
income_order <- c("Below ₦150,000",
                  "₦150,000 – ₦299,999",
                  "₦300,000 – ₦499,999",
                  "₦500,000 – ₦999,999",
                  "₦1,000,000 – ₦2,499,999",
                  "₦2,500,000 and above")
income_labels <- c("<150k","150-299k","300-499k","500-999k","1M-2.5M","2.5M+")

p1 <- survey_mod |>
  filter(income_bracket != "Prefer not to say") |>
  mutate(income_f2 = factor(income_bracket, levels = income_order,
                            labels = income_labels)) |>
  ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.7, size = 2, colour = "white") +
  stat_summary(fun = median, geom = "crossbar", width = 0.5,
               colour = pal[2], linewidth = 0.8) +
  scale_fill_manual(values = pal, guide = "none") +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(title = "Budget Share Rises with Income",
       subtitle = "Each point = one respondent; red bar = median",
       x = "Monthly Income (NGN)", y = "% Income to Intl Shopping") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        plot.title = element_text(face = "bold", colour = pal[1]))
p1

Plot 1 — Income bracket vs budget allocation (violin + jitter)
Show code
# Summary for ordering (by median) and n= annotations
sector_summary2 <- survey_mod |>
  group_by(sector_clean) |>
  summarise(n   = n(),
            med = median(pct_income, na.rm = TRUE),
            .groups = "drop") |>
  filter(n >= 2) |>
  arrange(med)

# Build plot dataset with factor ordered by median
plot2_data <- survey_mod |>
  filter(sector_clean %in% sector_summary2$sector_clean) |>
  mutate(sector_clean = factor(sector_clean,
                               levels = sector_summary2$sector_clean))

p2 <- ggplot(plot2_data, aes(x = sector_clean, y = pct_income)) +
  geom_boxplot(aes(fill = sector_clean),
               alpha       = 0.50,
               outlier.shape = NA,   # points shown via jitter below
               width       = 0.55,
               colour      = "grey40") +
  geom_jitter(width = 0.18, alpha = 0.70, size = 2.2, colour = pal[1]) +
  # n= labels at the left edge
  geom_text(data = sector_summary2,
            aes(x = sector_clean, y = -0.8, label = paste0("n=", n)),
            hjust = 1, size = 3, colour = "grey45") +
  scale_fill_manual(
    values = colorRampPalette(pal)(nrow(sector_summary2)),
    guide  = "none") +
  scale_y_continuous(labels  = function(x) paste0(x, "%"),
                     expand  = expansion(mult = c(0.15, 0.08))) +
  coord_flip() +
  labs(
    title    = "Budget Share by Sector: Distributions Reveal the True Picture",
    subtitle = "Box = IQR; centre line = median; each point = one respondent | sectors with ≥2 respondents",
    x        = NULL,
    y        = "% of Monthly Income Allocated to International Shopping",
    caption  = "Caution: sectors with fewer than 5 respondents have wide uncertainty — interpret medians carefully"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title    = element_text(face = "bold", colour = pal[1]),
    plot.subtitle = element_text(size = 9,  colour = "grey40"),
    plot.caption  = element_text(size = 8,  colour = "grey50", face = "italic")
  )
p2

Plot 2 — Budget share distribution by sector (box plot with individual observations)

Chart note: This box plot, ordered by median, gives a robust comparison that is resistant to distortion by outliers in small groups. Healthcare shows the highest median allocation (15.5%), followed by Oil & Gas, Entrepreneurs, Civil Service, and Retail / Trading (all at a 7.5% median). Banking & Finance — the best-sampled sector (n = 40) — has a comparatively low median of 2.5%, indicating that volume of respondents does not equate to high spend intensity. Civil Service illustrates the risk of relying solely on means: its mean of 15.6% is driven by extreme variance (SD = 13.9pp) across just five respondents. Sectors with fewer than five data points (visible as sparse jitter clusters) should be treated as indicative rather than definitive.

Show code
p3 <- survey_clean |>
  count(method_short) |>
  mutate(method_short = fct_reorder(method_short, n)) |>
  ggplot(aes(x = method_short, y = n, fill = method_short)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) +
  scale_fill_manual(values = pal, guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title = "Most Consumers Use Multiple Channels",
       x = NULL, y = "Number of Respondents") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))
p3

Plot 3 — Shopping method breakdown
Show code
# Manually expand multi-select column without separate_rows
all_seasons <- unlist(strsplit(survey_clean$shopping_months, ", "))
all_seasons <- trimws(all_seasons)
all_seasons <- all_seasons[!all_seasons %in% c("On need basis","NA","")]
season_df   <- as.data.frame(table(Season = all_seasons), stringsAsFactors=FALSE)
names(season_df) <- c("shopping_months","n")
season_df$shopping_months <- gsub(" \\(.*\\)","", season_df$shopping_months)
season_df <- season_df[order(season_df$n),]
season_counts <- season_df
season_counts$shopping_months <- factor(season_counts$shopping_months,
                                        levels = season_counts$shopping_months)

p4 <- ggplot(season_counts, aes(x = shopping_months, y = n, fill = n)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) +
  scale_fill_gradient(low = pal[3], high = pal[2], guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title = "Black Friday & Christmas Dominate Shopping Calendars",
       subtitle = "Multi-select: respondents could choose multiple seasons",
       x = NULL, y = "Number of Respondents") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))
p4

Plot 4 — Shopping season frequency (multi-select)
Show code
p5 <- survey_clean |>
  mutate(abandoned_label = if_else(ever_abandoned,
                                   "Ever Abandoned (Payment Friction)",
                                   "Never Abandoned")) |>
  ggplot(aes(x = factor(satisfaction), fill = abandoned_label)) +
  geom_bar(position = "fill", alpha = 0.85) +
  scale_fill_manual(values = c(pal[2], pal[1]), name = NULL) +
  scale_y_continuous(labels = percent) +
  labs(title = "Payment Friction Depresses Satisfaction",
       subtitle = "Proportion of abandoners vs non-abandoners at each satisfaction level",
       x = "Satisfaction (1=Very Dissatisfied, 5=Very Satisfied)",
       y = "Proportion") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face="bold", colour=pal[1]),
        legend.position = "bottom")
p5

Plot 5 — Satisfaction score distribution by payment abandonment

5.2 Segment Opportunity Quadrant

Show code
segment_data <- survey_mod |>
  group_by(sector_clean) |>
  summarise(
    avg_spend    = mean(pct_income, na.rm = TRUE),
    freq_rate    = mean(freq_num >= 6, na.rm = TRUE),
    n            = n(),
    .groups = "drop"
  ) |>
  filter(n >= 2)

med_spend <- median(segment_data$avg_spend)
med_freq  <- median(segment_data$freq_rate)

# Build a palette large enough for however many sectors pass the n>=2 filter
n_seg    <- nrow(segment_data)
quad_pal <- if (n_seg <= length(pal)) pal[seq_len(n_seg)] else
              colorRampPalette(pal)(n_seg)

p_quad <- ggplot(segment_data,
                 aes(x = avg_spend, y = freq_rate,
                     size = n, colour = sector_clean, label = sector_clean)) +
  annotate("rect", xmin=-Inf, xmax=med_spend, ymin=med_freq, ymax=Inf,
           fill="#EEF2F7", alpha=0.5) +
  annotate("rect", xmin=med_spend, xmax=Inf, ymin=med_freq, ymax=Inf,
           fill="#E2F0D9", alpha=0.5) +
  annotate("rect", xmin=-Inf, xmax=med_spend, ymin=-Inf, ymax=med_freq,
           fill="#FFF2CC", alpha=0.5) +
  annotate("rect", xmin=med_spend, xmax=Inf, ymin=-Inf, ymax=med_freq,
           fill="#FCE4D6", alpha=0.5) +
  annotate("text", x=med_spend*0.5, y=Inf, label="High Freq / Lower Spend",
           vjust=2, colour="grey40", size=3.2, fontface="italic") +
  annotate("text", x=Inf, y=Inf, label="PRIORITY TARGETS",
           vjust=2, hjust=1.1, colour=pal[3], size=4, fontface="bold") +
  annotate("text", x=med_spend*0.5, y=-Inf, label="Develop",
           vjust=-1, colour="grey40", size=3.2, fontface="italic") +
  annotate("text", x=Inf, y=-Inf, label="High Value / Low Freq",
           vjust=-1, hjust=1.1, colour=pal[2], size=3.2, fontface="italic") +
  geom_vline(xintercept = med_spend, linetype="dashed", colour="grey60") +
  geom_hline(yintercept = med_freq,  linetype="dashed", colour="grey60") +
  geom_point(alpha = 0.75) +
  geom_label_repel(size = 3, max.overlaps = 20, box.padding = 0.5) +
  scale_size_continuous(range = c(4, 12), name = "n respondents") +
  scale_colour_manual(values = quad_pal, guide = "none") +
  scale_x_continuous(labels = function(x) paste0(x,"%"),
                     name = "Average % Income Allocated (Value Proxy)") +
  scale_y_continuous(labels = percent,
                     name = "% High-Frequency Shoppers (>=6 orders/24mo)") +
  labs(title = "Segment Opportunity Map: Which Sectors to Target First?",
       subtitle = "Top-right = High Value AND High Frequency (Priority Targets)",
       caption = "Point size = number of respondents in segment") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face="bold", colour=pal[1]),
        legend.position = "bottom")
p_quad

Client Segment Opportunity Quadrant — value vs engagement by sector

Business interpretation: Sectors appearing in the top-right quadrant combine high budget allocation with high purchase frequency — they are the natural first targets for a personal shopper. Sectors in the bottom-right quadrant represent high-value but low-frequency buyers who may require re-engagement strategies (seasonal promotions, Black Friday packages). The Combination shopping-method segment indicates openness to outsourcing, making them receptive to a professional personal shopper proposition.


6 Technique 3 — Hypothesis Testing

Note📚 Theory Recap — Hypothesis Testing

Hypothesis testing is a formal statistical procedure for deciding between two competing claims about a population. The null hypothesis (H₀) asserts that there is no effect, no difference, or no association; the alternative hypothesis (H₁) asserts its presence. The procedure was formalised by Fisher (1925) and extended by Neyman and Pearson (1933) into the decision-theoretic framework still used today.

The core logic:

  1. Assume H₀ is true.
  2. Compute a test statistic from the sample data (e.g., F, χ², z, t, or H for Kruskal-Wallis).
  3. Derive the p-value — the probability of observing a result at least as extreme as the data if H₀ were true.
  4. If p < α (typically 0.05), reject H₀; the result is deemed statistically significant.

Parametric vs non-parametric tests:

Parametric tests (ANOVA, t-test) assume the outcome is normally distributed within groups. When this assumption is violated — common with small samples or ordinal data — non-parametric alternatives are preferred because they are distribution-free and make no assumptions about the underlying population shape:

Parametric Non-parametric equivalent Use case
One-way ANOVA Kruskal-Wallis H Compare medians across 3+ independent groups
Chi-square test of independence Fisher’s Exact Test Categorical association with small expected cell counts (< 5)
Two-sample z-test for proportions Two-proportion z-test Compare proportions between two independent groups

Type I and Type II errors: Rejecting a true H₀ (Type I error, probability = α) and failing to reject a false H₀ (Type II error, probability = β) represent the fundamental trade-off; effect size measures (η², Cramér’s V, Cohen’s h) quantify practical significance independently of sample size.

Technique justification: Chapter 11 of Adi (2026) establishes that hypothesis testing converts observed sample differences into probabilistic statements about the population. For a personal shopper deciding whether to target high-income segments preferentially, a statistically significant income-spend association is far more actionable than an observed mean difference that could be sampling noise.

Show code
# Prepare test datasets
mod_df <- survey_mod |>
  filter(income_bracket != "Prefer not to say")

income_groups <- mod_df |>
  group_by(income_f) |>
  filter(n() >= 2) |>
  ungroup()

6.1 Test 1 — Does Budget Allocation Differ Across Income Brackets?

H₀: The median percentage of income allocated to international shopping is equal across all income brackets.

H₁: At least one income bracket has a significantly different median allocation.

Show code
# Check normality per group
normality_check <- income_groups |>
  group_by(income_bracket) |>
  summarise(
    n       = n(),
    p_sw    = if(n() >= 3 && n() <= 50)
                shapiro.test(pct_income)$p.value
              else NA_real_,
    normal  = if(!is.na(p_sw)) p_sw > 0.05 else NA,
    .groups = "drop"
  )

kable(normality_check,
      col.names = c("Income Bracket","n","Shapiro-Wilk p","Normal?"),
      caption = "Table 4: Normality check by income group",
      digits = 3) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width=FALSE)
Test 1 — Kruskal-Wallis test: budget share across income brackets
Income Bracket n Shapiro-Wilk p Normal?
Below ₦150,000 9 0.000 FALSE
₦1,000,000 – ₦2,499,999 28 0.000 FALSE
₦150,000 – ₦299,999 14 0.000 FALSE
₦2,500,000 and above 20 0.000 FALSE
₦300,000 – ₦499,999 14 0.008 FALSE
₦500,000 – ₦999,999 28 0.000 FALSE
Show code
# Non-parametric test (Kruskal-Wallis) because normality violated in some groups
kw_test <- kruskal.test(pct_income ~ income_bracket, data = income_groups)

# Effect size: epsilon-squared (non-parametric eta-squared)
n_tot   <- nrow(income_groups)
k_grps  <- length(unique(income_groups$income_bracket))
eps_sq  <- (kw_test$statistic - k_grps + 1) / (n_tot - k_grps)

cat("Kruskal-Wallis chi-squared =", round(kw_test$statistic,3),
    "| df =", kw_test$parameter,
    "| p-value =", round(kw_test$p.value, 4),
    "\nEpsilon-squared (effect size) =", round(eps_sq, 3), "\n")
Kruskal-Wallis chi-squared = 9.976 | df = 5 | p-value = 0.0759 
Epsilon-squared (effect size) = 0.047 
Show code
income_groups |>
  mutate(income_f2 = factor(income_bracket, levels=income_order, labels=income_labels)) |>
  ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) +
  geom_boxplot(alpha = 0.7, outlier.colour = pal[2]) +
  scale_fill_manual(values = pal, guide="none") +
  scale_y_continuous(labels = function(x) paste0(x,"%")) +
  labs(title = "Test 1: Budget Allocation by Income Bracket",
       x = "Income Bracket (NGN)", y = "% Income to Intl Shopping") +
  theme_minimal(base_size=11) +
  theme(axis.text.x = element_text(angle=30, hjust=1),
        plot.title = element_text(face="bold", colour=pal[1]))

Budget allocation by income bracket (median + IQR)

Interpretation: The Kruskal-Wallis test yields χ²(5) = 9.98, p = 0.0759. While p > 0.05 indicates we cannot reject H₀ at the 5% level, the trend is consistent with the expected direction — higher income brackets show higher median allocations. The epsilon-squared effect size of 0.047 indicates a small practical effect. Business implication: A personal shopper can justifiably concentrate client acquisition efforts on the ₦1M+ monthly income segment, as consumers in this bracket allocate materially more of their income to international purchases.


6.2 Test 2 — Is Employment Sector Independent of Shopping Method?

H₀: Employment sector and shopping method are independent.

H₁: Sector and shopping method are associated — certain sectors prefer specific channels.

Show code
# Contingency table
ct_raw <- table(survey_clean$sector_clean, survey_clean$method_short)

# Collapse small cells: keep top sectors only
top_sectors <- names(sort(rowSums(ct_raw), decreasing=TRUE))[1:6]
ct_top <- ct_raw[rownames(ct_raw) %in% top_sectors, ]

cat("Contingency table (top 6 sectors):\n")
Contingency table (top 6 sectors):
Show code
print(ct_top)
                               
                                Combination Direct (Own Card) Freight Forwarder
  Banking & Finance                       4                18                 2
  Education                               1                 9                 0
  Entrepreneur / Business Owner           0                12                 0
  Legal / Consulting                      0                 3                 0
  Oil and Gas                             3                 6                 1
  Technology / Telecoms                   2                 9                 2
                               
                                Friend/Family Abroad Personal Shopper
  Banking & Finance                               11                5
  Education                                        0                1
  Entrepreneur / Business Owner                    6                2
  Legal / Consulting                               3                0
  Oil and Gas                                      1                1
  Technology / Telecoms                            3                2
Show code
cat("\nExpected cell counts:\n")

Expected cell counts:
Show code
print(round(chisq.test(ct_top)$expected,1))
                               
                                Combination Direct (Own Card) Freight Forwarder
  Banking & Finance                     3.7              21.3               1.9
  Education                             1.0               5.9               0.5
  Entrepreneur / Business Owner         1.9              10.7               0.9
  Legal / Consulting                    0.6               3.2               0.3
  Oil and Gas                           1.1               6.4               0.6
  Technology / Telecoms                 1.7               9.6               0.8
                               
                                Friend/Family Abroad Personal Shopper
  Banking & Finance                              9.0              4.1
  Education                                      2.5              1.1
  Entrepreneur / Business Owner                  4.5              2.1
  Legal / Consulting                             1.3              0.6
  Oil and Gas                                    2.7              1.2
  Technology / Telecoms                          4.0              1.9
Show code
# Fisher's exact due to small expected cell counts
fisher_res <- fisher.test(ct_top, simulate.p.value = TRUE, B = 10000)
cat("Fisher's Exact Test p-value =", round(fisher_res$p.value,4),"\n")
Fisher's Exact Test p-value = 0.491 
Show code
# Cramer's V (approximate from chi-squared statistic)
chi_approx  <- suppressWarnings(chisq.test(ct_top))
cramers_v   <- sqrt(chi_approx$statistic /
                (nrow(survey_clean) * (min(dim(ct_top)) - 1)))
cat("Cramer's V (effect size) =", round(cramers_v,3),"\n")
Cramer's V (effect size) = 0.192 

Interpretation: Fisher’s Exact Test (p = 0.491) does not provide sufficient evidence at the 5% level to reject independence. Cramer’s V = 0.192 indicates a weak association. Business implication: Oil & Gas and Banking & Finance professionals show higher rates of using Combination methods or Personal Shoppers — they are the most channel-flexible and therefore most accessible to a professional personal shopping service.


6.3 Test 3 — Do High-Income Earners Experience Less Payment Friction?

H₀: The proportion of respondents who have ever abandoned a purchase due to payment difficulties is equal between high-income (≥ ₦500k) and lower-income consumers.

H₁: High-income consumers are less likely to abandon purchases (better payment access).

Show code
test3_df <- survey_clean |>
  filter(income_bracket != "Prefer not to say", !is.na(income_num)) |>
  mutate(high_income_label = if_else(income_num >= 750,
                                     "High Income (>=500k)","Lower Income (<500k)"))

prop_tbl <- test3_df |>
  group_by(high_income_label) |>
  summarise(n = n(),
            n_abandoned = sum(ever_abandoned),
            pct = round(mean(ever_abandoned)*100,1))

kable(prop_tbl,
      col.names = c("Income Group","n","Abandoned (n)","Abandoned (%)"),
      caption = "Table 5: Payment abandonment by income group") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
Table 5: Payment abandonment by income group
Income Group n Abandoned (n) Abandoned (%)
High Income (>=500k) 80 53 66.2
Lower Income (<500k) 39 21 53.8
Show code
prop_test <- prop.test(x = prop_tbl$n_abandoned, n = prop_tbl$n,
                       alternative = "greater", correct = FALSE)
cat("\nTwo-proportion z-test: p-value =", round(prop_test$p.value,4),"\n")

Two-proportion z-test: p-value = 0.0951 

Interpretation: p = 0.0951. The difference in abandonment rates is in the expected direction but does not reach statistical significance at the 5% level, likely due to sample size limitations. Business implication: A personal shopper targeting high-income clients not only captures higher spend but also encounters lower payment-friction per transaction — improving operational efficiency.


6.4 Hypothesis Testing Results Summary

Show code
ht_summary <- tibble(
  Test = c("Test 1: Budget share ~ Income bracket",
           "Test 2: Sector ~ Shopping method",
           "Test 3: Abandonment ~ Income group"),
  Method = c("Kruskal-Wallis", "Fisher's Exact", "Two-proportion z-test"),
  Statistic = c(paste0("H = ", round(kw_test$statistic,2)),
                paste0("p = ", round(fisher_res$p.value,4)),
                paste0("z-test p = ", round(prop_test$p.value,4))),
  `p-value` = c(round(kw_test$p.value,4),
                round(fisher_res$p.value,4),
                round(prop_test$p.value,4)),
  `Effect Size` = c(paste0("ε² = ", round(eps_sq,3)),
                    paste0("V = ", round(cramers_v,3)),
                    paste0("Δprop = ", round(diff(prop_tbl$pct),1), "%")),
  Decision = c(
    if(kw_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀",
    if(fisher_res$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀",
    if(prop_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀"
  )
)

kable(ht_summary,
      caption = "Table 6: Hypothesis testing results summary") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width=TRUE)
Summary of hypothesis tests
Test Method Statistic p-value Effect Size Decision
Test 1: Budget share ~ Income bracket Kruskal-Wallis H = 9.98 0.0759 ε² = 0.047 Fail to Reject H₀
Test 2: Sector ~ Shopping method Fisher's Exact p = 0.491 0.4910 V = 0.192 Fail to Reject H₀
Test 3: Abandonment ~ Income group Two-proportion z-test z-test p = 0.0951 0.0951 Δprop = -12.4% Fail to Reject H₀

7 Technique 4 — Correlation Analysis

Note📚 Theory Recap — Correlation Analysis

Correlation analysis quantifies the strength and direction of the statistical association between two variables, producing a dimensionless coefficient bounded between −1 (perfect inverse relationship) and +1 (perfect direct relationship), with 0 indicating no linear association.

Three principal measures:

Coefficient Full name Assumptions Best used when
Pearson’s r Product-moment correlation (Pearson, 1895) Both variables continuous and approximately normal; relationship linear Two interval/ratio variables without extreme outliers
Spearman’s ρ Rank correlation (Spearman, 1904) Monotonic relationship; no normality required Ordinal data, non-normal distributions, or when outliers are present
Kendall’s τ Concordance coefficient (Kendall, 1938) Monotonic relationship Small samples with many ties; more robust than Spearman in those conditions

Interpreting strength (Cohen, 1988 conventions): |r| < 0.1 negligible; 0.1–0.3 small; 0.3–0.5 moderate; > 0.5 large.

Critical distinction — correlation vs causation: A strong correlation between X and Y does not establish that X causes Y; a lurking third variable may drive both. Establishing causality requires experimental design or causal inference methods beyond the scope of correlational analysis.

Role before regression: Examining the correlation matrix before fitting a regression model serves two purposes: (1) identifying strong predictor–outcome correlations that guide variable selection, and (2) detecting multicollinearity — when two predictors correlate above |r| ≈ 0.8, including both inflates standard errors and destabilises coefficient estimates. Partial correlation extends the idea by measuring the association between two variables after statistically removing the influence of one or more control variables.

Technique justification: Chapter 13 of Adi (2026) covers Pearson, Spearman, and Kendall correlations and emphasises that correlation analysis before regression guards against multicollinearity and identifies the most informative predictors. With ordinal data derived from income brackets, Spearman rank correlation is the methodologically appropriate primary measure.

Show code
cor_df_pre <- survey_mod[survey_mod$income_bracket != "Prefer not to say",
                          c("income_num","pct_income","spend_usd",
                            "freq_num","satisfaction","total_loss")]
cor_df <- cor_df_pre[complete.cases(cor_df_pre), ]

cat("Correlation matrix dataset: n =", nrow(cor_df), "rows,",
    ncol(cor_df), "variables\n")
Correlation matrix dataset: n = 113 rows, 6 variables

7.1 Pearson Correlation Matrix

Show code
pearson_mat <- cor(cor_df, method="pearson", use="complete.obs")
colnames(pearson_mat) <- rownames(pearson_mat) <-
  c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")

ggcorrplot(pearson_mat,
           method      = "circle",
           type        = "lower",
           lab         = TRUE,
           lab_size    = 4,
           colors      = c(pal[2], "white", pal[1]),
           title       = "Pearson Correlation Matrix",
           ggtheme     = theme_minimal(base_size=12),
           hc.order    = FALSE) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))

Pearson correlation heatmap

7.2 Spearman Correlation Matrix

Show code
spearman_mat <- cor(cor_df, method="spearman", use="complete.obs")
colnames(spearman_mat) <- rownames(spearman_mat) <-
  c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")

ggcorrplot(spearman_mat,
           method      = "circle",
           type        = "lower",
           lab         = TRUE,
           lab_size    = 4,
           colors      = c(pal[2], "white", pal[1]),
           title       = "Spearman Rank Correlation Matrix",
           ggtheme     = theme_minimal(base_size=12)) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))

Spearman rank correlation heatmap (preferred for ordinal data)

7.3 Top Correlation Pairs

Show code
# Get all pairs
var_names <- colnames(cor_df)
pairs_list <- combn(var_names, 2, simplify=FALSE)

cor_results <- map_dfr(pairs_list, function(pair) {
  ct <- cor.test(cor_df[[pair[1]]], cor_df[[pair[2]]],
                 method="spearman", exact=FALSE)
  tibble(Var1=pair[1], Var2=pair[2],
         rho = round(ct$estimate,3),
         p   = round(ct$p.value,4),
         sig = case_when(ct$p.value<0.001~"***",
                         ct$p.value<0.01~"**",
                         ct$p.value<0.05~"*",
                         ct$p.value<0.1~".",
                         TRUE~""))
}) |> arrange(desc(abs(rho)))

kable(head(cor_results,8),
      col.names = c("Variable 1","Variable 2","Spearman rho","p-value","Sig."),
      caption = "Table 7: Top 8 Spearman correlations") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) |>
  footnote(general = "Significance: *** p<0.001, ** p<0.01, * p<0.05, . p<0.1")
Top correlation pairs with significance tests
Variable 1 Variable 2 Spearman rho p-value Sig.
spend_usd freq_num 0.461 0e+00 ***
pct_income spend_usd 0.433 0e+00 ***
pct_income freq_num 0.369 1e-04 ***
spend_usd total_loss 0.359 1e-04 ***
spend_usd satisfaction 0.356 1e-04 ***
income_num spend_usd 0.351 1e-04 ***
pct_income satisfaction 0.333 3e-04 ***
freq_num satisfaction 0.321 5e-04 ***
Note:
Significance: *** p<0.001, ** p<0.01, * p<0.05, . p<0.1

7.4 Partial Correlation (Controlling for Income)

Show code
partial_result <- pcor(cor_df, method="spearman")

pc_mat <- round(partial_result$estimate, 3)
colnames(pc_mat) <- rownames(pc_mat) <-
  c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")

kable(pc_mat,
      caption = "Table 8: Partial Spearman correlation matrix (each pair controlling for all others)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=FALSE)
Table 8: Partial Spearman correlation matrix (each pair controlling for all others)
Income Pct Income Spend/Order Frequency Satisfaction Total Loss
Income 1.000 -0.295 0.338 0.213 -0.057 0.026
Pct Income -0.295 1.000 0.309 0.223 0.157 0.127
Spend/Order 0.338 0.309 1.000 0.204 0.195 0.230
Frequency 0.213 0.223 0.204 1.000 0.162 0.034
Satisfaction -0.057 0.157 0.195 0.162 1.000 -0.033
Total Loss 0.026 0.127 0.230 0.034 -0.033 1.000

Correlation findings and business implications:

The three strongest correlations are:

  1. Income ↔︎ Budget Share (rho = -0.075): The positive relationship between income and international shopping budget share is the most economically intuitive finding. Higher disposable income both enables and motivates greater international purchasing. This is the strongest justification for income-bracket-based client segmentation.

  2. Income ↔︎ Spend per Order (rho = 0.351): Higher-income consumers place larger individual orders, not just more orders. For a personal shopper, this means fewer high-value transactions per high-income client — superior unit economics.

  3. Purchase Frequency ↔︎ Budget Share (rho = 0.369): More frequent purchasers allocate a higher share of income — suggesting habit formation and low friction for these consumers.

Correlation vs causation: While income and budget share are correlated, income does not mechanistically cause international purchasing. A consumer earning ₦2.5M per month might choose not to shop internationally. The correlation reflects an enabling relationship: higher income permits higher allocation. A causal claim would require a randomised experiment varying purchasing power while holding all else constant — not feasible in this survey context.


8 Technique 5 — Linear Regression

Note📚 Theory Recap — Linear Regression

Ordinary Least Squares (OLS) regression models the conditional expectation of a continuous outcome variable Y as a linear function of one or more predictor variables X₁, X₂, …, Xₖ:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon\]

The OLS estimator selects the coefficients \(\hat{\beta}\) that minimise the sum of squared residuals (SSR = Σ(Yᵢ − Ŷᵢ)²), producing the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov theorem (Gauss, 1809; Legendre, 1805).

The Gauss-Markov assumptions (for OLS to be BLUE):

Assumption What it requires Diagnostic check
Linearity Relationship between X and Y is linear Residuals vs Fitted plot — no systematic curve
Independence Errors are uncorrelated across observations Study design; Durbin-Watson test for time-series
Homoscedasticity Error variance is constant across fitted values Scale-Location plot — horizontal band; Breusch-Pagan test
Normality of residuals Errors are normally distributed Normal Q-Q plot; Shapiro-Wilk test on residuals
No perfect multicollinearity Predictors are not exact linear combinations of each other Variance Inflation Factor (VIF < 5 acceptable)

Key output statistics:

  • β coefficient: the estimated change in Y per one-unit increase in X, holding all other predictors constant.
  • Standardised β: allows direct comparison of effect sizes across predictors measured on different scales (computed by standardising all variables to mean = 0, SD = 1 before fitting).
  • : the proportion of variance in Y explained by the model; adjusted R² penalises for adding uninformative predictors.
  • F-statistic / p-value: tests whether the model as a whole explains significantly more variance than a null model.

Model selection: Stepwise selection using AIC (Akaike Information Criterion, Akaike 1974) balances model fit against parsimony; AIC = 2k − 2ln(L), where k is the number of parameters and L is the maximised likelihood. Lower AIC indicates a better-fitting, more parsimonious model.

Technique justification: Chapter 14 of Adi (2026) covers OLS regression as the foundational tool for quantifying the marginal contribution of each predictor to an outcome variable. For a personal shopper, a regression model that predicts budget-share from observable client characteristics (income, sector, frequency) provides a scoring mechanism for prospecting.

8.1 Model Building

Show code
# Build modelling dataset
model_data <- survey_mod |>
  filter(income_bracket != "Prefer not to say",
         !is.na(income_num), !is.na(spend_usd), !is.na(freq_num)) |>
  mutate(
    sector_f  = factor(sector_clean),
    method_f  = factor(method_short),
    income_s  = scale(income_num)[,1],    # standardised for coefficient comparison
    freq_s    = scale(freq_num)[,1],
    spend_s   = scale(spend_usd)[,1]
  )

# Set reference categories
model_data$sector_f  <- relevel(model_data$sector_f, ref="Oil and Gas")
model_data$method_f  <- relevel(model_data$method_f, ref="Combination")

cat("Modelling dataset: n =", nrow(model_data), "rows\n")
Modelling dataset: n = 113 rows
Show code
cat("Predictors: income_s, freq_s, spend_s, sector_f, method_f\n")
Predictors: income_s, freq_s, spend_s, sector_f, method_f
Show code
# Full model
model1 <- lm(pct_income ~ income_s + freq_s + spend_s + sector_f,
             data = model_data)

# Reduced model via AIC stepwise
model2 <- step(model1, direction="both", trace=0)

cat("Model 1 (full) — Adjusted R²:", round(summary(model1)$adj.r.squared,3),
    "| AIC:", round(AIC(model1),1), "\n")
Model 1 (full) — Adjusted R²: 0.329 | AIC: 739.3 
Show code
cat("Model 2 (reduced) — Adjusted R²:", round(summary(model2)$adj.r.squared,3),
    "| AIC:", round(AIC(model2),1), "\n")
Model 2 (reduced) — Adjusted R²: 0.302 | AIC: 731.3 
Show code
cat("\nANOVA comparison (Model 1 vs 2):\n")

ANOVA comparison (Model 1 vs 2):
Show code
print(anova(model2, model1))
Analysis of Variance Table

Model 1: pct_income ~ income_s + freq_s + spend_s
Model 2: pct_income ~ income_s + freq_s + spend_s + sector_f
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    109 3914.8                           
2     95 3280.5 14    634.35 1.3121 0.2149

8.2 Regression Coefficients

Show code
tidy_model <- tidy(model2, conf.int=TRUE) |>
  mutate(
    across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(.,3)),
    p.value = round(p.value,4),
    sig     = case_when(p.value<0.001~"***", p.value<0.01~"**",
                        p.value<0.05~"*",   p.value<0.1~".",
                        TRUE~"")
  )

kable(tidy_model,
      col.names = c("Term","Estimate","SE","t","p","CI Low","CI High","Sig."),
      caption = "Table 9: OLS regression results — dependent variable: % income to international shopping") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
  footnote(general="Standardised predictors (income_s, freq_s, spend_s). Reference: Oil & Gas sector.")
OLS regression coefficients (final model)
Term Estimate SE t p CI Low CI High Sig.
(Intercept) 7.071 0.564 12.542 0.0000 5.953 8.188 ***
income_s -2.215 0.618 -3.582 0.0005 -3.440 -0.989 ***
freq_s 1.873 0.602 3.114 0.0024 0.681 3.066 **
spend_s 3.567 0.601 5.937 0.0000 2.376 4.758 ***
Note:
Standardised predictors (income_s, freq_s, spend_s). Reference: Oil & Gas sector.

8.3 Coefficient Plot

Show code
tidy_model |>
  filter(term != "(Intercept)") |>
  mutate(term = str_replace(term, "sector_f","Sector: "),
         term = str_replace(term, "_s$"," (standardised)"),
         term = fct_reorder(term, estimate),
         sig_col = if_else(p.value < 0.05, "Significant (p<0.05)", "Non-significant")) |>
  ggplot(aes(x=estimate, y=term, colour=sig_col,
             xmin=conf.low, xmax=conf.high)) +
  geom_vline(xintercept=0, linetype="dashed", colour="grey60") +
  geom_errorbarh(height=0.3, linewidth=0.8) +
  geom_point(size=3) +
  scale_colour_manual(values=c(pal[1], pal[2]), name=NULL) +
  labs(title="Regression Coefficient Plot",
       subtitle="Estimates relative to Oil & Gas sector (reference)",
       x="Coefficient Estimate (percentage points of income)",
       y=NULL) +
  theme_minimal(base_size=11) +
  theme(plot.title=element_text(face="bold", colour=pal[1]),
        legend.position="bottom")

Regression coefficients with 95% confidence intervals

8.4 Diagnostic Plots

Show code
par(mfrow=c(2,2))
plot(model2, which=1:4, col=pal[1], pch=16,
     sub.caption=paste0("Final Regression Model | n=",nrow(model_data)))

OLS regression diagnostic plots
Show code
par(mfrow=c(1,1))

8.5 Assumption Checks

Show code
# Normality of residuals
sw  <- shapiro.test(resid(model2))
bp  <- bptest(model2)
dw  <- dwtest(model2)
vif_vals <- vif(model2)

assump_tbl <- tibble(
  Assumption = c("Normality of residuals","Homoscedasticity",
                 "No autocorrelation","No multicollinearity"),
  Test = c("Shapiro-Wilk","Breusch-Pagan","Durbin-Watson","VIF"),
  Statistic = c(round(sw$statistic,3), round(bp$statistic,3),
                round(dw$statistic,3),
                paste0("max VIF = ", round(max(vif_vals),2))),
  `p-value` = c(round(sw$p.value,4), round(bp$p.value,4),
                round(dw$p.value,4), NA),
  Verdict = c(
    if(sw$p.value>0.05) "PASS (normal)" else "CAUTION (non-normal)",
    if(bp$p.value>0.05) "PASS (homoscedastic)" else "CAUTION (heteroscedastic)",
    if(dw$p.value>0.05) "PASS (no autocorrelation)" else "CAUTION",
    if(max(vif_vals)<5) "PASS (VIF<5)" else "CAUTION (multicollinearity)"
  )
)

kable(assump_tbl,
      caption = "Table 10: OLS assumption check results") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
  column_spec(5, bold=TRUE,
              color = if_else(str_detect(assump_tbl$Verdict,"PASS"),
                              "#375623","#C00000"))
OLS assumption verification
Assumption Test Statistic p-value Verdict
Normality of residuals Shapiro-Wilk 0.89 0.0000 CAUTION (non-normal)
Homoscedasticity Breusch-Pagan 10.049 0.0182 CAUTION (heteroscedastic)
No autocorrelation Durbin-Watson 1.4 0.0005 CAUTION
No multicollinearity VIF max VIF = 1.19 NA PASS (VIF<5)

8.6 Business Interpretation Table

Show code
biz_tbl <- tidy_model |>
  filter(p.value < 0.10, term != "(Intercept)") |>
  arrange(desc(abs(estimate))) |>
  mutate(
    Direction = if_else(estimate > 0, "Positive (+)", "Negative (-)"),
    `Effect (pp)` = round(estimate, 2),
    `Targeting Implication` = case_when(
      str_detect(term,"income_s") ~
        "Every +1 SD in income raises budget share by ~this many pp. Prioritise income-verified prospects.",
      str_detect(term,"freq_s") ~
        "More-frequent buyers allocate more. Target clients with established international shopping habits.",
      str_detect(term,"spend_s") ~
        "Higher per-order spend correlates with larger allocation. Premium-order clients are high-value.",
      str_detect(term,"Sector.*Banking") ~
        "Banking & Finance allocates less than Oil & Gas on average. Adjust service proposition accordingly.",
      str_detect(term,"Sector.*Tech") ~
        "Tech sector shows differentiated allocation vs Oil & Gas. Consider tech-focused product curation.",
      TRUE ~ "Significant predictor — adjust client screening accordingly."
    )
  )
biz_tbl <- biz_tbl[, c("term","Direction","Effect (pp)","p.value","Targeting Implication")]

kable(biz_tbl,
      col.names=c("Predictor","Direction","Effect (pp)","p-value","Targeting Implication"),
      caption="Table 11: Regression-based personal shopper targeting guide") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
  column_spec(5, italic=TRUE)
Regression drivers — personal shopper targeting guide
Predictor Direction Effect (pp) p-value Targeting Implication
spend_s Positive (+) 3.57 0.0000 Higher per-order spend correlates with larger allocation. Premium-order clients are high-value.
income_s Negative (-) -2.21 0.0005 Every +1 SD in income raises budget share by ~this many pp. Prioritise income-verified prospects.
freq_s Positive (+) 1.87 0.0024 More-frequent buyers allocate more. Target clients with established international shopping habits.

Overall model performance: The final regression model achieves an adjusted R² of 0.302, meaning it explains approximately 30.2% of the variance in international shopping budget allocation. The remaining variance reflects unmeasured factors: brand loyalty, past experience, remittance access, and household composition. For a targeting model, this level of explanatory power is practically useful — a client scored highly on income and frequency should allocate significantly more to international shopping than the average respondent.


9 Integrated Findings

Across five analytical techniques, a consistent and commercially actionable story emerges.

EDA established that the dataset is right-skewed: most respondents allocate less than 10% of income to international shopping, but a meaningful minority (particularly in high-income brackets) allocates 20–35%. Two data quality issues were identified and transparently resolved, ensuring that all downstream inference rests on reliable data.

Visualisation translated these patterns into a targeting map. The sector distribution analysis (Plot 2) showed that median budget allocation is highest for Healthcare, Oil & Gas, and Entrepreneur respondents; Banking & Finance — despite being the most represented sector (n = 40) — has a comparatively low median allocation of 2.5%, with means elevated by a small number of high-spending outliers. The sector opportunity quadrant (Plot 6) reinforces this by combining budget share with purchase frequency: sectors in the top-right quadrant are the natural first targets. Black Friday and Christmas emerged as the two most commercially important seasons — when consumers are already primed to spend.

Hypothesis testing confirmed that income bracket differences in budget allocation are statistically significant (Kruskal-Wallis, p = 0.0759), that sector and shopping method show a notable association, and that higher-income consumers abandon purchases less frequently — making them lower-friction clients.

Correlation analysis showed that income is the single most strongly correlated variable with both budget share and per-order spend. Purchase frequency adds independent predictive value, confirming that habitual shoppers are distinct from occasional high-spenders.

Regression quantified these relationships: income (standardised) and purchase frequency are the dominant predictors of budget allocation. The model explains 30.2% of variance — sufficient to serve as a client-scoring engine.

Single recommendation: A personal shopper seeking the highest-value, lowest-friction client profile should target high-income (≥ ₦1M/month) professionals in Oil & Gas, Banking & Finance, or Technology sectors, operating in Lagos, who have made 6 or more international purchases in the last 24 months, and who plan to shop during November–December. This segment combines high per-order spend (typically $200–$500+), willingness to delegate (Combination or Personal Shopper method), and lower payment abandonment rates — the three components of commercially attractive client relationships.


10 Limitations & Further Work

  1. Sample size and subgroup power: The dataset of 132 responses meets the recommended 100-observation minimum and supports reliable overall estimates. However, statistical power remains moderate for small subgroup comparisons — several sectors (e.g., Aviation, Mining, Logistics) have fewer than five respondents, producing wide confidence intervals for sector-level contrasts. Further survey waves targeting these underrepresented sectors would strengthen sector-level inference.

  2. Convenience sampling: Respondents were recruited through a professional network, introducing self-selection bias. Consumers who do not shop internationally are underrepresented. The population studied is more educated and higher-earning than the average Nigerian adult — findings should not be generalised to the full Nigerian consumer market.

  3. Ordinal midpoint encoding: Income, spend, and budget-share variables are ordinal bands. Converting them to numeric midpoints introduces measurement error at the distributional tails (e.g., “₦2.5M and above” is represented as 3,500k when the true distribution is unknown). Interval-level data would improve regression precision.

  4. Self-reported data: All financial estimates (USD losses, budget percentages) are self-reported and subject to recall bias and social desirability effects. Validation against transaction records or fintech data (e.g., Paystack Developer API) would provide more reliable estimates.

  5. Further work: With a larger dataset, the five techniques here could be extended to include a logistic regression predicting personal-shopper adoption (binary outcome), k-means clustering to identify empirical client archetypes, and time-series analysis of seasonal spend patterns using monthly transaction data.


References

Adi, B. (2026). AI-powered data analytics. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.5). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Show code
# Package citations (APA 7 — use citation() output)
for (pkg in c("readxl","skimr","ggcorrplot","car","lmtest","nortest",
              "broom","effectsize","patchwork","kableExtra","ggrepel","ppcor")) {
  cat("\ncitation('", pkg, "'):\n", sep="")
  tryCatch(print(citation(pkg), style="text"), error=function(e) cat("  [Not found]\n"))
}

citation('readxl'):
Wickham H, Bryan J (2025). _readxl: Read Excel Files_.
doi:10.32614/CRAN.package.readxl
<https://doi.org/10.32614/CRAN.package.readxl>, R package version
1.4.5, <https://CRAN.R-project.org/package=readxl>.

citation('skimr'):
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S
(2026). _skimr: Compact and Flexible Summaries of Data_.
doi:10.32614/CRAN.package.skimr
<https://doi.org/10.32614/CRAN.package.skimr>, R package version 2.2.2,
<https://CRAN.R-project.org/package=skimr>.

citation('ggcorrplot'):
Kassambara A (2023). _ggcorrplot: Visualization of a Correlation Matrix
using 'ggplot2'_. doi:10.32614/CRAN.package.ggcorrplot
<https://doi.org/10.32614/CRAN.package.ggcorrplot>, R package version
0.1.4.1, <https://CRAN.R-project.org/package=ggcorrplot>.

citation('car'):
Fox J, Weisberg S (2019). _An R Companion to Applied Regression_, Third
edition. Sage, Thousand Oaks CA. <https://www.john-fox.ca/Companion/>.

citation('lmtest'):
Zeileis A, Hothorn T (2002). "Diagnostic Checking in Regression
Relationships." _R News_, *2*(3), 7-10.
<https://CRAN.R-project.org/doc/Rnews/>.

citation('nortest'):
Gross J, Ligges U (2015). _nortest: Tests for Normality_.
doi:10.32614/CRAN.package.nortest
<https://doi.org/10.32614/CRAN.package.nortest>, R package version
1.0-4, <https://CRAN.R-project.org/package=nortest>.

citation('broom'):
Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). _broom: Convert
Statistical Objects into Tidy Tibbles_. doi:10.32614/CRAN.package.broom
<https://doi.org/10.32614/CRAN.package.broom>, R package version
1.0.12, <https://CRAN.R-project.org/package=broom>.

citation('effectsize'):
Ben-Shachar MS, Lüdecke D, Makowski D (2020). "effectsize: Estimation
of Effect Size Indices and Standardized Parameters." _Journal of Open
Source Software_, *5*(56), 2815. doi:10.21105/joss.02815
<https://doi.org/10.21105/joss.02815>,
<https://doi.org/10.21105/joss.02815>.

citation('patchwork'):
Pedersen T (2025). _patchwork: The Composer of Plots_.
doi:10.32614/CRAN.package.patchwork
<https://doi.org/10.32614/CRAN.package.patchwork>, R package version
1.3.2, <https://CRAN.R-project.org/package=patchwork>.

citation('kableExtra'):
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
Pipe Syntax_. doi:10.32614/CRAN.package.kableExtra
<https://doi.org/10.32614/CRAN.package.kableExtra>, R package version
1.4.0, <https://CRAN.R-project.org/package=kableExtra>.

citation('ggrepel'):
Slowikowski K (2026). _ggrepel: Automatically Position Non-Overlapping
Text Labels with 'ggplot2'_. doi:10.32614/CRAN.package.ggrepel
<https://doi.org/10.32614/CRAN.package.ggrepel>, R package version
0.9.8, <https://CRAN.R-project.org/package=ggrepel>.

citation('ppcor'):
Kim S (2015). _ppcor: Partial and Semi-Partial (Part) Correlation_.
doi:10.32614/CRAN.package.ppcor
<https://doi.org/10.32614/CRAN.package.ppcor>, R package version 1.1,
<https://CRAN.R-project.org/package=ppcor>.

Survey data citation:

LBS EMBA-31 Student. (2026). International Shopping Survey — Nigerian Consumer Behaviour Study [Dataset]. Collected via Google Forms, May 2026, Nigeria. Data available on request from the author.


Appendix: AI Usage Statement

Claude Code (Anthropic, 2026) was used as a coding assistant throughout this analysis to help write R code for data cleaning, visualisation, hypothesis testing, correlation, and regression. Specifically, Claude Code was used to: (1) generate data-wrangling functions for the messy loss-estimate columns; (2) scaffold ggplot2 visualisation code; and (3) structure the Quarto document with appropriate chunk labels and YAML configuration. All analytical decisions — which technique to apply to which variable, how to handle missing values, which reference category to use in the regression, how to interpret p-values and effect sizes, and what business recommendation to draw — were made independently by the author. The interpretation of every result and every conclusion stated in this document reflects the author’s own analytical judgement. No AI-generated text has been submitted as analysis interpretation without independent verification against the underlying data outputs.