Nigerian Consumer International Shopping: Exploratory & Inferential Analytics

Author

Idayat Oshodi

Published

May 20, 2026

Executive Summary

This study analyses primary survey data collected from 132 Nigerian consumers to answer the question: What income, spending, and behavioural characteristics predict how much a consumer allocates to international shopping — and which sector × category × season combinations represent the highest-value, lowest-friction segments for a personal shopper to target?

The data were collected via a structured Google Form survey administered to Nigerian professionals in May 2026. Respondents span 16 employment sectors, six income brackets, and report purchasing internationally between 0 and more than 10 times over the previous 24 months. Five inferential techniques were applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression.

Key findings indicate that monthly income is the single strongest predictor of international shopping budget share. Consumers earning ₦1 million or more monthly allocate significantly higher proportions of income to international purchases. Oil & Gas and Banking & Finance professionals operating in the Black Friday and Christmas seasons emerge as the highest-value, lowest-friction segments. Payment friction (card declines and FX losses) remains the most prevalent barrier, disproportionately affecting lower-income brackets. The regression model explains approximately 45% of variance in budget allocation, with income bracket and purchase frequency as the dominant drivers.

1 Professional Disclosure

Role and context: I am a Business Analyst and a Personal International Shopper operating in the Financial sector with regular exposure to cross-border commerce, supply chain management, and consumer decision-making. International purchasing of goods — particularly electronics, fashion, and specialised equipment — is a recurring operational and personal activity that requires navigating payment infrastructure constraints, logistics providers, and personal shoppers. This survey was designed and administered to peers within a professional network to understand demand patterns that inform a potential personal-shopping service.

Technique justifications:

EDA is the foundation of any reliable analysis: with survey data, cleaning decisions (how to handle non-standard responses, ordinal midpoints) materially affect every downstream result. EDA makes those decisions transparent and auditable.
Data Visualisation converts multi-dimensional survey patterns into actionable client profiles. A personal shopper benefits more from a visual targeting map than from a regression table.
Hypothesis Testing provides statistical rigour to observed differences: it distinguishes genuine income-driven spend patterns from sampling noise, which matters when deciding which client segments to pursue.
Correlation Analysis identifies which predictors are independently informative versus redundant — critical for avoiding double-counting in a regression model built on ordinal data.
Linear Regression provides a single quantitative model that translates observable client characteristics (income, frequency, sector) into a predicted budget-allocation score — the foundation of a client-scoring tool for a personal shopper.

2 Data Collection & Sampling

Collection method: A structured survey was designed and distributed via Google Forms in May 2026 to Nigerian consumers within a professional network. The link was shared through WhatsApp groups, email chains, and LinkedIn direct messages targeting working professionals aged 18–55.

Sampling frame: Convenience sample of Nigerian professionals with prior exposure to international online shopping. The population of interest is economically active Nigerians who have attempted or completed international purchases via platforms such as Amazon, ASOS, Shein, or through personal shoppers.

Sample size and period: 132 completed responses were collected between 06 May 2026 and 15 May 2026. The assignment guidelines require a minimum of 100 observations; the achieved sample of 132 exceeds this threshold, providing a credible analytical base. All 132 responses are genuine primary data and are used in their entirety; no rows were discarded except the 6 with uninterpretable outcome entries in the outcome variable.

Variables collected: 25 variables covering demographics (age, gender, city), professional context (sector, employment status, income bracket), shopping behaviour (frequency, budget share, spend per order, categories, platforms, seasons, method), payment behaviour, friction losses, satisfaction, and open-text responses.

Ethical notes: No personally identifiable information (names, phone numbers, employee IDs) was collected. Participation was voluntary; the survey introduction stated that responses would be used for an academic analysis. Respondents who preferred not to disclose income were given a “Prefer not to say” option. No organisational confidential data was used.

Data-sharing restrictions: The dataset will be made available on request from the author; no organisational restrictions apply given the consumer-level nature of the data.

3 Data Description

3.1 Respondent Profile — Pie Charts

The five interactive pie charts below answer who the respondents are. age_group is ordinal (youngest-to-oldest order); the remaining four are nominal (no inherent ranking). Hover over any slice to see the exact label, count, and percentage. Click a slice to isolate it; click the ⛶ Expand button (top-right of each chart) to explode it to full-screen, and press Esc or click the overlay to close.

Show code

# ── Colour palette ────────────────────────────────────────────────────────────
pie_cols <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0",
              "#00B0F0","#FFC000","#FF7C80","#43682B","#833C00",
              "#A9D18E","#9DC3E6","#FFD966","#F4B183")

# ── Helper: frequency table with top-n + "Other" rollup ─────────────────────
make_pie_df <- function(x, top_n = NULL) {
  x      <- as.character(x)
  x      <- x[!is.na(x) & nchar(trimws(x)) > 0]
  tbl    <- sort(table(x), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (!is.null(top_n) && length(counts) > top_n) {
    others <- sum(counts[(top_n + 1):length(counts)])
    labels <- c(labels[seq_len(top_n)], "Other")
    counts <- c(counts[seq_len(top_n)], others)
  }
  df <- data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
  df$pct <- round(df$n / sum(df$n) * 100, 1)
  df
}

# ── Build data for each pie ───────────────────────────────────────────────────
d_age    <- make_pie_df(survey_clean$age_group)
d_gender <- make_pie_df(survey_clean$gender)
d_emp    <- make_pie_df(survey_clean$emp_status)
d_city   <- make_pie_df(survey_clean$city,         top_n = 7)
d_sector <- make_pie_df(survey_clean$sector_clean, top_n = 8)

pie_data   <- list(d_age, d_gender, d_emp, d_city, d_sector)
pie_titles <- c("Age Group", "Gender", "Employment Status",
                "City  (Top 7 + Other)", "Employment Sector  (Top 8 + Other)")

# ── Domain grid: 3 charts on top row, 2 centred on bottom row ────────────────
dom_x <- list(c(0.01,0.31), c(0.35,0.65), c(0.69,0.99),
              c(0.13,0.47), c(0.53,0.87))
dom_y <- list(c(0.54,1.00), c(0.54,1.00), c(0.54,1.00),
              c(0.00,0.46), c(0.00,0.46))
ann_x <- sapply(dom_x, mean)
ann_y <- c(1.01, 1.01, 1.01, 0.47, 0.47)

# ── Build multi-trace plotly figure ──────────────────────────────────────────
fig_pie <- plot_ly(height = 720)

for (i in seq_along(pie_data)) {
  df   <- pie_data[[i]]
  cols <- as.list(pie_cols[seq_len(nrow(df))])
  fig_pie <- fig_pie |>
    add_pie(
      data          = df,
      labels        = ~label,
      values        = ~n,
      name          = pie_titles[i],
      textposition  = "inside",
      textinfo      = "percent",
      hovertemplate = paste0(
        "<b>%{label}</b><br>Count: %{value}<br>Share: %{percent}<extra></extra>"
      ),
      domain  = list(x = dom_x[[i]], y = dom_y[[i]]),
      marker  = list(colors = cols,
                     line   = list(color = "white", width = 2)),
      showlegend = FALSE
    )
}

# ── Subtitle annotations (chart titles above each pie) ───────────────────────
ann_list <- lapply(seq_along(pie_titles), function(i) {
  list(x = ann_x[i], y = ann_y[i],
       text      = paste0("<b>", pie_titles[i], "</b>"),
       xref      = "paper", yref = "paper",
       showarrow = FALSE, xanchor = "center", yanchor = "bottom",
       font      = list(size = 11, color = "#1F3864"))
})

fig_pie |>
  plotly::layout(
    title = list(
      text = paste0(
        "<b>Who Are the Respondents?</b>",
        "<span style='font-size:12px;color:#777'>",
        "   n = ", nrow(survey_clean),
        " respondents  |  Hover for counts  |  Click slices to filter</span>"
      ),
      font = list(size = 15, color = "#1F3864"), x = 0.5
    ),
    annotations   = ann_list,
    paper_bgcolor = "white",
    plot_bgcolor  = "white",
    margin        = list(t = 80, b = 10, l = 10, r = 10)
  )

3.2 Variable Summary Table

The table below consolidates all key descriptive statistics into a single readable summary. Each row is one numeric variable derived from the survey; columns show sample size, completeness, central tendency, spread, range, and quartiles.

Show code

# ── Human-readable variable labels ───────────────────────────────────────────
var_meta <- data.frame(
  col   = c("income_num","pct_income","spend_usd",
             "freq_num","satisfaction","total_loss"),
  Label = c("Monthly Income (₦ '000)",
             "Budget Share — Intl Shopping (%)",
             "Typical Order Spend (USD)",
             "Purchases in Last 24 Months (n)",
             "Overall Satisfaction (1–5)",
             "Total Estimated USD Losses"),
  Source = c("Income bracket midpoint",
              "Budget-share band midpoint",
              "Spend-band midpoint",
              "Frequency-band midpoint",
              "Direct survey rating",
              "Sum of 4 cleaned loss columns"),
  stringsAsFactors = FALSE
)

# ── Build summary row for each variable ──────────────────────────────────────
sum_rows <- lapply(var_meta$col, function(nm) {
  v   <- survey_clean[[nm]]
  v_c <- v[!is.na(v)]
  n_valid   <- length(v_c)
  n_missing <- sum(is.na(v))
  pct_complete <- round(n_valid / nrow(survey_clean) * 100, 1)
  data.frame(
    col          = nm,
    N_Valid      = n_valid,
    N_Missing    = n_missing,
    Pct_Complete = pct_complete,
    Mean         = round(mean(v_c), 2),
    Median       = round(median(v_c), 2),
    SD           = round(sd(v_c), 2),
    Min          = round(min(v_c), 2),
    Q1           = round(quantile(v_c, 0.25), 2),
    Q3           = round(quantile(v_c, 0.75), 2),
    Max          = round(max(v_c), 2),
    stringsAsFactors = FALSE
  )
})
sum_tbl <- do.call(rbind, sum_rows)

# ── Merge labels ──────────────────────────────────────────────────────────────
sum_tbl <- merge(var_meta, sum_tbl, by = "col")
sum_tbl <- sum_tbl[match(var_meta$col, sum_tbl$col), ]   # preserve order

# ── Render ───────────────────────────────────────────────────────────────────
kable(
  sum_tbl[, c("Label","Source","N_Valid","N_Missing","Pct_Complete",
               "Mean","Median","SD","Min","Q1","Q3","Max")],
  col.names = c("Variable","Derived From",
                "Valid n","Missing n","Complete %",
                "Mean","Median","Std Dev",
                "Min (p0)","Q1 (p25)","Q3 (p75)","Max (p100)"),
  caption   = paste0("Table 1: Descriptive statistics for all key numeric variables (n = ", nrow(survey_clean), " respondents)"),
  align     = c("l","l","c","c","c","r","r","r","r","r","r","r")
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) |>
  row_spec(0,
           bold       = TRUE,
           color      = "white",
           background = "#1F3864") |>
  row_spec(1, background = "#EEF2F7") |>   # income_num
  row_spec(2, background = "#FEF2F2") |>   # pct_income  (outcome — highlight)
  row_spec(3, background = "#EEF2F7") |>
  row_spec(4, background = "#FFFFFF") |>
  row_spec(5, background = "#EEF2F7") |>
  row_spec(6, background = "#FFFFFF") |>
  column_spec(1, bold = TRUE, width = "18em") |>
  column_spec(2, italic = TRUE, color = "#555555", width = "16em") |>
  column_spec(4,
    color = ifelse(sum_tbl$N_Missing > 0, "#C00000", "#375623"),
    bold  = ifelse(sum_tbl$N_Missing > 0, TRUE,      FALSE)) |>
  footnote(
    general = paste0(
      "Row highlighted in red = outcome variable (pct_income). ",
      "Red Missing n = variable has missing values. ",
      "All numeric values derived from ordinal survey bands using band midpoints ",
      "(see Section 3 — Data Collection for encoding decisions)."
    ),
    general_title = "Notes: "
  )

Table 1: Descriptive statistics for all key numeric variables
	Variable	Derived From	Valid n	Missing n	Complete %	Mean	Median	Std Dev	Min (p0)	Q1 (p25)	Q3 (p75)	Max (p100)
2	Monthly Income (₦ '000)	Income bracket midpoint	119	13	90.2	1340.76	750.0	1178.61	75	400.0	1750.0	3500
3	Budget Share — Intl Shopping (%)	Budget-share band midpoint	126	6	95.5	7.02	2.5	7.35	0	2.5	7.5	35
5	Typical Order Spend (USD)	Spend-band midpoint	132	0	100.0	138.75	75.0	177.66	30	30.0	150.0	700
1	Purchases in Last 24 Months (n)	Frequency-band midpoint	132	0	100.0	3.94	1.5	3.89	0	1.5	4.0	12
4	Overall Satisfaction (1–5)	Direct survey rating	132	0	100.0	3.41	3.0	0.94	1	3.0	4.0	5
6	Total Estimated USD Losses	Sum of 4 cleaned loss columns	132	0	100.0	6281.49	0.0	41868.61	0	0.0	82.5	401900
Notes:
Row highlighted in red = outcome variable (pct_income). Red Missing n = variable has missing values. All numeric values derived from ordinal survey bands using band midpoints (see Section 3 — Data Collection for encoding decisions).

The outcome variable pct_income (highlighted in red) is missing for 6 respondents who gave non-interpretable free-text responses (“0.9”, “Depend need(s)”, “As required”). These rows are excluded from the regression and correlation analyses but retained in all descriptive outputs.

3.3 Distribution of Key Variables

The panel below shows the full distribution of each variable — bars show how many respondents fall into each value range, overlaid with a smooth density curve. The dashed red line marks the median for each variable, giving an instant read on the typical respondent.

Show code

# ── Data in long format for faceting ─────────────────────────────────────────
hist_vars <- survey_clean[, c("income_num","pct_income","spend_usd",
                               "freq_num","satisfaction","total_loss")]

# Build long data frame manually (no pivot_longer to avoid select conflict)
hist_long <- do.call(rbind, lapply(names(hist_vars), function(nm) {
  v <- hist_vars[[nm]]
  data.frame(variable = nm, value = v[!is.na(v)], stringsAsFactors = FALSE)
}))

# Readable facet labels
facet_labels <- c(
  income_num   = "Monthly Income (₦ '000)",
  pct_income   = "Budget Share — Intl Shopping (% of income)",
  spend_usd    = "Typical Order Spend (USD)",
  freq_num     = "Purchases in Last 24 Months (n)", 
  satisfaction = "Overall Satisfaction \n(1 = very dissatisfied, 5 = very satisfied)",
  total_loss   = "Total Estimated USD Losses"
)
hist_long$facet <- factor(facet_labels[hist_long$variable],
                           levels = facet_labels)

# Median lines per facet
med_lines <- do.call(rbind, lapply(names(hist_vars), function(nm) {
  v <- hist_vars[[nm]]
  data.frame(variable = nm,
             facet    = facet_labels[nm],
             med      = median(v, na.rm = TRUE),
             stringsAsFactors = FALSE)
}))
med_lines$facet <- factor(med_lines$facet, levels = facet_labels)

# Per-variable fill colours (one per panel)
fill_cols <- setNames(
  c("#1F3864","#C00000","#ED7D31","#70AD47","#7030A0","#00B0F0"),
  facet_labels
)

# ── Plot ──────────────────────────────────────────────────────────────────────
p_hist <- ggplot(hist_long, aes(x = value, fill = facet)) +
  geom_histogram(aes(y = after_stat(density)),
                 bins      = 12,
                 colour    = "white",
                 linewidth = 0.4,
                 alpha     = 0.80) +
  geom_density(colour    = "grey20",
               linewidth = 0.8,
               adjust    = 1.2) +
  geom_vline(data      = med_lines,
             aes(xintercept = med,
                 text = paste0("Median: ", round(med, 1))),
             colour    = "#C00000",
             linetype  = "dashed",
             linewidth = 0.9) +
  scale_fill_manual(values = fill_cols, guide = "none") +
  facet_wrap(~ facet,
             scales   = "free",
             ncol     = 2,
             labeller = label_value) +
  labs(
    title    = "Distribution of Key Survey Variables",
    subtitle = paste0(
      "Bars = histogram density  |  Curve = smoothed density  |  ",
      "Dashed red line = median  |  ★ Outcome in red"
    ),
    x = "Value",
    y = "Density"
  ) +
  theme_minimal(base_size = 12) +
  theme(
    plot.title       = element_text(face = "bold", colour = "#1F3864",
                                    size = 15, margin = margin(b = 4)),
    plot.subtitle    = element_text(colour = "#555555", size = 10,
                                    margin = margin(b = 10)),
    strip.text       = element_text(face = "bold", colour = "white",
                                    size = 9),
    strip.background = element_rect(fill = "#1F3864", colour = NA),
    panel.grid.minor = element_blank(),
    panel.border     = element_rect(colour = "#DDDDDD", fill = NA,
                                    linewidth = 0.5),
    axis.title       = element_text(colour = "#333333", size = 10),
    plot.margin      = margin(12, 12, 12, 12)
  )

# ── Wrap as interactive plotly ────────────────────────────────────────────────
ggplotly(p_hist, height = 900, tooltip = c("x", "y", "text")) |>
  plotly::layout(
    hoverlabel = list(bgcolor = "white", font = list(size = 11)),
    margin     = list(t = 80, b = 40, l = 50, r = 30)
  )

Figure 1: Distribution of all six key numeric variables (hover for values · drag to zoom · double-click to reset)

3.4 Categorical & Behavioural Variable Inventory

The table below documents all eleven categorical variables charted in this section — five in the pie charts above and six in the bar charts below. The Data Type column classifies each variable by measurement level; Response Type shows whether a respondent could select one or multiple answers. Understanding this matters for interpreting bar-chart counts: multi-select bars can sum to more than the total number of respondents.

Show code

var_desc <- data.frame(
  R_Variable = c("age_group","gender","city","sector_clean","emp_status",
                 "categories","platforms","shopping_months",
                 "shopping_method","payment_method","improvement"),
  Survey_Question = c(
    "What is your age group?",
    "What is your gender?",
    "What city do you currently live in?",
    "What is your employment sector?",
    "What best describes your employment status?",
    "Which categories do you shop for internationally? (select all that apply)",
    "Which international platforms do you shop from? (select all that apply)",
    "Which months do you shop most internationally? (select all that apply)",
    "How do you usually handle your international shopping?",
    "How do you typically pay your personal shopper or for international orders?",
    "What single improvement would make you shop internationally more often?"
  ),
  Data_Type  = c("Ordinal","Nominal","Nominal","Nominal","Nominal",
                 "Nominal","Nominal","Nominal","Nominal","Nominal","Nominal"),
  Response   = c("Single-select","Single-select","Single-select",
                 "Single-select","Single-select",
                 "Multi-select","Multi-select","Multi-select",
                 "Single-select","Single-select","Single-select"),
  Chart      = c("Pie","Pie","Pie","Pie","Pie",
                 "Bar","Bar","Bar","Bar","Bar","Bar"),
  stringsAsFactors = FALSE
)

kable(
  var_desc,
  col.names = c("R Variable Name","Survey Question",
                "Data Type","Response Type","Chart Used"),
  caption   = "Table 2: Categorical and behavioural variables described in this section",
  align     = c("l","l","c","c","c")
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width        = TRUE,
    font_size         = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |>
  column_spec(1, bold = TRUE, monospace = TRUE, color = "#1F3864") |>
  column_spec(3, italic = TRUE, color = "#555555") |>
  row_spec(c(6, 7, 8), background = "#FFF3CD") |>
  footnote(
    general = paste0(
      "Rows highlighted in amber = multi-select variables: one respondent may contribute ",
      "multiple values, so bar-chart counts can exceed n = ", nrow(survey_clean), ". ",
      "sector_clean is a cleaned version of the raw sector column (Banking variants merged)."
    ),
    general_title = "Notes: "
  )

Table 2: Categorical variables visualised in this section
R Variable Name	Survey Question	Data Type	Response Type	Chart Used
age_group	What is your age group?	Ordinal	Single-select	Pie
gender	What is your gender?	Nominal	Single-select	Pie
city	What city do you currently live in?	Nominal	Single-select	Pie
sector_clean	What is your employment sector?	Nominal	Single-select	Pie
emp_status	What best describes your employment status?	Nominal	Single-select	Pie
categories	Which categories do you shop for internationally? (select all that apply)	Nominal	Multi-select	Bar
platforms	Which international platforms do you shop from? (select all that apply)	Nominal	Multi-select	Bar
shopping_months	Which months do you shop most internationally? (select all that apply)	Nominal	Multi-select	Bar
shopping_method	How do you usually handle your international shopping?	Nominal	Single-select	Bar
payment_method	How do you typically pay your personal shopper or for international orders?	Nominal	Single-select	Bar
improvement	What single improvement would make you shop internationally more often?	Nominal	Single-select	Bar
Notes:
Rows highlighted in amber = multi-select variables: one respondent may contribute multiple values, so bar-chart counts can exceed n = 132. sector_clean is a cleaned version of the raw sector column (Banking variants merged).

3.5 Shopping Behaviour — Bar Charts

The six interactive charts below answer how respondents shop internationally. All six variables are nominal (no natural ranking). Hover over any bar for the exact count; drag to zoom in on any region; double-click to reset the view.

Multi-select note: Categories, Platforms, and Months are “select all that apply” questions — one respondent may be counted in multiple bars, so totals across bars exceed 132.

Show code

bar_pal <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0","#00B0F0")

# ── Helper: expand multi-select and count ─────────────────────────────────────
split_count <- function(x, top_n = 12) {
  items <- unlist(strsplit(as.character(x[!is.na(x) & x != ""]), ",\\s*"))
  items <- trimws(items[nchar(trimws(items)) > 0])
  tbl    <- sort(table(items), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (length(counts) > top_n) { labels <- labels[seq_len(top_n)]; counts <- counts[seq_len(top_n)] }
  data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
}

# ── Helper: single-select count with "Other" rollup ───────────────────────────
single_count <- function(x, top_n = 12) {
  x      <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
  tbl    <- sort(table(x), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (length(counts) > top_n) {
    others <- sum(counts[(top_n + 1):length(counts)])
    labels <- c(labels[seq_len(top_n)], "Other")
    counts <- c(counts[seq_len(top_n)], others)
  }
  data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
}

# ── Helper: build one native plotly horizontal bar chart ─────────────────────
make_plotly_bar <- function(df, title, fill_col, x_note = "Count") {
  df$lw <- stringr::str_wrap(df$label, width = 30)
  # order ascending so largest bar is at the top of the y-axis
  df     <- df[order(df$n, decreasing = FALSE), ]
  df$lw  <- factor(df$lw, levels = unique(df$lw))
  plot_ly(
    df,
    x           = ~n,
    y           = ~lw,
    type        = "bar",
    orientation = "h",
    marker      = list(color     = fill_col,
                       opacity   = 0.87,
                       line      = list(color = "white", width = 0.5)),
    text        = ~paste0("<b>", label, "</b><br>Count: ", n),
    hoverinfo   = "text",
    showlegend  = FALSE
  ) |>
    plotly::layout(
      title  = list(text  = paste0("<b>", title, "</b>"),
                    font  = list(size = 12, color = "#1F3864"),
                    x = 0.02, xanchor = "left"),
      xaxis  = list(title    = x_note,
                    gridcolor = "#DDDDDD",
                    zeroline  = FALSE,
                    tickfont  = list(size = 10)),
      yaxis  = list(title     = "",
                    tickfont  = list(size = 9),
                    automargin = TRUE),
      height        = 400,
      paper_bgcolor = "white",
      plot_bgcolor  = "#FAFAFA",
      margin        = list(l = 10, r = 30, t = 55, b = 30)
    )
}

# ── Build the six bar charts ──────────────────────────────────────────────────
b_cat  <- make_plotly_bar(
  split_count(survey_clean$categories, top_n = 10),
  "1. Categories Shopped Internationally  (Top 10 — multi-select)",
  bar_pal[1], x_note = "Respondents mentioning"
)
b_plat <- make_plotly_bar(
  split_count(survey_clean$platforms, top_n = 10),
  "2. Platforms Used  (Top 10 — multi-select)",
  bar_pal[2], x_note = "Respondents mentioning"
)
b_mon  <- make_plotly_bar(
  split_count(survey_clean$shopping_months),
  "3. Months of Peak International Shopping  (multi-select)",
  bar_pal[3], x_note = "Respondents mentioning"
)
b_meth <- make_plotly_bar(
  single_count(survey_clean$method_short),
  "4. How Shopping is Handled  (single-select)",
  bar_pal[4]
)
b_pay  <- make_plotly_bar(
  single_count(survey_clean$payment_method, top_n = 8),
  "5. Payment Method  (Top 8 — single-select)",
  bar_pal[5]
)
b_imp  <- make_plotly_bar(
  single_count(survey_clean$improvement, top_n = 10),
  "6. What Would Drive More Shopping  (Top 10 — single-select)",
  bar_pal[6]
)

# ── Render all six charts stacked (full-width, individually interactive) ──────
htmltools::tagList(b_cat, b_plat, b_mon, b_meth, b_pay, b_imp)

4 Technique 1 — Exploratory Data Analysis

📚 Theory Recap — Exploratory Data Analysis

Exploratory Data Analysis (EDA) is an analytic philosophy introduced by John Tukey (1977) that advocates examining data through graphical and numerical summaries before imposing statistical models. The guiding principle is to let the data “speak” — uncovering distributions, outliers, and structural patterns without prior hypotheses — so that subsequent modelling decisions rest on observed reality rather than untested assumptions.

Key concepts:

Concept	What it measures	Why it matters
Central tendency (mean, median)	Where data cluster	Median is robust to skew; mean is sensitive to outliers
Dispersion (SD, IQR, range)	How spread out values are	Large SD relative to mean signals high heterogeneity
Shape (skewness, kurtosis)	Asymmetry and tail weight	Right-skewed income data may need transformation before parametric tests
Missing-value audit	Completeness of each variable	Informs whether to impute, exclude, or flag missing cases

Anscombe’s Quartet (Anscombe, 1973) provides the canonical demonstration of why visual inspection is indispensable: four datasets with identical means, variances, and correlations exhibit completely different structures when plotted, proving that numerical summaries alone can be deeply misleading.

Technique justification: EDA is the mandatory first step before any inferential work. Chapter 9 of Adi (2026) emphasises that summary statistics and distributions must be understood before model parameters can be meaningfully interpreted. With a survey dataset containing ordinal income brackets, multi-select categories, and free-text loss estimates, a thorough EDA prevents misspecified models and unreliable inference.

4.1 Summary Statistics

Show code

num_vars_list <- list(
  income_num  = survey_clean$income_num,
  pct_income  = survey_clean$pct_income,
  spend_usd   = survey_clean$spend_usd,
  freq_num    = survey_clean$freq_num,
  satisfaction= survey_clean$satisfaction,
  total_loss  = survey_clean$total_loss
)
skew_pearson <- function(v) {
  v <- v[!is.na(v)]
  if(length(v)<3) return(NA_real_)
  round((mean(v) - median(v)) / sd(v), 2)
}
num_summary2 <- do.call(rbind, lapply(names(num_vars_list), function(nm) {
  v <- num_vars_list[[nm]]
  data.frame(Variable=nm, n=sum(!is.na(v)),
             Mean=round(mean(v,na.rm=TRUE),2),
             Median=round(median(v,na.rm=TRUE),2),
             SD=round(sd(v,na.rm=TRUE),2),
             Min=round(min(v,na.rm=TRUE),2),
             Max=round(max(v,na.rm=TRUE),2),
             Skew=skew_pearson(v),
             stringsAsFactors=FALSE)
}))

kable(num_summary2,
      caption = "Table 2: Descriptive statistics for key numeric variables") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = FALSE)

Descriptive statistics for key numeric variables
Variable	n	Mean	Median	SD	Min	Max	Skew
income_num	119	1340.76	750.0	1178.61	75	3500	0.50
pct_income	126	7.02	2.5	7.35	0	35	0.61
spend_usd	132	138.75	75.0	177.66	30	700	0.36
freq_num	132	3.94	1.5	3.89	0	12	0.63
satisfaction	132	3.41	3.0	0.94	1	5	0.43
total_loss	132	6281.49	0.0	41868.61	0	401900	0.15

4.2 Summary Statistics for Categorical Variables

For categorical variables, descriptive analysis centres on frequency counts and percentage distributions rather than means and standard deviations. The appropriate measure of central tendency is the mode — the most frequently occurring category. The table below provides an at-a-glance overview of all eleven key categorical variables; detailed frequency breakdowns follow.

Show code

# ── Helper: compute mode statistics for one categorical column ────────────────
cat_stats <- function(nm, label, df) {
  x   <- df[[nm]]
  x_c <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
  tbl <- sort(table(x_c), decreasing = TRUE)
  mode_val <- if (length(tbl) > 0) names(tbl)[1] else "—"
  mode_n   <- if (length(tbl) > 0) as.integer(tbl[1]) else 0L
  mode_pct <- if (mode_n > 0) round(mode_n / nrow(df) * 100, 1) else 0
  data.frame(
    Label    = label,
    N_Valid  = length(x_c),
    N_Miss   = sum(is.na(x)),
    N_Cat    = length(unique(x_c)),
    Mode     = mode_val,
    Mode_n   = mode_n,
    Mode_pct = paste0(mode_pct, "%"),
    stringsAsFactors = FALSE
  )
}

cat_meta <- data.frame(
  col   = c("age_group","gender","city","sector_clean","emp_status",
            "education","income_bracket","purchases_24m",
            "method_short","payment_method","payment_abandoned"),
  Label = c("Age Group","Gender","City of Residence",
            "Employment Sector","Employment Status","Education Level",
            "Income Bracket","International Purchases (last 24 mo.)",
            "Shopping Method","Payment Method","Ever Abandoned a Purchase?"),
  stringsAsFactors = FALSE
)

cat_summary <- do.call(rbind, lapply(
  seq_len(nrow(cat_meta)),
  function(i) cat_stats(cat_meta$col[i], cat_meta$Label[i], survey_clean)
))

kable(
  cat_summary,
  col.names = c("Variable","Valid n","Missing","# Categories",
                "Mode (most common response)","Mode count","Mode %"),
  caption   = paste0("Table 3: Categorical variable overview — mode and completeness",
                     " (n = ", nrow(survey_clean), " respondents)"),
  align     = c("l","c","c","c","l","c","c")
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width = TRUE, font_size = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |>
  column_spec(1, bold = TRUE, color = "#1F3864") |>
  column_spec(5, italic = TRUE, color = "#444444") |>
  footnote(
    general       = paste0(
      "Mode = most frequently occurring response. ",
      "Missing = blank or NA entries excluded from frequency calculations. ",
      "sector_clean merges 'Banking' and 'Banking and Financial Services'."
    ),
    general_title = "Notes: "
  )

Table 3: Categorical variable overview — mode, completeness, and category count
Variable	Valid n	Missing	# Categories	Mode (most common response)	Mode count	Mode %
Age Group	132	0	5	35 - 44	61	46.2%
Gender	132	0	3	Male	66	50%
City of Residence	132	0	15	Lagos	98	74.2%
Employment Sector	132	0	16	Banking & Finance	40	30.3%
Employment Status	132	0	4	Employed full-time	95	72%
Education Level	132	0	4	HND / Bachelor's degree	60	45.5%
Income Bracket	132	0	7	₦1,000,000 – ₦2,499,999	29	22%
International Purchases (last 24 mo.)	132	0	5	1 - 2	43	32.6%
Shopping Method	132	0	5	Direct (Own Card)	64	48.5%
Payment Method	132	0	8	Bank transfer in Naira	67	50.8%
Ever Abandoned a Purchase?	132	0	4	Yes, more than once	70	53%
Notes:
Mode = most frequently occurring response. Missing = blank or NA entries excluded from frequency calculations. sector_clean merges 'Banking' and 'Banking and Financial Services'.

4.2.1 Demographic & Employment Profile — Detailed Frequencies

The table below gives the full frequency distribution for the five demographic and employment variables, grouped by variable. Percentages are calculated against the total sample of 132 respondents.

Show code

# ── Helper: frequency data frame for one variable ────────────────────────────
make_freq_df <- function(x, n_total, top_n = NULL, other_label = "All others") {
  x_c    <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
  tbl    <- sort(table(x_c), decreasing = TRUE)
  labels <- names(tbl);  counts <- as.integer(tbl)
  if (!is.null(top_n) && length(counts) > top_n) {
    others <- sum(counts[(top_n + 1):length(counts)])
    labels <- c(labels[seq_len(top_n)], other_label)
    counts <- c(counts[seq_len(top_n)], others)
  }
  data.frame(
    Category = labels,
    Count    = counts,
    Pct      = paste0(round(counts / n_total * 100, 1), "%"),
    stringsAsFactors = FALSE
  )
}

N <- nrow(survey_clean)

demo_vars <- list(
  "Age Group"          = survey_clean$age_group,
  "Gender"             = survey_clean$gender,
  "Employment Status"  = survey_clean$emp_status,
  "Education Level"    = survey_clean$education,
  "Employment Sector"  = survey_clean$sector_clean
)

# ── Build combined long data frame ────────────────────────────────────────────
demo_rows  <- data.frame(Category = character(), Count = integer(),
                         Pct = character(), stringsAsFactors = FALSE)
grp_label  <- character()
grp_start  <- integer()
grp_end    <- integer()
cursor     <- 1L

for (nm in names(demo_vars)) {
  df      <- make_freq_df(demo_vars[[nm]], N)
  grp_label <- c(grp_label, nm)
  grp_start <- c(grp_start, cursor)
  grp_end   <- c(grp_end,   cursor + nrow(df) - 1L)
  demo_rows <- rbind(demo_rows, df)
  cursor    <- cursor + nrow(df)
}

# ── Render with pack_rows grouping ────────────────────────────────────────────
k_demo <- kable(
  demo_rows,
  col.names = c("Category", "Count", "% of sample"),
  align     = c("l", "c", "c"),
  caption   = "Table 4: Demographic and employment frequency distributions"
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width = TRUE, font_size = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864")

for (i in seq_along(grp_label)) {
  k_demo <- k_demo |>
    pack_rows(grp_label[i], grp_start[i], grp_end[i],
              bold = TRUE, color = "white", background = "#2C5F9E",
              label_row_css = "border-top: 2px solid #1F3864;")
}
k_demo

Table 4: Frequency distributions — demographic and employment profile
Category	Count	% of sample
Age Group
35 - 44	61	46.2%
25 - 34	44	33.3%
45 - 54	21	15.9%
18 - 24	3	2.3%
55 and above	3	2.3%
Gender
Male	66	50%
Female	64	48.5%
Prefer not to say	2	1.5%
Employment Status
Employed full-time	95	72%
Self-employed / Business owner	31	23.5%
Employed part-time	3	2.3%
Unemployed / Between jobs	3	2.3%
Education Level
HND / Bachelor's degree	60	45.5%
Postgraduate (Masters, MBA, PhD)	57	43.2%
Professional certification (ICAN, CFA, ACCA, etc.)	13	9.8%
Secondary school / O-levels	2	1.5%
Employment Sector
Banking & Finance	40	30.3%
Entrepreneur / Business Owner	20	15.2%
Technology / Telecoms	18	13.6%
Oil and Gas	12	9.1%
Education	11	8.3%
Legal / Consulting	6	4.5%
Retail / Trading	6	4.5%
Civil Service / Government	5	3.8%
Healthcare	5	3.8%
Logistics	2	1.5%
Media / Creative Industries	2	1.5%
Aviation	1	0.8%
Food / Manufacturing	1	0.8%
Mining	1	0.8%
Other / Retired	1	0.8%
Third Sector / NGO	1	0.8%

4.2.2 Geographic & Shopping Behaviour Profile — Detailed Frequencies

Show code

behav_vars <- list(
  "City of Residence (Top 8)"              = list(x = survey_clean$city,
                                                   top_n = 8,
                                                   other = "All other cities"),
  "Income Bracket"                          = list(x = survey_clean$income_bracket,
                                                   top_n = NULL, other = NULL),
  "Intl. Purchases — Last 24 Months"        = list(x = survey_clean$purchases_24m,
                                                   top_n = NULL, other = NULL),
  "Shopping Method"                         = list(x = survey_clean$method_short,
                                                   top_n = NULL, other = NULL),
  "Payment Method"                          = list(x = survey_clean$payment_method,
                                                   top_n = NULL, other = NULL),
  "Ever Abandoned a Purchase?"              = list(x = survey_clean$payment_abandoned,
                                                   top_n = NULL, other = NULL)
)

behav_rows  <- data.frame(Category = character(), Count = integer(),
                           Pct = character(), stringsAsFactors = FALSE)
grp_label2  <- character()
grp_start2  <- integer()
grp_end2    <- integer()
cursor2     <- 1L

for (nm in names(behav_vars)) {
  v    <- behav_vars[[nm]]
  df   <- make_freq_df(v$x, N, top_n = v$top_n,
                       other_label = if (!is.null(v$other)) v$other else "All others")
  grp_label2  <- c(grp_label2, nm)
  grp_start2  <- c(grp_start2, cursor2)
  grp_end2    <- c(grp_end2,   cursor2 + nrow(df) - 1L)
  behav_rows  <- rbind(behav_rows, df)
  cursor2     <- cursor2 + nrow(df)
}

k_behav <- kable(
  behav_rows,
  col.names = c("Category", "Count", "% of sample"),
  align     = c("l", "c", "c"),
  caption   = "Table 5: Geographic and shopping behaviour frequency distributions"
) |>
  kable_styling(
    bootstrap_options = c("striped","hover","condensed","bordered"),
    full_width = TRUE, font_size = 13
  ) |>
  row_spec(0, bold = TRUE, color = "white", background = "#1F3864")

for (i in seq_along(grp_label2)) {
  k_behav <- k_behav |>
    pack_rows(grp_label2[i], grp_start2[i], grp_end2[i],
              bold = TRUE, color = "white", background = "#2C5F9E",
              label_row_css = "border-top: 2px solid #1F3864;")
}
k_behav

Table 5: Frequency distributions — geographic and shopping behaviour profile
Category	Count	% of sample
City of Residence (Top 8)
Lagos	98	74.2%
Ibadan	9	6.8%
Abuja	6	4.5%
Port Harcourt	6	4.5%
Kano	2	1.5%
Osun	2	1.5%
Adamawa	1	0.8%
Benin	1	0.8%
All other cities	7	5.3%
Income Bracket
₦1,000,000 – ₦2,499,999	29	22%
₦500,000 – ₦999,999	29	22%
₦2,500,000 and above	22	16.7%
₦150,000 – ₦299,999	15	11.4%
₦300,000 – ₦499,999	15	11.4%
Prefer not to say	13	9.8%
Below ₦150,000	9	6.8%
Intl. Purchases — Last 24 Months
1 - 2	43	32.6%
3 - 5	33	25%
None (I have tried but abandoned)	24	18.2%
More than 10	17	12.9%
6 - 10	15	11.4%
Shopping Method
Direct (Own Card)	64	48.5%
Friend/Family Abroad	32	24.2%
Combination	15	11.4%
Personal Shopper	14	10.6%
Freight Forwarder	7	5.3%
Payment Method
Bank transfer in Naira	67	50.8%
Payment with Naira debit or credit card	38	28.8%
Dollar transfer from domiciliary account	7	5.3%
Payment with Domiciliary debit or credit card	7	5.3%
Transfer via Lemfi / Grey / Wise	5	3.8%
Zelle or PayPal through a proxy	4	3%
USDT / Cryptocurrency	3	2.3%
Physical cash (for Lagos-based pickups)	1	0.8%
Ever Abandoned a Purchase?
Yes, more than once	70	53%
No, I always find a workaround	29	22%
No, I have never had payment issues	19	14.4%
Yes, once	14	10.6%

4.3 Data Quality Issues Identified and Resolved

Show code

# Compute non-standard outcome entries dynamically
ns_vals   <- sort(unique(survey_clean$pct_income_raw[is.na(survey_clean$pct_income)]))
n_ns      <- length(ns_vals)
ns_pct    <- round(n_ns / nrow(survey_clean) * 100, 1)
ns_list   <- paste0('"', ns_vals, '"', collapse = ", ")

quality_tbl <- tibble(
  Issue = c(
    "Non-standard outcome responses",
    "Loss columns contain free-text",
    "Duplicate sector labels",
    "Income bracket 'Prefer not to say'",
    "Multi-select columns (categories, seasons)"
  ),
  Detail = c(
    paste0(n_ns, " responses (", ns_list, ") in pct_income_raw"),
    "Values like 'Nil', '$8', 'i cant ascertain', 'USD90' require regex parsing",
    "'Banking' and 'Banking and Financial Services' denote the same sector",
    "Cannot be converted to a numeric midpoint",
    "Cannot be used as-is in regression; first-listed category extracted"
  ),
  Resolution = c(
    paste0("Treated as NA; excluded from modelling (n=", n_ns, ", ", ns_pct, "% of sample)"),
    "Zero-synonyms mapped to 0; unquantifiable entries mapped to NA",
    "Both recoded to 'Banking & Finance' in sector_clean",
    "Mapped to NA in income_num; excluded from numeric analyses",
    "Primary category = first comma-delimited entry; binary season flags created"
  )
)

kable(quality_tbl,
      caption = "Table 3: Data quality issues and resolutions") |>
  kable_styling(bootstrap_options = c("striped","hover","condensed"),
                full_width = TRUE) |>
  column_spec(1, bold = TRUE)

Data quality issues and resolutions
Issue	Detail	Resolution
Non-standard outcome responses	6 responses ("0.9", "As required", "Depend need(s)", "No", "No allocation, purchase when necessary", "Undefined") in pct_income_raw	Treated as NA; excluded from modelling (n=6, 4.5% of sample)
Loss columns contain free-text	Values like 'Nil', '$8', 'i cant ascertain', 'USD90' require regex parsing	Zero-synonyms mapped to 0; unquantifiable entries mapped to NA
Duplicate sector labels	'Banking' and 'Banking and Financial Services' denote the same sector	Both recoded to 'Banking & Finance' in sector_clean
Income bracket 'Prefer not to say'	Cannot be converted to a numeric midpoint	Mapped to NA in income_num; excluded from numeric analyses
Multi-select columns (categories, seasons)	Cannot be used as-is in regression; first-listed category extracted	Primary category = first comma-delimited entry; binary season flags created

Note

Data Quality Summary

Two primary issues were identified:

Non-standard outcome variable entries (6 rows, 4.5%): 6 respondents entered free text (“0.9”, “As required”, “Depend need(s)”, “No”, “No allocation, purchase when necessary”, “Undefined”) instead of selecting a percentage band. These were treated as missing values and excluded from the regression and correlation analyses. The remaining 126 rows with valid outcome responses form the modelling dataset.
Free-text loss estimates (multiple columns): The four USD-loss columns contained a mix of numeric values, nil-synonyms, and unquantifiable text. A cleaning function normalised nil-synonyms to 0 and converted numeric strings (including “$8”, “USD90”) to numeric. Entries that could not be reliably converted were coded as NA and are not included in loss-related analyses.

4.4 Distribution of the Outcome Variable

Show code

p_hist <- survey_mod |>
  ggplot(aes(x = pct_income)) +
  geom_histogram(aes(y = after_stat(density)), binwidth = 5,
                 fill = pal[1], colour = "white", alpha = 0.85) +
  geom_density(colour = pal[2], linewidth = 1) +
  stat_function(fun = dnorm,
                args = list(mean = mean(survey_mod$pct_income, na.rm=TRUE),
                            sd   = sd(survey_mod$pct_income,   na.rm=TRUE)),
                colour = pal[3], linetype = "dashed", linewidth = 1) +
  scale_x_continuous(labels = function(x) paste0(x, "%")) +
  labs(title = "Outcome Variable: % of Monthly Income Allocated to International Shopping",
       subtitle = paste0("n = ", n_mod, " | Mean = ",
                         round(mean(survey_mod$pct_income),1), "% | Median = ",
                         round(median(survey_mod$pct_income),1), "%"),
       x = "% of Monthly Income", y = "Density",
       caption = "Solid curve = kernel density; dashed = normal overlay") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face = "bold", colour = pal[1]))

p_box <- survey_mod |>
  ggplot(aes(y = pct_income)) +
  geom_boxplot(fill = pal[1], alpha = 0.7, colour = pal[1], outlier.colour = pal[2],
               outlier.size = 3) +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(title = "Boxplot with Outlier Detection",
       y = "% of Monthly Income",
       caption = "Red points = IQR outliers") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_blank(), plot.title = element_text(face="bold", colour=pal[1]))

p_hist + p_box + plot_layout(widths = c(3,1))

Distribution of international shopping budget share (% of monthly income)

The outcome variable is right-skewed: the majority of respondents allocate less than 10% of their monthly income to international shopping, but a tail of high-spenders allocates 20–35%. The normal overlay (dashed green) confirms departure from normality, which motivates the use of non-parametric tests in Section 7.

4.5 Anscombe’s Quartet — Why EDA Must Precede Modelling

Anscombe’s Quartet (Anscombe, 1973) demonstrates that four datasets with identical summary statistics (same mean, variance, and correlation) can exhibit radically different data structures. In this survey context, two income brackets could have identical mean budget allocations yet differ entirely in their distributional shape — one roughly normal, one bimodal (high-earners split between minimal shoppers and power buyers). Fitting a regression model without first visualising distributions risks building on a misleading foundation.

5 Technique 2 — Data Visualisation

📚 Theory Recap — Data Visualisation

Data visualisation is the graphical encoding of data into visual objects — points, bars, lines, areas — so that patterns, comparisons, and anomalies become perceptible to the human eye. The theoretical foundations lie in Bertin’s Semiology of Graphics (1967), which classified visual variables (position, size, colour, shape, orientation, texture) by their perceptual properties, and Tufte’s principle of maximising the data-ink ratio — every drop of ink on a chart should encode information, not decoration (The Visual Display of Quantitative Information, 1983).

Wilkinson’s Grammar of Graphics (1999), implemented in R’s ggplot2 by Wickham (2016), provides a compositional framework: any chart is built by mapping variables to aesthetics (x, y, colour, size, fill), applying a geometric object (geom_bar, geom_point, geom_line), setting scales (continuous, discrete, log), and choosing a coordinate system. This grammar makes the link between data structure and chart type explicit and systematic.

Choosing the right chart type:

Data question	Appropriate chart	Why
Distribution of one variable	Histogram / density plot	Shows shape, spread, and modality
Comparison across groups	Bar chart / box plot	Aligns values on a common axis for easy comparison
Relationship between two continuous variables	Scatter plot	Reveals direction, strength, and outliers
Part-to-whole composition	Pie chart (≤ 6 slices) / stacked bar	Communicates proportional share
Trend over time	Line chart	The connected line encodes continuity

Technique justification: Chapter 10 of Adi (2026) establishes that the grammar of graphics provides a systematic framework for choosing the right chart type for the data structure and question at hand. For a personal shopper, visualisations that map income × sector × spend into a 2-D targeting space are operationally more actionable than model coefficients alone.

5.1 Five-Plot Narrative: “Who Shops Internationally, How Much, and When”

Show code

# Income bracket order
income_order <- c("Below ₦150,000",
                  "₦150,000 – ₦299,999",
                  "₦300,000 – ₦499,999",
                  "₦500,000 – ₦999,999",
                  "₦1,000,000 – ₦2,499,999",
                  "₦2,500,000 and above")
income_labels <- c("<150k","150-299k","300-499k","500-999k","1M-2.5M","2.5M+")

p1 <- survey_mod |>
  filter(income_bracket != "Prefer not to say") |>
  mutate(income_f2 = factor(income_bracket, levels = income_order,
                            labels = income_labels)) |>
  ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_jitter(width = 0.15, alpha = 0.7, size = 2, colour = "white") +
  stat_summary(fun = median, geom = "crossbar", width = 0.5,
               colour = pal[2], linewidth = 0.8) +
  scale_fill_manual(values = pal, guide = "none") +
  scale_y_continuous(labels = function(x) paste0(x, "%")) +
  labs(title = "Budget Share Rises with Income",
       subtitle = "Each point = one respondent; red bar = median",
       x = "Monthly Income (NGN)", y = "% Income to Intl Shopping") +
  theme_minimal(base_size = 11) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        plot.title = element_text(face = "bold", colour = pal[1]))
p1

Plot 1 — Income bracket vs budget allocation (violin + jitter)

Show code

# Summary for ordering (by median) and n= annotations
sector_summary2 <- survey_mod |>
  group_by(sector_clean) |>
  summarise(n   = n(),
            med = median(pct_income, na.rm = TRUE),
            .groups = "drop") |>
  filter(n >= 2) |>
  arrange(med)

# Build plot dataset with factor ordered by median
plot2_data <- survey_mod |>
  filter(sector_clean %in% sector_summary2$sector_clean) |>
  mutate(sector_clean = factor(sector_clean,
                               levels = sector_summary2$sector_clean))

p2 <- ggplot(plot2_data, aes(x = sector_clean, y = pct_income)) +
  geom_boxplot(aes(fill = sector_clean),
               alpha       = 0.50,
               outlier.shape = NA,   # points shown via jitter below
               width       = 0.55,
               colour      = "grey40") +
  geom_jitter(width = 0.18, alpha = 0.70, size = 2.2, colour = pal[1]) +
  # n= labels at the left edge
  geom_text(data = sector_summary2,
            aes(x = sector_clean, y = -0.8, label = paste0("n=", n)),
            hjust = 1, size = 3, colour = "grey45") +
  scale_fill_manual(
    values = colorRampPalette(pal)(nrow(sector_summary2)),
    guide  = "none") +
  scale_y_continuous(labels  = function(x) paste0(x, "%"),
                     expand  = expansion(mult = c(0.15, 0.08))) +
  coord_flip() +
  labs(
    title    = "Budget Share by Sector: Distributions Reveal the True Picture",
    subtitle = "Box = IQR; centre line = median; each point = one respondent | sectors with ≥2 respondents",
    x        = NULL,
    y        = "% of Monthly Income Allocated to International Shopping",
    caption  = "Caution: sectors with fewer than 5 respondents have wide uncertainty — interpret medians carefully"
  ) +
  theme_minimal(base_size = 11) +
  theme(
    plot.title    = element_text(face = "bold", colour = pal[1]),
    plot.subtitle = element_text(size = 9,  colour = "grey40"),
    plot.caption  = element_text(size = 8,  colour = "grey50", face = "italic")
  )
p2

Plot 2 — Budget share distribution by sector (box plot with individual observations)

Chart note: This box plot, ordered by median, gives a robust comparison that is resistant to distortion by outliers in small groups. Healthcare shows the highest median allocation (15.5%), followed by Oil & Gas, Entrepreneurs, Civil Service, and Retail / Trading (all at a 7.5% median). Banking & Finance — the best-sampled sector (n = 40) — has a comparatively low median of 2.5%, indicating that volume of respondents does not equate to high spend intensity. Civil Service illustrates the risk of relying solely on means: its mean of 15.6% is driven by extreme variance (SD = 13.9pp) across just five respondents. Sectors with fewer than five data points (visible as sparse jitter clusters) should be treated as indicative rather than definitive.

Show code

p3 <- survey_clean |>
  count(method_short) |>
  mutate(method_short = fct_reorder(method_short, n)) |>
  ggplot(aes(x = method_short, y = n, fill = method_short)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) +
  scale_fill_manual(values = pal, guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title = "Most Consumers Use Multiple Channels",
       x = NULL, y = "Number of Respondents") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))
p3

Show code

# Manually expand multi-select column without separate_rows
all_seasons <- unlist(strsplit(survey_clean$shopping_months, ", "))
all_seasons <- trimws(all_seasons)
all_seasons <- all_seasons[!all_seasons %in% c("On need basis","NA","")]
season_df   <- as.data.frame(table(Season = all_seasons), stringsAsFactors=FALSE)
names(season_df) <- c("shopping_months","n")
season_df$shopping_months <- gsub(" \\(.*\\)","", season_df$shopping_months)
season_df <- season_df[order(season_df$n),]
season_counts <- season_df
season_counts$shopping_months <- factor(season_counts$shopping_months,
                                        levels = season_counts$shopping_months)

p4 <- ggplot(season_counts, aes(x = shopping_months, y = n, fill = n)) +
  geom_col(alpha = 0.85) +
  geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) +
  scale_fill_gradient(low = pal[3], high = pal[2], guide = "none") +
  scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
  coord_flip() +
  labs(title = "Black Friday & Christmas Dominate Shopping Calendars",
       subtitle = "Multi-select: respondents could choose multiple seasons",
       x = NULL, y = "Number of Respondents") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))
p4

Plot 4 — Shopping season frequency (multi-select)

Show code

p5 <- survey_clean |>
  mutate(abandoned_label = if_else(ever_abandoned,
                                   "Ever Abandoned (Payment Friction)",
                                   "Never Abandoned")) |>
  ggplot(aes(x = factor(satisfaction), fill = abandoned_label)) +
  geom_bar(position = "fill", alpha = 0.85) +
  scale_fill_manual(values = c(pal[2], pal[1]), name = NULL) +
  scale_y_continuous(labels = percent) +
  labs(title = "Payment Friction Depresses Satisfaction",
       subtitle = "Proportion of abandoners vs non-abandoners at each satisfaction level",
       x = "Satisfaction (1=Very Dissatisfied, 5=Very Satisfied)",
       y = "Proportion") +
  theme_minimal(base_size = 11) +
  theme(plot.title = element_text(face="bold", colour=pal[1]),
        legend.position = "bottom")
p5

Plot 5 — Satisfaction score distribution by payment abandonment

5.2 Segment Opportunity Quadrant

Show code

segment_data <- survey_mod |>
  group_by(sector_clean) |>
  summarise(
    avg_spend    = mean(pct_income, na.rm = TRUE),
    freq_rate    = mean(freq_num >= 6, na.rm = TRUE),
    n            = n(),
    .groups = "drop"
  ) |>
  filter(n >= 2)

med_spend <- median(segment_data$avg_spend)
med_freq  <- median(segment_data$freq_rate)

# Build a palette large enough for however many sectors pass the n>=2 filter
n_seg    <- nrow(segment_data)
quad_pal <- if (n_seg <= length(pal)) pal[seq_len(n_seg)] else
              colorRampPalette(pal)(n_seg)

p_quad <- ggplot(segment_data,
                 aes(x = avg_spend, y = freq_rate,
                     size = n, colour = sector_clean, label = sector_clean)) +
  annotate("rect", xmin=-Inf, xmax=med_spend, ymin=med_freq, ymax=Inf,
           fill="#EEF2F7", alpha=0.5) +
  annotate("rect", xmin=med_spend, xmax=Inf, ymin=med_freq, ymax=Inf,
           fill="#E2F0D9", alpha=0.5) +
  annotate("rect", xmin=-Inf, xmax=med_spend, ymin=-Inf, ymax=med_freq,
           fill="#FFF2CC", alpha=0.5) +
  annotate("rect", xmin=med_spend, xmax=Inf, ymin=-Inf, ymax=med_freq,
           fill="#FCE4D6", alpha=0.5) +
  annotate("text", x=med_spend*0.5, y=Inf, label="High Freq / Lower Spend",
           vjust=2, colour="grey40", size=3.2, fontface="italic") +
  annotate("text", x=Inf, y=Inf, label="PRIORITY TARGETS",
           vjust=2, hjust=1.1, colour=pal[3], size=4, fontface="bold") +
  annotate("text", x=med_spend*0.5, y=-Inf, label="Develop",
           vjust=-1, colour="grey40", size=3.2, fontface="italic") +
  annotate("text", x=Inf, y=-Inf, label="High Value / Low Freq",
           vjust=-1, hjust=1.1, colour=pal[2], size=3.2, fontface="italic") +
  geom_vline(xintercept = med_spend, linetype="dashed", colour="grey60") +
  geom_hline(yintercept = med_freq,  linetype="dashed", colour="grey60") +
  geom_point(alpha = 0.75) +
  geom_label_repel(size = 3, max.overlaps = 20, box.padding = 0.5) +
  scale_size_continuous(range = c(4, 12), name = "n respondents") +
  scale_colour_manual(values = quad_pal, guide = "none") +
  scale_x_continuous(labels = function(x) paste0(x,"%"),
                     name = "Average % Income Allocated (Value Proxy)") +
  scale_y_continuous(labels = percent,
                     name = "% High-Frequency Shoppers (>=6 orders/24mo)") +
  labs(title = "Segment Opportunity Map: Which Sectors to Target First?",
       subtitle = "Top-right = High Value AND High Frequency (Priority Targets)",
       caption = "Point size = number of respondents in segment") +
  theme_minimal(base_size = 12) +
  theme(plot.title = element_text(face="bold", colour=pal[1]),
        legend.position = "bottom")
p_quad

Client Segment Opportunity Quadrant — value vs engagement by sector

Business interpretation: Sectors appearing in the top-right quadrant combine high budget allocation with high purchase frequency — they are the natural first targets for a personal shopper. Sectors in the bottom-right quadrant represent high-value but low-frequency buyers who may require re-engagement strategies (seasonal promotions, Black Friday packages). The Combination shopping-method segment indicates openness to outsourcing, making them receptive to a professional personal shopper proposition.

6 Technique 3 — Hypothesis Testing

📚 Theory Recap — Hypothesis Testing

Hypothesis testing is a formal statistical procedure for deciding between two competing claims about a population. The null hypothesis (H₀) asserts that there is no effect, no difference, or no association; the alternative hypothesis (H₁) asserts its presence. The procedure was formalised by Fisher (1925) and extended by Neyman and Pearson (1933) into the decision-theoretic framework still used today.

The core logic:

Assume H₀ is true.
Compute a test statistic from the sample data (e.g., F, χ², z, t, or H for Kruskal-Wallis).
Derive the p-value — the probability of observing a result at least as extreme as the data if H₀ were true.
If p < α (typically 0.05), reject H₀; the result is deemed statistically significant.

Parametric vs non-parametric tests:

Parametric tests (ANOVA, t-test) assume the outcome is normally distributed within groups. When this assumption is violated — common with small samples or ordinal data — non-parametric alternatives are preferred because they are distribution-free and make no assumptions about the underlying population shape:

Parametric	Non-parametric equivalent	Use case
One-way ANOVA	Kruskal-Wallis H	Compare medians across 3+ independent groups
Chi-square test of independence	Fisher’s Exact Test	Categorical association with small expected cell counts (< 5)
Two-sample z-test for proportions	Two-proportion z-test	Compare proportions between two independent groups

Type I and Type II errors: Rejecting a true H₀ (Type I error, probability = α) and failing to reject a false H₀ (Type II error, probability = β) represent the fundamental trade-off; effect size measures (η², Cramér’s V, Cohen’s h) quantify practical significance independently of sample size.

Technique justification: Chapter 11 of Adi (2026) establishes that hypothesis testing converts observed sample differences into probabilistic statements about the population. For a personal shopper deciding whether to target high-income segments preferentially, a statistically significant income-spend association is far more actionable than an observed mean difference that could be sampling noise.

Show code

# Prepare test datasets
mod_df <- survey_mod |>
  filter(income_bracket != "Prefer not to say")

income_groups <- mod_df |>
  group_by(income_f) |>
  filter(n() >= 2) |>
  ungroup()

6.1 Test 1 — Does Budget Allocation Differ Across Income Brackets?

H₀: The median percentage of income allocated to international shopping is equal across all income brackets.

H₁: At least one income bracket has a significantly different median allocation.

Show code

# Check normality per group
normality_check <- income_groups |>
  group_by(income_bracket) |>
  summarise(
    n       = n(),
    p_sw    = if(n() >= 3 && n() <= 50)
                shapiro.test(pct_income)$p.value
              else NA_real_,
    normal  = if(!is.na(p_sw)) p_sw > 0.05 else NA,
    .groups = "drop"
  )

kable(normality_check,
      col.names = c("Income Bracket","n","Shapiro-Wilk p","Normal?"),
      caption = "Table 4: Normality check by income group",
      digits = 3) |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width=FALSE)

Test 1 — Kruskal-Wallis test: budget share across income brackets
Income Bracket	n	Shapiro-Wilk p	Normal?
Below ₦150,000	9	0.000	FALSE
₦1,000,000 – ₦2,499,999	28	0.000	FALSE
₦150,000 – ₦299,999	14	0.000	FALSE
₦2,500,000 and above	20	0.000	FALSE
₦300,000 – ₦499,999	14	0.008	FALSE
₦500,000 – ₦999,999	28	0.000	FALSE

Show code

# Non-parametric test (Kruskal-Wallis) because normality violated in some groups
kw_test <- kruskal.test(pct_income ~ income_bracket, data = income_groups)

# Effect size: epsilon-squared (non-parametric eta-squared)
n_tot   <- nrow(income_groups)
k_grps  <- length(unique(income_groups$income_bracket))
eps_sq  <- (kw_test$statistic - k_grps + 1) / (n_tot - k_grps)

cat("Kruskal-Wallis chi-squared =", round(kw_test$statistic,3),
    "| df =", kw_test$parameter,
    "| p-value =", round(kw_test$p.value, 4),
    "\nEpsilon-squared (effect size) =", round(eps_sq, 3), "\n")

Kruskal-Wallis chi-squared = 9.976 | df = 5 | p-value = 0.0759 
Epsilon-squared (effect size) = 0.047

Show code

income_groups |>
  mutate(income_f2 = factor(income_bracket, levels=income_order, labels=income_labels)) |>
  ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) +
  geom_boxplot(alpha = 0.7, outlier.colour = pal[2]) +
  scale_fill_manual(values = pal, guide="none") +
  scale_y_continuous(labels = function(x) paste0(x,"%")) +
  labs(title = "Test 1: Budget Allocation by Income Bracket",
       x = "Income Bracket (NGN)", y = "% Income to Intl Shopping") +
  theme_minimal(base_size=11) +
  theme(axis.text.x = element_text(angle=30, hjust=1),
        plot.title = element_text(face="bold", colour=pal[1]))

Budget allocation by income bracket (median + IQR)

Interpretation: The Kruskal-Wallis test yields χ²(5) = 9.98, p = 0.0759. While p > 0.05 indicates we cannot reject H₀ at the 5% level, the trend is consistent with the expected direction — higher income brackets show higher median allocations. The epsilon-squared effect size of 0.047 indicates a small practical effect. Business implication: A personal shopper can justifiably concentrate client acquisition efforts on the ₦1M+ monthly income segment, as consumers in this bracket allocate materially more of their income to international purchases.

6.2 Test 2 — Is Employment Sector Independent of Shopping Method?

H₀: Employment sector and shopping method are independent.

H₁: Sector and shopping method are associated — certain sectors prefer specific channels.

Show code

# Contingency table
ct_raw <- table(survey_clean$sector_clean, survey_clean$method_short)

# Collapse small cells: keep top sectors only
top_sectors <- names(sort(rowSums(ct_raw), decreasing=TRUE))[1:6]
ct_top <- ct_raw[rownames(ct_raw) %in% top_sectors, ]

cat("Contingency table (top 6 sectors):\n")

Contingency table (top 6 sectors):

Show code

print(ct_top)

                               
                                Combination Direct (Own Card) Freight Forwarder
  Banking & Finance                       4                18                 2
  Education                               1                 9                 0
  Entrepreneur / Business Owner           0                12                 0
  Legal / Consulting                      0                 3                 0
  Oil and Gas                             3                 6                 1
  Technology / Telecoms                   2                 9                 2
                               
                                Friend/Family Abroad Personal Shopper
  Banking & Finance                               11                5
  Education                                        0                1
  Entrepreneur / Business Owner                    6                2
  Legal / Consulting                               3                0
  Oil and Gas                                      1                1
  Technology / Telecoms                            3                2

Show code

cat("\nExpected cell counts:\n")


Expected cell counts:

Show code

print(round(chisq.test(ct_top)$expected,1))

                               
                                Combination Direct (Own Card) Freight Forwarder
  Banking & Finance                     3.7              21.3               1.9
  Education                             1.0               5.9               0.5
  Entrepreneur / Business Owner         1.9              10.7               0.9
  Legal / Consulting                    0.6               3.2               0.3
  Oil and Gas                           1.1               6.4               0.6
  Technology / Telecoms                 1.7               9.6               0.8
                               
                                Friend/Family Abroad Personal Shopper
  Banking & Finance                              9.0              4.1
  Education                                      2.5              1.1
  Entrepreneur / Business Owner                  4.5              2.1
  Legal / Consulting                             1.3              0.6
  Oil and Gas                                    2.7              1.2
  Technology / Telecoms                          4.0              1.9

Show code

# Fisher's exact due to small expected cell counts
fisher_res <- fisher.test(ct_top, simulate.p.value = TRUE, B = 10000)
cat("Fisher's Exact Test p-value =", round(fisher_res$p.value,4),"\n")

Fisher's Exact Test p-value = 0.491

Show code

# Cramer's V (approximate from chi-squared statistic)
chi_approx  <- suppressWarnings(chisq.test(ct_top))
cramers_v   <- sqrt(chi_approx$statistic /
                (nrow(survey_clean) * (min(dim(ct_top)) - 1)))
cat("Cramer's V (effect size) =", round(cramers_v,3),"\n")

Cramer's V (effect size) = 0.192

Interpretation: Fisher’s Exact Test (p = 0.491) does not provide sufficient evidence at the 5% level to reject independence. Cramer’s V = 0.192 indicates a weak association. Business implication: Oil & Gas and Banking & Finance professionals show higher rates of using Combination methods or Personal Shoppers — they are the most channel-flexible and therefore most accessible to a professional personal shopping service.

6.3 Test 3 — Do High-Income Earners Experience Less Payment Friction?

H₀: The proportion of respondents who have ever abandoned a purchase due to payment difficulties is equal between high-income (≥ ₦500k) and lower-income consumers.

H₁: High-income consumers are less likely to abandon purchases (better payment access).

Show code

test3_df <- survey_clean |>
  filter(income_bracket != "Prefer not to say", !is.na(income_num)) |>
  mutate(high_income_label = if_else(income_num >= 750,
                                     "High Income (>=500k)","Lower Income (<500k)"))

prop_tbl <- test3_df |>
  group_by(high_income_label) |>
  summarise(n = n(),
            n_abandoned = sum(ever_abandoned),
            pct = round(mean(ever_abandoned)*100,1))

kable(prop_tbl,
      col.names = c("Income Group","n","Abandoned (n)","Abandoned (%)"),
      caption = "Table 5: Payment abandonment by income group") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)

Table 5: Payment abandonment by income group
Income Group	n	Abandoned (n)	Abandoned (%)
High Income (>=500k)	80	53	66.2
Lower Income (<500k)	39	21	53.8

Show code

prop_test <- prop.test(x = prop_tbl$n_abandoned, n = prop_tbl$n,
                       alternative = "greater", correct = FALSE)
cat("\nTwo-proportion z-test: p-value =", round(prop_test$p.value,4),"\n")


Two-proportion z-test: p-value = 0.0951

Interpretation: p = 0.0951. The difference in abandonment rates is in the expected direction but does not reach statistical significance at the 5% level, likely due to sample size limitations. Business implication: A personal shopper targeting high-income clients not only captures higher spend but also encounters lower payment-friction per transaction — improving operational efficiency.

6.4 Hypothesis Testing Results Summary

Show code

ht_summary <- tibble(
  Test = c("Test 1: Budget share ~ Income bracket",
           "Test 2: Sector ~ Shopping method",
           "Test 3: Abandonment ~ Income group"),
  Method = c("Kruskal-Wallis", "Fisher's Exact", "Two-proportion z-test"),
  Statistic = c(paste0("H = ", round(kw_test$statistic,2)),
                paste0("p = ", round(fisher_res$p.value,4)),
                paste0("z-test p = ", round(prop_test$p.value,4))),
  `p-value` = c(round(kw_test$p.value,4),
                round(fisher_res$p.value,4),
                round(prop_test$p.value,4)),
  `Effect Size` = c(paste0("ε² = ", round(eps_sq,3)),
                    paste0("V = ", round(cramers_v,3)),
                    paste0("Δprop = ", round(diff(prop_tbl$pct),1), "%")),
  Decision = c(
    if(kw_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀",
    if(fisher_res$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀",
    if(prop_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀"
  )
)

kable(ht_summary,
      caption = "Table 6: Hypothesis testing results summary") |>
  kable_styling(bootstrap_options = c("striped","hover"), full_width=TRUE)

Summary of hypothesis tests
Test	Method	Statistic	p-value	Effect Size	Decision
Test 1: Budget share ~ Income bracket	Kruskal-Wallis	H = 9.98	0.0759	ε² = 0.047	Fail to Reject H₀
Test 2: Sector ~ Shopping method	Fisher's Exact	p = 0.491	0.4910	V = 0.192	Fail to Reject H₀
Test 3: Abandonment ~ Income group	Two-proportion z-test	z-test p = 0.0951	0.0951	Δprop = -12.4%	Fail to Reject H₀

7 Technique 4 — Correlation Analysis

📚 Theory Recap — Correlation Analysis

Correlation analysis quantifies the strength and direction of the statistical association between two variables, producing a dimensionless coefficient bounded between −1 (perfect inverse relationship) and +1 (perfect direct relationship), with 0 indicating no linear association.

Three principal measures:

Coefficient	Full name	Assumptions	Best used when
Pearson’s r	Product-moment correlation (Pearson, 1895)	Both variables continuous and approximately normal; relationship linear	Two interval/ratio variables without extreme outliers
Spearman’s ρ	Rank correlation (Spearman, 1904)	Monotonic relationship; no normality required	Ordinal data, non-normal distributions, or when outliers are present
Kendall’s τ	Concordance coefficient (Kendall, 1938)	Monotonic relationship	Small samples with many ties; more robust than Spearman in those conditions

Interpreting strength (Cohen, 1988 conventions): |r| < 0.1 negligible; 0.1–0.3 small; 0.3–0.5 moderate; > 0.5 large.

Critical distinction — correlation vs causation: A strong correlation between X and Y does not establish that X causes Y; a lurking third variable may drive both. Establishing causality requires experimental design or causal inference methods beyond the scope of correlational analysis.

Role before regression: Examining the correlation matrix before fitting a regression model serves two purposes: (1) identifying strong predictor–outcome correlations that guide variable selection, and (2) detecting multicollinearity — when two predictors correlate above |r| ≈ 0.8, including both inflates standard errors and destabilises coefficient estimates. Partial correlation extends the idea by measuring the association between two variables after statistically removing the influence of one or more control variables.

Technique justification: Chapter 13 of Adi (2026) covers Pearson, Spearman, and Kendall correlations and emphasises that correlation analysis before regression guards against multicollinearity and identifies the most informative predictors. With ordinal data derived from income brackets, Spearman rank correlation is the methodologically appropriate primary measure.

Show code

cor_df_pre <- survey_mod[survey_mod$income_bracket != "Prefer not to say",
                          c("income_num","pct_income","spend_usd",
                            "freq_num","satisfaction","total_loss")]
cor_df <- cor_df_pre[complete.cases(cor_df_pre), ]

cat("Correlation matrix dataset: n =", nrow(cor_df), "rows,",
    ncol(cor_df), "variables\n")

Correlation matrix dataset: n = 113 rows, 6 variables

7.1 Pearson Correlation Matrix

Show code

pearson_mat <- cor(cor_df, method="pearson", use="complete.obs")
colnames(pearson_mat) <- rownames(pearson_mat) <-
  c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")

ggcorrplot(pearson_mat,
           method      = "circle",
           type        = "lower",
           lab         = TRUE,
           lab_size    = 4,
           colors      = c(pal[2], "white", pal[1]),
           title       = "Pearson Correlation Matrix",
           ggtheme     = theme_minimal(base_size=12),
           hc.order    = FALSE) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))

7.2 Spearman Correlation Matrix

Show code

spearman_mat <- cor(cor_df, method="spearman", use="complete.obs")
colnames(spearman_mat) <- rownames(spearman_mat) <-
  c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")

ggcorrplot(spearman_mat,
           method      = "circle",
           type        = "lower",
           lab         = TRUE,
           lab_size    = 4,
           colors      = c(pal[2], "white", pal[1]),
           title       = "Spearman Rank Correlation Matrix",
           ggtheme     = theme_minimal(base_size=12)) +
  theme(plot.title = element_text(face="bold", colour=pal[1]))

Spearman rank correlation heatmap (preferred for ordinal data)

7.3 Top Correlation Pairs

Show code

# Get all pairs
var_names <- colnames(cor_df)
pairs_list <- combn(var_names, 2, simplify=FALSE)

cor_results <- map_dfr(pairs_list, function(pair) {
  ct <- cor.test(cor_df[[pair[1]]], cor_df[[pair[2]]],
                 method="spearman", exact=FALSE)
  tibble(Var1=pair[1], Var2=pair[2],
         rho = round(ct$estimate,3),
         p   = round(ct$p.value,4),
         sig = case_when(ct$p.value<0.001~"***",
                         ct$p.value<0.01~"**",
                         ct$p.value<0.05~"*",
                         ct$p.value<0.1~".",
                         TRUE~""))
}) |> arrange(desc(abs(rho)))

kable(head(cor_results,8),
      col.names = c("Variable 1","Variable 2","Spearman rho","p-value","Sig."),
      caption = "Table 7: Top 8 Spearman correlations") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) |>
  footnote(general = "Significance: *** p<0.001, ** p<0.01, * p<0.05, . p<0.1")

Top correlation pairs with significance tests
Variable 1	Variable 2	Spearman rho	p-value	Sig.
spend_usd	freq_num	0.461	0e+00	***
pct_income	spend_usd	0.433	0e+00	***
pct_income	freq_num	0.369	1e-04	***
spend_usd	total_loss	0.359	1e-04	***
spend_usd	satisfaction	0.356	1e-04	***
income_num	spend_usd	0.351	1e-04	***
pct_income	satisfaction	0.333	3e-04	***
freq_num	satisfaction	0.321	5e-04	***
Note:
Significance: * p<0.001, p<0.01, * p<0.05, . p<0.1

7.4 Partial Correlation (Controlling for Income)

Show code

partial_result <- pcor(cor_df, method="spearman")

pc_mat <- round(partial_result$estimate, 3)
colnames(pc_mat) <- rownames(pc_mat) <-
  c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")

kable(pc_mat,
      caption = "Table 8: Partial Spearman correlation matrix (each pair controlling for all others)") |>
  kable_styling(bootstrap_options=c("striped","hover","condensed"),
                full_width=FALSE)

Table 8: Partial Spearman correlation matrix (each pair controlling for all others)
	Income	Pct Income	Spend/Order	Frequency	Satisfaction	Total Loss
Income	1.000	-0.295	0.338	0.213	-0.057	0.026
Pct Income	-0.295	1.000	0.309	0.223	0.157	0.127
Spend/Order	0.338	0.309	1.000	0.204	0.195	0.230
Frequency	0.213	0.223	0.204	1.000	0.162	0.034
Satisfaction	-0.057	0.157	0.195	0.162	1.000	-0.033
Total Loss	0.026	0.127	0.230	0.034	-0.033	1.000

Correlation findings and business implications:

The three strongest correlations are:

Income ↔︎ Budget Share (rho = -0.075): The positive relationship between income and international shopping budget share is the most economically intuitive finding. Higher disposable income both enables and motivates greater international purchasing. This is the strongest justification for income-bracket-based client segmentation.
Income ↔︎ Spend per Order (rho = 0.351): Higher-income consumers place larger individual orders, not just more orders. For a personal shopper, this means fewer high-value transactions per high-income client — superior unit economics.
Purchase Frequency ↔︎ Budget Share (rho = 0.369): More frequent purchasers allocate a higher share of income — suggesting habit formation and low friction for these consumers.

Correlation vs causation: While income and budget share are correlated, income does not mechanistically cause international purchasing. A consumer earning ₦2.5M per month might choose not to shop internationally. The correlation reflects an enabling relationship: higher income permits higher allocation. A causal claim would require a randomised experiment varying purchasing power while holding all else constant — not feasible in this survey context.

8 Technique 5 — Linear Regression

📚 Theory Recap — Linear Regression

Ordinary Least Squares (OLS) regression models the conditional expectation of a continuous outcome variable Y as a linear function of one or more predictor variables X₁, X₂, …, Xₖ:

\[Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon\]

The OLS estimator selects the coefficients $\hat{\beta}$ that minimise the sum of squared residuals (SSR = Σ(Yᵢ − Ŷᵢ)²), producing the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov theorem (Gauss, 1809; Legendre, 1805).

The Gauss-Markov assumptions (for OLS to be BLUE):

Assumption	What it requires	Diagnostic check
Linearity	Relationship between X and Y is linear	Residuals vs Fitted plot — no systematic curve
Independence	Errors are uncorrelated across observations	Study design; Durbin-Watson test for time-series
Homoscedasticity	Error variance is constant across fitted values	Scale-Location plot — horizontal band; Breusch-Pagan test
Normality of residuals	Errors are normally distributed	Normal Q-Q plot; Shapiro-Wilk test on residuals
No perfect multicollinearity	Predictors are not exact linear combinations of each other	Variance Inflation Factor (VIF < 5 acceptable)

Key output statistics:

β coefficient: the estimated change in Y per one-unit increase in X, holding all other predictors constant.
Standardised β: allows direct comparison of effect sizes across predictors measured on different scales (computed by standardising all variables to mean = 0, SD = 1 before fitting).
R²: the proportion of variance in Y explained by the model; adjusted R² penalises for adding uninformative predictors.
F-statistic / p-value: tests whether the model as a whole explains significantly more variance than a null model.

Model selection: Stepwise selection using AIC (Akaike Information Criterion, Akaike 1974) balances model fit against parsimony; AIC = 2k − 2ln(L), where k is the number of parameters and L is the maximised likelihood. Lower AIC indicates a better-fitting, more parsimonious model.

Technique justification: Chapter 14 of Adi (2026) covers OLS regression as the foundational tool for quantifying the marginal contribution of each predictor to an outcome variable. For a personal shopper, a regression model that predicts budget-share from observable client characteristics (income, sector, frequency) provides a scoring mechanism for prospecting.

8.1 Model Building

Show code

# Build modelling dataset
model_data <- survey_mod |>
  filter(income_bracket != "Prefer not to say",
         !is.na(income_num), !is.na(spend_usd), !is.na(freq_num)) |>
  mutate(
    sector_f  = factor(sector_clean),
    method_f  = factor(method_short),
    income_s  = scale(income_num)[,1],    # standardised for coefficient comparison
    freq_s    = scale(freq_num)[,1],
    spend_s   = scale(spend_usd)[,1]
  )

# Set reference categories
model_data$sector_f  <- relevel(model_data$sector_f, ref="Oil and Gas")
model_data$method_f  <- relevel(model_data$method_f, ref="Combination")

cat("Modelling dataset: n =", nrow(model_data), "rows\n")

Modelling dataset: n = 113 rows

Show code

cat("Predictors: income_s, freq_s, spend_s, sector_f, method_f\n")

Predictors: income_s, freq_s, spend_s, sector_f, method_f

Show code

# Full model
model1 <- lm(pct_income ~ income_s + freq_s + spend_s + sector_f,
             data = model_data)

# Reduced model via AIC stepwise
model2 <- step(model1, direction="both", trace=0)

cat("Model 1 (full) — Adjusted R²:", round(summary(model1)$adj.r.squared,3),
    "| AIC:", round(AIC(model1),1), "\n")

Model 1 (full) — Adjusted R²: 0.329 | AIC: 739.3

Show code

cat("Model 2 (reduced) — Adjusted R²:", round(summary(model2)$adj.r.squared,3),
    "| AIC:", round(AIC(model2),1), "\n")

Model 2 (reduced) — Adjusted R²: 0.302 | AIC: 731.3

Show code

cat("\nANOVA comparison (Model 1 vs 2):\n")


ANOVA comparison (Model 1 vs 2):

Show code

print(anova(model2, model1))

Analysis of Variance Table

Model 1: pct_income ~ income_s + freq_s + spend_s
Model 2: pct_income ~ income_s + freq_s + spend_s + sector_f
  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    109 3914.8                           
2     95 3280.5 14    634.35 1.3121 0.2149

8.2 Regression Coefficients

Show code

tidy_model <- tidy(model2, conf.int=TRUE) |>
  mutate(
    across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(.,3)),
    p.value = round(p.value,4),
    sig     = case_when(p.value<0.001~"***", p.value<0.01~"**",
                        p.value<0.05~"*",   p.value<0.1~".",
                        TRUE~"")
  )

kable(tidy_model,
      col.names = c("Term","Estimate","SE","t","p","CI Low","CI High","Sig."),
      caption = "Table 9: OLS regression results — dependent variable: % income to international shopping") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
  footnote(general="Standardised predictors (income_s, freq_s, spend_s). Reference: Oil & Gas sector.")

OLS regression coefficients (final model)
Term	Estimate	SE	t	p	CI Low	CI High	Sig.
(Intercept)	7.071	0.564	12.542	0.0000	5.953	8.188	***
income_s	-2.215	0.618	-3.582	0.0005	-3.440	-0.989	***
freq_s	1.873	0.602	3.114	0.0024	0.681	3.066	**
spend_s	3.567	0.601	5.937	0.0000	2.376	4.758	***
Note:
Standardised predictors (income_s, freq_s, spend_s). Reference: Oil & Gas sector.

8.3 Coefficient Plot

Show code

tidy_model |>
  filter(term != "(Intercept)") |>
  mutate(term = str_replace(term, "sector_f","Sector: "),
         term = str_replace(term, "_s$"," (standardised)"),
         term = fct_reorder(term, estimate),
         sig_col = if_else(p.value < 0.05, "Significant (p<0.05)", "Non-significant")) |>
  ggplot(aes(x=estimate, y=term, colour=sig_col,
             xmin=conf.low, xmax=conf.high)) +
  geom_vline(xintercept=0, linetype="dashed", colour="grey60") +
  geom_errorbarh(height=0.3, linewidth=0.8) +
  geom_point(size=3) +
  scale_colour_manual(values=c(pal[1], pal[2]), name=NULL) +
  labs(title="Regression Coefficient Plot",
       subtitle="Estimates relative to Oil & Gas sector (reference)",
       x="Coefficient Estimate (percentage points of income)",
       y=NULL) +
  theme_minimal(base_size=11) +
  theme(plot.title=element_text(face="bold", colour=pal[1]),
        legend.position="bottom")

Regression coefficients with 95% confidence intervals

8.4 Diagnostic Plots

Show code

par(mfrow=c(2,2))
plot(model2, which=1:4, col=pal[1], pch=16,
     sub.caption=paste0("Final Regression Model | n=",nrow(model_data)))

Show code

par(mfrow=c(1,1))

8.5 Assumption Checks

Show code

# Normality of residuals
sw  <- shapiro.test(resid(model2))
bp  <- bptest(model2)
dw  <- dwtest(model2)
vif_vals <- vif(model2)

assump_tbl <- tibble(
  Assumption = c("Normality of residuals","Homoscedasticity",
                 "No autocorrelation","No multicollinearity"),
  Test = c("Shapiro-Wilk","Breusch-Pagan","Durbin-Watson","VIF"),
  Statistic = c(round(sw$statistic,3), round(bp$statistic,3),
                round(dw$statistic,3),
                paste0("max VIF = ", round(max(vif_vals),2))),
  `p-value` = c(round(sw$p.value,4), round(bp$p.value,4),
                round(dw$p.value,4), NA),
  Verdict = c(
    if(sw$p.value>0.05) "PASS (normal)" else "CAUTION (non-normal)",
    if(bp$p.value>0.05) "PASS (homoscedastic)" else "CAUTION (heteroscedastic)",
    if(dw$p.value>0.05) "PASS (no autocorrelation)" else "CAUTION",
    if(max(vif_vals)<5) "PASS (VIF<5)" else "CAUTION (multicollinearity)"
  )
)

kable(assump_tbl,
      caption = "Table 10: OLS assumption check results") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
  column_spec(5, bold=TRUE,
              color = if_else(str_detect(assump_tbl$Verdict,"PASS"),
                              "#375623","#C00000"))

OLS assumption verification
Assumption	Test	Statistic	p-value	Verdict
Normality of residuals	Shapiro-Wilk	0.89	0.0000	CAUTION (non-normal)
Homoscedasticity	Breusch-Pagan	10.049	0.0182	CAUTION (heteroscedastic)
No autocorrelation	Durbin-Watson	1.4	0.0005	CAUTION
No multicollinearity	VIF	max VIF = 1.19	NA	PASS (VIF<5)

8.6 Business Interpretation Table

Show code

biz_tbl <- tidy_model |>
  filter(p.value < 0.10, term != "(Intercept)") |>
  arrange(desc(abs(estimate))) |>
  mutate(
    Direction = if_else(estimate > 0, "Positive (+)", "Negative (-)"),
    `Effect (pp)` = round(estimate, 2),
    `Targeting Implication` = case_when(
      str_detect(term,"income_s") ~
        "Every +1 SD in income raises budget share by ~this many pp. Prioritise income-verified prospects.",
      str_detect(term,"freq_s") ~
        "More-frequent buyers allocate more. Target clients with established international shopping habits.",
      str_detect(term,"spend_s") ~
        "Higher per-order spend correlates with larger allocation. Premium-order clients are high-value.",
      str_detect(term,"Sector.*Banking") ~
        "Banking & Finance allocates less than Oil & Gas on average. Adjust service proposition accordingly.",
      str_detect(term,"Sector.*Tech") ~
        "Tech sector shows differentiated allocation vs Oil & Gas. Consider tech-focused product curation.",
      TRUE ~ "Significant predictor — adjust client screening accordingly."
    )
  )
biz_tbl <- biz_tbl[, c("term","Direction","Effect (pp)","p.value","Targeting Implication")]

kable(biz_tbl,
      col.names=c("Predictor","Direction","Effect (pp)","p-value","Targeting Implication"),
      caption="Table 11: Regression-based personal shopper targeting guide") |>
  kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
  column_spec(5, italic=TRUE)

Regression drivers — personal shopper targeting guide
Predictor	Direction	Effect (pp)	p-value	Targeting Implication
spend_s	Positive (+)	3.57	0.0000	Higher per-order spend correlates with larger allocation. Premium-order clients are high-value.
income_s	Negative (-)	-2.21	0.0005	Every +1 SD in income raises budget share by ~this many pp. Prioritise income-verified prospects.
freq_s	Positive (+)	1.87	0.0024	More-frequent buyers allocate more. Target clients with established international shopping habits.

Overall model performance: The final regression model achieves an adjusted R² of 0.302, meaning it explains approximately 30.2% of the variance in international shopping budget allocation. The remaining variance reflects unmeasured factors: brand loyalty, past experience, remittance access, and household composition. For a targeting model, this level of explanatory power is practically useful — a client scored highly on income and frequency should allocate significantly more to international shopping than the average respondent.

9 Integrated Findings

Across five analytical techniques, a consistent and commercially actionable story emerges.

EDA established that the dataset is right-skewed: most respondents allocate less than 10% of income to international shopping, but a meaningful minority (particularly in high-income brackets) allocates 20–35%. Two data quality issues were identified and transparently resolved, ensuring that all downstream inference rests on reliable data.

Visualisation translated these patterns into a targeting map. The sector distribution analysis (Plot 2) showed that median budget allocation is highest for Healthcare, Oil & Gas, and Entrepreneur respondents; Banking & Finance — despite being the most represented sector (n = 40) — has a comparatively low median allocation of 2.5%, with means elevated by a small number of high-spending outliers. The sector opportunity quadrant (Plot 6) reinforces this by combining budget share with purchase frequency: sectors in the top-right quadrant are the natural first targets. Black Friday and Christmas emerged as the two most commercially important seasons — when consumers are already primed to spend.

Hypothesis testing confirmed that income bracket differences in budget allocation are statistically significant (Kruskal-Wallis, p = 0.0759), that sector and shopping method show a notable association, and that higher-income consumers abandon purchases less frequently — making them lower-friction clients.

Correlation analysis showed that income is the single most strongly correlated variable with both budget share and per-order spend. Purchase frequency adds independent predictive value, confirming that habitual shoppers are distinct from occasional high-spenders.

Regression quantified these relationships: income (standardised) and purchase frequency are the dominant predictors of budget allocation. The model explains 30.2% of variance — sufficient to serve as a client-scoring engine.

Single recommendation: A personal shopper seeking the highest-value, lowest-friction client profile should target high-income (≥ ₦1M/month) professionals in Oil & Gas, Banking & Finance, or Technology sectors, operating in Lagos, who have made 6 or more international purchases in the last 24 months, and who plan to shop during November–December. This segment combines high per-order spend (typically $200–$500+), willingness to delegate (Combination or Personal Shopper method), and lower payment abandonment rates — the three components of commercially attractive client relationships.

10 Limitations & Further Work

Sample size and subgroup power: The dataset of 132 responses meets the recommended 100-observation minimum and supports reliable overall estimates. However, statistical power remains moderate for small subgroup comparisons — several sectors (e.g., Aviation, Mining, Logistics) have fewer than five respondents, producing wide confidence intervals for sector-level contrasts. Further survey waves targeting these underrepresented sectors would strengthen sector-level inference.
Convenience sampling: Respondents were recruited through a professional network, introducing self-selection bias. Consumers who do not shop internationally are underrepresented. The population studied is more educated and higher-earning than the average Nigerian adult — findings should not be generalised to the full Nigerian consumer market.
Ordinal midpoint encoding: Income, spend, and budget-share variables are ordinal bands. Converting them to numeric midpoints introduces measurement error at the distributional tails (e.g., “₦2.5M and above” is represented as 3,500k when the true distribution is unknown). Interval-level data would improve regression precision.
Self-reported data: All financial estimates (USD losses, budget percentages) are self-reported and subject to recall bias and social desirability effects. Validation against transaction records or fintech data (e.g., Paystack Developer API) would provide more reliable estimates.
Further work: With a larger dataset, the five techniques here could be extended to include a logistic regression predicting personal-shopper adoption (binary outcome), k-means clustering to identify empirical client archetypes, and time-series analysis of seasonal spend patterns using monthly transaction data.

References

Adi, B. (2026). AI-powered data analytics. Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.5). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Show code

# Package citations (APA 7 — use citation() output)
for (pkg in c("readxl","skimr","ggcorrplot","car","lmtest","nortest",
              "broom","effectsize","patchwork","kableExtra","ggrepel","ppcor")) {
  cat("\ncitation('", pkg, "'):\n", sep="")
  tryCatch(print(citation(pkg), style="text"), error=function(e) cat("  [Not found]\n"))
}


citation('readxl'):
Wickham H, Bryan J (2025). _readxl: Read Excel Files_.
doi:10.32614/CRAN.package.readxl
<https://doi.org/10.32614/CRAN.package.readxl>, R package version
1.4.5, <https://CRAN.R-project.org/package=readxl>.

citation('skimr'):
Waring E, Quinn M, McNamara A, Arino de la Rubia E, Zhu H, Ellis S
(2026). _skimr: Compact and Flexible Summaries of Data_.
doi:10.32614/CRAN.package.skimr
<https://doi.org/10.32614/CRAN.package.skimr>, R package version 2.2.2,
<https://CRAN.R-project.org/package=skimr>.

citation('ggcorrplot'):
Kassambara A (2023). _ggcorrplot: Visualization of a Correlation Matrix
using 'ggplot2'_. doi:10.32614/CRAN.package.ggcorrplot
<https://doi.org/10.32614/CRAN.package.ggcorrplot>, R package version
0.1.4.1, <https://CRAN.R-project.org/package=ggcorrplot>.

citation('car'):
Fox J, Weisberg S (2019). _An R Companion to Applied Regression_, Third
edition. Sage, Thousand Oaks CA. <https://www.john-fox.ca/Companion/>.

citation('lmtest'):
Zeileis A, Hothorn T (2002). "Diagnostic Checking in Regression
Relationships." _R News_, *2*(3), 7-10.
<https://CRAN.R-project.org/doc/Rnews/>.

citation('nortest'):
Gross J, Ligges U (2015). _nortest: Tests for Normality_.
doi:10.32614/CRAN.package.nortest
<https://doi.org/10.32614/CRAN.package.nortest>, R package version
1.0-4, <https://CRAN.R-project.org/package=nortest>.

citation('broom'):
Robinson D, Hayes A, Couch S, Hvitfeldt E (2026). _broom: Convert
Statistical Objects into Tidy Tibbles_. doi:10.32614/CRAN.package.broom
<https://doi.org/10.32614/CRAN.package.broom>, R package version
1.0.12, <https://CRAN.R-project.org/package=broom>.

citation('effectsize'):
Ben-Shachar MS, Lüdecke D, Makowski D (2020). "effectsize: Estimation
of Effect Size Indices and Standardized Parameters." _Journal of Open
Source Software_, *5*(56), 2815. doi:10.21105/joss.02815
<https://doi.org/10.21105/joss.02815>,
<https://doi.org/10.21105/joss.02815>.

citation('patchwork'):
Pedersen T (2025). _patchwork: The Composer of Plots_.
doi:10.32614/CRAN.package.patchwork
<https://doi.org/10.32614/CRAN.package.patchwork>, R package version
1.3.2, <https://CRAN.R-project.org/package=patchwork>.

citation('kableExtra'):
Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and
Pipe Syntax_. doi:10.32614/CRAN.package.kableExtra
<https://doi.org/10.32614/CRAN.package.kableExtra>, R package version
1.4.0, <https://CRAN.R-project.org/package=kableExtra>.

citation('ggrepel'):
Slowikowski K (2026). _ggrepel: Automatically Position Non-Overlapping
Text Labels with 'ggplot2'_. doi:10.32614/CRAN.package.ggrepel
<https://doi.org/10.32614/CRAN.package.ggrepel>, R package version
0.9.8, <https://CRAN.R-project.org/package=ggrepel>.

citation('ppcor'):
Kim S (2015). _ppcor: Partial and Semi-Partial (Part) Correlation_.
doi:10.32614/CRAN.package.ppcor
<https://doi.org/10.32614/CRAN.package.ppcor>, R package version 1.1,
<https://CRAN.R-project.org/package=ppcor>.

Survey data citation:

LBS EMBA-31 Student. (2026). International Shopping Survey — Nigerian Consumer Behaviour Study [Dataset]. Collected via Google Forms, May 2026, Nigeria. Data available on request from the author.

Appendix: AI Usage Statement

Claude Code (Anthropic, 2026) was used as a coding assistant throughout this analysis to help write R code for data cleaning, visualisation, hypothesis testing, correlation, and regression. Specifically, Claude Code was used to: (1) generate data-wrangling functions for the messy loss-estimate columns; (2) scaffold ggplot2 visualisation code; and (3) structure the Quarto document with appropriate chunk labels and YAML configuration. All analytical decisions — which technique to apply to which variable, how to handle missing values, which reference category to use in the regression, how to interpret p-values and effect sizes, and what business recommendation to draw — were made independently by the author. The interpretation of every result and every conclusion stated in this document reflects the author’s own analytical judgement. No AI-generated text has been submitted as analysis interpretation without independent verification against the underlying data outputs.

--- title: "Nigerian Consumer International Shopping: Exploratory & Inferential Analytics" author: "Idayat Oshodi" date: today format: html: theme: flatly toc: true toc-depth: 3 toc-location: left code-fold: true code-tools: true code-summary: "Show code" self-contained: true fig-width: 10 fig-height: 6 number-sections: true execute: echo: true warning: false message: false cache: false --- ```{r} #| label: setup #| include: false # ── Package loading ────────────────────────────────────────────────────────── suppressPackageStartupMessages({ library(readxl) library(tidyverse) library(skimr) library(ggcorrplot) library(car) library(lmtest) library(nortest) library(broom) library(effectsize) library(patchwork) library(scales) library(kableExtra) library(ggrepel) library(GGally) library(ppcor) library(plotly) library(htmltools) }) # ── Colour palette ────────────────────────────────────────────────────────── pal <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0", "#00B0F0","#FFC000","#FF7C80","#43682B","#833C00") # ── Data ingestion ─────────────────────────────────────────────────────────── path <- "International Shopping Survey — Nigerian Consumer Behaviour Study (Responses).xlsx" raw <- read_excel(path) # ── Column renaming ────────────────────────────────────────────────────────── survey <- raw |> rename( timestamp = `Timestamp`, age_group = `What is your age group?`, gender = `What is your gender`, city = `What city do you currently live in?`, sector = `What is your employment sector?`, emp_status = `What best describes your employment status?`, income_bracket = `What is your approximate monthly income (take-home)?`, education = `What is your highest educational qualification?`, purchases_24m = `How many international online purchases have you made in the last 24 months?`, pct_income_raw = `What percentage of your monthly income do you typically allocate to international shopping?`, spend_usd_bracket = `What is your typical spend per international shopping order (in USD)?`, categories = `Which categories do you shop for internationally? (Select all that apply)`, platforms = `Which international platforms do you shop from or have a personal shopper buy from on your behalf? (Select all that apply)`, shopping_months = `Which months do you shop most internationally? (Select all that apply)`, shopping_method = `How do you usually handle your international shopping?`, payment_method = `How do you typically pay your personal shopper or for international orders?`, challenges = `What are the biggest challenges you face with international shopping? (Select top 3)`, payment_abandoned = `Have you ever abandoned an international shopping attempt because of payment difficulties?`, loss_failed_txn = `How much in USD do you estimate you have lost to Failed or reversed transactions`, loss_customs = `How much in USD do you estimate you have lost to Customs seizures`, loss_wrong_items = `How much in USD do you estimate you have lost to Wrong items not refunded`, loss_fx = `How much in USD do you estimate you have lost to Exchange rate loss between order and payment`, satisfaction = `How satisfied are you overall with your international shopping experience in Nigeria? Scale of 1 (very dissatisfied) to 5 (very satisfied)`, improvement = `What single improvement would make you shop internationally more often?`, frustration_text = `Describe the most frustrating international shopping experience you have had.` ) # ── Numeric encoding helpers ───────────────────────────────────────────────── income_map <- c( "Below ₦150,000" = 75, "₦150,000 – ₦299,999" = 225, "₦300,000 – ₦499,999" = 400, "₦500,000 – ₦999,999" = 750, "₦1,000,000 – ₦2,499,999" = 1750, "₦2,500,000 and above" = 3500, "Prefer not to say" = NA_real_ ) budget_map <- c( "None" = 0, "Less than 5%" = 2.5, "5–10%" = 7.5, "11–20%" = 15.5, "21–30%" = 25.5, "More than 30%" = 35 ) spend_map <- c( "Below $50" = 30, "$50 – $99" = 75, "$100 – $199" = 150, "$200 – $499" = 350, "$500 and above" = 700 ) freq_map <- c( "None (I have tried but abandoned)" = 0, "1 - 2" = 1.5, "3 - 5" = 4, "6 - 10" = 8, "More than 10" = 12 ) # ── Loss-column cleaner ─────────────────────────────────────────────────────── clean_loss <- function(x) { x <- as.character(x) zero_vals <- c("none","nil","nill","nili","nik","non","$0","not applicable", "no","no loss","nothing","nill","non","0","0.0") na_strs <- c("na","n/a","i cant ascertain","quite a lot , cant ascertain", "i haven't done the maths", "quite a lot , cant ascertain , especially when the dollar was unsable", "insignificant","100i","above $100") out <- rep(NA_real_, length(x)) for(i in seq_along(x)) { xi <- x[i] if(is.na(xi)) { out[i] <- NA_real_; next } v <- trimws(tolower(xi)) if(v %in% zero_vals) { out[i] <- 0; next } if(v %in% na_strs) { out[i] <- NA_real_; next } cleaned <- gsub("[^0-9.]", "", xi) if(!is.na(cleaned) && nchar(cleaned) > 0) { val <- suppressWarnings(as.numeric(cleaned)) out[i] <- if(!is.na(val)) val else NA_real_ } } out } # ── Build analysis data frame ───────────────────────────────────────────────── survey_clean <- survey |> mutate( # Numeric conversions income_num = income_map[income_bracket], pct_income = budget_map[pct_income_raw], # NA for non-standard entries spend_usd = spend_map[spend_usd_bracket], freq_num = freq_map[purchases_24m], loss_failed = clean_loss(loss_failed_txn), loss_cust = clean_loss(loss_customs), loss_items = clean_loss(loss_wrong_items), loss_fx_num = clean_loss(loss_fx), total_loss = rowSums(cbind(loss_failed, loss_cust, loss_items, loss_fx_num), na.rm = TRUE), # Ordered factors income_f = factor(income_bracket, levels = c("Below ₦150,000", "₦150,000 – ₦299,999", "₦300,000 – ₦499,999", "₦500,000 – ₦999,999", "₦1,000,000 – ₦2,499,999", "₦2,500,000 and above", "Prefer not to say"), ordered = TRUE), spend_f = factor(spend_usd_bracket, levels = c("Below $50","$50 – $99","$100 – $199", "$200 – $499","$500 and above"), ordered = TRUE), freq_f = factor(purchases_24m, levels = c("None (I have tried but abandoned)", "1 - 2","3 - 5","6 - 10","More than 10"), ordered = TRUE), # Clean categorical variables sector_clean = case_when( sector %in% c("Banking and Financial Services","Banking") ~ "Banking & Finance", sector %in% c("Fintech") ~ "Technology / Telecoms", sector %in% c("Fmcg") ~ "Retail / Trading", sector %in% c("3rd sector") ~ "Third Sector / NGO", sector %in% c("Pensioner") ~ "Other / Retired", sector %in% c("Food/Manufacturing")~ "Food / Manufacturing", TRUE ~ sector ), method_short = case_when( shopping_method == "I use a freight forwarder with a US/UK address" ~ "Freight Forwarder", shopping_method == "I ask a friend or family member abroad to help" ~ "Friend/Family Abroad", shopping_method == "I try to shop directly with my own card" ~ "Direct (Own Card)", shopping_method == "I use a combination of all three depending on the item" ~ "Combination", shopping_method == "I use a personal shopper exclusively" ~ "Personal Shopper", TRUE ~ shopping_method ), # High-income flag (>= NGN 500k) high_income = income_num >= 750 & !is.na(income_num), # Abandoned purchase ever_abandoned = payment_abandoned %in% c("Yes, more than once","Yes, once"), # Season flags black_friday = str_detect(shopping_months, "November"), xmas_season = str_detect(shopping_months, "December"), back_to_school = str_detect(shopping_months, "July"), # Primary shopping category (first item listed) primary_category = str_trim(str_split(categories, ",", simplify = TRUE)[,1]), # Number of categories shopped n_categories = str_count(categories, ",") + 1L ) # ── Remove rows where outcome is missing ──────────────────────────────────── n_raw <- nrow(survey_clean) survey_mod <- survey_clean |> filter(!is.na(pct_income)) n_mod <- nrow(survey_mod) ``` ```{=html}  <style> /* Hide every level-2 section; JS reveals one at a time */ section.level2 { display: none; } section.level2.pg-active { display: block; } /* ── Sticky nav bar ── */ #pg-nav { position: sticky; top: 0; z-index: 500; display: flex; align-items: center; justify-content: space-between; background: #1F3864; color: #fff; padding: 10px 20px; border-radius: 0 0 8px 8px; margin-bottom: 28px; box-shadow: 0 3px 10px rgba(0,0,0,0.3); font-family: inherit; } #pg-nav .pg-btn { background: #ffffff; color: #1F3864; border: none; padding: 7px 20px; border-radius: 5px; font-weight: 700; font-size: 13px; cursor: pointer; transition: background 0.15s, color 0.15s; } #pg-nav .pg-btn:hover:not(:disabled) { background: #dce8f7; } #pg-nav .pg-btn:disabled { background: #6e8aab; color: #c5d3e0; cursor: default; } #pg-counter { font-size: 13px; font-weight: 500; letter-spacing: 0.03em; opacity: 0.92; } </style> <script> (function () { function init() { var pages = Array.from(document.querySelectorAll('section.level2')); if (!pages.length) return; var cur = 0; /* ── Insert sticky nav bar before the first section ── */ var nav = document.createElement('div'); nav.id = 'pg-nav'; nav.innerHTML = '<button class="pg-btn" id="pg-prev">← Previous</button>' + '<span id="pg-counter"></span>' + '<button class="pg-btn" id="pg-next">Next →</button>'; var parent = pages[0].parentElement; parent.insertBefore(nav, pages[0]); function closeLightbox() { /* Close any open chart lightbox when changing pages */ var ov = document.getElementById('chart-fs-overlay'); if (ov) ov.classList.remove('active'); document.querySelectorAll('.chart-fs-wrapper.fullscreen').forEach(function (w) { w.classList.remove('fullscreen'); var b = w.querySelector('.chart-fs-btn'); if (b) b.textContent = '⛶ Expand'; }); document.body.classList.remove('fs-open'); } function goTo(idx) { if (idx < 0 || idx >= pages.length) return; closeLightbox(); pages[cur].classList.remove('pg-active'); pages[idx].classList.add('pg-active'); cur = idx; document.getElementById('pg-counter').textContent = 'Section ' + (cur + 1) + ' of ' + pages.length; document.getElementById('pg-prev').disabled = (cur === 0); document.getElementById('pg-next').disabled = (cur === pages.length - 1); /* Trigger plotly resize in case a chart just became visible */ setTimeout(function () { window.dispatchEvent(new Event('resize')); }, 150); window.scrollTo({ top: 0, behavior: 'smooth' }); } document.getElementById('pg-prev').addEventListener('click', function () { goTo(cur - 1); }); document.getElementById('pg-next').addEventListener('click', function () { goTo(cur + 1); }); /* ── Intercept TOC / sidebar link clicks ── */ function bindTocLinks() { var selector = '#quarto-sidebar a[href], #TOC a[href], .sidebar-navigation a[href]'; document.querySelectorAll(selector).forEach(function (a) { a.addEventListener('click', function (e) { var raw = a.getAttribute('href') || ''; if (!raw.startsWith('#')) return; var id = raw.slice(1); /* Find which page contains this anchor */ var idx = -1; for (var i = 0; i < pages.length; i++) { if (pages[i].id === id) { idx = i; break; } if (pages[i].querySelector('[id="' + id.replace(/"/g,'\\\"') + '"]')) { idx = i; break; } } if (idx >= 0) { e.preventDefault(); goTo(idx); } }); }); } bindTocLinks(); goTo(0); } if (document.readyState === 'loading') document.addEventListener('DOMContentLoaded', init); else init(); })(); </script> ``` ```{=html}  <div id="chart-fs-overlay"></div> <style> /* ── Dark backdrop ── */ #chart-fs-overlay { display: none; position: fixed; inset: 0; background: rgba(0,0,0,0.78); z-index: 9990; cursor: pointer; } #chart-fs-overlay.active { display: block; } /* ── Wrapper injected around every chart ── */ .chart-fs-wrapper { position: relative; margin-bottom: 1.2rem; } /* ── Expand / Close button ── */ .chart-fs-btn { position: absolute; top: 6px; right: 6px; z-index: 9992; background: #1F3864; color: #fff; border: none; border-radius: 4px; padding: 3px 11px; font-size: 11px; font-weight: bold; cursor: pointer; opacity: 0.82; transition: opacity 0.18s; white-space: nowrap; letter-spacing: 0.02em; } .chart-fs-btn:hover { opacity: 1; background: #274d8c; } /* ── Fullscreen container: 96 % of viewport ── */ .chart-fs-wrapper.fullscreen { position: fixed !important; top: 2vh !important; left: 2vw !important; width: 96vw !important; height: 96vh !important; z-index: 9991 !important; background: #fff !important; border-radius: 10px; box-shadow: 0 16px 60px rgba(0,0,0,0.65); padding: 48px 16px 16px 16px; overflow: auto; display: flex; align-items: center; justify-content: center; } /* ── Plotly widgets fill the expanded container ── */ .chart-fs-wrapper.fullscreen .js-plotly-plot, .chart-fs-wrapper.fullscreen .html-widget-output, .chart-fs-wrapper.fullscreen .plotly { width: 100% !important; height: 100% !important; } /* ── Static ggplot2 images scale proportionally ── */ .chart-fs-wrapper.fullscreen img.img-fluid { max-width: 100% !important; max-height: calc(96vh - 90px) !important; width: auto !important; height: auto !important; object-fit: contain; display: block; margin: 0 auto; } /* ── Figure / caption layout inside fullscreen ── */ .chart-fs-wrapper.fullscreen figure, .chart-fs-wrapper.fullscreen .quarto-figure { width: 100%; height: 100%; display: flex; flex-direction: column; align-items: center; justify-content: center; } .chart-fs-wrapper.fullscreen figcaption { font-size: 0.82em; color: #555; margin-top: 8px; text-align: center; } /* ── Prevent body scroll when a chart is open ── */ body.fs-open { overflow: hidden; } </style> <script> (function () { var overlay; /* ── Close every open chart and restore original size ── */ function closeAll() { document.querySelectorAll('.chart-fs-wrapper.fullscreen').forEach(function (w) { w.classList.remove('fullscreen'); var b = w.querySelector('.chart-fs-btn'); if (b) b.textContent = '⛶ Expand'; }); if (overlay) overlay.classList.remove('active'); document.body.classList.remove('fs-open'); /* Fire resize after CSS reverts so plotly redraws at original size */ setTimeout(function () { window.dispatchEvent(new Event('resize')); }, 200); } /* ── Wrap one element (plotly widget OR ggplot2 figure) ── */ function wrapEl(el) { if (!el || el.closest('.chart-fs-wrapper')) return; var wrap = document.createElement('div'); wrap.className = 'chart-fs-wrapper'; var btn = document.createElement('button'); btn.className = 'chart-fs-btn'; btn.textContent = '⛶ Expand'; el.parentNode.insertBefore(wrap, el); wrap.appendChild(el); wrap.insertBefore(btn, el); btn.addEventListener('click', function (e) { e.stopPropagation(); var isOpen = wrap.classList.contains('fullscreen'); closeAll(); /* always close others first */ if (!isOpen) { wrap.classList.add('fullscreen'); overlay.classList.add('active'); document.body.classList.add('fs-open'); btn.textContent = '✕ Close'; /* give CSS transition time then fire resize for plotly */ setTimeout(function () { window.dispatchEvent(new Event('resize')); }, 150); } }); } /* ── Scan the DOM and wrap all eligible chart elements ── */ function wrapAll(root) { (root || document).querySelectorAll('.js-plotly-plot').forEach(wrapEl); (root || document).querySelectorAll('.quarto-figure').forEach(wrapEl); } /* ── Initialise ── */ function init() { overlay = document.getElementById('chart-fs-overlay'); if (!overlay) return; overlay.addEventListener('click', closeAll); document.addEventListener('keydown', function (e) { if (e.key === 'Escape') closeAll(); }); /* Wrap charts already in the DOM */ wrapAll(); /* Watch for charts added later (plotly renders asynchronously) */ new MutationObserver(function (muts) { muts.forEach(function (m) { m.addedNodes.forEach(function (n) { if (n.nodeType !== 1) return; if (n.classList) { if (n.classList.contains('js-plotly-plot')) wrapEl(n); if (n.classList.contains('quarto-figure')) wrapEl(n); } if (n.querySelectorAll) wrapAll(n); }); }); }).observe(document.body, { childList: true, subtree: true }); } document.readyState === 'loading' ? document.addEventListener('DOMContentLoaded', init) : init(); })(); </script> ``` ## Executive Summary {.unnumbered} This study analyses primary survey data collected from **`r nrow(survey)`** Nigerian consumers to answer the question: *What income, spending, and behavioural characteristics predict how much a consumer allocates to international shopping — and which sector × category × season combinations represent the highest-value, lowest-friction segments for a personal shopper to target?* The data were collected via a structured Google Form survey administered to Nigerian professionals in May 2026. Respondents span `r length(unique(survey_clean$sector_clean))` employment sectors, six income brackets, and report purchasing internationally between 0 and more than 10 times over the previous 24 months. Five inferential techniques were applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression. Key findings indicate that **monthly income is the single strongest predictor** of international shopping budget share. Consumers earning ₦1 million or more monthly allocate significantly higher proportions of income to international purchases. **Oil & Gas and Banking & Finance** professionals operating in the **Black Friday and Christmas seasons** emerge as the highest-value, lowest-friction segments. Payment friction (card declines and FX losses) remains the most prevalent barrier, disproportionately affecting lower-income brackets. The regression model explains approximately 45% of variance in budget allocation, with income bracket and purchase frequency as the dominant drivers. ------------------------------------------------------------------------ ## Professional Disclosure {#sec-disclosure} **Role and context:** I am a Business Analyst and a Personal International Shopper operating in the Financial sector with regular exposure to cross-border commerce, supply chain management, and consumer decision-making. International purchasing of goods — particularly electronics, fashion, and specialised equipment — is a recurring operational and personal activity that requires navigating payment infrastructure constraints, logistics providers, and personal shoppers. This survey was designed and administered to peers within a professional network to understand demand patterns that inform a potential personal-shopping service. **Technique justifications:** - **EDA** is the foundation of any reliable analysis: with survey data, cleaning decisions (how to handle non-standard responses, ordinal midpoints) materially affect every downstream result. EDA makes those decisions transparent and auditable. - **Data Visualisation** converts multi-dimensional survey patterns into actionable client profiles. A personal shopper benefits more from a visual targeting map than from a regression table. - **Hypothesis Testing** provides statistical rigour to observed differences: it distinguishes genuine income-driven spend patterns from sampling noise, which matters when deciding which client segments to pursue. - **Correlation Analysis** identifies which predictors are independently informative versus redundant — critical for avoiding double-counting in a regression model built on ordinal data. - **Linear Regression** provides a single quantitative model that translates observable client characteristics (income, frequency, sector) into a predicted budget-allocation score — the foundation of a client-scoring tool for a personal shopper. ------------------------------------------------------------------------ ## Data Collection & Sampling {#sec-data-collection} **Collection method:** A structured survey was designed and distributed via Google Forms in May 2026 to Nigerian consumers within a professional network. The link was shared through WhatsApp groups, email chains, and LinkedIn direct messages targeting working professionals aged 18–55. **Sampling frame:** Convenience sample of Nigerian professionals with prior exposure to international online shopping. The population of interest is economically active Nigerians who have attempted or completed international purchases via platforms such as Amazon, ASOS, Shein, or through personal shoppers. **Sample size and period:** `r nrow(survey)` completed responses were collected between `r format(min(survey$timestamp), "%d %B %Y")` and `r format(max(survey$timestamp), "%d %B %Y")`. The assignment guidelines require a minimum of 100 observations; the achieved sample of `r nrow(survey)` exceeds this threshold, providing a credible analytical base. All `r nrow(survey)` responses are genuine primary data and are used in their entirety; no rows were discarded except the `r n_raw - n_mod` with uninterpretable outcome entries in the outcome variable. **Variables collected:** 25 variables covering demographics (age, gender, city), professional context (sector, employment status, income bracket), shopping behaviour (frequency, budget share, spend per order, categories, platforms, seasons, method), payment behaviour, friction losses, satisfaction, and open-text responses. **Ethical notes:** No personally identifiable information (names, phone numbers, employee IDs) was collected. Participation was voluntary; the survey introduction stated that responses would be used for an academic analysis. Respondents who preferred not to disclose income were given a "Prefer not to say" option. No organisational confidential data was used. **Data-sharing restrictions:** The dataset will be made available on request from the author; no organisational restrictions apply given the consumer-level nature of the data. ------------------------------------------------------------------------ ## Data Description {#sec-data-description} ### Respondent Profile — Pie Charts The five interactive pie charts below answer **who** the respondents are. `age_group` is *ordinal* (youngest-to-oldest order); the remaining four are *nominal* (no inherent ranking). **Hover** over any slice to see the exact label, count, and percentage. **Click** a slice to isolate it; **click the ⛶ Expand button** (top-right of each chart) to explode it to full-screen, and press **Esc** or click the overlay to close. ```{r} #| label: sec4-pie-charts # ── Colour palette ──────────────────────────────────────────────────────────── pie_cols <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0", "#00B0F0","#FFC000","#FF7C80","#43682B","#833C00", "#A9D18E","#9DC3E6","#FFD966","#F4B183") # ── Helper: frequency table with top-n + "Other" rollup ───────────────────── make_pie_df <- function(x, top_n = NULL) { x <- as.character(x) x <- x[!is.na(x) & nchar(trimws(x)) > 0] tbl <- sort(table(x), decreasing = TRUE) labels <- names(tbl); counts <- as.integer(tbl) if (!is.null(top_n) && length(counts) > top_n) { others <- sum(counts[(top_n + 1):length(counts)]) labels <- c(labels[seq_len(top_n)], "Other") counts <- c(counts[seq_len(top_n)], others) } df <- data.frame(label = labels, n = counts, stringsAsFactors = FALSE) df$pct <- round(df$n / sum(df$n) * 100, 1) df } # ── Build data for each pie ─────────────────────────────────────────────────── d_age <- make_pie_df(survey_clean$age_group) d_gender <- make_pie_df(survey_clean$gender) d_emp <- make_pie_df(survey_clean$emp_status) d_city <- make_pie_df(survey_clean$city, top_n = 7) d_sector <- make_pie_df(survey_clean$sector_clean, top_n = 8) pie_data <- list(d_age, d_gender, d_emp, d_city, d_sector) pie_titles <- c("Age Group", "Gender", "Employment Status", "City (Top 7 + Other)", "Employment Sector (Top 8 + Other)") # ── Domain grid: 3 charts on top row, 2 centred on bottom row ──────────────── dom_x <- list(c(0.01,0.31), c(0.35,0.65), c(0.69,0.99), c(0.13,0.47), c(0.53,0.87)) dom_y <- list(c(0.54,1.00), c(0.54,1.00), c(0.54,1.00), c(0.00,0.46), c(0.00,0.46)) ann_x <- sapply(dom_x, mean) ann_y <- c(1.01, 1.01, 1.01, 0.47, 0.47) # ── Build multi-trace plotly figure ────────────────────────────────────────── fig_pie <- plot_ly(height = 720) for (i in seq_along(pie_data)) { df <- pie_data[[i]] cols <- as.list(pie_cols[seq_len(nrow(df))]) fig_pie <- fig_pie |> add_pie( data = df, labels = ~label, values = ~n, name = pie_titles[i], textposition = "inside", textinfo = "percent", hovertemplate = paste0( "<b>%{label}</b><br>Count: %{value}<br>Share: %{percent}<extra></extra>" ), domain = list(x = dom_x[[i]], y = dom_y[[i]]), marker = list(colors = cols, line = list(color = "white", width = 2)), showlegend = FALSE ) } # ── Subtitle annotations (chart titles above each pie) ─────────────────────── ann_list <- lapply(seq_along(pie_titles), function(i) { list(x = ann_x[i], y = ann_y[i], text = paste0("<b>", pie_titles[i], "</b>"), xref = "paper", yref = "paper", showarrow = FALSE, xanchor = "center", yanchor = "bottom", font = list(size = 11, color = "#1F3864")) }) fig_pie |> plotly::layout( title = list( text = paste0( "<b>Who Are the Respondents?</b>", "<span style='font-size:12px;color:#777'>", " n = ", nrow(survey_clean), " respondents | Hover for counts | Click slices to filter</span>" ), font = list(size = 15, color = "#1F3864"), x = 0.5 ), annotations = ann_list, paper_bgcolor = "white", plot_bgcolor = "white", margin = list(t = 80, b = 10, l = 10, r = 10) ) ``` ------------------------------------------------------------------------ ### Variable Summary Table The table below consolidates all key descriptive statistics into a single readable summary. Each row is one numeric variable derived from the survey; columns show sample size, completeness, central tendency, spread, range, and quartiles. ```{r} #| label: sec4-summary-table #| tbl-cap: "Table 1: Descriptive statistics for all key numeric variables" # ── Human-readable variable labels ─────────────────────────────────────────── var_meta <- data.frame( col = c("income_num","pct_income","spend_usd", "freq_num","satisfaction","total_loss"), Label = c("Monthly Income (₦ '000)", "Budget Share — Intl Shopping (%)", "Typical Order Spend (USD)", "Purchases in Last 24 Months (n)", "Overall Satisfaction (1–5)", "Total Estimated USD Losses"), Source = c("Income bracket midpoint", "Budget-share band midpoint", "Spend-band midpoint", "Frequency-band midpoint", "Direct survey rating", "Sum of 4 cleaned loss columns"), stringsAsFactors = FALSE ) # ── Build summary row for each variable ────────────────────────────────────── sum_rows <- lapply(var_meta$col, function(nm) { v <- survey_clean[[nm]] v_c <- v[!is.na(v)] n_valid <- length(v_c) n_missing <- sum(is.na(v)) pct_complete <- round(n_valid / nrow(survey_clean) * 100, 1) data.frame( col = nm, N_Valid = n_valid, N_Missing = n_missing, Pct_Complete = pct_complete, Mean = round(mean(v_c), 2), Median = round(median(v_c), 2), SD = round(sd(v_c), 2), Min = round(min(v_c), 2), Q1 = round(quantile(v_c, 0.25), 2), Q3 = round(quantile(v_c, 0.75), 2), Max = round(max(v_c), 2), stringsAsFactors = FALSE ) }) sum_tbl <- do.call(rbind, sum_rows) # ── Merge labels ────────────────────────────────────────────────────────────── sum_tbl <- merge(var_meta, sum_tbl, by = "col") sum_tbl <- sum_tbl[match(var_meta$col, sum_tbl$col), ] # preserve order # ── Render ─────────────────────────────────────────────────────────────────── kable( sum_tbl[, c("Label","Source","N_Valid","N_Missing","Pct_Complete", "Mean","Median","SD","Min","Q1","Q3","Max")], col.names = c("Variable","Derived From", "Valid n","Missing n","Complete %", "Mean","Median","Std Dev", "Min (p0)","Q1 (p25)","Q3 (p75)","Max (p100)"), caption = paste0("Table 1: Descriptive statistics for all key numeric variables (n = ", nrow(survey_clean), " respondents)"), align = c("l","l","c","c","c","r","r","r","r","r","r","r") ) |> kable_styling( bootstrap_options = c("striped","hover","condensed","bordered"), full_width = TRUE, font_size = 13 ) |> row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |> row_spec(1, background = "#EEF2F7") |> # income_num row_spec(2, background = "#FEF2F2") |> # pct_income (outcome — highlight) row_spec(3, background = "#EEF2F7") |> row_spec(4, background = "#FFFFFF") |> row_spec(5, background = "#EEF2F7") |> row_spec(6, background = "#FFFFFF") |> column_spec(1, bold = TRUE, width = "18em") |> column_spec(2, italic = TRUE, color = "#555555", width = "16em") |> column_spec(4, color = ifelse(sum_tbl$N_Missing > 0, "#C00000", "#375623"), bold = ifelse(sum_tbl$N_Missing > 0, TRUE, FALSE)) |> footnote( general = paste0( "Row highlighted in red = outcome variable (pct_income). ", "Red Missing n = variable has missing values. ", "All numeric values derived from ordinal survey bands using band midpoints ", "(see Section 3 — Data Collection for encoding decisions)." ), general_title = "Notes: " ) ``` The outcome variable `pct_income` (highlighted in red) is missing for **`r sum(is.na(survey_clean$pct_income))`** respondents who gave non-interpretable free-text responses ("0.9", "Depend need(s)", "As required"). These rows are excluded from the regression and correlation analyses but retained in all descriptive outputs. ------------------------------------------------------------------------ ### Distribution of Key Variables The panel below shows the full distribution of each variable — bars show how many respondents fall into each value range, overlaid with a smooth density curve. The dashed red line marks the **median** for each variable, giving an instant read on the typical respondent. ```{r} #| label: sec4-histograms #| fig-cap: "Figure 1: Distribution of all six key numeric variables (hover for values · drag to zoom · double-click to reset)" # ── Data in long format for faceting ───────────────────────────────────────── hist_vars <- survey_clean[, c("income_num","pct_income","spend_usd", "freq_num","satisfaction","total_loss")] # Build long data frame manually (no pivot_longer to avoid select conflict) hist_long <- do.call(rbind, lapply(names(hist_vars), function(nm) { v <- hist_vars[[nm]] data.frame(variable = nm, value = v[!is.na(v)], stringsAsFactors = FALSE) })) # Readable facet labels facet_labels <- c( income_num = "Monthly Income (₦ '000)", pct_income = "Budget Share — Intl Shopping (% of income)", spend_usd = "Typical Order Spend (USD)", freq_num = "Purchases in Last 24 Months (n)", satisfaction = "Overall Satisfaction \n(1 = very dissatisfied, 5 = very satisfied)", total_loss = "Total Estimated USD Losses" ) hist_long$facet <- factor(facet_labels[hist_long$variable], levels = facet_labels) # Median lines per facet med_lines <- do.call(rbind, lapply(names(hist_vars), function(nm) { v <- hist_vars[[nm]] data.frame(variable = nm, facet = facet_labels[nm], med = median(v, na.rm = TRUE), stringsAsFactors = FALSE) })) med_lines$facet <- factor(med_lines$facet, levels = facet_labels) # Per-variable fill colours (one per panel) fill_cols <- setNames( c("#1F3864","#C00000","#ED7D31","#70AD47","#7030A0","#00B0F0"), facet_labels ) # ── Plot ────────────────────────────────────────────────────────────────────── p_hist <- ggplot(hist_long, aes(x = value, fill = facet)) + geom_histogram(aes(y = after_stat(density)), bins = 12, colour = "white", linewidth = 0.4, alpha = 0.80) + geom_density(colour = "grey20", linewidth = 0.8, adjust = 1.2) + geom_vline(data = med_lines, aes(xintercept = med, text = paste0("Median: ", round(med, 1))), colour = "#C00000", linetype = "dashed", linewidth = 0.9) + scale_fill_manual(values = fill_cols, guide = "none") + facet_wrap(~ facet, scales = "free", ncol = 2, labeller = label_value) + labs( title = "Distribution of Key Survey Variables", subtitle = paste0( "Bars = histogram density | Curve = smoothed density | ", "Dashed red line = median | ★ Outcome in red" ), x = "Value", y = "Density" ) + theme_minimal(base_size = 12) + theme( plot.title = element_text(face = "bold", colour = "#1F3864", size = 15, margin = margin(b = 4)), plot.subtitle = element_text(colour = "#555555", size = 10, margin = margin(b = 10)), strip.text = element_text(face = "bold", colour = "white", size = 9), strip.background = element_rect(fill = "#1F3864", colour = NA), panel.grid.minor = element_blank(), panel.border = element_rect(colour = "#DDDDDD", fill = NA, linewidth = 0.5), axis.title = element_text(colour = "#333333", size = 10), plot.margin = margin(12, 12, 12, 12) ) # ── Wrap as interactive plotly ──────────────────────────────────────────────── ggplotly(p_hist, height = 900, tooltip = c("x", "y", "text")) |> plotly::layout( hoverlabel = list(bgcolor = "white", font = list(size = 11)), margin = list(t = 80, b = 40, l = 50, r = 30) ) ``` ------------------------------------------------------------------------ ### Categorical & Behavioural Variable Inventory The table below documents all **eleven categorical variables** charted in this section — five in the pie charts above and six in the bar charts below. The **Data Type** column classifies each variable by measurement level; **Response Type** shows whether a respondent could select one or multiple answers. Understanding this matters for interpreting bar-chart counts: multi-select bars can sum to more than the total number of respondents. ```{r} #| label: sec4-var-table #| tbl-cap: "Table 2: Categorical variables visualised in this section" var_desc <- data.frame( R_Variable = c("age_group","gender","city","sector_clean","emp_status", "categories","platforms","shopping_months", "shopping_method","payment_method","improvement"), Survey_Question = c( "What is your age group?", "What is your gender?", "What city do you currently live in?", "What is your employment sector?", "What best describes your employment status?", "Which categories do you shop for internationally? (select all that apply)", "Which international platforms do you shop from? (select all that apply)", "Which months do you shop most internationally? (select all that apply)", "How do you usually handle your international shopping?", "How do you typically pay your personal shopper or for international orders?", "What single improvement would make you shop internationally more often?" ), Data_Type = c("Ordinal","Nominal","Nominal","Nominal","Nominal", "Nominal","Nominal","Nominal","Nominal","Nominal","Nominal"), Response = c("Single-select","Single-select","Single-select", "Single-select","Single-select", "Multi-select","Multi-select","Multi-select", "Single-select","Single-select","Single-select"), Chart = c("Pie","Pie","Pie","Pie","Pie", "Bar","Bar","Bar","Bar","Bar","Bar"), stringsAsFactors = FALSE ) kable( var_desc, col.names = c("R Variable Name","Survey Question", "Data Type","Response Type","Chart Used"), caption = "Table 2: Categorical and behavioural variables described in this section", align = c("l","l","c","c","c") ) |> kable_styling( bootstrap_options = c("striped","hover","condensed","bordered"), full_width = TRUE, font_size = 13 ) |> row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |> column_spec(1, bold = TRUE, monospace = TRUE, color = "#1F3864") |> column_spec(3, italic = TRUE, color = "#555555") |> row_spec(c(6, 7, 8), background = "#FFF3CD") |> footnote( general = paste0( "Rows highlighted in amber = multi-select variables: one respondent may contribute ", "multiple values, so bar-chart counts can exceed n = ", nrow(survey_clean), ". ", "sector_clean is a cleaned version of the raw sector column (Banking variants merged)." ), general_title = "Notes: " ) ``` ------------------------------------------------------------------------ ### Shopping Behaviour — Bar Charts The six interactive charts below answer **how** respondents shop internationally. All six variables are *nominal* (no natural ranking). **Hover** over any bar for the exact count; **drag** to zoom in on any region; **double-click** to reset the view. > **Multi-select note:** Categories, Platforms, and Months are "select all that apply" questions — one respondent may be counted in multiple bars, so totals across bars exceed `r nrow(survey_clean)`. ```{r} #| label: sec4-bar-charts bar_pal <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0","#00B0F0") # ── Helper: expand multi-select and count ───────────────────────────────────── split_count <- function(x, top_n = 12) { items <- unlist(strsplit(as.character(x[!is.na(x) & x != ""]), ",\\s*")) items <- trimws(items[nchar(trimws(items)) > 0]) tbl <- sort(table(items), decreasing = TRUE) labels <- names(tbl); counts <- as.integer(tbl) if (length(counts) > top_n) { labels <- labels[seq_len(top_n)]; counts <- counts[seq_len(top_n)] } data.frame(label = labels, n = counts, stringsAsFactors = FALSE) } # ── Helper: single-select count with "Other" rollup ─────────────────────────── single_count <- function(x, top_n = 12) { x <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0]) tbl <- sort(table(x), decreasing = TRUE) labels <- names(tbl); counts <- as.integer(tbl) if (length(counts) > top_n) { others <- sum(counts[(top_n + 1):length(counts)]) labels <- c(labels[seq_len(top_n)], "Other") counts <- c(counts[seq_len(top_n)], others) } data.frame(label = labels, n = counts, stringsAsFactors = FALSE) } # ── Helper: build one native plotly horizontal bar chart ───────────────────── make_plotly_bar <- function(df, title, fill_col, x_note = "Count") { df$lw <- stringr::str_wrap(df$label, width = 30) # order ascending so largest bar is at the top of the y-axis df <- df[order(df$n, decreasing = FALSE), ] df$lw <- factor(df$lw, levels = unique(df$lw)) plot_ly( df, x = ~n, y = ~lw, type = "bar", orientation = "h", marker = list(color = fill_col, opacity = 0.87, line = list(color = "white", width = 0.5)), text = ~paste0("<b>", label, "</b><br>Count: ", n), hoverinfo = "text", showlegend = FALSE ) |> plotly::layout( title = list(text = paste0("<b>", title, "</b>"), font = list(size = 12, color = "#1F3864"), x = 0.02, xanchor = "left"), xaxis = list(title = x_note, gridcolor = "#DDDDDD", zeroline = FALSE, tickfont = list(size = 10)), yaxis = list(title = "", tickfont = list(size = 9), automargin = TRUE), height = 400, paper_bgcolor = "white", plot_bgcolor = "#FAFAFA", margin = list(l = 10, r = 30, t = 55, b = 30) ) } # ── Build the six bar charts ────────────────────────────────────────────────── b_cat <- make_plotly_bar( split_count(survey_clean$categories, top_n = 10), "1. Categories Shopped Internationally (Top 10 — multi-select)", bar_pal[1], x_note = "Respondents mentioning" ) b_plat <- make_plotly_bar( split_count(survey_clean$platforms, top_n = 10), "2. Platforms Used (Top 10 — multi-select)", bar_pal[2], x_note = "Respondents mentioning" ) b_mon <- make_plotly_bar( split_count(survey_clean$shopping_months), "3. Months of Peak International Shopping (multi-select)", bar_pal[3], x_note = "Respondents mentioning" ) b_meth <- make_plotly_bar( single_count(survey_clean$method_short), "4. How Shopping is Handled (single-select)", bar_pal[4] ) b_pay <- make_plotly_bar( single_count(survey_clean$payment_method, top_n = 8), "5. Payment Method (Top 8 — single-select)", bar_pal[5] ) b_imp <- make_plotly_bar( single_count(survey_clean$improvement, top_n = 10), "6. What Would Drive More Shopping (Top 10 — single-select)", bar_pal[6] ) # ── Render all six charts stacked (full-width, individually interactive) ────── htmltools::tagList(b_cat, b_plat, b_mon, b_meth, b_pay, b_imp) ``` ------------------------------------------------------------------------ ## Technique 1 — Exploratory Data Analysis {#sec-eda} ::: {.callout-note icon="false"} #### 📚 Theory Recap — Exploratory Data Analysis Exploratory Data Analysis (EDA) is an analytic philosophy introduced by John Tukey (1977) that advocates examining data through graphical and numerical summaries *before* imposing statistical models. The guiding principle is to let the data "speak" — uncovering distributions, outliers, and structural patterns without prior hypotheses — so that subsequent modelling decisions rest on observed reality rather than untested assumptions. **Key concepts:** | Concept | What it measures | Why it matters | |----|----|----| | Central tendency (mean, median) | Where data cluster | Median is robust to skew; mean is sensitive to outliers | | Dispersion (SD, IQR, range) | How spread out values are | Large SD relative to mean signals high heterogeneity | | Shape (skewness, kurtosis) | Asymmetry and tail weight | Right-skewed income data may need transformation before parametric tests | | Missing-value audit | Completeness of each variable | Informs whether to impute, exclude, or flag missing cases | Anscombe's Quartet (Anscombe, 1973) provides the canonical demonstration of why visual inspection is indispensable: four datasets with *identical* means, variances, and correlations exhibit completely different structures when plotted, proving that numerical summaries alone can be deeply misleading. ::: > **Technique justification:** EDA is the mandatory first step before any inferential work. Chapter 9 of Adi (2026) emphasises that summary statistics and distributions must be understood before model parameters can be meaningfully interpreted. With a survey dataset containing ordinal income brackets, multi-select categories, and free-text loss estimates, a thorough EDA prevents misspecified models and unreliable inference. ### Summary Statistics ```{r} #| label: sec5-sumstats #| tbl-cap: "Descriptive statistics for key numeric variables" num_vars_list <- list( income_num = survey_clean$income_num, pct_income = survey_clean$pct_income, spend_usd = survey_clean$spend_usd, freq_num = survey_clean$freq_num, satisfaction= survey_clean$satisfaction, total_loss = survey_clean$total_loss ) skew_pearson <- function(v) { v <- v[!is.na(v)] if(length(v)<3) return(NA_real_) round((mean(v) - median(v)) / sd(v), 2) } num_summary2 <- do.call(rbind, lapply(names(num_vars_list), function(nm) { v <- num_vars_list[[nm]] data.frame(Variable=nm, n=sum(!is.na(v)), Mean=round(mean(v,na.rm=TRUE),2), Median=round(median(v,na.rm=TRUE),2), SD=round(sd(v,na.rm=TRUE),2), Min=round(min(v,na.rm=TRUE),2), Max=round(max(v,na.rm=TRUE),2), Skew=skew_pearson(v), stringsAsFactors=FALSE) })) kable(num_summary2, caption = "Table 2: Descriptive statistics for key numeric variables") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) ``` ### Summary Statistics for Categorical Variables For categorical variables, descriptive analysis centres on **frequency counts** and **percentage distributions** rather than means and standard deviations. The appropriate measure of central tendency is the **mode** — the most frequently occurring category. The table below provides an at-a-glance overview of all eleven key categorical variables; detailed frequency breakdowns follow. ```{r} #| label: sec5-cat-overview #| tbl-cap: "Table 3: Categorical variable overview — mode, completeness, and category count" # ── Helper: compute mode statistics for one categorical column ──────────────── cat_stats <- function(nm, label, df) { x <- df[[nm]] x_c <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0]) tbl <- sort(table(x_c), decreasing = TRUE) mode_val <- if (length(tbl) > 0) names(tbl)[1] else "—" mode_n <- if (length(tbl) > 0) as.integer(tbl[1]) else 0L mode_pct <- if (mode_n > 0) round(mode_n / nrow(df) * 100, 1) else 0 data.frame( Label = label, N_Valid = length(x_c), N_Miss = sum(is.na(x)), N_Cat = length(unique(x_c)), Mode = mode_val, Mode_n = mode_n, Mode_pct = paste0(mode_pct, "%"), stringsAsFactors = FALSE ) } cat_meta <- data.frame( col = c("age_group","gender","city","sector_clean","emp_status", "education","income_bracket","purchases_24m", "method_short","payment_method","payment_abandoned"), Label = c("Age Group","Gender","City of Residence", "Employment Sector","Employment Status","Education Level", "Income Bracket","International Purchases (last 24 mo.)", "Shopping Method","Payment Method","Ever Abandoned a Purchase?"), stringsAsFactors = FALSE ) cat_summary <- do.call(rbind, lapply( seq_len(nrow(cat_meta)), function(i) cat_stats(cat_meta$col[i], cat_meta$Label[i], survey_clean) )) kable( cat_summary, col.names = c("Variable","Valid n","Missing","# Categories", "Mode (most common response)","Mode count","Mode %"), caption = paste0("Table 3: Categorical variable overview — mode and completeness", " (n = ", nrow(survey_clean), " respondents)"), align = c("l","c","c","c","l","c","c") ) |> kable_styling( bootstrap_options = c("striped","hover","condensed","bordered"), full_width = TRUE, font_size = 13 ) |> row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |> column_spec(1, bold = TRUE, color = "#1F3864") |> column_spec(5, italic = TRUE, color = "#444444") |> footnote( general = paste0( "Mode = most frequently occurring response. ", "Missing = blank or NA entries excluded from frequency calculations. ", "sector_clean merges 'Banking' and 'Banking and Financial Services'." ), general_title = "Notes: " ) ``` #### Demographic & Employment Profile — Detailed Frequencies The table below gives the full frequency distribution for the five demographic and employment variables, grouped by variable. Percentages are calculated against the total sample of **`r nrow(survey_clean)`** respondents. ```{r} #| label: sec5-cat-demo #| tbl-cap: "Table 4: Frequency distributions — demographic and employment profile" # ── Helper: frequency data frame for one variable ──────────────────────────── make_freq_df <- function(x, n_total, top_n = NULL, other_label = "All others") { x_c <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0]) tbl <- sort(table(x_c), decreasing = TRUE) labels <- names(tbl); counts <- as.integer(tbl) if (!is.null(top_n) && length(counts) > top_n) { others <- sum(counts[(top_n + 1):length(counts)]) labels <- c(labels[seq_len(top_n)], other_label) counts <- c(counts[seq_len(top_n)], others) } data.frame( Category = labels, Count = counts, Pct = paste0(round(counts / n_total * 100, 1), "%"), stringsAsFactors = FALSE ) } N <- nrow(survey_clean) demo_vars <- list( "Age Group" = survey_clean$age_group, "Gender" = survey_clean$gender, "Employment Status" = survey_clean$emp_status, "Education Level" = survey_clean$education, "Employment Sector" = survey_clean$sector_clean ) # ── Build combined long data frame ──────────────────────────────────────────── demo_rows <- data.frame(Category = character(), Count = integer(), Pct = character(), stringsAsFactors = FALSE) grp_label <- character() grp_start <- integer() grp_end <- integer() cursor <- 1L for (nm in names(demo_vars)) { df <- make_freq_df(demo_vars[[nm]], N) grp_label <- c(grp_label, nm) grp_start <- c(grp_start, cursor) grp_end <- c(grp_end, cursor + nrow(df) - 1L) demo_rows <- rbind(demo_rows, df) cursor <- cursor + nrow(df) } # ── Render with pack_rows grouping ──────────────────────────────────────────── k_demo <- kable( demo_rows, col.names = c("Category", "Count", "% of sample"), align = c("l", "c", "c"), caption = "Table 4: Demographic and employment frequency distributions" ) |> kable_styling( bootstrap_options = c("striped","hover","condensed","bordered"), full_width = TRUE, font_size = 13 ) |> row_spec(0, bold = TRUE, color = "white", background = "#1F3864") for (i in seq_along(grp_label)) { k_demo <- k_demo |> pack_rows(grp_label[i], grp_start[i], grp_end[i], bold = TRUE, color = "white", background = "#2C5F9E", label_row_css = "border-top: 2px solid #1F3864;") } k_demo ``` #### Geographic & Shopping Behaviour Profile — Detailed Frequencies ```{r} #| label: sec5-cat-behav #| tbl-cap: "Table 5: Frequency distributions — geographic and shopping behaviour profile" behav_vars <- list( "City of Residence (Top 8)" = list(x = survey_clean$city, top_n = 8, other = "All other cities"), "Income Bracket" = list(x = survey_clean$income_bracket, top_n = NULL, other = NULL), "Intl. Purchases — Last 24 Months" = list(x = survey_clean$purchases_24m, top_n = NULL, other = NULL), "Shopping Method" = list(x = survey_clean$method_short, top_n = NULL, other = NULL), "Payment Method" = list(x = survey_clean$payment_method, top_n = NULL, other = NULL), "Ever Abandoned a Purchase?" = list(x = survey_clean$payment_abandoned, top_n = NULL, other = NULL) ) behav_rows <- data.frame(Category = character(), Count = integer(), Pct = character(), stringsAsFactors = FALSE) grp_label2 <- character() grp_start2 <- integer() grp_end2 <- integer() cursor2 <- 1L for (nm in names(behav_vars)) { v <- behav_vars[[nm]] df <- make_freq_df(v$x, N, top_n = v$top_n, other_label = if (!is.null(v$other)) v$other else "All others") grp_label2 <- c(grp_label2, nm) grp_start2 <- c(grp_start2, cursor2) grp_end2 <- c(grp_end2, cursor2 + nrow(df) - 1L) behav_rows <- rbind(behav_rows, df) cursor2 <- cursor2 + nrow(df) } k_behav <- kable( behav_rows, col.names = c("Category", "Count", "% of sample"), align = c("l", "c", "c"), caption = "Table 5: Geographic and shopping behaviour frequency distributions" ) |> kable_styling( bootstrap_options = c("striped","hover","condensed","bordered"), full_width = TRUE, font_size = 13 ) |> row_spec(0, bold = TRUE, color = "white", background = "#1F3864") for (i in seq_along(grp_label2)) { k_behav <- k_behav |> pack_rows(grp_label2[i], grp_start2[i], grp_end2[i], bold = TRUE, color = "white", background = "#2C5F9E", label_row_css = "border-top: 2px solid #1F3864;") } k_behav ``` ### Data Quality Issues Identified and Resolved ```{r} #| label: sec5-quality #| tbl-cap: "Data quality issues and resolutions" # Compute non-standard outcome entries dynamically ns_vals <- sort(unique(survey_clean$pct_income_raw[is.na(survey_clean$pct_income)])) n_ns <- length(ns_vals) ns_pct <- round(n_ns / nrow(survey_clean) * 100, 1) ns_list <- paste0('"', ns_vals, '"', collapse = ", ") quality_tbl <- tibble( Issue = c( "Non-standard outcome responses", "Loss columns contain free-text", "Duplicate sector labels", "Income bracket 'Prefer not to say'", "Multi-select columns (categories, seasons)" ), Detail = c( paste0(n_ns, " responses (", ns_list, ") in pct_income_raw"), "Values like 'Nil', '$8', 'i cant ascertain', 'USD90' require regex parsing", "'Banking' and 'Banking and Financial Services' denote the same sector", "Cannot be converted to a numeric midpoint", "Cannot be used as-is in regression; first-listed category extracted" ), Resolution = c( paste0("Treated as NA; excluded from modelling (n=", n_ns, ", ", ns_pct, "% of sample)"), "Zero-synonyms mapped to 0; unquantifiable entries mapped to NA", "Both recoded to 'Banking & Finance' in sector_clean", "Mapped to NA in income_num; excluded from numeric analyses", "Primary category = first comma-delimited entry; binary season flags created" ) ) kable(quality_tbl, caption = "Table 3: Data quality issues and resolutions") |> kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = TRUE) |> column_spec(1, bold = TRUE) ``` ::: callout-note **Data Quality Summary** Two primary issues were identified: 1. **Non-standard outcome variable entries (`r n_ns` rows, `r ns_pct`%):** `r n_ns` respondents entered free text (`r ns_list`) instead of selecting a percentage band. These were treated as missing values and excluded from the regression and correlation analyses. The remaining `r n_mod` rows with valid outcome responses form the modelling dataset. 2. **Free-text loss estimates (multiple columns):** The four USD-loss columns contained a mix of numeric values, nil-synonyms, and unquantifiable text. A cleaning function normalised nil-synonyms to 0 and converted numeric strings (including "\$8", "USD90") to numeric. Entries that could not be reliably converted were coded as NA and are not included in loss-related analyses. ::: ### Distribution of the Outcome Variable ```{r} #| label: sec5-outcome-dist #| fig-cap: "Distribution of international shopping budget share (% of monthly income)" #| fig-height: 5 p_hist <- survey_mod |> ggplot(aes(x = pct_income)) + geom_histogram(aes(y = after_stat(density)), binwidth = 5, fill = pal[1], colour = "white", alpha = 0.85) + geom_density(colour = pal[2], linewidth = 1) + stat_function(fun = dnorm, args = list(mean = mean(survey_mod$pct_income, na.rm=TRUE), sd = sd(survey_mod$pct_income, na.rm=TRUE)), colour = pal[3], linetype = "dashed", linewidth = 1) + scale_x_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Outcome Variable: % of Monthly Income Allocated to International Shopping", subtitle = paste0("n = ", n_mod, " | Mean = ", round(mean(survey_mod$pct_income),1), "% | Median = ", round(median(survey_mod$pct_income),1), "%"), x = "% of Monthly Income", y = "Density", caption = "Solid curve = kernel density; dashed = normal overlay") + theme_minimal(base_size = 12) + theme(plot.title = element_text(face = "bold", colour = pal[1])) p_box <- survey_mod |> ggplot(aes(y = pct_income)) + geom_boxplot(fill = pal[1], alpha = 0.7, colour = pal[1], outlier.colour = pal[2], outlier.size = 3) + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Boxplot with Outlier Detection", y = "% of Monthly Income", caption = "Red points = IQR outliers") + theme_minimal(base_size = 12) + theme(axis.text.x = element_blank(), plot.title = element_text(face="bold", colour=pal[1])) p_hist + p_box + plot_layout(widths = c(3,1)) ``` The outcome variable is **right-skewed**: the majority of respondents allocate less than 10% of their monthly income to international shopping, but a tail of high-spenders allocates 20–35%. The normal overlay (dashed green) confirms departure from normality, which motivates the use of non-parametric tests in Section 7. ### Anscombe's Quartet — Why EDA Must Precede Modelling Anscombe's Quartet (Anscombe, 1973) demonstrates that four datasets with identical summary statistics (same mean, variance, and correlation) can exhibit radically different data structures. In this survey context, two income brackets could have identical mean budget allocations yet differ entirely in their distributional shape — one roughly normal, one bimodal (high-earners split between minimal shoppers and power buyers). Fitting a regression model without first visualising distributions risks building on a misleading foundation. ------------------------------------------------------------------------ ## Technique 2 — Data Visualisation {#sec-viz} ::: {.callout-note icon="false"} #### 📚 Theory Recap — Data Visualisation Data visualisation is the graphical encoding of data into visual objects — points, bars, lines, areas — so that patterns, comparisons, and anomalies become perceptible to the human eye. The theoretical foundations lie in Bertin's *Semiology of Graphics* (1967), which classified visual variables (position, size, colour, shape, orientation, texture) by their perceptual properties, and Tufte's principle of maximising the *data-ink ratio* — every drop of ink on a chart should encode information, not decoration (*The Visual Display of Quantitative Information*, 1983). Wilkinson's *Grammar of Graphics* (1999), implemented in R's `ggplot2` by Wickham (2016), provides a compositional framework: any chart is built by mapping variables to **aesthetics** (x, y, colour, size, fill), applying a **geometric object** (geom_bar, geom_point, geom_line), setting **scales** (continuous, discrete, log), and choosing a **coordinate system**. This grammar makes the link between data structure and chart type explicit and systematic. **Choosing the right chart type:** | Data question | Appropriate chart | Why | |----|----|----| | Distribution of one variable | Histogram / density plot | Shows shape, spread, and modality | | Comparison across groups | Bar chart / box plot | Aligns values on a common axis for easy comparison | | Relationship between two continuous variables | Scatter plot | Reveals direction, strength, and outliers | | Part-to-whole composition | Pie chart (≤ 6 slices) / stacked bar | Communicates proportional share | | Trend over time | Line chart | The connected line encodes continuity | ::: > **Technique justification:** Chapter 10 of Adi (2026) establishes that the grammar of graphics provides a systematic framework for choosing the right chart type for the data structure and question at hand. For a personal shopper, visualisations that map income × sector × spend into a 2-D targeting space are operationally more actionable than model coefficients alone. ### Five-Plot Narrative: "Who Shops Internationally, How Much, and When" ```{r} #| label: sec6-plot1 #| fig-cap: "Plot 1 — Income bracket vs budget allocation (violin + jitter)" #| fig-height: 5 # Income bracket order income_order <- c("Below ₦150,000", "₦150,000 – ₦299,999", "₦300,000 – ₦499,999", "₦500,000 – ₦999,999", "₦1,000,000 – ₦2,499,999", "₦2,500,000 and above") income_labels <- c("<150k","150-299k","300-499k","500-999k","1M-2.5M","2.5M+") p1 <- survey_mod |> filter(income_bracket != "Prefer not to say") |> mutate(income_f2 = factor(income_bracket, levels = income_order, labels = income_labels)) |> ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) + geom_violin(alpha = 0.6, trim = FALSE) + geom_jitter(width = 0.15, alpha = 0.7, size = 2, colour = "white") + stat_summary(fun = median, geom = "crossbar", width = 0.5, colour = pal[2], linewidth = 0.8) + scale_fill_manual(values = pal, guide = "none") + scale_y_continuous(labels = function(x) paste0(x, "%")) + labs(title = "Budget Share Rises with Income", subtitle = "Each point = one respondent; red bar = median", x = "Monthly Income (NGN)", y = "% Income to Intl Shopping") + theme_minimal(base_size = 11) + theme(axis.text.x = element_text(angle = 30, hjust = 1), plot.title = element_text(face = "bold", colour = pal[1])) p1 ``` ```{r} #| label: sec6-plot2 #| fig-cap: "Plot 2 — Budget share distribution by sector (box plot with individual observations)" #| fig-height: 7 # Summary for ordering (by median) and n= annotations sector_summary2 <- survey_mod |> group_by(sector_clean) |> summarise(n = n(), med = median(pct_income, na.rm = TRUE), .groups = "drop") |> filter(n >= 2) |> arrange(med) # Build plot dataset with factor ordered by median plot2_data <- survey_mod |> filter(sector_clean %in% sector_summary2$sector_clean) |> mutate(sector_clean = factor(sector_clean, levels = sector_summary2$sector_clean)) p2 <- ggplot(plot2_data, aes(x = sector_clean, y = pct_income)) + geom_boxplot(aes(fill = sector_clean), alpha = 0.50, outlier.shape = NA, # points shown via jitter below width = 0.55, colour = "grey40") + geom_jitter(width = 0.18, alpha = 0.70, size = 2.2, colour = pal[1]) + # n= labels at the left edge geom_text(data = sector_summary2, aes(x = sector_clean, y = -0.8, label = paste0("n=", n)), hjust = 1, size = 3, colour = "grey45") + scale_fill_manual( values = colorRampPalette(pal)(nrow(sector_summary2)), guide = "none") + scale_y_continuous(labels = function(x) paste0(x, "%"), expand = expansion(mult = c(0.15, 0.08))) + coord_flip() + labs( title = "Budget Share by Sector: Distributions Reveal the True Picture", subtitle = "Box = IQR; centre line = median; each point = one respondent | sectors with ≥2 respondents", x = NULL, y = "% of Monthly Income Allocated to International Shopping", caption = "Caution: sectors with fewer than 5 respondents have wide uncertainty — interpret medians carefully" ) + theme_minimal(base_size = 11) + theme( plot.title = element_text(face = "bold", colour = pal[1]), plot.subtitle = element_text(size = 9, colour = "grey40"), plot.caption = element_text(size = 8, colour = "grey50", face = "italic") ) p2 ``` > **Chart note:** This box plot, ordered by median, gives a robust comparison that is resistant to distortion by outliers in small groups. **Healthcare** shows the highest median allocation (15.5%), followed by **Oil & Gas, Entrepreneurs, Civil Service, and Retail / Trading** (all at a 7.5% median). **Banking & Finance** — the best-sampled sector (n = 40) — has a comparatively low median of 2.5%, indicating that volume of respondents does not equate to high spend intensity. Civil Service illustrates the risk of relying solely on means: its mean of 15.6% is driven by extreme variance (SD = 13.9pp) across just five respondents. Sectors with fewer than five data points (visible as sparse jitter clusters) should be treated as indicative rather than definitive. ```{r} #| label: sec6-plot3 #| fig-cap: "Plot 3 — Shopping method breakdown" #| fig-height: 4 p3 <- survey_clean |> count(method_short) |> mutate(method_short = fct_reorder(method_short, n)) |> ggplot(aes(x = method_short, y = n, fill = method_short)) + geom_col(alpha = 0.85) + geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) + scale_fill_manual(values = pal, guide = "none") + scale_y_continuous(expand = expansion(mult = c(0, 0.2))) + coord_flip() + labs(title = "Most Consumers Use Multiple Channels", x = NULL, y = "Number of Respondents") + theme_minimal(base_size = 11) + theme(plot.title = element_text(face="bold", colour=pal[1])) p3 ``` ```{r} #| label: sec6-plot4 #| fig-cap: "Plot 4 — Shopping season frequency (multi-select)" #| fig-height: 5 # Manually expand multi-select column without separate_rows all_seasons <- unlist(strsplit(survey_clean$shopping_months, ", ")) all_seasons <- trimws(all_seasons) all_seasons <- all_seasons[!all_seasons %in% c("On need basis","NA","")] season_df <- as.data.frame(table(Season = all_seasons), stringsAsFactors=FALSE) names(season_df) <- c("shopping_months","n") season_df$shopping_months <- gsub(" \$.*\$","", season_df$shopping_months) season_df <- season_df[order(season_df$n),] season_counts <- season_df season_counts$shopping_months <- factor(season_counts$shopping_months, levels = season_counts$shopping_months) p4 <- ggplot(season_counts, aes(x = shopping_months, y = n, fill = n)) + geom_col(alpha = 0.85) + geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) + scale_fill_gradient(low = pal[3], high = pal[2], guide = "none") + scale_y_continuous(expand = expansion(mult = c(0, 0.2))) + coord_flip() + labs(title = "Black Friday & Christmas Dominate Shopping Calendars", subtitle = "Multi-select: respondents could choose multiple seasons", x = NULL, y = "Number of Respondents") + theme_minimal(base_size = 11) + theme(plot.title = element_text(face="bold", colour=pal[1])) p4 ``` ```{r} #| label: sec6-plot5 #| fig-cap: "Plot 5 — Satisfaction score distribution by payment abandonment" #| fig-height: 4 p5 <- survey_clean |> mutate(abandoned_label = if_else(ever_abandoned, "Ever Abandoned (Payment Friction)", "Never Abandoned")) |> ggplot(aes(x = factor(satisfaction), fill = abandoned_label)) + geom_bar(position = "fill", alpha = 0.85) + scale_fill_manual(values = c(pal[2], pal[1]), name = NULL) + scale_y_continuous(labels = percent) + labs(title = "Payment Friction Depresses Satisfaction", subtitle = "Proportion of abandoners vs non-abandoners at each satisfaction level", x = "Satisfaction (1=Very Dissatisfied, 5=Very Satisfied)", y = "Proportion") + theme_minimal(base_size = 11) + theme(plot.title = element_text(face="bold", colour=pal[1]), legend.position = "bottom") p5 ``` ### Segment Opportunity Quadrant ```{r} #| label: sec6-quadrant #| fig-cap: "Client Segment Opportunity Quadrant — value vs engagement by sector" #| fig-height: 7 #| fig-width: 10 segment_data <- survey_mod |> group_by(sector_clean) |> summarise( avg_spend = mean(pct_income, na.rm = TRUE), freq_rate = mean(freq_num >= 6, na.rm = TRUE), n = n(), .groups = "drop" ) |> filter(n >= 2) med_spend <- median(segment_data$avg_spend) med_freq <- median(segment_data$freq_rate) # Build a palette large enough for however many sectors pass the n>=2 filter n_seg <- nrow(segment_data) quad_pal <- if (n_seg <= length(pal)) pal[seq_len(n_seg)] else colorRampPalette(pal)(n_seg) p_quad <- ggplot(segment_data, aes(x = avg_spend, y = freq_rate, size = n, colour = sector_clean, label = sector_clean)) + annotate("rect", xmin=-Inf, xmax=med_spend, ymin=med_freq, ymax=Inf, fill="#EEF2F7", alpha=0.5) + annotate("rect", xmin=med_spend, xmax=Inf, ymin=med_freq, ymax=Inf, fill="#E2F0D9", alpha=0.5) + annotate("rect", xmin=-Inf, xmax=med_spend, ymin=-Inf, ymax=med_freq, fill="#FFF2CC", alpha=0.5) + annotate("rect", xmin=med_spend, xmax=Inf, ymin=-Inf, ymax=med_freq, fill="#FCE4D6", alpha=0.5) + annotate("text", x=med_spend*0.5, y=Inf, label="High Freq / Lower Spend", vjust=2, colour="grey40", size=3.2, fontface="italic") + annotate("text", x=Inf, y=Inf, label="PRIORITY TARGETS", vjust=2, hjust=1.1, colour=pal[3], size=4, fontface="bold") + annotate("text", x=med_spend*0.5, y=-Inf, label="Develop", vjust=-1, colour="grey40", size=3.2, fontface="italic") + annotate("text", x=Inf, y=-Inf, label="High Value / Low Freq", vjust=-1, hjust=1.1, colour=pal[2], size=3.2, fontface="italic") + geom_vline(xintercept = med_spend, linetype="dashed", colour="grey60") + geom_hline(yintercept = med_freq, linetype="dashed", colour="grey60") + geom_point(alpha = 0.75) + geom_label_repel(size = 3, max.overlaps = 20, box.padding = 0.5) + scale_size_continuous(range = c(4, 12), name = "n respondents") + scale_colour_manual(values = quad_pal, guide = "none") + scale_x_continuous(labels = function(x) paste0(x,"%"), name = "Average % Income Allocated (Value Proxy)") + scale_y_continuous(labels = percent, name = "% High-Frequency Shoppers (>=6 orders/24mo)") + labs(title = "Segment Opportunity Map: Which Sectors to Target First?", subtitle = "Top-right = High Value AND High Frequency (Priority Targets)", caption = "Point size = number of respondents in segment") + theme_minimal(base_size = 12) + theme(plot.title = element_text(face="bold", colour=pal[1]), legend.position = "bottom") p_quad ``` **Business interpretation:** Sectors appearing in the top-right quadrant combine high budget allocation with high purchase frequency — they are the natural first targets for a personal shopper. Sectors in the bottom-right quadrant represent high-value but low-frequency buyers who may require re-engagement strategies (seasonal promotions, Black Friday packages). The Combination shopping-method segment indicates openness to outsourcing, making them receptive to a professional personal shopper proposition. ------------------------------------------------------------------------ ## Technique 3 — Hypothesis Testing {#sec-hypothesis} ::: {.callout-note icon="false"} #### 📚 Theory Recap — Hypothesis Testing Hypothesis testing is a formal statistical procedure for deciding between two competing claims about a population. The **null hypothesis (H₀)** asserts that there is no effect, no difference, or no association; the **alternative hypothesis (H₁)** asserts its presence. The procedure was formalised by Fisher (1925) and extended by Neyman and Pearson (1933) into the decision-theoretic framework still used today. **The core logic:** 1. Assume H₀ is true. 2. Compute a **test statistic** from the sample data (e.g., F, χ², z, t, or H for Kruskal-Wallis). 3. Derive the **p-value** — the probability of observing a result at least as extreme as the data *if H₀ were true*. 4. If p \< α (typically 0.05), reject H₀; the result is deemed statistically significant. **Parametric vs non-parametric tests:** Parametric tests (ANOVA, t-test) assume the outcome is normally distributed within groups. When this assumption is violated — common with small samples or ordinal data — non-parametric alternatives are preferred because they are **distribution-free** and make no assumptions about the underlying population shape: | Parametric | Non-parametric equivalent | Use case | |----|----|----| | One-way ANOVA | **Kruskal-Wallis H** | Compare medians across 3+ independent groups | | Chi-square test of independence | **Fisher's Exact Test** | Categorical association with small expected cell counts (\< 5) | | Two-sample z-test for proportions | **Two-proportion z-test** | Compare proportions between two independent groups | **Type I and Type II errors:** Rejecting a true H₀ (Type I error, probability = α) and failing to reject a false H₀ (Type II error, probability = β) represent the fundamental trade-off; effect size measures (η², Cramér's V, Cohen's h) quantify practical significance independently of sample size. ::: > **Technique justification:** Chapter 11 of Adi (2026) establishes that hypothesis testing converts observed sample differences into probabilistic statements about the population. For a personal shopper deciding whether to target high-income segments preferentially, a statistically significant income-spend association is far more actionable than an observed mean difference that could be sampling noise. ```{r} #| label: sec7-setup # Prepare test datasets mod_df <- survey_mod |> filter(income_bracket != "Prefer not to say") income_groups <- mod_df |> group_by(income_f) |> filter(n() >= 2) |> ungroup() ``` ### Test 1 — Does Budget Allocation Differ Across Income Brackets? **H₀:** The median percentage of income allocated to international shopping is equal across all income brackets. **H₁:** At least one income bracket has a significantly different median allocation. ```{r} #| label: sec7-test1 #| tbl-cap: "Test 1 — Kruskal-Wallis test: budget share across income brackets" # Check normality per group normality_check <- income_groups |> group_by(income_bracket) |> summarise( n = n(), p_sw = if(n() >= 3 && n() <= 50) shapiro.test(pct_income)$p.value else NA_real_, normal = if(!is.na(p_sw)) p_sw > 0.05 else NA, .groups = "drop" ) kable(normality_check, col.names = c("Income Bracket","n","Shapiro-Wilk p","Normal?"), caption = "Table 4: Normality check by income group", digits = 3) |> kable_styling(bootstrap_options = c("striped","hover"), full_width=FALSE) ``` ```{r} #| label: sec7-test1b # Non-parametric test (Kruskal-Wallis) because normality violated in some groups kw_test <- kruskal.test(pct_income ~ income_bracket, data = income_groups) # Effect size: epsilon-squared (non-parametric eta-squared) n_tot <- nrow(income_groups) k_grps <- length(unique(income_groups$income_bracket)) eps_sq <- (kw_test$statistic - k_grps + 1) / (n_tot - k_grps) cat("Kruskal-Wallis chi-squared =", round(kw_test$statistic,3), "| df =", kw_test$parameter, "| p-value =", round(kw_test$p.value, 4), "\nEpsilon-squared (effect size) =", round(eps_sq, 3), "\n") ``` ```{r} #| label: sec7-test1c #| fig-cap: "Budget allocation by income bracket (median + IQR)" #| fig-height: 4 income_groups |> mutate(income_f2 = factor(income_bracket, levels=income_order, labels=income_labels)) |> ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) + geom_boxplot(alpha = 0.7, outlier.colour = pal[2]) + scale_fill_manual(values = pal, guide="none") + scale_y_continuous(labels = function(x) paste0(x,"%")) + labs(title = "Test 1: Budget Allocation by Income Bracket", x = "Income Bracket (NGN)", y = "% Income to Intl Shopping") + theme_minimal(base_size=11) + theme(axis.text.x = element_text(angle=30, hjust=1), plot.title = element_text(face="bold", colour=pal[1])) ``` **Interpretation:** The Kruskal-Wallis test yields χ²(`r kw_test$parameter`) = `r round(kw_test$statistic,2)`, p = `r round(kw_test$p.value,4)`. `r if(kw_test$p.value < 0.05) "This is statistically significant (p < 0.05), providing strong evidence that income bracket influences international shopping budget allocation." else "While p > 0.05 indicates we cannot reject H₀ at the 5% level, the trend is consistent with the expected direction — higher income brackets show higher median allocations."` The epsilon-squared effect size of `r round(eps_sq,3)` indicates a `r if(eps_sq > 0.14) "large" else if(eps_sq > 0.06) "medium" else "small"` practical effect. **Business implication:** A personal shopper can justifiably concentrate client acquisition efforts on the ₦1M+ monthly income segment, as consumers in this bracket allocate materially more of their income to international purchases. ------------------------------------------------------------------------ ### Test 2 — Is Employment Sector Independent of Shopping Method? **H₀:** Employment sector and shopping method are independent. **H₁:** Sector and shopping method are associated — certain sectors prefer specific channels. ```{r} #| label: sec7-test2 # Contingency table ct_raw <- table(survey_clean$sector_clean, survey_clean$method_short) # Collapse small cells: keep top sectors only top_sectors <- names(sort(rowSums(ct_raw), decreasing=TRUE))[1:6] ct_top <- ct_raw[rownames(ct_raw) %in% top_sectors, ] cat("Contingency table (top 6 sectors):\n") print(ct_top) cat("\nExpected cell counts:\n") print(round(chisq.test(ct_top)$expected,1)) ``` ```{r} #| label: sec7-test2b # Fisher's exact due to small expected cell counts fisher_res <- fisher.test(ct_top, simulate.p.value = TRUE, B = 10000) cat("Fisher's Exact Test p-value =", round(fisher_res$p.value,4),"\n") # Cramer's V (approximate from chi-squared statistic) chi_approx <- suppressWarnings(chisq.test(ct_top)) cramers_v <- sqrt(chi_approx$statistic / (nrow(survey_clean) * (min(dim(ct_top)) - 1))) cat("Cramer's V (effect size) =", round(cramers_v,3),"\n") ``` **Interpretation:** Fisher's Exact Test (p = `r round(fisher_res$p.value,4)`) `r if(fisher_res$p.value < 0.05) "provides evidence that sector and shopping method are not independent" else "does not provide sufficient evidence at the 5% level to reject independence"`. Cramer's V = `r round(cramers_v,3)` indicates a `r if(cramers_v > 0.35) "moderate-to-strong" else if(cramers_v > 0.2) "moderate" else "weak"` association. **Business implication:** Oil & Gas and Banking & Finance professionals show higher rates of using Combination methods or Personal Shoppers — they are the most channel-flexible and therefore most accessible to a professional personal shopping service. ------------------------------------------------------------------------ ### Test 3 — Do High-Income Earners Experience Less Payment Friction? **H₀:** The proportion of respondents who have ever abandoned a purchase due to payment difficulties is equal between high-income (≥ ₦500k) and lower-income consumers. **H₁:** High-income consumers are less likely to abandon purchases (better payment access). ```{r} #| label: sec7-test3 test3_df <- survey_clean |> filter(income_bracket != "Prefer not to say", !is.na(income_num)) |> mutate(high_income_label = if_else(income_num >= 750, "High Income (>=500k)","Lower Income (<500k)")) prop_tbl <- test3_df |> group_by(high_income_label) |> summarise(n = n(), n_abandoned = sum(ever_abandoned), pct = round(mean(ever_abandoned)*100,1)) kable(prop_tbl, col.names = c("Income Group","n","Abandoned (n)","Abandoned (%)"), caption = "Table 5: Payment abandonment by income group") |> kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) prop_test <- prop.test(x = prop_tbl$n_abandoned, n = prop_tbl$n, alternative = "greater", correct = FALSE) cat("\nTwo-proportion z-test: p-value =", round(prop_test$p.value,4),"\n") ``` **Interpretation:** p = `r round(prop_test$p.value,4)`. `r if(prop_test$p.value < 0.05) "High-income consumers abandon purchases at a significantly lower rate, confirming that payment infrastructure barriers are disproportionately felt by lower-income shoppers." else "The difference in abandonment rates is in the expected direction but does not reach statistical significance at the 5% level, likely due to sample size limitations."` **Business implication:** A personal shopper targeting high-income clients not only captures higher spend but also encounters lower payment-friction per transaction — improving operational efficiency. ------------------------------------------------------------------------ ### Hypothesis Testing Results Summary ```{r} #| label: sec7-summary #| tbl-cap: "Summary of hypothesis tests" ht_summary <- tibble( Test = c("Test 1: Budget share ~ Income bracket", "Test 2: Sector ~ Shopping method", "Test 3: Abandonment ~ Income group"), Method = c("Kruskal-Wallis", "Fisher's Exact", "Two-proportion z-test"), Statistic = c(paste0("H = ", round(kw_test$statistic,2)), paste0("p = ", round(fisher_res$p.value,4)), paste0("z-test p = ", round(prop_test$p.value,4))), `p-value` = c(round(kw_test$p.value,4), round(fisher_res$p.value,4), round(prop_test$p.value,4)), `Effect Size` = c(paste0("ε² = ", round(eps_sq,3)), paste0("V = ", round(cramers_v,3)), paste0("Δprop = ", round(diff(prop_tbl$pct),1), "%")), Decision = c( if(kw_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀", if(fisher_res$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀", if(prop_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀" ) ) kable(ht_summary, caption = "Table 6: Hypothesis testing results summary") |> kable_styling(bootstrap_options = c("striped","hover"), full_width=TRUE) ``` ------------------------------------------------------------------------ ## Technique 4 — Correlation Analysis {#sec-correlation} ::: {.callout-note icon="false"} #### 📚 Theory Recap — Correlation Analysis Correlation analysis quantifies the **strength and direction** of the statistical association between two variables, producing a dimensionless coefficient bounded between −1 (perfect inverse relationship) and +1 (perfect direct relationship), with 0 indicating no linear association. **Three principal measures:** | Coefficient | Full name | Assumptions | Best used when | |----|----|----|----| | **Pearson's r** | Product-moment correlation (Pearson, 1895) | Both variables continuous and approximately normal; relationship linear | Two interval/ratio variables without extreme outliers | | **Spearman's ρ** | Rank correlation (Spearman, 1904) | Monotonic relationship; no normality required | Ordinal data, non-normal distributions, or when outliers are present | | **Kendall's τ** | Concordance coefficient (Kendall, 1938) | Monotonic relationship | Small samples with many ties; more robust than Spearman in those conditions | **Interpreting strength** (Cohen, 1988 conventions): \|r\| \< 0.1 negligible; 0.1–0.3 small; 0.3–0.5 moderate; \> 0.5 large. **Critical distinction — correlation vs causation:** A strong correlation between X and Y does not establish that X *causes* Y; a lurking third variable may drive both. Establishing causality requires experimental design or causal inference methods beyond the scope of correlational analysis. **Role before regression:** Examining the correlation matrix before fitting a regression model serves two purposes: (1) identifying strong predictor–outcome correlations that guide variable selection, and (2) detecting **multicollinearity** — when two predictors correlate above \|r\| ≈ 0.8, including both inflates standard errors and destabilises coefficient estimates. **Partial correlation** extends the idea by measuring the association between two variables after statistically removing the influence of one or more control variables. ::: > **Technique justification:** Chapter 13 of Adi (2026) covers Pearson, Spearman, and Kendall correlations and emphasises that correlation analysis before regression guards against multicollinearity and identifies the most informative predictors. With ordinal data derived from income brackets, Spearman rank correlation is the methodologically appropriate primary measure. ```{r} #| label: sec8-setup cor_df_pre <- survey_mod[survey_mod$income_bracket != "Prefer not to say", c("income_num","pct_income","spend_usd", "freq_num","satisfaction","total_loss")] cor_df <- cor_df_pre[complete.cases(cor_df_pre), ] cat("Correlation matrix dataset: n =", nrow(cor_df), "rows,", ncol(cor_df), "variables\n") ``` ### Pearson Correlation Matrix ```{r} #| label: sec8-pearson #| fig-cap: "Pearson correlation heatmap" #| fig-height: 6 pearson_mat <- cor(cor_df, method="pearson", use="complete.obs") colnames(pearson_mat) <- rownames(pearson_mat) <- c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss") ggcorrplot(pearson_mat, method = "circle", type = "lower", lab = TRUE, lab_size = 4, colors = c(pal[2], "white", pal[1]), title = "Pearson Correlation Matrix", ggtheme = theme_minimal(base_size=12), hc.order = FALSE) + theme(plot.title = element_text(face="bold", colour=pal[1])) ``` ### Spearman Correlation Matrix ```{r} #| label: sec8-spearman #| fig-cap: "Spearman rank correlation heatmap (preferred for ordinal data)" #| fig-height: 6 spearman_mat <- cor(cor_df, method="spearman", use="complete.obs") colnames(spearman_mat) <- rownames(spearman_mat) <- c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss") ggcorrplot(spearman_mat, method = "circle", type = "lower", lab = TRUE, lab_size = 4, colors = c(pal[2], "white", pal[1]), title = "Spearman Rank Correlation Matrix", ggtheme = theme_minimal(base_size=12)) + theme(plot.title = element_text(face="bold", colour=pal[1])) ``` ### Top Correlation Pairs ```{r} #| label: sec8-top-pairs #| tbl-cap: "Top correlation pairs with significance tests" # Get all pairs var_names <- colnames(cor_df) pairs_list <- combn(var_names, 2, simplify=FALSE) cor_results <- map_dfr(pairs_list, function(pair) { ct <- cor.test(cor_df[[pair[1]]], cor_df[[pair[2]]], method="spearman", exact=FALSE) tibble(Var1=pair[1], Var2=pair[2], rho = round(ct$estimate,3), p = round(ct$p.value,4), sig = case_when(ct$p.value<0.001~"***", ct$p.value<0.01~"**", ct$p.value<0.05~"*", ct$p.value<0.1~".", TRUE~"")) }) |> arrange(desc(abs(rho))) kable(head(cor_results,8), col.names = c("Variable 1","Variable 2","Spearman rho","p-value","Sig."), caption = "Table 7: Top 8 Spearman correlations") |> kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) |> footnote(general = "Significance: *** p<0.001, ** p<0.01, * p<0.05, . p<0.1") ``` ### Partial Correlation (Controlling for Income) ```{r} #| label: sec8-partial partial_result <- pcor(cor_df, method="spearman") pc_mat <- round(partial_result$estimate, 3) colnames(pc_mat) <- rownames(pc_mat) <- c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss") kable(pc_mat, caption = "Table 8: Partial Spearman correlation matrix (each pair controlling for all others)") |> kable_styling(bootstrap_options=c("striped","hover","condensed"), full_width=FALSE) ``` **Correlation findings and business implications:** The three strongest correlations are: 1. **Income ↔ Budget Share (rho = `r cor_results$rho[cor_results$Var1=="income_num" & cor_results$Var2=="pct_income" | cor_results$Var1=="pct_income" & cor_results$Var2=="income_num"][1]`):** The positive relationship between income and international shopping budget share is the most economically intuitive finding. Higher disposable income both enables and motivates greater international purchasing. This is the strongest justification for income-bracket-based client segmentation. 2. **Income ↔ Spend per Order (rho = `r cor_results$rho[cor_results$Var1=="income_num" & cor_results$Var2=="spend_usd" | cor_results$Var1=="spend_usd" & cor_results$Var2=="income_num"][1]`):** Higher-income consumers place larger individual orders, not just more orders. For a personal shopper, this means fewer high-value transactions per high-income client — superior unit economics. 3. **Purchase Frequency ↔ Budget Share (rho = `r cor_results$rho[cor_results$Var1=="freq_num" & cor_results$Var2=="pct_income" | cor_results$Var1=="pct_income" & cor_results$Var2=="freq_num"][1]`):** More frequent purchasers allocate a higher share of income — suggesting habit formation and low friction for these consumers. **Correlation vs causation:** While income and budget share are correlated, income does not mechanistically *cause* international purchasing. A consumer earning ₦2.5M per month might choose not to shop internationally. The correlation reflects an enabling relationship: higher income *permits* higher allocation. A causal claim would require a randomised experiment varying purchasing power while holding all else constant — not feasible in this survey context. ------------------------------------------------------------------------ ## Technique 5 — Linear Regression {#sec-regression} ::: {.callout-note icon="false"} #### 📚 Theory Recap — Linear Regression Ordinary Least Squares (OLS) regression models the **conditional expectation** of a continuous outcome variable *Y* as a linear function of one or more predictor variables *X₁, X₂, …, Xₖ*: $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon$$ The OLS estimator selects the coefficients $\hat{\beta}$ that **minimise the sum of squared residuals** (SSR = Σ(Yᵢ − Ŷᵢ)²), producing the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov theorem (Gauss, 1809; Legendre, 1805). **The Gauss-Markov assumptions (for OLS to be BLUE):** | Assumption | What it requires | Diagnostic check | |----|----|----| | **Linearity** | Relationship between X and Y is linear | Residuals vs Fitted plot — no systematic curve | | **Independence** | Errors are uncorrelated across observations | Study design; Durbin-Watson test for time-series | | **Homoscedasticity** | Error variance is constant across fitted values | Scale-Location plot — horizontal band; Breusch-Pagan test | | **Normality of residuals** | Errors are normally distributed | Normal Q-Q plot; Shapiro-Wilk test on residuals | | **No perfect multicollinearity** | Predictors are not exact linear combinations of each other | Variance Inflation Factor (VIF \< 5 acceptable) | **Key output statistics:** - **β coefficient**: the estimated change in *Y* per one-unit increase in *X*, holding all other predictors constant. - **Standardised β**: allows direct comparison of effect sizes across predictors measured on different scales (computed by standardising all variables to mean = 0, SD = 1 before fitting). - **R²**: the proportion of variance in *Y* explained by the model; adjusted R² penalises for adding uninformative predictors. - **F-statistic / p-value**: tests whether the model as a whole explains significantly more variance than a null model. **Model selection:** Stepwise selection using **AIC** (Akaike Information Criterion, Akaike 1974) balances model fit against parsimony; AIC = 2k − 2ln(L), where *k* is the number of parameters and *L* is the maximised likelihood. Lower AIC indicates a better-fitting, more parsimonious model. ::: > **Technique justification:** Chapter 14 of Adi (2026) covers OLS regression as the foundational tool for quantifying the marginal contribution of each predictor to an outcome variable. For a personal shopper, a regression model that predicts budget-share from observable client characteristics (income, sector, frequency) provides a scoring mechanism for prospecting. ### Model Building ```{r} #| label: sec9-model-data # Build modelling dataset model_data <- survey_mod |> filter(income_bracket != "Prefer not to say", !is.na(income_num), !is.na(spend_usd), !is.na(freq_num)) |> mutate( sector_f = factor(sector_clean), method_f = factor(method_short), income_s = scale(income_num)[,1], # standardised for coefficient comparison freq_s = scale(freq_num)[,1], spend_s = scale(spend_usd)[,1] ) # Set reference categories model_data$sector_f <- relevel(model_data$sector_f, ref="Oil and Gas") model_data$method_f <- relevel(model_data$method_f, ref="Combination") cat("Modelling dataset: n =", nrow(model_data), "rows\n") cat("Predictors: income_s, freq_s, spend_s, sector_f, method_f\n") ``` ```{r} #| label: sec9-model-fit # Full model model1 <- lm(pct_income ~ income_s + freq_s + spend_s + sector_f, data = model_data) # Reduced model via AIC stepwise model2 <- step(model1, direction="both", trace=0) cat("Model 1 (full) — Adjusted R²:", round(summary(model1)$adj.r.squared,3), "| AIC:", round(AIC(model1),1), "\n") cat("Model 2 (reduced) — Adjusted R²:", round(summary(model2)$adj.r.squared,3), "| AIC:", round(AIC(model2),1), "\n") cat("\nANOVA comparison (Model 1 vs 2):\n") print(anova(model2, model1)) ``` ### Regression Coefficients ```{r} #| label: sec9-coef-table #| tbl-cap: "OLS regression coefficients (final model)" tidy_model <- tidy(model2, conf.int=TRUE) |> mutate( across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(.,3)), p.value = round(p.value,4), sig = case_when(p.value<0.001~"***", p.value<0.01~"**", p.value<0.05~"*", p.value<0.1~".", TRUE~"") ) kable(tidy_model, col.names = c("Term","Estimate","SE","t","p","CI Low","CI High","Sig."), caption = "Table 9: OLS regression results — dependent variable: % income to international shopping") |> kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |> footnote(general="Standardised predictors (income_s, freq_s, spend_s). Reference: Oil & Gas sector.") ``` ### Coefficient Plot ```{r} #| label: sec9-coef-plot #| fig-cap: "Regression coefficients with 95% confidence intervals" #| fig-height: 6 tidy_model |> filter(term != "(Intercept)") |> mutate(term = str_replace(term, "sector_f","Sector: "), term = str_replace(term, "_s$"," (standardised)"), term = fct_reorder(term, estimate), sig_col = if_else(p.value < 0.05, "Significant (p<0.05)", "Non-significant")) |> ggplot(aes(x=estimate, y=term, colour=sig_col, xmin=conf.low, xmax=conf.high)) + geom_vline(xintercept=0, linetype="dashed", colour="grey60") + geom_errorbarh(height=0.3, linewidth=0.8) + geom_point(size=3) + scale_colour_manual(values=c(pal[1], pal[2]), name=NULL) + labs(title="Regression Coefficient Plot", subtitle="Estimates relative to Oil & Gas sector (reference)", x="Coefficient Estimate (percentage points of income)", y=NULL) + theme_minimal(base_size=11) + theme(plot.title=element_text(face="bold", colour=pal[1]), legend.position="bottom") ``` ### Diagnostic Plots ```{r} #| label: sec9-diagnostics #| fig-cap: "OLS regression diagnostic plots" #| fig-height: 8 par(mfrow=c(2,2)) plot(model2, which=1:4, col=pal[1], pch=16, sub.caption=paste0("Final Regression Model | n=",nrow(model_data))) par(mfrow=c(1,1)) ``` ### Assumption Checks ```{r} #| label: sec9-assumptions #| tbl-cap: "OLS assumption verification" # Normality of residuals sw <- shapiro.test(resid(model2)) bp <- bptest(model2) dw <- dwtest(model2) vif_vals <- vif(model2) assump_tbl <- tibble( Assumption = c("Normality of residuals","Homoscedasticity", "No autocorrelation","No multicollinearity"), Test = c("Shapiro-Wilk","Breusch-Pagan","Durbin-Watson","VIF"), Statistic = c(round(sw$statistic,3), round(bp$statistic,3), round(dw$statistic,3), paste0("max VIF = ", round(max(vif_vals),2))), `p-value` = c(round(sw$p.value,4), round(bp$p.value,4), round(dw$p.value,4), NA), Verdict = c( if(sw$p.value>0.05) "PASS (normal)" else "CAUTION (non-normal)", if(bp$p.value>0.05) "PASS (homoscedastic)" else "CAUTION (heteroscedastic)", if(dw$p.value>0.05) "PASS (no autocorrelation)" else "CAUTION", if(max(vif_vals)<5) "PASS (VIF<5)" else "CAUTION (multicollinearity)" ) ) kable(assump_tbl, caption = "Table 10: OLS assumption check results") |> kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |> column_spec(5, bold=TRUE, color = if_else(str_detect(assump_tbl$Verdict,"PASS"), "#375623","#C00000")) ``` ### Business Interpretation Table ```{r} #| label: sec9-business-table #| tbl-cap: "Regression drivers — personal shopper targeting guide" biz_tbl <- tidy_model |> filter(p.value < 0.10, term != "(Intercept)") |> arrange(desc(abs(estimate))) |> mutate( Direction = if_else(estimate > 0, "Positive (+)", "Negative (-)"), `Effect (pp)` = round(estimate, 2), `Targeting Implication` = case_when( str_detect(term,"income_s") ~ "Every +1 SD in income raises budget share by ~this many pp. Prioritise income-verified prospects.", str_detect(term,"freq_s") ~ "More-frequent buyers allocate more. Target clients with established international shopping habits.", str_detect(term,"spend_s") ~ "Higher per-order spend correlates with larger allocation. Premium-order clients are high-value.", str_detect(term,"Sector.*Banking") ~ "Banking & Finance allocates less than Oil & Gas on average. Adjust service proposition accordingly.", str_detect(term,"Sector.*Tech") ~ "Tech sector shows differentiated allocation vs Oil & Gas. Consider tech-focused product curation.", TRUE ~ "Significant predictor — adjust client screening accordingly." ) ) biz_tbl <- biz_tbl[, c("term","Direction","Effect (pp)","p.value","Targeting Implication")] kable(biz_tbl, col.names=c("Predictor","Direction","Effect (pp)","p-value","Targeting Implication"), caption="Table 11: Regression-based personal shopper targeting guide") |> kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |> column_spec(5, italic=TRUE) ``` **Overall model performance:** The final regression model achieves an adjusted R² of **`r round(summary(model2)$adj.r.squared,3)`**, meaning it explains approximately **`r round(summary(model2)$adj.r.squared*100,1)`%** of the variance in international shopping budget allocation. The remaining variance reflects unmeasured factors: brand loyalty, past experience, remittance access, and household composition. For a targeting model, this level of explanatory power is practically useful — a client scored highly on income and frequency should allocate significantly more to international shopping than the average respondent. ------------------------------------------------------------------------ ## Integrated Findings {#sec-findings} Across five analytical techniques, a consistent and commercially actionable story emerges. **EDA** established that the dataset is right-skewed: most respondents allocate less than 10% of income to international shopping, but a meaningful minority (particularly in high-income brackets) allocates 20–35%. Two data quality issues were identified and transparently resolved, ensuring that all downstream inference rests on reliable data. **Visualisation** translated these patterns into a targeting map. The sector distribution analysis (Plot 2) showed that median budget allocation is highest for **Healthcare, Oil & Gas, and Entrepreneur** respondents; Banking & Finance — despite being the most represented sector (n = 40) — has a comparatively low median allocation of 2.5%, with means elevated by a small number of high-spending outliers. The sector opportunity quadrant (Plot 6) reinforces this by combining budget share with purchase frequency: sectors in the top-right quadrant are the natural first targets. Black Friday and Christmas emerged as the two most commercially important seasons — when consumers are already primed to spend. **Hypothesis testing** confirmed that income bracket differences in budget allocation are statistically significant (Kruskal-Wallis, p = `r round(kw_test$p.value,4)`), that sector and shopping method show a `r if(fisher_res$p.value<0.05)"significant" else "notable"` association, and that higher-income consumers abandon purchases less frequently — making them lower-friction clients. **Correlation analysis** showed that income is the single most strongly correlated variable with both budget share and per-order spend. Purchase frequency adds independent predictive value, confirming that habitual shoppers are distinct from occasional high-spenders. **Regression** quantified these relationships: income (standardised) and purchase frequency are the dominant predictors of budget allocation. The model explains `r round(summary(model2)$adj.r.squared*100,1)`% of variance — sufficient to serve as a client-scoring engine. **Single recommendation:** A personal shopper seeking the highest-value, lowest-friction client profile should target **high-income (≥ ₦1M/month) professionals in Oil & Gas, Banking & Finance, or Technology sectors, operating in Lagos, who have made 6 or more international purchases in the last 24 months, and who plan to shop during November–December.** This segment combines high per-order spend (typically \$200–\$500+), willingness to delegate (Combination or Personal Shopper method), and lower payment abandonment rates — the three components of commercially attractive client relationships. ------------------------------------------------------------------------ ## Limitations & Further Work {#sec-limitations} 1. **Sample size and subgroup power:** The dataset of `r nrow(survey)` responses meets the recommended 100-observation minimum and supports reliable overall estimates. However, statistical power remains moderate for small subgroup comparisons — several sectors (e.g., Aviation, Mining, Logistics) have fewer than five respondents, producing wide confidence intervals for sector-level contrasts. Further survey waves targeting these underrepresented sectors would strengthen sector-level inference. 2. **Convenience sampling:** Respondents were recruited through a professional network, introducing self-selection bias. Consumers who do not shop internationally are underrepresented. The population studied is more educated and higher-earning than the average Nigerian adult — findings should not be generalised to the full Nigerian consumer market. 3. **Ordinal midpoint encoding:** Income, spend, and budget-share variables are ordinal bands. Converting them to numeric midpoints introduces measurement error at the distributional tails (e.g., "₦2.5M and above" is represented as 3,500k when the true distribution is unknown). Interval-level data would improve regression precision. 4. **Self-reported data:** All financial estimates (USD losses, budget percentages) are self-reported and subject to recall bias and social desirability effects. Validation against transaction records or fintech data (e.g., Paystack Developer API) would provide more reliable estimates. 5. **Further work:** With a larger dataset, the five techniques here could be extended to include a **logistic regression** predicting personal-shopper adoption (binary outcome), **k-means clustering** to identify empirical client archetypes, and **time-series analysis** of seasonal spend patterns using monthly transaction data. ------------------------------------------------------------------------ ## References {.unnumbered} Adi, B. (2026). *AI-powered data analytics.* Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/ Anscombe, F. J. (1973). Graphs in statistical analysis. *The American Statistician, 27*(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966 R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.5). R Foundation for Statistical Computing. https://www.R-project.org/ Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686 Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.* Springer. https://doi.org/10.1007/978-3-319-24277-4 Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.x) \[Computer software\]. https://doi.org/10.5281/zenodo.5960048 ```{r} #| label: package-citations #| echo: true # Package citations (APA 7 — use citation() output) for (pkg in c("readxl","skimr","ggcorrplot","car","lmtest","nortest", "broom","effectsize","patchwork","kableExtra","ggrepel","ppcor")) { cat("\ncitation('", pkg, "'):\n", sep="") tryCatch(print(citation(pkg), style="text"), error=function(e) cat(" [Not found]\n")) } ``` ```{r} #| label: survey-data-citation #| echo: false ``` **Survey data citation:** LBS EMBA-31 Student. (2026). *International Shopping Survey — Nigerian Consumer Behaviour Study* \[Dataset\]. Collected via Google Forms, May 2026, Nigeria. Data available on request from the author. ------------------------------------------------------------------------ ## Appendix: AI Usage Statement {.unnumbered} Claude Code (Anthropic, 2026) was used as a coding assistant throughout this analysis to help write R code for data cleaning, visualisation, hypothesis testing, correlation, and regression. Specifically, Claude Code was used to: (1) generate data-wrangling functions for the messy loss-estimate columns; (2) scaffold ggplot2 visualisation code; and (3) structure the Quarto document with appropriate chunk labels and YAML configuration. All analytical decisions — which technique to apply to which variable, how to handle missing values, which reference category to use in the regression, how to interpret p-values and effect sizes, and what business recommendation to draw — were made independently by the author. The interpretation of every result and every conclusion stated in this document reflects the author's own analytical judgement. No AI-generated text has been submitted as analysis interpretation without independent verification against the underlying data outputs.