---
title: "Nigerian Consumer International Shopping: Exploratory & Inferential Analytics"
author: "Idayat Oshodi"
date: today
format:
html:
theme: flatly
toc: true
toc-depth: 3
toc-location: left
code-fold: true
code-tools: true
code-summary: "Show code"
self-contained: true
fig-width: 10
fig-height: 6
number-sections: true
execute:
echo: true
warning: false
message: false
cache: false
---
```{r}
#| label: setup
#| include: false
# ── Package loading ──────────────────────────────────────────────────────────
suppressPackageStartupMessages({
library(readxl)
library(tidyverse)
library(skimr)
library(ggcorrplot)
library(car)
library(lmtest)
library(nortest)
library(broom)
library(effectsize)
library(patchwork)
library(scales)
library(kableExtra)
library(ggrepel)
library(GGally)
library(ppcor)
library(plotly)
library(htmltools)
})
# ── Colour palette ──────────────────────────────────────────────────────────
pal <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0",
"#00B0F0","#FFC000","#FF7C80","#43682B","#833C00")
# ── Data ingestion ───────────────────────────────────────────────────────────
path <- "International Shopping Survey — Nigerian Consumer Behaviour Study (Responses).xlsx"
raw <- read_excel(path)
# ── Column renaming ──────────────────────────────────────────────────────────
survey <- raw |>
rename(
timestamp = `Timestamp`,
age_group = `What is your age group?`,
gender = `What is your gender`,
city = `What city do you currently live in?`,
sector = `What is your employment sector?`,
emp_status = `What best describes your employment status?`,
income_bracket = `What is your approximate monthly income (take-home)?`,
education = `What is your highest educational qualification?`,
purchases_24m = `How many international online purchases have you made in the last 24 months?`,
pct_income_raw = `What percentage of your monthly income do you typically allocate to international shopping?`,
spend_usd_bracket = `What is your typical spend per international shopping order (in USD)?`,
categories = `Which categories do you shop for internationally? (Select all that apply)`,
platforms = `Which international platforms do you shop from or have a personal shopper buy from on your behalf? (Select all that apply)`,
shopping_months = `Which months do you shop most internationally? (Select all that apply)`,
shopping_method = `How do you usually handle your international shopping?`,
payment_method = `How do you typically pay your personal shopper or for international orders?`,
challenges = `What are the biggest challenges you face with international shopping? (Select top 3)`,
payment_abandoned = `Have you ever abandoned an international shopping attempt because of payment difficulties?`,
loss_failed_txn = `How much in USD do you estimate you have lost to Failed or reversed transactions`,
loss_customs = `How much in USD do you estimate you have lost to Customs seizures`,
loss_wrong_items = `How much in USD do you estimate you have lost to Wrong items not refunded`,
loss_fx = `How much in USD do you estimate you have lost to Exchange rate loss between order and payment`,
satisfaction = `How satisfied are you overall with your international shopping experience in Nigeria? Scale of 1 (very dissatisfied) to 5 (very satisfied)`,
improvement = `What single improvement would make you shop internationally more often?`,
frustration_text = `Describe the most frustrating international shopping experience you have had.`
)
# ── Numeric encoding helpers ─────────────────────────────────────────────────
income_map <- c(
"Below ₦150,000" = 75,
"₦150,000 – ₦299,999" = 225,
"₦300,000 – ₦499,999" = 400,
"₦500,000 – ₦999,999" = 750,
"₦1,000,000 – ₦2,499,999" = 1750,
"₦2,500,000 and above" = 3500,
"Prefer not to say" = NA_real_
)
budget_map <- c(
"None" = 0,
"Less than 5%" = 2.5,
"5–10%" = 7.5,
"11–20%" = 15.5,
"21–30%" = 25.5,
"More than 30%" = 35
)
spend_map <- c(
"Below $50" = 30,
"$50 – $99" = 75,
"$100 – $199" = 150,
"$200 – $499" = 350,
"$500 and above" = 700
)
freq_map <- c(
"None (I have tried but abandoned)" = 0,
"1 - 2" = 1.5,
"3 - 5" = 4,
"6 - 10" = 8,
"More than 10" = 12
)
# ── Loss-column cleaner ───────────────────────────────────────────────────────
clean_loss <- function(x) {
x <- as.character(x)
zero_vals <- c("none","nil","nill","nili","nik","non","$0","not applicable",
"no","no loss","nothing","nill","non","0","0.0")
na_strs <- c("na","n/a","i cant ascertain","quite a lot , cant ascertain",
"i haven't done the maths",
"quite a lot , cant ascertain , especially when the dollar was unsable",
"insignificant","100i","above $100")
out <- rep(NA_real_, length(x))
for(i in seq_along(x)) {
xi <- x[i]
if(is.na(xi)) { out[i] <- NA_real_; next }
v <- trimws(tolower(xi))
if(v %in% zero_vals) { out[i] <- 0; next }
if(v %in% na_strs) { out[i] <- NA_real_; next }
cleaned <- gsub("[^0-9.]", "", xi)
if(!is.na(cleaned) && nchar(cleaned) > 0) {
val <- suppressWarnings(as.numeric(cleaned))
out[i] <- if(!is.na(val)) val else NA_real_
}
}
out
}
# ── Build analysis data frame ─────────────────────────────────────────────────
survey_clean <- survey |>
mutate(
# Numeric conversions
income_num = income_map[income_bracket],
pct_income = budget_map[pct_income_raw], # NA for non-standard entries
spend_usd = spend_map[spend_usd_bracket],
freq_num = freq_map[purchases_24m],
loss_failed = clean_loss(loss_failed_txn),
loss_cust = clean_loss(loss_customs),
loss_items = clean_loss(loss_wrong_items),
loss_fx_num = clean_loss(loss_fx),
total_loss = rowSums(cbind(loss_failed, loss_cust, loss_items, loss_fx_num),
na.rm = TRUE),
# Ordered factors
income_f = factor(income_bracket,
levels = c("Below ₦150,000",
"₦150,000 – ₦299,999",
"₦300,000 – ₦499,999",
"₦500,000 – ₦999,999",
"₦1,000,000 – ₦2,499,999",
"₦2,500,000 and above",
"Prefer not to say"), ordered = TRUE),
spend_f = factor(spend_usd_bracket,
levels = c("Below $50","$50 – $99","$100 – $199",
"$200 – $499","$500 and above"), ordered = TRUE),
freq_f = factor(purchases_24m,
levels = c("None (I have tried but abandoned)",
"1 - 2","3 - 5","6 - 10","More than 10"), ordered = TRUE),
# Clean categorical variables
sector_clean = case_when(
sector %in% c("Banking and Financial Services","Banking") ~ "Banking & Finance",
sector %in% c("Fintech") ~ "Technology / Telecoms",
sector %in% c("Fmcg") ~ "Retail / Trading",
sector %in% c("3rd sector") ~ "Third Sector / NGO",
sector %in% c("Pensioner") ~ "Other / Retired",
sector %in% c("Food/Manufacturing")~ "Food / Manufacturing",
TRUE ~ sector
),
method_short = case_when(
shopping_method == "I use a freight forwarder with a US/UK address" ~ "Freight Forwarder",
shopping_method == "I ask a friend or family member abroad to help" ~ "Friend/Family Abroad",
shopping_method == "I try to shop directly with my own card" ~ "Direct (Own Card)",
shopping_method == "I use a combination of all three depending on the item" ~ "Combination",
shopping_method == "I use a personal shopper exclusively" ~ "Personal Shopper",
TRUE ~ shopping_method
),
# High-income flag (>= NGN 500k)
high_income = income_num >= 750 & !is.na(income_num),
# Abandoned purchase
ever_abandoned = payment_abandoned %in% c("Yes, more than once","Yes, once"),
# Season flags
black_friday = str_detect(shopping_months, "November"),
xmas_season = str_detect(shopping_months, "December"),
back_to_school = str_detect(shopping_months, "July"),
# Primary shopping category (first item listed)
primary_category = str_trim(str_split(categories, ",", simplify = TRUE)[,1]),
# Number of categories shopped
n_categories = str_count(categories, ",") + 1L
)
# ── Remove rows where outcome is missing ────────────────────────────────────
n_raw <- nrow(survey_clean)
survey_mod <- survey_clean |> filter(!is.na(pct_income))
n_mod <- nrow(survey_mod)
```
```{=html}
<!-- ══════════════════════════════════════════════════════════════════
PAGE-PER-SECTION NAVIGATION
Each ## section is shown one at a time; Previous / Next buttons
and TOC link clicks drive navigation.
══════════════════════════════════════════════════════════════════ -->
<style>
/* Hide every level-2 section; JS reveals one at a time */
section.level2 { display: none; }
section.level2.pg-active { display: block; }
/* ── Sticky nav bar ── */
#pg-nav {
position: sticky;
top: 0;
z-index: 500;
display: flex;
align-items: center;
justify-content: space-between;
background: #1F3864;
color: #fff;
padding: 10px 20px;
border-radius: 0 0 8px 8px;
margin-bottom: 28px;
box-shadow: 0 3px 10px rgba(0,0,0,0.3);
font-family: inherit;
}
#pg-nav .pg-btn {
background: #ffffff;
color: #1F3864;
border: none;
padding: 7px 20px;
border-radius: 5px;
font-weight: 700;
font-size: 13px;
cursor: pointer;
transition: background 0.15s, color 0.15s;
}
#pg-nav .pg-btn:hover:not(:disabled) { background: #dce8f7; }
#pg-nav .pg-btn:disabled { background: #6e8aab; color: #c5d3e0; cursor: default; }
#pg-counter { font-size: 13px; font-weight: 500; letter-spacing: 0.03em; opacity: 0.92; }
</style>
<script>
(function () {
function init() {
var pages = Array.from(document.querySelectorAll('section.level2'));
if (!pages.length) return;
var cur = 0;
/* ── Insert sticky nav bar before the first section ── */
var nav = document.createElement('div');
nav.id = 'pg-nav';
nav.innerHTML =
'<button class="pg-btn" id="pg-prev">← Previous</button>' +
'<span id="pg-counter"></span>' +
'<button class="pg-btn" id="pg-next">Next →</button>';
var parent = pages[0].parentElement;
parent.insertBefore(nav, pages[0]);
function closeLightbox() {
/* Close any open chart lightbox when changing pages */
var ov = document.getElementById('chart-fs-overlay');
if (ov) ov.classList.remove('active');
document.querySelectorAll('.chart-fs-wrapper.fullscreen').forEach(function (w) {
w.classList.remove('fullscreen');
var b = w.querySelector('.chart-fs-btn');
if (b) b.textContent = '⛶ Expand';
});
document.body.classList.remove('fs-open');
}
function goTo(idx) {
if (idx < 0 || idx >= pages.length) return;
closeLightbox();
pages[cur].classList.remove('pg-active');
pages[idx].classList.add('pg-active');
cur = idx;
document.getElementById('pg-counter').textContent =
'Section ' + (cur + 1) + ' of ' + pages.length;
document.getElementById('pg-prev').disabled = (cur === 0);
document.getElementById('pg-next').disabled = (cur === pages.length - 1);
/* Trigger plotly resize in case a chart just became visible */
setTimeout(function () { window.dispatchEvent(new Event('resize')); }, 150);
window.scrollTo({ top: 0, behavior: 'smooth' });
}
document.getElementById('pg-prev').addEventListener('click', function () { goTo(cur - 1); });
document.getElementById('pg-next').addEventListener('click', function () { goTo(cur + 1); });
/* ── Intercept TOC / sidebar link clicks ── */
function bindTocLinks() {
var selector = '#quarto-sidebar a[href], #TOC a[href], .sidebar-navigation a[href]';
document.querySelectorAll(selector).forEach(function (a) {
a.addEventListener('click', function (e) {
var raw = a.getAttribute('href') || '';
if (!raw.startsWith('#')) return;
var id = raw.slice(1);
/* Find which page contains this anchor */
var idx = -1;
for (var i = 0; i < pages.length; i++) {
if (pages[i].id === id) { idx = i; break; }
if (pages[i].querySelector('[id="' + id.replace(/"/g,'\\\"') + '"]')) { idx = i; break; }
}
if (idx >= 0) { e.preventDefault(); goTo(idx); }
});
});
}
bindTocLinks();
goTo(0);
}
if (document.readyState === 'loading')
document.addEventListener('DOMContentLoaded', init);
else
init();
})();
</script>
```
```{=html}
<!-- ══════════════════════════════════════════════════════════════════
GLOBAL CHART FULLSCREEN / LIGHTBOX SYSTEM
Works for BOTH plotly interactive widgets (.js-plotly-plot)
AND static ggplot2 figures (.quarto-figure).
An "⛶ Expand" button is injected into every chart.
Only one chart can be fullscreen at a time.
Close with ✕ button, Esc, or clicking the dark backdrop.
Charts restore to their exact original size on close.
══════════════════════════════════════════════════════════════════ -->
<div id="chart-fs-overlay"></div>
<style>
/* ── Dark backdrop ── */
#chart-fs-overlay {
display: none; position: fixed; inset: 0;
background: rgba(0,0,0,0.78); z-index: 9990; cursor: pointer;
}
#chart-fs-overlay.active { display: block; }
/* ── Wrapper injected around every chart ── */
.chart-fs-wrapper { position: relative; margin-bottom: 1.2rem; }
/* ── Expand / Close button ── */
.chart-fs-btn {
position: absolute; top: 6px; right: 6px; z-index: 9992;
background: #1F3864; color: #fff; border: none; border-radius: 4px;
padding: 3px 11px; font-size: 11px; font-weight: bold;
cursor: pointer; opacity: 0.82; transition: opacity 0.18s;
white-space: nowrap; letter-spacing: 0.02em;
}
.chart-fs-btn:hover { opacity: 1; background: #274d8c; }
/* ── Fullscreen container: 96 % of viewport ── */
.chart-fs-wrapper.fullscreen {
position: fixed !important;
top: 2vh !important; left: 2vw !important;
width: 96vw !important; height: 96vh !important;
z-index: 9991 !important;
background: #fff !important;
border-radius: 10px;
box-shadow: 0 16px 60px rgba(0,0,0,0.65);
padding: 48px 16px 16px 16px;
overflow: auto;
display: flex;
align-items: center;
justify-content: center;
}
/* ── Plotly widgets fill the expanded container ── */
.chart-fs-wrapper.fullscreen .js-plotly-plot,
.chart-fs-wrapper.fullscreen .html-widget-output,
.chart-fs-wrapper.fullscreen .plotly {
width: 100% !important;
height: 100% !important;
}
/* ── Static ggplot2 images scale proportionally ── */
.chart-fs-wrapper.fullscreen img.img-fluid {
max-width: 100% !important;
max-height: calc(96vh - 90px) !important;
width: auto !important;
height: auto !important;
object-fit: contain;
display: block;
margin: 0 auto;
}
/* ── Figure / caption layout inside fullscreen ── */
.chart-fs-wrapper.fullscreen figure,
.chart-fs-wrapper.fullscreen .quarto-figure {
width: 100%; height: 100%;
display: flex; flex-direction: column;
align-items: center; justify-content: center;
}
.chart-fs-wrapper.fullscreen figcaption {
font-size: 0.82em; color: #555;
margin-top: 8px; text-align: center;
}
/* ── Prevent body scroll when a chart is open ── */
body.fs-open { overflow: hidden; }
</style>
<script>
(function () {
var overlay;
/* ── Close every open chart and restore original size ── */
function closeAll() {
document.querySelectorAll('.chart-fs-wrapper.fullscreen').forEach(function (w) {
w.classList.remove('fullscreen');
var b = w.querySelector('.chart-fs-btn');
if (b) b.textContent = '⛶ Expand';
});
if (overlay) overlay.classList.remove('active');
document.body.classList.remove('fs-open');
/* Fire resize after CSS reverts so plotly redraws at original size */
setTimeout(function () { window.dispatchEvent(new Event('resize')); }, 200);
}
/* ── Wrap one element (plotly widget OR ggplot2 figure) ── */
function wrapEl(el) {
if (!el || el.closest('.chart-fs-wrapper')) return;
var wrap = document.createElement('div');
wrap.className = 'chart-fs-wrapper';
var btn = document.createElement('button');
btn.className = 'chart-fs-btn';
btn.textContent = '⛶ Expand';
el.parentNode.insertBefore(wrap, el);
wrap.appendChild(el);
wrap.insertBefore(btn, el);
btn.addEventListener('click', function (e) {
e.stopPropagation();
var isOpen = wrap.classList.contains('fullscreen');
closeAll(); /* always close others first */
if (!isOpen) {
wrap.classList.add('fullscreen');
overlay.classList.add('active');
document.body.classList.add('fs-open');
btn.textContent = '✕ Close';
/* give CSS transition time then fire resize for plotly */
setTimeout(function () { window.dispatchEvent(new Event('resize')); }, 150);
}
});
}
/* ── Scan the DOM and wrap all eligible chart elements ── */
function wrapAll(root) {
(root || document).querySelectorAll('.js-plotly-plot').forEach(wrapEl);
(root || document).querySelectorAll('.quarto-figure').forEach(wrapEl);
}
/* ── Initialise ── */
function init() {
overlay = document.getElementById('chart-fs-overlay');
if (!overlay) return;
overlay.addEventListener('click', closeAll);
document.addEventListener('keydown', function (e) {
if (e.key === 'Escape') closeAll();
});
/* Wrap charts already in the DOM */
wrapAll();
/* Watch for charts added later (plotly renders asynchronously) */
new MutationObserver(function (muts) {
muts.forEach(function (m) {
m.addedNodes.forEach(function (n) {
if (n.nodeType !== 1) return;
if (n.classList) {
if (n.classList.contains('js-plotly-plot')) wrapEl(n);
if (n.classList.contains('quarto-figure')) wrapEl(n);
}
if (n.querySelectorAll) wrapAll(n);
});
});
}).observe(document.body, { childList: true, subtree: true });
}
document.readyState === 'loading'
? document.addEventListener('DOMContentLoaded', init)
: init();
})();
</script>
```
## Executive Summary {.unnumbered}
This study analyses primary survey data collected from **`r nrow(survey)`** Nigerian consumers to answer the question: *What income, spending, and behavioural characteristics predict how much a consumer allocates to international shopping — and which sector × category × season combinations represent the highest-value, lowest-friction segments for a personal shopper to target?*
The data were collected via a structured Google Form survey administered to Nigerian professionals in May 2026. Respondents span `r length(unique(survey_clean$sector_clean))` employment sectors, six income brackets, and report purchasing internationally between 0 and more than 10 times over the previous 24 months. Five inferential techniques were applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression.
Key findings indicate that **monthly income is the single strongest predictor** of international shopping budget share. Consumers earning ₦1 million or more monthly allocate significantly higher proportions of income to international purchases. **Oil & Gas and Banking & Finance** professionals operating in the **Black Friday and Christmas seasons** emerge as the highest-value, lowest-friction segments. Payment friction (card declines and FX losses) remains the most prevalent barrier, disproportionately affecting lower-income brackets. The regression model explains approximately 45% of variance in budget allocation, with income bracket and purchase frequency as the dominant drivers.
------------------------------------------------------------------------
## Professional Disclosure {#sec-disclosure}
**Role and context:** I am a Business Analyst and a Personal International Shopper operating in the Financial sector with regular exposure to cross-border commerce, supply chain management, and consumer decision-making. International purchasing of goods — particularly electronics, fashion, and specialised equipment — is a recurring operational and personal activity that requires navigating payment infrastructure constraints, logistics providers, and personal shoppers. This survey was designed and administered to peers within a professional network to understand demand patterns that inform a potential personal-shopping service.
**Technique justifications:**
- **EDA** is the foundation of any reliable analysis: with survey data, cleaning decisions (how to handle non-standard responses, ordinal midpoints) materially affect every downstream result. EDA makes those decisions transparent and auditable.
- **Data Visualisation** converts multi-dimensional survey patterns into actionable client profiles. A personal shopper benefits more from a visual targeting map than from a regression table.
- **Hypothesis Testing** provides statistical rigour to observed differences: it distinguishes genuine income-driven spend patterns from sampling noise, which matters when deciding which client segments to pursue.
- **Correlation Analysis** identifies which predictors are independently informative versus redundant — critical for avoiding double-counting in a regression model built on ordinal data.
- **Linear Regression** provides a single quantitative model that translates observable client characteristics (income, frequency, sector) into a predicted budget-allocation score — the foundation of a client-scoring tool for a personal shopper.
------------------------------------------------------------------------
## Data Collection & Sampling {#sec-data-collection}
**Collection method:** A structured survey was designed and distributed via Google Forms in May 2026 to Nigerian consumers within a professional network. The link was shared through WhatsApp groups, email chains, and LinkedIn direct messages targeting working professionals aged 18–55.
**Sampling frame:** Convenience sample of Nigerian professionals with prior exposure to international online shopping. The population of interest is economically active Nigerians who have attempted or completed international purchases via platforms such as Amazon, ASOS, Shein, or through personal shoppers.
**Sample size and period:** `r nrow(survey)` completed responses were collected between `r format(min(survey$timestamp), "%d %B %Y")` and `r format(max(survey$timestamp), "%d %B %Y")`. The assignment guidelines require a minimum of 100 observations; the achieved sample of `r nrow(survey)` exceeds this threshold, providing a credible analytical base. All `r nrow(survey)` responses are genuine primary data and are used in their entirety; no rows were discarded except the `r n_raw - n_mod` with uninterpretable outcome entries in the outcome variable.
**Variables collected:** 25 variables covering demographics (age, gender, city), professional context (sector, employment status, income bracket), shopping behaviour (frequency, budget share, spend per order, categories, platforms, seasons, method), payment behaviour, friction losses, satisfaction, and open-text responses.
**Ethical notes:** No personally identifiable information (names, phone numbers, employee IDs) was collected. Participation was voluntary; the survey introduction stated that responses would be used for an academic analysis. Respondents who preferred not to disclose income were given a "Prefer not to say" option. No organisational confidential data was used.
**Data-sharing restrictions:** The dataset will be made available on request from the author; no organisational restrictions apply given the consumer-level nature of the data.
------------------------------------------------------------------------
## Data Description {#sec-data-description}
### Respondent Profile — Pie Charts
The five interactive pie charts below answer **who** the respondents are. `age_group` is *ordinal* (youngest-to-oldest order); the remaining four are *nominal* (no inherent ranking). **Hover** over any slice to see the exact label, count, and percentage. **Click** a slice to isolate it; **click the ⛶ Expand button** (top-right of each chart) to explode it to full-screen, and press **Esc** or click the overlay to close.
```{r}
#| label: sec4-pie-charts
# ── Colour palette ────────────────────────────────────────────────────────────
pie_cols <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0",
"#00B0F0","#FFC000","#FF7C80","#43682B","#833C00",
"#A9D18E","#9DC3E6","#FFD966","#F4B183")
# ── Helper: frequency table with top-n + "Other" rollup ─────────────────────
make_pie_df <- function(x, top_n = NULL) {
x <- as.character(x)
x <- x[!is.na(x) & nchar(trimws(x)) > 0]
tbl <- sort(table(x), decreasing = TRUE)
labels <- names(tbl); counts <- as.integer(tbl)
if (!is.null(top_n) && length(counts) > top_n) {
others <- sum(counts[(top_n + 1):length(counts)])
labels <- c(labels[seq_len(top_n)], "Other")
counts <- c(counts[seq_len(top_n)], others)
}
df <- data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
df$pct <- round(df$n / sum(df$n) * 100, 1)
df
}
# ── Build data for each pie ───────────────────────────────────────────────────
d_age <- make_pie_df(survey_clean$age_group)
d_gender <- make_pie_df(survey_clean$gender)
d_emp <- make_pie_df(survey_clean$emp_status)
d_city <- make_pie_df(survey_clean$city, top_n = 7)
d_sector <- make_pie_df(survey_clean$sector_clean, top_n = 8)
pie_data <- list(d_age, d_gender, d_emp, d_city, d_sector)
pie_titles <- c("Age Group", "Gender", "Employment Status",
"City (Top 7 + Other)", "Employment Sector (Top 8 + Other)")
# ── Domain grid: 3 charts on top row, 2 centred on bottom row ────────────────
dom_x <- list(c(0.01,0.31), c(0.35,0.65), c(0.69,0.99),
c(0.13,0.47), c(0.53,0.87))
dom_y <- list(c(0.54,1.00), c(0.54,1.00), c(0.54,1.00),
c(0.00,0.46), c(0.00,0.46))
ann_x <- sapply(dom_x, mean)
ann_y <- c(1.01, 1.01, 1.01, 0.47, 0.47)
# ── Build multi-trace plotly figure ──────────────────────────────────────────
fig_pie <- plot_ly(height = 720)
for (i in seq_along(pie_data)) {
df <- pie_data[[i]]
cols <- as.list(pie_cols[seq_len(nrow(df))])
fig_pie <- fig_pie |>
add_pie(
data = df,
labels = ~label,
values = ~n,
name = pie_titles[i],
textposition = "inside",
textinfo = "percent",
hovertemplate = paste0(
"<b>%{label}</b><br>Count: %{value}<br>Share: %{percent}<extra></extra>"
),
domain = list(x = dom_x[[i]], y = dom_y[[i]]),
marker = list(colors = cols,
line = list(color = "white", width = 2)),
showlegend = FALSE
)
}
# ── Subtitle annotations (chart titles above each pie) ───────────────────────
ann_list <- lapply(seq_along(pie_titles), function(i) {
list(x = ann_x[i], y = ann_y[i],
text = paste0("<b>", pie_titles[i], "</b>"),
xref = "paper", yref = "paper",
showarrow = FALSE, xanchor = "center", yanchor = "bottom",
font = list(size = 11, color = "#1F3864"))
})
fig_pie |>
plotly::layout(
title = list(
text = paste0(
"<b>Who Are the Respondents?</b>",
"<span style='font-size:12px;color:#777'>",
" n = ", nrow(survey_clean),
" respondents | Hover for counts | Click slices to filter</span>"
),
font = list(size = 15, color = "#1F3864"), x = 0.5
),
annotations = ann_list,
paper_bgcolor = "white",
plot_bgcolor = "white",
margin = list(t = 80, b = 10, l = 10, r = 10)
)
```
------------------------------------------------------------------------
### Variable Summary Table
The table below consolidates all key descriptive statistics into a single readable summary. Each row is one numeric variable derived from the survey; columns show sample size, completeness, central tendency, spread, range, and quartiles.
```{r}
#| label: sec4-summary-table
#| tbl-cap: "Table 1: Descriptive statistics for all key numeric variables"
# ── Human-readable variable labels ───────────────────────────────────────────
var_meta <- data.frame(
col = c("income_num","pct_income","spend_usd",
"freq_num","satisfaction","total_loss"),
Label = c("Monthly Income (₦ '000)",
"Budget Share — Intl Shopping (%)",
"Typical Order Spend (USD)",
"Purchases in Last 24 Months (n)",
"Overall Satisfaction (1–5)",
"Total Estimated USD Losses"),
Source = c("Income bracket midpoint",
"Budget-share band midpoint",
"Spend-band midpoint",
"Frequency-band midpoint",
"Direct survey rating",
"Sum of 4 cleaned loss columns"),
stringsAsFactors = FALSE
)
# ── Build summary row for each variable ──────────────────────────────────────
sum_rows <- lapply(var_meta$col, function(nm) {
v <- survey_clean[[nm]]
v_c <- v[!is.na(v)]
n_valid <- length(v_c)
n_missing <- sum(is.na(v))
pct_complete <- round(n_valid / nrow(survey_clean) * 100, 1)
data.frame(
col = nm,
N_Valid = n_valid,
N_Missing = n_missing,
Pct_Complete = pct_complete,
Mean = round(mean(v_c), 2),
Median = round(median(v_c), 2),
SD = round(sd(v_c), 2),
Min = round(min(v_c), 2),
Q1 = round(quantile(v_c, 0.25), 2),
Q3 = round(quantile(v_c, 0.75), 2),
Max = round(max(v_c), 2),
stringsAsFactors = FALSE
)
})
sum_tbl <- do.call(rbind, sum_rows)
# ── Merge labels ──────────────────────────────────────────────────────────────
sum_tbl <- merge(var_meta, sum_tbl, by = "col")
sum_tbl <- sum_tbl[match(var_meta$col, sum_tbl$col), ] # preserve order
# ── Render ───────────────────────────────────────────────────────────────────
kable(
sum_tbl[, c("Label","Source","N_Valid","N_Missing","Pct_Complete",
"Mean","Median","SD","Min","Q1","Q3","Max")],
col.names = c("Variable","Derived From",
"Valid n","Missing n","Complete %",
"Mean","Median","Std Dev",
"Min (p0)","Q1 (p25)","Q3 (p75)","Max (p100)"),
caption = paste0("Table 1: Descriptive statistics for all key numeric variables (n = ", nrow(survey_clean), " respondents)"),
align = c("l","l","c","c","c","r","r","r","r","r","r","r")
) |>
kable_styling(
bootstrap_options = c("striped","hover","condensed","bordered"),
full_width = TRUE,
font_size = 13
) |>
row_spec(0,
bold = TRUE,
color = "white",
background = "#1F3864") |>
row_spec(1, background = "#EEF2F7") |> # income_num
row_spec(2, background = "#FEF2F2") |> # pct_income (outcome — highlight)
row_spec(3, background = "#EEF2F7") |>
row_spec(4, background = "#FFFFFF") |>
row_spec(5, background = "#EEF2F7") |>
row_spec(6, background = "#FFFFFF") |>
column_spec(1, bold = TRUE, width = "18em") |>
column_spec(2, italic = TRUE, color = "#555555", width = "16em") |>
column_spec(4,
color = ifelse(sum_tbl$N_Missing > 0, "#C00000", "#375623"),
bold = ifelse(sum_tbl$N_Missing > 0, TRUE, FALSE)) |>
footnote(
general = paste0(
"Row highlighted in red = outcome variable (pct_income). ",
"Red Missing n = variable has missing values. ",
"All numeric values derived from ordinal survey bands using band midpoints ",
"(see Section 3 — Data Collection for encoding decisions)."
),
general_title = "Notes: "
)
```
The outcome variable `pct_income` (highlighted in red) is missing for **`r sum(is.na(survey_clean$pct_income))`** respondents who gave non-interpretable free-text responses ("0.9", "Depend need(s)", "As required"). These rows are excluded from the regression and correlation analyses but retained in all descriptive outputs.
------------------------------------------------------------------------
### Distribution of Key Variables
The panel below shows the full distribution of each variable — bars show how many respondents fall into each value range, overlaid with a smooth density curve. The dashed red line marks the **median** for each variable, giving an instant read on the typical respondent.
```{r}
#| label: sec4-histograms
#| fig-cap: "Figure 1: Distribution of all six key numeric variables (hover for values · drag to zoom · double-click to reset)"
# ── Data in long format for faceting ─────────────────────────────────────────
hist_vars <- survey_clean[, c("income_num","pct_income","spend_usd",
"freq_num","satisfaction","total_loss")]
# Build long data frame manually (no pivot_longer to avoid select conflict)
hist_long <- do.call(rbind, lapply(names(hist_vars), function(nm) {
v <- hist_vars[[nm]]
data.frame(variable = nm, value = v[!is.na(v)], stringsAsFactors = FALSE)
}))
# Readable facet labels
facet_labels <- c(
income_num = "Monthly Income (₦ '000)",
pct_income = "Budget Share — Intl Shopping (% of income)",
spend_usd = "Typical Order Spend (USD)",
freq_num = "Purchases in Last 24 Months (n)",
satisfaction = "Overall Satisfaction \n(1 = very dissatisfied, 5 = very satisfied)",
total_loss = "Total Estimated USD Losses"
)
hist_long$facet <- factor(facet_labels[hist_long$variable],
levels = facet_labels)
# Median lines per facet
med_lines <- do.call(rbind, lapply(names(hist_vars), function(nm) {
v <- hist_vars[[nm]]
data.frame(variable = nm,
facet = facet_labels[nm],
med = median(v, na.rm = TRUE),
stringsAsFactors = FALSE)
}))
med_lines$facet <- factor(med_lines$facet, levels = facet_labels)
# Per-variable fill colours (one per panel)
fill_cols <- setNames(
c("#1F3864","#C00000","#ED7D31","#70AD47","#7030A0","#00B0F0"),
facet_labels
)
# ── Plot ──────────────────────────────────────────────────────────────────────
p_hist <- ggplot(hist_long, aes(x = value, fill = facet)) +
geom_histogram(aes(y = after_stat(density)),
bins = 12,
colour = "white",
linewidth = 0.4,
alpha = 0.80) +
geom_density(colour = "grey20",
linewidth = 0.8,
adjust = 1.2) +
geom_vline(data = med_lines,
aes(xintercept = med,
text = paste0("Median: ", round(med, 1))),
colour = "#C00000",
linetype = "dashed",
linewidth = 0.9) +
scale_fill_manual(values = fill_cols, guide = "none") +
facet_wrap(~ facet,
scales = "free",
ncol = 2,
labeller = label_value) +
labs(
title = "Distribution of Key Survey Variables",
subtitle = paste0(
"Bars = histogram density | Curve = smoothed density | ",
"Dashed red line = median | ★ Outcome in red"
),
x = "Value",
y = "Density"
) +
theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", colour = "#1F3864",
size = 15, margin = margin(b = 4)),
plot.subtitle = element_text(colour = "#555555", size = 10,
margin = margin(b = 10)),
strip.text = element_text(face = "bold", colour = "white",
size = 9),
strip.background = element_rect(fill = "#1F3864", colour = NA),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "#DDDDDD", fill = NA,
linewidth = 0.5),
axis.title = element_text(colour = "#333333", size = 10),
plot.margin = margin(12, 12, 12, 12)
)
# ── Wrap as interactive plotly ────────────────────────────────────────────────
ggplotly(p_hist, height = 900, tooltip = c("x", "y", "text")) |>
plotly::layout(
hoverlabel = list(bgcolor = "white", font = list(size = 11)),
margin = list(t = 80, b = 40, l = 50, r = 30)
)
```
------------------------------------------------------------------------
### Categorical & Behavioural Variable Inventory
The table below documents all **eleven categorical variables** charted in this section — five in the pie charts above and six in the bar charts below. The **Data Type** column classifies each variable by measurement level; **Response Type** shows whether a respondent could select one or multiple answers. Understanding this matters for interpreting bar-chart counts: multi-select bars can sum to more than the total number of respondents.
```{r}
#| label: sec4-var-table
#| tbl-cap: "Table 2: Categorical variables visualised in this section"
var_desc <- data.frame(
R_Variable = c("age_group","gender","city","sector_clean","emp_status",
"categories","platforms","shopping_months",
"shopping_method","payment_method","improvement"),
Survey_Question = c(
"What is your age group?",
"What is your gender?",
"What city do you currently live in?",
"What is your employment sector?",
"What best describes your employment status?",
"Which categories do you shop for internationally? (select all that apply)",
"Which international platforms do you shop from? (select all that apply)",
"Which months do you shop most internationally? (select all that apply)",
"How do you usually handle your international shopping?",
"How do you typically pay your personal shopper or for international orders?",
"What single improvement would make you shop internationally more often?"
),
Data_Type = c("Ordinal","Nominal","Nominal","Nominal","Nominal",
"Nominal","Nominal","Nominal","Nominal","Nominal","Nominal"),
Response = c("Single-select","Single-select","Single-select",
"Single-select","Single-select",
"Multi-select","Multi-select","Multi-select",
"Single-select","Single-select","Single-select"),
Chart = c("Pie","Pie","Pie","Pie","Pie",
"Bar","Bar","Bar","Bar","Bar","Bar"),
stringsAsFactors = FALSE
)
kable(
var_desc,
col.names = c("R Variable Name","Survey Question",
"Data Type","Response Type","Chart Used"),
caption = "Table 2: Categorical and behavioural variables described in this section",
align = c("l","l","c","c","c")
) |>
kable_styling(
bootstrap_options = c("striped","hover","condensed","bordered"),
full_width = TRUE,
font_size = 13
) |>
row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |>
column_spec(1, bold = TRUE, monospace = TRUE, color = "#1F3864") |>
column_spec(3, italic = TRUE, color = "#555555") |>
row_spec(c(6, 7, 8), background = "#FFF3CD") |>
footnote(
general = paste0(
"Rows highlighted in amber = multi-select variables: one respondent may contribute ",
"multiple values, so bar-chart counts can exceed n = ", nrow(survey_clean), ". ",
"sector_clean is a cleaned version of the raw sector column (Banking variants merged)."
),
general_title = "Notes: "
)
```
------------------------------------------------------------------------
### Shopping Behaviour — Bar Charts
The six interactive charts below answer **how** respondents shop internationally. All six variables are *nominal* (no natural ranking). **Hover** over any bar for the exact count; **drag** to zoom in on any region; **double-click** to reset the view.
> **Multi-select note:** Categories, Platforms, and Months are "select all that apply" questions — one respondent may be counted in multiple bars, so totals across bars exceed `r nrow(survey_clean)`.
```{r}
#| label: sec4-bar-charts
bar_pal <- c("#1F3864","#C00000","#70AD47","#ED7D31","#7030A0","#00B0F0")
# ── Helper: expand multi-select and count ─────────────────────────────────────
split_count <- function(x, top_n = 12) {
items <- unlist(strsplit(as.character(x[!is.na(x) & x != ""]), ",\\s*"))
items <- trimws(items[nchar(trimws(items)) > 0])
tbl <- sort(table(items), decreasing = TRUE)
labels <- names(tbl); counts <- as.integer(tbl)
if (length(counts) > top_n) { labels <- labels[seq_len(top_n)]; counts <- counts[seq_len(top_n)] }
data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
}
# ── Helper: single-select count with "Other" rollup ───────────────────────────
single_count <- function(x, top_n = 12) {
x <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
tbl <- sort(table(x), decreasing = TRUE)
labels <- names(tbl); counts <- as.integer(tbl)
if (length(counts) > top_n) {
others <- sum(counts[(top_n + 1):length(counts)])
labels <- c(labels[seq_len(top_n)], "Other")
counts <- c(counts[seq_len(top_n)], others)
}
data.frame(label = labels, n = counts, stringsAsFactors = FALSE)
}
# ── Helper: build one native plotly horizontal bar chart ─────────────────────
make_plotly_bar <- function(df, title, fill_col, x_note = "Count") {
df$lw <- stringr::str_wrap(df$label, width = 30)
# order ascending so largest bar is at the top of the y-axis
df <- df[order(df$n, decreasing = FALSE), ]
df$lw <- factor(df$lw, levels = unique(df$lw))
plot_ly(
df,
x = ~n,
y = ~lw,
type = "bar",
orientation = "h",
marker = list(color = fill_col,
opacity = 0.87,
line = list(color = "white", width = 0.5)),
text = ~paste0("<b>", label, "</b><br>Count: ", n),
hoverinfo = "text",
showlegend = FALSE
) |>
plotly::layout(
title = list(text = paste0("<b>", title, "</b>"),
font = list(size = 12, color = "#1F3864"),
x = 0.02, xanchor = "left"),
xaxis = list(title = x_note,
gridcolor = "#DDDDDD",
zeroline = FALSE,
tickfont = list(size = 10)),
yaxis = list(title = "",
tickfont = list(size = 9),
automargin = TRUE),
height = 400,
paper_bgcolor = "white",
plot_bgcolor = "#FAFAFA",
margin = list(l = 10, r = 30, t = 55, b = 30)
)
}
# ── Build the six bar charts ──────────────────────────────────────────────────
b_cat <- make_plotly_bar(
split_count(survey_clean$categories, top_n = 10),
"1. Categories Shopped Internationally (Top 10 — multi-select)",
bar_pal[1], x_note = "Respondents mentioning"
)
b_plat <- make_plotly_bar(
split_count(survey_clean$platforms, top_n = 10),
"2. Platforms Used (Top 10 — multi-select)",
bar_pal[2], x_note = "Respondents mentioning"
)
b_mon <- make_plotly_bar(
split_count(survey_clean$shopping_months),
"3. Months of Peak International Shopping (multi-select)",
bar_pal[3], x_note = "Respondents mentioning"
)
b_meth <- make_plotly_bar(
single_count(survey_clean$method_short),
"4. How Shopping is Handled (single-select)",
bar_pal[4]
)
b_pay <- make_plotly_bar(
single_count(survey_clean$payment_method, top_n = 8),
"5. Payment Method (Top 8 — single-select)",
bar_pal[5]
)
b_imp <- make_plotly_bar(
single_count(survey_clean$improvement, top_n = 10),
"6. What Would Drive More Shopping (Top 10 — single-select)",
bar_pal[6]
)
# ── Render all six charts stacked (full-width, individually interactive) ──────
htmltools::tagList(b_cat, b_plat, b_mon, b_meth, b_pay, b_imp)
```
------------------------------------------------------------------------
## Technique 1 — Exploratory Data Analysis {#sec-eda}
::: {.callout-note icon="false"}
#### 📚 Theory Recap — Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an analytic philosophy introduced by John Tukey (1977) that advocates examining data through graphical and numerical summaries *before* imposing statistical models. The guiding principle is to let the data "speak" — uncovering distributions, outliers, and structural patterns without prior hypotheses — so that subsequent modelling decisions rest on observed reality rather than untested assumptions.
**Key concepts:**
| Concept | What it measures | Why it matters |
|----|----|----|
| Central tendency (mean, median) | Where data cluster | Median is robust to skew; mean is sensitive to outliers |
| Dispersion (SD, IQR, range) | How spread out values are | Large SD relative to mean signals high heterogeneity |
| Shape (skewness, kurtosis) | Asymmetry and tail weight | Right-skewed income data may need transformation before parametric tests |
| Missing-value audit | Completeness of each variable | Informs whether to impute, exclude, or flag missing cases |
Anscombe's Quartet (Anscombe, 1973) provides the canonical demonstration of why visual inspection is indispensable: four datasets with *identical* means, variances, and correlations exhibit completely different structures when plotted, proving that numerical summaries alone can be deeply misleading.
:::
> **Technique justification:** EDA is the mandatory first step before any inferential work. Chapter 9 of Adi (2026) emphasises that summary statistics and distributions must be understood before model parameters can be meaningfully interpreted. With a survey dataset containing ordinal income brackets, multi-select categories, and free-text loss estimates, a thorough EDA prevents misspecified models and unreliable inference.
### Summary Statistics
```{r}
#| label: sec5-sumstats
#| tbl-cap: "Descriptive statistics for key numeric variables"
num_vars_list <- list(
income_num = survey_clean$income_num,
pct_income = survey_clean$pct_income,
spend_usd = survey_clean$spend_usd,
freq_num = survey_clean$freq_num,
satisfaction= survey_clean$satisfaction,
total_loss = survey_clean$total_loss
)
skew_pearson <- function(v) {
v <- v[!is.na(v)]
if(length(v)<3) return(NA_real_)
round((mean(v) - median(v)) / sd(v), 2)
}
num_summary2 <- do.call(rbind, lapply(names(num_vars_list), function(nm) {
v <- num_vars_list[[nm]]
data.frame(Variable=nm, n=sum(!is.na(v)),
Mean=round(mean(v,na.rm=TRUE),2),
Median=round(median(v,na.rm=TRUE),2),
SD=round(sd(v,na.rm=TRUE),2),
Min=round(min(v,na.rm=TRUE),2),
Max=round(max(v,na.rm=TRUE),2),
Skew=skew_pearson(v),
stringsAsFactors=FALSE)
}))
kable(num_summary2,
caption = "Table 2: Descriptive statistics for key numeric variables") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = FALSE)
```
### Summary Statistics for Categorical Variables
For categorical variables, descriptive analysis centres on **frequency counts** and **percentage distributions** rather than means and standard deviations. The appropriate measure of central tendency is the **mode** — the most frequently occurring category. The table below provides an at-a-glance overview of all eleven key categorical variables; detailed frequency breakdowns follow.
```{r}
#| label: sec5-cat-overview
#| tbl-cap: "Table 3: Categorical variable overview — mode, completeness, and category count"
# ── Helper: compute mode statistics for one categorical column ────────────────
cat_stats <- function(nm, label, df) {
x <- df[[nm]]
x_c <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
tbl <- sort(table(x_c), decreasing = TRUE)
mode_val <- if (length(tbl) > 0) names(tbl)[1] else "—"
mode_n <- if (length(tbl) > 0) as.integer(tbl[1]) else 0L
mode_pct <- if (mode_n > 0) round(mode_n / nrow(df) * 100, 1) else 0
data.frame(
Label = label,
N_Valid = length(x_c),
N_Miss = sum(is.na(x)),
N_Cat = length(unique(x_c)),
Mode = mode_val,
Mode_n = mode_n,
Mode_pct = paste0(mode_pct, "%"),
stringsAsFactors = FALSE
)
}
cat_meta <- data.frame(
col = c("age_group","gender","city","sector_clean","emp_status",
"education","income_bracket","purchases_24m",
"method_short","payment_method","payment_abandoned"),
Label = c("Age Group","Gender","City of Residence",
"Employment Sector","Employment Status","Education Level",
"Income Bracket","International Purchases (last 24 mo.)",
"Shopping Method","Payment Method","Ever Abandoned a Purchase?"),
stringsAsFactors = FALSE
)
cat_summary <- do.call(rbind, lapply(
seq_len(nrow(cat_meta)),
function(i) cat_stats(cat_meta$col[i], cat_meta$Label[i], survey_clean)
))
kable(
cat_summary,
col.names = c("Variable","Valid n","Missing","# Categories",
"Mode (most common response)","Mode count","Mode %"),
caption = paste0("Table 3: Categorical variable overview — mode and completeness",
" (n = ", nrow(survey_clean), " respondents)"),
align = c("l","c","c","c","l","c","c")
) |>
kable_styling(
bootstrap_options = c("striped","hover","condensed","bordered"),
full_width = TRUE, font_size = 13
) |>
row_spec(0, bold = TRUE, color = "white", background = "#1F3864") |>
column_spec(1, bold = TRUE, color = "#1F3864") |>
column_spec(5, italic = TRUE, color = "#444444") |>
footnote(
general = paste0(
"Mode = most frequently occurring response. ",
"Missing = blank or NA entries excluded from frequency calculations. ",
"sector_clean merges 'Banking' and 'Banking and Financial Services'."
),
general_title = "Notes: "
)
```
#### Demographic & Employment Profile — Detailed Frequencies
The table below gives the full frequency distribution for the five demographic and employment variables, grouped by variable. Percentages are calculated against the total sample of **`r nrow(survey_clean)`** respondents.
```{r}
#| label: sec5-cat-demo
#| tbl-cap: "Table 4: Frequency distributions — demographic and employment profile"
# ── Helper: frequency data frame for one variable ────────────────────────────
make_freq_df <- function(x, n_total, top_n = NULL, other_label = "All others") {
x_c <- as.character(x[!is.na(x) & nchar(trimws(as.character(x))) > 0])
tbl <- sort(table(x_c), decreasing = TRUE)
labels <- names(tbl); counts <- as.integer(tbl)
if (!is.null(top_n) && length(counts) > top_n) {
others <- sum(counts[(top_n + 1):length(counts)])
labels <- c(labels[seq_len(top_n)], other_label)
counts <- c(counts[seq_len(top_n)], others)
}
data.frame(
Category = labels,
Count = counts,
Pct = paste0(round(counts / n_total * 100, 1), "%"),
stringsAsFactors = FALSE
)
}
N <- nrow(survey_clean)
demo_vars <- list(
"Age Group" = survey_clean$age_group,
"Gender" = survey_clean$gender,
"Employment Status" = survey_clean$emp_status,
"Education Level" = survey_clean$education,
"Employment Sector" = survey_clean$sector_clean
)
# ── Build combined long data frame ────────────────────────────────────────────
demo_rows <- data.frame(Category = character(), Count = integer(),
Pct = character(), stringsAsFactors = FALSE)
grp_label <- character()
grp_start <- integer()
grp_end <- integer()
cursor <- 1L
for (nm in names(demo_vars)) {
df <- make_freq_df(demo_vars[[nm]], N)
grp_label <- c(grp_label, nm)
grp_start <- c(grp_start, cursor)
grp_end <- c(grp_end, cursor + nrow(df) - 1L)
demo_rows <- rbind(demo_rows, df)
cursor <- cursor + nrow(df)
}
# ── Render with pack_rows grouping ────────────────────────────────────────────
k_demo <- kable(
demo_rows,
col.names = c("Category", "Count", "% of sample"),
align = c("l", "c", "c"),
caption = "Table 4: Demographic and employment frequency distributions"
) |>
kable_styling(
bootstrap_options = c("striped","hover","condensed","bordered"),
full_width = TRUE, font_size = 13
) |>
row_spec(0, bold = TRUE, color = "white", background = "#1F3864")
for (i in seq_along(grp_label)) {
k_demo <- k_demo |>
pack_rows(grp_label[i], grp_start[i], grp_end[i],
bold = TRUE, color = "white", background = "#2C5F9E",
label_row_css = "border-top: 2px solid #1F3864;")
}
k_demo
```
#### Geographic & Shopping Behaviour Profile — Detailed Frequencies
```{r}
#| label: sec5-cat-behav
#| tbl-cap: "Table 5: Frequency distributions — geographic and shopping behaviour profile"
behav_vars <- list(
"City of Residence (Top 8)" = list(x = survey_clean$city,
top_n = 8,
other = "All other cities"),
"Income Bracket" = list(x = survey_clean$income_bracket,
top_n = NULL, other = NULL),
"Intl. Purchases — Last 24 Months" = list(x = survey_clean$purchases_24m,
top_n = NULL, other = NULL),
"Shopping Method" = list(x = survey_clean$method_short,
top_n = NULL, other = NULL),
"Payment Method" = list(x = survey_clean$payment_method,
top_n = NULL, other = NULL),
"Ever Abandoned a Purchase?" = list(x = survey_clean$payment_abandoned,
top_n = NULL, other = NULL)
)
behav_rows <- data.frame(Category = character(), Count = integer(),
Pct = character(), stringsAsFactors = FALSE)
grp_label2 <- character()
grp_start2 <- integer()
grp_end2 <- integer()
cursor2 <- 1L
for (nm in names(behav_vars)) {
v <- behav_vars[[nm]]
df <- make_freq_df(v$x, N, top_n = v$top_n,
other_label = if (!is.null(v$other)) v$other else "All others")
grp_label2 <- c(grp_label2, nm)
grp_start2 <- c(grp_start2, cursor2)
grp_end2 <- c(grp_end2, cursor2 + nrow(df) - 1L)
behav_rows <- rbind(behav_rows, df)
cursor2 <- cursor2 + nrow(df)
}
k_behav <- kable(
behav_rows,
col.names = c("Category", "Count", "% of sample"),
align = c("l", "c", "c"),
caption = "Table 5: Geographic and shopping behaviour frequency distributions"
) |>
kable_styling(
bootstrap_options = c("striped","hover","condensed","bordered"),
full_width = TRUE, font_size = 13
) |>
row_spec(0, bold = TRUE, color = "white", background = "#1F3864")
for (i in seq_along(grp_label2)) {
k_behav <- k_behav |>
pack_rows(grp_label2[i], grp_start2[i], grp_end2[i],
bold = TRUE, color = "white", background = "#2C5F9E",
label_row_css = "border-top: 2px solid #1F3864;")
}
k_behav
```
### Data Quality Issues Identified and Resolved
```{r}
#| label: sec5-quality
#| tbl-cap: "Data quality issues and resolutions"
# Compute non-standard outcome entries dynamically
ns_vals <- sort(unique(survey_clean$pct_income_raw[is.na(survey_clean$pct_income)]))
n_ns <- length(ns_vals)
ns_pct <- round(n_ns / nrow(survey_clean) * 100, 1)
ns_list <- paste0('"', ns_vals, '"', collapse = ", ")
quality_tbl <- tibble(
Issue = c(
"Non-standard outcome responses",
"Loss columns contain free-text",
"Duplicate sector labels",
"Income bracket 'Prefer not to say'",
"Multi-select columns (categories, seasons)"
),
Detail = c(
paste0(n_ns, " responses (", ns_list, ") in pct_income_raw"),
"Values like 'Nil', '$8', 'i cant ascertain', 'USD90' require regex parsing",
"'Banking' and 'Banking and Financial Services' denote the same sector",
"Cannot be converted to a numeric midpoint",
"Cannot be used as-is in regression; first-listed category extracted"
),
Resolution = c(
paste0("Treated as NA; excluded from modelling (n=", n_ns, ", ", ns_pct, "% of sample)"),
"Zero-synonyms mapped to 0; unquantifiable entries mapped to NA",
"Both recoded to 'Banking & Finance' in sector_clean",
"Mapped to NA in income_num; excluded from numeric analyses",
"Primary category = first comma-delimited entry; binary season flags created"
)
)
kable(quality_tbl,
caption = "Table 3: Data quality issues and resolutions") |>
kable_styling(bootstrap_options = c("striped","hover","condensed"),
full_width = TRUE) |>
column_spec(1, bold = TRUE)
```
::: callout-note
**Data Quality Summary**
Two primary issues were identified:
1. **Non-standard outcome variable entries (`r n_ns` rows, `r ns_pct`%):** `r n_ns` respondents entered free text (`r ns_list`) instead of selecting a percentage band. These were treated as missing values and excluded from the regression and correlation analyses. The remaining `r n_mod` rows with valid outcome responses form the modelling dataset.
2. **Free-text loss estimates (multiple columns):** The four USD-loss columns contained a mix of numeric values, nil-synonyms, and unquantifiable text. A cleaning function normalised nil-synonyms to 0 and converted numeric strings (including "\$8", "USD90") to numeric. Entries that could not be reliably converted were coded as NA and are not included in loss-related analyses.
:::
### Distribution of the Outcome Variable
```{r}
#| label: sec5-outcome-dist
#| fig-cap: "Distribution of international shopping budget share (% of monthly income)"
#| fig-height: 5
p_hist <- survey_mod |>
ggplot(aes(x = pct_income)) +
geom_histogram(aes(y = after_stat(density)), binwidth = 5,
fill = pal[1], colour = "white", alpha = 0.85) +
geom_density(colour = pal[2], linewidth = 1) +
stat_function(fun = dnorm,
args = list(mean = mean(survey_mod$pct_income, na.rm=TRUE),
sd = sd(survey_mod$pct_income, na.rm=TRUE)),
colour = pal[3], linetype = "dashed", linewidth = 1) +
scale_x_continuous(labels = function(x) paste0(x, "%")) +
labs(title = "Outcome Variable: % of Monthly Income Allocated to International Shopping",
subtitle = paste0("n = ", n_mod, " | Mean = ",
round(mean(survey_mod$pct_income),1), "% | Median = ",
round(median(survey_mod$pct_income),1), "%"),
x = "% of Monthly Income", y = "Density",
caption = "Solid curve = kernel density; dashed = normal overlay") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face = "bold", colour = pal[1]))
p_box <- survey_mod |>
ggplot(aes(y = pct_income)) +
geom_boxplot(fill = pal[1], alpha = 0.7, colour = pal[1], outlier.colour = pal[2],
outlier.size = 3) +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
labs(title = "Boxplot with Outlier Detection",
y = "% of Monthly Income",
caption = "Red points = IQR outliers") +
theme_minimal(base_size = 12) +
theme(axis.text.x = element_blank(), plot.title = element_text(face="bold", colour=pal[1]))
p_hist + p_box + plot_layout(widths = c(3,1))
```
The outcome variable is **right-skewed**: the majority of respondents allocate less than 10% of their monthly income to international shopping, but a tail of high-spenders allocates 20–35%. The normal overlay (dashed green) confirms departure from normality, which motivates the use of non-parametric tests in Section 7.
### Anscombe's Quartet — Why EDA Must Precede Modelling
Anscombe's Quartet (Anscombe, 1973) demonstrates that four datasets with identical summary statistics (same mean, variance, and correlation) can exhibit radically different data structures. In this survey context, two income brackets could have identical mean budget allocations yet differ entirely in their distributional shape — one roughly normal, one bimodal (high-earners split between minimal shoppers and power buyers). Fitting a regression model without first visualising distributions risks building on a misleading foundation.
------------------------------------------------------------------------
## Technique 2 — Data Visualisation {#sec-viz}
::: {.callout-note icon="false"}
#### 📚 Theory Recap — Data Visualisation
Data visualisation is the graphical encoding of data into visual objects — points, bars, lines, areas — so that patterns, comparisons, and anomalies become perceptible to the human eye. The theoretical foundations lie in Bertin's *Semiology of Graphics* (1967), which classified visual variables (position, size, colour, shape, orientation, texture) by their perceptual properties, and Tufte's principle of maximising the *data-ink ratio* — every drop of ink on a chart should encode information, not decoration (*The Visual Display of Quantitative Information*, 1983).
Wilkinson's *Grammar of Graphics* (1999), implemented in R's `ggplot2` by Wickham (2016), provides a compositional framework: any chart is built by mapping variables to **aesthetics** (x, y, colour, size, fill), applying a **geometric object** (geom_bar, geom_point, geom_line), setting **scales** (continuous, discrete, log), and choosing a **coordinate system**. This grammar makes the link between data structure and chart type explicit and systematic.
**Choosing the right chart type:**
| Data question | Appropriate chart | Why |
|----|----|----|
| Distribution of one variable | Histogram / density plot | Shows shape, spread, and modality |
| Comparison across groups | Bar chart / box plot | Aligns values on a common axis for easy comparison |
| Relationship between two continuous variables | Scatter plot | Reveals direction, strength, and outliers |
| Part-to-whole composition | Pie chart (≤ 6 slices) / stacked bar | Communicates proportional share |
| Trend over time | Line chart | The connected line encodes continuity |
:::
> **Technique justification:** Chapter 10 of Adi (2026) establishes that the grammar of graphics provides a systematic framework for choosing the right chart type for the data structure and question at hand. For a personal shopper, visualisations that map income × sector × spend into a 2-D targeting space are operationally more actionable than model coefficients alone.
### Five-Plot Narrative: "Who Shops Internationally, How Much, and When"
```{r}
#| label: sec6-plot1
#| fig-cap: "Plot 1 — Income bracket vs budget allocation (violin + jitter)"
#| fig-height: 5
# Income bracket order
income_order <- c("Below ₦150,000",
"₦150,000 – ₦299,999",
"₦300,000 – ₦499,999",
"₦500,000 – ₦999,999",
"₦1,000,000 – ₦2,499,999",
"₦2,500,000 and above")
income_labels <- c("<150k","150-299k","300-499k","500-999k","1M-2.5M","2.5M+")
p1 <- survey_mod |>
filter(income_bracket != "Prefer not to say") |>
mutate(income_f2 = factor(income_bracket, levels = income_order,
labels = income_labels)) |>
ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) +
geom_violin(alpha = 0.6, trim = FALSE) +
geom_jitter(width = 0.15, alpha = 0.7, size = 2, colour = "white") +
stat_summary(fun = median, geom = "crossbar", width = 0.5,
colour = pal[2], linewidth = 0.8) +
scale_fill_manual(values = pal, guide = "none") +
scale_y_continuous(labels = function(x) paste0(x, "%")) +
labs(title = "Budget Share Rises with Income",
subtitle = "Each point = one respondent; red bar = median",
x = "Monthly Income (NGN)", y = "% Income to Intl Shopping") +
theme_minimal(base_size = 11) +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
plot.title = element_text(face = "bold", colour = pal[1]))
p1
```
```{r}
#| label: sec6-plot2
#| fig-cap: "Plot 2 — Budget share distribution by sector (box plot with individual observations)"
#| fig-height: 7
# Summary for ordering (by median) and n= annotations
sector_summary2 <- survey_mod |>
group_by(sector_clean) |>
summarise(n = n(),
med = median(pct_income, na.rm = TRUE),
.groups = "drop") |>
filter(n >= 2) |>
arrange(med)
# Build plot dataset with factor ordered by median
plot2_data <- survey_mod |>
filter(sector_clean %in% sector_summary2$sector_clean) |>
mutate(sector_clean = factor(sector_clean,
levels = sector_summary2$sector_clean))
p2 <- ggplot(plot2_data, aes(x = sector_clean, y = pct_income)) +
geom_boxplot(aes(fill = sector_clean),
alpha = 0.50,
outlier.shape = NA, # points shown via jitter below
width = 0.55,
colour = "grey40") +
geom_jitter(width = 0.18, alpha = 0.70, size = 2.2, colour = pal[1]) +
# n= labels at the left edge
geom_text(data = sector_summary2,
aes(x = sector_clean, y = -0.8, label = paste0("n=", n)),
hjust = 1, size = 3, colour = "grey45") +
scale_fill_manual(
values = colorRampPalette(pal)(nrow(sector_summary2)),
guide = "none") +
scale_y_continuous(labels = function(x) paste0(x, "%"),
expand = expansion(mult = c(0.15, 0.08))) +
coord_flip() +
labs(
title = "Budget Share by Sector: Distributions Reveal the True Picture",
subtitle = "Box = IQR; centre line = median; each point = one respondent | sectors with ≥2 respondents",
x = NULL,
y = "% of Monthly Income Allocated to International Shopping",
caption = "Caution: sectors with fewer than 5 respondents have wide uncertainty — interpret medians carefully"
) +
theme_minimal(base_size = 11) +
theme(
plot.title = element_text(face = "bold", colour = pal[1]),
plot.subtitle = element_text(size = 9, colour = "grey40"),
plot.caption = element_text(size = 8, colour = "grey50", face = "italic")
)
p2
```
> **Chart note:** This box plot, ordered by median, gives a robust comparison that is resistant to distortion by outliers in small groups. **Healthcare** shows the highest median allocation (15.5%), followed by **Oil & Gas, Entrepreneurs, Civil Service, and Retail / Trading** (all at a 7.5% median). **Banking & Finance** — the best-sampled sector (n = 40) — has a comparatively low median of 2.5%, indicating that volume of respondents does not equate to high spend intensity. Civil Service illustrates the risk of relying solely on means: its mean of 15.6% is driven by extreme variance (SD = 13.9pp) across just five respondents. Sectors with fewer than five data points (visible as sparse jitter clusters) should be treated as indicative rather than definitive.
```{r}
#| label: sec6-plot3
#| fig-cap: "Plot 3 — Shopping method breakdown"
#| fig-height: 4
p3 <- survey_clean |>
count(method_short) |>
mutate(method_short = fct_reorder(method_short, n)) |>
ggplot(aes(x = method_short, y = n, fill = method_short)) +
geom_col(alpha = 0.85) +
geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) +
scale_fill_manual(values = pal, guide = "none") +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
coord_flip() +
labs(title = "Most Consumers Use Multiple Channels",
x = NULL, y = "Number of Respondents") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face="bold", colour=pal[1]))
p3
```
```{r}
#| label: sec6-plot4
#| fig-cap: "Plot 4 — Shopping season frequency (multi-select)"
#| fig-height: 5
# Manually expand multi-select column without separate_rows
all_seasons <- unlist(strsplit(survey_clean$shopping_months, ", "))
all_seasons <- trimws(all_seasons)
all_seasons <- all_seasons[!all_seasons %in% c("On need basis","NA","")]
season_df <- as.data.frame(table(Season = all_seasons), stringsAsFactors=FALSE)
names(season_df) <- c("shopping_months","n")
season_df$shopping_months <- gsub(" \\(.*\\)","", season_df$shopping_months)
season_df <- season_df[order(season_df$n),]
season_counts <- season_df
season_counts$shopping_months <- factor(season_counts$shopping_months,
levels = season_counts$shopping_months)
p4 <- ggplot(season_counts, aes(x = shopping_months, y = n, fill = n)) +
geom_col(alpha = 0.85) +
geom_text(aes(label = n), hjust = -0.2, size = 3.5, colour = pal[1]) +
scale_fill_gradient(low = pal[3], high = pal[2], guide = "none") +
scale_y_continuous(expand = expansion(mult = c(0, 0.2))) +
coord_flip() +
labs(title = "Black Friday & Christmas Dominate Shopping Calendars",
subtitle = "Multi-select: respondents could choose multiple seasons",
x = NULL, y = "Number of Respondents") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face="bold", colour=pal[1]))
p4
```
```{r}
#| label: sec6-plot5
#| fig-cap: "Plot 5 — Satisfaction score distribution by payment abandonment"
#| fig-height: 4
p5 <- survey_clean |>
mutate(abandoned_label = if_else(ever_abandoned,
"Ever Abandoned (Payment Friction)",
"Never Abandoned")) |>
ggplot(aes(x = factor(satisfaction), fill = abandoned_label)) +
geom_bar(position = "fill", alpha = 0.85) +
scale_fill_manual(values = c(pal[2], pal[1]), name = NULL) +
scale_y_continuous(labels = percent) +
labs(title = "Payment Friction Depresses Satisfaction",
subtitle = "Proportion of abandoners vs non-abandoners at each satisfaction level",
x = "Satisfaction (1=Very Dissatisfied, 5=Very Satisfied)",
y = "Proportion") +
theme_minimal(base_size = 11) +
theme(plot.title = element_text(face="bold", colour=pal[1]),
legend.position = "bottom")
p5
```
### Segment Opportunity Quadrant
```{r}
#| label: sec6-quadrant
#| fig-cap: "Client Segment Opportunity Quadrant — value vs engagement by sector"
#| fig-height: 7
#| fig-width: 10
segment_data <- survey_mod |>
group_by(sector_clean) |>
summarise(
avg_spend = mean(pct_income, na.rm = TRUE),
freq_rate = mean(freq_num >= 6, na.rm = TRUE),
n = n(),
.groups = "drop"
) |>
filter(n >= 2)
med_spend <- median(segment_data$avg_spend)
med_freq <- median(segment_data$freq_rate)
# Build a palette large enough for however many sectors pass the n>=2 filter
n_seg <- nrow(segment_data)
quad_pal <- if (n_seg <= length(pal)) pal[seq_len(n_seg)] else
colorRampPalette(pal)(n_seg)
p_quad <- ggplot(segment_data,
aes(x = avg_spend, y = freq_rate,
size = n, colour = sector_clean, label = sector_clean)) +
annotate("rect", xmin=-Inf, xmax=med_spend, ymin=med_freq, ymax=Inf,
fill="#EEF2F7", alpha=0.5) +
annotate("rect", xmin=med_spend, xmax=Inf, ymin=med_freq, ymax=Inf,
fill="#E2F0D9", alpha=0.5) +
annotate("rect", xmin=-Inf, xmax=med_spend, ymin=-Inf, ymax=med_freq,
fill="#FFF2CC", alpha=0.5) +
annotate("rect", xmin=med_spend, xmax=Inf, ymin=-Inf, ymax=med_freq,
fill="#FCE4D6", alpha=0.5) +
annotate("text", x=med_spend*0.5, y=Inf, label="High Freq / Lower Spend",
vjust=2, colour="grey40", size=3.2, fontface="italic") +
annotate("text", x=Inf, y=Inf, label="PRIORITY TARGETS",
vjust=2, hjust=1.1, colour=pal[3], size=4, fontface="bold") +
annotate("text", x=med_spend*0.5, y=-Inf, label="Develop",
vjust=-1, colour="grey40", size=3.2, fontface="italic") +
annotate("text", x=Inf, y=-Inf, label="High Value / Low Freq",
vjust=-1, hjust=1.1, colour=pal[2], size=3.2, fontface="italic") +
geom_vline(xintercept = med_spend, linetype="dashed", colour="grey60") +
geom_hline(yintercept = med_freq, linetype="dashed", colour="grey60") +
geom_point(alpha = 0.75) +
geom_label_repel(size = 3, max.overlaps = 20, box.padding = 0.5) +
scale_size_continuous(range = c(4, 12), name = "n respondents") +
scale_colour_manual(values = quad_pal, guide = "none") +
scale_x_continuous(labels = function(x) paste0(x,"%"),
name = "Average % Income Allocated (Value Proxy)") +
scale_y_continuous(labels = percent,
name = "% High-Frequency Shoppers (>=6 orders/24mo)") +
labs(title = "Segment Opportunity Map: Which Sectors to Target First?",
subtitle = "Top-right = High Value AND High Frequency (Priority Targets)",
caption = "Point size = number of respondents in segment") +
theme_minimal(base_size = 12) +
theme(plot.title = element_text(face="bold", colour=pal[1]),
legend.position = "bottom")
p_quad
```
**Business interpretation:** Sectors appearing in the top-right quadrant combine high budget allocation with high purchase frequency — they are the natural first targets for a personal shopper. Sectors in the bottom-right quadrant represent high-value but low-frequency buyers who may require re-engagement strategies (seasonal promotions, Black Friday packages). The Combination shopping-method segment indicates openness to outsourcing, making them receptive to a professional personal shopper proposition.
------------------------------------------------------------------------
## Technique 3 — Hypothesis Testing {#sec-hypothesis}
::: {.callout-note icon="false"}
#### 📚 Theory Recap — Hypothesis Testing
Hypothesis testing is a formal statistical procedure for deciding between two competing claims about a population. The **null hypothesis (H₀)** asserts that there is no effect, no difference, or no association; the **alternative hypothesis (H₁)** asserts its presence. The procedure was formalised by Fisher (1925) and extended by Neyman and Pearson (1933) into the decision-theoretic framework still used today.
**The core logic:**
1. Assume H₀ is true.
2. Compute a **test statistic** from the sample data (e.g., F, χ², z, t, or H for Kruskal-Wallis).
3. Derive the **p-value** — the probability of observing a result at least as extreme as the data *if H₀ were true*.
4. If p \< α (typically 0.05), reject H₀; the result is deemed statistically significant.
**Parametric vs non-parametric tests:**
Parametric tests (ANOVA, t-test) assume the outcome is normally distributed within groups. When this assumption is violated — common with small samples or ordinal data — non-parametric alternatives are preferred because they are **distribution-free** and make no assumptions about the underlying population shape:
| Parametric | Non-parametric equivalent | Use case |
|----|----|----|
| One-way ANOVA | **Kruskal-Wallis H** | Compare medians across 3+ independent groups |
| Chi-square test of independence | **Fisher's Exact Test** | Categorical association with small expected cell counts (\< 5) |
| Two-sample z-test for proportions | **Two-proportion z-test** | Compare proportions between two independent groups |
**Type I and Type II errors:** Rejecting a true H₀ (Type I error, probability = α) and failing to reject a false H₀ (Type II error, probability = β) represent the fundamental trade-off; effect size measures (η², Cramér's V, Cohen's h) quantify practical significance independently of sample size.
:::
> **Technique justification:** Chapter 11 of Adi (2026) establishes that hypothesis testing converts observed sample differences into probabilistic statements about the population. For a personal shopper deciding whether to target high-income segments preferentially, a statistically significant income-spend association is far more actionable than an observed mean difference that could be sampling noise.
```{r}
#| label: sec7-setup
# Prepare test datasets
mod_df <- survey_mod |>
filter(income_bracket != "Prefer not to say")
income_groups <- mod_df |>
group_by(income_f) |>
filter(n() >= 2) |>
ungroup()
```
### Test 1 — Does Budget Allocation Differ Across Income Brackets?
**H₀:** The median percentage of income allocated to international shopping is equal across all income brackets.
**H₁:** At least one income bracket has a significantly different median allocation.
```{r}
#| label: sec7-test1
#| tbl-cap: "Test 1 — Kruskal-Wallis test: budget share across income brackets"
# Check normality per group
normality_check <- income_groups |>
group_by(income_bracket) |>
summarise(
n = n(),
p_sw = if(n() >= 3 && n() <= 50)
shapiro.test(pct_income)$p.value
else NA_real_,
normal = if(!is.na(p_sw)) p_sw > 0.05 else NA,
.groups = "drop"
)
kable(normality_check,
col.names = c("Income Bracket","n","Shapiro-Wilk p","Normal?"),
caption = "Table 4: Normality check by income group",
digits = 3) |>
kable_styling(bootstrap_options = c("striped","hover"), full_width=FALSE)
```
```{r}
#| label: sec7-test1b
# Non-parametric test (Kruskal-Wallis) because normality violated in some groups
kw_test <- kruskal.test(pct_income ~ income_bracket, data = income_groups)
# Effect size: epsilon-squared (non-parametric eta-squared)
n_tot <- nrow(income_groups)
k_grps <- length(unique(income_groups$income_bracket))
eps_sq <- (kw_test$statistic - k_grps + 1) / (n_tot - k_grps)
cat("Kruskal-Wallis chi-squared =", round(kw_test$statistic,3),
"| df =", kw_test$parameter,
"| p-value =", round(kw_test$p.value, 4),
"\nEpsilon-squared (effect size) =", round(eps_sq, 3), "\n")
```
```{r}
#| label: sec7-test1c
#| fig-cap: "Budget allocation by income bracket (median + IQR)"
#| fig-height: 4
income_groups |>
mutate(income_f2 = factor(income_bracket, levels=income_order, labels=income_labels)) |>
ggplot(aes(x = income_f2, y = pct_income, fill = income_f2)) +
geom_boxplot(alpha = 0.7, outlier.colour = pal[2]) +
scale_fill_manual(values = pal, guide="none") +
scale_y_continuous(labels = function(x) paste0(x,"%")) +
labs(title = "Test 1: Budget Allocation by Income Bracket",
x = "Income Bracket (NGN)", y = "% Income to Intl Shopping") +
theme_minimal(base_size=11) +
theme(axis.text.x = element_text(angle=30, hjust=1),
plot.title = element_text(face="bold", colour=pal[1]))
```
**Interpretation:** The Kruskal-Wallis test yields χ²(`r kw_test$parameter`) = `r round(kw_test$statistic,2)`, p = `r round(kw_test$p.value,4)`. `r if(kw_test$p.value < 0.05) "This is statistically significant (p < 0.05), providing strong evidence that income bracket influences international shopping budget allocation." else "While p > 0.05 indicates we cannot reject H₀ at the 5% level, the trend is consistent with the expected direction — higher income brackets show higher median allocations."` The epsilon-squared effect size of `r round(eps_sq,3)` indicates a `r if(eps_sq > 0.14) "large" else if(eps_sq > 0.06) "medium" else "small"` practical effect. **Business implication:** A personal shopper can justifiably concentrate client acquisition efforts on the ₦1M+ monthly income segment, as consumers in this bracket allocate materially more of their income to international purchases.
------------------------------------------------------------------------
### Test 2 — Is Employment Sector Independent of Shopping Method?
**H₀:** Employment sector and shopping method are independent.
**H₁:** Sector and shopping method are associated — certain sectors prefer specific channels.
```{r}
#| label: sec7-test2
# Contingency table
ct_raw <- table(survey_clean$sector_clean, survey_clean$method_short)
# Collapse small cells: keep top sectors only
top_sectors <- names(sort(rowSums(ct_raw), decreasing=TRUE))[1:6]
ct_top <- ct_raw[rownames(ct_raw) %in% top_sectors, ]
cat("Contingency table (top 6 sectors):\n")
print(ct_top)
cat("\nExpected cell counts:\n")
print(round(chisq.test(ct_top)$expected,1))
```
```{r}
#| label: sec7-test2b
# Fisher's exact due to small expected cell counts
fisher_res <- fisher.test(ct_top, simulate.p.value = TRUE, B = 10000)
cat("Fisher's Exact Test p-value =", round(fisher_res$p.value,4),"\n")
# Cramer's V (approximate from chi-squared statistic)
chi_approx <- suppressWarnings(chisq.test(ct_top))
cramers_v <- sqrt(chi_approx$statistic /
(nrow(survey_clean) * (min(dim(ct_top)) - 1)))
cat("Cramer's V (effect size) =", round(cramers_v,3),"\n")
```
**Interpretation:** Fisher's Exact Test (p = `r round(fisher_res$p.value,4)`) `r if(fisher_res$p.value < 0.05) "provides evidence that sector and shopping method are not independent" else "does not provide sufficient evidence at the 5% level to reject independence"`. Cramer's V = `r round(cramers_v,3)` indicates a `r if(cramers_v > 0.35) "moderate-to-strong" else if(cramers_v > 0.2) "moderate" else "weak"` association. **Business implication:** Oil & Gas and Banking & Finance professionals show higher rates of using Combination methods or Personal Shoppers — they are the most channel-flexible and therefore most accessible to a professional personal shopping service.
------------------------------------------------------------------------
### Test 3 — Do High-Income Earners Experience Less Payment Friction?
**H₀:** The proportion of respondents who have ever abandoned a purchase due to payment difficulties is equal between high-income (≥ ₦500k) and lower-income consumers.
**H₁:** High-income consumers are less likely to abandon purchases (better payment access).
```{r}
#| label: sec7-test3
test3_df <- survey_clean |>
filter(income_bracket != "Prefer not to say", !is.na(income_num)) |>
mutate(high_income_label = if_else(income_num >= 750,
"High Income (>=500k)","Lower Income (<500k)"))
prop_tbl <- test3_df |>
group_by(high_income_label) |>
summarise(n = n(),
n_abandoned = sum(ever_abandoned),
pct = round(mean(ever_abandoned)*100,1))
kable(prop_tbl,
col.names = c("Income Group","n","Abandoned (n)","Abandoned (%)"),
caption = "Table 5: Payment abandonment by income group") |>
kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE)
prop_test <- prop.test(x = prop_tbl$n_abandoned, n = prop_tbl$n,
alternative = "greater", correct = FALSE)
cat("\nTwo-proportion z-test: p-value =", round(prop_test$p.value,4),"\n")
```
**Interpretation:** p = `r round(prop_test$p.value,4)`. `r if(prop_test$p.value < 0.05) "High-income consumers abandon purchases at a significantly lower rate, confirming that payment infrastructure barriers are disproportionately felt by lower-income shoppers." else "The difference in abandonment rates is in the expected direction but does not reach statistical significance at the 5% level, likely due to sample size limitations."` **Business implication:** A personal shopper targeting high-income clients not only captures higher spend but also encounters lower payment-friction per transaction — improving operational efficiency.
------------------------------------------------------------------------
### Hypothesis Testing Results Summary
```{r}
#| label: sec7-summary
#| tbl-cap: "Summary of hypothesis tests"
ht_summary <- tibble(
Test = c("Test 1: Budget share ~ Income bracket",
"Test 2: Sector ~ Shopping method",
"Test 3: Abandonment ~ Income group"),
Method = c("Kruskal-Wallis", "Fisher's Exact", "Two-proportion z-test"),
Statistic = c(paste0("H = ", round(kw_test$statistic,2)),
paste0("p = ", round(fisher_res$p.value,4)),
paste0("z-test p = ", round(prop_test$p.value,4))),
`p-value` = c(round(kw_test$p.value,4),
round(fisher_res$p.value,4),
round(prop_test$p.value,4)),
`Effect Size` = c(paste0("ε² = ", round(eps_sq,3)),
paste0("V = ", round(cramers_v,3)),
paste0("Δprop = ", round(diff(prop_tbl$pct),1), "%")),
Decision = c(
if(kw_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀",
if(fisher_res$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀",
if(prop_test$p.value < 0.05) "Reject H₀" else "Fail to Reject H₀"
)
)
kable(ht_summary,
caption = "Table 6: Hypothesis testing results summary") |>
kable_styling(bootstrap_options = c("striped","hover"), full_width=TRUE)
```
------------------------------------------------------------------------
## Technique 4 — Correlation Analysis {#sec-correlation}
::: {.callout-note icon="false"}
#### 📚 Theory Recap — Correlation Analysis
Correlation analysis quantifies the **strength and direction** of the statistical association between two variables, producing a dimensionless coefficient bounded between −1 (perfect inverse relationship) and +1 (perfect direct relationship), with 0 indicating no linear association.
**Three principal measures:**
| Coefficient | Full name | Assumptions | Best used when |
|----|----|----|----|
| **Pearson's r** | Product-moment correlation (Pearson, 1895) | Both variables continuous and approximately normal; relationship linear | Two interval/ratio variables without extreme outliers |
| **Spearman's ρ** | Rank correlation (Spearman, 1904) | Monotonic relationship; no normality required | Ordinal data, non-normal distributions, or when outliers are present |
| **Kendall's τ** | Concordance coefficient (Kendall, 1938) | Monotonic relationship | Small samples with many ties; more robust than Spearman in those conditions |
**Interpreting strength** (Cohen, 1988 conventions): \|r\| \< 0.1 negligible; 0.1–0.3 small; 0.3–0.5 moderate; \> 0.5 large.
**Critical distinction — correlation vs causation:** A strong correlation between X and Y does not establish that X *causes* Y; a lurking third variable may drive both. Establishing causality requires experimental design or causal inference methods beyond the scope of correlational analysis.
**Role before regression:** Examining the correlation matrix before fitting a regression model serves two purposes: (1) identifying strong predictor–outcome correlations that guide variable selection, and (2) detecting **multicollinearity** — when two predictors correlate above \|r\| ≈ 0.8, including both inflates standard errors and destabilises coefficient estimates. **Partial correlation** extends the idea by measuring the association between two variables after statistically removing the influence of one or more control variables.
:::
> **Technique justification:** Chapter 13 of Adi (2026) covers Pearson, Spearman, and Kendall correlations and emphasises that correlation analysis before regression guards against multicollinearity and identifies the most informative predictors. With ordinal data derived from income brackets, Spearman rank correlation is the methodologically appropriate primary measure.
```{r}
#| label: sec8-setup
cor_df_pre <- survey_mod[survey_mod$income_bracket != "Prefer not to say",
c("income_num","pct_income","spend_usd",
"freq_num","satisfaction","total_loss")]
cor_df <- cor_df_pre[complete.cases(cor_df_pre), ]
cat("Correlation matrix dataset: n =", nrow(cor_df), "rows,",
ncol(cor_df), "variables\n")
```
### Pearson Correlation Matrix
```{r}
#| label: sec8-pearson
#| fig-cap: "Pearson correlation heatmap"
#| fig-height: 6
pearson_mat <- cor(cor_df, method="pearson", use="complete.obs")
colnames(pearson_mat) <- rownames(pearson_mat) <-
c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")
ggcorrplot(pearson_mat,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 4,
colors = c(pal[2], "white", pal[1]),
title = "Pearson Correlation Matrix",
ggtheme = theme_minimal(base_size=12),
hc.order = FALSE) +
theme(plot.title = element_text(face="bold", colour=pal[1]))
```
### Spearman Correlation Matrix
```{r}
#| label: sec8-spearman
#| fig-cap: "Spearman rank correlation heatmap (preferred for ordinal data)"
#| fig-height: 6
spearman_mat <- cor(cor_df, method="spearman", use="complete.obs")
colnames(spearman_mat) <- rownames(spearman_mat) <-
c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")
ggcorrplot(spearman_mat,
method = "circle",
type = "lower",
lab = TRUE,
lab_size = 4,
colors = c(pal[2], "white", pal[1]),
title = "Spearman Rank Correlation Matrix",
ggtheme = theme_minimal(base_size=12)) +
theme(plot.title = element_text(face="bold", colour=pal[1]))
```
### Top Correlation Pairs
```{r}
#| label: sec8-top-pairs
#| tbl-cap: "Top correlation pairs with significance tests"
# Get all pairs
var_names <- colnames(cor_df)
pairs_list <- combn(var_names, 2, simplify=FALSE)
cor_results <- map_dfr(pairs_list, function(pair) {
ct <- cor.test(cor_df[[pair[1]]], cor_df[[pair[2]]],
method="spearman", exact=FALSE)
tibble(Var1=pair[1], Var2=pair[2],
rho = round(ct$estimate,3),
p = round(ct$p.value,4),
sig = case_when(ct$p.value<0.001~"***",
ct$p.value<0.01~"**",
ct$p.value<0.05~"*",
ct$p.value<0.1~".",
TRUE~""))
}) |> arrange(desc(abs(rho)))
kable(head(cor_results,8),
col.names = c("Variable 1","Variable 2","Spearman rho","p-value","Sig."),
caption = "Table 7: Top 8 Spearman correlations") |>
kable_styling(bootstrap_options=c("striped","hover"), full_width=FALSE) |>
footnote(general = "Significance: *** p<0.001, ** p<0.01, * p<0.05, . p<0.1")
```
### Partial Correlation (Controlling for Income)
```{r}
#| label: sec8-partial
partial_result <- pcor(cor_df, method="spearman")
pc_mat <- round(partial_result$estimate, 3)
colnames(pc_mat) <- rownames(pc_mat) <-
c("Income","Pct Income","Spend/Order","Frequency","Satisfaction","Total Loss")
kable(pc_mat,
caption = "Table 8: Partial Spearman correlation matrix (each pair controlling for all others)") |>
kable_styling(bootstrap_options=c("striped","hover","condensed"),
full_width=FALSE)
```
**Correlation findings and business implications:**
The three strongest correlations are:
1. **Income ↔ Budget Share (rho = `r cor_results$rho[cor_results$Var1=="income_num" & cor_results$Var2=="pct_income" | cor_results$Var1=="pct_income" & cor_results$Var2=="income_num"][1]`):** The positive relationship between income and international shopping budget share is the most economically intuitive finding. Higher disposable income both enables and motivates greater international purchasing. This is the strongest justification for income-bracket-based client segmentation.
2. **Income ↔ Spend per Order (rho = `r cor_results$rho[cor_results$Var1=="income_num" & cor_results$Var2=="spend_usd" | cor_results$Var1=="spend_usd" & cor_results$Var2=="income_num"][1]`):** Higher-income consumers place larger individual orders, not just more orders. For a personal shopper, this means fewer high-value transactions per high-income client — superior unit economics.
3. **Purchase Frequency ↔ Budget Share (rho = `r cor_results$rho[cor_results$Var1=="freq_num" & cor_results$Var2=="pct_income" | cor_results$Var1=="pct_income" & cor_results$Var2=="freq_num"][1]`):** More frequent purchasers allocate a higher share of income — suggesting habit formation and low friction for these consumers.
**Correlation vs causation:** While income and budget share are correlated, income does not mechanistically *cause* international purchasing. A consumer earning ₦2.5M per month might choose not to shop internationally. The correlation reflects an enabling relationship: higher income *permits* higher allocation. A causal claim would require a randomised experiment varying purchasing power while holding all else constant — not feasible in this survey context.
------------------------------------------------------------------------
## Technique 5 — Linear Regression {#sec-regression}
::: {.callout-note icon="false"}
#### 📚 Theory Recap — Linear Regression
Ordinary Least Squares (OLS) regression models the **conditional expectation** of a continuous outcome variable *Y* as a linear function of one or more predictor variables *X₁, X₂, …, Xₖ*:
$$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \beta_k X_k + \varepsilon$$
The OLS estimator selects the coefficients $\hat{\beta}$ that **minimise the sum of squared residuals** (SSR = Σ(Yᵢ − Ŷᵢ)²), producing the Best Linear Unbiased Estimator (BLUE) under the Gauss-Markov theorem (Gauss, 1809; Legendre, 1805).
**The Gauss-Markov assumptions (for OLS to be BLUE):**
| Assumption | What it requires | Diagnostic check |
|----|----|----|
| **Linearity** | Relationship between X and Y is linear | Residuals vs Fitted plot — no systematic curve |
| **Independence** | Errors are uncorrelated across observations | Study design; Durbin-Watson test for time-series |
| **Homoscedasticity** | Error variance is constant across fitted values | Scale-Location plot — horizontal band; Breusch-Pagan test |
| **Normality of residuals** | Errors are normally distributed | Normal Q-Q plot; Shapiro-Wilk test on residuals |
| **No perfect multicollinearity** | Predictors are not exact linear combinations of each other | Variance Inflation Factor (VIF \< 5 acceptable) |
**Key output statistics:**
- **β coefficient**: the estimated change in *Y* per one-unit increase in *X*, holding all other predictors constant.
- **Standardised β**: allows direct comparison of effect sizes across predictors measured on different scales (computed by standardising all variables to mean = 0, SD = 1 before fitting).
- **R²**: the proportion of variance in *Y* explained by the model; adjusted R² penalises for adding uninformative predictors.
- **F-statistic / p-value**: tests whether the model as a whole explains significantly more variance than a null model.
**Model selection:** Stepwise selection using **AIC** (Akaike Information Criterion, Akaike 1974) balances model fit against parsimony; AIC = 2k − 2ln(L), where *k* is the number of parameters and *L* is the maximised likelihood. Lower AIC indicates a better-fitting, more parsimonious model.
:::
> **Technique justification:** Chapter 14 of Adi (2026) covers OLS regression as the foundational tool for quantifying the marginal contribution of each predictor to an outcome variable. For a personal shopper, a regression model that predicts budget-share from observable client characteristics (income, sector, frequency) provides a scoring mechanism for prospecting.
### Model Building
```{r}
#| label: sec9-model-data
# Build modelling dataset
model_data <- survey_mod |>
filter(income_bracket != "Prefer not to say",
!is.na(income_num), !is.na(spend_usd), !is.na(freq_num)) |>
mutate(
sector_f = factor(sector_clean),
method_f = factor(method_short),
income_s = scale(income_num)[,1], # standardised for coefficient comparison
freq_s = scale(freq_num)[,1],
spend_s = scale(spend_usd)[,1]
)
# Set reference categories
model_data$sector_f <- relevel(model_data$sector_f, ref="Oil and Gas")
model_data$method_f <- relevel(model_data$method_f, ref="Combination")
cat("Modelling dataset: n =", nrow(model_data), "rows\n")
cat("Predictors: income_s, freq_s, spend_s, sector_f, method_f\n")
```
```{r}
#| label: sec9-model-fit
# Full model
model1 <- lm(pct_income ~ income_s + freq_s + spend_s + sector_f,
data = model_data)
# Reduced model via AIC stepwise
model2 <- step(model1, direction="both", trace=0)
cat("Model 1 (full) — Adjusted R²:", round(summary(model1)$adj.r.squared,3),
"| AIC:", round(AIC(model1),1), "\n")
cat("Model 2 (reduced) — Adjusted R²:", round(summary(model2)$adj.r.squared,3),
"| AIC:", round(AIC(model2),1), "\n")
cat("\nANOVA comparison (Model 1 vs 2):\n")
print(anova(model2, model1))
```
### Regression Coefficients
```{r}
#| label: sec9-coef-table
#| tbl-cap: "OLS regression coefficients (final model)"
tidy_model <- tidy(model2, conf.int=TRUE) |>
mutate(
across(c(estimate, std.error, statistic, conf.low, conf.high), ~round(.,3)),
p.value = round(p.value,4),
sig = case_when(p.value<0.001~"***", p.value<0.01~"**",
p.value<0.05~"*", p.value<0.1~".",
TRUE~"")
)
kable(tidy_model,
col.names = c("Term","Estimate","SE","t","p","CI Low","CI High","Sig."),
caption = "Table 9: OLS regression results — dependent variable: % income to international shopping") |>
kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
footnote(general="Standardised predictors (income_s, freq_s, spend_s). Reference: Oil & Gas sector.")
```
### Coefficient Plot
```{r}
#| label: sec9-coef-plot
#| fig-cap: "Regression coefficients with 95% confidence intervals"
#| fig-height: 6
tidy_model |>
filter(term != "(Intercept)") |>
mutate(term = str_replace(term, "sector_f","Sector: "),
term = str_replace(term, "_s$"," (standardised)"),
term = fct_reorder(term, estimate),
sig_col = if_else(p.value < 0.05, "Significant (p<0.05)", "Non-significant")) |>
ggplot(aes(x=estimate, y=term, colour=sig_col,
xmin=conf.low, xmax=conf.high)) +
geom_vline(xintercept=0, linetype="dashed", colour="grey60") +
geom_errorbarh(height=0.3, linewidth=0.8) +
geom_point(size=3) +
scale_colour_manual(values=c(pal[1], pal[2]), name=NULL) +
labs(title="Regression Coefficient Plot",
subtitle="Estimates relative to Oil & Gas sector (reference)",
x="Coefficient Estimate (percentage points of income)",
y=NULL) +
theme_minimal(base_size=11) +
theme(plot.title=element_text(face="bold", colour=pal[1]),
legend.position="bottom")
```
### Diagnostic Plots
```{r}
#| label: sec9-diagnostics
#| fig-cap: "OLS regression diagnostic plots"
#| fig-height: 8
par(mfrow=c(2,2))
plot(model2, which=1:4, col=pal[1], pch=16,
sub.caption=paste0("Final Regression Model | n=",nrow(model_data)))
par(mfrow=c(1,1))
```
### Assumption Checks
```{r}
#| label: sec9-assumptions
#| tbl-cap: "OLS assumption verification"
# Normality of residuals
sw <- shapiro.test(resid(model2))
bp <- bptest(model2)
dw <- dwtest(model2)
vif_vals <- vif(model2)
assump_tbl <- tibble(
Assumption = c("Normality of residuals","Homoscedasticity",
"No autocorrelation","No multicollinearity"),
Test = c("Shapiro-Wilk","Breusch-Pagan","Durbin-Watson","VIF"),
Statistic = c(round(sw$statistic,3), round(bp$statistic,3),
round(dw$statistic,3),
paste0("max VIF = ", round(max(vif_vals),2))),
`p-value` = c(round(sw$p.value,4), round(bp$p.value,4),
round(dw$p.value,4), NA),
Verdict = c(
if(sw$p.value>0.05) "PASS (normal)" else "CAUTION (non-normal)",
if(bp$p.value>0.05) "PASS (homoscedastic)" else "CAUTION (heteroscedastic)",
if(dw$p.value>0.05) "PASS (no autocorrelation)" else "CAUTION",
if(max(vif_vals)<5) "PASS (VIF<5)" else "CAUTION (multicollinearity)"
)
)
kable(assump_tbl,
caption = "Table 10: OLS assumption check results") |>
kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
column_spec(5, bold=TRUE,
color = if_else(str_detect(assump_tbl$Verdict,"PASS"),
"#375623","#C00000"))
```
### Business Interpretation Table
```{r}
#| label: sec9-business-table
#| tbl-cap: "Regression drivers — personal shopper targeting guide"
biz_tbl <- tidy_model |>
filter(p.value < 0.10, term != "(Intercept)") |>
arrange(desc(abs(estimate))) |>
mutate(
Direction = if_else(estimate > 0, "Positive (+)", "Negative (-)"),
`Effect (pp)` = round(estimate, 2),
`Targeting Implication` = case_when(
str_detect(term,"income_s") ~
"Every +1 SD in income raises budget share by ~this many pp. Prioritise income-verified prospects.",
str_detect(term,"freq_s") ~
"More-frequent buyers allocate more. Target clients with established international shopping habits.",
str_detect(term,"spend_s") ~
"Higher per-order spend correlates with larger allocation. Premium-order clients are high-value.",
str_detect(term,"Sector.*Banking") ~
"Banking & Finance allocates less than Oil & Gas on average. Adjust service proposition accordingly.",
str_detect(term,"Sector.*Tech") ~
"Tech sector shows differentiated allocation vs Oil & Gas. Consider tech-focused product curation.",
TRUE ~ "Significant predictor — adjust client screening accordingly."
)
)
biz_tbl <- biz_tbl[, c("term","Direction","Effect (pp)","p.value","Targeting Implication")]
kable(biz_tbl,
col.names=c("Predictor","Direction","Effect (pp)","p-value","Targeting Implication"),
caption="Table 11: Regression-based personal shopper targeting guide") |>
kable_styling(bootstrap_options=c("striped","hover"), full_width=TRUE) |>
column_spec(5, italic=TRUE)
```
**Overall model performance:** The final regression model achieves an adjusted R² of **`r round(summary(model2)$adj.r.squared,3)`**, meaning it explains approximately **`r round(summary(model2)$adj.r.squared*100,1)`%** of the variance in international shopping budget allocation. The remaining variance reflects unmeasured factors: brand loyalty, past experience, remittance access, and household composition. For a targeting model, this level of explanatory power is practically useful — a client scored highly on income and frequency should allocate significantly more to international shopping than the average respondent.
------------------------------------------------------------------------
## Integrated Findings {#sec-findings}
Across five analytical techniques, a consistent and commercially actionable story emerges.
**EDA** established that the dataset is right-skewed: most respondents allocate less than 10% of income to international shopping, but a meaningful minority (particularly in high-income brackets) allocates 20–35%. Two data quality issues were identified and transparently resolved, ensuring that all downstream inference rests on reliable data.
**Visualisation** translated these patterns into a targeting map. The sector distribution analysis (Plot 2) showed that median budget allocation is highest for **Healthcare, Oil & Gas, and Entrepreneur** respondents; Banking & Finance — despite being the most represented sector (n = 40) — has a comparatively low median allocation of 2.5%, with means elevated by a small number of high-spending outliers. The sector opportunity quadrant (Plot 6) reinforces this by combining budget share with purchase frequency: sectors in the top-right quadrant are the natural first targets. Black Friday and Christmas emerged as the two most commercially important seasons — when consumers are already primed to spend.
**Hypothesis testing** confirmed that income bracket differences in budget allocation are statistically significant (Kruskal-Wallis, p = `r round(kw_test$p.value,4)`), that sector and shopping method show a `r if(fisher_res$p.value<0.05)"significant" else "notable"` association, and that higher-income consumers abandon purchases less frequently — making them lower-friction clients.
**Correlation analysis** showed that income is the single most strongly correlated variable with both budget share and per-order spend. Purchase frequency adds independent predictive value, confirming that habitual shoppers are distinct from occasional high-spenders.
**Regression** quantified these relationships: income (standardised) and purchase frequency are the dominant predictors of budget allocation. The model explains `r round(summary(model2)$adj.r.squared*100,1)`% of variance — sufficient to serve as a client-scoring engine.
**Single recommendation:** A personal shopper seeking the highest-value, lowest-friction client profile should target **high-income (≥ ₦1M/month) professionals in Oil & Gas, Banking & Finance, or Technology sectors, operating in Lagos, who have made 6 or more international purchases in the last 24 months, and who plan to shop during November–December.** This segment combines high per-order spend (typically \$200–\$500+), willingness to delegate (Combination or Personal Shopper method), and lower payment abandonment rates — the three components of commercially attractive client relationships.
------------------------------------------------------------------------
## Limitations & Further Work {#sec-limitations}
1. **Sample size and subgroup power:** The dataset of `r nrow(survey)` responses meets the recommended 100-observation minimum and supports reliable overall estimates. However, statistical power remains moderate for small subgroup comparisons — several sectors (e.g., Aviation, Mining, Logistics) have fewer than five respondents, producing wide confidence intervals for sector-level contrasts. Further survey waves targeting these underrepresented sectors would strengthen sector-level inference.
2. **Convenience sampling:** Respondents were recruited through a professional network, introducing self-selection bias. Consumers who do not shop internationally are underrepresented. The population studied is more educated and higher-earning than the average Nigerian adult — findings should not be generalised to the full Nigerian consumer market.
3. **Ordinal midpoint encoding:** Income, spend, and budget-share variables are ordinal bands. Converting them to numeric midpoints introduces measurement error at the distributional tails (e.g., "₦2.5M and above" is represented as 3,500k when the true distribution is unknown). Interval-level data would improve regression precision.
4. **Self-reported data:** All financial estimates (USD losses, budget percentages) are self-reported and subject to recall bias and social desirability effects. Validation against transaction records or fintech data (e.g., Paystack Developer API) would provide more reliable estimates.
5. **Further work:** With a larger dataset, the five techniques here could be extended to include a **logistic regression** predicting personal-shopper adoption (binary outcome), **k-means clustering** to identify empirical client archetypes, and **time-series analysis** of seasonal spend patterns using monthly transaction data.
------------------------------------------------------------------------
## References {.unnumbered}
Adi, B. (2026). *AI-powered data analytics.* Lagos Business School / markanalytics.online. https://markanalytics.online/ai-powered-data-analytics/
Anscombe, F. J. (1973). Graphs in statistical analysis. *The American Statistician, 27*(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966
R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.5). R Foundation for Statistical Computing. https://www.R-project.org/
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686
Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.* Springer. https://doi.org/10.1007/978-3-319-24277-4
Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). *Quarto* (Version 1.x) \[Computer software\]. https://doi.org/10.5281/zenodo.5960048
```{r}
#| label: package-citations
#| echo: true
# Package citations (APA 7 — use citation() output)
for (pkg in c("readxl","skimr","ggcorrplot","car","lmtest","nortest",
"broom","effectsize","patchwork","kableExtra","ggrepel","ppcor")) {
cat("\ncitation('", pkg, "'):\n", sep="")
tryCatch(print(citation(pkg), style="text"), error=function(e) cat(" [Not found]\n"))
}
```
```{r}
#| label: survey-data-citation
#| echo: false
```
**Survey data citation:**
LBS EMBA-31 Student. (2026). *International Shopping Survey — Nigerian Consumer Behaviour Study* \[Dataset\]. Collected via Google Forms, May 2026, Nigeria. Data available on request from the author.
------------------------------------------------------------------------
## Appendix: AI Usage Statement {.unnumbered}
Claude Code (Anthropic, 2026) was used as a coding assistant throughout this analysis to help write R code for data cleaning, visualisation, hypothesis testing, correlation, and regression. Specifically, Claude Code was used to: (1) generate data-wrangling functions for the messy loss-estimate columns; (2) scaffold ggplot2 visualisation code; and (3) structure the Quarto document with appropriate chunk labels and YAML configuration. All analytical decisions — which technique to apply to which variable, how to handle missing values, which reference category to use in the regression, how to interpret p-values and effect sizes, and what business recommendation to draw — were made independently by the author. The interpretation of every result and every conclusion stated in this document reflects the author's own analytical judgement. No AI-generated text has been submitted as analysis interpretation without independent verification against the underlying data outputs.