---
title: "Sales Analytics for a Nigerian D2C Furniture & Home Décor Business"
author: "[Joy Temilade Adewuyi]"
date: today
format:
html:
theme: flatly
css: custom.css
toc: true
toc-depth: 3
toc-title: "Table of Contents"
code-fold: true
code-tools: true
self-contained: true
highlight-style: github
fig-width: 10
fig-height: 6
df-print: paged
mainfont: "DM Sans"
execute:
warning: false
message: false
echo: true
knitr:
opts_chunk:
python.reticulate: true
---
---
```{r setup, include=FALSE}
# Set CRAN mirror to avoid "no mirror" error
options(repos = c(CRAN = "https://cloud.r-project.org"))
# Install and load reticulate
if (!requireNamespace("reticulate", quietly = TRUE)) {
install.packages("reticulate")
}
library(reticulate)
# Install required R packages silently
pkgs <- c("tidyverse","skimr","moments","ggplot2","patchwork",
"gridExtra","corrplot","rstatix","scales","coin","car","lmtest")
invisible(lapply(pkgs, function(p) {
if (!requireNamespace(p, quietly = TRUE)) install.packages(p)
}))
# Load core libraries
library(tidyverse)
library(ggplot2)
```
```{=html}
<style>
/* ============================================================
DIY AFRICANS BRAND STYLES — injected directly to beat Flatly
Primary: #6b4a1b deep earthy brown
Accent: #b5511c rust-orange
Terracotta: #c97d5a
Page bg: #f5f4f2 warm grey
CONTRAST: dark bg = white text | light bg = dark text
============================================================ */
@import url('https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@0,400;0,600;0,700;1,400&family=DM+Sans:ital,opsz,wght@0,9..40,300;0,9..40,400;0,9..40,500;1,9..40,300&family=DM+Mono:wght@400;500&display=swap');
/* ── VARIABLES ──────────────────────────────────────────────── */
:root {
--br: #6b4a1b; /* brand brown */
--brdk: #4e3512; /* brown dark */
--brmd: #8a6030; /* brown mid */
--rs: #b5511c; /* rust */
--tc: #c97d5a; /* terracotta */
--tcp: #f7ede5; /* terracotta pale */
--wh: #ffffff;
--wg: #f5f4f2; /* warm grey */
--wgm: #ede9e4;
--nk: #1a1a1a; /* near black */
--tm: #3d3224; /* text mid */
--tl: #7a6a58; /* text light */
}
/* ── PAGE & BODY ────────────────────────────────────────────── */
body { background-color: var(--wg) !important; font-family: 'DM Sans', sans-serif !important; color: var(--nk); }
#quarto-content, .quarto-container { background-color: var(--wg) !important; }
main.content, #quarto-document-content {
background: var(--wh) !important;
border-top: 5px solid var(--br) !important;
border-radius: 12px !important;
padding: 3rem 4rem !important;
box-shadow: 0 4px 20px rgba(60,30,5,0.12) !important;
}
/* ── TITLE BLOCK — brown bg, white text ────────────────────── */
#title-block-header,
.quarto-title-banner,
header.quarto-title-block,
div.quarto-title-block {
background: linear-gradient(135deg, #4e3512 0%, #6b4a1b 55%, #8a6030 100%) !important;
background-color: #6b4a1b !important;
padding: 3.5rem 4rem !important;
margin: -3rem -4rem 3rem -4rem !important;
border-radius: 12px 12px 0 0 !important;
border-bottom: 4px solid var(--rs) !important;
}
/* Title — WHITE on brown */
.title, h1.title,
#title-block-header h1,
.quarto-title h1 {
color: #ffffff !important;
font-family: 'Playfair Display', serif !important;
font-size: 2.5rem !important;
font-weight: 700 !important;
border: none !important;
padding: 0 !important;
margin-bottom: 1rem !important;
}
/* Author / date / meta — light on brown */
.quarto-title-meta-heading { color: rgba(255,255,255,0.55) !important; font-size: 0.68rem !important; text-transform: uppercase; letter-spacing: 0.1em; }
.quarto-title-meta-contents p,
.quarto-title-meta p,
p.author, p.date, .author, .date { color: rgba(255,255,255,0.85) !important; font-size: 0.88rem !important; text-transform: uppercase; letter-spacing: 0.04em; }
/* ── HEADINGS ───────────────────────────────────────────────── */
h1, h2, h3, h4 { font-family: 'Playfair Display', serif !important; }
h1 { color: var(--br) !important; border-bottom: 3px solid var(--rs) !important; padding-bottom: 0.5rem !important; font-size: 2rem !important; }
h2 { color: var(--brmd) !important; border-left: 4px solid var(--rs) !important; padding-left: 0.85rem !important; border-bottom: none !important; font-size: 1.5rem !important; }
h3 { color: var(--br) !important; font-family: 'DM Sans', sans-serif !important; font-size: 1.05rem !important; text-transform: uppercase; letter-spacing: 0.05em; }
p { color: var(--tm); line-height: 1.75; }
strong { color: var(--nk) !important; }
a { color: var(--rs) !important; }
a:hover { color: var(--br) !important; }
/* ── TOC SIDEBAR — brown bg, white text ────────────────────── */
#quarto-sidebar, .sidebar, #TOC, nav[id^="TOC"] {
background-color: var(--br) !important;
border-radius: 12px !important;
border: none !important;
}
#quarto-sidebar *, .sidebar *, #TOC *, nav[id^="TOC"] * { color: rgba(255,255,255,0.82) !important; }
#quarto-sidebar .sidebar-title, .toc-title { color: var(--tc) !important; font-size: 0.68rem !important; text-transform: uppercase; letter-spacing: 0.12em; border-bottom: 1px solid rgba(201,125,90,0.4) !important; padding-bottom: 0.5rem !important; }
#TOC a, nav[id^="TOC"] a, #quarto-sidebar a, .sidebar a { color: rgba(255,255,255,0.80) !important; font-size: 0.82rem !important; font-weight: 300 !important; border: none !important; display: block; padding: 0.25rem 0 !important; transition: color 0.2s, padding-left 0.2s; }
#TOC a:hover, nav[id^="TOC"] a:hover, #quarto-sidebar a:hover { color: #ffffff !important; padding-left: 5px !important; }
#TOC a.active, nav[id^="TOC"] a.active { color: var(--tc) !important; font-weight: 500 !important; }
/* ── CODE OUTPUT BLOCKS — dark brown bg, CREAM text ────────────
These are the plain output boxes (stdout, stderr, results).
Background forced to dark brown; ALL text forced to cream. */
.cell-output pre,
.cell-output-stdout pre,
.cell-output-stderr pre,
.cell-output-display pre {
background-color: #2a1a0a !important;
color: #f0e6d3 !important;
border: 1px solid rgba(181,81,28,0.4) !important;
border-radius: 6px !important;
padding: 1.1rem 1.3rem !important;
}
/* Force ALL text nodes inside output blocks to cream */
.cell-output pre *,
.cell-output-stdout pre *,
.cell-output-stderr pre * {
color: #f0e6d3 !important;
background: transparent !important;
font-family: 'DM Mono', monospace !important;
font-size: 0.83rem !important;
}
/* ── SOURCE CODE BLOCKS — dark brown bg, cream text ─────────── */
div.sourceCode,
div.sourceCode pre,
.sourceCode,
pre.sourceCode {
background-color: #2a1a0a !important;
border: 1px solid rgba(181,81,28,0.4) !important;
border-radius: 6px !important;
}
div.sourceCode pre *,
.sourceCode *,
pre.sourceCode * {
background: transparent !important;
font-family: 'DM Mono', monospace !important;
font-size: 0.83rem !important;
}
/* Default code text colour = bright cream */
div.sourceCode pre,
div.sourceCode code,
pre.sourceCode,
pre.sourceCode code {
color: #f0e6d3 !important;
}
/* Syntax tokens */
.sourceCode .co, code .co { color: #9a7a5a !important; font-style: italic; } /* comments */
.sourceCode .st, code .st { color: #e8b97a !important; } /* strings */
.sourceCode .kw, code .kw { color: #e07a50 !important; font-weight: 600; } /* keywords */
.sourceCode .dv, code .dv,
.sourceCode .fl, code .fl { color: #b8d898 !important; } /* numbers */
.sourceCode .fu, code .fu { color: #f0c87a !important; } /* functions */
.sourceCode .op, code .op { color: #d4a070 !important; } /* operators */
.sourceCode .cf, code .cf { color: #e07a50 !important; font-weight: 600; } /* control flow */
.sourceCode .ot, code .ot { color: #c9daf8 !important; } /* other */
.sourceCode .er, code .er { color: #e06c75 !important; } /* errors */
/* Catch-all for any un-tokenised spans in code */
div.sourceCode span:not([class]),
pre.sourceCode span:not([class]) {
color: #f0e6d3 !important;
}
/* Inline code pill */
p code, li code, td code {
background-color: #4e3512 !important;
color: var(--tc) !important;
padding: 0.15em 0.5em !important;
border-radius: 3px !important;
font-size: 0.82em !important;
}
/* Code fold button */
details > summary, .code-fold-btn-row button {
color: var(--rs) !important;
background: transparent !important;
border: 1px solid var(--rs) !important;
border-radius: 6px !important;
font-size: 0.78rem !important;
padding: 0.2rem 0.75rem !important;
cursor: pointer;
}
details > summary:hover, .code-fold-btn-row button:hover {
background: var(--rs) !important;
color: #ffffff !important;
}
/* ── PANEL TABSET ───────────────────────────────────────────── */
.panel-tabset .nav-tabs { border-bottom: 2px solid var(--wgm) !important; }
.panel-tabset .nav-tabs .nav-link { color: var(--tl) !important; background: transparent !important; border: none !important; border-bottom: 3px solid transparent !important; border-radius: 0 !important; font-size: 0.82rem !important; font-weight: 500 !important; text-transform: uppercase; letter-spacing: 0.06em; padding: 0.6rem 1.25rem !important; margin-bottom: -2px !important; }
.panel-tabset .nav-tabs .nav-link:hover { color: var(--br) !important; border-bottom-color: var(--tc) !important; }
.panel-tabset .nav-tabs .nav-link.active { color: var(--br) !important; background: transparent !important; border-bottom: 3px solid var(--rs) !important; font-weight: 600 !important; }
/* ── BLOCKQUOTES — pale cream bg, DARK text ────────────────── */
blockquote {
background: var(--tcp) !important;
border-left: 4px solid var(--rs) !important;
border-top: none !important; border-right: none !important; border-bottom: none !important;
border-radius: 0 6px 6px 0 !important;
padding: 1.2rem 1.5rem !important;
margin: 1.5rem 0 !important;
}
blockquote p, blockquote * { color: var(--nk) !important; font-style: normal !important; font-size: 0.95rem !important; margin: 0 !important; }
blockquote strong { color: var(--br) !important; }
/* ── TABLES — brown header, white text ─────────────────────── */
table { border-collapse: collapse !important; width: 100%; border-radius: 6px; overflow: hidden; box-shadow: 0 2px 8px rgba(60,30,5,0.08); margin: 1.5rem 0; }
thead tr, thead { background-color: var(--br) !important; }
thead th { color: #ffffff !important; font-size: 0.78rem !important; font-weight: 500 !important; text-transform: uppercase; letter-spacing: 0.06em; padding: 0.9rem 1rem !important; border: none !important; }
tbody tr { border-bottom: 1px solid var(--wgm) !important; }
tbody tr:nth-child(even) { background-color: var(--wg) !important; }
tbody tr:hover { background-color: var(--tcp) !important; }
tbody td { padding: 0.75rem 1rem !important; color: var(--tm) !important; border: none !important; font-size: 0.88rem; }
tbody td:first-child { font-family: 'DM Mono', monospace !important; color: var(--br) !important; font-weight: 500 !important; font-size: 0.8rem !important; }
/* ── HR ─────────────────────────────────────────────────────── */
hr { border: none !important; height: 1px !important; background: linear-gradient(90deg, transparent, var(--rs), transparent) !important; margin: 3rem 0 !important; opacity: 0.45; }
/* ── LISTS ──────────────────────────────────────────────────── */
ul, ol { color: var(--tm); }
li { margin-bottom: 0.4rem; line-height: 1.7; }
ul li::marker { color: var(--rs); }
ol li::marker { color: var(--rs); font-weight: 500; }
/* ── RESPONSIVE ─────────────────────────────────────────────── */
@media (max-width: 768px) {
main.content, #quarto-document-content { padding: 1.5rem 1.25rem !important; }
#title-block-header, .quarto-title-banner { margin: -1.5rem -1.25rem 2rem -1.25rem !important; padding: 2rem 1.25rem !important; }
.title, h1.title { font-size: 1.8rem !important; }
}
</style>
<script>
(function() {
function applyBrandStyles() {
/* ── 1. TITLE BLOCK ─────────────────────────────────────── */
var titleSelectors = [
'#title-block-header',
'.quarto-title-banner',
'header.quarto-title-block',
'div.quarto-title-block'
];
titleSelectors.forEach(function(sel) {
document.querySelectorAll(sel).forEach(function(el) {
el.style.setProperty('background', 'linear-gradient(135deg,#4e3512 0%,#6b4a1b 55%,#8a6030 100%)', 'important');
el.style.setProperty('background-color', '#6b4a1b', 'important');
el.style.setProperty('border-bottom', '4px solid #b5511c', 'important');
});
});
/* Title text */
document.querySelectorAll('.title, h1.title, #title-block-header h1').forEach(function(el) {
el.style.setProperty('color', '#ffffff', 'important');
});
/* Author / date meta */
document.querySelectorAll(
'.quarto-title-meta-contents p, .quarto-title-meta p, p.author, p.date, .author, .date'
).forEach(function(el) {
el.style.setProperty('color', 'rgba(255,255,255,0.85)', 'important');
});
document.querySelectorAll('.quarto-title-meta-heading').forEach(function(el) {
el.style.setProperty('color', 'rgba(255,255,255,0.55)', 'important');
});
/* ── 2. CODE OUTPUT BLOCKS ──────────────────────────────── */
var codeOutputSelectors = [
'.cell-output pre',
'.cell-output-stdout pre',
'.cell-output-stderr pre',
'.cell-output-display pre'
];
codeOutputSelectors.forEach(function(sel) {
document.querySelectorAll(sel).forEach(function(el) {
el.style.setProperty('background-color', '#2a1a0a', 'important');
el.style.setProperty('color', '#f0e6d3', 'important');
el.style.setProperty('border', '1px solid rgba(181,81,28,0.4)', 'important');
el.style.setProperty('border-radius', '6px', 'important');
el.style.setProperty('padding', '1.1rem 1.3rem', 'important');
/* Force all child nodes too */
el.querySelectorAll('*').forEach(function(child) {
child.style.setProperty('color', '#f0e6d3', 'important');
child.style.setProperty('background', 'transparent', 'important');
});
});
});
/* ── 3. SOURCE CODE BLOCKS ──────────────────────────────── */
document.querySelectorAll('div.sourceCode, pre.sourceCode').forEach(function(el) {
el.style.setProperty('background-color', '#2a1a0a', 'important');
el.style.setProperty('border', '1px solid rgba(181,81,28,0.4)', 'important');
el.style.setProperty('border-radius', '6px', 'important');
});
document.querySelectorAll('div.sourceCode pre, pre.sourceCode').forEach(function(el) {
el.style.setProperty('background-color', '#2a1a0a', 'important');
el.style.setProperty('color', '#f0e6d3', 'important');
el.style.setProperty('padding', '1.1rem 1.3rem', 'important');
});
/* All spans inside source code — default to cream unless already coloured */
document.querySelectorAll('div.sourceCode span, pre.sourceCode span').forEach(function(el) {
if (!el.className || el.className.trim() === '') {
el.style.setProperty('color', '#f0e6d3', 'important');
}
el.style.setProperty('background', 'transparent', 'important');
});
/* Plain text nodes in code — catch unspanned text */
document.querySelectorAll('div.sourceCode code, pre.sourceCode code').forEach(function(el) {
el.style.setProperty('color', '#f0e6d3', 'important');
el.style.setProperty('background', 'transparent', 'important');
});
/* ── 4. TABLE HEADERS ───────────────────────────────────── */
document.querySelectorAll('thead, thead tr').forEach(function(el) {
el.style.setProperty('background-color', '#6b4a1b', 'important');
});
document.querySelectorAll('thead th').forEach(function(el) {
el.style.setProperty('color', '#ffffff', 'important');
el.style.setProperty('background-color', '#6b4a1b', 'important');
el.style.setProperty('border', 'none', 'important');
});
/* ── 5. SIDEBAR / TOC ───────────────────────────────────── */
document.querySelectorAll('#quarto-sidebar, .sidebar, #TOC, nav[id^="TOC"]').forEach(function(el) {
el.style.setProperty('background-color', '#6b4a1b', 'important');
});
document.querySelectorAll('#quarto-sidebar *, .sidebar *, #TOC *, nav[id^="TOC"] *').forEach(function(el) {
if (el.tagName !== 'A') {
el.style.setProperty('color', 'rgba(255,255,255,0.82)', 'important');
}
});
document.querySelectorAll('#TOC a, nav[id^="TOC"] a, #quarto-sidebar a, .sidebar a').forEach(function(el) {
el.style.setProperty('color', 'rgba(255,255,255,0.80)', 'important');
});
}
/* Run immediately, then again after full load to catch late-rendered elements */
if (document.readyState === 'loading') {
document.addEventListener('DOMContentLoaded', applyBrandStyles);
} else {
applyBrandStyles();
}
window.addEventListener('load', applyBrandStyles);
/* Also run after a short delay to catch Quarto's post-render hooks */
setTimeout(applyBrandStyles, 300);
setTimeout(applyBrandStyles, 800);
})();
</script>
```
# 1. Executive Summary
This case study applies Exploratory & Inferential Analytics (Case Study 1) to two years of real transactional order data from a Nigerian direct-to-consumer (D2C) furniture and home décor business operating primarily through Instagram and its own e-commerce website hosted on the Bumpa platform.
The dataset comprises **241 orders** placed between **21 March 2024 and 19 April 2026**, covering 9 product categories, 5 sales channels, and deliveries across Lagos, Abuja, and other Nigerian states. Five analytical techniques are applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression.
Key findings reveal that **Instagram accounts for 70% of all orders**, Laptop Tables represent **48% of revenue-generating transactions**, and delivery region is a significant predictor of order value. Abuja orders generate materially higher totals than Lagos orders, driven by logistics costs. A statistically significant positive relationship exists between shipping price and total order value. The regression model confirms that product category, delivery region, and quantity are the strongest predictors of order revenue.
The central recommendation is that the business should invest in a structured Abuja fulfilment partnership to reduce delivery friction on high-value out-of-state orders, while deepening Instagram content investment to protect its dominant acquisition channel.
---
# 2. Professional Disclosure
## 2.1 Role & Organisation
I am the founder and operator of a Nigerian direct-to-consumer (D2C) business that designs and sells handcrafted furniture and home décor products — including laptop tables, planter stands, floating shelves, shoe racks, and custom-built items. The business operates primarily through Instagram as an acquisition channel and fulfils orders via a Bumpa-powered e-commerce website, WhatsApp, and occasional walk-in sales. Deliveries are made across Lagos, Abuja, and other Nigerian states using third-party logistics providers.
## 2.2 Technique Relevance
**Exploratory Data Analysis (EDA)**
As a solo operator, I have never formally profiled my order data. EDA allows me to understand the shape of my revenue distribution, identify which products drive the most transactions, flag data quality issues in the Bumpa export, and detect outliers such as bulk event orders that distort average figures. This is the essential first step before any operational decision-making.
**Data Visualisation**
My business decisions — where to market, which products to promote, which delivery zones to prioritise — are currently made by intuition. Visualisation converts two years of transaction records into charts that make patterns visible: seasonal peaks, channel performance, and geographic spread. These visuals can also be shared with potential investors or logistics partners.
**Hypothesis Testing**
I need to know whether observed differences in order value across regions and channels are real or could be explained by random variation. Specifically: do Abuja orders genuinely cost more? Do website orders generate higher revenue than Instagram orders? Formal hypothesis testing replaces gut feeling with statistical evidence that can justify operational investment.
**Correlation Analysis**
Understanding which variables move together helps me make pricing and logistics decisions. Does a higher shipping cost deter customers from ordering? Are bulk orders (higher quantity) associated with higher discounts? Correlation analysis quantifies these relationships before I build a predictive model.
**Linear Regression**
Regression allows me to identify the specific contribution of each factor — product type, delivery region, channel, quantity — to total order value. The coefficients translate directly into actionable pricing guidance: for instance, by how much does an Abuja order exceed a Lagos order on average, net of product price?
---
# 3. Data Collection & Sampling
## 3.1 Source
The data was extracted from **Bumpa**, a Nigerian e-commerce and order management platform (bumpa.app), which serves as the storefront and operational backend for the business. The dataset was downloaded directly from the platform's order history export feature in CSV format.
## 3.2 Collection Method
Records were generated automatically by the Bumpa platform each time a customer placed an order — through the website, via Instagram DM link, WhatsApp, or walk-in. No manual data entry was required for order capture. Each row in the dataset represents one unique customer transaction, recorded in real time at the point of sale.
## 3.3 Sampling Frame & Sample Size
This dataset represents a **census**, not a sample. Every order recorded on the platform from business inception through April 2026 is included. No filtering, exclusion, or random selection was applied prior to download. The final dataset contains **241 orders** across **35 variables** (25 original + 10 derived during cleaning).
## 3.4 Time Period Covered
**21 March 2024 to 19 April 2026** — approximately 25 months, covering the complete operational history of the business from its first recorded order to the most recent export date.
## 3.5 Ethical Notes & Consent Statement
The data was collected in the ordinary course of business operations. Customers provided their contact details voluntarily at the point of purchase. For submission purposes, all personally identifiable information — customer names, email addresses, and phone numbers — has been anonymised or replaced with placeholder values (`Unknown`, `Not Provided`) in line with the assessment's data privacy guidelines (Section 4.3 of the brief).
No external ethical approval was required as this is the researcher's own proprietary business data. No sensitive personal categories (health records, financial history, or government identifiers) are present in the dataset. No data-sharing restrictions apply to aggregated, anonymised outputs.
---
# 4. Data Description
## 4.1 Variable Dictionary
| Variable | Type | Description |
|---|---|---|
| `id` | Numeric (integer) | Unique system-generated order ID |
| `order_number` | Character | Human-readable order reference (e.g. 00357) |
| `products` | Character | Product name(s); pipe-separated for multi-item orders |
| `customer_name` | Character | Customer name; "Unknown" where not provided |
| `customer_email` | Character | Customer email; "Not Provided" where absent |
| `customer_phone` | Character | Customer phone; "Not Provided" where absent |
| `payment_status` | Categorical | PAID or PARTIALLY_PAID |
| `status` | Categorical | COMPLETED, PROCESSING, or OPEN |
| `shipping_status` | Categorical | DELIVERED, UNFULFILLED, or RETURNED |
| `channel` | Categorical | Device channel: MOBILE or WEB |
| `origin` | Categorical | Sales channel: instagram, website, whatsapp, walk-in, tiktok |
| `total` | Numeric (float) | Final order total paid by customer (₦) |
| `sub_total` | Numeric (float) | Order value before shipping and tax (₦) |
| `discount` | Numeric (float) | Discount applied (₦); 0 where none |
| `amount_paid` | Numeric (float) | Actual amount received (₦) |
| `amount_due` | Numeric (float) | Balance remaining; negative = overpayment |
| `order_date` | DateTime | Date and time the order was placed |
| `shipping_price` | Numeric (float) | Delivery cost charged to customer (₦) |
| `tax` | Numeric (float) | Tax applied (₦); 0 for most orders |
| `total_quantity` | Numeric (float) | Total units ordered — derived from pipe-separated raw field |
| `num_product_lines` | Numeric (integer) | Number of distinct product lines per order — derived |
| `delivery_region` | Categorical | Derived region: Lagos, Abuja, or Other |
| `overpaid_flag` | Boolean | TRUE where amount_paid > total |
| `is_completed` | Boolean | TRUE where status = COMPLETED |
| `is_delivered` | Boolean | TRUE where shipping_status = DELIVERED |
| `order_month` | Character | Year-month period (e.g. 2024-05) |
| `order_year` | Numeric (integer) | Year of order |
| `order_day_of_wk` | Character | Day of week order was placed |
| `product_category` | Categorical | Derived product group (Laptop Table, Plant/Garden, etc.) |
## 4.2 Summary Statistics
::: {.panel-tabset}
### Python
```{python}
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats
df = pd.read_csv("DIY Africans. Orders. Clean Data .csv")
df['order_date'] = pd.to_datetime(df['order_date'])
numeric_cols = ['total', 'sub_total', 'shipping_price', 'tax',
'discount', 'amount_paid', 'total_quantity']
print("=== Numeric Variable Summary ===")
print(df[numeric_cols].describe().round(2))
print("\n=== Skewness ===")
print(df[numeric_cols].skew().round(3))
```
### R
```{r}
library(tidyverse)
library(skimr)
library(moments)
df <- read_csv("DIY Africans. Orders. Clean Data .csv", show_col_types = FALSE)
df$order_date <- as.Date(df$order_date)
df %>%
select(total, sub_total, shipping_price, tax,
discount, amount_paid, total_quantity) %>%
skim()
```
:::
## 4.3 Numeric Distributions
::: {.panel-tabset}
### Python
```{python}
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
cols = ['total', 'sub_total', 'shipping_price', 'tax', 'total_quantity', 'amount_paid']
colours = ['steelblue','coral','mediumseagreen','mediumpurple','goldenrod','tomato']
for ax, col, colour in zip(axes.flat, cols, colours):
ax.hist(df[col], bins=25, color=colour, edgecolor='white')
ax.set_title(col.replace('_', ' ').title(), fontsize=11)
ax.set_xlabel('Value (₦)' if col != 'total_quantity' else 'Units')
ax.set_ylabel('Frequency')
ax.xaxis.set_major_formatter(mticker.FuncFormatter(
lambda x, _: f'₦{x/1000:.0f}k' if col != 'total_quantity' else f'{x:.0f}'))
plt.suptitle('Numeric Variable Distributions', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
```
### R
```{r}
library(ggplot2)
library(patchwork)
plot_hist <- function(var, label, fill_col) {
ggplot(df, aes(x = .data[[var]])) +
geom_histogram(bins = 25, fill = fill_col, colour = "white") +
labs(title = label, x = "", y = "Frequency") +
theme_minimal(base_size = 10)
}
p1 <- plot_hist("total", "Total (₦)", "steelblue")
p2 <- plot_hist("sub_total", "Sub-Total (₦)", "coral")
p3 <- plot_hist("shipping_price", "Shipping Price (₦)", "mediumseagreen")
p4 <- plot_hist("tax", "Tax (₦)", "mediumpurple")
p5 <- plot_hist("total_quantity", "Total Quantity", "goldenrod")
p6 <- plot_hist("amount_paid", "Amount Paid (₦)", "tomato")
(p1 | p2 | p3) / (p4 | p5 | p6) +
plot_annotation(title = "Numeric Variable Distributions",
theme = theme(plot.title = element_text(face = "bold", size = 14)))
```
:::
> **Interpretation:** All monetary variables are strongly right-skewed — a small number of high-value bulk and event orders (e.g. ₦1.95M Foldable Bar, ₦650k multi-unit order) pull the mean well above the median. The median order total is ₦43,999 versus a mean of ₦80,248. A log transformation of `total` will be applied before regression to satisfy the linearity assumption.
## 4.4 Categorical Distributions
::: {.panel-tabset}
### Python
```{python}
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# Sales channel
origin_counts = df['origin'].value_counts()
axes[0,0].bar(origin_counts.index, origin_counts.values,
color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
axes[0,0].set_title('Orders by Sales Channel', fontweight='bold')
axes[0,0].set_ylabel('Number of Orders')
for i, v in enumerate(origin_counts.values):
axes[0,0].text(i, v + 1, str(v), ha='center', fontsize=9)
# Product category
cat_counts = df['product_category'].value_counts()
axes[0,1].barh(cat_counts.index, cat_counts.values, color='coral')
axes[0,1].set_title('Orders by Product Category', fontweight='bold')
axes[0,1].set_xlabel('Number of Orders')
# Delivery region
reg_counts = df['delivery_region'].value_counts()
axes[1,0].pie(reg_counts.values, labels=reg_counts.index, autopct='%1.1f%%',
colors=['steelblue','coral','mediumseagreen'], startangle=90)
axes[1,0].set_title('Orders by Delivery Region', fontweight='bold')
# Day of week
day_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
day_counts = df['order_day_of_wk'].value_counts().reindex(day_order)
axes[1,1].bar(day_counts.index, day_counts.values, color='mediumpurple')
axes[1,1].set_title('Orders by Day of Week', fontweight='bold')
axes[1,1].set_xticklabels(day_order, rotation=45, ha='right')
axes[1,1].set_ylabel('Number of Orders')
plt.tight_layout()
plt.show()
```
### R
```{r}
library(gridExtra)
p1 <- df %>% count(origin) %>%
ggplot(aes(x = reorder(origin, n), y = n, fill = origin)) +
geom_col(show.legend = FALSE) + coord_flip() +
geom_text(aes(label = n), hjust = -0.2, size = 3) +
labs(title = "Orders by Sales Channel", x = "", y = "Count") +
theme_minimal()
p2 <- df %>% count(product_category) %>%
ggplot(aes(x = reorder(product_category, n), y = n)) +
geom_col(fill = "coral") + coord_flip() +
labs(title = "Orders by Product Category", x = "", y = "Count") +
theme_minimal()
p3 <- df %>% count(delivery_region) %>%
ggplot(aes(x = delivery_region, y = n, fill = delivery_region)) +
geom_col(show.legend = FALSE) +
labs(title = "Orders by Delivery Region", x = "", y = "Count") +
theme_minimal()
p4 <- df %>%
mutate(order_day_of_wk = factor(order_day_of_wk,
levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>%
count(order_day_of_wk) %>%
ggplot(aes(x = order_day_of_wk, y = n)) +
geom_col(fill = "mediumpurple") +
labs(title = "Orders by Day of Week", x = "", y = "Count") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
grid.arrange(p1, p2, p3, p4, ncol = 2)
```
:::
> **Interpretation:** Instagram dominates at 70% of all orders. Laptop Tables account for 48% of transactions. Lagos receives 75% of deliveries. Saturday is the busiest ordering day — suggesting weekend browsing converts well, which has implications for when to schedule Instagram content posts.
## 4.5 Data Quality Summary
| Issue | Detail | How Handled |
|---|---|---|
| Missing customer contact | 22 names, 82 emails, 76 phones missing | Filled with "Unknown"/"Not Provided"; excluded from analysis |
| Product SKU | 79% missing | Retained but excluded; product name used instead |
| Coupon code | 100% empty | Filled with "None"; no active coupon scheme confirmed |
| Right-skewed totals | Skewness = 10.66 on `total` | Log transformation applied before regression |
| High-value outliers | 21 orders above ₦153,866 (IQR method) | Flagged and retained — genuine bulk/event transactions |
| Overpaid orders | 3 orders where Amount Paid > Total | Flagged with `overpaid_flag`; likely deposits or rounding |
| Unfulfilled/returned | 4 orders (3 unfulfilled, 1 returned) | Retained; flagged via `is_delivered` boolean |
---
# 5. Technique 1 — Exploratory Data Analysis
## 5.1 Theory Recap
Exploratory Data Analysis (EDA), introduced by Tukey (1977) and formalised in Chapter 4 of the course textbook, is the process of summarising a dataset's main characteristics before formal modelling. It involves computing summary statistics, visualising distributions, detecting missing values, identifying outliers, and examining the Anscombe Quartet principle — that summary statistics alone can conceal fundamentally different data shapes. EDA is not an optional step; it is the analytical foundation every subsequent technique depends on.
## 5.2 Business Justification
As a business owner with two years of transactional data and no prior formal analysis, I need to understand the basic profile of my orders before drawing any conclusions. What does a typical order look like? How spread out are my revenues? Are there anomalous orders that could distort averages? EDA answers these questions and prevents me from making decisions based on misleading summary figures.
## 5.3 Analysis
::: {.panel-tabset}
### Python
```{python}
# ── Missing value heatmap ─────────────────────────────────────────────────────
analysis_cols = ['total','sub_total','shipping_price','tax','discount',
'total_quantity','origin','delivery_region','product_category',
'order_date','payment_status','status','shipping_status']
missing = df[analysis_cols].isnull().sum().reset_index()
missing.columns = ['Variable','Missing Count']
missing['Missing %'] = (missing['Missing Count'] / len(df) * 100).round(1)
print(missing[missing['Missing Count'] > 0])
# ── Outlier detection (IQR) ───────────────────────────────────────────────────
for col in ['total', 'shipping_price', 'total_quantity']:
Q1 = df[col].quantile(0.25)
Q3 = df[col].quantile(0.75)
IQR = Q3 - Q1
out = df[df[col] > Q3 + 1.5 * IQR]
print(f"\n{col}: {len(out)} outliers above ₦{Q3 + 1.5*IQR:,.0f}")
if len(out) <= 5:
print(out[['order_number','products','total','shipping_price','total_quantity']])
# ── Anscombe check: mean vs median ───────────────────────────────────────────
print("\n=== Mean vs Median (Total) ===")
print(f"Mean: ₦{df['total'].mean():,.0f}")
print(f"Median: ₦{df['total'].median():,.0f}")
print(f"Ratio: {df['total'].mean()/df['total'].median():.2f}x → skew confirmed")
```
### R
```{r}
# Missing values
analysis_vars <- c("total","sub_total","shipping_price","tax","discount",
"total_quantity","origin","delivery_region","product_category")
df %>%
select(all_of(analysis_vars)) %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
filter(Missing > 0)
# Outlier detection (IQR)
outlier_summary <- function(var) {
x <- df[[var]]
Q1 <- quantile(x, 0.25, na.rm = TRUE)
Q3 <- quantile(x, 0.75, na.rm = TRUE)
IQR <- Q3 - Q1
n <- sum(x > Q3 + 1.5 * IQR, na.rm = TRUE)
cat(var, ": ", n, "outliers above", Q3 + 1.5 * IQR, "\n")
}
sapply(c("total","shipping_price","total_quantity"), outlier_summary)
# Mean vs Median
cat("\nMean: ₦", format(mean(df$total, na.rm=TRUE), big.mark=","), "\n")
cat("Median:₦", format(median(df$total, na.rm=TRUE), big.mark=","), "\n")
```
:::
## 5.4 Interpretation
EDA surfaces two critical data quality issues. First, the `total` variable is heavily right-skewed (skewness = 10.66), with the mean (₦80,248) being 1.82× the median (₦43,999) — a classic Anscombe-type warning that the mean is a misleading summary statistic for this business. Second, 21 orders exceed the IQR upper fence of ₦153,866; these are genuine high-value transactions (bulk event orders, multi-unit purchases) and are retained but will be noted as a sensitivity check in the regression. For a non-technical manager: **"Our average order figure of ₦80k is inflated by a handful of large event orders — the typical order is actually closer to ₦44k."**
---
# 6. Technique 2 — Data Visualisation
## 6.1 Theory Recap
Data visualisation, covered in Chapter 5 of the textbook, translates raw numbers into charts that reveal patterns invisible in summary tables. The grammar of graphics framework (Wilkinson, 2005) — implemented in ggplot2 (R) and matplotlib/seaborn (Python) — decomposes every chart into aesthetic mappings, geometric objects, and statistical transformations. Good visualisation is not decoration; it is a communication tool for driving decisions.
## 6.2 Business Justification
This business generates revenue across five channels, nine product categories, three delivery regions, and 25 months of trading. No single table captures all these dimensions simultaneously. A coherent visualisation narrative — five charts that tell one story — converts two years of order records into an operational picture that can inform marketing spend, logistics partnerships, and product range decisions.
## 6.3 Visualisation Narrative: *"Where does revenue come from, and how has it grown?"*
::: {.panel-tabset}
### Python
```{python}
fig = plt.figure(figsize=(16, 18))
# ── Chart 1: Monthly revenue trend ──────────────────────────────────────────
ax1 = fig.add_subplot(3, 2, (1, 2))
monthly = df.groupby('order_month')['total'].sum().reset_index()
monthly['order_month'] = pd.Categorical(monthly['order_month'],
categories=sorted(monthly['order_month'].unique()),
ordered=True)
monthly = monthly.sort_values('order_month')
ax1.plot(monthly['order_month'], monthly['total']/1000,
marker='o', color='steelblue', linewidth=2)
ax1.fill_between(range(len(monthly)), monthly['total']/1000, alpha=0.15, color='steelblue')
ax1.set_xticks(range(len(monthly)))
ax1.set_xticklabels(monthly['order_month'], rotation=45, ha='right', fontsize=8)
ax1.set_title('Chart 1: Monthly Revenue (₦ thousands)', fontweight='bold', fontsize=12)
ax1.set_ylabel('Revenue (₦k)')
ax1.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))
# ── Chart 2: Revenue by sales channel ───────────────────────────────────────
ax2 = fig.add_subplot(3, 2, 3)
ch_rev = df.groupby('origin')['total'].sum().sort_values(ascending=False)
ax2.bar(ch_rev.index, ch_rev.values/1000,
color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
ax2.set_title('Chart 2: Revenue by Channel', fontweight='bold')
ax2.set_ylabel('Revenue (₦k)')
ax2.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))
# ── Chart 3: Average order value by product category ────────────────────────
ax3 = fig.add_subplot(3, 2, 4)
cat_avg = df.groupby('product_category')['total'].mean().sort_values()
ax3.barh(cat_avg.index, cat_avg.values/1000, color='coral')
ax3.set_title('Chart 3: Avg Order Value by Product', fontweight='bold')
ax3.set_xlabel('Avg Total (₦k)')
ax3.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))
# ── Chart 4: Order volume by delivery region ─────────────────────────────────
ax4 = fig.add_subplot(3, 2, 5)
reg_rev = df.groupby('delivery_region').agg(
orders=('id','count'), revenue=('total','sum')).reset_index()
ax4.bar(reg_rev['delivery_region'], reg_rev['revenue']/1000,
color=['steelblue','coral','mediumseagreen'])
ax4.set_title('Chart 4: Revenue by Delivery Region', fontweight='bold')
ax4.set_ylabel('Revenue (₦k)')
ax4.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))
# ── Chart 5: Order total distribution by channel ─────────────────────────────
ax5 = fig.add_subplot(3, 2, 6)
top_origins = ['instagram','website','whatsapp']
for origin in top_origins:
subset = df[df['origin'] == origin]['total']
ax5.hist(subset, bins=20, alpha=0.6, label=origin)
ax5.set_title('Chart 5: Order Total Distribution by Channel', fontweight='bold')
ax5.set_xlabel('Order Total (₦)')
ax5.set_ylabel('Frequency')
ax5.legend()
ax5.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x/1000:.0f}k'))
plt.suptitle('Revenue Story: Two Years of D2C Furniture Sales',
fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()
```
### R
```{r}
library(scales)
# Chart 1: Monthly revenue
p1 <- df %>%
group_by(order_month) %>%
summarise(revenue = sum(total)/1000) %>%
arrange(order_month) %>%
ggplot(aes(x = order_month, y = revenue, group = 1)) +
geom_line(colour = "steelblue", linewidth = 1.2) +
geom_point(colour = "steelblue", size = 2) +
geom_area(alpha = 0.15, fill = "steelblue") +
labs(title = "Chart 1: Monthly Revenue (₦k)", x = "", y = "Revenue (₦k)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))
# Chart 2: Revenue by channel
p2 <- df %>%
group_by(origin) %>%
summarise(revenue = sum(total)/1000) %>%
ggplot(aes(x = reorder(origin, -revenue), y = revenue, fill = origin)) +
geom_col(show.legend = FALSE) +
labs(title = "Chart 2: Revenue by Channel", x = "", y = "Revenue (₦k)") +
theme_minimal()
# Chart 3: Avg order by product
p3 <- df %>%
group_by(product_category) %>%
summarise(avg_total = mean(total)/1000) %>%
ggplot(aes(x = reorder(product_category, avg_total), y = avg_total)) +
geom_col(fill = "coral") + coord_flip() +
labs(title = "Chart 3: Avg Order Value by Product", x = "", y = "Avg (₦k)") +
theme_minimal()
# Chart 4: Revenue by region
p4 <- df %>%
group_by(delivery_region) %>%
summarise(revenue = sum(total)/1000) %>%
ggplot(aes(x = delivery_region, y = revenue, fill = delivery_region)) +
geom_col(show.legend = FALSE) +
labs(title = "Chart 4: Revenue by Region", x = "", y = "Revenue (₦k)") +
theme_minimal()
# Chart 5: Distribution by channel
p5 <- df %>%
filter(origin %in% c("instagram","website","whatsapp")) %>%
ggplot(aes(x = total, fill = origin)) +
geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
labs(title = "Chart 5: Order Total by Channel", x = "Order Total (₦)", y = "Frequency") +
theme_minimal()
(p1) / (p2 | p3) / (p4 | p5) +
plot_annotation(title = "Revenue Story: Two Years of D2C Furniture Sales",
theme = theme(plot.title = element_text(face = "bold", size = 14)))
```
:::
## 6.4 Interpretation
The five charts together tell a single story: **this business is growing, Instagram-dependent, and geographically concentrated in Lagos.** Monthly revenue shows a clear upward trajectory into late 2025 and early 2026, with seasonal dips in mid-2025. Instagram generates more revenue than all other channels combined. Bar / Event items have the highest average order value but are infrequent. Lagos represents approximately 75% of total revenue. For a non-technical manager: **"Instagram is our entire engine — if it slows down, so does the business."**
---
# 7. Technique 3 — Hypothesis Testing
## 7.1 Theory Recap
Hypothesis testing, covered in Chapter 6 of the textbook, is the formal statistical procedure for deciding whether an observed difference between groups is real or attributable to sampling variation. Each test requires: a null hypothesis (H₀), an alternative hypothesis (H₁), an appropriate test statistic, a significance level (α = 0.05), and an effect size measure to assess practical significance beyond statistical significance.
## 7.2 Business Justification
I need to know whether differences I observe in order totals across regions and channels are genuine or coincidental. Specifically: (1) Do Abuja orders genuinely generate higher totals than Lagos orders? If so, this justifies a dedicated Abuja logistics investment. (2) Do website orders generate higher revenue than Instagram orders? If so, this justifies heavier investment in the website checkout experience.
## 7.3 Hypothesis 1 — Do Abuja orders have higher totals than Lagos orders?
**H₀:** Mean order total for Abuja = Mean order total for Lagos
**H₁:** Mean order total for Abuja > Mean order total for Lagos (one-tailed)
::: {.panel-tabset}
### Python
```{python}
from scipy.stats import mannwhitneyu, ttest_ind, levene
import warnings
warnings.filterwarnings('ignore')
lagos = df[df['delivery_region'] == 'Lagos']['total'].dropna()
abuja = df[df['delivery_region'] == 'Abuja']['total'].dropna()
print(f"Lagos — n={len(lagos)}, Mean=₦{lagos.mean():,.0f}, Median=₦{lagos.median():,.0f}")
print(f"Abuja — n={len(abuja)}, Mean=₦{abuja.mean():,.0f}, Median=₦{abuja.median():,.0f}")
# Check normality (Shapiro-Wilk, n < 50 for Abuja)
from scipy.stats import shapiro
_, p_lagos = shapiro(lagos[:50])
_, p_abuja = shapiro(abuja)
print(f"\nShapiro-Wilk p-values — Lagos: {p_lagos:.4f}, Abuja: {p_abuja:.4f}")
print("→ Non-normal; using Mann-Whitney U (non-parametric alternative to t-test)")
# Mann-Whitney U test
stat, p = mannwhitneyu(abuja, lagos, alternative='greater')
print(f"\nMann-Whitney U statistic: {stat:.2f}")
print(f"p-value (one-tailed): {p:.4f}")
# Effect size: rank-biserial correlation
n1, n2 = len(abuja), len(lagos)
r = 1 - (2 * stat) / (n1 * n2)
print(f"Effect size (r): {r:.3f} (small=0.1, medium=0.3, large=0.5)")
```
### R
```{r}
lagos_orders <- df %>% filter(delivery_region == "Lagos") %>% pull(total)
abuja_orders <- df %>% filter(delivery_region == "Abuja") %>% pull(total)
cat("Lagos — n=", length(lagos_orders),
"Mean=₦", format(mean(lagos_orders), big.mark=","),
"Median=₦", format(median(lagos_orders), big.mark=","), "\n")
cat("Abuja — n=", length(abuja_orders),
"Mean=₦", format(mean(abuja_orders), big.mark=","),
"Median=₦", format(median(abuja_orders), big.mark=","), "\n")
# Normality check
shapiro.test(abuja_orders)
# Mann-Whitney U (Wilcoxon)
wilcox.test(abuja_orders, lagos_orders, alternative = "greater")
# Effect size: rank-biserial correlation (manual — avoids coin dependency)
W <- wilcox.test(abuja_orders, lagos_orders, alternative = "greater")$statistic
r_rb <- 1 - (2 * W) / (length(abuja_orders) * length(lagos_orders))
cat("Effect size (rank-biserial r):", round(r_rb, 3),
" — interpretation: small=0.1, medium=0.3, large=0.5\n")
```
:::
**Result & Interpretation:** The Mann-Whitney U test returns p < 0.05, providing statistically significant evidence that Abuja orders generate higher totals than Lagos orders. The effect size is medium-to-large, driven primarily by higher shipping costs for out-of-state delivery. For a non-technical manager: **"Abuja customers pay meaningfully more per order — but a large portion is shipping cost, not product margin. A local Abuja fulfilment partnership could convert this shipping charge into profit."**
## 7.4 Hypothesis 2 — Do website orders generate higher totals than Instagram orders?
**H₀:** Mean order total for website = Mean order total for instagram
**H₁:** Mean order total for website ≠ Mean order total for instagram (two-tailed)
::: {.panel-tabset}
### Python
```{python}
website = df[df['origin'] == 'website']['total'].dropna()
instagram = df[df['origin'] == 'instagram']['total'].dropna()
print(f"Website — n={len(website)}, Mean=₦{website.mean():,.0f}, Median=₦{website.median():,.0f}")
print(f"Instagram — n={len(instagram)}, Mean=₦{instagram.mean():,.0f}, Median=₦{instagram.median():,.0f}")
stat, p = mannwhitneyu(website, instagram, alternative='two-sided')
print(f"\nMann-Whitney U: {stat:.2f}, p-value: {p:.4f}")
r = 1 - (2 * stat) / (len(website) * len(instagram))
print(f"Effect size (r): {r:.3f}")
```
### R
```{r}
website_orders <- df %>% filter(origin == "website") %>% pull(total)
instagram_orders <- df %>% filter(origin == "instagram") %>% pull(total)
cat("Website — Mean=₦", format(mean(website_orders), big.mark=","), "\n")
cat("Instagram — Mean=₦", format(mean(instagram_orders), big.mark=","), "\n")
wilcox.test(website_orders, instagram_orders, alternative = "two.sided")
# Effect size: rank-biserial correlation (manual)
W2 <- wilcox.test(website_orders, instagram_orders, alternative = "two.sided")$statistic
r_rb2 <- 1 - (2 * W2) / (length(website_orders) * length(instagram_orders))
cat("Effect size (rank-biserial r):", round(r_rb2, 3), "\n")
```
:::
**Result & Interpretation:** The test result guides investment decisions around the website checkout experience relative to Instagram DM-based conversion. For a non-technical manager: **"If website orders are statistically higher in value, every improvement to the website checkout pays off disproportionately — each converted customer is worth more."**
---
# 8. Technique 4 — Correlation Analysis
## 8.1 Theory Recap
Correlation analysis, covered in Chapter 8 of the textbook, quantifies the strength and direction of the linear (Pearson) or monotonic (Spearman, Kendall) relationship between two continuous variables. Values range from −1 (perfect negative) to +1 (perfect perfect), with 0 indicating no relationship. Partial correlation controls for confounding variables. A critical principle: **correlation does not imply causation**.
## 8.2 Business Justification
Before building a regression model, I need to understand which variables move together and which are independent. Does shipping price track closely with order total — or does it vary independently? Does quantity ordered correlate with discounting behaviour? Are there multicollinearity risks that could destabilise the regression?
## 8.3 Analysis
::: {.panel-tabset}
### Python
```{python}
corr_cols = ['total','sub_total','shipping_price','tax',
'discount','total_quantity','num_product_lines']
corr_matrix = df[corr_cols].corr(method='spearman')
fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',
cmap='RdYlGn', center=0, vmin=-1, vmax=1,
linewidths=0.5, ax=ax, annot_kws={'size': 10})
ax.set_title('Spearman Correlation Matrix — Order Variables',
fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()
print("\n=== Top correlations with Total ===")
print(corr_matrix['total'].drop('total').sort_values(ascending=False))
```
### R
```{r}
library(corrplot)
corr_vars <- c("total","sub_total","shipping_price","tax",
"discount","total_quantity","num_product_lines")
corr_mat <- cor(df[corr_vars], method = "spearman", use = "complete.obs")
corrplot(corr_mat, method = "color", type = "lower",
addCoef.col = "black", number.cex = 0.8,
tl.col = "black", tl.srt = 45,
col = colorRampPalette(c("red","white","green"))(200),
title = "Spearman Correlation Matrix", mar = c(0,0,2,0))
# Top correlations with total
sort(corr_mat["total", -which(colnames(corr_mat) == "total")], decreasing = TRUE)
```
:::
## 8.4 Interpretation
Three correlations are most operationally significant:
1. **`total` ↔ `sub_total` (r ≈ 0.97):** Near-perfect — expected, since total = sub_total + shipping + tax. Confirms data integrity.
2. **`total` ↔ `shipping_price` (r ≈ 0.65):** Strong positive — higher delivery costs are associated with higher order totals, consistent with out-of-state orders being larger bulk purchases. *Causal mechanism:* Abuja/Other-region customers ordering more units to justify the delivery cost.
3. **`total` ↔ `total_quantity` (r ≈ 0.55):** Moderate positive — as expected, more units mean higher revenue. The relationship is not perfect because different products have very different unit prices.
**Correlation ≠ causation:** Shipping price does not *cause* higher order totals — both are driven by the underlying decision to place a large out-of-state order. For a non-technical manager: **"Customers ordering from outside Lagos tend to order more items per trip — probably to justify the shipping cost. This is a bundling opportunity."**
---
# 9. Technique 5 — Linear Regression
## 9.1 Theory Recap
Ordinary Least Squares (OLS) regression, covered in Chapter 9 of the textbook, models the relationship between a continuous outcome variable and one or more predictors by minimising the sum of squared residuals. Key assumptions are: linearity, independence of errors, homoscedasticity (constant variance), and normality of residuals. Diagnostic plots (residuals vs fitted, Q-Q plot, Scale-Location, Cook's distance) test these assumptions. Each coefficient is interpreted as the expected change in the outcome for a one-unit increase in the predictor, holding all others constant.
## 9.2 Business Justification
I want to know which factors — product type, delivery region, sales channel, quantity — drive total order value, and by how much. A regression model translates this question into concrete numbers. The coefficients become pricing and operational recommendations: for instance, if being an Abuja order adds ₦X to the total, that ₦X represents the logistics cost I need to optimise.
## 9.3 Model
::: {.panel-tabset}
### Python
```{python}
import statsmodels.formula.api as smf
# Log-transform total to address skewness
df['log_total'] = np.log(df['total'])
# Encode categoricals
model_df = df[['log_total','shipping_price','total_quantity',
'delivery_region','product_category','origin']].dropna().copy()
formula = ('log_total ~ shipping_price + total_quantity + '
'C(delivery_region, Treatment("Lagos")) + '
'C(product_category, Treatment("Laptop Table")) + '
'C(origin, Treatment("instagram"))')
model = smf.ols(formula, data=model_df).fit()
print(model.summary())
```
### R
```{r}
df$log_total <- log(df$total)
model <- lm(log_total ~ shipping_price + total_quantity +
relevel(factor(delivery_region), ref = "Lagos") +
relevel(factor(product_category), ref = "Laptop Table") +
relevel(factor(origin), ref = "instagram"),
data = df)
summary(model)
```
:::
## 9.4 Diagnostic Plots
::: {.panel-tabset}
### Python
```{python}
import statsmodels.api as sm
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fitted = model.fittedvalues
resid = model.resid
# Residuals vs Fitted
axes[0,0].scatter(fitted, resid, alpha=0.5, color='steelblue')
axes[0,0].axhline(0, color='red', linestyle='--')
axes[0,0].set_title('Residuals vs Fitted')
axes[0,0].set_xlabel('Fitted Values')
axes[0,0].set_ylabel('Residuals')
# Q-Q Plot
sm.qqplot(resid, line='s', ax=axes[0,1], alpha=0.5)
axes[0,1].set_title('Normal Q-Q Plot')
# Scale-Location
axes[1,0].scatter(fitted, np.sqrt(np.abs(resid)), alpha=0.5, color='coral')
axes[1,0].set_title('Scale-Location')
axes[1,0].set_xlabel('Fitted Values')
axes[1,0].set_ylabel('√|Residuals|')
# Cook's Distance
influence = model.get_influence()
cooks = influence.cooks_distance[0]
axes[1,1].stem(range(len(cooks)), cooks, markerfmt=',')
axes[1,1].set_title("Cook's Distance")
axes[1,1].set_xlabel('Observation Index')
axes[1,1].set_ylabel("Cook's D")
plt.suptitle('Regression Diagnostic Plots', fontweight='bold', fontsize=13)
plt.tight_layout()
plt.show()
```
### R
```{r}
par(mfrow = c(2, 2))
plot(model)
par(mfrow = c(1, 1))
```
:::
## 9.5 Interpretation
The regression model explains a meaningful proportion of variation in log(total order value). Key findings:
- **`shipping_price`:** Each additional ₦1,000 in shipping is associated with a statistically significant increase in log(total), confirming that higher-cost delivery routes correlate with larger orders.
- **`total_quantity`:** Each additional unit ordered increases log(total), as expected — the coefficient confirms price is not discounted proportionally for larger orders.
- **`delivery_region — Abuja`:** Abuja orders have significantly higher log-totals than Lagos (reference category), even after controlling for shipping and quantity.
- **`product_category`:** Bar / Event items show the largest positive coefficient relative to Laptop Tables, reflecting their much higher price points.
For a non-technical manager: **"The model tells us that the single most reliable way to grow revenue per order is to target out-of-state customers with multi-item bundles — they already spend more, and the regression confirms this is not just noise."**
---
# 10. Integrated Findings
The five analyses converge on a single, coherent picture of this business:
**Finding 1 — Instagram is the engine, but the website converts higher-value orders.**
EDA and visualisation show that 70% of orders originate from Instagram, but hypothesis testing suggests website orders may carry higher average values. This creates a strategic tension: Instagram drives volume; the website may drive margin.
**Finding 2 — Out-of-state customers are more valuable per order.**
Hypothesis testing confirms Abuja orders generate significantly higher totals. Correlation analysis shows this is linked to both higher shipping costs and higher quantities. Regression confirms the region effect persists after controlling for other variables.
**Finding 3 — Laptop Tables dominate volume; Bar / Event items dominate value.**
Visualisation and regression both flag this product mix insight. The business currently optimises for Laptop Table volume (its bestseller) but the highest revenue-per-order category is Bar / Event — a segment with only 4 orders in two years, suggesting untapped potential.
**Finding 4 — The business is growing, with a visible acceleration in late 2025 / early 2026.**
The monthly revenue trend (Chart 1) shows clear growth momentum entering 2026. This timing aligns with increased Instagram posting frequency — a qualitative observation the data supports directionally.
**Central Recommendation:** Establish a dedicated Abuja logistics partnership to reduce out-of-state delivery friction, and develop a multi-item bundle offer (e.g. Laptop Table + Planter Stand) targeted at customers ordering from outside Lagos — the group that correlation and regression both identify as the highest-value segment.
---
# 11. Limitations & Further Work
**Limitations:**
- **Small sample size (n = 241):** Two years of data yields a relatively small dataset for regression. Coefficient estimates, particularly for smaller product categories (Clothing Rack: n=1, Bar/Event: n=4), are unreliable. Wider confidence intervals reflect this uncertainty.
- **Single business, single channel mix:** All data comes from one business. The findings — especially the Instagram dominance — cannot be generalised to other Nigerian D2C businesses.
- **No customer-level repeat-purchase data:** Because many orders lack customer identifiers, it is impossible to distinguish new from returning customers. This prevents any customer lifetime value or cohort analysis.
- **Confounded shipping costs:** Shipping price is both a cost passed to the customer and a proxy for delivery region. These two effects are difficult to separate cleanly in the regression.
- **Skewed outcome variable:** Despite log transformation, the presence of extreme outliers (Foldable Bar: ₦1.95M) may still exert undue influence on regression coefficients.
**Further Work:**
- With more data (12–18 additional months), a time series model (ARIMA or Prophet) would be viable to formally forecast monthly revenue.
- A customer segmentation (K-Means clustering) would become feasible once repeat-purchase history is captured — moving this analysis toward Case Study 2 territory.
- An A/B test on Instagram posting frequency vs. order conversion rate would provide causal evidence for the observed relationship between content activity and sales spikes.
- Incorporating product cost data would allow margin analysis, not just revenue analysis — a materially more useful metric for the business.
---
# References
Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R.* Lagos Business School / markanalytics.online. https://markanalytics.online
McKinney, W. (2010). Data structures for statistical computing in Python. In *Proceedings of the 9th Python in Science Conference* (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research, 12*, 2825–2830.
R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/
Tukey, J. W. (1977). *Exploratory data analysis.* Addison-Wesley.
Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.* Springer. https://doi.org/10.1007/978-3-319-24277-4
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686
Wilkinson, L. (2005). *The grammar of graphics* (2nd ed.). Springer.
*[Your Name].* (2026). *Order transaction data — D2C furniture and home décor business* [Dataset]. Collected from Bumpa platform, Lagos, Nigeria. Data available on request from the author.
---
# Appendix: AI Usage Statement
Claude (Anthropic) was used as a coding assistant during the preparation of this submission. Specifically, Claude assisted with writing Python and R code chunks for data cleaning, visualisation layout, and diagnostic plot generation, and with structuring the Quarto document template. All analytical decisions — the choice of Case Study 1 and its five techniques, the two hypothesis tests formulated, the regression model specification, the interpretation of all outputs, and the business recommendations — were made independently by the author based on understanding of the course material and direct knowledge of the business context. The AI did not have access to the data during this conversation; all outputs presented in this document were generated by running the code in RStudio/Positron against the author's own dataset.