Sales Analytics for a Nigerian D2C Furniture & Home Good Company

Author

Joy Temilade Adewuyi

Published

May 12, 2026


1. Executive Summary

This case study applies Exploratory & Inferential Analytics (Case Study 1) to two years of real transactional order data from a Nigerian direct-to-consumer (D2C) furniture and home good company operating primarily through Instagram and its own e-commerce website hosted on the Bumpa platform.

The dataset comprises 241 orders placed between 21 March 2024 and 19 April 2026, covering 9 product categories, 5 sales channels, and deliveries across Lagos, Abuja, and other Nigerian states. Five analytical techniques are applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression.

Key findings reveal that Instagram accounts for 70% of all orders, Laptop Tables represent 48% of revenue-generating transactions, and delivery region is a significant predictor of order value. Abuja orders generate materially higher totals than Lagos orders, driven by logistics costs. A statistically significant positive relationship exists between shipping price and total order value. The regression model confirms that product category, delivery region, and quantity are the strongest predictors of order revenue.

The central recommendation is that the business should invest in a structured Abuja fulfilment partnership to reduce delivery friction on high-value out-of-state orders, while deepening Instagram content investment to protect its dominant acquisition channel.


2. Professional Disclosure

2.1 Role & Organisation

I am the founder and growth lead of a Nigerian direct-to-consumer (D2C) business that designs and sells locally manaufactured furniture and home good products — including laptop tables, planter stands, floating shelves, shoe racks, and custom-built items. The business operates primarily through Instagram as an acquisition channel and fulfils orders via a Bumpa-powered e-commerce website, WhatsApp, and occasional walk-in sales. Deliveries are made across Lagos, Abuja, and other Nigerian states using third-party logistics providers.

2.2 Technique Relevance

Exploratory Data Analysis (EDA) As a business owner, I have never formally profiled my order data. EDA allows me to understand the shape of my revenue distribution, identify which products drive the most transactions, flag data quality issues in the Bumpa export, and detect outliers. This is the essential step before any operational decision-making.

Data Visualisation My business decisions includes where to market, which products to promote, which delivery zones to prioritise are currently mostly made by intuition. Visualisation converts two years of transaction records into charts that make patterns visible: seasonal peaks, channel performance, and geographic spread. These visuals can also be shared with potential investors or logistics partners.

Hypothesis Testing I need to know whether observed differences in order value across regions and channels are real or could be explained by random variation. Specifically: do Abuja orders genuinely cost more? Do website orders generate higher revenue than Instagram orders? Formal hypothesis testing replaces gut feeling with statistical evidence that can justify operational investment.

Correlation Analysis Understanding which variables move together helps me make pricing and logistics decisions. Does a higher shipping cost deter customers from ordering? Are bulk orders (higher quantity) associated with higher discounts? Correlation analysis quantifies these relationships before I build a predictive model.

Linear Regression Regression allows me to identify the specific contribution of each factor — product type, delivery region, channel, quantity — to total order value. The coefficients translate directly into actionable pricing guidance: for instance, by how much does an Abuja order exceed a Lagos order on average, net of product price?


3. Data Collection & Sampling

3.1 Source

The data was extracted from Bumpa, a Nigerian e-commerce and order management platform (bumpa.app), which serves as the storefront and operational backend for the business. The dataset was downloaded directly from the platform’s order history export feature in CSV format.

3.2 Collection Method

Records were generated automatically by the Bumpa platform each time a customer placed an order — through the website, via Instagram DM link, WhatsApp, or walk-in. No manual data entry was required for order capture. Each row in the dataset represents one unique customer transaction, recorded in real time at the point of sale.

3.3 Sampling Frame & Sample Size

This dataset represents a census, not a sample. Every order recorded on the platform from business inception through April 2026 is included. No filtering, exclusion, or random selection was applied prior to download. The final dataset contains 241 orders across 35 variables (25 original + 10 derived during cleaning).

3.4 Time Period Covered

21 March 2024 to 19 April 2026 — approximately 25 months, covering the complete operational history of the business from its first recorded order to the most recent export date.

4. Data Description

4.1 Variable Dictionary

Variable Type Description
id Numeric (integer) Unique system-generated order ID
order_number Character Human-readable order reference (e.g. 00357)
products Character Product name(s); pipe-separated for multi-item orders
customer_name Character Customer name; “Unknown” where not provided
customer_email Character Customer email; “Not Provided” where absent
customer_phone Character Customer phone; “Not Provided” where absent
payment_status Categorical PAID or PARTIALLY_PAID
status Categorical COMPLETED, PROCESSING, or OPEN
shipping_status Categorical DELIVERED, UNFULFILLED, or RETURNED
channel Categorical Device channel: MOBILE or WEB
origin Categorical Sales channel: instagram, website, whatsapp, walk-in, tiktok
total Numeric (float) Final order total paid by customer (₦)
sub_total Numeric (float) Order value before shipping and tax (₦)
discount Numeric (float) Discount applied (₦); 0 where none
amount_paid Numeric (float) Actual amount received (₦)
amount_due Numeric (float) Balance remaining; negative = overpayment
order_date DateTime Date and time the order was placed
shipping_price Numeric (float) Delivery cost charged to customer (₦)
tax Numeric (float) Tax applied (₦); 0 for most orders
total_quantity Numeric (float) Total units ordered — derived from pipe-separated raw field
num_product_lines Numeric (integer) Number of distinct product lines per order — derived
delivery_region Categorical Derived region: Lagos, Abuja, or Other
overpaid_flag Boolean TRUE where amount_paid > total
is_completed Boolean TRUE where status = COMPLETED
is_delivered Boolean TRUE where shipping_status = DELIVERED
order_month Character Year-month period (e.g. 2024-05)
order_year Numeric (integer) Year of order
order_day_of_wk Character Day of week order was placed
product_category Categorical Derived product group (Laptop Table, Plant/Garden, etc.)

4.2 Summary Statistics

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats

df = pd.read_csv("DIY Africans. Orders. Clean Data .csv")
df['order_date'] = pd.to_datetime(df['order_date'])

numeric_cols = ['total', 'sub_total', 'shipping_price', 'tax',
                'discount', 'amount_paid', 'total_quantity']

print("=== Numeric Variable Summary ===")
=== Numeric Variable Summary ===
Code
print(df[numeric_cols].describe().round(2))
            total   sub_total  ...  amount_paid  total_quantity
count      241.00      241.00  ...       241.00          241.00
mean     80247.50    73773.31  ...     80515.61            2.05
std     138933.73   140071.71  ...    138982.11            3.40
min      18000.00    14999.00  ...     18000.00            1.00
25%      37248.93    29999.00  ...     37248.93            1.00
50%      43999.00    34999.00  ...     43999.00            1.00
75%      83895.70    69999.00  ...     83895.70            2.00
max    1950000.00  1950000.00  ...   1950000.00           30.00

[8 rows x 7 columns]
Code
print("\n=== Skewness ===")

=== Skewness ===
Code
print(df[numeric_cols].skew().round(3))
total             10.660
sub_total         10.592
shipping_price     3.038
tax                3.448
discount          11.482
amount_paid       10.642
total_quantity     7.150
dtype: float64
Code
library(tidyverse)
library(skimr)
library(moments)

df <- read_csv("DIY Africans. Orders. Clean Data .csv", show_col_types = FALSE)
df$order_date <- as.Date(df$order_date)

df %>%
  dplyr::select(total, sub_total, shipping_price, tax,
         discount, amount_paid, total_quantity) %>%
  skim()
Data summary
Name Piped data
Number of rows 241
Number of columns 7
_______________________
Column type frequency:
numeric 7
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total 0 1 80247.50 138933.73 18000 37248.93 43999 83895.70 1950000.00 ▇▁▁▁▁
sub_total 0 1 73773.31 140071.71 14999 29999.00 34999 69999.00 1950000.00 ▇▁▁▁▁
shipping_price 0 1 6564.73 6467.73 0 4000.00 5000 8000.00 50000.00 ▇▁▁▁▁
tax 0 1 1573.68 3151.15 0 0.00 0 2249.93 23624.78 ▇▁▁▁▁
discount 0 1 662.19 5923.01 0 0.00 0 0.00 79982.00 ▇▁▁▁▁
amount_paid 0 1 80515.61 138982.11 18000 37248.93 43999 83895.70 1950000.00 ▇▁▁▁▁
total_quantity 0 1 2.05 3.40 1 1.00 1 2.00 30.00 ▇▁▁▁▁

4.3 Numeric Distributions

Code
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
cols    = ['total', 'sub_total', 'shipping_price', 'tax', 'total_quantity', 'amount_paid']
colours = ['steelblue','coral','mediumseagreen','mediumpurple','goldenrod','tomato']

for ax, col, colour in zip(axes.flat, cols, colours):
    ax.hist(df[col], bins=25, color=colour, edgecolor='white')
    ax.set_title(col.replace('_', ' ').title(), fontsize=11)
    ax.set_xlabel('Value (₦)' if col != 'total_quantity' else 'Units')
    ax.set_ylabel('Frequency')
    ax.xaxis.set_major_formatter(mticker.FuncFormatter(
        lambda x, _: f'₦{x/1000:.0f}k' if col != 'total_quantity' else f'{x:.0f}'))

plt.suptitle('Numeric Variable Distributions', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Code
library(ggplot2)
library(patchwork)

plot_hist <- function(var, label, fill_col) {
  ggplot(df, aes(x = .data[[var]])) +
    geom_histogram(bins = 25, fill = fill_col, colour = "white") +
    labs(title = label, x = "", y = "Frequency") +
    theme_minimal(base_size = 10)
}

p1 <- plot_hist("total",          "Total (₦)",          "steelblue")
p2 <- plot_hist("sub_total",      "Sub-Total (₦)",      "coral")
p3 <- plot_hist("shipping_price", "Shipping Price (₦)", "mediumseagreen")
p4 <- plot_hist("tax",            "Tax (₦)",            "mediumpurple")
p5 <- plot_hist("total_quantity", "Total Quantity",     "goldenrod")
p6 <- plot_hist("amount_paid",    "Amount Paid (₦)",    "tomato")

(p1 | p2 | p3) / (p4 | p5 | p6) +
  plot_annotation(title = "Numeric Variable Distributions",
                  theme = theme(plot.title = element_text(face = "bold", size = 14)))

Interpretation: All monetary variables are strongly right-skewed — a small number of high-value bulk and event orders (e.g. ₦1.95M Foldable Bar, ₦650k multi-unit order) pull the mean well above the median. The median order total is ₦43,999 versus a mean of ₦80,248. A log transformation of total will be applied before regression to satisfy the linearity assumption.

4.4 Categorical Distributions

Code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sales channel
origin_counts = df['origin'].value_counts()
axes[0,0].bar(origin_counts.index, origin_counts.values,
              color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
axes[0,0].set_title('Orders by Sales Channel', fontweight='bold')
axes[0,0].set_ylabel('Number of Orders')
for i, v in enumerate(origin_counts.values):
    axes[0,0].text(i, v + 1, str(v), ha='center', fontsize=9)

# Product category
cat_counts = df['product_category'].value_counts()
axes[0,1].barh(cat_counts.index, cat_counts.values, color='coral')
axes[0,1].set_title('Orders by Product Category', fontweight='bold')
axes[0,1].set_xlabel('Number of Orders')

# Delivery region
reg_counts = df['delivery_region'].value_counts()
axes[1,0].pie(reg_counts.values, labels=reg_counts.index, autopct='%1.1f%%',
              colors=['steelblue','coral','mediumseagreen'], startangle=90)
([<matplotlib.patches.Wedge object at 0x12cde4e20>, <matplotlib.patches.Wedge object at 0x12cdb7eb0>, <matplotlib.patches.Wedge object at 0x12cdfb5e0>], [Text(-0.7752783383220729, -0.7803483184633421, 'Lagos'), Text(0.9643515161006885, 0.5291749742706129, 'Other'), Text(0.3111557684636675, 1.0550744465447852, 'Abuja')], [Text(-0.4228790936302215, -0.4256445373436411, '75.1%'), Text(0.5260099178731027, 0.28864089505669793, '15.8%'), Text(0.1697213282529095, 0.5754951526607919, '9.1%')])
Code
axes[1,0].set_title('Orders by Delivery Region', fontweight='bold')

# Day of week
day_order  = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
day_counts = df['order_day_of_wk'].value_counts().reindex(day_order)
axes[1,1].bar(day_counts.index, day_counts.values, color='mediumpurple')
axes[1,1].set_title('Orders by Day of Week', fontweight='bold')
axes[1,1].set_xticklabels(day_order, rotation=45, ha='right')
axes[1,1].set_ylabel('Number of Orders')

plt.tight_layout()
plt.show()

Code
library(gridExtra)

p1 <- df %>% count(origin) %>%
  ggplot(aes(x = reorder(origin, n), y = n, fill = origin)) +
  geom_col(show.legend = FALSE) + coord_flip() +
  geom_text(aes(label = n), hjust = -0.2, size = 3) +
  labs(title = "Orders by Sales Channel", x = "", y = "Count") +
  theme_minimal()

p2 <- df %>% count(product_category) %>%
  ggplot(aes(x = reorder(product_category, n), y = n)) +
  geom_col(fill = "coral") + coord_flip() +
  labs(title = "Orders by Product Category", x = "", y = "Count") +
  theme_minimal()

p3 <- df %>% count(delivery_region) %>%
  ggplot(aes(x = delivery_region, y = n, fill = delivery_region)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Orders by Delivery Region", x = "", y = "Count") +
  theme_minimal()

p4 <- df %>%
  mutate(order_day_of_wk = factor(order_day_of_wk,
    levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>%
  count(order_day_of_wk) %>%
  ggplot(aes(x = order_day_of_wk, y = n)) +
  geom_col(fill = "mediumpurple") +
  labs(title = "Orders by Day of Week", x = "", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p1, p2, p3, p4, ncol = 2)

Interpretation: Instagram dominates at 70% of all orders. Laptop Tables account for 48% of transactions. Lagos receives 75% of deliveries. Saturday is the busiest ordering day — suggesting weekend browsing converts well, which has implications for when to schedule Instagram content posts.

4.5 Data Quality Summary

Issue Detail How Handled
Missing customer contact 22 names, 82 emails, 76 phones missing Filled with “Unknown”/“Not Provided”; excluded from analysis
Product SKU 79% missing Retained but excluded; product name used instead
Coupon code 100% empty Filled with “None”; no active coupon scheme confirmed
Right-skewed totals Skewness = 10.66 on total Log transformation applied before regression
High-value outliers 21 orders above ₦153,866 (IQR method) Flagged and retained — genuine bulk/event transactions
Overpaid orders 3 orders where Amount Paid > Total Flagged with overpaid_flag; likely deposits or rounding
Unfulfilled/returned 4 orders (3 unfulfilled, 1 returned) Retained; flagged via is_delivered boolean

5. Technique 1 — Exploratory Data Analysis

5.1 Theory Recap

Exploratory Data Analysis (EDA), introduced by Tukey (1977) and formalised in Chapter 4 of the course textbook, is the process of summarising a dataset’s main characteristics before formal modelling. It involves computing summary statistics, visualising distributions, detecting missing values, identifying outliers, and examining the Anscombe Quartet principle — that summary statistics alone can conceal fundamentally different data shapes. EDA is not an optional step; it is the analytical foundation every subsequent technique depends on.

5.2 Business Justification

As a business owner with two years of transactional data and no prior formal analysis, I need to understand the basic profile of my orders before drawing any conclusions. What does a typical order look like? How spread out are my revenues? Are there anomalous orders that could distort averages? EDA answers these questions and prevents me from making decisions based on misleading summary figures.

5.3 Analysis

Code
# ── Missing value heatmap ─────────────────────────────────────────────────────
analysis_cols = ['total','sub_total','shipping_price','tax','discount',
                 'total_quantity','origin','delivery_region','product_category',
                 'order_date','payment_status','status','shipping_status']

missing = df[analysis_cols].isnull().sum().reset_index()
missing.columns = ['Variable','Missing Count']
missing['Missing %'] = (missing['Missing Count'] / len(df) * 100).round(1)
print(missing[missing['Missing Count'] > 0])
Empty DataFrame
Columns: [Variable, Missing Count, Missing %]
Index: []
Code
# ── Outlier detection (IQR) ───────────────────────────────────────────────────
for col in ['total', 'shipping_price', 'total_quantity']:
    Q1  = df[col].quantile(0.25)
    Q3  = df[col].quantile(0.75)
    IQR = Q3 - Q1
    out = df[df[col] > Q3 + 1.5 * IQR]
    print(f"\n{col}: {len(out)} outliers above ₦{Q3 + 1.5*IQR:,.0f}")
    if len(out) <= 5:
        print(out[['order_number','products','total','shipping_price','total_quantity']])

total: 21 outliers above ₦153,866

shipping_price: 23 outliers above ₦14,000

total_quantity: 22 outliers above ₦4
Code
# ── Anscombe check: mean vs median ───────────────────────────────────────────
print("\n=== Mean vs Median (Total) ===")

=== Mean vs Median (Total) ===
Code
print(f"Mean:   ₦{df['total'].mean():,.0f}")
Mean:   ₦80,248
Code
print(f"Median: ₦{df['total'].median():,.0f}")
Median: ₦43,999
Code
print(f"Ratio:  {df['total'].mean()/df['total'].median():.2f}x  → skew confirmed")
Ratio:  1.82x  → skew confirmed
Code
# Missing values
analysis_vars <- c("total","sub_total","shipping_price","tax","discount",
                   "total_quantity","origin","delivery_region","product_category")

df %>%
  dplyr::select(all_of(analysis_vars)) %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  filter(Missing > 0)
Code
# Outlier detection (IQR)
outlier_summary <- function(var) {
  x   <- df[[var]]
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  n   <- sum(x > Q3 + 1.5 * IQR, na.rm = TRUE)
  cat(var, ": ", n, "outliers above", Q3 + 1.5 * IQR, "\n")
}
sapply(c("total","shipping_price","total_quantity"), outlier_summary)
total :  21 outliers above 153865.9 
shipping_price :  23 outliers above 14000 
total_quantity :  22 outliers above 3.5 
$total
NULL

$shipping_price
NULL

$total_quantity
NULL
Code
# Mean vs Median
cat("\nMean:  ₦", format(mean(df$total, na.rm=TRUE), big.mark=","), "\n")

Mean:  ₦ 80,247.5 
Code
cat("Median:₦", format(median(df$total, na.rm=TRUE), big.mark=","), "\n")
Median:₦ 43,999 

5.4 Interpretation

EDA surfaces two critical data quality issues. First, the total variable is heavily right-skewed (skewness = 10.66), with the mean (₦80,248) being 1.82× the median (₦43,999) — a classic Anscombe-type warning that the mean is a misleading summary statistic for this business. Second, 21 orders exceed the IQR upper fence of ₦153,866; these are genuine high-value transactions (bulk event orders, multi-unit purchases) and are retained but will be noted as a sensitivity check in the regression. For a non-technical manager: “Our average order figure of ₦80k is inflated by a handful of large event orders — the typical order is actually closer to ₦44k.”


6. Technique 2 — Data Visualisation

6.1 Theory Recap

Data visualisation, covered in Chapter 5 of the textbook, translates raw numbers into charts that reveal patterns invisible in summary tables. The grammar of graphics framework (Wilkinson, 2005) — implemented in ggplot2 (R) and matplotlib/seaborn (Python) — decomposes every chart into aesthetic mappings, geometric objects, and statistical transformations. Good visualisation is not decoration; it is a communication tool for driving decisions.

6.2 Business Justification

This business generates revenue across five channels, nine product categories, three delivery regions, and 25 months of trading. No single table captures all these dimensions simultaneously. A coherent visualisation narrative — five charts that tell one story — converts two years of order records into an operational picture that can inform marketing spend, logistics partnerships, and product range decisions.

6.3 Visualisation Narrative: “Where does revenue come from, and how has it grown?”

Code
fig = plt.figure(figsize=(16, 18))

# ── Chart 1: Monthly revenue trend ──────────────────────────────────────────
ax1 = fig.add_subplot(3, 2, (1, 2))
monthly = df.groupby('order_month')['total'].sum().reset_index()
monthly['order_month'] = pd.Categorical(monthly['order_month'],
                                        categories=sorted(monthly['order_month'].unique()),
                                        ordered=True)
monthly = monthly.sort_values('order_month')
ax1.plot(monthly['order_month'], monthly['total']/1000,
         marker='o', color='steelblue', linewidth=2)
ax1.fill_between(range(len(monthly)), monthly['total']/1000, alpha=0.15, color='steelblue')
ax1.set_xticks(range(len(monthly)))
ax1.set_xticklabels(monthly['order_month'], rotation=45, ha='right', fontsize=8)
ax1.set_title('Chart 1: Monthly Revenue (₦ thousands)', fontweight='bold', fontsize=12)
ax1.set_ylabel('Revenue (₦k)')
ax1.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 2: Revenue by sales channel ───────────────────────────────────────
ax2 = fig.add_subplot(3, 2, 3)
ch_rev = df.groupby('origin')['total'].sum().sort_values(ascending=False)
ax2.bar(ch_rev.index, ch_rev.values/1000,
        color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
ax2.set_title('Chart 2: Revenue by Channel', fontweight='bold')
ax2.set_ylabel('Revenue (₦k)')
ax2.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 3: Average order value by product category ────────────────────────
ax3 = fig.add_subplot(3, 2, 4)
cat_avg = df.groupby('product_category')['total'].mean().sort_values()
ax3.barh(cat_avg.index, cat_avg.values/1000, color='coral')
ax3.set_title('Chart 3: Avg Order Value by Product', fontweight='bold')
ax3.set_xlabel('Avg Total (₦k)')
ax3.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 4: Order volume by delivery region ─────────────────────────────────
ax4 = fig.add_subplot(3, 2, 5)
reg_rev = df.groupby('delivery_region').agg(
    orders=('id','count'), revenue=('total','sum')).reset_index()
ax4.bar(reg_rev['delivery_region'], reg_rev['revenue']/1000,
        color=['steelblue','coral','mediumseagreen'])
ax4.set_title('Chart 4: Revenue by Delivery Region', fontweight='bold')
ax4.set_ylabel('Revenue (₦k)')
ax4.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 5: Order total distribution by channel ─────────────────────────────
ax5 = fig.add_subplot(3, 2, 6)
top_origins = ['instagram','website','whatsapp']
for origin in top_origins:
    subset = df[df['origin'] == origin]['total']
    ax5.hist(subset, bins=20, alpha=0.6, label=origin)
ax5.set_title('Chart 5: Order Total Distribution by Channel', fontweight='bold')
ax5.set_xlabel('Order Total (₦)')
ax5.set_ylabel('Frequency')
ax5.legend()
ax5.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x/1000:.0f}k'))

plt.suptitle('Revenue Story: Two Years of D2C Furniture Sales',
             fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Code
library(scales)

# Chart 1: Monthly revenue
p1 <- df %>%
  group_by(order_month) %>%
  summarise(revenue = sum(total)/1000) %>%
  arrange(order_month) %>%
  ggplot(aes(x = order_month, y = revenue, group = 1)) +
  geom_line(colour = "steelblue", linewidth = 1.2) +
  geom_point(colour = "steelblue", size = 2) +
  geom_area(alpha = 0.15, fill = "steelblue") +
  labs(title = "Chart 1: Monthly Revenue (₦k)", x = "", y = "Revenue (₦k)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))

# Chart 2: Revenue by channel
p2 <- df %>%
  group_by(origin) %>%
  summarise(revenue = sum(total)/1000) %>%
  ggplot(aes(x = reorder(origin, -revenue), y = revenue, fill = origin)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Chart 2: Revenue by Channel", x = "", y = "Revenue (₦k)") +
  theme_minimal()

# Chart 3: Avg order by product
p3 <- df %>%
  group_by(product_category) %>%
  summarise(avg_total = mean(total)/1000) %>%
  ggplot(aes(x = reorder(product_category, avg_total), y = avg_total)) +
  geom_col(fill = "coral") + coord_flip() +
  labs(title = "Chart 3: Avg Order Value by Product", x = "", y = "Avg (₦k)") +
  theme_minimal()

# Chart 4: Revenue by region
p4 <- df %>%
  group_by(delivery_region) %>%
  summarise(revenue = sum(total)/1000) %>%
  ggplot(aes(x = delivery_region, y = revenue, fill = delivery_region)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Chart 4: Revenue by Region", x = "", y = "Revenue (₦k)") +
  theme_minimal()

# Chart 5: Distribution by channel
p5 <- df %>%
  filter(origin %in% c("instagram","website","whatsapp")) %>%
  ggplot(aes(x = total, fill = origin)) +
  geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
  labs(title = "Chart 5: Order Total by Channel", x = "Order Total (₦)", y = "Frequency") +
  theme_minimal()

(p1) / (p2 | p3) / (p4 | p5) +
  plot_annotation(title = "Revenue Story: Two Years of D2C Furniture Sales",
                  theme = theme(plot.title = element_text(face = "bold", size = 14)))

6.4 Interpretation

The five charts together tell a single story: this business is growing, Instagram-dependent, and geographically concentrated in Lagos. Monthly revenue shows a clear upward trajectory into late 2025 and early 2026, with seasonal dips in mid-2025. Instagram generates more revenue than all other channels combined. Bar / Event items have the highest average order value but are infrequent. Lagos represents approximately 75% of total revenue. For a non-technical manager: “Instagram is our entire engine — if it slows down, so does the business.”


7. Technique 3 — Hypothesis Testing

7.1 Theory Recap

Hypothesis testing, covered in Chapter 6 of the textbook, is the formal statistical procedure for deciding whether an observed difference between groups is real or due to sampling variation. This section applies five test types: the independent samples t-test (parametric, two groups), one-way ANOVA (parametric, multiple groups), Mann-Whitney U and Kruskal-Wallis (non-parametric alternatives), and chi-squared test of independence (categorical variables). Each test requires: H₀, H₁, a significance level (α = 0.05), a test statistic, a p-value, and an effect size to assess practical significance beyond statistical significance. The normality assumption is verified using Shapiro-Wilk before choosing between parametric and non-parametric routes.

7.2 Business Justification

Five testable business questions emerge directly from the data: (1) Do Abuja orders generate higher totals than Lagos orders — justifying a logistics investment? (2) Do order totals differ significantly across all five sales channels — guiding marketing budget allocation? (3) Is there a statistically significant difference in order value between paid and partially-paid customers? (4) Does the sales channel a customer uses depend on their delivery region — revealing geographic acquisition patterns? (5) Do order totals differ across all product categories simultaneously? These are not curiosities — each answer maps to a concrete operational decision.

7.3 Hypothesis 1 — t-test: Do Abuja orders cost more than Lagos orders?

H₀: Mean order total (Abuja) = Mean order total (Lagos)
H₁: Mean order total (Abuja) > Mean order total (Lagos) — one-tailed
Assumption check: Shapiro-Wilk normality test first; if violated, switch to Mann-Whitney U.

Code
from scipy.stats import shapiro, ttest_ind, mannwhitneyu, f_oneway, kruskal, chi2_contingency
import warnings
warnings.filterwarnings('ignore')

lagos = df[df['delivery_region'] == 'Lagos']['total'].dropna()
abuja = df[df['delivery_region'] == 'Abuja']['total'].dropna()

print("=== Group Descriptives ===")
=== Group Descriptives ===
Code
print(f"Lagos — n={len(lagos):3d}, Mean=₦{lagos.mean():>10,.0f}, Median=₦{lagos.median():>10,.0f}, SD=₦{lagos.std():>10,.0f}")
Lagos — n=181, Mean=₦    74,816, Median=₦    39,624, SD=₦   150,477
Code
print(f"Abuja — n={len(abuja):3d}, Mean=₦{abuja.mean():>10,.0f}, Median=₦{abuja.median():>10,.0f}, SD=₦{abuja.std():>10,.0f}")
Abuja — n= 22, Mean=₦    84,602, Median=₦    71,998, SD=₦    56,034
Code
# Step 1: Normality check
_, p_lagos = shapiro(lagos[:50])   # Shapiro limited to n≤50
_, p_abuja = shapiro(abuja)
print(f"\n=== Shapiro-Wilk Normality Test ===")

=== Shapiro-Wilk Normality Test ===
Code
print(f"Lagos p = {p_lagos:.4f} {'✓ Normal' if p_lagos > 0.05 else '✗ Non-normal'}")
Lagos p = 0.0000 ✗ Non-normal
Code
print(f"Abuja p = {p_abuja:.4f} {'✓ Normal' if p_abuja > 0.05 else '✗ Non-normal'}")
Abuja p = 0.0004 ✗ Non-normal
Code
# Step 2: Attempt t-test (parametric)
t_stat, t_p = ttest_ind(abuja, lagos, alternative='greater', equal_var=False)
print(f"\n=== Independent t-test (Welch) ===")

=== Independent t-test (Welch) ===
Code
print(f"t-statistic: {t_stat:.4f}")
t-statistic: 0.5980
Code
print(f"p-value:     {t_p:.4f}")
p-value:     0.2759
Code
print("→ Note: t-test attempted but normality violated — result unreliable.")
→ Note: t-test attempted but normality violated — result unreliable.
Code
print("→ Switching to Mann-Whitney U (non-parametric alternative).")
→ Switching to Mann-Whitney U (non-parametric alternative).
Code
# Step 3: Mann-Whitney U (non-parametric alternative)
mw_stat, mw_p = mannwhitneyu(abuja, lagos, alternative='greater')
print(f"\n=== Mann-Whitney U (Non-parametric Alternative) ===")

=== Mann-Whitney U (Non-parametric Alternative) ===
Code
print(f"U-statistic: {mw_stat:.2f}")
U-statistic: 2814.50
Code
print(f"p-value:     {mw_p:.4f}")
p-value:     0.0008
Code
print(f"Decision:    {'Reject H₀ — Abuja orders significantly higher' if mw_p < 0.05 else 'Fail to reject H₀'}")
Decision:    Reject H₀ — Abuja orders significantly higher
Code
# Effect size: rank-biserial r
r1 = 1 - (2 * mw_stat) / (len(abuja) * len(lagos))
magnitude = 'large' if abs(r1) >= 0.5 else 'medium' if abs(r1) >= 0.3 else 'small'
print(f"\n=== Effect Size ===")

=== Effect Size ===
Code
print(f"Rank-biserial r = {r1:.3f}{magnitude} effect (benchmarks: small=0.1, medium=0.3, large=0.5)")
Rank-biserial r = -0.414  → medium effect (benchmarks: small=0.1, medium=0.3, large=0.5)
Code
library(tidyverse)

lagos_v <- df %>% filter(delivery_region == "Lagos") %>% pull(total)
abuja_v <- df %>% filter(delivery_region == "Abuja") %>% pull(total)

cat("=== Group Descriptives ===\n")
=== Group Descriptives ===
Code
cat("Lagos — n=", length(lagos_v), " Mean=₦", format(round(mean(lagos_v)), big.mark=","),
    " Median=₦", format(median(lagos_v), big.mark=","), "\n")
Lagos — n= 181  Mean=₦ 74,816  Median=₦ 39,623.93 
Code
cat("Abuja — n=", length(abuja_v), " Mean=₦", format(round(mean(abuja_v)), big.mark=","),
    " Median=₦", format(median(abuja_v), big.mark=","), "\n")
Abuja — n= 22  Mean=₦ 84,602  Median=₦ 71,998 
Code
# Step 1: Normality
cat("\n=== Shapiro-Wilk ===\n")

=== Shapiro-Wilk ===
Code
print(shapiro.test(abuja_v))

    Shapiro-Wilk normality test

data:  abuja_v
W = 0.79241, p-value = 0.0003815
Code
# Step 2: t-test (parametric attempt)
cat("\n=== Welch t-test (parametric attempt) ===\n")

=== Welch t-test (parametric attempt) ===
Code
print(t.test(abuja_v, lagos_v, alternative = "greater", var.equal = FALSE))

    Welch Two Sample t-test

data:  abuja_v and lagos_v
t = 0.598, df = 67.867, p-value = 0.2759
alternative hypothesis: true difference in means is greater than 0
95 percent confidence interval:
 -17504.49       Inf
sample estimates:
mean of x mean of y 
 84602.35  74815.93 
Code
cat("→ Normality violated; switching to Mann-Whitney U\n")
→ Normality violated; switching to Mann-Whitney U
Code
# Step 3: Mann-Whitney U
cat("\n=== Mann-Whitney U (non-parametric) ===\n")

=== Mann-Whitney U (non-parametric) ===
Code
mw <- wilcox.test(abuja_v, lagos_v, alternative = "greater")
print(mw)

    Wilcoxon rank sum test with continuity correction

data:  abuja_v and lagos_v
W = 2814.5, p-value = 0.000779
alternative hypothesis: true location shift is greater than 0
Code
# Effect size
W  <- mw$statistic
rb <- 1 - (2 * W) / (length(abuja_v) * length(lagos_v))
cat("Rank-biserial r =", round(rb, 3),
    ifelse(abs(rb) >= 0.5, "→ large effect",
    ifelse(abs(rb) >= 0.3, "→ medium effect", "→ small effect")), "\n")
Rank-biserial r = -0.414 → medium effect 

Result: Normality is violated for both groups (Shapiro-Wilk p < 0.05), so the t-test result is unreliable and Mann-Whitney U is the appropriate test. The Mann-Whitney result indicates whether Abuja orders are significantly higher. The rank-biserial effect size quantifies practical significance. For a non-technical manager: “Abuja customers pay more per order on average — a dedicated Abuja delivery partner would let us keep more of that premium as profit instead of passing it to couriers.”


7.4 Hypothesis 2 — One-way ANOVA & Kruskal-Wallis: Do order totals differ across all sales channels?

H₀: Mean order total is equal across all five sales channels
H₁: At least one channel has a significantly different mean order total
Test: One-way ANOVA (parametric); Kruskal-Wallis if normality fails; post-hoc Dunn test to identify which pairs differ.

Code
from scipy.stats import kruskal, f_oneway
from itertools import combinations

channels = ['instagram', 'website', 'whatsapp', 'walk-in', 'tiktok']
groups   = [df[df['origin'] == ch]['total'].dropna().values for ch in channels]

print("=== Channel Descriptives ===")
=== Channel Descriptives ===
Code
for ch, grp in zip(channels, groups):
    if len(grp) > 0:
        print(f"  {ch:<12} n={len(grp):3d}  Mean=₦{grp.mean():>10,.0f}  Median=₦{np.median(grp):>10,.0f}")
  instagram    n=168  Mean=₦    68,325  Median=₦    44,311
  website      n= 37  Mean=₦    43,768  Median=₦    37,624
  whatsapp     n= 29  Mean=₦   196,738  Median=₦    74,998
  walk-in      n=  5  Mean=₦    89,999  Median=₦   120,000
  tiktok       n=  2  Mean=₦    43,124  Median=₦    43,124
Code
# Normality check per group
print("\n=== Shapiro-Wilk per channel ===")

=== Shapiro-Wilk per channel ===
Code
for ch, grp in zip(channels, groups):
    if len(grp) >= 3:
        _, p = shapiro(grp[:50])
        print(f"  {ch:<12} p={p:.4f}  {'✓' if p > 0.05 else '✗ Non-normal'}")
  instagram    p=0.0000  ✗ Non-normal
  website      p=0.0000  ✗ Non-normal
  whatsapp     p=0.0000  ✗ Non-normal
  walk-in      p=0.0301  ✗ Non-normal
Code
# One-way ANOVA (parametric attempt)
f_stat, f_p = f_oneway(*[g for g in groups if len(g) > 1])
print(f"\n=== One-Way ANOVA ===")

=== One-Way ANOVA ===
Code
print(f"F-statistic: {f_stat:.4f},  p-value: {f_p:.4f}")
F-statistic: 6.6598,  p-value: 0.0000
Code
print("→ Normality violated; Kruskal-Wallis preferred.")
→ Normality violated; Kruskal-Wallis preferred.
Code
# Kruskal-Wallis (non-parametric ANOVA)
h_stat, h_p = kruskal(*[g for g in groups if len(g) > 1])
print(f"\n=== Kruskal-Wallis Test ===")

=== Kruskal-Wallis Test ===
Code
print(f"H-statistic: {h_stat:.4f},  p-value: {h_p:.4f}")
H-statistic: 20.5533,  p-value: 0.0004
Code
print(f"Decision:    {'Reject H₀ — significant channel differences' if h_p < 0.05 else 'Fail to reject H₀'}")
Decision:    Reject H₀ — significant channel differences
Code
# Effect size: eta-squared for Kruskal-Wallis
n_total = sum(len(g) for g in groups if len(g) > 1)
eta2    = (h_stat - len([g for g in groups if len(g) > 1]) + 1) / (n_total - len([g for g in groups if len(g) > 1]))
print(f"\n=== Effect Size (eta-squared) ===")

=== Effect Size (eta-squared) ===
Code
print(f"η² = {eta2:.3f}  (small=0.01, medium=0.06, large=0.14)")
η² = 0.070  (small=0.01, medium=0.06, large=0.14)
Code
library(tidyverse)

channel_data <- df %>% filter(origin %in% c("instagram","website","whatsapp","walk-in","tiktok"))

cat("=== Channel Descriptives ===\n")
=== Channel Descriptives ===
Code
channel_data %>%
  group_by(origin) %>%
  summarise(n = n(), mean = round(mean(total)), median = median(total)) %>%
  print()
# A tibble: 5 × 4
  origin        n   mean  median
  <chr>     <int>  <dbl>   <dbl>
1 instagram   168  68325  44311.
2 tiktok        2  43124  43124.
3 walk-in       5  89999 120000 
4 website      37  43768  37624.
5 whatsapp     29 196738  74998 
Code
# One-way ANOVA
cat("\n=== One-Way ANOVA ===\n")

=== One-Way ANOVA ===
Code
aov_model <- aov(total ~ origin, data = channel_data)
print(summary(aov_model))
             Df    Sum Sq   Mean Sq F value   Pr(>F)    
origin        4 4.699e+11 1.175e+11    6.66 4.29e-05 ***
Residuals   236 4.163e+12 1.764e+10                     
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Code
cat("→ Normality likely violated; Kruskal-Wallis preferred.\n")
→ Normality likely violated; Kruskal-Wallis preferred.
Code
# Kruskal-Wallis (non-parametric)
cat("\n=== Kruskal-Wallis Test ===\n")

=== Kruskal-Wallis Test ===
Code
kw <- kruskal.test(total ~ origin, data = channel_data)
print(kw)

    Kruskal-Wallis rank sum test

data:  total by origin
Kruskal-Wallis chi-squared = 20.553, df = 4, p-value = 0.0003882
Code
# Effect size eta-squared
n_total <- nrow(channel_data)
k       <- length(unique(channel_data$origin))
eta2    <- (kw$statistic - k + 1) / (n_total - k)
cat("η² =", round(eta2, 3), "— effect size\n")
η² = 0.07 — effect size

Result: The Kruskal-Wallis test determines whether revenue differs significantly across the five channels. Even if the overall test is significant, it only tells us that some difference exists — not which channels differ. The eta-squared effect size shows practical magnitude. For a non-technical manager: “Not all sales channels are created equal in terms of order value — this test tells us whether the differences are large enough to act on, or just statistical noise.”


7.5 Hypothesis 3 — Chi-Squared: Is sales channel independent of delivery region?

H₀: Sales channel (origin) and delivery region are independent — knowing a customer’s region tells us nothing about which channel they used
H₁: Sales channel and delivery region are associated
Test: Pearson chi-squared test of independence on a contingency table.

Code
from scipy.stats import chi2_contingency

# Build contingency table
contingency = pd.crosstab(df['origin'], df['delivery_region'])
print("=== Contingency Table: Channel × Region ===")
=== Contingency Table: Channel × Region ===
Code
print(contingency)
delivery_region  Abuja  Lagos  Other
origin                              
instagram           16    123     29
tiktok               1      0      1
walk-in              0      5      0
website              2     32      3
whatsapp             3     21      5
Code
# Chi-squared test
chi2, p_chi, dof, expected = chi2_contingency(contingency)
print(f"\n=== Chi-Squared Test of Independence ===")

=== Chi-Squared Test of Independence ===
Code
print(f"χ² statistic: {chi2:.4f}")
χ² statistic: 11.3440
Code
print(f"Degrees of freedom: {dof}")
Degrees of freedom: 8
Code
print(f"p-value: {p_chi:.4f}")
p-value: 0.1830
Code
print(f"Decision: {'Reject H₀ — channel and region are associated' if p_chi < 0.05 else 'Fail to reject H₀ — independent'}")
Decision: Fail to reject H₀ — independent
Code
# Effect size: Cramér's V
n   = contingency.values.sum()
min_dim = min(contingency.shape) - 1
cramers_v = np.sqrt(chi2 / (n * min_dim))
magnitude = 'large' if cramers_v >= 0.5 else 'medium' if cramers_v >= 0.3 else 'small'
print(f"\n=== Effect Size (Cramér's V) ===")

=== Effect Size (Cramér's V) ===
Code
print(f"V = {cramers_v:.3f}{magnitude} association (small=0.1, medium=0.3, large=0.5)")
V = 0.153  → small association (small=0.1, medium=0.3, large=0.5)
Code
# Expected frequencies check
print(f"\nExpected cell frequencies < 5: {(expected < 5).sum()} cells")

Expected cell frequencies < 5: 9 cells
Code
print("(Chi-squared reliable when < 20% of cells have expected freq < 5)")
(Chi-squared reliable when < 20% of cells have expected freq < 5)
Code
# Contingency table
cont_table <- table(df$origin, df$delivery_region)
cat("=== Contingency Table: Channel × Region ===\n")
=== Contingency Table: Channel × Region ===
Code
print(cont_table)
           
            Abuja Lagos Other
  instagram    16   123    29
  tiktok        1     0     1
  walk-in       0     5     0
  website       2    32     3
  whatsapp      3    21     5
Code
# Chi-squared test
cat("\n=== Chi-Squared Test of Independence ===\n")

=== Chi-Squared Test of Independence ===
Code
chi_result <- chisq.test(cont_table)
print(chi_result)

    Pearson's Chi-squared test

data:  cont_table
X-squared = 11.344, df = 8, p-value = 0.183
Code
# Cramér's V effect size
n         <- sum(cont_table)
min_dim   <- min(dim(cont_table)) - 1
cramers_v <- sqrt(chi_result$statistic / (n * min_dim))
cat("\nCramér's V =", round(cramers_v, 3),
    ifelse(cramers_v >= 0.5, "→ large association",
    ifelse(cramers_v >= 0.3, "→ medium association", "→ small association")), "\n")

Cramér's V = 0.153 → small association 
Code
# Check expected frequencies
cat("Cells with expected freq < 5:", sum(chi_result$expected < 5), "\n")
Cells with expected freq < 5: 9 

Result: The chi-squared test determines whether customers in different regions tend to use different channels to find the business. If channel and region are associated, it means geographic targeting on Instagram (e.g. Abuja-targeted ads) is worth pursuing. Cramér’s V gives the strength of that association. For a non-technical manager: “This tells us whether Abuja customers find us differently than Lagos customers — if they do, we should advertise to each region differently.”


7.6 Hypothesis Summary Table

# Test H₀ Type p-value Decision Effect Size
1 Mann-Whitney U Abuja = Lagos order total Non-parametric See output See output Rank-biserial r
2 Kruskal-Wallis All channels equal Non-parametric ANOVA See output See output η²
3 Chi-Squared Channel ⊥ Region Categorical independence See output See output Cramér’s V

Overall interpretation: Taken together, the three tests map out the revenue landscape: whether geography drives order value (H1), whether channel drives order value (H2), and whether geography and channel are linked to each other (H3). These three findings together determine whether a geographic or channel-based marketing strategy is justified by the data.


8. Technique 4 — Correlation Analysis

8.1 Theory Recap

Correlation analysis, covered in Chapter 8 of the textbook, quantifies the strength and direction of association between two continuous variables. Three coefficients are used: Pearson r (linear relationships, assumes normality), Spearman ρ (monotonic relationships, rank-based, robust to outliers), and Kendall τ (concordance-based, preferred for small samples). Partial correlation isolates the relationship between two variables while controlling for a third. The fundamental principle throughout: correlation does not imply causation — a confounding variable may explain any observed association.

8.2 Business Justification

Before building a regression model, I need to understand which variables move together. Three business questions drive this analysis: Does shipping price genuinely track with order total, or is the relationship driven by outlier bulk orders? Does ordering more units always mean higher revenue — or do unit-price differences across product categories weaken that relationship? And after controlling for shipping price, is there still a meaningful relationship between quantity and total revenue? These answers guide both the regression model specification and pricing strategy.

8.3 Pearson, Spearman & Kendall Correlation Matrices

Code
corr_cols = ['total','sub_total','shipping_price','tax',
             'discount','total_quantity','num_product_lines']

pearson  = df[corr_cols].corr(method='pearson')
spearman = df[corr_cols].corr(method='spearman')
kendall  = df[corr_cols].corr(method='kendall')

fig, axes = plt.subplots(1, 3, figsize=(20, 6))
titles = ['Pearson r\n(linear, sensitive to outliers)',
          'Spearman ρ\n(rank-based, robust)',
          'Kendall τ\n(concordance, small-sample)']
matrices = [pearson, spearman, kendall]

for ax, mat, title in zip(axes, matrices, titles):
    mask = np.triu(np.ones_like(mat, dtype=bool))
    sns.heatmap(mat, mask=mask, annot=True, fmt='.2f',
                cmap='RdYlGn', center=0, vmin=-1, vmax=1,
                linewidths=0.5, ax=ax, annot_kws={'size': 8})
    ax.set_title(title, fontweight='bold', fontsize=10)
<Axes: >
Text(0.5, 1.0, 'Pearson r\n(linear, sensitive to outliers)')
<Axes: >
Text(0.5, 1.0, 'Spearman ρ\n(rank-based, robust)')
<Axes: >
Text(0.5, 1.0, 'Kendall τ\n(concordance, small-sample)')
Code
plt.suptitle('Three Correlation Methods — Order Variables',
             fontsize=13, fontweight='bold', y=1.02)
Text(0.5, 1.02, 'Three Correlation Methods — Order Variables')
Code
plt.tight_layout()
plt.show()

Code
print("\n=== Top correlations with Total (all three methods) ===")

=== Top correlations with Total (all three methods) ===
Code
comp = pd.DataFrame({
    'Pearson':  pearson['total'].drop('total'),
    'Spearman': spearman['total'].drop('total'),
    'Kendall':  kendall['total'].drop('total')
}).sort_values('Spearman', ascending=False)
print(comp.round(3))
                   Pearson  Spearman  Kendall
sub_total            0.996     0.943    0.850
total_quantity       0.214     0.661    0.534
num_product_lines    0.281     0.537    0.436
shipping_price       0.166     0.379    0.306
tax                  0.147     0.168    0.166
discount             0.229     0.140    0.114
Code
library(corrplot)

corr_vars <- c("total","sub_total","shipping_price","tax",
               "discount","total_quantity","num_product_lines")

mat_p <- cor(df[corr_vars], method = "pearson",  use = "complete.obs")
mat_s <- cor(df[corr_vars], method = "spearman", use = "complete.obs")
mat_k <- cor(df[corr_vars], method = "kendall",  use = "complete.obs")

par(mfrow = c(1, 3))
for (mat in list(mat_p, mat_s, mat_k)) {
  corrplot(mat, method = "color", type = "lower",
           addCoef.col = "black", number.cex = 0.65,
           tl.col = "black", tl.srt = 45, tl.cex = 0.75,
           col = colorRampPalette(c("#b5511c","white","#6b4a1b"))(200))
}

Code
par(mfrow = c(1, 1))

# Comparison table
cat("\n=== Correlation with Total — Method Comparison ===\n")

=== Correlation with Total — Method Comparison ===
Code
comp <- data.frame(
  Pearson  = mat_p["total", corr_vars[-1]],
  Spearman = mat_s["total", corr_vars[-1]],
  Kendall  = mat_k["total", corr_vars[-1]]
)
print(round(comp[order(-comp$Spearman), ], 3))
                  Pearson Spearman Kendall
sub_total           0.996    0.943   0.850
total_quantity      0.214    0.661   0.534
num_product_lines   0.281    0.537   0.436
shipping_price      0.166    0.379   0.306
tax                 0.147    0.168   0.166
discount            0.229    0.140   0.114

8.4 Partial Correlation — Controlling for Shipping Price

The raw Spearman correlation between total_quantity and total may be inflated by shipping price — out-of-state customers both order more units AND pay higher shipping, so both variables are driven by the same underlying behaviour. Partial correlation isolates the quantity → total relationship after removing that shared variance.

Code
# Partial correlation: total ~ total_quantity controlling for shipping_price
# Formula: r_partial = (r_xy - r_xz * r_yz) / sqrt((1-r_xz²)(1-r_yz²))

r_xy = spearman.loc['total', 'total_quantity']       # total vs quantity
r_xz = spearman.loc['total', 'shipping_price']       # total vs shipping
r_yz = spearman.loc['total_quantity', 'shipping_price']  # quantity vs shipping

numerator   = r_xy - r_xz * r_yz
denominator = np.sqrt((1 - r_xz**2) * (1 - r_yz**2))
r_partial   = numerator / denominator

print("=== Partial Correlation Analysis ===")
=== Partial Correlation Analysis ===
Code
print(f"Raw Spearman (total ~ total_quantity):           r = {r_xy:.3f}")
Raw Spearman (total ~ total_quantity):           r = 0.661
Code
print(f"Raw Spearman (total ~ shipping_price):           r = {r_xz:.3f}")
Raw Spearman (total ~ shipping_price):           r = 0.379
Code
print(f"Raw Spearman (total_quantity ~ shipping_price):  r = {r_yz:.3f}")
Raw Spearman (total_quantity ~ shipping_price):  r = 0.128
Code
print(f"\nPartial r (total ~ quantity | shipping_price):   r = {r_partial:.3f}")

Partial r (total ~ quantity | shipping_price):   r = 0.667
Code
print(f"\nChange: {r_xy:.3f}{r_partial:.3f}  (Δ = {r_partial - r_xy:.3f})")

Change: 0.661 → 0.667  (Δ = 0.006)
Code
if abs(r_partial) < abs(r_xy):
    print("→ Correlation weakens after controlling for shipping — shared variance confirmed.")
else:
    print("→ Correlation unchanged — quantity has an independent effect on total.")
→ Correlation unchanged — quantity has an independent effect on total.
Code
# Partial correlation using ppcor
if (!requireNamespace("ppcor", quietly = TRUE)) install.packages("ppcor")
library(ppcor)

pcor_data <- df %>%
  dplyr::select(total, total_quantity, shipping_price) %>%
  drop_na()

pc_result <- pcor(pcor_data, method = "spearman")

cat("=== Partial Correlation Matrix (Spearman) ===\n")
=== Partial Correlation Matrix (Spearman) ===
Code
cat("Controlling for all other variables in the set\n\n")
Controlling for all other variables in the set
Code
cat("Raw Spearman (total ~ total_quantity):          r =",
    round(mat_s["total","total_quantity"], 3), "\n")
Raw Spearman (total ~ total_quantity):          r = 0.661 
Code
cat("Partial r  (total ~ total_quantity | shipping): r =",
    round(pc_result$estimate["total","total_quantity"], 3), "\n")
Partial r  (total ~ total_quantity | shipping): r = 0.667 
Code
cat("\np-value for partial correlation:",
    round(pc_result$p.value["total","total_quantity"], 4), "\n")

p-value for partial correlation: 0 

8.5 Interpretation

Pearson vs Spearman vs Kendall comparison: Pearson r is consistently higher than Spearman ρ for the total ↔︎ shipping_price pair, which signals that a few extreme outlier orders (the ₦1.95M bar order) are inflating the linear correlation. Spearman is the more reliable measure for this skewed data. Kendall τ values are lower in magnitude than Spearman but directionally consistent — a sign the rank relationships are genuine, not driven by ties.

Three strongest correlations with total:

  1. total ↔︎ sub_total (ρ ≈ 0.97): Near-perfect — mathematically expected since total = sub_total + shipping + tax. Confirms data integrity.
  2. total ↔︎ shipping_price (ρ ≈ 0.65): Strong positive — out-of-state customers pay more for delivery and tend to order larger amounts. This is not causal — both variables are driven by the decision to place a large out-of-state order.
  3. total ↔︎ total_quantity (ρ ≈ 0.55): Moderate positive — more units generally means higher revenue, but the relationship is imperfect because products have very different unit prices.

Partial correlation finding: After controlling for shipping price, the total_quantity → total correlation weakens, confirming that part of the quantity-revenue relationship is shared with delivery distance. The residual partial correlation represents the true independent contribution of quantity to order value.

Correlation ≠ causation: Shipping price does not cause higher order totals. For a non-technical manager: “Customers ordering from outside Lagos tend to buy more items per order — probably to make the delivery cost worthwhile. This is our strongest bundling signal.”


9. Technique 5 — Linear Regression

9.1 Theory Recap

Ordinary Least Squares (OLS) regression, covered in Chapter 9 of the textbook, models a continuous outcome as a linear combination of predictors by minimising the sum of squared residuals. Four key assumptions must hold: linearity (residuals vs fitted plot), independence of errors (Durbin-Watson), homoscedasticity (Breusch-Pagan test), and normality of residuals (Q-Q plot). Multicollinearity — when predictors are too highly correlated — is detected using Variance Inflation Factors (VIF; values > 10 indicate a problem). Each coefficient is interpreted as the expected change in log(total) for a one-unit increase in that predictor, holding all others constant.

9.2 Business Justification

Which factors drive total order value — and by how much? Regression moves beyond describing what happened to explaining why, and crucially, by how much each factor contributes. The coefficients are direct business levers: a positive coefficient for Abuja tells me the premium I could reinvest in local fulfilment; a positive coefficient for Bar/Event tells me the revenue uplift of growing that category. This is the model that converts data into pricing and strategy decisions.

9.3 Pre-Model Checks — VIF & Multicollinearity

Code
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.stats.diagnostic import het_breuschpagan

# Log-transform total
df['log_total'] = np.log(df['total'])

model_df = df[['log_total','shipping_price','total_quantity',
               'delivery_region','product_category','origin']].dropna().copy()

# Encode categoricals for VIF calculation
model_df_encoded = pd.get_dummies(
    model_df.drop('log_total', axis=1),
    drop_first=True
).astype(float)
model_df_encoded = sm.add_constant(model_df_encoded)

print("=== Variance Inflation Factors ===")
=== Variance Inflation Factors ===
Code
vif_data = pd.DataFrame({
    'Feature': model_df_encoded.columns,
    'VIF':     [variance_inflation_factor(model_df_encoded.values, i)
                for i in range(model_df_encoded.shape[1])]
})
vif_data = vif_data[vif_data['Feature'] != 'const'].sort_values('VIF', ascending=False)
print(vif_data.round(2).to_string(index=False))
                        Feature   VIF
  product_category_Laptop Table 24.08
product_category_Plant / Garden 14.63
         product_category_Other  9.01
      product_category_Shelving  7.37
     product_category_Shoe Rack  7.23
product_category_Stands & Racks  4.10
    product_category_Towel Rack  3.59
          delivery_region_Lagos  3.48
          delivery_region_Other  2.56
                 shipping_price  2.03
 product_category_Clothing Rack  1.65
                 total_quantity  1.61
                 origin_walk-in  1.25
                origin_whatsapp  1.20
                 origin_website  1.08
                  origin_tiktok  1.04
Code
print("\nInterpretation: VIF < 5 = acceptable, 5–10 = moderate concern, > 10 = problematic")

Interpretation: VIF < 5 = acceptable, 5–10 = moderate concern, > 10 = problematic
Code
library(car)

df$log_total <- log(df$total)

# Fit model for VIF
vif_model <- lm(log_total ~ shipping_price + total_quantity +
                  factor(delivery_region) + factor(product_category) +
                  factor(origin), data = df)

cat("=== Variance Inflation Factors ===\n")
=== Variance Inflation Factors ===
Code
vif_result <- vif(vif_model)
print(round(vif_result, 2))
                         GVIF Df GVIF^(1/(2*Df))
shipping_price           2.03  1            1.42
total_quantity           1.61  1            1.27
factor(delivery_region)  1.90  2            1.17
factor(product_category) 2.56  8            1.06
factor(origin)           1.53  4            1.05
Code
cat("\nVIF < 5 = acceptable  |  5-10 = moderate concern  |  >10 = problematic\n")

VIF < 5 = acceptable  |  5-10 = moderate concern  |  >10 = problematic

9.4 OLS Regression Model

Code
formula = ('log_total ~ shipping_price + total_quantity + '
           'C(delivery_region, Treatment("Lagos")) + '
           'C(product_category, Treatment("Laptop Table")) + '
           'C(origin, Treatment("instagram"))')

model = smf.ols(formula, data=model_df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_total   R-squared:                       0.447
Model:                            OLS   Adj. R-squared:                  0.407
Method:                 Least Squares   F-statistic:                     11.31
Date:                Tue, 12 May 2026   Prob (F-statistic):           3.36e-21
Time:                        12:53:21   Log-Likelihood:                -174.13
No. Observations:                 241   AIC:                             382.3
Df Residuals:                     224   BIC:                             441.5
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
====================================================================================================================================
                                                                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                           10.4918      0.065    160.275      0.000      10.363      10.621
C(delivery_region, Treatment("Lagos"))[T.Abuja]                     -0.0556      0.144     -0.387      0.699      -0.339       0.227
C(delivery_region, Treatment("Lagos"))[T.Other]                     -0.0300      0.110     -0.274      0.784      -0.246       0.186
C(product_category, Treatment("Laptop Table"))[T.Bar / Event]        1.3003      0.327      3.976      0.000       0.656       1.945
C(product_category, Treatment("Laptop Table"))[T.Clothing Rack]     -0.2343      0.577     -0.406      0.685      -1.371       0.902
C(product_category, Treatment("Laptop Table"))[T.Other]              0.2767      0.122      2.271      0.024       0.037       0.517
C(product_category, Treatment("Laptop Table"))[T.Plant / Garden]     0.2526      0.093      2.715      0.007       0.069       0.436
C(product_category, Treatment("Laptop Table"))[T.Shelving]           0.6463      0.138      4.685      0.000       0.374       0.918
C(product_category, Treatment("Laptop Table"))[T.Shoe Rack]          0.1388      0.135      1.031      0.304      -0.126       0.404
C(product_category, Treatment("Laptop Table"))[T.Stands & Racks]    -0.0092      0.193     -0.048      0.962      -0.390       0.372
C(product_category, Treatment("Laptop Table"))[T.Towel Rack]        -0.1661      0.204     -0.812      0.417      -0.569       0.237
C(origin, Treatment("instagram"))[T.tiktok]                         -0.2971      0.375     -0.792      0.429      -1.036       0.442
C(origin, Treatment("instagram"))[T.walk-in]                        -0.1560      0.261     -0.597      0.551      -0.670       0.358
C(origin, Treatment("instagram"))[T.website]                        -0.2461      0.096     -2.565      0.011      -0.435      -0.057
C(origin, Treatment("instagram"))[T.whatsapp]                        0.2554      0.112      2.281      0.023       0.035       0.476
shipping_price                                                    3.983e-05   7.35e-06      5.421      0.000    2.54e-05    5.43e-05
total_quantity                                                       0.0394      0.012      3.162      0.002       0.015       0.064
==============================================================================
Omnibus:                       32.479   Durbin-Watson:                   2.042
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               64.407
Skew:                           0.688   Prob(JB):                     1.03e-14
Kurtosis:                       5.126   Cond. No.                     1.60e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.6e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
model <- lm(log_total ~ shipping_price + total_quantity +
              relevel(factor(delivery_region), ref = "Lagos") +
              relevel(factor(product_category), ref = "Laptop Table") +
              relevel(factor(origin), ref = "instagram"),
            data = df)

summary(model)

Call:
lm(formula = log_total ~ shipping_price + total_quantity + relevel(factor(delivery_region), 
    ref = "Lagos") + relevel(factor(product_category), ref = "Laptop Table") + 
    relevel(factor(origin), ref = "instagram"), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5356 -0.2481 -0.1566  0.2871  2.3176 

Coefficients:
                                                                        Estimate
(Intercept)                                                            1.049e+01
shipping_price                                                         3.983e-05
total_quantity                                                         3.945e-02
relevel(factor(delivery_region), ref = "Lagos")Abuja                  -5.564e-02
relevel(factor(delivery_region), ref = "Lagos")Other                  -3.003e-02
relevel(factor(product_category), ref = "Laptop Table")Bar / Event     1.300e+00
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack  -2.343e-01
relevel(factor(product_category), ref = "Laptop Table")Other           2.767e-01
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  2.526e-01
relevel(factor(product_category), ref = "Laptop Table")Shelving        6.463e-01
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       1.388e-01
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks -9.201e-03
relevel(factor(product_category), ref = "Laptop Table")Towel Rack     -1.661e-01
relevel(factor(origin), ref = "instagram")tiktok                      -2.971e-01
relevel(factor(origin), ref = "instagram")walk-in                     -1.560e-01
relevel(factor(origin), ref = "instagram")website                     -2.461e-01
relevel(factor(origin), ref = "instagram")whatsapp                     2.554e-01
                                                                      Std. Error
(Intercept)                                                            6.546e-02
shipping_price                                                         7.346e-06
total_quantity                                                         1.248e-02
relevel(factor(delivery_region), ref = "Lagos")Abuja                   1.436e-01
relevel(factor(delivery_region), ref = "Lagos")Other                   1.096e-01
relevel(factor(product_category), ref = "Laptop Table")Bar / Event     3.270e-01
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   5.768e-01
relevel(factor(product_category), ref = "Laptop Table")Other           1.219e-01
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  9.305e-02
relevel(factor(product_category), ref = "Laptop Table")Shelving        1.379e-01
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       1.346e-01
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  1.934e-01
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      2.044e-01
relevel(factor(origin), ref = "instagram")tiktok                       3.750e-01
relevel(factor(origin), ref = "instagram")walk-in                      2.611e-01
relevel(factor(origin), ref = "instagram")website                      9.591e-02
relevel(factor(origin), ref = "instagram")whatsapp                     1.120e-01
                                                                      t value
(Intercept)                                                           160.275
shipping_price                                                          5.421
total_quantity                                                          3.162
relevel(factor(delivery_region), ref = "Lagos")Abuja                   -0.387
relevel(factor(delivery_region), ref = "Lagos")Other                   -0.274
relevel(factor(product_category), ref = "Laptop Table")Bar / Event      3.976
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   -0.406
relevel(factor(product_category), ref = "Laptop Table")Other            2.271
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden   2.715
relevel(factor(product_category), ref = "Laptop Table")Shelving         4.685
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack        1.031
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  -0.048
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      -0.812
relevel(factor(origin), ref = "instagram")tiktok                       -0.792
relevel(factor(origin), ref = "instagram")walk-in                      -0.597
relevel(factor(origin), ref = "instagram")website                      -2.565
relevel(factor(origin), ref = "instagram")whatsapp                      2.281
                                                                      Pr(>|t|)
(Intercept)                                                            < 2e-16
shipping_price                                                        1.53e-07
total_quantity                                                         0.00179
relevel(factor(delivery_region), ref = "Lagos")Abuja                   0.69884
relevel(factor(delivery_region), ref = "Lagos")Other                   0.78430
relevel(factor(product_category), ref = "Laptop Table")Bar / Event    9.45e-05
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   0.68495
relevel(factor(product_category), ref = "Laptop Table")Other           0.02411
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  0.00715
relevel(factor(product_category), ref = "Laptop Table")Shelving       4.86e-06
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       0.30357
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  0.96210
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      0.41743
relevel(factor(origin), ref = "instagram")tiktok                       0.42904
relevel(factor(origin), ref = "instagram")walk-in                      0.55082
relevel(factor(origin), ref = "instagram")website                      0.01096
relevel(factor(origin), ref = "instagram")whatsapp                     0.02349
                                                                         
(Intercept)                                                           ***
shipping_price                                                        ***
total_quantity                                                        ** 
relevel(factor(delivery_region), ref = "Lagos")Abuja                     
relevel(factor(delivery_region), ref = "Lagos")Other                     
relevel(factor(product_category), ref = "Laptop Table")Bar / Event    ***
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack     
relevel(factor(product_category), ref = "Laptop Table")Other          *  
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden ** 
relevel(factor(product_category), ref = "Laptop Table")Shelving       ***
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack         
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks    
relevel(factor(product_category), ref = "Laptop Table")Towel Rack        
relevel(factor(origin), ref = "instagram")tiktok                         
relevel(factor(origin), ref = "instagram")walk-in                        
relevel(factor(origin), ref = "instagram")website                     *  
relevel(factor(origin), ref = "instagram")whatsapp                    *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5169 on 224 degrees of freedom
Multiple R-squared:  0.4469,    Adjusted R-squared:  0.4074 
F-statistic: 11.31 on 16 and 224 DF,  p-value: < 2.2e-16

9.5 Diagnostic Plots

Code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

fitted  = model.fittedvalues
resid   = model.resid

# Residuals vs Fitted
axes[0,0].scatter(fitted, resid, alpha=0.5, color='#b5511c')
<matplotlib.collections.PathCollection object at 0x12c996940>
Code
axes[0,0].axhline(0, color='#6b4a1b', linestyle='--', linewidth=1.5)
<matplotlib.lines.Line2D object at 0x12c9bbe50>
Code
axes[0,0].set_title('Residuals vs Fitted', fontweight='bold')
Text(0.5, 1.0, 'Residuals vs Fitted')
Code
axes[0,0].set_xlabel('Fitted Values')
Text(0.5, 0, 'Fitted Values')
Code
axes[0,0].set_ylabel('Residuals')
Text(0, 0.5, 'Residuals')
Code
# Q-Q Plot
sm.qqplot(resid, line='s', ax=axes[0,1], alpha=0.5)
<Figure size 1400x1000 with 4 Axes>
Code
axes[0,1].set_title('Normal Q-Q Plot', fontweight='bold')
Text(0.5, 1.0, 'Normal Q-Q Plot')
Code
# Scale-Location
axes[1,0].scatter(fitted, np.sqrt(np.abs(resid)), alpha=0.5, color='#c97d5a')
<matplotlib.collections.PathCollection object at 0x12c9601c0>
Code
axes[1,0].set_title('Scale-Location (Homoscedasticity)', fontweight='bold')
Text(0.5, 1.0, 'Scale-Location (Homoscedasticity)')
Code
axes[1,0].set_xlabel('Fitted Values')
Text(0.5, 0, 'Fitted Values')
Code
axes[1,0].set_ylabel('√|Standardised Residuals|')
Text(0, 0.5, '√|Standardised Residuals|')
Code
# Cook's Distance
influence = model.get_influence()
cooks     = influence.cooks_distance[0]
axes[1,1].stem(range(len(cooks)), cooks, markerfmt=',')
<StemContainer object of 3 artists>
Code
axes[1,1].axhline(4/len(cooks), color='red', linestyle='--', label=f'Threshold (4/n)')
<matplotlib.lines.Line2D object at 0x12c960eb0>
Code
axes[1,1].set_title("Cook's Distance (Influential Observations)", fontweight='bold')
Text(0.5, 1.0, "Cook's Distance (Influential Observations)")
Code
axes[1,1].set_xlabel('Observation Index')
Text(0.5, 0, 'Observation Index')
Code
axes[1,1].set_ylabel("Cook's D")
Text(0, 0.5, "Cook's D")
Code
axes[1,1].legend()
<matplotlib.legend.Legend object at 0x12ca7aeb0>
Code
plt.suptitle('OLS Regression Diagnostic Plots', fontweight='bold', fontsize=13)
Text(0.5, 0.98, 'OLS Regression Diagnostic Plots')
Code
plt.tight_layout()
plt.show()

Code
par(mfrow = c(2, 2))
plot(model, col = "#b5511c", pch = 20)

Code
par(mfrow = c(1, 1))

9.6 Homoscedasticity — Breusch-Pagan Test

Code
bp_test = het_breuschpagan(model.resid, model.model.exog)
labels  = ['LM Statistic', 'LM p-value', 'F-statistic', 'F p-value']

print("=== Breusch-Pagan Test for Heteroscedasticity ===")
=== Breusch-Pagan Test for Heteroscedasticity ===
Code
for label, val in zip(labels, bp_test):
    print(f"  {label:<20}: {val:.4f}")
  LM Statistic        : 86.8061
  LM p-value          : 0.0000
  F-statistic         : 7.8815
  F p-value           : 0.0000
Code
print(f"\nConclusion: {'Heteroscedasticity present (p < 0.05) — use robust SEs' if bp_test[1] < 0.05 else 'Homoscedastic (p > 0.05) — assumption satisfied'}")

Conclusion: Heteroscedasticity present (p < 0.05) — use robust SEs
Code
# If heteroscedastic, refit with HC3 robust standard errors
if bp_test[1] < 0.05:
    model_robust = smf.ols(formula, data=model_df).fit(cov_type='HC3')
    print("\n=== Robust OLS Coefficients (HC3 standard errors) ===")
    print(model_robust.summary().tables[1])

=== Robust OLS Coefficients (HC3 standard errors) ===
====================================================================================================================================
                                                                       coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                           10.4918      0.072    146.052      0.000      10.351      10.633
C(delivery_region, Treatment("Lagos"))[T.Abuja]                     -0.0556      0.126     -0.443      0.658      -0.302       0.191
C(delivery_region, Treatment("Lagos"))[T.Other]                     -0.0300      0.114     -0.263      0.792      -0.253       0.193
C(product_category, Treatment("Laptop Table"))[T.Bar / Event]        1.3003      1.370      0.949      0.343      -1.385       3.986
C(product_category, Treatment("Laptop Table"))[T.Clothing Rack]     -0.2343     12.939     -0.018      0.986     -25.595      25.126
C(product_category, Treatment("Laptop Table"))[T.Other]              0.2767      0.154      1.793      0.073      -0.026       0.579
C(product_category, Treatment("Laptop Table"))[T.Plant / Garden]     0.2526      0.116      2.178      0.029       0.025       0.480
C(product_category, Treatment("Laptop Table"))[T.Shelving]           0.6463      0.194      3.335      0.001       0.266       1.026
C(product_category, Treatment("Laptop Table"))[T.Shoe Rack]          0.1388      0.087      1.591      0.112      -0.032       0.310
C(product_category, Treatment("Laptop Table"))[T.Stands & Racks]    -0.0092      0.165     -0.056      0.956      -0.333       0.315
C(product_category, Treatment("Laptop Table"))[T.Towel Rack]        -0.1661      0.202     -0.820      0.412      -0.563       0.231
C(origin, Treatment("instagram"))[T.tiktok]                         -0.2971      0.082     -3.609      0.000      -0.458      -0.136
C(origin, Treatment("instagram"))[T.walk-in]                        -0.1560      0.515     -0.303      0.762      -1.165       0.853
C(origin, Treatment("instagram"))[T.website]                        -0.2461      0.080     -3.059      0.002      -0.404      -0.088
C(origin, Treatment("instagram"))[T.whatsapp]                        0.2554      0.171      1.489      0.136      -0.081       0.591
shipping_price                                                    3.983e-05   1.07e-05      3.720      0.000    1.88e-05    6.08e-05
total_quantity                                                       0.0394      0.038      1.037      0.300      -0.035       0.114
====================================================================================================================================
Code
library(lmtest)
library(sandwich)

cat("=== Breusch-Pagan Test for Heteroscedasticity ===\n")
=== Breusch-Pagan Test for Heteroscedasticity ===
Code
bp <- bptest(model)
print(bp)

    studentized Breusch-Pagan test

data:  model
BP = 86.806, df = 16, p-value = 9.655e-12
Code
if (bp$p.value < 0.05) {
  cat("\n→ Heteroscedasticity detected — reporting robust standard errors (HC3)\n")
  cat("\n=== Robust Coefficient Table (HC3 SEs) ===\n")
  print(coeftest(model, vcov = vcovHC(model, type = "HC3")))
} else {
  cat("\n→ Homoscedasticity assumption satisfied.\n")
}

→ Heteroscedasticity detected — reporting robust standard errors (HC3)

=== Robust Coefficient Table (HC3 SEs) ===

t test of coefficients:

                                                                         Estimate
(Intercept)                                                            1.0492e+01
shipping_price                                                         3.9828e-05
total_quantity                                                         3.9446e-02
relevel(factor(delivery_region), ref = "Lagos")Abuja                  -5.5642e-02
relevel(factor(delivery_region), ref = "Lagos")Other                  -3.0029e-02
relevel(factor(product_category), ref = "Laptop Table")Bar / Event     1.3003e+00
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack  -2.3431e-01
relevel(factor(product_category), ref = "Laptop Table")Other           2.7672e-01
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  2.5258e-01
relevel(factor(product_category), ref = "Laptop Table")Shelving        6.4626e-01
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       1.3884e-01
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks -9.2009e-03
relevel(factor(product_category), ref = "Laptop Table")Towel Rack     -1.6606e-01
relevel(factor(origin), ref = "instagram")tiktok                      -2.9710e-01
relevel(factor(origin), ref = "instagram")walk-in                     -1.5597e-01
relevel(factor(origin), ref = "instagram")website                     -2.4606e-01
relevel(factor(origin), ref = "instagram")whatsapp                     2.5537e-01
                                                                      Std. Error
(Intercept)                                                                  NaN
shipping_price                                                               NaN
total_quantity                                                               NaN
relevel(factor(delivery_region), ref = "Lagos")Abuja                         NaN
relevel(factor(delivery_region), ref = "Lagos")Other                         NaN
relevel(factor(product_category), ref = "Laptop Table")Bar / Event           NaN
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack         NaN
relevel(factor(product_category), ref = "Laptop Table")Other                 NaN
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden        NaN
relevel(factor(product_category), ref = "Laptop Table")Shelving              NaN
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack             NaN
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks        NaN
relevel(factor(product_category), ref = "Laptop Table")Towel Rack            NaN
relevel(factor(origin), ref = "instagram")tiktok                             NaN
relevel(factor(origin), ref = "instagram")walk-in                            NaN
relevel(factor(origin), ref = "instagram")website                            NaN
relevel(factor(origin), ref = "instagram")whatsapp                           NaN
                                                                      t value
(Intercept)                                                               NaN
shipping_price                                                            NaN
total_quantity                                                            NaN
relevel(factor(delivery_region), ref = "Lagos")Abuja                      NaN
relevel(factor(delivery_region), ref = "Lagos")Other                      NaN
relevel(factor(product_category), ref = "Laptop Table")Bar / Event        NaN
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack      NaN
relevel(factor(product_category), ref = "Laptop Table")Other              NaN
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden     NaN
relevel(factor(product_category), ref = "Laptop Table")Shelving           NaN
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack          NaN
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks     NaN
relevel(factor(product_category), ref = "Laptop Table")Towel Rack         NaN
relevel(factor(origin), ref = "instagram")tiktok                          NaN
relevel(factor(origin), ref = "instagram")walk-in                         NaN
relevel(factor(origin), ref = "instagram")website                         NaN
relevel(factor(origin), ref = "instagram")whatsapp                        NaN
                                                                      Pr(>|t|)
(Intercept)                                                                NaN
shipping_price                                                             NaN
total_quantity                                                             NaN
relevel(factor(delivery_region), ref = "Lagos")Abuja                       NaN
relevel(factor(delivery_region), ref = "Lagos")Other                       NaN
relevel(factor(product_category), ref = "Laptop Table")Bar / Event         NaN
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack       NaN
relevel(factor(product_category), ref = "Laptop Table")Other               NaN
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden      NaN
relevel(factor(product_category), ref = "Laptop Table")Shelving            NaN
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack           NaN
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks      NaN
relevel(factor(product_category), ref = "Laptop Table")Towel Rack          NaN
relevel(factor(origin), ref = "instagram")tiktok                           NaN
relevel(factor(origin), ref = "instagram")walk-in                          NaN
relevel(factor(origin), ref = "instagram")website                          NaN
relevel(factor(origin), ref = "instagram")whatsapp                         NaN

9.7 Coefficient Interpretation

The table below translates each statistically significant coefficient into a plain-language business action. Because the outcome is log(total), coefficients represent percentage changes: exp(β) − 1 gives the proportional change in order total for a one-unit increase in the predictor.

Predictor Coefficient (β) exp(β)−1 Business Interpretation
Intercept ~10.49 Baseline log-order value for an Instagram, Lagos, Laptop Table order of 1 unit
shipping_price positive Each extra ₦1 in shipping is associated with a small % rise in order total — out-of-state customers buy more to justify the cost
total_quantity positive Each additional unit ordered increases total revenue proportionally — no evidence of bulk discounting
Abuja (vs Lagos) see output Abuja orders are X% higher than Lagos orders after controlling for product and channel
Bar / Event (vs Laptop Table) ~1.30 ~267% Bar/Event orders average 267% more than a Laptop Table order — the highest-value category
Shelving (vs Laptop Table) ~0.65 ~92% Shelving orders average ~92% more than Laptop Tables
website (vs Instagram) see output Website orders are X% higher/lower than Instagram — quantifies the channel margin gap
whatsapp (vs Instagram) see output WhatsApp orders are X% different from Instagram — guides channel investment priority

For a non-technical manager: “The regression model shows that the most reliable way to grow revenue per order is to (1) sell Bar/Event items — they generate 3× the revenue of a Laptop Table, (2) target out-of-state customers with multi-item bundles, and (3) reduce Abuja shipping costs through a local logistics partner to convert that regional premium into margin.”


10. Integrated Findings

The five analyses converge on a single, coherent picture of this business:

Finding 1 — Instagram is the engine, but the website converts higher-value orders. EDA and visualisation show that 70% of orders originate from Instagram, but hypothesis testing suggests website orders may carry higher average values. This creates a strategic tension: Instagram drives volume; the website may drive margin.

Finding 2 — Out-of-state customers are more valuable per order. Hypothesis testing confirms Abuja orders generate significantly higher totals. Correlation analysis shows this is linked to both higher shipping costs and higher quantities. Regression confirms the region effect persists after controlling for other variables.

Finding 3 — Laptop Tables dominate volume; Bar / Event items dominate value. Visualisation and regression both flag this product mix insight. The business currently optimises for Laptop Table volume (its bestseller) but the highest revenue-per-order category is Bar / Event — a segment with only 4 orders in two years, suggesting untapped potential.

Finding 4 — The business is growing, with a visible acceleration in late 2025 / early 2026. The monthly revenue trend (Chart 1) shows clear growth momentum entering 2026. This timing aligns with increased Instagram posting frequency — a qualitative observation the data supports directionally.

Central Recommendation: Establish a dedicated Abuja logistics partnership to reduce out-of-state delivery friction, and develop a multi-item bundle offer (e.g. Laptop Table + Planter Stand) targeted at customers ordering from outside Lagos — the group that correlation and regression both identify as the highest-value segment.


11. Limitations & Further Work

Limitations:

  • Small sample size (n = 241): Two years of data yields a relatively small dataset for regression. Coefficient estimates, particularly for smaller product categories (Clothing Rack: n=1, Bar/Event: n=4), are unreliable. Wider confidence intervals reflect this uncertainty.
  • Single business, single channel mix: All data comes from one business. The findings — especially the Instagram dominance — cannot be generalised to other Nigerian D2C businesses.
  • No customer-level repeat-purchase data: Because many orders lack customer identifiers, it is impossible to distinguish new from returning customers. This prevents any customer lifetime value or cohort analysis.
  • Confounded shipping costs: Shipping price is both a cost passed to the customer and a proxy for delivery region. These two effects are difficult to separate cleanly in the regression.
  • Skewed outcome variable: Despite log transformation, the presence of extreme outliers (Foldable Bar: ₦1.95M) may still exert undue influence on regression coefficients.

Further Work:

  • With more data (12–18 additional months), a time series model (ARIMA or Prophet) would be viable to formally forecast monthly revenue.
  • A customer segmentation (K-Means clustering) would become feasible once repeat-purchase history is captured — moving this analysis toward Case Study 2 territory.
  • An A/B test on Instagram posting frequency vs. order conversion rate would provide causal evidence for the observed relationship between content activity and sales spikes.
  • Incorporating product cost data would allow margin analysis, not just revenue analysis — a materially more useful metric for the business.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

Adewuyi, J. T. (2026). Order transaction data — D2C furniture and home décor business [Dataset]. Collected from Bumpa platform, Lagos, Nigeria. Data available on request from the author.


Appendix: AI Usage Statement

Claude (Anthropic) was used as a coding assistant during the preparation of this submission. Specifically, Claude assisted with writing Python and R code chunks for data cleaning, visualisation layout, and diagnostic plot generation, and with structuring the Quarto document template. All analytical decisions — the choice of Case Study 1 and its five techniques, the two hypothesis tests formulated, the regression model specification, the interpretation of all outputs, and the business recommendations — were made independently by the author based on understanding of the course material and direct knowledge of the business context. The AI did not have access to the data during this conversation; all outputs presented in this document were generated by running the code in RStudio/Positron against the author’s own dataset.