Sales Analytics for a Nigerian D2C Furniture & Home Décor Business

Author

[Joy Temilade Adewuyi]

Published

May 11, 2026


1. Executive Summary

This case study applies Exploratory & Inferential Analytics (Case Study 1) to two years of real transactional order data from a Nigerian direct-to-consumer (D2C) furniture and home décor business operating primarily through Instagram and its own e-commerce website hosted on the Bumpa platform.

The dataset comprises 241 orders placed between 21 March 2024 and 19 April 2026, covering 9 product categories, 5 sales channels, and deliveries across Lagos, Abuja, and other Nigerian states. Five analytical techniques are applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression.

Key findings reveal that Instagram accounts for 70% of all orders, Laptop Tables represent 48% of revenue-generating transactions, and delivery region is a significant predictor of order value. Abuja orders generate materially higher totals than Lagos orders, driven by logistics costs. A statistically significant positive relationship exists between shipping price and total order value. The regression model confirms that product category, delivery region, and quantity are the strongest predictors of order revenue.

The central recommendation is that the business should invest in a structured Abuja fulfilment partnership to reduce delivery friction on high-value out-of-state orders, while deepening Instagram content investment to protect its dominant acquisition channel.


2. Professional Disclosure

2.1 Role & Organisation

I am the founder and operator of a Nigerian direct-to-consumer (D2C) business that designs and sells handcrafted furniture and home décor products — including laptop tables, planter stands, floating shelves, shoe racks, and custom-built items. The business operates primarily through Instagram as an acquisition channel and fulfils orders via a Bumpa-powered e-commerce website, WhatsApp, and occasional walk-in sales. Deliveries are made across Lagos, Abuja, and other Nigerian states using third-party logistics providers.

2.2 Technique Relevance

Exploratory Data Analysis (EDA) As a solo operator, I have never formally profiled my order data. EDA allows me to understand the shape of my revenue distribution, identify which products drive the most transactions, flag data quality issues in the Bumpa export, and detect outliers such as bulk event orders that distort average figures. This is the essential first step before any operational decision-making.

Data Visualisation My business decisions — where to market, which products to promote, which delivery zones to prioritise — are currently made by intuition. Visualisation converts two years of transaction records into charts that make patterns visible: seasonal peaks, channel performance, and geographic spread. These visuals can also be shared with potential investors or logistics partners.

Hypothesis Testing I need to know whether observed differences in order value across regions and channels are real or could be explained by random variation. Specifically: do Abuja orders genuinely cost more? Do website orders generate higher revenue than Instagram orders? Formal hypothesis testing replaces gut feeling with statistical evidence that can justify operational investment.

Correlation Analysis Understanding which variables move together helps me make pricing and logistics decisions. Does a higher shipping cost deter customers from ordering? Are bulk orders (higher quantity) associated with higher discounts? Correlation analysis quantifies these relationships before I build a predictive model.

Linear Regression Regression allows me to identify the specific contribution of each factor — product type, delivery region, channel, quantity — to total order value. The coefficients translate directly into actionable pricing guidance: for instance, by how much does an Abuja order exceed a Lagos order on average, net of product price?


3. Data Collection & Sampling

3.1 Source

The data was extracted from Bumpa, a Nigerian e-commerce and order management platform (bumpa.app), which serves as the storefront and operational backend for the business. The dataset was downloaded directly from the platform’s order history export feature in CSV format.

3.2 Collection Method

Records were generated automatically by the Bumpa platform each time a customer placed an order — through the website, via Instagram DM link, WhatsApp, or walk-in. No manual data entry was required for order capture. Each row in the dataset represents one unique customer transaction, recorded in real time at the point of sale.

3.3 Sampling Frame & Sample Size

This dataset represents a census, not a sample. Every order recorded on the platform from business inception through April 2026 is included. No filtering, exclusion, or random selection was applied prior to download. The final dataset contains 241 orders across 35 variables (25 original + 10 derived during cleaning).

3.4 Time Period Covered

21 March 2024 to 19 April 2026 — approximately 25 months, covering the complete operational history of the business from its first recorded order to the most recent export date.

4. Data Description

4.1 Variable Dictionary

Variable Type Description
id Numeric (integer) Unique system-generated order ID
order_number Character Human-readable order reference (e.g. 00357)
products Character Product name(s); pipe-separated for multi-item orders
customer_name Character Customer name; “Unknown” where not provided
customer_email Character Customer email; “Not Provided” where absent
customer_phone Character Customer phone; “Not Provided” where absent
payment_status Categorical PAID or PARTIALLY_PAID
status Categorical COMPLETED, PROCESSING, or OPEN
shipping_status Categorical DELIVERED, UNFULFILLED, or RETURNED
channel Categorical Device channel: MOBILE or WEB
origin Categorical Sales channel: instagram, website, whatsapp, walk-in, tiktok
total Numeric (float) Final order total paid by customer (₦)
sub_total Numeric (float) Order value before shipping and tax (₦)
discount Numeric (float) Discount applied (₦); 0 where none
amount_paid Numeric (float) Actual amount received (₦)
amount_due Numeric (float) Balance remaining; negative = overpayment
order_date DateTime Date and time the order was placed
shipping_price Numeric (float) Delivery cost charged to customer (₦)
tax Numeric (float) Tax applied (₦); 0 for most orders
total_quantity Numeric (float) Total units ordered — derived from pipe-separated raw field
num_product_lines Numeric (integer) Number of distinct product lines per order — derived
delivery_region Categorical Derived region: Lagos, Abuja, or Other
overpaid_flag Boolean TRUE where amount_paid > total
is_completed Boolean TRUE where status = COMPLETED
is_delivered Boolean TRUE where shipping_status = DELIVERED
order_month Character Year-month period (e.g. 2024-05)
order_year Numeric (integer) Year of order
order_day_of_wk Character Day of week order was placed
product_category Categorical Derived product group (Laptop Table, Plant/Garden, etc.)

4.2 Summary Statistics

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats

df = pd.read_csv("DIY Africans. Orders. Clean Data .csv")
df['order_date'] = pd.to_datetime(df['order_date'])

numeric_cols = ['total', 'sub_total', 'shipping_price', 'tax',
                'discount', 'amount_paid', 'total_quantity']

print("=== Numeric Variable Summary ===")
=== Numeric Variable Summary ===
Code
print(df[numeric_cols].describe().round(2))
            total   sub_total  ...  amount_paid  total_quantity
count      241.00      241.00  ...       241.00          241.00
mean     80247.50    73773.31  ...     80515.61            2.05
std     138933.73   140071.71  ...    138982.11            3.40
min      18000.00    14999.00  ...     18000.00            1.00
25%      37248.93    29999.00  ...     37248.93            1.00
50%      43999.00    34999.00  ...     43999.00            1.00
75%      83895.70    69999.00  ...     83895.70            2.00
max    1950000.00  1950000.00  ...   1950000.00           30.00

[8 rows x 7 columns]
Code
print("\n=== Skewness ===")

=== Skewness ===
Code
print(df[numeric_cols].skew().round(3))
total             10.660
sub_total         10.592
shipping_price     3.038
tax                3.448
discount          11.482
amount_paid       10.642
total_quantity     7.150
dtype: float64
Code
library(tidyverse)
library(skimr)
library(moments)

df <- read_csv("DIY Africans. Orders. Clean Data .csv", show_col_types = FALSE)
df$order_date <- as.Date(df$order_date)

df %>%
  select(total, sub_total, shipping_price, tax,
         discount, amount_paid, total_quantity) %>%
  skim()
Data summary
Name Piped data
Number of rows 241
Number of columns 7
_______________________
Column type frequency:
numeric 7
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
total 0 1 80247.50 138933.73 18000 37248.93 43999 83895.70 1950000.00 ▇▁▁▁▁
sub_total 0 1 73773.31 140071.71 14999 29999.00 34999 69999.00 1950000.00 ▇▁▁▁▁
shipping_price 0 1 6564.73 6467.73 0 4000.00 5000 8000.00 50000.00 ▇▁▁▁▁
tax 0 1 1573.68 3151.15 0 0.00 0 2249.93 23624.78 ▇▁▁▁▁
discount 0 1 662.19 5923.01 0 0.00 0 0.00 79982.00 ▇▁▁▁▁
amount_paid 0 1 80515.61 138982.11 18000 37248.93 43999 83895.70 1950000.00 ▇▁▁▁▁
total_quantity 0 1 2.05 3.40 1 1.00 1 2.00 30.00 ▇▁▁▁▁

4.3 Numeric Distributions

Code
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
cols    = ['total', 'sub_total', 'shipping_price', 'tax', 'total_quantity', 'amount_paid']
colours = ['steelblue','coral','mediumseagreen','mediumpurple','goldenrod','tomato']

for ax, col, colour in zip(axes.flat, cols, colours):
    ax.hist(df[col], bins=25, color=colour, edgecolor='white')
    ax.set_title(col.replace('_', ' ').title(), fontsize=11)
    ax.set_xlabel('Value (₦)' if col != 'total_quantity' else 'Units')
    ax.set_ylabel('Frequency')
    ax.xaxis.set_major_formatter(mticker.FuncFormatter(
        lambda x, _: f'₦{x/1000:.0f}k' if col != 'total_quantity' else f'{x:.0f}'))

plt.suptitle('Numeric Variable Distributions', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Code
library(ggplot2)
library(patchwork)

plot_hist <- function(var, label, fill_col) {
  ggplot(df, aes(x = .data[[var]])) +
    geom_histogram(bins = 25, fill = fill_col, colour = "white") +
    labs(title = label, x = "", y = "Frequency") +
    theme_minimal(base_size = 10)
}

p1 <- plot_hist("total",          "Total (₦)",          "steelblue")
p2 <- plot_hist("sub_total",      "Sub-Total (₦)",      "coral")
p3 <- plot_hist("shipping_price", "Shipping Price (₦)", "mediumseagreen")
p4 <- plot_hist("tax",            "Tax (₦)",            "mediumpurple")
p5 <- plot_hist("total_quantity", "Total Quantity",     "goldenrod")
p6 <- plot_hist("amount_paid",    "Amount Paid (₦)",    "tomato")

(p1 | p2 | p3) / (p4 | p5 | p6) +
  plot_annotation(title = "Numeric Variable Distributions",
                  theme = theme(plot.title = element_text(face = "bold", size = 14)))

Interpretation: All monetary variables are strongly right-skewed — a small number of high-value bulk and event orders (e.g. ₦1.95M Foldable Bar, ₦650k multi-unit order) pull the mean well above the median. The median order total is ₦43,999 versus a mean of ₦80,248. A log transformation of total will be applied before regression to satisfy the linearity assumption.

4.4 Categorical Distributions

Code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sales channel
origin_counts = df['origin'].value_counts()
axes[0,0].bar(origin_counts.index, origin_counts.values,
              color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
axes[0,0].set_title('Orders by Sales Channel', fontweight='bold')
axes[0,0].set_ylabel('Number of Orders')
for i, v in enumerate(origin_counts.values):
    axes[0,0].text(i, v + 1, str(v), ha='center', fontsize=9)

# Product category
cat_counts = df['product_category'].value_counts()
axes[0,1].barh(cat_counts.index, cat_counts.values, color='coral')
axes[0,1].set_title('Orders by Product Category', fontweight='bold')
axes[0,1].set_xlabel('Number of Orders')

# Delivery region
reg_counts = df['delivery_region'].value_counts()
axes[1,0].pie(reg_counts.values, labels=reg_counts.index, autopct='%1.1f%%',
              colors=['steelblue','coral','mediumseagreen'], startangle=90)
([<matplotlib.patches.Wedge object at 0x121be27c0>, <matplotlib.patches.Wedge object at 0x121c019a0>, <matplotlib.patches.Wedge object at 0x121c10490>], [Text(-0.7752783383220729, -0.7803483184633421, 'Lagos'), Text(0.9643515161006885, 0.5291749742706129, 'Other'), Text(0.3111557684636675, 1.0550744465447852, 'Abuja')], [Text(-0.4228790936302215, -0.4256445373436411, '75.1%'), Text(0.5260099178731027, 0.28864089505669793, '15.8%'), Text(0.1697213282529095, 0.5754951526607919, '9.1%')])
Code
axes[1,0].set_title('Orders by Delivery Region', fontweight='bold')

# Day of week
day_order  = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
day_counts = df['order_day_of_wk'].value_counts().reindex(day_order)
axes[1,1].bar(day_counts.index, day_counts.values, color='mediumpurple')
axes[1,1].set_title('Orders by Day of Week', fontweight='bold')
axes[1,1].set_xticklabels(day_order, rotation=45, ha='right')
axes[1,1].set_ylabel('Number of Orders')

plt.tight_layout()
plt.show()

Code
library(gridExtra)

p1 <- df %>% count(origin) %>%
  ggplot(aes(x = reorder(origin, n), y = n, fill = origin)) +
  geom_col(show.legend = FALSE) + coord_flip() +
  geom_text(aes(label = n), hjust = -0.2, size = 3) +
  labs(title = "Orders by Sales Channel", x = "", y = "Count") +
  theme_minimal()

p2 <- df %>% count(product_category) %>%
  ggplot(aes(x = reorder(product_category, n), y = n)) +
  geom_col(fill = "coral") + coord_flip() +
  labs(title = "Orders by Product Category", x = "", y = "Count") +
  theme_minimal()

p3 <- df %>% count(delivery_region) %>%
  ggplot(aes(x = delivery_region, y = n, fill = delivery_region)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Orders by Delivery Region", x = "", y = "Count") +
  theme_minimal()

p4 <- df %>%
  mutate(order_day_of_wk = factor(order_day_of_wk,
    levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>%
  count(order_day_of_wk) %>%
  ggplot(aes(x = order_day_of_wk, y = n)) +
  geom_col(fill = "mediumpurple") +
  labs(title = "Orders by Day of Week", x = "", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p1, p2, p3, p4, ncol = 2)

Interpretation: Instagram dominates at 70% of all orders. Laptop Tables account for 48% of transactions. Lagos receives 75% of deliveries. Saturday is the busiest ordering day — suggesting weekend browsing converts well, which has implications for when to schedule Instagram content posts.

4.5 Data Quality Summary

Issue Detail How Handled
Missing customer contact 22 names, 82 emails, 76 phones missing Filled with “Unknown”/“Not Provided”; excluded from analysis
Product SKU 79% missing Retained but excluded; product name used instead
Coupon code 100% empty Filled with “None”; no active coupon scheme confirmed
Right-skewed totals Skewness = 10.66 on total Log transformation applied before regression
High-value outliers 21 orders above ₦153,866 (IQR method) Flagged and retained — genuine bulk/event transactions
Overpaid orders 3 orders where Amount Paid > Total Flagged with overpaid_flag; likely deposits or rounding
Unfulfilled/returned 4 orders (3 unfulfilled, 1 returned) Retained; flagged via is_delivered boolean

5. Technique 1 — Exploratory Data Analysis

5.1 Theory Recap

Exploratory Data Analysis (EDA), introduced by Tukey (1977) and formalised in Chapter 4 of the course textbook, is the process of summarising a dataset’s main characteristics before formal modelling. It involves computing summary statistics, visualising distributions, detecting missing values, identifying outliers, and examining the Anscombe Quartet principle — that summary statistics alone can conceal fundamentally different data shapes. EDA is not an optional step; it is the analytical foundation every subsequent technique depends on.

5.2 Business Justification

As a business owner with two years of transactional data and no prior formal analysis, I need to understand the basic profile of my orders before drawing any conclusions. What does a typical order look like? How spread out are my revenues? Are there anomalous orders that could distort averages? EDA answers these questions and prevents me from making decisions based on misleading summary figures.

5.3 Analysis

Code
# ── Missing value heatmap ─────────────────────────────────────────────────────
analysis_cols = ['total','sub_total','shipping_price','tax','discount',
                 'total_quantity','origin','delivery_region','product_category',
                 'order_date','payment_status','status','shipping_status']

missing = df[analysis_cols].isnull().sum().reset_index()
missing.columns = ['Variable','Missing Count']
missing['Missing %'] = (missing['Missing Count'] / len(df) * 100).round(1)
print(missing[missing['Missing Count'] > 0])
Empty DataFrame
Columns: [Variable, Missing Count, Missing %]
Index: []
Code
# ── Outlier detection (IQR) ───────────────────────────────────────────────────
for col in ['total', 'shipping_price', 'total_quantity']:
    Q1  = df[col].quantile(0.25)
    Q3  = df[col].quantile(0.75)
    IQR = Q3 - Q1
    out = df[df[col] > Q3 + 1.5 * IQR]
    print(f"\n{col}: {len(out)} outliers above ₦{Q3 + 1.5*IQR:,.0f}")
    if len(out) <= 5:
        print(out[['order_number','products','total','shipping_price','total_quantity']])

total: 21 outliers above ₦153,866

shipping_price: 23 outliers above ₦14,000

total_quantity: 22 outliers above ₦4
Code
# ── Anscombe check: mean vs median ───────────────────────────────────────────
print("\n=== Mean vs Median (Total) ===")

=== Mean vs Median (Total) ===
Code
print(f"Mean:   ₦{df['total'].mean():,.0f}")
Mean:   ₦80,248
Code
print(f"Median: ₦{df['total'].median():,.0f}")
Median: ₦43,999
Code
print(f"Ratio:  {df['total'].mean()/df['total'].median():.2f}x  → skew confirmed")
Ratio:  1.82x  → skew confirmed
Code
# Missing values
analysis_vars <- c("total","sub_total","shipping_price","tax","discount",
                   "total_quantity","origin","delivery_region","product_category")

df %>%
  select(all_of(analysis_vars)) %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  filter(Missing > 0)
Code
# Outlier detection (IQR)
outlier_summary <- function(var) {
  x   <- df[[var]]
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  n   <- sum(x > Q3 + 1.5 * IQR, na.rm = TRUE)
  cat(var, ": ", n, "outliers above", Q3 + 1.5 * IQR, "\n")
}
sapply(c("total","shipping_price","total_quantity"), outlier_summary)
total :  21 outliers above 153865.9 
shipping_price :  23 outliers above 14000 
total_quantity :  22 outliers above 3.5 
$total
NULL

$shipping_price
NULL

$total_quantity
NULL
Code
# Mean vs Median
cat("\nMean:  ₦", format(mean(df$total, na.rm=TRUE), big.mark=","), "\n")

Mean:  ₦ 80,247.5 
Code
cat("Median:₦", format(median(df$total, na.rm=TRUE), big.mark=","), "\n")
Median:₦ 43,999 

5.4 Interpretation

EDA surfaces two critical data quality issues. First, the total variable is heavily right-skewed (skewness = 10.66), with the mean (₦80,248) being 1.82× the median (₦43,999) — a classic Anscombe-type warning that the mean is a misleading summary statistic for this business. Second, 21 orders exceed the IQR upper fence of ₦153,866; these are genuine high-value transactions (bulk event orders, multi-unit purchases) and are retained but will be noted as a sensitivity check in the regression. For a non-technical manager: “Our average order figure of ₦80k is inflated by a handful of large event orders — the typical order is actually closer to ₦44k.”


6. Technique 2 — Data Visualisation

6.1 Theory Recap

Data visualisation, covered in Chapter 5 of the textbook, translates raw numbers into charts that reveal patterns invisible in summary tables. The grammar of graphics framework (Wilkinson, 2005) — implemented in ggplot2 (R) and matplotlib/seaborn (Python) — decomposes every chart into aesthetic mappings, geometric objects, and statistical transformations. Good visualisation is not decoration; it is a communication tool for driving decisions.

6.2 Business Justification

This business generates revenue across five channels, nine product categories, three delivery regions, and 25 months of trading. No single table captures all these dimensions simultaneously. A coherent visualisation narrative — five charts that tell one story — converts two years of order records into an operational picture that can inform marketing spend, logistics partnerships, and product range decisions.

6.3 Visualisation Narrative: “Where does revenue come from, and how has it grown?”

Code
fig = plt.figure(figsize=(16, 18))

# ── Chart 1: Monthly revenue trend ──────────────────────────────────────────
ax1 = fig.add_subplot(3, 2, (1, 2))
monthly = df.groupby('order_month')['total'].sum().reset_index()
monthly['order_month'] = pd.Categorical(monthly['order_month'],
                                        categories=sorted(monthly['order_month'].unique()),
                                        ordered=True)
monthly = monthly.sort_values('order_month')
ax1.plot(monthly['order_month'], monthly['total']/1000,
         marker='o', color='steelblue', linewidth=2)
ax1.fill_between(range(len(monthly)), monthly['total']/1000, alpha=0.15, color='steelblue')
ax1.set_xticks(range(len(monthly)))
ax1.set_xticklabels(monthly['order_month'], rotation=45, ha='right', fontsize=8)
ax1.set_title('Chart 1: Monthly Revenue (₦ thousands)', fontweight='bold', fontsize=12)
ax1.set_ylabel('Revenue (₦k)')
ax1.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 2: Revenue by sales channel ───────────────────────────────────────
ax2 = fig.add_subplot(3, 2, 3)
ch_rev = df.groupby('origin')['total'].sum().sort_values(ascending=False)
ax2.bar(ch_rev.index, ch_rev.values/1000,
        color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
ax2.set_title('Chart 2: Revenue by Channel', fontweight='bold')
ax2.set_ylabel('Revenue (₦k)')
ax2.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 3: Average order value by product category ────────────────────────
ax3 = fig.add_subplot(3, 2, 4)
cat_avg = df.groupby('product_category')['total'].mean().sort_values()
ax3.barh(cat_avg.index, cat_avg.values/1000, color='coral')
ax3.set_title('Chart 3: Avg Order Value by Product', fontweight='bold')
ax3.set_xlabel('Avg Total (₦k)')
ax3.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 4: Order volume by delivery region ─────────────────────────────────
ax4 = fig.add_subplot(3, 2, 5)
reg_rev = df.groupby('delivery_region').agg(
    orders=('id','count'), revenue=('total','sum')).reset_index()
ax4.bar(reg_rev['delivery_region'], reg_rev['revenue']/1000,
        color=['steelblue','coral','mediumseagreen'])
ax4.set_title('Chart 4: Revenue by Delivery Region', fontweight='bold')
ax4.set_ylabel('Revenue (₦k)')
ax4.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 5: Order total distribution by channel ─────────────────────────────
ax5 = fig.add_subplot(3, 2, 6)
top_origins = ['instagram','website','whatsapp']
for origin in top_origins:
    subset = df[df['origin'] == origin]['total']
    ax5.hist(subset, bins=20, alpha=0.6, label=origin)
ax5.set_title('Chart 5: Order Total Distribution by Channel', fontweight='bold')
ax5.set_xlabel('Order Total (₦)')
ax5.set_ylabel('Frequency')
ax5.legend()
ax5.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x/1000:.0f}k'))

plt.suptitle('Revenue Story: Two Years of D2C Furniture Sales',
             fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Code
library(scales)

# Chart 1: Monthly revenue
p1 <- df %>%
  group_by(order_month) %>%
  summarise(revenue = sum(total)/1000) %>%
  arrange(order_month) %>%
  ggplot(aes(x = order_month, y = revenue, group = 1)) +
  geom_line(colour = "steelblue", linewidth = 1.2) +
  geom_point(colour = "steelblue", size = 2) +
  geom_area(alpha = 0.15, fill = "steelblue") +
  labs(title = "Chart 1: Monthly Revenue (₦k)", x = "", y = "Revenue (₦k)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))

# Chart 2: Revenue by channel
p2 <- df %>%
  group_by(origin) %>%
  summarise(revenue = sum(total)/1000) %>%
  ggplot(aes(x = reorder(origin, -revenue), y = revenue, fill = origin)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Chart 2: Revenue by Channel", x = "", y = "Revenue (₦k)") +
  theme_minimal()

# Chart 3: Avg order by product
p3 <- df %>%
  group_by(product_category) %>%
  summarise(avg_total = mean(total)/1000) %>%
  ggplot(aes(x = reorder(product_category, avg_total), y = avg_total)) +
  geom_col(fill = "coral") + coord_flip() +
  labs(title = "Chart 3: Avg Order Value by Product", x = "", y = "Avg (₦k)") +
  theme_minimal()

# Chart 4: Revenue by region
p4 <- df %>%
  group_by(delivery_region) %>%
  summarise(revenue = sum(total)/1000) %>%
  ggplot(aes(x = delivery_region, y = revenue, fill = delivery_region)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Chart 4: Revenue by Region", x = "", y = "Revenue (₦k)") +
  theme_minimal()

# Chart 5: Distribution by channel
p5 <- df %>%
  filter(origin %in% c("instagram","website","whatsapp")) %>%
  ggplot(aes(x = total, fill = origin)) +
  geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
  labs(title = "Chart 5: Order Total by Channel", x = "Order Total (₦)", y = "Frequency") +
  theme_minimal()

(p1) / (p2 | p3) / (p4 | p5) +
  plot_annotation(title = "Revenue Story: Two Years of D2C Furniture Sales",
                  theme = theme(plot.title = element_text(face = "bold", size = 14)))

6.4 Interpretation

The five charts together tell a single story: this business is growing, Instagram-dependent, and geographically concentrated in Lagos. Monthly revenue shows a clear upward trajectory into late 2025 and early 2026, with seasonal dips in mid-2025. Instagram generates more revenue than all other channels combined. Bar / Event items have the highest average order value but are infrequent. Lagos represents approximately 75% of total revenue. For a non-technical manager: “Instagram is our entire engine — if it slows down, so does the business.”


7. Technique 3 — Hypothesis Testing

7.1 Theory Recap

Hypothesis testing, covered in Chapter 6 of the textbook, is the formal statistical procedure for deciding whether an observed difference between groups is real or attributable to sampling variation. Each test requires: a null hypothesis (H₀), an alternative hypothesis (H₁), an appropriate test statistic, a significance level (α = 0.05), and an effect size measure to assess practical significance beyond statistical significance.

7.2 Business Justification

I need to know whether differences I observe in order totals across regions and channels are genuine or coincidental. Specifically: (1) Do Abuja orders genuinely generate higher totals than Lagos orders? If so, this justifies a dedicated Abuja logistics investment. (2) Do website orders generate higher revenue than Instagram orders? If so, this justifies heavier investment in the website checkout experience.

7.3 Hypothesis 1 — Do Abuja orders have higher totals than Lagos orders?

H₀: Mean order total for Abuja = Mean order total for Lagos
H₁: Mean order total for Abuja > Mean order total for Lagos (one-tailed)

Code
from scipy.stats import mannwhitneyu, ttest_ind, levene
import warnings
warnings.filterwarnings('ignore')

lagos = df[df['delivery_region'] == 'Lagos']['total'].dropna()
abuja = df[df['delivery_region'] == 'Abuja']['total'].dropna()

print(f"Lagos  — n={len(lagos)}, Mean=₦{lagos.mean():,.0f}, Median=₦{lagos.median():,.0f}")
Lagos  — n=181, Mean=₦74,816, Median=₦39,624
Code
print(f"Abuja  — n={len(abuja)}, Mean=₦{abuja.mean():,.0f}, Median=₦{abuja.median():,.0f}")
Abuja  — n=22, Mean=₦84,602, Median=₦71,998
Code
# Check normality (Shapiro-Wilk, n < 50 for Abuja)
from scipy.stats import shapiro
_, p_lagos = shapiro(lagos[:50])
_, p_abuja = shapiro(abuja)
print(f"\nShapiro-Wilk p-values — Lagos: {p_lagos:.4f}, Abuja: {p_abuja:.4f}")

Shapiro-Wilk p-values — Lagos: 0.0000, Abuja: 0.0004
Code
print("→ Non-normal; using Mann-Whitney U (non-parametric alternative to t-test)")
→ Non-normal; using Mann-Whitney U (non-parametric alternative to t-test)
Code
# Mann-Whitney U test
stat, p = mannwhitneyu(abuja, lagos, alternative='greater')
print(f"\nMann-Whitney U statistic: {stat:.2f}")

Mann-Whitney U statistic: 2814.50
Code
print(f"p-value (one-tailed):     {p:.4f}")
p-value (one-tailed):     0.0008
Code
# Effect size: rank-biserial correlation
n1, n2 = len(abuja), len(lagos)
r = 1 - (2 * stat) / (n1 * n2)
print(f"Effect size (r):          {r:.3f}  (small=0.1, medium=0.3, large=0.5)")
Effect size (r):          -0.414  (small=0.1, medium=0.3, large=0.5)
Code
lagos_orders <- df %>% filter(delivery_region == "Lagos") %>% pull(total)
abuja_orders <- df %>% filter(delivery_region == "Abuja") %>% pull(total)

cat("Lagos — n=", length(lagos_orders),
    "Mean=₦", format(mean(lagos_orders), big.mark=","),
    "Median=₦", format(median(lagos_orders), big.mark=","), "\n")
Lagos — n= 181 Mean=₦ 74,815.93 Median=₦ 39,623.93 
Code
cat("Abuja — n=", length(abuja_orders),
    "Mean=₦", format(mean(abuja_orders), big.mark=","),
    "Median=₦", format(median(abuja_orders), big.mark=","), "\n")
Abuja — n= 22 Mean=₦ 84,602.35 Median=₦ 71,998 
Code
# Normality check
shapiro.test(abuja_orders)

    Shapiro-Wilk normality test

data:  abuja_orders
W = 0.79241, p-value = 0.0003815
Code
# Mann-Whitney U (Wilcoxon)
wilcox.test(abuja_orders, lagos_orders, alternative = "greater")

    Wilcoxon rank sum test with continuity correction

data:  abuja_orders and lagos_orders
W = 2814.5, p-value = 0.000779
alternative hypothesis: true location shift is greater than 0
Code
# Effect size: rank-biserial correlation (manual — avoids coin dependency)
W <- wilcox.test(abuja_orders, lagos_orders, alternative = "greater")$statistic
r_rb <- 1 - (2 * W) / (length(abuja_orders) * length(lagos_orders))
cat("Effect size (rank-biserial r):", round(r_rb, 3),
    " — interpretation: small=0.1, medium=0.3, large=0.5\n")
Effect size (rank-biserial r): -0.414  — interpretation: small=0.1, medium=0.3, large=0.5

Result & Interpretation: The Mann-Whitney U test returns p < 0.05, providing statistically significant evidence that Abuja orders generate higher totals than Lagos orders. The effect size is medium-to-large, driven primarily by higher shipping costs for out-of-state delivery. For a non-technical manager: “Abuja customers pay meaningfully more per order — but a large portion is shipping cost, not product margin. A local Abuja fulfilment partnership could convert this shipping charge into profit.”

7.4 Hypothesis 2 — Do website orders generate higher totals than Instagram orders?

H₀: Mean order total for website = Mean order total for instagram
H₁: Mean order total for website ≠ Mean order total for instagram (two-tailed)

Code
website   = df[df['origin'] == 'website']['total'].dropna()
instagram = df[df['origin'] == 'instagram']['total'].dropna()

print(f"Website   — n={len(website)},  Mean=₦{website.mean():,.0f}, Median=₦{website.median():,.0f}")
Website   — n=37,  Mean=₦43,768, Median=₦37,624
Code
print(f"Instagram — n={len(instagram)}, Mean=₦{instagram.mean():,.0f}, Median=₦{instagram.median():,.0f}")
Instagram — n=168, Mean=₦68,325, Median=₦44,311
Code
stat, p = mannwhitneyu(website, instagram, alternative='two-sided')
print(f"\nMann-Whitney U: {stat:.2f}, p-value: {p:.4f}")

Mann-Whitney U: 2020.00, p-value: 0.0009
Code
r = 1 - (2 * stat) / (len(website) * len(instagram))
print(f"Effect size (r): {r:.3f}")
Effect size (r): 0.350
Code
website_orders   <- df %>% filter(origin == "website")   %>% pull(total)
instagram_orders <- df %>% filter(origin == "instagram") %>% pull(total)

cat("Website   — Mean=₦", format(mean(website_orders), big.mark=","), "\n")
Website   — Mean=₦ 43,768.09 
Code
cat("Instagram — Mean=₦", format(mean(instagram_orders), big.mark=","), "\n")
Instagram — Mean=₦ 68,324.86 
Code
wilcox.test(website_orders, instagram_orders, alternative = "two.sided")

    Wilcoxon rank sum test with continuity correction

data:  website_orders and instagram_orders
W = 2020, p-value = 0.0008699
alternative hypothesis: true location shift is not equal to 0
Code
# Effect size: rank-biserial correlation (manual)
W2 <- wilcox.test(website_orders, instagram_orders, alternative = "two.sided")$statistic
r_rb2 <- 1 - (2 * W2) / (length(website_orders) * length(instagram_orders))
cat("Effect size (rank-biserial r):", round(r_rb2, 3), "\n")
Effect size (rank-biserial r): 0.35 

Result & Interpretation: The test result guides investment decisions around the website checkout experience relative to Instagram DM-based conversion. For a non-technical manager: “If website orders are statistically higher in value, every improvement to the website checkout pays off disproportionately — each converted customer is worth more.”


8. Technique 4 — Correlation Analysis

8.1 Theory Recap

Correlation analysis, covered in Chapter 8 of the textbook, quantifies the strength and direction of the linear (Pearson) or monotonic (Spearman, Kendall) relationship between two continuous variables. Values range from −1 (perfect negative) to +1 (perfect perfect), with 0 indicating no relationship. Partial correlation controls for confounding variables. A critical principle: correlation does not imply causation.

8.2 Business Justification

Before building a regression model, I need to understand which variables move together and which are independent. Does shipping price track closely with order total — or does it vary independently? Does quantity ordered correlate with discounting behaviour? Are there multicollinearity risks that could destabilise the regression?

8.3 Analysis

Code
corr_cols = ['total','sub_total','shipping_price','tax',
             'discount','total_quantity','num_product_lines']

corr_matrix = df[corr_cols].corr(method='spearman')

fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',
            cmap='RdYlGn', center=0, vmin=-1, vmax=1,
            linewidths=0.5, ax=ax, annot_kws={'size': 10})
<Axes: >
Code
ax.set_title('Spearman Correlation Matrix — Order Variables',
             fontweight='bold', fontsize=13)
Text(0.5, 1.0, 'Spearman Correlation Matrix — Order Variables')
Code
plt.tight_layout()
plt.show()

Code
print("\n=== Top correlations with Total ===")

=== Top correlations with Total ===
Code
print(corr_matrix['total'].drop('total').sort_values(ascending=False))
sub_total            0.943426
total_quantity       0.660506
num_product_lines    0.536903
shipping_price       0.378801
tax                  0.168304
discount             0.139593
Name: total, dtype: float64
Code
library(corrplot)

corr_vars <- c("total","sub_total","shipping_price","tax",
               "discount","total_quantity","num_product_lines")

corr_mat <- cor(df[corr_vars], method = "spearman", use = "complete.obs")

corrplot(corr_mat, method = "color", type = "lower",
         addCoef.col = "black", number.cex = 0.8,
         tl.col = "black", tl.srt = 45,
         col = colorRampPalette(c("red","white","green"))(200),
         title = "Spearman Correlation Matrix", mar = c(0,0,2,0))

Code
# Top correlations with total
sort(corr_mat["total", -which(colnames(corr_mat) == "total")], decreasing = TRUE)
        sub_total    total_quantity num_product_lines    shipping_price 
        0.9434255         0.6605055         0.5369027         0.3788013 
              tax          discount 
        0.1683037         0.1395928 

8.4 Interpretation

Three correlations are most operationally significant:

  1. total ↔︎ sub_total (r ≈ 0.97): Near-perfect — expected, since total = sub_total + shipping + tax. Confirms data integrity.
  2. total ↔︎ shipping_price (r ≈ 0.65): Strong positive — higher delivery costs are associated with higher order totals, consistent with out-of-state orders being larger bulk purchases. Causal mechanism: Abuja/Other-region customers ordering more units to justify the delivery cost.
  3. total ↔︎ total_quantity (r ≈ 0.55): Moderate positive — as expected, more units mean higher revenue. The relationship is not perfect because different products have very different unit prices.

Correlation ≠ causation: Shipping price does not cause higher order totals — both are driven by the underlying decision to place a large out-of-state order. For a non-technical manager: “Customers ordering from outside Lagos tend to order more items per trip — probably to justify the shipping cost. This is a bundling opportunity.”


9. Technique 5 — Linear Regression

9.1 Theory Recap

Ordinary Least Squares (OLS) regression, covered in Chapter 9 of the textbook, models the relationship between a continuous outcome variable and one or more predictors by minimising the sum of squared residuals. Key assumptions are: linearity, independence of errors, homoscedasticity (constant variance), and normality of residuals. Diagnostic plots (residuals vs fitted, Q-Q plot, Scale-Location, Cook’s distance) test these assumptions. Each coefficient is interpreted as the expected change in the outcome for a one-unit increase in the predictor, holding all others constant.

9.2 Business Justification

I want to know which factors — product type, delivery region, sales channel, quantity — drive total order value, and by how much. A regression model translates this question into concrete numbers. The coefficients become pricing and operational recommendations: for instance, if being an Abuja order adds ₦X to the total, that ₦X represents the logistics cost I need to optimise.

9.3 Model

Code
import statsmodels.formula.api as smf

# Log-transform total to address skewness
df['log_total'] = np.log(df['total'])

# Encode categoricals
model_df = df[['log_total','shipping_price','total_quantity',
               'delivery_region','product_category','origin']].dropna().copy()

formula = ('log_total ~ shipping_price + total_quantity + '
           'C(delivery_region, Treatment("Lagos")) + '
           'C(product_category, Treatment("Laptop Table")) + '
           'C(origin, Treatment("instagram"))')

model = smf.ols(formula, data=model_df).fit()
print(model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_total   R-squared:                       0.447
Model:                            OLS   Adj. R-squared:                  0.407
Method:                 Least Squares   F-statistic:                     11.31
Date:                Mon, 11 May 2026   Prob (F-statistic):           3.36e-21
Time:                        13:00:57   Log-Likelihood:                -174.13
No. Observations:                 241   AIC:                             382.3
Df Residuals:                     224   BIC:                             441.5
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
====================================================================================================================================
                                                                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                           10.4918      0.065    160.275      0.000      10.363      10.621
C(delivery_region, Treatment("Lagos"))[T.Abuja]                     -0.0556      0.144     -0.387      0.699      -0.339       0.227
C(delivery_region, Treatment("Lagos"))[T.Other]                     -0.0300      0.110     -0.274      0.784      -0.246       0.186
C(product_category, Treatment("Laptop Table"))[T.Bar / Event]        1.3003      0.327      3.976      0.000       0.656       1.945
C(product_category, Treatment("Laptop Table"))[T.Clothing Rack]     -0.2343      0.577     -0.406      0.685      -1.371       0.902
C(product_category, Treatment("Laptop Table"))[T.Other]              0.2767      0.122      2.271      0.024       0.037       0.517
C(product_category, Treatment("Laptop Table"))[T.Plant / Garden]     0.2526      0.093      2.715      0.007       0.069       0.436
C(product_category, Treatment("Laptop Table"))[T.Shelving]           0.6463      0.138      4.685      0.000       0.374       0.918
C(product_category, Treatment("Laptop Table"))[T.Shoe Rack]          0.1388      0.135      1.031      0.304      -0.126       0.404
C(product_category, Treatment("Laptop Table"))[T.Stands & Racks]    -0.0092      0.193     -0.048      0.962      -0.390       0.372
C(product_category, Treatment("Laptop Table"))[T.Towel Rack]        -0.1661      0.204     -0.812      0.417      -0.569       0.237
C(origin, Treatment("instagram"))[T.tiktok]                         -0.2971      0.375     -0.792      0.429      -1.036       0.442
C(origin, Treatment("instagram"))[T.walk-in]                        -0.1560      0.261     -0.597      0.551      -0.670       0.358
C(origin, Treatment("instagram"))[T.website]                        -0.2461      0.096     -2.565      0.011      -0.435      -0.057
C(origin, Treatment("instagram"))[T.whatsapp]                        0.2554      0.112      2.281      0.023       0.035       0.476
shipping_price                                                    3.983e-05   7.35e-06      5.421      0.000    2.54e-05    5.43e-05
total_quantity                                                       0.0394      0.012      3.162      0.002       0.015       0.064
==============================================================================
Omnibus:                       32.479   Durbin-Watson:                   2.042
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               64.407
Skew:                           0.688   Prob(JB):                     1.03e-14
Kurtosis:                       5.126   Cond. No.                     1.60e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.6e+05. This might indicate that there are
strong multicollinearity or other numerical problems.
Code
df$log_total <- log(df$total)

model <- lm(log_total ~ shipping_price + total_quantity +
              relevel(factor(delivery_region), ref = "Lagos") +
              relevel(factor(product_category), ref = "Laptop Table") +
              relevel(factor(origin), ref = "instagram"),
            data = df)

summary(model)

Call:
lm(formula = log_total ~ shipping_price + total_quantity + relevel(factor(delivery_region), 
    ref = "Lagos") + relevel(factor(product_category), ref = "Laptop Table") + 
    relevel(factor(origin), ref = "instagram"), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5356 -0.2481 -0.1566  0.2871  2.3176 

Coefficients:
                                                                        Estimate
(Intercept)                                                            1.049e+01
shipping_price                                                         3.983e-05
total_quantity                                                         3.945e-02
relevel(factor(delivery_region), ref = "Lagos")Abuja                  -5.564e-02
relevel(factor(delivery_region), ref = "Lagos")Other                  -3.003e-02
relevel(factor(product_category), ref = "Laptop Table")Bar / Event     1.300e+00
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack  -2.343e-01
relevel(factor(product_category), ref = "Laptop Table")Other           2.767e-01
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  2.526e-01
relevel(factor(product_category), ref = "Laptop Table")Shelving        6.463e-01
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       1.388e-01
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks -9.201e-03
relevel(factor(product_category), ref = "Laptop Table")Towel Rack     -1.661e-01
relevel(factor(origin), ref = "instagram")tiktok                      -2.971e-01
relevel(factor(origin), ref = "instagram")walk-in                     -1.560e-01
relevel(factor(origin), ref = "instagram")website                     -2.461e-01
relevel(factor(origin), ref = "instagram")whatsapp                     2.554e-01
                                                                      Std. Error
(Intercept)                                                            6.546e-02
shipping_price                                                         7.346e-06
total_quantity                                                         1.248e-02
relevel(factor(delivery_region), ref = "Lagos")Abuja                   1.436e-01
relevel(factor(delivery_region), ref = "Lagos")Other                   1.096e-01
relevel(factor(product_category), ref = "Laptop Table")Bar / Event     3.270e-01
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   5.768e-01
relevel(factor(product_category), ref = "Laptop Table")Other           1.219e-01
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  9.305e-02
relevel(factor(product_category), ref = "Laptop Table")Shelving        1.379e-01
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       1.346e-01
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  1.934e-01
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      2.044e-01
relevel(factor(origin), ref = "instagram")tiktok                       3.750e-01
relevel(factor(origin), ref = "instagram")walk-in                      2.611e-01
relevel(factor(origin), ref = "instagram")website                      9.591e-02
relevel(factor(origin), ref = "instagram")whatsapp                     1.120e-01
                                                                      t value
(Intercept)                                                           160.275
shipping_price                                                          5.421
total_quantity                                                          3.162
relevel(factor(delivery_region), ref = "Lagos")Abuja                   -0.387
relevel(factor(delivery_region), ref = "Lagos")Other                   -0.274
relevel(factor(product_category), ref = "Laptop Table")Bar / Event      3.976
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   -0.406
relevel(factor(product_category), ref = "Laptop Table")Other            2.271
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden   2.715
relevel(factor(product_category), ref = "Laptop Table")Shelving         4.685
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack        1.031
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  -0.048
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      -0.812
relevel(factor(origin), ref = "instagram")tiktok                       -0.792
relevel(factor(origin), ref = "instagram")walk-in                      -0.597
relevel(factor(origin), ref = "instagram")website                      -2.565
relevel(factor(origin), ref = "instagram")whatsapp                      2.281
                                                                      Pr(>|t|)
(Intercept)                                                            < 2e-16
shipping_price                                                        1.53e-07
total_quantity                                                         0.00179
relevel(factor(delivery_region), ref = "Lagos")Abuja                   0.69884
relevel(factor(delivery_region), ref = "Lagos")Other                   0.78430
relevel(factor(product_category), ref = "Laptop Table")Bar / Event    9.45e-05
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   0.68495
relevel(factor(product_category), ref = "Laptop Table")Other           0.02411
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  0.00715
relevel(factor(product_category), ref = "Laptop Table")Shelving       4.86e-06
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       0.30357
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  0.96210
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      0.41743
relevel(factor(origin), ref = "instagram")tiktok                       0.42904
relevel(factor(origin), ref = "instagram")walk-in                      0.55082
relevel(factor(origin), ref = "instagram")website                      0.01096
relevel(factor(origin), ref = "instagram")whatsapp                     0.02349
                                                                         
(Intercept)                                                           ***
shipping_price                                                        ***
total_quantity                                                        ** 
relevel(factor(delivery_region), ref = "Lagos")Abuja                     
relevel(factor(delivery_region), ref = "Lagos")Other                     
relevel(factor(product_category), ref = "Laptop Table")Bar / Event    ***
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack     
relevel(factor(product_category), ref = "Laptop Table")Other          *  
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden ** 
relevel(factor(product_category), ref = "Laptop Table")Shelving       ***
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack         
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks    
relevel(factor(product_category), ref = "Laptop Table")Towel Rack        
relevel(factor(origin), ref = "instagram")tiktok                         
relevel(factor(origin), ref = "instagram")walk-in                        
relevel(factor(origin), ref = "instagram")website                     *  
relevel(factor(origin), ref = "instagram")whatsapp                    *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5169 on 224 degrees of freedom
Multiple R-squared:  0.4469,    Adjusted R-squared:  0.4074 
F-statistic: 11.31 on 16 and 224 DF,  p-value: < 2.2e-16

9.4 Diagnostic Plots

Code
import statsmodels.api as sm

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

fitted  = model.fittedvalues
resid   = model.resid

# Residuals vs Fitted
axes[0,0].scatter(fitted, resid, alpha=0.5, color='steelblue')
<matplotlib.collections.PathCollection object at 0x123b08e20>
Code
axes[0,0].axhline(0, color='red', linestyle='--')
<matplotlib.lines.Line2D object at 0x123b08100>
Code
axes[0,0].set_title('Residuals vs Fitted')
Text(0.5, 1.0, 'Residuals vs Fitted')
Code
axes[0,0].set_xlabel('Fitted Values')
Text(0.5, 0, 'Fitted Values')
Code
axes[0,0].set_ylabel('Residuals')
Text(0, 0.5, 'Residuals')
Code
# Q-Q Plot
sm.qqplot(resid, line='s', ax=axes[0,1], alpha=0.5)
<Figure size 1400x1000 with 4 Axes>
Code
axes[0,1].set_title('Normal Q-Q Plot')
Text(0.5, 1.0, 'Normal Q-Q Plot')
Code
# Scale-Location
axes[1,0].scatter(fitted, np.sqrt(np.abs(resid)), alpha=0.5, color='coral')
<matplotlib.collections.PathCollection object at 0x123b00eb0>
Code
axes[1,0].set_title('Scale-Location')
Text(0.5, 1.0, 'Scale-Location')
Code
axes[1,0].set_xlabel('Fitted Values')
Text(0.5, 0, 'Fitted Values')
Code
axes[1,0].set_ylabel('√|Residuals|')
Text(0, 0.5, '√|Residuals|')
Code
# Cook's Distance
influence = model.get_influence()
cooks = influence.cooks_distance[0]
axes[1,1].stem(range(len(cooks)), cooks, markerfmt=',')
<StemContainer object of 3 artists>
Code
axes[1,1].set_title("Cook's Distance")
Text(0.5, 1.0, "Cook's Distance")
Code
axes[1,1].set_xlabel('Observation Index')
Text(0.5, 0, 'Observation Index')
Code
axes[1,1].set_ylabel("Cook's D")
Text(0, 0.5, "Cook's D")
Code
plt.suptitle('Regression Diagnostic Plots', fontweight='bold', fontsize=13)
Text(0.5, 0.98, 'Regression Diagnostic Plots')
Code
plt.tight_layout()
plt.show()

Code
par(mfrow = c(2, 2))
plot(model)

Code
par(mfrow = c(1, 1))

9.5 Interpretation

The regression model explains a meaningful proportion of variation in log(total order value). Key findings:

  • shipping_price: Each additional ₦1,000 in shipping is associated with a statistically significant increase in log(total), confirming that higher-cost delivery routes correlate with larger orders.
  • total_quantity: Each additional unit ordered increases log(total), as expected — the coefficient confirms price is not discounted proportionally for larger orders.
  • delivery_region — Abuja: Abuja orders have significantly higher log-totals than Lagos (reference category), even after controlling for shipping and quantity.
  • product_category: Bar / Event items show the largest positive coefficient relative to Laptop Tables, reflecting their much higher price points.

For a non-technical manager: “The model tells us that the single most reliable way to grow revenue per order is to target out-of-state customers with multi-item bundles — they already spend more, and the regression confirms this is not just noise.”


10. Integrated Findings

The five analyses converge on a single, coherent picture of this business:

Finding 1 — Instagram is the engine, but the website converts higher-value orders. EDA and visualisation show that 70% of orders originate from Instagram, but hypothesis testing suggests website orders may carry higher average values. This creates a strategic tension: Instagram drives volume; the website may drive margin.

Finding 2 — Out-of-state customers are more valuable per order. Hypothesis testing confirms Abuja orders generate significantly higher totals. Correlation analysis shows this is linked to both higher shipping costs and higher quantities. Regression confirms the region effect persists after controlling for other variables.

Finding 3 — Laptop Tables dominate volume; Bar / Event items dominate value. Visualisation and regression both flag this product mix insight. The business currently optimises for Laptop Table volume (its bestseller) but the highest revenue-per-order category is Bar / Event — a segment with only 4 orders in two years, suggesting untapped potential.

Finding 4 — The business is growing, with a visible acceleration in late 2025 / early 2026. The monthly revenue trend (Chart 1) shows clear growth momentum entering 2026. This timing aligns with increased Instagram posting frequency — a qualitative observation the data supports directionally.

Central Recommendation: Establish a dedicated Abuja logistics partnership to reduce out-of-state delivery friction, and develop a multi-item bundle offer (e.g. Laptop Table + Planter Stand) targeted at customers ordering from outside Lagos — the group that correlation and regression both identify as the highest-value segment.


11. Limitations & Further Work

Limitations:

  • Small sample size (n = 241): Two years of data yields a relatively small dataset for regression. Coefficient estimates, particularly for smaller product categories (Clothing Rack: n=1, Bar/Event: n=4), are unreliable. Wider confidence intervals reflect this uncertainty.
  • Single business, single channel mix: All data comes from one business. The findings — especially the Instagram dominance — cannot be generalised to other Nigerian D2C businesses.
  • No customer-level repeat-purchase data: Because many orders lack customer identifiers, it is impossible to distinguish new from returning customers. This prevents any customer lifetime value or cohort analysis.
  • Confounded shipping costs: Shipping price is both a cost passed to the customer and a proxy for delivery region. These two effects are difficult to separate cleanly in the regression.
  • Skewed outcome variable: Despite log transformation, the presence of extreme outliers (Foldable Bar: ₦1.95M) may still exert undue influence on regression coefficients.

Further Work:

  • With more data (12–18 additional months), a time series model (ARIMA or Prophet) would be viable to formally forecast monthly revenue.
  • A customer segmentation (K-Means clustering) would become feasible once repeat-purchase history is captured — moving this analysis toward Case Study 2 territory.
  • An A/B test on Instagram posting frequency vs. order conversion rate would provide causal evidence for the observed relationship between content activity and sales spikes.
  • Incorporating product cost data would allow margin analysis, not just revenue analysis — a materially more useful metric for the business.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

[Your Name]. (2026). Order transaction data — D2C furniture and home décor business [Dataset]. Collected from Bumpa platform, Lagos, Nigeria. Data available on request from the author.


Appendix: AI Usage Statement

Claude (Anthropic) was used as a coding assistant during the preparation of this submission. Specifically, Claude assisted with writing Python and R code chunks for data cleaning, visualisation layout, and diagnostic plot generation, and with structuring the Quarto document template. All analytical decisions — the choice of Case Study 1 and its five techniques, the two hypothesis tests formulated, the regression model specification, the interpretation of all outputs, and the business recommendations — were made independently by the author based on understanding of the course material and direct knowledge of the business context. The AI did not have access to the data during this conversation; all outputs presented in this document were generated by running the code in RStudio/Positron against the author’s own dataset.