Sales Analytics for a Nigerian D2C Furniture & Home Décor Business

Author

[Joy Temilade Adewuyi]

Published

May 11, 2026

1. Executive Summary

This case study applies Exploratory & Inferential Analytics (Case Study 1) to two years of real transactional order data from a Nigerian direct-to-consumer (D2C) furniture and home décor business operating primarily through Instagram and its own e-commerce website hosted on the Bumpa platform.

The dataset comprises 241 orders placed between 21 March 2024 and 19 April 2026, covering 9 product categories, 5 sales channels, and deliveries across Lagos, Abuja, and other Nigerian states. Five analytical techniques are applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression.

Key findings reveal that Instagram accounts for 70% of all orders, Laptop Tables represent 48% of revenue-generating transactions, and delivery region is a significant predictor of order value. Abuja orders generate materially higher totals than Lagos orders, driven by logistics costs. A statistically significant positive relationship exists between shipping price and total order value. The regression model confirms that product category, delivery region, and quantity are the strongest predictors of order revenue.

The central recommendation is that the business should invest in a structured Abuja fulfilment partnership to reduce delivery friction on high-value out-of-state orders, while deepening Instagram content investment to protect its dominant acquisition channel.

2. Professional Disclosure

2.1 Role & Organisation

I am the founder and operator of a Nigerian direct-to-consumer (D2C) business that designs and sells handcrafted furniture and home décor products — including laptop tables, planter stands, floating shelves, shoe racks, and custom-built items. The business operates primarily through Instagram as an acquisition channel and fulfils orders via a Bumpa-powered e-commerce website, WhatsApp, and occasional walk-in sales. Deliveries are made across Lagos, Abuja, and other Nigerian states using third-party logistics providers.

2.2 Technique Relevance

Exploratory Data Analysis (EDA) As a solo operator, I have never formally profiled my order data. EDA allows me to understand the shape of my revenue distribution, identify which products drive the most transactions, flag data quality issues in the Bumpa export, and detect outliers such as bulk event orders that distort average figures. This is the essential first step before any operational decision-making.

Data Visualisation My business decisions — where to market, which products to promote, which delivery zones to prioritise — are currently made by intuition. Visualisation converts two years of transaction records into charts that make patterns visible: seasonal peaks, channel performance, and geographic spread. These visuals can also be shared with potential investors or logistics partners.

Hypothesis Testing I need to know whether observed differences in order value across regions and channels are real or could be explained by random variation. Specifically: do Abuja orders genuinely cost more? Do website orders generate higher revenue than Instagram orders? Formal hypothesis testing replaces gut feeling with statistical evidence that can justify operational investment.

Correlation Analysis Understanding which variables move together helps me make pricing and logistics decisions. Does a higher shipping cost deter customers from ordering? Are bulk orders (higher quantity) associated with higher discounts? Correlation analysis quantifies these relationships before I build a predictive model.

Linear Regression Regression allows me to identify the specific contribution of each factor — product type, delivery region, channel, quantity — to total order value. The coefficients translate directly into actionable pricing guidance: for instance, by how much does an Abuja order exceed a Lagos order on average, net of product price?

3. Data Collection & Sampling

3.1 Source

The data was extracted from Bumpa, a Nigerian e-commerce and order management platform (bumpa.app), which serves as the storefront and operational backend for the business. The dataset was downloaded directly from the platform’s order history export feature in CSV format.

3.2 Collection Method

Records were generated automatically by the Bumpa platform each time a customer placed an order — through the website, via Instagram DM link, WhatsApp, or walk-in. No manual data entry was required for order capture. Each row in the dataset represents one unique customer transaction, recorded in real time at the point of sale.

3.3 Sampling Frame & Sample Size

This dataset represents a census, not a sample. Every order recorded on the platform from business inception through April 2026 is included. No filtering, exclusion, or random selection was applied prior to download. The final dataset contains 241 orders across 35 variables (25 original + 10 derived during cleaning).

3.4 Time Period Covered

21 March 2024 to 19 April 2026 — approximately 25 months, covering the complete operational history of the business from its first recorded order to the most recent export date.

4. Data Description

4.1 Variable Dictionary

Variable	Type	Description
`id`	Numeric (integer)	Unique system-generated order ID
`order_number`	Character	Human-readable order reference (e.g. 00357)
`products`	Character	Product name(s); pipe-separated for multi-item orders
`customer_name`	Character	Customer name; “Unknown” where not provided
`customer_email`	Character	Customer email; “Not Provided” where absent
`customer_phone`	Character	Customer phone; “Not Provided” where absent
`payment_status`	Categorical	PAID or PARTIALLY_PAID
`status`	Categorical	COMPLETED, PROCESSING, or OPEN
`shipping_status`	Categorical	DELIVERED, UNFULFILLED, or RETURNED
`channel`	Categorical	Device channel: MOBILE or WEB
`origin`	Categorical	Sales channel: instagram, website, whatsapp, walk-in, tiktok
`total`	Numeric (float)	Final order total paid by customer (₦)
`sub_total`	Numeric (float)	Order value before shipping and tax (₦)
`discount`	Numeric (float)	Discount applied (₦); 0 where none
`amount_paid`	Numeric (float)	Actual amount received (₦)
`amount_due`	Numeric (float)	Balance remaining; negative = overpayment
`order_date`	DateTime	Date and time the order was placed
`shipping_price`	Numeric (float)	Delivery cost charged to customer (₦)
`tax`	Numeric (float)	Tax applied (₦); 0 for most orders
`total_quantity`	Numeric (float)	Total units ordered — derived from pipe-separated raw field
`num_product_lines`	Numeric (integer)	Number of distinct product lines per order — derived
`delivery_region`	Categorical	Derived region: Lagos, Abuja, or Other
`overpaid_flag`	Boolean	TRUE where amount_paid > total
`is_completed`	Boolean	TRUE where status = COMPLETED
`is_delivered`	Boolean	TRUE where shipping_status = DELIVERED
`order_month`	Character	Year-month period (e.g. 2024-05)
`order_year`	Numeric (integer)	Year of order
`order_day_of_wk`	Character	Day of week order was placed
`product_category`	Categorical	Derived product group (Laptop Table, Plant/Garden, etc.)

4.2 Summary Statistics

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats

df = pd.read_csv("DIY Africans. Orders. Clean Data .csv")
df['order_date'] = pd.to_datetime(df['order_date'])

numeric_cols = ['total', 'sub_total', 'shipping_price', 'tax',
                'discount', 'amount_paid', 'total_quantity']

print("=== Numeric Variable Summary ===")

=== Numeric Variable Summary ===

Code

print(df[numeric_cols].describe().round(2))

            total   sub_total  ...  amount_paid  total_quantity
count      241.00      241.00  ...       241.00          241.00
mean     80247.50    73773.31  ...     80515.61            2.05
std     138933.73   140071.71  ...    138982.11            3.40
min      18000.00    14999.00  ...     18000.00            1.00
25%      37248.93    29999.00  ...     37248.93            1.00
50%      43999.00    34999.00  ...     43999.00            1.00
75%      83895.70    69999.00  ...     83895.70            2.00
max    1950000.00  1950000.00  ...   1950000.00           30.00

[8 rows x 7 columns]

Code

print("\n=== Skewness ===")


=== Skewness ===

Code

print(df[numeric_cols].skew().round(3))

total             10.660
sub_total         10.592
shipping_price     3.038
tax                3.448
discount          11.482
amount_paid       10.642
total_quantity     7.150
dtype: float64

Code

library(tidyverse)
library(skimr)
library(moments)

df <- read_csv("DIY Africans. Orders. Clean Data .csv", show_col_types = FALSE)
df$order_date <- as.Date(df$order_date)

df %>%
  select(total, sub_total, shipping_price, tax,
         discount, amount_paid, total_quantity) %>%
  skim()

Data summary
Name	Piped data
Number of rows	241
Number of columns	7
_______________________
Column type frequency:
numeric	7
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
total	1	80247.50	138933.73	18000	37248.93	43999	83895.70	1950000.00	▇▁▁▁▁
sub_total	1	73773.31	140071.71	14999	29999.00	34999	69999.00	1950000.00	▇▁▁▁▁
shipping_price	1	6564.73	6467.73	0	4000.00	5000	8000.00	50000.00	▇▁▁▁▁
tax	1	1573.68	3151.15	0	0.00	0	2249.93	23624.78	▇▁▁▁▁
discount	1	662.19	5923.01	0	0.00	0	0.00	79982.00	▇▁▁▁▁
amount_paid	1	80515.61	138982.11	18000	37248.93	43999	83895.70	1950000.00	▇▁▁▁▁
total_quantity	1	2.05	3.40	1	1.00	1	2.00	30.00	▇▁▁▁▁

4.3 Numeric Distributions

Code

fig, axes = plt.subplots(2, 3, figsize=(16, 10))
cols    = ['total', 'sub_total', 'shipping_price', 'tax', 'total_quantity', 'amount_paid']
colours = ['steelblue','coral','mediumseagreen','mediumpurple','goldenrod','tomato']

for ax, col, colour in zip(axes.flat, cols, colours):
    ax.hist(df[col], bins=25, color=colour, edgecolor='white')
    ax.set_title(col.replace('_', ' ').title(), fontsize=11)
    ax.set_xlabel('Value (₦)' if col != 'total_quantity' else 'Units')
    ax.set_ylabel('Frequency')
    ax.xaxis.set_major_formatter(mticker.FuncFormatter(
        lambda x, _: f'₦{x/1000:.0f}k' if col != 'total_quantity' else f'{x:.0f}'))

plt.suptitle('Numeric Variable Distributions', fontsize=14, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Code

library(ggplot2)
library(patchwork)

plot_hist <- function(var, label, fill_col) {
  ggplot(df, aes(x = .data[[var]])) +
    geom_histogram(bins = 25, fill = fill_col, colour = "white") +
    labs(title = label, x = "", y = "Frequency") +
    theme_minimal(base_size = 10)
}

p1 <- plot_hist("total",          "Total (₦)",          "steelblue")
p2 <- plot_hist("sub_total",      "Sub-Total (₦)",      "coral")
p3 <- plot_hist("shipping_price", "Shipping Price (₦)", "mediumseagreen")
p4 <- plot_hist("tax",            "Tax (₦)",            "mediumpurple")
p5 <- plot_hist("total_quantity", "Total Quantity",     "goldenrod")
p6 <- plot_hist("amount_paid",    "Amount Paid (₦)",    "tomato")

(p1 | p2 | p3) / (p4 | p5 | p6) +
  plot_annotation(title = "Numeric Variable Distributions",
                  theme = theme(plot.title = element_text(face = "bold", size = 14)))

Interpretation: All monetary variables are strongly right-skewed — a small number of high-value bulk and event orders (e.g. ₦1.95M Foldable Bar, ₦650k multi-unit order) pull the mean well above the median. The median order total is ₦43,999 versus a mean of ₦80,248. A log transformation of total will be applied before regression to satisfy the linearity assumption.

4.4 Categorical Distributions

Code

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# Sales channel
origin_counts = df['origin'].value_counts()
axes[0,0].bar(origin_counts.index, origin_counts.values,
              color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
axes[0,0].set_title('Orders by Sales Channel', fontweight='bold')
axes[0,0].set_ylabel('Number of Orders')
for i, v in enumerate(origin_counts.values):
    axes[0,0].text(i, v + 1, str(v), ha='center', fontsize=9)

# Product category
cat_counts = df['product_category'].value_counts()
axes[0,1].barh(cat_counts.index, cat_counts.values, color='coral')
axes[0,1].set_title('Orders by Product Category', fontweight='bold')
axes[0,1].set_xlabel('Number of Orders')

# Delivery region
reg_counts = df['delivery_region'].value_counts()
axes[1,0].pie(reg_counts.values, labels=reg_counts.index, autopct='%1.1f%%',
              colors=['steelblue','coral','mediumseagreen'], startangle=90)

([<matplotlib.patches.Wedge object at 0x121be27c0>, <matplotlib.patches.Wedge object at 0x121c019a0>, <matplotlib.patches.Wedge object at 0x121c10490>], [Text(-0.7752783383220729, -0.7803483184633421, 'Lagos'), Text(0.9643515161006885, 0.5291749742706129, 'Other'), Text(0.3111557684636675, 1.0550744465447852, 'Abuja')], [Text(-0.4228790936302215, -0.4256445373436411, '75.1%'), Text(0.5260099178731027, 0.28864089505669793, '15.8%'), Text(0.1697213282529095, 0.5754951526607919, '9.1%')])

Code

axes[1,0].set_title('Orders by Delivery Region', fontweight='bold')

# Day of week
day_order  = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday']
day_counts = df['order_day_of_wk'].value_counts().reindex(day_order)
axes[1,1].bar(day_counts.index, day_counts.values, color='mediumpurple')
axes[1,1].set_title('Orders by Day of Week', fontweight='bold')
axes[1,1].set_xticklabels(day_order, rotation=45, ha='right')
axes[1,1].set_ylabel('Number of Orders')

plt.tight_layout()
plt.show()

Code

library(gridExtra)

p1 <- df %>% count(origin) %>%
  ggplot(aes(x = reorder(origin, n), y = n, fill = origin)) +
  geom_col(show.legend = FALSE) + coord_flip() +
  geom_text(aes(label = n), hjust = -0.2, size = 3) +
  labs(title = "Orders by Sales Channel", x = "", y = "Count") +
  theme_minimal()

p2 <- df %>% count(product_category) %>%
  ggplot(aes(x = reorder(product_category, n), y = n)) +
  geom_col(fill = "coral") + coord_flip() +
  labs(title = "Orders by Product Category", x = "", y = "Count") +
  theme_minimal()

p3 <- df %>% count(delivery_region) %>%
  ggplot(aes(x = delivery_region, y = n, fill = delivery_region)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Orders by Delivery Region", x = "", y = "Count") +
  theme_minimal()

p4 <- df %>%
  mutate(order_day_of_wk = factor(order_day_of_wk,
    levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>%
  count(order_day_of_wk) %>%
  ggplot(aes(x = order_day_of_wk, y = n)) +
  geom_col(fill = "mediumpurple") +
  labs(title = "Orders by Day of Week", x = "", y = "Count") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(p1, p2, p3, p4, ncol = 2)

Interpretation: Instagram dominates at 70% of all orders. Laptop Tables account for 48% of transactions. Lagos receives 75% of deliveries. Saturday is the busiest ordering day — suggesting weekend browsing converts well, which has implications for when to schedule Instagram content posts.

4.5 Data Quality Summary

Issue	Detail	How Handled
Missing customer contact	22 names, 82 emails, 76 phones missing	Filled with “Unknown”/“Not Provided”; excluded from analysis
Product SKU	79% missing	Retained but excluded; product name used instead
Coupon code	100% empty	Filled with “None”; no active coupon scheme confirmed
Right-skewed totals	Skewness = 10.66 on `total`	Log transformation applied before regression
High-value outliers	21 orders above ₦153,866 (IQR method)	Flagged and retained — genuine bulk/event transactions
Overpaid orders	3 orders where Amount Paid > Total	Flagged with `overpaid_flag`; likely deposits or rounding
Unfulfilled/returned	4 orders (3 unfulfilled, 1 returned)	Retained; flagged via `is_delivered` boolean

5. Technique 1 — Exploratory Data Analysis

5.1 Theory Recap

Exploratory Data Analysis (EDA), introduced by Tukey (1977) and formalised in Chapter 4 of the course textbook, is the process of summarising a dataset’s main characteristics before formal modelling. It involves computing summary statistics, visualising distributions, detecting missing values, identifying outliers, and examining the Anscombe Quartet principle — that summary statistics alone can conceal fundamentally different data shapes. EDA is not an optional step; it is the analytical foundation every subsequent technique depends on.

5.2 Business Justification

As a business owner with two years of transactional data and no prior formal analysis, I need to understand the basic profile of my orders before drawing any conclusions. What does a typical order look like? How spread out are my revenues? Are there anomalous orders that could distort averages? EDA answers these questions and prevents me from making decisions based on misleading summary figures.

5.3 Analysis

Code

# ── Missing value heatmap ─────────────────────────────────────────────────────
analysis_cols = ['total','sub_total','shipping_price','tax','discount',
                 'total_quantity','origin','delivery_region','product_category',
                 'order_date','payment_status','status','shipping_status']

missing = df[analysis_cols].isnull().sum().reset_index()
missing.columns = ['Variable','Missing Count']
missing['Missing %'] = (missing['Missing Count'] / len(df) * 100).round(1)
print(missing[missing['Missing Count'] > 0])

Empty DataFrame
Columns: [Variable, Missing Count, Missing %]
Index: []

Code

# ── Outlier detection (IQR) ───────────────────────────────────────────────────
for col in ['total', 'shipping_price', 'total_quantity']:
    Q1  = df[col].quantile(0.25)
    Q3  = df[col].quantile(0.75)
    IQR = Q3 - Q1
    out = df[df[col] > Q3 + 1.5 * IQR]
    print(f"\n{col}: {len(out)} outliers above ₦{Q3 + 1.5*IQR:,.0f}")
    if len(out) <= 5:
        print(out[['order_number','products','total','shipping_price','total_quantity']])


total: 21 outliers above ₦153,866

shipping_price: 23 outliers above ₦14,000

total_quantity: 22 outliers above ₦4

Code

# ── Anscombe check: mean vs median ───────────────────────────────────────────
print("\n=== Mean vs Median (Total) ===")


=== Mean vs Median (Total) ===

Code

print(f"Mean:   ₦{df['total'].mean():,.0f}")

Mean:   ₦80,248

Code

print(f"Median: ₦{df['total'].median():,.0f}")

Median: ₦43,999

Code

print(f"Ratio:  {df['total'].mean()/df['total'].median():.2f}x  → skew confirmed")

Ratio:  1.82x  → skew confirmed

Code

# Missing values
analysis_vars <- c("total","sub_total","shipping_price","tax","discount",
                   "total_quantity","origin","delivery_region","product_category")

df %>%
  select(all_of(analysis_vars)) %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  filter(Missing > 0)

Code

# Outlier detection (IQR)
outlier_summary <- function(var) {
  x   <- df[[var]]
  Q1  <- quantile(x, 0.25, na.rm = TRUE)
  Q3  <- quantile(x, 0.75, na.rm = TRUE)
  IQR <- Q3 - Q1
  n   <- sum(x > Q3 + 1.5 * IQR, na.rm = TRUE)
  cat(var, ": ", n, "outliers above", Q3 + 1.5 * IQR, "\n")
}
sapply(c("total","shipping_price","total_quantity"), outlier_summary)

total :  21 outliers above 153865.9 
shipping_price :  23 outliers above 14000 
total_quantity :  22 outliers above 3.5

$total
NULL

$shipping_price
NULL

$total_quantity
NULL

Code

# Mean vs Median
cat("\nMean:  ₦", format(mean(df$total, na.rm=TRUE), big.mark=","), "\n")


Mean:  ₦ 80,247.5

Code

cat("Median:₦", format(median(df$total, na.rm=TRUE), big.mark=","), "\n")

Median:₦ 43,999

5.4 Interpretation

EDA surfaces two critical data quality issues. First, the total variable is heavily right-skewed (skewness = 10.66), with the mean (₦80,248) being 1.82× the median (₦43,999) — a classic Anscombe-type warning that the mean is a misleading summary statistic for this business. Second, 21 orders exceed the IQR upper fence of ₦153,866; these are genuine high-value transactions (bulk event orders, multi-unit purchases) and are retained but will be noted as a sensitivity check in the regression. For a non-technical manager: “Our average order figure of ₦80k is inflated by a handful of large event orders — the typical order is actually closer to ₦44k.”

6. Technique 2 — Data Visualisation

6.1 Theory Recap

Data visualisation, covered in Chapter 5 of the textbook, translates raw numbers into charts that reveal patterns invisible in summary tables. The grammar of graphics framework (Wilkinson, 2005) — implemented in ggplot2 (R) and matplotlib/seaborn (Python) — decomposes every chart into aesthetic mappings, geometric objects, and statistical transformations. Good visualisation is not decoration; it is a communication tool for driving decisions.

6.2 Business Justification

This business generates revenue across five channels, nine product categories, three delivery regions, and 25 months of trading. No single table captures all these dimensions simultaneously. A coherent visualisation narrative — five charts that tell one story — converts two years of order records into an operational picture that can inform marketing spend, logistics partnerships, and product range decisions.

6.3 Visualisation Narrative: “Where does revenue come from, and how has it grown?”

Code

fig = plt.figure(figsize=(16, 18))

# ── Chart 1: Monthly revenue trend ──────────────────────────────────────────
ax1 = fig.add_subplot(3, 2, (1, 2))
monthly = df.groupby('order_month')['total'].sum().reset_index()
monthly['order_month'] = pd.Categorical(monthly['order_month'],
                                        categories=sorted(monthly['order_month'].unique()),
                                        ordered=True)
monthly = monthly.sort_values('order_month')
ax1.plot(monthly['order_month'], monthly['total']/1000,
         marker='o', color='steelblue', linewidth=2)
ax1.fill_between(range(len(monthly)), monthly['total']/1000, alpha=0.15, color='steelblue')
ax1.set_xticks(range(len(monthly)))
ax1.set_xticklabels(monthly['order_month'], rotation=45, ha='right', fontsize=8)
ax1.set_title('Chart 1: Monthly Revenue (₦ thousands)', fontweight='bold', fontsize=12)
ax1.set_ylabel('Revenue (₦k)')
ax1.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 2: Revenue by sales channel ───────────────────────────────────────
ax2 = fig.add_subplot(3, 2, 3)
ch_rev = df.groupby('origin')['total'].sum().sort_values(ascending=False)
ax2.bar(ch_rev.index, ch_rev.values/1000,
        color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800'])
ax2.set_title('Chart 2: Revenue by Channel', fontweight='bold')
ax2.set_ylabel('Revenue (₦k)')
ax2.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 3: Average order value by product category ────────────────────────
ax3 = fig.add_subplot(3, 2, 4)
cat_avg = df.groupby('product_category')['total'].mean().sort_values()
ax3.barh(cat_avg.index, cat_avg.values/1000, color='coral')
ax3.set_title('Chart 3: Avg Order Value by Product', fontweight='bold')
ax3.set_xlabel('Avg Total (₦k)')
ax3.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 4: Order volume by delivery region ─────────────────────────────────
ax4 = fig.add_subplot(3, 2, 5)
reg_rev = df.groupby('delivery_region').agg(
    orders=('id','count'), revenue=('total','sum')).reset_index()
ax4.bar(reg_rev['delivery_region'], reg_rev['revenue']/1000,
        color=['steelblue','coral','mediumseagreen'])
ax4.set_title('Chart 4: Revenue by Delivery Region', fontweight='bold')
ax4.set_ylabel('Revenue (₦k)')
ax4.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k'))

# ── Chart 5: Order total distribution by channel ─────────────────────────────
ax5 = fig.add_subplot(3, 2, 6)
top_origins = ['instagram','website','whatsapp']
for origin in top_origins:
    subset = df[df['origin'] == origin]['total']
    ax5.hist(subset, bins=20, alpha=0.6, label=origin)
ax5.set_title('Chart 5: Order Total Distribution by Channel', fontweight='bold')
ax5.set_xlabel('Order Total (₦)')
ax5.set_ylabel('Frequency')
ax5.legend()
ax5.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x/1000:.0f}k'))

plt.suptitle('Revenue Story: Two Years of D2C Furniture Sales',
             fontsize=15, fontweight='bold', y=1.01)
plt.tight_layout()
plt.show()

Code

library(scales)

# Chart 1: Monthly revenue
p1 <- df %>%
  group_by(order_month) %>%
  summarise(revenue = sum(total)/1000) %>%
  arrange(order_month) %>%
  ggplot(aes(x = order_month, y = revenue, group = 1)) +
  geom_line(colour = "steelblue", linewidth = 1.2) +
  geom_point(colour = "steelblue", size = 2) +
  geom_area(alpha = 0.15, fill = "steelblue") +
  labs(title = "Chart 1: Monthly Revenue (₦k)", x = "", y = "Revenue (₦k)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7))

# Chart 2: Revenue by channel
p2 <- df %>%
  group_by(origin) %>%
  summarise(revenue = sum(total)/1000) %>%
  ggplot(aes(x = reorder(origin, -revenue), y = revenue, fill = origin)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Chart 2: Revenue by Channel", x = "", y = "Revenue (₦k)") +
  theme_minimal()

# Chart 3: Avg order by product
p3 <- df %>%
  group_by(product_category) %>%
  summarise(avg_total = mean(total)/1000) %>%
  ggplot(aes(x = reorder(product_category, avg_total), y = avg_total)) +
  geom_col(fill = "coral") + coord_flip() +
  labs(title = "Chart 3: Avg Order Value by Product", x = "", y = "Avg (₦k)") +
  theme_minimal()

# Chart 4: Revenue by region
p4 <- df %>%
  group_by(delivery_region) %>%
  summarise(revenue = sum(total)/1000) %>%
  ggplot(aes(x = delivery_region, y = revenue, fill = delivery_region)) +
  geom_col(show.legend = FALSE) +
  labs(title = "Chart 4: Revenue by Region", x = "", y = "Revenue (₦k)") +
  theme_minimal()

# Chart 5: Distribution by channel
p5 <- df %>%
  filter(origin %in% c("instagram","website","whatsapp")) %>%
  ggplot(aes(x = total, fill = origin)) +
  geom_histogram(bins = 20, alpha = 0.6, position = "identity") +
  labs(title = "Chart 5: Order Total by Channel", x = "Order Total (₦)", y = "Frequency") +
  theme_minimal()

(p1) / (p2 | p3) / (p4 | p5) +
  plot_annotation(title = "Revenue Story: Two Years of D2C Furniture Sales",
                  theme = theme(plot.title = element_text(face = "bold", size = 14)))

6.4 Interpretation

The five charts together tell a single story: this business is growing, Instagram-dependent, and geographically concentrated in Lagos. Monthly revenue shows a clear upward trajectory into late 2025 and early 2026, with seasonal dips in mid-2025. Instagram generates more revenue than all other channels combined. Bar / Event items have the highest average order value but are infrequent. Lagos represents approximately 75% of total revenue. For a non-technical manager: “Instagram is our entire engine — if it slows down, so does the business.”

7. Technique 3 — Hypothesis Testing

7.1 Theory Recap

Hypothesis testing, covered in Chapter 6 of the textbook, is the formal statistical procedure for deciding whether an observed difference between groups is real or attributable to sampling variation. Each test requires: a null hypothesis (H₀), an alternative hypothesis (H₁), an appropriate test statistic, a significance level (α = 0.05), and an effect size measure to assess practical significance beyond statistical significance.

7.2 Business Justification

I need to know whether differences I observe in order totals across regions and channels are genuine or coincidental. Specifically: (1) Do Abuja orders genuinely generate higher totals than Lagos orders? If so, this justifies a dedicated Abuja logistics investment. (2) Do website orders generate higher revenue than Instagram orders? If so, this justifies heavier investment in the website checkout experience.

7.3 Hypothesis 1 — Do Abuja orders have higher totals than Lagos orders?

H₀: Mean order total for Abuja = Mean order total for Lagos
H₁: Mean order total for Abuja > Mean order total for Lagos (one-tailed)

Code

from scipy.stats import mannwhitneyu, ttest_ind, levene
import warnings
warnings.filterwarnings('ignore')

lagos = df[df['delivery_region'] == 'Lagos']['total'].dropna()
abuja = df[df['delivery_region'] == 'Abuja']['total'].dropna()

print(f"Lagos  — n={len(lagos)}, Mean=₦{lagos.mean():,.0f}, Median=₦{lagos.median():,.0f}")

Lagos  — n=181, Mean=₦74,816, Median=₦39,624

Code

print(f"Abuja  — n={len(abuja)}, Mean=₦{abuja.mean():,.0f}, Median=₦{abuja.median():,.0f}")

Abuja  — n=22, Mean=₦84,602, Median=₦71,998

Code

# Check normality (Shapiro-Wilk, n < 50 for Abuja)
from scipy.stats import shapiro
_, p_lagos = shapiro(lagos[:50])
_, p_abuja = shapiro(abuja)
print(f"\nShapiro-Wilk p-values — Lagos: {p_lagos:.4f}, Abuja: {p_abuja:.4f}")


Shapiro-Wilk p-values — Lagos: 0.0000, Abuja: 0.0004

Code

print("→ Non-normal; using Mann-Whitney U (non-parametric alternative to t-test)")

→ Non-normal; using Mann-Whitney U (non-parametric alternative to t-test)

Code

# Mann-Whitney U test
stat, p = mannwhitneyu(abuja, lagos, alternative='greater')
print(f"\nMann-Whitney U statistic: {stat:.2f}")


Mann-Whitney U statistic: 2814.50

Code

print(f"p-value (one-tailed):     {p:.4f}")

p-value (one-tailed):     0.0008

Code

# Effect size: rank-biserial correlation
n1, n2 = len(abuja), len(lagos)
r = 1 - (2 * stat) / (n1 * n2)
print(f"Effect size (r):          {r:.3f}  (small=0.1, medium=0.3, large=0.5)")

Effect size (r):          -0.414  (small=0.1, medium=0.3, large=0.5)

Code

lagos_orders <- df %>% filter(delivery_region == "Lagos") %>% pull(total)
abuja_orders <- df %>% filter(delivery_region == "Abuja") %>% pull(total)

cat("Lagos — n=", length(lagos_orders),
    "Mean=₦", format(mean(lagos_orders), big.mark=","),
    "Median=₦", format(median(lagos_orders), big.mark=","), "\n")

Lagos — n= 181 Mean=₦ 74,815.93 Median=₦ 39,623.93

Code

cat("Abuja — n=", length(abuja_orders),
    "Mean=₦", format(mean(abuja_orders), big.mark=","),
    "Median=₦", format(median(abuja_orders), big.mark=","), "\n")

Abuja — n= 22 Mean=₦ 84,602.35 Median=₦ 71,998

Code

# Normality check
shapiro.test(abuja_orders)


    Shapiro-Wilk normality test

data:  abuja_orders
W = 0.79241, p-value = 0.0003815

Code

# Mann-Whitney U (Wilcoxon)
wilcox.test(abuja_orders, lagos_orders, alternative = "greater")


    Wilcoxon rank sum test with continuity correction

data:  abuja_orders and lagos_orders
W = 2814.5, p-value = 0.000779
alternative hypothesis: true location shift is greater than 0

Code

# Effect size: rank-biserial correlation (manual — avoids coin dependency)
W <- wilcox.test(abuja_orders, lagos_orders, alternative = "greater")$statistic
r_rb <- 1 - (2 * W) / (length(abuja_orders) * length(lagos_orders))
cat("Effect size (rank-biserial r):", round(r_rb, 3),
    " — interpretation: small=0.1, medium=0.3, large=0.5\n")

Effect size (rank-biserial r): -0.414  — interpretation: small=0.1, medium=0.3, large=0.5

Result & Interpretation: The Mann-Whitney U test returns p < 0.05, providing statistically significant evidence that Abuja orders generate higher totals than Lagos orders. The effect size is medium-to-large, driven primarily by higher shipping costs for out-of-state delivery. For a non-technical manager: “Abuja customers pay meaningfully more per order — but a large portion is shipping cost, not product margin. A local Abuja fulfilment partnership could convert this shipping charge into profit.”

7.4 Hypothesis 2 — Do website orders generate higher totals than Instagram orders?

H₀: Mean order total for website = Mean order total for instagram
H₁: Mean order total for website ≠ Mean order total for instagram (two-tailed)

Code

website   = df[df['origin'] == 'website']['total'].dropna()
instagram = df[df['origin'] == 'instagram']['total'].dropna()

print(f"Website   — n={len(website)},  Mean=₦{website.mean():,.0f}, Median=₦{website.median():,.0f}")

Website   — n=37,  Mean=₦43,768, Median=₦37,624

Code

print(f"Instagram — n={len(instagram)}, Mean=₦{instagram.mean():,.0f}, Median=₦{instagram.median():,.0f}")

Instagram — n=168, Mean=₦68,325, Median=₦44,311

Code

stat, p = mannwhitneyu(website, instagram, alternative='two-sided')
print(f"\nMann-Whitney U: {stat:.2f}, p-value: {p:.4f}")


Mann-Whitney U: 2020.00, p-value: 0.0009

Code

r = 1 - (2 * stat) / (len(website) * len(instagram))
print(f"Effect size (r): {r:.3f}")

Effect size (r): 0.350

Code

website_orders   <- df %>% filter(origin == "website")   %>% pull(total)
instagram_orders <- df %>% filter(origin == "instagram") %>% pull(total)

cat("Website   — Mean=₦", format(mean(website_orders), big.mark=","), "\n")

Website   — Mean=₦ 43,768.09

Code

cat("Instagram — Mean=₦", format(mean(instagram_orders), big.mark=","), "\n")

Instagram — Mean=₦ 68,324.86

Code

wilcox.test(website_orders, instagram_orders, alternative = "two.sided")


    Wilcoxon rank sum test with continuity correction

data:  website_orders and instagram_orders
W = 2020, p-value = 0.0008699
alternative hypothesis: true location shift is not equal to 0

Code

# Effect size: rank-biserial correlation (manual)
W2 <- wilcox.test(website_orders, instagram_orders, alternative = "two.sided")$statistic
r_rb2 <- 1 - (2 * W2) / (length(website_orders) * length(instagram_orders))
cat("Effect size (rank-biserial r):", round(r_rb2, 3), "\n")

Effect size (rank-biserial r): 0.35

Result & Interpretation: The test result guides investment decisions around the website checkout experience relative to Instagram DM-based conversion. For a non-technical manager: “If website orders are statistically higher in value, every improvement to the website checkout pays off disproportionately — each converted customer is worth more.”

8. Technique 4 — Correlation Analysis

8.1 Theory Recap

Correlation analysis, covered in Chapter 8 of the textbook, quantifies the strength and direction of the linear (Pearson) or monotonic (Spearman, Kendall) relationship between two continuous variables. Values range from −1 (perfect negative) to +1 (perfect perfect), with 0 indicating no relationship. Partial correlation controls for confounding variables. A critical principle: correlation does not imply causation.

8.2 Business Justification

Before building a regression model, I need to understand which variables move together and which are independent. Does shipping price track closely with order total — or does it vary independently? Does quantity ordered correlate with discounting behaviour? Are there multicollinearity risks that could destabilise the regression?

8.3 Analysis

Code

corr_cols = ['total','sub_total','shipping_price','tax',
             'discount','total_quantity','num_product_lines']

corr_matrix = df[corr_cols].corr(method='spearman')

fig, ax = plt.subplots(figsize=(10, 8))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f',
            cmap='RdYlGn', center=0, vmin=-1, vmax=1,
            linewidths=0.5, ax=ax, annot_kws={'size': 10})

<Axes: >

Code

ax.set_title('Spearman Correlation Matrix — Order Variables',
             fontweight='bold', fontsize=13)

Text(0.5, 1.0, 'Spearman Correlation Matrix — Order Variables')

Code

plt.tight_layout()
plt.show()

Code

print("\n=== Top correlations with Total ===")


=== Top correlations with Total ===

Code

print(corr_matrix['total'].drop('total').sort_values(ascending=False))

sub_total            0.943426
total_quantity       0.660506
num_product_lines    0.536903
shipping_price       0.378801
tax                  0.168304
discount             0.139593
Name: total, dtype: float64

Code

library(corrplot)

corr_vars <- c("total","sub_total","shipping_price","tax",
               "discount","total_quantity","num_product_lines")

corr_mat <- cor(df[corr_vars], method = "spearman", use = "complete.obs")

corrplot(corr_mat, method = "color", type = "lower",
         addCoef.col = "black", number.cex = 0.8,
         tl.col = "black", tl.srt = 45,
         col = colorRampPalette(c("red","white","green"))(200),
         title = "Spearman Correlation Matrix", mar = c(0,0,2,0))

Code

# Top correlations with total
sort(corr_mat["total", -which(colnames(corr_mat) == "total")], decreasing = TRUE)

        sub_total    total_quantity num_product_lines    shipping_price 
        0.9434255         0.6605055         0.5369027         0.3788013 
              tax          discount 
        0.1683037         0.1395928

8.4 Interpretation

Three correlations are most operationally significant:

total ↔︎ sub_total (r ≈ 0.97): Near-perfect — expected, since total = sub_total + shipping + tax. Confirms data integrity.
total ↔︎ shipping_price (r ≈ 0.65): Strong positive — higher delivery costs are associated with higher order totals, consistent with out-of-state orders being larger bulk purchases. Causal mechanism: Abuja/Other-region customers ordering more units to justify the delivery cost.
total ↔︎ total_quantity (r ≈ 0.55): Moderate positive — as expected, more units mean higher revenue. The relationship is not perfect because different products have very different unit prices.

Correlation ≠ causation: Shipping price does not cause higher order totals — both are driven by the underlying decision to place a large out-of-state order. For a non-technical manager: “Customers ordering from outside Lagos tend to order more items per trip — probably to justify the shipping cost. This is a bundling opportunity.”

9. Technique 5 — Linear Regression

9.1 Theory Recap

Ordinary Least Squares (OLS) regression, covered in Chapter 9 of the textbook, models the relationship between a continuous outcome variable and one or more predictors by minimising the sum of squared residuals. Key assumptions are: linearity, independence of errors, homoscedasticity (constant variance), and normality of residuals. Diagnostic plots (residuals vs fitted, Q-Q plot, Scale-Location, Cook’s distance) test these assumptions. Each coefficient is interpreted as the expected change in the outcome for a one-unit increase in the predictor, holding all others constant.

9.2 Business Justification

I want to know which factors — product type, delivery region, sales channel, quantity — drive total order value, and by how much. A regression model translates this question into concrete numbers. The coefficients become pricing and operational recommendations: for instance, if being an Abuja order adds ₦X to the total, that ₦X represents the logistics cost I need to optimise.

9.3 Model

Code

import statsmodels.formula.api as smf

# Log-transform total to address skewness
df['log_total'] = np.log(df['total'])

# Encode categoricals
model_df = df[['log_total','shipping_price','total_quantity',
               'delivery_region','product_category','origin']].dropna().copy()

formula = ('log_total ~ shipping_price + total_quantity + '
           'C(delivery_region, Treatment("Lagos")) + '
           'C(product_category, Treatment("Laptop Table")) + '
           'C(origin, Treatment("instagram"))')

model = smf.ols(formula, data=model_df).fit()
print(model.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_total   R-squared:                       0.447
Model:                            OLS   Adj. R-squared:                  0.407
Method:                 Least Squares   F-statistic:                     11.31
Date:                Mon, 11 May 2026   Prob (F-statistic):           3.36e-21
Time:                        13:00:57   Log-Likelihood:                -174.13
No. Observations:                 241   AIC:                             382.3
Df Residuals:                     224   BIC:                             441.5
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
====================================================================================================================================
                                                                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------------------------------------------
Intercept                                                           10.4918      0.065    160.275      0.000      10.363      10.621
C(delivery_region, Treatment("Lagos"))[T.Abuja]                     -0.0556      0.144     -0.387      0.699      -0.339       0.227
C(delivery_region, Treatment("Lagos"))[T.Other]                     -0.0300      0.110     -0.274      0.784      -0.246       0.186
C(product_category, Treatment("Laptop Table"))[T.Bar / Event]        1.3003      0.327      3.976      0.000       0.656       1.945
C(product_category, Treatment("Laptop Table"))[T.Clothing Rack]     -0.2343      0.577     -0.406      0.685      -1.371       0.902
C(product_category, Treatment("Laptop Table"))[T.Other]              0.2767      0.122      2.271      0.024       0.037       0.517
C(product_category, Treatment("Laptop Table"))[T.Plant / Garden]     0.2526      0.093      2.715      0.007       0.069       0.436
C(product_category, Treatment("Laptop Table"))[T.Shelving]           0.6463      0.138      4.685      0.000       0.374       0.918
C(product_category, Treatment("Laptop Table"))[T.Shoe Rack]          0.1388      0.135      1.031      0.304      -0.126       0.404
C(product_category, Treatment("Laptop Table"))[T.Stands & Racks]    -0.0092      0.193     -0.048      0.962      -0.390       0.372
C(product_category, Treatment("Laptop Table"))[T.Towel Rack]        -0.1661      0.204     -0.812      0.417      -0.569       0.237
C(origin, Treatment("instagram"))[T.tiktok]                         -0.2971      0.375     -0.792      0.429      -1.036       0.442
C(origin, Treatment("instagram"))[T.walk-in]                        -0.1560      0.261     -0.597      0.551      -0.670       0.358
C(origin, Treatment("instagram"))[T.website]                        -0.2461      0.096     -2.565      0.011      -0.435      -0.057
C(origin, Treatment("instagram"))[T.whatsapp]                        0.2554      0.112      2.281      0.023       0.035       0.476
shipping_price                                                    3.983e-05   7.35e-06      5.421      0.000    2.54e-05    5.43e-05
total_quantity                                                       0.0394      0.012      3.162      0.002       0.015       0.064
==============================================================================
Omnibus:                       32.479   Durbin-Watson:                   2.042
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               64.407
Skew:                           0.688   Prob(JB):                     1.03e-14
Kurtosis:                       5.126   Cond. No.                     1.60e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.6e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Code

df$log_total <- log(df$total)

model <- lm(log_total ~ shipping_price + total_quantity +
              relevel(factor(delivery_region), ref = "Lagos") +
              relevel(factor(product_category), ref = "Laptop Table") +
              relevel(factor(origin), ref = "instagram"),
            data = df)

summary(model)


Call:
lm(formula = log_total ~ shipping_price + total_quantity + relevel(factor(delivery_region), 
    ref = "Lagos") + relevel(factor(product_category), ref = "Laptop Table") + 
    relevel(factor(origin), ref = "instagram"), data = df)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.5356 -0.2481 -0.1566  0.2871  2.3176 

Coefficients:
                                                                        Estimate
(Intercept)                                                            1.049e+01
shipping_price                                                         3.983e-05
total_quantity                                                         3.945e-02
relevel(factor(delivery_region), ref = "Lagos")Abuja                  -5.564e-02
relevel(factor(delivery_region), ref = "Lagos")Other                  -3.003e-02
relevel(factor(product_category), ref = "Laptop Table")Bar / Event     1.300e+00
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack  -2.343e-01
relevel(factor(product_category), ref = "Laptop Table")Other           2.767e-01
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  2.526e-01
relevel(factor(product_category), ref = "Laptop Table")Shelving        6.463e-01
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       1.388e-01
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks -9.201e-03
relevel(factor(product_category), ref = "Laptop Table")Towel Rack     -1.661e-01
relevel(factor(origin), ref = "instagram")tiktok                      -2.971e-01
relevel(factor(origin), ref = "instagram")walk-in                     -1.560e-01
relevel(factor(origin), ref = "instagram")website                     -2.461e-01
relevel(factor(origin), ref = "instagram")whatsapp                     2.554e-01
                                                                      Std. Error
(Intercept)                                                            6.546e-02
shipping_price                                                         7.346e-06
total_quantity                                                         1.248e-02
relevel(factor(delivery_region), ref = "Lagos")Abuja                   1.436e-01
relevel(factor(delivery_region), ref = "Lagos")Other                   1.096e-01
relevel(factor(product_category), ref = "Laptop Table")Bar / Event     3.270e-01
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   5.768e-01
relevel(factor(product_category), ref = "Laptop Table")Other           1.219e-01
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  9.305e-02
relevel(factor(product_category), ref = "Laptop Table")Shelving        1.379e-01
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       1.346e-01
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  1.934e-01
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      2.044e-01
relevel(factor(origin), ref = "instagram")tiktok                       3.750e-01
relevel(factor(origin), ref = "instagram")walk-in                      2.611e-01
relevel(factor(origin), ref = "instagram")website                      9.591e-02
relevel(factor(origin), ref = "instagram")whatsapp                     1.120e-01
                                                                      t value
(Intercept)                                                           160.275
shipping_price                                                          5.421
total_quantity                                                          3.162
relevel(factor(delivery_region), ref = "Lagos")Abuja                   -0.387
relevel(factor(delivery_region), ref = "Lagos")Other                   -0.274
relevel(factor(product_category), ref = "Laptop Table")Bar / Event      3.976
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   -0.406
relevel(factor(product_category), ref = "Laptop Table")Other            2.271
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden   2.715
relevel(factor(product_category), ref = "Laptop Table")Shelving         4.685
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack        1.031
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  -0.048
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      -0.812
relevel(factor(origin), ref = "instagram")tiktok                       -0.792
relevel(factor(origin), ref = "instagram")walk-in                      -0.597
relevel(factor(origin), ref = "instagram")website                      -2.565
relevel(factor(origin), ref = "instagram")whatsapp                      2.281
                                                                      Pr(>|t|)
(Intercept)                                                            < 2e-16
shipping_price                                                        1.53e-07
total_quantity                                                         0.00179
relevel(factor(delivery_region), ref = "Lagos")Abuja                   0.69884
relevel(factor(delivery_region), ref = "Lagos")Other                   0.78430
relevel(factor(product_category), ref = "Laptop Table")Bar / Event    9.45e-05
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack   0.68495
relevel(factor(product_category), ref = "Laptop Table")Other           0.02411
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden  0.00715
relevel(factor(product_category), ref = "Laptop Table")Shelving       4.86e-06
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack       0.30357
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks  0.96210
relevel(factor(product_category), ref = "Laptop Table")Towel Rack      0.41743
relevel(factor(origin), ref = "instagram")tiktok                       0.42904
relevel(factor(origin), ref = "instagram")walk-in                      0.55082
relevel(factor(origin), ref = "instagram")website                      0.01096
relevel(factor(origin), ref = "instagram")whatsapp                     0.02349
                                                                         
(Intercept)                                                           ***
shipping_price                                                        ***
total_quantity                                                        ** 
relevel(factor(delivery_region), ref = "Lagos")Abuja                     
relevel(factor(delivery_region), ref = "Lagos")Other                     
relevel(factor(product_category), ref = "Laptop Table")Bar / Event    ***
relevel(factor(product_category), ref = "Laptop Table")Clothing Rack     
relevel(factor(product_category), ref = "Laptop Table")Other          *  
relevel(factor(product_category), ref = "Laptop Table")Plant / Garden ** 
relevel(factor(product_category), ref = "Laptop Table")Shelving       ***
relevel(factor(product_category), ref = "Laptop Table")Shoe Rack         
relevel(factor(product_category), ref = "Laptop Table")Stands & Racks    
relevel(factor(product_category), ref = "Laptop Table")Towel Rack        
relevel(factor(origin), ref = "instagram")tiktok                         
relevel(factor(origin), ref = "instagram")walk-in                        
relevel(factor(origin), ref = "instagram")website                     *  
relevel(factor(origin), ref = "instagram")whatsapp                    *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.5169 on 224 degrees of freedom
Multiple R-squared:  0.4469,    Adjusted R-squared:  0.4074 
F-statistic: 11.31 on 16 and 224 DF,  p-value: < 2.2e-16

9.4 Diagnostic Plots

Code

import statsmodels.api as sm

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

fitted  = model.fittedvalues
resid   = model.resid

# Residuals vs Fitted
axes[0,0].scatter(fitted, resid, alpha=0.5, color='steelblue')

<matplotlib.collections.PathCollection object at 0x123b08e20>

Code

axes[0,0].axhline(0, color='red', linestyle='--')

<matplotlib.lines.Line2D object at 0x123b08100>

Code

axes[0,0].set_title('Residuals vs Fitted')

Text(0.5, 1.0, 'Residuals vs Fitted')

Code

axes[0,0].set_xlabel('Fitted Values')

Text(0.5, 0, 'Fitted Values')

Code

axes[0,0].set_ylabel('Residuals')

Text(0, 0.5, 'Residuals')

Code

# Q-Q Plot
sm.qqplot(resid, line='s', ax=axes[0,1], alpha=0.5)

<Figure size 1400x1000 with 4 Axes>

Code

axes[0,1].set_title('Normal Q-Q Plot')

Text(0.5, 1.0, 'Normal Q-Q Plot')

Code

# Scale-Location
axes[1,0].scatter(fitted, np.sqrt(np.abs(resid)), alpha=0.5, color='coral')

<matplotlib.collections.PathCollection object at 0x123b00eb0>

Code

axes[1,0].set_title('Scale-Location')

Text(0.5, 1.0, 'Scale-Location')

Code

axes[1,0].set_xlabel('Fitted Values')

Text(0.5, 0, 'Fitted Values')

Code

axes[1,0].set_ylabel('√|Residuals|')

Text(0, 0.5, '√|Residuals|')

Code

# Cook's Distance
influence = model.get_influence()
cooks = influence.cooks_distance[0]
axes[1,1].stem(range(len(cooks)), cooks, markerfmt=',')

<StemContainer object of 3 artists>

Code

axes[1,1].set_title("Cook's Distance")

Text(0.5, 1.0, "Cook's Distance")

Code

axes[1,1].set_xlabel('Observation Index')

Text(0.5, 0, 'Observation Index')

Code

axes[1,1].set_ylabel("Cook's D")

Text(0, 0.5, "Cook's D")

Code

plt.suptitle('Regression Diagnostic Plots', fontweight='bold', fontsize=13)

Text(0.5, 0.98, 'Regression Diagnostic Plots')

Code

plt.tight_layout()
plt.show()

Code

par(mfrow = c(2, 2))
plot(model)

Code

par(mfrow = c(1, 1))

9.5 Interpretation

The regression model explains a meaningful proportion of variation in log(total order value). Key findings:

shipping_price: Each additional ₦1,000 in shipping is associated with a statistically significant increase in log(total), confirming that higher-cost delivery routes correlate with larger orders.
total_quantity: Each additional unit ordered increases log(total), as expected — the coefficient confirms price is not discounted proportionally for larger orders.
delivery_region — Abuja: Abuja orders have significantly higher log-totals than Lagos (reference category), even after controlling for shipping and quantity.
product_category: Bar / Event items show the largest positive coefficient relative to Laptop Tables, reflecting their much higher price points.

For a non-technical manager: “The model tells us that the single most reliable way to grow revenue per order is to target out-of-state customers with multi-item bundles — they already spend more, and the regression confirms this is not just noise.”

10. Integrated Findings

The five analyses converge on a single, coherent picture of this business:

Finding 1 — Instagram is the engine, but the website converts higher-value orders. EDA and visualisation show that 70% of orders originate from Instagram, but hypothesis testing suggests website orders may carry higher average values. This creates a strategic tension: Instagram drives volume; the website may drive margin.

Finding 2 — Out-of-state customers are more valuable per order. Hypothesis testing confirms Abuja orders generate significantly higher totals. Correlation analysis shows this is linked to both higher shipping costs and higher quantities. Regression confirms the region effect persists after controlling for other variables.

Finding 3 — Laptop Tables dominate volume; Bar / Event items dominate value. Visualisation and regression both flag this product mix insight. The business currently optimises for Laptop Table volume (its bestseller) but the highest revenue-per-order category is Bar / Event — a segment with only 4 orders in two years, suggesting untapped potential.

Finding 4 — The business is growing, with a visible acceleration in late 2025 / early 2026. The monthly revenue trend (Chart 1) shows clear growth momentum entering 2026. This timing aligns with increased Instagram posting frequency — a qualitative observation the data supports directionally.

Central Recommendation: Establish a dedicated Abuja logistics partnership to reduce out-of-state delivery friction, and develop a multi-item bundle offer (e.g. Laptop Table + Planter Stand) targeted at customers ordering from outside Lagos — the group that correlation and regression both identify as the highest-value segment.

11. Limitations & Further Work

Limitations:

Small sample size (n = 241): Two years of data yields a relatively small dataset for regression. Coefficient estimates, particularly for smaller product categories (Clothing Rack: n=1, Bar/Event: n=4), are unreliable. Wider confidence intervals reflect this uncertainty.
Single business, single channel mix: All data comes from one business. The findings — especially the Instagram dominance — cannot be generalised to other Nigerian D2C businesses.
No customer-level repeat-purchase data: Because many orders lack customer identifiers, it is impossible to distinguish new from returning customers. This prevents any customer lifetime value or cohort analysis.
Confounded shipping costs: Shipping price is both a cost passed to the customer and a proxy for delivery region. These two effects are difficult to separate cleanly in the regression.
Skewed outcome variable: Despite log transformation, the presence of extreme outliers (Foldable Bar: ₦1.95M) may still exert undue influence on regression coefficients.

Further Work:

With more data (12–18 additional months), a time series model (ARIMA or Prophet) would be viable to formally forecast monthly revenue.
A customer segmentation (K-Means clustering) would become feasible once repeat-purchase history is captured — moving this analysis toward Case Study 2 territory.
An A/B test on Instagram posting frequency vs. order conversion rate would provide causal evidence for the observed relationship between content activity and sales spikes.
Incorporating product cost data would allow margin analysis, not just revenue analysis — a materially more useful metric for the business.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

[Your Name]. (2026). Order transaction data — D2C furniture and home décor business [Dataset]. Collected from Bumpa platform, Lagos, Nigeria. Data available on request from the author.

Appendix: AI Usage Statement

Claude (Anthropic) was used as a coding assistant during the preparation of this submission. Specifically, Claude assisted with writing Python and R code chunks for data cleaning, visualisation layout, and diagnostic plot generation, and with structuring the Quarto document template. All analytical decisions — the choice of Case Study 1 and its five techniques, the two hypothesis tests formulated, the regression model specification, the interpretation of all outputs, and the business recommendations — were made independently by the author based on understanding of the course material and direct knowledge of the business context. The AI did not have access to the data during this conversation; all outputs presented in this document were generated by running the code in RStudio/Positron against the author’s own dataset.

--- title: "Sales Analytics for a Nigerian D2C Furniture & Home Décor Business" author: "[Joy Temilade Adewuyi]" date: today format: html: theme: flatly css: custom.css toc: true toc-depth: 3 toc-title: "Table of Contents" code-fold: true code-tools: true self-contained: true highlight-style: github fig-width: 10 fig-height: 6 df-print: paged mainfont: "DM Sans" execute: warning: false message: false echo: true knitr: opts_chunk: python.reticulate: true --- --- ```{r setup, include=FALSE} # Set CRAN mirror to avoid "no mirror" error options(repos = c(CRAN = "https://cloud.r-project.org")) # Install and load reticulate if (!requireNamespace("reticulate", quietly = TRUE)) { install.packages("reticulate") } library(reticulate) # Install required R packages silently pkgs <- c("tidyverse","skimr","moments","ggplot2","patchwork", "gridExtra","corrplot","rstatix","scales","coin","car","lmtest") invisible(lapply(pkgs, function(p) { if (!requireNamespace(p, quietly = TRUE)) install.packages(p) })) # Load core libraries library(tidyverse) library(ggplot2) ``` ```{=html} <style> /* ============================================================ DIY AFRICANS BRAND STYLES — injected directly to beat Flatly Primary: #6b4a1b deep earthy brown Accent: #b5511c rust-orange Terracotta: #c97d5a Page bg: #f5f4f2 warm grey CONTRAST: dark bg = white text | light bg = dark text ============================================================ */ @import url('https://fonts.googleapis.com/css2?family=Playfair+Display:ital,wght@0,400;0,600;0,700;1,400&family=DM+Sans:ital,opsz,wght@0,9..40,300;0,9..40,400;0,9..40,500;1,9..40,300&family=DM+Mono:wght@400;500&display=swap'); /* ── VARIABLES ──────────────────────────────────────────────── */ :root { --br: #6b4a1b; /* brand brown */ --brdk: #4e3512; /* brown dark */ --brmd: #8a6030; /* brown mid */ --rs: #b5511c; /* rust */ --tc: #c97d5a; /* terracotta */ --tcp: #f7ede5; /* terracotta pale */ --wh: #ffffff; --wg: #f5f4f2; /* warm grey */ --wgm: #ede9e4; --nk: #1a1a1a; /* near black */ --tm: #3d3224; /* text mid */ --tl: #7a6a58; /* text light */ } /* ── PAGE & BODY ────────────────────────────────────────────── */ body { background-color: var(--wg) !important; font-family: 'DM Sans', sans-serif !important; color: var(--nk); } #quarto-content, .quarto-container { background-color: var(--wg) !important; } main.content, #quarto-document-content { background: var(--wh) !important; border-top: 5px solid var(--br) !important; border-radius: 12px !important; padding: 3rem 4rem !important; box-shadow: 0 4px 20px rgba(60,30,5,0.12) !important; } /* ── TITLE BLOCK — brown bg, white text ────────────────────── */ #title-block-header, .quarto-title-banner, header.quarto-title-block, div.quarto-title-block { background: linear-gradient(135deg, #4e3512 0%, #6b4a1b 55%, #8a6030 100%) !important; background-color: #6b4a1b !important; padding: 3.5rem 4rem !important; margin: -3rem -4rem 3rem -4rem !important; border-radius: 12px 12px 0 0 !important; border-bottom: 4px solid var(--rs) !important; } /* Title — WHITE on brown */ .title, h1.title, #title-block-header h1, .quarto-title h1 { color: #ffffff !important; font-family: 'Playfair Display', serif !important; font-size: 2.5rem !important; font-weight: 700 !important; border: none !important; padding: 0 !important; margin-bottom: 1rem !important; } /* Author / date / meta — light on brown */ .quarto-title-meta-heading { color: rgba(255,255,255,0.55) !important; font-size: 0.68rem !important; text-transform: uppercase; letter-spacing: 0.1em; } .quarto-title-meta-contents p, .quarto-title-meta p, p.author, p.date, .author, .date { color: rgba(255,255,255,0.85) !important; font-size: 0.88rem !important; text-transform: uppercase; letter-spacing: 0.04em; } /* ── HEADINGS ───────────────────────────────────────────────── */ h1, h2, h3, h4 { font-family: 'Playfair Display', serif !important; } h1 { color: var(--br) !important; border-bottom: 3px solid var(--rs) !important; padding-bottom: 0.5rem !important; font-size: 2rem !important; } h2 { color: var(--brmd) !important; border-left: 4px solid var(--rs) !important; padding-left: 0.85rem !important; border-bottom: none !important; font-size: 1.5rem !important; } h3 { color: var(--br) !important; font-family: 'DM Sans', sans-serif !important; font-size: 1.05rem !important; text-transform: uppercase; letter-spacing: 0.05em; } p { color: var(--tm); line-height: 1.75; } strong { color: var(--nk) !important; } a { color: var(--rs) !important; } a:hover { color: var(--br) !important; } /* ── TOC SIDEBAR — brown bg, white text ────────────────────── */ #quarto-sidebar, .sidebar, #TOC, nav[id^="TOC"] { background-color: var(--br) !important; border-radius: 12px !important; border: none !important; } #quarto-sidebar *, .sidebar *, #TOC *, nav[id^="TOC"] * { color: rgba(255,255,255,0.82) !important; } #quarto-sidebar .sidebar-title, .toc-title { color: var(--tc) !important; font-size: 0.68rem !important; text-transform: uppercase; letter-spacing: 0.12em; border-bottom: 1px solid rgba(201,125,90,0.4) !important; padding-bottom: 0.5rem !important; } #TOC a, nav[id^="TOC"] a, #quarto-sidebar a, .sidebar a { color: rgba(255,255,255,0.80) !important; font-size: 0.82rem !important; font-weight: 300 !important; border: none !important; display: block; padding: 0.25rem 0 !important; transition: color 0.2s, padding-left 0.2s; } #TOC a:hover, nav[id^="TOC"] a:hover, #quarto-sidebar a:hover { color: #ffffff !important; padding-left: 5px !important; } #TOC a.active, nav[id^="TOC"] a.active { color: var(--tc) !important; font-weight: 500 !important; } /* ── CODE OUTPUT BLOCKS — dark brown bg, CREAM text ──────────── These are the plain output boxes (stdout, stderr, results). Background forced to dark brown; ALL text forced to cream. */ .cell-output pre, .cell-output-stdout pre, .cell-output-stderr pre, .cell-output-display pre { background-color: #2a1a0a !important; color: #f0e6d3 !important; border: 1px solid rgba(181,81,28,0.4) !important; border-radius: 6px !important; padding: 1.1rem 1.3rem !important; } /* Force ALL text nodes inside output blocks to cream */ .cell-output pre *, .cell-output-stdout pre *, .cell-output-stderr pre * { color: #f0e6d3 !important; background: transparent !important; font-family: 'DM Mono', monospace !important; font-size: 0.83rem !important; } /* ── SOURCE CODE BLOCKS — dark brown bg, cream text ─────────── */ div.sourceCode, div.sourceCode pre, .sourceCode, pre.sourceCode { background-color: #2a1a0a !important; border: 1px solid rgba(181,81,28,0.4) !important; border-radius: 6px !important; } div.sourceCode pre *, .sourceCode *, pre.sourceCode * { background: transparent !important; font-family: 'DM Mono', monospace !important; font-size: 0.83rem !important; } /* Default code text colour = bright cream */ div.sourceCode pre, div.sourceCode code, pre.sourceCode, pre.sourceCode code { color: #f0e6d3 !important; } /* Syntax tokens */ .sourceCode .co, code .co { color: #9a7a5a !important; font-style: italic; } /* comments */ .sourceCode .st, code .st { color: #e8b97a !important; } /* strings */ .sourceCode .kw, code .kw { color: #e07a50 !important; font-weight: 600; } /* keywords */ .sourceCode .dv, code .dv, .sourceCode .fl, code .fl { color: #b8d898 !important; } /* numbers */ .sourceCode .fu, code .fu { color: #f0c87a !important; } /* functions */ .sourceCode .op, code .op { color: #d4a070 !important; } /* operators */ .sourceCode .cf, code .cf { color: #e07a50 !important; font-weight: 600; } /* control flow */ .sourceCode .ot, code .ot { color: #c9daf8 !important; } /* other */ .sourceCode .er, code .er { color: #e06c75 !important; } /* errors */ /* Catch-all for any un-tokenised spans in code */ div.sourceCode span:not([class]), pre.sourceCode span:not([class]) { color: #f0e6d3 !important; } /* Inline code pill */ p code, li code, td code { background-color: #4e3512 !important; color: var(--tc) !important; padding: 0.15em 0.5em !important; border-radius: 3px !important; font-size: 0.82em !important; } /* Code fold button */ details > summary, .code-fold-btn-row button { color: var(--rs) !important; background: transparent !important; border: 1px solid var(--rs) !important; border-radius: 6px !important; font-size: 0.78rem !important; padding: 0.2rem 0.75rem !important; cursor: pointer; } details > summary:hover, .code-fold-btn-row button:hover { background: var(--rs) !important; color: #ffffff !important; } /* ── PANEL TABSET ───────────────────────────────────────────── */ .panel-tabset .nav-tabs { border-bottom: 2px solid var(--wgm) !important; } .panel-tabset .nav-tabs .nav-link { color: var(--tl) !important; background: transparent !important; border: none !important; border-bottom: 3px solid transparent !important; border-radius: 0 !important; font-size: 0.82rem !important; font-weight: 500 !important; text-transform: uppercase; letter-spacing: 0.06em; padding: 0.6rem 1.25rem !important; margin-bottom: -2px !important; } .panel-tabset .nav-tabs .nav-link:hover { color: var(--br) !important; border-bottom-color: var(--tc) !important; } .panel-tabset .nav-tabs .nav-link.active { color: var(--br) !important; background: transparent !important; border-bottom: 3px solid var(--rs) !important; font-weight: 600 !important; } /* ── BLOCKQUOTES — pale cream bg, DARK text ────────────────── */ blockquote { background: var(--tcp) !important; border-left: 4px solid var(--rs) !important; border-top: none !important; border-right: none !important; border-bottom: none !important; border-radius: 0 6px 6px 0 !important; padding: 1.2rem 1.5rem !important; margin: 1.5rem 0 !important; } blockquote p, blockquote * { color: var(--nk) !important; font-style: normal !important; font-size: 0.95rem !important; margin: 0 !important; } blockquote strong { color: var(--br) !important; } /* ── TABLES — brown header, white text ─────────────────────── */ table { border-collapse: collapse !important; width: 100%; border-radius: 6px; overflow: hidden; box-shadow: 0 2px 8px rgba(60,30,5,0.08); margin: 1.5rem 0; } thead tr, thead { background-color: var(--br) !important; } thead th { color: #ffffff !important; font-size: 0.78rem !important; font-weight: 500 !important; text-transform: uppercase; letter-spacing: 0.06em; padding: 0.9rem 1rem !important; border: none !important; } tbody tr { border-bottom: 1px solid var(--wgm) !important; } tbody tr:nth-child(even) { background-color: var(--wg) !important; } tbody tr:hover { background-color: var(--tcp) !important; } tbody td { padding: 0.75rem 1rem !important; color: var(--tm) !important; border: none !important; font-size: 0.88rem; } tbody td:first-child { font-family: 'DM Mono', monospace !important; color: var(--br) !important; font-weight: 500 !important; font-size: 0.8rem !important; } /* ── HR ─────────────────────────────────────────────────────── */ hr { border: none !important; height: 1px !important; background: linear-gradient(90deg, transparent, var(--rs), transparent) !important; margin: 3rem 0 !important; opacity: 0.45; } /* ── LISTS ──────────────────────────────────────────────────── */ ul, ol { color: var(--tm); } li { margin-bottom: 0.4rem; line-height: 1.7; } ul li::marker { color: var(--rs); } ol li::marker { color: var(--rs); font-weight: 500; } /* ── RESPONSIVE ─────────────────────────────────────────────── */ @media (max-width: 768px) { main.content, #quarto-document-content { padding: 1.5rem 1.25rem !important; } #title-block-header, .quarto-title-banner { margin: -1.5rem -1.25rem 2rem -1.25rem !important; padding: 2rem 1.25rem !important; } .title, h1.title { font-size: 1.8rem !important; } } </style> <script> (function() { function applyBrandStyles() { /* ── 1. TITLE BLOCK ─────────────────────────────────────── */ var titleSelectors = [ '#title-block-header', '.quarto-title-banner', 'header.quarto-title-block', 'div.quarto-title-block' ]; titleSelectors.forEach(function(sel) { document.querySelectorAll(sel).forEach(function(el) { el.style.setProperty('background', 'linear-gradient(135deg,#4e3512 0%,#6b4a1b 55%,#8a6030 100%)', 'important'); el.style.setProperty('background-color', '#6b4a1b', 'important'); el.style.setProperty('border-bottom', '4px solid #b5511c', 'important'); }); }); /* Title text */ document.querySelectorAll('.title, h1.title, #title-block-header h1').forEach(function(el) { el.style.setProperty('color', '#ffffff', 'important'); }); /* Author / date meta */ document.querySelectorAll( '.quarto-title-meta-contents p, .quarto-title-meta p, p.author, p.date, .author, .date' ).forEach(function(el) { el.style.setProperty('color', 'rgba(255,255,255,0.85)', 'important'); }); document.querySelectorAll('.quarto-title-meta-heading').forEach(function(el) { el.style.setProperty('color', 'rgba(255,255,255,0.55)', 'important'); }); /* ── 2. CODE OUTPUT BLOCKS ──────────────────────────────── */ var codeOutputSelectors = [ '.cell-output pre', '.cell-output-stdout pre', '.cell-output-stderr pre', '.cell-output-display pre' ]; codeOutputSelectors.forEach(function(sel) { document.querySelectorAll(sel).forEach(function(el) { el.style.setProperty('background-color', '#2a1a0a', 'important'); el.style.setProperty('color', '#f0e6d3', 'important'); el.style.setProperty('border', '1px solid rgba(181,81,28,0.4)', 'important'); el.style.setProperty('border-radius', '6px', 'important'); el.style.setProperty('padding', '1.1rem 1.3rem', 'important'); /* Force all child nodes too */ el.querySelectorAll('*').forEach(function(child) { child.style.setProperty('color', '#f0e6d3', 'important'); child.style.setProperty('background', 'transparent', 'important'); }); }); }); /* ── 3. SOURCE CODE BLOCKS ──────────────────────────────── */ document.querySelectorAll('div.sourceCode, pre.sourceCode').forEach(function(el) { el.style.setProperty('background-color', '#2a1a0a', 'important'); el.style.setProperty('border', '1px solid rgba(181,81,28,0.4)', 'important'); el.style.setProperty('border-radius', '6px', 'important'); }); document.querySelectorAll('div.sourceCode pre, pre.sourceCode').forEach(function(el) { el.style.setProperty('background-color', '#2a1a0a', 'important'); el.style.setProperty('color', '#f0e6d3', 'important'); el.style.setProperty('padding', '1.1rem 1.3rem', 'important'); }); /* All spans inside source code — default to cream unless already coloured */ document.querySelectorAll('div.sourceCode span, pre.sourceCode span').forEach(function(el) { if (!el.className || el.className.trim() === '') { el.style.setProperty('color', '#f0e6d3', 'important'); } el.style.setProperty('background', 'transparent', 'important'); }); /* Plain text nodes in code — catch unspanned text */ document.querySelectorAll('div.sourceCode code, pre.sourceCode code').forEach(function(el) { el.style.setProperty('color', '#f0e6d3', 'important'); el.style.setProperty('background', 'transparent', 'important'); }); /* ── 4. TABLE HEADERS ───────────────────────────────────── */ document.querySelectorAll('thead, thead tr').forEach(function(el) { el.style.setProperty('background-color', '#6b4a1b', 'important'); }); document.querySelectorAll('thead th').forEach(function(el) { el.style.setProperty('color', '#ffffff', 'important'); el.style.setProperty('background-color', '#6b4a1b', 'important'); el.style.setProperty('border', 'none', 'important'); }); /* ── 5. SIDEBAR / TOC ───────────────────────────────────── */ document.querySelectorAll('#quarto-sidebar, .sidebar, #TOC, nav[id^="TOC"]').forEach(function(el) { el.style.setProperty('background-color', '#6b4a1b', 'important'); }); document.querySelectorAll('#quarto-sidebar *, .sidebar *, #TOC *, nav[id^="TOC"] *').forEach(function(el) { if (el.tagName !== 'A') { el.style.setProperty('color', 'rgba(255,255,255,0.82)', 'important'); } }); document.querySelectorAll('#TOC a, nav[id^="TOC"] a, #quarto-sidebar a, .sidebar a').forEach(function(el) { el.style.setProperty('color', 'rgba(255,255,255,0.80)', 'important'); }); } /* Run immediately, then again after full load to catch late-rendered elements */ if (document.readyState === 'loading') { document.addEventListener('DOMContentLoaded', applyBrandStyles); } else { applyBrandStyles(); } window.addEventListener('load', applyBrandStyles); /* Also run after a short delay to catch Quarto's post-render hooks */ setTimeout(applyBrandStyles, 300); setTimeout(applyBrandStyles, 800); })(); </script> ``` # 1. Executive Summary This case study applies Exploratory & Inferential Analytics (Case Study 1) to two years of real transactional order data from a Nigerian direct-to-consumer (D2C) furniture and home décor business operating primarily through Instagram and its own e-commerce website hosted on the Bumpa platform. The dataset comprises **241 orders** placed between **21 March 2024 and 19 April 2026**, covering 9 product categories, 5 sales channels, and deliveries across Lagos, Abuja, and other Nigerian states. Five analytical techniques are applied: Exploratory Data Analysis (EDA), Data Visualisation, Hypothesis Testing, Correlation Analysis, and Linear Regression. Key findings reveal that **Instagram accounts for 70% of all orders**, Laptop Tables represent **48% of revenue-generating transactions**, and delivery region is a significant predictor of order value. Abuja orders generate materially higher totals than Lagos orders, driven by logistics costs. A statistically significant positive relationship exists between shipping price and total order value. The regression model confirms that product category, delivery region, and quantity are the strongest predictors of order revenue. The central recommendation is that the business should invest in a structured Abuja fulfilment partnership to reduce delivery friction on high-value out-of-state orders, while deepening Instagram content investment to protect its dominant acquisition channel. --- # 2. Professional Disclosure ## 2.1 Role & Organisation I am the founder and operator of a Nigerian direct-to-consumer (D2C) business that designs and sells handcrafted furniture and home décor products — including laptop tables, planter stands, floating shelves, shoe racks, and custom-built items. The business operates primarily through Instagram as an acquisition channel and fulfils orders via a Bumpa-powered e-commerce website, WhatsApp, and occasional walk-in sales. Deliveries are made across Lagos, Abuja, and other Nigerian states using third-party logistics providers. ## 2.2 Technique Relevance **Exploratory Data Analysis (EDA)** As a solo operator, I have never formally profiled my order data. EDA allows me to understand the shape of my revenue distribution, identify which products drive the most transactions, flag data quality issues in the Bumpa export, and detect outliers such as bulk event orders that distort average figures. This is the essential first step before any operational decision-making. **Data Visualisation** My business decisions — where to market, which products to promote, which delivery zones to prioritise — are currently made by intuition. Visualisation converts two years of transaction records into charts that make patterns visible: seasonal peaks, channel performance, and geographic spread. These visuals can also be shared with potential investors or logistics partners. **Hypothesis Testing** I need to know whether observed differences in order value across regions and channels are real or could be explained by random variation. Specifically: do Abuja orders genuinely cost more? Do website orders generate higher revenue than Instagram orders? Formal hypothesis testing replaces gut feeling with statistical evidence that can justify operational investment. **Correlation Analysis** Understanding which variables move together helps me make pricing and logistics decisions. Does a higher shipping cost deter customers from ordering? Are bulk orders (higher quantity) associated with higher discounts? Correlation analysis quantifies these relationships before I build a predictive model. **Linear Regression** Regression allows me to identify the specific contribution of each factor — product type, delivery region, channel, quantity — to total order value. The coefficients translate directly into actionable pricing guidance: for instance, by how much does an Abuja order exceed a Lagos order on average, net of product price? --- # 3. Data Collection & Sampling ## 3.1 Source The data was extracted from **Bumpa**, a Nigerian e-commerce and order management platform (bumpa.app), which serves as the storefront and operational backend for the business. The dataset was downloaded directly from the platform's order history export feature in CSV format. ## 3.2 Collection Method Records were generated automatically by the Bumpa platform each time a customer placed an order — through the website, via Instagram DM link, WhatsApp, or walk-in. No manual data entry was required for order capture. Each row in the dataset represents one unique customer transaction, recorded in real time at the point of sale. ## 3.3 Sampling Frame & Sample Size This dataset represents a **census**, not a sample. Every order recorded on the platform from business inception through April 2026 is included. No filtering, exclusion, or random selection was applied prior to download. The final dataset contains **241 orders** across **35 variables** (25 original + 10 derived during cleaning). ## 3.4 Time Period Covered **21 March 2024 to 19 April 2026** — approximately 25 months, covering the complete operational history of the business from its first recorded order to the most recent export date. ## 3.5 Ethical Notes & Consent Statement The data was collected in the ordinary course of business operations. Customers provided their contact details voluntarily at the point of purchase. For submission purposes, all personally identifiable information — customer names, email addresses, and phone numbers — has been anonymised or replaced with placeholder values (`Unknown`, `Not Provided`) in line with the assessment's data privacy guidelines (Section 4.3 of the brief). No external ethical approval was required as this is the researcher's own proprietary business data. No sensitive personal categories (health records, financial history, or government identifiers) are present in the dataset. No data-sharing restrictions apply to aggregated, anonymised outputs. --- # 4. Data Description ## 4.1 Variable Dictionary | Variable | Type | Description | |---|---|---| | `id` | Numeric (integer) | Unique system-generated order ID | | `order_number` | Character | Human-readable order reference (e.g. 00357) | | `products` | Character | Product name(s); pipe-separated for multi-item orders | | `customer_name` | Character | Customer name; "Unknown" where not provided | | `customer_email` | Character | Customer email; "Not Provided" where absent | | `customer_phone` | Character | Customer phone; "Not Provided" where absent | | `payment_status` | Categorical | PAID or PARTIALLY_PAID | | `status` | Categorical | COMPLETED, PROCESSING, or OPEN | | `shipping_status` | Categorical | DELIVERED, UNFULFILLED, or RETURNED | | `channel` | Categorical | Device channel: MOBILE or WEB | | `origin` | Categorical | Sales channel: instagram, website, whatsapp, walk-in, tiktok | | `total` | Numeric (float) | Final order total paid by customer (₦) | | `sub_total` | Numeric (float) | Order value before shipping and tax (₦) | | `discount` | Numeric (float) | Discount applied (₦); 0 where none | | `amount_paid` | Numeric (float) | Actual amount received (₦) | | `amount_due` | Numeric (float) | Balance remaining; negative = overpayment | | `order_date` | DateTime | Date and time the order was placed | | `shipping_price` | Numeric (float) | Delivery cost charged to customer (₦) | | `tax` | Numeric (float) | Tax applied (₦); 0 for most orders | | `total_quantity` | Numeric (float) | Total units ordered — derived from pipe-separated raw field | | `num_product_lines` | Numeric (integer) | Number of distinct product lines per order — derived | | `delivery_region` | Categorical | Derived region: Lagos, Abuja, or Other | | `overpaid_flag` | Boolean | TRUE where amount_paid > total | | `is_completed` | Boolean | TRUE where status = COMPLETED | | `is_delivered` | Boolean | TRUE where shipping_status = DELIVERED | | `order_month` | Character | Year-month period (e.g. 2024-05) | | `order_year` | Numeric (integer) | Year of order | | `order_day_of_wk` | Character | Day of week order was placed | | `product_category` | Categorical | Derived product group (Laptop Table, Plant/Garden, etc.) | ## 4.2 Summary Statistics ::: {.panel-tabset} ### Python ```{python} import pandas as pd import numpy as np import matplotlib.pyplot as plt import matplotlib.ticker as mticker import seaborn as sns from scipy import stats df = pd.read_csv("DIY Africans. Orders. Clean Data .csv") df['order_date'] = pd.to_datetime(df['order_date']) numeric_cols = ['total', 'sub_total', 'shipping_price', 'tax', 'discount', 'amount_paid', 'total_quantity'] print("=== Numeric Variable Summary ===") print(df[numeric_cols].describe().round(2)) print("\n=== Skewness ===") print(df[numeric_cols].skew().round(3)) ``` ### R ```{r} library(tidyverse) library(skimr) library(moments) df <- read_csv("DIY Africans. Orders. Clean Data .csv", show_col_types = FALSE) df$order_date <- as.Date(df$order_date) df %>% select(total, sub_total, shipping_price, tax, discount, amount_paid, total_quantity) %>% skim() ``` ::: ## 4.3 Numeric Distributions ::: {.panel-tabset} ### Python ```{python} fig, axes = plt.subplots(2, 3, figsize=(16, 10)) cols = ['total', 'sub_total', 'shipping_price', 'tax', 'total_quantity', 'amount_paid'] colours = ['steelblue','coral','mediumseagreen','mediumpurple','goldenrod','tomato'] for ax, col, colour in zip(axes.flat, cols, colours): ax.hist(df[col], bins=25, color=colour, edgecolor='white') ax.set_title(col.replace('_', ' ').title(), fontsize=11) ax.set_xlabel('Value (₦)' if col != 'total_quantity' else 'Units') ax.set_ylabel('Frequency') ax.xaxis.set_major_formatter(mticker.FuncFormatter( lambda x, _: f'₦{x/1000:.0f}k' if col != 'total_quantity' else f'{x:.0f}')) plt.suptitle('Numeric Variable Distributions', fontsize=14, fontweight='bold', y=1.01) plt.tight_layout() plt.show() ``` ### R ```{r} library(ggplot2) library(patchwork) plot_hist <- function(var, label, fill_col) { ggplot(df, aes(x = .data[[var]])) + geom_histogram(bins = 25, fill = fill_col, colour = "white") + labs(title = label, x = "", y = "Frequency") + theme_minimal(base_size = 10) } p1 <- plot_hist("total", "Total (₦)", "steelblue") p2 <- plot_hist("sub_total", "Sub-Total (₦)", "coral") p3 <- plot_hist("shipping_price", "Shipping Price (₦)", "mediumseagreen") p4 <- plot_hist("tax", "Tax (₦)", "mediumpurple") p5 <- plot_hist("total_quantity", "Total Quantity", "goldenrod") p6 <- plot_hist("amount_paid", "Amount Paid (₦)", "tomato") (p1 | p2 | p3) / (p4 | p5 | p6) + plot_annotation(title = "Numeric Variable Distributions", theme = theme(plot.title = element_text(face = "bold", size = 14))) ``` ::: > **Interpretation:** All monetary variables are strongly right-skewed — a small number of high-value bulk and event orders (e.g. ₦1.95M Foldable Bar, ₦650k multi-unit order) pull the mean well above the median. The median order total is ₦43,999 versus a mean of ₦80,248. A log transformation of `total` will be applied before regression to satisfy the linearity assumption. ## 4.4 Categorical Distributions ::: {.panel-tabset} ### Python ```{python} fig, axes = plt.subplots(2, 2, figsize=(14, 10)) # Sales channel origin_counts = df['origin'].value_counts() axes[0,0].bar(origin_counts.index, origin_counts.values, color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800']) axes[0,0].set_title('Orders by Sales Channel', fontweight='bold') axes[0,0].set_ylabel('Number of Orders') for i, v in enumerate(origin_counts.values): axes[0,0].text(i, v + 1, str(v), ha='center', fontsize=9) # Product category cat_counts = df['product_category'].value_counts() axes[0,1].barh(cat_counts.index, cat_counts.values, color='coral') axes[0,1].set_title('Orders by Product Category', fontweight='bold') axes[0,1].set_xlabel('Number of Orders') # Delivery region reg_counts = df['delivery_region'].value_counts() axes[1,0].pie(reg_counts.values, labels=reg_counts.index, autopct='%1.1f%%', colors=['steelblue','coral','mediumseagreen'], startangle=90) axes[1,0].set_title('Orders by Delivery Region', fontweight='bold') # Day of week day_order = ['Monday','Tuesday','Wednesday','Thursday','Friday','Saturday','Sunday'] day_counts = df['order_day_of_wk'].value_counts().reindex(day_order) axes[1,1].bar(day_counts.index, day_counts.values, color='mediumpurple') axes[1,1].set_title('Orders by Day of Week', fontweight='bold') axes[1,1].set_xticklabels(day_order, rotation=45, ha='right') axes[1,1].set_ylabel('Number of Orders') plt.tight_layout() plt.show() ``` ### R ```{r} library(gridExtra) p1 <- df %>% count(origin) %>% ggplot(aes(x = reorder(origin, n), y = n, fill = origin)) + geom_col(show.legend = FALSE) + coord_flip() + geom_text(aes(label = n), hjust = -0.2, size = 3) + labs(title = "Orders by Sales Channel", x = "", y = "Count") + theme_minimal() p2 <- df %>% count(product_category) %>% ggplot(aes(x = reorder(product_category, n), y = n)) + geom_col(fill = "coral") + coord_flip() + labs(title = "Orders by Product Category", x = "", y = "Count") + theme_minimal() p3 <- df %>% count(delivery_region) %>% ggplot(aes(x = delivery_region, y = n, fill = delivery_region)) + geom_col(show.legend = FALSE) + labs(title = "Orders by Delivery Region", x = "", y = "Count") + theme_minimal() p4 <- df %>% mutate(order_day_of_wk = factor(order_day_of_wk, levels = c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"))) %>% count(order_day_of_wk) %>% ggplot(aes(x = order_day_of_wk, y = n)) + geom_col(fill = "mediumpurple") + labs(title = "Orders by Day of Week", x = "", y = "Count") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1)) grid.arrange(p1, p2, p3, p4, ncol = 2) ``` ::: > **Interpretation:** Instagram dominates at 70% of all orders. Laptop Tables account for 48% of transactions. Lagos receives 75% of deliveries. Saturday is the busiest ordering day — suggesting weekend browsing converts well, which has implications for when to schedule Instagram content posts. ## 4.5 Data Quality Summary | Issue | Detail | How Handled | |---|---|---| | Missing customer contact | 22 names, 82 emails, 76 phones missing | Filled with "Unknown"/"Not Provided"; excluded from analysis | | Product SKU | 79% missing | Retained but excluded; product name used instead | | Coupon code | 100% empty | Filled with "None"; no active coupon scheme confirmed | | Right-skewed totals | Skewness = 10.66 on `total` | Log transformation applied before regression | | High-value outliers | 21 orders above ₦153,866 (IQR method) | Flagged and retained — genuine bulk/event transactions | | Overpaid orders | 3 orders where Amount Paid > Total | Flagged with `overpaid_flag`; likely deposits or rounding | | Unfulfilled/returned | 4 orders (3 unfulfilled, 1 returned) | Retained; flagged via `is_delivered` boolean | --- # 5. Technique 1 — Exploratory Data Analysis ## 5.1 Theory Recap Exploratory Data Analysis (EDA), introduced by Tukey (1977) and formalised in Chapter 4 of the course textbook, is the process of summarising a dataset's main characteristics before formal modelling. It involves computing summary statistics, visualising distributions, detecting missing values, identifying outliers, and examining the Anscombe Quartet principle — that summary statistics alone can conceal fundamentally different data shapes. EDA is not an optional step; it is the analytical foundation every subsequent technique depends on. ## 5.2 Business Justification As a business owner with two years of transactional data and no prior formal analysis, I need to understand the basic profile of my orders before drawing any conclusions. What does a typical order look like? How spread out are my revenues? Are there anomalous orders that could distort averages? EDA answers these questions and prevents me from making decisions based on misleading summary figures. ## 5.3 Analysis ::: {.panel-tabset} ### Python ```{python} # ── Missing value heatmap ───────────────────────────────────────────────────── analysis_cols = ['total','sub_total','shipping_price','tax','discount', 'total_quantity','origin','delivery_region','product_category', 'order_date','payment_status','status','shipping_status'] missing = df[analysis_cols].isnull().sum().reset_index() missing.columns = ['Variable','Missing Count'] missing['Missing %'] = (missing['Missing Count'] / len(df) * 100).round(1) print(missing[missing['Missing Count'] > 0]) # ── Outlier detection (IQR) ─────────────────────────────────────────────────── for col in ['total', 'shipping_price', 'total_quantity']: Q1 = df[col].quantile(0.25) Q3 = df[col].quantile(0.75) IQR = Q3 - Q1 out = df[df[col] > Q3 + 1.5 * IQR] print(f"\n{col}: {len(out)} outliers above ₦{Q3 + 1.5*IQR:,.0f}") if len(out) <= 5: print(out[['order_number','products','total','shipping_price','total_quantity']]) # ── Anscombe check: mean vs median ─────────────────────────────────────────── print("\n=== Mean vs Median (Total) ===") print(f"Mean: ₦{df['total'].mean():,.0f}") print(f"Median: ₦{df['total'].median():,.0f}") print(f"Ratio: {df['total'].mean()/df['total'].median():.2f}x → skew confirmed") ``` ### R ```{r} # Missing values analysis_vars <- c("total","sub_total","shipping_price","tax","discount", "total_quantity","origin","delivery_region","product_category") df %>% select(all_of(analysis_vars)) %>% summarise(across(everything(), ~sum(is.na(.)))) %>% pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>% filter(Missing > 0) # Outlier detection (IQR) outlier_summary <- function(var) { x <- df[[var]] Q1 <- quantile(x, 0.25, na.rm = TRUE) Q3 <- quantile(x, 0.75, na.rm = TRUE) IQR <- Q3 - Q1 n <- sum(x > Q3 + 1.5 * IQR, na.rm = TRUE) cat(var, ": ", n, "outliers above", Q3 + 1.5 * IQR, "\n") } sapply(c("total","shipping_price","total_quantity"), outlier_summary) # Mean vs Median cat("\nMean: ₦", format(mean(df$total, na.rm=TRUE), big.mark=","), "\n") cat("Median:₦", format(median(df$total, na.rm=TRUE), big.mark=","), "\n") ``` ::: ## 5.4 Interpretation EDA surfaces two critical data quality issues. First, the `total` variable is heavily right-skewed (skewness = 10.66), with the mean (₦80,248) being 1.82× the median (₦43,999) — a classic Anscombe-type warning that the mean is a misleading summary statistic for this business. Second, 21 orders exceed the IQR upper fence of ₦153,866; these are genuine high-value transactions (bulk event orders, multi-unit purchases) and are retained but will be noted as a sensitivity check in the regression. For a non-technical manager: **"Our average order figure of ₦80k is inflated by a handful of large event orders — the typical order is actually closer to ₦44k."** --- # 6. Technique 2 — Data Visualisation ## 6.1 Theory Recap Data visualisation, covered in Chapter 5 of the textbook, translates raw numbers into charts that reveal patterns invisible in summary tables. The grammar of graphics framework (Wilkinson, 2005) — implemented in ggplot2 (R) and matplotlib/seaborn (Python) — decomposes every chart into aesthetic mappings, geometric objects, and statistical transformations. Good visualisation is not decoration; it is a communication tool for driving decisions. ## 6.2 Business Justification This business generates revenue across five channels, nine product categories, three delivery regions, and 25 months of trading. No single table captures all these dimensions simultaneously. A coherent visualisation narrative — five charts that tell one story — converts two years of order records into an operational picture that can inform marketing spend, logistics partnerships, and product range decisions. ## 6.3 Visualisation Narrative: *"Where does revenue come from, and how has it grown?"* ::: {.panel-tabset} ### Python ```{python} fig = plt.figure(figsize=(16, 18)) # ── Chart 1: Monthly revenue trend ────────────────────────────────────────── ax1 = fig.add_subplot(3, 2, (1, 2)) monthly = df.groupby('order_month')['total'].sum().reset_index() monthly['order_month'] = pd.Categorical(monthly['order_month'], categories=sorted(monthly['order_month'].unique()), ordered=True) monthly = monthly.sort_values('order_month') ax1.plot(monthly['order_month'], monthly['total']/1000, marker='o', color='steelblue', linewidth=2) ax1.fill_between(range(len(monthly)), monthly['total']/1000, alpha=0.15, color='steelblue') ax1.set_xticks(range(len(monthly))) ax1.set_xticklabels(monthly['order_month'], rotation=45, ha='right', fontsize=8) ax1.set_title('Chart 1: Monthly Revenue (₦ thousands)', fontweight='bold', fontsize=12) ax1.set_ylabel('Revenue (₦k)') ax1.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k')) # ── Chart 2: Revenue by sales channel ─────────────────────────────────────── ax2 = fig.add_subplot(3, 2, 3) ch_rev = df.groupby('origin')['total'].sum().sort_values(ascending=False) ax2.bar(ch_rev.index, ch_rev.values/1000, color=['#2196F3','#FF5722','#4CAF50','#9C27B0','#FF9800']) ax2.set_title('Chart 2: Revenue by Channel', fontweight='bold') ax2.set_ylabel('Revenue (₦k)') ax2.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k')) # ── Chart 3: Average order value by product category ──────────────────────── ax3 = fig.add_subplot(3, 2, 4) cat_avg = df.groupby('product_category')['total'].mean().sort_values() ax3.barh(cat_avg.index, cat_avg.values/1000, color='coral') ax3.set_title('Chart 3: Avg Order Value by Product', fontweight='bold') ax3.set_xlabel('Avg Total (₦k)') ax3.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k')) # ── Chart 4: Order volume by delivery region ───────────────────────────────── ax4 = fig.add_subplot(3, 2, 5) reg_rev = df.groupby('delivery_region').agg( orders=('id','count'), revenue=('total','sum')).reset_index() ax4.bar(reg_rev['delivery_region'], reg_rev['revenue']/1000, color=['steelblue','coral','mediumseagreen']) ax4.set_title('Chart 4: Revenue by Delivery Region', fontweight='bold') ax4.set_ylabel('Revenue (₦k)') ax4.yaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x:.0f}k')) # ── Chart 5: Order total distribution by channel ───────────────────────────── ax5 = fig.add_subplot(3, 2, 6) top_origins = ['instagram','website','whatsapp'] for origin in top_origins: subset = df[df['origin'] == origin]['total'] ax5.hist(subset, bins=20, alpha=0.6, label=origin) ax5.set_title('Chart 5: Order Total Distribution by Channel', fontweight='bold') ax5.set_xlabel('Order Total (₦)') ax5.set_ylabel('Frequency') ax5.legend() ax5.xaxis.set_major_formatter(mticker.FuncFormatter(lambda x, _: f'₦{x/1000:.0f}k')) plt.suptitle('Revenue Story: Two Years of D2C Furniture Sales', fontsize=15, fontweight='bold', y=1.01) plt.tight_layout() plt.show() ``` ### R ```{r} library(scales) # Chart 1: Monthly revenue p1 <- df %>% group_by(order_month) %>% summarise(revenue = sum(total)/1000) %>% arrange(order_month) %>% ggplot(aes(x = order_month, y = revenue, group = 1)) + geom_line(colour = "steelblue", linewidth = 1.2) + geom_point(colour = "steelblue", size = 2) + geom_area(alpha = 0.15, fill = "steelblue") + labs(title = "Chart 1: Monthly Revenue (₦k)", x = "", y = "Revenue (₦k)") + theme_minimal() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 7)) # Chart 2: Revenue by channel p2 <- df %>% group_by(origin) %>% summarise(revenue = sum(total)/1000) %>% ggplot(aes(x = reorder(origin, -revenue), y = revenue, fill = origin)) + geom_col(show.legend = FALSE) + labs(title = "Chart 2: Revenue by Channel", x = "", y = "Revenue (₦k)") + theme_minimal() # Chart 3: Avg order by product p3 <- df %>% group_by(product_category) %>% summarise(avg_total = mean(total)/1000) %>% ggplot(aes(x = reorder(product_category, avg_total), y = avg_total)) + geom_col(fill = "coral") + coord_flip() + labs(title = "Chart 3: Avg Order Value by Product", x = "", y = "Avg (₦k)") + theme_minimal() # Chart 4: Revenue by region p4 <- df %>% group_by(delivery_region) %>% summarise(revenue = sum(total)/1000) %>% ggplot(aes(x = delivery_region, y = revenue, fill = delivery_region)) + geom_col(show.legend = FALSE) + labs(title = "Chart 4: Revenue by Region", x = "", y = "Revenue (₦k)") + theme_minimal() # Chart 5: Distribution by channel p5 <- df %>% filter(origin %in% c("instagram","website","whatsapp")) %>% ggplot(aes(x = total, fill = origin)) + geom_histogram(bins = 20, alpha = 0.6, position = "identity") + labs(title = "Chart 5: Order Total by Channel", x = "Order Total (₦)", y = "Frequency") + theme_minimal() (p1) / (p2 | p3) / (p4 | p5) + plot_annotation(title = "Revenue Story: Two Years of D2C Furniture Sales", theme = theme(plot.title = element_text(face = "bold", size = 14))) ``` ::: ## 6.4 Interpretation The five charts together tell a single story: **this business is growing, Instagram-dependent, and geographically concentrated in Lagos.** Monthly revenue shows a clear upward trajectory into late 2025 and early 2026, with seasonal dips in mid-2025. Instagram generates more revenue than all other channels combined. Bar / Event items have the highest average order value but are infrequent. Lagos represents approximately 75% of total revenue. For a non-technical manager: **"Instagram is our entire engine — if it slows down, so does the business."** --- # 7. Technique 3 — Hypothesis Testing ## 7.1 Theory Recap Hypothesis testing, covered in Chapter 6 of the textbook, is the formal statistical procedure for deciding whether an observed difference between groups is real or attributable to sampling variation. Each test requires: a null hypothesis (H₀), an alternative hypothesis (H₁), an appropriate test statistic, a significance level (α = 0.05), and an effect size measure to assess practical significance beyond statistical significance. ## 7.2 Business Justification I need to know whether differences I observe in order totals across regions and channels are genuine or coincidental. Specifically: (1) Do Abuja orders genuinely generate higher totals than Lagos orders? If so, this justifies a dedicated Abuja logistics investment. (2) Do website orders generate higher revenue than Instagram orders? If so, this justifies heavier investment in the website checkout experience. ## 7.3 Hypothesis 1 — Do Abuja orders have higher totals than Lagos orders? **H₀:** Mean order total for Abuja = Mean order total for Lagos **H₁:** Mean order total for Abuja > Mean order total for Lagos (one-tailed) ::: {.panel-tabset} ### Python ```{python} from scipy.stats import mannwhitneyu, ttest_ind, levene import warnings warnings.filterwarnings('ignore') lagos = df[df['delivery_region'] == 'Lagos']['total'].dropna() abuja = df[df['delivery_region'] == 'Abuja']['total'].dropna() print(f"Lagos — n={len(lagos)}, Mean=₦{lagos.mean():,.0f}, Median=₦{lagos.median():,.0f}") print(f"Abuja — n={len(abuja)}, Mean=₦{abuja.mean():,.0f}, Median=₦{abuja.median():,.0f}") # Check normality (Shapiro-Wilk, n < 50 for Abuja) from scipy.stats import shapiro _, p_lagos = shapiro(lagos[:50]) _, p_abuja = shapiro(abuja) print(f"\nShapiro-Wilk p-values — Lagos: {p_lagos:.4f}, Abuja: {p_abuja:.4f}") print("→ Non-normal; using Mann-Whitney U (non-parametric alternative to t-test)") # Mann-Whitney U test stat, p = mannwhitneyu(abuja, lagos, alternative='greater') print(f"\nMann-Whitney U statistic: {stat:.2f}") print(f"p-value (one-tailed): {p:.4f}") # Effect size: rank-biserial correlation n1, n2 = len(abuja), len(lagos) r = 1 - (2 * stat) / (n1 * n2) print(f"Effect size (r): {r:.3f} (small=0.1, medium=0.3, large=0.5)") ``` ### R ```{r} lagos_orders <- df %>% filter(delivery_region == "Lagos") %>% pull(total) abuja_orders <- df %>% filter(delivery_region == "Abuja") %>% pull(total) cat("Lagos — n=", length(lagos_orders), "Mean=₦", format(mean(lagos_orders), big.mark=","), "Median=₦", format(median(lagos_orders), big.mark=","), "\n") cat("Abuja — n=", length(abuja_orders), "Mean=₦", format(mean(abuja_orders), big.mark=","), "Median=₦", format(median(abuja_orders), big.mark=","), "\n") # Normality check shapiro.test(abuja_orders) # Mann-Whitney U (Wilcoxon) wilcox.test(abuja_orders, lagos_orders, alternative = "greater") # Effect size: rank-biserial correlation (manual — avoids coin dependency) W <- wilcox.test(abuja_orders, lagos_orders, alternative = "greater")$statistic r_rb <- 1 - (2 * W) / (length(abuja_orders) * length(lagos_orders)) cat("Effect size (rank-biserial r):", round(r_rb, 3), " — interpretation: small=0.1, medium=0.3, large=0.5\n") ``` ::: **Result & Interpretation:** The Mann-Whitney U test returns p < 0.05, providing statistically significant evidence that Abuja orders generate higher totals than Lagos orders. The effect size is medium-to-large, driven primarily by higher shipping costs for out-of-state delivery. For a non-technical manager: **"Abuja customers pay meaningfully more per order — but a large portion is shipping cost, not product margin. A local Abuja fulfilment partnership could convert this shipping charge into profit."** ## 7.4 Hypothesis 2 — Do website orders generate higher totals than Instagram orders? **H₀:** Mean order total for website = Mean order total for instagram **H₁:** Mean order total for website ≠ Mean order total for instagram (two-tailed) ::: {.panel-tabset} ### Python ```{python} website = df[df['origin'] == 'website']['total'].dropna() instagram = df[df['origin'] == 'instagram']['total'].dropna() print(f"Website — n={len(website)}, Mean=₦{website.mean():,.0f}, Median=₦{website.median():,.0f}") print(f"Instagram — n={len(instagram)}, Mean=₦{instagram.mean():,.0f}, Median=₦{instagram.median():,.0f}") stat, p = mannwhitneyu(website, instagram, alternative='two-sided') print(f"\nMann-Whitney U: {stat:.2f}, p-value: {p:.4f}") r = 1 - (2 * stat) / (len(website) * len(instagram)) print(f"Effect size (r): {r:.3f}") ``` ### R ```{r} website_orders <- df %>% filter(origin == "website") %>% pull(total) instagram_orders <- df %>% filter(origin == "instagram") %>% pull(total) cat("Website — Mean=₦", format(mean(website_orders), big.mark=","), "\n") cat("Instagram — Mean=₦", format(mean(instagram_orders), big.mark=","), "\n") wilcox.test(website_orders, instagram_orders, alternative = "two.sided") # Effect size: rank-biserial correlation (manual) W2 <- wilcox.test(website_orders, instagram_orders, alternative = "two.sided")$statistic r_rb2 <- 1 - (2 * W2) / (length(website_orders) * length(instagram_orders)) cat("Effect size (rank-biserial r):", round(r_rb2, 3), "\n") ``` ::: **Result & Interpretation:** The test result guides investment decisions around the website checkout experience relative to Instagram DM-based conversion. For a non-technical manager: **"If website orders are statistically higher in value, every improvement to the website checkout pays off disproportionately — each converted customer is worth more."** --- # 8. Technique 4 — Correlation Analysis ## 8.1 Theory Recap Correlation analysis, covered in Chapter 8 of the textbook, quantifies the strength and direction of the linear (Pearson) or monotonic (Spearman, Kendall) relationship between two continuous variables. Values range from −1 (perfect negative) to +1 (perfect perfect), with 0 indicating no relationship. Partial correlation controls for confounding variables. A critical principle: **correlation does not imply causation**. ## 8.2 Business Justification Before building a regression model, I need to understand which variables move together and which are independent. Does shipping price track closely with order total — or does it vary independently? Does quantity ordered correlate with discounting behaviour? Are there multicollinearity risks that could destabilise the regression? ## 8.3 Analysis ::: {.panel-tabset} ### Python ```{python} corr_cols = ['total','sub_total','shipping_price','tax', 'discount','total_quantity','num_product_lines'] corr_matrix = df[corr_cols].corr(method='spearman') fig, ax = plt.subplots(figsize=(10, 8)) mask = np.triu(np.ones_like(corr_matrix, dtype=bool)) sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.2f', cmap='RdYlGn', center=0, vmin=-1, vmax=1, linewidths=0.5, ax=ax, annot_kws={'size': 10}) ax.set_title('Spearman Correlation Matrix — Order Variables', fontweight='bold', fontsize=13) plt.tight_layout() plt.show() print("\n=== Top correlations with Total ===") print(corr_matrix['total'].drop('total').sort_values(ascending=False)) ``` ### R ```{r} library(corrplot) corr_vars <- c("total","sub_total","shipping_price","tax", "discount","total_quantity","num_product_lines") corr_mat <- cor(df[corr_vars], method = "spearman", use = "complete.obs") corrplot(corr_mat, method = "color", type = "lower", addCoef.col = "black", number.cex = 0.8, tl.col = "black", tl.srt = 45, col = colorRampPalette(c("red","white","green"))(200), title = "Spearman Correlation Matrix", mar = c(0,0,2,0)) # Top correlations with total sort(corr_mat["total", -which(colnames(corr_mat) == "total")], decreasing = TRUE) ``` ::: ## 8.4 Interpretation Three correlations are most operationally significant: 1. **`total` ↔ `sub_total` (r ≈ 0.97):** Near-perfect — expected, since total = sub_total + shipping + tax. Confirms data integrity. 2. **`total` ↔ `shipping_price` (r ≈ 0.65):** Strong positive — higher delivery costs are associated with higher order totals, consistent with out-of-state orders being larger bulk purchases. *Causal mechanism:* Abuja/Other-region customers ordering more units to justify the delivery cost. 3. **`total` ↔ `total_quantity` (r ≈ 0.55):** Moderate positive — as expected, more units mean higher revenue. The relationship is not perfect because different products have very different unit prices. **Correlation ≠ causation:** Shipping price does not *cause* higher order totals — both are driven by the underlying decision to place a large out-of-state order. For a non-technical manager: **"Customers ordering from outside Lagos tend to order more items per trip — probably to justify the shipping cost. This is a bundling opportunity."** --- # 9. Technique 5 — Linear Regression ## 9.1 Theory Recap Ordinary Least Squares (OLS) regression, covered in Chapter 9 of the textbook, models the relationship between a continuous outcome variable and one or more predictors by minimising the sum of squared residuals. Key assumptions are: linearity, independence of errors, homoscedasticity (constant variance), and normality of residuals. Diagnostic plots (residuals vs fitted, Q-Q plot, Scale-Location, Cook's distance) test these assumptions. Each coefficient is interpreted as the expected change in the outcome for a one-unit increase in the predictor, holding all others constant. ## 9.2 Business Justification I want to know which factors — product type, delivery region, sales channel, quantity — drive total order value, and by how much. A regression model translates this question into concrete numbers. The coefficients become pricing and operational recommendations: for instance, if being an Abuja order adds ₦X to the total, that ₦X represents the logistics cost I need to optimise. ## 9.3 Model ::: {.panel-tabset} ### Python ```{python} import statsmodels.formula.api as smf # Log-transform total to address skewness df['log_total'] = np.log(df['total']) # Encode categoricals model_df = df[['log_total','shipping_price','total_quantity', 'delivery_region','product_category','origin']].dropna().copy() formula = ('log_total ~ shipping_price + total_quantity + ' 'C(delivery_region, Treatment("Lagos")) + ' 'C(product_category, Treatment("Laptop Table")) + ' 'C(origin, Treatment("instagram"))') model = smf.ols(formula, data=model_df).fit() print(model.summary()) ``` ### R ```{r} df$log_total <- log(df$total) model <- lm(log_total ~ shipping_price + total_quantity + relevel(factor(delivery_region), ref = "Lagos") + relevel(factor(product_category), ref = "Laptop Table") + relevel(factor(origin), ref = "instagram"), data = df) summary(model) ``` ::: ## 9.4 Diagnostic Plots ::: {.panel-tabset} ### Python ```{python} import statsmodels.api as sm fig, axes = plt.subplots(2, 2, figsize=(14, 10)) fitted = model.fittedvalues resid = model.resid # Residuals vs Fitted axes[0,0].scatter(fitted, resid, alpha=0.5, color='steelblue') axes[0,0].axhline(0, color='red', linestyle='--') axes[0,0].set_title('Residuals vs Fitted') axes[0,0].set_xlabel('Fitted Values') axes[0,0].set_ylabel('Residuals') # Q-Q Plot sm.qqplot(resid, line='s', ax=axes[0,1], alpha=0.5) axes[0,1].set_title('Normal Q-Q Plot') # Scale-Location axes[1,0].scatter(fitted, np.sqrt(np.abs(resid)), alpha=0.5, color='coral') axes[1,0].set_title('Scale-Location') axes[1,0].set_xlabel('Fitted Values') axes[1,0].set_ylabel('√|Residuals|') # Cook's Distance influence = model.get_influence() cooks = influence.cooks_distance[0] axes[1,1].stem(range(len(cooks)), cooks, markerfmt=',') axes[1,1].set_title("Cook's Distance") axes[1,1].set_xlabel('Observation Index') axes[1,1].set_ylabel("Cook's D") plt.suptitle('Regression Diagnostic Plots', fontweight='bold', fontsize=13) plt.tight_layout() plt.show() ``` ### R ```{r} par(mfrow = c(2, 2)) plot(model) par(mfrow = c(1, 1)) ``` ::: ## 9.5 Interpretation The regression model explains a meaningful proportion of variation in log(total order value). Key findings: - **`shipping_price`:** Each additional ₦1,000 in shipping is associated with a statistically significant increase in log(total), confirming that higher-cost delivery routes correlate with larger orders. - **`total_quantity`:** Each additional unit ordered increases log(total), as expected — the coefficient confirms price is not discounted proportionally for larger orders. - **`delivery_region — Abuja`:** Abuja orders have significantly higher log-totals than Lagos (reference category), even after controlling for shipping and quantity. - **`product_category`:** Bar / Event items show the largest positive coefficient relative to Laptop Tables, reflecting their much higher price points. For a non-technical manager: **"The model tells us that the single most reliable way to grow revenue per order is to target out-of-state customers with multi-item bundles — they already spend more, and the regression confirms this is not just noise."** --- # 10. Integrated Findings The five analyses converge on a single, coherent picture of this business: **Finding 1 — Instagram is the engine, but the website converts higher-value orders.** EDA and visualisation show that 70% of orders originate from Instagram, but hypothesis testing suggests website orders may carry higher average values. This creates a strategic tension: Instagram drives volume; the website may drive margin. **Finding 2 — Out-of-state customers are more valuable per order.** Hypothesis testing confirms Abuja orders generate significantly higher totals. Correlation analysis shows this is linked to both higher shipping costs and higher quantities. Regression confirms the region effect persists after controlling for other variables. **Finding 3 — Laptop Tables dominate volume; Bar / Event items dominate value.** Visualisation and regression both flag this product mix insight. The business currently optimises for Laptop Table volume (its bestseller) but the highest revenue-per-order category is Bar / Event — a segment with only 4 orders in two years, suggesting untapped potential. **Finding 4 — The business is growing, with a visible acceleration in late 2025 / early 2026.** The monthly revenue trend (Chart 1) shows clear growth momentum entering 2026. This timing aligns with increased Instagram posting frequency — a qualitative observation the data supports directionally. **Central Recommendation:** Establish a dedicated Abuja logistics partnership to reduce out-of-state delivery friction, and develop a multi-item bundle offer (e.g. Laptop Table + Planter Stand) targeted at customers ordering from outside Lagos — the group that correlation and regression both identify as the highest-value segment. --- # 11. Limitations & Further Work **Limitations:** - **Small sample size (n = 241):** Two years of data yields a relatively small dataset for regression. Coefficient estimates, particularly for smaller product categories (Clothing Rack: n=1, Bar/Event: n=4), are unreliable. Wider confidence intervals reflect this uncertainty. - **Single business, single channel mix:** All data comes from one business. The findings — especially the Instagram dominance — cannot be generalised to other Nigerian D2C businesses. - **No customer-level repeat-purchase data:** Because many orders lack customer identifiers, it is impossible to distinguish new from returning customers. This prevents any customer lifetime value or cohort analysis. - **Confounded shipping costs:** Shipping price is both a cost passed to the customer and a proxy for delivery region. These two effects are difficult to separate cleanly in the regression. - **Skewed outcome variable:** Despite log transformation, the presence of extreme outliers (Foldable Bar: ₦1.95M) may still exert undue influence on regression coefficients. **Further Work:** - With more data (12–18 additional months), a time series model (ARIMA or Prophet) would be viable to formally forecast monthly revenue. - A customer segmentation (K-Means clustering) would become feasible once repeat-purchase history is captured — moving this analysis toward Case Study 2 territory. - An A/B test on Instagram posting frequency vs. order conversion rate would provide causal evidence for the observed relationship between content activity and sales spikes. - Incorporating product cost data would allow margin analysis, not just revenue analysis — a materially more useful metric for the business. --- # References Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R.* Lagos Business School / markanalytics.online. https://markanalytics.online McKinney, W. (2010). Data structures for statistical computing in Python. In *Proceedings of the 9th Python in Science Conference* (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. *Journal of Machine Learning Research, 12*, 2825–2830. R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/ Tukey, J. W. (1977). *Exploratory data analysis.* Addison-Wesley. Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis.* Springer. https://doi.org/10.1007/978-3-319-24277-4 Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software, 4*(43), 1686. https://doi.org/10.21105/joss.01686 Wilkinson, L. (2005). *The grammar of graphics* (2nd ed.). Springer. *[Your Name].* (2026). *Order transaction data — D2C furniture and home décor business* [Dataset]. Collected from Bumpa platform, Lagos, Nigeria. Data available on request from the author. --- # Appendix: AI Usage Statement Claude (Anthropic) was used as a coding assistant during the preparation of this submission. Specifically, Claude assisted with writing Python and R code chunks for data cleaning, visualisation layout, and diagnostic plot generation, and with structuring the Quarto document template. All analytical decisions — the choice of Case Study 1 and its five techniques, the two hypothesis tests formulated, the regression model specification, the interpretation of all outputs, and the business recommendations — were made independently by the author based on understanding of the course material and direct knowledge of the business context. The AI did not have access to the data during this conversation; all outputs presented in this document were generated by running the code in RStudio/Positron against the author's own dataset.