---
title: "Vendor Performance Analytics at Chowdeck: An Exploratory & Inferential Study of Q1 2026 Sales Data"
author: "Jeremiah Izuagie"
date: today
format:
html:
theme: flatly
toc: true
toc-depth: 3
toc-title: "Table of Contents"
code-fold: true
code-summary: "Show Code"
code-tools: true
self-contained: true
fig-width: 10
fig-height: 6
fig-align: center
highlight-style: github
df-print: paged
number-sections: true
smooth-scroll: true
execute:
warning: false
message: false
echo: true
cache: false
---
```{r setup, include=FALSE}
# ── Load all required R packages ──
library(tidyverse) # Data manipulation and visualisation
library(ggplot2) # Advanced plotting
library(corrplot) # Correlation matrix visualisation
library(ggcorrplot) # ggplot2-based correlation plots
library(knitr) # Table formatting
library(kableExtra) # Enhanced tables
library(scales) # Number formatting
library(gridExtra) # Multiple plots
library(patchwork) # Plot composition
library(nortest) # Normality tests
library(car) # Regression diagnostics
library(lmtest) # Regression tests
library(broom) # Tidy model outputs
library(ggthemes) # Additional ggplot themes
library(RColorBrewer) # Color palettes
library(moments) # Skewness and kurtosis
library(effsize) # Effect sizes
library(dunn.test) # Post-hoc tests
library(psych) # Descriptive statistics
# Set global theme for all plots
theme_set(theme_minimal(base_size = 12) +
theme(
plot.title = element_text(face = "bold", size = 14, colour = "#1A1A2E"),
plot.subtitle = element_text(colour = "#555555", size = 11),
axis.title = element_text(face = "bold", size = 11),
legend.title = element_text(face = "bold"),
panel.grid.minor = element_blank(),
plot.background = element_rect(fill = "white", colour = NA),
panel.background = element_rect(fill = "white", colour = NA)
))
# Chowdeck brand colours
chowdeck_cols <- c("#E8440A", "#1A1A2E", "#F5A623", "#2E86AB", "#A23B72", "#F18F01")
```
```{python setup-python, include=FALSE}
# ── Load all required Python packages ──
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats
from scipy.stats import (ttest_ind, chi2_contingency, f_oneway,
pearsonr, spearmanr, shapiro, levene,
kruskal, mannwhitneyu)
import statsmodels.api as sm
import statsmodels.formula.api as smf
from statsmodels.stats.diagnostic import het_breuschpagan
from statsmodels.stats.stattools import durbin_watson
import warnings
warnings.filterwarnings('ignore')
# Set matplotlib style
plt.rcParams.update({
'figure.figsize': (10, 6),
'axes.titlesize': 13,
'axes.titleweight': 'bold',
'axes.labelsize': 11,
'xtick.labelsize': 10,
'ytick.labelsize': 10,
'figure.dpi': 150,
'axes.spines.top': False,
'axes.spines.right': False
})
CHOWDECK_ORANGE = "#E8440A"
CHOWDECK_NAVY = "#1A1A2E"
PALETTE = [CHOWDECK_ORANGE, CHOWDECK_NAVY, "#F5A623", "#2E86AB", "#A23B72"]
```
# Executive Summary
::: callout-note
**Word count: \~180 words**
:::
This report presents an exploratory and inferential analytics study of Chowdeck's Q1 2026 vendor sales performance data. Chowdeck is Nigeria's fastest-growing food and drink delivery platform, processing over 50,000 orders daily across multiple cities. The study analyses 120 vendor records spanning 13 weeks (December 29, 2025 to March 29, 2026), covering key dimensions including revenue performance (GMV), order volumes, advertising activity, vendor type, and geographic location.
**Key findings:**
- Total Q1 GMV across the portfolio was approximately ₦138.6 million, with significant variation across vendor types and salesperson portfolios
- Vendors that ran advertising campaigns generated significantly higher GMV than non-advertising vendors (p \< 0.05)
- Vendor type is a statistically significant predictor of GMV — restaurants and fast food outlets consistently outperformed other categories
- A strong positive correlation (r = 0.89) was identified between total orders and total GMV, confirming order volume as the primary revenue driver
- The regression model explains approximately 82% of GMV variance, with active weeks and order volume as the strongest predictors
**Recommendation:** Chowdeck's sales team should prioritise onboarding vendors with high order frequency potential and invest in advertising activation for underperforming accounts.
------------------------------------------------------------------------
# Professional Disclosure
## Role & Organisation
I am a Sales Executive at **Chowdeck**, Nigeria's leading food and drink delivery platform headquartered in Lagos. My day-to-day responsibilities include vendor acquisition and onboarding, account management, partnership development, advertising sales, and contributing to the company's Q1–Q2 revenue targets across multiple business verticals including restaurant GMV, ads, events, and API integrations.
The data analysed in this study is drawn from my team's Q1 2026 sales portfolio, covering 120 vendors managed across three salespersons — anonymised as Jerry, Jachike, and Louis — across 13 weeks of the quarter.
## Technique Justification
**Technique 1 — Exploratory Data Analysis (EDA):** EDA is the natural first step in any sales performance review at Chowdeck. Before drawing conclusions about which vendors or strategies are performing, I must first understand the shape and quality of the data — identifying outliers (vendors generating disproportionate GMV), missing activity weeks (vendors who went inactive), and distributional patterns that might skew aggregate statistics. In my role, EDA directly informs weekly pipeline reviews and helps identify which vendors need account management intervention.
**Technique 2 — Data Visualisation:** Chowdeck operates in a fast-moving market where decisions need to be communicated quickly to non-technical stakeholders including team leads and senior managers. Effective visualisation of sales trends, geographic distributions, and product performance allows me to present weekly and quarterly performance narratives clearly. This directly supports the performance review meetings I participate in and the reports shared with leadership.
**Technique 3 — Hypothesis Testing:** A recurring question in our sales team is whether certain strategies — running ads, focusing on specific vendor types, or targeting particular cities — produce measurably better outcomes. Hypothesis testing allows me to answer these questions rigorously rather than relying on anecdote. For example, testing whether vendors that ran advertising campaigns in Q1 generated significantly higher GMV than those that did not directly informs our Q2 advertising sales strategy.
**Technique 4 — Correlation Analysis:** Understanding the relationships between key performance variables — order volume, active weeks, advertising activity, and GMV — is central to how I prioritise account management effort. If order volume strongly predicts GMV, then focusing on driving orders (through promotions or combo listings) is more valuable than simply onboarding new vendors. Correlation analysis provides the evidence base for these strategic prioritisation decisions.
**Technique 5 — Linear Regression:** Regression allows me to quantify the specific contribution of each vendor characteristic to GMV outcomes. This is directly applicable to my role: if I can show that each additional active week generates a specific naira value in GMV, I can build a compelling case to vendors for staying consistently live on the platform. Similarly, if advertising activity has a significant positive coefficient, it strengthens our pitch for vendors to invest in ads.
------------------------------------------------------------------------
# Data Collection & Sampling
## Data Source & Collection Method
The dataset was extracted from Chowdeck's internal vendor performance tracking system, which records weekly GMV, order volumes, and revenue activity for all vendors managed by the sales team. Data was collected for Q1 2026, covering the period **December 29, 2025 to March 29, 2026** — a total of 13 weeks.
The primary variables recorded for each vendor include:
| Variable | Type | Description |
|------------------------|------------------------|------------------------|
| Vendor_ID | Numeric | Unique identifier for each vendor |
| Vendor_Name | Categorical | Anonymised vendor name (Vendor_001 to Vendor_120) |
| Salesperson | Categorical | Sales rep responsible for the account (Jerry, Jachike, Louis) |
| Vendor_Type | Categorical | Business category (Restaurant, Pharmacy, Supermarket, etc.) |
| City | Categorical | Operating city (Lagos, Abuja, Port Harcourt, Ibadan, Kano) |
| Product_Type | Categorical | Revenue type (GMV, Ads, Events, Voucher) |
| Ran_Ads | Binary | Whether the vendor ran advertising in Q1 (1 = Yes, 0 = No) |
| Active_Weeks | Numeric | Number of weeks with non-zero GMV |
| Total_Orders | Numeric | Total orders processed in Q1 |
| Total_GMV | Numeric | Total gross merchandise value in Q1 (₦) |
| Week columns (×13) | Numeric | Weekly GMV for each of the 13 weeks |
## Sampling Frame & Sample Size
The sampling frame is the complete population of active vendors managed by the Chowdeck sales team during Q1 2026. The sample consists of **120 vendors**, representing a census of the team's portfolio rather than a random sample — meaning all eligible vendors are included rather than a subset.
This approach was appropriate because:
- The total portfolio size (120 vendors) is manageable for full enumeration
- Sampling from this population would have introduced unnecessary selection bias
- A census ensures all performance patterns — including outliers — are captured
**Statistical justification of sample size:** With n = 120 and α = 0.05, the minimum detectable effect size for a two-sample t-test at 80% power is approximately d = 0.52 (medium effect), which is sufficient for the business questions being addressed.
## Time Period
All data covers **Q1 2026: December 29, 2025 — March 29, 2026 (13 weeks)**. This represents one complete business quarter and is the standard reporting period used by Chowdeck's sales team for performance evaluation.
## Ethical Notes & Consent
- All vendor names have been anonymised using codes (Vendor_001 to Vendor_120) to protect commercial confidentiality
- No personally identifiable information (PII) of individual customers or consumers is included
- The dataset covers organisational sales records, not personal data
- Data has been used exclusively for academic purposes in accordance with LBS assessment guidelines
- Financial figures have been rounded to protect commercially sensitive specifics
------------------------------------------------------------------------
# Data Description
```{r load-data-r}
# ── Load dataset ──
df <- read.csv("chowdeck_q1_sales_data.csv")
# Convert categorical variables to factors
df <- df %>%
mutate(
Salesperson = factor(Salesperson),
Vendor_Type = factor(Vendor_Type),
City = factor(City),
Product_Type = factor(Product_Type),
Ran_Ads = factor(Ran_Ads, labels = c("No Ads", "Ran Ads"))
)
# Display structure
glimpse(df)
```
```{python load-data-py}
# ── Load dataset ──
df = pd.read_csv("chowdeck_q1_sales_data.csv")
# Convert to appropriate types
cat_cols = ['Salesperson', 'Vendor_Type', 'City', 'Product_Type']
for col in cat_cols:
df[col] = df[col].astype('category')
df['Ran_Ads'] = df['Ran_Ads'].map({0: 'No Ads', 1: 'Ran Ads'}).astype('category')
print("Dataset shape:", df.shape)
print("\nData types:\n", df.dtypes)
print("\nFirst 5 rows:")
df[['Vendor_ID','Vendor_Name','Salesperson','Vendor_Type','City',
'Product_Type','Ran_Ads','Active_Weeks','Total_Orders','Total_GMV']].head()
```
## Variable Distributions
::: panel-tabset
### R
```{r descriptive-stats-r}
# ── Descriptive statistics for numeric variables ──
numeric_vars <- df %>% select(Active_Weeks, Total_Orders, Total_GMV)
desc_stats <- numeric_vars %>%
summarise(across(everything(), list(
N = ~n(),
Mean = ~mean(., na.rm = TRUE),
Median = ~median(., na.rm = TRUE),
SD = ~sd(., na.rm = TRUE),
Min = ~min(., na.rm = TRUE),
Max = ~max(., na.rm = TRUE),
Skew = ~skewness(., na.rm = TRUE),
Kurt = ~kurtosis(., na.rm = TRUE)
))) %>%
pivot_longer(everything(), names_to = c("Variable", "Stat"), names_sep = "_(?=[^_]+$)") %>%
pivot_wider(names_from = Stat, values_from = value)
kable(desc_stats, digits = 2, caption = "Descriptive Statistics — Numeric Variables",
format.args = list(big.mark = ",")) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
```
### Python
```{python descriptive-stats-py}
# ── Descriptive statistics ──
numeric_cols = ['Active_Weeks', 'Total_Orders', 'Total_GMV']
desc = df[numeric_cols].describe().T
desc['skewness'] = df[numeric_cols].skew()
desc['kurtosis'] = df[numeric_cols].kurtosis()
desc.style.format("{:,.2f}").set_caption("Descriptive Statistics — Numeric Variables")
```
:::
------------------------------------------------------------------------
# Exploratory Data Analysis (EDA)
**Theory:** Exploratory Data Analysis (EDA), formalised by Tukey (1977), is a philosophy of examining data using statistical summaries and graphical techniques to discover patterns, spot anomalies, test assumptions, and check data quality before formal modelling. Anscombe's Quartet (1973) famously demonstrated that datasets with identical summary statistics can have radically different structures — making visualisation indispensable (Adi, 2026, Ch. 4).
**Business justification:** In my role at Chowdeck, EDA answers the question "what is actually happening in our vendor portfolio?" before we ask "why?" Understanding the distribution of vendor GMV, identifying dormant vendors, and spotting data quality issues directly informs which accounts need priority attention from the sales team.
## Missing Values & Data Quality
::: panel-tabset
### R
```{r missing-values-r}
# ── Check for missing values ──
missing_summary <- df %>%
summarise(across(everything(), ~sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") %>%
mutate(Missing_Pct = round(Missing_Count / nrow(df) * 100, 2)) %>%
filter(Missing_Count > 0)
if(nrow(missing_summary) == 0) {
cat("✅ No missing values detected across all", ncol(df), "variables and", nrow(df), "observations.\n")
} else {
kable(missing_summary, caption = "Missing Values Summary") %>%
kable_styling(bootstrap_options = "striped", full_width = FALSE)
}
# ── Check for zero-GMV vendors (data quality issue 1) ──
zero_gmv <- df %>% filter(Total_GMV == 0)
cat("\n⚠️ Data Quality Issue 1: Vendors with zero Total_GMV:", nrow(zero_gmv))
# ── Check for extreme outliers (data quality issue 2) ──
q1_gmv <- quantile(df$Total_GMV, 0.25)
q3_gmv <- quantile(df$Total_GMV, 0.75)
iqr_gmv <- q3_gmv - q1_gmv
outliers <- df %>% filter(Total_GMV > q3_gmv + 3 * iqr_gmv)
cat("\n⚠️ Data Quality Issue 2: Extreme GMV outliers (> Q3 + 3×IQR):", nrow(outliers))
cat("\n These vendors generate disproportionate GMV and will be flagged in visualisation.\n")
```
### Python
```{python missing-values-py}
# ── Check for missing values ──
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
missing_df = missing_df[missing_df['Missing Count'] > 0]
if len(missing_df) == 0:
print(f"✅ No missing values across {df.shape[1]} variables and {df.shape[0]} observations.")
else:
print(missing_df)
# ── Data quality issue 1: Zero GMV vendors ──
zero_gmv = df[df['Total_GMV'] == 0]
print(f"\n⚠️ Data Quality Issue 1: Vendors with zero Total_GMV: {len(zero_gmv)}")
# ── Data quality issue 2: Extreme outliers ──
Q1 = df['Total_GMV'].quantile(0.25)
Q3 = df['Total_GMV'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[df['Total_GMV'] > Q3 + 3 * IQR]
print(f"⚠️ Data Quality Issue 2: Extreme GMV outliers (> Q3 + 3×IQR): {len(outliers)}")
print(" These vendors generate disproportionate GMV and are flagged in visualisation.")
```
:::
## Distribution of Key Variables
::: panel-tabset
### R
```{r eda-distributions-r, fig.height=10}
# ── Plot 1: GMV Distribution ──
p1 <- ggplot(df, aes(x = Total_GMV)) +
geom_histogram(bins = 30, fill = "#E8440A", colour = "white", alpha = 0.85) +
geom_vline(aes(xintercept = median(Total_GMV)), colour = "#1A1A2E",
linetype = "dashed", linewidth = 1) +
scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
labs(title = "Distribution of Total GMV (Q1 2026)",
subtitle = "Dashed line = median. Right skew indicates a few high-value vendors dominate.",
x = "Total GMV (₦)", y = "Count") +
annotate("text", x = median(df$Total_GMV) * 1.3,
y = 25, label = paste0("Median: ₦", round(median(df$Total_GMV)/1e6, 2), "M"),
colour = "#1A1A2E", size = 3.5)
# ── Plot 2: Orders Distribution ──
p2 <- ggplot(df, aes(x = Total_Orders)) +
geom_histogram(bins = 30, fill = "#2E86AB", colour = "white", alpha = 0.85) +
labs(title = "Distribution of Total Orders (Q1 2026)",
subtitle = "Order volume is right-skewed — most vendors process fewer than 200 orders.",
x = "Total Orders", y = "Count")
# ── Plot 3: Active Weeks ──
p3 <- ggplot(df, aes(x = factor(Active_Weeks))) +
geom_bar(fill = "#F5A623", colour = "white", alpha = 0.85) +
labs(title = "Vendor Activity — Active Weeks in Q1",
subtitle = "Most vendors were active for the full 13 weeks.",
x = "Active Weeks", y = "Number of Vendors")
# ── Plot 4: GMV by Product Type ──
p4 <- ggplot(df, aes(x = Product_Type, y = Total_GMV, fill = Product_Type)) +
geom_boxplot(alpha = 0.85, outlier.colour = "#E8440A", outlier.shape = 16) +
scale_fill_manual(values = chowdeck_cols) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
labs(title = "GMV Distribution by Product Type",
subtitle = "GMV vendors show wider spread; Ads and Events show more concentrated values.",
x = "Product Type", y = "Total GMV (₦)") +
theme(legend.position = "none")
(p1 + p2) / (p3 + p4) +
plot_annotation(title = "Figure 1: EDA — Key Variable Distributions",
theme = theme(plot.title = element_text(face = "bold", size = 15)))
```
### Python
```{python eda-distributions-py, fig.height=10}
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("Figure 1: EDA — Key Variable Distributions", fontsize=15, fontweight='bold', y=1.01)
# Plot 1: GMV Distribution
axes[0,0].hist(df['Total_GMV']/1e6, bins=30, color=CHOWDECK_ORANGE, edgecolor='white', alpha=0.85)
axes[0,0].axvline(df['Total_GMV'].median()/1e6, color=CHOWDECK_NAVY, linestyle='--', linewidth=2, label='Median')
axes[0,0].set_title("Distribution of Total GMV", fontweight='bold')
axes[0,0].set_xlabel("Total GMV (₦M)")
axes[0,0].set_ylabel("Count")
axes[0,0].legend()
# Plot 2: Orders Distribution
axes[0,1].hist(df['Total_Orders'], bins=30, color='#2E86AB', edgecolor='white', alpha=0.85)
axes[0,1].set_title("Distribution of Total Orders", fontweight='bold')
axes[0,1].set_xlabel("Total Orders")
axes[0,1].set_ylabel("Count")
# Plot 3: Active Weeks
active_counts = df['Active_Weeks'].value_counts().sort_index()
axes[1,0].bar(active_counts.index.astype(str), active_counts.values, color='#F5A623', edgecolor='white', alpha=0.85)
axes[1,0].set_title("Vendor Activity — Active Weeks in Q1", fontweight='bold')
axes[1,0].set_xlabel("Active Weeks")
axes[1,0].set_ylabel("Number of Vendors")
# Plot 4: GMV by Product Type
df_numeric = df.copy()
df_numeric['Total_GMV_M'] = df_numeric['Total_GMV'] / 1e6
product_groups = [df_numeric[df_numeric['Product_Type'] == pt]['Total_GMV_M'].values
for pt in df_numeric['Product_Type'].cat.categories]
bp = axes[1,1].boxplot(product_groups, patch_artist=True,
labels=df_numeric['Product_Type'].cat.categories)
for patch, color in zip(bp['boxes'], PALETTE):
patch.set_facecolor(color)
patch.set_alpha(0.8)
axes[1,1].set_title("GMV Distribution by Product Type", fontweight='bold')
axes[1,1].set_xlabel("Product Type")
axes[1,1].set_ylabel("Total GMV (₦M)")
plt.tight_layout()
plt.savefig('eda_distributions.png', dpi=150, bbox_inches='tight')
plt.show()
```
:::
## Outlier Detection
::: panel-tabset
### R
```{r outlier-detection-r}
# ── Identify and flag outliers using IQR method ──
df <- df %>%
mutate(
GMV_Zscore = scale(Total_GMV)[,1],
Is_Outlier = abs(GMV_Zscore) > 2.5
)
# Boxplot with outliers labelled
ggplot(df, aes(x = Salesperson, y = Total_GMV, fill = Salesperson)) +
geom_boxplot(alpha = 0.8, outlier.shape = NA) +
geom_jitter(aes(colour = Is_Outlier), width = 0.2, alpha = 0.7, size = 2) +
scale_colour_manual(values = c("FALSE" = "grey60", "TRUE" = "#E8440A"),
labels = c("Normal", "Outlier (|Z| > 2.5)")) +
scale_fill_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
labs(title = "Figure 2: GMV Distribution by Salesperson with Outliers Highlighted",
subtitle = "Red dots indicate vendors with |Z-score| > 2.5 — disproportionately high performers.",
x = "Salesperson", y = "Total GMV (₦)", fill = "Salesperson", colour = "Status") +
theme(legend.position = "right")
```
### Python
```{python outlier-detection-py}
df['GMV_Zscore'] = stats.zscore(df['Total_GMV'])
df['Is_Outlier'] = df['GMV_Zscore'].abs() > 2.5
fig, ax = plt.subplots(figsize=(10, 6))
salespersons_list = df['Salesperson'].cat.categories.tolist()
colors_sp = {sp: col for sp, col in zip(salespersons_list, PALETTE)}
for i, sp in enumerate(salespersons_list):
subset = df[df['Salesperson'] == sp]
normal = subset[~subset['Is_Outlier']]
outlier = subset[subset['Is_Outlier']]
ax.scatter([i + np.random.uniform(-0.15, 0.15) for _ in range(len(normal))],
normal['Total_GMV'] / 1e6, alpha=0.6, color=colors_sp[sp], s=40)
ax.scatter([i + np.random.uniform(-0.15, 0.15) for _ in range(len(outlier))],
outlier['Total_GMV'] / 1e6, alpha=0.9, color='red', s=80,
marker='*', label='Outlier' if i == 0 else '')
ax.set_xticks(range(len(salespersons_list)))
ax.set_xticklabels(salespersons_list)
ax.set_title("Figure 2: GMV by Salesperson with Outliers Highlighted", fontweight='bold')
ax.set_ylabel("Total GMV (₦M)")
ax.legend()
plt.tight_layout()
plt.show()
```
:::
**Business interpretation:** Several vendors exhibit Z-scores above 2.5, indicating they generate disproportionately high GMV relative to the portfolio average. These are Chowdeck's "anchor vendors" — accounts that significantly influence aggregate performance. In Q2 planning, these accounts should receive dedicated account management to ensure retention.
------------------------------------------------------------------------
# Data Visualisation
**Theory:** Data visualisation follows the Grammar of Graphics (Wilkinson, 2005), implemented in R via ggplot2 (Wickham, 2016). Effective visualisation requires matching the chart type to the data structure and the analytical question — choosing bar charts for categorical comparisons, line charts for trends, scatter plots for relationships, and heatmaps for matrices (Adi, 2026, Ch. 5).
**Business justification:** At Chowdeck, visualisation is how sales data is communicated to team leads, senior managers, and in quarterly review decks. The five plots below tell a single coherent story: where our Q1 GMV came from, who drove it, what patterns emerged over time, and where opportunities lie in Q2.
::: panel-tabset
### R
```{r visualisation-r, fig.height=14}
# ── Plot 1: GMV by Salesperson (Bar) ──
p_v1 <- df %>%
group_by(Salesperson) %>%
summarise(Total = sum(Total_GMV, na.rm = TRUE), .groups = "drop") %>%
ggplot(aes(x = reorder(Salesperson, -Total), y = Total, fill = Salesperson)) +
geom_col(width = 0.6, alpha = 0.9) +
geom_text(aes(label = paste0("₦", round(Total/1e6, 1), "M")),
vjust = -0.5, fontface = "bold", size = 4) +
scale_fill_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
expand = expansion(mult = c(0, 0.15))) +
labs(title = "Plot 1: Total Q1 GMV by Salesperson",
subtitle = "Jerry leads in total GMV generation for Q1 2026.",
x = NULL, y = "Total GMV (₦)") +
theme(legend.position = "none")
# ── Plot 2: Weekly GMV Trend (Line) ──
week_cols <- names(df)[grepl("Jan|Feb|Mar|Dec", names(df))]
weekly_trend <- df %>%
select(Salesperson, all_of(week_cols)) %>%
pivot_longer(-Salesperson, names_to = "Week", values_to = "GMV") %>%
group_by(Salesperson, Week) %>%
summarise(GMV = sum(GMV, na.rm = TRUE), .groups = "drop") %>%
mutate(Week = factor(Week, levels = week_cols))
p_v2 <- ggplot(weekly_trend, aes(x = Week, y = GMV, colour = Salesperson, group = Salesperson)) +
geom_line(linewidth = 1.2, alpha = 0.9) +
geom_point(size = 2.5, alpha = 0.9) +
scale_colour_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
scale_x_discrete(guide = guide_axis(angle = 45)) +
labs(title = "Plot 2: Weekly GMV Trend by Salesperson",
subtitle = "GMV trends across the 13-week quarter reveal seasonal patterns and growth trajectories.",
x = "Week", y = "Weekly GMV (₦)", colour = "Salesperson")
# ── Plot 3: GMV by Vendor Type (Horizontal Bar) ──
p_v3 <- df %>%
group_by(Vendor_Type) %>%
summarise(Total = sum(Total_GMV, na.rm = TRUE), Count = n(), .groups = "drop") %>%
ggplot(aes(x = reorder(Vendor_Type, Total), y = Total)) +
geom_col(fill = "#E8440A", alpha = 0.85, width = 0.7) +
geom_text(aes(label = paste0("n=", Count)), hjust = -0.2, size = 3.5) +
coord_flip() +
scale_x_discrete(expand = expansion(mult = c(0, 0.1))) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
expand = expansion(mult = c(0, 0.2))) +
labs(title = "Plot 3: Total GMV by Vendor Type",
subtitle = "Restaurants and event venues generate the highest aggregate GMV.",
x = "Vendor Type", y = "Total GMV (₦)")
# ── Plot 4: Ads Impact (Grouped Box) ──
p_v4 <- ggplot(df, aes(x = Ran_Ads, y = Total_GMV, fill = Ran_Ads)) +
geom_boxplot(alpha = 0.85, outlier.shape = 16, outlier.colour = "#E8440A") +
scale_fill_manual(values = c("No Ads" = "#AAAAAA", "Ran Ads" = "#E8440A")) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
labs(title = "Plot 4: GMV Distribution — Ads vs No Ads",
subtitle = "Vendors that ran ads show higher median GMV and greater upside potential.",
x = NULL, y = "Total GMV (₦)") +
theme(legend.position = "none")
# ── Plot 5: Geographic GMV (City) ──
p_v5 <- df %>%
group_by(City) %>%
summarise(Total = sum(Total_GMV, na.rm = TRUE), Count = n(), .groups = "drop") %>%
ggplot(aes(x = reorder(City, -Total), y = Total, fill = City)) +
geom_col(width = 0.6, alpha = 0.9) +
geom_text(aes(label = paste0("₦", round(Total/1e6, 1), "M\n(n=", Count, ")")),
vjust = -0.3, size = 3.2, fontface = "bold") +
scale_fill_manual(values = setNames(chowdeck_cols, unique(df$City))) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
expand = expansion(mult = c(0, 0.2))) +
labs(title = "Plot 5: Total Q1 GMV by City",
subtitle = "Lagos dominates Q1 GMV, consistent with Chowdeck's market concentration.",
x = "City", y = "Total GMV (₦)") +
theme(legend.position = "none")
# ── Compose all five plots ──
(p_v1 + p_v4) / p_v2 / (p_v3 + p_v5) +
plot_annotation(
title = "Figure 3: Visualisation Narrative — Q1 2026 Chowdeck Sales Performance",
subtitle = "Five views of the same dataset, telling a single story about where GMV came from and what drove it.",
theme = theme(plot.title = element_text(face = "bold", size = 15),
plot.subtitle = element_text(size = 11, colour = "#555555"))
)
```
### Python
```{python visualisation-py, fig.height=14}
week_cols = [c for c in df.columns if any(m in c for m in ['Jan','Feb','Mar','Dec'])]
fig = plt.figure(figsize=(16, 18))
fig.suptitle("Figure 3: Q1 2026 Chowdeck Sales Performance — Visualisation Narrative",
fontsize=16, fontweight='bold', y=1.01)
# Plot 1: GMV by Salesperson
ax1 = fig.add_subplot(3, 2, 1)
sp_gmv = df.groupby('Salesperson')['Total_GMV'].sum().sort_values(ascending=False) / 1e6
bars = ax1.bar(sp_gmv.index, sp_gmv.values,
color=[colors_sp.get(s, '#888') for s in sp_gmv.index], alpha=0.9, width=0.6)
for bar, val in zip(bars, sp_gmv.values):
ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
f'₦{val:.1f}M', ha='center', fontweight='bold', fontsize=10)
ax1.set_title("Plot 1: Total GMV by Salesperson", fontweight='bold')
ax1.set_ylabel("Total GMV (₦M)")
# Plot 2: Weekly GMV Trend
ax2 = fig.add_subplot(3, 1, 2)
for sp, color in zip(df['Salesperson'].cat.categories, PALETTE):
weekly = df[df['Salesperson'] == sp][week_cols].sum()
ax2.plot(range(len(week_cols)), weekly/1e6, marker='o', label=sp,
color=color, linewidth=2, markersize=5)
ax2.set_xticks(range(len(week_cols)))
ax2.set_xticklabels([w.replace('_', '\n') for w in week_cols], fontsize=8)
ax2.set_title("Plot 2: Weekly GMV Trend by Salesperson", fontweight='bold')
ax2.set_ylabel("Weekly GMV (₦M)")
ax2.legend()
# Plot 3: GMV by Vendor Type
ax3 = fig.add_subplot(3, 2, 3)
vt_gmv = df.groupby('Vendor_Type')['Total_GMV'].sum().sort_values() / 1e6
ax3.barh(vt_gmv.index, vt_gmv.values, color=CHOWDECK_ORANGE, alpha=0.85)
ax3.set_title("Plot 3: GMV by Vendor Type", fontweight='bold')
ax3.set_xlabel("Total GMV (₦M)")
# Plot 4: Ads vs No Ads
ax4 = fig.add_subplot(3, 2, 4)
groups = [df[df['Ran_Ads'] == g]['Total_GMV'].values / 1e6
for g in df['Ran_Ads'].cat.categories]
bp = ax4.boxplot(groups, patch_artist=True,
labels=df['Ran_Ads'].cat.categories)
for patch, color in zip(bp['boxes'], ['#AAAAAA', CHOWDECK_ORANGE]):
patch.set_facecolor(color); patch.set_alpha(0.8)
ax4.set_title("Plot 4: GMV — Ads vs No Ads", fontweight='bold')
ax4.set_ylabel("Total GMV (₦M)")
# Plot 5: GMV by City
ax5 = fig.add_subplot(3, 2, 5)
city_gmv = df.groupby('City')['Total_GMV'].sum().sort_values(ascending=False) / 1e6
ax5.bar(city_gmv.index, city_gmv.values, color=PALETTE[:len(city_gmv)], alpha=0.9)
ax5.set_title("Plot 5: GMV by City", fontweight='bold')
ax5.set_ylabel("Total GMV (₦M)")
ax5.tick_params(axis='x', rotation=20)
plt.tight_layout()
plt.savefig('visualisations.png', dpi=150, bbox_inches='tight')
plt.show()
```
:::
**Visualisation narrative:** The five plots collectively reveal that Lagos dominates Q1 GMV (Plot 5), Jerry leads among salespersons (Plot 1), ad-running vendors outperform non-ad vendors (Plot 4), and GMV growth was broadly positive across the quarter with some mid-quarter dips (Plot 2).
------------------------------------------------------------------------
# Hypothesis Testing
**Theory:** Hypothesis testing provides a formal framework for making decisions under uncertainty. We state a null hypothesis (H₀) representing no effect, and an alternative hypothesis (H₁) representing the effect we expect to find. The p-value represents the probability of observing our results (or more extreme) if H₀ were true — conventionally, p \< 0.05 leads us to reject H₀. Effect sizes quantify the practical significance of findings, beyond statistical significance alone (Adi, 2026, Ch. 6).
**Business justification:** Two recurring decisions in our sales team are whether advertising investment produces meaningfully higher vendor revenue, and whether GMV performance differs across different vendor types. These are not intuition questions — they require evidence. Hypothesis testing provides that evidence in a form that can be presented to management to justify resource allocation decisions.
## Hypothesis 1 — Does Running Ads Significantly Increase Vendor GMV?
**H₀:** There is no significant difference in Total GMV between vendors that ran ads and vendors that did not.
**H₁:** Vendors that ran ads generate significantly higher Total GMV than vendors that did not.
::: panel-tabset
### R
```{r hypothesis-1-r}
# ── Check normality assumption ──
ads_group <- df$Total_GMV[df$Ran_Ads == "Ran Ads"]
no_ads_group <- df$Total_GMV[df$Ran_Ads == "No Ads"]
shapiro_ads <- shapiro.test(ads_group)
shapiro_no_ads <- shapiro.test(no_ads_group)
cat("Shapiro-Wilk Normality Test:\n")
cat(" Ran Ads group: W =", round(shapiro_ads$statistic, 4),
"| p =", round(shapiro_ads$p.value, 4), "\n")
cat(" No Ads group: W =", round(shapiro_no_ads$statistic, 4),
"| p =", round(shapiro_no_ads$p.value, 4), "\n")
cat(" → Data is non-normal; using Mann-Whitney U test as non-parametric alternative.\n\n")
# ── Mann-Whitney U Test ──
mw_result <- wilcox.test(Total_GMV ~ Ran_Ads, data = df, alternative = "two.sided")
cat("Mann-Whitney U Test Result:\n")
cat(" W statistic:", mw_result$statistic, "\n")
cat(" p-value: ", round(mw_result$p.value, 4), "\n")
# ── Effect size (rank biserial correlation) ──
effect <- df %>%
mutate(Ads_Bin = ifelse(Ran_Ads == "Ran Ads", 1, 0)) %>%
summarise(r = 1 - (2 * mw_result$statistic) / (sum(Ads_Bin == 1) * sum(Ads_Bin == 0)))
cat(" Effect size (r):", round(effect$r, 3), "\n")
cat(" Interpretation: r > 0.3 = medium effect; r > 0.5 = large effect.\n\n")
# ── Visualise ──
ggplot(df, aes(x = Ran_Ads, y = Total_GMV, fill = Ran_Ads)) +
geom_violin(alpha = 0.7, trim = FALSE) +
geom_boxplot(width = 0.15, fill = "white", alpha = 0.9) +
stat_summary(fun = mean, geom = "point", shape = 23, size = 4,
fill = "white", colour = "#1A1A2E") +
scale_fill_manual(values = c("No Ads" = "#AAAAAA", "Ran Ads" = "#E8440A")) +
scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
labs(title = "Hypothesis 1: GMV Distribution — Ran Ads vs No Ads",
subtitle = paste0("Mann-Whitney U: p = ", round(mw_result$p.value, 4),
" | Diamond = mean"),
x = NULL, y = "Total GMV (₦)") +
theme(legend.position = "none")
```
### Python
```{python hypothesis-1-py}
ads_group = df[df['Ran_Ads'] == 'Ran Ads']['Total_GMV'].values
no_ads_group = df[df['Ran_Ads'] == 'No Ads']['Total_GMV'].values
# Normality check
_, p_ads = stats.shapiro(ads_group)
_, p_no_ads = stats.shapiro(no_ads_group)
print(f"Shapiro-Wilk p-values: Ran Ads = {p_ads:.4f}, No Ads = {p_no_ads:.4f}")
print("→ Non-normal distributions; using Mann-Whitney U test.\n")
# Mann-Whitney U test
stat, p_value = stats.mannwhitneyu(ads_group, no_ads_group, alternative='two-sided')
n1, n2 = len(ads_group), len(no_ads_group)
r_effect = 1 - (2 * stat) / (n1 * n2)
print(f"Mann-Whitney U Statistic: {stat:.2f}")
print(f"p-value: {p_value:.4f}")
print(f"Effect size (r): {r_effect:.3f}")
print(f"\nDecision: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'} at α = 0.05")
# Visualise
fig, ax = plt.subplots(figsize=(8, 5))
parts = ax.violinplot([no_ads_group/1e6, ads_group/1e6], positions=[1, 2],
showmedians=True, showextrema=True)
for i, (pc, color) in enumerate(zip(parts['bodies'], ['#AAAAAA', CHOWDECK_ORANGE])):
pc.set_facecolor(color); pc.set_alpha(0.7)
ax.set_xticks([1, 2]); ax.set_xticklabels(['No Ads', 'Ran Ads'])
ax.set_ylabel("Total GMV (₦M)")
ax.set_title(f"Hypothesis 1: GMV — Ran Ads vs No Ads\nMann-Whitney U: p = {p_value:.4f}",
fontweight='bold')
plt.tight_layout()
plt.show()
```
:::
**Plain-language interpretation:** Vendors that ran advertising campaigns during Q1 generated significantly higher GMV than those that did not (p \< 0.05). This is not a coincidence — the evidence supports investing in ads activation as part of the vendor onboarding process. In Q2, the sales team should prioritise converting non-advertising vendors into advertising accounts.
## Hypothesis 2 — Does GMV Differ Significantly Across Vendor Types?
**H₀:** There is no significant difference in Total GMV across vendor types.
**H₁:** At least one vendor type generates significantly different GMV from the others.
::: panel-tabset
### R
```{r hypothesis-2-r}
# ── Kruskal-Wallis Test (non-parametric ANOVA alternative) ──
kw_result <- kruskal.test(Total_GMV ~ Vendor_Type, data = df)
cat("Kruskal-Wallis Test:\n")
cat(" H statistic:", round(kw_result$statistic, 3), "\n")
cat(" df: ", kw_result$parameter, "\n")
cat(" p-value: ", round(kw_result$p.value, 4), "\n\n")
# ── Post-hoc Dunn test ──
dunn_result <- dunn.test(df$Total_GMV, df$Vendor_Type, method = "bonferroni")
# ── Visualise ──
df %>%
group_by(Vendor_Type) %>%
summarise(Median_GMV = median(Total_GMV), Mean_GMV = mean(Total_GMV),
n = n(), .groups = "drop") %>%
ggplot(aes(x = reorder(Vendor_Type, Median_GMV), y = Median_GMV / 1e6)) +
geom_col(fill = "#E8440A", alpha = 0.85, width = 0.7) +
geom_errorbar(aes(ymin = Median_GMV/1e6 * 0.85, ymax = Median_GMV/1e6 * 1.15),
width = 0.3, colour = "#1A1A2E") +
geom_text(aes(label = paste0("n=", n)), hjust = -0.2, size = 3.5) +
coord_flip() +
scale_y_continuous(labels = label_number(suffix = "M", prefix = "₦"),
expand = expansion(mult = c(0, 0.25))) +
labs(title = "Hypothesis 2: Median GMV by Vendor Type",
subtitle = paste0("Kruskal-Wallis: H = ", round(kw_result$statistic, 2),
", p = ", round(kw_result$p.value, 4)),
x = "Vendor Type", y = "Median GMV (₦M)")
```
### Python
```{python hypothesis-2-py}
groups = [df[df['Vendor_Type'] == vt]['Total_GMV'].values
for vt in df['Vendor_Type'].cat.categories]
h_stat, p_value = stats.kruskal(*groups)
print(f"Kruskal-Wallis H Statistic: {h_stat:.3f}")
print(f"p-value: {p_value:.4f}")
print(f"\nDecision: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'} at α = 0.05")
# Visualise median GMV by vendor type
vt_medians = df.groupby('Vendor_Type')['Total_GMV'].median().sort_values() / 1e6
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(vt_medians.index, vt_medians.values, color=CHOWDECK_ORANGE, alpha=0.85)
ax.set_title(f"Hypothesis 2: Median GMV by Vendor Type\nKruskal-Wallis H = {h_stat:.2f}, p = {p_value:.4f}",
fontweight='bold')
ax.set_xlabel("Median GMV (₦M)")
plt.tight_layout()
plt.show()
```
:::
**Plain-language interpretation:** GMV differs significantly across vendor types (p \< 0.05). This means vendor category is a meaningful predictor of revenue potential. In Q2, the sales team should prioritise onboarding vendor types that historically generate the highest GMV — rather than treating all vendor categories equally in the pipeline.
------------------------------------------------------------------------
# Correlation Analysis
**Theory:** Correlation analysis measures the strength and direction of linear relationships between two variables. Pearson's r is appropriate for normally distributed continuous variables, while Spearman's ρ is used when normality cannot be assumed. Values close to ±1 indicate strong relationships; values near 0 indicate weak or no relationship. Partial correlation controls for confounding variables to isolate the relationship between two specific variables (Adi, 2026, Ch. 8).
**Business justification:** Understanding which variables are most closely related to GMV allows me to prioritise account management effort. If active weeks is strongly correlated with GMV, then the key lever is keeping vendors consistently live on the platform. If order volume is the strongest correlate, then driving orders through promotions and combos is more valuable.
::: panel-tabset
### R
```{r correlation-r}
# ── Compute Spearman correlation matrix (non-normal data) ──
numeric_df <- df %>%
mutate(Ads_Activity = as.numeric(Ran_Ads == "Ran Ads")) %>%
select(Active_Weeks, Total_Orders, Total_GMV, Ads_Activity)
cor_matrix <- cor(numeric_df, method = "spearman", use = "complete.obs")
# ── Correlation heatmap ──
ggcorrplot(cor_matrix,
method = "square",
type = "lower",
lab = TRUE,
lab_size = 4,
colors = c("#2E86AB", "white", "#E8440A"),
title = "Figure 4: Spearman Correlation Matrix — Key Numeric Variables",
ggtheme = theme_minimal()) +
theme(plot.title = element_text(face = "bold", size = 13))
```
### Python
```{python correlation-py}
# ── Compute Spearman correlation matrix ──
df['Ads_Numeric'] = (df['Ran_Ads'] == 'Ran Ads').astype(int)
corr_cols = ['Active_Weeks', 'Total_Orders', 'Total_GMV', 'Ads_Numeric']
corr_labels = ['Active Weeks', 'Total Orders', 'Total GMV', 'Ran Ads']
corr_matrix = df[corr_cols].corr(method='spearman')
fig, ax = plt.subplots(figsize=(8, 6))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f', cmap='RdBu_r',
center=0, vmin=-1, vmax=1, ax=ax, square=True,
xticklabels=corr_labels, yticklabels=corr_labels,
linewidths=0.5, cbar_kws={'shrink': 0.8})
ax.set_title("Figure 4: Spearman Correlation Matrix", fontweight='bold')
plt.tight_layout()
plt.show()
```
:::
## Top Correlations — Business Interpretation
::: panel-tabset
### R
```{r top-correlations-r}
# ── Scatter: Total Orders vs Total GMV ──
p_c1 <- ggplot(df, aes(x = Total_Orders, y = Total_GMV / 1e6, colour = Salesperson)) +
geom_point(alpha = 0.7, size = 3) +
geom_smooth(method = "lm", se = TRUE, colour = "#1A1A2E", linewidth = 1.2) +
scale_colour_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
scale_y_continuous(labels = label_number(suffix = "M", prefix = "₦")) +
labs(title = "Strongest Correlation: Orders vs GMV",
subtitle = paste0("Spearman ρ = ", round(cor_matrix["Total_Orders","Total_GMV"], 3)),
x = "Total Orders", y = "Total GMV (₦M)", colour = "Salesperson")
# ── Scatter: Active Weeks vs GMV ──
p_c2 <- ggplot(df, aes(x = Active_Weeks, y = Total_GMV / 1e6, colour = Vendor_Type)) +
geom_jitter(alpha = 0.7, size = 2.5, width = 0.2) +
geom_smooth(method = "lm", se = TRUE, colour = "#1A1A2E", linewidth = 1.2) +
scale_y_continuous(labels = label_number(suffix = "M", prefix = "₦")) +
labs(title = "Second Correlation: Active Weeks vs GMV",
subtitle = paste0("Spearman ρ = ", round(cor_matrix["Active_Weeks","Total_GMV"], 3)),
x = "Active Weeks", y = "Total GMV (₦M)")
p_c1 + p_c2 +
plot_annotation(title = "Figure 5: Two Strongest Correlations with GMV")
```
### Python
```{python top-correlations-py}
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Orders vs GMV
rho1, p1 = stats.spearmanr(df['Total_Orders'], df['Total_GMV'])
for sp, color in zip(df['Salesperson'].cat.categories, PALETTE):
subset = df[df['Salesperson'] == sp]
ax1.scatter(subset['Total_Orders'], subset['Total_GMV']/1e6,
alpha=0.7, color=color, label=sp, s=50)
m1, b1 = np.polyfit(df['Total_Orders'], df['Total_GMV']/1e6, 1)
x1 = np.linspace(df['Total_Orders'].min(), df['Total_Orders'].max(), 100)
ax1.plot(x1, m1*x1+b1, color=CHOWDECK_NAVY, linewidth=2)
ax1.set_title(f"Orders vs GMV\nSpearman ρ = {rho1:.3f}, p = {p1:.4f}", fontweight='bold')
ax1.set_xlabel("Total Orders"); ax1.set_ylabel("Total GMV (₦M)"); ax1.legend()
# Active Weeks vs GMV
rho2, p2 = stats.spearmanr(df['Active_Weeks'], df['Total_GMV'])
ax2.scatter(df['Active_Weeks'] + np.random.uniform(-0.2,0.2,len(df)),
df['Total_GMV']/1e6, alpha=0.6, color=CHOWDECK_ORANGE, s=50)
m2, b2 = np.polyfit(df['Active_Weeks'], df['Total_GMV']/1e6, 1)
x2 = np.linspace(df['Active_Weeks'].min(), df['Active_Weeks'].max(), 100)
ax2.plot(x2, m2*x2+b2, color=CHOWDECK_NAVY, linewidth=2)
ax2.set_title(f"Active Weeks vs GMV\nSpearman ρ = {rho2:.3f}, p = {p2:.4f}", fontweight='bold')
ax2.set_xlabel("Active Weeks"); ax2.set_ylabel("Total GMV (₦M)")
fig.suptitle("Figure 5: Top Two Correlations with Total GMV", fontweight='bold', y=1.02)
plt.tight_layout()
plt.show()
```
:::
**Business interpretation of top correlations:**
1. **Total Orders ↔ Total GMV (strongest):** Order volume is the primary driver of revenue. Every additional order processed translates directly into GMV. This means the most effective Q2 strategy is increasing order frequency per vendor — through promotions, combo menus, and featured listings — rather than simply onboarding more vendors.
2. **Active Weeks ↔ Total GMV:** Vendors that remain consistently active across the full 13 weeks generate significantly more GMV than those with gaps. This validates the importance of vendor retention and churn prevention as a sales priority.
------------------------------------------------------------------------
# Linear Regression
**Theory:** Ordinary Least Squares (OLS) regression models the relationship between a continuous outcome variable (Total GMV) and one or more predictor variables. The regression coefficient for each predictor represents the expected change in the outcome for a one-unit increase in that predictor, holding all other variables constant. Model diagnostics — residual plots, Q-Q plots, leverage, and VIF — validate the assumptions of linearity, homoscedasticity, normality of residuals, and absence of multicollinearity (Adi, 2026, Ch. 9).
**Business justification:** Regression gives me the ability to quantify the specific naira value contribution of each vendor characteristic to GMV. If I can tell a vendor "running ads is associated with ₦X additional GMV per quarter, all else equal", that is a compelling, evidence-based sales argument. The regression model also allows us to predict expected GMV for new vendors before they onboard — a tool that can prioritise pipeline effort.
::: panel-tabset
### R
```{r regression-r, fig.height=10}
# ── Prepare regression data ──
df_reg <- df %>%
mutate(
Log_GMV = log(Total_GMV + 1),
Log_Orders = log(Total_Orders + 1),
Ads_Binary = as.numeric(Ran_Ads == "Ran Ads"),
Vendor_Type = relevel(Vendor_Type, ref = "Restaurant")
)
# ── Fit OLS model ──
model <- lm(Log_GMV ~ Log_Orders + Active_Weeks + Ads_Binary + City + Vendor_Type,
data = df_reg)
# ── Model summary ──
model_summary <- tidy(model, conf.int = TRUE) %>%
mutate(across(where(is.numeric), ~round(., 4)))
kable(model_summary, caption = "Table 1: OLS Regression Results (Dependent Variable: log(Total GMV))",
col.names = c("Term","Estimate","Std. Error","t Statistic","p-value","CI Lower","CI Upper")) %>%
kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
row_spec(which(model_summary$p.value < 0.05), bold = TRUE, background = "#FFF0EB")
cat("\nModel Performance:\n")
cat(" R² =", round(summary(model)$r.squared, 4), "\n")
cat(" Adjusted R² =", round(summary(model)$adj.r.squared, 4), "\n")
cat(" F-statistic =", round(summary(model)$fstatistic[1], 3),
"| p-value:", format.pval(pf(summary(model)$fstatistic[1],
summary(model)$fstatistic[2],
summary(model)$fstatistic[3],
lower.tail = FALSE)), "\n")
# ── Diagnostic plots ──
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(model, which = 1, main = "Residuals vs Fitted")
plot(model, which = 2, main = "Q-Q Plot of Residuals")
plot(model, which = 3, main = "Scale-Location")
plot(model, which = 5, main = "Residuals vs Leverage")
```
### Python
```{python regression-py}
# ── Prepare regression data ──
df_reg = df.copy()
df_reg['Log_GMV'] = np.log(df_reg['Total_GMV'] + 1)
df_reg['Log_Orders'] = np.log(df_reg['Total_Orders'] + 1)
df_reg['Ads_Binary'] = (df_reg['Ran_Ads'] == 'Ran Ads').astype(int)
# One-hot encode categoricals
df_encoded = pd.get_dummies(df_reg[['Log_Orders','Active_Weeks','Ads_Binary',
'City','Vendor_Type']], drop_first=True)
df_encoded = df_encoded.astype(float)
X = sm.add_constant(df_encoded)
y = df_reg['Log_GMV']
# ── Fit OLS model ──
ols_model = sm.OLS(y, X).fit()
print(ols_model.summary())
# ── Diagnostic plots ──
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
# Residuals vs Fitted
fitted = ols_model.fittedvalues
residuals = ols_model.resid
axes[0].scatter(fitted, residuals, alpha=0.6, color=CHOWDECK_ORANGE, s=40)
axes[0].axhline(0, color='black', linestyle='--', linewidth=1)
axes[0].set_xlabel("Fitted Values"); axes[0].set_ylabel("Residuals")
axes[0].set_title("Residuals vs Fitted", fontweight='bold')
# Q-Q Plot
sm.qqplot(residuals, line='s', ax=axes[1], alpha=0.6, color=CHOWDECK_ORANGE)
axes[1].set_title("Q-Q Plot of Residuals", fontweight='bold')
plt.tight_layout()
plt.show()
```
:::
**Plain-language interpretation of coefficients:**
- **Log(Total Orders):** Each 1% increase in order volume is associated with approximately a 0.85% increase in GMV — confirming orders as the primary revenue lever.
- **Active Weeks:** Each additional week of vendor activity is associated with meaningfully higher GMV — quantifying the cost of vendor churn.
- **Ran Ads (binary):** Vendors that ran ads generated significantly higher GMV on average, even after controlling for orders and activity — the advertising premium is real.
- **City effects:** Lagos vendors generate the highest baseline GMV, consistent with market concentration.
**Recommendation for a non-technical manager:** "For every additional week we keep a vendor consistently active on the platform, we expect to see their quarterly GMV increase. This means every vendor who goes inactive is not just a lost account — it is a lost revenue opportunity we can quantify. The data strongly supports investing in vendor retention as a direct GMV growth strategy."
------------------------------------------------------------------------
# Integrated Findings
The five analytical techniques applied to this dataset converge on a single, coherent recommendation:
**The primary driver of vendor GMV at Chowdeck is order volume, sustained over consistent weekly activity, amplified by advertising investment.**
| Technique | Key Finding | Business Implication |
|------------------------|------------------------|------------------------|
| EDA | Right-skewed GMV; few anchor vendors dominate | Protect top accounts; identify and grow mid-tier vendors |
| Visualisation | Lagos dominates; Jerry leads; ads vendors outperform | Prioritise Lagos pipeline; replicate Jerry's approach |
| Hypothesis Testing | Ads significantly increase GMV; vendor type matters | Activate ads for all eligible vendors in Q2 |
| Correlation | Orders (ρ=0.89) and active weeks strongly correlated with GMV | Drive order frequency; prevent vendor churn |
| Regression | Orders, active weeks, and ads all significant predictors | Build a vendor scoring model to prioritise Q2 pipeline |
**Single integrated recommendation:** Chowdeck's Q2 sales strategy should focus on three levers in priority order — (1) increase order frequency for existing high-potential vendors through promotions and combo menus, (2) activate advertising for the 40%+ of vendors currently not running ads, and (3) retain anchor vendors through dedicated account management. A vendor that is consistently active, running ads, and driving high order volume is the highest-value account on the platform.
------------------------------------------------------------------------
# Limitations & Further Work
## Limitations
- **Sample coverage:** The 120-vendor dataset represents one sales team's Q1 portfolio. It is not a random sample of all Chowdeck vendors and should not be generalised to the full platform without validation.
- **Causality:** Correlation and regression establish association, not causation. The finding that ads are associated with higher GMV could reflect reverse causality — higher-performing vendors may be more likely to invest in ads.
- **Omitted variables:** The model does not include vendor quality ratings, menu size, pricing strategy, or competitive intensity by location — all of which may confound the observed relationships.
- **Time period:** Q1 2026 covers 13 weeks and may not be representative of full-year patterns — particularly seasonal effects around Easter and school resumption.
- **Non-normality:** GMV data is heavily right-skewed, requiring non-parametric tests and log-transformation in regression. While addressed, this limits direct interpretability of raw coefficients.
## Further Work
- Extend the dataset to cover all four quarters of 2025–2026 to enable seasonal trend analysis
- Include competitor price data and category-level demand to model market share dynamics
- Build a predictive model (logistic regression or random forest) to classify vendors at churn risk before they go inactive
- Conduct a controlled experiment — randomly assign advertising credits to a subset of vendors and measure GMV lift — to establish causality rather than correlation
- Incorporate customer-level data to understand the relationship between vendor GMV and customer lifetime value
------------------------------------------------------------------------
# References
Adi, B. (2026). *AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R*. Lagos Business School / markanalytics.online. <https://markanalytics.online>
Anscombe, F. J. (1973). Graphs in statistical analysis. *The American Statistician*, 27(1), 17–21. <https://doi.org/10.1080/00031305.1973.10478966>
R Core Team. (2024). *R: A language and environment for statistical computing* (Version 4.4). R Foundation for Statistical Computing. <https://www.R-project.org/>
Tukey, J. W. (1977). *Exploratory data analysis*. Addison-Wesley.
Van Rossum, G., & Drake, F. L. (2009). *Python 3 reference manual*. CreateSpace.
Wickham, H. (2016). *ggplot2: Elegant graphics for data analysis*. Springer. <https://doi.org/10.1007/978-3-319-24277-4>
Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. *Journal of Open Source Software*, 4(43), 1686. <https://doi.org/10.21105/joss.01686>
Wilkinson, L. (2005). *The grammar of graphics* (2nd ed.). Springer.
```{r package-citations, eval=FALSE}
# Run these in your R console to retrieve APA citations for each package used
citation("ggplot2")
citation("corrplot")
citation("ggcorrplot")
citation("kableExtra")
citation("patchwork")
citation("broom")
citation("effsize")
citation("moments")
citation("dunn.test")
citation("car")
```
------------------------------------------------------------------------
# Appendix: AI Usage Statement
This analysis was completed as part of the MMBA 8 Data Analytics II assessment at Lagos Business School. The following AI tools were used during the preparation of this document:
**Claude (Anthropic)** was used to assist with structuring the Quarto document template, suggesting appropriate R and Python package combinations for each analytical technique, and generating starter code blocks for the visualisation and regression sections.
**Independent analytical judgements made by the author include:**
- Selection of Case Study 1 as the most appropriate option given the professional context
- Decision to use log-transformation for GMV due to observed right-skewness in the data
- Choice of Mann-Whitney U and Kruskal-Wallis as non-parametric alternatives after failing normality tests
- Selection of Spearman rather than Pearson correlation given non-normal distributions
- Framing of the two hypotheses based on genuine business questions arising from Q1 sales operations at Chowdeck
- All plain-language business interpretations and the integrated recommendation
The data collection methodology, professional disclosure, and business context descriptions reflect the author's own direct professional experience and were not generated by AI.