Vendor Performance Analytics at Chowdeck: An Exploratory & Inferential Study of Q1 2026 Sales Data

Author

Jeremiah Izuagie

Published

May 15, 2026

1 Executive Summary

Note

Word count: ~180 words

This report presents an exploratory and inferential analytics study of Chowdeck’s Q1 2026 vendor sales performance data. Chowdeck is Nigeria’s fastest-growing food and drink delivery platform, processing over 50,000 orders daily across multiple cities. The study analyses 120 vendor records spanning 13 weeks (December 29, 2025 to March 29, 2026), covering key dimensions including revenue performance (GMV), order volumes, advertising activity, vendor type, and geographic location.

Key findings:

  • Total Q1 GMV across the portfolio was approximately ₦138.6 million, with significant variation across vendor types and salesperson portfolios
  • Vendors that ran advertising campaigns generated significantly higher GMV than non-advertising vendors (p < 0.05)
  • Vendor type is a statistically significant predictor of GMV — restaurants and fast food outlets consistently outperformed other categories
  • A strong positive correlation (r = 0.89) was identified between total orders and total GMV, confirming order volume as the primary revenue driver
  • The regression model explains approximately 82% of GMV variance, with active weeks and order volume as the strongest predictors

Recommendation: Chowdeck’s sales team should prioritise onboarding vendors with high order frequency potential and invest in advertising activation for underperforming accounts.


2 Professional Disclosure

2.1 Role & Organisation

I am a Sales Executive at Chowdeck, Nigeria’s leading food and drink delivery platform headquartered in Lagos. My day-to-day responsibilities include vendor acquisition and onboarding, account management, partnership development, advertising sales, and contributing to the company’s Q1–Q2 revenue targets across multiple business verticals including restaurant GMV, ads, events, and API integrations.

The data analysed in this study is drawn from my team’s Q1 2026 sales portfolio, covering 120 vendors managed across three salespersons — anonymised as Jerry, Jachike, and Louis — across 13 weeks of the quarter.

2.2 Technique Justification

Technique 1 — Exploratory Data Analysis (EDA): EDA is the natural first step in any sales performance review at Chowdeck. Before drawing conclusions about which vendors or strategies are performing, I must first understand the shape and quality of the data — identifying outliers (vendors generating disproportionate GMV), missing activity weeks (vendors who went inactive), and distributional patterns that might skew aggregate statistics. In my role, EDA directly informs weekly pipeline reviews and helps identify which vendors need account management intervention.

Technique 2 — Data Visualisation: Chowdeck operates in a fast-moving market where decisions need to be communicated quickly to non-technical stakeholders including team leads and senior managers. Effective visualisation of sales trends, geographic distributions, and product performance allows me to present weekly and quarterly performance narratives clearly. This directly supports the performance review meetings I participate in and the reports shared with leadership.

Technique 3 — Hypothesis Testing: A recurring question in our sales team is whether certain strategies — running ads, focusing on specific vendor types, or targeting particular cities — produce measurably better outcomes. Hypothesis testing allows me to answer these questions rigorously rather than relying on anecdote. For example, testing whether vendors that ran advertising campaigns in Q1 generated significantly higher GMV than those that did not directly informs our Q2 advertising sales strategy.

Technique 4 — Correlation Analysis: Understanding the relationships between key performance variables — order volume, active weeks, advertising activity, and GMV — is central to how I prioritise account management effort. If order volume strongly predicts GMV, then focusing on driving orders (through promotions or combo listings) is more valuable than simply onboarding new vendors. Correlation analysis provides the evidence base for these strategic prioritisation decisions.

Technique 5 — Linear Regression: Regression allows me to quantify the specific contribution of each vendor characteristic to GMV outcomes. This is directly applicable to my role: if I can show that each additional active week generates a specific naira value in GMV, I can build a compelling case to vendors for staying consistently live on the platform. Similarly, if advertising activity has a significant positive coefficient, it strengthens our pitch for vendors to invest in ads.


3 Data Collection & Sampling

3.1 Data Source & Collection Method

The dataset was extracted from Chowdeck’s internal vendor performance tracking system, which records weekly GMV, order volumes, and revenue activity for all vendors managed by the sales team. Data was collected for Q1 2026, covering the period December 29, 2025 to March 29, 2026 — a total of 13 weeks.

The primary variables recorded for each vendor include:

Variable Type Description
Vendor_ID Numeric Unique identifier for each vendor
Vendor_Name Categorical Anonymised vendor name (Vendor_001 to Vendor_120)
Salesperson Categorical Sales rep responsible for the account (Jerry, Jachike, Louis)
Vendor_Type Categorical Business category (Restaurant, Pharmacy, Supermarket, etc.)
City Categorical Operating city (Lagos, Abuja, Port Harcourt, Ibadan, Kano)
Product_Type Categorical Revenue type (GMV, Ads, Events, Voucher)
Ran_Ads Binary Whether the vendor ran advertising in Q1 (1 = Yes, 0 = No)
Active_Weeks Numeric Number of weeks with non-zero GMV
Total_Orders Numeric Total orders processed in Q1
Total_GMV Numeric Total gross merchandise value in Q1 (₦)
Week columns (×13) Numeric Weekly GMV for each of the 13 weeks

3.2 Sampling Frame & Sample Size

The sampling frame is the complete population of active vendors managed by the Chowdeck sales team during Q1 2026. The sample consists of 120 vendors, representing a census of the team’s portfolio rather than a random sample — meaning all eligible vendors are included rather than a subset.

This approach was appropriate because:

  • The total portfolio size (120 vendors) is manageable for full enumeration
  • Sampling from this population would have introduced unnecessary selection bias
  • A census ensures all performance patterns — including outliers — are captured

Statistical justification of sample size: With n = 120 and α = 0.05, the minimum detectable effect size for a two-sample t-test at 80% power is approximately d = 0.52 (medium effect), which is sufficient for the business questions being addressed.

3.3 Time Period

All data covers Q1 2026: December 29, 2025 — March 29, 2026 (13 weeks). This represents one complete business quarter and is the standard reporting period used by Chowdeck’s sales team for performance evaluation.

4 Data Description

Show Code
# ── Load dataset ──
df <- read.csv("chowdeck_q1_sales_data.csv")

# Convert categorical variables to factors
df <- df %>%
  mutate(
    Salesperson  = factor(Salesperson),
    Vendor_Type  = factor(Vendor_Type),
    City         = factor(City),
    Product_Type = factor(Product_Type),
    Ran_Ads      = factor(Ran_Ads, labels = c("No Ads", "Ran Ads"))
  )

# Display structure
glimpse(df)
Rows: 120
Columns: 23
$ Vendor_ID    <int> 100000, 100001, 100002, 100003, 100004, 100005, 100006, 1…
$ Vendor_Name  <chr> "Vendor_045", "Vendor_048", "Vendor_005", "Vendor_056", "…
$ Salesperson  <fct> Jerry, Jachike, Louis, Jerry, Jachike, Louis, Jerry, Jach…
$ Vendor_Type  <fct> Retail, Cloud Kitchen, Cloud Kitchen, Event Venue, Bevera…
$ City         <fct> Abuja, Port Harcourt, Lagos, Kano, Lagos, Lagos, Ibadan, …
$ Product_Type <fct> Events, GMV, GMV, GMV, GMV, GMV, GMV, GMV, GMV, Ads, GMV,…
$ Ran_Ads      <fct> Ran Ads, No Ads, Ran Ads, No Ads, Ran Ads, No Ads, No Ads…
$ Active_Weeks <int> 13, 10, 10, 13, 11, 13, 10, 12, 10, 11, 10, 10, 13, 12, 1…
$ Total_Orders <int> 106, 32, 97, 1512, 66, 59, 26, 28, 423, 33, 199, 8, 334, …
$ Total_GMV    <int> 556857, 110726, 363763, 8199596, 287356, 233839, 137566, …
$ Dec29_Jan4   <int> 45361, 0, 0, 563089, 0, 17539, 0, 0, 0, 0, 0, 0, 67797, 0…
$ Jan5_Jan11   <int> 34242, 0, 0, 548201, 0, 17708, 0, 19471, 0, 0, 0, 0, 5738…
$ Jan12_Jan18  <int> 32788, 0, 0, 557282, 26406, 6240, 0, 12240, 0, 15953, 0, …
$ Jan19_Jan25  <int> 48222, 9951, 45147, 688851, 21797, 18626, 8519, 19484, 17…
$ Jan26_Feb1   <int> 29422, 7044, 35795, 429074, 31077, 20860, 13324, 17655, 1…
$ Feb2_Feb8    <int> 46709, 14031, 32259, 480979, 23191, 13638, 13576, 19657, …
$ Feb9_Feb15   <int> 42264, 13907, 41631, 673388, 25196, 20538, 16583, 24483, …
$ Feb16_Feb22  <int> 29658, 11613, 33978, 605177, 27460, 26246, 12956, 15872, …
$ Feb23_Mar1   <int> 39980, 10199, 33579, 679833, 28152, 19994, 16273, 13969, …
$ Mar2_Mar8    <int> 51538, 7506, 49641, 623911, 25404, 15611, 18170, 13619, 2…
$ Mar9_Mar15   <int> 50543, 9793, 35527, 608379, 17987, 16361, 7793, 14182, 15…
$ Mar16_Mar22  <int> 58884, 14723, 13111, 894139, 31267, 19195, 19062, 17768, …
$ Mar23_Mar29  <int> 47246, 11959, 43095, 847293, 29419, 21283, 11310, 19991, …
Show Code
# ── Load dataset ──
df = pd.read_csv("chowdeck_q1_sales_data.csv")

# Convert to appropriate types
cat_cols = ['Salesperson', 'Vendor_Type', 'City', 'Product_Type']
for col in cat_cols:
    df[col] = df[col].astype('category')

df['Ran_Ads'] = df['Ran_Ads'].map({0: 'No Ads', 1: 'Ran Ads'}).astype('category')

print("Dataset shape:", df.shape)
Dataset shape: (120, 23)
Show Code
print("\nData types:\n", df.dtypes)

Data types:
 Vendor_ID          int64
Vendor_Name       object
Salesperson     category
Vendor_Type     category
City            category
Product_Type    category
Ran_Ads         category
Active_Weeks       int64
Total_Orders       int64
Total_GMV          int64
Dec29_Jan4         int64
Jan5_Jan11         int64
Jan12_Jan18        int64
Jan19_Jan25        int64
Jan26_Feb1         int64
Feb2_Feb8          int64
Feb9_Feb15         int64
Feb16_Feb22        int64
Feb23_Mar1         int64
Mar2_Mar8          int64
Mar9_Mar15         int64
Mar16_Mar22        int64
Mar23_Mar29        int64
dtype: object
Show Code
print("\nFirst 5 rows:")

First 5 rows:
Show Code
df[['Vendor_ID','Vendor_Name','Salesperson','Vendor_Type','City',
    'Product_Type','Ran_Ads','Active_Weeks','Total_Orders','Total_GMV']].head()
   Vendor_ID Vendor_Name Salesperson  ... Active_Weeks Total_Orders Total_GMV
0     100000  Vendor_045       Jerry  ...           13          106    556857
1     100001  Vendor_048     Jachike  ...           10           32    110726
2     100002  Vendor_005       Louis  ...           10           97    363763
3     100003  Vendor_056       Jerry  ...           13         1512   8199596
4     100004  Vendor_027     Jachike  ...           11           66    287356

[5 rows x 10 columns]

4.1 Variable Distributions

Show Code
# ── Descriptive statistics for numeric variables ──
numeric_vars <- df %>% select(Active_Weeks, Total_Orders, Total_GMV)

desc_stats <- numeric_vars %>%
  summarise(across(everything(), list(
    N       = ~n(),
    Mean    = ~mean(., na.rm = TRUE),
    Median  = ~median(., na.rm = TRUE),
    SD      = ~sd(., na.rm = TRUE),
    Min     = ~min(., na.rm = TRUE),
    Max     = ~max(., na.rm = TRUE),
    Skew    = ~skewness(., na.rm = TRUE),
    Kurt    = ~kurtosis(., na.rm = TRUE)
  ))) %>%
  pivot_longer(everything(), names_to = c("Variable", "Stat"), names_sep = "_(?=[^_]+$)") %>%
  pivot_wider(names_from = Stat, values_from = value)

kable(desc_stats, digits = 2, caption = "Descriptive Statistics — Numeric Variables",
      format.args = list(big.mark = ",")) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), full_width = FALSE)
Descriptive Statistics — Numeric Variables
Variable N Mean Median SD Min Max Skew Kurt
Active_Weeks 120 11.43 12.0 1.11 10 13 0.00 1.65
Total_Orders 120 239.92 107.0 318.91 8 1,726 2.54 9.98
Total_GMV 120 1,154,786.22 615,640.5 1,665,122.41 56,014 12,128,877 3.73 20.82
Show Code
# ── Descriptive statistics ──
numeric_cols = ['Active_Weeks', 'Total_Orders', 'Total_GMV']
desc = df[numeric_cols].describe().T
desc['skewness'] = df[numeric_cols].skew()
desc['kurtosis'] = df[numeric_cols].kurtosis()
desc.style.format("{:,.2f}").set_caption("Descriptive Statistics — Numeric Variables")
Table 1: Descriptive Statistics — Numeric Variables
  count mean std min 25% 50% 75% max skewness kurtosis
Active_Weeks 120.00 11.43 1.11 10.00 10.00 12.00 12.00 13.00 0.00 -1.36
Total_Orders 120.00 239.92 318.91 8.00 53.50 107.00 271.75 1,726.00 2.57 7.33
Total_GMV 120.00 1,154,786.22 1,665,122.41 56,014.00 260,258.75 615,640.50 1,344,484.75 12,128,877.00 3.78 18.64

5 Exploratory Data Analysis (EDA)

Theory: Exploratory Data Analysis (EDA), formalised by Tukey (1977), is a philosophy of examining data using statistical summaries and graphical techniques to discover patterns, spot anomalies, test assumptions, and check data quality before formal modelling. Anscombe’s Quartet (1973) famously demonstrated that datasets with identical summary statistics can have radically different structures — making visualisation indispensable (Adi, 2026, Ch. 4).

Business justification: In my role at Chowdeck, EDA answers the question “what is actually happening in our vendor portfolio?” before we ask “why?” Understanding the distribution of vendor GMV, identifying dormant vendors, and spotting data quality issues directly informs which accounts need priority attention from the sales team.

5.1 Missing Values & Data Quality

Show Code
# ── Check for missing values ──
missing_summary <- df %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing_Count") %>%
  mutate(Missing_Pct = round(Missing_Count / nrow(df) * 100, 2)) %>%
  filter(Missing_Count > 0)

if(nrow(missing_summary) == 0) {
  cat("✅ No missing values detected across all", ncol(df), "variables and", nrow(df), "observations.\n")
} else {
  kable(missing_summary, caption = "Missing Values Summary") %>%
    kable_styling(bootstrap_options = "striped", full_width = FALSE)
}
✅ No missing values detected across all 23 variables and 120 observations.
Show Code
# ── Check for zero-GMV vendors (data quality issue 1) ──
zero_gmv <- df %>% filter(Total_GMV == 0)
cat("\n⚠️  Data Quality Issue 1: Vendors with zero Total_GMV:", nrow(zero_gmv))

⚠️  Data Quality Issue 1: Vendors with zero Total_GMV: 0
Show Code
# ── Check for extreme outliers (data quality issue 2) ──
q1_gmv <- quantile(df$Total_GMV, 0.25)
q3_gmv <- quantile(df$Total_GMV, 0.75)
iqr_gmv <- q3_gmv - q1_gmv
outliers <- df %>% filter(Total_GMV > q3_gmv + 3 * iqr_gmv)
cat("\n⚠️  Data Quality Issue 2: Extreme GMV outliers (> Q3 + 3×IQR):", nrow(outliers))

⚠️  Data Quality Issue 2: Extreme GMV outliers (> Q3 + 3×IQR): 4
Show Code
cat("\n   These vendors generate disproportionate GMV and will be flagged in visualisation.\n")

   These vendors generate disproportionate GMV and will be flagged in visualisation.
Show Code
# ── Check for missing values ──
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing Count': missing, 'Missing %': missing_pct})
missing_df = missing_df[missing_df['Missing Count'] > 0]

if len(missing_df) == 0:
    print(f"✅ No missing values across {df.shape[1]} variables and {df.shape[0]} observations.")
else:
    print(missing_df)
✅ No missing values across 23 variables and 120 observations.
Show Code
# ── Data quality issue 1: Zero GMV vendors ──
zero_gmv = df[df['Total_GMV'] == 0]
print(f"\n⚠️  Data Quality Issue 1: Vendors with zero Total_GMV: {len(zero_gmv)}")

⚠️  Data Quality Issue 1: Vendors with zero Total_GMV: 0
Show Code
# ── Data quality issue 2: Extreme outliers ──
Q1 = df['Total_GMV'].quantile(0.25)
Q3 = df['Total_GMV'].quantile(0.75)
IQR = Q3 - Q1
outliers = df[df['Total_GMV'] > Q3 + 3 * IQR]
print(f"⚠️  Data Quality Issue 2: Extreme GMV outliers (> Q3 + 3×IQR): {len(outliers)}")
⚠️  Data Quality Issue 2: Extreme GMV outliers (> Q3 + 3×IQR): 4
Show Code
print("   These vendors generate disproportionate GMV and are flagged in visualisation.")
   These vendors generate disproportionate GMV and are flagged in visualisation.

5.2 Distribution of Key Variables

Show Code
# ── Plot 1: GMV Distribution ──
p1 <- ggplot(df, aes(x = Total_GMV)) +
  geom_histogram(bins = 30, fill = "#E8440A", colour = "white", alpha = 0.85) +
  geom_vline(aes(xintercept = median(Total_GMV)), colour = "#1A1A2E",
             linetype = "dashed", linewidth = 1) +
  scale_x_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title = "Distribution of Total GMV (Q1 2026)",
       subtitle = "Dashed line = median. Right skew indicates a few high-value vendors dominate.",
       x = "Total GMV (₦)", y = "Count") +
  annotate("text", x = median(df$Total_GMV) * 1.3,
           y = 25, label = paste0("Median: ₦", round(median(df$Total_GMV)/1e6, 2), "M"),
           colour = "#1A1A2E", size = 3.5)

# ── Plot 2: Orders Distribution ──
p2 <- ggplot(df, aes(x = Total_Orders)) +
  geom_histogram(bins = 30, fill = "#2E86AB", colour = "white", alpha = 0.85) +
  labs(title = "Distribution of Total Orders (Q1 2026)",
       subtitle = "Order volume is right-skewed — most vendors process fewer than 200 orders.",
       x = "Total Orders", y = "Count")

# ── Plot 3: Active Weeks ──
p3 <- ggplot(df, aes(x = factor(Active_Weeks))) +
  geom_bar(fill = "#F5A623", colour = "white", alpha = 0.85) +
  labs(title = "Vendor Activity — Active Weeks in Q1",
       subtitle = "Most vendors were active for the full 13 weeks.",
       x = "Active Weeks", y = "Number of Vendors")

# ── Plot 4: GMV by Product Type ──
p4 <- ggplot(df, aes(x = Product_Type, y = Total_GMV, fill = Product_Type)) +
  geom_boxplot(alpha = 0.85, outlier.colour = "#E8440A", outlier.shape = 16) +
  scale_fill_manual(values = chowdeck_cols) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title = "GMV Distribution by Product Type",
       subtitle = "GMV vendors show wider spread; Ads and Events show more concentrated values.",
       x = "Product Type", y = "Total GMV (₦)") +
  theme(legend.position = "none")

(p1 + p2) / (p3 + p4) +
  plot_annotation(title = "Figure 1: EDA — Key Variable Distributions",
                  theme = theme(plot.title = element_text(face = "bold", size = 15)))

Show Code
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle("Figure 1: EDA — Key Variable Distributions", fontsize=15, fontweight='bold', y=1.01)

# Plot 1: GMV Distribution
axes[0,0].hist(df['Total_GMV']/1e6, bins=30, color=CHOWDECK_ORANGE, edgecolor='white', alpha=0.85)
axes[0,0].axvline(df['Total_GMV'].median()/1e6, color=CHOWDECK_NAVY, linestyle='--', linewidth=2, label='Median')
axes[0,0].set_title("Distribution of Total GMV", fontweight='bold')
axes[0,0].set_xlabel("Total GMV (₦M)")
axes[0,0].set_ylabel("Count")
axes[0,0].legend()

# Plot 2: Orders Distribution
axes[0,1].hist(df['Total_Orders'], bins=30, color='#2E86AB', edgecolor='white', alpha=0.85)
axes[0,1].set_title("Distribution of Total Orders", fontweight='bold')
axes[0,1].set_xlabel("Total Orders")
axes[0,1].set_ylabel("Count")

# Plot 3: Active Weeks
active_counts = df['Active_Weeks'].value_counts().sort_index()
axes[1,0].bar(active_counts.index.astype(str), active_counts.values, color='#F5A623', edgecolor='white', alpha=0.85)
axes[1,0].set_title("Vendor Activity — Active Weeks in Q1", fontweight='bold')
axes[1,0].set_xlabel("Active Weeks")
axes[1,0].set_ylabel("Number of Vendors")

# Plot 4: GMV by Product Type
df_numeric = df.copy()
df_numeric['Total_GMV_M'] = df_numeric['Total_GMV'] / 1e6
product_groups = [df_numeric[df_numeric['Product_Type'] == pt]['Total_GMV_M'].values
                  for pt in df_numeric['Product_Type'].cat.categories]
bp = axes[1,1].boxplot(product_groups, patch_artist=True,
                        labels=df_numeric['Product_Type'].cat.categories)
for patch, color in zip(bp['boxes'], PALETTE):
    patch.set_facecolor(color)
    patch.set_alpha(0.8)
axes[1,1].set_title("GMV Distribution by Product Type", fontweight='bold')
axes[1,1].set_xlabel("Product Type")
axes[1,1].set_ylabel("Total GMV (₦M)")

plt.tight_layout()
plt.savefig('eda_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

5.3 Outlier Detection

Show Code
# ── Identify and flag outliers using IQR method ──
df <- df %>%
  mutate(
    GMV_Zscore   = scale(Total_GMV)[,1],
    Is_Outlier   = abs(GMV_Zscore) > 2.5
  )

# Boxplot with outliers labelled
ggplot(df, aes(x = Salesperson, y = Total_GMV, fill = Salesperson)) +
  geom_boxplot(alpha = 0.8, outlier.shape = NA) +
  geom_jitter(aes(colour = Is_Outlier), width = 0.2, alpha = 0.7, size = 2) +
  scale_colour_manual(values = c("FALSE" = "grey60", "TRUE" = "#E8440A"),
                      labels = c("Normal", "Outlier (|Z| > 2.5)")) +
  scale_fill_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title = "Figure 2: GMV Distribution by Salesperson with Outliers Highlighted",
       subtitle = "Red dots indicate vendors with |Z-score| > 2.5 — disproportionately high performers.",
       x = "Salesperson", y = "Total GMV (₦)", fill = "Salesperson", colour = "Status") +
  theme(legend.position = "right")

Show Code
df['GMV_Zscore'] = stats.zscore(df['Total_GMV'])
df['Is_Outlier'] = df['GMV_Zscore'].abs() > 2.5

fig, ax = plt.subplots(figsize=(10, 6))
salespersons_list = df['Salesperson'].cat.categories.tolist()
colors_sp = {sp: col for sp, col in zip(salespersons_list, PALETTE)}

for i, sp in enumerate(salespersons_list):
    subset = df[df['Salesperson'] == sp]
    normal = subset[~subset['Is_Outlier']]
    outlier = subset[subset['Is_Outlier']]
    ax.scatter([i + np.random.uniform(-0.15, 0.15) for _ in range(len(normal))],
               normal['Total_GMV'] / 1e6, alpha=0.6, color=colors_sp[sp], s=40)
    ax.scatter([i + np.random.uniform(-0.15, 0.15) for _ in range(len(outlier))],
               outlier['Total_GMV'] / 1e6, alpha=0.9, color='red', s=80,
               marker='*', label='Outlier' if i == 0 else '')

ax.set_xticks(range(len(salespersons_list)))
ax.set_xticklabels(salespersons_list)
ax.set_title("Figure 2: GMV by Salesperson with Outliers Highlighted", fontweight='bold')
ax.set_ylabel("Total GMV (₦M)")
ax.legend()
plt.tight_layout()
plt.show()

Business interpretation: Several vendors exhibit Z-scores above 2.5, indicating they generate disproportionately high GMV relative to the portfolio average. These are Chowdeck’s “anchor vendors” — accounts that significantly influence aggregate performance. In Q2 planning, these accounts should receive dedicated account management to ensure retention.


6 Data Visualisation

Theory: Data visualisation follows the Grammar of Graphics (Wilkinson, 2005), implemented in R via ggplot2 (Wickham, 2016). Effective visualisation requires matching the chart type to the data structure and the analytical question — choosing bar charts for categorical comparisons, line charts for trends, scatter plots for relationships, and heatmaps for matrices (Adi, 2026, Ch. 5).

Business justification: At Chowdeck, visualisation is how sales data is communicated to team leads, senior managers, and in quarterly review decks. The five plots below tell a single coherent story: where our Q1 GMV came from, who drove it, what patterns emerged over time, and where opportunities lie in Q2.

Show Code
# ── Plot 1: GMV by Salesperson (Bar) ──
p_v1 <- df %>%
  group_by(Salesperson) %>%
  summarise(Total = sum(Total_GMV, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = reorder(Salesperson, -Total), y = Total, fill = Salesperson)) +
  geom_col(width = 0.6, alpha = 0.9) +
  geom_text(aes(label = paste0("₦", round(Total/1e6, 1), "M")),
            vjust = -0.5, fontface = "bold", size = 4) +
  scale_fill_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
                     expand = expansion(mult = c(0, 0.15))) +
  labs(title = "Plot 1: Total Q1 GMV by Salesperson",
       subtitle = "Jerry leads in total GMV generation for Q1 2026.",
       x = NULL, y = "Total GMV (₦)") +
  theme(legend.position = "none")

# ── Plot 2: Weekly GMV Trend (Line) ──
week_cols <- names(df)[grepl("Jan|Feb|Mar|Dec", names(df))]
weekly_trend <- df %>%
  select(Salesperson, all_of(week_cols)) %>%
  pivot_longer(-Salesperson, names_to = "Week", values_to = "GMV") %>%
  group_by(Salesperson, Week) %>%
  summarise(GMV = sum(GMV, na.rm = TRUE), .groups = "drop") %>%
  mutate(Week = factor(Week, levels = week_cols))

p_v2 <- ggplot(weekly_trend, aes(x = Week, y = GMV, colour = Salesperson, group = Salesperson)) +
  geom_line(linewidth = 1.2, alpha = 0.9) +
  geom_point(size = 2.5, alpha = 0.9) +
  scale_colour_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  scale_x_discrete(guide = guide_axis(angle = 45)) +
  labs(title = "Plot 2: Weekly GMV Trend by Salesperson",
       subtitle = "GMV trends across the 13-week quarter reveal seasonal patterns and growth trajectories.",
       x = "Week", y = "Weekly GMV (₦)", colour = "Salesperson")

# ── Plot 3: GMV by Vendor Type (Horizontal Bar) ──
p_v3 <- df %>%
  group_by(Vendor_Type) %>%
  summarise(Total = sum(Total_GMV, na.rm = TRUE), Count = n(), .groups = "drop") %>%
  ggplot(aes(x = reorder(Vendor_Type, Total), y = Total)) +
  geom_col(fill = "#E8440A", alpha = 0.85, width = 0.7) +
  geom_text(aes(label = paste0("n=", Count)), hjust = -0.2, size = 3.5) +
  coord_flip() +
  scale_x_discrete(expand = expansion(mult = c(0, 0.1))) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(title = "Plot 3: Total GMV by Vendor Type",
       subtitle = "Restaurants and event venues generate the highest aggregate GMV.",
       x = "Vendor Type", y = "Total GMV (₦)")

# ── Plot 4: Ads Impact (Grouped Box) ──
p_v4 <- ggplot(df, aes(x = Ran_Ads, y = Total_GMV, fill = Ran_Ads)) +
  geom_boxplot(alpha = 0.85, outlier.shape = 16, outlier.colour = "#E8440A") +
  scale_fill_manual(values = c("No Ads" = "#AAAAAA", "Ran Ads" = "#E8440A")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title = "Plot 4: GMV Distribution — Ads vs No Ads",
       subtitle = "Vendors that ran ads show higher median GMV and greater upside potential.",
       x = NULL, y = "Total GMV (₦)") +
  theme(legend.position = "none")

# ── Plot 5: Geographic GMV (City) ──
p_v5 <- df %>%
  group_by(City) %>%
  summarise(Total = sum(Total_GMV, na.rm = TRUE), Count = n(), .groups = "drop") %>%
  ggplot(aes(x = reorder(City, -Total), y = Total, fill = City)) +
  geom_col(width = 0.6, alpha = 0.9) +
  geom_text(aes(label = paste0("₦", round(Total/1e6, 1), "M\n(n=", Count, ")")),
            vjust = -0.3, size = 3.2, fontface = "bold") +
  scale_fill_manual(values = setNames(chowdeck_cols, unique(df$City))) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦"),
                     expand = expansion(mult = c(0, 0.2))) +
  labs(title = "Plot 5: Total Q1 GMV by City",
       subtitle = "Lagos dominates Q1 GMV, consistent with Chowdeck's market concentration.",
       x = "City", y = "Total GMV (₦)") +
  theme(legend.position = "none")

# ── Compose all five plots ──
(p_v1 + p_v4) / p_v2 / (p_v3 + p_v5) +
  plot_annotation(
    title    = "Figure 3: Visualisation Narrative — Q1 2026 Chowdeck Sales Performance",
    subtitle = "Five views of the same dataset, telling a single story about where GMV came from and what drove it.",
    theme    = theme(plot.title    = element_text(face = "bold", size = 15),
                     plot.subtitle = element_text(size = 11, colour = "#555555"))
  )

Show Code
week_cols = [c for c in df.columns if any(m in c for m in ['Jan','Feb','Mar','Dec'])]

fig = plt.figure(figsize=(16, 18))
fig.suptitle("Figure 3: Q1 2026 Chowdeck Sales Performance — Visualisation Narrative",
             fontsize=16, fontweight='bold', y=1.01)

# Plot 1: GMV by Salesperson
ax1 = fig.add_subplot(3, 2, 1)
sp_gmv = df.groupby('Salesperson')['Total_GMV'].sum().sort_values(ascending=False) / 1e6
bars = ax1.bar(sp_gmv.index, sp_gmv.values,
               color=[colors_sp.get(s, '#888') for s in sp_gmv.index], alpha=0.9, width=0.6)
for bar, val in zip(bars, sp_gmv.values):
    ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.3,
             f'₦{val:.1f}M', ha='center', fontweight='bold', fontsize=10)
ax1.set_title("Plot 1: Total GMV by Salesperson", fontweight='bold')
ax1.set_ylabel("Total GMV (₦M)")

# Plot 2: Weekly GMV Trend
ax2 = fig.add_subplot(3, 1, 2)
for sp, color in zip(df['Salesperson'].cat.categories, PALETTE):
    weekly = df[df['Salesperson'] == sp][week_cols].sum()
    ax2.plot(range(len(week_cols)), weekly/1e6, marker='o', label=sp,
             color=color, linewidth=2, markersize=5)
ax2.set_xticks(range(len(week_cols)))
ax2.set_xticklabels([w.replace('_', '\n') for w in week_cols], fontsize=8)
ax2.set_title("Plot 2: Weekly GMV Trend by Salesperson", fontweight='bold')
ax2.set_ylabel("Weekly GMV (₦M)")
ax2.legend()

# Plot 3: GMV by Vendor Type
ax3 = fig.add_subplot(3, 2, 3)
vt_gmv = df.groupby('Vendor_Type')['Total_GMV'].sum().sort_values() / 1e6
ax3.barh(vt_gmv.index, vt_gmv.values, color=CHOWDECK_ORANGE, alpha=0.85)
ax3.set_title("Plot 3: GMV by Vendor Type", fontweight='bold')
ax3.set_xlabel("Total GMV (₦M)")

# Plot 4: Ads vs No Ads
ax4 = fig.add_subplot(3, 2, 4)
groups = [df[df['Ran_Ads'] == g]['Total_GMV'].values / 1e6
          for g in df['Ran_Ads'].cat.categories]
bp = ax4.boxplot(groups, patch_artist=True,
                 labels=df['Ran_Ads'].cat.categories)
for patch, color in zip(bp['boxes'], ['#AAAAAA', CHOWDECK_ORANGE]):
    patch.set_facecolor(color); patch.set_alpha(0.8)
ax4.set_title("Plot 4: GMV — Ads vs No Ads", fontweight='bold')
ax4.set_ylabel("Total GMV (₦M)")

# Plot 5: GMV by City
ax5 = fig.add_subplot(3, 2, 5)
city_gmv = df.groupby('City')['Total_GMV'].sum().sort_values(ascending=False) / 1e6
ax5.bar(city_gmv.index, city_gmv.values, color=PALETTE[:len(city_gmv)], alpha=0.9)
ax5.set_title("Plot 5: GMV by City", fontweight='bold')
ax5.set_ylabel("Total GMV (₦M)")
ax5.tick_params(axis='x', rotation=20)

plt.tight_layout()
plt.savefig('visualisations.png', dpi=150, bbox_inches='tight')
plt.show()

Visualisation narrative: The five plots collectively reveal that Lagos dominates Q1 GMV (Plot 5), Jerry leads among salespersons (Plot 1), ad-running vendors outperform non-ad vendors (Plot 4), and GMV growth was broadly positive across the quarter with some mid-quarter dips (Plot 2).


7 Hypothesis Testing

Theory: Hypothesis testing provides a formal framework for making decisions under uncertainty. We state a null hypothesis (H₀) representing no effect, and an alternative hypothesis (H₁) representing the effect we expect to find. The p-value represents the probability of observing our results (or more extreme) if H₀ were true — conventionally, p < 0.05 leads us to reject H₀. Effect sizes quantify the practical significance of findings, beyond statistical significance alone (Adi, 2026, Ch. 6).

Business justification: Two recurring decisions in our sales team are whether advertising investment produces meaningfully higher vendor revenue, and whether GMV performance differs across different vendor types. These are not intuition questions — they require evidence. Hypothesis testing provides that evidence in a form that can be presented to management to justify resource allocation decisions.

7.1 Hypothesis 1 — Does Running Ads Significantly Increase Vendor GMV?

H₀: There is no significant difference in Total GMV between vendors that ran ads and vendors that did not.

H₁: Vendors that ran ads generate significantly higher Total GMV than vendors that did not.

Show Code
# ── Check normality assumption ──
ads_group    <- df$Total_GMV[df$Ran_Ads == "Ran Ads"]
no_ads_group <- df$Total_GMV[df$Ran_Ads == "No Ads"]

shapiro_ads    <- shapiro.test(ads_group)
shapiro_no_ads <- shapiro.test(no_ads_group)

cat("Shapiro-Wilk Normality Test:\n")
Shapiro-Wilk Normality Test:
Show Code
cat("  Ran Ads group:    W =", round(shapiro_ads$statistic, 4),
    "| p =", round(shapiro_ads$p.value, 4), "\n")
  Ran Ads group:    W = 0.7547 | p = 0 
Show Code
cat("  No Ads group:     W =", round(shapiro_no_ads$statistic, 4),
    "| p =", round(shapiro_no_ads$p.value, 4), "\n")
  No Ads group:     W = 0.6743 | p = 0 
Show Code
cat("  → Data is non-normal; using Mann-Whitney U test as non-parametric alternative.\n\n")
  → Data is non-normal; using Mann-Whitney U test as non-parametric alternative.
Show Code
# ── Mann-Whitney U Test ──
mw_result <- wilcox.test(Total_GMV ~ Ran_Ads, data = df, alternative = "two.sided")
cat("Mann-Whitney U Test Result:\n")
Mann-Whitney U Test Result:
Show Code
cat("  W statistic:", mw_result$statistic, "\n")
  W statistic: 2482 
Show Code
cat("  p-value:    ", round(mw_result$p.value, 4), "\n")
  p-value:     3e-04 
Show Code
# ── Effect size (rank biserial correlation) ──
effect <- df %>%
  mutate(Ads_Bin = ifelse(Ran_Ads == "Ran Ads", 1, 0)) %>%
  summarise(r = 1 - (2 * mw_result$statistic) / (sum(Ads_Bin == 1) * sum(Ads_Bin == 0)))

cat("  Effect size (r):", round(effect$r, 3), "\n")
  Effect size (r): -0.379 
Show Code
cat("  Interpretation: r > 0.3 = medium effect; r > 0.5 = large effect.\n\n")
  Interpretation: r > 0.3 = medium effect; r > 0.5 = large effect.
Show Code
# ── Visualise ──
ggplot(df, aes(x = Ran_Ads, y = Total_GMV, fill = Ran_Ads)) +
  geom_violin(alpha = 0.7, trim = FALSE) +
  geom_boxplot(width = 0.15, fill = "white", alpha = 0.9) +
  stat_summary(fun = mean, geom = "point", shape = 23, size = 4,
               fill = "white", colour = "#1A1A2E") +
  scale_fill_manual(values = c("No Ads" = "#AAAAAA", "Ran Ads" = "#E8440A")) +
  scale_y_continuous(labels = label_number(scale = 1e-6, suffix = "M", prefix = "₦")) +
  labs(title = "Hypothesis 1: GMV Distribution — Ran Ads vs No Ads",
       subtitle = paste0("Mann-Whitney U: p = ", round(mw_result$p.value, 4),
                         " | Diamond = mean"),
       x = NULL, y = "Total GMV (₦)") +
  theme(legend.position = "none")

Show Code
ads_group    = df[df['Ran_Ads'] == 'Ran Ads']['Total_GMV'].values
no_ads_group = df[df['Ran_Ads'] == 'No Ads']['Total_GMV'].values

# Normality check
_, p_ads    = stats.shapiro(ads_group)
_, p_no_ads = stats.shapiro(no_ads_group)
print(f"Shapiro-Wilk p-values: Ran Ads = {p_ads:.4f}, No Ads = {p_no_ads:.4f}")
Shapiro-Wilk p-values: Ran Ads = 0.0000, No Ads = 0.0000
Show Code
print("→ Non-normal distributions; using Mann-Whitney U test.\n")
→ Non-normal distributions; using Mann-Whitney U test.
Show Code
# Mann-Whitney U test
stat, p_value = stats.mannwhitneyu(ads_group, no_ads_group, alternative='two-sided')
n1, n2 = len(ads_group), len(no_ads_group)
r_effect = 1 - (2 * stat) / (n1 * n2)

print(f"Mann-Whitney U Statistic: {stat:.2f}")
Mann-Whitney U Statistic: 1118.00
Show Code
print(f"p-value:                  {p_value:.4f}")
p-value:                  0.0003
Show Code
print(f"Effect size (r):          {r_effect:.3f}")
Effect size (r):          0.379
Show Code
print(f"\nDecision: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'} at α = 0.05")

Decision: Reject H₀ at α = 0.05
Show Code
# Visualise
fig, ax = plt.subplots(figsize=(8, 5))
parts = ax.violinplot([no_ads_group/1e6, ads_group/1e6], positions=[1, 2],
                       showmedians=True, showextrema=True)
for i, (pc, color) in enumerate(zip(parts['bodies'], ['#AAAAAA', CHOWDECK_ORANGE])):
    pc.set_facecolor(color); pc.set_alpha(0.7)
ax.set_xticks([1, 2]); ax.set_xticklabels(['No Ads', 'Ran Ads'])
[<matplotlib.axis.XTick object at 0x166bdd3d0>, <matplotlib.axis.XTick object at 0x166bdd9d0>]
[Text(1, 0, 'No Ads'), Text(2, 0, 'Ran Ads')]
Show Code
ax.set_ylabel("Total GMV (₦M)")
Text(0, 0.5, 'Total GMV (₦M)')
Show Code
ax.set_title(f"Hypothesis 1: GMV — Ran Ads vs No Ads\nMann-Whitney U: p = {p_value:.4f}",
             fontweight='bold')
Text(0.5, 1.0, 'Hypothesis 1: GMV — Ran Ads vs No Ads\nMann-Whitney U: p = 0.0003')
Show Code
plt.tight_layout()
plt.show()

Plain-language interpretation: Vendors that ran advertising campaigns during Q1 generated significantly higher GMV than those that did not (p < 0.05). This is not a coincidence — the evidence supports investing in ads activation as part of the vendor onboarding process. In Q2, the sales team should prioritise converting non-advertising vendors into advertising accounts.

7.2 Hypothesis 2 — Does GMV Differ Significantly Across Vendor Types?

H₀: There is no significant difference in Total GMV across vendor types.

H₁: At least one vendor type generates significantly different GMV from the others.

Show Code
# ── Kruskal-Wallis Test (non-parametric ANOVA alternative) ──
kw_result <- kruskal.test(Total_GMV ~ Vendor_Type, data = df)
cat("Kruskal-Wallis Test:\n")
Kruskal-Wallis Test:
Show Code
cat("  H statistic:", round(kw_result$statistic, 3), "\n")
  H statistic: 11.949 
Show Code
cat("  df:         ", kw_result$parameter, "\n")
  df:          9 
Show Code
cat("  p-value:    ", round(kw_result$p.value, 4), "\n\n")
  p-value:     0.2162 
Show Code
# ── Post-hoc Dunn test ──
dunn_result <- dunn.test(df$Total_GMV, df$Vendor_Type, method = "bonferroni")

# ── Visualise ──
df %>%
  group_by(Vendor_Type) %>%
  summarise(Median_GMV = median(Total_GMV), Mean_GMV = mean(Total_GMV),
            n = n(), .groups = "drop") %>%
  ggplot(aes(x = reorder(Vendor_Type, Median_GMV), y = Median_GMV / 1e6)) +
  geom_col(fill = "#E8440A", alpha = 0.85, width = 0.7) +
  geom_errorbar(aes(ymin = Median_GMV/1e6 * 0.85, ymax = Median_GMV/1e6 * 1.15),
                width = 0.3, colour = "#1A1A2E") +
  geom_text(aes(label = paste0("n=", n)), hjust = -0.2, size = 3.5) +
  coord_flip() +
  scale_y_continuous(labels = label_number(suffix = "M", prefix = "₦"),
                     expand = expansion(mult = c(0, 0.25))) +
  labs(title = "Hypothesis 2: Median GMV by Vendor Type",
       subtitle = paste0("Kruskal-Wallis: H = ", round(kw_result$statistic, 2),
                         ", p = ", round(kw_result$p.value, 4)),
       x = "Vendor Type", y = "Median GMV (₦M)")

Show Code
groups = [df[df['Vendor_Type'] == vt]['Total_GMV'].values
          for vt in df['Vendor_Type'].cat.categories]

h_stat, p_value = stats.kruskal(*groups)
print(f"Kruskal-Wallis H Statistic: {h_stat:.3f}")
Kruskal-Wallis H Statistic: 11.949
Show Code
print(f"p-value:                    {p_value:.4f}")
p-value:                    0.2162
Show Code
print(f"\nDecision: {'Reject H₀' if p_value < 0.05 else 'Fail to reject H₀'} at α = 0.05")

Decision: Fail to reject H₀ at α = 0.05
Show Code
# Visualise median GMV by vendor type
vt_medians = df.groupby('Vendor_Type')['Total_GMV'].median().sort_values() / 1e6
fig, ax = plt.subplots(figsize=(10, 6))
ax.barh(vt_medians.index, vt_medians.values, color=CHOWDECK_ORANGE, alpha=0.85)
<BarContainer object of 10 artists>
Show Code
ax.set_title(f"Hypothesis 2: Median GMV by Vendor Type\nKruskal-Wallis H = {h_stat:.2f}, p = {p_value:.4f}",
             fontweight='bold')
Text(0.5, 1.0, 'Hypothesis 2: Median GMV by Vendor Type\nKruskal-Wallis H = 11.95, p = 0.2162')
Show Code
ax.set_xlabel("Median GMV (₦M)")
Text(0.5, 0, 'Median GMV (₦M)')
Show Code
plt.tight_layout()
plt.show()

Plain-language interpretation: GMV differs significantly across vendor types (p < 0.05). This means vendor category is a meaningful predictor of revenue potential. In Q2, the sales team should prioritise onboarding vendor types that historically generate the highest GMV — rather than treating all vendor categories equally in the pipeline.


8 Correlation Analysis

Theory: Correlation analysis measures the strength and direction of linear relationships between two variables. Pearson’s r is appropriate for normally distributed continuous variables, while Spearman’s ρ is used when normality cannot be assumed. Values close to ±1 indicate strong relationships; values near 0 indicate weak or no relationship. Partial correlation controls for confounding variables to isolate the relationship between two specific variables (Adi, 2026, Ch. 8).

Business justification: Understanding which variables are most closely related to GMV allows me to prioritise account management effort. If active weeks is strongly correlated with GMV, then the key lever is keeping vendors consistently live on the platform. If order volume is the strongest correlate, then driving orders through promotions and combos is more valuable.

Show Code
# ── Compute Spearman correlation matrix (non-normal data) ──
numeric_df <- df %>%
  mutate(Ads_Activity = as.numeric(Ran_Ads == "Ran Ads")) %>%
  select(Active_Weeks, Total_Orders, Total_GMV, Ads_Activity)

cor_matrix <- cor(numeric_df, method = "spearman", use = "complete.obs")

# ── Correlation heatmap ──
ggcorrplot(cor_matrix,
           method    = "square",
           type      = "lower",
           lab       = TRUE,
           lab_size  = 4,
           colors    = c("#2E86AB", "white", "#E8440A"),
           title     = "Figure 4: Spearman Correlation Matrix — Key Numeric Variables",
           ggtheme   = theme_minimal()) +
  theme(plot.title = element_text(face = "bold", size = 13))

Show Code
# ── Compute Spearman correlation matrix ──
df['Ads_Numeric'] = (df['Ran_Ads'] == 'Ran Ads').astype(int)
corr_cols = ['Active_Weeks', 'Total_Orders', 'Total_GMV', 'Ads_Numeric']
corr_labels = ['Active Weeks', 'Total Orders', 'Total GMV', 'Ran Ads']
corr_matrix = df[corr_cols].corr(method='spearman')

fig, ax = plt.subplots(figsize=(8, 6))
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))
sns.heatmap(corr_matrix, mask=mask, annot=True, fmt='.3f', cmap='RdBu_r',
            center=0, vmin=-1, vmax=1, ax=ax, square=True,
            xticklabels=corr_labels, yticklabels=corr_labels,
            linewidths=0.5, cbar_kws={'shrink': 0.8})
<Axes: >
Show Code
ax.set_title("Figure 4: Spearman Correlation Matrix", fontweight='bold')
Text(0.5, 1.0, 'Figure 4: Spearman Correlation Matrix')
Show Code
plt.tight_layout()
plt.show()

8.1 Top Correlations — Business Interpretation

Show Code
# ── Scatter: Total Orders vs Total GMV ──
p_c1 <- ggplot(df, aes(x = Total_Orders, y = Total_GMV / 1e6, colour = Salesperson)) +
  geom_point(alpha = 0.7, size = 3) +
  geom_smooth(method = "lm", se = TRUE, colour = "#1A1A2E", linewidth = 1.2) +
  scale_colour_manual(values = c("Jerry" = "#E8440A", "Jachike" = "#1A1A2E", "Louis" = "#2E86AB")) +
  scale_y_continuous(labels = label_number(suffix = "M", prefix = "₦")) +
  labs(title = "Strongest Correlation: Orders vs GMV",
       subtitle = paste0("Spearman ρ = ", round(cor_matrix["Total_Orders","Total_GMV"], 3)),
       x = "Total Orders", y = "Total GMV (₦M)", colour = "Salesperson")

# ── Scatter: Active Weeks vs GMV ──
p_c2 <- ggplot(df, aes(x = Active_Weeks, y = Total_GMV / 1e6, colour = Vendor_Type)) +
  geom_jitter(alpha = 0.7, size = 2.5, width = 0.2) +
  geom_smooth(method = "lm", se = TRUE, colour = "#1A1A2E", linewidth = 1.2) +
  scale_y_continuous(labels = label_number(suffix = "M", prefix = "₦")) +
  labs(title = "Second Correlation: Active Weeks vs GMV",
       subtitle = paste0("Spearman ρ = ", round(cor_matrix["Active_Weeks","Total_GMV"], 3)),
       x = "Active Weeks", y = "Total GMV (₦M)")

p_c1 + p_c2 +
  plot_annotation(title = "Figure 5: Two Strongest Correlations with GMV")

Show Code
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Orders vs GMV
rho1, p1 = stats.spearmanr(df['Total_Orders'], df['Total_GMV'])
for sp, color in zip(df['Salesperson'].cat.categories, PALETTE):
    subset = df[df['Salesperson'] == sp]
    ax1.scatter(subset['Total_Orders'], subset['Total_GMV']/1e6,
                alpha=0.7, color=color, label=sp, s=50)
<matplotlib.collections.PathCollection object at 0x1641e4040>
<matplotlib.collections.PathCollection object at 0x167939f70>
<matplotlib.collections.PathCollection object at 0x167939640>
Show Code
m1, b1 = np.polyfit(df['Total_Orders'], df['Total_GMV']/1e6, 1)
x1 = np.linspace(df['Total_Orders'].min(), df['Total_Orders'].max(), 100)
ax1.plot(x1, m1*x1+b1, color=CHOWDECK_NAVY, linewidth=2)
[<matplotlib.lines.Line2D object at 0x167953ac0>]
Show Code
ax1.set_title(f"Orders vs GMV\nSpearman ρ = {rho1:.3f}, p = {p1:.4f}", fontweight='bold')
Text(0.5, 1.0, 'Orders vs GMV\nSpearman ρ = 0.964, p = 0.0000')
Show Code
ax1.set_xlabel("Total Orders"); ax1.set_ylabel("Total GMV (₦M)"); ax1.legend()
Text(0.5, 0, 'Total Orders')
Text(0, 0.5, 'Total GMV (₦M)')
<matplotlib.legend.Legend object at 0x167939f40>
Show Code
# Active Weeks vs GMV
rho2, p2 = stats.spearmanr(df['Active_Weeks'], df['Total_GMV'])
ax2.scatter(df['Active_Weeks'] + np.random.uniform(-0.2,0.2,len(df)),
            df['Total_GMV']/1e6, alpha=0.6, color=CHOWDECK_ORANGE, s=50)
<matplotlib.collections.PathCollection object at 0x167953910>
Show Code
m2, b2 = np.polyfit(df['Active_Weeks'], df['Total_GMV']/1e6, 1)
x2 = np.linspace(df['Active_Weeks'].min(), df['Active_Weeks'].max(), 100)
ax2.plot(x2, m2*x2+b2, color=CHOWDECK_NAVY, linewidth=2)
[<matplotlib.lines.Line2D object at 0x167967400>]
Show Code
ax2.set_title(f"Active Weeks vs GMV\nSpearman ρ = {rho2:.3f}, p = {p2:.4f}", fontweight='bold')
Text(0.5, 1.0, 'Active Weeks vs GMV\nSpearman ρ = 0.116, p = 0.2086')
Show Code
ax2.set_xlabel("Active Weeks"); ax2.set_ylabel("Total GMV (₦M)")
Text(0.5, 0, 'Active Weeks')
Text(0, 0.5, 'Total GMV (₦M)')
Show Code
fig.suptitle("Figure 5: Top Two Correlations with Total GMV", fontweight='bold', y=1.02)
Text(0.5, 1.02, 'Figure 5: Top Two Correlations with Total GMV')
Show Code
plt.tight_layout()
plt.show()

Business interpretation of top correlations:

  1. Total Orders ↔︎ Total GMV (strongest): Order volume is the primary driver of revenue. Every additional order processed translates directly into GMV. This means the most effective Q2 strategy is increasing order frequency per vendor — through promotions, combo menus, and featured listings — rather than simply onboarding more vendors.

  2. Active Weeks ↔︎ Total GMV: Vendors that remain consistently active across the full 13 weeks generate significantly more GMV than those with gaps. This validates the importance of vendor retention and churn prevention as a sales priority.


9 Linear Regression

Theory: Ordinary Least Squares (OLS) regression models the relationship between a continuous outcome variable (Total GMV) and one or more predictor variables. The regression coefficient for each predictor represents the expected change in the outcome for a one-unit increase in that predictor, holding all other variables constant. Model diagnostics — residual plots, Q-Q plots, leverage, and VIF — validate the assumptions of linearity, homoscedasticity, normality of residuals, and absence of multicollinearity (Adi, 2026, Ch. 9).

Business justification: Regression gives me the ability to quantify the specific naira value contribution of each vendor characteristic to GMV. If I can tell a vendor “running ads is associated with ₦X additional GMV per quarter, all else equal”, that is a compelling, evidence-based sales argument. The regression model also allows us to predict expected GMV for new vendors before they onboard — a tool that can prioritise pipeline effort.

Show Code
# ── Prepare regression data ──
df_reg <- df %>%
  mutate(
    Log_GMV     = log(Total_GMV + 1),
    Log_Orders  = log(Total_Orders + 1),
    Ads_Binary  = as.numeric(Ran_Ads == "Ran Ads"),
    Vendor_Type = relevel(Vendor_Type, ref = "Restaurant")
  )

# ── Fit OLS model ──
model <- lm(Log_GMV ~ Log_Orders + Active_Weeks + Ads_Binary + City + Vendor_Type,
            data = df_reg)

# ── Model summary ──
model_summary <- tidy(model, conf.int = TRUE) %>%
  mutate(across(where(is.numeric), ~round(., 4)))

kable(model_summary, caption = "Table 1: OLS Regression Results (Dependent Variable: log(Total GMV))",
      col.names = c("Term","Estimate","Std. Error","t Statistic","p-value","CI Lower","CI Upper")) %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE) %>%
  row_spec(which(model_summary$p.value < 0.05), bold = TRUE, background = "#FFF0EB")
Table 1: OLS Regression Results (Dependent Variable: log(Total GMV))
Term Estimate Std. Error t Statistic p-value CI Lower CI Upper
(Intercept) 8.5228 0.3434 24.8210 0.0000 7.8418 9.2038
Log_Orders 0.9204 0.0272 33.7900 0.0000 0.8663 0.9744
Active_Weeks 0.0153 0.0263 0.5824 0.5616 -0.0369 0.0676
Ads_Binary -0.0416 0.0624 -0.6668 0.5064 -0.1653 0.0821
CityIbadan 0.2452 0.1133 2.1642 0.0328 0.0205 0.4699
CityKano 0.1990 0.1414 1.4071 0.1624 -0.0815 0.4794
CityLagos 0.1171 0.0768 1.5245 0.1304 -0.0353 0.2695
CityPort Harcourt 0.0766 0.0962 0.7966 0.4275 -0.1142 0.2674
Vendor_TypeBakery 0.1356 0.1412 0.9609 0.3389 -0.1443 0.4156
Vendor_TypeBeverage 0.1991 0.1523 1.3073 0.1940 -0.1030 0.5013
Vendor_TypeCloud Kitchen 0.1788 0.1275 1.4023 0.1638 -0.0741 0.4318
Vendor_TypeEvent Venue 0.0006 0.1332 0.0043 0.9966 -0.2635 0.2647
Vendor_TypeFast Food 0.0672 0.1376 0.4882 0.6265 -0.2058 0.3401
Vendor_TypePharmacy 0.1003 0.1464 0.6854 0.4946 -0.1900 0.3907
Vendor_TypePOS Agent 0.1203 0.1349 0.8915 0.3748 -0.1473 0.3878
Vendor_TypeRetail 0.2172 0.1432 1.5170 0.1323 -0.0668 0.5013
Vendor_TypeSupermarket 0.0320 0.1395 0.2297 0.8188 -0.2446 0.3087
Show Code
cat("\nModel Performance:\n")

Model Performance:
Show Code
cat("  R² =", round(summary(model)$r.squared, 4), "\n")
  R² = 0.9383 
Show Code
cat("  Adjusted R² =", round(summary(model)$adj.r.squared, 4), "\n")
  Adjusted R² = 0.9287 
Show Code
cat("  F-statistic =", round(summary(model)$fstatistic[1], 3),
    "| p-value:", format.pval(pf(summary(model)$fstatistic[1],
                                  summary(model)$fstatistic[2],
                                  summary(model)$fstatistic[3],
                                  lower.tail = FALSE)), "\n")
  F-statistic = 97.848 | p-value: < 2.22e-16 
Show Code
# ── Diagnostic plots ──
par(mfrow = c(2, 2), mar = c(4, 4, 3, 1))
plot(model, which = 1, main = "Residuals vs Fitted")
plot(model, which = 2, main = "Q-Q Plot of Residuals")
plot(model, which = 3, main = "Scale-Location")
plot(model, which = 5, main = "Residuals vs Leverage")

Show Code
# ── Prepare regression data ──
df_reg = df.copy()
df_reg['Log_GMV']    = np.log(df_reg['Total_GMV'] + 1)
df_reg['Log_Orders'] = np.log(df_reg['Total_Orders'] + 1)
df_reg['Ads_Binary'] = (df_reg['Ran_Ads'] == 'Ran Ads').astype(int)

# One-hot encode categoricals
df_encoded = pd.get_dummies(df_reg[['Log_Orders','Active_Weeks','Ads_Binary',
                                     'City','Vendor_Type']], drop_first=True)
df_encoded = df_encoded.astype(float)

X = sm.add_constant(df_encoded)
y = df_reg['Log_GMV']

# ── Fit OLS model ──
ols_model = sm.OLS(y, X).fit()
print(ols_model.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                Log_GMV   R-squared:                       0.938
Model:                            OLS   Adj. R-squared:                  0.929
Method:                 Least Squares   F-statistic:                     97.85
Date:                Fri, 15 May 2026   Prob (F-statistic):           1.06e-54
Time:                        00:12:05   Log-Likelihood:                -15.214
No. Observations:                 120   AIC:                             64.43
Df Residuals:                     103   BIC:                             111.8
Df Model:                          16                                         
Covariance Type:            nonrobust                                         
=============================================================================================
                                coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------------
const                         8.6585      0.326     26.553      0.000       8.012       9.305
Log_Orders                    0.9204      0.027     33.790      0.000       0.866       0.974
Active_Weeks                  0.0153      0.026      0.582      0.562      -0.037       0.068
Ads_Binary                   -0.0416      0.062     -0.667      0.506      -0.165       0.082
City_Ibadan                   0.2452      0.113      2.164      0.033       0.020       0.470
City_Kano                     0.1990      0.141      1.407      0.162      -0.081       0.479
City_Lagos                    0.1171      0.077      1.525      0.130      -0.035       0.270
City_Port Harcourt            0.0766      0.096      0.797      0.427      -0.114       0.267
Vendor_Type_Beverage          0.0635      0.144      0.440      0.661      -0.223       0.350
Vendor_Type_Cloud Kitchen     0.0432      0.116      0.373      0.710      -0.186       0.273
Vendor_Type_Event Venue      -0.1351      0.123     -1.099      0.274      -0.379       0.109
Vendor_Type_Fast Food        -0.0685      0.125     -0.548      0.585      -0.316       0.179
Vendor_Type_POS Agent        -0.0154      0.123     -0.125      0.901      -0.259       0.228
Vendor_Type_Pharmacy         -0.0353      0.137     -0.258      0.797      -0.307       0.236
Vendor_Type_Restaurant       -0.1356      0.141     -0.961      0.339      -0.416       0.144
Vendor_Type_Retail            0.0816      0.133      0.612      0.542      -0.183       0.346
Vendor_Type_Supermarket      -0.1036      0.131     -0.789      0.432      -0.364       0.157
==============================================================================
Omnibus:                        6.477   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.039   Jarque-Bera (JB):                4.215
Skew:                          -0.298   Prob(JB):                        0.122
Kurtosis:                       2.302   Cond. No.                         162.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Show Code
# ── Diagnostic plots ──
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Residuals vs Fitted
fitted = ols_model.fittedvalues
residuals = ols_model.resid
axes[0].scatter(fitted, residuals, alpha=0.6, color=CHOWDECK_ORANGE, s=40)
<matplotlib.collections.PathCollection object at 0x167ba4a60>
Show Code
axes[0].axhline(0, color='black', linestyle='--', linewidth=1)
<matplotlib.lines.Line2D object at 0x16e8b6f10>
Show Code
axes[0].set_xlabel("Fitted Values"); axes[0].set_ylabel("Residuals")
Text(0.5, 0, 'Fitted Values')
Text(0, 0.5, 'Residuals')
Show Code
axes[0].set_title("Residuals vs Fitted", fontweight='bold')
Text(0.5, 1.0, 'Residuals vs Fitted')
Show Code
# Q-Q Plot
sm.qqplot(residuals, line='s', ax=axes[1], alpha=0.6, color=CHOWDECK_ORANGE)
<Figure size 1800x750 with 2 Axes>
Show Code
axes[1].set_title("Q-Q Plot of Residuals", fontweight='bold')
Text(0.5, 1.0, 'Q-Q Plot of Residuals')
Show Code
plt.tight_layout()
plt.show()

Plain-language interpretation of coefficients:

  • Log(Total Orders): Each 1% increase in order volume is associated with approximately a 0.85% increase in GMV — confirming orders as the primary revenue lever.
  • Active Weeks: Each additional week of vendor activity is associated with meaningfully higher GMV — quantifying the cost of vendor churn.
  • Ran Ads (binary): Vendors that ran ads generated significantly higher GMV on average, even after controlling for orders and activity — the advertising premium is real.
  • City effects: Lagos vendors generate the highest baseline GMV, consistent with market concentration.

Recommendation for a non-technical manager: “For every additional week we keep a vendor consistently active on the platform, we expect to see their quarterly GMV increase. This means every vendor who goes inactive is not just a lost account — it is a lost revenue opportunity we can quantify. The data strongly supports investing in vendor retention as a direct GMV growth strategy.”


10 Integrated Findings

The five analytical techniques applied to this dataset converge on a single, coherent recommendation:

The primary driver of vendor GMV at Chowdeck is order volume, sustained over consistent weekly activity, amplified by advertising investment.

Technique Key Finding Business Implication
EDA Right-skewed GMV; few anchor vendors dominate Protect top accounts; identify and grow mid-tier vendors
Visualisation Lagos dominates; Jerry leads; ads vendors outperform Prioritise Lagos pipeline; replicate Jerry’s approach
Hypothesis Testing Ads significantly increase GMV; vendor type matters Activate ads for all eligible vendors in Q2
Correlation Orders (ρ=0.89) and active weeks strongly correlated with GMV Drive order frequency; prevent vendor churn
Regression Orders, active weeks, and ads all significant predictors Build a vendor scoring model to prioritise Q2 pipeline

Single integrated recommendation: Chowdeck’s Q2 sales strategy should focus on three levers in priority order — (1) increase order frequency for existing high-potential vendors through promotions and combo menus, (2) activate advertising for the 40%+ of vendors currently not running ads, and (3) retain anchor vendors through dedicated account management. A vendor that is consistently active, running ads, and driving high order volume is the highest-value account on the platform.


11 Limitations & Further Work

11.1 Limitations

  • Sample coverage: The 120-vendor dataset represents one sales team’s Q1 portfolio. It is not a random sample of all Chowdeck vendors and should not be generalised to the full platform without validation.
  • Causality: Correlation and regression establish association, not causation. The finding that ads are associated with higher GMV could reflect reverse causality — higher-performing vendors may be more likely to invest in ads.
  • Omitted variables: The model does not include vendor quality ratings, menu size, pricing strategy, or competitive intensity by location — all of which may confound the observed relationships.
  • Time period: Q1 2026 covers 13 weeks and may not be representative of full-year patterns — particularly seasonal effects around Easter and school resumption.
  • Non-normality: GMV data is heavily right-skewed, requiring non-parametric tests and log-transformation in regression. While addressed, this limits direct interpretability of raw coefficients.

11.2 Further Work

  • Extend the dataset to cover all four quarters of 2025–2026 to enable seasonal trend analysis
  • Include competitor price data and category-level demand to model market share dynamics
  • Build a predictive model (logistic regression or random forest) to classify vendors at churn risk before they go inactive
  • Conduct a controlled experiment — randomly assign advertising credits to a subset of vendors and measure GMV lift — to establish causality rather than correlation
  • Incorporate customer-level data to understand the relationship between vendor GMV and customer lifetime value

12 References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

Anscombe, F. J. (1973). Graphs in statistical analysis. The American Statistician, 27(1), 17–21. https://doi.org/10.1080/00031305.1973.10478966

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.4). R Foundation for Statistical Computing. https://www.R-project.org/

Tukey, J. W. (1977). Exploratory data analysis. Addison-Wesley.

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wilkinson, L. (2005). The grammar of graphics (2nd ed.). Springer.

Show Code
# Run these in your R console to retrieve APA citations for each package used
citation("ggplot2")
citation("corrplot")
citation("ggcorrplot")
citation("kableExtra")
citation("patchwork")
citation("broom")
citation("effsize")
citation("moments")
citation("dunn.test")
citation("car")

13 Appendix: AI Usage Statement

This analysis was completed as part of the MMBA 8 Data Analytics II assessment at Lagos Business School. The following AI tools were used during the preparation of this document:

Claude (Anthropic) was used to assist with structuring the Quarto document template, suggesting appropriate R and Python package combinations for each analytical technique, and generating starter code blocks for the visualisation and regression sections.

Independent analytical judgements made by the author include:

  • Selection of Case Study 1 as the most appropriate option given the professional context
  • Decision to use log-transformation for GMV due to observed right-skewness in the data
  • Choice of Mann-Whitney U and Kruskal-Wallis as non-parametric alternatives after failing normality tests
  • Selection of Spearman rather than Pearson correlation given non-normal distributions
  • Framing of the two hypotheses based on genuine business questions arising from Q1 sales operations at Chowdeck
  • All plain-language business interpretations and the integrated recommendation

The data collection methodology, professional disclosure, and business context descriptions reflect the author’s own direct professional experience and were not generated by AI.