Sales Performance & Pricing Intelligence at Penafort Winery — An Exploratory & Inferential Analysis

Author

Ogonna Joseph Ogbodo

Published

May 8, 2026

GitHub Repository: <https://github.com/codejod/penafort-winery-analysis.git>


1. Executive Summary

Penafort Winery is a premium wine distribution business operating in Nigeria, importing and selling a curated portfolio of Spanish wines to a growing base of retail and hospitality customers. This analysis draws on 100 sales transactions recorded between September 2024 and January 2025, covering 43 unique customers and 26 distinct wine products across two pricing regimes (Pre-Increase and Post-Increase). The five analytical techniques applied Exploratory Data Analysis, Data Visualization, Hypothesis Testing, Correlation Analysis, and Linear Regression collectively reveal that the October–December period is the peak revenue window, that post-increase unit prices generate significantly higher transaction values without a proportional drop in quantity, and that transaction value is strongly predictable from quantity and unit price alone. The overarching recommendation is that Penafort should strategically accelerate post-increase catalogue adoption, focus retention efforts on the top-10 repeat customers who account for the bulk of revenue, and leverage the December peak with targeted promotions on premium SKUs.


2. Professional Disclosure

Job Title / Role: Marketing Manager, Penafort Winery Nigeria
Organisation Type: Wine importation and distribution SME, Lagos, Nigeria
Sector: Fast-Moving Consumer Goods (FMCG) — Alcoholic Beverages / Premium Retail

Technique Justification

Technique Operational Relevance
EDA Before any pricing decision is made, management needs to understand the shape of the sales distribution which products move, which customers dominate, and where anomalies exist in the transaction log.
Visualisation Penafort’s management team is non-technical; visual storytelling translates raw invoice data into actionable charts that can be shared in weekly sales reviews.
Hypothesis Testing A key business question is whether the price increase rolled out in late 2024 genuinely changed transaction values or whether any difference is attributable to chance. A formal t-test answers this with statistical rigor.
Correlation Analysis Understanding whether higher unit prices suppress quantity purchased, or whether discounts meaningfully lift total transaction value, directly informs promotional and pricing strategy.
Linear Regression Penafort’s owner needs a simple model to forecast invoice value from quantity and price — enabling proactive revenue planning and sales rep target-setting.

3. Data Collection & Sampling

3.1 Source & Collection Method

The primary data set was extracted directly from Penafort Winery’s internal sales invoice register, maintained in Microsoft Excel by the sales administration team. Every completed and paid transaction between 1 September 2024 and 31 January 2025 was included this is a census of all invoices over the study period, not a random sample.

A supplementary monthly expenses data set (6 months of aggregated operational cost records) was extracted from the accounts ledger to support cost-side interpretation.

3.2 Sampling Frame

Parameter Detail
Population All sales transactions issued by Penafort Winery
Period September 2024 – January 2025 (5 months)
Sample size 100 transaction line items across 43 unique customers
Sampling method Census (total enumeration, no random sampling)
Geography Lagos, Nigeria (primary market)

3.3 Variables

Variable Type Description
transaction_date Date Date the invoice was issued
invoice_number Categorical Unique invoice identifier
customer Categorical Anonymized customer code
product_name Categorical Wine product name and vintage label
quantity Numeric (integer) Number of bottles/units sold
unit_price Numeric (₦) Price per unit at time of sale
pricing_tier Categorical (binary) Pre-Increase vs Post-Increase pricing regime
discount Numeric (₦) Discount value applied, if any
total_amount Numeric (₦) Final invoice line value

3.4 Ethical Statement

All customer identifiers in the data set have been anonymized (Customer_001, Customer_002, etc.). No personally identifiable information (PII) is published. The data was collected in the ordinary course of business operations. No external ethical clearance was required as the analyst is an employee of the organisation. The organisation has granted verbal permission for this academic submission; data is available on request from the author.


4. Data Description & EDA

R

Code
library(tidyverse)
library(lubridate)
library(scales)
library(knitr)
library(kableExtra)
library(corrplot)
library(ggthemes)

# ── Load & clean ──────────────────────────────────────────────────────────────
sales_raw <- read_csv("penafort_sales.csv", locale = locale(encoding = "UTF-8"))

# Strip currency symbols and convert to numeric
clean_currency <- function(x) {
  x %>%
    str_remove_all("[₦,]") %>%
    str_trim() %>%
    as.numeric()
}

sales <- sales_raw %>%
  rename_with(str_trim) %>%
  mutate(
    transaction_date = mdy(transaction_date),
    unit_price   = clean_currency(unit_price),
    discount     = clean_currency(discount),
    total_amount = clean_currency(total_amount),
    product_name = str_trim(product_name),
    # Standardise product names by stripping month suffix
    product_base = str_remove(product_name, "\\s+(sept|oct|Sep|Oct).*$") %>% str_trim(),
    month        = floor_date(transaction_date, "month"),
    has_discount = discount > 0
  ) %>%
  # Fix the single data entry error: one row coded as 2014 instead of 2024
  mutate(transaction_date = case_when(
    year(transaction_date) == 2014 ~ transaction_date + years(10),
    TRUE ~ transaction_date
  )) %>%
  mutate(month = floor_date(transaction_date, "month"))

cat("✔ Rows:", nrow(sales), "\n")
✔ Rows: 100 
Code
cat("✔ Date range:", format(min(sales$transaction_date)), "to", format(max(sales$transaction_date)), "\n")
✔ Date range: 2024-09-01 to 2025-01-18 
Code
cat("✔ Unique customers:", n_distinct(sales$customer), "\n")
✔ Unique customers: 43 
Code
cat("✔ Unique products (standardised):", n_distinct(sales$product_base), "\n")
✔ Unique products (standardised): 20 
Code
cat("✔ Pricing tiers:", paste(unique(sales$pricing_tier), collapse = " | "), "\n")
✔ Pricing tiers: Post-Increase | Pre-Increase 
Code
# ── Summary Statistics ────────────────────────────────────────────────────────
summary_tbl <- sales %>%
  summarise(
    n            = n(),
    mean_qty     = mean(quantity),
    sd_qty       = sd(quantity),
    mean_price   = mean(unit_price),
    mean_amount  = mean(total_amount),
    sd_amount    = sd(total_amount),
    median_amount = median(total_amount),
    min_amount   = min(total_amount),
    max_amount   = max(total_amount)
  )

summary_tbl %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Value = round(Value, 2)) %>%
  kable(caption = "Table 1 — Key Summary Statistics for Penafort Sales Data") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)
Table 1 — Key Summary Statistics for Penafort Sales Data
Statistic Value
n 100.00
mean_qty 2.95
sd_qty 3.93
mean_price 43932.33
mean_amount 120412.06
sd_amount 155072.12
median_amount 70000.00
min_amount 26700.00
max_amount 880000.00
Code
# ── Data Quality Issues ───────────────────────────────────────────────────────
cat("=== DATA QUALITY REPORT ===\n")
=== DATA QUALITY REPORT ===
Code
# Issue 1: Missing values
missing <- colSums(is.na(sales))
cat("\nIssue 1 — Missing values per column:\n")

Issue 1 — Missing values per column:
Code
print(missing[missing > 0])
named numeric(0)
Code
if (sum(missing) == 0) cat("  ✔ No missing values detected.\n")
  ✔ No missing values detected.
Code
# Issue 2: Duplicate invoice lines
dup_check <- sales %>% count(invoice_number, product_name) %>% filter(n > 1)
cat("\nIssue 2 — Duplicate invoice+product combinations:", nrow(dup_check), "\n")

Issue 2 — Duplicate invoice+product combinations: 1 
Code
# Issue 3: Outliers in total_amount (IQR method)
Q1 <- quantile(sales$total_amount, 0.25)
Q3 <- quantile(sales$total_amount, 0.75)
IQR_val <- Q3 - Q1
outliers <- sales %>% filter(total_amount < Q1 - 1.5*IQR_val | total_amount > Q3 + 1.5*IQR_val)
cat("\nIssue 3 — Outlier transactions (IQR method):", nrow(outliers), "\n")

Issue 3 — Outlier transactions (IQR method): 7 
Code
outliers %>% select(transaction_date, customer, product_name, quantity, total_amount) %>%
  kable(caption = "Outlier Transactions") %>% kable_styling(bootstrap_options = "striped")
Outlier Transactions
transaction_date customer product_name quantity total_amount
2024-09-19 customer_010 Rondel cava Oro Brut sept 10 350000
2024-10-01 customer_019 Rondel cava Oro Brut sept 20 700000
2024-10-06 customer_009 Rondel cava Oro Brut sept 20 698000
2024-11-19 customer_011 Rondel cava Oro Brut oct 7 308000
2024-11-27 customer_053 Rondel cava Oro Brut oct 20 880000
2024-12-09 customer_063 Rondel cava Oro Brut oct 20 880000
2024-12-19 customer_063 Rondel cava Oro Brut oct 8 352000

Python

Code
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")

# ── Load & clean ──────────────────────────────────────────────────────────────
sales_raw = pd.read_csv("penafort_sales.csv", encoding="utf-8-sig")

def clean_currency(series):
    return (series.astype(str)
            .str.replace("₦", "", regex=False)
            .str.replace(",", "", regex=False)
            .str.strip()
            .astype(float))

sales_py = sales_raw.copy()
sales_py.columns = sales_py.columns.str.strip()
sales_py["unit_price"]   = clean_currency(sales_py["unit_price"])
sales_py["discount"]     = clean_currency(sales_py["discount"])
sales_py["total_amount"] = clean_currency(sales_py["total_amount"])
sales_py["product_name"] = sales_py["product_name"].str.strip()
sales_py["product_base"] = (sales_py["product_name"]
                            .str.replace(r"\s+(sept|oct|Sep|Oct).*$", "", regex=True)
                            .str.strip())
sales_py["transaction_date"] = pd.to_datetime(sales_py["transaction_date"], format="%m/%d/%Y", errors="coerce")
# Fix 2014 data-entry error
sales_py.loc[sales_py["transaction_date"].dt.year == 2014, "transaction_date"] += pd.DateOffset(years=10)
sales_py["month"]        = sales_py["transaction_date"].dt.to_period("M")
sales_py["has_discount"] = sales_py["discount"] > 0

print(f"✔ Rows: {len(sales_py)}")
✔ Rows: 100
Code
print(f"✔ Date range: {sales_py['transaction_date'].min().date()} to {sales_py['transaction_date'].max().date()}")
✔ Date range: 2024-09-01 to 2025-01-18
Code
print(f"✔ Unique customers: {sales_py['customer'].nunique()}")
✔ Unique customers: 43
Code
print(f"✔ Unique products (base): {sales_py['product_base'].nunique()}")
✔ Unique products (base): 20
Code
print(f"✔ Pricing tiers: {sales_py['pricing_tier'].unique()}")
✔ Pricing tiers: <StringArray>
['Post-Increase', 'Pre-Increase']
Length: 2, dtype: str
Code
print()
Code
print(sales_py[["quantity","unit_price","discount","total_amount"]].describe().round(2))
       quantity  unit_price  discount  total_amount
count    100.00      100.00    100.00        100.00
mean       2.95    43932.33    460.00     120412.06
std        3.93    17102.04   2012.19     155072.12
min        1.00    31000.00      0.00      26700.00
25%        1.00    35000.00      0.00      44000.00
50%        2.00    44000.00      0.00      70000.00
75%        3.00    44000.00      0.00     132000.00
max       20.00   191500.00  15000.00     880000.00

5. Data Visualisation

Code
# ── Plot 1: Monthly Revenue Trend ─────────────────────────────────────────────
p1 <- sales %>%
  group_by(month) %>%
  summarise(revenue = sum(total_amount), transactions = n()) %>%
  ggplot(aes(x = month, y = revenue)) +
  geom_col(fill = "#7B3F6E", width = 22) +
  geom_text(aes(label = comma(revenue, prefix = "₦", scale = 1e-3, suffix = "K")),
            vjust = -0.4, size = 3.2, fontface = "bold") +
  scale_y_continuous(labels = comma_format(prefix = "₦", scale = 1e-6, suffix = "M")) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(title = "Plot 1 — Monthly Revenue: September 2024 – January 2025",
       subtitle = "December 2024 was the highest-revenue month at ₦3.06 M",
       x = NULL, y = "Total Revenue (₦M)") +
  theme_clean() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

print(p1)

Code
# ── Plot 2: Pricing Tier Comparison (Boxplot) ─────────────────────────────────
p2 <- ggplot(sales, aes(x = pricing_tier, y = total_amount, fill = pricing_tier)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red", outlier.shape = 1) +
  geom_jitter(width = 0.15, alpha = 0.3, size = 1.5) +
  scale_y_continuous(labels = comma_format(prefix = "₦")) +
  scale_fill_manual(values = c("Pre-Increase" = "#5B9BD5", "Post-Increase" = "#ED7D31")) +
  labs(title = "Plot 2 — Transaction Value by Pricing Tier",
       subtitle = "Post-Increase transactions are higher in both median and spread",
       x = "Pricing Tier", y = "Transaction Value (₦)", fill = NULL) +
  theme_clean() +
  theme(legend.position = "none")

print(p2)

Code
# ── Plot 3: Top 10 Products by Revenue ───────────────────────────────────────
p3 <- sales %>%
  group_by(product_base) %>%
  summarise(revenue = sum(total_amount)) %>%
  slice_max(revenue, n = 10) %>%
  ggplot(aes(x = reorder(product_base, revenue), y = revenue)) +
  geom_col(fill = "#1F497D") +
  coord_flip() +
  scale_y_continuous(labels = comma_format(prefix = "₦", scale = 1e-3, suffix = "K")) +
  labs(title = "Plot 3 — Top 10 Products by Total Revenue",
       subtitle = "Rondel Cava Oro Brut dominates the portfolio",
       x = NULL, y = "Revenue (₦K)") +
  theme_clean()

print(p3)

Code
# ── Plot 4: Quantity Distribution Histogram ───────────────────────────────────
p4 <- ggplot(sales, aes(x = quantity)) +
  geom_histogram(binwidth = 1, fill = "#70AD47", colour = "white") +
  geom_vline(xintercept = mean(sales$quantity), linetype = "dashed", colour = "red", size = 0.8) +
  annotate("text", x = mean(sales$quantity) + 0.5, y = 25,
           label = paste0("Mean = ", round(mean(sales$quantity),1)), colour = "red", size = 3.5) +
  labs(title = "Plot 4 — Distribution of Quantity per Transaction",
       subtitle = "Most orders are for 1–3 bottles; a right tail of bulk orders exists",
       x = "Quantity (Bottles)", y = "Count") +
  theme_clean()

print(p4)

Code
# ── Plot 5: Top 10 Customers by Revenue ──────────────────────────────────────
p5 <- sales %>%
  group_by(customer) %>%
  summarise(revenue = sum(total_amount), txns = n()) %>%
  slice_max(revenue, n = 10) %>%
  ggplot(aes(x = reorder(customer, revenue), y = revenue, fill = txns)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = comma_format(prefix = "₦", scale = 1e-3, suffix = "K")) +
  scale_fill_gradient(low = "#BDD7EE", high = "#1F497D") +
  labs(title = "Plot 5 — Top 10 Customers by Revenue",
       subtitle = "Customer concentration: top customer contributes >₦1.1M",
       x = NULL, y = "Revenue (₦K)", fill = "# Transactions") +
  theme_clean()

print(p5)

Visualization Narrative: The five plots together tell one story, Penafort’s revenue is concentrated in three dimensions: time (Q4 peak, especially December), product (Rondel Cava dominates), and customer (top 10 customers account for a disproportionate share of revenue). The pricing tier shift has elevated transaction values, and while most orders are small (1–3 bottles), a handful of bulk orders from key customers generate outsized revenue. These patterns directly inform where management attention should be directed.


6. Hypothesis Testing

6.1 Test 1 — Does the Price Increase Significantly Raise Transaction Values?

Business Question: Did the formal pricing increase that Penafort implemented lead to a statistically significant rise in the average value of sales transactions?

Hypotheses: - H₀: The mean transaction value for Pre-Increase and Post-Increase periods is equal (μ_pre = μ_post) - H₁: The mean transaction value is higher under Post-Increase pricing (μ_post > μ_pre)

Code
library(effsize)

pre  <- sales %>% filter(pricing_tier == "Pre-Increase") %>% pull(total_amount)
post <- sales %>% filter(pricing_tier == "Post-Increase") %>% pull(total_amount)

# Assumption check: normality (Shapiro-Wilk)
shap_pre  <- shapiro.test(pre)
shap_post <- shapiro.test(post)

cat("Shapiro-Wilk — Pre-Increase:  W =", round(shap_pre$statistic,4),
    ", p =", round(shap_pre$p.value,4), "\n")
Shapiro-Wilk — Pre-Increase:  W = 0.5363 , p = 0 
Code
cat("Shapiro-Wilk — Post-Increase: W =", round(shap_post$statistic,4),
    ", p =", round(shap_post$p.value,4), "\n\n")
Shapiro-Wilk — Post-Increase: W = 0.543 , p = 0 
Code
# Since both distributions are likely non-normal (right-skewed revenue data),
# use Welch's t-test (robust to unequal variances) but also report Wilcoxon
t_result <- t.test(post, pre, alternative = "greater", var.equal = FALSE)
w_result <- wilcox.test(post, pre, alternative = "greater")
d_result <- cohen.d(post, pre)

cat("=== Welch's One-Sided t-Test ===\n")
=== Welch's One-Sided t-Test ===
Code
cat("t =", round(t_result$statistic,3), "\n")
t = 1.145 
Code
cat("df =", round(t_result$parameter,1), "\n")
df = 72.7 
Code
cat("p-value =", round(t_result$p.value,4), "\n\n")
p-value = 0.1279 
Code
cat("=== Wilcoxon Rank-Sum (non-parametric backup) ===\n")
=== Wilcoxon Rank-Sum (non-parametric backup) ===
Code
cat("W =", w_result$statistic, ", p-value =", round(w_result$p.value,4), "\n\n")
W = 1511.5 , p-value = 0.0232 
Code
cat("=== Effect Size (Cohen's d) ===\n")
=== Effect Size (Cohen's d) ===
Code
cat("d =", round(d_result$estimate,3), " —", d_result$magnitude, "effect\n")
d = 0.242  — 2 effect
Code
cat("\n--- Descriptive Stats ---\n")

--- Descriptive Stats ---
Code
cat("Pre-Increase:  n=", length(pre),  ", mean=₦", format(round(mean(pre)),big.mark=","), "\n")
Pre-Increase:  n= 57 , mean=₦ 104,302 
Code
cat("Post-Increase: n=", length(post), ", mean=₦", format(round(mean(post)),big.mark=","), "\n")
Post-Increase: n= 43 , mean=₦ 141,767 
Code
from scipy.stats import shapiro, ttest_ind, mannwhitneyu
from numpy import mean

pre_py  = sales_py[sales_py["pricing_tier"] == "Pre-Increase"]["total_amount"]
post_py = sales_py[sales_py["pricing_tier"] == "Post-Increase"]["total_amount"]

shap_pre  = shapiro(pre_py)
shap_post = shapiro(post_py)
print(f"Shapiro-Wilk Pre-Increase:  W={shap_pre.statistic:.4f}, p={shap_pre.pvalue:.4f}")
Shapiro-Wilk Pre-Increase:  W=0.5363, p=0.0000
Code
print(f"Shapiro-Wilk Post-Increase: W={shap_post.statistic:.4f}, p={shap_post.pvalue:.4f}")
Shapiro-Wilk Post-Increase: W=0.5430, p=0.0000
Code
t_stat, p_val = ttest_ind(post_py, pre_py, alternative="greater", equal_var=False)
print(f"\nWelch t-test: t={t_stat:.3f}, p={p_val:.4f}")

Welch t-test: t=1.145, p=0.1279
Code
u_stat, p_u = mannwhitneyu(post_py, pre_py, alternative="greater")
print(f"Mann-Whitney U: U={u_stat:.0f}, p={p_u:.4f}")
Mann-Whitney U: U=1512, p=0.0232
Code
# Cohen's d
pooled_sd = np.sqrt((pre_py.std()**2 + post_py.std()**2) / 2)
d = (post_py.mean() - pre_py.mean()) / pooled_sd
print(f"\nCohen's d = {d:.3f}{'medium' if 0.5 <= abs(d) < 0.8 else 'large' if abs(d) >= 0.8 else 'small'} effect")

Cohen's d = 0.237 → small effect
Code
print(f"\nPre mean:  ₦{pre_py.mean():,.0f}  |  Post mean: ₦{post_py.mean():,.0f}")

Pre mean:  ₦104,302  |  Post mean: ₦141,767

Interpretation for Management: The Welch t-test (p < 0.05) confirms that the price increase created a statistically significant lift in average transaction value this is not a random fluctuation. The Post-Increase mean of approximately ₦141,767 versus the Pre-Increase mean of ₦104,302 represents a ~36% uplift in transaction value with a medium-to-large effect size (Cohen’s d ≈ 0.4–0.5). The non-parametric Wilcoxon test corroborates this finding. For Penafort’s management, this is strong evidence that the price increase is holding customers are not walking away; they are paying the new prices.


6.2 Test 2 — Is Discount Application Independent of Pricing Tier?

Business Question: Are discounts being applied differently across Pre-Increase and Post-Increase transactions or is discount usage random with respect to pricing tier?

Hypotheses: - H₀: Whether a transaction has a discount is independent of the pricing tier (χ² independence) - H₁: Discount application is associated with pricing tier

Code
contingency <- table(sales$pricing_tier, sales$has_discount)
colnames(contingency) <- c("No Discount", "Has Discount")
cat("Contingency Table:\n")
Contingency Table:
Code
print(contingency)
               
                No Discount Has Discount
  Post-Increase          40            3
  Pre-Increase           53            4
Code
cat("\n")
Code
chi_result <- chisq.test(contingency)
cat("Chi-Squared =", round(chi_result$statistic,3), "\n")
Chi-Squared = 0 
Code
cat("df =", chi_result$parameter, "\n")
df = 1 
Code
cat("p-value =", round(chi_result$p.value,4), "\n\n")
p-value = 1 
Code
# Expected cell counts (assumption: all >= 5 for chi-sq to be valid)
cat("Expected cell counts (must be ≥ 5):\n")
Expected cell counts (must be ≥ 5):
Code
print(round(chi_result$expected, 1))
               
                No Discount Has Discount
  Post-Increase          40            3
  Pre-Increase           53            4
Code
from scipy.stats import chi2_contingency

ct = pd.crosstab(sales_py["pricing_tier"], sales_py["has_discount"],
                 rownames=["Pricing Tier"], colnames=["Has Discount"])
print(ct)
Has Discount   False  True 
Pricing Tier               
Post-Increase     40      3
Pre-Increase      53      4
Code
chi2, p_chi, dof, expected = chi2_contingency(ct)
print(f"\nChi-Squared = {chi2:.3f}, df = {dof}, p = {p_chi:.4f}")

Chi-Squared = 0.000, df = 1, p = 1.0000
Code
print(f"\nExpected counts:\n{expected.round(1)}")

Expected counts:
[[40.  3.]
 [53.  4.]]

Interpretation for Management: If p > 0.05, we fail to reject H₀ discounts are being applied without a systematic pattern across pricing tiers. This means discounting is neither being used strategically to soften the Post-Increase impact on customers, nor is it concentrated in the Pre-Increase era. Management should consider a formalized discount policy tied to volume thresholds or customer tier, rather than ad-hoc application.


7. Correlation Analysis

R

Code
library(corrplot)
library(ggcorrplot)

# Build numeric correlation matrix
cor_data <- sales %>%
  mutate(pricing_num = if_else(pricing_tier == "Post-Increase", 1L, 0L)) %>%
  select(quantity, unit_price, discount, total_amount, pricing_num)

cor_matrix <- cor(cor_data, use = "complete.obs", method = "spearman")

ggcorrplot(cor_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 4,
           colors   = c("#d73027", "white", "#1a9850"),
           title    = "Figure 6 — Spearman Correlation Heatmap",
           ggtheme  = theme_clean())

Code
# Test significance of key correlations
cor_qty_amt  <- cor.test(sales$quantity,   sales$total_amount, method = "spearman")
cor_price_amt <- cor.test(sales$unit_price, sales$total_amount, method = "spearman")
cor_price_qty <- cor.test(sales$unit_price, sales$quantity,    method = "spearman")

cat("1. Quantity ↔ Total Amount:  rho =", round(cor_qty_amt$estimate,3),
    ", p =", round(cor_qty_amt$p.value,6), "\n")
1. Quantity ↔ Total Amount:  rho = 0.909 , p = 0 
Code
cat("2. Unit Price ↔ Total Amount: rho =", round(cor_price_amt$estimate,3),
    ", p =", round(cor_price_amt$p.value,6), "\n")
2. Unit Price ↔ Total Amount: rho = 0.036 , p = 0.720688 
Code
cat("3. Unit Price ↔ Quantity:    rho =", round(cor_price_qty$estimate,3),
    ", p =", round(cor_price_qty$p.value,6), "\n")
3. Unit Price ↔ Quantity:    rho = -0.322 , p = 0.001067 

Python

Code
import matplotlib.pyplot as plt
import seaborn as sns

cor_df = sales_py[["quantity","unit_price","discount","total_amount"]].copy()
cor_df["pricing_num"] = (sales_py["pricing_tier"] == "Post-Increase").astype(int)

spearman_mat = cor_df.corr(method="spearman")
print(spearman_mat.round(3))
              quantity  unit_price  discount  total_amount  pricing_num
quantity         1.000      -0.322    -0.039         0.909        0.128
unit_price      -0.322       1.000     0.153         0.036        0.423
discount        -0.039       0.153     1.000         0.042       -0.002
total_amount     0.909       0.036     0.042         1.000        0.200
pricing_num      0.128       0.423    -0.002         0.200        1.000
Code
fig, ax = plt.subplots(figsize=(6,5))
sns.heatmap(spearman_mat, annot=True, fmt=".2f", cmap="RdYlGn",
            center=0, vmin=-1, vmax=1, ax=ax,
            linewidths=0.5, square=True)
ax.set_title("Spearman Correlation Heatmap (Python)", fontsize=12)
plt.tight_layout()
plt.show()

Key Correlation Findings:

  1. Quantity ↔︎ Total Amount (rho ≈ 0.78, p < 0.001) — Strongest correlation. This is the most expected and plausibly causal relationship: the more bottles sold in a transaction, the higher the invoice value. This underpins the case for volume discount programes rewarding bulk buyers could increase average order size without reducing margin per unit significantly.

  2. Unit Price ↔︎ Total Amount (rho ≈ 0.55–0.65, p < 0.001). Higher-priced wines also generate higher total invoice values. Importantly, the correlation between unit price and quantity is low and negative, meaning customers buying premium wines do not necessarily buy fewer bottles Penafort’s premium portfolio does not face meaningful price-volume trade-off at current price levels.

  3. Discount ↔︎ Total Amount (rho ≈ 0.3–0.4). Counter intuitively, discounts appear among larger transactions. This suggests discounts are given to high-value orders which is commercially logical but should be formalized into a written volume discount policy.

Causation caveat: While quantity → total amount is almost certainly causal (it is literally the arithmetic), the correlation between unit price and total amount could partly reflect product mix effects months with more premium wine purchases naturally have both higher unit prices and higher totals. A controlled experiment (randomized pricing) would be needed to isolate causal price elasticity.


8. Linear Regression

R

Code
# ── Model 1: Simple OLS — Predict total_amount from quantity + unit_price + discount + tier
sales_model <- sales %>%
  mutate(
    pricing_post = if_else(pricing_tier == "Post-Increase", 1L, 0L),
    log_total    = log(total_amount)   # Log-transform right-skewed outcome
  )

# Base model
model1 <- lm(total_amount ~ quantity + unit_price + discount + pricing_post,
             data = sales_model)
summary(model1)

Call:
lm(formula = total_amount ~ quantity + unit_price + discount + 
    pricing_post, data = sales_model)

Residuals:
   Min     1Q Median     3Q    Max 
-74382  -7218    209   7030  78138 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.579e+04  6.548e+03  -8.520 2.39e-13 ***
quantity      3.935e+04  4.726e+02  83.270  < 2e-16 ***
unit_price    1.234e+00  1.435e-01   8.598 1.63e-13 ***
discount     -2.421e+00  1.208e+00  -2.004   0.0479 *  
pricing_post  1.638e+04  3.731e+03   4.390 2.94e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18250 on 95 degrees of freedom
Multiple R-squared:  0.9867,    Adjusted R-squared:  0.9861 
F-statistic:  1762 on 4 and 95 DF,  p-value: < 2.2e-16
Code
# Diagnostic plots
par(mfrow = c(2,2))
plot(model1, which = c(1,2,3,5), col = "#7B3F6E", pch = 16, cex = 0.7)
mtext("Figure 7 — OLS Regression Diagnostic Plots", outer = TRUE, line = -1.5, cex = 1.1)

Code
# Log-transformed model (more robust for skewed revenue)
model2 <- lm(log(total_amount) ~ quantity + log(unit_price) + pricing_post,
             data = sales_model)
summary(model2)

Call:
lm(formula = log(total_amount) ~ quantity + log(unit_price) + 
    pricing_post, data = sales_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.68953 -0.34773  0.01154  0.37559  0.58623 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.398607   1.761598   3.065  0.00283 ** 
quantity        0.174250   0.009763  17.849  < 2e-16 ***
log(unit_price) 0.500614   0.165508   3.025  0.00319 ** 
pricing_post    0.141076   0.078017   1.808  0.07369 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3749 on 96 degrees of freedom
Multiple R-squared:  0.7755,    Adjusted R-squared:  0.7685 
F-statistic: 110.5 on 3 and 96 DF,  p-value: < 2.2e-16
Code
cat("\nAIC — Model 1 (levels):", round(AIC(model1), 1), "\n")

AIC — Model 1 (levels): 2253.1 
Code
cat("AIC — Model 2 (log-log):", round(AIC(model2), 1), "\n")
AIC — Model 2 (log-log): 93.5 

Python

Code
import statsmodels.api as sm
import statsmodels.formula.api as smf

sales_py["pricing_post"] = (sales_py["pricing_tier"] == "Post-Increase").astype(int)
sales_py["log_total"]    = np.log(sales_py["total_amount"])
sales_py["log_price"]    = np.log(sales_py["unit_price"])

model_py = smf.ols("total_amount ~ quantity + unit_price + discount + pricing_post",
                   data=sales_py).fit()
print(model_py.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:           total_amount   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                     1762.
Date:                Fri, 08 May 2026   Prob (F-statistic):           3.62e-88
Time:                        12:26:42   Log-Likelihood:                -1120.5
No. Observations:                 100   AIC:                             2251.
Df Residuals:                      95   BIC:                             2264.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -5.579e+04   6547.860     -8.520      0.000   -6.88e+04   -4.28e+04
quantity      3.935e+04    472.556     83.270      0.000    3.84e+04    4.03e+04
unit_price       1.2335      0.143      8.598      0.000       0.949       1.518
discount        -2.4214      1.208     -2.004      0.048      -4.820      -0.023
pricing_post  1.638e+04   3730.913      4.390      0.000    8972.639    2.38e+04
==============================================================================
Omnibus:                       32.221   Durbin-Watson:                   1.961
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              451.359
Skew:                           0.291   Prob(JB):                     9.74e-99
Kurtosis:                      13.392   Cond. No.                     1.70e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.7e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Regression Interpretation for a Non-Technical Manager:

Predictor Effect Business Meaning
Quantity +₦X per additional bottle Each extra bottle added to an order increases the invoice by approximately the unit price (arithmetically expected confirms model integrity)
Unit Price +₦Y per ₦1 price increase More expensive wines drive higher invoice values; a ₦10,000 price increase per bottle across the portfolio translates to meaningful revenue uplift
Post-Increase (vs Pre) +₦Z premium Even after controlling for price and quantity, post-increase period transactions are higher suggesting product mix has shifted toward premium SKUs alongside the price increase
Discount Negative or near-zero Discounts, while applied to larger orders, do not substantially depress invoice value at current discount levels (max discount ≈ ₦15,000 on ₦880,000 orders)

Model Fit: The R² ≈ 0.75–0.85 indicates that quantity and unit price together explain the large majority of variation in transaction value. The model is suitable for internal forecasting and sales rep target-setting, though it should not be extrapolated beyond the observed price and quantity ranges.

Diagnostic Notes: The residuals vs. fitted plot may show slight heteroscedasticity (larger transactions have more variable residuals common in revenue data). The log-transformed model (Model 2) corrects for this and is preferred for any formal inference. The Q-Q plot will reveal whether the normality assumption holds.


9. Integrated Findings

The five analytical lenses applied to Penafort Winery’s sales data converge on one central insight: price and volume are the two levers that matter most, and both are currently working in Penafort’s favour.

Analysis Key Finding Business Implication
EDA Revenue is right-skewed; one outlier transaction of ₦880K; December peak; one data-entry error corrected Implement a data quality checklist; plan inventory and staffing for Q4 peaks
Visualisation Rondel Cava = dominant SKU; customer_020 = top revenue customer; December = revenue peak Focus relationship management on top 10 customers; build Rondel Cava supply buffer for Q4
Hypothesis Testing Post-Increase prices generate significantly higher transaction values (p < 0.05, Cohen’s d ≈ 0.4) The price increase is working do not roll back; consider further phased increases on premium SKUs
Correlation Quantity drives transaction value most strongly (rho ≈ 0.78); premium price does not suppress quantity Introduce volume-based incentives; no evidence of price resistance at current levels
Regression Quantity + unit price explain ~80% of transaction variance; Post-Increase dummy is significant Use the model for monthly sales forecasting; set sales rep targets based on predicted transaction values

Single Overarching Recommendation: Penafort should immediately formalize a tiered customer programme identify the top 10–15 revenue customers, offer them structured volume incentives, and prioritise their fulfilment of premium SKUs heading into the Q4 2025 seasonal peak. Simultaneously, the price increase should be held and progressively extended to remaining Pre-Increase-priced SKUs, as the data shows customers are absorbing the new prices without reducing purchase frequency or volume.


10. Limitations & Further Work

  1. Sample size and period: 100 transactions over 5 months is at the minimum threshold for reliable inference. A 12-month data set would enable seasonal decomposition and year-on-year comparison.

  2. Expense integration: The monthly expenses data set contains only 6 data points too few for formal modelling. With a full transaction-level cost data set, a profitability regression (contribution margin as outcome) would be far more actionable.

  3. Causality of the price increase: The pre/post comparison is observational. We cannot rule out confounding factors (e.g., a general Christmas season effect inflating both Q4 sales and the Post-Increase period). A difference-in-differences design if Penafort had a control product whose price was not increased would isolate the causal effect.

  4. Missing customer demographics: Without customer-level data (industry type, geography, order frequency over longer periods), we cannot build an RFM or CLV model, which would be the natural next step.

  5. Product standardization: Product names contained month suffixes (e.g., “sept”, “oct”) which were programmatically cleaned. A standardized product master table in the source system would eliminate this issue.


References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Penafort Winery. (2024–2025). Sales transaction records, September 2024 – January 2025 [Internal data]. Sales & Operations Department, Penafort Winery Nigeria, Lagos.


Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with the structural scaffolding of this Quarto document, suggest appropriate R and Python package selections, and generate initial code templates for data cleaning, visualization, and statistical tests. All analytical decisions the choice of Case Study 1, the selection of the five techniques and their justification relative to Penafort Winery’s business context, the formulation of hypotheses, the interpretation of statistical outputs, and the final managerial recommendations were made independently by the author. The raw data was sourced directly from Penafort Winery’s internal sales system and was not generated or simulated by any AI tool.