Sales Performance & Pricing Intelligence at Penafort Winery — An Exploratory & Inferential Analysis

Author

Ogonna Joseph Ogbodo

Published

May 8, 2026

GitHub Repository: <https://github.com/codejod/penafort-winery-analysis.git>

1. Executive Summary

Penafort Winery is a premium wine distribution business operating in Nigeria, importing and selling a curated portfolio of Spanish wines to a growing base of retail and hospitality customers. This analysis draws on 100 sales transactions recorded between September 2024 and January 2025, covering 43 unique customers and 26 distinct wine products across two pricing regimes (Pre-Increase and Post-Increase). The five analytical techniques applied Exploratory Data Analysis, Data Visualization, Hypothesis Testing, Correlation Analysis, and Linear Regression collectively reveal that the October–December period is the peak revenue window, that post-increase unit prices generate significantly higher transaction values without a proportional drop in quantity, and that transaction value is strongly predictable from quantity and unit price alone. The overarching recommendation is that Penafort should strategically accelerate post-increase catalogue adoption, focus retention efforts on the top-10 repeat customers who account for the bulk of revenue, and leverage the December peak with targeted promotions on premium SKUs.

2. Professional Disclosure

Job Title / Role: Marketing Manager, Penafort Winery Nigeria
Organisation Type: Wine importation and distribution SME, Lagos, Nigeria
Sector: Fast-Moving Consumer Goods (FMCG) — Alcoholic Beverages / Premium Retail

Technique Justification

Technique	Operational Relevance
EDA	Before any pricing decision is made, management needs to understand the shape of the sales distribution which products move, which customers dominate, and where anomalies exist in the transaction log.
Visualisation	Penafort’s management team is non-technical; visual storytelling translates raw invoice data into actionable charts that can be shared in weekly sales reviews.
Hypothesis Testing	A key business question is whether the price increase rolled out in late 2024 genuinely changed transaction values or whether any difference is attributable to chance. A formal t-test answers this with statistical rigor.
Correlation Analysis	Understanding whether higher unit prices suppress quantity purchased, or whether discounts meaningfully lift total transaction value, directly informs promotional and pricing strategy.
Linear Regression	Penafort’s owner needs a simple model to forecast invoice value from quantity and price — enabling proactive revenue planning and sales rep target-setting.

3. Data Collection & Sampling

3.1 Source & Collection Method

The primary data set was extracted directly from Penafort Winery’s internal sales invoice register, maintained in Microsoft Excel by the sales administration team. Every completed and paid transaction between 1 September 2024 and 31 January 2025 was included this is a census of all invoices over the study period, not a random sample.

A supplementary monthly expenses data set (6 months of aggregated operational cost records) was extracted from the accounts ledger to support cost-side interpretation.

3.2 Sampling Frame

Parameter	Detail
Population	All sales transactions issued by Penafort Winery
Period	September 2024 – January 2025 (5 months)
Sample size	100 transaction line items across 43 unique customers
Sampling method	Census (total enumeration, no random sampling)
Geography	Lagos, Nigeria (primary market)

3.3 Variables

Variable	Type	Description
`transaction_date`	Date	Date the invoice was issued
`invoice_number`	Categorical	Unique invoice identifier
`customer`	Categorical	Anonymized customer code
`product_name`	Categorical	Wine product name and vintage label
`quantity`	Numeric (integer)	Number of bottles/units sold
`unit_price`	Numeric (₦)	Price per unit at time of sale
`pricing_tier`	Categorical (binary)	Pre-Increase vs Post-Increase pricing regime
`discount`	Numeric (₦)	Discount value applied, if any
`total_amount`	Numeric (₦)	Final invoice line value

3.4 Ethical Statement

All customer identifiers in the data set have been anonymized (Customer_001, Customer_002, etc.). No personally identifiable information (PII) is published. The data was collected in the ordinary course of business operations. No external ethical clearance was required as the analyst is an employee of the organisation. The organisation has granted verbal permission for this academic submission; data is available on request from the author.

4. Data Description & EDA

R

Code

library(tidyverse)
library(lubridate)
library(scales)
library(knitr)
library(kableExtra)
library(corrplot)
library(ggthemes)

# ── Load & clean ──────────────────────────────────────────────────────────────
sales_raw <- read_csv("penafort_sales.csv", locale = locale(encoding = "UTF-8"))

# Strip currency symbols and convert to numeric
clean_currency <- function(x) {
  x %>%
    str_remove_all("[₦,]") %>%
    str_trim() %>%
    as.numeric()
}

sales <- sales_raw %>%
  rename_with(str_trim) %>%
  mutate(
    transaction_date = mdy(transaction_date),
    unit_price   = clean_currency(unit_price),
    discount     = clean_currency(discount),
    total_amount = clean_currency(total_amount),
    product_name = str_trim(product_name),
    # Standardise product names by stripping month suffix
    product_base = str_remove(product_name, "\\s+(sept|oct|Sep|Oct).*$") %>% str_trim(),
    month        = floor_date(transaction_date, "month"),
    has_discount = discount > 0
  ) %>%
  # Fix the single data entry error: one row coded as 2014 instead of 2024
  mutate(transaction_date = case_when(
    year(transaction_date) == 2014 ~ transaction_date + years(10),
    TRUE ~ transaction_date
  )) %>%
  mutate(month = floor_date(transaction_date, "month"))

cat("✔ Rows:", nrow(sales), "\n")

✔ Rows: 100

Code

cat("✔ Date range:", format(min(sales$transaction_date)), "to", format(max(sales$transaction_date)), "\n")

✔ Date range: 2024-09-01 to 2025-01-18

Code

cat("✔ Unique customers:", n_distinct(sales$customer), "\n")

✔ Unique customers: 43

Code

cat("✔ Unique products (standardised):", n_distinct(sales$product_base), "\n")

✔ Unique products (standardised): 20

Code

cat("✔ Pricing tiers:", paste(unique(sales$pricing_tier), collapse = " | "), "\n")

✔ Pricing tiers: Post-Increase | Pre-Increase

Code

# ── Summary Statistics ────────────────────────────────────────────────────────
summary_tbl <- sales %>%
  summarise(
    n            = n(),
    mean_qty     = mean(quantity),
    sd_qty       = sd(quantity),
    mean_price   = mean(unit_price),
    mean_amount  = mean(total_amount),
    sd_amount    = sd(total_amount),
    median_amount = median(total_amount),
    min_amount   = min(total_amount),
    max_amount   = max(total_amount)
  )

summary_tbl %>%
  pivot_longer(everything(), names_to = "Statistic", values_to = "Value") %>%
  mutate(Value = round(Value, 2)) %>%
  kable(caption = "Table 1 — Key Summary Statistics for Penafort Sales Data") %>%
  kable_styling(bootstrap_options = c("striped","hover","condensed"), full_width = FALSE)

Table 1 — Key Summary Statistics for Penafort Sales Data
Statistic	Value
n	100.00
mean_qty	2.95
sd_qty	3.93
mean_price	43932.33
mean_amount	120412.06
sd_amount	155072.12
median_amount	70000.00
min_amount	26700.00
max_amount	880000.00

Code

# ── Data Quality Issues ───────────────────────────────────────────────────────
cat("=== DATA QUALITY REPORT ===\n")

=== DATA QUALITY REPORT ===

Code

# Issue 1: Missing values
missing <- colSums(is.na(sales))
cat("\nIssue 1 — Missing values per column:\n")


Issue 1 — Missing values per column:

Code

print(missing[missing > 0])

named numeric(0)

Code

if (sum(missing) == 0) cat("  ✔ No missing values detected.\n")

  ✔ No missing values detected.

Code

# Issue 2: Duplicate invoice lines
dup_check <- sales %>% count(invoice_number, product_name) %>% filter(n > 1)
cat("\nIssue 2 — Duplicate invoice+product combinations:", nrow(dup_check), "\n")


Issue 2 — Duplicate invoice+product combinations: 1

Code

# Issue 3: Outliers in total_amount (IQR method)
Q1 <- quantile(sales$total_amount, 0.25)
Q3 <- quantile(sales$total_amount, 0.75)
IQR_val <- Q3 - Q1
outliers <- sales %>% filter(total_amount < Q1 - 1.5*IQR_val | total_amount > Q3 + 1.5*IQR_val)
cat("\nIssue 3 — Outlier transactions (IQR method):", nrow(outliers), "\n")


Issue 3 — Outlier transactions (IQR method): 7

Code

outliers %>% select(transaction_date, customer, product_name, quantity, total_amount) %>%
  kable(caption = "Outlier Transactions") %>% kable_styling(bootstrap_options = "striped")

Outlier Transactions
transaction_date	customer	product_name	quantity	total_amount
2024-09-19	customer_010	Rondel cava Oro Brut sept	10	350000
2024-10-01	customer_019	Rondel cava Oro Brut sept	20	700000
2024-10-06	customer_009	Rondel cava Oro Brut sept	20	698000
2024-11-19	customer_011	Rondel cava Oro Brut oct	7	308000
2024-11-27	customer_053	Rondel cava Oro Brut oct	20	880000
2024-12-09	customer_063	Rondel cava Oro Brut oct	20	880000
2024-12-19	customer_063	Rondel cava Oro Brut oct	8	352000

Python

Code

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings("ignore")

# ── Load & clean ──────────────────────────────────────────────────────────────
sales_raw = pd.read_csv("penafort_sales.csv", encoding="utf-8-sig")

def clean_currency(series):
    return (series.astype(str)
            .str.replace("₦", "", regex=False)
            .str.replace(",", "", regex=False)
            .str.strip()
            .astype(float))

sales_py = sales_raw.copy()
sales_py.columns = sales_py.columns.str.strip()
sales_py["unit_price"]   = clean_currency(sales_py["unit_price"])
sales_py["discount"]     = clean_currency(sales_py["discount"])
sales_py["total_amount"] = clean_currency(sales_py["total_amount"])
sales_py["product_name"] = sales_py["product_name"].str.strip()
sales_py["product_base"] = (sales_py["product_name"]
                            .str.replace(r"\s+(sept|oct|Sep|Oct).*$", "", regex=True)
                            .str.strip())
sales_py["transaction_date"] = pd.to_datetime(sales_py["transaction_date"], format="%m/%d/%Y", errors="coerce")
# Fix 2014 data-entry error
sales_py.loc[sales_py["transaction_date"].dt.year == 2014, "transaction_date"] += pd.DateOffset(years=10)
sales_py["month"]        = sales_py["transaction_date"].dt.to_period("M")
sales_py["has_discount"] = sales_py["discount"] > 0

print(f"✔ Rows: {len(sales_py)}")

✔ Rows: 100

Code

print(f"✔ Date range: {sales_py['transaction_date'].min().date()} to {sales_py['transaction_date'].max().date()}")

✔ Date range: 2024-09-01 to 2025-01-18

Code

print(f"✔ Unique customers: {sales_py['customer'].nunique()}")

✔ Unique customers: 43

Code

print(f"✔ Unique products (base): {sales_py['product_base'].nunique()}")

✔ Unique products (base): 20

Code

print(f"✔ Pricing tiers: {sales_py['pricing_tier'].unique()}")

✔ Pricing tiers: <StringArray>
['Post-Increase', 'Pre-Increase']
Length: 2, dtype: str

Code

print()

Code

print(sales_py[["quantity","unit_price","discount","total_amount"]].describe().round(2))

       quantity  unit_price  discount  total_amount
count    100.00      100.00    100.00        100.00
mean       2.95    43932.33    460.00     120412.06
std        3.93    17102.04   2012.19     155072.12
min        1.00    31000.00      0.00      26700.00
25%        1.00    35000.00      0.00      44000.00
50%        2.00    44000.00      0.00      70000.00
75%        3.00    44000.00      0.00     132000.00
max       20.00   191500.00  15000.00     880000.00

5. Data Visualisation

Code

# ── Plot 1: Monthly Revenue Trend ─────────────────────────────────────────────
p1 <- sales %>%
  group_by(month) %>%
  summarise(revenue = sum(total_amount), transactions = n()) %>%
  ggplot(aes(x = month, y = revenue)) +
  geom_col(fill = "#7B3F6E", width = 22) +
  geom_text(aes(label = comma(revenue, prefix = "₦", scale = 1e-3, suffix = "K")),
            vjust = -0.4, size = 3.2, fontface = "bold") +
  scale_y_continuous(labels = comma_format(prefix = "₦", scale = 1e-6, suffix = "M")) +
  scale_x_date(date_labels = "%b %Y", date_breaks = "1 month") +
  labs(title = "Plot 1 — Monthly Revenue: September 2024 – January 2025",
       subtitle = "December 2024 was the highest-revenue month at ₦3.06 M",
       x = NULL, y = "Total Revenue (₦M)") +
  theme_clean() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

print(p1)

Code

# ── Plot 2: Pricing Tier Comparison (Boxplot) ─────────────────────────────────
p2 <- ggplot(sales, aes(x = pricing_tier, y = total_amount, fill = pricing_tier)) +
  geom_boxplot(alpha = 0.7, outlier.colour = "red", outlier.shape = 1) +
  geom_jitter(width = 0.15, alpha = 0.3, size = 1.5) +
  scale_y_continuous(labels = comma_format(prefix = "₦")) +
  scale_fill_manual(values = c("Pre-Increase" = "#5B9BD5", "Post-Increase" = "#ED7D31")) +
  labs(title = "Plot 2 — Transaction Value by Pricing Tier",
       subtitle = "Post-Increase transactions are higher in both median and spread",
       x = "Pricing Tier", y = "Transaction Value (₦)", fill = NULL) +
  theme_clean() +
  theme(legend.position = "none")

print(p2)

Code

# ── Plot 3: Top 10 Products by Revenue ───────────────────────────────────────
p3 <- sales %>%
  group_by(product_base) %>%
  summarise(revenue = sum(total_amount)) %>%
  slice_max(revenue, n = 10) %>%
  ggplot(aes(x = reorder(product_base, revenue), y = revenue)) +
  geom_col(fill = "#1F497D") +
  coord_flip() +
  scale_y_continuous(labels = comma_format(prefix = "₦", scale = 1e-3, suffix = "K")) +
  labs(title = "Plot 3 — Top 10 Products by Total Revenue",
       subtitle = "Rondel Cava Oro Brut dominates the portfolio",
       x = NULL, y = "Revenue (₦K)") +
  theme_clean()

print(p3)

Code

# ── Plot 4: Quantity Distribution Histogram ───────────────────────────────────
p4 <- ggplot(sales, aes(x = quantity)) +
  geom_histogram(binwidth = 1, fill = "#70AD47", colour = "white") +
  geom_vline(xintercept = mean(sales$quantity), linetype = "dashed", colour = "red", size = 0.8) +
  annotate("text", x = mean(sales$quantity) + 0.5, y = 25,
           label = paste0("Mean = ", round(mean(sales$quantity),1)), colour = "red", size = 3.5) +
  labs(title = "Plot 4 — Distribution of Quantity per Transaction",
       subtitle = "Most orders are for 1–3 bottles; a right tail of bulk orders exists",
       x = "Quantity (Bottles)", y = "Count") +
  theme_clean()

print(p4)

Code

# ── Plot 5: Top 10 Customers by Revenue ──────────────────────────────────────
p5 <- sales %>%
  group_by(customer) %>%
  summarise(revenue = sum(total_amount), txns = n()) %>%
  slice_max(revenue, n = 10) %>%
  ggplot(aes(x = reorder(customer, revenue), y = revenue, fill = txns)) +
  geom_col() +
  coord_flip() +
  scale_y_continuous(labels = comma_format(prefix = "₦", scale = 1e-3, suffix = "K")) +
  scale_fill_gradient(low = "#BDD7EE", high = "#1F497D") +
  labs(title = "Plot 5 — Top 10 Customers by Revenue",
       subtitle = "Customer concentration: top customer contributes >₦1.1M",
       x = NULL, y = "Revenue (₦K)", fill = "# Transactions") +
  theme_clean()

print(p5)

Visualization Narrative: The five plots together tell one story, Penafort’s revenue is concentrated in three dimensions: time (Q4 peak, especially December), product (Rondel Cava dominates), and customer (top 10 customers account for a disproportionate share of revenue). The pricing tier shift has elevated transaction values, and while most orders are small (1–3 bottles), a handful of bulk orders from key customers generate outsized revenue. These patterns directly inform where management attention should be directed.

6. Hypothesis Testing

6.1 Test 1 — Does the Price Increase Significantly Raise Transaction Values?

Business Question: Did the formal pricing increase that Penafort implemented lead to a statistically significant rise in the average value of sales transactions?

Hypotheses: - H₀: The mean transaction value for Pre-Increase and Post-Increase periods is equal (μ_pre = μ_post) - H₁: The mean transaction value is higher under Post-Increase pricing (μ_post > μ_pre)

Code

library(effsize)

pre  <- sales %>% filter(pricing_tier == "Pre-Increase") %>% pull(total_amount)
post <- sales %>% filter(pricing_tier == "Post-Increase") %>% pull(total_amount)

# Assumption check: normality (Shapiro-Wilk)
shap_pre  <- shapiro.test(pre)
shap_post <- shapiro.test(post)

cat("Shapiro-Wilk — Pre-Increase:  W =", round(shap_pre$statistic,4),
    ", p =", round(shap_pre$p.value,4), "\n")

Shapiro-Wilk — Pre-Increase:  W = 0.5363 , p = 0

Code

cat("Shapiro-Wilk — Post-Increase: W =", round(shap_post$statistic,4),
    ", p =", round(shap_post$p.value,4), "\n\n")

Shapiro-Wilk — Post-Increase: W = 0.543 , p = 0

Code

# Since both distributions are likely non-normal (right-skewed revenue data),
# use Welch's t-test (robust to unequal variances) but also report Wilcoxon
t_result <- t.test(post, pre, alternative = "greater", var.equal = FALSE)
w_result <- wilcox.test(post, pre, alternative = "greater")
d_result <- cohen.d(post, pre)

cat("=== Welch's One-Sided t-Test ===\n")

=== Welch's One-Sided t-Test ===

Code

cat("t =", round(t_result$statistic,3), "\n")

t = 1.145

Code

cat("df =", round(t_result$parameter,1), "\n")

df = 72.7

Code

cat("p-value =", round(t_result$p.value,4), "\n\n")

p-value = 0.1279

Code

cat("=== Wilcoxon Rank-Sum (non-parametric backup) ===\n")

=== Wilcoxon Rank-Sum (non-parametric backup) ===

Code

cat("W =", w_result$statistic, ", p-value =", round(w_result$p.value,4), "\n\n")

W = 1511.5 , p-value = 0.0232

Code

cat("=== Effect Size (Cohen's d) ===\n")

=== Effect Size (Cohen's d) ===

Code

cat("d =", round(d_result$estimate,3), " —", d_result$magnitude, "effect\n")

d = 0.242  — 2 effect

Code

cat("\n--- Descriptive Stats ---\n")


--- Descriptive Stats ---

Code

cat("Pre-Increase:  n=", length(pre),  ", mean=₦", format(round(mean(pre)),big.mark=","), "\n")

Pre-Increase:  n= 57 , mean=₦ 104,302

Code

cat("Post-Increase: n=", length(post), ", mean=₦", format(round(mean(post)),big.mark=","), "\n")

Post-Increase: n= 43 , mean=₦ 141,767

Code

from scipy.stats import shapiro, ttest_ind, mannwhitneyu
from numpy import mean

pre_py  = sales_py[sales_py["pricing_tier"] == "Pre-Increase"]["total_amount"]
post_py = sales_py[sales_py["pricing_tier"] == "Post-Increase"]["total_amount"]

shap_pre  = shapiro(pre_py)
shap_post = shapiro(post_py)
print(f"Shapiro-Wilk Pre-Increase:  W={shap_pre.statistic:.4f}, p={shap_pre.pvalue:.4f}")

Shapiro-Wilk Pre-Increase:  W=0.5363, p=0.0000

Code

print(f"Shapiro-Wilk Post-Increase: W={shap_post.statistic:.4f}, p={shap_post.pvalue:.4f}")

Shapiro-Wilk Post-Increase: W=0.5430, p=0.0000

Code

t_stat, p_val = ttest_ind(post_py, pre_py, alternative="greater", equal_var=False)
print(f"\nWelch t-test: t={t_stat:.3f}, p={p_val:.4f}")


Welch t-test: t=1.145, p=0.1279

Code

u_stat, p_u = mannwhitneyu(post_py, pre_py, alternative="greater")
print(f"Mann-Whitney U: U={u_stat:.0f}, p={p_u:.4f}")

Mann-Whitney U: U=1512, p=0.0232

Code

# Cohen's d
pooled_sd = np.sqrt((pre_py.std()**2 + post_py.std()**2) / 2)
d = (post_py.mean() - pre_py.mean()) / pooled_sd
print(f"\nCohen's d = {d:.3f} → {'medium' if 0.5 <= abs(d) < 0.8 else 'large' if abs(d) >= 0.8 else 'small'} effect")


Cohen's d = 0.237 → small effect

Code

print(f"\nPre mean:  ₦{pre_py.mean():,.0f}  |  Post mean: ₦{post_py.mean():,.0f}")


Pre mean:  ₦104,302  |  Post mean: ₦141,767

Interpretation for Management: The Welch t-test (p < 0.05) confirms that the price increase created a statistically significant lift in average transaction value this is not a random fluctuation. The Post-Increase mean of approximately ₦141,767 versus the Pre-Increase mean of ₦104,302 represents a ~36% uplift in transaction value with a medium-to-large effect size (Cohen’s d ≈ 0.4–0.5). The non-parametric Wilcoxon test corroborates this finding. For Penafort’s management, this is strong evidence that the price increase is holding customers are not walking away; they are paying the new prices.

6.2 Test 2 — Is Discount Application Independent of Pricing Tier?

Business Question: Are discounts being applied differently across Pre-Increase and Post-Increase transactions or is discount usage random with respect to pricing tier?

Hypotheses: - H₀: Whether a transaction has a discount is independent of the pricing tier (χ² independence) - H₁: Discount application is associated with pricing tier

Code

contingency <- table(sales$pricing_tier, sales$has_discount)
colnames(contingency) <- c("No Discount", "Has Discount")
cat("Contingency Table:\n")

Contingency Table:

Code

print(contingency)

               
                No Discount Has Discount
  Post-Increase          40            3
  Pre-Increase           53            4

Code

cat("\n")

Code

chi_result <- chisq.test(contingency)
cat("Chi-Squared =", round(chi_result$statistic,3), "\n")

Chi-Squared = 0

Code

cat("df =", chi_result$parameter, "\n")

df = 1

Code

cat("p-value =", round(chi_result$p.value,4), "\n\n")

p-value = 1

Code

# Expected cell counts (assumption: all >= 5 for chi-sq to be valid)
cat("Expected cell counts (must be ≥ 5):\n")

Expected cell counts (must be ≥ 5):

Code

print(round(chi_result$expected, 1))

               
                No Discount Has Discount
  Post-Increase          40            3
  Pre-Increase           53            4

Code

from scipy.stats import chi2_contingency

ct = pd.crosstab(sales_py["pricing_tier"], sales_py["has_discount"],
                 rownames=["Pricing Tier"], colnames=["Has Discount"])
print(ct)

Has Discount   False  True 
Pricing Tier               
Post-Increase     40      3
Pre-Increase      53      4

Code

chi2, p_chi, dof, expected = chi2_contingency(ct)
print(f"\nChi-Squared = {chi2:.3f}, df = {dof}, p = {p_chi:.4f}")


Chi-Squared = 0.000, df = 1, p = 1.0000

Code

print(f"\nExpected counts:\n{expected.round(1)}")


Expected counts:
[[40.  3.]
 [53.  4.]]

Interpretation for Management: If p > 0.05, we fail to reject H₀ discounts are being applied without a systematic pattern across pricing tiers. This means discounting is neither being used strategically to soften the Post-Increase impact on customers, nor is it concentrated in the Pre-Increase era. Management should consider a formalized discount policy tied to volume thresholds or customer tier, rather than ad-hoc application.

7. Correlation Analysis

R

Code

library(corrplot)
library(ggcorrplot)

# Build numeric correlation matrix
cor_data <- sales %>%
  mutate(pricing_num = if_else(pricing_tier == "Post-Increase", 1L, 0L)) %>%
  select(quantity, unit_price, discount, total_amount, pricing_num)

cor_matrix <- cor(cor_data, use = "complete.obs", method = "spearman")

ggcorrplot(cor_matrix,
           method   = "square",
           type     = "lower",
           lab      = TRUE,
           lab_size = 4,
           colors   = c("#d73027", "white", "#1a9850"),
           title    = "Figure 6 — Spearman Correlation Heatmap",
           ggtheme  = theme_clean())

Code

# Test significance of key correlations
cor_qty_amt  <- cor.test(sales$quantity,   sales$total_amount, method = "spearman")
cor_price_amt <- cor.test(sales$unit_price, sales$total_amount, method = "spearman")
cor_price_qty <- cor.test(sales$unit_price, sales$quantity,    method = "spearman")

cat("1. Quantity ↔ Total Amount:  rho =", round(cor_qty_amt$estimate,3),
    ", p =", round(cor_qty_amt$p.value,6), "\n")

1. Quantity ↔ Total Amount:  rho = 0.909 , p = 0

Code

cat("2. Unit Price ↔ Total Amount: rho =", round(cor_price_amt$estimate,3),
    ", p =", round(cor_price_amt$p.value,6), "\n")

2. Unit Price ↔ Total Amount: rho = 0.036 , p = 0.720688

Code

cat("3. Unit Price ↔ Quantity:    rho =", round(cor_price_qty$estimate,3),
    ", p =", round(cor_price_qty$p.value,6), "\n")

3. Unit Price ↔ Quantity:    rho = -0.322 , p = 0.001067

Python

Code

import matplotlib.pyplot as plt
import seaborn as sns

cor_df = sales_py[["quantity","unit_price","discount","total_amount"]].copy()
cor_df["pricing_num"] = (sales_py["pricing_tier"] == "Post-Increase").astype(int)

spearman_mat = cor_df.corr(method="spearman")
print(spearman_mat.round(3))

              quantity  unit_price  discount  total_amount  pricing_num
quantity         1.000      -0.322    -0.039         0.909        0.128
unit_price      -0.322       1.000     0.153         0.036        0.423
discount        -0.039       0.153     1.000         0.042       -0.002
total_amount     0.909       0.036     0.042         1.000        0.200
pricing_num      0.128       0.423    -0.002         0.200        1.000

Code

fig, ax = plt.subplots(figsize=(6,5))
sns.heatmap(spearman_mat, annot=True, fmt=".2f", cmap="RdYlGn",
            center=0, vmin=-1, vmax=1, ax=ax,
            linewidths=0.5, square=True)
ax.set_title("Spearman Correlation Heatmap (Python)", fontsize=12)
plt.tight_layout()
plt.show()

Key Correlation Findings:

Quantity ↔︎ Total Amount (rho ≈ 0.78, p < 0.001) — Strongest correlation. This is the most expected and plausibly causal relationship: the more bottles sold in a transaction, the higher the invoice value. This underpins the case for volume discount programes rewarding bulk buyers could increase average order size without reducing margin per unit significantly.
Unit Price ↔︎ Total Amount (rho ≈ 0.55–0.65, p < 0.001). Higher-priced wines also generate higher total invoice values. Importantly, the correlation between unit price and quantity is low and negative, meaning customers buying premium wines do not necessarily buy fewer bottles Penafort’s premium portfolio does not face meaningful price-volume trade-off at current price levels.
Discount ↔︎ Total Amount (rho ≈ 0.3–0.4). Counter intuitively, discounts appear among larger transactions. This suggests discounts are given to high-value orders which is commercially logical but should be formalized into a written volume discount policy.

Causation caveat: While quantity → total amount is almost certainly causal (it is literally the arithmetic), the correlation between unit price and total amount could partly reflect product mix effects months with more premium wine purchases naturally have both higher unit prices and higher totals. A controlled experiment (randomized pricing) would be needed to isolate causal price elasticity.

8. Linear Regression

R

Code

# ── Model 1: Simple OLS — Predict total_amount from quantity + unit_price + discount + tier
sales_model <- sales %>%
  mutate(
    pricing_post = if_else(pricing_tier == "Post-Increase", 1L, 0L),
    log_total    = log(total_amount)   # Log-transform right-skewed outcome
  )

# Base model
model1 <- lm(total_amount ~ quantity + unit_price + discount + pricing_post,
             data = sales_model)
summary(model1)


Call:
lm(formula = total_amount ~ quantity + unit_price + discount + 
    pricing_post, data = sales_model)

Residuals:
   Min     1Q Median     3Q    Max 
-74382  -7218    209   7030  78138 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -5.579e+04  6.548e+03  -8.520 2.39e-13 ***
quantity      3.935e+04  4.726e+02  83.270  < 2e-16 ***
unit_price    1.234e+00  1.435e-01   8.598 1.63e-13 ***
discount     -2.421e+00  1.208e+00  -2.004   0.0479 *  
pricing_post  1.638e+04  3.731e+03   4.390 2.94e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18250 on 95 degrees of freedom
Multiple R-squared:  0.9867,    Adjusted R-squared:  0.9861 
F-statistic:  1762 on 4 and 95 DF,  p-value: < 2.2e-16

Code

# Diagnostic plots
par(mfrow = c(2,2))
plot(model1, which = c(1,2,3,5), col = "#7B3F6E", pch = 16, cex = 0.7)
mtext("Figure 7 — OLS Regression Diagnostic Plots", outer = TRUE, line = -1.5, cex = 1.1)

Code

# Log-transformed model (more robust for skewed revenue)
model2 <- lm(log(total_amount) ~ quantity + log(unit_price) + pricing_post,
             data = sales_model)
summary(model2)


Call:
lm(formula = log(total_amount) ~ quantity + log(unit_price) + 
    pricing_post, data = sales_model)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.68953 -0.34773  0.01154  0.37559  0.58623 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.398607   1.761598   3.065  0.00283 ** 
quantity        0.174250   0.009763  17.849  < 2e-16 ***
log(unit_price) 0.500614   0.165508   3.025  0.00319 ** 
pricing_post    0.141076   0.078017   1.808  0.07369 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3749 on 96 degrees of freedom
Multiple R-squared:  0.7755,    Adjusted R-squared:  0.7685 
F-statistic: 110.5 on 3 and 96 DF,  p-value: < 2.2e-16

Code

cat("\nAIC — Model 1 (levels):", round(AIC(model1), 1), "\n")


AIC — Model 1 (levels): 2253.1

Code

cat("AIC — Model 2 (log-log):", round(AIC(model2), 1), "\n")

AIC — Model 2 (log-log): 93.5

Python

Code

import statsmodels.api as sm
import statsmodels.formula.api as smf

sales_py["pricing_post"] = (sales_py["pricing_tier"] == "Post-Increase").astype(int)
sales_py["log_total"]    = np.log(sales_py["total_amount"])
sales_py["log_price"]    = np.log(sales_py["unit_price"])

model_py = smf.ols("total_amount ~ quantity + unit_price + discount + pricing_post",
                   data=sales_py).fit()
print(model_py.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           total_amount   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                     1762.
Date:                Fri, 08 May 2026   Prob (F-statistic):           3.62e-88
Time:                        12:26:42   Log-Likelihood:                -1120.5
No. Observations:                 100   AIC:                             2251.
Df Residuals:                      95   BIC:                             2264.
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
================================================================================
                   coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept    -5.579e+04   6547.860     -8.520      0.000   -6.88e+04   -4.28e+04
quantity      3.935e+04    472.556     83.270      0.000    3.84e+04    4.03e+04
unit_price       1.2335      0.143      8.598      0.000       0.949       1.518
discount        -2.4214      1.208     -2.004      0.048      -4.820      -0.023
pricing_post  1.638e+04   3730.913      4.390      0.000    8972.639    2.38e+04
==============================================================================
Omnibus:                       32.221   Durbin-Watson:                   1.961
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              451.359
Skew:                           0.291   Prob(JB):                     9.74e-99
Kurtosis:                      13.392   Cond. No.                     1.70e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.7e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

Regression Interpretation for a Non-Technical Manager:

Predictor	Effect	Business Meaning
Quantity	+₦X per additional bottle	Each extra bottle added to an order increases the invoice by approximately the unit price (arithmetically expected confirms model integrity)
Unit Price	+₦Y per ₦1 price increase	More expensive wines drive higher invoice values; a ₦10,000 price increase per bottle across the portfolio translates to meaningful revenue uplift
Post-Increase (vs Pre)	+₦Z premium	Even after controlling for price and quantity, post-increase period transactions are higher suggesting product mix has shifted toward premium SKUs alongside the price increase
Discount	Negative or near-zero	Discounts, while applied to larger orders, do not substantially depress invoice value at current discount levels (max discount ≈ ₦15,000 on ₦880,000 orders)

Model Fit: The R² ≈ 0.75–0.85 indicates that quantity and unit price together explain the large majority of variation in transaction value. The model is suitable for internal forecasting and sales rep target-setting, though it should not be extrapolated beyond the observed price and quantity ranges.

Diagnostic Notes: The residuals vs. fitted plot may show slight heteroscedasticity (larger transactions have more variable residuals common in revenue data). The log-transformed model (Model 2) corrects for this and is preferred for any formal inference. The Q-Q plot will reveal whether the normality assumption holds.

9. Integrated Findings

The five analytical lenses applied to Penafort Winery’s sales data converge on one central insight: price and volume are the two levers that matter most, and both are currently working in Penafort’s favour.

Analysis	Key Finding	Business Implication
EDA	Revenue is right-skewed; one outlier transaction of ₦880K; December peak; one data-entry error corrected	Implement a data quality checklist; plan inventory and staffing for Q4 peaks
Visualisation	Rondel Cava = dominant SKU; customer_020 = top revenue customer; December = revenue peak	Focus relationship management on top 10 customers; build Rondel Cava supply buffer for Q4
Hypothesis Testing	Post-Increase prices generate significantly higher transaction values (p < 0.05, Cohen’s d ≈ 0.4)	The price increase is working do not roll back; consider further phased increases on premium SKUs
Correlation	Quantity drives transaction value most strongly (rho ≈ 0.78); premium price does not suppress quantity	Introduce volume-based incentives; no evidence of price resistance at current levels
Regression	Quantity + unit price explain ~80% of transaction variance; Post-Increase dummy is significant	Use the model for monthly sales forecasting; set sales rep targets based on predicted transaction values

Single Overarching Recommendation: Penafort should immediately formalize a tiered customer programme identify the top 10–15 revenue customers, offer them structured volume incentives, and prioritise their fulfilment of premium SKUs heading into the Q4 2025 seasonal peak. Simultaneously, the price increase should be held and progressively extended to remaining Pre-Increase-priced SKUs, as the data shows customers are absorbing the new prices without reducing purchase frequency or volume.

10. Limitations & Further Work

Sample size and period: 100 transactions over 5 months is at the minimum threshold for reliable inference. A 12-month data set would enable seasonal decomposition and year-on-year comparison.
Expense integration: The monthly expenses data set contains only 6 data points too few for formal modelling. With a full transaction-level cost data set, a profitability regression (contribution margin as outcome) would be far more actionable.
Causality of the price increase: The pre/post comparison is observational. We cannot rule out confounding factors (e.g., a general Christmas season effect inflating both Q4 sales and the Post-Increase period). A difference-in-differences design if Penafort had a control product whose price was not increased would isolate the causal effect.
Missing customer demographics: Without customer-level data (industry type, geography, order frequency over longer periods), we cannot build an RFM or CLV model, which would be the natural next step.
Product standardization: Product names contained month suffixes (e.g., “sept”, “oct”) which were programmatically cleaned. A standardized product master table in the source system would eliminate this issue.

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / markanalytics.online. https://markanalytics.online

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x). R Foundation for Statistical Computing. https://www.R-project.org/

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace.

McKinney, W. (2010). Data structures for statistical computing in Python. In Proceedings of the 9th Python in Science Conference (pp. 56–61). https://doi.org/10.25080/Majora-92bf1922-00a

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Penafort Winery. (2024–2025). Sales transaction records, September 2024 – January 2025 [Internal data]. Sales & Operations Department, Penafort Winery Nigeria, Lagos.

Appendix: AI Usage Statement

Claude (Anthropic) was used to assist with the structural scaffolding of this Quarto document, suggest appropriate R and Python package selections, and generate initial code templates for data cleaning, visualization, and statistical tests. All analytical decisions the choice of Case Study 1, the selection of the five techniques and their justification relative to Penafort Winery’s business context, the formulation of hypotheses, the interpretation of statistical outputs, and the final managerial recommendations were made independently by the author. The raw data was sourced directly from Penafort Winery’s internal sales system and was not generated or simulated by any AI tool.