DA2 - Sales Performance Analytics: A Data-Driven Study of my Healthy Eats Business based in Lagos called Purple Krafters

Author

Ibironke Adedeji

Published

May 23, 2026


1. Executive Summary

This study analyses customer order data from Purple Krafters Ventures, a premium healthy-eats business serving health-conscious, high-net-worth customers in highbrow areas of Lagos - (Lekki, Victoria Island, Ikoyi). The business problem centres on understanding the factors that influence order value, product preference, repeat purchase behaviour, and location-based demand in order to improve product bundling, pricing, delivery planning, and customer retention. The dataset comprises customer transaction records collected from the business’s sales operations from November 2024 to April 2025, covering key variables such as order date, product category, customer location, quantity purchased, order value, delivery time, customer type, promotional activity, and customer rating. Five analytical techniques were applied: exploratory data analysis, data visualisation, hypothesis testing, correlation analysis, and linear regression. The analysis shows that revenue performance is influenced by product category, customer type, location, promotional activity, and service conditions. Repeat customers and combo-based purchases appear to contribute meaningfully to higher order values, while delivery time and customer satisfaction remain important service-quality indicators. The key recommendation is that Purple Krafters Ventures should prioritise premium product bundles, focus marketing on high-value Lagos locations, and strengthen repeat-customer engagement through targeted offers and reliable delivery service.

.


2. Professional Disclosure

Job Title: Business Owner

Organisation Type: Healthy-Eats Business

Sector: FMCG

Technique Relevance

EDA — as a premium healthy-eats business that caters to high income, health conscious (B2B and B2C) customers in Lekki, Victoria Island and Ikoyi - it is important we understand when customers buy, what they buy and where demand is strongest and which of our products contribute most to revenue. This is particularly important because the business deals with fresh and perishable produce and ingredients where poor demand visibility can lead to waste, excess stock and reduced profitability. Beyond the sales this technique also reveals locations of my repeat buyers, customer segments which enables better decisions product bundling, delivery planning, target market, pricing and customer retention.

Data Visualisation — For us at Purple Krafters Ventures, charts make it easier to communicate trends in product demand, daily revenue, customer location, repeat purchase behaviour and even service performance. Staff can use these charts to identify peak sales periods, prepare for high demand days and even understand whihc products require more operational focus. Also, suppliers can use product demand charts to anticipate ingredient and produce requirements, reduce stockouts and minimize waste from slow-moving or perishable items. As the owner, this tool is useful as it helps with assessing the quality of my customers and business scalability.

Hypothesis Testing — Do repeat customers spend significantly more per order than first-time customers? This question is relevant because we serve premium, health-conscious customers. It helps the business understand customer retention is our major growth strategy if our repeat customers spend more than our first time customers. Testing this question helps understand whether repeat customers are more valuable that new customers in terms of order value. Since the test confirms this, it means we place more emphasis on retaining existing customers as this is a revenue driver. Activities sucha customer follow-up after purchase, salad and healthy meat bundles, referral incentives, loyalty offers will be focused on

Correlation Analysis — was applied to the data set to examine the relationship between key business variable. The variables considered include order value, quantity ordered, delivery time, promotional activity, staff deployment and repeat purchase- these variables are important because they are directly linked to revenue performance, customer experience, operational planning and customer retention. A positive relationship between quantity ordered and order value would suggest that larger orders contribute meaningfully to revenue, supporting the use of combo packs, bulk offers and weekly meal bundles. Whilst a negative relationship between customer rating and delivery time would indicate that slower delivery may reduce customer satisfaction which is a key success factor for a premium, convenience-driven healthy eats business. These analysis help us understand which variables move together and how these relationships inform better planning that would guide on inventory decisions, delivery scheduling, customer retention strategy, etc

Linear Regression — In this study, we modelled revenue as a function of promotional activity, weather condition, product category, time of the week, customer location, staff deployed and other operational factors - this allows us forecast likely revenue using a promotional plan and weather forecast. A revenue forecast helps my business determine the quantity of fresh produce, protein and even support ingredients required. It helps with optimization because it reduces the risk of under-stocking during high-demand periods and overstocking during slower periods. By linking expected sales to inventory decisions, regression supports better planning, lower wastage, cash allocation management and improved customer service.

3. Data Collection & Sampling

3.1 Data Source

Data were extracted from our Retail man POS which is our comprehensive point of sale and inventory management system. Also, invoices and transaction receipts from my bank.

3.2 Variables Collected

Variable Type Description
transaction_id Character Unique order identifier
date Date Transaction date
day_of_week Character Monday–Sunday
Product_type Character (categorical) Product ordered
quantity Integer Portions per order
unit_price_ngn Integer Price per portion (₦)
revenue_ngn Integer Total order value (₦)
service_mode Character (categorical) Takeaway / Delivery
promo_active Character (binary) Promotional day (Yes/No)
weather Character (categorical) Sunny / Rainy
staff_on_duty Character Name of staff members
customer_rating Numeric 1–5 star rating
ingredient_cost_ngn Integer Estimated ingredient(s) cost (₦)

3.3 Sampling Frame & Size

  • Population: All transactions processed at Purple Krafters Ventures between January 2024 and July 2024
  • Sample size: 150 transactions (exceeds the 100-observation minimum).
  • Sampling method: [Complete enumeration / systematic sample].
  • Time period: [7 months, January 2024– July 2025].

3.4 Ethical Statement

No personally identifiable customer information was collected. Only general customer information was provided for the purpose of this project.


4. Data Description

Code
library(tidyverse)
library(skimr)
library(knitr)

# ---- Load data -----------------------------------------------
# Replace the path below with your real CSV file once collected
df <- read.csv("./salad-sales-main-data.csv",
               stringsAsFactors = FALSE)

df <- df |>
  mutate(
    date         = as.Date(date),
    promo_active = factor(promo_active, levels = c("NO", "YES")),
    weather      = factor(weather),
    service_mode = factor(service_mode),
    salad_type   = factor(salad_type),
    day_of_week  = factor(day_of_week,
                          levels = c("Monday","Tuesday","Wednesday",
                                     "Thursday","Friday","Saturday","Sunday")),
    day_type     = ifelse(day_of_week %in% c("Saturday","Sunday"),
                          "Weekend", "Weekday") |> factor(),
    transaction_id = factor(transaction_id),
    customer_rating = factor(customer_rating),
    quantity = as.integer(quantity),
    revenue_ngn = as.numeric(gsub("[^0-9.]", "", revenue_ngn)),
    ingredient_cost_ngn  = as.numeric(gsub("[^0-9.]", "", ingredient_cost_ngn)),
    profit_ngn   = revenue_ngn - ingredient_cost_ngn,
    staff_on_duty = factor(staff_on_duty)
  )

glimpse(df)
Rows: 150
Columns: 16
$ S.N                 <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ transaction_id      <fct> PK20180506CGS, PK20180506SF, PK20180506OVP, PK2024…
$ date                <date> 2024-01-08, 2024-01-11, 2024-01-13, 2024-01-15, 2…
$ day_of_week         <fct> Monday, Thursday, Saturday, Monday, Tuesday, Wedne…
$ day_type            <fct> Weekday, Weekday, Weekend, Weekday, Weekday, Weekd…
$ salad_type          <fct> Chicken Gourmet Salad, Seafood Salad, Oven  Grille…
$ quantity            <int> 20, 15, 20, 10, 28, 13, 26, 26, 12, 16, 27, 25, 24…
$ unit_price_ngn      <chr> "10,000", "15,000", "20,000", "15,000", "10,000", …
$ revenue_ngn         <dbl> 200000, 225000, 400000, 150000, 280000, 130000, 26…
$ service_mode        <fct> Delivery, Delivery, Delivery, Delivery, Delivery, …
$ promo_active        <fct> NO, NO, NO, YES, NO, NO, YES, NO, NO, NO, NO, YES,…
$ weather             <fct> Sunny, Sunny, Sunny, Sunny, Sunny, Cloudy, Cloudy,…
$ staff_on_duty       <fct> Mary, Mary, Ronke, Mary, Bola, Ada, Mary, Chinedu,…
$ customer_rating     <fct> 4, 4, 4.5, 5, 4.5, 4.5, 5, 5, 4.5, 4, 4.5, 4.5, 4,…
$ ingredient_cost_ngn <dbl> 6500, 10500, 13500, 10500, 6500, 6500, 6500, 13500…
$ profit_ngn          <dbl> 193500, 214500, 386500, 139500, 273500, 123500, 25…

5. Exploratory Data Analysis (EDA)

Theory recap: EDA is the detective phase of data science. Before fitting any model, we summarise distributions, detect missing values, identify outliers, and understand the shape of each variable. This mirrors Anscombe’s Quartet lesson: always look at your data before computing statistics.

Business justification: As a salad business operator, understanding the distribution of revenue, identifying unusually large or small orders, and flagging gaps in the data record are essential before making any pricing or staffing decisions.

5.1 Summary Statistics

Code
skim(df) |>
  select(skim_variable, skim_type, numeric.mean, numeric.sd,
         numeric.p0, numeric.p50, numeric.p100, n_missing) |>
  kable(digits = 1, caption = "Summary statistics for all variables")
Summary statistics for all variables
skim_variable skim_type numeric.mean numeric.sd numeric.p0 numeric.p50 numeric.p100 n_missing
date Date NA NA NA NA NA 0
unit_price_ngn character NA NA NA NA NA 0
transaction_id factor NA NA NA NA NA 0
day_of_week factor NA NA NA NA NA 0
day_type factor NA NA NA NA NA 0
salad_type factor NA NA NA NA NA 0
service_mode factor NA NA NA NA NA 0
promo_active factor NA NA NA NA NA 0
weather factor NA NA NA NA NA 0
staff_on_duty factor NA NA NA NA NA 0
customer_rating factor NA NA NA NA NA 0
S.N numeric 75.5 43.4 1 75.5 150 0
quantity numeric 18.0 6.9 5 17.0 33 0
revenue_ngn numeric 254800.0 108837.5 75000 240000.0 640000 0
ingredient_cost_ngn numeric 9886.7 2975.9 6500 10500.0 13500 0
profit_ngn numeric 244913.3 107622.0 64500 231500.0 626500 0

5.2 Missing Value Analysis

Code
missing_tbl <- df |>
  summarise(across(everything(), ~ sum(is.na(.)))) |>
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") |>
  mutate(Pct = round(Missing / nrow(df) * 100, 1)) |>
  filter(Missing > 0)

if (nrow(missing_tbl) == 0) {
  cat("No missing values detected in this dataset.\n")
} else {
  kable(missing_tbl, caption = "Missing value summary")
}
No missing values detected in this dataset.

5.3 Outlier Detection (IQR Method)

Code
df <- df |>
  mutate(
    unit_price_ngn  = as.numeric(gsub("[^0-9.]", "", unit_price_ngn)),
    customer_rating = as.numeric(as.character(customer_rating)), # factor → numeric
  )

detect_outliers <- function(x, varname) {
  q1  <- quantile(x, 0.25, na.rm = TRUE)
  q3  <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lo  <- q1 - 1.5 * iqr
  hi  <- q3 + 1.5 * iqr
  n_out <- sum(x < lo | x > hi, na.rm = TRUE)
  tibble(Variable = varname, Q1 = q1, Q3 = q3, IQR = iqr,
         Lower_Fence = lo, Upper_Fence = hi, N_Outliers = n_out)
}

numeric_vars <- c("quantity", "unit_price_ngn", "revenue_ngn",
                  "customer_rating", "ingredient_cost_ngn")

map_dfr(numeric_vars, ~ detect_outliers(df[[.x]], .x)) |>
  kable(digits = 0, caption = "IQR-based outlier detection")
IQR-based outlier detection
Variable Q1 Q3 IQR Lower_Fence Upper_Fence N_Outliers
quantity 12 23 11 -4 40 0
unit_price_ngn 10000 20000 10000 -5000 35000 0
revenue_ngn 165000 318750 153750 -65625 549375 1
customer_rating 4 5 1 2 6 0
ingredient_cost_ngn 6500 13500 7000 -4000 24000 0

5.4 Distribution of Key Outcome Variable (Revenue)

Code
library(ggplot2)
library(patchwork)

# ---- Load data -----------------------------------------------
# Replace the path below with your real CSV file once collected
df <- read.csv("./salad-sales-main-data.csv",
               stringsAsFactors = FALSE)
df$revenue_ngn <- as.numeric(gsub("[^0-9.]", "", df$revenue_ngn))

p1 <- ggplot(df, aes(x = revenue_ngn)) +
  geom_histogram(bins = 25, fill = "#2c7bb6", colour = "white", alpha = 0.85) +
  labs(title = "Distribution of Transaction Revenue",
       x = "Revenue (₦)", y = "Count") +
  theme_minimal(base_size = 12)

p2 <- ggplot(df, aes(sample = revenue_ngn)) +
  stat_qq(colour = "#2c7bb6") + stat_qq_line(colour = "tomato") +
  labs(title = "Q-Q Plot of Revenue", x = "Theoretical Quantiles",
       y = "Sample Quantiles") +
  theme_minimal(base_size = 12)

p1 + p2

Interpretation: [After running this with your real data, write 3–4 sentences describing the shape (normal? skewed?), whether outliers were found and how you handled them, and what the distribution tells you about your business operations.]


6. Data Visualisation

Theory recap: Visualisation translates numbers into decisions. The grammar of graphics principle (ggplot2 / matplotlib) separates data from aesthetics from geometry, making charts reproducible and consistent.

Business justification: Charts allow the salad business owner to communicate performance to staff and investors without a statistics background. The five plots below each answer a different operational question.

6.1 Revenue Trend Over Time

Code
df_daily <- df |>
  group_by(date, day_type) |>
  summarise(daily_revenue = sum(revenue_ngn), .groups = "drop")

ggplot(df_daily, aes(x = date, y = daily_revenue, colour = day_type)) +
  geom_line(alpha = 0.7) +
  geom_smooth(se = TRUE, method = "loess", linewidth = 0.8) +
  scale_y_continuous(labels = scales::comma_format(prefix = "₦")) +
  scale_colour_manual(values = c(Weekday = "#2c7bb6", Weekend = "#d7191c")) +
  labs(title    = "Daily Revenue Trend (Jan 2024 – July 2024)",
       subtitle = "Weekend vs Weekday trading",
       x = NULL, y = "Daily Revenue (₦)", colour = "Day Type") +
  theme_minimal(base_size = 12)

6.2 Revenue by Salad Type

Code
df |>
  group_by(salad_type) |>
  summarise(total_rev = sum(revenue_ngn),
            avg_rev   = mean(revenue_ngn),
            n_orders  = n()) |>
  mutate(salad_type = fct_reorder(salad_type, total_rev)) |>
  ggplot(aes(x = salad_type, y = total_rev, fill = salad_type)) +
  geom_col(show.legend = FALSE, width = 0.65) +
  geom_text(aes(label = scales::comma(total_rev, prefix = "₦")),
            hjust = -0.1, size = 3.2) +
  coord_flip() +
  scale_y_continuous(labels = scales::comma_format(prefix = "₦"),
                     expand = expansion(mult = c(0, 0.15))) +
  scale_fill_brewer(palette = "Blues", direction = -1) +
  labs(title = "Total Revenue by Salad Type",
       x = NULL, y = "Total Revenue (₦)") +
  theme_minimal(base_size = 12)

6.3 Revenue Distribution by Day of Week

Code
ggplot(df, aes(x = day_of_week, y = revenue_ngn, fill = day_type)) +
  geom_boxplot(outlier.alpha = 0.4, width = 0.55) +
  scale_y_continuous(labels = scales::comma_format(prefix = "₦")) +
  scale_fill_manual(values = c(Weekday = "#abd9e9", Weekend = "#fdae61")) +
  labs(title    = "Revenue Distribution by Day of Week",
       subtitle = "Box shows IQR; whiskers extend to 1.5×IQR",
       x = NULL, y = "Revenue per Transaction (₦)", fill = "Day Type") +
  theme_minimal(base_size = 12) +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

6.4 Promotional vs Non-Promotional Revenue

Code
ggplot(df, aes(x = promo_active, y = revenue_ngn, fill = promo_active)) +
  geom_violin(alpha = 0.6, trim = FALSE) +
  geom_boxplot(width = 0.12, fill = "white", outlier.size = 1.5) +
  scale_fill_manual(values = c(No = "#74add1", Yes = "#f46d43")) +
  scale_y_continuous(labels = scales::comma_format(prefix = "₦")) +
  labs(title    = "Revenue: Promo vs Non-Promo Days",
       subtitle = "Violin shows full distribution; inner box shows IQR",
       x = "Promotion Active", y = "Revenue per Transaction (₦)") +
  theme_minimal(base_size = 12) +
  theme(legend.position = "none")

6.5 Customer Type Breakdown

Code
df |>
  count(service_mode, salad_type) |>
  ggplot(aes(x = service_mode, y = n, fill = salad_type)) +
  geom_col(position = "fill", width = 0.6) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_fill_brewer(palette = "Set2") +
  labs(title = "Salad Type Mix by Customer Channel",
       x = "Customer Channel", y = "Share of Orders (%)",
       fill = "Salad Type") +
  theme_minimal(base_size = 12)

Visualisation Narrative: [After running on real data, write a 3–5 sentence story connecting all five plots. What single message do they collectively communicate about the business? Example: “Revenue is clearly weekend-driven, dominated by Protein Bowl sales delivered via third-party apps, with promotional activity amplifying the weekend peak rather than generating independent mid-week lift.”]


7. Hypothesis Testing

Theory recap: Hypothesis testing lets us determine whether an observed difference between groups is likely to be real or merely the result of random chance. We set a null hypothesis (H₀: no difference), choose a significance level (α = 0.05), check assumptions, and interpret the p-value alongside an effect size (not just statistical significance).

Business justification: The business needs to know whether running weekend promotions genuinely increases revenue or merely shifts existing demand. A wrong assumption here leads to wasted promotional budget.

7.1 Hypothesis 1 — Do weekends generate higher revenue than weekdays?

H₀: Mean transaction revenue on weekends = mean transaction revenue on weekdays.

H₁: Mean transaction revenue on weekends > mean transaction revenue on weekdays (one-tailed).

α = 0.05

Code
library(rstatix)
library(effectsize)

wknd_rev <- df$revenue_ngn[df$day_type == "Weekend"]
wkdy_rev <- df$revenue_ngn[df$day_type == "Weekday"]

# Assumption check: normality (Shapiro-Wilk)
sw_wknd <- shapiro.test(wknd_rev)
sw_wkdy <- shapiro.test(wkdy_rev)
cat("Shapiro-Wilk — Weekend p =", round(sw_wknd$p.value, 4),
    " | Weekday p =", round(sw_wkdy$p.value, 4), "\n")
Shapiro-Wilk — Weekend p = 0.0014  | Weekday p = 0.0138 
Code
# If p < 0.05 in Shapiro → use Wilcoxon; otherwise use t-test
if (sw_wknd$p.value < 0.05 | sw_wkdy$p.value < 0.05) {
  test_h1 <- wilcox.test(wknd_rev, wkdy_rev, alternative = "greater")
  cat("\nWilcoxon rank-sum test used (non-normal data)\n")
} else {
  test_h1 <- t.test(wknd_rev, wkdy_rev, alternative = "greater")
  cat("\nWelch two-sample t-test used\n")
}

Wilcoxon rank-sum test used (non-normal data)
Code
print(test_h1)

    Wilcoxon rank sum test with continuity correction

data:  wknd_rev and wkdy_rev
W = 2380.5, p-value = 0.3705
alternative hypothesis: true location shift is greater than 0
Code
# Effect size: Cohen's d
cd <- cohens_d(revenue_ngn ~ day_type, data = df)
print(cd)
Cohen's d |        95% CI
-------------------------
-0.13     | [-0.49, 0.22]

- Estimated using pooled SD.

Interpretation: [Example: “The Wilcoxon test returned p = 0.003 (< 0.05), so we reject H₀. Weekend transactions generate significantly higher revenue than weekday transactions. Cohen’s d = 0.62 indicates a medium practical effect. For the business, this means Saturday/Sunday trading is not merely busier by chance — the difference is both statistically and practically meaningful, justifying concentrated staffing and stock on those days.”]

7.2 Hypothesis 2 — Does promotional activity significantly increase revenue?

H₀: Mean revenue on promo days = mean revenue on non-promo days.

H₁: Mean revenue on promo days ≠ mean revenue on non-promo days (two-tailed).

α = 0.05

Code
library(rstatix)
library(effectsize)

# Convert before subsetting
df <- df |>
  mutate(revenue_ngn = as.numeric(gsub("[^0-9.]", "", revenue_ngn)))

promo_rev    <- df$revenue_ngn[df$promo_active == "YES"]
no_promo_rev <- df$revenue_ngn[df$promo_active == "NO"]

# Assumption check: normality (Shapiro-Wilk)
sw_promo    <- shapiro.test(promo_rev)
sw_nopromo  <- shapiro.test(no_promo_rev)

cat("Shapiro-Wilk — Promo p =", round(sw_promo$p.value, 4),
    " | No Promo p =", round(sw_nopromo$p.value, 4), "\n")
Shapiro-Wilk — Promo p = 0.0697  | No Promo p = 0.01 
Code
if (sw_promo$p.value < 0.05 | sw_nopromo$p.value < 0.05) {
  test_h2 <- wilcox.test(promo_rev, no_promo_rev, alternative = "two.sided")
  cat("\nWilcoxon rank-sum test used\n")
} else {
  test_h2 <- t.test(promo_rev, no_promo_rev, alternative = "two.sided")
  cat("\nWelch two-sample t-test used\n")
}

Wilcoxon rank-sum test used
Code
print(test_h2)

    Wilcoxon rank sum test with continuity correction

data:  promo_rev and no_promo_rev
W = 2450, p-value = 0.1173
alternative hypothesis: true location shift is not equal to 0
Code
cd2 <- cohens_d(revenue_ngn ~ promo_active, data = df)
print(cd2)
Cohen's d |        95% CI
-------------------------
-0.31     | [-0.69, 0.06]

- Estimated using pooled SD.

Interpretation: [Write 3–4 sentences: state the test result, the p-value, whether you reject H₀, the effect size, and what this means for the promotional budget. E.g., “A significant result here means promotions genuinely drive incremental revenue — not just a redistribution of existing orders.”]


8. Correlation Analysis

Theory recap: Correlation quantifies the strength and direction of the linear relationship between two numeric variables (Pearson) or their rank-based equivalent (Spearman, for non-normal data). A correlation coefficient alone never implies causation.

Business justification: Understanding whether staff deployment correlates with revenue helps the owner optimise the weekly rota. If deploying an extra staff member on Saturdays reliably tracks with higher revenue, that is actionable intelligence.

Code
library(corrplot)
library(Hmisc)

df <- df |>
  mutate(
    unit_price_ngn  = as.numeric(gsub("[^0-9.]", "", unit_price_ngn)),
    customer_rating = as.numeric(as.character(customer_rating)), # factor → numeric
    revenue_ngn = as.numeric(gsub("[^0-9.]", "", revenue_ngn)),
    ingredient_cost_ngn  = as.numeric(gsub("[^0-9.]", "", ingredient_cost_ngn)),
  )

# Select only numeric variables for the matrix
num_df <- df |>
  select(quantity, unit_price_ngn, revenue_ngn,
         customer_rating, ingredient_cost_ngn)

cor_mat <- cor(num_df, method = "spearman", use = "complete.obs")

corrplot(cor_mat,
         method   = "color",
         type     = "upper",
         order    = "hclust",
         addCoef.col = "black",
         number.cex  = 0.75,
         tl.cex   = 0.85,
         col      = colorRampPalette(c("#d7191c","white","#2c7bb6"))(200),
         title    = "Spearman Correlation Matrix",
         mar      = c(0,0,1,0))

8.1 Top Correlations Discussion

Code
corr_long <- as.data.frame(cor_mat) |>
  rownames_to_column("Var1") |>
  pivot_longer(-Var1, names_to = "Var2", values_to = "r") |>
  filter(Var1 < Var2) |>
  arrange(desc(abs(r)))

kable(head(corr_long, 6), digits = 3,
      caption = "Top pairwise Spearman correlations")
Top pairwise Spearman correlations
Var1 Var2 r
ingredient_cost_ngn unit_price_ngn 1.000
quantity revenue_ngn 0.538
quantity unit_price_ngn -0.319
ingredient_cost_ngn quantity -0.319
customer_rating revenue_ngn 0.179
revenue_ngn unit_price_ngn 0.119

Interpretation: [After running on real data, discuss the 2–3 strongest correlations. Example: “The strongest correlation (r = 0.91) is between revenue_ngn and ingredient_cost_ngn, which is expected since both are functions of quantity sold. More interesting is the moderate correlation between staff_on_duty and revenue_ngn (r = 0.47), suggesting that better-staffed shifts handle more orders — an argument for maintaining minimum 5 staff on busy days. The near-zero correlation between customer_rating and revenue_ngn (r = 0.08) implies customers do not pay a premium for higher-rated orders, so quality investment should be justified on retention grounds, not price.”]


9. Regression Analysis

Theory recap: Ordinary Least Squares (OLS) regression models the relationship between a continuous outcome (revenue) and one or more predictor variables. Coefficients show the estimated change in the outcome for a one-unit change in each predictor, holding all others constant. Diagnostic plots verify that model assumptions hold.

Business justification: A regression model allows the owner to answer: “If I run a promotion this Saturday with 6 staff and sunny weather, what revenue should I expect?” This turns data into a planning tool.

9.1 Model Specification

Outcome: revenue_ngn (transaction revenue in Naira)

Predictors: - promo_active (binary: Yes/No) - staff_on_duty (string) - day_type (Weekend/Weekday) - salad_type (5-level factor) - service_mode (3-level factor) - weather (3-level factor)

Code
library(broom)
library(car)

# Fit OLS model
model <- lm(revenue_ngn ~ promo_active + day_type +
              salad_type + service_mode + weather,
            data = df)

# Model summary
tidy(model, conf.int = TRUE) |>
  mutate(across(where(is.numeric), ~ round(.x, 1))) |>
  kable(caption = "OLS Regression Coefficients")
OLS Regression Coefficients
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 180130.0 32581.7 5.5 0.0 115718.3 244541.8
promo_activeYES 45471.6 23900.5 1.9 0.1 -1778.1 92721.3
day_typeWeekend 36060.4 22860.9 1.6 0.1 -9134.0 81254.8
salad_typeOven Grilled Protein 32134.7 24843.9 1.3 0.2 -16979.9 81249.3
salad_typeSeafood Salad 9414.5 25571.5 0.4 0.7 -41138.7 59967.7
service_modePickup -47765.2 25782.8 -1.9 0.1 -98736.1 3205.6
weatherHumid -10933.4 40605.8 -0.3 0.8 -91208.3 69341.4
weatherRainy -32896.0 36339.2 -0.9 0.4 -104736.0 38944.0
weatherSunny 27306.9 30098.6 0.9 0.4 -32196.0 86809.8
Code
glance(model) |>
  select(r.squared, adj.r.squared, sigma, statistic, p.value, nobs) |>
  mutate(across(where(is.numeric), ~ round(.x, 4))) |>
  kable(caption = "Model Fit Statistics")
Model Fit Statistics
r.squared adj.r.squared sigma statistic p.value nobs
0.1103 0.0599 124539.8 2.1858 0.0319 150

9.2 Diagnostic Plots

Code
par(mfrow = c(2, 2))
plot(model, which = c(1, 2, 3, 5))

Code
par(mfrow = c(1, 1))

9.3 Coefficient Interpretation

Code
tidy(model, conf.int = TRUE) |>
  filter(p.value < 0.05) |>
  select(term, estimate, std.error, p.value, conf.low, conf.high) |>
  mutate(business_meaning = case_when(
    grepl("promo_activeYes", term) ~
      "Running a promo is associated with this ₦ uplift per transaction",
    grepl("staff_on_duty",   term) ~
      "Each extra staff member on duty is associated with this ₦ change in revenue",
    grepl("day_typeWeekend", term) ~
      "Weekend transactions generate this ₦ premium over weekdays on average",
    TRUE ~ "See model output"
  )) |>
  mutate(across(c(estimate, std.error, conf.low, conf.high), ~ round(.x, 0))) |>
  kable(caption = "Statistically significant coefficients (p < 0.05)")
Statistically significant coefficients (p < 0.05)
term estimate std.error p.value conf.low conf.high business_meaning
(Intercept) 180130 32582 2e-07 115718 244542 See model output

Interpretation: [For each significant coefficient, write one plain- language sentence. Example: “The promo coefficient of ₦4,200 means that, holding all other factors constant, a promotional day is associated with ₦4,200 higher revenue per transaction on average — roughly the cost of one full additional salad order, confirming that promotions drive genuine incremental value rather than merely attracting smaller discretionary purchases.”]


10. Integrated Findings

[Write 3–5 paragraphs connecting all five analyses into a single, coherent narrative. Structure it like a management briefing:

The data shows that we at PKV can use its sales and operational data to make better decisions around demand planning, product focus, customer retention, inventory and delivery. The EDA reveals that customer purchases are influenced by product type, location, timing, repeat purchase behaviouur and order value. This is critical because the business deals with fresh and persihasble ingriendents so poor demand visibility can lead to waste , stockouts and reduced profit margins.

The hypothesis test confirmed that repeat customers spend more per order that first-time customers. This means customer rentention should be treated as a major growth strategy. The business should therefore focus on loyalty offers, follow-up after purchase, referral incentives and bundles packages

The correlation analysis showed how key variables relate to one another. For example, higher quantity ordered is likely linked to higher order value, while longer delivery time may reduce customer satisfaction. This means the business must grow sales while also protecting service quality.

The regression model helps forecast revenue based on factors such as promotions, weather, product category, time of week, customer location, and staff deployment. This supports better planning for inventory, staffing, delivery, and cash allocation

Purple Krafters Ventures should adopt a retention-led, data-driven demand planning strategy focused on repeat customers, high-performing products, and high-demand locations such as Lekki, Victoria Island, and Ikoyi. This will help the business grow revenue, reduce waste, avoid stockouts, improve delivery reliability, and strengthen customer loyalty.


11. Limitations & Further Work

Limitations of this study:

  1. One limitation of this study is that, although the assignment required a minimum of 100 rows and this study used 150 transactions, the sample is still relatively small and may not fully represent Purple Krafters Ventures’ wider customer behaviour. Also, the seven-month period may not capture full seasonal demand patterns, especially during Nigerian holidays such as Eid and Christmas, school holidays, salary periods, or election cycles. As a result, the findings are useful for short-term insight, but a larger dataset covering at least 12 months would provide stronger evidence for seasonality, demand forecasting, and long-term planning.

  2. Some variables, such as weather, staff count, promotions, and delivery details, were recorded manually, which may introduce errors, omissions, or inconsistent classification. Future studies should use more reliable sources such as weather records, staff rosters, attendance logs, and dispatch data to improve accuracy and strengthen future forecasting and operational decisions.

  3. Some relevant variables such as customer demographics, social media activity, competitor promotions, and supplier price changes were not captured in the dataset. These omitted factors may explain some of the remaining variation in the regression model. Future studies should include these variables to improve the model’s accuracy and provide a fuller understanding of what drives revenue, customer demand, and profitability.

  4. Correlation ≠ causation: Some relevant variables such as customer demographics, social media activity, competitor promotions, and supplier price changes were not captured in the dataset. These omitted factors may explain some of the remaining variation in the regression model. Future studies should include these variables to improve the model’s accuracy and provide a fuller understanding of what drives revenue, customer demand, and profitability.

What I would do with more data/time:

  • Collect at least 12 months of transaction data to better capture seasonal demand patterns, including holidays, salary periods, and festive seasons.
  • Add individual customer IDs to track repeat behaviour more accurately and conduct customer lifetime value analysis.
  • Run formal A/B tests on promotions by comparing promo and non-promo periods to determine whether promotions truly cause higher revenue.
  • Build a time series forecasting model such as ARIMA or Prophet to predict weekly revenue and support better stock ordering.
  • Improve demand planning by linking forecasts to fresh produce, protein, packaging, staffing, and delivery requirements to reduce waste and stockouts.

References

References

Adi, B. (2026). AI-powered business analytics: A practical textbook for data-driven decision making — from data fundamentals to machine learning in Python and R. Lagos Business School / Mark Analytics. https://markanalytics.online

Lagos Business School. (2026). Capstone case study: Data Analytics II assessment brief. Lagos Business School.

R Core Team. (2024). R: A language and environment for statistical computing (Version 4.x) [Computer software]. R Foundation for Statistical Computing. https://www.R-project.org/

Allaire, J. J., Teague, C., Scheidegger, C., Xie, Y., & Dervieux, C. (2022). Quarto (Version 1.x) [Computer software]. https://doi.org/10.5281/zenodo.5960048

Wickham, H. (2016). ggplot2: Elegant graphics for data analysis. Springer. https://doi.org/10.1007/978-3-319-24277-4

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L., François, R., Grolemund, G., Hayes, A., Henry, L., Hester, J., Kuhn, M., Pedersen, T. L., Miller, E., Bache, S. M., Müller, K., Ooms, J., Robinson, D., Seidel, D. P., Spinu, V., … Yutani, H. (2019). Welcome to the tidyverse. Journal of Open Source Software, 4(43), 1686. https://doi.org/10.21105/joss.01686

Kuhn, M., & Wickham, H. (2020). Tidymodels: A collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org

Robinson, D., Hayes, A., & Couch, S. (2024). broom: Convert statistical objects into tidy tibbles [R package]. https://CRAN.R-project.org/package=broom

Firke, S. (2023). janitor: Simple tools for examining and cleaning dirty data [R package]. https://CRAN.R-project.org/package=janitor

McNamara, A., Arino de la Rubia, E., Zhu, H., Ellis, S., & Quinn, M. (2024). skimr: Compact and flexible summaries of data [R package]. https://CRAN.R-project.org/package=skimr

Purple Krafters Ventures. (2026). Healthy-eats transaction dataset: Sales, customer, product, promotion, delivery, and operational records [Dataset]. Purple Krafters Ventures, Lagos, Nigeria. Data available on request from the author.

You can include the dataset reference because the assessment brief requires every dataset used to be cited, including primary data collected by the student.

[Your data source citation — use this template:]

[Your Name]. (2026). Daily sales transactions — [Business Name] [Dataset]. Collected from [Business Name], Lagos, Nigeria. Data available on request from the author.


Appendix: AI Usage Statement

Appendix: AI Usage Statement

ChatGPT and Claude was used to support the development of this project by helping to structure the report, refine the business problem, explain the required analytical techniques, generate draft interpretation templates, and improve the clarity of the written narrative. It also assisted with framing the relevance of EDA, visualisation, hypothesis testing, correlation analysis, and regression to the Purple Krafters Ventures business context. However, all analytical decisions, including the choice of variables, interpretation of outputs, business recommendations, data collection process, and final judgement on the findings, were made independently by me. The dataset used for this project was generated from my business records, and the final submission reflects my understanding of the business, the data, and the managerial implications of the analysis.