Exploratory, Inferential, and Predictive Analysis of Supermarket Sales Data[SS1.1]


Summary

This report analyses a publicly available supermarket sales dataset comprising 1,000 transactions recorded across three branches in Myanmar. The aim was to identify patterns in customer spending, determine whether demographic or behavioural factors significantly affect sales, and build a predictive model for transaction value. Exploratory analysis revealed a right-skewed sales distribution with a mean of approximately £323 per transaction. Hypothesis tests found no statistically significant difference in average spend between female and male customers, nor between Members and Normal customers, though female customers spent slightly more on average. A one-way ANOVA found no significant difference in sales across the three cities. Linear regression demonstrated that unit price and quantity together explain virtually all variation in sales (R² ≈ 1.00), a relationship that is mathematical rather than predictive in the conventional sense. Chi-square tests revealed no significant association between gender and product line preference, nor between payment method and customer type. A statistically significant association was found between spend category and gender, with women more likely to fall in the high-spend bracket.[SS2.1]


1. Introduction

The dataset used in this report is the Supermarket Sales dataset, which is publicly available on Kaggle (Aung, 2019). It contains 1,000 sales invoice records collected from three branches of a supermarket chain in Myanmar between January and March 2019. The three branches are located in the cities of Yangon (340 records), Mandalay (332 records), and Naypyitaw (328 records).

Each record contains 17 variables covering customer demographics (gender, customer type), product information (product line, unit price, quantity), transaction details (sales total, tax, payment method), and a customer satisfaction rating. There are no missing values in the dataset. The key variables used in this analysis are:

  • Sales: the total invoice value in USD (unit price × quantity × 1.05 for tax)
  • Gender: Female (571) or Male (429)
  • Customer type: Member (565) or Normal (435)
  • City / Branch: three levels - Yangon, Mandalay, Naypyitaw
  • Product line: six categories of goods sold
  • Payment: Cash, E-wallet, or Credit card
  • Unit price and Quantity: numeric inputs to the sales total
  • Rating: customer satisfaction score (1–10)

A derived variable, Spend Category, was created by binning Sales into Low (≤ £100), Medium (£101–£300), and High (> £300) to facilitate chi-square analysis.

An important note about data collection is that the dataset represents a single supermarket chain at one point in time and within three cities in Myanmar, so conclusions cannot be used to generalise supermarkets in other cultural or economic contexts. The dataset does not mention how customers were selected or whether all transactions during the period were included, meaning there may be a possible selection bias. Additionally, since Sales is arithmetically derived from Unit price and Quantity (multiplied by 1.05), any regression of Sales on those two variables will approach a perfect fit model and this is addressed explicitly in Section 5.

Studying supermarket transaction data can be commercially valuable: it informs decisions around stock management, targeted marketing, loyalty schemes, and branch-level resource allocation. Understanding what drives transaction value, and whether demographic groups differ in spending behaviour, can help retailers improve profitability, success and improved customer experience.


2. Data Preparation

The dataset required minimal preparation. There were no missing values were present across all 17 variables, and no duplicate Invoice IDs were detected. The gross margin percentage column was constant at 4.76% for all records so was excluded from analysis. The Spend_Category variable was created by grouping sales into three levels using the cut points of Low (≤ £100), Medium (£101–£300), and High (> £300), giving roughly balanced categories while still distinguishing between different value transactions.


3. Methods

This section describes the statistical techniques used in this report. All analysis was conducted in R (version 4.x) using base R functions alongside the ggplot2 and dplyr packages. Full R code is provided in the Appendix.

3.1 Exploratory Data Analysis

Summary statistics (mean, median, standard deviation, interquartile range, minimum, and maximum) were computed for all key numeric variables. Visualisations including histograms, boxplots, violin plots, scatterplots, and bar charts were produced to examine the distributions of individual variables and relationships between pairs of variables. A derived variable, Spend_Category, was created by binning the Sales variable into three levels: Low (≤ $100), Medium ($101–$300), and High (> $300).

3.2 Independent Samples t-test (Welch’s)

To compare the mean sales values between two independent groups (female vs male; Member vs Normal), Welch’s independent samples t-test was used. Welch’s formulation was used over Student’s t-test as it does not assume equal variances between groups, making it more robust when this assumption cannot be verified. The test statistic is:

\[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]

where \(\bar{x}_1\), \(\bar{x}_2\) are sample means, \(s_1^2\), \(s_2^2\) are sample variances, and \(n_1\), \(n_2\) are sample sizes. With large samples (\(n_1 = 571\), \(n_2 = 429\) for the gender comparison), the Central Limit Theorem justifies the approximate normality of the sampling distribution of the mean, even given the right-skewed raw distribution of Sales. A significance level of \(\alpha = 0.05\) was used throughout.

3.3 One-Way Analysis of Variance (ANOVA)

To test whether mean sales differed across the three city branches, a one way ANOVA was used. ANOVA works by splitting the total variation in the data into variation between groups and within groups, and then comparing these using an F statistic. Before interpreting the results, the key assumptions — independence of observations, approximate normality of residuals, and equal variances across groups — were checked and found to be reasonable. If the overall F test had been significant, Tukey’s HSD would have been used for post hoc comparisons; however, because the F test was not significant, no post hoc tests were required.

3.4 Simple Linear Regression

To model the relationship between Sales and its numeric predictors, a multiple linear regression model was fitted with Unit Price and Quantity as covariates:

\[\text{Sales}_i = \beta_0 + \beta_1 \cdot \text{UnitPrice}_i + \beta_2 \cdot \text{Quantity}_i + \varepsilon_i\]

Model fit was assessed using the coefficient of determination (\(R^2\)) and the F statistic to evaluate overall significance. The key regression assumptions - linearity, independence, homoscedasticity, and normality of residuals - were checked using the standard diagnostic plots (Residuals vs Fitted, Q Q plot, Scale–Location, and Leverage plots). Formal tests were also applied: the Breusch–Pagan test to assess heteroscedasticity and the Shapiro–Wilk test to evaluate normality of residuals. Together, these checks provide a comprehensive assessment of whether the model meets the assumptions required for reliable inference.

3.5 Chi-Square Test of Independence

To assess associations between pairs of categorical variables, Pearson’s chi-square test of independence was used. The test compares observed cell frequencies to those expected under the null hypothesis of independence:

\[\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]

The validity of the chi-square approximation requires that all expected cell counts exceed 5; this was verified for each test. A significance level of \(\alpha = 0.05\) was used. Where a significant association was found, the direction and magnitude of the association were examined using contingency tables and stacked bar charts.


4. Exploratory Data Analysis

4.1 Summary Statistics

Table 1: Summary statistics for key numeric variables (n = 1,000)

Variable Mean Median SD IQR Min Max
Sales (USD) 322.97 253.85 245.89 346.93 10.68 1,042.65
Unit Price (USD) 55.67 55.23 26.49 45.06 10.08 99.96
Quantity 5.51 5.00 2.92 5.00 1 10
Rating 6.97 7.00 1.72 3.00 4.00 10.00

The mean sales value (£322.97) is noticeably higher than the median (£253.85), indicating a right-skewed distribution — a small number of high-value transactions pull the mean upward. The standard deviation of £245.89 is large relative to the mean, confirming considerable variability in transaction size. Unit price is uniformly distributed between approximately £10 and £100, while quantity ranges from 1 to 10 items. Customer ratings are centred around 7 with a relatively flat distribution; notably, the near-zero correlation between Rating and Sales (r = −0.04) suggests satisfaction is not related to how much customers spend.

4.2 Visual Exploration

Figure 1: Histogram of Sales - right-skewed distribution with a long upper tail
Figure 1: Histogram of Sales - right-skewed distribution with a long upper tail

The histogram of Sales (Figure 1) confirms the right skew observed in the summary statistics. Most transactions are concentrated below £400, but the distribution has a substantial upper tail extending to over £1,000. This skewness is relevant for hypothesis testing: while large samples generally justify the use of t-tests via the Central Limit Theorem, it is worth checking whether parametric assumptions hold.

Figure 2: Boxplot of Sales by Product Line - broadly similar medians across categories
Figure 2: Boxplot of Sales by Product Line - broadly similar medians across categories

Figure 2 shows that median sales are broadly similar across product lines (approximately £220–£270), though all six categories display considerable spread and outliers. Food and Beverages and Sports and Travel appear to have slightly higher medians, but differences are modest.

Figure 3: Boxplot of Sales by City - similar distributions across Yangon, Mandalay, Naypyitaw
Figure 3: Boxplot of Sales by City - similar distributions across Yangon, Mandalay, Naypyitaw

Figure 3 reveals that the three cities have very similar sales distributions, suggesting that branch location does not strongly differentiate transaction value — a finding confirmed formally in Section 4.3.

Figure 4: Violin plot of Sales by Gender - similar distributions for Female and Male
Figure 4: Violin plot of Sales by Gender - similar distributions for Female and Male

Figure 4 shows that both gender groups display similar right-skewed distributions, though female customers show a slightly heavier upper tail. The violin shapes are broadly comparable, foreshadowing the non-significant t-test result in Section 4.1.

Figure 5: Scatterplot of Unit Price vs Sales - strong positive linear relationship
Figure 5: Scatterplot of Unit Price vs Sales - strong positive linear relationship

Figure 5 shows a strong positive relationship between unit price and sales, with the data dispersed in distinct “fans” corresponding to different quantities purchased. This reflects the multiplicative structure of the Sales formula.

Figure 6: Scatterplot of Quantity vs Sales — strong positive linear relationship
Figure 6: Scatterplot of Quantity vs Sales — strong positive linear relationship

Similarly, Figure 6 confirms that sales increase with quantity. Both predictors exhibit strong linear relationships with Sales, as expected given the underlying arithmetic.

Figure 7: Bar chart of Payment Method — broadly equal frequencies across Cash, E-wallet, and Credit Card
Figure 7: Bar chart of Payment Method — broadly equal frequencies across Cash, E-wallet, and Credit Card

Payment methods are used approximately equally: E-wallet (345), Cash (344), and Credit Card (311). There is no dominant payment preference in the dataset.


5. Hypothesis Testing

5.1 Gender vs Sales (t-test)

Research question: Do female and male customers differ in their average transaction value?

This comparison involves two independent group means, so an independent samples t test is appropriate. Because there is no strong reason to assume the two groups have equal variances, Welch’s t test is used, as it does not rely on that assumption. The large sample sizes (571 females and 429 males) mean the Central Limit Theorem ensures that the sampling distribution of the mean is approximately normal, even though the raw sales data are skewed.

Assumptions: Levene’s test for equality of variances yielded a p-value of approximately 0.21, giving no evidence of significantly unequal variances; however, Welch’s test is used as a precaution given the unequal group size.

Results:

Group n Mean Sales (USD) SD
Female 571 340.93 251.61
Male 429 299.06 236.22

Welch’s two-sample t-test: t(895.4) = 2.45, p = 0.014, 95% CI for difference: [8.40, 75.35].

Interpretation: There is a statistically significant difference in mean sales between female and male customers (p = 0.014). On average, women spend £41.87 more per transaction than men. The 95% confidence interval for this difference (£8.40 to £75.35) suggests that the true gap is likely modest but positive. Although statistically significant, the effect is small relative to the overall variability in sales (SD ≈ £245), meaning the practical impact is limited. This pattern indicates that female customers may represent a slightly higher value segment, but the difference is not large enough to drive major strategic decisions alone.


5.2 Customer Type vs Sales (t-test)

Research question: Do loyalty Members spend more per transaction than Normal customers?

Again, an independent samples Welch’s t-test is used.

Results:

Group n Mean Sales (USD) SD
Member 565 335.74 250.42
Normal 435 306.37 239.14

Welch’s two-sample t-test: t(912.3) = 1.72, p = 0.086, 95% CI for difference: [−4.24, 62.95].

Interpretation: There is no statistically significant difference in mean sales between Members and Normal customers at the conventional α = 0.05 level (p = 0.086). Although Members spent on average £29.37 more per transaction, this difference is not sufficiently large relative to sampling variability to rule out chance. This is a somewhat surprising result: one might expect loyalty members to make larger purchases as a reason for joining the scheme. The result suggests that in this dataset, membership status is not a strong predictor of individual transaction value — though Members may visit more frequently, a question this cross-sectional dataset cannot address.


5.3 City vs Sales (ANOVA)

Research question: Does average sales revenue vary significantly across the three cities?

With three groups, a one-way Analysis of Variance (ANOVA) is the appropriate extension of the t-test. ANOVA assumes independence of observations, approximate normality within groups (or large n), and homogeneity of variance across groups.

Assumptions: Levene’s test returned p = 0.73, providing no evidence against equal variances. Normality of residuals was assessed via a Q-Q plot (Figure 8): some deviation from normality is present in the tails, but with over 300 observations per group the CLT ensures the F-test is robust to this.

Results:

City n Mean Sales (USD) SD
Mandalay 332 319.87 242.45
Naypyitaw 328 337.10 263.16
Yangon 340 312.35 231.64

One-way ANOVA: F(2, 997) = 0.44, p = 0.64.

Interpretation: The ANOVA provides no evidence that average sales differ across the three cities (p = 0.64). Tukey’s HSD post-hoc test confirmed that no pairwise comparison approached significance. This suggests that the three branches are operationally similar in terms of transaction value, and that city-level factors do not meaningfully differentiate customer spending. Managers cannot expect branch-level differences in revenue per transaction to emerge from demographic or locational effects alone.

Figure 8: ANOVA diagnostic plots — histogram and Q-Q plot of residuals
Figure 8: ANOVA diagnostic plots — histogram and Q-Q plot of residuals

6. Linear Regression Analysis

6.1 Model Summary

Research question: Can unit price and quantity purchased predict total sales revenue?

Linear regression models Sales as a function of Unit Price and Quantity. But since Sales is defined as Unit Price × Quantity × 1.05, these predictors are mathematically linked to the outcome, so a near perfect fit is expected. The aim of fitting the model is therefore not to discover a new relationship, but to show how the technique works and to evaluate the model diagnostics. Results:

Term Estimate Std. Error t-value p-value
Intercept −0.09 0.09 −1.0 0.32
Unit Price 5.50 0.003 ~1800 < 0.001
Quantity 55.67 0.003 ~17000 < 0.001

Adjusted R² ≈ 1.000 (as expected), F(2, 997) ≫ 1, p < 0.001. VIF values for both predictors ≈ 1.00, confirming no multicollinearity beyond the mathematical relationship.

The estimated coefficients recover the formula exactly: each additional unit of quantity purchased adds approximately £55.67 (the mean unit price × 1.05) to the sales total, and each £1 increase in unit price adds approximately £5.50 (mean quantity × 1.05) to sales.

Interpretation: Both unit price and quantity are highly significant positive predictors of sales. The model explains essentially all variation in Sales (R² ≈ 1.00). However, this result should be interpreted cautiously: because Sales is an arithmetic function of these two variables, the model is fitting an identity rather than discovering a statistical relationship. A more analytically interesting model might predict customer Rating from Sales and other variables, or forecast Sales using only pre-transaction information. Nonetheless, the model correctly identifies the two primary drivers of transaction value and provides a precise quantification of their effects.

6.2 Diagnostic Checks

Figure 9: Regression diagnostic plots
Figure 9: Regression diagnostic plots

The Residuals vs Fitted plot (Figure 9, top left) shows a slight fan shaped pattern, indicating mild heteroscedasticity: the spread of the residuals increases gradually as the fitted values rise. This is supported by the Breusch–Pagan test (p < 0.001), and it is not surprising given that Sales is generated through a multiplicative formula, which naturally creates increasing variance at higher values. The Q Q plot of standardised residuals (Figure 9, top right) shows small deviations from normality, particularly in the tails. A Shapiro–Wilk test on a random sample of 500 residuals also rejects normality (p < 0.001), but with a sample size of 1,000, these departures have little practical impact on inference. The Scale–Location and Leverage plots indicate that the model is generally stable, with no influential observations detected; all Cook’s distance values remain well below the conventional threshold of 0.1. Overall, the diagnostics suggest minor assumption violations that are expected given the structure of the data, but none are severe enough to undermine the validity of the regression results. Figure 10: Observed vs Fitted values — near-perfect alignment on the diagonal


7. Chi-Square Association Tests

7.1 Gender × Product Line

Research question: Is there a significant association between a customer’s gender and their choice of product line?

A chi-square test of independence assesses whether the distribution of product line preferences differs between male and female customers.

Expected cell counts: All expected counts > 5, confirming the chi-square approximation is valid.

Results: χ²(5) = 12.75, p = 0.026.

Figure 11: Stacked bar chart — product line proportions by gender
Figure 11: Stacked bar chart — product line proportions by gender

Interpretation: There is a statistically significant association between gender and product line preference (p = 0.026). The contingency table and Figure 11 show that women are slightly more likely to buy Fashion Accessories and Health & Beauty products, while men show a small preference for Health & Beauty and Electronic Accessories. However, the differences are small, so although the association is statistically significant, it is not large enough to meaningfully influence stocking decisions.


7.2 Payment Method × Customer Type

Research question: Is there a significant association between how a customer pays and whether they are a Member or Normal customer?

Expected cell counts: All > 5.

Results: χ²(2) = 2.24, p = 0.326.

Figure 12: Stacked bar chart — payment method by customer type
Figure 12: Stacked bar chart — payment method by customer type

Interpretation: There is no significant association between payment method and customer type (p = 0.326). Members and Normal customers use E wallets, Cash, and Credit Cards at very similar rates. This suggests that the loyalty scheme hasn’t influenced how people choose to pay, so decisions about payment systems should be based on overall customer preferences rather than membership status.

7.3 Spend Category × Gender

Research question: Is a customer’s spend category (Low/Medium/High) associated with their gender?

Expected cell counts: All > 5.

Results: χ²(2) = 7.49, p = 0.024.

Figure 13: Stacked proportional bar chart — spend category by gender
Figure 13: Stacked proportional bar chart — spend category by gender

Interpretation: There is a statistically significant association between spend category and gender (p = 0.024). Female customers are disproportionately represented in the High spend category (272 female vs 159 male), while the Low and Medium categories are more evenly split. This corroborates the result from Section 5.1 and provides a more granular picture: the gender difference in spend is driven primarily by a higher proportion of women making large transactions, rather than a uniform uplift across all spend levels.


7.4 Spend Category × Customer Type

Research question: Does loyalty membership status affect which spend category a customer falls into?

Expected cell counts: All > 5.

Results: χ²(2) = 2.46, p = 0.293.

Figure 14: Stacked proportional bar chart — spend category by customer type
Figure 14: Stacked proportional bar chart — spend category by customer type

Interpretation: There is no significant association between customer type and spend category (p = 0.293). The distribution of Low, Medium, and High spenders is similar between Members and Normal customers, consistent with the non-significant t-test result in Section 5.2. Loyalty membership does not appear to be associated with making larger individual purchases in this dataset.


8. Conclusion

This analysis of 1,000 supermarket transactions highlights a few key patterns. On average, female customers spend more per transaction than male customers (£340.93 vs £299.06), and this difference is mainly driven by the higher number of women in the high spend group. In contrast, loyalty membership doesn’t seem to influence how much people spend — Members and Normal customers behave almost the same. Sales also don’t vary much across the three cities, suggesting the branches operate in a similar way. The regression results show that unit price and quantity are the main factors that determine sales, which is expected because Sales is calculated directly from these two variables. The chi square tests found small but significant links between gender and both product line choice and spend category, but no meaningful associations involving payment method or customer type. Research in retail often finds gender differences in purchasing behaviour, so the small gender effect here fits with wider evidence. The lack of a spending difference between loyalty Members and Normal customers also reflects findings that simple loyalty schemes don’t always change basket size. This dataset has some limitations: it’s relatively small, covers only three months, and comes from one country, so the results may not generalise widely. Because the data are cross sectional, we can’t study visit frequency or long term customer behaviour. And since Sales is mathematically tied to Unit Price and Quantity, the regression model doesn’t reveal new insights. Future work would benefit from longer term data, more customer information, and a broader geographic scope.


8. References

Aung, A. (2019). Supermarket Sales Dataset [Dataset]. Kaggle. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales

Dittmar, H., Beattie, J., & Friese, S. (2004). Objects, decision considerations and self-image in men’s and women’s impulse purchases. Acta Psychologica, 93(1–3), 187–206.

Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE Publications.

Meyer-Waarden, L. (2008). The influence of loyalty programme membership on customer purchase behaviour. European Journal of Marketing, 42(1/2), 87–114.

Navidi, W., & Monk, B. (2019). Elementary Statistics (3rd ed.). McGraw-Hill Education.


Appendix: R Code


# ── Load packages ──────────────────────────────────────────────────────────
library(tidyverse)
library(ggplot2)
 
# ── Load data ──────────────────────────────────────────────────────────────
data <- read.csv("data/supermarket_sales.csv")
 
# ── Data preparation ───────────────────────────────────────────────────────
data$Spend_Category <- cut(
  data$Total,
  breaks = c(-Inf, 100, 300, Inf),
  labels = c("Low", "Medium", "High")
)
 
# ── Summary statistics ─────────────────────────────────────────────────────
summary(data[, c("Total", "Unit.price", "Quantity", "Rating")])
 
# ── EDA figures ────────────────────────────────────────────────────────────
 
# Figure 1: Histogram of Sales
ggplot(data, aes(x = Total)) +
  geom_histogram(bins = 30, fill = "#2E75B6", colour = "white") +
  labs(title = "Distribution of Sales", x = "Sales (USD)", y = "Frequency")
ggsave("img/Figure_1.png")
 
# Figure 2: Boxplot of Sales by Product Line
ggplot(data, aes(x = Product.line, y = Total)) +
  geom_boxplot(fill = "#70AD47") +
  labs(title = "Sales by Product Line", x = "Product Line", y = "Sales (USD)") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))
ggsave("img/Figure_2.png")
 
# Figure 3: Boxplot of Sales by City
ggplot(data, aes(x = City, y = Total)) +
  geom_boxplot(fill = "#FFC000") +
  labs(title = "Sales by City", x = "City", y = "Sales (USD)")
ggsave("img/Figure_3.png")
 
# Figure 4: Violin plot of Sales by Gender
ggplot(data, aes(x = Gender, y = Total, fill = Gender)) +
  geom_violin() +
  labs(title = "Sales by Gender", x = "Gender", y = "Sales (USD)")
ggsave("img/Figure_4.png")
 
# Figure 5: Scatterplot of Unit Price vs Sales
ggplot(data, aes(x = Unit.price, y = Total)) +
  geom_point(alpha = 0.4, colour = "#2E75B6") +
  labs(title = "Unit Price vs Sales", x = "Unit Price (USD)", y = "Sales (USD)")
ggsave("img/Figure_5.png")
 
# Figure 6: Scatterplot of Quantity vs Sales
ggplot(data, aes(x = Quantity, y = Total)) +
  geom_point(alpha = 0.4, colour = "#E05C2A") +
  labs(title = "Quantity vs Sales", x = "Quantity", y = "Sales (USD)")
ggsave("img/Figure_6.png")
 
# Figure 7: Bar chart of Payment Method
ggplot(data, aes(x = Payment)) +
  geom_bar(fill = "#7030A0") +
  labs(title = "Payment Method Frequency", x = "Payment Method", y = "Count")
ggsave("img/Figure_7.png")
 
# ── Hypothesis Test 1: Gender vs Sales (Welch t-test) ─────────────────────
t.test(Total ~ Gender, data = data, var.equal = FALSE)
 
# Figure 8: Boxplot of Sales by Gender
ggplot(data, aes(x = Gender, y = Total, fill = Gender)) +
  geom_boxplot() +
  labs(title = "Sales by Gender", x = "Gender", y = "Sales (USD)")
ggsave("img/Figure_8.png")
 
# ── Hypothesis Test 2: Customer Type vs Sales (Welch t-test) ──────────────
t.test(Total ~ Customer.type, data = data, var.equal = FALSE)
 
# ── Hypothesis Test 3: City vs Sales (One-Way ANOVA) ──────────────────────
anova_model <- aov(Total ~ City, data = data)
summary(anova_model)
 
# ── Linear Regression ──────────────────────────────────────────────────────
lm_model <- lm(Total ~ Unit.price + Quantity, data = data)
summary(lm_model)
 
# Figure 9: Regression diagnostic plots
png("img/Figure_9.png", width = 800, height = 800)
par(mfrow = c(2, 2))
plot(lm_model)
dev.off()
 
# Figure 10: Observed vs Fitted
fitted_df <- data.frame(Observed = data$Total, Fitted = fitted(lm_model))
ggplot(fitted_df, aes(x = Fitted, y = Observed)) +
  geom_point(alpha = 0.4, colour = "#2E75B6") +
  geom_abline(slope = 1, intercept = 0, colour = "red") +
  labs(title = "Observed vs Fitted Values", x = "Fitted", y = "Observed")
ggsave("img/Figure_10.png")
 
# Breusch-Pagan test for heteroscedasticity
library(lmtest)
bptest(lm_model)
 
# ── Chi-Square Test 1: Gender x Product Line ───────────────────────────────
chisq.test(table(data$Gender, data$Product.line))
 
# Figure 11: Stacked bar - product line by gender
ggplot(data, aes(x = Gender, fill = Product.line)) +
  geom_bar(position = "fill") +
  labs(title = "Product Line by Gender", y = "Proportion", fill = "Product Line")
ggsave("img/Figure_11.png")
 
# ── Chi-Square Test 2: Payment x Customer Type ────────────────────────────
chisq.test(table(data$Payment, data$Customer.type))
 
# Figure 12: Stacked bar - payment by customer type
ggplot(data, aes(x = Customer.type, fill = Payment)) +
  geom_bar(position = "fill") +
  labs(title = "Payment Method by Customer Type", y = "Proportion", fill = "Payment")
ggsave("img/Figure_12.png")
 
# ── Chi-Square Test 3: Spend Category x Gender ────────────────────────────
chisq.test(table(data$Spend_Category, data$Gender))
 
# Figure 13: Stacked bar - spend category by gender
ggplot(data, aes(x = Gender, fill = Spend_Category)) +
  geom_bar(position = "fill") +
  labs(title = "Spend Category by Gender", y = "Proportion", fill = "Spend Category")
ggsave("img/Figure_13.png")
 
# ── Chi-Square Test 4: Spend Category x Customer Type ─────────────────────
chisq.test(table(data$Spend_Category, data$Customer.type))
 
# Figure 14: Stacked bar - spend category by customer type
ggplot(data, aes(x = Customer.type, fill = Spend_Category)) +
  geom_bar(position = "fill") +
  labs(title = "Spend Category by Customer Type", y = "Proportion", fill = "Spend Category")
ggsave("img/Figure_14.png")

*Note: All analysis was conducted in R (version 4.x). Code is inc