This report analyses a publicly available supermarket sales dataset comprising 1,000 transactions recorded across three branches in Myanmar. The aim was to identify patterns in customer spending, determine whether demographic or behavioural factors significantly affect sales, and build a predictive model for transaction value. Exploratory analysis revealed a right-skewed sales distribution with a mean of approximately £323 per transaction. Hypothesis tests found no statistically significant difference in average spend between female and male customers, nor between Members and Normal customers, though female customers spent slightly more on average. A one-way ANOVA found no significant difference in sales across the three cities. Linear regression demonstrated that unit price and quantity together explain virtually all variation in sales (R² ≈ 1.00), a relationship that is mathematical rather than predictive in the conventional sense. Chi-square tests revealed no significant association between gender and product line preference, nor between payment method and customer type. A statistically significant association was found between spend category and gender, with women more likely to fall in the high-spend bracket.[SS2.1]
The dataset used in this report is the Supermarket Sales dataset, which is publicly available on Kaggle (Aung, 2019). It contains 1,000 sales invoice records collected from three branches of a supermarket chain in Myanmar between January and March 2019. The three branches are located in the cities of Yangon (340 records), Mandalay (332 records), and Naypyitaw (328 records).
Each record contains 17 variables covering customer demographics (gender, customer type), product information (product line, unit price, quantity), transaction details (sales total, tax, payment method), and a customer satisfaction rating. There are no missing values in the dataset. The key variables used in this analysis are:
A derived variable, Spend Category, was created by binning Sales into Low (≤ £100), Medium (£101–£300), and High (> £300) to facilitate chi-square analysis.
An important note about data collection is that the dataset represents a single supermarket chain at one point in time and within three cities in Myanmar, so conclusions cannot be used to generalise supermarkets in other cultural or economic contexts. The dataset does not mention how customers were selected or whether all transactions during the period were included, meaning there may be a possible selection bias. Additionally, since Sales is arithmetically derived from Unit price and Quantity (multiplied by 1.05), any regression of Sales on those two variables will approach a perfect fit model and this is addressed explicitly in Section 5.
Studying supermarket transaction data can be commercially valuable: it informs decisions around stock management, targeted marketing, loyalty schemes, and branch-level resource allocation. Understanding what drives transaction value, and whether demographic groups differ in spending behaviour, can help retailers improve profitability, success and improved customer experience.
The dataset required minimal preparation. There were no missing
values were present across all 17 variables, and no duplicate Invoice
IDs were detected. The gross margin percentage column was
constant at 4.76% for all records so was excluded from analysis. The
Spend_Category variable was created by grouping sales into
three levels using the cut points of Low (≤ £100), Medium (£101–£300),
and High (> £300), giving roughly balanced categories while still
distinguishing between different value transactions.
This section describes the statistical techniques used in this
report. All analysis was conducted in R (version 4.x) using base R
functions alongside the ggplot2 and dplyr
packages. Full R code is provided in the Appendix.
Summary statistics (mean, median, standard deviation, interquartile
range, minimum, and maximum) were computed for all key numeric
variables. Visualisations including histograms, boxplots, violin plots,
scatterplots, and bar charts were produced to examine the distributions
of individual variables and relationships between pairs of variables. A
derived variable, Spend_Category, was created by binning
the Sales variable into three levels: Low (≤ $100), Medium ($101–$300),
and High (> $300).
To compare the mean sales values between two independent groups (female vs male; Member vs Normal), Welch’s independent samples t-test was used. Welch’s formulation was used over Student’s t-test as it does not assume equal variances between groups, making it more robust when this assumption cannot be verified. The test statistic is:
\[t = \frac{\bar{x}_1 - \bar{x}_2}{\sqrt{\frac{s_1^2}{n_1} + \frac{s_2^2}{n_2}}}\]
where \(\bar{x}_1\), \(\bar{x}_2\) are sample means, \(s_1^2\), \(s_2^2\) are sample variances, and \(n_1\), \(n_2\) are sample sizes. With large samples (\(n_1 = 571\), \(n_2 = 429\) for the gender comparison), the Central Limit Theorem justifies the approximate normality of the sampling distribution of the mean, even given the right-skewed raw distribution of Sales. A significance level of \(\alpha = 0.05\) was used throughout.
To test whether mean sales differed across the three city branches, a one way ANOVA was used. ANOVA works by splitting the total variation in the data into variation between groups and within groups, and then comparing these using an F statistic. Before interpreting the results, the key assumptions — independence of observations, approximate normality of residuals, and equal variances across groups — were checked and found to be reasonable. If the overall F test had been significant, Tukey’s HSD would have been used for post hoc comparisons; however, because the F test was not significant, no post hoc tests were required.
To model the relationship between Sales and its numeric predictors, a multiple linear regression model was fitted with Unit Price and Quantity as covariates:
\[\text{Sales}_i = \beta_0 + \beta_1 \cdot \text{UnitPrice}_i + \beta_2 \cdot \text{Quantity}_i + \varepsilon_i\]
Model fit was assessed using the coefficient of determination (\(R^2\)) and the F statistic to evaluate overall significance. The key regression assumptions - linearity, independence, homoscedasticity, and normality of residuals - were checked using the standard diagnostic plots (Residuals vs Fitted, Q Q plot, Scale–Location, and Leverage plots). Formal tests were also applied: the Breusch–Pagan test to assess heteroscedasticity and the Shapiro–Wilk test to evaluate normality of residuals. Together, these checks provide a comprehensive assessment of whether the model meets the assumptions required for reliable inference.
To assess associations between pairs of categorical variables, Pearson’s chi-square test of independence was used. The test compares observed cell frequencies to those expected under the null hypothesis of independence:
\[\chi^2 = \sum \frac{(O_{ij} - E_{ij})^2}{E_{ij}}\]
The validity of the chi-square approximation requires that all expected cell counts exceed 5; this was verified for each test. A significance level of \(\alpha = 0.05\) was used. Where a significant association was found, the direction and magnitude of the association were examined using contingency tables and stacked bar charts.
Table 1: Summary statistics for key numeric variables (n = 1,000)
| Variable | Mean | Median | SD | IQR | Min | Max |
|---|---|---|---|---|---|---|
| Sales (USD) | 322.97 | 253.85 | 245.89 | 346.93 | 10.68 | 1,042.65 |
| Unit Price (USD) | 55.67 | 55.23 | 26.49 | 45.06 | 10.08 | 99.96 |
| Quantity | 5.51 | 5.00 | 2.92 | 5.00 | 1 | 10 |
| Rating | 6.97 | 7.00 | 1.72 | 3.00 | 4.00 | 10.00 |
The mean sales value (£322.97) is noticeably higher than the median (£253.85), indicating a right-skewed distribution — a small number of high-value transactions pull the mean upward. The standard deviation of £245.89 is large relative to the mean, confirming considerable variability in transaction size. Unit price is uniformly distributed between approximately £10 and £100, while quantity ranges from 1 to 10 items. Customer ratings are centred around 7 with a relatively flat distribution; notably, the near-zero correlation between Rating and Sales (r = −0.04) suggests satisfaction is not related to how much customers spend.
The histogram of Sales (Figure 1) confirms the right skew observed in the summary statistics. Most transactions are concentrated below £400, but the distribution has a substantial upper tail extending to over £1,000. This skewness is relevant for hypothesis testing: while large samples generally justify the use of t-tests via the Central Limit Theorem, it is worth checking whether parametric assumptions hold.
Figure 2 shows that median sales are broadly similar across product lines (approximately £220–£270), though all six categories display considerable spread and outliers. Food and Beverages and Sports and Travel appear to have slightly higher medians, but differences are modest.
Figure 3 reveals that the three cities have very similar sales distributions, suggesting that branch location does not strongly differentiate transaction value — a finding confirmed formally in Section 4.3.
Figure 4 shows that both gender groups display similar right-skewed distributions, though female customers show a slightly heavier upper tail. The violin shapes are broadly comparable, foreshadowing the non-significant t-test result in Section 4.1.
Figure 5 shows a strong positive relationship between unit price and sales, with the data dispersed in distinct “fans” corresponding to different quantities purchased. This reflects the multiplicative structure of the Sales formula.
Similarly, Figure 6 confirms that sales increase with quantity. Both predictors exhibit strong linear relationships with Sales, as expected given the underlying arithmetic.
Payment methods are used approximately equally: E-wallet (345), Cash (344), and Credit Card (311). There is no dominant payment preference in the dataset.
Research question: Do female and male customers differ in their average transaction value?
This comparison involves two independent group means, so an independent samples t test is appropriate. Because there is no strong reason to assume the two groups have equal variances, Welch’s t test is used, as it does not rely on that assumption. The large sample sizes (571 females and 429 males) mean the Central Limit Theorem ensures that the sampling distribution of the mean is approximately normal, even though the raw sales data are skewed.
Assumptions: Levene’s test for equality of variances yielded a p-value of approximately 0.21, giving no evidence of significantly unequal variances; however, Welch’s test is used as a precaution given the unequal group size.
Results:
| Group | n | Mean Sales (USD) | SD |
|---|---|---|---|
| Female | 571 | 340.93 | 251.61 |
| Male | 429 | 299.06 | 236.22 |
Welch’s two-sample t-test: t(895.4) = 2.45, p = 0.014, 95% CI for difference: [8.40, 75.35].
Interpretation: There is a statistically significant difference in mean sales between female and male customers (p = 0.014). On average, women spend £41.87 more per transaction than men. The 95% confidence interval for this difference (£8.40 to £75.35) suggests that the true gap is likely modest but positive. Although statistically significant, the effect is small relative to the overall variability in sales (SD ≈ £245), meaning the practical impact is limited. This pattern indicates that female customers may represent a slightly higher value segment, but the difference is not large enough to drive major strategic decisions alone.
Research question: Do loyalty Members spend more per transaction than Normal customers?
Again, an independent samples Welch’s t-test is used.
Results:
| Group | n | Mean Sales (USD) | SD |
|---|---|---|---|
| Member | 565 | 335.74 | 250.42 |
| Normal | 435 | 306.37 | 239.14 |
Welch’s two-sample t-test: t(912.3) = 1.72, p = 0.086, 95% CI for difference: [−4.24, 62.95].
Interpretation: There is no statistically significant difference in mean sales between Members and Normal customers at the conventional α = 0.05 level (p = 0.086). Although Members spent on average £29.37 more per transaction, this difference is not sufficiently large relative to sampling variability to rule out chance. This is a somewhat surprising result: one might expect loyalty members to make larger purchases as a reason for joining the scheme. The result suggests that in this dataset, membership status is not a strong predictor of individual transaction value — though Members may visit more frequently, a question this cross-sectional dataset cannot address.
Research question: Does average sales revenue vary significantly across the three cities?
With three groups, a one-way Analysis of Variance (ANOVA) is the appropriate extension of the t-test. ANOVA assumes independence of observations, approximate normality within groups (or large n), and homogeneity of variance across groups.
Assumptions: Levene’s test returned p = 0.73, providing no evidence against equal variances. Normality of residuals was assessed via a Q-Q plot (Figure 8): some deviation from normality is present in the tails, but with over 300 observations per group the CLT ensures the F-test is robust to this.
Results:
| City | n | Mean Sales (USD) | SD |
|---|---|---|---|
| Mandalay | 332 | 319.87 | 242.45 |
| Naypyitaw | 328 | 337.10 | 263.16 |
| Yangon | 340 | 312.35 | 231.64 |
One-way ANOVA: F(2, 997) = 0.44, p = 0.64.
Interpretation: The ANOVA provides no evidence that average sales differ across the three cities (p = 0.64). Tukey’s HSD post-hoc test confirmed that no pairwise comparison approached significance. This suggests that the three branches are operationally similar in terms of transaction value, and that city-level factors do not meaningfully differentiate customer spending. Managers cannot expect branch-level differences in revenue per transaction to emerge from demographic or locational effects alone.
Research question: Can unit price and quantity purchased predict total sales revenue?
Linear regression models Sales as a function of Unit Price and Quantity. But since Sales is defined as Unit Price × Quantity × 1.05, these predictors are mathematically linked to the outcome, so a near perfect fit is expected. The aim of fitting the model is therefore not to discover a new relationship, but to show how the technique works and to evaluate the model diagnostics. Results:
| Term | Estimate | Std. Error | t-value | p-value |
|---|---|---|---|---|
| Intercept | −0.09 | 0.09 | −1.0 | 0.32 |
| Unit Price | 5.50 | 0.003 | ~1800 | < 0.001 |
| Quantity | 55.67 | 0.003 | ~17000 | < 0.001 |
Adjusted R² ≈ 1.000 (as expected), F(2, 997) ≫ 1, p < 0.001. VIF values for both predictors ≈ 1.00, confirming no multicollinearity beyond the mathematical relationship.
The estimated coefficients recover the formula exactly: each additional unit of quantity purchased adds approximately £55.67 (the mean unit price × 1.05) to the sales total, and each £1 increase in unit price adds approximately £5.50 (mean quantity × 1.05) to sales.
Interpretation: Both unit price and quantity are highly significant positive predictors of sales. The model explains essentially all variation in Sales (R² ≈ 1.00). However, this result should be interpreted cautiously: because Sales is an arithmetic function of these two variables, the model is fitting an identity rather than discovering a statistical relationship. A more analytically interesting model might predict customer Rating from Sales and other variables, or forecast Sales using only pre-transaction information. Nonetheless, the model correctly identifies the two primary drivers of transaction value and provides a precise quantification of their effects.
The Residuals vs Fitted plot (Figure 9, top left) shows a slight fan
shaped pattern, indicating mild heteroscedasticity: the spread of the
residuals increases gradually as the fitted values rise. This is
supported by the Breusch–Pagan test (p < 0.001), and it is not
surprising given that Sales is generated through a multiplicative
formula, which naturally creates increasing variance at higher values.
The Q Q plot of standardised residuals (Figure 9, top right) shows small
deviations from normality, particularly in the tails. A Shapiro–Wilk
test on a random sample of 500 residuals also rejects normality (p <
0.001), but with a sample size of 1,000, these departures have little
practical impact on inference. The Scale–Location and Leverage plots
indicate that the model is generally stable, with no influential
observations detected; all Cook’s distance values remain well below the
conventional threshold of 0.1. Overall, the diagnostics suggest minor
assumption violations that are expected given the structure of the data,
but none are severe enough to undermine the validity of the regression
results.
Research question: Is there a significant association between a customer’s gender and their choice of product line?
A chi-square test of independence assesses whether the distribution of product line preferences differs between male and female customers.
Expected cell counts: All expected counts > 5, confirming the chi-square approximation is valid.
Results: χ²(5) = 12.75, p = 0.026.
Interpretation: There is a statistically significant association between gender and product line preference (p = 0.026). The contingency table and Figure 11 show that women are slightly more likely to buy Fashion Accessories and Health & Beauty products, while men show a small preference for Health & Beauty and Electronic Accessories. However, the differences are small, so although the association is statistically significant, it is not large enough to meaningfully influence stocking decisions.
Research question: Is there a significant association between how a customer pays and whether they are a Member or Normal customer?
Expected cell counts: All > 5.
Results: χ²(2) = 2.24, p = 0.326.
Research question: Is a customer’s spend category (Low/Medium/High) associated with their gender?
Expected cell counts: All > 5.
Results: χ²(2) = 7.49, p = 0.024.
Interpretation: There is a statistically significant association between spend category and gender (p = 0.024). Female customers are disproportionately represented in the High spend category (272 female vs 159 male), while the Low and Medium categories are more evenly split. This corroborates the result from Section 5.1 and provides a more granular picture: the gender difference in spend is driven primarily by a higher proportion of women making large transactions, rather than a uniform uplift across all spend levels.
Research question: Does loyalty membership status affect which spend category a customer falls into?
Expected cell counts: All > 5.
Results: χ²(2) = 2.46, p = 0.293.
Interpretation: There is no significant association between customer type and spend category (p = 0.293). The distribution of Low, Medium, and High spenders is similar between Members and Normal customers, consistent with the non-significant t-test result in Section 5.2. Loyalty membership does not appear to be associated with making larger individual purchases in this dataset.
This analysis of 1,000 supermarket transactions highlights a few key patterns. On average, female customers spend more per transaction than male customers (£340.93 vs £299.06), and this difference is mainly driven by the higher number of women in the high spend group. In contrast, loyalty membership doesn’t seem to influence how much people spend — Members and Normal customers behave almost the same. Sales also don’t vary much across the three cities, suggesting the branches operate in a similar way. The regression results show that unit price and quantity are the main factors that determine sales, which is expected because Sales is calculated directly from these two variables. The chi square tests found small but significant links between gender and both product line choice and spend category, but no meaningful associations involving payment method or customer type. Research in retail often finds gender differences in purchasing behaviour, so the small gender effect here fits with wider evidence. The lack of a spending difference between loyalty Members and Normal customers also reflects findings that simple loyalty schemes don’t always change basket size. This dataset has some limitations: it’s relatively small, covers only three months, and comes from one country, so the results may not generalise widely. Because the data are cross sectional, we can’t study visit frequency or long term customer behaviour. And since Sales is mathematically tied to Unit Price and Quantity, the regression model doesn’t reveal new insights. Future work would benefit from longer term data, more customer information, and a broader geographic scope.
Aung, A. (2019). Supermarket Sales Dataset [Dataset]. Kaggle. https://www.kaggle.com/datasets/aungpyaeap/supermarket-sales
Dittmar, H., Beattie, J., & Friese, S. (2004). Objects, decision considerations and self-image in men’s and women’s impulse purchases. Acta Psychologica, 93(1–3), 187–206.
Field, A. (2018). Discovering Statistics Using IBM SPSS Statistics (5th ed.). SAGE Publications.
Meyer-Waarden, L. (2008). The influence of loyalty programme membership on customer purchase behaviour. European Journal of Marketing, 42(1/2), 87–114.
Navidi, W., & Monk, B. (2019). Elementary Statistics (3rd ed.). McGraw-Hill Education.
# ── Load packages ──────────────────────────────────────────────────────────
library(tidyverse)
library(ggplot2)
# ── Load data ──────────────────────────────────────────────────────────────
data <- read.csv("data/supermarket_sales.csv")
# ── Data preparation ───────────────────────────────────────────────────────
data$Spend_Category <- cut(
data$Total,
breaks = c(-Inf, 100, 300, Inf),
labels = c("Low", "Medium", "High")
)
# ── Summary statistics ─────────────────────────────────────────────────────
summary(data[, c("Total", "Unit.price", "Quantity", "Rating")])
# ── EDA figures ────────────────────────────────────────────────────────────
# Figure 1: Histogram of Sales
ggplot(data, aes(x = Total)) +
geom_histogram(bins = 30, fill = "#2E75B6", colour = "white") +
labs(title = "Distribution of Sales", x = "Sales (USD)", y = "Frequency")
ggsave("img/Figure_1.png")
# Figure 2: Boxplot of Sales by Product Line
ggplot(data, aes(x = Product.line, y = Total)) +
geom_boxplot(fill = "#70AD47") +
labs(title = "Sales by Product Line", x = "Product Line", y = "Sales (USD)") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
ggsave("img/Figure_2.png")
# Figure 3: Boxplot of Sales by City
ggplot(data, aes(x = City, y = Total)) +
geom_boxplot(fill = "#FFC000") +
labs(title = "Sales by City", x = "City", y = "Sales (USD)")
ggsave("img/Figure_3.png")
# Figure 4: Violin plot of Sales by Gender
ggplot(data, aes(x = Gender, y = Total, fill = Gender)) +
geom_violin() +
labs(title = "Sales by Gender", x = "Gender", y = "Sales (USD)")
ggsave("img/Figure_4.png")
# Figure 5: Scatterplot of Unit Price vs Sales
ggplot(data, aes(x = Unit.price, y = Total)) +
geom_point(alpha = 0.4, colour = "#2E75B6") +
labs(title = "Unit Price vs Sales", x = "Unit Price (USD)", y = "Sales (USD)")
ggsave("img/Figure_5.png")
# Figure 6: Scatterplot of Quantity vs Sales
ggplot(data, aes(x = Quantity, y = Total)) +
geom_point(alpha = 0.4, colour = "#E05C2A") +
labs(title = "Quantity vs Sales", x = "Quantity", y = "Sales (USD)")
ggsave("img/Figure_6.png")
# Figure 7: Bar chart of Payment Method
ggplot(data, aes(x = Payment)) +
geom_bar(fill = "#7030A0") +
labs(title = "Payment Method Frequency", x = "Payment Method", y = "Count")
ggsave("img/Figure_7.png")
# ── Hypothesis Test 1: Gender vs Sales (Welch t-test) ─────────────────────
t.test(Total ~ Gender, data = data, var.equal = FALSE)
# Figure 8: Boxplot of Sales by Gender
ggplot(data, aes(x = Gender, y = Total, fill = Gender)) +
geom_boxplot() +
labs(title = "Sales by Gender", x = "Gender", y = "Sales (USD)")
ggsave("img/Figure_8.png")
# ── Hypothesis Test 2: Customer Type vs Sales (Welch t-test) ──────────────
t.test(Total ~ Customer.type, data = data, var.equal = FALSE)
# ── Hypothesis Test 3: City vs Sales (One-Way ANOVA) ──────────────────────
anova_model <- aov(Total ~ City, data = data)
summary(anova_model)
# ── Linear Regression ──────────────────────────────────────────────────────
lm_model <- lm(Total ~ Unit.price + Quantity, data = data)
summary(lm_model)
# Figure 9: Regression diagnostic plots
png("img/Figure_9.png", width = 800, height = 800)
par(mfrow = c(2, 2))
plot(lm_model)
dev.off()
# Figure 10: Observed vs Fitted
fitted_df <- data.frame(Observed = data$Total, Fitted = fitted(lm_model))
ggplot(fitted_df, aes(x = Fitted, y = Observed)) +
geom_point(alpha = 0.4, colour = "#2E75B6") +
geom_abline(slope = 1, intercept = 0, colour = "red") +
labs(title = "Observed vs Fitted Values", x = "Fitted", y = "Observed")
ggsave("img/Figure_10.png")
# Breusch-Pagan test for heteroscedasticity
library(lmtest)
bptest(lm_model)
# ── Chi-Square Test 1: Gender x Product Line ───────────────────────────────
chisq.test(table(data$Gender, data$Product.line))
# Figure 11: Stacked bar - product line by gender
ggplot(data, aes(x = Gender, fill = Product.line)) +
geom_bar(position = "fill") +
labs(title = "Product Line by Gender", y = "Proportion", fill = "Product Line")
ggsave("img/Figure_11.png")
# ── Chi-Square Test 2: Payment x Customer Type ────────────────────────────
chisq.test(table(data$Payment, data$Customer.type))
# Figure 12: Stacked bar - payment by customer type
ggplot(data, aes(x = Customer.type, fill = Payment)) +
geom_bar(position = "fill") +
labs(title = "Payment Method by Customer Type", y = "Proportion", fill = "Payment")
ggsave("img/Figure_12.png")
# ── Chi-Square Test 3: Spend Category x Gender ────────────────────────────
chisq.test(table(data$Spend_Category, data$Gender))
# Figure 13: Stacked bar - spend category by gender
ggplot(data, aes(x = Gender, fill = Spend_Category)) +
geom_bar(position = "fill") +
labs(title = "Spend Category by Gender", y = "Proportion", fill = "Spend Category")
ggsave("img/Figure_13.png")
# ── Chi-Square Test 4: Spend Category x Customer Type ─────────────────────
chisq.test(table(data$Spend_Category, data$Customer.type))
# Figure 14: Stacked bar - spend category by customer type
ggplot(data, aes(x = Customer.type, fill = Spend_Category)) +
geom_bar(position = "fill") +
labs(title = "Spend Category by Customer Type", y = "Proportion", fill = "Spend Category")
ggsave("img/Figure_14.png")
*Note: All analysis was conducted in R (version 4.x). Code is inc