MATH1324 Assignment 3

Group/Individual Details

Arion Barzoucas-Evans (s3650046)
Chitra Venkitachalam Iyer (s3622880)
Napapach Dechawatthanaphokin (s3613572)

Executive Statement

The purpose of this paper is to investigate whether there is a statistically significant difference in product prices between Coles and Woolworths supermarkets. Hence, our null hypothesis will be that there is no price difference.

To examine this, a random sample of matching products was collected from the two aforementioned supermarkets and their prices recorded. The data collection was performed using a stratified sampling method with home brand and non-home brand products as the population strata. The home brand stratum was calculated to be approximately 20% of the population which matches the stratum proportion in our sample. This was achieved by using simple random sampling on Coles products and then matching the product observed with the corresponding one from Woolworths. Product matches were made according to brand and item size while excluding any “on sale” prices.

In order to test for a statistically significant difference in prices, a paired-samples t-test was performed. The mean price difference was found to be $\$0.393$ more expensive for Coles products with a standard deviation of $\$0.775$. The pair-samples t-test found statistically significant mean difference with t(df = 69) = 4.246, p<0.01, 95%[0.208, 0.578].

Consequently, there is statistically significant evidence pointing towards Woolworths having lower prices than Coles.

Data source: https://www.coles.com.au/, https://www.woolworths.com.au/

Load Packages and Data

library(dplyr)
library(knitr)
library(granova)

## Warning: package 'granova' was built under R version 3.4.3

## Warning: package 'car' was built under R version 3.4.3

prices <- read.csv("D:/RMIT/Data-Visualisation/R/Prices.csv")

Summary Statistics

The variables of interest for the analysis are Coles_Per_Item and Woolworths_Per_Item, each containing prices per item from the corresponding store. These were combined into a third variable, d, containing the difference in price for each individual item. A summary of these variables is presented in Table 1 below.

prices <- prices %>% mutate(d = prices$Coles_Per_Item - prices$Woolworths_Per_Item) 

coles_summ <- prices  %>% summarise(Min = min(Coles_Per_Item,na.rm = TRUE),
                                    Q1 = quantile(Coles_Per_Item,probs = .25,na.rm=TRUE),
                                    Median = median(Coles_Per_Item, na.rm = TRUE),
                                    Q3 = quantile(Coles_Per_Item,probs = .75,na.rm=TRUE),
                                    Max = max(Coles_Per_Item,na.rm = TRUE),
                                    Mean = mean(Coles_Per_Item, na.rm = TRUE),
                                    SD = sd(Coles_Per_Item, na.rm = TRUE),
                                    n = n()
                                   )

wool_summ <- prices  %>% summarise(Min = min(Woolworths_Per_Item,na.rm = TRUE),
                      Q1 = quantile(Woolworths_Per_Item,probs = .25,na.rm = TRUE),
                      Median = median(Woolworths_Per_Item, na.rm = TRUE),
                      Q3 = quantile(Woolworths_Per_Item,probs = .75,na.rm = TRUE),
                      Max = max(Woolworths_Per_Item,na.rm = TRUE),
                      Mean = mean(Woolworths_Per_Item, na.rm = TRUE),
                      SD = sd(Woolworths_Per_Item, na.rm = TRUE),
                      n = n()
                     )

diff_summ <- prices  %>% summarise(Min = min(d,na.rm = TRUE),
                                   Q1 = quantile(d,probs = .25,na.rm = TRUE),
                                   Median = median(d, na.rm = TRUE),
                                   Q3 = quantile(d,probs = .75,na.rm = TRUE),
                                   Max = max(d,na.rm = TRUE),
                                   Mean = mean(d, na.rm = TRUE),
                                   SD = sd(d, na.rm = TRUE),
                                   n = n()
                                  )


comb <- rbind(coles_summ, wool_summ,diff_summ)

colnames(comb) <- c("Minimum","Q1","Median","Q3","Maximum","Mean",
                    "Standard Deviation","Observations")

rownames(comb) <- c("Coles","Woolworths","Difference")
kable(round(comb,2), caption = "Summary Comparison", row.names = TRUE)

Summary Comparison
	Minimum	Q1	Median	Q3	Maximum	Mean	Standard Deviation	Observations
Coles	0.83	3.97	5.75	12.47	53.00	9.21	9.66	70
Woolworths	0.65	3.84	5.20	11.87	55.00	8.81	9.46	70
Difference	-2.00	0.00	0.21	0.56	2.56	0.39	0.77	70

According to this, Woolworths items are $\$0.39$ cheaper in average than Coles items with a standard deviation of $\$0.77$. This trend is confirmed by the line plot in Figure 1 which visualises the price difference between each item across the two stores. A slight drop in price is observed in almost every item. In addition, there are two outliers in each store, however, the price difference between these is small so there’s no need to exclude them from the test.

matplot(t(data.frame(prices$Coles_Per_Item, prices$Woolworths_Per_Item)),
        type = "b", pch = 16, col = "1", lty = 1, 
        xlab = "Store", ylab = "Price ($)", xaxt = "n")
axis(1, at = 1:2, labels = c("Coles", "Woolworths"))
title(main = "Coles-Woolworths Prices")
title(sub = "Figure 1 - Coles/Woolworths line plot price comparison",
      cex.sub = 0.80, font.sub = 3, col.sub = "gray40", line = 4, adj = 0)

Hypothesis Test

A paired-samples t-test with a significance level of $\alpha=0.05$ was chosen to test for a statistically significant difference in prices between the two stores. This is because observations (prices) were recorded twice for the same sample of products and hence the measurements are dependent. Since the sample size is greater than 30 observation there is no need to check for the normal distribution of the differences thanks to the Central Limit Theorem.

The statistical hypotheses for the t-test are as follows: $H_0:\mu_\Delta=0 \\ H_A:\mu_\Delta\ne0$

result <- t.test(prices$Coles_Per_Item, prices$Woolworths_Per_Item, 
       paired = TRUE, alternative = "two.sided")

result_table <- cbind(round(result$statistic,3), result$parameter, 
                      round(result$p.value,6), round(t(result$conf.int),3),
                      round(result$estimate,3), result$null.value, result$alternative)
colnames(result_table) <- c("Test Statistic", "DF","p-value", "Lower CI Bound",
                            "Upper CI Bound","Mean Dif. Est.","H0","HA")

kable(result_table, caption = "Paired t-test Coles_Per_Item/Woolworths_Per_Item Prices")

Paired t-test Coles_Per_Item/Woolworths_Per_Item Prices
	Test Statistic	DF	p-value	Lower CI Bound	Upper CI Bound	Mean Dif. Est.	H0	HA
t	4.246	69	6.7e-05	0.208	0.578	0.393	0	two.sided

A visual representation of the t-test is shown in Figure 2. This includes a scatter plot of the product prices (black points), the price differences (blue X’s) and a confidence interval for the mean differences.

x <- granova.ds(data.frame(prices$Coles_Per_Item, prices$Woolworths_Per_Item),
           xlab = "Coles Prices ($)", ylab = "Woolworths Prices ($)")

title(sub = "Figure 2 - Mean difference plot for the paired-samples t-test",
      cex.sub = 0.80, font.sub = 3, col.sub = "gray40", line = 4, adj = 0)

Interpretation

As stated above, the null hypothesis for the paired t-test is that there is no difference in the mean product prices accross the two stores and the alternative hypothesis is that there is.

The pair-samples t-test in Table 2 has indicated a test statistic t(df = 69) = 4.246, a p-value p<0.01, and a 95% confidence interval [0.208, 0.578]. The p-value represents the probabiblity of observing a mean difference of $\mu_\Delta=0.393$ or one more extreme given our null hypothesis. Since $p<0.01<0.05=\alpha$ the null hypothesis is rejected. This can also be confirmed by the mean difference confidence interval as it does not capture $H_0$.

Discussion

In conclusion, statistically significant evidence was found to suggest that Woolworths is a cheaper supermarket compared to Coles.

However, there are many limitations to this investigation such as time and resource constraints. The sample was collected solely from the online stores of the supermarkets. This does not take into account different prices that may occur in different regions. In addition, the sample size, even though adequate, is far from desirable, thus limiting the scope of the analysis to a general price comparison. A larger sample collected by systematic or stratified sampling (using product categories as strata) would allow for a deeper investigation into differences in product types between the two stores.

Another important factor when comparing suparmaket prices are product sales. This analysis has excluded any prices of items that were on sale. Yet the number and type of items on sale as well as the sales frequency play an important role in the cost efficiency of each supermarket.

Finally, future investigations ought to consider loyalty schemes such as point collection which might have an impact on the choice of supermarket.