The aim of this investigation is to determine whether Coles or Woolworths offer the lowest prices. Prices were randomly chosen from an aggregator website https://grocerycop.com.au on 17/09/2018 using a bespoke VBA script (refer to Appendix A for the methodology and code). 300 products were randomly chosen, of which 64 were avaialble at both Coles and Woolworths. An initial look at these 64 items found some outliers due to a category called ‘Drinks & Tobacco’ which consisted of 11 items. As alcohol cannot be purchased inside Coles and Woolworths (usually purchasable in a seperate store located next to the supermarket), this category was removed entirely. Removing this category did not impact on the randomness of the VBA script.
A paired-samples t-test was used to test for a signifcant mean difference between Coles and Woolworths. The mean difference was found to be $0.726. Visual inspection of the Q-Q plot for price differences suggests that there is a high number of zeros (i.e no price difference) and there is some evidence of non-normality. However the central limit theorem ensured that the t-test could be applied due to the large sample size. The paired-samples t-test found a statistically signifncant mean price difference between Coles and woolworths. t(df = 52) = 3.893, p < 0.001, 95% [0.352 1.1].
#Load packages
library(magrittr); library(dplyr); library(car); library(ggplot2); library(readxl); library(tidyr); library(lattice); library(granova)
## Warning: package 'dplyr' was built under R version 3.5.2
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## Warning: package 'readxl' was built under R version 3.5.2
##
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
##
## extract
#Load and Clean Data
dataset <- read_excel("data.xlsx") #read the data set into the object
WorkingDataset <- dataset %>% filter(!(CategoryName %in% "Drinks & Tobacco")) #Filter out Alcohol&Tobacco
#Tidy the data
tidy_WorkingDataset <- WorkingDataset %>% gather(-c(1:5), key = "StoreName", value = "Price")
sapply(tidy_WorkingDataset, class) #inspect type of each variable
## CategoryNumber CategoryName Page ProductNumber Product
## "numeric" "character" "numeric" "numeric" "character"
## StoreName Price
## "character" "numeric"
#StoreName,CategoryName and CategoryNumber should be a factor as its a categorical variable
tidy_WorkingDataset <- tidy_WorkingDataset %>% mutate(
CategoryName =factor(CategoryName,
levels = c("Baby, Health & Beauty","Bakery","Clothing, Household & Pet","Entertainment & International Food","Freezer","Fridge","Meat & Seafood","Pantry"),
labels = c("Baby,Health&Beauty","Bakery","Clothing,HouseHold","Entertainment",
"Freezer","Fridge","Meat&Seafood","Pantry")),
CategoryNumber = as.factor(CategoryNumber),
StoreName = as.factor(StoreName)
)
#Summary Statistics For Dataset
tidy_WorkingDataset %>% group_by(StoreName) %>% summarise(
Min = min(Price,na.rm = TRUE),
Q1 = quantile(Price,probs = .25,na.rm = TRUE),
Median = median(Price, na.rm = TRUE),
Q3 = quantile(Price,probs = .75,na.rm = TRUE),
Max = max(Price,na.rm = TRUE),
Mean = mean(Price, na.rm = TRUE),
SD = sd(Price, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Price))
)
#screen the sample distribution:
hist(tidy_WorkingDataset$Price,col = "grey", main = "Histogram:Store Prices", xlab = "Price" )
abline(v = mean(WorkingDataset$Woolworths), col = "darkgreen",lwd = 2) #Pop mean (red)
abline(v= mean(WorkingDataset$Coles),
col = "red", lwd = 2) #Sample Mean (blue)
legend("topright", c("Coles Avg Price", "Woolworths Avg Price"), fill=c("red", "darkgreen"))
#Screen for outliers:
boxplot(data = tidy_WorkingDataset, Price ~ StoreName, col=c("red","darkgreen"), xlab = "Price in A$", horizontal = TRUE, main = "Boxplot of Coles and Woolworth Grocery Prices")
#category wise comparison
library(ggplot2)
tidy_WorkingDataset$boxcolor <- with(tidy_WorkingDataset,
ifelse(tidy_WorkingDataset$StoreName %in% "Woolworths","green","red"))
boxplot <- ggplot(tidy_WorkingDataset,aes(x= CategoryName, y= Price, fill=StoreName), main = "Boxplot per Category and Store") + geom_boxplot()
print(boxplot + ggtitle("Boxplot per Category and Store") + theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom")) #Add title and center
Trend & high level interpretation:
Insights and trends from the summary statistics and visualisations are:
The paired or dependent t-test was chosen as the study required the same product to be measured twice - once at Coles and the other at Woolworths. As part of running this test, an assumption is made that the mean price difference between Coles and Woolworths is normally distributed. In order to verify this assumption, a QQplot is drawn to assess whether the sample is normally distributed or not. However this is not strictly required as due to the large sample size involved (53 > 30), the Central Limit theorem can be invoked as any sufficinetly large samples would have a mean that is normally distributed regardless of the underlying population distribution. The significance level is set at the default level (0.95).
WorkingDataset <- WorkingDataset %>% mutate(d=Coles - Woolworths)
invisible(qqPlot(WorkingDataset$d, dist='norm', ylab = "Price Difference in $", main = "Quantile-Quantile Plot"))
Most values are within the 95% confidence interval bands for that quartile, with the exception of some outliers on the far right and a cluster near 0,0. The QQplot is showing that there is a significantly higher number of zero price differences than expected in a normal distribution. Fortunately due to large sample size, the t-test can still be used.
#Run the t-test
t.test(WorkingDataset$Coles,WorkingDataset$Woolworths, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: WorkingDataset$Coles and WorkingDataset$Woolworths
## t = 3.8926, df = 52, p-value = 0.000284
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.3515781 1.0997427
## sample estimates:
## mean of the differences
## 0.7256604
The hypothesis test for this study is as follows:
\[H_0:\mu\Delta = 0\] \[H_A:\mu\Delta \neq 0\]
With \(\mu\Delta\) being the mean difference in price between Coles and Woolworths. The results of the paired t.test is as follows:
-Therefore we reject the Null Hypothesis.
The paired sample T-test is statistically significant.
invisible(granova.ds(data.frame(WorkingDataset$Coles, WorkingDataset$Woolworths), ylab = "Woolworths Price", xlab = "Coles Price", main = "Paired Sample Visualisation for Mean Difference"))
The granova visualisation shows that there is a high concentration of products which are very closely clustered together and also on the identity line. The price difference is also clustered together, but to the right of the identity line. This suggests that Woolworths is cheaper than Coles. It can also be seen that there are 3 products which are significantly cheaper at Woolworths than Coles.
The thick red dotted line shows the mean difference, and the 95% confidence interval in green just misses capturing the solid black line which is the null.
Findings: The QQplot for price difference shows that there is a spike of identical values centered around 0. Looking at this data, roughly 20 products have a difference of $0.01 to $-0.02, which account for roughly 33% of the total dataset. This would be definitely intentional as both parties are looking to price match popular goods and/or have agreements with their suppliers whereby the retail recommended price is set.
The granova visualisation shows that for the remainder of the products, Coles generally has the higher price compared to Woolworths. This is further reflected in the mean price difference, the 95% confidence interval of the paired sample t-test, and the statistically significant finding Woolworths is cheaper than Coles.
Strengths and limitations: The strength of this study is being able to randomly sample via a VBA script in order to collect the data. This helps avoid a lot of the human bias which is possible when selecting items off a grocery shelf.
The biggest limitation is being reliant on a third party aggrevator site. it is not known how that site scrapes Woolworth/Coles websites for products, whether the list is complete, how often the data is updated, etc. In addition the website may have its own algorithrims and usually displays the most popular items first.
Improvements: The point of this study is to find which supermarket is cheaper. Cheaper is an ambiguous word and can mean different things - in this specific study, matched products were used to determine which of the 2 major supermarket is cheaper. However a more accurate way would be to conduct a Stratified sampling technique whereby a basket of goods from 1 supermarket is compared with another. This allows comparison between products that are not identical, as well as better emulate what a typical shopper would buy, and thus properly predict which supermarket is cheaper.
Actually physically visiting stores and comparing prices would also help diversify the dataset and not make it reliant on a single website.
1 Fridge
2 Bakery
3 Pantry
4 Freezer
5 Baby, Health & Beauty
6 Meat & Seafood
7 Entertainment & International Food
8 Clothing, Household & Pet
9 Drinks & Tobacco
#VBA Code used for random sampling of the website:
Option Explicit
Private Sub getCWPrices()
Dim IE As New InternetExplorer
Dim ShoppingItems As MSHTML.IHTMLElementCollection
Dim innnerElements As MSHTML.IHTMLElementCollection
Dim productName As String
Dim colesPrice As Double
Dim woolworthsPrice As Double
Dim ShoppingItem As MSHTML.IHTMLElement
Dim i As Integer
Dim j As Integer
Dim catagory As String
Dim pageNumber As Integer
Dim productNumber As Integer
Dim Doc As HTMLDocument
Dim url As String
IE.Visible = False
For i = 0 To 300
catagory = Cells(Cells(1 + i, 1), 9)
pageNumber = Cells(1 + i, 2)
productNumber = Cells(1 + i, 3)
url = "https://grocerycop.com.au/products/category/" & catagory & "?s=productDetail.description&p=" & pageNumber - 1
IE.Navigate url
Do
DoEvents
Application.Wait (Now + TimeValue("0:00:3"))
Loop Until IE.ReadyState = READYSTATE_COMPLETE
Set Doc = IE.Document
Set ShoppingItems = Doc.getElementsByClassName("prod-el")
On Error Resume Next
productName = ShoppingItems(productNumber).getElementsByTagName("h5")(0).innerText
colesPrice = ShoppingItems(productNumber).getElementsByClassName("left coles")(0).getElementsByClassName("price")(0).innerText
woolworthsPrice = ShoppingItems(productNumber).getElementsByClassName("right woolworths")(0).getElementsByClassName("price")(0).innerText
On Error GoTo 0
Debug.Print productName & ", Coles: " & colesPrice & ", Woolworths Price: " & woolworthsPrice
Cells(i + 1, 4) = productName
Cells(i + 1, 5) = colesPrice
Cells(i + 1, 6) = woolworthsPrice
productName = vbNullString
colesPrice = 0
woolworthsPrice = 0
Next i
End Sub