Group Details:

Executive Statement:

The aim of this investigation is to determine whether Coles or Woolworths offer the lowest prices. Prices were randomly chosen from an aggregator website https://grocerycop.com.au on 17/09/2018 using a bespoke VBA script (refer to Appendix A for the methodology and code). 300 products were randomly chosen, of which 64 were avaialble at both Coles and Woolworths. An initial look at these 64 items found some outliers due to a category called ‘Drinks & Tobacco’ which consisted of 11 items. As alcohol cannot be purchased inside Coles and Woolworths (usually purchasable in a seperate store located next to the supermarket), this category was removed entirely. Removing this category did not impact on the randomness of the VBA script.

A paired-samples t-test was used to test for a signifcant mean difference between Coles and Woolworths. The mean difference was found to be $0.726. Visual inspection of the Q-Q plot for price differences suggests that there is a high number of zeros (i.e no price difference) and there is some evidence of non-normality. However the central limit theorem ensured that the t-test could be applied due to the large sample size. The paired-samples t-test found a statistically signifncant mean price difference between Coles and woolworths. t(df = 52) = 3.893, p < 0.001, 95% [0.352 1.1].

Load Packages and Data

#Load packages
library(magrittr); library(dplyr); library(car); library(ggplot2); library(readxl); library(tidyr); library(lattice); library(granova)
## Warning: package 'dplyr' was built under R version 3.5.2
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## Warning: package 'readxl' was built under R version 3.5.2
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
#Load and Clean Data
dataset <- read_excel("data.xlsx") #read the data set into the object 

WorkingDataset <- dataset %>% filter(!(CategoryName %in% "Drinks & Tobacco")) #Filter out Alcohol&Tobacco 

#Tidy the data
tidy_WorkingDataset <- WorkingDataset %>% gather(-c(1:5), key = "StoreName", value = "Price")

sapply(tidy_WorkingDataset, class) #inspect type of each variable
## CategoryNumber   CategoryName           Page  ProductNumber        Product 
##      "numeric"    "character"      "numeric"      "numeric"    "character" 
##      StoreName          Price 
##    "character"      "numeric"
#StoreName,CategoryName and CategoryNumber should be a factor as its a categorical variable 
tidy_WorkingDataset <- tidy_WorkingDataset %>% mutate(
  
  CategoryName =factor(CategoryName,
                       levels = c("Baby, Health & Beauty","Bakery","Clothing, Household & Pet","Entertainment & International Food","Freezer","Fridge","Meat & Seafood","Pantry"),
                       labels = c("Baby,Health&Beauty","Bakery","Clothing,HouseHold","Entertainment",
                       "Freezer","Fridge","Meat&Seafood","Pantry")),
  CategoryNumber = as.factor(CategoryNumber),
  StoreName = as.factor(StoreName)
)

Summary Statistics

#Summary Statistics For Dataset 
tidy_WorkingDataset %>% group_by(StoreName) %>% summarise(
                                         Min = min(Price,na.rm = TRUE),
                                         Q1 = quantile(Price,probs = .25,na.rm = TRUE),
                                         Median = median(Price, na.rm = TRUE),
                                         Q3 = quantile(Price,probs = .75,na.rm = TRUE),
                                         Max = max(Price,na.rm = TRUE),
                                         Mean = mean(Price, na.rm = TRUE),
                                         SD = sd(Price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price))
  
)
#screen the sample distribution:
hist(tidy_WorkingDataset$Price,col = "grey", main = "Histogram:Store Prices", xlab = "Price" )
abline(v = mean(WorkingDataset$Woolworths), col = "darkgreen",lwd = 2) #Pop mean (red)
abline(v= mean(WorkingDataset$Coles),
       col = "red", lwd = 2) #Sample Mean (blue)
legend("topright", c("Coles Avg Price", "Woolworths Avg Price"), fill=c("red", "darkgreen"))

#Screen for outliers:
boxplot(data = tidy_WorkingDataset, Price ~ StoreName, col=c("red","darkgreen"), xlab = "Price in A$", horizontal = TRUE, main = "Boxplot of Coles and Woolworth Grocery Prices") 

#category wise comparison 
library(ggplot2)
tidy_WorkingDataset$boxcolor <- with(tidy_WorkingDataset,
                                  ifelse(tidy_WorkingDataset$StoreName %in% "Woolworths","green","red"))
boxplot <- ggplot(tidy_WorkingDataset,aes(x= CategoryName, y= Price, fill=StoreName), main = "Boxplot per Category and Store") + geom_boxplot()
print(boxplot + ggtitle("Boxplot per Category and Store") + theme(plot.title = element_text(hjust = 0.5), legend.position = "bottom"))  #Add title and center

Trend & high level interpretation:
Insights and trends from the summary statistics and visualisations are:

Hypothesis Test

The paired or dependent t-test was chosen as the study required the same product to be measured twice - once at Coles and the other at Woolworths. As part of running this test, an assumption is made that the mean price difference between Coles and Woolworths is normally distributed. In order to verify this assumption, a QQplot is drawn to assess whether the sample is normally distributed or not. However this is not strictly required as due to the large sample size involved (53 > 30), the Central Limit theorem can be invoked as any sufficinetly large samples would have a mean that is normally distributed regardless of the underlying population distribution. The significance level is set at the default level (0.95).

WorkingDataset <- WorkingDataset %>% mutate(d=Coles - Woolworths)
invisible(qqPlot(WorkingDataset$d, dist='norm', ylab = "Price Difference in $", main = "Quantile-Quantile Plot"))

Most values are within the 95% confidence interval bands for that quartile, with the exception of some outliers on the far right and a cluster near 0,0. The QQplot is showing that there is a significantly higher number of zero price differences than expected in a normal distribution. Fortunately due to large sample size, the t-test can still be used.

#Run the t-test
t.test(WorkingDataset$Coles,WorkingDataset$Woolworths, paired = TRUE, alternative =  "two.sided")
## 
##  Paired t-test
## 
## data:  WorkingDataset$Coles and WorkingDataset$Woolworths
## t = 3.8926, df = 52, p-value = 0.000284
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.3515781 1.0997427
## sample estimates:
## mean of the differences 
##               0.7256604

Interpretation:

The hypothesis test for this study is as follows:

\[H_0:\mu\Delta = 0\] \[H_A:\mu\Delta \neq 0\]

With \(\mu\Delta\) being the mean difference in price between Coles and Woolworths. The results of the paired t.test is as follows:

-Therefore we reject the Null Hypothesis.

The paired sample T-test is statistically significant.

invisible(granova.ds(data.frame(WorkingDataset$Coles, WorkingDataset$Woolworths), ylab = "Woolworths Price", xlab = "Coles Price", main = "Paired Sample Visualisation for Mean Difference"))

The granova visualisation shows that there is a high concentration of products which are very closely clustered together and also on the identity line. The price difference is also clustered together, but to the right of the identity line. This suggests that Woolworths is cheaper than Coles. It can also be seen that there are 3 products which are significantly cheaper at Woolworths than Coles.

The thick red dotted line shows the mean difference, and the 95% confidence interval in green just misses capturing the solid black line which is the null.

Discussion:

Findings: The QQplot for price difference shows that there is a spike of identical values centered around 0. Looking at this data, roughly 20 products have a difference of $0.01 to $-0.02, which account for roughly 33% of the total dataset. This would be definitely intentional as both parties are looking to price match popular goods and/or have agreements with their suppliers whereby the retail recommended price is set.

The granova visualisation shows that for the remainder of the products, Coles generally has the higher price compared to Woolworths. This is further reflected in the mean price difference, the 95% confidence interval of the paired sample t-test, and the statistically significant finding Woolworths is cheaper than Coles.

Strengths and limitations: The strength of this study is being able to randomly sample via a VBA script in order to collect the data. This helps avoid a lot of the human bias which is possible when selecting items off a grocery shelf.

The biggest limitation is being reliant on a third party aggrevator site. it is not known how that site scrapes Woolworth/Coles websites for products, whether the list is complete, how often the data is updated, etc. In addition the website may have its own algorithrims and usually displays the most popular items first.

Improvements: The point of this study is to find which supermarket is cheaper. Cheaper is an ambiguous word and can mean different things - in this specific study, matched products were used to determine which of the 2 major supermarket is cheaper. However a more accurate way would be to conduct a Stratified sampling technique whereby a basket of goods from 1 supermarket is compared with another. This allows comparison between products that are not identical, as well as better emulate what a typical shopper would buy, and thus properly predict which supermarket is cheaper.

Actually physically visiting stores and comparing prices would also help diversify the dataset and not make it reliant on a single website.

Appendix A - Sampling Methodology and VBA Code

  1. Categories were predefined on the webpage. We indexed these categories as shown below:

1 Fridge
2 Bakery
3 Pantry
4 Freezer
5 Baby, Health & Beauty
6 Meat & Seafood
7 Entertainment & International Food
8 Clothing, Household & Pet
9 Drinks & Tobacco

  1. Once a category was clicked the website would then navigate to that category’s list which was spread of many pages. By default, this data was sorted by relevance
  2. Each page contained 24 items
  3. An excel macro was created to scrape the website for data. The macro performed the following tasks:
  1. Generated a random number between 1 and 10 for category
  2. Generated a random number between 1 and 30 for page number
  3. Generated a random number between 1 and 24 for item index on the chosen page
  4. If a sale price was detected, a result of 0 was returned. This was done by the following:
    a. non sale price items had the html tag while sale items did not.
    b. the macro ignored occurrences of where the html tag was not present
  5. Prices for both Coles and Woolworths were recorded to the spreadsheet
#VBA Code used for random sampling of the website:

Option Explicit
Private Sub getCWPrices()
    Dim IE As New InternetExplorer
    Dim ShoppingItems   As MSHTML.IHTMLElementCollection
    Dim innnerElements  As MSHTML.IHTMLElementCollection
    Dim productName     As String
    Dim colesPrice      As Double
    Dim woolworthsPrice As Double
    Dim ShoppingItem    As MSHTML.IHTMLElement
    Dim i               As Integer
    Dim j               As Integer
    Dim catagory        As String
    Dim pageNumber      As Integer
    Dim productNumber   As Integer
    Dim Doc             As HTMLDocument
    Dim url             As String
    
    IE.Visible = False
    
    For i = 0 To 300
        catagory = Cells(Cells(1 + i, 1), 9)
        pageNumber = Cells(1 + i, 2)
        productNumber = Cells(1 + i, 3)
        url = "https://grocerycop.com.au/products/category/" & catagory & "?s=productDetail.description&p=" & pageNumber - 1
        IE.Navigate url
        Do
            DoEvents
            Application.Wait (Now + TimeValue("0:00:3"))
        Loop Until IE.ReadyState = READYSTATE_COMPLETE
        Set Doc = IE.Document
        Set ShoppingItems = Doc.getElementsByClassName("prod-el")
        On Error Resume Next
        productName = ShoppingItems(productNumber).getElementsByTagName("h5")(0).innerText
        colesPrice = ShoppingItems(productNumber).getElementsByClassName("left coles")(0).getElementsByClassName("price")(0).innerText
        woolworthsPrice = ShoppingItems(productNumber).getElementsByClassName("right woolworths")(0).getElementsByClassName("price")(0).innerText
        On Error GoTo 0
        Debug.Print productName & ", Coles: " & colesPrice & ", Woolworths Price: " & woolworthsPrice
        Cells(i + 1, 4) = productName
        Cells(i + 1, 5) = colesPrice
        Cells(i + 1, 6) = woolworthsPrice
        productName = vbNullString
        colesPrice = 0
        woolworthsPrice = 0
    Next i
End Sub