Group Details
- Student1: Meihui Lee (S3176428)
- Student2: Daniel Tay (S3089145)
Executive Statement
In recent months supermarket giants Woolworths and Coles have been cutting prices to fend off challenger Aldi. Our initial assumption is that due to both supermarkets having similar purchasing power and financial strength, there will be no overall difference between its prices.
The objective of this investigation is to determine which, Woolworths or Coles, is cheaper, based on statistical evidence.
Using the Simple Random Sampling method we collected samples from a full list of items obtained from the Woolworths website using a data scraper. The data population is defined to only exact matched items on regular prices. With a sample size of 100 (>30), referring to the Central Limit Theorem we assume our sampling distribution to be approximately normally distributed.
With descriptive statistics and box plots we made observations and performed hypothesis testing to determine if there is a statistically significant difference between item prices. The sample mean difference between item prices in Woolworths against Coles was found to be -0.1889 (SD = 0.7635). The paired-samples t-test found the p-value to be 0.015, which is less than our significance level of 0.05. The 95% Confidence Interval [-0.340, -0.037] does not capture our null hypothesis value of 0.
The investigation found a statistically significant mean difference between prices in Woolworths and Coles. Woolworths prices are found to be significantly cheaper compared to Coles.
Load Packages and Data
library(readr)
library(dplyr)
library(DT)
library(car)
library(granova)
Price_War_Samples <- read_csv("~/Dropbox/RMIT - Master of Analytics/Intro to Stats/Assignment 3 - Price Wars/Price Wars/Excel All Files/Price War Samples.csv")
Data Collection:
To ensure every item has an equal chance of being selected, a full list of items was obtained from the Woolworths website using a data miner application. We limit these items to consumables, excluding items that we think do not behave the same way as regular food and groceries. This will be our population.
The list is sorted, cleaned, compiled into Excel and ordered. We use the RANDBETWEEN function in Excel to randomly generate numbers to pick items from the ordered list. We define an accurate match to be of exact name, size and weight. If an item picked does not have an accurate match in Coles, we skip over the item. The item is excluded from our population.
Current prices are obtained from respective websites on 2017/MAY/10.
An extract of the sample is shown as following:
datatable(Price_War_Samples, options = list(
initComplete = JS("
function(settings, json) {
$(this.api().table().header()).css({
'background-color': '#000',
'color': '#fff'
});
}")
))
Note:
Skips in the Sample No indicate where an accurate match could not be found in Coles. 195 random numbers were generated to achieve a sample size of 100 accurately matched items between Woolworths and Coles.
Var is the price difference between the two supermarkets, Woolworths against Coles, where a negative variance indicates Woolworths is cheaper than Coles and vice versa.
Summary Statistics
We begin the investigation by analysing the descriptive statistics of both sample results from Woolworths and Coles:
Woolworths
Price_War_Samples %>% summarise (Min = min(Woolworths,na.rm = TRUE),
Q1 = quantile(Woolworths,probs = .25,na.rm = TRUE),
Median = median(Woolworths, na.rm = TRUE),
Q3 = quantile(Woolworths,probs = .75,na.rm = TRUE),
Max = max(Woolworths,na.rm = TRUE),
Mean = mean(Woolworths, na.rm = TRUE),
SD = sd(Woolworths, na.rm = TRUE),
Range = Max - Min, na.rm = TRUE,
n = n(),
Missing = sum(is.na(Woolworths)))
Coles
Price_War_Samples %>% summarise (Min = min(Coles,na.rm = TRUE),
Q1 = quantile(Coles,probs = .25,na.rm = TRUE),
Median = median(Coles, na.rm = TRUE),
Q3 = quantile(Coles,probs = .75,na.rm = TRUE),
Max = max(Coles,na.rm = TRUE),
Mean = mean(Coles, na.rm = TRUE),
SD = sd(Coles, na.rm = TRUE),
Range = Max - Min, na.rm = TRUE,
n = n(),
Missing = sum(is.na(Coles)))
We visualise using a side-by-side box plot of the prices:
boxplot(
Price_War_Samples$Woolworths,
Price_War_Samples$Coles,
ylab = "Item Price",
xlab = "Supermarket",
col=c("green", "red"))
axis(1, at = 1:2, labels = c("Woolworths", "Coles"))

Observation:
Based on the sample mean (4.7546 v 4.9435) it appears that Woolworths is slightly cheaper than Coles.
Woolworths has a larger standard deviation (4.012 v 3.919) and a larger range (37.31 v 35.72) compared to Coles.
Referring to the summary statistics and box plot, item prices in Woolworths and Coles show very slight differences. The hypothesis test will help us consider whether this difference is statistically significant.
Hypothesis Test
We reject the one sample t-test as we do not have a population mean. We reject the two sample t-test as our samples are matched, ie. not independent.
Based on our strict matching rules where items are accurately matched between both supermarkets, each price observation is said to be dependent. Hence we choose the paired samples t-test, also known as the dependent samples t-test.
We use the paired-samples t-test to determine if the mean price difference between Woolworths and Coles can be considered statistically significant. Based on our sample size of 100 (>30) we can assume our data to be normally distributed. We set the significance level at 0.05. The statistical hypotheses:
\[H_0: \mu_\Delta = 0\] \[H_A: \mu_\Delta \neq 0\]
We calculate the descriptive statistics for the mean difference between Woolworths and Coles:
Price_War_Samples %>% summarise (Min = min(Var,na.rm = TRUE),
Q1 = quantile(Var,probs = .25,na.rm = TRUE),
Median = median(Var, na.rm = TRUE),
Q3 = quantile(Var,probs = .75,na.rm = TRUE),
Max = max(Var,na.rm = TRUE),
Mean = mean(Var, na.rm = TRUE),
SD = sd(Var, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Var)))
boxplot (Price_War_Samples$Var,
ylab = "Prices",
xlab = "Price Variance (Woolworths v Coles)",
col=c("yellow"))

NA
Observation:
- It appears that Woolworths is cheaper than Coles on a mean of 0.1889. We note though that the price variance ranges from being $2.10 cheaper to $4.06 more expensive.
Although we have a sample size >30 we still look at the Q-Q plot to check normality of the differences:
qqPlot(Price_War_Samples$Var, dist="norm")

The Q-Q plot above shows that apart from a few outliers the majority of our sample follows a normal distribution.
We calculate the paired sample t-test using the t.test() function in R:
t.test(Price_War_Samples$Woolworths, Price_War_Samples$Coles,
paired = TRUE,
conf.level = 0.95,
alternative = "two.sided")
Paired t-test
data: Price_War_Samples$Woolworths and Price_War_Samples$Coles
t = -2.4743, df = 99, p-value = 0.01505
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.34038615 -0.03741385
sample estimates:
mean of the differences
-0.1889
We visualise the mean difference using a scatter plot. The plot reports the mean difference and confidence intervals from the paired samples t-test:
granova.ds(
data.frame(Price_War_Samples$Woolworths, Price_War_Samples$Coles),
xlab = "Woolworths",
ylab = "Coles"
)
Summary Stats
n 100.000
mean(x) 4.755
mean(y) 4.944
mean(D=x-y) -0.189
SD(D) 0.763
ES(D) -0.247
r(x,y) 0.982
r(x+y,d) 0.123
LL 95%CI -0.340
UL 95%CI -0.037
t(D-bar) -2.474
df.t 99.000
pval.t 0.015

Interpretation
We use a paired-samples t-test to test for a significant mean difference between item prices in Woolworths against Coles. The mean difference was found to be -0.1889 (SD = 0.7635).
The paired-samples t-test found the following:
- p-value of 0.015 is less than our significance level of 0.05
- The null hypothesis value of 0 is not captured in our 95% Confidence Interval [-0.340, -0.037]
The results above found a statistically significant mean difference between prices in Woolworths and Coles. Woolworths prices are found to be significantly cheaper compared to Coles.
Discussion
Based on our findings, the initial assumption that no overall price difference between the two supermarkets appears to be incorrect. Looking at the Dependent Sample Assessment Plot we see that the majority of samples indicate Woolworths being cheaper. However, these plots (where Woolworths was cheaper) are very close to the identity line. This indicates that Woolworths prices are just slightly cheaper than Coles. In the few instances where Coles is cheaper, the difference is more significant.
When we look back at the descriptive statistics for the mean difference, the Max of differences (Coles being cheaper than Woolworths) is $4.06 where else the Min of differences (Woolworths being cheaper than Coles) is just $2.10. These observations indicate a possibility of each supermarket having a different pricing strategy. Woolworths seem more likely to give slight better prices on its overall products compared to Coles but Coles is more likely to give a larger saving to its customers on fewer products.
Having obtained a full list of items from the Woolworths website we are confident of the Simple Random Sampling method, where every item in Woolworths had an equal chance of being selected. We trust this to be the most effective probability-based sampling method.
We note though that comparing only regular priced items in this investigation is a limitation, noting that special discount prices offered by each supermarket could significantly change our findings of whether Woolworths or Coles is cheaper. However due to these special discount prices varying from week to week, pricing data would have to be collected over a significant number of weeks. Due to time constraints we were unable to do so thus unable to fairly compare the impact of these special discount prices.
For better results we should definitely take into account the impact of special discount prices. Future investigations could also focus on specific categories of items to provide a targeted insight.
---
title: "MATH1324 Assignment 3"
subtitle: "Supermarket Price Wars"
output: html_notebook
---

## Group Details

* Student1: Meihui Lee (S3176428)
* Student2: Daniel Tay (S3089145)

## Executive Statement

In recent months supermarket giants Woolworths and Coles have been cutting prices to fend off challenger Aldi. Our initial assumption is that due to both supermarkets having similar purchasing power and financial strength, there will be no overall difference between its prices.

The objective of this investigation is to determine which, Woolworths or Coles, is cheaper, based on statistical evidence.

Using the Simple Random Sampling method we collected samples from a full list of items obtained from the Woolworths website using a data scraper. The data population is defined to only exact matched items on regular prices. With a sample size of 100 (>30), referring to the Central Limit Theorem we assume our sampling distribution to be approximately normally distributed.

With descriptive statistics and box plots we made observations and performed hypothesis testing to determine if there is a statistically significant difference between item prices. The sample mean difference between item prices in Woolworths against Coles was found to be -0.1889 (SD = 0.7635). The paired-samples t-test found the p-value to be 0.015, which is less than our significance level of 0.05. The 95% Confidence Interval [-0.340, -0.037] does not capture our null hypothesis value of 0.

The investigation found a statistically significant mean difference between prices in Woolworths and Coles. Woolworths prices are found to be significantly cheaper compared to Coles.

## Load Packages and Data

```{r}
library(readr)
library(dplyr)
library(DT)
library(car)
library(granova)

Price_War_Samples <- read_csv("~/Dropbox/RMIT - Master of Analytics/Intro to Stats/Assignment 3 - Price Wars/Price Wars/Excel All Files/Price War Samples.csv")

```

Data Collection:

1. To ensure every item has an equal chance of being selected, a full list of items was obtained from the Woolworths website using a data miner application. We limit these items to consumables, excluding items that we think do not behave the same way as regular food and groceries. This will be our population. 

2. The list is sorted, cleaned, compiled into Excel and ordered. We use the RANDBETWEEN function in Excel to randomly generate numbers to pick items from the ordered list. We define an accurate match to be of exact name, size and weight. If an item picked does not have an accurate match in Coles, we skip over the item. The item is excluded from our population.

3. Current prices are obtained from respective websites on 2017/MAY/10.

An extract of the sample is shown as following:

```{r}

datatable(Price_War_Samples, options = list(
  initComplete = JS("
    function(settings, json) {
      $(this.api().table().header()).css({
        'background-color': '#000',
        'color': '#fff'
      });
    }")
))

```

Note:

1. Skips in the Sample No indicate where an accurate match could not be found in Coles. 195 random numbers were generated to achieve a sample size of 100 accurately matched items between Woolworths and Coles.

2. Var is the price difference between the two supermarkets, Woolworths against Coles, where a negative variance indicates Woolworths is cheaper than Coles and vice versa.

## Summary Statistics

We begin the investigation by analysing the descriptive statistics of both sample results from Woolworths and Coles:

Woolworths

```{r}

Price_War_Samples %>% summarise                  (Min = min(Woolworths,na.rm = TRUE),
                                                  Q1 = quantile(Woolworths,probs = .25,na.rm = TRUE),
                                                  Median = median(Woolworths, na.rm = TRUE),
                                                  Q3 = quantile(Woolworths,probs = .75,na.rm = TRUE),
                                                  Max = max(Woolworths,na.rm = TRUE),
                                                  Mean = mean(Woolworths, na.rm = TRUE),
                                                  SD = sd(Woolworths, na.rm = TRUE),
                                                  Range = Max - Min, na.rm = TRUE,
                                                  n = n(),
                                                  Missing = sum(is.na(Woolworths)))
```

Coles

```{r}

Price_War_Samples %>% summarise                  (Min = min(Coles,na.rm = TRUE),
                                                  Q1 = quantile(Coles,probs = .25,na.rm = TRUE),
                                                  Median = median(Coles, na.rm = TRUE),
                                                  Q3 = quantile(Coles,probs = .75,na.rm = TRUE),
                                                  Max = max(Coles,na.rm = TRUE),
                                                  Mean = mean(Coles, na.rm = TRUE),
                                                  SD = sd(Coles, na.rm = TRUE),
                                                  Range = Max - Min, na.rm = TRUE,
                                                  n = n(),
                                                  Missing = sum(is.na(Coles)))

```

We visualise using a side-by-side box plot of the prices:

```{r}

boxplot(
  Price_War_Samples$Woolworths,
  Price_War_Samples$Coles,
  ylab = "Item Price",
  xlab = "Supermarket",
  col=c("green", "red"))
axis(1, at = 1:2, labels = c("Woolworths", "Coles"))

```

Observation:

1. Based on the sample mean (4.7546 v 4.9435) it appears that Woolworths is slightly cheaper than Coles.

2. Woolworths has a larger standard deviation (4.012 v 3.919) and a larger range (37.31 v 35.72) compared to Coles.

Referring to the summary statistics and box plot, item prices in Woolworths and Coles show very slight differences. The hypothesis test will help us consider whether this difference is statistically significant. 

## Hypothesis Test

We reject the one sample t-test as we do not have a population mean. We reject the two sample t-test as our samples are matched, ie. not independent.

Based on our strict matching rules where items are accurately matched between both supermarkets, each price observation is said to be dependent. Hence we choose the **paired samples t-test, also known as the dependent samples t-test**. 

We use the paired-samples t-test to determine if the mean price difference between Woolworths and Coles can be considered statistically significant. Based on our sample size of 100 (>30) we can assume our data to be normally distributed. We set the significance level at 0.05. The statistical hypotheses:

$$H_0: \mu_\Delta = 0$$
$$H_A: \mu_\Delta \neq 0$$

We calculate the descriptive statistics for the mean difference between Woolworths and Coles:

```{r}


Price_War_Samples %>% summarise                  (Min = min(Var,na.rm = TRUE),
                                                  Q1 = quantile(Var,probs = .25,na.rm = TRUE),
                                                  Median = median(Var, na.rm = TRUE),
                                                  Q3 = quantile(Var,probs = .75,na.rm = TRUE),
                                                  Max = max(Var,na.rm = TRUE),
                                                  Mean = mean(Var, na.rm = TRUE),
                                                  SD = sd(Var, na.rm = TRUE),
                                                  n = n(),
                                                  Missing = sum(is.na(Var)))


```
```{r}

boxplot   (Price_War_Samples$Var,
           ylab = "Prices",
           xlab = "Price Variance (Woolworths v Coles)",
           col=c("yellow"))   
    
```

Observation:

1. It appears that Woolworths is cheaper than Coles on a mean of 0.1889. We note though that the price variance ranges from being $2.10 cheaper to $4.06 more expensive.

Although we have a sample size >30 we still look at the Q-Q plot to check normality of the differences:

```{r}

qqPlot(Price_War_Samples$Var, dist="norm")

```

The Q-Q plot above shows that apart from a few outliers the majority of our sample follows a normal distribution. 

We calculate the paired sample t-test using the t.test() function in R:

```{r}

t.test(Price_War_Samples$Woolworths, Price_War_Samples$Coles,
       paired = TRUE,
       conf.level = 0.95,
       alternative = "two.sided")

```

We visualise the mean difference using a scatter plot. The plot reports the mean difference and confidence intervals from the paired samples t-test:

```{r, message=FALSE, warning=FALSE, fig.height=9, fig.width=9}

granova.ds(
  data.frame(Price_War_Samples$Woolworths, Price_War_Samples$Coles),
  xlab = "Woolworths",
  ylab = "Coles"
  )

```

## Interpretation

We use a paired-samples t-test to test for a significant mean difference between item prices in Woolworths against Coles. The mean difference was found to be -0.1889 (SD = 0.7635). 

The paired-samples t-test found the following:

a) p-value of 0.015 is less than our significance level of 0.05
b) The null hypothesis value of 0 is not captured in our 95% Confidence Interval [-0.340, -0.037]

The results above found a statistically significant mean difference between prices in Woolworths and Coles. Woolworths prices are found to be significantly cheaper compared to Coles.

## Discussion 

Based on our findings, the initial assumption that no overall price difference between the two supermarkets appears to be incorrect. Looking at the Dependent Sample Assessment Plot we see that the majority of samples indicate Woolworths being cheaper. However, these plots (where Woolworths was cheaper) are very close to the identity line. This indicates that  Woolworths prices are just slightly cheaper than Coles. In the few instances where Coles is cheaper, the difference is more significant. 

When we look back at the descriptive statistics for the mean difference, the Max of differences (Coles being cheaper than Woolworths) is $4.06 where else the Min of differences (Woolworths being cheaper than Coles) is just $2.10. These observations indicate a possibility of each supermarket having a different pricing strategy. Woolworths seem more likely to give slight better prices on its overall products compared to Coles but Coles is more likely to give a larger saving to its customers on fewer products.  

Having obtained a full list of items from the Woolworths website we are confident of the Simple Random Sampling method, where every item in Woolworths had an equal chance of being selected. We trust this to be the most effective probability-based sampling method.

We note though that comparing only regular priced items in this investigation is a limitation, noting that special discount prices offered by each supermarket could significantly change our findings of whether Woolworths or Coles is cheaper. However due to these special discount prices varying from week to week, pricing data would have to be collected over a significant number of weeks. Due to time constraints we were unable to do so thus unable to fairly compare the impact of these special discount prices.

For better results we should definitely take into account the impact of special discount prices. Future investigations could also focus on specific categories of items to provide a targeted insight. 



