Individual Details
- Student Name: Balkan Misirli (s3650250)
Executive Statement
The aim of the investigation was to determine whether Coles or Woolworths has cheaper prices in general. As the population of products/prices for each chain is difficult to obtain and each product is not always matched in the other store, it is necessary to make inferences on the population based on an observable subset of matched products.
The entire inventory of Coles’ online shop was scraped and tabulated per category, with roughly 22,200 total products. To obtain a sample subset of 0.5% of these products, 111 items were chosen at random from each table proportional to the ratio of that entire category’s total products to the overall total number of products.
Random numbers were chosen via Google Sheets’ random function, which then corresponded to rows in the Coles scraped dataset. The chosen products were then matched manually in Woolworths’ online shop. If a match existed, these were taken into the sample. If no match was found, another random item was chosen until a match was found. Items which were ‘on special’ in either chain were not taken into the sample.
The results of the testing indicated that the null hypothesis, that both chains had the same pricing over their entire inventory, was very likely to be incorrect. The one sample t test returned a p-value of < 0.001.
Load Packages and Data
library(car)
library(dplyr)
prices <- read.csv("/Users/balkan_misirli/Downloads/ColesWoolPricesV5.csv")
# data was manually collated in Google Sheets
lev <- read.csv("/Users/balkan_misirli/Downloads/ColesWoolPricesLEV3.csv")
# the long format of the same data, for use in other functions further down
price_diff <- prices %>% mutate(diff_price = Coles.Cost.Quantity - Woolworths.Cost.Quantity, diff_price_perc = diff_price/Coles.Cost.Quantity)
# Price difference and price difference percentage columns were added to transform the values into standardized measures of price difference.
Summary Statistics
The price per unit (kilogram/litre/each) was used in this investigation. In this sample, there were 80 items where Woolworths was cheaper, 15 where both were the same price and the remaining 16 items where Coles was cheaper.
When viewed with side-by-side boxplots the summary statistics look quite similar, with Woolworths having the slightly lower appearance. However, without removing the outliers this plot is largely unreadable.
The summary statistics of the sample show that Coles has a higher mean and standard deviation than Woolworths. the Min/Max statistics emphasise the extremely wide spectrum of values in this sample.
The Q-Q plots of the data did not indicate normality, but given the large sample size this should not be a problem. Also, this type of plot is likely meaningless given the values of each product are not really linked to one another.
The plot of price difference in percentage terms visually exposes the relationship at play within the sample. The red line is price parity, and most points lie above it- showing Coles to be more expensive than Woolworths in this sample.
boxplot(prices$Coles.Cost.Quantity, prices$Woolworths.Cost.Quantity, names = c("Coles", "Woolworths"), ylab = "Dollars")

boxplot(prices$Coles.Cost.Quantity, prices$Woolworths.Cost.Quantity, names = c("Coles", "Woolworths"), ylab = "Dollars", outline = FALSE)

lev %>% group_by(Chain) %>% summarise(Min = min(cost.quantity,na.rm = TRUE),
Q1 = quantile(cost.quantity,probs = .25,na.rm = TRUE),
Median = median(cost.quantity, na.rm = TRUE),
Q3 = quantile(cost.quantity,probs = .75,na.rm = TRUE),
Max = max(cost.quantity,na.rm = TRUE),
Mean = mean(cost.quantity, na.rm = TRUE),
SD = sd(cost.quantity, na.rm = TRUE),
n = n(),
Missing = sum(is.na(cost.quantity)))
prices$Coles.Cost.Quantity %>% qqPlot(dist = "norm")

prices$Woolworths.Cost.Quantity %>% qqPlot(dist = "norm")

plot(price_diff$diff_price_perc*100, ylab = "% Price Diff (+ve: Coles dearer, -ve: WW dearer", xlab = "Product Number")
abline(h = 0, col = "red")

Hypothesis Test
The hypothesis test chosen was the One Sample T Test on the percentage difference in prices per unit of weight (with Coles price being the denominator). This unit was chosen because although the magnitude of prices varied greatly the relative price difference was rather smaller. However, this means that a 5c difference in a 20c product would be weighted equally with a 5 dollar difference in a 20 dollar product.
The null hypothesis was that the mean price percentage difference between these populations is zero. A significance level of alpha = 0.01 was used, such that the null will be rejected if the p value indicates probability of obtaining this sample mean, given the null being true, is less than 0.01.
To test for homogeneity of variance, the Levene’s test was implemented on the two samples. Resulting in p = 0.82, the test failed to reject the null hypothesis. Thus, the population variance of Coles and Woolworths are likely to be identical.
#
leveneTest(lev$cost.quantity ~ Chain, data = lev)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.0496 0.8239
220
t.test(price_diff$diff_price_perc, mu = 0, conf.level = 0.99)
One Sample t-test
data: price_diff$diff_price_perc
t = 5.2924, df = 110, p-value = 6.224e-07
alternative hypothesis: true mean is not equal to 0
99 percent confidence interval:
0.02734727 0.08102037
sample estimates:
mean of x
0.05418382
Interpretation
The t-test resulted in a p-value of < 0.001 and 99% confidence interval of between 2.73% and 8.1% price percentage difference. This result means there is statistically significant evidence to reject the null hypothesis (that the population mean lies at 0% difference) in favour of the alternate. In this case, it appears likely that Woolworths is cheaper than Coles overall.
Implicit in this statement is that this sample may be within the sliver of possible valid samples which would give this result even though the two populations are equal. Very unlikely but possible.
Discussion
Based on this sample, it can be inferred with strong evidence that Woolworths is cheaper than Coles over the entire set of products.
During data collection, it was noticed that one chain seemed consistently cheaper than the other depending on category. If indeed so, it may imply that each category has its own distinct probability distribution. Further investigation may be useful in uncovering better pricing by chain per category. Problems originating from this point were likely avoided by sampling products proportionally by category such that the results were a representative subset of the population by category.
It was also noticed that items which did not have an identical match (but were broadly matched) seemed to have greater price variance than those that did. This may indicate two separate probability distributions for exactly matched versus non-matched products. This may render our findings unrepresentative of reality.
This investigation fails to consider that each chain sets prices strategically, at least partly in response to the pricing signals of the other- their primary competitor. It may be more important for a chain to be cheaper on one product than another. Consequently, another possible improvement to this investigation could be to weight the products by their relative importance in the average shopper’s basket. Perhaps being a lot cheaper on an exotic item is not preferable to being a little cheaper on an everyday item.
If this investigation were to be repeated, it would certainly merit some investment of time in building a more capable web scraping program to obtain Woolworths’ price data as well as some functionality to synthesize/fuzzy match products. Collecting and processing a broadly representative sample dataset was by far the most time consuming portion of this venture.
---
title: "MATH1324 Assignment 3"
subtitle: "Supermarket Price Wars"
output: html_notebook
---

## Individual Details

* Student Name: Balkan Misirli (s3650250)

## Executive Statement

The aim of the investigation was to determine whether Coles or Woolworths has cheaper prices in general. As the population of products/prices for each chain is difficult to obtain and each product is not always matched in the other store, it is necessary to make inferences on the population based on an observable subset of matched products.

The entire inventory of Coles' online shop was scraped and tabulated per category, with roughly 22,200 total products. To obtain a sample subset of 0.5% of these products, 111 items were chosen at random from each table proportional to the ratio of that entire category’s total products to the overall total number of products. 

Random numbers were chosen via Google Sheets’ random function, which then corresponded to rows in the Coles scraped dataset. The chosen products were then matched manually in Woolworths’ online shop. If a match existed, these were taken into the sample. If no match was found, another random item was chosen until a match was found. Items which were 'on special' in either chain were not taken into the sample.

The results of the testing indicated that the null hypothesis, that both chains had the same pricing over their entire inventory, was very likely to be incorrect. The one sample t test returned a p-value of < 0.001.


## Load Packages and Data

```{r}
library(car)
library(dplyr)
prices <- read.csv("/Users/balkan_misirli/Downloads/ColesWoolPricesV5.csv")
# data was manually collated in Google Sheets

lev <- read.csv("/Users/balkan_misirli/Downloads/ColesWoolPricesLEV3.csv")
# the long format of the same data, for use in other functions further down

price_diff <- prices %>% mutate(diff_price = Coles.Cost.Quantity - Woolworths.Cost.Quantity, diff_price_perc = diff_price/Coles.Cost.Quantity)
# Price difference and price difference percentage columns were added to transform the values into standardized measures of price difference.
```

## Summary Statistics

The price per unit (kilogram/litre/each) was used in this investigation. In this sample, there were 80 items where Woolworths was cheaper, 15 where both were the same price and the remaining 16 items where Coles was cheaper.

When viewed with side-by-side boxplots the summary statistics look quite similar, with Woolworths having the slightly lower appearance. However, without removing the outliers this plot is largely unreadable.

The summary statistics of the sample show that Coles has a higher mean and standard deviation than Woolworths. the Min/Max statistics emphasise the extremely wide spectrum of values in this sample.

The Q-Q plots of the data did not indicate normality, but given the large sample size this should not be a problem. Also, this type of plot is likely meaningless given the values of each product are not really linked to one another.

The plot of price difference in percentage terms visually exposes the relationship at play within the sample. The red line is price parity, and most points lie above it- showing Coles to be more expensive than Woolworths in this sample.

```{r}

boxplot(prices$Coles.Cost.Quantity, prices$Woolworths.Cost.Quantity, names = c("Coles", "Woolworths"), ylab = "Dollars")

boxplot(prices$Coles.Cost.Quantity, prices$Woolworths.Cost.Quantity, names = c("Coles", "Woolworths"), ylab = "Dollars", outline = FALSE)

lev %>% group_by(Chain) %>% summarise(Min = min(cost.quantity,na.rm = TRUE),
                        Q1 = quantile(cost.quantity,probs = .25,na.rm = TRUE),
                        Median = median(cost.quantity, na.rm = TRUE),
                        Q3 = quantile(cost.quantity,probs = .75,na.rm = TRUE),
                        Max = max(cost.quantity,na.rm = TRUE),
                        Mean = mean(cost.quantity, na.rm = TRUE),
                        SD = sd(cost.quantity, na.rm = TRUE),
                        n = n(),
                        Missing = sum(is.na(cost.quantity)))

prices$Coles.Cost.Quantity %>% qqPlot(dist = "norm")
prices$Woolworths.Cost.Quantity %>% qqPlot(dist = "norm")

plot(price_diff$diff_price_perc*100, ylab = "% Price Diff (+ve: Coles dearer, -ve: WW dearer", xlab = "Product Number")
abline(h = 0, col = "red")
```


## Hypothesis Test

The hypothesis test chosen was the One Sample T Test on the percentage difference in prices per unit of weight (with Coles price being the denominator). This unit was chosen because although the magnitude of prices varied greatly the relative price difference was rather smaller. However, this means that a 5c difference in a 20c product would be weighted equally with a 5 dollar difference in a 20 dollar product.

The null hypothesis was that the mean price percentage difference between these populations is zero. A significance level of alpha = 0.01 was used, such that the null will be rejected if the p value indicates probability of obtaining this sample mean, given the null being true, is less than 0.01.

To test for homogeneity of variance, the Levene's test was implemented on the two samples. Resulting in p = 0.82, the test failed to reject the null hypothesis. Thus, the population variance of Coles and Woolworths are likely to be identical.

```{r}
#

leveneTest(lev$cost.quantity ~ Chain, data = lev)
t.test(price_diff$diff_price_perc, mu = 0, conf.level = 0.99)


```

## Interpretation

The t-test resulted in a p-value of < 0.001 and 99% confidence interval of between 2.73% and 8.1% price percentage difference. This result means there is statistically significant evidence to reject the null hypothesis (that the population mean lies at 0% difference) in favour of the alternate. In this case, it appears likely that Woolworths is cheaper than Coles overall.

Implicit in this statement is that this sample may be within the sliver of possible valid samples which would give this result even though the two populations are equal. Very unlikely but possible.

## Discussion 

Based on this sample, it can be inferred with strong evidence that Woolworths is cheaper than Coles over the entire set of products.

During data collection, it was noticed that one chain seemed consistently cheaper than the other depending on category. If indeed so, it may imply that each category has its own distinct probability distribution. Further investigation may be useful in uncovering better pricing by chain per category. Problems originating from this point were likely avoided by sampling products proportionally by category such that the results were a representative subset of the population by category.

It was also noticed that items which did not have an identical match (but were broadly matched) seemed to have greater price variance than those that did. This may indicate two separate probability distributions for exactly matched versus non-matched products. This may render our findings unrepresentative of reality.

This investigation fails to consider that each chain sets prices strategically, at least partly in response to the pricing signals of the other- their primary competitor. It may be more important for a chain to be cheaper on one product than another. Consequently, another possible improvement to this investigation could be to weight the products by their relative importance in the average shopper's basket. Perhaps being a lot cheaper on an exotic item is not preferable to being a little cheaper on an everyday item.

If this investigation were to be repeated, it would certainly merit some investment of time in building a more capable web scraping program to obtain Woolworths’ price data as well as some functionality to synthesize/fuzzy match products. Collecting and processing a broadly representative sample dataset was by far the most time consuming portion of this venture.


