MATH1324 Assignment 3

Group/Individual Details

Tim Davidson (s3664882)

Executive Statement

The purpose of this investigation is to consider whether there is a statistically significant difference between prices of products sold by Coles and Woolworths. The investigation will use statistical evidence to draw an inference about which supermarket is cheaper.

Coles and Woolworths sell a large number of products and not all could be included in the investigation. A random sample of products were chosen. A probability based sampling method was used to select the sample. This method was used because it meant:

the sample could be identified quickly
there was no need to collect data on all products sold by Coles and Woolworths
less chance of bias (that may occur through non probability based methods).

The sampling process was:

identify which “categories” appear in both the Woolworths and Coles online stores
randomly select a “category” to analyse
this random selection process (conducted within R) identified “bakery” for further analysis
collect data on “like for like” bakery products (products sold by both supermarkets)
specifically, data was collected on bread products sold by both supermarkets. There was more “like for like”" data available on bread than other bakery products.
an analysis was conducted on 700g bread varieties, given this size product (700g) had the most data available.

Hypothesis testing was used to test the statistical significance of the difference in prices. A two-sample t-test was used to compare the difference between the mean price of 700g bread varieties.

The results of the investigation found a statistically significant difference between the mean price of 700g bread varieties sold by Coles and Woolworths. The results of the investigation indicate that Woolworths prices are slightly cheaper on average than Coles.

However as this study was limited to bread products, it may be helpful for future investigations to analyse a broader cross section of products from different categories.

Load Packages and Data

library(data.table)
library(rmarkdown)
library(dplyr)

## -------------------------------------------------------------------------

## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!

## -------------------------------------------------------------------------

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(knitr)
library(readr)
Categories <- read_csv("~/Desktop/Categories.csv")

## Parsed with column specification:
## cols(
##   Category = col_character()
## )

View(Categories)
Categories <- data.table(Categories)
Categories[sample(.N,1)]

##                  Category
## 1: Meat, seafood and deli

Bread <- read_csv("~/Desktop/Bread.csv", 
    col_types = cols(Price = col_number(), 
        Size = col_number(), `Store number` = col_number()))
View(Bread)
Bread$`Store number`<- Bread$`Store number` %>% factor(levels=c(1,2), labels=c("Coles","Woolworths"), ordered = TRUE)
Bread$`Store number` %>% levels

## [1] "Coles"      "Woolworths"

Summary Statistics

The estimated difference between the means is $0.17 (Coles mean - Woolworths mean).
Both supermarkets have the same minimum, maximum and median price for 700g bread varieties sampled.
It is difficult to say definitively, but it appears from the boxplot that Woolworths has a slightly lower mean price of 700g bread than Coles. However, a statistical test will help to understand whether this difference is statistically significant.

Size_filter <- Bread %>% filter(Size == 700)
Size_filter %>%  group_by(`Store number`) %>% summarise(Min = min(Price,na.rm = TRUE),
                                         Q1 = quantile(Price,probs = .25,na.rm = TRUE),
                                         Median = median(Price, na.rm = TRUE),
                                         Q3 = quantile(Price,probs = .75,na.rm = TRUE),
                                         Max = max(Price,na.rm = TRUE),
                                         Mean = mean(Price, na.rm = TRUE),
                                         SD = sd(Price, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price)))

## # A tibble: 2 × 10
##   `Store number`   Min    Q1 Median    Q3   Max     Mean        SD     n
##            <ord> <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>     <dbl> <int>
## 1          Coles     2     3      3 4.675   4.9 3.638889 0.9299835    18
## 2     Woolworths     2     3      3 4.425   4.9 3.472222 0.9404664    18
## # ... with 1 more variables: Missing <int>

boxplot(Bread$Price ~ Bread$`Store number`, data = Size_filter)

Type of hypothesis test:

Two-sample t-test

Justification:

The two-sample t-test is being used to compare the difference between two population means, being the price of 700g bread varieties sold by Coles and Woolworths. This type of hypothesis test is considered the most appropriate because it assumes that Woolworths and Coles prices are independent of each other and that the data for both supermarkets are normally distributed.

The paired sampled t-test was considered not appropriate because the same sample is not being measured twice.

Hypothesis:

Null hypothesis: mean of Coles prices - mean of Woolworths prices = 0
Alternate hypothesis: mean of Coles prices - mean of Woolworths prices ≠ 0

Assumptions:

Comparing two independent means with unknown population variance.
Data is normally distributed
Equal variance is not assumed

Decision Rules:

Reject the null hypothesis if:
p-value < 0.05 (α significance level)
95% Confidence Interval (CI) of the difference between means does not capture the null hypothesis
Otherwise, fail to reject the null hypothesis.

library(car)

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

Store_filter_Woolworths <- Size_filter %>% filter(Store == "Woolworths")
Store_filter_Woolworths$Price %>% qqPlot(dist="norm")

Store_filter_Coles <- Size_filter %>% filter(Store == "Coles")
Store_filter_Coles$Price %>% qqPlot(dist="norm")

# test normality as sample size for both populations <30. Normality checked through a visual inspection. 
leveneTest(Price ~ Store, data = Size_filter)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value Pr(>F)
## group  1  0.3421 0.5625
##       34

t.test(Bread$Price ~ Bread$`Store number`, data = Size_filter)

## 
##  Welch Two Sample t-test
## 
## data:  Bread$Price by Bread$`Store number`
## t = 0.46144, df = 121.88, p-value = 0.6453
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.4427131  0.7118310
## sample estimates:
##      mean in group Coles mean in group Woolworths 
##                 3.853607                 3.719048

Interpretation

We assumed the populations were normally distributed. However due to sample size for both populations being <30, normality was checked visually via the qqplot function. Coles data appears ok, however caution needs to be taken with regards to Woolworths. For Woolworths, most points fall within the 95% CI for the normal quantiles. There are some that fall outside the red lines, but these are considered minor.
Homogeneity of variance was checked using Levene’s test. P value was > 0.05 which indicates that it is safe to assume homogeneity of variance.
Despite the p value from the levene’s test, equal variance was not assumed, and the Welch two sample t-test was used.
Estimated difference between means: 3.85−3.72 = $0.13 (Coles - Woolworths)
95% CI of difference between means [-0.44, 0.71]
p-value >0.05
Decision: reject null hypothesis
Rejection of null hypothesis suggests that the results of the investigation are statistically significant.

Discussion

The results of the investigation found a statistically significant difference between the mean price of 700g bread varieties sold by Coles and Woolworths, t(df=121.88)=0.46, p>0.05, difference between means = $0.13, 95% CI [-0.44, 0.71]. The statistical test enabled an inference to be drawn that products sold by Woolworths are slightly cheaper on average than products sold by Coles.

This study was limited to bread products, so it may be helpful for future investigations to analyse a broader cross section of products from different categories. This particular investigation was focused on trying to ensure no bias in the random selection of products. There are other approaches to probability based sampling that would still limit bias in the selection process, whilst including a broader cross section of products in the sample. This lesson will be considered in future investigations.