The purpose of this investigation is to consider whether there is a statistically significant difference between prices of products sold by Coles and Woolworths. The investigation will use statistical evidence to draw an inference about which supermarket is cheaper.
Coles and Woolworths sell a large number of products and not all could be included in the investigation. A random sample of products were chosen. A probability based sampling method was used to select the sample. This method was used because it meant:
The sampling process was:
Hypothesis testing was used to test the statistical significance of the difference in prices. A two-sample t-test was used to compare the difference between the mean price of 700g bread varieties.
The results of the investigation found a statistically significant difference between the mean price of 700g bread varieties sold by Coles and Woolworths. The results of the investigation indicate that Woolworths prices are slightly cheaper on average than Coles.
However as this study was limited to bread products, it may be helpful for future investigations to analyse a broader cross section of products from different categories.
library(data.table)
library(rmarkdown)
library(dplyr)
## -------------------------------------------------------------------------
## data.table + dplyr code now lives in dtplyr.
## Please library(dtplyr)!
## -------------------------------------------------------------------------
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, first, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(knitr)
library(readr)
Categories <- read_csv("~/Desktop/Categories.csv")
## Parsed with column specification:
## cols(
## Category = col_character()
## )
View(Categories)
Categories <- data.table(Categories)
Categories[sample(.N,1)]
## Category
## 1: Meat, seafood and deli
Bread <- read_csv("~/Desktop/Bread.csv",
col_types = cols(Price = col_number(),
Size = col_number(), `Store number` = col_number()))
View(Bread)
Bread$`Store number`<- Bread$`Store number` %>% factor(levels=c(1,2), labels=c("Coles","Woolworths"), ordered = TRUE)
Bread$`Store number` %>% levels
## [1] "Coles" "Woolworths"
Size_filter <- Bread %>% filter(Size == 700)
Size_filter %>% group_by(`Store number`) %>% summarise(Min = min(Price,na.rm = TRUE),
Q1 = quantile(Price,probs = .25,na.rm = TRUE),
Median = median(Price, na.rm = TRUE),
Q3 = quantile(Price,probs = .75,na.rm = TRUE),
Max = max(Price,na.rm = TRUE),
Mean = mean(Price, na.rm = TRUE),
SD = sd(Price, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Price)))
## # A tibble: 2 × 10
## `Store number` Min Q1 Median Q3 Max Mean SD n
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 Coles 2 3 3 4.675 4.9 3.638889 0.9299835 18
## 2 Woolworths 2 3 3 4.425 4.9 3.472222 0.9404664 18
## # ... with 1 more variables: Missing <int>
boxplot(Bread$Price ~ Bread$`Store number`, data = Size_filter)
Type of hypothesis test:
Two-sample t-test
Justification:
The two-sample t-test is being used to compare the difference between two population means, being the price of 700g bread varieties sold by Coles and Woolworths. This type of hypothesis test is considered the most appropriate because it assumes that Woolworths and Coles prices are independent of each other and that the data for both supermarkets are normally distributed.
The paired sampled t-test was considered not appropriate because the same sample is not being measured twice.
Hypothesis:
Assumptions:
Decision Rules:
Reject the null hypothesis if:
95% Confidence Interval (CI) of the difference between means does not capture the null hypothesis
Otherwise, fail to reject the null hypothesis.
library(car)
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Store_filter_Woolworths <- Size_filter %>% filter(Store == "Woolworths")
Store_filter_Woolworths$Price %>% qqPlot(dist="norm")
Store_filter_Coles <- Size_filter %>% filter(Store == "Coles")
Store_filter_Coles$Price %>% qqPlot(dist="norm")
# test normality as sample size for both populations <30. Normality checked through a visual inspection.
leveneTest(Price ~ Store, data = Size_filter)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 0.3421 0.5625
## 34
t.test(Bread$Price ~ Bread$`Store number`, data = Size_filter)
##
## Welch Two Sample t-test
##
## data: Bread$Price by Bread$`Store number`
## t = 0.46144, df = 121.88, p-value = 0.6453
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.4427131 0.7118310
## sample estimates:
## mean in group Coles mean in group Woolworths
## 3.853607 3.719048
The results of the investigation found a statistically significant difference between the mean price of 700g bread varieties sold by Coles and Woolworths, t(df=121.88)=0.46, p>0.05, difference between means = $0.13, 95% CI [-0.44, 0.71]. The statistical test enabled an inference to be drawn that products sold by Woolworths are slightly cheaper on average than products sold by Coles.
This study was limited to bread products, so it may be helpful for future investigations to analyse a broader cross section of products from different categories. This particular investigation was focused on trying to ensure no bias in the random selection of products. There are other approaches to probability based sampling that would still limit bias in the selection process, whilst including a broader cross section of products in the sample. This lesson will be considered in future investigations.