MATH1324 Assignment 2

Group/Individual Details

Yonn April (s3727210)
Andrew Chen (s3488195)

Executive Statement

The aim of this investigation is to determine which major supermarket, Coles or Woolworths is cheaper. The sample data of products available from both stores was collected and perform hypothesis testing to determine if there is a statistically significant difference between prices between stores.

For the investigation, sample data was collected online. The website http://www.grocerycop.com.au compares prices of products between the two major supermarkets and was used to sample data for this investigation.

To test the hypothesis, the null hypothesis is defined as \(H_0: \mu_1 - \mu_2 = 0\) and an alternative hypothesis \(\mu_1 - \mu_2 \neq 0\) where \(u_1\) is the mean of prices of products from Coles, and \(u_2\) is the mean of the prices from Woolworths. \(H_0\) the null-hypothesis states that the mean prices between stores are not significantly different. \(H_A\), the alternative hypothesis states the mean prices between stores are significantly different.

To explore whether is enough statistical evidence to support the hypotheses a two-sampled two-tailed test is used. To use this hypothesis test, some assumptions must be tested. QQ-plots were visualised to test for normality. Homogeneity of variances is also tested using the Levene test. Testing these assumptions determines the type of hypothesis test best suited to test the data. Samples were found to satisfy normality as well as homogeneity of variances.

Following testing for the assumptions, an appropriate two-sample t-test can be selected for the sample. Under the assumptions, the t-test results were not statistically significant and the decision should be to fail to reject \(H_0\). In the context of this sample, \(H_0\) states that the mean prices between Coles and Woolworths are not significantly different, inferring that one supermarket is not significantly cheaper than the other.

Load Packages and Data

# This is a chunk where you can load the necessary data and packages required to reproduce the report
# You should also include your code required to prepare your data for analysis. 

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(readr)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

groceries <- read.csv("groceryshopping.csv")

Summary Statistics

# This is a chunk for your summary statistics and visualisation code

#Statistcal Summary
stats <- groceries %>% group_by(store) %>% summarise(Min=min(price,na.rm=TRUE),
                        Q1=quantile(price,probs=0.25,na.rm=TRUE),
                        Median=median(price,na.rm=TRUE),
                        Q3=quantile(price,probs=0.75,na.rm=TRUE),
                        Mean=mean(price,na.rm=TRUE),
                        SD=sd(price,na.rm=TRUE),
                        n=n(),
                        Missing=sum(is.na(price)))
stats

Mean values between the two stores appear to be similar although it can be noted that the standard deviations vary somewhat.

meanPriceColes <- stats[1, "Mean"]
meanPriceWool <- stats[2, "Mean"]

meanPriceDiff <- meanPriceColes - meanPriceWool

boxplot(price ~ store, data = groceries, ylab = "Price", ylim = c(0,15), main = "Price by store")

Median values between Coles and Woolworths are similar in value. Woolworths appears to have a slightly lower median price than Coles.

Hypothesis Test

# This is a chunk for your hypothesis testing code.
#Test normal distribution
coles <- groceries %>% filter(store=="coles")
coles$price %>% qqPlot(dist = "norm")

## [1] 110  64

woolworths <- groceries %>% filter(store=="woolworths")
woolworths$price %>% qqPlot(dist = "norm")

## [1] 110 104

Both data sets skewed in higher prices but sampling distribution for both stores are \(n>30\) therefore will approximate a normal distribution due to Central Limit Theorem.

#Homogeneity of variances
levene <- leveneTest(price ~ store,  data = groceries)
levene

If \(p > 0.05\) then it can be assumed that population variances are equal.
The sample \(p\)-value from Levene’s Test is 0.64 therefore it can be assumed the population variances are equal.

#Apply two-sample t-test

#H0: m1 = m2
#HA: m1 != m2

result <- t.test(
  price ~ store,
  data = groceries,
  var.equal = TRUE,
  alternative = "two.sided"
)
result

## 
##  Two Sample t-test
## 
## data:  price by store
## t = 0.80575, df = 370, p-value = 0.4209
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6449569  1.5404407
## sample estimates:
##      mean in group coles mean in group woolworths 
##                 6.866613                 6.418871

After testing assumptions of normality and homogeneity of variances, a two-sample two-tailed t-test with equal variance can be performed.

Interpretation

A two-sample t-test was used to test for the significant difference between the mean retail price of Woolworths and Coles. The normality test with the Q-Q plots for both Woolworths and Coles show that the distribution of the prices are skewed to the right. However, according to the central limit theorem, the two-sample t-test can still be applied due to the sample size of the dataset (\(n=186\)).

The Levene’s test of homogeneity of variance showed that \(p\)-value=0.64 which is greater than \(\alpha\)=0.05. Therefore, homogeneity of variances are assumed for the two-sample t-test. The results of the t-test show that there is no statistical significant difference \(t(df=370) = 0.81, p = 0.42 > \alpha = 0.05\) with 95% CI for the difference in means \([-0.64, 1.54]\). The difference between means between stores is 0.45 and is captured by the sample mean difference confidence interval. Therefore, \(H_0\) cannot be rejected.

Discussion

Sample data was collected randomly for each of the categories (1. Baby, Health & Beauty, 2. Bakery, 3. Clothing, Household & Pet, 4. Entertainment and International, 5. Freezer, 6. Fridge, 7. fruit/Veg, 8. pantry) by hand. The categories that do not have common items available in both supermarkets are not included (e.g. meat and seafood). The sales prices of the items are used instead of original prices if the items are on sale and only the items that are available in both Woolworths and Coles are selected.

According to the results, there is not enough statistical evidence to suggest that either supermarket is cheaper than the other. It can be concluded that products are not significantly cheaper at one store compared to the other.

However, there are several limitations in this investigation. Firstly, the special sale prices in both supermarkets are not accounted in this investigation. This may have affected the average prices in both supermarkets which may have given similar average prices as a result. Secondly, there is a difficulty in selecting the sample data randomly as there can be human bias in the random selection method used in this investigation. Therefore, a better way of random selection of data can be used in the future experiments by using computer generated random variables. The data is also not evenly distributed among the different grocery categories in the sample dataset used in the experiment. Data may also vary depending on location, the sample data in this investigation may not accurately represent data from a different location.

Even though there is no significant difference between the average prices of the two supermarkets, there can be a difference in each category. Further research can be done to find whether there is a difference in the average prices between the two supermarkets for different categories.