Group/Individual Details

Executive Statement

In this investigation, it will mainly focus on analysing if there is a statistically significant average price difference between coles and woolworths. The report will analyse the average price differece for all sample products of the two supermarkets, as well as the difference between the two supermakets in each category and if so which supermaket is cheaper.

The sample of the test includes information on 200 products observations with product name, product categories and product prices for both coles and woolworths supermarkets. The gorcerycop website was used to collect the data, which ensures and provide prices of same product from coles and woolworths (http://www.grocerycop.com.au/products/category/fridge?page=7).

For the categrorization, the 200 products are categorized into 5 categories with each category of 40 observations. The gorcery website basically categories all products in coles and woolworth into 10 categroies, fridge, bakery,fruit&vegetables, pantry, freezer, baby&health&beauty, meat&seafood,Entertainment & International Food, closing&household&pet and drinks&tobacco.This report combine these 10 listed on website into 5 catrgories, the fridge (fridge&freezer), food(fruit&vegetables,meat&seafood,bakery), household (Entertainment & International Food, closing&household&pet),baby health (baby&health&beauty,bakery) and pantry&drinks(pantry&drinks&tobacco). Each of the 5 categories contains 2 categrories of products listed on the gorcerycop websites and each of the 10 website categrories are selected 20 products to form the total 200 sample observations.

To ensure the data are collected randomly, researcher randomly select 1 to 3 products that contains prices information for both coles and woolworths on each page of websites to guarantee that sample could contains different products as possible and make the sample to be more representative.

After finishing the hypothsis test, the report has found that there is no significant difference in average price between coles and woolworths for both all sample products and the products in each of the 5 categories. And this analysis support that fact that the average price of cole and woolworths are not different in each categories and all products as a whole.

Load Packages and Data

# This is a chunk where you can load the necessary data and packages required to reproduce the report
price<-library(readr)
allprices <- read.csv("~/Desktop/allprices.csv")
install.packages("magrittr")
install.packages("dplyr")
install.packages("car")

library("magrittr", lib.loc="/Library/Frameworks/R.framework/Versions/3.4/Resources/library")
library("ggplot2", lib.loc="/Library/Frameworks/R.framework/Versions/3.4/Resources/library")
library("dplyr", lib.loc="/Library/Frameworks/R.framework/Versions/3.4/Resources/library")
library("car", lib.loc="/Library/Frameworks/R.framework/Versions/3.4/Resources/library")
# You should also include your code required to prepare your data for analysis. 

Summary Statistics

allprices %>% group_by(categories) %>% summarise(Min = min(coles,na.rm = TRUE),
                            Max=max(coles,na.rm=TRUE),
                            Q1 = quantile(coles,probs = .25,na.rm = TRUE),
                           Median = median(coles, na.rm = TRUE),
                           Q3 = quantile(coles,probs = .75,na.rm = TRUE),
                           Mean = mean(coles, na.rm = TRUE),
                           SD = sd(coles, na.rm = TRUE),
                           n = n(),
                           Missing = sum(is.na(coles)))
allprices %>% group_by(categories) %>% summarise(Min = min(woolworths,na.rm = TRUE),
                            Max=max(woolworths,na.rm=TRUE),
                            Q1 = quantile(woolworths,probs = .25,na.rm = TRUE),
                           Median = median(woolworths, na.rm = TRUE),
                           Q3 = quantile(woolworths,probs = .75,na.rm = TRUE),
                           Mean = mean(woolworths, na.rm = TRUE),
                           SD = sd(woolworths, na.rm = TRUE),
                           n = n(),
                           Missing = sum(is.na(coles)))
#create a variable of the difference between coles and woolworths
allprices<-allprices%>%mutate(differ=coles - woolworths)
#caculate the statistical summary of the differences between price of the same products of coles and woolworths in different categories.
allprices %>% group_by(categories) %>% summarise(Min = min(differ,na.rm = TRUE),
                            Max=max(differ,na.rm=TRUE),
                            Q1 = quantile(differ,probs = .25,na.rm = TRUE),
                           Median = median(differ, na.rm = TRUE),
                           Q3 = quantile(differ,probs = .75,na.rm = TRUE),
                           Mean = mean(differ, na.rm = TRUE),
                           SD = sd(differ, na.rm = TRUE),
                           n = n(),
                           Missing = sum(is.na(differ)))
# This is a chunk for your summary statistics and visualisation code
allprices %>% plot(differ ~ categories, data = .,ylab="difference between coles and woolworth", xlab="category of products",
                  col="blue",main="difference of price by categroies",ylim=c(-4,4))
grid()

#caculate the statistical summary of the differences between price of the same products of coles and woolworths across all products
allprices %>% group_by %>% summarise(Min = min(differ,na.rm = TRUE),
                            Max=max(differ,na.rm=TRUE),
                            Q1 = quantile(differ,probs = .25,na.rm = TRUE),
                           Median = median(differ, na.rm = TRUE),
                           Q3 = quantile(differ,probs = .75,na.rm = TRUE),
                           Mean = mean(differ, na.rm = TRUE),
                           SD = sd(differ, na.rm = TRUE),
                           n = n(),
                           Missing = sum(is.na(differ)))
# the boxplot of the difference of price between the same products in coles and woolworths
allprices$differ%>%boxplot(.,ylab="difference", 
                  col="blue",main="difference between the price of coles and woolworth",ylim=c(-4,4))
grid()

plot(differ ~ all_products, data = allprices, ylab="difference", xlab="Length (mm)",
     col="orangered",main="Length by Width")
grid()

From the boxplot and scatter plot of graphs, it can be seen that difference of price between coles and woolworth is slightly and postively above zero, which indicates that the price of the products in coles might be slightly higher than that of in woolworth. The mean of the differences of the 200 obervations for sample is 0.258 with standard deviation of 2.199. when it comes to different categories of these products, it has shown that the means of difference for baby&health, food, fridge and pantry&drinks ares greater than zero with 1.014, 0.133, 0.0803 and 0.250 with standard deviation of 2.723, 1.020, 0.775 3.027 respectivly, which indicates the average price of these products from coles in these categories trend to be greater than that of woolworths. While for the household categories, the average price of difference is negative, -0.185 with standard deviation of 2.374, indicates that price level of woolworths in this household categories might trend to be higher than that of products from coles.

Then the hypothesis test will be conducted to exam if these results are correct.

Hypothesis Test

The analysis conducted a hypothsis test to check if there is a price difference between coles and woolworths for all sample products. Also,the analysis has conducted a hypothsis test to check if there is average price differece of products between coles and woolworth for each of the 5 categories, and if so which supermarket is cheaper.

it test has conducted under the assumption that the variances of prices for coles and woolworths are unequal and the sample is approximately normally distributed.

# check the average price level of differeces bwtween coles and woolwhorths for all products
#we conduct the two-sample tt-tests to check if the average price level of coles is higher than that of woolworths.Cause the two-sample tt-test assume the data are drawn from a normal population distribution.we have a bref look at if the price of products from coles and woolworths are approximately normal. 
price_coles <- allprices$coles %>% qqPlot(dist="norm")

price_woolworths<- allprices$woolworths %>% qqPlot(dist="norm")

#we can see that the normality of the product prices of bothe coles and woolworth do not fit well. While, due to the central limits theory, sample size are 200, which is much larger than 30, so that the average price of the sample still approximately follows the normal distribution. then we can conduct the sample tt-test.
#
#1st step is to check if the variances of price of coles and woolworths are equal.
leveneTest(allprices$coles,allprices$woolworths)
allprices$woolworths coerced to factor.
Levene's Test for Homogeneity of Variance (center = median)
       Df F value   Pr(>F)   
group 107  1.7596 0.002884 **
       92                    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#by conducting the levene test, the p-value is 0.00288, less than 0.05, the significant level, therefore, there is sufficient evidence to reject null hypothsis and say that the variances for the price of coles and woolworths products are unequal. 
#Null hypothsis HO:σcoles=σwoolworths
#alternative hypothsis: HA:σwoolworths < σcoles  
#based on unequal variance and the average price of woolworths is lower than that of coles, we conduct Welch two sample test with the null hypothsis that the average price level for woolworths is equal to that of coles with the alternative hypothsis that average price of woolworths is lower than that of coles. 
t.test(allprices$woolworths,allprices$coles, var.equal = FALSE, alternative = "less")

    Welch Two Sample t-test

data:  allprices$woolworths and allprices$coles
t = -0.24756, df = 396.56, p-value = 0.4023
alternative hypothesis: true difference in means is less than 0
95 percent confidence interval:
   -Inf 1.4619
sample estimates:
mean of x mean of y 
   7.3166    7.5749 
#it can be seen that p-value = 0.4023, which is greater than significant level of 0.05, therefore, it fails to reject the null hypothsis and concluded that there is insufficient evidence to say the average price of woolworths is less than that of coles.
#we can also check if the two sided hypothsis test to see if the average price of coles and woolworths are different or not.  
t.test(allprices$woolworths,allprices$coles, var.equal = FALSE, alternative = "two.sided")

    Welch Two Sample t-test

data:  allprices$woolworths and allprices$coles
t = -0.24756, df = 396.56, p-value = 0.8046
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -2.309517  1.792917
sample estimates:
mean of x mean of y 
   7.3166    7.5749 
#then we can check if there is a difference for price of coles and woolworths in each of the five categories.
#check the price difference of coles and woolworths in fridge category
fridge<-allprices%>%filter(categories=="Fridge")
t.test(fridge$woolworths,fridge$coles, var.equal = FALSE, alternative = "two.sided")

    Welch Two Sample t-test

data:  fridge$woolworths and fridge$coles
t = -0.16323, df = 77.753, p-value = 0.8708
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.0590721  0.8985721
sample estimates:
mean of x mean of y 
  5.09825   5.17850 
#check the price difference of coles and woolworths in food category
food<-allprices%>%filter(categories=="food")
t.test(food$woolworths,food$coles, var.equal = FALSE, alternative = "two.sided")

    Welch Two Sample t-test

data:  food$woolworths and food$coles
t = -0.1953, df = 77.996, p-value = 0.8457
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.483186  1.218186
sample estimates:
mean of x mean of y 
  5.41775   5.55025 
#check the price difference of coles and woolworths in baby_health category
baby_health<-allprices%>%filter(categories=="baby_health")
t.test(baby_health$woolworths,baby_health$coles, var.equal = FALSE, alternative = "two.sided")

    Welch Two Sample t-test

data:  baby_health$woolworths and baby_health$coles
t = -0.69668, df = 76.338, p-value = 0.4881
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -3.910678  1.883678
sample estimates:
mean of x mean of y 
  7.80925   8.82275 
#check the price difference of coles and woolworths in household category
household<-allprices%>%filter(categories=="household")
t.test(household$woolworths,household$coles, var.equal = FALSE, alternative = "two.sided")

    Welch Two Sample t-test

data:  household$woolworths and household$coles
t = 0.20475, df = 77.315, p-value = 0.8383
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.609739  1.978739
sample estimates:
mean of x mean of y 
   6.8080    6.6235 
#check the price difference of coles and woolworths in pantry_drinks category
pantry_drinks<-allprices%>%filter(categories=="pantry_drinks")
t.test(pantry_drinks$woolworths,pantry_drinks$coles, var.equal = FALSE, alternative = "two.sided")

    Welch Two Sample t-test

data:  pantry_drinks$woolworths and pantry_drinks$coles
t = -0.052394, df = 77.728, p-value = 0.9583
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -9.740156  9.240656
sample estimates:
mean of x mean of y 
 11.44975  11.69950 

Interpretation

it has been test the price difference between coles and woolworths for all sample products and products in each of the 5 catefories.

1.hypothsis test for price difference between coles and woolworths for all sample product prices. Based on unequal variance and the fact that average price of woolworths for all products is lower than that of coles, we conduct Welch two-sample test to check if average price of woolworths is lower than that of coles.

Null hypothsis HO:the average price of products in coles is equal to that of in woolworths Alternative hypothsis H1: the average price of products in woolworths is less than that of in coles. significant level, a = 0.05

p-value = 0.4023 > a=0.05, there is a probability of 40.23% that the null hypothsis is true when we reject null hypothisis.

95% confidence interval (-∞, 1.4619],it has probability of 95% to find the average price difference between coles and woolworths lies in the interval (-∞, 1.4619] which incluedes the sample mean of the difference, 0.2583.

Therefore, there is insufficient evidence to reject the null hypothsis, and it is reasonable to conclude that average price of all the products for coles is equal to the average price of woolworths.

2.hypothsis test for price difference between coles and woolworths for each of the 5 categories Conducting Welch Two Sample t-test to check if there is a difference for price of smaple products between coles and woolworths in each category.

(1)category of fridge, which include fridge&freezer products in the two supermarket. Null hypothsis HO: the average price of products in woolworhs for fridge category is equal to that of in coles Alternative hypothsis H1: the average price of products in woolworths in fridge category is unequal to that of in coles. significant level, a = 0.05

p-value = 0.8708 > a=0.05, there is a probability of 87.08% that the null hypothsis is true when we reject null hypothisis.

95% confidence interval [-1.059, 0.898],it has probability of 95% to find the average price difference for fridge category between coles and woolworths lies in the interval [-1.059, 0.898] which incluedes the sample mean of the difference, 0.08025.

Therefore, there is insufficient evidence to reject the null hypothsis, and it is reasonable to conclude that average product price of woolworths in fridge category is equal to average product price of coles.

(2)category of baby health, which include relative baby health care and adult skin care products in the two supermarket. Null hypothsis HO: the average price of products in woolworhs for baby health category is equal to that of in coles Alternative hypothsis H1: the average price of products in woolworths in baby health category is unequal to that of in coles. significant level, a = 0.05

p-value = 0.4881 > a=0.05, there is a probability of 48.81% that the null hypothsis is true when we reject null hypothisis.

95% confidence interval [-3.911, 1.884],it has probability of 95% to find the average price difference between coles and woolworths in baby health category lies in the interval [-3.911, 1.884], which incluedes the sample mean of the difference, 1.014.

Therefore, there is insufficient evidence to reject the null hypothsis, and it is reasonable to conclude that average product price of woolworths in baby health category is equal to average product price of coles.

(3)category of household, which include relative entertainment clothing, household and pet products in the two supermarket.

Null hypothsis HO: the average price of products in woolworhs for household category is equal to that of in coles Alternative hypothsis H1: the average price of products in woolworths in household category is unequal to that of in coles. significant level, a = 0.05

p-value = 0.8383 > a=0.05, there is a probability of 83.83% that the null hypothsis is true when we reject null hypothisis.

95% confidence interval [-1.610, 1.979],it has probability of 95% to find the average price difference between coles and woolworths in household category lies in the interval [-1.610, 1.979], which incluedes the sample mean of the difference, -0.18450.

Therefore, there is insufficient evidence to reject the null hypothsis, and it is reasonable to conclude that average product price of woolworths in household category is equal to average product price of coles.

(4)category of pantry and drinks, which include relative pantry_drinks products in the two supermarket. Null hypothsis HO: the average price of products in woolworhs for pantry and drinks category is equal to that of in coles Alternative hypothsis H1: the average price of products in woolworths in pantry and drinks category is unequal to that of in coles. significant level, a = 0.05

p-value =0.9583 > a=0.05, there is a probability of 95.83% that the null hypothsis is true when we reject null hypothisis.

95% confidence interval [-9.740, 9.241],it has probability of 95% to find the average price difference between coles and woolworths in pantry and drinks category lies in the interval [-9.740, 9.241], which incluedes the sample mean of the difference, 0.24975.

Therefore, there is insufficient evidence to reject the null hypothsis, and it is reasonable to conclude that average product price of woolworths in pantry and drinks category is equal to average product price of coles

(5)category of food, which include fruit, vegetables, bakery, meat and seafood in the two supermarket. Null hypothsis HO: the average price of products in woolworhs for food category is equal to that of in coles Alternative hypothsis H1: the average price of products in woolworths in foods category is unequal to that of in coles. significant level, a = 0.05

p-value =0.8457 > a=0.05, there is a probability of 84.57% that the null hypothsis is true when we reject null hypothisis.

95% confidence interval [-1.483, 1.218],it has probability of 95% to find the average price difference between coles and woolworths in food category lies in the interval [-1.483, 1.218], which incluedes the sample mean of the difference, 0.13250.

Therefore, there is insufficient evidence to reject the null hypothsis, and it is reasonable to conclude that average product price of woolworths in food category is equal to average product price of coles

In summary, there is no significant difference in average price between coles and woolworths for both all sample products and the products in each of the 5 categories. And this analysis support that fact that the average price of cole and woolworths are not different in each categories and all products as a whole.

Discussion

There are some limitations. First, for test the price difference for each sample, the sample size for each category is 40, which is just higher than 30 obeservations. The accuracy of the test could be improved by using larger sample size with more observations. Secondly,the classification of category could be more specific. In this analysis, the classfication are relatively rough that all the products for woolworths and coles are only categorised into 5 categories, actually products can be categorised into more explicit categories, which could help to have a more accuracy compare and overview the price levels in each categories between the two supermarket.
Thirdly,from the normality quality view in statistic summary in this report, it can be seen that there are some points outlying and deviate relatively far from the 95% of normality confident intervals, which might probably make the test to be less accuracy. These outlier is generated by some extreme high price differences. This could be reduced by properiate categoration, because the price difference in the same category are more likely to be in the same level of price difference.Like the price difference in bakery are relately small, however, price difference in alcohol products are relatively large.

