This report aims to examine which Australian supermarket out of two major supermarkets- Coles and Woolworths is cheaper.

Procedure

Sample

Variables

Main findings

Conclusions

Load Packages and Data

library(readr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(lattice)
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
setwd("C:/Users/ria93/Desktop/Stats/Assignment2")
colesWoolie <- read_csv("ColesWoolies.csv")
## Parsed with column specification:
## cols(
##   ProductID = col_integer(),
##   `Product Category` = col_character(),
##   `Product Name` = col_character(),
##   `Product Quantity` = col_integer(),
##   `Coles Price` = col_double(),
##   `Woolworths Price` = col_double(),
##   Random = col_integer()
## )
head(colesWoolie)
colesWoolie_random <- colesWoolie %>% filter(Random == 0)

head(colesWoolie_random)

Summary Statistics

colesWoolie_random <- colesWoolie_random %>%  mutate (diff = `Coles Price` - `Woolworths Price`)

head(colesWoolie_random)
colesWoolie_random$`Coles Price` %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   3.600   5.000   5.663   7.000  12.900
colesWoolie_random$`Woolworths Price` %>% summary()
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.350   3.700   4.620   5.592   7.000  12.900
colesWoolie_random %>% summarise( Mean_Coles = mean(`Coles Price`, na.rm = TRUE),   
                                  SD_Coles = sd(`Coles Price`, na.rm = TRUE),   
                                  Mean_Woolworths = mean(`Woolworths Price`, na.rm = TRUE),             
                                  SD_woolworths = sd(`Woolworths Price`, na.rm = TRUE),                                                    Mean_Difference = Mean_Coles - Mean_Woolworths,   
                                  SD_Difference = sd(`Coles Price` -  `Woolworths Price`, na.rm = TRUE),
                                  n = n()) 
matplot(t(data.frame(colesWoolie_random$`Coles Price`,                               
                     colesWoolie_random$`Woolworths Price`)),
        type="b", pch=19, col=1, lty=1, xlab= "Supermarkets", ylab="Price ($)",         
        xaxt = "n")  
axis(1, at=1:2, labels=c("Coles","Woolworths")) 

ColesWooliesGather <- colesWoolie_random %>% gather(`Coles Price`,`Woolworths Price`, key = "store", value = "price")

head(ColesWooliesGather)
ColesWooliesGather %>% boxplot(price ~ store, data = ., horizontal=TRUE)

colesWoolie_random$`Coles Price` %>% qqPlot(dist="norm") 

## [1] 32  9
colesWoolie_random$`Woolworths Price` %>% qqPlot(dist="norm") 

## [1] 32  9

Hypothesis Test

  • We implement two-sample t-test, which helps us to consider whether this difference is statistically significant and also assists to see which supermarket is cheaper. The reason we are going to use two-sample t-test is that the function is used to compare the difference between two population means. The two-sample t-test follows statistical hypothesis:

    H0:M1 - M2 = 0 (Null Hypothesis)

    HA:M1 - M2 != 0 (Alternate Hypothesis)

    where M1 and M2 are means of Coles and Woolworths respectively.

  • The null hypothesis(H0) is defined under the assumption that the difference between the two supermarket prices means is 0. Wherein alternate hypothesis assumes that difference would not be 0.

  • For the above summary functions, the difference between Coles and Woolworths estimated by the sample is 5.663 - 5.592 = 0.071.

  • Firstly we need to use Levene’s Test to determine the homogeneity of variance (or the assumptions of equal variance). it is important to determine the type of two-sample t-test. If it can assume equal variance, then we use two-sample t-test assuming equal variance method to compare the difference between Coles and Woolworths.

    Levene’s Test follows statistical hypotheses:

    H0:M1^2 = M2^2

    HA:M1^2 != M2^2

  • The p-value for the Levene’s test of equal variance for product price between coles and woolworths was p = 0.9007. We find p>.05, therefore, we fail to reject H0. That simply mean, we are safe to assume equal variance. If p value would have been less than 0.5 then we fail to consider equal varience and we have TO reject the null hypothesis.

  • Now we have confidence to perform a two-sample t-test assuming equal variance and a two-sided hypothesis test. The two-sided is based on whether Coles is cheaper or Woolworth is cheaper.

leveneTest(price ~ store, data = ColesWooliesGather)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
pttest <- t.test(colesWoolie_random$`Coles Price`, colesWoolie_random$`Woolworths Price`,paired = TRUE, alternative = "two.sided") 
 
pttest
## 
##  Paired t-test
## 
## data:  colesWoolie_random$`Coles Price` and colesWoolie_random$`Woolworths Price`
## t = 1.2439, df = 40, p-value = 0.2208
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04434887  0.18630009
## sample estimates:
## mean of the differences 
##              0.07097561
Tabulated_t_statistic<-qt(0.025,df=40)
Tabulated_t_statistic
## [1] -2.021075

Interpretation

A paired-samples t-test was used to test for a significant mean difference between the prices for Coles and Woolworths. The mean difference following exercise was found to be 0.07 (SD = 0.37). Visual inspection of the Q-Q plot of the difference scores suggested that the data were approximately normally distributed. The paired-samples t-test found that there was no statistically significant mean difference between Coles and Woolworths prices.

  • We are running paired t-test as the data is dependant as we are dealing with the same products and examining them in two stores.

Following the procedure of paired t.test we came up with the following observations:

  • Firstly, p-value came out to be 0.2208 which is >0.05 stating that if we conclude that population prices are different then we have a 2.2% chance of making an error.
  • Secondly, the mean of price difference(0.07) lies in the 95% confidence interval [-0.044, 0.186].
  • Thirdly, our test’s t-statistics value (1.23) is less than the tabulated t-statistics value (2.02). Hence, we cannot reject the Null Hypothesis.

Based on these three outcomes, the mean difference between Coles and Woolworths is NOT statistically significant, we cannot reject the Null Hypothesis and further concludes that the product prices in Coles and Woolworths are almost same.

Discussion

Strengths

  • We effectively applied hypothesis testing knownledge into our investigation through different methodologies.
  • We also ramdonly collected big enough samples from various categories to support this investigation.

Limitations

  • Data in the report for collected at a particular time but the rates fluctuate everyday. The product may be cheaper in one supermarket today but may be cheaper in the other day. We can collect the price at different points of time to understand the differences.
  • We didn’t compare the seasonal on sale products. We can consider these products seperately.
  • Online prices always have some delay in updating prices from real stores. Also, supermarkets in different areas can have different prices for the same product. The findings can vary based on the area.
  • The overlap of brands between two stores is not large. The majority products of two stores are different with each other. Also, we did not compare the home brands of two stores.
  • We had a bias for certain products as we were not using the home brands. This will not give a complete and true picture of the cheaper supermarket.

Improvements

  • We can collect more data from history, then comparing year by year, or month by month, category by category to see if any trends exist.
  • We can creat data scraper (to scrap data from the website and save it in excel) and can create more accurate picture. Additionally it will save time of manual scrapping.
  • We can also divide the data into lower and higher price range to get a clear picture of the difference.

Conclusion

  • Product prices in Coles and Woolworths are equivalent and the mean difference between Coles and Woolworths prices is NOT statistically significant.