Ria Talwar (s3729618)
Radhika Zawar (s3734939)

This report aims to examine which Australian supermarket out of two major supermarkets- Coles and Woolworths is cheaper.

Procedure

The raw data was collected from http://www.grocerycop.com.au/ which is an online website for comparing prices from Coles and Woolworths and is updated daily.
All the data was collected on 18th September 2018 and the discounted items were not taken into account because the discount would be valid for one particular day and would not represent the overall picture.
Random values were picked up from the excel using RANDBETWEEN function in excel, explained more in the Sample section.
Paired-samples t-test was conducted.
Results were interpreted and concluded.

Sample

The data contains 73 observations of 7 variables with products across 9 categories.
Product name and quantity are the same across Coles and Woolworths so there is no difference due to quantity or brand.
There are 2 columns for Coles and Woolworths prices.
Random column is generated using RANDBETWEEN() function in excel and the values 0 and 1 are given to this fuction by RANDBETWEEN(0,1). All values with 0 as the random value are selected and the sample for our analysis is created.
The sample named colesWoolie_random contains 41 oservations of 7 variables and is the main dataset for our analysis.
Any home product/ discounted product was not included in the data.

Variables

The following variables were included in the dataset:
1. ProductID- It is a unique number assigned to each product. Before randomly selecting, the productID was sequential. However, later due to random sampling and filtering we so not see it to be a sequence.
2. Product Category- The products from 9 different categories like Bakery,Freezer,Fruit & Vegetables Meat & Seafood,Pantry were included.
3. Product Name- Identifies the name of the product.
4. Product Quantity- Gives the product quantity so that the comparison between Coles and Woolworths product is accurate.
5. Coles Price- Price of a product at Coles.
6. Woolworths Price- Price of a product at Woolworths.
7. Random- This field will be 0 throughout as we are selecting only 0 after random sampling.

Main findings

We used the paired two sample test to determine which of the two supermarkets- Coles or Woolsworth is cheaper.

Conclusions

It was concluded that Coles and Woolworths had almost same prices for the products considered hence we can’t support that one is cheaper than other.

Load Packages and Data

We are setting our working directory to the location where we have saved our dataset using setwd() function.
We are reading the data set using read.csv() function, storing it in colesWoolie and then storing the random sample in colesWoolie_random.

library(readr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)
library(lattice)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

setwd("C:/Users/ria93/Desktop/Stats/Assignment2")
colesWoolie <- read_csv("ColesWoolies.csv")

## Parsed with column specification:
## cols(
##   ProductID = col_integer(),
##   `Product Category` = col_character(),
##   `Product Name` = col_character(),
##   `Product Quantity` = col_integer(),
##   `Coles Price` = col_double(),
##   `Woolworths Price` = col_double(),
##   Random = col_integer()
## )

head(colesWoolie)

colesWoolie_random <- colesWoolie %>% filter(Random == 0)

head(colesWoolie_random)

Summary Statistics

We are adding a column to our random data using mutate function. This column represents the price difference between Coles and Woolworths.
We are using the summary() function on the prices of both Coles and Woolworths which gives us summary statistics. This function enables us to find out first quartile and third quartile along with minimum value, mediun, mean and maximum value. From this statistic it is clear that their is a difference between mean value and even in the interquartile range. However to determine if this difference is signicant or not we will apply some additional test on the data sample.
We are also interested in finding out standard deviation, difference in standard deviation and mean difference between coles and woolworths price values. for that we have used summarised function. It clarifies that mean difference is 0.07097561 and difference in the standard deviation is 0.3653683.
As we are not dealing with very big sample we intent to use matplot for better visual representation. Matplot helps to determine tendancy of the trend.
Now the two plots are visually similar, and we find that Coles seems to have higher prices than Woolworths. Before we implement two-sample t-test, we take a look at vizually checking normality using Q-Q plots, although it is not required for this scenario because our sample size in higher than 30, i.e. Coles n = 41 and Woolworth n = 41, is large enough to be considered as approximately normal.
In the Q-Q plot, the dotted arcs correspond to 95% CI for the normal quantiles. Both groups can see points falling outside the tails of the distribution, which suggests that the tails are heavier than what we would expect under a normal distribution. However, due to the large sample size (n = 41), we don’t have to worry about this problem.
Here some values may appear as outlier, as their range is higher than the average products under consideration. However they do not reflect true outliers as they are not falsified value. therefore for this statistical analysis we will not annul them.

colesWoolie_random <- colesWoolie_random %>%  mutate (diff = `Coles Price` - `Woolworths Price`)

head(colesWoolie_random)

colesWoolie_random$`Coles Price` %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.500   3.600   5.000   5.663   7.000  12.900

colesWoolie_random$`Woolworths Price` %>% summary()

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.350   3.700   4.620   5.592   7.000  12.900

colesWoolie_random %>% summarise( Mean_Coles = mean(`Coles Price`, na.rm = TRUE),   
                                  SD_Coles = sd(`Coles Price`, na.rm = TRUE),   
                                  Mean_Woolworths = mean(`Woolworths Price`, na.rm = TRUE),             
                                  SD_woolworths = sd(`Woolworths Price`, na.rm = TRUE),                                                    Mean_Difference = Mean_Coles - Mean_Woolworths,   
                                  SD_Difference = sd(`Coles Price` -  `Woolworths Price`, na.rm = TRUE),
                                  n = n())

matplot(t(data.frame(colesWoolie_random$`Coles Price`,                               
                     colesWoolie_random$`Woolworths Price`)),
        type="b", pch=19, col=1, lty=1, xlab= "Supermarkets", ylab="Price ($)",         
        xaxt = "n")  
axis(1, at=1:2, labels=c("Coles","Woolworths"))

ColesWooliesGather <- colesWoolie_random %>% gather(`Coles Price`,`Woolworths Price`, key = "store", value = "price")

head(ColesWooliesGather)

ColesWooliesGather %>% boxplot(price ~ store, data = ., horizontal=TRUE)

colesWoolie_random$`Coles Price` %>% qqPlot(dist="norm")

## [1] 32  9

colesWoolie_random$`Woolworths Price` %>% qqPlot(dist="norm")

## [1] 32  9

Hypothesis Test

We implement two-sample t-test, which helps us to consider whether this difference is statistically significant and also assists to see which supermarket is cheaper. The reason we are going to use two-sample t-test is that the function is used to compare the difference between two population means. The two-sample t-test follows statistical hypothesis:

H0:M1 - M2 = 0 (Null Hypothesis)

HA:M1 - M2 != 0 (Alternate Hypothesis)

where M1 and M2 are means of Coles and Woolworths respectively.
The null hypothesis(H0) is defined under the assumption that the difference between the two supermarket prices means is 0. Wherein alternate hypothesis assumes that difference would not be 0.
For the above summary functions, the difference between Coles and Woolworths estimated by the sample is 5.663 - 5.592 = 0.071.
Firstly we need to use Levene’s Test to determine the homogeneity of variance (or the assumptions of equal variance). it is important to determine the type of two-sample t-test. If it can assume equal variance, then we use two-sample t-test assuming equal variance method to compare the difference between Coles and Woolworths.

Levene’s Test follows statistical hypotheses:

H0:M1^2 = M2^2

HA:M1^2 != M2^2
The p-value for the Levene’s test of equal variance for product price between coles and woolworths was p = 0.9007. We find p>.05, therefore, we fail to reject H0. That simply mean, we are safe to assume equal variance. If p value would have been less than 0.5 then we fail to consider equal varience and we have TO reject the null hypothesis.
Now we have confidence to perform a two-sample t-test assuming equal variance and a two-sided hypothesis test. The two-sided is based on whether Coles is cheaper or Woolworth is cheaper.

leveneTest(price ~ store, data = ColesWooliesGather)

## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.

pttest <- t.test(colesWoolie_random$`Coles Price`, colesWoolie_random$`Woolworths Price`,paired = TRUE, alternative = "two.sided") 
 
pttest

## 
##  Paired t-test
## 
## data:  colesWoolie_random$`Coles Price` and colesWoolie_random$`Woolworths Price`
## t = 1.2439, df = 40, p-value = 0.2208
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.04434887  0.18630009
## sample estimates:
## mean of the differences 
##              0.07097561

Tabulated_t_statistic<-qt(0.025,df=40)
Tabulated_t_statistic

## [1] -2.021075

Interpretation

A paired-samples t-test was used to test for a significant mean difference between the prices for Coles and Woolworths. The mean difference following exercise was found to be 0.07 (SD = 0.37). Visual inspection of the Q-Q plot of the difference scores suggested that the data were approximately normally distributed. The paired-samples t-test found that there was no statistically significant mean difference between Coles and Woolworths prices.

We are running paired t-test as the data is dependant as we are dealing with the same products and examining them in two stores.

Following the procedure of paired t.test we came up with the following observations:

Firstly, p-value came out to be 0.2208 which is >0.05 stating that if we conclude that population prices are different then we have a 2.2% chance of making an error.
Secondly, the mean of price difference(0.07) lies in the 95% confidence interval [-0.044, 0.186].
Thirdly, our test’s t-statistics value (1.23) is less than the tabulated t-statistics value (2.02). Hence, we cannot reject the Null Hypothesis.

Based on these three outcomes, the mean difference between Coles and Woolworths is NOT statistically significant, we cannot reject the Null Hypothesis and further concludes that the product prices in Coles and Woolworths are almost same.

Discussion

Strengths

We effectively applied hypothesis testing knownledge into our investigation through different methodologies.
We also ramdonly collected big enough samples from various categories to support this investigation.

Limitations

Data in the report for collected at a particular time but the rates fluctuate everyday. The product may be cheaper in one supermarket today but may be cheaper in the other day. We can collect the price at different points of time to understand the differences.
We didn’t compare the seasonal on sale products. We can consider these products seperately.
Online prices always have some delay in updating prices from real stores. Also, supermarkets in different areas can have different prices for the same product. The findings can vary based on the area.
The overlap of brands between two stores is not large. The majority products of two stores are different with each other. Also, we did not compare the home brands of two stores.
We had a bias for certain products as we were not using the home brands. This will not give a complete and true picture of the cheaper supermarket.

Improvements

We can collect more data from history, then comparing year by year, or month by month, category by category to see if any trends exist.
We can creat data scraper (to scrap data from the website and save it in excel) and can create more accurate picture. Additionally it will save time of manual scrapping.
We can also divide the data into lower and higher price range to get a clear picture of the difference.

Conclusion

Product prices in Coles and Woolworths are equivalent and the mean difference between Coles and Woolworths prices is NOT statistically significant.

MATH1324 Assignment 2

Supermarket Price Wars

Load Packages and Data

Summary Statistics

Hypothesis Test

Interpretation

Discussion