Supermarket Price Wars

Executive Statement

Coles and Woolworths are the dominant market players in Australia which engaged in intense price competition. The purpose of this investigation is to analyse the price difference between the products of these two giant supermarkets to find out which is cheaper.

A large sample size of 81 products from seven different categories has been taken for the investigation in order to minimize standard error. Products were matched according to the brand, type and size/weight. Data was collected manually from https://www.grocerygetter.com.au/#!/shop (https://www.grocerygetter.com.au/#!/shop) on 19th September 2019 . The dataset called supermarket contains 81 observations and 4 variables which are

Prod_name-Identify the name of the product
Prod_category-Shows the category of the products. The products are categorised into 7 categories.
Coles_price-Product price in Coles
Woolworth_price-Product price in Woolworths

The category of the products is outlined below:

Bread & Bakery
Dairy & Eggs
Freezer
Fruits&Veggie
Health&Beauty
Meat&Seafood
Pets

Home brand products and products on discounts were not taken for the investigation as the attributes of the home brand products varies and also to avoid biasness. Statistical summaries, side by side box plot, bar chart and scatter plots are used to summarise the variables and visualize the data for analysis. To begin with the analysis, assumed that Coles and Woolworths have same price (Null hypothesis).

Paired sample t-test was used to determine the significant difference in the prices. After conducting the t-test, significant difference between the prices of Coles and Woolworths is not found. Hence, Coles and Woolworths are having almost similar pricing for the products.

Load Packages and Data

library(readr)

## Warning: package 'readr' was built under R version 3.6.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.6.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(knitr)

## Warning: package 'knitr' was built under R version 3.6.3

library(rmarkdown)

## Warning: package 'rmarkdown' was built under R version 3.6.3

library(ggplot2)
library(magrittr)

## Warning: package 'magrittr' was built under R version 3.6.3

library(reshape2)

## Warning: package 'reshape2' was built under R version 3.6.3

library(dplyr)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(granova)

## Warning: package 'granova' was built under R version 3.6.3

supermarket <- read_csv("supermarket1.csv")

## Parsed with column specification:
## cols(
##   Prod_name = col_character(),
##   Prod_category = col_character(),
##   Coles_price = col_double(),
##   Woolworth_price = col_double()
## )

head(supermarket)

Summary Statistics

# Summary 

Coles <- supermarket%>%summarise(Min = min(Coles_price,na.rm = TRUE),
                                    Q1 = quantile(Coles_price,probs = .25,na.rm=TRUE),
                                    Median = median(Coles_price, na.rm = TRUE),
                                    Q3 = quantile(Coles_price,probs = .75,na.rm=TRUE),
                                    Max = max(Coles_price,na.rm = TRUE),
                                    Mean = mean(Coles_price, na.rm = TRUE),
                                    SD = sd(Coles_price, na.rm = TRUE),
                                    n = n()
                                   )

Woolworths <- supermarket  %>% summarise(Min = min(Woolworth_price,na.rm = TRUE),
                      Q1 = quantile(Woolworth_price,probs = .25,na.rm = TRUE),
                      Median = median(Woolworth_price, na.rm = TRUE),
                      Q3 = quantile(Woolworth_price,probs = .75,na.rm = TRUE),
                      Max = max(Woolworth_price,na.rm = TRUE),
                      Mean = mean(Woolworth_price, na.rm = TRUE),
                      SD = sd(Woolworth_price, na.rm = TRUE),
                      n = n()
                     )

supermarket <- supermarket %>% mutate(difference=Coles_price-Woolworth_price) 
Differences <- supermarket%>% summarise(Min = min(difference,na.rm = TRUE),
                                   Q1 = quantile(difference,probs = .25,na.rm = TRUE),
                                   Median = median(difference, na.rm = TRUE),
                                   Q3 = quantile(difference,probs = .75,na.rm = TRUE),
                                   Max = max(difference,na.rm = TRUE),
                                   Mean = mean(difference, na.rm = TRUE),
                                   SD = sd(difference, na.rm = TRUE),
                                   n = n()
                                  )

combination <- rbind(Coles, Woolworths,Differences)

rownames(combination) <- c("Coles", "Woolworths", "Differences")

## Warning: Setting row names on a tibble is deprecated.

colnames(combination) <- c("Minimum","Q1","Median","Q3","Maximum","Mean",
                    "Standard Deviation","Total count")


kable(round(combination,2), caption = "Summary table of Coles and Woolworths", row.names = TRUE)

Summary table of Coles and Woolworths
	Minimum	Q1	Median	Q3	Maximum	Mean	Standard Deviation	Total count
Coles	0.9	3.60	4.79	6.29	13.00	5.36	2.56	81
Woolworths	1.0	4.00	4.80	6.30	13.00	5.23	2.24	81
Differences	-2.5	-0.51	-0.01	0.50	4.24	0.13	1.23	81

#Summary by product category

supermarket %>% group_by (Prod_category) %>% summarise(Min = min(difference,na.rm = TRUE),
                                         Q1 = quantile(difference,probs = .25,na.rm = TRUE),
                                         Median = median(difference, na.rm = TRUE),
                                         Q3 = quantile(difference,probs = .75,na.rm = TRUE),
                                         Max = max(difference,na.rm = TRUE),
                                         Mean = mean(difference, na.rm = TRUE),
                                         SD = sd(difference, na.rm = TRUE),
                                         sum=sum(difference, na.rm = TRUE),
                                         n = n(),
                                         Missing_value = sum(is.na(difference)))

The mean price of Coles which is 5.36 is slightly higher than Woolworths which is 5.23. The mean value of the price difference from Coles to Woolworths is 0.13 denoting Woolworths products are cheaper compare to Coles. The second table shows the price comparison by product category. It shows that Pets,Bread & Bakery,Freezer and Health&Beauty products in Coles are expensive compare to Woolworths.

#boxplot
supermarket %>% boxplot(supermarket$Coles_price, supermarket$Woolworth_price, names=c("Coles"
, "WoolWorths"), data = .,
 main="Boxplot of Coles and Woolworths",
 xlab="Supermarkets", ylab="Price", col=c("yellow", "green"))

The box plots are comparatively short and there is no Obvious differences between box plots. The shows that the overall product prices are similar in the the supermarket although there is slight difference between the sample mean of Coles and Woolworth

# Grouped Bar Plot

table1<-supermarket %>% group_by (Prod_category) %>% summarise(
                                                         Coles= mean(Coles_price, na.rm = TRUE),
                                                         Woolworths = mean(Woolworth_price, na.rm = TRUE)
                                                        )
df_long <- melt(table1, id.var = "Prod_category")

ggplot(df_long, aes(x = Prod_category, y = value, fill = variable)) + 
  geom_bar(stat = "identity", position = "dodge")

The bar chart shows the price of Health&Beauty is expensive compare to other product category. Both supermarket has almost similar pricing for Dairy&Eggs.Health&Beauty, Freezer and Pets products in Coles are expensive compare to Woolworth .

#scatterplot

x <- supermarket$Coles_price
y <- supermarket$Woolworth_price
plot(x, y, main = "Woolworths Price VS Coles Price ",
     xlab = "Coles Price", ylab = "Woolworths Price",
     pch = 19, frame = FALSE)
abline(lm(y ~ x, data = supermarket), col = "blue")

text(x=7.5, y=8.5, cex=0.6, col="darkblue",
 labels=paste0
 ("Slope = ", round(coef(lm(y ~ x, data = supermarket))[2], 3)), srt=42)
abline(lm(y ~ x, data = supermarket), col="red", lty=6)

The scatter plot shows an uphill pattern which indicates a positive relationship between the product prices of Coles and Woolworths.With the regression line, the prices of the products in both supermarkets are well aligned with the slope of 0.769.

Hypothesis Test

The hypothesis test that is used is the paired-sample t-test or the dependent paired-sample t-test is utilised to determine whether Coles or Woolworth is cheaper.The sample mean of coles is 5.36 and the sample mean of woolworth is 5.22, the mean values of the both are very close by so its difficult to determine.So the test will assist in considering whether there is a significant difference. The hypothesis is as follows:

H0:μColes-μWoolworths=0
HA:μColes-μWoolworths≠0

The Q-Q plots are utilised to visually check for the normality. Since the sample size n=81, according to the Central Limit Theorem if the size of the sample is greater than 30 then even if there is an violation of the normality assumption the paired sample t-test can be performed. In the Q-Q plot of coles and woolworths there are data points that depart from normality these can be ruled out.It is observed from the Q-Q plot that the majority of the data points lie within the 95% confidence interval.

Levenes test is conducted to find out homogeneity of variance. In levenes test the p value is compared with the 0.05 the standard level.Here the p value is 0.02 which is less than 0.05. Hence, the variance are not equal and have to reject the null hypothesis.Thus, paired sample t-test can be performed assuming unequal variance. Dependent Sample Assessment plot is used to visualize data in the context of Coles price and Woolworths price sample analyses.

A confidence interval of 95% with significance level 0.05 is used for the t-test

#normality
qqPlot(supermarket$Coles_price, dist="norm" , main = "Q-Q Plot of Coles")

## [1] 37 39

qqPlot(supermarket$Woolworth_price, dist="norm" , main ="Q-Q Plot of Woolworth")

## [1] 40 28

#Levene test
leveneTest(supermarket$Coles_price , supermarket$Woolworth_price)

## Warning in leveneTest.default(supermarket$Coles_price,
## supermarket$Woolworth_price): supermarket$Woolworth_price coerced to factor.

#Utilizing the Welch paired sample t-test
t.test(supermarket$Coles_price , supermarket$Woolworth_price,  paired = TRUE, alternative = "two.sided")

## 
##  Paired t-test
## 
## data:  supermarket$Coles_price and supermarket$Woolworth_price
## t = 0.97973, df = 80, p-value = 0.3302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.1377522  0.4049127
## sample estimates:
## mean of the differences 
##               0.1335802

#Dependent Sample Assesment plot
granova.ds(
 data.frame(supermarket$Coles_price, supermarket$Woolworth_price),
 xlab = "Coles",
 ylab = "Woolworths"
 )

##             Summary Stats
## n                  81.000
## mean(x)             5.362
## mean(y)             5.229
## mean(D=x-y)         0.134
## SD(D)               1.227
## ES(D)               0.109
## r(x,y)              0.878
## r(x+y,d)            0.266
## LL 95%CI           -0.138
## UL 95%CI            0.405
## t(D-bar)            0.980
## df.t               80.000
## pval.t              0.330

Interpretation

Based on the hypothesis test performed,the following has been interpreted:

The p-value which is 0.3302 is greater than 0.05 this implies that the one should fail to reject H0.This means that the coles and woolworths are significantly similar.
The 95% confidence interval is [-0.1377522,0.4049127] since CI captures H0:μColes-μWoolworths=0 ,so one should fail to reject H0.
In the Dependent Sample Assesment plot, the green bar overlaps the identity line, thus the observed difference is not statistically significant.

Hence, both the supermarkets Coles and Woolworths do not show statistically significant difference.

Discussion

From the summary statistics,the mean of coles is slighthy greater than woolworths.Based on the p value and 95% CI, it can be concluded that a statiscal significant difference of mean between coles and woolworths price could not be determined. The data was collected for a sample of 81 products.The data set could be improved by collecting more data which can assist in decreasing the sampling error thus producing accurate conclusions. For coles and woolworths,p=0.3302, 95 percent CI[-0.1377522,0.4049127].The woolworths is cheaper slightly based on the plot, but it does not show a statistical significant difference.

Strengths of the investigation:

the products rate was observed online although there were limitations
the customers will have a basic idea about the cost of the goods
hypothesis testing knowledge is applied effectively using different methodologies

The limitation of the investigation:

the data that was collected for coles and woolworths was limited for each department of the supermarket.
the product price may differ from time to time and according to the location. Collecting data for a certain
time would help to understand and analyse the price differences well.
Seasonal products can be included in future

The improvement that can be made in the future investigation is increasing the collection of data from each department of the supermarkets, this will improve the accuracy and provide better outcome.Moreover, the data gathering process can be automated using data scraper which collectes the data from the website and save it as an excel file. This will save time.