Coles and Woolworths are the dominant market players in Australia which engaged in intense price competition. The purpose of this investigation is to analyse the price difference between the products of these two giant supermarkets to find out which is cheaper.
A large sample size of 81 products from seven different categories has been taken for the investigation in order to minimize standard error. Products were matched according to the brand, type and size/weight. Data was collected manually from https://www.grocerygetter.com.au/#!/shop (https://www.grocerygetter.com.au/#!/shop) on 19th September 2019 . The dataset called supermarket contains 81 observations and 4 variables which are
The category of the products is outlined below:
Home brand products and products on discounts were not taken for the investigation as the attributes of the home brand products varies and also to avoid biasness. Statistical summaries, side by side box plot, bar chart and scatter plots are used to summarise the variables and visualize the data for analysis. To begin with the analysis, assumed that Coles and Woolworths have same price (Null hypothesis).
Paired sample t-test was used to determine the significant difference in the prices. After conducting the t-test, significant difference between the prices of Coles and Woolworths is not found. Hence, Coles and Woolworths are having almost similar pricing for the products.
library(readr)
## Warning: package 'readr' was built under R version 3.6.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.6.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
## Warning: package 'knitr' was built under R version 3.6.3
library(rmarkdown)
## Warning: package 'rmarkdown' was built under R version 3.6.3
library(ggplot2)
library(magrittr)
## Warning: package 'magrittr' was built under R version 3.6.3
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.6.3
library(dplyr)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
library(granova)
## Warning: package 'granova' was built under R version 3.6.3
supermarket <- read_csv("supermarket1.csv")
## Parsed with column specification:
## cols(
## Prod_name = col_character(),
## Prod_category = col_character(),
## Coles_price = col_double(),
## Woolworth_price = col_double()
## )
head(supermarket)
# Summary
Coles <- supermarket%>%summarise(Min = min(Coles_price,na.rm = TRUE),
Q1 = quantile(Coles_price,probs = .25,na.rm=TRUE),
Median = median(Coles_price, na.rm = TRUE),
Q3 = quantile(Coles_price,probs = .75,na.rm=TRUE),
Max = max(Coles_price,na.rm = TRUE),
Mean = mean(Coles_price, na.rm = TRUE),
SD = sd(Coles_price, na.rm = TRUE),
n = n()
)
Woolworths <- supermarket %>% summarise(Min = min(Woolworth_price,na.rm = TRUE),
Q1 = quantile(Woolworth_price,probs = .25,na.rm = TRUE),
Median = median(Woolworth_price, na.rm = TRUE),
Q3 = quantile(Woolworth_price,probs = .75,na.rm = TRUE),
Max = max(Woolworth_price,na.rm = TRUE),
Mean = mean(Woolworth_price, na.rm = TRUE),
SD = sd(Woolworth_price, na.rm = TRUE),
n = n()
)
supermarket <- supermarket %>% mutate(difference=Coles_price-Woolworth_price)
Differences <- supermarket%>% summarise(Min = min(difference,na.rm = TRUE),
Q1 = quantile(difference,probs = .25,na.rm = TRUE),
Median = median(difference, na.rm = TRUE),
Q3 = quantile(difference,probs = .75,na.rm = TRUE),
Max = max(difference,na.rm = TRUE),
Mean = mean(difference, na.rm = TRUE),
SD = sd(difference, na.rm = TRUE),
n = n()
)
combination <- rbind(Coles, Woolworths,Differences)
rownames(combination) <- c("Coles", "Woolworths", "Differences")
## Warning: Setting row names on a tibble is deprecated.
colnames(combination) <- c("Minimum","Q1","Median","Q3","Maximum","Mean",
"Standard Deviation","Total count")
kable(round(combination,2), caption = "Summary table of Coles and Woolworths", row.names = TRUE)
| Minimum | Q1 | Median | Q3 | Maximum | Mean | Standard Deviation | Total count | |
|---|---|---|---|---|---|---|---|---|
| Coles | 0.9 | 3.60 | 4.79 | 6.29 | 13.00 | 5.36 | 2.56 | 81 |
| Woolworths | 1.0 | 4.00 | 4.80 | 6.30 | 13.00 | 5.23 | 2.24 | 81 |
| Differences | -2.5 | -0.51 | -0.01 | 0.50 | 4.24 | 0.13 | 1.23 | 81 |
#Summary by product category
supermarket %>% group_by (Prod_category) %>% summarise(Min = min(difference,na.rm = TRUE),
Q1 = quantile(difference,probs = .25,na.rm = TRUE),
Median = median(difference, na.rm = TRUE),
Q3 = quantile(difference,probs = .75,na.rm = TRUE),
Max = max(difference,na.rm = TRUE),
Mean = mean(difference, na.rm = TRUE),
SD = sd(difference, na.rm = TRUE),
sum=sum(difference, na.rm = TRUE),
n = n(),
Missing_value = sum(is.na(difference)))
The mean price of Coles which is 5.36 is slightly higher than Woolworths which is 5.23. The mean value of the price difference from Coles to Woolworths is 0.13 denoting Woolworths products are cheaper compare to Coles. The second table shows the price comparison by product category. It shows that Pets,Bread & Bakery,Freezer and Health&Beauty products in Coles are expensive compare to Woolworths.
#boxplot
supermarket %>% boxplot(supermarket$Coles_price, supermarket$Woolworth_price, names=c("Coles"
, "WoolWorths"), data = .,
main="Boxplot of Coles and Woolworths",
xlab="Supermarkets", ylab="Price", col=c("yellow", "green"))
The box plots are comparatively short and there is no Obvious differences between box plots. The shows that the overall product prices are similar in the the supermarket although there is slight difference between the sample mean of Coles and Woolworth
# Grouped Bar Plot
table1<-supermarket %>% group_by (Prod_category) %>% summarise(
Coles= mean(Coles_price, na.rm = TRUE),
Woolworths = mean(Woolworth_price, na.rm = TRUE)
)
df_long <- melt(table1, id.var = "Prod_category")
ggplot(df_long, aes(x = Prod_category, y = value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge")
The bar chart shows the price of Health&Beauty is expensive compare to other product category. Both supermarket has almost similar pricing for Dairy&Eggs.Health&Beauty, Freezer and Pets products in Coles are expensive compare to Woolworth .
#scatterplot
x <- supermarket$Coles_price
y <- supermarket$Woolworth_price
plot(x, y, main = "Woolworths Price VS Coles Price ",
xlab = "Coles Price", ylab = "Woolworths Price",
pch = 19, frame = FALSE)
abline(lm(y ~ x, data = supermarket), col = "blue")
text(x=7.5, y=8.5, cex=0.6, col="darkblue",
labels=paste0
("Slope = ", round(coef(lm(y ~ x, data = supermarket))[2], 3)), srt=42)
abline(lm(y ~ x, data = supermarket), col="red", lty=6)
The scatter plot shows an uphill pattern which indicates a positive relationship between the product prices of Coles and Woolworths.With the regression line, the prices of the products in both supermarkets are well aligned with the slope of 0.769.
The hypothesis test that is used is the paired-sample t-test or the dependent paired-sample t-test is utilised to determine whether Coles or Woolworth is cheaper.The sample mean of coles is 5.36 and the sample mean of woolworth is 5.22, the mean values of the both are very close by so its difficult to determine.So the test will assist in considering whether there is a significant difference. The hypothesis is as follows:
The Q-Q plots are utilised to visually check for the normality. Since the sample size n=81, according to the Central Limit Theorem if the size of the sample is greater than 30 then even if there is an violation of the normality assumption the paired sample t-test can be performed. In the Q-Q plot of coles and woolworths there are data points that depart from normality these can be ruled out.It is observed from the Q-Q plot that the majority of the data points lie within the 95% confidence interval.
Levenes test is conducted to find out homogeneity of variance. In levenes test the p value is compared with the 0.05 the standard level.Here the p value is 0.02 which is less than 0.05. Hence, the variance are not equal and have to reject the null hypothesis.Thus, paired sample t-test can be performed assuming unequal variance. Dependent Sample Assessment plot is used to visualize data in the context of Coles price and Woolworths price sample analyses.
A confidence interval of 95% with significance level 0.05 is used for the t-test
#normality
qqPlot(supermarket$Coles_price, dist="norm" , main = "Q-Q Plot of Coles")
## [1] 37 39
qqPlot(supermarket$Woolworth_price, dist="norm" , main ="Q-Q Plot of Woolworth")
## [1] 40 28
#Levene test
leveneTest(supermarket$Coles_price , supermarket$Woolworth_price)
## Warning in leveneTest.default(supermarket$Coles_price,
## supermarket$Woolworth_price): supermarket$Woolworth_price coerced to factor.
#Utilizing the Welch paired sample t-test
t.test(supermarket$Coles_price , supermarket$Woolworth_price, paired = TRUE, alternative = "two.sided")
##
## Paired t-test
##
## data: supermarket$Coles_price and supermarket$Woolworth_price
## t = 0.97973, df = 80, p-value = 0.3302
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1377522 0.4049127
## sample estimates:
## mean of the differences
## 0.1335802
#Dependent Sample Assesment plot
granova.ds(
data.frame(supermarket$Coles_price, supermarket$Woolworth_price),
xlab = "Coles",
ylab = "Woolworths"
)
## Summary Stats
## n 81.000
## mean(x) 5.362
## mean(y) 5.229
## mean(D=x-y) 0.134
## SD(D) 1.227
## ES(D) 0.109
## r(x,y) 0.878
## r(x+y,d) 0.266
## LL 95%CI -0.138
## UL 95%CI 0.405
## t(D-bar) 0.980
## df.t 80.000
## pval.t 0.330
Based on the hypothesis test performed,the following has been interpreted:
Hence, both the supermarkets Coles and Woolworths do not show statistically significant difference.
From the summary statistics,the mean of coles is slighthy greater than woolworths.Based on the p value and 95% CI, it can be concluded that a statiscal significant difference of mean between coles and woolworths price could not be determined. The data was collected for a sample of 81 products.The data set could be improved by collecting more data which can assist in decreasing the sampling error thus producing accurate conclusions. For coles and woolworths,p=0.3302, 95 percent CI[-0.1377522,0.4049127].The woolworths is cheaper slightly based on the plot, but it does not show a statistical significant difference.
Strengths of the investigation:
The limitation of the investigation:
The improvement that can be made in the future investigation is increasing the collection of data from each department of the supermarkets, this will improve the accuracy and provide better outcome.Moreover, the data gathering process can be automated using data scraper which collectes the data from the website and save it as an excel file. This will save time.