MATH1324 Assignment 2

Student Detail

Divya Ulaganathan (s3759465)

Executive Statement

Two of the Australia’s largest supermarket chains, Coles and Woolworths are highly reputed for their budget-friendly low cost products. The aim of this investigation is to find the supermarket that offers the cheapest price by analyzing random samples of matched product prices from Coles and Woolworths. To guide the investigation, a random sample of 164 matched products and their respective prices(AUD) were collected from their websites (https://www.coles.com.au/ and https://www.woolworths.com.au/). In order to collect a representative sample, stratified sampling technique was used by segregating the products based on its perishability factor. The sample dataset contains products of various categories like bakery, cosmetics, cleaning supplies, fruits, vegetables, etc., price from coles, price from woolworths, perishability and data collection date. For non-perishable products, only respective home-brand was considered. Branded products were excluded as certain brands were not available on either of the supermarkets and in addition discounts were neglected. Data was collected using stratified sampling technique. The fore mentioned constraints were implemented at the data collection process. Outliers were detected using boxplots and then they were removed. Two samples of prices were initially compared by using the summary statistics. Then the relevant hypothesis was constructed and before testing the hypothesis, the assumption on normality the two samples were tested. Normality test was done using the Q-Q plot. Since the two samples were dependent, the paired-sample t-test was used to test the hypothesis on mean price differences between the supermarkets. The results from the investigation, failed to provide a statistically significant evidence that either of the super market is cheaper than the other.

Load Packages and Data

# loading the necessary data and packages required to reproduce the report

library(readxl)

## Warning: package 'readxl' was built under R version 3.5.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.5.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(magrittr)

## Warning: package 'magrittr' was built under R version 3.5.3

library(pastecs)

## Warning: package 'pastecs' was built under R version 3.5.3

## 
## Attaching package: 'pastecs'

## The following object is masked from 'package:magrittr':
## 
##     extract

## The following objects are masked from 'package:dplyr':
## 
##     first, last

library(car)

## Warning: package 'car' was built under R version 3.5.3

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

library(granova)

## Warning: package 'granova' was built under R version 3.5.3

#data import
price_wars <- read_excel("price_wars.xlsx")
View(price_wars)

#Assigning factors
price_wars$Perishability <- price_wars$Perishability %>% factor(levels = c('p','np'), labels=c("Perishable","Non-perishable"))

price_wars$Category <- price_wars$Category %>% as.factor()

price_wars$`Unit of measure` <- price_wars$`Unit of measure` %>% as.factor()

Identyfying and removing outliers

#Visualisation - Boxplot to identify the existence outliers
boxplot(price_wars$Coles, price_wars$Woolworths, ylab = "Prices (AUD)", col=c("darkred","darkgreen"), main= "Price comparison between supermarkets - with Outliers", names = c("Coles","Woolworths"))

Based on the above boxplot, some outliers which are significantly far from the whiskers were observed. Hence the tuples relevent to those values are removed in the upcoming code segment.

#Remove outliers
boxplot_Prices <- boxplot(price_wars$Coles, price_wars$Woolworths, plot=FALSE)
boxplot_Prices

## $stats
##        [,1]   [,2]
## [1,]  1.000  0.990
## [2,]  6.000  6.525
## [3,]  8.395  8.500
## [4,] 11.550 12.000
## [5,] 19.500 20.000
## 
## $n
## [1] 164 164
## 
## $conf
##          [,1]     [,2]
## [1,] 7.710256 7.824509
## [2,] 9.079744 9.175491
## 
## $out
## [1] 22.000 21.000 26.000 25.000 22.250 20.625 25.000 26.000
## 
## $group
## [1] 1 1 1 1 1 1 2 2
## 
## $names
## [1] "1" "2"

# Creating a filter matrix to identify and remove the outliers
filter_matrix <- data.frame(group = boxplot_Prices$group, outliers= boxplot_Prices$out)
filter_matrix

# Filtering out the outliers
prices_filt <- price_wars %>% filter(!(Coles %in% filter_matrix$outliers) & !(Woolworths %in% filter_matrix$outliers))
# Recheck the boxplot
boxplot(prices_filt$Coles, prices_filt$Woolworths,col=c("lightcoral","lightgreen"), ylab="Prices (AUD)", main = "Price comparison between supermarkets - outlier removed", names = c("Coles","Woolworths"))

The mean difference visualization using a scatter plot:

granova.ds(
  data.frame(prices_filt$Coles, prices_filt$Woolworths),
  xlab = "Coles prices",
  ylab = "Woolworths prices",
  main = "Mean difference between Coles and Woolworths products"
  )

##             Summary Stats
## n                 158.000
## mean(x)             8.976
## mean(y)             9.177
## mean(D=x-y)        -0.200
## SD(D)               1.971
## ES(D)              -0.102
## r(x,y)              0.879
## r(x+y,d)           -0.091
## LL 95%CI           -0.510
## UL 95%CI            0.109
## t(D-bar)           -1.277
## df.t              157.000
## pval.t              0.203

These data concern prices for 158 samples from Coles and Woolworths supermarkets. The Y scores shows the prices of Woolworths and X scores show prices of Coles. Conventional analysis yields a t-statistic equal to -1.277, suggesting that prices are significantly less in Coles. Indeed, the standardized effect size of .57 is moderate-to-large by conventional standards. The cluster of points at the extreme left shows that price difference from Woolworths samples are more carrying than Coles. There is a high correlation (r = 0.879) between Coles and Woolworths prices is related to the relatively small variance of the differences.

Summary Statistics

#Summary - Prices of Coles
prices_filt %>% summarise(Min = min(Coles, na.rm=TRUE), Q1 = quantile(Coles, probs = .25, na.rm=TRUE), Median = median(Coles, na.rm=TRUE), Q3 = quantile(Coles, probs = .75, na.rm=TRUE), IQR = quantile(Coles, probs = .75, na.rm=TRUE)-quantile(Coles, probs = .25, na.rm=TRUE), Max = max(Coles, na.rm=TRUE), Mean = mean(Coles, na.rm=TRUE), SD = sd(Coles, na.rm=TRUE), n= n())

#Summary - Prices of Woolworths
prices_filt %>% summarise(Min = min(Woolworths, na.rm=TRUE), Q1 = quantile(Woolworths, probs = .25, na.rm=TRUE), Median = median(Woolworths, na.rm=TRUE), Q3 = quantile(Woolworths, probs = .75, na.rm=TRUE), IQR = quantile(Woolworths, probs = .75, na.rm=TRUE)-quantile(Woolworths, probs = .25, na.rm=TRUE), Max = max(Woolworths, na.rm=TRUE), Mean = mean(Woolworths, na.rm=TRUE), SD = sd(Woolworths, na.rm=TRUE), n=n())

The above summary statistics tables don’t depict a significant difference in the price distributions of the two supermarkets. There is a slight difference in average prices where Woolworths having the higher average.

# Creating the price difference column
prices_filt <- prices_filt %>% mutate(d= Coles - Woolworths)

# Summary - Price Differences
prices_filt %>% summarise(Min = min(d, na.rm=TRUE), Q1 = quantile(d, probs = .25, na.rm=TRUE), Median = median(d, na.rm=TRUE), Q3 = quantile(d, probs = .75, na.rm=TRUE), IQR = quantile(d, probs = .75, na.rm=TRUE)-quantile(d, probs = .25, na.rm=TRUE), Max = max(d, na.rm=TRUE), Mean = mean(d, na.rm=TRUE), SD = sd(d, na.rm=TRUE), n= n())

The mean difference is very low, which is less than a dollar. The minimum difference is nearly 10 AUD which is significant.

Hypothesis Test

\(H_0 \implies \mu_\bigtriangleup\) = 0 ; \(H_A \implies \mu_\bigtriangleup \neq\) 0

Where \(\mu_\bigtriangleup\) = population mean of the price differences in prices of Coles and Woolworths products. (i.e. \(\mu_\bigtriangleup\) = \(\mu_c\)-\(\mu_w\) )

i.e. \(H_0\): There is no price difference in the two supermarkets. \(H_A\): There is a price difference in the two supermarkets. The paired samples t- test was used to compare the difference between the mean prices of Coles and Woolworths. To elaborate the analysis three subsets of the sample were considered, all the products in the sample (sample size=158), perishable products (sample size=103) and non-perishable products (sample size=55).

Assumptions for the paired sample samples t-test: 1. The paired difference follow a normal distribution or large samples were used. Before conducting the hypothesis test the above assumption was verified by the following code segments.

Testing assumption of Normality using Q-Q plots

#Testing the Assumption of Normality:
#qqPlot for the price differences of all products
prices_filt$d %>% qqPlot(dist='norm',main='Q-Q plot price differences of all products')

## [1] 157  43

#qqPlot for price difference of perishable products
perishable_products <- prices_filt %>% filter(Perishability == "Perishable")
perishable_products$d %>% qqPlot(dist='norm',main='Q-Q plot for price differences of perishable products')

## [1] 102  36

#qqPlot for price difference of non-perishable products
nonPerishable_products<- prices_filt %>% filter(Perishability == "Non-perishable")
nonPerishable_products$d %>% qqPlot(dist='norm',main='Q-Q plot for price differences of non-perishable products')

## [1] 13 30

The cumulative price differences distrubution of the two supermarkets are not normal according to the above three Q-Q plots. Since we have a sufficiently large samples of size of greater than 30, we can still assume normality based on the Central Limit theorem. Now we can proceed to the two-sample hypothesis test for the above mentioned three sets, with 95% confidence interval.

Hypothesis test 1: cumulative prices difference Hypothesis test 2: perishable products price difference Hypothesis test 3: non-perishable products price difference

#Hypothesis test 1: cumulative prices difference
t.test(prices_filt$Coles, prices_filt$Woolworths, paired = TRUE, alternative = "two.sided",conf.level = .95)

## 
##  Paired t-test
## 
## data:  prices_filt$Coles and prices_filt$Woolworths
## t = -1.2772, df = 157, p-value = 0.2034
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.5099514  0.1094451
## sample estimates:
## mean of the differences 
##              -0.2002532

#Hypothesis test 2: perishable products price difference
t.test(perishable_products$Coles, perishable_products$Woolworths, paired = TRUE, alternative = "two.sided",conf.level = .95)

## 
##  Paired t-test
## 
## data:  perishable_products$Coles and perishable_products$Woolworths
## t = -1.0999, df = 102, p-value = 0.274
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6409501  0.1836686
## sample estimates:
## mean of the differences 
##              -0.2286408

#Hypothesis test 3: non-perishable products price difference
t.test(nonPerishable_products$Coles, nonPerishable_products$Woolworths, paired = TRUE, alternative = "two.sided",conf.level = .95)

## 
##  Paired t-test
## 
## data:  nonPerishable_products$Coles and nonPerishable_products$Woolworths
## t = -0.64255, df = 54, p-value = 0.5232
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -0.6060449  0.3118631
## sample estimates:
## mean of the differences 
##              -0.1470909

Interpretation

Based on the paired sample t-test, the p-value is greater than 0.05 in the three hypothesis tests. The mean price difference of all products is -0.204 which is captured by the confidence interval [ -0.5136765 0.1049423]. While testing the perishable and non-perishable product prices individually between the supermarkets, the p-value reamined to be greater than 0.05. The mean price difference of perishable products -0.229 is captured by the confidence interval[-0.6409501 , 0.1836686] and the mean price difference of non-perishable product price -0.159 is captured by the confidence interval[-0.6156959, 0.2978777].

Hence we fail to reject the null hypothesis that there is no significant difference in prices for both Coles and Woolworths. The results of the test doesn’t provide required evidence to conclude either of the supermarket is cheap than the other.

Based on the paired two sample t-test, the p-value is greater than 0.05 for all the product prices.

The mean price difference is -0.200 as the p > .05 which is captured by the 95% confidence interval [-0.5099514 , 0.1094451] for cumulative price difference.So, we fail to reject \(H_0\) of hypothesis 1. While testing the perishable and non-perishable product prices individually between the supermarkets, the p-value reamined to be greater than 0.05. The mean price difference of perishable products -0.229 is captured by the confidence interval[ -0.6409501, 0.1836686] and the mean price difference of non-perishable product price -0.147 is captured by the confidence interval[-0.5099514, 0.1094451]. Hence, we fail to reject \(H_0\) of hypothesis 2 and 3 as well. Thus,we fail to reject the null hypothesis \(H_0\) as there is no significant difference in prices for both Coles and Woolworths. There isint any significant difference between the prices of perishable and non perishable products of coles and woolworths to determine the cheap supermarket for perishable or non perishable products. The results of the test doesn’t provide required evidence to conclude either of the supermarket is cheap than the other.

Discussion

The prices in Coles and Woolworths based on their online prices for Melbourne region was examined, and data was categorize according to their perishability factor during the data collection. Products that are manufactured by the respective home brands were considered. The dataset proved to be a major strength as we had a mixed set of products from different categories of a large sample size (164). The paired samples t-test was used to identify whether there is a price difference in the products of the two supermarkets, by testing the mean price difference between Coles and Woolworths. Although the q-q plot doesn’t support the normality assumption of any of the three sets which we conducted the hypothesis test, the central limit theorem ensured that the t-test could be applied due to the large sample size in each group. Paired samples t-test was conducted for the three hypothesis tests and the results didn’t provide statistically significant evidence to reject the null hypothesis, which is there is no difference in the prices of the two supermarkets. Although a slight variance was observed in prices of individual products during data collection, with the above analysis it can be concluded that the pricing strategies between the two supermarkets were decently balanced. In future, we could further investigate by examining the store specific offers and online offers on Australia and extend the dataset by considering the common products from different brands other than home brands.

References

[1]https://www.woolworths.com.au/ [2]https://www.coles.com.au/ [3]class notes and demo reference. [4] https://astral-theory-157510.appspot.com/secured/MATH1324_Module_05.html [5] https://astral-theory-157510.appspot.com/secured/MATH1324_Module_06.html [6] https://astral-theory-157510.appspot.com/secured/MATH1324_Module_07.html