Introduction to Statistics : Assignment III

Objective:
Objective - Detail:
Hypothesis:
Data Collection:
Data URL
Data Pre-Processing
Hypothesis Testing
Conclusion
Limitations
Executive Summary

Objective:

The objective of this assignment is to investigate which major supermarket - Coles or Woolworths is cheaper.

Objective - Detail:

To achieve this objective we are attempting to collect enough sample data from both supermarkets and then perform a Paired Two sample T Test to see if the results bear any statistically significance.

Hypothesis:

We start the experiment assuming that both the supermarkets have no price difference when enough products are sampled.

We use our understanding of our statistics and collect random sample of same products across both these supermarkets to define the problem statement as Null and Alternative Hypothesis

Statistically speaking In our tests we define a Null and an alternate Hypothesis as below.

Null Hypothesis: There is no price difference between Coles and Woolworths ,$H_0: \mu_\Delta = 0$

Alternative Hypothesis : There is a significant price difference in the prices between Coles and Woolworths ,$H_A: \mu_\Delta \neq 0$

Data Collection:

We have picked different varieties of products across all categories - Fridge, Bakery, Fruit & Vegetables, Pantry, Freezer, Baby-Health-Beauty, Meat - Seafood, Entertainment, Intl Food, Clothing, Household and Pets, Drinks and Tobacco. In collecting these samples we have also tried to avoid surveyor bias by picking not more than a handful of items from a single category.

With the data collected we are attempting to do a paired Sample t-test to validate if the data collected is statistically significant in determining which major supermarket is cheaper.

Data URL

Data Source : Grocery Cop Data Owner : [Grocery Cop Limited]

Data Pre-Processing

Setting Global Parameters

In this code snippet we try and set all global parameters for the R Markdown file.

knitr::opts_chunk$set(fig.width=15, fig.height=6, fig.align = "center",warning=FALSE, message=FALSE)

Loading libraries and setting working directory path

As the next step we load all the required libraries and then set the environment variables.

library(dplyr)
library(ggplot2)
library(RColorBrewer)
library(knitr)
library(car)
library(granova)
library(lattice)
colors = brewer.pal(8, "Dark2")

data_path = "."
out_path = "."

Read and combine all the data sets into a single Data Frame

Since all the team members collected data we collated them all offline into a csv. Here we load the csv into a data frame.

sp_mkt_data = read.csv2 (file=paste(data_path,"prices.csv",sep="/"),
                         header=T,blank.lines.skip=TRUE, sep= ",")

# change columns to numeric
sp_mkt_data$Coles = as.numeric(as.character(sp_mkt_data$Coles))
sp_mkt_data$Woolworths = as.numeric(as.character(sp_mkt_data$Woolworths))

# rename columns
colnames(sp_mkt_data) = c("Prod_Name","Coles_Price","Woolworths_Price","ProdCategory")

#create a column for price difference
sp_mkt_data$Price_Diff = sp_mkt_data$Coles - sp_mkt_data$Woolworths

Removing Outliers and Summarising Data

We need to also check to see if there are any outliers in the data sets.

# define margins
par(mar=c(5,11,5,5))
# draw box plots
sp_mkt_data %>% boxplot(Price_Diff ~ ProdCategory, las=2, data=.,
                main = "Price Difference Distribution in Products from Coles and WW",
                xlab = "$ Difference",
                col = brewer.pal(7, name = "RdBu"),
                las=2,
                horizontal = T)
# show guiding line for 0
abline(v=0,col="red")
grid(nx=16,ny=10)

We see from the data sets that there are a few outliers outside of the AUD5 difference in the raw data set. Upon investigation we found that these appear to be special offers and so we filter out any product that has a price difference of more than $5.

# filtering data
sp_mkt_data_fil = sp_mkt_data %>% filter(Price_Diff >= -5 & Price_Diff <= 5)
# define margins
par(mar=c(5,11,5,5))
# draw box plots
sp_mkt_data_fil %>% boxplot(Price_Diff ~ ProdCategory, las=2, data=.,
                main = "Price Difference Distribution in Products from Coles and WW",
                xlab = "$ Difference",
                col = brewer.pal(7, name = "RdBu"),
                las=2,
                horizontal = T)
# show guiding line for 0
abline(v=0,col="red")
grid(nx=16,ny=10)

With this filter in place we are removing 17 observations from the raw data. The box plot indicates that the the difference is slightly towards +ve side that is Coles is being shown as slightly expensive.

As for the next step, we try and find some basic statistics of the raw sample data collected.

# Display the summary table
knitr::kable(summary_table, caption = "Sumamry of Data Samples collected")

Sumamry of Data Samples collected
ProdCategory	TotalProducts	Mean_Price_Difference	StandardDeviation
Baby, Health & Beauty	20	0.914	1.362
Bakery	40	0.260	1.393
Clothing, Household & Pet	19	0.165	1.794
Drinks & Tobacco	31	-0.307	1.933
Entertainment & Intl Food	28	0.316	1.286
Freezer	29	0.447	1.870
Fridge	39	0.191	1.685
Meat & Seafood	27	0.066	0.950
Pantry	36	0.934	1.933
Total - All Products	269	0.323	1.641

Again from this summary table we see that the mean of price difference shows that Coles is slightly more expensive than Woolworths.

Test for normalisation

We then plot the data to see if they fit the normal distribution

# define margins
par(mar=c(5,11,5,5))

sp_mkt_data_fil$Price_Diff %>% qqPlot(dist="norm",main="Test to see if Data fits Normal Distribution")

As we can see the price difference is not distributed as expected. But from our understanding of Central Limit Theorem if the sample size is more than 30 we can safely say that the distribution will behave as a normal distribution.

Hypothesis Testing

Now that we have cleaned up the data set we will perform a paired t-test to see if this difference is statistically significant. We use a two tailed test with a confidence level of 0.95 (significance level = 0.05)

# run the T-Test
sp_mkt_t_test =
  t.test(
  x= sp_mkt_data_fil$Coles_Price, 
  y= sp_mkt_data_fil$Woolworths_Price,
  paired = T,
  alternative = "two.sided",
  conf.level = 0.95
  )
# display the t-test results
sp_mkt_t_test

## 
##  Paired t-test
## 
## data:  sp_mkt_data_fil$Coles_Price and sp_mkt_data_fil$Woolworths_Price
## t = 3.2298, df = 268, p-value = 0.001393
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1261923 0.5202761
## sample estimates:
## mean of the differences 
##               0.3232342

Results Interpretation: The mean difference following in prices was found to be 0.323.

The t statistic was 3.23.

The p value from the t-test was 0.001.

From the t-tests we can see that based on the sample data collected there is a very low probability that a mean of 0 will fall within two-tailed hypothesis tests, p<.001<$\alpha$=0.05, therefore we reject $H_0$. As you can see, the probability of observing a mean of 0.323 , or a sample mean more extreme, assuming $H_0: \mu_\Delta = 0$ is true, is extremely unlikely. Therefore, $H_0$ is rejected and we find statistical evidence to support $H_A$.

The 95% confidence Interval from our sample was [0.126 , 0.52]. AS we can see this does not capture our $H_0$ = 0 and hence we find statistical evidence to support $H_A$.

Conclusion

A two-tailed, paired sample t-test was used to determine if the price difference between Woolworths and Coles were significantly different from the previous assumption of no difference. The 0.05 level of significance was used. The sample’s mean price difference was $0.323 , SD = $1.641.

The results of the paired sample t-test found that mean price of Woolworths to be statistically significantly lower than the mean price of Coles with t[268] = 3.23, p = 0.001, 95% CI [0.126 ,0.52].

Limitations

We have attempted to compare the same products in same volume across 2 supermarkets. This means we could not compare products where they are even slightly different in volume.
We also could not compare any home brand products as the volume of the product were slightly different to each other and also because they might not be from the same supplier or region.

Executive Summary

The objective of this assignment was to investigate which major supermarket - Coles or Woolworths is cheaper. We started this experiment assuming that there is no price difference between the two. To test our hypothesis we collected 286 sample products across 9 product categories (Baby, Health & Beauty, Drinks & Tobacco, Clothing, Household & Pet, Meat & Seafood, Entertainment & Intl Food, Freezer, Fridge, Bakery, Pantry). Then we performed statistical hypothesis testing to verify if the results matched our assumptions. The results showed that Coles was more expensive compared to Woolworths for the products that we had sampled.