Group/Individual Details

Executive Statement

Objective: This report is written to analyse the price difference of the products of the two giant supermarkets(Coles and Woolworths)

Procedure: To begin with we assume that both are having same prices and one is not cheaper than the other (Null hypothesis). Data was collected manually from the website http://www.grocerycop.com.au/ between 5-May-2017 to 7-May-2017. With 35 products each from 9 product categories a total of 315 products were chosen from the website where product pictures, weight/cupsize/cup volume and Price dates matched.Categories are as follows:

  1. Baby/Health/Beauty
  2. Bakery
  3. Clothing/Household/Pet
  4. Drinks/Tobacco
  5. Entertainment/International food
  6. Pantry
  7. Freezer
  8. Fridge
  9. Fruits/Vegetables

During initial analysis we found that it would be better to drop home brand products and products on special to avoid biasness.

Variables: For this reporting, the following data was captured: a) product category, b) product price, c) price volume, d) Cup price, e) cup volume, f) Supermarket name, g) Price date.

Main Findings: As sample size is greater than 30 we can discard the normality issue and assume that the the data is normally distributed around the mean.

The \(p\)-value generated for the paired t.test with 95%CI was (0.001574) which shows that the test is statistically significant to reject the null hypothesis mentioned above i.e. Coles and Woolworths do not have same prices but one of them(woolworths) is cheaper than the other(coles). In other words the mean difference is not equal to zero which is our alternate hypothesis(\(H_A\)). The confidence interval[0.1643315 0.5115971] did not accomodate the null hypothesis(mean=0).

Load Packages and Data

library(dplyr)
library(readr)
library(magrittr)
library(lattice)
library(ggplot2)
library(Hmisc)
library(car)
library(granova)
library(rmarkdown)
#Loading data where price data of both stores are in two different column
Final_excel <- read_csv("D:/Intro to Statistics/Assignment3/Final_excel.csv")
#Loading data where price data of both stores are in same column ("Price")
SIngleprice_Column <- read_csv("D:/Intro to Statistics/Assignment3/SIngleprice_Column.csv")
#Finding difference between two superstores' prices
Final_excel <- Final_excel %>% mutate(Price_diff = (Price_Coles-Price_Woolworths))

New variable price_diff is created for price difference of the two supermakets.

Summary Statistics with all data samples

From the summary statistics, it has been found that on an average Woolworths products are $0.338 cheaper than Coles. Out of 9 product categories, Woolworths found cheaper in 6 categories. It is also noteworthy that SD is quite higher than the mean depicts the variability of the sample. Boxplot with product categories shows the distribution of products for both supermarkets

#Summary statistics of Prices by store
SIngleprice_Column %>% group_by (Store) %>% summarise(Min = min(Price.,na.rm = TRUE),
                                         Q1 = quantile(Price.,probs = .25,na.rm = TRUE),
                                         Median = median(Price., na.rm = TRUE),
                                         Q3 = quantile(Price.,probs = .75,na.rm = TRUE),
                                         Max = max(Price.,na.rm = TRUE),
                                         Mean = mean(Price., na.rm = TRUE),
                                         SD = sd(Price., na.rm = TRUE),
                                         sum=sum(Price., na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price.)))
#Summary statistics of Prices by Product categories
SIngleprice_Column %>% group_by (Product_Category) %>% summarise(Min = min(Price.,na.rm = TRUE),
                                         Q1 = quantile(Price.,probs = .25,na.rm = TRUE),
                                         Median = median(Price., na.rm = TRUE),
                                         Q3 = quantile(Price.,probs = .75,na.rm = TRUE),
                                         Max = max(Price.,na.rm = TRUE),
                                         Mean = mean(Price., na.rm = TRUE),
                                         SD = sd(Price., na.rm = TRUE),
                                         sum=sum(Price., na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price.)))
#Summary statistics of Prices_difference by Product categories
Final_excel %>% group_by (Product_Category) %>% summarise(Min = min(Price_diff,na.rm = TRUE),
                                         Q1 = quantile(Price_diff,probs = .25,na.rm = TRUE),
                                         Median = median(Price_diff, na.rm = TRUE),
                                         Q3 = quantile(Price_diff,probs = .75,na.rm = TRUE),
                                         Max = max(Price_diff,na.rm = TRUE),
                                         Mean = mean(Price_diff, na.rm = TRUE),
                                         SD = sd(Price_diff, na.rm = TRUE),
                                         sum=sum(Price_diff, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price_diff)))
#Box plot of Price($) by Store with all 9 categories
SIngleprice_Column %>% boxplot(Price. ~ Store, data = ., xlab = "Price($)", ylab = "Store", col="Cyan", main = "Box plot of Price($) by Stores", horizontal=TRUE)

#Box plot of Price by category Where price data are put in single price column
SIngleprice_Column %>% boxplot(Price. ~ Product_Category_code, data = ., ylab = "Price($)", xlab = "Product Category", col="grey", main = "Box plot of Price($) by product categories")

Filter out high-priced product category

After reviewing the data, it is obvious that product category#9 “Drinks_Tobacco” has a very high price ranges from $1.8 to $110 with an average of $35 which is significantly different from other 8 categories. As overall mean price for both coles and woolworths is around $9, so the Drinks-Tobacco products are raising the overall mean considerably. To make a case with similar prices, it has been decided from the summary statistics to filterout the Drinks_Tobacco category from the whole sample. After doing the data filtering, we can observe that there are less outliers in our data set. We assume that coles is expensive, based on the mean value that we get but whether it is statistically significant that should be determined by Hypothesis test.

#Filtered data (Category!=9)
Final_data_filt <- Final_excel %>% filter(Product_Category_code!=9)
SIngleprice_Column_filt <- SIngleprice_Column %>% filter(Product_Category_code!=9)
#Box plot of Price($) by Store after filtering out product category'9'
SIngleprice_Column_filt %>% boxplot(Price. ~ Store, data = ., xlab = "Price($)", ylab = "Store", col="Cyan", main = "Box plot of Price($) by Stores", horizontal=TRUE)

Hypothesis Test

To become more sure about the price difference, hypothesis .It is noteworthy that as total sample is above 30, thus Welch Test has been chosen directly. As exactly same products were chosen to see the price difference between Coles and Woolworths, ‘Paired Sample t-test’ fits with the objective of the test by using 95% confidence interval, test was applied to check whether their mean value for the overall bucket prices are the same or not.

Formulate Hypothesis The population mean of these differences is denoted: \[\mu_\Delta\]

Null Hypothesis: \[H_0: \mu_\Delta = 0\]

Alternate Hypothesis: \[H_A: \mu_\Delta \neq 0\]

#Histogram of price_difference
Final_data_filt$Price_diff %>% histogram()

#paired sample tt-test using the t.test() function
t.test(Final_data_filt$Price_Coles, Final_data_filt$Price_Woolworths,
       paired = TRUE,
       alternative = "two.sided")

    Paired t-test

data:  Final_data_filt$Price_Coles and Final_data_filt$Price_Woolworths
t = 3.8316, df = 279, p-value = 0.0001574
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.1643315 0.5115971
sample estimates:
mean of the differences 
              0.3379643 

Through the histogram of price difference we observe that there is a slight trend of left skewed. And we verified the assumption by using the paird sample t-test. A paired-samples T-Test results with 280 sample data from both stores shows that the mean difference between Coles and Woolworths bucket is $0.338. Test results also gives the t value as 3.8316 (with the degree of freedom=279) where \(p\)value (0.0001574) is < 0.05 with 95% CI [0.1643315, 0.5115971]. As \(P\)value<0.05 and null hypothesized value is not captured within the 95% CI values, thus null hypothesis is rejected.

Additional analysis:

  • In addition to the Paired sample t-test for the overall bucket for both stores, product category wise test was performed to see comparison at product category level. To support this the analysis is below: #### T-test result by product categories
result=rep(0,8)
 
  for(i in 1:8){
  a=Final_data_filt$Price_diff[(i*35-34):(i*35)]
  b=t.test(a,mu=0,alternative = "two.sided")
 
  if(b$p.value<0.05)
   result[i]=b$estimate
  else
   result[i]=0}
result=data.frame(result)
row.names(result)=c("Fridge","Bakery","Fruit_Vegetables","Pantry","Freezer","Baby_Health_Beauty","Entertainment_Internationalfoods","Clothing_HH_Pet")
result

The above results show that out of the 8 categories, Coles is expensive in three categories (Fridge, Pantry, Baby_Health_Beauty), while there is no significant difference in the remaining 5 categories i.e. a) Bakery, b) Fruit-Veg, c) Freezer, d) Entertainment-International foods e) Closthing_HH-Pet.

Discussion

Further Improvement areas

  • We conclude that the two supermarkets do not have the same prices for similar product range and coles is slightly expensive than woolworths overall.

  • The limitations here was we did not take into consideration products of home brand and products on special in our analysis,but the strength remains that we could compare them in an unbiased way by dropping home brand products and concentrated more on daily consumables rather than less used products (TV,Mobile Phones,Headphones,DVD etc).

  • We did the analysis from customer perspective to find out which is cheaper in both the supermarkets,but if we were to analyse the data from supermarkets perspective to find how they can benefit from studying the competitors price range and cover more customer base in future by being competitive in prices in certain categories and leading the supermarket scoreborad.

  • These price ranges can vary with time so that constraint needs to be kept in mind always for future analysis as prices keep fluctuating on daily basis.

---
title: "MATH1324 Assignment 3"
subtitle: "Supermarket Price Wars"
output: html_notebook
---

# Group/Individual Details


* Amrin Shaikh   (s3654729)


# Executive Statement 


**Objective:** 
This report is written to analyse the price difference of the products of the two giant supermarkets(Coles and Woolworths)   

**Procedure:**
To begin with we assume that both are having same prices and one is not cheaper than the other (Null hypothesis). Data was collected manually from the website <http://www.grocerycop.com.au/> between 5-May-2017 to 7-May-2017.
With 35 products each from 9 product categories a total of 315 products were chosen from the website where product pictures, weight/cupsize/cup volume and Price dates matched.Categories are as follows:


1.	Baby/Health/Beauty
2.	Bakery
3.	Clothing/Household/Pet
4.	Drinks/Tobacco
5.	Entertainment/International food
6.	Pantry
7.	Freezer
8.	Fridge
9.	Fruits/Vegetables

During initial analysis we found that it would be better to drop home brand products and products on special to avoid biasness.

**Variables:**
For this reporting, the following data was captured: a) product category, b) product price, c) price volume, d) Cup price, e) cup volume, f) Supermarket name, g) Price date. 

**Main Findings:**
As sample size is greater than 30 we can discard the normality issue and assume that the the data is normally distributed around the mean.

  + Summary statistics tells that mean price difference between the Coles and Woolworths is minimal **$0.338** and overall bucket price were *$2897* and *$2804* respectively. 
Out of 9 product categories, *Drinks_Tobacco* has a very high price ranges from *$1.8* to *$110* with an average of **$35** which is significantly different from other 8 categories and thus we decided to filter out this category from the main analysis. Total sample was thus 280.

  + Though descriptive statistics says Woolworths is cheaper, but without statistical test, it cant be proved. Based on data nature, **Paired sample t-test** was found to be appropriate for the hypothesis test as it tries to calculate the difference in mean for dependent sample. 
  
The $p$-value generated for the paired t.test with 95%CI was (0.001574) which shows that the test is statistically significant to reject the null hypothesis mentioned above i.e. Coles and Woolworths do not have same prices but one of them(woolworths) is cheaper than the other(coles). In other words the mean difference is not equal to zero which is our alternate hypothesis($H_A$). The confidence interval[0.1643315 0.5115971] did not accomodate the null hypothesis(mean=0).


# Load Packages and Data

```{r message=FALSE, warning=FALSE}
library(dplyr)
library(readr)
library(magrittr)
library(lattice)
library(ggplot2)
library(Hmisc)
library(car)
library(granova)
library(rmarkdown)


#Loading data where price data of both stores are in two different column
Final_excel <- read_csv("D:/Intro to Statistics/Assignment3/Final_excel.csv")

#Loading data where price data of both stores are in same column ("Price")
SIngleprice_Column <- read_csv("D:/Intro to Statistics/Assignment3/SIngleprice_Column.csv")

#Finding difference between two superstores' prices
Final_excel <- Final_excel %>% mutate(Price_diff = (Price_Coles-Price_Woolworths))
```
New variable price_diff is created for price difference of the two supermakets.

# Summary Statistics with all data samples
From the summary statistics, it has been found that on an average Woolworths products are **$0.338** cheaper than Coles. Out of 9 product categories, Woolworths found cheaper in 6 categories. It is also noteworthy that SD is quite higher than the mean depicts the variability of the sample. 
Boxplot with product categories shows the distribution of products for both supermarkets   

```{r}
#Summary statistics of Prices by store
SIngleprice_Column %>% group_by (Store) %>% summarise(Min = min(Price.,na.rm = TRUE),
                                         Q1 = quantile(Price.,probs = .25,na.rm = TRUE),
                                         Median = median(Price., na.rm = TRUE),
                                         Q3 = quantile(Price.,probs = .75,na.rm = TRUE),
                                         Max = max(Price.,na.rm = TRUE),
                                         Mean = mean(Price., na.rm = TRUE),
                                         SD = sd(Price., na.rm = TRUE),
                                         sum=sum(Price., na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price.)))

#Summary statistics of Prices by Product categories
SIngleprice_Column %>% group_by (Product_Category) %>% summarise(Min = min(Price.,na.rm = TRUE),
                                         Q1 = quantile(Price.,probs = .25,na.rm = TRUE),
                                         Median = median(Price., na.rm = TRUE),
                                         Q3 = quantile(Price.,probs = .75,na.rm = TRUE),
                                         Max = max(Price.,na.rm = TRUE),
                                         Mean = mean(Price., na.rm = TRUE),
                                         SD = sd(Price., na.rm = TRUE),
                                         sum=sum(Price., na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price.)))

#Summary statistics of Prices_difference by Product categories
Final_excel %>% group_by (Product_Category) %>% summarise(Min = min(Price_diff,na.rm = TRUE),
                                         Q1 = quantile(Price_diff,probs = .25,na.rm = TRUE),
                                         Median = median(Price_diff, na.rm = TRUE),
                                         Q3 = quantile(Price_diff,probs = .75,na.rm = TRUE),
                                         Max = max(Price_diff,na.rm = TRUE),
                                         Mean = mean(Price_diff, na.rm = TRUE),
                                         SD = sd(Price_diff, na.rm = TRUE),
                                         sum=sum(Price_diff, na.rm = TRUE),
                                         n = n(),
                                         Missing = sum(is.na(Price_diff)))

#Box plot of Price($) by Store with all 9 categories
SIngleprice_Column %>% boxplot(Price. ~ Store, data = ., xlab = "Price($)", ylab = "Store", col="Cyan", main = "Box plot of Price($) by Stores", horizontal=TRUE)

#Box plot of Price by category Where price data are put in single price column
SIngleprice_Column %>% boxplot(Price. ~ Product_Category_code, data = ., ylab = "Price($)", xlab = "Product Category", col="grey", main = "Box plot of Price($) by product categories")
```

### Filter out high-priced product category
After reviewing the data, it is obvious that product category#9 "Drinks_Tobacco" has a very high price ranges from **$1.8** to **$110** with an average of **$35** which is significantly different from other 8 categories. As overall mean price for both coles and woolworths is around **$9**, so the Drinks-Tobacco products are raising the overall mean considerably. To make a case with similar prices, it has been decided from the summary statistics to filterout the Drinks_Tobacco category from the whole sample. After doing the data filtering, we can observe that there are less outliers in our data set. We assume that coles is expensive, based on the mean value that we get but whether it is statistically significant that should be determined by Hypothesis test.



```{r}

#Filtered data (Category!=9)
Final_data_filt <- Final_excel %>% filter(Product_Category_code!=9)
SIngleprice_Column_filt <- SIngleprice_Column %>% filter(Product_Category_code!=9)

#Box plot of Price($) by Store after filtering out product category'9'
SIngleprice_Column_filt %>% boxplot(Price. ~ Store, data = ., xlab = "Price($)", ylab = "Store", col="Cyan", main = "Box plot of Price($) by Stores", horizontal=TRUE)
```




## Hypothesis Test

To become more sure about the price difference, hypothesis .It is noteworthy that as total sample is  above 30, thus Welch Test has been chosen directly. As exactly same products were chosen to see the price  difference between Coles and Woolworths, **'Paired Sample t-test'** fits with the objective of the test by using **95% confidence interval**, test was applied to check whether their mean value for the overall bucket prices are the same or not. 

**Formulate Hypothesis**
The population mean of these differences is denoted:
                              $$\mu_\Delta$$




Null Hypothesis:           $$H_0: \mu_\Delta = 0$$




Alternate Hypothesis:      $$H_A: \mu_\Delta \neq 0$$

 
```{r}
#Histogram of price_difference
Final_data_filt$Price_diff %>% histogram()

#paired sample tt-test using the t.test() function
t.test(Final_data_filt$Price_Coles, Final_data_filt$Price_Woolworths,
       paired = TRUE,
       alternative = "two.sided")

```
Through the histogram of price difference we observe that there is a slight trend of left skewed. And we verified the assumption by using the paird sample t-test. 
A paired-samples T-Test results with 280 sample data from both stores shows that the mean difference between Coles and Woolworths bucket is **$0.338**. Test results also gives the t value as 3.8316 (with the degree of freedom=279) where $p$value **(0.0001574)** is < 0.05 with 95% CI [0.1643315, 0.5115971]. As $P$value<0.05 and null hypothesized value is not captured within the 95% CI values, thus null hypothesis is rejected.

 **Additional analysis:**
  
  + In addition to the Paired sample t-test for the overall bucket for both stores, product category wise test was performed to see comparison at product category level. 
To support this the analysis is below:
#### T-test result by product categories

```{r}
result=rep(0,8)
 
  for(i in 1:8){
  a=Final_data_filt$Price_diff[(i*35-34):(i*35)]
  b=t.test(a,mu=0,alternative = "two.sided")
 
  if(b$p.value<0.05)
   result[i]=b$estimate
  else
   result[i]=0}

result=data.frame(result)
row.names(result)=c("Fridge","Bakery","Fruit_Vegetables","Pantry","Freezer","Baby_Health_Beauty","Entertainment_Internationalfoods","Clothing_HH_Pet")
result
```

The above results show that out of the 8 categories, **Coles is expensive in three categories** (Fridge, Pantry, Baby_Health_Beauty), while there is no significant difference in the remaining 5 categories i.e. a) Bakery, b) Fruit-Veg, c) Freezer, d) Entertainment-International foods  e) Closthing_HH-Pet.
 

## Discussion 
#### Further Improvement areas

  + We conclude that the two supermarkets do not have the same prices for similar product range and coles is slightly expensive than woolworths overall.

  + The limitations here was we did not take into consideration products of home brand and products on special in our analysis,but the strength remains that we could compare them in an unbiased way by dropping home brand products and concentrated more on daily consumables rather than less used products (TV,Mobile Phones,Headphones,DVD etc).

  + We did the analysis from customer perspective to find out which is cheaper in both the supermarkets,but if we were to analyse the data from supermarkets perspective to find how they can benefit from studying the competitors price range and cover more customer base in future by being competitive in prices in certain categories and leading the supermarket scoreborad.

  + These price ranges can vary with time so that constraint needs to be kept in mind always for future analysis as prices keep fluctuating on daily basis.


