Executive Statement

An investigation is done to identify which Australian supermarket among coles and Woolworths sells products at a cheaper price. Data containing details of products(ProductName, Prices, Category)from both stores were scrapped online, a sample of approximately 50 products were taken from each category namely ‘Clothing, Household & Pet’,‘Pantry’,‘Freezer’,‘Baby, Health & Beauty’,‘Fridge’,‘Bakery’,‘Meat & Seafood’ and ‘Drinks & Tobacco’.The dataset was checked for different anomalies and the dataset was cleaned and made efficient for analysis.The final cleaned dataset named ‘clean_final_trans’ was used for analysis which contains 4 variables ‘name’,‘store’,‘price’,‘category’.Prices are in Australian dollars and are from 16 May 2019. Using this dataset, descriptive statistics, box plot and qqplot visuals were created to investigate price variations. And a paired ttest was used to test for a significant price difference.

From the summary statistics, it was found that Woolworths sells products at a slightly lower price compared to coles online. From the t-test results, based on the p-value and 95% confidence interval, we tend to reject the null hypothesis Ho:???? = 0 and the test is found to be statistically significant.

Load Packages and Data

library(readxl) 
library(car) 
## Loading required package: carData
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:car':
## 
##     recode
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr) 
options(max.print=1000000) 
require(qqplotr)
## Loading required package: qqplotr
## Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
## logical.return = TRUE, : there is no package called 'qqplotr'
##Loads the initial uncleaned scrapped data.
 uncleaned <- read_xlsx("coles%wools.xlsx")
 
 ##Deletes all entries where there's no value.
 cleaned <- na.omit(uncleaned) 
 
 ##Clears all entries with 'null' value from coles and woolworth's prices columns. 
 cleaned<-cleaned%>% filter(coles!="null") 
 cleaned<-cleaned%>% filter(wools!="null")
 

 
 ##Apply string split to coles price column for distinguishing the before and after of fer price values
 cleaned$before <- sapply(strsplit(as.character(cleaned$coles),' '), "[", 1) 
 cleaned$after <- sapply(strsplit(as.character(cleaned$coles),' '), "[", 2) 
 
 ##Replacing the before offer price column with after discount(current price) value.
 cleaned[ !is.na(cleaned$after) , "before" ]  <- cleaned[ !is.na(cleaned$after), "after" ]
 
 ##Removing special character ('$') from the price columns.
 cleaned$wools<-str_remove(cleaned$wools, "[$]") 
 cleaned$coles<-str_remove(cleaned$before, "[$]") 
 
##Assigning values within coles and wools column as numeric types.
 cleaned$coles <- as.numeric(cleaned$coles) 
 cleaned$wools <- as.numeric(cleaned$wools)

Summary Statistics

Using R to summarise the data from the investigation. Including appropriate plot to help visualise the data. Describe the trend.

#########Visualization############
 
cleaned %>% boxplot(cleaned$coles, data = ., ylab = "Price")

cleaned %>% boxplot(cleaned$coles, data = ., ylab = "Price")

Hypothesis Test

Use R to perform an appropriate hypothesis test to determine which supermarket is the cheapest. You need to explain your choice of hypothesis test, any assumptions and the significance level.

cleaned %>% boxplot(cleaned$wools, data = ., ylab = "Price")

##Removing identified outliers
 
boxplot<-cleaned %>% boxplot(cleaned$coles , data = ., ylab = "Price", plot = FALSE) 
filt_mat <- data.frame(group = boxplot$group, outliers = boxplot$out) 
clean_coles<-cleaned %>% filter(!(coles %in% filt_mat$outliers) ) 
clean_coles %>% boxplot(clean_coles$coles, data = ., ylab = "Price")

boxplot1<-clean_coles %>% boxplot(clean_coles$wools , data = ., ylab = "Price", plot = FALSE) 
filt_mat1 <- data.frame(group = boxplot1$group, outliers = boxplot1$out) 
clean_final<-clean_coles %>% filter(!(wools %in% filt_mat1$outliers) ) 
clean_final %>% boxplot(clean_final$wools, data = ., ylab = "Price")

##Creating separate dataframe to store coles and woolworths data.
 clean_data_1<-data.frame(name=clean_final$name,store=c("coles"),price=clean_final$coles,category=clean_final$category) 
clean_data_2<-data.frame(name=clean_final$name,store=c("wools"),price=clean_final$wools,category=clean_final$category) 
##Merging both dataframe to a single dataframe using, rbind()
 clean_final_trans<-rbind(clean_data_1,clean_data_2) 
##Plotting boxplots for product prices from both stores 
 clean_final_trans %>% boxplot(price ~ store, data = ., ylab = "Price")

##Filtering products based on store - 'Coles' & 'Woolworth'and Plotting QQ-Plots to c heck normality
 
 
#For store - 'Coles'
 clean_coles_fil <- clean_final_trans %>% filter(store == "coles")
clean_coles_fil$price %>% qqPlot(dist="norm")

## [1] 85 30
#For store - 'Woolworth'
 clean_wools_fil <- clean_final_trans %>% filter(store == "wools")
clean_wools_fil$price %>% qqPlot(dist="norm")

## [1] 128 136
##Summarizing the measurement variables
 
#Summarizing by store
clean_final_trans %>% group_by(store) %>%   summarise( 
  Min = min(price, na.rm = TRUE),
  Q1 = quantile(price, probs = .25,na.rm = TRUE),
  Median = median(price, na.rm = TRUE),
  Q3 = quantile(price, probs = .75, na.rm = TRUE),
  Max = max(price, na.rm = TRUE),
  Mean = mean(price, na.rm = TRUE),
  SD = sd(price, na.rm = TRUE), 
  n = n()   )

Using the dataset summary and box plots for each supermarket, resulted in Woolworths having a lower mean price in comparison to Coles mean price. The median price of both these stores was found to be the same. QQplot for both the retailers were plotted and it is notable that neither Coles or Woolworths follow a normal distribution and both have a similar distribution pattern. The outliers were then removed in order to obtain a proper visualization of data. A few outliers were still observed but we assume the samples to be normally distributed since our sample size is much greater than 30 and also the hypothesis testing is robust against slight deviations to the normal distribution.

Hypothesis Test

Using R to perform an appropriate hypothesis test to determine which supermarket is the cheapest with significance level.

##Checking equality of variance for the two populations using LaveneTest
 leveneTest(price ~ store, data = clean_final_trans)
##Performing the paired t-test considering all assumptions
 t.test(   price ~ store,   data = clean_final_trans,   paired = TRUE,   var.equal = FALSE,   alternative = "greater",   conf.interval=0.95 )
## 
##  Paired t-test
## 
## data:  price by store
## t = 3.1123, df = 241, p-value = 0.00104
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  0.1396521       Inf
## sample estimates:
## mean of the differences 
##               0.2974793

A paired t-test was chosen for hypothesis testing with H0:???? = 0 & HA:???? ??? 0.Significance level is 5% or 0.05.

Levene’s test is performed to check homogeneity of variance’s of both the population samples. From the Levene’s test, the p-value is found to be 0.3502, which is greater than 0.05 (the significance level). Hence we assume the variances to be equal and unknown. We also assume the population data to be normally distributed and the Alternate hypothesis (Ha) to be such that the price of products in Woolworths is less than that at Coles. A paired-sample t-test is carried out to check the hypotheses.

Interpretation

Interpret the results of the hypothesis test by interpreting the -value and confidence intervals and commenting on the statistical significance of the findings. A paired t-test was used to test for a significant difference between the mean product prices between Woolworths and Coles. From the t-test that’s been carried out, the p-value is found to be 0.001. We already know that the significance level(??) is 0.05. Clearly, the p-value is lesser than the significance level(??) and the 95% CI of the mean difference does not capture Ho:???? = 0. This indicates that there is statistical evidence that Woolworth is cheaper compared to Coles and we reject the null hypothesis (Ho) and test is statistically significant.

Discussion

##.Findings & Conclusion : Findings from the above test performed we have found that the p-value is less than 0.05 in the Levene’s test and hence we assumed the homogeneity of the variances and also assumed the normality of distribution since the sample’s size is greater than 30. From the results of paired sample t-test, we reject the null hypothesis because the p-value is less than the significance level and the 95% CI mean differences do not capture Ho:???? = 0. Thereby concluding that there is a significant difference between both stores. ie, the test is statistically significant. While there may be a significant price difference for the sample data collected but there are some limitations that need to be considered.

##.Strengths : One of the strengths of investigation is the consistency and accuracy of the products used, the brand, product and size. this is also a limitation of the investigation. With this investigation, we were able to draw a rough view of products prices for customers who shopped online from these stores.

##.Limitations : One of the main limitations was on the data collected. The data was collected from Coles and Woolworths online store than location based. Not every product that was in stock in Woolworths had stock in coles and vice-versa, so predicting a general trend in variation of price on daily items from each category was not precise

##.Improvements: One of the improvements that can be done for future investigations is to have a check at Coles and Woolworths own items like Milk, Dairy products, Eggs as there could be some price variation in own brand items. The data collected now had limited entries for each category, for more accuracy it would be best if more data are collected. Prepare a prices comparison of both these stores based on locations(Different suburbs and cities).