This report aims to examine which Australian supermarket out of two major supermarkets- Coles and Woolworths is cheaper.
Procedure
Sample
Variables
Main findings
Conclusions
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
library(lattice)
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
setwd("C:/Users/ria93/Desktop/Stats/Assignment2")
colesWoolie <- read_csv("ColesWoolies.csv")
## Parsed with column specification:
## cols(
## ProductID = col_integer(),
## `Product Category` = col_character(),
## `Product Name` = col_character(),
## `Product Quantity` = col_integer(),
## `Coles Price` = col_double(),
## `Woolworths Price` = col_double(),
## Random = col_integer()
## )
head(colesWoolie)
colesWoolie_random <- colesWoolie %>% filter(Random == 0)
head(colesWoolie_random)
We are adding a column to our random data using mutate function. This column represents the price difference between Coles and Woolworths.
We are using the summary() function on the prices of both Coles and Woolworths which gives us summary statistics. This function enables us to find out first quartile and third quartile along with minimum value, mediun, mean and maximum value. From this statistic it is clear that their is a difference between mean value and even in the interquartile range. However to determine if this difference is signicant or not we will apply some additional test on the data sample.
We are also interested in finding out standard deviation, difference in standard deviation and mean difference between coles and woolworths price values. for that we have used summarised function. It clarifies that mean difference is 0.07097561 and difference in the standard deviation is 0.3653683.
As we are not dealing with very big sample we intent to use matplot for better visual representation. Matplot helps to determine tendancy of the trend.
Now the two plots are visually similar, and we find that Coles seems to have higher prices than Woolworths. Before we implement two-sample t-test, we take a look at vizually checking normality using Q-Q plots, although it is not required for this scenario because our sample size in higher than 30, i.e. Coles n = 41 and Woolworth n = 41, is large enough to be considered as approximately normal.
In the Q-Q plot, the dotted arcs correspond to 95% CI for the normal quantiles. Both groups can see points falling outside the tails of the distribution, which suggests that the tails are heavier than what we would expect under a normal distribution. However, due to the large sample size (n = 41), we don’t have to worry about this problem.
Here some values may appear as outlier, as their range is higher than the average products under consideration. However they do not reflect true outliers as they are not falsified value. therefore for this statistical analysis we will not annul them.
colesWoolie_random <- colesWoolie_random %>% mutate (diff = `Coles Price` - `Woolworths Price`)
head(colesWoolie_random)
colesWoolie_random$`Coles Price` %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.500 3.600 5.000 5.663 7.000 12.900
colesWoolie_random$`Woolworths Price` %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.350 3.700 4.620 5.592 7.000 12.900
colesWoolie_random %>% summarise( Mean_Coles = mean(`Coles Price`, na.rm = TRUE),
SD_Coles = sd(`Coles Price`, na.rm = TRUE),
Mean_Woolworths = mean(`Woolworths Price`, na.rm = TRUE),
SD_woolworths = sd(`Woolworths Price`, na.rm = TRUE), Mean_Difference = Mean_Coles - Mean_Woolworths,
SD_Difference = sd(`Coles Price` - `Woolworths Price`, na.rm = TRUE),
n = n())
matplot(t(data.frame(colesWoolie_random$`Coles Price`,
colesWoolie_random$`Woolworths Price`)),
type="b", pch=19, col=1, lty=1, xlab= "Supermarkets", ylab="Price ($)",
xaxt = "n")
axis(1, at=1:2, labels=c("Coles","Woolworths"))
ColesWooliesGather <- colesWoolie_random %>% gather(`Coles Price`,`Woolworths Price`, key = "store", value = "price")
head(ColesWooliesGather)
ColesWooliesGather %>% boxplot(price ~ store, data = ., horizontal=TRUE)
colesWoolie_random$`Coles Price` %>% qqPlot(dist="norm")
## [1] 32 9
colesWoolie_random$`Woolworths Price` %>% qqPlot(dist="norm")
## [1] 32 9
We implement two-sample t-test, which helps us to consider whether this difference is statistically significant and also assists to see which supermarket is cheaper. The reason we are going to use two-sample t-test is that the function is used to compare the difference between two population means. The two-sample t-test follows statistical hypothesis:
H0:M1 - M2 = 0 (Null Hypothesis)
HA:M1 - M2 != 0 (Alternate Hypothesis)
where M1 and M2 are means of Coles and Woolworths respectively.
The null hypothesis(H0) is defined under the assumption that the difference between the two supermarket prices means is 0. Wherein alternate hypothesis assumes that difference would not be 0.
For the above summary functions, the difference between Coles and Woolworths estimated by the sample is 5.663 - 5.592 = 0.071.
Firstly we need to use Levene’s Test to determine the homogeneity of variance (or the assumptions of equal variance). it is important to determine the type of two-sample t-test. If it can assume equal variance, then we use two-sample t-test assuming equal variance method to compare the difference between Coles and Woolworths.
Levene’s Test follows statistical hypotheses:
H0:M1^2 = M2^2
HA:M1^2 != M2^2
The p-value for the Levene’s test of equal variance for product price between coles and woolworths was p = 0.9007. We find p>.05, therefore, we fail to reject H0. That simply mean, we are safe to assume equal variance. If p value would have been less than 0.5 then we fail to consider equal varience and we have TO reject the null hypothesis.
Now we have confidence to perform a two-sample t-test assuming equal variance and a two-sided hypothesis test. The two-sided is based on whether Coles is cheaper or Woolworth is cheaper.
leveneTest(price ~ store, data = ColesWooliesGather)
## Warning in leveneTest.default(y = y, group = group, ...): group coerced to
## factor.
pttest <- t.test(colesWoolie_random$`Coles Price`, colesWoolie_random$`Woolworths Price`,paired = TRUE, alternative = "two.sided")
pttest
##
## Paired t-test
##
## data: colesWoolie_random$`Coles Price` and colesWoolie_random$`Woolworths Price`
## t = 1.2439, df = 40, p-value = 0.2208
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.04434887 0.18630009
## sample estimates:
## mean of the differences
## 0.07097561
Tabulated_t_statistic<-qt(0.025,df=40)
Tabulated_t_statistic
## [1] -2.021075
A paired-samples t-test was used to test for a significant mean difference between the prices for Coles and Woolworths. The mean difference following exercise was found to be 0.07 (SD = 0.37). Visual inspection of the Q-Q plot of the difference scores suggested that the data were approximately normally distributed. The paired-samples t-test found that there was no statistically significant mean difference between Coles and Woolworths prices.
Following the procedure of paired t.test we came up with the following observations:
Based on these three outcomes, the mean difference between Coles and Woolworths is NOT statistically significant, we cannot reject the Null Hypothesis and further concludes that the product prices in Coles and Woolworths are almost same.
Strengths
Limitations
Improvements
Conclusion