Introduction

Woolworths currently becomes one of the largest grocery stores in Australia as they are successful in using pricing strategy resulting in positively pushing customer buying behavior. With this sustainable development, Woolworths intend to expand their business internationally such as United States, Denmark or Canada to increase profit as well as their market share. However, this international expansion is a really tough challenge once Woolworths must profound the level of groceries demand of the local people in each area. To choose a good location to avoid business failure, Woolworths hires our company to conduct market surveys and analyse to deeply understand average spending level on groceries of residents spreading cities of the countries mentioned above.

Problem Statement

With the retail industry, we realize that the cost of living may influence on the shopping trend, which will be represented by the groceries index. First of all, to examine whether there is any relationship between two these variables, we need to check whether they have linear relationship or not. If the cost of living has a positive linear relationship with groceries index, customers with higher standard of living will tend to consume more groceries items. So, Woolworths can invest in the new markets which have higher cost of living. Otherwise, we will choose another potential factor to help Woolworth choose the best location for their investment.

Introduce Data

Based on kaggle website, we collect the file of Cost of Living Index with 536 observations and 8 variables including Rank, City, Cost of Living Index, Rent Index, Cost of living Plus Rent Index, Groceries Index, Restaurant Price Index and Local Purchasing Power Index. However, the purpose of this research is to focus on the cost of living index and groceries index to give Woolworths a comprehensive overview of the relationship between 2 these variables. So, we will select only 4 main variables below to highlight what we want to convey to Woolworths.

setwd("~/Desktop/Week3")
Cost_of_living_index <- read.csv("Cost_of_living_index.csv", stringsAsFactors = FALSE)
Cost_of_living_index <- Cost_of_living_index[, c("Rank", "City", "Cost.of.Living.Index","Groceries.Index")]
head(Cost_of_living_index)

##   Rank                  City Cost.of.Living.Index Groceries.Index
## 1    1     Hamilton, Bermuda               137.56          126.56
## 2    2   Zurich, Switzerland               128.65          127.35
## 3    3    Basel, Switzerland               126.89          120.44
## 4    4 Lausanne, Switzerland               119.62          116.35
## 5    5     Bern, Switzerland               118.42          114.54
## 6    6   Geneva, Switzerland               118.33          112.08

Check Data Structure

As you can see, Rank is not an numeric variable as dataset is representing.

str(Cost_of_living_index)

## 'data.frame':    536 obs. of  4 variables:
##  $ Rank                : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ City                : chr  "Hamilton, Bermuda" "Zurich, Switzerland" "Basel, Switzerland" "Lausanne, Switzerland" ...
##  $ Cost.of.Living.Index: num  138 129 127 120 118 ...
##  $ Groceries.Index     : num  127 127 120 116 115 ...

Convert Data

We convert Rank from numeric to factor class to stand for its nature as ordinal variable.

Cost_of_living_index$Rank <-as.factor(Cost_of_living_index$Rank)
str(Cost_of_living_index)

## 'data.frame':    536 obs. of  4 variables:
##  $ Rank                : Factor w/ 536 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ City                : chr  "Hamilton, Bermuda" "Zurich, Switzerland" "Basel, Switzerland" "Lausanne, Switzerland" ...
##  $ Cost.of.Living.Index: num  138 129 127 120 118 ...
##  $ Groceries.Index     : num  127 127 120 116 115 ...

Visualise Data

plot(Cost_of_living_index$Groceries.Index ~ Cost_of_living_index$Cost.of.Living.Index)

As you can see, as the cost of living increases, so too does groceries index. This is a positive relationship.

Decsriptive Statistics

Cost_of_living_index %>% summarise(Min = min(Groceries.Index,na.rm = TRUE),
                                           Q1 = quantile(Groceries.Index,probs = .25,na.rm = TRUE),
                                           Median = median(Groceries.Index, na.rm = TRUE),
                                           Q3 = quantile(Groceries.Index,probs = .75,na.rm = TRUE),
                                           Max = max(Groceries.Index,na.rm = TRUE),
                                           Mean = mean(Groceries.Index, na.rm = TRUE),
                                           SD = sd(Groceries.Index, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(Groceries.Index))) -> table1
knitr::kable(table1)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
18.01	31.4975	51.57	62.7575	127.35	49.75821	19.3209	536	0

Cost_of_living_index %>% summarise(Min = min(Cost.of.Living.Index,na.rm = TRUE),
                                           Q1 = quantile(Cost.of.Living.Index,probs = .25,na.rm = TRUE),
                                           Median = median(Cost.of.Living.Index, na.rm = TRUE),
                                           Q3 = quantile(Cost.of.Living.Index,probs = .75,na.rm = TRUE),
                                           Max = max(Cost.of.Living.Index,na.rm = TRUE),
                                           Mean = mean(Cost.of.Living.Index, na.rm = TRUE),
                                           SD = sd(Cost.of.Living.Index, na.rm = TRUE),
                                           n = n(),
                                           Missing = sum(is.na(Cost.of.Living.Index))) -> table2
knitr::kable(table2)

Min	Q1	Median	Q3	Max	Mean	SD	n	Missing
19.26	39.37	61.97	70.865	137.56	57.18757	19.98492	536	0

Sum of Square Distance

sum_x <- sum(Cost_of_living_index$Cost.of.Living.Index)
sum_x

## [1] 30652.54

sum_y <- sum(Cost_of_living_index$Groceries.Index)
sum_y

## [1] 26670.4

sum_x_sq <- sum(Cost_of_living_index$Cost.of.Living.Index^2)
sum_x_sq

## [1] 1966622

sum_y_sq <- sum(Cost_of_living_index$Groceries.Index^2)
sum_y_sq

## [1] 1526785

sum_xy <- sum(Cost_of_living_index$Cost.of.Living.Index*Cost_of_living_index$Groceries.Index)
sum_xy

## [1] 1722830

Sum of Square Distance Cont.

n <- length(Cost_of_living_index$Groceries.Index)
n

## [1] 536

Lxx <- sum_x_sq-((sum_x^2)/n)
Lxx

## [1] 213677.4

Lyy <- sum_y_sq-((sum_y^2)/n)
Lyy

## [1] 199714

Lxy = sum_xy - (((sum_x)*(sum_y))/n)
Lxy

## [1] 197614.2

b = Lxy/Lxx
b

## [1] 0.924825

a = mean(Cost_of_living_index$Groceries.Index - b*mean(Cost_of_living_index$Cost.of.Living.Index))
a

## [1] -3.130289

Demonstrate Linear Relationship

plot(Cost_of_living_index$Groceries.Index ~ Cost_of_living_index$Cost.of.Living.Index)
abline(a = a, b = b, col= "red")

This plot determine the relationship between x and y variables, the cost of living and groceries index, that linear regression is suitable. The data exhibited a positive linear trend. A positive linear relationship occurs when as the predictor variable increases in value, so too do the values for the dependent variable. In this situation, higher groceries index are associated with higher cost of living values.

As the data exhibit signs of a positive linear relationship, we can proceed with fitting the linear regression model.

Hypothesis Testing- Overall Model

oanh <- lm(Groceries.Index~ Cost.of.Living.Index, data=Cost_of_living_index)
summary(oanh)

## 
## Call:
## lm(formula = Groceries.Index ~ Cost.of.Living.Index, data = Cost_of_living_index)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.3272  -3.8839  -0.2281   3.9196  20.9016 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -3.13029    0.73838  -4.239 2.64e-05 ***
## Cost.of.Living.Index  0.92482    0.01219  75.867  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.635 on 534 degrees of freedom
## Multiple R-squared:  0.9151, Adjusted R-squared:  0.9149 
## F-statistic:  5756 on 1 and 534 DF,  p-value: < 2.2e-16

The R-squared is a measure of goodness of fit for linear regression. The better the line fits the data, the higher R-squared will be. In this case, around 92% of the variability in groceries index can be explained by a linear relationship with cost of living index.

We can see that the F-statistic reported in the summary as F=5756 with df1=1 and df2=534 resulting in the p-value reported in the summary to be p<.001. As p is less than the 0.05 level of significance, we reject H0:The data do not fit the linear regression model. There was statistically significant evidence that the data fit a linear regression model.

Interpret Intercept/Slope

oanh %>% summary() %>% coef()

##                       Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)          -3.130289 0.73838455 -4.239376  2.641244e-05
## Cost.of.Living.Index  0.924825 0.01219002 75.867363 3.746184e-288

The intercept is reported as a=-3.13. The intercept is the average value for y when x=0. In this case, this value represents groceries index when the cost of living is equal to 0. Given that the cost of living of 0 is impossible, so the constant(intercept/a) typically has no meaningful interpretation.

The slope of the regression line was reported as b=0.92. The slope represents the average increase in y following a one unit increase in x. In relation to this case, a one unit increase in cost of living was related to an average increase in groceries index of .92 units. This is a positive change.

Hypothesis Testing- Intercept

To test the statistical significance of the constant, we set the following statistical hypotheses:

\[H_0: α = 0\] \[H_A: α \ne 0\] This hypothesis is tested using a t-statistic, reported as t=-4.24,p<.001. The intercept is statistically significant at the 0.05 level. This means that there is statistically significant evidence that the intercept is not 0.

oanh %>% confint()

##                           2.5 %     97.5 %
## (Intercept)          -4.5807842 -1.6797948
## Cost.of.Living.Index  0.9008787  0.9487713

R reports the 95% CI for a to be [-4.58, -1.68]. H0:α=0 is clearly not captured by this interval, so was rejected.

Hypothesis Testing- Slope

The hypothesis test of the slope, b, was as follows:

\[H_0: β = 0\] \[H_A: β \ne 0\] The slope was also tested using a t-statistic which was reported as t=75.87,p<.001. we reject H0. There was statistically significant evidence that cost of living was positively related to groceries index.

oanh %>% confint()

##                           2.5 %     97.5 %
## (Intercept)          -4.5807842 -1.6797948
## Cost.of.Living.Index  0.9008787  0.9487713

R reports the 95% CI for b to be [0.90, 0.95]. This 95% CI does not capture H0, therefore it was rejected.

Summarise the linear relationship:

plot(Groceries.Index~ Cost.of.Living.Index, data = Cost_of_living_index, xlab = "Cost of Living Index", ylab = "Groceries Index")
abline(oanh, col = "red")

Test Assumption: Residuals vs. Fitted

oanh %>% plot(which = 1)

The relationship between fitted values and residuals is nearly flat (look at the red line), this is a good indication that we are modelling a linear relationship.

Test Assumption: Normal Q-Q

oanh %>% plot(which = 2)

The plot above suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.

Test Assumption: Scale-Location

oanh %>% plot(which = 3)

This is another plot used to check homoscedasticity. The red line should be close to flat and the variance in the square root of the standardised residuals should be consistent across predicted (fitted values).

Test Assumption: Influential Cases

oanh %>% plot(which = 5)

If we remove outliers, the regression model won’t change much.

Correlation Coefficient r

r <- cor(Cost_of_living_index$Groceries.Index, Cost_of_living_index$Cost.of.Living.Index,
use = "complete.obs")
r

## [1] 0.9566093

CIr(r, n = 536, level = .95)

## [1] 0.9487852 0.9632607

A Pearson’s correlation was calculated to measure the strength of the linear relationship between the cost of living and groceries index. The confidence interval does not capture H0:r=0, therefore, H0 was rejected. The positive correlation was statistically significant, r=.957, 95% CI [0.949, 0.963].

Discussion

A linear regression model was fitted to predict the dependent variable, groceries index, using measures of the cost of living values as a single predictor. Prior to fitting the regression, a scatter plot assessing the bivariate relationship between these 2 variables was inspected. The scatter plot demonstrated evidence of a positive linear relationship. The overall regression model was statistically significant, F(1,534)=5756, p<.001, and explained 91.51% of the variability in groceries index, R-square=0.9151. The estimated regression equation was Groceries Index=-3.13+.92∗Cost of Living Index. The positive slope for cost of living was statistically significant, b=0.92, t(534)=75.87, p<.001, 95% CI [0.9, 0.95]. The intercept was statistical significant, a=-3.13,t(534)=-4.23, p<.001, 95% CI [-4.58,-1.68]. Final inspection of the residuals supported normality and homoscedasticity. In short, there was a statistically significant strongly positive linear relationship between the cost of living and groceries index. Therefore, in order to exactly understand the level of groceries demand or the groceries price which residents are willing to pay in cities of countries, Woolworths should focus on the cost of living as well as its ranking to analyse and evaluate in choosing a good investment.

During this investigation, we have to deal with limitation as well as strength:

*Strength: Based on kaggle website, we have an advantage in collecting data from various cities of countries (n>30,normal distribution). This advantage will help Woolworths gain more choices to refine the best location for their challege investment through the cost of living and its ranking of each area.

*Limitation: Based on kaggle website, some data files lack of description, so it takes us long time to look for data files with completeness and reliability.

Woolworths’s explosion to international market