Thi Hoang Yen Nguyen - s3594445
02 June 2019
Woolworths currently becomes one of the largest grocery stores in Australia as they are successful in using pricing strategy resulting in positively pushing customer buying behavior. With this sustainable development, Woolworths intend to expand their business internationally such as United States, Denmark or Canada to increase profit as well as their market share. However, this international expansion is a really tough challenge once Woolworths must profound the level of groceries demand of the local people in each area. To choose a good location to avoid business failure, Woolworths hires our company to conduct market surveys and analyse to deeply understand average spending level on groceries of residents spreading cities of the countries mentioned above.
With the retail industry, we realize that the cost of living may influence on the shopping trend, which will be represented by the groceries index. First of all, to examine whether there is any relationship between two these variables, we need to check whether they have linear relationship or not. If the cost of living has a positive linear relationship with groceries index, customers with higher standard of living will tend to consume more groceries items. So, Woolworths can invest in the new markets which have higher cost of living. Otherwise, we will choose another potential factor to help Woolworth choose the best location for their investment.
Based on kaggle website, we collect the file of Cost of Living Index with 536 observations and 8 variables including Rank, City, Cost of Living Index, Rent Index, Cost of living Plus Rent Index, Groceries Index, Restaurant Price Index and Local Purchasing Power Index. However, the purpose of this research is to focus on the cost of living index and groceries index to give Woolworths a comprehensive overview of the relationship between 2 these variables. So, we will select only 4 main variables below to highlight what we want to convey to Woolworths.
setwd("~/Desktop/Week3")
Cost_of_living_index <- read.csv("Cost_of_living_index.csv", stringsAsFactors = FALSE)
Cost_of_living_index <- Cost_of_living_index[, c("Rank", "City", "Cost.of.Living.Index","Groceries.Index")]
head(Cost_of_living_index)## Rank City Cost.of.Living.Index Groceries.Index
## 1 1 Hamilton, Bermuda 137.56 126.56
## 2 2 Zurich, Switzerland 128.65 127.35
## 3 3 Basel, Switzerland 126.89 120.44
## 4 4 Lausanne, Switzerland 119.62 116.35
## 5 5 Bern, Switzerland 118.42 114.54
## 6 6 Geneva, Switzerland 118.33 112.08
As you can see, Rank is not an numeric variable as dataset is representing.
str(Cost_of_living_index)## 'data.frame': 536 obs. of 4 variables:
## $ Rank : int 1 2 3 4 5 6 7 8 9 10 ...
## $ City : chr "Hamilton, Bermuda" "Zurich, Switzerland" "Basel, Switzerland" "Lausanne, Switzerland" ...
## $ Cost.of.Living.Index: num 138 129 127 120 118 ...
## $ Groceries.Index : num 127 127 120 116 115 ...
We convert Rank from numeric to factor class to stand for its nature as ordinal variable.
Cost_of_living_index$Rank <-as.factor(Cost_of_living_index$Rank)
str(Cost_of_living_index)## 'data.frame': 536 obs. of 4 variables:
## $ Rank : Factor w/ 536 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ City : chr "Hamilton, Bermuda" "Zurich, Switzerland" "Basel, Switzerland" "Lausanne, Switzerland" ...
## $ Cost.of.Living.Index: num 138 129 127 120 118 ...
## $ Groceries.Index : num 127 127 120 116 115 ...
plot(Cost_of_living_index$Groceries.Index ~ Cost_of_living_index$Cost.of.Living.Index) As you can see, as the cost of living increases, so too does groceries index. This is a positive relationship.
Cost_of_living_index %>% summarise(Min = min(Groceries.Index,na.rm = TRUE),
Q1 = quantile(Groceries.Index,probs = .25,na.rm = TRUE),
Median = median(Groceries.Index, na.rm = TRUE),
Q3 = quantile(Groceries.Index,probs = .75,na.rm = TRUE),
Max = max(Groceries.Index,na.rm = TRUE),
Mean = mean(Groceries.Index, na.rm = TRUE),
SD = sd(Groceries.Index, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Groceries.Index))) -> table1
knitr::kable(table1)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 18.01 | 31.4975 | 51.57 | 62.7575 | 127.35 | 49.75821 | 19.3209 | 536 | 0 |
Cost_of_living_index %>% summarise(Min = min(Cost.of.Living.Index,na.rm = TRUE),
Q1 = quantile(Cost.of.Living.Index,probs = .25,na.rm = TRUE),
Median = median(Cost.of.Living.Index, na.rm = TRUE),
Q3 = quantile(Cost.of.Living.Index,probs = .75,na.rm = TRUE),
Max = max(Cost.of.Living.Index,na.rm = TRUE),
Mean = mean(Cost.of.Living.Index, na.rm = TRUE),
SD = sd(Cost.of.Living.Index, na.rm = TRUE),
n = n(),
Missing = sum(is.na(Cost.of.Living.Index))) -> table2
knitr::kable(table2)| Min | Q1 | Median | Q3 | Max | Mean | SD | n | Missing |
|---|---|---|---|---|---|---|---|---|
| 19.26 | 39.37 | 61.97 | 70.865 | 137.56 | 57.18757 | 19.98492 | 536 | 0 |
sum_x <- sum(Cost_of_living_index$Cost.of.Living.Index)
sum_x## [1] 30652.54
sum_y <- sum(Cost_of_living_index$Groceries.Index)
sum_y## [1] 26670.4
sum_x_sq <- sum(Cost_of_living_index$Cost.of.Living.Index^2)
sum_x_sq## [1] 1966622
sum_y_sq <- sum(Cost_of_living_index$Groceries.Index^2)
sum_y_sq## [1] 1526785
sum_xy <- sum(Cost_of_living_index$Cost.of.Living.Index*Cost_of_living_index$Groceries.Index)
sum_xy ## [1] 1722830
n <- length(Cost_of_living_index$Groceries.Index)
n## [1] 536
Lxx <- sum_x_sq-((sum_x^2)/n)
Lxx## [1] 213677.4
Lyy <- sum_y_sq-((sum_y^2)/n)
Lyy## [1] 199714
Lxy = sum_xy - (((sum_x)*(sum_y))/n)
Lxy## [1] 197614.2
b = Lxy/Lxx
b## [1] 0.924825
a = mean(Cost_of_living_index$Groceries.Index - b*mean(Cost_of_living_index$Cost.of.Living.Index))
a## [1] -3.130289
plot(Cost_of_living_index$Groceries.Index ~ Cost_of_living_index$Cost.of.Living.Index)
abline(a = a, b = b, col= "red")This plot determine the relationship between x and y variables, the cost of living and groceries index, that linear regression is suitable. The data exhibited a positive linear trend. A positive linear relationship occurs when as the predictor variable increases in value, so too do the values for the dependent variable. In this situation, higher groceries index are associated with higher cost of living values.
As the data exhibit signs of a positive linear relationship, we can proceed with fitting the linear regression model.
oanh <- lm(Groceries.Index~ Cost.of.Living.Index, data=Cost_of_living_index)
summary(oanh)##
## Call:
## lm(formula = Groceries.Index ~ Cost.of.Living.Index, data = Cost_of_living_index)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.3272 -3.8839 -0.2281 3.9196 20.9016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.13029 0.73838 -4.239 2.64e-05 ***
## Cost.of.Living.Index 0.92482 0.01219 75.867 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.635 on 534 degrees of freedom
## Multiple R-squared: 0.9151, Adjusted R-squared: 0.9149
## F-statistic: 5756 on 1 and 534 DF, p-value: < 2.2e-16
The R-squared is a measure of goodness of fit for linear regression. The better the line fits the data, the higher R-squared will be. In this case, around 92% of the variability in groceries index can be explained by a linear relationship with cost of living index.
We can see that the F-statistic reported in the summary as F=5756 with df1=1 and df2=534 resulting in the p-value reported in the summary to be p<.001. As p is less than the 0.05 level of significance, we reject H0:The data do not fit the linear regression model. There was statistically significant evidence that the data fit a linear regression model.
oanh %>% summary() %>% coef()## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.130289 0.73838455 -4.239376 2.641244e-05
## Cost.of.Living.Index 0.924825 0.01219002 75.867363 3.746184e-288
The intercept is reported as a=-3.13. The intercept is the average value for y when x=0. In this case, this value represents groceries index when the cost of living is equal to 0. Given that the cost of living of 0 is impossible, so the constant(intercept/a) typically has no meaningful interpretation.
The slope of the regression line was reported as b=0.92. The slope represents the average increase in y following a one unit increase in x. In relation to this case, a one unit increase in cost of living was related to an average increase in groceries index of .92 units. This is a positive change.
To test the statistical significance of the constant, we set the following statistical hypotheses:
\[H_0: α = 0\] \[H_A: α \ne 0\] This hypothesis is tested using a t-statistic, reported as t=-4.24,p<.001. The intercept is statistically significant at the 0.05 level. This means that there is statistically significant evidence that the intercept is not 0.
oanh %>% confint()## 2.5 % 97.5 %
## (Intercept) -4.5807842 -1.6797948
## Cost.of.Living.Index 0.9008787 0.9487713
R reports the 95% CI for a to be [-4.58, -1.68]. H0:α=0 is clearly not captured by this interval, so was rejected.
The hypothesis test of the slope, b, was as follows:
\[H_0: β = 0\] \[H_A: β \ne 0\] The slope was also tested using a t-statistic which was reported as t=75.87,p<.001. we reject H0. There was statistically significant evidence that cost of living was positively related to groceries index.
oanh %>% confint()## 2.5 % 97.5 %
## (Intercept) -4.5807842 -1.6797948
## Cost.of.Living.Index 0.9008787 0.9487713
R reports the 95% CI for b to be [0.90, 0.95]. This 95% CI does not capture H0, therefore it was rejected.
plot(Groceries.Index~ Cost.of.Living.Index, data = Cost_of_living_index, xlab = "Cost of Living Index", ylab = "Groceries Index")
abline(oanh, col = "red")oanh %>% plot(which = 1) The relationship between fitted values and residuals is nearly flat (look at the red line), this is a good indication that we are modelling a linear relationship.
oanh %>% plot(which = 2) The plot above suggests there are no major deviations from normality. It would be safe to assume the residuals are approximately normally distributed.
oanh %>% plot(which = 3) This is another plot used to check homoscedasticity. The red line should be close to flat and the variance in the square root of the standardised residuals should be consistent across predicted (fitted values).
oanh %>% plot(which = 5) If we remove outliers, the regression model won’t change much.
r <- cor(Cost_of_living_index$Groceries.Index, Cost_of_living_index$Cost.of.Living.Index,
use = "complete.obs")
r## [1] 0.9566093
CIr(r, n = 536, level = .95)## [1] 0.9487852 0.9632607
A Pearson’s correlation was calculated to measure the strength of the linear relationship between the cost of living and groceries index. The confidence interval does not capture H0:r=0, therefore, H0 was rejected. The positive correlation was statistically significant, r=.957, 95% CI [0.949, 0.963].
A linear regression model was fitted to predict the dependent variable, groceries index, using measures of the cost of living values as a single predictor. Prior to fitting the regression, a scatter plot assessing the bivariate relationship between these 2 variables was inspected. The scatter plot demonstrated evidence of a positive linear relationship. The overall regression model was statistically significant, F(1,534)=5756, p<.001, and explained 91.51% of the variability in groceries index, R-square=0.9151. The estimated regression equation was Groceries Index=-3.13+.92∗Cost of Living Index. The positive slope for cost of living was statistically significant, b=0.92, t(534)=75.87, p<.001, 95% CI [0.9, 0.95]. The intercept was statistical significant, a=-3.13,t(534)=-4.23, p<.001, 95% CI [-4.58,-1.68]. Final inspection of the residuals supported normality and homoscedasticity. In short, there was a statistically significant strongly positive linear relationship between the cost of living and groceries index. Therefore, in order to exactly understand the level of groceries demand or the groceries price which residents are willing to pay in cities of countries, Woolworths should focus on the cost of living as well as its ranking to analyse and evaluate in choosing a good investment.
During this investigation, we have to deal with limitation as well as strength:
*Strength: Based on kaggle website, we have an advantage in collecting data from various cities of countries (n>30,normal distribution). This advantage will help Woolworths gain more choices to refine the best location for their challege investment through the cost of living and its ranking of each area.
*Limitation: Based on kaggle website, some data files lack of description, so it takes us long time to look for data files with completeness and reliability.
We collect the data from this link https://www.kaggle.com/debdutta/cost-of-living-index-by-country