Read.me
The structure of the notebook is as the following:
- Data Description: description of the data and mini case
- 5-Step Framework for Regression Analysis
- Model Comparison: a model without
Brand Equity for
comparison
- Risk Control: to check the normality and equal variance
assumptions
Make sure you have downloaded and installed the course package “MSR”
from Canvas. Check Canvas (in Module - Week 1) about how to do it. For
this notebook, you need to download and load a R package
ggplot2.
library(MSR)
library(ggplot2)
Data Description
We first load the Marks and Spencer data from the MSR package using
data() function.
data("marks_spencer")
head(marks_spencer)
## Week Advertising Promotion Price Brand_Equity Sales
## 1 1 6.643165 No 2.664169 99.01042 208.9463
## 2 2 6.551415 Yes 4.799289 99.31082 159.9861
## 3 3 3.606885 No 4.798553 99.44092 157.0999
## 4 4 4.323537 No 4.674879 99.52237 168.7835
## 5 5 5.859314 No 3.101983 99.53748 203.5160
## 6 6 6.366751 No 4.583886 99.67113 165.1949
For a detailed description of the data, please use the following
command:
help("marks_spencer") # or you can use ?marks_spencer
## starting httpd help server ... done
From the data, we know the sales from Week 1 to
Week 75. Using the first 75 weeks of data, we want to
predict the sales from Week 76 to Week 100. It
is known ad hoc that the sales of Marks & Spencer decreased after
Week 75. The question is: can we
predict the sales decrease?
We are comparing two models, one model with Brand Equity and the
other model without Brand Equity, to highlight the importance of
including consumer-related measures in prediction.
Running Regression
Analysis Step-by-Step
In this section, we will run a linear regression to predict sales by
following the framework discussed in class step-by-step. For this
analysis, we are using Sales as the DV and all the other
variables as the IVs.
Note: The framework is presented to give a structure to follow as a
novice. If you have become an expert, no need to follow it exactly.
We are using the first 75 weeks to estimate the model (or as the
train set), and the remaining 25 weeks to do prediction (or as the test
set). So, from the full data, we first obtain a train set.
train <- marks_spencer[1:75,]
str(train)
## 'data.frame': 75 obs. of 6 variables:
## $ Week : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Advertising : num 6.64 6.55 3.61 4.32 5.86 ...
## $ Promotion : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 2 1 ...
## $ Price : num 2.66 4.8 4.8 4.67 3.1 ...
## $ Brand_Equity: num 99 99.3 99.4 99.5 99.5 ...
## $ Sales : num 209 160 157 169 204 ...
Examining the
data
To get a good prediction, we want good IVs that are relevant (or
correlated with the DV). In addition, we do NOT want IVs to be highly
correlated with one another, which lead to information redundancy and
eventually bad predictions.
Checking the
VIFs
We want to check if there’s any multi-collinearity problem. We do NOT
want IVs to be highly correlated. To this end, we calculate the VIFs.
You are already given a function called vif along with the
data. This function takes two inputs: the names of IVs that you want to
check in a data frame and the data frame containing those IVs, and
outputs the VIF values of the variables (as a vector).
# getting the variable names of the IVs
vnames <- colnames(train)[2:5]
# calculating the vif's
vif(vnames,train)
## PromotionYes Advertising Price Brand_Equity
## 1.112713 1.047414 1.008407 1.068874
The criterion for the VIF values is if they are larger than 10. In
the output, no VIFs are larger than 10. So, we do not have collinearity
issue.
Estimating the
model
Next, we estimate our linear regression model with the
lm function by translating the equation into an R formula.
Please see Canvas for the formula language.
model_no_controls <- lm(Sales ~ Price, data = train)
summary(model_no_controls)
##
## Call:
## lm(formula = Sales ~ Price, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.868 -17.852 6.217 16.776 32.608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 241.718 11.301 21.390 < 2e-16 ***
## Price -20.747 2.746 -7.556 9.57e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.56 on 73 degrees of freedom
## Multiple R-squared: 0.4389, Adjusted R-squared: 0.4312
## F-statistic: 57.1 on 1 and 73 DF, p-value: 9.566e-11
Validating the
model
From the results above, we first validate the model. We check a few
things.
The overall
significance
The idea of the overall significance is to compare the model we run
to a null model with no IVs. If our model significantly
outperforms the null model, it means it’s worth it. Or in statistical
term, we test the null hypothesis that the coefficients
(\(\beta\)’s) of all IVs are zero. Or,
\[H_0:\beta_1=\beta_2=...=0\] For this
model, the F-stat is 57.1, with a p-value < 0.05. So, we can reject
the null hypothesis, and the model is of predictive value.
The significance of
coefficients
We have included multiple IVs in the model. However, it’s possible
that an IV does not contribute to the prediction accuracy. To check
this, we focus on an individual coefficient and test the following null
hypothesis for a particular IV: \[H_0:\beta_k=0\] The coefficient of price
\(\beta_\text{price}\) is -20.75, with
a standard error of 2.74. The p-value < 0.05. So, we reject the null
hypothesis, and conclude that Price is of predictive
value.
Model fit
To see how well the model fits with our data, we look at
R-squared. The direct interpretation of R-squared is the
percentage of variation explained by the model. Here, the R-squared of
the model is 0.4389 or 43.89%. It means 94.85% of variation in the sales
is captured by the model.
There is no clear cutoff value for
R-squared. You need to consider the specific setting to judge
whether the R-squared is good enough. For example, in sales prediction,
we usually expect a big R-squared (e.g., 90%). This is because the
retailers oftentimes are faced with a relatively stable environment
where consumers show persistent habits of buying products. In our case,
the value of 43.89% indicates a fit that is below expectation.
Making
predictions
After estimating the model, we can use it to do prediction. In the
case, we want to predict the sales from Week 76 to
Week 100. Let’s first create a prediction data set (or test
set).
test <- marks_spencer[76:100,]
str(test)
## 'data.frame': 25 obs. of 6 variables:
## $ Week : num 76 77 78 79 80 81 82 83 84 85 ...
## $ Advertising : num 6.57 6.61 6.55 6.94 5.43 ...
## $ Promotion : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 2 2 1 ...
## $ Price : num 3.62 3.76 4.52 4.67 4.51 ...
## $ Brand_Equity: num 53 50.8 45.8 44.9 44.5 ...
## $ Sales : num NA NA NA NA NA NA NA NA NA NA ...
To predict the sales, we use the predict function.
Please use ?predict for more information of this
function.
sales_no_controls <- predict(model_no_controls,
newdata = test)
#coerce into a data frame
sales_no_controls <- as.data.frame(sales_no_controls)
Next, let’s create a line plot of the predicted sales to see the
predicted trend.
# add week id for plotting
sales_no_controls$week <- 76:100
# a line plot of the predicted sales
ggplot(data = sales_no_controls,
aes(x = week, y = sales_no_controls)) + geom_line()

Model Comparison
In this section, we are running a model control variables. The
analyses are the same as in Section 3. We would not do things
step-by-step. If you are interested, you can practice with this model by
yourself.
The model specification is as the following (with Brand Equity not
included): \[\text{Sales} = \beta_0+
\beta_1\text{Price} + \beta_2\text{Advertising} +
\beta_3\text{Promotion} + \beta_4\text{Brand_Equity} +e\] Next,
we estimate our linear regression model with the lm
function by translating the equation into an R formula. Please see
Canvas for the formula language.
model_with_controls <- lm(Sales ~ Price + Advertising + Promotion + Brand_Equity, data = train)
summary(model_with_controls)
##
## Call:
## lm(formula = Sales ~ Price + Advertising + Promotion + Brand_Equity,
## data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.493 -3.082 0.006 4.011 13.866
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 102.34877 7.33386 13.956 <2e-16 ***
## Price -20.21582 0.85321 -23.694 <2e-16 ***
## Advertising 1.74314 0.67864 2.569 0.0123 *
## PromotionYes -0.78282 1.68069 -0.466 0.6428
## Brand_Equity 1.46121 0.05805 25.170 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.363 on 70 degrees of freedom
## Multiple R-squared: 0.9485, Adjusted R-squared: 0.9455
## F-statistic: 322.1 on 4 and 70 DF, p-value: < 2.2e-16
To compare this model with the model with Brand Equity, we use
adjusted R-squared. This is because the two model have
different no. of IVs. So, we must adjust for this
difference. The adjusted R-squared of the model with vs. without
controls are 0.9455 vs. 0.4312. Therefore, the model with controls fits
the data better.
The coefficient of price in this model is -20.22, larger than that in
the model without control variables (-20.75). The relatively small
change in price coefficient does not mean we have a good estimation of
price effect. It is possible that we fail to add other control
variables, such as seasonality or competition.
Risk Control
After running the models, we want to check if the statistical
assumptions of linear regression are Okay. As discussed in class, the
violation of these assumptions leads to biased estimation and bad
prediction. Over the years, people have developed many tests for these
assumptions. We will focus on two assumptions: 1) the normality
assumption; and 2) the equal variance assumption.
We will use the full model (the model with brand equity or ) as an
example. For other models, the procedure works the same.
The normality
assumption
For the normality assumption, we are using KS test. KS test checks a
variable against a normal distribution and the value of the KS test
tells you whether the variable follows a normal distribution.
Note that the null hypothesis of the test is: \[H_0: \text{The variable follows a normal
distribution}.\] For the normality assumption to hold, the null
hypothesis should not be rejected. In other words, the KS test should
NOT be significant if the normality assumption holds.
First, we obtain the residuals from the estimation results and
visualize the distribution with a histogram.
# obtain the residuals using the residuals() function
error_term <- residuals(model_no_controls)
# build a histogram
# leave the data empty as we only plot a variable error_term
ggplot( , aes(x=error_term)) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the plot, it seems the error_term does not follow a normal
distribution. We may further use KS test to do the formal testing. Note
that by default, the KS test compares a variable to a standard normal
distribution, or a normal distribution with mean 0 and standard
deviation 1. Therefore, we need to first re-scale the residuals to make
it consistent with the default distribution.
# scale() function standardizes a variable use help(scale) to see more.
std_error_term <- scale(error_term)
# the ks test with ks.test();
#"pnorm" means testing against the standard normal distribution
ks.test(std_error_term,"pnorm")
##
## Exact one-sample Kolmogorov-Smirnov test
##
## data: std_error_term
## D = 0.13614, p-value = 0.1129
## alternative hypothesis: two-sided
From the test results, p-value is 0.11 > 0.05. So, we cannot
reject the null hypothesis. This implies the normality assumption is
Okay.
The equal variance
assumption
Another assumption of linear regression is all residuals follow the
same normal distribution or the same variance. Although there is a
formal test for this, it is beyond the scope of this course. Instead, we
rely on plotting to eyeball this assumption.
To produce this plot, we need two values: the residuals and the DV
(sales in this case). We do a scatter plot with the residuals as the
Y-axis and the predicted DV as the X-axis. In addition, sometimes it’s
not so easy to see the patterns due to the scales of variables. So, we
rescale the variables by standardization (with the scale()
function).
# getting the predicted sales of the test set
test_sales_no_controls <- predict(model_no_controls)
# plotting to see the residuals vs. predicted sales
ggplot( , aes(y = std_error_term, x = scale(test_sales_no_controls)))+
geom_point() + geom_hline(yintercept = 0)

The trick is to check whether at different value of the DV (sales),
the span or the range of the residuals are similar. You may add
auxiliary lines to help you check.
Equal Variance Scatter Plot
From the plot, we an see that the range or span of residuals are
similar. So, the equal variance assumption is likely to hold. Note that
as we rely on eyeballing, it can be subjective. For the learning
purpose, as long as you know how to use the plot to detect equal
variance, your conclusion is not an issue.
---
title: 'Lecture 2 - Market Response Model'
output:
  html_document:
    df_print: default
    code_download: yes
    theme: readable
    toc: yes
    toc_float: 
      collapsed: no
      smooth_scroll: no
    number_sections: yes
  pdf_document:
    toc: yes
---

# Read.me
The structure of the notebook is as the following:

* Data Description: description of the data and mini case
* 5-Step Framework for Regression Analysis
* Model Comparison: a model without `Brand Equity` for comparison
* Risk Control: to check the normality and equal variance assumptions

Make sure you have downloaded and installed the course package "MSR" from Canvas. Check Canvas (in Module - Week 1) about how to do it. For this notebook, you need to download and load a R package `ggplot2`.

```{r message=FALSE,warning=FALSE}
library(MSR)
library(ggplot2)
```

# Data Description 

We first load the Marks and Spencer data from the MSR package using `data()` function.

```{r }
data("marks_spencer")
head(marks_spencer)
```

For a detailed description of the data, please use the following command:
```{r}
help("marks_spencer") # or you can use ?marks_spencer
```

From the data, we know the sales from `Week 1` to `Week 75`. Using the first 75 weeks of data, we want to predict the sales from `Week 76` to `Week 100`. It is known ad hoc that the sales of Marks & Spencer decreased after `Week 75`. The question is: <span style="color: red;">can we predict the sales decrease?</span>

We are comparing two models, one model with Brand Equity and the other model without Brand Equity, to highlight the importance of including consumer-related measures in prediction. 

# Running Regression Analysis Step-by-Step 

In this section, we will run a linear regression to predict sales by following the framework discussed in class step-by-step. For this analysis, we are using `Sales` as the DV and all the other variables as the IVs. 

Note: The framework is presented to give a structure to follow as a novice. If you have become an expert, no need to follow it exactly. 

We are using the first 75 weeks to estimate the model (or as the train set), and the remaining 25 weeks to do prediction (or as the test set). So, from the full data, we first obtain a train set. 

```{r}
train <- marks_spencer[1:75,]
str(train)
```

## Examining the data

To get a good prediction, we want good IVs that are relevant (or correlated with the DV). In addition, we do NOT want IVs to be highly correlated with one another, which lead to information redundancy and eventually bad predictions. 

### Checking the VIFs

We want to check if there's any multi-collinearity problem. We do NOT want IVs to be highly correlated. To this end, we calculate the VIFs. You are already given a function called `vif` along with the data. This function takes two inputs: the names of IVs that you want to check in a data frame and the data frame containing those IVs, and outputs the VIF values of the variables (as a vector).  

```{r}
# getting the variable names of the IVs
vnames <- colnames(train)[2:5]

# calculating the vif's 
vif(vnames,train)
```

The criterion for the VIF values is if they are larger than 10. In the output, no VIFs are larger than 10. So, we do not have collinearity issue. 

## Formulating the model

We start the simplest model of using `Price` as the IV and `Sales` as the DV. The regression equation is as the following: 
$$\text{Sales} = \beta_0+\beta_1\text{Price}+e$$

## Estimating the model

Next, we estimate our linear regression model with the `lm` function by translating the equation into an R formula. Please see Canvas for the formula language. 

```{r}
model_no_controls <- lm(Sales ~ Price, data = train)
summary(model_no_controls)
```

## Validating the model

From the results above, we first validate the model. We check a few things. 

### The overall significance

The idea of the overall significance is to compare the model we run to a `null model` with no IVs. If our model significantly outperforms the null model, it means it's worth it. Or in statistical term, we test the `null hypothesis` that the coefficients ($\beta$'s) of all IVs are zero. Or, 
$$H_0:\beta_1=\beta_2=...=0$$
For this model, the F-stat is 57.1, with a p-value < 0.05. So, we can reject the null hypothesis, and the model is of predictive value. 

### The significance of coefficients

We have included multiple IVs in the model. However, it's possible that an IV does not contribute to the prediction accuracy. To check this, we focus on an individual coefficient and test the following null hypothesis for a particular IV: 
$$H_0:\beta_k=0$$
The coefficient of price $\beta_\text{price}$ is -20.75, with a standard error of  2.74. The p-value < 0.05. So, we reject the null hypothesis, and conclude that `Price` is of predictive value. 

### Model fit

To see how well the model fits with our data, we look at `R-squared`. The direct interpretation of R-squared is the percentage of variation explained by the model. Here, the R-squared of the model is 0.4389 or 43.89%. It means 94.85% of variation in the sales is captured by the model. 

<span style="color: red;">There is no clear cutoff value for R-squared.</span> You need to consider the specific setting to judge whether the R-squared is good enough. For example, in sales prediction, we usually expect a big R-squared (e.g., 90%). This is because the retailers oftentimes are faced with a relatively stable environment where consumers show persistent habits of buying products. In our case, the value of 43.89% indicates a fit that is below expectation. 

## Making predictions

After estimating the model, we can use it to do prediction. In the case, we want to predict the sales from `Week 76` to `Week 100`. Let's first create a prediction data set (or test set). 

```{r}
test <- marks_spencer[76:100,]
str(test)
```

To predict the sales, we use the `predict` function. Please use `?predict` for more information of this function. 

```{r}
sales_no_controls <- predict(model_no_controls,
                            newdata = test)
#coerce into a data frame
sales_no_controls <- as.data.frame(sales_no_controls)
```

Next, let's create a line plot of the predicted sales to see the predicted trend. 

```{r, fig.width=4.5, fig.height=3.375}
# add week id for plotting
sales_no_controls$week <- 76:100

# a line plot of the predicted sales
ggplot(data = sales_no_controls,
       aes(x = week, y = sales_no_controls)) + geom_line()
```

# Model Comparison

In this section, we are running a model control variables. The analyses are the same as in Section 3. We would not do things step-by-step. If you are interested, you can practice with this model by yourself. 

The model specification is as the following (with Brand Equity not included):
$$\text{Sales} = \beta_0+ \beta_1\text{Price} + \beta_2\text{Advertising} + \beta_3\text{Promotion} + \beta_4\text{Brand_Equity} +e$$
Next, we estimate our linear regression model with the `lm` function by translating the equation into an R formula. Please see Canvas for the formula language. 

```{r}
model_with_controls <- lm(Sales ~ Price + Advertising + Promotion + Brand_Equity, data = train)
summary(model_with_controls)
```

To compare this model with the model with Brand Equity, we use `adjusted R-squared`. This is because the two model have different no. of IVs. So, we must `adjust` for this difference. The adjusted R-squared of the model with vs. without controls are 0.9455 vs. 0.4312. Therefore, the model with controls fits the data better. 

The coefficient of price in this model is -20.22, larger than that in the model without control variables (-20.75). The relatively small change in price coefficient does not mean we have a good estimation of price effect. It is possible that we fail to add other control variables, such as seasonality or competition. 

# Risk Control 

After running the models, we want to check if the statistical assumptions of linear regression are Okay. As discussed in class, the violation of these assumptions leads to biased estimation and bad prediction. Over the years, people have developed many tests for these assumptions. We will focus on two assumptions: 1) the normality assumption; and 2) the equal variance assumption. 

We will use the full model (the model with brand equity or ) as an example. For other models, the procedure works the same. 

## The normality assumption 

For the normality assumption, we are using KS test. KS test checks a variable against a normal distribution and the value of the KS test tells you whether the variable follows a normal distribution. 

Note that the null hypothesis of the test is: 
$$H_0: \text{The variable follows a normal distribution}.$$
For the normality assumption to hold, the null hypothesis should not be rejected. In other words, the KS test should NOT be significant if the normality assumption holds. 

First, we obtain the residuals from the estimation results and visualize the distribution with a histogram. 

```{r, fig.width=4.5, fig.height=3.375}
# obtain the residuals using the residuals() function
error_term <- residuals(model_no_controls)

# build a histogram
# leave the data empty as we only plot a variable error_term
ggplot( , aes(x=error_term)) + geom_histogram()

```

From the plot, it seems the error_term does not follow a normal distribution. We may further use KS test to do the formal testing. Note that by default, the KS test compares a variable to a standard normal distribution, or a normal distribution with mean 0 and standard deviation 1. Therefore, we need to first re-scale the residuals to make it consistent with the default distribution. 

```{r}
# scale() function standardizes a variable use help(scale) to see more.
std_error_term <- scale(error_term)

# the ks test with ks.test(); 
#"pnorm" means testing against the standard normal distribution
ks.test(std_error_term,"pnorm")
```

From the test results, p-value is 0.11 > 0.05. So, we cannot reject the null hypothesis. This implies the normality assumption is Okay. 

## The equal variance assumption

Another assumption of linear regression is all residuals follow the same normal distribution or the same variance. Although there is a formal test for this, it is beyond the scope of this course. Instead, we rely on plotting to eyeball this assumption.

To produce this plot, we need two values: the residuals and the DV (sales in this case). We do a scatter plot with the residuals as the Y-axis and the predicted DV as the X-axis. In addition, sometimes it's not so easy to see the patterns due to the scales of variables. So, we rescale the variables by standardization (with the `scale()` function). 

```{r, fig.width=4.5, fig.height=3.375}
# getting the predicted sales of the test set
test_sales_no_controls <- predict(model_no_controls)

# plotting to see the residuals vs. predicted sales
ggplot( , aes(y = std_error_term, x = scale(test_sales_no_controls)))+ 
  geom_point() + geom_hline(yintercept = 0)
```

The trick is to check whether at different value of the DV (sales), the span or the range of the residuals are similar. You may add auxiliary lines to help you check.

![Equal Variance Scatter Plot](equal_variance.jpg){width=80%}

From the plot, we an see that the range or span of residuals are similar. So, the equal variance assumption is likely to hold. Note that as we rely on eyeballing, it can be subjective. For the learning purpose, as long as you know how to use the plot to detect equal variance, your conclusion is not an issue.  