Multiple Linear Regression

Author

J Sigma

In simple linear regression, we try to explain or predict a response variable using one explanatory variable.

For example:

In reality, final marks are rarely determined by lecture attendance alone.

Other factors may matter too:

So instead of using one explanatory variable, we use multiple predictors simultaneously.

Simple Linear Regression vs Multiple Linear Regression

Population and Sample Models

We have already established, for simple linear regression, that the population model is given by

\[y_{i}=\beta_{0}+\beta_{1}x +\epsilon_{i}\]

where

  • \(i\) refers to a specific observation

  • \(\beta_{0}\) is the intercept parameter

  • \(\beta_{1}\) is the slope parameter; and

  • \(\epsilon_{i}\) is the error for the \(i\)-th observation

The sample model is then given by

\[\hat{y}_{i}=\hat{\beta_{0}}+\hat{\beta}_{1}x\]

since we assume that the errors are normally distributed with a mean of \(0\). Here, there is one dependent variable and only one explanatory variable. For multiple linear regression, we have more than one explanatory variable. So, we adjust the population model for simple linear regression slightly. We have:

\[y_{i}=\beta_{0}+\beta_{1}x_{1i}+\beta_{2}x_{2i}+\dots+\beta_{p}x_{pi}+\epsilon_{i}\]

Here, we have \(p\) independent variables.

  • \(i\) refers to an observation

  • \(\beta_{0}\) is the intercept parameter

  • \(\beta_{j}\) is the slope parameter for the \(j\)-th independent variable; and

  • \(\epsilon_{i}\) is the error

For multiple linear regression, the sample model is given by

\[\hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x_{1i}+\hat{\beta}_{2}x_{2i}+\dots+\hat{\beta}_{p}x_{pi}\]

Visually, we have the following difference

Here, we can see that simple linear regression can be represented by a line on a two-dimensional plane. In contrast, multiple linear regression forms a plane in three-dimensional space. We could have a hyperplane for more than two variables

Estimating the \(\beta\) Parameters

Like in simple linear regression, the ordinary least squares (OLS) method is used to estimate the \(\beta\) parameters by minimising \(\sum\epsilon_{i}^{2}\)

Aim of Multiple Linear Regression

  1. To model the dependent variable using many independent variables
  2. To predict the value of the dependent variable from the values of the multiple independent variables
  3. To understand how a dependent variable changes with many independent variables

We will try to achieve these aims by using the following example problem:

WarningWorking Example

A company produces Fresh, a brand of detergent. In order to manage its inventory more effectively and make revenue predictions, this company would like to better predict the demand for Fresh. To develop a prediction model, the company has gathered data concerning demand for Fresh over the last 30 sales periods. The first few lines of the dataset are shown below:

Code
#########################
# IMPORTING DATA INTO R
#########################

fresh_data <- data.frame(
  demand = c(
    7.38, 8.51, 9.52, 7.50, 9.33, 8.28, 8.75, 7.87,
    7.10, 8.00, 7.89, 8.15, 9.10, 8.86, 8.90, 8.87,
    9.26, 9.30, 8.75, 7.95, 7.65, 7.27, 8.30, 8.50,
    8.75, 9.21, 8.27, 7.67, 7.93, 9.26
  ),

  fresh_price = c(
    3.85, 3.75, 3.70, 3.70, 3.60, 3.60, 3.60, 3.80,
    3.80, 3.85, 3.90, 3.90, 3.70, 3.75, 3.75, 3.80,
    3.70, 3.80, 3.70, 3.80, 3.80, 3.75, 3.70, 3.55,
    3.60, 3.65, 3.70, 3.75, 3.80, 3.70
  ),

  ads_expenditure = c(
    5.5, 6.75, 7.25, 5.5, 7.0, 6.5, 6.75, 5.25,
    5.25, 6.0, 6.5, 6.25, 7.0, 6.9, 6.8, 6.8,
    7.1, 7.0, 6.8, 6.5, 6.25, 6.0, 6.5, 7.0,
    6.8, 6.8, 6.5, 5.75, 5.8, 6.8
  ),

  size = c(
    "Small", "Big", "Big", "Small", "Big", "Small", "Big", "Small",
    "Small", "Small", "Small", "Small", "Big", "Big", "Big", "Big",
    "Big", "Big", "Big", "Small", "Small", "Small", "Small", "Big",
    "Big", "Big", "Small", "Small", "Small", "Big"
  ),

  ads_campaign = c(
    "B", "B", "B", "A", "C", "A", "C", "C",
    "B", "C", "A", "C", "C", "A", "B", "B",
    "B", "A", "B", "B", "C", "A", "A", "A",
    "A", "B", "C", "B", "C", "C"
  ),

  competitor_price = c(
    3.80, 4.00, 4.30, 3.70, 3.85, 3.80, 3.75, 3.85,
    3.65, 4.00, 4.10, 4.00, 4.10, 4.20, 4.10, 4.10,
    4.20, 4.30, 4.10, 3.75, 3.75, 3.65, 3.90, 3.65,
    4.10, 4.25, 3.65, 3.75, 3.85, 4.25
  )
)

# first few (6) line of data
head(fresh_data)
  demand fresh_price ads_expenditure  size ads_campaign competitor_price
1   7.38        3.85            5.50 Small            B             3.80
2   8.51        3.75            6.75   Big            B             4.00
3   9.52        3.70            7.25   Big            B             4.30
4   7.50        3.70            5.50 Small            A             3.70
5   9.33        3.60            7.00   Big            C             3.85
6   8.28        3.60            6.50 Small            A             3.80

Here, we have

  • \(y \implies \text{demand for Fresh (in $100 000$s)}\)

  • \(x_{1} \implies \text{price for Fresh (in $10$ rands)}\)

  • \(x_{2} \implies \text{ad expenditure to promote Fresh (in $1000$ rands)}\)

  • \(x_{3} \implies \text{size of the company (big or small)}\)

  • \(x_{4} \implies \text{ads campaign used by the company}\). (A: TV campaigns, B: Mixture of TV and Radio ads, C: Mixture of TV, radio, magazine and newspaper ads)

  • \(x_{5} \implies \text{average competitor price for liquid detergents}\)

Correlation Analysis

For multiple linear regression, we calculate the correlation between \(y\) and each independent variable, \(x_{j}\). So, for \(p\) independent variables, we end up having \(p\) correlation coefficients.

Note

Later, we will be interested in looking at the correlation between each of the independent variables. As it turns out, this is important for discussing how much value the explanatory variables add to explaining how \(y\) changes.

Since we want the correlation coefficients between all of our variables, a convenient way of doing this is by creating a correlation matrix, as opposed to manually calculating the correlation between all the variables.

Code
#########################
# CORRELATION MATRIX
#########################

fresh_numeric <- fresh_data[, c(1,2,3,6)]

# here, we only used variables from the dataset which are numeric (or continuous)
# because we can create a correlation matrix for those without introducing 
# any new intricacies 

# later on, we will see how we many include "catgeorical varibales", c(4,5)

cor(fresh_numeric)
                     demand fresh_price ads_expenditure competitor_price
demand            1.0000000 -0.45893255       0.8816511       0.75368059
fresh_price      -0.4589326  1.00000000      -0.4687930       0.07836681
ads_expenditure   0.8816511 -0.46879301       1.0000000       0.60454000
competitor_price  0.7536806  0.07836681       0.6045400       1.00000000
  • For demand and Fresh price : \(r=-0.469\) which indicates a moderate negative correlation between demand and the price of Fresh

  • For demand and ads expenditure : \(r=0.882\) which indicates a strong positive relationship between demand and ads expenditure

  • For demand and competitor price : \(r=0.754\) which indicates a fairly strong positive relationship between demand and competitor price

Correlation Matrix and Pairwise Plot

A correlation matrix plot is a visual representation of a correlation matrix. Here, each cell represents the correlation between two variables, and contains the value for \(r\) for each pair of variables.

  • Red blocks indicate a positive correlation, and the intensity increases with the strength of correlation

  • Blue blocks indicate a negative correlation and the intensity increases with the strength of correlation

  • White blocks are where there is little to no correlation between variables

A pairwise plot plots every single variable against every other variable, and shows

  • scatter plots in the lower triangle. This gives a visual idea of the relationship between variables, and a qualitative idea of how strong it is

  • density curves along the diagonal which show the spread of the values of a particular variable. This gives an idea about whether the values are skewed or centred; and

  • correlation coefficients between two variables in the upper triangle

Multicollinearity

Multicollinearity occurs when there are high correlations between the independent variables in a multiple linear regression analysis. This is not a good thing, because it introduces redundancies in explaining the variability in the dependent variable.

For our example,

  • there is no concern in the correlation between price and competitor price since we have that \(r=0.078\). This is a very weak positive correlation

  • there is some concern in the correlation between price and ads expenditure. Here, \(r=-0.469\). This is a relatively strong positive correlation and is significant enough to cause some redundancies in explaining changes the dependent variable

  • there is concern in the correlation between competitor price and ads expenditure since \(r=0.605\). This is a moderately strong positive correlation between the two variables

Note

For now, we continue with this example to explore other things we are able to do in MLR using R. However, in the real world, we might have wanted to do away with some variables provided we have thoroughly analysed the impacts of doing so.

Consequences of Multicollinearity for a Regression Analysis

Multicollinearity is problematic because we do not know which variable explains the variability in the dependent variable, and this causes uncertainty in the parameter estimates.

It affects analyses in the following ways:

  • Inflated standard error, and therefore inflated \(p\)-values

  • removing/adding correlated independent variables from the model results in larger changes to parameter estimates

  • An increase in RSE when adding a correlated independent variable to the model which causes the errors in the model to be bigger. This lowers the accuracy of the model

Multiple Linear Regression with Continuous Explanatory Variables

Here, we only consider a multiple linear regression analysis using only numeric variables. If our data contains categorical variables – which, in our case, it does – we do not include them in the regression analysis.

Later on, we will look at what to do so as to include categorical variables.

Performing Multiple Linear Regression in R

Code
############################################
# FITTING A MULTIPLE LINEAR REGRESSION MODEL
############################################

fresh_fit <- lm(demand ~ fresh_price + ads_expenditure + competitor_price, data=fresh_data)

summary(fresh_fit)

Call:
lm(formula = demand ~ fresh_price + ads_expenditure + competitor_price, 
    data = fresh_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.41083 -0.12355 -0.00404  0.13311  0.49957 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7.0705     2.3335   3.030 0.005471 ** 
fresh_price       -2.2826     0.6088  -3.749 0.000896 ***
ads_expenditure    0.5162     0.1201   4.297 0.000215 ***
competitor_price   1.6530     0.2819   5.864  3.5e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.224 on 26 degrees of freedom
Multiple R-squared:  0.9053,    Adjusted R-squared:  0.8943 
F-statistic: 82.82 on 3 and 26 DF,  p-value: 1.976e-13

We get our regression equation as:

\[\hat{y}=7.0705-2.2826(\text{fresh price})+0.5162(\text{ads expenditure})+1.6530(\text{competitor price})\]

Interpreting the \(\beta\) Coefficients

ImportantGeneral Interpretation of the Coefficients

Interpretation of \(\beta_{0}\) : it is the average estimated value of \(y\) when all \(x_{j}=0\) for \(j \in \{1,2,3, \dots, p\}\)

Interpretation of \(\hat{\beta}_{j}\) : it is the average estimated change in \(y\) for a unit increase in \(x_{j}\), holding all the other independent variables constant

For our example,

\(\hat{\beta}_{0}\) : on average, the demand for Fresh is \(7.0705\times100000=707050\) bottles when all the indepedendent variables take on a value of \(0\).

\(\hat{\beta}_{1}\) : on average, the demand for Fresh decreases by \(2.2826\times100000=228260\) bottles for each additional \(\text{R}10\) increase in the price of fresh, holding all the other independent variables constant.

\(\hat{\beta}_{2}\) : on average, the demand for Fresh increases by \(0.5162\times100000=51620\) bottles for each additional \(\text{R}1000\) increase in the ads expenditure to promote Fresh, ceteris paribus.

\(\hat{\beta}_{3}\) : on average, the demand for Fresh increases by \(1.6530\times100000=165300\) bottles for each additional \(\text{R}10\) increase in the price for Fresh, ceteris paribus.

Testing the Significance of the \(\beta\) Estimates

In multiple linear regression, we do a significance test on each of the \(\beta_{j}\) estimates separately. The test proceeds as we normally would do it. Suppose we do a significance test for the ads expenditure estimate. Then, we would have that

\[H_{0} : \beta_{2}=0\]

\[\text{and}\]

\[H_{1}:\beta_{2}\neq0\]

are our null and alternative hypotheses. We define \(\alpha=0.05\). The test statistic still follows a \(t\) distribution, and we have that

\[\begin{align*} t=\frac{\hat{\beta}_{j}-\beta_{j}}{se(\hat{\beta}_{j})} \sim t_{n-p-1} \end{align*}\]

with \(p\) being the number of independent variables used to model the dependent variable. For our case, we get that

\[\begin{align*} t &=\frac{0.5162-(0)}{0.1201} \sim t_{30-3-1}\\ &=4.298 \sim t_{27} \end{align*}\]

Note

The test statistic is also given in the model summary

We can, then, calculate the \(p\)-value associated with the test statistic:

Code
####################################
# CALCULATING THE P-VALUE MANUALLY
####################################

pval <- 2*pt(q=4.298, df=26, lower.tail=FALSE)
pval
[1] 0.0002141684

This value could have also be obtained from the model summary.

We then reject the null hypothesis since the \(p\)-value is less than the significance level, and conclude that there is significant evidence of a linear relationship between the ads expenditure and the demand of Fresh.

Confidence Intervals for \(\beta_{j}\) Estimates

We can get a confidence interval around our estimates. These are ranges of values for which we a relatively sure that they contain the true \(\beta\) values. We calculate them using

\[\text{CI}=\beta_{j}\pm t_{\alpha/2, n-p-1}\times se(\hat{\beta_{j}})\]

As an example, we can calculate a \(95\%\) confidence interval for the ads expenditure as follows:

Code
##########################################
# CRITICAL VALUE FOR CONF. INT.
##########################################

# critical value
tcrit <- qt(p=0.975, df=26)
tcrit
[1] 2.055529

and so we have that

\[\begin{align*} \text{CI} &= 0.5162 \pm 2.06\times0.1201\\ &=[0.27,0.76] \end{align*}\]

Checking Overall Model Significance

The way we check overall model significance for multiple linear regression is the same as the way we check it for simple linear regression. Again, we begin by gathering some information

Source of Variation \(\text{df}\) Sum of Squares Mean Squares F
Regression \(p\) \(SS_{reg}=(\hat{y}_{i}-\bar{y})^{2}\) \(MS_{reg}=\frac{SS_{reg}}{p}\) \(F=\frac{MS_{reg}}{MSE}\)
Error (Residuals) \(n-p-1\) \(SSE=(y_{i}-\hat{y}_{i})^{2}\) \(MSE=\frac{SSE}{n-p-1}\)
Total \(n-1\) \(SS_{tot}=(y_{i}-\bar{y})^{2}\)

We the use the \(\text{f}\)-statistic to perform an \(\text{F}\) test. As before, we have that \(\sqrt{MSE}=RSE\).

Our null and alternative hypotheses are given by

\[H_{0}: \beta_{1}=\beta_{2}=\dots=\beta{p}=0 \text{ (null model)}\]

\[\text{and}\]

\[H_{1}:\text{at least one $\beta_{j}\neq0$}\]

We perform the significance test at the \(5\%\) level. We move on to calculating the test statistic.

Code
########################################################
# CALCULATING THE TEST STATISTIC FROM FIRST PRINCIPLES
########################################################

p <- length(coef(fresh_fit)[-1]) # number of beta estimates (less the 
#interecpt)

n <- nobs(fresh_fit) #number of observations 

# Regression
y <- fresh_data$demand
ybar <- mean(y)
yhat <- fitted(fresh_fit)

SS_reg <- sum((yhat-ybar)^2)

# Error (Residuals)
SSE <- sum((y-yhat)^2)

# MS Regression
MS_reg <- SS_reg/p

# MS Errors
MSE <- SSE/(n-p-1)

# F statistic 
f <- MS_reg/MSE
f
[1] 82.82327

We have, then, that our test statistic is given by \(\text{F}=82.82\) with \(27\) degrees of freedom. Then,

Code
##############
# P-VALUE
##############

pval <- pf(q=82.82327, df1=p, df2=n-p-1, lower.tail=F)
pval
[1] 1.976379e-13

Since our \(p\)-value is much less than the significance level we have defined, we reject the null hypothesis and conclude that there is significant evidence that our model is different from the null model.

Coefficient of Determination

For multiple linear regression, the coefficient of determination, \(R^{2}\), is given by:

\[R^{2}=\frac{SS_{reg}}{SS_{tot}}=\frac{\sum(y_{i}-\hat{y}_{i})^{2}}{\sum(y_{i}-\bar{y})^{2}}\]

There is an issue here:

  • \(SS_{tot}\) only depends on \(y_{i}\), and so it does not change when we add new model predictors

  • In MLS, adding a new variable gives the model more flexibility (or degrees of freedom) since we can always set the new coefficient to zero and reproduce the old fit.

  • \(SS_{reg}\) can only stay the same or increase when adding new variables to the model. This means that the value of \(R^{2}\) never decreases when adding new variables. This rewards model overfitting.

The issue here is that if we use \(R^{2}\) to compare models, models with more variables will always be favoured even when they are worse-off than smaller models. This motivates the use of the adjusted coefficient of determination to compare models.

Adjusted Coefficient of Determination

The adjusted coefficient of determination adjusts the \(R^{2}\) value to consider the size of the model. In this way, it penalises the addition of new explanatory variables. It is defined by:

\[R^{2}_{\text{adj}}=1-\left(\frac{n-1}{n-p-1}\right)(1-R^{2})\]

We still have the bound

\[0\leq R^{2}_{\text{adj}} \leq 1\]

The adjusted \(R^{2}\) describes the proportion of variation in the dependent variable that is explained by the proportion of variation in the independent variables, taking the sample size and number of independent variables into account.

We can obtain it from the model summary.

Code
summary(fresh_fit)

Call:
lm(formula = demand ~ fresh_price + ads_expenditure + competitor_price, 
    data = fresh_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.41083 -0.12355 -0.00404  0.13311  0.49957 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)        7.0705     2.3335   3.030 0.005471 ** 
fresh_price       -2.2826     0.6088  -3.749 0.000896 ***
ads_expenditure    0.5162     0.1201   4.297 0.000215 ***
competitor_price   1.6530     0.2819   5.864  3.5e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.224 on 26 degrees of freedom
Multiple R-squared:  0.9053,    Adjusted R-squared:  0.8943 
F-statistic: 82.82 on 3 and 26 DF,  p-value: 1.976e-13

From this, we can see that the coefficient of determination is \(0.8943\).

Model Checking

For multiple linear regression, we have the same model assumptions as in simple linear regression. Those are

  • Linearity

  • Normal distribution of errors with a mean of \(0\)

  • Constant error variance

  • Independence of errors

As such, we check the model in the same way.

Code
##############################
# CHECKING MODEL ASSUMPTIONS
##############################

#assessing linearity
plot(fresh_data$fresh_price, fresh_data$demand)

Code
plot(fresh_data$ads_expenditure, fresh_data$demand)

Code
plot(fresh_data$competitor_price, fresh_data$demand)

Code
# assessing normality of errors

# histogram
hist(residuals(fresh_fit))

Code
# qqplot 
qqnorm(residuals(fresh_fit))
qqline(residuals(fresh_fit))

Code
# constant error variance
plot(fresh_fit, which=1, add.smooth=F)

Code
# independence of errors
plot(residuals(fresh_fit))

From these plots, it looks like our data meets all of the assumptions.

Prediction

Prediction for multiple linear regression works in the same manner as with simple linear regression, only with the adjustment that we need to provide values for all the independent variables to make a prediction. We already have that:

\[\hat{y}=7.0705-2.2826(\text{fresh price})+0.5162(\text{ads expenditure})+1.6530(\text{competitor price})\]

Suppose we want to know the demand predicted from Fresh when the price of Fresh is \(\text{R}50\), the average competitor price is \(\text{R}55\), and the advertising expenditure is \(\text{R}8000\). Then, the predicted demand would be given by

\[\begin{align*} \hat{y} &= 7.0705-2.2826(5)+0.5126(8)+1.6530(5.5)\\ &= 8.8786 \end{align*}\]

So, the predicted demand would be \(8.8786\times100000=887860\) bottles.

Confidence and Prediction Intervals

Because of how complicated manual methods of finding prediction and confidence intervals are for multiple linear regression, we stick to using R to obtain these.

Code
#####################################
# PREDICTING FOR A SINGLE INDIVIDUAL
#####################################

newdat <- list(fresh_price=5, ads_expenditure=8, competitor_price=5.5)
# makes it easier on the eye

predict(fresh_fit, newdata=newdat, interval="prediction")
      fit      lwr      upr
1 8.87845 7.296745 10.46016
Code
###############################
# PREDICTING FOR AN AVERAGE
###############################

predict(fresh_fit, newdata=newdat, interval="confidence")
      fit      lwr      upr
1 8.87845 7.365223 10.39168
  • Prediction interval : \([7.297,10.46]\)

  • Confidence interval : \([7.365, 10.39]\)

Multiple Linear Regression with Categorical Explanatory Variables

Sometimes, our datasets include both numerical and categorical variables. Before, we bypassed this by removing the categorical variables, but these variables may be very important. So, we introduce a way to include them into the model.

Let us start by considering a variable with two levels.

Note

Levels refer to the number of subcategories that exist for a categorical variable.

Take the company size variable from our data which has the levels “big” and “small”. We can make the category of company sizes binary, letting small take a value of \(0\) and “big” take a value of \(1\). This allows us to treat the categorical variable as numeric while still preserving the fact that it has two levels. Our regression equation will, then, be given by

\[\hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x_{1i}+\hat{\beta}_{2}x_{2i}+\hat{\beta}_{3}x_{3i}+\hat{\beta}_{4}x_{4i}\]

with \(\hat{\beta}_{4}\) being the change in demand when the size of the company is big versus when it is small.

recall : we chose “big” to take on the value of \(1\) (and “small” to take on a value of \(0\)). We interpret this as a change in the independent variable when the independent variable takes on a value of \(1\) compared to \(0\). The reason why we do this is because when the company is small, the last term falls away, and so does not affect the demand. However, when the company is big, the last term is present and affects the demand.

This treats small as a reference category. What happens, geometrically, is that when a company is big the intercept is adjusted by \(\hat{\beta}_{4}\). This captures the change in demand when the company is big. Since “small” is the reference category, the intercept parameter captures what happens when a company is small.

Including Binary Variables in R

Code
##############################
# DEFAULT REFERENCE CATEGORY
##############################

# this assigns the category that comes first in the alphabet as the 
# reference category
fresh_data$size <- as.factor(fresh_data$size) # makes "size" numeric

# viewing data with new changes
str(fresh_data)
'data.frame':   30 obs. of  6 variables:
 $ demand          : num  7.38 8.51 9.52 7.5 9.33 8.28 8.75 7.87 7.1 8 ...
 $ fresh_price     : num  3.85 3.75 3.7 3.7 3.6 3.6 3.6 3.8 3.8 3.85 ...
 $ ads_expenditure : num  5.5 6.75 7.25 5.5 7 6.5 6.75 5.25 5.25 6 ...
 $ size            : Factor w/ 2 levels "Big","Small": 2 1 1 2 1 2 1 2 2 2 ...
 $ ads_campaign    : chr  "B" "B" "B" "A" ...
 $ competitor_price: num  3.8 4 4.3 3.7 3.85 3.8 3.75 3.85 3.65 4 ...
Code
contrasts(fresh_data$size) # shows us explicitly which variable is 0 and 1
      Small
Big       0
Small     1

Notice that, in the table, size is given as \(2\)’s and \(1\)’s. That is the nature of R’s output. \(1 \rightarrow\) reference category (or \(0\)). What is important to note here is that “big” is assigned as the reference category since it comes first in the alphabet. But this is not how we established our example; we want “small” to be the reference category. We can do this by manually assigning “small” to be a reference category.

Code
################################
# MANUAL REFERENCE CATEGORY
################################

fresh_data$size <- relevel(fresh_data$size, ref="Small")
contrasts(fresh_data$size)
      Big
Small   0
Big     1

As we can see, “small” is now the reference category.

What if There are Two or More Levels?

When there are two or more variables, we do the same thing, but now we will have a dummy variable for each variable that is not a reference category.

Suppose we consider adding the ads campaign variable which has three levels. This would result in two more \(\beta\) estimates added to the model. So, our model now becomes

\[\hat{y}_{i}=\hat{\beta}_{0}+\hat{\beta}_{1}x_{1i}+\hat{\beta}_{2}x_{2i}+\hat{\beta}_{3}x_{3i}+\hat{\beta}_{4}x_{6i}+\hat{\beta}_{5}x_{5i}+\hat{\beta}_{6}x_{6i}\]

Assuming that we use “C” as the reference category, \(\hat{\beta}_{5}\) would represent the change in demand when the company uses ads campaign “A” as opposed to “C”, and \(\hat{\beta}_{6}\) would represent the change in demand when the company uses ads campaign “B” as opposed to “C”. Since we are using “C” as the reference category, we need to manually assign it as the reference category in R.

Code
###############################################
# MANUAL REFERENCE CATEGORY FOR ADS CAMPAIGN
###############################################

fresh_data$ads_campaign <- as.factor(fresh_data$ads_campaign)
fresh_data$ads_campaign <- relevel(fresh_data$ads_campaign, ref="C")

contrasts(fresh_data$ads_campaign)
  A B
C 0 0
A 1 0
B 0 1

Regression Analysis with Categorical Variables

We can, then, perform a full regression analysis including all the variables of the dataset

Code
###########################
# FULL REGRESSION ANALYSIS
###########################

fresh_full <- lm(demand ~ fresh_price + ads_expenditure + competitor_price + size + ads_campaign, data = fresh_data)

summary(fresh_full)

Call:
lm(formula = demand ~ fresh_price + ads_expenditure + competitor_price + 
    size + ads_campaign, data = fresh_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.30301 -0.11195  0.01392  0.07748  0.33904 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.74990    1.81931   4.260 0.000295 ***
fresh_price      -2.22741    0.55818  -3.990 0.000576 ***
ads_expenditure   0.45755    0.09858   4.641 0.000114 ***
competitor_price  1.54869    0.25246   6.134 2.94e-06 ***
sizeBig           0.16275    0.13484   1.207 0.239699    
ads_campaignA    -0.34975    0.07976  -4.385 0.000216 ***
ads_campaignB    -0.19574    0.07563  -2.588 0.016435 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1664 on 23 degrees of freedom
Multiple R-squared:  0.9537,    Adjusted R-squared:  0.9417 
F-statistic: 79.03 on 6 and 23 DF,  p-value: 3.456e-14

We then obtain the regression equation as

\[\begin{align*} \hat{y}= \hspace{0.2em} &7.75-2.23(\text{fresh price})+0.46(\text{ads expenditure})+1.55(\text{competitor price})+\\ & 0.16(\text{size: Big}) - 0.35(\text{ads campaign A}) - 0.2(\text{ads campaign B}) \end{align*}\]

Interpretation of the \(\beta\) Estimates

ImportantInterpretation of the Parameter Estimates

\(\beta_{0}\) : it is the average value of \(y\) when all the continuous variables are zero and the categorical variables are at the reference category.

\(\beta_{continuous_{j}}\) : it is the average estimated change in \(y\) for every additional unit increase in \(x_{continuous_{j}}\) and all the categorical variables are at the reference category.

\(\beta_{categorical_{j}}\) : it is the average estimated change in \(y\) when \(x_{categorical_{j}}\) takes on a dummy category compared to when it takes on a reference category.

In our example:

\(\beta_{0}\) : the average demand for Fresh is \(7.75\times100000=775000\) bottles when the price and competitor price are \(\text{R}0\), no ads expenditure, the size of the company is small, and the company uses ads campaign A and/or B.

\(\beta_{1}\) : the average demand for Fresh decreases by \(2.23\times100000=223000\) bottles for every additional \(\text{R}10\) increase in the price of Fresh when the company is small, and when the company uses ads campaign C.

\(\beta_{3}\) : the average demand of Fresh increases by \(0.46\times100000=46000\) bottles for every additional \(\text{R}1000\) spent on advertising when the company is small, and the company uses ads campaign C.

\(\beta_{3}\) : the average demand for Fresh increases by \(1.55\times100000=155000\) bottles for every additional \(\text{R}10\) increase in the competitor price for similar detergents when the company is small, and when the company uses ads campaign C.

\(\beta_{4}\) : the average demand for Fresh increases by \(0.16\times100000=16000\) bottles when the company is big compared to when it is small

\(\beta_{5}\) : the average demand for Fresh decreases by \(0.35\times100000=35000\) bottles when the company uses ads campaign A compared to when the company uses ads campaign C

\(\beta_{6}\) : the average demand for Fresh decreases by \(0.2\times100000=20000\) bottles when the company uses ads campaign B compared to C

The effect of a altering levels of a category on a multiple linear regression model is that it shifts the model up and down.

The intercept is at a baseline when the categorical variable is at a reference. The entire plane then shifts up or down when the categorical variable takes on a dummy category. We say that the dummy category “adjusts the intercept” (when taking a value of 1).)

Testing the Significance of the \(\beta_{j}\) Estimates

The test is the same as it is for continuous variables, and only the hypotheses and conclusion differ.

If \(H_{0}\) is rejected: we conclude that there is significant evidence in a difference in \(y\) between the dummy category and the reference category

If \(H_{0}\) is not rejected: we conclude that there is no evidence of a significant difference in \(y\) between the dummy category and the reference category.

Note

Of course, this comes from a null hypotheses of no difference in \(y\) between between the dummy category and the reference category. The alternative hypothesis stands to reason that there is a significant difference.

Interactions

The regression models we have looked at so far are called additive regression models. Their regression equation is characterised by a linear combination of the independent variables.

Sometimes, however, we may have reason to belive that the relationship between \(y\) and some independent variable \(x_{i}\) is affected by another independent variable \(x_{k}\). In this case, we have to modify the effect of \(x_{i}\) on \(y\) for different values of \(x_{k}\). To do this, we introduce an interaction term.

Suppose we have a regression model where the effect of \(x_{1}\) is influence by \(x_{2}\). Then, our model my look like:

\[\hat{y}=\hat{\beta}_{0}+\hat{\beta}_{1}x_{1i}+\hat{\beta}_{2}x_{2i}+\hat{\beta}_{3}x_{1i}x_{2i}\]

with \(\hat{\beta}_{3}x_{1i}x_{2i}\) being the interaction term. The effect of \(x_{1}\) on \(y\) is now given by \(\hat{\beta}_{1}\) and \(\hat{\beta}_{3}x_{3i}\). We can do this in R.

Code
###########################
# MULTIPLICATIVE MODELS
###########################

fresh_inter <- lm(demand ~ fresh_price * ads_campaign, 
                  data=fresh_data)

# we use * to show interaction between two variables
# for now, we exclude the other variables if we added the others 
# we would have 
# lm(demand ~ fresh_price*ads_campaign + ... + ... , data=fresh_data)

summary(fresh_inter)

Call:
lm(formula = demand ~ fresh_price * ads_campaign, data = fresh_data)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.98869 -0.30334  0.00246  0.41857  1.08154 

Coefficients:
                          Estimate Std. Error t value Pr(>|t|)   
(Intercept)                 25.764      7.030   3.665  0.00122 **
fresh_price                 -4.628      1.877  -2.466  0.02118 * 
ads_campaignA              -14.488      9.764  -1.484  0.15088   
ads_campaignB               21.648     13.415   1.614  0.11966   
fresh_price:ads_campaignA    3.824      2.620   1.460  0.15737   
fresh_price:ads_campaignB   -5.758      3.578  -1.609  0.12067   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.57 on 24 degrees of freedom
Multiple R-squared:  0.4337,    Adjusted R-squared:  0.3157 
F-statistic: 3.676 on 5 and 24 DF,  p-value: 0.01304

From this, we obtain our multiplicative regression model as

\[\begin{align*} \hat{y}= \hspace{0.2em} &25.76-4.63(\text{fresh price})-14.49(\text{ads campaign A})+21.65(\text{ads campaign B})\\ & +3.82(\text{fresh price * ads campaign A})-5.76(\text{fresh price * ads campaign B}) \end{align*}\]

Interpreting Interaction Terms

\(\hat{\beta}_{4}\) : this is the adjustment of the slope of fresh price when the advertising campaign is A compared to C

\(\hat{\beta}_{5}\) : this is the adjustments of the slope of fresh price when the advertising campaign is B compared to C.

This is the visual effect of interaction terms when we have different category levels:

Different category levels adjust the intercept for different levels of a category. On the other hand, interaction terms adjust the slope.