Multiple Linear Regression

Emilio A. Laca

15 March 2022

Objectives

  1. Analyze data problems and determine if MLR is applicable.
  2. Determine the type of goal for specific applications of MLR.
  3. Assess collinearity and its effects.
  4. Formulate good candidate MLR models in R.
  5. Fit MLR models in R and check assumptions.

What is Multiple Linear Regression

MLR is applicable when there are:

  1. Continuous response or dependent variable \(Y\).
  2. More than one continuous predictors, or independent variables \(X_1, X_2, \cdots X_p\).

where p is the number of predictors (note that if we use powers of any predictor, e.g., \(X_1^2\), each power counts as another predictor).

and the relationship between response and predictors is as follows:

\[Y_i = \beta_0 + \beta_1 X_{1 i}+ \beta_2 X_{2 i} + \ldots + \beta_p X_{p i} + \epsilon_i\]

where

\(Y_i\) is the value of the response variable in observation i (row i of the data),

\(\beta_0\) is the intercept or expected value of the response variable when all predictors are zero,

\(\beta_1\)is the partial regression coefficient for predictor \(X_1\)

\(X_{1i}\) is the value of predictor \(X_1\) for observation i,

\(\epsilon_i\) is the error for observation i.

and

\[\epsilon \sim N(0, \sigma I)\]

(In reality, there is no need for all predictors to be continuous.)

Example

  • An MLR model is developed to determine canopy mass of mesquite shrubs.

  • Predictors are height, largest horizontal diameter and perpendicular diameter.

  • Predictors are random variables whose values are measured with error.

  • Predictors have to be fixed and known without error.

  • Or they can be random, but variance in predictor has to be much larger than error in measurement.

  • Otherwise, partial regression coefficients (\(\beta\)s) are biased towards zero.

shrub

Effects of measurement error in predictors

Blue: distribution of measurement errors of tree diameter.

Regression coefficient is biased downward in absolute value by a factor of 0.97.

Maroon: distribution of tree diameters among trees.

Regression coefficient is biased downward in absolute value by a factor of 0.69.

Everything has units

Mass, length, and time are fundamental dimensions. Kilograms, meters and seconds are some of the units used for the previous dimensions. \(\beta_0\) has dimensions (and units) equal to the dimensions of Y. \(\beta_j\) has dimensions (and units) equal to the dimensions of Y divided by the dimensions of \(X_j\).

Use and interpretation of model requires knowing the units.

Types of goals for MLR determine procedure

At least, one must understand the difference between two extremes of a continuum:

  1. In the first case we want to make sure that we obtain unbiased estimates of \(\beta\)s with the least variance (or a good combination of a little bias that leads to much smaller variance, which is called biased regression and outside the scope of this topic).

  2. In the second case we want to make sure that we get accurate and precise estimates of the response Y for a given set of values of the predictors, \(\mu_{Y|X=x}\).

Focus on parameters:

Estimate the contribution of carcass traits, sex, date of sale and growing period on price of finished cattle.

Steer with schematic of cuts and prices

Focus on response variable:

Determine the percentage body fat in a person or animal using measurements that can be performed quickly in the doctor’s office, for example, height, thickness of triceps skin fold, girth, thigh circumference, mass.

body fat calipers

Collinearity and its ill effects

Effects of collinearity

Collinearity causes the following negative effects on multiple linear regression:

  1. Large variance in the estimated partial regression coefficients.
  2. Unstable regression equation from sample to sample.
  3. Large changes in the estimated effects of predictors caused by small changes in the response variable.
  4. Large variance in the estimated expected response value for predictor values far from their means.
  5. Inability to unambiguously attribute explanatory power to all predictors.
  6. Dependence of “significance” (p-values) of predictors depend on the order in which they are entered in the model.
  7. Values of regression coefficients that are dependent on which predictors are included in model.
  8. Biased regression coefficients when not all relevant predictors are in the model.
  9. Potentially overfitted models. Models that predict much better in the training sample than in validation or testing data.

Assessing collinearity

\[VIF_j = 1/(1 - R_j^2)\]

Where j refers to the identity of the predictor being considered and \(R_j^2\) is the proportion of the variance of \(X_j\) explained by the rest of the predictors or X’s.

\(R_j^2\) can be obtained by simply regressing \(X_j\) on the other X’s and obtaining the \(R^2\) or coefficient of determination of the model.

The function car::vif() will calculate the VIF’s for any linear model in R.

The VIF gets its name from the fact that it is the factor by which the variance of the estimated partial regression coefficient () is inflated because of the presence of collinearity.

Each predictor has a value of VIF.

Effects if collinearity in predictors

Uncorrelated predictors

sampling distribution of plane without colinearity

Collinear data

sampling distribution of plane with colinearity

What to do about collinearity

Nothing, unless the goal of the study requires it.

Methods to ameliorate the effects of collinearity are particularly necessary when the focus of study is:

Some techniques to deal with collinearity are:

  1. Variable selection or model reduction.
  2. Partial least squares regression (a type of biased regression).
  3. Principal components regression (a type of biased regression).
  4. Ridge regression (a type of biased regression).
  5. Lasso regression (a type of biased regression).
  6. Elastic net (a type of biased regression).

Example: Body fat

Go to RStudio Cloud