Emilio A. Laca
15 March 2022
Multiple Linear Regression (MLR) is one of the most commonly used and useful statistical methods.
MLR : a very compact and general modeling approach with conceptually simple equations and calculations.
where p is the number of predictors (note that if we use powers of any predictor, e.g., \(X_1^2\), each power counts as another predictor).
and the relationship between response and predictors is as follows:
\[Y_i = \beta_0 + \beta_1 X_{1 i}+ \beta_2 X_{2 i} + \ldots + \beta_p X_{p i} + \epsilon_i\]
where
\(Y_i\) is the value of the response variable in observation i (row i of the data),
\(\beta_0\) is the intercept or expected value of the response variable when all predictors are zero,
\(\beta_1\)is the partial regression coefficient for predictor \(X_1\)
\(X_{1i}\) is the value of predictor \(X_1\) for observation i,
\(\epsilon_i\) is the error for observation i.
and
\[\epsilon \sim N(0, \sigma I)\]
(In reality, there is no need for all predictors to be continuous.)
An MLR model is developed to determine canopy mass of mesquite shrubs.
Predictors are height, largest horizontal diameter and perpendicular diameter.
Predictors are random variables whose values are measured with error.
Predictors have to be fixed and known without error.
Or they can be random, but variance in predictor has to be much larger than error in measurement.
Otherwise, partial regression coefficients (\(\beta\)s) are biased towards zero.
shrub
Blue: distribution of measurement errors of tree diameter.
Regression coefficient is biased downward in absolute value by a factor of 0.97.
Maroon: distribution of tree diameters among trees.
Regression coefficient is biased downward in absolute value by a factor of 0.69.
Mass, length, and time are fundamental dimensions. Kilograms, meters and seconds are some of the units used for the previous dimensions. \(\beta_0\) has dimensions (and units) equal to the dimensions of Y. \(\beta_j\) has dimensions (and units) equal to the dimensions of Y divided by the dimensions of \(X_j\).
Use and interpretation of model requires knowing the units.
At least, one must understand the difference between two extremes of a continuum:
In the first case we want to make sure that we obtain unbiased estimates of \(\beta\)s with the least variance (or a good combination of a little bias that leads to much smaller variance, which is called biased regression and outside the scope of this topic).
In the second case we want to make sure that we get accurate and precise estimates of the response Y for a given set of values of the predictors, \(\mu_{Y|X=x}\).
Focus on parameters:
Estimate the contribution of carcass traits, sex, date of sale and growing period on price of finished cattle.
Steer with schematic of cuts and prices
Focus on response variable:
Determine the percentage body fat in a person or animal using measurements that can be performed quickly in the doctor’s office, for example, height, thickness of triceps skin fold, girth, thigh circumference, mass.
body fat calipers
Collinearity is the presence of correlation between two predictors.
Multicollinearity is the presence of a statistical linear relationship involving more than two predictors.
Collinearity causes the following negative effects on multiple linear regression:
\[VIF_j = 1/(1 - R_j^2)\]
Where j refers to the identity of the predictor being considered and \(R_j^2\) is the proportion of the variance of \(X_j\) explained by the rest of the predictors or X’s.
\(R_j^2\) can be obtained by simply regressing \(X_j\) on the other X’s and obtaining the \(R^2\) or coefficient of determination of the model.
The function car::vif() will calculate the VIF’s for any linear model in R.
The VIF gets its name from the fact that it is the factor by which the variance of the estimated partial regression coefficient () is inflated because of the presence of collinearity.
Each predictor has a value of VIF.
sampling distribution of plane without colinearity
sampling distribution of plane with colinearity
Nothing, unless the goal of the study requires it.
Methods to ameliorate the effects of collinearity are particularly necessary when the focus of study is: